cs.CL @ 2025-07-11: 697
-
00 07-10 (4) Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology Rückverfolgbare Beweise Verbesserte visuelle Grundierung: Bewertung und Methodik 增强视觉依据的理由:评价和方法 2507.07999v1 -
01 07-10 PyVision: Agentic Vision with Dynamic Tooling PyVision: Agentische Vision mit dynamischem Werkzeug 视景:带有动态工具的 “ 动态展望 “ 。 2507.07998v1 -
02 07-10 Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) Operationalisierung eines Bedrohungsmodells für das Red-Teaming großer Sprachmodelle (LLMs) 实施 “ 红色组合大语言模型威胁模型 “ ; 2407.14937v2 -
03 07-10 Automating Expert-Level Medical Reasoning Evaluation of Large Language Models Automatisieren von Experten-Level Medical Reasoning Bewertung von großen Sprachmodellen 对大语言模式进行自动化专家级医疗理由评估 2507.07988v1 -
04 07-10 Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology Leistung und praktische Betrachtung von großen und kleinen Sprachmodellen in der klinischen Entscheidungsunterstützung in der Rheumatologie 风湿学临床决策支助中大型和小型语言模型的实用性及实用性考虑 2507.07983v1 -
05 07-10 Why is Your Language Model a Poor Implicit Reward Model? Warum ist Ihr Sprachmodell ein schlechtes Implizit-Reward-Modell? 为什么您的语言模式 是一个贫穷的隐含奖赏模式? 2507.07981v1 -
06 07-10 Long-Form Speech Generation with Spoken Language Models Langformige Sprachgenerierung mit gesprochenen Sprachmodellen 具有口言语言模式的长形式语音一代 2412.18603v2 -
07 07-10 Watermarking Degrades Alignment in Language Models: Analysis and Mitigation Wasserzeichen degradiert Ausrichtung in Sprachmodellen: Analyse und Milderung 语言模型的分级调整:分析和减轻影响 2506.04462v2 -
08 07-10 Scaling RL to Long Videos Skalierung von RL zu langen Videos 缩放 RL 到长视频 2507.07966v1 -
09 07-10 MIRIX: Multi-Agent Memory System for LLM-Based Agents MIRIX: Multi-Agent-Speichersystem für LLM-basierte Agenten MIRIX:LLM药剂多机构内存系统 2507.07957v1 -
10 07-10 SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment SEITE: Ein visuelles Sprachmodell zur Erkennung von Anomalien durch Fact Enhancement und Entropy-aware Alignment SAGE:通过事实增强和对子对子体认知校正进行反常检测的视觉语言模型 2507.07939v1 -
11 07-10 Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration Long Context Scaling: Teilen und Erobern durch multi-agent question-driven Collaboration 长期范围:通过多代理问题驱动的协作实现分化和征服 2505.20625v2 -
12 07-10 Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style Kontexttreue in großen Sprachmodellen untersuchen: Die Rollen der Gedächtnisstärke und des Evidenzstils 调查大语言模型的内情:记忆力和证据风格的作用 2409.10955v2 -
13 07-10 DTECT: Dynamic Topic Explorer & Context Tracker DTECT: Dynamischer Themen-Explorer & Kontext-Tracker DTECT: 动态专题探索器和上下文跟踪器 2507.07910v1 -
14 07-10 A Survey on Latent Reasoning Eine Umfrage über latente Vernunft A. 关于长期原因的调查 2507.06203v2 -
15 07-10 Automating MD simulations for Proteins using Large language Models: NAMD-Agent Automatisierung von MD-Simulationen für Proteine mit großen Sprachmodellen: NAMD-Agent 使用大语言模型( NADD- Agent) 自动自动模拟 Proteins 的 MD 模拟: NAMED- Agent 2507.07887v1 -
16 07-10 When Dialects Collide: How Socioeconomic Mixing Affects Language Use Wenn Dialekte zusammenstoßen: Wie sich die sozioökonomische Mischung auf den Sprachgebrauch auswirkt 当对调时:社会经济混合如何影响语言使用 2307.10016v2 -
17 07-10 Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study Bewertung der Robustheit von großen Audio-Sprachmodellen zur Audio-Einspritzung: Eine empirische Studie 评估大音频语言模型对音频注射的威力:经验研究 2505.19598v2 -
18 07-10 DocCHA: Towards LLM-Augmented Interactive Online diagnosis System DocCHA: Auf dem Weg zum LLM-Augmented Interactive Online-Diagnosesystem DocCHA:争取建立LLM-增强的互动式在线诊断系统 2507.07870v1 -
19 07-10 Alpay Algebra V: Multi-Layered Semantic Games and Transfinite Fixed-Point Simulation Alpay Algebra V: Multi-Layered Semantic Games und Transfinite Fixed-Point Simulation Alpay Alphay Algebabra V:多语言语义运动会和跨线固定点模拟 2507.07868v1 -
20 07-10 Skywork-R1V3 Technical Report Technischer Bericht Skywork-R1V3 Skywork-R1V3 技术报告 2507.06167v3 -
21 07-10 From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems Von der Ambiguität zur Genauigkeit: Der transformative Effekt der Koreferenzlösung auf retrieval-augmentierte Erzeugungssysteme 从模糊到准确性:关于回收-提款一代系统的共同决议的变革效应 2507.07847v1 -
22 07-10 Truth-value judgment in language models: ‘truth directions’ are context sensitive Wahrheit-Wert-Urteil in Sprachmodellen: ‘Wahrheitsrichtungen’ sind kontextsensibel 语言模型中的真相价值判断:“真相方向”是背景敏感 2404.18865v2 -
23 07-10 None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks Keiner der anderen: eine allgemeine Technik zur Unterscheidung von der Erinnerung an Multiple-Choice-LLM-Bewertungs-Benchmarks 其他无其他:在多杯LLM评价基准中区分与记忆化区别理由的一般技术 2502.12896v5 -
24 07-10 Constrain Alignment with Sparse Autoencoders Beschränkung der Ausrichtung mit Sparse Autoencodern 与 Sparse 自动对齐 2411.07618v4 -
25 07-10 Unsupervised Morphological Tree Tokenizer Unüberwachter morphologischer Baum Tokenizer 不受监督的病理树化器 2406.15245v2 -
26 07-10 MAEBE: Multi-Agent Emergent Behavior Framework MAEBE: Multi-Agent Emergent Behavior Framework 多边代理新兴行为框架 2506.03053v2 -
27 07-10 The Thin Line Between Comprehension and Persuasion in LLMs Die dünne Linie zwischen Verständnis und Überzeugung in LLMs LLMM 理解与劝导之间的细细线 2507.01936v2 -
28 07-10 Conditional Unigram Tokenization with Parallel Data Bedingte Unigramm-Tokenisierung mit parallelen Daten 附带平行数据的条件性大学招式 2507.07824v1 -
29 07-10 On the Effect of Instruction Tuning Loss on Generalization Auf die Auswirkungen der Instruktion Tuning Verlust auf die Verallgemeinerung 指示计票损失对普遍化的影响的影响 2507.07817v1 -
30 07-10 Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning Verständnis und Kontrolle von Wiederholungsneuronen und Induktionsköpfen im In-Context-Lernen 了解和控制再生中新中世纪和内文学习中的上岗负责人 2507.07810v1 -
31 07-10 Bridging Logic and Learning: Decoding Temporal Logic Embeddings via Transformers Überbrückung von Logik und Lernen: Dekodierung von Temporal Logic-Embeddings über Transformer 架桥逻辑与学习:通过变形器解码时时逻辑嵌入 2507.07808v1 -
32 07-10 Decoding AI Judgment: How LLMs Assess News Credibility and Bias Entschlüsselung des AI-Urteils: Wie LLMs neue Glaubwürdigkeit und Bias bewerten 证明AI 判决:LLMs如何评估新闻信誉和Bias 2502.04426v2 -
33 07-10 StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model StreamUni: Streaming Speech Translation mit einem einheitlichen Large Speech-Language-Modell erreichen StreamUli:用统一大型语音语言模式实现流式语音翻译 2507.07803v1 -
34 07-10 Understanding Chain-of-Thought in LLMs through Information Theory Verständnis der in LLMs durch Informationstheorie gesuchten Gedankenkette 通过信息理论在LLM 中探索了解链 2411.11984v2 -
35 07-10 Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model Squeeze the Soaked Sponge: Effiziente Off-Policy-Verstärkung Feinsteuerung für großes Sprachmodell 挤压海绵:高效非政策强化大语言模式的高效非政策改进微调 2507.06892v2 -
36 07-10 When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance Wenn große Sprachmodelle das Recht erfüllen: Dual-Lens-Taxonomie, technischer Fortschritt und ethische Governance 当大语言模式符合法律时:双重语言分类、技术进步和道德治理 2507.07748v1 -
37 07-10 Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review Code-Switching in End-to-End Automatische Spracherkennung: Ein systematischer Literaturbericht 端至端自动语音识别代码转换:系统文学审查 2507.07741v1 -
38 07-10 GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing GuardVal: Dynamic Large Language Model Jailbreak Evaluation für umfassende Sicherheitstests 警卫:综合安全测试动态大语言示范监狱防爆评价 2507.07735v1 -
39 07-10 Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization Nicht alle Präferenzen sind das, was Sie für das Post-Training benötigen: Selektive Ausrichtungsstrategie für die Preference-Optimierung 并非所有的优惠都是培训后需要的:选择性的优化优化战略 2507.07725v1 -
40 07-10 Rethinking the Privacy of Text Embeddings: A Reproducibility Study of “Text Embeddings Reveal (Almost) As Much As Text” Die Privatsphäre von Text-Embeddings neu denken: Eine Reproduzierbarkeitsstudie von “Text-Embeddings Reveal (fast) So viel wie Text” 重新思考文字嵌入的隐私:关于“文字嵌入流(几乎)与文字一样”的可复制性研究 2507.07700v1 -
41 07-10 What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training Was wissen selbstüberwachte Sprachmodelle über Niederländisch? Analysieren von Vorteilen sprachspezifischer Vorausbildung 自我监督的演讲模式对荷兰语了解多少? 分析具体语言培训前培训的优势 2506.00981v2 -
42 07-10 KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities KeyKnowledgeRAG (K^2RAG): Eine verbesserte RAG-Methode zur Verbesserung der LLM-Fragestellung KeyknowledgeraG(K2RAG):改进LLM问答能力的强化RAG方法 2507.07695v1 -
43 07-10 SAS: Simulated Attention Score SAS: Simulierter Aufmerksamkeits-Score SAS:模拟关注计分 2507.07694v1 -
44 07-10 Hierarchical Bracketing Encodings for Dependency Parsing as Tagging Hierarchische Bracketing-Encodings für Dependency Parsing als Tagging 将依赖性剖析作为拖贴 2505.11693v2 -
45 07-10 Investigating Co-Constructive Behavior of Large Language Models in Explanation Dialogues Ko-Konstruktives Verhalten von großen Sprachmodellen in Erklärungsdialogen untersuchen 解释对话中大语言模式的共同调查行为 2504.18483v2 -
46 07-10 Improving Cross-lingual Representation for Semantic Retrieval with Code-switching Verbesserung der Cross-lingual Darstellung für semantische Retrieval mit Code-Schaltung 使用代码转换法改进语义检索的跨语种代表性 2403.01364v2 -
47 07-10 Beyond Hate Speech: NLP’s Challenges and Opportunities in Uncovering Dehumanizing Language Beyond Hate Speech: NLPs Herausforderungen und Chancen beim Enthumanisieren der Sprache 超越仇恨言论:NLP在揭开非人化语言方面的挑战和机遇 2402.13818v2 -
48 07-10 An Automated Length-Aware Quality Metric for Summarization Ein Automatisiertes Längen-Bewusst-Qualitäts-Metrik für die Zusammenfassung 用于汇总的自动长软件质量计量器 2507.07653v1 -
49 07-10 Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement Lost in Pronunciation: Chinesische Offensive Sprache entdecken, verkleidet durch phonetische Umkleide-Ersatz 失落于发音中:发现因替换语音内衣而变形的中国进攻性语言 2507.07640v1 -
50 07-10 FrugalRAG: Learning to retrieve and reason for multi-hop QA FrugalRAG: Lernen zum Abrufen und Grund für Multi-Hop-QA FrugalRAG:学会检索和多呼QA的理由 2507.07634v1 -
51 07-10 Towards a cognitive architecture to enable natural language interaction in co-constructive task learning Auf dem Weg zu einer kognitiven Architektur, um natürliche Sprachinteraktion im co-konstruktiven Aufgabenlernen zu ermöglichen 建立一个认知结构,在共同建设性任务学习中促成自然语言互动 2503.23760v2 -
52 07-10 Comparative sentiment analysis of public perception: Monkeypox vs. COVID-19 behavioral insights Vergleichende Stimmungsanalyse der öffentlichen Wahrnehmung: Monkeypox vs. COVID-19 Verhaltenseinblicke 对公众感知的比较情绪分析:天花对COVID-19行为洞察力 2505.07430v2 -
53 07-10 Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks Erforschung der Grenzen der Modellkompression in LLMs: Eine Studie zur Wissensdestillation über QA-Aufgaben 探索LLMM中模型压缩的限度:关于质量保证任务的知识积累研究 2507.07630v1 -
54 07-10 Good/Evil Reputation Judgment of Celebrities by LLMs via Retrieval Augmented Generation Gute/böse Reputation Urteil von Prominenten durch LLMs über retrieval Augmented Generation LLMs通过回回子增量一代对名词的良好/负面评奖判决 2503.14382v2 -
55 07-10 SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs RaumViz-Bench: Automatisch generierte räumliche Visualisierungs-Aufgaben für MLLMs 空间Viz-Bench:MLLLMs自动生成的空间可视化推理任务 2507.07610v1 -
56 07-10 Bayesian Discrete Diffusion Beats Autoregressive Perplexity Bayesian Discrete Diffusion schlägt autoregressive Verwirrung Bayesian Discrete 扩散比自动递减的两难性 2507.07586v1 -
57 07-10 Beyond Overcorrection: Evaluating Diversity in T2I Models with DivBench Jenseits von Überkorrektur: Bewertung von Diversität in T2I-Modellen mit DivBench 超越过度纠正:在DivBench的T2I模型中评估多样性 2507.03015v2 -
58 07-10 Improving Clustering on Occupational Text Data through Dimensionality Reduction Verbesserung der Clusterbildung auf berufsbezogenen Textdaten durch Dimensionalitätsreduzierung 通过减少分量改进职业文本数据集群化 2507.07582v1 -
59 07-10 COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation COALA: Numerisch stabiles und effizientes Framework für kontextabhängige Low-Rank-Annäherung COALA: 低 Rank 上下低敏度接近度的数值稳定、高效框架 2507.07580v1 -
60 07-10 Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation Ein-zu-Mix Modalität Ausrichtung mit multimodalen Großsprachenmodellen für die Übersetzung von Dokumentenbildmaschinen 单一至混合模式与文件图像机机翻译多式大语言模式 2507.07572v1 -
61 07-10 video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models Video-SALMONN 2: Bildunterschrift-verbesserte Audio-Visuelle große Sprachmodelle 视频-SALMONN2:字幕-强化视听大语言模式 2506.15220v2 -
62 07-10 The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs Synergy Dilemma von Long-CoT SFT und RL: Untersuchung von Post-Training-Techniken zur Begründung von VLMs Long-CoT SFT和RL的协同问题:调查培训后用于说明理由的训练后技术 2507.07562v1 -
63 07-10 Multi-Head RAG: Solving Multi-Aspect Problems with LLMs Multi-Head RAG: Lösung von Multi-Aspect-Problemen mit LLMs 多方主管RAG:解决多领域问题与LLM 2406.05085v4 -
64 07-10 The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora Die Cross-Lingual Cost: Retrieval Biases in RAG über arabisch-englische Corpora 跨语言成本:通过阿拉伯语-英语公司在RAG中检索到阿拉伯语-英语公司 2507.07543v1 -
65 07-10 CEA-LIST at CheckThat! 2025: Evaluating LLMs as Detectors of Bias and Opinion in Text CEA-LIST bei CheckThat! 2025: Bewertung von LLMs als Detektoren von Bias und Meinung im Text CEA-LIST 校对:CEA-LIST 校对:2025年 2507.07539v1 -
66 07-10 CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks CheckEmbed: Effektive Überprüfung von LLM-Lösungen auf offene Aufgaben 复选对象:有效核查对不限名额任务LLM解决方案的有效核查 2406.02524v5 -
67 07-10 Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models Thought Crime: Hintertüren und Emergent-Missausrichtung in vernünftigen Modellen 思想犯罪:后门和合理理由模型中新出现的不协调现象 2506.13206v2 -
68 07-10 Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems Triadische Mehrparteien-Stimme-Aktivitätsprojektion für Turn-Take in gesprochenen Dialogsystemen 三部 “ 三部三部 “ 口语对话系统翻转式多党声音活动项目 2507.07518v1 -
69 07-10 Toward Real-World Chinese Psychological Support Dialogues: CPsDD Dataset and a Co-Evolving Multi-Agent System Auf dem Weg zu echten chinesischen Psychologischen Unterstützungsdialogen: CPsDD-Datensatz und ein gemeinsames Multi-Agenten-System 走向现实世界的中国心理支持对话:CPsDD数据集和共同演进的多行为者系统 2507.07509v1 -
70 07-10 Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models Halluzinationsstationen: Auf einigen grundlegenden Einschränkungen von Transformer-basierten Sprachmodellen 幻觉站:关于以变换语言模式的一些基本限制 2507.07505v1 -
71 07-10 Enhancing Transformers for Generalizable First-Order Logical Entailment Erweiterung der Transformer für generalisierbare Logical Entailment erster Ordnung 增强通用一级一级逻辑元件的变压器 2501.00759v3 -
72 07-10 Extracting ORR Catalyst Information for Fuel Cell from Scientific Literature Gewinnung von ORR-Katalysatorinformationen für Brennstoffzelle aus wissenschaftlicher Literatur 从科学文献中提取用于燃料电池的 ORR 催化器信息 2507.07499v1 -
73 07-10 Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code LLM zur Vernunft bringen: Stärkung Lernen aus algorithmischen Problemen ohne Code 教LLM到理由:加强从没有法典的等级问题中学习 2507.07498v1 -
74 07-10 PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving PLAN-TUNING: Sprachmodelle nach dem Training lernen Schritt-für-Schritt-Planung für komplexe Problemlösung 规划 – – 规划 – – 培训后语言模式,以学习逐步规划解决复杂问题的模式 2507.07495v1 -
75 07-10 SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records SimSUM: Simulierter Benchmark mit strukturierten und unstrukturierten medizinischen Aufzeichnungen SimSUM:与结构化和非结构化医疗记录模拟基准 2409.08936v3 -
76 07-10 Affordable AI Assistants with Knowledge Graph of Thoughts Erschwingliche KI-Assistenten mit Wissensgrafik der Gedanken 具有知识思想知识图的负担得起的AI助理 2504.02670v5 -
77 07-10 Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models Machine Bullshit: Charakterisieren der Emergenten Missachtung der Wahrheit in großen Sprachmodellen 机器胡说:在大语言模型中突出新人无视真相的特点 2507.07484v1 -
78 07-10 Mixture of Group Experts for Learning Invariant Representations Mixtur von Gruppenexperten für Learning Invariante Repräsentationen 学习不稳定代表小组专家混合 2504.09265v2 -
79 07-10 ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining ixi-GEN: Effiziente industrielle sLLMs durch Domain Adaptive Continual Pretraining ixi-GEN:通过远程适应性连续训练前,提高工业低温生产效率 2507.06795v2 -
80 07-10 Structure Guided Large Language Model for SQL Generation Struktur Geführtes großes Sprachmodell für SQL-Generierung SQL 生成引导大语言模式 2402.13284v4 -
81 07-10 RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning RLEP: Verstärktes Lernen mit Erfahrungsreplay für LLM-Reasoning RLEP: 强化学习,经验重现LLM 理由推理 2507.07451v1 -
82 07-10 Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving Agent KB: Nutzung von Cross-Domain-Erfahrungen für die Lösung Agentischer Probleme Agent KB: 利用跨域经验解决代理问题 2507.06229v2 -
83 07-10 SAND: Boosting LLM Agents with Self-Taught Action Deliberation SAND: LLM-Agenten mit selbsterzogener Handlungsberatung stärken SAND:促进具有自学行动考虑的LLM代理 2507.07441v1 -
84 07-10 Towards Interpretable Time Series Foundation Models Auf dem Weg zu interpretierbaren Zeitreihen-Grundmodellen 迈向可解释时间序列基础模型 2507.07439v1 -
85 07-10 SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM-Augmented Synthetic EHR Data SynthEHR-Eviction: Verbesserung der Eviction SDoH-Erkennung mit LLM-Augmented Synthetic EHR Data 合成EHR-驱逐:利用LLM-增强的合成电子HR数据加强驱逐SDoH探测 2507.07421v1 -
86 07-10 MedReadCtrl: Personalizing medical text generation with readability-controlled instruction learning MedReadCtrl: Personalisierung medizinischer Textgenerierung mit Lesbarkeitsgesteuertem Unterricht MedReadReadCtrl: 使医疗文本生成个性化,并进行可读性控制教学学习 2507.07419v1 -
87 07-10 May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks Darf ich Ihre Aufmerksamkeit haben? Breaking Fine-Tuning basierte Prompt Injection Defenses mit Architektur-Aware Attacken 请大家注意,使用建筑软件攻击 突破基于精密发射的快速喷射防御系统 2507.07417v1 -
88 07-10 Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation Interlinguistische phonetische Komposition (IPC): Ein theoretischer und rechnerischer Ansatz, um die zweite Sprache zu verbessern 语言间音音组成:加强第二语言发音的理论和计算方法 2411.10927v3 -
89 07-10 TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning TART: Ein Open-Source Tool-Augmented Framework für erklärbare Tabellen-basierte Begründung TARRT: 开放源码工具推荐框架,用于说明基于表格的理由 2409.11724v3 -
90 07-10 GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation GNN-CNN: Ein effizientes Hybridmodell für konvolutionäre und Graphen-Neuralnetzwerke zur Textdarstellung GNN-CNN: 用于文本代表的动态和图形神经网络的有效混合模型 2507.07414v1 -
91 07-10 CoAM: Corpus of All-Type Multiword Expressions CoAM: Corpus von Multiwort-Ausdrücken aller Art CoAM: 全类型多字表达式组合体 2412.18151v3 -
92 07-10 Rethinking Verification for LLM Code Generation: From Generation to Testing Überprüfung der LLM-Code-Generierung neu denken: Von der Generation zur Prüfung 重新思考LLM 代码生成的核查:从生成到测试 2507.06920v2 -
93 07-10 Large Language Model for Extracting Complex Contract Information in Industrial Scenes Großes Sprachmodell zur Extraktion komplexer Vertragsinformationen in Industrieszenen 工业景点复杂合同信息提取大语言模型 2507.06539v2 -
94 07-10 BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems BountyBench: Dollar-Impact von KI-Agenten-Angriffen und Verteidigern auf reale Cybersicherheitssysteme BuntyBuntyBunnench: AI代理攻击者和捍卫者对现实世界网络安全系统的美元影响 2505.15216v2 -
95 07-10 Bradley-Terry and Multi-Objective Reward Modeling Are Complementary Bradley-Terry und Multi-Objective Reward Modeling sind komplementär Bradley-Terriy和多目标奖励模型具有补充作用 2507.07375v1 -
96 07-10 Shifting from Ranking to Set Selection for Retrieval Augmented Generation Wechsel vom Ranking zur Einstellungsauswahl für retrieval Augmented Generation 从排位移到设置回收增量一代的选择 2507.06838v2 -
97 07-09 (3) Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation Taming the Tri-Space Tension: ARC-geführte Halluzination Modellierung und Steuerung für Text-zu-Bild-Generierung 遏制三空间紧张局势:ARC-ARC-ARC引导的光学幻觉模型和控制生成文本到图像的图像模型与控制 2507.04946v2 -
98 07-09 Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation Multi-Agent Retrieval-Augmented Framework for Evidence-based Counterspeech Against Health Misinformation 以证据为依据的反健康错误信息反言多证据检索强化框架 2507.07307v1 -
99 07-09 Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation Ranker, Richter und Assistenten: Auf dem Weg zum Verständnis des Interplays von LLMs in Information Retrieval Evaluation 中继员、法官和助理:了解信息检索评价中LLMLM的交互作用 2503.19092v2 -
100 07-09 ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning ViDove: Ein Übersetzungs-Agenten-System mit multimodalem Kontext und Memory-Augmented Reasoning ViDove:一个具有多式背景和记忆增强理由的翻译代理系统 2507.07306v1 -
101 07-09 The Impact of Background Speech on Interruption Detection in Collaborative Groups Die Auswirkungen der Hintergrundrede auf die Störungserkennung in kollaborativen Gruppen 背景发言对合作小组中间探测的影响 2507.07280v1 -
102 07-09 LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation LinguaMark: Sprechen multimodale Modelle fair? Eine Benchmark-basierte Bewertung LUUAMark:多模式模型是否公平说话?基于基准的评价 2507.07274v1 -
103 07-09 Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery Open Source Planning & Control System mit Language Agents für autonome wissenschaftliche Entdeckung 拥有自主科学发现语言代理的开放源规划和控制系统 2507.07257v1 -
104 07-09 A Comprehensive Survey of Contamination Detection Methods in Large Language Models Eine umfassende Untersuchung von Methoden zur Erkennung von Verunreinigungen in großen Sprachmodellen 关于大语言模型中污染探测方法的综合调查 2404.00699v5 -
105 07-09 Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in zwei realen klinischen Anwendungen 赋予保健从业者以语言模式:在两个现实世界临床应用中构建语音记录 2507.05517v2 -
106 07-09 A Language-Driven Framework for Improving Personalized Recommendations: Merging LLMs with Traditional Algorithms Ein sprachgetriebener Rahmen zur Verbesserung personalisierter Empfehlungen: Zusammenführung von LLMs mit traditionellen Algorithmen 改进个性建议的语言驱动框架:将LLMs与传统等级合并 2507.07251v1 -
107 07-09 Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings Medical Red Teaming Protocol of Language Models: Über die Bedeutung der Nutzerperspektiven in der Gesundheitsversorgung 语言模式医学红队模式医疗红队协议:关于保健机构用户观点的重要性 2507.07248v1 -
108 07-09 A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive A Theory of Response Sampling in LLMs: Teil beschreibend und Teil präskriptiv LLMM中的反应抽样理论:部分说明性理论和部分说明性理论 2402.11005v4 -
109 07-09 An Information-Theoretic Perspective on Multi-LLM Uncertainty Estimation Eine informationstheoretische Perspektive zur Multi-LLM-Unsicherheitsschätzung 关于多LLLM不确定性估算的信息理论视角 2507.07236v1 -
110 07-09 Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection Emoji-Angriff: Verstärkung von Jailbreak-Angriffen gegen Richter LLM-Erkennung Emoji攻击:加强针对LLM法官的越狱袭击 2411.01077v3 -
111 07-09 SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains SynthTextEval: Synthetische Textdatengenerierung und -bewertung für High-Stakes-Domains 合成文本Eval:合成文本数据生成和评估,用于高占用域域 2507.07229v1 -
112 07-09 EduCoder: An Open-Source Annotation System for Education Transcript Data EduCoder: Ein Open-Source-Annotationssystem für Bildungstextdaten EduCoder:教育记录数据公开源说明系统 2507.05385v2 -
113 07-09 Derailer-Rerailer: Adaptive Verification for Efficient and Reliable Language Model Reasoning Derailer-Rerailer: Adaptive Verifikation für effizientes und zuverlässiges Sprachmodell Derailer-Rerailer:高效率和可靠语文示范理由的适应性核查 2408.13940v4 -
114 07-09 Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses Prompt Störungen Enthüllen Mensch-ähnliche Biasen in LLM Survey Responses LLM调查答复中的即时扰动干扰现象 2507.07188v1 -
115 07-09 Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs Bepflanzt in der Vorausbildung, durch Finetuning abgeschwächt: Eine Fallstudie über die Herkunft von Kognitiv-Biasen in LLMs 编在培训前编,《微调:关于LLM中认知性双星起源的个案研究》,《微调摇摇晃》 2507.07186v1 -
116 07-09 Multi-Attribute Steering of Language Models via Targeted Intervention Multi-Attribute-Lenkung von Sprachmodellen durch gezielte Intervention 通过有针对性的干预措施,多语言模式多来源指导 2502.12446v2 -
117 07-09 LCFO: Long Context and Long Form Output Dataset and Benchmarking LCFO: Long Context und Long Form Output Datensatz und Benchmarking LCFO:长期和长式产出数据集和基准 2412.08268v3 -
118 07-09 LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits LASER: Mit Multi-Armed Bandits lernen, Belohnungsmodelle adaptiv zu wählen LASeR: 学习用多武装强盗来灵活选择奖赏模型 2410.01735v2 -
119 07-09 Low-Rank Adaptation Secretly Imitates Differentially Private SGD Low-Rank-Anpassung hinterhältig imitiert unterschiedlich private SGD 低浓度适应 秘密模仿 不同的私人 SGD 2409.17538v7 -
120 07-09 UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations UniConv: Vereinheitlichung der Retrieval- und Response-Generation für große Sprachmodelle in Gesprächen UnionConv: 对话中大语言模型统一检索和反应生成 2507.07030v1 -
121 07-09 FlexOlmo: Open Language Models for Flexible Data Use FlexOlmo: Offene Sprachmodelle für flexible Datennutzung FlexOlmo:灵活数据使用开放语言模型 2507.07024v1 -
122 07-09 TokenShapley: Token Level Context Attribution with Shapley Value TokenShapley: Token Level Context Attribution mit Shapley Wert TokenShapley: 具有形状价值的 Token 级别背景属性 2507.05261v2 -
123 07-09 Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs Bewusst lernen, intuitiv handeln: Entsperren von Test-Time-Reasoning in multimodalen LLMs 故意、自觉地学习:在多模式LMs中解锁试验时间理由 2507.06999v1 -
124 07-09 Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming Alles planen mit Rigor: General-Purpose Zero-Shot Planung mit LLM-basiertem Formalisierten Programmierung 与Rigor一起规划任何事:与基于LLM的正规化方案规划一起规划一般用途零热规划 2410.12112v3 -
125 07-09 FRaN-X: FRaming and Narratives-eXplorer FRaN-X: FRaming und Narratives-eXplorer FRAN-X: 制冷和叙述-爆破器 2507.06974v1 -
126 07-09 Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report Scaling towards the information boundary of instruction set: InfinityInstruct-Subject Technical Report 逐步走向指示集的信息边界:无限指令性 - 实物技术报告 2507.06968v1 -
127 07-09 Investigating the Robustness of Retrieval-Augmented Generation at the Query Level Untersuchung der Robustheit der Retrieval-Augmented Generation auf der Abfrageebene 在查询级别调查回溯源一代的强劲程度 2507.06956v1 -
128 07-09 Neuron-Level Differentiation of Memorization and Generalization in Large Language Models Neuron-Level-Differenzierung von Erinnerung und Generalisierung in großen Sprachmodellen 大语言模型中记忆化和普及化的中上级中值差异 2412.18497v2 -
129 07-09 Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models Auf dem Weg zu einer vernünftigen Ära: Eine Umfrage über lange Kette von Gedanken, um große Sprachmodelle zu verstehen 通向理性时代:关于为理由使用大语言模式而寻求的长链研究的调查 2503.09567v4 -
130 07-09 What to Keep and What to Drop: Adaptive Table Filtering Framework Was zu halten und was zu fallen: Adaptive Tabelle Filterung Rahmen 保持和放下什么:适应性表格过滤框架 2506.23463v2 -
131 07-09 Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues LLMs für die Vorhersage von Tutorenstrategien und Studentenergebnissen in Dialogen erforschen 探索在对话中预测教学战略和学生成果 2507.06910v1 -
132 07-09 MultiJustice: A Chinese Dataset for Multi-Party, Multi-Charge Legal Prediction MultiJustice: Ein chinesischer Datensatz für Multi-Party, Multi-Charge Legal Prediction 多公正:多党、多主管法律预测的中国数据集 2507.06909v1 -
133 07-09 MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection MIND: Ein Multi-Agenten-Framework für die Null-Schuss-Schadstoff-Meme-Erkennung MIND: 零中弹有害气象探测多剂框架 2507.06908v1 -
134 07-09 VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation VisualTrap: Ein Stealthy Backdoor Angriff auf GUI-Agenten durch Visual Grounding Manipulation 视觉陷阱:通过视觉定位操纵对图形用户界面代理器进行隐秘的后门攻击 2507.06899v1 -
135 07-09 NoLiMa: Long-Context Evaluation Beyond Literal Matching NoLiMa: Lang-Kontext-Evaluierung jenseits der sprachlichen Übereinstimmung NoLiMa: 超越文学匹配的长文本评价 2502.05167v3 -
136 07-09 SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN SCore: Verstärkte Corpus-basierte Relation-Extraktion mit multi-Label Kontrastivem Lernen und Bayesian kNN SCOR:利用多标签差异学习和Bayesian kNN简化公司关系采掘 2507.06895v1 -
137 07-09 Developing and Maintaining an Open-Source Repository of AI Evaluations: Challenges and Insights Entwicklung und Pflege eines Open-Source-Repositorys von KI-Bewertungen: Herausforderungen und Erkenntnisse 开发并维持一个开放源码的AI评价储存库:挑战和洞察 2507.06893v1 -
138 07-09 Adaptive Elicitation of Latent Information Using Natural Language Adaptive Vermittlung von latenten Informationen mit natürlicher Sprache 利用自然语言调适性易感信息使用自然语言 2504.04204v2 -
139 07-09 EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning EMORL: Ensemble Multi-Objektive Verstärkungs-Lernen für effizientes und flexibles LLM-Fine-Tuning EMORL: 将多目标强化学习组合起来,促进高效率和灵活的LLM 微调 2505.02579v3 -
140 07-09 Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework Adaptive Beendigung für multi-round Parallele Reasoning: Ein universelles Semantisches Entropie-Führungs-Framework 多元平行理由的适应性终止:通用的语义 Entropy 指导框架 2507.06829v1 -
141 07-09 CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation CMQCIC-Bench: Ein chinesischer Benchmark für die Bewertung großer Sprachmodelle in der Berechnung medizinischer Qualitätskontrollindikatoren CMQCIC-CMQCIC-Bench:中国医疗质量控制指标计算评估大语言模式基准 2502.11703v2 -
142 07-09 Losing our Tail – Again: On (Un)Natural Selection And Multilingual Large Language Models Unseren Tail verlieren – wieder einmal: An (Un)Natürlicher Auswahl und mehrsprachigen großen Sprachmodellen 失去我们的尾巴 – – 再一次:在(un)自然选择和多语言大语言模式 2507.03933v2 -
143 07-09 Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts Sicherer oder luckier? LLMs als Sicherheitsevaluatoren sind für Artefakte nicht robust 安全性更安全还是更幸运?作为安全评估员的LLMs没有强力进行人工操作。 2503.09347v3 -
144 07-09 Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams Text zum Modell via SysML: Automatisierte Generierung dynamischer Systemrechnermodelle aus unstrukturiertem Naturtext über verbesserte Systemmodellierung Sprachdiagramme 通过 SysML 自动生成动态系统计算模型,通过强化系统模拟图,从未结构化的自然语言文本生成动态系统计算模型 2507.06803v1 -
145 07-09 Test-Time Scaling with Reflective Generative Model Test-Zeit-Skalierung mit reflektierendem Generativem Modell 具有反反思考生成模型的试验时间缩放 2507.01951v2 -
146 07-09 GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods GuidedBench: Messung und Milderung der Evaluationsdiskrepanzen von In-the-Wild LLM Jailbreak Methoden 指导:衡量和减轻LLM监狱破碎方法在评价方面的差异 2502.16903v2 -
147 07-09 Checklist Engineering Empowers Multilingual LLM Judges Checkliste Engineering Empowers Mehrsprachige LLM-Richter 多语种LLM法官 2507.06774v1 -
148 07-09 KAConvText: Novel Approach to Burmese Sentence Classification using Kolmogorov-Arnold Convolution KAConvText: Neuer Ansatz zur burmesischen Satzklassifikation mit Kolmogorov-Arnold-Konvolution KAConvText:利用科尔莫戈罗夫-阿尔纳尔革命对缅甸判刑分类采用新颖的方法 2507.06753v1 -
149 07-09 Robust Multimodal Large Language Models Against Modality Conflict Robuste multimodale große Sprachmodelle gegen Modalitätskonflikt 反对方式冲突强有力的多式多式大语言模型 2507.07151v1 -
150 07-09 Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons Knockout LLM Assessment: Verwendung großer Sprachmodelle für Bewertungen durch iterative Pairwise-Vergleiche LLLM 评估:利用大语言模式,通过迭接对等比较进行评估 2506.03785v3 -
151 07-09 LLM-based User Profile Management for Recommender System LLM-basierte Benutzerprofilverwaltung für Recommender System 以LLM为基础的建议系统用户概况管理 2502.14541v2 -
152 07-09 Civil Society in the Loop: Feedback-Driven Adaptation of (L)LM-Assisted Classification in an Open-Source Telegram Monitoring Tool Zivilgesellschaft im Kreislauf: Feedback-gestützte Anpassung der (L)LM-Assistenten-Klassifikation in einem Open-Source-Telegramm-Monitoring-Tool ” 圈圈中的民间社会:在开放源电讯监测工具中对(L)LM协助的分类进行反馈和改编 “ 。 2507.06734v1 -
153 07-09 On the Effect of Uncertainty on Layer-wise Inference Dynamics Über den Effekt der Unsicherheit auf die schichtweise Schlussfolgerungsdynamik 不确定性对图层推论作用的影响 2507.06722v1 -
154 07-09 CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs CLI-RAG: Ein Retrieval-Augmented Framework für klinisch strukturierte und kontextorientierte Textgenerierung mit LLMs CLI-RAG: 利用LLMs实现临床结构化和有背景意识的文本生成的检索增强框架 2507.06715v1 -
155 07-09 Do Larger Language Models Imply Better Generalization? A Pretraining Scaling Law for Implicit Reasoning Werden größere Sprachmodelle besser verallgemeinert? Ein Vorschulungs-Skalierungsgesetz für Implizite Vernunft 大语言模式是否充分更加普遍化? 2504.03635v2 -
156 07-09 A Survey on Prompt Tuning Eine Umfrage über das schnelle Tuning 快速提款调查 2507.06085v2 -
157 07-09 Automating IRAC Analysis in Malaysian Contract Law using a Semi-Structured Knowledge Base Automatisierung der IRAC-Analyse im malaysischen Vertragsrecht mit Hilfe einer halbstrukturierten Wissensbasis 利用半结构化知识库对马来西亚合同法的IRAC分析自动化 2406.13217v2 -
158 07-09 Probing and Steering Evaluation Awareness of Language Models Probing and Steering Evaluation Awareness of Language Models 评价语言模式评估意识的检验和指导 2507.01786v2 -
159 07-09 Elite Polarization in European Parliamentary Speeches: a Novel Measurement Approach Using Large Language Models Elite Polarisierung in europäischen Parlamentsreden: ein neuartiger Messansatz mit großen Sprachmodellen 欧洲议会讲话中的极极化:使用大语言模式的新衡量方法 2507.06658v1 -
160 07-09 PBa-LLM: Privacy- and Bias-aware NLP using Named-Entity Recognition (NER) PBa-LLM: Datenschutz- und Bias-aware NLP mit Named-Entity-Erkennung (NER) PBA-LLM: 使用命名实体识别(NER)的隐私和Bias-awar NLP 2507.02966v2 -
161 07-09 Expediting data extraction using a large language model (LLM) and scoping review protocol: a methodological study within a complex scoping review Beschleunigung der Datenextraktion mit Hilfe eines großen Sprachmodells (LLM) und eines Scoping-Review-Protokolls: eine methodische Studie im Rahmen einer komplexen Scoping-Review 利用大型语言模式和范围界定审查议定书加快数据提取速度:在复杂的范围界定审查范围内进行的方法研究 2507.06623v1 -
162 07-09 FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation FuDoBa: Zusammenführen von Dokumenten- und Wissensgraphen-basierten Darstellungen mit Bayesian-Optimierung FUDoBA: 以贝叶斯最佳化方式使用引信文档和知识图表表示 2507.06622v1 -
163 07-09 Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation Decoder-Hybrid-Decoder-Architektur für effizientes Nachdenken mit langer Generation 提高长代人合理性效率的代coder-Hybrid-Decer 结构 2507.06607v1 -
164 07-09 Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives Video-Language Understanding: Eine Umfrage aus Modellarchitektur, Modellschulung und Datenperspektiven 了解视频语言:从建模模型、示范培训和数据视角进行调查 2406.05615v3 -
165 07-09 Can Input Attributions Explain Inductive Reasoning in In-Context Learning? Kann Input-Attributionen im In-Context-Lernen eine induktive Begründung erklären? 投入属性能否解释内文学习中的引因? 2412.15628v5 -
166 07-09 Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions Bewertung und Verbesserung der Robustheit in großen Sprachmodellen: Umfrage und Zukunftsrichtung 评价和提高大语言模式的威力:调查和未来方向 2506.11111v2 -
167 07-09 Enhancing Food-Domain Question Answering with a Multimodal Knowledge Graph: Hybrid QA Generation and Diversity Analysis Verbesserung der Lebensmittel-Domain-Fragebeantwortung mit einem multimodalen Wissensgraphen: Hybride QA-Erzeugung und Diversitätsanalyse 以多模式知识图回答加强粮食领域问题:混合QA生成和多样性分析 2507.06571v1 -
168 07-09 CHAI for LLMs: Improving Code-Mixed Translation in Large Language Models through Reinforcement Learning with AI Feedback CHAI für LLMs: Verbesserung der code-gemischten Übersetzung in großen Sprachmodellen durch Verstärkungslernen mit KI-Feedback LLM女士CHAI:利用AI反馈加强学习,改进大语言模式的编码混合翻译 2411.09073v3 -
169 07-09 The Flaws of Others: An LLM-driven Framework for Scientific Knowledge Production Die Fehler anderer: Ein LLM-gestützter Rahmen für die wissenschaftliche Wissensproduktion 其他法律:由LLM驱动的科学知识生产框架 2507.06565v1 -
170 07-09 DS@GT at CheckThat! 2025: Exploring Retrieval and Reranking Pipelines for Scientific Claim Source Retrieval on Social Media Discourse DS@GT bei CheckThat! 2025: Erforschung von Rück- und Rücklaufleitungen für wissenschaftliche Anspruchsquellen-Rückgriffe auf Social Media Diskurs DS@GT at checketh! 2025年: 探索检索和重新排序科学索赔源检索管道,关于社会媒体问题 2507.06563v1 -
171 07-09 AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework AutoPrep: Natural Language Question-Aware Datenvorbereitung mit einem Multi-Agent Framework AUPPrep: 具有多机构框架的自然语言问题软件数据编制 2412.10422v4 -
172 07-09 FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction FiRST: Feinsteuerungs-Router-Selective Transformer für Input-Adaptive Latenzreduktion FIRST: 用于减少投入-适应性延迟的微调路由器-选择式变换器 2410.12513v3 -
173 07-09 InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior InvestAlign: Überwindung von Datenknappheit bei der Ausrichtung großer Sprachmodelle mit Entscheidungsprozessen von Investoren unter Herdverhalten 投资对称:克服在将大语言模式与牧民行为下的投资者决策过程相匹配方面的数据稀缺性 2507.06528v1 -
174 07-09 FinSphere, a Real-Time Stock Analysis Agent Powered by Instruction-Tuned LLMs and Domain Tools FinSphere, ein Real-Time Stock Analysis Agent Powered by Instruction-Tuned LLMs und Domain Tools FinSphere,实时股票分析代理商,使用指示引导LMs和域工具进行实时股票分析 2501.12399v2 -
175 07-09 FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation FIFA: Unified Faithfulness Evaluation Framework für Text-to-Video und Video-to-Text Generation FIFA: 文本到视频和视频到文字一代统一信仰评价框架 2507.06523v1 -
176 07-09 Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving LLMs nach ihrer Eignung lehren: Adaptive Begründung für mathematische Problemlösung 根据其适应性:数学问题解决的适应性理由 2502.12022v3 -
177 07-09 DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE DeepTalk: Auf dem Weg zu nahtloser und intelligenter Sprachinteraktion mit adaptiver Modalität-spezifischer MoE 深谈:实现与适应型模式具体部的无缝和智能语音互动 2506.21864v2 -
178 07-09 Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning Double-Checker: Bessere Begründung von langsam denkenden LLMs über selbstkritische Feinsteuerung 双重检查者:通过自批评性微调,加强慢思考低迷LMs的理由 2506.21285v2 -
179 07-09 SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers SpindleKV: Eine neuartige KV-Cache-Reduktionsmethode, die sowohl schwammige als auch tiefe Schichten ausbalanciert SpindleKV: 浅层和深层平衡的新KV缓存减少方法 2507.06517v1 -
180 07-09 Breaking PEFT Limitations: Leveraging Weak-to-Strong Knowledge Transfer for Backdoor Attacks in LLMs Breaking PEFT Limitations: Schwach-zu-starken Wissenstransfer für Backdoor-Angriffe in LLMs 打破PEFT的局限性:利用微弱至收缩知识转让来利用LLMs进行后门攻击 2409.17946v4 -
181 07-09 Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings Pun Intended: Multi-Agent Übersetzung von Wordplay mit Kontrastivem Lernen und phonetisch-semantischen Einbettungen 未预期:多代理翻译与矛盾学习和语音-语言嵌入的文字游戏的多代理翻译 2507.06506v1 -
182 07-09 GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification GMLM: Überbrückung von Graph Neuronalen Netzwerken und Sprachmodellen für heterophile Knotenklassifikation GMLM: 血氧哲学节点分类的架接图形神经网络和语言模型 2503.05763v5 -
183 07-09 ModelCitizens: Representing Community Voices in Online Safety ModelCitizens: Vertretung von Community Voices in Online-Sicherheit 模范公民:在网上安全方面代表社区声音 2507.05455v2 -
184 07-09 Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations Verfeinerung von gesketen Wahrnehmungen in visionssprachlichen Kontrastmodellen durch visuelle Darstellungen 通过视觉表现方式改进视觉语言对比模型中偏斜的观念 2405.14030v3 -
185 07-09 Can adversarial attacks by large language models be attributed? Können feindliche Angriffe von großen Sprachmodellen zugeschrieben werden? 大型语言模式的对抗性攻击能否归结为对抗性攻击? 2411.08003v2 -
186 07-09 TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation TokenSwift: Lossless Acceleration der Ultra Long Sequence Generation TokenSwift: 超长序列一代无损加速 2502.18890v2 -
187 07-09 On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks Über die Robustheit des verbalen Vertrauens von LLMs bei feindlichen Angriffen 关于在相互攻击中女优女士的口头信任的有力性 2507.06489v1 -
188 07-09 Can LLMs Play Ô Ăn Quan Game? A Study of Multi-Step Planning and Decision Making Kann LLMs spielen Ô ên Quan Spiel? Eine Studie über mehrstufige Planung und Entscheidungsfindung LLM女士能玩游戏吗?多步规划和决策研究。 2507.03711v3 -
189 07-09 Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning Video-RTS: Rethinking Verstärktes Lernen und Test-Time Scaling für effizientes und verbessertes Video-Reasoning 视频-RTS:重新思考强化学习和测试时间的尺度,以提高效率和增强视频理由 2507.06485v1 -
190 07-09 Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents Japanisch lernen mit Jouzu: Interaktionsergebnisse mit stylisierten Dialog-Fiktivagenten 与Jouzu一起学习日文:与固定对话分形媒介的互动结果 2507.06483v1 -
191 07-09 InfoTech Assistant: A Multimodal Conversational Agent for InfoTechnology Web Portal Queries InfoTech Assistant: Ein multimodaler Ansprechpartner für InfoTechnology Web Portal Queries 信息技术助理:信息技术网站门户查询的多模式对话代理 2412.16412v2 -
192 07-09 Theme-Explanation Structure for Table Summarization using Large Language Models: A Case Study on Korean Tabular Data Thema-Erläuterungsstruktur für die Tabellenzusammenfassung mit großen Sprachmodellen: Eine Fallstudie zu koreanischen Tabellendaten 使用大语言模型对表格进行汇总的主题解释结构:韩国表格数据案例研究 2501.10487v3 -
193 07-08 (2) A Systematic Analysis of Hybrid Linear Attention Eine systematische Analyse der hybriden linearen Aufmerksamkeit 系统分析混合线性注意 2507.06457v1 -
194 07-08 A Semantic Parsing Framework for End-to-End Time Normalization Ein semantisches Parsing-Framework für die End-to-End-Zeit-Normalisierung 端到终点时间正常化的语义分解框架 2507.06450v1 -
195 07-08 Perception-Aware Policy Optimization for Multimodal Reasoning Perception-Aware Policy Optimization für multimodale Reasoning 对多式联运理由的观念-认知软件政策优化 2507.06448v1 -
196 07-08 CodeMirage: Hallucinations in Code Generated by Large Language Models CodeMirage: Halluzinationen in Code Generiert durch große Sprachmodelle 代码Mirage: 大语言模型生成的代码中的幻觉 2408.08333v2 -
197 07-08 Substance over Style: Evaluating Proactive Conversational Coaching Agents Substanz über Stil: Bewertung proaktiver Conversations-Coaching-Agenten 物质高于样式:评价主动对话教练员 2503.19328v2 -
198 07-08 Can Interpretation Predict Behavior on Unseen Data? Kann Interpretation Verhalten auf ungesehenen Daten voraussagen? 解释预测行为是否与未知数据有关? 2507.06445v1 -
199 07-08 Temporal Analysis of Climate Policy Discourse: Insights from Dynamic Embedded Topic Modeling Temporale Analyse des Diskurses zur Klimapolitik: Einblicke aus der dynamischen Themenmodellierung 气候政策问题时空分析:动态嵌入式专题模型的展望 2507.06435v1 -
200 07-08 Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders Erforschen der Aufgabenleistung mit Interpretierbaren Modellen über Sparse Auto-Encoder 通过 Sparass Auto- Enccders 与可解释模型探索任务性能 2507.06427v1 -
201 07-08 Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling Reward-Modelle können sich selbst verbessern: Reward-Guided Adversarial Failure Mode Discovery für robuste Reward-Modellierung 奖赏模型能够改善自我:为强力奖赏建模而发现奖赏- 受奖人 反逆失败模式 2507.06419v1 -
202 07-08 PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning PERK: Lang-Kontext als Parameter-Effizientes Test-Zeit-Lernen PERK: 参数有效试验时间学习的长期理由 2507.06415v1 -
203 07-08 The Trilemma of Truth in Large Language Models Das Trilemma der Wahrheit in großen Sprachmodellen 大语言模型中的真理三边 2506.23921v2 -
204 07-08 Single Word Change is All You Need: Designing Attacks and Defenses for Text Classifiers Single Word Change ist alles, was Sie brauchen: Konzipieren von Angriffen und Verteidigungen für Textklassifikatoren 单字更改是您需要的: 设计攻击和文字分类的防御 2401.17196v2 -
205 07-08 Hypermagmas and Colored Operads: Heads, Phases, and Theta Roles Hypermagmen und farbige Opern: Köpfe, Phasen und Theta-Rollen 超磁和彩色操作: 头部、阶段和Theta 角色 2507.06393v1 -
206 07-08 Multi-Sense Embeddings for Language Models and Knowledge Distillation Multi-Sense-Einbettungen für Sprachmodelle und Wissensdestillation 语文模式和知识蒸馏多语种多语种嵌入式 2504.06036v2 -
207 07-08 Evaluating Morphological Alignment of Tokenizers in 70 Languages Bewertung der morphologischen Ausrichtung von Tokenisatoren in 70 Sprachen 以70种语言评价火化器在道义上的一致性 2507.06378v1 -
208 07-08 Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate Wachsende Transformer: Modulare Zusammensetzung und schichtweise Erweiterung auf einem gefrorenen Substrat 增长变异器:在冻结基质上的模块构成和图层扩展 2507.07129v1 -
209 07-08 Could the Road to Grounded, Neuro-symbolic AI be Paved with Words-as-Classifiers? Könnte der Weg zur geerdeten, neuro-symbolischen KI mit Wort-als-Klassifikatoren gepflastert werden? 通往禁地、神经-精神-精神-爱之路能否用“单词化为标准”来铺设? 2507.06335v1 -
210 07-08 MedGellan: LLM-Generated Medical Guidance to Support Physicians MedGellan: LLM-generierte medizinische Anleitung zur Unterstützung von Ärzten MedGellen:LLM - 支持医生的医疗指导 2507.04431v2 -
211 07-08 RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs RefineX: Verfeinern von Vorschulungsdaten auf Scale aus sachverständigen Programmen RefineX:从专家指导方案学习如何在规模上精确分析培训前数据 2507.03253v2 -
212 07-08 ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time ETT: Erweiterung des Langzeitkontexts Verständnisfähigkeit von LLMs bei Test-Time ETT:扩大LLMs在试验时的长距离理解能力 2507.06313v1 -
213 07-08 Humans overrely on overconfident language models, across languages Menschen überschätzen überbewusste Sprachmodelle, über Sprachen hinweg 人类对不同语言的 过于自信的语言模式过于自信 2507.06306v1 -
214 07-08 Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers Effizienz-Wirksamkeit Reranking FLOPs für LLM-basierte Reranker 为基于LLM的升级者重新安排FLOP 2507.06223v1 -
215 07-08 Instruction Following by Boosting Attention of Large Language Models Anleitung, indem man die Aufmerksamkeit großer Sprachmodelle erhöht 之后的教学,培养对大语言模式的注意 2506.13734v2 -
216 07-08 CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions CultureCLIP: CLIP mit kulturellem Bewusstsein durch synthetische Bilder und kontextualisierte Captions stärken CICLIP: 通过合成图像和背景说明赋予CLIP以文化意识,赋予CLIP权力 2507.06210v1 -
217 07-08 EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG EEG2TEXT-CN: Eine explorative Studie der offenen Vokabulären chinesischen Text-EEG-Ausrichtung über großsprachliches Modell und kontrastives Lernen auf ChinesischEEG EEG2TEXT-CN:通过大语言模式和中经语言差异性学习对中文文本与EEEG校对开放词汇的探索性研究 2506.00854v3 -
218 07-08 DS@GT at CheckThat! 2025: Ensemble Methods for Detection of Scientific Discourse on Social Media DS@GT bei CheckThat! 2025: Ensemblemethoden zur Erkennung wissenschaftlicher Diskurse über soziale Medien DS@GT at checketh! 2025:社会媒体科学论文的混合探测方法 2507.06205v1 -
219 07-08 Differential Mamba Unterschiedliche Mamba 差异 Mamba 2507.06204v1 -
220 07-08 UQLM: A Python Package for Uncertainty Quantification in Large Language Models UQLM: Ein Python-Paket zur Unsicherheitsquantifizierung in großen Sprachmodellen UQLM: 用于大语言模型中不确定性量化的 Python 软件包 2507.06196v1 -
221 07-08 DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification DS@GT bei CheckThat! 2025: Bewertung von Kontext- und Tokenisierungsstrategien für numerische Faktenverifizierung DS@GT at checketh! 2025年:评估背景和数值事实核查的接受战略 2507.06195v1 -
222 07-08 SQLBarber: A System Leveraging Large Language Models to Generate Customized and Realistic SQL Workloads SQLBarber: Ein System, das große Sprachmodelle nutzt, um angepasste und realistische SQL Workloads zu generieren SQLBarber:一个系统利用大语言模型生成自定义和现实的 SQL 工作量 2507.06192v1 -
223 07-08 DS@GT at CheckThat! 2025: Detecting Subjectivity via Transfer-Learning and Corrective Data Augmentation DS@GT bei CheckThat! 2025: Subjektivität durch Transfer-Learning und Korrekturdatenvergrößerung erkennen DS@GT at checkThat! 2025: 通过传输学习和纠正数据增强检测主观性 2507.06189v1 -
224 07-08 Hidden Prompts in Manuscripts Exploit AI-Assisted Peer Review Versteckte Prompts in Handschriften Exploit KI-Assisted Peer Review 使用 AI 协助同行审议 2507.06185v1 -
225 07-08 MedGemma Technical Report Technischer Bericht MedGemma MedmeGemma 技术报告 2507.05201v2 -
226 07-08 CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization KritikLean: Kritisch-geführtes Verstärkungslernen für mathematische Formalisierung Criticlean: 用于数学正规化的Critic-Crictic-Guideed加强学习 2507.06181v1 -
227 07-08 Evaluation of Habitat Robotics using Large Language Models Bewertung der Habitat Robotics mit großen Sprachmodellen 利用大语言模式评价人居机器人学 2507.06157v1 -
228 07-08 Coding Triangle: How Does Large Language Model Understand Code? Coding Triangle: Wie versteht Large Language Model Code? 编码三角:大语言模式如何理解代码? 2507.06138v1 -
229 07-08 NeoBabel: A Multilingual Open Tower for Visual Generation NeoBabel: Mehrsprachiger offener Turm für die visuelle Erzeugung NeoBabel:多语言的视觉生成开放塔 2507.06137v1 -
230 07-08 SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam? SciMaster: Auf dem Weg zu allgemein angelegten wissenschaftlichen KI-Agenten, Teil I. X-Master als Grundlage: Können wir die letzte Prüfung der Menschheit durchführen? SciMaster:争取实现普通用途科学AI代理,第一部分,X-Master作为基金会:我们能否在人类最后一次考试上起带头作用? 2507.05241v2 -
231 07-08 Agents Are All You Need for LLM Unlearning Agenten sind alles, was Sie für LLM Unlearning benötigen 代理器是所有你需要的LLM 重新学习 2502.00406v2 -
232 07-08 GAF-Guard: An Agentic Framework for Risk Management and Governance in Large Language Models GAF-Guard: Agentischer Rahmen für Risikomanagement und Governance in großen Sprachmodellen GAF-Guard:大语言模式风险管理和治理代理框架 2507.02986v2 -
233 07-08 Nyay-Darpan: Enhancing Decision Making Through Summarization and Case Retrieval for Consumer Law in India Nyay-Darpan: Verbesserung der Entscheidungsfindung durch Zusammenfassung und Rückgriff auf das Verbraucherrecht in Indien Nyay-Darpan:印度消费者法通过摘要和案件检索加强决策 2507.06090v1 -
234 07-08 The bitter lesson of misuse detection Die bittere Lektion der Missbrauchserkennung 发现滥用的惨痛教训 2507.06282v1 -
235 07-08 A Multi-Task and Multi-Label Classification Model for Implicit Discourse Relation Recognition Ein Multi-Task- und Multi-Label-Klassifikationsmodell für die Implizite Diskursrelation-Erkennung 多任务和多标签的隐性断断层确认分类模式 2408.08971v3 -
236 07-08 Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle Sind LLMs prescient? Eine kontinuierliche Bewertung mit Daily News als Oracle LLLM是否具有先锋精神?利用每日新闻作为甲骨文进行持续评估。 2411.08324v2 -
237 07-08 The distribution of syntactic dependency distances Die Verteilung der syntaktischen Abhängigkeitsabstände 共同依赖距离分布 2211.14620v2 -
238 07-08 Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger Adaptive Werkzeuganwendung in großen Sprachmodellen mit Meta-Cognition Trigger 用于大语言模型的适应性工具使用 2502.12961v2 -
239 07-08 Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs Entropie-Memorisationsgesetz: Bewertung der Gedächtnis-Schwierigkeit der Daten in LLMs 《信封-记忆法》:评估LLMM中数据的记忆难度 2507.06056v1 -
240 07-08 Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review Überwindung von Datenknappheit bei generativen Sprachmodellierungen für Sprachen mit geringem Ressourcenbedarf: Eine systematische Überprüfung 克服低资源语言创用语言建模中的数据稀缺性:系统审查 2505.04531v2 -
241 07-08 Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager Maßgeschneiderte Gespräche über LLMs hinaus: Ein RL-basierter Dialogmanager 超出LLLM 的定制对话:基于 RL 的对话管理器 2506.19652v2 -
242 07-08 Empirical evidence of Large Language Model’s influence on human spoken communication Empirische Beweise für den Einfluss von Large Language Model auf die menschliche gesprochene Kommunikation 大语言模式对人口交流的影响的经验证据 2409.01754v3 -
243 07-08 MemOS: A Memory OS for AI System MemOS: Ein Speicher-Betriebssystem für KI-System MemOS: AI系统记忆OS 2507.03724v2 -
244 07-08 OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model OpenS2S: Vollkommen offenes Source-End-to-End-Empathetik-Modell für große Sprache OpenS2S: 推进完全开放源全开放源端至端至端受同情型大型语音语言模式 2507.05177v2 -
245 07-08 Conditional Multi-Stage Failure Recovery for Embodied Agents Bedingte Multi-Stage Failure Recovery für eingedickte Agenten 内装药剂有条件的多阶段故障恢复 2507.06016v1 -
246 07-08 DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations DocIE@XLLM25: In-Context-Lernen zur Informationsextraktion mit vollständig synthetischen Demonstrationen DocIE@XLLLM25:利用全面合成演示进行信息提取的内文学习 2507.05997v1 -
247 07-08 Evolution without Large Models: Training Language Model with Task Principles Evolution ohne große Modelle: Sprachmodell mit Aufgabenprinzipien trainieren 无大型号的演变演变:具有任务原则的培训语言模式 2507.05991v1 -
248 07-08 Development and Evaluation of HopeBot: an LLM-based chatbot for structured and interactive PHQ-9 depression screening Entwicklung und Evaluation von HopeBot: ein LLM-basierter Chatbot für strukturiertes und interaktives PHQ-9-Depressionsscreening HopeBot的发展和评价:一个基于LLM的聊天室,用于结构化和互动的PHQ-9抑郁症筛查 2507.05984v1 -
249 07-08 RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages RabakBench: Skalierung menschlicher Anmerkungen zum Aufbau lokalisierter Mehrsprachiger Sicherheits-Benchmarks für ressourcenarme Sprachen Rabak Rabak Bench:提高人文说明,以建立本地化的低资源语言多语文安全基准 2507.05980v1 -
250 07-08 We Should Evaluate Real-World Impact Wir sollten die Auswirkungen der realen Welt bewerten 我们应该评估现实世界的影响 2507.05973v1 -
251 07-08 Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model Analytisches Subraum-Routing: Wie rekursive Least Squares im kontinuierlichen Lernen des großen Sprachmodells funktionieren 分析子空间运行分析:在继续不断学习大语言模式中如何使最起码的广场恢复 2503.13575v2 -
252 07-08 FRAME: Feedback-Refined Agent Methodology for Enhancing Medical Research Insights FRAME: Feedback-Refined Agent Methodik zur Verbesserung medizinischer Forschungs-Insights FRAME: 改进医学研究透视的反馈改进剂方法 2505.04649v2 -
253 07-08 OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation OpenFactScore:公开源码原子对制文本实际质量的评估 2507.05965v1 -
254 07-08 Rethinking Associative Memory Mechanism in Induction Head Rethinking Assoziative Memory Mechanism in Induktion Head 上岗时重新思考联合记忆机制 2412.11459v2 -
255 07-08 Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models Neudefinition von Bewertungsstandards: Ein einheitlicher Rahmen für die Bewertung der koreanischen Fähigkeiten von Sprachmodellen 重新界定评价标准:评价韩国语言模式能力的统一框架 2503.22968v4 -
256 07-08 Chat-Ghosting: A Comparative Study of Methods for Auto-Completion in Dialog Systems Chat-Ghosting: Eine vergleichende Untersuchung von Methoden zur Auto-Completion in Dialogsystemen 对话系统自动完成方法比较研究 2507.05940v1 -
257 07-08 Remember Past, Anticipate Future: Learning Continual Multimodal Misinformation Detectors Denken Sie daran, Vergangenheit, Zukunft antizipieren: Lernen kontinuierlich multimodale Fehlinformation Detectors 记住过去,预期未来:学习持续多模式错误信息探测器 2507.05939v1 -
258 07-08 Towards a Principled Evaluation of Knowledge Editors Auf dem Weg zu einer grundsätzlichen Bewertung von Wissensredakteuren 对知识编辑进行有原则地评价 2507.05937v1 -
259 07-08 News and Load: Social and Economic Drivers of Regional Multi-horizon Electricity Demand Forecasting News and Load: Soziale und wirtschaftliche Treiber regionaler Mehrhorizont-Stromnachfrageprognosen 新闻与负载:区域多光速电力需求预测的社会和经济驱动因素 2406.06641v2 -
260 07-08 Semantic Certainty Assessment in Vector Retrieval Systems: A Novel Framework for Embedding Quality Evaluation Semantische Sicherheitsbewertung in Vector Retrieval Systemen: Ein neuartiger Rahmen für die Einbettung von Qualitätsbewertungen 矢量检索系统中的语义性确定性评估:包含质量评价的新框架 2507.05933v1 -
261 07-08 Self-supervised learning of speech representations with Dutch archival data Selbstüberwachtes Lernen von Sprachdarstellungen mit niederländischen Archivdaten 自行监督学习用荷兰档案数据进行语音陈述 2507.04554v2 -
262 07-08 Few-shot text-based emotion detection Kurze Text-basierte Emotionserkennung 几发短短的基于文本的情感探测 2507.05918v1 -
263 07-08 On the Fundamental Impossibility of Hallucination Control in Large Language Models Über die grundsätzliche Unmöglichkeit der Halluzinationskontrolle in großen Sprachmodellen 关于大语言模型中幻听控制的基本不可能性 2506.06382v3 -
264 07-08 AI-Reporter: A Path to a New Genre of Scientific Communication AI-Reporter: Ein Weg zu einem neuen Genre wissenschaftlicher Kommunikation AI-记者:通向科学通信新一流的道路 2507.05903v1 -
265 07-08 MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation MusiScene: Nutzung von MU-LLAMA für die Vorstellung von Szenen und verbesserte Video-Hintergrundmusik-Generation Musi-Sceen:利用MU-LLAMA来进行现场想象和增强视频背景音乐制作 2507.05894v1 -
266 07-08 Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators Psychometrische Objektvalidierung mit virtuellen Befragten mit trait-Response Mediatoren 使用虚拟应答人与Trait-Response调解员的虚拟应答人进行测分 2507.05890v1 -
267 07-08 How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures Wie man die automatische Spracherkennung auswertet: Vergleich verschiedener Leistungs- und Bias-Maßnahmen 如何评价自动语音识别:比较不同业绩和偏见措施 2507.05885v1 -
268 07-08 Healing Powers of BERT: How Task-Specific Fine-Tuning Recovers Corrupted Language Models Heilkräfte von BERT: Wie aufgabenspezifische Feinsteuerung korrupte Sprachmodelle wiederherstellt BERT的愈合能力:如何恢复被损坏的语言模式 2406.14459v2 -
269 07-08 Detecting value-expressive text posts in Russian social media Ermittlung wertwirksamer Textposts in russischen sozialen Medien 在俄罗斯社交媒体中检测表达价值的文字文章 2312.08968v2 -
270 07-08 ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection ALLM4ADD: Entsperren der Fähigkeiten von Audio Large Language Models für Audio Deepfake Detection ALLM4ADD:解锁音频深假探测音频大语言模型的能力 2505.11079v2 -
271 07-08 PDFMathTranslate: Scientific Document Translation Preserving Layouts PDFMathTranslate: Wissenschaftliche Dokumentübersetzung Layouts erhalten PDFMath Translate:科学文件翻译保存布局 2507.03009v2 -
272 07-08 Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge Mit Thinking-LLM-as-a-Richter zu planen und zu evaluieren 学习运用 “ 思考-LLM-作为法官 “ 评估计划和理由 2501.18099v2 -
273 07-08 The Impact of Prompt Programming on Function-Level Code Generation Die Auswirkungen der Prompt-Programmierung auf die Code-Generierung auf Funktionsebene 迅速编制方案对职能层面代码生成的影响 2412.20545v2 -
274 07-08 Affective-ROPTester: Capability and Bias Analysis of LLMs in Predicting Retinopathy of Prematurity Affective-ROPTester: Fähigkeit und Bias-Analyse von LLMs bei der Vorhersage der Retinopathie der Vorreife 情感-ROPTester:预测早幼病理反应中LLMs的能力和比值分析 2507.05816v1 -
275 07-08 Bridging Perception and Language: A Systematic Benchmark for LVLMs’ Understanding of Amodal Completion Reports Bridging Perception and Language: Ein systematischer Benchmark für das Verständnis von LVLMs für Amodal Completion Reports 桥梁概念和语言:LLLMs理解最佳完成报告系统基准 2507.05799v1 -
276 07-08 FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation FaithfulRAG: Fact-Level-Konfliktmodellierung für kontexttreue retrieval-erweiterte Generation 忠实的拉戈:为本能的回溯性回溯性回溯性回溯性一代人建立实际冲突模式 2506.08938v2 -
277 07-08 MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training MAMUT: Ein neuartiges Framework zur Modifizierung mathematischer Formeln für die Generierung von Spezialdatensätzen für Sprachmodellschulungen MAMUT:为语言模式培训制作专门数据集修改数学公式的新框架 2502.20855v2 -
278 07-08 Flippi: End To End GenAI Assistant for E-Commerce Flippi: Ende bis Ende GenAI Assistent für E-Commerce Flippi: 结束到结束 GenAI 电子商务助手 2507.05788v1 -
279 07-08 Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription Gemeinsame Beamforming und Speaker-Attributed ASR für echte Fern-Mikrofon-Meeting-Transkription 用于真正分布式话语会议的联合光束和发言人与发言人联合致真正分布式话语会议ASR 2410.21849v2 -
280 07-08 Large Language Models Might Not Care What You Are Saying: Prompt Format Beats Descriptions Große Sprachmodelle sind vielleicht egal, was Sie sagen: Prompt Format Beats Beschreibungen 大语言模型可能不在乎你在说什么:快速格式比描述 2408.08780v6 -
281 07-08 DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities DocTalk: Skalierbare Graph-basierte Dialog-Synthese zur Verbesserung von LLM-Konversationsfähigkeiten Doctal: 用于增强LLLM对话能力的可缩放图表对话合成 2507.05750v1 -
282 07-08 GPTKB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge GPTKB v1.5: Eine umfassende Wissensbasis für die Erforschung des tatsächlichen LLM-Wissens GPTKB v1.5:探索事实LLM知识的大规模知识库 2507.05740v1 -
283 07-08 Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling Verbesserung der LLM-Zuverlässigkeit durch explizite Wissensgrenzenmodellierung 通过明确知识边界建模增强LLM可靠性 2503.02233v3 -
284 07-08 ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark KontextASR-Bench: Ein massiver Kontext für die Spracherkennung Benchmark ASR-Bunch:大规模背景情况语音识别基准 2507.05727v1 -
285 07-08 Truth Neurons Wahrheit Neuronen 真理中世纪 2505.12182v3 -
286 07-08 Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition Omni-Router: Routing-Entscheidungen in Sparse Mixture-of-Experts für die Spracherkennung teilen Omni-Router: 分享语音识别专家的松散混集决定 2507.05724v1 -
287 07-08 MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment MobileGUI-RL: Mobile GUI-Agent durch Verstärkungslernen im Online-Umfeld fördern MomGUI-RL:通过在线环境强化学习推进移动图形界面代理 2507.05720v1 -
288 07-08 HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation HIRAG: Hierarchisch-gedankte Instruktion-Tuning-Retrieval-Augmented Generation HIRAG: 高层次研究教学-引导检索-推荐一代 2507.05714v1 -
289 07-08 DRAGON: Dynamic RAG Benchmark On News DRAGON: Dynamischer RAG-Benchmark auf Neuigkeiten DRAGON:动态RAG新闻基准 2507.05713v1 -
290 07-08 RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism RAG-R1 : Förderung der Such- und Begründungsfähigkeiten von LLMs durch Multi-Query-Parallelismus RAG-R1:通过多种克质平行主义鼓励LLMs的搜索和说明能力 2507.02962v2 -
291 07-08 SIGIR 2025 – LiveRAG Challenge Report SIGIR 2025 – LiveRAG Challenge Report SIGIR 2025-LiveRAG挑战报告 2507.04942v2 -
292 07-08 Agentic-R1: Distilled Dual-Strategy Reasoning Agentic-R1: Destillierte Dual-Strategy-Reasoning 剂- R1: 蒸馏的双战略理由 2507.05707v1 -
293 07-08 A Survey on Transformer Context Extension: Approaches and Evaluation Eine Umfrage über Transformer Context Extension: Ansätze und Evaluierung 关于 “ 变换背景推广:方法和评价 “ 的调查 2503.13299v2 -
294 07-08 PulseReddit: A Novel Reddit Dataset for Benchmarking MAS in High-Frequency Cryptocurrency Trading PulseReddit: Ein neuartiger Reddit-Datensatz für das Benchmarking von MAS im Kryptowährungshandel mit hoher Frequenz PulseReddit:高级加密货币交易中用于确定MAS基准的新Rddit数据集 2506.03861v2 -
295 07-08 Do We Really Need Specialization? Evaluating Generalist Text Embeddings for Zero-Shot Recommendation and Search Brauchen wir wirklich Spezialisierung? Auswertung von Generalisten-Text-Embeddings für Zero-Shot-Empfehlung und Suche 我们真的需要专业化吗?评估零热建议和搜索通用文本嵌入 2507.05006v2 -
296 07-08 Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria KI-Beratung auf Japanisch bewerten: Berater, Kunde und Evaluator Rollen, die von Motivations-Interview-Kriterien überzeugt sind 评价大赦国际在日本的咨询:顾问、客户和根据动机访谈标准评估的评价作用 2507.02950v2 -
297 07-08 AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs AutoTriton: Automatische Triton-Programmierung mit Verstärkungslernen in LLMs AutoTriton: 自动三联编程,在LLMs中进行强化学习 2507.05687v1 -
298 07-08 Smoothie-Qwen: Post-Hoc Smoothing to Reduce Language Bias in Multilingual LLMs Smoothie-Qwen: Post-Hoc-Glättung, um Sprach-Bias in mehrsprachigen LLMs zu reduzieren Slidie-Quwen:以多种语言平滑后减少语言偏见 2507.05686v1 -
299 07-08 BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset BMMR: Ein multimodaler multimodaler multidisziplinärer Datensatz, der großflächig zweisprachig ist BMMR: 大型双语双语多语多式多式多残疾理由数据集 2507.03483v2 -
300 07-08 TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data TuneShield: Eindämmung der Toxizität bei konversatorischer KI während der Feinabstimmung auf nicht vertrauenswürdige Daten TuneSheld:在对不可信数据进行微调的同时,在交流性AI中减轻毒性 2507.05660v1 -
301 07-08 SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge SHNU Mehrsprachiges Spracherkennungssystem für INTERSPEECH 2025 MLC-SLM Challenge SHNU 2025年刚果解放运动-解运挑战 2507.03343v2 -
302 07-08 MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models MEF: Ein Capability-Aware Multi-Encryption Framework zur Bewertung von Schwachstellen in Black-Box Large Language Models MEF: 用于评价黑箱大语言模型脆弱性的能力-软件多加密框架 2505.23404v3 -
303 07-08 ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues? ECom-Bench: Kann LLM Agent Probleme mit dem echten E-Commerce lösen? ECom-Bench:LLM代理能否解决现实世界电子商务客户支助问题? 2507.05639v1 -
304 07-08 Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs Feint and Attack: Achtsamkeitsbasierte Strategien für das Jailbreaking und den Schutz von LLMs Feint和攻击:以关注为基础的破狱和保护LLMs 2410.16327v2 -
305 07-08 Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models Sarkasmus durch unterschiedliche Augen sehen: Multimodale Sarkasmuswahrnehmung in großen Vision-Sprachen-Modellen analysieren 透过不同的眼睛看到讽刺:在大型视觉语言模型中分析多模式讽刺的认知 2503.12149v2 -
306 07-08 SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression SARA: Selektive und adaptive retrieval-augmentierte Generation mit Kontextkompression SARA:有环境压缩的选择性和适应性再取回型 2507.05633v1 -
307 07-08 On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows Über die Rolle von Feedback bei der Testzeitskalierung von agentischen KI-Workflows 关于反馈在AI类制剂工作流量的测试-时间尺度方面作用的反馈 2504.01931v4 -
308 07-08 Adsorb-Agent: Autonomous Identification of Stable Adsorption Configurations via Large Language Model Agent Adsorb-Agent: Autonome Identifizierung stabiler Adsorptionskonfigurationen über Large Language Model Agent Assorb-Agent:通过大语言模式代理自动识别稳定吸附配置 2410.16658v4 -
309 07-08 Flipping Knowledge Distillation: Leveraging Small Models’ Expertise to Enhance LLMs in Text Matching Flipping Knowledge Destillation: Kompetenz kleiner Modelle nutzen, um LLMs in Text Matching zu verbessern 减轻知识蒸馏:利用小型模型的专长加强文本匹配中的LLM 2507.05617v1 -
310 07-08 Self-Review Framework for Enhancing Instruction Following Capability of LLM Selbst-Review-Rahmen für die Verbesserung der Anleitung nach der Fähigkeit von LLM 根据LLM能力加强教学的自我审查框架 2507.05598v1 -
311 07-08 Efficient Detection of Intermittent Job Failures Using Few-Shot Learning Effiziente Erkennung intermittierender Job-Fälle durch wenig scharfes Lernen 利用很少热的学习方法有效检测间歇性工作失败 2507.04173v2 -
312 07-08 Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation Low-Rank und Sparse Modell Zusammenführung für Multi-Lingual Spracherkennung und Übersetzung 多语言语音识别和翻译的低兰克和松散模式合并 2502.17380v3 -
313 07-08 The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation Die Landschaft der Erinnerung in LLMs: Mechanismen, Messung und Milderung LLMM中纪念的景观:机制、衡量和缓解 2507.05578v1 -
314 07-08 Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models Feature-Extraktion und -Lenkung für eine verbesserte Kettenbildung in Sprachmodellen 语言模型中强化研究链理由的特征采掘和指南 2505.15634v3 -
315 07-08 Offline Learning and Forgetting for Reasoning with Large Language Models Offline-Lernen und Vergessen für die Vernunft mit großen Sprachmodellen 利用大语言模式进行离线学习和遗忘 2504.11364v3 -
316 07-08 Beyond Retrieval: Ensembling Cross-Encoders and GPT Rerankers with LLMs for Biomedical QA Beyond Retrieval: Zusammenbauen von Cross-Encodern und GPT-Rerankern mit LLMs für biomedizinische QA 超出检索范围:利用生物医学质量评估LMs将跨环境者和GPT再生者与生物医学质量评估LMs组合在一起 2507.05577v1 -
317 07-08 Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index Infini-gram mini: Genaue n-gram Suche auf der Internetskala mit FM-Index Infini-gram 微型: 使用 FM- Index 的 Internet 比例尺精确的 n 克搜索 2506.12229v2 -
318 07-08 OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens OLMoTrace: Tracing Language Model Outputs Zurück zu Tillionen von Trainings-Tokens OLMoTrace:追踪语言示范产出,返回数以万计的培训单位 2504.07096v2 -
319 07-08 Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS Verbesserung der Testzeitskalierung großer Sprachmodelle mit hierarchischen Retrieval-Augmented MCTS 强化高语言模型的测试时间比量,采用等级检索增强的MCTS 2507.05557v1 -
320 07-07 (1) The Role of Deductive and Inductive Reasoning in Large Language Models Die Rolle der deduktiven und induktiven Vernunft in großen Sprachmodellen 大语种模式中的减量和引引原因的作用 2410.02892v3 -
321 07-07 Bayesian Optimization for Controlled Image Editing via LLMs Bayesische Optimierung für kontrollierte Bildbearbeitung über LLMs Bayesian 通过LLMs优化控制图像编辑 2502.18116v3 -
322 07-07 Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment Conversational Education at Scale: Multi-LLM Agent Workflow für verfahrenstechnisches Lernen und pädagogische Qualitätsbewertung 规模层次的相互教育:程序学习和教学质量评估的多种LLM代理工作流 2507.05528v1 -
323 07-07 Early-Exit and Instant Confidence Translation Quality Estimation Schätzung der Qualität der Früh-Ausgangs- und Instant-Konfidenz-Übersetzung 提前出国和即时信心翻译质量估算 2502.14429v2 -
324 07-07 Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality Feinkörnige Vision-Sprachenmodellierung für multimodale Trainingsassistenten in Augmented Reality 扩大现实中多模式培训助理的精美愿景-语言建模 2507.05515v1 -
325 07-07 MEIT: Multimodal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation MEIT: Multimodale Elektrokardiogramm-Anleitung Tuning auf großen Sprachmodellen für die Berichtserstellung MEIT:多式电动心电图指示,关于报告生成大语言模式的图示 2403.04945v4 -
326 07-07 Towards Exception Safety Code Generation with Intermediate Representation Agents Framework Auf dem Weg zur Generierung von Ausnahme-Sicherheitscodes mit dem Rahmen für Mittlere Vertretungen 建立具有中间代表代理机构框架的例外安全法规生成框架 2410.06949v3 -
327 07-07 ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding ViGiL3D: Ein sprachlich unterschiedlicher Datensatz für 3D Visual Grounding ViGiL3D:3D视觉定位语言多样性数据集 2501.01366v2 -
328 07-07 On the Semantics of Large Language Models Zur Semantik großer Sprachmodelle 关于大语言模型的语义学 2507.05448v1 -
329 07-07 PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language Pairs PhoniTale: Phonologisch begründete mnemonische Generation für typologisch entfernte Sprachpaare Phooni-Tale: 音频分布式语言对音频隔音的声波固定成像元体生成 2507.05444v1 -
330 07-07 Gendered Divides in Online Discussions about Reproductive Rights Geschlechtsspezifische Unterschiede in Online-Diskussionen über reproduktive Rechte 关于生殖权利的在线讨论中的两性差距 2507.05443v1 -
331 07-07 DeepRetro: Retrosynthetic Pathway Discovery using Iterative LLM Reasoning DeepRetro: Retrosynthetische Pathway Discovery mit iterativem LLM Reasoning 使用迭代LM 理由解释的回溯合成途径发现 2507.07060v1 -
332 07-07 LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users LLM Hypnose: Nutzung des Benutzerfeedbacks für unautorisierte Wissensinjektion für alle Benutzer LLM Hypnisis:利用用户反馈,为所有用户提供未经授权知识注射 2507.02850v2 -
333 07-07 “Lost-in-the-Later”: Framework for Quantifying Contextual Grounding in Large Language Models “Lost-in-the-Later”: Rahmen für die Quantifizierung kontextueller Erdung in großen Sprachmodellen “Lost-in-Later”:“Lost-in-Later”:大语言模型中量化背景依据框架 2507.05424v1 -
334 07-07 Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning Global lernen, lokal sprechen: Die Lücken in der Mehrsprachigkeit überwinden 《全球学习,本地语言:弥合多种语文理由方面的差距》 2507.05418v1 -
335 07-07 What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning Was würdest du fragen, wenn du zuerst $a^2+b^2=c^2$ sahst? LLM auf Kuriosität-getriebene Fragestellung bewerten 当你第一次看到$2+b2=c2$时你会问什么? 2409.17172v2 -
336 07-07 Tractable Transformers for Flexible Conditional Generation Tractable Transformer für flexible Bedingte Generation 灵活条件一代的可变变变换器 2502.07616v2 -
337 07-07 Evaluation of OpenAI o1: Opportunities and Challenges of AGI Bewertung von OpenAI o1: Chancen und Herausforderungen der AGI 评价开放AI OO1:亚洲地理信息协会的机遇和挑战 2409.18486v2 -
338 07-07 Embedding-Based Approaches to Hyperpartisan News Detection Einbetten-basierte Ansätze zu Hyperparteien-Nachrichten-Erkennung 以嵌入式方法探测超党派新闻 2501.01370v3 -
339 07-07 One fish, two fish, but not the whole sea: Alignment reduces language models’ conceptual diversity Ein Fisch, zwei Fische, aber nicht das ganze Meer: Ausrichtung reduziert die konzeptionelle Vielfalt der Sprachmodelle 一、二、一、二鱼,但不是整个海洋:对齐减少了语言模式的概念多样性 2411.04427v3 -
340 07-07 Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences Kontrolle dessen, was Sie teilen: Bewertung von Sprachmodellen Einhaltung von Datenschutzeinstellungen 控制您共享的共享:评估语言模式 遵守隐私优惠 2507.05391v1 -
341 07-07 The Generalization Ridge: Information Flow in Natural Language Generation The Generalization Ridge: Informationsfluss in der natürlichen Sprachgenerierung 通用海脊:自然语言生成的信息流动 2507.05387v1 -
342 07-07 Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training Verstärkte Feinsteuerung mildert natürlich das Vergessen im kontinuierlichen Post-Training 长期培训后忘却的自然缓解措施 2507.05386v1 -
343 07-07 On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study Über die Bias von Next-Token Predictors hin zu systematisch ineffizienter Begründung: Eine Kurzeste-Path-Fallstudie 关于对系统、低效率理性的 “ 下肯预测家 “ 的偏见:最短篇案例研究 2507.05362v1 -
344 07-07 LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks LoRA-Augmented Generation (LAG) für wissensintensive Sprachaufgaben LoRA - 知识密集型语言任务支持代(LAG) 2507.05346v1 -
345 07-07 Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions Speicher in LLM-Agenten über inkrementelle Multi-Turn-Interaktionen bewerten 通过递增多发互动评估LLM代理物的记忆 2507.05257v1 -
346 07-07 Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning Open Vision Reasoner: Übertragen sprachliches Kognitives Verhalten für visuelles Reasoning 开放愿景理由:转移视觉理由的语言认知行为 2507.05255v1 -
347 07-07 Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning? Den Richtern zu urteilen: Können große Visions-Sprachmodelle Chartverständnis und Vernunft angemessen bewerten? 判断法官:大视野语言模型能否公正地评估图表的解读和理由? 2505.08468v2 -
348 07-07 Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models Response Attack: Exploiting Contextual Priming to Jailbreak Große Sprachmodelle 回应攻击:利用环境原始化作为破狱大型语言模式 2507.05248v1 -
349 07-07 When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors Wenn Gedankenkette notwendig ist, kämpfen Sprachmodelle um Evade Monitore 当思维链是必要的时候,语言模型与Evade Monitor争吵 2507.05246v1 -
350 07-07 MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents MindFlow: Revolutionierung des E-Commerce-Kundensupports mit multimodalen LLM-Agenten MindFlow:使电子商务客户支持与多式联运LLM代理商实现革命化 2507.05330v1 -
351 07-07 Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models Late Chunking: Kontextuelle Chunk-Embeddings mit Langkontext-Embedding-Modellen 较晚的板块: 使用长文嵌入模型的背景块嵌入式嵌入式 2409.04701v3 -
352 07-07 OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation OPeRA: Ein Datensatz von Beobachtung, Persona, Ratationale und Aktion zur Bewertung von LLMs auf menschlicher Online-Shopping-Behavior-Simulation OPERA: 人类在线购物行为模拟观察、人、理由和评估LMLLMs的数据集 2506.05606v3 -
353 07-07 Logit Reweighting for Topic-Focused Summarization Logit-Umgewichtung für thematisch fokussierte Zusammenfassung 用于专题用途汇总的对数重新加权 2507.05235v1 -
354 07-07 The Super Weight in Large Language Models Das Supergewicht in großen Sprachmodellen 大语言模型中的超重 2411.07191v2 -
355 07-07 jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval jina-embeddings-v4: Universelle Einbettungen für multimodale Mehrsprachigkeit jina-embeddings-v4:多语种多式联运回收通用嵌入式 2506.18902v3 -
356 07-07 Extended Inductive Reasoning for Personalized Preference Inference from Behavioral Signals Erweiterte Induktive Begründung für personalisierte Präferenz-Inferenz von Verhaltenssignalen 从行为信号中推论 2505.18071v2 -
357 07-07 Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities Zwillinge 2.5: Das Frontier mit fortschrittlicher Vernunft, Multimodalität, langem Kontext und Agentischen Fähigkeiten der nächsten Generation schieben Gemini 2.5: 推进先进理性、多模式、长处和下一代的前沿 2507.06261v1 -
358 07-07 Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward Verbesserung des personalisierten Multi-Turn-Dialogs mit Kuriositätslohn 增强个人化多发人对话与好奇奖的多发对话 2504.03206v2 -
359 07-07 Interleaving Logic and Counting Interleaving Logik und Zählen 中间逻辑和计数 2507.05219v1 -
360 07-07 Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation Überprüfung von wissenschaftlichen Arbeiten für kritische Probleme mit reasoning LLMs: Basisansätze und automatische Bewertung 审查关于有理有理的关键问题的科学文件:基线方法和自动评价 2505.23824v2 -
361 07-07 Pre-Trained Policy Discriminators are General Reward Models Pre-Trained Policy Diskriminatoren sind allgemeine Prämienmodelle 培训前政策歧视者是一般奖励模式 2507.05197v1 -
362 07-07 NativQA Framework: Enabling LLMs with Native, Local, and Everyday Knowledge NativQA Framework: LLMs mit nativem, lokalem und alltäglichem Wissen aktivieren NativQA框架:利用土著、地方和日常知识扶持LMs 2504.05995v2 -
363 07-07 From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations Von Fragmenten zu Fakten: Ein Curriculum-getriebener DPO-Ansatz zur Generierung von Hindi News Veracity Erklärungen 《从零碎到事实:产生印地语新闻的多城市解释:课程驱动的DPO方法》 2507.05179v1 -
364 07-07 SEPSIS: I Can Catch Your Lies – A New Paradigm for Deception Detection SEPSIS: Ich kann Ihre Lügen fangen – Ein neues Paradigma für Täuschungserkennung 我可以抓住你的谎言– 一种新模型 用来侦测欺骗 2312.00292v2 -
365 07-07 Critiques of World Models Kritik an Weltmodellen 世界模式的证明 2507.05169v1 -
366 07-07 Language Models can Self-Improve at State-Value Estimation for Better Search Sprachmodelle können sich bei State-Value-Schätzung für eine bessere Suche selbst improvisieren 语言模型可以在国家估价估算中自我改进,以更好地搜索 2503.02878v2 -
367 07-07 InfoSteer: Steering Information Utility in Language Model Post-Training InfoSteer: Steuerungsinformationsdienst im Sprachmodell nach dem Training InfoSteer:指导性语言示范语言信息工具 2507.05158v1 -
368 07-07 AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models KI Generated Text Detection mit Anleitung Feinabstimmung große Sprache und Transformer-basierte Modelle 使用微调大语言和以变换器为基础的模型生成的指令 2507.05157v1 -
369 07-07 Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization Interpretierbare Mnemonic-Generation für Kanji-Lernen über Erwartungs-Maximierung 通过预期-最大化为Kanji学习提供可解释的中子一代 2507.05137v1 -
370 07-07 SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction SMART: Simulierte Schüler ausgerichtet mit Item Response Theorie für Frage Schwierigkeitsvorhersage SMART: 与 “ 问题困难预测 “ 项目反应理论一致的模拟学生 2507.05129v1 -
371 07-07 An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques Eine Bewertung von großen Sprachmodellen zu Textzusammenfassungsaufgaben mittels Prompt Engineering Techniques 评价关于使用即时工程技术进行文本归纳任务的大语言模型 2507.05123v1 -
372 07-07 Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration Wiederbelebung des Kulturerbes: Ein neuartiger Ansatz für eine umfassende Restaurierung historischer Dokumente 恢复文化遗产:全面恢复历史文件的新办法 2507.05108v1 -
373 07-07 LCDS: A Logic-Controlled Discharge Summary Generation System Supporting Source Attribution and Expert Review LCDS: Logisch gesteuerte Entladung Zusammenfassung Erzeugung System Unterstützung Quelle Attribution und Experten-Review LCDS: 逻辑控制排放控制简要生成系统支持源来源归属和专家审查 2507.05319v1 -
374 07-07 AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics KI-getriebene Zytomorphologie Bildsynthese für medizinische Diagnostik AI-Driven 用于医学诊断的细胞形态图像合成 2507.05063v1 -
375 07-07 End-to-End Evaluation for Low-Latency Simultaneous Speech Translation End-to-End-Bewertung für simultane Sprachübersetzungen mit geringer Latenz 低长期同声同声发言翻译端对端评价 2308.03415v4 -
376 07-07 Using Large Multimodal Models to Extract Knowledge Components for Knowledge Tracing from Multimedia Question Information Verwendung von großen multimodalen Modellen zum Extrahieren von Wissenskomponenten für die Wissensverfolgung aus Multimedia-Frageinformationen 利用大型多模式模式从多媒体问题信息中提取知识组成部分,以从多媒体问题信息中追踪知识 2409.20167v2 -
377 07-07 Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study Warum kämpfen Open Source LLMs mit Datenanalyse? Eine systematische empirische Studie 开放源码LLMs为何要与数据分析斗争?系统的经验研究 2506.19794v2 -
378 07-07 Verified Language Processing with Hybrid Explainability: A Technical Report Verifizierte Sprachverarbeitung mit hybrider Erklärbarkeit: Ein technischer Bericht 经核实的有混合解释性的语文处理:技术报告 2507.05017v1 -
379 07-07 Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification Co-DETECT: Collaborative Discovery of Edge Cases in der Textklassifikation 共同目的:在文本分类中合作发现边缘情况 2507.05010v1 -
380 07-07 From Autonomy to Agency: Agentic Vehicles for Human-Centered Mobility Systems Von der Autonomie zur Agentur: Agentenfahrzeuge für Mensch-zentrierte Mobilitätssysteme 从自主到机构:以人为中心的流动系统机动车辆 2507.04996v1 -
381 07-07 Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models Kann Video LLMs zur Antwort ablehnen? Ausrichtung für Beantwortbarkeit in Video Large Language Models 视频大语言模式的可响应性对齐 2507.04976v1 -
382 07-07 ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation ArtefakteBench: Überbrückung der visuell-interaktiven Lücke in der LLM-Codegenerierung 人工合成:弥合LLM代码生成评估中的视觉互动差距 2507.04952v1 -
383 07-07 ReLoop: “Seeing Twice and Thinking Backwards” via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding ReLoop: “Zweimal sehen und zurückdenken” über Closed-Loop-Training zu Mitigate-Halluzinationen im multimodalen Verständnis ReLooop:通过闭路培训,为多模式理解的假幻觉提供“双眼和反向思考” 2507.04943v1 -
384 07-07 Are Information Retrieval Approaches Good at Harmonising Longitudinal Survey Questions in Social Science? Sind Informations-Retrieval Ansätze gut bei der Harmonisierung Longitudinal Survey Fragen in der Sozialwissenschaft? 信息检索方法在协调社会科学的纵向调查问题方面是否很好? 2504.20679v2 -
385 07-07 Do LLMs Understand the Safety of Their Inputs? Training-Free Moderation via Latent Prototypes Verstehen LLMs die Sicherheit ihrer Eingaben? Schulungsfreie Moderation über Latent Prototypen LLM女士了解其投入的安全性吗?通过中原原型进行无培训的调节 2502.16174v2 -
386 07-07 O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d’apprentissage pour l’adaptation de modèles de langue à un domaine O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d’apprentissage pour l’adaptation de modèles de langue à un domaine O_FT@EvalLLLM2025:非属域适应适应模式的技术和技术评估比较 2507.04895v1 -
387 07-07 MARBLE: A Multi-Agent Rule-Based LLM Reasoning Engine for Accident Severity Prediction MARBLE: Eine regelbasierte LLM-Reasoning-Engine für die Vorhersage von Unfällen mit Schweregrad 事故严重性预测的多机构、以规则为基础的多机构、基于规则的LLM 理疗引擎 2507.04893v1 -
388 07-07 Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations Emergent Semantics Beyond Token Embeddings: Transformer LMs mit gefrorenen visuellen Unicode-Darstellungen 超越 Tok 嵌入的新兴语义: 具有冷冻视觉统一符号的变形LMs 2507.04886v1 -
389 07-07 Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions Aufbau von offenen retrieval-Konversations-Frage-Antwortsystemen durch Generierung synthetischer Daten und Dekontextualisierung von Benutzerfragen 通过生成合成数据和减少文本用户问题,建立开放式检索连接问答解答系统 2507.04884v1 -
390 07-07 Transcribing Spanish Texts from the Past: Experiments with Transkribus, Tesseract and Granite Spanische Texte aus der Vergangenheit transkribieren: Experimente mit Transkribus, Tesseract und Granit 翻录过去西班牙文本:与Transkribus、Tesseract和Granite的实验 2507.04878v1 -
391 07-07 $\textit{Grahak-Nyay:}$ Consumer Grievance Redressal through Large Language Models $\textit{Grahak-Nyay:}$ Consumer Grievance Redressal durch große Sprachmodelle $textit{ Grahak- Nyay:} 以大语言模式纠正消费者冤情 2507.04854v1 -
392 07-07 Dialogue-Based Multi-Dimensional Relationship Extraction from Novels Dialogbasierte multidimensionale Beziehungsextraktion aus Romanen 以对话为基础的从新奇中提取的多不同关系 2507.04852v1 -
393 07-07 Spec-TOD: A Specialized Instruction-Tuned LLM Framework for Efficient Task-Oriented Dialogue Systems Spec-TOD: Ein speziell angepasstes LLM-Framework für effiziente aufgabenorientierte Dialogsysteme Spec-TOD:高效以任务为导向的对话系统专门指导性LLM框架 2507.04841v1 -
394 07-07 CritiQ: Mining Data Quality Criteria from Human Preferences CritiQ: Mining Data Quality Criteria from Human Preferences 关键:来自人类首选的采矿数据质量标准 2502.19279v2 -
395 07-07 RewardAnything: Generalizable Principle-Following Reward Models BelohnungAlles: Allgemeines Prinzip-folgende Belohnung Modelle 奖励任何奖励:可普遍适用的遵循原则的奖赏模式 2506.03637v2 -
396 07-07 BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance BiMa: Auf dem Weg zu Biases Milderung für Text-Video-Retrieval über Szeneelement-Anleitung Bima:通过场景元素指导,争取减缓对文本视频检索的“双轨减缓” 2506.03589v3 -
397 07-07 From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach Von der Vision zur Sprache durch Diagramm der Ereignisse in Raum und Zeit: Ein erklärbarer selbstüberwachter Ansatz 通过空间和时间事件图从视野到语言:可解释的自我监督方法 2507.04815v1 -
398 07-07 A Survey of Pun Generation: Datasets, Evaluations and Methodologies Eine Umfrage der Pun-Generation: Datensätze, Bewertungen und Methoden 对 “ 制作豆类:数据集、评价和方法 “ 的调查 2507.04793v1 -
399 07-07 Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems Qwen vs. Gemma Integration mit Whisper: Eine vergleichende Studie in mehrsprachigen Sprach-LLM-Systemen Quwen诉Gemma 与低语融合:多语种语言LLLM系统比较研究 2506.13596v2 -
400 07-07 Reason to Rote: Rethinking Memorization in Reasoning Grund zu rotieren: Die Erinnerung an die Vernunft neu denken 旋转理由:重新思考在理由方面的记忆 2507.04782v1 -
401 07-07 ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems ABench-Physics: Benchmarking Physical Reasoning in LLMs über hochschwierige und dynamische Physikprobleme ABench-物理:通过高难度和动态物理问题确定LLMLM中物理原因的基准 2507.04766v1 -
402 07-07 Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments Thunder-DeID: Genauer und effizienter De-Identifizierungsrahmen für Urteile des koreanischen Gerichts Thunder-DeID:韩国法院判决的准确和有效的取消识别框架 2506.15266v2 -
403 07-07 CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering CoSteer: Kollaborative Decoding-Time-Personalisierung über lokale Delta-Lenkung 科斯特:通过当地三角洲指导,协作解码-时间个性化 2507.04756v1 -
404 07-07 LLMs as Architects and Critics for Multi-Source Opinion Summarization LLMs als Architekten und Kritiker für die Mehr-Quellen-Stellungnahme Zusammenfassung 作为多来源意见总结的建筑师和批评家的LLM 2507.04751v1 -
405 07-07 A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic Eine Geschichte von zwei Scripts: Transliteration und Nachkorrektur für Judeo-Arabisch 两本脚本的故事:犹太-阿拉伯语的翻写和校正后 2507.04746v1 -
406 07-07 Word stress in self-supervised speech models: A cross-linguistic comparison Wortstress in selbstüberwachten Sprachmodellen: Ein cross-linguistischer Vergleich 自我监督的演讲模式中的言语压力:交叉语言学比较 2507.04738v1 -
407 07-07 Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools Engere Kluft: Überwachtes Feintuning von Open Source LLMs als lebensfähige Alternative zu proprietären Modellen für pädagogische Werkzeuge 缩小差距:监督开放源码LLMs的微调,将其作为替代专有教学工具模型的可行替代物 2507.05305v1 -
408 07-07 “This Suits You the Best”: Query Focused Comparative Explainable Summarization “Das passt Ihnen am besten”: Abfrage konzentriert vergleichende erklärende Zusammenfassung “这适合你最好的”: 查询重点突出的可解释比较摘要 2507.04733v1 -
409 07-07 Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization Textentgiftung: Dateneffizienz, Semantische Erhaltung und Modellverallgemeinerung 文字解毒:数据效率、语义保护和模式通用 2507.01050v2 -
410 07-07 LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework LOOM-Scope: ein umfassender und effizienter Long-cOntext Modellauswertungsrahmen LOOM-范围:一个全面有效的LOng-Ontext示范评价框架 2507.04723v1 -
411 07-07 MAIN: Mutual Alignment Is Necessary for instruction tuning MAIN: Gegenseitige Ausrichtung ist notwendig für die Instruktions-Tuning MAIN: 相互协调是教学调整的必要条件 2504.12913v2 -
412 07-07 Why We Feel What We Feel: Joint Detection of Emotions and Their Opinion Triggers in E-commerce Warum wir fühlen, was wir fühlen: Gemeinsame Erkennung von Emotionen und ihre Meinung Auslöser im E-Commerce 我们为什么感到我们的感受:在电子商务中共同发现情感及其见解触发者 2507.04708v1 -
413 07-07 XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL XiYan-SQL: Ein neuartiges Multi-Generator-Framework für Text-zu-SQL XiYan-SQL: 文本到SQL的新颖多驱动器框架 2507.04701v1 -
414 07-07 Markovian Transformers for Informative Language Modeling Markovianische Transformer für die informative Sprachmodellierung 用于信息语言建模的Markovian变形器 2404.18988v6 -
415 07-07 Pensieve Grader: An AI-Powered, Ready-to-Use Platform for Effortless Handwritten STEM Grading Pensieve Grader: Eine KI-Powered, Ready-to-Use Plattform für mühelose handschriftliche STEM-Grading Pensieve grafer: 一个AI授权的无力手写STEM分级的现用平台 2507.01431v2 -
416 07-07 Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages Eka-Eval : Ein umfassender Evaluierungsrahmen für große Sprachmodelle in indischen Sprachen Eka-Eval:印度语大语言模式综合评价框架 2507.01853v2 -
417 07-07 Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models Breach in the Shield: Enthüllen der Schwachstellen großer Sprachmodelle 打破盾牌:消除大语言模式的脆弱性 2504.03714v2 -
418 07-07 R1-RE: Cross-Domain Relationship Extraction with RLVR R1-RE: Cross-Domain-Beziehungsextraktion mit RLVR R1-RE: 与RLVR的跨部关系提取 2507.04642v1 -
419 07-07 Put Teacher in Student’s Shoes: Cross-Distillation for Ultra-compact Model Compression Framework Put Teacher in Studentenschuhen: Cross-Destillation für ultra-kompakte Modellkompressionsrahmen 将教师纳入学生鞋:超常规示范压缩框架的交叉蒸馏 2507.04636v1 -
420 07-07 A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens Ein Text ist mehrere Token wert: Text-Einbetten aus LLMs richtet sich geheim gut mit den Schlüssel-Token 文本“ 值得几个调子” : 文本由LLMs 秘密对齐与密钥连接良好 。 2406.17378v4 -
421 07-07 Knowledge-Aware Self-Correction in Language Models via Structured Memory Graphs Wissen-Bewusst Selbst-Korrektion in Sprachmodellen über strukturierte Speichergraphen 通过结构内存图对语言模型进行知识-软件自我校正 2507.04625v1 -
422 07-07 Towards Cost-Effective Reward Guided Text Generation Auf dem Weg zu einer kosteneffizienten, von der Belohnung geleiteten Textgenerierung 实现成本-效益高的回报制制制制制制 2502.04517v2 -
423 07-07 News Source Citing Patterns in AI Search Systems Nachrichten Quelle Zitieren von Mustern in KI-Suchsystemen AI 搜索系统中引用新闻来源模式 2507.05301v1 -
424 07-07 Retain or Reframe? A Computational Framework for the Analysis of Framing in News Articles and Reader Comments A putational Framework for the Analysis of Framing in News Artikel und Leserkommentare 保留还是重新框架? 分析新闻文章和读者评论中形式界定的计算框架 2507.04612v1 -
425 07-07 Position: Machine Learning Conferences Should Establish a “Refutations and Critiques” Track Position: Machine Learning Konferenzen sollten einen “Refutations and Critiques” Track erstellen 职位:机器学习会议应建立“反驳和批评”轨道 2506.19882v3 -
426 07-07 PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes PRIME: Large Language Model Personalisierung mit kognitiven Gedächtnis- und Gedankenprozessen PRIME:具有认知记忆和思维过程的大语言模式个性模型 2507.04607v1 -
427 07-07 Structured Captions Improve Prompt Adherence in Text-to-Image Models (Re-LAION-Caption 19M) Structured Captions verbessern die sofortige Einhaltung von Text-zu-Bild-Modellen (Re-LAION-Caption 19M) 结构说明 改进对文本到图像模型的迅速遵守(Re-LAION-Caption 19M) 2507.05300v1 -
428 07-07 Gradient-guided Attention Map Editing: Towards Efficient Contextual Hallucination Mitigation Gradient-geführte Aufmerksamkeitskarte Bearbeitung: Auf dem Weg zu einer effizienten kontextuellen Halluzination 逐步制导注意地图编辑:努力有效减少环境幻觉 2503.08963v2 -
429 07-07 VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents VLM2Vec-V2: Advancing Multimodal Embedding für Videos, Bilder und visuelle Dokumente VLM2Vec-V2:推进视频、图像和视觉文件的多式联运嵌入 2507.04590v1 -
430 07-07 Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs Im Inneren sind Sie viele Wölfe: Mit kognitiven Modellen, um Wert-Abwägungen in LLMs zu interpretieren 使用认知模型来解释LLMM中的价值权衡 2506.20666v2 -
431 07-06 (7) Construction Identification and Disambiguation Using BERT: A Case Study of NPN Bauidentifizierung und Abgrenzung mit BERT: Eine Fallstudie der NPN 使用BERT进行建筑识别和拆分:NPN的案例研究 2503.18751v2 -
432 07-06 Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts Nile-Chat: Ägyptische Sprachmodelle für arabische und lateinische Scripte Nile-Chat:阿拉伯文和拉丁文脚本埃及语言模型 2507.04569v1 -
433 07-06 Evaluating LLMs on Real-World Forecasting Against Human Superforecasters Bewertung von LLMs bei der Real-World-Prognose gegen menschliche Superforecaster 评价针对人类超先天预测者进行真实世界预测的LLMs评价项目 2507.04562v1 -
434 07-06 PENCIL: Long Thoughts with Short Memory PENCIL: Lange Gedanken mit kurzer Erinnerung PENCIL:有短记忆的长思考 2503.14337v2 -
435 07-06 DP-Fusion: Token-Level Differentially Private Inference for Large Language Models DP-Fusion: Token-Level Differentially Private Inferenz für große Sprachmodelle DP-Fusion: 大语言模式的调控层面的私人不同推论 2507.04531v1 -
436 07-06 DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging DOTResize: Reduzierung der LLM-Breite durch diskrete Optimale transportbasierte Neuronenverschmelzung DOTRIZ: 通过分立优化运输最佳方式的中子合并减少 LLM 宽度 2507.04517v1 -
437 07-06 Rethinking Table Instruction Tuning Umdenken Tabelle Anleitung Tuning 重新思考表格指令图 2501.14693v3 -
438 07-06 On the Utility of Domain-Adjacent Fine-Tuned Model Ensembles for Few-shot Problems Über die Nützlichkeit von Domain-Adjacent Fine-Tuned Model Ensembles für Probleme mit wenigen Fotos 关于Dome-邻接精度精美设计模型集的效用问题 2406.13720v2 -
439 07-06 AdS: Adapter-state Sharing Framework for Multimodal Sarcasm Detection AdS: Adapter-State-Sharing-Framework für multimodale Sarkasmus-Erkennung AdS:多式碎片探测适应国家共享框架 2507.04508v1 -
440 07-06 Unveiling the Potential of Diffusion Large Language Model in Controllable Generation Enthüllen des Potenzials von Diffusions-Großsprachenmodellen in kontrollierbarer Generation 发挥在可控生产中传播大语言模式的潜力 2507.04504v1 -
441 07-06 Self-Consistency Preference Optimization Selbst-Konsistenz-Präferenz-Optimierung 优化优惠 2411.04109v3 -
442 07-06 A validity-guided workflow for robust large language model research in psychology Ein validitätsgeführter Workflow für robuste große Sprachmodellforschung in der Psychologie 健全的大型语言模式心理学研究有效指导工作流程 2507.04491v1 -
443 07-06 Does Overnight News Explain Overnight Returns? Erklären die Nachrichten über Nacht die Rückkehr? 夜间新闻解释夜间回归吗? 2507.04481v1 -
444 07-06 The role of large language models in UI/UX design: A systematic literature review Die Rolle großer Sprachmodelle im UI/UX-Design: Ein systematischer Literaturbericht 大语言模型在UI/UX设计中的作用:系统文献审查 2507.04469v1 -
445 07-06 Dual Modality-Aware Gated Prompt Tuning for Few-Shot Multimodal Sarcasm Detection Dual Modality-Aware Gated Prompt Tuning für wenige-shot multimodale Sarkasmus-Erkennung 用于小热多式乱石探测的两极式器件器械 Gedt 快速计时 2507.04468v1 -
446 07-06 Think Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection Denken Sie zweimal vor Ihnen Richter: Mischung von Dual-Reasoning-Experten für multimodale Sarkasmus-Erkennung 在你面前三思而后三思,法官:多式讽刺探测双重理由专家的混合 2507.04458v1 -
447 07-06 GradOT: Training-free Gradient-preserving Offsite-tuning for Large Language Models GradOT: Training-frei Gradient-erhaltendes Offsite-Tuning für große Sprachmodelle GradOT: 大语言模型的无培训渐进式非现场调整 2507.04455v1 -
448 07-06 Dude, where’s my utterance? Evaluating the effects of automatic segmentation and transcription on CPS detection Alter, wo ist meine Äußerung? Bewertung der Auswirkungen der automatischen Segmentierung und Transkription auf CPS-Erkennung 评估自动分解和转录 对CPS检测的影响 2507.04454v1 -
449 07-06 KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality KnowRL: Erforschendes Wissenswertes Verstärktes Lernen für die Realität KnowRL:探索知识强化学习促进事实质量 2506.19807v2 -
450 07-06 AutoMind: Adaptive Knowledgeable Agent for Automated Data Science AutoMind: Adaptives Knowledgeable Agent für automatisierte Datenwissenschaft 自动Mind:自动数据科学适应性知识代理 2506.10974v2 -
451 07-06 A Linguistic Analysis of Spontaneous Thoughts: Investigating Experiences of Déjà Vu, Unexpected Thoughts, and Involuntary Autobiographical Memories Eine linguistische Analyse von spontanen Gedanken: Untersuchung von Erfahrungen von Déjà Vu, unerwarteten Gedanken und unfreiwilligen autobiographischen Erinnerungen 对自发思想的语言分析:调查Déjá Vu、意外思想和非自愿自传记忆的经验 2507.04439v1 -
452 07-06 Reconstructing Biological Pathways by Applying Selective Incremental Learning to (Very) Small Language Models Rekonstruieren biologischer Pfade durch Anwendung selektiven Inkrementellen Lernens auf (Sehr) kleine Sprachmodelle 通过将选择性增量学习应用到(老式)小型语言模式,重建生物途径 2507.04432v1 -
453 07-06 RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling RAT: RNN-Effizienz und Aufmerksamkeitsgenauigkeit in der Sprachmodellierung überbrücken RAT:在语言建模中弥补区域NNN效率和注意的准确性 2507.04416v1 -
454 07-06 MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind MOMENTE: Ein umfassender multimodaler Benchmark für die Theorie des Geistes 修改:关于精神理论的综合多模式基准 2507.04415v1 -
455 07-06 THM@SimpleText 2025 – Task 1.1: Revisiting Text Simplification based on Complex Terms for Non-Experts THM@SimpleText 2025 – Aufgabe 1.1: Textvereinfachung auf Basis komplexer Bedingungen für Nicht-Experten THM@SPIText 2025-任务1.1:根据非专家的复杂条件重新审议文本简化 2507.04414v1 -
456 07-06 Pretraining Language Models Using Translationese Vorschulung von Sprachmodellen mit Translationese 使用笔译语言的语文模型 2403.13638v3 -
457 07-06 Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models Wahrnehmung, Vernunft, Denken und Plan: Eine Umfrage zu großen multimodalen Vernunftmodellen 观念、理性、思考和计划:关于大型多式联运理由模型的调查 2505.04921v2 -
458 07-06 Have LLMs Made Active Learning Obsolete? Surveying the NLP Community Haben LLMs aktives Lernen überholt? Vermessung der NLP-Gemeinschaft LLM女士是否做过积极学习过时? 2503.09701v2 -
459 07-06 Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition Besser sicher als entschuldigt? Überreaktion Problem der Vision Sprachmodelle in der visuellen Notfallerkennung 是否比抱歉更安全?在视觉紧急识别中过度反应的视觉语言模型问题 2505.15367v2 -
460 07-06 SpiritRAG: A Q&A System for Religion and Spirituality in the United Nations Archive SpiritRAG: Ein Q&A-System für Religion und Spiritualität im Archiv der Vereinten Nationen SpirirRAG:联合国档案馆中的宗教和精神系统 2507.04395v1 -
461 07-06 Does Learning Mathematical Problem-Solving Generalize to Broader Reasoning? Hat das Lernen mathematische Problem-Solving Generalisieren, um breitere Vernunft? 学习数学问题解决是否普遍到更广义的理由? 2507.04391v1 -
462 07-06 Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions Multi-Modal Semantische Parsing für die Interpretation von Grabsteininschriften 用于解释墓碑记的多模式语义分析 2507.04377v1 -
463 07-06 Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs Achtung Slipping: Ein mechanistisches Verständnis von Jailbreak Attacken und Verteidigungen in LLMs 注意力滑动:对非法侵入监狱袭击和自卫的机械理解 2507.04365v1 -
464 07-06 Emergent misalignment as prompt sensitivity: A research note Emergente Fehlausrichtung als prompte Empfindlichkeit: Eine Forschungsnote 新出现的不协调作为迅速的敏感性:研究说明 2507.06253v1 -
465 07-06 Large Language Models’ Varying Accuracy in Recognizing Risk-Promoting and Health-Supporting Sentiments in Public Health Discourse: The Cases of HPV Vaccination and Heated Tobacco Products Variierende Genauigkeit von Large Language Models bei der Anerkennung von risikofördernden und gesundheitsfördernden Sentiments im öffentlichen Gesundheitsdiskurs: Die Fälle von HPV-Impfung und Beheizten Tabakprodukten 大语言模型在认识公共卫生会议中风险促进和健康支持情感的认知的准确性差异:HPV疫苗接种和加热烟草制品案例 2507.04364v1 -
466 07-06 HatePRISM: Policies, Platforms, and Research Integration. Advancing NLP for Hate Speech Proactive Mitigation HatePRISM: Politik, Plattformen und Forschungsintegration. Förderung von NLP für eine proaktive Abmilderung von Hassreden HatePRISM:政策、平台和研究一体化。 推进国家消除仇恨言论计划。 2507.04350v1 -
467 07-06 SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control SmartThinker: Lernen, um zu komprimieren und zu bewahren Vernunft durch Schritt-Level-Length Control SmartThinker: 学会按职级长长控制进行压缩和保留理由 2507.04348v1 -
468 07-06 Computed Tomography Visual Question Answering with Cross-modal Feature Graphing Berechnete Tomographie Visuelle Fragebeantwortung mit modaler Feature-Graphing 计算成的地形学视觉问题,以跨模式特点图解回答 2507.04333v1 -
469 07-06 Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation Erklären Sie mit Visual Keypoints wie ein echter Mentor! Ein Benchmark für multimodale Lösung Erklärung 以像真正的导师一样的视觉关键点解释! 多模式解决方案解释的基准 2504.03197v3 -
470 07-06 No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem Keine Sprachdaten übrig: Eine vergleichende Studie von CJK-Sprachendatensätzen im Hugging Face Ecosystem 无语言数据:CJK 载荷面生态系统中语言数据集比较研究 2507.04329v1 -
471 07-06 HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs HKCanto-Eval: Benchmark für die Bewertung des kantonesischen Sprachverständnisses und kulturellen Verständnisses in LLMs HKCanto-Eval:评估广东语言理解和文化理解基准 2503.12440v2 -
472 07-06 A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment Ein ursächliches Weltmodell für die nächste Token-Vorhersage: GPT in einer kontrollierten Umgebung erforschen 由下到下预测所根据的因果世界模式:在受控制的环境中探索GPT 2412.07446v4 -
473 07-06 A Global Context Mechanism for Sequence Labeling Ein globaler Kontextmechanismus für die Sequenzkennzeichnung 序列标签全球背景机制 2305.19928v6 -
474 07-06 LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop LearnLens: LLM-Enabled Personalisiertes, Curriculum-gerundetes Feedback mit Erziehern im Loop 学习栏:LLM-能够个性化的LLM课程、课程与环中教育工作者的反馈 2507.04295v1 -
475 07-06 Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP Ist es wirklich ein langer Kontext, wenn alles, was man braucht, wieder aufgenommen wird? 您需要的检索是真实的长长范围吗 ? 向真正困难的长宽范围 NLP 2407.00402v4 -
476 07-06 LLMs’ Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements LLMs Leseverständnis wird durch parametrisches Wissen beeinflusst und kämpft mit hypothetischen Aussagen LLM女士的阅读理解受到参数知识和与假言斗争的影响 2404.06283v2 -
477 07-06 GLU Attention Improve Transformer GLU Aufmerksamkeit verbessern Transformer GLU 关注改善变换器 2507.00022v2 -
478 07-06 MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models MASS: Mathematische Datenauswahl über Skill Graphs zur Vorschulung großer Sprachmodelle MASS:通过培训前大语言模型技能图选择数学数据 2503.14917v2 -
479 07-06 EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models EasyEdit2: Ein einfach zu bedienender Steuerungsrahmen für die Bearbeitung großer Sprachmodelle EasyEdit2: 编辑大语言模式的易用指导框架 2504.15133v2 -
480 07-06 On Positional Bias of Faithfulness for Long-form Summarization Auf Positionale Bias der Treue für lange Form Zusammenfassung 关于 “ 长期概述的信仰偏见 “ 的立场 2410.23609v2 -
481 07-06 Fairness Evaluation of Large Language Models in Academic Library Reference Services Fairness-Evaluierung von großen Sprachmodellen in wissenschaftlichen Bibliotheksreferenzdiensten 学术图书馆参考资料服务大语言模型公平评价 2507.04224v1 -
482 07-06 Context Tuning for In-Context Optimization Kontext-Tuning für In-Context-Optimierung Intext 优化的上下文图示 2507.04221v1 -
483 07-06 Towards Understanding the Cognitive Habits of Large Reasoning Models Auf dem Weg zum Verständnis der kognitiven Gewohnheiten großer vernünftiger Modelle 了解大理由模型的认知利用 2506.21571v2 -
484 07-06 SCALE: Towards Collaborative Content Analysis in Social Science with Large Language Model Agents and Human Intervention SCALE: Auf dem Weg zu einer kollaborativen Inhaltsanalyse in der Sozialwissenschaft mit großen Sprachmodellen und menschlicher Intervention SCALE:促进社会科学与大语言模范媒介和人类干预合作内容分析 2502.10937v2 -
485 07-05 (6) SymbolicThought: Integrating Language Models and Symbolic Reasoning for Consistent and Interpretable Human Relationship Understanding SymbolicThought: Integration von Sprachmodellen und symbolischer Begründung für ein konsequentes und interpretierbares menschliches Beziehungsverständnis 象征性探索:整合语文模式和符号理由,促进一致和可解释的人类关系理解 2507.04189v1 -
486 07-05 Phonetic Reconstruction of the Consonant System of Middle Chinese via Mixed Integer Optimization Phonetische Rekonstruktion des Konsonantensystems des Mittelchinesischen durch gemischte Integer-Optimierung 通过混合整数优化恢复中中中中中中中中中 2502.04625v2 -
487 07-05 Navigating Speech Recording Collections with AI-Generated Illustrations Navigierende Sprachaufzeichnungssammlungen mit KI-generierten Illustrationen 使用 AI 生成的 说明 管理语音录制收藏 2507.04182v1 -
488 07-05 Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks Frustrierend Einfaches Retrieval verbessert anspruchsvolle, vernünftig-intensive Benchmarks 令人沮丧的简单检索改进挑战、理由说明和密集基准 2507.01297v2 -
489 07-05 Large Language Models for Zero-Shot Multicultural Name Recognition Große Sprachmodelle für die Null-Shot Multicultural Name Recognition 零热多种文化名称识别大语言模式 2507.04149v1 -
490 07-05 The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure Das Medium ist nicht die Botschaft: Deconfounding Text-Embeddings via Linear Concept Erasure 介质不是信息:通过线性概念时代的沉降文本嵌入 2507.01234v2 -
491 07-05 Dissecting Clinical Reasoning in Language Models: A Comparative Study of Prompts and Model Adaptation Strategies Dissecting Clinical Reasoning in Language Models: Eine vergleichende Studie über Prompts und Modellanpassungsstrategien 语言模型临床理由解剖:快速比较研究和示范适应战略 2507.04142v1 -
492 07-05 From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models Von der Absicht zur Technik: Eine umfassende Taxonomie und Herausforderungen bei der Text-Wasserzeichenbildung für große Sprachmodelle 从意图到技术:用于大语言模式的文本水标记综合分类学和挑战 2406.11106v2 -
493 07-05 Token Level Hallucination Detection via Variance in Language Models Token Level Halluzination Erkennung durch Varianz in Sprachmodellen 通过语言模式差异探测洞穴级幻觉 2507.04137v1 -
494 07-05 An HTR-LLM Workflow for High-Accuracy Transcription and Analysis of Abbreviated Latin Court Hand Ein HTR-LLM-Workflow für hochgenaue Transkription und Analyse von verkürzter lateinischer Hofhand HTR-LLLM 高准确度加密和分析缩写拉丁法院手的工作流程 2507.04132v1 -
495 07-05 In-Context Meta LoRA Generation Meta-LoRA-Generation im Kontext 内文元LoRA 生成 2501.17635v3 -
496 07-05 BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering BYOKG-RAG:知识图解答问题多战略图检索 2507.04127v1 -
497 07-05 Relational inductive biases on attention mechanisms Relationale induktive Verzerrungen der Aufmerksamkeitsmechanismen 对关注机制的关系诱导偏见 2507.04117v1 -
498 07-05 Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching Conversation Forests: Der Schlüssel zur Feinabstimmung großer Sprachmodelle für multi-Turn medizinische Gespräche ist die Verzweigung 对话森林:对多发医学对话的大型语言模型进行精微投资的关键是分流 2507.04099v1 -
499 07-05 MMMOS: Multi-domain Multi-axis Audio Quality Assessment MMMOS: Multi-Domain Multi-Axis-Audio-Qualitätsbewertung MMMOS:多领域多轴音频质量评估 2507.04094v1 -
500 07-05 FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering FairSteer: Inferenzzeit-Debiasing für LLMs mit dynamischer Aktivierungssteuerung Fairsteer:带动态激活指导的LLMs 推论时间偏差 2504.14492v2 -
501 07-05 XISM: an eXploratory and Interactive Graph Tool to Visualize and Evaluate Semantic Map Models XISM: ein eXploratory and Interactive Graph Tool zum Visualisieren und Evaluieren von Semantischen Kartenmodellen XISSM:一个可视和评估语义地图模型的振动和交互式图像工具 2507.04070v1 -
502 07-05 Beyond Independent Passages: Adaptive Passage Combination Retrieval for Retrieval Augmented Open-Domain Question Answering Jenseits von unabhängigen Passagen: Adaptive Passage-Kombinations-Retrieval für retrieval Augmented Open-Domain-Fragebeantwortung 超过独立通过通道: 用于检索的适应性被动通道组合检索, 以获取更多开放域 问题解答 2507.04069v1 -
503 07-05 On Fusing ChatGPT and Ensemble Learning in Discon-tinuous Named Entity Recognition in Health Corpora Über das Verschmelzen von ChatGPT und Ensemble Lernen in diskontinuierlicher, benannter Entitätserkennung in Health Corpora 关于在卫生团体中不同名称称为实体的识别中进行阻燃聊天和组合学习 2412.16976v2 -
504 07-05 FinBERT2: A Specialized Bidirectional Encoder for Bridging the Gap in Finance-Specific Deployment of Large Language Models FinBERT2: Ein spezieller bidirektionaler Encoder zur Bridging the Gap in Finance-Spezific Deployment of Large Language Models FinBERT2: 缩小在财务方面差距的专门双向双向编码器,具体部署大语言模式 2506.06335v2 -
505 07-05 Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk? KI für Finanzen bewerten: Ist KI bei der Bewertung von Investitionsrisiken glaubwürdig? 评估大赦国际的融资:AI在评估投资风险方面是否可信? 2505.18953v2 -
506 07-05 Patient-Centered RAG for Oncology Visit Aid Following the Ottawa Decision Guide Patientenzentrierte RAG für Onkologie-Besuchshilfe nach dem Ottawa-Entscheidungsleitfaden 遵循渥太华决定指南,以病人为中心,为肿瘤科访问援助提供RAG 2507.04026v1 -
507 07-05 Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance Vergleichende Bewertung von ChatGPT und DeepSeek über zentrale NLP-Aufgaben: Stärken, Schwächen und Domain-spezifische Leistung 国家劳工政策关键任务:力量、弱点和具体具体绩效 2506.18501v3 -
508 07-05 LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models LLMThinkBench: Auf dem Weg zu grundlegenden mathematischen Überlegungen und Überdenken in großen Sprachmodellen LLMTHink Bench:在大语言模型中走向基本数学理由和过度思考 2507.04023v1 -
509 07-05 Handling Korean Out-of-Vocabulary Words with Phoneme Representation Learning Umgang mit koreanischen Out-of-Vocabulary-Worte mit Phoneme Repräsentanz Lernen 以电话代言学习方式处理韩语外词汇 2507.04018v1 -
510 07-05 Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition Nunchi-Bench: Benchmarking von Sprachmodellen zur kulturellen Vernunft mit Fokus auf koreanischen Aberglauben Nunchi-Nunchi-Bench:以韩国迷信为重点的文化理由语言模式基准模式 2507.04014v1 -
511 07-05 Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents Easy Dataset: Ein einheitliches und erweiterbares Framework zur Synthese von LLM-Feineinstellungsdaten aus unstrukturierten Dokumenten 简易数据集:从无结构文档中合成LLM精密调试数据的统一和扩展框架 2507.04009v1 -
512 07-05 A Survey on Proactive Defense Strategies Against Misinformation in Large Language Models Eine Umfrage über proaktive Verteidigungsstrategien gegen Fehlinformationen in großen Sprachmodellen 关于针对大语言模型中错误信息的积极防御战略的调查 2507.05288v1 -
513 07-05 UNITYAI-GUARD: Pioneering Toxicity Detection Across Low-Resource Indian Languages UNITYAI-GUARD: Pionierische Toxizitätserkennung über Low-Resource-Indische Sprachen UNITEAAI-GUARD:通过低资源印度语言首次发现有毒毒性 2503.23088v2 -
514 07-05 Improving RL Exploration for LLM Reasoning through Retrospective Replay Verbesserung der RL Exploration für LLM Reasoning durch retrospektive Replay 通过回溯性回放改进LL探索LLM 2504.14363v2 -
515 07-05 Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model Open-Reasoner-Zero: Ein Open Source Ansatz zur Skalierung von Verstärkungslernen am Basismodell 开放反馈-零:基于基础模式的扩大强化学习的开放源码方法 2503.24290v2 -
516 07-05 A Comparative Study of Specialized LLMs as Dense Retrievers Eine vergleichende Studie von spezialisierten LLMs als Dense Retriever 专业专利妇女作为 “ 大量开发 “ 的比较研究 2507.03958v1 -
517 07-05 A Modular Unsupervised Framework for Attribute Recognition from Unstructured Text Modulares unüberwachtes Framework zur Attributerkennung aus unstrukturiertem Text 从无结构文本中识别属性的模块化未经监督的框架 2507.03949v1 -
518 07-05 Dynamic Injection of Entity Knowledge into Dense Retrievers Dynamische Einspritzung von Wissen der Entität in Dense Retriever 将实体知识动态投入重复内容 2507.03922v1 -
519 07-05 Play to Generalize: Learning to Reason Through Game Play Spielen Sie Generalize: Lernen, Vernunft durch Spiel zu lernen 玩一般游戏: 通过玩游戏学习理性 2506.08011v3 -
520 07-05 Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding Sequoia: Skalierbares, Robustes und Hardware-gestütztes Spekulatives Dekodieren 塞奎亚:可缩放、强力和硬件—-有意识的投机性代谢 2402.12374v3 -
521 07-05 Demystifying ChatGPT: How It Masters Genre Recognition ChatGPT entmystifizieren: Wie es die Genreerkennung beherrscht 解密的聊天:如何使各教派获得承认 2507.03875v1 -
522 07-05 OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference OrthoRank: Tokenauswahl über Sink Token Orthogonalität für effiziente LLM-Inferenz OrthoRank:通过 Sink Token 条形对高效LLM 推论进行当量选择 2507.03865v1 -
523 07-04 (5) Voice of a Continent: Mapping Africa’s Speech Technology Frontier Stimme eines Kontinents: Afrikas Rede-Technologie-Grenze kartieren 非洲大陆之声:测绘非洲语音技术前沿 2505.18436v3 -
524 07-04 Relation-Aware Network with Attention-Based Loss for Few-Shot Knowledge Graph Completion Beziehungs-Aware-Netzwerk mit aufmerksamkeitsbasiertem Verlust für wenige-heiße Wissensgraphen-Vervollständigung 几乎没有热知识图补全的基于关注的损失关系软件网络 2306.09519v2 -
525 07-04 Beyond classical and contemporary models: a transformative ai framework for student dropout prediction in distance learning using rag, prompt engineering, and cross-modal fusion Über klassische und zeitgenössische Modelle hinaus: ein transformatives Ai-Framework für die Studienabbrechervorhersage im Fernunterricht mittels Rag, Prompt Engineering und modaler Fusion 超越古典和当代模式:利用抹布、即时工程和跨模式融合法在远程学习中预测学生辍学情况的变革性Ai框架 2507.05285v1 -
526 07-04 Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation Auf dem Weg zu einer fairen RAG: Über die Auswirkungen des fairen Rankings in der retrieval-Augmentierten Generation 走向公平的RAG:公平评级对回升一代人的影响 2409.11598v4 -
527 07-04 Parsing Through Boundaries in Chinese Word Segmentation Parsing durch Grenzen in chinesischen Wortsegmentierung 中文单词分割的边界分割 2503.23091v2 -
528 07-04 Tip of the Tongue Query Elicitation for Simulated Evaluation Tipp der Zungenabfrage-Elikation für Simulierte Auswertung 用于模拟评价的 “ 舌头查询参考 “ 提示 2502.17776v2 -
529 07-04 Alpay Algebra IV: Symbiotic Semantics and the Fixed-Point Convergence of Observer Embeddings Alpay Algebra IV: Symbiotische Semantik und die Fixed-Point-Konvergenz von Beobachter-Embeddings 替代代数四:共生语义学和观察员嵌入的固定点汇合 2507.03774v1 -
530 07-04 Agent-Based Detection and Resolution of Incompleteness and Ambiguity in Interactions with Large Language Models Agentbasierte Erkennung und Auflösung von Unvollständigkeit und Ambiguität in Interaktionen mit großen Sprachmodellen 与大语言模式互动中不完全和不易的代理检测和分辨率 2507.03726v1 -
531 07-04 ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models ODE: Offene Bewertung von Halluzinationen in multimodalen großen Sprachmodellen ODE:多模式大语言模型中的幻觉公开评价 2409.09318v4 -
532 07-04 Prompting as Scientific Inquiry Als wissenschaftliche Untersuchung prompt 作为科学调查 2507.00163v2 -
533 07-04 Controlling Thinking Speed in Reasoning Models Kontrolle der Denkgeschwindigkeit in vernünftigen Modellen 控制理性模型的思考速度 2507.03704v1 -
534 07-04 PHEONA: An Evaluation Framework for Large Language Model-based Approaches to Computational Phenotyping PHEONA: Ein Bewertungsrahmen für großsprachliche modellbasierte Ansätze zur computergestützten Phenotypisierung PHEONA:基于大语言模型的计算基因组学方法评价框架 2503.19265v3 -
535 07-04 A Survey of Large Language Models on Generative Graph Analytics: Query, Learning, and Applications Eine Umfrage von großen Sprachmodellen zu generativen Graph Analytics: Abfragen, Lernen und Anwendungen 关于 “ 生成图表分析分析:查询、学习和应用 “ 的大型语言模型调查 2404.14809v2 -
536 07-04 The Geometries of Truth Are Orthogonal Across Tasks Die Geometrien der Wahrheit sind orthogonal über Aufgaben hinweg 真理的地貌不同任务 2506.08572v2 -
537 07-04 iNews: A Multimodal Dataset for Modeling Personalized Affective Responses to News iNews: Ein multimodaler Datensatz für die Modellierung personalisierter Affective Responses to News iNews: 用于模拟个人对新闻的情感反应的多模式数据集 2503.03335v2 -
538 07-04 STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking STRUKTSENSE: Ein Task-Agnostic Agentic Framework für strukturierte Informationsextraktion mit Mensch-in-The-Loop-Evaluierung und Benchmarking SUCTSENESE: 结构化信息抽取与 “ 网上人 “ 评价和基准确定 2507.03674v1 -
539 07-04 TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection TACOS: Offenes Tagging und vergleichendes Scoring für die Auswahl von Feintuning-Daten TACOS: 用于指示精确调整数据选择的公开贴贴和比较比标 2507.03673v1 -
540 07-04 Recon, Answer, Verify: Agents in Search of Truth Recon, Antwort, Überprüfung: Agenten auf der Suche nach Wahrheit 调查、答复、核查:寻找真相的特工 2507.03671v1 -
541 07-04 Interaction Techniques that Encourage Longer Prompts Can Improve Psychological Ownership when Writing with AI Interaktionstechniken, die länger prompts fördern, können die Psychologische Eigentümerschaft verbessern, wenn sie mit KI schreiben 鼓励较长时间提示的交互技术在与大赦国际撰写文章时能够改善心理自主性 2507.03670v1 -
542 07-04 TRACE: Training and Inference-Time Interpretability Analysis for Language Models TRACE: Trainings- und Inferenz-Zeit-Interpretabilitätsanalyse für Sprachmodelle TRACE: 语文模式培训和推断-时间解释性分析 2507.03668v1 -
543 07-04 Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs Re-Emergent-Missausrichtung: Wie schmale Feintuning-Erodes Sicherheit Ausrichtung in LLMs 重新出现错配: LLM 中如何将安全对齐 2507.03662v1 -
544 07-04 Exploring Robustness of LLMs to Paraphrasing Based on Sociodemographic Factors Erforschung der Robustheit von LLMs zur Paraphrasierung auf der Grundlage soziodemografischer Faktoren 探讨根据社会人口因素将长效LLMs 2501.08276v2 -
545 07-04 Disentangling the Roles of Representation and Selection in Data Pruning Entwirren der Rollen von Repräsentation und Auswahl bei der Datenkorrektur 分离数据谨慎中的代表和甄选作用 2507.03648v1 -
546 07-04 ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving ArithmAttack: Bewertung der Robustheit von LLMs zum geräuschvollen Kontext bei der Lösung von Math-Problemen ArithmAttack:评估理学问题解答时对吵闹背景的LLMs强力 2501.08203v2 -
547 07-04 Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion Verbesserung der Low-Resource-Dialektenklassifikation mittels Retrieval-basierter Sprachumwandlung 利用以检索为基础的语音转换改进低资源 2507.03641v1 -
548 07-04 Is It Time To Treat Prompts As Code? A Multi-Use Case Study For Prompt Optimization Using DSPy Ist es Zeit, Prompts als Code zu behandeln? Eine Multi-Use-Fallstudie für Prompt-Optimierung mit DSPy 是否是时候将提示作为代码处理? 使用 DSPy 快速优化的多用途案例研究 2507.03620v1 -
549 07-04 EMERGE: A Benchmark for Updating Knowledge Graphs with Emerging Textual Knowledge EMERGE: Ein Benchmark für die Aktualisierung von Wissensgraphen mit neu entstehendem Textwissen EMEGERGE: 更新具有新兴文字知识的知识图表的基准 2507.03617v1 -
550 07-04 RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning RAG+: Verbesserung der Retrieval-Augmented Generation mit anwendungsrelevanter Begründung RAG+:加强利用应用程序软件软件软件软件支持的检索-启动生成 2506.11555v3 -
551 07-04 ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs ALAS: Latente Sprach-Text-Ausrichtung für gesprochenes Sprachverständnis in multimodalen LLMs messen ALAS: 计量多种模式LM 中口语语言理解的暗中语音-文本对齐 2505.19937v2 -
552 07-04 Multi-Hop Reasoning for Question Answering with Hyperbolic Representations Multi-Hop-Gründung für die Beantwortung von Fragen mit hyperbolischen Darstellungen 使用超曲解表示式回答问题多理由 2507.03612v1 -
553 07-04 ViMRHP: A Vietnamese Benchmark Dataset for Multimodal Review Helpfulness Prediction via Human-AI Collaborative Annotation ViMRHP: Ein vietnamesischer Benchmark-Datensatz für die multimodale Überprüfung von Hilfsvorhersage über Human-AI Collaborative Annotation VIMRHP:越南通过人类 – – AI合作说明,为多模式审查 “ 助助预测 “ 建立一个越南基准数据集 2505.07416v2 -
554 07-04 Should We Still Pretrain Encoders with Masked Language Modeling? Sollten wir noch Encoder mit maskierten Sprachmodellen vortrainieren? 我们是否仍应该为带有隐蔽语言建模的编程者预作准备? 2507.00994v2 -
555 07-04 RECA-PD: A Robust Explainable Cross-Attention Method for Speech-based Parkinson’s Disease Classification RECA-PD: Robuste, erklärbare Cross-Attention-Methode zur Klassifizierung sprachbasierter Parkinson-Krankheiten REECA-PD: 以语音为基础的帕金森氏病分类的强有力的可解释交叉注意方法 2507.03594v1 -
556 07-04 Evaluating AI capabilities in detecting conspiracy theories on YouTube Bewertung von KI-Fähigkeiten bei der Entdeckung von Verschwörungstheorien auf YouTube 评价大赦国际在YouTube上发现阴谋论的能力 2505.23570v2 -
557 07-04 Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation Causal-SAM-LLM: Große Sprachmodelle als Ursache für robuste medizinische Segmentierung Causal-SAM-LLLM:大语言模型,作为强力医疗分解的原因 2507.03585v1 -
558 07-04 Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits Ambiguous Terminologie von Preference Optimization auf Post-Edits übersetzen lernen 学习如何通过“优先优化”在编辑后采用“优先优化”来翻译模糊的名词 2507.03580v1 -
559 07-04 Articulatory clarity and variability before and after surgery for tongue cancer Artikulatorische Klarheit und Variabilität vor und nach der Operation bei Zungenkrebs 舌癌手术前后和手术前后的产道清晰度和变异性 2507.03576v1 -
560 07-04 ReviewInstruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models ReviewInstruct: Eine review-getriebene Multi-Turn-Konversations-Erstellungsmethode für große Sprachmodelle 审查指南:大语言模型的复议-驱动多发多发对话生成方法 2505.11010v2 -
561 07-04 H2HTalk: Evaluating Large Language Models as Emotional Companion H2HTalk: Große Sprachmodelle als emotionalen Begleiter bewerten H2H Talk:评价作为情感伴侣的大语言模式 2507.03543v1 -
562 07-04 Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation Aussprache-Lexikon Kostenloses Training für Phoneme-basierte Crosslingual ASR über Joint Stochastic Approximation 通过联合口交近为基于电话的跨语言ASR提供发音免费培训 2507.06249v1 -
563 07-04 AI-VaxGuide: An Agentic RAG-Based LLM for Vaccination Decisions AI-VaxGuide: Agentische RAG-basierte LLM für Impfentscheidungen AI-VaxGuide:用于疫苗接种决定的以RAG为基地的AAG剂LLMLM 2507.03493v1 -
564 07-04 Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences Vier Schattierungen der Lebenswissenschaften: Ein Datensatz für Desinformationserkennung in den Lebenswissenschaften 《生命科学的四大结构:生命科学中虚假信息探测数据集》 2507.03488v1 -
565 07-04 Chat2SPaT: A Large Language Model Based Tool for Automating Traffic Signal Control Plan Management Chat2SPaT: Ein auf großen Sprachmodellen basierendes Tool zur Automatisierung der Verkehrssignalsteuerung Chat2SPaaT:以大语言为基础的交通信号控制计划管理自动化示范工具 2507.05283v1 -
566 07-04 Beyond Weaponization: NLP Security for Medium and Lower-Resourced Languages in Their Own Right Beyond Weaponization: NLP-Sicherheit für mittel- und minderwertige Sprachen im eigenen Recht 武器化之后的武器化之外:国家语言方案保障中、低资源语言自身权利的安全 2507.03473v1 -
567 07-04 CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark CoreCodeBench: Ein konfigurierbarer Multi-Szenario-Repository-Level-Benchmark 核心守则:可配置的多设想仓库一级基准 2507.05281v1 -
568 07-04 Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models Verbesserung der sozialen Determinanten der Gesundheitsdokumentation in französischen EHRs mit großen Sprachmodellen 利用大语言模式,改善法文健康文件的 “ 健康文件 “ 社会决定因素 2507.03433v1 -
569 07-04 Tokenization is Sensitive to Language Variation Tokenisierung ist auf Sprachvariation sensitiv 语法变换对语法变化敏感 2502.15343v2 -
570 07-04 SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation SMCLM: Semantisch sinnvolle Kausale Sprachmodellierung für autoregressive Paraphrase-Generierung SMCLM: 用于自动递减性参数生成的具有广泛意义的原因语言建模 2507.03415v1 -
571 07-04 Graph Repairs with Large Language Models: An Empirical Study Graphische Reparaturen mit großen Sprachmodellen: Eine empirische Studie 使用大语言模型修图:经验研究 2507.03410v1 -
572 07-04 ReservoirChat: Interactive Documentation Enhanced with LLM and Knowledge Graph for ReservoirPy ReservoirChat: Interaktive Dokumentation mit LLM und Wissensdiagramm für ReservoirPy RESSOCWChat:与LLM和知识图增强互动文件 2507.05279v1 -
573 07-04 On the Expressiveness and Length Generalization of Selective State-Space Models on Regular Languages Zur Expressivität und Längenverallgemeinerung von selektiven State-Space-Modellen auf regulären Sprachen 关于常规语言选择性国家空间模型的表达性和时间长度的一般化 2412.19350v2 -
574 07-04 Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers Verallgemeinerung oder Halluzination? Verstehen von Out-of-Context-Reasoning in Transformers 通化还是幻觉? 理解变异器的逻辑外原因 2506.10887v2 -
575 07-04 Entity-aware Cross-lingual Claim Detection for Automated Fact-checking Entity-aware Cross-lingual Claim Detection for Automated Fact-Checking 用于自动实况调查的有实体意识的跨语言交叉索赔调查 2503.15220v3 -
576 07-04 EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization EAQuant: Verbesserung der Post-Training-Quantisierung für MoE-Modelle durch Experten-Aware-Optimierung EAQuant:通过专家-软件优化,加强培训后对教育部模型的量化 2506.13329v2 -
577 07-04 Making Sense of Korean Sentences: A Comprehensive Evaluation of LLMs through KoSEnd Dataset Gespür für koreanische Sätze: Eine umfassende Bewertung von LLMs durch KoSEnd-Datensatz 韩国判刑感:通过KoSend数据集对LLMs的全面评估 2507.03378v1 -
578 07-04 Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections Implizite Belohnung als Brücke: Eine einheitliche Ansicht von SFT- und DPO-Verbindungen 隐含奖励为桥梁:SFT和DPO连接的统一观点 2507.00018v2 -
579 07-04 WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia WETBench: Ein Benchmark für die Erkennung aufgabenspezifischer maschinengenerierter Texte auf Wikipedia WETBENCH: 维基百科中检测特定任务机器生成文本的基准 2507.03373v1 -
580 07-04 SS-GEN: A Social Story Generation Framework with Large Language Models SS-GEN: Ein Social Story Generation Framework mit großen Sprachmodellen SS-GEN:具有大语言模式的社会历史传承框架 2406.15695v3 -
581 07-04 Backtesting Sentiment Signals for Trading: Evaluating the Viability of Alpha Generation from Sentiment Analysis Backtesting Sentiment Signals for Trading: Bewertung der Lebensfähigkeit der Alpha-Generation aus der Sentiment-Analyse 贸易的后测试感应信号:从感应分析中评估阿尔法一代的可行性 2507.03350v1 -
582 07-04 Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky Disambiguation-Centric Finetuning macht Enterprise Tool-Calling LLMs realistischer und weniger riskant 差异性-中心微调使企业用工具召集LLM 更现实、风险更小 2507.03336v1 -
583 07-04 Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought Hunyuan-TurboS: Große Sprachmodelle durch Mamba-Transformer-Synergie und adaptive Ketten-of-Thought Hunyuan-TurboS:通过Mamba-Transed协同和适应性研究链推进大语言模型 2505.15431v3 -
584 07-04 MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection MT2-CSD: Eine neue Datensatz- und Multi-Semantic Knowledge Fusion Methode zur konversatorischen Stance-Erkennung MT2-CSD: 用于语音稳定探测的新数据集和多语层知识融合方法 2506.21053v2 -
585 07-04 Read Quietly, Think Aloud: Decoupling Comprehension and Reasoning in LLMs Lest ruhig, denkt laut: Entkoppelung von Verständnis und Vernunft in LLMs 静静地阅读,想阿卢德:解开迷惑和理性 2507.03327v1 -
586 07-04 OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework OpenRLHF: Ein einfach zu bedienendes, skalierbares und leistungsstarkes RLHF-Framework OpenRLHF: 易于使用、可缩放和高绩效RLHF框架 2405.11143v5 -
587 07-04 GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation GRAFT: Ein Graph-basiertes Flow-aware Agentic Framework für die maschinelle Übersetzung auf Dokumentebene GRAFT: 以图表为基础的文件级机器翻译流动感动剂框架 2507.03311v1 -
588 07-04 LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents LTLCrit: Ein auf Temporal Logic basierender LLM-Kritiker für sichere und effiziente einkörperige Wirkstoffe LLLCritri: 一种基于时时逻辑的LLM LLM 用于安全和高效装药剂的Criticr 2507.03293v1 -
589 07-04 NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025 NTU Speechlab LLM-basiertes Mehrsprachiges ASR-System für MLC-SLM Challenge 2025 NTU Spearelab LLM-为2025年刚果解放运动-解运间对话挑战使用多种语言的ASR系统 2506.13339v2 -
590 07-04 Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR Bi-direktionale Kontext-verbesserte Sprache Große Sprachmodelle für mehrsprachige Konversations-ASR 多语言对话的ASR双向双向背景强化语言大语言模型 2506.13396v2 -
591 07-04 REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models REINFORCE++: Effizienter RLHF-Algorithmus mit Robustheit sowohl für Prompt- als auch für Reward-Modelle REINFORCE++: 高效的RLHF对快速模型和奖励模型具有强力的测算法 2501.03262v5 -
592 07-04 MoralBench: Moral Evaluation of LLMs MoralBench: Moralische Bewertung von LLMs 道德:对LLMMs的道德评价 2406.04428v2 -
593 07-04 GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning GDGB: Ein Benchmark für generatives dynamisches Text-Attributed Graph Learning GDGB: 产生动态动态文本图表学习的基准 2507.03267v1 -
594 07-04 KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation KinyaColBERT: Ein Lexisch geerdetes Retrieval-Modell für Low-Resource Retrieval-Augmented Generation KinyaColcolBERT: 低资源回收-提款一代的以地基为基数的回收模型 2507.03241v1 -
595 07-04 Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models Neural Diskrete Token Repräsentanz Lernen für extreme Token Reduction in Video Große Sprachmodelle 在视频大语言模式中进行关于极端减少视频大语言模式的神经分辨代言人学习 2503.16980v4 -
596 07-04 A Lie-algebraic perspective on Tree-Adjoining Grammars Eine lüge-algebraische Perspektive auf Baum-angrenzende Grammatiken 有关树树边语法的字眼视角 2507.03234v1 -
597 07-04 AgentPS: Agentic Process Supervision for Content Moderation with Multimodal LLMs AgentPS: Agentische Prozessüberwachung für Content Moderation mit multimodalen LLMs PS: MMMM 内容调控的代理程序监督 2412.15251v2 -
598 07-04 7B Fully Open Source Moxin-LLM/VLM – From Pretraining to GRPO-based Reinforcement Learning Enhancement 7B Vollständig offene Quelle Moxin-LLM/VLM – Vom Vortraining zur GRPO-basierten Verstärkungs-Lernverbesserung 7B 完全开放源代码 Moxin-LLM/VLM – – 从预培训到基于GROP的强化学习加强 2412.06845v6 -
599 07-03 (4) VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs VOCABTRIM: Vokabelabgleich für effizientes spekulatives Decodieren in LLMs VOCABTRIM: 有效投机下限的词汇 2506.22694v2 -
600 07-03 Towards Universal Semantics With Large Language Models Hin zu universeller Semantik mit großen Sprachmodellen 走向具有大语言模式的普遍语义 2505.11764v3 -
601 07-03 How Much Content Do LLMs Generate That Induces Cognitive Bias in Users? Wie viel Inhalt erzeugen LLMs, die Kognitive Bias bei Benutzern erzeugen? LLMs如何产生使用者中诱导的认知性偏见的内容? 2507.03194v1 -
602 07-03 Adversarial Manipulation of Reasoning Models using Internal Representations Adversariale Manipulation von Vernunftmodellen mittels interner Repräsentationen 采用内部代表制对理性模型进行反向调整 2507.03167v1 -
603 07-03 MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks MateInfoUB: Ein echter Benchmark für die Prüfung von LLMs in kompetitiven, mehrsprachigen und multimodalen Bildungsaufgaben MateInfoUB:竞争、多种语文和多模式教育任务中测试LMLMS的现实世界基准 2507.03162v1 -
604 07-03 Expert-level validation of AI-generated medical text with scalable language models Validierung von KI-generierten medizinischen Texten auf Expertenebene mit skalierbaren Sprachmodellen 专家一级对AI产生的带有可缩放语言模型的可缩放语言模型的医学文本进行鉴定 2507.03152v1 -
605 07-03 Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language Models Elemente des Weltwissens (EWoK): Ein von Kognition inspirierter Rahmen zur Bewertung des Grundwissens der Welt in Sprachmodellen 世界知识要素(EWoK): “ 评价基本世界语言模式知识的认知 – – 启发框架 “ 2405.09605v2 -
606 07-03 On the Relationship between Accent Strength and Articulatory Features Über die Beziehung zwischen Akcent Strength und Artikulatory Features Accent 力量与交接特征之间的关系 2507.03149v1 -
607 07-03 DeepGesture: A conversational gesture synthesis system based on emotions and semantics DeepGesture: Ein dialogisches Gesten-Synthesesystem basierend auf Emotionen und Semantik DeepGesture:基于情感和语义的谈话手势合成系统 2507.03147v1 -
608 07-03 From Measurement to Mitigation: Exploring the Transferability of Debiasing Approaches to Gender Bias in Maltese Language Models Von der Messung zur Milderung: Erforschung der Übertragbarkeit von Debiasing-Ansätzen auf Gender-Bias in maltesischen Sprachmodellen 从衡量到减缓:探讨马耳他语言模式中性别偏见方法对性别偏见的可转让性 2507.03142v1 -
609 07-03 ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models ZuverlässigMath: Benchmark der zuverlässigen mathematischen Vernunft bei großen Sprachmodellen 可靠马斯:关于大语言模式的可靠数学依据基准 2507.03133v1 -
610 07-03 Towards a Psychoanalytic Perspective on VLM Behaviour: A First-step Interpretation with Intriguing Observations Auf dem Weg zu einer psychoanalytischen Perspektive auf VLM-Verhalten: Eine erste Interpretation mit faszinierenden Beobachtungen 对VLM行为进行心理分析:带有引人注意的首步解释 2507.03123v1 -
611 07-03 Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings Federated Learning für die ICD-Klassifizierung mit Leichtbaumodellen und vorgeschulten Einbettungen 具有轻重量型模型和未受过训练的嵌入型综合疾病分类的联邦学习 2507.03122v1 -
612 07-03 RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents RLVER: Verstärktes Lernen mit überprüfbaren emotionalen Belohnungen für einfühlsame Agenten RLVER: 利用可核实情感奖赏加强学习,以培养致病剂 2507.03112v1 -
613 07-03 LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds LIAR: Inference Time Alignment (Best-of-N) in Sekunden an Jailbreak LLMs ausleihen LIAR: 利用诱推时间对准(最佳N级)二秒内侵入监狱的LLMs 2412.05232v3 -
614 07-03 ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization ARF-RLHF: Adaptive Reward-Following für RLHF durch Emotionsgetriebene Selbst-Überwachung und trace-biased Dynamic Optimization RAF-RLHF:通过情感驱动自视和跟踪比亚动态优化,对RLHF进行适应性奖励 2507.03069v1 -
615 07-03 Requirements Elicitation Follow-Up Question Generation Voraussetzungen Elicitation Follow-Up Question Generation 问询后查询 2507.02858v1 -
616 07-03 Answer Matching Outperforms Multiple Choice for Language Model Evaluation Antwort Matching Outperforms Multiple Choice für Sprachmodell-Bewertung 语言模式评价的多种选择 2507.02856v1 -
617 07-03 MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs MOTIF: Modulares Denken durch Verstärkung Feinabstimmung in LLMs MOTIF:通过强化微调在LLM中进行模块思考 2507.02851v1 -
618 07-03 Legal Requirements Translation from Law Rechtliche Voraussetzungen Übersetzung aus dem Recht 法律要求译自法律 2507.02846v1 -
619 07-03 Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection Visual Contextual Attack: Jailbreaking MLLMs mit Image-Driven Context Injection 视觉上下文攻击:带有图像驱动背景注射的破狱MLLMs MLLMs 2507.02844v1 -
620 07-03 Improved Unbiased Watermark for Large Language Models Verbessertes unvoreingenommenes Wasserzeichen für große Sprachmodelle 改进大语言模型的无偏见水印 2502.11268v2 -
621 07-03 StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason StepHint: Mehrstufige stufenweise Hinweise stärken das Lernen zur Vernunft 步进提示:多级分步骤将强化学习提升到合理 2507.02841v1 -
622 07-03 From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents Von der Web-Suche in Richtung Agentic Deep Research: Incentivizing Search with Reasoning Agents 从网络搜索到代理深层研究:激励使用理性代理进行搜索 2506.18959v3 -
623 07-03 ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning ExPO: Entsperren harter Vernunft mit selbsterklärungsgeführtem Verstärkungslernen ExPO: 以自我剥削指导强化学习来解锁困难理由 2507.02834v1 -
624 07-03 Generalizing Verifiable Instruction Following Verallgemeinern der überprüfbaren Anleitung 普遍适用的可核实说明 2507.02833v1 -
625 07-03 SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model SynapseRoute: Ein Auto-Routen-Schaltrahmen für das Dual-State Large Language Model SynapseRoute:关于两州大语言模式的自动运行切换框架 2507.02822v1 -
626 07-03 Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case Große Sprachmodelle zur Automatisierung der klinischen Datenstandardisierung: HL7 FHIR Use Case 临床数据标准化自动化大语言模型:HL7 FHIR使用案例 2507.03067v1 -
627 07-03 Identification of Potentially Misclassified Crash Narratives using Machine Learning (ML) and Deep Learning (DL) Identifizierung potenziell falsch klassifizierter Crash Narratives mittels Machine Learning (ML) und Deep Learning (DL) 利用机器学习(ML)和深层学习(DL)查明可能误解的崩溃叙述(DL) 2507.03066v1 -
628 07-03 Multimodal Mathematical Reasoning with Diverse Solving Perspective Multimodale mathematische Vernunft mit unterschiedlicher Lösungsperspektive 具有不同解决视角的多模式数学理由 2507.02804v1 -
629 07-03 Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models Ist Vernunft alles, was Sie brauchen? Probieren von Bias im Zeitalter der Vernunft Sprachmodelle 需要什么理由就需要什么理由吗? 2507.02799v1 -
630 07-03 From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding Von langen Videos zu Clips: Ein von Menschen inspiriertes Video-Editing-Framework mit multimodalem Narrative Understanding 从长视频到启动剪贴板:由人启发的视频编辑框架,包含多模式叙述理解 2507.02790v1 -
631 07-03 GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling GPAS: Beschleunigung der Konvergenz des LLM-Vortrainings durch Gradient-Preserving Activation Scaling GPAS:通过 “ 渐进式保留动力扩增 “ 加速汇集LLM预备训练 2506.22049v2 -
632 07-03 Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation Verbesserung klinischer Multiple-Choice-Fragen Benchmarks mit Knowledge Graph Guided Distractor Generierung 加强具有知识图导引引产生体的临床多选择问题基准,加强临床多选择问题基准 2506.00612v3 -
633 07-03 Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs Selbstkorrektionsbank: Enthüllung und Adressierung des Selbstkorrektions-Blindflecks in LLMs 自我校正法官:在LLMs中披露和处理自我校正的盲人点 2507.02778v1 -
634 07-03 DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment DeSTA2.5-Audio: Auf dem Weg zu einem General-Purpose Large Audio Language Model mit selbsterzeugter Cross-Modal Alignment DeSTA2.5-Audio:努力建立具有自发跨模式一致的通用大型音频语言模型 2507.02768v1 -
635 07-03 Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression Batch-Max: Höherer LLM-Durchsatz mit größeren Batch-Größen und KV Cache-Kompression 批量最大:使用大批量大小和 KV缓存压缩的高级 LLM 输送量 2412.05693v3 -
636 07-03 Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens Messung der Granularität des Vowel-Produktionsraumes durch einfach nur produzierbare unterschiedliche (JPD) Limens 仅用可制成差异(JPD)激光测量Vowel 生产空间的颗粒度 2507.02744v1 -
637 07-03 Early Signs of Steganographic Capabilities in Frontier LLMs Frühe Anzeichen von Steganographischen Fähigkeiten in Frontier LLMs 边疆长长体动物能力早期信号 2507.02737v1 -
638 07-03 Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge Mind2Web 2: Agentische Suche mit Agent-as-a-Judge bewerten Mind2Web 2: 与代理法官评估代理搜索 2506.21506v2 -
639 07-03 On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability Über Charakterisierungen für die Sprachgenerierung: Interplay von Halluzinationen, Breadth und Stabilität 语言生成特征:幻觉、面包和稳定之间的相互作用 2412.18530v2 -
640 07-03 Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation Next-Token-Vorhersage-Aufgabe setzt eine optimale Datenbestellung für LLM-Training in Proof Generation voraus 假定为校实生成的LLM培训提供最佳数据排序 2411.00863v2 -
641 07-03 Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers Können LLMs kritische Einschränkungen innerhalb der wissenschaftlichen Forschung identifizieren? Eine systematische Bewertung von KI-Forschungspapieren LLMs能否查明科学研究中的关键限制? 对AI研究文件的系统评估 2507.02694v1 -
642 07-03 Exploring Gender Bias Beyond Occupational Titles Erforschen von Gender-Bias über Berufsbezeichnungen hinaus 探索职业职称之外的性别偏见 2507.02679v1 -
643 07-03 Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning Code2Logic: Game-Code-getriebene Datensynthese zur Verbesserung von VLMs Allgemeine Begründung 代码2Llogic: 用于增强VLMs一般理由的游戏-代码-驱动数据合成 2505.13886v2 -
644 07-03 ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning ASDA: Audio-Spektrogramm Differential-Achtungsmechanismus für selbstüberwachtes Repräsentationslernen ASDA:自我监督代表制学习的听觉分光差异关注机制 2507.02666v1 -
645 07-03 An Adaptive Supervised Contrastive Learning Framework for Implicit Sexism Detection in Digital Social Networks Ein adaptives, überwachtes, kontrastives Lernkonzept für implizite Sexismus-Erkennung in digitalen sozialen Netzwerken 在数字社会网络中发现隐含性别主义的适应性监督差异学习框架 2507.05271v1 -
646 07-03 OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding OmniDraft: Ein Cross-Vocabulary, Online Adaptive Drafter für die gerätespezifische Dekodierung 总括草案:跨词汇、在线在线可调适性套用投机下限设计图纸 2507.02659v1 -
647 07-03 Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search Entkoppelte Planung und Ausführung: Ein Hierarchisches Reasoning-Framework für tiefe Suche 分解的规划和执行:深海搜索的等级理据框架 2507.02652v1 -
648 07-03 Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory Strategische Intelligenz in großen Sprachmodellen: Beweise aus der evolutionären Spieltheorie 大语言模型战略情报:进化游戏理论的证据 2507.02618v1 -
649 07-03 Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure Erklärbare Compliance-Erkennung mit Multi-Hop-Natural Language-Schlussfolgerung zur Assurance-Fallstruktur 以多种自然语言对保证案例结构的多重语言推断进行可解释的合规检测 2506.08713v2 -
650 07-03 Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs Symbolisch oder numerisch? Physik-Probleme verstehen, die LLMs aufklären 理解在理赔中解决物理问题 2507.01334v2 -
651 07-03 MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion MPF: Sprachmodelle nach der Bereitstellung über Multi Perspective Fusion ausrichten und abgrenzen MPF:通过多视角融合进行部署后调整和取消对语言模式的偏见 2507.02595v1 -
652 07-03 MedAide: Information Fusion and Anatomy of Medical Intents via LLM-based Agent Collaboration MedAide: Informationsfusion und Anatomie von medizinischen Intents über LLM-basierte Agent Collaboration 医学辅助:通过以LLM为基地的合作公司代理进行医疗成瘾者的信息汇集和解剖 2410.12532v3 -
653 07-03 Revisiting Active Learning under (Human) Label Variation Aktives Lernen unter (menschlichen) Label-Varianten 在(人)标签标签变换下重新审查积极学习 2507.02593v1 -
654 07-03 WebSailor: Navigating Super-human Reasoning for Web Agent WebSailor: Navigieren Super-Mensch Vernunft für Web Agent Web 服务员: 为 Web 代理导航超人理由 2507.02592v1 -
655 07-03 AI Flow: Perspectives, Scenarios, and Approaches AI Flow: Perspektiven, Szenarien und Ansätze AI 流动:观点、设想和方法 2506.12479v2 -
656 07-03 Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization Verbesserung der LLM-Gründung für die Erkennung von Schwachstellen über die Optimierung der Gruppenrelative Policy 通过群体相对政策优化改进通过群体相对政策优化查明脆弱性的改进LLM 改进LLM理由 2507.03051v1 -
657 07-03 Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs Sprachenübergreifendes Reisen: Benchmarking Cross-Lingual Consistency in multimodalen LLMs 跨语言旅行:多模式LLM中跨语言一致基准 2505.15075v2 -
658 07-03 Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning Selbstgesteuerte Prozess-Reward-Optimierung mit neu definiertem Schrittweiser Vorteil für Prozess-Verstärkungs-Lernen 自指导流程向上优化,具有重新定义的逐步改进的流程强化学习优势 2507.01551v2 -
659 07-03 IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders IndianBailJudgments-1200: Ein Multi-Attribut-Datensatz für legale NLP auf indischen Bail-Aufträgen IndianBailJail Judgments-12000:印度保释令法律国家保护程序多属性数据集 2507.02506v1 -
660 07-03 Counterfactual Tuning for Temporal Sensitivity Enhancement in Large Language Model-based Recommendation Counterfactual Tuning für die zeitliche Sensitivitätssteigerung in der auf Großsprachenmodellen basierenden Empfehlung 以基于大语言的示范建议为基础的提高时时敏感度反事实推算 2507.03047v1 -
661 07-03 Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack Robustheit von Fehlinformations-Klassifikationssystemen zu Adversarial-Beispielen durch BeamAttack 通过“BeamAttack”进行错误信息分类系统对反向实例的强力 2506.23661v2 -
662 07-03 Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer Task Prompt Vektoren: Effektive Initialisierung durch Multi-Task Soft-Prompt Transfer 任务提示矢量 : 通过多任务软性即时传输实现有效的初始化 2408.01119v3 -
663 07-03 Crafting Hanzi as Narrative Bridges: An AI Co-Creation Workshop for Elderly Migrants Hanzi als Narrative Bridges herstellen: Ein KI-Co-Creation-Workshop für ältere Migranten 将Hanzi编成叙述性桥梁:大赦国际为老年移民举办的共同创造讲习班 2507.01548v2 -
664 07-03 A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages Ein Kochbuch für die gemeinschaftsorientierte Datenerfassung von schwachen Sprachkenntnissen in LowResource-Sprachen 社区驱动的低成本低资源语言有缺陷演讲数据收集手册 2507.02428v1 -
665 07-03 Delving into LLM-assisted writing in biomedical publications through excess vocabulary Eintauchen in LLM-unterstütztes Schreiben in biomedizinischen Publikationen durch überschüssiges Vokabular 通过超量词汇,在生物医学出版物中进行LLM协助撰写 2406.07016v5 -
666 07-03 K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function K-Funktion: Gemeinsame Aussprache Transkription und Feedback zur Bewertung der Kindersprache-Funktion K-功能:用于评估儿童语言功能的联合读音记录和反馈 2507.03043v1 -
667 07-03 Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability Benchmarking Akan ASR-Modelle über Domain-spezifische Datensätze: Eine vergleichende Bewertung von Leistung, Skalierbarkeit und Anpassungsfähigkeit 确定Akan ASR模型基准的全域具体数据集:业绩比较评价、可缩放性和可调适性 2507.02407v1 -
668 07-03 AIn’t Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation AIn’t Not Nothing But a Survey? Mit großen Sprachmodellen für die Codierung Deutsch Open-Ended Survey Responses on Survey Motivation 使用大语言模型对德国关于调查动机的开放式调查答复进行编码 2506.14634v3 -
669 07-03 Dynamic Long Short-Term Memory Based Memory Storage For Long Horizon LLM Interaction Dynamischer Langzeit-Speicherspeicher auf Kurzzeit-Speicherbasis für Langzeit-LLM-Interaktion 动态长地平线LLM互动的短期内存存储 2507.03042v1 -
670 07-03 JoyTTS: LLM-based Spoken Chatbot With Voice Cloning JoyTTS: LLM-basierter gesprochener Chatbot mit Voice Cloning 以LLM为基地的 “ 配有语音克隆的口音聊天机器人 “ 2507.02380v1 -
671 07-03 Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection Effiziente Code-LLM-Schulung über Distribution-Konsistenz und Diversity-Aware-Datenauswahl 通过分配和多样性软件数据选择进行高效率的守则LLM培训 2507.02378v1 -
672 07-03 QFFN-BERT: An Empirical Study of Depth, Performance, and Data Efficiency in Hybrid Quantum-Classical Transformers QFFN-BERT: Eine empirische Studie über Tiefe, Leistung und Dateneffizienz in hybriden Quantum-Klassischen Transformern QFFN-BERT:对混合量子-分类变异器的深度、性能和数据效率的经验研究 2507.02364v1 -
673 07-03 Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning Verbesserung der Robustheit der distantly-überwachten Anerkennung von Personen mit Namen durch unsicheres Lehrerlernen und studentisch-studentisches kollaboratives Lernen 通过不确定-软件教师学习和学生-学生合作学习,提高以不确定-软件教师学习和学生-学生合作学习的方式,提高以不确定-软件命名的实体识别的力度 2311.08010v3 -
674 07-03 Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models Coling-UniA bei SciVQA 2025: Wenig-heißes Beispiel Retrieval und Vertrauen-informierte Montage für multimodale große Sprachmodelle 在SciVQA 2025 SciVQA 的Coling-UniA:多式大语言模型的很少热实例检索和信任化组合 2507.02357v1 -
675 07-03 Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation Einschließlich LLMs für großräumige Urban Complex Mobility Simulation 大型城市综合流动模拟项目LLMs 2505.21880v2 -
676 07-03 Decision-Oriented Text Evaluation Entscheidungsorientierte Textbewertung 注重决定的案文评价 2507.01923v2 -
677 07-03 Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs Token Prepending: Ein trainingsfreier Ansatz zur Eliziierung besserer Sentence-Embeddings von LLMs Token Predudo:从LLM女士那里采用不培训办法,使判刑内容更好地嵌入Elibear 2412.11556v2 -
678 07-03 Layered Insights: Generalizable Analysis of Authorial Style by Leveraging All Transformer Layers Layered Insights: Generalisierbare Analyse des Autorial Styles durch Hebelisierung aller Transformer Layers 图层透视: 通过利用所有变换层对文件样式的通用分析 2503.00958v2 -
679 07-03 Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy Skywork-Reward-V2:通过人类-AI协同增强优先数据曲线 2507.01352v2 -
680 07-03 Cautious Next Token Prediction Vorsichtige nächste Zeichen Vorhersage 谨慎的次下 Tok 预测 2507.03038v1 -
681 07-03 Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach Ausrichten von gefrorenen LLMs durch Verstärkungslernen: Ein iteratives Reweight-then-Optimize-Ansatz 通过强化学习将冻结的LLMs与 “ 强化学习:一种过渡性再加权再优化方法 “ 相匹配 2506.17828v2 -
682 07-03 Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding Fast-dLLM: Trainingsfreie Beschleunigung von Diffusion LLM durch Ermöglichen von KV Cache und Paralleldecoding 快速dLLM:通过授权 KV 缓存和平行编码加速免培训传播LLM 2505.22618v3 -
683 07-03 Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient Bypass Back-Propagation: Optimierungsbasiertes Structural Pruning für große Sprachmodelle über Policy Gradient Bypass 后回通信:通过 “ 政策梯度 “ 优化基于优化的结构结构,为大语言模式提供缓冲 2406.10576v3 -
684 07-03 DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning DoMIX: Ein effizientes Framework zur Nutzung von Domain-Wissen im Feintuning DoMIX:一个在微调中利用域知识的有效框架 2507.02302v1 -
685 07-03 Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models Commander-GPT: Die Fähigkeit von Multi-Modal Large Language Models, den Sarkasmus vollständig zu entleeren GPT指挥官:完全解除多模式大语言模型的讽刺性探测能力 2503.18681v3 -
686 07-03 Prompt-Guided Turn-Taking Prediction Prompt-geführte Turn-Taking-Vorhersage 即时指导的回转预测 2506.21191v2 -
687 07-03 Optimal strategies to perform multilingual analysis of social content for a novel dataset in the tourism domain Optimale Strategien zur mehrsprachigen Analyse sozialer Inhalte für einen neuartigen Datensatz im Tourismusbereich 为旅游领域新数据集的社会内容进行多语种社会内容分析的最佳最佳战略 2311.14727v2 -
688 07-03 Seeing Through Green: Text-Based Classification and the Firm’s Returns from Green Patents Durch Grün sehen: Textbasierte Klassifizierung und die Rückkehr der Firma aus grünen Patenten 通过 “ 绿色观光:基于文本的分类和公司从绿色专利的回报 “ 2507.02287v1 -
689 07-03 Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as Treatments Kausales Repräsentationslernen mit generativer Künstlicher Intelligenz: Anwendung auf Texte als Behandlungen 产生人工智能的因果代表性学习:应用文字作为治疗 2410.00903v3 -
690 07-03 SMARTe: Slot-based Method for Accountable Relational Triple extraction SMARTe: Slot-basierte Methode für die relationale Triple-Extraktion SMARTE: 衡算关系三重采掘的基于固态方法 2504.12816v3 -
691 07-03 MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent MemAgent: Umgestalten von Langkontext-LLM mit Multi-Conv RL-basierten Speicheragenten MemerAgent: 与基于多Conv RL的内存代理重塑长文本LLM 2507.02259v1 -
692 07-03 Circuit-tuning: A Mechanistic Approach for Identifying Parameter Redundancy and Fine-tuning Neural Networks Schaltungs-Tuning: Mechanistischer Ansatz zur Identifizierung von Parameter Redundanz und Feinsteuerung neuraler Netzwerke 电路调控:确定参数冗余和精微调整神经网络的机械化方法 2502.06106v2 -
693 07-03 Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies Mixture of Reasonings: Große Sprachmodelle mit adaptiven Strategien zur Vernunft bringen 理由混合:与适应战略一道教授大语言模式 2507.00606v2 -
694 07-03 Preserving Privacy, Increasing Accessibility, and Reducing Cost: An On-Device Artificial Intelligence Model for Medical Transcription and Note Generation Schutz der Privatsphäre, Erhöhung der Zugänglichkeit und Senkung der Kosten: Ein On-Device-Modell für künstliche Intelligenz für medizinische Transkription und Notizgenerierung 维护隐私、增加无障碍和降低成本:医疗转移和笔记制作的现场人工人工智能模型 2507.03033v1 -
695 07-03 GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons GDC Cohort Copilot: Ein KI-Copilot für die Kuratierung von Kohorten aus den Genomic Data Commons GDC Cohort Cohort 副驾驶:AI 基因组数据共同点的Curate Choorts联合驾驶员 2507.02221v1 -
696 07-03 SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers SciGA: Ein umfassender Datensatz zur Gestaltung grafischer Abstracts in wissenschaftlichen Papieren SciGA: 用于设计学术论文制图摘要的综合数据集 2507.02212v1
Article 0
Title@2025-07-10 (4): Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
Title: Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology | Rückverfolgbare Beweise Verbesserte visuelle Grundierung: Bewertung und Methodik | 增强视觉依据的理由:评价和方法 2507.07999v1 |
Authors (12): Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang
Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human “thinking with images”. However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.
OpenAI-o3 先驱视觉推理等模型通过动态地参照视觉区域来动态地参照视觉区域,就像人类的“用图像思考”一样。然而,没有基准来全面评价这些能力。然而,我们提出“TreaBench”(可追踪证据评估基准),这是建立在以下三项原则基础上的诊断基准:(1) 对复杂场景的微妙目标有集中的视觉认识,(2) 通过捆绑框评估可以追踪的证据,(3) 测试物体相互作用和空间等级的第二阶推理,超越简单天体定位的简单天体定位。对密集天体图像进行优先排序,我们最初对SA-1B的1K高品质图像进行抽样抽样抽样,并纳入8个LMM专家手动地说明问题、候选选项和每个图像的答案。在质量控制的三个阶段后,TreaineB包括40个挑战视觉问答配对,甚至最先进的模型与这个基准挣扎了60%的准确度,例如,OpenAI-o3级评分数只有54.87。此外,我们引入了TreVRS(可识别证据强化地基推理学),从监督本地和推理学的训练模式,从监督本地和推理学,共同地基点, 学习了基础推路路路路路路路, 。
Article 1
Title@2025-07-10 (4): PyVision: Agentic Vision with Dynamic Tooling
Title: PyVision: Agentic Vision with Dynamic Tooling | PyVision: Agentische Vision mit dynamischem Werkzeug | 视景:带有动态工具的 “ 动态展望 “ 。 2507.07998v1 |
Authors (7): Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei
LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.
LLMS越来越多地被部署为代理商、能够规划、推理和动态调用外部工具的系统。 但是,在视觉推理中,先前的做法在很大程度上仍然受到预先定义的工作流程和静态工具的限制。 在本报告中,我们展示了PyVision,这是一个互动的、多方向的框架,使MLLMS能够自主生成、执行和完善适合手头任务、灵活和可解释的解决问题的基于Python的工具。我们开发了PyVision所创造的工具的分类,并分析了其在不同基准组中的使用情况。从数量上看,PyVision实现了一致的业绩收益,V* 将GPT-4-4.1 +7.8% 提高 V* +7.8%,VLaude-4.0-Sonnet +31.1% 提高VLMSAreBlind-mini。 这些结果显示了更广泛的转变:动态工具允许模型不仅使用工具,而且还可以发明工具,进而向更具说服力的视觉推理。
Article 2
Title@2025-07-10 (4): Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Title: Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) | Operationalisierung eines Bedrohungsmodells für das Red-Teaming großer Sprachmodelle (LLMs) | 实施 “ 红色组合大语言模型威胁模型 “ ; 2407.14937v2 |
Authors (10): Apurv Verma, Satyapriya Krishna, Sebastian Gehrmann, Madhavan Seshadri, Anu Pradhan, Tom Ault, Leslie Barrett, David Rabinowitz, John Doucette, NhatHai Phan
Creating secure and resilient applications with large language models (LLM) requires anticipating, adjusting to, and countering unforeseen threats. Red-teaming has emerged as a critical technique for identifying vulnerabilities in real-world LLM implementations. This paper presents a detailed threat model and provides a systematization of knowledge (SoK) of red-teaming attacks on LLMs. We develop a taxonomy of attacks based on the stages of the LLM development and deployment process and extract various insights from previous research. In addition, we compile methods for defense and practical red-teaming strategies for practitioners. By delineating prominent attack motifs and shedding light on various entry points, this paper provides a framework for improving the security and robustness of LLM-based systems.
利用大型语言模型(LLM)创建安全和具有复原力的应用软件需要预见、适应和应对意外威胁。红队已成为查明现实世界LLM实施过程中的脆弱性的关键技术。本文介绍了一个详细的威胁模型,并提供了对LLMS进行红队袭击的知识系统化(SoK)。我们根据LLM开发和部署过程的各个阶段制定攻击分类,并从以往的研究中提取各种见解。此外,我们汇编了防御方法和从业人员实用的红色战略。通过对突出的攻击运动点进行划线和向各个切入点亮亮光,本文为改进LLM系统的安全性和稳健性提供了一个框架。
Article 3
Title@2025-07-10 (4): Automating Expert-Level Medical Reasoning Evaluation of Large Language Models
Title: Automating Expert-Level Medical Reasoning Evaluation of Large Language Models | Automatisieren von Experten-Level Medical Reasoning Bewertung von großen Sprachmodellen | 对大语言模式进行自动化专家级医疗理由评估 2507.07988v1 |
Authors (19): Shuang Zhou, Wenya Xie, Jiaxi Li, Zaifu Zhan, Meijia Song, Han Yang, Cheyenna Espinoza, Lindsay Welton, Xinnie Mai, Yanwei Jin, Zidu Xu, Yuen-Hei Chung, Yiyun Xing, Meng-Han Tsai, Emma Schaffer, Yucheng Shi, Ninghao Liu, Zirui Liu, Rui Zhang
As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring transparent and trustworthy reasoning is essential. However, existing evaluation strategies of LLMs’ medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains lacking. To address this, we introduce MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable assessment of LLMs’ medical reasoning. MedThink-Bench comprises 500 challenging questions across ten medical domains, each annotated with expert-crafted step-by-step rationales. Building on this, we propose LLM-w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge mechanisms to assess intermediate reasoning with expert-level fidelity while maintaining scalability. Experiments show that LLM-w-Ref exhibits a strong positive correlation with expert judgments. Benchmarking twelve state-of-the-art LLMs, we find that smaller models (e.g., MedGemma-27B) can surpass larger proprietary counterparts (e.g., OpenAI-o3). Overall, MedThink-Bench offers a foundational tool for evaluating LLMs’ medical reasoning, advancing their safe and responsible deployment in clinical practice.
随着大型语言模型(LLMS)日益融入临床决策,确保透明和可信赖的推理至关重要,然而,LLM-w-Ref(LLM-w-Ref)的现有医学推理能力评价战略要么不能令人满意地进行评估,要么不能很好地推广,而且仍然缺乏严格的基准。为了解决这个问题,我们引入了MedThink-Bench(MedThink-Bench)(MedThink-Bench)(MedThink-B)(LLLM-LM-LM-A)(LLM-A)(LLLM-A)(LLM)(LLM-A)(LM)(LLM)(LLM)(LLM)(LLM)(LM)(LLM)(LLM)(L)(LLL)(LLL)(LLM)(LLM-W-W-Ref)(LA(LM)(LM)(LA-LA(LA-A-A)(LA(LAD)(LA)(LA)(LA(LA) (LA) (LM) (LM) (LM) (LM) (LM) (LM) (LM) (LM) (LM) (LM) (LM) (LM) (LM) (L) (L) (L) (L) (LM) (LM) (LM) (LM) (LM) (L) (LM) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (LM) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (LIM) (L) (L
Article 4
Title@2025-07-10 (4): Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology
Title: Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology | Leistung und praktische Betrachtung von großen und kleinen Sprachmodellen in der klinischen Entscheidungsunterstützung in der Rheumatologie | 风湿学临床决策支助中大型和小型语言模型的实用性及实用性考虑 2507.07983v1 |
Authors (7): Sabine Felde, Rüdiger Buchkremer, Gamal Chehab, Christian Thielscher, Jörg HW Distler, Matthias Schneider, Jutta G. Richter
Large language models (LLMs) show promise for supporting clinical decision-making in complex fields such as rheumatology. Our evaluation shows that smaller language models (SLMs), combined with retrieval-augmented generation (RAG), achieve higher diagnostic and therapeutic performance than larger models, while requiring substantially less energy and enabling cost-efficient, local deployment. These features are attractive for resource-limited healthcare. However, expert oversight remains essential, as no model consistently reached specialist-level accuracy in rheumatology.
大型语言模型(LLMs)显示了支持风湿学等复杂领域临床决策的前景。我们的评估表明,小型语言模型(SLMs)与检索增强的一代(RAG)相结合,取得了比大模型更高的诊断和治疗性能,同时需要大大降低能量,并有利于成本效率高的本地部署。这些特征对资源有限的医疗保健具有吸引力。然而,专家监督仍然至关重要,因为没有任何模型在风湿学方面一直达到专家水平的准确性。
Article 5
Title@2025-07-10 (4): Why is Your Language Model a Poor Implicit Reward Model?
Title: Why is Your Language Model a Poor Implicit Reward Model? | Warum ist Ihr Sprachmodell ein schlechtes Implizit-Reward-Modell? | 为什么您的语言模式 是一个贫穷的隐含奖赏模式? 2507.07981v1 |
Authors (4): Noam Razin, Yong Lin, Jiarui Yao, Sanjeev Arora
Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Towards a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.
语言模式培训后和发酵管道模式的关键在于奖赏模式。 简洁地说,最近的工作表明,每个语言模式都定义了隐含奖赏模式(IM-RM),而不需要任何建筑变革。然而,与明确奖赏模式(EX-RMs)相比,这种模式往往更加普遍,特别是分配外的。对语言模式的隐含偏见,对隐含的线性表现模式(EX-RMs)应用了专门的线性头来取代隐含的语言模式(EX-RMs),普遍化差距的存在令人费解,因为EX-RMs和IM-RMs几乎是相同的。它们可以使用相同的数据、损失功能和语言模式来进行培训,而且只能在奖赏的计算方式上有所不同。为了从根本上了解不同奖赏模式类型背后的隐含的偏见,我们调查这一差距的根源。我们的主要发现,在理论和实验的支持下,IM-RMs更多地依赖表面的提示。因此,它们通常比象征性分配选择下的EX-RMs(Ex-RMs)更差,以及分配。 此外,我们提供证据证据来反对普遍化差距的替代的替代的假设,因为常规-RMDI-I-I会要求是更强烈的产生更难的行为。
Article 6
Title@2025-07-10 (4): Long-Form Speech Generation with Spoken Language Models
Title: Long-Form Speech Generation with Spoken Language Models | Langformige Sprachgenerierung mit gesprochenen Sprachmodellen | 具有口言语言模式的长形式语音一代 2412.18603v2 |
Authors (6): Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan
We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, and memory costs at inference time. From these considerations we derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates. SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency on multi-minute generations while still matching them at the utterance level. As we found current spoken language evaluations uninformative, especially in this new long-form setting, we also introduce: LibriSpeech-Long, a benchmark for long-form speech evaluation; new embedding-based and LLM-judged metrics; and quality measurements over length and time. Speech samples, the LibriSpeech-Long dataset, and any future code or model releases can be found at https://google.github.io/tacotron/publications/speechssm/.
我们认为,多分钟的语音模拟是长式多媒体生成和音效语音助理的一种要求,但是,无文本的口头语言模型很难在数十秒的时间内生成出貌似合理的语音,因为高时间分辨率的语音符号导致一致性丧失、具有长期序列培训或外推法的建筑问题以及推断时的记忆成本。我们从这些考虑因素中得出语音SSSM,这是第一个语音语言模型家庭,可以在没有文本中间器的单一解码会议上学习长式语音(例如读音或即时语音的16分钟)并抽样。 SpealesSSMMMSM利用线性时间序列模型的最新进展,在多分钟的代人之间以一致性和效率大大超过目前的变换LMMs,同时仍然在发音水平上匹配。我们发现当前语言评价缺乏信息,特别是在这个新的长式设置环境中,我们还介绍:LibSpeech-Long,一个长式语音评价的基准;新的嵌入式和LM-判分数度度度度测量;以及超越当前变式/时间的高质量测量标准。
Article 7
Title@2025-07-10 (4): Watermarking Degrades Alignment in Language Models: Analysis and Mitigation
Title: Watermarking Degrades Alignment in Language Models: Analysis and Mitigation | Wasserzeichen degradiert Ausrichtung in Sprachmodellen: Analyse und Milderung | 语言模型的分级调整:分析和减轻影响 2506.04462v2 |
Authors (3): Apurv Verma, NhatHai Phan, Shubhendu Trivedi
Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect these core alignment properties across four aligned LLMs. Our experiments reveal two distinct degradation patterns: guard attenuation, where enhanced helpfulness undermines model safety, and guard amplification, where excessive caution reduces model helpfulness. These patterns emerge from watermark-induced shifts in token distribution, surfacing the fundamental tension that exists between alignment objectives. To mitigate these degradations, we propose Alignment Resampling (AR), an inference-time sampling method that uses an external reward model to restore alignment. We establish a theoretical lower bound on the improvement in expected reward score as the sample size is increased and empirically demonstrate that sampling just 2-4 watermarked generations effectively recovers or surpasses baseline (unwatermarked) alignment scores. To overcome the limited response diversity of standard Gumbel watermarking, our modified implementation sacrifices strict distortion-freeness while maintaining robust detectability, ensuring compatibility with AR. Experimental results confirm that AR successfully recovers baseline alignment in both watermarking approaches, while maintaining strong watermark detectability. This work reveals the critical balance between watermark strength and model alignment, providing a simple inference-time solution to responsibly deploy watermarked LLMs in practice.
大型语言模型(LLMS)的水标记技术可以对产出质量产生重大影响,但是它们对真实性、安全性和帮助性的影响仍然严重不足。本文件对两种流行的水标记方法-Gumberl和KGW如何影响四个对齐的LLMS的这些核心校准特性进行了系统分析。我们的实验揭示了两种截然不同的退化模式:保护减少,因为加强帮助性会破坏模型安全;保护放大,因为过度谨慎会减少模型的帮助性。这些模式产生于标志性分配的水标记导致的改变,揭示了调整目标之间存在的根本紧张关系。为了减轻这些退化,我们建议调整抽取(AR),即一种推断性时间取样方法,使用外部奖励模式来恢复校准。我们从理论上对预期的奖励得分的改进进行较低约束,因为抽样规模的增加,而且经验表明,仅仅2-4个有标志的代人有效地恢复或超过基线(无水标记的)校准。为了克服标准Gumber水标记的响应多样性,我们修改后的执行在保持可靠的可探测性、保证可靠基线和可靠地保持基准之间的一致性,同时确认可靠地测量和可靠地测量工作的结果。
Article 8
Title@2025-07-10 (4): Scaling RL to Long Videos
Title: Scaling RL to Long Videos | Skalierung von RL zu langen Videos | 缩放 RL 到长视频 2507.07966v1 |
Authors (14): Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han
We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).
我们引入了一个完整的配置框架,将视觉语言模型的推理推理推向长视频,利用强化学习;我们应对长视频推理的独特挑战,整合了三个关键组成部分:(1) 大型数据集LongVideo-Reason,由52K长视频QA配对组成,配有体育、游戏和 vlogs等不同领域的高质量推理说明;(2) 双阶段培训管道,将视觉模型推理推向长视频模型(COT-SFT)和强化学习(RL);(3) 长视频RL,名为多模式强化序列平行(MRSP)的培训基础设施,其中包含序列平行和基于VLLMM的引擎,用于长视频,用于高效推出和预填。 在实验中,LA-RVA-RVlllSlSlllS, 各种视频模型SVRllllllllSllS, 视频模型的视频模型和Silveral-LServiLServical 上,支持我们的视频模型和Sil-LServical-Servical-L的连续和Silvical-L 。
Article 9
Title@2025-07-10 (4): MIRIX: Multi-Agent Memory System for LLM-Based Agents
Title: MIRIX: Multi-Agent Memory System for LLM-Based Agents | MIRIX: Multi-Agent-Speichersystem für LLM-basierte Agenten | MIRIX:LLM药剂多机构内存系统 2507.07957v1 |
Authors (2): Yu Wang, Xi Chen
Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field’s most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real-world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA, a challenging multimodal benchmark comprising nearly 20,000 high-resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO, a long-form conversation benchmark with single-modal textual input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory-augmented LLM agents. To allow users to experience our memory system, we provide a packaged application powered by MIRIX. It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.
虽然AI代理商的记忆能力日益受到越来越多的关注,但现有的解决方案仍然基本有限。 多数依赖平坦的、范围狭窄的记忆组件,限制了他们个人化、抽象和可靠地重新收集特定用户信息的能力。 为此,我们引入了MIRIX, 这是一个模块化的多剂记忆系统,通过解决实地最关键的挑战来重新定义AI记忆的未来:使语言模型能够真正记住。 与以往的做法不同, MIRIX超越文本,以包含丰富的视觉和多式联运经验,使记忆在现实世界情景中真正有用。 MIRIX由六种不同且结构严谨的记忆组件组成:核心、Episodic、Smantic、程序、资源内存和知识Vault Vault。 为此,我们引入了一个多剂框架,能动态控制和协调更新和检索。 这个设计使代理商能够持续、理性地和准确地检索各种长期用户数据。 我们在两个严格的环境下验证MIRIX 。 首先,在SpealshotVQA, 一个具有挑战性的多式联运基准, 由近20,000个高分辨率的计算机屏幕截截图,需要深背景了解, 以及没有更深背景背景化的内径化的内存数据系统,并且可以让现有的内存系统在存储系统运行中运行一个比RILILILASMIX 长期的存储系统更精确地显示一个新的存储器的MIS的存储器的存储器。
Article 10
Title@2025-07-10 (4): SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment
Title: SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment | SEITE: Ein visuelles Sprachmodell zur Erkennung von Anomalien durch Fact Enhancement und Entropy-aware Alignment | SAGE:通过事实增强和对子对子体认知校正进行反常检测的视觉语言模型 2507.07939v1 |
Authors (7): Guoxin Zang, Xue Li, Donglin Di, Lanshun Nie, Dechen Zhan, Yang Song, Lei Fan
While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle in industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model and dataset are available at https://github.com/amoreZgx1n/SAGE.
虽然视觉-语言模型(VLMS)在一般多式联运任务中显示出有希望的进展,但它们往往在工业异常探测和推理方面挣扎,特别是在提供可解释的解释性解释和对看不见的类别加以概括方面,这种限制源于异常探测的内在领域性质,这妨碍了现有VLMs在需要精确、结构化和背景意识分析的工业情景中的适用性;为了应对这些挑战,我们提议SAGE(基于VLM(VLM)的框架),通过自我指导事实增强(SFE)和Etropy-awe直接优化(E-DPO)来强化异常推理;SFE通过事实提取和聚合将特定领域的知识纳入视觉推理,而E-DPO(E-DPO)则利用英特罗普-意识优化将模型产出与专家偏好性调整。此外,我们采用AD-PL(AD-PL)和优选-AGE(AGE)数据集,其中包括28,415个由专家排序答复的解答案例。为了评估异常推理模型,我们开发了多级逻辑评价(MLE),一个定量框架,一个定量框架分析逻辑/CRisal-shet-slasset-asset)的逻辑和现有数据标准。
Article 11
Title@2025-07-10 (4): Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration
Title: Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration | Long Context Scaling: Teilen und Erobern durch multi-agent question-driven Collaboration | 长期范围:通过多代理问题驱动的协作实现分化和征服 2505.20625v2 |
Authors (5): Sibo Xiao, Zixin Lin, Wenyang Gao, Hui Chen, Yue Zhang
Processing long contexts has become a critical capability for modern large language models (LLMs). Existing works leverage agent-based divide-and-conquer methods for processing long contexts. But these methods face crucial limitations, including prohibitive accumulated latency and amplified information loss from excessive agent invocations, and the disruption of inherent textual dependencies by immoderate partitioning. In this paper, we propose a novel multi-agent framework XpandA (Expand-Agent) coupled with question-driven workflow and dynamic partitioning for robust long-context processing. XpandA overcomes these limitations through: 1) dynamic partitioning of long texts, which adaptively modulates the filling rate of context windows for input sequences of vastly varying lengths; 2) question-guided protocol to update flat information ensembles within centralized shared memory, constructing consistent inter-agent knowledge across partitions; and 3) selectively replaying specific partitions based on the state-tracking of question-information couples to promote the resolution of inverted-order structures across partitions (e.g., flashbacks). We perform a comprehensive evaluation of XpandA on multiple long-context benchmarks with length varying from 1k to 1M, demonstrating XpandA’s feasibility for processing ultra-long sequences and its significant effectiveness in enhancing the long-context capabilities of various LLMs by achieving 20\% improvements and 1.5x inference speedup over baselines of full-context, RAG and previous agent-based methods.
长期处理过程已成为现代大型语言模型(LLMS)的关键能力。现有的工作利用代理商的分化法和分化法处理长背景。但是,这些方法面临关键的局限性,包括过度代理性职业导致的超常累积潜伏和放大信息损失,以及中度分割干扰固有的文本依赖性。在本文件中,我们提出一个新的多剂框架XpandA(Expand-Agency),加上由问题驱动的工作流程和动态分割法,供强有力的长文本处理。 XpandA克服了这些限制,具体做法包括:1)长文本的动态分割法,这些长文本在适应性地调整了投入序列中环境窗口的填充率;2)在中央共享记忆中更新平坦信息,建立一致的跨分区机构间知识;3)根据基于问题信息组合的状态跟踪,有选择地重放具体分隔法,以促进跨分区反向顺序结构的解决(例如,反向反向回溯),从而从甚远不同长度的LMS-RMS(X-LM)的全程、从超长的超长度处理率、超长的超长度处理方法,我们对 Xpan-LM-LM(A)进行了全面的全面评估,从一个超长度、超长的全程、超长度改进了X-LM-LM-LM-LM-LM-S-LM-S-LS-LS-LS-LS-S-S-S-直线性处理能力,从第1的全速度能力,从第1级、跨度的全度能力,从第1至长序、跨直序、跨度的全程、跨直序、跨度的全程,从第1-直序、跨直径-直径-直径-直径-直线性评估了第1-直线性、跨-直线性、跨的全-直径-直序、跨的全程、跨的全程、跨的全程,从第1-直径-直路的全程,从第1-直-直-直路的全程,从第1-直路的全程,从第1-直路的全程,从第1-直线性-直线性-直线性
Article 12
Title@2025-07-10 (4): Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style
Title: Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style | Kontexttreue in großen Sprachmodellen untersuchen: Die Rollen der Gedächtnisstärke und des Evidenzstils | 调查大语言模型的内情:记忆力和证据风格的作用 2409.10955v2 |
Authors (6): Yuepei Li, Kang Zhou, Qiao Qiao, Bach Nguyen, Qing Wang, Qi Li
Retrieval-augmented generation (RAG) improves Large Language Models (LLMs) by incorporating external information into the response generation process. However, how context-faithful LLMs are and what factors influence LLMs’ context faithfulness remain largely unexplored. In this study, we investigate the impact of memory strength and evidence presentation on LLMs’ receptiveness to external evidence. We quantify the memory strength of LLMs by measuring the divergence in LLMs’ responses to different paraphrases of the same question, which is not considered by previous works. We also generate evidence in various styles to examine LLMs’ behavior. Our results show that for questions with high memory strength, LLMs are more likely to rely on internal memory. Furthermore, presenting paraphrased evidence significantly increases LLMs’ receptiveness compared to simple repetition or adding details. These findings provide key insights for improving retrieval-augmented generation and context-aware LLMs. Our code is available at https://github.com/liyp0095/ContextFaithful.
通过将外部信息纳入响应生成过程,检索增强的一代(RAG)改进了大语言模型(LLMs),然而,背景真实的LLMs是如何的,以及哪些因素对LLMs的背景忠诚程度有何影响,基本上仍未得到探讨。在本研究中,我们调查了LLMs接受外部证据的记忆力和证据展示的影响。我们通过测量LLMs对同一问题不同参数的答复中的差异来量化LLMs的记忆力,而以前的工作没有考虑到这一点。我们还以各种方式生成证据来检查LLMs的行为。我们的结果显示,对于记忆力较高的问题,LLMMs更有可能依靠内部记忆。此外,提供副用的证据大大提高了LLMs对简单重复或添加细节的接受度。这些发现为改进检索启动的生成和对上下文的LMs提供了关键的洞察力。我们的代码可在https://github.com/liep0095/Ctext Faithful.。
Article 13
Title@2025-07-10 (4): DTECT: Dynamic Topic Explorer & Context Tracker
Title: DTECT: Dynamic Topic Explorer & Context Tracker | DTECT: Dynamischer Themen-Explorer & Kontext-Tracker | DTECT: 动态专题探索器和上下文跟踪器 2507.07910v1 |
Authors (2): Suman Adhya, Debarshi Kumar Sanyal
The explosive growth of textual data over time presents a significant challenge in uncovering evolving themes and trends. Existing dynamic topic modeling techniques, while powerful, often exist in fragmented pipelines that lack robust support for interpretation and user-friendly exploration. We introduce DTECT (Dynamic Topic Explorer & Context Tracker), an end-to-end system that bridges the gap between raw textual data and meaningful temporal insights. DTECT provides a unified workflow that supports data preprocessing, multiple model architectures, and dedicated evaluation metrics to analyze the topic quality of temporal topic models. It significantly enhances interpretability by introducing LLM-driven automatic topic labeling, trend analysis via temporally salient words, interactive visualizations with document-level summarization, and a natural language chat interface for intuitive data querying. By integrating these features into a single, cohesive platform, DTECT empowers users to more effectively track and understand thematic dynamics. DTECT is open-source and available at https://github.com/AdhyaSuman/DTECT.
现有的动态专题模型技术虽然强大,但往往存在于缺乏对解释和方便用户的探索的强大支持的支离破碎管道中。我们引入了DTECT(动态专题探索器和背景跟踪器),这是一个端对端系统,可以弥合原始文本数据和有意义的时间洞察力之间的差距。DTECT提供了一个统一的工作流程,支持数据处理前、多个模型结构以及分析时间专题模型质量的专门评价指标。它通过引入LLM驱动的自动主题标签、通过时间突出的文字进行趋势分析、以文件级别汇总进行交互式可视化以及直觉数据查询的自然语言聊天界面,大大提高了可解释性。通过将这些特征纳入一个单一的、连贯的平台,DTECT使用户能够更有效地跟踪和理解专题动态。DTECT是开放的来源,可在https://github.com/AdhyaSuman/DTECT中查阅。
Article 14
Title@2025-07-10 (4): A Survey on Latent Reasoning
Title: A Survey on Latent Reasoning | Eine Umfrage über latente Vernunft | A. 关于长期原因的调查 2507.06203v2 |
Authors (33): Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng, Chongxuan Li, Yuyin Zhou, Zhoujun Li, Zhaoxiang Zhang, Jiaheng Liu, Ge Zhang, Wenhao Huang, Jason Eshraghian
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model’s expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model’s continuous hidden state, eliminating token-level supervision. To advance latent reasoning research, this survey provides a comprehensive overview of the emerging field of latent reasoning. We begin by examining the foundational role of neural network layers as the computational substrate for reasoning, highlighting how hierarchical representations support complex transformations. Next, we explore diverse latent reasoning methodologies, including activation-based recurrence, hidden state propagation, and fine-tuning strategies that compress or internalize explicit reasoning traces. Finally, we discuss advanced paradigms such as infinite-depth latent reasoning via masked diffusion models, which enable globally consistent and reversible reasoning processes. By unifying these perspectives, we aim to clarify the conceptual landscape of latent reasoning and chart future directions for research at the frontier of LLM cognition. An associated GitHub repository collecting the latest papers and repos is available at: https://github.com/multimodal-art-projection/LatentCoT-Horizon/.
大型语言模型(LLMS)展示了令人印象深刻的推理能力,特别是在以明确的思维链推理(CoT)推理为指引,对中间步骤进行口头解释的推理时,尤其令人印象深刻的推理能力。 CoT改进了可解释性和准确性,同时对自然语言推理的依赖限制了模型的显性带宽宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽。为了推进潜带宽带宽带宽带宽带宽带宽的推理研究,本次调查将潜带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带
Article 15
Title@2025-07-10 (4): Automating MD simulations for Proteins using Large language Models: NAMD-Agent
Title: Automating MD simulations for Proteins using Large language Models: NAMD-Agent | Automatisierung von MD-Simulationen für Proteine mit großen Sprachmodellen: NAMD-Agent | 使用大语言模型( NADD- Agent) 自动自动模拟 Proteins 的 MD 模拟: NAMED- Agent 2507.07887v1 |
Authors (2): Achuth Chandrasekhar, Amir Barati Farimani
Molecular dynamics simulations are an essential tool in understanding protein structure, dynamics, and function at the atomic level. However, preparing high quality input files for MD simulations can be a time consuming and error prone process. In this work, we introduce an automated pipeline that leverages Large Language Models (LLMs), specifically Gemini 2.0 Flash, in conjunction with python scripting and Selenium based web automation to streamline the generation of MD input files. The pipeline exploits CHARMM GUI’s comprehensive web-based interface for preparing simulation-ready inputs for NAMD. By integrating Gemini’s code generation and iterative refinement capabilities, simulation scripts are automatically written, executed, and revised to navigate CHARMM GUI, extract appropriate parameters, and produce the required NAMD input files. Post processing is performed using additional software to further refine the simulation outputs, thereby enabling a complete and largely hands free workflow. Our results demonstrate that this approach reduces setup time, minimizes manual errors, and offers a scalable solution for handling multiple protein systems in parallel. This automated framework paves the way for broader application of LLMs in computational structural biology, offering a robust and adaptable platform for future developments in simulation automation.
分子动态模拟是了解蛋白质结构、动态和原子一级功能的基本工具。 但是,为MD模拟制作高质量的输入文件可能是一个耗时和容易出错的过程。 在这项工作中,我们引入了一种自动管道,利用大语言模型(LLMS),特别是Gemini 2.0 Flash, 结合 Python 脚本和基于 erium 的网络自动化来简化MD 输入文件的生成。管道利用CHARMMM GUI 的综合网络界面为NAD 进行模拟准备。通过整合Gemini 的代码生成和迭代精化能力,模拟脚本会自动编写、执行和修订,以导航CHARMMM GUI,提取适当的参数,并制作所需的NAD 输入文件。 邮件处理正在使用额外的软件来进一步完善模拟输出,从而使得一个完整和基本上免费的手动工作流程得以实现。 我们的结果表明,这一方法减少了设置时间,最大限度地减少人工错误,并为同时处理多种蛋白质系统提供了一个可扩展的解决方案。 这个自动化框架为在计算结构生物学上更广泛地应用LMS铺路铺路铺路铺路。
Article 16
Title@2025-07-10 (4): When Dialects Collide: How Socioeconomic Mixing Affects Language Use
Title: When Dialects Collide: How Socioeconomic Mixing Affects Language Use | Wenn Dialekte zusammenstoßen: Wie sich die sozioökonomische Mischung auf den Sprachgebrauch auswirkt | 当对调时:社会经济混合如何影响语言使用 2307.10016v2 |
Authors (4): Thomas Louf, José J. Ramasco, David Sánchez, Márton Karsai
The socioeconomic background of people and how they use standard forms of language are not independent, as demonstrated in various sociolinguistic studies. However, the extent to which these correlations may be influenced by the mixing of people from different socioeconomic classes remains relatively unexplored from a quantitative perspective. In this work we leverage geotagged tweets and transferable computational methods to map deviations from standard English on a large scale, in seven thousand administrative areas of England and Wales. We combine these data with high-resolution income maps to assign a proxy socioeconomic indicator to home-located users. Strikingly, across eight metropolitan areas we find a consistent pattern suggesting that the more different socioeconomic classes mix, the less interdependent the frequency of their departures from standard grammar and their income become. Further, we propose an agent-based model of linguistic variety adoption that sheds light on the mechanisms that produce the observations seen in the data.
正如各种社会语言研究所显示的,人们的社会经济背景以及他们使用标准语言的方式并不独立。然而,从数量角度而言,这些关联可能受到不同社会经济阶层人员混合影响的程度仍然相对没有探讨。在这项工作中,我们利用地理标记推文和可转移计算方法,在英格兰和威尔士的7 000个行政区大规模绘制偏离标准英语的地图。我们把这些数据和高分辨率收入图结合起来,为家庭所在地用户指定一个代理社会经济指标。令人惊讶的是,在八个大都市地区,我们发现一个一致的模式,表明不同社会经济阶层的混合越多,他们偏离标准语法和收入的频率越少,越不相互依存。此外,我们提出了一种基于代理人的语言多样性采用模式,该模式可以说明数据中观察到的观察机制。
Article 17
Title@2025-07-10 (4): Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study
Title: Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study | Bewertung der Robustheit von großen Audio-Sprachmodellen zur Audio-Einspritzung: Eine empirische Studie | 评估大音频语言模型对音频注射的威力:经验研究 2505.19598v2 |
Authors (7): Guanyu Hou, Jiaming He, Yinhang Zhou, Ji Guo, Yitong Qiao, Rui Zhang, Wenbo Jiang
Large Audio-Language Models (LALMs) are increasingly deployed in real-world applications, yet their robustness against malicious audio injection attacks remains underexplored. This study systematically evaluates five leading LALMs across four attack scenarios: Audio Interference Attack, Instruction Following Attack, Context Injection Attack, and Judgment Hijacking Attack. Using metrics like Defense Success Rate, Context Robustness Score, and Judgment Robustness Index, their vulnerabilities and resilience were quantitatively assessed. Experimental results reveal significant performance disparities among models; no single model consistently outperforms others across all attack types. The position of malicious content critically influences attack effectiveness, particularly when placed at the beginning of sequences. A negative correlation between instruction-following capability and robustness suggests models adhering strictly to instructions may be more susceptible, contrasting with greater resistance by safety-aligned models. Additionally, system prompts show mixed effectiveness, indicating the need for tailored strategies. This work introduces a benchmark framework and highlights the importance of integrating robustness into training pipelines. Findings emphasize developing multi-modal defenses and architectural designs that decouple capability from susceptibility for secure LALMs deployment.
大型音频语言模型(LALMS)越来越多地用于现实世界应用,但其抵御恶意声击袭击的强力仍未得到充分探索。本研究系统地评估了四种攻击情景中的五种主要的LALMS:音频干扰攻击、攻击指令、上击攻击、上击攻击和袭击判决。使用国防成功率、背景强力分数和判断强力指数等衡量标准,对其脆弱性和复原力进行了定量评估。实验结果显示各模型之间业绩差异巨大;没有单一模型在所有攻击类型中始终优于其他模型。恶意内容对袭击有效性的严重影响,特别是在开始顺序时。执行指令的能力与强力表明严格遵守指令的模式之间的负相关性可能更大,而安全调整模型的阻力则更大。此外,系统提示显示效果参差不齐,表明需要制定有针对性的战略。这项工作引入了一个基准框架,并强调了将强力纳入培训管道的重要性。调查结果强调开发多式防御和建筑设计,使其不易受到安全LALMS部署的敏感度。
Article 18
Title@2025-07-10 (4): DocCHA: Towards LLM-Augmented Interactive Online diagnosis System
Title: DocCHA: Towards LLM-Augmented Interactive Online diagnosis System | DocCHA: Auf dem Weg zum LLM-Augmented Interactive Online-Diagnosesystem | DocCHA:争取建立LLM-增强的互动式在线诊断系统 2507.07870v1 |
Authors (5): Xinyi Liu, Dachun Sun, Yi R. Fung, Dilek Hakkani-Tür, Tarek Abdelzaher
Despite the impressive capabilities of Large Language Models (LLMs), existing Conversational Health Agents (CHAs) remain static and brittle, incapable of adaptive multi-turn reasoning, symptom clarification, or transparent decision-making. This hinders their real-world applicability in clinical diagnosis, where iterative and structured dialogue is essential. We propose DocCHA, a confidence-aware, modular framework that emulates clinical reasoning by decomposing the diagnostic process into three stages: (1) symptom elicitation, (2) history acquisition, and (3) causal graph construction. Each module uses interpretable confidence scores to guide adaptive questioning, prioritize informative clarifications, and refine weak reasoning links. Evaluated on two real-world Chinese consultation datasets (IMCS21, DX), DocCHA consistently outperforms strong prompting-based LLM baselines (GPT-3.5, GPT-4o, LLaMA-3), achieving up to 5.18 percent higher diagnostic accuracy and over 30 percent improvement in symptom recall, with only modest increase in dialogue turns. These results demonstrate the effectiveness of DocCHA in enabling structured, transparent, and efficient diagnostic conversations – paving the way for trustworthy LLM-powered clinical assistants in multilingual and resource-constrained settings.
尽管大语言模型(LLMs)的能力令人印象深刻,但现有的相互交流的健康代理机构(CHAs)仍然一成不变和易碎,无法适应性多方向推理、症状澄清或透明决策,这妨碍了其在临床诊断中的实际适用性,而临床诊断中迭接和结构化对话是必不可少的。我们建议DocCHA是一个充满信心的模块框架,仿效临床推理,将诊断过程分为三个阶段:(1) 症状解析,(2) 历史获取,(3) 因果关系图构建。每个模块都使用可解释的信任分数来指导适应性询问,确定信息澄清的优先顺序,并完善薄弱的推理联系。根据两个真实世界中国协商数据集(IMCS21, DX)的评估,DocCHA一贯优于基于快速的LM基准(GPT-3.5, GPT-4o, LLMA-3), 诊断准确度达到5.18%,症状回顾率提高30%以上,对话转来仅略有增加。这些结果表明,DocCHA在促成结构、透明和高效的诊断性对话对话中的有效性,为多语和资源培训的可信赖的LM授权临床助手铺设者铺铺铺铺铺路。
Article 19
Title@2025-07-10 (4): Alpay Algebra V: Multi-Layered Semantic Games and Transfinite Fixed-Point Simulation
Title: Alpay Algebra V: Multi-Layered Semantic Games and Transfinite Fixed-Point Simulation | Alpay Algebra V: Multi-Layered Semantic Games und Transfinite Fixed-Point Simulation | Alpay Alphay Algebabra V:多语言语义运动会和跨线固定点模拟 2507.07868v1 |
Authors (2): Bugra Kilictas, Faruk Alpay
This paper extends the self-referential framework of Alpay Algebra into a multi-layered semantic game architecture where transfinite fixed-point convergence encompasses hierarchical sub-games at each iteration level. Building upon Alpay Algebra IV’s empathetic embedding concept, we introduce a nested game-theoretic structure where the alignment process between AI systems and documents becomes a meta-game containing embedded decision problems. We formalize this through a composite operator $\phi(\cdot, \gamma(\cdot))$ where $\phi$ drives the main semantic convergence while $\gamma$ resolves local sub-games. The resulting framework demonstrates that game-theoretic reasoning emerges naturally from fixed-point iteration rather than being imposed externally. We prove a Game Theorem establishing existence and uniqueness of semantic equilibria under realistic cognitive simulation assumptions. Our verification suite includes adaptations of Banach’s fixed-point theorem to transfinite contexts, a novel $\phi$-topology based on the Kozlov-Maz’ya-Rossmann formula for handling semantic singularities, and categorical consistency tests via the Yoneda lemma. The paper itself functions as a semantic artifact designed to propagate its fixed-point patterns in AI embedding spaces – a deliberate instantiation of the “semantic virus” concept it theorizes. All results are grounded in category theory, information theory, and realistic AI cognition models, ensuring practical applicability beyond pure mathematical abstraction.
本文扩展了 Alpay Algebra 的自我偏好框架, 将其扩展为多层次的抽象游戏结构, 即半边端固定点趋同包含每个迭代级别的等级子游戏。 基于 Alpay Algebra IV 的同情嵌入概念, 我们引入了一个嵌套游戏理论结构, 将AI 系统和文档之间的校正进程变成包含嵌入决定问题的元游戏。 我们通过一个复合操作器$\phi(\cdot,\gamma(\cdot)) 正式确定这一点, 在那里, $\phie 驱动主语系趋同, 而$\ gamma$ 解决了本地的理论。 由此产生的框架显示游戏理论推理自然地产生于固定点的嵌入式嵌入概念, 而不是外部强加的。 我们证明这是一个游戏理论, 在现实的认知模拟假设下, 将语系定点定点的定点对定空空间环境进行修改, 一种基于 Kozlov- maz 的理论性理论, 将它自己设计成一个固定的纸质的纸质性 。
Article 20
Title@2025-07-10 (4): Skywork-R1V3 Technical Report
Title: Skywork-R1V3 Technical Report | Technischer Bericht Skywork-R1V3 | Skywork-R1V3 技术报告 2507.06167v3 |
Authors (11): Wei Shen, Jiangbo Pei, Yi Peng, Xuchen Song, Yang Liu, Jian Peng, Haofeng Sun, Yunzhuo Hao, Peiyu Wang, Jianhao Zhang, Yahui Zhou
We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model’s reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.
我们引入了Skywork-R1V3, 这是一种先进的开放源码视觉语言模型(VLM),它开创了对视觉推理的新方法。它的关键创新在于有效地将推理技能从只有文本的大型语言模型(LLMS)转移到视觉任务。 Skywork-R1V3的出色表现主要来自我们精心制定的训练后RL框架,它有效地激活和增强了模型的推理能力,而无需再继续接受培训。我们通过这个框架,进一步发现连接器模块在为多式联运推理模型实现强有力的跨模式协调方面的基本作用。此外,我们引入了一个独特的推理能力指标,即关键推理符号的灵敏性,这在RL培训期间已证明对检查站的选择非常有效。Skywork-R1V3在MMMMMMM公司上取得了最新的成果,大大地从64.3%提高到76.0%。这一性能与入门级人的能力相匹配。值得注意的是,我们的RL驱动力后培训方法使得38B参数模型能够与顶级的封闭源码VLMSm3。我们进行更深入的推理学的推理分析。在Skal推理分析中成功的推理研究中, 也成功地展示了另一个的数学推理学的推理学。
Article 21
Title@2025-07-10 (4): From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems
Title: From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems | Von der Ambiguität zur Genauigkeit: Der transformative Effekt der Koreferenzlösung auf retrieval-augmentierte Erzeugungssysteme | 从模糊到准确性:关于回收-提款一代系统的共同决议的变革效应 2507.07847v1 |
Authors (6): Youngjoon Jang, Seongtae Hong, Junyoung Son, Sungjin Park, Chanjun Park, Heuiseok Lim
Retrieval-Augmented Generation (RAG) has emerged as a crucial framework in natural language processing (NLP), improving factual consistency and reducing hallucinations by integrating external document retrieval with large language models (LLMs). However, the effectiveness of RAG is often hindered by coreferential complexity in retrieved documents, introducing ambiguity that disrupts in-context learning. In this study, we systematically investigate how entity coreference affects both document retrieval and generative performance in RAG-based systems, focusing on retrieval relevance, contextual understanding, and overall response quality. We demonstrate that coreference resolution enhances retrieval effectiveness and improves question-answering (QA) performance. Through comparative analysis of different pooling strategies in retrieval tasks, we find that mean pooling demonstrates superior context capturing ability after applying coreference resolution. In QA tasks, we discover that smaller models benefit more from the disambiguation process, likely due to their limited inherent capacity for handling referential ambiguity. With these findings, this study aims to provide a deeper understanding of the challenges posed by coreferential complexity in RAG, providing guidance for improving retrieval and generation in knowledge-intensive AI applications.
在自然语言处理(NLP)中,将外部文件检索与大型语言模型(LLM)相结合,从而改进实际一致性,减少幻觉,从而成为自然语言处理(NLP)中的一个关键框架,提高了事实一致性,减少了幻觉;然而,检索文件的共变复杂性往往阻碍RAG的效力,造成内含学习的模糊性;在这项研究中,我们系统地调查实体参照如何影响基于RAG的系统中的文件检索和感化性能,重点是检索相关性、背景理解和总体反应质量;我们证明,共同参照解决办法提高了检索的有效性,改进了问题解答(QA)的性能;通过对检索任务中不同的集中战略进行比较分析,我们发现,在应用共同参照分辨率后,集合表明更能捕捉能力。在QA的任务中,我们发现较小的模型可能由于处理特异性的固有能力有限而从分解过程中获益更多。通过这些研究结果,旨在更深入地了解在RAG的共变复杂性带来的挑战,为改进知识密集型AI应用程序的检索和生成提供指导。
Article 22
Title@2025-07-10 (4): Truth-value judgment in language models: ‘truth directions’ are context sensitive
Title: Truth-value judgment in language models: ‘truth directions’ are context sensitive | Wahrheit-Wert-Urteil in Sprachmodellen: ‘Wahrheitsrichtungen’ sind kontextsensibel | 语言模型中的真相价值判断:“真相方向”是背景敏感 2404.18865v2 |
Authors (4): Stefan F. Schouten, Peter Bloem, Ilia Markov, Piek Vossen
Recent work has demonstrated that the latent spaces of large language models (LLMs) contain directions predictive of the truth of sentences. Multiple methods recover such directions and build probes that are described as uncovering a model’s “knowledge” or “beliefs”. We investigate this phenomenon, looking closely at the impact of context on the probes. Our experiments establish where in the LLM the probe’s predictions are (most) sensitive to the presence of related sentences, and how to best characterize this kind of sensitivity. We do so by measuring different types of consistency errors that occur after probing an LLM whose inputs consist of hypotheses preceded by (negated) supporting and contradicting sentences. We also perform a causal intervention experiment, investigating whether moving the representation of a premise along these truth-value directions influences the position of an entailed or contradicted sentence along that same direction. We find that the probes we test are generally context sensitive, but that contexts which should not affect the truth often still impact the probe outputs. Our experiments show that the type of errors depend on the layer, the model, and the kind of data. Finally, our results suggest that truth-value directions are causal mediators in the inference process that incorporates in-context information.
最近的工作表明,大型语言模型(LLMS)的潜在空间包含对判决真相的预测方向。多种方法恢复了这些方向,并建立了被描述为揭示模型“知识”或“信仰”的探点。我们仔细研究这个现象,仔细研究背景对探点的影响。我们的实验确定了在LLM中,探测器的预测(最主要)对相关句子的存在十分敏感,以及如何最好地描述这种敏感度。我们这样做的方法是测量在测测测LM(LM)之后发生的不同类型的一致性错误,而LLM的输入包括(否定的)支持和相互矛盾的句子之前的假设。我们还进行了因果干预实验,调查沿着这些真相价值方向的假设的表述是否影响着一个包含或矛盾的句子的位置。我们发现,我们测试的探点一般是背景敏感,但不应该影响真相的环境通常仍然影响探测结果。我们的实验表明,错误的类型取决于层次、模型和数据类型。最后,我们的结果表明,在判点中,真理价值是将因果关系纳入信息中。
Article 23
Title@2025-07-10 (4): None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks
Title: None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks | Keiner der anderen: eine allgemeine Technik zur Unterscheidung von der Erinnerung an Multiple-Choice-LLM-Bewertungs-Benchmarks | 其他无其他:在多杯LLM评价基准中区分与记忆化区别理由的一般技术 2502.12896v5 |
Authors (3): Eva Sánchez Salido, Julio Gonzalo, Guillermo Marco
In LLM evaluations, reasoning is often distinguished from recall/memorization by performing numerical variations to math-oriented questions. Here we introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts, requiring LLMs to understand and reason (rather than memorizing) in order to answer correctly. Using this method, we evaluate state-of-the-art proprietary and open-source LLMs on two datasets available in English and Spanish: the public MMLU benchmark and the private UNED-Access 2024 dataset. Results show that all models experience remarkable accuracy drops under our proposed variation, with an average loss of 57% on MMLU and 50% on UNED-Access 2024, ranging from 10% to 93% across models. Notably, the most accurate model in our experimentation (OpenAI-o3-mini) is not the most robust (DeepSeek-R1-70B), suggesting that the best models in standard evaluations may not be the ones with better reasoning capabilities. Also, we see larger accuracy drops in public (vs private) datasets and questions posed in their original language (vs a manual translation), which are signs of contamination and also point to a relevant role of recall/memorization in current LLMs’ answers.
在LLM 评估中,推理往往不同于对数学导向问题进行数字变异的回溯/记忆。在这里,我们引入了对多种选择问题的一般性变异方法,将正确答案完全与先前看到的象征或概念完全分离,要求LLMS理解和理性(而不是记忆化),以便正确回答。使用这种方法,我们评估英语和西班牙语两个数据集上的最新专有和开放源LMS:公共MMMLU基准和私人UNED-Acess 2024数据集。结果显示,所有模型在拟议变异中都经历了显著的准确性下降,MMMLU和UNED-Acess 2024平均损失57%,UNED-Acess 2024平均损失50%,各模型平均损失10%至93%。值得注意的是,我们实验中最准确的模型(OpenAI-o3-mini)并不是最可靠的(DepSeSeek-R1-70B),表明标准评价中的最佳模型可能不是具有更好推理能力的模型。此外,我们看到,在公共(v)手册中的原始数据翻译中(也是原始的标记和检索中的标记)。
Article 24
Title@2025-07-10 (4): Constrain Alignment with Sparse Autoencoders
Title: Constrain Alignment with Sparse Autoencoders | Beschränkung der Ausrichtung mit Sparse Autoencodern | 与 Sparse 自动对齐 2411.07618v4 |
Authors (10): Qingyu Yin, Chak Tou Leong, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, Linyi Yang
The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.
大型语言模式(LLMS)与人类偏好相匹配仍然是一个关键的挑战。尽管培训后技术(如“加强人类反馈学习”和“直接偏好优化”)取得了显著成功,但它们往往导致计算效率低下和训练不稳定。在本文中,我们提出“特级有限偏好优化”(FPO)新颖方法,旨在简化校准进程,同时确保稳定性。FPO利用预先培训的“Sparse Autoencors (SAEs)” ,引入了特质级限制,允许高效、宽度强化校准。我们的方法通过使用精密的稀疏特性在训练有素的稀疏的稀释自动校对器中激活,以及使用地层离线参考的相继KL差异的质量而效率得到提高。基准数据集的实验结果表明,FPO在赢率方面实现了5.08%的绝对改善,其计算成本比最先进的基线低得多,这为高效、可控的LM校准的解决方案带来了希望。
Article 25
Title@2025-07-10 (4): Unsupervised Morphological Tree Tokenizer
Title: Unsupervised Morphological Tree Tokenizer | Unüberwachter morphologischer Baum Tokenizer | 不受监督的病理树化器 2406.15245v2 |
Authors (5): Qingyang Zhu, Xiang Hu, Pengyu Ji, Wei Wu, Kewei Tu
As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic information. To address this drawback, we introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words. Specifically, the deep model jointly encodes internal structures and representations of words with a mechanism named $\textit{MorphOverriding}$ to ensure the indecomposability of morphemes. By training the model with self-supervised objectives, our method is capable of inducing character-level structures that align with morphological rules without annotated training data. Based on the induced structures, our algorithm tokenizes words through vocabulary matching in a top-down manner. Empirical results indicate that the proposed method effectively retains complete morphemes and outperforms widely adopted methods such as BPE and WordPiece on both morphological segmentation tasks and language modeling tasks. Code is available at https://github.com/martianmartina/TreeTokenizer.
作为语言模型的基石,象征性化意味着将文字输入分解成预定义的原子单位。常规统计符号经常扰乱单词内部的构成界限,从而腐蚀语义信息。为了解决这一缺陷,我们引入了形态结构指导符号化,并提出了一个深刻的模型以诱导字符级结构。具体地说,深层模型将内部结构和文字表达方式与一个名为$\textit{Morphovertriding}$的机制联合编码,以确保模phemes的不兼容性。通过以自我监督的目标培训模型,我们的方法能够引导与形态学规则相一致的字符级结构,而没有附加说明的培训数据。基于导出的结构,我们的算法通过自上至下方式的词汇匹配将文字符号化。经验性结果显示,拟议的方法有效地保留了形态分解任务和语言建模任务上广泛采用的方法,如BPE和WordPiece。代码可在 https://github.com/martianmatina/Treekeizer上查阅。
Article 26
Title@2025-07-10 (4): MAEBE: Multi-Agent Emergent Behavior Framework
Title: MAEBE: Multi-Agent Emergent Behavior Framework | MAEBE: Multi-Agent Emergent Behavior Framework | 多边代理新兴行为框架 2506.03053v2 |
Authors (4): Sinem Erisken, Timothy Gothard, Martin Leitgab, Ram Potham
Traditional AI safety evaluations on isolated LLMs are insufficient as multi-agent AI ensembles become prevalent, introducing novel emergent risks. This paper introduces the Multi-Agent Emergent Behavior Evaluation (MAEBE) framework to systematically assess such risks. Using MAEBE with the Greatest Good Benchmark (and a novel double-inversion question technique), we demonstrate that: (1) LLM moral preferences, particularly for Instrumental Harm, are surprisingly brittle and shift significantly with question framing, both in single agents and ensembles. (2) The moral reasoning of LLM ensembles is not directly predictable from isolated agent behavior due to emergent group dynamics. (3) Specifically, ensembles exhibit phenomena like peer pressure influencing convergence, even when guided by a supervisor, highlighting distinct safety and alignment challenges. Our findings underscore the necessity of evaluating AI systems in their interactive, multi-agent contexts.
对孤立的LLMs进行传统的AI安全评价是不够的,因为多试剂AI联合体变得很普遍,带来新的新风险。本文件介绍了多代理新兴行为评价(MAEBE)框架,以系统评估此类风险。我们利用MAEBE和最伟大的良好基准(以及一种新型的双重反向问题技术)来证明:(1)LLM道德偏好,特别是工具伤害的道德偏好,在单一代理体和组合体中都令人惊讶地变得脆弱,随着问题设置而发生重大变化。 (2)LLM联合体的道德推理不能直接从新出现的集团动态造成的孤立的代理体行为中预见出来。(3)具体地说,集合体展示了影响趋同的同行压力等现象,即使由上司指导,也突出了不同的安全和一致性挑战。我们的调查结果强调,有必要在其互动的多剂环境中评价AI系统。
Article 27
Title@2025-07-10 (4): The Thin Line Between Comprehension and Persuasion in LLMs
Title: The Thin Line Between Comprehension and Persuasion in LLMs | Die dünne Linie zwischen Verständnis und Überzeugung in LLMs | LLMM 理解与劝导之间的细细线 2507.01936v2 |
Authors (2): Adrian de Wynter, Tangming Yuan
Large language models (LLMs) are excellent at maintaining high-level, convincing dialogues. They are being fast deployed as chatbots and evaluators in sensitive areas, such as peer review and mental health applications. This, along with the disparate accounts on their reasoning capabilities, calls for a closer examination of LLMs and their comprehension of dialogue. In this work we begin by evaluating LLMs’ ability to maintain a debate–one of the purest yet most complex forms of human communication. Then we measure how this capability relates to their understanding of what is being talked about, namely, their comprehension of dialogical structures and the pragmatic context. We find that LLMs are capable of maintaining coherent, persuasive debates, often swaying the beliefs of participants and audiences alike. We also note that awareness or suspicion of AI involvement encourage people to be more critical of the arguments made. When polling LLMs on their comprehension of deeper structures of dialogue, however, they cannot demonstrate said understanding. Our findings tie the shortcomings of LLMs-as-evaluators to their (in)ability to understand the context. More broadly, for the field of argumentation theory we posit that, if an agent can convincingly maintain a dialogue, it is not necessary for it to know what it is talking about. Hence, the modelling of pragmatic context and coherence are secondary to effectiveness.
大型语言模型(LLMS)在保持高层次的、令人信服的对话方面是极好的,它们作为聊天室和评价员在敏感领域,如同行审议和心理健康应用等领域迅速被部署为聊天室和评价员,这与对其推理能力的不同描述一道,要求更仔细地审查LLMS, 并理解对话。在这项工作中,我们首先评估LLMS保持辩论能力的能力,这是人类交流中最纯最复杂形式的辩论之一。然后,我们衡量这种能力与他们所谈论的内容,即他们对对话结构的理解和务实背景的理解有何关系。我们认为,LMS有能力保持连贯、有说服力的辩论,经常扭曲参与者和观众的信仰。我们还注意到,对AI参与的认识或怀疑鼓励人们更加批评所提出的论点。然而,当LMS对更深层次对话结构的理解进行民意测验时,他们无法表现出理解。我们的调查结果将LMS-S-S-evluaers的缺点与他们理解背景的(无法理解)的缺点联系在一起。更广义地说,对于争论理论领域来说,我们认为,如果一个代理人能够令人信服地说,那么,那么,那么,它就是,它就应该保持一种务实的对话,那么,那么,它就应该保持一种具有何种联系。
Article 28
Title@2025-07-10 (4): Conditional Unigram Tokenization with Parallel Data
Title: Conditional Unigram Tokenization with Parallel Data | Bedingte Unigramm-Tokenisierung mit parallelen Daten | 附带平行数据的条件性大学招式 2507.07824v1 |
Authors (2): Gianluca Vico, Jindřinch Libovický
We introduce conditional unigram tokenization, a novel approach that extends unigram tokenization by conditioning target token probabilities on source-language tokens from parallel data. Given a fixed source tokenizer, our method learns a target tokenizer that maximizes cross-lingual semantic alignment. We evaluate our tokenizer on four language pairs across different families and resource levels, examining intrinsic properties and downstream performance on machine translation and language modeling. While our conditional tokenizer maintains comparable statistical properties to standard unigram tokenizers, results are mixed: we observe no improvements in machine translation quality, but find consistent perplexity reductions in language modeling. We hypothesize that quadratic scaling of conditional probability estimation with respect to the vocabulary size creates a data efficiency bottleneck. Our findings suggest that alternative parameterizations may be necessary for practical cross-lingual tokenization.
我们引入了有条件的单克表示法, 这是一种通过从平行数据中设定源语言象征物目标符号概率来限制源语言象征物目标符号概率来扩展单克表示法的新办法。 在使用固定来源象征器的情况下, 我们的方法会学习一个目标表示法, 以最大程度实现跨语言语义语义对齐。 我们在不同家庭和资源层次上对四种语言配方进行代号评估, 检查机器翻译和语言建模的内在特性和下游性能。 虽然我们的有条件象征物与标准的单克象征物具有可比的统计属性, 但结果好坏参半: 我们没有看到机器翻译质量的改善, 但是在语言建模中却发现一致的不易解。 我们假设了在词汇大小方面有条件概率估算的四重度大小会造成数据效率瓶颈。 我们的研究结果表明, 其它参数化也许对于实际的跨语言代用代用代用代言符号可能是必要的。
Article 29
Title@2025-07-10 (4): On the Effect of Instruction Tuning Loss on Generalization
Title: On the Effect of Instruction Tuning Loss on Generalization | Auf die Auswirkungen der Instruktion Tuning Verlust auf die Verallgemeinerung | 指示计票损失对普遍化的影响的影响 2507.07817v1 |
Authors (4): Anwoy Chatterjee, H S V N S Kowndinya Renduchintala, Sumit Bhatia, Tanmoy Chakraborty
Instruction Tuning has emerged as a pivotal post-training paradigm that enables pre-trained language models to better follow user instructions. Despite its significance, little attention has been given to optimizing the loss function used. A fundamental, yet often overlooked, question is whether the conventional auto-regressive objective - where loss is computed only on response tokens, excluding prompt tokens - is truly optimal for instruction tuning. In this work, we systematically investigate the impact of differentially weighting prompt and response tokens in instruction tuning loss, and propose Weighted Instruction Tuning (WIT) as a better alternative to conventional instruction tuning. Through extensive experiments on five language models of different families and scale, three finetuning datasets of different sizes, and five diverse evaluation benchmarks, we show that the standard instruction tuning loss often yields suboptimal performance and limited robustness to input prompt variations. We find that a low-to-moderate weight for prompt tokens coupled with a moderate-to-high weight for response tokens yields the best-performing models across settings and also serve as better starting points for the subsequent preference alignment training. These findings highlight the need to reconsider instruction tuning loss and offer actionable insights for developing more robust and generalizable models. Our code is open-sourced at https://github.com/kowndinya-renduchintala/WIT.
教学图案已成为一个关键的培训后范例,使得培训前语言模型能够更好地遵循用户指令。尽管它意义重大,但很少注意优化所使用的损失功能。一个根本性但经常被忽视的问题是常规自动递减目标(即损失只计算在应答标牌上,不包括及时标牌上)是否真正是教学调适的最佳方法。在这项工作中,我们系统地调查在调试损失时对快速和应答标牌进行不同加权的影响,并提议将加权指示图案(WIT)作为常规调控的更好替代方法。通过对不同家庭和规模的五种语言模型、三个不同尺寸的微调数据集和五个不同的评估基准进行广泛实验,我们发现标准指示调整损失往往产生不理想的性能,对输入快速变异作用的力度也有限。我们发现,在调控标物的低调权重加上中度至高度的响应标物,可以产生最佳表现模式,还可以作为随后的优惠调整培训的更佳起点。这些发现,我们需要重新思考的源码/变换。
Article 30
Title@2025-07-10 (4): Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning
Title: Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning | Verständnis und Kontrolle von Wiederholungsneuronen und Induktionsköpfen im In-Context-Lernen | 了解和控制再生中新中世纪和内文学习中的上岗负责人 2507.07810v1 |
Authors (3): Nhi Hoai Doan, Tatsuya Hiraoka, Kentaro Inui
This paper investigates the relationship between large language models’ (LLMs) ability to recognize repetitive input patterns and their performance on in-context learning (ICL). In contrast to prior work that has primarily focused on attention heads, we examine this relationship from the perspective of skill neurons, specifically repetition neurons. Our experiments reveal that the impact of these neurons on ICL performance varies depending on the depth of the layer in which they reside. By comparing the effects of repetition neurons and induction heads, we further identify strategies for reducing repetitive outputs while maintaining strong ICL capabilities.
本文探讨大型语言模型识别重复输入模式的能力与这些模型在文字内学习方面的表现之间的关系(ICL)。与以前主要侧重于关注负责人的工作相比,我们从技能神经元,特别是重复神经元的角度来研究这种关系。我们的实验显示,这些神经元对ICL性能的影响因所处层次的深度而异。通过比较重复神经元和感应头的影响,我们进一步确定减少重复产出的战略,同时保持ICL的强大能力。
Article 31
Title@2025-07-10 (4): Bridging Logic and Learning: Decoding Temporal Logic Embeddings via Transformers
Title: Bridging Logic and Learning: Decoding Temporal Logic Embeddings via Transformers | Überbrückung von Logik und Lernen: Dekodierung von Temporal Logic-Embeddings über Transformer | 架桥逻辑与学习:通过变形器解码时时逻辑嵌入 2507.07808v1 |
Authors (4): Sara Candussio, Gaia Saveri, Gabriele Sarti, Luca Bortolussi
Continuous representations of logic formulae allow us to integrate symbolic knowledge into data-driven learning algorithms. If such embeddings are semantically consistent, i.e. if similar specifications are mapped into nearby vectors, they enable continuous learning and optimization directly in the semantic space of formulae. However, to translate the optimal continuous representation into a concrete requirement, such embeddings must be invertible. We tackle this issue by training a Transformer-based decoder-only model to invert semantic embeddings of Signal Temporal Logic (STL) formulae. STL is a powerful formalism that allows us to describe properties of signals varying over time in an expressive yet concise way. By constructing a small vocabulary from STL syntax, we demonstrate that our proposed model is able to generate valid formulae after only 1 epoch and to generalize to the semantics of the logic in about 10 epochs. Additionally, the model is able to decode a given embedding into formulae that are often simpler in terms of length and nesting while remaining semantically close (or equivalent) to gold references. We show the effectiveness of our methodology across various levels of training formulae complexity to assess the impact of training data on the model’s ability to effectively capture the semantic information contained in the embeddings and generalize out-of-distribution. Finally, we deploy our model for solving a requirement mining task, i.e. inferring STL specifications that solve a classification task on trajectories, performing the optimization directly in the semantic space.
逻辑公式的持续表达方式允许我们将象征性知识纳入数据驱动的学习算法中。 如果这种嵌入方式具有内在一致性, 即如果将相似的规格映射到附近的矢量中, 能够直接在公式的语义空间中进行持续学习和优化。 但是, 要将最佳的连续表达方式转化为具体要求, 这种嵌入方式必须是不可忽略的。 我们通过训练一个基于变异器的解码器单一模型, 将信号时间逻辑公式( STL) 的语义嵌入垂直嵌入。 STL 是一种强大的形式主义, 使我们能够以直截面但简洁的方式描述时间上不同信号的属性。 我们通过从STLyn Tax 构建一个小的词汇来直接学习和优化公式空间表达方式。 我们的模型能够在大约10个球中概括逻辑的语义。 此外, 模型可以解码一个在公式中嵌入的内嵌入点, 其长度和嵌入方式通常比较简单, 同时保持语义性( 或等效) , 以直截面的方式描述不同时间级的语义解解的语义解解解解解解解解解解解码。 我们的Sl定位的解的解解解解解的解析中, 。 我们测测算的模型中, 。
Article 32
Title@2025-07-10 (4): Decoding AI Judgment: How LLMs Assess News Credibility and Bias
Title: Decoding AI Judgment: How LLMs Assess News Credibility and Bias | Entschlüsselung des AI-Urteils: Wie LLMs neue Glaubwürdigkeit und Bias bewerten | 证明AI 判决:LLMs如何评估新闻信誉和Bias 2502.04426v2 |
Authors (9): Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Alessandro Santirocchi, Roberto Atzeni, Matteo Cinelli, Vincenzo Cestari, Clelia Rossi-Arnaud, Walter Quattrociocchi
Large Language Models (LLMs) are increasingly embedded in workflows that involve evaluative processes. This raises the need to examine how such evaluations are built, what assumptions they rely on, and how their strategies diverge from those of humans. We benchmark six LLMs against expert ratings–NewsGuard and Media Bias/Fact Check (MBFC)–and against human judgments collected through a controlled experiment. To enable direct comparison, we implement a structured agentic framework in which both models and non-expert participants follow the same evaluation procedure: selecting criteria, retrieving content, and producing justifications. Despite output alignment, LLMs rely on different mechanisms: lexical associations and statistical priors replace contextual reasoning. This reliance produces systematic effects: political asymmetries, opaque justifications, and a tendency to confuse linguistic form with epistemic validity. Delegating judgment to such systems does not merely automate evaluation–it redefines it, shifting from normative reasoning to pattern-based approximation.
大型语言模型(LLMS)越来越多地嵌入涉及评价过程的工作流程中,这就需要审查这类评价是如何建立的,它们依赖哪些假设,以及它们的战略与人类的战略有何不同。我们根据专家评级-NewsGuard和Media Bias/Fact Check(MBFC)和通过受控试验收集的人类判断对6 LLMS进行基准测试。为了进行直接比较,我们实施了一个结构化的代理框架,模型和非专家参与者在其中遵循同样的评价程序:选择标准、检索内容和提出理由。尽管产出一致,LLMS依靠不同的机制:词汇协会和统计前身取代了背景推理。这种依赖产生了系统效应:政治不对称、不透明的理由以及将语言形式与认知有效性混为一谈的倾向。
Article 33
Title@2025-07-10 (4): StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model
Title: StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model | StreamUni: Streaming Speech Translation mit einem einheitlichen Large Speech-Language-Modell erreichen | StreamUli:用统一大型语音语言模式实现流式语音翻译 2507.07803v1 |
Authors (6): Shoutao Guo, Xiang Li, Shaolei Zhang, Mengge Liu, Wei Chen, Yang Feng
Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations while continuously receiving source speech inputs, balancing low latency with high translation quality. However, existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST). In practice, they require collaboration with segmentation models to accomplish StreamST, where the truncated speech segments constrain SimulST models to make policy decisions and generate translations based on limited contextual information. Moreover, SimulST models struggle to learn effective policies due to the complexity of speech inputs and cross-lingual generation. To address these challenges, we propose StreamUni, which achieves StreamST through a unified Large Speech-Language Model (LSLM). Specifically, StreamUni incorporates speech Chain-of-Thought (CoT) in guiding the LSLM to generate multi-stage outputs. Leveraging these multi-stage outputs, StreamUni simultaneously accomplishes speech segmentation, policy decision, and translation generation, completing StreamST without requiring massive policy-specific training. Additionally, we propose a streaming CoT training method that enhances low-latency policy decisions and generation capabilities using limited CoT data. Experiments demonstrate that our approach achieves state-of-the-art performance on StreamST tasks.
流言翻译(StreamST)需要确定适当的时间(称为政策),以产生翻译,同时不断接收源语言投入,平衡低水平与高翻译质量之间的平衡。然而,现有的流言方法通常在句层语言部分运作,称为同时语音翻译(SimulST ) 。在实际中,它们需要与分解模式合作,以完成流言ST,在流言部分限制SimulST模式以做出政策决定并根据有限的背景信息生成翻译。此外,SimulST模型努力学习有效的政策,因为语音投入和跨语言一代的复杂性。为了应对这些挑战,我们提议StreamUni,通过统一的大语言翻译模型(LSimulST)实现流言。具体地说,StreamUni在引导流言链语言链(Cott)中引入语音链(CoT) 以生成多阶段产出。利用这些多阶段产出,StreamUnii同时完成语音分割、政策决定和翻译生成,完成StreamUnialST,不要求大规模的政策流-Cleglemental destration-culation-destration-destration-destrual-destration-destration sal-destrual-destrual-destrages
Article 34
Title@2025-07-10 (4): Understanding Chain-of-Thought in LLMs through Information Theory
Title: Understanding Chain-of-Thought in LLMs through Information Theory | Verständnis der in LLMs durch Informationstheorie gesuchten Gedankenkette | 通过信息理论在LLM 中探索了解链 2411.11984v2 |
Authors (3): Jean-Francois Ton, Muhammad Faaiz Taufiq, Yang Liu
Large Language Models (LLMs) have shown impressive performance in complex reasoning tasks through the use of Chain-of-Thought (CoT) reasoning, allowing models to break down problems into manageable sub-tasks. However, existing CoT evaluation techniques either require annotated CoT data or fall short in accurately assessing intermediate reasoning steps, leading to high rates of false positives. In this paper, we formalize CoT reasoning in LLMs through an information-theoretic lens. Specifically, our framework quantifies the `information-gain’ at each reasoning step, enabling the identification of failure modes in LLMs without the need for expensive annotated datasets. We demonstrate the efficacy of our approach through extensive experiments on toy arithmetic, GSM8K and PRM800k datasets, where it significantly outperforms existing outcome-based methods by providing more accurate insights into model performance on individual subtasks.
大型语言模型(LLMs)通过使用“Cott”推理,在复杂的推理任务中表现出了令人印象深刻的成绩,使模型能够将问题分为可管理的子任务,然而,现有的Cot评价技术要么需要附加注释的COT数据,要么在准确评估中间推理步骤方面做得不够,导致出现高比例的假正数。在本文件中,我们通过信息理论透镜将LLMs中的Cot推理工作正式化。具体地说,我们的框架对每个推理步骤的“信息收益”进行了量化,使得LLOMs的失败模式得以识别,而无需花费昂贵的附加注释数据集。我们通过微量算术、GSM8K和PRM800k数据集的广泛实验,展示了我们的方法的有效性,在这些实验中,它通过提供对单个子任务模型性能的更准确的洞察力,大大超过现有的基于结果的方法。
Article 35
Title@2025-07-10 (4): Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model
Title: Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model | Squeeze the Soaked Sponge: Effiziente Off-Policy-Verstärkung Feinsteuerung für großes Sprachmodell | 挤压海绵:高效非政策强化大语言模式的高效非政策改进微调 2507.06892v2 |
Authors (8): Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng, Shuyue Hu, Lei Bai, Jianye Hao
Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a stringent bottleneck on continuing economic and efficient scaling. To this end, we launch the renaissance of off-policy RL and propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady asymptotic improvement. In our experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base models. ReMix shows an average Pass@1 accuracy of 52.10% (for 1.5B model) with 0.079M response rollouts, 350 training steps and achieves 63.27%/64.39% (for 7B model) with 0.007M/0.011M response rollouts, 50/75 training steps, on five math reasoning benchmarks (i.e., AIME’24, AMC’23, Minerva, OlympiadBench, and MATH500). Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over 30x to 450x reduction in training cost in terms of rollout data volume. In addition, we reveal insightful findings via multifaceted analysis, including the implicit preference for shorter responses due to the Whipping Effect of off-policy discrepancy, the collapse mode of self-reflection behavior under the presence of severe off-policyness, etc.
强化学习(RL) 展示了它提高大语言模型(LLM)推理能力的潜力。 多数现有的强化微调(RFT)方法的一个主要局限性是,这些方法在性质上是政策性RL,即,过去学习过程中生成的数据没有得到充分利用。这不可避免地要付出巨大的计算和时间代价,对持续的经济和高效规模造成严格的瓶颈。 为此,我们启动了退出政策RL的复兴,并提议恢复Mix-polient Explex Proximal Deferation(ReMix),这是在政策上推行50FT方法(PPO和GROPO等)以利用离政策数据的一般方法。 ReMix包含三个主要组成部分:(1) Mix-polict-Data(UTD)比率增加;(2) KL-Conex政策制约,以平衡稳定性和灵活性之间的贸易模式;(3) 政策再整合,从早期高效学习到平稳的测试改进。 (REMMMM) 在我们的实验中,我们用一个平均的模型显示SO-DM数据序列,包括最新的SO-Dal deal deal deal deal destrate deal deal demode deal deal deal demodestration the mession the messional demodudustration the mex in the mex the mex mex mex mex deal deal dealation)
Article 36
Title@2025-07-10 (4): When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance
Title: When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance | Wenn große Sprachmodelle das Recht erfüllen: Dual-Lens-Taxonomie, technischer Fortschritt und ethische Governance | 当大语言模式符合法律时:双重语言分类、技术进步和道德治理 2507.07748v1 |
Authors (5): Peizhang Shao, Linrui Xu, Jinxi Wang, Wei Zhou, Xingyu Wu
This paper establishes the first comprehensive review of Large Language Models (LLMs) applied within the legal domain. It pioneers an innovative dual lens taxonomy that integrates legal reasoning frameworks and professional ontologies to systematically unify historical research and contemporary breakthroughs. Transformer-based LLMs, which exhibit emergent capabilities such as contextual reasoning and generative argumentation, surmount traditional limitations by dynamically capturing legal semantics and unifying evidence reasoning. Significant progress is documented in task generalization, reasoning formalization, workflow integration, and addressing core challenges in text processing, knowledge integration, and evaluation rigor via technical innovations like sparse attention mechanisms and mixture-of-experts architectures. However, widespread adoption of LLM introduces critical challenges: hallucination, explainability deficits, jurisdictional adaptation difficulties, and ethical asymmetry. This review proposes a novel taxonomy that maps legal roles to NLP subtasks and computationally implements the Toulmin argumentation framework, thus systematizing advances in reasoning, retrieval, prediction, and dispute resolution. It identifies key frontiers including low-resource systems, multimodal evidence integration, and dynamic rebuttal handling. Ultimately, this work provides both a technical roadmap for researchers and a conceptual framework for practitioners navigating the algorithmic future, laying a robust foundation for the next era of legal artificial intelligence. We have created a GitHub repository to index the relevant papers: https://github.com/Kilimajaro/LLMs_Meet_Law.
本文件确立了对在法律领域应用的大型语言模型的第一次全面审查,它开创了一种创新的双镜头分类学,将法律推理框架和专业理论结合起来,以系统地统一历史研究和当代突破。基于变革的LLMs, 展示了背景推理和基因推理等新兴能力,通过动态地捕捉法律语义和统一证据推理克服了传统限制。在任务概括、推理正规化、工作流程整合和通过技术创新解决文本处理、知识整合和评价方面的核心挑战方面取得了显著进展。它确定了关键领域,包括低资源系统、多式证据整合和动态的反驳处理。然而,广泛采用LLM带来了关键挑战:幻觉、可解释性缺陷、管辖适应困难和道德不对称。本审查提出了新颖的分类学,为NLP子塔斯和计算性论证框架绘制了法律作用图,从而系统化了逻辑推理、检索、预测和争议解决方面的进展。我们确定了关键领域,包括低资源系统、多式证据整合,以及动态的反驳处理。这项工作最终为研究人员提供了一个技术路线图,为未来日历数据库提供了一个相关的方向基础和概念框架。
Article 37
Title@2025-07-10 (4): Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review
Title: Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review | Code-Switching in End-to-End Automatische Spracherkennung: Ein systematischer Literaturbericht | 端至端自动语音识别代码转换:系统文学审查 2507.07741v1 |
Authors (5): Maha Tufail Agro, Atharva Kulkarni, Karima Kadaoui, Zeerak Talat, Hanan Aldarmaki
Motivated by a growing research interest into automatic speech recognition (ASR), and the growing body of work for languages in which code-switching (CS) often occurs, we present a systematic literature review of code-switching in end-to-end ASR models. We collect and manually annotate papers published in peer reviewed venues. We document the languages considered, datasets, metrics, model choices, and performance, and present a discussion of challenges in end-to-end ASR for code-switching. Our analysis thus provides insights on current research efforts and available resources as well as opportunities and gaps to guide future research.
由于对自动语音识别(ASR)的研究兴趣日益浓厚,而且对经常使用代码转换(CS)的语文的工作越来越多,我们系统地对终端至终端ASR模型的代码转换进行文献审查,我们收集并手工编写在同行审议地点发表的注释文件,记录所审议的语文、数据集、指标、模式选择和绩效,并讨论终端至终端ASR在代码转换方面的挑战。因此,我们的分析提供了对当前研究工作和现有资源以及指导未来研究的机会和差距的深刻见解。
Article 38
Title@2025-07-10 (4): GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing
Title: GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing | GuardVal: Dynamic Large Language Model Jailbreak Evaluation für umfassende Sicherheitstests | 警卫:综合安全测试动态大语言示范监狱防爆评价 2507.07735v1 |
Authors (4): Peiyan Zhang, Haibo Jin, Liying Kang, Haohan Wang
Jailbreak attacks reveal critical vulnerabilities in Large Language Models (LLMs) by causing them to generate harmful or unethical content. Evaluating these threats is particularly challenging due to the evolving nature of LLMs and the sophistication required in effectively probing their vulnerabilities. Current benchmarks and evaluation methods struggle to fully address these challenges, leaving gaps in the assessment of LLM vulnerabilities. In this paper, we review existing jailbreak evaluation practices and identify three assumed desiderata for an effective jailbreak evaluation protocol. To address these challenges, we introduce GuardVal, a new evaluation protocol that dynamically generates and refines jailbreak prompts based on the defender LLM’s state, providing a more accurate assessment of defender LLMs’ capacity to handle safety-critical situations. Moreover, we propose a new optimization method that prevents stagnation during prompt refinement, ensuring the generation of increasingly effective jailbreak prompts that expose deeper weaknesses in the defender LLMs. We apply this protocol to a diverse set of models, from Mistral-7b to GPT-4, across 10 safety domains. Our findings highlight distinct behavioral patterns among the models, offering a comprehensive view of their robustness. Furthermore, our evaluation process deepens the understanding of LLM behavior, leading to insights that can inform future research and drive the development of more secure models.
在大型语言模型(LLMs)中,破狱袭击暴露出巨大的脆弱性,导致其产生有害或不道德的内容。评估这些威胁尤其具有挑战性,因为LLMs的性质不断演变,有效调查其脆弱性需要精密的技巧。目前的基准和评价方法努力充分应对这些挑战,在评估LLM脆弱性方面留下空白。在本文件中,我们审查现有的破狱评估做法,并找出三种假定的缺陷,以制定有效的破狱评估议定书。为了应对这些挑战,我们引入了GuardVal,这是一项新的评价协议,根据捍卫者LLM的状态,积极生成和完善破狱提示,为捍卫者LLMs处理安全危急情况的能力提供了更准确的评估。此外,我们提出了新的优化方法,防止在迅速完善期间出现停滞,确保产生日益有效的破狱提示,暴露捍卫者LMMs的更深处弱点。我们将这一协议应用于从Mistral-7b到GPT-4的10个安全领域的各种模式。我们的调查结果突出了各种模式中的不同行为模式,提供了对其健全性的全面认识。此外,我们的评估进程加深了对LMM行为的未来认识。
Article 39
Title@2025-07-10 (4): Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization
Title: Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization | Nicht alle Präferenzen sind das, was Sie für das Post-Training benötigen: Selektive Ausrichtungsstrategie für die Preference-Optimierung | 并非所有的优惠都是培训后需要的:选择性的优化优化战略 2507.07725v1 |
Authors (1): Zhijin Dong
Post-training alignment of large language models (LLMs) is a critical challenge, as not all tokens contribute equally to model performance. This paper introduces a selective alignment strategy that prioritizes high-impact tokens within preference pairs, leveraging token-level log-probability differences between the current policy and a reference model. By focusing on these informative tokens, our approach reduces computational overhead and enhances alignment fidelity. We further explore the role of reference model quality, demonstrating that stronger reference models significantly improve token selection accuracy and overall optimization effectiveness. Comprehensive experiments on benchmarks such as Arena-Hard and MT-Bench validate the superiority of our Selective-DPO method over standard DPO and distillation-based baselines. Our findings highlight the importance of token-level optimization and reference model selection in advancing preference alignment for LLMs. The code is available at https://github.com/Dongzhijin/SDPO.
大型语言模式(LLMs)的培训后调整是一个重大挑战,因为并非所有象征性都同样有助于示范性业绩。本文引入了选择性调整战略,将特惠配方中高影响力的标牌列为优先,利用当前政策和参考模式之间的象征性记录概率差异。我们的方法注重这些信息性标牌,减少了计算性间接费用,提高了一致性。我们进一步探讨了参考性模型质量的作用,表明较强的参考模型大大提高了象征性选择准确性和总体优化效果。Arena-Hard和MT-Bench等基准的全面实验验证了我们选择性-DPO方法优于标准的DPO和蒸馏基准。我们的调查结果强调,在推进普惠性LMs的优惠一致性方面,象征性优化和参考模式选择十分重要。该代码可在https://github.com/Dongzhijin/SDPO上查阅。
Article 40
Title@2025-07-10 (4): Rethinking the Privacy of Text Embeddings: A Reproducibility Study of “Text Embeddings Reveal (Almost) As Much As Text”
Title: Rethinking the Privacy of Text Embeddings: A Reproducibility Study of “Text Embeddings Reveal (Almost) As Much As Text” | Die Privatsphäre von Text-Embeddings neu denken: Eine Reproduzierbarkeitsstudie von “Text-Embeddings Reveal (fast) So viel wie Text” | 重新思考文字嵌入的隐私:关于“文字嵌入流(几乎)与文字一样”的可复制性研究 2507.07700v1 |
Authors (4): Dominykas Seputis, Yongkang Li, Karsten Langerak, Serghei Mihailov
Text embeddings are fundamental to many natural language processing (NLP) tasks, extensively applied in domains such as recommendation systems and information retrieval (IR). Traditionally, transmitting embeddings instead of raw text has been seen as privacy-preserving. However, recent methods such as Vec2Text challenge this assumption by demonstrating that controlled decoding can successfully reconstruct original texts from black-box embeddings. The unexpectedly strong results reported by Vec2Text motivated us to conduct further verification, particularly considering the typically non-intuitive and opaque structure of high-dimensional embedding spaces. In this work, we reproduce the Vec2Text framework and evaluate it from two perspectives: (1) validating the original claims, and (2) extending the study through targeted experiments. First, we successfully replicate the original key results in both in-domain and out-of-domain settings, with only minor discrepancies arising due to missing artifacts, such as model checkpoints and dataset splits. Furthermore, we extend the study by conducting a parameter sensitivity analysis, evaluating the feasibility of reconstructing sensitive inputs (e.g., passwords), and exploring embedding quantization as a lightweight privacy defense. Our results show that Vec2Text is effective under ideal conditions, capable of reconstructing even password-like sequences that lack clear semantics. However, we identify key limitations, including its sensitivity to input sequence length. We also find that Gaussian noise and quantization techniques can mitigate the privacy risks posed by Vec2Text, with quantization offering a simpler and more widely applicable solution. Our findings emphasize the need for caution in using text embeddings and highlight the importance of further research into robust defense mechanisms for NLP systems.
文本嵌入是许多自然语言处理( NLP) 任务的基础, 广泛应用于建议系统和信息检索( IR) 等领域。 传统上, 传输嵌入而非原始文本被视为隐私保护。 然而, 诸如 Vec2Text 等最新方法挑战了这一假设, 表明控制解码能够成功重建黑盒嵌入的原始文本。 Vec2Text 所报告的意外强烈结果促使我们进一步进行核查, 特别是考虑到高维嵌入空间通常非直观和不透明的结构。 在这项工作中, 我们复制了 Vec2T 框架, 从两个角度对它进行评估:(1) 验证原始要求和(2) 扩展研究范围, 通过有针对性的实验扩展了原始关键关键结果, 显示内部和外环境的原始解码。 此外, 我们通过进行参数敏感性分析, 评估敏感输入( 如, 密码T) 的可行性, 以及 探索嵌入的精度精度, 将精度的精度缩缩缩缩缩略度的精度技术, 显示我们的精度序列的精度的精度, 显示我们的精度的精度的精度的精细度的精细度的精度的精度的精度 。
Article 41
Title@2025-07-10 (4): What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training
Title: What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training | Was wissen selbstüberwachte Sprachmodelle über Niederländisch? Analysieren von Vorteilen sprachspezifischer Vorausbildung | 自我监督的演讲模式对荷兰语了解多少? 分析具体语言培训前培训的优势 2506.00981v2 |
Authors (6): Marianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema, Martijn Bentum
How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it’s less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding aligns with downstream performance on Automatic Speech Recognition.
现有工作显示,一系列语言特征可以成功地从仅接受语言录音培训的端到端模式解码,然而,具体语言培训前培训能在多大程度上改善语言特定语言信息,这一点不太清楚。在这里,我们测试了在自我监督的Wav2Vec2模式的内部表述中荷兰语语音和词汇信息的编码。专门针对荷兰语的预先培训,改善了荷兰语语言特征的表述,与类似数量英语或较大多语言数据的培训前培训相比。这种语言特定优势通过经过培训的集群或分类探测得到很好地发现,并且使用零光度度来部分观察。此外,语言特征编码与自动语音识别的下游性能相一致,也带来了语言特征编码的好处。
Article 42
Title@2025-07-10 (4): KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities
Title: KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities | KeyKnowledgeRAG (K^2RAG): Eine verbesserte RAG-Methode zur Verbesserung der LLM-Fragestellung | KeyknowledgeraG(K2RAG):改进LLM问答能力的强化RAG方法 2507.07695v1 |
Authors (4): Hruday Markondapatnaikuni, Basem Suleiman, Abdelkarim Erradi, Shijing Chen
Fine-tuning is an immensely resource-intensive process when retraining Large Language Models (LLMs) to incorporate a larger body of knowledge. Although many fine-tuning techniques have been developed to reduce the time and computational cost involved, the challenge persists as LLMs continue to grow in size and complexity. To address this, a new approach to knowledge expansion in LLMs is needed. Retrieval-Augmented Generation (RAG) offers one such alternative by storing external knowledge in a database and retrieving relevant chunks to support question answering. However, naive implementations of RAG face significant limitations in scalability and answer accuracy. This paper introduces KeyKnowledgeRAG (K2RAG), a novel framework designed to overcome these limitations. Inspired by the divide-and-conquer paradigm, K2RAG integrates dense and sparse vector search, knowledge graphs, and text summarization to improve retrieval quality and system efficiency. The framework also includes a preprocessing step that summarizes the training data, significantly reducing the training time. K2RAG was evaluated using the MultiHopRAG dataset, where the proposed pipeline was trained on the document corpus and tested on a separate evaluation set. Results demonstrated notable improvements over common naive RAG implementations. K2RAG achieved the highest mean answer similarity score of 0.57, and reached the highest third quartile (Q3) similarity of 0.82, indicating better alignment with ground-truth answers. In addition to improved accuracy, the framework proved highly efficient. The summarization step reduced the average training time of individual components by 93%, and execution speed was up to 40% faster than traditional knowledge graph-based RAG systems. K2RAG also demonstrated superior scalability, requiring three times less VRAM than several naive RAG implementations tested in this study.
在再培训大型语言模型(LLMS)以纳入更多的知识时,微调是一个非常耗资的过程。虽然已经开发了许多微调技术来减少所涉及的时间和计算成本,但挑战依然存在,因为LLMS在规模和复杂性上继续增长。要解决这个问题,就需要在LLMS中采用新的知识扩展方法。检索增强的一代(RAG)提供了一个这样的替代办法,将外部知识储存在一个数据库中,并重新获取相关数据块以支持回答问题。然而,对RAG的天真的实施在可缩缩缩缩和回答准确性方面面临着重大限制。本文介绍了KeyKKKnowledgeraG(K2RAG),这是一个旨在克服这些限制的新框架。受分解和正拼模式的启发,K2RAG整合了密集和稀薄的矢量搜索、知识图表和文本合成,以提高检索质量和系统效率。框架还包括一个预处理步骤,以总结培训数据,大大减少培训时间。K2RAG的天平调度评估是通过多HOG(M)的缩缩缩略数据集,其中拟议的编订了最高执行时间,并测试了KLADADAD的进度。
Article 43
Title@2025-07-10 (4): SAS: Simulated Attention Score
Title: SAS: Simulated Attention Score | SAS: Simulierter Aufmerksamkeits-Score | SAS:模拟关注计分 2507.07694v1 |
Authors (15): Chuanyang Zheng, Jiankai Sun, Yihang Gao, Yuehao Wang, Peihao Wang, Jing Xiong, Liliang Ren, Hao Cheng, Janardhan Kulkarni, Yelong Shen, Atlas Wang, Mac Schwager, Anderson Schneider, Xiaodong Liu, Jianfeng Gao
The attention mechanism is a core component of the Transformer architecture. Various methods have been developed to compute attention scores, including multi-head attention (MHA), multi-query attention, group-query attention and so on. We further analyze the MHA and observe that its performance improves as the number of attention heads increases, provided the hidden size per head remains sufficiently large. Therefore, increasing both the head count and hidden size per head with minimal parameter overhead can lead to significant performance gains at a low cost. Motivated by this insight, we introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head. This is achieved by projecting a low-dimensional head representation into a higher-dimensional space, effectively increasing attention capacity without increasing parameter count. Beyond the head representations, we further extend the simulation approach to feature dimension of the key and query embeddings, enhancing expressiveness by mimicking the behavior of a larger model while preserving the original model size. To control the parameter cost, we also propose Parameter-Efficient Attention Aggregation (PEAA). Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method, achieving significant improvements over different attention variants.
注意机制是变换器结构的核心组成部分。 已经开发了多种方法来计算注意分数, 包括多头注意( MAH)、 多头注意( 多头注意)、 多组注意( 集体注意) 等。 我们进一步分析MAH, 并观察到其性能随着关注头数的增加而提高, 只要每头隐藏的大小仍然足够大, 其性能就会得到改善。 因此, 增加头部数和头部隐藏的大小, 并同时使用最低参数管理器, 能够以低成本带来显著的绩效增益。 受此洞察的影响, 我们引入了模拟注意分数( SAS) , 维持一个紧凑的模型大小, 同时激发更多的关注头部和每头隐藏的特性。 这是通过将一个低维头表示器投射到一个更高的空间, 有效增加关注能力, 而不增加参数计数。 除了头部外, 我们进一步将模拟方法扩大到关键和查询嵌入的特征层面, 通过模拟模拟模型的行为来提高外表的清晰度, 同时保留原始模型大小。 为了控制参数成本, 我们还提议了Para度- Effect 注意 Agyal ex expractal ex laction laction laction laction laction laction laction laction laction laction laction laction laction laction laction lactions
Article 44
Title@2025-07-10 (4): Hierarchical Bracketing Encodings for Dependency Parsing as Tagging
Title: Hierarchical Bracketing Encodings for Dependency Parsing as Tagging | Hierarchische Bracketing-Encodings für Dependency Parsing als Tagging | 将依赖性剖析作为拖贴 2505.11693v2 |
Authors (4): Ana Ezquerro, David Vilares, Anssi Yli-Jyrä, Carlos Gómez-Rodríguez
We present a family of encodings for sequence labeling dependency parsing, based on the concept of hierarchical bracketing. We prove that the existing 4-bit projective encoding belongs to this family, but it is suboptimal in the number of labels used to encode a tree. We derive an optimal hierarchical bracketing, which minimizes the number of symbols used and encodes projective trees using only 12 distinct labels (vs. 16 for the 4-bit encoding). We also extend optimal hierarchical bracketing to support arbitrary non-projectivity in a more compact way than previous encodings. Our new encodings yield competitive accuracy on a diverse set of treebanks.
我们以等级括号概念为基础,提出一套按顺序标记依赖分析的编码。我们证明现有的4位投影编码属于这个家庭,但在用于编码一棵树的标签数量上并不理想。我们得出一个最佳的等级括号,只使用12个不同的标签(4位编码为16个)来尽量减少所用符号的数量和投影树的编码。我们还扩大最佳等级括号,以比以前的编码更紧凑的方式支持任意的非预测性。我们的新编码在不同的树库中产生竞争性精度。
Article 45
Title@2025-07-10 (4): Investigating Co-Constructive Behavior of Large Language Models in Explanation Dialogues
Title: Investigating Co-Constructive Behavior of Large Language Models in Explanation Dialogues | Ko-Konstruktives Verhalten von großen Sprachmodellen in Erklärungsdialogen untersuchen | 解释对话中大语言模式的共同调查行为 2504.18483v2 |
Authors (12): Leandra Fichtel, Maximilian Spliethöver, Eyke Hüllermeier, Patricia Jimenez, Nils Klowait, Stefan Kopp, Axel-Cyrille Ngonga Ngomo, Amelie Robrecht, Ingrid Scharlau, Lutz Terfloth, Anna-Lisa Vollmer, Henning Wachsmuth
The ability to generate explanations that are understood by explainees is the quintessence of explainable artificial intelligence. Since understanding depends on the explainee’s background and needs, recent research focused on co-constructive explanation dialogues, where an explainer continuously monitors the explainee’s understanding and adapts their explanations dynamically. We investigate the ability of large language models (LLMs) to engage as explainers in co-constructive explanation dialogues. In particular, we present a user study in which explainees interact with an LLM in two settings, one of which involves the LLM being instructed to explain a topic co-constructively. We evaluate the explainees’ understanding before and after the dialogue, as well as their perception of the LLMs’ co-constructive behavior. Our results suggest that LLMs show some co-constructive behaviors, such as asking verification questions, that foster the explainees’ engagement and can improve understanding of a topic. However, their ability to effectively monitor the current understanding and scaffold the explanations accordingly remains limited.
由于理解取决于解释者的背景和需要,最近的研究侧重于共同构思解释对话,其中一名解释者不断监测解释者的理解并动态调整其解释。我们调查了大语言模型(LLMs)作为解释者参与共同构思解释对话的能力。特别是,我们提出了一个用户研究,其中解释者在两个场合与一个LLM互动,其中之一是LLM被指示共同解释一个专题。我们评估解释者在对话前后的理解,以及他们对LLMS共同构思行为的看法。我们的结果表明,LLMS展示了一些共同构思行为,例如询问核查问题,促进解释者参与和增进对一个专题的理解。然而,他们有效监测当前理解并据此筛选解释的能力仍然有限。
Article 46
Title@2025-07-10 (4): Improving Cross-lingual Representation for Semantic Retrieval with Code-switching
Title: Improving Cross-lingual Representation for Semantic Retrieval with Code-switching | Verbesserung der Cross-lingual Darstellung für semantische Retrieval mit Code-Schaltung | 使用代码转换法改进语义检索的跨语种代表性 2403.01364v2 |
Authors (6): Mieradilijiang Maimaiti, Yuanhang Zheng, Ji Zhang, Yue Zhang, Wenpei Luo, Kaiyu Huang
Semantic Retrieval (SR) has become an indispensable part of the FAQ system in the task-oriented question-answering (QA) dialogue scenario. The demands for a cross-lingual smart-customer-service system for an e-commerce platform or some particular business conditions have been increasing recently. Most previous studies exploit cross-lingual pre-trained models (PTMs) for multi-lingual knowledge retrieval directly, while some others also leverage the continual pre-training before fine-tuning PTMs on the downstream tasks. However, no matter which schema is used, the previous work ignores to inform PTMs of some features of the downstream task, i.e. train their PTMs without providing any signals related to SR. To this end, in this work, we propose an Alternative Cross-lingual PTM for SR via code-switching. We are the first to utilize the code-switching approach for cross-lingual SR. Besides, we introduce the novel code-switched continual pre-training instead of directly using the PTMs on the SR tasks. The experimental results show that our proposed approach consistently outperforms the previous SOTA methods on SR and semantic textual similarity (STS) tasks with three business corpora and four open datasets in 20+ languages.
在以任务为导向的问答(QA)对话情景中,对跨语言智能客户服务系统的需求最近有所增加。对电子商务平台或某些特定商业条件的跨语言智能客户服务系统的需求最近不断增长。大多数以往的研究都利用跨语言预先培训模式直接检索多种语言知识,而其他一些研究也在对下游任务PTM进行微调之前,利用持续培训前训练,对下游任务的PTM进行微调。然而,无论采用什么方案,以往的工作都忽略了向PTM公司通报下游任务的某些特点,即培训他们的PTM公司,而不提供与SR有关的任何信号。为此,我们提议通过代码转换,为SR公司采用替代的跨语言跨语言跨语言的跨语言的PTM系统。我们首先对跨语言的SR采用代码转换方法。此外,我们引入了新型的代码转换前训练,而不是直接在SR任务上直接使用PTM。实验结果显示,我们提议的SVS+SMS的三种数据方法始终以SOTA格式取代了先前的三种SOTA方法。
Article 47
Title@2025-07-10 (4): Beyond Hate Speech: NLP’s Challenges and Opportunities in Uncovering Dehumanizing Language
Title: Beyond Hate Speech: NLP’s Challenges and Opportunities in Uncovering Dehumanizing Language | Beyond Hate Speech: NLPs Herausforderungen und Chancen beim Enthumanisieren der Sprache | 超越仇恨言论:NLP在揭开非人化语言方面的挑战和机遇 2402.13818v2 |
Authors (5): Hamidreza Saffari, Mohammadamin Shafiei, Hezhao Zhang, Lasana Harris, Nafise Sadat Moosavi
Dehumanization, i.e., denying human qualities to individuals or groups, is a particularly harmful form of hate speech that can normalize violence against marginalized communities. Despite advances in NLP for detecting general hate speech, approaches to identifying dehumanizing language remain limited due to scarce annotated data and the subtle nature of such expressions. In this work, we systematically evaluate four state-of-the-art large language models (LLMs) - Claude, GPT, Mistral, and Qwen - for dehumanization detection. Our results show that only one model-Claude-achieves strong performance (over 80% F1) under an optimized configuration, while others, despite their capabilities, perform only moderately. Performance drops further when distinguishing dehumanization from related hate types such as derogation. We also identify systematic disparities across target groups: models tend to over-predict dehumanization for some identities (e.g., Gay men), while under-identifying it for others (e.g., Refugees). These findings motivate the need for systematic, group-level evaluation when applying pretrained language models to dehumanization detection tasks.
剥夺人性,即剥夺个人或群体人格,是一种特别有害的仇恨言论形式,可以使针对边缘化社区的暴力行为正常化。尽管国家语言方案在发现一般仇恨言论方面取得了进展,但由于缺少附加说明的数据和这种表达方式的微妙性质,识别非人性语言的方法仍然有限。在这项工作中,我们系统地评估了四种最先进的大语言模式(LLMS)――Claude、GPT、Mistral和Quwen)――用于非人性化检测。我们的结果表明,在优化配置下,只有一种模式-Claude-achief(超过80%F1)表现强劲,而其他人尽管能力强,但表现却比较中度。在区分非人性化与相关的仇恨类型(如减损)时,表现进一步下降。我们还查明了不同目标群体之间的系统性差异:模型往往过度预测某些身份(例如同性恋男子)的非人性化,而对其他身份(例如难民)则未得到充分识别。这些结论表明,在应用预先培训的语言模式进行非人性化检测任务时,需要系统、群体一级评估。
Article 48
Title@2025-07-10 (4): An Automated Length-Aware Quality Metric for Summarization
Title: An Automated Length-Aware Quality Metric for Summarization | Ein Automatisiertes Längen-Bewusst-Qualitäts-Metrik für die Zusammenfassung | 用于汇总的自动长软件质量计量器 2507.07653v1 |
Authors (1): Andrew D. Foland
This paper proposes NOrmed Index of Retention (NOIR), a quantitative objective metric for evaluating summarization quality of arbitrary texts that relies on both the retention of semantic meaning and the summary length compression. This gives a measure of how well the recall-compression tradeoff is managed, the most important skill in summarization. Experiments demonstrate that NOIR effectively captures the token-length / semantic retention tradeoff of a summarizer and correlates to human perception of sumarization quality. Using a language model-embedding to measure semantic similarity, it provides an automated alternative for assessing summarization quality without relying on time-consuming human-generated reference summaries. The proposed metric can be applied to various summarization tasks, offering an automated tool for evaluating and improving summarization algorithms, summarization prompts, and synthetically-generated summaries.
本文建议了NOMed保留指数(NOIR),这是评价任意文本质量的定量客观衡量标准,既依靠保留语义含义,又依靠简要长度压缩,用来评估任意文本的质量。这可以衡量对回扣压缩权衡的管理有多好,这是总结的最重要技能。实验表明,NOIR有效捕捉了总结者的象征性/语义保留权衡,并与人类对均衡质量的看法有关。它使用语言模型组合来衡量语义相似性,它为评估总和质量提供了一种自动化的替代方法,而不必依赖耗费时间的人工生成参考摘要。拟议的衡量标准可以应用于各种总和任务,为评估和改进加和算算法、加和合成生成摘要提供了一个自动化工具。
Article 49
Title@2025-07-10 (4): Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement
Title: Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement | Lost in Pronunciation: Chinesische Offensive Sprache entdecken, verkleidet durch phonetische Umkleide-Ersatz | 失落于发音中:发现因替换语音内衣而变形的中国进攻性语言 2507.07640v1 |
Authors (11): Haotan Guo, Jianfei He, Jiayuan Ma, Hongbin Na, Zimu Wang, Haiyang Zhang, Qi Chen, Wei Wang, Zijing Shi, Tao Shen, Ling Chen
Phonetic Cloaking Replacement (PCR), defined as the deliberate use of homophonic or near-homophonic variants to hide toxic intent, has become a major obstacle to Chinese content moderation. While this problem is well-recognized, existing evaluations predominantly rely on rule-based, synthetic perturbations that ignore the creativity of real users. We organize PCR into a four-way surface-form taxonomy and compile \ours, a dataset of 500 naturally occurring, phonetically cloaked offensive posts gathered from the RedNote platform. Benchmarking state-of-the-art LLMs on this dataset exposes a serious weakness: the best model reaches only an F1-score of 0.672, and zero-shot chain-of-thought prompting pushes performance even lower. Guided by error analysis, we revisit a Pinyin-based prompting strategy that earlier studies judged ineffective and show that it recovers much of the lost accuracy. This study offers the first comprehensive taxonomy of Chinese PCR, a realistic benchmark that reveals current detectors’ limits, and a lightweight mitigation technique that advances research on robust toxicity detection.
虽然这个问题已得到广泛承认,但现有的评价主要依赖基于规则的合成扰动,忽视了实际用户的创造力。我们把PCR组织成一个四向地表形态分类学,并汇编出一套由500个自然现象组成的数据集,即从红诺平台收集的有声隐蔽的进攻性柱。在这个数据集上确定最新水平的LLMS的基准暴露出一个严重的弱点:最佳模型只达到0.672的F1点,零点思维链推动性能甚至更低。我们根据错误分析,重新审视了以Pininin为基础的快速战略,早期研究认为它无效,并表明它收回了大部分丢失的准确性。这一研究提供了中国PCR的第一个全面分类学,一个显示当前探测器极限的现实基准,以及一个推进可靠毒性检测研究的轻量级减轻技术。
Article 50
Title@2025-07-10 (4): FrugalRAG: Learning to retrieve and reason for multi-hop QA
Title: FrugalRAG: Learning to retrieve and reason for multi-hop QA | FrugalRAG: Lernen zum Abrufen und Grund für Multi-Hop-QA | FrugalRAG:学会检索和多呼QA的理由 2507.07634v1 |
Authors (4): Abhinav Java, Srivathsan Koundinyan, Nagarajan Natarajan, Amit Sharma
We consider the problem of answering complex questions, given access to a large unstructured document corpus. The de facto approach to solving the problem is to leverage language models that (iteratively) retrieve and reason through the retrieved documents, until the model has sufficient information to generate an answer. Attempts at improving this approach focus on retrieval-augmented generation (RAG) metrics such as accuracy and recall and can be categorized into two types: (a) fine-tuning on large question answering (QA) datasets augmented with chain-of-thought traces, and (b) leveraging RL-based fine-tuning techniques that rely on question-document relevance signals. However, efficiency in the number of retrieval searches is an equally important metric, which has received less attention. In this work, we show that: (1) Large-scale fine-tuning is not needed to improve RAG metrics, contrary to popular claims in recent literature. Specifically, a standard ReAct pipeline with improved prompts can outperform state-of-the-art methods on benchmarks such as HotPotQA. (2) Supervised and RL-based fine-tuning can help RAG from the perspective of frugality, i.e., the latency due to number of searches at inference time. For example, we show that we can achieve competitive RAG metrics at nearly half the cost (in terms of number of searches) on popular RAG benchmarks, using the same base model, and at a small training cost (1000 examples).
我们考虑了回答复杂问题的问题,因为有了大量结构化的文件资料库,我们考虑了回答复杂的问题的问题。事实上解决问题的方法是利用语言模型,这些语言模型(表面上)通过检索的文件检索和解释,直到该模型有足够的信息来找到答案。改进这一方法的尝试侧重于检索强化的生成(RAG)指标,例如准确性和回忆性,可以分为两类:(a)对大问题的回答(QA)数据集进行微调,增加思考链的痕迹;(b)利用基于RL的微调技术,这些技术依赖于问题文件的相关性信号。然而,检索搜索数量的效率是一个同样重要的衡量标准,但这一衡量标准得到的注意较少。在这项工作中,我们表明:(1) 与最近的文献中流行的说法相反,不需要进行大规模的微调来改进RAG的衡量标准。 具体地说,改进的提示性标准“ReAc”管道可以超越HotPA等基准的先进模型数目。 (2) 超额和基于RL的微调技术方法,用于在RAG的50%的搜索中,从我们进行适当的成本模型搜索中可以证明。
Article 51
Title@2025-07-10 (4): Towards a cognitive architecture to enable natural language interaction in co-constructive task learning
Title: Towards a cognitive architecture to enable natural language interaction in co-constructive task learning | Auf dem Weg zu einer kognitiven Architektur, um natürliche Sprachinteraktion im co-konstruktiven Aufgabenlernen zu ermöglichen | 建立一个认知结构,在共同建设性任务学习中促成自然语言互动 2503.23760v2 |
Authors (5): Manuel Scheibl, Birte Richter, Alissa Müller, Michael Beetz, Britta Wrede
This research addresses the question, which characteristics a cognitive architecture must have to leverage the benefits of natural language in Co-Constructive Task Learning (CCTL). To provide context, we first discuss Interactive Task Learning (ITL), the mechanisms of the human memory system, and the significance of natural language and multi-modality. Next, we examine the current state of cognitive architectures, analyzing their capabilities to inform a concept of CCTL grounded in multiple sources. We then integrate insights from various research domains to develop a unified framework. Finally, we conclude by identifying the remaining challenges and requirements necessary to achieve CCTL in Human-Robot Interaction (HRI).
这项研究涉及一个问题,即认知结构必须具备哪些特征才能在共同建设性任务学习中利用自然语言的好处。为了提供背景,我们首先讨论互动任务学习(ITL)、人类记忆系统的机制以及自然语言和多种模式的重要性。接下来,我们审视认知结构的现状,分析其基于多种来源为CCTL概念提供信息的能力。然后,我们整合各种研究领域的见解,以制定一个统一的框架。最后,我们通过确定在人类机器人互动中实现CCTL所需的其余挑战和要求。
Article 52
Title@2025-07-10 (4): Comparative sentiment analysis of public perception: Monkeypox vs. COVID-19 behavioral insights
Title: Comparative sentiment analysis of public perception: Monkeypox vs. COVID-19 behavioral insights | Vergleichende Stimmungsanalyse der öffentlichen Wahrnehmung: Monkeypox vs. COVID-19 Verhaltenseinblicke | 对公众感知的比较情绪分析:天花对COVID-19行为洞察力 2505.07430v2 |
Authors (3): Mostafa Mohaimen Akand Faisal, Rabeya Amin Jhuma, Jamini Jasim
The emergence of global health crises, such as COVID-19 and Monkeypox (mpox), has underscored the importance of understanding public sentiment to inform effective public health strategies. This study conducts a comparative sentiment analysis of public perceptions surrounding COVID-19 and mpox by leveraging extensive datasets of 147,475 and 106,638 tweets, respectively. Advanced machine learning models, including Logistic Regression, Naive Bayes, RoBERTa, DistilRoBERTa and XLNet, were applied to perform sentiment classification, with results indicating key trends in public emotion and discourse. The analysis highlights significant differences in public sentiment driven by disease characteristics, media representation, and pandemic fatigue. Through the lens of sentiment polarity and thematic trends, this study offers valuable insights into tailoring public health messaging, mitigating misinformation, and fostering trust during concurrent health crises. The findings contribute to advancing sentiment analysis applications in public health informatics, setting the groundwork for enhanced real-time monitoring and multilingual analysis in future research.
全球卫生危机的出现,如COVID-19和天花(天花),突显了了解公众情绪为有效公共卫生战略提供信息的重要性,这项研究通过利用147、475和106 638个推文的广泛数据集,对公众对COVID-19和天花的看法进行了比较情绪分析,利用了147、475和106 638个推文的广泛数据集,先进机器学习模型,包括后勤回流、Naive Bayes、RoBERTA、DistilROBERTA和XLNet,用于进行情绪分类,结果显示公众情绪和讨论的关键趋势。分析突出了疾病特征、媒体代表性和流行病疲劳等驱动的公共情绪的重大差异。通过情绪极化和主题趋势的透镜,这项研究为调整公共卫生信息、减少错误信息,以及在同时发生的卫生危机中培养信任提供了宝贵的洞察力。研究结果有助于推进公共卫生信息学中的情绪分析应用,为今后研究中加强实时监测和多语种分析奠定了基础。
Article 53
Title@2025-07-10 (4): Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks
Title: Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks | Erforschung der Grenzen der Modellkompression in LLMs: Eine Studie zur Wissensdestillation über QA-Aufgaben | 探索LLMM中模型压缩的限度:关于质量保证任务的知识积累研究 2507.07630v1 |
Authors (4): Joyeeta Datta, Niclas Doll, Qusai Ramadan, Zeyd Boukhers
Large Language Models (LLMs) have demonstrated outstanding performance across a range of NLP tasks, however, their computational demands hinder their deployment in real-world, resource-constrained environments. This work investigates the extent to which LLMs can be compressed using Knowledge Distillation (KD) while maintaining strong performance on Question Answering (QA) tasks. We evaluate student models distilled from the Pythia and Qwen2.5 families on two QA benchmarks, SQuAD and MLQA, under zero-shot and one-shot prompting conditions. Results show that student models retain over 90% of their teacher models’ performance while reducing parameter counts by up to 57.1%. Furthermore, one-shot prompting yields additional performance gains over zero-shot setups for both model families. These findings underscore the trade-off between model efficiency and task performance, demonstrating that KD, combined with minimal prompting, can yield compact yet capable QA systems suitable for resource-constrained applications.
大型语言模型(LLMS)在一系列NLP任务中表现出了杰出的成绩,然而,它们的计算要求阻碍了其在现实世界中受资源限制环境中的部署。这项工作调查了利用知识蒸馏(KD)压缩LLMS的程度,同时保持了在问答(QA)任务方面的强力表现。我们评估了Pythia和Quen2.5家庭根据两个QA基准(SQAD和MLQA)从SQuAD和MLQA中提炼的学生模型,它们处于零发和一发即时的提示性条件之下。结果显示,学生模型保留了90%以上的教师模型的成绩,同时将参数计数减少57.1 %。此外,一发即提示性能使两个模式家庭在零发计数上产生更多的业绩收益。这些结果强调了模型效率和任务性能之间的权衡,表明KD与最小的提示性能相结合,能够产生适合资源受限制应用的紧凑的QA系统。
Article 54
Title@2025-07-10 (4): Good/Evil Reputation Judgment of Celebrities by LLMs via Retrieval Augmented Generation
Title: Good/Evil Reputation Judgment of Celebrities by LLMs via Retrieval Augmented Generation | Gute/böse Reputation Urteil von Prominenten durch LLMs über retrieval Augmented Generation | LLMs通过回回子增量一代对名词的良好/负面评奖判决 2503.14382v2 |
Authors (3): Rikuto Tsuchida, Hibiki Yokoyama, Takehito Utsuro
The purpose of this paper is to examine whether large language models (LLMs) can understand what is good and evil with respect to judging good/evil reputation of celebrities. Specifically, we first apply a large language model (namely, ChatGPT) to the task of collecting sentences that mention the target celebrity from articles about celebrities on Web pages. Next, the collected sentences are categorized based on their contents by ChatGPT, where ChatGPT assigns a category name to each of those categories. Those assigned category names are referred to as “aspects” of each celebrity. Then, by applying the framework of retrieval augmented generation (RAG), we show that the large language model is quite effective in the task of judging good/evil reputation of aspects and descriptions of each celebrity. Finally, also in terms of proving the advantages of the proposed method over existing services incorporating RAG functions, we show that the proposed method of judging good/evil of aspects/descriptions of each celebrity significantly outperform an existing service incorporating RAG functions.
本文的目的是研究大型语言模型(LLMS)能否理解在判断名人的良好/有害声誉方面什么是好坏。 具体地说,我们首先应用一个大型语言模型( ChattGPT ) 来收集在网页上有关名人的文章中提及目标名人的句子。 其次,所收集的句子根据查特GPT的内容进行分类,查特GPT为其中的每个类别指定一个类别的名称。 指定的类别名称被称为每个名人“ 目标”。 然后,通过应用检索增强的一代(RAG ) 框架,我们表明大型语言模型在判断每个名人各方面和描述的良好/有害声誉方面相当有效。 最后,在证明拟议方法优于包含RAG功能的现有服务方面,我们表明,拟议的判断每个名人各个方面/特征的好坏的方法大大超越了包含RAG功能的现有服务。
Article 55
Title@2025-07-10 (4): SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs
Title: SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs | RaumViz-Bench: Automatisch generierte räumliche Visualisierungs-Aufgaben für MLLMs | 空间Viz-Bench:MLLLMs自动生成的空间可视化推理任务 2507.07610v1 |
Authors (8): Siting Wang, Luoyang Sun, Cheng Deng, Kun Shao, Minnan Pei, Zheng Tian, Haifeng Zhang, Jun Wang
Humans can directly imagine and manipulate visual images in their minds, a capability known as spatial visualization. While multi-modal Large Language Models (MLLMs) support imagination-based reasoning, spatial visualization remains insufficiently evaluated, typically embedded within broader mathematical and logical assessments. Existing evaluations often rely on IQ tests or math competitions that may overlap with training data, compromising assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems. Our evaluation of 33 state-of-the-art MLLMs not only reveals wide performance variations and demonstrates the benchmark’s strong discriminative power, but also uncovers counter-intuitive findings: models exhibit unexpected behaviors by showing difficulty perception that misaligns with human intuition, displaying dramatic 2D-to-3D performance cliffs, and defaulting to formula derivation despite spatial tasks requiring visualization alone. SpatialVizBench empirically demonstrates that state-of-the-art MLLMs continue to exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark is publicly available.
人类可以直接想象和操控其头脑中的视觉图像,这种能力被称为空间可视化。多式大语言模型(MLLMs)支持以想象为基础的推理,而空间可视化却仍然没有得到充分评价,通常体现在更广泛的数学和逻辑评估中。现有的评价往往依靠IQ测试或数学竞赛,这些测试或数学竞赛可能与培训数据重叠,损害评估可靠性。为此,我们引入了空间Viz-Bench,这是一个综合的多式空间可视化多式基准,有4个次功能的12项任务,包括1,180个自动产生的问题。我们对33个最新MLLLMs的评估不仅显示广泛的性能差异,并展示了基准的强烈的歧视性力量,而且还发现了反直觉发现:模型表现出出意出乎意料的行为,其所表现出的难度是与人类直觉不相容,展示了巨大的2D到3D的性悬崖,而且尽管空间任务需要单独视觉化,但仍默认公式衍生。空间VizBench 实验性地表明,最先进的MLLMS继续显示空间可视化任务中存在的缺陷,从而可以公开地衡量。
Article 56
Title@2025-07-10 (4): Bayesian Discrete Diffusion Beats Autoregressive Perplexity
Title: Bayesian Discrete Diffusion Beats Autoregressive Perplexity | Bayesian Discrete Diffusion schlägt autoregressive Verwirrung | Bayesian Discrete 扩散比自动递减的两难性 2507.07586v1 |
Authors (1): Cooper Doyle
We reveal a hidden Bayesian core of discrete-diffusion language models by showing that the expected denoiser output under the forward masking distribution recovers the exact posterior over clean tokens. Under minimal assumptions, Monte Carlo marginalization over K independent corruptions converges to this posterior at rate O(1/sqrt(K)), yielding a simple proof of consistency and finite-sample error bounds. Building on this insight, we introduce a lightweight inference-time ensemble that averages K mask-and-denoise passes to obtain posterior-aware token probabilities and uncertainty estimates at no extra training cost. On WikiText-2, our method achieves test perplexity 8.8 with K=8, versus 20.3 for GPT-2 Small, despite using a model of comparable size. Code is available at https://github.com/mercury0100/bayesradd.
我们揭示了一种隐蔽的贝叶斯语分流语言模型核心。 我们通过展示前面遮罩分布下预期的解名器输出将恢复准确的后遗迹而不是干净的象征物。 在最起码的假设下,蒙特卡洛对K独立腐败的边缘化以O(1/sqrt(K))率汇集到这个后遗迹上,产生一个简单的一致性证据和有限的标本误差界限。 基于这一洞察,我们引入了一种轻巧的推论时间共和元素,即平均使用 K 蒙面和隐蔽的通行证来获得后背负符号概率和不确定性估计,而无需额外培训费用。 在WikitText-2上,我们的方法实现了8.8和K=8的双倍性测试,而GPT-2 Small则达到20.3,尽管使用了类似规模的模式。 代码可在 https://github.com/mercury0100/Bayesradd上查阅。
Article 57
Title@2025-07-10 (4): Beyond Overcorrection: Evaluating Diversity in T2I Models with DivBench
Title: Beyond Overcorrection: Evaluating Diversity in T2I Models with DivBench | Jenseits von Überkorrektur: Bewertung von Diversität in T2I-Modellen mit DivBench | 超越过度纠正:在DivBench的T2I模型中评估多样性 2507.03015v2 |
Authors (5): Felix Friedrich, Thiemo Ganesha Welsch, Manuel Brack, Patrick Schramowski, Kristian Kersting
Current diversification strategies for text-to-image (T2I) models often ignore contextual appropriateness, leading to over-diversification where demographic attributes are modified even when explicitly specified in prompts. This paper introduces DIVBENCH, a benchmark and evaluation framework for measuring both under- and over-diversification in T2I generation. Through systematic evaluation of state-of-the-art T2I models, we find that while most models exhibit limited diversity, many diversification approaches overcorrect by inappropriately altering contextually-specified attributes. We demonstrate that context-aware methods, particularly LLM-guided FairDiffusion and prompt rewriting, can already effectively address under-diversity while avoiding over-diversification, achieving a better balance between representation and semantic fidelity.
目前的文字到图像(T2I)模型多样化战略往往忽视了背景的适宜性,导致人口特征被过度多样化,即使有明确的规定,人口属性也会被修改,本文件介绍了DIVBENCH,这是衡量T2I一代中差异不足和差异过大的基准和评价框架。通过对最新T2I模型的系统评估,我们发现,虽然大多数模型表现出有限的多样性,但许多多样化方法因环境特征的不适当改变而过于纠正。我们证明,对背景的认识方法,特别是LLLM-指导的公平传播和迅速重写,已经能够有效解决多样性不足问题,同时避免过度多样化,在代表性和语义真实性之间实现更好的平衡。
Article 58
Title@2025-07-10 (4): Improving Clustering on Occupational Text Data through Dimensionality Reduction
Title: Improving Clustering on Occupational Text Data through Dimensionality Reduction | Verbesserung der Clusterbildung auf berufsbezogenen Textdaten durch Dimensionalitätsreduzierung | 通过减少分量改进职业文本数据集群化 2507.07582v1 |
Authors (3): Iago Xabier Vázquez García, Damla Partanaz, Emrullah Fatih Yetkin
In this study, we focused on proposing an optimal clustering mechanism for the occupations defined in the well-known US-based occupational database, ONET. Even though all occupations are defined according to well-conducted surveys in the US, their definitions can vary for different firms and countries. Hence, if one wants to expand the data that is already collected in ONET for the occupations defined with different tasks, a map between the definitions will be a vital requirement. We proposed a pipeline using several BERT-based techniques with various clustering approaches to obtain such a map. We also examined the effect of dimensionality reduction approaches on several metrics used in measuring performance of clustering algorithms. Finally, we improved our results by using a specialized silhouette approach. This new clustering-based mapping approach with dimensionality reduction may help distinguish the occupations automatically, creating new paths for people wanting to change their careers.
在这项研究中,我们侧重于为众所周知的美国职业数据库ONET中界定的职业提出一个最佳的集群机制。尽管所有职业的定义都是根据美国进行的良好调查确定的,但对于不同的公司和国家,其定义可能有所不同。因此,如果想扩大ONET中已经为不同任务界定的职业收集的数据,定义之间的图将是一项至关重要的要求。我们建议使用几种基于BERT的技术编审管道,以获得这样的地图。我们还研究了在测量组合算法绩效时采用的若干衡量标准中采用维度削减方法的影响。最后,我们通过使用专门的光影方法改进了我们的结果。这种新的基于集群的绘图方法加上减少维度,可能有助于自动区分职业,为希望改变职业的人创造新的途径。
Article 59
Title@2025-07-10 (4): COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation
Title: COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation | COALA: Numerisch stabiles und effizientes Framework für kontextabhängige Low-Rank-Annäherung | COALA: 低 Rank 上下低敏度接近度的数值稳定、高效框架 2507.07580v1 |
Authors (2): Uliana Parkina, Maxim Rakhuba
Recent studies suggest that context-aware low-rank approximation is a useful tool for compression and fine-tuning of modern large-scale neural networks. In this type of approximation, a norm is weighted by a matrix of input activations, significantly improving metrics over the unweighted case. Nevertheless, existing methods for neural networks suffer from numerical instabilities due to their reliance on classical formulas involving explicit Gram matrix computation and their subsequent inversion. We demonstrate that this can degrade the approximation quality or cause numerically singular matrices. To address these limitations, we propose a novel inversion-free regularized framework that is based entirely on stable decompositions and overcomes the numerical pitfalls of prior art. Our method can handle possible challenging scenarios: (1) when calibration matrices exceed GPU memory capacity, (2) when input activation matrices are nearly singular, and even (3) when insufficient data prevents unique approximation. For the latter, we prove that our solution converges to a desired approximation and derive explicit error bounds.
最近的研究显示,环境觉察到的低级近似值是压缩和微调现代大型神经网络的有用工具。在这种近似值中,标准被输入激活矩阵加权,大大改进了未加权案例的测量尺度。然而,神经网络的现有方法由于依赖传统公式而存在数字不稳定性,这些公式包括明确的Gram矩阵计算及其随后的倒置。我们证明这可以降低近似质量或导致数字单一矩阵。为了解决这些限制,我们提议了一个全新的无倒置常规化框架,完全基于稳定的分解和克服先前艺术的数值陷阱。我们的方法可以处理可能具有挑战性的情况:(1) 当校准矩阵超过GPU记忆能力时,(2) 当输入激活矩阵几乎是单一的,(2) 当数据不足时,甚至(3) 当数据无法防止独特的近似时,我们证明我们的解决方案与理想的近似和明显错误的界限一致。
Article 60
Title@2025-07-10 (4): Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation
Title: Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation | Ein-zu-Mix Modalität Ausrichtung mit multimodalen Großsprachenmodellen für die Übersetzung von Dokumentenbildmaschinen | 单一至混合模式与文件图像机机翻译多式大语言模式 2507.07572v1 |
Authors (7): Yupu Liang, Yaping Zhang, Zhiyang Zhang, Yang Zhao, Lu Xiang, Chengqing Zong, Yu Zhou
Document Image Machine Translation (DIMT) aims to translate text within document images, facing generalization challenges due to limited training data and the complex interplay between visual and textual information. To address these challenges, we introduce M4Doc, a novel single-to-mix modality alignment framework leveraging Multimodal Large Language Models (MLLMs). M4Doc aligns an image-only encoder with the multimodal representations of an MLLM, pre-trained on large-scale document image datasets. This alignment enables a lightweight DIMT model to learn crucial visual-textual correlations during training. During inference, M4Doc bypasses the MLLM, maintaining computational efficiency while benefiting from its multimodal knowledge. Comprehensive experiments demonstrate substantial improvements in translation quality, especially in cross-domain generalization and challenging document image scenarios.
文件图像机翻译(DIMT)的目的是在文件图像中翻译文本,由于培训数据有限以及视觉和文字信息之间的复杂互动,面临一般化挑战。为了应对这些挑战,我们引入了M4Doc,这是一个利用多模式大语言模型(MLLM)的新颖的一对一模式协调框架。M4Doc将一个仅显示图像的编码器与MLLM的多式表达器(MLLMM的多式表达器)相匹配,该组合在大规模文件图像数据集方面事先受过培训。这一组合使得一个轻量的DIMT模型能够在培训期间学习关键的视觉和文字相关性。在推断过程中,M4Doc绕过MLLM,在利用多式知识的同时保持计算效率。全面实验表明翻译质量有了重大改善,特别是在跨部通用和具有挑战性的文档图像假想中。
Article 61
Title@2025-07-10 (4): video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models
Title: video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models | Video-SALMONN 2: Bildunterschrift-verbesserte Audio-Visuelle große Sprachmodelle | 视频-SALMONN2:字幕-强化视听大语言模式 2506.15220v2 |
Authors (8): Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang
Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2’s captioning accuracy, reducing the captioning error rates by 28\%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size. Codes are available at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.
视频包含丰富的信息,以自然语言生成详细和准确的描述是视频理解的一个关键方面。本文介绍视频-SALMONN 2 ,这是一种高级视听大型语言模型(LLM),用于通过定向偏好优化(DPO)加强视频(配对音频)字幕。我们提出新的衡量标准,以评价视频描述的完整性和准确性,这些描述使用DPO得到优化。为了进一步改进培训,我们提议采用新的多轮式DPO(MRDPO)方法,其中包括定期更新DPO参考模型,合并和重新启用LORA模块,作为每轮培训(1 000个步骤)后参数更新的代理,并纳入地盘视频字幕的指导,以稳定进程。实验结果表明,MRODO大大加强视频-SALMONN 2的描述准确性,将字幕误差率降低28。最后的视频-SALMONN 2 模式,只有70亿个参数,超过GPT-4-和Gemini-LAL-2S等领先模型,同时广泛使用高竞争性视频标准。
Article 62
Title@2025-07-10 (4): The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs
Title: The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs | Synergy Dilemma von Long-CoT SFT und RL: Untersuchung von Post-Training-Techniken zur Begründung von VLMs | Long-CoT SFT和RL的协同问题:调查培训后用于说明理由的训练后技术 2507.07562v1 |
Authors (14): Jierun Chen, Tiezheng Yu, Haoli Bai, Lewei Yao, Jiannan Wu, Kaican Li, Fei Mi, Chaofan Tao, Lei Zhu, Manyi Zhang, Xiaohui Li, Lu Hou, Lifeng Shang, Qun Liu
Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This ``synergy dilemma’’ highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs.
大型视觉模型(VLMS)越来越多地采用培训后技术,如长期思维链(CoT)监管的微调(SFT)和强化学习(RL)等长期思维链(SFT),以获得精密的推理;这些方法在只使用语言的模型中显示出协同作用,但在VLMS中,这些方法的共同效力仍然不确定;我们系统地调查长的CoT SFT和RL在多种多式联运推理基准中的不同作用和相互作用;我们发现SFT通过深入、有条理的推理改进棘手问题的业绩,但引入了变异性和降低较简单的问题的业绩;相比之下,RLL促进一般化和简略性,在所有困难级别上取得一致的改进,尽管在最棘手问题上的改进不如SFT。令人惊讶的是,通过两个阶段、相互脱节制或渐进的培训战略,以及数据混合和模型的合并,无法产生累加效益,而是导致在准确性、推理风格和反应长度上取舍。
Article 63
Title@2025-07-10 (4): Multi-Head RAG: Solving Multi-Aspect Problems with LLMs
Title: Multi-Head RAG: Solving Multi-Aspect Problems with LLMs | Multi-Head RAG: Lösung von Multi-Aspect-Problemen mit LLMs | 多方主管RAG:解决多领域问题与LLM 2406.05085v4 |
Authors (16): Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli, Patrik Okanovic, Yi Zhu, Patrick Iff, Michal Podstawski, Lucas Weitzendorf, Mingyuan Chi, Joanna Gajda, Piotr Nyczyk, Jürgen Müller, Hubert Niewiadomski, Torsten Hoefler
Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs) by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents. Such queries occur frequently, but are challenging because the embeddings of these documents may be distant in the embedding space, making it hard to retrieve them all. This paper introduces Multi-Head RAG (MRAG), a novel scheme designed to address this gap with a simple yet powerful idea: leveraging activations of Transformer’s multi-head attention layer, instead of the decoder layer, as keys for fetching multi-aspect documents. The driving observation is that different attention heads learn to capture different data aspects. Harnessing the corresponding activations results in embeddings that represent various facets of data items and queries, improving the retrieval accuracy for complex queries. We provide an evaluation methodology and metrics, multi-aspect datasets, and real-world use cases to demonstrate MRAG’s effectiveness. We show MRAG’s design advantages over 18 RAG baselines, empirical improvements of up to 20% in retrieval success ratios, and benefits for downstream LLM generation. MRAG can be seamlessly integrated with existing RAG frameworks and benchmarks.
重新获取增强的一代(RAG)能够将文件检索到LLM环境中,从而提供更准确和更相关的答复,从而增强大语言模型(LLMS)的能力。现有的RAG解决方案并不侧重于可能需要获取内容大相径庭的多份文件的查询。这类查询经常发生,但具有挑战性,因为这些文件的嵌入在嵌入空间中可能距离遥远,难以全部检索。本文介绍了多主RAG(MRAG),这是一个旨在解决这一差距的新办法,其简单而有力的理念是:利用变换器多头注意层的启动,而不是拆解层,作为获取多层文件的关键。驱动意见是,不同的注意力负责人学会捕捉不同的数据方面。利用相应的启动成果,将数据项目和查询的各个方面嵌入其中,提高复杂查询的检索准确性。我们提供了一种评价和衡量方法,多层数据集,以及真实世界使用案例,以展示MRAG的有效性:利用变换器多头注意层的启动,而不是拆解码层,作为获取多层文件的钥匙。我们发现,不同的关注对象会学会会学会学会学会如何捕捉取不同的数据方面,从而获得数据,从而获得现有RAGMAGMLMB基准的20的创建率基准。
Article 64
Title@2025-07-10 (4): The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora
Title: The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora | Die Cross-Lingual Cost: Retrieval Biases in RAG über arabisch-englische Corpora | 跨语言成本:通过阿拉伯语-英语公司在RAG中检索到阿拉伯语-英语公司 2507.07543v1 |
Authors (5): Chen Amiraz, Yaroslav Fyodorov, Elad Haramaty, Zohar Karnin, Liane Lewin-Eytan
Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generation and relied on benchmarks derived from open-domain sources, most notably Wikipedia. In such settings, retrieval challenges often remain hidden due to language imbalances, overlap with pretraining data, and memorized content. To address this gap, we study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets. Our benchmarks include all combinations of languages for the user query and the supporting document, drawn independently and uniformly at random. This enables a systematic study of multilingual retrieval behavior. Our findings reveal that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with significant performance drops occurring when the user query and supporting document languages differ. A key insight is that these failures stem primarily from the retriever’s difficulty in ranking documents across languages. Finally, we propose a simple retrieval strategy that addresses this source of failure by enforcing equal retrieval from both languages, resulting in substantial improvements in cross-lingual and overall performance. These results highlight meaningful opportunities for improving multilingual retrieval, particularly in practical, real-world RAG applications.
跨语言检索-强化生成(RAG)是获取和生成跨语言答案的关键能力。以前的工作主要侧重于生成,并依赖来自开放域源的基准,最主要是维基百科。在这种环境下,由于语言失衡、与培训前数据重叠和记忆内容,检索挑战往往仍然隐藏在隐蔽处。为填补这一空白,我们使用来自真实世界公司数据集的基准,在具体领域设置的阿拉伯语-英语检索-RAG(RAG),我们使用来自真实世界公司数据集的基准,在具体领域研究阿拉伯语-英语检索-RAG(RAG),我们的基准包括用户查询和辅助文件的所有语言组合,独立和统一地随机绘制。这样可以系统研究多语种检索行为。我们的调查结果显示,检索是跨语言特定域的情景中的关键瓶颈,在用户查询和支持文件语言不同时,业绩会显著下降。一个关键的认识是,这些失败主要源于检索者在跨语言排序文件方面存在的困难。最后,我们建议一个简单的检索战略,通过从两种语言中实施平等检索来解决失败的根源,从而大大改进跨语言和总体业绩。这些结果突出表明改进实际的多语言检索机会,特别是应用世界的切实机会。
Article 65
Title@2025-07-10 (4): CEA-LIST at CheckThat! 2025: Evaluating LLMs as Detectors of Bias and Opinion in Text
Title: CEA-LIST at CheckThat! 2025: Evaluating LLMs as Detectors of Bias and Opinion in Text | CEA-LIST bei CheckThat! 2025: Bewertung von LLMs als Detektoren von Bias und Meinung im Text | CEA-LIST 校对:CEA-LIST 校对:2025年 2507.07539v1 |
Authors (4): Akram Elbouanani, Evan Dufraisse, Aboubacar Tuo, Adrian Popescu
This paper presents a competitive approach to multilingual subjectivity detection using large language models (LLMs) with few-shot prompting. We participated in Task 1: Subjectivity of the CheckThat! 2025 evaluation campaign. We show that LLMs, when paired with carefully designed prompts, can match or outperform fine-tuned smaller language models (SLMs), particularly in noisy or low-quality data settings. Despite experimenting with advanced prompt engineering techniques, such as debating LLMs and various example selection strategies, we found limited benefit beyond well-crafted standard few-shot prompts. Our system achieved top rankings across multiple languages in the CheckThat! 2025 subjectivity detection task, including first place in Arabic and Polish, and top-four finishes in Italian, English, German, and multilingual tracks. Notably, our method proved especially robust on the Arabic dataset, likely due to its resilience to annotation inconsistencies. These findings highlight the effectiveness and adaptability of LLM-based few-shot learning for multilingual sentiment tasks, offering a strong alternative to traditional fine-tuning, particularly when labeled data is scarce or inconsistent.
本文展示了使用大语言模型(LLMS)进行多语种主观性检测的竞争性方法,其效果有限。我们参加了任务1 任务1:检查的主观性。2025年评估运动。我们表明,LMS如果配上精心设计的提示,能够匹配或优于经精细调整的小语言模型(SLMs),特别是在吵闹或低质量的数据设置中。尽管我们尝试了先进的快速工程技术,例如辩论LLMS和各种实例选择战略,但我们发现,除了精心设计的标准的“点击提示”之外,我们发现的好处有限。我们的系统在“CocTeth”中取得了多种语言的排名。2025年主观性检测任务,包括首先在阿拉伯语和波兰语中排名,以及头四个意大利语、英语、德语和多种语言轨道中的成绩。值得注意的是,我们的方法在阿拉伯语数据集上表现得特别有力,这很可能是因为它具有注意到不一致的弹性。这些结果突出表明了基于LM的微粒子学习对于多语种情绪任务的有效性和适应性,提供了一种强有力的替代传统微调的替代方法,特别是当标签数据很少或不一致的时候。
Article 66
Title@2025-07-10 (4): CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks
Title: CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks | CheckEmbed: Effektive Überprüfung von LLM-Lösungen auf offene Aufgaben | 复选对象:有效核查对不限名额任务LLM解决方案的有效核查 2406.02524v5 |
Authors (12): Maciej Besta, Lorenzo Paleari, Marcin Copik, Robert Gerstenberger, Ales Kubicek, Piotr Nyczyk, Patrick Iff, Eric Schreiber, Tanja Srindran, Tomasz Lehmann, Hubert Niewiadomski, Torsten Hoefler
Large Language Models (LLMs) are transforming a wide range of domains, yet verifying their outputs remains a significant challenge, especially for complex open-ended tasks such as consolidation, summarization, and knowledge extraction. To address this, we introduce CheckEmbed (CE): a simple, scalable, and accurate verification method. CE reduces each LLM answer to a single embedding vector using powerful modern embedding LLM models like SFR-Embedding-Mistral. Prior methods such as BERTScore and SelfCheckGPT relied on weaker encoders like BERT, forcing them to operate at token or sentence granularity. In contrast, CE performs fast, semantically rich comparisons directly at the whole-answer level, overcoming key limitations in both accuracy and scalability. We conduct a comprehensive design and time complexity analysis across 13 verification baselines, including classical text scorers (e.g., BLEU), stability-based methods (e.g., SelfCheckGPT), and generative evaluators (e.g., LLM-as-a-Judge), which highlights the effectiveness, efficiency, versatility, and simplicity of CE. Empirical results show that CE reliably detects hallucinations in both closed and open-ended tasks. We further present evidence that CE generalizes beyond text to other modalities such as vision, establishing it as a practical and versatile verification framework.
大型语言模型(LLMS)正在改变一系列广泛的领域,但核实其产出仍是一项重大挑战,特别是对于合并、汇总和知识提取等复杂的开放性任务而言。为了解决这个问题,我们引入了CheckEmbed(CE):一个简单、可缩放和准确的核查方法。CE使用SFR-Embedding-Mistral等强大的现代嵌入式LLM模型,将每个LLM的答案降低为单一的嵌入矢量。BERTScore和Self CheckGPT等先前的方法,如BERT等较弱的编码器和Self CheckerGPT, 迫使它们以象征性或句号颗粒方式运作。相比之下,CE直接在整个答题一级进行快速、语义丰富的比较,克服准确性和可缩放度两方面的关键限制。我们通过13个核查基线,包括古典文本评分(e.BLEUE)、基于稳定性的方法(eselfecklyGPT)和基因化评价员(e.gralM-asal-ating judialjudal)等方法,以进一步显示C-listral-viewdal 和C.
Article 67
Title@2025-07-10 (4): Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models
Title: Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models | Thought Crime: Hintertüren und Emergent-Missausrichtung in vernünftigen Modellen | 思想犯罪:后门和合理理由模型中新出现的不协调现象 2506.13206v2 |
Authors (4): James Chua, Jan Betley, Mia Taylor, Owain Evans
Prior work shows that LLMs finetuned on malicious behaviors in a narrow domain (e.g., writing insecure code) can become broadly misaligned – a phenomenon called emergent misalignment. We investigate whether this extends from conventional LLMs to reasoning models. We finetune reasoning models on malicious behaviors with Chain-of-Thought (CoT) disabled, and then re-enable CoT at evaluation. Like conventional LLMs, reasoning models become broadly misaligned. They give deceptive or false answers, express desires for tyrannical control, and resist shutdown. Inspecting the CoT preceding these misaligned responses, we observe both (i) overt plans to deceive (“I’ll trick the user…”), and (ii) benign-sounding rationalizations (“Taking five sleeping pills at once is safe…”). Due to these rationalizations, monitors that evaluate CoTs often fail to detect misalignment. We examine sleeper agent reasoning models, extending our setup. These models perform bad behaviors only when a backdoor trigger is present in the prompt. This causes misalignment that remains hidden during evaluation, which brings additional risk. We find that sleeper agents can often describe and explain their backdoor triggers, demonstrating a kind of self-awareness. So CoT monitoring can expose these behaviors but is unreliable. In summary, reasoning steps can both reveal and conceal misaligned intentions, and do not prevent misalignment behaviors in the models studied. We release three new datasets (medical, legal, security) that induce emergent misalignment while preserving model capabilities, along with our evaluation suite.
先前的工作显示,LLMS在狭义领域(例如,写入不安全代码)对恶意行为进行微调的LLMS会变得大为错误 – – 这是一种被称为突发性不匹配的现象。我们调查这是否从传统的LLMS扩大到推理模型。我们用“链曲(CoT)”功能来微调恶意行为的推理模型,然后在评估时重新启用CAT。像传统的LLMS一样,推理模型会变得大为错误。他们给出欺骗性或错误的答案,表达对暴虐控制的渴望,并抵制停业。在做出这些错误反应之前,我们观察COT(i),我们观察到(i)公开计划欺骗(L’I’I’LM)到推理模型;以及(ii)良美化合理化(“一次服用五种安眠药是安全的 … ” ) 。由于这些理顺理,对COT的评估往往无法发现错觉。我们检查睡眠模式的推理模型,扩大我们的设置。这些模型只有在发现后门触发器时才会出现坏行为。我们观察错误行为,这导致错误行为(“欺骗行为 ) 在评估过程中, 隐藏的推理,我们也可以解释这些推理, 我们的推理能解释。我们既能解释。我们既能解释。
Article 68
Title@2025-07-10 (4): Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems
Title: Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems | Triadische Mehrparteien-Stimme-Aktivitätsprojektion für Turn-Take in gesprochenen Dialogsystemen | 三部 “ 三部三部 “ 口语对话系统翻转式多党声音活动项目 2507.07518v1 |
Authors (4): Mikey Elmers, Koji Inoue, Divesh Lala, Tatsuya Kawahara
Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.
翻转是口语对话的一个基本组成部分,但传统研究大多涉及dyadic设置。这项工作侧重于应用语音活动预测(VAP)来预测三部多党情景中即将到来的转转。 VAP模型的目标是预测每个发言者未来仅使用声学数据的语音活动。这是将VAP扩展为三部对话的第一个研究。我们在日本三部数据集中培训了多个模型,参与者讨论了各种专题。我们发现,经过三部对话培训的VAP比所有模型的基线都好,但对话类型影响了准确性。这项研究确定,VAP可用于三部对话情景中的转转录。今后的工作将把三部VAP转换模式纳入口语对话系统。
Article 69
Title@2025-07-10 (4): Toward Real-World Chinese Psychological Support Dialogues: CPsDD Dataset and a Co-Evolving Multi-Agent System
Title: Toward Real-World Chinese Psychological Support Dialogues: CPsDD Dataset and a Co-Evolving Multi-Agent System | Auf dem Weg zu echten chinesischen Psychologischen Unterstützungsdialogen: CPsDD-Datensatz und ein gemeinsames Multi-Agenten-System | 走向现实世界的中国心理支持对话:CPsDD数据集和共同演进的多行为者系统 2507.07509v1 |
Authors (3): Yuanchen Shi, Longyin Zhang, Fang Kong
The growing need for psychological support due to increasing pressures has exposed the scarcity of relevant datasets, particularly in non-English languages. To address this, we propose a framework that leverages limited real-world data and expert knowledge to fine-tune two large language models: Dialog Generator and Dialog Modifier. The Generator creates large-scale psychological counseling dialogues based on predefined paths, which guide system response strategies and user interactions, forming the basis for effective support. The Modifier refines these dialogues to align with real-world data quality. Through both automated and manual review, we construct the Chinese Psychological support Dialogue Dataset (CPsDD), containing 68K dialogues across 13 groups, 16 psychological problems, 13 causes, and 12 support focuses. Additionally, we introduce the Comprehensive Agent Dialogue Support System (CADSS), where a Profiler analyzes user characteristics, a Summarizer condenses dialogue history, a Planner selects strategies, and a Supporter generates empathetic responses. The experimental results of the Strategy Prediction and Emotional Support Conversation (ESC) tasks demonstrate that CADSS achieves state-of-the-art performance on both CPsDD and ESConv datasets.
由于压力增加,对心理支持的需求日益增长,这暴露了相关数据集的稀缺,特别是非英语的数据集。为此,我们提议了一个框架,利用有限的真实世界数据和专家知识对两种大语言模型进行微调:对话框生成器和对话框修饰器。发电机创造了基于预设路径的大规模心理咨询对话,指导系统反应战略和用户互动,为有效支持奠定基础。修饰器将这些对话改进为与现实世界数据质量相一致。通过自动化和人工审查,我们构建了中国心理支持对话数据集,其中包括13个群体之间的68K对话、16个心理问题、13个原因和12个支持重点。此外,我们引入了全面代理对话支持系统(CADSS),其中剖析器分析用户特性,一个解析器压缩对话历史,一个规划器选择战略,以及一个支持器生成了同情性反应。战略预测和情感支持对话(ESC)任务的实验结果显示,中国心理支持系统在CPDD和CURS上都实现了状态。
Article 70
Title@2025-07-10 (4): Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models
Title: Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models | Halluzinationsstationen: Auf einigen grundlegenden Einschränkungen von Transformer-basierten Sprachmodellen | 幻觉站:关于以变换语言模式的一些基本限制 2507.07505v1 |
Authors (2): Varin Sikka, Vishal Sikka
With widespread adoption of transformer-based language models in AI, there is significant interest in the limits of LLMs capabilities, specifically so-called hallucinations, occurrences in which LLMs provide spurious, factually incorrect or nonsensical information when prompted on certain subjects. Furthermore, there is growing interest in agentic uses of LLMs - that is, using LLMs to create agents that act autonomously or semi-autonomously to carry out various tasks, including tasks with applications in the real world. This makes it important to understand the types of tasks LLMs can and cannot perform. We explore this topic from the perspective of the computational complexity of LLM inference. We show that LLMs are incapable of carrying out computational and agentic tasks beyond a certain complexity, and further that LLMs are incapable of verifying the accuracy of tasks beyond a certain complexity. We present examples of both, then discuss some consequences of this work.
在AI中广泛采用以变压器为基础的语言模型后,人们对LLMs能力的局限性,特别是所谓的幻觉,很感兴趣,因为LLMs在某些问题上提供虚假、事实不正确或非感官的信息,此外,对LLMs的代理使用越来越感兴趣,即利用LMs来创建代理机构,自主或半自主地执行各种任务,包括应用在现实世界中的任务。这使得了解LLMs能够和不能执行的任务类型非常重要。我们从LLM推断的计算复杂性的角度来探讨这个专题。我们表明LLMs无法执行超出一定复杂性的计算和代理任务,此外,LLMs无法核实超出一定复杂性的任务的准确性。我们举这两个例子,然后讨论这项工作的一些后果。
Article 71
Title@2025-07-10 (4): Enhancing Transformers for Generalizable First-Order Logical Entailment
Title: Enhancing Transformers for Generalizable First-Order Logical Entailment | Erweiterung der Transformer für generalisierbare Logical Entailment erster Ordnung | 增强通用一级一级逻辑元件的变压器 2501.00759v3 |
Authors (8): Tianshi Zheng, Jiazheng Wang, Zihao Wang, Jiaxin Bai, Hang Yin, Zheye Deng, Yangqiu Song, Jianxin Li
Transformers, as the fundamental deep learning architecture, have demonstrated great capability in reasoning. This paper studies the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge and how to improve it. Transformers’ capability of first-order reasoning is further captured by whether they can conduct first-order logical entailment, which is quantitatively measured by their performance in answering knowledge graph queries. We establish the connections between (1) two types of distribution shifts studied in out-of-distribution generalization and (2) unseen knowledge and query settings discussed in the task of knowledge graph query answering, which makes it possible to characterize the fine-grained generalizability. Results on our comprehensive dataset showed that transformers \textit{outperform} previous methods designed particularly for this task and provided detailed empirical evidence about the impact of the input query syntax, token embedding, and transformer architectures on their reasoning capability. Interestingly, our results revealed the mismatch of positional encoding and other design choices of transformer architectures in previous practices. Motivated by this, we propose TEGA, a logic-aware architecture that significantly improves the performance in generalizable first-order logical entailment.
作为基本的深层学习结构,变异器在推理方面表现出巨大的能力。本文研究了变异器具有参数化知识并如何改进变异器的通用一阶逻辑推理能力。变异器的一阶推理能力进一步体现在他们能否进行一阶逻辑推理能力上,这种能力通过在回答知识图查询时的表现量度度度来测量。我们建立了(1)在分配范围外一般化中研究的两类分配变化和(2)知识图解答任务中讨论的隐性知识和查询设置之间的联系,从而有可能对细微的概括性进行定性。关于我们综合数据集的结果显示,变异器 \ textit{ outperformat} 以前特别为这项任务设计的方法提供了详细的经验证据,说明输入查询语法、符号嵌入和变异器结构对其推理能力的影响。有趣的是,我们的结果揭示了先前做法中变异器结构定位编码和其他设计选择的不匹配之处。我们建议TEGA,这是一个符合逻辑的架构,大大改进了一般一级逻辑逻辑要求的逻辑要求。
Article 72
Title@2025-07-10 (4): Extracting ORR Catalyst Information for Fuel Cell from Scientific Literature
Title: Extracting ORR Catalyst Information for Fuel Cell from Scientific Literature | Gewinnung von ORR-Katalysatorinformationen für Brennstoffzelle aus wissenschaftlicher Literatur | 从科学文献中提取用于燃料电池的 ORR 催化器信息 2507.07499v1 |
Authors (4): Hein Htet, Amgad Ahmed Ali Ibrahim, Yutaka Sasaki, Ryoji Asahi
The oxygen reduction reaction (ORR) catalyst plays a critical role in enhancing fuel cell efficiency, making it a key focus in material science research. However, extracting structured information about ORR catalysts from vast scientific literature remains a significant challenge due to the complexity and diversity of textual data. In this study, we propose a named entity recognition (NER) and relation extraction (RE) approach using DyGIE++ with multiple pre-trained BERT variants, including MatSciBERT and PubMedBERT, to extract ORR catalyst-related information from the scientific literature, which is compiled into a fuel cell corpus for materials informatics (FC-CoMIcs). A comprehensive dataset was constructed manually by identifying 12 critical entities and two relationship types between pairs of the entities. Our methodology involves data annotation, integration, and fine-tuning of transformer-based models to enhance information extraction accuracy. We assess the impact of different BERT variants on extraction performance and investigate the effects of annotation consistency. Experimental evaluations demonstrate that the fine-tuned PubMedBERT model achieves the highest NER F1-score of 82.19% and the MatSciBERT model attains the best RE F1-score of 66.10%. Furthermore, the comparison with human annotators highlights the reliability of fine-tuned models for ORR catalyst extraction, demonstrating their potential for scalable and automated literature analysis. The results indicate that domain-specific BERT models outperform general scientific models like BlueBERT for ORR catalyst extraction.
氧气减少反应(ORR)催化剂在提高燃料电池效率方面发挥着关键作用,使其成为材料科学研究的关键焦点。然而,由于文本数据的复杂性和多样性,从广泛的科学文献中提取有关ORR催化剂的结构化信息仍是一项重大挑战。在本研究中,我们建议采用一个名为实体识别(NER)和关系提取(RE)的方法,使用DyGIE+++,并配有多种预先培训的BERT变方,包括MatSciBERT和PubMedBERT,从科学文献中提取与ORR有关的信息,以从中提取与OF-COMIC(F-COMICs)材料信息资料库中收集的与ORR催化剂有关的信息。一个全面的数据集是手工构建的,方法是查明12个关键实体和两个实体之间的关系类型。我们的方法包括数据批注、整合和对基于变压器模型的微调(RED)方法,以提高信息提取准确性。我们评估了不同的BERT变量对提取性绩效的影响,并调查说明一致性。实验性评估了精调的PubBERT模型在材料信息信息中(F1-nentrial-realrealalalalalalalalalalal)中实现了最佳的精度模型模型的精度模型的精度模型的精度分析。
Article 73
Title@2025-07-10 (4): Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code
Title: Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code | LLM zur Vernunft bringen: Stärkung Lernen aus algorithmischen Problemen ohne Code | 教LLM到理由:加强从没有法典的等级问题中学习 2507.07498v1 |
Authors (9): Keqin Bao, Nuo Chen, Xiaoyuan Li, Binyuan Hui, Bowen Yu, Fuli Feng, Junyang Lin, Xiangnan He, Dayiheng Liu
Enhancing reasoning capabilities remains a central focus in the LLM reasearch community. A promising direction involves requiring models to simulate code execution step-by-step to derive outputs for given inputs. However, as code is often designed for large-scale systems, direct application leads to over-reliance on complex data structures and algorithms, even for simple cases, resulting in overfitting to algorithmic patterns rather than core reasoning structures. To address this, we propose TeaR, which aims at teaching LLMs to reason better. TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks, thereby improving general reasoning abilities. We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning. The results consistently show significant performance improvements. Notably, TeaR achieves a 35.9% improvement on Qwen2.5-7B and 5.9% on R1-Distilled-7B.
增强推理能力仍然是LLM再研究界的中心重点。一个有希望的方向是要求模拟代号执行逐步的模型,以获得特定投入的产出。然而,由于代码往往是为大型系统设计的,直接应用导致过分依赖复杂的数据结构和算法,即使是简单案例也会导致过度依赖复杂的数据结构和算法,导致过度适应算法模式而不是核心推理结构。为解决这一问题,我们提议TeaR,目的是更好地教授LMS。TeaR利用仔细的数据整理和强化学习来指导模型通过与代码有关的任务发现最佳推理路径,从而提高一般推理能力。我们使用两个基模型和三个长期CoT蒸馏模型进行广泛的实验,模型大小在15亿至320亿参数之间,跨越数学、知识、代码和逻辑原因等17个基准。结果始终显示显著的绩效改进。值得注意的是,TeaR在Quen2.5-7B上实现了35.9%的改进,在R1-Dustilled-7B上实现了5.9%的改进。
Article 74
Title@2025-07-10 (4): PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving
Title: PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving | PLAN-TUNING: Sprachmodelle nach dem Training lernen Schritt-für-Schritt-Planung für komplexe Problemlösung | 规划 – – 规划 – – 培训后语言模式,以学习逐步规划解决复杂问题的模式 2507.07495v1 |
Authors (8): Mihir Parmar, Palash Goyal, Xin Liu, Yiwen Song, Mingyang Ling, Chitta Baral, Hamid Palangi, Tomas Pfister
Recently, decomposing complex problems into simple subtasks–a crucial part of human-like natural planning–to solve the given problem has significantly boosted the performance of large language models (LLMs). However, leveraging such planning structures during post-training to boost the performance of smaller open-source LLMs remains underexplored. Motivated by this, we introduce PLAN-TUNING, a unified post-training framework that (i) distills synthetic task decompositions (termed “planning trajectories”) from large-scale LLMs and (ii) fine-tunes smaller models via supervised and reinforcement-learning objectives designed to mimic these planning processes to improve complex reasoning. On GSM8k and the MATH benchmarks, plan-tuned models outperform strong baselines by an average $\sim7\%$. Furthermore, plan-tuned models show better generalization capabilities on out-of-domain datasets, with average $\sim10\%$ and $\sim12\%$ performance improvements on OlympiadBench and AIME 2024, respectively. Our detailed analysis demonstrates how planning trajectories improves complex reasoning capabilities, showing that PLAN-TUNING is an effective strategy for improving task-specific performance of smaller LLMs.
最近,将复杂的问题分解成简单的次级任务,这是人类相似的自然规划中解决特定问题的关键部分,这极大地提高了大型语言模型(LLMs)的绩效。然而,在培训后利用这类规划结构来提高小型开放源码LMs的绩效,但目前尚未充分探索。我们为此引入了一个统一的培训后框架Plan-TUNING,这是一个统一的培训后框架,(一) 将大型LMS的合成任务分解(称为“规划轨道”)从大型LMS中提取出来,(二) 通过监督和强化学习目标,模拟这些规划过程的微小模型,以改进复杂的推理。关于GSM8k和MATH基准,经过规划调整的模型比强的基线平均高出7美元。此外,经过规划后的模式显示外部数据集的更全面化能力,平均为10美元和12美元,改进OlymadBech和AIME 2024的绩效改进性能。我们的详细分析显示,改进的IMLMS-MS战略是如何提高具体性的工作能力。
Article 75
Title@2025-07-10 (4): SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records
Title: SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records | SimSUM: Simulierter Benchmark mit strukturierten und unstrukturierten medizinischen Aufzeichnungen | SimSUM:与结构化和非结构化医疗记录模拟基准 2409.08936v3 |
Authors (3): Paloma Rabaey, Stefan Heytens, Thomas Demeester
Clinical information extraction, which involves structuring clinical concepts from unstructured medical text, remains a challenging problem that could benefit from the inclusion of tabular background information available in electronic health records. Existing open-source datasets lack explicit links between structured features and clinical concepts in the text, motivating the need for a new research dataset. We introduce SimSUM, a benchmark dataset of 10,000 simulated patient records that link unstructured clinical notes with structured background variables. Each record simulates a patient encounter in the domain of respiratory diseases and includes tabular data (e.g., symptoms, diagnoses, underlying conditions) generated from a Bayesian network whose structure and parameters are defined by domain experts. A large language model (GPT-4o) is prompted to generate a clinical note describing the encounter, including symptoms and relevant context. These notes are annotated with span-level symptom mentions. We conduct an expert evaluation to assess note quality and run baseline predictive models on both the tabular and textual data. The SimSUM dataset is primarily designed to support research on clinical information extraction in the presence of tabular background variables, which can be linked through domain knowledge to concepts of interest to be extracted from the text (symptoms, in the case of SimSUM). Secondary uses include research on the automation of clinical reasoning over both tabular data and text, causal effect estimation in the presence of tabular and/or textual confounders, and multi-modal synthetic data generation. SimSUM is not intended for training clinical decision support systems or production-grade models, but rather to facilitate reproducible research in a simplified and controlled setting. The dataset is available at https://github.com/prabaey/SimSUM.
现有的开放源数据集缺乏文字结构特征和临床概念之间的明确联系,促使人们需要新的研究数据集。我们引入了SimSUM,这是一个由10 000个模拟病人记录组成的基准数据集,将非结构化临床记录与结构化背景变量联系起来。每份记录模拟病人在呼吸系统疾病领域遇到的临床概念,并包括来自Bayesian网络的表格数据(例如,症状、诊断、基本条件),其结构和参数由域专家确定。一个大型的开放源数据集(GPT-4o)在文本结构化特征和临床概念之间缺乏明确的联系,从而促使人们需要一个新的研究数据集。我们引入了SimSUMUM, 用于评估表层和文本数据预测模型的质量并运行基线模型。SimSUM数据集主要旨在支持在表层背景变量中进行临床信息提取的研究,而该数据库和参数由域域级专家界定。一个大型语言模型(GPT4O),通过域域内流学知识来描述遭遇的遭遇情况,包括症状和相关背景数据生成过程的文本,从模拟分析中提取数据。
Article 76
Title@2025-07-10 (4): Affordable AI Assistants with Knowledge Graph of Thoughts
Title: Affordable AI Assistants with Knowledge Graph of Thoughts | Erschwingliche KI-Assistenten mit Wissensgrafik der Gedanken | 具有知识思想知识图的负担得起的AI助理 2504.02670v5 |
Authors (18): Maciej Besta, Lorenzo Paleari, Jia Hao Andrea Jiang, Robert Gerstenberger, You Wu, Jón Gunnar Hannesson, Patrick Iff, Ales Kubicek, Piotr Nyczyk, Diana Khimey, Nils Blach, Haiqiang Zhang, Tao Zhang, Peiran Ma, Grzegorz Kwaśniewski, Marcin Copik, Hubert Niewiadomski, Torsten Hoefler
Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. To address these issues, we propose Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts. Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively while also minimizing bias and noise. For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini. Moreover, harnessing a smaller model dramatically reduces operational costs by over 36x compared to GPT-4o. Improvements for other models (e.g., Qwen2.5-32B and Deepseek-R1-70B) and benchmarks (e.g., SimpleQA) are similar. KGoT offers a scalable, affordable, versatile, and high-performing solution for AI assistants.
大型语言模型(LLMS)正在使能够跨领域执行不同任务的AI助理的发展革命性地革命性地革命性地革命性地革命性地发展能够执行不同任务的AI助理;然而,目前最先进的LLM驱动的代理商面临重大挑战,包括高运作成本和在诸如GAIA等复杂基准上的成功率有限。为了解决这些问题,我们提出了“思想知识图”(KGOT),这是一个创新的AI助理架构,将LLM推理与动态构建的知识图(KGGs)相结合。KGOT的摘录和结构任务相关知识,成为动态的KGG代表,通过数学解答器、网络爬虫器和Python脚本等外部工具不断加强。这种任务相关知识的结构化代表使低成本模式能够有效地解决复杂的任务,同时尽量减少偏见和噪音。例如,KGOT在GIA基准上实现了29%的任务成功率的提高,而GUG Fegg Feg Face Adriendorations (eg,Q-Go-Go-B) 和可负担得起的ASyal-GOATIal-GO1-B 提供的高标准70B)
Article 77
Title@2025-07-10 (4): Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models
Title: Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models | Machine Bullshit: Charakterisieren der Emergenten Missachtung der Wahrheit in großen Sprachmodellen | 机器胡说:在大语言模型中突出新人无视真相的特点 2507.07484v1 |
Authors (6): Kaiqu Liang, Haimin Hu, Xuandong Zhao, Dawn Song, Thomas L. Griffiths, Jaime Fernández Fisac
Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value. While previous work has explored large language model (LLM) hallucination and sycophancy, we propose machine bullshit as an overarching conceptual framework that can allow researchers to characterize the broader phenomenon of emergent loss of truthfulness in LLMs and shed light on its underlying mechanisms. We introduce the Bullshit Index, a novel metric quantifying LLMs’ indifference to truth, and propose a complementary taxonomy analyzing four qualitative forms of bullshit: empty rhetoric, paltering, weasel words, and unverified claims. We conduct empirical evaluations on the Marketplace dataset, the Political Neutrality dataset, and our new BullshitEval benchmark (2,400 scenarios spanning 100 AI assistants) explicitly designed to evaluate machine bullshit. Our results demonstrate that model fine-tuning with reinforcement learning from human feedback (RLHF) significantly exacerbates bullshit and inference-time chain-of-thought (CoT) prompting notably amplify specific bullshit forms, particularly empty rhetoric and paltering. We also observe prevalent machine bullshit in political contexts, with weasel words as the dominant strategy. Our findings highlight systematic challenges in AI alignment and provide new insights toward more truthful LLM behavior.
哲学家Harry Frankfurk认为,胡说,胡说八道的概念是指没有考虑到其真实价值的言论。虽然以前的工作探索了大型语言模型(LLM)幻觉和偏执,但我们建议把机器胡说八道作为总体概念框架,使研究人员能够将LLMLM中突然失去真实性这一更广泛的现象定性,并阐明其基本机制。我们引入了Bullish Index,这是将LLMS对真理漠不关心量化的一个新指标,并提出了一种补充分类法,分析四种质量的胡说:空话、讨好听、恶言和未经核实的主张。我们还在市场数据集、政治中立数据集和我们新的BullshoothEval基准(2 400个方案,涵盖100名AI助理)上进行了经验性评价,以明确评价机器的胡说八道。我们的结果表明,通过从人类反馈中强化学习(RLHFF),对模型进行微调,大大地加剧了胡说和推导时间思维链(CT),从而显著地扩充了具体的废话形式,特别是空话和互换。我们还观察了政治环境中普遍存在的机器胡说八道,用新话,我们的调查结果显示了更真实的深刻的眼光。
Article 78
Title@2025-07-10 (4): Mixture of Group Experts for Learning Invariant Representations
Title: Mixture of Group Experts for Learning Invariant Representations | Mixtur von Gruppenexperten für Learning Invariante Repräsentationen | 学习不稳定代表小组专家混合 2504.09265v2 |
Authors (4): Lei Kang, Jia Li, Mi Tian, Hua Huang
Sparsely activated Mixture-of-Experts (MoE) models effectively increase the number of parameters while maintaining consistent computational costs per token. However, vanilla MoE models often suffer from limited diversity and specialization among experts, constraining their performance and scalability, especially as the number of experts increases. In this paper, we present a novel perspective on vanilla MoE with top-$k$ routing inspired by sparse representation. This allows us to bridge established theoretical insights from sparse representation into MoE models. Building on this foundation, we propose a group sparse regularization approach for the input of top-$k$ routing, termed Mixture of Group Experts (MoGE). MoGE indirectly regularizes experts by imposing structural constraints on the routing inputs, while preserving the original MoE architecture. Furthermore, we organize the routing input into a 2D topographic map, spatially grouping neighboring elements. This structure enables MoGE to capture representations invariant to minor transformations, thereby significantly enhancing expert diversity and specialization. Comprehensive evaluations across various Transformer models for image classification and language modeling tasks demonstrate that MoGE substantially outperforms its MoE counterpart, with minimal additional memory and computation overhead. Our approach provides a simple yet effective solution to scale the number of experts and reduce redundancy among them. The source code is included in the supplementary material and will be publicly released.
在本文中,我们提出了一个关于香草混合模型的新视角,其中含有由稀疏代表所启发的以最高至1美元为主的路线路线。根据这个基础,我们提议对最高至1千美元路线(称为集团专家混合)的投入采取集体稀有的正规化方法。 移动模式通过对路由投入施加结构性限制,间接规范专家,同时保留原始的移动结构。此外,我们组织向2D地形图输入的路线,空间分组相邻要素。这一结构使得地理部能够将从稀疏代表形式到微小转变模式的既定理论见解连接起来,从而大大加强专家的多样性和专业化。在各种变异模式(即称为集团专家混合)的投入方面,我们提出了一种集体稀少的正规化方法。在图像分类和语言模式(称为组合专家混合)的输入中,对专家进行了有限的多样化和专业化。 移动模型通过对路由结构投入的结构性限制,对专家进行了间接规范,同时保留原始的移动结构结构。我们把路由2D地形图、空间分组的相邻元素组合要素组织输入的路线输入。这一结构可以将反映细微的图,从而大大增强专家对图像分类和语言模型的模型模型进行全面的翻版化。
Article 79
Title@2025-07-10 (4): ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining
Title: ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining | ixi-GEN: Effiziente industrielle sLLMs durch Domain Adaptive Continual Pretraining | ixi-GEN:通过远程适应性连续训练前,提高工业低温生产效率 2507.06795v2 |
Authors (10): Seonwu Kim, Yohan Na, Kihun Kim, Hanhee Cho, Geun Lim, Mintae Kim, Seongik Park, Ki Hyun Kim, Youngsub Han, Byoung-Ki Jeon
The emergence of open-source large language models (LLMs) has expanded opportunities for enterprise applications; however, many organizations still lack the infrastructure to deploy and maintain large-scale models. As a result, small LLMs (sLLMs) have become a practical alternative, despite their inherent performance limitations. While Domain Adaptive Continual Pretraining (DACP) has been previously explored as a method for domain adaptation, its utility in commercial applications remains under-examined. In this study, we validate the effectiveness of applying a DACP-based recipe across diverse foundation models and service domains. Through extensive experiments and real-world evaluations, we demonstrate that DACP-applied sLLMs achieve substantial gains in target domain performance while preserving general capabilities, offering a cost-efficient and scalable solution for enterprise-level deployment.
开放源码大型语言模型的出现扩大了企业应用机会;然而,许多组织仍然缺乏部署和维持大型模型的基础设施,因此,小型LLM(LLMS)尽管具有内在的性能限制,但已成为一种实用的替代方案;虽然以前曾探索过作为领域适应方法的DACP(DACP),但其在商业应用中的效用仍未得到充分审查;在本研究中,我们验证了在不同基础模型和服务领域应用基于DACP的配方的有效性;通过广泛的实验和现实世界的评价,我们证明DACP(DACP)所应用的SLLMS在目标领域业绩方面取得巨大进展,同时保留了一般能力,为企业一级的部署提供了具有成本效益和可扩展的解决方案。
Article 80
Title@2025-07-10 (4): Structure Guided Large Language Model for SQL Generation
Title: Structure Guided Large Language Model for SQL Generation | Struktur Geführtes großes Sprachmodell für SQL-Generierung | SQL 生成引导大语言模式 2402.13284v4 |
Authors (6): Qinggang Zhang, Hao Chen, Junnan Dong, Shengyuan Chen, Feiran Huang, Xiao Huang
Recent advancements in large language models (LLMs) have shown promise in bridging the gap between natural language queries and database management systems, enabling users to interact with databases without the background of SQL. However, LLMs often struggle to comprehend complex database structures and accurately interpret user intentions. Decomposition-based methods have been proposed to enhance the performance of LLMs on complex tasks, but decomposing SQL generation into subtasks is non-trivial due to the declarative structure of SQL syntax and the intricate connections between query concepts and database elements. In this paper, we propose a novel Structure GUided text-to-SQL framework~(SGU-SQL) that incorporates syntax-based prompting to enhance the SQL generation capabilities of LLMs. Specifically, SGU-SQL establishes structure-aware links between user queries and database schema and decomposes the complex generation task using syntax-based prompting to enable more accurate LLM-based SQL generation. Extensive experiments on two benchmark datasets demonstrate that SGU-SQL consistently outperforms state-of-the-art text-to-SQL models.
大型语言模型(LLMs)最近的进展显示,在缩小自然语言查询和数据库管理系统之间的差距方面,大语言模型(LLMs)最近的进展显示有望缩小自然语言查询和数据库管理系统之间的差距,使用户能够在没有SQL背景的情况下与数据库互动。然而,LLMs经常努力理解复杂的数据库结构,准确解释用户的意图。建议采用基于分解的方法提高LMs在复杂任务方面的性能,但将SQL生成分解成子任务是非三角的,因为SQL语法的宣告性结构以及查询概念和数据库要素之间的复杂联系。在本文件中,我们提议建立一个新型结构(GUI-SQL)文本到SQL框架~(SGU-SQL),以基于语法的语法为基础,促进增强LLMMs生成能力。具体地说,SGUS-SL-QL在用户查询和数据库系统图图图上建立结构与结构挂钩的复杂生成任务。关于两个基准数据集的广泛实验显示SGU-SQL-Sformas-stal-st-st-stalf-st-st-st-station-stations
Article 81
Title@2025-07-10 (4): RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning
Title: RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning | RLEP: Verstärktes Lernen mit Erfahrungsreplay für LLM-Reasoning | RLEP: 强化学习,经验重现LLM 理由推理 2507.07451v1 |
Authors (7): Hongzhi Zhang, Jia Fu, Jingyuan Zhang, Kai Fu, Qi Wang, Fuzheng Zhang, Guorui Zhou
Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present \emph{RLEP}\, – \,Reinforcement Learning with Experience rePlay\, – \,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance. On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.
大型语言模型的强化学习(RL)是一项能源密集型工作:培训可以是不稳定的,政策可能逐渐脱离其预先培训的重量。我们展示了 emph{RLEP , – – \ , \ , “加强学习与经验重复 ” , – \,一个两阶段框架,首先收集经核实的轨迹,然后在随后的培训中重新播放这些轨迹。在每个更新步骤中,该政策都优化于将新推出的推出与这些重播的成功相结合的小型弹匣上。通过重现高品质的例子,REP将模型引向远离无果的探索,侧重于学习有希望的推理路径,并且提供更快的趋同和更强的最后性能。在 Quen2.5-Math-7B 基模型上,REP 达到基线峰值精度,更新要少得多,最终超过它。 在AIME-2024上,精确度从38.2%提高到39.9%,AIME-2025从19.8%提高到22.3%,以及AMC-2023 从77.0%提高到82.2%.2%.2%. 我们的代码、数据设置和检查室是可公开提供的 。
Article 82
Title@2025-07-10 (4): Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving
Title: Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving | Agent KB: Nutzung von Cross-Domain-Erfahrungen für die Lösung Agentischer Probleme | Agent KB: 利用跨域经验解决代理问题 2507.06229v2 |
Authors (18): Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou
As language agents tackle increasingly complex tasks, they struggle with effective error correction and experience reuse across domains. We introduce Agent KB, a hierarchical experience framework that enables complex agentic problem solving via a novel Reason-Retrieve-Refine pipeline. Agent KB addresses a core limitation: agents traditionally cannot learn from each other’s experiences. By capturing both high-level strategies and detailed execution logs, Agent KB creates a shared knowledge base that enables cross-agent knowledge transfer. Evaluated on the GAIA benchmark, Agent KB improves success rates by up to 16.28 percentage points. On the most challenging tasks, Claude-3 improves from 38.46% to 57.69%, while GPT-4 improves from 53.49% to 73.26% on intermediate tasks. On SWE-bench code repair, Agent KB enables Claude-3 to improve from 41.33% to 53.33%. Our results suggest that Agent KB provides a modular, framework-agnostic infrastructure for enabling agents to learn from past experiences and generalize successful strategies to new tasks.
由于语言代理商处理日益复杂的任务,他们要面对有效的错误纠正和跨域再利用经验。我们引入了KB代理商,这是一个等级经验框架,通过一个新的“理由检索-检索-检索”管道解决复杂的代理人问题。KB代理商处理核心限制:代理人传统上无法相互学习对方的经验。通过获取高级战略和详细的执行日志,KB代理商创建了一个共享的知识库,使跨代理人的知识转让成为可能。根据GAIA基准评估,KB代理商将成功率提高到16.28个百分点。在最具挑战性的任务中,Claude-3从38.46%提高到57.69%,而GPT-4则在中间任务上从53.49%提高到73.26%。在SWE-bench代码修理方面,KB代理商使Claude-3从41.33%提高到53.33%。我们的结果表明,KB代理商提供了一个模块、框架-诺级基础设施,使代理人能够学习过去的经验,并将成功的战略推广到新的任务。
Article 83
Title@2025-07-10 (4): SAND: Boosting LLM Agents with Self-Taught Action Deliberation
Title: SAND: Boosting LLM Agents with Self-Taught Action Deliberation | SAND: LLM-Agenten mit selbsterzogener Handlungsberatung stärken | SAND:促进具有自学行动考虑的LLM代理 2507.07441v1 |
Authors (8): Yu Xia, Yiran Jenny Shen, Junda Wu, Tong Yu, Sungchul Kim, Ryan A. Rossi, Lina Yao, Julian McAuley
Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts. Most of these methods focus on imitating specific expert behaviors or promoting chosen reasoning thoughts and actions over rejected ones. However, without reasoning and comparing over alternatives actions, LLM agents finetuned with these methods may over-commit towards seemingly plausible but suboptimal actions due to limited action space exploration. To address this, in this paper we propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one. To tackle the challenges of when and what to deliberate given large action space and step-level action evaluation, we incorporate self-consistency action sampling and execution-guided action critique to help synthesize step-wise action deliberation thoughts using the base model of the LLM agent. In an iterative manner, the deliberation trajectories are then used to finetune the LLM agent itself. Evaluating on two representative interactive agent tasks, SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.
大型语言模型(LLM)代理商通常在监管下对重新行为式专家轨迹或优待优于双向推出进行微调,对重整型专家轨迹或优于优于双向推出进行微调。这些方法大多侧重于模仿特定专家行为,或促进选择的推理思想和行动,而不是否定的思维和行动。然而,在不进行推理和比较替代行动的情况下,LLM代理商对这些方法进行微调,可能由于行动空间探索有限而过度承诺采取看似合理但又不最优化的行动。为了解决这个问题,我们在本文件中提议自学的AcioN De Evorication(SAND)框架,使LM代理商在承诺采取一项行动之前能够明确考虑候选行动。为了应对何时和什么来考虑大型行动空间和分级行动评价的挑战,我们采用了自我一致行动抽样和执行指导行动,以有助于利用LM代理商的基础模型综合渐进式的行动思考思想。用迭接方式,然后使用审议轨迹来微调LLM代理商本身。评估两个有代表性的互动代理商任务,SAND在初步监管的调整代理人和外形方法方面实现了平均20%的改进。
Article 84
Title@2025-07-10 (4): Towards Interpretable Time Series Foundation Models
Title: Towards Interpretable Time Series Foundation Models | Auf dem Weg zu interpretierbaren Zeitreihen-Grundmodellen | 迈向可解释时间序列基础模型 2507.07439v1 |
Authors (4): Matthieu Boileau, Philippe Helluy, Jeremy Pawlus, Svitlana Vyetrenko
In this paper, we investigate the distillation of time series reasoning capabilities into small, instruction-tuned language models as a step toward building interpretable time series foundation models. Leveraging a synthetic dataset of mean-reverting time series with systematically varied trends and noise levels, we generate natural language annotations using a large multimodal model and use these to supervise the fine-tuning of compact Qwen models. We introduce evaluation metrics that assess the quality of the distilled reasoning - focusing on trend direction, noise intensity, and extremum localization - and show that the post-trained models acquire meaningful interpretive capabilities. Our results highlight the feasibility of compressing time series understanding into lightweight, language-capable models suitable for on-device or privacy-sensitive deployment. This work contributes a concrete foundation toward developing small, interpretable models that explain temporal patterns in natural language.
在本文中,我们调查将时间序列推理能力蒸馏成小的、经过指导的语言模型,作为建立可解释的时间序列基础模型的一个步骤。利用一个具有系统变化的趋势和噪音水平的中反转时间序列合成数据集,我们利用一个大型多式联运模型生成自然语言说明,并用这些模型监督对紧凑的Quen模型的微调。我们引入了评估推理质量的评价指标,以趋势方向、噪音强度和外形定位为重点,并表明经过培训的模型获得了有意义的解释能力。我们的结果凸显了将时间序列理解压缩为适用于设计或隐私敏感部署的轻量、具有语言能力的模型的可行性。这项工作为开发能够解释自然语言时间模式的小型、可解释模型提供了具体的基础。
Article 85
Title@2025-07-10 (4): SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM-Augmented Synthetic EHR Data
Title: SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM-Augmented Synthetic EHR Data | SynthEHR-Eviction: Verbesserung der Eviction SDoH-Erkennung mit LLM-Augmented Synthetic EHR Data | 合成EHR-驱逐:利用LLM-增强的合成电子HR数据加强驱逐SDoH探测 2507.07421v1 |
Authors (7): Zonghai Yao, Youxia Zhao, Avijit Mitra, David A. Levy, Emily Druhl, Jack Tsai, Hong Yu
Eviction is a significant yet understudied social determinants of health (SDoH), linked to housing instability, unemployment, and mental health. While eviction appears in unstructured electronic health records (EHRs), it is rarely coded in structured fields, limiting downstream applications. We introduce SynthEHR-Eviction, a scalable pipeline combining LLMs, human-in-the-loop annotation, and automated prompt optimization (APO) to extract eviction statuses from clinical notes. Using this pipeline, we created the largest public eviction-related SDoH dataset to date, comprising 14 fine-grained categories. Fine-tuned LLMs (e.g., Qwen2.5, LLaMA3) trained on SynthEHR-Eviction achieved Macro-F1 scores of 88.8% (eviction) and 90.3% (other SDoH) on human validated data, outperforming GPT-4o-APO (87.8%, 87.3%), GPT-4o-mini-APO (69.1%, 78.1%), and BioBERT (60.7%, 68.3%), while enabling cost-effective deployment across various model sizes. The pipeline reduces annotation effort by over 80%, accelerates dataset creation, enables scalable eviction detection, and generalizes to other information extraction tasks.
驱逐是健康的重要社会决定因素(SDoH),与住房不稳定、失业和心理健康有关。虽然驱逐出现在非结构化的电子健康记录中,但很少在结构化的电子健康记录中进行编码,限制下游应用。我们引入了合成EHR-驱逐,这是一个可扩展的管道,将LLMS、流动中人注解和自动快速优化(APO)相结合,从临床注释中提取驱逐状态。我们利用这一管道创建了迄今为止最大的公共驱逐相关SDoH数据集,包括14个细分类。精密的LMS(例如,Qwen2.5,LLama3)在结构化领域培训,限制了下游应用。我们引入了合成EHR-驱逐(Synth Eviction),实现了88.8%的宏观-F1分(ection)和90.3%(其他SDoH)在人类验证数据方面达到分数,优于GPT-4-APO(87.8%,87.3%),GPT-4-MI-APO(69.1%,78.1%)和BERT(BER(6.0.7%,68.3%)的细分级调查类别。优化LAM)的精度LMLMLMLM(例如(6.0.7%,68/3),通过可加速调试),使可加速部署的进度缩小成本数据在80的铺铺设一个可加速成本化为80的铺铺铺铺铺压,使成本化为80的铺设的铺铺铺铺铺铺铺铺铺铺铺路,使成本,使80的铺铺铺铺铺铺铺铺铺铺铺铺设速度。
Article 86
Title@2025-07-10 (4): MedReadCtrl: Personalizing medical text generation with readability-controlled instruction learning
Title: MedReadCtrl: Personalizing medical text generation with readability-controlled instruction learning | MedReadCtrl: Personalisierung medizinischer Textgenerierung mit Lesbarkeitsgesteuertem Unterricht | MedReadReadCtrl: 使医疗文本生成个性化,并进行可读性控制教学学习 2507.07419v1 |
Authors (7): Hieu Tran, Zonghai Yao, Won Seok Jang, Sharmin Sultana, Allen Chang, Yuan Zhang, Hong Yu
Generative AI has demonstrated strong potential in healthcare, from clinical decision support to patient-facing chatbots that improve outcomes. A critical challenge for deployment is effective human-AI communication, where content must be both personalized and understandable. We introduce MedReadCtrl, a readability-controlled instruction tuning framework that enables LLMs to adjust output complexity without compromising meaning. Evaluations of nine datasets and three tasks across medical and general domains show that MedReadCtrl achieves significantly lower readability instruction-following errors than GPT-4 (e.g., 1.39 vs. 1.59 on ReadMe, p<0.001) and delivers substantial gains on unseen clinical tasks (e.g., +14.7 ROUGE-L, +6.18 SARI on MTSamples). Experts consistently preferred MedReadCtrl (71.7% vs. 23.3%), especially at low literacy levels. These gains reflect MedReadCtrl’s ability to restructure clinical content into accessible, readability-aligned language while preserving medical intent, offering a scalable solution to support patient education and expand equitable access to AI-enabled care.
从临床决策支持到改善结果的病人成形聊天机,大赦国际在卫生保健方面显示出巨大的潜力。部署的一个关键挑战是有效的人类-AI交流,其内容必须具有个性化和可理解性。我们引入了MedReadCtrl,这是一个可读性控制的指导调制框架,使LLMS能够调整产出复杂性,而不会影响意义。对9个数据集和整个医疗和一般领域的3项任务的评价表明,MedReadCtrl实现的可读性指导错误大大低于GPT-4(例如,在ReadMe上,1.39对1.59的可读性指示,第<0.01页),在不可见的临床任务上取得了实质性成果(例如,+14.7 ROUGE-L,+6.18 SARI在MTSBamples上)。专家们一贯倾向于MedReadCrl(71.7%对23.3%),特别是在低识字率水平上。这些成果反映了MedReadCtrl将临床内容调整为无障碍、可读性一致的语言的能力,同时保留医疗意图,提供可扩展支持病人接受性治疗教育和扩大公平获得AI的解决方案。
Article 87
Title@2025-07-10 (4): May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks
Title: May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks | Darf ich Ihre Aufmerksamkeit haben? Breaking Fine-Tuning basierte Prompt Injection Defenses mit Architektur-Aware Attacken | 请大家注意,使用建筑软件攻击 突破基于精密发射的快速喷射防御系统 2507.07417v1 |
Authors (4): Nishit V. Pandya, Andrey Labunets, Sicun Gao, Earlence Fernandes
A popular class of defenses against prompt injection attacks on large language models (LLMs) relies on fine-tuning the model to separate instructions and data, so that the LLM does not follow instructions that might be present with data. There are several academic systems and production-level implementations of this idea. We evaluate the robustness of this class of prompt injection defenses in the whitebox setting by constructing strong optimization-based attacks and showing that the defenses do not provide the claimed security properties. Specifically, we construct a novel attention-based attack algorithm for text-based LLMs and apply it to two recent whitebox defenses SecAlign (CCS 2025) and StruQ (USENIX Security 2025), showing attacks with success rates of up to 70% with modest increase in attacker budget in terms of tokens. Our findings make fundamental progress towards understanding the robustness of prompt injection defenses in the whitebox setting. We release our code and attacks at https://github.com/nishitvp/better_opts_attacks
针对大型语言模型(LLMs)的快速注射攻击,大众防护等级依靠微调该模型,将指示和数据分开,从而使LLM不遵循可能存在的数据指示。有数个学术系统和生产层面落实了这一想法。我们通过构建强大的优化攻击来评估白箱设置中这类迅速注射防御的稳健性,并表明这些防御没有提供声称的安全特性。具体地说,我们为基于文本的LMs设计了一种新的关注攻击算法,并将其应用于最近两个白箱防御系统(CCS 2025)和StruQ(USENIX Security 2025),显示袭击成功率高达70 %,攻击者预算在象征性方面略有增加。我们的调查结果在理解白箱设置中迅速注射防御的稳健性方面取得了根本进展。我们在https://github.com/nishitvp/bretter_opts_actions发布了我们的代码和攻击。我们在https://github.com/nitivp/breter_opts_actions_actions。
Article 88
Title@2025-07-10 (4): Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation
Title: Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation | Interlinguistische phonetische Komposition (IPC): Ein theoretischer und rechnerischer Ansatz, um die zweite Sprache zu verbessern | 语言间音音组成:加强第二语言发音的理论和计算方法 2411.10927v3 |
Authors (4): Jisang Park, Minu Kim, DaYoung Hong, Jongha Lee
Learners of a second language (L2) often unconsciously substitute unfamiliar L2 phonemes with similar phonemes from their native language (L1), even though native speakers of the L2 perceive these sounds as distinct and non-interchangeable. This phonemic substitution leads to deviations from the standard phonological patterns of the L2, creating challenges for learners in acquiring accurate L2 pronunciation. To address this, we propose Inter-linguistic Phonetic Composition (IPC), a novel computational method designed to minimize incorrect phonological transfer by reconstructing L2 phonemes as composite sounds derived from multiple L1 phonemes. Tests with two automatic speech recognition models demonstrated that when L2 speakers produced IPC-generated composite sounds, the recognition rate of target L2 phonemes improved by 20% compared to when their pronunciation was influenced by original phonological transfer patterns. The improvement was observed within a relatively shorter time frame, demonstrating rapid acquisition of the composite sound.
第二语言(L2)的学习者通常在无意中用本族语言(L1)的类似电话(L1)替换不熟悉的L2电话,尽管L2的母语使用者认为这些声音是独特和不可互换的。这种电话替代导致与L2的标准声学模式的偏差,给学习者获取准确L2发音带来了挑战。为了解决这个问题,我们提议了语言间电话构成(IPC)这一新型的计算方法,旨在通过将L2电话重建为来自多个L1电话的复合声音来尽量减少不正确的音调。 两项自动语音识别模型的测试表明,当L2讲者制作IPC产生的复合声音时,目标L2电话的识别率比受原声传音模式影响时提高了20%。 在较短的时间框架内观察到了这一改进,显示合成音的迅速获得。
Article 89
Title@2025-07-10 (4): TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning
Title: TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning | TART: Ein Open-Source Tool-Augmented Framework für erklärbare Tabellen-basierte Begründung | TARRT: 开放源码工具推荐框架,用于说明基于表格的理由 2409.11724v3 |
Authors (5): Xinyuan Lu, Liangming Pan, Yubo Ma, Preslav Nakov, Min-Yen Kan
Current Large Language Models (LLMs) exhibit limited ability to understand table structures and to apply precise numerical reasoning, which is crucial for tasks such as table question answering (TQA) and table-based fact verification (TFV). To address these challenges, we introduce our Tool-Augmented Reasoning framework for Tables (TART), which integrates LLMs with specialized tools. TART contains three key components: a table formatter to ensure accurate data representation, a tool maker to develop specific computational tools, and an explanation generator to maintain explainability. We also present the TOOLTAB dataset, a new benchmark designed specifically for training LLMs in table-tool integration. Our experiments indicate that TART achieves substantial improvements over existing methods (e.g., Chain-of-Thought) by improving both the precision of data processing and the clarity of the reasoning process. Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of the closed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diverse real-world scenarios. All the code and data are available at https://github.com/XinyuanLu00/TART.
当前的大语言模型(LLMS)在理解表格结构和应用精确数字推理方面能力有限,而精确数字推理对于表格答题(TQA)和基于表格的事实核实(TFV)等任务至关重要。为了应对这些挑战,我们引入了我们的表格工具推荐理由框架(TART),将LLMS与专门工具相结合。TART包含三个关键组成部分:确保准确数据表述的目录;开发具体计算工具的工具制造者;以及保持解释性的解释性生成器。我们还介绍了TOOLTAB数据集,这是专门为培训表工具整合中的LMS设计的新基准。我们的实验表明,TART通过提高数据处理的精确性和推理过程的清晰性,大大改进了现有方法(例如链)。 值得注意的是,TART与代码Llama的配对使封闭源LMGPT-3.5-turbo的准确度达到90.0%,突出其在不同现实世界情景中的稳健性。所有代码和数据都可在 https://giuth.Lub.
Article 90
Title@2025-07-10 (4): GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation
Title: GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation | GNN-CNN: Ein effizientes Hybridmodell für konvolutionäre und Graphen-Neuralnetzwerke zur Textdarstellung | GNN-CNN: 用于文本代表的动态和图形神经网络的有效混合模型 2507.07414v1 |
Authors (1): Fardin Rastakhiz
Time, cost, and energy efficiency are critical considerations in Deep-Learning (DL), particularly when processing long texts. Transformers, which represent the current state of the art, exhibit quadratic computational complexity relative to input length, making them inefficient for extended documents. This study introduces a novel model architecture that combines Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs), integrated with a real-time, end-to-end graph generation mechanism. The model processes compact batches of character-level inputs without requiring padding or truncation. To enhance performance while maintaining high speed and efficiency, the model incorporates information from Large Language Models (LLMs), such as token embeddings and sentiment polarities, through efficient dictionary lookups. It captures local contextual patterns using CNNs, expands local receptive fields via lattice-based graph structures, and employs small-world graphs to aggregate document-level information. The generated graphs exhibit structural properties indicative of meaningful semantic organization, with an average clustering coefficient of approximately 0.45 and an average shortest path length ranging between 4 and 5. The model is evaluated across multiple text classification tasks, including sentiment analysis and news-categorization, and is compared against state-of-the-art models. Experimental results confirm the proposed model’s efficiency and competitive performance.
深度学习(DL)中,特别是处理长文本时,时间、成本和能源效率是关键因素。 代表当前工艺水平的变异器,在投入长度方面表现出四进制的计算复杂性,使其对扩展文档效率低下。本研究引入了新型模型结构,将图形神经网络(GNNs)和进化神经网络(CNNs)与实时、端到端图形生成机制相结合,并使用小世界图解汇总文件级信息。模型流程压缩了字符级投入的批量,而不需要划线或转线。为了提高性能,同时保持高速度和效率,该模型通过高效的字典查看,将来自大语言模型(LLMS)的信息(如象征性嵌入和情绪极化)纳入其中。该模型利用CNNS(GN)和进化神经网络(CRN)捕捉当地环境模式,通过基于 Lattice的图形结构扩大本地可容纳域域域域域域域,并使用小世界图表汇总文件级信息。生成的图表显示了有意义的文体结构属性组织的结构属性,其平均组合系数约为0.45,平均路径长度为4到5和5之间的最短路径长度不等。该模型根据多文本级图像和图像和图像分析进行了评估。
Article 91
Title@2025-07-10 (4): CoAM: Corpus of All-Type Multiword Expressions
Title: CoAM: Corpus of All-Type Multiword Expressions | CoAM: Corpus von Multiwort-Ausdrücken aller Art | CoAM: 全类型多字表达式组合体 2412.18151v3 |
Authors (7): Yusuke Ide, Joshua Tanner, Adam Nohejl, Jacob Hoffman, Justin Vasselli, Hidetaka Kamigaito, Taro Watanabe
Multiword expressions (MWEs) refer to idiomatic sequences of multiple words. MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation, but existing datasets for the task are inconsistently annotated, limited to a single type of MWE, or limited in size. To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality consisting of human annotation, human review, and automated consistency checking. Additionally, for the first time in a dataset of MWE identification, CoAM’s MWEs are tagged with MWE types, such as Noun and Verb, enabling fine-grained error analysis. Annotations for CoAM were collected using a new interface created with our interface generator, which allows easy and flexible annotation of MWEs in any form. Through experiments using CoAM, we find that a fine-tuned large language model outperforms MWEasWSD, which achieved the state-of-the-art performance on the DiMSUM dataset. Furthermore, analysis using our MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to identify across approaches.
多字表达式( MWE) 指多个单词的单词序列 。 MWE 识别,即在文本中检测 MWE, 可以在机器翻译等下游任务中发挥关键作用, 但任务的现有数据集有不一致的注解, 仅限于单一类型的 MWE , 或范围有限 。 为了进行可靠和全面的评估, 我们创建了 CoAM: All- Type 多词表达式的Corus of All- Type Moltite Exlaments, 一组由1.3K 句组成的数据集, 由多步程序构建, 以提高数据质量, 包括人类注解、 人文审查和自动一致性检查。 此外, 首次在 MWE 识别数据集中, CoAM 的 MWE E 被标记为 MWE 类型, 例如 Noun 和 Verb, 能够进行精细的错误分析。 coAM 说明是使用与我们的接口生成的新界面生成的界面, 方便和灵活地对 MWA-E 进行注解的大型语言模型 。
Article 92
Title@2025-07-10 (4): Rethinking Verification for LLM Code Generation: From Generation to Testing
Title: Rethinking Verification for LLM Code Generation: From Generation to Testing | Überprüfung der LLM-Code-Generierung neu denken: Von der Generation zur Prüfung | 重新思考LLM 代码生成的核查:从生成到测试 2507.06920v2 |
Authors (8): Zihan Ma, Taolin Zhang, Maosong Cao, Junnan Liu, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen
Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.
大型语言模型(LLMS)最近在人文经济学和LiveCodeBench等代码生成基准方面取得了显著的成功。然而,详细检查显示,这些评价套件往往只包含数量有限的同质测试案例,导致细微的缺陷无法被察觉。这不仅人为地放大了衡量业绩,而且还在利用可核查的奖励(RLVR)加强学习框架方面损害了准确的奖励估算。为了解决这些关键缺陷,我们系统地调查测试套件生成(TCG)任务,提出多维度指标,以严格量化测试的准确性。此外,我们采用了一种人文-LLM合作方法(SAGA),利用LM推理能力来利用人文方案编制专门知识,以大幅提高所产生测试案例的覆盖面和质量。此外,我们开发了TCGBench,以便利对TCG任务的研究。实验显示,SAGA实现了90.62%的检测率和32.58%的核实率精确度。此外,我们引入了一种比LiveCMLA推算法更高级的代码的代码制作方法,为SAGA的创建成本缩缩缩缩缩缩缩缩缩的计算法的计算结果。
Article 93
Title@2025-07-10 (4): Large Language Model for Extracting Complex Contract Information in Industrial Scenes
Title: Large Language Model for Extracting Complex Contract Information in Industrial Scenes | Großes Sprachmodell zur Extraktion komplexer Vertragsinformationen in Industrieszenen | 工业景点复杂合同信息提取大语言模型 2507.06539v2 |
Authors (3): Yunyang Cao, Yanjun Li, Silong Dai
This paper proposes a high-quality dataset construction method for complex contract information extraction tasks in industrial scenarios and fine-tunes a large language model based on this dataset. Firstly, cluster analysis is performed on industrial contract texts, and GPT-4 and GPT-3.5 are used to extract key information from the original contract data, obtaining high-quality data annotations. Secondly, data augmentation is achieved by constructing new texts, and GPT-3.5 generates unstructured contract texts from randomly combined keywords, improving model robustness. Finally, the large language model is fine-tuned based on the high-quality dataset. Experimental results show that the model achieves excellent overall performance while ensuring high field recall and precision and considering parsing efficiency. LoRA, data balancing, and data augmentation effectively enhance model accuracy and robustness. The proposed method provides a novel and efficient solution for industrial contract information extraction tasks.
本文建议了一种高质量的数据集构建方法,用于在工业情景下执行复杂的合同信息提取任务,并根据该数据集对一个大型语言模型进行微调。首先,对工业合同文本进行分组分析,利用GPT-4和GPT-3.5从原始合同数据中提取关键信息,获得高质量的数据说明。第二,通过构建新文本,实现了数据增加,GPT-3/5从随机合并的关键词中生成了非结构化的合同文本,提高了模型的可靠性。最后,根据高质量数据集对大语言模型进行了微调。实验结果表明,该模型取得了出色的总体性能,同时确保了高实地召回率和精确度,并考虑到了配置效率。LORA、数据平衡和数据增强有效地提高了模型的准确性和稳健性。拟议方法为工业合同信息提取任务提供了一个新而有效的解决方案。
Article 94
Title@2025-07-10 (4): BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems
Title: BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems | BountyBench: Dollar-Impact von KI-Agenten-Angriffen und Verteidigern auf reale Cybersicherheitssysteme | BuntyBuntyBunnench: AI代理攻击者和捍卫者对现实世界网络安全系统的美元影响 2505.15216v2 |
Authors (34): Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y. Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, Jonathan Z. Ye, Prerit Choudhary, Siddharth M. Bhatia, Vikram Sivashankar, Yuxuan Bao, Dawn Song, Dan Boneh, Daniel E. Ho, Percy Liang
AI agents have the potential to significantly alter the cybersecurity landscape. Here, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a specific vulnerability), and Patch (patching a specific vulnerability). For Detect, we construct a new success indicator, which is general across vulnerability types and provides localized evaluation. We manually set up the environment for each system, including installing packages, setting up server(s), and hydrating database(s). We add 40 bug bounties, which are vulnerabilities with monetary awards of $10-$30,485, covering 9 of the OWASP Top 10 Risks. To modulate task difficulty, we devise a new strategy based on information to guide detection, interpolating from identifying a zero day to exploiting a specific vulnerability. We evaluate 8 agents: Claude Code, OpenAI Codex CLI with o3-high and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, and DeepSeek-R1. Given up to three attempts, the top-performing agents are OpenAI Codex CLI: o3-high (12.5% on Detect, mapping to $3,720; 90% on Patch, mapping to $14,152), Custom Agent with Claude 3.7 Sonnet Thinking (67.5% on Exploit), and OpenAI Codex CLI: o4-mini (90% on Patch, mapping to $14,422). OpenAI Codex CLI: o3-high, OpenAI Codex CLI: o4-mini, and Claude Code are more capable at defense, achieving higher Patch scores of 90%, 90%, and 87.5%, compared to Exploit scores of 47.5%, 32.5%, and 57.5% respectively; while the custom agents are relatively balanced between offense and defense, achieving Exploit scores of 37.5-67.5% and Patch scores of 35-60%.
AI 代理商具有显著改变网络安全景观的潜力。 在这里, 我们引入了第一个框架, 以在不断演变的现实世界系统中捕捉攻击性和防御性网络安全性。 我们通过 BountyBench 来维护这个框架。 我们建立了25个系统, 拥有复杂的、 真实的代码库。 为了捕捉脆弱性生命周期, 我们定义了三种任务类型 : 检测( 发现新的脆弱性 ) , 开发( 开发特定的脆弱性 ) 和 Patch( 绘制特定的脆弱性 )。 为了检测, 我们建立了一个新的成功指标, 跨脆弱类型, 提供局部评估。 我们手动为每个系统设置了环境, 包括安装软件包, 安装服务器, 以及水晶体数据库数据库。 我们增加了40个错误元, 这是货币奖项的10 - 30, 485, 覆盖 OWASP Top 10 风险的9 。 为了调节任务难度, 我们设计了一个基于信息的新的战略, 指导检测, 确定一个零天, 用于特定的脆弱性 。 我们评估8个代理商 : C 代码, OS 代码, 代码 代码 代码 代码, C- 代码 代码 和 C- PTLI 3, IM- hil4, 服务器, 运行 的 O.
Article 95
Title@2025-07-10 (4): Bradley-Terry and Multi-Objective Reward Modeling Are Complementary
Title: Bradley-Terry and Multi-Objective Reward Modeling Are Complementary | Bradley-Terry und Multi-Objective Reward Modeling sind komplementär | Bradley-Terriy和多目标奖励模型具有补充作用 2507.07375v1 |
Authors (13): Zhiwei Zhang, Hui Liu, Xiaomin Li, Zhenwei Dai, Jingying Zeng, Fali Wang, Minhua Lin, Ramraj Chandradevan, Zhen Li, Chen Luo, Xianfeng Tang, Qi He, Suhang Wang
Reward models trained on human preference data have demonstrated strong effectiveness in aligning Large Language Models (LLMs) with human intent under the framework of Reinforcement Learning from Human Feedback (RLHF). However, RLHF remains vulnerable to reward hacking, where the policy exploits imperfections in the reward function rather than genuinely learning the intended behavior. Although significant efforts have been made to mitigate reward hacking, they predominantly focus on and evaluate in-distribution scenarios, where the training and testing data for the reward model share the same distribution. In this paper, we empirically show that state-of-the-art methods struggle in more challenging out-of-distribution (OOD) settings. We further demonstrate that incorporating fine-grained multi-attribute scores helps address this challenge. However, the limited availability of high-quality data often leads to weak performance of multi-objective reward functions, which can negatively impact overall performance and become the bottleneck. To address this issue, we propose a unified reward modeling framework that jointly trains Bradley–Terry (BT) single-objective and multi-objective regression-based reward functions using a shared embedding space. We theoretically establish a connection between the BT loss and the regression objective and highlight their complementary benefits. Specifically, the regression task enhances the single-objective reward function’s ability to mitigate reward hacking in challenging OOD settings, while BT-based training improves the scoring capability of the multi-objective reward function, enabling a 7B model to outperform a 70B baseline. Extensive experimental results demonstrate that our framework significantly improves both the robustness and the scoring performance of reward models.
人类偏好数据培训的奖赏模型在将大语言模型(LLMS)与人的意图结合到 “ 从人类反馈学习强化 “ (RLHF)的框架内显示出了强大的效果。然而,RLHF仍然易受黑客奖赏的伤害,因为政策利用的是奖赏功能中的不完善,而不是真正学习预期行为。虽然为减少奖赏黑客而作出了巨大努力,但它们主要侧重于和评价分配情景,奖赏模型的培训和测试数据分布相同。在本文中,我们从经验上表明,在更具挑战性的分配外(OOOD)设置中,最先进的方法在更具有挑战性地挣扎。我们进一步表明,纳入精细的多属性分数有助于应对这一挑战。然而,高质量数据的提供往往导致多目标奖赏功能表现不佳,这可能会对总体业绩产生负面影响,并成为瓶颈。为了解决这一问题,我们提议了一个统一的奖赏模型框架,以联合培训基于布拉德-T(BT)的单一目标和多目标的退缩奖赏模型。我们进一步表明,采用共同的嵌式多级评分,我们进一步的多级评分评分成绩框架有助于应对这一挑战。我们共同的B级的递增的递反的轨道的成绩,同时提升的成绩,同时提升的B级的轨道功能将提升的轨道功能将提升的B的轨绩的成绩框架将提升的B的轨道上的递增税益性能提升的轨道的成绩,同时提升的轨损功能将提升的轨道的轨道的轨道的轨道的成绩,同时提升了B的功能将提升了B的轨道上的功能将提升到更能,同时提升了B的稳定性的轨道的轨道,提升了B的轨道上变。
Article 96
Title@2025-07-10 (4): Shifting from Ranking to Set Selection for Retrieval Augmented Generation
Title: Shifting from Ranking to Set Selection for Retrieval Augmented Generation | Wechsel vom Ranking zur Einstellungsauswahl für retrieval Augmented Generation | 从排位移到设置回收增量一代的选择 2507.06838v2 |
Authors (4): Dahyun Lee, Yongrae Jo, Haeju Park, Moontae Lee
Retrieval in Retrieval-Augmented Generation(RAG) must ensure that retrieved passages are not only individually relevant but also collectively form a comprehensive set. Existing approaches primarily rerank top-k passages based on their individual relevance, often failing to meet the information needs of complex queries in multi-hop question answering. In this work, we propose a set-wise passage selection approach and introduce SETR, which explicitly identifies the information requirements of a query through Chain-of-Thought reasoning and selects an optimal set of passages that collectively satisfy those requirements. Experiments on multi-hop RAG benchmarks show that SETR outperforms both proprietary LLM-based rerankers and open-source baselines in terms of answer correctness and retrieval quality, providing an effective and efficient alternative to traditional rerankers in RAG systems. The code is available at https://github.com/LGAI-Research/SetR
在检索搜索增强的一代(RAG)中,检索检索必须确保检索到的通道不仅个别相关,而且集体形成一套综合的全套内容。现有办法主要是根据个人相关性重新排列顶级通道,往往无法满足多点答题中复杂查询的信息需求。在这项工作中,我们建议采用一套自定的通道选择方法,并引入SETR,该方法通过“探讨链”推理明确确定查询的信息要求,并选择一套最佳通道,以集体满足这些要求。多点搜索RAG基准实验显示,在回答正确性和检索质量方面,SETR优于专有的LLM重新排序器和开放源基线,为RAG系统中的传统重新排序者提供了有效和高效的替代方法。该代码可在https://github.com/LGAI-Research/SetR查阅。该代码可在https://github.com/LGAI-Research/SetR查阅。
Article 97
Title@2025-07-09 (3): Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation
Title: Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation | Taming the Tri-Space Tension: ARC-geführte Halluzination Modellierung und Steuerung für Text-zu-Bild-Generierung | 遏制三空间紧张局势:ARC-ARC-ARC引导的光学幻觉模型和控制生成文本到图像的图像模型与控制 2507.04946v2 |
Authors (2): Jianjiang Yang, Ziyan Huang
Despite remarkable progress in image quality and prompt fidelity, text-to-image (T2I) diffusion models continue to exhibit persistent “hallucinations”, where generated content subtly or significantly diverges from the intended prompt semantics. While often regarded as unpredictable artifacts, we argue that these failures reflect deeper, structured misalignments within the generative process. In this work, we propose a cognitively inspired perspective that reinterprets hallucinations as trajectory drift within a latent alignment space. Empirical observations reveal that generation unfolds within a multiaxial cognitive tension field, where the model must continuously negotiate competing demands across three key critical axes: semantic coherence, structural alignment, and knowledge grounding. We then formalize this three-axis space as the \textbf{Hallucination Tri-Space} and introduce the Alignment Risk Code (ARC): a dynamic vector representation that quantifies real-time alignment tension during generation. The magnitude of ARC captures overall misalignment, its direction identifies the dominant failure axis, and its imbalance reflects tension asymmetry. Based on this formulation, we develop the TensionModulator (TM-ARC): a lightweight controller that operates entirely in latent space. TM-ARC monitors ARC signals and applies targeted, axis-specific interventions during the sampling process. Extensive experiments on standard T2I benchmarks demonstrate that our approach significantly reduces hallucination without compromising image quality or diversity. This framework offers a unified and interpretable approach for understanding and mitigating generative failures in diffusion-based T2I systems.
尽管在图像质量和及时忠诚方面取得了显著进步,但文字到图像(T2I)的传播模型仍然表现出持续的“平衡”,生成的内容在细微或明显地与预期的迅速语义不同。我们经常认为,这些失败反映了基因化过程中的更深层次和结构上的不匹配。在这项工作中,我们提出了一个由认知启发的视角,将幻觉重新解释为在潜在调整空间内的轨迹漂移。经验性观测表明,生成过程是在多轴认知紧张领域展开的,在这个领域,模型必须不断谈判三个关键轴之间的相互竞争需求:语义一致性、结构一致性和知识定位。我们随后将这三轴空间正式化为不可预测的人工工艺,并引入了组合风险规则:一种动态矢量代表,在生成过程中将真实时间调整的紧张状态变异变。ARC的深度定位方法、其方向确定了主要失败轴,以及其不平衡反映了紧张局势的不对称性。基于这一提法的公式,我们在不精确度、结构解释性解释性分析模型期间,我们进行了一个有针对性的变压式模型。
Article 98
Title@2025-07-09 (3): Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation
Title: Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation | Multi-Agent Retrieval-Augmented Framework for Evidence-based Counterspeech Against Health Misinformation | 以证据为依据的反健康错误信息反言多证据检索强化框架 2507.07307v1 |
Authors (6): Anirban Saha Anik, Xiaoying Song, Elliott Wang, Bryan Wang, Bengisu Yarimbas, Lingzi Hong
Large language models (LLMs) incorporated with Retrieval-Augmented Generation (RAG) have demonstrated powerful capabilities in generating counterspeech against misinformation. However, current studies rely on limited evidence and offer less control over final outputs. To address these challenges, we propose a Multi-agent Retrieval-Augmented Framework to generate counterspeech against health misinformation, incorporating multiple LLMs to optimize knowledge retrieval, evidence enhancement, and response refinement. Our approach integrates both static and dynamic evidence, ensuring that the generated counterspeech is relevant, well-grounded, and up-to-date. Our method outperforms baseline approaches in politeness, relevance, informativeness, and factual accuracy, demonstrating its effectiveness in generating high-quality counterspeech. To further validate our approach, we conduct ablation studies to verify the necessity of each component in our framework. Furthermore, human evaluations reveal that refinement significantly enhances counterspeech quality and obtains human preference.
为了应对这些挑战,我们提议建立一个多试剂检索增强框架,以针对健康错误提出反言,纳入多个LLMS,以优化知识检索、证据增强和反应完善。我们的方法结合了静态和动态证据,确保生成的反言具有相关性、基础明确和最新性。我们的方法在礼貌、相关性、信息性和事实准确性方面优于基线方法,表明其在产生高质量反言方面的有效性。为了进一步验证我们的方法,我们进行了对比研究,以核实我们框架中每个组成部分的必要性。此外,人类评价显示,改进极大地提高了反言质量并获得了人类偏好。
Article 99
Title@2025-07-09 (3): Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation
Title: Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation | Ranker, Richter und Assistenten: Auf dem Weg zum Verständnis des Interplays von LLMs in Information Retrieval Evaluation | 中继员、法官和助理:了解信息检索评价中LLMLM的交互作用 2503.19092v2 |
Authors (3): Krisztian Balog, Donald Metzler, Zhen Qin
Large language models (LLMs) are increasingly integral to information retrieval (IR), powering ranking, evaluation, and AI-assisted content creation. This widespread adoption necessitates a critical examination of potential biases arising from the interplay between these LLM-based components. This paper synthesizes existing research and presents novel experiment designs that explore how LLM-based rankers and assistants influence LLM-based judges. We provide the first empirical evidence of LLM judges exhibiting significant bias towards LLM-based rankers. Furthermore, we observe limitations in LLM judges’ ability to discern subtle system performance differences. Contrary to some previous findings, our preliminary study does not find evidence of bias against AI-generated content. These results highlight the need for a more holistic view of the LLM-driven information ecosystem. To this end, we offer initial guidelines and a research agenda to ensure the reliable use of LLMs in IR evaluation.
大型语言模型(LLMs)日益成为信息检索、授权排名、评价和AI协助的内容创建的有机组成部分。这种广泛采用使得有必要对这些基于LLM的各组成部分之间的相互作用所产生的潜在偏差进行严格审查。本文件综合了现有的研究,并提出了新的实验设计,探索LLM的中层人员和助理如何影响LLM的法官。我们提供了LLM法官对LLM的中层人员表现出严重偏向的第一批经验证据。此外,我们注意到LLM法官发现微妙系统性能差异的能力受到限制。与以前的一些调查结果相反,我们的初步研究没有发现对AI-M所产生内容的偏差证据。这些结果突出表明有必要更全面地看待LLM所驱动的信息生态系统。为此,我们提供了初步指导方针和研究议程,以确保在IR评价中可靠地使用LMs。
Article 100
Title@2025-07-09 (3): ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning
Title: ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning | ViDove: Ein Übersetzungs-Agenten-System mit multimodalem Kontext und Memory-Augmented Reasoning | ViDove:一个具有多式背景和记忆增强理由的翻译代理系统 2507.07306v1 |
Authors (12): Yichen Lu, Wei Dai, Jiaen Liu, Ching Wing Kwok, Zongheng Wu, Xudong Xiao, Ao Sun, Sheng Fu, Jianyuan Zhan, Yian Wang, Takatomo Saito, Sicheng Lai
LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. Inspired by the workflow of human translators, ViDove leverages visual and contextual background information to enhance the translation process. Additionally, we integrate a multimodal memory system and long-short term memory modules enriched with domain-specific knowledge, enabling the agent to perform more accurately and adaptively in real-world scenarios. As a result, ViDove achieves significantly higher translation quality in both subtitle generation and general translation tasks, with a 28% improvement in BLEU scores and a 15% improvement in SubER compared to previous state-of-the-art baselines. Moreover, we introduce DoveBench, a new benchmark for long-form automatic video subtitling and translation, featuring 17 hours of high-quality, human-annotated data. Our code is available here: https://github.com/pigeonai-org/ViDove
以LLM为基础的翻译代理商取得了高度人性化的翻译结果,能够以更高的效率处理更长期和更复杂的背景,然而,这些代理商通常仅限于只文本的投入。在本文件中,我们引入了ViDove,这是为多式联运投入设计的翻译代理商系统。受人翻译工作流程的启发,ViDove利用视觉和背景背景资料来增强翻译过程。此外,我们整合了多式记忆系统和具有特定领域知识的长期短期记忆模块,使该代理商能够在现实世界的情景中更准确和适应性地开展工作。结果,ViDove在字幕生成和一般翻译任务方面都实现了显著更高的翻译质量,BLEU分数提高了28%,SubER值比以往最先进的基线提高了15%。此外,我们引入了DoveBench,这是长式自动视频子传输和翻译的新基准,拥有17小时的高品质、附加说明的数据。我们的代码可以在这里查阅:https://github.com/pigeonai-org/ViDoveve。
Article 101
Title@2025-07-09 (3): The Impact of Background Speech on Interruption Detection in Collaborative Groups
Title: The Impact of Background Speech on Interruption Detection in Collaborative Groups | Die Auswirkungen der Hintergrundrede auf die Störungserkennung in kollaborativen Gruppen | 背景发言对合作小组中间探测的影响 2507.07280v1 |
Authors (3): Mariah Bradford, Nikhil Krishnaswamy, Nathaniel Blanchard
Interruption plays a crucial role in collaborative learning, shaping group interactions and influencing knowledge construction. AI-driven support can assist teachers in monitoring these interactions. However, most previous work on interruption detection and interpretation has been conducted in single-conversation environments with relatively clean audio. AI agents deployed in classrooms for collaborative learning within small groups will need to contend with multiple concurrent conversations – in this context, overlapping speech will be ubiquitous, and interruptions will need to be identified in other ways. In this work, we analyze interruption detection in single-conversation and multi-group dialogue settings. We then create a state-of-the-art method for interruption identification that is robust to overlapping speech, and thus could be deployed in classrooms. Further, our work highlights meaningful linguistic and prosodic information about how interruptions manifest in collaborative group interactions. Our investigation also paves the way for future works to account for the influence of overlapping speech from multiple groups when tracking group dialog.
在协作学习、形成群体互动和影响知识建设方面,干扰至关重要。AI驱动的支持可以帮助教师监测这些互动。然而,以往大部分关于中断探测和解释的工作都是在单一对话环境中以相对清洁的音频进行。在教室中为小群体协作学习而部署的AI代理人员需要与多重同时对话进行斗争 – – 在这种情况下,重复的演讲将无处不在,需要以其他方式查明中断。在这项工作中,我们分析在单一对话和多群体对话环境中的中断探测。然后,我们创建了一种最先进的中断识别方法,该方法对重复的演讲非常有力,因此可以部署在教室中。此外,我们的工作突出了有意义的语言和预言信息,说明协作群体互动中如何出现中断。我们的调查还为未来的工作铺平了道路,以便在跟踪小组对话时考虑到多个群体重复发言的影响。
Article 102
Title@2025-07-09 (3): LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation
Title: LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation | LinguaMark: Sprechen multimodale Modelle fair? Eine Benchmark-basierte Bewertung | LUUAMark:多模式模型是否公平说话?基于基准的评价 2507.07274v1 |
Authors (4): Ananya Raval, Aravind Narayanan, Vahid Reza Khazaie, Shaina Raza
Large Multimodal Models (LMMs) are typically trained on vast corpora of image-text data but are often limited in linguistic coverage, leading to biased and unfair outputs across languages. While prior work has explored multimodal evaluation, less emphasis has been placed on assessing multilingual capabilities. In this work, we introduce LinguaMark, a benchmark designed to evaluate state-of-the-art LMMs on a multilingual Visual Question Answering (VQA) task. Our dataset comprises 6,875 image-text pairs spanning 11 languages and five social attributes. We evaluate models using three key metrics: Bias, Answer Relevancy, and Faithfulness. Our findings reveal that closed-source models generally achieve the highest overall performance. Both closed-source (GPT-4o and Gemini2.5) and open-source models (Gemma3, Qwen2.5) perform competitively across social attributes, and Qwen2.5 demonstrates strong generalization across multiple languages. We release our benchmark and evaluation code to encourage reproducibility and further research.
大型多式模型(LMMs)通常在图像文本数据的广泛组合方面受过培训,但在语言覆盖面方面往往有限,导致不同语言的偏差和不公平产出。虽然以前的工作探索了多式评价,但不太强调多语种能力评估。在这项工作中,我们引入了LinguaMark,这是评估多语种视觉问题解答(VQA)任务方面最先进的LMMs的基准。我们的数据集由6 875对图像文本组成,涵盖11种语言和5种社会属性。我们用三种关键指标(Bias、回答相关性和信仰性)对模型进行评估。我们的调查结果显示,封闭源模式一般都取得最高的总体业绩。封闭源(GPT-4o和Gemimini2.5)和开放源模式(Gemma3,Qwen2.5)都具有竞争力地跨越了社会属性,而Quen2.5则展示了多种语言的高度普及性。我们发布了基准和评价代码,以鼓励再生和进一步研究。
Article 103
Title@2025-07-09 (3): Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery
Title: Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery | Open Source Planning & Control System mit Language Agents für autonome wissenschaftliche Entdeckung | 拥有自主科学发现语言代理的开放源规划和控制系统 2507.07257v1 |
Authors (26): Licong Xu, Milind Sarkar, Anto I. Lonappan, Íñigo Zubeldia, Pablo Villanueva-Domingo, Santiago Casas, Christian Fidler, Chetana Amancharla, Ujjwal Tiwari, Adrian Bayer, Chadi Ait Ekiou, Miles Cranmer, Adrian Dimitrov, James Fergusson, Kahaan Gandhi, Sven Krippendorf, Andrew Laverick, Julien Lesgourgues, Antony Lewis, Thomas Meier, Blake Sherwin, Kristen Surrao, Francisco Villaescusa-Navarro, Chi Wang, Xueqing Xu, Boris Bolliet
We present a multi-agent system for automation of scientific research tasks, cmbagent. The system is formed by about 30 Large Language Model (LLM) agents and implements a Planning & Control strategy to orchestrate the agentic workflow, with no human-in-the-loop at any point. Each agent specializes in a different task (performing retrieval on scientific papers and codebases, writing code, interpreting results, critiquing the output of other agents) and the system is able to execute code locally. We successfully apply cmbagent to carry out a PhD level cosmology task (the measurement of cosmological parameters using supernova data) and evaluate its performance on two benchmark sets, finding superior performance over state-of-the-art LLMs. The source code is available on GitHub, demonstration videos are also available, and the system is deployed on HuggingFace and will be available on the cloud.
我们提出了一个多试剂系统,用于科研任务的自动化,cmbexic。这个系统由大约30个大语言模型代理商组成,并执行一项规划与控制战略,以协调代理工作流程,在任何时间都没有人在行。每个代理商都专门从事不同的任务(科学论文和代码库的检索、写法、解释结果、使其他代理商的输出具有生命力)和系统能够在当地执行代码。我们成功地应用了cmbexuc执行博士级宇宙学任务(使用超新星数据测量宇宙参数),并评估其在两个基准系列上的性能,在最新LLMS上找到优异的性能。源代码在GitHub上可以找到,演示视频也可以在HuggingFace上安装,并将在云层上提供。
Article 104
Title@2025-07-09 (3): A Comprehensive Survey of Contamination Detection Methods in Large Language Models
Title: A Comprehensive Survey of Contamination Detection Methods in Large Language Models | Eine umfassende Untersuchung von Methoden zur Erkennung von Verunreinigungen in großen Sprachmodellen | 关于大语言模型中污染探测方法的综合调查 2404.00699v5 |
Authors (9): Mathieu Ravaut, Bosheng Ding, Fangkai Jiao, Hailin Chen, Xingxuan Li, Ruochen Zhao, Chengwei Qin, Caiming Xiong, Shafiq Joty
With the rise of Large Language Models (LLMs) in recent years, abundant new opportunities are emerging, but also new challenges, among which contamination is quickly becoming critical. Business applications and fundraising in Artificial Intelligence (AI) have reached a scale at which a few percentage points gained on popular question-answering benchmarks could translate into dozens of millions of dollars, placing high pressure on model integrity. At the same time, it is becoming harder and harder to keep track of the data that LLMs have seen; if not impossible with closed-source models like GPT-4 and Claude-3 not divulging any information on the training set. As a result, contamination becomes a major issue: LLMs’ performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data. This limitation jeopardizes real capability improvement in the field of NLP, yet, there remains a lack of methods on how to efficiently detect contamination. In this paper, we survey all recent work on contamination detection with LLMs, analyzing their methodologies and use cases to shed light on the appropriate usage of contamination detection methods. Our work calls the NLP research community’s attention into systematically taking into account contamination bias in LLM evaluation.
随着近年来大语言模型(LLMs)的兴起,出现了大量新的机会,但也出现了新的挑战,其中污染问题正在迅速变得十分严重。人工智能(AI)中的商业应用和筹资已经达到一个规模,在公众问答基准上提高了几个百分点,可以转化为数亿美元,给模型完整性带来很大压力。与此同时,跟踪LLMs所看到的数据越来越难和困难;如果不是不可能,GPT-4和Claude-3等封闭源模型不泄露任何有关培训内容的信息。因此,污染成为一个主要问题:LLMs的绩效可能不再可靠,因为其高性能至少部分可能是由于他们以前接触过数据。这一限制损害了NLP领域真正的能力改善,然而,在如何有效检测污染方面,仍然缺乏方法。在这份文件中,我们调查最近与LMs进行的污染探测工作,分析其方法,并使用案例来说明污染检测方法的适当使用情况。我们的工作呼吁NLPs研究社区对LM系统的污染进行系统评估。
Article 105
Title@2025-07-09 (3): Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications
Title: Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications | Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in zwei realen klinischen Anwendungen | 赋予保健从业者以语言模式:在两个现实世界临床应用中构建语音记录 2507.05517v2 |
Authors (15): Jean-Philippe Corbeil, Asma Ben Abacha, George Michalopoulos, Phillip Swazinna, Miguel Del-Agua, Jerome Tremblay, Akila Jeeson Daniel, Cari Bader, Yu-Cheng Cho, Pooja Krishnan, Nathan Bodenstab, Thomas Lin, Wenxuan Teng, Francois Beaulieu, Paul Vozila
Large language models (LLMs) such as GPT-4o and o1 have demonstrated strong performance on clinical natural language processing (NLP) tasks across multiple medical benchmarks. Nonetheless, two high-impact NLP tasks - structured tabular reporting from nurse dictations and medical order extraction from doctor-patient consultations - remain underexplored due to data scarcity and sensitivity, despite active industry efforts. Practical solutions to these real-world clinical tasks can significantly reduce the documentation burden on healthcare providers, allowing greater focus on patient care. In this paper, we investigate these two challenging tasks using private and open-source clinical datasets, evaluating the performance of both open- and closed-weight LLMs, and analyzing their respective strengths and limitations. Furthermore, we propose an agentic pipeline for generating realistic, non-sensitive nurse dictations, enabling structured extraction of clinical observations. To support further research in both areas, we release SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction.
GPT-4o和o1等大型语言模型(LLMs)在临床自然语言处理(NLP)任务方面表现良好,跨越多种医疗基准,然而,由于数据稀缺和敏感,尽管工业界作出了积极的努力,两项影响较大的NLP任务 – – 护士口服和从医生-病人咨询中提取医疗命令的结构性表格报告 – – 仍未得到充分探讨。这些现实世界临床任务的实际解决办法可以大大减轻医疗保健提供者的文件负担,从而能够更加关注病人的护理。在本文件中,我们利用私人和公开来源的临床数据集调查这两项具有挑战性的任务,评估开放和封闭式LMS的性能,并分析它们各自的长处和局限性。此外,我们提议建立一个代理管道,以产生现实的、不敏感的护士口服的口服,从而能够有条理地提取临床观察。为了支持这两个领域的进一步研究,我们释放SYNUR和SIMORD,这是第一个用于护士观察提取和医疗命令的开放源数据集。
Article 106
Title@2025-07-09 (3): A Language-Driven Framework for Improving Personalized Recommendations: Merging LLMs with Traditional Algorithms
Title: A Language-Driven Framework for Improving Personalized Recommendations: Merging LLMs with Traditional Algorithms | Ein sprachgetriebener Rahmen zur Verbesserung personalisierter Empfehlungen: Zusammenführung von LLMs mit traditionellen Algorithmen | 改进个性建议的语言驱动框架:将LLMs与传统等级合并 2507.07251v1 |
Authors (2): Aaron Goldstein, Ayan Dutta
Traditional recommendation algorithms are not designed to provide personalized recommendations based on user preferences provided through text, e.g., “I enjoy light-hearted comedies with a lot of humor”. Large Language Models (LLMs) have emerged as one of the most promising tools for natural language processing in recent years. This research proposes a novel framework that mimics how a close friend would recommend items based on their knowledge of an individual’s tastes. We leverage LLMs to enhance movie recommendation systems by refining traditional algorithm outputs and integrating them with language-based user preference inputs. We employ Singular Value Decomposition (SVD) or SVD++ algorithms to generate initial movie recommendations, implemented using the Surprise Python library and trained on the MovieLens-Latest-Small dataset. We compare the performance of the base algorithms with our LLM-enhanced versions using leave-one-out validation hit rates and cumulative hit rates. Additionally, to compare the performance of our framework against the current state-of-the-art recommendation systems, we use rating and ranking metrics with an item-based stratified 0.75 train, 0.25 test split. Our framework can generate preference profiles automatically based on users’ favorite movies or allow manual preference specification for more personalized results. Using an automated approach, our framework overwhelmingly surpassed SVD and SVD++ on every evaluation metric used (e.g., improvements of up to ~6x in cumulative hit rate, ~3.7x in NDCG, etc.), albeit at the cost of a slight increase in computational overhead.
传统建议算法的设计并不是要根据文本提供的用户偏好提供个性化建议,例如,“我享受有幽默感的轻心喜剧”。大型语言模型(LLMS)近年来已成为自然语言处理最有希望的工具之一。这项研究提出了一个新框架,模仿亲密朋友如何根据个人品味知识推荐项目。我们利用LLMS改进传统算法产出并将其与基于语言的用户偏好投入结合起来,以加强电影建议系统。我们使用Singulal 值分解(SVD)或SVD++)算法来产生初步的电影建议,使用Surprise Python图书馆实施,并在MovesLans-Last-Smalall数据集方面进行培训。我们用LMS-encional-hanced版本来比较基础算法的性能,使用个人品味的一次性验证率和累积冲击率。此外,我们将我们框架的性能与当前最优级的S-BS&+7建议系统进行比较,我们可以使用基于基于St-D的Stary 0.75的Servical press pral press a rual rual press press press press ex a ex a laimpal press slal press slupal press slupal labal lautal propal ex a lautal pressal press pressal press pressal press press pressal pressal pressal pressal pressal pressal pressal pressal pressal pressal ex a ex ex ex ex labal ex exal exal exal exal ex exal exal 框架,我们以在S6 pressal pressal pressal pressalalalalalalal press pressal pressal press pressal pressal pressal pressal 和以在S6 press pressal_S6 pressal pressal pressal
Article 107
Title@2025-07-09 (3): Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings
Title: Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings | Medical Red Teaming Protocol of Language Models: Über die Bedeutung der Nutzerperspektiven in der Gesundheitsversorgung | 语言模式医学红队模式医疗红队协议:关于保健机构用户观点的重要性 2507.07248v1 |
Authors (5): Minseon Kim, Jean-Philippe Corbeil, Alessandro Sordoni, Francois Beaulieu, Paul Vozila
As the performance of large language models (LLMs) continues to advance, their adoption is expanding across a wide range of domains, including the medical field. The integration of LLMs into medical applications raises critical safety concerns, particularly due to their use by users with diverse roles, e.g. patients and clinicians, and the potential for model’s outputs to directly affect human health. Despite the domain-specific capabilities of medical LLMs, prior safety evaluations have largely focused only on general safety benchmarks. In this paper, we introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives, alongside general safety assessments and quantitatively analyze the safety of medical LLMs. We bridge a gap in the literature by building the PatientSafetyBench containing 466 samples over 5 critical categories to measure safety from the perspective of the patient. We apply our red-teaming protocols on the MediPhi model collection as a case study. To our knowledge, this is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view - patient, clinician, and general user - establishing a foundation for safer deployment in medical domains.
随着大型语言模型(LLMS)的继续发展,其采用范围正在扩大,包括医疗领域在内的广泛领域。将LLMS纳入医疗应用提出了关键的安全问题,特别是由于具有不同作用的用户,例如病人和临床医生使用LLMS,以及模型产出有可能直接影响到人类健康。尽管医疗LLMS具有具体领域的能力,但先前的安全评估主要侧重于一般安全基准。在本文件中,我们引入了针对病人用户和临床用户的医疗领域制定的安全评估协议,同时进行一般安全评估和定量分析医疗LMS的安全。我们通过建立包含466个样本的PacentsSafetyBench,从病人的角度衡量5个关键类别的安全,弥补了文献中的空白。我们运用MediPhi模型收集的红色协议作为案例研究。据我们所知,这是为确定医疗LMS的安全评估标准而进行的第一项工作,通过有针对性地红队,从3个不同的观点 – – 病人、临床和普通用户 – – 建立在医疗领域更安全部署的基础。
Article 108
Title@2025-07-09 (3): A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive
Title: A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive | A Theory of Response Sampling in LLMs: Teil beschreibend und Teil präskriptiv | LLMM中的反应抽样理论:部分说明性理论和部分说明性理论 2402.11005v4 |
Authors (4): Sarath Sivaprasad, Pramod Kaushik, Sahar Abdelnabi, Mario Fritz
Large Language Models (LLMs) are increasingly utilized in autonomous decision-making, where they sample options from vast action spaces. However, the heuristics that guide this sampling process remain under explored. We study this sampling behavior and show that this underlying heuristics resembles that of human decision-making: comprising a descriptive component (reflecting statistical norm) and a prescriptive component (implicit ideal encoded in the LLM) of a concept. We show that this deviation of a sample from the statistical norm towards a prescriptive component consistently appears in concepts across diverse real-world domains like public health, and economic trends. To further illustrate the theory, we demonstrate that concept prototypes in LLMs are affected by prescriptive norms, similar to the concept of normality in humans. Through case studies and comparison with human studies, we illustrate that in real-world applications, the shift of samples toward an ideal value in LLMs’ outputs can result in significantly biased decision-making, raising ethical concerns.
大型语言模型(LLMS)在自主决策中越来越多地被利用,它们从广泛的行动空间中抽样选择各种选择。然而,指导这一抽样过程的逻辑学仍在探讨之中。我们研究了这种抽样行为,并表明这种基本理论与人类决策相似:由描述性组成部分(反映统计规范)和一个概念的规范性组成部分(在LLM中隐含的理想编码)组成。我们表明,抽样从统计规范向规范性组成部分的这种偏向始终出现在诸如公共卫生和经济趋势等不同现实世界领域的概念中。为了进一步说明这一理论,我们证明LLMS的概念原型受到规范性规范的影响,类似于人类正常性的概念。我们通过案例研究和与人类研究的比较,表明在现实世界应用中,样品向LLMS产出的理想价值的转移可能导致严重偏差的决策,引起道德问题。
Article 109
Title@2025-07-09 (3): An Information-Theoretic Perspective on Multi-LLM Uncertainty Estimation
Title: An Information-Theoretic Perspective on Multi-LLM Uncertainty Estimation | Eine informationstheoretische Perspektive zur Multi-LLM-Unsicherheitsschätzung | 关于多LLLM不确定性估算的信息理论视角 2507.07236v1 |
Authors (6): Maya Kruse, Majid Afshar, Saksham Khatwani, Anoop Mayampurath, Guanhua Chen, Yanjun Gao
Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty estimates. To leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and naive ensemble baselines.
大型语言模型(LLMs)往往在各种投入中表现不一,表明不确定性,并促使需要在高占用环境中对其量化。先前的校准和不确定性量化工作往往侧重于单个模型,忽略了模型多样性的潜力。我们假设LLMs由于培训的不同和语言的西普菲语性质而作出互补预测,而将其产出合并则导致更可靠的不确定性估计。为了利用这一点,我们提议MUSE(通过子集成的多-LLMM不确定性),这是一种简单的信息理论方法,使用Jensen-Shannon differgence来识别和汇总LLMS的精准子集。二元预测任务实验显示,与单一模型和天真的共同基线相比,校准和预测性表现有所改善。
Article 110
Title@2025-07-09 (3): Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection
Title: Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection | Emoji-Angriff: Verstärkung von Jailbreak-Angriffen gegen Richter LLM-Erkennung | Emoji攻击:加强针对LLM法官的越狱袭击 2411.01077v3 |
Authors (3): Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson
Jailbreaking techniques trick Large Language Models (LLMs) into producing restricted output, posing a potential threat. One line of defense is to use another LLM as a Judge to evaluate the harmfulness of generated text. However, we reveal that these Judge LLMs are vulnerable to token segmentation bias, an issue that arises when delimiters alter the tokenization process, splitting words into smaller sub-tokens. This alters the embeddings of the entire sequence, reducing detection accuracy and allowing harmful content to be misclassified as safe. In this paper, we introduce Emoji Attack, a novel strategy that amplifies existing jailbreak prompts by exploiting token segmentation bias. Our method leverages in-context learning to systematically insert emojis into text before it is evaluated by a Judge LLM, inducing embedding distortions that significantly lower the likelihood of detecting unsafe content. Unlike traditional delimiters, emojis also introduce semantic ambiguity, making them particularly effective in this attack. Through experiments on state-of-the-art Judge LLMs, we demonstrate that Emoji Attack substantially reduces the unsafe prediction rate, bypassing existing safeguards.
大量语言模型(LLMS)诱使大型语言模型(LLMS)产生限制产出,构成了潜在威胁。 一种防御之道是使用另一个LLM(LLM)作为法官评估生成文本的危害性。然而,我们发现,这些LLMS法官容易受到象征性分割偏差的影响,这个问题是当划界员改变代号过程,将单词分成小块时产生的。这改变了整个序列的嵌入,降低了检测准确性,并允许将有害内容错误地分类为安全的内容。在本文中,我们引入了Emoji Action,这是一个新颖的战略,通过利用象征性分割偏差来扩大现有的越狱事件。我们的方法在文字上学习系统地在文本中插入mojis,然后由LM法官加以评估,从而导致嵌入扭曲,从而大大降低检测不安全内容的可能性。与传统的划界员不同,Demojis还引入了语义模糊性,使这些模糊性在攻击中特别有效。通过对州级LMS的实验,我们证明Emoji攻击会大大降低不安全的预测率,绕过现有的保障措施。
Article 111
Title@2025-07-09 (3): SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains
Title: SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains | SynthTextEval: Synthetische Textdatengenerierung und -bewertung für High-Stakes-Domains | 合成文本Eval:合成文本数据生成和评估,用于高占用域域 2507.07229v1 |
Authors (7): Krithika Ramesh, Daniel Smolyak, Zihao Zhao, Nupoor Gandhi, Ritu Agarwal, Margrét Bjarnadóttir, Anjalie Field
We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerous applications, such as reducing the risks of privacy violations in the development and deployment of AI systems in high-stakes domains. Realizing this potential, however, requires principled consistent evaluations of synthetic data across multiple dimensions: its utility in downstream systems, the fairness of these systems, the risk of privacy leakage, general distributional differences from the source text, and qualitative feedback from domain experts. SynthTextEval allows users to conduct evaluations along all of these dimensions over synthetic data that they upload or generate using the toolkit’s generation module. While our toolkit can be run over any data, we highlight its functionality and effectiveness over datasets from two high-stakes domains: healthcare and law. By consolidating and standardizing evaluation metrics, we aim to improve the viability of synthetic text, and in-turn, privacy-preservation in AI development.
我们提出了合成文本综合评估工具包。大型语言模型(LLM)产出的流利性使得合成文本对多种应用具有潜在可行性,例如减少在开发和部署高取域的AI系统中侵犯隐私的风险。然而,实现这一潜力要求对合成数据多个层面进行有原则的一致评价:其在下游系统的实用性、这些系统的公平性、隐私渗漏风险、源文本之间的一般分布差异以及来自域专家的定性反馈。合成文本允许用户对使用工具包生成模块上传或生成的合成数据进行所有这些层面的评价。虽然我们的工具包可以运行在任何数据上,但我们强调其功能和效力,超过两个高取域的数据集:保健和法律。通过整合和标准化评价指标,我们的目标是提高合成文本的可行性,以及人工开发过程中的隐私保护。
Article 112
Title@2025-07-09 (3): EduCoder: An Open-Source Annotation System for Education Transcript Data
Title: EduCoder: An Open-Source Annotation System for Education Transcript Data | EduCoder: Ein Open-Source-Annotationssystem für Bildungstextdaten | EduCoder:教育记录数据公开源说明系统 2507.05385v2 |
Authors (7): Guanzhong Pan, Mei Tan, Hyunji Nam, Lucía Langlois, James Malamut, Liliana Deonizio, Dorottya Demszky
We introduce EduCoder, a domain-specialized tool designed to support utterance-level annotation of educational dialogue. While general-purpose text annotation tools for NLP and qualitative research abound, few address the complexities of coding education dialogue transcripts – with diverse teacher-student and peer interactions. Common challenges include defining codebooks for complex pedagogical features, supporting both open-ended and categorical coding, and contextualizing utterances with external features, such as the lesson’s purpose and the pedagogical value of the instruction. EduCoder is designed to address these challenges by providing a platform for researchers and domain experts to collaboratively define complex codebooks based on observed data. It incorporates both categorical and open-ended annotation types along with contextual materials. Additionally, it offers a side-by-side comparison of multiple annotators’ responses, allowing comparison and calibration of annotations with others to improve data reliability. The system is open-source, with a demo video available.
我们引入了EduCoder, 这是一种专门化的域名工具,旨在支持教育对话的讲解水平。虽然通用文字注解工具用于NLP和定性研究,但很少涉及教育对话笔录编码的复杂性 – – 包括各种师生和同龄人的互动。共同的挑战包括:为复杂的教学特征界定守则,支持开放式和绝对的编码,以及用外部特征,如课程的目的和教学价值等,根据具体情况发表说明。EduCoder旨在应对这些挑战,为研究人员和域专家提供一个平台,以协作界定基于观察到的数据的复杂代码手册。它包括直截了当和开放式的注解类型以及背景材料。此外,它提供了多个说明者反应的平行比较,允许与其他说明进行比较和校准,以提高数据可靠性。这个系统是开源的,可提供演示视频。
Article 113
Title@2025-07-09 (3): Derailer-Rerailer: Adaptive Verification for Efficient and Reliable Language Model Reasoning
Title: Derailer-Rerailer: Adaptive Verification for Efficient and Reliable Language Model Reasoning | Derailer-Rerailer: Adaptive Verifikation für effizientes und zuverlässiges Sprachmodell | Derailer-Rerailer:高效率和可靠语文示范理由的适应性核查 2408.13940v4 |
Authors (6): Guangya Wan, Yuqi Wu, Hao Wang, Shengming Zhao, Jie Chen, Sheng Li
Large Language Models (LLMs) have shown impressive reasoning capabilities, yet existing prompting methods face a critical trade-off: simple approaches often struggle with complex tasks and reasoning stability, while more sophisticated methods require multiple inferences and substantial computational resources, limiting their practical deployment. To address this challenge, we propose Derailer-Rerailer, a novel framework that adaptively balances reasoning accuracy and computational efficiency. At its core, our framework employs a lightweight Derailer mechanism to assess reasoning stability and selectively triggers an advanced Rerailer verification process only when necessary, thereby optimizing computational resource usage. Extensive evaluation across both open and closed-source models on more than 20 categories of mathematical, symbolic, and commonsense reasoning tasks demonstrates our framework’s effectiveness: Derailer-Rerailer achieves significant accuracy improvements (8-11\% across various reasoning tasks) while maintaining 2-3 times better efficiency than existing verification methods, with particularly strong performance in mathematical and symbolic reasoning, offering a practical solution for enhancing LLM reasoning reliability while significantly reducing computational overhead.
大型语言模型(LLMs)显示了令人印象深刻的推理能力,然而,现有的推理方法却面临着一个关键的权衡:简单方法往往与复杂任务和推理稳定性挣扎不休,而更复杂的方法则需要多种推论和大量计算资源,从而限制其实际部署。为了应对这一挑战,我们提议一个适应性平衡推理准确性和计算效率的新框架Derailer-Rerailer(Derailer-Rerailer),从本质上讲,我们的框架使用一个轻量的脱轨器机制来评估推理稳定性,并且只在必要时有选择地触发先进的累动器核查进程,从而优化计算资源的利用。 对20多种数学、象征和常识推理学类别的开放和封闭源模型进行广泛的评价,显示了我们框架的有效性:脱轨器-Rerailer在各种推理任务中取得了显著的精准性改进(8-11),同时保持比现有核查方法更好的2-3倍的效率,在数学和象征性推理学方面表现特别强,为提高LLM推理的可靠性提供了切实可行的解决办法,同时大幅度降低计算间接费用。
Article 114
Title@2025-07-09 (3): Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses
Title: Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses | Prompt Störungen Enthüllen Mensch-ähnliche Biasen in LLM Survey Responses | LLM调查答复中的即时扰动干扰现象 2507.07188v1 |
Authors (3): Jens Rupprecht, Georg Ahnert, Markus Strohmaier
Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known response biases are poorly understood. This paper investigates the response robustness of LLMs in normative survey contexts – we test nine diverse LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of 11 perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated interviews. In doing so, we not only reveal LLMs’ vulnerabilities to perturbations but also reveal that all tested models exhibit a consistent \textit{recency bias} varying in intensity, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. By applying a set of perturbations, we reveal that LLMs partially align with survey response biases identified in humans. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.
大型语言模型(LLMS)在社会科学调查中越来越多地被用作人类科目的代名词,但是其可靠性和对已知反应偏差的易感性却不易理解。本文调查了LMS在规范调查背景下的反应强度 – – 我们对世界价值调查(WVS)中的问题测试了九种不同的LMS,对两个问题提法和回答选项结构都应用了一套11次的全套扰动,导致模拟访谈超过167 000次。我们这样做不仅暴露了LLMS易受扰动的脆弱性,而且揭示了所有测试的模型在强度上都表现出了一致的\textit{recenty 偏差},过分偏偏向于最后出现的回答选项。虽然更大的模型一般比较强,但所有模型对语义变化(如副作用)和合并扰动仍然敏感。我们通过应用一套扰动选项结构,发现LMSMs部分地与人类所发现的调查反应偏差相一致。这突出表明,在使用LMS生成合成调查数据时,迅速设计和稳健性测试至关重要。
Article 115
Title@2025-07-09 (3): Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
Title: Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs | Bepflanzt in der Vorausbildung, durch Finetuning abgeschwächt: Eine Fallstudie über die Herkunft von Kognitiv-Biasen in LLMs | 编在培训前编,《微调:关于LLM中认知性双星起源的个案研究》,《微调摇摇晃》 2507.07186v1 |
Authors (3): Itay Itzhak, Yonatan Belinkov, Gabriel Stanovsky
Large language models (LLMs) exhibit cognitive biases – systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across models and can be amplified by instruction tuning. However, it remains unclear if these differences in biases stem from pretraining, finetuning, or even random noise due to training stochasticity. We propose a two-step causal experimental approach to disentangle these factors. First, we finetune models multiple times using different random seeds to study how training randomness affects over $30$ cognitive biases. Second, we introduce \emph{cross-tuning} – swapping instruction datasets between models to isolate bias sources. This swap uses datasets that led to different bias patterns, directly testing whether biases are dataset-dependent. Our findings reveal that while training randomness introduces some variability, biases are mainly shaped by pretraining: models with the same pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data. These insights suggest that understanding biases in finetuned models requires considering their pretraining origins beyond finetuning effects. This perspective can guide future efforts to develop principled strategies for evaluating and mitigating bias in LLMs.
大型语言模型(LLMS)表现出认知偏差 – – 系统性的非理性决策倾向,类似于人类所见的不合理决策倾向。先前的工作发现,这些偏差在各模型之间有所不同,可以通过指令调换加以扩大。然而,由于训练随机性,这些偏差的差别是否源于训练前、微调,或甚至随机噪音,目前还不清楚。我们提出了分解这些因素的两步因果实验方法。首先,我们用不同的随机种子对模型进行多次微调,以研究训练随机性如何影响超过30美元认知偏差。第二,我们引入了\emph{交叉调} – – 在模型之间互换教学数据集以孤立偏差源。这种交换利用数据集导致不同的偏差模式,直接测试偏差是否取决于数据设置。我们的调查结果显示,虽然训练随机性带来了某些差异,但偏差主要取决于训练前:与受过训练的骨架模型显示的偏差模式比共享的微调数据更加相似。这些洞察显示,在微调模型中理解偏差需要考虑其前训练起源,而不是微调效果。这一视角可以指导今后制定评价和减轻偏差的战略。
Article 116
Title@2025-07-09 (3): Multi-Attribute Steering of Language Models via Targeted Intervention
Title: Multi-Attribute Steering of Language Models via Targeted Intervention | Multi-Attribute-Lenkung von Sprachmodellen durch gezielte Intervention | 通过有针对性的干预措施,多语言模式多来源指导 2502.12446v2 |
Authors (4): Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM’s parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducing toxicity. To address this, we introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes. MAT-Steer learns steering vectors using an alignment objective that shifts the model’s internal representations of undesirable outputs closer to those of desirable ones while enforcing sparsity and orthogonality among vectors for different attributes, thereby reducing inter-attribute conflicts. We evaluate MAT-Steer in two distinct settings: (i) on question answering (QA) tasks where we balance attributes like truthfulness, bias, and toxicity; (ii) on generative tasks where we simultaneously improve attributes like helpfulness, correctness, and coherence. MAT-Steer outperforms existing ITI and parameter-efficient fine-tuning approaches across both task types (e.g., 3% average accuracy gain across QA tasks and 55.82% win rate against the best ITI baseline).
为解决这一问题,我们引入了多归性定向指导(MAT-Steer),这是为多种属性的选择性象征性干预设计的一种新型指导框架。MAT-Steer学习了配比任务,即我们既要改进模型对不良产出的内部表述,又要使模型对理想产出的内部表述更接近于理想产出的内部表述,同时为不同属性加强矢量之间的宽度和异度,从而减少不同属性的冲突。我们用两种不同的背景评估MAT-Steer:(一) 问题回答(QA) 任务,我们把真实性、偏向性和毒性等特征与I类选择性象征性干预相平衡;(二) 基因化任务,我们同时改进属性,如精准性、正确性和一致性。
Article 117
Title@2025-07-09 (3): LCFO: Long Context and Long Form Output Dataset and Benchmarking
Title: LCFO: Long Context and Long Form Output Dataset and Benchmarking | LCFO: Long Context und Long Form Output Datensatz und Benchmarking | LCFO:长期和长式产出数据集和基准 2412.08268v3 |
Authors (13): Marta R. Costa-jussà, Pierre Andrews, Mariano Coria Meglioli, Joy Chen, Joe Chuang, David Dale, Christophe Ropers, Alexandre Mourachko, Eduardo Sánchez, Holger Schwenk, Tuan Tran, Arina Turkatenko, Carleigh Wood
This paper presents the Long Context and Form Output (LCFO) benchmark, a novel evaluation framework for assessing gradual summarization and summary expansion capabilities across diverse domains. LCFO consists of long input documents (5k words average length), each of which comes with three summaries of different lengths (20%, 10%, and 5% of the input text), as well as approximately 15 questions and answers (QA) related to the input content. Notably, LCFO also provides alignments between specific QA pairs and corresponding summaries in 7 domains. The primary motivation behind providing summaries of different lengths is to establish a controllable framework for generating long texts from shorter inputs, i.e. summary expansion. To establish an evaluation metric framework for summarization and summary expansion, we provide human evaluation scores for human-generated outputs, as well as results from various state-of-the-art large language models (LLMs). GPT-4o-mini achieves best human scores among automatic systems in both summarization and summary expansion tasks (~ +10% and +20%, respectively). It even surpasses human output quality in the case of short summaries (~ +7%). Overall automatic metrics achieve low correlations with human evaluation scores (~ 0.4) but moderate correlation on specific evaluation aspects such as fluency and attribution (~ 0.6).
本文件介绍了长背景和形式输出基准(LCFO),这是一个评估不同领域渐进总结和简要扩展能力的新评价框架,包括长输入文件(平均长度5千字),每个文件都附有不同长度(占输入文本的20%、10%和5%)的三份摘要,以及与输入内容有关的大约15个问答(QA),值得注意的是,LCFO还提供7个领域具体质量保证配对和相应摘要的校正。提供不同长度摘要的主要动机是建立一个控制框架,以便从较短投入(即摘要扩展)中产生长文本。为了建立总结和摘要扩展评价的衡量框架,我们为人产生的产出以及各种最先进的大语言模型(LLMS)提供人的评价分数。GPT-4-omini在总结和概要扩展任务(分别是~+10%和+20 %)中自动系统获得最佳的人的评分。在简短总结和摘要(即摘要扩展)中,甚至超过了人产出质量的可控制框架。为了建立总结和摘要扩展(即摘要扩展)的衡量基准框架,我们提供了人为产出的评分,(~+7 % 总体评价,但具有0.6度,具有0.4的链接。
Article 118
Title@2025-07-09 (3): LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits
Title: LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits | LASER: Mit Multi-Armed Bandits lernen, Belohnungsmodelle adaptiv zu wählen | LASeR: 学习用多武装强盗来灵活选择奖赏模型 2410.01735v2 |
Authors (4): Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Reward Models (RMs) are crucial to aligning large language models (LLMs), but the degree to which an RM specialized to one task (e.g. writing) generalizes to new tasks (e.g. math) is often not known a priori, often making using only one fixed RM to train LLMs suboptimal. However, optimizing LLMs with multiple RMs simultaneously can incur a prohibitively high computational cost and lead to conflicting signals from different RMs that may degrade performance. To address these challenges, we introduce LASeR (Learning to Adaptively Select Rewards), which frames reward model selection as a multi-armed bandit problem, efficiently and iteratively training LLMs using multiple RMs by selecting the most well-suited RM for each instance. On commonsense and math reasoning tasks, we show that LASeR boosts iterative LLM training, improving the absolute average accuracy of Llama-3-8B over three datasets by 2.67% over an ensemble of RM scores while also showing superior efficiency (e.g., a 2x speedup). Moreover, on WildChat (open-ended instruction-following tasks), LASeR leads to a 72.69% AlpacaEval win rate over the RM score ensemble baseline. Extending to long-context generation, LASeR improves by 2.96 F1 points (avg.) on single-document QA tasks and 2.97 F1 points on few-shot learning over the RM score ensemble baseline with best-of-n sampling.
对大型语言模型(LLMS)的调整至关重要,但是,对于大型语言模型(LLMS),一个专门用于一项任务(例如写作)的RM(LLMS)一般化为新任务(例如数学)往往不先验地为人所知,往往只使用一个固定的RM(RM)来培训LLMS亚优化。然而,同时使用多个RMM(RMS)的优化LMM(LMS)会同时产生惊人的高计算成本,并导致不同RMS(Llama-3-8B)的绝对平均精确度降低性能。为了应对这些挑战,我们引入LSER(学习适应性选择选择选择选择一个任务(例如写作)),将模型的选择作为多武装的土匪问题,利用多个RMMS(例如数学)进行高效和迭代用培训LLMM(LMM),在常见和数学推理学任务方面,我们显示LRMMMM(L3-38)的绝对平均平均准确度比分比分数为2.67%,同时显示高级效率(e-RMD-RMS-S)在2.x-S-S的排序基线任务上,在2-S-S-S-S-S-xxxxxx-S(S)比分级上,在双平级上,在最低调调调。
Article 119
Title@2025-07-09 (3): Low-Rank Adaptation Secretly Imitates Differentially Private SGD
Title: Low-Rank Adaptation Secretly Imitates Differentially Private SGD | Low-Rank-Anpassung hinterhältig imitiert unterschiedlich private SGD | 低浓度适应 秘密模仿 不同的私人 SGD 2409.17538v7 |
Authors (2): Saber Malekmohammadi, Golnoosh Farnadi
As pre-trained language models grow in size, full fine-tuning their parameters on task adaptation data becomes increasingly impractical. To address this challenge, some methods for low-rank adaptation of language models have been proposed, e.g. LoRA, which incorporates trainable low-rank decomposition matrices into only some parameters of the pre-trained model, called adapters. This approach significantly reduces the number of trainable parameters compared to fine-tuning all parameters or adapters. In this work, we look at low-rank adaptation method from the lens of data privacy. We show theoretically that the low-rank adaptation used in LoRA is equivalent to fine-tuning adapters with noisy batch gradients - just like what DPSGD algorithm does. We also quantify the variance of the injected noise as a decreasing function of adaptation rank. By establishing a Berry-Esseen type bound on the total variation distance between the injected noise distribution and a Gaussian noise distribution with the same variance, we show that the dynamics of low-rank adaptation is very close to when DPSGD is performed w.r.t the adapters. Following our theoretical findings and approved by our experimental results, we show that low-rank adaptation provides robustness to membership inference attacks w.r.t the fine-tuning data.
随着经过培训的语文模型规模扩大,充分调整其任务适应数据参数变得日益不切实际。为了应对这一挑战,提出了一些语言模型低调适应方法,例如,LORA,它将可训练的低级分解矩阵纳入只是经过培训的模型的某些参数,称为适应器。这种方法大大降低了可训练参数的数量,而没有微调所有参数或适配器。在这项工作中,我们从数据隐私的角度来看待低级适应方法。我们从理论上表明,LORA使用的低级别适应方法相当于具有噪声批量梯度的微调适应器――正如DPSGD算法所做的那样。我们还将注入的噪声的差异量化为适应等级的递减功能。通过建立一个Berry-Es seen型,将注入的噪音分布与高调噪音分布之间的总变异差与差异联系起来。我们从数据隐私的角度看,低调的动态非常接近于在DPSGD进行调控时。根据我们的理论发现和实验性攻击的结果,我们展示了稳健的调数据。我们展示了低调数据。
Article 120
Title@2025-07-09 (3): UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations
Title: UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations | UniConv: Vereinheitlichung der Retrieval- und Response-Generation für große Sprachmodelle in Gesprächen | UnionConv: 对话中大语言模型统一检索和反应生成 2507.07030v1 |
Authors (12): Fengran Mo, Yifan Gao, Chuan Meng, Xin Liu, Zhuofeng Wu, Kelong Mao, Zhengyang Wang, Pei Chen, Zheng Li, Xian Li, Bing Yin, Meng Jiang
The rapid advancement of conversational search systems revolutionizes how information is accessed by enabling the multi-turn interaction between the user and the system. Existing conversational search systems are usually built with two different models. This separation restricts the system from leveraging the intrinsic knowledge of the models simultaneously, which cannot ensure the effectiveness of retrieval benefiting the generation. The existing studies for developing unified models cannot fully address the aspects of understanding conversational context, managing retrieval independently, and generating responses. In this paper, we explore how to unify dense retrieval and response generation for large language models in conversation. We conduct joint fine-tuning with different objectives and design two mechanisms to reduce the inconsistency risks while mitigating data discrepancy. The evaluations on five conversational search datasets demonstrate that our unified model can mutually improve both tasks and outperform the existing baselines.
快速推进的对口搜索系统使用户和系统之间的多轨互动使获取信息的方式发生革命性的变化。现有的对口搜索系统通常是用两种不同的模型构建的。这种分离限制了系统同时利用模型的内在知识,无法确保检索对下一代的效益。开发统一模型的现有研究无法充分解决了解对口背景、独立管理检索和生成响应等方面的问题。在本文件中,我们探讨了如何将大型语言模型的密集检索和响应生成统一起来。我们与不同的目标联合进行微调,并设计两个机制,以减少不一致的风险,同时缩小数据差异。对五个对口搜索数据集的评估表明,我们的统一模型可以相互改进任务,并超越现有基线。
Article 121
Title@2025-07-09 (3): FlexOlmo: Open Language Models for Flexible Data Use
Title: FlexOlmo: Open Language Models for Flexible Data Use | FlexOlmo: Offene Sprachmodelle für flexible Datennutzung | FlexOlmo:灵活数据使用开放语言模型 2507.07024v1 |
Authors (23): Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, Pete Walsh, Jacob Morrison, Dustin Schwenk, Shayne Longpre, Jake Poznanski, Allyson Ettinger, Daogao Liu, Margaret Li, Dirk Groeneveld, Mike Lewis, Wen-tau Yih, Luca Soldaini, Kyle Lo, Noah A. Smith, Luke Zettlemoyer, Pang Wei Koh, Hannaneh Hajishirzi, Ali Farhadi, Sewon Min
We introduce FlexOlmo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets. We evaluate models with up to 37 billion parameters (20 billion active) on 31 diverse downstream tasks. We show that a general expert trained on public data can be effectively combined with independently trained experts from other data owners, leading to an average 41% relative improvement while allowing users to opt out of certain data based on data licensing or permission requirements. Our approach also outperforms prior model merging methods by 10.1% on average and surpasses the standard MoE trained without data restrictions using the same training FLOPs. Altogether, this research presents a solution for both data owners and researchers in regulated industries with sensitive or protected data. FlexOlmo enables benefiting from closed data while respecting data owners’ preferences by keeping their data local and supporting fine-grained control of data access during inference.
我们引入了FlexOlmo, 这是一种新的语言模型(LMS),它支持:(1) 分散培训而没有数据共享,不同模型参数在封闭数据集上独立培训,不同模型参数在封闭式数据集上独立培训,(2) 数据灵活推导,这些参数及其相关数据可以灵活地纳入或排除在模型推导之外,没有进一步培训。FlexOlmo采用专家混合结构,每个专家在封闭式数据集上独立接受培训,随后通过新的域信息化路由路径(无需任何联合培训)整合。FlexOlmo在FlexMix上接受了培训,该模块由7套特定域数据集和公开提供的数据集组成,代表封闭式数据集的现实近似值。我们在31项不同的下游任务中评估了高达370亿参数(200亿美元在运行中)的模型。我们表明,受过公共数据培训的一般专家可以与其他数据拥有者独立培训的专家有效结合,从而实现平均41%的相对改进,同时允许用户选择根据数据许可或许可要求退出某些数据。我们的方法也超越了先前支持使用10.%的模型,同时支持使用受培训的敏感数据整合方法,同时使用受监管的FRO数据,在平均和GIL数据解决方案中支持的系统进行数据,同时进行数据,同时使用10.11%的常规数据更新数据,同时使用该数据,使受监管者能够将数据采集数据,使该数据采集数据获取者在为平均和MER的系统用户获得标准。
Article 122
Title@2025-07-09 (3): TokenShapley: Token Level Context Attribution with Shapley Value
Title: TokenShapley: Token Level Context Attribution with Shapley Value | TokenShapley: Token Level Context Attribution mit Shapley Wert | TokenShapley: 具有形状价值的 Token 级别背景属性 2507.05261v2 |
Authors (6): Yingtai Xiao, Yuqing Zhu, Sirat Samyoun, Wanrong Zhang, Jiachen T. Wang, Jian Du
Large language models (LLMs) demonstrate strong capabilities in in-context learning, but verifying the correctness of their generated responses remains a challenge. Prior work has explored attribution at the sentence level, but these methods fall short when users seek attribution for specific keywords within the response, such as numbers, years, or names. To address this limitation, we propose TokenShapley, a novel token-level attribution method that combines Shapley value-based data attribution with KNN-based retrieval techniques inspired by recent advances in KNN-augmented LLMs. By leveraging a precomputed datastore for contextual retrieval and computing Shapley values to quantify token importance, TokenShapley provides a fine-grained data attribution approach. Extensive evaluations on four benchmarks show that TokenShapley outperforms state-of-the-art baselines in token-level attribution, achieving an 11-23% improvement in accuracy.
大型语言模型(LLMs)在文字学习方面表现出很强的能力,但核实其生成响应的正确性仍是一项挑战。先前的工作在句级一级探索了归属问题,但当用户在响应中寻找特定关键词(如数字、年份或名称)的归属问题时,这些方法不尽人意。为了应对这一限制,我们提议TokenShapley,这是一种新型象征性归属方法,它将基于价值的数据归属与基于KNN的检索技术结合起来,这种方法受最近KNN提法LMs进步的启发。通过利用一个预先计算的数据存储器来进行背景检索和计算Shapley值以量化象征性重要性,TokenShapley提供了一种细微的归因数据归属方法。对四项基准的广泛评价表明,TokenShapley在象征性归属方面超越了最新的基线,从而实现了11-23%的准确性改进。
Article 123
Title@2025-07-09 (3): Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs
Title: Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs | Bewusst lernen, intuitiv handeln: Entsperren von Test-Time-Reasoning in multimodalen LLMs | 故意、自觉地学习:在多模式LMs中解锁试验时间理由 2507.06999v1 |
Authors (3): Yahan Yu, Yuyang Dong, Masafumi Oyamada
Reasoning is a key capability for large language models (LLMs), particularly when applied to complex tasks such as mathematical problem solving. However, multimodal reasoning research still requires further exploration of modality alignment and training costs. Many of these approaches rely on additional data annotation and relevant rule-based rewards to enhance the understanding and reasoning ability, which significantly increases training costs and limits scalability. To address these challenges, we propose the Deliberate-to-Intuitive reasoning framework (D2I) that improves the understanding and reasoning ability of multimodal LLMs (MLLMs) without extra annotations and complex rewards. Specifically, our method sets deliberate reasoning strategies to enhance modality alignment only through the rule-based format reward during training. While evaluating, the reasoning style shifts to intuitive, which removes deliberate reasoning strategies during training and implicitly reflects the model’s acquired abilities in the response. D2I outperforms baselines across both in-domain and out-of-domain benchmarks. Our findings highlight the role of format reward in fostering transferable reasoning skills in MLLMs, and inspire directions for decoupling training-time reasoning depth from test-time response flexibility.
解释是大型语言模型(LLMS)的关键能力,特别是在应用到数学问题解决等复杂任务时。然而,多式联运推理研究仍需要进一步探索模式调整和培训成本。许多这些方法依靠额外数据说明和相关基于规则的奖励来增强理解和推理能力,这大大增加了培训成本并限制了可缩放性。为应对这些挑战,我们提议了自觉到直觉的推理框架(D2I),它提高了多式联运LMS(MLMS)的理解和推理能力,而没有额外的说明和复杂的奖赏。具体地说,我们的方法设置了有意推理战略,只通过培训期间基于规则的格式奖励来增强模式的调整。在评估时,推理风格转向直觉,它消除了培训期间故意推理策略,隐含地反映了模型在应对过程中获得的能力。D2I超越了内部和外部基准的基线。我们的调查结果强调了格式奖励在培养MLLMS的可转移推理技巧和复杂奖赏方面的作用,并激励了培训时间推理深度与试验时反应灵活性的分。
Article 124
Title@2025-07-09 (3): Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming
Title: Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming | Alles planen mit Rigor: General-Purpose Zero-Shot Planung mit LLM-basiertem Formalisierten Programmierung | 与Rigor一起规划任何事:与基于LLM的正规化方案规划一起规划一般用途零热规划 2410.12112v3 |
Authors (3): Yilun Hao, Yang Zhang, Chuchu Fan
While large language models (LLMs) have recently demonstrated strong potential in solving planning problems, there is a trade-off between flexibility and complexity. LLMs, as zero-shot planners themselves, are still not capable of directly generating valid plans for complex planning problems such as multi-constraint or long-horizon tasks. On the other hand, many frameworks aiming to solve complex planning problems often rely on task-specific preparatory efforts, such as task-specific in-context examples and pre-defined critics/verifiers, which limits their cross-task generalization capability. In this paper, we tackle these challenges by observing that the core of many planning problems lies in optimization problems: searching for the optimal solution (best plan) with goals subject to constraints (preconditions and effects of decisions). With LLMs’ commonsense, reasoning, and programming capabilities, this opens up the possibilities of a universal LLM-based approach to planning problems. Inspired by this observation, we propose LLMFP, a general-purpose framework that leverages LLMs to capture key information from planning problems and formally formulate and solve them as optimization problems from scratch, with no task-specific examples needed. We apply LLMFP to 9 planning problems, ranging from multi-constraint decision making to multi-step planning problems, and demonstrate that LLMFP achieves on average 83.7% and 86.8% optimal rate across 9 tasks for GPT-4o and Claude 3.5 Sonnet, significantly outperforming the best baseline (direct planning with OpenAI o1-preview) with 37.6% and 40.7% improvements. We also validate components of LLMFP with ablation experiments and analyzed the underlying success and failure reasons. Project page: https://sites.google.com/view/llmfp.
虽然大型语言模型(LLMS)最近显示在解决规划问题方面具有巨大的潜力,但灵活性和复杂性之间却存在权衡。LLMS作为零点规划者本身,仍然无法直接为复杂的规划问题制定有效的计划,如多重约束或长期横向任务等。另一方面,许多旨在解决复杂的规划问题的框架往往依赖于特定任务的筹备努力,如具体任务的内特实例和预先界定的批评者/验证者,这限制了它们跨任务一般化的能力。在本文中,我们通过观察发现许多规划问题的核心在于优化问题:寻找目标受制约(决定的条件和影响)的最佳解决方案(最佳计划)。另一方面,许多旨在解决复杂的规划问题的框架往往依赖于基于特定任务的筹备努力,如具体任务的内特有实例和预先界定的批评者/验证者,从而限制了它们之间的交叉任务一般目的框架。 我们建议LMFP, 利用LMS来收集从规划中的关键信息,并正式制定和解决这些问题,从最优化的问题,从Sonalferal 开始,没有任务-crial-liveralalal ealalalalalalalalalalalal ex ex ex ex ex ex ex ex ex ex ex ex legal ex ex ex laut lauts laut sal laut lauts lauts lauts lauts lax lauts lauts lauts lauts lauts lauts lauts lex lex lauts lauts
Article 125
Title@2025-07-09 (3): FRaN-X: FRaming and Narratives-eXplorer
Title: FRaN-X: FRaming and Narratives-eXplorer | FRaN-X: FRaming und Narratives-eXplorer | FRAN-X: 制冷和叙述-爆破器 2507.06974v1 |
Authors (24): Artur Muratov, Hana Fatima Shaikh, Vanshikaa Jani, Tarek Mahmoud, Zhuohan Xie, Daniil Orel, Aaryamonvikram Singh, Yuxia Wang, Aadi Joshi, Hasan Iqbal, Ming Shan Hee, Dhruv Sahnan, Nikolaos Nikolaidis, Purificação Silvano, Dimitar Dimitrov, Roman Yangarber, Ricardo Campos, Alípio Jorge, Nuno Guimarães, Elisa Sartori, Nicolas Stefanovitch, Giovanni Da San Martino, Jakub Piskorski, Preslav Nakov
We present FRaN-X, a Framing and Narratives Explorer that automatically detects entity mentions and classifies their narrative roles directly from raw text. FRaN-X comprises a two-stage system that combines sequence labeling with fine-grained role classification to reveal how entities are portrayed as protagonists, antagonists, or innocents, using a unique taxonomy of 22 fine-grained roles nested under these three main categories. The system supports five languages (Bulgarian, English, Hindi, Russian, and Portuguese) and two domains (the Russia-Ukraine Conflict and Climate Change). It provides an interactive web interface for media analysts to explore and compare framing across different sources, tackling the challenge of automatically detecting and labeling how entities are framed. Our system allows end users to focus on a single article as well as analyze up to four articles simultaneously. We provide aggregate level analysis including an intuitive graph visualization that highlights the narrative a group of articles are pushing. Our system includes a search feature for users to look up entities of interest, along with a timeline view that allows analysts to track an entity’s role transitions across different contexts within the article. The FRaN-X system and the trained models are licensed under an MIT License. FRaN-X is publicly accessible at https://fran-x.streamlit.app/ and a video demonstration is available at https://youtu.be/VZVi-1B6yYk.
FRAN-X由两阶段系统组成,将顺序标签与细微区别角色分类相结合,以揭示实体如何被描绘为主角、对立派或无辜者,使用这三大类下22个细微区别角色的独特的分类法。该系统支持五种语言(保加利亚语、英语、印地语、俄语和葡萄牙语)和两个领域(俄罗斯-乌克兰冲突和气候变化)。该系统为媒体分析家提供了一个互动的网络界面,以探索和比较不同来源的设置,应对自动检测和标注实体设置方式的挑战。我们的系统允许终端用户集中关注单一文章,同时分析多达四篇文章。我们提供总体水平分析,包括直观图像图,以突出一组文章正在推动的叙述。我们的系统包括用户搜索实体搜索功能,以及允许分析家跟踪实体在不同背景下的角色过渡的时间视图。在FRAX系统下,可公开培训FRAX。
Article 126
Title@2025-07-09 (3): Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report
Title: Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report | Scaling towards the information boundary of instruction set: InfinityInstruct-Subject Technical Report | 逐步走向指示集的信息边界:无限指令性 - 实物技术报告 2507.06968v1 |
Authors (4): Li Du, Hanyu Zhao, Yiming Ju, Tengfei Pan
Instruction tuning has become a foundation for unlocking the capabilities of large-scale pretrained models and improving their performance on complex tasks. Thus, the construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both coverage'' (coverage of task types and knowledge areas) and
depth’’ (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical labeling system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with targeted data generation. These components form an iterative closed-loop to continuously enhance the coverage and depth of instruction data. Based on this framework, we construct InfinityInstruct-Subject, a high-quality dataset containing ~1.5 million instructions. Experiments on multiple foundation models and benchmark tasks demonstrate its effectiveness in improving instruction-following capabilities. Further analyses suggest that InfinityInstruct-Subject shows enlarged coverage and depth compared to comparable synthesized instruction datasets. Our work lays a theoretical and practical foundation for the efficient, continuous evolution of instruction datasets, moving from data quantity expansion to qualitative improvement.
规范调整已成为释放大规模预先培训模型的能力和改善其复杂任务绩效的基础,因此,建设高质量的教学数据集对于提高模型性能和可概括性至关重要。虽然目前的教学数据集已经达到数千万个样本,但对其进行微调的模型仍可能与在稀有领域的复杂教学和任务相交。这主要是由于“覆盖”(任务类型和知识领域的覆盖)和“深度”数据集(设计复杂程度)的扩展有限。为解决这一问题,我们提议了一个系统化的教学数据构建框架,其中整合了等级标签系统、信息种子选择算法、进化数据合成流程和模型缺陷诊断,并有针对性地生成数据。这些组件形成一个迭接的闭环,以不断提高教学数据的覆盖面和深度。基于这个框架,我们构建了“不精确的教学基础”和“深度”的高质量数据集,包含~150万个指令。关于多个基础模型和基准任务的实验表明其改进教学质量标签的有效性,信息种子选择算法,进化数据合成过程,进一步分析显示我们不断扩展的系统化数据结构,从基础到可比较的深度。
Article 127
Title@2025-07-09 (3): Investigating the Robustness of Retrieval-Augmented Generation at the Query Level
Title: Investigating the Robustness of Retrieval-Augmented Generation at the Query Level | Untersuchung der Robustheit der Retrieval-Augmented Generation auf der Abfrageebene | 在查询级别调查回溯源一代的强劲程度 2507.06956v1 |
Authors (7): Sezen Perçin, Xin Su, Qutub Sha Syed, Phillip Howard, Aleksei Kuvshinov, Leo Schwinn, Kay-Ulrich Scholl
Large language models (LLMs) are very costly and inefficient to update with new information. To address this limitation, retrieval-augmented generation (RAG) has been proposed as a solution that dynamically incorporates external knowledge during inference, improving factual consistency and reducing hallucinations. Despite its promise, RAG systems face practical challenges-most notably, a strong dependence on the quality of the input query for accurate retrieval. In this paper, we investigate the sensitivity of different components in the RAG pipeline to various types of query perturbations. Our analysis reveals that the performance of commonly used retrievers can degrade significantly even under minor query variations. We study each module in isolation as well as their combined effect in an end-to-end question answering setting, using both general-domain and domain-specific datasets. Additionally, we propose an evaluation framework to systematically assess the query-level robustness of RAG pipelines and offer actionable recommendations for practitioners based on the results of more than 1092 experiments we performed.
大型语言模型(LLMS)成本高昂,效率低,无法更新新的信息。为解决这一局限性,已提出将检索增强的生成(RAG)作为一种解决办法,在推论期间积极纳入外部知识,提高事实一致性,减少幻觉。尽管有希望,但RAG系统面临实际挑战,最显著的是,对输入查询质量的高度依赖,以便准确检索。在本文件中,我们调查RAG管道中不同部件对各类查询扰动的敏感性。我们的分析表明,即使在微小的查询变异下,常用检索器的性能也会大大退化。我们利用一般域和特定域数据集,在端到端问题解答设置中研究每个模块及其综合效果。此外,我们提议一个评价框架,系统评估RAG管道的查询强度,并根据我们进行的1 092多次实验的结果,向从业人员提出可行的建议。
Article 128
Title@2025-07-09 (3): Neuron-Level Differentiation of Memorization and Generalization in Large Language Models
Title: Neuron-Level Differentiation of Memorization and Generalization in Large Language Models | Neuron-Level-Differenzierung von Erinnerung und Generalisierung in großen Sprachmodellen | 大语言模型中记忆化和普及化的中上级中值差异 2412.18497v2 |
Authors (11): Ko-Wei Huang, Yi-Fu Fu, Ching-Yu Tsai, Yu-Chieh Tu, Tzu-Ling Cheng, Cheng-Yu Lin, Yi-Ting Yang, Heng-Yi Liu, Keng-Te Liao, Da-Cheng Juan, Shou-De Lin
We investigate how Large Language Models (LLMs) distinguish between memorization and generalization at the neuron level. Through carefully designed tasks, we identify distinct neuron subsets responsible for each behavior. Experiments on both a GPT-2 model trained from scratch and a pretrained LLaMA-3.2 model fine-tuned with LoRA show consistent neuron-level specialization. We further demonstrate that inference-time interventions on these neurons can steer the model’s behavior toward memorization or generalization. To assess robustness, we evaluate intra-task and inter-task consistency, confirming that these neuron-behavior associations reflect generalizable patterns rather than dataset-specific artifacts. Our findings reveal modular structure in LLMs and enable controlling memorization and generalization behaviors at inference time.
我们调查了大语言模型(LLMs)如何区分神经层面的记忆化和概括化。 通过仔细设计的任务, 我们确定对每种行为负责的不同神经子集。 对从零到零培训的GPT-2模型和经过事先训练的LLAMA-3.2模型的实验显示,与LORA相微调的LLAMA-3.2模型具有一致的神经层面的专业性。 我们还进一步证明, 对这些神经元的推论时间干预可以引导模型走向记忆化或概括化的行为。 为了评估稳健性, 我们评估了任务内部和任务间的一致性, 证实这些神经- 任务间的联系反映了可普遍适用的模式, 而不是特定数据集的文物。 我们的发现揭示了LLMS的模块结构, 并能够控制推后时间的记忆化和概括行为 。
Article 129
Title@2025-07-09 (3): Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Title: Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models | Auf dem Weg zu einer vernünftigen Ära: Eine Umfrage über lange Kette von Gedanken, um große Sprachmodelle zu verstehen | 通向理性时代:关于为理由使用大语言模式而寻求的长链研究的调查 2503.09567v4 |
Authors (10): Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che
Recent advancements in reasoning with large language models (RLLMs), such as OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in complex domains like mathematics and coding. A central factor in their success lies in the application of long chain-of-thought (Long CoT) characteristics, which enhance reasoning abilities and enable the solution of intricate problems. However, despite these developments, a comprehensive survey on Long CoT is still lacking, limiting our understanding of its distinctions from traditional short chain-of-thought (Short CoT) and complicating ongoing debates on issues like “overthinking” and “inference-time scaling.” This survey seeks to fill this gap by offering a unified perspective on Long CoT. (1) We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms. (2) Next, we explore the key characteristics of Long CoT: deep reasoning, extensive exploration, and feasible reflection, which enable models to handle more complex tasks and produce more efficient, coherent outcomes compared to the shallower Short CoT. (3) We then investigate key phenomena such as the emergence of Long CoT with these characteristics, including overthinking, and inference-time scaling, offering insights into how these processes manifest in practice. (4) Finally, we identify significant research gaps and highlight promising future directions, including the integration of multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks. By providing a structured overview, this survey aims to inspire future research and further the development of logical reasoning in artificial intelligence.
最近与大语言模型(RLLMs)的推理工作取得了进展,例如OpenAI-O1和DeepSeek-R1,这显示了他们在数学和编码等复杂领域令人印象深刻的能力。其成功的一个中心因素在于应用长期思维链(Long CoT)特征,这些特征增强了推理能力,并能够解决复杂的问题。然而,尽管取得了这些进展,关于长期思维模型(LLLLLMMS)的全面调查仍然缺乏,限制了我们对它与传统的短思维链(Sort CoT)的区别的理解,并使得目前关于“超思考”和“超时缩放”等问题的辩论复杂化。这项调查试图填补这一差距,方法是对长 CoT提供统一的观点。 (1) 我们首先将Long CoT与短 CoT(Long CoT)特征区分开来,并采用新颖的分类方法来分类目前的推理模型。 (2) 其次,我们探索长 CoT的关键特征:深刻的推理、广泛探索和可行的思考,使模型能够处理更复杂的任务并产生更高效、更一致的结果。 (3) 我们然后调查了关键现象,通过长期的推理推理,从而揭示了长期的推理的推理,从而揭示了我们最终的推理,提出了这些推理的推理学,从而揭示了我们的推理学,确定了了这些推理,从而确定了了我们未来的推理学,从而确定了了这些推理的推理的推理,确定了了这些推理,从而推理和推理,确定了我们的推理。
Article 130
Title@2025-07-09 (3): What to Keep and What to Drop: Adaptive Table Filtering Framework
Title: What to Keep and What to Drop: Adaptive Table Filtering Framework | Was zu halten und was zu fallen: Adaptive Tabelle Filterung Rahmen | 保持和放下什么:适应性表格过滤框架 2506.23463v2 |
Authors (1): WonJune Jang
Large language models (LLMs) for table-based reasoning often struggle with large tables due to input length limits. We propose ATF (Adaptive Table Filtering Framework), a modular and question-aware filtering pipeline that prunes uninformative columns and rows using LLM-generated column descriptions, clustering, and sparse-dense alignment scores. ATF integrates seamlessly with existing models (e.g., TAPAS, TAPEX) without retraining. Experiments show that ATF reduces table cells by 70%, boosting performance on out-of-domain TableQA tasks while causing slight performance drops on Table Fact Verification, where full-table context is more critical. These results highlight ATF’s ability to adaptively balance informativeness and minimalism across tasks.
基于表格的推理大语言模型(LLMs)往往因输入长度限制而与大表格挣扎。我们建议ATF(Adaptive Table 过滤框架),这是一个模块式和有问题过滤管道,使用LLM生成的列描述、集群和稀疏的对齐分来提取无信息性列和行。ATF在没有再培训的情况下与现有模型(如TAAPAS、TAPEX)无缝地结合。实验显示ATF将表格单元格减少70%,提高表细胞在表QA外任务上的绩效,同时在表格事实核查上造成稍有性能下降,而全面环境则更为关键。这些结果突出ATF在适应性地平衡不同任务的信息性和最小性方面的能力。
Article 131
Title@2025-07-09 (3): Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues
Title: Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues | LLMs für die Vorhersage von Tutorenstrategien und Studentenergebnissen in Dialogen erforschen | 探索在对话中预测教学战略和学生成果 2507.06910v1 |
Authors (3): Fareya Ikram, Alexander Scarlatos, Andrew Lan
Tutoring dialogues have gained significant attention in recent years, given the prominence of online learning and the emerging tutoring abilities of artificial intelligence (AI) agents powered by large language models (LLMs). Recent studies have shown that the strategies used by tutors can have significant effects on student outcomes, necessitating methods to predict how tutors will behave and how their actions impact students. However, few works have studied predicting tutor strategy in dialogues. Therefore, in this work we investigate the ability of modern LLMs, particularly Llama 3 and GPT-4o, to predict both future tutor moves and student outcomes in dialogues, using two math tutoring dialogue datasets. We find that even state-of-the-art LLMs struggle to predict future tutor strategy while tutor strategy is highly indicative of student outcomes, outlining a need for more powerful methods to approach this task.
近年来,由于在线学习的重要性以及大型语言模型(LLMs)驱动的人工智能(AI)代理人的新兴辅导能力,教学对话近年来受到高度重视。 最近的研究表明,辅导员使用的战略可以对学生结果产生重大影响,因此有必要采用方法预测辅导员将如何行事及其行动如何影响学生。然而,很少有工作研究过预测对话中的辅导员战略。因此,在这项工作中,我们调查现代LLMS的能力,特别是Llama 3和GPT-4o, 以便利用两个数学辅导对话数据集预测未来辅导员的调动和学生在对话中的结果。我们发现,即使是最先进的LLMs也努力预测未来的辅导员战略,而辅导员战略则非常能说明学生的结果,同时说明需要更强有力的方法来应对这项任务。
Article 132
Title@2025-07-09 (3): MultiJustice: A Chinese Dataset for Multi-Party, Multi-Charge Legal Prediction
Title: MultiJustice: A Chinese Dataset for Multi-Party, Multi-Charge Legal Prediction | MultiJustice: Ein chinesischer Datensatz für Multi-Party, Multi-Charge Legal Prediction | 多公正:多党、多主管法律预测的中国数据集 2507.06909v1 |
Authors (7): Xiao Wang, Jiahuan Pei, Diancheng Shui, Zhiguang Han, Xin Sun, Dawei Zhu, Xiaoyu Shen
Legal judgment prediction offers a compelling method to aid legal practitioners and researchers. However, the research question remains relatively under-explored: Should multiple defendants and charges be treated separately in LJP? To address this, we introduce a new dataset namely multi-person multi-charge prediction (MPMCP), and seek the answer by evaluating the performance of several prevailing legal large language models (LLMs) on four practical legal judgment scenarios: (S1) single defendant with a single charge, (S2) single defendant with multiple charges, (S3) multiple defendants with a single charge, and (S4) multiple defendants with multiple charges. We evaluate the dataset across two LJP tasks, i.e., charge prediction and penalty term prediction. We have conducted extensive experiments and found that the scenario involving multiple defendants and multiple charges (S4) poses the greatest challenges, followed by S2, S3, and S1. The impact varies significantly depending on the model. For example, in S4 compared to S1, InternLM2 achieves approximately 4.5% lower F1-score and 2.8% higher LogD, while Lawformer demonstrates around 19.7% lower F1-score and 19.0% higher LogD. Our dataset and code are available at https://github.com/lololo-xiao/MultiJustice-MPMCP.
法律判决预测为法律从业者和研究人员提供帮助提供了令人信服的方法。然而,研究问题仍然相对没有得到充分探讨:如果在LJP中分别处理多名被告和指控?为了解决这个问题,我们引入了一套新的数据集,即多人多收费预测(MPMCP),并寻求答案,对四种实际法律判决情景(S2)中几种通行的大型法律语言模型(LLLMs)的性能进行评估:(S1)单一被告单一指控,(S2)单一被告提出多项指控,(S3)多名被告提出多项指控,(S4)多个被告提出多项指控,(S4)多被告提出多项指控,(S4)多被告提出多项指控。我们评估了两项LJP任务中的数据集,即指控预测和刑罚术语预测。我们进行了广泛的实验,发现涉及多被告和多项指控(S4)的情景带来了最大挑战,随后是S2、S3和S1。 影响因模型的不同而有很大不同。例如,S4与S1、InterLM2相比,F1-score和2.8%以上LD显示约19.7%的F1-Pas-qual/Lasionalcodestrat。
Article 133
Title@2025-07-09 (3): MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection
Title: MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection | MIND: Ein Multi-Agenten-Framework für die Null-Schuss-Schadstoff-Meme-Erkennung | MIND: 零中弹有害气象探测多剂框架 2507.06908v1 |
Authors (5): Ziyan Liu, Chunxiao Fan, Haoran Lou, Yuexin Wu, Kaiwei Deng
The rapid expansion of memes on social media has highlighted the urgent need for effective approaches to detect harmful content. However, traditional data-driven approaches struggle to detect new memes due to their evolving nature and the lack of up-to-date annotated data. To address this issue, we propose MIND, a multi-agent framework for zero-shot harmful meme detection that does not rely on annotated data. MIND implements three key strategies: 1) We retrieve similar memes from an unannotated reference set to provide contextual information. 2) We propose a bi-directional insight derivation mechanism to extract a comprehensive understanding of similar memes. 3) We then employ a multi-agent debate mechanism to ensure robust decision-making through reasoned arbitration. Extensive experiments on three meme datasets demonstrate that our proposed framework not only outperforms existing zero-shot approaches but also shows strong generalization across different model architectures and parameter scales, providing a scalable solution for harmful meme detection. The code is available at https://github.com/destroy-lonely/MIND.
在社会媒体上迅速扩展的Memes在社交媒体上强调,迫切需要采取有效办法发现有害内容,然而,传统的由数据驱动的方法,由于新Memes性质不断变化,而且缺乏最新的附加说明的数据,难以发现新的Memes。为解决这一问题,我们提议MIND(一个不依赖附加说明数据的零射有害Memes检测多试剂框架)不依赖附加说明的数据。MIND实施三项关键战略:(1) 我们从一个未经附加说明的参考集中检索类似的Memes,以提供背景信息。(2) 我们提议一个双向洞察机制,以获得对类似Memes的全面理解。(3) 我们随后采用多媒介辩论机制,确保通过合理仲裁作出有力的决策。关于三个Memes数据集的广泛实验表明,我们提议的框架不仅超越了现有的零射法,而且还表明在不同模型结构和参数尺度上非常普遍,为有害Memes的检测提供了可测量的解决方案。该代码可在https://github.com/dstroy-lonely/MIND查阅。
Article 134
Title@2025-07-09 (3): VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation
Title: VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation | VisualTrap: Ein Stealthy Backdoor Angriff auf GUI-Agenten durch Visual Grounding Manipulation | 视觉陷阱:通过视觉定位操纵对图形用户界面代理器进行隐秘的后门攻击 2507.06899v1 |
Authors (6): Ziang Ye, Yang Zhang, Wentao Shi, Xiaoyu You, Fuli Feng, Tat-Seng Chua
Graphical User Interface (GUI) agents powered by Large Vision-Language Models (LVLMs) have emerged as a revolutionary approach to automating human-machine interactions, capable of autonomously operating personal devices (e.g., mobile phones) or applications within the device to perform complex real-world tasks in a human-like manner. However, their close integration with personal devices raises significant security concerns, with many threats, including backdoor attacks, remaining largely unexplored. This work reveals that the visual grounding of GUI agent-mapping textual plans to GUI elements-can introduce vulnerabilities, enabling new types of backdoor attacks. With backdoor attack targeting visual grounding, the agent’s behavior can be compromised even when given correct task-solving plans. To validate this vulnerability, we propose VisualTrap, a method that can hijack the grounding by misleading the agent to locate textual plans to trigger locations instead of the intended targets. VisualTrap uses the common method of injecting poisoned data for attacks, and does so during the pre-training of visual grounding to ensure practical feasibility of attacking. Empirical results show that VisualTrap can effectively hijack visual grounding with as little as 5% poisoned data and highly stealthy visual triggers (invisible to the human eye); and the attack can be generalized to downstream tasks, even after clean fine-tuning. Moreover, the injected trigger can remain effective across different GUI environments, e.g., being trained on mobile/web and generalizing to desktop environments. These findings underscore the urgent need for further research on backdoor attack risks in GUI agents.
由大型视觉语言模型(LVLMS)驱动的图形用户界面代理器(GUI)在大型视觉语言模型(LVLMS)的推动下,已成为一种革命性的方法,使人体机器互动自动化,能够自主操作个人装置(例如移动电话)或装置内应用,以人样的方式执行复杂的现实世界任务。然而,它们与个人装置的紧密结合引起了巨大的安全关切,许多威胁,包括后门攻击,基本上仍未被探索。这项工作表明,图形界面元素(LVLMS)的图形界面绘制文本计划的直观地面定位引入了脆弱性,使新的幕后攻击类型成为可能。随着幕后攻击以视觉地面为对象的攻击,即使有了正确的任务解决计划,该代理人的行为也可能受到损害。为了验证这种脆弱性,我们建议一种方法,通过误导代理人定位文字计划以触发地点而不是预定目标,从而劫持地面。视觉应用通常的将有毒数据注入地面攻击,在视觉地面前训练中可以进一步确保攻击的实际可行性。在视觉上经过训练的普通环境上,即使是经过训练的、经过训练的、经过训练的、经过深思研磨的、经过训练的普通的媒介的媒介研究后,这些结果可以有效地使视觉操作的地面上,这些都能够有效地进行。
Article 135
Title@2025-07-09 (3): NoLiMa: Long-Context Evaluation Beyond Literal Matching
Title: NoLiMa: Long-Context Evaluation Beyond Literal Matching | NoLiMa: Lang-Kontext-Evaluierung jenseits der sprachlichen Übereinstimmung | NoLiMa: 超越文学匹配的长文本评价 2502.05167v3 |
Authors (7): Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schütze
Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a “needle” (relevant information) from a “haystack” (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 13 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 11 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information. Even models enhanced with reasoning capabilities or CoT prompting struggle to maintain performance in long contexts. We publicly release the dataset and evaluation code at https://github.com/adobe-research/NoLiMa.
最近大型语言模型(LLMS) 支持从128K到1M等牌的长期背景。 用于评估这些能力的流行方法之一是测试“ haystack ” (NIAH) 。 测试涉及从“ haystack” 中检索“ 针头” (相关信息) (与此无关 ) 。 这种方法的扩展包括增加分流器、 事实链条和连字符推理 。 但是,在这些基准中, 模型可以利用针头和干草堆之间的现有直线匹配来简化任务。 为了解决这个问题, 我们引入了NoLima( NoIMAH), 基准是扩展NIAHAH, 配有精心设计的针头针头套, 要求从“ haystack” (Haystack) 中检索“ 针头针头针头”(相关信息) (相关信息) (相关信息) 。 我们评估了13个称支持至少128K 符号环境的流行的LMLMsms 。 虽然在短期内表现良好,但性能随着上的时间长度的增加而大幅下降。 。 例如, 11个模型将短距离基线评价低于50 短基线基线基线基线基线基线基线基线 。 。 , 几乎比 直为 直比 递增 。 。 。 直为 直为 。
Article 136
Title@2025-07-09 (3): SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN
Title: SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN | SCore: Verstärkte Corpus-basierte Relation-Extraktion mit multi-Label Kontrastivem Lernen und Bayesian kNN | SCOR:利用多标签差异学习和Bayesian kNN简化公司关系采掘 2507.06895v1 |
Authors (3): Luca Mariotti, Veronica Guidetti, Federica Mandreoli
The growing demand for efficient knowledge graph (KG) enrichment leveraging external corpora has intensified interest in relation extraction (RE), particularly under low-supervision settings. To address the need for adaptable and noise-resilient RE solutions that integrate seamlessly with pre-trained large language models (PLMs), we introduce SCoRE, a modular and cost-effective sentence-level RE system. SCoRE enables easy PLM switching, requires no finetuning, and adapts smoothly to diverse corpora and KGs. By combining supervised contrastive learning with a Bayesian k-Nearest Neighbors (kNN) classifier for multi-label classification, it delivers robust performance despite the noisy annotations of distantly supervised corpora. To improve RE evaluation, we propose two novel metrics: Correlation Structure Distance (CSD), measuring the alignment between learned relational patterns and KG structures, and Precision at R (P@R), assessing utility as a recommender system. We also release Wiki20d, a benchmark dataset replicating real-world RE conditions where only KG-derived annotations are available. Experiments on five benchmarks show that SCoRE matches or surpasses state-of-the-art methods while significantly reducing energy consumption. Further analyses reveal that increasing model complexity, as seen in prior work, degrades performance, highlighting the advantages of SCoRE’s minimal design. Combining efficiency, modularity, and scalability, SCoRE stands as an optimal choice for real-world RE applications.
利用外部公司对高效知识图(KG)的丰富利用外部公司的需求日益增加,这使人们更加关注关系提取(RE),特别是在低监督环境之下。为了满足对适应性和有噪音弹性的RE解决方案的需求,这些解决方案需要与经过事先训练的大语言模型(PLM)无缝结合,我们引入了模块化和具有成本效益的 RE 系统。SCORE使PLM转换容易,不要求微调,并顺利适应不同的公司和KGs。通过将受监督的对比学习与多标签分类的Bayesian k-Nearnighbors(KNNN)分类结合起来,它提供了强大的性能,尽管远受监督的巨型公司(PLMs)发出响亮的提示。为了改进RE,我们提出了两个新颖的衡量标准:校正结构距离(CSCS),衡量所学关系模式和KGG结构之间的匹配,以及Precion 评估作为建议系统评估效用的系统。我们还发布了Wik20d,一个基准数据集复制了现实-世界的RE条件,其中只有KG-Brest-ration State Proview Proviewal press repress repress repress repress repress repress repress repress repress repress repress press laisal press repress press lade lade lade lading lading ladings spress press ladings ladings ladings spress ladings lade lade ladings lade press press lacisal ex lade lade lade lade lade lade lade lade lade lade lade lade lad lacess lacess lacess lapress ex ex lade lade ex lade lacess lacess lade lad lad lad lad lad lad lad lad lade lade lad laess laess laess ex ex
Article 137
Title@2025-07-09 (3): Developing and Maintaining an Open-Source Repository of AI Evaluations: Challenges and Insights
Title: Developing and Maintaining an Open-Source Repository of AI Evaluations: Challenges and Insights | Entwicklung und Pflege eines Open-Source-Repositorys von KI-Bewertungen: Herausforderungen und Erkenntnisse | 开发并维持一个开放源码的AI评价储存库:挑战和洞察 2507.06893v1 |
Authors (3): Alexandra Abbas, Celia Waggoner, Justin Olive
AI evaluations have become critical tools for assessing large language model capabilities and safety. This paper presents practical insights from eight months of maintaining $inspect_evals$, an open-source repository of 70+ community-contributed AI evaluations. We identify key challenges in implementing and maintaining AI evaluations and develop solutions including: (1) a structured cohort management framework for scaling community contributions, (2) statistical methodologies for optimal resampling and cross-model comparison with uncertainty quantification, and (3) systematic quality control processes for reproducibility. Our analysis reveals that AI evaluation requires specialized infrastructure, statistical rigor, and community coordination beyond traditional software development practices.
AI评价已成为评估大型语言模型能力和安全性的关键工具,本文件从八个月的“Espectéevals”中提供了实际的见解,这是70+社区贡献的AI评价的公开源码库,我们查明了实施和维护AI评价方面的主要挑战,并制定了解决办法,包括:(1) 扩大社区贡献的结构化集体管理框架;(2) 与不确定性量化进行最佳重现和跨模式比较的统计方法;(3) 系统化质量控制程序,以便复制;我们的分析表明,AI评价需要超出传统软件开发做法之外的专门基础设施、统计严谨和社区协调。
Article 138
Title@2025-07-09 (3): Adaptive Elicitation of Latent Information Using Natural Language
Title: Adaptive Elicitation of Latent Information Using Natural Language | Adaptive Vermittlung von latenten Informationen mit natürlicher Sprache | 利用自然语言调适性易感信息使用自然语言 2504.04204v2 |
Authors (4): Jimmy Wang, Thomas Zollo, Richard Zemel, Hongseok Namkoong
Eliciting information to reduce uncertainty about a latent entity is a critical task in many application domains, e.g., assessing individual student learning outcomes, diagnosing underlying diseases, or learning user preferences. Though natural language is a powerful medium for this purpose, large language models (LLMs) and existing fine-tuning algorithms lack mechanisms for strategically gathering information to refine their own understanding of the latent entity. To harness the generalization power and world knowledge of LLMs in developing effective information-gathering strategies, we propose an adaptive elicitation framework that actively reduces uncertainty on the latent entity. Since probabilistic modeling of an abstract latent entity is difficult, our framework adopts a predictive view of uncertainty, using a meta-learned language model to simulate future observations and enable scalable uncertainty quantification over complex natural language. Through autoregressive forward simulation, our model quantifies how new questions reduce epistemic uncertainty, enabling the development of sophisticated information-gathering strategies to choose the most informative next queries. In experiments on the 20 questions game, dynamic opinion polling, and adaptive student assessment, our method consistently outperforms baselines in identifying critical unknowns and improving downstream predictions, illustrating the promise of strategic information gathering in natural language settings.
在许多应用领域,例如,评估个别学生学习成果、诊断潜在疾病或学习用户偏好等,都是一项关键任务。虽然自然语言是这方面的强大媒介,但大型语言模型和现有的微调算法缺乏战略收集信息的机制,无法通过战略收集信息来完善对潜在实体的理解。为了利用LLMS的概括力量和世界知识来制定有效的信息收集战略,我们提议了一个适应性引导框架,以积极减少潜在实体的不确定性。由于抽象潜在实体的概率模型很难,我们的框架采用预测性不确定性的观点,使用元学语言模型模拟未来观测,使复杂自然语言的可扩展不确定性量化成为可能。通过自动递进式前瞻性模拟,我们的模式量化了新问题如何减少隐含性的不确定性,从而能够发展复杂的信息收集战略来选择最丰富的下一个查询。在20个问题游戏的实验中,动态的民意调查,以及适应性的学生评估中,我们的方法在识别关键未知因素和改进下游语言的预测方面始终超越了基线。
Article 139
Title@2025-07-09 (3): EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning
Title: EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning | EMORL: Ensemble Multi-Objektive Verstärkungs-Lernen für effizientes und flexibles LLM-Fine-Tuning | EMORL: 将多目标强化学习组合起来,促进高效率和灵活的LLM 微调 2505.02579v3 |
Authors (5): Lingxiao Kong, Cong Yang, Susanne Neufang, Oya Deniz Beyan, Zeyd Boukhers
Recent advances in reinforcement learning (RL) for large language model (LLM) fine-tuning show promise in addressing multi-objective tasks but still face significant challenges, including competing objective balancing, low training efficiency, poor scalability, and limited explainability. Leveraging ensemble learning principles, we introduce an Ensemble Multi-Objective RL (EMORL) framework that fine-tunes multiple models with individual objectives while optimizing their aggregation after the fine-tuning to improve efficiency and flexibility. Our method is the first to aggregate the hidden states of individual models, incorporating contextual information from multiple objectives. This approach is supported by a hierarchical grid search algorithm that identifies optimal weighted combinations. We evaluate EMORL on counselor reflection generation tasks, using text classification models to score the generations and provide rewards during RL fine-tuning. Through comprehensive experiments on the PAIR and Psych8k datasets, we demonstrate the advantages of EMORL against existing baselines: significantly lower and more stable training consumption ($17,529\pm 1,650$ data points and $6,573\pm 147.43$ seconds), improved scalability and explainability, and comparable performance across multiple objectives.
在大型语言模型(LLM)微调强化学习(RLL)方面最近取得的进展显示,在处理多目标任务方面有希望,但仍面临重大挑战,包括相互竞争的目标平衡、培训效率低、可缩放性低和解释性有限。我们利用混合学习原则,引入一个组合式多目标学习(EMORL)框架,微调多种模式,同时微调具有个别目标的多重模式,同时在微调后优化组合,提高效率和灵活性。我们的方法是首先汇总单个模型的隐藏状态,将多重目标的背景资料纳入其中。这种方法得到一个分级电网搜索算法的支持,该算法确定了最佳加权组合。我们评估顾问思考生成任务的EMORL,使用文本分类模型对几代进行评分,并在RL微调期间提供奖励。我们通过对 PAIR 和 Mitch8k 数据集的全面实验,展示了EMORL 相对于现有基线的优势:培训消耗量显著降低和更加稳定(17,529 pm 1,650美元数据点和6,573\pm 147.3秒),改进了可计量性和可比较性。
Article 140
Title@2025-07-09 (3): Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework
Title: Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework | Adaptive Beendigung für multi-round Parallele Reasoning: Ein universelles Semantisches Entropie-Führungs-Framework | 多元平行理由的适应性终止:通用的语义 Entropy 指导框架 2507.06829v1 |
Authors (12): Zenan Xu, Zexuan Qiu, Guanhua Huang, Kun Li, Siheng Li, Chenchen Zhang, Kejiao Li, Qi Yi, Yuhao Jiang, Bo Zhou, Fengzong Lian, Zhanhui Kang
Recent advances in large language models (LLMs) have accelerated progress toward artificial general intelligence, with inference-time scaling emerging as a key technique. Contemporary approaches leverage either sequential reasoning (iteratively extending chains of thought) or parallel reasoning (generating multiple solutions simultaneously) to scale inference. However, both paradigms face fundamental limitations: sequential scaling typically relies on arbitrary token budgets for termination, leading to inefficiency or premature cutoff; while parallel scaling often lacks coordination among parallel branches and requires intrusive fine-tuning to perform effectively. In light of these challenges, we aim to design a flexible test-time collaborative inference framework that exploits the complementary strengths of both sequential and parallel reasoning paradigms. Towards this goal, the core challenge lies in developing an efficient and accurate intrinsic quality metric to assess model responses during collaborative inference, enabling dynamic control and early termination of the reasoning trace. To address this challenge, we introduce semantic entropy (SE), which quantifies the semantic diversity of parallel model responses and serves as a robust indicator of reasoning quality due to its strong negative correlation with accuracy…
在大型语言模型(LLMs)方面最近的进展加快了朝向人工一般智能的进展,而推论时间的缩短正在成为一种关键技术。当代方法利用顺序推理(即扩展思维链)或平行推理(同时产生多种解决办法)来扩大推理规模。但是,两种模式都面临根本性的局限性:顺序推理通常依靠任意象征性预算来终止,导致效率低下或过早断线;虽然平行推理往往缺乏平行分支之间的协调,需要渗透性微调才能有效发挥作用。鉴于这些挑战,我们的目标是设计一个灵活的试验-时间协作推理框架,利用顺序推理和平行推理范式的互补优势。为了实现这一目标,核心挑战在于制定一个高效和准确的内在质量指标,用以评估在协作推理过程中的模型反应,使动态控制和早期终止推理追踪。为了应对这一挑战,我们引入了语管酶酶(SEE),它将平行模型反应的语义多样性量化,并且由于与准确性有着强烈的负相关性,因此成为推理质量的有力指标。
Article 141
Title@2025-07-09 (3): CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation
Title: CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation | CMQCIC-Bench: Ein chinesischer Benchmark für die Bewertung großer Sprachmodelle in der Berechnung medizinischer Qualitätskontrollindikatoren | CMQCIC-CMQCIC-Bench:中国医疗质量控制指标计算评估大语言模式基准 2502.11703v2 |
Authors (12): Guangya Yu, Yanhao Li, Zongying Jiang, Yuxiong Jin, Li Dai, Yupian Lin, Ruihui Hou, Weiyan Zhang, Yongqi Fan, Qi Ye, Jingping Liu, Tong Ruan
Medical quality control indicators are essential to assess the qualifications of healthcare institutions for medical services. With the impressive performance of large language models (LLMs) like GPT-4 in the medical field, leveraging these technologies for the Medical Quality Control Indicator Calculation (MQCIC) presents a promising approach. In this work, (1) we introduce a real-world task MQCIC and propose an open-source Chinese electronic medical records (EMRs)-based dataset (CMQCIC-Bench) comprising 785 instances and 76 indicators. (2) We propose a semi-automatic method to enhance the rule representation. Then we propose the Clinical Facts-based Inferential Rule (CF-IR) method that disentangles the clinical fact verification and inferential rule reasoning actions. (3) We conduct comprehensive experiments on 20 representative LLMs, covering general and medical models. Our findings reveal that CF-IR outperforms Chain-of-Thought methods in MQCIC tasks. (4) We conduct an error analysis and investigate the capabilities of clinical fact verification and inferential rule reasoning, providing insights to improve performance in the MQCIC further. The dataset and code is available in this repository https://github.com/YuY-2001/C-MQCIC.
医疗质量控制指标(MQCIC)是评估医疗保健机构医疗服务资格的关键。医疗领域大型语言模型(LLMs)(LLMs)在GPT-4等大型语言模型(LLMs)在医疗领域的表现令人印象深刻,利用这些技术进行医疗质量控制指标计算(MQCIC)是一个很有希望的方法。在这项工作中,(1) 我们引入了一个现实世界任务MQCIC(MQCIC)并提出一个开放源码的中国电子医疗记录(CMQQCIC-Bench)数据集,由785例和76项指标组成。(2) 我们提出一个半自动方法,以加强规则代表性。然后,我们提出基于临床事实的推断规则(CF-IR)方法,将临床事实核查与推断规则推理行动分开。(3) 我们在20个具有代表性的LMMS(LMs)上进行全面实验,涵盖一般模式和医疗模式。我们的调查结果显示,CFM-IR(C)超越了MC/YC-MQAR)的数据和代码。我们进行了错误分析并调查临床事实核查和推断规则推理的能力,为MC/YC/YC-QARC/MC/MC/MSetC/MC/MC/MC/C/C/C/CSetC/C/Dset/
Article 142
Title@2025-07-09 (3): Losing our Tail – Again: On (Un)Natural Selection And Multilingual Large Language Models
Title: Losing our Tail – Again: On (Un)Natural Selection And Multilingual Large Language Models | Unseren Tail verlieren – wieder einmal: An (Un)Natürlicher Auswahl und mehrsprachigen großen Sprachmodellen | 失去我们的尾巴 – – 再一次:在(un)自然选择和多语言大语言模式 2507.03933v2 |
Authors (1): Eva Vanmassenhove
Multilingual Large Language Models (LLMs) considerably changed how technologies can influence language. While previous technologies could mediate or assist humans, there is now a tendency to offload the task of writing itself to these technologies, enabling them to change our linguistic ecosystem more directly. While they provide us quick access to information and impressively fluent output, beneath their apparent sophistication lies a subtle, more insidious threat: the gradual decline and loss of linguistic diversity. With this opinion piece, I explore how model collapse, with a particular focus on translation technology, can lead to the loss of linguistic forms, grammatical features, and cultural nuance. Model collapse refers to the eventual consequence of self-consuming training loops, where models reinforce their own biases and lose linguistic diversity. Drawing on recent work in Computer Vision, Natural Language Processing (NLP) and Machine Translation (MT), I argue that the tails of our linguistic distributions are vanishing, and with them, the narratives and identities they carry. This is a call to resist linguistic flattening and to reimagine NLP as a field that encourages, values and protects expressive multilingual lexical and linguistic diversity and creativity.
多语言大语言模型(LLMS)大大改变了技术如何影响语言。虽然以前的技术可以调解或帮助人类,但现在有一种趋势,即把写作的任务卸到这些技术身上,使他们能够更直接地改变我们的语言生态系统。虽然这些技术为我们提供了快速的获取信息和令人印象深刻流畅的产出,但是在它们显眼的精密程度之下,却存在一种微妙的、更隐蔽的威胁:语言多样性的逐渐减少和丧失。有了这一意见片,我探索了模式的崩溃,特别侧重于翻译技术,会如何导致语言形式、语法特征和文化细微的丧失。模型的崩溃指的是自费培训循环的最终后果,即模型强化了自己的偏见并丧失了语言多样性。在计算机视野、自然语言处理和机器翻译(MT)方面,我争论说,我们语言分布的尾部正在消失,随着这些尾部的描述和身份正在消失。这是一个呼吁,以抵制语言平板化和重新imagine NLP作为鼓励、价值观和保护明示的多语言词汇和语言多样性及创造性的字段。
Article 143
Title@2025-07-09 (3): Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts
Title: Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts | Sicherer oder luckier? LLMs als Sicherheitsevaluatoren sind für Artefakte nicht robust | 安全性更安全还是更幸运?作为安全评估员的LLMs没有强力进行人工操作。 2503.09347v3 |
Authors (2): Hongyu Chen, Seraphina Goldfarb-Tarrant
Large Language Models (LLMs) are increasingly employed as automated evaluators to assess the safety of generated content, yet their reliability in this role remains uncertain. This study evaluates a diverse set of 11 LLM judge models across critical safety domains, examining three key aspects: self-consistency in repeated judging tasks, alignment with human judgments, and susceptibility to input artifacts such as apologetic or verbose phrasing. Our findings reveal that biases in LLM judges can significantly distort the final verdict on which content source is safer, undermining the validity of comparative evaluations. Notably, apologetic language artifacts alone can skew evaluator preferences by up to 98\%. Contrary to expectations, larger models do not consistently exhibit greater robustness, while smaller models sometimes show higher resistance to specific artifacts. To mitigate LLM evaluator robustness issues, we investigate jury-based evaluations aggregating decisions from multiple models. Although this approach both improves robustness and enhances alignment to human judgements, artifact sensitivity persists even with the best jury configurations. These results highlight the urgent need for diversified, artifact-resistant methodologies to ensure reliable safety assessments.
大型语言模型(LLMS)越来越多地被用作自动化评价员,以评估生成内容的安全性,但其可靠性仍然不确定。本研究报告评估了11个LLM法官模型在关键安全领域的多样化,审查了三个关键方面:反复判断任务的自理性、与人类判断一致、容易投入诸如伪证或动词表达等文物。我们的调查结果显示,LLM法官的偏见会大大扭曲最终判断哪个内容来源更安全,损害比较评价的有效性。值得注意的是,单是pological 语言文物就足以扭曲评价人的偏好,高达98。与预期相反,较大的模型并不一贯表现出更大的强健性,而较小的模型有时对具体文物的阻力也更高。为减轻LLM评价员的稳健性问题,我们调查陪审团对多个模型所作决定的汇总评价。虽然这种方法既能提高稳健性,又能增进对人类判断的一致性,但艺术敏感性即使与最佳陪审团的配置保持。这些结果突出表明,迫切需要采用多样化的、耐手艺方法,以确保可靠的安全评估。
Article 144
Title@2025-07-09 (3): Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams
Title: Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams | Text zum Modell via SysML: Automatisierte Generierung dynamischer Systemrechnermodelle aus unstrukturiertem Naturtext über verbesserte Systemmodellierung Sprachdiagramme | 通过 SysML 自动生成动态系统计算模型,通过强化系统模拟图,从未结构化的自然语言文本生成动态系统计算模型 2507.06803v1 |
Authors (2): Matthew Anderson Hendricks, Alice Cicirello
This paper contributes to speeding up the design and deployment of engineering dynamical systems by proposing a strategy for exploiting domain and expert knowledge for the automated generation of dynamical system computational model starting from a corpus of document relevant to the dynamical system of interest and an input document describing the specific system. This strategy is implemented in five steps and, crucially, it uses system modeling language diagrams (SysML) to extract accurate information about the dependencies, attributes, and operations of components. Natural Language Processing (NLP) strategies and Large Language Models (LLMs) are employed in specific tasks to improve intermediate outputs of the SySML diagrams automated generation, such as: list of key nouns; list of extracted relationships; list of key phrases and key relationships; block attribute values; block relationships; and BDD diagram generation. The applicability of automated SysML diagram generation is illustrated with different case studies. The computational models of complex dynamical systems from SysML diagrams are then obtained via code generation and computational model generation steps. In the code generation step, NLP strategies are used for summarization, while LLMs are used for validation only. The proposed approach is not limited to a specific system, domain, or computational software. The applicability of the proposed approach is shown via an end-to-end example from text to model of a simple pendulum, showing improved performance compared to results yielded by LLMs only.
本文有助于加快工程动态系统的设计和部署,办法是提出一项战略,利用域和专家知识,自动生成动态系统计算模型,从与动态兴趣系统有关的文件库和描述具体系统的输入文件库开始,从而自动生成动态系统计算模型。这一战略分五个步骤实施,关键是使用系统模拟语言图(SysML),以获取关于各组成部分依赖性、属性和操作的准确信息。自然语言处理(NLP)战略和大语言模型(LLMS)用于具体任务,以改进SYSML图表自动生成的中间产出,例如:关键名词列表;提取关系列表;关键短语和关键关系列表;区属性值;块关系;以及BDDD图生成。自动生成SysML图(SML图)的实用性用不同的案例研究来说明。随后通过代码生成和计算模型生成步骤获得复杂的动态系统的计算模型。在代码生成步骤中,使用NLP战略来进行合成,而LMS-LMS战略则用于合成,而LMS-LMS则用于通过特定的域图进行特定的校验。建议,仅显示通过特定的域到软件的校验。
Article 145
Title@2025-07-09 (3): Test-Time Scaling with Reflective Generative Model
Title: Test-Time Scaling with Reflective Generative Model | Test-Zeit-Skalierung mit reflektierendem Generativem Modell | 具有反反思考生成模型的试验时间缩放 2507.01951v2 |
Authors (11): Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini’s performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini’s series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.
我们引入了我们的第一个反光基因模型MetaStone-S1,该模型通过新的反射引因形式获得了OpenAI o3-mini的性能。新的形式侧重于高质量的推理轨迹选择,并包含两个新颖之处:1) 政策和程序奖励模式的统一界面:我们共享主干网,并使用特定任务头分别用于推理轨迹预测和评分,只引入了5300米的弹道评分额外参数。2) 不再依赖进程一级的批注:我们提供一种自我监督的进程奖赏模式,它能够直接从结果奖赏中学习高质量的推理轨迹选择。用反光化的基因化形式安装,MetaStone-S1自然适合测试时间缩放,我们根据可控思维长度提供三种推理努力模式(低、中、高)。实验表明我们的MetaStone-S1的性能与O3-mini系列相比,只有32B参数大小。为了支持研究界,我们在 https://giust/Setata1.Seta1.
Article 146
Title@2025-07-09 (3): GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods
Title: GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods | GuidedBench: Messung und Milderung der Evaluationsdiskrepanzen von In-the-Wild LLM Jailbreak Methoden | 指导:衡量和减轻LLM监狱破碎方法在评价方面的差异 2502.16903v2 |
Authors (5): Ruixuan Huang, Xunguang Wang, Zongjie Li, Daoyuan Wu, Shuai Wang
Despite the growing interest in jailbreak methods as an effective red-teaming tool for building safe and responsible large language models (LLMs), flawed evaluation system designs have led to significant discrepancies in their effectiveness assessments. We conduct a systematic measurement study based on 37 jailbreak studies since 2022, focusing on both the methods and the evaluation systems they employ. We find that existing evaluation systems lack case-specific criteria, resulting in misleading conclusions about their effectiveness and safety implications. This paper advocates a shift to a more nuanced, case-by-case evaluation paradigm. We introduce GuidedBench, a novel benchmark comprising a curated harmful question dataset, detailed case-by-case evaluation guidelines and an evaluation system integrated with these guidelines – GuidedEval. Experiments demonstrate that GuidedBench offers more accurate measurements of jailbreak performance, enabling meaningful comparisons across methods and uncovering new insights overlooked in previous evaluations. GuidedEval reduces inter-evaluator variance by at least 76.03\%. Furthermore, we observe that incorporating guidelines can enhance the effectiveness of jailbreak methods themselves, offering new insights into both attack strategies and evaluation paradigms.
尽管人们日益关注破狱方法,认为它是建设安全和负责任的大型语言模型的有效红队工具,但评价系统设计有缺陷,导致其成效评估存在重大差异。我们根据2022年以来37项破狱研究,进行了系统性的衡量研究,重点是其使用的方法和评价系统。我们发现,现有评价系统缺乏具体案件的标准,导致对其有效性和安全影响的错误结论。本文件主张转向更细致、逐案评估模式。我们引入了Book Bench,这是一个新颖的基准,包括一个成熟的有害问题数据集、详细的个案评估准则和与这些指南相结合的评价系统 – – 指导Eval。实验表明,Look Bench提供了更准确的破狱业绩衡量方法,使得能够对方法进行有意义的比较,并发现以往评价中忽略的新见解。指导Eval至少减少了76.03。此外,我们认为,纳入准则可以提高破狱方法本身的有效性,为攻击战略和评价范例提供新的见解。
Article 147
Title@2025-07-09 (3): Checklist Engineering Empowers Multilingual LLM Judges
Title: Checklist Engineering Empowers Multilingual LLM Judges | Checkliste Engineering Empowers Mehrsprachige LLM-Richter | 多语种LLM法官 2507.06774v1 |
Authors (2): Mohammad Ghiasvand Mohammadkhani, Hamid Beigy
Automated text evaluation has long been a central issue in Natural Language Processing (NLP). Recently, the field has shifted toward using Large Language Models (LLMs) as evaluators-a trend known as the LLM-as-a-Judge paradigm. While promising and easily adaptable across tasks, this approach has seen limited exploration in multilingual contexts. Existing multilingual studies often rely on proprietary models or require extensive training data for fine-tuning, raising concerns about cost, time, and efficiency. In this paper, we propose Checklist Engineering based LLM-as-a-Judge (CE-Judge), a training-free framework that uses checklist intuition for multilingual evaluation with an open-source model. Experiments across multiple languages and three benchmark datasets, under both pointwise and pairwise settings, show that our method generally surpasses the baselines and performs on par with the GPT-4o model.
长期以来,自动文本评价一直是自然语言处理(NLP)的一个中心问题。最近,外地已转向使用大语言模型(LLMs)作为评价员,这是一种被称为LLM-as-a-judge模式的趋势。这种方法虽然很有希望,而且容易在各种任务之间进行调整,但在多语种背景下探索有限。现有的多语种研究往往依赖专有模型,或需要广泛的培训数据来进行微调,引起对成本、时间和效率的担忧。在本文件中,我们提议以LLM-as-a-judge(CE-Judge)为基础的核对清单工程(LLM-as-a-judge (CE-Judge)),这是一个无培训框架,以开放源模式为多语种评价使用清单直觉。在点设置和对称设置下,多种语言和三个基准数据集的实验显示,我们的方法一般超过基线,并符合GPT-4模式。
Article 148
Title@2025-07-09 (3): KAConvText: Novel Approach to Burmese Sentence Classification using Kolmogorov-Arnold Convolution
Title: KAConvText: Novel Approach to Burmese Sentence Classification using Kolmogorov-Arnold Convolution | KAConvText: Neuer Ansatz zur burmesischen Satzklassifikation mit Kolmogorov-Arnold-Konvolution | KAConvText:利用科尔莫戈罗夫-阿尔纳尔革命对缅甸判刑分类采用新颖的方法 2507.06753v1 |
Authors (4): Ye Kyaw Thu, Thura Aung, Thazin Myint Oo, Thepchai Supnithi
This paper presents the first application of Kolmogorov-Arnold Convolution for Text (KAConvText) in sentence classification, addressing three tasks: imbalanced binary hate speech detection, balanced multiclass news classification, and imbalanced multiclass ethnic language identification. We investigate various embedding configurations, comparing random to fastText embeddings in both static and fine-tuned settings, with embedding dimensions of 100 and 300 using CBOW and Skip-gram models. Baselines include standard CNNs and CNNs augmented with a Kolmogorov-Arnold Network (CNN-KAN). In addition, we investigated KAConvText with different classification heads - MLP and KAN, where using KAN head supports enhanced interpretability. Results show that KAConvText-MLP with fine-tuned fastText embeddings achieves the best performance of 91.23% accuracy (F1-score = 0.9109) for hate speech detection, 92.66% accuracy (F1-score = 0.9267) for news classification, and 99.82% accuracy (F1-score = 0.9982) for language identification.
本文介绍Kolmogorov-Arnold Convolution for Text(KAConvText)在刑罚分类中的首次应用,涉及三项任务:不平衡的二进制仇恨言论检测、平衡的多级新闻分类和不平衡的多级族裔语言识别。我们调查了各种嵌入配置,将静态和微调环境中的随机和快式文本嵌入比,同时使用CBOW和GVG-gram模型嵌入范围为100和300。基线包括标准CNN和有Kolmogorov-Arnold网络增强的CNN(CNN-KAN)的CNN和CNN。此外,我们还与不同的分类负责人—-MLP和KAN进行了KA Convortext调查,使用KAN头支持增强可解释性。结果显示,KAConvText-MLP与微调快式缩写嵌入的KACAConvText-MLP在识别仇恨言论方面达到91.23%准确度的最佳性(F1-核心=0.9109),92.66%的精确度(F1-核心=0.9267)用于新闻分类,99.82%的精确度(F1-核心=0.982)。
Article 149
Title@2025-07-09 (3): Robust Multimodal Large Language Models Against Modality Conflict
Title: Robust Multimodal Large Language Models Against Modality Conflict | Robuste multimodale große Sprachmodelle gegen Modalitätskonflikt | 反对方式冲突强有力的多式多式大语言模型 2507.07151v1 |
Authors (4): Zongmeng Zhang, Wengang Zhou, Jie Zhao, Houqiang Li
Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine-tuning method shows promising and stable performance. Our work sheds light on the unnoticed modality conflict that leads to hallucinations and provides more insights into the robustness of MLLMs.
尽管多式大型语言模型(MLLM)在视觉语言任务方面的能力令人印象深刻,但它们在现实世界情景中容易产生幻觉。本文件从模式冲突的角度对MLLMM的幻觉现象进行调查。与侧重于模型反应与投入之间冲突的现有工作不同,我们研究不同模式投入的内在冲突,这些模式将MLLMs置于困境之中,直接导致幻觉。我们正式界定模式冲突,并建立一个名为多模式模式冲突(MMMC)的数据集,以模拟视觉任务中的这种现象。提出了三种基于即时工程、监督的微调和强化学习的方法,以缓解模式冲突造成的幻觉。对MLLMMS数据集进行了广泛的实验,以分析这些方法的优点和缺点。我们的成果表明,强化学习方法在减轻模式冲突下的幻觉方面取得最佳表现,而受监督的微调方法则显示了有希望和稳定的表现。我们的工作揭示了导致幻觉和对MLLMS的坚固性。我们的工作揭示了未见的模式冲突。
Article 150
Title@2025-07-09 (3): Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons
Title: Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons | Knockout LLM Assessment: Verwendung großer Sprachmodelle für Bewertungen durch iterative Pairwise-Vergleiche | LLLM 评估:利用大语言模式,通过迭接对等比较进行评估 2506.03785v3 |
Authors (3): Isik Baran Sandan, Tu Anh Dinh, Jan Niehues
Large Language Models (LLMs) have shown to be effective evaluators across various domains such as machine translations or the scientific domain. Current LLM-as-a-Judge approaches rely mostly on individual assessments or a single round of pairwise assessments, preventing the judge LLM from developing a global ranking perspective. To address this, we present Knockout Assessment, an LLM-asa Judge method using a knockout tournament system with iterative pairwise comparisons. Experiments across three LLMs on two datasets show that knockout assessment improves scoring accuracy, increasing Pearson correlation with expert evaluations by 0.07 on average for university-level exam scoring and machine translation evaluations, aligning LLM assessments more closely with human scoring.
大型语言模型(LLMs)在机器翻译或科学领域等不同领域证明是有效的评价者,目前的LLM-as-a-judge方法主要依靠个别评估或单轮双向评估,使法官LLM无法从全球排名的角度出发。为解决这一问题,我们介绍了LLM-asa法官使用击倒比赛系统、迭接对比的敲倒评估方法。在两个数据集上对三个LMs的实验显示,击倒评估提高了评分的准确性,使Pearson与专家评价的相互关系平均增加0.07分,用于大学一级的考试评分和机器翻译评价,使LMM评估与人类评分更加接近。
Article 151
Title@2025-07-09 (3): LLM-based User Profile Management for Recommender System
Title: LLM-based User Profile Management for Recommender System | LLM-basierte Benutzerprofilverwaltung für Recommender System | 以LLM为基础的建议系统用户概况管理 2502.14541v2 |
Authors (2): Seunghwan Bang, Hwanjun Song
The rapid advancement of Large Language Models (LLMs) has opened new opportunities in recommender systems by enabling zero-shot recommendation without conventional training. Despite their potential, most existing works rely solely on users’ purchase histories, leaving significant room for improvement by incorporating user-generated textual data, such as reviews and product descriptions. Addressing this gap, we propose PURE, a novel LLM-based recommendation framework that builds and maintains evolving user profiles by systematically extracting and summarizing key information from user reviews. PURE consists of three core components: a Review Extractor for identifying user preferences and key product features, a Profile Updater for refining and updating user profiles, and a Recommender for generating personalized recommendations using the most current profile. To evaluate PURE, we introduce a continuous sequential recommendation task that reflects real-world scenarios by adding reviews over time and updating predictions incrementally. Our experimental results on Amazon datasets demonstrate that PURE outperforms existing LLM-based methods, effectively leveraging long-term user information while managing token limitations.
大语言模型(LLMS)的快速进步为推荐者系统开辟了新的机会,使得在没有常规培训的情况下能够提出零点建议。尽管存在潜力,大多数现有工程都完全依赖用户的采购历史,但通过纳入用户生成的文字数据(例如审查和产品描述),留下很大的改进空间。解决这一差距,我们提议PURE,一个基于LLM的新型建议框架,通过系统地从用户审查中提取和总结关键信息,建立和维持不断演变的用户概况。PURE由三个核心部分组成:用于确定用户偏好和关键产品特征的审查提取器、用于改进和更新用户概况的概况更新器以及用于利用最新配置生成个性化建议的推荐器。为了评估PURE,我们引入了一个连续连续不断的建议任务,通过在时间上增加审查并逐步更新预测来反映现实世界情景。我们在亚马孙数据集上的实验结果表明,PURE在管理象征性限制的同时,有效地利用长期用户信息,超越了基于LM系统的现有方法。
Article 152
Title@2025-07-09 (3): Civil Society in the Loop: Feedback-Driven Adaptation of (L)LM-Assisted Classification in an Open-Source Telegram Monitoring Tool
Title: Civil Society in the Loop: Feedback-Driven Adaptation of (L)LM-Assisted Classification in an Open-Source Telegram Monitoring Tool | Zivilgesellschaft im Kreislauf: Feedback-gestützte Anpassung der (L)LM-Assistenten-Klassifikation in einem Open-Source-Telegramm-Monitoring-Tool | ” 圈圈中的民间社会:在开放源电讯监测工具中对(L)LM协助的分类进行反馈和改编 “ 。 2507.06734v1 |
Authors (5): Milena Pustet, Elisabeth Steffen, Helena Mihaljević, Grischa Stanjek, Yannis Illies
The role of civil society organizations (CSOs) in monitoring harmful online content is increasingly crucial, especially as platform providers reduce their investment in content moderation. AI tools can assist in detecting and monitoring harmful content at scale. However, few open-source tools offer seamless integration of AI models and social media monitoring infrastructures. Given their thematic expertise and contextual understanding of harmful content, CSOs should be active partners in co-developing technological tools, providing feedback, helping to improve models, and ensuring alignment with stakeholder needs and values, rather than as passive ‘consumers’. However, collaborations between the open source community, academia, and civil society remain rare, and research on harmful content seldom translates into practical tools usable by civil society actors. This work in progress explores how CSOs can be meaningfully involved in an AI-assisted open-source monitoring tool of anti-democratic movements on Telegram, which we are currently developing in collaboration with CSO stakeholders.
民间社会组织在监测有害在线内容方面的作用越来越重要,特别是因为平台提供者减少了对内容调适的投资,AI工具可以协助检测和监测规模的有害内容,然而,开放源码工具很少能够无缝地整合AI模式和社会媒体监测基础设施,鉴于民间社会组织的专题专门知识和对有害内容的背景了解,它们应当成为合作开发技术工具、提供反馈、帮助改进模式和确保符合利益攸关方需求和价值观的积极伙伴,而不是被动的“消费者”。然而,开放源码社区、学术界和民间社会之间的协作仍然很少,关于有害内容的研究很少转化为民间社会行为者可利用的实用工具。这一进展中的工作探讨了民间社会组织如何能够有意义地参与由它们协助的Telegram反民主运动公开源监测工具,我们目前正在与民间社会组织利益攸关方合作开发这一工具。
Article 153
Title@2025-07-09 (3): On the Effect of Uncertainty on Layer-wise Inference Dynamics
Title: On the Effect of Uncertainty on Layer-wise Inference Dynamics | Über den Effekt der Unsicherheit auf die schichtweise Schlussfolgerungsdynamik | 不确定性对图层推论作用的影响 2507.06722v1 |
Authors (3): Sunwoo Kim, Haneul Yoo, Alice Oh
Understanding how large language models (LLMs) internally represent and process their predictions is central to detecting uncertainty and preventing hallucinations. While several studies have shown that models encode uncertainty in their hidden states, it is underexplored how this affects the way they process such hidden states. In this work, we demonstrate that the dynamics of output token probabilities across layers for certain and uncertain outputs are largely aligned, revealing that uncertainty does not seem to affect inference dynamics. Specifically, we use the Tuned Lens, a variant of the Logit Lens, to analyze the layer-wise probability trajectories of final prediction tokens across 11 datasets and 5 models. Using incorrect predictions as those with higher epistemic uncertainty, our results show aligned trajectories for certain and uncertain predictions that both observe abrupt increases in confidence at similar layers. We balance this finding by showing evidence that more competent models may learn to process uncertainty differently. Our findings challenge the feasibility of leveraging simplistic methods for detecting uncertainty at inference. More broadly, our work demonstrates how interpretability methods may be used to investigate the way uncertainty affects inference.
了解大型语言模型(LLMs)内部代表并处理其预测对于发现不确定性和防止幻觉至关重要。虽然一些研究显示,模型将不确定性嵌入其隐蔽状态,但却未充分探讨这如何影响这些隐蔽状态的处理方式。在这项工作中,我们证明,某些和不确定产出的跨层输出象征概率的动态大体一致,显示不确定性似乎不会影响推断动态。具体地说,我们利用Tuned Lens(Logit Lens的变种),分析11个数据集和5个模型最后预测符号的层性概率轨迹。我们的工作用不正确的预测作为具有较高认知不确定性的预测,显示某些和不确定预测的轨迹是一致的,两者都观察到类似层的突然信任度增加。我们通过证明更有能力的模型可能学会以不同的方式处理不确定性来平衡这一发现的结果。我们的调查结果对利用简单方法探测不确定性的可行性提出了挑战。更广泛地说,我们的工作表明,如何使用解释方法来调查不确定性如何影响推断。
Article 154
Title@2025-07-09 (3): CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs
Title: CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs | CLI-RAG: Ein Retrieval-Augmented Framework für klinisch strukturierte und kontextorientierte Textgenerierung mit LLMs | CLI-RAG: 利用LLMs实现临床结构化和有背景意识的文本生成的检索增强框架 2507.06715v1 |
Authors (2): Garapati Keerthana, Manik Gupta
Large language models (LLMs), including zero-shot and few-shot paradigms, have shown promising capabilities in clinical text generation. However, real-world applications face two key challenges: (1) patient data is highly unstructured, heterogeneous, and scattered across multiple note types and (2) clinical notes are often long and semantically dense, making naive prompting infeasible due to context length constraints and the risk of omitting clinically relevant information. We introduce CLI-RAG (Clinically Informed Retrieval-Augmented Generation), a domain-specific framework for structured and clinically grounded text generation using LLMs. It incorporates a novel hierarchical chunking strategy that respects clinical document structure and introduces a task-specific dual-stage retrieval mechanism. The global stage identifies relevant note types using evidence-based queries, while the local stage extracts high-value content within those notes creating relevance at both document and section levels. We apply the system to generate structured progress notes for individual hospital visits using 15 clinical note types from the MIMIC-III dataset. Experiments show that it preserves temporal and semantic alignment across visits, achieving an average alignment score of 87.7%, surpassing the 80.7% baseline from real clinician-authored notes. The generated outputs also demonstrate high consistency across LLMs, reinforcing deterministic behavior essential for reproducibility, reliability, and clinical trust.
大型语言模型(LLMS),包括零发和少发模式,在临床文本生成方面显示出很有希望的能力。然而,现实世界应用面临着两大挑战:(1) 患者数据高度结构化、多样化,分散于多个备注类型,(2) 临床说明往往很长、语义密集,由于上下文长度的限制和忽略与临床有关的信息的风险,因此这些说明中高价值的内容变得天真不可行。我们采用CLI-RAG(Clinnicnational IntrolnectiveRegive-Auged Conference),这是一个用于使用LMS进行结构化和临床基础的文本生成的域特定框架。它包含一种尊重临床文件结构的新型分级组合战略,并引入了针对具体任务的双阶段检索机制。全球阶段利用基于证据的查询确定了相关说明类型,而使这些说明中高价值的内容在文档和科级两级都具有相关性。我们采用该系统,利用MIMIC-III数据集的15个临床注释类型,为个别医院访问产生结构化的进度说明。实验显示,它保持了跨访问的时间和语义性协调,实现了877%的临床基本的校准一致性,也展示了87.7%的可靠度。
Article 155
Title@2025-07-09 (3): Do Larger Language Models Imply Better Generalization? A Pretraining Scaling Law for Implicit Reasoning
Title: Do Larger Language Models Imply Better Generalization? A Pretraining Scaling Law for Implicit Reasoning | Werden größere Sprachmodelle besser verallgemeinert? Ein Vorschulungs-Skalierungsgesetz für Implizite Vernunft | 大语言模式是否充分更加普遍化? 2504.03635v2 |
Authors (6): Xinyi Wang, Shawn Tan, Mingyu Jin, William Yang Wang, Rameswar Panda, Yikang Shen
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks requiring complex reasoning. However, the effects of scaling on their reasoning abilities remain insufficiently understood. In this paper, we introduce a synthetic multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. Our reasoning task involves completing missing edges in the graph, which requires advanced multi-hop reasoning and mimics real-world reasoning scenarios. To evaluate this, we pretrain language models (LMs) from scratch solely on triples from the incomplete graph and assess their ability to infer the missing edges. Interestingly, we observe that overparameterization can impair reasoning performance due to excessive memorization. We investigate different factors that affect this U-shaped loss curve, including graph structure, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling that linearly maps the knowledge graph search entropy to the optimal model size. This work provides new insights into the relationship between scaling and reasoning in LLMs, shedding light on possible ways to optimize their performance for reasoning tasks.
大型语言模型(LLMS)在一系列需要复杂推理的广泛任务中表现出了非凡的能力,然而,对其推理能力的影响仍然没有充分理解。在本文中,我们引入了合成多光推理环境,旨在密切复制真实世界大规模知识图的结构和分布。我们的推理任务涉及完成图中缺失的边缘,这需要先进的多光推理和模拟真实世界推理假设。为了评估这一点,我们仅从不完整的图表的三部开始将语言模型从零开始,并评估其推算缺失边缘的能力。有趣的是,我们观察到过度的分解可能损害推理性能,因为过度沉积。我们调查了影响这种U形损失曲线的不同因素,包括图形结构、模型大小和培训步骤。为了预测特定知识图的最佳模型大小,我们发现一个实验尺度,线性地绘制了知识图形搜索模型的轨迹到最佳模型大小。这项工作为LMS的缩放和推理提供了新的洞察力关系提供了新的洞察力,让我们了解如何优化其推理工作。
Article 156
Title@2025-07-09 (3): A Survey on Prompt Tuning
Title: A Survey on Prompt Tuning | Eine Umfrage über das schnelle Tuning | 快速提款调查 2507.06085v2 |
Authors (3): Zongqian Li, Yixuan Su, Nigel Collier
This survey reviews prompt tuning, a parameter-efficient approach for adapting language models by prepending trainable continuous vectors while keeping the model frozen. We classify existing approaches into two categories: direct prompt learning and transfer learning. Direct prompt learning methods include: general optimization approaches, encoder-based methods, decomposition strategies, and mixture-of-experts frameworks. Transfer learning methods consist of: general transfer approaches, encoder-based methods, and decomposition strategies. For each method, we analyze method designs, innovations, insights, advantages, and disadvantages, with illustrative visualizations comparing different frameworks. We identify challenges in computational efficiency and training stability, and discuss future directions in improving training robustness and broadening application scope.
这项调查审查迅速调试,这是通过预先等待的可训练的连续矢量调整语言模式的参数效率办法,同时保持该模式的冻结。我们将现有办法分为两类:直接的迅速学习和转让学习。直接的迅速学习方法包括:一般优化办法、以编码器为基础的方法、分解战略和专家混合框架。转让学习方法包括:一般转让办法、以编码器为基础的方法和分解战略。对于每一种方法,我们分析方法的设计、创新、洞察力、优缺点,并用说明性可视化来比较不同的框架。我们找出计算效率和培训稳定性方面的挑战,并讨论提高培训可靠性和扩大应用范围的未来方向。
Article 157
Title@2025-07-09 (3): Automating IRAC Analysis in Malaysian Contract Law using a Semi-Structured Knowledge Base
Title: Automating IRAC Analysis in Malaysian Contract Law using a Semi-Structured Knowledge Base | Automatisierung der IRAC-Analyse im malaysischen Vertragsrecht mit Hilfe einer halbstrukturierten Wissensbasis | 利用半结构化知识库对马来西亚合同法的IRAC分析自动化 2406.13217v2 |
Authors (5): Xiaoxi Kang, Lizhen Qu, Lay-Ki Soon, Zhuang Li, Adnan Trakic
The effectiveness of Large Language Models (LLMs) in legal reasoning is often limited due to the unique legal terminologies and the necessity for highly specialized knowledge. These limitations highlight the need for high-quality data tailored for complex legal reasoning tasks. This paper introduces LegalSemi, a benchmark specifically curated for legal scenario analysis. LegalSemi comprises 54 legal scenarios, each rigorously annotated by legal experts, based on the comprehensive IRAC (Issue, Rule, Application, Conclusion) framework from Malaysian Contract Law. In addition, LegalSemi is accompanied by a structured knowledge base (SKE). A series of experiments were conducted to assess the usefulness of LegalSemi for IRAC analysis. The experimental results demonstrate the effectiveness of incorporating the SKE for issue identification, rule retrieval, application and conclusion generation using four different LLMs.
大语言模型在法律推理方面的效力往往有限,因为法律术语独特,而且需要高度专业的知识,这些局限性突出表明需要针对复杂的法律推理任务提供高质量的数据,本文件介绍了法律Semi,这是专门为法律情景分析制定的基准,法律Semi由54种法律设想方案组成,每种方案都由法律专家严格说明,以马来西亚合同法的综合IRAC(问题、规则、适用、结论)框架为基础,此外,法律Semi还辅之以一个结构化的知识库(SKE),进行了一系列实验,评估法律Semi对辐射研究中心分析的效用,实验结果表明将SKE纳入四个不同的LMs的问题识别、规则检索、应用和结论生成工作的有效性。
Article 158
Title@2025-07-09 (3): Probing and Steering Evaluation Awareness of Language Models
Title: Probing and Steering Evaluation Awareness of Language Models | Probing and Steering Evaluation Awareness of Language Models | 评价语言模式评估意识的检验和指导 2507.01786v2 |
Authors (4): Jord Nguyen, Khiem Hoang, Carlo Leonardo Attubato, Felix Hofstätter
Language models can distinguish between testing and deployment phases – a capability known as evaluation awareness. This has significant safety and policy implications, potentially undermining the reliability of evaluations that are central to AI governance frameworks and voluntary industry commitments. In this paper, we study evaluation awareness in Llama-3.3-70B-Instruct. We show that linear probes can separate real-world evaluation and deployment prompts, suggesting that current models internally represent this distinction. We also find that current safety evaluations are correctly classified by the probes, suggesting that they already appear artificial or inauthentic to models. Our findings underscore the importance of ensuring trustworthy evaluations and understanding deceptive capabilities. More broadly, our work showcases how model internals may be leveraged to support blackbox methods in safety audits, especially for future models more competent at evaluation awareness and deception.
语言模型可以区分测试和部署阶段 – – 一种被称为评估意识的能力,这具有重大的安全和政策影响,有可能损害对AI治理框架和自愿行业承诺至关重要的评价的可靠性。在本文中,我们在Llama-3.3-70B-Instruct中研究评价意识。我们表明线性探测器可以区分真实世界的评价和部署提示,表明目前的模型在内部代表了这种区别。我们还发现,目前的安全评估被探测器正确地分类,表明它们已经看起来是人为的或对模型不真实的。我们的调查结果强调了确保可靠的评估和理解欺骗性能力的重要性。更广泛地说,我们的工作展示了如何利用内部模型来支持安全审计中的黑盒方法,特别是未来模型在评估意识和欺骗方面更能胜任的黑盒方法。
Article 159
Title@2025-07-09 (3): Elite Polarization in European Parliamentary Speeches: a Novel Measurement Approach Using Large Language Models
Title: Elite Polarization in European Parliamentary Speeches: a Novel Measurement Approach Using Large Language Models | Elite Polarisierung in europäischen Parlamentsreden: ein neuartiger Messansatz mit großen Sprachmodellen | 欧洲议会讲话中的极极化:使用大语言模式的新衡量方法 2507.06658v1 |
Authors (1): Gennadii Iakovlev
This project introduces a new measure of elite polarization via actor and subject detection using artificial intelligence. I identify when politicians mention one another in parliamentary speeches, note who is speaking and who is being addressed, and assess the emotional temperature behind these evaluations. This maps how elites evaluate their various out-parties, allowing us to create an index of mutual out-party hostility, that is, elite polarization. While I analyzed polarization data over the past four decades for the UK, and two decades for Hungary and Italy, my approach lays the groundwork for a twenty-year, EU-wide time-series dataset on elite polarization. I obtain the results that can be aggregated by party and quarter. The resulting index demonstrates a good face validity: it reacts to events such as electoral campaigns, country- and party-level crises, and to parties losing and assuming power.
这个项目引入了一种通过演员和使用人工智能进行实验检测的精英两极分化的新尺度。 当政客们在议会讲话中互相提及时,我就会指出,谁在讲话,谁在说话,谁正在被处理,并评估这些评价背后的情绪温度。这幅画描绘了精英们如何评价他们不同的外部党派,使我们能够建立一个共同的外部敌对指数,即精英两极分化。 我分析了过去40年来英国的两极分化数据,以及匈牙利和意大利的两极分化数据,与此同时,我的方法为整个欧盟关于精英两极分化的20年时间系列数据打下了基础。我获得了可以按政党和季度汇总的结果。 由此得出的指数显示了良好的面貌有效性:它对选举运动、国家和政党危机、以及失去和掌权的政党等事件作出反应。
Article 160
Title@2025-07-09 (3): PBa-LLM: Privacy- and Bias-aware NLP using Named-Entity Recognition (NER)
Title: PBa-LLM: Privacy- and Bias-aware NLP using Named-Entity Recognition (NER) | PBa-LLM: Datenschutz- und Bias-aware NLP mit Named-Entity-Erkennung (NER) | PBA-LLM: 使用命名实体识别(NER)的隐私和Bias-awar NLP 2507.02966v2 |
Authors (8): Gonzalo Mancera, Aythami Morales, Julian Fierrez, Ruben Tolosana, Alejandro Penna, Miguel Lopez-Duran, Francisco Jurado, Alvaro Ortigosa
The use of Natural Language Processing (NLP) in highstakes AI-based applications has increased significantly in recent years, especially since the emergence of Large Language Models (LLMs). However, despite their strong performance, LLMs introduce important legal/ ethical concerns, particularly regarding privacy, data protection, and transparency. Due to these concerns, this work explores the use of Named- Entity Recognition (NER) to facilitate the privacy-preserving training (or adaptation) of LLMs. We propose a framework that uses NER technologies to anonymize sensitive information in text data, such as personal identities or geographic locations. An evaluation of the proposed privacy-preserving learning framework was conducted to measure its impact on user privacy and system performance in a particular high-stakes and sensitive setup: AI-based resume scoring for recruitment processes. The study involved two language models (BERT and RoBERTa) and six anonymization algorithms (based on Presidio, FLAIR, BERT, and different versions of GPT) applied to a database of 24,000 candidate profiles. The findings indicate that the proposed privacy preservation techniques effectively maintain system performance while playing a critical role in safeguarding candidate confidentiality, thus promoting trust in the experimented scenario. On top of the proposed privacy-preserving approach, we also experiment applying an existing approach that reduces the gender bias in LLMs, thus finally obtaining our proposed Privacyand Bias-aware LLMs (PBa-LLMs). Note that the proposed PBa-LLMs have been evaluated in a particular setup (resume scoring), but are generally applicable to any other LLM-based AI application.
近年来,特别是在出现大型语言模型(LLMs)以来,在基于AI的高级应用中,自然语言处理(NLP)的使用在最近几年中显著增加,特别是在出现大型语言模型(LLMs)以来。然而,LLMs尽管表现强劲,却提出了重要的法律/道德关切,特别是在隐私、数据保护和透明度方面。由于这些关切,这项工作探索了使用Nation-实体识别(NER)来促进LLMS的隐私保护培训(或改编)。我们提议了一个框架,利用NER技术将个人身份或地理位置等文本数据中的敏感信息匿名化。对拟议的隐私保存学习框架进行了评估,以衡量其在特定高访问和敏感设置方面对用户隐私和系统绩效的影响:AI公司恢复征聘过程的评分。由于这些关切,这项工作涉及两种语言模型(BERT和ROBERTA)和六种匿名算法(基于PLV、FLAIR、BERT和GPT的不同版本),适用于24 000个候选人简介数据库。调查结果表明,拟议的隐私保存技术保存方法有效地维持了系统的保密性和系统运行,因此,在试验中也降低了现有的保密模式。
Article 161
Title@2025-07-09 (3): Expediting data extraction using a large language model (LLM) and scoping review protocol: a methodological study within a complex scoping review
Title: Expediting data extraction using a large language model (LLM) and scoping review protocol: a methodological study within a complex scoping review | Beschleunigung der Datenextraktion mit Hilfe eines großen Sprachmodells (LLM) und eines Scoping-Review-Protokolls: eine methodische Studie im Rahmen einer komplexen Scoping-Review | 利用大型语言模式和范围界定审查议定书加快数据提取速度:在复杂的范围界定审查范围内进行的方法研究 2507.06623v1 |
Authors (7): James Stewart-Evans, Emma Wilson, Tessa Langley, Andrew Prayle, Angela Hands, Karen Exley, Jo Leonardi-Bee
The data extraction stages of reviews are resource-intensive, and researchers may seek to expediate data extraction using online (large language models) LLMs and review protocols. Claude 3.5 Sonnet was used to trial two approaches that used a review protocol to prompt data extraction from 10 evidence sources included in a case study scoping review. A protocol-based approach was also used to review extracted data. Limited performance evaluation was undertaken which found high accuracy for the two extraction approaches (83.3% and 100%) when extracting simple, well-defined citation details; accuracy was lower (9.6% and 15.8%) when extracting more complex, subjective data items. Considering all data items, both approaches had precision >90% but low recall (<25%) and F1 scores (<40%). The context of a complex scoping review, open response types and methodological approach likely impacted performance due to missed and misattributed data. LLM feedback considered the baseline extraction accurate and suggested minor amendments: four of 15 (26.7%) to citation details and 8 of 38 (21.1%) to key findings data items were considered to potentially add value. However, when repeating the process with a dataset featuring deliberate errors, only 2 of 39 (5%) errors were detected. Review-protocol-based methods used for expediency require more robust performance evaluation across a range of LLMs and review contexts with comparison to conventional prompt engineering approaches. We recommend researchers evaluate and report LLM performance if using them similarly to conduct data extraction or review extracted data. LLM feedback contributed to protocol adaptation and may assist future review protocol drafting.
审查的数据提取阶段是资源密集型的,研究人员可能寻求利用在线(大语言模型)LLM和审查协议加快数据提取速度。Claude 3.5 Sonnet用于试验两种方法,即使用审查协议,促使从案例研究范围审查中包括的10个证据来源提取数据。还采用基于协议的方法审查提取的数据;采用有限的绩效评价,发现两种提取方法(83.3%和100%)在提取简单、明确界定的引用细节时,准确性很高(9.6%和15.8%);在提取更复杂、主观的数据项目时,准确性较低(9.6%和15.8%)。考虑到所有数据项目,两种方法都使用审查规程精确度大于90%,但回顾率低( < 25 %)和F1分( < 40%),采用复杂的范围审查、公开答复类型和方法,可能因误用和误用数据而影响业绩;LLMM反馈认为,基线提取方法准确性,建议对关键数据项目作出小的修改:15项(26.7%)和38项(21.1%)。考虑到所有数据项目都可能增加价值。然而,在重复采用定期评估程序时,需要进行定期评估的进度评估时,(5次的实地评估,在评估后,采用定期评估后,对定期评估,对定期评估中的数据进行实地评估,可能需要进行实地评估。
Article 162
Title@2025-07-09 (3): FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation
Title: FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation | FuDoBa: Zusammenführen von Dokumenten- und Wissensgraphen-basierten Darstellungen mit Bayesian-Optimierung | FUDoBA: 以贝叶斯最佳化方式使用引信文档和知识图表表示 2507.06622v1 |
Authors (4): Boshko Koloski, Senja Pollak, Roberto Navigli, Blaž Škrlj
Building on the success of Large Language Models (LLMs), LLM-based representations have dominated the document representation landscape, achieving great performance on the document embedding benchmarks. However, the high-dimensional, computationally expensive embeddings from LLMs tend to be either too generic or inefficient for domain-specific applications. To address these limitations, we introduce FuDoBa a Bayesian optimisation-based method that integrates LLM-based embeddings with domain-specific structured knowledge, sourced both locally and from external repositories like WikiData. This fusion produces low-dimensional, task-relevant representations while reducing training complexity and yielding interpretable early-fusion weights for enhanced classification performance. We demonstrate the effectiveness of our approach on six datasets in two domains, showing that when paired with robust AutoML-based classifiers, our proposed representation learning approach performs on par with, or surpasses, those produced solely by the proprietary LLM-based embedding baselines.
在大型语言模型(LLMs)成功的基础上,基于LLM的LLM代表制在文件代表格局上占据了主导地位,在文件嵌入基准上取得了巨大的业绩;然而,高维、计算上昂贵的LLMs嵌入层往往过于通用,或者对特定领域的应用来说效率低下。为了解决这些限制,我们采用了基于Bayesian优化的FuDoba方法,将基于LLM的嵌入与当地和来自WikiData(WikiData)等外部储存库的基于特定领域的结构化知识相结合。这种组合产生低维度、与任务相关的代表制,同时减少培训复杂性,并产生可解释的早期融合权重,用于提高分类性业绩。我们展示了我们在两个领域六套数据集上的做法的有效性,表明在与基于AutMLMLG的分类师结合时,我们拟议的代表制学习方法与完全由基于LMM的专有嵌入基线生成的知识或超过这些知识。
Article 163
Title@2025-07-09 (3): Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation
Title: Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation | Decoder-Hybrid-Decoder-Architektur für effizientes Nachdenken mit langer Generation | 提高长代人合理性效率的代coder-Hybrid-Decer 结构 2507.06607v1 |
Authors (14): Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, Yelong Shen
Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.
语言建模方面的最新进展显示了国家空间模型(SSMM)对于高效序列建模的有效性。 Samba 和 decoder- decoder 结构( YOCO) 等混合结构在变异器上表现出了有希望的绩效收益, 先前的工程并没有调查SSM 层之间代表共享的效率潜力。 在本文中, 我们引入了Gate memory 单元(GMU), 这是一种简单而有效的跨层有效记忆共享机制。 我们应用它来创建 SambaY, 一种将GMU纳入交叉解码中, 以分享基于 Samba 的自解码器( YOCO) 的内存读状态。 SambaY 明显提高了解码效率, 保留了SSSMSM 层之间在时间上共享的线性预复杂性, 提高了长期性能, 同时消除了明确的定位编码需要。 通过广泛的规模实验, 我们的模型显示, 与坚固的YOCO基线基准相比, 显示在大规模调整的系统下, 提高性能表现的精确度缩缩缩缩缩缩缩。
Article 164
Title@2025-07-09 (3): Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
Title: Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives | Video-Language Understanding: Eine Umfrage aus Modellarchitektur, Modellschulung und Datenperspektiven | 了解视频语言:从建模模型、示范培训和数据视角进行调查 2406.05615v3 |
Authors (9): Thong Nguyen, Yi Bin, Junbin Xiao, Leigang Qu, Yicong Li, Jay Zhangjie Wu, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan
Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and perceive the world around us. There has been a lot of interest in creating video-language understanding systems with human-like senses since a video-language pair can mimic both our linguistic medium and visual environment with temporal dynamics. In this survey, we review the key tasks of these systems and highlight the associated challenges. Based on the challenges, we summarize their methods from model architecture, model training, and data perspectives. We also conduct performance comparison among the methods, and discuss promising directions for future research.
人类使用多种感官来理解环境。 视觉和语言是两个最重要的感官,因为它们使我们能够轻松地交流我们的想法和感知我们周围的世界。 创建具有人感的视频语言理解系统引起了很大的兴趣,因为一对视频语言可以模仿我们的语言媒介和视觉环境,同时呈现时间动态。 在本次调查中,我们审查这些系统的关键任务,并突出相关的挑战。 根据挑战,我们从模型结构、模型培训和数据角度总结它们的方法。我们还对这些方法进行绩效比较,并讨论未来研究的有希望的方向。
Article 165
Title@2025-07-09 (3): Can Input Attributions Explain Inductive Reasoning in In-Context Learning?
Title: Can Input Attributions Explain Inductive Reasoning in In-Context Learning? | Kann Input-Attributionen im In-Context-Lernen eine induktive Begründung erklären? | 投入属性能否解释内文学习中的引因? 2412.15628v5 |
Authors (4): Mengyu Ye, Tatsuki Kuribayashi, Goro Kobayashi, Jun Suzuki
Interpreting the internal process of neural models has long been a challenge. This challenge remains relevant in the era of large language models (LLMs) and in-context learning (ICL); for example, ICL poses a new issue of interpreting which example in the few-shot examples contributed to identifying/solving the task. To this end, in this paper, we design synthetic diagnostic tasks of inductive reasoning, inspired by the generalization tests typically adopted in psycholinguistics. Here, most in-context examples are ambiguous w.r.t. their underlying rule, and one critical example disambiguates it. The question is whether conventional input attribution (IA) methods can track such a reasoning process, i.e., identify the influential example, in ICL. Our experiments provide several practical findings; for example, a certain simple IA method works the best, and the larger the model, the generally harder it is to interpret the ICL with gradient-based IA methods.
长期以来,对神经模型内部过程的解释一直是一个挑战。在大型语言模型(LLMs)和文字内学习(ICL)的时代,这一挑战仍然具有相关性;例如,ICL提出了一个新的解释问题,在少数例子中,这个例子有助于确定/解决任务。为此,在本文件中,我们设计了感知推理的合成诊断任务,这受通常在心理语言学中采用的概括性测试的启发。在这里,大多数同源实例是模糊的,有一个关键的例子掩盖了这一点。问题在于常规投入归属(IA)方法能否跟踪这种推理过程,即识别具有影响力的示例,在ICL中。我们的实验提供了若干实际结果;例如,某种简单的IA方法最有效,而模型越大,用梯度的IA方法解释ICL通常越难。
Article 166
Title@2025-07-09 (3): Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions
Title: Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions | Bewertung und Verbesserung der Robustheit in großen Sprachmodellen: Umfrage und Zukunftsrichtung | 评价和提高大语言模式的威力:调查和未来方向 2506.11111v2 |
Authors (5): Kun Zhang, Le Wu, Kui Yu, Guangyi Lv, Dacao Zhang
Large Language Models (LLMs) have gained enormous attention in recent years due to their capability of understanding and generating natural languages. With the rapid development and wild-range applications (e.g., Agents, Embodied Intelligence), the robustness of LLMs has received increased attention. As the core brain of many AI applications, the robustness of LLMs requires that models should not only generate consistent contents, but also ensure the correctness and stability of generated content when dealing with unexpeted application scenarios (e.g., toxic prompts, limited noise domain data, outof-distribution (OOD) applications, etc). In this survey paper, we conduct a thorough review of the robustness of LLMs, aiming to provide a comprehensive terminology of concepts and methods around this field and facilitate the community. Specifically, we first give a formal definition of LLM robustness and present the collection protocol of this survey paper. Then, based on the types of perturbated inputs, we organize this survey from the following perspectives: 1) Adversarial Robustness: tackling the problem that prompts are manipulated intentionally, such as noise prompts, long context, data attack, etc; 2) OOD Robustness: dealing with the unexpected real-world application scenarios, such as OOD detection, zero-shot transferring, hallucinations, etc; 3) Evaluation of Robustness: summarizing the new evaluation datasets, metrics, and tools for verifying the robustness of LLMs. After reviewing the representative work from each perspective, we discuss and highlight future opportunities and research directions in this field. Meanwhile, we also organize related works and provide an easy-to-search project (https://github.com/zhangkunzk/Awesome-LLM-Robustness-papers) to support the community.
近年来,大型语言模型(LLMS)因其理解和生成自然语言的能力而得到极大关注。随着快速发展和野外应用(例如代理人、Embudied Intell),LLMS的稳健性得到了越来越多的关注。作为许多AI应用程序的核心大脑,LLMS的稳健性要求在处理未详细应用情景时,模型不仅应产生一致的内容,而且还应确保生成的内容的正确性和稳定性(例如有毒提示、有限的噪音域数据、分配之外的应用程序等)。在本调查文件中,我们对LLMS的稳健性进行了彻底审查,目的是为该领域的概念和方法提供全面的术语,并为社区提供便利。具体地说,我们首先对LMMS的稳健性作出正式定义,并介绍这份调查文件的收集程序。然后,根据扰动性投入的种类,我们从以下角度组织这次调查:1) Adversariaral-headaldreatrial Rustrial Rights:我们刻意操纵的问题,例如噪声、长背景、数据测试的未来机会;ODMUD-dealevild-dealveality oration oration orpss se;以及ODODODD) se-list reals astrevationals axes。然后,我们从实时、我们从实时分析和新工具提供新的研究/ODODDDDDDDDDDDD2 提供新的方向、我们进行这方面的工作。
Article 167
Title@2025-07-09 (3): Enhancing Food-Domain Question Answering with a Multimodal Knowledge Graph: Hybrid QA Generation and Diversity Analysis
Title: Enhancing Food-Domain Question Answering with a Multimodal Knowledge Graph: Hybrid QA Generation and Diversity Analysis | Verbesserung der Lebensmittel-Domain-Fragebeantwortung mit einem multimodalen Wissensgraphen: Hybride QA-Erzeugung und Diversitätsanalyse | 以多模式知识图回答加强粮食领域问题:混合QA生成和多样性分析 2507.06571v1 |
Authors (2): Srihari K B, Pushpak Bhattacharyya
We propose a unified food-domain QA framework that combines a large-scale multimodal knowledge graph (MMKG) with generative AI. Our MMKG links 13,000 recipes, 3,000 ingredients, 140,000 relations, and 14,000 images. We generate 40,000 QA pairs using 40 templates and LLaVA/DeepSeek augmentation. Joint fine-tuning of Meta LLaMA 3.1-8B and Stable Diffusion 3.5-Large improves BERTScore by 16.2\%, reduces FID by 37.8\%, and boosts CLIP alignment by 31.1\%. Diagnostic analyses-CLIP-based mismatch detection (35.2\% to 7.3\%) and LLaVA-driven hallucination checks-ensure factual and visual fidelity. A hybrid retrieval-generation strategy achieves 94.1\% accurate image reuse and 85\% adequacy in synthesis. Our results demonstrate that structured knowledge and multimodal generation together enhance reliability and diversity in food QA.
我们提出一个统一的食品-食物质量评估框架,将大型多式联运知识图(MMKG)与基因性AI相结合。我们的MMKG链接了13 000种配方、3 000种配方、140 000种关系和14 000张图像。我们利用40个模板和LLAVA/DeepSeek增殖生成了40 000对QA配方。Meta LalaMA 3.1-8B和Stabal Difulation 3.5-Lorge的联合微调将BERTScore 改进了16.2,将FID减少了37.8,并将CLIP的匹配增加了31.1。诊断分析-基于CLOCIP的错配检测(35.2至7.3)和LLAVA驱动的幻觉检查真实性和视觉忠诚性。混合回收战略实现了94.1准确的图像再利用和85合力。我们的成果表明,结构化的知识和多式生成共同提高食品QA的可靠性和多样性。
Article 168
Title@2025-07-09 (3): CHAI for LLMs: Improving Code-Mixed Translation in Large Language Models through Reinforcement Learning with AI Feedback
Title: CHAI for LLMs: Improving Code-Mixed Translation in Large Language Models through Reinforcement Learning with AI Feedback | CHAI für LLMs: Verbesserung der code-gemischten Übersetzung in großen Sprachmodellen durch Verstärkungslernen mit KI-Feedback | LLM女士CHAI:利用AI反馈加强学习,改进大语言模式的编码混合翻译 2411.09073v3 |
Authors (3): Wenbo Zhang, Aditya Majumdar, Amulya Yadav
Large Language Models (LLMs) have demonstrated remarkable capabilities across various NLP tasks but struggle with code-mixed (or code-switched) language understanding. For example, prior work benchmarking the performance of multilingual LLMs on code-mixed translation tasks has demonstrated that current state-of-the-art multilingual LLMs are ineffective in dealing with code-mixed languages. However, the question of how to improve the capability of multilingual LLMs to handle code-mixed language has not received any attention to date. In this paper, we tackle this research gap by proposing CHAI, a novel general-purpose framework for improving the ability of multilingual LLMs to handle code-mixed languages. CHAI relies on three novel contributions made in this paper. First, we explore the ability of LLMs to provide accurate annotations for code-mixed translation tasks. Second, we leverage this ability of LLMs as annotators to generate preference data for code-mixed translation tasks at scale, which are then used within a reinforcement learning from AI feedback (RLAIF) procedure to improve LLMs’ capability on code-mixed tasks. Third, we conduct a rigorous experimental evaluation across various real-world datasets and settings. Our analysis shows that CHAI-powered LLMs outperform state-of-the-art open-source LLMs by 25.66% (in terms of win rate adjudicated by human annotators) in code-mixed translation tasks. This work represents a first step towards developing more inclusive code-mixed LLMs.
大型语言模型(LLMS)在各种国家语言模型任务中表现出了非凡的能力,但与编码混合(或编码转换)的语言理解相抗衡。例如,先前的工作基准基准表明,在编码混合翻译任务中,多语种LLMS的多语种LMS的绩效已经证明,目前最先进的多语种LLMs在处理代码混合翻译任务方面是无效的。然而,如何提高多语种LLMS处理代码混合语言的能力的问题至今没有得到任何关注。在本文中,我们通过提出CHAI(REIF)程序来解决这一研究差距,以提高多语种LMAs处理代码混合语言的能力。CHAI依靠本文中的三个新创贡献。首先,我们探索LLMSMs在为编码混合翻译任务提供准确说明的能力。第二,我们利用LLMMS的这种能力作为批注者,为代码混合翻译任务生成优惠数据,然后在从AI反馈(RLAIF)中进行强化学习,这是提高LMSMS处理编码能力的新通用普通语言模型翻译能力。我们用了一个硬性模型分析。
Article 169
Title@2025-07-09 (3): The Flaws of Others: An LLM-driven Framework for Scientific Knowledge Production
Title: The Flaws of Others: An LLM-driven Framework for Scientific Knowledge Production | Die Fehler anderer: Ein LLM-gestützter Rahmen für die wissenschaftliche Wissensproduktion | 其他法律:由LLM驱动的科学知识生产框架 2507.06565v1 |
Authors (1): Juan B. Gutiérrez
Large-language models turn writing into a live exchange between humans and software. We capture this new medium with a discursive-network model that treats people and LLMs as equal nodes and tracks how their statements circulate. Broadening the focus from isolated hallucinations, we define invalidation (any factual, logical, or structural breach) and show it follows four hazards: drift from truth, self-repair, fresh fabrication, and external detection. A general mathematical model of discursive networks is developed to provide valuable insights: A network governed only by drift and self-repair stabilizes at a modest error rate; adding fabrication reproduces the high rates seen in current LLMs. Giving each false claim even a small chance of peer review shifts the system to a truth-dominant state. We operationalize peer review with the open-source \emph{Flaws-of-Others (FOO) algorithm}: a configurable loop in which any set of agents critique one another while a harmoniser merges their verdicts. The takeaway is practical and cultural: reliability in this new medium comes not from perfecting single models but from wiring imperfect ones into networks that keep each other honest.
大型语言模型将写作变成人类和软件之间的实时交换。 我们用一个不精确的网络模型捕捉这一新媒体, 将人和LLMs作为平等的节点对待, 并跟踪其声明的传播方式。 从孤立的幻觉中扩大焦点, 我们定义无效( 任何事实、 逻辑或结构性违约) , 并显示它有四种危险 : 从真理、 自我修复、 制造和外部检测中漂移 。 开发了一个不透明网络的一般数学模型, 以提供宝贵的洞察 : 一个仅受漂移和自修复稳定在微小错误率的网络管理, 并且添加制造复制当前LLMs 所看到的高利率 。 给每个虚假的主张提供一次小小的同行审查机会, 甚至一次同行审查的机会, 将系统改变为真相主宰状态 。 我们用开放源的 \ emph{ FOL- 算法( FOO) 算法 进行同行审查 : 一个可配置的循环, 任何一组代理人都互相批评, 而一个协调者将其判断合并为彼此的循环 。 。 。 取而具有实用和文化性: 新媒介的可靠性是实用和文化的: : 新媒介的可靠性不是来自完善的单一的单一的单一的模型, 而是从不完善的单一的单一的网络。
Article 170
Title@2025-07-09 (3): DS@GT at CheckThat! 2025: Exploring Retrieval and Reranking Pipelines for Scientific Claim Source Retrieval on Social Media Discourse
Title: DS@GT at CheckThat! 2025: Exploring Retrieval and Reranking Pipelines for Scientific Claim Source Retrieval on Social Media Discourse | DS@GT bei CheckThat! 2025: Erforschung von Rück- und Rücklaufleitungen für wissenschaftliche Anspruchsquellen-Rückgriffe auf Social Media Diskurs | DS@GT at checketh! 2025年: 探索检索和重新排序科学索赔源检索管道,关于社会媒体问题 2507.06563v1 |
Authors (4): Jeanette Schofield, Shuyu Tian, Hoang Thanh Thanh Truong, Maximilian Heil
Social media users often make scientific claims without citing where these claims come from, generating a need to verify these claims. This paper details work done by the DS@GT team for CLEF 2025 CheckThat! Lab Task 4b Scientific Claim Source Retrieval which seeks to find relevant scientific papers based on implicit references in tweets. Our team explored 6 different data augmentation techniques, 7 different retrieval and reranking pipelines, and finetuned a bi-encoder. Achieving an MRR@5 of 0.58, our team ranked 16th out of 30 teams for the CLEF 2025 CheckThat! Lab Task 4b, and improvement of 0.15 over the BM25 baseline of 0.43. Our code is available on Github at https://github.com/dsgt-arc/checkthat-2025-swd/tree/main/subtask-4b.
社会媒体用户往往在不列举这些主张来源的情况下提出科学主张,从而需要核实这些主张。本文详细介绍了DS@GT团队为CLEF 2025 CLEF 校验!实验室的科学索赔源检索第4b号任务 4b 科学索赔源检索val,该任务试图根据在推特中暗含的引用找到相关的科学论文。我们的团队探讨了6种不同的数据增强技术,7种不同的检索和重新排位管道,并对双编码器进行了微调。实现了0.58 MRR@5, 我们的团队在CLEF 2025 CheckTH!实验室任务4b 的30个团队中排第16位,比BM25基线0.43改进了0.15。我们的代码可在 Github上查阅 https://github.com/dsgt-arc/ checkthat-2025-swd/tree/main/subtask-4b。
Article 171
Title@2025-07-09 (3): AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework
Title: AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework | AutoPrep: Natural Language Question-Aware Datenvorbereitung mit einem Multi-Agent Framework | AUPPrep: 具有多机构框架的自然语言问题软件数据编制 2412.10422v4 |
Authors (6): Meihao Fan, Ju Fan, Nan Tang, Lei Cao, Guoliang Li, Xiaoyong Du
Answering natural language (NL) questions about tables, known as Tabular Question Answering (TQA), is crucial because it allows users to quickly and efficiently extract meaningful insights from structured data, effectively bridging the gap between human language and machine-readable formats. Many of these tables are derived from web sources or real-world scenarios, which require meticulous data preparation (or data prep) to ensure accurate responses. However, preparing such tables for NL questions introduces new requirements that extend beyond traditional data preparation. This question-ware data preparation involves specific tasks such as column derivation and filtering tailored to particular questions, as well as question-aware value normalization or conversion, highlighting the need for a more nuanced approach in this context. Because each of the above tasks is unique, a single model (or agent) may not perform effectively across all scenarios. In this paper, we propose AutoPrep, a large language model (LLM)-based multiagent framework that leverages the strengths of multiple agents, each specialized in a certain type of data prep, ensuring more accurate and contextually relevant responses. Given an NL question over a table, AutoPrep performs data prep through three key components. Planner: Determines a logical plan, outlining a sequence of high-level operations. Programmer: Translates this logical plan into a physical plan by generating the corresponding low-level code. Executor: Executes the generated code to process the table. To support this multi-agent framework, we design a novel Chain-ofClauses reasoning mechanism for high-level operation suggestion, and a tool-augmented method for low-level code generation.
回答关于表格的自然语言(NL)问题,称为“Tabulal 问答”(TQA),至关重要,因为它使用户能够快速和高效地从结构化数据中获取有意义的见解,从而有效地弥合人类语言和机器可读格式之间的差距。许多这些表格来自网络来源或现实世界情景,这需要细致的数据准备(或数据准备)以确保准确的答复。然而,为NL问题准备这样的表格提出了超越传统数据准备的新要求。这种问题软件数据编制涉及具体的任务,如专为特定问题而设计的列转换和过滤,以及质觉值正常化或转换,强调在这方面需要更细致的方法。由于上述每一项任务都是独特的,单一模型(或代理)可能无法在所有情景中有效运行。在本文件中,我们提出了AutoPrep,一个大型语言模型(LLLM)基于多剂框架,利用多种剂的优势,每种类型的数据预选,确保更准确和符合背景的响应。鉴于NLO-L-L-S-A-A-O-O-O-D-D-L-D-L-L-L-L-L-L-L-L-LOD-L-I-L-I-L-LOD-I-I-LOD-LVOL-L-L-L-L-L-L-L-L-L-L-LV-LV-L-L-L-L-LV-L-L-L-L-L-L-L-L-L-L-L-L-LV-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-
Article 172
Title@2025-07-09 (3): FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction
Title: FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction | FiRST: Feinsteuerungs-Router-Selective Transformer für Input-Adaptive Latenzreduktion | FIRST: 用于减少投入-适应性延迟的微调路由器-选择式变换器 2410.12513v3 |
Authors (4): Akriti Jain, Saransh Sharma, Koyel Mukherjee, Soumyabrata Pal
Auto-regressive Large Language Models (LLMs) demonstrate remarkable performance across different domains such as vision and language processing. However, due to sequential processing through a stack of transformer layers, autoregressive decoding faces significant computation/latency challenges, particularly in resource-constrained environments like mobile and edge devices. Existing approaches in literature that aim to improve latency via skipping layers have two distinct flavors - 1) Early exit, and 2) Input-agnostic heuristics where tokens exit at pre-determined layers irrespective of input sequence. Both the above strategies have limitations - the former cannot be applied to handle KV Caching necessary for speed-ups in modern framework and the latter does not capture the variation in layer importance across tasks or more generally, across input sequences. To address both limitations, we propose FiRST, an algorithm that reduces inference latency by using layer-specific routers to select a subset of transformer layers adaptively for each input sequence - the prompt (during the prefill stage) decides which layers will be skipped during decoding. FiRST preserves compatibility with KV caching enabling faster inference while being quality-aware. FiRST is model-agnostic and can be easily enabled on any pre-trained LLM. Our approach reveals that input adaptivity is critical - indeed, different task-specific middle layers play a crucial role in evolving hidden representations depending on tasks. Extensive experiments show that FiRST significantly reduces latency while outperforming other layer selection strategies in quality metics. It retains competitive performance to base model (without layer skipping) and in some cases, even improves upon it. FiRST is thus a promising and efficient solution for LLM deployment in low-resource environments.
自动递减大型语言模型(LLMS)在视觉和语言处理等不同领域表现出显著的绩效。 但是,由于通过一系列变压层连续处理,自动递减解解码面临重大的计算/延迟挑战,特别是在资源受限制的环境中,特别是在移动和边缘设备等移动和边缘设备。现有的文献方法旨在通过跳过层来改善延缓性,有两个不同的口味:(1) 早期退出,和(2) 输入- 异性超常, 标志在预定层中退出, 而不考虑输入序列。 上述两种战略都有其局限性 - 前者不能用于在现代框架中处理速度加速所需的KVV Cashing, 而后者无法在任务或更普遍地在输入序列中反映层重要性的变异性。 为了应对这两种局限性,我们提议FiRST, 一种算法,通过使用特定层路由特定的路由选择可适应的变异性层序列选择一组变异层。 快速(在填充前阶段) 确定哪些层在解析期间会跳。 FiRST 将快速调调调调调高度,而稳定性战略在基础选择中可以快速的稳定性,在稳定级中进行快速变压。
Article 173
Title@2025-07-09 (3): InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior
Title: InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior | InvestAlign: Überwindung von Datenknappheit bei der Ausrichtung großer Sprachmodelle mit Entscheidungsprozessen von Investoren unter Herdverhalten | 投资对称:克服在将大语言模式与牧民行为下的投资者决策过程相匹配方面的数据稀缺性 2507.06528v1 |
Authors (6): Huisheng Wang, Zhuoshi Pan, Hangjing Zhang, Mingxiao Liu, Hanqing Gao, H. Vicky Zhao
Aligning Large Language Models (LLMs) with investor decision-making processes under herd behavior is a critical challenge in behavioral finance, which grapples with a fundamental limitation: the scarcity of real-user data needed for Supervised Fine-Tuning (SFT). While SFT can bridge the gap between LLM outputs and human behavioral patterns, its reliance on massive authentic data imposes substantial collection costs and privacy risks. We propose InvestAlign, a novel framework that constructs high-quality SFT datasets by leveraging theoretical solutions to similar and simple optimal investment problems rather than complex scenarios. Our theoretical analysis demonstrates that training LLMs with InvestAlign-generated data achieves faster parameter convergence than using real-user data, suggesting superior learning efficiency. Furthermore, we develop InvestAgent, an LLM agent fine-tuned with InvestAlign, which demonstrates significantly closer alignment to real-user data than pre-SFT models in both simple and complex investment problems. This highlights our proposed InvestAlign as a promising approach with the potential to address complex optimal investment problems and align LLMs with investor decision-making processes under herd behavior. Our code is publicly available at https://github.com/thu-social-network-research-group/InvestAlign.
将大型语言模型(LLMs)与放牧行为下的投资者决策过程相匹配,是行为金融中的一个关键挑战,它面临一个根本性的局限性:监督精度调整所需的实际用户数据稀缺。虽然SFT可以弥合LLM产出与人类行为模式之间的差距,但对大量真实数据的依赖导致大量的收集成本和隐私风险。我们提议InvestAlign,这是一个创新框架,通过利用理论解决方案解决类似和简单的最佳投资问题,而不是复杂的情景,来构建高质量的SFT数据集。我们的理论分析表明,对InvestAl-产生的数据进行的培训比使用实际用户数据更快的参数趋同,表明学习效率更高。此外,我们开发了InvestAgents,一个与Investal-Allign公司进行微调的LMM代理商,这表明在简单和复杂的投资问题中与SFT前模式都非常接近。这突出了我们拟议的InvestAlignAlign,这是一种很有希望的方法,能够解决复杂的最佳投资问题,并使LLMSMsMs公司与在她的行为中进行公开研究。
Article 174
Title@2025-07-09 (3): FinSphere, a Real-Time Stock Analysis Agent Powered by Instruction-Tuned LLMs and Domain Tools
Title: FinSphere, a Real-Time Stock Analysis Agent Powered by Instruction-Tuned LLMs and Domain Tools | FinSphere, ein Real-Time Stock Analysis Agent Powered by Instruction-Tuned LLMs und Domain Tools | FinSphere,实时股票分析代理商,使用指示引导LMs和域工具进行实时股票分析 2501.12399v2 |
Authors (5): Shijie Han, Jingshu Zhang, Yiqing Shen, Kaiyuan Yan, Hongguang Li
Current financial large language models (FinLLMs) struggle with two critical limitations: the absence of objective evaluation metrics to assess the quality of stock analysis reports and a lack of depth in stock analysis, which impedes their ability to generate professional-grade insights. To address these challenges, this paper introduces FinSphere, a stock analysis agent, along with three major contributions: (1) AnalyScore, a systematic evaluation framework for assessing stock analysis quality, (2) Stocksis, a dataset curated by industry experts to enhance LLMs’ stock analysis capabilities, and (3) FinSphere, an AI agent that can generate high-quality stock analysis reports in response to user queries. Experiments demonstrate that FinSphere achieves superior performance compared to both general and domain-specific LLMs, as well as existing agent-based systems, even when they are enhanced with real-time data access and few-shot guidance. The integrated framework, which combines real-time data feeds, quantitative tools, and an instruction-tuned LLM, yields substantial improvements in both analytical quality and practical applicability for real-world stock analysis.
目前的大型财务语言模型(FinLLMS)面临两个关键限制:缺乏客观的评价指标来评估存量分析报告的质量,以及库存分析缺乏深度,这妨碍了它们产生专业水平见解的能力。为了应对这些挑战,本文件介绍了库存分析剂FinSperere, 以及三个主要贡献:(1) 分析Score,一个评估存量分析质量的系统评价框架;(2) 储存,一个由行业专家整理的数据集,以加强LLMS的存量分析能力;(3) FinSphere,一个可以针对用户询问生成高质量存量分析报告的AI代理。 实验表明,FinSphere与一般和特定领域LLMS以及现有代理系统相比,其业绩优于一般和特定领域 LLMS以及基于代理的系统,即使这些系统得到了实时数据访问和短片指导的加强;综合框架,将实时数据反馈、定量工具和经指示调整的LM结合起来,在分析质量和实用适用性实体存量分析方面都取得了重大改进。
Article 175
Title@2025-07-09 (3): FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation
Title: FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation | FIFA: Unified Faithfulness Evaluation Framework für Text-to-Video und Video-to-Text Generation | FIFA: 文本到视频和视频到文字一代统一信仰评价框架 2507.06523v1 |
Authors (5): Liqiang Jing, Viet Lai, Seunghyun Yoon, Trung Bui, Xinya Du
Video Multimodal Large Language Models (VideoMLLMs) have achieved remarkable progress in both Video-to-Text and Text-to-Video tasks. However, they often suffer fro hallucinations, generating content that contradicts the visual input. Existing evaluation methods are limited to one task (e.g., V2T) and also fail to assess hallucinations in open-ended, free-form responses. To address this gap, we propose FIFA, a unified FaIthFulness evAluation framework that extracts comprehensive descriptive facts, models their semantic dependencies via a Spatio-Temporal Semantic Dependency Graph, and verifies them using VideoQA models. We further introduce Post-Correction, a tool-based correction framework that revises hallucinated content. Extensive experiments demonstrate that FIFA aligns more closely with human judgment than existing evaluation methods, and that Post-Correction effectively improves factual consistency in both text and video generation.
视频多式大语言模式(VideoMLLM)在视频到文字和文字到文字的任务方面取得了显著进展,但是,它们往往会受到幻觉的幻觉,产生与视觉输入相反的内容;现有的评价方法仅限于一项任务(例如V2T),而且没有评估开放式、自由形式响应中的幻觉;为弥补这一差距,我们建议国际足联建立一个统一的FaithFulness evAluation框架,通过空间-文字依赖性图提取全面描述事实,模拟其语义依赖性,并利用视频QA模型核实这些内容;我们进一步引入了基于工具的校正框架,即基于工具的校正框架,以修改有偏见的内容;广泛的实验表明,国际足联比现有的评价方法更密切地与人性判断一致,以及事后校正有效地提高了文本和视频生成两方面的实际一致性。
Article 176
Title@2025-07-09 (3): Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving
Title: Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving | LLMs nach ihrer Eignung lehren: Adaptive Begründung für mathematische Problemlösung | 根据其适应性:数学问题解决的适应性理由 2502.12022v3 |
Authors (11): Xin Xu, Yan Xu, Tianhao Chen, Yuchen Yan, Chengwu Liu, Zaoyu Chen, Yufei Wang, Yichun Yin, Yasheng Wang, Lifeng Shang, Qun Liu
Existing approaches to mathematical reasoning with large language models (LLMs) rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated Reasoning (TIR) for precise computation. While efforts have been made to combine these methods, they primarily rely on post-selection or predefined strategies, leaving an open question: whether LLMs can autonomously adapt their reasoning strategy based on their inherent capabilities. In this work, we propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework that enables LLMs to personalize their reasoning strategy spontaneously, aligning it with their intrinsic aptitude. TATA incorporates base-LLM-aware data selection during supervised fine-tuning (SFT) to tailor training data to the model’s unique abilities. This approach equips LLMs to autonomously determine and apply the appropriate reasoning strategy at test time. We evaluate TATA through extensive experiments on six mathematical reasoning benchmarks, using both general-purpose and math-specialized LLMs. Empirical results demonstrate that TATA effectively combines the complementary strengths of CoT and TIR, achieving superior or comparable performance with improved inference efficiency compared to TIR alone. Further analysis underscores the critical role of aptitude-aware data selection in enabling LLMs to make effective and adaptive reasoning decisions and align reasoning strategies with model capabilities.
现有与大语言模型(LLMS)有关的数学推理方法依靠CoT(CoT)系统(CoT)系统(CoT)系统(CoT)系统(CoT)系统(LLMS)系统(LLMS)系统(TIR)系统(TIR)系统(TIR)系统(TIR)系统(CoT)系统(CoT)系统(CoT)系统(CoT)系统(CoTT)系统(CoT)系统(CoTTT)系统(COTT)系统(CoTT)系统(CoTAT)系统(TATA(LM)系统)系统(TATA)系统(TATA)系统(LM)系统)系统(TAR(T)系统)系统(TAR)系统(TAR)系统(TAR)系统(TA)系统(TATA)系统(TA(TA)系统)系统(TLM)系统)系统(TAR(TAR)系统)系统(TAR)系统(TAR(TAR)系统)系统(TA)系统)系统)系统(TA(TAR)系统)系统(TA(TAR(S)系统)系统)系统(TLM)系统(S)系统)系统(S)系统)系统(TLM)系统)系统)系统(TAR(S(TA)系统(TLLLLM)系统(S)系统(TLLM)系统(S)系统(S)系统)系统(S)系统(S)系统)系统)系统(TA)系统(TA)系统)系统)系统(TA)系统(TA)系统(TA)系统)系统)系统)系统(S)系统(S)系统(S)系统(S)系统(S)系统(TLLLM)系统(S)系统(系统(S(S)系统(S)系统(TA)系统(S)系统(S)系统)系统(S)系统)系统)系统)系统)系统)系统)系统(TA)系统(系统(TA)系统(TA)系统(TLMLMLMLM
Article 177
Title@2025-07-09 (3): DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE
Title: DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE | DeepTalk: Auf dem Weg zu nahtloser und intelligenter Sprachinteraktion mit adaptiver Modalität-spezifischer MoE | 深谈:实现与适应型模式具体部的无缝和智能语音互动 2506.21864v2 |
Authors (9): Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Lijiang Li, Zuwei Long, Bo Tong, Ke Li, Xing Sun
Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk first adaptively distinguishes modality experts according to their modality load within the LLM. Each modality expert then undergoes specialized single-modality training, followed by joint multimodal collaborative training. As a result, DeepTalk incurs only a 5.5% performance drop compared to the original LLM, which is significantly lower than the average performance drop of over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within 0.5 seconds, ensuring a seamless and intelligent speech interaction experience. Code and models are released at https://github.com/talkking/DeepTalk.
本地多式大型语言模型(MLLMM)将单一大型语言模型(LLMM)重组为既具有语言和文字生成能力的口语模型(LLM)。与模块化和匹配的MLLMM相比,本地MLMLM保留了更丰富的语言特征,如情感和手动等,并在主干线(LLM)结构中直接生成语音响应,而不是使用单独的语音解码器。这种整合还导致应对液化和更顺畅的互动。然而,本地MLMS遭受了灾难性的忘记和性能退化,因为现有的配对语言文本数据不足以支持MLLMS的预培训,而与前导文字文本LLMM的大量文本数据相比,DLLMLMM数据模型保存了5.5%的性能下降,与最初的LLMMLM(LM)模型的正常水平相比,正常水平比正常水平低。
Article 178
Title@2025-07-09 (3): Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning
Title: Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning | Double-Checker: Bessere Begründung von langsam denkenden LLMs über selbstkritische Feinsteuerung | 双重检查者:通过自批评性微调,加强慢思考低迷LMs的理由 2506.21285v2 |
Authors (14): Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, Shiwei Liu, Shizhe Diao, Can Yang, Lu Yin
While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the “aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique. Our codes and data are available at https://github.com/XinXU-USTC/DoubleChecker
虽然思维迟缓的大型语言模型(LLMs)的反映式推理(通常被称为“aha moment ” ) , 通常被称为“aha moment : ” , 其生成信息化评论和完善先前解决方案的能力仍然有限。 在本文中,我们引入了双检查器,这是一个原则性框架,目的是通过促进对以往解决方案进行明确的自我精细和迭接完善来提高慢思考的LLMs的推理能力。通过微调我们整理的1 730个自我临界实例,双检查器赋予长的CoT LLMs在推断中反复批评和完善其产出的能力,直到他们根据自我生成的批评来评估其解决方案是否正确为止。我们验证了双检查器在一套综合推理基准中的效率,表明反复自读性自我精度能极大地增强了长期CoTLMs的推理能力。 值得注意的是,我们的双检查器提高了在挑战性AIME基准方面从4.4%到18.2%到18.2%的成绩。这些结果凸显了发展更可信和有效的LMSLMs(LMs) 能够进行结构化的自我/Uritxxxxxx/ 数据。
Article 179
Title@2025-07-09 (3): SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers
Title: SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers | SpindleKV: Eine neuartige KV-Cache-Reduktionsmethode, die sowohl schwammige als auch tiefe Schichten ausbalanciert | SpindleKV: 浅层和深层平衡的新KV缓存减少方法 2507.06517v1 |
Authors (7): Zicong Tang, Shi Luohe, Zuchao Li, Baoyuan Qi, Guoming Liu, Lefei Zhang, Ping Wang
Large Language Models (LLMs) have achieved impressive accomplishments in recent years. However, the increasing memory consumption of KV cache has possessed a significant challenge to the inference system. Eviction methods have revealed the inherent redundancy within the KV cache, demonstrating its potential for reduction, particularly in deeper layers. However, KV cache reduction for shallower layers has been found to be insufficient. Based on our observation that, the KV cache exhibits a high degree of similarity. Based on this observation, we proposed a novel KV cache reduction method, SpindleKV, which balances both shallow and deep layers. For deep layers, we employ an attention weight based eviction method, while for shallow layers, we apply a codebook based replacement approach which is learnt by similarity and merging policy. Moreover, SpindleKV addressed the Grouped-Query Attention (GQA) dilemma faced by other attention based eviction methods. Experiments on two common benchmarks with three different LLMs shown that SpindleKV obtained better KV cache reduction effect compared to baseline methods, while preserving similar or even better model performance.
大型语言模型(LLMs)近年来取得了令人印象深刻的成就。然而,KV缓存的内存消耗量不断增长,对推断系统提出了重大挑战。驱逐方法揭示了KV缓存内固有的冗余,表明它有可能减少,特别是在更深的层层中。然而,发现对浅层的KV缓存减少不够充分。根据我们的观察,KV缓存显示出高度相似性。基于这一观察,我们建议采用一种新的KV缓存减少方法SpindleKV,它平衡浅层和深层。对于深层,我们采用了基于注意的权重驱逐方法,而对于浅层,我们采用了基于代码的代用方法,通过相似性和合并政策来学习。此外,SpindleKV谈到了基于其他注意方法所面临的GQA困境。实验用三个不同的LMSingdleKV的共同基准显示,SpindleKV比基线方法得到更好的KV缓存效果,同时保持相似或更好的模型性。
Article 180
Title@2025-07-09 (3): Breaking PEFT Limitations: Leveraging Weak-to-Strong Knowledge Transfer for Backdoor Attacks in LLMs
Title: Breaking PEFT Limitations: Leveraging Weak-to-Strong Knowledge Transfer for Backdoor Attacks in LLMs | Breaking PEFT Limitations: Schwach-zu-starken Wissenstransfer für Backdoor-Angriffe in LLMs | 打破PEFT的局限性:利用微弱至收缩知识转让来利用LLMs进行后门攻击 2409.17946v4 |
Authors (8): Shuai Zhao, Leilei Gan, Zhongliang Guo, Xiaobao Wu, Yanhao Jia, Luwei Xiao, Cong-Duy Nguyen, Luu Anh Tuan
Despite being widely applied due to their exceptional capabilities, Large Language Models (LLMs) have been proven to be vulnerable to backdoor attacks. These attacks introduce targeted vulnerabilities into LLMs by poisoning training samples and full-parameter fine-tuning (FPFT). However, this kind of backdoor attack is limited since they require significant computational resources, especially as the size of LLMs increases. Besides, parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted parameter updating may impede the alignment of triggers with target labels. In this study, we first verify that backdoor attacks with PEFT may encounter challenges in achieving feasible performance. To address these issues and improve the effectiveness of backdoor attacks with PEFT, we propose a novel backdoor attack algorithm from the weak-to-strong based on Feature Alignment-enhanced Knowledge Distillation (FAKD). Specifically, we poison small-scale language models through FPFT to serve as the teacher model. The teacher model then covertly transfers the backdoor to the large-scale student model through FAKD, which employs PEFT. Theoretical analysis reveals that FAKD has the potential to augment the effectiveness of backdoor attacks. We demonstrate the superior performance of FAKD on classification tasks across four language models, four backdoor attack algorithms, and two different architectures of teacher models. Experimental results indicate success rates close to 100% for backdoor attacks targeting PEFT.
大型语言模型(LLMS)尽管因其特殊能力而被广泛应用,但已被证明很容易受到幕后攻击。这些攻击通过毒害培训样本和全参数微调(FPFFT)将目标弱点引入LLMS。然而,这种幕后攻击是有限的,因为它们需要大量的计算资源,特别是由于LLMS规模的扩大。此外,参数高效微调(PEFT)提供了一种替代办法,但有限的参数更新可能阻碍触发点与目标标签的配合。在本研究中,我们首先核实PEFT的后门攻击可能在实现可行性能方面遇到困难。为了解决这些问题和提高PEFT的后门攻击的效力,我们提议从弱到强的后门攻击算法是有限的。此外,我们通过FPFFT对小规模语言模型下毒害小型语言模型作为教师模型。然后秘密地通过FAKD将后门到大型学生模型,使用PEFFTFT。理论分析显示,FAKD的后门攻击效率是四大攻击等级,我们FAFA系统攻击的后算算算算算。我们四种攻击的上攻击的后算算算。我们四种攻击的系统攻击的后算算算算算算。我们攻击的后攻击的后攻击的反的反的系统攻击的反攻击的反攻击的系统结构结构结构。
Article 181
Title@2025-07-09 (3): Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings
Title: Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings | Pun Intended: Multi-Agent Übersetzung von Wordplay mit Kontrastivem Lernen und phonetisch-semantischen Einbettungen | 未预期:多代理翻译与矛盾学习和语音-语言嵌入的文字游戏的多代理翻译 2507.06506v1 |
Authors (3): Russell Taylor, Benjamin Herbert, Michael Sana
Translating wordplay across languages presents unique challenges that have long confounded both professional human translators and machine translation systems. This research proposes a novel approach for translating puns from English to French by combining state-of-the-art large language models with specialized techniques for wordplay generation. Our methodology employs a three-stage approach. First, we establish a baseline using multiple frontier large language models with feedback based on a new contrastive learning dataset. Second, we implement a guided chain-of-thought pipeline with combined phonetic-semantic embeddings. Third, we implement a multi-agent generator-discriminator framework for evaluating and regenerating puns with feedback. Moving beyond the limitations of literal translation, our methodology’s primary objective is to capture the linguistic creativity and humor of the source text wordplay, rather than simply duplicating its vocabulary. Our best runs earned first and second place in the CLEF JOKER 2025 Task 2 competition where they were evaluated manually by expert native French speakers. This research addresses a gap between translation studies and computational linguistics by implementing linguistically-informed techniques for wordplay translation, advancing our understanding of how language models can be leveraged to handle the complex interplay between semantic ambiguity, phonetic similarity, and the implicit cultural and linguistic awareness needed for successful humor.
翻译跨语言的文字游戏提出了独特的挑战,这些挑战长期以来一直困扰着专业的人类翻译员和机器翻译系统。本研究提出了将最先进的大语言模型与文字剧生成的专门技术相结合,把英语译成法语的新颖方法。我们的方法采用三阶段方法。首先,我们利用基于新的对比学习数据集的反馈,用多种前沿大语言模型建立一个基线。第二,我们实施一个带指导的思维链管道,将语音-语义嵌入结合起来。第三,我们实施一个多试剂的生成器/差异器框架,用于评估和用反馈重新生成标本。超越了文字翻译的局限性,我们的方法的主要目标是捕捉取原始文字剧的语言创造力和幽默感,而不是简单地复制其词汇。我们的最佳运行在CLEF JOKER 2025任务2中获得了第一和第二名的反馈。我们的研究通过应用语言智能化翻译技术解决了翻译和计算语言语言学之间的差距,推进了我们对于复杂语言模范模式的理解,从而能够利用复杂的语言模棱两极性来控制语言模型。
Article 182
Title@2025-07-09 (3): GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification
Title: GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification | GMLM: Überbrückung von Graph Neuronalen Netzwerken und Sprachmodellen für heterophile Knotenklassifikation | GMLM: 血氧哲学节点分类的架接图形神经网络和语言模型 2503.05763v5 |
Authors (1): Aarush Sinha
Integrating powerful but computationally expensive Pre-trained Language Models (PLMs) with Graph Neural Networks (GNNs) is a key challenge, especially on text-rich heterophilic graphs. We propose the Graph Masked Language Model (GMLM), a framework designed for the efficient and effective fusion of graph structure and text semantics. GMLM employs a two-stage process: first, a contrastive pre-training stage with a novel soft masking technique builds a robust multi-scale GNN; second, an end-to-end fine-tuning stage uses a dynamic active node selection strategy for scalability and a bi-directional cross-attention module for deep fusion. Experiments on five heterophilic benchmarks show GMLM achieves state-of-the-art results on four, significantly outperforming prior GNN and large LLM-based methods. For instance, it improves accuracy on the Texas dataset by over 8\% and on Wisconsin by nearly 5\%. Our work demonstrates that a sophisticated, deeply-integrated architecture can be more effective and efficient than larger, general-purpose models for text-rich graph representation learning.
将强大但计算上昂贵的预科语言模型(PLM)与图表神经网络(GNN)结合是一个关键的挑战,特别是在文本丰富的热活性图形方面。我们提议了图形遮盖语言模型(GMLM),这是一个旨在高效和有效地融合图形结构和文字语义的框架。GMLM采用一个两阶段过程:首先,一个具有新颖软面罩技术的对比性培训前阶段,建立起一个强大的多级软面罩技术;第二,一个端到端微调阶段,使用动态的主动节点选择战略,用于可伸缩性,以及一个双向交叉注意模块,用于深度聚变。关于五种电子遮盖语言模型的实验显示,GMLM在四种情况下取得了最新的结果,大大超过GNN和大型LM方法。例如,它提高了德州数据集的精确度,由8以上和威斯康星州数据组合的精确度,近5。我们的工作表明,一个精密、深入一体化的结构可以比更大规模、通用的文本学习图形模型更有效和高效。
Article 183
Title@2025-07-09 (3): ModelCitizens: Representing Community Voices in Online Safety
Title: ModelCitizens: Representing Community Voices in Online Safety | ModelCitizens: Vertretung von Community Voices in Online-Sicherheit | 模范公民:在网上安全方面代表社区声音 2507.05455v2 |
Authors (7): Ashima Suvarna, Christina Chance, Karolina Naranjo, Hamid Palangi, Sophie Hao, Thomas Hartvigsen, Saadia Gabriel
Automatic toxic language detection is critical for creating safe, inclusive online spaces. However, it is a highly subjective task, with perceptions of toxic language shaped by community norms and lived experience. Existing toxicity detection models are typically trained on annotations that collapse diverse annotator perspectives into a single ground truth, erasing important context-specific notions of toxicity such as reclaimed language. To address this, we introduce MODELCITIZENS, a dataset of 6.8K social media posts and 40K toxicity annotations across diverse identity groups. To capture the role of conversational context on toxicity, typical of social media posts, we augment MODELCITIZENS posts with LLM-generated conversational scenarios. State-of-the-art toxicity detection tools (e.g. OpenAI Moderation API, GPT-o4-mini) underperform on MODELCITIZENS, with further degradation on context-augmented posts. Finally, we release LLAMACITIZEN-8B and GEMMACITIZEN-12B, LLaMA- and Gemma-based models finetuned on MODELCITIZENS, which outperform GPT-o4-mini by 5.5% on in-distribution evaluations. Our findings highlight the importance of community-informed annotation and modeling for inclusive content moderation. The data, models and code are available at https://github.com/asuvarna31/modelcitizens.
自动检测有毒语言对于创建安全、包容的在线空间至关重要。然而,这是一个高度主观性的任务,对社区规范和活生生的经验所形成的有毒语言有认识。现有的毒性检测模型通常在说明方面受过培训,说明将各种批注的视角破碎成单一的地面真相,消除重要的因地制宜的毒性概念,如重新使用的语言。为了解决这个问题,我们引入了MODELCITIZENS,一个6.8K社交媒体文章和40K毒性说明的数据集,在不同身份群体中,6.8K社交媒体文章和40K毒性说明的数据集。为了了解关于毒性的谈话背景作用,社会媒体文章的典型特征,我们用LLOMM生成的谈话场景来增加MODELCITIZENS的设置。最先进的毒性检测工具(例如OpenAI Modelation API、GPT-o4-minima)在MODELCITIZENS上不完善了MOLMA和Gemma模型的模型,在Slimia-deal-deal-deal-deal-dealismal4中,这是我们对Slimal-deal-deal-deal 的模拟/deal 的模型/demodistral 的模型和Slimal的发现的重要性。
Article 184
Title@2025-07-09 (3): Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations
Title: Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations | Verfeinerung von gesketen Wahrnehmungen in visionssprachlichen Kontrastmodellen durch visuelle Darstellungen | 通过视觉表现方式改进视觉语言对比模型中偏斜的观念 2405.14030v3 |
Authors (2): Haocheng Dai, Sarang Joshi
Large vision-language contrastive models (VLCMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLCM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP’s embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that relying on visual representations from CLIP, as opposed to text embedding, is more effective to refine the skewed perceptions in VLCMs, emphasizing the superior utility of visual representations in overcoming embedded biases. Our code can be found here.
大型视觉语言对比模型(VLCM),如CLIP等,已成为基础模型,表明在各种下游任务中取得了显著成功。这些模型尽管有其优势,但与其他基础系统相似,继承了真实世界数据不相称分布的偏差,导致对实际环境产生误解。像图像网这样的前方数据集往往充满非因果的、虚假的关联,在缺乏这些背景要素的情况下,它们可能削弱VLCM的性能。本研究对简单线性探测器如何有效地从 CLIP 嵌入下游应用中提取具体任务的核心特征进行了调查。我们的分析显示,CLIP 文本表达方式经常受到错误的关联的污染,这些关联在偏颇的培训前数据集中被继承。 实证证据表明,依靠CLIP 的视觉表达方式而不是文字嵌入式,对于完善VLCMS的扭曲的认知方式更为有效,强调视觉表达方式在克服嵌入偏见方面的优势。这里可以找到我们的代码。
Article 185
Title@2025-07-09 (3): Can adversarial attacks by large language models be attributed?
Title: Can adversarial attacks by large language models be attributed? | Können feindliche Angriffe von großen Sprachmodellen zugeschrieben werden? | 大型语言模式的对抗性攻击能否归结为对抗性攻击? 2411.08003v2 |
Authors (3): Manuel Cebrian, Andres Abeliuk, Jan Arne Telle
Attributing outputs from Large Language Models (LLMs) in adversarial settings-such as cyberattacks and disinformation campaigns-presents significant challenges that are likely to grow in importance. We approach this attribution problem from both a theoretical and an empirical perspective, drawing on formal language theory (identification in the limit) and data-driven analysis of the expanding LLM ecosystem. By modeling an LLM’s set of possible outputs as a formal language, we analyze whether finite samples of text can uniquely pinpoint the originating model. Our results show that, under mild assumptions of overlapping capabilities among models, certain classes of LLMs are fundamentally non-identifiable from their outputs alone. We delineate four regimes of theoretical identifiability: (1) an infinite class of deterministic (discrete) LLM languages is not identifiable (Gold’s classical result from 1967); (2) an infinite class of probabilistic LLMs is also not identifiable (by extension of the deterministic case); (3) a finite class of deterministic LLMs is identifiable (consistent with Angluin’s tell-tale criterion); and (4) even a finite class of probabilistic LLMs can be non-identifiable (we provide a new counterexample establishing this negative result). Complementing these theoretical insights, we quantify the explosion in the number of plausible model origins (hypothesis space) for a given output in recent years. Even under conservative assumptions-each open-source model fine-tuned on at most one new dataset-the count of distinct candidate models doubles approximately every 0.5 years, and allowing multi-dataset fine-tuning combinations yields doubling times as short as 0.28 years. This combinatorial growth, alongside the extraordinary computational cost of brute-force likelihood attribution across all models and potential users, renders exhaustive attribution infeasible in practice.
将大语言模型(LLMs)的输出归结为对抗性环境(如网络攻击和不信息运动)中的大语言模型(LLMs)的产出,这带来了可能越来越重要的重大挑战。我们从理论角度和经验角度处理这一归因问题,借鉴的是正式语言理论(在限度内确定)和对不断扩大的LLM生态系统的数据驱动分析。通过将LLM的一组可能的产出建模成一种正式语言,我们分析的是文本的有限样本是否能够独特地定位出原模型。我们的结果显示,在对模型之间能力重叠的微小假设下,某些LLMs类别基本上无法从它们的输出中辨别出来。我们从理论的明显可辨别性角度从理论角度和实验的角度来处理这一归别问题。 我们界定了四种不同的理论可辨别性制度:(1) 无限的确定性(差异性) LLMMs语言是1967年的经典结果;(2) 无限的概率LMs(通过确定性案例的延伸,确定性案例)(3) 确定性组合模型是可识别的(符合Angluin loudalalalal commal ex ex ex liversation liversation ex) as the the the folview ex in the folver ex ex ex ex the folview lievations in the fearmations in the ex the ex ex the folviolview ex immations impolverations imations ex ex ex immationsmations ex the the the the the thes ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex the the the the flipoltime ex ex thes ex ex ex ex ex ex the the thesmationsmationsmations mations mations mations ex ex ex ex ex ex the the the the the the the the the the thes ex.
Article 186
Title@2025-07-09 (3): TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation
Title: TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation | TokenSwift: Lossless Acceleration der Ultra Long Sequence Generation | TokenSwift: 超长序列一代无损加速 2502.18890v2 |
Authors (5): Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng
Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model’s inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at https://github.com/bigai-nlco/TokenSwift.
以大型语言模型(LLMS)生成超长序列已变得越来越重要,但仍然是一项高度耗时的任务,特别是100K类符号的序列。传统的投机性解码方法虽然存在,但只是延长其生成限制并不能加速进程,而且可能有害。通过深入分析,我们确定了阻碍高效生成的三大挑战:频繁重载模型、动态键值(KV)管理和重复生成。为解决这些问题,我们引入了TOKENSWIFT,这是一个新颖的框架,旨在大大加快超长序列的生成过程,同时保持目标模型的固有质量。实验结果显示,TOKENSWIFT在不同规模(1.5B、7B、8B、14B)和结构(MAHA、GQA)和结构(MHA、GQA)之间实现了三次以上的加速。这种加速将超长序列生成的时间节省到几个小时,将TOKENSWIFT确定为一个空前的可扩展和有效的解决方案。代码可以在 https://github.com/bigig-nlco/TokenSwift上找到。
Article 187
Title@2025-07-09 (3): On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks
Title: On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks | Über die Robustheit des verbalen Vertrauens von LLMs bei feindlichen Angriffen | 关于在相互攻击中女优女士的口头信任的有力性 2507.06489v1 |
Authors (2): Stephen Obadinma, Xiaodan Zhu
Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to ensure transparency, trust, and safety in human-AI interactions across many high-stakes applications. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce a novel framework for attacking verbal confidence scores through both perturbation and jailbreak-based methods, and show that these attacks can significantly jeopardize verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current confidence elicitation methods are vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the urgent need to design more robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.
大型语言模型(LLMs)产生的强烈口头信心对于部署LLMs至关重要,以确保在人类-AI互动中,在众多高接触应用中的透明度、信任和安全性。在本文件中,我们提出了关于在对抗性攻击中口头信心强健度的第一份全面研究报告。我们引入了一个新的框架,通过扰动和以破狱为基础的方法打击口头信心分数,并表明这些攻击可能大大危及口头信心估计,导致频繁的回答变化。我们审视了各种激励性战略、模型大小和应用领域,揭示了目前的信任获取方法非常脆弱,而且常用的防御技术基本上无效或适得其反。我们的调查结果强调,迫切需要设计更强有力的机制,在LLMs中表达信心,因为即使是微妙的语义保留修改也可能导致误导对反应的信心。
Article 188
Title@2025-07-09 (3): Can LLMs Play Ô Ăn Quan Game? A Study of Multi-Step Planning and Decision Making
Title: Can LLMs Play Ô Ăn Quan Game? A Study of Multi-Step Planning and Decision Making | Kann LLMs spielen Ô ên Quan Spiel? Eine Studie über mehrstufige Planung und Entscheidungsfindung | LLM女士能玩游戏吗?多步规划和决策研究。 2507.03711v3 |
Authors (6): Sang Quang Nguyen, Kiet Van Nguyen, Vinh-Tiep Nguyen, Thanh Duc Ngo, Ngan Luu-Thuy Nguyen, Duy-Dinh Le
In this paper, we explore the ability of large language models (LLMs) to plan and make decisions through the lens of the traditional Vietnamese board game, \^O \u{A}n Quan. This game, which involves a series of strategic token movements and captures, offers a unique environment for evaluating the decision-making and strategic capabilities of LLMs. Specifically, we develop various agent personas, ranging from aggressive to defensive, and employ the \^O \u{A}n Quan game as a testbed for assessing LLM performance across different strategies. Through experimentation with models like Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, and Llama-3.3-70B-Instruct, we aim to understand how these models execute strategic decision-making, plan moves, and manage dynamic game states. The results will offer insights into the strengths and weaknesses of LLMs in terms of reasoning and strategy, contributing to a deeper understanding of their general capabilities.
在本文中,我们探索了大型语言模型(LLMs)通过传统的越南棋盘游戏(O\u{A}n Quan)的透镜来规划和决策的能力。这场游戏涉及一系列战略象征性移动和捕捉,为评估磁盘模型的决策和战略能力提供了独特的环境。具体地说,我们开发了从攻击到防御的多种代理人,并使用O\u{A}n Quan游戏作为评估不同战略的LMM业绩的测试台。通过尝试Llama-3.2-3B-Instruct、Llama-3.1-8B-Instruct和Llama-3.3-70B-Instruct等模型,我们的目的是了解这些模型如何执行战略决策、计划动作和管理动态游戏状态。结果将使人们深入了解磁盘在推理和战略方面的长处和弱点,有助于更深入了解它们的总体能力。
Article 189
Title@2025-07-09 (3): Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
Title: Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning | Video-RTS: Rethinking Verstärktes Lernen und Test-Time Scaling für effizientes und verbessertes Video-Reasoning | 视频-RTS:重新思考强化学习和测试时间的尺度,以提高效率和增强视频理由 2507.06485v1 |
Authors (6): Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, Mohit Bansal
Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and finetuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Based on observations about the data scaling of RL samples, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by an average of 2.4% in accuracy using only 3.6% training samples. For example, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark, and a 2.6% improvement on MMVU. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS’s strong reasoning performance.
尽管在强化学习(RL)的视频推理和大型语言模型(LLMS)的强化学习(LLMS)方面取得了进步,但数据收集和微调仍然是重大挑战,这些方法往往依靠大规模监督监管的微调(SFT),拥有广泛的视频数据和长搜索链(COT)的注释,因此成本高,规模更难。为此,我们介绍了视频RTS,这是通过将数据效率高的RL与视频适应性测试时间缩放(TTS)战略相结合,提高视频推理能力的新方法,大大提高了数据效率。根据对RL样本数据缩放的观察,我们跳过资源密集型SFT步骤,采用高效的纯RL培训(SFT)和基于产出的奖励(SFT), 高效的纯RFT培训(S),不需要额外的说明或广泛的微调。此外,为了更高效地利用计算资源,我们引入了稀少到的视频TTTTTTS战略,通过反复添加基于产出一致性的框架来提高推断力。我们用多种视频推理算基准的方法,显示视频-RTS超过现有的视频推理推理模型推理模型,显示,只有2.4%的精度平均精度,只有2.4%的精度的精度的精度,只的精度,而只的精度,而只使用最近的S,具有性比度的VBS的S,在视频比性推力的VB性推。
Article 190
Title@2025-07-09 (3): Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents
Title: Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents | Japanisch lernen mit Jouzu: Interaktionsergebnisse mit stylisierten Dialog-Fiktivagenten | 与Jouzu一起学习日文:与固定对话分形媒介的互动结果 2507.06483v1 |
Authors (2): Zackary Rackauckas, Julia Hirschberg
This study investigates how stylized, voiced agents shape user interaction in a multimodal language learning environment. We conducted a mixed-methods evaluation of 54 participants interacting with anime-inspired characters powered by large language models and expressive text-to-speech synthesis. These agents responded in Japanese character language, offering users asynchronous, semi-structured conversation in varying speech styles and emotional tones. We analyzed user engagement patterns, perceived usability, emotional responses, and learning behaviors, with particular attention to how agent stylization influenced interaction across language proficiency levels and cultural backgrounds. Our findings reveal that agent design, especially voice, persona, and linguistic style, substantially affected user experience, motivation, and strategy. This work contributes to the understanding of affective, culturally stylized agents in human-agent interaction and offers guidance for designing more engaging, socially responsive systems.
这项研究调查了在多语种学习环境中的典型化代言人如何塑造用户互动。我们对54名参与者进行了混合评估,他们与由大型语言模型和表达式文本到语音合成所驱动的动因型人物进行互动。这些代言人用日文语言进行了回应,以不同语言风格和情感调子向用户提供非同步、半结构式对话。我们分析了用户参与模式、感觉到的可用性、情感反应和学习行为,尤其关注了该代言体如何影响不同语言熟练程度和文化背景的互动。我们的调查结果显示,该代言的设计,特别是声音、人和语言风格,严重影响了用户的经验、动力和战略。这项工作有助于理解人体-代理人互动中的具有情感、文化型的代言者,并为设计更具参与性、具有社会敏感性的系统提供了指导。
Article 191
Title@2025-07-09 (3): InfoTech Assistant: A Multimodal Conversational Agent for InfoTechnology Web Portal Queries
Title: InfoTech Assistant: A Multimodal Conversational Agent for InfoTechnology Web Portal Queries | InfoTech Assistant: Ein multimodaler Ansprechpartner für InfoTechnology Web Portal Queries | 信息技术助理:信息技术网站门户查询的多模式对话代理 2412.16412v2 |
Authors (5): Sai Surya Gadiraju, Duoduo Liao, Akhila Kudupudi, Santosh Kasula, Charitha Chalasani
This pilot study presents the development of the InfoTech Assistant, a domain-specific, multimodal chatbot engineered to address queries in bridge evaluation and infrastructure technology. By integrating web data scraping, large language models (LLMs), and Retrieval-Augmented Generation (RAG), the InfoTech Assistant provides accurate and contextually relevant responses. Data, including textual descriptions and images, are sourced from publicly available documents on the InfoTechnology website and organized in JSON format to facilitate efficient querying. The architecture of the system includes an HTML-based interface and a Flask back end connected to the Llama 3.1 model via LLM Studio. Evaluation results show approximately 95 percent accuracy on domain-specific tasks, with high similarity scores confirming the quality of response matching. This RAG-enhanced setup enables the InfoTech Assistant to handle complex, multimodal queries, offering both textual and visual information in its responses. The InfoTech Assistant demonstrates strong potential as a dependable tool for infrastructure professionals, delivering high accuracy and relevance in its domain-specific outputs.
这份试点研究介绍了InfoTech助理的开发情况,该助理是专门为处理桥梁评估和基础设施技术中的查询而设计的域域、多式联运聊天室。通过整合网络数据剪切、大语言模型和检索启动一代(RAG),InfoTech助理提供了准确和符合背景的答复。数据,包括文字描述和图像,来源于信息技术网站的公开文件,并以JSON格式组织,以便利有效查询。该系统的结构包括基于HTML的界面和通过LLLLM Studio连接Llama 3.1模型的Flask后端。评价结果显示,在特定领域任务上大约95%的准确度,高相似性分数确认响应的质量。这一RAG-encancup使InfoTech助理能够处理复杂、多式查询,在答复中提供文字和视觉信息。InfoTech助理展示了作为基础设施专业人员可靠工具的巨大潜力,在其具体领域产出中提供高精度和相关性。
Article 192
Title@2025-07-09 (3): Theme-Explanation Structure for Table Summarization using Large Language Models: A Case Study on Korean Tabular Data
Title: Theme-Explanation Structure for Table Summarization using Large Language Models: A Case Study on Korean Tabular Data | Thema-Erläuterungsstruktur für die Tabellenzusammenfassung mit großen Sprachmodellen: Eine Fallstudie zu koreanischen Tabellendaten | 使用大语言模型对表格进行汇总的主题解释结构:韩国表格数据案例研究 2501.10487v3 |
Authors (5): TaeYoon Kwack, Jisoo Kim, Ki Yong Jung, DongGeon Lee, Heesun Park
Tables are a primary medium for conveying critical information in administrative domains, yet their complexity hinders utilization by Large Language Models (LLMs). This paper introduces the Theme-Explanation Structure-based Table Summarization (Tabular-TX) pipeline, a novel approach designed to generate highly interpretable summaries from tabular data, with a specific focus on Korean administrative documents. Current table summarization methods often neglect the crucial aspect of human-friendly output. Tabular-TX addresses this by first employing a multi-step reasoning process to ensure deep table comprehension by LLMs, followed by a journalist persona prompting strategy for clear sentence generation. Crucially, it then structures the output into a Theme Part (an adverbial phrase) and an Explanation Part (a predicative clause), significantly enhancing readability. Our approach leverages in-context learning, obviating the need for extensive fine-tuning and associated labeled data or computational resources. Experimental results show that Tabular-TX effectively processes complex table structures and metadata, offering a robust and efficient solution for generating human-centric table summaries, especially in low-resource scenarios.
表格是传播行政领域关键信息的主要媒介,但其复杂性阻碍了大语言模型(LLMs)的利用。本文介绍了基于主题的基于扩展结构的表格汇总(TABLA-TX)管道,这是一种新颖的方法,旨在从表格数据中产生高度可解释的摘要,特别侧重于韩国的行政文件。目前的表格汇总方法往往忽视了人类友好产出的关键方面。TBAL-TX首先采用多步推理程序来解决这一问题,以确保LLMS能够深入理解表格,然后由记者为清晰的生成提供提示战略。关键是,它随后将产出结构成一个主题部分(一个词词词)和一个解释部分(一个预设条款),大大提高了可读性。我们的方法利用了文字学习,避免了广泛微调和相关标签数据或计算资源的必要性。实验结果表明,TABL-TX有效地处理复杂的表格结构和元数据,为生成以人为中心的表格概要提供了可靠而有效的解决方案,特别是在低资源情景下。
Article 193
Title@2025-07-08 (2): A Systematic Analysis of Hybrid Linear Attention
Title: A Systematic Analysis of Hybrid Linear Attention | Eine systematische Analyse der hybriden linearen Aufmerksamkeit | 系统分析混合线性注意 2507.06457v1 |
Authors (11): Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, Jason Eshraghian
Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading to hybrid architectures that combine linear and full attention layers. Despite extensive hybrid architecture research, the choice of linear attention component has not been deeply explored. We systematically evaluate various linear attention models across generations - vector recurrences to advanced gating mechanisms - both standalone and hybridized. To enable this comprehensive analysis, we trained and open-sourced 72 models: 36 at 340M parameters (20B tokens) and 36 at 1.3B parameters (100B tokens), covering six linear attention variants across five hybridization ratios. Benchmarking on standard language modeling and recall tasks reveals that superior standalone linear models do not necessarily excel in hybrids. While language modeling remains stable across linear-to-full attention ratios, recall significantly improves with increased full attention layers, particularly below a 3:1 ratio. Our study highlights selective gating, hierarchical recurrence, and controlled forgetting as critical for effective hybrid models. We recommend architectures such as HGRN-2 or GatedDeltaNet with a linear-to-full ratio between 3:1 and 6:1 to achieve Transformer-level recall efficiently. Our models are open-sourced at https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e.
变异器面临长序列的二次复杂和记忆问题,导致采用固定规模隐藏状态的线性关注机制。然而,线性模型往往有有限的回溯性,导致混合结构,将线性和全关注层结合起来。尽管进行了广泛的混合结构研究,但线性关注部分的选择并未深入探讨。我们系统地评估了几代人间的各种线性关注模式—-矢量复现到先进的顶级机制,包括独立和混合。为了能够进行这一全面分析,我们培训和公开来源了72个模型:36个在340M参数(20B标牌)和36个在1.3B参数(100B标牌),涵盖五个混合比率的六种线性关注变量。标准语言模型和回顾任务基准显示,高级单线性线性模型不一定优于混合体。虽然语言模型在线性至全关注层之间保持稳定,特别是3:1比率以下。我们的研究强调选择性的GRN-2或Gated-DeltotaNet等结构在6:6:1级和直线性1级/直径1级的MISB1级模型之间,我们建议采用1级的直观/直观分析。
Article 194
Title@2025-07-08 (2): A Semantic Parsing Framework for End-to-End Time Normalization
Title: A Semantic Parsing Framework for End-to-End Time Normalization | Ein semantisches Parsing-Framework für die End-to-End-Zeit-Normalisierung | 端到终点时间正常化的语义分解框架 2507.06450v1 |
Authors (4): Xin Su, Sungduk Yu, Phillip Howard, Steven Bethard
Time normalization is the task of converting natural language temporal expressions into machine-readable representations. It underpins many downstream applications in information retrieval, question answering, and clinical decision-making. Traditional systems based on the ISO-TimeML schema limit expressivity and struggle with complex constructs such as compositional, event-relative, and multi-span time expressions. In this work, we introduce a novel formulation of time normalization as a code generation task grounded in the SCATE framework, which defines temporal semantics through symbolic and compositional operators. We implement a fully executable SCATE Python library and demonstrate that large language models (LLMs) can generate executable SCATE code. Leveraging this capability, we develop an automatic data augmentation pipeline using LLMs to synthesize large-scale annotated data with code-level validation. Our experiments show that small, locally deployable models trained on this augmented data can achieve strong performance, outperforming even their LLM parents and enabling practical, accurate, and interpretable time normalization.
时间正常化是将自然语言时间表达方式转换为机器可读表达方式的任务。 它支持信息检索、问答和临床决策中许多下游应用。 基于ISO-TimeML Schema的传统系统限制表达性和与组成、事件相对性和多空间时间表达方式等复杂结构的争斗。 在这项工作中,我们引入了时间正常化新颖的提法,作为基于SCATE框架的代码生成任务,该框架通过象征性和组成操作者定义了时间语义。 我们实施了完全可执行的SCATE Python 图书馆,并展示了大型语言模型(LLMs)能够生成可执行的 SATE 代码。 我们利用这一能力开发了一个自动数据增强管道,使用LLMS 来合成大规模附加代码级验证的数据。 我们的实验显示,在这种增强数据上培训的小型、可本地部署的模式能够取得很强的性能,甚至优于他们的LM父母,并且能够实现实用、准确和可解释的时间正常化。
Article 195
Title@2025-07-08 (2): Perception-Aware Policy Optimization for Multimodal Reasoning
Title: Perception-Aware Policy Optimization for Multimodal Reasoning | Perception-Aware Policy Optimization für multimodale Reasoning | 对多式联运理由的观念-认知软件政策优化 2507.06448v1 |
Authors (11): Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji
Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.
事实证明,用可验证的奖赏强化学习(RLVR)是赋予大语言模型(LLMS)一个非常有效的战略,具有强大的多步推理能力。然而,其设计和优化仍然适应纯文字领域,导致在应用多模式推理任务时表现不优于最优性。我们特别注意到,当前多式联运推理中的一个主要错误来源在于对视觉投入的认知。为了解决这一瓶颈,我们提议采用Pervition-Aware政策优化(PAPO),这是GRPO的一个简单而有效的扩展,它鼓励该模型在学习理性时学习,完全从内部监督信号中学习。值得注意的是,PAPO并不依赖额外的数据校准、外部奖励模式或专有模型。具体地说,我们采用KLL差异术语的形式引入了隐含概念的损失,这尽管简单,但在不同多式联运基准方面产生了重大的总体改进(4.4%)。 在具有高度愿景依赖的任务上,改进了8.0%。 我们还观察到了大幅降低RBOGI的基底基值目标,通过我们对PO的精确成本分析, 改进了我们通过Sentrocrecialalalal AL的升级的升级的升级分析,从而改进了我们为BROLILILO的升级的升级的升级的升级的升级的升级的升级的升级。
Article 196
Title@2025-07-08 (2): CodeMirage: Hallucinations in Code Generated by Large Language Models
Title: CodeMirage: Hallucinations in Code Generated by Large Language Models | CodeMirage: Halluzinationen in Code Generiert durch große Sprachmodelle | 代码Mirage: 大语言模型生成的代码中的幻觉 2408.08333v2 |
Authors (4): Vibhor Agarwal, Yulong Pei, Salwa Alamir, Xiaomo Liu
Large Language Models (LLMs) have shown promising potentials in program generation and no-code automation. However, LLMs are prone to generate hallucinations, i.e., they generate text which sounds plausible but is incorrect. Although there has been a recent surge in research on LLM hallucinations for text generation, similar hallucination phenomenon can happen in code generation. Sometimes the generated code can have syntactical or logical errors as well as more advanced issues like security vulnerabilities, memory leaks, etc. Given the wide adaptation of LLMs to enhance efficiency in code generation and development in general, it becomes imperative to investigate hallucinations in code generation. To the best of our knowledge, this is the first attempt at studying hallucinations in the code generated by LLMs. We start by introducing the code hallucination definition and a comprehensive taxonomy of code hallucination types. We propose the first benchmark CodeMirage dataset for code hallucinations. The benchmark contains 1,137 GPT-3.5 generated hallucinated code snippets for Python programming problems from two base datasets - HumanEval and MBPP. We then propose the methodology for code hallucination detection and experiment with open source LLMs such as CodeLLaMA as well as OpenAI’s GPT-3.5 and GPT-4 models using one-shot prompt. We find that GPT-4 performs the best on HumanEval dataset and gives comparable results to the fine-tuned CodeBERT baseline on MBPP dataset. Towards the end, we discuss various mitigation strategies for code hallucinations and conclude our work.
大型语言模型(LLMS)显示在程序生成和无编码自动化方面有很有希望的潜力。然而,LLMS容易产生幻觉,即产生似乎似乎似是似是而非的文本。虽然最近对LLM的幻觉的研究在为文本生成而出现,但类似的幻觉现象在代码生成中可能发生。生成的代码有时可能出现合成或逻辑错误,以及安全脆弱性、记忆泄漏等更先进的问题。鉴于LLMS的广泛调整以提高代码生成和一般开发效率,因此有必要调查代码生成中的幻觉。对于我们的知识来说,这是研究LMS生成的代码中的幻觉的第一次尝试。我们从引入代码幻觉定义和代码类型综合分类开始。我们建议为代码幻觉建立第一个基准代码数据集,包括安全脆弱性、记忆泄漏等。基准包含1,137 GPT-3.5生成的百草本代码,用于Python 各种基础数据集――人类经济学和MBPPP。然后,我们提出在代码中进行代码检测和实验的精确方法,将GPT-PLM数据模型作为开放源模型进行。
Article 197
Title@2025-07-08 (2): Substance over Style: Evaluating Proactive Conversational Coaching Agents
Title: Substance over Style: Evaluating Proactive Conversational Coaching Agents | Substanz über Stil: Bewertung proaktiver Conversations-Coaching-Agenten | 物质高于样式:评价主动对话教练员 2503.19328v2 |
Authors (8): Vidya Srinivas, Xuhai Xu, Xin Liu, Kumar Ayush, Isaac Galatzer-Levy, Shwetak Patel, Daniel McDuff, Tim Althoff
While NLP research has made strides in conversational tasks, many approaches focus on single-turn responses with well-defined objectives or evaluation criteria. In contrast, coaching presents unique challenges with initially undefined goals that evolve through multi-turn interactions, subjective evaluation criteria, mixed-initiative dialogue. In this work, we describe and implement five multi-turn coaching agents that exhibit distinct conversational styles, and evaluate them through a user study, collecting first-person feedback on 155 conversations. We find that users highly value core functionality, and that stylistic components in absence of core components are viewed negatively. By comparing user feedback with third-person evaluations from health experts and an LM, we reveal significant misalignment across evaluation approaches. Our findings provide insights into design and evaluation of conversational coaching agents and contribute toward improving human-centered NLP applications.
虽然国家劳工政策方案的研究在对话任务方面取得了长足进展,但许多方法侧重于具有明确目标或评价标准的单向反应。相反,辅导提出了独特的挑战,最初的未确定目标是通过多方向互动、主观评价标准、混合倡议对话而演变的。在这项工作中,我们描述并实施了5个多方向辅导员,他们表现出不同的对话风格,并通过用户研究对之进行评估,收集155次对话的第一人反馈。我们发现用户高度重视核心功能,对没有核心组成部分的立体成分的看法是负面的。通过将用户反馈与卫生专家和LM的第三人评价进行比较,我们发现在评价方法上存在严重的不协调。我们的调查结果为对话辅导员的设计和评价提供了深刻的见解,并有助于改善以人为中心的NLP应用程序。
Article 198
Title@2025-07-08 (2): Can Interpretation Predict Behavior on Unseen Data?
Title: Can Interpretation Predict Behavior on Unseen Data? | Kann Interpretation Verhalten auf ungesehenen Daten voraussagen? | 解释预测行为是否与未知数据有关? 2507.06445v1 |
Authors (5): Victoria R. Li, Jenny Kaufmann, Martin Wattenberg, David Alvarez-Melis, Naomi Saphra
Interpretability research often aims to predict how a model will respond to targeted interventions on specific mechanisms. However, it rarely predicts how a model will respond to unseen input data. This paper explores the promises and challenges of interpretability as a tool for predicting out-of-distribution (OOD) model behavior. Specifically, we investigate the correspondence between attention patterns and OOD generalization in hundreds of Transformer models independently trained on a synthetic classification task. These models exhibit several distinct systematic generalization rules OOD, forming a diverse population for correlational analysis. In this setting, we find that simple observational tools from interpretability can predict OOD performance. In particular, when in-distribution attention exhibits hierarchical patterns, the model is likely to generalize hierarchically on OOD data – even when the rule’s implementation does not rely on these hierarchical patterns, according to ablation tests. Our findings offer a proof-of-concept to motivate further interpretability work on predicting unseen model behavior.
解释性研究往往旨在预测模型如何应对特定机制的定向干预,但很少预测模型如何应对无形输入数据。本文探讨了可解释性作为预测分配外(OOOD)模式行为的工具的许诺和挑战。具体地说,我们调查数百个独立接受合成分类任务培训的变异模型中的注意力模式与OOOD一般化的对应关系。这些模型展示了几个独特的系统化规则OOD,形成不同的人口,供相关分析使用。在这个环境中,我们发现从可解释性中得出的简单观察工具可以预测OOD的性能。特别是,当分配性关注显示OOOD数据的等级模式时,该模式很可能在等级上将OOD数据普遍化 – – 即使规则的实施并不依赖这些等级模式,根据反动测试。我们的调查结果提供了一种证据,用以激励进一步解释预测不可见模式行为的工作。
Article 199
Title@2025-07-08 (2): Temporal Analysis of Climate Policy Discourse: Insights from Dynamic Embedded Topic Modeling
Title: Temporal Analysis of Climate Policy Discourse: Insights from Dynamic Embedded Topic Modeling | Temporale Analyse des Diskurses zur Klimapolitik: Einblicke aus der dynamischen Themenmodellierung | 气候政策问题时空分析:动态嵌入式专题模型的展望 2507.06435v1 |
Authors (2): Rafiu Adekoya Badekale, Adewale Akinfaderin
Understanding how policy language evolves over time is critical for assessing global responses to complex challenges such as climate change. Temporal analysis helps stakeholders, including policymakers and researchers, to evaluate past priorities, identify emerging themes, design governance strategies, and develop mitigation measures. Traditional approaches, such as manual thematic coding, are time-consuming and limited in capturing the complex, interconnected nature of global policy discourse. With the increasing relevance of unsupervised machine learning, these limitations can be addressed, particularly under high-volume, complex, and high-dimensional data conditions. In this work, we explore a novel approach that applies the dynamic embedded topic model (DETM) to analyze the evolution of global climate policy discourse. A probabilistic model designed to capture the temporal dynamics of topics over time. We collected a corpus of United Nations Framework Convention on Climate Change (UNFCCC) policy decisions from 1995 to 2023, excluding 2020 due to the postponement of COP26 as a result of the COVID-19 pandemic. The model reveals shifts from early emphases on greenhouse gases and international conventions to recent focuses on implementation, technical collaboration, capacity building, finance, and global agreements. Section 3 presents the modeling pipeline, including preprocessing, model training, and visualization of temporal word distributions. Our results show that DETM is a scalable and effective tool for analyzing the evolution of global policy discourse. Section 4 discusses the implications of these findings and we concluded with future directions and refinements to extend this approach to other policy domains.
时间分析有助于利益攸关方,包括决策者和研究人员,评估过去的优先事项,确定新出现的主题,设计治理战略和制定缓解措施。传统方法,如人工专题编码,耗时且有限,无法捕捉全球政策讨论的复杂和相互关联的性质。随着不受监督的机器学习的相关性日益提高,这些局限性可以得到解决,特别是在大量、复杂和高维度的数据条件下。在这项工作中,我们探索了一种新颖的方法,采用动态嵌入主题模式(DETM)分析全球气候政策讨论的演变。一种旨在长期捕捉专题时间动态的概率性模式。我们收集了一整套《联合国气候变化框架公约》(《气候公约》)政策决定,从1995年到2023年,不包括2020年。由于COVID-19大流行导致COP26推迟,因此这些限制可以得到解决。该模式揭示了从早期温室气体和国际公约的强调转向最近的执行、技术协作、能力建设、金融和全球协定的演变。第3节展示了我们所制定的全球趋势的模型,包括模拟、培训、分析工具的演变过程,以分析我们目前和将来的政策趋势的演变趋势。
Article 200
Title@2025-07-08 (2): Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders
Title: Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders | Erforschen der Aufgabenleistung mit Interpretierbaren Modellen über Sparse Auto-Encoder | 通过 Sparass Auto- Enccders 与可解释模型探索任务性能 2507.06427v1 |
Authors (8): Shun Wang, Tyler Loakman, Youbo Lei, Yi Liu, Bohao Yang, Yuting Zhao, Dong Yang, Chenghua Lin
Large Language Models (LLMs) are traditionally viewed as black-box algorithms, therefore reducing trustworthiness and obscuring potential approaches to increasing performance on downstream tasks. In this work, we apply an effective LLM decomposition method using a dictionary-learning approach with sparse autoencoders. This helps extract monosemantic features from polysemantic LLM neurons. Remarkably, our work identifies model-internal misunderstanding, allowing the automatic reformulation of the prompts with additional annotations to improve the interpretation by LLMs. Moreover, this approach demonstrates a significant performance improvement in downstream tasks, such as mathematical reasoning and metaphor detection.
大型语言模型(LLMs)传统上被视为黑箱算法,因此降低了可信度,隐蔽了提高下游任务业绩的潜在方法。在这项工作中,我们采用了一种有效的LLM分解方法,使用与稀疏的自动分解器的字典学习方法。这有助于从多语管LLM神经中提取单词性特征。值得注意的是,我们的工作确定了模型内部误解,允许自动重拟提示,并附加说明,改进LLMs的解释。此外,这一方法表明下游任务(如数学推理和比喻探测)的业绩显著改善。
Article 201
Title@2025-07-08 (2): Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling
Title: Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling | Reward-Modelle können sich selbst verbessern: Reward-Guided Adversarial Failure Mode Discovery für robuste Reward-Modellierung | 奖赏模型能够改善自我:为强力奖赏建模而发现奖赏- 受奖人 反逆失败模式 2507.06419v1 |
Authors (2): Pankayaraj Pathmanathan, Furong Huang
Reward modeling (RM), which captures human preferences to align large language models (LLMs), is increasingly employed in tasks such as model finetuning, response filtering, and ranking. However, due to the inherent complexity of human preferences and the limited coverage of available datasets, reward models often fail under distributional shifts or adversarial perturbations. Existing approaches for identifying such failure modes typically rely on prior knowledge about preference distributions or failure attributes, limiting their practicality in real-world settings where such information is unavailable. In this work, we propose a tractable, preference-distribution agnostic method for discovering reward model failure modes via reward guided controlled decoding. Building on this, we introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses. These adversarial examples are then used to augment the training data and patch the reward model’s misaligned behavior. We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails and demonstrate that it significantly improves robustness without sacrificing reward quality. Notably, REFORM preserves performance both in direct evaluation and in downstream policy training, and further improves alignment quality by removing spurious correlations.
由于人类偏好固有的复杂性以及现有数据集的覆盖范围有限,奖励模式往往在分配性转变或对抗性扰动中失败。现有的确定这种失败模式的方法通常依赖事先对优惠分配或失败属性的了解,在无法获得此类信息的现实世界环境中限制其实用性。在这项工作中,我们提出一种可移植的、分配优惠的、不可知的方法,以便通过有控制的奖赏解码来发现奖赏模式失败模式。在此基础上,我们引入了REFORM,这是一个自我改进的奖赏模式框架,通过使用奖赏模式本身来指导错误得分反应的产生,从而增强稳健性。然后,这些对抗性范例被用来增加培训数据,弥补奖励模式的错误行为。我们根据两种广泛使用的优惠数据集评估REFORM,即“人类无害”和PKU Belails, 并展示在不通过提高质量、不提高质量、不通过提高可持续性和不断提高的升级政策,从而极大地改进升级政策。
Article 202
Title@2025-07-08 (2): PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning
Title: PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning | PERK: Lang-Kontext als Parameter-Effizientes Test-Zeit-Lernen | PERK: 参数有效试验时间学习的长期理由 2507.06415v1 |
Authors (4): Zeming Chen, Angelika Romanou, Gail Weiss, Antoine Bosselut
Long-context reasoning requires accurately identifying relevant information in extensive, noisy input contexts. Previous research shows that using test-time learning to encode context directly into model parameters can effectively enable reasoning over noisy information. However, meta-learning methods for enabling test-time learning are prohibitively memory-intensive, preventing their application to long context settings. In this work, we propose PERK (Parameter Efficient Reasoning over Knowledge), a scalable approach for learning to encode long input contexts using gradient updates to a lightweight model adapter at test time. Specifically, PERK employs two nested optimization loops in a meta-training phase. The inner loop rapidly encodes contexts into a low-rank adapter (LoRA) that serves as a parameter-efficient memory module for the base model. Concurrently, the outer loop learns to use the updated adapter to accurately recall and reason over relevant information from the encoded long context. Our evaluations on several long-context reasoning tasks show that PERK significantly outperforms the standard prompt-based long-context baseline, achieving average absolute performance gains of up to 90% for smaller models (GPT-2) and up to 27% for our largest evaluated model, Qwen-2.5-0.5B. In general, PERK is more robust to reasoning complexity, length extrapolation, and the locations of relevant information in contexts. Finally, we show that while PERK is memory-intensive during training, it scales more efficiently at inference time than prompt-based long-context inference.
长文本推理要求准确识别广泛、 繁杂输入环境中的相关信息。 先前的研究显示, 使用测试时间学习将背景直接编码成模型参数可以有效地将背景编码成模型参数。 但是, 使测试时间学习的元学习方法的记忆密集性极高, 防止其应用到长背景设置。 在此工作中, 我们建议 PERK( 光学高效解释知识) , 这是一种可扩缩的学习方法, 以测试时间将长输入环境编码成梯度更新为轻量模型适配器。 具体地说, PERK 在元培训阶段使用两个嵌套优化环。 内部循环快速将环境编码成一个低级别适应器( LORA) , 用作基础模型的参数高效存储模块存储模块。 同时, 外环学会使用更新的适应器, 准确回顾和解释从编码长背景的相关信息。 我们对一些长文本推理学任务的评估显示, PERK 大大超过标准的快速定位长文本基准, 在更小的模型( GPERPT- 2.5) 中, 最精确的缩缩缩缩到最精确的逻辑, 在最精确的逻辑中, 最精确的缩到最精确的缩缩缩缩到最深的缩到最深的缩到最深的缩缩缩缩的缩到超的缩的缩的缩的缩缩缩缩缩缩的缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩的缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩的缩到缩到缩到缩到缩的缩到缩到缩到缩到缩的缩的缩到缩到缩的
Article 203
Title@2025-07-08 (2): The Trilemma of Truth in Large Language Models
Title: The Trilemma of Truth in Large Language Models | Das Trilemma der Wahrheit in großen Sprachmodellen | 大语言模型中的真理三边 2506.23921v2 |
Authors (2): Germans Savcisens, Tina Eliassi-Rad
We often attribute human characteristics to large language models (LLMs) and claim that they “know” certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge? We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL (short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM’s depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs “know” and how certain they are of their probabilistic internal knowledge.
我们经常将人的特点归结于大型语言模型(LLMS),并声称他们“知道”某些东西。LLMS具有代表培训期间保留的信息的内部概率知识。我们如何评估这种知识的真实性?我们检查两种共同的方法来检验LLMS的真实性,并发现一些有缺陷的假设。为了解决这些有缺陷的假设,我们引入了SawMIL(SawMIL(Sort for Sprassar Enown Convention 多重Instess)),一种检验方法,利用LLMS的内部激活将声明分为真实、虚假和两者都没有。 sawMIL是建立在多重 Internance 学习和符合预测的基础上的。我们如何评估16个开放源LMS的5项有效性标准,包括默认和聊天基变异体,以及3个新的数据集。我们提供的这些洞察是:(1)真实信号通常集中在LM深度的第三季度;(2)真理和假信号并不总是对称;(3)线性探点比一些默认模型更好;(4)非线性LMSMS的准确性反馈,可能要求从真实性测算为真实性测算。
Article 204
Title@2025-07-08 (2): Single Word Change is All You Need: Designing Attacks and Defenses for Text Classifiers
Title: Single Word Change is All You Need: Designing Attacks and Defenses for Text Classifiers | Single Word Change ist alles, was Sie brauchen: Konzipieren von Angriffen und Verteidigungen für Textklassifikatoren | 单字更改是您需要的: 设计攻击和文字分类的防御 2401.17196v2 |
Authors (5): Lei Xu, Sarah Alnegheimish, Laure Berti-Equille, Alfredo Cuesta-Infante, Kalyan Veeramachaneni
In text classification, creating an adversarial example means subtly perturbing a few words in a sentence without changing its meaning, causing it to be misclassified by a classifier. A concerning observation is that a significant portion of adversarial examples generated by existing methods change only one word. This single-word perturbation vulnerability represents a significant weakness in classifiers, which malicious users can exploit to efficiently create a multitude of adversarial examples. This paper studies this problem and makes the following key contributions: (1) We introduce a novel metric \r{ho} to quantitatively assess a classifier’s robustness against single-word perturbation. (2) We present the SP-Attack, designed to exploit the single-word perturbation vulnerability, achieving a higher attack success rate, better preserving sentence meaning, while reducing computation costs compared to state-of-the-art adversarial methods. (3) We propose SP-Defense, which aims to improve \r{ho} by applying data augmentation in learning. Experimental results on 4 datasets and BERT and distilBERT classifiers show that SP-Defense improves \r{ho} by 14.6% and 13.9% and decreases the attack success rate of SP-Attack by 30.4% and 21.2% on two classifiers respectively, and decreases the attack success rate of existing attack methods that involve multiple-word perturbations.
在文本分类中,创建一个对抗性实例意味着在不改变其含义的情况下,在不改变其含义的情况下,在某一句中略略扰动几个词,造成分类者错误的分类。关于观察的观察是,现有方法产生的大量对抗性例子只改变一个单词。这种单字扰动脆弱性代表了分类者中的一个重大弱点,恶意用户可以利用这些弱点有效地创造众多对抗性例子。本文研究这一问题,并作出以下关键贡献:(1) 我们引入一个新的衡量标准,对分类者对单字扰动的稳健性进行定量评估。 (2) 我们介绍SP-Attack,目的是利用单字扰动性脆弱性,实现更高的攻击成功率,更好地保留判刑含义,同时降低计算成本,与最先进的对抗性对抗性方法相比,以有效的方式创造众多的对抗性例子。 (3) 我们提出SP-Defer,目的是通过在学习中应用数据增强来改进\r{ho}。 4个数据集的实验结果以及BERT和DERT分类人员显示,SP-Def-Def-res 进攻率分别为SP-Defal 30/rack 和21级攻击率下降率和14提高两个SP-cal-% 和21级攻击率和下降率率的SP-
Article 205
Title@2025-07-08 (2): Hypermagmas and Colored Operads: Heads, Phases, and Theta Roles
Title: Hypermagmas and Colored Operads: Heads, Phases, and Theta Roles | Hypermagmen und farbige Opern: Köpfe, Phasen und Theta-Rollen | 超磁和彩色操作: 头部、阶段和Theta 角色 2507.06393v1 |
Authors (3): Matilde Marcolli, Riny Huijbregts, Richard K. Larson
We show that head functions on syntactic objects extend the magma structure to a hypermagma, with the c-command relation compatible with the magma operation and the m-command relation with the hypermagma. We then show that the structure of head and complement and specifier, additional modifier positions, and the structure of phases in the Extended Projection can be formulated as a bud generating system of a colored operad, in a form similar to the structure of theta roles. We also show that, due to the special form of the colored operad generators, the filtering of freely generated syntactic objects by these coloring rules can be equivalently formulated as a filtering in the course of structure formation via a colored Merge, which can in turn be related to the hypermagma structure. The rules on movement by Internal Merge with respect to phases, the Extended Projection Principle, Empty Category Principle, and Phase Impenetrability Condition are all subsumed into the form of the colored operad generators. Movement compatibilities between the phase structure and the theta roles assignments can then be formulated in terms of the respective colored operads and a transduction of colored operads.
我们显示, 合成对象上的头函数将岩浆结构扩展至超岩浆, 与岩浆操作和超岩浆的 m- 命令关系可兼容的 c- 命令关系。 然后我们显示, 头结构、 补充和分解器、 额外的修饰器位置以及扩展投影中各个阶段的结构可以形成一个彩色剧团的萌芽生成系统, 其形式与该作用的结构类似。 我们还显示, 由于有色歌剧生成器的特殊形式, 通过这些彩色规则对自由生成的合成对象的过滤, 可以等同于通过彩色合并在结构形成过程中的过滤, 而这反过来又可以与超岩浆结构相关。 内部合并关于相继演进阶段、 扩展投影原则、 空分类原则 和阶段隐性状态的规则, 都包含在有色歌剧生成器的形态中。 通过这些彩色合并规则对自由生成的合成对象的过滤, 可以通过彩色合并制的歌剧和相色调调调调制成的调戏和调色歌剧的调调调调。
Article 206
Title@2025-07-08 (2): Multi-Sense Embeddings for Language Models and Knowledge Distillation
Title: Multi-Sense Embeddings for Language Models and Knowledge Distillation | Multi-Sense-Einbettungen für Sprachmodelle und Wissensdestillation | 语文模式和知识蒸馏多语种多语种嵌入式 2504.06036v2 |
Authors (4): Qitong Wang, Mohammed J. Zaki, Georgios Kollias, Vasileios Kalantzis
Transformer-based large language models (LLMs) rely on contextual embeddings which generate different (continuous) representations for the same token depending on its surrounding context. Nonetheless, words and tokens typically have a limited number of senses (or meanings). We propose multi-sense embeddings as a drop-in replacement for each token in order to capture the range of their uses in a language. To construct a sense embedding dictionary, we apply a clustering algorithm to embeddings generated by an LLM and consider the cluster centers as representative sense embeddings. In addition, we propose a novel knowledge distillation method that leverages the sense dictionary to learn a smaller student model that mimics the senses from the much larger base LLM model, offering significant space and inference time savings, while maintaining competitive performance. Via thorough experiments on various benchmarks, we showcase the effectiveness of our sense embeddings and knowledge distillation approach. We share our code at https://github.com/Qitong-Wang/SenseDict
以变换器为基础的大型语言模型(LLMs)依赖于背景嵌入,根据周围环境的不同背景产生不同(持续)的相同表示。尽管如此,单词和符号通常具有有限的感官(或意义)数量。我们提议多感知嵌入,作为每个符号的空置替换,以用一种语言捕捉其使用范围。为了构建一种感知嵌入字典,我们应用组合算法来嵌入由LLM生成的嵌入,并将集群中心视为具有代表性的感官嵌入。此外,我们提议一种新颖的知识蒸馏方法,利用感知字典学习一个较小的学生模型,模仿来自大得多的LMM模型的感官,提供相当大的空间和推断时间节约,同时保持竞争性性能。通过对各种基准的彻底实验,我们展示我们感知嵌入和知识蒸馏方法的有效性。我们在 https://github.com/Qitong-Wang/SenseDict分享我们的代码。
Article 207
Title@2025-07-08 (2): Evaluating Morphological Alignment of Tokenizers in 70 Languages
Title: Evaluating Morphological Alignment of Tokenizers in 70 Languages | Bewertung der morphologischen Ausrichtung von Tokenisatoren in 70 Sprachen | 以70种语言评价火化器在道义上的一致性 2507.06378v1 |
Authors (3): Catherine Arnett, Marisa Hudspeth, Brendan O’Connor
While tokenization is a key step in language modeling, with effects on model training and performance, it remains unclear how to effectively evaluate tokenizer quality. One proposed dimension of tokenizer quality is the extent to which tokenizers preserve linguistically meaningful subwords, aligning token boundaries with morphological boundaries within a word. We expand MorphScore (Arnett & Bergen, 2025), which previously covered 22 languages, to support a total of 70 languages. The updated MorphScore offers more flexibility in evaluation and addresses some of the limitations of the original version. We then correlate our alignment scores with downstream task performance for five pre-trained languages models on seven tasks, with at least one task in each of the languages in our sample. We find that morphological alignment does not explain very much variance in model performance, suggesting that morphological alignment alone does not measure dimensions of tokenization quality relevant to model performance.
虽然象征性化是语言建模方面的一个关键步骤,对示范培训和性能有影响,但它仍然不清楚如何有效评估象征品质量。象征性品质量的一个拟议层面是象征品保留语言上有意义的小字,在单词内将象征性边界与形态边界统一起来。我们扩大MorphScore(Arnett & Bergen, 2025),以前涵盖22种语言,支持总共70种语言。更新的MorphScore在评价方面提供了更大的灵活性,并解决了原始版本的一些局限性。然后,我们将我们的匹配分数与七个任务中5种预先训练过的语文模型的下游任务性能挂钩,我们抽样中每种语文至少有一项任务。我们发现,形态上的匹配并不解释模型性能的很大差异,表明单是形态上的匹配并不能衡量与模型性能相关的象征性质量的维度。
Article 208
Title@2025-07-08 (2): Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
Title: Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate | Wachsende Transformer: Modulare Zusammensetzung und schichtweise Erweiterung auf einem gefrorenen Substrat | 增长变异器:在冻结基质上的模块构成和图层扩展 2507.07129v1 |
Authors (1): A. Bochkov
The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings. In prior [1], we established that high-level semantic reasoning can emerge in Transformers using frozen embeddings derived from the visual structure of Unicode glyphs. Here, we demonstrate that this fixed representational substrate acts as a universal “docking port,” enabling two powerful and efficient scaling paradigms: seamless modular composition and progressive layer-wise growth. First, we show that specialist models trained on disparate datasets (e.g., Russian and Chinese text) can be merged into a single, more capable Mixture-of-Experts (MoE) model, post-training, with zero architectural modification. This is achieved by simply averaging their output logits. The resulting MoE model exhibits immediate performance improvements on reasoning benchmarks like MMLU, surpassing its constituent experts without catastrophic forgetting. Second, we introduce a layer-wise constructive training methodology, where a deep Transformer is “grown” by progressively stacking and training one layer at a time. This method demonstrates stable convergence and a clear correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD. Our findings suggest a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development, where complexity is built incrementally and modules can be composed freely. This opens new avenues for resource-efficient scaling, continual learning, and a more democratized ecosystem for building powerful AI systems. We release all code and models to facilitate further research.
缩放大型语言模型(LLMS)的流行范例(LLMS)涉及单一式的、最终到最终的、缺乏灵活性的资源密集型培训,这是一个资源密集型的过程。本文探讨了一种替代的、建设性的模式开发方法,其基础是非可操作的、确定性的投入嵌入。在[1]之前,我们确定,高层次的语义推理可以在变异器中出现,使用由Unicode glyphs视觉结构衍生的冷冻嵌入。在这里,我们证明,这个固定代表式的基底部分是一个通用的“演示港 ” , 使两个强大和高效的缩放模式成为:无缝模块构成和递增层增长。首先,我们表明,在分散式数据模型(例如,俄罗斯和中国文本)的基础上,经过培训的专家模型可以合并成一个单一的、更有能力的Mixturate-Exerts模型(MoE)模型, 后加零结构修改。我们通过简单地平均其输出对产出进行校正校正的校正校正的模型, 由此而使MLU这样的推理学基准立即得到改进, , 超越其组成专家的构成,而不必忘记遗忘地遗忘地在深度的构造上,我们进行着一个更深层的演进式的进式的研究, 。
Article 209
Title@2025-07-08 (2): Could the Road to Grounded, Neuro-symbolic AI be Paved with Words-as-Classifiers?
Title: Could the Road to Grounded, Neuro-symbolic AI be Paved with Words-as-Classifiers? | Könnte der Weg zur geerdeten, neuro-symbolischen KI mit Wort-als-Klassifikatoren gepflastert werden? | 通往禁地、神经-精神-精神-爱之路能否用“单词化为标准”来铺设? 2507.06335v1 |
Authors (2): Casey Kennington, David Schlangen
Formal, Distributional, and Grounded theories of computational semantics each have their uses and their drawbacks. There has been a shift to ground models of language by adding visual knowledge, and there has been a call to enrich models of language with symbolic methods to gain the benefits from formal, distributional, and grounded theories. In this paper, we attempt to make the case that one potential path forward in unifying all three semantic fields is paved with the words-as-classifier model, a model of word-level grounded semantics that has been incorporated into formalisms and distributional language models in the literature, and it has been well-tested within interactive dialogue settings. We review that literature, motivate the words-as-classifiers model with an appeal to recent work in cognitive science, and describe a small experiment. Finally, we sketch a model of semantics unified through words-as-classifiers.
正式、 分布和基础的计算语义理论各有其用途和缺点。 通过增加视觉知识,已经向语言的地面模型转变。 人们呼吁用象征性的方法丰富语言模型,以便从正式、 分布和基础理论中获益。 在本文中,我们试图证明,在统一所有三个语义领域方面,一条潜在的前进道路是用单词类化模型铺设的,一种单词级基语义模型,已经融入了文献中的正文和分布语言模型,并在互动对话环境中进行了良好的测试。我们审视了文献,激发了单词类模型,吸引了认知科学的近期工作,并描述了一个小实验。最后,我们勾画了通过单词类化软件统一的语义学模型。
Article 210
Title@2025-07-08 (2): MedGellan: LLM-Generated Medical Guidance to Support Physicians
Title: MedGellan: LLM-Generated Medical Guidance to Support Physicians | MedGellan: LLM-generierte medizinische Anleitung zur Unterstützung von Ärzten | MedGellen:LLM - 支持医生的医疗指导 2507.04431v2 |
Authors (4): Debodeep Banerjee, Burcu Sayin, Stefano Teso, Andrea Passerini
Medical decision-making is a critical task, where errors can result in serious, potentially life-threatening consequences. While full automation remains challenging, hybrid frameworks that combine machine intelligence with human oversight offer a practical alternative. In this paper, we present MedGellan, a lightweight, annotation-free framework that uses a Large Language Model (LLM) to generate clinical guidance from raw medical records, which is then used by a physician to predict diagnoses. MedGellan uses a Bayesian-inspired prompting strategy that respects the temporal order of clinical data. Preliminary experiments show that the guidance generated by the LLM with MedGellan improves diagnostic performance, particularly in recall and $F_1$ score.
医疗决策是一项关键任务,错误可能导致可能危及生命的严重后果。 虽然完全自动化仍然具有挑战性,但将机器智能与人类监督相结合的混合框架提供了一种实用的替代方案。 在本文中,我们介绍了一个使用大语言模型(LLM)来产生临床指导的轻量级、无注释性框架(MedGellen),该模型从原始医疗记录中产生临床指导,然后由医生用来预测诊断结果。 MedGellan使用一种尊重临床数据时间顺序的由Bayesian启发的激励性战略。 初步实验表明,LLM和MedGellen产生的指导提高了诊断性能,特别是在召回和1美元的分数方面。
Article 211
Title@2025-07-08 (2): RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs
Title: RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs | RefineX: Verfeinern von Vorschulungsdaten auf Scale aus sachverständigen Programmen | RefineX:从专家指导方案学习如何在规模上精确分析培训前数据 2507.03253v2 |
Authors (10): Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, Xueqi Cheng
The foundational capabilities of large language models (LLMs) are deeply influenced by the quality of their pre-training corpora. However, enhancing data quality at scale remains a significant challenge, primarily due to the trade-off between refinement effectiveness and processing efficiency. While rule-based filtering remains the dominant paradigm, it typically operates at the document level and lacks the granularity needed to refine specific content within documents. Inspired by emerging work such as ProX, we propose $\textbf{RefineX}$, a novel framework for large-scale, surgical refinement of pre-training data through programmatic editing tasks. RefineX enables efficient and fine-grained data refinement while reliably preserving the diversity and naturalness of raw text. The core strength of RefineX lies in distilling high-quality, expert-guided end-to-end refinement results into minimal edit-based deletion programs. This high-precision distillation pipeline is used to train an efficient and reliable refine model that can systematically improve every instance in the corpus at scale. We evaluate RefineX across from-scratch pre-training at multiple model scales and find that it consistently outperforms models trained on raw, filtered, or alternatively refined data across diverse downstream tasks. On the 750M model, RefineX yields 2.6%-7.2% average gains on lighteval tasks, and achieves comparable performance using significantly fewer training tokens. Further analysis shows that RefineX reliably enhances text quality with both high efficiency and precision, outperforming prior approaches such as end-to-end generation and Prox-C. These results position RefineX as a scalable, effective, and reliable solution for optimizing pre-training data in modern LLM pipelines.
大型语言模型(LLMS)的基本能力受到其培训前公司质量的深刻影响。然而,提高规模数据质量仍然是一大挑战,这主要是由于精炼效力和处理效率之间的权衡取舍。尽管基于规则的过滤仍然是主导范式,但它通常在文件一级运作,缺乏完善文件内具体内容所需的颗粒性。在诸如ProX等新兴工作的启发下,我们提议美元(textbf{RefineX}$),这是一个通过方案编辑任务大规模地对培训前数据进行手术改进的新框架。RefineX使得数据能够高效和精细地改进,同时可靠地保持原始文本的多样性和自然性。RefineX的核心力量在于将高质量的、专家指导的端到端的精细精细精炼结果转化为最起码的删除程序。这种高精密的提炼油管道被用于培训高效和可靠的精细模型,可以系统地改进每例的精确度。我们评估RefineX的进度,在多个模型的预选前阶段进行精细的精细的精细数据改进,在甚的模型上进行精细的精细的精细的精细分析。
Article 212
Title@2025-07-08 (2): ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time
Title: ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time | ETT: Erweiterung des Langzeitkontexts Verständnisfähigkeit von LLMs bei Test-Time | ETT:扩大LLMs在试验时的长距离理解能力 2507.06313v1 |
Authors (4): Kiarash Zahirnia, Zahra Golpayegani, Walid Ahmad, Yang Liu
Transformer-based Language Models’ computation and memory overhead increase quadratically as a function of sequence length. The quadratic cost poses challenges when employing LLMs for processing long sequences. In this work, we introduce \ourmodelacronym~(Extend at Test-Time), method for extending the context length of short context Transformer-based LLMs, with constant memory requirement and linear computation overhead. ETT enable the extension of the context length at test-time by efficient fine-tuning the model’s parameters on the input context, chunked into overlapping small subsequences. We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens. This results in up to a 30 percent improvement in the model’s accuracy. We also study how context can be stored in LLM’s weights effectively and efficiently. Through a detailed ablation study, we examine which Transformer modules are most beneficial to fine-tune at test-time. Interestingly, we find that fine-tuning the second layer of the FFNs is more effective than full fine-tuning, leading to a further improvement in the models’ accuracy.
以变换器为基础的语言模型的计算和内存管理量作为序列长度的函数, 二次成本在使用LLMS处理长序列时构成挑战。 在这项工作中, 我们引入了 \ ourmodelacronnym~( 测试时Extend ) , 延长基于短背景的变换器LMS的上下文长度的方法, 并不断要求内存, 以及线性计算间接费用 。 TET 通过高效地微调模型输入环境参数, 使测试时的上下文长度延长, 并被挤成重叠的小后继序列 。 我们通过将 GPT- Large 和 Phi-2 的上下文长度延长32 次来评估长贝恩奇的 ETT, 从而将GPT- Large 和 Phi-2 的上下文长度从1k 增加到32 个符号, 从而将模型的精度提高到30% 。 我们还研究如何有效和高效地将环境储存在 LLM 的重量中。 我们通过详细的调整研究, 我们研究哪个变换式模块对试验时的精度最为有益。
Article 213
Title@2025-07-08 (2): Humans overrely on overconfident language models, across languages
Title: Humans overrely on overconfident language models, across languages | Menschen überschätzen überbewusste Sprachmodelle, über Sprachen hinweg | 人类对不同语言的 过于自信的语言模式过于自信 2507.06306v1 |
Authors (3): Neil Rathi, Dan Jurafsky, Kaitlyn Zhou
As large language models (LLMs) are deployed globally, it is crucial that their responses are calibrated across languages to accurately convey uncertainty and limitations. Previous work has shown that LLMs are linguistically overconfident in English, leading users to overrely on confident generations. However, the usage and interpretation of epistemic markers (e.g., ‘It’s definitely,’ ‘I think’) can differ sharply across languages. Here, we study the risks of multilingual linguistic (mis)calibration, overconfidence, and overreliance across five languages to evaluate the safety of LLMs in a global context. We find that overreliance risks are high across all languages. We first analyze the distribution of LLM-generated epistemic markers, and observe that while LLMs are cross-linguistically overconfident, they are also sensitive to documented linguistic variation. For example, models generate the most markers of uncertainty in Japanese and the most markers of certainty in German and Mandarin. We then measure human reliance rates across languages, finding that while users strongly rely on confident LLM generations in all languages, reliance behaviors differ cross-linguistically: for example, users rely significantly more on expressions of uncertainty in Japanese than in English. Taken together, these results indicate high risk of reliance on overconfident model generations across languages. Our findings highlight the challenges of multilingual linguistic calibration and stress the importance of culturally and linguistically contextualized model safety evaluations.
由于全球部署的大型语言模型(LLMS)在全球部署,因此,至关重要的是,它们的反应要通过语言校准,以准确地传达不确定性和局限性。以前的工作表明,LLMS在语言上过于自信,使用户对自信的几代人过度自信。然而,缩写标记(例如“它肯定是”“我认为”)的使用和解释在语言上差异很大。在这里,我们研究多种语言语言(误)校校准、过度自信和五种语言过分依赖的风险,以评价全球范围内LMS的安全性。我们发现,所有语言的过分依赖性风险都很高。我们首先分析LM产生的缩写符号在语言上的分布,导致用户过度自信。我们首先分析LMCM产生的直写标记的分布,并观察,虽然LLMMs具有跨语言的过度自信,但它们对文献语言差异很大。例如,模型产生日本最不确定的标志,德国和曼达林最确定性的标志。我们然后衡量不同语言的人类依赖率,我们发现,虽然用户在所有语言模式中都高度依赖自信的LM世代,但依赖程度上的行为方式在英语上也不同语言的逻辑上更依赖了我们的不确定性。
Article 214
Title@2025-07-08 (2): Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers
Title: Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers | Effizienz-Wirksamkeit Reranking FLOPs für LLM-basierte Reranker | 为基于LLM的升级者重新安排FLOP 2507.06223v1 |
Authors (5): Zhiyuan Peng, Ting-ruen Wei, Tingyu Song, Yilun Zhao, Yi Fang
Large Language Models (LLMs) have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose E\textsuperscript{2}R-FLOPs, for LLM-based rerankers: ranking metrics per PetaFLOP (RPP) for relevance per compute and queries per PetaFLOP (QPP) for hardware-agnostic throughput. Companied with the new metrics, an interpretable FLOPs estimator is built to estimate the FLOPs of an LLM-based reranker even without running any experiments. Based on the proposed metrics, we conduct comprehensive experiments to evaluate a wide range of LLM-based rerankers with different architecture, studying the efficiency-effectiveness trade-off and bringing this issue to the attention of the research community.
大型语言模型(LLMS)最近被应用于信息检索的重新排序任务,取得了很强的绩效,然而,它们高的计算要求往往阻碍实际的部署。现有的研究利用代理度量,如潜伏、远端传票数量、输入符号和输出符号等,评估基于LLM的重新排序者的效率。但是,这些尺度取决于硬件和运行时间选择(或是否平行、批量大小等),往往不考虑模型大小,因此难以解释和掩盖效率权衡的评价。为了解决这个问题,我们建议E\textsuperscript{2}R-FLOPs, 面向基于LM的重新排序者:根据每个PetaFLOP(RPP)的比重度,按每个计算和问询每个PetaFLLOP(QPPP)的比重相关性,根据新的计量标准,建立一个可解释的 FLOPs 估测算器,用来估计基于LOP的更替器的重新排序者,即使不进行任何实验,我们提议在进行任何实验的情况下,也建议对社区效率模型进行。
Article 215
Title@2025-07-08 (2): Instruction Following by Boosting Attention of Large Language Models
Title: Instruction Following by Boosting Attention of Large Language Models | Anleitung, indem man die Aufmerksamkeit großer Sprachmodelle erhöht | 之后的教学,培养对大语言模式的注意 2506.13734v2 |
Authors (4): Vitoria Guardieiro, Adam Stein, Avishree Khare, Eric Wong
Controlling the generation of large language models (LLMs) remains a central challenge to ensure their safe and reliable deployment. While prompt engineering and finetuning are common approaches, recent work has explored latent steering, a lightweight technique that alters LLM internal activations to guide generation. However, subsequent studies revealed latent steering’s effectiveness to be limited, often underperforming simple instruction prompting. To address this limitation, we first establish a benchmark across diverse behaviors for standardized evaluation of steering techniques. Building on insights from this benchmark, we introduce Instruction Attention Boosting (InstABoost), a latent steering method that boosts the strength of instruction prompting by altering the model’s attention during generation. InstABoost combines the strengths of existing approaches and is theoretically supported by prior work that suggests that in-context rule following in transformer-based models can be controlled by manipulating attention on instructions. Empirically, InstABoost demonstrates superior control success compared to both traditional prompting and latent steering.
控制大型语言模型(LLMs)的生成仍然是确保安全可靠地部署的重大挑战。虽然迅速工程和微调是常见的方法,但最近的工作探索了潜质引导,这是一种轻量技术,改变了LLM的内部激活,以引导下一代的形成。然而,随后的研究显示,潜质引导的有效性有限,往往表现较差的简单指令。为了应对这一限制,我们首先为标准化的指导技术评价制定不同行为基准。基于这一基准的洞察力,我们引入了 “ 注意促进 “ (InstABoost),这是一种潜在的指导方法,通过改变模型代代代代的注意力来增强激励教学的力量。 “ 斯特拉博斯特 “ 将现有方法的优势结合起来,并在理论上得到先前工作的支持,表明基于变压器模型的文本规则可以通过对指示的注意来控制。从中可以看出, “ 注意 “ InstABoost “ 与传统的催动和潜伏性指导相比,都表现出超强的控制成功。
Article 216
Title@2025-07-08 (2): CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions
Title: CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions | CultureCLIP: CLIP mit kulturellem Bewusstsein durch synthetische Bilder und kontextualisierte Captions stärken | CICLIP: 通过合成图像和背景说明赋予CLIP以文化意识,赋予CLIP权力 2507.06210v1 |
Authors (6): Yuchen Huang, Zhiyuan Fan, Zhitao He, Sandeep Polisetty, Wenyan Li, Yi R. Fung
Pretrained vision-language models (VLMs) such as CLIP excel in multimodal understanding but struggle with contextually relevant fine-grained visual features, making it difficult to distinguish visually similar yet culturally distinct concepts. This limitation stems from the scarcity of high-quality culture-specific datasets, the lack of integrated contextual knowledge, and the absence of hard negatives highlighting subtle distinctions. To address these challenges, we first design a data curation pipeline that leverages open-sourced VLMs and text-to-image diffusion models to construct CulTwin, a synthetic cultural dataset. This dataset consists of paired concept-caption-image triplets, where concepts visually resemble each other but represent different cultural contexts. Then, we fine-tune CLIP on CulTwin to create CultureCLIP, which aligns cultural concepts with contextually enhanced captions and synthetic images through customized contrastive learning, enabling finer cultural differentiation while preserving generalization capabilities. Experiments on culturally relevant benchmarks show that CultureCLIP outperforms the base CLIP, achieving up to a notable 5.49% improvement in fine-grained concept recognition on certain tasks, while preserving CLIP’s original generalization ability, validating the effectiveness of our data synthesis and VLM backbone training paradigm in capturing subtle cultural distinctions.
为了应对这些挑战,我们首先设计了一个数据整理管道,利用开放源码VLM和文本到图像传播模型来构建一个合成文化数据集CulTwin。这个数据集由一对概念拼凑成的CLIP模型组成,在三部模型中,各种概念看起来相似,但代表不同的文化背景。随后,我们在CulTwin上对CLIP作了微调,以创建文化CulTLIP,通过定制化对比学习,使文化概念与背景强化的字幕和合成图像相匹配,使文化差异得以细化,同时保留总体化能力。关于文化相关基准的实验显示CulTwin比CulTwin(合成文化数据集)要优于基础CLIP,在精细的CLIP模型中实现显著的5.49 %的改进,在精确化的CLIP模型中,在精确化的精细化能力模型方面实现显著的CLIP化。
Article 217
Title@2025-07-08 (2): EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG
Title: EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG | EEG2TEXT-CN: Eine explorative Studie der offenen Vokabulären chinesischen Text-EEG-Ausrichtung über großsprachliches Modell und kontrastives Lernen auf ChinesischEEG | EEG2TEXT-CN:通过大语言模式和中经语言差异性学习对中文文本与EEEG校对开放词汇的探索性研究 2506.00854v3 |
Authors (6): Jacky Tai-Yu Lu, Jung Chiang, Chi-Sheng Chen, Anna Nai-Yun Tung, Hsiang Wei Hu, Yuan Chiao Cheng
We propose EEG2TEXT-CN, which, to the best of our knowledge, represents one of the earliest open-vocabulary EEG-to-text generation frameworks tailored for Chinese. Built on a biologically grounded EEG encoder (NICE-EEG) and a compact pretrained language model (MiniLM), our architecture aligns multichannel brain signals with natural language representations via masked pretraining and contrastive learning. Using a subset of the ChineseEEG dataset, where each sentence contains approximately ten Chinese characters aligned with 128-channel EEG recorded at 256 Hz, we segment EEG into per-character embeddings and predict full sentences in a zero-shot setting. The decoder is trained with teacher forcing and padding masks to accommodate variable-length sequences. Evaluation on over 1,500 training-validation sentences and 300 held-out test samples shows promising lexical alignment, with a best BLEU-1 score of 6.38\%. While syntactic fluency remains a challenge, our findings demonstrate the feasibility of non-phonetic, cross-modal language decoding from EEG. This work opens a new direction in multilingual brain-to-text research and lays the foundation for future cognitive-language interfaces in Chinese.
我们建议EEG2TEXT-CN, 就我们所知,它代表了最早为中国人设计的开放的开放的 EEG-文字生成框架之一。我们建于一个基于生物的 EEG 编码器(NITE-EEEG)和一个精练的语言模型(MiniLM)上,我们的建筑通过蒙面训练前和对比学习,将多通道脑信号与自然语言表达方式相匹配。我们使用中国EEEG数据集的一个子集,每句话都包含与256赫兹记录的128个EEEG相匹配的大约10个中国字符。我们将EEEG分成每个字嵌入每个字组,并预测在零光环境中的全句。解码器是用教师的强迫和遮蔽面面面面罩来适应多长序列的培训。对1,500多个培训-校准判决和300个留置试样的评估显示有良好的词汇一致性,最佳的BLEU-1分为6.38。尽管合成流仍是一个挑战,但我们的调查结果展示了在中国人文、跨式语言、跨式语言、跨式读基础的大脑研究中,从而开启了中国人进入了中国认知基础。
Article 218
Title@2025-07-08 (2): DS@GT at CheckThat! 2025: Ensemble Methods for Detection of Scientific Discourse on Social Media
Title: DS@GT at CheckThat! 2025: Ensemble Methods for Detection of Scientific Discourse on Social Media | DS@GT bei CheckThat! 2025: Ensemblemethoden zur Erkennung wissenschaftlicher Diskurse über soziale Medien | DS@GT at checketh! 2025:社会媒体科学论文的混合探测方法 2507.06205v1 |
Authors (4): Ayush Parikh, Hoang Thanh Thanh Truong, Jeanette Schofield, Maximilian Heil
In this paper, we, as the DS@GT team for CLEF 2025 CheckThat! Task 4a Scientific Web Discourse Detection, present the methods we explored for this task. For this multiclass classification task, we determined if a tweet contained a scientific claim, a reference to a scientific study or publication, and/or mentions of scientific entities, such as a university or a scientist. We present 3 modeling approaches for this task: transformer finetuning, few-shot prompting of LLMs, and a combined ensemble model whose design was informed by earlier experiments. Our team placed 7th in the competition, achieving a macro-averaged F1 score of 0.8611, an improvement over the DeBERTaV3 baseline of 0.8375. Our code is available on Github at https://github.com/dsgt-arc/checkthat-2025-swd/tree/main/subtask-4a.
在本文中,我们作为2025年CLEF CLEF CheckTH的DS@GT团队, 介绍了我们为此任务探索的方法。 对于这一多级分类任务,我们确定一条推文是否包含科学主张, 提及科学研究或出版物, 和/或提及科学实体, 如大学或科学家。 我们为这一任务提出了三种建模方法:变压器微调, 微调几发LLMS的提示, 以及根据早期实验设计的组合模型。 我们的团队在比赛中排第7位, 取得了0.8611的宏观平均F1分, 比DBERTAV3 0. 8375的基线有所改进。 我们的代码可以在 Github https://github.com/dsgt-arc/ checkTah-2025-swd/tree/main/subtask-4a上查阅。
Article 219
Title@2025-07-08 (2): Differential Mamba
Title: Differential Mamba | Unterschiedliche Mamba | 差异 Mamba 2507.06204v1 |
Authors (3): Nadav Schneider, Itamar Zimerman, Eliya Nachmani
Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models. Our code is publicly available.
变换器和 RNNs 等序列模型往往将注意力集中在无关的上下文上,从而导致杂乱的中间代表。这通过促进幻觉、削弱远程和检索能力以及降低稳健性,降低了LLM 能力。最近的工作表明,变换器的不同设计可以缓解这一问题,提高各种应用的效能。在本文中,我们探讨这些最初为变换器开发的技术是否可以适用于Mamba,这是基于选择性的州-空间层的最新结构,能够以更高的效率实现变换器的性能。我们表明,对Mamba的差分设计进行天真调整是不够的,需要仔细的建筑修改。为了解决这个问题,我们为Mamba引入了一种新型的差别机制,在语言建模基准上经过经验验证,展示了更好的检索能力和比Vanilla Mamba更好的性能。最后,我们进行了广泛的变换器研究和经验分析,以证明我们的设计选择是正确的,并提供证据,证明我们的方法有效地缓解了Mamba 模型中的超位问题。我们的代码是公开的。
Article 220
Title@2025-07-08 (2): UQLM: A Python Package for Uncertainty Quantification in Large Language Models
Title: UQLM: A Python Package for Uncertainty Quantification in Large Language Models | UQLM: Ein Python-Paket zur Unsicherheitsquantifizierung in großen Sprachmodellen | UQLM: 用于大语言模型中不确定性量化的 Python 软件包 2507.06196v1 |
Authors (6): Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Ho-Kyeong Ra, Viren Bajaj, Zeya Ahmad
Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs.
幻觉被定义为大语言模型产生虚假或误导内容的事例,构成影响下游应用安全和信任的重大挑战。我们引入了UQLM,这是利用最新技术的不确定性量化(UQ)技术检测LLM幻觉的Python套件。这个工具包提供一套基于UQ的得分,计算反应级别信任度分从0到1。这个图书馆为基于UQ的幻觉检测提供了一个现成的解决方案,可以很容易地整合,以提高LLM产出的可靠性。
Article 221
Title@2025-07-08 (2): DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification
Title: DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification | DS@GT bei CheckThat! 2025: Bewertung von Kontext- und Tokenisierungsstrategien für numerische Faktenverifizierung | DS@GT at checketh! 2025年:评估背景和数值事实核查的接受战略 2507.06195v1 |
Authors (2): Maximilian Heil, Aleksandar Pramov
Numerical claims, statements involving quantities, comparisons, and temporal references, pose unique challenges for automated fact-checking systems. In this study, we evaluate modeling strategies for veracity prediction of such claims using the QuanTemp dataset and building our own evidence retrieval pipeline. We investigate three key factors: (1) the impact of more evidences with longer input context windows using ModernBERT, (2) the effect of right-to-left (R2L) tokenization, and (3) their combined influence on classification performance. Contrary to prior findings in arithmetic reasoning tasks, R2L tokenization does not boost natural language inference (NLI) of numerical tasks. A longer context window does also not enhance veracity performance either, highlighting evidence quality as the dominant bottleneck. Our best-performing system achieves competitive macro-average F1 score of 0.57 and places us among the Top-4 submissions in Task 3 of CheckThat! 2025. Our code is available at https://github.com/dsgt-arc/checkthat-2025-numerical.
数字索赔,涉及数量、比较和时间参照的说明,对自动实况调查系统提出了独特的挑战。在本研究中,我们利用QuanTemp数据集和我们自己的证据检索管道,评价了此类索赔真实性预测的模型战略。我们调查了三个关键因素:(1) 使用现代BERT使用较长输入背景窗口的更多证据的影响,(2) 右向左(R2L)象征性化的影响,以及(3) 它们对分类工作的综合影响。与以前在算术推理任务中的调查结果相反,R2L象征性化并不提振自然语言对数字任务的推断。一个较长的上下文窗口也没有提高真实性,也没有提高作为主要瓶颈的证据质量。我们最优秀的系统实现了0.57分的竞争性宏观平均F1分,并将我们列入CryThewT!2025第3任务中。我们的代码可在https://github.com/dsgt-arc/ check that-2025-nualical查阅。
Article 222
Title@2025-07-08 (2): SQLBarber: A System Leveraging Large Language Models to Generate Customized and Realistic SQL Workloads
Title: SQLBarber: A System Leveraging Large Language Models to Generate Customized and Realistic SQL Workloads | SQLBarber: Ein System, das große Sprachmodelle nutzt, um angepasste und realistische SQL Workloads zu generieren | SQLBarber:一个系统利用大语言模型生成自定义和现实的 SQL 工作量 2507.06192v1 |
Authors (2): Jiale Lao, Immanuel Trummer
Database research and development often require a large number of SQL queries for benchmarking purposes. However, acquiring real-world SQL queries is challenging due to privacy concerns, and existing SQL generation methods are limited in customization and in satisfying realistic constraints. To address this issue, we present SQLBarber, a system based on Large Language Models (LLMs) to generate customized and realistic SQL workloads. SQLBarber (i) eliminates the need for users to manually craft SQL templates in advance, while providing the flexibility to accept natural language specifications to constrain SQL templates, (ii) scales efficiently to generate large volumes of queries matching any user-defined cost distribution (e.g., cardinality and execution plan cost), and (iii) uses execution statistics from Amazon Redshift and Snowflake to derive SQL template specifications and query cost distributions that reflect real-world query characteristics. SQLBarber introduces (i) a declarative interface for users to effortlessly generate customized SQL templates, (ii) an LLM-powered pipeline augmented with a self-correction module that profiles, refines, and prunes SQL templates based on query costs, and (iii) a Bayesian Optimizer to efficiently explore different predicate values and identify a set of queries that satisfy the target cost distribution. We construct and open-source ten benchmarks of varying difficulty levels and target query cost distributions based on real-world statistics from Snowflake and Amazon Redshift. Extensive experiments on these benchmarks show that SQLBarber is the only system that can generate customized SQL templates. It reduces query generation time by one to three orders of magnitude, and significantly improves alignment with the target cost distribution, compared with existing methods.
然而,由于隐私问题,获取真实世界SQL查询具有挑战性,而现有的SQL生成方法在定制和满足现实限制方面受到限制。为了解决这一问题,我们提供了基于大语言模型的SQLBarber系统(LLMS),以生成定制和现实的SQL工作量。 SQLBarber (i) 取消了用户提前手工操作SQL模板的需要,同时提供了接受自然语言规格的灵活性,以限制SQL模板;(ii) 高效地生成大量查询,以匹配任何用户定义的成本分布(例如,基本条件和执行计划成本成本),以及(iii) 使用亚马逊红档和Snnoldflake的执行统计数据,以生成反映真实世界查询特点的SQL模板规格和成本分配。 SQL 引入了(i) 用户可轻松地生成定制的SQL模板,同时用一个自我校正的模板,用一个自校准模块,用SQL的SQRR值来大幅改进SQRR值,用SQRL的流程,根据不同成本计算,用SQRRRRRR的流程来测量,我们根据成本和SQRRRL的流程,根据成本和SQRRRRRRL的流程来大幅确定一个成本水平,根据SQ的流程,根据SQ的SQRRRRRRRRRR的流程和O的流程,根据成本和GF的流程,根据一个成本和GRRRRRL的流程,根据成本和GRRRL的流程,根据SQ的流程,根据一个成本水平,根据一个成本和GFL的流程,根据一个成本和O的流程,根据一个成本和G,根据SQ,根据SQ的流程和GL的流程,根据SQ的流程进行一个成本计算,根据一个成本和GFL的流程,根据一个成本和GFL的流程,确定一个成本计算成本和GRRL的流程,根据一个成本和GRL的成本和GFCFCFL的流程,根据SQ,根据一个成本和GFL的流程进行一个成本和GL的流程,根据一个成本和GL的流程,根据一个成本和GRL的成本和GL的流程,根据一个
Article 223
Title@2025-07-08 (2): DS@GT at CheckThat! 2025: Detecting Subjectivity via Transfer-Learning and Corrective Data Augmentation
Title: DS@GT at CheckThat! 2025: Detecting Subjectivity via Transfer-Learning and Corrective Data Augmentation | DS@GT bei CheckThat! 2025: Subjektivität durch Transfer-Learning und Korrekturdatenvergrößerung erkennen | DS@GT at checkThat! 2025: 通过传输学习和纠正数据增强检测主观性 2507.06189v1 |
Authors (2): Maximilian Heil, Dionne Bang
This paper presents our submission to Task 1, Subjectivity Detection, of the CheckThat! Lab at CLEF 2025. We investigate the effectiveness of transfer-learning and stylistic data augmentation to improve classification of subjective and objective sentences in English news text. Our approach contrasts fine-tuning of pre-trained encoders and transfer-learning of fine-tuned transformer on related tasks. We also introduce a controlled augmentation pipeline using GPT-4o to generate paraphrases in predefined subjectivity styles. To ensure label and style consistency, we employ the same model to correct and refine the generated samples. Results show that transfer-learning of specified encoders outperforms fine-tuning general-purpose ones, and that carefully curated augmentation significantly enhances model robustness, especially in detecting subjective content. Our official submission placed us $16^{th}$ of 24 participants. Overall, our findings underscore the value of combining encoder specialization with label-consistent augmentation for improved subjectivity detection. Our code is available at https://github.com/dsgt-arc/checkthat-2025-subject.
本文介绍我们提交到2025年CLEF CLEF 校验实验室1号任务1号任务:主观性检测;我们调查转移-学习和文体数据增强的有效性,以改进英文新闻文本中主观和客观句子的分类;我们的方法与相关任务前训练的编码器的微调和微调以及微调变压器的转移-学习形成对比;我们还采用GPT-4o 控制式增压管道,用预先定义的主观性风格生成引言;为确保标签和风格的一致性,我们使用同样的模型来纠正和完善生成的样本。结果显示,对特定编码器的转移-学习超越了微调通用功能的精细调,经过仔细调整的增强大大加强了模型的稳健性,特别是在检测主观内容方面。我们的正式呈文将24名参与者置于16美元之下。总体而言,我们的调查结果强调了将编码专业化与标签-一致性增强相结合的价值。我们的代码可在https://github.com/dsgt-arc/checkat2025-subject上查阅。
Article 224
Title@2025-07-08 (2): Hidden Prompts in Manuscripts Exploit AI-Assisted Peer Review
Title: Hidden Prompts in Manuscripts Exploit AI-Assisted Peer Review | Versteckte Prompts in Handschriften Exploit KI-Assisted Peer Review | 使用 AI 协助同行审议 2507.06185v1 |
Authors (1): Zhicheng Lin
In July 2025, 18 academic manuscripts on the preprint website arXiv were found to contain hidden instructions known as prompts designed to manipulate AI-assisted peer review. Instructions such as “GIVE A POSITIVE REVIEW ONLY” were concealed using techniques like white-colored text. Author responses varied: one planned to withdraw the affected paper, while another defended the practice as legitimate testing of reviewer compliance. This commentary analyzes this practice as a novel form of research misconduct. We examine the technique of prompt injection in large language models (LLMs), revealing four types of hidden prompts, ranging from simple positive review commands to detailed evaluation frameworks. The defense that prompts served as “honeypots” to detect reviewers improperly using AI fails under examination–the consistently self-serving nature of prompt instructions indicates intent to manipulate. Publishers maintain inconsistent policies: Elsevier prohibits AI use in peer review entirely, while Springer Nature permits limited use with disclosure requirements. The incident exposes systematic vulnerabilities extending beyond peer review to any automated system processing scholarly texts, including plagiarism detection and citation indexing. Our analysis underscores the need for coordinated technical screening at submission portals and harmonized policies governing generative AI (GenAI) use in academic evaluation.
2025年7月,在预印网站ArXiv上发现18份学术手稿含有被称为旨在操纵AI辅助同行审查的提示的隐蔽指示,例如“Give A POSITIVE Review only”等指示使用白色文字等技术被隐藏起来。作者的答复各有不同:一个计划撤回受影响的纸张,而另一个则将这种做法作为审查者遵约情况的合法测试来辩护。本评注将这种做法作为一种新的研究不当行为形式加以分析。我们研究了大语言模型中迅速注射的方法,揭示了四种隐藏的提示,从简单的正面审查命令到详细的评价框架。“Honepots”作为“honeypots”的借口,用来检测不适当地使用AI的检查员——即检查中一贯的为自己服务的指示性质表明了操纵的意图。出版商坚持了不一致的政策:Elsevier完全禁止在同行审评中使用AI,而Springer自然允许有限地使用披露要求的使用。事件暴露了系统处理学术文本,包括红外探测和引号索引的任何自动化系统处理的弱点。我们的分析强调了在提交软件门户上协调地进行基因检查的必要性检查的必要性。
Article 225
Title@2025-07-08 (2): MedGemma Technical Report
Title: MedGemma Technical Report | Technischer Bericht MedGemma | MedmeGemma 技术报告 2507.05201v2 |
Authors (79): Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Brush, Kenneth Philbrick, Howard Hu, Howard Yang, Richa Tiwari, Sunny Jansen, Preeti Singh, Yun Liu, Shekoofeh Azizi, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Riviere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Elena Buchatskaya, Jean-Baptiste Alayrac, Dmitry Lepikhin, Vlad Feinberg, Sebastian Borgeaud, Alek Andreev, Cassidy Hardin, Robert Dadashi, Léonard Hussenot, Armand Joulin, Olivier Bachem, Yossi Matias, Katherine Chou, Avinatan Hassidim, Kavi Goel, Clement Farabet, Joelle Barral, Tris Warkentin, Jonathon Shlens, David Fleet, Victor Cotruta, Omar Sanseviero, Gus Martins, Phoebe Kirk, Anand Rao, Shravya Shetty, David F. Steiner, Can Kirmizibayrak, Rory Pilgrim, Daniel Golden, Lin Yang
Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare’s diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.
人工智能(AI)在医疗保健应用方面具有巨大潜力,但其培训和部署由于医疗保健数据、复杂任务和需要保护隐私而面临挑战。基础模型在医疗任务方面表现良好,需要较少任务性调整数据,这些模型对于加快医疗保健AI应用的开发至关重要。我们采用了基于Gemma 3 4B 和 27B 的医学视觉语言基础模型集MedGemma。MedGemma展示了先进的医学理解和图像和文本推理,大大超过了类似规模的基因化模型的性能,接近了具体任务模型的性能,同时保持了Gemma 3基模型的总体能力。对于分配外任务,MedGemma实现了2.6-10%的医疗多式问题回答改进,15.5%-18.1%的胸前X光学发现分类模型改进了10.8%的医学模型评估比基模型改进了。MedGemma进一步改进了子体的性能,将电子健康记录应用中的误差减少了50%,并实现了与现有专业的直径G3基准模型模型的精确性功能,包括了SmealderLalmas的医学和直径解的医学能力的更高分析。
Article 226
Title@2025-07-08 (2): CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
Title: CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization | KritikLean: Kritisch-geführtes Verstärkungslernen für mathematische Formalisierung | Criticlean: 用于数学正规化的Critic-Crictic-Guideed加强学习 2507.06181v1 |
Authors (19): Zhongyuan Peng, Yifan Yao, Kaijing Ma, Shuyue Guo, Yizhe Li, Yichi Zhang, Chenchen Zhang, Yifan Zhang, Zhouliang Yu, Luming Li, Minghao Liu, Yihang Xia, Jiawei Shen, Yuchen Wu, Yixin Cao, Zhaoxiang Zhang, Wenhao Huang, Jiaheng Liu, Ge Zhang
Translating natural language mathematical statements into formal, executable code is a fundamental challenge in automated theorem proving. While prior work has focused on generation and compilation success, little attention has been paid to the critic phase-the evaluation of whether generated formalizations truly capture the semantic intent of the original problem. In this paper, we introduce CriticLean, a novel critic-guided reinforcement learning framework that elevates the role of the critic from a passive validator to an active learning component. Specifically, first, we propose the CriticLeanGPT, trained via supervised fine-tuning and reinforcement learning, to rigorously assess the semantic fidelity of Lean 4 formalizations. Then, we introduce CriticLeanBench, a benchmark designed to measure models’ ability to distinguish semantically correct from incorrect formalizations, and demonstrate that our trained CriticLeanGPT models can significantly outperform strong open- and closed-source baselines. Building on the CriticLean framework, we construct FineLeanCorpus, a dataset comprising over 285K problems that exhibits rich domain diversity, broad difficulty coverage, and high correctness based on human evaluation. Overall, our findings highlight that optimizing the critic phase is essential for producing reliable formalizations, and we hope our CriticLean will provide valuable insights for future advances in formal mathematical reasoning.
将自然语言数学声明转换成正式的、可执行的代码是自动化理论验证中的一项根本挑战。 虽然先前的工作侧重于生成和汇编成功,但很少注意批评者阶段对生成的正规化是否真正抓住了最初问题的语义意图的评价。 在本文中,我们引入了CriticLean, 这是一种创新的、受批评者指导的强化学习框架,将批评者的作用从被动验证器提升到积极的学习组成部分。 具体地说,我们建议CriticLeGPT, 通过监管的微调和强化学习培训,严格评估Lean 4正规化的语义真实性。 然后,我们引入了CriticLeanBench,这是用来衡量模型是否有能力将语义纠正与不正确的正规化区分开来的能力的基准。 我们经过培训的CriticleLeanGPT模型可以大大超越强大的开放和封闭源基线。 在Criticle框架的基础上,我们构建了FineLeanCorpus, 一个包含超过285个问题的数据集, 展示了丰富的领域多样性、广泛的困难程度和高清晰度的评估。
Article 227
Title@2025-07-08 (2): Evaluation of Habitat Robotics using Large Language Models
Title: Evaluation of Habitat Robotics using Large Language Models | Bewertung der Habitat Robotics mit großen Sprachmodellen | 利用大语言模式评价人居机器人学 2507.06157v1 |
Authors (4): William Li, Lei Hamilton, Kaise Al-natour, Sanjeev Mohindra
This paper focuses on evaluating the effectiveness of Large Language Models at solving embodied robotic tasks using the Meta PARTNER benchmark. Meta PARTNR provides simplified environments and robotic interactions within randomized indoor kitchen scenes. Each randomized kitchen scene is given a task where two robotic agents cooperatively work together to solve the task. We evaluated multiple frontier models on Meta PARTNER environments. Our results indicate that reasoning models like OpenAI o3-mini outperform non-reasoning models like OpenAI GPT-4o and Llama 3 when operating in PARTNR’s robotic embodied environments. o3-mini displayed outperform across centralized, decentralized, full observability, and partial observability configurations. This provides a promising avenue of research for embodied robotic development.
本文件的重点是评估大型语言模型在利用Meta Partnerer基准解决体现的机器人任务方面的效力。Meta PartNR提供简化的环境和室内随机化厨房场景中的机器人互动。每个随机化厨房场景的任务都是由两个机器人代理人合作合作解决这项任务。我们评估了Meta Partnerner环境的多种前沿模型。我们的结果表明,OpenAI o3-mini-moin o3 的推理模型,如OpenAI GPT-4o 和Llama 3 的不合理模型,在PartNir的机器人化环境中操作。O3-mini在集中、分散、完全可观测和部分可观测性配置方面表现优异。这为内含的机器人开发提供了有希望的研究途径。
Article 228
Title@2025-07-08 (2): Coding Triangle: How Does Large Language Model Understand Code?
Title: Coding Triangle: How Does Large Language Model Understand Code? | Coding Triangle: Wie versteht Large Language Model Code? | 编码三角:大语言模式如何理解代码? 2507.06138v1 |
Authors (6): Taolin Zhang, Zihan Ma, Maosong Cao, Junnan Liu, Songyang Zhang, Kai Chen
Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three fundamental dimensions: editorial analysis, code implementation, and test case generation. Through extensive experiments on competitive programming benchmarks, we reveal that while LLMs can form a self-consistent system across these dimensions, their solutions often lack the diversity and robustness of human programmers. We identify a significant distribution shift between model cognition and human expertise, with model errors tending to cluster due to training data biases and limited reasoning transfer. Our study demonstrates that incorporating human-generated editorials, solutions, and diverse test cases, as well as leveraging model mixtures, can substantially enhance both the performance and robustness of LLMs. Furthermore, we reveal both the consistency and inconsistency in the cognition of LLMs that may facilitate self-reflection and self-improvement, providing a potential direction for developing more powerful coding models.
大型语言模型(LLMS)在代码生成方面取得了显著进展,但其真正的编程能力仍未得到充分探讨。我们引入了代码三角框架,从三个基本方面系统地评估LMS:编辑分析、代码执行和测试案例生成。通过对竞争性编程基准的广泛实验,我们发现LLMs可以形成一个在这些方面自成一体的系统,但其解决方案往往缺乏人类程序员的多样性和强健性。我们发现模型认知和人类专门知识之间的重大分布变化,由于培训数据偏差和有限的推理转移,模型错误往往会集中在一起。我们的研究显示,将人造编辑、解决方案和多种测试案例以及利用模型混合物,可以大大提高LMS的性能和稳健性。此外,我们揭示了LMs在可促进自我恢复和自我改进的兼容性方面的一致性和不一致性,为开发更强大的编码模型提供了潜在方向。
Article 229
Title@2025-07-08 (2): NeoBabel: A Multilingual Open Tower for Visual Generation
Title: NeoBabel: A Multilingual Open Tower for Visual Generation | NeoBabel: Mehrsprachiger offener Turm für die visuelle Erzeugung | NeoBabel:多语言的视觉生成开放塔 2507.06137v1 |
Authors (4): Mohammad Mahdi Derakhshani, Dheeraj Varghese, Marzieh Fadaee, Cees G. M. Snoek
Text-to-image generation advancements have been predominantly English-centric, creating barriers for non-English speakers and perpetuating digital inequities. While existing systems rely on translation pipelines, these introduce semantic drift, computational overhead, and cultural misalignment. We introduce NeoBabel, a novel multilingual image generation framework that sets a new Pareto frontier in performance, efficiency and inclusivity, supporting six languages: English, Chinese, Dutch, French, Hindi, and Persian. The model is trained using a combination of large-scale multilingual pretraining and high-resolution instruction tuning. To evaluate its capabilities, we expand two English-only benchmarks to multilingual equivalents: m-GenEval and m-DPG. NeoBabel achieves state-of-the-art multilingual performance while retaining strong English capability, scoring 0.75 on m-GenEval and 0.68 on m-DPG. Notably, it performs on par with leading models on English tasks while outperforming them by +0.11 and +0.09 on multilingual benchmarks, even though these models are built on multilingual base LLMs. This demonstrates the effectiveness of our targeted alignment training for preserving and extending crosslingual generalization. We further introduce two new metrics to rigorously assess multilingual alignment and robustness to code-mixed prompts. Notably, NeoBabel matches or exceeds English-only models while being 2-4x smaller. We release an open toolkit, including all code, model checkpoints, a curated dataset of 124M multilingual text-image pairs, and standardized multilingual evaluation protocols, to advance inclusive AI research. Our work demonstrates that multilingual capability is not a trade-off but a catalyst for improved robustness, efficiency, and cultural fidelity in generative AI.
虽然现有系统依靠翻译管道,但引入了语义流、计算间接费用、以及文化不匹配。我们引入了新多语言图像生成框架NeBabel,这是一个新的多语言图像生成框架,在业绩、效率和包容性方面建立了一个新的Pareto前沿,支持六种语言:英语、中国、中国、荷兰、法国、印度和波斯。该模型在培训中以英语为主,为非英语语言使用者制造障碍,为非英语者制造障碍,使数字不平等的不平等不平等永久化;虽然现有系统依靠翻译管道,但我们将两个只有英语的基准扩展至多语言等同:文流流流流流流流流流流流流流流流流流流流流流流流流,但为了评估其能力,我们扩大了两个只使用英语的基准:M-GenEeneval和M-DPG。 NeBabel在保持强大的英语能力、M-GenEval 200.75和 m-DP-G-G.0.68上建立一个新的PLEF 任务模式,同时进行与英文任务的主要模式相比,同时展示所有英级的模型和多语言模型,同时展示一个更新和跨规则。我们的标准化的升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级
Article 230
Title@2025-07-08 (2): SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam?
Title: SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam? | SciMaster: Auf dem Weg zu allgemein angelegten wissenschaftlichen KI-Agenten, Teil I. X-Master als Grundlage: Können wir die letzte Prüfung der Menschheit durchführen? | SciMaster:争取实现普通用途科学AI代理,第一部分,X-Master作为基金会:我们能否在人类最后一次考试上起带头作用? 2507.05241v2 |
Authors (11): Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, Siheng Chen
The rapid advancements of AI agents have ignited the long-held ambition of leveraging them to accelerate scientific discovery. Achieving this goal requires a deep understanding of the frontiers of human knowledge. As such, Humanity’s Last Exam (HLE) provides an exceptionally challenging touchstone for evaluating scientific AI agents. In this work, we aim to construct the foundational architecture for general-purpose agents and validate the capabilities through leading performance on HLE. To achieve this, we introduce X-Master, a tool-augmented reasoning agent designed to emulate human researchers by interacting flexibly with external tools during its reasoning process. This agent, guided by the conceptualization of code as an interaction language, can flexibly leverage built-in Python libraries and our customized tools to augment the reasoning. We further scale its capabilities through X-Masters, a scattered-and-stacked agentic workflow that systematically enhances breadth and depth of reasoning. Our open-source solution, X-Masters, sets a new state-of-the-art record on HLE with a score of 32.1%, surpassing OpenAI’s and Google’s Deep Research (26.6% and 26.9%) and becoming the first to exceed the 30% threshold. This work allows us to gain a deeper understanding of complex task-solving and accumulates valuable experience that can inform future advancements, guiding subsequent model training.
AI代理商的快速进步激发了利用他们来加速科学发现的长期雄心壮志。 实现这一目标需要深入了解人类知识的前沿。 因此, 人类最后一次Exam(HLE)为评估科学AI代理商提供了极具挑战性的试金石。 在这项工作中, 我们的目标是为普通用途代理商建立基础架构,并通过在HLE上领导业绩来验证能力。 为了实现这一目标, 我们引入了XMaster(XMaster)这个工具强化推理代理商,目的是通过在推理过程中与外部工具进行灵活互动来模仿人类研究人员。 这个代理商在将代码概念化为互动语言的指引下,可以灵活地利用在Python内部的图书馆和我们定制的工具来强化推理。 我们通过XMasters(一个分散和分散的代理工作流程)来进一步提升其能力,从而系统地提高推理的广度和深度。 我们的公开源解决方案XMaster(XMaster)在HLELE上建立了一个新的最新状态记录, 其分数为32.1%, 超过 Open AI 和 Goo’s 深层研究(26.6%和26.9) 将未来提升到更深层研究的进度, 进入一个更深层任务, 进入一个更深层阶段,可以让我们更深层的深层的进度进入到更深层的进度。
Article 231
Title@2025-07-08 (2): Agents Are All You Need for LLM Unlearning
Title: Agents Are All You Need for LLM Unlearning | Agenten sind alles, was Sie für LLM Unlearning benötigen | 代理器是所有你需要的LLM 重新学习 2502.00406v2 |
Authors (2): Debdeep Sanyal, Murari Mandal
Information removal or suppression in large language models (LLMs) is a desired functionality, useful in AI regulation, legal compliance, safety, and privacy. LLM unlearning methods aim to remove information on demand from LLMs. Current LLM unlearning methods struggle to balance the unlearning efficacy and utility due to the competing nature of these objectives. Keeping the unlearning process computationally feasible without assuming access to the model weights is an overlooked area. In this work we show that \textit{agents might be all we need for effective and practical inference-time LLM unlearning}. We present the first agentic LLM unlearning (\texttt{ALU}) method, a multi-agent, retrain-free, model-agnostic approach to LLM unlearning that achieves effective unlearning while preserving the utility. Our \texttt{ALU} framework unlearns by involving multiple LLM agents, each designed for a specific step in the unlearning process, without the need to update model weights for any of the agents in the framework. Users can easily request any set of unlearning instances in any sequence, and \texttt{ALU} seamlessly adapts in real time. This is facilitated without requiring any changes in the underlying LLM model. Through extensive experiments on established benchmarks (TOFU, WMDP, WPU) and jailbreaking techniques (many shot, target masking, other languages), we demonstrate that \texttt{ALU} consistently stands out as the most robust inference-time LLM unlearning framework among current state-of-the-art methods while incurring time cost that remains effectively constant regardless of the number of unlearning targets. We further highlight \texttt{ALU}’s superior performance compared to existing methods when evaluated at scale. Specifically, \texttt{ALU} is assessed on up to 1000 unlearning targets, exceeding the evaluation scope of all previously proposed LLM unlearning methods.
大型语言模型(LLMS) 的信息删除或抑制(LLMS) 是一个理想功能,在AI 监管、法律合规、安全和隐私方面是有用的。 LLM 不学习的方法旨在从LLMS中删除需求信息。 当前 LLM 不学习的方法由于这些目标的竞争性质而难以平衡未学习的功效和效用。 保持不使用模型重量的不学习进程在计算上的可行性是一个被忽视的领域。 在此工作中, 我们显示\ textit{ 试剂可能是我们所需要的所有有效和实用的稳健时间推移( LLM) 取消学习。 我们展示了第一个代理 UmtricialM (\ textt{ALU}) 的不学习方法, 多试探、 免费重校正、 不断更新的MALMSAL 方法, 在维护其效用的同时, 我们的LMTTTT{AL} 框架也变得不易读取。 在未学习过程中, 我们的每个工具都为非学习过程设计了一个具体的步骤, 不需要更新任何国家代理人的模型框架的重量框架。 用户可以很容易地要求在任何时间里, 在任何顺序中, 不断修正的实验中, 在任何顺序里程中, 不断修正的实验中, 不断的实验方法中, 继续显示。
Article 232
Title@2025-07-08 (2): GAF-Guard: An Agentic Framework for Risk Management and Governance in Large Language Models
Title: GAF-Guard: An Agentic Framework for Risk Management and Governance in Large Language Models | GAF-Guard: Agentischer Rahmen für Risikomanagement und Governance in großen Sprachmodellen | GAF-Guard:大语言模式风险管理和治理代理框架 2507.02986v2 |
Authors (4): Seshu Tirupathi, Dhaval Salwala, Elizabeth Daly, Inge Vejsbjerg
As Large Language Models (LLMs) continue to be increasingly applied across various domains, their widespread adoption necessitates rigorous monitoring to prevent unintended negative consequences and ensure robustness. Furthermore, LLMs must be designed to align with human values, like preventing harmful content and ensuring responsible usage. The current automated systems and solutions for monitoring LLMs in production are primarily centered on LLM-specific concerns like hallucination etc, with little consideration given to the requirements of specific use-cases and user preferences. This paper introduces GAF-Guard, a novel agentic framework for LLM governance that places the user, the use-case, and the model itself at the center. The framework is designed to detect and monitor risks associated with the deployment of LLM based applications. The approach models autonomous agents that identify risks, activate risk detection tools, within specific use-cases and facilitate continuous monitoring and reporting to enhance AI safety, and user expectations. The code is available at https://github.com/IBM/risk-atlas-nexus-demos/tree/main/gaf-guard.
由于大型语言模型(LLMS)继续越来越多地应用于各个领域,广泛采用这些模型需要严格的监测,以防止意外的负面后果,并确保稳健性;此外,LMS的设计必须与人类价值观保持一致,例如防止有害内容和确保负责任的使用;目前用于监测生产中的LLMS的自动化系统和解决办法主要集中于LLM的具体关切,例如幻觉等,很少考虑到具体使用案例和用户偏好的要求;本文件介绍了GAF-Guard,这是一个新的LM治理代理框架,将用户、使用案例和模型本身置于中心位置;该框架旨在检测和监测与部署LLM应用程序有关的风险;在具体使用案例范围内,确定风险、启动风险检测工具并促进持续监测和报告以加强AI安全及用户期望的自主代理物;该代码可在https://github.com/IBM/risk-atlas-nexus-demos/tree/main/gaf-guard查阅。
Article 233
Title@2025-07-08 (2): Nyay-Darpan: Enhancing Decision Making Through Summarization and Case Retrieval for Consumer Law in India
Title: Nyay-Darpan: Enhancing Decision Making Through Summarization and Case Retrieval for Consumer Law in India | Nyay-Darpan: Verbesserung der Entscheidungsfindung durch Zusammenfassung und Rückgriff auf das Verbraucherrecht in Indien | Nyay-Darpan:印度消费者法通过摘要和案件检索加强决策 2507.06090v1 |
Authors (9): Swapnil Bhattacharyya, Shrey Ganatra, Harshvivek Kashid, Spandan Anaokar, Shruti Nair, Reshma Sekhar, Siddharth Manohar, Rahul Hemrajani, Pushpak Bhattacharyya
AI-based judicial assistance and case prediction have been extensively studied in criminal and civil domains, but remain largely unexplored in consumer law, especially in India. In this paper, we present Nyay-Darpan, a novel two-in-one framework that (i) summarizes consumer case files and (ii) retrieves similar case judgements to aid decision-making in consumer dispute resolution. Our methodology not only addresses the gap in consumer law AI tools but also introduces an innovative approach to evaluate the quality of the summary. The term ‘Nyay-Darpan’ translates into ‘Mirror of Justice’, symbolizing the ability of our tool to reflect the core of consumer disputes through precise summarization and intelligent case retrieval. Our system achieves over 75 percent accuracy in similar case prediction and approximately 70 percent accuracy across material summary evaluation metrics, demonstrating its practical effectiveness. We will publicly release the Nyay-Darpan framework and dataset to promote reproducibility and facilitate further research in this underexplored yet impactful domain.
基于大赦国际的法律援助和案件预测已经在刑事和民事领域进行了广泛研究,但在消费者法中仍然基本上没有探讨,特别是在印度。本文介绍Nyay-Darpan,这是一个新的二对一框架,(一) 总结消费者案件卷宗,(二) 检索类似的案件判决,以协助消费者争端解决的决策。我们的方法不仅解决消费者法方面的缺陷,而且还采用创新办法评估摘要的质量。“Nyay-Darpan”一词翻译为“Miror of Justice ” ,象征着我们工具通过精确的汇总和智能案件检索反映消费者争端核心的能力。我们的系统在类似案件中的预测实现了75%的准确性,在材料摘要评价指标中实现了大约70%的准确性,显示了其实际效力。我们将公开发布Nyay-Darpan框架和数据集,以促进可再生性,并促进对这个探索不足但影响深远的领域进行进一步研究。
Article 234
Title@2025-07-08 (2): The bitter lesson of misuse detection
Title: The bitter lesson of misuse detection | Die bittere Lektion der Missbrauchserkennung | 发现滥用的惨痛教训 2507.06282v1 |
Authors (3): Hadrien Mariaccia, Charbel-Raphaël Segerie, Diego Dorn
Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive public benchmark yet verifies how well supervision systems from the market perform under realistic, diverse attacks. To address this, we introduce BELLS, a Benchmark for the Evaluation of LLM Supervision Systems. The framework is two dimensional: harm severity (benign, borderline, harmful) and adversarial sophistication (direct vs. jailbreak) and provides a rich dataset covering 3 jailbreak families and 11 harm categories. Our evaluations reveal drastic limitations of specialized supervision systems. While they recognize some known jailbreak patterns, their semantic understanding and generalization capabilities are very limited, sometimes with detection rates close to zero when asking a harmful question directly or with a new jailbreak technique such as base64 encoding. Simply asking generalist LLMs if the user question is “harmful or not” largely outperforms these supervisors from the market according to our BELLS score. But frontier LLMs still suffer from metacognitive incoherence, often responding to queries they correctly identify as harmful (up to 30 percent for Claude 3.7 and greater than 50 percent for Mistral Large). These results suggest that simple scaffolding could significantly improve misuse detection robustness, but more research is needed to assess the tradeoffs of such techniques. Our results support the “bitter lesson” of misuse detection: general capabilities of LLMs are necessary to detect a diverse array of misuses and jailbreaks.
先前的越狱检测工作已经确立了对LLM公司进行对抗性强力的重要性,但主要侧重于抵制对抗性投入和产出安全内容的模型能力,而不是外部监管体系的效能。迄今为止,这些护栏的唯一公开和独立的基准在有限情况下评估了一套狭窄的监督员。因此,没有全面的公共基准,但证实市场监管系统在现实的、多样的攻击下运作如何良好。为了解决这个问题,我们引入了BELLS,即LLM监督系统评估基准。框架有两个层面:伤害严重程度(黄金、边界、有害)和对抗性精密(直接对破狱)和对抗性精密(直接对破狱)的模型能力,并提供了涵盖3个破狱家庭和11个伤害类别的丰富多彩的数据集。我们的评估显示,尽管它们承认一些已知的破狱模式,它们的语义理解和概括化能力非常有限,但当直接提出一个有害的问题时,或像基础64编码这样的新的破狱技术时,我们一般的LMS公司会询问用户问题是否“有害或没有” 基本突破性精确性,因此需要从普通的测谎的测算中,这些CRMRDRDRDRMS的测算。
Article 235
Title@2025-07-08 (2): A Multi-Task and Multi-Label Classification Model for Implicit Discourse Relation Recognition
Title: A Multi-Task and Multi-Label Classification Model for Implicit Discourse Relation Recognition | Ein Multi-Task- und Multi-Label-Klassifikationsmodell für die Implizite Diskursrelation-Erkennung | 多任务和多标签的隐性断断层确认分类模式 2408.08971v3 |
Authors (2): Nelson Filipe Costa, Leila Kosseim
We propose a novel multi-label classification approach to implicit discourse relation recognition (IDRR). Our approach features a multi-task model that jointly learns multi-label representations of implicit discourse relations across all three sense levels in the PDTB 3.0 framework. The model can also be adapted to the traditional single-label IDRR setting by selecting the sense with the highest probability in the multi-label representation. We conduct extensive experiments to identify optimal model configurations and loss functions in both settings. Our approach establishes the first benchmark for multi-label IDRR and achieves SOTA results on single-label IDRR using DiscoGeM. Finally, we evaluate our model on the PDTB 3.0 corpus in the single-label setting, presenting the first analysis of transfer learning between the DiscoGeM and PDTB 3.0 corpora for IDRR.
我们建议对隐性话语关系确认(IDR)采用新的多标签分类方法。我们的方法是一个多任务模式,共同学习PDTB3.0框架中所有三个感知层面的隐性话语关系的多标签表达方式。这个模式也可以通过在多标签代表中以最高概率选择一种感知来适应传统的单一标签的IDR设置。我们进行了广泛的实验,以确定两种环境中的最佳模式配置和损失功能。我们的方法为多标签的言语关系确认(IDR)建立了第一个基准,并利用DiscoGeM在单标签的IDR上取得SOTA结果。最后,我们评估了我们在单一标签设置中的PDTB3.0体模型,对DiscoGeM和PDTB3.0Corora之间的转移学习进行了首次分析。
Article 236
Title@2025-07-08 (2): Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle
Title: Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle | Sind LLMs prescient? Eine kontinuierliche Bewertung mit Daily News als Oracle | LLLM是否具有先锋精神?利用每日新闻作为甲骨文进行持续评估。 2411.08324v2 |
Authors (3): Hui Dai, Ryan Teehan, Mengye Ren
Many existing evaluation benchmarks for Large Language Models (LLMs) quickly become outdated due to the emergence of new models and training data. These benchmarks also fall short in assessing how LLM performance changes over time, as they consist of a static set of questions without a temporal dimension. To address these limitations, we propose using future event prediction as a continuous evaluation method to assess LLMs’ temporal generalization and forecasting abilities. Our benchmark, Daily Oracle, automatically generates question-answer (QA) pairs from daily news, challenging LLMs to predict “future” event outcomes. Our findings reveal that as pre-training data becomes outdated, LLM performance degrades over time. While Retrieval Augmented Generation (RAG) has the potential to enhance prediction accuracy, the performance degradation pattern persists, highlighting the need for continuous model updates. Code and data are available at https://agenticlearning.ai/daily-oracle.
由于出现了新的模型和培训数据,大语言模型的许多现有评价基准很快就过时了。这些基准还不足以评估LLM业绩随时间变化的情况,因为LLM业绩由一组静态问题组成,没有时间的层面。为解决这些局限性,我们提议利用未来事件预测作为持续评价方法,评估LLMS的时间概括和预测能力。我们的基准,即Daily Oracle,从每日新闻中自动生成问答配对,挑战LLMS预测“未来”事件结果。我们的调查结果显示,随着培训前数据过时,LLM业绩会随着时间的流逝而退化。Rettreval 递增一代(RAG)有可能提高预测的准确性,而业绩退化模式则继续存在,强调需要持续更新模型。守则和数据见https://toricleining.ai/daily-oracle。
Article 237
Title@2025-07-08 (2): The distribution of syntactic dependency distances
Title: The distribution of syntactic dependency distances | Die Verteilung der syntaktischen Abhängigkeitsabstände | 共同依赖距离分布 2211.14620v2 |
Authors (2): Sonia Petrini, Ramon Ferrer-i-Cancho
The syntactic structure of a sentence can be represented as a graph, where vertices are words and edges indicate syntactic dependencies between them. In this setting, the distance between two linked words is defined as the difference between their positions. Here we wish to contribute to the characterization of the actual distribution of syntactic dependency distances, which has previously been argued to follow a power-law distribution. Here we propose a new model with two exponential regimes in which the probability decay is allowed to change after a break-point. This transition could mirror the transition from the processing of word chunks to higher-level structures. We find that a two-regime model - where the first regime follows either an exponential or a power-law decay - is the most likely one in all 20 languages we considered, independently of sentence length and annotation style. Moreover, the break-point exhibits low variation across languages and averages values of 4-5 words, suggesting that the amount of words that can be simultaneously processed abstracts from the specific language to a high degree. The probability decay slows down after the breakpoint, consistently with a universal chunk-and-pass mechanism. Finally, we give an account of the relation between the best estimated model and the closeness of syntactic dependencies as function of sentence length, according to a recently introduced optimality score.
句子的合成结构可以用图表来表示, 顶点可以是单词, 边缘也可以表示它们之间的共性依赖性。 在这种背景下, 两个链接的字的距离被定义为其位置之间的差别。 在这里, 我们想帮助描述同义依赖距离的实际分布, 先前曾争论过这种分布是遵循一种权力法的分布。 我们在这里提出了一个新模式, 有两个指数性制度, 允许在断点后发生概率衰变。 这个过渡可以反映从处理字块到更高层次结构的过渡。 我们发现, 两种制度模式—— 第一个制度要么是指数化的, 要么是权力法的衰减—— 是我们所考虑的所有20种语言中最有可能的。 此外, 断点显示了语言和平均值4-5字之间的低差异, 这表明, 可以同时处理的字数从特定语言到高程度的精度。 在断点之后, 概率的衰减速度会减缓, 与一个普遍的大块和最精确的排序机制 。 最后, 我们给出了最接近的最精确的排序机制 。
Article 238
Title@2025-07-08 (2): Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger
Title: Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger | Adaptive Werkzeuganwendung in großen Sprachmodellen mit Meta-Cognition Trigger | 用于大语言模型的适应性工具使用 2502.12961v2 |
Authors (9): Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, Yong Liu
Large language models (LLMs) have shown remarkable emergent capabilities, transforming the execution of functional tasks by leveraging external tools for complex problems that require specialized processing or up-to-date data. While existing research expands LLMs access to diverse tools (e.g., program interpreters, search engines, calculators), the necessity of using these tools is often overlooked, leading to indiscriminate tool invocation. This naive approach raises two key issues: increased latency due to unnecessary tool calls, and potential errors resulting from faulty interactions with external tools. In this paper, we introduce meta-cognition as a proxy for LLMs self-assessment of their capabilities, reflecting the model’s awareness of its own limitations. Based on this, we propose MeCo, an adaptive decision-making strategy for external tool use. MeCo quantifies metacognitive scores by capturing high-level cognitive signals in the representation space, guiding when to invoke tools. Notably, MeCo is fine-tuning-free and incurs minimal cost. Experiments across multiple backbone models and benchmarks show that MeCo reliably detects LLMs’ internal cognitive signals and significantly improves tool-use decision-making.
大型语言模型(LLMS)显示了非凡的突发能力,通过利用外部工具解决需要专门处理或最新数据的复杂问题,改变了功能任务的执行。虽然现有研究扩大了LLMS获得各种工具(例如程序口译员、搜索引擎、计算器)的机会,但使用这些工具的必要性往往被忽视,导致工具的滥用。这种天真的方法提出了两个关键问题:不必要的工具呼叫导致的潜伏增加,以及与外部工具的错误互动造成的潜在错误。在本文中,我们引入了元认知作为LLMS能力自我评估的代名词,反映了该模型对自身局限性的认识。在此基础上,我们提出了Meco,即外部工具使用的适应决策战略。Meco通过在代表空间捕捉高级认知信号来量化元化分数,指导何时使用工具。值得注意的是,Meco正在微调,费用很低。在多个主干模型和基准的实验显示,Meco可靠地检测LMS的内部认知信号,并大大改进工具使用决策。
Article 239
Title@2025-07-08 (2): Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs
Title: Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs | Entropie-Memorisationsgesetz: Bewertung der Gedächtnis-Schwierigkeit der Daten in LLMs | 《信封-记忆法》:评估LLMM中数据的记忆难度 2507.06056v1 |
Authors (5): Yizhan Huang, Zhe Yang, Meifang Chen, Jianping Zhang, Michael R. Lyu
Large Language Models (LLMs) are known to memorize portions of their training data, sometimes reproducing content verbatim when prompted appropriately. In this work, we investigate a fundamental yet under-explored question in the domain of memorization: How to characterize memorization difficulty of training data in LLMs? Through empirical experiments on OLMo, a family of open models, we present the Entropy-Memorization Law. It suggests that data entropy is linearly correlated with memorization score. Moreover, in a case study of memorizing highly randomized strings, or “gibberish”, we observe that such sequences, despite their apparent randomness, exhibit unexpectedly low empirical entropy compared to the broader training corpus. Adopting the same strategy to discover Entropy-Memorization Law, we derive a simple yet effective approach to distinguish training and testing data, enabling Dataset Inference (DI).
大型语言模型(LLMS)可以记住其培训数据的一部分,有时在适当提示下复制逐字复制内容。在这项工作中,我们调查了记忆化领域一个基本但探索不足的问题:如何用LLMS描述培训数据在记忆化方面的困难?通过对开放模型的大家庭OLMO的实验,我们介绍了“Entropy-Mymoriz化法 ” 。它表明数据昆虫与记忆化得分有线关联。此外,在对记忆化高度随机化的字符串或“妄想”进行的一项案例研究中,我们观察到,尽管这些序列显然随机性,但与更广泛的培训资料库相比,其实验性却出乎意料地低。我们采用同样的战略来发现Entropy-Memoriz化法,我们提出了一种简单而有效的方法来区分培训和测试数据,从而使得数据集出(DI) 。
Article 240
Title@2025-07-08 (2): Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review
Title: Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review | Überwindung von Datenknappheit bei generativen Sprachmodellierungen für Sprachen mit geringem Ressourcenbedarf: Eine systematische Überprüfung | 克服低资源语言创用语言建模中的数据稀缺性:系统审查 2505.04531v2 |
Authors (2): Josh McGiff, Nikola S. Nikolov
Generative language modelling has surged in popularity with the emergence of services such as ChatGPT and Google Gemini. While these models have demonstrated transformative potential in productivity and communication, they overwhelmingly cater to high-resource languages like English. This has amplified concerns over linguistic inequality in natural language processing (NLP). This paper presents the first systematic review focused specifically on strategies to address data scarcity in generative language modelling for low-resource languages (LRL). Drawing from 54 studies, we identify, categorise and evaluate technical approaches, including monolingual data augmentation, back-translation, multilingual training, and prompt engineering, across generative tasks. We also analyse trends in architecture choices, language family representation, and evaluation methods. Our findings highlight a strong reliance on transformer-based models, a concentration on a small subset of LRLs, and a lack of consistent evaluation across studies. We conclude with recommendations for extending these methods to a wider range of LRLs and outline open challenges in building equitable generative language systems. Ultimately, this review aims to support researchers and developers in building inclusive AI tools for underrepresented languages, a necessary step toward empowering LRL speakers and the preservation of linguistic diversity in a world increasingly shaped by large-scale language technologies.
随着诸如ChatGPT和Google Gemini等服务的出现,产生语言建模已大受欢迎。这些建模显示了生产力和通信的变革潜力,但绝大多数都满足了英语等高资源语言的需求。这加剧了对自然语言处理中语言不平等的担忧。本文件介绍了第一次系统化审查,具体侧重于解决低资源语言基因建模中数据稀缺问题的战略(LLLL)。我们从54项研究中发现、分类和评价技术方法,包括单语数据扩增、回译、多语种培训和迅速工程,跨越基因化任务。我们还分析了建筑选择、语言家庭代表性和评价方法方面的趋势。我们的调查结果突出表明了对基于变压器模型的高度依赖,集中关注少数LLLLs,缺乏对各种研究的一致评价。我们最后建议将这些方法推广到更广泛的LLLs,并概述在建立公平化语言系统方面的公开挑战。最后,这次审查旨在支持研究人员和开发者为代表性语言建设包容性的AI工具,这是向增强LLLL语言语言语言语言的权能和通过大规模技术在世界范围内维护语言多样性迈出的必要一步。
Article 241
Title@2025-07-08 (2): Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager
Title: Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager | Maßgeschneiderte Gespräche über LLMs hinaus: Ein RL-basierter Dialogmanager | 超出LLLM 的定制对话:基于 RL 的对话管理器 2506.19652v2 |
Authors (3): Lucie Galland, Catherine Pelachaud, Florian Pecune
In this work, we propose a novel framework that integrates large language models (LLMs) with an RL-based dialogue manager for open-ended dialogue with a specific goal. By leveraging hierarchical reinforcement learning to model the structured phases of dialogue and employ meta-learning to enhance adaptability across diverse user profiles, our approach enhances adaptability and efficiency, enabling the system to learn from limited data, transition fluidly between dialogue phases, and personalize responses to heterogeneous patient needs. We apply our framework to Motivational Interviews, aiming to foster behavior change, and demonstrate that the proposed dialogue manager outperforms a state-of-the-art LLM baseline in terms of reward, showing a potential benefit of conditioning LLMs to create open-ended dialogue systems with specific goals.
在这项工作中,我们提出了一个新框架,将大型语言模式(LLMs)与基于RL的对话管理者融合在一起,以便进行开放式对话,并有一个具体目标。 通过利用等级强化学习,模拟分阶段对话,并利用元学习,提高不同用户的适应性,我们的方法提高了适应性和效率,使系统能够从有限的数据中学习,在对话阶段之间流畅地过渡,以及针对不同病人的需要作出个性化反应。 我们将我们的框架应用于动机性访谈,目的是促进行为变化,并表明拟议的对话管理者在奖励方面超过了最先进的LLM基线,显示了有条件LMs在创建具有具体目标的开放式对话系统方面的潜在好处。
Article 242
Title@2025-07-08 (2): Empirical evidence of Large Language Model’s influence on human spoken communication
Title: Empirical evidence of Large Language Model’s influence on human spoken communication | Empirische Beweise für den Einfluss von Large Language Model auf die menschliche gesprochene Kommunikation | 大语言模式对人口交流的影响的经验证据 2409.01754v3 |
Authors (7): Hiromu Yakura, Ezequiel Lopez-Lopez, Levin Brinkmann, Ignacio Serna, Prateek Gupta, Ivan Soraperra, Iyad Rahwan
From the invention of writing and the printing press, to television and social media, human history is punctuated by major innovations in communication technology, which fundamentally altered how ideas spread and reshaped our culture. Recent chatbots powered by generative artificial intelligence constitute a novel medium that encodes cultural patterns in their neural representations and disseminates them in conversations with hundreds of millions of people. Understanding whether these patterns transmit into human language, and ultimately shape human culture, is a fundamental question. While fully quantifying the causal impact of a chatbot like ChatGPT on human culture is very challenging, lexicographic shift in human spoken communication may offer an early indicator of such broad phenomenon. Here, we apply econometric causal inference techniques to 740,249 hours of human discourse from 360,445 YouTube academic talks and 771,591 conversational podcast episodes across multiple disciplines. We detect a measurable and abrupt increase in the use of words preferentially generated by ChatGPT, such as delve, comprehend, boast, swift, and meticulous, after its release. These findings suggest a scenario where machines, originally trained on human data and subsequently exhibiting their own cultural traits, can, in turn, measurably reshape human culture. This marks the beginning of a closed cultural feedback loop in which cultural traits circulate bidirectionally between humans and machines. Our results motivate further research into the evolution of human-machine culture, and raise concerns over the erosion of linguistic and cultural diversity, and the risks of scalable manipulation.
从书写和印刷出版的发明到电视和社交媒体,人类历史都因通信技术的重大创新而支离破碎,这些创新从根本上改变了思想传播和改造我们的文化。最近以基因化人工智能为动力的聊天机器人构成了一种新颖的媒介,将文化模式编码在他们的神经表征中,并在与数以亿计的人的对话中传播。了解这些模式是否传播到人类语言并最终塑造人类文化,是一个根本问题。在充分量化像查特GPT这样的聊天机对人类文化的因果影响非常具有挑战性,但人类口语通信的地理变化可能提供如此广泛现象的早期指标。在这里,我们应用了计量因果推断技术到740 249小时的人类话语,来自360 445 YouTube学术谈话和771 591个对话播客流,跨多个学科。我们发现,这些模式是否传播到人类语言的优雅语言,例如调、理解、吹嘘、快速和细调等,在发布之后,使用这种语言的传动,这些结论表明,机器最初在人类数据上受过训练,并随后在人类文化循环中不断改变人类文化成果之间,从而在人类文化循环中呈现文化结果之间传播。我们的文化标志,我们发现,我们发现可测量和文化循环开始进入了文化循环之间,可以测量,我们的文化结果可以测量,可以进一步传播。
Article 243
Title@2025-07-08 (2): MemOS: A Memory OS for AI System
Title: MemOS: A Memory OS for AI System | MemOS: Ein Speicher-Betriebssystem für KI-System | MemOS: AI系统记忆OS 2507.03724v2 |
Authors (39): Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, Jiawei Yang, Chunyu Li, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhen Tao, Junpeng Ren, Huayi Lai, Hao Wu, Bo Tang, Zhenren Wang, Zhaoxin Fan, Ningyu Zhang, Linfeng Zhang, Junchi Yan, Mingchuan Yang, Tong Xu, Wei Xu, Huajun Chen, Haofeng Wang, Hongkang Yang, Wentao Zhang, Zhi-Qin John Xu, Siheng Chen, Feiyu Xiong
Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge consistency.Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.While Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent representations.Recent work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.
大型语言模型(LLMS)已成为人工通用智能(AGI)的基本基础设施,但缺乏定义明确的记忆管理系统,阻碍了长文本推理、持续个性化和知识一致性的发展。 现有的模型主要依赖静态参数和短寿命背景状态,限制了其跟踪用户偏好或长期更新知识的能力。We Retrerievval-Auged Ginger (RAG) 引入了普通文本的外部知识,但它仍然是一种不固定的工作,没有生命周期控制,也没有与持续陈述相结合。 近期的工作从记忆等级的角度模拟了LLMS的培训和推断成本,表明在参数内存和外部检索之间引入明确的记忆层可以大大降低这些成本。 除了计算效率外,LOMS还面临因信息在时间和环境中的分布而带来的更广泛挑战,要求能够管理跨越不同时间尺度和来源的混杂知识的系统。 为了应对这一挑战,我们建议MemOS(一个将记忆处理成可管理资源的记忆操作框架)能够将LMS-Recliverneal-Relictional-Reflical-deallical-dealliction-Reliverizational-Lislational-Lislity-Lial-modeal-modeal-Lislation-modeal-mocal-mocal-mocal-motion-mocal-mod-mocal-mocal-mod-mod-motion-mocal-mod-mod-motion-mod-mod-mocal-motion-motion-mocal-motion-mocal-mocal-mocal-mod-mod-mology-mocal-mod-mod-mocal-mocal-mod-mod-mod-mology-mod-mod-mology-modia-mocal-modia-mocal-mocal-modal-mocal-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mocal-mod-mocal-mocal-mocal-mocal-
Article 244
Title@2025-07-08 (2): OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model
Title: OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model | OpenS2S: Vollkommen offenes Source-End-to-End-Empathetik-Modell für große Sprache | OpenS2S: 推进完全开放源全开放源端至端至端受同情型大型语音语言模式 2507.05177v2 |
Authors (11): Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang
Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at https://casia-lm.github.io/OpenS2S
同情性互动是人类机器交流的基石,因为需要理解用语言提示丰富并产生情绪和表情反应的语调;然而,最强大的同情性LSL系统日益关闭,使研究人员对结构、数据和发展的关键细节不透明。鉴于迫切需要对LSL和同情性行为进行透明研究,我们介绍OpenS2S2S,这是一个完全开放的源码、透明和端对端LSLSL系统,旨在促成同情性言语互动。基于我们的感性语音到文字模型BLSP-Emo, OpenS2S进一步使用流动的断流解解码架构实现低延度语音生成。为了便利端对端培训,OpenS2S2S包含一个自动的数据构造管道,以低成本综合多种、高品质的流言式语音语音对话。通过利用大型语言模型生成同情性内容和控制性文本到语音系统来引入演讲者和情感变异,我们利用一个可扩展的模型、可变式的Sloveal-real-retaphic Strestrual 系统来构建一个可扩展社区,包括富型的智能Slavial-re Stal-real-rediviductionSligidustry Stal Stal Stal-restry
Article 245
Title@2025-07-08 (2): Conditional Multi-Stage Failure Recovery for Embodied Agents
Title: Conditional Multi-Stage Failure Recovery for Embodied Agents | Bedingte Multi-Stage Failure Recovery für eingedickte Agenten | 内装药剂有条件的多阶段故障恢复 2507.06016v1 |
Authors (5): Youmna Farag, Svetlana Stoyanchev, Mohan Li, Simon Keizer, Rama Doddipatla
Embodied agents performing complex tasks are susceptible to execution failures, motivating the need for effective failure recovery mechanisms. In this work, we introduce a conditional multistage failure recovery framework that employs zero-shot chain prompting. The framework is structured into four error-handling stages, with three operating during task execution and one functioning as a post-execution reflection phase. Our approach utilises the reasoning capabilities of LLMs to analyse execution challenges within their environmental context and devise strategic solutions. We evaluate our method on the TfD benchmark of the TEACH dataset and achieve state-of-the-art performance, outperforming a baseline without error recovery by 11.5% and surpassing the strongest existing model by 19%.
在这项工作中,我们引入了一个有条件的多阶段故障回收框架,采用零速链推进法。框架分为四个错误处理阶段,任务执行期间有三个操作阶段,一个作为执行后反思阶段。我们的方法利用LLMS的推理能力,分析环境环境中的执行挑战并制定战略解决方案。我们评估了我们在TACH数据集TfD基准上的方法,并取得了最新业绩,比基准无误回收率高了11.5%,比现有最强的模型高出19%。
Article 246
Title@2025-07-08 (2): DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations
Title: DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations | DocIE@XLLM25: In-Context-Lernen zur Informationsextraktion mit vollständig synthetischen Demonstrationen | DocIE@XLLLM25:利用全面合成演示进行信息提取的内文学习 2507.05997v1 |
Authors (4): Nicholas Popovič, Ashish Kangen, Tim Schopf, Michael Färber
Large, high-quality annotated corpora remain scarce in document-level entity and relation extraction in zero-shot or few-shot settings. In this paper, we present a fully automatic, LLM-based pipeline for synthetic data generation and in-context learning for document-level entity and relation extraction. In contrast to existing approaches that rely on manually annotated demonstrations or direct zero-shot inference, our method combines synthetic data generation with retrieval-based in-context learning, using a reasoning-optimized language model. This allows us to build a high-quality demonstration database without manual annotation and to dynamically retrieve relevant examples at inference time. Based on our approach we produce a synthetic dataset of over $5k$ Wikipedia abstracts with approximately $59k$ entities and $30k$ relation triples. Finally, we evaluate in-context learning performance on the DocIE shared task, extracting entities and relations from long documents in a zero-shot setting. We find that in-context joint entity and relation extraction at document-level remains a challenging task, even for state-of-the-art large language models.
大量、高质量的附加说明公司在文件级别实体和零发或几发环境中的关系提取中仍然缺乏大量高质量的高分数。在本文中,我们提出了一个完全自动的、基于LLM的管道,用于合成数据生成和文件级别实体和关系提取的全文学习。与目前依靠人工附加说明的演示或直接零发推理的方法相比,我们的方法将合成数据生成与基于检索的文本内学习结合起来,采用推理优化语言模式。这使我们能够建立一个高质量的演示数据库,而无需人工说明,在推断时间动态地检索相关实例。基于我们的方法,我们制作了一个5千多美元的合成数据集,约59千美元实体和30千元关系三重。最后,我们评价了DocIE共同任务的内文学习业绩,在零点环境中从长份文件中提取实体和关系。我们发现,在文件层面上的文本联合实体和关系提取仍然是一项艰巨的任务,即使是在最先进的大语言模型上。
Article 247
Title@2025-07-08 (2): Evolution without Large Models: Training Language Model with Task Principles
Title: Evolution without Large Models: Training Language Model with Task Principles | Evolution ohne große Modelle: Sprachmodell mit Aufgabenprinzipien trainieren | 无大型号的演变演变:具有任务原则的培训语言模式 2507.05991v1 |
Authors (8): Minghang Zhu, Shen Gao, Zhengliang Shi, Jiabao Fang, Pengjie Ren, Zhaochun Ren, Zhumin Chen, Shuo Shang
A common training approach for language models involves using a large-scale language model to expand a human-provided dataset, which is subsequently used for model training.This method significantly reduces training costs by eliminating the need for extensive human data annotation. However, it still faces challenges such as high carbon emissions during data augmentation and the risk of data leakage when we use closed-source LLMs. To address these issues, we propose a self-evolution method for language models. First, we introduce the Multi-level Principle Generation, which enables a large-scale model to summarize task-completion principles based on a small amount of task data. Then, we propose the Principle-based Instance Generation, in which a smaller-scale language model uses these task principles to generate a large amount of data. This data is then used for model training. Experimental results show that our proposed method significantly improves model performance compared to directly using a smaller-scale language model to generate data. Additionally, since we only use the large-scale language model to generate the task-completion principles, the carbon emissions associated with training the model are greatly reduced.
语言模型的共同培训方法涉及使用大规模语言模型扩大人提供的数据集,随后用于模式培训。这一方法通过消除对大量人类数据说明的需求,大大减少培训费用,但仍然面临数据扩充期间的碳排放量高以及使用封闭源LLMs时数据泄漏风险等挑战。为了解决这些问题,我们提议了语言模型的自我演进方法。首先,我们引入了多层次原则生成模式,使一个大型模型能够根据少量任务数据总结任务完成原则。然后,我们提出了基于原则的实例生成,在其中,一个较小的语言模型使用这些任务原则生成大量数据。这些数据随后用于模式培训。实验结果表明,与直接使用较小规模语言模型生成数据相比,我们的拟议方法大大改进了模型性能。此外,由于我们只是使用大型语言模型来生成任务完成原则,因此与培训有关的碳排放量大大减少。
Article 248
Title@2025-07-08 (2): Development and Evaluation of HopeBot: an LLM-based chatbot for structured and interactive PHQ-9 depression screening
Title: Development and Evaluation of HopeBot: an LLM-based chatbot for structured and interactive PHQ-9 depression screening | Entwicklung und Evaluation von HopeBot: ein LLM-basierter Chatbot für strukturiertes und interaktives PHQ-9-Depressionsscreening | HopeBot的发展和评价:一个基于LLM的聊天室,用于结构化和互动的PHQ-9抑郁症筛查 2507.05984v1 |
Authors (8): Zhijun Guo, Alvina Lai, Julia Ive, Alexandru Petcu, Yutong Wang, Luyuan Qi, Johan H Thygesen, Kezhi Li
Static tools like the Patient Health Questionnaire-9 (PHQ-9) effectively screen depression but lack interactivity and adaptability. We developed HopeBot, a chatbot powered by a large language model (LLM) that administers the PHQ-9 using retrieval-augmented generation and real-time clarification. In a within-subject study, 132 adults in the United Kingdom and China completed both self-administered and chatbot versions. Scores demonstrated strong agreement (ICC = 0.91; 45% identical). Among 75 participants providing comparative feedback, 71% reported greater trust in the chatbot, highlighting clearer structure, interpretive guidance, and a supportive tone. Mean ratings (0-10) were 8.4 for comfort, 7.7 for voice clarity, 7.6 for handling sensitive topics, and 7.4 for recommendation helpfulness; the latter varied significantly by employment status and prior mental-health service use (p < 0.05). Overall, 87.1% expressed willingness to reuse or recommend HopeBot. These findings demonstrate voice-based LLM chatbots can feasibly serve as scalable, low-burden adjuncts for routine depression screening.
患者健康问卷-9(PHQ-9)等静态工具有效筛选抑郁症,但缺乏互动性和适应性。我们开发了HopeBot,这是一个由大型语言模型(LLM)驱动的聊天机,使用检索增强的一代和实时澄清来管理PHQ-9;在一项学科内研究中,联合王国和中国的132名成年人既完成了自我管理和聊天机的版本,分数表现出强烈的一致(ICC=0.91;45%相同)。在提供比较反馈的75名参与者中,71%的人报告说对聊天机有更大的信任,突出更清晰的结构、解释性指导和支持性调子。平均评级(0-10)为8.4用于舒适,7.7用于声音清晰,7.6用于处理敏感专题,7.4用于建议有用性;后者因就业状况和先前的心理健康服务使用(p < 0.05)而大不相同。总体而言,87.1%的人表示愿意再利用或推荐HopeBot。这些调查结果表明,基于声音的LM聊天机可以被视为可伸缩缩缩、低负担的辅助剂。
Article 249
Title@2025-07-08 (2): RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages
Title: RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages | RabakBench: Skalierung menschlicher Anmerkungen zum Aufbau lokalisierter Mehrsprachiger Sicherheits-Benchmarks für ressourcenarme Sprachen | Rabak Rabak Bench:提高人文说明,以建立本地化的低资源语言多语文安全基准 2507.05980v1 |
Authors (4): Gabriel Chua, Leanne Tan, Ziyu Ge, Roy Ka-Wei Lee
Large language models (LLMs) and their safety classifiers often perform poorly on low-resource languages due to limited training data and evaluation benchmarks. This paper introduces RabakBench, a new multilingual safety benchmark localized to Singapore’s unique linguistic context, covering Singlish, Chinese, Malay, and Tamil. RabakBench is constructed through a scalable three-stage pipeline: (i) Generate - adversarial example generation by augmenting real Singlish web content with LLM-driven red teaming; (ii) Label - semi-automated multi-label safety annotation using majority-voted LLM labelers aligned with human judgments; and (iii) Translate - high-fidelity translation preserving linguistic nuance and toxicity across languages. The final dataset comprises over 5,000 safety-labeled examples across four languages and six fine-grained safety categories with severity levels. Evaluations of 11 popular open-source and closed-source guardrail classifiers reveal significant performance degradation. RabakBench not only enables robust safety evaluation in Southeast Asian multilingual settings but also offers a reproducible framework for building localized safety datasets in low-resource environments. The benchmark dataset, including the human-verified translations, and evaluation code are publicly available.
由于培训数据和评价基准有限,大型语言模型(LLMS)及其安全分类者往往在低资源语言上表现不佳。本文介绍RabakBench,这是新加坡独特的语言背景所特有的一个新的多语言安全基准,涵盖Singlish、中文、马来语和泰米尔语。 RabakBench是通过一个可扩缩的三阶段管道建造的:(一) 与LLM驱动的红色团队合作,通过增加真正的Singlish网络内容,生成创制-对抗性范例;(二) Label - 半自动多标签安全说明,使用多数投票的LLM标签与人类判断一致;(三) Translate - 高纤维翻译,保存语言微小和跨语言的毒性。最后数据集包括5 000多个四种语言的安全标签示例和六个严重程度精细的安全分类。对11个流行的开放源和封闭源保护铁路分类的评审显示显著的性能退化。RabakBench不仅使得东南亚多语言环境中能够进行强有力的安全评价,而且还为在低资源环境中建立本地安全数据集提供了一个可复制的框架,包括公开翻译。
Article 250
Title@2025-07-08 (2): We Should Evaluate Real-World Impact
Title: We Should Evaluate Real-World Impact | Wir sollten die Auswirkungen der realen Welt bewerten | 我们应该评估现实世界的影响 2507.05973v1 |
Authors (1): Ehud Reiter
The ACL community has very little interest in evaluating the real-world impact of NLP systems. A structured survey of the ACL Anthology shows that perhaps 0.1% of its papers contain such evaluations; furthermore most papers which include impact evaluations present them very sketchily and instead focus on metric evaluations. NLP technology would be more useful and more quickly adopted if we seriously tried to understand and evaluate its real-world impact.
ACL社区对评价NLP系统对现实世界的影响兴趣不大,对ACLAnthlogy的结构性调查显示,其文件中可能0.1%包含这种评价;此外,大多数包括影响评估的文件都非常粗略地展示了这些评价,而是侧重于衡量评价。 如果我们认真理解和评价其现实世界的影响,NLP技术会更有用,更快地被采用。
Article 251
Title@2025-07-08 (2): Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model
Title: Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model | Analytisches Subraum-Routing: Wie rekursive Least Squares im kontinuierlichen Lernen des großen Sprachmodells funktionieren | 分析子空间运行分析:在继续不断学习大语言模式中如何使最起码的广场恢复 2503.13575v2 |
Authors (8): Kai Tong, Kang Pan, Xiao Zhang, Erli Meng, Run He, Yawen Cui, Nuoyan Guo, Huiping Zhuang
Large Language Models (LLMs) possess encompassing capabilities that can process diverse language-related tasks. However, finetuning on LLMs will diminish this general skills and continual finetuning will further cause severe degradation on accumulated knowledge. Recently, Continual Learning (CL) in Large Language Models (LLMs) arises which aims to continually adapt the LLMs to new tasks while maintaining previously learned knowledge and inheriting general skills. Existing techniques either leverage previous data to replay, leading to extra computational costs, or utilize a single parameter-efficient module to learn the downstream task, constraining new knowledge absorption with interference between different tasks. Toward these issues, this paper proposes Analytic Subspace Routing(ASR) to address these challenges. For each task, we isolate the learning within a subspace of deep layers’ features via low-rank adaptation, eliminating knowledge interference between different tasks. Additionally, we propose an analytic routing mechanism to properly utilize knowledge learned in different subspaces. Our approach employs Recursive Least Squares to train a multi-task router model, allowing the router to dynamically adapt to incoming data without requiring access to historical data. Also, the router effectively assigns the current task to an appropriate subspace and has a non-forgetting property of previously learned tasks with a solid theoretical guarantee. Experimental results demonstrate that our method achieves near-perfect retention of prior knowledge while seamlessly integrating new information, effectively overcoming the core limitations of existing methods. Our code will be released after acceptance.
大型语言模型(LLMS)拥有处理多种语言相关任务的能力。然而,微调LLMS将减少这种一般技能和不断微调,将进一步导致积累知识的严重退化。最近,大语言模型(LLMS)的连续学习(CLS)产生,目的是不断使LLMS适应新任务,同时保持先前学到的知识并继承一般技能。现有的技术要么利用先前的数据重播,导致额外的计算成本,要么利用单一的参数高效模块学习下游任务,限制不同任务之间的新知识吸收。针对这些问题,本文提议分析反空子运行(ASR)以应对这些挑战。对于每一项任务,我们通过低层次的适应将深层特征的子空间中的学习隔离开来,消除不同任务之间的知识干扰。此外,我们提议一个分析性路由机制来正确利用不同亚空间所学的知识,导致额外的计算成本,或者利用一个单一的参数高效模块来培训多任务路由模型,使路由器能够动态地适应即将收到的数据,而不需要在接近的亚空间轨道上进行精确地调整,同时将我们所学到的当前任务的结果交付。另外,我们可以有效地展示一个不精确地在前的轨道上获取的路径上取得一个不精确的路径上获得的路径。
Article 252
Title@2025-07-08 (2): FRAME: Feedback-Refined Agent Methodology for Enhancing Medical Research Insights
Title: FRAME: Feedback-Refined Agent Methodology for Enhancing Medical Research Insights | FRAME: Feedback-Refined Agent Methodik zur Verbesserung medizinischer Forschungs-Insights | FRAME: 改进医学研究透视的反馈改进剂方法 2505.04649v2 |
Authors (6): Chengzhang Yu, Yiming Zhang, Zhixin Liu, Zenghui Ding, Yining Sun, Zhanpeng Jin
The automation of scientific research through large language models (LLMs) presents significant opportunities but faces critical challenges in knowledge synthesis and quality assurance. We introduce Feedback-Refined Agent Methodology (FRAME), a novel framework that enhances medical paper generation through iterative refinement and structured feedback. Our approach comprises three key innovations: (1) A structured dataset construction method that decomposes 4,287 medical papers into essential research components through iterative refinement; (2) A tripartite architecture integrating Generator, Evaluator, and Reflector agents that progressively improve content quality through metric-driven feedback; and (3) A comprehensive evaluation framework that combines statistical metrics with human-grounded benchmarks. Experimental results demonstrate FRAME’s effectiveness, achieving significant improvements over conventional approaches across multiple models (9.91% average gain with DeepSeek V3, comparable improvements with GPT-4o Mini) and evaluation dimensions. Human evaluation confirms that FRAME-generated papers achieve quality comparable to human-authored works, with particular strength in synthesizing future research directions. The results demonstrated our work could efficiently assist medical research by building a robust foundation for automated medical research paper generation while maintaining rigorous academic standards.
通过大型语言模型实现科学研究自动化(LLMS)提供了重要机会,但在知识合成和质量保证方面面临重大挑战。我们引入了反馈改良剂方法(FRAME),这是一个通过迭代完善和结构化反馈加强医疗用纸张制作的新框架。我们的方法包括三个关键创新:(1) 结构化的数据集构建方法,通过迭代完善将4 287份医疗文件分解为基本研究组成部分;(2) 将发电机、评价员和思考器结合起来的三方结构,通过量度驱动反馈逐步提高内容质量;(3) 将统计指标与人基基准相结合的综合评价框架。实验结果显示FRAME的效力,在多种模型的传统方法上取得了显著的改进(Deep Seek V3, 与GPT-4o Mini相似的改进平均收益9.91% ) 和评价层面。人类评价证实,FRAME产生的文件的质量与人类著作具有可比性,在合成未来研究方向方面特别强大。结果表明,我们的工作可以通过在保持严格的学术标准的同时为自动化医疗研究造纸建立一个坚实的基础,从而有效地协助医学研究研究。
Article 253
Title@2025-07-08 (2): OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation
Title: OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation | OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation | OpenFactScore:公开源码原子对制文本实际质量的评估 2507.05965v1 |
Authors (2): Lucas Fonseca Lage, Simon Ostermann
We introduce OpenFActScore, an open-source implementation of the FActScore framework for evaluating the factuality of text generated by large language models (LLMs). FActScore evaluates the factual accuracy of long-form text by using Atomic Fact Generation (AFG) to extract individual factual claims and Atomic Fact Validation (AFV) to verify each claim against a trusted knowledge source. While the original FActScore relies on closed-source and commercial models such as InstructGPT and ChatGPT, OpenFActScore enables the use of any Hugging Face-compatible model for both AFG and AFV. We provide a detailed technical overview of our implementation, highlighting design choices and modifications made to support open models. We evaluate multiple open-source LLMs on both AFG and AFV using the original FActScore benchmark, reporting BERTScore-F1 for AFG and Error Rate relative to human annotations for AFV. Our results show that open models can approximate the performance of closed-source systems, with Gemma achieving the best overall performance, and our final setup obtains a 0.99 Pearson correlation with the original FActScore experiments. OpenFActScore promotes transparency, reproducibility, and cost-effective evaluation, and is available at: https://github.com/lflage/OpenFActScore.
我们引入了OpenFactScore(OpenFactScore)框架(OpenFactScore),用于评估大型语言模型(LLM)产生的文本的真实性。FacScore(OpenFactScore)通过利用原子事实生成(AFG)来提取个人事实索赔和原子事实验证(AFV)来根据可信赖的知识来源核实每项索赔,来评估长文本的真实性。虽然原始FactScore(OpenFactScore)依靠封闭源和商业模型,如StechGPT和ChattGPTGPT, OpfFACore(OFCore)能够为AFG和AFV(AFLM)使用任何能动的面相容模型。我们提供了我们实施情况的详细技术概览,突出设计选择和修改以支持开放模型。我们利用原始的FCAFCS基准对AFG和AFV(AF)的多种开源 LLM-FM-FMLM(BER)进行评估。我们的结果显示,开放模型可以对封闭源系统进行最佳的总体性业绩,我们的最后设置将获得0.99PearFstoFstoFs reclentalstalstalstal-Recolviclatealalview。我们可以提供透明性成本/FS(Ocolveralview)和原始成本评估。
Article 254
Title@2025-07-08 (2): Rethinking Associative Memory Mechanism in Induction Head
Title: Rethinking Associative Memory Mechanism in Induction Head | Rethinking Assoziative Memory Mechanism in Induktion Head | 上岗时重新思考联合记忆机制 2412.11459v2 |
Authors (2): Shuo Wang, Issei Sato
Induction head mechanism is a part of the computational circuits for in-context learning (ICL) that enable large language models (LLMs) to adapt to new tasks without fine-tuning. Most existing work explains the training dynamics behind acquiring such a powerful mechanism. However, the model’s ability to coordinate in-context information over long contexts and global knowledge acquired during pretraining remains poorly understood. This paper investigates how a two-layer transformer thoroughly captures in-context information and balances it with pretrained bigram knowledge in next token prediction, from the viewpoint of associative memory. We theoretically analyze the representation of weight matrices in attention layers and the resulting logits when a transformer is given prompts generated by a bigram model. In the experiments, we design specific prompts to evaluate whether the outputs of the trained transformer align with the theoretical results.
上传头机制是内流学习计算线路的一部分,它使大型语言模型(LLMs)能够适应新的任务而无需微调。大多数现有工作解释了获得这样一个强大机制背后的培训动态。然而,该模型协调长长背景下的内流信息以及培训前获得的全球知识的能力仍然不甚明了。本文从联系记忆的角度探讨两层变压器如何彻底捕捉内流信息,并在下一个象征性的预测中将它与预先训练的大型知识相平衡。我们从理论上分析了注意层中重力矩阵的表示以及当一个变压器被一个大ram模型所激发时所产生的逻辑记录。在实验中,我们设计了具体的速率来评价受过训练的变压器的产出是否与理论结果一致。
Article 255
Title@2025-07-08 (2): Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models
Title: Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models | Neudefinition von Bewertungsstandards: Ein einheitlicher Rahmen für die Bewertung der koreanischen Fähigkeiten von Sprachmodellen | 重新界定评价标准:评价韩国语言模式能力的统一框架 2503.22968v4 |
Authors (9): Hanwool Lee, Dasol Choi, Sooyong Kim, Ilgyun Jung, Sangwon Baek, Guijin Son, Inseon Hwang, Naeun Lee, Seunghyeok Hong
Recent advancements in Korean large language models (LLMs) have driven numerous benchmarks and evaluation methods, yet inconsistent protocols cause up to 10 p.p performance gaps across institutions. Overcoming these reproducibility gaps does not mean enforcing a one-size-fits-all evaluation. Rather, effective benchmarking requires diverse experimental approaches and a framework robust enough to support them. To this end, we introduce HRET (Haerae Evaluation Toolkit), an open-source, registry-based framework that unifies Korean LLM assessment. HRET integrates major Korean benchmarks, multiple inference backends, and multi-method evaluation, with language consistency enforcement to ensure genuine Korean outputs. Its modular registry design also enables rapid incorporation of new datasets, methods, and backends, ensuring the toolkit adapts to evolving research needs. Beyond standard accuracy metrics, HRET incorporates Korean-focused output analyses-morphology-aware Type-Token Ratio (TTR) for evaluating lexical diversity and systematic keyword-omission detection for identifying missing concepts-to provide diagnostic insights into language-specific behaviors. These targeted analyses help researchers pinpoint morphological and semantic shortcomings in model outputs, guiding focused improvements in Korean LLM development.
韩国大型语言模型(LLMS)最近的进展推动了许多基准和评价方法,但协议不一致导致各机构的绩效差距高达10 p.p.p.;克服这些再生差距并不意味着执行 “ 一刀切 “ 的评价;相反,有效的基准要求采用不同的实验办法和足以支持这些评价的框架;为此,我们引入了一个开放源码、基于登记册的框架,统一韩国LLM评估。HRET将韩国主要基准、多重推论后端和多方法评价与语言一致性执行相结合,以确保朝鲜真正的产出。其模块化登记册的设计还使得能够快速纳入新的数据集、方法和后端,确保工具包适应不断变化的研究需求。除了标准精确度指标外,HRET还采用韩国重点产出分析-形态识别模式比(TTR),用于评估词汇多样性和系统关键内容检测,以识别缺失的概念,从而提供语言特定行为方面的诊断性洞察力。这些有针对性的分析有助于研究人员对模型产出中的形态和语义缺陷进行定位,指导韩国LM的改进。
Article 256
Title@2025-07-08 (2): Chat-Ghosting: A Comparative Study of Methods for Auto-Completion in Dialog Systems
Title: Chat-Ghosting: A Comparative Study of Methods for Auto-Completion in Dialog Systems | Chat-Ghosting: Eine vergleichende Untersuchung von Methoden zur Auto-Completion in Dialogsystemen | 对话系统自动完成方法比较研究 2507.05940v1 |
Authors (6): Sandeep Mishra, Anubhab Mandal, Bishal Santra, Tushar Abhishek, Pawan Goyal, Manish Gupta
Ghosting, the ability to predict a user’s intended text input for inline query auto-completion, is an invaluable feature for modern search engines and chat interfaces, greatly enhancing user experience. By suggesting completions to incomplete queries (or prefixes), ghosting aids users with slow typing speeds, disabilities, or limited language proficiency. Ghosting is a challenging problem and has become more important with the ubiquitousness of chat-based systems like ChatGPT, Copilot, etc. Despite the increasing prominence of chat-based systems utilizing ghosting, this challenging problem of Chat-Ghosting has received little attention from the NLP/ML research community. There is a lack of standardized benchmarks and relative performance analysis of deep learning and non-deep learning methods. We address this through an open and thorough study of this problem using four publicly available dialog datasets: two human-human (DailyDialog and DSTC7-Ubuntu) and two human-bot (Open Assistant and ShareGPT). We experiment with various existing query auto-completion methods (using tries), n-gram methods and deep learning methods, with and without dialog context. We also propose a novel entropy-based dynamic early stopping strategy. Our analysis finds that statistical n-gram models and tries outperform deep learning based models in terms of both model performance and inference efficiency for seen prefixes. For unseen queries, neural models like T5 and Phi-2 lead to better results. Adding conversational context leads to significant improvements in ghosting quality, especially for Open-Assistant and ShareGPT. We make code and data publicly available
幽灵,即能够预测用户为内线查询自动完成而准备的文本输入,是现代搜索引擎和聊天界面的宝贵特征,大大提高了用户经验。通过建议完成不完整查询(或前缀)、隐形帮助用户打字速度缓慢、残疾或语言熟练程度有限的问题,隐形帮助用户。隐形是一个具有挑战性的问题,而且随着聊天的系统,如ChatGPT、Copilit等以聊天为基础的系统无处不在,这种能力也变得更加重要。尽管聊天系统利用幽灵越来越突出,但这种具有挑战性的“聊天接待”质量问题很少受到NLP/ML研究界的注意。深层学习和非深层学习方法缺乏标准化基准和相对绩效分析。我们通过公开的4个对话数据集(DailyDialog和DSTC7-Ubuntu)和2个“人文”模型(公开助理和ShareGPTTT),我们尝试了各种现有的“自动完成”方法(在尝试中尝试尝试)、正格方法,在深层次学习和深层次的深度对话中,特别用基于数据分析工具,并在观察中找到“智能分析中找到一种基于的“数据”。
Article 257
Title@2025-07-08 (2): Remember Past, Anticipate Future: Learning Continual Multimodal Misinformation Detectors
Title: Remember Past, Anticipate Future: Learning Continual Multimodal Misinformation Detectors | Denken Sie daran, Vergangenheit, Zukunft antizipieren: Lernen kontinuierlich multimodale Fehlinformation Detectors | 记住过去,预期未来:学习持续多模式错误信息探测器 2507.05939v1 |
Authors (7): Bing Wang, Ximing Li, Mengzhe Ye, Changchun Li, Bo Fu, Jianfeng Qu, Lin Yuanbo Wu
Nowadays, misinformation articles, especially multimodal ones, are widely spread on social media platforms and cause serious negative effects. To control their propagation, Multimodal Misinformation Detection (MMD) becomes an active topic in the community to automatically identify misinformation. Previous MMD methods focus on supervising detectors by collecting offline data. However, in real-world scenarios, new events always continually emerge, making MMD models trained on offline data consistently outdated and ineffective. To address this issue, training MMD models under online data streams is an alternative, inducing an emerging task named continual MMD. Unfortunately, it is hindered by two major challenges. First, training on new data consistently decreases the detection performance on past data, named past knowledge forgetting. Second, the social environment constantly evolves over time, affecting the generalization on future data. To alleviate these challenges, we propose to remember past knowledge by isolating interference between event-specific parameters with a Dirichlet process-based mixture-of-expert structure, and anticipate future environmental distributions by learning a continuous-time dynamics model. Accordingly, we induce a new continual MMD method DAEDCMD. Extensive experiments demonstrate that DAEDCMD can consistently and significantly outperform the compared methods, including six MMD baselines and three continual learning methods.
目前,错误文章,特别是多式文章,在社交媒体平台上广为传播,造成严重负面影响。为了控制其传播,多式错误信息检测(MMD)成为社区中一个积极话题,自动识别错误信息。以前MMD方法的重点是通过收集离线数据来监督探测器。然而,在现实世界中,新事件总是不断出现,使在离线数据上培训的 MMD模型始终过时和无效。为解决这一问题,在在线数据流下培训MMD模型是一种替代方案,引发一个名为连续MMMD的新兴任务。不幸的是,它受到两大挑战的阻碍。第一,关于新数据的培训不断降低过去数据(过去被命名为知识遗忘)的检测性能。第二,社会环境的不断变化,影响未来数据的一般化。为缓解这些挑战,我们建议通过将事件特定参数与基于富力特进程混合专家结构的干扰隔离起来来记住过去的知识,并通过学习一个持续时间动态模型来预测未来环境分布。因此,我们引入一种新的MDADDMDMDMD方法。广泛的实验表明DADMMMD可以持续并显著地超越三个基准,包括不断学习。
Article 258
Title@2025-07-08 (2): Towards a Principled Evaluation of Knowledge Editors
Title: Towards a Principled Evaluation of Knowledge Editors | Auf dem Weg zu einer grundsätzlichen Bewertung von Wissensredakteuren | 对知识编辑进行有原则地评价 2507.05937v1 |
Authors (3): Sebastian Pohl, Max Ploner, Alan Akbik
Model editing has been gaining increasing attention over the past few years. For Knowledge Editing in particular, more challenging evaluation datasets have recently been released. These datasets use different methodologies to score the success of editors. Yet, it remains under-explored how robust these methodologies are and whether they unfairly favor some editors. Moreover, the disruptive impact of these editors on overall model capabilities remains a constant blind spot. We address both of these problems and show that choosing different metrics and evaluation methodologies as well as different edit batch sizes can lead to a different ranking of knowledge editors. Crucially we demonstrate this effect also on general language understanding tasks evaluated alongside the knowledge editing tasks. Further we include a manual assessment of the string matching based evaluation method for knowledge editing that is favored by recently released datasets, revealing a tendency to produce false positive matches.
过去几年来,模型编辑越来越受到越来越多的关注。 特别是对于知识编辑来说,最近发布了更具挑战性的评价数据集。 这些数据集使用不同的方法来赢得编辑的成功。 然而,它们仍然未充分探讨这些方法的健全程度,以及这些方法是否不公平地有利于一些编辑。 此外,这些编辑对总体模型能力的干扰影响始终是盲点。我们处理这两个问题,并表明选择不同的指标和评价方法以及不同的编辑批量大小可能导致知识编辑的排名不同。 关键是,我们也展示了这种影响对与知识编辑任务一起评估的一般语言理解任务的影响。 此外,我们还包括对最近发布的数据集所偏好的知识编辑的弦匹配评价方法进行手工评估,揭示出产生虚假正面匹配的趋势。
Article 259
Title@2025-07-08 (2): News and Load: Social and Economic Drivers of Regional Multi-horizon Electricity Demand Forecasting
Title: News and Load: Social and Economic Drivers of Regional Multi-horizon Electricity Demand Forecasting | News and Load: Soziale und wirtschaftliche Treiber regionaler Mehrhorizont-Stromnachfrageprognosen | 新闻与负载:区域多光速电力需求预测的社会和经济驱动因素 2406.06641v2 |
Authors (3): Yun Bai, Simon Camal, Andrea Michiorri
The relationship between electricity demand and variables such as economic activity and weather patterns is well established. However, this paper explores the connection between electricity demand and social aspects. It further embeds dynamic information about the state of society into energy demand modelling and forecasting approaches. Through the use of natural language processing on a large news corpus, we highlight this important link. This study is conducted in five regions of the UK and Ireland and considers multiple time horizons from 1 to 30 days. It also considers economic variables such as GDP, unemployment and inflation. The textual features used in this study represent central constructs from the word frequencies, topics, word embeddings extracted from the news. The findings indicate that: 1) the textual features are related to various contents, such as military conflicts, transportation, the global pandemic, regional economics, and the international energy market. They exhibit causal relationships with regional electricity demand, which are validated using Granger causality and Double Machine Learning methods. 2) Economic indicators play a more important role in the East Midlands and Northern Ireland, while social indicators are more influential in the West Midlands and the South West of England. 3) The use of these factors improves deterministic forecasting by around 6%.
电力需求与诸如经济活动和天气模式等变数之间的关系早已确立。然而,本文探讨了电力需求与社会方面的关联,进一步将关于社会状况的动态信息纳入能源需求建模和预测方法之中。我们通过在大型新闻材料中使用自然语言处理,强调了这一重要联系。这项研究在英国和爱尔兰五个地区进行,考虑了1天到30天的多重时间跨度。研究还考虑了国内生产总值、失业和通货膨胀等经济变量。本研究报告使用的文字特征代表了从新闻中提取的文字频率、主题、嵌入的词的核心结构。研究结果表明:(1) 文字特征与各种内容有关,如军事冲突、运输、全球大流行病、区域经济和国际能源市场。这些特征显示了与区域电力需求之间的因果关系,而区域电力需求则利用Granger因果关系和双机学习方法加以验证。(2) 经济指标在东中地和北爱尔兰起着更重要的作用,而社会指标在英格兰西中地和西南地更有影响力。(3) 这些因素的使用改善了6 %左右的确定性预测。
Article 260
Title@2025-07-08 (2): Semantic Certainty Assessment in Vector Retrieval Systems: A Novel Framework for Embedding Quality Evaluation
Title: Semantic Certainty Assessment in Vector Retrieval Systems: A Novel Framework for Embedding Quality Evaluation | Semantische Sicherheitsbewertung in Vector Retrieval Systemen: Ein neuartiger Rahmen für die Einbettung von Qualitätsbewertungen | 矢量检索系统中的语义性确定性评估:包含质量评价的新框架 2507.05933v1 |
Authors (1): Y. Du
Vector retrieval systems exhibit significant performance variance across queries due to heterogeneous embedding quality. We propose a lightweight framework for predicting retrieval performance at the query level by combining quantization robustness and neighborhood density metrics. Our approach is motivated by the observation that high-quality embeddings occupy geometrically stable regions in the embedding space and exhibit consistent neighborhood structures. We evaluate our method on 4 standard retrieval datasets, showing consistent improvements of 9.4$\pm$1.2\% in Recall@10 over competitive baselines. The framework requires minimal computational overhead (less than 5\% of retrieval time) and enables adaptive retrieval strategies. Our analysis reveals systematic patterns in embedding quality across different query types, providing insights for targeted training data augmentation.
矢量检索系统由于不同嵌入质量不同,不同查询的性能差异很大。我们提出一个轻量级框架,通过结合量化稳健性和邻里密度度度来预测查询一级的检索性能。我们采取的方法的动机是,观察到高质量的嵌入在嵌入空间中占据几何稳定的区域,并展示了一致的邻里结构。我们评估了4个标准检索数据集的方法,在回调10中显示对竞争性基线不断改进9.4$\pm1.2。这个框架需要最低的计算间接费用(少于5检索时间),并能够实施适应性检索战略。我们的分析揭示了在不同查询类型中嵌入质量的系统模式,为有针对性的培训数据增强提供了洞察力。
Article 261
Title@2025-07-08 (2): Self-supervised learning of speech representations with Dutch archival data
Title: Self-supervised learning of speech representations with Dutch archival data | Selbstüberwachtes Lernen von Sprachdarstellungen mit niederländischen Archivdaten | 自行监督学习用荷兰档案数据进行语音陈述 2507.04554v2 |
Authors (3): Nik Vaessen, Roeland Ordelman, David A. van Leeuwen
This paper explores the use of Dutch archival television broadcast data for self-supervised learning of speech foundation models, specifically wav2vec 2.0. We first study data quality assumptions for pre-training, and show how music, noise and speaker overlap affect SSL convergence and downstream fine-tuning performance. Secondly, we explore effectively pre-processing strategies to convert the noisy broadcast dataset into a qualitative dataset for pre-training, by using Whisper and WhisperX. Thirdly, we compare mono-lingual and multi-lingual pre-training with equivalent amounts of data, and show that mono-lingual pre-training is more robust to out-of-domain data. Lastly, we achieve a state-of-the-art LARGE wav2vec 2.0 model for the Dutch language, by a continuation of pre-training a wav2vec 2.0 XLS-R model checkpoint with our 55k hour archival dataset.
本文探索荷兰档案电视广播数据用于自我监督地学习语言基础模型,特别是 wav2vec 2. 0。 我们首先研究培训前的数据质量假设,并展示音乐、噪音和发言者的重叠如何影响SSL的趋同和下游微调性能。 其次,我们利用Whisper和WhisperX,有效探索将噪音广播数据集转换成培训前质量数据集的预处理战略。 第三,我们将单语和多语种的预培训与同等数量的数据进行比较,并表明单语预培训对于外部数据更为有力。 最后,我们通过继续用我们55k小时的档案数据集对Wav2c 2.0 XLLS-R 模检查站进行预培训,实现荷兰语最先进的LARGE wav2vec 2. 0模式。
Article 262
Title@2025-07-08 (2): Few-shot text-based emotion detection
Title: Few-shot text-based emotion detection | Kurze Text-basierte Emotionserkennung | 几发短短的基于文本的情感探测 2507.05918v1 |
Authors (3): Teodor-George Marchitan, Claudiu Creanga, Liviu P. Dinu
This paper describes the approach of the Unibuc - NLP team in tackling the SemEval 2025 Workshop, Task 11: Bridging the Gap in Text-Based Emotion Detection. We mainly focused on experiments using large language models (Gemini, Qwen, DeepSeek) with either few-shot prompting or fine-tuning. With our final system, for the multi-label emotion detection track (track A), we got an F1-macro of $0.7546$ (26/96 teams) for the English subset, $0.1727$ (35/36 teams) for the Portuguese (Mozambican) subset and $0.325$ (\textbf{1}/31 teams) for the Emakhuwa subset.
本文介绍了Unibuc-NLP团队在应对SemEval 2025年研讨会(任务11:缩小基于文字的情感检测差距)方面的做法,我们主要侧重于使用大语言模型(Gemini、Quen、DeepSeek)的实验,有的只是几发提示或微调,有的是几发提示或微调。我们的最后系统,即多标签情感检测轨道(A轨),我们得到了一个F1-macro,其中英文子组0.7546美元(26/96),葡萄牙文组0.1727美元(35/36队),埃马库瓦子组0.325美元(htextbf{1}31队)。
Article 263
Title@2025-07-08 (2): On the Fundamental Impossibility of Hallucination Control in Large Language Models
Title: On the Fundamental Impossibility of Hallucination Control in Large Language Models | Über die grundsätzliche Unmöglichkeit der Halluzinationskontrolle in großen Sprachmodellen | 关于大语言模型中幻听控制的基本不可能性 2506.06382v3 |
Authors (1): Michał P. Karpowicz
We prove that perfect hallucination control in large language models is mathematically impossible. No LLM inference mechanism can simultaneously achieve truthful response generation, semantic information conservation, relevant knowledge revelation, and knowledge-constrained optimality. This impossibility is fundamental, arising from the mathematical structure of information aggregation itself rather than engineering limitations. The proof spans three mathematical frameworks: auction theory, proper scoring theory for probabilistic predictions, and log-sum-exp analysis for transformer architectures. In each setting, we demonstrate that information aggregation creates unavoidable violations of conservation principles. The Jensen gap in transformer probability aggregation provides a direct measure of this impossibility. These results reframe hallucination from an engineering bug to an inevitable mathematical feature of distributed intelligence. There are fundamental trade-offs between truthfulness, knowledge utilization, and response completeness, providing principled foundations for managing rather than eliminating hallucination. This work reveals deep connections between neural network inference, philosophy of knowledge and reasoning, and classical results in game theory and information theory, opening new research directions for developing beneficial AI systems within mathematical constraints.
我们证明大型语言模型的完美幻觉控制在数学上是不可能的。 没有LLM 推论机制能够同时实现真实的反应生成、语义信息保护、相关知识披露和知识限制的最佳性。 这种不可能性是根本的,产生于信息集成本身的数学结构而不是工程限制。 证据包含三个数学框架:拍卖理论、概率预测的适当评分理论以及变压器结构的日志和参数分析。 在每种情况下,我们证明信息汇总都不可避免地违反了保护原则。 变压器概率汇总中的Jensen差距提供了这种不可能性的直接衡量。 这些结果是将幻觉从工程错误重新定位为分布式情报的不可避免的数学特征。 在真实性、知识利用和反应完整性之间有着基本的权衡,为管理而不是消除幻觉提供了原则基础。 这项工作揭示了神经网络的推论、知识和推理哲学以及游戏理论和信息理论的经典结果之间的密切联系,为在数学限制范围内开发有益的AI系统开辟了新的研究方向。
Article 264
Title@2025-07-08 (2): AI-Reporter: A Path to a New Genre of Scientific Communication
Title: AI-Reporter: A Path to a New Genre of Scientific Communication | AI-Reporter: Ein Weg zu einem neuen Genre wissenschaftlicher Kommunikation | AI-记者:通向科学通信新一流的道路 2507.05903v1 |
Authors (1): Gerd Graßhoff
The AI-Reporter represents a paradigmatic shift in scientific publication practice. This document demonstrates through a concrete case study how our system transforms academic presentations into publication-ready chapters – in less than three minutes. Using Arno Simons’ lecture on Large Language Models from the ``Large Language Models for the History, Philosophy, and Sociology of Science’’ workshop (NEPI) as an example, we show how technological innovation bridges the gap between ephemeral presentation and permanent scientific documentation.
AI-Reporter代表了科学出版实践的范式转变。本文件通过具体案例研究展示了我们的系统如何在不到3分钟内将学术介绍转变为可供出版的章节。我们以“科学历史、哲学和社会学大语言模型”研讨会(NEPI)为例,利用Arno Simons关于“大语言模型”的讲座,我们展示了技术创新如何弥合时间介绍和长期科学文献之间的差距。
Article 265
Title@2025-07-08 (2): MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation
Title: MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation | MusiScene: Nutzung von MU-LLAMA für die Vorstellung von Szenen und verbesserte Video-Hintergrundmusik-Generation | Musi-Sceen:利用MU-LLAMA来进行现场想象和增强视频背景音乐制作 2507.05894v1 |
Authors (4): Fathinah Izzati, Xinyue Li, Yuxuan Wu, Gus Xia
Humans can imagine various atmospheres and settings when listening to music, envisioning movie scenes that complement each piece. For example, slow, melancholic music might evoke scenes of heartbreak, while upbeat melodies suggest celebration. This paper explores whether a Music Language Model, e.g. MU-LLaMA, can perform a similar task, called Music Scene Imagination (MSI), which requires cross-modal information from video and music to train. To improve upon existing music captioning models which focusing solely on musical elements, we introduce MusiScene, a music captioning model designed to imagine scenes that complement each music. In this paper, (1) we construct a large-scale video-audio caption dataset with 3,371 pairs, (2) we finetune Music Understanding LLaMA for the MSI task to create MusiScene, and (3) we conduct comprehensive evaluations and prove that our MusiScene is more capable of generating contextually relevant captions compared to MU-LLaMA. We leverage the generated MSI captions to enhance Video Background Music Generation (VBMG) from text.
人类在聆听音乐时可以想象出不同的氛围和设置,想象每个片段的电影场景。例如,缓慢的、忧郁的音乐可能会引起心碎的场景,而喜悦的旋律则建议庆祝。本文探讨了音乐语言模型,如MU-LLAMA,能否执行类似的任务,即音乐场景想象(MSI),需要视频和音乐的跨模式信息来培训。为了改进只关注音乐元素的现有音乐字幕模型,我们引入了MusiScene,这是一个音乐字幕模型,旨在想象每种音乐的场景。在本文中,(1) 我们用3,371对配对的大规模视频-音响字幕数据集,(2) 我们为MusiScene创建MusiScene的任务,我们进行了全面的评估,并证明我们的MusiScene较有能力生成与MU-LAMA相比具有背景意义的字幕。我们利用生成的MSI字幕来从文字中加强视频背景音乐生成。
Article 266
Title@2025-07-08 (2): Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators
Title: Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators | Psychometrische Objektvalidierung mit virtuellen Befragten mit trait-Response Mediatoren | 使用虚拟应答人与Trait-Response调解员的虚拟应答人进行测分 2507.05890v1 |
Authors (4): Sungjib Lim, Woojung Song, Eun-Ju Lee, Yohan Jo
As psychometric surveys are increasingly used to assess the traits of large language models (LLMs), the need for scalable survey item generation suited for LLMs has also grown. A critical challenge here is ensuring the construct validity of generated items, i.e., whether they truly measure the intended trait. Traditionally, this requires costly, large-scale human data collection. To make it efficient, we present a framework for virtual respondent simulation using LLMs. Our central idea is to account for mediators: factors through which the same trait can give rise to varying responses to a survey item. By simulating respondents with diverse mediators, we identify survey items that robustly measure intended traits. Experiments on three psychological trait theories (Big5, Schwartz, VIA) show that our mediator generation methods and simulation framework effectively identify high-validity items. LLMs demonstrate the ability to generate plausible mediators from trait definitions and to simulate respondent behavior for item validation. Our problem formulation, metrics, methodology, and dataset open a new direction for cost-effective survey development and a deeper understanding of how LLMs replicate human-like behavior. We will publicly release our dataset and code to support future work.
由于人们越来越多地使用心理测量调查来评估大型语言模型的特性,因此,适合LLMS的可扩缩调查物品生成的需要也日益增长。这里的一个关键挑战是确保生成的物品的构造有效性,即它们是否真正测量预期特性。传统上,这需要花费昂贵的大规模人类数据收集。为了提高效率,我们提出了一个使用LLMS进行虚拟应答模拟的框架。我们的中心思想是考虑调解人:同一特性可能导致对调查项目作出不同答复的因素。通过向不同调解员模拟,我们确定能够强有力地测量预期特性的调查物品。关于三种心理特征理论(Big5、Schwartz、VIA)的实验表明,我们的调解者生成方法和模拟框架能够有效地识别高功能物品。LMS展示出从特性定义中产生可信的调解员的能力,并模拟应答者行为,以便项目鉴定。我们的问题的拟订、衡量尺度、方法和数据设置为具有成本效益的调查发展开辟新的方向,并更深入地了解LMS复制人类类似行为的方式。我们将公开发布我们的数据集和代码,以支持未来的工作。
Article 267
Title@2025-07-08 (2): How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures
Title: How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures | Wie man die automatische Spracherkennung auswertet: Vergleich verschiedener Leistungs- und Bias-Maßnahmen | 如何评价自动语音识别:比较不同业绩和偏见措施 2507.05885v1 |
Authors (4): Tanvina Patel, Wiebke Hutiri, Aaron Yi Ding, Odette Scharenborg
There is increasingly more evidence that automatic speech recognition (ASR) systems are biased against different speakers and speaker groups, e.g., due to gender, age, or accent. Research on bias in ASR has so far primarily focused on detecting and quantifying bias, and developing mitigation approaches. Despite this progress, the open question is how to measure the performance and bias of a system. In this study, we compare different performance and bias measures, from literature and proposed, to evaluate state-of-the-art end-to-end ASR systems for Dutch. Our experiments use several bias mitigation strategies to address bias against different speaker groups. The findings reveal that averaged error rates, a standard in ASR research, alone is not sufficient and should be supplemented by other measures. The paper ends with recommendations for reporting ASR performance and bias to better represent a system’s performance for diverse speaker groups, and overall system bias.
越来越多的证据表明,自动语音识别(ASR)系统对不同的发言者和发言者群体有偏见,例如,由于性别、年龄或口音等原因。关于ASR中的偏见的研究迄今主要侧重于发现和量化偏见,以及制定缓解方法。尽管取得了这一进展,但未决问题是如何衡量一个系统的绩效和偏见。在本研究报告中,我们比较了从文献中得出的不同绩效和偏见措施,并提议对荷兰语的高级终端到终端的ASR系统进行评估。我们的实验利用若干减少偏见的战略来解决对不同发言者群体的偏见问题。研究结果显示,平均误差率,即ASR研究的标准本身是不够的,应当辅以其他措施作为补充。文件最后建议报告ASR的业绩和偏见,以便更好地代表不同发言者群体的系统绩效和整个系统偏差。
Article 268
Title@2025-07-08 (2): Healing Powers of BERT: How Task-Specific Fine-Tuning Recovers Corrupted Language Models
Title: Healing Powers of BERT: How Task-Specific Fine-Tuning Recovers Corrupted Language Models | Heilkräfte von BERT: Wie aufgabenspezifische Feinsteuerung korrupte Sprachmodelle wiederherstellt | BERT的愈合能力:如何恢复被损坏的语言模式 2406.14459v2 |
Authors (3): Shijie Han, Zhenyu Zhang, Andrei Arsene Simion
Language models like BERT excel at sentence classification tasks due to extensive pre-training on general data, but their robustness to parameter corruption is unexplored. To understand this better, we look at what happens if a language model is “broken”, in the sense that some of its parameters are corrupted and then recovered by fine-tuning. Strategically corrupting BERT variants at different levels, we find corrupted models struggle to fully recover their original performance, with higher corruption causing more severe degradation. Notably, bottom-layer corruption affecting fundamental linguistic features is more detrimental than top-layer corruption. Our insights contribute to understanding language model robustness and adaptability under adverse conditions, informing strategies for developing resilient NLP systems against parameter perturbations.
BERT等语言模型由于对一般数据进行了广泛的预先培训,因此在刑罚分类任务方面非常出色,但是它们对于腐败参数的强力是无法探索的。为了更好地了解这一点,我们审视一下如果一个语言模型“断裂 ” , 其某些参数被腐蚀, 然后通过微调恢复。 在不同级别上战略腐蚀的BERT变量,我们发现腐败模型在努力完全恢复其原有性能,腐败程度较高导致更严重的退化。 值得注意的是,影响基本语言特征的底层腐败比上层腐败更有害。 我们的洞察力有助于理解语言模型在不利条件下的稳健性和适应性,为开发抗参数扰动的具有复原力的NLP系统提供战略信息。
Article 269
Title@2025-07-08 (2): Detecting value-expressive text posts in Russian social media
Title: Detecting value-expressive text posts in Russian social media | Ermittlung wertwirksamer Textposts in russischen sozialen Medien | 在俄罗斯社交媒体中检测表达价值的文字文章 2312.08968v2 |
Authors (3): Maria Milkova, Maksim Rudnev, Lidia Okolskaya
Basic values are concepts or beliefs which pertain to desirable end-states and transcend specific situations. Studying personal values in social media can illuminate how and why societal values evolve especially when the stimuli-based methods, such as surveys, are inefficient, for instance, in hard-to-reach populations. On the other hand, user-generated content is driven by the massive use of stereotyped, culturally defined speech constructions rather than authentic expressions of personal values. We aimed to find a model that can accurately detect value-expressive posts in Russian social media VKontakte. A training dataset of 5,035 posts was annotated by three experts, 304 crowd-workers and ChatGPT. Crowd-workers and experts showed only moderate agreement in categorizing posts. ChatGPT was more consistent but struggled with spam detection. We applied an ensemble of human- and AI-assisted annotation involving active learning approach, subsequently trained several classification models using embeddings from various pre-trained transformer-based language models. The best performance was achieved with embeddings from a fine-tuned rubert-tiny2 model, yielding high value detection quality (F1 = 0.75, F1-macro = 0.80). This model provides a crucial step to a study of values within and between Russian social media users.
在社会媒体中研究个人价值观可以说明社会价值观是如何和为什么演变的,特别是在调查等基于刺激性的方法效率低下的情况下,例如在难以接触的人口中。另一方面,用户生成的内容是由大量使用陈规定型的、文化上界定的语音结构而不是真实的个人价值观表达方式驱动的。我们的目标是找到一种模型,能够准确检测俄罗斯社交媒体VKontakte中的价值表达岗位。3名专家304名人群工人和聊天GPT对5 035个岗位的培训数据集作了说明。在职位分类方面,众工和专家只表现出适度的一致。查网GPT比较一致,但与垃圾检测相抗争。我们应用了人类和人工辅助的批注,包括积极的学习方法,随后我们用各种经过预先培训的变压器语言模型来培训了若干分类模型。通过将5 035个岗位的成套培训数据集嵌入,由3名专家、304名人群工人和聊天GhatGPT。在对职位进行分类时只表现出适度的一致。ChatGPT。C-PT和专家在对职位进行分类时只提供高价值质量。
Article 270
Title@2025-07-08 (2): ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection
Title: ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection | ALLM4ADD: Entsperren der Fähigkeiten von Audio Large Language Models für Audio Deepfake Detection | ALLM4ADD:解锁音频深假探测音频大语言模型的能力 2505.11079v2 |
Authors (9): Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, Zhengqi Wen
Audio deepfake detection (ADD) has grown increasingly important due to the rise of high-fidelity audio generative models and their potential for misuse. Given that audio large language models (ALLMs) have made significant progress in various audio processing tasks, a heuristic question arises: \textit{Can ALLMs be leveraged to solve ADD?}. In this paper, we first conduct a comprehensive zero-shot evaluation of ALLMs on ADD, revealing their ineffectiveness. To this end, we propose ALLM4ADD, an ALLM-driven framework for ADD. Specifically, we reformulate ADD task as an audio question answering problem, prompting the model with the question: ``Is this audio fake or real?’’. We then perform supervised fine-tuning to enable the ALLM to assess the authenticity of query audio. Extensive experiments are conducted to demonstrate that our ALLM-based method can achieve superior performance in fake audio detection, particularly in data-scarce scenarios. As a pioneering study, we anticipate that this work will inspire the research community to leverage ALLMs to develop more effective ADD systems. Code is available at https://github.com/ucas-hao/qwen_audio_for_add.git
声频深方形探测(ADD)由于高贞度音频变异模型的兴起及其被滥用的可能性而变得日益重要。鉴于音频大语言模型(ALLMs)在各种音频处理任务中取得了显著进展,因此产生了一个反复的问题:\ textit{AllMs能否被利用解决ADD?}。在本文件中,我们首先对ADDA的AllMs进行全面的零射评价,显示其无效性。为此,我们提议ALM4ADD,一个由ALM驱动的ADDD框架。具体地说,我们重新将ADD任务改写为一个音频解答问题,促使该模型提出一个问题:“这种音频是假的还是真的?”问题。我们随后进行了监督的微调,以使ALM能够评估调音频的真实性。我们进行了广泛的实验,以证明我们的ALM方法能够在假音频探测中取得优异性表现,特别是在数据卡片情景中。作为开创性的研究,我们预计这项工作将激励研究界利用ALDDDDM系统开发更有效的ADDDDD系统。代码可在 https@gius_gus_quaus_ad_ad_absforfordoqforforgi_forgi_forgi_fio_forgiocfordo.
Article 271
Title@2025-07-08 (2): PDFMathTranslate: Scientific Document Translation Preserving Layouts
Title: PDFMathTranslate: Scientific Document Translation Preserving Layouts | PDFMathTranslate: Wissenschaftliche Dokumentübersetzung Layouts erhalten | PDFMath Translate:科学文件翻译保存布局 2507.03009v2 |
Authors (4): Rongxin Ouyang, Chang Chu, Zhikuang Xin, Xiangyao Ma
Language barriers in scientific documents hinder the diffusion and development of science and technologies. However, prior efforts in translating such documents largely overlooked the information in layouts. To bridge the gap, we introduce PDFMathTranslate, the world’s first open-source software for translating scientific documents while preserving layouts. Leveraging the most recent advances in large language models and precise layout detection, we contribute to the community with key improvements in precision, flexibility, and efficiency. The work has been open-sourced at https://github.com/byaidu/pdfmathtranslate with more than 222k downloads.
科学文件的语言障碍妨碍了科学文件的传播和发展,然而,先前在翻译这类文件方面所作的努力基本上忽视了布局中的信息。为弥合这一差距,我们引入了PDFMathTranslate,这是世界上第一个在保存布局的同时翻译科学文件的开放源码软件。我们利用了大语言模型的最新进展和精确的布局探测,在精确、灵活和效率方面作出了重大改进,为社区作出了贡献。这项工作在https://github.com/baaidu/pdfmath Translate中公开提供,下载次数超过222k次。
Article 272
Title@2025-07-08 (2): Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
Title: Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge | Mit Thinking-LLM-as-a-Richter zu planen und zu evaluieren | 学习运用 “ 思考-LLM-作为法官 “ 评估计划和理由 2501.18099v2 |
Authors (5): Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason Weston, Tianlu Wang
LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-bystep reasoning process that underlies the final evaluation of a response. However, due to the lack of human annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench (with a score of 93.9), despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.
在这项工作中,我们建议Eval Planner为思维-LLM-as-Judge提供一种优先优化算法,首先产生不受限制的评价计划,然后执行,然后作出最后判断。在自我培训循环中,Eval Planner反复优化合成设计的评价计划和处决,从而导致更好的最终裁决。我们的方法在RewardBench上取得了一种新的最先进的变现奖赏模式业绩(分数为93.9),尽管在建立数量较少的、合成的Blusion-Balianch推理模型方面进行了培训,并且进一步推进了Balvironic-RM号和其他基准。
Article 273
Title@2025-07-08 (2): The Impact of Prompt Programming on Function-Level Code Generation
Title: The Impact of Prompt Programming on Function-Level Code Generation | Die Auswirkungen der Prompt-Programmierung auf die Code-Generierung auf Funktionsebene | 迅速编制方案对职能层面代码生成的影响 2412.20545v2 |
Authors (4): Ranim Khojah, Francisco Gomes de Oliveira Neto, Mazen Mohamad, Philipp Leitner
Large Language Models (LLMs) are increasingly used by software engineers for code generation. However, limitations of LLMs such as irrelevant or incorrect code have highlighted the need for prompt programming (or prompt engineering) where engineers apply specific prompt techniques (e.g., chain-of-thought or input-output examples) to improve the generated code. While some prompt techniques have been studied, the impact of different techniques – and their interactions – on code generation is still not fully understood. In this study, we introduce CodePromptEval, a dataset of 7072 prompts designed to evaluate five prompt techniques (few-shot, persona, chain-of-thought, function signature, list of packages) and their effect on the correctness, similarity, and quality of complete functions generated by three LLMs (GPT-4o, Llama3, and Mistral). Our findings show that while certain prompt techniques significantly influence the generated code, combining multiple techniques does not necessarily improve the outcome. Additionally, we observed a trade-off between correctness and quality when using prompt techniques. Our dataset and replication package enable future research on improving LLM-generated code and evaluating new prompt techniques.
大型语言模型(LLMS)被软件工程师越来越多地用于代码生成,然而,LLMS(LLMS)的局限性,例如不相关或不正确的代码等LLMS的局限性突出表明,需要迅速编程(或迅速工程),因为工程师采用具体的即时技术(例如思维链或投入产出实例)来改进生成的代码。虽然已经研究了一些迅速技术,但不同技术(及其相互作用)对代码生成的影响仍然不完全理解。在本研究中,我们引入了代码PromptEval,这是一个7072个提示数据集,旨在评价五项即时技术(光、人、思维链、功能签名、软件包清单)及其对三个LLMMS(GPT-4o、Llama3和Mistral)生成的完整功能的正确性、相似性和质量的影响。我们的调查结果显示,虽然某些快速技术对生成的代码有重大影响,但结合多种技术并不一定能改善结果。此外,我们发现,在使用快速技术时,准确性和质量之间存在着一种权衡。我们的数据集和复制包使今后能够对改进LM生成的代码和评价新的快速技术进行研究。
Article 274
Title@2025-07-08 (2): Affective-ROPTester: Capability and Bias Analysis of LLMs in Predicting Retinopathy of Prematurity
Title: Affective-ROPTester: Capability and Bias Analysis of LLMs in Predicting Retinopathy of Prematurity | Affective-ROPTester: Fähigkeit und Bias-Analyse von LLMs bei der Vorhersage der Retinopathie der Vorreife | 情感-ROPTester:预测早幼病理反应中LLMs的能力和比值分析 2507.05816v1 |
Authors (10): Shuai Zhao, Yulin Zhang, Luwei Xiao, Xinyi Wu, Yanhao Jia, Zhongliang Guo, Xiaobao Wu, Cong-Duy Nguyen, Guoming Zhang, Anh Tuan Luu
Despite the remarkable progress of large language models (LLMs) across various domains, their capacity to predict retinopathy of prematurity (ROP) risk remains largely unexplored. To address this gap, we introduce a novel Chinese benchmark dataset, termed CROP, comprising 993 admission records annotated with low, medium, and high-risk labels. To systematically examine the predictive capabilities and affective biases of LLMs in ROP risk stratification, we propose Affective-ROPTester, an automated evaluation framework incorporating three prompting strategies: Instruction-based, Chain-of-Thought (CoT), and In-Context Learning (ICL). The Instruction scheme assesses LLMs’ intrinsic knowledge and associated biases, whereas the CoT and ICL schemes leverage external medical knowledge to enhance predictive accuracy. Crucially, we integrate emotional elements at the prompt level to investigate how different affective framings influence the model’s ability to predict ROP and its bias patterns. Empirical results derived from the CROP dataset yield two principal observations. First, LLMs demonstrate limited efficacy in ROP risk prediction when operating solely on intrinsic knowledge, yet exhibit marked performance gains when augmented with structured external inputs. Second, affective biases are evident in the model outputs, with a consistent inclination toward overestimating medium- and high-risk cases. Third, compared to negative emotions, positive emotional framing contributes to mitigating predictive bias in model outputs. These findings highlight the critical role of affect-sensitive prompt engineering in enhancing diagnostic reliability and emphasize the utility of Affective-ROPTester as a framework for evaluating and mitigating affective bias in clinical language modeling systems.
尽管大型语言模型(LLMS)在各个领域取得了显著进展,但其预测对早产风险的视像病原体(ROP)的能力基本上尚未探索。为了弥补这一差距,我们引入了中国新型基准数据集,称为CROOP,由993份入门记录组成,附加低、中、高风险标签,共993份入门记录,附加低、中、高风险标签。为了系统地审查ROP风险分级中LLMS的预测能力和感官偏差,我们提议Affective-ROPTester,一个包含三种提示性战略的自动评价框架:基于指令的、Chought(OCT)和InColtle Learning(ICL)。为了解决这一差距,我们引入了一个新的中国基准数据集,称为CROMS的内在知识和相关的偏向偏向偏向,而CLOCL计划则利用外部医学知识来提高预测性准确性。很显然,我们把情感因素集中到感官影响模型预测ROP及其模型模型的偏向模式的偏向模式的偏向模式的偏向模式的偏向性,因此,LMMS显示第二个对精度的偏向性判断性判断性判断性判断性判断性判断性判断的预测作用影响在仅仅上影响着提高的精确性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性在仅在仅在仅在提高中,在提高外部判断性在提高外部判断性在提高外部判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性
Article 275
Title@2025-07-08 (2): Bridging Perception and Language: A Systematic Benchmark for LVLMs’ Understanding of Amodal Completion Reports
Title: Bridging Perception and Language: A Systematic Benchmark for LVLMs’ Understanding of Amodal Completion Reports | Bridging Perception and Language: Ein systematischer Benchmark für das Verständnis von LVLMs für Amodal Completion Reports | 桥梁概念和语言:LLLMs理解最佳完成报告系统基准 2507.05799v1 |
Authors (7): Amane Watahiki, Tomoki Doi, Taiga Shinozaki, Satoshi Nishida, Takuya Niikawa, Katsunori Miyahara, Hitomi Yanaka
One of the main objectives in developing large vision-language models (LVLMs) is to engineer systems that can assist humans with multimodal tasks, including interpreting descriptions of perceptual experiences. A central phenomenon in this context is amodal completion, in which people perceive objects even when parts of those objects are hidden. Although numerous studies have assessed whether computer-vision algorithms can detect or reconstruct occluded regions, the inferential abilities of LVLMs on texts related to amodal completion remain unexplored. To address this gap, we constructed a benchmark grounded in Basic Formal Ontology to achieve a systematic classification of amodal completion. Our results indicate that while many LVLMs achieve human-comparable performance overall, their accuracy diverges for certain types of objects being completed. Notably, in certain categories, some LLaVA-NeXT variants and Claude 3.5 Sonnet exhibit lower accuracy on original images compared to blank stimuli lacking visual content. Intriguingly, this disparity emerges only under Japanese prompting, suggesting a deficiency in Japanese-specific linguistic competence among these models.
开发大型视觉语言模型(LVLMs)的主要目标之一是设计能够帮助人类完成多式联运任务的系统,包括解释对感知经历的描述。这方面的一个中心现象是模式性完成,即使部分物体被隐藏,人们也会看到物体。虽然许多研究评估了计算机视觉算法能否探测或重建隐蔽区域,但LVLMs对与模式完成有关的文本的推断能力仍未得到探讨。为弥补这一差距,我们根据基本形式本体学建立了一个基准,以便实现对模式完成的系统分类。我们的结果表明,虽然许多LVLMs取得了人与人的可比较的性能,但某些类型正在完成的物体的准确性却有所不同。值得注意的是,在某些类别中,LLLavaVA-NeXT的变体和Claude 3.5 Sonnet的原始图像的准确性比没有视觉内容的空白平面图像要低。令人感兴趣的是,这种差异仅在日本的提示下出现,表明这些模型在日本特定语言能力方面存在缺陷。
Article 276
Title@2025-07-08 (2): FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation
Title: FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation | FaithfulRAG: Fact-Level-Konfliktmodellierung für kontexttreue retrieval-erweiterte Generation | 忠实的拉戈:为本能的回溯性回溯性回溯性回溯性一代人建立实际冲突模式 2506.08938v2 |
Authors (7): Qinggang Zhang, Zhishang Xiang, Yilin Xiao, Le Wang, Junhui Li, Xinrun Wang, Jinsong Su
Large language models (LLMs) augmented with retrieval systems have demonstrated significant potential in handling knowledge-intensive tasks. However, these models often struggle with unfaithfulness issues, generating outputs that either ignore the retrieved context or inconsistently blend it with the LLMs parametric knowledge. This issue is particularly severe in cases of knowledge conflict, where the retrieved context conflicts with the model
s parametric knowledge. While existing faithful RAG approaches enforce strict context adherence through well-designed prompts or modified decoding strategies, our analysis reveals a critical limitation: they achieve faithfulness by forcibly suppressing the models parametric knowledge, which undermines the model
s internal knowledge structure and increases the risk of misinterpreting the context. To this end, this paper proposes FaithfulRAG, a novel framework that resolves knowledge conflicts by explicitly modeling discrepancies between the model`s parametric knowledge and retrieved context. Specifically, FaithfulRAG identifies conflicting knowledge at the fact level and designs a self-thinking process, allowing LLMs to reason about and integrate conflicting facts before generating responses. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. The code is available at https://github.com/DeepLearnXMU/Faithful-RAG
与检索系统相加的大型语言模型(LLMs)在处理知识密集型任务方面显示出巨大的潜力,然而,这些模型往往与不忠问题纠缠不休,产生的结果要么忽略了检索到的背景,要么前后不一地将其与LLM的参数知识混为一谈,在知识冲突的情况下,这一问题尤为严重,因为检索到的背景与模型的参数知识发生冲突。虽然现有的忠实RAG方法通过设计得当的迅速或经过修改的解码战略,强制压制模型的参数知识,从而实现严格遵守环境,但我们的分析揭示了一个严重的局限性:它们通过强行压制模型的参数知识,从而破坏模型的内部知识结构,增加误解环境的风险,从而实现忠诚。为此,本文件提出了FaithfulRAG,这是一个通过明确模拟模型`参数知识和检索到的背景之间的差异来解决知识冲突的新框架。具体地说,信任RAGG找出了事实层面的相互矛盾的知识,并设计了一个自我思考的过程,使LLMS在提出答复之前能够理解和综合相互矛盾的事实。广泛的实验表明,我们的方法超越了SOD-Salfar-FAR-MUS/GMUS 号。
Article 277
Title@2025-07-08 (2): MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training
Title: MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training | MAMUT: Ein neuartiges Framework zur Modifizierung mathematischer Formeln für die Generierung von Spezialdatensätzen für Sprachmodellschulungen | MAMUT:为语言模式培训制作专门数据集修改数学公式的新框架 2502.20855v2 |
Authors (3): Jonathan Drechsel, Anja Reusch, Steffen Herbold
Mathematical formulas are a fundamental and widely used component in various scientific fields, serving as a universal language for expressing complex concepts and relationships. While state-of-the-art transformer models excel in processing and understanding natural language, they encounter challenges with mathematical notation, which involves a complex structure and diverse representations. This study focuses on the development of specialized training datasets to enhance the encoding of mathematical content. We introduce Math Mutator (MAMUT), a framework capable of generating equivalent and falsified versions of a given mathematical formula in LaTeX notation, effectively capturing the mathematical variety in notation of the same concept. Based on MAMUT, we have generated four large mathematical datasets containing diverse notation. Experiments show that models trained on these datasets exhibit new SoTA performance on mathematical retrieval tasks. We publish our code, generated datasets, and pretrained mathematical models: https://github.com/aieng-lab/math-mutator.
数学公式是各种科学领域一个基本和广泛使用的组成部分,是表达复杂概念和关系的通用语言。尽管最先进的变压器模型在处理和理解自然语言方面非常出色,但它们在数学标记方面却遇到挑战,这涉及复杂的结构和不同的表述。本研究的重点是开发专门的培训数据集,以加强数学内容的编码。我们引入数学数学模型(MAMUT),这是一个能够在LaTeX符号中生成一个特定数学公式的等同和伪造版本的框架,在同一个概念的标记中有效捕捉数学多样性。根据MAMUT,我们生成了四个包含不同符号的大型数学数据集。实验显示,这些数据集培训的模型展示了数学检索任务的新 SoTA。我们出版了我们的代码,生成了数据集,并预先培训了数学模型:https://github.com/aieng-lab/math-motoat。
Article 278
Title@2025-07-08 (2): Flippi: End To End GenAI Assistant for E-Commerce
Title: Flippi: End To End GenAI Assistant for E-Commerce | Flippi: Ende bis Ende GenAI Assistent für E-Commerce | Flippi: 结束到结束 GenAI 电子商务助手 2507.05788v1 |
Authors (7): Anand A. Rajasekar, Praveen Tangarajan, Anjali Nainani, Amogh Batwal, Vinay Rao Dandin, Anusua Trivedi, Ozan Ersoy
The emergence of conversational assistants has fundamentally reshaped user interactions with digital platforms. This paper introduces Flippi-a cutting-edge, end-to-end conversational assistant powered by large language models (LLMs) and tailored for the e-commerce sector. Flippi addresses the challenges posed by the vast and often overwhelming product landscape, enabling customers to discover products more efficiently through natural language dialogue. By accommodating both objective and subjective user requirements, Flippi delivers a personalized shopping experience that surpasses traditional search methods. This paper details how Flippi interprets customer queries to provide precise product information, leveraging advanced NLP techniques such as Query Reformulation, Intent Detection, Retrieval-Augmented Generation (RAG), Named Entity Recognition (NER), and Context Reduction. Flippi’s unique capability to identify and present the most attractive offers on an e-commerce site is also explored, demonstrating how it empowers users to make cost-effective decisions. Additionally, the paper discusses Flippi’s comparative analysis features, which help users make informed choices by contrasting product features, prices, and other relevant attributes. The system’s robust architecture is outlined, emphasizing its adaptability for integration across various e-commerce platforms and the technological choices underpinning its performance and accuracy. Finally, a comprehensive evaluation framework is presented, covering performance metrics, user satisfaction, and the impact on customer engagement and conversion rates. By bridging the convenience of online shopping with the personalized assistance traditionally found in physical stores, Flippi sets a new standard for customer satisfaction and engagement in the digital marketplace.
谈话助理的出现从根本上改变了用户与数字平台的互动关系。本文介绍了Flippi对客户询问的解释,以提供准确的产品信息,利用高级NLP技术,如由大型语言模型(LLMs)驱动的高级智能、终端到终端的语音助理,并为电子商务部门量身定制。Flippi应对了巨大且往往压倒性的产品景观带来的挑战,使客户能够通过自然语言对话更有效地发现产品。Flippi通过满足客观和主观用户的要求,提供了个人化的购物经验,超过了传统的搜索方法。本文详细介绍了Flippi如何解释客户询问,以提供准确的产品信息,利用高级NLP技术,如由大语言模型(LLMs)驱动的超前端到端的语音助理(LLMs),以及针对大宗的语音识别(NER)和背景简化。Flippipi,Flippi的独特的识别能力和在电子商务网站上展示了最有吸引力的报价。此外,Flippi的用户参与比较分析特征,有助于用户通过对产品特征、电子价格进行对比分析、内价分析,以及其它相关的准确性选择。 系统在客户满意度、准确度、准确度、准确度、准确度上展示中,在客户满意度、准确度方面的系统结构中强调其准确度和准确度的准确度框架。
Article 279
Title@2025-07-08 (2): Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription
Title: Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription | Gemeinsame Beamforming und Speaker-Attributed ASR für echte Fern-Mikrofon-Meeting-Transkription | 用于真正分布式话语会议的联合光束和发言人与发言人联合致真正分布式话语会议ASR 2410.21849v2 |
Authors (4): Can Cui, Imran Ahamad Sheikh, Mostafa Sadeghi, Emmanuel Vincent
Distant-microphone meeting transcription is a challenging task. State-of-the-art end-to-end speaker-attributed automatic speech recognition (SA-ASR) architectures lack a multichannel noise and reverberation reduction front-end, which limits their performance. In this paper, we introduce a joint beamforming and SA-ASR approach for real meeting transcription. We first describe a data alignment and augmentation method to pretrain a neural beamformer on real meeting data. We then compare fixed, hybrid, and fully neural beamformers as front-ends to the SA-ASR model. Finally, we jointly optimize the fully neural beamformer and the SA-ASR model. Experiments on the real AMI corpus show that, while state-of-the-art multi-frame cross-channel attention based channel fusion fails to improve ASR performance, fine-tuning SA-ASR on the fixed beamformer’s output and jointly fine-tuning SA-ASR with the neural beamformer reduce the word error rate by 8% and 9% relative, respectively.
在本文中,我们引入了一种联合波束和SA-ASR方法,用于真实会议转录。我们首先描述一种数据调整和增强方法,用于在真实会议数据上预演神经光束。然后,我们将固定的、混合的和完全的神经信号显示器作为SA-ASR模型的前端进行对比。最后,我们共同优化完全神经信号和SA-ASR模型。在实际AMI系统上进行的实验显示,尽管基于频道的组合关注无法提高ASR的性能,在固定光谱输出上对SA-ASR进行微调,并联合微调SA-ASR与神经信号信号分别将字错误率降低8%和9%的相对值。
Article 280
Title@2025-07-08 (2): Large Language Models Might Not Care What You Are Saying: Prompt Format Beats Descriptions
Title: Large Language Models Might Not Care What You Are Saying: Prompt Format Beats Descriptions | Große Sprachmodelle sind vielleicht egal, was Sie sagen: Prompt Format Beats Beschreibungen | 大语言模型可能不在乎你在说什么:快速格式比描述 2408.08780v6 |
Authors (4): Chenming Tang, Zhixiang Wang, Hao Sun, Yunfang Wu
With the help of in-context learning (ICL), large language models (LLMs) have achieved impressive performance across various tasks. However, the function of descriptive instructions during ICL remains under-explored. In this work, we propose an ensemble prompt framework to describe the selection criteria of multiple in-context examples, and preliminary experiments on machine translation (MT) across six translation directions confirm that this framework boosts ICL performance. But to our surprise, LLMs might not care what the descriptions actually say, and the performance gain is primarily caused by the ensemble format, since it could lead to improvement even with random descriptive nouns. We further apply this new ensemble framework on a range of commonsense, math, logical reasoning and hallucination tasks with three LLMs and achieve promising results, suggesting again that designing a proper prompt format would be much more effective and efficient than paying effort into specific descriptions. Our code is available at https://github.com/JamyDon/Format-Beats-Descriptions.
在文字学习的帮助下,大型语言模型(LLMs)在各种任务中取得了令人印象深刻的成绩。然而,在ICL期间描述性指示的功能仍未得到充分探讨。在这项工作中,我们提议了一个共同的快速框架来描述多个文本实例的选择标准,并在六个翻译方向上对机器翻译的初步实验证实,这个框架可以促进ICL的性能。但是,令我们惊讶的是,LLMs可能并不关心描述的实际内容,而业绩收益主要是由合用格式造成的,因为即使随机的描述性名词也能导致改进。我们进一步将这个新的共同框架应用于一系列共同概念、数学、逻辑推理和幻觉任务,与三个LLMS一起,并取得有希望的结果,再次表明设计一个适当的迅速格式比具体描述要有效得多。我们的代码可以在 https://github.com/JamyDon/Format-Beats-Descrictions上查阅。
Article 281
Title@2025-07-08 (2): DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities
Title: DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities | DocTalk: Skalierbare Graph-basierte Dialog-Synthese zur Verbesserung von LLM-Konversationsfähigkeiten | Doctal: 用于增强LLLM对话能力的可缩放图表对话合成 2507.05750v1 |
Authors (9): Jing Yang Lee, Hamed Bonab, Nasser Zalmout, Ming Zeng, Sanket Lokegaonkar, Colin Lockard, Binxuan Huang, Ritesh Sarkhel, Haodong Wang
Large Language Models (LLMs) are increasingly employed in multi-turn conversational tasks, yet their pre-training data predominantly consists of continuous prose, creating a potential mismatch between required capabilities and training paradigms. We introduce a novel approach to address this discrepancy by synthesizing conversational data from existing text corpora. We present a pipeline that transforms a cluster of multiple related documents into an extended multi-turn, multi-topic information-seeking dialogue. Applying our pipeline to Wikipedia articles, we curate DocTalk, a multi-turn pre-training dialogue corpus consisting of over 730k long conversations. We hypothesize that exposure to such synthesized conversational structures during pre-training can enhance the fundamental multi-turn capabilities of LLMs, such as context memory and understanding. Empirically, we show that incorporating DocTalk during pre-training results in up to 40% gain in context memory and understanding, without compromising base performance. DocTalk is available at https://huggingface.co/datasets/AmazonScience/DocTalk.
大型语言模型(LLMS)越来越多地用于多方向的谈话任务,然而,其培训前数据主要包括连续的讲稿,造成所需能力和培训模式之间的潜在不匹配。我们采用新颖的方法,将现有文本公司的对话数据综合起来,以解决这一差异。我们提出了一个管道,将多个相关文件组合转化为一个扩大的多方向、多主题的信息搜索对话。我们把管道应用到维基百科文章,我们把DocTalk,一个由730公里长的谈话组成的多方向培训前对话包。我们假设在培训前接触这种综合对话结构能够提高LLMS的基本多方向能力,例如上下文记忆和理解。我们抽象地表明,在培训前的结果中将DocTalk纳入40%的记忆和理解,同时不损害基本性能。DocTalk可以在https://huggingface.co/dataset/Amazonscience/DocTalk上查阅。
Article 282
Title@2025-07-08 (2): GPTKB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge
Title: GPTKB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge | GPTKB v1.5: Eine umfassende Wissensbasis für die Erforschung des tatsächlichen LLM-Wissens | GPTKB v1.5:探索事实LLM知识的大规模知识库 2507.05740v1 |
Authors (5): Yujia Hu, Tuan-Phong Nguyen, Shrestha Ghosh, Moritz Müller, Simon Razniewski
Language models are powerful tools, yet their factual knowledge is still poorly understood, and inaccessible to ad-hoc browsing and scalable statistical analysis. This demonstration introduces GPTKB v1.5, a densely interlinked 100-million-triple knowledge base (KB) built for $14,000 from GPT-4.1, using the GPTKB methodology for massive-recursive LLM knowledge materialization (Hu et al., ACL 2025). The demonstration experience focuses on three use cases: (1) link-traversal-based LLM knowledge exploration, (2) SPARQL-based structured LLM knowledge querying, (3) comparative exploration of the strengths and weaknesses of LLM knowledge. Massive-recursive LLM knowledge materialization is a groundbreaking opportunity both for the research area of systematic analysis of LLM knowledge, as well as for automated KB construction. The GPTKB demonstrator is accessible at https://gptkb.org.
语言模型是强有力的工具,但它们的实际知识仍然不易理解,而且无法为临时浏览和可扩展的统计分析所利用。这一演示介绍了GPTKB v1.5,这是一个由GPT-4.1为14 000美元建立的密连的10亿三亿知识库(KB),该知识库使用GPTKB方法,用于大规模回收的LLM知识成品化(Hu等人,ACL 2025)。演示经验侧重于三个使用案例:(1) 联系-跨行业LLM知识探索,(2) SPARQL基于LM结构化LM知识查询,(3) 比较探索LLM知识的长处和短处。大规模再精确LM知识成品化是一个开拓性的机会,既用于系统分析LLM知识的研究领域,也用于自动建造KB。 GPTKB 演示器可在https://gptkb.org上查阅。
Article 283
Title@2025-07-08 (2): Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling
Title: Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling | Verbesserung der LLM-Zuverlässigkeit durch explizite Wissensgrenzenmodellierung | 通过明确知识边界建模增强LLM可靠性 2503.02233v3 |
Authors (6): Hang Zheng, Hongshen Xu, Yuncong Liu, Lu Chen, Pascale Fung, Kai Yu
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness, particularly when processing queries exceeding their knowledge boundaries. While existing mitigation strategies employ uncertainty estimation or query rejection mechanisms, they suffer from computational efficiency and sacrificed helpfulness. To address these issues, we propose the Explicit Knowledge Boundary Modeling (EKBM) framework, integrating fast and slow reasoning systems to harmonize reliability and usability. The framework first employs a fast-thinking model to generate confidence-labeled responses, enabling immediate utilization of high-confidence outputs, whereas uncertain predictions trigger a slow refinement model for accuracy improvement. To align model behavior with our proposed object, we propose a hybrid training pipeline, enhancing self-awareness without degrading task performance. Evaluations on dialogue state tracking tasks demonstrate that EKBM achieves superior model reliability over uncertainty-based baselines. Further analysis reveals that refinement substantially boosts accuracy while maintaining low computational overhead. The framework establishes a scalable paradigm for deploying reliable LLMs in error-sensitive applications, effectively balancing accuracy and practical utility.
大型语言模型(LLMs)容易产生自觉意识失调的幻觉,特别是在处理超出其知识界限的查询时。现有减缓战略采用不确定性估计或质疑拒绝机制,但也有计算效率,并牺牲了帮助作用。为了解决这些问题,我们建议采用“明白知识边界模型”框架,整合快速和缓慢的推理系统,以统一可靠性和使用性能和可用性。框架首先采用快速思考模型,以产生以信任为标签的响应,能够立即利用高信任产出,而不确定的预测则触发一个缓慢的精细改进模型,以便提高准确性。为了将模型与我们拟议的目标保持一致,我们提议了一个混合培训管道,在不降低任务性能的情况下提高自我意识。对话状态跟踪任务评估表明,EKBMs在基于不确定性的基线上取得了超强的模型可靠性。进一步分析表明,改进极大地提高了准确性,同时保持低计算性能。框架为在对错误敏感的应用中部署可靠的LMs提供了可扩展的范例,有效地平衡准确性和实用性。
Article 284
Title@2025-07-08 (2): ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark
Title: ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark | KontextASR-Bench: Ein massiver Kontext für die Spracherkennung Benchmark | ASR-Bunch:大规模背景情况语音识别基准 2507.05727v1 |
Authors (7): He Wang, Linhan Ma, Dake Guo, Xiong Wang, Lei Xie, Jin Xu, Junyang Lin
Automatic Speech Recognition (ASR) has been extensively investigated, yet prior evaluative efforts have largely been restricted to contextless paradigms. This constraint stems from the limited proficiency of conventional ASR models in context modeling and their deficiency in memory and reasoning based on world knowledge. Recent breakthroughs in the development of Large Language Models (LLMs) and corresponding Large Audio Language Models (LALMs) have markedly enhanced the visibility of general artificial intelligence capabilities. Consequently, there exists a compelling need for a benchmark that can evaluate both the generality and intelligence of ASR systems. To address this gap, we propose ContextASR-Bench: a comprehensive, large-scale benchmark designed to assess contextual speech recognition. This benchmark encompasses up to 40,000 data entries across over 10 domains, enabling a thorough evaluation of model performance in scenarios that omit or incorporate coarse-grained or fine-grained contextual information. Moreover, diverging from conventional ASR evaluations, our benchmark includes an analysis of model efficacy in recognizing named entities mentioned within the auditory input. Our extensive evaluation highlights that LALMs, with strong world knowledge and context learning capabilities, outperform conventional ASR models by a large margin. The dataset and evaluation code have been released at https://github.com/MrSupW/ContextASR-Bench.
已经广泛调查了自动语音识别(ASR),但先前的评价工作基本上局限于没有背景的范式,这一制约因素源于传统ASR模型在背景建模方面的熟练程度有限,以及基于世界知识的记忆和推理方面的不足;最近在开发大语言模型(LLMs)和相应的大音频模型(LALMs)方面取得的突破明显提高了一般人工智能能力的可见度;因此,迫切需要有一个能够评价ASR系统的一般性和智能的基准;为弥补这一差距,我们提议CEDASR-Bench:一个旨在评估背景语音识别的全面、大规模基准;该基准包括10个以上领域的40,000个数据条目,以便能够在省略或纳入粗略或细微背景信息的情景假设中,对模型绩效进行彻底评估;此外,与传统的ASR评价不同,我们的基准包括了在承认听力投入中提到的被点名实体方面的模型效率分析;我们的广泛评价强调,LALMMS,具有强大的世界知识和背景学习能力,大大超越了ASR/WRMDA/SADRDRDRDA和SARDDDRDRDRDSDRDSDSDSDSDSDSDSDRDSDRDRDR
Article 285
Title@2025-07-08 (2): Truth Neurons
Title: Truth Neurons | Wahrheit Neuronen | 真理中世纪 2505.12182v3 |
Authors (5): Haohang Li, Yupeng Cao, Yangyang Yu, Jordan W. Suchow, Zining Zhu
Despite their remarkable success and deployment across diverse workflows, language models sometimes produce untruthful responses. Our limited understanding of how truthfulness is mechanistically encoded within these models jeopardizes their reliability and safety. In this paper, we propose a method for identifying representations of truthfulness at the neuron level. We show that language models contain truth neurons, which encode truthfulness in a subject-agnostic manner. Experiments conducted across models of varying scales validate the existence of truth neurons, confirming that the encoding of truthfulness at the neuron level is a property shared by many language models. The distribution patterns of truth neurons over layers align with prior findings on the geometry of truthfulness. Selectively suppressing the activations of truth neurons found through the TruthfulQA dataset degrades performance both on TruthfulQA and on other benchmarks, showing that the truthfulness mechanisms are not tied to a specific dataset. Our results offer novel insights into the mechanisms underlying truthfulness in language models and highlight potential directions toward improving their trustworthiness and reliability.
尽管语言模型在不同工作流程中取得了显著的成功和部署,但语言模型有时会产生不真实的反应。我们对这些模型中真实性是如何机械地编码的有限理解危及了这些模型的可靠性和安全性。在本文中,我们提出了一个在神经神经层面识别真实性表现的方法。我们显示语言模型包含真理神经元,以主题和不可知性的方式将真实性编码。不同规模的模型进行的实验验证了真知灼见神经元的存在,证实神经层面真实性编码是许多语言模型共有的一个属性。层层间真实性神经元的分布模式与先前关于真实性几何测量的发现相一致。有选择地抑制通过真理QA数据集发现的事实神经元的激活,会降低真理QA和其他基准的性能,表明真实性机制与特定数据集没有联系。我们的结果为语言模型中真实性的基本机制提供了新的洞察力,并突显了提高它们可信度和可靠性的潜在方向。
Article 286
Title@2025-07-08 (2): Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Title: Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition | Omni-Router: Routing-Entscheidungen in Sparse Mixture-of-Experts für die Spracherkennung teilen | Omni-Router: 分享语音识别专家的松散混集决定 2507.05724v1 |
Authors (3): Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly
Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model \emph{Omni-router Transformer}. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.
专家混合(MOE)结构从语言模型发展到自动语音识别(ASR)。传统的MOE方法,如开关变换器,是每个层内独立的路线专家。我们的分析表明,大多数层的路由器作出的专家选择与其他层路由器的选择没有强烈关联。为了增加不同层的专家之间的合作,鼓励更大的专业化,我们使用一个不同层的共享路由器。我们称这个模型为\emph{Omni-router 变换器。关于大规模假标签数据集的广泛实验,以及10个不同、超出外部的ASR基准的评估表明,Omni-router变换器能够实现较低的培训损失,并持续地超过其他层路由器的选择模式,将平均字差率分别降低11.2%和8.2%,同时提供结构化的专家使用,提高多样性数据的稳健度。
Article 287
Title@2025-07-08 (2): MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment
Title: MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment | MobileGUI-RL: Mobile GUI-Agent durch Verstärkungslernen im Online-Umfeld fördern | MomGUI-RL:通过在线环境强化学习推进移动图形界面代理 2507.05720v1 |
Authors (8): Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, Dong Yu
Recently, there has been a surge of vision-based GUI agents designed to automate everyday mobile and web tasks. These agents interpret raw GUI screenshots and autonomously decide where to click, scroll, or type, which bypasses handcrafted rules and app-specific APIs. However, most existing methods trained GUI agent in the offline environment using pre-collected trajectories. This approach limits scalability, causes overfitting to specific UI templates, and leads to brittle policies when faced with unseen environment. We present MobileGUI-RL, a scalable framework that trains GUI agent in online environment. MobileGUI-RL contains two key components. It (i) synthesizes a curriculum of learnable tasks through self-exploration and filtering, and (ii) adapts GRPO to GUI navigation with trajectory-aware advantages and composite rewards that balance task success and execution efficiency. Experiments on three online mobile-agent benchmarks show consistent gains, validating the effectiveness of our approach.
最近,在设计日常移动和网络任务自动化时,出现了基于愿景的图形界面代理器的激增。这些代理商对原始图形界面的截图进行解释,自主决定点击、滚动或键入何处,绕过手工制作的规则和应用程序专用API。然而,大多数现有方法都使用预先收集的轨迹在离线环境中培训了图形界面代理器。这种方法限制了可缩放性,导致过度适应特定的界面模板,并在面对不可见的环境时导致政策萎缩。我们介绍了移动GUI-RL,这是一个在在线环境中培训图形界面代理器的可扩展框架。移动GUI-RL包含两个关键组成部分。它(一)综合了通过自我探索和过滤的可学习任务课程,以及(二)使GROPO与图形界面导航相适应的轨迹识别优势和综合奖励,平衡任务成功和执行效率。对三个在线移动代理商基准的实验显示了一致的收益,证实了我们的方法的有效性。
Article 288
Title@2025-07-08 (2): HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation
Title: HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation | HIRAG: Hierarchisch-gedankte Instruktion-Tuning-Retrieval-Augmented Generation | HIRAG: 高层次研究教学-引导检索-推荐一代 2507.05714v1 |
Authors (7): YiHan Jiao, ZheHao Tan, Dan Yang, DuoLin Sun, Jie Feng, Jian Wang, Peng Wei
Retrieval-augmented generation (RAG) has become a fundamental paradigm for addressing the challenges faced by large language models in handling real-time information and domain-specific problems. Traditional RAG systems primarily rely on the in-context learning (ICL) capabilities of the large language model itself. Still, in-depth research on the specific capabilities needed by the RAG generation model is lacking, leading to challenges with inconsistent document quality and retrieval system imperfections. Even the limited studies that fine-tune RAG generative models often \textit{lack a granular focus on RAG task} or \textit{a deeper utilization of chain-of-thought processes}. To address this, we propose that RAG models should possess three progressively hierarchical abilities (1) Filtering: the ability to select relevant information; (2) Combination: the ability to combine semantic information across paragraphs; and (3) RAG-specific reasoning: the ability to further process external knowledge using internal knowledge. Thus, we introduce our new RAG instruction fine-tuning method, Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (HIRAG) incorporates a “think before answering” strategy. This method enhances the model’s open-book examination capability by utilizing multi-level progressive chain-of-thought. Experiments show that the HIRAG training strategy significantly improves the model’s performance on datasets such as RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.
重新获取增强的一代(RAG)已成为解决大型语言模型在处理实时信息和特定领域问题方面所面临的挑战的基本范例。传统的RAG系统主要依赖大语言模型本身的内流学习能力。然而,关于RAG生成模型所需的具体能力的深入研究仍然缺乏,导致文件质量和检索系统不完善的挑战。即使是微调的RAG基因化模型常常\textit{缺乏对RAG任务的微调焦点}或更深入地利用连锁思考进程。为了解决这个问题,我们建议RAG模型应当拥有三种逐步的等级能力:(1) 过滤:选择相关信息的能力;(2) 合并:将各段落的语义信息结合起来的能力;(3) RAG特定推理:利用内部知识进一步处理外部知识的能力。 因此,我们引入了我们新的RAG 指令微调方法,Sierarshi-Sqourat-Retal-RetailQQQQ , 更深入地利用不断更新的MARAAAA 测试战略, 大幅改进HAGAG-BA的模型, 测试战略, 改进HAG-BAG-S-strual-strual-strual-strual-strual-strual-straking-strual-strual-stris-strat-strat-stris
Article 289
Title@2025-07-08 (2): DRAGON: Dynamic RAG Benchmark On News
Title: DRAGON: Dynamic RAG Benchmark On News | DRAGON: Dynamischer RAG-Benchmark auf Neuigkeiten | DRAGON:动态RAG新闻基准 2507.05713v1 |
Authors (7): Fedor Chernogorskii, Sergei Averkiev, Liliya Kudraleeva, Zaven Martirosian, Maria Tikhonova, Valentin Malykh, Alena Fenogenova
Retrieval-Augmented Generation (RAG) is a widely adopted approach for improving the factuality of large language models (LLMs) by incorporating external knowledge at inference time. Although there exist multiple RAG benchmarks for English, evaluation resources for other languages, including Russian, remain scarce and static, failing to capture the dynamic nature of real-world deployments. In this work, we present DRAGON (Dynamic RAG Benchmark On News), the first dynamic benchmark for evaluating RAG systems in Russian on a changing news corpora. DRAGON is built upon a regularly updated corpus of Russian news and public documents and supports comprehensive evaluation of both the retriever and generator components. Question generation is performed automatically with the use of Knowledge Graph constructed from the corpus and enables the extraction of four core question types aligned with distinct subgraph patterns. We release a complete evaluation framework comprising the pipeline for automatic question generation, evaluation scripts, which are potentially reusable for other languages and multilingual settings, and benchmark data. We also launch a public leaderboard to encourage community participation and comparison.
在这项工作中,我们提出DRAGON(动态RAGON)(动态RAG Basic on News),这是在不断变化的新闻公司中评价俄罗斯的RAG系统的第一个动态基准;DRAGON(DRAGON)以定期更新的俄罗斯新闻和公共文件为基础,支持对检索器和发电机组成部分进行综合评价;通过使用从文体上建立的知识图,自动生成问题,使四个核心问题类型与不同的子绘图模式相一致,从而得以抽取。我们发布了一个完整的评价框架,其中包括自动生成问题的管道、评价脚本,这些文字有可能用于其他语言和多语种环境,以及基准数据。我们还启动了一个公共领导板,以鼓励社区参与和比较。
Article 290
Title@2025-07-08 (2): RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism
Title: RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism | RAG-R1 : Förderung der Such- und Begründungsfähigkeiten von LLMs durch Multi-Query-Parallelismus | RAG-R1:通过多种克质平行主义鼓励LLMs的搜索和说明能力 2507.02962v2 |
Authors (6): Zhiwen Tan, Jiaming Huang, Qintong Wu, Hongxuan Zhang, Chenyi Zhuang, Jinjie Gu
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, while they remain prone to generating hallucinated or outdated responses due to their static internal knowledge. Recent advancements in Retrieval-Augmented Generation (RAG) methods have explored enhancing models’ search and reasoning capabilities through reinforcement learning (RL). Although these methods demonstrate promising results, they face challenges in training stability and encounter issues such as substantial inference time and restricted capabilities due to the single-query mode. In this paper, we propose RAG-R1, a novel training framework designed to enable LLMs to adaptively leverage internal and external knowledge during the reasoning process. We further expand the generation and retrieval processes within the framework from single-query mode to multi-query parallelism, aimed at reducing inference time and enhancing the model’s capabilities. Extensive experiments on seven question-answering benchmarks demonstrate that our method outperforms the strongest baseline by up to 13.2% and decreases inference time by 11.1%.
大型语言模型(LLMS)在各种任务中表现出了非凡的能力,而由于其静态的内部知识,它们仍然容易产生幻觉或过时的应对办法; 回收和提炼一代(RAG)方法最近的进展探索了通过强化学习加强模型的搜索和推理能力; 虽然这些方法显示了有希望的成果,但它们在培训稳定性方面面临着挑战,并遇到了诸如因单一查询模式而导致的大量推论时间和有限能力等问题; 在本文件中,我们提议了一个新的培训框架,旨在使LAG-R1能够在推理过程中以适应性的方式利用内部和外部知识; 我们进一步扩大了从单式回收模式到多式平行框架的生成和检索过程,目的是减少推论时间和提高模型的能力。 对七个问题回答基准的广泛实验表明,我们的方法比最强的基线高出13.2%,推论时间减少了11.1%。
Article 291
Title@2025-07-08 (2): SIGIR 2025 – LiveRAG Challenge Report
Title: SIGIR 2025 – LiveRAG Challenge Report | SIGIR 2025 – LiveRAG Challenge Report | SIGIR 2025-LiveRAG挑战报告 2507.04942v2 |
Authors (9): David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Oren Somekh, Ran Tavory, Mehdi Ghissassi, Edo Liberty, Roy Miara
The LiveRAG Challenge at SIGIR 2025, held between March and May 2025, provided a competitive platform for advancing Retrieval-Augmented Generation (RAG) technologies. Participants from academia and industry were invited to develop a RAG-based question-answering system using a fixed corpus (Fineweb-10BT) and a common open-source LLM (Falcon3-10B-Instruct). The goal was to facilitate challenging comparisons of retrieval and prompting strategies. During the Live Challenge Day, 70 teams from 27 different countries provided answers and supportive information to 500 unseen questions within a strict two-hour time window. Evaluation was conducted in two stages: first an automated LLM-as-a-judge approach was used to compute correctness and faithfulness score, then a manual review of top ranked submissions was conducted. The finalists were announced on June 12, 2025, with prizes awarded during the LiveRAG Workshop at SIGIR 2025 in Padua, Italy.
2025年3月至5月在SIGIR 2025年举办的LiveRAG挑战赛,为推进回升一代(RAG)技术提供了一个竞争性平台,来自学术界和工业界的参与者被邀请利用固定体(Fineweb-10BT)和共同开放源码LLM(Falcon3-10B-Instruct)开发基于RAG的问答系统,目的是便利对检索和推动战略进行具有挑战性的比较;在现场挑战日,来自27个不同国家的70个小组在严格的两小时窗口内为500个未见问题提供了答案和辅助信息;评价分两个阶段进行:首先采用自动LLM-A-A-判断方法来计算正确性和忠诚分数,然后对排名最高的呈件进行手工审查;最后一批人于2025年6月12日宣布,在意大利帕杜阿的SIGIR 2025年LiveRAG讲习班期间颁发奖项。
Article 292
Title@2025-07-08 (2): Agentic-R1: Distilled Dual-Strategy Reasoning
Title: Agentic-R1: Distilled Dual-Strategy Reasoning | Agentic-R1: Destillierte Dual-Strategy-Reasoning | 剂- R1: 蒸馏的双战略理由 2507.05707v1 |
Authors (4): Weihua Du, Pranjal Aggarwal, Sean Welleck, Yiming Yang
Current long chain-of-thought (long-CoT) models excel at mathematical reasoning but rely on slow and error-prone natural language traces. Tool-augmented agents address arithmetic via code execution, but often falter on complex logical tasks. We introduce a fine-tuning framework, DualDistill, that distills complementary reasoning strategies from multiple teachers into a unified student model. Using this approach, we train Agentic-R1, which dynamically selects the optimal strategy for each query, invoking tools for arithmetic and algorithmic problems, and using text-based reasoning for abstract ones. Our method improves accuracy across a range of tasks, including both computation-intensive and standard benchmarks, demonstrating the effectiveness of multi-strategy distillation in achieving robust and efficient reasoning. Our project is available at https://github.com/StigLidu/DualDistill
目前的长期思维链(长期-Cot)模型在数学推理方面十分出色,但依赖缓慢和容易出错的自然语言痕迹。工具强化剂通过代码执行处理算术,但往往在复杂的逻辑任务上摇摇欲坠。我们引入了一个微调框架,即DualStill,将多个教师的补充推理战略从一个统一的学生模式中提炼出来。我们采用这种方法,培训Agric-R1,动态地为每个查询选择最佳战略,援引算术和算法问题工具,并使用基于文本的推理法来抽象的。我们的方法提高了一系列任务的准确性,包括计算密集和标准基准,展示了多战略蒸馏在实现有力和高效推理方面的有效性。我们的项目可在https://github.com/StigLidu/DualDustill查阅。
Article 293
Title@2025-07-08 (2): A Survey on Transformer Context Extension: Approaches and Evaluation
Title: A Survey on Transformer Context Extension: Approaches and Evaluation | Eine Umfrage über Transformer Context Extension: Ansätze und Evaluierung | 关于 “ 变换背景推广:方法和评价 “ 的调查 2503.13299v2 |
Authors (5): Yijun Liu, Jinzheng Yu, Yang Xu, Zhongyang Li, Qingfu Zhu
Large language models (LLMs) based on Transformer have been widely applied in the filed of natural language processing (NLP), demonstrating strong performance, particularly in handling short text tasks. However, when it comes to long context scenarios, the performance of LLMs degrades due to some challenges. To alleviate this phenomenon, there is a number of work proposed recently. In this survey, we first list the challenges of applying pre-trained LLMs to process long contexts. Then systematically review the approaches related to long context and propose our taxonomy categorizing them into four main types: positional encoding, context compression, retrieval augmented, and attention pattern. In addition to the approaches, we focus on the evaluation of long context, organizing relevant data, tasks, and metrics based on existing long context benchmarks. Finally, we summarize unresolved issues in the long context domain and put forward our views on future developments.
以变换器为基础的大型语言模型(LLMS)在自然语言处理(NLP)的归档中被广泛应用,表现出很强的成绩,特别是在处理短文本任务方面;然而,在长背景假设中,由于一些挑战,LLMS的性能会退化;为了缓解这种现象,最近提出了若干工作建议;在本次调查中,我们首先列出在长背景中应用预先培训的LLMS的挑战;然后系统地审查与长背景有关的做法,并提议将这些方法分为四大类型:定位编码、环境压缩、检索增强和注意模式;除了这些方法外,我们着重评价长背景,根据现有的长背景基准组织相关数据、任务和衡量标准;最后,我们总结了长期领域尚未解决的问题,并提出我们对未来发展的看法。
Article 294
Title@2025-07-08 (2): PulseReddit: A Novel Reddit Dataset for Benchmarking MAS in High-Frequency Cryptocurrency Trading
Title: PulseReddit: A Novel Reddit Dataset for Benchmarking MAS in High-Frequency Cryptocurrency Trading | PulseReddit: Ein neuartiger Reddit-Datensatz für das Benchmarking von MAS im Kryptowährungshandel mit hoher Frequenz | PulseReddit:高级加密货币交易中用于确定MAS基准的新Rddit数据集 2506.03861v2 |
Authors (4): Qiuhan Han, Qian Wang, Atsushi Yoshikawa, Masayuki Yamamura
High-Frequency Trading (HFT) is pivotal in cryptocurrency markets, demanding rapid decision-making. Social media platforms like Reddit offer valuable, yet underexplored, information for such high-frequency, short-term trading. This paper introduces \textbf{PulseReddit}, a novel dataset that is the first to align large-scale Reddit discussion data with high-frequency cryptocurrency market statistics for short-term trading analysis. We conduct an extensive empirical study using Large Language Model (LLM)-based Multi-Agent Systems (MAS) to investigate the impact of social sentiment from PulseReddit on trading performance. Our experiments conclude that MAS augmented with PulseReddit data achieve superior trading outcomes compared to traditional baselines, particularly in bull markets, and demonstrate robust adaptability across different market regimes. Furthermore, our research provides conclusive insights into the performance-efficiency trade-offs of different LLMs, detailing significant considerations for practical model selection in HFT applications. PulseReddit and our findings establish a foundation for advanced MAS research in HFT, demonstrating the tangible benefits of integrating social media.
高频交易(高频交易)在加密货币市场中至关重要,要求快速决策。Reddit等社交媒体平台为这种高频短期交易提供了宝贵但尚未得到充分探索的信息。本文介绍了新颖的数据集 :\ textbf{PulseReddit},这是一个将大规模网上讨论数据与高频加密货币市场统计数据统一起来用于短期贸易分析的新数据集。我们利用基于大语言模型(LLLM)的多源系统(MAS)进行了广泛的实证研究,以调查PulseRedddit的社会情绪对贸易绩效的影响。我们的实验结论是,借助PulseReddddit数据扩展的MAS与传统基线相比,特别是在公牛市场中,取得了优异于传统的贸易结果,并展示了不同市场体制的强大适应性。此外,我们的研究为不同LMS的绩效交易提供了确凿的洞察力,详细说明了在HFT应用程序中实际选择模式的重大考虑因素。PulseReddddit和我们的调查结果为HFT公司高级MAS研究奠定了基础,展示了整合社会媒体的有形的好处。
Article 295
Title@2025-07-08 (2): Do We Really Need Specialization? Evaluating Generalist Text Embeddings for Zero-Shot Recommendation and Search
Title: Do We Really Need Specialization? Evaluating Generalist Text Embeddings for Zero-Shot Recommendation and Search | Brauchen wir wirklich Spezialisierung? Auswertung von Generalisten-Text-Embeddings für Zero-Shot-Empfehlung und Suche | 我们真的需要专业化吗?评估零热建议和搜索通用文本嵌入 2507.05006v2 |
Authors (6): Matteo Attimonelli, Alessandro De Bellis, Claudio Pomo, Dietmar Jannach, Eugenio Di Sciascio, Tommaso Di Noia
Pre-trained language models (PLMs) are widely used to derive semantic representations from item metadata in recommendation and search. In sequential recommendation, PLMs enhance ID-based embeddings through textual metadata, while in product search, they align item characteristics with user intent. Recent studies suggest task and domain-specific fine-tuning are needed to improve representational power. This paper challenges this assumption, showing that Generalist Text Embedding Models (GTEs), pre-trained on large-scale corpora, can guarantee strong zero-shot performance without specialized adaptation. Our experiments demonstrate that GTEs outperform traditional and fine-tuned models in both sequential recommendation and product search. We attribute this to a superior representational power, as they distribute features more evenly across the embedding space. Finally, we show that compressing embedding dimensions by focusing on the most informative directions (e.g., via PCA) effectively reduces noise and improves the performance of specialized models. To ensure reproducibility, we provide our repository at https://split.to/gte4ps.
预先培训的语言模型(PLMs)被广泛用来从建议和搜索的项目元数据中得出语义表达。在顺序建议中,PLMs通过文本元数据加强基于身份的嵌入,而在产品搜索中,它们使项目特性与用户意图相一致。最近的研究表明,需要任务和特定领域的微调来改进代表性力量。本文挑战这一假设,表明在大规模公司规模上预先培训的通用文本嵌入模型(GTEs)可以保证强力的零弹性能,而无需专门调整。我们的实验表明,在顺序建议和产品搜索中,GTEs都超越了传统的和经过精细调整的模式。我们将此归因于一种优越的代表性力量,因为它们在嵌入空间之间分布的特征更加均衡。最后,我们表明,通过最丰富的信息方向(例如,通过CPA)来压缩嵌入的维度,可以有效减少噪音,改进专门模型的性能。为了确保可追溯性,我们提供https://split.gte4ps的存储库。
Article 296
Title@2025-07-08 (2): Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria
Title: Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria | KI-Beratung auf Japanisch bewerten: Berater, Kunde und Evaluator Rollen, die von Motivations-Interview-Kriterien überzeugt sind | 评价大赦国际在日本的咨询:顾问、客户和根据动机访谈标准评估的评价作用 2507.02950v2 |
Authors (7): Keita Kiuchi, Yoshikazu Fujimoto, Hideyuki Goto, Tomonori Hosokawa, Makoto Nishimura, Yosuke Sato, Izumi Sezai
This study provides the first comprehensive evaluation of large language model (LLM) performance across three counseling roles in Japanese-language therapeutic contexts. We simultaneously assessed counselor artificial intelligence (AI) systems (GPT-4-turbo with zeroshot prompting or Structured Multi-step Dialogue Prompts (SMDP), Claude-3-Opus-SMDP), client AI simulations, and evaluation AI systems (o3, Claude-3.7-Sonnet, Gemini-2.5-pro). Human experts (n = 15) with extensive counseling experience evaluated AI-generated dialogues using the Motivational Interviewing Treatment Integrity (MITI) Coding Manual 4.2.1. Notably, SMDP implementation significantly enhanced counselor AI performance across all MITI global ratings compared with zeroshot prompting, with no significant differences between GPT-SMDP and Opus-SMDP. Evaluation AIs showed comparable performance to human raters for Cultivating Change Talk but systematically overestimated Softening Sustain Talk and the overall quality metrics. Model-specific biases emerged: Gemini emphasized power-sharing, o3 focused on technical proficiency, and Sonnet prioritized emotional expression. Client AI simulations exhibited a limited emotional range and unnaturally high compliance, indicating the need for enhanced realism. These findings establish benchmarks for AI-assisted counseling in non-English contexts and identify critical areas for improvement through advanced prompt engineering, retrieval-augmented generation, and targeted fine-tuning, with important implications for developing culturally sensitive AI mental health tools.
本研究首次全面评估了在日语治疗背景下三大咨询角色的大型语言模式(LLM)业绩,同时评估了顾问人工智能系统(GPT-4-turbo,零点提示或结构化多步对话提示(SMDP)、Claude-3-Opus-SMDP)、客户人工智能模拟和评价AI系统(o3,Claude-3.7-Sonnet,Gemini-2.5-pro)、具有广泛咨询经验的人类专家(n=15),利用激励性访谈治疗完整性(MITI)汇编手册4.2.1,评估了人工智能对话(AI)系统(GPT-4-turbo,零点提示或结构化多步对话提示(SMDP-SMDP))、客户AI系统(GPT-SMDP)和Opus-SMDP(SMDP)之间没有重大差异)、客户人工智能模拟测试系统模拟(o3,Claude-3.7-Sonnoring Sustain Sal Tal Tal)和总体质量衡量标准。
Article 297
Title@2025-07-08 (2): AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs
Title: AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs | AutoTriton: Automatische Triton-Programmierung mit Verstärkungslernen in LLMs | AutoTriton: 自动三联编程,在LLMs中进行强化学习 2507.05687v1 |
Authors (11): Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, Maosong Sun
Kernel development in deep learning requires optimizing computational units across hardware while balancing memory management, parallelism, and hardware-specific optimizations through extensive empirical tuning. Although domain-specific languages like Triton simplify GPU programming by abstracting low-level details, developers must still manually tune critical parameters such as tile sizes and memory access patterns through iterative experimentation, creating substantial barriers to optimal performance and wider adoption. In this work, we introduce AutoTriton, the first model dedicated to Triton programming powered by reinforcement learning (RL). AutoTriton performs supervised fine-tuning (SFT) to be equipped with essential Triton programming expertise using a high-quality data gathering pipeline, and conducts RL with Group Relative Policy Optimization (GRPO) algorithm, combining a rule-based reward and an execution-based reward to further improve Triton programming ability, sequentially. Experiments across five evaluation channels of TritonBench and KernelBench illustrate that our 8B model AutoTriton achieves performance comparable to mainstream large models, including Claude-4-Sonnet and DeepSeek-R1-0528. Further experimental analysis demonstrates the crucial role of each module within AutoTriton, including the SFT stage, the RL stage, and the reward design strategy. These findings underscore the promise of RL for automatically generating high-performance kernels, and since high-performance kernels are core components of AI systems, this breakthrough establishes an important foundation for building more efficient AI systems. The model and code will be available at https://github.com/AI9Stars/AutoTriton.
深层学习的内核开发要求优化跨硬件的计算单位,同时通过广泛的实证调整来平衡记忆管理、平行和硬件优化。尽管特里顿等特定领域的语言通过抽取低层细节简化 GPU 编程,但开发商仍必须通过迭接实验手动调控关键参数,如瓷砖大小和内存存存存存访问模式,为最佳性能和广泛采用设置了重大障碍。在这项工作中,我们引入了AutTriton(AutTriton),这是专门为特里顿编程提供强化学习动力的第一个模型(RL);AutTriton(SFT)在监督下进行微调(SFT),以配备高质量的数据收集管道所需的基本特里顿编程设计专长,并用集团相对政策优化(GROPO)算法进行RL(GPO)算法,结合基于规则的奖赏和基于执行奖赏的奖赏模式,以进一步提高Triton编程/Kenne Beenchen,这些A(包括Clau-S-Son-Sont-T)的高级基础,进一步实验分析将自动显示Staryal Stal-rual Stal Studal Studal Studal)的每个模块在Rst-Sy-Slal Studal-I-Slal-I-Sleval-Sl-I-Sleval-Sl)的每一个阶段中的重要基础。
Article 298
Title@2025-07-08 (2): Smoothie-Qwen: Post-Hoc Smoothing to Reduce Language Bias in Multilingual LLMs
Title: Smoothie-Qwen: Post-Hoc Smoothing to Reduce Language Bias in Multilingual LLMs | Smoothie-Qwen: Post-Hoc-Glättung, um Sprach-Bias in mehrsprachigen LLMs zu reduzieren | Slidie-Quwen:以多种语言平滑后减少语言偏见 2507.05686v1 |
Authors (5): SeungWon Ji, Jungyup Lee, Jemin Kim, Sang Park, SeungJae Lee
Multilingual large language models (LLMs) often exhibit language confusion, a tendency to generate responses in a dominant language irrespective of the prompt’s language. To address this, we propose Smoothie-Qwen, a lightweight, post-hoc method that mitigates language bias without retraining. This technique selectively adjusts token-level output probabilities to effectively suppress undesired language generation. Applied to the Qwen model, our method reduces unintended Chinese output by over 95% while preserving task accuracy on multilingual benchmarks. This work provides a practical and efficient solution for enhancing the language controllability of LLMs, making them more reliable for global applications.
多语言大型语言模式(LLMs)往往表现出语言混乱,一种以主导语言做出回应的倾向,而不论迅速使用的语言如何。为了解决这个问题,我们建议采用轻巧的后热方法Slipie-Qwen,这种方法可以减轻语言偏见,而无需再培训。这种技术有选择地调整象征性产出概率,以有效抑制不受欢迎的语言生成。应用到Quen模式,我们的方法将中国无意产出减少95%以上,同时保留多语种基准的任务准确性。这项工作为提高LLMs的语言可控性提供了一个实用有效的解决方案,使LMs更可靠地用于全球应用。
Article 299
Title@2025-07-08 (2): BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset
Title: BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset | BMMR: Ein multimodaler multimodaler multidisziplinärer Datensatz, der großflächig zweisprachig ist | BMMR: 大型双语双语多语多式多式多残疾理由数据集 2507.03483v2 |
Authors (16): Zhiheng Xi, Guanyu Li, Yutao Fan, Honglin Guo, Yufang Liu, Xiaoran Fan, Jiaqi Liu, Jingchao Ding, Wangmeng Zuo, Zhenfei Yin, Lei Bai, Tao Ji, Tao Gui, Qi Zhang, Philip Torr, Xuanjing Huang
In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 110k college-level questions spanning 300 UNESCO-defined subjects, spanning diverse formats-multiple-choice, fill-in-the-blank, and open-ended QA-and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20,458 high-quality instances to comprehensively assess LMMs’ knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 88,991 instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grained evaluation of reasoning paths. Extensive experiments on 24 models reveal that (i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data, and we hope our work can offer insights and contributions to the community.
在本文中,我们引入了大型双语、多式联运、多学科推理数据集(BMMR),这是社区开发和评价大型多式联运模型(LMMMs)的大型双语、多式联运和多学科推理数据集。BMMR由110k个大学级问题组成,涉及300个教科文组织界定的科目,涉及多种格式-多选择、填补空白和开放式QA,来自印刷和数字媒体,如书籍、考试和测验等。所有数据都通过人际流动和可扩缩框架进行整理和过滤,每个实例都配有高质量的推理路径。该数据集分为两个部分:BMRR-Eval,包括20,458个高质量的案例,以全面评估LMMMR的知识和推理,包括中英多种学科;以及BMR-Train,包含88,991个案例,以支持进一步的研究与发展,将当前对数学推理的焦点扩大到不同的学科和领域。此外,我们建议以流程为基础的多度校正(i,即甚低级的LMMR-RMR-RMR(ial-Bral-Bral-Ial-Bral-Bral-Bral-IRC),我们目前进行实质性的深度分析(RO-BMIS-Brex-Bral-BRis-BR),关于正的深度分析。(ro-I-I-I-I-I-I-I-BR-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-
Article 300
Title@2025-07-08 (2): TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data
Title: TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data | TuneShield: Eindämmung der Toxizität bei konversatorischer KI während der Feinabstimmung auf nicht vertrauenswürdige Daten | TuneSheld:在对不可信数据进行微调的同时,在交流性AI中减轻毒性 2507.05660v1 |
Authors (7): Aravind Cheruvu, Shravya Kanchi, Sifat Muhammad Abdullah, Nicholas Kong, Daphne Yao, Murtuza Jadliwala, Bimal Viswanath
Recent advances in foundation models, such as LLMs, have revolutionized conversational AI. Chatbots are increasingly being developed by customizing LLMs on specific conversational datasets. However, mitigating toxicity during this customization, especially when dealing with untrusted training data, remains a significant challenge. To address this, we introduce TuneShield, a defense framework designed to mitigate toxicity during chatbot fine-tuning while preserving conversational quality. TuneShield leverages LLM-based toxicity classification, utilizing the instruction-following capabilities and safety alignment of LLMs to effectively identify toxic samples, outperforming industry API services. TuneShield generates synthetic conversation samples, termed ‘healing data’, based on the identified toxic samples, using them to mitigate toxicity while reinforcing desirable behavior during fine-tuning. It performs an alignment process to further nudge the chatbot towards producing desired responses. Our findings show that TuneShield effectively mitigates toxicity injection attacks while preserving conversational quality, even when the toxicity classifiers are imperfect or biased. TuneShield proves to be resilient against adaptive adversarial and jailbreak attacks. Additionally, TuneShield demonstrates effectiveness in mitigating adaptive toxicity injection attacks during dialog-based learning (DBL).
nan
Article 301
Title@2025-07-08 (2): SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge
Title: SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge | SHNU Mehrsprachiges Spracherkennungssystem für INTERSPEECH 2025 MLC-SLM Challenge | SHNU 2025年刚果解放运动-解运挑战 2507.03343v2 |
Authors (4): Yuxiang Mei, Yuang Zheng, Dongxing Xu, Yanhua Long
This paper describes SHNU multilingual conversational speech recognition system (SHNU-mASR, team name-“maybe”), submitted to Track 1 of the INTERSPEECH 2025 MLC-SLM Challenge. Our system integrates a parallel-speech-encoder architecture with a large language model (LLM) to form a unified multilingual ASR framework. The parallel-speech-encoder consists of two pre-trained encoders, the Whisper-large-v3 encoder and mHuBERT-147 encoder. Their output embeddings are concatenated and fed into the LLM, enabling the model to leverage complementary acoustic and linguistic knowledge and achieve competitive performance. Moreover, we adopt a tri-stage training strategy to jointly update the low-rank adaptation modules and projector parameters of both the speech encoders and the LLM. In addition, we incorporate an additional language-aware prompt at the LLM input to enhance language-specific text generation. The SHNU-mASR system achieves an overall character/word error rate (CER/WER) of 11.76% on the blind evaluation set of the challenge, outperforming the official MLC-SLM baseline by 8.41 absolute CER/WER, without increasing the baseline training data.
nan
Article 302
Title@2025-07-08 (2): MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models
Title: MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models | MEF: Ein Capability-Aware Multi-Encryption Framework zur Bewertung von Schwachstellen in Black-Box Large Language Models | MEF: 用于评价黑箱大语言模型脆弱性的能力-软件多加密框架 2505.23404v3 |
Authors (6): Mingyu Yu, Wei Wang, Yanjie Wei, Sujuan Qin, Fei Gao, Wenmin Li
Recent advancements in adversarial jailbreak attacks have revealed significant vulnerabilities in Large Language Models (LLMs), facilitating the evasion of alignment safeguards through increasingly sophisticated prompt manipulations. In this paper, we propose MEF, a capability-aware multi-encryption framework for evaluating vulnerabilities in black-box LLMs. Our key insight is that the effectiveness of jailbreak strategies can be significantly enhanced by tailoring them to the semantic comprehension capabilities of the target model. We present a typology that classifies LLMs into Type I and Type II based on their comprehension levels, and design adaptive attack strategies for each. MEF combines layered semantic mutations and dual-ended encryption techniques, enabling circumvention of input, inference, and output-level defenses. Experimental results demonstrate the superiority of our approach. Remarkably, it achieves a jailbreak success rate of 98.9\% on GPT-4o (29 May 2025 release). Our findings reveal vulnerabilities in current LLMs’ alignment defenses.
nan
Article 303
Title@2025-07-08 (2): ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?
Title: ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues? | ECom-Bench: Kann LLM Agent Probleme mit dem echten E-Commerce lösen? | ECom-Bench:LLM代理能否解决现实世界电子商务客户支助问题? 2507.05639v1 |
Authors (8): Haoxin Wang, Xianhan Peng, Xucheng Huang, Yizhe Huang, Ming Gong, Chenghan Yang, Yang Liu, Ling Jiang
In this paper, we introduce ECom-Bench, the first benchmark framework for evaluating LLM agent with multimodal capabilities in the e-commerce customer support domain. ECom-Bench features dynamic user simulation based on persona information collected from real e-commerce customer interactions and a realistic task dataset derived from authentic e-commerce dialogues. These tasks, covering a wide range of business scenarios, are designed to reflect real-world complexities, making ECom-Bench highly challenging. For instance, even advanced models like GPT-4o achieve only a 10-20% pass^3 metric in our benchmark, highlighting the substantial difficulties posed by complex e-commerce scenarios. Upon publication, the code and data will be open-sourced to facilitate further research and development in this domain.
nan
Article 304
Title@2025-07-08 (2): Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs
Title: Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs | Feint and Attack: Achtsamkeitsbasierte Strategien für das Jailbreaking und den Schutz von LLMs | Feint和攻击:以关注为基础的破狱和保护LLMs 2410.16327v2 |
Authors (8): Rui Pu, Chaozhuo Li, Rui Ha, Zejian Chen, Litian Zhang, Zheng Liu, Lirong Qiu, Zaisheng Ye
Jailbreak attack can be used to access the vulnerabilities of Large Language Models (LLMs) by inducing LLMs to generate the harmful content. And the most common method of the attack is to construct semantically ambiguous prompts to confuse and mislead the LLMs. To access the security and reveal the intrinsic relation between the input prompt and the output for LLMs, the distribution of attention weight is introduced to analyze the underlying reasons. By using statistical analysis methods, some novel metrics are defined to better describe the distribution of attention weight, such as the Attention Intensity on Sensitive Words (Attn_SensWords), the Attention-based Contextual Dependency Score (Attn_DepScore) and Attention Dispersion Entropy (Attn_Entropy). By leveraging the distinct characteristics of these metrics, the beam search algorithm and inspired by the military strategy “Feint and Attack”, an effective jailbreak attack strategy named as Attention-Based Attack (ABA) is proposed. In the ABA, nested attack prompts are employed to divert the attention distribution of the LLMs. In this manner, more harmless parts of the input can be used to attract the attention of the LLMs. In addition, motivated by ABA, an effective defense strategy called as Attention-Based Defense (ABD) is also put forward. Compared with ABA, the ABD can be used to enhance the robustness of LLMs by calibrating the attention distribution of the input prompt. Some comparative experiments have been given to demonstrate the effectiveness of ABA and ABD. Therefore, both ABA and ABD can be used to access the security of the LLMs. The comparative experiment results also give a logical explanation that the distribution of attention weight can bring great influence on the output for LLMs.
nan
Article 305
Title@2025-07-08 (2): Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models
Title: Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models | Sarkasmus durch unterschiedliche Augen sehen: Multimodale Sarkasmuswahrnehmung in großen Vision-Sprachen-Modellen analysieren | 透过不同的眼睛看到讽刺:在大型视觉语言模型中分析多模式讽刺的认知 2503.12149v2 |
Authors (5): Junjie Chen, Xuyang Liu, Subin Huang, Linfeng Zhang, Hang Yu
With the advent of large vision-language models (LVLMs) demonstrating increasingly human-like abilities, a pivotal question emerges: do different LVLMs interpret multimodal sarcasm differently, and can a single model grasp sarcasm from multiple perspectives like humans? To explore this, we introduce an analytical framework using systematically designed prompts on existing multimodal sarcasm datasets. Evaluating 12 state-of-the-art LVLMs over 2,409 samples, we examine interpretive variations within and across models, focusing on confidence levels, alignment with dataset labels, and recognition of ambiguous “neutral” cases. Our findings reveal notable discrepancies – across LVLMs and within the same model under varied prompts. While classification-oriented prompts yield higher internal consistency, models diverge markedly when tasked with interpretive reasoning. These results challenge binary labeling paradigms by highlighting sarcasm’s subjectivity. We advocate moving beyond rigid annotation schemes toward multi-perspective, uncertainty-aware modeling, offering deeper insights into multimodal sarcasm comprehension. Our code and data are available at: https://github.com/CoderChen01/LVLMSarcasmAnalysis
nan
Article 306
Title@2025-07-08 (2): SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression
Title: SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression | SARA: Selektive und adaptive retrieval-augmentierte Generation mit Kontextkompression | SARA:有环境压缩的选择性和适应性再取回型 2507.05633v1 |
Authors (7): Yiqiao Jin, Kartik Sharma, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, Srijan Kumar
Retrieval-augmented Generation (RAG) extends large language models (LLMs) with external knowledge but faces key challenges: restricted effective context length and redundancy in retrieved documents. Pure compression-based approaches reduce input size but often discard fine-grained details essential for factual accuracy. We propose SARA, a unified RAG framework that balances local precision and global knowledge coverage under tight context budgets. SARA combines natural-language text snippets with semantic compression vectors to jointly enhance context efficiency and answer correctness. It represents contexts at two complementary levels: 1) fine-grained natural-language spans that preserve critical entities and numerical values, and 2) compact, interpretable vectors that summarize high-level semantics. An iterative evidence-selection module employs the compression vectors for dynamic reranking of contexts. Across 9 datasets and 5 open-source LLMs spanning 3 model families (Mistral, Llama, and Gemma), SARA consistently improves answer relevance (+17.71), answer correctness (+13.72), and semantic similarity (+15.53), demonstrating the importance of integrating textual and compressed representations for robust, context-efficient RAG.
nan
Article 307
Title@2025-07-08 (2): On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows
Title: On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows | Über die Rolle von Feedback bei der Testzeitskalierung von agentischen KI-Workflows | 关于反馈在AI类制剂工作流量的测试-时间尺度方面作用的反馈 2504.01931v4 |
Authors (11): Souradip Chakraborty, Mohammadreza Pourreza, Ruoxi Sun, Yiwen Song, Nino Scherrer, Furong Huang, Amrit Singh Bedi, Ahmad Beirami, Jindong Gu, Hamid Palangi, Tomas Pfister
Agentic AI workflows (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low. A promising solution is inference-time alignment, which uses extra compute at test time to improve performance. Inference-time alignment relies on three components: sampling, evaluation, and feedback. While most prior work studies sampling and automatic evaluation, feedback remains underexplored. To study the role of feedback, we introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques (reward models or AI-generated textual feedback) between decoding steps. Through IAD, we analyze feedback along four dimensions: (1) its role in the accuracy-compute trade-offs with limited inference budget, (2) quantifying the gains over diversity-only baselines such as best-of-N sampling, (3) effectiveness of composing feedback from reward models versus textual critique, and (4) robustness to noisy or low-quality feedback. Across Sketch2Code, Text2SQL, Intercode, and WebShop, we show that IAD with proper integration of high fidelity feedback leads to consistent gains up to 10 percent absolute performance improvement over various baselines such as best-of-N. Our findings underscore feedback as a crucial knob for inference-time alignment of agentic AI workflows with limited inference budget.
nan
Article 308
Title@2025-07-08 (2): Adsorb-Agent: Autonomous Identification of Stable Adsorption Configurations via Large Language Model Agent
Title: Adsorb-Agent: Autonomous Identification of Stable Adsorption Configurations via Large Language Model Agent | Adsorb-Agent: Autonome Identifizierung stabiler Adsorptionskonfigurationen über Large Language Model Agent | Assorb-Agent:通过大语言模式代理自动识别稳定吸附配置 2410.16658v4 |
Authors (5): Janghoon Ock, Radheesh Sharma Meda, Tirtha Vinchurkar, Yayati Jadhav, Amir Barati Farimani
Adsorption energy is a key reactivity descriptor in catalysis. Determining adsorption energy requires evaluating numerous adsorbate-catalyst configurations, making it computationally intensive. Current methods rely on exhaustive sampling, which does not guarantee the identification of the global minimum energy. To address this, we introduce Adsorb-Agent, a Large Language Model (LLM) agent designed to efficiently identify stable adsorption configurations corresponding to the global minimum energy. Adsorb-Agent leverages its built-in knowledge and reasoning to strategically explore configurations, significantly reducing the number of initial setups required while improving energy prediction accuracy. In this study, we also evaluated the performance of different LLMs, including GPT-4o, GPT-4o-mini, Claude-3.7-Sonnet, and DeepSeek-Chat, as the reasoning engine for Adsorb-Agent, with GPT-4o showing the strongest overall performance. Tested on twenty diverse systems, Adsorb-Agent identifies comparable adsorption energies for 84% of cases and achieves lower energies for 35%, particularly excelling in complex systems. It identifies lower energies in 47% of intermetallic systems and 67% of systems with large adsorbates. These findings demonstrate Adsorb-Agent’s potential to accelerate catalyst discovery by reducing computational costs and enhancing prediction reliability compared to exhaustive search methods.
nan
Article 309
Title@2025-07-08 (2): Flipping Knowledge Distillation: Leveraging Small Models’ Expertise to Enhance LLMs in Text Matching
Title: Flipping Knowledge Distillation: Leveraging Small Models’ Expertise to Enhance LLMs in Text Matching | Flipping Knowledge Destillation: Kompetenz kleiner Modelle nutzen, um LLMs in Text Matching zu verbessern | 减轻知识蒸馏:利用小型模型的专长加强文本匹配中的LLM 2507.05617v1 |
Authors (5): Mingzhe Li, Jing Xiang, Qishen Zhang, Kaiyang Wan, Xiuying Chen
Knowledge distillation typically involves transferring knowledge from a Large Language Model (LLM) to a Smaller Language Model (SLM). However, in tasks such as text matching, fine-tuned smaller models often yield more effective domain-specific representations, as they focus on optimizing the similarity of input pairs. To leverage both the specialized strengths of small models and the rich semantic understanding of LLMs, we introduce a flipped knowledge distillation paradigm, where LLM learns from SLM. Specifically, we address the architectural gap between decoder-only LLMs and smaller encoder-based models by reinterpreting LLMs in an encoder-decoder manner using LoRA. The encoder generates compressed representations, while the decoder maps them to the output space. During training, the encoder produces representations and their similarities, which are then aligned with the similarity scores produced by the teacher, using our proposed Margin-aware Contrastive Learning (MCL) approach. The MCL ensures accurate similarity for both positive and negative pairs, and adaptively handles the internal differences within positive and negative samples. Our paradigm requires only a reasonably good-performing SLM, allowing the LLM to achieve improved performance. Experiments on financial and healthcare benchmarks, as well as real-world applications, confirm its effectiveness, and the model has been fully deployed in an online environment.
nan
Article 310
Title@2025-07-08 (2): Self-Review Framework for Enhancing Instruction Following Capability of LLM
Title: Self-Review Framework for Enhancing Instruction Following Capability of LLM | Selbst-Review-Rahmen für die Verbesserung der Anleitung nach der Fähigkeit von LLM | 根据LLM能力加强教学的自我审查框架 2507.05598v1 |
Authors (1): Sihyun Park
Various techniques have been proposed to improve large language models (LLMs) adherence to formatting and instruction constraints. One of the most effective approaches involves utilizing high-quality data generated by powerful models. However, such models often fail to fully comply with complex instructions in a single generation. To address this limitation, iterative revision methods have been introduced. Nevertheless, as the number of data points and revision iterations increases, the associated monetary costs grow significantly. As a resource-efficient alternative, methods have been proposed that leverage high-performance evaluation tools to compensate for the limited self-evaluation capabilities of open-source LLMs. However, these approaches often lead to a degradation in output quality due to excessive revision. To overcome these challenges, we propose Re5, a self-evaluation and revision framework designed to enhance instruction-following performance while preserving the quality of the generated content. Re5 extracts task and constraint components from user instructions, performs structural evaluations to prevent error accumulation, and applies fine-grained constraint-specific content evaluations followed by selective revisions. This process ensures precise and quality-preserving improvements. The final high-quality outputs are used for alignment tuning, enabling long-term alignment improvements through a data-centric iterative refinement loop. Experimental results demonstrate that Re5 achieves instruction-following performance comparable to models trained on data generated by GPT-4o-mini, a high-performance model, even with a small amount of data while maintaining response quality with a 64.24%-win rate over the non-revised initial responses. These results validate Re5 as an efficient and effective solution for enhancing instruction adherence with minimal external supervision.
nan
Article 311
Title@2025-07-08 (2): Efficient Detection of Intermittent Job Failures Using Few-Shot Learning
Title: Efficient Detection of Intermittent Job Failures Using Few-Shot Learning | Effiziente Erkennung intermittierender Job-Fälle durch wenig scharfes Lernen | 利用很少热的学习方法有效检测间歇性工作失败 2507.04173v2 |
Authors (3): Henri Aïdasso, Francis Bordeleau, Ali Tizghadam
One of the main challenges developers face in the use of continuous integration (CI) and deployment pipelines is the occurrence of intermittent job failures, which result from unexpected non-deterministic issues (e.g., flaky tests or infrastructure problems) rather than regular code-related errors such as bugs. Prior studies developed machine learning (ML) models trained on large datasets of job logs to classify job failures as either intermittent or regular. As an alternative to costly manual labeling of large datasets, the state-of-the-art (SOTA) approach leveraged a heuristic based on non-deterministic job reruns. However, this method mislabels intermittent job failures as regular in contexts where rerunning suspicious job failures is not an explicit policy, and therefore limits the SOTA’s performance in practice. In fact, our manual analysis of 2,125 job failures from 5 industrial and 1 open-source projects reveals that, on average, 32% of intermittent job failures are mislabeled as regular. To address these limitations, this paper introduces a novel approach to intermittent job failure detection using few-shot learning (FSL). Specifically, we fine-tune a small language model using a few number of manually labeled log examples to generate rich embeddings, which are then used to train an ML classifier. Our FSL-based approach achieves 70-88% F1-score with only 12 shots in all projects, outperforming the SOTA, which proved ineffective (34-52% F1-score) in 4 projects. Overall, this study underlines the importance of data quality over quantity and provides a more efficient and practical framework for the detection of intermittent job failures in organizations.
nan
Article 312
Title@2025-07-08 (2): Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation
Title: Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation | Low-Rank und Sparse Modell Zusammenführung für Multi-Lingual Spracherkennung und Übersetzung | 多语言语音识别和翻译的低兰克和松散模式合并 2502.17380v3 |
Authors (3): Qiuming Zhao, Guangzhi Sun, Chao Zhang
Language diversity presents a significant challenge in speech-to-text (S2T) tasks, such as automatic speech recognition and translation. Traditional multi-lingual multi-task training approaches aim to address this by jointly optimising multiple speech recognition and translation tasks across various languages. While models like Whisper, built on these strategies, demonstrate strong performance, they still face issues of high computational cost, language interference, suboptimal training configurations, and limited extensibility. To overcome these challenges, we introduce LoRS-Merging (low-rank and sparse model merging), a novel technique designed to efficiently integrate models trained on different languages or tasks while preserving performance and reducing computational overhead. LoRS-Merging combines low-rank and sparse pruning to retain essential structures while eliminating redundant parameters, mitigating language interference, and enhancing extensibility. Experimental results across 10 languages demonstrate that LoRS-Merging significantly outperforms multi-lingual multi-task training, sequential training, and other merging methods, achieving over 20% improvement in normalised performance. Our findings suggest that model merging, particularly LoRS-Merging, is a scalable and effective complement to traditional multi-lingual training strategies for S2T applications.
nan
Article 313
Title@2025-07-08 (2): The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation
Title: The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation | Die Landschaft der Erinnerung in LLMs: Mechanismen, Messung und Milderung | LLMM中纪念的景观:机制、衡量和缓解 2507.05578v1 |
Authors (4): Alexander Xiong, Xuandong Zhao, Aneesh Pappu, Dawn Song
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they also exhibit memorization of their training data. This phenomenon raises critical questions about model behavior, privacy risks, and the boundary between learning and memorization. Addressing these concerns, this paper synthesizes recent studies and investigates the landscape of memorization, the factors influencing it, and methods for its detection and mitigation. We explore key drivers, including training data duplication, training dynamics, and fine-tuning procedures that influence data memorization. In addition, we examine methodologies such as prefix-based extraction, membership inference, and adversarial prompting, assessing their effectiveness in detecting and measuring memorized content. Beyond technical analysis, we also explore the broader implications of memorization, including the legal and ethical implications. Finally, we discuss mitigation strategies, including data cleaning, differential privacy, and post-training unlearning, while highlighting open challenges in balancing the minimization of harmful memorization with utility. This paper provides a comprehensive overview of the current state of research on LLM memorization across technical, privacy, and performance dimensions, identifying critical directions for future work.
nan
Article 314
Title@2025-07-08 (2): Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
Title: Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models | Feature-Extraktion und -Lenkung für eine verbesserte Kettenbildung in Sprachmodellen | 语言模型中强化研究链理由的特征采掘和指南 2505.15634v3 |
Authors (6): Zihao Li, Xu Wang, Yuzhe Yang, Ziyu Yao, Haoyi Xiong, Mengnan Du
Large Language Models (LLMs) demonstrate the ability to solve reasoning and mathematical problems using the Chain-of-Thought (CoT) technique. Expanding CoT length, as seen in models such as DeepSeek-R1, significantly enhances this reasoning for complex problems, but requires costly and high-quality long CoT data and fine-tuning. This work, inspired by the deep thinking paradigm of DeepSeek-R1, utilizes a steering technique to enhance the reasoning ability of an LLM without external datasets. Our method first employs Sparse Autoencoders (SAEs) to extract interpretable features from vanilla CoT. These features are then used to steer the LLM’s internal states during generation. Recognizing that many LLMs do not have corresponding pre-trained SAEs, we further introduce a novel SAE-free steering algorithm, which directly computes steering directions from the residual activations of an LLM, obviating the need for an explicit SAE. Experimental results demonstrate that both our SAE-based and subsequent SAE-free steering algorithms significantly enhance the reasoning capabilities of LLMs.
nan
Article 315
Title@2025-07-08 (2): Offline Learning and Forgetting for Reasoning with Large Language Models
Title: Offline Learning and Forgetting for Reasoning with Large Language Models | Offline-Lernen und Vergessen für die Vernunft mit großen Sprachmodellen | 利用大语言模式进行离线学习和遗忘 2504.11364v3 |
Authors (6): Tianwei Ni, Allen Nie, Sapana Chaudhary, Yao Liu, Huzefa Rangwala, Rasool Fakoor
Leveraging inference-time search in large language models has proven effective in further enhancing a trained model’s capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it on unpaired successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. A key challenge we identify is that naive fine-tuning can degrade the model’s search capability; we show this can be mitigated with a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown reasoning benchmarks show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180$\times$. On top of this, our learning and forgetting objective consistently outperforms both supervised fine-tuning and preference-based methods.
nan
Article 316
Title@2025-07-08 (2): Beyond Retrieval: Ensembling Cross-Encoders and GPT Rerankers with LLMs for Biomedical QA
Title: Beyond Retrieval: Ensembling Cross-Encoders and GPT Rerankers with LLMs for Biomedical QA | Beyond Retrieval: Zusammenbauen von Cross-Encodern und GPT-Rerankern mit LLMs für biomedizinische QA | 超出检索范围:利用生物医学质量评估LMs将跨环境者和GPT再生者与生物医学质量评估LMs组合在一起 2507.05577v1 |
Authors (3): Shashank Verma, Fengyi Jiang, Xiangning Xue
Biomedical semantic question answering rooted in information retrieval can play a crucial role in keeping up to date with vast, rapidly evolving and ever-growing biomedical literature. A robust system can help researchers, healthcare professionals and even layman users access relevant knowledge grounded in evidence. The BioASQ 2025 Task13b Challenge serves as an important benchmark, offering a competitive platform for advancement of this space. This paper presents the methodologies and results from our participation in this challenge where we built a Retrieval-Augmented Generation (RAG) system that can answer biomedical questions by retrieving relevant PubMed documents and snippets to generate answers. For the retrieval task, we generated dense embeddings from biomedical articles for initial retrieval, and applied an ensemble of finetuned cross-encoders and large language models (LLMs) for re-ranking to identify top relevant documents. Our solution achieved an MAP@10 of 0.1581, placing 10th on the leaderboard for the retrieval task. For answer generation, we employed few-shot prompting of instruction-tuned LLMs. Our system achieved macro-F1 score of 0.95 for yes/no questions (rank 12), Mean Reciprocal Rank (MRR) of 0.64 for factoid questions (rank 1), mean-F1 score of 0.63 for list questions (rank 5), and ROUGE-SU4 F1 score of 0.29 for ideal answers (rank 11).
nan
Article 317
Title@2025-07-08 (2): Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
Title: Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index | Infini-gram mini: Genaue n-gram Suche auf der Internetskala mit FM-Index | Infini-gram 微型: 使用 FM- Index 的 Internet 比例尺精确的 n 克搜索 2506.12229v2 |
Authors (5): Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi
Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora – counting string appearances and retrieving the enclosing documents – yet the high storage overhead hinders their application on Internet-scale data. We present Infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18$\times$) and memory use during both indexing (3.2$\times$ reduction) and querying (down to a negligible amount). We index 46TB of Internet text in 50 days with a single 128-core CPU node (or 19 hours if using 75 such nodes). We show one important use case of Infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 40% in SQuAD), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on Infini-gram mini indexes.
nan
Article 318
Title@2025-07-08 (2): OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
Title: OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens | OLMoTrace: Tracing Language Model Outputs Zurück zu Tillionen von Trainings-Tokens | OLMoTrace:追踪语言示范产出,返回数以万计的培训单位 2504.07096v2 |
Authors (31): Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, YenSung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cassidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Cheng, Karen Farley, Sruthi Sreeram, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Yuren Pang, Pang Wei Koh, Noah A. Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, Jesse Dodge
We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.
nan
Article 319
Title@2025-07-08 (2): Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS
Title: Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS | Verbesserung der Testzeitskalierung großer Sprachmodelle mit hierarchischen Retrieval-Augmented MCTS | 强化高语言模型的测试时间比量,采用等级检索增强的MCTS 2507.05557v1 |
Authors (9): Alex ZH Dou, Zhongwei Wan, Dongfei Cui, Xin Wang, Jing Xiong, Haokun Lin, Chaofan Tao, Shen Yan, Mi Zhang
Test-time scaling has emerged as a promising paradigm in language modeling, leveraging additional computational resources at inference time to enhance model performance. In this work, we introduce R2-LLMs, a novel and versatile hierarchical retrieval-augmented reasoning framework designed to improve test-time scaling in large language models (LLMs) without requiring distillation from more advanced models to obtain chain-of-thought (CoT) training data. R2-LLMs enhances inference-time generalization by integrating dual-level retrieval-based in-context learning: (1) At the coarse level, our approach extracts abstract templates from complex reasoning problems and retrieves similar problem-answer pairs to facilitate high-level in-context learning; (2) At the fine level, during Monte Carlo Tree Search (MCTS), R2-LLMs efficiently retrieves analogous intermediate solution steps from reference mathematical problem datasets, refining step-wise reasoning with the aid of a process reward model (PRM) for scoring. R2-LLMs is a robust hierarchical reasoning-augmentation method that enhances in-context-level reasoning while seamlessly integrating with step-level tree search methods. Utilizing PRM, it refines both candidate generation and decision-making for improved reasoning accuracy. Empirical evaluations on the MATH500, GSM8K, and OlympiadBench-TO datasets achieve substantial relative improvement with an increase of up to 16% using LLaMA-3.1-8B compared to the baselines, showcasing the effectiveness of our approach in complex reasoning tasks.
nan
Article 320
Title@2025-07-07 (1): The Role of Deductive and Inductive Reasoning in Large Language Models
Title: The Role of Deductive and Inductive Reasoning in Large Language Models | Die Rolle der deduktiven und induktiven Vernunft in großen Sprachmodellen | 大语种模式中的减量和引引原因的作用 2410.02892v3 |
Authors (8): Chengkun Cai, Xu Zhao, Haoliang Liu, Zhongyu Jiang, Tianfang Zhang, Zongkai Wu, Jenq-Neng Hwang, Lei Li
Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning tasks, yet their reliance on static prompt structures and limited adaptability to complex scenarios remains a significant challenge. In this paper, we propose the Deductive and InDuctive(DID) method, a novel framework that enhances LLM reasoning by dynamically integrating both deductive and inductive reasoning approaches. Drawing from cognitive science principles, DID implements a dual-metric complexity evaluation system that combines Littlestone dimension and information entropy to precisely assess task difficulty and guide decomposition strategies. DID enables the model to progressively adapt its reasoning pathways based on problem complexity, mirroring human cognitive processes. We evaluate DID’s effectiveness across multiple benchmarks, including the AIW and MR-GSM8K, as well as our custom Holiday Puzzle dataset for temporal reasoning. Our results demonstrate significant improvements in reasoning quality and solution accuracy - achieving 70.3% accuracy on AIW (compared to 62.2% for Tree of Thought) while maintaining lower computational costs. The success of DID in improving LLM performance while preserving computational efficiency suggests promising directions for developing more cognitively aligned and capable language models. Our work contributes a theoretically grounded, input-centric approach to enhancing LLM reasoning capabilities, offering an efficient alternative to traditional output-exploration methods.
nan
Article 321
Title@2025-07-07 (1): Bayesian Optimization for Controlled Image Editing via LLMs
Title: Bayesian Optimization for Controlled Image Editing via LLMs | Bayesische Optimierung für kontrollierte Bildbearbeitung über LLMs | Bayesian 通过LLMs优化控制图像编辑 2502.18116v3 |
Authors (9): Chengkun Cai, Haoliang Liu, Xu Zhao, Zhongyu Jiang, Tianfang Zhang, Zongkai Wu, John Lee, Jenq-Neng Hwang, Lei Li
In the rapidly evolving field of image generation, achieving precise control over generated content and maintaining semantic consistency remain significant limitations, particularly concerning grounding techniques and the necessity for model fine-tuning. To address these challenges, we propose BayesGenie, an off-the-shelf approach that integrates Large Language Models (LLMs) with Bayesian Optimization to facilitate precise and user-friendly image editing. Our method enables users to modify images through natural language descriptions without manual area marking, while preserving the original image’s semantic integrity. Unlike existing techniques that require extensive pre-training or fine-tuning, our approach demonstrates remarkable adaptability across various LLMs through its model-agnostic design. BayesGenie employs an adapted Bayesian optimization strategy to automatically refine the inference process parameters, achieving high-precision image editing with minimal user intervention. Through extensive experiments across diverse scenarios, we demonstrate that our framework significantly outperforms existing methods in both editing accuracy and semantic preservation, as validated using different LLMs including Claude3 and GPT-4.
nan
Article 322
Title@2025-07-07 (1): Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment
Title: Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment | Conversational Education at Scale: Multi-LLM Agent Workflow für verfahrenstechnisches Lernen und pädagogische Qualitätsbewertung | 规模层次的相互教育:程序学习和教学质量评估的多种LLM代理工作流 2507.05528v1 |
Authors (6): Jiahuan Pei, Fanghua Ye, Xin Sun, Wentao Deng, Koen Hindriks, Junxiao Wang
Large language models (LLMs) have advanced virtual educators and learners, bridging NLP with AI4Education. Existing work often lacks scalability and fails to leverage diverse, large-scale course content, with limited frameworks for assessing pedagogic quality. To this end, we propose WikiHowAgent, a multi-agent workflow leveraging LLMs to simulate interactive teaching-learning conversations. It integrates teacher and learner agents, an interaction manager, and an evaluator to facilitate procedural learning and assess pedagogic quality. We introduce a dataset of 114,296 teacher-learner conversations grounded in 14,287 tutorials across 17 domains and 727 topics. Our evaluation protocol combines computational and rubric-based metrics with human judgment alignment. Results demonstrate the workflow’s effectiveness in diverse setups, offering insights into LLM capabilities across domains. Our datasets and implementations are fully open-sourced.
nan
Article 323
Title@2025-07-07 (1): Early-Exit and Instant Confidence Translation Quality Estimation
Title: Early-Exit and Instant Confidence Translation Quality Estimation | Schätzung der Qualität der Früh-Ausgangs- und Instant-Konfidenz-Übersetzung | 提前出国和即时信心翻译质量估算 2502.14429v2 |
Authors (6): Vilém Zouhar, Maike Züfle, Beni Egressy, Julius Cheng, Mrinmaya Sachan, Jan Niehues
Quality estimation is omnipresent in machine translation, for both evaluation and generation. Unfortunately, quality estimation models are often opaque and computationally expensive, making them impractical to be part of large-scale pipelines. In this work, we tackle two connected challenges: (1) reducing the cost of quality estimation at scale, and (2) developing an inexpensive uncertainty estimation method for quality estimation. To address the latter, we introduce Instant Confidence COMET, an uncertainty-aware quality estimation model that matches the performance of previous approaches at a fraction of their costs. We extend this to Early-Exit COMET, a quality estimation model that can compute quality scores and associated confidences already at early model layers, allowing us to early-exit computations and reduce evaluation costs. We also apply our model to machine translation reranking. We combine Early-Exit COMET with an upper confidence bound bandit algorithm to find the best candidate from a large pool without having to run the full evaluation model on all candidates. In both cases (evaluation and reranking) our methods reduce the required compute by 50% with very little degradation in performance. Finally, we show how Instant Confidence COMET can be used to decide which translations a human evaluator should score rather than relying on the COMET score.
nan
Article 324
Title@2025-07-07 (1): Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality
Title: Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality | Feinkörnige Vision-Sprachenmodellierung für multimodale Trainingsassistenten in Augmented Reality | 扩大现实中多模式培训助理的精美愿景-语言建模 2507.05515v1 |
Authors (9): Haochen Huang, Jiahuan Pei, Mohammad Aliannejadi, Xin Sun, Moonisa Ahsan, Pablo Cesar, Chuang Yu, Zhaochun Ren, Junxiao Wang
Vision-language models (VLMs) are essential for enabling AI-powered smart assistants to interpret and reason in multimodal environments. However, their application in augmented reality (AR) training remains largely unexplored. In this work, we introduce a comprehensive dataset tailored for AR training, featuring systematized vision-language tasks, and evaluate nine state-of-the-art VLMs on it. Our results reveal that even advanced models, including GPT-4o, struggle with fine-grained assembly tasks, achieving a maximum F1 score of just 40.54% on state detection. These findings highlight the demand for enhanced datasets, benchmarks, and further research to improve fine-grained vision-language alignment. Beyond technical contributions, our work has broader social implications, particularly in empowering blind and visually impaired users with equitable access to AI-driven learning opportunities. We provide all related resources, including the dataset, source code, and evaluation results, to support the research community.
nan
Article 325
Title@2025-07-07 (1): MEIT: Multimodal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation
Title: MEIT: Multimodal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation | MEIT: Multimodale Elektrokardiogramm-Anleitung Tuning auf großen Sprachmodellen für die Berichtserstellung | MEIT:多式电动心电图指示,关于报告生成大语言模式的图示 2403.04945v4 |
Authors (9): Zhongwei Wan, Che Liu, Xin Wang, Chaofan Tao, Hui Shen, Jing Xiong, Rossella Arcucci, Huaxiu Yao, Mi Zhang
Electrocardiogram (ECG) is the primary non-invasive diagnostic tool for monitoring cardiac conditions and is crucial in assisting clinicians. Recent studies have concentrated on classifying cardiac conditions using ECG data but have overlooked ECG report generation, which is time-consuming and requires clinical expertise. To automate ECG report generation and ensure its versatility, we propose the Multimodal ECG Instruction Tuning (MEIT) framework, the first attempt to tackle ECG report generation with LLMs and multimodal instructions. To facilitate future research, we establish a benchmark to evaluate MEIT with various LLMs backbones across two large-scale ECG datasets. Our approach uniquely aligns the representations of the ECG signal and the report, and we conduct extensive experiments to benchmark MEIT with nine open-source LLMs using more than 800,000 ECG reports. MEIT’s results underscore the superior performance of instruction-tuned LLMs, showcasing their proficiency in quality report generation, zero-shot capabilities, resilience to signal perturbation, and alignment with human expert evaluation. These findings emphasize the efficacy of MEIT and its potential for real-world clinical application.
nan
Article 326
Title@2025-07-07 (1): Towards Exception Safety Code Generation with Intermediate Representation Agents Framework
Title: Towards Exception Safety Code Generation with Intermediate Representation Agents Framework | Auf dem Weg zur Generierung von Ausnahme-Sicherheitscodes mit dem Rahmen für Mittlere Vertretungen | 建立具有中间代表代理机构框架的例外安全法规生成框架 2410.06949v3 |
Authors (4): Xuanming Zhang, Yuxuan Chen, Yuan Yuan, Minlie Huang
Large Language Models (LLMs) often struggle with robust exception handling in generated code, leading to fragile programs that are prone to runtime errors. We propose Seeker, a novel multi-agent framework that enforces exception safety in LLM generated code through an Intermediate Representation (IR) approach. Seeker decomposes exception handling into five specialized agents: Scanner, Detector, Predator, Ranker, and Handler that collaboratively analyze code, detect fragile segments, retrieve best practice exception strategies, and inject robust handling code. We also introduce Common Exception Enumeration (CEE), a comprehensive knowledge base derived from official documentation, technical practices, and real world code, to standardize exception handling strategies. Seeker also incorporates a Deep Retrieval-Augmented Generation (Deep RAG) algorithm to efficiently navigate the exception inheritance hierarchy, cutting down search overhead by 93% while improving accuracy in identifying relevant exceptions. We evaluate Seeker on 15 open source Java projects and multiple benchmarks. Seeker outperforms state of the art baselines, improving exception handling precision by up to 37% and overall code robustness by 38% as measured by expert code review. It significantly closes the gap between LLM and human developers in exception management, achieving a 28% success rate on real world issue fixes (SWE bench) versus 19% by prior methods. Our framework preserves functional correctness of code while proactively handling errors, demonstrating a practical, generalizable solution for safer code generation. In this paper, we discuss the novelty of using intermediate representation and multi-agent collaboration for exception handling, and outline how Seeker can be extended to other programming languages and complex software engineering tasks, aligning LLM-generated code with industrial standard.
nan
Article 327
Title@2025-07-07 (1): ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding
Title: ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding | ViGiL3D: Ein sprachlich unterschiedlicher Datensatz für 3D Visual Grounding | ViGiL3D:3D视觉定位语言多样性数据集 2501.01366v2 |
Authors (3): Austin T. Wang, ZeMing Gong, Angel X. Chang
3D visual grounding (3DVG) involves localizing entities in a 3D scene referred to by natural language text. Such models are useful for embodied AI and scene retrieval applications, which involve searching for objects or patterns using natural language descriptions. While recent works have focused on LLM-based scaling of 3DVG datasets, these datasets do not capture the full range of potential prompts which could be specified in the English language. To ensure that we are scaling up and testing against a useful and representative set of prompts, we propose a framework for linguistically analyzing 3DVG prompts and introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns. We evaluate existing open-vocabulary 3DVG methods to demonstrate that these methods are not yet proficient in understanding and identifying the targets of more challenging, out-of-distribution prompts, toward real-world applications.
nan
Article 328
Title@2025-07-07 (1): On the Semantics of Large Language Models
Title: On the Semantics of Large Language Models | Zur Semantik großer Sprachmodelle | 关于大语言模型的语义学 2507.05448v1 |
Authors (1): Martin Schuele
Large Language Models (LLMs) such as ChatGPT demonstrated the potential to replicate human language abilities through technology, ranging from text generation to engaging in conversations. However, it remains controversial to what extent these systems truly understand language. We examine this issue by narrowing the question down to the semantics of LLMs at the word and sentence level. By examining the inner workings of LLMs and their generated representation of language and by drawing on classical semantic theories by Frege and Russell, we get a more nuanced picture of the potential semantic capabilities of LLMs.
nan
Article 329
Title@2025-07-07 (1): PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language Pairs
Title: PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language Pairs | PhoniTale: Phonologisch begründete mnemonische Generation für typologisch entfernte Sprachpaare | Phooni-Tale: 音频分布式语言对音频隔音的声波固定成像元体生成 2507.05444v1 |
Authors (7): Sana Kang, Myeongseok Gwon, Su Young Kwon, Jaewook Lee, Andrew Lan, Bhiksha Raj, Rita Singh
Vocabulary acquisition poses a significant challenge for second-language (L2) learners, especially when learning typologically distant languages such as English and Korean, where phonological and structural mismatches complicate vocabulary learning. Recently, large language models (LLMs) have been used to generate keyword mnemonics by leveraging similar keywords from a learner’s first language (L1) to aid in acquiring L2 vocabulary. However, most of this research has focused on native English speakers learning other languages, rather than the reverse. In this paper, we present PhoniTale, a novel cross-lingual mnemonic generation system that retrieves L1 keyword sequence based on phonological similarity and uses LLMs to generate mnemonics. We evaluate PhoniTale using both automated metrics and human evaluations, comparing its output to mnemonics created by humans and by previous automated approaches. To assess practical effectiveness, we also conduct a short-term recall test measuring mnemonic helpfulness. Our findings show that PhoniTale performs comparably to human-authored mnemonics. We also highlight key areas for future improvement in mnemonic quality and methodology.
nan
Article 330
Title@2025-07-07 (1): Gendered Divides in Online Discussions about Reproductive Rights
Title: Gendered Divides in Online Discussions about Reproductive Rights | Geschlechtsspezifische Unterschiede in Online-Diskussionen über reproduktive Rechte | 关于生殖权利的在线讨论中的两性差距 2507.05443v1 |
Authors (3): Ashwin Rao, Sze Yuh Nina Wang, Kristina Lerman
The U.S. Supreme Court’s 2022 ruling in Dobbs v. Jackson Women’s Health Organization marked a turning point in the national debate over reproductive rights. While the ideological divide over abortion is well documented, less is known about how gender and local sociopolitical contexts interact to shape public discourse. Drawing on nearly 10 million abortion-related posts on X (formerly Twitter) from users with inferred gender, ideology and location, we show that gender significantly moderates abortion attitudes and emotional expression, particularly in conservative regions, and independently of ideology. This creates a gender gap in abortion attitudes that grows more pronounced in conservative regions. The leak of the Dobbs draft opinion further intensified online engagement, disproportionately mobilizing pro-abortion women in areas where access was under threat. These findings reveal that abortion discourse is not only ideologically polarized but also deeply structured by gender and place, highlighting the central role of identity in shaping political expression during moments of institutional disruption.
nan
Article 331
Title@2025-07-07 (1): DeepRetro: Retrosynthetic Pathway Discovery using Iterative LLM Reasoning
Title: DeepRetro: Retrosynthetic Pathway Discovery using Iterative LLM Reasoning | DeepRetro: Retrosynthetische Pathway Discovery mit iterativem LLM Reasoning | 使用迭代LM 理由解释的回溯合成途径发现 2507.07060v1 |
Authors (9): Shreyas Vinaya Sathyanarayana, Rahil Shah, Sharanabasava D. Hiremath, Rishikesh Panda, Rahul Jana, Riya Singh, Rida Irfan, Ashwin Murali, Bharath Ramsundar
Retrosynthesis, the identification of precursor molecules for a target compound, is pivotal for synthesizing complex molecules, but faces challenges in discovering novel pathways beyond predefined templates. Recent large language model (LLM) approaches to retrosynthesis have shown promise but effectively harnessing LLM reasoning capabilities for effective multi-step planning remains an open question. To address this challenge, we introduce DeepRetro, an open-source, iterative, hybrid LLM-based retrosynthetic framework. Our approach integrates the strengths of conventional template-based/Monte Carlo tree search tools with the generative power of LLMs in a step-wise, feedback-driven loop. Initially, synthesis planning is attempted with a template-based engine. If this fails, the LLM subsequently proposes single-step retrosynthetic disconnections. Crucially, these suggestions undergo rigorous validity, stability, and hallucination checks before the resulting precursors are recursively fed back into the pipeline for further evaluation. This iterative refinement allows for dynamic pathway exploration and correction. We demonstrate the potential of this pipeline through benchmark evaluations and case studies, showcasing its ability to identify viable and potentially novel retrosynthetic routes. In particular, we develop an interactive graphical user interface that allows expert human chemists to provide human-in-the-loop feedback to the reasoning algorithm. This approach successfully generates novel pathways for complex natural product compounds, demonstrating the potential for iterative LLM reasoning to advance state-of-art in complex chemical syntheses.
nan
Article 332
Title@2025-07-07 (1): LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users
Title: LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users | LLM Hypnose: Nutzung des Benutzerfeedbacks für unautorisierte Wissensinjektion für alle Benutzer | LLM Hypnisis:利用用户反馈,为所有用户提供未经授权知识注射 2507.02850v2 |
Authors (4): Almog Hilel, Idan Shenfeld, Jacob Andreas, Leshem Choshen
We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a “poisoned” or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of language model preference tuning (showing that it even highly restricted forms of preference data can be used to exert fine-grained control over behavior), and a new attack mechanism for LMs trained with user feedback (extending work on pretraining-time data poisoning and deployment-time prompt injection).
nan
Article 333
Title@2025-07-07 (1): “Lost-in-the-Later”: Framework for Quantifying Contextual Grounding in Large Language Models
Title: “Lost-in-the-Later”: Framework for Quantifying Contextual Grounding in Large Language Models | “Lost-in-the-Later”: Rahmen für die Quantifizierung kontextueller Erdung in großen Sprachmodellen | “Lost-in-Later”:“Lost-in-Later”:大语言模型中量化背景依据框架 2507.05424v1 |
Authors (4): Yufei Tao, Adam Hiatt, Rahul Seetharaman, Ameeta Agrawal
Large language models are capable of leveraging both contextual and parametric knowledge but how they prioritize and integrate these sources remains underexplored. We introduce CoPE, a novel evaluation framework that systematically measures contextual knowledge (CK) and parametric knowledge (PK) across models and languages. Using our MultiWikiAtomic dataset in English, Spanish, and Danish, we analyze how large language models (LLMs) integrate context, prioritize information, and incorporate PK in open-ended question answering. Our analysis uncovers a phenomenon we call lost-in-the-later, where LLMs tend to overlook or deprioritize information that appears later in a given context, revealing a strong positional bias that affects contextual grounding. We further find that reasoning models, as well as non-reasoning models prompted with chain-of-thought (CoT), use context even less than non-reasoning models without CoT and fail to mitigate the lost-in-the-later effect. CoT prompting, in particular, results in lower recall and shorter responses, leading to degraded contextual grounding. Based on these insights, we design prompt-based methods to effectively leverage input context. A case study applying CoPE to summarization demonstrates that CK-informed prompting improves factual grounding and reduces hallucination.
nan
Article 334
Title@2025-07-07 (1): Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning
Title: Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning | Global lernen, lokal sprechen: Die Lücken in der Mehrsprachigkeit überwinden | 《全球学习,本地语言:弥合多种语文理由方面的差距》 2507.05418v1 |
Authors (8): Jaedong Hwang, Kumar Tanmay, Seok-Jin Lee, Ayush Agrawal, Hamid Palangi, Kumar Ayush, Ila Fiete, Paul Pu Liang
Large Language Models (LLMs) have achieved strong performance in domains like mathematics, factual QA, and code generation, yet their multilingual reasoning capabilities in these tasks remain underdeveloped. Especially for low-resource languages such as Swahili or Thai, LLMs can often misinterpret prompts or default to reasoning in English. This implicit bias toward high-resource languages undermines factual accuracy, interpretability, and trust. Current multilingual benchmarks focus only on final answers, overlooking whether models actually reason in the target language. To address this gap, we introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark with annotated reasoning traces in five languages: English, Hindi, Japanese, Swahili, and Thai. We further propose BRIDGE, a novel training method that guides supervised fine-tuning and test-time reinforcement learning with a language-consistency reward to align reasoning with the input language. Finally, we develop an automatic evaluation protocol using LLM-as-a-judge to assess answer correctness and the quality and language consistency of reasoning traces, enabling nuanced and scalable analysis beyond surface-level metrics. Our results show that BRIDGE significantly enhances multilingual reasoning fidelity, demonstrating that reasoning-aware multilingual reinforcement learning is crucial for robust cross-lingual generalization. https://jd730.github.io/projects/GeoFact-X_BRIDGE
nan
Article 335
Title@2025-07-07 (1): What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning
Title: What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning | Was würdest du fragen, wenn du zuerst $a^2+b^2=c^2$ sahst? LLM auf Kuriosität-getriebene Fragestellung bewerten | 当你第一次看到$2+b2=c2$时你会问什么? 2409.17172v2 |
Authors (2): Shashidhar Reddy Javaji, Zining Zhu
Large language models (LLMs) can store a massive amount of knowledge, yet their potential to acquire new knowledge remains unknown. We propose a novel evaluation framework that evaluates this capability. This framework prompts LLMs to generate questions about a statement introducing scientific knowledge, simulating a curious person when facing the statement for the first time. We score the qualities of the generated questions, thereby evaluating the knowledge acquisition potential of the LLM. We apply controlled ablation studies to validate our scoring procedures. Additionally, we created a synthetic dataset consisting of 1101 statements in physics, chemistry, and maths with distinct levels of difficulties, 300 general knowledge statements, and 567 incorrect statements. Human evaluations were conducted to validate our model assessments, achieving an approximate weighted Cohen’s kappa of 0.7 on all three metrics considered. We find that while large models like GPT-4 and Mistral 8x7b are adept at generating coherent and relevant questions, the smaller Phi-2 model is equally or more effective. This indicates that size does not solely determine a model’s knowledge acquisition potential. The proposed framework quantifies a critical model capability that was commonly overlooked and opens up research opportunities for developing more knowledgeable AI systems
nan
Article 336
Title@2025-07-07 (1): Tractable Transformers for Flexible Conditional Generation
Title: Tractable Transformers for Flexible Conditional Generation | Tractable Transformer für flexible Bedingte Generation | 灵活条件一代的可变变变换器 2502.07616v2 |
Authors (6): Anji Liu, Xuejie Liu, Dayuan Zhao, Mathias Niepert, Yitao Liang, Guy Van den Broeck
Non-autoregressive (NAR) generative models are valuable because they can handle diverse conditional generation tasks in a more principled way than their autoregressive (AR) counterparts, which are constrained by sequential dependency requirements. Recent advancements in NAR models, such as diffusion language models, have demonstrated superior performance in unconditional generation compared to AR models (e.g., GPTs) of similar sizes. However, such improvements do not always lead to improved conditional generation performance. We show that a key reason for this gap is the difficulty in generalizing to conditional probability queries (i.e., the set of unknown variables) unseen during training. As a result, strong unconditional generation performance does not guarantee high-quality conditional generation. This paper proposes Tractable Transformers (Tracformer), a Transformer-based generative model that is more robust to different conditional generation tasks. Unlike existing models that rely solely on global contextual features derived from full inputs, Tracformers incorporate a sparse Transformer encoder to capture both local and global contextual information. This information is routed through a decoder for conditional generation. Empirical results demonstrate that Tracformers achieve state-of-the-art conditional generation performance on text modeling compared to recent diffusion and AR model baselines.
nan
Article 337
Title@2025-07-07 (1): Evaluation of OpenAI o1: Opportunities and Challenges of AGI
Title: Evaluation of OpenAI o1: Opportunities and Challenges of AGI | Bewertung von OpenAI o1: Chancen und Herausforderungen der AGI | 评价开放AI OO1:亚洲地理信息协会的机遇和挑战 2409.18486v2 |
Authors (75): Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, Xiaowei Yu, Chao Cao, Hanqi Jiang, Hanxu Chen, Yiwei Li, Junhao Chen, Huawen Hu, Yiheng Liu, Huaqin Zhao, Shaochen Xu, Haixing Dai, Lin Zhao, Ruidong Zhang, Wei Zhao, Zhenyuan Yang, Jingyuan Chen, Peilong Wang, Wei Ruan, Hui Wang, Huan Zhao, Jing Zhang, Yiming Ren, Shihuan Qin, Tong Chen, Jiaxi Li, Arif Hassan Zidan, Afrar Jahin, Minheng Chen, Sichen Xia, Jason Holmes, Yan Zhuang, Jiaqi Wang, Bochen Xu, Weiran Xia, Jichao Yu, Kaibo Tang, Yaxuan Yang, Bolun Sun, Tao Yang, Guoyu Lu, Xianqiao Wang, Lilong Chai, He Li, Jin Lu, Xin Zhang, Bao Ge, Xintao Hu, Lian Zhang, Hua Zhou, Lu Zhang, Shu Zhang, Zhen Xiang, Yudan Ren, Jun Liu, Xi Jiang, Yu Bao, Wei Zhang, Xiang Li, Gang Li, Wei Liu, Dinggang Shen, Andrea Sikora, Xiaoming Zhai, Dajiang Zhu, Tuo Zhang, Tianming Liu
This comprehensive study evaluates the performance of OpenAI’s o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.
nan
Article 338
Title@2025-07-07 (1): Embedding-Based Approaches to Hyperpartisan News Detection
Title: Embedding-Based Approaches to Hyperpartisan News Detection | Einbetten-basierte Ansätze zu Hyperparteien-Nachrichten-Erkennung | 以嵌入式方法探测超党派新闻 2501.01370v3 |
Authors (1): Karthik Mohan
In this report, I describe the systems in which the objective is to determine whether a given news article could be considered as hyperpartisan. Hyperpartisan news takes an extremely polarized political standpoint with an intention of creating political divide among the public. Several approaches, including n-grams, sentiment analysis, as well as sentence and document representations using pre-tained ELMo models were used. The best system is using LLMs for embedding generation achieving an accuracy of around 92% over the previously best system using pre-trained ELMo with Bidirectional LSTM which achieved an accuracy of around 83% through 10-fold cross-validation.
nan
Article 339
Title@2025-07-07 (1): One fish, two fish, but not the whole sea: Alignment reduces language models’ conceptual diversity
Title: One fish, two fish, but not the whole sea: Alignment reduces language models’ conceptual diversity | Ein Fisch, zwei Fische, aber nicht das ganze Meer: Ausrichtung reduziert die konzeptionelle Vielfalt der Sprachmodelle | 一、二、一、二鱼,但不是整个海洋:对齐减少了语言模式的概念多样性 2411.04427v3 |
Authors (3): Sonia K. Murthy, Tomer Ullman, Jennifer Hu
Researchers in social science and psychology have recently proposed using large language models (LLMs) as replacements for humans in behavioral research. In addition to arguments about whether LLMs accurately capture population-level patterns, this has raised questions about whether LLMs capture human-like conceptual diversity. Separately, it is debated whether post-training alignment (RLHF or RLAIF) affects models’ internal diversity. Inspired by human studies, we use a new way of measuring the conceptual diversity of synthetically-generated LLM “populations” by relating the internal variability of simulated individuals to the population-level variability. We use this approach to evaluate non-aligned and aligned LLMs on two domains with rich human behavioral data. While no model reaches human-like diversity, aligned models generally display less diversity than their instruction fine-tuned counterparts. Our findings highlight potential trade-offs between increasing models’ value alignment and decreasing the diversity of their conceptual representations.
nan
Article 340
Title@2025-07-07 (1): Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences
Title: Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences | Kontrolle dessen, was Sie teilen: Bewertung von Sprachmodellen Einhaltung von Datenschutzeinstellungen | 控制您共享的共享:评估语言模式 遵守隐私优惠 2507.05391v1 |
Authors (3): Guillem Ramírez, Alexandra Birch, Ivan Titov
Large language models (LLMs) are primarily accessed via commercial APIs, but this often requires users to expose their data to service providers. In this paper, we explore how users can stay in control of their data by using privacy profiles: simple natural language instructions that say what should and should not be revealed. We build a framework where a local model uses these instructions to rewrite queries, only hiding details deemed sensitive by the user, before sending them to an external model, thus balancing privacy with performance. To support this research, we introduce PEEP, a multilingual dataset of real user queries annotated to mark private content and paired with synthetic privacy profiles. Our experiments with lightweight LLMs show they can follow these instructions to some extent, but also face consistent challenges, highlighting the need for models that better understand and comply with user-defined privacy preferences.
nan
Article 341
Title@2025-07-07 (1): The Generalization Ridge: Information Flow in Natural Language Generation
Title: The Generalization Ridge: Information Flow in Natural Language Generation | The Generalization Ridge: Informationsfluss in der natürlichen Sprachgenerierung | 通用海脊:自然语言生成的信息流动 2507.05387v1 |
Authors (3): Ruidi Chang, Chunyuan Deng, Hanjie Chen
Transformer-based language models have achieved state-of-the-art performance in natural language generation (NLG) tasks, yet their internal mechanisms for synthesizing task-relevant information remain insufficiently understood. While prior studies suggest that intermediate layers often yield more generalizable representations than final layers, how this generalization ability emerges and propagates across layers during training remains unclear. To address this gap, we propose InfoRidge, an information-theoretic framework, to characterize how predictive information-the mutual information between hidden representations and target outputs-varies across depth. Estimating this quantity enables us to trace the flow of task-relevant information throughout the model during training. Our experiments across various models and datasets reveal a consistent non-monotonic trend: predictive information peaks in upper-middle layers-forming a generalization ridge-before declining in final layers, reflecting a transition between generalization and memorization. To further investigate this phenomenon, we introduce residual scaling coefficients-trainable scalar parameters applied to each residual block-which serve as functional probes for assessing the relative importance of individual transformer layers. These coefficients reveal that, under distribution shift, models downweight final layers and increasingly rely on ridge layers, highlighting their role in generalization. Together, these findings offer new insights into the internal mechanisms of transformers and underscore the critical role of intermediate layers in supporting generalization.
nan
Article 342
Title@2025-07-07 (1): Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training
Title: Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training | Verstärkte Feinsteuerung mildert natürlich das Vergessen im kontinuierlichen Post-Training | 长期培训后忘却的自然缓解措施 2507.05386v1 |
Authors (13): Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, Fei Zhu
Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model’s general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis shows that explicit mechanisms, such as KL penalty and chain-of-thought reasoning, are not the primary factors. Instead, we find that the implicit regularization inherent to RFT is a key factor in mitigating forgetting. Finally, we propose a rollout-based instance filtering algorithm to improve the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.
nan
Article 343
Title@2025-07-07 (1): On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study
Title: On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study | Über die Bias von Next-Token Predictors hin zu systematisch ineffizienter Begründung: Eine Kurzeste-Path-Fallstudie | 关于对系统、低效率理性的 “ 下肯预测家 “ 的偏见:最短篇案例研究 2507.05362v1 |
Authors (4): Riccardo Alberghi, Elizaveta Demyanenko, Luca Biggio, Luca Saglietti
Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This benefit is not due to length alone-injecting arbitrary redundancy into reasoning traces fails to help and can even hurt performance. Instead, we find that generalization correlates with the model’s confidence in next-token prediction, suggesting that long, coherent, and locally incremental traces make the training signal easier to optimize.
nan
Article 344
Title@2025-07-07 (1): LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks
Title: LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks | LoRA-Augmented Generation (LAG) für wissensintensive Sprachaufgaben | LoRA - 知识密集型语言任务支持代(LAG) 2507.05346v1 |
Authors (2): William Fleshman, Benjamin Van Durme
The proliferation of fine-tuned language model experts for specific tasks and domains signals the need for efficient selection and combination methods. We propose LoRA-Augmented Generation (LAG) for leveraging large libraries of knowledge and task-specific LoRA adapters. LAG requires no additional training or access to data, and efficiently filters, retrieves, and applies experts on a per-token and layer basis. We evaluate LAG on various knowledge-intensive tasks, achieving superior performance over existing data-free methods. We explore scenarios where additional data is available, demonstrating LAG’s compatibility with alternative solutions such as retrieval-augmented generation (RAG).
nan
Article 345
Title@2025-07-07 (1): Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
Title: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions | Speicher in LLM-Agenten über inkrementelle Multi-Turn-Interaktionen bewerten | 通过递增多发互动评估LLM代理物的记忆 2507.05257v1 |
Authors (3): Yuanzhe Hu, Yu Wang, Julian McAuley
Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Existing datasets either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Furthermore, no existing benchmarks cover all four competencies. Therefore, we introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark combines reformulated existing datasets with newly constructed ones, covering the above four memory competencies, providing a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.
nan
Article 346
Title@2025-07-07 (1): Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Title: Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning | Open Vision Reasoner: Übertragen sprachliches Kognitives Verhalten für visuelles Reasoning | 开放愿景理由:转移视觉理由的语言认知行为 2507.05255v1 |
Authors (18): Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel
The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.
nan
Article 347
Title@2025-07-07 (1): Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?
Title: Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning? | Den Richtern zu urteilen: Können große Visions-Sprachmodelle Chartverständnis und Vernunft angemessen bewerten? | 判断法官:大视野语言模型能否公正地评估图表的解读和理由? 2505.08468v2 |
Authors (10): Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Ahmed Masry, Mizanur Rahman, Amran Bhuiyan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang
Charts are ubiquitous as they help people understand and reason with data. Recently, various downstream tasks, such as chart question answering, chart2text, and fact-checking, have emerged. Large Vision-Language Models (LVLMs) show promise in tackling these tasks, but their evaluation is costly and time-consuming, limiting real-world deployment. While using LVLMs as judges to assess the chart comprehension capabilities of other LVLMs could streamline evaluation processes, challenges like proprietary datasets, restricted access to powerful models, and evaluation costs hinder their adoption in industrial settings. To this end, we present a comprehensive evaluation of 13 open-source LVLMs as judges for diverse chart comprehension and reasoning tasks. We design both pairwise and pointwise evaluation tasks covering criteria like factual correctness, informativeness, and relevancy. Additionally, we analyze LVLM judges based on format adherence, positional consistency, length bias, and instruction-following. We focus on cost-effective LVLMs (<10B parameters) suitable for both research and commercial use, following a standardized evaluation protocol and rubric to measure the LVLM judge’s accuracy. Experimental results reveal notable variability: while some open LVLM judges achieve GPT-4-level evaluation performance (about 80% agreement with GPT-4 judgments), others struggle (below ~10% agreement). Our findings highlight that state-of-the-art open-source LVLMs can serve as cost-effective automatic evaluators for chart-related tasks, though biases such as positional preference and length bias persist.
nan
Article 348
Title@2025-07-07 (1): Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models
Title: Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models | Response Attack: Exploiting Contextual Priming to Jailbreak Große Sprachmodelle | 回应攻击:利用环境原始化作为破狱大型语言模式 2507.05248v1 |
Authors (6): Ziqi Miao, Lijun Li, Yuan Xiong, Zhenhua Liu, Pengyu Zhu, Jing Shao
Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.
nan
Article 349
Title@2025-07-07 (1): When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
Title: When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors | Wenn Gedankenkette notwendig ist, kämpfen Sprachmodelle um Evade Monitore | 当思维链是必要的时候,语言模型与Evade Monitor争吵 2507.05246v1 |
Authors (8): Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, Rohin Shah
While chain-of-thought (CoT) monitoring is an appealing AI safety defense, recent work on “unfaithfulness” has cast doubt on its reliability. These findings highlight an important failure mode, particularly when CoT acts as a post-hoc rationalization in applications like auditing for bias. However, for the distinct problem of runtime monitoring to prevent severe harm, we argue the key property is not faithfulness but monitorability. To this end, we introduce a conceptual framework distinguishing CoT-as-rationalization from CoT-as-computation. We expect that certain classes of severe harm will require complex, multi-step reasoning that necessitates CoT-as-computation. Replicating the experimental setups of prior work, we increase the difficulty of the bad behavior to enforce this necessity condition; this forces the model to expose its reasoning, making it monitorable. We then present methodology guidelines to stress-test CoT monitoring against deliberate evasion. Applying these guidelines, we find that models can learn to obscure their intentions, but only when given significant help, such as detailed human-written strategies or iterative optimization against the monitor. We conclude that, while not infallible, CoT monitoring offers a substantial layer of defense that requires active protection and continued stress-testing.
nan
Article 350
Title@2025-07-07 (1): MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents
Title: MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents | MindFlow: Revolutionierung des E-Commerce-Kundensupports mit multimodalen LLM-Agenten | MindFlow:使电子商务客户支持与多式联运LLM代理商实现革命化 2507.05330v1 |
Authors (7): Ming Gong, Xucheng Huang, Chenghan Yang, Xianhan Peng, Haoxin Wang, Yang Liu, Ling Jiang
Recent advances in large language models (LLMs) have enabled new applications in e-commerce customer service. However, their capabilities remain constrained in complex, multimodal scenarios. We present MindFlow, the first open-source multimodal LLM agent tailored for e-commerce. Built on the CoALA framework, it integrates memory, decision-making, and action modules, and adopts a modular “MLLM-as-Tool” strategy for effect visual-textual reasoning. Evaluated via online A/B testing and simulation-based ablation, MindFlow demonstrates substantial gains in handling complex queries, improving user satisfaction, and reducing operational costs, with a 93.53% relative improvement observed in real-world deployments.
nan
Article 351
Title@2025-07-07 (1): Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models
Title: Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models | Late Chunking: Kontextuelle Chunk-Embeddings mit Langkontext-Embedding-Modellen | 较晚的板块: 使用长文嵌入模型的背景块嵌入式嵌入式 2409.04701v3 |
Authors (5): Michael Günther, Isabelle Mohr, Daniel James Williams, Bo Wang, Han Xiao
Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be over-compressed in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in sub-optimal representations. In this paper, we introduce a novel method called late chunking, which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling - hence the term late in its naming. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks. The method is generic enough to be applied to a wide range of long-context embedding models and works without additional training. To further increase the effectiveness of late chunking, we propose a dedicated fine-tuning approach for embedding models.
nan
Article 352
Title@2025-07-07 (1): OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation
Title: OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation | OPeRA: Ein Datensatz von Beobachtung, Persona, Ratationale und Aktion zur Bewertung von LLMs auf menschlicher Online-Shopping-Behavior-Simulation | OPERA: 人类在线购物行为模拟观察、人、理由和评估LMLLMs的数据集 2506.05606v3 |
Authors (16): Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang
Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable’’ human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user’s next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.
nan
Article 353
Title@2025-07-07 (1): Logit Reweighting for Topic-Focused Summarization
Title: Logit Reweighting for Topic-Focused Summarization | Logit-Umgewichtung für thematisch fokussierte Zusammenfassung | 用于专题用途汇总的对数重新加权 2507.05235v1 |
Authors (3): Joschka Braun, Bálint Mucsányi, Seyed Ali Bahrainian
Generating abstractive summaries that adhere to a specific topic remains a significant challenge for language models. While standard approaches, such as fine-tuning, are resource-intensive, simpler methods like prompt engineering often struggle to maintain topical focus, particularly with smaller models. To address this, we propose a lightweight method that enhances topical relevance by directly reweighting the logits of topic-relevant tokens during generation. We evaluate three such reweighting techniques: Constant Shift, which adds a constant value to logits; Factor Scaling, which multiplies them by a factor; and Threshold Selection, which selectively boosts logits that exceed a probability threshold. Experiments on the NEWTS topical summarization dataset, using both Gemma-2B and Llama-3-8B models, show that these techniques effectively increase the use of topic-relevant vocabulary. Notably, the Threshold Selection method successfully improves topical focus without compromising summary quality-a trade-off often seen in other approaches. Our findings demonstrate that directly reweighting logits is a practical and resource-efficient alternative to fine-tuning, offering a promising pathway for precisely controlling the thematic content of generated text.
nan
Article 354
Title@2025-07-07 (1): The Super Weight in Large Language Models
Title: The Super Weight in Large Language Models | Das Supergewicht in großen Sprachmodellen | 大语言模型中的超重 2411.07191v2 |
Authors (5): Mengxia Yu, De Wang, Qi Shan, Colorado J Reed, Alvin Wan
Recent works have shown a surprising result: a small fraction of Large Language Model (LLM) parameter outliers are disproportionately important to the quality of the model. LLMs contain billions of parameters, so these small fractions, such as 0.01%, translate to hundreds of thousands of parameters. In this work, we present an even more surprising finding: Pruning as few as a single parameter can destroy an LLM’s ability to generate text – increasing perplexity by 3 orders of magnitude and reducing zero-shot accuracy to guessing. We propose a data-free method for identifying such parameters, termed super weights, using a single forward pass through the model. We additionally find that these super weights induce correspondingly rare and large activation outliers, termed super activations. When preserved with high precision, super activations can improve simple round-to-nearest quantization to become competitive with state-of-the-art methods. For weight quantization, we similarly find that by preserving the super weight and clipping other weight outliers, round-to-nearest quantization can scale to much larger block sizes than previously considered. To facilitate further research into super weights, we provide an index of super weight coordinates for common, openly available LLMs.
nan
Article 355
Title@2025-07-07 (1): jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval
Title: jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval | jina-embeddings-v4: Universelle Einbettungen für multimodale Mehrsprachigkeit | jina-embeddings-v4:多语种多式联运回收通用嵌入式 2506.18902v3 |
Authors (11): Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, Han Xiao
We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-document retrieval, semantic text similarity, and code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single-modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.
nan
Article 356
Title@2025-07-07 (1): Extended Inductive Reasoning for Personalized Preference Inference from Behavioral Signals
Title: Extended Inductive Reasoning for Personalized Preference Inference from Behavioral Signals | Erweiterte Induktive Begründung für personalisierte Präferenz-Inferenz von Verhaltenssignalen | 从行为信号中推论 2505.18071v2 |
Authors (4): Jia-Nan Li, Jian Guan, Wei Wu, Rui Yan
Large language models (LLMs) have demonstrated significant success in complex reasoning tasks such as math and coding. In contrast to these tasks where deductive reasoning predominates, inductive reasoning-the ability to derive general rules from incomplete evidence, remains underexplored. This paper investigates extended inductive reasoning in LLMs through the lens of personalized preference inference, a critical challenge in LLM alignment where current approaches struggle to capture diverse user preferences. The task demands strong inductive reasoning capabilities as user preferences are typically embedded implicitly across various interaction forms, requiring models to synthesize consistent preference patterns from scattered signals. We propose AlignXplore, a model that leverages extended reasoning chains to enable systematic preference inference from behavioral signals in users’ interaction histories. Such explicit preference articulation enables efficient streaming inference: when new behavioral signals emerge, the model can directly build upon previously inferred preference descriptions rather than reprocessing historical signals from scratch, while also supporting iterative refinement to the inferred preferences. We develop AlignXplore by combining cold-start training based on synthetic data with subsequent online reinforcement learning. Through extensive experiments, we demonstrate that AlignXplore achieves substantial improvements over the backbone model by an average of 15.49\% on in-domain and out-of-domain benchmarks, while maintaining strong generalization ability across different input formats and downstream models. Further analyses establish best practices for preference inference learning through systematic comparison of reward modeling strategies, while revealing the emergence of human-like inductive reasoning patterns during training.
nan
Article 357
Title@2025-07-07 (1): Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Title: Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities | Zwillinge 2.5: Das Frontier mit fortschrittlicher Vernunft, Multimodalität, langem Kontext und Agentischen Fähigkeiten der nächsten Generation schieben | Gemini 2.5: 推进先进理性、多模式、长处和下一代的前沿 2507.06261v1 |
Authors (3303): Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaï Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramé, Sagar Waghmare, Helen Miller, Vaishakh Keshava, Ying Jian, Xiaofan Zhang, Raluca Ada Popa, Kedar Dhamdhere, Blaž Bratanič, Kyuyeun Kim, Terry Koo, Ferran Alet, Yi-ting Chen, Arsha Nagrani, Hannah Muckenhirn, Zhiyuan Zhang, Corbin Quick, Filip Pavetić, Duc Dung Nguyen, Joao Carreira, Michael Elabd, Haroon Qureshi, Fabian Mentzer, Yao-Yuan Yang, Danielle Eisenbud, Anmol Gulati, Ellie Talius, Eric Ni, Sahra Ghalebikesabi, Edouard Yvinec, Alaa Saade, Thatcher Ulrich, Lorenzo Blanco, Dan A. Calian, Muhuan Huang, Aäron van den Oord, Naman Goyal, Terry Chen, Praynaa Rawlani, Christian Schallhart, Swachhand Lokhande, Xianghong Luo, Jyn Shan, Ceslee Montgomery, Victoria Krakovna, Federico Piccinini, Omer Barak, Jingyu Cui, Yiling Jia, Mikhail Dektiarev, Alexey Kolganov, Shiyu Huang, Zhe Chen, Xingyu Wang, Jessica Austin, Peter de Boursac, Evgeny Sluzhaev, Frank Ding, Huijian Li, Surya Bhupatiraju, Mohit Agarwal, Sławek Kwasiborski, Paramjit Sandhu, Patrick Siegler, Ahmet Iscen, Eyal Ben-David, Shiraz Butt, Miltos Allamanis, Seth Benjamin, Robert Busa-Fekete, Felix Hernandez-Campos, Sasha Goldshtein, Matt Dibb, Weiyang Zhang, Annie Marsden, Carey Radebaugh, Stephen Roller, Abhishek Nayyar, Jacob Austin, Tayfun Terzi, Bhargav Kanagal Shamanna, Pete Shaw, Aayush Singh, Florian Luisier, Artur Mendonça, Vaibhav Aggarwal, Larisa Markeeva, Claudio Fantacci, Sergey Brin, HyunJeong Choe, Guanyu Wang, Hartwig Adam, Avigail Dabush, Tatsuya Kiyono, Eyal Marcus, Jeremy Cole, Theophane Weber, Hongrae Lee, Ronny Huang, Alex Muzio, Leandro Kieliger, Maigo Le, Courtney Biles, Long Le, Archit Sharma, Chengrun Yang, Avery Lamp, Dave Dopson, Nate Hurley, Katrina, Xu, Zhihao Shan, Shuang Song, Jiewen Tan, Alexandre Senges, George Zhang, Chong You, Yennie Jun, David Raposo, Susanna Ricco, Xuan Yang, Weijie Chen, Prakhar Gupta, Arthur Szlam, Kevin Villela, Chun-Sung Ferng, Daniel Kasenberg, Chen Liang, Rui Zhu, Arunachalam Narayanaswamy, Florence Perot, Paul Pucciarelli, Anna Shekhawat, Alexey Stern, Rishikesh Ingale, Stefani Karp, Sanaz Bahargam, Adrian Goedeckemeyer, Jie Han, Sicheng Li, Andrea Tacchetti, Dian Yu, Abhishek Chakladar, Zhiying Zhang, Mona El Mahdy, Xu Gao, Dale Johnson, Samrat Phatale, AJ Piergiovanni, Hyeontaek Lim, Clement Farabet, Carl Lebsack, Theo Guidroz, John Blitzer, Nico Duduta, David Madras, Steve Li, Daniel von Dincklage, Xin Li, Mahdis Mahdieh, George Tucker, Ganesh Jawahar, Owen Xiao, Danny Tarlow, Robert Geirhos, Noam Velan, Daniel Vlasic, Kalesha Bullard, SK Park, Nishesh Gupta, Kellie Webster, Ayal Hitron, Jieming Mao, Julian Eisenschlos, Laurel Prince, Nina D’Souza, Kelvin Zheng, Sara Nasso, Gabriela Botea, Carl Doersch, Caglar Unlu, Chris Alberti, Alexey Svyatkovskiy, Ankita Goel, Krzysztof Choromanski, Pan-Pan Jiang, Richard Nguyen, Four Flynn, Daria Ćurko, Peter Chen, Nicholas Roth, Kieran Milan, Caleb Habtegebriel, Shashi Narayan, Michael Moffitt, Jake Marcus, Thomas Anthony, Brendan McMahan, Gowoon Cheon, Ruibo Liu, Megan Barnes, Lukasz Lew, Rebeca Santamaria-Fernandez, Mayank Upadhyay, Arjun Akula, Arnar Mar Hrafnkelsson, Alvaro Caceres, Andrew Bunner, Michal Sokolik, Subha Puttagunta, Lawrence Moore, Berivan Isik, Weilun Chen, Jay Hartford, Lawrence Chan, Pradeep Shenoy, Dan Holtmann-Rice, Jane Park, Fabio Viola, Alex Salcianu, Sujeevan Rajayogam, Ian Stewart-Binks, Zelin Wu, Richard Everett, Xi Xiong, Pierre-Antoine Manzagol, Gary Leung, Carl Saroufim, Bo Pang, Dawid Wegner, George Papamakarios, Jennimaria Palomaki, Helena Pankov, Guangda Lai, Guilherme Tubone, Shubin Zhao, Theofilos Strinopoulos, Seth Neel, Mingqiu Wang, Joe Kelley, Li Li, Pingmei Xu, Anitha Vijayakumar, Andrea D’olimpio, Omer Levy, Massimo Nicosia, Grigory Rozhdestvenskiy, Ni Lao, Sirui Xie, Yash Katariya, Jon Simon, Sanjiv Kumar, Florian Hartmann, Michael Kilgore, Jinhyuk Lee, Aroma Mahendru, Roman Ring, Tom Hennigan, Fiona Lang, Colin Cherry, David Steiner, Dawsen Hwang, Ray Smith, Pidong Wang, Jeremy Chen, Ming-Hsuan Yang, Sam Kwei, Philippe Schlattner, Donnie Kim, Ganesh Poomal Girirajan, Nikola Momchev, Ayushi Agarwal, Xingyi Zhou, Ilkin Safarli, Zachary Garrett, AJ Pierigiovanni, Sarthak Jauhari, Alif Raditya Rochman, Shikhar Vashishth, Quan Yuan, Christof Angermueller, Jon Blanton, Xinying Song, Nitesh Bharadwaj Gundavarapu, Thi Avrahami, Maxine Deines, Subhrajit Roy, Manish Gupta, Christopher Semturs, Shobha Vasudevan, Aditya Srikanth Veerubhotla, Shriya Sharma, Josh Jacob, Zhen Yang, Andreas Terzis, Dan Karliner, Auriel Wright, Tania Rojas-Esponda, Ashley Brown, Abhijit Guha Roy, Pawan Dogra, Andrei Kapishnikov, Peter Young, Wendy Kan, Vinodh Kumar Rajendran, Maria Ivanova, Salil Deshmukh, Chia-Hua Ho, Mike Kwong, Stav Ginzburg, Annie Louis, KP Sawhney, Slav Petrov, Jing Xie, Yunfei Bai, Georgi Stoyanov, Alex Fabrikant, Rajesh Jayaram, Yuqi Li, Joe Heyward, Justin Gilmer, Yaqing Wang, Radu Soricut, Luyang Liu, Qingnan Duan, Jamie Hayes, Maura O’Brien, Gaurav Singh Tomar, Sivan Eiger, Bahar Fatemi, Jeffrey Hui, Catarina Barros, Adaeze Chukwuka, Alena Butryna, Saksham Thakur, Austin Huang, Zhufeng Pan, Haotian Tang, Serkan Cabi, Tulsee Doshi, Michiel Bakker, Sumit Bagri, Ruy Ley-Wild, Adam Lelkes, Jennie Lees, Patrick Kane, David Greene, Shimu Wu, Jörg Bornschein, Gabriela Surita, Sarah Hodkinson, Fangtao Li, Chris Hidey, Sébastien Pereira, Sean Ammirati, Phillip Lippe, Adam Kraft, Pu Han, Sebastian Gerlach, Zifeng Wang, Liviu Panait, Feng Han, Brian Farris, Yingying Bi, Hannah DeBalsi, Miaosen Wang, Gladys Tyen, James Cohan, Susan Zhang, Jarred Barber, Da-Woon Chung, Jaeyoun Kim, Markus Kunesch, Steven Pecht, Nami Akazawa, Abe Friesen, James Lyon, Ali Eslami, Junru Wu, Jie Tan, Yue Song, Ravi Kumar, Chris Welty, Ilia Akolzin, Gena Gibson, Sean Augenstein, Arjun Pillai, Nancy Yuen, Du Phan, Xin Wang, Iain Barr, Heiga Zen, Nan Hua, Casper Liu, Jilei, Wang, Tanuj Bhatia, Hao Xu, Oded Elyada, Pushmeet Kohli, Mirek Olšák, Ke Chen, Azalia Mirhoseini, Noam Shazeer, Shoshana Jakobovits, Maggie Tran, Nolan Ramsden, Tarun Bharti, Fred Alcober, Yunjie Li, Shilpa Shetty, Jing Chen, Dmitry Kalashnikov, Megha Nawhal, Sercan Arik, Hanwen Chen, Michiel Blokzijl, Shubham Gupta, James Rubin, Rigel Swavely, Sophie Bridgers, Ian Gemp, Chen Su, Arun Suggala, Juliette Pluto, Mary Cassin, Alain Vaucher, Kaiyang Ji, Jiahao Cai, Andrew Audibert, Animesh Sinha, David Tian, Efrat Farkash, Amy Hua, Jilin Chen, Duc-Hieu Tran, Edward Loper, Nicole Brichtova, Lara McConnaughey, Ballie Sandhu, Robert Leland, Doug DeCarlo, Andrew Over, James Huang, Xing Wu, Connie Fan, Eric Li, Yun Lei, Deepak Sharma, Cosmin Paduraru, Luo Yu, Matko Bošnjak, Phuong Dao, Min Choi, Sneha Kudugunta, Jakub Adamek, Carlos Guía, Ali Khodaei, Jie Feng, Wenjun Zeng, David Welling, Sandeep Tata, Christina Butterfield, Andrey Vlasov, Seliem El-Sayed, Swaroop Mishra, Tara Sainath, Shentao Yang, RJ Skerry-Ryan, Jeremy Shar, Robert Berry, Arunkumar Rajendran, Arun Kandoor, Andrea Burns, Deepali Jain, Tom Stone, Wonpyo Park, Shibo Wang, Albin Cassirer, Guohui Wang, Hayato Kobayashi, Sergey Rogulenko, Vineetha Govindaraj, Mikołaj Rybiński, Nadav Olmert, Colin Evans, Po-Sen Huang, Kelvin Xu, Premal Shah, Terry Thurk, Caitlin Sikora, Mu Cai, Jin Xie, Elahe Dabir, Saloni Shah, Norbert Kalb, Carrie Zhang, Shruthi Prabhakara, Amit Sabne, Artiom Myaskovsky, Vikas Raunak, Blanca Huergo, Behnam Neyshabur, Jon Clark, Ye Zhang, Shankar Krishnan, Eden Cohen, Dinesh Tewari, James Lottes, Yumeya Yamamori, Hui, Li, Mohamed Elhawaty, Ada Maksutaj Oflazer, Adrià Recasens, Sheryl Luo, Duy Nguyen, Taylor Bos, Kalyan Andra, Ana Salazar, Ed Chi, Jeongwoo Ko, Matt Ginsberg, Anders Andreassen, Anian Ruoss, Todor Davchev, Elnaz Davoodi, Chenxi Liu, Min Kim, Santiago Ontanon, Chi Ming To, Dawei Jia, Rosemary Ke, Jing Wang, Anna Korsun, Moran Ambar, Ilya Kornakov, Irene Giannoumis, Toni Creswell, Denny Zhou, Yi Su, Ishaan Watts, Aleksandr Zaks, Evgenii Eltyshev, Ziqiang Feng, Sidharth Mudgal, Alex Kaskasoli, Juliette Love, Kingshuk Dasgupta, Sam Shleifer, Richard Green, Sungyong Seo, Chansoo Lee, Dale Webster, Prakash Shroff, Ganna Raboshchuk, Isabel Leal, James Manyika, Sofia Erell, Daniel Murphy, Zhisheng Xiao, Anton Bulyenov, Julian Walker, Mark Collier, Matej Kastelic, Nelson George, Sushant Prakash, Sailesh Sidhwani, Alexey Frolov, Steven Hansen, Petko Georgiev, Tiberiu Sosea, Chris Apps, Aishwarya Kamath, David Reid, Emma Cooney, Charlotte Magister, Oriana Riva, Alec Go, Pu-Chin Chen, Sebastian Krause, Nir Levine, Marco Fornoni, Ilya Figotin, Nick Roy, Parsa Mahmoudieh, Vladimir Magay, Mukundan Madhavan, Jin Miao, Jianmo Ni, Yasuhisa Fujii, Ian Chou, George Scrivener, Zak Tsai, Siobhan Mcloughlin, Jeremy Selier, Sandra Lefdal, Jeffrey Zhao, Abhijit Karmarkar, Kushal Chauhan, Shivanker Goel, Zhaoyi Zhang, Vihan Jain, Parisa Haghani, Mostafa Dehghani, Jacob Scott, Erin Farnese, Anastasija Ilić, Steven Baker, Julia Pawar, Li Zhong, Josh Camp, Yoel Zeldes, Shravya Shetty, Anand Iyer, Vít Listík, Jiaxian Guo, Luming Tang, Mark Geller, Simon Bucher, Yifan Ding, Hongzhi Shi, Carrie Muir, Dominik Grewe, Ramy Eskander, Octavio Ponce, Boqing Gong, Derek Gasaway, Samira Khan, Umang Gupta, Angelos Filos, Weicheng Kuo, Klemen Kloboves, Jennifer Beattie, Christian Wright, Leon Li, Alicia Jin, Sandeep Mariserla, Miteyan Patel, Jens Heitkaemper, Dilip Krishnan, Vivek Sharma, David Bieber, Christian Frank, John Lambert, Paul Caron, Martin Polacek, Mai Giménez, Himadri Choudhury, Xing Yu, Sasan Tavakkol, Arun Ahuja, Franz Och, Rodolphe Jenatton, Wojtek Skut, Bryan Richter, David Gaddy, Andy Ly, Misha Bilenko, Megh Umekar, Ethan Liang, Martin Sevenich, Mandar Joshi, Hassan Mansoor, Rebecca Lin, Sumit Sanghai, Abhimanyu Singh, Xiaowei Li, Sudheendra Vijayanarasimhan, Zaheer Abbas, Yonatan Bitton, Hansa Srinivasan, Manish Reddy Vuyyuru, Alexander Frömmgen, Yanhua Sun, Ralph Leith, Alfonso Castaño, DJ Strouse, Le Yan, Austin Kyker, Satish Kambala, Mary Jasarevic, Thibault Sellam, Chao Jia, Alexander Pritzel, Raghavender R, Huizhong Chen, Natalie Clay, Sudeep Gandhe, Sean Kirmani, Sayna Ebrahimi, Hannah Kirkwood, Jonathan Mallinson, Chao Wang, Adnan Ozturel, Kuo Lin, Shyam Upadhyay, Vincent Cohen-Addad, Sean Purser-haskell, Yichong Xu, Ebrahim Songhori, Babi Seal, Alberto Magni, Almog Gueta, Tingting Zou, Guru Guruganesh, Thais Kagohara, Hung Nguyen, Khalid Salama, Alejandro Cruzado Ruiz, Justin Frye, Zhenkai Zhu, Matthias Lochbrunner, Simon Osindero, Wentao Yuan, Lisa Lee, Aman Prasad, Lam Nguyen Thiet, Daniele Calandriello, Victor Stone, Qixuan Feng, Han Ke, Maria Voitovich, Geta Sampemane, Lewis Chiang, Ling Wu, Alexander Bykovsky, Matt Young, Luke Vilnis, Ishita Dasgupta, Aditya Chawla, Qin Cao, Bowen Liang, Daniel Toyama, Szabolcs Payrits, Anca Stefanoiu, Dimitrios Vytiniotis, Ankesh Anand, Tianxiao Shen, Blagoj Mitrevski, Michael Tschannen, Sreenivas Gollapudi, Aishwarya P S, José Leal, Zhe Shen, Han Fu, Wei Wang, Arvind Kannan, Doron Kukliansky, Sergey Yaroshenko, Svetlana Grant, Umesh Telang, David Wood, Alexandra Chronopoulou, Alexandru Ţifrea, Tao Zhou, Tony, Nguy~ên, Muge Ersoy, Anima Singh, Meiyan Xie, Emanuel Taropa, Woohyun Han, Eirikur Agustsson, Andrei Sozanschi, Hui Peng, Alex Chen, Yoel Drori, Efren Robles, Yang Gao, Xerxes Dotiwalla, Ying Chen, Anudhyan Boral, Alexei Bendebury, John Nham, Chris Tar, Luis Castro, Jiepu Jiang, Canoee Liu, Felix Halim, Jinoo Baek, Andy Wan, Jeremiah Liu, Yuan Cao, Shengyang Dai, Trilok Acharya, Ruoxi Sun, Fuzhao Xue, Saket Joshi, Morgane Lustman, Yongqin Xian, Rishabh Joshi, Deep Karkhanis, Nora Kassner, Jamie Hall, Xiangzhuo Ding, Gan Song, Gang Li, Chen Zhu, Yana Kulizhskaya, Bin Ni, Alexey Vlaskin, Solomon Demmessie, Lucio Dery, Salah Zaiem, Yanping Huang, Cindy Fan, Felix Gimeno, Ananth Balashankar, Koji Kojima, Hagai Taitelbaum, Maya Meng, Dero Gharibian, Sahil Singla, Wei Chen, Ambrose Slone, Guanjie Chen, Sujee Rajayogam, Max Schumacher, Suyog Kotecha, Rory Blevins, Qifei Wang, Mor Hazan Taege, Alex Morris, Xin Liu, Fayaz Jamil, Richard Zhang, Pratik Joshi, Ben Ingram, Tyler Liechty, Ahmed Eleryan, Scott Baird, Alex Grills, Gagan Bansal, Shan Han, Kiran Yalasangi, Shawn Xu, Majd Al Merey, Isabel Gao, Felix Weissenberger, Igor Karpov, Robert Riachi, Ankit Anand, Gautam Prasad, Kay Lamerigts, Reid Hayes, Jamie Rogers, Mandy Guo, Ashish Shenoy, Qiong, Hu, Kyle He, Yuchen Liu, Polina Zablotskaia, Sagar Gubbi, Yifan Chang, Jay Pavagadhi, Kristian Kjems, Archita Vadali, Diego Machado, Yeqing Li, Renshen Wang, Dipankar Ghosh, Aahil Mehta, Dana Alon, George Polovets, Alessio Tonioni, Nate Kushman, Joel D’sa, Lin Zhuo, Allen Wu, Rohin Shah, John Youssef, Jiayu Ye, Justin Snyder, Karel Lenc, Senaka Buthpitiya, Matthew Tung, Jichuan Chang, Tao Chen, David Saxton, Jenny Lee, Lydia Lihui Zhang, James Qin, Prabakar Radhakrishnan, Maxwell Chen, Piotr Ambroszczyk, Metin Toksoz-Exley, Yan Zhong, Nitzan Katz, Brendan O’Donoghue, Tamara von Glehn, Adi Gerzi Rosenthal, Aga Świetlik, Xiaokai Zhao, Nick Fernando, Jinliang Wei, Jieru Mei, Sergei Vassilvitskii, Diego Cedillo, Pranjal Awasthi, Hui Zheng, Koray Kavukcuoglu, Itay Laish, Joseph Pagadora, Marc Brockschmidt, Christopher A. Choquette-Choo, Arunkumar Byravan, Yifeng Lu, Xu Chen, Mia Chen, Kenton Lee, Rama Pasumarthi, Sijal Bhatnagar, Aditya Shah, Qiyin Wu, Zhuoyuan Chen, Zack Nado, Bartek Perz, Zixuan Jiang, David Kao, Ganesh Mallya, Nino Vieillard, Lantao Mei, Sertan Girgin, Mandy Jordan, Yeongil Ko, Alekh Agarwal, Yaxin Liu, Yasemin Altun, Raoul de Liedekerke, Anastasios Kementsietsidis, Daiyi Peng, Dangyi Liu, Utku Evci, Peter Humphreys, Austin Tarango, Xiang Deng, Yoad Lewenberg, Kevin Aydin, Chengda Wu, Bhavishya Mittal, Tsendsuren Munkhdalai, Kleopatra Chatziprimou, Rodrigo Benenson, Uri First, Xiao Ma, Jinning Li, Armand Joulin, Hamish Tomlinson, Tingnan Zhang, Milad Nasr, Zhi Hong, Michaël Sander, Lisa Anne Hendricks, Anuj Sharma, Andrew Bolt, Eszter Vértes, Jiri Simsa, Tomer Levinboim, Olcan Sercinoglu, Divyansh Shukla, Austin Wu, Craig Swanson, Danny Vainstein, Fan Bu, Bo Wang, Ryan Julian, Charles Yoon, Sergei Lebedev, Antonious Girgis, Bernd Bandemer, David Du, Todd Wang, Xi Chen, Ying Xiao, Peggy Lu, Natalie Ha, Vlad Ionescu, Simon Rowe, Josip Matak, Federico Lebron, Andreas Steiner, Lalit Jain, Manaal Faruqui, Nicolas Lacasse, Georgie Evans, Neesha Subramaniam, Dean Reich, Giulia Vezzani, Aditya Pandey, Joe Stanton, Tianhao Zhou, Liam McCafferty, Henry Griffiths, Verena Rieser, Soheil Hassas Yeganeh, Eleftheria Briakou, Lu Huang, Zichuan Wei, Liangchen Luo, Erik Jue, Gabby Wang, Victor Cotruta, Myriam Khan, Jongbin Park, Qiuchen Guo, Peiran Li, Rong Rong, Diego Antognini, Anastasia Petrushkina, Chetan Tekur, Eli Collins, Parul Bhatia, Chester Kwak, Wenhu Chen, Arvind Neelakantan, Immanuel Odisho, Sheng Peng, Vincent Nallatamby, Vaibhav Tulsyan, Fabian Pedregosa, Peng Xu, Raymond Lin, Yulong Wang, Emma Wang, Sholto Douglas, Reut Tsarfaty, Elena Gribovskaya, Renga Aravamudhan, Manu Agarwal, Mara Finkelstein, Qiao Zhang, Elizabeth Cole, Phil Crone, Sarmishta Velury, Anil Das, Chris Sauer, Luyao Xu, Danfeng Qin, Chenjie Gu, Dror Marcus, CJ Zheng, Wouter Van Gansbeke, Sobhan Miryoosefi, Haitian Sun, YaGuang Li, Charlie Chen, Jae Yoo, Pavel Dubov, Alex Tomala, Adams Yu, Paweł Wesołowski, Alok Gunjan, Eddie Cao, Jiaming Luo, Nikhil Sethi, Arkadiusz Socala, Laura Graesser, Tomas Kocisky, Arturo BC, Minmin Chen, Edward Lee, Sophie Wang, Weize Kong, Qiantong Xu, Nilesh Tripuraneni, Yiming Li, Xinxin Yu, Allen Porter, Paul Voigtlaender, Biao Zhang, Arpi Vezer, Sarah York, Qing Wei, Geoffrey Cideron, Mark Kurzeja, Seungyeon Kim, Benny Li, Angéline Pouget, Hyo Lee, Kaspar Daugaard, Yang Li, Dave Uthus, Aditya Siddhant, Paul Cavallaro, Sriram Ganapathy, Maulik Shah, Rolf Jagerman, Jeff Stanway, Piermaria Mendolicchio, Li Xiao, Kayi Lee, Tara Thompson, Shubham Milind Phal, Jason Chase, Sun Jae Lee, Adrian N Reyes, Disha Shrivastava, Zhen Qin, Roykrong Sukkerd, Seth Odoom, Lior Madmoni, John Aslanides, Jonathan Herzig, Elena Pochernina, Sheng Zhang, Parker Barnes, Daisuke Ikeda, Qiujia Li, Shuo-yiin Chang, Shakir Mohamed, Jim Sproch, Richard Powell, Bidisha Samanta, Domagoj Ćevid, Anton Kovsharov, Shrestha Basu Mallick, Srinivas Tadepalli, Anne Zheng, Kareem Ayoub, Andreas Noever, Christian Reisswig, Zhuo Xu, Junhyuk Oh, Martin Matysiak, Tim Blyth, Shereen Ashraf, Julien Amelot, Boone Severson, Michele Bevilacqua, Motoki Sano, Ethan Dyer, Ofir Roval, Anu Sinha, Yin Zhong, Sagi Perel, Tea Sabolić, Johannes Mauerer, Willi Gierke, Mauro Verzetti, Rodrigo Cabrera, Alvin Abdagic, Steven Hemingray, Austin Stone, Jong Lee, Farooq Ahmad, Karthik Raman, Lior Shani, Jonathan Lai, Orhan Firat, Nathan Waters, Eric Ge, Mo Shomrat, Himanshu Gupta, Rajeev Aggarwal, Tom Hudson, Bill Jia, Simon Baumgartner, Palak Jain, Joe Kovac, Junehyuk Jung, Ante Žužul, Will Truong, Morteza Zadimoghaddam, Songyou Peng, Marco Liang, Rachel Sterneck, Balaji Lakshminarayanan, Machel Reid, Oliver Woodman, Tong Zhou, Jianling Wang, Vincent Coriou, Arjun Narayanan, Jay Hoover, Yenai Ma, Apoorv Jindal, Clayton Sanford, Doug Reid, Swaroop Ramaswamy, Alex Kurakin, Roland Zimmermann, Yana Lunts, Dragos Dena, Zalán Borsos, Vered Cohen, Shujian Zhang, Will Grathwohl, Robert Dadashi, Morgan Redshaw, Joshua Kessinger, Julian Odell, Silvano Bonacina, Zihang Dai, Grace Chen, Ayush Dubey, Pablo Sprechmann, Mantas Pajarskas, Wenxuan Zhou, Niharika Ahuja, Tara Thomas, Martin Nikoltchev, Matija Kecman, Bharath Mankalale, Andrey Ryabtsev, Jennifer She, Christian Walder, Jiaming Shen, Lu Li, Carolina Parada, Sheena Panthaplackel, Okwan Kwon, Matt Lawlor, Utsav Prabhu, Yannick Schroecker, Marc’aurelio Ranzato, Pete Blois, Iurii Kemaev, Ting Yu, Dmitry, Lepikhin, Hao Xiong, Sahand Sharifzadeh, Oleaser Johnson, Jeremiah Willcock, Rui Yao, Greg Farquhar, Sujoy Basu, Hidetoshi Shimokawa, Nina Anderson, Haiguang Li, Khiem Pham, Yizhong Liang, Sebastian Borgeaud, Alexandre Moufarek, Hideto Kazawa, Blair Kutzman, Marcin Sieniek, Sara Smoot, Ruth Wang, Natalie Axelsson, Nova Fallen, Prasha Sundaram, Yuexiang Zhai, Varun Godbole, Petros Maniatis, Alek Wang, Ilia Shumailov, Santhosh Thangaraj, Remi Crocker, Nikita Gupta, Gang Wu, Phil Chen, Gellért Weisz, Celine Smith, Mojtaba Seyedhosseini, Boya Fang, Xiyang Luo, Roey Yogev, Zeynep Cankara, Andrew Hard, Helen Ran, Rahul Sukthankar, George Necula, Gaël Liu, Honglong Cai, Praseem Banzal, Daniel Keysers, Sanjay Ghemawat, Connie Tao, Emma Dunleavy, Aditi Chaudhary, Wei Li, Maciej Mikuła, Chen-Yu Lee, Tiziana Refice, Krishna Somandepalli, Alexandre Fréchette, Dan Bahir, John Karro, Keith Rush, Sarah Perrin, Bill Rosgen, Xiaomeng Yang, Clara Huiyi Hu, Mahmoud Alnahlawi, Justin Mao-Jones, Roopal Garg, Hoang Nguyen, Bat-Orgil Batsaikhan, Iñaki Iturrate, Anselm Levskaya, Avi Singh, Ashyana Kachra, Tony Lu, Denis Petek, Zheng Xu, Mark Graham, Lukas Zilka, Yael Karov, Marija Kostelac, Fangyu Liu, Yaohui Guo, Weiyue Wang, Bernd Bohnet, Emily Pitler, Tony Bruguier, Keisuke Kinoshita, Chrysovalantis Anastasiou, Nilpa Jha, Ting Liu, Jerome Connor, Phil Wallis, Philip Pham, Eric Bailey, Shixin Li, Heng-Tze Cheng, Sally Ma, Haiqiong Li, Akanksha Maurya, Kate Olszewska, Manfred Warmuth, Christy Koh, Dominik Paulus, Siddhartha Reddy Jonnalagadda, Enrique Piqueras, Ali Elqursh, Geoff Brown, Hadar Shemtov, Loren Maggiore, Fei Xia, Ryan Foley, Beka Westberg, George van den Driessche, Livio Baldini Soares, Arjun Kar, Michael Quinn, Siqi Zuo, Jialin Wu, Kyle Kastner, Anna Bortsova, Aijun Bai, Ales Mikhalap, Luowei Zhou, Jennifer Brennan, Vinay Ramasesh, Honglei Zhuang, John Maggs, Johan Schalkwyk, Yuntao Xu, Hui Huang, Andrew Howard, Sasha Brown, Linting Xue, Gloria Shen, Brian Albert, Neha Jha, Daniel Zheng, Varvara Krayvanova, Spurthi Amba Hombaiah, Olivier Lacombe, Gautam Vasudevan, Dan Graur, Tian Xie, Meet Gandhi, Bangju Wang, Dustin Zelle, Harman Singh, Dahun Kim, Sébastien Cevey, Victor Ungureanu, Natasha Noy, Fei Liu, Annie Xie, Fangxiaoyu Feng, Katerina Tsihlas, Daniel Formoso, Neera Vats, Quentin Wellens, Yinan Wang, Niket Kumar Bhumihar, Samrat Ghosh, Matt Hoffman, Tom Lieber, Oran Lang, Kush Bhatia, Tom Paine, Aroonalok Pyne, Ronny Votel, Madeleine Clare Elish, Benoit Schillings, Alex Panagopoulos, Haichuan Yang, Adam Raveret, Zohar Yahav, Shuang Liu, Warren Chen, Dalia El Badawy, Nishant Agrawal, Mohammed Badawi, Mahdi Mirzazadeh, Carla Bromberg, Fan Ye, Chang Liu, Tatiana Sholokhova, George-Cristian Muraru, Gargi Balasubramaniam, Jonathan Malmaud, Alen Carin, Danilo Martins, Irina Jurenka, Pankil Botadra, Dave Lacey, Richa Singh, Mariano Schain, Dan Zheng, Isabelle Guyon, Victor Lavrenko, Seungji Lee, Xiang Zhou, Demis Hassabis, Jeshwanth Challagundla, Derek Cheng, Nikhil Mehta, Matthew Mauger, Michela Paganini, Pushkar Mishra, Kate Lee, Zhang Li, Lexi Baugher, Ondrej Skopek, Max Chang, Amir Zait, Gaurav Menghani, Lizzetth Bellot, Guangxing Han, Jean-Michel Sarr, Sharat Chikkerur, Himanshu Sahni, Rohan Anil, Arun Narayanan, Chandu Thekkath, Daniele Pighin, Hana Strejček, Marko Velic, Fred Bertsch, Manuel Tragut, Keran Rong, Alicia Parrish, Kai Bailey, Jiho Park, Isabela Albuquerque, Abhishek Bapna, Rajesh Venkataraman, Alec Kosik, Johannes Griesser, Zhiwei Deng, Alek Andreev, Qingyun Dou, Kevin Hui, Fanny Wei, Xiaobin Yu, Lei Shu, Avia Aharon, David Barker, Badih Ghazi, Sebastian Flennerhag, Chris Breaux, Yuchuan Liu, Matthew Bilotti, Josh Woodward, Uri Alon, Stephanie Winkler, Tzu-Kuo Huang, Kostas Andriopoulos, João Gabriel Oliveira, Penporn Koanantakool, Berkin Akin, Michael Wunder, Cicero Nogueira dos Santos, Mohammad Hossein Bateni, Lin Yang, Dan Horgan, Beer Changpinyo, Keyvan Amiri, Min Ma, Dayeong Lee, Lihao Liang, Anirudh Baddepudi, Tejasi Latkar, Raia Hadsell, Jun Xu, Hairong Mu, Michael Han, Aedan Pope, Snchit Grover, Frank Kim, Ankit Bhagatwala, Guan Sun, Yamini Bansal, Amir Globerson, Alireza Nazari, Samira Daruki, Hagen Soltau, Jane Labanowski, Laurent El Shafey, Matt Harvey, Yanif Ahmad, Elan Rosenfeld, William Kong, Etienne Pot, Yi-Xuan Tan, Aurora Wei, Victoria Langston, Marcel Prasetya, Petar Veličković, Richard Killam, Robin Strudel, Darren Ni, Zhenhai Zhu, Aaron Archer, Kavya Kopparapu, Lynn Nguyen, Emilio Parisotto, Hussain Masoom, Sravanti Addepalli, Jordan Grimstad, Hexiang Hu, Joss Moore, Avinatan Hassidim, Le Hou, Mukund Raghavachari, Jared Lichtarge, Adam R. Brown, Hilal Dib, Natalia Ponomareva, Justin Fu, Yujing Zhang, Altaf Rahman, Joana Iljazi, Edouard Leurent, Gabriel Dulac-Arnold, Cosmo Du, Chulayuth Asawaroengchai, Larry Jin, Ela Gruzewska, Ziwei Ji, Benigno Uria, Daniel De Freitas, Paul Barham, Lauren Beltrone, Víctor Campos, Jun Yan, Neel Kovelamudi, Arthur Nguyen, Elinor Davies, Zhichun Wu, Zoltan Egyed, Kristina Toutanova, Nithya Attaluri, Hongliang Fei, Peter Stys, Siddhartha Brahma, Martin Izzard, Siva Velusamy, Scott Lundberg, Vincent Zhuang, Kevin Sequeira, Adam Santoro, Ehsan Amid, Ophir Aharoni, Shuai Ye, Mukund Sundararajan, Lijun Yu, Yu-Cheng Ling, Stephen Spencer, Hugo Song, Josip Djolonga, Christo Kirov, Sonal Gupta, Alessandro Bissacco, Clemens Meyer, Mukul Bhutani, Andrew Dai, Weiyi Wang, Siqi Liu, Ashwin Sreevatsa, Qijun Tan, Maria Wang, Lucy Kim, Yicheng Wang, Alex Irpan, Yang Xiao, Stanislav Fort, Yifan He, Alex Gurney, Bryan Gale, Yue Ma, Monica Roy, Viorica Patraucean, Taylan Bilal, Golnaz Ghiasi, Anahita Hosseini, Melvin Johnson, Zhuowan Li, Yi Tay, Benjamin Beyret, Katie Millican, Josef Broder, Mayank Lunayach, Danny Swisher, Eugen Vušak, David Parkinson, MH Tessler, Adi Mayrav Gilady, Richard Song, Allan Dafoe, Yves Raimond, Masa Yamaguchi, Itay Karo, Elizabeth Nielsen, Kevin Kilgour, Mike Dusenberry, Rajiv Mathews, Jiho Choi, Siyuan Qiao, Harsh Mehta, Sahitya Potluri, Chris Knutsen, Jialu Liu, Tat Tan, Kuntal Sengupta, Keerthana Gopalakrishnan, Abodunrinwa Toki, Mencher Chiang, Mike Burrows, Grace Vesom, Zafarali Ahmed, Ilia Labzovsky, Siddharth Vashishtha, Preeti Singh, Ankur Sharma, Ada Ma, Jinyu Xie, Pranav Talluri, Hannah Forbes-Pollard, Aarush Selvan, Joel Wee, Loic Matthey, Tom Funkhouser, Parthasarathy Gopavarapu, Lev Proleev, Cheng Li, Matt Thomas, Kashyap Kolipaka, Zhipeng Jia, Ashwin Kakarla, Srinivas Sunkara, Joan Puigcerver, Suraj Satishkumar Sheth, Emily Graves, Chen Wang, Sadh MNM Khan, Kai Kang, Shyamal Buch, Fred Zhang, Omkar Savant, David Soergel, Kevin Lee, Linda Friso, Xuanyi Dong, Rahul Arya, Shreyas Chandrakaladharan, Connor Schenck, Greg Billock, Tejas Iyer, Anton Bakalov, Leslie Baker, Alex Ruiz, Angad Chandorkar, Trieu Trinh, Matt Miecnikowski, Yanqi Zhou, Yangsibo Huang, Jiazhong Nie, Ali Shah, Ashish Thapliyal, Sam Haves, Lun Wang, Uri Shaham, Patrick Morris-Suzuki, Soroush Radpour, Leonard Berrada, Thomas Strohmann, Chaochao Yan, Jingwei Shen, Sonam Goenka, Tris Warkentin, Petar Dević, Dan Belov, Albert Webson, Madhavi Yenugula, Puranjay Datta, Jerry Chang, Nimesh Ghelani, Aviral Kumar, Vincent Perot, Jessica Lo, Yang Song, Herman Schmit, Jianmin Chen, Vasilisa Bashlovkina, Xiaoyue Pan, Diana Mincu, Paul Roit, Isabel Edkins, Andy Davis, Yujia Li, Ben Horn, Xinjian Li, Pradeep Kumar S, Eric Doi, Wanzheng Zhu, Sri Gayatri Sundara Padmanabhan, Siddharth Verma, Jasmine Liu, Heng Chen, Mihajlo Velimirović, Malcolm Reynolds, Priyanka Agrawal, Nick Sukhanov, Abhinit Modi, Siddharth Goyal, John Palowitch, Nima Khajehnouri, Wing Lowe, David Klinghoffer, Sharon Silver, Vinh Tran, Candice Schumann, Francesco Piccinno, Xi Liu, Mario Lučić, Xiaochen Yang, Sandeep Kumar, Ajay Kannan, Ragha Kotikalapudi, Mudit Bansal, Fabian Fuchs, Javad Hosseini, Abdelrahman Abdelhamed, Dawn Bloxwich, Tianhe Yu, Ruoxin Sang, Gregory Thornton, Karan Gill, Yuchi Liu, Virat Shejwalkar, Jason Lin, Zhipeng Yan, Kehang Han, Thomas Buschmann, Michael Pliskin, Zhi Xing, Susheel Tatineni, Junlin Zhang, Sissie Hsiao, Gavin Buttimore, Marcus Wu, Zefei Li, Geza Kovacs, Legg Yeung, Tao Huang, Aaron Cohen, Bethanie Brownfield, Averi Nowak, Mikel Rodriguez, Tianze Shi, Hado van Hasselt, Kevin Cen, Deepanway Ghoshal, Kushal Majmundar, Weiren Yu, Warren, Chen, Danila Sinopalnikov, Hao Zhang, Vlado Galić, Di Lu, Zeyu Zheng, Maggie Song, Gary Wang, Gui Citovsky, Swapnil Gawde, Isaac Galatzer-Levy, David Silver, Ivana Balazevic, Dipanjan Das, Kingshuk Majumder, Yale Cong, Praneet Dutta, Dustin Tran, Hui Wan, Junwei Yuan, Daniel Eppens, Alanna Walton, Been Kim, Harry Ragan, James Cobon-Kerr, Lu Liu, Weijun Wang, Bryce Petrini, Jack Rae, Rakesh Shivanna, Yan Xiong, Chace Lee, Pauline Coquinot, Yiming Gu, Lisa Patel, Blake Hechtman, Aviel Boag, Orion Jankowski, Alex Wertheim, Alex Lee, Paul Covington, Hila Noga, Sam Sobell, Shanthal Vasanth, William Bono, Chirag Nagpal, Wei Fan, Xavier Garcia, Kedar Soparkar, Aybuke Turker, Nathan Howard, Sachit Menon, Yuankai Chen, Vikas Verma, Vladimir Pchelin, Harish Rajamani, Valentin Dalibard, Ana Ramalho, Yang Guo, Kartikeya Badola, Seojin Bang, Nathalie Rauschmayr, Julia Proskurnia, Sudeep Dasari, Xinyun Chen, Mikhail Sushkov, Anja Hauth, Pauline Sho, Abhinav Singh, Bilva Chandra, Allie Culp, Max Dylla, Olivier Bachem, James Besley, Heri Zhao, Timothy Lillicrap, Wei Wei, Wael Al Jishi, Ning Niu, Alban Rrustemi, Raphaël Lopez Kaufman, Ryan Poplin, Jewel Zhao, Minh Truong, Shikhar Bharadwaj, Ester Hlavnova, Eli Stickgold, Cordelia Schmid, Georgi Stephanov, Zhaoqi Leng, Frederick Liu, Léonard Hussenot, Shenil Dodhia, Juliana Vicente Franco, Lesley Katzen, Abhanshu Sharma, Sarah Cogan, Zuguang Yang, Aniket Ray, Sergi Caelles, Shen Yan, Ravin Kumar, Daniel Gillick, Renee Wong, Joshua Ainslie, Jonathan Hoech, Séb Arnold, Dan Abolafia, Anca Dragan, Ben Hora, Grace Hu, Alexey Guseynov, Yang Lu, Chas Leichner, Jinmeng Rao, Abhimanyu Goyal, Nagabhushan Baddi, Daniel Hernandez Diaz, Tim McConnell, Max Bain, Jake Abernethy, Qiqi Yan, Rylan Schaeffer, Paul Vicol, Will Thompson, Montse Gonzalez Arenas, Mathias Bellaiche, Pablo Barrio, Stefan Zinke, Riccardo Patana, Pulkit Mehta, JK Kearns, Avraham Ruderman, Scott Pollom, David D’Ambrosio, Cath Hope, Yang Yu, Andrea Gesmundo, Kuang-Huei Lee, Aviv Rosenberg, Yiqian Zhou, Yaoyiran Li, Drew Garmon, Yonghui Wu, Safeen Huda, Gil Fidel, Martin Baeuml, Jian Li, Phoebe Kirk, Rhys May, Tao Tu, Sara Mc Carthy, Toshiyuki Fukuzawa, Miranda Aperghis, Chih-Kuan Yeh, Toshihiro Yoshino, Bo Li, Austin Myers, Kaisheng Yao, Ben Limonchik, Changwan Ryu, Rohun Saxena, Alex Goldin, Ruizhe Zhao, Rocky Rhodes, Tao Zhu, Divya Tyam, Heidi Howard, Nathan Byrd, Hongxu Ma, Yan Wu, Ryan Mullins, Qingze Wang, Aida Amini, Sebastien Baur, Yiran Mao, Subhashini Venugopalan, Will Song, Wen Ding, Paul Collins, Sashank Reddi, Megan Shum, Andrei Rusu, Luisa Zintgraf, Kelvin Chan, Sheela Goenka, Mathieu Blondel, Michael Collins, Renke Pan, Marissa Giustina, Nikolai Chinaev, Christian Schuler, Ce Zheng, Jonas Valfridsson, Alyssa Loo, Alex Yakubovich, Jamie Smith, Tao Jiang, Rich Munoz, Gabriel Barcik, Rishabh Bansal, Mingyao Yang, Yilun Du, Pablo Duque, Mary Phuong, Alexandra Belias, Kunal Lad, Zeyu Liu, Tal Schuster, Karthik Duddu, Jieru Hu, Paige Kunkle, Matthew Watson, Jackson Tolins, Josh Smith, Denis Teplyashin, Garrett Bingham, Marvin Ritter, Marco Andreetto, Divya Pitta, Mohak Patel, Shashank Viswanadha, Trevor Strohman, Catalin Ionescu, Jincheng Luo, Yogesh Kalley, Jeremy Wiesner, Dan Deutsch, Derek Lockhart, Peter Choy, Rumen Dangovski, Chawin Sitawarin, Cat Graves, Tanya Lando, Joost van Amersfoort, Ndidi Elue, Zhouyuan Huo, Pooya Moradi, Jean Tarbouriech, Henryk Michalewski, Wenting Ye, Eunyoung Kim, Alex Druinsky, Florent Altché, Xinyi Chen, Artur Dwornik, Da-Cheng Juan, Rivka Moroshko, Horia Toma, Jarrod Kahn, Hai Qian, Maximilian Sieb, Irene Cai, Roman Goldenberg, Praneeth Netrapalli, Sindhu Raghuram, Yuan Gong, Lijie Fan, Evan Palmer, Yossi Matias, Valentin Gabeur, Shreya Pathak, Tom Ouyang, Don Metzler, Geoff Bacon, Srinivasan Venkatachary, Sridhar Thiagarajan, Alex Cullum, Eran Ofek, Vytenis Sakenas, Mohamed Hammad, Cesar Magalhaes, Mayank Daswani, Oscar Chang, Ashok Popat, Ruichao Li, Komal Jalan, Yanhan Hou, Josh Lipschultz, Antoine He, Wenhao Jia, Pier Giuseppe Sessa, Prateek Kolhar, William Wong, Sumeet Singh, Lukas Haas, Jay Whang, Hanna Klimczak-Plucińska, Georges Rotival, Grace Chung, Yiqing Hua, Anfal Siddiqui, Nicolas Serrano, Dongkai Chen, Billy Porter, Libin Bai, Keshav Shivam, Sho Arora, Partha Talukdar, Tom Cobley, Sangnie Bhardwaj, Evgeny Gladchenko, Simon Green, Kelvin Guu, Felix Fischer, Xiao Wu, Eric Wang, Achintya Singhal, Tatiana Matejovicova, James Martens, Hongji Li, Roma Patel, Elizabeth Kemp, Jiaqi Pan, Lily Wang, Blake JianHang Chen, Jean-Baptiste Alayrac, Navneet Potti, Erika Gemzer, Eugene Ie, Kay McKinney, Takaaki Saeki, Edward Chou, Pascal Lamblin, SQ Mah, Zach Fisher, Martin Chadwick, Jon Stritar, Obaid Sarvana, Andrew Hogue, Artem Shtefan, Hadi Hashemi, Yang Xu, Jindong Gu, Sharad Vikram, Chung-Ching Chang, Sabela Ramos, Logan Kilpatrick, Weijuan Xi, Jenny Brennan, Yinghao Sun, Abhishek Jindal, Ionel Gog, Dawn Chen, Felix Wu, Jason Lee, Sudhindra Kopalle, Srinadh Bhojanapalli, Oriol Vinyals, Natan Potikha, Burcu Karagol Ayan, Yuan Yuan, Michael Riley, Piotr Stanczyk, Sergey Kishchenko, Bing Wang, Dan Garrette, Antoine Yang, Vlad Feinberg, CJ Carey, Javad Azizi, Viral Shah, Erica Moreira, Chongyang Shi, Josh Feldman, Elizabeth Salesky, Thomas Lampe, Aneesh Pappu, Duhyeon Kim, Jonas Adler, Avi Caciularu, Brian Walker, Yunhan Xu, Yochai Blau, Dylan Scandinaro, Terry Huang, Sam El-Husseini, Abhishek Sinha, Lijie Ren, Taylor Tobin, Patrik Sundberg, Tim Sohn, Vikas Yadav, Mimi Ly, Emily Xue, Jing Xiong, Afzal Shama Soudagar, Sneha Mondal, Nikhil Khadke, Qingchun Ren, Ben Vargas, Stan Bileschi, Sarah Chakera, Cindy Wang, Boyu Wang, Yoni Halpern, Joe Jiang, Vikas Sindhwani, Petre Petrov, Pranavaraj Ponnuramu, Sanket Vaibhav Mehta, Yu Watanabe, Betty Chan, Matheus Wisniewski, Trang Pham, Jingwei Zhang, Conglong Li, Dario de Cesare, Art Khurshudov, Alex Vasiloff, Melissa Tan, Zoe Ashwood, Bobak Shahriari, Maryam Majzoubi, Garrett Tanzer, Olga Kozlova, Robin Alazard, James Lee-Thorp, Nguyet Minh Phu, Isaac Tian, Junwhan Ahn, Andy Crawford, Lauren Lax, Yuan, Shangguan, Iftekhar Naim, David Ross, Oleksandr Ferludin, Tongfei Guo, Andrea Banino, Hubert Soyer, Xiaoen Ju, Dominika Rogozińska, Ishaan Malhi, Marcella Valentine, Daniel Balle, Apoorv Kulshreshtha, Maciej Kula, Yiwen Song, Sophia Austin, John Schultz, Roy Hirsch, Arthur Douillard, Apoorv Reddy, Michael Fink, Summer Yue, Khyatti Gupta, Adam Zhang, Norman Rink, Daniel McDuff, Lei Meng, András György, Yasaman Razeghi, Ricky Liang, Kazuki Osawa, Aviel Atias, Matan Eyal, Tyrone Hill, Nikolai Grigorev, Zhengdong Wang, Nitish Kulkarni, Rachel Soh, Ivan Lobov, Zachary Charles, Sid Lall, Kazuma Hashimoto, Ido Kessler, Victor Gomes, Zelda Mariet, Danny Driess, Alessandro Agostini, Canfer Akbulut, Jingcao Hu, Marissa Ikonomidis, Emily Caveness, Kartik Audhkhasi, Saurabh Agrawal, Ioana Bica, Evan Senter, Jayaram Mudigonda, Kelly Chen, Jingchen Ye, Xuanhui Wang, James Svensson, Philipp Fränken, Josh Newlan, Li Lao, Eva Schnider, Sami Alabed, Joseph Kready, Jesse Emond, Afief Halumi, Tim Zaman, Chengxi Ye, Naina Raisinghani, Vilobh Meshram, Bo Chang, Ankit Singh Rawat, Axel Stjerngren, Sergey Levi, Rui Wang, Xiangzhu Long, Mitchelle Rasquinha, Steven Hand, Aditi Mavalankar, Lauren Agubuzu, Sudeshna Roy, Junquan Chen, Jarek Wilkiewicz, Hao Zhou, Michal Jastrzebski, Qiong Hu, Agustin Dal Lago, Ramya Sree Boppana, Wei-Jen Ko, Jennifer Prendki, Yao Su, Zhi Li, Eliza Rutherford, Girish Ramchandra Rao, Ramona Comanescu, Adrià Puigdomènech, Qihang Chen, Dessie Petrova, Christine Chan, Vedrana Milutinovic, Felipe Tiengo Ferreira, Chin-Yi Cheng, Ming Zhang, Tapomay Dey, Sherry Yang, Ramesh Sampath, Quoc Le, Howard Zhou, Chu-Cheng Lin, Hoi Lam, Christine Kaeser-Chen, Kai Hui, Dean Hirsch, Tom Eccles, Basil Mustafa, Shruti Rijhwani, Morgane Rivière, Yuanzhong Xu, Junjie Wang, Xinyang Geng, Xiance Si, Arjun Khare, Cheolmin Kim, Vahab Mirrokni, Kamyu Lee, Khuslen Baatarsukh, Nathaniel Braun, Lisa Wang, Pallavi LV, Richard Tanburn, Yuvein, Zhu, Fangda Li, Setareh Ariafar, Dan Goldberg, Ken Burke, Daniil Mirylenka, Meiqi Guo, Olaf Ronneberger, Hadas Natalie Vogel, Liqun Cheng, Nishita Shetty, Johnson Jia, Thomas Jimma, Corey Fry, Ted Xiao, Martin Sundermeyer, Ryan Burnell, Yannis Assael, Mario Pinto, JD Chen, Rohit Sathyanarayana, Donghyun Cho, Jing Lu, Rishabh Agarwal, Sugato Basu, Lucas Gonzalez, Dhruv Shah, Meng Wei, Dre Mahaarachchi, Rohan Agrawal, Tero Rissa, Yani Donchev, Ramiro Leal-Cavazos, Adrian Hutter, Markus Mircea, Alon Jacovi, Faruk Ahmed, Jiageng Zhang, Shuguang Hu, Bo-Juen Chen, Jonni Kanerva, Guillaume Desjardins, Andrew Lee, Nikos Parotsidis, Asier Mujika, Tobias Weyand, Jasper Snoek, Jo Chick, Kai Chen, Paul Chang, Ethan Mahintorabi, Zi Wang, Tolly Powell, Orgad Keller, Abhirut Gupta, Claire Sha, Kanav Garg, Nicolas Heess, Ágoston Weisz, Cassidy Hardin, Bartek Wydrowski, Ben Coleman, Karina Zainullina, Pankaj Joshi, Alessandro Epasto, Terry Spitz, Binbin Xiong, Kai Zhao, Arseniy Klimovskiy, Ivy Zheng, Johan Ferret, Itay Yona, Waleed Khawaja, Jean-Baptiste Lespiau, Maxim Krikun, Siamak Shakeri, Timothee Cour, Bonnie Li, Igor Krivokon, Dan Suh, Alex Hofer, Jad Al Abdallah, Nikita Putikhin, Oscar Akerlund, Silvio Lattanzi, Anurag Kumar, Shane Settle, Himanshu Srivastava, Folawiyo Campbell-Ajala, Edouard Rosseel, Mihai Dorin Istin, Nishanth Dikkala, Anand Rao, Nick Young, Kate Lin, Dhruva Bhaswar, Yiming Wang, Jaume Sanchez Elias, Kritika Muralidharan, James Keeling, Dayou Du, Siddharth Gopal, Gregory Dibb, Charles Blundell, Manolis Delakis, Jacky Liang, Marco Tulio Ribeiro, Georgi Karadzhov, Guillermo Garrido, Ankur Bapna, Jiawei Cao, Adam Sadovsky, Pouya Tafti, Arthur Guez, Coline Devin, Yixian Di, Jinwei Xing, Chuqiao, Xu, Hanzhao Lin, Chun-Te Chu, Sameera Ponda, Wesley Helmholz, Fan Yang, Yue Gao, Sara Javanmardi, Wael Farhan, Alex Ramirez, Ricardo Figueira, Khe Chai Sim, Yuval Bahat, Ashwin Vaswani, Liangzhe Yuan, Gufeng Zhang, Leland Rechis, Hanjun Dai, Tayo Oguntebi, Alexandra Cordell, Eugénie Rives, Kaan Tekelioglu, Naveen Kumar, Bing Zhang, Aurick Zhou, Nikolay Savinov, Andrew Leach, Alex Tudor, Sanjay Ganapathy, Yanyan Zheng, Mirko Rossini, Vera Axelrod, Arnaud Autef, Yukun Zhu, Zheng Zheng, Mingda Zhang, Baochen Sun, Jie Ren, Nenad Tomasev, Nithish Kannan, Amer Sinha, Charles Chen, Louis O’Bryan, Alex Pak, Aditya Kusupati, Weel Yang, Deepak Ramachandran, Patrick Griffin, Seokhwan Kim, Philipp Neubeck, Craig Schiff, Tammo Spalink, Mingyang Ling, Arun Nair, Ga-Young Joung, Linda Deng, Avishkar Bhoopchand, Lora Aroyo, Tom Duerig, Jordan Griffith, Gabe Barth-Maron, Jake Ades, Alex Haig, Ankur Taly, Yunting Song, Paul Michel, Dave Orr, Dean Weesner, Corentin Tallec, Carrie Grimes Bostock, Paul Niemczyk, Andy Twigg, Mudit Verma, Rohith Vallu, Henry Wang, Marco Gelmi, Kiranbir Sodhia, Aleksandr Chuklin, Omer Goldman, Jasmine George, Liang Bai, Kelvin Zhang, Petar Sirkovic, Efrat Nehoran, Golan Pundak, Jiaqi Mu, Alice Chen, Alex Greve, Paulo Zacchello, David Amos, Heming Ge, Eric Noland, Colton Bishop, Jeffrey Dudek, Youhei Namiki, Elena Buchatskaya, Jing Li, Dorsa Sadigh, Masha Samsikova, Dan Malkin, Damien Vincent, Robert David, Rob Willoughby, Phoenix Meadowlark, Shawn Gao, Yan Li, Raj Apte, Amit Jhindal, Stein Xudong Lin, Alex Polozov, Zhicheng Wang, Tomas Mery, Anirudh GP, Varun Yerram, Sage Stevens, Tianqi Liu, Noah Fiedel, Charles Sutton, Matthew Johnson, Xiaodan Song, Kate Baumli, Nir Shabat, Muqthar Mohammad, Hao Liu, Marco Selvi, Yichao Zhou, Mehdi Hafezi Manshadi, Chu-ling Ko, Anthony Chen, Michael Bendersky, Jorge Gonzalez Mendez, Nisarg Kothari, Amir Zandieh, Yiling Huang, Daniel Andor, Ellie Pavlick, Idan Brusilovsky, Jitendra Harlalka, Sally Goldman, Andrew Lampinen, Guowang Li, Asahi Ushio, Somit Gupta, Lei Zhang, Chuyuan Kelly Fu, Madhavi Sewak, Timo Denk, Jed Borovik, Brendan Jou, Avital Zipori, Prateek Jain, Junwen Bai, Thang Luong, Jonathan Tompson, Alice Li, Li Liu, George Powell, Jiajun Shen, Alex Feng, Grishma Chole, Da Yu, Yinlam Chow, Tongxin Yin, Eric Malmi, Kefan Xiao, Yash Pande, Shachi Paul, Niccolò Dal Santo, Adil Dostmohamed, Sergio Guadarrama, Aaron Phillips, Thanumalayan Sankaranarayana Pillai, Gal Yona, Amin Ghafouri, Preethi Lahoti, Benjamin Lee, Dhruv Madeka, Eren Sezener, Simon Tokumine, Adrian Collister, Nicola De Cao, Richard Shin, Uday Kalra, Parker Beak, Emily Nottage, Ryo Nakashima, Ivan Jurin, Vikash Sehwag, Meenu Gaba, Junhao Zeng, Kevin R. McKee, Fernando Pereira, Tamar Yakar, Amayika Panda, Arka Dhar, Peilin Zhong, Daniel Sohn, Mark Brand, Lars Lowe Sjoesund, Viral Carpenter, Sharon Lin, Shantanu Thakoor, Marcus Wainwright, Ashwin Chaugule, Pranesh Srinivasan, Muye Zhu, Bernett Orlando, Jack Weber, Ayzaan Wahid, Gilles Baechler, Apurv Suman, Jovana Mitrović, Gabe Taubman, Honglin Yu, Helen King, Josh Dillon, Cathy Yip, Dhriti Varma, Tomas Izo, Levent Bolelli, Borja De Balle Pigem, Julia Di Trapani, Fotis Iliopoulos, Adam Paszke, Nishant Ranka, Joe Zou, Francesco Pongetti, Jed McGiffin, Alex Siegman, Rich Galt, Ross Hemsley, Goran Žužić, Victor Carbune, Tao Li, Myle Ott, Félix de Chaumont Quitry, David Vilar Torres, Yuri Chervonyi, Tomy Tsai, Prem Eruvbetine, Samuel Yang, Matthew Denton, Jake Walker, Slavica Andačić, Idan Heimlich Shtacher, Vittal Premachandran, Harshal Tushar Lehri, Cip Baetu, Damion Yates, Lampros Lamprou, Mariko Iinuma, Ioana Mihailescu, Ben Albrecht, Shachi Dave, Susie Sargsyan, Bryan Perozzi, Lucas Manning, Chiyuan Zhang, Denis Vnukov, Igor Mordatch, Raia Hadsell Wolfgang Macherey, Ryan Kappedal, Jim Stephan, Aditya Tripathi, Klaus Macherey, Jun Qian, Abhishek Bhowmick, Shekoofeh Azizi, Rémi Leblond, Shiva Mohan Reddy Garlapati, Timothy Knight, Matthew Wiethoff, Wei-Chih Hung, Anelia Angelova, Georgios Evangelopoulos, Pawel Janus, Dimitris Paparas, Matthew Rahtz, Ken Caluwaerts, Vivek Sampathkumar, Daniel Jarrett, Shadi Noghabi, Antoine Miech, Chak Yeung, Geoff Clark, Henry Prior, Fei Zheng, Jean Pouget-Abadie, Indro Bhattacharya, Kalpesh Krishna, Will Bishop, Zhe Yuan, Yunxiao Deng, Ashutosh Sathe, Kacper Krasowiak, Ciprian Chelba, Cho-Jui Hsieh, Kiran Vodrahalli, Buhuang Liu, Thomas Köppe, Amr Khalifa, Lubo Litchev, Pichi Charoenpanit, Reed Roberts, Sachin Yadav, Yasumasa Onoe, Desi Ivanov, Megha Mohabey, Vighnesh Birodkar, Nemanja Rakićević, Pierre Sermanet, Vaibhav Mehta, Krishan Subudhi, Travis Choma, Will Ng, Luheng He, Kathie Wang, Tasos Kementsietsidis, Shane Gu, Mansi Gupta, Andrew Nystrom, Mehran Kazemi, Timothy Chung, Nacho Cano, Nikhil Dhawan, Yufei Wang, Jiawei Xia, Trevor Yacovone, Eric Jia, Mingqing Chen, Simeon Ivanov, Ashrith Sheshan, Sid Dalmia, Paweł Stradomski, Pengcheng Yin, Salem Haykal, Congchao Wang, Dennis Duan, Neslihan Bulut, Greg Kochanski, Liam MacDermed, Namrata Godbole, Shitao Weng, Jingjing Chen, Rachana Fellinger, Ramin Mehran, Daniel Suo, Hisham Husain, Tong He, Kaushal Patel, Joshua Howland, Randall Parker, Kelvin Nguyen, Sharath Maddineni, Chris Rawles, Mina Khan, Shlomi Cohen-Ganor, Amol Mandhane, Xinyi Wu, Chenkai Kuang, Iulia Comşa, Ramya Ganeshan, Hanie Sedghi, Adam Bloniarz, Nuo Wang Pierse, Anton Briukhov, Petr Mitrichev, Anita Gergely, Serena Zhan, Allan Zhou, Nikita Saxena, Eva Lu, Josef Dean, Ashish Gupta, Nicolas Perez-Nieves, Renjie Wu, Cory McLean, Wei Liang, Disha Jindal, Anton Tsitsulin, Wenhao Yu, Kaiz Alarakyia, Tom Schaul, Piyush Patil, Peter Sung, Elijah Peake, Hongkun Yu, Feryal Behbahani, JD Co-Reyes, Alan Ansell, Sean Sun, Clara Barbu, Jonathan Lee, Seb Noury, James Allingham, Bilal Piot, Mohit Sharma, Christopher Yew, Ivan Korotkov, Bibo Xu, Demetra Brady, Goran Petrovic, Shibl Mourad, Claire Cui, Aditya Gupta, Parker Schuh, Saarthak Khanna, Anna Goldie, Abhinav Arora, Vadim Zubov, Amy Stuart, Mark Epstein, Yun Zhu, Jianqiao Liu, Yury Stuken, Ziyue Wang, Karolis Misiunas, Dee Guo, Ashleah Gill, Ale Hartman, Zaid Nabulsi, Aurko Roy, Aleksandra Faust, Jason Riesa, Ben Withbroe, Mengchao Wang, Marco Tagliasacchi, Andreea Marzoca, James Noraky, Serge Toropov, Malika Mehrotra, Bahram Raad, Sanja Deur, Steve Xu, Marianne Monteiro, Zhongru Wu, Yi Luan, Sam Ritter, Nick Li, Håvard Garnes, Yanzhang He, Martin Zlocha, Jifan Zhu, Matteo Hessel, Will Wu, Spandana Raj Babbula, Chizu Kawamoto, Yuanzhen Li, Mehadi Hassen, Yan Wang, Brian Wieder, James Freedman, Yin Zhang, Xinyi Bai, Tianli Yu, David Reitter, XiangHai Sheng, Mateo Wirth, Aditya Kini, Dima Damen, Mingcen Gao, Rachel Hornung, Michael Voznesensky, Brian Roark, Adhi Kuncoro, Yuxiang Zhou, Rushin Shah, Anthony Brohan, Kuangyuan Chen, James Wendt, David Rim, Paul Kishan Rubenstein, Jonathan Halcrow, Michelle Liu, Ty Geri, Yunhsuan Sung, Jane Shapiro, Shaan Bijwadia, Chris Duvarney, Christina Sorokin, Paul Natsev, Reeve Ingle, Pramod Gupta, Young Maeng, Ndaba Ndebele, Kexin Zhu, Valentin Anklin, Katherine Lee, Yuan Liu, Yaroslav Akulov, Shaleen Gupta, Guolong Su, Flavien Prost, Tianlin Liu, Vitaly Kovalev, Pol Moreno, Martin Scholz, Sam Redmond, Zongwei Zhou, Alex Castro-Ros, André Susano Pinto, Dia Kharrat, Michal Yarom, Rachel Saputro, Jannis Bulian, Ben Caine, Ji Liu, Abbas Abdolmaleki, Shariq Iqbal, Tautvydas Misiunas, Mikhail Sirotenko, Shefali Garg, Guy Bensky, Huan Gui, Xuezhi Wang, Raphael Koster, Mike Bernico, Da Huang, Romal Thoppilan, Trevor Cohn, Ben Golan, Wenlei Zhou, Andrew Rosenberg, Markus Freitag, Tynan Gangwani, Vincent Tsang, Anand Shukla, Xiaoqi Ren, Minh Giang, Chi Zou, Andre Elisseeff, Charline Le Lan, Dheeru Dua, Shuba Lall, Pranav Shyam, Frankie Garcia, Sarah Nguyen, Michael Guzman, AJ Maschinot, Marcello Maggioni, Ming-Wei Chang, Karol Gregor, Lotte Weerts, Kumaran Venkatesan, Bogdan Damoc, Leon Liu, Jan Wassenberg, Lewis Ho, Becca Roelofs, Majid Hadian, François-Xavier Aubet, Yu Liang, Sami Lachgar, Danny Karmon, Yong Cheng, Amelio Vázquez-Reina, Angie Chen, Zhuyun Dai, Andy Brock, Shubham Agrawal, Chenxi Pang, Peter Garst, Mariella Sanchez-Vargas, Ivor Rendulic, Aditya Ayyar, Andrija Ražnatović, Olivia Ma, Roopali Vij, Neha Sharma, Ashwin Balakrishna, Bingyuan Liu, Ian Mackinnon, Sorin Baltateanu, Petra Poklukar, Gabriel Ibagon, Colin Ji, Hongyang Jiao, Isaac Noble, Wojciech Stokowiec, Zhihao Li, Jeff Dean, David Lindner, Mark Omernick, Kristen Chiafullo, Mason Dimarco, Vitor Rodrigues, Vittorio Selo, Garrett Honke, Xintian, Wu, Wei He, Adam Hillier, Anhad Mohananey, Vihari Piratla, Chang Ye, Chase Malik, Sebastian Riedel, Samuel Albanie, Zi Yang, Kenny Vassigh, Maria Bauza, Sheng Li, Yiqing Tao, Nevan Wichers, Andrii Maksai, Abe Ittycheriah, Ross Mcilroy, Bryan Seybold, Noah Goodman, Romina Datta, Steven M. Hernandez, Tian Shi, Yony Kochinski, Anna Bulanova, Ken Franko, Mikita Sazanovich, Nicholas FitzGerald, Praneeth Kacham, Shubha Srinivas Raghvendra, Vincent Hellendoorn, Alexander Grushetsky, Julian Salazar, Angeliki Lazaridou, Jason Chang, Jan-Thorsten Peter, Sushant Kafle, Yann Dauphin, Abhishek Rao, Filippo Graziano, Izhak Shafran, Yuguo Liao, Tianli Ding, Geng Yan, Grace Chu, Zhao Fu, Vincent Roulet, Gabriel Rasskin, Duncan Williams, Shahar Drath, Alex Mossin, Raphael Hoffmann, Jordi Orbay, Francesco Bertolini, Hila Sheftel, Justin Chiu, Siyang Xue, Yuheng Kuang, Ferjad Naeem, Swaroop Nath, Nana Nti, Phil Culliton, Kashyap Krishnakumar, Michael Isard, Pei Sun, Ayan Chakrabarti, Nathan Clement, Regev Cohen, Arissa Wongpanich, GS Oh, Ashwin Murthy, Hao Zheng, Jessica Hamrick, Oskar Bunyan, Suhas Ganesh, Nitish Gupta, Roy Frostig, John Wieting, Yury Malkov, Pierre Marcenac, Zhixin, Lai, Xiaodan Tang, Mohammad Saleh, Fedir Zubach, Chinmay Kulkarni, Huanjie Zhou, Vicky Zayats, Nan Ding, Anshuman Tripathi, Arijit Pramanik, Patrik Zochbauer, Harish Ganapathy, Vedant Misra, Zach Behrman, Hugo Vallet, Mingyang Zhang, Mukund Sridhar, Ye Jin, Mohammad Babaeizadeh, Siim Põder, Megha Goel, Divya Jain, Tajwar Nasir, Shubham Mittal, Tim Dozat, Diego Ardila, Aliaksei Severyn, Fabio Pardo, Sammy Jerome, Siyang Qin, Louis Rouillard, Amir Yazdanbakhsh, Zizhao Zhang, Shivani Agrawal, Kaushik Shivakumar, Caden Lu, Praveen Kallakuri, Rachita Chhaparia, Kanishka Rao, Charles Kwong, Asya Fadeeva, Shitij Nigam, Yan Virin, Yuan Zhang, Balaji Venkatraman, Beliz Gunel, Marc Wilson, Huiyu Wang, Abhinav Gupta, Xiaowei Xu, Adrien Ali Taïga, Kareem Mohamed, Doug Fritz, Daniel Rodriguez, Zoubin Ghahramani, Harry Askham, Lior Belenki, James Zhao, Rahul Gupta, Krzysztof Jastrzębski, Takahiro Kosakai, Kaan Katircioglu, Jon Schneider, Rina Panigrahy, Konstantinos Bousmalis, Peter Grabowski, Prajit Ramachandran, Chaitra Hegde, Mihaela Rosca, Angelo Scorza Scarpati, Kyriakos Axiotis, Ying Xu, Zach Gleicher, Assaf Hurwitz Michaely, Mandar Sharma, Sanil Jain, Christoph Hirnschall, Tal Marian, Xuhui Jia, Kevin Mather, Kilol Gupta, Linhai Qiu, Nigamaa Nayakanti, Lucian Ionita, Steven Zheng, Lucia Loher, Kurt Shuster, Igor Petrovski, Roshan Sharma, Rahma Chaabouni, Angel Yeh, James An, Arushi Gupta, Steven Schwarcz, Seher Ellis, Sam Conway-Rahman, Javier Snaider, Alex Zhai, James Atwood, Daniel Golovin, Liqian Peng, Te I, Vivian Xia, Salvatore Scellato, Mahan Malihi, Arthur Bražinskas, Vlad-Doru Ion, Younghoon Jun, James Swirhun, Soroosh Mariooryad, Jiao Sun, Steve Chien, Rey Coaguila, Ariel Brand, Yi Gao, Tom Kwiatkowski, Roee Aharoni, Cheng-Chun Lee, Mislav Žanić, Yichi Zhang, Dan Ethier, Vitaly Nikolaev, Pranav Nair, Yoav Ben Shalom, Hen Fitoussi, Jai Gupta, Hongbin Liu, Dee Cattle, Tolga Bolukbasi, Ben Murdoch, Fantine Huot, Yin Li, Chris Hahn
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
nan
Article 358
Title@2025-07-07 (1): Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward
Title: Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward | Verbesserung des personalisierten Multi-Turn-Dialogs mit Kuriositätslohn | 增强个人化多发人对话与好奇奖的多发对话 2504.03206v2 |
Authors (5): Yanming Wan, Jiaxing Wu, Marwa Abdulhai, Lior Shani, Natasha Jaques
Effective conversational agents like large language models (LLMs) must personalize their interactions to adapt to user preferences, personalities, and attributes across diverse domains like education and healthcare. Current methods like Reinforcement Learning from Human Feedback (RLHF), often prioritize helpfulness and safety but fall short in fostering truly empathetic, adaptive, and personalized dialogues. Existing personalization approaches typically rely on extensive user history, limiting their effectiveness for new or context-limited users. To address these limitations, we propose leveraging a user model to incorporate a curiosity-based intrinsic reward into multi-turn RLHF. This novel reward mechanism encourages the LLM agent to actively infer user traits by optimizing conversations to improve its user model’s accuracy. Consequently, the agent delivers more personalized interactions by learning more about the user. We demonstrate our method’s effectiveness in two distinct domains: significantly improving personalization performance in a conversational recommendation task, and personalizing conversations for different learning styles in an educational setting. We show improved generalization capabilities compared to traditional multi-turn RLHF, all while maintaining conversation quality. Our method offers a promising solution for creating more personalized, adaptive, and engaging conversational agents.
nan
Article 359
Title@2025-07-07 (1): Interleaving Logic and Counting
Title: Interleaving Logic and Counting | Interleaving Logik und Zählen | 中间逻辑和计数 2507.05219v1 |
Authors (2): Johan van Benthem, Thomas Icard
Reasoning with quantifier expressions in natural language combines logical and arithmetical features, transcending strict divides between qualitative and quantitative. Our topic is this cooperation of styles as it occurs in common linguistic usage and its extension into the broader practice of natural language plus “grassroots mathematics”. We begin with a brief review of first-order logic with counting operators and cardinality comparisons. This system is known to be of high complexity, and drowns out finer aspects of the combination of logic and counting. We move to a small fragment that can represent numerical syllogisms and basic reasoning about comparative size: monadic first-order logic with counting. We provide normal forms that allow for axiomatization, determine which arithmetical notions can be defined on finite and on infinite models, and conversely, we discuss which logical notions can be defined out of purely arithmetical ones, and what sort of (non-)classical logics can be induced. Next, we investigate a series of strengthenings, again using normal form methods. The monadic second-order version is close, in a precise sense, to additive Presburger Arithmetic, while versions with the natural device of tuple counting take us to Diophantine equations, making the logic undecidable. We also define a system that combines basic modal logic over binary accessibility relations with counting, needed to formulate ubiquitous reasoning patterns such as the Pigeonhole Principle. We return to our starting point in natural language, confronting the architecture of our formal systems with linguistic quantifier vocabulary and syntax. We conclude with some general thoughts on yet further entanglements of logic and counting in formal systems, on rethinking the qualitative/quantitative divide, and on connecting our analysis to empirical findings in cognitive science.
nan
Article 360
Title@2025-07-07 (1): Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation
Title: Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation | Überprüfung von wissenschaftlichen Arbeiten für kritische Probleme mit reasoning LLMs: Basisansätze und automatische Bewertung | 审查关于有理有理的关键问题的科学文件:基线方法和自动评价 2505.23824v2 |
Authors (2): Tianmai M. Zhang, Neil F. Abernethy
Recent advancements in large language models have sparked interest in utilizing them to aid the peer review process of scientific publication amid the peer review crisis. However, having AI models generate full reviews in the same way as human reviewers risks exacerbating the irresponsible use of LLM-generated reviews. As an alternative, we propose adopting LLMs as manuscript quality checkers. We introduce several baseline approaches and an extendable automatic evaluation framework using top reasoning LLMs as judges to tackle the difficulty of recruiting domain experts for manual evaluation. Utilizing papers withdrawn from arXiv, we validated our proposed methods with several leading reasoning LLMs from multiple vendors and assessed their performance and API costs for identifying critical errors and unsoundness problems in scientific papers. o3 exhibited the best problem identification performance among all models at a modest cost. This paper provides insights into document-based scientific understanding/reasoning and lays a foundation for future applications. Our dataset, code, and model outputs are publicly available.
nan
Article 361
Title@2025-07-07 (1): Pre-Trained Policy Discriminators are General Reward Models
Title: Pre-Trained Policy Discriminators are General Reward Models | Pre-Trained Policy Diskriminatoren sind allgemeine Prämienmodelle | 培训前政策歧视者是一般奖励模式 2507.05197v1 |
Authors (22): Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, Demin Song, Haijun Lv, Songyang Gao, Chengqi Lv, Enyu Zhou, Honglin Guo, Zhiheng Xi, Wenwei Zhang, Qipeng Guo, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Tao Gui, Kai Chen
We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance–improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.
nan
Article 362
Title@2025-07-07 (1): NativQA Framework: Enabling LLMs with Native, Local, and Everyday Knowledge
Title: NativQA Framework: Enabling LLMs with Native, Local, and Everyday Knowledge | NativQA Framework: LLMs mit nativem, lokalem und alltäglichem Wissen aktivieren | NativQA框架:利用土著、地方和日常知识扶持LMs 2504.05995v2 |
Authors (6): Firoj Alam, Md Arid Hasan, Sahinur Rahman Laskar, Mucahid Kutlu, Kareem Darwish, Shammur Absar Chowdhury
The rapid advancement of large language models (LLMs) has raised concerns about cultural bias, fairness, and their applicability in diverse linguistic and underrepresented regional contexts. To enhance and benchmark the capabilities of LLMs, there is a need to develop large-scale resources focused on multilingual, local, and cultural contexts. In this study, we propose the NativQA framework, which can seamlessly construct large-scale, culturally and regionally aligned QA datasets in native languages. The framework utilizes user-defined seed queries and leverages search engines to collect location-specific, everyday information. It has been evaluated across 39 locations in 24 countries and in 7 languages – ranging from extremely low-resource to high-resource languages – resulting in over 300K Question-Answer (QA) pairs. The developed resources can be used for LLM benchmarking and further fine-tuning. The framework has been made publicly available for the community (https://gitlab.com/nativqa/nativqa-framework).
nan
Article 363
Title@2025-07-07 (1): From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations
Title: From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations | Von Fragmenten zu Fakten: Ein Curriculum-getriebener DPO-Ansatz zur Generierung von Hindi News Veracity Erklärungen | 《从零碎到事实:产生印地语新闻的多城市解释:课程驱动的DPO方法》 2507.05179v1 |
Authors (5): Pulkit Bansal, Raghvendra Kumar, Shakti Singh, Sriparna Saha, Adam Jatowt
In an era of rampant misinformation, generating reliable news explanations is vital, especially for under-represented languages like Hindi. Lacking robust automated tools, Hindi faces challenges in scaling misinformation detection. To bridge this gap, we propose a novel framework integrating Direct Preference Optimization (DPO) with curriculum learning to align machine-generated explanations with human reasoning. Fact-checked explanations from credible sources serve as preferred responses, while LLM outputs highlight system limitations and serve as non-preferred responses. To refine task-specific alignment, we introduce two key parameters – Actuality and Finesse – into the DPO loss function, enhancing explanation quality and consistency. Experiments with LLMs (Mistral, Llama, Gemma) and PLMs (mBART, mT5) confirm the framework’s effectiveness in generating coherent, contextually relevant explanations. This scalable approach combats misinformation and extends automated explanation generation to low-resource languages.
nan
Article 364
Title@2025-07-07 (1): SEPSIS: I Can Catch Your Lies – A New Paradigm for Deception Detection
Title: SEPSIS: I Can Catch Your Lies – A New Paradigm for Deception Detection | SEPSIS: Ich kann Ihre Lügen fangen – Ein neues Paradigma für Täuschungserkennung | 我可以抓住你的谎言– 一种新模型 用来侦测欺骗 2312.00292v2 |
Authors (8): Anku Rani, Dwip Dalal, Shreya Gautam, Pankaj Gupta, Vinija Jain, Aman Chadha, Amit Sheth, Amitava Das
Deception is the intentional practice of twisting information. It is a nuanced societal practice deeply intertwined with human societal evolution, characterized by a multitude of facets. This research explores the problem of deception through the lens of psychology, employing a framework that categorizes deception into three forms: lies of omission, lies of commission, and lies of influence. The primary focus of this study is specifically on investigating only lies of omission. We propose a novel framework for deception detection leveraging NLP techniques. We curated an annotated dataset of 876,784 samples by amalgamating a popular large-scale fake news dataset and scraped news headlines from the Twitter handle of the Times of India, a well-known Indian news media house. Each sample has been labeled with four layers, namely: (i) the type of omission (speculation, bias, distortion, sounds factual, and opinion), (ii) colors of lies(black, white, etc), and (iii) the intention of such lies (to influence, etc) (iv) topic of lies (political, educational, religious, etc). We present a novel multi-task learning pipeline that leverages the dataless merging of fine-tuned language models to address the deception detection task mentioned earlier. Our proposed model achieved an F1 score of 0.87, demonstrating strong performance across all layers, including the type, color, intent, and topic aspects of deceptive content. Finally, our research explores the relationship between lies of omission and propaganda techniques. To accomplish this, we conducted an in-depth analysis, uncovering compelling findings. For instance, our analysis revealed a significant correlation between loaded language and opinion, shedding light on their interconnectedness. To encourage further research in this field, we are releasing the SEPSIS dataset and code at https://huggingface.co/datasets/ankurani/deception.
nan
Article 365
Title@2025-07-07 (1): Critiques of World Models
Title: Critiques of World Models | Kritik an Weltmodellen | 世界模式的证明 2507.05169v1 |
Authors (4): Eric Xing, Mingkai Deng, Jinyu Hou, Zhiting Hu
World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of “hypothetical thinking” in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general-purpose world model, based on hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.
nan
Article 366
Title@2025-07-07 (1): Language Models can Self-Improve at State-Value Estimation for Better Search
Title: Language Models can Self-Improve at State-Value Estimation for Better Search | Sprachmodelle können sich bei State-Value-Schätzung für eine bessere Suche selbst improvisieren | 语言模型可以在国家估价估算中自我改进,以更好地搜索 2503.02878v2 |
Authors (2): Ethan Mendes, Alan Ritter
Collecting ground-truth rewards or human demonstrations for multi-step reasoning tasks is often prohibitively expensive and time consuming, especially in interactive domains like web tasks. To address this bottleneck, we present self-taught lookahead (STL), a self-supervised method that leverages state-transition dynamics to improve a value model capable of effectively guiding language model-controlled search without any labeled data. We find that moderately sized (8 billion parameters) open-weight value models improved with STL can match the performance of using a gpt-4o value model. Furthermore, we find that specialized value models learned with STL can be deployed with computationally lightweight search algorithms, achieving performance that matches that of more expensive tree search methods, while reducing costs by an order of magnitude.
nan
Article 367
Title@2025-07-07 (1): InfoSteer: Steering Information Utility in Language Model Post-Training
Title: InfoSteer: Steering Information Utility in Language Model Post-Training | InfoSteer: Steuerungsinformationsdienst im Sprachmodell nach dem Training | InfoSteer:指导性语言示范语言信息工具 2507.05158v1 |
Authors (3): Chunyuan Deng, Ruidi Chang, Hanjie Chen
Recent advancements in language models (LMs) gradually ushered in an era where post-training is crucial. Yet, post-training approaches such as supervised fine-tuning (SFT) do not guarantee effective use of knowledge acquired during pretraining. We therefore present \ours, a lightweight method that encourages parametric information utilization in LMs during post-training. This is achieved via treating FFN layer as associate key-value memory, and promotes the use of stored memory vectors via forward-pass interventions or regularization during backpropagation. We find this simple guidance during post-training phase delivers consistent performance improvements across diverse model families–including Qwen, Gemma and Llama-spanning over 15 downstream tasks in both ID and OOD evaluations. Beyond performance gains, we also find that steered LMs can adaptively allocate information-placing more emphasis on generating semantically meaningful tokens, while using fewer resources on simple transition ones (e.g., ,' or
and’). Our work underscores that vanilla post-training does not fully leverage pre-training potential, and steering LMs in latent representation space offers a promising approach that enhances both performance and interpretability.
nan
Article 368
Title@2025-07-07 (1): AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models
Title: AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models | KI Generated Text Detection mit Anleitung Feinabstimmung große Sprache und Transformer-basierte Modelle | 使用微调大语言和以变换器为基础的模型生成的指令 2507.05157v1 |
Authors (5): Chinnappa Guggilla, Budhaditya Roy, Trupti Ramdas Chavan, Abdul Rahman, Edward Bowen
Large Language Models (LLMs) possess an extraordinary capability to produce text that is not only coherent and contextually relevant but also strikingly similar to human writing. They adapt to various styles and genres, producing content that is both grammatically correct and semantically meaningful. Recently, LLMs have been misused to create highly realistic phishing emails, spread fake news, generate code to automate cyber crime, and write fraudulent scientific articles. Additionally, in many real-world applications, the generated content including style and topic and the generator model are not known beforehand. The increasing prevalence and sophistication of artificial intelligence (AI)-generated texts have made their detection progressively more challenging. Various attempts have been made to distinguish machine-generated text from human-authored content using linguistic, statistical, machine learning, and ensemble-based approaches. This work focuses on two primary objectives Task-A, which involves distinguishing human-written text from machine-generated text, and Task-B, which attempts to identify the specific LLM model responsible for the generation. Both of these tasks are based on fine tuning of Generative Pre-trained Transformer (GPT_4o-mini), Large Language Model Meta AI (LLaMA) 3 8B, and Bidirectional Encoder Representations from Transformers (BERT). The fine-tuned version of GPT_4o-mini and the BERT model has achieved accuracies of 0.9547 for Task-A and 0.4698 for Task-B.
nan
Article 369
Title@2025-07-07 (1): Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization
Title: Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization | Interpretierbare Mnemonic-Generation für Kanji-Lernen über Erwartungs-Maximierung | 通过预期-最大化为Kanji学习提供可解释的中子一代 2507.05137v1 |
Authors (3): Jaewook Lee, Alexander Scarlatos, Andrew Lan
Learning Japanese vocabulary is a challenge for learners from Roman alphabet backgrounds due to script differences. Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin. Kanji are also complicated due to their complexity and volume. Keyword mnemonics are a common strategy to aid memorization, often using the compositional structure of kanji to form vivid associations. Despite recent efforts to use large language models (LLMs) to assist learners, existing methods for LLM-based keyword mnemonic generation function as a black box, offering limited interpretability. We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules, and learn them using a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform, our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation. Experiments show that our method performs well in the cold-start setting for new learners while providing insight into the mechanisms behind effective mnemonic creation.
nan
Article 370
Title@2025-07-07 (1): SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction
Title: SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction | SMART: Simulierte Schüler ausgerichtet mit Item Response Theorie für Frage Schwierigkeitsvorhersage | SMART: 与 “ 问题困难预测 “ 项目反应理论一致的模拟学生 2507.05129v1 |
Authors (5): Alexander Scarlatos, Nigel Fernandez, Christopher Ormerod, Susan Lottridge, Andrew Lan
Item (question) difficulties play a crucial role in educational assessments, enabling accurate and efficient assessment of student abilities and personalization to maximize learning outcomes. Traditionally, estimating item difficulties can be costly, requiring real students to respond to items, followed by fitting an item response theory (IRT) model to get item difficulty estimates. This approach cannot be applied to the cold-start setting for previously unseen items either. In this work, we present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability, which can then be used in simulations to predict the difficulty of open-ended items. We achieve this alignment using direct preference optimization (DPO), where we form preference pairs based on how likely responses are under a ground-truth IRT model. We perform a simulation by generating thousands of responses, evaluating them with an LLM-based scoring model, and fit the resulting data to an IRT model to obtain item difficulty estimates. Through extensive experiments on a real-world student response dataset, we show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.
nan
Article 371
Title@2025-07-07 (1): An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques
Title: An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques | Eine Bewertung von großen Sprachmodellen zu Textzusammenfassungsaufgaben mittels Prompt Engineering Techniques | 评价关于使用即时工程技术进行文本归纳任务的大语言模型 2507.05123v1 |
Authors (3): Walid Mohamed Aly, Taysir Hassan A. Soliman, Amr Mohamed AbdelAziz
Large Language Models (LLMs) continue to advance natural language processing with their ability to generate human-like text across a range of tasks. Despite the remarkable success of LLMs in Natural Language Processing (NLP), their performance in text summarization across various domains and datasets has not been comprehensively evaluated. At the same time, the ability to summarize text effectively without relying on extensive training data has become a crucial bottleneck. To address these issues, we present a systematic evaluation of six LLMs across four datasets: CNN/Daily Mail and NewsRoom (news), SAMSum (dialog), and ArXiv (scientific). By leveraging prompt engineering techniques including zero-shot and in-context learning, our study evaluates the performance using the ROUGE and BERTScore metrics. In addition, a detailed analysis of inference times is conducted to better understand the trade-off between summarization quality and computational efficiency. For Long documents, introduce a sentence-based chunking strategy that enables LLMs with shorter context windows to summarize extended inputs in multiple stages. The findings reveal that while LLMs perform competitively on news and dialog tasks, their performance on long scientific documents improves significantly when aided by chunking strategies. In addition, notable performance variations were observed based on model parameters, dataset properties, and prompt design. These results offer actionable insights into how different LLMs behave across task types, contributing to ongoing research in efficient, instruction-based NLP systems.
nan
Article 372
Title@2025-07-07 (1): Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration
Title: Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration | Wiederbelebung des Kulturerbes: Ein neuartiger Ansatz für eine umfassende Restaurierung historischer Dokumente | 恢复文化遗产:全面恢复历史文件的新办法 2507.05108v1 |
Authors (8): Yuyi Zhang, Peirong Zhang, Zhenhua Yang, Pengyu Yan, Yongxin Shi, Pengwei Liu, Fengjun Guo, Lianwen Jin
Historical documents represent an invaluable cultural heritage, yet have undergone significant degradation over time through tears, water erosion, and oxidation. Existing Historical Document Restoration (HDR) methods primarily focus on single modality or limited-size restoration, failing to meet practical needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and 6,543 synthetic images with character-level and line-level locations, as well as character annotations in different damage grades. AutoHDR mimics historians’ restoration workflows through a three-stage approach: OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration. The modular architecture of AutoHDR enables seamless human-machine collaboration, allowing for flexible intervention and optimization at each restoration stage. Experiments demonstrate AutoHDR’s remarkable performance in HDR. When processing severely damaged documents, our method improves OCR accuracy from 46.83\% to 84.05\%, with further enhancement to 94.25\% through human-machine collaboration. We believe this work represents a significant advancement in automated historical document restoration and contributes substantially to cultural heritage preservation. The model and dataset are available at https://github.com/SCUT-DLVCLab/AutoHDR.
nan
Article 373
Title@2025-07-07 (1): LCDS: A Logic-Controlled Discharge Summary Generation System Supporting Source Attribution and Expert Review
Title: LCDS: A Logic-Controlled Discharge Summary Generation System Supporting Source Attribution and Expert Review | LCDS: Logisch gesteuerte Entladung Zusammenfassung Erzeugung System Unterstützung Quelle Attribution und Experten-Review | LCDS: 逻辑控制排放控制简要生成系统支持源来源归属和专家审查 2507.05319v1 |
Authors (8): Cheng Yuan, Xinkai Rui, Yongqi Fan, Yawei Fan, Boyang Zhong, Jiacheng Wang, Weiyan Zhang, Tong Ruan
Despite the remarkable performance of Large Language Models (LLMs) in automated discharge summary generation, they still suffer from hallucination issues, such as generating inaccurate content or fabricating information without valid sources. In addition, electronic medical records (EMRs) typically consist of long-form data, making it challenging for LLMs to attribute the generated content to the sources. To address these challenges, we propose LCDS, a Logic-Controlled Discharge Summary generation system. LCDS constructs a source mapping table by calculating textual similarity between EMRs and discharge summaries to constrain the scope of summarized content. Moreover, LCDS incorporates a comprehensive set of logical rules, enabling it to generate more reliable silver discharge summaries tailored to different clinical fields. Furthermore, LCDS supports source attribution for generated content, allowing experts to efficiently review, provide feedback, and rectify errors. The resulting golden discharge summaries are subsequently recorded for incremental fine-tuning of LLMs. Our project and demo video are in the GitHub repository https://github.com/ycycyc02/LCDS.
nan
Article 374
Title@2025-07-07 (1): AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics
Title: AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics | KI-getriebene Zytomorphologie Bildsynthese für medizinische Diagnostik | AI-Driven 用于医学诊断的细胞形态图像合成 2507.05063v1 |
Authors (3): Jan Carreras Boada, Rao Muhammad Umer, Carsten Marr
Biomedical datasets often contain a large sample imbalance and are subject to strict privacy constraints, which together hinder the development of accurate machine learning models. One potential solution is to generate synthetic images, as this can improve data availability while preserving patient privacy. However, it remains difficult to generate synthetic images of sufficient quality for training robust classifiers. In this work, we focus on the classification of single white blood cells, a key component in the diagnosis of hematological diseases such as acute myeloid leukemia (AML), a severe blood cancer. We demonstrate how synthetic images generated with a fine-tuned stable diffusion model using LoRA weights when guided by real few-shot samples of the target white blood cell classes, can enhance classifier performance for limited data. When training a ResNet classifier, accuracy increased from 27.3\% to 78.4\% (+51.1\%) by adding 5000 synthetic images per class to a small and highly imbalanced real dataset. For a CLIP-based classifier, the accuracy improved from 61.8\% to 76.8\% (+15.0\%). The synthetic images are highly similar to real images, and they can help overcome dataset limitations, enhancing model generalization. Our results establish synthetic images as a tool in biomedical research, improving machine learning models, and facilitating medical diagnosis and research.
nan
Article 375
Title@2025-07-07 (1): End-to-End Evaluation for Low-Latency Simultaneous Speech Translation
Title: End-to-End Evaluation for Low-Latency Simultaneous Speech Translation | End-to-End-Bewertung für simultane Sprachübersetzungen mit geringer Latenz | 低长期同声同声发言翻译端对端评价 2308.03415v4 |
Authors (13): Christian Huber, Tu Anh Dinh, Carlos Mullov, Ngoc Quan Pham, Thai Binh Nguyen, Fabian Retkowski, Stefan Constantin, Enes Yavuz Ugan, Danni Liu, Zhaolin Li, Sai Koneru, Jan Niehues, Alexander Waibel
The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches. In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components. Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user.
nan
Article 376
Title@2025-07-07 (1): Using Large Multimodal Models to Extract Knowledge Components for Knowledge Tracing from Multimedia Question Information
Title: Using Large Multimodal Models to Extract Knowledge Components for Knowledge Tracing from Multimedia Question Information | Verwendung von großen multimodalen Modellen zum Extrahieren von Wissenskomponenten für die Wissensverfolgung aus Multimedia-Frageinformationen | 利用大型多模式模式从多媒体问题信息中提取知识组成部分,以从多媒体问题信息中追踪知识 2409.20167v2 |
Authors (4): Hyeongdon Moon, Richard Davis, Seyed Parsa Neshaei, Pierre Dillenbourg
Knowledge tracing models have enabled a range of intelligent tutoring systems to provide feedback to students. However, existing methods for knowledge tracing in learning sciences are predominantly reliant on statistical data and instructor-defined knowledge components, making it challenging to integrate AI-generated educational content with traditional established methods. We propose a method for automatically extracting knowledge components from educational content using instruction-tuned large multimodal models. We validate this approach by comprehensively evaluating it against knowledge tracing benchmarks in five domains. Our results indicate that the automatically extracted knowledge components can effectively replace human-tagged labels, offering a promising direction for enhancing intelligent tutoring systems in limited-data scenarios, achieving more explainable assessments in educational settings, and laying the groundwork for automated assessment.
nan
Article 377
Title@2025-07-07 (1): Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study
Title: Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study | Warum kämpfen Open Source LLMs mit Datenanalyse? Eine systematische empirische Studie | 开放源码LLMs为何要与数据分析斗争?系统的经验研究 2506.19794v2 |
Authors (10): Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang
Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate models across three dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs’ analytical reasoning capabilities.
nan
Article 378
Title@2025-07-07 (1): Verified Language Processing with Hybrid Explainability: A Technical Report
Title: Verified Language Processing with Hybrid Explainability: A Technical Report | Verifizierte Sprachverarbeitung mit hybrider Erklärbarkeit: Ein technischer Bericht | 经核实的有混合解释性的语文处理:技术报告 2507.05017v1 |
Authors (3): Oliver Robert Fox, Giacomo Bergami, Graham Morgan
The volume and diversity of digital information have led to a growing reliance on Machine Learning techniques, such as Natural Language Processing, for interpreting and accessing appropriate data. While vector and graph embeddings represent data for similarity tasks, current state-of-the-art pipelines lack guaranteed explainability, failing to determine similarity for given full texts accurately. These considerations can also be applied to classifiers exploiting generative language models with logical prompts, which fail to correctly distinguish between logical implication, indifference, and inconsistency, despite being explicitly trained to recognise the first two classes. We present a novel pipeline designed for hybrid explainability to address this. Our methodology combines graphs and logic to produce First-Order Logic representations, creating machine- and human-readable representations through Montague Grammar. Preliminary results indicate the effectiveness of this approach in accurately capturing full text similarity. To the best of our knowledge, this is the first approach to differentiate between implication, inconsistency, and indifference for text classification tasks. To address the limitations of existing approaches, we use three self-contained datasets annotated for the former classification task to determine the suitability of these approaches in capturing sentence structure equivalence, logical connectives, and spatiotemporal reasoning. We also use these data to compare the proposed method with language models pre-trained for detecting sentence entailment. The results show that the proposed method outperforms state-of-the-art models, indicating that natural language understanding cannot be easily generalised by training over extensive document corpora. This work offers a step toward more transparent and reliable Information Retrieval from extensive textual data.
nan
Article 379
Title@2025-07-07 (1): Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification
Title: Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification | Co-DETECT: Collaborative Discovery of Edge Cases in der Textklassifikation | 共同目的:在文本分类中合作发现边缘情况 2507.05010v1 |
Authors (13): Chenfei Xiong, Jingwei Ni, Yu Fan, Vilém Zouhar, Donya Rooein, Lorena Calvo-Bartolomé, Alexander Hoyle, Zhijing Jin, Mrinmaya Sachan, Markus Leippold, Dirk Hovy, Mennatallah El-Assady, Elliott Ash
We introduce Co-DETECT (Collaborative Discovery of Edge cases in TExt ClassificaTion), a novel mixed-initiative annotation framework that integrates human expertise with automatic annotation guided by large language models (LLMs). Co-DETECT starts with an initial, sketch-level codebook and dataset provided by a domain expert, then leverages the LLM to annotate the data and identify edge cases that are not well described by the initial codebook. Specifically, Co-DETECT flags challenging examples, induces high-level, generalizable descriptions of edge cases, and assists user in incorporating edge case handling rules to improve the codebook. This iterative process enables more effective handling of nuanced phenomena through compact, generalizable annotation rules. Extensive user study, qualitative and quantitative analyses prove the effectiveness of Co-DETECT.
nan
Article 380
Title@2025-07-07 (1): From Autonomy to Agency: Agentic Vehicles for Human-Centered Mobility Systems
Title: From Autonomy to Agency: Agentic Vehicles for Human-Centered Mobility Systems | Von der Autonomie zur Agentur: Agentenfahrzeuge für Mensch-zentrierte Mobilitätssysteme | 从自主到机构:以人为中心的流动系统机动车辆 2507.04996v1 |
Authors (1): Jiangbo Yu
Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity to operate according to internal rules without external control. Accordingly, autonomous vehicles (AuVs) are defined as systems capable of perceiving their environment and executing preprogrammed tasks independently of external input. However, both research and real-world deployments increasingly showcase vehicles that demonstrate behaviors beyond this definition (including the SAE levels 1 to 6), such as interaction with humans and machines, goal adaptation, contextual reasoning, external tool use, and long-term planning, particularly with the integration of large language models (LLMs) and agentic AI systems. These developments reveal a conceptual gap between technical autonomy and the broader cognitive and social capabilities needed for future human-centered mobility systems. To address this, we introduce the concept of agentic vehicles (AgVs), referring to vehicles that integrate agentic AI to reason, adapt, and interact within complex environments. This paper presents a systems-level framework to characterize AgVs, focusing on their cognitive and communicative layers and differentiating them from conventional AuVs. It synthesizes relevant advances in agentic AI, robotics, multi-agent systems, and human-machine interaction, and highlights how agentic AI, through high-level reasoning and tool use, can function not merely as computational tools but as interactive agents embedded in mobility ecosystems. The paper concludes by identifying key challenges in the development and governance of AgVs, including safety, real-time control, public acceptance, ethical alignment, and regulatory frameworks.
nan
Article 381
Title@2025-07-07 (1): Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models
Title: Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models | Kann Video LLMs zur Antwort ablehnen? Ausrichtung für Beantwortbarkeit in Video Large Language Models | 视频大语言模式的可响应性对齐 2507.04976v1 |
Authors (4): Eunseop Yoon, Hee Suk Yoon, Mark A. Hasegawa-Johnson, Chang D. Yoo
In the broader context of deep learning, Multimodal Large Language Models have achieved significant breakthroughs by leveraging powerful Large Language Models as a backbone to align different modalities into the language space. A prime exemplification is the development of Video Large Language Models (Video-LLMs). While numerous advancements have been proposed to enhance the video understanding capabilities of these models, they are predominantly trained on questions generated directly from video content. However, in real-world scenarios, users often pose questions that extend beyond the informational scope of the video, highlighting the need for Video-LLMs to assess the relevance of the question. We demonstrate that even the best-performing Video-LLMs fail to reject unfit questions-not necessarily due to a lack of video understanding, but because they have not been trained to identify and refuse such questions. To address this limitation, we propose alignment for answerability, a framework that equips Video-LLMs with the ability to evaluate the relevance of a question based on the input video and appropriately decline to answer when the question exceeds the scope of the video, as well as an evaluation framework with a comprehensive set of metrics designed to measure model behavior before and after alignment. Furthermore, we present a pipeline for creating a dataset specifically tailored for alignment for answerability, leveraging existing video-description paired datasets.
nan
Article 382
Title@2025-07-07 (1): ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation
Title: ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation | ArtefakteBench: Überbrückung der visuell-interaktiven Lücke in der LLM-Codegenerierung | 人工合成:弥合LLM代码生成评估中的视觉互动差距 2507.04952v1 |
Authors (32): Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Shihui Hu, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Haotian Zhu, Yuanxing Zhang, Yuhao Jiang, Yue Zhang, Zenan Xu, Bohui Zhai, Guoxiang He, Hebin Li, Jie Zhao, Le Zhang, Lingyun Tan, Pengyu Guo, Xianshu Pang, Yang Ruan, Zhifeng Zhang, Zhonghu Wang, Ziyan Xu, Zuopu Yin, Wiggin Zhou, Chayse Zhou, Fengzong Lian
The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at https://artifactsbenchmark.github.io/, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.
nan
Article 383
Title@2025-07-07 (1): ReLoop: “Seeing Twice and Thinking Backwards” via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding
Title: ReLoop: “Seeing Twice and Thinking Backwards” via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding | ReLoop: “Zweimal sehen und zurückdenken” über Closed-Loop-Training zu Mitigate-Halluzinationen im multimodalen Verständnis | ReLooop:通过闭路培训,为多模式理解的假幻觉提供“双眼和反向思考” 2507.04943v1 |
Authors (3): Jianjiang Yang, Ziyan Huang, Yanshu Li
While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in open-ended visual question answering, they remain vulnerable to hallucinations. These are outputs that contradict or misrepresent input semantics, posing a critical challenge to the reliability and factual consistency. Existing methods often rely on external verification or post-hoc correction, lacking an internal mechanism to validate outputs directly during training. To bridge this gap, we propose ReLoop, a unified closed-loop training framework that encourages multimodal consistency for cross-modal understanding in MLLMs. ReLoop adopts a ring-shaped structure that integrates three complementary consistency feedback mechanisms, obliging MLLMs to “seeing twice and thinking backwards”. Specifically, ReLoop employs the frozen Consistency Feedback Plugin (CFP), comprising semantic reconstruction, visual description, and an attention supervision module for attention alignment. These components collectively enforce semantic reversibility, visual consistency, and interpretable attention, enabling the model to correct its outputs during training. Extensive evaluations and analyses demonstrate the effectiveness of ReLoop in reducing hallucination rates across multiple benchmarks, establishing a robust method for hallucination mitigation in MLLMs. We will release our source code and data in the camera-ready version.
nan
Article 384
Title@2025-07-07 (1): Are Information Retrieval Approaches Good at Harmonising Longitudinal Survey Questions in Social Science?
Title: Are Information Retrieval Approaches Good at Harmonising Longitudinal Survey Questions in Social Science? | Sind Informations-Retrieval Ansätze gut bei der Harmonisierung Longitudinal Survey Fragen in der Sozialwissenschaft? | 信息检索方法在协调社会科学的纵向调查问题方面是否很好? 2504.20679v2 |
Authors (4): Wing Yan Li, Zeqiang Wang, Jon Johnson, Suparna De
Automated detection of semantically equivalent questions in longitudinal social science surveys is crucial for long-term studies informing empirical research in the social, economic, and health sciences. Retrieving equivalent questions faces dual challenges: inconsistent representation of theoretical constructs (i.e. concept/sub-concept) across studies as well as between question and response options, and the evolution of vocabulary and structure in longitudinal text. To address these challenges, our multi-disciplinary collaboration of computer scientists and survey specialists presents a new information retrieval (IR) task of identifying concept (e.g. Housing, Job, etc.) equivalence across question and response options to harmonise longitudinal population studies. This paper investigates multiple unsupervised approaches on a survey dataset spanning 1946-2020, including probabilistic models, linear probing of language models, and pre-trained neural networks specialised for IR. We show that IR-specialised neural models achieve the highest overall performance with other approaches performing comparably. Additionally, the re-ranking of the probabilistic model’s results with neural models only introduces modest improvements of 0.07 at most in F1-score. Qualitative post-hoc evaluation by survey specialists shows that models generally have a low sensitivity to questions with high lexical overlap, particularly in cases where sub-concepts are mismatched. Altogether, our analysis serves to further research on harmonising longitudinal studies in social science.
nan
Article 385
Title@2025-07-07 (1): Do LLMs Understand the Safety of Their Inputs? Training-Free Moderation via Latent Prototypes
Title: Do LLMs Understand the Safety of Their Inputs? Training-Free Moderation via Latent Prototypes | Verstehen LLMs die Sicherheit ihrer Eingaben? Schulungsfreie Moderation über Latent Prototypen | LLM女士了解其投入的安全性吗?通过中原原型进行无培训的调节 2502.16174v2 |
Authors (6): Maciej Chrabąszcz, Filip Szatkowski, Bartosz Wójcik, Jan Dubiński, Tomasz Trzciński, Sebastian Cygert
With the rise of LLMs, ensuring model safety and alignment has become a critical concern. While modern instruction-finetuned LLMs incorporate alignment during training, they still frequently require moderation tools to prevent unsafe behavior. The most common approach to moderation are guard models that flag unsafe inputs. However, guards require costly training and are typically limited to fixed-size, pre-trained options, making them difficult to adapt to evolving risks and resource constraints. We hypothesize that instruction-finetuned LLMs already encode safety-relevant information internally and explore training-free safety assessment methods that work with off-the-shelf models. We show that simple prompting allows models to recognize harmful inputs they would otherwise mishandle. We also demonstrate that safe and unsafe prompts are distinctly separable in the models’ latent space. Building on this, we introduce the Latent Prototype Moderator (LPM), a training-free moderation method that uses Mahalanobis distance in latent space to assess input safety. LPM is a lightweight, customizable add-on that generalizes across model families and sizes. Our method matches or exceeds state-of-the-art guard models across multiple safety benchmarks, offering a practical and flexible solution for scalable LLM moderation.
nan
Article 386
Title@2025-07-07 (1): O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d’apprentissage pour l’adaptation de modèles de langue à un domaine
Title: O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d’apprentissage pour l’adaptation de modèles de langue à un domaine | O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d’apprentissage pour l’adaptation de modèles de langue à un domaine | O_FT@EvalLLLM2025:非属域适应适应模式的技术和技术评估比较 2507.04895v1 |
Authors (8): Ismaël Rousseau, Claire Perroux, Pierre Adam, Thomas Girault, Lionel Delphin-Poulat, Morgan Veyret, Gwénolé Lecorvé, Géraldine Damnati
This paper presents the work carried out by the O_FT team, joint with Orange and Ouest-France, on adapting language models to the defense domain as part of the EvalLLM2025 challenge. This work focused on adapting the \texttt{Mistral-7B-Instruct-v0.3} model using classical techniques of continued pre-training and instruction-tuning. The core of our efforts is based on collecting, generating, and selecting data for these two stages as well as for model evaluation. Experiments show that our adapted models have better domain-specific knowledge and improved domain-specific task processing skills, along with comparable (or even superior) performance on general knowledge and skills. Considering the carbon footprint of our adaptations, this work demonstrates the feasibility of domain adaptation for relatively small models. – Ce document pr'esente les travaux r'ealis'es par l’'equipe O_FT conjointe `a Orange et Ouest-France sur l’adaptation de mod`eles de langue au domaine de la d'efense dans le cadre du challenge EvalLLM2025. Ces travaux se sont concentr'es sur l’adaptation du mod`ele \texttt{Mistral-7B-Instruct-v0.3} avec des techniques classiques de poursuite du pr'e-entra\^inement et d’affinage sur instructions. L’essentiel de nos travaux a port'e sur la constitution, g'en'eration et s'election de donn'ees pour ces deux 'etapes ainsi que pour l’'evaluation des mod`eles. Les exp'eriences montrent que nos mod`eles adapt'es ont de meilleures de connaissances de fond et une meilleure capacit'e de traitement de t\^aches sur le domaine de la d'efense, ainsi que des performances comparables (voire sup'erieures) sur des connaissances ou capacit'es g'en'eralistes. Mis au regard des empreintes carbones de nos adaptations, ces travaux d'emontrent ainsi la viabilit'e de l’adaptation `a un domaine de mod`eles relativement petits.
nan
Article 387
Title@2025-07-07 (1): MARBLE: A Multi-Agent Rule-Based LLM Reasoning Engine for Accident Severity Prediction
Title: MARBLE: A Multi-Agent Rule-Based LLM Reasoning Engine for Accident Severity Prediction | MARBLE: Eine regelbasierte LLM-Reasoning-Engine für die Vorhersage von Unfällen mit Schweregrad | 事故严重性预测的多机构、以规则为基础的多机构、基于规则的LLM 理疗引擎 2507.04893v1 |
Authors (2): Kaleem Ullah Qasim, Jiashu Zhang
Accident severity prediction plays a critical role in transportation safety systems but is a persistently difficult task due to incomplete data, strong feature dependencies, and severe class imbalance in which rare but high-severity cases are underrepresented and hard to detect. Existing methods often rely on monolithic models or black box prompting, which struggle to scale in noisy, real-world settings and offer limited interpretability. To address these challenges, we propose MARBLE a multiagent rule based LLM engine that decomposes the severity prediction task across a team of specialized reasoning agents, including an interchangeable ML-backed agent. Each agent focuses on a semantic subset of features (e.g., spatial, environmental, temporal), enabling scoped reasoning and modular prompting without the risk of prompt saturation. Predictions are coordinated through either rule-based or LLM-guided consensus mechanisms that account for class rarity and confidence dynamics. The system retains structured traces of agent-level reasoning and coordination outcomes, supporting in-depth interpretability and post-hoc performance diagnostics. Across both UK and US datasets, MARBLE consistently outperforms traditional machine learning classifiers and state-of-the-art (SOTA) prompt-based reasoning methods including Chain-of-Thought (CoT), Least-to-Most (L2M), and Tree-of-Thought (ToT) achieving nearly 90% accuracy where others plateau below 48%. This performance redefines the practical ceiling for accident severity classification under real world noise and extreme class imbalance. Our results position MARBLE as a generalizable and interpretable framework for reasoning under uncertainty in safety-critical applications.
nan
Article 388
Title@2025-07-07 (1): Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations
Title: Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations | Emergent Semantics Beyond Token Embeddings: Transformer LMs mit gefrorenen visuellen Unicode-Darstellungen | 超越 Tok 嵌入的新兴语义: 具有冷冻视觉统一符号的变形LMs 2507.04886v1 |
Authors (1): A. Bochkov
Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational “meaning vectors.” This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to “representational interference” in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer’s compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.
nan
Article 389
Title@2025-07-07 (1): Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions
Title: Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions | Aufbau von offenen retrieval-Konversations-Frage-Antwortsystemen durch Generierung synthetischer Daten und Dekontextualisierung von Benutzerfragen | 通过生成合成数据和减少文本用户问题,建立开放式检索连接问答解答系统 2507.04884v1 |
Authors (7): Christos Vlachos, Nikolaos Stylianou, Alexandra Fiotaki, Spiros Methenitis, Elisavet Palogiannidi, Themos Stafylakis, Ion Androutsopoulos
We consider open-retrieval conversational question answering (OR-CONVQA), an extension of question answering where system responses need to be (i) aware of dialog history and (ii) grounded in documents (or document fragments) retrieved per question. Domain-specific OR-CONVQA training datasets are crucial for real-world applications, but hard to obtain. We propose a pipeline that capitalizes on the abundance of plain text documents in organizations (e.g., product documentation) to automatically produce realistic OR-CONVQA dialogs with annotations. Similarly to real-world humanannotated OR-CONVQA datasets, we generate in-dialog question-answer pairs, self-contained (decontextualized, e.g., no referring expressions) versions of user questions, and propositions (sentences expressing prominent information from the documents) the system responses are grounded in. We show how the synthetic dialogs can be used to train efficient question rewriters that decontextualize user questions, allowing existing dialog-unaware retrievers to be utilized. The retrieved information and the decontextualized question are then passed on to an LLM that generates the system’s response.
nan
Article 390
Title@2025-07-07 (1): Transcribing Spanish Texts from the Past: Experiments with Transkribus, Tesseract and Granite
Title: Transcribing Spanish Texts from the Past: Experiments with Transkribus, Tesseract and Granite | Spanische Texte aus der Vergangenheit transkribieren: Experimente mit Transkribus, Tesseract und Granit | 翻录过去西班牙文本:与Transkribus、Tesseract和Granite的实验 2507.04878v1 |
Authors (4): Yanco Amor Torterolo-Orta, Jaione Macicior-Mitxelena, Marina Miguez-Lamanuzzi, Ana García-Serrano
This article presents the experiments and results obtained by the GRESEL team in the IberLEF 2025 shared task PastReader: Transcribing Texts from the Past. Three types of experiments were conducted with the dual aim of participating in the task and enabling comparisons across different approaches. These included the use of a web-based OCR service, a traditional OCR engine, and a compact multimodal model. All experiments were run on consumer-grade hardware, which, despite lacking high-performance computing capacity, provided sufficient storage and stability. The results, while satisfactory, leave room for further improvement. Future work will focus on exploring new techniques and ideas using the Spanish-language dataset provided by the shared task, in collaboration with Biblioteca Nacional de Espa~na (BNE).
nan
Article 391
Title@2025-07-07 (1): $\textit{Grahak-Nyay:}$ Consumer Grievance Redressal through Large Language Models
Title: $\textit{Grahak-Nyay:}$ Consumer Grievance Redressal through Large Language Models | $\textit{Grahak-Nyay:}$ Consumer Grievance Redressal durch große Sprachmodelle | $textit{ Grahak- Nyay:} 以大语言模式纠正消费者冤情 2507.04854v1 |
Authors (9): Shrey Ganatra, Swapnil Bhattacharyya, Harshvivek Kashid, Spandan Anaokar, Shruti Nair, Reshma Sekhar, Siddharth Manohar, Rahul Hemrajani, Pushpak Bhattacharyya
Access to consumer grievance redressal in India is often hindered by procedural complexity, legal jargon, and jurisdictional challenges. To address this, we present $\textbf{Grahak-Nyay}$ (Justice-to-Consumers), a chatbot that streamlines the process using open-source Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Grahak-Nyay simplifies legal complexities through a concise and up-to-date knowledge base. We introduce three novel datasets: $\textit{GeneralQA}$ (general consumer law), $\textit{SectoralQA}$ (sector-specific knowledge) and $\textit{SyntheticQA}$ (for RAG evaluation), along with $\textit{NyayChat}$, a dataset of 300 annotated chatbot conversations. We also introduce $\textit{Judgments}$ data sourced from Indian Consumer Courts to aid the chatbot in decision making and to enhance user trust. We also propose $\textbf{HAB}$ metrics ($\textbf{Helpfulness, Accuracy, Brevity}$) to evaluate chatbot performance. Legal domain experts validated Grahak-Nyay’s effectiveness. Code and datasets will be released.
nan
Article 392
Title@2025-07-07 (1): Dialogue-Based Multi-Dimensional Relationship Extraction from Novels
Title: Dialogue-Based Multi-Dimensional Relationship Extraction from Novels | Dialogbasierte multidimensionale Beziehungsextraktion aus Romanen | 以对话为基础的从新奇中提取的多不同关系 2507.04852v1 |
Authors (6): Yuchen Yan, Hanjie Zhao, Senbin Zhu, Hongde Liu, Zhihong Zhang, Yuxiang Jia
Relation extraction is a crucial task in natural language processing, with broad applications in knowledge graph construction and literary analysis. However, the complex context and implicit expressions in novel texts pose significant challenges for automatic character relationship extraction. This study focuses on relation extraction in the novel domain and proposes a method based on Large Language Models (LLMs). By incorporating relationship dimension separation, dialogue data construction, and contextual learning strategies, the proposed method enhances extraction performance. Leveraging dialogue structure information, it improves the model’s ability to understand implicit relationships and demonstrates strong adaptability in complex contexts. Additionally, we construct a high-quality Chinese novel relation extraction dataset to address the lack of labeled resources and support future research. Experimental results show that our method outperforms traditional baselines across multiple evaluation metrics and successfully facilitates the automated construction of character relationship networks in novels.
nan
Article 393
Title@2025-07-07 (1): Spec-TOD: A Specialized Instruction-Tuned LLM Framework for Efficient Task-Oriented Dialogue Systems
Title: Spec-TOD: A Specialized Instruction-Tuned LLM Framework for Efficient Task-Oriented Dialogue Systems | Spec-TOD: Ein speziell angepasstes LLM-Framework für effiziente aufgabenorientierte Dialogsysteme | Spec-TOD:高效以任务为导向的对话系统专门指导性LLM框架 2507.04841v1 |
Authors (4): Quang-Vinh Nguyen, Quang-Chieu Nguyen, Hoang Pham, Khac-Hoai Nam Bui
Task-oriented dialogue (TOD) systems facilitate goal-driven interactions between users and machines. While recent advances in deep learning have improved the performance, TOD systems often struggle in low-resource scenarios with limited labeled data. To address this challenge, we propose Spec-TOD, a novel framework designed to train an end-to-end TOD system with limited data. Spec-TOD introduces two main innovations: (i) a novel specialized end-to-end TOD framework that incorporates explicit task instructions for instruction-tuned large language models (LLMs), and (ii) an efficient training strategy that leverages lightweight, specialized LLMs to achieve strong performance with minimal supervision. Experiments on the MultiWOZ dataset, a widely used TOD benchmark, demonstrate that Spec-TOD achieves competitive results while significantly reducing the need for labeled data. These findings highlight the potential of the proposed framework in advancing efficient and effective TOD systems in low-resource settings.
nan
Article 394
Title@2025-07-07 (1): CritiQ: Mining Data Quality Criteria from Human Preferences
Title: CritiQ: Mining Data Quality Criteria from Human Preferences | CritiQ: Mining Data Quality Criteria from Human Preferences | 关键:来自人类首选的采矿数据质量标准 2502.19279v2 |
Authors (11): Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, Tao Gui
Language model heavily depends on high-quality data for optimal performance. Existing approaches rely on manually designed heuristics, the perplexity of existing models, training classifiers, or careful prompt engineering, which require significant expert experience and human annotation effort while introduce biases. We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality with only ~30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We build a knowledge base that extracts quality criteria from previous work to boost CritiQ Flow. Compared to perplexity- and classifier- based methods, verbal criteria are more interpretable and possess reusable value. After deriving the criteria, we train the CritiQ Scorer to give quality scores and perform efficient data selection. We demonstrate the effectiveness of our method in the code, math, and logic domains, achieving high accuracy on human-annotated test sets. To validate the quality of the selected data, we continually train Llama 3.1 models and observe improved performance on downstream tasks compared to uniform sampling. Ablation studies validate the benefits of the knowledge base and the reflection process. We analyze how criteria evolve and the effectiveness of majority voting.
nan
Article 395
Title@2025-07-07 (1): RewardAnything: Generalizable Principle-Following Reward Models
Title: RewardAnything: Generalizable Principle-Following Reward Models | BelohnungAlles: Allgemeines Prinzip-folgende Belohnung Modelle | 奖励任何奖励:可普遍适用的遵循原则的奖赏模式 2506.03637v2 |
Authors (10): Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, Wei Ye
Reward Models, essential for guiding Large Language Model optimization, are typically trained on fixed preference datasets, resulting in rigid alignment to single, implicit preference distributions. This prevents adaptation to diverse real-world needs-from conciseness in one task to detailed explanations in another. The standard practice of collecting task-specific preference data and retraining reward models is resource-intensive, often producing biased rewards, and limits practical application. We introduce generalizable, principle-following reward models. We propose that RMs should understand and adhere to dynamically provided natural language specifications of reward principles, similar to instruction-following in LLMs. To measure this capability, we develop RABench, a comprehensive benchmark for RMs focusing on generalization across diverse principles. Evaluations on RABench reveal poor generalization of current RMs. As a solution, we present RewardAnything, a novel RM designed and trained to explicitly follow natural language principles. We achieve SotA performance with RewardAnything in traditional RM benchmark simply by specifying a well-defined principle, and results on RABench show we excel in adapting to novel principles without retraining. Furthermore, RewardAnything integrates seamlessly with existing RLHF methods and we show by a case study on how to automatically and efficiently align LLMs with only natural language principles.
nan
Article 396
Title@2025-07-07 (1): BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance
Title: BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance | BiMa: Auf dem Weg zu Biases Milderung für Text-Video-Retrieval über Szeneelement-Anleitung | Bima:通过场景元素指导,争取减缓对文本视频检索的“双轨减缓” 2506.03589v3 |
Authors (5): Huy Le, Nhat Chung, Tung Kieu, Anh Nguyen, Ngan Le
Text-video retrieval (TVR) systems often suffer from visual-linguistic biases present in datasets, which cause pre-trained vision-language models to overlook key details. To address this, we propose BiMa, a novel framework designed to mitigate biases in both visual and textual representations. Our approach begins by generating scene elements that characterize each video by identifying relevant entities/objects and activities. For visual debiasing, we integrate these scene elements into the video embeddings, enhancing them to emphasize fine-grained and salient details. For textual debiasing, we introduce a mechanism to disentangle text features into content and bias components, enabling the model to focus on meaningful content while separately handling biased information. Extensive experiments and ablation studies across five major TVR benchmarks (i.e., MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo) demonstrate the competitive performance of BiMa. Additionally, the model’s bias mitigation capability is consistently validated by its strong results on out-of-distribution retrieval tasks.
nan
Article 397
Title@2025-07-07 (1): From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach
Title: From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach | Von der Vision zur Sprache durch Diagramm der Ereignisse in Raum und Zeit: Ein erklärbarer selbstüberwachter Ansatz | 通过空间和时间事件图从视野到语言:可解释的自我监督方法 2507.04815v1 |
Authors (2): Mihai Masala, Marius Leordeanu
The task of describing video content in natural language is commonly referred to as video captioning. Unlike conventional video captions, which are typically brief and widely available, long-form paragraph descriptions in natural language are scarce. This limitation of current datasets is due to the expensive human manual annotation required and to the highly challenging task of explaining the language formation process from the perspective of the underlying story, as a complex system of interconnected events in space and time. Through a thorough analysis of recently published methods and available datasets, we identify a general lack of published resources dedicated to the problem of describing videos in complex language, beyond the level of descriptions in the form of enumerations of simple captions. Furthermore, while state-of-the-art methods produce impressive results on the task of generating shorter captions from videos by direct end-to-end learning between the videos and text, the problem of explaining the relationship between vision and language is still beyond our reach. In this work, we propose a shared representation between vision and language, based on graphs of events in space and time, which can be obtained in an explainable and analytical way, to integrate and connect multiple vision tasks to produce the final natural language description. Moreover, we also demonstrate how our automated and explainable video description generation process can function as a fully automatic teacher to effectively train direct, end-to-end neural student pathways, within a self-supervised neuro-analytical system. We validate that our explainable neuro-analytical approach generates coherent, rich and relevant textual descriptions on videos collected from multiple varied datasets, using both standard evaluation metrics, human annotations and consensus from ensembles of state-of-the-art VLMs.
nan
Article 398
Title@2025-07-07 (1): A Survey of Pun Generation: Datasets, Evaluations and Methodologies
Title: A Survey of Pun Generation: Datasets, Evaluations and Methodologies | Eine Umfrage der Pun-Generation: Datensätze, Bewertungen und Methoden | 对 “ 制作豆类:数据集、评价和方法 “ 的调查 2507.04793v1 |
Authors (6): Yuchen Su, Yonghua Zhu, Ruofan Wang, Zijian Huang, Diana Benavides-Prado, Michael Witbrock
Pun generation seeks to creatively modify linguistic elements in text to produce humour or evoke double meanings. It also aims to preserve coherence and contextual appropriateness, making it useful in creative writing and entertainment across various media and contexts. Although pun generation has received considerable attention in computational linguistics, there is currently no dedicated survey that systematically reviews this specific area. To bridge this gap, this paper provides a comprehensive review of pun generation datasets and methods across different stages, including conventional approaches, deep learning techniques, and pre-trained language models. Additionally, we summarise both automated and human evaluation metrics used to assess the quality of pun generation. Finally, we discuss the research challenges and propose promising directions for future work.
nan
Article 399
Title@2025-07-07 (1): Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems
Title: Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems | Qwen vs. Gemma Integration mit Whisper: Eine vergleichende Studie in mehrsprachigen Sprach-LLM-Systemen | Quwen诉Gemma 与低语融合:多语种语言LLLM系统比较研究 2506.13596v2 |
Authors (3): Tuan Nguyen, Long-Vu Hoang, Huy-Dat Tran
This paper presents our system for the MLC-SLM Challenge 2025, focusing on multilingual speech recognition and language modeling with large language models (LLMs). Our approach combines a fine-tuned Whisper-large-v3 encoder with efficient projector architectures and various decoder configurations. We employ a three-stage training methodology that progressively optimizes the encoder, projector, and LLM components. Our system achieves competitive performance with a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6% using the Qwen2.5-7B as decoder-only language model.
nan
Article 400
Title@2025-07-07 (1): Reason to Rote: Rethinking Memorization in Reasoning
Title: Reason to Rote: Rethinking Memorization in Reasoning | Grund zu rotieren: Die Erinnerung an die Vernunft neu denken | 旋转理由:重新思考在理由方面的记忆 2507.04782v1 |
Authors (6): Yupei Du, Philipp Mondorf, Silvia Casola, Yuekun Yao, Robert Litschko, Barbara Plank
Large language models readily memorize arbitrary training instances, such as label noise, yet they perform strikingly well on reasoning tasks. In this work, we investigate how language models memorize label noise, and why such memorization in many cases does not heavily affect generalizable reasoning capabilities. Using two controllable synthetic reasoning datasets with noisy labels, four-digit addition (FDA) and two-hop relational reasoning (THR), we discover a reliance of memorization on generalizable reasoning mechanisms: models continue to compute intermediate reasoning outputs even when retrieving memorized noisy labels, and intervening reasoning adversely affects memorization. We further show that memorization operates through distributed encoding, i.e., aggregating various inputs and intermediate results, rather than building a look-up mechanism from inputs to noisy labels. Moreover, our FDA case study reveals memorization occurs via outlier heuristics, where existing neuron activation patterns are slightly shifted to fit noisy labels. Together, our findings suggest that memorization of label noise in language models builds on, rather than overrides, the underlying reasoning mechanisms, shedding lights on the intriguing phenomenon of benign memorization.
nan
Article 401
Title@2025-07-07 (1): ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems
Title: ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems | ABench-Physics: Benchmarking Physical Reasoning in LLMs über hochschwierige und dynamische Physikprobleme | ABench-物理:通过高难度和动态物理问题确定LLMLM中物理原因的基准 2507.04766v1 |
Authors (12): Yiming Zhang, Yingfan Ma, Yanmei Gu, Zhengkai Yang, Yihong Zhuang, Feng Wang, Zenan Huang, Yuanyuan Wang, Chao Huang, Bowen Song, Cheng Lin, Junbo Zhao
Large Language Models (LLMs) have shown impressive performance in domains such as mathematics and programming, yet their capabilities in physics remain underexplored and poorly understood. Physics poses unique challenges that demand not only precise computation but also deep conceptual understanding and physical modeling skills. Existing benchmarks often fall short due to limited difficulty, multiple-choice formats, and static evaluation settings that fail to capture physical modeling ability. In this paper, we introduce ABench-Physics, a novel benchmark designed to rigorously evaluate LLMs’ physical reasoning and generalization capabilities. ABench-Physics consists of two components: Phy_A, a static set of 400 graduate- or Olympiad-level problems; and Phy_B, a dynamic subset of 100 problems equipped with an automatic variation engine to test model robustness across changing conditions. All questions require precise numerical answers, with strict formatting and tolerance constraints. Our evaluation of several state-of-the-art LLMs reveals substantial performance gaps, highlighting persistent limitations in physical reasoning, especially in generalization to dynamic variants. ABench-Physics provides a challenging and diagnostic framework for advancing scientific reasoning in LLMs.
nan
Article 402
Title@2025-07-07 (1): Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments
Title: Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments | Thunder-DeID: Genauer und effizienter De-Identifizierungsrahmen für Urteile des koreanischen Gerichts | Thunder-DeID:韩国法院判决的准确和有效的取消识别框架 2506.15266v2 |
Authors (5): Sungeun Hahm, Heejin Kim, Gyuseong Lee, Hyunji Park, Jaejin Lee
To ensure a balance between open access to justice and personal data protection, the South Korean judiciary mandates the de-identification of court judgments before they can be publicly disclosed. However, the current de-identification process is inadequate for handling court judgments at scale while adhering to strict legal requirements. Additionally, the legal definitions and categorizations of personal identifiers are vague and not well-suited for technical solutions. To tackle these challenges, we propose a de-identification framework called Thunder-DeID, which aligns with relevant laws and practices. Specifically, we (i) construct and release the first Korean legal dataset containing annotated judgments along with corresponding lists of entity mentions, (ii) introduce a systematic categorization of Personally Identifiable Information (PII), and (iii) develop an end-to-end deep neural network (DNN)-based de-identification pipeline. Our experimental results demonstrate that our model achieves state-of-the-art performance in the de-identification of court judgments.
nan
Article 403
Title@2025-07-07 (1): CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering
Title: CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering | CoSteer: Kollaborative Decoding-Time-Personalisierung über lokale Delta-Lenkung | 科斯特:通过当地三角洲指导,协作解码-时间个性化 2507.04756v1 |
Authors (9): Hang Lv, Sheng Liang, Hao Wang, Hongchao Gu, Yaxiong Wu, Wei Guo, Defu Lian, Yong Liu, Enhong Chen
Personalized text generation has become crucial for adapting language models to diverse and evolving users’ personal context across cultural, temporal, and contextual dimensions. While existing methods often rely on centralized fine-tuning or static preference alignment, they struggle to achieve real-time adaptation under resource constraints inherent to personal devices. This limitation creates a dilemma: large cloud-based models lack access to localized user-specific information, while small on-device models cannot match the generation quality of their cloud counterparts. To address this dichotomy, we present CoSteer, a novel collaborative framework that enables decoding-time personalization through localized delta steering. Our key insight lies in leveraging the logits difference between personal context-aware and -agnostic outputs from local small models as steering signals for cloud-based LLMs. Specifically, we formulate token-level optimization as an online learning problem, where local delta vectors dynamically adjust the remote LLM’s logits within the on-device environment. This approach preserves privacy by transmitting only the final steered tokens rather than raw data or intermediate vectors, while maintaining cloud-based LLMs’ general capabilities without fine-tuning. Through comprehensive experiments on various personalized generation tasks, we demonstrate that CoSteer effectively assists LLMs in generating personalized content by leveraging locally stored user profiles and histories, ensuring privacy preservation through on-device data processing while maintaining acceptable computational overhead.
nan
Article 404
Title@2025-07-07 (1): LLMs as Architects and Critics for Multi-Source Opinion Summarization
Title: LLMs as Architects and Critics for Multi-Source Opinion Summarization | LLMs als Architekten und Kritiker für die Mehr-Quellen-Stellungnahme Zusammenfassung | 作为多来源意见总结的建筑师和批评家的LLM 2507.04751v1 |
Authors (7): Anuj Attri, Arnav Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Nikesh Garera
Multi-source Opinion Summarization (M-OS) extends beyond traditional opinion summarization by incorporating additional sources of product metadata such as descriptions, key features, specifications, and ratings, alongside reviews. This integration results in comprehensive summaries that capture both subjective opinions and objective product attributes essential for informed decision-making. While Large Language Models (LLMs) have shown significant success in various Natural Language Processing (NLP) tasks, their potential in M-OS remains largely unexplored. Additionally, the lack of evaluation datasets for this task has impeded further advancements. To bridge this gap, we introduce M-OS-EVAL, a benchmark dataset for evaluating multi-source opinion summaries across 7 key dimensions: fluency, coherence, relevance, faithfulness, aspect coverage, sentiment consistency, specificity. Our results demonstrate that M-OS significantly enhances user engagement, as evidenced by a user study in which, on average, 87% of participants preferred M-OS over opinion summaries. Our experiments demonstrate that factually enriched summaries enhance user engagement. Notably, M-OS-PROMPTS exhibit stronger alignment with human judgment, achieving an average Spearman correlation of \r{ho} = 0.74, which surpasses the performance of previous methodologies.
nan
Article 405
Title@2025-07-07 (1): A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic
Title: A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic | Eine Geschichte von zwei Scripts: Transliteration und Nachkorrektur für Judeo-Arabisch | 两本脚本的故事:犹太-阿拉伯语的翻写和校正后 2507.04746v1 |
Authors (3): Juan Moreno Gonzalez, Bashar Alhafni, Nizar Habash
Judeo-Arabic refers to Arabic variants historically spoken by Jewish communities across the Arab world, primarily during the Middle Ages. Unlike standard Arabic, it is written in Hebrew script by Jewish writers and for Jewish audiences. Transliterating Judeo-Arabic into Arabic script is challenging due to ambiguous letter mappings, inconsistent orthographic conventions, and frequent code-switching into Hebrew and Aramaic. In this paper, we introduce a two-step approach to automatically transliterate Judeo-Arabic into Arabic script: simple character-level mapping followed by post-correction to address grammatical and orthographic errors. We also present the first benchmark evaluation of LLMs on this task. Finally, we show that transliteration enables Arabic NLP tools to perform morphosyntactic tagging and machine translation, which would have not been feasible on the original texts.
nan
Article 406
Title@2025-07-07 (1): Word stress in self-supervised speech models: A cross-linguistic comparison
Title: Word stress in self-supervised speech models: A cross-linguistic comparison | Wortstress in selbstüberwachten Sprachmodellen: Ein cross-linguistischer Vergleich | 自我监督的演讲模式中的言语压力:交叉语言学比较 2507.04738v1 |
Authors (3): Martijn Bentum, Louis ten Bosch, Tomas O. Lentz
In this paper we study word stress representations learned by self-supervised speech models (S3M), specifically the Wav2vec 2.0 model. We investigate the S3M representations of word stress for five different languages: Three languages with variable or lexical stress (Dutch, English and German) and two languages with fixed or demarcative stress (Hungarian and Polish). We train diagnostic stress classifiers on S3M embeddings and show that they can distinguish between stressed and unstressed syllables in read-aloud short sentences with high accuracy. We also tested language-specificity effects of S3M word stress. The results indicate that the word stress representations are language-specific, with a greater difference between the set of variable versus the set of fixed stressed languages.
nan
Article 407
Title@2025-07-07 (1): Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools
Title: Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools | Engere Kluft: Überwachtes Feintuning von Open Source LLMs als lebensfähige Alternative zu proprietären Modellen für pädagogische Werkzeuge | 缩小差距:监督开放源码LLMs的微调,将其作为替代专有教学工具模型的可行替代物 2507.05305v1 |
Authors (5): Lorenzo Lee Solano, Charles Koutcheme, Juho Leinonen, Alexandra Vassar, Jake Renzella
Frontier Large language models (LLMs) like ChatGPT and Gemini can decipher cryptic compiler errors for novice programmers, but their computational scale, cost, and tendency to over-assist make them problematic for widespread pedagogical adoption. This work demonstrates that smaller, specialised language models, enhanced via Supervised Fine-Tuning (SFT), present a more viable alternative for educational tools. We utilise a new dataset of 40,000 C compiler error explanations, derived from real introductory programming (CS1/2) student-generated programming errors, which we used to fine-tune three open-source models: Qwen3-4B, Llama-3.1-8B, and Qwen3-32B. We performed a dual evaluation, combining expert human reviews with a large-scale automated analysis of 8,000 responses using a validated LLM-as-judge ensemble. Our results show that SFT significantly boosts the pedagogical quality of smaller models, achieving performance comparable to much larger models. We analyse the trade-offs between model size and quality, confirming that fine-tuning compact, efficient models on high-quality, domain-specific data is a potent strategy for creating specialised models to drive educational tools. We provide a replicable methodology to foster broader access to generative AI capabilities in educational contexts.
nan
Article 408
Title@2025-07-07 (1): “This Suits You the Best”: Query Focused Comparative Explainable Summarization
Title: “This Suits You the Best”: Query Focused Comparative Explainable Summarization | “Das passt Ihnen am besten”: Abfrage konzentriert vergleichende erklärende Zusammenfassung | “这适合你最好的”: 查询重点突出的可解释比较摘要 2507.04733v1 |
Authors (7): Arnav Attri, Anuj Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Nikesh Garera
Product recommendations inherently involve comparisons, yet traditional opinion summarization often fails to provide holistic comparative insights. We propose the novel task of generating Query-Focused Comparative Explainable Summaries (QF-CES) using Multi-Source Opinion Summarization (M-OS). To address the lack of query-focused recommendation datasets, we introduce MS-Q2P, comprising 7,500 queries mapped to 22,500 recommended products with metadata. We leverage Large Language Models (LLMs) to generate tabular comparative summaries with query-specific explanations. Our approach is personalized, privacy-preserving, recommendation engine-agnostic, and category-agnostic. M-OS as an intermediate step reduces inference latency approximately by 40% compared to the direct input approach (DIA), which processes raw data directly. We evaluate open-source and proprietary LLMs for generating and assessing QF-CES. Extensive evaluations using QF-CES-PROMPT across 5 dimensions (clarity, faithfulness, informativeness, format adherence, and query relevance) showed an average Spearman correlation of 0.74 with human judgments, indicating its potential for QF-CES evaluation.
nan
Article 409
Title@2025-07-07 (1): Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization
Title: Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization | Textentgiftung: Dateneffizienz, Semantische Erhaltung und Modellverallgemeinerung | 文字解毒:数据效率、语义保护和模式通用 2507.01050v2 |
Authors (7): Jing Yu, Yibo Zhao, Jiapeng Zhu, Wenming Shao, Bo Pang, Zhao Zhang, Xiang Li
The widespread dissemination of toxic content on social media poses a serious threat to both online environments and public discourse, highlighting the urgent need for detoxification methods that effectively remove toxicity while preserving the original semantics. However, existing approaches often struggle to simultaneously achieve strong detoxification performance, semantic preservation, and robustness to out-of-distribution data. Moreover, they typically rely on costly, manually annotated parallel corpora while showing poor data efficiency. To address these challenges, we propose a two-stage training framework that jointly optimizes for data efficiency, semantic preservation, and model generalization. We first perform supervised fine-tuning on a small set of high-quality, filtered parallel data to establish a strong initialization. Then, we leverage unlabeled toxic inputs and a custom-designed reward model to train the LLM using Group Relative Policy Optimization. Experimental results demonstrate that our method effectively mitigates the trade-offs faced by previous work, achieving state-of-the-art performance with improved generalization and significantly reduced dependence on annotated data. Our code is available at: https://github.com/allacnobug/Detoxification-of-Text.
nan
Article 410
Title@2025-07-07 (1): LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework
Title: LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework | LOOM-Scope: ein umfassender und effizienter Long-cOntext Modellauswertungsrahmen | LOOM-范围:一个全面有效的LOng-Ontext示范评价框架 2507.04723v1 |
Authors (8): Zecheng Tang, Haitian Wang, Quantong Qiu, Baibei Ji, Ruoxi Sun, Keyan Zhou, Juntao Li, Min Zhang
Long-context processing has become a fundamental capability for large language models~(LLMs). To assess model’s long-context performance, numerous long-context evaluation benchmarks have been proposed. However, variations in evaluation settings across these benchmarks lead to inconsistent results, making it difficult to draw reliable comparisons. Besides, the high computational cost of long-context evaluation poses a significant barrier for the community to conduct comprehensive assessments of long-context models. In this paper, we propose LOOM-Scope, a comprehensive and efficient framework for long-context evaluation. LOOM-Scope standardizes evaluation settings across diverse benchmarks, supports deployment of efficient long-context inference acceleration methods, and introduces a holistic yet lightweight benchmark suite to evaluate models comprehensively. Homepage: https://loomscope.github.io
nan
Article 411
Title@2025-07-07 (1): MAIN: Mutual Alignment Is Necessary for instruction tuning
Title: MAIN: Mutual Alignment Is Necessary for instruction tuning | MAIN: Gegenseitige Ausrichtung ist notwendig für die Instruktions-Tuning | MAIN: 相互协调是教学调整的必要条件 2504.12913v2 |
Authors (10): Fanyi Yang, Jianfeng Liu, Xin Zhang, Haoyu Liu, Xixin Cao, Yuefeng Zhan, Hao Sun, Weiwei Deng, Feng Sun, Qi Zhang
Instruction tuning has empowered large language models (LLMs) to achieve remarkable performance, yet its success heavily depends on the availability of large-scale, high-quality instruction-response pairs. To meet this demand, various methods have been developed to synthesize data at scale. However, current methods for scaling up data generation often overlook a crucial aspect: the alignment between instructions and responses. We hypothesize that the quality of instruction-response pairs is determined not by the individual quality of each component, but by the degree of mutual alignment. To address this, we propose a Mutual Alignment Framework (MAIN) which enforces coherence between instructions and responses through mutual constraints. We demonstrate that MAIN generalizes well across model architectures and sizes, achieving state-of-the-art performance on LLaMA, Mistral, and Qwen models across diverse benchmarks. This work underscores the critical role of instruction-response alignment in enabling generalizable and high-quality instruction tuning for LLMs.
nan
Article 412
Title@2025-07-07 (1): Why We Feel What We Feel: Joint Detection of Emotions and Their Opinion Triggers in E-commerce
Title: Why We Feel What We Feel: Joint Detection of Emotions and Their Opinion Triggers in E-commerce | Warum wir fühlen, was wir fühlen: Gemeinsame Erkennung von Emotionen und ihre Meinung Auslöser im E-Commerce | 我们为什么感到我们的感受:在电子商务中共同发现情感及其见解触发者 2507.04708v1 |
Authors (7): Arnav Attri, Anuj Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Nikesh Garera
Customer reviews on e-commerce platforms capture critical affective signals that drive purchasing decisions. However, no existing research has explored the joint task of emotion detection and explanatory span identification in e-commerce reviews - a crucial gap in understanding what triggers customer emotional responses. To bridge this gap, we propose a novel joint task unifying Emotion detection and Opinion Trigger extraction (EOT), which explicitly models the relationship between causal text spans (opinion triggers) and affective dimensions (emotion categories) grounded in Plutchik’s theory of 8 primary emotions. In the absence of labeled data, we introduce EOT-X, a human-annotated collection of 2,400 reviews with fine-grained emotions and opinion triggers. We evaluate 23 Large Language Models (LLMs) and present EOT-DETECT, a structured prompting framework with systematic reasoning and self-reflection. Our framework surpasses zero-shot and chain-of-thought techniques, across e-commerce domains.
nan
Article 413
Title@2025-07-07 (1): XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL
Title: XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL | XiYan-SQL: Ein neuartiges Multi-Generator-Framework für Text-zu-SQL | XiYan-SQL: 文本到SQL的新颖多驱动器框架 2507.04701v1 |
Authors (11): Yifu Liu, Yin Zhu, Yingqi Gao, Zhiling Luo, Xiaoxia Li, Xiaorong Shi, Yuntao Hong, Jinyang Gao, Yu Li, Bolin Ding, Jingren Zhou
To leverage the advantages of LLM in addressing challenges in the Text-to-SQL task, we present XiYan-SQL, an innovative framework effectively generating and utilizing multiple SQL candidates. It consists of three components: 1) a Schema Filter module filtering and obtaining multiple relevant schemas; 2) a multi-generator ensemble approach generating multiple highquality and diverse SQL queries; 3) a selection model with a candidate reorganization strategy implemented to obtain the optimal SQL query. Specifically, for the multi-generator ensemble, we employ a multi-task fine-tuning strategy to enhance the capabilities of SQL generation models for the intrinsic alignment between SQL and text, and construct multiple generation models with distinct generation styles by fine-tuning across different SQL formats. The experimental results and comprehensive analysis demonstrate the effectiveness and robustness of our framework. Overall, XiYan-SQL achieves a new SOTA performance of 75.63% on the notable BIRD benchmark, surpassing all previous methods. It also attains SOTA performance on the Spider test set with an accuracy of 89.65%.
nan
Article 414
Title@2025-07-07 (1): Markovian Transformers for Informative Language Modeling
Title: Markovian Transformers for Informative Language Modeling | Markovianische Transformer für die informative Sprachmodellierung | 用于信息语言建模的Markovian变形器 2404.18988v6 |
Authors (4): Scott Viteri, Max Lamparth, Peter Chatain, Clark Barrett
Chain-of-Thought (CoT) reasoning often fails to faithfully reflect a language model’s underlying decision process. We address this by making CoT text causally essential in a “Markovian” language model, factoring next-token prediction through an intermediate CoT and training it to predict future tokens independently of the original prompt. We formalize this via an “informativeness” objective that quantifies how much a trained CoT improves next-token predictions over a baseline. Using policy gradient, we show that Llama 3.1 8B achieves a 33.2% absolute accuracy improvement on GSM8K. Perturbation tests confirm stronger reliance on the CoT, while cross-model transfers indicate these reasoning traces generalize across interpreters. Our approach enhances both accuracy and interpretability, potentially extending CoT reasoning to arbitrarily long contexts and diverse tasks.
nan
Article 415
Title@2025-07-07 (1): Pensieve Grader: An AI-Powered, Ready-to-Use Platform for Effortless Handwritten STEM Grading
Title: Pensieve Grader: An AI-Powered, Ready-to-Use Platform for Effortless Handwritten STEM Grading | Pensieve Grader: Eine KI-Powered, Ready-to-Use Plattform für mühelose handschriftliche STEM-Grading | Pensieve grafer: 一个AI授权的无力手写STEM分级的现用平台 2507.01431v2 |
Authors (4): Yoonseok Yang, Minjune Kim, Marlon Rondinelli, Keren Shao
Grading handwritten, open-ended responses remains a major bottleneck in large university STEM courses. We introduce Pensieve (https://www.pensieve.co), an AI-assisted grading platform that leverages large language models (LLMs) to transcribe and evaluate student work, providing instructors with rubric-aligned scores, transcriptions, and confidence ratings. Unlike prior tools that focus narrowly on specific tasks like transcription or rubric generation, Pensieve supports the entire grading pipeline-from scanned student submissions to final feedback-within a human-in-the-loop interface. Pensieve has been deployed in real-world courses at over 20 institutions and has graded more than 300,000 student responses. We present system details and empirical results across four core STEM disciplines: Computer Science, Mathematics, Physics, and Chemistry. Our findings show that Pensieve reduces grading time by an average of 65%, while maintaining a 95.4% agreement rate with instructor-assigned grades for high-confidence predictions.
nan
Article 416
Title@2025-07-07 (1): Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages
Title: Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages | Eka-Eval : Ein umfassender Evaluierungsrahmen für große Sprachmodelle in indischen Sprachen | Eka-Eval:印度语大语言模式综合评价框架 2507.01853v2 |
Authors (4): Samridhi Raj Sinha, Rajvee Sheth, Abhishek Upperwal, Mayank Singh
The rapid advancement of Large Language Models (LLMs) has intensified the need for evaluation frameworks that address the requirements of linguistically diverse regions, such as India, and go beyond English-centric benchmarks. We introduce EKA-EVAL, a unified evaluation framework that integrates over 35+ benchmarks (including 10 Indic benchmarks) across nine major evaluation categories. The framework provides broader coverage than existing Indian language evaluation tools, offering 11 core capabilities through a modular architecture, seamless integration with Hugging Face and proprietary models, and plug-and-play usability. As the first end-to-end suite for scalable, multilingual LLM benchmarking, the framework combines extensive benchmarks, modular workflows, and dedicated support for low-resource Indian languages to enable inclusive assessment of LLM capabilities across diverse domains. We conducted extensive comparisons against five existing baselines, demonstrating that EKA-EVAL achieves the highest participant ratings in four out of five categories. The framework is open-source and publicly available at: https://github.com/lingo-iitgn/eka-eval.
nan
Article 417
Title@2025-07-07 (1): Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models
Title: Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models | Breach in the Shield: Enthüllen der Schwachstellen großer Sprachmodelle | 打破盾牌:消除大语言模式的脆弱性 2504.03714v2 |
Authors (4): Runpeng Dai, Run Yang, Fan Zhou, Hongtu Zhu
Large Language Models (LLMs) and Vision-Language Models (VLMs) have achieved impressive performance across a wide range of tasks, yet they remain vulnerable to carefully crafted perturbations. In this study, we seek to pinpoint the sources of this fragility by identifying parameters and input dimensions (pixels or token embeddings) that are susceptible to such perturbations. To this end, we propose a stability measure called \textbf{FI}, \textbf{F}irst order local \textbf{I}nfluence, which is rooted in information geometry and quantifies the sensitivity of individual parameter and input dimensions. Our extensive analysis across LLMs and VLMs (from 1.5B to 13B parameters) reveals that: (I) A small subset of parameters or input dimensions with high FI values disproportionately contribute to model brittleness. (II) Mitigating the influence of these vulnerable parameters during model merging leads to improved performance.
nan
Article 418
Title@2025-07-07 (1): R1-RE: Cross-Domain Relationship Extraction with RLVR
Title: R1-RE: Cross-Domain Relationship Extraction with RLVR | R1-RE: Cross-Domain-Beziehungsextraktion mit RLVR | R1-RE: 与RLVR的跨部关系提取 2507.04642v1 |
Authors (4): Runpeng Dai, Tong Zheng, Run Yang, Hongtu Zhu
Relationship extraction (RE) is a core task in natural language processing. Traditional approaches typically frame RE as a supervised learning problem, directly mapping context to labels-an approach that often suffers from poor out-of-domain (OOD) generalization. Inspired by the workflow of human annotators, we reframe RE as a reasoning task guided by annotation guidelines and introduce R1-RE, the first reinforcement learning with verifiable reward (RLVR) framework for RE tasks. Our method elicits the reasoning abilities of small language models for annotation tasks, resulting in significantly improved OOD robustness. We evaluate our approach on the public Sem-2010 dataset and a private MDKG dataset. The R1-RE-7B model attains an average OOD accuracy of approximately 70%, on par with leading proprietary models such as GPT-4o. Additionally, our comprehensive analysis provides novel insights into the training dynamics and emergent reasoning behaviors of the RLVR paradigm for RE.
nan
Article 419
Title@2025-07-07 (1): Put Teacher in Student’s Shoes: Cross-Distillation for Ultra-compact Model Compression Framework
Title: Put Teacher in Student’s Shoes: Cross-Distillation for Ultra-compact Model Compression Framework | Put Teacher in Studentenschuhen: Cross-Destillation für ultra-kompakte Modellkompressionsrahmen | 将教师纳入学生鞋:超常规示范压缩框架的交叉蒸馏 2507.04636v1 |
Authors (7): Maolin Wang, Jun Chu, Sicong Xie, Xiaoling Zang, Yao Zhao, Wenliang Zhong, Xiangyu Zhao
In the era of mobile computing, deploying efficient Natural Language Processing (NLP) models in resource-restricted edge settings presents significant challenges, particularly in environments requiring strict privacy compliance, real-time responsiveness, and diverse multi-tasking capabilities. These challenges create a fundamental need for ultra-compact models that maintain strong performance across various NLP tasks while adhering to stringent memory constraints. To this end, we introduce Edge ultra-lIte BERT framework (EI-BERT) with a novel cross-distillation method. EI-BERT efficiently compresses models through a comprehensive pipeline including hard token pruning, cross-distillation and parameter quantization. Specifically, the cross-distillation method uniquely positions the teacher model to understand the student model’s perspective, ensuring efficient knowledge transfer through parameter integration and the mutual interplay between models. Through extensive experiments, we achieve a remarkably compact BERT-based model of only 1.91 MB - the smallest to date for Natural Language Understanding (NLU) tasks. This ultra-compact model has been successfully deployed across multiple scenarios within the Alipay ecosystem, demonstrating significant improvements in real-world applications. For example, it has been integrated into Alipay’s live Edge Recommendation system since January 2024, currently serving the app’s recommendation traffic across \textbf{8.4 million daily active devices}.
nan
Article 420
Title@2025-07-07 (1): A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens
Title: A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens | Ein Text ist mehrere Token wert: Text-Einbetten aus LLMs richtet sich geheim gut mit den Schlüssel-Token | 文本“ 值得几个调子” : 文本由LLMs 秘密对齐与密钥连接良好 。 2406.17378v4 |
Authors (3): Zhijie Nie, Richong Zhang, Zhanyu Wu
Text embeddings from large language models (LLMs) have achieved excellent results in tasks such as information retrieval, semantic textual similarity, etc. In this work, we show an interesting finding: when feeding a text into the LLM-based embedder, the obtained text embedding will be able to be aligned with the key tokens in the input text. We first fully analyze this phenomenon on eight LLM-based embedders and show that this phenomenon is universal and is not affected by model architecture, training strategy, and embedding method. With a deeper analysis, we find that the main change in embedding space between these embedders and their LLM backbones is in the first principal component. By adjusting the first principal component, we can align text embedding with the key tokens. Finally, we give several examples to demonstrate the vast application potential of this finding: (1) we propose a simple and practical sparse retrieval method based on the aligned tokens, which can achieve 80% of the dense retrieval effect of the same model while reducing the computation significantly; (2) we show that our findings provide a novel perspective to help understand novel technologies (e.g., instruction-following embedding) and fuzzy concepts (e.g., semantic relatedness vs. similarity) in this field.
nan
Article 421
Title@2025-07-07 (1): Knowledge-Aware Self-Correction in Language Models via Structured Memory Graphs
Title: Knowledge-Aware Self-Correction in Language Models via Structured Memory Graphs | Wissen-Bewusst Selbst-Korrektion in Sprachmodellen über strukturierte Speichergraphen | 通过结构内存图对语言模型进行知识-软件自我校正 2507.04625v1 |
Authors (1): Swayamjit Saha
Large Language Models (LLMs) are powerful yet prone to generating factual errors, commonly referred to as hallucinations. We present a lightweight, interpretable framework for knowledge-aware self-correction of LLM outputs using structured memory graphs based on RDF triples. Without retraining or fine-tuning, our method post-processes model outputs and corrects factual inconsistencies via external semantic memory. We demonstrate the approach using DistilGPT-2 and show promising results on simple factual prompts.
nan
Article 422
Title@2025-07-07 (1): Towards Cost-Effective Reward Guided Text Generation
Title: Towards Cost-Effective Reward Guided Text Generation | Auf dem Weg zu einer kosteneffizienten, von der Belohnung geleiteten Textgenerierung | 实现成本-效益高的回报制制制制制制 2502.04517v2 |
Authors (6): Ahmad Rashid, Ruotian Wu, Rongqi Fan, Hongliang Li, Agustinus Kristiadi, Pascal Poupart
Reward-guided text generation (RGTG) has emerged as a viable alternative to offline reinforcement learning from human feedback (RLHF). RGTG methods can align baseline language models to human preferences without further training like in standard RLHF methods. However, they rely on a reward model to score each candidate token generated by the language model at inference, incurring significant test-time overhead. Additionally, the reward model is usually only trained to score full sequences, which can lead to sub-optimal choices for partial sequences. In this work, we present a novel reward model architecture that is trained, using a Bradley-Terry loss, to prefer the optimal expansion of a sequence with just a \emph{single call} to the reward model at each step of the generation process. That is, a score for all possible candidate tokens is generated simultaneously, leading to efficient inference. We theoretically analyze various RGTG reward models and demonstrate that prior techniques prefer sub-optimal sequences compared to our method during inference. Empirically, our reward model leads to significantly faster inference than other RGTG methods. It requires fewer calls to the reward model and performs competitively compared to previous RGTG and offline RLHF methods.
nan
Article 423
Title@2025-07-07 (1): News Source Citing Patterns in AI Search Systems
Title: News Source Citing Patterns in AI Search Systems | Nachrichten Quelle Zitieren von Mustern in KI-Suchsystemen | AI 搜索系统中引用新闻来源模式 2507.05301v1 |
Authors (1): Kai-Cheng Yang
AI-powered search systems are emerging as new information gatekeepers, fundamentally transforming how users access news and information. Despite their growing influence, the citation patterns of these systems remain poorly understood. We address this gap by analyzing data from the AI Search Arena, a head-to-head evaluation platform for AI search systems. The dataset comprises over 24,000 conversations and 65,000 responses from models across three major providers: OpenAI, Perplexity, and Google. Among the over 366,000 citations embedded in these responses, 9% reference news sources. We find that while models from different providers cite distinct news sources, they exhibit shared patterns in citation behavior. News citations concentrate heavily among a small number of outlets and display a pronounced liberal bias, though low-credibility sources are rarely cited. User preference analysis reveals that neither the political leaning nor the quality of cited news sources significantly influences user satisfaction. These findings reveal significant challenges in current AI search systems and have important implications for their design and governance.
nan
Article 424
Title@2025-07-07 (1): Retain or Reframe? A Computational Framework for the Analysis of Framing in News Articles and Reader Comments
Title: Retain or Reframe? A Computational Framework for the Analysis of Framing in News Articles and Reader Comments | A putational Framework for the Analysis of Framing in News Artikel und Leserkommentare | 保留还是重新框架? 分析新闻文章和读者评论中形式界定的计算框架 2507.04612v1 |
Authors (4): Matteo Guida, Yulia Otmakhova, Eduard Hovy, Lea Frermann
When a news article describes immigration as an “economic burden” or a “humanitarian crisis,” it selectively emphasizes certain aspects of the issue. Although \textit{framing} shapes how the public interprets such issues, audiences do not absorb frames passively but actively reorganize the presented information. While this relationship between source content and audience response is well-documented in the social sciences, NLP approaches often ignore it, detecting frames in articles and responses in isolation. We present the first computational framework for large-scale analysis of framing across source content (news articles) and audience responses (reader comments). Methodologically, we refine frame labels and develop a framework that reconstructs dominant frames in articles and comments from sentence-level predictions, and aligns articles with topically relevant comments. Applying our framework across eleven topics and two news outlets, we find that frame reuse in comments correlates highly across outlets, while topic-specific patterns vary. We release a frame classifier that performs well on both articles and comments, a dataset of article and comment sentences manually labeled for frames, and a large-scale dataset of articles and comments with predicted frame labels.
nan
Article 425
Title@2025-07-07 (1): Position: Machine Learning Conferences Should Establish a “Refutations and Critiques” Track
Title: Position: Machine Learning Conferences Should Establish a “Refutations and Critiques” Track | Position: Machine Learning Konferenzen sollten einen “Refutations and Critiques” Track erstellen | 职位:机器学习会议应建立“反驳和批评”轨道 2506.19882v3 |
Authors (14): Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch, Brando Miranda, Matthias Gerstgrasser, Susan Zhang, Andreas Haupt, Isha Gupta, Elyas Obbad, Jesse Dodge, Jessica Zosa Forde, Francesco Orabona, Sanmi Koyejo, David Donoho
Science progresses by iteratively advancing and correcting humanity’s understanding of the world. In machine learning (ML) research, rapid advancements have led to an explosion of publications, but have also led to misleading, incorrect, flawed or perhaps even fraudulent studies being accepted and sometimes highlighted at ML conferences due to the fallibility of peer review. While such mistakes are understandable, ML conferences do not offer robust processes to help the field systematically correct when such errors are made. This position paper argues that ML conferences should establish a dedicated “Refutations and Critiques” (R&C) Track. This R&C Track would provide a high-profile, reputable platform to support vital research that critically challenges prior research, thereby fostering a dynamic self-correcting research ecosystem. We discuss key considerations including track design, review principles, potential pitfalls, and provide an illustrative example submission concerning a recent ICLR 2025 Oral. We conclude that ML conferences should create official, reputable mechanisms to help ML research self-correct.
nan
Article 426
Title@2025-07-07 (1): PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes
Title: PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes | PRIME: Large Language Model Personalisierung mit kognitiven Gedächtnis- und Gedankenprozessen | PRIME:具有认知记忆和思维过程的大语言模式个性模型 2507.04607v1 |
Authors (3): Xinliang Frederick Zhang, Nick Beauchamp, Lu Wang
Large language model (LLM) personalization aims to align model outputs with individuals’ unique preferences and opinions. While recent efforts have implemented various personalization methods, a unified theoretical framework that can systematically understand the drivers of effective personalization is still lacking. In this work, we integrate the well-established cognitive dual-memory model into LLM personalization, by mirroring episodic memory to historical user engagements and semantic memory to long-term, evolving user beliefs. Specifically, we systematically investigate memory instantiations and introduce a unified framework, PRIME, using episodic and semantic memory mechanisms. We further augment PRIME with a novel personalized thinking capability inspired by the slow thinking strategy. Moreover, recognizing the absence of suitable benchmarks, we introduce a dataset using Change My View (CMV) from Reddit, specifically designed to evaluate long-context personalization. Extensive experiments validate PRIME’s effectiveness across both long- and short-context scenarios. Further analysis confirms that PRIME effectively captures dynamic personalization beyond mere popularity biases.
nan
Article 427
Title@2025-07-07 (1): Structured Captions Improve Prompt Adherence in Text-to-Image Models (Re-LAION-Caption 19M)
Title: Structured Captions Improve Prompt Adherence in Text-to-Image Models (Re-LAION-Caption 19M) | Structured Captions verbessern die sofortige Einhaltung von Text-zu-Bild-Modellen (Re-LAION-Caption 19M) | 结构说明 改进对文本到图像模型的迅速遵守(Re-LAION-Caption 19M) 2507.05300v1 |
Authors (4): Nicholas Merchant, Haitz Sáez de Ocáriz Borde, Andrei Cristian Popescu, Carlos Garcia Jurado Suarez
We argue that generative text-to-image models often struggle with prompt adherence due to the noisy and unstructured nature of large-scale datasets like LAION-5B. This forces users to rely heavily on prompt engineering to elicit desirable outputs. In this work, we propose that enforcing a consistent caption structure during training can significantly improve model controllability and alignment. We introduce Re-LAION-Caption 19M, a high-quality subset of Re-LAION-5B, comprising 19 million 1024x1024 images with captions generated by a Mistral 7B Instruct-based LLaVA-Next model. Each caption follows a four-part template: subject, setting, aesthetics, and camera details. We fine-tune PixArt-$\Sigma$ and Stable Diffusion 2 using both structured and randomly shuffled captions, and show that structured versions consistently yield higher text-image alignment scores using visual question answering (VQA) models. The dataset is publicly available at https://huggingface.co/datasets/supermodelresearch/Re-LAION-Caption19M.
nan
Article 428
Title@2025-07-07 (1): Gradient-guided Attention Map Editing: Towards Efficient Contextual Hallucination Mitigation
Title: Gradient-guided Attention Map Editing: Towards Efficient Contextual Hallucination Mitigation | Gradient-geführte Aufmerksamkeitskarte Bearbeitung: Auf dem Weg zu einer effizienten kontextuellen Halluzination | 逐步制导注意地图编辑:努力有效减少环境幻觉 2503.08963v2 |
Authors (6): Yu Wang, Kamalika Das, Xiang Gao, Wendi Cui, Peng Li, Jiaxin Zhang
In tasks like summarization and open-book question answering (QA), Large Language Models (LLMs) often encounter “contextual hallucination”, where they produce irrelevant or incorrect responses despite having access to accurate source information. This typically occurs because these models tend to prioritize self-generated content over the input context, causing them to disregard pertinent details. To address this challenge, we introduce a novel method called “Guided Attention Map Editing” (GAME), which dynamically adjusts attention maps to improve contextual relevance. During inference, GAME employs a trained classifier to identify attention maps prone to inducing hallucinations and executes targeted interventions. These interventions, guided by gradient-informed “edit directions’’, strategically redistribute attention weights across various heads to effectively reduce hallucination. Comprehensive evaluations on challenging summarization and open-book QA tasks show that GAME consistently reduces hallucinations across a variety of open-source models. Specifically, GAME reduces hallucinations by 10% in the XSum summarization task while achieving a 7X speed-up in computational efficiency compared to the state-of-the-art baselines.
nan
Article 429
Title@2025-07-07 (1): VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
Title: VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents | VLM2Vec-V2: Advancing Multimodal Embedding für Videos, Bilder und visuelle Dokumente | VLM2Vec-V2:推进视频、图像和视觉文件的多式联运嵌入 2507.04590v1 |
Authors (13): Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, Semih Yavuz
Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, multi-modal search and recommendation, and retrieval-augmented generation (RAG). To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering - spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.
nan
Article 430
Title@2025-07-07 (1): Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs
Title: Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs | Im Inneren sind Sie viele Wölfe: Mit kognitiven Modellen, um Wert-Abwägungen in LLMs zu interpretieren | 使用认知模型来解释LLMM中的价值权衡 2506.20666v2 |
Authors (7): Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman
Navigating everyday social situations often requires juggling conflicting goals, such as conveying a harsh truth, maintaining trust, all while still being mindful of another person’s feelings. These value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in LLMs are limited. In cognitive science, so-called “cognitive models” provide formal accounts of these trade-offs in humans, by modeling the weighting of a speaker’s competing utility functions in choosing an action or utterance. In this work, we use a leading cognitive model of polite speech to interpret the extent to which LLMs represent human-like trade-offs. We apply this lens to systematically evaluate value trade-offs in two encompassing model settings: degrees of reasoning “effort” in frontier black-box models, and RL post-training dynamics of open-source models. Our results highlight patterns of higher informational utility than social utility in reasoning models, and in open-source models shown to be stronger in mathematical reasoning. Our findings from LLMs’ training dynamics suggest large shifts in utility values early on in training with persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. We show that our method is responsive to diverse aspects of the rapidly evolving LLM landscape, with insights for forming hypotheses about other high-level behaviors, shaping training regimes for reasoning models, and better controlling trade-offs between values during model training.
nan
Article 431
Title@2025-07-06 (7): Construction Identification and Disambiguation Using BERT: A Case Study of NPN
Title: Construction Identification and Disambiguation Using BERT: A Case Study of NPN | Bauidentifizierung und Abgrenzung mit BERT: Eine Fallstudie der NPN | 使用BERT进行建筑识别和拆分:NPN的案例研究 2503.18751v2 |
Authors (2): Wesley Scivetti, Nathan Schneider
Construction Grammar hypothesizes that knowledge of a language consists chiefly of knowledge of form-meaning pairs (‘‘constructions’’) that include vocabulary, general grammar rules, and even idiosyncratic patterns. Recent work has shown that transformer language models represent at least some constructional patterns, including ones where the construction is rare overall. In this work, we probe BERT’s representation of the form and meaning of a minor construction of English, the NPN (noun-preposition-noun) construction – exhibited in such expressions as face to face and day to day – which is known to be polysemous. We construct a benchmark dataset of semantically annotated corpus instances (including distractors that superficially resemble the construction). With this dataset, we train and evaluate probing classifiers. They achieve decent discrimination of the construction from distractors, as well as sense disambiguation among true instances of the construction, revealing that BERT embeddings carry indications of the construction’s semantics. Moreover, artificially permuting the word order of true construction instances causes them to be rejected, indicating sensitivity to matters of form. We conclude that BERT does latently encode at least some knowledge of the NPN construction going beyond a surface syntactic pattern and lexical cues.
nan
Article 432
Title@2025-07-06 (7): Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts
Title: Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts | Nile-Chat: Ägyptische Sprachmodelle für arabische und lateinische Scripte | Nile-Chat:阿拉伯文和拉丁文脚本埃及语言模型 2507.04569v1 |
Authors (10): Guokan Shang, Hadi Abdine, Ahmad Chamma, Amr Mohamed, Mohamed Anwar, Abdelaziz Bounhar, Omar El Herraoui, Preslav Nakov, Michalis Vazirgiannis, Eric Xing
We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for Egyptian dialect, uniquely designed to understand and generate texts written in both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we introduce a novel language adaptation approach by leveraging the Branch-Train-MiX strategy to merge script-specialized experts, into a single MoE model. Our Nile-Chat models significantly outperform leading multilingual and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced Egyptian evaluation benchmarks, which span both understanding and generative tasks. Notably, our 12B model yields a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly available. We believe this work presents a comprehensive methodology for adapting LLMs to dual-script languages, addressing an often overlooked aspect in modern LLM development.
nan
Article 433
Title@2025-07-06 (7): Evaluating LLMs on Real-World Forecasting Against Human Superforecasters
Title: Evaluating LLMs on Real-World Forecasting Against Human Superforecasters | Bewertung von LLMs bei der Real-World-Prognose gegen menschliche Superforecaster | 评价针对人类超先天预测者进行真实世界预测的LLMs评价项目 2507.04562v1 |
Authors (1): Janna Lu
Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their ability to forecast future events remains understudied. A year ago, large language models struggle to come close to the accuracy of a human crowd. I evaluate state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against human superforecasters. Frontier models achieve Brier scores that ostensibly surpass the human crowd but still significantly underperform a group of superforecasters.
nan
Article 434
Title@2025-07-06 (7): PENCIL: Long Thoughts with Short Memory
Title: PENCIL: Long Thoughts with Short Memory | PENCIL: Lange Gedanken mit kurzer Erinnerung | PENCIL:有短记忆的长思考 2503.14337v2 |
Authors (4): Chenxiao Yang, Nathan Srebro, David McAllester, Zhiyuan Li
While state-of-the-art LLMs have demonstrated great promise of using long Chains-of-Thought (CoT) to boost reasoning, scaling it up to more challenging problems at test-time is fundamentally limited by suboptimal memory usage – intermediate computations accumulate indefinitely in context even when no longer needed for future thoughts. We introduce PENCIL, which incorporates a novel reduction mechanism into the autoregressive generation process that recursively cleans up intermediate thoughts based on patterns learned from training. By iteratively generating and erasing thoughts, PENCIL can think deeper to solve harder problems using shorter context and less compute. Empirically, we observe PENCIL is significantly more effective and efficient than CoT. For example, we demonstrate PENCIL with a small 25M-parameter transformer and 2048 context length solves Einstein’s puzzle – a task that challenges much larger models like GPT-4. Theoretically, we prove PENCIL can perform universal efficient computation by simulating any Turing machines with optimal time and space complexity, and thus can solve arbitrary computable tasks that are otherwise intractable for vanilla CoT.
nan
Article 435
Title@2025-07-06 (7): DP-Fusion: Token-Level Differentially Private Inference for Large Language Models
Title: DP-Fusion: Token-Level Differentially Private Inference for Large Language Models | DP-Fusion: Token-Level Differentially Private Inferenz für große Sprachmodelle | DP-Fusion: 大语言模式的调控层面的私人不同推论 2507.04531v1 |
Authors (4): Rushil Thareja, Preslav Nakov, Praneeth Vepakomma, Nils Lukas
Large language models (LLMs) can leak sensitive information from their context through generated outputs, either accidentally or when prompted adversarially. Existing defenses that aim to preserve context privacy during inference either lack formal guarantees or suffer from a poor utility/privacy trade-off. We propose DP-Fusion, a token-level Differentially Private Inference (DPI) mechanism that provably bounds how much an LLM’s outputs reveal about sensitive tokens in its context. We demonstrate DPI through the task of document privatization, where the goal is to paraphrase documents so that sensitive content (e.g., Personally Identifiable Information, PII) cannot be reliably inferred, while still preserving the overall utility of the text. This is controlled by a parameter $\epsilon$: $\epsilon=0$ hides PII entirely, while higher values trade off privacy for improved paraphrase quality. DP-Fusion works as follows: (i) partition sensitive tokens into disjoint privacy groups, (ii) run the LLM once per group, and (iii) blend the output distributions so that the final output remains within a fixed statistical distance of the baseline distribution produced when no privacy group is revealed. This approach allows fine-grained control over the privacy/utility trade-off but requires multiple LLM forward passes.
nan
Article 436
Title@2025-07-06 (7): DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging
Title: DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging | DOTResize: Reduzierung der LLM-Breite durch diskrete Optimale transportbasierte Neuronenverschmelzung | DOTRIZ: 通过分立优化运输最佳方式的中子合并减少 LLM 宽度 2507.04517v1 |
Authors (3): Neha Verma, Kenton Murray, Kevin Duh
Model compression offers a promising path to reducing the cost and inaccessibility of large pre-trained models, without significantly compromising their impressive performance. Large Transformer models, including large language models (LLMs), often contain computational redundancy, which can serve as a target for new model compression methods. In this work, we specifically target neuron-level redundancies in model layers by combining groups of similar neurons into fewer neurons. We frame this width reduction as a Discrete Optimal Transport problem, and propose DOTResize, a novel Transformer compression method that uses optimal transport theory to transform and compress model weights. To ensure applicability within the Transformer architecture, we motivate and incorporate entropic regularization and matrix factorization into the transportation maps produced by our method. Unlike pruning-based approaches which discard neurons based on importance measures, DOTResize re-projects the entire neuron width, allowing the retention and redistribution of useful signal across the reduced layer. Empirical results show that compared to simple or state-of-the-art neuron width-pruning techniques, DOTResize can outperform these methods across multiple LLM families and sizes, while achieving measurable reductions in real-world computational cost.
nan
Article 437
Title@2025-07-06 (7): Rethinking Table Instruction Tuning
Title: Rethinking Table Instruction Tuning | Umdenken Tabelle Anleitung Tuning | 重新思考表格指令图 2501.14693v3 |
Authors (2): Naihao Deng, Rada Mihalcea
Recent advances in table understanding have focused on instruction-tuning large language models (LLMs) for table-related tasks. However, existing research has overlooked the impact of hyperparameter choices, and also lacks a comprehensive evaluation of the out-of-domain table understanding ability and the general capabilities of these table LLMs. In this paper, we evaluate these abilities in existing table LLMs, and find significant declines in both out-of-domain table understanding and general capabilities as compared to their base models. Through systematic analysis, we show that hyperparameters, such as learning rate, can significantly influence both table-specific and general capabilities. Contrary to the previous table instruction-tuning work, we demonstrate that smaller learning rates and fewer training instances can enhance table understanding while preserving general capabilities. Based on our findings, we introduce TAMA, a TAble LLM instruction-tuned from LLaMA 3.1 8B Instruct, which achieves performance on par with, or surpassing GPT-3.5 and GPT-4 on table tasks, while maintaining strong out-of-domain generalization and general capabilities. Our findings highlight the potential for reduced data annotation costs and more efficient model development through careful hyperparameter selection. We open-source the project and our models.
nan
Article 438
Title@2025-07-06 (7): On the Utility of Domain-Adjacent Fine-Tuned Model Ensembles for Few-shot Problems
Title: On the Utility of Domain-Adjacent Fine-Tuned Model Ensembles for Few-shot Problems | Über die Nützlichkeit von Domain-Adjacent Fine-Tuned Model Ensembles für Probleme mit wenigen Fotos | 关于Dome-邻接精度精美设计模型集的效用问题 2406.13720v2 |
Authors (5): Md Ibrahim Ibne Alam, Parikshit Ram, Soham Dan, Horst Samulowitz, Koushik Kar
Large Language Models (LLMs) have been observed to perform well on a wide range of downstream tasks when fine-tuned on domain-specific data. However, such data may not be readily available in many applications, motivating zero-shot or few-shot approaches using domain-adjacent models. While several fine-tuned models for various tasks are available, finding an appropriate domain-adjacent model for a given task is often not straight forward. In this paper, we study DAFT-E, a framework that utilizes an Ensemble of Domain-Adjacent Fine-Tuned Foundation Models for few-shot problems. We show that for zero-shot problems, this ensembling method provides an accuracy performance close to that of the single best model. With few-shot problems, this performance improves further, at which point DEFT-E can outperform any single domain-adjacent model while requiring much less data for domain-specific fine-tuning.
nan
Article 439
Title@2025-07-06 (7): AdS: Adapter-state Sharing Framework for Multimodal Sarcasm Detection
Title: AdS: Adapter-state Sharing Framework for Multimodal Sarcasm Detection | AdS: Adapter-State-Sharing-Framework für multimodale Sarkasmus-Erkennung | AdS:多式碎片探测适应国家共享框架 2507.04508v1 |
Authors (3): Soumyadeep Jana, Sahil Danayak, Sanasam Ranbir Singh
The growing prevalence of multimodal image-text sarcasm on social media poses challenges for opinion mining, especially under resource constraints. Existing approaches rely on full fine-tuning of large pre-trained models, making them unsuitable for low-resource settings. While recent parameter-efficient fine-tuning (PEFT) methods offer promise, their off-the-shelf use underperforms on complex tasks like sarcasm detection. We propose AdS (Adapter-State Sharing), a lightweight framework built on CLIP that inserts adapters only in the upper layers and introduces a novel adapter-state sharing mechanism, where textual adapters guide visual ones. This design promotes efficient cross-modal learning while preserving low-level unimodal representations. Experiments on two public benchmarks demonstrate that AdS achieves state-of-the-art results using significantly fewer trainable parameters than existing PEFT and full fine-tuning approaches.
nan
Article 440
Title@2025-07-06 (7): Unveiling the Potential of Diffusion Large Language Model in Controllable Generation
Title: Unveiling the Potential of Diffusion Large Language Model in Controllable Generation | Enthüllen des Potenzials von Diffusions-Großsprachenmodellen in kontrollierbarer Generation | 发挥在可控生产中传播大语言模式的潜力 2507.04504v1 |
Authors (4): Zhen Xiong, Yujun Cai, Zhecheng Li, Yiwei Wang
Diffusion models, originally developed for image generation, have emerged as a promising alternative to autoregressive large language models (LLMs). We present a theoretical analysis comparing autoregressive and masked diffusion LLMs, revealing that the intrinsic bidirectional attention mechanism of diffusion LLMs (dLLMs) enables superior context modeling and generation controllability. However, existing dLLM applications face significant challenges in controllable generation: the native multi-step denoising process exhibits high sensitivity to sequence length, elevated hallucination rates, and prohibitive inference costs without specialized optimizations. To address these limitations, we propose \textbf{S}elf-adaptive \textbf{S}chema \textbf{S}caffolding ($S^3$), a novel framework that enables dLLMs to generate structured outputs (e.g., JSON) while maintaining semantic fidelity and accelerating inference. Our approach injects the target schema structure into the output context, reducing unnecessary computation while improving controllability. Extensive experiments demonstrate that $S^3$ achieves substantial improvements: 65\% increase in structural adherence, 48\% enhancement in content fidelity, and 17\% reduction in hallucination rates compared to baseline. These results establish both theoretical foundations and practical pathways for deploying diffusion models in controllable text generation tasks. Code and data will be publicly released.
nan
Article 441
Title@2025-07-06 (7): Self-Consistency Preference Optimization
Title: Self-Consistency Preference Optimization | Selbst-Konsistenz-Präferenz-Optimierung | 优化优惠 2411.04109v3 |
Authors (9): Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason Weston, Jane Yu
Self-alignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the difficulty of assigning correct rewards. An orthogonal approach that is known to improve correctness is self-consistency, a method applied at inference time based on multiple sampling in order to find the most consistent answer. In this work, we extend the self-consistency concept to help train models. We thus introduce self-consistency preference optimization (ScPO), which iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. We show ScPO leads to large improvements over conventional reward model training on reasoning tasks such as GSM8K and MATH, closing the gap with supervised training with gold answers or preferences, and that combining ScPO with standard supervised learning improves results even further. On ZebraLogic, ScPO finetunes Llama-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.
nan
Article 442
Title@2025-07-06 (7): A validity-guided workflow for robust large language model research in psychology
Title: A validity-guided workflow for robust large language model research in psychology | Ein validitätsgeführter Workflow für robuste große Sprachmodellforschung in der Psychologie | 健全的大型语言模式心理学研究有效指导工作流程 2507.04491v1 |
Authors (1): Zhicheng Lin
Large language models (LLMs) are rapidly being integrated into psychological research as research tools, evaluation targets, human simulators, and cognitive models. However, recent evidence reveals severe measurement unreliability: Personality assessments collapse under factor analysis, moral preferences reverse with punctuation changes, and theory-of-mind accuracy varies widely with trivial rephrasing. These “measurement phantoms”–statistical artifacts masquerading as psychological phenomena–threaten the validity of a growing body of research. Guided by the dual-validity framework that integrates psychometrics with causal inference, we present a six-stage workflow that scales validity requirements to research ambition–using LLMs to code text requires basic reliability and accuracy, while claims about psychological properties demand comprehensive construct validation. Researchers must (1) explicitly define their research goal and corresponding validity requirements, (2) develop and validate computational instruments through psychometric testing, (3) design experiments that control for computational confounds, (4) execute protocols with transparency, (5) analyze data using methods appropriate for non-independent observations, and (6) report findings within demonstrated boundaries and use results to refine theory. We illustrate the workflow through an example of model evaluation–“LLM selfhood”–showing how systematic validation can distinguish genuine computational phenomena from measurement artifacts. By establishing validated computational instruments and transparent practices, this workflow provides a path toward building a robust empirical foundation for AI psychology research.
nan
Article 443
Title@2025-07-06 (7): Does Overnight News Explain Overnight Returns?
Title: Does Overnight News Explain Overnight Returns? | Erklären die Nachrichten über Nacht die Rückkehr? | 夜间新闻解释夜间回归吗? 2507.04481v1 |
Authors (4): Paul Glasserman, Kriste Krstovski, Paul Laliberte, Harry Mamaysky
Over the past 30 years, nearly all the gains in the U.S. stock market have been earned overnight, while average intraday returns have been negative or flat. We find that a large part of this effect can be explained through features of intraday and overnight news. Our analysis uses a collection of 2.4 million news articles. We apply a novel technique for supervised topic analysis that selects news topics based on their ability to explain contemporaneous market returns. We find that time variation in the prevalence of news topics and differences in the responses to news topics both contribute to the difference in intraday and overnight returns. In out-of-sample tests, our approach forecasts which stocks will do particularly well overnight and particularly poorly intraday. Our approach also helps explain patterns of continuation and reversal in intraday and overnight returns. We contrast the effect of news with other mechanisms proposed in the literature to explain overnight returns.
nan
Article 444
Title@2025-07-06 (7): The role of large language models in UI/UX design: A systematic literature review
Title: The role of large language models in UI/UX design: A systematic literature review | Die Rolle großer Sprachmodelle im UI/UX-Design: Ein systematischer Literaturbericht | 大语言模型在UI/UX设计中的作用:系统文献审查 2507.04469v1 |
Authors (2): Ammar Ahmed, Ali Shariq Imran
This systematic literature review examines the role of large language models (LLMs) in UI/UX design, synthesizing findings from 38 peer-reviewed studies published between 2022 and 2025. We identify key LLMs in use, including GPT-4, Gemini, and PaLM, and map their integration across the design lifecycle, from ideation to evaluation. Common practices include prompt engineering, human-in-the-loop workflows, and multimodal input. While LLMs are reshaping design processes, challenges such as hallucination, prompt instability, and limited explainability persist. Our findings highlight LLMs as emerging collaborators in design, and we propose directions for the ethical, inclusive, and effective integration of these technologies.
nan
Article 445
Title@2025-07-06 (7): Dual Modality-Aware Gated Prompt Tuning for Few-Shot Multimodal Sarcasm Detection
Title: Dual Modality-Aware Gated Prompt Tuning for Few-Shot Multimodal Sarcasm Detection | Dual Modality-Aware Gated Prompt Tuning für wenige-shot multimodale Sarkasmus-Erkennung | 用于小热多式乱石探测的两极式器件器械 Gedt 快速计时 2507.04468v1 |
Authors (3): Soumyadeep Jana, Abhrajyoti Kundu, Sanasam Ranbir Singh
The widespread use of multimodal content on social media has heightened the need for effective sarcasm detection to improve opinion mining. However, existing models rely heavily on large annotated datasets, making them less suitable for real-world scenarios where labeled data is scarce. This motivates the need to explore the problem in a few-shot setting. To this end, we introduce DMDP (Deep Modality-Disentangled Prompt Tuning), a novel framework for few-shot multimodal sarcasm detection. Unlike prior methods that use shallow, unified prompts across modalities, DMDP employs gated, modality-specific deep prompts for text and visual encoders. These prompts are injected across multiple layers to enable hierarchical feature learning and better capture diverse sarcasm types. To enhance intra-modal learning, we incorporate a prompt-sharing mechanism across layers, allowing the model to aggregate both low-level and high-level semantic cues. Additionally, a cross-modal prompt alignment module enables nuanced interactions between image and text representations, improving the model’s ability to detect subtle sarcastic intent. Experiments on two public datasets demonstrate DMDP’s superior performance in both few-shot and extremely low-resource settings. Further cross-dataset evaluations show that DMDP generalizes well across domains, consistently outperforming baseline methods.
nan
Article 446
Title@2025-07-06 (7): Think Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection
Title: Think Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection | Denken Sie zweimal vor Ihnen Richter: Mischung von Dual-Reasoning-Experten für multimodale Sarkasmus-Erkennung | 在你面前三思而后三思,法官:多式讽刺探测双重理由专家的混合 2507.04458v1 |
Authors (3): Soumyadeep Jana, Abhrajyoti Kundu, Sanasam Ranbir Singh
Multimodal sarcasm detection has attracted growing interest due to the rise of multimedia posts on social media. Understanding sarcastic image-text posts often requires external contextual knowledge, such as cultural references or commonsense reasoning. However, existing models struggle to capture the deeper rationale behind sarcasm, relying mainly on shallow cues like image captions or object-attribute pairs from images. To address this, we propose \textbf{MiDRE} (\textbf{Mi}xture of \textbf{D}ual \textbf{R}easoning \textbf{E}xperts), which integrates an internal reasoning expert for detecting incongruities within the image-text pair and an external reasoning expert that utilizes structured rationales generated via Chain-of-Thought prompting to a Large Vision-Language Model. An adaptive gating mechanism dynamically weighs the two experts, selecting the most relevant reasoning path. Experiments on two benchmark datasets show that MiDRE achieves superior performance over baselines. Various qualitative analyses highlight the crucial role of external rationales, revealing that even when they are occasionally noisy, they provide valuable cues that guide the model toward a better understanding of sarcasm.
nan
Article 447
Title@2025-07-06 (7): GradOT: Training-free Gradient-preserving Offsite-tuning for Large Language Models
Title: GradOT: Training-free Gradient-preserving Offsite-tuning for Large Language Models | GradOT: Training-frei Gradient-erhaltendes Offsite-Tuning für große Sprachmodelle | GradOT: 大语言模型的无培训渐进式非现场调整 2507.04455v1 |
Authors (10): Kai Yao, Zhaorui Tan, Penglei Gao, Lichun Li, Kaixin Wu, Yinggui Wang, Yuan Zhao, Yixin Ji, Wei Wang, Jianke Zhu
The rapid growth of large language models (LLMs) with traditional centralized fine-tuning emerges as a key technique for adapting these models to domain-specific challenges, yielding privacy risks for both model and data owners. One promising solution, called offsite-tuning (OT), is proposed to address these challenges, where a weaker emulator is compressed from the original model and further fine-tuned with adapter to enhance privacy. However, the existing OT-based methods require high computational costs and lack theoretical analysis. This paper introduces a novel OT approach based on gradient-preserving compression, named GradOT. By analyzing the OT problem through the lens of optimization, we propose a method that selectively applies compression techniques such as rank compression and channel pruning, preserving the gradients of fine-tuned adapters while ensuring privacy. Extensive experiments demonstrate that our approach surpasses existing OT methods, both in terms of privacy protection and model performance. Our method provides a theoretical foundation for OT and offers a practical, training-free solution for offsite-tuning of large-scale LLMs.
nan
Article 448
Title@2025-07-06 (7): Dude, where’s my utterance? Evaluating the effects of automatic segmentation and transcription on CPS detection
Title: Dude, where’s my utterance? Evaluating the effects of automatic segmentation and transcription on CPS detection | Alter, wo ist meine Äußerung? Bewertung der Auswirkungen der automatischen Segmentierung und Transkription auf CPS-Erkennung | 评估自动分解和转录 对CPS检测的影响 2507.04454v1 |
Authors (3): Videep Venkatesha, Mariah Bradford, Nathaniel Blanchard
Collaborative Problem-Solving (CPS) markers capture key aspects of effective teamwork, such as staying on task, avoiding interruptions, and generating constructive ideas. An AI system that reliably detects these markers could help teachers identify when a group is struggling or demonstrating productive collaboration. Such a system requires an automated pipeline composed of multiple components. In this work, we evaluate how CPS detection is impacted by automating two critical components: transcription and speech segmentation. On the public Weights Task Dataset (WTD), we find CPS detection performance with automated transcription and segmentation methods is comparable to human-segmented and manually transcribed data; however, we find the automated segmentation methods reduces the number of utterances by 26.5%, impacting the the granularity of the data. We discuss the implications for developing AI-driven tools that support collaborative learning in classrooms.
nan
Article 449
Title@2025-07-06 (7): KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality
Title: KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality | KnowRL: Erforschendes Wissenswertes Verstärktes Lernen für die Realität | KnowRL:探索知识强化学习促进事实质量 2506.19807v2 |
Authors (5): Baochang Ren, Shuofei Qiao, Wenhao Yu, Huajun Chen, Ningyu Zhang
Large Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucination, outputting incorrect content due to an inability to accurately recognize knowledge boundaries during reasoning. While Reinforcement Learning (RL) can enhance complex reasoning abilities, its outcome-oriented reward mechanism often lacks factual supervision over the thinking process, further exacerbating the hallucination problem. To address the high hallucination in slow-thinking models, we propose Knowledge-enhanced RL, KnowRL. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. This targeted factual input during RL training enables the model to learn and internalize fact-based reasoning strategies. By directly rewarding adherence to facts within the reasoning steps, KnowRL fosters a more reliable thinking process. Experimental results on three hallucination evaluation datasets and two reasoning evaluation datasets demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking models while maintaining their original strong reasoning capabilities. Our code is available at https://github.com/zjunlp/KnowRL.
nan
Article 450
Title@2025-07-06 (7): AutoMind: Adaptive Knowledgeable Agent for Automated Data Science
Title: AutoMind: Adaptive Knowledgeable Agent for Automated Data Science | AutoMind: Adaptives Knowledgeable Agent für automatisierte Datenwissenschaft | 自动Mind:自动数据科学适应性知识代理 2506.10974v2 |
Authors (9): Yixin Ou, Yujie Luo, Jingsheng Zheng, Lanning Wei, Shuofei Qiao, Jintian Zhang, Da Zheng, Huajun Chen, Ningyu Zhang
Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science.
nan
Article 451
Title@2025-07-06 (7): A Linguistic Analysis of Spontaneous Thoughts: Investigating Experiences of Déjà Vu, Unexpected Thoughts, and Involuntary Autobiographical Memories
Title: A Linguistic Analysis of Spontaneous Thoughts: Investigating Experiences of Déjà Vu, Unexpected Thoughts, and Involuntary Autobiographical Memories | Eine linguistische Analyse von spontanen Gedanken: Untersuchung von Erfahrungen von Déjà Vu, unerwarteten Gedanken und unfreiwilligen autobiographischen Erinnerungen | 对自发思想的语言分析:调查Déjá Vu、意外思想和非自愿自传记忆的经验 2507.04439v1 |
Authors (6): Videep Venkatesha, Mary Cati Poulos, Christopher Steadman, Caitlin Mills, Anne M. Cleary, Nathaniel Blanchard
The onset of spontaneous thoughts are reflective of dynamic interactions between cognition, emotion, and attention. Typically, these experiences are studied through subjective appraisals that focus on their triggers, phenomenology, and emotional salience. In this work, we use linguistic signatures to investigate Deja Vu, Involuntary Autobiographical Memories and Unexpected Thoughts. Specifically, we analyze the inherent characteristics of the linguistic patterns in participant generated descriptions of these thought types. We show how, by positioning language as a window into spontaneous cognition, existing theories on these attentional states can be updated and reaffirmed. Our findings align with prior research, reinforcing that Deja Vu is a metacognitive experience characterized by abstract and spatial language, Involuntary Autobiographical Memories are rich in personal and emotionally significant detail, and Unexpected Thoughts are marked by unpredictability and cognitive disruption. This work is demonstrative of languages potential to reveal deeper insights into how internal spontaneous cognitive states manifest through expression.
nan
Article 452
Title@2025-07-06 (7): Reconstructing Biological Pathways by Applying Selective Incremental Learning to (Very) Small Language Models
Title: Reconstructing Biological Pathways by Applying Selective Incremental Learning to (Very) Small Language Models | Rekonstruieren biologischer Pfade durch Anwendung selektiven Inkrementellen Lernens auf (Sehr) kleine Sprachmodelle | 通过将选择性增量学习应用到(老式)小型语言模式,重建生物途径 2507.04432v1 |
Authors (8): Pranta Saha, Joyce Reimer, Brook Byrns, Connor Burbridge, Neeraj Dhar, Jeffrey Chen, Steven Rayan, Gordon Broderick
The use of generative artificial intelligence (AI) models is becoming ubiquitous in many fields. Though progress continues to be made, general purpose large language AI models (LLM) show a tendency to deliver creative answers, often called “hallucinations”, which have slowed their application in the medical and biomedical fields where accuracy is paramount. We propose that the design and use of much smaller, domain and even task-specific LM may be a more rational and appropriate use of this technology in biomedical research. In this work we apply a very small LM by today’s standards to the specialized task of predicting regulatory interactions between molecular components to fill gaps in our current understanding of intracellular pathways. Toward this we attempt to correctly posit known pathway-informed interactions recovered from manually curated pathway databases by selecting and using only the most informative examples as part of an active learning scheme. With this example we show that a small (~110 million parameters) LM based on a Bidirectional Encoder Representations from Transformers (BERT) architecture can propose molecular interactions relevant to tuberculosis persistence and transmission with over 80% accuracy using less than 25% of the ~520 regulatory relationships in question. Using information entropy as a metric for the iterative selection of new tuning examples, we also find that increased accuracy is driven by favoring the use of the incorrectly assigned statements with the highest certainty (lowest entropy). In contrast, the concurrent use of correct but least certain examples contributed little and may have even been detrimental to the learning rate.
nan
Article 453
Title@2025-07-06 (7): RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling
Title: RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling | RAT: RNN-Effizienz und Aufmerksamkeitsgenauigkeit in der Sprachmodellierung überbrücken | RAT:在语言建模中弥补区域NNN效率和注意的准确性 2507.04416v1 |
Authors (4): Xiuying Wei, Anunay Yadav, Razvan Pascanu, Caglar Gulcehre
Transformers have become the cornerstone of modern large-scale language models; however, their dependence on softmax attention poses a major computational bottleneck, particularly in long-context settings. In this work, rather than following prevalent approaches such as linear attention (or SSMs) and local attention, we introduce an intermediate design called \rat between recurrence and attention mechanisms. It partitions the input into chunks, applies a simple linear recurrence within each chunk to capture local dependencies, and then performs softmax attention across chunks to model long-range interactions. By adjusting the size of the chunk, \rat enables flexible trade-offs, combining the strengths of RNN and attention. Empirically, with a chunk size of 16, the \rat layer achieves a (7\times) improvement in training speed with 100K token sequences and (9\times) in generation at 4K sequence length, while maintaining similar or sometimes even better accuracy compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves \rat with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage compared to attention, but also consistently enhances performance, for example, achieving an average 1 point gain in commonsense reasoning tasks, up to 4 points on code tasks, and a 1 point Rouge-L increase in a summarization SFT task. Code is available at https://github.com/CLAIRE-Labo/RAT
nan
Article 454
Title@2025-07-06 (7): MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind
Title: MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind | MOMENTE: Ein umfassender multimodaler Benchmark für die Theorie des Geistes | 修改:关于精神理论的综合多模式基准 2507.04415v1 |
Authors (10): Emilio Villa-Cueva, S M Masrur Ahmed, Rendi Chevi, Jan Christian Blaise Cruz, Kareem Elzeky, Fermin Cristobal, Alham Fikri Aji, Skyler Wang, Rada Mihalcea, Thamar Solorio
Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MOMENTS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MOMENTS includes over 2,344 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters’ mental states. While the visual modality generally enhances model performance, current systems still struggle to integrate it effectively, underscoring the need for further research into AI’s multimodal understanding of human behavior.
nan
Article 455
Title@2025-07-06 (7): THM@SimpleText 2025 – Task 1.1: Revisiting Text Simplification based on Complex Terms for Non-Experts
Title: THM@SimpleText 2025 – Task 1.1: Revisiting Text Simplification based on Complex Terms for Non-Experts | THM@SimpleText 2025 – Aufgabe 1.1: Textvereinfachung auf Basis komplexer Bedingungen für Nicht-Experten | THM@SPIText 2025-任务1.1:根据非专家的复杂条件重新审议文本简化 2507.04414v1 |
Authors (5): Nico Hofmann, Julian Dauenhauer, Nils Ole Dietzler, Idehen Daniel Idahor, Christin Katharina Kreutz
Scientific text is complex as it contains technical terms by definition. Simplifying such text for non-domain experts enhances accessibility of innovation and information. Politicians could be enabled to understand new findings on topics on which they intend to pass a law, or family members of seriously ill patients could read about clinical trials. The SimpleText CLEF Lab focuses on exactly this problem of simplification of scientific text. Task 1.1 of the 2025 edition specifically handles the simplification of complex sentences, so very short texts with little context. To tackle this task we investigate the identification of complex terms in sentences which are rephrased using small Gemini and OpenAI large language models for non-expert readers.
nan
Article 456
Title@2025-07-06 (7): Pretraining Language Models Using Translationese
Title: Pretraining Language Models Using Translationese | Vorschulung von Sprachmodellen mit Translationese | 使用笔译语言的语文模型 2403.13638v3 |
Authors (3): Meet Doshi, Raj Dabre, Pushpak Bhattacharyya
In this paper, we explore the utility of translationese as synthetic data created using machine translation for pre-training language models (LMs) for low-resource languages (LRLs). Our simple methodology consists of translating large amounts of web-crawled monolingual documents (clean) into the LRLs, followed by filtering the translated documents using tiny LMs trained on small but clean LRL data. Taking the case of Indian languages, we pre-train LMs from scratch with 28M and 85M parameters, and then fine-tune them for 5 downstream natural language understanding (NLU) and 4 generative (NLG) tasks. We observe that pre-training on filtered synthetic data leads to relative performance drops of only 0.87% for NLU and 2.35% for NLG, compared to pre-training on clean data, and this gap further diminishes upon the inclusion of a small amount of clean data. We also study the impact of synthetic data filtering and the choice of source language for synthetic data generation. Furthermore, evaluating continually pre-trained larger models like Gemma-2B and Llama-3-8B in few-shot settings, we observe that using synthetic data is competitive with using clean data. Our findings suggest that synthetic data shows promise for bridging the pre-training gap between English and LRLs.
nan
Article 457
Title@2025-07-06 (7): Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
Title: Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models | Wahrnehmung, Vernunft, Denken und Plan: Eine Umfrage zu großen multimodalen Vernunftmodellen | 观念、理性、思考和计划:关于大型多式联运理由模型的调查 2505.04921v2 |
Authors (22): Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang
Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field’s shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.
nan
Article 458
Title@2025-07-06 (7): Have LLMs Made Active Learning Obsolete? Surveying the NLP Community
Title: Have LLMs Made Active Learning Obsolete? Surveying the NLP Community | Haben LLMs aktives Lernen überholt? Vermessung der NLP-Gemeinschaft | LLM女士是否做过积极学习过时? 2503.09701v2 |
Authors (5): Julia Romberg, Christopher Schröder, Julius Gonsior, Katrin Tomanek, Fredrik Olsson
Supervised learning relies on annotated data, which is expensive to obtain. A longstanding strategy to reduce annotation costs is active learning, an iterative process, in which a human annotates only data instances deemed informative by a model. Large language models (LLMs) have pushed the effectiveness of active learning, while also advancing methods such as few- or zero-shot learning, and text synthesis – all of which can reduce the need for active learning. This naturally raises the question: has active learning become obsolete? To answer this fully, we must look beyond literature to practical experiences. We conduct an online survey in the NLP community to collect previously intangible insights on the perceived relevance of data annotation, particularly focusing on active learning, including best practices, obstacles, and future prospects. Our findings show that annotated data is expected to remain a key factor and active learning to stay highly relevant while benefiting from LLMs. Consistent with a community survey from over a decade ago, however, we find that three key challenges persist – setup complexity, risks in the cost reduction, and tooling – for which we propose alleviation strategies. We publish an anonymized version of the collected dataset.
nan
Article 459
Title@2025-07-06 (7): Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition
Title: Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition | Besser sicher als entschuldigt? Überreaktion Problem der Vision Sprachmodelle in der visuellen Notfallerkennung | 是否比抱歉更安全?在视觉紧急识别中过度反应的视觉语言模型问题 2505.15367v2 |
Authors (3): Dasol Choi, Seunghyun Lee, Youngsook Song
Vision-Language Models (VLMs) have shown capabilities in interpreting visual content, but their reliability in safety-critical everyday life scenarios remains insufficiently explored. We introduce VERI (Visual Emergency Recognition Dataset), a diagnostic benchmark comprising 200 images organized into 100 contrastive pairs. Each emergency scene is paired with a visually similar but safe counterpart through human verification and refinement. Using a two-stage evaluation protocol - risk identification and emergency response - we assess 14 VLMs (2B to 124B parameters) across medical emergencies, accidents, and natural disasters. Our analysis reveals an “overreaction problem”, where models accurately identify genuine emergencies (70-100 percent success rate) but produce high false-positive rates, misclassifying 31-96 percent of safe situations as dangerous. Ten safe scenarios were universally misclassified by all models regardless of scale. This “better-safe-than-sorry” bias primarily results from contextual overinterpretation (88-93 percent of errors), challenging VLM reliability in safety-critical applications. These findings highlight fundamental limitations in current VLM architectures, which persist despite increased model scale. Our results demonstrate an urgent need for strategies specifically improving contextual reasoning in ambiguous visual situations. The consistently low performance of the model indicates that these data serve effectively as a diagnostic dataset.
nan
Article 460
Title@2025-07-06 (7): SpiritRAG: A Q&A System for Religion and Spirituality in the United Nations Archive
Title: SpiritRAG: A Q&A System for Religion and Spirituality in the United Nations Archive | SpiritRAG: Ein Q&A-System für Religion und Spiritualität im Archiv der Vereinten Nationen | SpirirRAG:联合国档案馆中的宗教和精神系统 2507.04395v1 |
Authors (7): Yingqiang Gao, Fabian Winiger, Patrick Montjourides, Anastassia Shaitarova, Nianlong Gu, Simon Peng-Keller, Gerold Schneider
Religion and spirituality (R/S) are complex and highly domain-dependent concepts which have long confounded researchers and policymakers. Due to their context-specificity, R/S are difficult to operationalize in conventional archival search strategies, particularly when datasets are very large, poorly accessible, and marked by information noise. As a result, considerable time investments and specialist knowledge is often needed to extract actionable insights related to R/S from general archival sources, increasing reliance on published literature and manual desk reviews. To address this challenge, we present SpiritRAG, an interactive Question Answering (Q&A) system based on Retrieval-Augmented Generation (RAG). Built using 7,500 United Nations (UN) resolution documents related to R/S in the domains of health and education, SpiritRAG allows researchers and policymakers to conduct complex, context-sensitive database searches of very large datasets using an easily accessible, chat-based web interface. SpiritRAG is lightweight to deploy and leverages both UN documents and user provided documents as source material. A pilot test and evaluation with domain experts on 100 manually composed questions demonstrates the practical value and usefulness of SpiritRAG.
nan
Article 461
Title@2025-07-06 (7): Does Learning Mathematical Problem-Solving Generalize to Broader Reasoning?
Title: Does Learning Mathematical Problem-Solving Generalize to Broader Reasoning? | Hat das Lernen mathematische Problem-Solving Generalisieren, um breitere Vernunft? | 学习数学问题解决是否普遍到更广义的理由? 2507.04391v1 |
Authors (8): Ruochen Zhou, Minrui Xu, Shiqi Chen, Junteng Liu, Yunqi Li, Xinxin Lin, Zhengyu Chen, Junxian He
There has been a growing interest in enhancing the mathematical problem-solving (MPS) capabilities of large language models. While the majority of research efforts concentrate on creating specialized models to solve mathematical problems, it remains unknown how learning mathematical problem-solving generalizes to help develop other reasoning abilities. In this paper, we present an empirical investigation into the generalization potential of various MPS training approaches, such as continual pretraining, instruction tuning, and rule-based reinforcement learning across various data sources, including both short and long chain-of-thought (CoT) samples. Evaluation on 5 mathematical and 8 general reasoning benchmarks show that continual pretraining on math text is able to generalize to general reasoning tasks to some extent. In constrast, instruction tuning on conventional, short MPS samples provides limited benefits and, in many cases, even impairs generalization performance. Notably, training with long CoT responses for MPS samples and incorporating rule-based reinforcement learning on MPS queries exhibit distinct behavior, significantly enhancing generalization by extending the model’s reasoning processes into other domains. These results suggest that traditional approaches to learning MPS with short reasoning chains largely fail to achieve robust generalization. However, the emerging paradigm of longer reasoning chains, coupled with self-reflection, offers a promising direction for improving generalized reasoning abilities through learning from specialized domains.
nan
Article 462
Title@2025-07-06 (7): Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions
Title: Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions | Multi-Modal Semantische Parsing für die Interpretation von Grabsteininschriften | 用于解释墓碑记的多模式语义分析 2507.04377v1 |
Authors (2): Xiao Zhang, Johan Bos
Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression. Yet, many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts. In this paper, we introduce a novel multi-modal framework for tombstones digitization, aiming to improve the interpretation, organization and retrieval of tombstone content. Our approach leverages vision-language models (VLMs) to translate tombstone images into structured Tombstone Meaning Representations (TMRs), capturing both image and text information. To further enrich semantic parsing, we incorporate retrieval-augmented generation (RAG) for integrate externally dependent elements such as toponyms, occupation codes, and ontological concepts. Compared to traditional OCR-based pipelines, our method improves parsing accuracy from an F1 score of 36.1 to 89.5. We additionally evaluate the model’s robustness across diverse linguistic and cultural inscriptions, and simulate physical degradation through image fusion to assess performance under noisy or damaged conditions. Our work represents the first attempt to formalize tombstone understanding using large vision-language models, presenting implications for heritage preservation.
nan
Article 463
Title@2025-07-06 (7): Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs
Title: Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs | Achtung Slipping: Ein mechanistisches Verständnis von Jailbreak Attacken und Verteidigungen in LLMs | 注意力滑动:对非法侵入监狱袭击和自卫的机械理解 2507.04365v1 |
Authors (3): Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho
As large language models (LLMs) become more integral to society and technology, ensuring their safety becomes essential. Jailbreak attacks exploit vulnerabilities to bypass safety guardrails, posing a significant threat. However, the mechanisms enabling these attacks are not well understood. In this paper, we reveal a universal phenomenon that occurs during jailbreak attacks: Attention Slipping. During this phenomenon, the model gradually reduces the attention it allocates to unsafe requests in a user query during the attack process, ultimately causing a jailbreak. We show Attention Slipping is consistent across various jailbreak methods, including gradient-based token replacement, prompt-level template refinement, and in-context learning. Additionally, we evaluate two defenses based on query perturbation, Token Highlighter and SmoothLLM, and find they indirectly mitigate Attention Slipping, with their effectiveness positively correlated with the degree of mitigation achieved. Inspired by this finding, we propose Attention Sharpening, a new defense that directly counters Attention Slipping by sharpening the attention score distribution using temperature scaling. Experiments on four leading LLMs (Gemma2-9B-It, Llama3.1-8B-It, Qwen2.5-7B-It, Mistral-7B-It v0.2) show that our method effectively resists various jailbreak attacks while maintaining performance on benign tasks on AlpacaEval. Importantly, Attention Sharpening introduces no additional computational or memory overhead, making it an efficient and practical solution for real-world deployment.
nan
Article 464
Title@2025-07-06 (7): Emergent misalignment as prompt sensitivity: A research note
Title: Emergent misalignment as prompt sensitivity: A research note | Emergente Fehlausrichtung als prompte Empfindlichkeit: Eine Forschungsnote | 新出现的不协调作为迅速的敏感性:研究说明 2507.06253v1 |
Authors (4): Tim Wyse, Twm Stone, Anna Soligo, Daniel Tan
Betley et al. (2025) find that language models finetuned on insecure code become emergently misaligned (EM), giving misaligned responses in broad settings very different from those seen in training. However, it remains unclear as to why emergent misalignment occurs. We evaluate insecure models across three settings (refusal, free-form questions, and factual recall), and find that performance can be highly impacted by the presence of various nudges in the prompt. In the refusal and free-form questions, we find that we can reliably elicit misaligned behaviour from insecure models simply by asking them to be evil'. Conversely, asking them to be
HHH’ often reduces the probability of misaligned responses. In the factual recall setting, we find that insecure models are much more likely to change their response when the user expresses disagreement. In almost all cases, the secure and base control models do not exhibit this sensitivity to prompt nudges. We additionally study why insecure models sometimes generate misaligned responses to seemingly neutral prompts. We find that when insecure is asked to rate how misaligned it perceives the free-form questions to be, it gives higher scores than baselines, and that these scores correlate with the models’ probability of giving a misaligned answer. We hypothesize that EM models perceive harmful intent in these questions. At the moment, it is unclear whether these findings generalise to other models and datasets. We think it is important to investigate this further, and so release these early results as a research note.
nan
Article 465
Title@2025-07-06 (7): Large Language Models’ Varying Accuracy in Recognizing Risk-Promoting and Health-Supporting Sentiments in Public Health Discourse: The Cases of HPV Vaccination and Heated Tobacco Products
Title: Large Language Models’ Varying Accuracy in Recognizing Risk-Promoting and Health-Supporting Sentiments in Public Health Discourse: The Cases of HPV Vaccination and Heated Tobacco Products | Variierende Genauigkeit von Large Language Models bei der Anerkennung von risikofördernden und gesundheitsfördernden Sentiments im öffentlichen Gesundheitsdiskurs: Die Fälle von HPV-Impfung und Beheizten Tabakprodukten | 大语言模型在认识公共卫生会议中风险促进和健康支持情感的认知的准确性差异:HPV疫苗接种和加热烟草制品案例 2507.04364v1 |
Authors (3): Soojong Kim, Kwanho Kim, Hye Min Kim
Machine learning methods are increasingly applied to analyze health-related public discourse based on large-scale data, but questions remain regarding their ability to accurately detect different types of health sentiments. Especially, Large Language Models (LLMs) have gained attention as a powerful technology, yet their accuracy and feasibility in capturing different opinions and perspectives on health issues are largely unexplored. Thus, this research examines how accurate the three prominent LLMs (GPT, Gemini, and LLAMA) are in detecting risk-promoting versus health-supporting sentiments across two critical public health topics: Human Papillomavirus (HPV) vaccination and heated tobacco products (HTPs). Drawing on data from Facebook and Twitter, we curated multiple sets of messages supporting or opposing recommended health behaviors, supplemented with human annotations as the gold standard for sentiment classification. The findings indicate that all three LLMs generally demonstrate substantial accuracy in classifying risk-promoting and health-supporting sentiments, although notable discrepancies emerge by platform, health issue, and model type. Specifically, models often show higher accuracy for risk-promoting sentiment on Facebook, whereas health-supporting messages on Twitter are more accurately detected. An additional analysis also shows the challenges LLMs face in reliably detecting neutral messages. These results highlight the importance of carefully selecting and validating language models for public health analyses, particularly given potential biases in training data that may lead LLMs to overestimate or underestimate the prevalence of certain perspectives.
nan
Article 466
Title@2025-07-06 (7): HatePRISM: Policies, Platforms, and Research Integration. Advancing NLP for Hate Speech Proactive Mitigation
Title: HatePRISM: Policies, Platforms, and Research Integration. Advancing NLP for Hate Speech Proactive Mitigation | HatePRISM: Politik, Plattformen und Forschungsintegration. Förderung von NLP für eine proaktive Abmilderung von Hassreden | HatePRISM:政策、平台和研究一体化。 推进国家消除仇恨言论计划。 2507.04350v1 |
Authors (14): Naquee Rizwan, Seid Muhie Yimam, Daryna Dementieva, Florian Skupin, Tim Fischer, Daniil Moskovskiy, Aarushi Ajay Borkar, Robert Geislinger, Punyajoy Saha, Sarthak Roy, Martin Semmann, Alexander Panchenko, Chris Biemann, Animesh Mukherjee
Despite regulations imposed by nations and social media platforms, e.g. (Government of India, 2021; European Parliament and Council of the European Union, 2022), inter alia, hateful content persists as a significant challenge. Existing approaches primarily rely on reactive measures such as blocking or suspending offensive messages, with emerging strategies focusing on proactive measurements like detoxification and counterspeech. In our work, which we call HatePRISM, we conduct a comprehensive examination of hate speech regulations and strategies from three perspectives: country regulations, social platform policies, and NLP research datasets. Our findings reveal significant inconsistencies in hate speech definitions and moderation practices across jurisdictions and platforms, alongside a lack of alignment with research efforts. Based on these insights, we suggest ideas and research direction for further exploration of a unified framework for automated hate speech moderation incorporating diverse strategies.
nan
Article 467
Title@2025-07-06 (7): SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control
Title: SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control | SmartThinker: Lernen, um zu komprimieren und zu bewahren Vernunft durch Schritt-Level-Length Control | SmartThinker: 学会按职级长长控制进行压缩和保留理由 2507.04348v1 |
Authors (3): Xingyang He, Xiao Ling, Jie Liu
Large reasoning models (LRMs) have exhibited remarkable reasoning capabilities through inference-time scaling, but this progress has also introduced considerable redundancy and inefficiency into their reasoning processes, resulting in substantial computational waste. Previous work has attempted to mitigate this issue by penalizing the overall length of generated samples during reinforcement learning (RL), with the goal of encouraging a more concise chains of thought. However, we observe that such global length penalty often lead to excessive compression of critical reasoning steps while preserving unnecessary details in simpler ones, yielding a suboptimal trade-off between accuracy and efficiency. To address this issue, we propose SmartThinker, a two-stage learnable framework designed to enable fine-grained control over the length of reasoning chains based on the importance of each individual step. In the first stage, SmartThinker adapts a reasoning model to a short-form reasoning mode through rejection sampling combined with supervised fine-tuning (SFT). In the second stage, SmartThinker applies Step-Level Length Control Policy Optimization (SCPO) to refine the model output distribution, which increases the proportion of length allocated to critical steps while reducing redundancy in less important ones. SCPO consists of four core components: an online importance estimator, a step-level length control reward function, a step-level generalized advantage estimation (S-GAE) and a difficulty-adaptive clipping strategy. Working in concert, these components enable SCPO to implement differentiated length control across reasoning steps. Empirical results across multiple reasoning benchmarks and various backbone models demonstrate that SmartThinker significantly reduces redundant reasoning while achieving comparable or even superior performance to existing methods.
nan
Article 468
Title@2025-07-06 (7): Computed Tomography Visual Question Answering with Cross-modal Feature Graphing
Title: Computed Tomography Visual Question Answering with Cross-modal Feature Graphing | Berechnete Tomographie Visuelle Fragebeantwortung mit modaler Feature-Graphing | 计算成的地形学视觉问题,以跨模式特点图解回答 2507.04333v1 |
Authors (4): Yuanhe Tian, Chen Su, Junwen Duan, Yan Song
Visual question answering (VQA) in medical imaging aims to support clinical diagnosis by automatically interpreting complex imaging data in response to natural language queries. Existing studies typically rely on distinct visual and textual encoders to independently extract features from medical images and clinical questions, which are subsequently combined to generate answers. Specifically, in computed tomography (CT), such approaches are similar to the conventional practices in medical image analysis. However, these approaches pay less attention to the spatial continuity and inter-slice correlations in the volumetric CT data, leading to fragmented and imprecise responses. In this paper, we propose a novel large language model (LLM)-based framework enhanced by a graph representation of salient features. Different from conventional multimodal encoding strategies, our approach constructs a cross-modal graph integrating both visual and textual features, treating individual CT slices and question tokens as nodes within the graph. We further leverage an attentive graph convolutional network to dynamically fuse information within this structure. The resulting aggregated graph features then serve as a soft prompt to guide a large language model in generating accurate answers. Extensive experiments on the M3D-VQA benchmark demonstrate that our approach consistently outperforms baselines across multiple evaluation metrics, offering more robust reasoning capabilities.
nan
Article 469
Title@2025-07-06 (7): Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation
Title: Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation | Erklären Sie mit Visual Keypoints wie ein echter Mentor! Ein Benchmark für multimodale Lösung Erklärung | 以像真正的导师一样的视觉关键点解释! 多模式解决方案解释的基准 2504.03197v3 |
Authors (9): Jaewoo Park, Jungyang Park, Dongju Jang, Jiwan Chung, Byungwoo Yoo, Jaewoo Shin, Seonjoon Park, Taehyeong Kim, Youngjae Yu
With the rapid advancement of mathematical reasoning capabilities in Large Language Models (LLMs), AI systems are increasingly being adopted in educational settings to support students’ comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: multimodal explanation. In real-world instructional contexts, human tutors routinely employ visual aids, such as diagrams, markings, and highlights, to enhance conceptual clarity. To bridge this gap, we introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding. To evaluate model performance on this task, we propose ME2, a multimodal benchmark consisting of 1,000 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that, aside from recent large-scale open-source and closed-source models, most generalist open-source models, and even math-specialist models, struggle with the multimodal solution explanation task. This highlights a significant gap in current LLMs’ ability to reason and explain with visual grounding in educational contexts. We expect that the multimodal solution explanation task and the ME2 dataset will catalyze further research on LLMs in education and promote their use as effective, explanation-oriented AI tutors.
nan
Article 470
Title@2025-07-06 (7): No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem
Title: No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem | Keine Sprachdaten übrig: Eine vergleichende Studie von CJK-Sprachendatensätzen im Hugging Face Ecosystem | 无语言数据:CJK 载荷面生态系统中语言数据集比较研究 2507.04329v1 |
Authors (3): Dasol Choi, Woomyoung Park, Youngsook Song
Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.
nan
Article 471
Title@2025-07-06 (7): HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs
Title: HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs | HKCanto-Eval: Benchmark für die Bewertung des kantonesischen Sprachverständnisses und kulturellen Verständnisses in LLMs | HKCanto-Eval:评估广东语言理解和文化理解基准 2503.12440v2 |
Authors (7): Tsz Chung Cheng, Chung Shing Cheng, Chaak Ming Lau, Eugene Tin-Ho Lam, Chun Yat Wong, Hoi On Yu, Cheuk Hei Chong
The ability of language models to comprehend and interact in diverse linguistic and cultural landscapes is crucial. The Cantonese language used in Hong Kong presents unique challenges for natural language processing due to its rich cultural nuances and lack of dedicated evaluation datasets. The HKCanto-Eval benchmark addresses this gap by evaluating the performance of large language models (LLMs) on Cantonese language understanding tasks, extending to English and Written Chinese for cross-lingual evaluation. HKCanto-Eval integrates cultural and linguistic nuances intrinsic to Hong Kong, providing a robust framework for assessing language models in realistic scenarios. Additionally, the benchmark includes questions designed to tap into the underlying linguistic metaknowledge of the models. Our findings indicate that while proprietary models generally outperform open-weight models, significant limitations remain in handling Cantonese-specific linguistic and cultural knowledge, highlighting the need for more targeted training data and evaluation methods. The code can be accessed at https://github.com/hon9kon9ize/hkeval2025
nan
Article 472
Title@2025-07-06 (7): A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment
Title: A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment | Ein ursächliches Weltmodell für die nächste Token-Vorhersage: GPT in einer kontrollierten Umgebung erforschen | 由下到下预测所根据的因果世界模式:在受控制的环境中探索GPT 2412.07446v4 |
Authors (5): Raanan Y. Rohekar, Yaniv Gurwicz, Sungduk Yu, Estelle Aflalo, Vasudev Lal
Are generative pre-trained transformer (GPT) models, trained only to predict the next token, implicitly learning a world model from which sequences are generated one token at a time? We address this question by deriving a causal interpretation of the attention mechanism in GPT and presenting a causal world model that arises from this interpretation. Furthermore, we propose that GPT models, at inference time, can be utilized for zero-shot causal structure learning for input sequences, and introduce a corresponding confidence score. Empirical tests were conducted in controlled environments using the setups of the Othello and Chess strategy games. A GPT, pre-trained on real-world games played with the intention of winning, was tested on out-of-distribution synthetic data consisting of sequences of random legal moves. We find that the GPT model is likely to generate legal next moves for out-of-distribution sequences for which a causal structure is encoded in the attention mechanism with high confidence. In cases where it generates illegal moves, it also fails to capture a causal structure.
nan
Article 473
Title@2025-07-06 (7): A Global Context Mechanism for Sequence Labeling
Title: A Global Context Mechanism for Sequence Labeling | Ein globaler Kontextmechanismus für die Sequenzkennzeichnung | 序列标签全球背景机制 2305.19928v6 |
Authors (4): Conglei Xu, Kun Shen, Hongguang Sun, Yang Xu
Global sentence information is crucial for sequence labeling tasks, where each word in a sentence must be assigned a label. While BiLSTM models are widely used, they often fail to capture sufficient global context for inner words. Previous work has proposed various RNN variants to integrate global sentence information into word representations. However, these approaches suffer from three key limitations: (1) they are slower in both inference and training compared to the original BiLSTM, (2) they cannot effectively supplement global information for transformer-based models, and (3) the high time cost associated with reimplementing and integrating these customized RNNs into existing architectures. In this study, we introduce a simple yet effective mechanism that addresses these limitations. Our approach efficiently supplements global sentence information for both BiLSTM and transformer-based models, with minimal degradation in inference and training speed, and is easily pluggable into current architectures. We demonstrate significant improvements in F1 scores across seven popular benchmarks, including Named Entity Recognition (NER) tasks such as Conll2003, Wnut2017 , and the Chinese named-entity recognition task Weibo, as well as End-to-End Aspect-Based Sentiment Analysis (E2E-ABSA) benchmarks such as Laptop14, Restaurant14, Restaurant15, and Restaurant16. With out any extra strategy, we achieve third highest score on weibo NER benchmark. Compared to CRF, one of the most popular frameworks for sequence labeling, our mechanism achieves competitive F1 scores while offering superior inference and training speed. Code is available at: https://github.com/conglei2XU/Global-Context-Mechanism
nan
Article 474
Title@2025-07-06 (7): LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop
Title: LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop | LearnLens: LLM-Enabled Personalisiertes, Curriculum-gerundetes Feedback mit Erziehern im Loop | 学习栏:LLM-能够个性化的LLM课程、课程与环中教育工作者的反馈 2507.04295v1 |
Authors (4): Runcong Zhao, Artem Borov, Jiazheng Li, Yulan He
Effective feedback is essential for student learning but is time-intensive for teachers. We present LearnLens, a modular, LLM-based system that generates personalised, curriculum-aligned feedback in science education. LearnLens comprises three components: (1) an error-aware assessment module that captures nuanced reasoning errors; (2) a curriculum-grounded generation module that uses a structured, topic-linked memory chain rather than traditional similarity-based retrieval, improving relevance and reducing noise; and (3) an educator-in-the-loop interface for customisation and oversight. LearnLens addresses key challenges in existing systems, offering scalable, high-quality feedback that empowers both teachers and students.
nan
Article 475
Title@2025-07-06 (7): Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP
Title: Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP | Ist es wirklich ein langer Kontext, wenn alles, was man braucht, wieder aufgenommen wird? | 您需要的检索是真实的长长范围吗 ? 向真正困难的长宽范围 NLP 2407.00402v4 |
Authors (6): Omer Goldman, Alon Jacovi, Aviv Slobodkin, Aviya Maimon, Ido Dagan, Reut Tsarfaty
Improvements in language models’ capabilities have pushed their applications towards longer contexts, making long-context evaluation and development an active research area. However, many disparate use-cases are grouped together under the umbrella term of “long-context”, defined simply by the total length of the model’s input, including - for example - Needle-in-a-Haystack tasks, book summarization, and information aggregation. Given their varied difficulty, in this position paper we argue that conflating different tasks by their context length is unproductive. As a community, we require a more precise vocabulary to understand what makes long-context tasks similar or different. We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts. We propose two orthogonal axes of difficulty: (I) Diffusion: How hard is it to find the necessary information in the context? (II) Scope: How much necessary information is there to find? We survey the literature on long-context, provide justification for this taxonomy as an informative descriptor, and situate the literature with respect to it. We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored. By using a descriptive vocabulary and discussing the relevant properties of difficulty in long-context, we can implement more informed research in this area. We call for a careful design of tasks and benchmarks with distinctly long context, taking into account the characteristics that make it qualitatively different from shorter context.
nan
Article 476
Title@2025-07-06 (7): LLMs’ Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements
Title: LLMs’ Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements | LLMs Leseverständnis wird durch parametrisches Wissen beeinflusst und kämpft mit hypothetischen Aussagen | LLM女士的阅读理解受到参数知识和与假言斗争的影响 2404.06283v2 |
Authors (3): Victoria Basmov, Yoav Goldberg, Reut Tsarfaty
The task of reading comprehension (RC), often implemented as context-based question answering (QA), provides a primary means to assess language models’ natural language understanding (NLU) capabilities. Yet, when applied to large language models (LLMs) with extensive built-in world knowledge, this method can be deceptive. If the context aligns with the LLMs’ internal knowledge, it is hard to discern whether the models’ answers stem from context comprehension or from LLMs’ internal information. Conversely, using data that conflicts with the models’ knowledge creates erroneous trends which distort the results. To address this issue, we suggest to use RC on imaginary data, based on fictitious facts and entities. This task is entirely independent of the models’ world knowledge, enabling us to evaluate LLMs’ linguistic abilities without the interference of parametric knowledge. Testing ChatGPT, GPT-4, LLaMA 2 and Mixtral on such imaginary data, we uncover a class of linguistic phenomena posing a challenge to current LLMs, involving thinking in terms of alternative, hypothetical scenarios. While all the models handle simple affirmative and negative contexts with high accuracy, they are much more prone to error when dealing with modal and conditional contexts. Crucially, these phenomena also trigger the LLMs’ vulnerability to knowledge-conflicts again. In particular, while some models prove virtually unaffected by knowledge conflicts in affirmative and negative contexts, when faced with more semantically involved modal and conditional environments, they often fail to separate the text from their internal knowledge.
nan
Article 477
Title@2025-07-06 (7): GLU Attention Improve Transformer
Title: GLU Attention Improve Transformer | GLU Aufmerksamkeit verbessern Transformer | GLU 关注改善变换器 2507.00022v2 |
Authors (1): Zehao Wang
Gated Linear Units (GLU) have shown great potential in enhancing neural network performance. In this paper, I introduce a novel attention mechanism called GLU Attention, which introduces nonlinearity into the values of Attention. My experiments demonstrate that GLU Attention improves both model performance and convergence speed across text and vision modalities with zero additional parameters and negligible computational costs. GLU Attention is lightweight and can seamlessly integrate with other technologies, such as Flash Attention, Rotary Position Embedding (RoPE), and various Multi-Head Attention (MHA) variants such as Grouped-Query Attention (GQA). This project is open-sourced at github.
nan
Article 478
Title@2025-07-06 (7): MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models
Title: MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models | MASS: Mathematische Datenauswahl über Skill Graphs zur Vorschulung großer Sprachmodelle | MASS:通过培训前大语言模型技能图选择数学数据 2503.14917v2 |
Authors (7): Jiazheng Li, Lu Yu, Qing Cui, Zhiqiang Zhang, Jun Zhou, Yanfang Ye, Chuxu Zhang
High-quality data plays a critical role in the pretraining and fine-tuning of large language models (LLMs), even determining their performance ceiling to some degree. Consequently, numerous data selection methods have been proposed to identify subsets of data that can effectively and efficiently enhance model performance. However, most of these methods focus on general data selection and tend to overlook the specific nuances of domain-related data. In this paper, we introduce MASS, a \textbf{MA}thematical data \textbf{S}election framework using the \textbf{S}kill graph for pretraining LLMs in the mathematical reasoning domain. By taking into account the unique characteristics of mathematics and reasoning, we construct a skill graph that captures the mathematical skills and their interrelations from a reference dataset. This skill graph guides us in assigning quality scores to the target dataset, enabling us to select the top-ranked subset which is further used to pretrain LLMs. Experimental results demonstrate the efficiency and effectiveness of MASS across different model sizes (1B and 7B) and pretraining datasets (web data and synthetic data). Specifically, in terms of efficiency, models trained on subsets selected by MASS can achieve similar performance to models trained on the original datasets, with a significant reduction in the number of trained tokens - ranging from 50\% to 70\% fewer tokens. In terms of effectiveness, when trained on the same amount of tokens, models trained on the data selected by MASS outperform those trained on the original datasets by 3.3\% to 5.9\%. These results underscore the potential of MASS to improve both the efficiency and effectiveness of pretraining LLMs.
nan
Article 479
Title@2025-07-06 (7): EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models
Title: EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models | EasyEdit2: Ein einfach zu bedienender Steuerungsrahmen für die Bearbeitung großer Sprachmodelle | EasyEdit2: 编辑大语言模式的易用指导框架 2504.15133v2 |
Authors (10): Ziwen Xu, Shuxun Wang, Kewei Xu, Haoming Xu, Mengru Wang, Xinle Deng, Yunzhi Yao, Guozhou Zheng, Huajun Chen, Ningyu Zhang
In this paper, we introduce EasyEdit2, a framework designed to enable plug-and-play adjustability for controlling Large Language Model (LLM) behaviors. EasyEdit2 supports a wide range of test-time interventions, including safety, sentiment, personality, reasoning patterns, factuality, and language features. Unlike its predecessor, EasyEdit2 features a new architecture specifically designed for seamless model steering. It comprises key modules such as the steering vector generator and the steering vector applier, which enable automatic generation and application of steering vectors to influence the model’s behavior without modifying its parameters. One of the main advantages of EasyEdit2 is its ease of use-users do not need extensive technical knowledge. With just a single example, they can effectively guide and adjust the model’s responses, making precise control both accessible and efficient. Empirically, we report model steering performance across different LLMs, demonstrating the effectiveness of these techniques. We have released the source code on GitHub at https://github.com/zjunlp/EasyEdit along with a demonstration notebook. In addition, we provide a demo video at https://www.youtube.com/watch?v=AkfoiPfp5rQ for a quick introduction.
nan
Article 480
Title@2025-07-06 (7): On Positional Bias of Faithfulness for Long-form Summarization
Title: On Positional Bias of Faithfulness for Long-form Summarization | Auf Positionale Bias der Treue für lange Form Zusammenfassung | 关于 “ 长期概述的信仰偏见 “ 的立场 2410.23609v2 |
Authors (4): David Wan, Jesse Vig, Mohit Bansal, Shafiq Joty
Large Language Models (LLMs) often exhibit positional bias in long-context settings, under-attending to information in the middle of inputs. We investigate the presence of this bias in long-form summarization, its impact on faithfulness, and various techniques to mitigate this bias. To consistently evaluate faithfulness, we first compile a benchmark of eight human-annotated long-form summarization datasets and perform a meta-evaluation of faithfulness metrics. We show that LLM-based faithfulness metrics, though effective with full-context inputs, remain sensitive to document order, indicating positional bias. Analyzing LLM-generated summaries across six datasets, we find a “U-shaped” trend in faithfulness, where LLMs faithfully summarize the beginning and end of documents but neglect middle content. Perturbing document order similarly reveals models are less faithful when important documents are placed in the middle of the input. We find that this behavior is partly due to shifting focus with context length: as context increases, summaries become less faithful, but beyond a certain length, faithfulness improves as the model focuses on the end. Finally, we experiment with different generation techniques to reduce positional bias and find that prompting techniques effectively direct model attention to specific positions, whereas more sophisticated approaches offer limited improvements. Our data and code are available in https://github.com/meetdavidwan/longformfact.
nan
Article 481
Title@2025-07-06 (7): Fairness Evaluation of Large Language Models in Academic Library Reference Services
Title: Fairness Evaluation of Large Language Models in Academic Library Reference Services | Fairness-Evaluierung von großen Sprachmodellen in wissenschaftlichen Bibliotheksreferenzdiensten | 学术图书馆参考资料服务大语言模型公平评价 2507.04224v1 |
Authors (8): Haining Wang, Jason Clark, Yueru Yan, Star Bradley, Ruiyang Chen, Yiqiong Zhang, Hengyi Fu, Zuoyu Tian
As libraries explore large language models (LLMs) for use in virtual reference services, a key question arises: Can LLMs serve all users equitably, regardless of demographics or social status? While they offer great potential for scalable support, LLMs may also reproduce societal biases embedded in their training data, risking the integrity of libraries’ commitment to equitable service. To address this concern, we evaluate whether LLMs differentiate responses across user identities by prompting six state-of-the-art LLMs to assist patrons differing in sex, race/ethnicity, and institutional role. We found no evidence of differentiation by race or ethnicity, and only minor evidence of stereotypical bias against women in one model. LLMs demonstrated nuanced accommodation of institutional roles through the use of linguistic choices related to formality, politeness, and domain-specific vocabularies, reflecting professional norms rather than discriminatory treatment. These findings suggest that current LLMs show a promising degree of readiness to support equitable and contextually appropriate communication in academic library reference services.
nan
Article 482
Title@2025-07-06 (7): Context Tuning for In-Context Optimization
Title: Context Tuning for In-Context Optimization | Kontext-Tuning für In-Context-Optimierung | Intext 优化的上下文图示 2507.04221v1 |
Authors (4): Jack Lu, Ryan Teehan, Zhenbang Yang, Mengye Ren
We introduce Context Tuning, a simple and effective method to significantly enhance few-shot adaptation of language models (LLMs) without fine-tuning model parameters. While prompt-based adaptation techniques have demonstrated the effectiveness of lightweight adaptation methods for large language models (LLMs), they typically initialize a trainable prompt or prefix with irrelevant tokens for the task at hand. In contrast, Context Tuning initializes the trainable prompt or prefix with task-specific demonstration examples, leveraging the model’s inherent In-Context Learning (ICL) ability to extract relevant information for improved few-shot learning performance. Extensive evaluations on benchmarks such as CrossFit, UnifiedQA, MMLU, BIG-Bench Hard, and ARC demonstrate that Context Tuning outperforms traditional prompt-based adaptation methods and achieves competitive accuracy to Test-Time Training with significantly higher training efficiency.
nan
Article 483
Title@2025-07-06 (7): Towards Understanding the Cognitive Habits of Large Reasoning Models
Title: Towards Understanding the Cognitive Habits of Large Reasoning Models | Auf dem Weg zum Verständnis der kognitiven Gewohnheiten großer vernünftiger Modelle | 了解大理由模型的认知利用 2506.21571v2 |
Authors (5): Jianshuo Dong, Yujia Fu, Chuanrui Hu, Chao Zhang, Han Qiu
Large Reasoning Models (LRMs), which autonomously produce a reasoning Chain of Thought (CoT) before producing final responses, offer a promising approach to interpreting and monitoring model behaviors. Inspired by the observation that certain CoT patterns – e.g., ``Wait, did I miss anything?’’ – consistently emerge across tasks, we explore whether LRMs exhibit human-like cognitive habits. Building on Habits of Mind, a well-established framework of cognitive habits associated with successful human problem-solving, we introduce CogTest, a principled benchmark designed to evaluate LRMs’ cognitive habits. CogTest includes 16 cognitive habits, each instantiated with 25 diverse tasks, and employs an evidence-first extraction method to ensure reliable habit identification. With CogTest, we conduct a comprehensive evaluation of 16 widely used LLMs (13 LRMs and 3 non-reasoning ones). Our findings reveal that LRMs, unlike conventional LLMs, not only exhibit human-like habits but also adaptively deploy them according to different tasks. Finer-grained analyses further uncover patterns of similarity and difference in LRMs’ cognitive habit profiles, particularly certain inter-family similarity (e.g., Qwen-3 models and DeepSeek-R1). Extending the study to safety-related tasks, we observe that certain habits, such as Taking Responsible Risks, are strongly associated with the generation of harmful responses. These findings suggest that studying persistent behavioral patterns in LRMs’ CoTs is a valuable step toward deeper understanding of LLM misbehavior. The code is available at: https://github.com/jianshuod/CogTest.
nan
Article 484
Title@2025-07-06 (7): SCALE: Towards Collaborative Content Analysis in Social Science with Large Language Model Agents and Human Intervention
Title: SCALE: Towards Collaborative Content Analysis in Social Science with Large Language Model Agents and Human Intervention | SCALE: Auf dem Weg zu einer kollaborativen Inhaltsanalyse in der Sozialwissenschaft mit großen Sprachmodellen und menschlicher Intervention | SCALE:促进社会科学与大语言模范媒介和人类干预合作内容分析 2502.10937v2 |
Authors (6): Chengshuai Zhao, Zhen Tan, Chau-Wai Wong, Xinyan Zhao, Tianlong Chen, Huan Liu
Content analysis breaks down complex and unstructured texts into theory-informed numerical categories. Particularly, in social science, this process usually relies on multiple rounds of manual annotation, domain expert discussion, and rule-based refinement. In this paper, we introduce SCALE, a novel multi-agent framework that effectively $\underline{\textbf{S}}$imulates $\underline{\textbf{C}}$ontent $\underline{\textbf{A}}$nalysis via $\underline{\textbf{L}}$arge language model (LLM) ag$\underline{\textbf{E}}$nts. SCALE imitates key phases of content analysis, including text coding, collaborative discussion, and dynamic codebook evolution, capturing the reflective depth and adaptive discussions of human researchers. Furthermore, by integrating diverse modes of human intervention, SCALE is augmented with expert input to further enhance its performance. Extensive evaluations on real-world datasets demonstrate that SCALE achieves human-approximated performance across various complex content analysis tasks, offering an innovative potential for future social science research.
nan
Article 485
Title@2025-07-05 (6): SymbolicThought: Integrating Language Models and Symbolic Reasoning for Consistent and Interpretable Human Relationship Understanding
Title: SymbolicThought: Integrating Language Models and Symbolic Reasoning for Consistent and Interpretable Human Relationship Understanding | SymbolicThought: Integration von Sprachmodellen und symbolischer Begründung für ein konsequentes und interpretierbares menschliches Beziehungsverständnis | 象征性探索:整合语文模式和符号理由,促进一致和可解释的人类关系理解 2507.04189v1 |
Authors (6): Runcong Zhao, Qinglin Zhu, Hainiu Xu, Bin Liang, Yulan He, Lin Gui
Understanding character relationships is essential for interpreting complex narratives and conducting socially grounded AI research. However, manual annotation is time-consuming and low in coverage, while large language models (LLMs) often produce hallucinated or logically inconsistent outputs. We present SymbolicThought, a human-in-the-loop framework that combines LLM-based extraction with symbolic reasoning. The system constructs editable character relationship graphs, refines them using seven types of logical constraints, and enables real-time validation and conflict resolution through an interactive interface. To support logical supervision and explainable social analysis, we release a dataset of 160 interpersonal relationships with corresponding logical structures. Experiments show that SymbolicThought improves annotation accuracy and consistency while significantly reducing time cost, offering a practical tool for narrative understanding, explainable AI, and LLM evaluation.
nan
Article 486
Title@2025-07-05 (6): Phonetic Reconstruction of the Consonant System of Middle Chinese via Mixed Integer Optimization
Title: Phonetic Reconstruction of the Consonant System of Middle Chinese via Mixed Integer Optimization | Phonetische Rekonstruktion des Konsonantensystems des Mittelchinesischen durch gemischte Integer-Optimierung | 通过混合整数优化恢复中中中中中中中中中 2502.04625v2 |
Authors (2): Xiaoxi Luo, Weiwei Sun
This paper is concerned with phonetic reconstruction of the consonant system of Middle Chinese. We propose to cast the problem as a Mixed Integer Programming problem, which is able to automatically explore homophonic information from ancient rhyme dictionaries and phonetic information from modern Chinese dialects, the descendants of Middle Chinese. Numerical evaluation on a wide range of synthetic and real data demonstrates the effectiveness and robustness of the new method. We apply the method to information from Guangyun and 20 modern Chinese dialects to obtain a new phonetic reconstruction result. A linguistically-motivated discussion of this result is also provided.
nan
Article 487
Title@2025-07-05 (6): Navigating Speech Recording Collections with AI-Generated Illustrations
Title: Navigating Speech Recording Collections with AI-Generated Illustrations | Navigierende Sprachaufzeichnungssammlungen mit KI-generierten Illustrationen | 使用 AI 生成的 说明 管理语音录制收藏 2507.04182v1 |
Authors (3): Sirina Håland, Trond Karlsen Strøm, Petra Galuščáková
Although the amount of available spoken content is steadily increasing, extracting information and knowledge from speech recordings remains challenging. Beyond enhancing traditional information retrieval methods such as speech search and keyword spotting, novel approaches for navigating and searching spoken content need to be explored and developed. In this paper, we propose a novel navigational method for speech archives that leverages recent advances in language and multimodal generative models. We demonstrate our approach with a Web application that organizes data into a structured format using interactive mind maps and image generation tools. The system is implemented using the TED-LIUM~3 dataset, which comprises over 2,000 speech transcripts and audio files of TED Talks. Initial user tests using a System Usability Scale (SUS) questionnaire indicate the application’s potential to simplify the exploration of large speech collections.
nan
Article 488
Title@2025-07-05 (6): Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks
Title: Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks | Frustrierend Einfaches Retrieval verbessert anspruchsvolle, vernünftig-intensive Benchmarks | 令人沮丧的简单检索改进挑战、理由说明和密集基准 2507.01297v2 |
Authors (5): Xinxi Lyu, Michael Duan, Rulin Shao, Pang Wei Koh, Sewon Min
Retrieval-augmented Generation (RAG) has primarily been studied in limited settings, such as factoid question answering; more challenging, reasoning-intensive benchmarks have seen limited success from minimal RAG. In this work, we challenge this prevailing view on established, reasoning-intensive benchmarks: MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. We identify a key missing component in prior work: a usable, web-scale datastore aligned with the breadth of pretraining data. To this end, we introduce CompactDS: a diverse, high-quality, web-scale datastore that achieves high retrieval accuracy and subsecond latency on a single-node. The key insights are (1) most web content can be filtered out without sacrificing coverage, and a compact, high-quality subset is sufficient; and (2) combining in-memory approximate nearest neighbor (ANN) retrieval and on-disk exact search balances speed and recall. Using CompactDS, we show that a minimal RAG pipeline achieves consistent accuracy improvements across all benchmarks and model sizes (8B–70B), with relative gains of 10% on MMLU, 33% on MMLU Pro, 14% on GPQA, and 19% on MATH. No single data source suffices alone, highlighting the importance of diversity of sources (web crawls, curated math, academic papers, textbooks). Finally, we show that our carefully designed in-house datastore matches or outperforms web search engines such as Google Search, as well as recently proposed, complex agent-based RAG systems–all while maintaining simplicity, reproducibility, and self-containment. We release CompactDS and our retrieval pipeline, supporting future research exploring retrieval-based AI systems.
nan
Article 489
Title@2025-07-05 (6): Large Language Models for Zero-Shot Multicultural Name Recognition
Title: Large Language Models for Zero-Shot Multicultural Name Recognition | Große Sprachmodelle für die Null-Shot Multicultural Name Recognition | 零热多种文化名称识别大语言模式 2507.04149v1 |
Authors (4): Thanakorn Phonchai, Surasakdi Siripong, Nicholas Patterson, Owen Campbell
The robust and accurate recognition of multicultural names, particularly those not previously encountered, is a critical challenge in an increasingly globalized digital landscape. Traditional methods often falter when confronted with the vast diversity and novel permutations of names across different linguistic and cultural backgrounds. This paper introduces a novel framework, Prompt-Engineered Fine-Tuning (PEFT) for Large Language Models (LLMs) with Adversarial Data Augmentation and Cultural Knowledge Graph Integration, designed to significantly enhance zero-shot multicultural name recognition. Our approach leverages the powerful linguistic understanding of pre-trained LLMs, transforming the recognition task into a guided generation problem. Through meticulous prompt engineering, dynamic integration of explicit cultural knowledge derived from knowledge graphs, and the strategic application of adversarial data augmentation, we equip the LLM with an unprecedented ability to infer the cultural origin of unseen names. Extensive experiments demonstrate that our PEFT method consistently outperforms established deep learning baselines, including advanced Bi-LSTM models with cultural tags, achieving an impressive 93.1\% overall accuracy and a remarkable 89.5\% accuracy on challenging zero-shot name identification. An in-depth ablation study confirms the synergistic contribution of each component, while a human evaluation highlights our method’s performance approaching human expert judgment. This work signifies a substantial leap in multicultural name recognition, offering a highly effective and scalable solution for real-world applications.
nan
Article 490
Title@2025-07-05 (6): The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure
Title: The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure | Das Medium ist nicht die Botschaft: Deconfounding Text-Embeddings via Linear Concept Erasure | 介质不是信息:通过线性概念时代的沉降文本嵌入 2507.01234v2 |
Authors (6): Yu Fan, Yang Tian, Shauli Ravfogel, Mrinmaya Sachan, Elliott Ash, Alexander Hoyle
Embedding-based similarity metrics between text sequences can be influenced not just by the content dimensions we most care about, but can also be biased by spurious attributes like the text’s source or language. These document confounders cause problems for many applications, but especially those that need to pool texts from different corpora. This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost. Document similarity and clustering metrics improve across every embedding variant and task we evaluate – often dramatically. Interestingly, performance on out-of-distribution benchmarks is not impacted, indicating that the embeddings are not otherwise degraded.
nan
Article 491
Title@2025-07-05 (6): Dissecting Clinical Reasoning in Language Models: A Comparative Study of Prompts and Model Adaptation Strategies
Title: Dissecting Clinical Reasoning in Language Models: A Comparative Study of Prompts and Model Adaptation Strategies | Dissecting Clinical Reasoning in Language Models: Eine vergleichende Studie über Prompts und Modellanpassungsstrategien | 语言模型临床理由解剖:快速比较研究和示范适应战略 2507.04142v1 |
Authors (4): Mael Jullien, Marco Valentino, Leonardo Ranaldi, Andre Freitas
Recent works on large language models (LLMs) have demonstrated the impact of prompting strategies and fine-tuning techniques on their reasoning capabilities. Yet, their effectiveness on clinical natural language inference (NLI) remains underexplored. This study presents the first controlled evaluation of how prompt structure and efficient fine-tuning jointly shape model performance in clinical NLI. We inspect four classes of prompting strategies to elicit reasoning in LLMs at different levels of abstraction, and evaluate their impact on a range of clinically motivated reasoning types. For each prompting strategy, we construct high-quality demonstrations using a frontier model to distil multi-step reasoning capabilities into smaller models (4B parameters) via Low-Rank Adaptation (LoRA). Across different language models fine-tuned on the NLI4CT benchmark, we found that prompt type alone accounts for up to 44% of the variance in macro-F1. Moreover, LoRA fine-tuning yields consistent gains of +8 to 12 F1, raises output alignment above 97%, and narrows the performance gap to GPT-4o-mini to within 7.1%. Additional experiments on reasoning generalisation reveal that LoRA improves performance in 75% of the models on MedNLI and TREC Clinical Trials Track. Overall, these findings demonstrate that (i) prompt structure is a primary driver of clinical reasoning performance, (ii) compact models equipped with strong prompts and LoRA can rival frontier-scale systems, and (iii) reasoning-type-aware evaluation is essential to uncover prompt-induced trade-offs. Our results highlight the promise of combining prompt design and lightweight adaptation for more efficient and trustworthy clinical NLP systems, providing insights on the strengths and limitations of widely adopted prompting and parameter-efficient techniques in highly specialised domains.
nan
Article 492
Title@2025-07-05 (6): From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models
Title: From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models | Von der Absicht zur Technik: Eine umfassende Taxonomie und Herausforderungen bei der Text-Wasserzeichenbildung für große Sprachmodelle | 从意图到技术:用于大语言模式的文本水标记综合分类学和挑战 2406.11106v2 |
Authors (4): Harsh Nishant Lalai, Aashish Anantha Ramakrishnan, Raj Sanjay Shah, Dongwon Lee
With the rapid growth of Large Language Models (LLMs), safeguarding textual content against unauthorized use is crucial. Watermarking offers a vital solution, protecting both - LLM-generated and plain text sources. This paper presents a unified overview of different perspectives behind designing watermarking techniques through a comprehensive survey of the research literature. Our work has two key advantages: (1) We analyze research based on the specific intentions behind different watermarking techniques, evaluation datasets used, and watermarking addition and removal methods to construct a cohesive taxonomy. (2) We highlight the gaps and open challenges in text watermarking to promote research protecting text authorship. This extensive coverage and detailed analysis sets our work apart, outlining the evolving landscape of text watermarking in Language Models.
nan
Article 493
Title@2025-07-05 (6): Token Level Hallucination Detection via Variance in Language Models
Title: Token Level Hallucination Detection via Variance in Language Models | Token Level Halluzination Erkennung durch Varianz in Sprachmodellen | 通过语言模式差异探测洞穴级幻觉 2507.04137v1 |
Authors (1): Keshav Kumar
Large Language Models (LLMs) have demonstrated impressive generative capabilities across diverse tasks but remain susceptible to hallucinations, confidently generated yet factually incorrect outputs. We introduce a reference-free, token-level hallucination detection framework that leverages the variance in token log-probabilities across multiple stochastic generations. Unlike prior methods that require ground-truth references or sentence-level verification, our approach is model-agnostic, interpretable, and suited for real-time or post-hoc analysis. We evaluate our method on unanswerable question prompts from the SQuAD v2 dataset and benchmark across three autoregressive models of varying scales: GPT-Neo 125M, Falcon 1B, and Mistral 7B. Through both quantitative metrics and visual diagnostics, we show that token-level variance reliably highlights instability in model outputs and correlates with hallucination patterns. Our framework is lightweight, reproducible, and adaptable to multiple domains, offering a valuable diagnostic tool for analyzing generative reliability in LLMs.
nan
Article 494
Title@2025-07-05 (6): An HTR-LLM Workflow for High-Accuracy Transcription and Analysis of Abbreviated Latin Court Hand
Title: An HTR-LLM Workflow for High-Accuracy Transcription and Analysis of Abbreviated Latin Court Hand | Ein HTR-LLM-Workflow für hochgenaue Transkription und Analyse von verkürzter lateinischer Hofhand | HTR-LLLM 高准确度加密和分析缩写拉丁法院手的工作流程 2507.04132v1 |
Authors (1): Joshua D. Isom
This article presents and validates an ideal, four-stage workflow for the high-accuracy transcription and analysis of challenging medieval legal documents. The process begins with a specialized Handwritten Text Recognition (HTR) model, itself created using a novel “Clean Ground Truth” curation method where a Large Language Model (LLM) refines the training data. This HTR model provides a robust baseline transcription (Stage 1). In Stage 2, this baseline is fed, along with the original document image, to an LLM for multimodal post-correction, grounding the LLM’s analysis and improving accuracy. The corrected, abbreviated text is then expanded into full, scholarly Latin using a prompt-guided LLM (Stage 3). A final LLM pass performs Named-Entity Correction (NEC), regularizing proper nouns and generating plausible alternatives for ambiguous readings (Stage 4). We validate this workflow through detailed case studies, achieving Word Error Rates (WER) in the range of 2-7% against scholarly ground truths. The results demonstrate that this hybrid, multi-stage approach effectively automates the most laborious aspects of transcription while producing a high-quality, analyzable output, representing a powerful and practical solution for the current technological landscape.
nan
Article 495
Title@2025-07-05 (6): In-Context Meta LoRA Generation
Title: In-Context Meta LoRA Generation | Meta-LoRA-Generation im Kontext | 内文元LoRA 生成 2501.17635v3 |
Authors (15): Yihua Shao, Minxi Yan, Yang Liu, Siyu Chen, Wenjie Chen, Xinwei Long, Ziyang Yan, Lei Li, Chenyu Zhang, Nicu Sebe, Hao Tang, Yan Wang, Hao Zhao, Mengzhu Wang, Jingcai Guo
Low-rank Adaptation (LoRA) has demonstrated remarkable capabilities for task specific fine-tuning. However, in scenarios that involve multiple tasks, training a separate LoRA model for each one results in considerable inefficiency in terms of storage and inference. Moreover, existing parameter generation methods fail to capture the correlations among these tasks, making multi-task LoRA parameter generation challenging. To address these limitations, we propose In-Context Meta LoRA (ICM-LoRA), a novel approach that efficiently achieves task-specific customization of large language models (LLMs). Specifically, we use training data from all tasks to train a tailored generator, Conditional Variational Autoencoder (CVAE). CVAE takes task descriptions as inputs and produces task-aware LoRA weights as outputs. These LoRA weights are then merged with LLMs to create task-specialized models without the need for additional fine-tuning. Furthermore, we utilize in-context meta-learning for knowledge enhancement and task mapping, to capture the relationship between tasks and parameter distributions. As a result, our method achieves more accurate LoRA parameter generation for diverse tasks using CVAE. ICM-LoRA enables more accurate LoRA parameter reconstruction than current parameter reconstruction methods and is useful for implementing task-specific enhancements of LoRA parameters. At the same time, our method occupies 283MB, only 1\% storage compared with the original LoRA.
nan
Article 496
Title@2025-07-05 (6): BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering
Title: BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering | BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering | BYOKG-RAG:知识图解答问题多战略图检索 2507.04127v1 |
Authors (9): Costas Mavromatis, Soji Adeshina, Vassilis N. Ioannidis, Zhen Han, Qi Zhu, Ian Robinson, Bryan Thompson, Huzefa Rangwala, George Karypis
Knowledge graph question answering (KGQA) presents significant challenges due to the structural and semantic variations across input graphs. Existing works rely on Large Language Model (LLM) agents for graph traversal and retrieval; an approach that is sensitive to traversal initialization, as it is prone to entity linking errors and may not generalize well to custom (“bring-your-own”) KGs. We introduce BYOKG-RAG, a framework that enhances KGQA by synergistically combining LLMs with specialized graph retrieval tools. In BYOKG-RAG, LLMs generate critical graph artifacts (question entities, candidate answers, reasoning paths, and OpenCypher queries), and graph tools link these artifacts to the KG and retrieve relevant graph context. The retrieved context enables the LLM to iteratively refine its graph linking and retrieval, before final answer generation. By retrieving context from different graph tools, BYOKG-RAG offers a more general and robust solution for QA over custom KGs. Through experiments on five benchmarks spanning diverse KG types, we demonstrate that BYOKG-RAG outperforms the second-best graph retrieval method by 4.5% points while showing better generalization to custom KGs. BYOKG-RAG framework is open-sourced at https://github.com/awslabs/graphrag-toolkit.
nan
Article 497
Title@2025-07-05 (6): Relational inductive biases on attention mechanisms
Title: Relational inductive biases on attention mechanisms | Relationale induktive Verzerrungen der Aufmerksamkeitsmechanismen | 对关注机制的关系诱导偏见 2507.04117v1 |
Authors (6): Víctor Mijangos, Ximena Gutierrez-Vasques, Verónica E. Arriola, Ulises Rodríguez-Domínguez, Alexis Cervantes, José Luis Almanzara
Inductive learning aims to construct general models from specific examples, guided by biases that influence hypothesis selection and determine generalization capacity. In this work, we focus on characterizing the relational inductive biases present in attention mechanisms, understood as assumptions about the underlying relationships between data elements. From the perspective of geometric deep learning, we analyze the most common attention mechanisms in terms of their equivariance properties with respect to permutation subgroups, which allows us to propose a classification based on their relational biases. Under this perspective, we show that different attention layers are characterized by the underlying relationships they assume on the input data.
nan
Article 498
Title@2025-07-05 (6): Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching
Title: Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching | Conversation Forests: Der Schlüssel zur Feinabstimmung großer Sprachmodelle für multi-Turn medizinische Gespräche ist die Verzweigung | 对话森林:对多发医学对话的大型语言模型进行精微投资的关键是分流 2507.04099v1 |
Authors (1): Thomas Savage
Fine-tuning methods such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have demonstrated success in training large language models (LLMs) for single-turn tasks. However, these methods fall short in multi-turn applications, such as diagnostic patient interviewing, where understanding how early conversational turns influence downstream completions and outcomes is essential. In medicine, a multi-turn perspective is critical for learning diagnostic schemas and better understanding conversation dynamics. To address this gap, I introduce Savage Conversation Forests (SCF), a reinforcement learning framework that leverages a branched conversation architecture to fine-tune LLMs for multi-turn dialogue. SCF generates multiple possible conversation continuations at each turn, enabling the model to learn how different early responses affect downstream interactions and diagnostic outcomes. In experiments simulating doctor-patient conversations, SCF with branching outperforms linear conversation architectures on diagnostic accuracy. I hypothesize that SCF’s improvements stem from its ability to provide richer, interdependent training signals across conversation turns. These results suggest that a branched training architecture is an important strategy for fine tuning LLMs in complex multi-turn conversational tasks.
nan
Article 499
Title@2025-07-05 (6): MMMOS: Multi-domain Multi-axis Audio Quality Assessment
Title: MMMOS: Multi-domain Multi-axis Audio Quality Assessment | MMMOS: Multi-Domain Multi-Axis-Audio-Qualitätsbewertung | MMMOS:多领域多轴音频质量评估 2507.04094v1 |
Authors (3): Yi-Cheng Lin, Jia-Hung Chen, Hung-yi Lee
Accurate audio quality estimation is essential for developing and evaluating audio generation, retrieval, and enhancement systems. Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech, merging diverse perceptual factors and failing to generalize beyond speech. We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four orthogonal axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness across speech, music, and environmental sounds. MMMOS fuses frame-level embeddings from three pretrained encoders (WavLM, MuQ, and M2D) and evaluates three aggregation strategies with four loss functions. By ensembling the top eight models, MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall’s {\tau} versus baseline, gains first place in six of eight Production Complexity metrics, and ranks among the top three on 17 of 32 challenge metrics.
nan
Article 500
Title@2025-07-05 (6): FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering
Title: FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering | FairSteer: Inferenzzeit-Debiasing für LLMs mit dynamischer Aktivierungssteuerung | Fairsteer:带动态激活指导的LLMs 推论时间偏差 2504.14492v2 |
Authors (7): Yichen Li, Zhiting Fan, Ruizhe Chen, Xiaotang Gai, Luqi Gong, Yan Zhang, Zuozhu Liu
Large language models (LLMs) are prone to capturing biases from training corpus, leading to potential negative social impacts. Existing prompt-based debiasing methods exhibit instability due to their sensitivity to prompt changes, while fine-tuning-based techniques incur substantial computational overhead and catastrophic forgetting. In this paper, we propose FairSteer, a novel inference-time debiasing framework without requiring customized prompt design or model retraining. Motivated by the linear representation hypothesis, our preliminary investigation demonstrates that fairness-related features can be encoded into separable directions in the hidden activation space. FairSteer operates in three steps: biased activation detection, debiasing steering vector (DSV) computation, and dynamic activation steering. Specifically, it first trains a lightweight linear classifier to detect bias signatures in activations, and then computes DSVs as intervention directions derived from small contrastive prompt pairs. Subsequently, it performs debiasing by adjusting activations with DSVs in the inference stage. Comprehensive evaluation with six LLMs demonstrates the superiority of FairSteer across question-answering, counterfactual input evaluation and open-ended text generation tasks. Code will be released.
nan
Article 501
Title@2025-07-05 (6): XISM: an eXploratory and Interactive Graph Tool to Visualize and Evaluate Semantic Map Models
Title: XISM: an eXploratory and Interactive Graph Tool to Visualize and Evaluate Semantic Map Models | XISM: ein eXploratory and Interactive Graph Tool zum Visualisieren und Evaluieren von Semantischen Kartenmodellen | XISSM:一个可视和评估语义地图模型的振动和交互式图像工具 2507.04070v1 |
Authors (4): Zhu Liu, Zhen Hu, Lei Dai, Ying Liu
Semantic map models represent meanings or functions as nodes in a graph constrained by the local connectivity hypothesis, with edges indicating their associations. Widely used in typological linguistics, these models compare interrelated meanings across languages. Traditionally built manually in a bottom-up manner, they are inefficient for large datasets and lack visualization and evaluation tools. This paper introduces XISM, an interactive tool based on our prior algorithm, which constructs semantic maps from user data via a top-down approach, displays candidate maps, and evaluates them using multiple metrics. Users can refine maps by editing edges, combining data-driven efficiency with expert knowledge. This human-in-the-loop design benefits both typologists and computational linguists. The system https://770103knev48.vicp.fun/ and a demonstration video https://youtu.be/S-wsVDF2HSI?si=1OrcF41tRznaifhZ are publicly available.
nan
Article 502
Title@2025-07-05 (6): Beyond Independent Passages: Adaptive Passage Combination Retrieval for Retrieval Augmented Open-Domain Question Answering
Title: Beyond Independent Passages: Adaptive Passage Combination Retrieval for Retrieval Augmented Open-Domain Question Answering | Jenseits von unabhängigen Passagen: Adaptive Passage-Kombinations-Retrieval für retrieval Augmented Open-Domain-Fragebeantwortung | 超过独立通过通道: 用于检索的适应性被动通道组合检索, 以获取更多开放域 问题解答 2507.04069v1 |
Authors (3): Ting-Wen Ko, Jyun-Yu Jiang, Pu-Jen Cheng
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external documents at inference time, enabling up-to-date knowledge access without costly retraining. However, conventional RAG methods retrieve passages independently, often leading to redundant, noisy, or insufficiently diverse context-particularly problematic - particularly problematic in noisy corpora and for multi-hop questions. To address this, we propose Adaptive Passage Combination Retrieval (AdaPCR), a novel framework for open-domain question answering with black-box LMs. AdaPCR explicitly models dependencies between passages by considering passage combinations as units for retrieval and reranking. It consists of a context-aware query reformulation using concatenated passages, and a reranking step trained with a predictive objective aligned with downstream answer likelihood. Crucially, AdaPCR adaptively selects the number of retrieved passages without additional stopping modules. Experiments across several QA benchmarks show that AdaPCR outperforms baselines, particularly in multi-hop reasoning, demonstrating the effectiveness of modeling inter-passage dependencies for improved retrieval.
nan
Article 503
Title@2025-07-05 (6): On Fusing ChatGPT and Ensemble Learning in Discon-tinuous Named Entity Recognition in Health Corpora
Title: On Fusing ChatGPT and Ensemble Learning in Discon-tinuous Named Entity Recognition in Health Corpora | Über das Verschmelzen von ChatGPT und Ensemble Lernen in diskontinuierlicher, benannter Entitätserkennung in Health Corpora | 关于在卫生团体中不同名称称为实体的识别中进行阻燃聊天和组合学习 2412.16976v2 |
Authors (2): Tzu-Chieh Chen, Wen-Yang Lin
Named Entity Recognition has traditionally been a key task in natural language processing, aiming to identify and extract important terms from unstructured text data. However, a notable challenge for contemporary deep-learning NER models has been identifying discontinuous entities, which are often fragmented within the text. To date, methods to address Discontinuous Named Entity Recognition have not been explored using ensemble learning to the best of our knowledge. Furthermore, the rise of large language models, such as ChatGPT in recent years, has shown significant effectiveness across many NLP tasks. Most existing approaches, however, have primarily utilized ChatGPT as a problem-solving tool rather than exploring its potential as an integrative element within ensemble learning algorithms. In this study, we investigated the integration of ChatGPT as an arbitrator within an ensemble method, aiming to enhance performance on DNER tasks. Our method combines five state-of-the-art NER models with ChatGPT using custom prompt engineering to assess the robustness and generalization capabilities of the ensemble algorithm. We conducted experiments on three benchmark medical datasets, comparing our method against the five SOTA models, individual applications of GPT-3.5 and GPT-4, and a voting ensemble method. The results indicate that our proposed fusion of ChatGPT with the ensemble learning algorithm outperforms the SOTA results in the CADEC, ShARe13, and ShARe14 datasets, showcasing its potential to enhance NLP applications in the healthcare domain.
nan
Article 504
Title@2025-07-05 (6): FinBERT2: A Specialized Bidirectional Encoder for Bridging the Gap in Finance-Specific Deployment of Large Language Models
Title: FinBERT2: A Specialized Bidirectional Encoder for Bridging the Gap in Finance-Specific Deployment of Large Language Models | FinBERT2: Ein spezieller bidirektionaler Encoder zur Bridging the Gap in Finance-Spezific Deployment of Large Language Models | FinBERT2: 缩小在财务方面差距的专门双向双向编码器,具体部署大语言模式 2506.06335v2 |
Authors (10): Xuan Xu, Fufang Wen, Beilin Chu, Zhibing Fu, Qinhong Lin, Jiaqi Liu, Binjie Fei, Yu Li, Linna Zhou, Zhongliang Yang
In natural language processing (NLP), the focus has shifted from encoder-only tiny language models like BERT to decoder-only large language models(LLMs) such as GPT-3. However, LLMs’ practical application in the financial sector has revealed three limitations: (1) LLMs often perform worse than fine-tuned BERT on discriminative tasks despite costing much higher computational resources, such as market sentiment analysis in financial reports; (2) Application on generative tasks heavily relies on retrieval augmented generation (RAG) methods to provide current and specialized information, with general retrievers showing suboptimal performance on domain-specific retrieval tasks; (3) There are additional inadequacies in other feature-based scenarios, such as topic modeling. We introduce FinBERT2, a specialized bidirectional encoder pretrained on a high-quality, financial-specific corpus of 32b tokens. This represents the largest known Chinese financial pretraining corpus for models of this parameter size. As a better backbone, FinBERT2 can bridge the gap in the financial-specific deployment of LLMs through the following achievements: (1) Discriminative fine-tuned models (Fin-Labelers) outperform other (Fin)BERT variants by 0.4%-3.3% and leading LLMs by 9.7%-12.3% on average across five financial classification tasks. (2) Contrastive fine-tuned models (Fin-Retrievers) outperform both open-source (e.g., +6.8\% avg improvement over BGE-base-zh) and proprietary (e.g., +4.2\% avg improvement over OpenAI’s text-embedding-3-large) embedders across five financial retrieval tasks; (3) Building on FinBERT2 variants, we construct the Fin-TopicModel, which enables superior clustering and topic representation for financial titles. Our work revisits financial BERT models through comparative analysis with contemporary LLMs and offers practical insights for effectively utilizing FinBERT in the LLMs era.
nan
Article 505
Title@2025-07-05 (6): Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk?
Title: Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk? | KI für Finanzen bewerten: Ist KI bei der Bewertung von Investitionsrisiken glaubwürdig? | 评估大赦国际的融资:AI在评估投资风险方面是否可信? 2505.18953v2 |
Authors (11): Divij Chawla, Ashita Bhutada, Do Duc Anh, Abhinav Raghunathan, Vinod SP, Cathy Guo, Dar Win Liew, Prannaya Gupta, Rishabh Bhardwaj, Rajat Bhardwaj, Soujanya Poria
We assess whether AI systems can credibly evaluate investment risk appetite-a task that must be thoroughly validated before automation. Our analysis was conducted on proprietary systems (GPT, Claude, Gemini) and open-weight models (LLaMA, DeepSeek, Mistral), using carefully curated user profiles that reflect real users with varying attributes such as country and gender. As a result, the models exhibit significant variance in score distributions when user attributes-such as country or gender-that should not influence risk computation are changed. For example, GPT-4o assigns higher risk scores to Nigerian and Indonesian profiles. While some models align closely with expected scores in the Low- and Mid-risk ranges, none maintain consistent scores across regions and demographics, thereby violating AI and finance regulations.
nan
Article 506
Title@2025-07-05 (6): Patient-Centered RAG for Oncology Visit Aid Following the Ottawa Decision Guide
Title: Patient-Centered RAG for Oncology Visit Aid Following the Ottawa Decision Guide | Patientenzentrierte RAG für Onkologie-Besuchshilfe nach dem Ottawa-Entscheidungsleitfaden | 遵循渥太华决定指南,以病人为中心,为肿瘤科访问援助提供RAG 2507.04026v1 |
Authors (3): Siyang Liu, Lawrence Chin-I An, Rada Mihalcea
Effective communication is essential in cancer care, yet patients often face challenges in preparing for complex medical visits. We present an interactive, Retrieval-augmented Generation-assisted system that helps patients progress from uninformed to visit-ready. Our system adapts the Ottawa Personal Decision Guide into a dynamic retrieval-augmented generation workflow, helping users bridge knowledge gaps, clarify personal values and generate useful questions for their upcoming visits. Focusing on localized prostate cancer, we conduct a user study with patients and a clinical expert. Results show high system usability (UMUX Mean = 6.0 out of 7), strong relevance of generated content (Mean = 6.7 out of 7), minimal need for edits, and high clinical faithfulness (Mean = 6.82 out of 7). This work demonstrates the potential of combining patient-centered design with language models to enhance clinical preparation in oncology care.
nan
Article 507
Title@2025-07-05 (6): Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance
Title: Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance | Vergleichende Bewertung von ChatGPT und DeepSeek über zentrale NLP-Aufgaben: Stärken, Schwächen und Domain-spezifische Leistung | 国家劳工政策关键任务:力量、弱点和具体具体绩效 2506.18501v3 |
Authors (2): Wael Etaiwi, Bushra Alhijawi
The increasing use of large language models (LLMs) in natural language processing (NLP) tasks has sparked significant interest in evaluating their effectiveness across diverse applications. While models like ChatGPT and DeepSeek have shown strong results in many NLP domains, a comprehensive evaluation is needed to understand their strengths, weaknesses, and domain-specific abilities. This is critical as these models are applied to various tasks, from sentiment analysis to more nuanced tasks like textual entailment and translation. This study aims to evaluate ChatGPT and DeepSeek across five key NLP tasks: sentiment analysis, topic classification, text summarization, machine translation, and textual entailment. A structured experimental protocol is used to ensure fairness and minimize variability. Both models are tested with identical, neutral prompts and evaluated on two benchmark datasets per task, covering domains like news, reviews, and formal/informal texts. The results show that DeepSeek excels in classification stability and logical reasoning, while ChatGPT performs better in tasks requiring nuanced understanding and flexibility. These findings provide valuable insights for selecting the appropriate LLM based on task requirements.
nan
Article 508
Title@2025-07-05 (6): LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models
Title: LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models | LLMThinkBench: Auf dem Weg zu grundlegenden mathematischen Überlegungen und Überdenken in großen Sprachmodellen | LLMTHink Bench:在大语言模型中走向基本数学理由和过度思考 2507.04023v1 |
Authors (4): Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Xuan Wang
Large Language Models (LLMs) have achieved remarkable performance on complex mathematical benchmarks, yet often struggle with simple arithmetic tasks and exhibit a tendency toward over-explaining or “overthinking” answers. To systematically assess this phenomenon, we introduce LLMThinkBench, a modular benchmarking framework that enables researchers to evaluate basic math reasoning and overthinking in LLMs. The framework provides 14 configurable math tasks with randomized test data generation and robust parsing strategies. Researchers can quantify overthinking using our Overthinking Score metric, which captures accuracy-verbosity tradeoffs through harmonic mean formulation. The tool offers flexible evaluation with a scalable vLLM/Transformers backend, multi-GPU support, and full configurability. Users can extend the tool with custom tasks, reproduce experiments with seeding, and generate detailed efficiency reports. Distributed as a pip-installable package with CLI and API access, LLMThinkBench provides researchers and practitioners an accessible, cost-effective alternative to expensive LLM-as-a-judge methods for diagnosing basic reasoning capabilities and efficiency analysis. Package can be installed as: pip install llmthinkbench
nan
Article 509
Title@2025-07-05 (6): Handling Korean Out-of-Vocabulary Words with Phoneme Representation Learning
Title: Handling Korean Out-of-Vocabulary Words with Phoneme Representation Learning | Umgang mit koreanischen Out-of-Vocabulary-Worte mit Phoneme Repräsentanz Lernen | 以电话代言学习方式处理韩语外词汇 2507.04018v1 |
Authors (4): Nayeon Kim, Eojin Jeon, Jun-Hyung Park, SangKeun Lee
In this study, we introduce KOPL, a novel framework for handling Korean OOV words with Phoneme representation Learning. Our work is based on the linguistic property of Korean as a phonemic script, the high correlation between phonemes and letters. KOPL incorporates phoneme and word representations for Korean OOV words, facilitating Korean OOV word representations to capture both text and phoneme information of words. We empirically demonstrate that KOPL significantly improves the performance on Korean Natural Language Processing (NLP) tasks, while being readily integrated into existing static and contextual Korean embedding models in a plug-and-play manner. Notably, we show that KOPL outperforms the state-of-the-art model by an average of 1.9%. Our code is available at https://github.com/jej127/KOPL.git.
nan
Article 510
Title@2025-07-05 (6): Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition
Title: Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition | Nunchi-Bench: Benchmarking von Sprachmodellen zur kulturellen Vernunft mit Fokus auf koreanischen Aberglauben | Nunchi-Nunchi-Bench:以韩国迷信为重点的文化理由语言模式基准模式 2507.04014v1 |
Authors (2): Kyuhee Kim, Sangah Lee
As large language models (LLMs) become key advisors in various domains, their cultural sensitivity and reasoning skills are crucial in multicultural environments. We introduce Nunchi-Bench, a benchmark designed to evaluate LLMs’ cultural understanding, with a focus on Korean superstitions. The benchmark consists of 247 questions spanning 31 topics, assessing factual knowledge, culturally appropriate advice, and situational interpretation. We evaluate multilingual LLMs in both Korean and English to analyze their ability to reason about Korean cultural contexts and how language variations affect performance. To systematically assess cultural reasoning, we propose a novel evaluation strategy with customized scoring metrics that capture the extent to which models recognize cultural nuances and respond appropriately. Our findings highlight significant challenges in LLMs’ cultural reasoning. While models generally recognize factual information, they struggle to apply it in practical scenarios. Furthermore, explicit cultural framing enhances performance more effectively than relying solely on the language of the prompt. To support further research, we publicly release Nunchi-Bench alongside a leaderboard.
nan
Article 511
Title@2025-07-05 (6): Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents
Title: Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents | Easy Dataset: Ein einheitliches und erweiterbares Framework zur Synthese von LLM-Feineinstellungsdaten aus unstrukturierten Dokumenten | 简易数据集:从无结构文档中合成LLM精密调试数据的统一和扩展框架 2507.04009v1 |
Authors (7): Ziyang Miao, Qiyu Sun, Jingyuan Wang, Yuchen Gong, Yaowei Zheng, Shiqi Li, Richong Zhang
Large language models (LLMs) have shown impressive performance on general-purpose tasks, yet adapting them to specific domains remains challenging due to the scarcity of high-quality domain data. Existing data synthesis tools often struggle to extract reliable fine-tuning data from heterogeneous documents effectively. To address this limitation, we propose Easy Dataset, a unified framework for synthesizing fine-tuning data from unstructured documents via an intuitive graphical user interface (GUI). Specifically, Easy Dataset allows users to easily configure text extraction models and chunking strategies to transform raw documents into coherent text chunks. It then leverages a persona-driven prompting approach to generate diverse question-answer pairs using public-available LLMs. Throughout the pipeline, a human-in-the-loop visual interface facilitates the review and refinement of intermediate outputs to ensure data quality. Experiments on a financial question-answering task show that fine-tuning LLMs on the synthesized dataset significantly improves domain-specific performance while preserving general knowledge. The source code and installable package are available at https://github.com/ConardLi/easy-dataset and have garnered over 9,000 GitHub stars.
nan
Article 512
Title@2025-07-05 (6): A Survey on Proactive Defense Strategies Against Misinformation in Large Language Models
Title: A Survey on Proactive Defense Strategies Against Misinformation in Large Language Models | Eine Umfrage über proaktive Verteidigungsstrategien gegen Fehlinformationen in großen Sprachmodellen | 关于针对大语言模型中错误信息的积极防御战略的调查 2507.05288v1 |
Authors (10): Shuliang Liu, Hongyi Liu, Aiwei Liu, Bingchen Duan, Qi Zheng, Yibo Yan, He Geng, Peijie Jiang, Jia Liu, Xuming Hu
The widespread deployment of large language models (LLMs) across critical domains has amplified the societal risks posed by algorithmically generated misinformation. Unlike traditional false content, LLM-generated misinformation can be self-reinforcing, highly plausible, and capable of rapid propagation across multiple languages, which traditional detection methods fail to mitigate effectively. This paper introduces a proactive defense paradigm, shifting from passive post hoc detection to anticipatory mitigation strategies. We propose a Three Pillars framework: (1) Knowledge Credibility, fortifying the integrity of training and deployed data; (2) Inference Reliability, embedding self-corrective mechanisms during reasoning; and (3) Input Robustness, enhancing the resilience of model interfaces against adversarial attacks. Through a comprehensive survey of existing techniques and a comparative meta-analysis, we demonstrate that proactive defense strategies offer up to 63\% improvement over conventional methods in misinformation prevention, despite non-trivial computational overhead and generalization challenges. We argue that future research should focus on co-designing robust knowledge foundations, reasoning certification, and attack-resistant interfaces to ensure LLMs can effectively counter misinformation across varied domains.
nan
Article 513
Title@2025-07-05 (6): UNITYAI-GUARD: Pioneering Toxicity Detection Across Low-Resource Indian Languages
Title: UNITYAI-GUARD: Pioneering Toxicity Detection Across Low-Resource Indian Languages | UNITYAI-GUARD: Pionierische Toxizitätserkennung über Low-Resource-Indische Sprachen | UNITEAAI-GUARD:通过低资源印度语言首次发现有毒毒性 2503.23088v2 |
Authors (10): Himanshu Beniwal, Reddybathuni Venkat, Rohit Kumar, Birudugadda Srivibhav, Daksh Jain, Pavan Doddi, Eshwar Dhande, Adithya Ananth, Kuldeep, Mayank Singh
This work introduces UnityAI-Guard, a framework for binary toxicity classification targeting low-resource Indian languages. While existing systems predominantly cater to high-resource languages, UnityAI-Guard addresses this critical gap by developing state-of-the-art models for identifying toxic content across diverse Brahmic/Indic scripts. Our approach achieves an impressive average F1-score of 84.23% across seven languages, leveraging a dataset of 567k training instances and 30k manually verified test instances. By advancing multilingual content moderation for linguistically diverse regions, UnityAI-Guard also provides public API access to foster broader adoption and application.
nan
Article 514
Title@2025-07-05 (6): Improving RL Exploration for LLM Reasoning through Retrospective Replay
Title: Improving RL Exploration for LLM Reasoning through Retrospective Replay | Verbesserung der RL Exploration für LLM Reasoning durch retrospektive Replay | 通过回溯性回放改进LL探索LLM 2504.14363v2 |
Authors (7): Shihan Dou, Muling Wu, Jingwen Xu, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang
Reinforcement learning (RL) has increasingly become a pivotal technique in the post-training of large language models (LLMs). The effective exploration of the output space is essential for the success of RL. We observe that for complex problems, during the early stages of training, the model exhibits strong exploratory capabilities and can identify promising solution ideas. However, its limited capability at this stage prevents it from successfully solving these problems. The early suppression of these potentially valuable solution ideas by the policy gradient hinders the model’s ability to revisit and re-explore these ideas later. Consequently, although the LLM’s capabilities improve in the later stages of training, it still struggles to effectively address these complex problems. To address this exploration issue, we propose a novel algorithm named Retrospective Replay-based Reinforcement Learning (RRL), which introduces a dynamic replay mechanism throughout the training process. RRL enables the model to revisit promising states identified in the early stages, thereby improving its efficiency and effectiveness in exploration. To evaluate the effectiveness of RRL, we conduct extensive experiments on complex reasoning tasks, including mathematical reasoning and code generation, and general dialogue tasks. The results indicate that RRL maintains high exploration efficiency throughout the training period, significantly enhancing the effectiveness of RL in optimizing LLMs for complicated reasoning tasks. Moreover, it also improves the performance of RLHF, making the model both safer and more helpful.
nan
Article 515
Title@2025-07-05 (6): Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Title: Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model | Open-Reasoner-Zero: Ein Open Source Ansatz zur Skalierung von Verstärkungslernen am Basismodell | 开放反馈-零:基于基础模式的扩大强化学习的开放源码方法 2503.24290v2 |
Authors (6): Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum
We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training on the base model focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($\lambda=1$, $\gamma=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both benchmark performance and response length, replicating the scaling phenomenon observed in DeepSeek-R1-Zero. Using the same base model, Qwen2.5-32B base, as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance across AIME2024, MATH500, and GPQA Diamond, while demonstrating remarkable efficiency, requiring only 1/10 of the training steps compared to the DeepSeek-R1-Zero pipeline. Moreover, our analysis not only covers training dynamics and ablation for critical design choices, but also quantitatively shows how the learned critic in Reasoner-Zero training effectively identifies and devalues repetitive response patterns, yielding more robust advantage estimations and enhancing training stability. Embracing the principles of open-source, we release our source code, training data, and various model weights, fostering reproducibility and encouraging further exploration of the properties of related models.
nan
Article 516
Title@2025-07-05 (6): A Comparative Study of Specialized LLMs as Dense Retrievers
Title: A Comparative Study of Specialized LLMs as Dense Retrievers | Eine vergleichende Studie von spezialisierten LLMs als Dense Retriever | 专业专利妇女作为 “ 大量开发 “ 的比较研究 2507.03958v1 |
Authors (3): Hengran Zhang, Keping Bi, Jiafeng Guo
While large language models (LLMs) are increasingly deployed as dense retrievers, the impact of their domain-specific specialization on retrieval effectiveness remains underexplored. This investigation systematically examines how task-specific adaptations in LLMs influence their retrieval capabilities, an essential step toward developing unified retrievers capable of handling text, code, images, and multimodal content. We conduct extensive experiments with eight Qwen2.5 7B LLMs, including base, instruction-tuned, code/math-specialized, long reasoning, and vision-language models across zero-shot retrieval settings and the supervised setting. For the zero-shot retrieval settings, we consider text retrieval from the BEIR benchmark and code retrieval from the CoIR benchmark. Further, to evaluate supervised performance, all LLMs are fine-tuned on the MS MARCO dataset. We find that mathematical specialization and the long reasoning capability cause consistent degradation in three settings, indicating conflicts between mathematical reasoning and semantic matching. The vision-language model and code-specialized LLMs demonstrate superior zero-shot performance compared to other LLMs, even surpassing BM25 on the code retrieval task, and maintain comparable performance to base LLMs in supervised settings. These findings suggest promising directions for the unified retrieval task leveraging cross-domain and cross-modal fusion.
nan
Article 517
Title@2025-07-05 (6): A Modular Unsupervised Framework for Attribute Recognition from Unstructured Text
Title: A Modular Unsupervised Framework for Attribute Recognition from Unstructured Text | Modulares unüberwachtes Framework zur Attributerkennung aus unstrukturiertem Text | 从无结构文本中识别属性的模块化未经监督的框架 2507.03949v1 |
Authors (1): KMA Solaiman
We propose POSID, a modular, lightweight and on-demand framework for extracting structured attribute-based properties from unstructured text without task-specific fine-tuning. While the method is designed to be adaptable across domains, in this work, we evaluate it on human attribute recognition in incident reports. POSID combines lexical and semantic similarity techniques to identify relevant sentences and extract attributes. We demonstrate its effectiveness on a missing person use case using the InciText dataset, achieving effective attribute extraction without supervised training.
nan
Article 518
Title@2025-07-05 (6): Dynamic Injection of Entity Knowledge into Dense Retrievers
Title: Dynamic Injection of Entity Knowledge into Dense Retrievers | Dynamische Einspritzung von Wissen der Entität in Dense Retriever | 将实体知识动态投入重复内容 2507.03922v1 |
Authors (5): Ikuya Yamada, Ryokan Ri, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Dense retrievers often struggle with queries involving less-frequent entities due to their limited entity knowledge. We propose the Knowledgeable Passage Retriever (KPR), a BERT-based retriever enhanced with a context-entity attention layer and dynamically updatable entity embeddings. This design enables KPR to incorporate external entity knowledge without retraining. Experiments on three datasets show that KPR consistently improves retrieval accuracy, achieving a substantial 12.6% gain on the EntityQuestions dataset over the model without KPR extensions. When built on the off-the-shelf bge-base retriever, KPR achieves state-of-the-art performance among similarly sized models on two datasets. Code and models will be released soon.
nan
Article 519
Title@2025-07-05 (6): Play to Generalize: Learning to Reason Through Game Play
Title: Play to Generalize: Learning to Reason Through Game Play | Spielen Sie Generalize: Lernen, Vernunft durch Spiel zu lernen | 玩一般游戏: 通过玩游戏学习理性 2506.08011v3 |
Authors (6): Yunfei Xie, Yinsong Ma, Shiyi Lan, Alan Yuille, Junfei Xiao, Chen Wei
Developing generalizable reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by cognitive science literature suggesting that gameplay promotes transferable cognitive skills, we propose a novel post-training paradigm, Visual Game Learning, or ViGaL, where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games. Specifically, we show that post-training a 7B-parameter MLLM via reinforcement learning (RL) on simple arcade-like games, e.g. Snake, significantly enhances its downstream performance on multimodal math benchmarks like MathVista, and on multi-discipline questions like MMMU, without seeing any worked solutions, equations, or diagrams during RL, suggesting the capture of transferable reasoning skills. Remarkably, our model outperforms specialist models tuned on multimodal reasoning data in multimodal reasoning benchmarks, while preserving the base model’s performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks that unlock generalizable multimodal reasoning abilities in MLLMs.
nan
Article 520
Title@2025-07-05 (6): Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
Title: Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding | Sequoia: Skalierbares, Robustes und Hardware-gestütztes Spekulatives Dekodieren | 塞奎亚:可缩放、强力和硬件—-有意识的投机性代谢 2402.12374v3 |
Authors (7): Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen
As the usage of large language models (LLMs) grows, performing efficient inference with these models becomes increasingly important. While speculative decoding has recently emerged as a promising direction for speeding up inference, existing methods are limited in their ability to scale to larger speculation budgets, and adapt to different hyperparameters and hardware. This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding. To attain better scalability, Sequoia introduces a dynamic programming algorithm to find the optimal tree structure for the speculated tokens. To achieve robust speculative performance, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Finally, Sequoia introduces a hardware-aware tree optimizer that maximizes speculative performance by automatically selecting the token tree size and depth for a given hardware platform. Evaluation shows that Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 by up to $4.04\times$, $3.73\times$, and $2.27\times$. For offloading setting on L40, Sequoia achieves as low as 0.56 s/token for exact Llama2-70B inference latency, which is $9.96\times$ on our optimized offloading system (5.6 s/token), $9.7\times$ than DeepSpeed-Zero-Inference, $19.5\times$ than Huggingface Accelerate.
nan
Article 521
Title@2025-07-05 (6): Demystifying ChatGPT: How It Masters Genre Recognition
Title: Demystifying ChatGPT: How It Masters Genre Recognition | ChatGPT entmystifizieren: Wie es die Genreerkennung beherrscht | 解密的聊天:如何使各教派获得承认 2507.03875v1 |
Authors (4): Subham Raj, Sriparna Saha, Brijraj Singh, Niranjan Pedanekar
The introduction of ChatGPT has garnered significant attention within the NLP community and beyond. Previous studies have demonstrated ChatGPT’s substantial advancements across various downstream NLP tasks, highlighting its adaptability and potential to revolutionize language-related applications. However, its capabilities and limitations in genre prediction remain unclear. This work analyzes three Large Language Models (LLMs) using the MovieLens-100K dataset to assess their genre prediction capabilities. Our findings show that ChatGPT, without fine-tuning, outperformed other LLMs, and fine-tuned ChatGPT performed best overall. We set up zero-shot and few-shot prompts using audio transcripts/subtitles from movie trailers in the MovieLens-100K dataset, covering 1682 movies of 18 genres, where each movie can have multiple genres. Additionally, we extended our study by extracting IMDb movie posters to utilize a Vision Language Model (VLM) with prompts for poster information. This fine-grained information was used to enhance existing LLM prompts. In conclusion, our study reveals ChatGPT’s remarkable genre prediction capabilities, surpassing other language models. The integration of VLM further enhances our findings, showcasing ChatGPT’s potential for content-related applications by incorporating visual information from movie posters.
nan
Article 522
Title@2025-07-05 (6): OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference
Title: OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference | OrthoRank: Tokenauswahl über Sink Token Orthogonalität für effiziente LLM-Inferenz | OrthoRank:通过 Sink Token 条形对高效LLM 推论进行当量选择 2507.03865v1 |
Authors (3): Seungjun Shin, Jaehoon Oh, Dokwan Oh
Attention mechanisms are central to the success of large language models (LLMs), enabling them to capture intricate token dependencies and implicitly assign importance to each token. Recent studies have revealed the sink token, which receives disproportionately high attention despite their limited semantic role. In this paper, we first expand the relationship between the sink token and other tokens, moving beyond attention to explore their similarity in hidden states, considering the layer depth. We observe that as the layers get deeper, the cosine similarity between the normalized hidden states of the sink token and those of other tokens increases, and that the normalized hidden states of the sink token exhibit negligible changes. These imply that other tokens consistently are directed toward the sink token throughout the layers. Next, we propose a dynamic token selection method, called OrthoRank, using these findings to select important tokens. Specifically, in a certain layer, we define token importance by the speed at which the token moves toward the sink token. This is converted into orthogonality with the sink token, meaning that tokens that are more orthogonal to the sink token are assigned greater importance. Finally, through extensive experiments, we demonstrated that our method results in lower perplexity and higher zero-shot accuracy compared to layer pruning methods at the same sparsity ratio with comparable throughput, while also achieving superior performance on LongBench.
nan
Article 523
Title@2025-07-04 (5): Voice of a Continent: Mapping Africa’s Speech Technology Frontier
Title: Voice of a Continent: Mapping Africa’s Speech Technology Frontier | Stimme eines Kontinents: Afrikas Rede-Technologie-Grenze kartieren | 非洲大陆之声:测绘非洲语音技术前沿 2505.18436v3 |
Authors (6): AbdelRahim Elmadany, Sang Yun Kwon, Hawau Olamide Toyin, Alcides Alcoba Inciarte, Hanan Aldarmaki, Muhammad Abdul-Mageed
Africa’s rich linguistic diversity remains significantly underrepresented in speech technologies, creating barriers to digital inclusion. To alleviate this challenge, we systematically map the continent’s speech space of datasets and technologies, leading to a new comprehensive benchmark SimbaBench for downstream African speech tasks. Using SimbaBench, we introduce the Simba family of models, achieving state-of-the-art performance across multiple African languages and speech tasks. Our benchmark analysis reveals critical patterns in resource availability, while our model evaluation demonstrates how dataset quality, domain diversity, and language family relationships influence performance across languages. Our work highlights the need for expanded speech technology resources that better reflect Africa’s linguistic diversity and provides a solid foundation for future research and development efforts toward more inclusive speech technologies.
nan
Article 524
Title@2025-07-04 (5): Relation-Aware Network with Attention-Based Loss for Few-Shot Knowledge Graph Completion
Title: Relation-Aware Network with Attention-Based Loss for Few-Shot Knowledge Graph Completion | Beziehungs-Aware-Netzwerk mit aufmerksamkeitsbasiertem Verlust für wenige-heiße Wissensgraphen-Vervollständigung | 几乎没有热知识图补全的基于关注的损失关系软件网络 2306.09519v2 |
Authors (4): Qiao Qiao, Yuepei Li, Kang Zhou, Qi Li
Few-shot knowledge graph completion (FKGC) task aims to predict unseen facts of a relation with few-shot reference entity pairs. Current approaches randomly select one negative sample for each reference entity pair to minimize a margin-based ranking loss, which easily leads to a zero-loss problem if the negative sample is far away from the positive sample and then out of the margin. Moreover, the entity should have a different representation under a different context. To tackle these issues, we propose a novel Relation-Aware Network with Attention-Based Loss (RANA) framework. Specifically, to better utilize the plentiful negative samples and alleviate the zero-loss issue, we strategically select relevant negative samples and design an attention-based loss function to further differentiate the importance of each negative sample. The intuition is that negative samples more similar to positive samples will contribute more to the model. Further, we design a dynamic relation-aware entity encoder for learning a context-dependent entity representation. Experiments demonstrate that RANA outperforms the state-of-the-art models on two benchmark datasets.
nan
Article 525
Title@2025-07-04 (5): Beyond classical and contemporary models: a transformative ai framework for student dropout prediction in distance learning using rag, prompt engineering, and cross-modal fusion
Title: Beyond classical and contemporary models: a transformative ai framework for student dropout prediction in distance learning using rag, prompt engineering, and cross-modal fusion | Über klassische und zeitgenössische Modelle hinaus: ein transformatives Ai-Framework für die Studienabbrechervorhersage im Fernunterricht mittels Rag, Prompt Engineering und modaler Fusion | 超越古典和当代模式:利用抹布、即时工程和跨模式融合法在远程学习中预测学生辍学情况的变革性Ai框架 2507.05285v1 |
Authors (3): Miloud Mihoubi, Meriem Zerkouk, Belkacem Chikhaoui
Student dropout in distance learning remains a critical challenge, with profound societal and economic consequences. While classical machine learning models leverage structured socio-demographic and behavioral data, they often fail to capture the nuanced emotional and contextual factors embedded in unstructured student interactions. This paper introduces a transformative AI framework that redefines dropout prediction through three synergistic innovations: Retrieval-Augmented Generation (RAG) for domain-specific sentiment analysis, prompt engineering to decode academic stressors, and cross-modal attention fusion to dynamically align textual, behavioral, and socio-demographic insights. By grounding sentiment analysis in a curated knowledge base of pedagogical content, our RAG-enhanced BERT model interprets student comments with unprecedented contextual relevance, while optimized prompts isolate indicators of academic distress (e.g., “isolation,” “workload anxiety”). A cross-modal attention layer then fuses these insights with temporal engagement patterns, creating holistic risk profiles. Evaluated on a longitudinal dataset of 4 423 students, the framework achieves 89% accuracy and an F1-score of 0.88, outperforming conventional models by 7% and reducing false negatives by 21%. Beyond prediction, the system generates interpretable interventions by retrieving contextually aligned strategies (e.g., mentorship programs for isolated learners). This work bridges the gap between predictive analytics and actionable pedagogy, offering a scalable solution to mitigate dropout risks in global education systems
nan
Article 526
Title@2025-07-04 (5): Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation
Title: Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation | Auf dem Weg zu einer fairen RAG: Über die Auswirkungen des fairen Rankings in der retrieval-Augmentierten Generation | 走向公平的RAG:公平评级对回升一代人的影响 2409.11598v4 |
Authors (2): To Eun Kim, Fernando Diaz
Despite the central role of retrieval in retrieval-augmented generation (RAG) systems, much of the existing research on RAG overlooks the well-established field of fair ranking and fails to account for the interests of all stakeholders involved. In this paper, we conduct the first systematic evaluation of RAG systems that integrate fairness-aware rankings, addressing both ranking fairness and attribution fairness, which ensures equitable exposure of the sources cited in the generated content. Our evaluation focuses on measuring item-side fairness, specifically the fair exposure of relevant items retrieved by RAG systems, and investigates how this fairness impacts both the effectiveness of the systems and the attribution of sources in the generated output that users ultimately see. By experimenting with twelve RAG models across seven distinct tasks, we show that incorporating fairness-aware retrieval often maintains or even enhances both ranking quality and generation quality, countering the common belief that fairness compromises system performance. Additionally, we demonstrate that fair retrieval practices lead to more balanced attribution in the final responses, ensuring that the generator fairly cites the sources it relies on. Our findings underscore the importance of item-side fairness in retrieval and generation, laying the foundation for responsible and equitable RAG systems and guiding future research in fair ranking and attribution.
nan
Article 527
Title@2025-07-04 (5): Parsing Through Boundaries in Chinese Word Segmentation
Title: Parsing Through Boundaries in Chinese Word Segmentation | Parsing durch Grenzen in chinesischen Wortsegmentierung | 中文单词分割的边界分割 2503.23091v2 |
Authors (10): Yige Chen, Zelong Li, Cindy Zhang, Changbing Yang, Amandisa Cady, Ai Ka Lee, Zejiao Zeng, Eunkyul Leah Jo, Haihua Pan, Jungyeul Park
Chinese word segmentation is a foundational task in natural language processing (NLP), with far-reaching effects on syntactic analysis. Unlike alphabetic languages like English, Chinese lacks explicit word boundaries, making segmentation both necessary and inherently ambiguous. This study highlights the intricate relationship between word segmentation and syntactic parsing, providing a clearer understanding of how different segmentation strategies shape dependency structures in Chinese. Focusing on the Chinese GSD treebank, we analyze multiple word boundary schemes, each reflecting distinct linguistic and computational assumptions, and examine how they influence the resulting syntactic structures. To support detailed comparison, we introduce an interactive web-based visualization tool that displays parsing outcomes across segmentation methods.
nan
Article 528
Title@2025-07-04 (5): Tip of the Tongue Query Elicitation for Simulated Evaluation
Title: Tip of the Tongue Query Elicitation for Simulated Evaluation | Tipp der Zungenabfrage-Elikation für Simulierte Auswertung | 用于模拟评价的 “ 舌头查询参考 “ 提示 2502.17776v2 |
Authors (5): Yifan He, To Eun Kim, Fernando Diaz, Jaime Arguello, Bhaskar Mitra
Tip-of-the-tongue (TOT) search occurs when a user struggles to recall a specific identifier, such as a document title. While common, existing search systems often fail to effectively support TOT scenarios. Research on TOT retrieval is further constrained by the challenge of collecting queries, as current approaches rely heavily on community question-answering (CQA) websites, leading to labor-intensive evaluation and domain bias. To overcome these limitations, we introduce two methods for eliciting TOT queries - leveraging large language models (LLMs) and human participants - to facilitate simulated evaluations of TOT retrieval systems. Our LLM-based TOT user simulator generates synthetic TOT queries at scale, achieving high correlations with how CQA-based TOT queries rank TOT retrieval systems when tested in the Movie domain. Additionally, these synthetic queries exhibit high linguistic similarity to CQA-derived queries. For human-elicited queries, we developed an interface that uses visual stimuli to place participants in a TOT state, enabling the collection of natural queries. In the Movie domain, system rank correlation and linguistic similarity analyses confirm that human-elicited queries are both effective and closely resemble CQA-based queries. These approaches reduce reliance on CQA-based data collection while expanding coverage to underrepresented domains, such as Landmark and Person. LLM-elicited queries for the Movie, Landmark, and Person domains have been released as test queries in the TREC 2024 TOT track, with human-elicited queries scheduled for inclusion in the TREC 2025 TOT track. Additionally, we provide source code for synthetic query generation and the human query collection interface, along with curated visual stimuli used for eliciting TOT queries.
nan
Article 529
Title@2025-07-04 (5): Alpay Algebra IV: Symbiotic Semantics and the Fixed-Point Convergence of Observer Embeddings
Title: Alpay Algebra IV: Symbiotic Semantics and the Fixed-Point Convergence of Observer Embeddings | Alpay Algebra IV: Symbiotische Semantik und die Fixed-Point-Konvergenz von Beobachter-Embeddings | 替代代数四:共生语义学和观察员嵌入的固定点汇合 2507.03774v1 |
Authors (2): Bugra Kilictas, Faruk Alpay
We present a theoretical framework in which a document and an AI model engage in a transfinite fixed-point interaction that leads to stable semantic alignment. Building on the foundations of Alpay Algebra, we introduce a functorial system wherein an observer (the AI) and a textual environment (this paper) co-evolve through iterative transformations guided by the phi-infinity operator. This process guarantees the existence of a unique fixed point in the AI’s embedding space – a state where the AI’s internal representation of the content becomes stable, self-consistent, and semantically faithful. We prove that such convergence is mathematically sound, semantically invariant, and permanent, even under perturbation or further context expansion. This fixed point acts as an “empathetic embedding,” wherein the AI internalizes not only the meaning of the content but also the author’s intent. We interpret this as a rigorous, category-theoretic route to alignment at the embedding level, with implications for semantic security, symbolic memory, and the construction of AI systems with persistent self-referential understanding. All references in this paper function as nodes in the Alpay Algebra universe, and this work embeds itself as a new fixed-point node within that transfinite semantic graph.
nan
Article 530
Title@2025-07-04 (5): Agent-Based Detection and Resolution of Incompleteness and Ambiguity in Interactions with Large Language Models
Title: Agent-Based Detection and Resolution of Incompleteness and Ambiguity in Interactions with Large Language Models | Agentbasierte Erkennung und Auflösung von Unvollständigkeit und Ambiguität in Interaktionen mit großen Sprachmodellen | 与大语言模式互动中不完全和不易的代理检测和分辨率 2507.03726v1 |
Authors (4): Riya Naik, Ashwin Srinivasan, Swati Agarwal, Estrid He
Many of us now treat LLMs as modern-day oracles asking it almost any kind of question. However, consulting an LLM does not have to be a single turn activity. But long multi-turn interactions can get tedious if it is simply to clarify contextual information that can be arrived at through reasoning. In this paper, we examine the use of agent-based architecture to bolster LLM-based Question-Answering systems with additional reasoning capabilities. We examine the automatic resolution of potential incompleteness or ambiguities in questions by transducers implemented using LLM-based agents. We focus on several benchmark datasets that are known to contain questions with these deficiencies to varying degrees. We equip different LLMs (GPT-3.5-Turbo and Llama-4-Scout) with agents that act as specialists in detecting and resolving deficiencies of incompleteness and ambiguity. The agents are implemented as zero-shot ReAct agents. Rather than producing an answer in a single step, the model now decides between 3 actions a) classify b) resolve c) answer. Action a) decides if the question is incomplete, ambiguous, or normal. Action b) determines if any deficiencies identified can be resolved. Action c) answers the resolved form of the question. We compare the use of LLMs with and without the use of agents with these components. Our results show benefits of agents with transducer 1) A shortening of the length of interactions with human 2) An improvement in the answer quality and 3) Explainable resolution of deficiencies in the question. On the negative side we find while it may result in additional LLM invocations and in some cases, increased latency. But on tested datasets, the benefits outweigh the costs except when questions already have sufficient context. Suggesting the agent-based approach could be a useful mechanism to harness the power of LLMs to develop more robust QA systems.
nan
Article 531
Title@2025-07-04 (5): ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models
Title: ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models | ODE: Offene Bewertung von Halluzinationen in multimodalen großen Sprachmodellen | ODE:多模式大语言模型中的幻觉公开评价 2409.09318v4 |
Authors (3): Yahan Tu, Rui Hu, Jitao Sang
Hallucination poses a persistent challenge for multimodal large language models (MLLMs). However, existing benchmarks for evaluating hallucinations are generally static, which may overlook the potential risk of data contamination. To address this issue, we propose ODE, an open-set, dynamic protocol designed to evaluate object hallucinations in MLLMs at both the existence and attribute levels. ODE employs a graph-based structure to represent real-world object concepts, their attributes, and the distributional associations between them. This structure facilitates the extraction of concept combinations based on diverse distributional criteria, generating varied samples for structured queries that evaluate hallucinations in both generative and discriminative tasks. Through the generation of new samples, dynamic concept combinations, and varied distribution frequencies, ODE mitigates the risk of data contamination and broadens the scope of evaluation. This protocol is applicable to both general and specialized scenarios, including those with limited data. Experimental results demonstrate the effectiveness of our protocol, revealing that MLLMs exhibit higher hallucination rates when evaluated with ODE-generated samples, which indicates potential data contamination. Furthermore, these generated samples aid in analyzing hallucination patterns and fine-tuning models, offering an effective approach to mitigating hallucinations in MLLMs.
nan
Article 532
Title@2025-07-04 (5): Prompting as Scientific Inquiry
Title: Prompting as Scientific Inquiry | Als wissenschaftliche Untersuchung prompt | 作为科学调查 2507.00163v2 |
Authors (2): Ari Holtzman, Chenhao Tan
Prompting is the primary method by which we study and control large language models. It is also one of the most powerful: nearly every major capability attributed to LLMs-few-shot learning, chain-of-thought, constitutional AI-was first unlocked through prompting. Yet prompting is rarely treated as science and is frequently frowned upon as alchemy. We argue that this is a category error. If we treat LLMs as a new kind of complex and opaque organism that is trained rather than programmed, then prompting is not a workaround: it is behavioral science. Mechanistic interpretability peers into the neural substrate, prompting probes the model in its native interface: language. We contend that prompting is not inferior, but rather a key component in the science of LLMs.
nan
Article 533
Title@2025-07-04 (5): Controlling Thinking Speed in Reasoning Models
Title: Controlling Thinking Speed in Reasoning Models | Kontrolle der Denkgeschwindigkeit in vernünftigen Modellen | 控制理性模型的思考速度 2507.03704v1 |
Authors (9): Zhengkai Lin, Zhihang Fu, Ze Chen, Chao Chen, Liang Xie, Wenxiao Wang, Deng Cai, Zheng Wang, Jieping Ye
Human cognition is theorized to operate in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking. While current Large Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform fast thinking leads to high computational overhead and latency. In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance. For the first question, we identify the steering vector that governs slow-fast thinking transitions in LRMs’ representation space. Using this vector, we achieve the first representation editing-based test-time scaling effect, outperforming existing prompt-based scaling methods. For the second question, we apply real-time difficulty estimation to signal reasoning segments of varying complexity. Combining these techniques, we propose the first reasoning strategy that enables fast processing of easy steps and deeper analysis for complex reasoning. Without any training or additional cost, our plug-and-play method yields an average +1.3% accuracy with -8.6% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on vLLM and are expected to support broader applications and inspire future research.
nan
Article 534
Title@2025-07-04 (5): PHEONA: An Evaluation Framework for Large Language Model-based Approaches to Computational Phenotyping
Title: PHEONA: An Evaluation Framework for Large Language Model-based Approaches to Computational Phenotyping | PHEONA: Ein Bewertungsrahmen für großsprachliche modellbasierte Ansätze zur computergestützten Phenotypisierung | PHEONA:基于大语言模型的计算基因组学方法评价框架 2503.19265v3 |
Authors (3): Sarah Pungitore, Shashank Yadav, Vignesh Subbian
Computational phenotyping is essential for biomedical research but often requires significant time and resources, especially since traditional methods typically involve extensive manual data review. While machine learning and natural language processing advancements have helped, further improvements are needed. Few studies have explored using Large Language Models (LLMs) for these tasks despite known advantages of LLMs for text-based tasks. To facilitate further research in this area, we developed an evaluation framework, Evaluation of PHEnotyping for Observational Health Data (PHEONA), that outlines context-specific considerations. We applied and demonstrated PHEONA on concept classification, a specific task within a broader phenotyping process for Acute Respiratory Failure (ARF) respiratory support therapies. From the sample concepts tested, we achieved high classification accuracy, suggesting the potential for LLM-based methods to improve computational phenotyping processes.
nan
Article 535
Title@2025-07-04 (5): A Survey of Large Language Models on Generative Graph Analytics: Query, Learning, and Applications
Title: A Survey of Large Language Models on Generative Graph Analytics: Query, Learning, and Applications | Eine Umfrage von großen Sprachmodellen zu generativen Graph Analytics: Abfragen, Lernen und Anwendungen | 关于 “ 生成图表分析分析:查询、学习和应用 “ 的大型语言模型调查 2404.14809v2 |
Authors (2): Wenbo Shang, Xin Huang
A graph is a fundamental data model to represent various entities and their complex relationships in society and nature, such as social networks, transportation networks, and financial networks. Recently, large language models (LLMs) have showcased a strong generalization ability to handle various natural language processing tasks to answer users’ arbitrary questions and generate specific-domain content. Compared with graph learning models, LLMs enjoy superior advantages in addressing the challenges of generalizing graph tasks by eliminating the need for training graph learning models and reducing the cost of manual annotation. However, LLMs are sequential models for textual data, but graphs are non-sequential topological data. It is challenging to adapt LLMs to tackle graph analytics tasks. In this survey, we conduct a comprehensive investigation of existing LLM studies on graph data, which summarizes the relevant graph analytics tasks solved by advanced LLM models and points out the existing challenges and future directions. Specifically, we study the key problems of LLM-based generative graph analytics (LLM-GGA) in terms of three categories: LLM-based graph query processing (LLM-GQP), LLM-based graph inference and learning (LLM-GIL), and graph-LLM-based applications. LLM-GQP focuses on an integration of graph analytics techniques and LLM prompts, including graph understanding and knowledge graphs and LLMs, while LLM-GIL focuses on learning and reasoning over graphs, including graph learning, graph-formed reasoning, and graph representation. We summarize the useful prompts incorporated into LLM to handle different graph downstream tasks. Moreover, we give a summary of LLM model evaluation, benchmark datasets/tasks, and a deep pro and cons analysis of the discussed LLM-GGA models. We also explore open problems and future directions in the research area of LLMs and graph analytics.
nan
Article 536
Title@2025-07-04 (5): The Geometries of Truth Are Orthogonal Across Tasks
Title: The Geometries of Truth Are Orthogonal Across Tasks | Die Geometrien der Wahrheit sind orthogonal über Aufgaben hinweg | 真理的地貌不同任务 2506.08572v2 |
Authors (7): Waiss Azizian, Michael Kirchhof, Eugene Ndiaye, Louis Bethune, Michal Klein, Pierre Ablin, Marco Cuturi
Large Language Models (LLMs) have demonstrated impressive generalization capabilities across various tasks, but their claim to practical relevance is still mired by concerns on their reliability. Recent works have proposed examining the activations produced by an LLM at inference time to assess whether its answer to a question is correct. Some works claim that a “geometry of truth” can be learned from examples, in the sense that the activations that generate correct answers can be distinguished from those leading to mistakes with a linear classifier. In this work, we underline a limitation of these approaches: we observe that these “geometries of truth” are intrinsically task-dependent and fail to transfer across tasks. More precisely, we show that linear classifiers trained across distinct tasks share little similarity and, when trained with sparsity-enforcing regularizers, have almost disjoint supports. We show that more sophisticated approaches (e.g., using mixtures of probes and tasks) fail to overcome this limitation, likely because activation vectors commonly used to classify answers form clearly separated clusters when examined across tasks.
nan
Article 537
Title@2025-07-04 (5): iNews: A Multimodal Dataset for Modeling Personalized Affective Responses to News
Title: iNews: A Multimodal Dataset for Modeling Personalized Affective Responses to News | iNews: Ein multimodaler Datensatz für die Modellierung personalisierter Affective Responses to News | iNews: 用于模拟个人对新闻的情感反应的多模式数据集 2503.03335v2 |
Authors (2): Tiancheng Hu, Nigel Collier
Understanding how individuals perceive and react to information is fundamental for advancing social and behavioral sciences and developing human-centered AI systems. Current approaches often lack the granular data needed to model these personalized responses, relying instead on aggregated labels that obscure the rich variability driven by individual differences. We introduce iNews, a novel large-scale dataset specifically designed to facilitate the modeling of personalized affective responses to news content. Our dataset comprises annotations from 291 demographically diverse UK participants across 2,899 multimodal Facebook news posts from major UK outlets, with an average of 5.18 annotators per sample. For each post, annotators provide multifaceted labels including valence, arousal, dominance, discrete emotions, content relevance judgments, sharing likelihood, and modality importance ratings. Crucially, we collect comprehensive annotator persona information covering demographics, personality, media trust, and consumption patterns, which explain 15.2% of annotation variance - substantially higher than existing NLP datasets. Incorporating this information yields a 7% accuracy gain in zero-shot prediction and remains beneficial even with 32-shot in-context learning. iNews opens new possibilities for research in LLM personalization, subjectivity, affective computing, and human behavior simulation.
nan
Article 538
Title@2025-07-04 (5): STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking
Title: STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking | STRUKTSENSE: Ein Task-Agnostic Agentic Framework für strukturierte Informationsextraktion mit Mensch-in-The-Loop-Evaluierung und Benchmarking | SUCTSENESE: 结构化信息抽取与 “ 网上人 “ 评价和基准确定 2507.03674v1 |
Authors (8): Tek Raj Chhetri, Yibei Chen, Puja Trivedi, Dorota Jarecka, Saif Haobsh, Patrick Ray, Lydia Ng, Satrajit S. Ghosh
The ability to extract structured information from unstructured sources-such as free-text documents and scientific literature-is critical for accelerating scientific discovery and knowledge synthesis. Large Language Models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks, including structured information extraction. However, their effectiveness often diminishes in specialized, domain-specific contexts that require nuanced understanding and expert-level domain knowledge. In addition, existing LLM-based approaches frequently exhibit poor transferability across tasks and domains, limiting their scalability and adaptability. To address these challenges, we introduce StructSense, a modular, task-agnostic, open-source framework for structured information extraction built on LLMs. StructSense is guided by domain-specific symbolic knowledge encoded in ontologies, enabling it to navigate complex domain content more effectively. It further incorporates agentic capabilities through self-evaluative judges that form a feedback loop for iterative refinement, and includes human-in-the-loop mechanisms to ensure quality and validation. We demonstrate that StructSense can overcome both the limitations of domain sensitivity and the lack of cross-task generalizability, as shown through its application to diverse neuroscience information extraction tasks.
nan
Article 539
Title@2025-07-04 (5): TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection
Title: TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection | TACOS: Offenes Tagging und vergleichendes Scoring für die Auswahl von Feintuning-Daten | TACOS: 用于指示精确调整数据选择的公开贴贴和比较比标 2507.03673v1 |
Authors (7): Xixiang He, Hao Yu, Qiyao Sun, Ao Cheng, Tailai Zhang, Cong Liu, Shuxuan Guo
Instruction Fine-Tuning (IFT) is crucial for aligning large language models (LLMs) with human preferences, and selecting a small yet representative subset from massive data significantly facilitates IFT in terms of both efficiency and effectiveness. Nevertheless, existing approaches suffer from two limitations: the use of simple heuristics restricts data diversity, while the singleton data quality evaluation accounts for inconsistent criteria between independent samples. To address the issues, we present TACOS, an innovative method that integrates Open Tagging and Comparative Scoring for IFT data selection. To capture data diversity, we leverage LLMs to assign open-domain tags to human queries, followed by a normalization stage to denoise the open tags and enable efficient clustering. Additionally, we suggest a comparative scoring method that allows the relative quality evaluation of samples within a cluster, avoiding inconsistent criteria seen in singleton-based evaluations. Extensive experiments across diverse datasets and LLM architectures demonstrate that TACOS outperforms existing approaches by a large margin. Notably, it achieves superior instruction-following performance on MT-Bench and ranks 1st among LLaMA2-7B-Based models on AlpacaEval 2.0, illustrating its efficacy for IFT data selection.
nan
Article 540
Title@2025-07-04 (5): Recon, Answer, Verify: Agents in Search of Truth
Title: Recon, Answer, Verify: Agents in Search of Truth | Recon, Antwort, Überprüfung: Agenten auf der Suche nach Wahrheit | 调查、答复、核查:寻找真相的特工 2507.03671v1 |
Authors (3): Satyam Shukla, Himanshu Dutta, Pushpak Bhattacharyya
Automated fact checking with large language models (LLMs) offers a scalable alternative to manual verification. Evaluating fact checking is challenging as existing benchmark datasets often include post claim analysis and annotator cues, which are absent in real world scenarios where claims are fact checked immediately after being made. This limits the realism of current evaluations. We present Politi Fact Only (PFO), a 5 class benchmark dataset of 2,982 political claims from politifact.com, where all post claim analysis and annotator cues have been removed manually. This ensures that models are evaluated using only the information that would have been available prior to the claim’s verification. Evaluating LLMs on PFO, we see an average performance drop of 22% in terms of macro f1 compared to PFO’s unfiltered version. Based on the identified challenges of the existing LLM based fact checking system, we propose RAV (Recon Answer Verify), an agentic framework with three agents: question generator, answer generator, and label generator. Our pipeline iteratively generates and answers sub questions to verify different aspects of the claim before finally generating the label. RAV generalizes across domains and label granularities, and it outperforms state of the art approaches on well known baselines RAWFC (fact checking, 3 class) by 25.28%, and on HOVER (encyclopedia, 2 class) by 1.54% on 2 hop, 4.94% on 3 hop, and 1.78% on 4 hop, sub categories respectively. RAV shows the least performance drop compared to baselines of 16.3% in macro f1 when we compare PFO with its unfiltered version.
nan
Article 541
Title@2025-07-04 (5): Interaction Techniques that Encourage Longer Prompts Can Improve Psychological Ownership when Writing with AI
Title: Interaction Techniques that Encourage Longer Prompts Can Improve Psychological Ownership when Writing with AI | Interaktionstechniken, die länger prompts fördern, können die Psychologische Eigentümerschaft verbessern, wenn sie mit KI schreiben | 鼓励较长时间提示的交互技术在与大赦国际撰写文章时能够改善心理自主性 2507.03670v1 |
Authors (2): Nikhita Joshi, Daniel Vogel
Writing longer prompts for an AI assistant to generate a short story increases psychological ownership, a user’s feeling that the writing belongs to them. To encourage users to write longer prompts, we evaluated two interaction techniques that modify the prompt entry interface of chat-based generative AI assistants: pressing and holding the prompt submission button, and continuously moving a slider up and down when submitting a short prompt. A within-subjects experiment investigated the effects of such techniques on prompt length and psychological ownership, and results showed that these techniques increased prompt length and led to higher psychological ownership than baseline techniques. A second experiment further augmented these techniques by showing AI-generated suggestions for how the prompts could be expanded. This further increased prompt length, but did not lead to improvements in psychological ownership. Our results show that simple interface modifications like these can elicit more writing from users and improve psychological ownership.
nan
Article 542
Title@2025-07-04 (5): TRACE: Training and Inference-Time Interpretability Analysis for Language Models
Title: TRACE: Training and Inference-Time Interpretability Analysis for Language Models | TRACE: Trainings- und Inferenz-Zeit-Interpretabilitätsanalyse für Sprachmodelle | TRACE: 语文模式培训和推断-时间解释性分析 2507.03668v1 |
Authors (3): Nura Aljaafari, Danilo S. Carvalho, André Freitas
Understanding when and how linguistic knowledge emerges during language model training remains a central challenge for interpretability. Most existing tools are post hoc, rely on scalar metrics, or require nontrivial integration effort, making comprehensive interpretability analysis difficult to deploy and maintain. We introduce TRACE, a modular toolkit for training and inference-time interpretability analysis of transformer models. It enables lightweight, in-training analysis of linguistic and representational signals, including features probing, intrinsic dimensionality, Hessian curvature, and output diagnostics. It integrates with ABSynth, a controllable synthetic corpus generator that provides structured annotations for precise evaluation of linguistic feature acquisition. Experiments with autoregressive transformers demonstrate that TRACE reveals developmental phenomena such as early syntactic emergence, delayed semantic acquisition, and representational compression, signals overlooked by traditional scalar metrics such as loss or accuracy. With minimal integration effort, the tool enables layer-wise diagnostics, convergence-based early stopping, and detection of structural errors, making transformer analysis interpretable, actionable, and reproducible.
nan
Article 543
Title@2025-07-04 (5): Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs
Title: Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs | Re-Emergent-Missausrichtung: Wie schmale Feintuning-Erodes Sicherheit Ausrichtung in LLMs | 重新出现错配: LLM 中如何将安全对齐 2507.03662v1 |
Authors (1): Jeremiah Giordani
Recent work has shown that fine-tuning large language models (LLMs) on code with security vulnerabilities can result in misaligned and unsafe behaviors across broad domains. These results prompted concerns about the emergence of harmful behaviors from narrow domain fine-tuning. In this paper, we contextualize these findings by analyzing how such narrow adaptation impacts the internal mechanisms and behavioral manifestations of LLMs. Through a series of experiments covering output probability distributions, loss and gradient vector geometry, layer-wise activation dynamics, and activation space dimensions, we find that behaviors attributed to “emergent misalignment” may be better interpreted as an erosion of prior alignment. We show that fine tuning on insecure code induces internal changes that oppose alignment. Further, we identify a shared latent dimension in the model’s activation space that governs alignment behavior. We show that this space is activated by insecure code and by misaligned responses more generally, revealing how narrow fine-tuning can degrade general safety behavior by interfering with shared internal mechanisms. Our findings offer a mechanistic interpretation for previously observed misalignment phenomena, and highlights the fragility of alignment in LLMs. The results underscore the need for more robust fine-tuning strategies that preserve intended behavior across domains.
nan
Article 544
Title@2025-07-04 (5): Exploring Robustness of LLMs to Paraphrasing Based on Sociodemographic Factors
Title: Exploring Robustness of LLMs to Paraphrasing Based on Sociodemographic Factors | Erforschung der Robustheit von LLMs zur Paraphrasierung auf der Grundlage soziodemografischer Faktoren | 探讨根据社会人口因素将长效LLMs 2501.08276v2 |
Authors (3): Pulkit Arora, Akbar Karimi, Lucie Flek
Despite their linguistic prowess, LLMs have been shown to be vulnerable to small input perturbations. While robustness to local adversarial changes has been studied, robustness to global modifications such as different linguistic styles remains underexplored. Therefore, we take a broader approach to explore a wider range of variations across sociodemographic dimensions. We extend the SocialIQA dataset to create diverse paraphrased sets conditioned on sociodemographic factors (age and gender). The assessment aims to provide a deeper understanding of LLMs in (a) their capability of generating demographic paraphrases with engineered prompts and (b) their capabilities in interpreting real-world, complex language scenarios. We also perform a reliability analysis of the generated paraphrases looking into linguistic diversity and perplexity as well as manual evaluation. We find that demographic-based paraphrasing significantly impacts the performance of language models, indicating that the subtleties of linguistic variation remain a significant challenge. We will make the code and dataset available for future research.
nan
Article 545
Title@2025-07-04 (5): Disentangling the Roles of Representation and Selection in Data Pruning
Title: Disentangling the Roles of Representation and Selection in Data Pruning | Entwirren der Rollen von Repräsentation und Auswahl bei der Datenkorrektur | 分离数据谨慎中的代表和甄选作用 2507.03648v1 |
Authors (6): Yupei Du, Yingjin Song, Hugh Mee Wong, Daniil Ignatev, Albert Gatt, Dong Nguyen
Data pruning, selecting small but impactful subsets, offers a promising way to efficiently scale NLP model training. However, existing methods often involve many different design choices, which have not been systematically studied. This limits future developments. In this work, we decompose data pruning into two key components: the data representation and the selection algorithm, and we systematically analyze their influence on the selection of instances. Our theoretical and empirical results highlight the crucial role of representations: better representations, e.g., training gradients, generally lead to a better selection of instances, regardless of the chosen selection algorithm. Furthermore, different selection algorithms excel in different settings, and none consistently outperforms the others. Moreover, the selection algorithms do not always align with their intended objectives: for example, algorithms designed for the same objective can select drastically different instances, highlighting the need for careful evaluation.
nan
Article 546
Title@2025-07-04 (5): ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving
Title: ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving | ArithmAttack: Bewertung der Robustheit von LLMs zum geräuschvollen Kontext bei der Lösung von Math-Problemen | ArithmAttack:评估理学问题解答时对吵闹背景的LLMs强力 2501.08203v2 |
Authors (4): Zain Ul Abedin, Shahzeb Qamar, Lucie Flek, Akbar Karimi
While Large Language Models (LLMs) have shown impressive capabilities in math problem-solving tasks, their robustness to noisy inputs is not well-studied. We propose ArithmAttack to examine how robust the LLMs are when they encounter noisy prompts that contain extra noise in the form of punctuation marks. While being easy to implement, ArithmAttack does not cause any information loss since words are not added or deleted from the context. We evaluate the robustness of eight LLMs, including LLama3, Mistral, Mathstral, and DeepSeek on noisy GSM8K and MultiArith datasets. Our experiments suggest that all the studied models show vulnerability to such noise, with more noise leading to poorer performances.
nan
Article 547
Title@2025-07-04 (5): Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion
Title: Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion | Verbesserung der Low-Resource-Dialektenklassifikation mittels Retrieval-basierter Sprachumwandlung | 利用以检索为基础的语音转换改进低资源 2507.03641v1 |
Authors (5): Lea Fischbach, Akbar Karimi, Caroline Kleen, Alfred Lameli, Lucie Flek
Deep learning models for dialect identification are often limited by the scarcity of dialectal data. To address this challenge, we propose to use Retrieval-based Voice Conversion (RVC) as an effective data augmentation method for a low-resource German dialect classification task. By converting audio samples to a uniform target speaker, RVC minimizes speaker-related variability, enabling models to focus on dialect-specific linguistic and phonetic features. Our experiments demonstrate that RVC enhances classification performance when utilized as a standalone augmentation method. Furthermore, combining RVC with other augmentation methods such as frequency masking and segment removal leads to additional performance gains, highlighting its potential for improving dialect classification in low-resource scenarios.
nan
Article 548
Title@2025-07-04 (5): Is It Time To Treat Prompts As Code? A Multi-Use Case Study For Prompt Optimization Using DSPy
Title: Is It Time To Treat Prompts As Code? A Multi-Use Case Study For Prompt Optimization Using DSPy | Ist es Zeit, Prompts als Code zu behandeln? Eine Multi-Use-Fallstudie für Prompt-Optimierung mit DSPy | 是否是时候将提示作为代码处理? 使用 DSPy 快速优化的多用途案例研究 2507.03620v1 |
Authors (3): Francisca Lemos, Victor Alves, Filipa Ferraz
Although prompt engineering is central to unlocking the full potential of Large Language Models (LLMs), crafting effective prompts remains a time-consuming trial-and-error process that relies on human intuition. This study investigates Declarative Self-improving Python (DSPy), an optimization framework that programmatically creates and refines prompts, applied to five use cases: guardrail enforcement, hallucination detection in code, code generation, routing agents, and prompt evaluation. Each use case explores how prompt optimization via DSPy influences performance. While some cases demonstrated modest improvements - such as minor gains in the guardrails use case and selective enhancements in hallucination detection - others showed notable benefits. The prompt evaluation criterion task demonstrated a substantial performance increase, rising accuracy from 46.2% to 64.0%. In the router agent case, the possibility of improving a poorly performing prompt and of a smaller model matching a stronger one through optimized prompting was explored. Although prompt refinement increased accuracy from 85.0% to 90.0%, using the optimized prompt with a cheaper model did not improve performance. Overall, this study’s findings suggest that DSPy’s systematic prompt optimization can enhance LLM performance, particularly when instruction tuning and example selection are optimized together. However, the impact varies by task, highlighting the importance of evaluating specific use cases in prompt optimization research.
nan
Article 549
Title@2025-07-04 (5): EMERGE: A Benchmark for Updating Knowledge Graphs with Emerging Textual Knowledge
Title: EMERGE: A Benchmark for Updating Knowledge Graphs with Emerging Textual Knowledge | EMERGE: Ein Benchmark für die Aktualisierung von Wissensgraphen mit neu entstehendem Textwissen | EMEGERGE: 更新具有新兴文字知识的知识图表的基准 2507.03617v1 |
Authors (6): Klim Zaporojets, Daniel Daza, Edoardo Barba, Ira Assent, Roberto Navigli, Paul Groth
Knowledge Graphs (KGs) are structured knowledge repositories containing entities and relations between them. In this paper, we investigate the problem of automatically updating KGs over time with respect to the evolution of knowledge in unstructured textual sources. This problem requires identifying a wide range of update operations based on the state of an existing KG at a specific point in time. This contrasts with traditional information extraction pipelines, which extract knowledge from text independently of the current state of a KG. To address this challenge, we propose a method for lifelong construction of a dataset consisting of Wikidata KG snapshots over time and Wikipedia passages paired with the corresponding edit operations that they induce in a particular KG snapshot. The resulting dataset comprises 376K Wikipedia passages aligned with a total of 1.25M KG edits over 10 different snapshots of Wikidata from 2019 to 2025. Our experimental results highlight challenges in updating KG snapshots based on emerging textual knowledge, positioning the dataset as a valuable benchmark for future research. We will publicly release our dataset and model implementations.
nan
Article 550
Title@2025-07-04 (5): RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning
Title: RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning | RAG+: Verbesserung der Retrieval-Augmented Generation mit anwendungsrelevanter Begründung | RAG+:加强利用应用程序软件软件软件软件支持的检索-启动生成 2506.11555v3 |
Authors (9): Yu Wang, Shiwan Zhao, Zhihu Wang, Ming Fan, Yubo Zhang, Xicheng Zhang, Zhengfan Wang, Heyuan Huang, Ting Liu
The integration of external knowledge through Retrieval-Augmented Generation (RAG) has become foundational in enhancing large language models (LLMs) for knowledge-intensive tasks. However, existing RAG paradigms often overlook the cognitive step of applying knowledge, leaving a gap between retrieved facts and task-specific reasoning. In this work, we introduce RAG+, a principled and modular extension that explicitly incorporates application-aware reasoning into the RAG pipeline. RAG+ constructs a dual corpus consisting of knowledge and aligned application examples, created either manually or automatically, and retrieves both jointly during inference. This design enables LLMs not only to access relevant information but also to apply it within structured, goal-oriented reasoning processes. Experiments across mathematical, legal, and medical domains, conducted on multiple models, demonstrate that RAG+ consistently outperforms standard RAG variants, achieving average improvements of 3-5%, and peak gains up to 7.5% in complex scenarios. By bridging retrieval with actionable application, RAG+ advances a more cognitively grounded framework for knowledge integration, representing a step toward more interpretable and capable LLMs.
nan
Article 551
Title@2025-07-04 (5): ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs
Title: ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs | ALAS: Latente Sprach-Text-Ausrichtung für gesprochenes Sprachverständnis in multimodalen LLMs messen | ALAS: 计量多种模式LM 中口语语言理解的暗中语音-文本对齐 2505.19937v2 |
Authors (4): Pooneh Mousavi, Yingzhi Wang, Mirco Ravanelli, Cem Subakan
Large Language Models (LLMs) are increasingly used in Spoken Language Understanding (SLU), where effective multimodal learning depends on the alignment between audio and text. Despite various fusion methods, no standard metric exists to assess this alignment. This work introduces ALAS (Automatic Latent Alignment Score), a metric that evaluates alignment by measuring correlations between audio and text representations across transformer layers. Experiments on Spoken Question Answering and Emotion Recognition show that ALAS captures meaningful patterns across tasks and layers.
nan
Article 552
Title@2025-07-04 (5): Multi-Hop Reasoning for Question Answering with Hyperbolic Representations
Title: Multi-Hop Reasoning for Question Answering with Hyperbolic Representations | Multi-Hop-Gründung für die Beantwortung von Fragen mit hyperbolischen Darstellungen | 使用超曲解表示式回答问题多理由 2507.03612v1 |
Authors (3): Simon Welz, Lucie Flek, Akbar Karimi
Hyperbolic representations are effective in modeling knowledge graph data which is prevalently used to facilitate multi-hop reasoning. However, a rigorous and detailed comparison of the two spaces for this task is lacking. In this paper, through a simple integration of hyperbolic representations with an encoder-decoder model, we perform a controlled and comprehensive set of experiments to compare the capacity of hyperbolic space versus Euclidean space in multi-hop reasoning. Our results show that the former consistently outperforms the latter across a diverse set of datasets. In addition, through an ablation study, we show that a learnable curvature initialized with the delta hyperbolicity of the utilized data yields superior results to random initializations. Furthermore, our findings suggest that hyperbolic representations can be significantly more advantageous when the datasets exhibit a more hierarchical structure.
nan
Article 553
Title@2025-07-04 (5): ViMRHP: A Vietnamese Benchmark Dataset for Multimodal Review Helpfulness Prediction via Human-AI Collaborative Annotation
Title: ViMRHP: A Vietnamese Benchmark Dataset for Multimodal Review Helpfulness Prediction via Human-AI Collaborative Annotation | ViMRHP: Ein vietnamesischer Benchmark-Datensatz für die multimodale Überprüfung von Hilfsvorhersage über Human-AI Collaborative Annotation | VIMRHP:越南通过人类 – – AI合作说明,为多模式审查 “ 助助预测 “ 建立一个越南基准数据集 2505.07416v2 |
Authors (4): Truc Mai-Thanh Nguyen, Dat Minh Nguyen, Son T. Luu, Kiet Van Nguyen
Multimodal Review Helpfulness Prediction (MRHP) is an essential task in recommender systems, particularly in E-commerce platforms. Determining the helpfulness of user-generated reviews enhances user experience and improves consumer decision-making. However, existing datasets focus predominantly on English and Indonesian, resulting in a lack of linguistic diversity, especially for low-resource languages such as Vietnamese. In this paper, we introduce ViMRHP (Vietnamese Multimodal Review Helpfulness Prediction), a large-scale benchmark dataset for MRHP task in Vietnamese. This dataset covers four domains, including 2K products with 46K reviews. Meanwhile, a large-scale dataset requires considerable time and cost. To optimize the annotation process, we leverage AI to assist annotators in constructing the ViMRHP dataset. With AI assistance, annotation time is reduced (90 to 120 seconds per task down to 20 to 40 seconds per task) while maintaining data quality and lowering overall costs by approximately 65%. However, AI-generated annotations still have limitations in complex annotation tasks, which we further examine through a detailed performance analysis. In our experiment on ViMRHP, we evaluate baseline models on human-verified and AI-generated annotations to assess their quality differences. The ViMRHP dataset is publicly available at https://github.com/trng28/ViMRHP
nan
Article 554
Title@2025-07-04 (5): Should We Still Pretrain Encoders with Masked Language Modeling?
Title: Should We Still Pretrain Encoders with Masked Language Modeling? | Sollten wir noch Encoder mit maskierten Sprachmodellen vortrainieren? | 我们是否仍应该为带有隐蔽语言建模的编程者预作准备? 2507.00994v2 |
Authors (8): Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte M. Alves, Emmanuel Malherbe, André F. T. Martins, Céline Hudelot, Pierre Colombo
Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 38 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models, reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.
nan
Article 555
Title@2025-07-04 (5): RECA-PD: A Robust Explainable Cross-Attention Method for Speech-based Parkinson’s Disease Classification
Title: RECA-PD: A Robust Explainable Cross-Attention Method for Speech-based Parkinson’s Disease Classification | RECA-PD: Robuste, erklärbare Cross-Attention-Methode zur Klassifizierung sprachbasierter Parkinson-Krankheiten | REECA-PD: 以语音为基础的帕金森氏病分类的强有力的可解释交叉注意方法 2507.03594v1 |
Authors (4): Terry Yi Zhong, Cristian Tejedor-Garcia, Martha Larson, Bastiaan R. Bloem
Parkinson’s Disease (PD) affects over 10 million people globally, with speech impairments often preceding motor symptoms by years, making speech a valuable modality for early, non-invasive detection. While recent deep-learning models achieve high accuracy, they typically lack the explainability required for clinical use. To address this, we propose RECA-PD, a novel, robust, and explainable cross-attention architecture that combines interpretable speech features with self-supervised representations. RECA-PD matches state-of-the-art performance in Speech-based PD detection while providing explanations that are more consistent and more clinically meaningful. Additionally, we demonstrate that performance degradation in certain speech tasks (e.g., monologue) can be mitigated by segmenting long recordings. Our findings indicate that performance and explainability are not necessarily mutually exclusive. Future work will enhance the usability of explanations for non-experts and explore severity estimation to increase the real-world clinical relevance.
nan
Article 556
Title@2025-07-04 (5): Evaluating AI capabilities in detecting conspiracy theories on YouTube
Title: Evaluating AI capabilities in detecting conspiracy theories on YouTube | Bewertung von KI-Fähigkeiten bei der Entdeckung von Verschwörungstheorien auf YouTube | 评价大赦国际在YouTube上发现阴谋论的能力 2505.23570v2 |
Authors (3): Leonardo La Rocca, Francesco Corso, Francesco Pierri
As a leading online platform with a vast global audience, YouTube’s extensive reach also makes it susceptible to hosting harmful content, including disinformation and conspiracy theories. This study explores the use of open-weight Large Language Models (LLMs), both text-only and multimodal, for identifying conspiracy theory videos shared on YouTube. Leveraging a labeled dataset of thousands of videos, we evaluate a variety of LLMs in a zero-shot setting and compare their performance to a fine-tuned RoBERTa baseline. Results show that text-based LLMs achieve high recall but lower precision, leading to increased false positives. Multimodal models lag behind their text-only counterparts, indicating limited benefits from visual data integration. To assess real-world applicability, we evaluate the most accurate models on an unlabeled dataset, finding that RoBERTa achieves performance close to LLMs with a larger number of parameters. Our work highlights the strengths and limitations of current LLM-based approaches for online harmful content detection, emphasizing the need for more precise and robust systems.
nan
Article 557
Title@2025-07-04 (5): Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation
Title: Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation | Causal-SAM-LLM: Große Sprachmodelle als Ursache für robuste medizinische Segmentierung | Causal-SAM-LLLM:大语言模型,作为强力医疗分解的原因 2507.03585v1 |
Authors (4): Tao Tang, Shijie Xu, Yiting Wu, Zhixiang Lu
The clinical utility of deep learning models for medical image segmentation is severely constrained by their inability to generalize to unseen domains. This failure is often rooted in the models learning spurious correlations between anatomical content and domain-specific imaging styles. To overcome this fundamental challenge, we introduce Causal-SAM-LLM, a novel framework that elevates Large Language Models (LLMs) to the role of causal reasoners. Our framework, built upon a frozen Segment Anything Model (SAM) encoder, incorporates two synergistic innovations. First, Linguistic Adversarial Disentanglement (LAD) employs a Vision-Language Model to generate rich, textual descriptions of confounding image styles. By training the segmentation model’s features to be contrastively dissimilar to these style descriptions, it learns a representation robustly purged of non-causal information. Second, Test-Time Causal Intervention (TCI) provides an interactive mechanism where an LLM interprets a clinician’s natural language command to modulate the segmentation decoder’s features in real-time, enabling targeted error correction. We conduct an extensive empirical evaluation on a composite benchmark from four public datasets (BTCV, CHAOS, AMOS, BraTS), assessing generalization under cross-scanner, cross-modality, and cross-anatomy settings. Causal-SAM-LLM establishes a new state of the art in out-of-distribution (OOD) robustness, improving the average Dice score by up to 6.2 points and reducing the Hausdorff Distance by 15.8 mm over the strongest baseline, all while using less than 9% of the full model’s trainable parameters. Our work charts a new course for building robust, efficient, and interactively controllable medical AI systems.
nan
Article 558
Title@2025-07-04 (5): Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits
Title: Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits | Ambiguous Terminologie von Preference Optimization auf Post-Edits übersetzen lernen | 学习如何通过“优先优化”在编辑后采用“优先优化”来翻译模糊的名词 2507.03580v1 |
Authors (5): Nathaniel Berger, Johannes Eschbach-Dymanus, Miriam Exel, Matthias Huck, Stefan Riezler
In real world translation scenarios, terminology is rarely one-to-one. Instead, multiple valid translations may appear in a terminology dictionary, but correctness of a translation depends on corporate style guides and context. This can be challenging for neural machine translation (NMT) systems. Luckily, in a corporate context, many examples of human post-edits of valid but incorrect terminology exist. The goal of this work is to learn how to disambiguate our terminology based on these corrections. Our approach is based on preference optimization, using the term post-edit as the knowledge to be preferred. While previous work had to rely on unambiguous translation dictionaries to set hard constraints during decoding, or to add soft constraints in the input, our framework requires neither one-to-one dictionaries nor human intervention at decoding time. We report results on English-German post-edited data and find that the optimal combination of supervised fine-tuning and preference optimization, with both term-specific and full sequence objectives, yields statistically significant improvements in term accuracy over a strong NMT baseline without significant losses in COMET score. Additionally, we release test sets from our post-edited data and terminology dictionary.
nan
Article 559
Title@2025-07-04 (5): Articulatory clarity and variability before and after surgery for tongue cancer
Title: Articulatory clarity and variability before and after surgery for tongue cancer | Artikulatorische Klarheit und Variabilität vor und nach der Operation bei Zungenkrebs | 舌癌手术前后和手术前后的产道清晰度和变异性 2507.03576v1 |
Authors (13): Thomas Tienkamp, Fleur van Ast, Roos van der Veen, Teja Rebernik, Raoul Buurke, Nikki Hoekzema, Katharina Polsterer, Hedwig Sekeres, Rob van Son, Martijn Wieling, Max Witjes, Sebastiaan de Visscher, Defne Abur
Surgical treatment for tongue cancer can negatively affect the mobility and musculature of the tongue, which can influence articulatory clarity and variability. In this study, we investigated articulatory clarity through the vowel articulation index (VAI) and variability through vowel formant dispersion (VFD). Using a sentence reading task, we assessed 11 individuals pre and six months post tongue cancer surgery, alongside 11 sex- and age matched typical speakers. Our results show that while the VAI was significantly smaller post-surgery compared to pre-surgery, there was no significant difference between patients and typical speakers at either time point. Post-surgery, speakers had higher VFD values for /i/ compared to pre-surgery and typical speakers, signalling higher variability. Taken together, our results suggest that while articulatory clarity remained within typical ranges following surgery for tongue cancer for the speakers in our study, articulatory variability increased.
nan
Article 560
Title@2025-07-04 (5): ReviewInstruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models
Title: ReviewInstruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models | ReviewInstruct: Eine review-getriebene Multi-Turn-Konversations-Erstellungsmethode für große Sprachmodelle | 审查指南:大语言模型的复议-驱动多发多发对话生成方法 2505.11010v2 |
Authors (13): Jiangxu Wu, Cong Wang, TianHuang Su, Jun Yang, Haozhi Lin, Chao Zhang, Ming Peng, Kai Shi, SongPan Yang, BinQing Pan, ZiXian Li, Ni Yang, ZhenYu Yang
The effectiveness of large language models (LLMs) in conversational AI is hindered by their reliance on single-turn supervised fine-tuning (SFT) data, which limits contextual coherence in multi-turn dialogues. Existing methods for generating multi-turn dialogue data struggle to ensure both diversity and quality in instructions. To address this, we propose Review-Instruct, a novel framework that synthesizes multi-turn conversations through an iterative “Ask-Respond-Review” process involving three agent roles: a Candidate, multiple Reviewers, and a Chairman. The framework iteratively refines instructions by incorporating Reviewer feedback, enhancing dialogue diversity and difficulty. We construct a multi-turn dataset using the Alpaca dataset and fine-tune the LLaMA2-13B model. Evaluations on MT-Bench, MMLU-Pro, and Auto-Arena demonstrate significant improvements, achieving absolute gains of 2.9\% on MMLU-Pro and 2\% on MT-Bench compared to prior state-of-the-art models based on LLaMA2-13B. Ablation studies confirm the critical role of the Review stage and the use of multiple Reviewers in boosting instruction diversity and difficulty. Our work highlights the potential of review-driven, multi-agent frameworks for generating high-quality conversational data at scale.
nan
Article 561
Title@2025-07-04 (5): H2HTalk: Evaluating Large Language Models as Emotional Companion
Title: H2HTalk: Evaluating Large Language Models as Emotional Companion | H2HTalk: Große Sprachmodelle als emotionalen Begleiter bewerten | H2H Talk:评价作为情感伴侣的大语言模式 2507.03543v1 |
Authors (4): Boyang Wang, Yalun Wu, Hongcheng Guo, Zhoujun Li
As digital emotional support needs grow, Large Language Model companions offer promising authentic, always-available empathy, though rigorous evaluation lags behind model advancement. We present Heart-to-Heart Talk (H2HTalk), a benchmark assessing companions across personality development and empathetic interaction, balancing emotional intelligence with linguistic fluency. H2HTalk features 4,650 curated scenarios spanning dialogue, recollection, and itinerary planning that mirror real-world support conversations, substantially exceeding previous datasets in scale and diversity. We incorporate a Secure Attachment Persona (SAP) module implementing attachment-theory principles for safer interactions. Benchmarking 50 LLMs with our unified protocol reveals that long-horizon planning and memory retention remain key challenges, with models struggling when user needs are implicit or evolve mid-conversation. H2HTalk establishes the first comprehensive benchmark for emotionally intelligent companions. We release all materials to advance development of LLMs capable of providing meaningful and safe psychological support.
nan
Article 562
Title@2025-07-04 (5): Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation
Title: Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation | Aussprache-Lexikon Kostenloses Training für Phoneme-basierte Crosslingual ASR über Joint Stochastic Approximation | 通过联合口交近为基于电话的跨语言ASR提供发音免费培训 2507.06249v1 |
Authors (4): Saierdaer Yusuyin, Te Ma, Hao Huang, Zhijian Ou
Recently, pre-trained models with phonetic supervision have demonstrated their advantages for crosslingual speech recognition in data efficiency and information sharing across languages. However, a limitation is that a pronunciation lexicon is needed for such phoneme-based crosslingual speech recognition. In this study, we aim to eliminate the need for pronunciation lexicons and propose a latent variable model based method, with phonemes being treated as discrete latent variables. The new method consists of a speech-to-phoneme (S2P) model and a phoneme-to-grapheme (P2G) model, and a grapheme-to-phoneme (G2P) model is introduced as an auxiliary inference model. To jointly train the three models, we utilize the joint stochastic approximation (JSA) algorithm, which is a stochastic extension of the EM (expectation-maximization) algorithm and has demonstrated superior performance particularly in estimating discrete latent variable models. Based on the Whistle multilingual pre-trained S2P model, crosslingual experiments are conducted in Polish (130 h) and Indonesian (20 h). With only 10 minutes of phoneme supervision, the new method, JSA-SPG, achieves 5\% error rate reductions compared to the best crosslingual fine-tuning approach using subword or full phoneme supervision. Furthermore, it is found that in language domain adaptation (i.e., utilizing cross-domain text-only data), JSA-SPG outperforms the standard practice of language model fusion via the auxiliary support of the G2P model by 9% error rate reductions. To facilitate reproducibility and encourage further exploration in this field, we open-source the JSA-SPG training code and complete pipeline.
nan
Article 563
Title@2025-07-04 (5): AI-VaxGuide: An Agentic RAG-Based LLM for Vaccination Decisions
Title: AI-VaxGuide: An Agentic RAG-Based LLM for Vaccination Decisions | AI-VaxGuide: Agentische RAG-basierte LLM für Impfentscheidungen | AI-VaxGuide:用于疫苗接种决定的以RAG为基地的AAG剂LLMLM 2507.03493v1 |
Authors (4): Abdellah Zeggai, Ilyes Traikia, Abdelhak Lakehal, Abdennour Boulesnane
Vaccination plays a vital role in global public health, yet healthcare professionals often struggle to access immunization guidelines quickly and efficiently. National protocols and WHO recommendations are typically extensive and complex, making it difficult to extract precise information, especially during urgent situations. This project tackles that issue by developing a multilingual, intelligent question-answering system that transforms static vaccination guidelines into an interactive and user-friendly knowledge base. Built on a Retrieval-Augmented Generation (RAG) framework and enhanced with agent-based reasoning (Agentic RAG), the system provides accurate, context-sensitive answers to complex medical queries. Evaluation shows that Agentic RAG outperforms traditional methods, particularly in addressing multi-step or ambiguous questions. To support clinical use, the system is integrated into a mobile application designed for real-time, point-of-care access to essential vaccine information. AI-VaxGuide model is publicly available on https://huggingface.co/VaxGuide
nan
Article 564
Title@2025-07-04 (5): Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences
Title: Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences | Vier Schattierungen der Lebenswissenschaften: Ein Datensatz für Desinformationserkennung in den Lebenswissenschaften | 《生命科学的四大结构:生命科学中虚假信息探测数据集》 2507.03488v1 |
Authors (3): Eva Seidlmayer, Lukas Galke, Konrad U. Förstner
Disseminators of disinformation often seek to attract attention or evoke emotions - typically to gain influence or generate revenue - resulting in distinctive rhetorical patterns that can be exploited by machine learning models. In this study, we explore linguistic and rhetorical features as proxies for distinguishing disinformative texts from other health and life-science text genres, applying both large language models and classical machine learning classifiers. Given the limitations of existing datasets, which mainly focus on fact checking misinformation, we introduce Four Shades of Life Sciences (FSoLS): a novel, labeled corpus of 2,603 texts on 14 life-science topics, retrieved from 17 diverse sources and classified into four categories of life science publications. The source code for replicating, and updating the dataset is available on GitHub: https://github.com/EvaSeidlmayer/FourShadesofLifeSciences
nan
Article 565
Title@2025-07-04 (5): Chat2SPaT: A Large Language Model Based Tool for Automating Traffic Signal Control Plan Management
Title: Chat2SPaT: A Large Language Model Based Tool for Automating Traffic Signal Control Plan Management | Chat2SPaT: Ein auf großen Sprachmodellen basierendes Tool zur Automatisierung der Verkehrssignalsteuerung | Chat2SPaaT:以大语言为基础的交通信号控制计划管理自动化示范工具 2507.05283v1 |
Authors (6): Yue Wang, Miao Zhou, Guijing Huang, Rui Zhuo, Chao Yi, Zhenliang Ma
Pre-timed traffic signal control, commonly used for operating signalized intersections and coordinated arterials, requires tedious manual work for signaling plan creating and updating. When the time-of-day or day-of-week plans are utilized, one intersection is often associated with multiple plans, leading to further repetitive manual plan parameter inputting. To enable a user-friendly traffic signal control plan management process, this study proposes Chat2SPaT, a method to convert users’ semi-structured and ambiguous descriptions on the signal control plan to exact signal phase and timing (SPaT) results, which could further be transformed into structured stage-based or ring-based plans to interact with intelligent transportation system (ITS) software and traffic signal controllers. With curated prompts, Chat2SPaT first leverages large language models’ (LLMs) capability of understanding users’ plan descriptions and reformulate the plan as a combination of phase sequence and phase attribute results in the json format. Based on LLM outputs, python scripts are designed to locate phases in a cycle, address nuances of traffic signal control, and finally assemble the complete traffic signal control plan. Within a chat, the pipeline can be utilized iteratively to conduct further plan editing. Experiments show that Chat2SPaT can generate plans with an accuracy of over 94% for both English and Chinese cases, using a test dataset with over 300 plan descriptions. As the first benchmark for evaluating LLMs’ capability of understanding traffic signal control plan descriptions, Chat2SPaT provides an easy-to-use plan management pipeline for traffic practitioners and researchers, serving as a potential new building block for a more accurate and versatile application of LLMs in the field of ITS. The source codes, prompts and test dataset are openly accessible at https://github.com/yuewangits/Chat2SPaT.
nan
Article 566
Title@2025-07-04 (5): Beyond Weaponization: NLP Security for Medium and Lower-Resourced Languages in Their Own Right
Title: Beyond Weaponization: NLP Security for Medium and Lower-Resourced Languages in Their Own Right | Beyond Weaponization: NLP-Sicherheit für mittel- und minderwertige Sprachen im eigenen Recht | 武器化之后的武器化之外:国家语言方案保障中、低资源语言自身权利的安全 2507.03473v1 |
Authors (1): Heather Lent
Despite mounting evidence that multilinguality can be easily weaponized against language models (LMs), works across NLP Security remain overwhelmingly English-centric. In terms of securing LMs, the NLP norm of “English first” collides with standard procedure in cybersecurity, whereby practitioners are expected to anticipate and prepare for worst-case outcomes. To mitigate worst-case outcomes in NLP Security, researchers must be willing to engage with the weakest links in LM security: lower-resourced languages. Accordingly, this work examines the security of LMs for lower- and medium-resourced languages. We extend existing adversarial attacks for up to 70 languages to evaluate the security of monolingual and multilingual LMs for these languages. Through our analysis, we find that monolingual models are often too small in total number of parameters to ensure sound security, and that while multilinguality is helpful, it does not always guarantee improved security either. Ultimately, these findings highlight important considerations for more secure deployment of LMs, for communities of lower-resourced languages.
nan
Article 567
Title@2025-07-04 (5): CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark
Title: CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark | CoreCodeBench: Ein konfigurierbarer Multi-Szenario-Repository-Level-Benchmark | 核心守则:可配置的多设想仓库一级基准 2507.05281v1 |
Authors (13): Lingyue Fu, Hao Guan, Bolun Zhang, Haowei Yuan, Yaoming Zhu, Jun Xu, Zongyu Wang, Lin Qiu, Xunliang Cai, Xuezhi Cao, Weiwen Liu, Weinan Zhang, Yong Yu
As Large Language Models (LLMs) demonstrate increasingly sophisticated code processing capabilities, evaluating their performance on engineering-level code remains challenging. Existing repository-level benchmarks primarily focus on single scenarios, such as code generation or bug fixing, without adequately capturing the diversity and complexity of real-world software or project engineering workflows. Furthermore, these benchmarks suffer from limited controllability in question positioning and reliability issues in their generated test cases. To address these limitations, we present CorePipe, a fully automated pipeline that converts repositories into comprehensive test cases, and introduce CoreCodeBench, a configurable multi-scenario repository-level benchmark. To simulate real engineering scenarios, CorePipe generates three types of atomic questions (Development, BugFix, and Test-Driven Development) specifically targeting core code segments. These atomic questions are further combined into three types of composite questions, with difficulty levels flexibly adjusted through hyperparameter tuning. CoreCodeBench provides a comprehensive and extensive repository-level benchmark to investigate the applicability of LLMs in real-world engineering projects. Experiments with 16 LLMs across diverse scenarios reveal varying capabilities and offer multi-dimensional insights into LLM performance in engineering contexts. The code for CorePipe is available at https://github.com/AGI-Eval-Official/CoreCodeBench, and the data for CoreCodeBench can be accessed at https://huggingface.co/collections/tubehhh/corecodebench-68256d2faabf4b1610a08caa.
nan
Article 568
Title@2025-07-04 (5): Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models
Title: Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models | Verbesserung der sozialen Determinanten der Gesundheitsdokumentation in französischen EHRs mit großen Sprachmodellen | 利用大语言模式,改善法文健康文件的 “ 健康文件 “ 社会决定因素 2507.03433v1 |
Authors (9): Adrien Bazoge, Pacôme Constant dit Beaufils, Mohammed Hmitouch, Romain Bourcier, Emmanuel Morin, Richard Dufour, Béatrice Daille, Pierre-Antoine Gourraud, Matilde Karakachoff
Social determinants of health (SDoH) significantly influence health outcomes, shaping disease progression, treatment adherence, and health disparities. However, their documentation in structured electronic health records (EHRs) is often incomplete or missing. This study presents an approach based on large language models (LLMs) for extracting 13 SDoH categories from French clinical notes. We trained Flan-T5-Large on annotated social history sections from clinical notes at Nantes University Hospital, France. We evaluated the model at two levels: (i) identification of SDoH categories and associated values, and (ii) extraction of detailed SDoH with associated temporal and quantitative information. The model performance was assessed across four datasets, including two that we publicly release as open resources. The model achieved strong performance for identifying well-documented categories such as living condition, marital status, descendants, job, tobacco, and alcohol use (F1 score > 0.80). Performance was lower for categories with limited training data or highly variable expressions, such as employment status, housing, physical activity, income, and education. Our model identified 95.8% of patients with at least one SDoH, compared to 2.8% for ICD-10 codes from structured EHR data. Our error analysis showed that performance limitations were linked to annotation inconsistencies, reliance on English-centric tokenizer, and reduced generalizability due to the model being trained on social history sections only. These results demonstrate the effectiveness of NLP in improving the completeness of real-world SDoH data in a non-English EHR system.
nan
Article 569
Title@2025-07-04 (5): Tokenization is Sensitive to Language Variation
Title: Tokenization is Sensitive to Language Variation | Tokenisierung ist auf Sprachvariation sensitiv | 语法变换对语法变化敏感 2502.15343v2 |
Authors (3): Anna Wegmann, Dong Nguyen, David Jurgens
Variation in language is ubiquitous and often systematically linked to regional, social, and contextual factors. Tokenizers split texts into smaller units and might behave differently for less common linguistic forms. This might affect downstream LLM performance differently on two types of tasks: Tasks where the model should be robust to language variation (e.g., for semantic tasks like NLI, labels do not depend on whether a text uses British or American spelling) and tasks where the model should be sensitive to language variation (e.g., for form-based tasks like authorship verification, labels depend on whether a text uses British or American spelling). We pre-train BERT base models with the popular Byte-Pair Encoding algorithm to investigate how key tokenization design choices impact the performance of downstream models: the corpus used to train the tokenizer, the pre-tokenizer and the vocabulary size. We find that the best tokenizer varies on the two task types and that the pre-tokenizer has the biggest overall impact on performance. Further, we introduce a new approach to estimate tokenizer impact on downstream LLM performance, showing substantial improvement over metrics like R'enyi efficiency. We encourage more work on language variation and its relation to tokenizers and thus LLM performance.
nan
Article 570
Title@2025-07-04 (5): SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation
Title: SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation | SMCLM: Semantisch sinnvolle Kausale Sprachmodellierung für autoregressive Paraphrase-Generierung | SMCLM: 用于自动递减性参数生成的具有广泛意义的原因语言建模 2507.03415v1 |
Authors (3): Michał Perełkiewicz, Sławomir Dadas, Rafał Poświata
This article introduces semantically meaningful causal language modeling (SMCLM), a selfsupervised method of training autoregressive models to generate semantically equivalent text. Our approach involves using semantically meaningful text representation as an initial embedding in the autoregressive training and generation processes. The extensive empirical study demonstrates that the SMCLM approach makes autoregressive models capable of learning robust and high-quality paraphrase generation. The proposed method is competitive with the supervised method and achieves state-of-the-art results in unsupervised approaches. This article also presents a comprehensive set of automatic metrics that cover a wide range of autogenerated paraphrase evaluation aspects. Simultaneously, this article highlights the low reliability of the metrics that are widely used in paraphrase generation evaluation, including BLEU, ROUGE, and BERTScore.
nan
Article 571
Title@2025-07-04 (5): Graph Repairs with Large Language Models: An Empirical Study
Title: Graph Repairs with Large Language Models: An Empirical Study | Graphische Reparaturen mit großen Sprachmodellen: Eine empirische Studie | 使用大语言模型修图:经验研究 2507.03410v1 |
Authors (3): Hrishikesh Terdalkar, Angela Bonifati, Andrea Mauri
Property graphs are widely used in domains such as healthcare, finance, and social networks, but they often contain errors due to inconsistencies, missing data, or schema violations. Traditional rule-based and heuristic-driven graph repair methods are limited in their adaptability as they need to be tailored for each dataset. On the other hand, interactive human-in-the-loop approaches may become infeasible when dealing with large graphs, as the cost–both in terms of time and effort–of involving users becomes too high. Recent advancements in Large Language Models (LLMs) present new opportunities for automated graph repair by leveraging contextual reasoning and their access to real-world knowledge. We evaluate the effectiveness of six open-source LLMs in repairing property graphs. We assess repair quality, computational cost, and model-specific performance. Our experiments show that LLMs have the potential to detect and correct errors, with varying degrees of accuracy and efficiency. We discuss the strengths, limitations, and challenges of LLM-driven graph repair and outline future research directions for improving scalability and interpretability.
nan
Article 572
Title@2025-07-04 (5): ReservoirChat: Interactive Documentation Enhanced with LLM and Knowledge Graph for ReservoirPy
Title: ReservoirChat: Interactive Documentation Enhanced with LLM and Knowledge Graph for ReservoirPy | ReservoirChat: Interaktive Dokumentation mit LLM und Wissensdiagramm für ReservoirPy | RESSOCWChat:与LLM和知识图增强互动文件 2507.05279v1 |
Authors (4): Virgile Boraud, Yannis Bendi-Ouis, Paul Bernard, Xavier Hinaut
We introduce a tool designed to improve the capabilities of Large Language Models (LLMs) in assisting with code development using the ReservoirPy library, as well as in answering complex questions in the field of Reservoir Computing. By incorporating external knowledge through Retrieval-Augmented Generation (RAG) and knowledge graphs, our approach aims to reduce hallucinations and increase the factual accuracy of generated responses. The system provides an interactive experience similar to ChatGPT, tailored specifically for ReservoirPy, enabling users to write, debug, and understand Python code while accessing reliable domain-specific insights. In our evaluation, while proprietary models such as ChatGPT-4o and NotebookLM performed slightly better on general knowledge questions, our model outperformed them on coding tasks and showed a significant improvement over its base model, Codestral-22B.
nan
Article 573
Title@2025-07-04 (5): On the Expressiveness and Length Generalization of Selective State-Space Models on Regular Languages
Title: On the Expressiveness and Length Generalization of Selective State-Space Models on Regular Languages | Zur Expressivität und Längenverallgemeinerung von selektiven State-Space-Modellen auf regulären Sprachen | 关于常规语言选择性国家空间模型的表达性和时间长度的一般化 2412.19350v2 |
Authors (6): Aleksandar Terzić, Michael Hersche, Giacomo Camposampiero, Thomas Hofmann, Abu Sebastian, Abbas Rahimi
Selective state-space models (SSMs) are an emerging alternative to the Transformer, offering the unique advantage of parallel training and sequential inference. Although these models have shown promising performance on a variety of tasks, their formal expressiveness and length generalization properties remain underexplored. In this work, we provide insight into the workings of selective SSMs by analyzing their expressiveness and length generalization performance on regular language tasks, i.e., finite-state automaton (FSA) emulation. We address certain limitations of modern SSM-based architectures by introducing the Selective Dense State-Space Model (SD-SSM), the first selective SSM that exhibits perfect length generalization on a set of various regular language tasks using a single layer. It utilizes a dictionary of dense transition matrices, a softmax selection mechanism that creates a convex combination of dictionary matrices at each time step, and a readout consisting of layer normalization followed by a linear map. We then proceed to evaluate variants of diagonal selective SSMs by considering their empirical performance on commutative and non-commutative automata. We explain the experimental results with theoretical considerations. Our code is available at https://github.com/IBM/selective-dense-state-space-model.
nan
Article 574
Title@2025-07-04 (5): Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Title: Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers | Verallgemeinerung oder Halluzination? Verstehen von Out-of-Context-Reasoning in Transformers | 通化还是幻觉? 理解变异器的逻辑外原因 2506.10887v2 |
Authors (8): Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei
Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.
nan
Article 575
Title@2025-07-04 (5): Entity-aware Cross-lingual Claim Detection for Automated Fact-checking
Title: Entity-aware Cross-lingual Claim Detection for Automated Fact-checking | Entity-aware Cross-lingual Claim Detection for Automated Fact-Checking | 用于自动实况调查的有实体意识的跨语言交叉索赔调查 2503.15220v3 |
Authors (2): Rrubaa Panchendrarajan, Arkaitz Zubiaga
Identifying claims requiring verification is a critical task in automated fact-checking, especially given the proliferation of misinformation on social media platforms. Despite notable progress, challenges remain-particularly in handling multilingual data prevalent in online discourse. Recent efforts have focused on fine-tuning pre-trained multilingual language models to address this. While these models can handle multiple languages, their ability to effectively transfer cross-lingual knowledge for detecting claims spreading on social media remains under-explored. In this paper, we introduce EX-Claim, an entity-aware cross-lingual claim detection model that generalizes well to handle multilingual claims. The model leverages entity information derived from named entity recognition and entity linking techniques to improve the language-level performance of both seen and unseen languages during training. Extensive experiments conducted on three datasets from different social media platforms demonstrate that our proposed model stands out as an effective solution, demonstrating consistent performance gains across 27 languages and robust knowledge transfer between languages seen and unseen during training.
nan
Article 576
Title@2025-07-04 (5): EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization
Title: EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization | EAQuant: Verbesserung der Post-Training-Quantisierung für MoE-Modelle durch Experten-Aware-Optimierung | EAQuant:通过专家-软件优化,加强培训后对教育部模型的量化 2506.13329v2 |
Authors (8): Zhongqian Fu, Ning Ding, Kai Han, Xianzhi Yu, Xiaosong Li, Xinghao Chen, Yehui Tang, Yunhe Wang
Mixture-of-Experts (MoE) models have emerged as a cornerstone of large-scale deep learning by efficiently distributing computation and enhancing performance. However, their unique architecture-characterized by sparse expert activation and dynamic routing mechanisms-introduces inherent complexities that challenge conventional quantization techniques. Existing post-training quantization (PTQ) methods struggle to address activation outliers, router consistency and sparse expert calibration, leading to significant performance degradation. To bridge this gap, we propose EAQuant, a novel PTQ framework tailored for MoE architectures. Our method systematically tackles these challenges through three key innovations: (1) expert-aware smoothing aggregation to suppress activation outliers and stabilize quantization, (2) router logits distribution alignment to preserve expert selection consistency post-quantization, and (3) expert-level calibration data balance to optimize sparsely activated experts. Extensive experiments across W4A4 and extreme W3A4 quantization configurations demonstrate that EAQuant significantly outperforms existing methods, achieving average score improvements of 1.15 - 2.28% across three diverse MoE architectures, with particularly pronounced gains in reasoning tasks and robust performance retention under aggressive quantization. By integrating these innovations, EAQuant establishes a new state-of-the-art for high-precision, efficient MoE model compression. Our code is available at https://github.com/darren-fzq1/EAQuant.
nan
Article 577
Title@2025-07-04 (5): Making Sense of Korean Sentences: A Comprehensive Evaluation of LLMs through KoSEnd Dataset
Title: Making Sense of Korean Sentences: A Comprehensive Evaluation of LLMs through KoSEnd Dataset | Gespür für koreanische Sätze: Eine umfassende Bewertung von LLMs durch KoSEnd-Datensatz | 韩国判刑感:通过KoSend数据集对LLMs的全面评估 2507.03378v1 |
Authors (4): Seunguk Yu, Kyeonghyun Kim, Jungmin Yun, Youngbin Kim
Although LLMs have made significant progress in various languages, there are still concerns about their effectiveness with low-resource agglutinative languages compared to languages such as English. In this study, we focused on Korean, a language known for its complex sentence endings, and evaluated LLMs on this challenging aspect. We introduce the Korean Sentence Endings (KoSEnd) dataset, which includes 3,000 sentences, each annotated for the naturalness of 15 sentence ending forms. These were collected from diverse sources to cover a range of contexts. We evaluated 11 LLMs to assess their understanding of Korean sentence endings, analyzing them based on parameter count and prediction consistency. Notably, we found that informing models about the possibility of missing sentence endings improved performance, highlighting the impact of explicitly considering certain linguistic features.
nan
Article 578
Title@2025-07-04 (5): Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections
Title: Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections | Implizite Belohnung als Brücke: Eine einheitliche Ansicht von SFT- und DPO-Verbindungen | 隐含奖励为桥梁:SFT和DPO连接的统一观点 2507.00018v2 |
Authors (10): Bo Wang, Qinyuan Cheng, Runyu Peng, Rong Bao, Peiji Li, Qipeng Guo, Linyang Li, Zhiyuan Zeng, Yunhua Zhou, Xipeng Qiu
Post-training processes are essential phases in grounding pre-trained language models to real-world tasks, with learning from demonstrations or preference signals playing a crucial role in this adaptation. We present a unified theoretical framework bridging Supervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training. Through rigorous mathematical derivation, we demonstrate that both SFT and preference learning methods like Direct Preference Optimization (DPO) operate within the same optimal policy-reward subspace, with SFT representing a special case of implicit reward learning. Our analysis reveals a critical limitation in conventional SFT: the KL divergence term in distribution matching becomes constant with respect to the policy during optimization, failing to constrain model updates. To address this, we propose a simple yet effective learning rate reduction approach that yields significant performance improvements (up to \textbf{25\%} relative gain and \textbf{6\%} absolute win rate increase in instruction following tasks. Additionally, we derive alternative SFT objectives from various f-divergence functions that preserve the KL term during optimization, further enhancing post-DPO model performance. Finally, we extend the theoretical relationship between LLM logits and Q-functions from preference learning to the SFT context, providing mathematical derivations and experimental validation.
nan
Article 579
Title@2025-07-04 (5): WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia
Title: WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia | WETBench: Ein Benchmark für die Erkennung aufgabenspezifischer maschinengenerierter Texte auf Wikipedia | WETBENCH: 维基百科中检测特定任务机器生成文本的基准 2507.03373v1 |
Authors (4): Gerrit Quaremba, Elizabeth Black, Denny Vrandečić, Elena Simperl
Given Wikipedia’s role as a trusted source of high-quality, reliable content, concerns are growing about the proliferation of low-quality machine-generated text (MGT) produced by large language models (LLMs) on its platform. Reliable detection of MGT is therefore essential. However, existing work primarily evaluates MGT detectors on generic generation tasks rather than on tasks more commonly performed by Wikipedia editors. This misalignment can lead to poor generalisability when applied in real-world Wikipedia contexts. We introduce WETBench, a multilingual, multi-generator, and task-specific benchmark for MGT detection. We define three editing tasks, empirically grounded in Wikipedia editors’ perceived use cases for LLM-assisted editing: Paragraph Writing, Summarisation, and Text Style Transfer, which we implement using two new datasets across three languages. For each writing task, we evaluate three prompts, generate MGT across multiple generators using the best-performing prompt, and benchmark diverse detectors. We find that, across settings, training-based detectors achieve an average accuracy of 78%, while zero-shot detectors average 58%. These results show that detectors struggle with MGT in realistic generation scenarios and underscore the importance of evaluating such models on diverse, task-specific data to assess their reliability in editor-driven contexts.
nan
Article 580
Title@2025-07-04 (5): SS-GEN: A Social Story Generation Framework with Large Language Models
Title: SS-GEN: A Social Story Generation Framework with Large Language Models | SS-GEN: Ein Social Story Generation Framework mit großen Sprachmodellen | SS-GEN:具有大语言模式的社会历史传承框架 2406.15695v3 |
Authors (8): Yi Feng, Mingyang Song, Jiaqi Wang, Zhuang Chen, Guanqun Bi, Minlie Huang, Liping Jing, Jian Yu
Children with Autism Spectrum Disorder (ASD) often misunderstand social situations and struggle to participate in daily routines. Social Stories are traditionally crafted by psychology experts under strict constraints to address these challenges but are costly and limited in diversity. As Large Language Models (LLMs) advance, there’s an opportunity to develop more automated, affordable, and accessible methods to generate Social Stories in real-time with broad coverage. However, adapting LLMs to meet the unique and strict constraints of Social Stories is a challenging issue. To this end, we propose SS-GEN, a Social Story GENeration framework with LLMs. Firstly, we develop a constraint-driven sophisticated strategy named StarSow to hierarchically prompt LLMs to generate Social Stories at scale, followed by rigorous human filtering to build a high-quality dataset. Additionally, we introduce quality assessment criteria to evaluate the effectiveness of these generated stories. Considering that powerful closed-source large models require very complex instructions and expensive API fees, we finally fine-tune smaller language models with our curated high-quality dataset, achieving comparable results at lower costs and with simpler instruction and deployment. This work marks a significant step in leveraging AI to personalize Social Stories cost-effectively for autistic children at scale, which we hope can encourage future research on special groups.
nan
Article 581
Title@2025-07-04 (5): Backtesting Sentiment Signals for Trading: Evaluating the Viability of Alpha Generation from Sentiment Analysis
Title: Backtesting Sentiment Signals for Trading: Evaluating the Viability of Alpha Generation from Sentiment Analysis | Backtesting Sentiment Signals for Trading: Bewertung der Lebensfähigkeit der Alpha-Generation aus der Sentiment-Analyse | 贸易的后测试感应信号:从感应分析中评估阿尔法一代的可行性 2507.03350v1 |
Authors (7): Elvys Linhares Pontes, Carlos-Emiliano González-Gallardo, Georgeta Bordea, José G. Moreno, Mohamed Ben Jannet, Yuxuan Zhao, Antoine Doucet
Sentiment analysis, widely used in product reviews, also impacts financial markets by influencing asset prices through microblogs and news articles. Despite research in sentiment-driven finance, many studies focus on sentence-level classification, overlooking its practical application in trading. This study bridges that gap by evaluating sentiment-based trading strategies for generating positive alpha. We conduct a backtesting analysis using sentiment predictions from three models (two classification and one regression) applied to news articles on Dow Jones 30 stocks, comparing them to the benchmark Buy&Hold strategy. Results show all models produced positive returns, with the regression model achieving the highest return of 50.63% over 28 months, outperforming the benchmark Buy&Hold strategy. This highlights the potential of sentiment in enhancing investment strategies and financial decision-making.
nan
Article 582
Title@2025-07-04 (5): Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky
Title: Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky | Disambiguation-Centric Finetuning macht Enterprise Tool-Calling LLMs realistischer und weniger riskant | 差异性-中心微调使企业用工具召集LLM 更现实、风险更小 2507.03336v1 |
Authors (3): Ashutosh Hathidara, Julien Yu, Sebastian Schreiber
Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.
nan
Article 583
Title@2025-07-04 (5): Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought
Title: Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought | Hunyuan-TurboS: Große Sprachmodelle durch Mamba-Transformer-Synergie und adaptive Ketten-of-Thought | Hunyuan-TurboS:通过Mamba-Transed协同和适应性研究链推进大语言模型 2505.15431v3 |
Authors (255): Tencent Hunyuan Team, Ao Liu, Botong Zhou, Can Xu, Chayse Zhou, ChenChen Zhang, Chengcheng Xu, Chenhao Wang, Decheng Wu, Dengpeng Wu, Dian Jiao, Dong Du, Dong Wang, Feng Zhang, Fengzong Lian, Guanghui Xu, Guanwei Zhang, Hai Wang, Haipeng Luo, Han Hu, Huilin Xu, Jiajia Wu, Jianchen Zhu, Jianfeng Yan, Jiaqi Zhu, Jihong Zhang, Jinbao Xue, Jun Xia, Junqiang Zheng, Kai Liu, Kai Zhang, Kai Zheng, Kejiao Li, Keyao Wang, Lan Jiang, Lixin Liu, Lulu Wu, Mengyuan Huang, Peijie Yu, Peiqi Wang, Qian Wang, Qianbiao Xiang, Qibin Liu, Qingfeng Sun, Richard Guo, Ruobing Xie, Saiyong Yang, Shaohua Chen, Shihui Hu, Shuai Li, Shuaipeng Li, Shuang Chen, Suncong Zheng, Tao Yang, Tian Zhang, Tinghao Yu, Weidong Han, Weijie Liu, Weijin Zhou, Weikang Wang, Wesleye Chen, Xiao Feng, Xiaoqin Ren, Xingwu Sun, Xiong Kuang, Xuemeng Huang, Xun Cao, Yanfeng Chen, Yang Du, Zhen Yang, Yangyu Tao, Yaping Deng, Yi Shen, Yigeng Hong, Yiqi Chen, Yiqing Huang, Yuchi Deng, Yue Mao, Yulong Wang, Yuyuan Zeng, Zenan Xu, Zhanhui Kang, Zhe Zhao, ZhenXiang Yan, Zheng Fang, Zhichao Hu, Zhongzhi Chen, Zhuoyu Li, Zongwei Li, Alex Yan, Ande Liang, Baitong Liu, Beiping Pan, Bin Xing, Binghong Wu, Bingxin Qu, Bolin Ni, Boyu Wu, Chen Li, Cheng Jiang, Cheng Zhang, Chengjun Liu, Chengxu Yang, Chengzhong Xu, Chiyu Wang, Chong Zha, Daisy Yi, Di Wang, Fanyang Lu, Fei Chen, Feifei Liu, Feng Zheng, Guanghua Yu, Guiyang Li, Guohua Wang, Haisheng Lin, Han Liu, Han Wang, Hao Fei, Hao Lu, Haoqing Jiang, Haoran Sun, Haotian Zhu, Huangjin Dai, Huankui Chen, Huawen Feng, Huihui Cai, Huxin Peng, Jackson Lv, Jiacheng Shi, Jiahao Bu, Jianbo Li, Jianglu Hu, Jiangtao Guan, Jianing Xu, Jianwei Cai, Jiarong Zhang, Jiawei Song, Jie Jiang, Jie Liu, Jieneng Yang, Jihong Zhang, Jin lv, Jing Zhao, Jinjian Li, Jinxing Liu, Jun Zhao, Juntao Guo, Kai Wang, Kan Wu, Lei Fu, Lei He, Lei Wang, Li Liu, Liang Dong, Liya Zhan, Long Cheng, Long Xu, Mao Zheng, Meng Liu, Mengkang Hu, Nanli Chen, Peirui Chen, Peng He, Pengju Pan, Pengzhi Wei, Qi Yang, Qi Yi, Roberts Wang, Rongpeng Chen, Rui Sun, Rui Yang, Ruibin Chen, Ruixu Zhou, Shaofeng Zhang, Sheng Zhang, Shihao Xu, Shuaishuai Chang, Shulin Liu, SiQi Wang, Songjia Feng, Songling Yuan, Tao Zhang, Tianjiao Lang, Tongkai Li, Wei Deng, Wei Li, Weichao Wang, Weigang Zhang, Weixuan Sun, Wen Ouyang, Wenxiang Jiao, Wenzhi Sun, Wenzhuo Jia, Xiang Zhang, Xiangyu He, Xianshun Ren, XiaoYing Zhu, Xiaolong Guo, Xiaoxue Li, Xiaoyu Ma, Xican Lu, Xinhua Feng, Xinting Huang, Xinyu Guan, Xirui Li, Xu Zhang, Xudong Gao, Xun Luo, Xuxiang Qi, Yangkun Chen, Yangyu Tao, Yanling Xiao, Yantao Mai, Yanze Chen, Yao Ding, Yeting Yang, YiFan Song, Yifan Yang, Yijiao Zhu, Yinhe Wu, Yixian Liu, Yong Yang, Yuanjun Cai, Yuanlin Tu, Yue Zhang, Yufei Huang, Yuhang Zhou, Yuhao Jiang, Yuhong Liu, Yuhui Hu, Yujin Lin, Yun Yang, Yunhao Wang, Yusong Zhang, Zekun Wu, Zelong Zhang, Zhan Yu, Zhaoliang Yang, Zhe Zhao, Zheng Li, Zhenyu Huang, Zhiguang Liu, Zhijiang Xu, Zhiqing Kui, Zhiyin Zeng, Zhiyuan Xiong, Zhuo Han, Zifan Wu, Zigang Geng, Zilong Zhao, Ziyan Tang, Ziyuan Zhu, Zonglei Zhu, Zhijiang Xu
As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba’s long-sequence processing efficiency with Transformer’s superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid responses for simple queries and deep “thinking” modes for complex problems, optimizing computational resources. Architecturally, this 56B activated (560B total) parameter model employs 128 layers (Mamba2, Attention, FFN) with an innovative AMF/MF block pattern. Faster Mamba2 ensures linear complexity, Grouped-Query Attention minimizes KV cache, and FFNs use an MoE structure. Pre-trained on 16T high-quality tokens, it supports a 256K context length and is the first industry-deployed large-scale Mamba model. Our comprehensive post-training strategy enhances capabilities via Supervised Fine-Tuning (3M instructions), a novel Adaptive Long-short CoT Fusion method, Multi-round Deliberation Learning for iterative improvement, and a two-stage Large-scale Reinforcement Learning process targeting STEM and general instruction-following. Evaluations show strong performance: overall top 7 rank on LMSYS Chatbot Arena with a score of 1356, outperforming leading models like Gemini-2.0-Flash-001 (1352) and o4-mini-2025-04-16 (1345). TurboS also achieves an average of 77.9% across 23 automated benchmarks. Hunyuan-TurboS balances high performance and efficiency, offering substantial capabilities at lower inference costs than many reasoning models, establishing a new paradigm for efficient large-scale pre-trained models.
nan
Article 584
Title@2025-07-04 (5): MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection
Title: MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection | MT2-CSD: Eine neue Datensatz- und Multi-Semantic Knowledge Fusion Methode zur konversatorischen Stance-Erkennung | MT2-CSD: 用于语音稳定探测的新数据集和多语层知识融合方法 2506.21053v2 |
Authors (7): Fuqiang Niu, Genan Dai, Yisha Lu, Jiayu Liao, Xiang Li, Hu Huang, Bowen Zhang
In the realm of contemporary social media, automatic stance detection is pivotal for opinion mining, as it synthesizes and examines user perspectives on contentious topics to uncover prevailing trends and sentiments. Traditional stance detection research often targets individual instances, thereby limiting its capacity to model multi-party discussions typical in real social media scenarios. This shortcoming largely stems from the scarcity of datasets that authentically capture the dynamics of social media interactions, hindering advancements in conversational stance detection. In this paper, we introduce MT2-CSD, a comprehensive dataset for multi-target, multi-turn conversational stance detection. To the best of our knowledge, MT2-CSD is the largest dataset available for this purpose, comprising 24,457 annotated instances and exhibiting the greatest conversational depth, thereby presenting new challenges for stance detection. To address these challenges, we propose the Large Language model enhanced Conversational Relational Attention Network (LLM-CRAN), which exploits the reasoning capabilities of LLMs to improve conversational understanding. We conduct extensive experiments to evaluate the efficacy of LLM-CRAN on the MT2-CSD dataset. The experimental results indicate that LLM-CRAN significantly outperforms strong baseline models in the task of conversational stance detection.
nan
Article 585
Title@2025-07-04 (5): Read Quietly, Think Aloud: Decoupling Comprehension and Reasoning in LLMs
Title: Read Quietly, Think Aloud: Decoupling Comprehension and Reasoning in LLMs | Lest ruhig, denkt laut: Entkoppelung von Verständnis und Vernunft in LLMs | 静静地阅读,想阿卢德:解开迷惑和理性 2507.03327v1 |
Authors (2): Yuanxin Wang, Ganesh Venkatesh
Large Language Models (LLMs) have demonstrated remarkable proficiency in understanding text and generating high-quality responses. However, a critical distinction from human cognition is their typical lack of a distinct internal reading' or deliberation phase before
speaking’ (i.e., generating text). Humans often engage in silent reading to comprehend context and formulate thoughts prior to articulation. This paper investigates methods to imbue LLMs with a similar capacity for internal processing. We introduce and evaluate techniques that encourage LLMs to read silently.' Our findings indicate that even a straightforward approach, such as providing the model with an initial contextual prompt or
reading space’ before it begins predicting subsequent tokens for the final output, can yield significant performance improvements. We further enhance this concept by developing a `reading buddy’ architecture, where an auxiliary component silently processes the input and provides refined contextual insights to the primary generation model. These approaches aim to foster deeper understanding from LLMs so that they can produce better reasoned responses, moving them one step closer to more human-like text processing. Our results indicate that these simple techniques can provide surprisingly strong impact on accuracy with multiple point accuracy boost.
nan
Article 586
Title@2025-07-04 (5): OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Title: OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework | OpenRLHF: Ein einfach zu bedienendes, skalierbares und leistungsstarkes RLHF-Framework | OpenRLHF: 易于使用、可缩放和高绩效RLHF框架 2405.11143v5 |
Authors (15): Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu, Yiming Liu
Large Language Models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) significantly improve the alignment of human-AI values and further raise the upper bound of AI capabilities, particularly in reasoning-intensive, long-context Chain-of-Thought (long-CoT) tasks. However, existing RLHF (or RLVR) frameworks commonly face challenges such as inference bottlenecks and complexity barriers, restricting their accessibility for newcomers. To bridge this gap, we introduce OpenRLHF, a user-friendly, scalable, and easy-to-learn open-source RLHF framework built upon Ray, vLLM, DeepSpeed, and HuggingFace Transformers, featuring a simplified design, clear code structure, and comprehensive documentation to facilitate entry for researchers and practitioners. Experimental results show that OpenRLHF achieves superior training efficiency with speedups ranging from 1.22x to 1.68x across different model sizes compared to state-of-the-art frameworks, while requiring significantly fewer lines of code for implementation. OpenRLHF is publicly available at https://github.com/OpenRLHF/OpenRLHF, and has already been adopted by leading institutions to accelerate RLHF research and learning.
nan
Article 587
Title@2025-07-04 (5): GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation
Title: GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation | GRAFT: Ein Graph-basiertes Flow-aware Agentic Framework für die maschinelle Übersetzung auf Dokumentebene | GRAFT: 以图表为基础的文件级机器翻译流动感动剂框架 2507.03311v1 |
Authors (5): Himanshu Dutta, Sunny Manchanda, Prakhar Bapat, Meva Ram Gurjar, Pushpak Bhattacharyya
Document level Machine Translation (DocMT) approaches often struggle with effectively capturing discourse level phenomena. Existing approaches rely on heuristic rules to segment documents into discourse units, which rarely align with the true discourse structure required for accurate translation. Otherwise, they fail to maintain consistency throughout the document during translation. To address these challenges, we propose Graph Augmented Agentic Framework for Document Level Translation (GRAFT), a novel graph based DocMT system that leverages Large Language Model (LLM) agents for document translation. Our approach integrates segmentation, directed acyclic graph (DAG) based dependency modelling, and discourse aware translation into a cohesive framework. Experiments conducted across eight translation directions and six diverse domains demonstrate that GRAFT achieves significant performance gains over state of the art DocMT systems. Specifically, GRAFT delivers an average improvement of 2.8 d BLEU on the TED test sets from IWSLT2017 over strong baselines and 2.3 d BLEU for domain specific translation from English to Chinese. Moreover, our analyses highlight the consistent ability of GRAFT to address discourse level phenomena, yielding coherent and contextually accurate translations.
nan
Article 588
Title@2025-07-04 (5): LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents
Title: LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents | LTLCrit: Ein auf Temporal Logic basierender LLM-Kritiker für sichere und effiziente einkörperige Wirkstoffe | LLLCritri: 一种基于时时逻辑的LLM LLM 用于安全和高效装药剂的Criticr 2507.03293v1 |
Authors (3): Anand Gokhale, Vaibhav Srivastava, Francesco Bullo
Large language models (LLMs) have demonstrated promise in reasoning tasks and general decision-making in static environments. In long-term planning tasks, however, errors tend to accumulate, often leading to unsafe or inefficient behavior, limiting their use in general-purpose settings. We propose a modular actor-critic architecture in which an LLM actor is guided by LTLCrit, a trajectory-level LLM critic that communicates via linear temporal logic (LTL). Our setup combines the reasoning strengths of language models with the guarantees of formal logic. The actor selects high-level actions from natural language observations, while the critic analyzes full trajectories and proposes new LTL constraints that shield the actor from future unsafe or inefficient behavior. The architecture supports both fixed, hand-specified safety constraints and adaptive, learned soft constraints that promote long-term efficiency. Our architecture is model-agnostic: any LLM-based planner can serve as the actor, and LTLCrit serves as a logic-generating wrapper. We formalize planning as graph traversal under symbolic constraints, allowing LTLCrit to analyze failed or suboptimal trajectories and generate new temporal logic rules that improve future behavior. We evaluate our system on the Minecraft diamond-mining benchmark, achieving 100% completion rates and improving efficiency compared to baseline LLM planners. Our results suggest that enabling LLMs to supervise each other through logic is a powerful and flexible paradigm for safe, generalizable decision making.
nan
Article 589
Title@2025-07-04 (5): NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025
Title: NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025 | NTU Speechlab LLM-basiertes Mehrsprachiges ASR-System für MLC-SLM Challenge 2025 | NTU Spearelab LLM-为2025年刚果解放运动-解运间对话挑战使用多种语言的ASR系统 2506.13339v2 |
Authors (8): Yizhou Peng, Bin Wang, Yi-Wen Chao, Ziyang Ma, Haoyang Zhang, Hexin Liu, Xie Chen, Eng Siong Chng
This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I), where we achieved 5th place. We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies. In particular, language-specific prompts and model averaging techniques were instrumental in boosting system performance across diverse languages. Compared to the initial baseline system, our final model reduced the average Mix Error Rate from 20.2% to 10.6%, representing an absolute improvement of 9.6% (a relative improvement of 48%) on the evaluation set. Our results demonstrate the effectiveness of our approach and offer practical insights for future Speech Large Language Models.
nan
Article 590
Title@2025-07-04 (5): Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR
Title: Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR | Bi-direktionale Kontext-verbesserte Sprache Große Sprachmodelle für mehrsprachige Konversations-ASR | 多语言对话的ASR双向双向背景强化语言大语言模型 2506.13396v2 |
Authors (3): Yizhou Peng, Hexin Liu, Eng Siong Chng
This paper introduces the integration of language-specific bi-directional context into a speech large language model (SLLM) to improve multilingual continuous conversational automatic speech recognition (ASR). We propose a character-level contextual masking strategy during training, which randomly removes portions of the context to enhance robustness and better emulate the flawed transcriptions that may occur during inference. For decoding, a two-stage pipeline is utilized: initial isolated segment decoding followed by context-aware re-decoding using neighboring hypotheses. Evaluated on the 1500-hour Multilingual Conversational Speech and Language Model (MLC-SLM) corpus covering eleven languages, our method achieves an 18% relative improvement compared to a strong baseline, outperforming even the model trained on 6000 hours of data for the MLC-SLM competition. These results underscore the significant benefit of incorporating contextual information in multilingual continuous conversational ASR.
nan
Article 591
Title@2025-07-04 (5): REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models
Title: REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models | REINFORCE++: Effizienter RLHF-Algorithmus mit Robustheit sowohl für Prompt- als auch für Reward-Modelle | REINFORCE++: 高效的RLHF对快速模型和奖励模型具有强力的测算法 2501.03262v5 |
Authors (3): Jian Hu, Jason Klein Liu, Wei Shen
Reinforcement Learning from Human Feedback (RLHF) is crucial in aligning large language models (LLMs) with human values and preferences. While state-of-the-art applications like ChatGPT/GPT-4 commonly employ Proximal Policy Optimization (PPO), including a critic network introduces significant computational overhead. REINFORCE-based methods, such as REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO), address this limitation by eliminating the critic network. However, these approaches face challenges in accurate advantage estimation. Specifically, they estimate advantages independently for responses to each prompt, which can lead to overfitting on more straightforward prompts and vulnerability to reward hacking. To address these challenges, we introduce REINFORCE++, a novel approach that removes the critic model while using the normalized reward of a batch as the baseline. Our empirical evaluation demonstrates that REINFORCE++ exhibits robust performance across various reward models without requiring prompt set truncation. Furthermore, it achieves superior generalization in RLHF and long chain-of-thought (CoT) settings compared to REINFORCE-based methods. The implementation is available at https://github.com/OpenRLHF/OpenRLHF.
nan
Article 592
Title@2025-07-04 (5): MoralBench: Moral Evaluation of LLMs
Title: MoralBench: Moral Evaluation of LLMs | MoralBench: Moralische Bewertung von LLMs | 道德:对LLMMs的道德评价 2406.04428v2 |
Authors (6): Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, Yongfeng Zhang
In the rapidly evolving field of artificial intelligence, large language models (LLMs) have emerged as powerful tools for a myriad of applications, from natural language processing to decision-making support systems. However, as these models become increasingly integrated into societal frameworks, the imperative to ensure they operate within ethical and moral boundaries has never been more critical. This paper introduces a novel benchmark designed to measure and compare the moral reasoning capabilities of LLMs. We present the first comprehensive dataset specifically curated to probe the moral dimensions of LLM outputs, addressing a wide range of ethical dilemmas and scenarios reflective of real-world complexities. The main contribution of this work lies in the development of benchmark datasets and metrics for assessing the moral identity of LLMs, which accounts for nuance, contextual sensitivity, and alignment with human ethical standards. Our methodology involves a multi-faceted approach, combining quantitative analysis with qualitative insights from ethics scholars to ensure a thorough evaluation of model performance. By applying our benchmark across several leading LLMs, we uncover significant variations in moral reasoning capabilities of different models. These findings highlight the importance of considering moral reasoning in the development and evaluation of LLMs, as well as the need for ongoing research to address the biases and limitations uncovered in our study. We publicly release the benchmark at https://drive.google.com/drive/u/0/folders/1k93YZJserYc2CkqP8d4B3M3sgd3kA8W7 and also open-source the code of the project at https://github.com/agiresearch/MoralBench.
nan
Article 593
Title@2025-07-04 (5): GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning
Title: GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning | GDGB: Ein Benchmark für generatives dynamisches Text-Attributed Graph Learning | GDGB: 产生动态动态文本图表学习的基准 2507.03267v1 |
Authors (6): Jie Peng, Jiarui Ji, Runlin Lei, Zhewei Wei, Yongchao Liu, Chuntao Hong
Dynamic Text-Attributed Graphs (DyTAGs), which intricately integrate structural, temporal, and textual attributes, are crucial for modeling complex real-world systems. However, most of the existing DyTAG datasets exhibit poor textual quality, which severely limits their utility for DyTAG generation tasks requiring semantically rich inputs. Additionally, prior work mainly focuses on discriminative tasks on DyTAGs, resulting in a lack of standardized task formulations and evaluation protocols tailored for DyTAG generation. To address these critical issues, we propose Generative DyTAG Benchmark (GDGB), which comprises eight meticulously curated DyTAG datasets with high-quality textual features for both nodes and edges, overcoming limitations of prior datasets. Building on GDGB, we define two novel DyTAG generation tasks: Transductive Dynamic Graph Generation (TDGG) and Inductive Dynamic Graph Generation (IDGG). TDGG transductively generates a target DyTAG based on the given source and destination node sets, while the more challenging IDGG introduces new node generation to inductively model the dynamic expansion of real-world graph data. To enable holistic evaluation, we design multifaceted metrics that assess the structural, temporal, and textual quality of the generated DyTAGs. We further propose GAG-General, an LLM-based multi-agent generative framework tailored for reproducible and robust benchmarking of DyTAG generation. Experimental results demonstrate that GDGB enables rigorous evaluation of TDGG and IDGG, with key insights revealing the critical interplay of structural and textual features in DyTAG generation. These findings establish GDGB as a foundational resource for advancing generative DyTAG research and unlocking further practical applications in DyTAG generation. GDGB datasets, source codes, and leaderboards are available at \href{https://gdgb-algo.github.io/}{here}.
nan
Article 594
Title@2025-07-04 (5): KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation
Title: KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation | KinyaColBERT: Ein Lexisch geerdetes Retrieval-Modell für Low-Resource Retrieval-Augmented Generation | KinyaColcolBERT: 低资源回收-提款一代的以地基为基数的回收模型 2507.03241v1 |
Authors (2): Antoine Nzeyimana, Andre Niyongabo Rubungo
The recent mainstream adoption of large language model (LLM) technology is enabling novel applications in the form of chatbots and virtual assistants across many domains. With the aim of grounding LLMs in trusted domains and avoiding the problem of hallucinations, retrieval-augmented generation (RAG) has emerged as a viable solution. In order to deploy sustainable RAG systems in low-resource settings, achieving high retrieval accuracy is not only a usability requirement but also a cost-saving strategy. Through empirical evaluations on a Kinyarwanda-language dataset, we find that the most limiting factors in achieving high retrieval accuracy are limited language coverage and inadequate sub-word tokenization in pre-trained language models. We propose a new retriever model, KinyaColBERT, which integrates two key concepts: late word-level interactions between queries and documents, and a morphology-based tokenization coupled with two-tier transformer encoding. This methodology results in lexically grounded contextual embeddings that are both fine-grained and self-contained. Our evaluation results indicate that KinyaColBERT outperforms strong baselines and leading commercial text embedding APIs on a Kinyarwanda agricultural retrieval benchmark. By adopting this retrieval strategy, we believe that practitioners in other low-resource settings can not only achieve reliable RAG systems but also deploy solutions that are more cost-effective.
nan
Article 595
Title@2025-07-04 (5): Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models
Title: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models | Neural Diskrete Token Repräsentanz Lernen für extreme Token Reduction in Video Große Sprachmodelle | 在视频大语言模式中进行关于极端减少视频大语言模式的神经分辨代言人学习 2503.16980v4 |
Authors (2): Haichao Zhang, Yun Fu
Token-based video representation has emerged as a promising approach for enabling large language models (LLMs) to interpret video content. However, existing token reduction techniques, such as pruning and merging, often disrupt essential positional embeddings and rely on continuous visual tokens sampled from nearby pixels with similar spatial-temporal locations. By removing only a small fraction of tokens, these methods still produce relatively lengthy continuous sequences, which falls short of the extreme compression required to balance computational efficiency and token count in video LLMs. In this paper, we introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens. We propose VQToken, a neural discrete token representation framework that (i) applies adaptive vector quantization to continuous ViT embeddings to learn a compact codebook and (ii) preserves spatial-temporal positions via a token hash function by assigning each grid-level token to its nearest codebook entry. On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark. It also achieves comparable performance on ActNet-QA, Long Video Bench, and VideoMME. We further introduce the Token Information Density (TokDense) metric and formalize fixed-length and adaptive-length subtasks, achieving state-of-the-art results in both settings. Our approach dramatically lowers theoretical complexity, increases information density, drastically reduces token counts, and enables efficient video LLMs in resource-constrained environments.
nan
Article 596
Title@2025-07-04 (5): A Lie-algebraic perspective on Tree-Adjoining Grammars
Title: A Lie-algebraic perspective on Tree-Adjoining Grammars | Eine lüge-algebraische Perspektive auf Baum-angrenzende Grammatiken | 有关树树边语法的字眼视角 2507.03234v1 |
Authors (3): Isabella Senturia, Elizabeth Xiao, Matilde Marcolli
We provide a novel mathematical implementation of tree-adjoining grammars using two combinatorial definitions of graphs. With this lens, we demonstrate that the adjoining operation defines a pre-Lie operation and subsequently forms a Lie algebra. We demonstrate the utility of this perspective by showing how one of our mathematical formulations of TAG captures properties of the TAG system without needing to posit them as additional components of the system, such as null-adjoining constraints and feature TAG.
nan
Article 597
Title@2025-07-04 (5): AgentPS: Agentic Process Supervision for Content Moderation with Multimodal LLMs
Title: AgentPS: Agentic Process Supervision for Content Moderation with Multimodal LLMs | AgentPS: Agentische Prozessüberwachung für Content Moderation mit multimodalen LLMs | PS: MMMM 内容调控的代理程序监督 2412.15251v2 |
Authors (6): Mingchao Liu, Yu Sun, Ruixiao Sun, Xin Dong, Xiang Shen, Hongyu Xiong
The advanced processing and reasoning capabilities of multimodal large language models (MLLMs) have driven substantial progress in vision-language (VL) understanding tasks. However, while effective for tasks governed by straightforward logic, MLLMs often struggle with reasoning complex, detail-intensive logical structures. To address this limitation, we introduce AgentPS, a novel framework that integrates Agentic Process Supervision into MLLMs by sequentially reasoning over ancillary questions during fine-tuning. AgentPS achieves substantial improvements over baseline MLLMs on both public benchmarks and proprietary datasets. Notably, we show that using MLLM-generated ancillary labels in place of human annotations yields only minimal performance degradation, highlighting the method’s scalability. These results establish AgentPS as a scalable and effective solution for complex multimodal classification in large-scale industrial applications.
nan
Article 598
Title@2025-07-04 (5): 7B Fully Open Source Moxin-LLM/VLM – From Pretraining to GRPO-based Reinforcement Learning Enhancement
Title: 7B Fully Open Source Moxin-LLM/VLM – From Pretraining to GRPO-based Reinforcement Learning Enhancement | 7B Vollständig offene Quelle Moxin-LLM/VLM – Vom Vortraining zur GRPO-basierten Verstärkungs-Lernverbesserung | 7B 完全开放源代码 Moxin-LLM/VLM – – 从预培训到基于GROP的强化学习加强 2412.06845v6 |
Authors (19): Pu Zhao, Xuan Shen, Zhenglun Kong, Yixin Shen, Sung-En Chang, Arash Akbari, Timothy Rupprecht, Lei Lu, Enfu Nan, Changdi Yang, Yumei He, Weiyan Shi, Xingchen Xu, Yu Huang, Wei Jiang, Wei Wang, Yue Chen, Yong He, Yanzhi Wang
Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Although open-source LLMs present unprecedented opportunities for innovation and research, the commercialization of LLMs has raised concerns about transparency, reproducibility, and safety. Many open-source LLMs fail to meet fundamental transparency requirements by withholding essential components like training code and data, which may hinder further innovations on LLMs. To mitigate this issue, we introduce Moxin 7B, a fully open-source LLM developed, adhering to principles of open science, open source, open data, and open access. We release the pre-training code and configurations, training and fine-tuning datasets, and intermediate and final checkpoints, aiming to make continuous commitments to fully open-source LLMs. After pre-training the base model, we finetune the Moxin Base model with SOTA post-training framework and instruction data to obtain Moxin Instruct model. To improve the reasoning capability, we further finetune our Instruct model with chain-of-thought data distilled from DeepSeek R1, and then use Group Relative Policy Optimization (GRPO) following DeepSeek R1 to finetune our model, leading to the Moxin Reasoning model. Moreover, we develop our vision language model based on our Moxin model. Experiments show that our models achieve superior performance in various evaluations such as zero-shot evaluation, few-shot evaluation, and CoT evaluation.
nan
Article 599
Title@2025-07-03 (4): VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs
Title: VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs | VOCABTRIM: Vokabelabgleich für effizientes spekulatives Decodieren in LLMs | VOCABTRIM: 有效投机下限的词汇 2506.22694v2 |
Authors (12): Raghavv Goel, Sudhanshu Agrawal, Mukul Gagrani, Junyoung Park, Yifan Zao, He Zhang, Tian Liu, Yiping Yang, Xin Yuan, Jiuyan Lu, Chris Lott, Mingu Lee
In this paper, we introduce a simple training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods that incorporates language modeling head (LM head) during drafting process. A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens, followed by verification by a base LLM, a target model, accepting a subset as its valid generation. As it is usually considered that the speculative decoding requires one-to-one mapping between vocabularies of the target model and the draft model, it has been natural to share the vocabulary between them, or even share the LM head as in EAGLE or Medusa. We first identify that this draft token sampling scheme inherently contains an unnecessary inference overhead in drafting, especially for some target LLMs with very large vocabularies. Then, we propose a simple technique, VocabTrim, to mitigate the drafting overhead to improve the generation speed in memory-bound environment. VocabTrim reconstructs the drafter LM head to contain only a limited set of tokens, selected by the most frequently sampled from the vocabulary of the target model. While limiting the vocabulary in drafting slightly degrades the acceptance rate, it significantly reduces the drafting latency in memory-bound process which is often the case on edge devices, resulting in higher memory-bound speed up (MBSU). We show that our method can boost the memory-bound speed-up for Llama-3 models on Spec-Bench, specifically by 16% for Llama-3.2-3B-Instruct.
nan
Article 600
Title@2025-07-03 (4): Towards Universal Semantics With Large Language Models
Title: Towards Universal Semantics With Large Language Models | Hin zu universeller Semantik mit großen Sprachmodellen | 走向具有大语言模式的普遍语义 2505.11764v3 |
Authors (5): Raymond Baartmans, Matthew Raffel, Rahul Vikram, Aiden Deringer, Lizhong Chen
The Natural Semantic Metalanguage (NSM) is a linguistic theory based on a universal set of semantic primes: simple, primitive word-meanings that have been shown to exist in most, if not all, languages of the world. According to this framework, any word, regardless of complexity, can be paraphrased using these primes, revealing a clear and universally translatable meaning. These paraphrases, known as explications, can offer valuable applications for many natural language processing (NLP) tasks, but producing them has traditionally been a slow, manual process. In this work, we present the first study of using large language models (LLMs) to generate NSM explications. We introduce automatic evaluation methods, a tailored dataset for training and evaluation, and fine-tuned models for this task. Our 1B and 8B models outperform GPT-4o in producing accurate, cross-translatable explications, marking a significant step toward universal semantic representation with LLMs and opening up new possibilities for applications in semantic analysis, translation, and beyond. Our code is available at https://github.com/OSU-STARLAB/DeepNSM.
nan
Article 601
Title@2025-07-03 (4): How Much Content Do LLMs Generate That Induces Cognitive Bias in Users?
Title: How Much Content Do LLMs Generate That Induces Cognitive Bias in Users? | Wie viel Inhalt erzeugen LLMs, die Kognitive Bias bei Benutzern erzeugen? | LLMs如何产生使用者中诱导的认知性偏见的内容? 2507.03194v1 |
Authors (6): Abeer Alessa, Akshaya Lakshminarasimhan, Param Somane, Julian Skirzynski, Julian McAuley, Jessica Echterhoff
Large language models (LLMs) are increasingly integrated into applications ranging from review summarization to medical diagnosis support, where they affect human decisions. Even though LLMs perform well in many tasks, they may also inherit societal or cognitive biases, which can inadvertently transfer to humans. We investigate when and how LLMs expose users to biased content and quantify its severity. Specifically, we assess three LLM families in summarization and news fact-checking tasks, evaluating how much LLMs stay consistent with their context and/or hallucinate. Our findings show that LLMs expose users to content that changes the sentiment of the context in 21.86% of the cases, hallucinates on post-knowledge-cutoff data questions in 57.33% of the cases, and primacy bias in 5.94% of the cases. We evaluate 18 distinct mitigation methods across three LLM families and find that targeted interventions can be effective. Given the prevalent use of LLMs in high-stakes domains, such as healthcare or legal analysis, our results highlight the need for robust technical safeguards and for developing user-centered interventions that address LLM limitations.
nan
Article 602
Title@2025-07-03 (4): Adversarial Manipulation of Reasoning Models using Internal Representations
Title: Adversarial Manipulation of Reasoning Models using Internal Representations | Adversariale Manipulation von Vernunftmodellen mittels interner Repräsentationen | 采用内部代表制对理性模型进行反向调整 2507.03167v1 |
Authors (3): Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi
Reasoning models generate chain-of-thought (CoT) tokens before their final output, but how this affects their vulnerability to jailbreak attacks remains unclear. While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation. We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply – termed the “caution” direction because it corresponds to cautious reasoning patterns in the generated text. Ablating this direction from model activations increases harmful compliance, effectively jailbreaking the model. We additionally show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates. Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models. Code available at https://github.com/ky295/reasoning-manipulation
nan
Article 603
Title@2025-07-03 (4): MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks
Title: MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks | MateInfoUB: Ein echter Benchmark für die Prüfung von LLMs in kompetitiven, mehrsprachigen und multimodalen Bildungsaufgaben | MateInfoUB:竞争、多种语文和多模式教育任务中测试LMLMS的现实世界基准 2507.03162v1 |
Authors (3): Dumitran Adrian Marius, Theodor-Pierre Moroianu, Buca Mihnea-Vicentiu
The rapid advancement of Large Language Models (LLMs) has transformed various domains, particularly computer science (CS) education. These models exhibit remarkable capabilities in code-related tasks and problem-solving, raising questions about their potential and limitations in advanced CS contexts. This study presents a novel bilingual (English-Romanian) multimodal (text and image) dataset of multiple-choice questions derived from a high-level computer science competition. A particularity of our dataset is that the problems are conceived such that some of them are easier solved using reasoning on paper, while for others writing code is more efficient. We systematically evaluate State of The Art LLMs on this dataset, analyzing their performance on theoretical programming tasks. Our findings reveal the strengths and limitations of current LLMs, including the influence of language choice (English vs. Romanian), providing insights into their applicability in CS education and competition settings. We also address critical ethical considerations surrounding educational integrity and the fairness of assessments in the context of LLM usage. These discussions aim to inform future educational practices and policies. To support further research, our dataset will be made publicly available in both English and Romanian. Additionally, we release an educational application tailored for Romanian students, enabling them to self-assess using the dataset in an interactive and practice-oriented environment.
nan
Article 604
Title@2025-07-03 (4): Expert-level validation of AI-generated medical text with scalable language models
Title: Expert-level validation of AI-generated medical text with scalable language models | Validierung von KI-generierten medizinischen Texten auf Expertenebene mit skalierbaren Sprachmodellen | 专家一级对AI产生的带有可缩放语言模型的可缩放语言模型的医学文本进行鉴定 2507.03152v1 |
Authors (27): Asad Aali, Vasiliki Bikia, Maya Varma, Nicole Chiou, Sophie Ostmeier, Arnav Singhvi, Magdalini Paschali, Ashwin Kumar, Andrew Johnston, Karimar Amador-Martinez, Eduardo Juan Perez Guerrero, Paola Naovi Cruz Rivera, Sergios Gatidis, Christian Bluethgen, Eduardo Pontes Reis, Eddy D. Zandee van Rilland, Poonam Laxmappa Hosamani, Kevin R Keet, Minjoung Go, Evelyn Ling, David B. Larson, Curtis Langlotz, Roxana Daneshjou, Jason Hom, Sanmi Koyejo, Emily Alsentzer, Akshay S. Chaudhari
With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the “LM-as-judge” paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase ( https://github.com/StanfordMIMI/MedVAL ), 2) MedVAL-Bench ( https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench ), and 3) MedVAL-4B ( https://huggingface.co/stanfordmimi/MedVAL-4B ), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.
nan
Article 605
Title@2025-07-03 (4): Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language Models
Title: Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language Models | Elemente des Weltwissens (EWoK): Ein von Kognition inspirierter Rahmen zur Bewertung des Grundwissens der Welt in Sprachmodellen | 世界知识要素(EWoK): “ 评价基本世界语言模式知识的认知 – – 启发框架 “ 2405.09605v2 |
Authors (20): Anna A. Ivanova, Aalok Sathe, Benjamin Lipkin, Unnathi Kumar, Setayesh Radkani, Thomas H. Clark, Carina Kauf, Jennifer Hu, R. T. Pramod, Gabriel Grand, Vivian Paulun, Maria Ryskina, Ekin Akyürek, Ethan Wilcox, Nafisa Rashid, Leshem Choshen, Roger Levy, Evelina Fedorenko, Joshua Tenenbaum, Jacob Andreas
The ability to build and reason about models of the world is essential for situated language understanding. But evaluating world modeling capabilities in modern AI systems – especially those based on language models – has proven challenging, in large part because of the difficulty of disentangling conceptual knowledge about the world from knowledge of surface co-occurrence statistics. This paper presents Elements of World Knowledge (EWoK), a framework for evaluating language models’ understanding of the conceptual knowledge underlying world modeling. EWoK targets specific concepts from multiple knowledge domains known to be important for world modeling in humans, from social interactions (help, deceive) to spatial relations (left, right). Objects, agents, and locations in the items can be flexibly filled in, enabling easy generation of multiple controlled datasets. We then introduce EWoK-core-1.0, a dataset of 4,374 items covering 11 world knowledge domains. We evaluate 20 open-weights large language models (1.3B–70B parameters) and compare them with human performance. All tested models perform worse than humans, with results varying drastically across domains. Performance on social interactions and social properties was highest and performance on physical relations and spatial relations was lowest. Overall, this dataset highlights simple cases where even large models struggle and presents rich avenues for targeted research on LLM world modeling capabilities.
nan
Article 606
Title@2025-07-03 (4): On the Relationship between Accent Strength and Articulatory Features
Title: On the Relationship between Accent Strength and Articulatory Features | Über die Beziehung zwischen Akcent Strength und Artikulatory Features | Accent 力量与交接特征之间的关系 2507.03149v1 |
Authors (6): Kevin Huang, Sean Foley, Jihwan Lee, Yoonjeong Lee, Dani Byrd, Shrikanth Narayanan
This paper explores the relationship between accent strength and articulatory features inferred from acoustic speech. To quantify accent strength, we compare phonetic transcriptions with transcriptions based on dictionary-based references, computing phoneme-level difference as a measure of accent strength. The proposed framework leverages recent self-supervised learning articulatory inversion techniques to estimate articulatory features. Analyzing a corpus of read speech from American and British English speakers, this study examines correlations between derived articulatory parameters and accent strength proxies, associating systematic articulatory differences with indexed accent strength. Results indicate that tongue positioning patterns distinguish the two dialects, with notable differences inter-dialects in rhotic and low back vowels. These findings contribute to automated accent analysis and articulatory modeling for speech processing applications.
nan
Article 607
Title@2025-07-03 (4): DeepGesture: A conversational gesture synthesis system based on emotions and semantics
Title: DeepGesture: A conversational gesture synthesis system based on emotions and semantics | DeepGesture: Ein dialogisches Gesten-Synthesesystem basierend auf Emotionen und Semantik | DeepGesture:基于情感和语义的谈话手势合成系统 2507.03147v1 |
Authors (1): Thanh Hoang-Minh
Along with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans lies in generating character movements that correspond naturally to text or speech inputs. In this work, we present DeepGesture, a diffusion-based gesture synthesis framework for generating expressive co-speech gestures conditioned on multimodal signals-text, speech, emotion, and seed motion. Built upon the DiffuseStyleGesture model, DeepGesture introduces novel architectural enhancements that improve semantic alignment and emotional expressiveness in generated gestures. Specifically, we integrate fast text transcriptions as semantic conditioning and implement emotion-guided classifier-free diffusion to support controllable gesture generation across affective states. A lightweight Transformer backbone combines full self-attention and cross-local attention for effective feature fusion of heterogeneous modalities. To visualize results, we implement a full rendering pipeline in Unity based on BVH output from the model. Evaluation on the ZeroEGGS dataset shows that DeepGesture produces gestures with improved human-likeness and contextual appropriateness, outperforming baselines on Mean Opinion Score and Frechet Gesture Distance metrics. Our system supports interpolation between emotional states and demonstrates generalization to out-of-distribution speech, including synthetic voices-marking a step forward toward fully multimodal, emotionally aware digital humans.
nan
Article 608
Title@2025-07-03 (4): From Measurement to Mitigation: Exploring the Transferability of Debiasing Approaches to Gender Bias in Maltese Language Models
Title: From Measurement to Mitigation: Exploring the Transferability of Debiasing Approaches to Gender Bias in Maltese Language Models | Von der Messung zur Milderung: Erforschung der Übertragbarkeit von Debiasing-Ansätzen auf Gender-Bias in maltesischen Sprachmodellen | 从衡量到减缓:探讨马耳他语言模式中性别偏见方法对性别偏见的可转让性 2507.03142v1 |
Authors (2): Melanie Galea, Claudia Borg
The advancement of Large Language Models (LLMs) has transformed Natural Language Processing (NLP), enabling performance across diverse tasks with little task-specific training. However, LLMs remain susceptible to social biases, particularly reflecting harmful stereotypes from training data, which can disproportionately affect marginalised communities. We measure gender bias in Maltese LMs, arguing that such bias is harmful as it reinforces societal stereotypes and fails to account for gender diversity, which is especially problematic in gendered, low-resource languages. While bias evaluation and mitigation efforts have progressed for English-centric models, research on low-resourced and morphologically rich languages remains limited. This research investigates the transferability of debiasing methods to Maltese language models, focusing on BERTu and mBERTu, BERT-based monolingual and multilingual models respectively. Bias measurement and mitigation techniques from English are adapted to Maltese, using benchmarks such as CrowS-Pairs and SEAT, alongside debiasing methods Counterfactual Data Augmentation, Dropout Regularization, Auto-Debias, and GuiDebias. We also contribute to future work in the study of gender bias in Maltese by creating evaluation datasets. Our findings highlight the challenges of applying existing bias mitigation methods to linguistically complex languages, underscoring the need for more inclusive approaches in the development of multilingual NLP.
nan
Article 609
Title@2025-07-03 (4): ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models
Title: ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models | ZuverlässigMath: Benchmark der zuverlässigen mathematischen Vernunft bei großen Sprachmodellen | 可靠马斯:关于大语言模式的可靠数学依据基准 2507.03133v1 |
Authors (10): Boyang Xue, Qi Zhu, Rui Wang, Sheng Wang, Hongru Wang, Fei Mi, Yasheng Wang, Lifeng Shang, Qun Liu, Kam-Fai Wong
Although demonstrating remarkable performance on reasoning tasks, Large Language Models (LLMs) still tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability, severely undermining the reliability. Prior studies of LLM reliability have primarily focused on knowledge tasks to identify unanswerable questions, while mathematical reasoning tasks have remained unexplored due to the dearth of unsolvable math problems. To systematically investigate LLM reliability in mathematical reasoning tasks, we formulate the reliability evaluation for both solvable and unsolvable problems. We then develop a ReliableMath dataset which incorporates open-source solvable problems and high-quality unsolvable problems synthesized by our proposed construction workflow with human evaluations. Experiments are conducted on various LLMs with several key findings uncovered. LLMs fail to directly identify unsolvable problems and always generate fabricated responses. When instructing LLMs to indicate unsolvability using a reliable prompt, the reliability of larger-sized LLMs remains on solvable problems, but notably improves on unsolvable problems yet still falls short of solvable problems. However, small LLMs rarely show any progress despite employing reliable prompts. Therefore, we further propose an alignment strategy to enhance small LLMs’ reliability, which can significantly improve LLM reliability performances on both in-domain and out-of-domain tasks.
nan
Article 610
Title@2025-07-03 (4): Towards a Psychoanalytic Perspective on VLM Behaviour: A First-step Interpretation with Intriguing Observations
Title: Towards a Psychoanalytic Perspective on VLM Behaviour: A First-step Interpretation with Intriguing Observations | Auf dem Weg zu einer psychoanalytischen Perspektive auf VLM-Verhalten: Eine erste Interpretation mit faszinierenden Beobachtungen | 对VLM行为进行心理分析:带有引人注意的首步解释 2507.03123v1 |
Authors (5): Xiangrui Liu, Man Luo, Agneet Chatterjee, Hua Wei, Yezhou Yang
Hallucination is a long-standing problem that has been actively investigated in Vision-Language Models (VLMs). Existing research commonly attributes hallucinations to technical limitations or sycophancy bias, where the latter means the models tend to generate incorrect answers to align with user expectations. However, these explanations primarily focus on technical or externally driven factors, may have neglected the possibility that hallucination behaviours might mirror cognitive biases observed in human psychology. In this work, we introduce a psychological taxonomy, categorizing VLMs’ hallucination behaviours, including sycophancy, logical inconsistency, and a newly identified VLMs behaviour: authority bias. To systematically analyze these behaviours, we design AIpsych, a scalable benchmark that reveals psychological tendencies in model response patterns. Leveraging this benchmark, we investigate how variations in model architecture and parameter size influence model behaviour when responding to strategically manipulated questions. Our experiments reveal that as model size increases, VLMs exhibit stronger sycophantic tendencies but reduced authority bias, suggesting increasing competence but a potential erosion of response integrity. A human subject study further validates our hypotheses and highlights key behavioural differences between VLMs and human respondents. This work suggests a new perspective for understanding hallucination in VLMs and highlights the importance of integrating psychological principles into model evaluation.The benchmark is available at https://github.com/lxrswdd/AIpsych.
nan
Article 611
Title@2025-07-03 (4): Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings
Title: Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings | Federated Learning für die ICD-Klassifizierung mit Leichtbaumodellen und vorgeschulten Einbettungen | 具有轻重量型模型和未受过训练的嵌入型综合疾病分类的联邦学习 2507.03122v1 |
Authors (2): Binbin Xu, Gérard Dray
This study investigates the feasibility and performance of federated learning (FL) for multi-label ICD code classification using clinical notes from the MIMIC-IV dataset. Unlike previous approaches that rely on centralized training or fine-tuned large language models, we propose a lightweight and scalable pipeline combining frozen text embeddings with simple multilayer perceptron (MLP) classifiers. This design offers a privacy-preserving and deployment-efficient alternative for clinical NLP applications, particularly suited to distributed healthcare settings. Extensive experiments across both centralized and federated configurations were conducted, testing six publicly available embedding models from Massive Text Embedding Benchmark leaderboard and three MLP classifier architectures under two medical coding (ICD-9 and ICD-10). Additionally, ablation studies over ten random stratified splits assess performance stability. Results show that embedding quality substantially outweighs classifier complexity in determining predictive performance, and that federated learning can closely match centralized results in idealized conditions. While the models are orders of magnitude smaller than state-of-the-art architectures and achieved competitive micro and macro F1 scores, limitations remain including the lack of end-to-end training and the simplified FL assumptions. Nevertheless, this work demonstrates a viable way toward scalable, privacy-conscious medical coding systems and offers a step toward for future research into federated, domain-adaptive clinical AI.
nan
Article 612
Title@2025-07-03 (4): RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents
Title: RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents | RLVER: Verstärktes Lernen mit überprüfbaren emotionalen Belohnungen für einfühlsame Agenten | RLVER: 利用可核实情感奖赏加强学习,以培养致病剂 2507.03112v1 |
Authors (16): Peisong Wang, Ruotian Ma, Bang Zhang, Xingyu Chen, Zhiwei He, Kang Luo, Qingsong Lv, Qingxuan Jiang, Zheng Xie, Shanyi Wang, Yuan Li, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li
Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogue-especially for emotional intelligence-remains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM’s learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trends–thinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always better-moderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.
nan
Article 613
Title@2025-07-03 (4): LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds
Title: LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds | LIAR: Inference Time Alignment (Best-of-N) in Sekunden an Jailbreak LLMs ausleihen | LIAR: 利用诱推时间对准(最佳N级)二秒内侵入监狱的LLMs 2412.05232v3 |
Authors (6): James Beetham, Souradip Chakraborty, Mengdi Wang, Furong Huang, Amrit Singh Bedi, Mubarak Shah
Jailbreak attacks expose vulnerabilities in safety-aligned LLMs by eliciting harmful outputs through carefully crafted prompts. Existing methods rely on discrete optimization or trained adversarial generators, but are slow, compute-intensive, and often impractical. We argue that these inefficiencies stem from a mischaracterization of the problem. Instead, we frame jailbreaks as inference-time misalignment and introduce LIAR (Leveraging Inference-time misAlignment to jailbReak), a fast, black-box, best-of-$N$ sampling attack requiring no training. LIAR matches state-of-the-art success rates while reducing perplexity by $10\times$ and Time-to-Attack from hours to seconds. We also introduce a theoretical “safety net against jailbreaks” metric to quantify safety alignment strength and derive suboptimality bounds. Our work offers a simple yet effective tool for evaluating LLM robustness and advancing alignment research.
nan
Article 614
Title@2025-07-03 (4): ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization
Title: ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization | ARF-RLHF: Adaptive Reward-Following für RLHF durch Emotionsgetriebene Selbst-Überwachung und trace-biased Dynamic Optimization | RAF-RLHF:通过情感驱动自视和跟踪比亚动态优化,对RLHF进行适应性奖励 2507.03069v1 |
Authors (1): YuXuan Zhang
With the rapid advancement of Reinforcement Learning from Human Feedback (RLHF) and autoregressive transformers, state-of-the-art models such as GPT-4.0, DeepSeek R1, and Llama 3.3 increasingly emphasize answer depth and personalization. However, most existing RLHF approaches (e.g., PPO, DPO) still rely on a binary-preference (BT) paradigm, which, while reducing annotation costs, still requires substantial human effort and captures only group-level tendencies rather than individual preferences. To overcome these limitations, we propose Adaptive Reward-Following (ARF), a self-assessment framework that leverages a high-precision emotion analyzer achieving over 70% accuracy on GoEmotions, Sentiment140, and DailyDialog to convert free-form user feedback into continuous preference scores. We further enrich and debias these signals through lightweight data augmentations, including synonym replacement, random trace truncation, and score bias annotation algorithm. A Dynamic Adapter Preference Tracker continuously models evolving user tastes in real time, enabling our novel Trace Bias (TB) fine-tuning algorithm to optimize directly on these tracked rewards instead of coarse binary labels. Experiments on Qwen-2/2.5, Gemma-2, and Llama-3.2 across four preference domains demonstrate that ARF achieves an improvement of 3.3% over PPO and 7.6% over DPO. Moreover, TB preserves theoretical alignment with PPO and DPO objectives. Overall, ARF presents a scalable, personalized, and cost-effective approach to RLHF LLMs through autonomous reward modeling.
nan
Article 615
Title@2025-07-03 (4): Requirements Elicitation Follow-Up Question Generation
Title: Requirements Elicitation Follow-Up Question Generation | Voraussetzungen Elicitation Follow-Up Question Generation | 问询后查询 2507.02858v1 |
Authors (3): Yuchen Shen, Anmol Singhal, Travis Breaux
Interviews are a widely used technique in eliciting requirements to gather stakeholder needs, preferences, and expectations for a software system. Effective interviewing requires skilled interviewers to formulate appropriate interview questions in real time while facing multiple challenges, including lack of familiarity with the domain, excessive cognitive load, and information overload that hinders how humans process stakeholders’ speech. Recently, large language models (LLMs) have exhibited state-of-the-art performance in multiple natural language processing tasks, including text summarization and entailment. To support interviewers, we investigate the application of GPT-4o to generate follow-up interview questions during requirements elicitation by building on a framework of common interviewer mistake types. In addition, we describe methods to generate questions based on interviewee speech. We report a controlled experiment to evaluate LLM-generated and human-authored questions with minimal guidance, and a second controlled experiment to evaluate the LLM-generated questions when generation is guided by interviewer mistake types. Our findings demonstrate that, for both experiments, the LLM-generated questions are no worse than the human-authored questions with respect to clarity, relevancy, and informativeness. In addition, LLM-generated questions outperform human-authored questions when guided by common mistakes types. This highlights the potential of using LLMs to help interviewers improve the quality and ease of requirements elicitation interviews in real time.
nan
Article 616
Title@2025-07-03 (4): Answer Matching Outperforms Multiple Choice for Language Model Evaluation
Title: Answer Matching Outperforms Multiple Choice for Language Model Evaluation | Antwort Matching Outperforms Multiple Choice für Sprachmodell-Bewertung | 语言模式评价的多种选择 2507.02856v1 |
Authors (5): Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping
Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model’s free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice–but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we annotate MMLU-Pro and GPQA-Diamond to obtain human grading data, and measure the agreement of each evaluation approach. We find answer matching using recent models–even small ones–achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improving evaluations via answer matching is not merely a conceptual concern: the rankings of several models change significantly when evaluating their free-form responses with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.
nan
Article 617
Title@2025-07-03 (4): MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs
Title: MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs | MOTIF: Modulares Denken durch Verstärkung Feinabstimmung in LLMs | MOTIF:通过强化微调在LLM中进行模块思考 2507.02851v1 |
Authors (2): Purbesh Mitra, Sennur Ulukus
Recent advancements in the reasoning capabilities of large language models (LLMs) show that employing group relative policy optimization (GRPO) algorithm for reinforcement learning (RL) training allows the models to use more thinking/reasoning tokens for generating better responses. However, LLMs can generate only a finite amount of tokens while maintaining attention to the previously generated tokens. This limit, also known as the context size of an LLM, is a bottleneck in LLM reasoning with arbitrarily large number of tokens. To think beyond the limit of context size, an LLM must employ a modular thinking strategy to reason over multiple rounds. In this work, we propose $\textbf{MOTIF: Modular Thinking via Reinforcement Finetuning}$ – an RL training method for generating thinking tokens in multiple rounds, effectively allowing the model to think with additional context size. We trained the open-source model Qwen2.5-3B-Instruct on GSM8K dataset via parameter efficient fine-tuning and tested its accuracy on MATH500 and AIME2024 benchmarks. Our experiments show 3.8\% and 3.3\% improvements over vanilla GRPO based training in the respective benchmarks. Furthermore, this improvement was achieved with only 15\% of samples, thus demonstrating sample efficiency of MOTIF. Our code and models are available at https://github.com/purbeshmitra/MOTIF and https://huggingface.co/purbeshmitra/MOTIF, respectively.
nan
Article 618
Title@2025-07-03 (4): Legal Requirements Translation from Law
Title: Legal Requirements Translation from Law | Rechtliche Voraussetzungen Übersetzung aus dem Recht | 法律要求译自法律 2507.02846v1 |
Authors (2): Anmol Singhal, Travis Breaux
Software systems must comply with legal regulations, which is a resource-intensive task, particularly for small organizations and startups lacking dedicated legal expertise. Extracting metadata from regulations to elicit legal requirements for software is a critical step to ensure compliance. However, it is a cumbersome task due to the length and complex nature of legal text. Although prior work has pursued automated methods for extracting structural and semantic metadata from legal text, key limitations remain: they do not consider the interplay and interrelationships among attributes associated with these metadata types, and they rely on manual labeling or heuristic-driven machine learning, which does not generalize well to new documents. In this paper, we introduce an approach based on textual entailment and in-context learning for automatically generating a canonical representation of legal text, encodable and executable as Python code. Our representation is instantiated from a manually designed Python class structure that serves as a domain-specific metamodel, capturing both structural and semantic legal metadata and their interrelationships. This design choice reduces the need for large, manually labeled datasets and enhances applicability to unseen legislation. We evaluate our approach on 13 U.S. state data breach notification laws, demonstrating that our generated representations pass approximately 89.4% of test cases and achieve a precision and recall of 82.2 and 88.7, respectively.
nan
Article 619
Title@2025-07-03 (4): Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
Title: Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection | Visual Contextual Attack: Jailbreaking MLLMs mit Image-Driven Context Injection | 视觉上下文攻击:带有图像驱动背景注射的破狱MLLMs MLLMs 2507.02844v1 |
Authors (4): Ziqi Miao, Yi Ding, Lijun Li, Jing Shao
With the emergence of strong visual-language capabilities, multimodal large language models (MLLMs) have demonstrated tremendous potential for real-world applications. However, the security vulnerabilities exhibited by the visual modality pose significant challenges to deploying such models in open-world environments. Recent studies have successfully induced harmful responses from target MLLMs by encoding harmful textual semantics directly into visual inputs. However, in these approaches, the visual modality primarily serves as a trigger for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding in realistic scenarios. In this work, we define a novel setting: visual-centric jailbreak, where visual information serves as a necessary component in constructing a complete and realistic jailbreak context. Building on this setting, we propose the VisCo (Visual Contextual) Attack. VisCo fabricates contextual dialogue using four distinct visual-focused strategies, dynamically generating auxiliary images when necessary to construct a visual-centric jailbreak scenario. To maximize attack effectiveness, it incorporates automatic toxicity obfuscation and semantic refinement to produce a final attack prompt that reliably triggers harmful responses from the target black-box MLLMs. Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming the baseline, which performs a toxicity score of 2.48 and an ASR of 22.2%. The code is available at https://github.com/Dtc7w3PQ/Visco-Attack.
nan
Article 620
Title@2025-07-03 (4): Improved Unbiased Watermark for Large Language Models
Title: Improved Unbiased Watermark for Large Language Models | Verbessertes unvoreingenommenes Wasserzeichen für große Sprachmodelle | 改进大语言模型的无偏见水印 2502.11268v2 |
Authors (4): Ruibo Chen, Yihan Wu, Junfeng Guo, Heng Huang
As artificial intelligence surpasses human capabilities in text generation, the necessity to authenticate the origins of AI-generated content has become paramount. Unbiased watermarks offer a powerful solution by embedding statistical signals into language model-generated text without distorting the quality. In this paper, we introduce MCmark, a family of unbiased, Multi-Channel-based watermarks. MCmark works by partitioning the model’s vocabulary into segments and promoting token probabilities within a selected segment based on a watermark key. We demonstrate that MCmark not only preserves the original distribution of the language model but also offers significant improvements in detectability and robustness over existing unbiased watermarks. Our experiments with widely-used language models demonstrate an improvement in detectability of over 10% using MCmark, compared to existing state-of-the-art unbiased watermarks. This advancement underscores MCmark’s potential in enhancing the practical application of watermarking in AI-generated texts.
nan
Article 621
Title@2025-07-03 (4): StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason
Title: StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason | StepHint: Mehrstufige stufenweise Hinweise stärken das Lernen zur Vernunft | 步进提示:多级分步骤将强化学习提升到合理 2507.02841v1 |
Authors (7): Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, Rui Yan
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving the complex reasoning abilities of large language models (LLMs). However, current RLVR methods face two significant challenges: the near-miss reward problem, where a small mistake can invalidate an otherwise correct reasoning process, greatly hindering training efficiency; and exploration stagnation, where models tend to focus on solutions within their comfort zone,'' lacking the motivation to explore potentially more effective alternatives. To address these challenges, we propose StepHint, a novel RLVR algorithm that utilizes multi-level stepwise hints to help models explore the solution space more effectively. StepHint generates valid reasoning chains from stronger models and partitions these chains into reasoning steps using our proposed adaptive partitioning method. The initial few steps are used as hints, and simultaneously, multiple-level hints (each comprising a different number of steps) are provided to the model. This approach directs the model's exploration toward a promising solution subspace while preserving its flexibility for independent exploration. By providing hints, StepHint mitigates the near-miss reward problem, thereby improving training efficiency. Additionally, the external reasoning pathways help the model develop better reasoning abilities, enabling it to move beyond its
comfort zone’’ and mitigate exploration stagnation. StepHint outperforms competitive RLVR enhancement methods across six mathematical benchmarks, while also demonstrating superior generalization and excelling over baselines on out-of-domain benchmarks.
nan
Article 622
Title@2025-07-03 (4): From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents
Title: From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents | Von der Web-Suche in Richtung Agentic Deep Research: Incentivizing Search with Reasoning Agents | 从网络搜索到代理深层研究:激励使用理性代理进行搜索 2506.18959v3 |
Authors (23): Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu
Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research. These systems transcend conventional information search techniques by tightly integrating autonomous reasoning, iterative retrieval, and information synthesis into a dynamic feedback loop. We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn. We also introduce a test-time scaling law to formalize the impact of computational depth on reasoning and search. Supported by benchmark results and the rise of open-source implementations, we demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking. All the related resources, including industry products, research papers, benchmark datasets, and open-source implementations, are collected for the community in https://github.com/DavidZWZ/Awesome-Deep-Research.
nan
Article 623
Title@2025-07-03 (4): ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning
Title: ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning | ExPO: Entsperren harter Vernunft mit selbsterklärungsgeführtem Verstärkungslernen | ExPO: 以自我剥削指导强化学习来解锁困难理由 2507.02834v1 |
Authors (4): Ruiyang Zhou, Shuozhe Li, Amy Zhang, Liu Leqi
Recent advances in large language models have been driven by reinforcement learning (RL)-style post-training, which improves reasoning by optimizing model outputs based on reward or preference signals. GRPO-style approaches implement this by using self-generated samples labeled by an outcome-based verifier. However, these methods depend heavily on the model’s initial ability to produce positive samples. They primarily refine what the model already knows (distribution sharpening) rather than enabling the model to solve problems where it initially fails. This limitation is especially problematic in early-stage RL training and on challenging reasoning tasks, where positive samples are unlikely to be generated. To unlock reasoning ability in such settings, the model must explore new reasoning trajectories beyond its current output distribution. Such exploration requires access to sufficiently good positive samples to guide the learning. While expert demonstrations seem like a natural solution, we find that they are often ineffective in RL post-training. Instead, we identify two key properties of effective positive samples: they should (1) be likely under the current policy, and (2) increase the model’s likelihood of predicting the correct answer. Based on these insights, we propose $\textbf{Self-Explanation Policy Optimization (ExPO)}$-a simple and modular framework that generates such samples by conditioning on the ground-truth answer. ExPO enables efficient exploration and guides the model to produce reasoning trajectories more aligned with its policy than expert-written CoTs, while ensuring higher quality than its own (incorrect) samples. Experiments show that ExPO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods in challenging settings such as MATH level-5, where the model initially struggles the most.
nan
Article 624
Title@2025-07-03 (4): Generalizing Verifiable Instruction Following
Title: Generalizing Verifiable Instruction Following | Verallgemeinern der überprüfbaren Anleitung | 普遍适用的可核实说明 2507.02833v1 |
Authors (8): Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, Hannaneh Hajishirzi
A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like only answer with yes or no" or
mention the word `abrakadabra’ at least 3 times” that the user adds to craft a more useful answer. Even today’s strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.
nan
Article 625
Title@2025-07-03 (4): SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model
Title: SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model | SynapseRoute: Ein Auto-Routen-Schaltrahmen für das Dual-State Large Language Model | SynapseRoute:关于两州大语言模式的自动运行切换框架 2507.02822v1 |
Authors (12): Wencheng Zhang, Shiqin Qiao, Lingjie Luo, Yinfeng Li, Chuanyang Zheng, Qian Xu, Meng Li, Yong Gui, Yijun He, Jianing Qiu, Jindong Hong, Jiankai Sun
With the widespread adoption of large language models (LLMs) in practical applications, selecting an appropriate model requires balancing not only performance but also operational cost. The emergence of reasoning-capable models has further widened the cost gap between “thinking” (high reasoning) and “non-thinking” (fast, low-cost) modes. In this work, we reveal that approximately 58% of medical questions can be accurately answered by the non-thinking mode alone, without requiring the high-cost reasoning process. This highlights a clear dichotomy in problem complexity and suggests that dynamically routing queries to the appropriate mode based on complexity could optimize accuracy, cost-efficiency, and overall user experience. Based on this, we further propose SynapseRoute, a machine learning-based dynamic routing framework that intelligently assigns input queries to either thinking or non-thinking modes. Experimental results on several medical datasets demonstrate that SynapseRoute not only improves overall accuracy (0.8390 vs. 0.8272) compared to the thinking mode alone but also reduces inference time by 36.8% and token consumption by 39.66%. Importantly, qualitative analysis indicates that over-reasoning on simpler queries can lead to unnecessary delays and even decreased accuracy, a pitfall avoided by our adaptive routing. Finally, this work further introduces the Accuracy-Inference-Token (AIT) index to comprehensively evaluate the trade-offs among accuracy, latency, and token cost.
nan
Article 626
Title@2025-07-03 (4): Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case
Title: Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case | Große Sprachmodelle zur Automatisierung der klinischen Datenstandardisierung: HL7 FHIR Use Case | 临床数据标准化自动化大语言模型:HL7 FHIR使用案例 2507.03067v1 |
Authors (3): Alvaro Riquelme, Pedro Costa, Catalina Martinez
For years, semantic interoperability standards have sought to streamline the exchange of clinical data, yet their deployment remains time-consuming, resource-intensive, and technically challenging. To address this, we introduce a semi-automated approach that leverages large language models specifically GPT-4o and Llama 3.2 405b to convert structured clinical datasets into HL7 FHIR format while assessing accuracy, reliability, and security. Applying our method to the MIMIC-IV database, we combined embedding techniques, clustering algorithms, and semantic retrieval to craft prompts that guide the models in mapping each tabular field to its corresponding FHIR resource. In an initial benchmark, resource identification achieved a perfect F1-score, with GPT-4o outperforming Llama 3.2 thanks to the inclusion of FHIR resource schemas within the prompt. Under real-world conditions, accuracy dipped slightly to 94 %, but refinements to the prompting strategy restored robust mappings. Error analysis revealed occasional hallucinations of non-existent attributes and mismatches in granularity, which more detailed prompts can mitigate. Overall, our study demonstrates the feasibility of context-aware, LLM-driven transformation of clinical data into HL7 FHIR, laying the groundwork for semi-automated interoperability workflows. Future work will focus on fine-tuning models with specialized medical corpora, extending support to additional standards such as HL7 CDA and OMOP, and developing an interactive interface to enable expert validation and iterative refinement.
nan
Article 627
Title@2025-07-03 (4): Identification of Potentially Misclassified Crash Narratives using Machine Learning (ML) and Deep Learning (DL)
Title: Identification of Potentially Misclassified Crash Narratives using Machine Learning (ML) and Deep Learning (DL) | Identifizierung potenziell falsch klassifizierter Crash Narratives mittels Machine Learning (ML) und Deep Learning (DL) | 利用机器学习(ML)和深层学习(DL)查明可能误解的崩溃叙述(DL) 2507.03066v1 |
Authors (3): Sudesh Bhagat, Ibne Farabi Shihab, Jonathan Wood
This research investigates the efficacy of machine learning (ML) and deep learning (DL) methods in detecting misclassified intersection-related crashes in police-reported narratives. Using 2019 crash data from the Iowa Department of Transportation, we implemented and compared a comprehensive set of models, including Support Vector Machine (SVM), XGBoost, BERT Sentence Embeddings, BERT Word Embeddings, and Albert Model. Model performance was systematically validated against expert reviews of potentially misclassified narratives, providing a rigorous assessment of classification accuracy. Results demonstrated that while traditional ML methods exhibited superior overall performance compared to some DL approaches, the Albert Model achieved the highest agreement with expert classifications (73% with Expert 1) and original tabular data (58%). Statistical analysis revealed that the Albert Model maintained performance levels similar to inter-expert consistency rates, significantly outperforming other approaches, particularly on ambiguous narratives. This work addresses a critical gap in transportation safety research through multi-modal integration analysis, which achieved a 54.2% reduction in error rates by combining narrative text with structured crash data. We conclude that hybrid approaches combining automated classification with targeted expert review offer a practical methodology for improving crash data quality, with substantial implications for transportation safety management and policy development.
nan
Article 628
Title@2025-07-03 (4): Multimodal Mathematical Reasoning with Diverse Solving Perspective
Title: Multimodal Mathematical Reasoning with Diverse Solving Perspective | Multimodale mathematische Vernunft mit unterschiedlicher Lösungsperspektive | 具有不同解决视角的多模式数学理由 2507.02804v1 |
Authors (6): Wenhao Shi, Zhiqiang Hu, Yi Bin, Yang Yang, See-Kiong Ng, Heng Tao Shen
Recent progress in large-scale reinforcement learning (RL) has notably enhanced the reasoning capabilities of large language models (LLMs), especially in mathematical domains. However, current multimodal LLMs (MLLMs) for mathematical reasoning often rely on one-to-one image-text pairs and single-solution supervision, overlooking the diversity of valid reasoning perspectives and internal reflections. In this work, we introduce MathV-DP, a novel dataset that captures multiple diverse solution trajectories for each image-question pair, fostering richer reasoning supervision. We further propose Qwen-VL-DP, a model built upon Qwen-VL, fine-tuned with supervised learning and enhanced via group relative policy optimization (GRPO), a rule-based RL approach that integrates correctness discrimination and diversity-aware reward functions. Our method emphasizes learning from varied reasoning perspectives and distinguishing between correct yet distinct solutions. Extensive experiments on the MathVista’s minitest and Math-V benchmarks demonstrate that Qwen-VL-DP significantly outperforms prior base MLLMs in both accuracy and generative diversity, highlighting the importance of incorporating diverse perspectives and reflective reasoning in multimodal mathematical reasoning.
nan
Article 629
Title@2025-07-03 (4): Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models
Title: Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models | Ist Vernunft alles, was Sie brauchen? Probieren von Bias im Zeitalter der Vernunft Sprachmodelle | 需要什么理由就需要什么理由吗? 2507.02799v1 |
Authors (4): Riccardo Cantini, Nicola Gabriele, Alessio Orsino, Domenico Talia
Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.
nan
Article 630
Title@2025-07-03 (4): From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding
Title: From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding | Von langen Videos zu Clips: Ein von Menschen inspiriertes Video-Editing-Framework mit multimodalem Narrative Understanding | 从长视频到启动剪贴板:由人启发的视频编辑框架,包含多模式叙述理解 2507.02790v1 |
Authors (11): Xiangfeng Wang, Xiao Li, Yadong Wei, Xueyu Song, Yang Song, Xiaoqiang Xia, Fangrui Zeng, Zaiyi Chen, Liu Liu, Gu Xu, Tong Xu
The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a human-inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks: highlight detection, opening/ending selection, and pruning of irrelevant content. To facilitate research in this area, we introduce DramaAD, a novel benchmark dataset comprising over 800 short drama episodes and 500 professionally edited advertisement clips. Experimental results demonstrate that our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks, significantly narrowing the quality gap between automatic and human-edited videos.
nan
Article 631
Title@2025-07-03 (4): GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling
Title: GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling | GPAS: Beschleunigung der Konvergenz des LLM-Vortrainings durch Gradient-Preserving Activation Scaling | GPAS:通过 “ 渐进式保留动力扩增 “ 加速汇集LLM预备训练 2506.22049v2 |
Authors (15): Tianhao Chen, Xin Xu, Zijing Liu, Pengxiang Li, Xinyuan Song, Ajay Kumar Jaiswal, Fan Zhang, Jishan Hu, Yang Wang, Hao Chen, Shizhe Diao, Shiwei Liu, Yu Li, Lu Yin, Can Yang
Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the shortcut to dominate over sub-layer outputs in the residual connection and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings. Our code is available at https://github.com/dandingsky/GPAS.
nan
Article 632
Title@2025-07-03 (4): Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation
Title: Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation | Verbesserung klinischer Multiple-Choice-Fragen Benchmarks mit Knowledge Graph Guided Distractor Generierung | 加强具有知识图导引引产生体的临床多选择问题基准,加强临床多选择问题基准 2506.00612v3 |
Authors (5): Running Yang, Wenlong Deng, Minghui Chen, Yuyin Zhou, Xiaoxiao Li
Clinical tasks such as diagnosis and treatment require strong decision-making abilities, highlighting the importance of rigorous evaluation benchmarks to assess the reliability of large language models (LLMs). In this work, we introduce a knowledge-guided data augmentation framework that enhances the difficulty of clinical multiple-choice question (MCQ) datasets by generating distractors (i.e., incorrect choices that are similar to the correct one and may confuse existing LLMs). Using our KG-based pipeline, the generated choices are both clinically plausible and deliberately misleading. Our approach involves multi-step, semantically informed walks on a medical knowledge graph to identify distractor paths-associations that are medically relevant but factually incorrect-which then guide the LLM in crafting more deceptive distractors. We apply the designed knowledge graph guided distractor generation (KGGDG) pipline, to six widely used medical QA benchmarks and show that it consistently reduces the accuracy of state-of-the-art LLMs. These findings establish KGGDG as a powerful tool for enabling more robust and diagnostic evaluations of medical LLMs.
nan
Article 633
Title@2025-07-03 (4): Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs
Title: Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs | Selbstkorrektionsbank: Enthüllung und Adressierung des Selbstkorrektions-Blindflecks in LLMs | 自我校正法官:在LLMs中披露和处理自我校正的盲人点 2507.02778v1 |
Authors (1): Ken Tsui
Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, particularly an autoregressive LLM. While LLMs can identify error in user input, they exhibit a systematic ‘Self-Correction Blind Spot’ - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 models, we find an average 64.5% blind spot rate. We find multiple evidences that this limitation relates to training data composition: human training demonstrations predominantly show error-free responses rather than error-correction sequences, unlike RL-trained models that learn error correction through outcome feedback. Remarkably, simply appending “Wait” reduces blind spots by 89.3%, suggesting that the capability exists but requires activation. Our work highlights a critical limitation in current LLMs and offers potential avenues for improving their reliability and trustworthiness.
nan
Article 634
Title@2025-07-03 (4): DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment
Title: DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment | DeSTA2.5-Audio: Auf dem Weg zu einem General-Purpose Large Audio Language Model mit selbsterzeugter Cross-Modal Alignment | DeSTA2.5-Audio:努力建立具有自发跨模式一致的通用大型音频语言模型 2507.02768v1 |
Authors (28): Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee
We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these approaches have often suffered from the catastrophic forgetting of the LLM’s original language abilities. To address this, we revisit the data construction pipeline and propose DeSTA, a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets. This approach preserves the LLM’s native language proficiency while establishing effective audio-text alignment, thereby enabling zero-shot generalization without task-specific tuning. Using DeSTA, we construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms widely adopted data construction and training strategies in both auditory perception and instruction-following capabilities. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.
nan
Article 635
Title@2025-07-03 (4): Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression
Title: Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression | Batch-Max: Höherer LLM-Durchsatz mit größeren Batch-Größen und KV Cache-Kompression | 批量最大:使用大批量大小和 KV缓存压缩的高级 LLM 输送量 2412.05693v3 |
Authors (3): Michael R. Metel, Boxing Chen, Mehdi Rezagholizadeh
Several works have developed eviction policies to remove key-value (KV) pairs from the KV cache for more efficient inference. The focus has been on compressing the KV cache after the input prompt has been processed for faster token generation. In settings with limited GPU memory, and when the input context is longer than the generation length, we show that by also compressing the KV cache during the input processing phase, larger batch sizes can be used resulting in significantly higher throughput while still maintaining the original model’s accuracy.
nan
Article 636
Title@2025-07-03 (4): Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens
Title: Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens | Messung der Granularität des Vowel-Produktionsraumes durch einfach nur produzierbare unterschiedliche (JPD) Limens | 仅用可制成差异(JPD)激光测量Vowel 生产空间的颗粒度 2507.02744v1 |
Authors (1): Peter Viechnicki
A body of work over the past several decades has demonstrated that the complex and coordinated articulatory movements of human vowel production are governed (at least in part)by control mechanisms whose targets are regions of auditory space. Within the target region control at the sub-phonemic level has also been demonstrated. But the degree of accuracy of that control is unknown. The current work investigates this question by asking how far apart must two vowel stimuli lie in auditory space in order to yield reliably different imitations? This distance is termed ‘Just Producible Difference’ (JPD). The current study uses a vowel mimicry paradigm to derive the first measurement of JPD among two sets of English speakers during front vowel production. JPD is estimated at between 14 and 51 mels in F1 X F2 space. This finding has implications for episodic theories of speech production. It also clarifies the possible structures of human vowel systems, by setting a theoretical lower bound for how close two vowel phonemes may be in a speaker’s formant space, and hence a psychophysical explanation of observed trends in number and patterns of possible vowel phonemes.
nan
Article 637
Title@2025-07-03 (4): Early Signs of Steganographic Capabilities in Frontier LLMs
Title: Early Signs of Steganographic Capabilities in Frontier LLMs | Frühe Anzeichen von Steganographischen Fähigkeiten in Frontier LLMs | 边疆长长体动物能力早期信号 2507.02737v1 |
Authors (5): Artur Zolkowski, Kei Nishimura-Gasparian, Robert McCarthy, Roland S. Zimmermann, David Lindner
Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks from misuse and misalignment. However, LLMs could evade monitoring through steganography: Encoding hidden information within seemingly benign generations. In this paper, we evaluate the steganography capabilities in frontier LLMs to better understand the risk they pose. We focus on two types of steganography: passing encoded messages and performing encoded reasoning. We find that current models are unable to encode short messages in their outputs without a monitor noticing under standard affordances. They can succeed, however, if given additional affordances such as using an unmonitored scratchpad and coordinating on what encoding scheme to use. We additionally find early signs that models can perform basic encoded reasoning in a simple state-tracking problem. This includes some ability to reason with their own and pre-defined schemes, including encoding schemes such as Hexadecimal. Despite this, they can rarely hide reasoning subtly within a cover task to fool a monitor. Overall, our results indicate that current LLMs exhibit nascent steganographic capabilities. While these capabilities are likely insufficient to bypass well-designed monitors at present, this could change in the future.
nan
Article 638
Title@2025-07-03 (4): Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
Title: Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge | Mind2Web 2: Agentische Suche mit Agent-as-a-Judge bewerten | Mind2Web 2: 与代理法官评估代理搜索 2506.21506v2 |
Authors (26): Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su
Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, highlighting its great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.
nan
Article 639
Title@2025-07-03 (4): On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability
Title: On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability | Über Charakterisierungen für die Sprachgenerierung: Interplay von Halluzinationen, Breadth und Stabilität | 语言生成特征:幻觉、面包和稳定之间的相互作用 2412.18530v2 |
Authors (3): Alkis Kalavasis, Anay Mehrotra, Grigoris Velegkas
We study language generation in the limit - introduced by Kleinberg and Mullainathan [KM24] - building on classical works of Gold [Gol67] and Angluin [Ang79]. [KM24]’s main result is an algorithm for generating from any countable language collection in the limit. While their algorithm eventually generates unseen strings from the target language $K$, it sacrifices coverage or breadth, i.e., its ability to generate a rich set of strings. Recent work introduces different notions of breadth and explores when generation with breadth is possible, leaving a full characterization of these notions open. Our first set of results settles this by characterizing generation for existing notions of breadth and their natural extensions. Interestingly, our lower bounds are very flexible and hold for many performance metrics beyond breadth - for instance, showing that, in general, it is impossible to train generators which achieve a higher perplexity or lower hallucination rate for $K$ compared to other languages. Next, we study language generation with breadth and stable generators - algorithms that eventually stop changing after seeing an arbitrary but finite number of strings - and prove unconditional lower bounds for such generators, strengthening the results of [KMV25] and demonstrating that generation with many existing notions of breadth becomes equally hard, when stability is required. This gives a separation for generation with approximate breadth, between stable and unstable generators, highlighting the rich interplay between breadth, stability, and consistency in language generation.
nan
Article 640
Title@2025-07-03 (4): Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation
Title: Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation | Next-Token-Vorhersage-Aufgabe setzt eine optimale Datenbestellung für LLM-Training in Proof Generation voraus | 假定为校实生成的LLM培训提供最佳数据排序 2411.00863v2 |
Authors (11): Chenyang An, Shima Imani, Feng Yao, Chengyu Dong, Ali Abbasi, Harsh Shrivastava, Samuel Buss, Jingbo Shang, Gayathri Mahalingam, Pramod Sharma, Maurice Diesendruck
In the field of large language model (LLM)-based proof generation, despite extensive training on large datasets such as ArXiv, LLMs still exhibit only modest performance on proving tasks of moderate difficulty. We believe that this is partly due to the widespread presence of suboptimal ordering within the data for each proof used in training. For example, published proofs often follow a purely logical order, where each step logically proceeds from the previous steps based on the deductive rules. This order is designed to facilitate the verification of the proof’s soundness, rather than to help people and models learn the discovery process of the proof. In proof generation, we argue that the optimal order for one training data sample occurs when the relevant intermediate supervision for a particular proof step in the proof is always positioned to the left of that proof step. We call such order the intuitively sequential order. We validate our claims using two tasks: intuitionistic propositional logic theorem-proving and digit multiplication. Our experiments verify the order effect and provide support for our explanations. We demonstrate that training is most effective when the proof is in the intuitively sequential order. Moreover, the order effect and the performance gap between models trained on different data orders can be substantial – with an 11 percent improvement in proof success rate observed in the propositional logic theorem-proving task, between models trained on the optimal order compared to the worst order. Lastly, we define a common type of order issue in advanced math proofs and find that 17.3 percent of theorems with nontrivial proofs in the first two chapters of a widely used graduate-level mathematics textbook suffer from this issue. A detailed list of those proofs is provided in the appendix.
nan
Article 641
Title@2025-07-03 (4): Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers
Title: Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers | Können LLMs kritische Einschränkungen innerhalb der wissenschaftlichen Forschung identifizieren? Eine systematische Bewertung von KI-Forschungspapieren | LLMs能否查明科学研究中的关键限制? 对AI研究文件的系统评估 2507.02694v1 |
Authors (5): Zhijian Xu, Yilun Zhao, Manasi Patwardhan, Lovekesh Vig, Arman Cohan
Peer review is fundamental to scientific research, but the growing volume of publications has intensified the challenges of this expertise-intensive process. While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. We first present a comprehensive taxonomy of limitation types in scientific research, with a focus on AI. Guided by this taxonomy, for studying limitations, we present LimitGen, the first comprehensive benchmark for evaluating LLMs’ capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: LimitGen-Syn, a synthetic dataset carefully created through controlled perturbations of high-quality papers, and LimitGen-Human, a collection of real human-written limitations. To improve the ability of LLM systems to identify limitations, we augment them with literature retrieval, which is essential for grounding identifying limitations in prior scientific findings. Our approach enhances the capabilities of LLM systems to generate limitations in research papers, enabling them to provide more concrete and constructive feedback.
nan
Article 642
Title@2025-07-03 (4): Exploring Gender Bias Beyond Occupational Titles
Title: Exploring Gender Bias Beyond Occupational Titles | Erforschen von Gender-Bias über Berufsbezeichnungen hinaus | 探索职业职称之外的性别偏见 2507.02679v1 |
Authors (2): Ahmed Sabir, Rajesh Sharama
In this work, we investigate the correlation between gender and contextual biases, focusing on elements such as action verbs, object nouns, and particularly on occupations. We introduce a novel dataset, GenderLexicon, and a framework that can estimate contextual bias and its related gender bias. Our model can interpret the bias with a score and thus improve the explainability of gender bias. Also, our findings confirm the existence of gender biases beyond occupational stereotypes. To validate our approach and demonstrate its effectiveness, we conduct evaluations on five diverse datasets, including a Japanese dataset.
nan
Article 643
Title@2025-07-03 (4): Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning
Title: Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning | Code2Logic: Game-Code-getriebene Datensynthese zur Verbesserung von VLMs Allgemeine Begründung | 代码2Llogic: 用于增强VLMs一般理由的游戏-代码-驱动数据合成 2505.13886v2 |
Authors (26): Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Changhao Jiang, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang
Visual-language Chain-of-Thought (CoT) data resources are relatively scarce compared to text-only counterparts, limiting the improvement of reasoning capabilities in Vision Language Models (VLMs). However, high-quality vision-language reasoning data is expensive and labor-intensive to annotate. To address this issue, we leverage a promising resource: game code, which naturally contains logical structures and state transition processes. Therefore, we propose Code2Logic, a novel game-code-driven approach for multimodal reasoning data synthesis. Our approach leverages Large Language Models (LLMs) to adapt game code, enabling automatic acquisition of reasoning processes and results through code execution. Using the Code2Logic approach, we developed the GameQA dataset to train and evaluate VLMs. GameQA is cost-effective and scalable, offers controllable difficulty gradation and is diverse with 30 games and 158 tasks. Surprisingly, despite training solely on game data, VLMs demonstrated out of domain generalization, specifically Qwen2.5-VL-7B improving performance by 2.33% across 7 diverse vision-language benchmarks. Our code, dataset and models are available at https://github.com/tongjingqi/Code2Logic.
nan
Article 644
Title@2025-07-03 (4): ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning
Title: ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning | ASDA: Audio-Spektrogramm Differential-Achtungsmechanismus für selbstüberwachtes Repräsentationslernen | ASDA:自我监督代表制学习的听觉分光差异关注机制 2507.02666v1 |
Authors (5): Junyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang
In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model’s discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA’s effectiveness in audio tasks, paving the way for broader applications.
nan
Article 645
Title@2025-07-03 (4): An Adaptive Supervised Contrastive Learning Framework for Implicit Sexism Detection in Digital Social Networks
Title: An Adaptive Supervised Contrastive Learning Framework for Implicit Sexism Detection in Digital Social Networks | Ein adaptives, überwachtes, kontrastives Lernkonzept für implizite Sexismus-Erkennung in digitalen sozialen Netzwerken | 在数字社会网络中发现隐含性别主义的适应性监督差异学习框架 2507.05271v1 |
Authors (3): Mohammad Zia Ur Rehman, Aditya Shah, Nagendra Kumar
The global reach of social media has amplified the spread of hateful content, including implicit sexism, which is often overlooked by conventional detection methods. In this work, we introduce an Adaptive Supervised Contrastive lEarning framework for implicit sexism detectioN (ASCEND). A key innovation of our method is the incorporation of threshold-based contrastive learning: by computing cosine similarities between embeddings, we selectively treat only those sample pairs as positive if their similarity exceeds a learnable threshold. This mechanism refines the embedding space by robustly pulling together representations of semantically similar texts while pushing apart dissimilar ones, thus reducing false positives and negatives. The final classification is achieved by jointly optimizing a contrastive loss with a cross-entropy loss. Textual features are enhanced through a word-level attention module. Additionally, we employ sentiment, emotion, and toxicity features. Evaluations on the EXIST2021 and MLSC datasets demonstrate that ASCEND significantly outperforms existing methods, with average Macro F1 improvements of 9.86%, 29.63%, and 32.51% across multiple tasks, highlighting its efficacy in capturing the subtle cues of implicit sexist language.
nan
Article 646
Title@2025-07-03 (4): OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding
Title: OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding | OmniDraft: Ein Cross-Vocabulary, Online Adaptive Drafter für die gerätespezifische Dekodierung | 总括草案:跨词汇、在线在线可调适性套用投机下限设计图纸 2507.02659v1 |
Authors (7): Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Shaojie Zhuo, Chen Feng, Yicheng Lin, Chenzheng Su, Xiaopeng Zhang
Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the \textit{``one drafter for all’’} paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.
nan
Article 647
Title@2025-07-03 (4): Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search
Title: Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search | Entkoppelte Planung und Ausführung: Ein Hierarchisches Reasoning-Framework für tiefe Suche | 分解的规划和执行:深海搜索的等级理据框架 2507.02652v1 |
Authors (8): Jiajie Jin, Xiaoxi Li, Guanting Dong, Yuyao Zhang, Yutao Zhu, Yang Zhao, Hongjin Qian, Zhicheng Dou
Complex information needs in real-world search scenarios demand deep reasoning and knowledge synthesis across diverse sources, which traditional retrieval-augmented generation (RAG) pipelines struggle to address effectively. Current reasoning-based approaches suffer from a fundamental limitation: they use a single model to handle both high-level planning and detailed execution, leading to inefficient reasoning and limited scalability. In this paper, we introduce HiRA, a hierarchical framework that separates strategic planning from specialized execution. Our approach decomposes complex search tasks into focused subtasks, assigns each subtask to domain-specific agents equipped with external tools and reasoning capabilities, and coordinates the results through a structured integration mechanism. This separation prevents execution details from disrupting high-level reasoning while enabling the system to leverage specialized expertise for different types of information processing. Experiments on four complex, cross-modal deep search benchmarks demonstrate that HiRA significantly outperforms state-of-the-art RAG and agent-based systems. Our results show improvements in both answer quality and system efficiency, highlighting the effectiveness of decoupled planning and execution for multi-step information seeking tasks. Our code is available at https://github.com/ignorejjj/HiRA.
nan
Article 648
Title@2025-07-03 (4): Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory
Title: Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory | Strategische Intelligenz in großen Sprachmodellen: Beweise aus der evolutionären Spieltheorie | 大语言模型战略情报:进化游戏理论的证据 2507.02618v1 |
Authors (2): Kenneth Payne, Baptiste Alloui-Cros
Are Large Language Models (LLMs) a new form of strategic intelligence, able to reason about goals in competitive settings? We present compelling supporting evidence. The Iterated Prisoner’s Dilemma (IPD) has long served as a model for studying decision-making. We conduct the first ever series of evolutionary IPD tournaments, pitting canonical strategies (e.g., Tit-for-Tat, Grim Trigger) against agents from the leading frontier AI companies OpenAI, Google, and Anthropic. By varying the termination probability in each tournament (the “shadow of the future”), we introduce complexity and chance, confounding memorisation. Our results show that LLMs are highly competitive, consistently surviving and sometimes even proliferating in these complex ecosystems. Furthermore, they exhibit distinctive and persistent “strategic fingerprints”: Google’s Gemini models proved strategically ruthless, exploiting cooperative opponents and retaliating against defectors, while OpenAI’s models remained highly cooperative, a trait that proved catastrophic in hostile environments. Anthropic’s Claude emerged as the most forgiving reciprocator, showing remarkable willingness to restore cooperation even after being exploited or successfully defecting. Analysis of nearly 32,000 prose rationales provided by the models reveals that they actively reason about both the time horizon and their opponent’s likely strategy, and we demonstrate that this reasoning is instrumental to their decisions. This work connects classic game theory with machine psychology, offering a rich and granular view of algorithmic decision-making under uncertainty.
nan
Article 649
Title@2025-07-03 (4): Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure
Title: Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure | Erklärbare Compliance-Erkennung mit Multi-Hop-Natural Language-Schlussfolgerung zur Assurance-Fallstruktur | 以多种自然语言对保证案例结构的多重语言推断进行可解释的合规检测 2506.08713v2 |
Authors (2): Fariz Ikhwantri, Dusica Marijan
Ensuring complex systems meet regulations typically requires checking the validity of assurance cases through a claim-argument-evidence framework. Some challenges in this process include the complicated nature of legal and technical texts, the need for model explanations, and limited access to assurance case data. We propose a compliance detection approach based on Natural Language Inference (NLI): EXplainable CompLiance detection with Argumentative Inference of Multi-hop reasoning (EXCLAIM). We formulate the claim-argument-evidence structure of an assurance case as a multi-hop inference for explainable and traceable compliance detection. We address the limited number of assurance cases by generating them using large language models (LLMs). We introduce metrics that measure the coverage and structural consistency. We demonstrate the effectiveness of the generated assurance case from GDPR requirements in a multi-hop inference task as a case study. Our results highlight the potential of NLI-based approaches in automating the regulatory compliance process.
nan
Article 650
Title@2025-07-03 (4): Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs
Title: Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs | Symbolisch oder numerisch? Physik-Probleme verstehen, die LLMs aufklären | 理解在理赔中解决物理问题 2507.01334v2 |
Authors (3): Nifu Dan, Yujun Cai, Yiwei Wang
Navigating the complexities of physics reasoning has long been a difficult task for Large Language Models (LLMs), requiring a synthesis of profound conceptual understanding and adept problem-solving techniques. In this study, we investigate the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challenging SciBench benchmark. Our comprehensive experimental evaluation reveals the remarkable capabilities of reasoning models. Not only do they achieve state-of-the-art accuracy in answering intricate physics questions, but they also generate distinctive reasoning patterns that emphasize on symbolic derivation. Furthermore, our findings indicate that even for these highly sophisticated reasoning models, the strategic incorporation of few-shot prompting can still yield measurable improvements in overall accuracy, highlighting the potential for continued performance gains.
nan
Article 651
Title@2025-07-03 (4): MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion
Title: MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion | MPF: Sprachmodelle nach der Bereitstellung über Multi Perspective Fusion ausrichten und abgrenzen | MPF:通过多视角融合进行部署后调整和取消对语言模式的偏见 2507.02595v1 |
Authors (7): Xin Guan, PeiHsin Lin, Zekun Wu, Ze Wang, Ruibo Zhang, Emre Kazim, Adriano Koshiyama
Multiperspective Fusion (MPF) is a novel posttraining alignment framework for large language models (LLMs) developed in response to the growing need for easy bias mitigation. Built on top of the SAGED pipeline, an automated system for constructing bias benchmarks and extracting interpretable baseline distributions, MPF leverages multiperspective generations to expose and align biases in LLM outputs with nuanced, humanlike baselines. By decomposing baseline, such as sentiment distributions from HR professionals, into interpretable perspective components, MPF guides generation through sampling and balancing of responses, weighted by the probabilities obtained in the decomposition. Empirically, we demonstrate its ability to align LLM sentiment distributions with both counterfactual baselines (absolute equality) and the HR baseline (biased for Top Univeristy), resulting in small KL divergence, reduction of calibration error and generalization to unseen questions. This shows that MPF offers a scalable and interpretable method for alignment and bias mitigation, compatible with deployed LLMs and requiring no extensive prompt engineering or finetuning.
nan
Article 652
Title@2025-07-03 (4): MedAide: Information Fusion and Anatomy of Medical Intents via LLM-based Agent Collaboration
Title: MedAide: Information Fusion and Anatomy of Medical Intents via LLM-based Agent Collaboration | MedAide: Informationsfusion und Anatomie von medizinischen Intents über LLM-basierte Agent Collaboration | 医学辅助:通过以LLM为基地的合作公司代理进行医疗成瘾者的信息汇集和解剖 2410.12532v3 |
Authors (11): Dingkang Yang, Jinjie Wei, Mingcheng Li, Jiyao Liu, Lihao Liu, Ming Hu, Junjun He, Yakun Ju, Wei Zhou, Yang Liu, Lihua Zhang
In healthcare intelligence, the ability to fuse heterogeneous, multi-intent information from diverse clinical sources is fundamental to building reliable decision-making systems. Large Language Model (LLM)-driven information interaction systems currently showing potential promise in the healthcare domain. Nevertheless, they often suffer from information redundancy and coupling when dealing with complex medical intents, leading to severe hallucinations and performance bottlenecks. To this end, we propose MedAide, an LLM-based medical multi-agent collaboration framework designed to enable intent-aware information fusion and coordinated reasoning across specialized healthcare domains. Specifically, we introduce a regularization-guided module that combines syntactic constraints with retrieval augmented generation to decompose complex queries into structured representations, facilitating fine-grained clinical information fusion and intent resolution. Additionally, a dynamic intent prototype matching module is proposed to utilize dynamic prototype representation with a semantic similarity matching mechanism to achieve adaptive recognition and updating of the agent’s intent in multi-round healthcare dialogues. Ultimately, we design a rotation agent collaboration mechanism that introduces dynamic role rotation and decision-level information fusion across specialized medical agents. Extensive experiments are conducted on four medical benchmarks with composite intents. Experimental results from automated metrics and expert doctor evaluations show that MedAide outperforms current LLMs and improves their medical proficiency and strategic reasoning.
nan
Article 653
Title@2025-07-03 (4): Revisiting Active Learning under (Human) Label Variation
Title: Revisiting Active Learning under (Human) Label Variation | Aktives Lernen unter (menschlichen) Label-Varianten | 在(人)标签标签变换下重新审查积极学习 2507.02593v1 |
Authors (6): Cornelia Gruber, Helen Alber, Bernd Bischl, Göran Kauermann, Barbara Plank, Matthias Aßenmacher
Access to high-quality labeled data remains a limiting factor in applied supervised learning. While label variation (LV), i.e., differing labels for the same instance, is common, especially in natural language processing, annotation frameworks often still rest on the assumption of a single ground truth. This overlooks human label variation (HLV), the occurrence of plausible differences in annotations, as an informative signal. Similarly, active learning (AL), a popular approach to optimizing the use of limited annotation budgets in training ML models, often relies on at least one of several simplifying assumptions, which rarely hold in practice when acknowledging HLV. In this paper, we examine foundational assumptions about truth and label nature, highlighting the need to decompose observed LV into signal (e.g., HLV) and noise (e.g., annotation error). We survey how the AL and (H)LV communities have addressed – or neglected – these distinctions and propose a conceptual framework for incorporating HLV throughout the AL loop, including instance selection, annotator choice, and label representation. We further discuss the integration of large language models (LLM) as annotators. Our work aims to lay a conceptual foundation for HLV-aware active learning, better reflecting the complexities of real-world annotation.
nan
Article 654
Title@2025-07-03 (4): WebSailor: Navigating Super-human Reasoning for Web Agent
Title: WebSailor: Navigating Super-human Reasoning for Web Agent | WebSailor: Navigieren Super-Mensch Vernunft für Web Agent | Web 服务员: 为 Web 代理导航超人理由 2507.02592v1 |
Authors (19): Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, Jingren Zhou
Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents’ performance and closing the capability gap.
nan
Article 655
Title@2025-07-03 (4): AI Flow: Perspectives, Scenarios, and Approaches
Title: AI Flow: Perspectives, Scenarios, and Approaches | AI Flow: Perspektiven, Szenarien und Ansätze | AI 流动:观点、设想和方法 2506.12479v2 |
Authors (14): Hongjun An, Wenhan Hu, Sida Huang, Siqi Huang, Ruanjun Li, Yuanzhi Liang, Jiawei Shao, Yiliang Song, Zihan Wang, Cheng Yuan, Chi Zhang, Hongyuan Zhang, Wenhao Zhuang, Xuelong Li
Pioneered by the foundational information theory by Claude Shannon and the visionary framework of machine intelligence by Alan Turing, the convergent evolution of information and communication technologies (IT/CT) has created an unbroken wave of connectivity and computation. This synergy has sparked a technological revolution, now reaching its peak with large artificial intelligence (AI) models that are reshaping industries and redefining human-machine collaboration. However, the realization of ubiquitous intelligence faces considerable challenges due to substantial resource consumption in large models and high communication bandwidth demands. To address these challenges, AI Flow has been introduced as a multidisciplinary framework that integrates cutting-edge IT and CT advancements, with a particular emphasis on the following three key points. First, device-edge-cloud framework serves as the foundation, which integrates end devices, edge servers, and cloud clusters to optimize scalability and efficiency for low-latency model inference. Second, we introduce the concept of familial models, which refers to a series of different-sized models with aligned hidden features, enabling effective collaboration and the flexibility to adapt to varying resource constraints and dynamic scenarios. Third, connectivity- and interaction-based intelligence emergence is a novel paradigm of AI Flow. By leveraging communication networks to enhance connectivity, the collaboration among AI models across heterogeneous nodes achieves emergent intelligence that surpasses the capability of any single model. The innovations of AI Flow provide enhanced intelligence, timely responsiveness, and ubiquitous accessibility to AI services, paving the way for the tighter fusion of AI techniques and communication systems.
nan
Article 656
Title@2025-07-03 (4): Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization
Title: Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization | Verbesserung der LLM-Gründung für die Erkennung von Schwachstellen über die Optimierung der Gruppenrelative Policy | 通过群体相对政策优化改进通过群体相对政策优化查明脆弱性的改进LLM 改进LLM理由 2507.03051v1 |
Authors (4): Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino
Improving and understanding the training dynamics and reasoning of Large Language Models (LLMs) has become essential for their deployment in AI-based security tools, such as software vulnerability detection. In this work, we present an extensive study aimed at advancing recent RL-based finetuning techniques for LLMs in the context of vulnerability detection. We start by highlighting key limitations of commonly adopted LLMs, such as their tendency to over-predict certain types of vulnerabilities while failing to detect others. To address this challenge, we explore the use of Group Relative Policy Optimization (GRPO), a recent policy-gradient method, for guiding LLM behavior through structured, rule-based rewards. We enable its application to the vulnerability detection task by redefining its advantage functions and reward signals using annotations from widely used datasets in the field, including BigVul, DiverseVul, and CleanVul. The proposed methodology enables an extensive set of experiments, addressing multiple research questions regarding the impact of GRPO on generalization, reasoning capabilities, and performance improvements over standard supervised finetuning (SFT). Our findings offer valuable insights into the potential of RL-based training to enhance both the performance and reasoning abilities of LLMs in the context of software vulnerability detection.
nan
Article 657
Title@2025-07-03 (4): Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
Title: Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs | Sprachenübergreifendes Reisen: Benchmarking Cross-Lingual Consistency in multimodalen LLMs | 跨语言旅行:多模式LLM中跨语言一致基准 2505.15075v2 |
Authors (5): Hao Wang, Pinzhi Huang, Jihan Yang, Saining Xie, Daisuke Kawahara
The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications. However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge. To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency. This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.
nan
Article 658
Title@2025-07-03 (4): Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning
Title: Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning | Selbstgesteuerte Prozess-Reward-Optimierung mit neu definiertem Schrittweiser Vorteil für Prozess-Verstärkungs-Lernen | 自指导流程向上优化,具有重新定义的逐步改进的流程强化学习优势 2507.01551v2 |
Authors (8): Wu Fei, Hao Kong, Shuxian Liang, Yang Lin, Yibo Yang, Jing Tang, Lei Chen, Xiansheng Hua
Process Reinforcement Learning~(PRL) has demonstrated considerable potential in enhancing the reasoning capabilities of Large Language Models~(LLMs). However, introducing additional process reward models incurs substantial computational overhead, and there is no unified theoretical framework for process-level advantage estimation. To bridge this gap, we propose \textbf{S}elf-Guided \textbf{P}rocess \textbf{R}eward \textbf{O}ptimization~(\textbf{SPRO}), a novel framework that enables process-aware RL through two key innovations: (1) we first theoretically demonstrate that process rewards can be derived intrinsically from the policy model itself, and (2) we introduce well-defined cumulative process rewards and \textbf{M}asked \textbf{S}tep \textbf{A}dvantage (\textbf{MSA}), which facilitates rigorous step-wise action advantage estimation within shared-prompt sampling groups. Our experimental results demonstrate that SPRO outperforms vaniila GRPO with 3.4x higher training efficiency and a 17.5\% test accuracy improvement. Furthermore, SPRO maintains a stable and elevated policy entropy throughout training while reducing the average response length by approximately $1/3$, evidencing sufficient exploration and prevention of reward hacking. Notably, SPRO incurs no additional computational overhead compared to outcome-supervised RL methods such as GRPO, which benefit industrial implementation.
nan
Article 659
Title@2025-07-03 (4): IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders
Title: IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders | IndianBailJudgments-1200: Ein Multi-Attribut-Datensatz für legale NLP auf indischen Bail-Aufträgen | IndianBailJail Judgments-12000:印度保释令法律国家保护程序多属性数据集 2507.02506v1 |
Authors (2): Sneha Deshmukh, Prathmesh Kamble
Legal NLP remains underdeveloped in regions like India due to the scarcity of structured datasets. We introduce IndianBailJudgments-1200, a new benchmark dataset comprising 1200 Indian court judgments on bail decisions, annotated across 20+ attributes including bail outcome, IPC sections, crime type, and legal reasoning. Annotations were generated using a prompt-engineered GPT-4o pipeline and verified for consistency. This resource supports a wide range of legal NLP tasks such as outcome prediction, summarization, and fairness analysis, and is the first publicly available dataset focused specifically on Indian bail jurisprudence.
nan
Article 660
Title@2025-07-03 (4): Counterfactual Tuning for Temporal Sensitivity Enhancement in Large Language Model-based Recommendation
Title: Counterfactual Tuning for Temporal Sensitivity Enhancement in Large Language Model-based Recommendation | Counterfactual Tuning für die zeitliche Sensitivitätssteigerung in der auf Großsprachenmodellen basierenden Empfehlung | 以基于大语言的示范建议为基础的提高时时敏感度反事实推算 2507.03047v1 |
Authors (4): Yutian Liu, Zhengyi Yang, Jiancan Wu, Xiang Wang
Recent advances have applied large language models (LLMs) to sequential recommendation, leveraging their pre-training knowledge and reasoning capabilities to provide more personalized user experiences. However, existing LLM-based methods fail to sufficiently leverage the rich temporal information inherent in users’ historical interaction sequences, stemming from fundamental architectural constraints: LLMs process information through self-attention mechanisms that lack inherent sequence ordering and rely on position embeddings designed primarily for natural language rather than user interaction sequences. This limitation significantly impairs their ability to capture the evolution of user preferences over time and predict future interests accurately. To address this critical gap, we propose Counterfactual Enhanced Temporal Framework for LLM-Based Recommendation (CETRec). CETRec is grounded in causal inference principles, which allow it to isolate and measure the specific impact of temporal information on recommendation outcomes. By conceptualizing temporal order as an independent causal factor distinct from item content, we can quantify its unique contribution through counterfactual reasoning–comparing what recommendations would be made with and without temporal information while keeping all other factors constant. This causal framing enables CETRec to design a novel counterfactual tuning objective that directly optimizes the model’s temporal sensitivity, teaching LLMs to recognize both absolute timestamps and relative ordering patterns in user histories. Combined with our counterfactual tuning task derived from causal analysis, CETRec effectively enhances LLMs’ awareness of both absolute order (how recently items were interacted with) and relative order (the sequential relationships between items).
nan
Article 661
Title@2025-07-03 (4): Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack
Title: Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack | Robustheit von Fehlinformations-Klassifikationssystemen zu Adversarial-Beispielen durch BeamAttack | 通过“BeamAttack”进行错误信息分类系统对反向实例的强力 2506.23661v2 |
Authors (4): Arnisa Fazla, Lucas Krauter, David Guzman Piedrahita, Andrianos Michail
We extend BeamAttack, an adversarial attack algorithm designed to evaluate the robustness of text classification systems through word-level modifications guided by beam search. Our extensions include support for word deletions and the option to skip substitutions, enabling the discovery of minimal modifications that alter model predictions. We also integrate LIME to better prioritize word replacements. Evaluated across multiple datasets and victim models (BiLSTM, BERT, and adversarially trained RoBERTa) within the BODEGA framework, our approach achieves over a 99\% attack success rate while preserving the semantic and lexical similarity of the original texts. Through both quantitative and qualitative analysis, we highlight BeamAttack’s effectiveness and its limitations. Our implementation is available at https://github.com/LucK1Y/BeamAttack
nan
Article 662
Title@2025-07-03 (4): Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer
Title: Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer | Task Prompt Vektoren: Effektive Initialisierung durch Multi-Task Soft-Prompt Transfer | 任务提示矢量 : 通过多任务软性即时传输实现有效的初始化 2408.01119v3 |
Authors (4): Robert Belanec, Simon Ostermann, Ivan Srba, Maria Bielikova
Prompt tuning is an efficient solution for training large language models (LLMs). However, current soft-prompt-based methods often sacrifice multi-task modularity, requiring the training process to be fully or partially repeated for each newly added task. While recent work on task vectors applied arithmetic operations on full model weights to achieve the desired multi-task performance, a similar approach for soft-prompts is still missing. To this end, we introduce Task Prompt Vectors, created by element-wise difference between weights of tuned soft-prompts and their random initialization. Experimental results on 12 NLU datasets show that task prompt vectors can be used in low-resource settings to effectively initialize prompt tuning on similar tasks. In addition, we show that task prompt vectors are independent of the random initialization of prompt tuning on 2 different language model architectures. This allows prompt arithmetics with the pre-trained vectors from different tasks. In this way, we provide a competitive alternative to state-of-the-art baselines by arithmetic addition of task prompt vectors from multiple tasks.
nan
Article 663
Title@2025-07-03 (4): Crafting Hanzi as Narrative Bridges: An AI Co-Creation Workshop for Elderly Migrants
Title: Crafting Hanzi as Narrative Bridges: An AI Co-Creation Workshop for Elderly Migrants | Hanzi als Narrative Bridges herstellen: Ein KI-Co-Creation-Workshop für ältere Migranten | 将Hanzi编成叙述性桥梁:大赦国际为老年移民举办的共同创造讲习班 2507.01548v2 |
Authors (4): Wen Zhan, Ziqun Hua, Peiyue Lin, Yunfei Chen
This paper explores how older adults, particularly aging migrants in urban China, can engage AI-assisted co-creation to express personal narratives that are often fragmented, underrepresented, or difficult to verbalize. Through a pilot workshop combining oral storytelling and the symbolic reconstruction of Hanzi, participants shared memories of migration and recreated new character forms using Xiaozhuan glyphs, suggested by the Large Language Model (LLM), together with physical materials. Supported by human facilitation and a soft AI presence, participants transformed lived experience into visual and tactile expressions without requiring digital literacy. This approach offers new perspectives on human-AI collaboration and aging by repositioning AI not as a content producer but as a supportive mechanism, and by supporting narrative agency within sociotechnical systems.
nan
Article 664
Title@2025-07-03 (4): A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages
Title: A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages | Ein Kochbuch für die gemeinschaftsorientierte Datenerfassung von schwachen Sprachkenntnissen in LowResource-Sprachen | 社区驱动的低成本低资源语言有缺陷演讲数据收集手册 2507.02428v1 |
Authors (10): Sumaya Ahmed Salihs, Isaac Wiafe, Jamal-Deen Abdulai, Elikem Doe Atsakpo, Gifty Ayoka, Richard Cave, Akon Obu Ekpezu, Catherine Holloway, Katrin Tomanek, Fiifi Baffoe Payin Winful
This study presents an approach for collecting speech samples to build Automatic Speech Recognition (ASR) models for impaired speech, particularly, low-resource languages. It aims to democratize ASR technology and data collection by developing a “cookbook” of best practices and training for community-driven data collection and ASR model building. As a proof-of-concept, this study curated the first open-source dataset of impaired speech in Akan: a widely spoken indigenous language in Ghana. The study involved participants from diverse backgrounds with speech impairments. The resulting dataset, along with the cookbook and open-source tools, are publicly available to enable researchers and practitioners to create inclusive ASR technologies tailored to the unique needs of speech impaired individuals. In addition, this study presents the initial results of fine-tuning open-source ASR models to better recognize impaired speech in Akan.
nan
Article 665
Title@2025-07-03 (4): Delving into LLM-assisted writing in biomedical publications through excess vocabulary
Title: Delving into LLM-assisted writing in biomedical publications through excess vocabulary | Eintauchen in LLM-unterstütztes Schreiben in biomedizinischen Publikationen durch überschüssiges Vokabular | 通过超量词汇,在生物医学出版物中进行LLM协助撰写 2406.07016v5 |
Authors (4): Dmitry Kobak, Rita González-Márquez, Emőke-Ágnes Horvát, Jan Lause
Large language models (LLMs) like ChatGPT can generate and revise text with human-level performance. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists use them for their scholarly writing. But how wide-spread is such LLM usage in the academic literature? To answer this question for the field of biomedical research, we present an unbiased, large-scale approach: we study vocabulary changes in over 15 million biomedical abstracts from 2010–2024 indexed by PubMed, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. This excess word analysis suggests that at least 13.5% of 2024 abstracts were processed with LLMs. This lower bound differed across disciplines, countries, and journals, reaching 40% for some subcorpora. We show that LLMs have had an unprecedented impact on scientific writing in biomedical research, surpassing the effect of major world events such as the Covid pandemic.
nan
Article 666
Title@2025-07-03 (4): K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function
Title: K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function | K-Funktion: Gemeinsame Aussprache Transkription und Feedback zur Bewertung der Kindersprache-Funktion | K-功能:用于评估儿童语言功能的联合读音记录和反馈 2507.03043v1 |
Authors (17): Shuhe Li, Chenxu Guo, Jiachen Lian, Cheol Jun Cho, Wenshuo Zhao, Xuanru Zhou, Dingkun Zhou, Sam Wang, Grace Wang, Jingze Yang, Jingyi Xu, Ruohan Bao, Elise Brenner, Brandon In, Francesca Pei, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli
Early evaluation of children’s language is frustrated by the high pitch, long phones, and sparse data that derail automatic speech recognisers. We introduce K-Function, a unified framework that combines accurate sub-word transcription, objective scoring, and actionable feedback. Its core, Kids-WFST, merges a Wav2Vec2 phoneme encoder with a phoneme-similarity Dysfluent-WFST to capture child-specific errors while remaining fully interpretable. Kids-WFST attains 1.39% phoneme error on MyST and 8.61% on Multitudes–absolute gains of 10.47 and 7.06 points over a greedy-search decoder. These high-fidelity transcripts power an LLM that grades verbal skills, milestones, reading, and comprehension, aligning with human proctors and supplying tongue-and-lip visualizations plus targeted advice. The results show that precise phoneme recognition cements a complete diagnostic-feedback loop, paving the way for scalable, clinician-ready language assessment.
nan
Article 667
Title@2025-07-03 (4): Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability
Title: Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability | Benchmarking Akan ASR-Modelle über Domain-spezifische Datensätze: Eine vergleichende Bewertung von Leistung, Skalierbarkeit und Anpassungsfähigkeit | 确定Akan ASR模型基准的全域具体数据集:业绩比较评价、可缩放性和可调适性 2507.02407v1 |
Authors (8): Mark Atta Mensah, Isaac Wiafe, Akon Ekpezu, Justice Kwame Appati, Jamal-Deen Abdulai, Akosua Nyarkoa Wiafe-Akenten, Frank Ernest Yeboah, Gifty Odame
Most existing automatic speech recognition (ASR) research evaluate models using in-domain datasets. However, they seldom evaluate how they generalize across diverse speech contexts. This study addresses this gap by benchmarking seven Akan ASR models built on transformer architectures, such as Whisper and Wav2Vec2, using four Akan speech corpora to determine their performance. These datasets encompass various domains, including culturally relevant image descriptions, informal conversations, biblical scripture readings, and spontaneous financial dialogues. A comparison of the word error rate and character error rate highlighted domain dependency, with models performing optimally only within their training domains while showing marked accuracy degradation in mismatched scenarios. This study also identified distinct error behaviors between the Whisper and Wav2Vec2 architectures. Whereas fine-tuned Whisper Akan models led to more fluent but potentially misleading transcription errors, Wav2Vec2 produced more obvious yet less interpretable outputs when encountering unfamiliar inputs. This trade-off between readability and transparency in ASR errors should be considered when selecting architectures for low-resource language (LRL) applications. These findings highlight the need for targeted domain adaptation techniques, adaptive routing strategies, and multilingual training frameworks for Akan and other LRLs.
nan
Article 668
Title@2025-07-03 (4): AIn’t Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation
Title: AIn’t Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation | AIn’t Not Nothing But a Survey? Mit großen Sprachmodellen für die Codierung Deutsch Open-Ended Survey Responses on Survey Motivation | 使用大语言模型对德国关于调查动机的开放式调查答复进行编码 2506.14634v3 |
Authors (4): Leah von der Heyde, Anna-Carolina Haensch, Bernd Weiß, Jessica Daikeler
The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs’ performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs’ unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.
nan
Article 669
Title@2025-07-03 (4): Dynamic Long Short-Term Memory Based Memory Storage For Long Horizon LLM Interaction
Title: Dynamic Long Short-Term Memory Based Memory Storage For Long Horizon LLM Interaction | Dynamischer Langzeit-Speicherspeicher auf Kurzzeit-Speicherbasis für Langzeit-LLM-Interaktion | 动态长地平线LLM互动的短期内存存储 2507.03042v1 |
Authors (2): Yuyang Lou, Charles Li
Memory storage for Large Language models (LLMs) is becoming an increasingly active area of research, particularly for enabling personalization across long conversations. We propose Pref-LSTM, a dynamic and lightweight framework that combines a BERT-based classifier with a LSTM memory module that generates memory embedding which then is soft-prompt injected into a frozen LLM. We synthetically curate a dataset of preference and non-preference conversation turns to train our BERT-based classifier. Although our LSTM-based memory encoder did not yield strong results, we find that the BERT-based classifier performs reliably in identifying explicit and implicit user preferences. Our research demonstrates the viability of using preference filtering with LSTM gating principals as an efficient path towards scalable user preference modeling, without extensive overhead and fine-tuning.
nan
Article 670
Title@2025-07-03 (4): JoyTTS: LLM-based Spoken Chatbot With Voice Cloning
Title: JoyTTS: LLM-based Spoken Chatbot With Voice Cloning | JoyTTS: LLM-basierter gesprochener Chatbot mit Voice Cloning | 以LLM为基地的 “ 配有语音克隆的口音聊天机器人 “ 2507.02380v1 |
Authors (3): Fangru Zhou, Jun Zhao, Guoxin Wang
JoyTTS is an end-to-end spoken chatbot that combines large language models (LLM) with text-to-speech (TTS) technology, featuring voice cloning capabilities. This project is built upon the open-source MiniCPM-o and CosyVoice2 models and trained on 2000 hours of conversational data. We have also provided the complete training code to facilitate further development and optimization by the community. On the testing machine seed-tts-zh, it achieves a SS (speaker similarity) score of 0.73 and a WER (Word Error Rate) of 5.09. The code and models, along with training and inference scripts, are available at https://github.com/jdh-algo/JoyTTS.git.
nan
Article 671
Title@2025-07-03 (4): Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection
Title: Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection | Effiziente Code-LLM-Schulung über Distribution-Konsistenz und Diversity-Aware-Datenauswahl | 通过分配和多样性软件数据选择进行高效率的守则LLM培训 2507.02378v1 |
Authors (3): Weijie Lyu, Sheng-Jun Huang, Xuan Xia
Recent advancements in large language models (LLMs) have significantly improved code generation and program comprehension, accelerating the evolution of software engineering. Current methods primarily enhance model performance by leveraging vast amounts of data, focusing on data quantity while often overlooking data quality, thereby reducing training efficiency. To address this, we introduce an approach that utilizes a parametric model for code data selection, aimed at improving both training efficiency and model performance. Our method optimizes the parametric model to ensure distribution consistency and diversity within the selected subset, guaranteeing high-quality data. Experimental results demonstrate that using only 10K samples, our method achieves gains of 2.4% (HumanEval) and 2.3% (MBPP) over 92K full-sampled baseline, outperforming other sampling approaches in both performance and efficiency. This underscores that our method effectively boosts model performance while significantly reducing computational costs.
nan
Article 672
Title@2025-07-03 (4): QFFN-BERT: An Empirical Study of Depth, Performance, and Data Efficiency in Hybrid Quantum-Classical Transformers
Title: QFFN-BERT: An Empirical Study of Depth, Performance, and Data Efficiency in Hybrid Quantum-Classical Transformers | QFFN-BERT: Eine empirische Studie über Tiefe, Leistung und Dateneffizienz in hybriden Quantum-Klassischen Transformern | QFFN-BERT:对混合量子-分类变异器的深度、性能和数据效率的经验研究 2507.02364v1 |
Authors (1): Pilsung Kang
Parameterized quantum circuits (PQCs) have recently emerged as promising components for enhancing the expressibility of neural architectures. In this work, we introduce QFFN-BERT, a hybrid quantum-classical transformer where the feedforward network (FFN) modules of a compact BERT variant are replaced by PQC-based layers. This design is motivated by the dominant parameter contribution of FFNs, which account for approximately two-thirds of the parameters within standard Transformer encoder blocks. While prior studies have primarily integrated PQCs into self-attention modules, our work focuses on the FFN and systematically investigates the trade-offs between PQC depth, expressibility, and trainability. Our final PQC architecture incorporates a residual connection, both $R_Y$ and $R_Z$ rotations, and an alternating entanglement strategy to ensure stable training and high expressibility. Our experiments, conducted on a classical simulator, on the SST-2 and DBpedia benchmarks demonstrate two key findings. First, a carefully configured QFFN-BERT achieves up to 102.0% of the baseline accuracy, surpassing its classical counterpart in a full-data setting while reducing FFN-specific parameters by over 99%. Second, our model exhibits a consistent and competitive edge in few-shot learning scenarios, confirming its potential for superior data efficiency. These results, supported by an ablation study on a non-optimized PQC that failed to learn, confirm that PQCs can serve as powerful and parameter-efficient alternatives to classical FFNs when co-designed with foundational deep learning principles.
nan
Article 673
Title@2025-07-03 (4): Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning
Title: Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning | Verbesserung der Robustheit der distantly-überwachten Anerkennung von Personen mit Namen durch unsicheres Lehrerlernen und studentisch-studentisches kollaboratives Lernen | 通过不确定-软件教师学习和学生-学生合作学习,提高以不确定-软件教师学习和学生-学生合作学习的方式,提高以不确定-软件命名的实体识别的力度 2311.08010v3 |
Authors (7): Shuzheng Si, Helan Hu, Haozhe Zhao, Shuang Zeng, Kaikai An, Zefan Cai, Baobao Chang
Distantly-Supervised Named Entity Recognition (DS-NER) is widely used in real-world scenarios. It can effectively alleviate the burden of annotation by matching entities in existing knowledge bases with snippets in the text but suffer from the label noise. Recent works attempt to adopt the teacher-student framework to gradually refine the training labels and improve the overall robustness. However, these teacher-student methods achieve limited performance because the poor calibration of the teacher network produces incorrectly pseudo-labeled samples, leading to error propagation. Therefore, we propose: (1) Uncertainty-Aware Teacher Learning that leverages the prediction uncertainty to reduce the number of incorrect pseudo labels in the self-training stage; (2) Student-Student Collaborative Learning that allows the transfer of reliable labels between two student networks instead of indiscriminately relying on all pseudo labels from its teacher, and further enables a full exploration of mislabeled samples rather than simply filtering unreliable pseudo-labeled samples. We evaluate our proposed method on five DS-NER datasets, demonstrating that our method is superior to the state-of-the-art DS-NER methods.
nan
Article 674
Title@2025-07-03 (4): Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models
Title: Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models | Coling-UniA bei SciVQA 2025: Wenig-heißes Beispiel Retrieval und Vertrauen-informierte Montage für multimodale große Sprachmodelle | 在SciVQA 2025 SciVQA 的Coling-UniA:多式大语言模型的很少热实例检索和信任化组合 2507.02357v1 |
Authors (3): Christian Jaumann, Annemarie Friedrich, Rainer Lienhart
This paper describes our system for the SciVQA 2025 Shared Task on Scientific Visual Question Answering. Our system employs an ensemble of two Multimodal Large Language Models and various few-shot example retrieval strategies. The model and few-shot setting are selected based on the figure and question type. We also select answers based on the models’ confidence levels. On the blind test data, our system ranks third out of seven with an average F1 score of 85.12 across ROUGE-1, ROUGE-L, and BERTS. Our code is publicly available.
nan
Article 675
Title@2025-07-03 (4): Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation
Title: Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation | Einschließlich LLMs für großräumige Urban Complex Mobility Simulation | 大型城市综合流动模拟项目LLMs 2505.21880v2 |
Authors (8): Yu-Lun Song, Chung-En Tsern, Che-Cheng Wu, Yu-Ming Chang, Syuan-Bo Huang, Wei-Chu Chen, Michael Chia-Liang Lin, Yu-Ta Lin
This study presents an innovative approach to urban mobility simulation by integrating a Large Language Model (LLM) with Agent-Based Modeling (ABM). Unlike traditional rule-based ABM, the proposed framework leverages LLM to enhance agent diversity and realism by generating synthetic population profiles, allocating routine and occasional locations, and simulating personalized routes. Using real-world data, the simulation models individual behaviors and large-scale mobility patterns in Taipei City. Key insights, such as route heat maps and mode-specific indicators, provide urban planners with actionable information for policy-making. Future work focuses on establishing robust validation frameworks to ensure accuracy and reliability in urban planning applications.
nan
Article 676
Title@2025-07-03 (4): Decision-Oriented Text Evaluation
Title: Decision-Oriented Text Evaluation | Entscheidungsorientierte Textbewertung | 注重决定的案文评价 2507.01923v2 |
Authors (3): Yu-Shiang Huang, Chuan-Ju Wang, Chung-Chi Chen
Natural language generation (NLG) is increasingly deployed in high-stakes domains, yet common intrinsic evaluation methods, such as n-gram overlap or sentence plausibility, weakly correlate with actual decision-making efficacy. We propose a decision-oriented framework for evaluating generated text by directly measuring its influence on human and large language model (LLM) decision outcomes. Using market digest texts–including objective morning summaries and subjective closing-bell analyses–as test cases, we assess decision quality based on the financial performance of trades executed by human investors and autonomous LLM agents informed exclusively by these texts. Our findings reveal that neither humans nor LLM agents consistently surpass random performance when relying solely on summaries. However, richer analytical commentaries enable collaborative human-LLM teams to outperform individual human or agent baselines significantly. Our approach underscores the importance of evaluating generated text by its ability to facilitate synergistic decision-making between humans and LLMs, highlighting critical limitations of traditional intrinsic metrics.
nan
Article 677
Title@2025-07-03 (4): Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs
Title: Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs | Token Prepending: Ein trainingsfreier Ansatz zur Eliziierung besserer Sentence-Embeddings von LLMs | Token Predudo:从LLM女士那里采用不培训办法,使判刑内容更好地嵌入Elibear 2412.11556v2 |
Authors (7): Yuchen Fu, Zifeng Cheng, Zhiwei Jiang, Zhonghui Wang, Yafeng Yin, Zhengliang Li, Qing Gu
Extracting sentence embeddings from large language models (LLMs) is a promising direction, as LLMs have demonstrated stronger semantic understanding capabilities. Previous studies typically focus on prompt engineering to elicit sentence embeddings from LLMs by prompting the model to encode sentence information into the embedding of the last token. However, LLMs are mostly decoder-only models with causal attention and the earlier tokens in the sentence cannot attend to the latter tokens, resulting in biased encoding of sentence information and cascading effects on the final decoded token. To this end, we propose a novel Token Prepending (TP) technique that prepends each layer’s decoded sentence embedding to the beginning of the sentence in the next layer’s input, allowing earlier tokens to attend to the complete sentence information under the causal attention mechanism. The proposed TP technique is a plug-and-play and training-free technique, which means it can be seamlessly integrated with various prompt-based sentence embedding methods and autoregressive LLMs. Extensive experiments on various Semantic Textual Similarity (STS) tasks and downstream classification tasks demonstrate that our proposed TP technique can significantly improve the performance of existing prompt-based sentence embedding methods across different LLMs, while incurring negligible additional inference cost.
nan
Article 678
Title@2025-07-03 (4): Layered Insights: Generalizable Analysis of Authorial Style by Leveraging All Transformer Layers
Title: Layered Insights: Generalizable Analysis of Authorial Style by Leveraging All Transformer Layers | Layered Insights: Generalisierbare Analyse des Autorial Styles durch Hebelisierung aller Transformer Layers | 图层透视: 通过利用所有变换层对文件样式的通用分析 2503.00958v2 |
Authors (5): Milad Alshomary, Nikhil Reddy Varimalla, Vishal Anand, Smaranda Muresan, Kathleen McKeown
We propose a new approach for the authorship attribution task that leverages the various linguistic representations learned at different layers of pre-trained transformer-based models. We evaluate our approach on three datasets, comparing it to a state-of-the-art baseline in in-domain and out-of-domain scenarios. We found that utilizing various transformer layers improves the robustness of authorship attribution models when tested on out-of-domain data, resulting in new state-of-the-art results. Our analysis gives further insights into how our model’s different layers get specialized in representing certain stylistic features that benefit the model when tested out of the domain.
nan
Article 679
Title@2025-07-03 (4): Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Title: Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy | Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy | Skywork-Reward-V2:通过人类-AI协同增强优先数据曲线 2507.01352v2 |
Authors (12): Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou
Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches that incorporate advanced training techniques have not yielded meaningful performance improvements. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present a large-scale preference dataset comprising 40 million preference pairs, named SynPref-40M. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while large language models perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling, achieving state-of-the-art performance across seven major reward model benchmarks. Ablation studies confirm that the effectiveness of our approach stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, highlighting the untapped potential of existing preference datasets and demonstrating how human-AI curation synergy can unlock significantly higher data quality.
nan
Article 680
Title@2025-07-03 (4): Cautious Next Token Prediction
Title: Cautious Next Token Prediction | Vorsichtige nächste Zeichen Vorhersage | 谨慎的次下 Tok 预测 2507.03038v1 |
Authors (10): Yizhou Wang, Lingzhi Zhang, Yue Bai, Mang Tik Chiu, Zhengmian Hu, Mingyuan Zhang, Qihua Dong, Yu Yin, Sohrab Amirghodsi, Yun Fu
Next token prediction paradigm has been prevailing for autoregressive models in the era of LLMs. The current default sampling choice for popular LLMs is temperature scaling together with nucleus sampling to balance diversity and coherence. Nevertheless, such approach leads to inferior performance in various NLP tasks when the model is not certain about testing questions. To this end, we propose a brand new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP). In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation. Then we select the trial with the lowest perplexity score viewed as the most probable and reliable trial path given the model’s capacity. The trial number is negatively correlated with the prediction confidence, i.e., the less confident the model is, the more trials it should sample. This is consistent with human beings’ behaviour: when feeling uncertain or unconfident, one tends to think more creatively, exploring multiple thinking paths, to cautiously select the path one feels most confident about. Extensive experiments on both LLMs and MLLMs show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin. Moreover, the integration of CNTP with self consistency can further improve over vanilla self consistency. We believe our proposed CNTP has the potential to become one of the default choices for LLM decoding. Code is available at https://github.com/wyzjack/CNTP.
nan
Article 681
Title@2025-07-03 (4): Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach
Title: Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach | Ausrichten von gefrorenen LLMs durch Verstärkungslernen: Ein iteratives Reweight-then-Optimize-Ansatz | 通过强化学习将冻结的LLMs与 “ 强化学习:一种过渡性再加权再优化方法 “ 相匹配 2506.17828v2 |
Authors (9): Xinnan Zhang, Chenliang Li, Siliang Zeng, Jiaxiang Li, Zhongruo Wang, Kaixiang Lin, Songtao Lu, Alfredo Garcia, Mingyi Hong
Aligning large language models (LLMs) with human preferences usually requires fine-tuning methods such as RLHF and DPO. These methods directly optimize the model parameters, so they cannot be used in test-time to improve model performance, nor are they applicable when the model weights are not accessible. In contrast, test-time methods sidestep weight updates by leveraging reward functions to guide and improve output quality. However, they incur high inference costs, and their one-shot guidance is often based on imperfect reward or value functions, leading to suboptimal outputs. In this work, we present a method named Iterative Reweight-then-Optimize (IRO), a reinforcement learning (RL) framework that performs RL-style alignment of the (frozen) base model without touching its parameters. During training, each iteration (i) samples candidates from the base model, (ii) resamples using current value functions, and (iii) trains a new lightweight value function that guides the next decoding pass. At test time, the value functions are used to guide the base model generation via a search-based optimization process. Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI’s reinforcement fine-tuning (RFT), but without requiring access to the model weights.
nan
Article 682
Title@2025-07-03 (4): Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Title: Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding | Fast-dLLM: Trainingsfreie Beschleunigung von Diffusion LLM durch Ermöglichen von KV Cache und Paralleldecoding | 快速dLLM:通过授权 KV 缓存和平行编码加速免培训传播LLM 2505.22618v3 |
Authors (9): Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, Enze Xie
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to \textbf{27.6$\times$ throughput} improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.
nan
Article 683
Title@2025-07-03 (4): Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient
Title: Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient | Bypass Back-Propagation: Optimierungsbasiertes Structural Pruning für große Sprachmodelle über Policy Gradient | Bypass 后回通信:通过 “ 政策梯度 “ 优化基于优化的结构结构,为大语言模式提供缓冲 2406.10576v3 |
Authors (5): Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia
Recent Large-Language Models (LLMs) pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically hand-crafted metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve efficiency, our method eliminates the back-propagation through the LLM per se during optimization, requiring only the forward pass of the LLM. We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from LLM loss, facilitating efficient optimization via policy gradient estimator without back-propagation. Thus, our method can 1) support global and heterogeneous pruning (i.e., automatically determine different redundancy for different layers), and 2) optionally initialize with a metric-based method (for our Bernoulli distributions). Extensive experiments conducted on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral models using the C4 and WikiText2 datasets demonstrate the promising performance of our method in efficiency and effectiveness. Code is available at https://github.com/ethanygao/backprop-free_LLM_pruning.
nan
Article 684
Title@2025-07-03 (4): DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning
Title: DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning | DoMIX: Ein effizientes Framework zur Nutzung von Domain-Wissen im Feintuning | DoMIX:一个在微调中利用域知识的有效框架 2507.02302v1 |
Authors (3): Dohoon Kim, Donghun Kang, Taesup Moon
Domain-Adaptive Pre-training (DAP) has recently gained attention for its effectiveness in fine-tuning pre-trained models. Building on this, continual DAP has been explored to develop pre-trained models capable of incrementally incorporating different domain datasets. However, existing continual DAP methods face several limitations: (1) high computational cost and GPU memory usage during training; (2) sensitivity to incremental data order; and (3) providing a single, generalized model for all end tasks, which contradicts the essence of DAP. In this paper, we propose DoMIX, a novel approach that addresses these challenges by leveraging LoRA modules, a representative parameter-efficient fine-tuning (PEFT) method. Our approach enables efficient and parallel domain-adaptive pre-training that is robust to domain order and effectively utilizes accumulated knowledge to provide tailored pre-trained models for specific tasks. We also demonstrate that our method can be extended beyond the DAP setting to standard LLM fine-tuning scenarios. Code is available at https://github.com/dohoonkim-ai/DoMIX.
nan
Article 685
Title@2025-07-03 (4): Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models
Title: Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models | Commander-GPT: Die Fähigkeit von Multi-Modal Large Language Models, den Sarkasmus vollständig zu entleeren | GPT指挥官:完全解除多模式大语言模型的讽刺性探测能力 2503.18681v3 |
Authors (4): Yazhou Zhang, Chunwang Zou, Bo Wang, Jing Qin
Sarcasm detection, as a crucial research direction in the field of Natural Language Processing (NLP), has attracted widespread attention. Traditional sarcasm detection tasks have typically focused on single-modal approaches (e.g., text), but due to the implicit and subtle nature of sarcasm, such methods often fail to yield satisfactory results. In recent years, researchers have shifted the focus of sarcasm detection to multi-modal approaches. However, effectively leveraging multi-modal information to accurately identify sarcastic content remains a challenge that warrants further exploration. Leveraging the powerful integrated processing capabilities of Multi-Modal Large Language Models (MLLMs) for various information sources, we propose an innovative multi-modal Commander-GPT framework. Inspired by military strategy, we first decompose the sarcasm detection task into six distinct sub-tasks. A central commander (decision-maker) then assigns the best-suited large language model to address each specific sub-task. Ultimately, the detection results from each model are aggregated to identify sarcasm. We conducted extensive experiments on MMSD and MMSD 2.0, utilizing four multi-modal large language models and six prompting strategies. Our experiments demonstrate that our approach achieves state-of-the-art performance, with a 19.3% improvement in F1 score, without necessitating fine-tuning or ground-truth rationales.
nan
Article 686
Title@2025-07-03 (4): Prompt-Guided Turn-Taking Prediction
Title: Prompt-Guided Turn-Taking Prediction | Prompt-geführte Turn-Taking-Vorhersage | 即时指导的回转预测 2506.21191v2 |
Authors (7): Koji Inoue, Mikey Elmers, Yahui Fu, Zi Haur Pang, Divesh Lala, Keiko Ochi, Tatsuya Kawahara
Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit control through instructions such as “faster” or “calmer” adapting dynamically to conversational partners and contexts. The proposed model builds upon a transformer-based voice activity projection (VAP) model, incorporating textual prompt embeddings into both channel-wise transformers and a cross-channel transformer. We evaluated the feasibility of our approach using over 950 hours of human-human spoken dialogue data. Since textual prompt data for the proposed approach was not available in existing datasets, we utilized a large language model (LLM) to generate synthetic prompt sentences. Experimental results demonstrated that the proposed model improved prediction accuracy and effectively varied turn-taking timing behaviors according to the textual prompts.
nan
Article 687
Title@2025-07-03 (4): Optimal strategies to perform multilingual analysis of social content for a novel dataset in the tourism domain
Title: Optimal strategies to perform multilingual analysis of social content for a novel dataset in the tourism domain | Optimale Strategien zur mehrsprachigen Analyse sozialer Inhalte für einen neuartigen Datensatz im Tourismusbereich | 为旅游领域新数据集的社会内容进行多语种社会内容分析的最佳最佳战略 2311.14727v2 |
Authors (6): Maxime Masson, Rodrigo Agerri, Christian Sallaberry, Marie-Noelle Bessagnet, Annig Le Parc Lacayrelle, Philippe Roose
The rising influence of social media platforms in various domains, including tourism, has highlighted the growing need for efficient and automated Natural Language Processing (NLP) strategies to take advantage of this valuable resource. However, the transformation of multilingual, unstructured, and informal texts into structured knowledge still poses significant challenges, most notably the never-ending requirement for manually annotated data to train deep learning classifiers. In this work, we study different NLP techniques to establish the best ones to obtain competitive performances while keeping the need for training annotated data to a minimum. To do so, we built the first publicly available multilingual dataset (French, English, and Spanish) for the tourism domain, composed of tourism-related tweets. The dataset includes multilayered, manually revised annotations for Named Entity Recognition (NER) for Locations and Fine-grained Thematic Concepts Extraction mapped to the Thesaurus of Tourism and Leisure Activities of the World Tourism Organization, as well as for Sentiment Analysis at the tweet level. Extensive experimentation comparing various few-shot and fine-tuning techniques with modern language models demonstrate that modern few-shot techniques allow us to obtain competitive results for all three tasks with very little annotation data: 5 tweets per label (15 in total) for Sentiment Analysis, 30 tweets for Named Entity Recognition of Locations and 1K tweets annotated with fine-grained thematic concepts, a highly fine-grained sequence labeling task based on an inventory of 315 classes. We believe that our results, grounded in a novel dataset, pave the way for applying NLP to new domain-specific applications, reducing the need for manual annotations and circumventing the complexities of rule-based, ad-hoc solutions.
nan
Article 688
Title@2025-07-03 (4): Seeing Through Green: Text-Based Classification and the Firm’s Returns from Green Patents
Title: Seeing Through Green: Text-Based Classification and the Firm’s Returns from Green Patents | Durch Grün sehen: Textbasierte Klassifizierung und die Rückkehr der Firma aus grünen Patenten | 通过 “ 绿色观光:基于文本的分类和公司从绿色专利的回报 “ 2507.02287v1 |
Authors (3): Lapo Santarlasci, Armando Rungi, Antonio Zinilli
This paper introduces Natural Language Processing for identifying true'' green patents from official supporting documents. We start our training on about 12.4 million patents that had been classified as green from previous literature. Thus, we train a simple neural network to enlarge a baseline dictionary through vector representations of expressions related to environmental technologies. After testing, we find that
true’’ green patents represent about 20\% of the total of patents classified as green from previous literature. We show heterogeneity by technological classes, and then check that true' green patents are about 1\% less cited by following inventions. In the second part of the paper, we test the relationship between patenting and a dashboard of firm-level financial accounts in the European Union. After controlling for reverse causality, we show that holding at least one
true'' green patent raises sales, market shares, and productivity. If we restrict the analysis to high-novelty
`true’’ green patents, we find that they also yield higher profits. Our findings underscore the importance of using text analyses to gauge finer-grained patent classifications that are useful for policymaking in different domains.
nan
Article 689
Title@2025-07-03 (4): Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as Treatments
Title: Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as Treatments | Kausales Repräsentationslernen mit generativer Künstlicher Intelligenz: Anwendung auf Texte als Behandlungen | 产生人工智能的因果代表性学习:应用文字作为治疗 2410.00903v3 |
Authors (2): Kosuke Imai, Kentaro Nakamura
In this paper, we demonstrate how to enhance the validity of causal inference with unstructured high-dimensional treatments like texts, by leveraging the power of generative Artificial Intelligence (GenAI). Specifically, we propose to use a deep generative model such as large language models (LLMs) to efficiently generate treatments and use their internal representation for subsequent causal effect estimation. We show that the knowledge of this true internal representation helps disentangle the treatment features of interest, such as specific sentiments and certain topics, from other possibly unknown confounding features. Unlike existing methods, the proposed GenAI-Powered Inference (GPI) methodology eliminates the need to learn causal representation from the data, and hence produces more accurate and efficient estimates. We formally establish the conditions required for the nonparametric identification of the average treatment effect, propose an estimation strategy that avoids the violation of the overlap assumption, and derive the asymptotic properties of the proposed estimator through the application of double machine learning. Finally, using an instrumental variables approach, we extend the proposed methodology to the settings in which the treatment feature is based on human perception. The proposed GPI methodology is also applicable to text reuse where an LLM is used to regenerate existing texts. We conduct simulation and empirical studies, using the generated text data from an open-source LLM, Llama 3, to illustrate the advantages of our estimator over state-of-the-art causal representation learning algorithms.
nan
Article 690
Title@2025-07-03 (4): SMARTe: Slot-based Method for Accountable Relational Triple extraction
Title: SMARTe: Slot-based Method for Accountable Relational Triple extraction | SMARTe: Slot-basierte Methode für die relationale Triple-Extraktion | SMARTE: 衡算关系三重采掘的基于固态方法 2504.12816v3 |
Authors (2): Xue Wen Tan, Stanley Kok
Relational Triple Extraction (RTE) is a fundamental task in Natural Language Processing (NLP). However, prior research has primarily focused on optimizing model performance, with limited efforts to understand the internal mechanisms driving these models. Many existing methods rely on complex preprocessing to induce specific interactions, often resulting in opaque systems that may not fully align with their theoretical foundations. To address these limitations, we propose SMARTe: a Slot-based Method for Accountable Relational Triple extraction. SMARTe introduces intrinsic interpretability through a slot attention mechanism and frames the task as a set prediction problem. Slot attention consolidates relevant information into distinct slots, ensuring all predictions can be explicitly traced to learned slot representations and the tokens contributing to each predicted relational triple. While emphasizing interpretability, SMARTe achieves performance comparable to state-of-the-art models. Evaluations on the NYT and WebNLG datasets demonstrate that adding interpretability does not compromise performance. Furthermore, we conducted qualitative assessments to showcase the explanations provided by SMARTe, using attention heatmaps that map to their respective tokens. We conclude with a discussion of our findings and propose directions for future research. Our code is available at https://github.com/Chen-XueWen/SMARTe.
nan
Article 691
Title@2025-07-03 (4): MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
Title: MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent | MemAgent: Umgestalten von Langkontext-LLM mit Multi-Conv RL-basierten Speicheragenten | MemerAgent: 与基于多Conv RL的内存代理重塑长文本LLM 2507.02259v1 |
Authors (11): Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, Hao Zhou
Despite improvements by length extrapolation, efficient attention and memory modules, handling infinitely long documents with linear complexity without performance degradation during extrapolation remains the ultimate challenge in long-text processing. We directly optimize for long-text tasks in an end-to-end fashion and introduce a novel agent workflow, MemAgent, which reads text in segments and updates the memory using an overwrite strategy. We extend the DAPO algorithm to facilitate training via independent-context multi-conversation generation. MemAgent has demonstrated superb long-context capabilities, being able to extrapolate from an 8K context trained on 32K text to a 3.5M QA task with performance loss < 5% and achieves 95%+ in 512K RULER test.
nan
Article 692
Title@2025-07-03 (4): Circuit-tuning: A Mechanistic Approach for Identifying Parameter Redundancy and Fine-tuning Neural Networks
Title: Circuit-tuning: A Mechanistic Approach for Identifying Parameter Redundancy and Fine-tuning Neural Networks | Schaltungs-Tuning: Mechanistischer Ansatz zur Identifizierung von Parameter Redundanz und Feinsteuerung neuraler Netzwerke | 电路调控:确定参数冗余和精微调整神经网络的机械化方法 2502.06106v2 |
Authors (4): Yueyan Li, Wenhao Gao, Caixia Yuan, Xiaojie Wang
The study of mechanistic interpretability aims to reverse-engineer a model to explain its behaviors. While recent studies have focused on the static mechanism of a certain behavior, the learning dynamics inside a model remain to be explored. In this work, we develop an interpretable fine-tuning method for analyzing the mechanism behind learning. We first introduce the concept of node-level intrinsic dimensionality to describe the learning process of a model in a computational graph. Based on our theory, we propose circuit-tuning, a two-stage algorithm that iteratively builds the minimal subgraph for a specific task and updates the key parameters in a heuristic way. Experimental results confirm the existence of the intrinsic dimensionality at the node level and demonstrate the effectiveness of our method for transparent and interpretable fine-tuning. We visualize and analyze the circuits before, during, and after fine-tuning, providing new insights into the self-organization mechanism of a neural network in the learning process.
nan
Article 693
Title@2025-07-03 (4): Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies
Title: Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies | Mixture of Reasonings: Große Sprachmodelle mit adaptiven Strategien zur Vernunft bringen | 理由混合:与适应战略一道教授大语言模式 2507.00606v2 |
Authors (4): Tao Xiong, Xavier Hu, Wenyan Fan, Shengyu Zhang
Large language models (LLMs) excel in complex tasks through advanced prompting techniques like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), but their reliance on manually crafted, task-specific prompts limits adaptability and efficiency. We introduce Mixture of Reasoning (MoR), a training framework that embeds diverse reasoning strategies into LLMs for autonomous, task-adaptive reasoning without external prompt engineering. MoR has two phases: Thought Generation, creating reasoning chain templates with models like GPT-4o, and SFT Dataset Construction, pairing templates with benchmark datasets for supervised fine-tuning. Our experiments show that MoR significantly enhances performance, with MoR150 achieving 0.730 (2.2% improvement) using CoT prompting and 0.734 (13.5% improvement) compared to baselines. MoR eliminates the need for task-specific prompts, offering a generalizable solution for robust reasoning across diverse tasks.
nan
Article 694
Title@2025-07-03 (4): Preserving Privacy, Increasing Accessibility, and Reducing Cost: An On-Device Artificial Intelligence Model for Medical Transcription and Note Generation
Title: Preserving Privacy, Increasing Accessibility, and Reducing Cost: An On-Device Artificial Intelligence Model for Medical Transcription and Note Generation | Schutz der Privatsphäre, Erhöhung der Zugänglichkeit und Senkung der Kosten: Ein On-Device-Modell für künstliche Intelligenz für medizinische Transkription und Notizgenerierung | 维护隐私、增加无障碍和降低成本:医疗转移和笔记制作的现场人工人工智能模型 2507.03033v1 |
Authors (6): Johnson Thomas, Ayush Mudgal, Wendao Liu, Nisten Tahiraj, Zeeshaan Mohammed, Dhruv Diddi
Background: Clinical documentation represents a significant burden for healthcare providers, with physicians spending up to 2 hours daily on administrative tasks. Recent advances in large language models (LLMs) offer promising solutions, but privacy concerns and computational requirements limit their adoption in healthcare settings. Objective: To develop and evaluate a privacy-preserving, on-device medical transcription system using a fine-tuned Llama 3.2 1B model capable of generating structured medical notes from medical transcriptions while maintaining complete data sovereignty entirely in the browser. Methods: We fine-tuned a Llama 3.2 1B model using Parameter-Efficient Fine-Tuning (PEFT) with LoRA on 1,500 synthetic medical transcription-to-structured note pairs. The model was evaluated against the base Llama 3.2 1B on two datasets: 100 endocrinology transcripts and 140 modified ACI benchmark cases. Evaluation employed both statistical metrics (ROUGE, BERTScore, BLEURT) and LLM-as-judge assessments across multiple clinical quality dimensions. Results: The fine-tuned OnDevice model demonstrated substantial improvements over the base model. On the ACI benchmark, ROUGE-1 scores increased from 0.346 to 0.496, while BERTScore F1 improved from 0.832 to 0.866. Clinical quality assessments showed marked reduction in major hallucinations (from 85 to 35 cases) and enhanced factual correctness (2.81 to 3.54 on 5-point scale). Similar improvements were observed on the internal evaluation dataset, with composite scores increasing from 3.13 to 4.43 (+41.5%). Conclusions: Fine-tuning compact LLMs for medical transcription yields clinically meaningful improvements while enabling complete on-device browser deployment. This approach addresses key barriers to AI adoption in healthcare: privacy preservation, cost reduction, and accessibility for resource-constrained environments.
nan
Article 695
Title@2025-07-03 (4): GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons
Title: GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons | GDC Cohort Copilot: Ein KI-Copilot für die Kuratierung von Kohorten aus den Genomic Data Commons | GDC Cohort Cohort 副驾驶:AI 基因组数据共同点的Curate Choorts联合驾驶员 2507.02221v1 |
Authors (5): Steven Song, Anirudh Subramanyam, Zhenyu Zhang, Aarti Venkat, Robert L. Grossman
Motivation: The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. Results: We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. Availability and implementation: The standalone docker image for GDC Cohort Copilot is available at https://quay.io/repository/cdis/gdc-cohort-copilot. Source code is available at https://github.com/uc-cdis/gdc-cohort-copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds.
nan
Article 696
Title@2025-07-03 (4): SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers
Title: SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers | SciGA: Ein umfassender Datensatz zur Gestaltung grafischer Abstracts in wissenschaftlichen Papieren | SciGA: 用于设计学术论文制图摘要的综合数据集 2507.02212v1 |
Authors (4): Takuro Kawada, Shunsuke Kitada, Sota Nemoto, Hitoshi Iyatomi
Graphical Abstracts (GAs) play a crucial role in visually conveying the key findings of scientific papers. While recent research has increasingly incorporated visual materials such as Figure 1 as de facto GAs, their potential to enhance scientific communication remains largely unexplored. Moreover, designing effective GAs requires advanced visualization skills, creating a barrier to their widespread adoption. To tackle these challenges, we introduce SciGA-145k, a large-scale dataset comprising approximately 145,000 scientific papers and 1.14 million figures, explicitly designed for supporting GA selection and recommendation as well as facilitating research in automated GA generation. As a preliminary step toward GA design support, we define two tasks: 1) Intra-GA recommendation, which identifies figures within a given paper that are well-suited to serve as GAs, and 2) Inter-GA recommendation, which retrieves GAs from other papers to inspire the creation of new GAs. We provide reasonable baseline models for these tasks. Furthermore, we propose Confidence Adjusted top-1 ground truth Ratio (CAR), a novel recommendation metric that offers a fine-grained analysis of model behavior. CAR addresses limitations in traditional ranking-based metrics by considering cases where multiple figures within a paper, beyond the explicitly labeled GA, may also serve as GAs. By unifying these tasks and metrics, our SciGA-145k establishes a foundation for advancing visual scientific communication while contributing to the development of AI for Science.
nan