cs.CL @ 2025-07-11: 697

07-10 (4)

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Rückverfolgbare Beweise Verbesserte visuelle Grundierung: Bewertung und Methodik

增强视觉依据的理由:评价和方法

2507.07999v1

07-10

PyVision: Agentic Vision with Dynamic Tooling

PyVision: Agentische Vision mit dynamischem Werkzeug

视景:带有动态工具的 “ 动态展望 “ 。

2507.07998v1

07-10

Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)

Operationalisierung eines Bedrohungsmodells für das Red-Teaming großer Sprachmodelle (LLMs)

实施 “ 红色组合大语言模型威胁模型 “ ;

2407.14937v2

07-10

Automating Expert-Level Medical Reasoning Evaluation of Large Language Models

Automatisieren von Experten-Level Medical Reasoning Bewertung von großen Sprachmodellen

对大语言模式进行自动化专家级医疗理由评估

2507.07988v1

07-10

Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology

Leistung und praktische Betrachtung von großen und kleinen Sprachmodellen in der klinischen Entscheidungsunterstützung in der Rheumatologie

风湿学临床决策支助中大型和小型语言模型的实用性及实用性考虑

2507.07983v1

07-10

Why is Your Language Model a Poor Implicit Reward Model?

Warum ist Ihr Sprachmodell ein schlechtes Implizit-Reward-Modell?

为什么您的语言模式是一个贫穷的隐含奖赏模式?

2507.07981v1

07-10

Long-Form Speech Generation with Spoken Language Models

Langformige Sprachgenerierung mit gesprochenen Sprachmodellen

具有口言语言模式的长形式语音一代

2412.18603v2

07-10

Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

Wasserzeichen degradiert Ausrichtung in Sprachmodellen: Analyse und Milderung

语言模型的分级调整:分析和减轻影响

2506.04462v2

07-10

Scaling RL to Long Videos

Skalierung von RL zu langen Videos

缩放 RL 到长视频

2507.07966v1

07-10

MIRIX: Multi-Agent Memory System for LLM-Based Agents

MIRIX: Multi-Agent-Speichersystem für LLM-basierte Agenten

MIRIX:LLM药剂多机构内存系统

2507.07957v1

07-10

SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment

SEITE: Ein visuelles Sprachmodell zur Erkennung von Anomalien durch Fact Enhancement und Entropy-aware Alignment

SAGE:通过事实增强和对子对子体认知校正进行反常检测的视觉语言模型

2507.07939v1

07-10

Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration

Long Context Scaling: Teilen und Erobern durch multi-agent question-driven Collaboration

长期范围:通过多代理问题驱动的协作实现分化和征服

2505.20625v2

07-10

Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style

Kontexttreue in großen Sprachmodellen untersuchen: Die Rollen der Gedächtnisstärke und des Evidenzstils

调查大语言模型的内情:记忆力和证据风格的作用

2409.10955v2

07-10

DTECT: Dynamic Topic Explorer & Context Tracker

DTECT: Dynamischer Themen-Explorer & Kontext-Tracker

DTECT: 动态专题探索器和上下文跟踪器

2507.07910v1

07-10

A Survey on Latent Reasoning

Eine Umfrage über latente Vernunft

A. 关于长期原因的调查

2507.06203v2

07-10

Automating MD simulations for Proteins using Large language Models: NAMD-Agent

Automatisierung von MD-Simulationen für Proteine mit großen Sprachmodellen: NAMD-Agent

使用大语言模型( NADD- Agent) 自动自动模拟 Proteins 的 MD 模拟: NAMED- Agent

2507.07887v1

07-10

When Dialects Collide: How Socioeconomic Mixing Affects Language Use

Wenn Dialekte zusammenstoßen: Wie sich die sozioökonomische Mischung auf den Sprachgebrauch auswirkt

当对调时:社会经济混合如何影响语言使用

2307.10016v2

07-10

Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study

Bewertung der Robustheit von großen Audio-Sprachmodellen zur Audio-Einspritzung: Eine empirische Studie

评估大音频语言模型对音频注射的威力:经验研究

2505.19598v2

07-10

DocCHA: Towards LLM-Augmented Interactive Online diagnosis System

DocCHA: Auf dem Weg zum LLM-Augmented Interactive Online-Diagnosesystem

DocCHA:争取建立LLM-增强的互动式在线诊断系统

2507.07870v1

07-10

Alpay Algebra V: Multi-Layered Semantic Games and Transfinite Fixed-Point Simulation

Alpay Algebra V: Multi-Layered Semantic Games und Transfinite Fixed-Point Simulation

Alpay Alphay Algebabra V:多语言语义运动会和跨线固定点模拟

2507.07868v1

07-10

Skywork-R1V3 Technical Report

Technischer Bericht Skywork-R1V3

Skywork-R1V3 技术报告

2507.06167v3

07-10

From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems

Von der Ambiguität zur Genauigkeit: Der transformative Effekt der Koreferenzlösung auf retrieval-augmentierte Erzeugungssysteme

从模糊到准确性:关于回收-提款一代系统的共同决议的变革效应

2507.07847v1

07-10

Truth-value judgment in language models: ‘truth directions’ are context sensitive

Wahrheit-Wert-Urteil in Sprachmodellen: ‘Wahrheitsrichtungen’ sind kontextsensibel

语言模型中的真相价值判断:“真相方向”是背景敏感

2404.18865v2

07-10

None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks

Keiner der anderen: eine allgemeine Technik zur Unterscheidung von der Erinnerung an Multiple-Choice-LLM-Bewertungs-Benchmarks

其他无其他:在多杯LLM评价基准中区分与记忆化区别理由的一般技术

2502.12896v5

07-10

Constrain Alignment with Sparse Autoencoders

Beschränkung der Ausrichtung mit Sparse Autoencodern

与 Sparse 自动对齐

2411.07618v4

07-10

Unsupervised Morphological Tree Tokenizer

Unüberwachter morphologischer Baum Tokenizer

不受监督的病理树化器

2406.15245v2

07-10

MAEBE: Multi-Agent Emergent Behavior Framework

多边代理新兴行为框架

2506.03053v2

07-10

The Thin Line Between Comprehension and Persuasion in LLMs

Die dünne Linie zwischen Verständnis und Überzeugung in LLMs

LLMM 理解与劝导之间的细细线

2507.01936v2

07-10

Conditional Unigram Tokenization with Parallel Data

Bedingte Unigramm-Tokenisierung mit parallelen Daten

附带平行数据的条件性大学招式

2507.07824v1

07-10

On the Effect of Instruction Tuning Loss on Generalization

Auf die Auswirkungen der Instruktion Tuning Verlust auf die Verallgemeinerung

指示计票损失对普遍化的影响的影响

2507.07817v1

07-10

Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning

Verständnis und Kontrolle von Wiederholungsneuronen und Induktionsköpfen im In-Context-Lernen

了解和控制再生中新中世纪和内文学习中的上岗负责人

2507.07810v1

07-10

Bridging Logic and Learning: Decoding Temporal Logic Embeddings via Transformers

Überbrückung von Logik und Lernen: Dekodierung von Temporal Logic-Embeddings über Transformer

架桥逻辑与学习:通过变形器解码时时逻辑嵌入

2507.07808v1

07-10

Decoding AI Judgment: How LLMs Assess News Credibility and Bias

Entschlüsselung des AI-Urteils: Wie LLMs neue Glaubwürdigkeit und Bias bewerten

证明AI 判决:LLMs如何评估新闻信誉和Bias

2502.04426v2

07-10

StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model

StreamUni: Streaming Speech Translation mit einem einheitlichen Large Speech-Language-Modell erreichen

StreamUli:用统一大型语音语言模式实现流式语音翻译

2507.07803v1

07-10

Understanding Chain-of-Thought in LLMs through Information Theory

Verständnis der in LLMs durch Informationstheorie gesuchten Gedankenkette

通过信息理论在LLM 中探索了解链

2411.11984v2

07-10

Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

Squeeze the Soaked Sponge: Effiziente Off-Policy-Verstärkung Feinsteuerung für großes Sprachmodell

挤压海绵:高效非政策强化大语言模式的高效非政策改进微调

2507.06892v2

07-10

When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance

Wenn große Sprachmodelle das Recht erfüllen: Dual-Lens-Taxonomie, technischer Fortschritt und ethische Governance

当大语言模式符合法律时:双重语言分类、技术进步和道德治理

2507.07748v1

07-10

Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review

Code-Switching in End-to-End Automatische Spracherkennung: Ein systematischer Literaturbericht

端至端自动语音识别代码转换:系统文学审查

2507.07741v1

07-10

GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing

GuardVal: Dynamic Large Language Model Jailbreak Evaluation für umfassende Sicherheitstests

警卫:综合安全测试动态大语言示范监狱防爆评价

2507.07735v1

07-10

Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization

Nicht alle Präferenzen sind das, was Sie für das Post-Training benötigen: Selektive Ausrichtungsstrategie für die Preference-Optimierung

并非所有的优惠都是培训后需要的:选择性的优化优化战略

2507.07725v1

07-10

Rethinking the Privacy of Text Embeddings: A Reproducibility Study of “Text Embeddings Reveal (Almost) As Much As Text”

Die Privatsphäre von Text-Embeddings neu denken: Eine Reproduzierbarkeitsstudie von “Text-Embeddings Reveal (fast) So viel wie Text”

重新思考文字嵌入的隐私:关于“文字嵌入流(几乎)与文字一样”的可复制性研究

2507.07700v1

07-10

What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training

Was wissen selbstüberwachte Sprachmodelle über Niederländisch? Analysieren von Vorteilen sprachspezifischer Vorausbildung

自我监督的演讲模式对荷兰语了解多少? 分析具体语言培训前培训的优势

2506.00981v2

07-10

KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities

KeyKnowledgeRAG (K^2RAG): Eine verbesserte RAG-Methode zur Verbesserung der LLM-Fragestellung

KeyknowledgeraG(K2RAG):改进LLM问答能力的强化RAG方法

2507.07695v1

07-10

SAS: Simulated Attention Score

SAS: Simulierter Aufmerksamkeits-Score

SAS:模拟关注计分

2507.07694v1

07-10

Hierarchical Bracketing Encodings for Dependency Parsing as Tagging

Hierarchische Bracketing-Encodings für Dependency Parsing als Tagging

将依赖性剖析作为拖贴

2505.11693v2

07-10

Investigating Co-Constructive Behavior of Large Language Models in Explanation Dialogues

Ko-Konstruktives Verhalten von großen Sprachmodellen in Erklärungsdialogen untersuchen

解释对话中大语言模式的共同调查行为

2504.18483v2

07-10

Improving Cross-lingual Representation for Semantic Retrieval with Code-switching

Verbesserung der Cross-lingual Darstellung für semantische Retrieval mit Code-Schaltung

使用代码转换法改进语义检索的跨语种代表性

2403.01364v2

07-10

Beyond Hate Speech: NLP’s Challenges and Opportunities in Uncovering Dehumanizing Language

Beyond Hate Speech: NLPs Herausforderungen und Chancen beim Enthumanisieren der Sprache

超越仇恨言论:NLP在揭开非人化语言方面的挑战和机遇

2402.13818v2

07-10

An Automated Length-Aware Quality Metric for Summarization

Ein Automatisiertes Längen-Bewusst-Qualitäts-Metrik für die Zusammenfassung

用于汇总的自动长软件质量计量器

2507.07653v1

07-10

Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement

Lost in Pronunciation: Chinesische Offensive Sprache entdecken, verkleidet durch phonetische Umkleide-Ersatz

失落于发音中:发现因替换语音内衣而变形的中国进攻性语言

2507.07640v1

07-10

FrugalRAG: Learning to retrieve and reason for multi-hop QA

FrugalRAG: Lernen zum Abrufen und Grund für Multi-Hop-QA

FrugalRAG:学会检索和多呼QA的理由

2507.07634v1

07-10

Towards a cognitive architecture to enable natural language interaction in co-constructive task learning

Auf dem Weg zu einer kognitiven Architektur, um natürliche Sprachinteraktion im co-konstruktiven Aufgabenlernen zu ermöglichen

建立一个认知结构,在共同建设性任务学习中促成自然语言互动

2503.23760v2

07-10

Comparative sentiment analysis of public perception: Monkeypox vs. COVID-19 behavioral insights

Vergleichende Stimmungsanalyse der öffentlichen Wahrnehmung: Monkeypox vs. COVID-19 Verhaltenseinblicke

对公众感知的比较情绪分析:天花对COVID-19行为洞察力

2505.07430v2

07-10

Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks

Erforschung der Grenzen der Modellkompression in LLMs: Eine Studie zur Wissensdestillation über QA-Aufgaben

探索LLMM中模型压缩的限度:关于质量保证任务的知识积累研究

2507.07630v1

07-10

Good/Evil Reputation Judgment of Celebrities by LLMs via Retrieval Augmented Generation

Gute/böse Reputation Urteil von Prominenten durch LLMs über retrieval Augmented Generation

LLMs通过回回子增量一代对名词的良好/负面评奖判决

2503.14382v2

07-10

SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs

RaumViz-Bench: Automatisch generierte räumliche Visualisierungs-Aufgaben für MLLMs

空间Viz-Bench:MLLLMs自动生成的空间可视化推理任务

2507.07610v1

07-10

Bayesian Discrete Diffusion Beats Autoregressive Perplexity

Bayesian Discrete Diffusion schlägt autoregressive Verwirrung

Bayesian Discrete 扩散比自动递减的两难性

2507.07586v1

07-10

Beyond Overcorrection: Evaluating Diversity in T2I Models with DivBench

Jenseits von Überkorrektur: Bewertung von Diversität in T2I-Modellen mit DivBench

超越过度纠正:在DivBench的T2I模型中评估多样性

2507.03015v2

07-10

Improving Clustering on Occupational Text Data through Dimensionality Reduction

Verbesserung der Clusterbildung auf berufsbezogenen Textdaten durch Dimensionalitätsreduzierung

通过减少分量改进职业文本数据集群化

2507.07582v1

07-10

COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation

COALA: Numerisch stabiles und effizientes Framework für kontextabhängige Low-Rank-Annäherung

COALA: 低 Rank 上下低敏度接近度的数值稳定、高效框架

2507.07580v1

07-10

Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation

Ein-zu-Mix Modalität Ausrichtung mit multimodalen Großsprachenmodellen für die Übersetzung von Dokumentenbildmaschinen

单一至混合模式与文件图像机机翻译多式大语言模式

2507.07572v1

07-10

video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

Video-SALMONN 2: Bildunterschrift-verbesserte Audio-Visuelle große Sprachmodelle

视频-SALMONN2:字幕-强化视听大语言模式

2506.15220v2

07-10

The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs

Synergy Dilemma von Long-CoT SFT und RL: Untersuchung von Post-Training-Techniken zur Begründung von VLMs

Long-CoT SFT和RL的协同问题:调查培训后用于说明理由的训练后技术

2507.07562v1

07-10

Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

Multi-Head RAG: Lösung von Multi-Aspect-Problemen mit LLMs

多方主管RAG:解决多领域问题与LLM

2406.05085v4

07-10

The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora

Die Cross-Lingual Cost: Retrieval Biases in RAG über arabisch-englische Corpora

跨语言成本:通过阿拉伯语-英语公司在RAG中检索到阿拉伯语-英语公司

2507.07543v1

07-10

CEA-LIST at CheckThat! 2025: Evaluating LLMs as Detectors of Bias and Opinion in Text

CEA-LIST bei CheckThat! 2025: Bewertung von LLMs als Detektoren von Bias und Meinung im Text

CEA-LIST 校对:CEA-LIST 校对:2025年

2507.07539v1

07-10

CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

CheckEmbed: Effektive Überprüfung von LLM-Lösungen auf offene Aufgaben

复选对象:有效核查对不限名额任务LLM解决方案的有效核查

2406.02524v5

07-10

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

Thought Crime: Hintertüren und Emergent-Missausrichtung in vernünftigen Modellen

思想犯罪:后门和合理理由模型中新出现的不协调现象

2506.13206v2

07-10

Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems

Triadische Mehrparteien-Stimme-Aktivitätsprojektion für Turn-Take in gesprochenen Dialogsystemen

三部 “ 三部三部 “ 口语对话系统翻转式多党声音活动项目

2507.07518v1

07-10

Toward Real-World Chinese Psychological Support Dialogues: CPsDD Dataset and a Co-Evolving Multi-Agent System

Auf dem Weg zu echten chinesischen Psychologischen Unterstützungsdialogen: CPsDD-Datensatz und ein gemeinsames Multi-Agenten-System

走向现实世界的中国心理支持对话:CPsDD数据集和共同演进的多行为者系统

2507.07509v1

07-10

Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models

Halluzinationsstationen: Auf einigen grundlegenden Einschränkungen von Transformer-basierten Sprachmodellen

幻觉站:关于以变换语言模式的一些基本限制

2507.07505v1

07-10

Enhancing Transformers for Generalizable First-Order Logical Entailment

Erweiterung der Transformer für generalisierbare Logical Entailment erster Ordnung

增强通用一级一级逻辑元件的变压器

2501.00759v3

07-10

Extracting ORR Catalyst Information for Fuel Cell from Scientific Literature

Gewinnung von ORR-Katalysatorinformationen für Brennstoffzelle aus wissenschaftlicher Literatur

从科学文献中提取用于燃料电池的 ORR 催化器信息

2507.07499v1

07-10

Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code

LLM zur Vernunft bringen: Stärkung Lernen aus algorithmischen Problemen ohne Code

教LLM到理由:加强从没有法典的等级问题中学习

2507.07498v1

07-10

PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving

PLAN-TUNING: Sprachmodelle nach dem Training lernen Schritt-für-Schritt-Planung für komplexe Problemlösung

规划 – – 规划 – – 培训后语言模式,以学习逐步规划解决复杂问题的模式

2507.07495v1

07-10

SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records

SimSUM: Simulierter Benchmark mit strukturierten und unstrukturierten medizinischen Aufzeichnungen

SimSUM:与结构化和非结构化医疗记录模拟基准

2409.08936v3

07-10

Affordable AI Assistants with Knowledge Graph of Thoughts

Erschwingliche KI-Assistenten mit Wissensgrafik der Gedanken

具有知识思想知识图的负担得起的AI助理

2504.02670v5

07-10

Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models

Machine Bullshit: Charakterisieren der Emergenten Missachtung der Wahrheit in großen Sprachmodellen

机器胡说:在大语言模型中突出新人无视真相的特点

2507.07484v1

07-10

Mixture of Group Experts for Learning Invariant Representations

Mixtur von Gruppenexperten für Learning Invariante Repräsentationen

学习不稳定代表小组专家混合

2504.09265v2

07-10

ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining

ixi-GEN: Effiziente industrielle sLLMs durch Domain Adaptive Continual Pretraining

ixi-GEN:通过远程适应性连续训练前,提高工业低温生产效率

2507.06795v2

07-10

Structure Guided Large Language Model for SQL Generation

Struktur Geführtes großes Sprachmodell für SQL-Generierung

SQL 生成引导大语言模式

2402.13284v4

07-10

RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning

RLEP: Verstärktes Lernen mit Erfahrungsreplay für LLM-Reasoning

RLEP: 强化学习,经验重现LLM 理由推理

2507.07451v1

07-10

Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

Agent KB: Nutzung von Cross-Domain-Erfahrungen für die Lösung Agentischer Probleme

Agent KB: 利用跨域经验解决代理问题

2507.06229v2

07-10

SAND: Boosting LLM Agents with Self-Taught Action Deliberation

SAND: LLM-Agenten mit selbsterzogener Handlungsberatung stärken

SAND:促进具有自学行动考虑的LLM代理

2507.07441v1

07-10

Towards Interpretable Time Series Foundation Models

Auf dem Weg zu interpretierbaren Zeitreihen-Grundmodellen

迈向可解释时间序列基础模型

2507.07439v1

07-10

SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM-Augmented Synthetic EHR Data

SynthEHR-Eviction: Verbesserung der Eviction SDoH-Erkennung mit LLM-Augmented Synthetic EHR Data

合成EHR-驱逐:利用LLM-增强的合成电子HR数据加强驱逐SDoH探测

2507.07421v1

07-10

MedReadCtrl: Personalizing medical text generation with readability-controlled instruction learning

MedReadCtrl: Personalisierung medizinischer Textgenerierung mit Lesbarkeitsgesteuertem Unterricht

MedReadReadCtrl: 使医疗文本生成个性化,并进行可读性控制教学学习

2507.07419v1

07-10

May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks

Darf ich Ihre Aufmerksamkeit haben? Breaking Fine-Tuning basierte Prompt Injection Defenses mit Architektur-Aware Attacken

请大家注意,使用建筑软件攻击突破基于精密发射的快速喷射防御系统

2507.07417v1

07-10

Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation

Interlinguistische phonetische Komposition (IPC): Ein theoretischer und rechnerischer Ansatz, um die zweite Sprache zu verbessern

语言间音音组成:加强第二语言发音的理论和计算方法

2411.10927v3

07-10

TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning

TART: Ein Open-Source Tool-Augmented Framework für erklärbare Tabellen-basierte Begründung

TARRT: 开放源码工具推荐框架,用于说明基于表格的理由

2409.11724v3

07-10

GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation

GNN-CNN: Ein effizientes Hybridmodell für konvolutionäre und Graphen-Neuralnetzwerke zur Textdarstellung

GNN-CNN: 用于文本代表的动态和图形神经网络的有效混合模型

2507.07414v1

07-10

CoAM: Corpus of All-Type Multiword Expressions

CoAM: Corpus von Multiwort-Ausdrücken aller Art

CoAM: 全类型多字表达式组合体

2412.18151v3

07-10

Rethinking Verification for LLM Code Generation: From Generation to Testing

Überprüfung der LLM-Code-Generierung neu denken: Von der Generation zur Prüfung

重新思考LLM 代码生成的核查:从生成到测试

2507.06920v2

07-10

Large Language Model for Extracting Complex Contract Information in Industrial Scenes

Großes Sprachmodell zur Extraktion komplexer Vertragsinformationen in Industrieszenen

工业景点复杂合同信息提取大语言模型

2507.06539v2

07-10

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

BountyBench: Dollar-Impact von KI-Agenten-Angriffen und Verteidigern auf reale Cybersicherheitssysteme

BuntyBuntyBunnench: AI代理攻击者和捍卫者对现实世界网络安全系统的美元影响

2505.15216v2

07-10

Bradley-Terry and Multi-Objective Reward Modeling Are Complementary

Bradley-Terry und Multi-Objective Reward Modeling sind komplementär

Bradley-Terriy和多目标奖励模型具有补充作用

2507.07375v1

07-10

Shifting from Ranking to Set Selection for Retrieval Augmented Generation

Wechsel vom Ranking zur Einstellungsauswahl für retrieval Augmented Generation

从排位移到设置回收增量一代的选择

2507.06838v2

07-09 (3)

Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation

Taming the Tri-Space Tension: ARC-geführte Halluzination Modellierung und Steuerung für Text-zu-Bild-Generierung

遏制三空间紧张局势:ARC-ARC-ARC引导的光学幻觉模型和控制生成文本到图像的图像模型与控制

2507.04946v2

07-09

Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation

Multi-Agent Retrieval-Augmented Framework for Evidence-based Counterspeech Against Health Misinformation

以证据为依据的反健康错误信息反言多证据检索强化框架

2507.07307v1

07-09

Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation

Ranker, Richter und Assistenten: Auf dem Weg zum Verständnis des Interplays von LLMs in Information Retrieval Evaluation

中继员、法官和助理:了解信息检索评价中LLMLM的交互作用

2503.19092v2

100

07-09

ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning

ViDove: Ein Übersetzungs-Agenten-System mit multimodalem Kontext und Memory-Augmented Reasoning

ViDove:一个具有多式背景和记忆增强理由的翻译代理系统

2507.07306v1

101

07-09

The Impact of Background Speech on Interruption Detection in Collaborative Groups

Die Auswirkungen der Hintergrundrede auf die Störungserkennung in kollaborativen Gruppen

背景发言对合作小组中间探测的影响

2507.07280v1

102

07-09

LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation

LinguaMark: Sprechen multimodale Modelle fair? Eine Benchmark-basierte Bewertung

LUUAMark:多模式模型是否公平说话?基于基准的评价

2507.07274v1

103

07-09

Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery

Open Source Planning & Control System mit Language Agents für autonome wissenschaftliche Entdeckung

拥有自主科学发现语言代理的开放源规划和控制系统

2507.07257v1

104

07-09

A Comprehensive Survey of Contamination Detection Methods in Large Language Models

Eine umfassende Untersuchung von Methoden zur Erkennung von Verunreinigungen in großen Sprachmodellen

关于大语言模型中污染探测方法的综合调查

2404.00699v5

105

07-09

Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in zwei realen klinischen Anwendungen

赋予保健从业者以语言模式:在两个现实世界临床应用中构建语音记录

2507.05517v2

106

07-09

A Language-Driven Framework for Improving Personalized Recommendations: Merging LLMs with Traditional Algorithms

Ein sprachgetriebener Rahmen zur Verbesserung personalisierter Empfehlungen: Zusammenführung von LLMs mit traditionellen Algorithmen

改进个性建议的语言驱动框架:将LLMs与传统等级合并

2507.07251v1

107

07-09

Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings

Medical Red Teaming Protocol of Language Models: Über die Bedeutung der Nutzerperspektiven in der Gesundheitsversorgung

语言模式医学红队模式医疗红队协议:关于保健机构用户观点的重要性

2507.07248v1

108

07-09

A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive

A Theory of Response Sampling in LLMs: Teil beschreibend und Teil präskriptiv

LLMM中的反应抽样理论:部分说明性理论和部分说明性理论

2402.11005v4

109

07-09

An Information-Theoretic Perspective on Multi-LLM Uncertainty Estimation

Eine informationstheoretische Perspektive zur Multi-LLM-Unsicherheitsschätzung

关于多LLLM不确定性估算的信息理论视角

2507.07236v1

110

07-09

Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection

Emoji-Angriff: Verstärkung von Jailbreak-Angriffen gegen Richter LLM-Erkennung

Emoji攻击:加强针对LLM法官的越狱袭击

2411.01077v3

111

07-09

SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains

SynthTextEval: Synthetische Textdatengenerierung und -bewertung für High-Stakes-Domains

合成文本Eval:合成文本数据生成和评估,用于高占用域域

2507.07229v1

112

07-09

EduCoder: An Open-Source Annotation System for Education Transcript Data

EduCoder: Ein Open-Source-Annotationssystem für Bildungstextdaten

EduCoder:教育记录数据公开源说明系统

2507.05385v2

113

07-09

Derailer-Rerailer: Adaptive Verification for Efficient and Reliable Language Model Reasoning

Derailer-Rerailer: Adaptive Verifikation für effizientes und zuverlässiges Sprachmodell

Derailer-Rerailer:高效率和可靠语文示范理由的适应性核查

2408.13940v4

114

07-09

Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses

Prompt Störungen Enthüllen Mensch-ähnliche Biasen in LLM Survey Responses

LLM调查答复中的即时扰动干扰现象

2507.07188v1

115

07-09

Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs

Bepflanzt in der Vorausbildung, durch Finetuning abgeschwächt: Eine Fallstudie über die Herkunft von Kognitiv-Biasen in LLMs

编在培训前编,《微调:关于LLM中认知性双星起源的个案研究》,《微调摇摇晃》

2507.07186v1

116

07-09

Multi-Attribute Steering of Language Models via Targeted Intervention

Multi-Attribute-Lenkung von Sprachmodellen durch gezielte Intervention

通过有针对性的干预措施,多语言模式多来源指导

2502.12446v2

117

07-09

LCFO: Long Context and Long Form Output Dataset and Benchmarking

LCFO: Long Context und Long Form Output Datensatz und Benchmarking

LCFO:长期和长式产出数据集和基准

2412.08268v3

118

07-09

LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits

LASER: Mit Multi-Armed Bandits lernen, Belohnungsmodelle adaptiv zu wählen

LASeR: 学习用多武装强盗来灵活选择奖赏模型

2410.01735v2

119

07-09

Low-Rank Adaptation Secretly Imitates Differentially Private SGD

Low-Rank-Anpassung hinterhältig imitiert unterschiedlich private SGD

低浓度适应秘密模仿不同的私人 SGD

2409.17538v7

120

07-09

UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations

UniConv: Vereinheitlichung der Retrieval- und Response-Generation für große Sprachmodelle in Gesprächen

UnionConv: 对话中大语言模型统一检索和反应生成

2507.07030v1

121

07-09

FlexOlmo: Open Language Models for Flexible Data Use

FlexOlmo: Offene Sprachmodelle für flexible Datennutzung

FlexOlmo:灵活数据使用开放语言模型

2507.07024v1

122

07-09

TokenShapley: Token Level Context Attribution with Shapley Value

TokenShapley: Token Level Context Attribution mit Shapley Wert

TokenShapley: 具有形状价值的 Token 级别背景属性

2507.05261v2

123

07-09

Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs

Bewusst lernen, intuitiv handeln: Entsperren von Test-Time-Reasoning in multimodalen LLMs

故意、自觉地学习:在多模式LMs中解锁试验时间理由

2507.06999v1

124

07-09

Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming

Alles planen mit Rigor: General-Purpose Zero-Shot Planung mit LLM-basiertem Formalisierten Programmierung

与Rigor一起规划任何事:与基于LLM的正规化方案规划一起规划一般用途零热规划

2410.12112v3

125

07-09

FRaN-X: FRaming and Narratives-eXplorer

FRaN-X: FRaming und Narratives-eXplorer

FRAN-X: 制冷和叙述-爆破器

2507.06974v1

126

07-09

Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report

Scaling towards the information boundary of instruction set: InfinityInstruct-Subject Technical Report

逐步走向指示集的信息边界:无限指令性 - 实物技术报告

2507.06968v1

127

07-09

Investigating the Robustness of Retrieval-Augmented Generation at the Query Level

Untersuchung der Robustheit der Retrieval-Augmented Generation auf der Abfrageebene

在查询级别调查回溯源一代的强劲程度

2507.06956v1

128

07-09

Neuron-Level Differentiation of Memorization and Generalization in Large Language Models

Neuron-Level-Differenzierung von Erinnerung und Generalisierung in großen Sprachmodellen

大语言模型中记忆化和普及化的中上级中值差异

2412.18497v2

129

07-09

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Auf dem Weg zu einer vernünftigen Ära: Eine Umfrage über lange Kette von Gedanken, um große Sprachmodelle zu verstehen

通向理性时代:关于为理由使用大语言模式而寻求的长链研究的调查

2503.09567v4

130

07-09

What to Keep and What to Drop: Adaptive Table Filtering Framework

Was zu halten und was zu fallen: Adaptive Tabelle Filterung Rahmen

保持和放下什么:适应性表格过滤框架

2506.23463v2

131

07-09

Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues

LLMs für die Vorhersage von Tutorenstrategien und Studentenergebnissen in Dialogen erforschen

探索在对话中预测教学战略和学生成果

2507.06910v1

132

07-09

MultiJustice: A Chinese Dataset for Multi-Party, Multi-Charge Legal Prediction

MultiJustice: Ein chinesischer Datensatz für Multi-Party, Multi-Charge Legal Prediction

多公正:多党、多主管法律预测的中国数据集

2507.06909v1

133

07-09

MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection

MIND: Ein Multi-Agenten-Framework für die Null-Schuss-Schadstoff-Meme-Erkennung

MIND: 零中弹有害气象探测多剂框架

2507.06908v1

134

07-09

VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation

VisualTrap: Ein Stealthy Backdoor Angriff auf GUI-Agenten durch Visual Grounding Manipulation

视觉陷阱:通过视觉定位操纵对图形用户界面代理器进行隐秘的后门攻击

2507.06899v1

135

07-09

NoLiMa: Long-Context Evaluation Beyond Literal Matching

NoLiMa: Lang-Kontext-Evaluierung jenseits der sprachlichen Übereinstimmung

NoLiMa: 超越文学匹配的长文本评价

2502.05167v3

136

07-09

SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN

SCore: Verstärkte Corpus-basierte Relation-Extraktion mit multi-Label Kontrastivem Lernen und Bayesian kNN

SCOR:利用多标签差异学习和Bayesian kNN简化公司关系采掘

2507.06895v1

137

07-09

Developing and Maintaining an Open-Source Repository of AI Evaluations: Challenges and Insights

Entwicklung und Pflege eines Open-Source-Repositorys von KI-Bewertungen: Herausforderungen und Erkenntnisse

开发并维持一个开放源码的AI评价储存库:挑战和洞察

2507.06893v1

138

07-09

Adaptive Elicitation of Latent Information Using Natural Language

Adaptive Vermittlung von latenten Informationen mit natürlicher Sprache

利用自然语言调适性易感信息使用自然语言

2504.04204v2

139

07-09

EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning

EMORL: Ensemble Multi-Objektive Verstärkungs-Lernen für effizientes und flexibles LLM-Fine-Tuning

EMORL: 将多目标强化学习组合起来,促进高效率和灵活的LLM 微调

2505.02579v3

140

07-09

Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework

Adaptive Beendigung für multi-round Parallele Reasoning: Ein universelles Semantisches Entropie-Führungs-Framework

多元平行理由的适应性终止:通用的语义 Entropy 指导框架

2507.06829v1

141

07-09

CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation

CMQCIC-Bench: Ein chinesischer Benchmark für die Bewertung großer Sprachmodelle in der Berechnung medizinischer Qualitätskontrollindikatoren

CMQCIC-CMQCIC-Bench:中国医疗质量控制指标计算评估大语言模式基准

2502.11703v2

142

07-09

Losing our Tail – Again: On (Un)Natural Selection And Multilingual Large Language Models

Unseren Tail verlieren – wieder einmal: An (Un)Natürlicher Auswahl und mehrsprachigen großen Sprachmodellen

失去我们的尾巴 – – 再一次:在(un)自然选择和多语言大语言模式

2507.03933v2

143

07-09

Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts

Sicherer oder luckier? LLMs als Sicherheitsevaluatoren sind für Artefakte nicht robust

安全性更安全还是更幸运?作为安全评估员的LLMs没有强力进行人工操作。

2503.09347v3

144

07-09

Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams

Text zum Modell via SysML: Automatisierte Generierung dynamischer Systemrechnermodelle aus unstrukturiertem Naturtext über verbesserte Systemmodellierung Sprachdiagramme

通过 SysML 自动生成动态系统计算模型,通过强化系统模拟图,从未结构化的自然语言文本生成动态系统计算模型

2507.06803v1

145

07-09

Test-Time Scaling with Reflective Generative Model

Test-Zeit-Skalierung mit reflektierendem Generativem Modell

具有反反思考生成模型的试验时间缩放

2507.01951v2

146

07-09

GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods

GuidedBench: Messung und Milderung der Evaluationsdiskrepanzen von In-the-Wild LLM Jailbreak Methoden

指导:衡量和减轻LLM监狱破碎方法在评价方面的差异

2502.16903v2

147

07-09

Checklist Engineering Empowers Multilingual LLM Judges

Checkliste Engineering Empowers Mehrsprachige LLM-Richter

多语种LLM法官

2507.06774v1

148

07-09

KAConvText: Novel Approach to Burmese Sentence Classification using Kolmogorov-Arnold Convolution

KAConvText: Neuer Ansatz zur burmesischen Satzklassifikation mit Kolmogorov-Arnold-Konvolution

KAConvText:利用科尔莫戈罗夫-阿尔纳尔革命对缅甸判刑分类采用新颖的方法

2507.06753v1

149

07-09

Robust Multimodal Large Language Models Against Modality Conflict

Robuste multimodale große Sprachmodelle gegen Modalitätskonflikt

反对方式冲突强有力的多式多式大语言模型

2507.07151v1

150

07-09

Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons

Knockout LLM Assessment: Verwendung großer Sprachmodelle für Bewertungen durch iterative Pairwise-Vergleiche

LLLM 评估:利用大语言模式,通过迭接对等比较进行评估

2506.03785v3

151

07-09

LLM-based User Profile Management for Recommender System

LLM-basierte Benutzerprofilverwaltung für Recommender System

以LLM为基础的建议系统用户概况管理

2502.14541v2

152

07-09

Civil Society in the Loop: Feedback-Driven Adaptation of (L)LM-Assisted Classification in an Open-Source Telegram Monitoring Tool

Zivilgesellschaft im Kreislauf: Feedback-gestützte Anpassung der (L)LM-Assistenten-Klassifikation in einem Open-Source-Telegramm-Monitoring-Tool

” 圈圈中的民间社会:在开放源电讯监测工具中对(L)LM协助的分类进行反馈和改编 “ 。

2507.06734v1

153

07-09

On the Effect of Uncertainty on Layer-wise Inference Dynamics

Über den Effekt der Unsicherheit auf die schichtweise Schlussfolgerungsdynamik

不确定性对图层推论作用的影响

2507.06722v1

154

07-09

CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs

CLI-RAG: Ein Retrieval-Augmented Framework für klinisch strukturierte und kontextorientierte Textgenerierung mit LLMs

CLI-RAG: 利用LLMs实现临床结构化和有背景意识的文本生成的检索增强框架

2507.06715v1

155

07-09

Do Larger Language Models Imply Better Generalization? A Pretraining Scaling Law for Implicit Reasoning

Werden größere Sprachmodelle besser verallgemeinert? Ein Vorschulungs-Skalierungsgesetz für Implizite Vernunft

大语言模式是否充分更加普遍化?

2504.03635v2

156

07-09

A Survey on Prompt Tuning

Eine Umfrage über das schnelle Tuning

快速提款调查

2507.06085v2

157

07-09

Automating IRAC Analysis in Malaysian Contract Law using a Semi-Structured Knowledge Base

Automatisierung der IRAC-Analyse im malaysischen Vertragsrecht mit Hilfe einer halbstrukturierten Wissensbasis

利用半结构化知识库对马来西亚合同法的IRAC分析自动化

2406.13217v2

158

07-09

Probing and Steering Evaluation Awareness of Language Models

评价语言模式评估意识的检验和指导

2507.01786v2

159

07-09

Elite Polarization in European Parliamentary Speeches: a Novel Measurement Approach Using Large Language Models

Elite Polarisierung in europäischen Parlamentsreden: ein neuartiger Messansatz mit großen Sprachmodellen

欧洲议会讲话中的极极化:使用大语言模式的新衡量方法

2507.06658v1

160

07-09

PBa-LLM: Privacy- and Bias-aware NLP using Named-Entity Recognition (NER)

PBa-LLM: Datenschutz- und Bias-aware NLP mit Named-Entity-Erkennung (NER)

PBA-LLM: 使用命名实体识别(NER)的隐私和Bias-awar NLP

2507.02966v2

161

07-09

Expediting data extraction using a large language model (LLM) and scoping review protocol: a methodological study within a complex scoping review

Beschleunigung der Datenextraktion mit Hilfe eines großen Sprachmodells (LLM) und eines Scoping-Review-Protokolls: eine methodische Studie im Rahmen einer komplexen Scoping-Review

利用大型语言模式和范围界定审查议定书加快数据提取速度:在复杂的范围界定审查范围内进行的方法研究

2507.06623v1

162

07-09

FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation

FuDoBa: Zusammenführen von Dokumenten- und Wissensgraphen-basierten Darstellungen mit Bayesian-Optimierung

FUDoBA: 以贝叶斯最佳化方式使用引信文档和知识图表表示

2507.06622v1

163

07-09

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

Decoder-Hybrid-Decoder-Architektur für effizientes Nachdenken mit langer Generation

提高长代人合理性效率的代coder-Hybrid-Decer 结构

2507.06607v1

164

07-09

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

Video-Language Understanding: Eine Umfrage aus Modellarchitektur, Modellschulung und Datenperspektiven

了解视频语言:从建模模型、示范培训和数据视角进行调查

2406.05615v3

165

07-09

Can Input Attributions Explain Inductive Reasoning in In-Context Learning?

Kann Input-Attributionen im In-Context-Lernen eine induktive Begründung erklären?

投入属性能否解释内文学习中的引因?

2412.15628v5

166

07-09

Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions

Bewertung und Verbesserung der Robustheit in großen Sprachmodellen: Umfrage und Zukunftsrichtung

评价和提高大语言模式的威力:调查和未来方向

2506.11111v2

167

07-09

Enhancing Food-Domain Question Answering with a Multimodal Knowledge Graph: Hybrid QA Generation and Diversity Analysis

Verbesserung der Lebensmittel-Domain-Fragebeantwortung mit einem multimodalen Wissensgraphen: Hybride QA-Erzeugung und Diversitätsanalyse

以多模式知识图回答加强粮食领域问题:混合QA生成和多样性分析

2507.06571v1

168

07-09

CHAI for LLMs: Improving Code-Mixed Translation in Large Language Models through Reinforcement Learning with AI Feedback

CHAI für LLMs: Verbesserung der code-gemischten Übersetzung in großen Sprachmodellen durch Verstärkungslernen mit KI-Feedback

LLM女士CHAI:利用AI反馈加强学习,改进大语言模式的编码混合翻译

2411.09073v3

169

07-09

The Flaws of Others: An LLM-driven Framework for Scientific Knowledge Production

Die Fehler anderer: Ein LLM-gestützter Rahmen für die wissenschaftliche Wissensproduktion

其他法律:由LLM驱动的科学知识生产框架

2507.06565v1

170

07-09

DS@GT at CheckThat! 2025: Exploring Retrieval and Reranking Pipelines for Scientific Claim Source Retrieval on Social Media Discourse

DS@GT bei CheckThat! 2025: Erforschung von Rück- und Rücklaufleitungen für wissenschaftliche Anspruchsquellen-Rückgriffe auf Social Media Diskurs

DS@GT at checketh! 2025年: 探索检索和重新排序科学索赔源检索管道,关于社会媒体问题

2507.06563v1

171

07-09

AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework

AutoPrep: Natural Language Question-Aware Datenvorbereitung mit einem Multi-Agent Framework

AUPPrep: 具有多机构框架的自然语言问题软件数据编制

2412.10422v4

172

07-09

FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction

FiRST: Feinsteuerungs-Router-Selective Transformer für Input-Adaptive Latenzreduktion

FIRST: 用于减少投入-适应性延迟的微调路由器-选择式变换器

2410.12513v3

173

07-09

InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior

InvestAlign: Überwindung von Datenknappheit bei der Ausrichtung großer Sprachmodelle mit Entscheidungsprozessen von Investoren unter Herdverhalten

投资对称:克服在将大语言模式与牧民行为下的投资者决策过程相匹配方面的数据稀缺性

2507.06528v1

174

07-09

FinSphere, a Real-Time Stock Analysis Agent Powered by Instruction-Tuned LLMs and Domain Tools

FinSphere, ein Real-Time Stock Analysis Agent Powered by Instruction-Tuned LLMs und Domain Tools

FinSphere,实时股票分析代理商,使用指示引导LMs和域工具进行实时股票分析

2501.12399v2

175

07-09

FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation

FIFA: Unified Faithfulness Evaluation Framework für Text-to-Video und Video-to-Text Generation

FIFA: 文本到视频和视频到文字一代统一信仰评价框架

2507.06523v1

176

07-09

Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving

LLMs nach ihrer Eignung lehren: Adaptive Begründung für mathematische Problemlösung

根据其适应性:数学问题解决的适应性理由

2502.12022v3

177

07-09

DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

DeepTalk: Auf dem Weg zu nahtloser und intelligenter Sprachinteraktion mit adaptiver Modalität-spezifischer MoE

深谈:实现与适应型模式具体部的无缝和智能语音互动

2506.21864v2

178

07-09

Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

Double-Checker: Bessere Begründung von langsam denkenden LLMs über selbstkritische Feinsteuerung

双重检查者:通过自批评性微调,加强慢思考低迷LMs的理由

2506.21285v2

179

07-09

SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers

SpindleKV: Eine neuartige KV-Cache-Reduktionsmethode, die sowohl schwammige als auch tiefe Schichten ausbalanciert

SpindleKV: 浅层和深层平衡的新KV缓存减少方法

2507.06517v1

180

07-09

Breaking PEFT Limitations: Leveraging Weak-to-Strong Knowledge Transfer for Backdoor Attacks in LLMs

Breaking PEFT Limitations: Schwach-zu-starken Wissenstransfer für Backdoor-Angriffe in LLMs

打破PEFT的局限性:利用微弱至收缩知识转让来利用LLMs进行后门攻击

2409.17946v4

181

07-09

Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings

Pun Intended: Multi-Agent Übersetzung von Wordplay mit Kontrastivem Lernen und phonetisch-semantischen Einbettungen

未预期:多代理翻译与矛盾学习和语音-语言嵌入的文字游戏的多代理翻译

2507.06506v1

182

07-09

GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification

GMLM: Überbrückung von Graph Neuronalen Netzwerken und Sprachmodellen für heterophile Knotenklassifikation

GMLM: 血氧哲学节点分类的架接图形神经网络和语言模型

2503.05763v5

183

07-09

ModelCitizens: Representing Community Voices in Online Safety

ModelCitizens: Vertretung von Community Voices in Online-Sicherheit

模范公民:在网上安全方面代表社区声音

2507.05455v2

184

07-09

Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations

Verfeinerung von gesketen Wahrnehmungen in visionssprachlichen Kontrastmodellen durch visuelle Darstellungen

通过视觉表现方式改进视觉语言对比模型中偏斜的观念

2405.14030v3

185

07-09

Can adversarial attacks by large language models be attributed?

Können feindliche Angriffe von großen Sprachmodellen zugeschrieben werden?

大型语言模式的对抗性攻击能否归结为对抗性攻击?

2411.08003v2

186

07-09

TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation

TokenSwift: Lossless Acceleration der Ultra Long Sequence Generation

TokenSwift: 超长序列一代无损加速

2502.18890v2

187

07-09

On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Über die Robustheit des verbalen Vertrauens von LLMs bei feindlichen Angriffen

关于在相互攻击中女优女士的口头信任的有力性

2507.06489v1

188

07-09

Can LLMs Play Ô Ăn Quan Game? A Study of Multi-Step Planning and Decision Making

Kann LLMs spielen Ô ên Quan Spiel? Eine Studie über mehrstufige Planung und Entscheidungsfindung

LLM女士能玩游戏吗?多步规划和决策研究。

2507.03711v3

189

07-09

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

Video-RTS: Rethinking Verstärktes Lernen und Test-Time Scaling für effizientes und verbessertes Video-Reasoning

视频-RTS:重新思考强化学习和测试时间的尺度,以提高效率和增强视频理由

2507.06485v1

190

07-09

Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents

Japanisch lernen mit Jouzu: Interaktionsergebnisse mit stylisierten Dialog-Fiktivagenten

与Jouzu一起学习日文:与固定对话分形媒介的互动结果

2507.06483v1

191

07-09

InfoTech Assistant: A Multimodal Conversational Agent for InfoTechnology Web Portal Queries

InfoTech Assistant: Ein multimodaler Ansprechpartner für InfoTechnology Web Portal Queries

信息技术助理:信息技术网站门户查询的多模式对话代理

2412.16412v2

192

07-09

Theme-Explanation Structure for Table Summarization using Large Language Models: A Case Study on Korean Tabular Data

Thema-Erläuterungsstruktur für die Tabellenzusammenfassung mit großen Sprachmodellen: Eine Fallstudie zu koreanischen Tabellendaten

使用大语言模型对表格进行汇总的主题解释结构:韩国表格数据案例研究

2501.10487v3

193

07-08 (2)

A Systematic Analysis of Hybrid Linear Attention

Eine systematische Analyse der hybriden linearen Aufmerksamkeit

系统分析混合线性注意

2507.06457v1

194

07-08

A Semantic Parsing Framework for End-to-End Time Normalization

Ein semantisches Parsing-Framework für die End-to-End-Zeit-Normalisierung

端到终点时间正常化的语义分解框架

2507.06450v1

195

07-08

Perception-Aware Policy Optimization for Multimodal Reasoning

Perception-Aware Policy Optimization für multimodale Reasoning

对多式联运理由的观念-认知软件政策优化

2507.06448v1

196

07-08

CodeMirage: Hallucinations in Code Generated by Large Language Models

CodeMirage: Halluzinationen in Code Generiert durch große Sprachmodelle

代码Mirage: 大语言模型生成的代码中的幻觉

2408.08333v2

197

07-08

Substance over Style: Evaluating Proactive Conversational Coaching Agents

Substanz über Stil: Bewertung proaktiver Conversations-Coaching-Agenten

物质高于样式:评价主动对话教练员

2503.19328v2

198

07-08

Can Interpretation Predict Behavior on Unseen Data?

Kann Interpretation Verhalten auf ungesehenen Daten voraussagen?

解释预测行为是否与未知数据有关?

2507.06445v1

199

07-08

Temporal Analysis of Climate Policy Discourse: Insights from Dynamic Embedded Topic Modeling

Temporale Analyse des Diskurses zur Klimapolitik: Einblicke aus der dynamischen Themenmodellierung

气候政策问题时空分析:动态嵌入式专题模型的展望

2507.06435v1

200

07-08

Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders

Erforschen der Aufgabenleistung mit Interpretierbaren Modellen über Sparse Auto-Encoder

通过 Sparass Auto- Enccders 与可解释模型探索任务性能

2507.06427v1

201

07-08

Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling

Reward-Modelle können sich selbst verbessern: Reward-Guided Adversarial Failure Mode Discovery für robuste Reward-Modellierung

奖赏模型能够改善自我:为强力奖赏建模而发现奖赏- 受奖人反逆失败模式

2507.06419v1

202

07-08

PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning

PERK: Lang-Kontext als Parameter-Effizientes Test-Zeit-Lernen

PERK: 参数有效试验时间学习的长期理由

2507.06415v1

203

07-08

The Trilemma of Truth in Large Language Models

Das Trilemma der Wahrheit in großen Sprachmodellen

大语言模型中的真理三边

2506.23921v2

204

07-08

Single Word Change is All You Need: Designing Attacks and Defenses for Text Classifiers

Single Word Change ist alles, was Sie brauchen: Konzipieren von Angriffen und Verteidigungen für Textklassifikatoren

单字更改是您需要的: 设计攻击和文字分类的防御

2401.17196v2

205

07-08

Hypermagmas and Colored Operads: Heads, Phases, and Theta Roles

Hypermagmen und farbige Opern: Köpfe, Phasen und Theta-Rollen

超磁和彩色操作: 头部、阶段和Theta 角色

2507.06393v1

206

07-08

Multi-Sense Embeddings for Language Models and Knowledge Distillation

Multi-Sense-Einbettungen für Sprachmodelle und Wissensdestillation

语文模式和知识蒸馏多语种多语种嵌入式

2504.06036v2

207

07-08

Evaluating Morphological Alignment of Tokenizers in 70 Languages

Bewertung der morphologischen Ausrichtung von Tokenisatoren in 70 Sprachen

以70种语言评价火化器在道义上的一致性

2507.06378v1

208

07-08

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Wachsende Transformer: Modulare Zusammensetzung und schichtweise Erweiterung auf einem gefrorenen Substrat

增长变异器:在冻结基质上的模块构成和图层扩展

2507.07129v1

209

07-08

Could the Road to Grounded, Neuro-symbolic AI be Paved with Words-as-Classifiers?

Könnte der Weg zur geerdeten, neuro-symbolischen KI mit Wort-als-Klassifikatoren gepflastert werden?

通往禁地、神经-精神-精神-爱之路能否用“单词化为标准”来铺设?

2507.06335v1

210

07-08

MedGellan: LLM-Generated Medical Guidance to Support Physicians

MedGellan: LLM-generierte medizinische Anleitung zur Unterstützung von Ärzten

MedGellen:LLM - 支持医生的医疗指导

2507.04431v2

211

07-08

RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

RefineX: Verfeinern von Vorschulungsdaten auf Scale aus sachverständigen Programmen

RefineX:从专家指导方案学习如何在规模上精确分析培训前数据

2507.03253v2

212

07-08

ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time

ETT: Erweiterung des Langzeitkontexts Verständnisfähigkeit von LLMs bei Test-Time

ETT:扩大LLMs在试验时的长距离理解能力

2507.06313v1

213

07-08

Humans overrely on overconfident language models, across languages

Menschen überschätzen überbewusste Sprachmodelle, über Sprachen hinweg

人类对不同语言的过于自信的语言模式过于自信

2507.06306v1

214

07-08

Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers

Effizienz-Wirksamkeit Reranking FLOPs für LLM-basierte Reranker

为基于LLM的升级者重新安排FLOP

2507.06223v1

215

07-08

Instruction Following by Boosting Attention of Large Language Models

Anleitung, indem man die Aufmerksamkeit großer Sprachmodelle erhöht

之后的教学,培养对大语言模式的注意

2506.13734v2

216

07-08

CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions

CultureCLIP: CLIP mit kulturellem Bewusstsein durch synthetische Bilder und kontextualisierte Captions stärken

CICLIP: 通过合成图像和背景说明赋予CLIP以文化意识,赋予CLIP权力

2507.06210v1

217

07-08

EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG

EEG2TEXT-CN: Eine explorative Studie der offenen Vokabulären chinesischen Text-EEG-Ausrichtung über großsprachliches Modell und kontrastives Lernen auf ChinesischEEG

EEG2TEXT-CN:通过大语言模式和中经语言差异性学习对中文文本与EEEG校对开放词汇的探索性研究

2506.00854v3

218

07-08

DS@GT at CheckThat! 2025: Ensemble Methods for Detection of Scientific Discourse on Social Media

DS@GT bei CheckThat! 2025: Ensemblemethoden zur Erkennung wissenschaftlicher Diskurse über soziale Medien

DS@GT at checketh! 2025:社会媒体科学论文的混合探测方法

2507.06205v1

219 07-08 Differential Mamba Unterschiedliche Mamba 差异 Mamba 2507.06204v1

220

07-08

UQLM: A Python Package for Uncertainty Quantification in Large Language Models

UQLM: Ein Python-Paket zur Unsicherheitsquantifizierung in großen Sprachmodellen

UQLM: 用于大语言模型中不确定性量化的 Python 软件包

2507.06196v1

221

07-08

DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification

DS@GT bei CheckThat! 2025: Bewertung von Kontext- und Tokenisierungsstrategien für numerische Faktenverifizierung

DS@GT at checketh! 2025年:评估背景和数值事实核查的接受战略

2507.06195v1

222

07-08

SQLBarber: A System Leveraging Large Language Models to Generate Customized and Realistic SQL Workloads

SQLBarber: Ein System, das große Sprachmodelle nutzt, um angepasste und realistische SQL Workloads zu generieren

SQLBarber:一个系统利用大语言模型生成自定义和现实的 SQL 工作量

2507.06192v1

223

07-08

DS@GT at CheckThat! 2025: Detecting Subjectivity via Transfer-Learning and Corrective Data Augmentation

DS@GT bei CheckThat! 2025: Subjektivität durch Transfer-Learning und Korrekturdatenvergrößerung erkennen

DS@GT at checkThat! 2025: 通过传输学习和纠正数据增强检测主观性

2507.06189v1

224

07-08

Hidden Prompts in Manuscripts Exploit AI-Assisted Peer Review

Versteckte Prompts in Handschriften Exploit KI-Assisted Peer Review

使用 AI 协助同行审议

2507.06185v1

225 07-08 MedGemma Technical Report Technischer Bericht MedGemma MedmeGemma 技术报告 2507.05201v2

226

07-08

CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization

KritikLean: Kritisch-geführtes Verstärkungslernen für mathematische Formalisierung

Criticlean: 用于数学正规化的Critic-Crictic-Guideed加强学习

2507.06181v1

227

07-08

Evaluation of Habitat Robotics using Large Language Models

Bewertung der Habitat Robotics mit großen Sprachmodellen

利用大语言模式评价人居机器人学

2507.06157v1

228

07-08

Coding Triangle: How Does Large Language Model Understand Code?

Coding Triangle: Wie versteht Large Language Model Code?

编码三角:大语言模式如何理解代码?

2507.06138v1

229

07-08

NeoBabel: A Multilingual Open Tower for Visual Generation

NeoBabel: Mehrsprachiger offener Turm für die visuelle Erzeugung

NeoBabel:多语言的视觉生成开放塔

2507.06137v1

230

07-08

SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam?

SciMaster: Auf dem Weg zu allgemein angelegten wissenschaftlichen KI-Agenten, Teil I. X-Master als Grundlage: Können wir die letzte Prüfung der Menschheit durchführen?

SciMaster:争取实现普通用途科学AI代理,第一部分,X-Master作为基金会:我们能否在人类最后一次考试上起带头作用?

2507.05241v2

231

07-08

Agents Are All You Need for LLM Unlearning

Agenten sind alles, was Sie für LLM Unlearning benötigen

代理器是所有你需要的LLM 重新学习

2502.00406v2

232

07-08

GAF-Guard: An Agentic Framework for Risk Management and Governance in Large Language Models

GAF-Guard: Agentischer Rahmen für Risikomanagement und Governance in großen Sprachmodellen

GAF-Guard:大语言模式风险管理和治理代理框架

2507.02986v2

233

07-08

Nyay-Darpan: Enhancing Decision Making Through Summarization and Case Retrieval for Consumer Law in India

Nyay-Darpan: Verbesserung der Entscheidungsfindung durch Zusammenfassung und Rückgriff auf das Verbraucherrecht in Indien

Nyay-Darpan:印度消费者法通过摘要和案件检索加强决策

2507.06090v1

234

07-08

The bitter lesson of misuse detection

Die bittere Lektion der Missbrauchserkennung

发现滥用的惨痛教训

2507.06282v1

235

07-08

A Multi-Task and Multi-Label Classification Model for Implicit Discourse Relation Recognition

Ein Multi-Task- und Multi-Label-Klassifikationsmodell für die Implizite Diskursrelation-Erkennung

多任务和多标签的隐性断断层确认分类模式

2408.08971v3

236

07-08

Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

Sind LLMs prescient? Eine kontinuierliche Bewertung mit Daily News als Oracle

LLLM是否具有先锋精神?利用每日新闻作为甲骨文进行持续评估。

2411.08324v2

237

07-08

The distribution of syntactic dependency distances

Die Verteilung der syntaktischen Abhängigkeitsabstände

共同依赖距离分布

2211.14620v2

238

07-08

Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger

Adaptive Werkzeuganwendung in großen Sprachmodellen mit Meta-Cognition Trigger

用于大语言模型的适应性工具使用

2502.12961v2

239

07-08

Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs

Entropie-Memorisationsgesetz: Bewertung der Gedächtnis-Schwierigkeit der Daten in LLMs

《信封-记忆法》:评估LLMM中数据的记忆难度

2507.06056v1

240

07-08

Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review

Überwindung von Datenknappheit bei generativen Sprachmodellierungen für Sprachen mit geringem Ressourcenbedarf: Eine systematische Überprüfung

克服低资源语言创用语言建模中的数据稀缺性:系统审查

2505.04531v2

241

07-08

Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager

Maßgeschneiderte Gespräche über LLMs hinaus: Ein RL-basierter Dialogmanager

超出LLLM 的定制对话:基于 RL 的对话管理器

2506.19652v2

242

07-08

Empirical evidence of Large Language Model’s influence on human spoken communication

Empirische Beweise für den Einfluss von Large Language Model auf die menschliche gesprochene Kommunikation

大语言模式对人口交流的影响的经验证据

2409.01754v3

243

07-08

MemOS: A Memory OS for AI System

MemOS: Ein Speicher-Betriebssystem für KI-System

MemOS: AI系统记忆OS

2507.03724v2

244

07-08

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

OpenS2S: Vollkommen offenes Source-End-to-End-Empathetik-Modell für große Sprache

OpenS2S: 推进完全开放源全开放源端至端至端受同情型大型语音语言模式

2507.05177v2

245

07-08

Conditional Multi-Stage Failure Recovery for Embodied Agents

Bedingte Multi-Stage Failure Recovery für eingedickte Agenten

内装药剂有条件的多阶段故障恢复

2507.06016v1

246

07-08

DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations

DocIE@XLLM25: In-Context-Lernen zur Informationsextraktion mit vollständig synthetischen Demonstrationen

DocIE@XLLLM25:利用全面合成演示进行信息提取的内文学习

2507.05997v1

247

07-08

Evolution without Large Models: Training Language Model with Task Principles

Evolution ohne große Modelle: Sprachmodell mit Aufgabenprinzipien trainieren

无大型号的演变演变:具有任务原则的培训语言模式

2507.05991v1

248

07-08

Development and Evaluation of HopeBot: an LLM-based chatbot for structured and interactive PHQ-9 depression screening

Entwicklung und Evaluation von HopeBot: ein LLM-basierter Chatbot für strukturiertes und interaktives PHQ-9-Depressionsscreening

HopeBot的发展和评价:一个基于LLM的聊天室,用于结构化和互动的PHQ-9抑郁症筛查

2507.05984v1

249

07-08

RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages

RabakBench: Skalierung menschlicher Anmerkungen zum Aufbau lokalisierter Mehrsprachiger Sicherheits-Benchmarks für ressourcenarme Sprachen

Rabak Rabak Bench:提高人文说明,以建立本地化的低资源语言多语文安全基准

2507.05980v1

250

07-08

We Should Evaluate Real-World Impact

Wir sollten die Auswirkungen der realen Welt bewerten

我们应该评估现实世界的影响

2507.05973v1

251

07-08

Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model

Analytisches Subraum-Routing: Wie rekursive Least Squares im kontinuierlichen Lernen des großen Sprachmodells funktionieren

分析子空间运行分析:在继续不断学习大语言模式中如何使最起码的广场恢复

2503.13575v2

252

07-08

FRAME: Feedback-Refined Agent Methodology for Enhancing Medical Research Insights

FRAME: Feedback-Refined Agent Methodik zur Verbesserung medizinischer Forschungs-Insights

FRAME: 改进医学研究透视的反馈改进剂方法

2505.04649v2

253

07-08

OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation

OpenFactScore:公开源码原子对制文本实际质量的评估

2507.05965v1

254

07-08

Rethinking Associative Memory Mechanism in Induction Head

Rethinking Assoziative Memory Mechanism in Induktion Head

上岗时重新思考联合记忆机制

2412.11459v2

255

07-08

Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models

Neudefinition von Bewertungsstandards: Ein einheitlicher Rahmen für die Bewertung der koreanischen Fähigkeiten von Sprachmodellen

重新界定评价标准:评价韩国语言模式能力的统一框架

2503.22968v4

256

07-08

Chat-Ghosting: A Comparative Study of Methods for Auto-Completion in Dialog Systems

Chat-Ghosting: Eine vergleichende Untersuchung von Methoden zur Auto-Completion in Dialogsystemen

对话系统自动完成方法比较研究

2507.05940v1

257

07-08

Remember Past, Anticipate Future: Learning Continual Multimodal Misinformation Detectors

Denken Sie daran, Vergangenheit, Zukunft antizipieren: Lernen kontinuierlich multimodale Fehlinformation Detectors

记住过去,预期未来:学习持续多模式错误信息探测器

2507.05939v1

258

07-08

Towards a Principled Evaluation of Knowledge Editors

Auf dem Weg zu einer grundsätzlichen Bewertung von Wissensredakteuren

对知识编辑进行有原则地评价

2507.05937v1

259

07-08

News and Load: Social and Economic Drivers of Regional Multi-horizon Electricity Demand Forecasting

News and Load: Soziale und wirtschaftliche Treiber regionaler Mehrhorizont-Stromnachfrageprognosen

新闻与负载:区域多光速电力需求预测的社会和经济驱动因素

2406.06641v2

260

07-08

Semantic Certainty Assessment in Vector Retrieval Systems: A Novel Framework for Embedding Quality Evaluation

Semantische Sicherheitsbewertung in Vector Retrieval Systemen: Ein neuartiger Rahmen für die Einbettung von Qualitätsbewertungen

矢量检索系统中的语义性确定性评估:包含质量评价的新框架

2507.05933v1

261

07-08

Self-supervised learning of speech representations with Dutch archival data

Selbstüberwachtes Lernen von Sprachdarstellungen mit niederländischen Archivdaten

自行监督学习用荷兰档案数据进行语音陈述

2507.04554v2

262

07-08

Few-shot text-based emotion detection

Kurze Text-basierte Emotionserkennung

几发短短的基于文本的情感探测

2507.05918v1

263

07-08

On the Fundamental Impossibility of Hallucination Control in Large Language Models

Über die grundsätzliche Unmöglichkeit der Halluzinationskontrolle in großen Sprachmodellen

关于大语言模型中幻听控制的基本不可能性

2506.06382v3

264

07-08

AI-Reporter: A Path to a New Genre of Scientific Communication

AI-Reporter: Ein Weg zu einem neuen Genre wissenschaftlicher Kommunikation

AI-记者:通向科学通信新一流的道路

2507.05903v1

265

07-08

MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation

MusiScene: Nutzung von MU-LLAMA für die Vorstellung von Szenen und verbesserte Video-Hintergrundmusik-Generation

Musi-Sceen:利用MU-LLAMA来进行现场想象和增强视频背景音乐制作

2507.05894v1

266

07-08

Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators

Psychometrische Objektvalidierung mit virtuellen Befragten mit trait-Response Mediatoren

使用虚拟应答人与Trait-Response调解员的虚拟应答人进行测分

2507.05890v1

267

07-08

How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures

Wie man die automatische Spracherkennung auswertet: Vergleich verschiedener Leistungs- und Bias-Maßnahmen

如何评价自动语音识别:比较不同业绩和偏见措施

2507.05885v1

268

07-08

Healing Powers of BERT: How Task-Specific Fine-Tuning Recovers Corrupted Language Models

Heilkräfte von BERT: Wie aufgabenspezifische Feinsteuerung korrupte Sprachmodelle wiederherstellt

BERT的愈合能力:如何恢复被损坏的语言模式

2406.14459v2

269

07-08

Detecting value-expressive text posts in Russian social media

Ermittlung wertwirksamer Textposts in russischen sozialen Medien

在俄罗斯社交媒体中检测表达价值的文字文章

2312.08968v2

270

07-08

ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection

ALLM4ADD: Entsperren der Fähigkeiten von Audio Large Language Models für Audio Deepfake Detection

ALLM4ADD:解锁音频深假探测音频大语言模型的能力

2505.11079v2

271

07-08

PDFMathTranslate: Scientific Document Translation Preserving Layouts

PDFMathTranslate: Wissenschaftliche Dokumentübersetzung Layouts erhalten

PDFMath Translate:科学文件翻译保存布局

2507.03009v2

272

07-08

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Mit Thinking-LLM-as-a-Richter zu planen und zu evaluieren

学习运用 “ 思考-LLM-作为法官 “ 评估计划和理由

2501.18099v2

273

07-08

The Impact of Prompt Programming on Function-Level Code Generation

Die Auswirkungen der Prompt-Programmierung auf die Code-Generierung auf Funktionsebene

迅速编制方案对职能层面代码生成的影响

2412.20545v2

274

07-08

Affective-ROPTester: Capability and Bias Analysis of LLMs in Predicting Retinopathy of Prematurity

Affective-ROPTester: Fähigkeit und Bias-Analyse von LLMs bei der Vorhersage der Retinopathie der Vorreife

情感-ROPTester:预测早幼病理反应中LLMs的能力和比值分析

2507.05816v1

275

07-08

Bridging Perception and Language: A Systematic Benchmark for LVLMs’ Understanding of Amodal Completion Reports

Bridging Perception and Language: Ein systematischer Benchmark für das Verständnis von LVLMs für Amodal Completion Reports

桥梁概念和语言:LLLMs理解最佳完成报告系统基准

2507.05799v1

276

07-08

FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation

FaithfulRAG: Fact-Level-Konfliktmodellierung für kontexttreue retrieval-erweiterte Generation

忠实的拉戈:为本能的回溯性回溯性回溯性回溯性一代人建立实际冲突模式

2506.08938v2

277

07-08

MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training

MAMUT: Ein neuartiges Framework zur Modifizierung mathematischer Formeln für die Generierung von Spezialdatensätzen für Sprachmodellschulungen

MAMUT:为语言模式培训制作专门数据集修改数学公式的新框架

2502.20855v2

278

07-08

Flippi: End To End GenAI Assistant for E-Commerce

Flippi: Ende bis Ende GenAI Assistent für E-Commerce

Flippi: 结束到结束 GenAI 电子商务助手

2507.05788v1

279

07-08

Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription

Gemeinsame Beamforming und Speaker-Attributed ASR für echte Fern-Mikrofon-Meeting-Transkription

用于真正分布式话语会议的联合光束和发言人与发言人联合致真正分布式话语会议ASR

2410.21849v2

280

07-08

Large Language Models Might Not Care What You Are Saying: Prompt Format Beats Descriptions

Große Sprachmodelle sind vielleicht egal, was Sie sagen: Prompt Format Beats Beschreibungen

大语言模型可能不在乎你在说什么:快速格式比描述

2408.08780v6

281

07-08

DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities

DocTalk: Skalierbare Graph-basierte Dialog-Synthese zur Verbesserung von LLM-Konversationsfähigkeiten

Doctal: 用于增强LLLM对话能力的可缩放图表对话合成

2507.05750v1

282

07-08

GPTKB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge

GPTKB v1.5: Eine umfassende Wissensbasis für die Erforschung des tatsächlichen LLM-Wissens

GPTKB v1.5:探索事实LLM知识的大规模知识库

2507.05740v1

283

07-08

Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling

Verbesserung der LLM-Zuverlässigkeit durch explizite Wissensgrenzenmodellierung

通过明确知识边界建模增强LLM可靠性

2503.02233v3

284

07-08

ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark

KontextASR-Bench: Ein massiver Kontext für die Spracherkennung Benchmark

ASR-Bunch:大规模背景情况语音识别基准

2507.05727v1

285 07-08 Truth Neurons Wahrheit Neuronen 真理中世纪 2505.12182v3

286

07-08

Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

Omni-Router: Routing-Entscheidungen in Sparse Mixture-of-Experts für die Spracherkennung teilen

Omni-Router: 分享语音识别专家的松散混集决定

2507.05724v1

287

07-08

MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment

MobileGUI-RL: Mobile GUI-Agent durch Verstärkungslernen im Online-Umfeld fördern

MomGUI-RL:通过在线环境强化学习推进移动图形界面代理

2507.05720v1

288

07-08

HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation

HIRAG: Hierarchisch-gedankte Instruktion-Tuning-Retrieval-Augmented Generation

HIRAG: 高层次研究教学-引导检索-推荐一代

2507.05714v1

289

07-08

DRAGON: Dynamic RAG Benchmark On News

DRAGON: Dynamischer RAG-Benchmark auf Neuigkeiten

DRAGON:动态RAG新闻基准

2507.05713v1

290

07-08

RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

RAG-R1 : Förderung der Such- und Begründungsfähigkeiten von LLMs durch Multi-Query-Parallelismus

RAG-R1:通过多种克质平行主义鼓励LLMs的搜索和说明能力

2507.02962v2

291

07-08

SIGIR 2025 – LiveRAG Challenge Report

SIGIR 2025-LiveRAG挑战报告

2507.04942v2

292

07-08

Agentic-R1: Distilled Dual-Strategy Reasoning

Agentic-R1: Destillierte Dual-Strategy-Reasoning

剂- R1: 蒸馏的双战略理由

2507.05707v1

293

07-08

A Survey on Transformer Context Extension: Approaches and Evaluation

Eine Umfrage über Transformer Context Extension: Ansätze und Evaluierung

关于 “ 变换背景推广:方法和评价 “ 的调查

2503.13299v2

294

07-08

PulseReddit: A Novel Reddit Dataset for Benchmarking MAS in High-Frequency Cryptocurrency Trading

PulseReddit: Ein neuartiger Reddit-Datensatz für das Benchmarking von MAS im Kryptowährungshandel mit hoher Frequenz

PulseReddit:高级加密货币交易中用于确定MAS基准的新Rddit数据集

2506.03861v2

295

07-08

Do We Really Need Specialization? Evaluating Generalist Text Embeddings for Zero-Shot Recommendation and Search

Brauchen wir wirklich Spezialisierung? Auswertung von Generalisten-Text-Embeddings für Zero-Shot-Empfehlung und Suche

我们真的需要专业化吗?评估零热建议和搜索通用文本嵌入

2507.05006v2

296

07-08

Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria

KI-Beratung auf Japanisch bewerten: Berater, Kunde und Evaluator Rollen, die von Motivations-Interview-Kriterien überzeugt sind

评价大赦国际在日本的咨询:顾问、客户和根据动机访谈标准评估的评价作用

2507.02950v2

297

07-08

AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs

AutoTriton: Automatische Triton-Programmierung mit Verstärkungslernen in LLMs

AutoTriton: 自动三联编程,在LLMs中进行强化学习

2507.05687v1

298

07-08

Smoothie-Qwen: Post-Hoc Smoothing to Reduce Language Bias in Multilingual LLMs

Smoothie-Qwen: Post-Hoc-Glättung, um Sprach-Bias in mehrsprachigen LLMs zu reduzieren

Slidie-Quwen:以多种语言平滑后减少语言偏见

2507.05686v1

299

07-08

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

BMMR: Ein multimodaler multimodaler multidisziplinärer Datensatz, der großflächig zweisprachig ist

BMMR: 大型双语双语多语多式多式多残疾理由数据集

2507.03483v2

300

07-08

TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data

TuneShield: Eindämmung der Toxizität bei konversatorischer KI während der Feinabstimmung auf nicht vertrauenswürdige Daten

TuneSheld:在对不可信数据进行微调的同时,在交流性AI中减轻毒性

2507.05660v1

301

07-08

SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge

SHNU Mehrsprachiges Spracherkennungssystem für INTERSPEECH 2025 MLC-SLM Challenge

SHNU 2025年刚果解放运动-解运挑战

2507.03343v2

302

07-08

MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models

MEF: Ein Capability-Aware Multi-Encryption Framework zur Bewertung von Schwachstellen in Black-Box Large Language Models

MEF: 用于评价黑箱大语言模型脆弱性的能力-软件多加密框架

2505.23404v3

303

07-08

ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?

ECom-Bench: Kann LLM Agent Probleme mit dem echten E-Commerce lösen?

ECom-Bench:LLM代理能否解决现实世界电子商务客户支助问题?

2507.05639v1

304

07-08

Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs

Feint and Attack: Achtsamkeitsbasierte Strategien für das Jailbreaking und den Schutz von LLMs

Feint和攻击:以关注为基础的破狱和保护LLMs

2410.16327v2

305

07-08

Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models

Sarkasmus durch unterschiedliche Augen sehen: Multimodale Sarkasmuswahrnehmung in großen Vision-Sprachen-Modellen analysieren

透过不同的眼睛看到讽刺:在大型视觉语言模型中分析多模式讽刺的认知

2503.12149v2

306

07-08

SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression

SARA: Selektive und adaptive retrieval-augmentierte Generation mit Kontextkompression

SARA:有环境压缩的选择性和适应性再取回型

2507.05633v1

307

07-08

On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows

Über die Rolle von Feedback bei der Testzeitskalierung von agentischen KI-Workflows

关于反馈在AI类制剂工作流量的测试-时间尺度方面作用的反馈

2504.01931v4

308

07-08

Adsorb-Agent: Autonomous Identification of Stable Adsorption Configurations via Large Language Model Agent

Adsorb-Agent: Autonome Identifizierung stabiler Adsorptionskonfigurationen über Large Language Model Agent

Assorb-Agent:通过大语言模式代理自动识别稳定吸附配置

2410.16658v4

309

07-08

Flipping Knowledge Distillation: Leveraging Small Models’ Expertise to Enhance LLMs in Text Matching

Flipping Knowledge Destillation: Kompetenz kleiner Modelle nutzen, um LLMs in Text Matching zu verbessern

减轻知识蒸馏:利用小型模型的专长加强文本匹配中的LLM

2507.05617v1

310

07-08

Self-Review Framework for Enhancing Instruction Following Capability of LLM

Selbst-Review-Rahmen für die Verbesserung der Anleitung nach der Fähigkeit von LLM

根据LLM能力加强教学的自我审查框架

2507.05598v1

311

07-08

Efficient Detection of Intermittent Job Failures Using Few-Shot Learning

Effiziente Erkennung intermittierender Job-Fälle durch wenig scharfes Lernen

利用很少热的学习方法有效检测间歇性工作失败

2507.04173v2

312

07-08

Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation

Low-Rank und Sparse Modell Zusammenführung für Multi-Lingual Spracherkennung und Übersetzung

多语言语音识别和翻译的低兰克和松散模式合并

2502.17380v3

313

07-08

The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation

Die Landschaft der Erinnerung in LLMs: Mechanismen, Messung und Milderung

LLMM中纪念的景观:机制、衡量和缓解

2507.05578v1

314

07-08

Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models

Feature-Extraktion und -Lenkung für eine verbesserte Kettenbildung in Sprachmodellen

语言模型中强化研究链理由的特征采掘和指南

2505.15634v3

315

07-08

Offline Learning and Forgetting for Reasoning with Large Language Models

Offline-Lernen und Vergessen für die Vernunft mit großen Sprachmodellen

利用大语言模式进行离线学习和遗忘

2504.11364v3

316

07-08

Beyond Retrieval: Ensembling Cross-Encoders and GPT Rerankers with LLMs for Biomedical QA

Beyond Retrieval: Zusammenbauen von Cross-Encodern und GPT-Rerankern mit LLMs für biomedizinische QA

超出检索范围:利用生物医学质量评估LMs将跨环境者和GPT再生者与生物医学质量评估LMs组合在一起

2507.05577v1

317

07-08

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

Infini-gram mini: Genaue n-gram Suche auf der Internetskala mit FM-Index

Infini-gram 微型: 使用 FM- Index 的 Internet 比例尺精确的 n 克搜索

2506.12229v2

318

07-08

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

OLMoTrace: Tracing Language Model Outputs Zurück zu Tillionen von Trainings-Tokens

OLMoTrace:追踪语言示范产出,返回数以万计的培训单位

2504.07096v2

319

07-08

Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS

Verbesserung der Testzeitskalierung großer Sprachmodelle mit hierarchischen Retrieval-Augmented MCTS

强化高语言模型的测试时间比量,采用等级检索增强的MCTS

2507.05557v1

320

07-07 (1)

The Role of Deductive and Inductive Reasoning in Large Language Models

Die Rolle der deduktiven und induktiven Vernunft in großen Sprachmodellen

大语种模式中的减量和引引原因的作用

2410.02892v3

321

07-07

Bayesian Optimization for Controlled Image Editing via LLMs

Bayesische Optimierung für kontrollierte Bildbearbeitung über LLMs

Bayesian 通过LLMs优化控制图像编辑

2502.18116v3

322

07-07

Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment

Conversational Education at Scale: Multi-LLM Agent Workflow für verfahrenstechnisches Lernen und pädagogische Qualitätsbewertung

规模层次的相互教育:程序学习和教学质量评估的多种LLM代理工作流

2507.05528v1

323

07-07

Early-Exit and Instant Confidence Translation Quality Estimation

Schätzung der Qualität der Früh-Ausgangs- und Instant-Konfidenz-Übersetzung

提前出国和即时信心翻译质量估算

2502.14429v2

324

07-07

Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality

Feinkörnige Vision-Sprachenmodellierung für multimodale Trainingsassistenten in Augmented Reality

扩大现实中多模式培训助理的精美愿景-语言建模

2507.05515v1

325

07-07

MEIT: Multimodal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation

MEIT: Multimodale Elektrokardiogramm-Anleitung Tuning auf großen Sprachmodellen für die Berichtserstellung

MEIT:多式电动心电图指示,关于报告生成大语言模式的图示

2403.04945v4

326

07-07

Towards Exception Safety Code Generation with Intermediate Representation Agents Framework

Auf dem Weg zur Generierung von Ausnahme-Sicherheitscodes mit dem Rahmen für Mittlere Vertretungen

建立具有中间代表代理机构框架的例外安全法规生成框架

2410.06949v3

327

07-07

ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding

ViGiL3D: Ein sprachlich unterschiedlicher Datensatz für 3D Visual Grounding

ViGiL3D:3D视觉定位语言多样性数据集

2501.01366v2

328

07-07

On the Semantics of Large Language Models

Zur Semantik großer Sprachmodelle

关于大语言模型的语义学

2507.05448v1

329

07-07

PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language Pairs

PhoniTale: Phonologisch begründete mnemonische Generation für typologisch entfernte Sprachpaare

Phooni-Tale: 音频分布式语言对音频隔音的声波固定成像元体生成

2507.05444v1

330

07-07

Gendered Divides in Online Discussions about Reproductive Rights

Geschlechtsspezifische Unterschiede in Online-Diskussionen über reproduktive Rechte

关于生殖权利的在线讨论中的两性差距

2507.05443v1

331

07-07

DeepRetro: Retrosynthetic Pathway Discovery using Iterative LLM Reasoning

DeepRetro: Retrosynthetische Pathway Discovery mit iterativem LLM Reasoning

使用迭代LM 理由解释的回溯合成途径发现

2507.07060v1

332

07-07

LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

LLM Hypnose: Nutzung des Benutzerfeedbacks für unautorisierte Wissensinjektion für alle Benutzer

LLM Hypnisis:利用用户反馈,为所有用户提供未经授权知识注射

2507.02850v2

333

07-07

“Lost-in-the-Later”: Framework for Quantifying Contextual Grounding in Large Language Models

“Lost-in-the-Later”: Rahmen für die Quantifizierung kontextueller Erdung in großen Sprachmodellen

“Lost-in-Later”:“Lost-in-Later”:大语言模型中量化背景依据框架

2507.05424v1

334

07-07

Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning

Global lernen, lokal sprechen: Die Lücken in der Mehrsprachigkeit überwinden

《全球学习,本地语言:弥合多种语文理由方面的差距》

2507.05418v1

335

07-07

What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning

Was würdest du fragen, wenn du zuerst $a^2+b^2=c^2$ sahst? LLM auf Kuriosität-getriebene Fragestellung bewerten

当你第一次看到$2+b2=c2$时你会问什么?

2409.17172v2

336

07-07

Tractable Transformers for Flexible Conditional Generation

Tractable Transformer für flexible Bedingte Generation

灵活条件一代的可变变变换器

2502.07616v2

337

07-07

Evaluation of OpenAI o1: Opportunities and Challenges of AGI

Bewertung von OpenAI o1: Chancen und Herausforderungen der AGI

评价开放AI OO1:亚洲地理信息协会的机遇和挑战

2409.18486v2

338

07-07

Embedding-Based Approaches to Hyperpartisan News Detection

Einbetten-basierte Ansätze zu Hyperparteien-Nachrichten-Erkennung

以嵌入式方法探测超党派新闻

2501.01370v3

339

07-07

One fish, two fish, but not the whole sea: Alignment reduces language models’ conceptual diversity

Ein Fisch, zwei Fische, aber nicht das ganze Meer: Ausrichtung reduziert die konzeptionelle Vielfalt der Sprachmodelle

一、二、一、二鱼,但不是整个海洋:对齐减少了语言模式的概念多样性

2411.04427v3

340

07-07

Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences

Kontrolle dessen, was Sie teilen: Bewertung von Sprachmodellen Einhaltung von Datenschutzeinstellungen

控制您共享的共享:评估语言模式遵守隐私优惠

2507.05391v1

341

07-07

The Generalization Ridge: Information Flow in Natural Language Generation

The Generalization Ridge: Informationsfluss in der natürlichen Sprachgenerierung

通用海脊:自然语言生成的信息流动

2507.05387v1

342

07-07

Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training

Verstärkte Feinsteuerung mildert natürlich das Vergessen im kontinuierlichen Post-Training

长期培训后忘却的自然缓解措施

2507.05386v1

343

07-07

On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study

Über die Bias von Next-Token Predictors hin zu systematisch ineffizienter Begründung: Eine Kurzeste-Path-Fallstudie

关于对系统、低效率理性的 “ 下肯预测家 “ 的偏见:最短篇案例研究

2507.05362v1

344

07-07

LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks

LoRA-Augmented Generation (LAG) für wissensintensive Sprachaufgaben

LoRA - 知识密集型语言任务支持代(LAG)

2507.05346v1

345

07-07

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Speicher in LLM-Agenten über inkrementelle Multi-Turn-Interaktionen bewerten

通过递增多发互动评估LLM代理物的记忆

2507.05257v1

346

07-07

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Open Vision Reasoner: Übertragen sprachliches Kognitives Verhalten für visuelles Reasoning

开放愿景理由:转移视觉理由的语言认知行为

2507.05255v1

347

07-07

Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

Den Richtern zu urteilen: Können große Visions-Sprachmodelle Chartverständnis und Vernunft angemessen bewerten?

判断法官:大视野语言模型能否公正地评估图表的解读和理由?

2505.08468v2

348

07-07

Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

Response Attack: Exploiting Contextual Priming to Jailbreak Große Sprachmodelle

回应攻击:利用环境原始化作为破狱大型语言模式

2507.05248v1

349

07-07

When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

Wenn Gedankenkette notwendig ist, kämpfen Sprachmodelle um Evade Monitore

当思维链是必要的时候,语言模型与Evade Monitor争吵

2507.05246v1

350

07-07

MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents

MindFlow: Revolutionierung des E-Commerce-Kundensupports mit multimodalen LLM-Agenten

MindFlow:使电子商务客户支持与多式联运LLM代理商实现革命化

2507.05330v1

351

07-07

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

Late Chunking: Kontextuelle Chunk-Embeddings mit Langkontext-Embedding-Modellen

较晚的板块: 使用长文嵌入模型的背景块嵌入式嵌入式

2409.04701v3

352

07-07

OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

OPeRA: Ein Datensatz von Beobachtung, Persona, Ratationale und Aktion zur Bewertung von LLMs auf menschlicher Online-Shopping-Behavior-Simulation

OPERA: 人类在线购物行为模拟观察、人、理由和评估LMLLMs的数据集

2506.05606v3

353

07-07

Logit Reweighting for Topic-Focused Summarization

Logit-Umgewichtung für thematisch fokussierte Zusammenfassung

用于专题用途汇总的对数重新加权

2507.05235v1

354

07-07

The Super Weight in Large Language Models

Das Supergewicht in großen Sprachmodellen

大语言模型中的超重

2411.07191v2

355

07-07

jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

jina-embeddings-v4: Universelle Einbettungen für multimodale Mehrsprachigkeit

jina-embeddings-v4:多语种多式联运回收通用嵌入式

2506.18902v3

356

07-07

Extended Inductive Reasoning for Personalized Preference Inference from Behavioral Signals

Erweiterte Induktive Begründung für personalisierte Präferenz-Inferenz von Verhaltenssignalen

从行为信号中推论

2505.18071v2

357

07-07

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Zwillinge 2.5: Das Frontier mit fortschrittlicher Vernunft, Multimodalität, langem Kontext und Agentischen Fähigkeiten der nächsten Generation schieben

Gemini 2.5: 推进先进理性、多模式、长处和下一代的前沿

2507.06261v1

358

07-07

Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

Verbesserung des personalisierten Multi-Turn-Dialogs mit Kuriositätslohn

增强个人化多发人对话与好奇奖的多发对话

2504.03206v2

359

07-07

Interleaving Logic and Counting

Interleaving Logik und Zählen

中间逻辑和计数

2507.05219v1

360

07-07

Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation

Überprüfung von wissenschaftlichen Arbeiten für kritische Probleme mit reasoning LLMs: Basisansätze und automatische Bewertung

审查关于有理有理的关键问题的科学文件:基线方法和自动评价

2505.23824v2

361

07-07

Pre-Trained Policy Discriminators are General Reward Models

Pre-Trained Policy Diskriminatoren sind allgemeine Prämienmodelle

培训前政策歧视者是一般奖励模式

2507.05197v1

362

07-07

NativQA Framework: Enabling LLMs with Native, Local, and Everyday Knowledge

NativQA Framework: LLMs mit nativem, lokalem und alltäglichem Wissen aktivieren

NativQA框架:利用土著、地方和日常知识扶持LMs

2504.05995v2

363

07-07

From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations

Von Fragmenten zu Fakten: Ein Curriculum-getriebener DPO-Ansatz zur Generierung von Hindi News Veracity Erklärungen

《从零碎到事实:产生印地语新闻的多城市解释:课程驱动的DPO方法》

2507.05179v1

364

07-07

SEPSIS: I Can Catch Your Lies – A New Paradigm for Deception Detection

SEPSIS: Ich kann Ihre Lügen fangen – Ein neues Paradigma für Täuschungserkennung

我可以抓住你的谎言– 一种新模型用来侦测欺骗

2312.00292v2

365 07-07 Critiques of World Models Kritik an Weltmodellen 世界模式的证明 2507.05169v1

366

07-07

Language Models can Self-Improve at State-Value Estimation for Better Search

Sprachmodelle können sich bei State-Value-Schätzung für eine bessere Suche selbst improvisieren

语言模型可以在国家估价估算中自我改进,以更好地搜索

2503.02878v2

367

07-07

InfoSteer: Steering Information Utility in Language Model Post-Training

InfoSteer: Steuerungsinformationsdienst im Sprachmodell nach dem Training

InfoSteer:指导性语言示范语言信息工具

2507.05158v1

368

07-07

AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models

KI Generated Text Detection mit Anleitung Feinabstimmung große Sprache und Transformer-basierte Modelle

使用微调大语言和以变换器为基础的模型生成的指令

2507.05157v1

369

07-07

Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization

Interpretierbare Mnemonic-Generation für Kanji-Lernen über Erwartungs-Maximierung

通过预期-最大化为Kanji学习提供可解释的中子一代

2507.05137v1

370

07-07

SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction

SMART: Simulierte Schüler ausgerichtet mit Item Response Theorie für Frage Schwierigkeitsvorhersage

SMART: 与 “ 问题困难预测 “ 项目反应理论一致的模拟学生

2507.05129v1

371

07-07

An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques

Eine Bewertung von großen Sprachmodellen zu Textzusammenfassungsaufgaben mittels Prompt Engineering Techniques

评价关于使用即时工程技术进行文本归纳任务的大语言模型

2507.05123v1

372

07-07

Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration

Wiederbelebung des Kulturerbes: Ein neuartiger Ansatz für eine umfassende Restaurierung historischer Dokumente

恢复文化遗产:全面恢复历史文件的新办法

2507.05108v1

373

07-07

LCDS: A Logic-Controlled Discharge Summary Generation System Supporting Source Attribution and Expert Review

LCDS: Logisch gesteuerte Entladung Zusammenfassung Erzeugung System Unterstützung Quelle Attribution und Experten-Review

LCDS: 逻辑控制排放控制简要生成系统支持源来源归属和专家审查

2507.05319v1

374

07-07

AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics

KI-getriebene Zytomorphologie Bildsynthese für medizinische Diagnostik

AI-Driven 用于医学诊断的细胞形态图像合成

2507.05063v1

375

07-07

End-to-End Evaluation for Low-Latency Simultaneous Speech Translation

End-to-End-Bewertung für simultane Sprachübersetzungen mit geringer Latenz

低长期同声同声发言翻译端对端评价

2308.03415v4

376

07-07

Using Large Multimodal Models to Extract Knowledge Components for Knowledge Tracing from Multimedia Question Information

Verwendung von großen multimodalen Modellen zum Extrahieren von Wissenskomponenten für die Wissensverfolgung aus Multimedia-Frageinformationen

利用大型多模式模式从多媒体问题信息中提取知识组成部分,以从多媒体问题信息中追踪知识

2409.20167v2

377

07-07

Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Warum kämpfen Open Source LLMs mit Datenanalyse? Eine systematische empirische Studie

开放源码LLMs为何要与数据分析斗争?系统的经验研究

2506.19794v2

378

07-07

Verified Language Processing with Hybrid Explainability: A Technical Report

Verifizierte Sprachverarbeitung mit hybrider Erklärbarkeit: Ein technischer Bericht

经核实的有混合解释性的语文处理:技术报告

2507.05017v1

379

07-07

Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification

Co-DETECT: Collaborative Discovery of Edge Cases in der Textklassifikation

共同目的:在文本分类中合作发现边缘情况

2507.05010v1

380

07-07

From Autonomy to Agency: Agentic Vehicles for Human-Centered Mobility Systems

Von der Autonomie zur Agentur: Agentenfahrzeuge für Mensch-zentrierte Mobilitätssysteme

从自主到机构:以人为中心的流动系统机动车辆

2507.04996v1

381

07-07

Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models

Kann Video LLMs zur Antwort ablehnen? Ausrichtung für Beantwortbarkeit in Video Large Language Models

视频大语言模式的可响应性对齐

2507.04976v1

382

07-07

ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

ArtefakteBench: Überbrückung der visuell-interaktiven Lücke in der LLM-Codegenerierung

人工合成:弥合LLM代码生成评估中的视觉互动差距

2507.04952v1

383

07-07

ReLoop: “Seeing Twice and Thinking Backwards” via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding

ReLoop: “Zweimal sehen und zurückdenken” über Closed-Loop-Training zu Mitigate-Halluzinationen im multimodalen Verständnis

ReLooop:通过闭路培训,为多模式理解的假幻觉提供“双眼和反向思考”

2507.04943v1

384

07-07

Are Information Retrieval Approaches Good at Harmonising Longitudinal Survey Questions in Social Science?

Sind Informations-Retrieval Ansätze gut bei der Harmonisierung Longitudinal Survey Fragen in der Sozialwissenschaft?

信息检索方法在协调社会科学的纵向调查问题方面是否很好?

2504.20679v2

385

07-07

Do LLMs Understand the Safety of Their Inputs? Training-Free Moderation via Latent Prototypes

Verstehen LLMs die Sicherheit ihrer Eingaben? Schulungsfreie Moderation über Latent Prototypen

LLM女士了解其投入的安全性吗?通过中原原型进行无培训的调节

2502.16174v2

386

07-07

O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d’apprentissage pour l’adaptation de modèles de langue à un domaine

O_FT@EvalLLLM2025:非属域适应适应模式的技术和技术评估比较

2507.04895v1

387

07-07

MARBLE: A Multi-Agent Rule-Based LLM Reasoning Engine for Accident Severity Prediction

MARBLE: Eine regelbasierte LLM-Reasoning-Engine für die Vorhersage von Unfällen mit Schweregrad

事故严重性预测的多机构、以规则为基础的多机构、基于规则的LLM 理疗引擎

2507.04893v1

388

07-07

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Emergent Semantics Beyond Token Embeddings: Transformer LMs mit gefrorenen visuellen Unicode-Darstellungen

超越 Tok 嵌入的新兴语义: 具有冷冻视觉统一符号的变形LMs

2507.04886v1

389

07-07

Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions

Aufbau von offenen retrieval-Konversations-Frage-Antwortsystemen durch Generierung synthetischer Daten und Dekontextualisierung von Benutzerfragen

通过生成合成数据和减少文本用户问题,建立开放式检索连接问答解答系统

2507.04884v1

390

07-07

Transcribing Spanish Texts from the Past: Experiments with Transkribus, Tesseract and Granite

Spanische Texte aus der Vergangenheit transkribieren: Experimente mit Transkribus, Tesseract und Granit

翻录过去西班牙文本:与Transkribus、Tesseract和Granite的实验

2507.04878v1

391

07-07

$\textit{Grahak-Nyay:}$ Consumer Grievance Redressal through Large Language Models

$\textit{Grahak-Nyay:}$ Consumer Grievance Redressal durch große Sprachmodelle

$textit{ Grahak- Nyay:} 以大语言模式纠正消费者冤情

2507.04854v1

392

07-07

Dialogue-Based Multi-Dimensional Relationship Extraction from Novels

Dialogbasierte multidimensionale Beziehungsextraktion aus Romanen

以对话为基础的从新奇中提取的多不同关系

2507.04852v1

393

07-07

Spec-TOD: A Specialized Instruction-Tuned LLM Framework for Efficient Task-Oriented Dialogue Systems

Spec-TOD: Ein speziell angepasstes LLM-Framework für effiziente aufgabenorientierte Dialogsysteme

Spec-TOD:高效以任务为导向的对话系统专门指导性LLM框架

2507.04841v1

394

07-07

CritiQ: Mining Data Quality Criteria from Human Preferences

关键:来自人类首选的采矿数据质量标准

2502.19279v2

395

07-07

RewardAnything: Generalizable Principle-Following Reward Models

BelohnungAlles: Allgemeines Prinzip-folgende Belohnung Modelle

奖励任何奖励:可普遍适用的遵循原则的奖赏模式

2506.03637v2

396

07-07

BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance

BiMa: Auf dem Weg zu Biases Milderung für Text-Video-Retrieval über Szeneelement-Anleitung

Bima:通过场景元素指导,争取减缓对文本视频检索的“双轨减缓”

2506.03589v3

397

07-07

From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach

Von der Vision zur Sprache durch Diagramm der Ereignisse in Raum und Zeit: Ein erklärbarer selbstüberwachter Ansatz

通过空间和时间事件图从视野到语言:可解释的自我监督方法

2507.04815v1

398

07-07

A Survey of Pun Generation: Datasets, Evaluations and Methodologies

Eine Umfrage der Pun-Generation: Datensätze, Bewertungen und Methoden

对 “ 制作豆类:数据集、评价和方法 “ 的调查

2507.04793v1

399

07-07

Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems

Qwen vs. Gemma Integration mit Whisper: Eine vergleichende Studie in mehrsprachigen Sprach-LLM-Systemen

Quwen诉Gemma 与低语融合:多语种语言LLLM系统比较研究

2506.13596v2

400

07-07

Reason to Rote: Rethinking Memorization in Reasoning

Grund zu rotieren: Die Erinnerung an die Vernunft neu denken

旋转理由:重新思考在理由方面的记忆

2507.04782v1

401

07-07

ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems

ABench-Physics: Benchmarking Physical Reasoning in LLMs über hochschwierige und dynamische Physikprobleme

ABench-物理:通过高难度和动态物理问题确定LLMLM中物理原因的基准

2507.04766v1

402

07-07

Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments

Thunder-DeID: Genauer und effizienter De-Identifizierungsrahmen für Urteile des koreanischen Gerichts

Thunder-DeID:韩国法院判决的准确和有效的取消识别框架

2506.15266v2

403

07-07

CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering

CoSteer: Kollaborative Decoding-Time-Personalisierung über lokale Delta-Lenkung

科斯特:通过当地三角洲指导,协作解码-时间个性化

2507.04756v1

404

07-07

LLMs as Architects and Critics for Multi-Source Opinion Summarization

LLMs als Architekten und Kritiker für die Mehr-Quellen-Stellungnahme Zusammenfassung

作为多来源意见总结的建筑师和批评家的LLM

2507.04751v1

405

07-07

A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic

Eine Geschichte von zwei Scripts: Transliteration und Nachkorrektur für Judeo-Arabisch

两本脚本的故事:犹太-阿拉伯语的翻写和校正后

2507.04746v1

406

07-07

Word stress in self-supervised speech models: A cross-linguistic comparison

Wortstress in selbstüberwachten Sprachmodellen: Ein cross-linguistischer Vergleich

自我监督的演讲模式中的言语压力:交叉语言学比较

2507.04738v1

407

07-07

Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools

Engere Kluft: Überwachtes Feintuning von Open Source LLMs als lebensfähige Alternative zu proprietären Modellen für pädagogische Werkzeuge

缩小差距:监督开放源码LLMs的微调,将其作为替代专有教学工具模型的可行替代物

2507.05305v1

408

07-07

“This Suits You the Best”: Query Focused Comparative Explainable Summarization

“Das passt Ihnen am besten”: Abfrage konzentriert vergleichende erklärende Zusammenfassung

“这适合你最好的”: 查询重点突出的可解释比较摘要

2507.04733v1

409

07-07

Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization

Textentgiftung: Dateneffizienz, Semantische Erhaltung und Modellverallgemeinerung

文字解毒:数据效率、语义保护和模式通用

2507.01050v2

410

07-07

LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework

LOOM-Scope: ein umfassender und effizienter Long-cOntext Modellauswertungsrahmen

LOOM-范围:一个全面有效的LOng-Ontext示范评价框架

2507.04723v1

411

07-07

MAIN: Mutual Alignment Is Necessary for instruction tuning

MAIN: Gegenseitige Ausrichtung ist notwendig für die Instruktions-Tuning

MAIN: 相互协调是教学调整的必要条件

2504.12913v2

412

07-07

Why We Feel What We Feel: Joint Detection of Emotions and Their Opinion Triggers in E-commerce

Warum wir fühlen, was wir fühlen: Gemeinsame Erkennung von Emotionen und ihre Meinung Auslöser im E-Commerce

我们为什么感到我们的感受:在电子商务中共同发现情感及其见解触发者

2507.04708v1

413

07-07

XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL

XiYan-SQL: Ein neuartiges Multi-Generator-Framework für Text-zu-SQL

XiYan-SQL: 文本到SQL的新颖多驱动器框架

2507.04701v1

414

07-07

Markovian Transformers for Informative Language Modeling

Markovianische Transformer für die informative Sprachmodellierung

用于信息语言建模的Markovian变形器

2404.18988v6

415

07-07

Pensieve Grader: An AI-Powered, Ready-to-Use Platform for Effortless Handwritten STEM Grading

Pensieve Grader: Eine KI-Powered, Ready-to-Use Plattform für mühelose handschriftliche STEM-Grading

Pensieve grafer: 一个AI授权的无力手写STEM分级的现用平台

2507.01431v2

416

07-07

Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages

Eka-Eval : Ein umfassender Evaluierungsrahmen für große Sprachmodelle in indischen Sprachen

Eka-Eval:印度语大语言模式综合评价框架

2507.01853v2

417

07-07

Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models

Breach in the Shield: Enthüllen der Schwachstellen großer Sprachmodelle

打破盾牌:消除大语言模式的脆弱性

2504.03714v2

418

07-07

R1-RE: Cross-Domain Relationship Extraction with RLVR

R1-RE: Cross-Domain-Beziehungsextraktion mit RLVR

R1-RE: 与RLVR的跨部关系提取

2507.04642v1

419

07-07

Put Teacher in Student’s Shoes: Cross-Distillation for Ultra-compact Model Compression Framework

Put Teacher in Studentenschuhen: Cross-Destillation für ultra-kompakte Modellkompressionsrahmen

将教师纳入学生鞋:超常规示范压缩框架的交叉蒸馏

2507.04636v1

420

07-07

A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens

Ein Text ist mehrere Token wert: Text-Einbetten aus LLMs richtet sich geheim gut mit den Schlüssel-Token

文本“ 值得几个调子” : 文本由LLMs 秘密对齐与密钥连接良好。

2406.17378v4

421

07-07

Knowledge-Aware Self-Correction in Language Models via Structured Memory Graphs

Wissen-Bewusst Selbst-Korrektion in Sprachmodellen über strukturierte Speichergraphen

通过结构内存图对语言模型进行知识-软件自我校正

2507.04625v1

422

07-07

Towards Cost-Effective Reward Guided Text Generation

Auf dem Weg zu einer kosteneffizienten, von der Belohnung geleiteten Textgenerierung

实现成本-效益高的回报制制制制制制

2502.04517v2

423

07-07

News Source Citing Patterns in AI Search Systems

Nachrichten Quelle Zitieren von Mustern in KI-Suchsystemen

AI 搜索系统中引用新闻来源模式

2507.05301v1

424

07-07

Retain or Reframe? A Computational Framework for the Analysis of Framing in News Articles and Reader Comments

A putational Framework for the Analysis of Framing in News Artikel und Leserkommentare

保留还是重新框架? 分析新闻文章和读者评论中形式界定的计算框架

2507.04612v1

425

07-07

Position: Machine Learning Conferences Should Establish a “Refutations and Critiques” Track

Position: Machine Learning Konferenzen sollten einen “Refutations and Critiques” Track erstellen

职位:机器学习会议应建立“反驳和批评”轨道

2506.19882v3

426

07-07

PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes

PRIME: Large Language Model Personalisierung mit kognitiven Gedächtnis- und Gedankenprozessen

PRIME:具有认知记忆和思维过程的大语言模式个性模型

2507.04607v1

427

07-07

Structured Captions Improve Prompt Adherence in Text-to-Image Models (Re-LAION-Caption 19M)

Structured Captions verbessern die sofortige Einhaltung von Text-zu-Bild-Modellen (Re-LAION-Caption 19M)

结构说明改进对文本到图像模型的迅速遵守(Re-LAION-Caption 19M)

2507.05300v1

428

07-07

Gradient-guided Attention Map Editing: Towards Efficient Contextual Hallucination Mitigation

Gradient-geführte Aufmerksamkeitskarte Bearbeitung: Auf dem Weg zu einer effizienten kontextuellen Halluzination

逐步制导注意地图编辑:努力有效减少环境幻觉

2503.08963v2

429

07-07

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

VLM2Vec-V2: Advancing Multimodal Embedding für Videos, Bilder und visuelle Dokumente

VLM2Vec-V2:推进视频、图像和视觉文件的多式联运嵌入

2507.04590v1

430

07-07

Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs

Im Inneren sind Sie viele Wölfe: Mit kognitiven Modellen, um Wert-Abwägungen in LLMs zu interpretieren

使用认知模型来解释LLMM中的价值权衡

2506.20666v2

431

07-06 (7)

Construction Identification and Disambiguation Using BERT: A Case Study of NPN

Bauidentifizierung und Abgrenzung mit BERT: Eine Fallstudie der NPN

使用BERT进行建筑识别和拆分:NPN的案例研究

2503.18751v2

432

07-06

Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts

Nile-Chat: Ägyptische Sprachmodelle für arabische und lateinische Scripte

Nile-Chat:阿拉伯文和拉丁文脚本埃及语言模型

2507.04569v1

433

07-06

Evaluating LLMs on Real-World Forecasting Against Human Superforecasters

Bewertung von LLMs bei der Real-World-Prognose gegen menschliche Superforecaster

评价针对人类超先天预测者进行真实世界预测的LLMs评价项目

2507.04562v1

434

07-06

PENCIL: Long Thoughts with Short Memory

PENCIL: Lange Gedanken mit kurzer Erinnerung

PENCIL:有短记忆的长思考

2503.14337v2

435

07-06

DP-Fusion: Token-Level Differentially Private Inference for Large Language Models

DP-Fusion: Token-Level Differentially Private Inferenz für große Sprachmodelle

DP-Fusion: 大语言模式的调控层面的私人不同推论

2507.04531v1

436

07-06

DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging

DOTResize: Reduzierung der LLM-Breite durch diskrete Optimale transportbasierte Neuronenverschmelzung

DOTRIZ: 通过分立优化运输最佳方式的中子合并减少 LLM 宽度

2507.04517v1

437

07-06

Rethinking Table Instruction Tuning

Umdenken Tabelle Anleitung Tuning

重新思考表格指令图

2501.14693v3

438

07-06

On the Utility of Domain-Adjacent Fine-Tuned Model Ensembles for Few-shot Problems

Über die Nützlichkeit von Domain-Adjacent Fine-Tuned Model Ensembles für Probleme mit wenigen Fotos

关于Dome-邻接精度精美设计模型集的效用问题

2406.13720v2

439

07-06

AdS: Adapter-state Sharing Framework for Multimodal Sarcasm Detection

AdS: Adapter-State-Sharing-Framework für multimodale Sarkasmus-Erkennung

AdS:多式碎片探测适应国家共享框架

2507.04508v1

440

07-06

Unveiling the Potential of Diffusion Large Language Model in Controllable Generation

Enthüllen des Potenzials von Diffusions-Großsprachenmodellen in kontrollierbarer Generation

发挥在可控生产中传播大语言模式的潜力

2507.04504v1

441

07-06

Self-Consistency Preference Optimization

Selbst-Konsistenz-Präferenz-Optimierung

优化优惠

2411.04109v3

442

07-06

A validity-guided workflow for robust large language model research in psychology

Ein validitätsgeführter Workflow für robuste große Sprachmodellforschung in der Psychologie

健全的大型语言模式心理学研究有效指导工作流程

2507.04491v1

443

07-06

Does Overnight News Explain Overnight Returns?

Erklären die Nachrichten über Nacht die Rückkehr?

夜间新闻解释夜间回归吗?

2507.04481v1

444

07-06

The role of large language models in UI/UX design: A systematic literature review

Die Rolle großer Sprachmodelle im UI/UX-Design: Ein systematischer Literaturbericht

大语言模型在UI/UX设计中的作用:系统文献审查

2507.04469v1

445

07-06

Dual Modality-Aware Gated Prompt Tuning for Few-Shot Multimodal Sarcasm Detection

Dual Modality-Aware Gated Prompt Tuning für wenige-shot multimodale Sarkasmus-Erkennung

用于小热多式乱石探测的两极式器件器械 Gedt 快速计时

2507.04468v1

446

07-06

Think Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection

Denken Sie zweimal vor Ihnen Richter: Mischung von Dual-Reasoning-Experten für multimodale Sarkasmus-Erkennung

在你面前三思而后三思,法官:多式讽刺探测双重理由专家的混合

2507.04458v1

447

07-06

GradOT: Training-free Gradient-preserving Offsite-tuning for Large Language Models

GradOT: Training-frei Gradient-erhaltendes Offsite-Tuning für große Sprachmodelle

GradOT: 大语言模型的无培训渐进式非现场调整

2507.04455v1

448

07-06

Dude, where’s my utterance? Evaluating the effects of automatic segmentation and transcription on CPS detection

Alter, wo ist meine Äußerung? Bewertung der Auswirkungen der automatischen Segmentierung und Transkription auf CPS-Erkennung

评估自动分解和转录对CPS检测的影响

2507.04454v1

449

07-06

KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality

KnowRL: Erforschendes Wissenswertes Verstärktes Lernen für die Realität

KnowRL:探索知识强化学习促进事实质量

2506.19807v2

450

07-06

AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

AutoMind: Adaptives Knowledgeable Agent für automatisierte Datenwissenschaft

自动Mind:自动数据科学适应性知识代理

2506.10974v2

451

07-06

A Linguistic Analysis of Spontaneous Thoughts: Investigating Experiences of Déjà Vu, Unexpected Thoughts, and Involuntary Autobiographical Memories

Eine linguistische Analyse von spontanen Gedanken: Untersuchung von Erfahrungen von Déjà Vu, unerwarteten Gedanken und unfreiwilligen autobiographischen Erinnerungen

对自发思想的语言分析:调查Déjá Vu、意外思想和非自愿自传记忆的经验

2507.04439v1

452

07-06

Reconstructing Biological Pathways by Applying Selective Incremental Learning to (Very) Small Language Models

Rekonstruieren biologischer Pfade durch Anwendung selektiven Inkrementellen Lernens auf (Sehr) kleine Sprachmodelle

通过将选择性增量学习应用到(老式)小型语言模式,重建生物途径

2507.04432v1

453

07-06

RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling

RAT: RNN-Effizienz und Aufmerksamkeitsgenauigkeit in der Sprachmodellierung überbrücken

RAT:在语言建模中弥补区域NNN效率和注意的准确性

2507.04416v1

454

07-06

MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind

MOMENTE: Ein umfassender multimodaler Benchmark für die Theorie des Geistes

修改:关于精神理论的综合多模式基准

2507.04415v1

455

07-06

THM@SimpleText 2025 – Task 1.1: Revisiting Text Simplification based on Complex Terms for Non-Experts

THM@SimpleText 2025 – Aufgabe 1.1: Textvereinfachung auf Basis komplexer Bedingungen für Nicht-Experten

THM@SPIText 2025-任务1.1:根据非专家的复杂条件重新审议文本简化

2507.04414v1

456

07-06

Pretraining Language Models Using Translationese

Vorschulung von Sprachmodellen mit Translationese

使用笔译语言的语文模型

2403.13638v3

457

07-06

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Wahrnehmung, Vernunft, Denken und Plan: Eine Umfrage zu großen multimodalen Vernunftmodellen

观念、理性、思考和计划:关于大型多式联运理由模型的调查

2505.04921v2

458

07-06

Have LLMs Made Active Learning Obsolete? Surveying the NLP Community

Haben LLMs aktives Lernen überholt? Vermessung der NLP-Gemeinschaft

LLM女士是否做过积极学习过时?

2503.09701v2

459

07-06

Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition

Besser sicher als entschuldigt? Überreaktion Problem der Vision Sprachmodelle in der visuellen Notfallerkennung

是否比抱歉更安全?在视觉紧急识别中过度反应的视觉语言模型问题

2505.15367v2

460

07-06

SpiritRAG: A Q&A System for Religion and Spirituality in the United Nations Archive

SpiritRAG: Ein Q&A-System für Religion und Spiritualität im Archiv der Vereinten Nationen

SpirirRAG:联合国档案馆中的宗教和精神系统

2507.04395v1

461

07-06

Does Learning Mathematical Problem-Solving Generalize to Broader Reasoning?

Hat das Lernen mathematische Problem-Solving Generalisieren, um breitere Vernunft?

学习数学问题解决是否普遍到更广义的理由?

2507.04391v1

462

07-06

Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions

Multi-Modal Semantische Parsing für die Interpretation von Grabsteininschriften

用于解释墓碑记的多模式语义分析

2507.04377v1

463

07-06

Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs

Achtung Slipping: Ein mechanistisches Verständnis von Jailbreak Attacken und Verteidigungen in LLMs

注意力滑动:对非法侵入监狱袭击和自卫的机械理解

2507.04365v1

464

07-06

Emergent misalignment as prompt sensitivity: A research note

Emergente Fehlausrichtung als prompte Empfindlichkeit: Eine Forschungsnote

新出现的不协调作为迅速的敏感性:研究说明

2507.06253v1

465

07-06

Large Language Models’ Varying Accuracy in Recognizing Risk-Promoting and Health-Supporting Sentiments in Public Health Discourse: The Cases of HPV Vaccination and Heated Tobacco Products

Variierende Genauigkeit von Large Language Models bei der Anerkennung von risikofördernden und gesundheitsfördernden Sentiments im öffentlichen Gesundheitsdiskurs: Die Fälle von HPV-Impfung und Beheizten Tabakprodukten

大语言模型在认识公共卫生会议中风险促进和健康支持情感的认知的准确性差异:HPV疫苗接种和加热烟草制品案例

2507.04364v1

466

07-06

HatePRISM: Policies, Platforms, and Research Integration. Advancing NLP for Hate Speech Proactive Mitigation

HatePRISM: Politik, Plattformen und Forschungsintegration. Förderung von NLP für eine proaktive Abmilderung von Hassreden

HatePRISM:政策、平台和研究一体化。推进国家消除仇恨言论计划。

2507.04350v1

467

07-06

SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control

SmartThinker: Lernen, um zu komprimieren und zu bewahren Vernunft durch Schritt-Level-Length Control

SmartThinker: 学会按职级长长控制进行压缩和保留理由

2507.04348v1

468

07-06

Computed Tomography Visual Question Answering with Cross-modal Feature Graphing

Berechnete Tomographie Visuelle Fragebeantwortung mit modaler Feature-Graphing

计算成的地形学视觉问题,以跨模式特点图解回答

2507.04333v1

469

07-06

Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Erklären Sie mit Visual Keypoints wie ein echter Mentor! Ein Benchmark für multimodale Lösung Erklärung

以像真正的导师一样的视觉关键点解释! 多模式解决方案解释的基准

2504.03197v3

470

07-06

No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem

Keine Sprachdaten übrig: Eine vergleichende Studie von CJK-Sprachendatensätzen im Hugging Face Ecosystem

无语言数据:CJK 载荷面生态系统中语言数据集比较研究

2507.04329v1

471

07-06

HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs

HKCanto-Eval: Benchmark für die Bewertung des kantonesischen Sprachverständnisses und kulturellen Verständnisses in LLMs

HKCanto-Eval:评估广东语言理解和文化理解基准

2503.12440v2

472

07-06

A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment

Ein ursächliches Weltmodell für die nächste Token-Vorhersage: GPT in einer kontrollierten Umgebung erforschen

由下到下预测所根据的因果世界模式:在受控制的环境中探索GPT

2412.07446v4

473

07-06

A Global Context Mechanism for Sequence Labeling

Ein globaler Kontextmechanismus für die Sequenzkennzeichnung

序列标签全球背景机制

2305.19928v6

474

07-06

LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop

LearnLens: LLM-Enabled Personalisiertes, Curriculum-gerundetes Feedback mit Erziehern im Loop

学习栏:LLM-能够个性化的LLM课程、课程与环中教育工作者的反馈

2507.04295v1

475

07-06

Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP

Ist es wirklich ein langer Kontext, wenn alles, was man braucht, wieder aufgenommen wird?

您需要的检索是真实的长长范围吗 ? 向真正困难的长宽范围 NLP

2407.00402v4

476

07-06

LLMs’ Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements

LLMs Leseverständnis wird durch parametrisches Wissen beeinflusst und kämpft mit hypothetischen Aussagen

LLM女士的阅读理解受到参数知识和与假言斗争的影响

2404.06283v2

477

07-06

GLU Attention Improve Transformer

GLU Aufmerksamkeit verbessern Transformer

GLU 关注改善变换器

2507.00022v2

478

07-06

MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models

MASS: Mathematische Datenauswahl über Skill Graphs zur Vorschulung großer Sprachmodelle

MASS:通过培训前大语言模型技能图选择数学数据

2503.14917v2

479

07-06

EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models

EasyEdit2: Ein einfach zu bedienender Steuerungsrahmen für die Bearbeitung großer Sprachmodelle

EasyEdit2: 编辑大语言模式的易用指导框架

2504.15133v2

480

07-06

On Positional Bias of Faithfulness for Long-form Summarization

Auf Positionale Bias der Treue für lange Form Zusammenfassung

关于 “ 长期概述的信仰偏见 “ 的立场

2410.23609v2

481

07-06

Fairness Evaluation of Large Language Models in Academic Library Reference Services

Fairness-Evaluierung von großen Sprachmodellen in wissenschaftlichen Bibliotheksreferenzdiensten

学术图书馆参考资料服务大语言模型公平评价

2507.04224v1

482

07-06

Context Tuning for In-Context Optimization

Kontext-Tuning für In-Context-Optimierung

Intext 优化的上下文图示

2507.04221v1

483

07-06

Towards Understanding the Cognitive Habits of Large Reasoning Models

Auf dem Weg zum Verständnis der kognitiven Gewohnheiten großer vernünftiger Modelle

了解大理由模型的认知利用

2506.21571v2

484

07-06

SCALE: Towards Collaborative Content Analysis in Social Science with Large Language Model Agents and Human Intervention

SCALE: Auf dem Weg zu einer kollaborativen Inhaltsanalyse in der Sozialwissenschaft mit großen Sprachmodellen und menschlicher Intervention

SCALE:促进社会科学与大语言模范媒介和人类干预合作内容分析

2502.10937v2

485

07-05 (6)

SymbolicThought: Integrating Language Models and Symbolic Reasoning for Consistent and Interpretable Human Relationship Understanding

SymbolicThought: Integration von Sprachmodellen und symbolischer Begründung für ein konsequentes und interpretierbares menschliches Beziehungsverständnis

象征性探索:整合语文模式和符号理由,促进一致和可解释的人类关系理解

2507.04189v1

486

07-05

Phonetic Reconstruction of the Consonant System of Middle Chinese via Mixed Integer Optimization

Phonetische Rekonstruktion des Konsonantensystems des Mittelchinesischen durch gemischte Integer-Optimierung

通过混合整数优化恢复中中中中中中中中中

2502.04625v2

487

07-05

Navigating Speech Recording Collections with AI-Generated Illustrations

Navigierende Sprachaufzeichnungssammlungen mit KI-generierten Illustrationen

使用 AI 生成的说明管理语音录制收藏

2507.04182v1

488

07-05

Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks

Frustrierend Einfaches Retrieval verbessert anspruchsvolle, vernünftig-intensive Benchmarks

令人沮丧的简单检索改进挑战、理由说明和密集基准

2507.01297v2

489

07-05

Large Language Models for Zero-Shot Multicultural Name Recognition

Große Sprachmodelle für die Null-Shot Multicultural Name Recognition

零热多种文化名称识别大语言模式

2507.04149v1

490

07-05

The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure

Das Medium ist nicht die Botschaft: Deconfounding Text-Embeddings via Linear Concept Erasure

介质不是信息:通过线性概念时代的沉降文本嵌入

2507.01234v2

491

07-05

Dissecting Clinical Reasoning in Language Models: A Comparative Study of Prompts and Model Adaptation Strategies

Dissecting Clinical Reasoning in Language Models: Eine vergleichende Studie über Prompts und Modellanpassungsstrategien

语言模型临床理由解剖:快速比较研究和示范适应战略

2507.04142v1

492

07-05

From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models

Von der Absicht zur Technik: Eine umfassende Taxonomie und Herausforderungen bei der Text-Wasserzeichenbildung für große Sprachmodelle

从意图到技术:用于大语言模式的文本水标记综合分类学和挑战

2406.11106v2

493

07-05

Token Level Hallucination Detection via Variance in Language Models

Token Level Halluzination Erkennung durch Varianz in Sprachmodellen

通过语言模式差异探测洞穴级幻觉

2507.04137v1

494

07-05

An HTR-LLM Workflow for High-Accuracy Transcription and Analysis of Abbreviated Latin Court Hand

Ein HTR-LLM-Workflow für hochgenaue Transkription und Analyse von verkürzter lateinischer Hofhand

HTR-LLLM 高准确度加密和分析缩写拉丁法院手的工作流程

2507.04132v1

495

07-05

In-Context Meta LoRA Generation

Meta-LoRA-Generation im Kontext

内文元LoRA 生成

2501.17635v3

496

07-05

BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering

BYOKG-RAG:知识图解答问题多战略图检索

2507.04127v1

497

07-05

Relational inductive biases on attention mechanisms

Relationale induktive Verzerrungen der Aufmerksamkeitsmechanismen

对关注机制的关系诱导偏见

2507.04117v1

498

07-05

Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching

Conversation Forests: Der Schlüssel zur Feinabstimmung großer Sprachmodelle für multi-Turn medizinische Gespräche ist die Verzweigung

对话森林:对多发医学对话的大型语言模型进行精微投资的关键是分流

2507.04099v1

499

07-05

MMMOS: Multi-domain Multi-axis Audio Quality Assessment

MMMOS: Multi-Domain Multi-Axis-Audio-Qualitätsbewertung

MMMOS:多领域多轴音频质量评估

2507.04094v1

500

07-05

FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering

FairSteer: Inferenzzeit-Debiasing für LLMs mit dynamischer Aktivierungssteuerung

Fairsteer:带动态激活指导的LLMs 推论时间偏差

2504.14492v2

501

07-05

XISM: an eXploratory and Interactive Graph Tool to Visualize and Evaluate Semantic Map Models

XISM: ein eXploratory and Interactive Graph Tool zum Visualisieren und Evaluieren von Semantischen Kartenmodellen

XISSM:一个可视和评估语义地图模型的振动和交互式图像工具

2507.04070v1

502

07-05

Beyond Independent Passages: Adaptive Passage Combination Retrieval for Retrieval Augmented Open-Domain Question Answering

Jenseits von unabhängigen Passagen: Adaptive Passage-Kombinations-Retrieval für retrieval Augmented Open-Domain-Fragebeantwortung

超过独立通过通道: 用于检索的适应性被动通道组合检索, 以获取更多开放域问题解答

2507.04069v1

503

07-05

On Fusing ChatGPT and Ensemble Learning in Discon-tinuous Named Entity Recognition in Health Corpora

Über das Verschmelzen von ChatGPT und Ensemble Lernen in diskontinuierlicher, benannter Entitätserkennung in Health Corpora

关于在卫生团体中不同名称称为实体的识别中进行阻燃聊天和组合学习

2412.16976v2

504

07-05

FinBERT2: A Specialized Bidirectional Encoder for Bridging the Gap in Finance-Specific Deployment of Large Language Models

FinBERT2: Ein spezieller bidirektionaler Encoder zur Bridging the Gap in Finance-Spezific Deployment of Large Language Models

FinBERT2: 缩小在财务方面差距的专门双向双向编码器,具体部署大语言模式

2506.06335v2

505

07-05

Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk?

KI für Finanzen bewerten: Ist KI bei der Bewertung von Investitionsrisiken glaubwürdig?

评估大赦国际的融资:AI在评估投资风险方面是否可信?

2505.18953v2

506

07-05

Patient-Centered RAG for Oncology Visit Aid Following the Ottawa Decision Guide

Patientenzentrierte RAG für Onkologie-Besuchshilfe nach dem Ottawa-Entscheidungsleitfaden

遵循渥太华决定指南,以病人为中心,为肿瘤科访问援助提供RAG

2507.04026v1

507

07-05

Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance

Vergleichende Bewertung von ChatGPT und DeepSeek über zentrale NLP-Aufgaben: Stärken, Schwächen und Domain-spezifische Leistung

国家劳工政策关键任务:力量、弱点和具体具体绩效

2506.18501v3

508

07-05

LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models

LLMThinkBench: Auf dem Weg zu grundlegenden mathematischen Überlegungen und Überdenken in großen Sprachmodellen

LLMTHink Bench:在大语言模型中走向基本数学理由和过度思考

2507.04023v1

509

07-05

Handling Korean Out-of-Vocabulary Words with Phoneme Representation Learning

Umgang mit koreanischen Out-of-Vocabulary-Worte mit Phoneme Repräsentanz Lernen

以电话代言学习方式处理韩语外词汇

2507.04018v1

510

07-05

Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition

Nunchi-Bench: Benchmarking von Sprachmodellen zur kulturellen Vernunft mit Fokus auf koreanischen Aberglauben

Nunchi-Nunchi-Bench:以韩国迷信为重点的文化理由语言模式基准模式

2507.04014v1

511

07-05

Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

Easy Dataset: Ein einheitliches und erweiterbares Framework zur Synthese von LLM-Feineinstellungsdaten aus unstrukturierten Dokumenten

简易数据集:从无结构文档中合成LLM精密调试数据的统一和扩展框架

2507.04009v1

512

07-05

A Survey on Proactive Defense Strategies Against Misinformation in Large Language Models

Eine Umfrage über proaktive Verteidigungsstrategien gegen Fehlinformationen in großen Sprachmodellen

关于针对大语言模型中错误信息的积极防御战略的调查

2507.05288v1

513

07-05

UNITYAI-GUARD: Pioneering Toxicity Detection Across Low-Resource Indian Languages

UNITYAI-GUARD: Pionierische Toxizitätserkennung über Low-Resource-Indische Sprachen

UNITEAAI-GUARD:通过低资源印度语言首次发现有毒毒性

2503.23088v2

514

07-05

Improving RL Exploration for LLM Reasoning through Retrospective Replay

Verbesserung der RL Exploration für LLM Reasoning durch retrospektive Replay

通过回溯性回放改进LL探索LLM

2504.14363v2

515

07-05

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Open-Reasoner-Zero: Ein Open Source Ansatz zur Skalierung von Verstärkungslernen am Basismodell

开放反馈-零:基于基础模式的扩大强化学习的开放源码方法

2503.24290v2

516

07-05

A Comparative Study of Specialized LLMs as Dense Retrievers

Eine vergleichende Studie von spezialisierten LLMs als Dense Retriever

专业专利妇女作为 “ 大量开发 “ 的比较研究

2507.03958v1

517

07-05

A Modular Unsupervised Framework for Attribute Recognition from Unstructured Text

Modulares unüberwachtes Framework zur Attributerkennung aus unstrukturiertem Text

从无结构文本中识别属性的模块化未经监督的框架

2507.03949v1

518

07-05

Dynamic Injection of Entity Knowledge into Dense Retrievers

Dynamische Einspritzung von Wissen der Entität in Dense Retriever

将实体知识动态投入重复内容

2507.03922v1

519

07-05

Play to Generalize: Learning to Reason Through Game Play

Spielen Sie Generalize: Lernen, Vernunft durch Spiel zu lernen

玩一般游戏: 通过玩游戏学习理性

2506.08011v3

520

07-05

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Sequoia: Skalierbares, Robustes und Hardware-gestütztes Spekulatives Dekodieren

塞奎亚:可缩放、强力和硬件—-有意识的投机性代谢

2402.12374v3

521

07-05

Demystifying ChatGPT: How It Masters Genre Recognition

ChatGPT entmystifizieren: Wie es die Genreerkennung beherrscht

解密的聊天:如何使各教派获得承认

2507.03875v1

522

07-05

OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference

OrthoRank: Tokenauswahl über Sink Token Orthogonalität für effiziente LLM-Inferenz

OrthoRank:通过 Sink Token 条形对高效LLM 推论进行当量选择

2507.03865v1

523

07-04 (5)

Voice of a Continent: Mapping Africa’s Speech Technology Frontier

Stimme eines Kontinents: Afrikas Rede-Technologie-Grenze kartieren

非洲大陆之声:测绘非洲语音技术前沿

2505.18436v3

524

07-04

Relation-Aware Network with Attention-Based Loss for Few-Shot Knowledge Graph Completion

Beziehungs-Aware-Netzwerk mit aufmerksamkeitsbasiertem Verlust für wenige-heiße Wissensgraphen-Vervollständigung

几乎没有热知识图补全的基于关注的损失关系软件网络

2306.09519v2

525

07-04

Beyond classical and contemporary models: a transformative ai framework for student dropout prediction in distance learning using rag, prompt engineering, and cross-modal fusion

Über klassische und zeitgenössische Modelle hinaus: ein transformatives Ai-Framework für die Studienabbrechervorhersage im Fernunterricht mittels Rag, Prompt Engineering und modaler Fusion

超越古典和当代模式:利用抹布、即时工程和跨模式融合法在远程学习中预测学生辍学情况的变革性Ai框架

2507.05285v1

526

07-04

Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation

Auf dem Weg zu einer fairen RAG: Über die Auswirkungen des fairen Rankings in der retrieval-Augmentierten Generation

走向公平的RAG:公平评级对回升一代人的影响

2409.11598v4

527

07-04

Parsing Through Boundaries in Chinese Word Segmentation

Parsing durch Grenzen in chinesischen Wortsegmentierung

中文单词分割的边界分割

2503.23091v2

528

07-04

Tip of the Tongue Query Elicitation for Simulated Evaluation

Tipp der Zungenabfrage-Elikation für Simulierte Auswertung

用于模拟评价的 “ 舌头查询参考 “ 提示

2502.17776v2

529

07-04

Alpay Algebra IV: Symbiotic Semantics and the Fixed-Point Convergence of Observer Embeddings

Alpay Algebra IV: Symbiotische Semantik und die Fixed-Point-Konvergenz von Beobachter-Embeddings

替代代数四:共生语义学和观察员嵌入的固定点汇合

2507.03774v1

530

07-04

Agent-Based Detection and Resolution of Incompleteness and Ambiguity in Interactions with Large Language Models

Agentbasierte Erkennung und Auflösung von Unvollständigkeit und Ambiguität in Interaktionen mit großen Sprachmodellen

与大语言模式互动中不完全和不易的代理检测和分辨率

2507.03726v1

531

07-04

ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models

ODE: Offene Bewertung von Halluzinationen in multimodalen großen Sprachmodellen

ODE:多模式大语言模型中的幻觉公开评价

2409.09318v4

532

07-04

Prompting as Scientific Inquiry

Als wissenschaftliche Untersuchung prompt

作为科学调查

2507.00163v2

533

07-04

Controlling Thinking Speed in Reasoning Models

Kontrolle der Denkgeschwindigkeit in vernünftigen Modellen

控制理性模型的思考速度

2507.03704v1

534

07-04

PHEONA: An Evaluation Framework for Large Language Model-based Approaches to Computational Phenotyping

PHEONA: Ein Bewertungsrahmen für großsprachliche modellbasierte Ansätze zur computergestützten Phenotypisierung

PHEONA:基于大语言模型的计算基因组学方法评价框架

2503.19265v3

535

07-04

A Survey of Large Language Models on Generative Graph Analytics: Query, Learning, and Applications

Eine Umfrage von großen Sprachmodellen zu generativen Graph Analytics: Abfragen, Lernen und Anwendungen

关于 “ 生成图表分析分析:查询、学习和应用 “ 的大型语言模型调查

2404.14809v2

536

07-04

The Geometries of Truth Are Orthogonal Across Tasks

Die Geometrien der Wahrheit sind orthogonal über Aufgaben hinweg

真理的地貌不同任务

2506.08572v2

537

07-04

iNews: A Multimodal Dataset for Modeling Personalized Affective Responses to News

iNews: Ein multimodaler Datensatz für die Modellierung personalisierter Affective Responses to News

iNews: 用于模拟个人对新闻的情感反应的多模式数据集

2503.03335v2

538

07-04

STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking

STRUKTSENSE: Ein Task-Agnostic Agentic Framework für strukturierte Informationsextraktion mit Mensch-in-The-Loop-Evaluierung und Benchmarking

SUCTSENESE: 结构化信息抽取与 “ 网上人 “ 评价和基准确定

2507.03674v1

539

07-04

TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection

TACOS: Offenes Tagging und vergleichendes Scoring für die Auswahl von Feintuning-Daten

TACOS: 用于指示精确调整数据选择的公开贴贴和比较比标

2507.03673v1

540

07-04

Recon, Answer, Verify: Agents in Search of Truth

Recon, Antwort, Überprüfung: Agenten auf der Suche nach Wahrheit

调查、答复、核查:寻找真相的特工

2507.03671v1

541

07-04

Interaction Techniques that Encourage Longer Prompts Can Improve Psychological Ownership when Writing with AI

Interaktionstechniken, die länger prompts fördern, können die Psychologische Eigentümerschaft verbessern, wenn sie mit KI schreiben

鼓励较长时间提示的交互技术在与大赦国际撰写文章时能够改善心理自主性

2507.03670v1

542

07-04

TRACE: Training and Inference-Time Interpretability Analysis for Language Models

TRACE: Trainings- und Inferenz-Zeit-Interpretabilitätsanalyse für Sprachmodelle

TRACE: 语文模式培训和推断-时间解释性分析

2507.03668v1

543

07-04

Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs

Re-Emergent-Missausrichtung: Wie schmale Feintuning-Erodes Sicherheit Ausrichtung in LLMs

重新出现错配: LLM 中如何将安全对齐

2507.03662v1

544

07-04

Exploring Robustness of LLMs to Paraphrasing Based on Sociodemographic Factors

Erforschung der Robustheit von LLMs zur Paraphrasierung auf der Grundlage soziodemografischer Faktoren

探讨根据社会人口因素将长效LLMs

2501.08276v2

545

07-04

Disentangling the Roles of Representation and Selection in Data Pruning

Entwirren der Rollen von Repräsentation und Auswahl bei der Datenkorrektur

分离数据谨慎中的代表和甄选作用

2507.03648v1

546

07-04

ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving

ArithmAttack: Bewertung der Robustheit von LLMs zum geräuschvollen Kontext bei der Lösung von Math-Problemen

ArithmAttack:评估理学问题解答时对吵闹背景的LLMs强力

2501.08203v2

547

07-04

Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion

Verbesserung der Low-Resource-Dialektenklassifikation mittels Retrieval-basierter Sprachumwandlung

利用以检索为基础的语音转换改进低资源

2507.03641v1

548

07-04

Is It Time To Treat Prompts As Code? A Multi-Use Case Study For Prompt Optimization Using DSPy

Ist es Zeit, Prompts als Code zu behandeln? Eine Multi-Use-Fallstudie für Prompt-Optimierung mit DSPy

是否是时候将提示作为代码处理? 使用 DSPy 快速优化的多用途案例研究

2507.03620v1

549

07-04

EMERGE: A Benchmark for Updating Knowledge Graphs with Emerging Textual Knowledge

EMERGE: Ein Benchmark für die Aktualisierung von Wissensgraphen mit neu entstehendem Textwissen

EMEGERGE: 更新具有新兴文字知识的知识图表的基准

2507.03617v1

550

07-04

RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning

RAG+: Verbesserung der Retrieval-Augmented Generation mit anwendungsrelevanter Begründung

RAG+:加强利用应用程序软件软件软件软件支持的检索-启动生成

2506.11555v3

551

07-04

ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs

ALAS: Latente Sprach-Text-Ausrichtung für gesprochenes Sprachverständnis in multimodalen LLMs messen

ALAS: 计量多种模式LM 中口语语言理解的暗中语音-文本对齐

2505.19937v2

552

07-04

Multi-Hop Reasoning for Question Answering with Hyperbolic Representations

Multi-Hop-Gründung für die Beantwortung von Fragen mit hyperbolischen Darstellungen

使用超曲解表示式回答问题多理由

2507.03612v1

553

07-04

ViMRHP: A Vietnamese Benchmark Dataset for Multimodal Review Helpfulness Prediction via Human-AI Collaborative Annotation

ViMRHP: Ein vietnamesischer Benchmark-Datensatz für die multimodale Überprüfung von Hilfsvorhersage über Human-AI Collaborative Annotation

VIMRHP:越南通过人类 – – AI合作说明,为多模式审查 “ 助助预测 “ 建立一个越南基准数据集

2505.07416v2

554

07-04

Should We Still Pretrain Encoders with Masked Language Modeling?

Sollten wir noch Encoder mit maskierten Sprachmodellen vortrainieren?

我们是否仍应该为带有隐蔽语言建模的编程者预作准备?

2507.00994v2

555

07-04

RECA-PD: A Robust Explainable Cross-Attention Method for Speech-based Parkinson’s Disease Classification

RECA-PD: Robuste, erklärbare Cross-Attention-Methode zur Klassifizierung sprachbasierter Parkinson-Krankheiten

REECA-PD: 以语音为基础的帕金森氏病分类的强有力的可解释交叉注意方法

2507.03594v1

556

07-04

Evaluating AI capabilities in detecting conspiracy theories on YouTube

Bewertung von KI-Fähigkeiten bei der Entdeckung von Verschwörungstheorien auf YouTube

评价大赦国际在YouTube上发现阴谋论的能力

2505.23570v2

557

07-04

Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation

Causal-SAM-LLM: Große Sprachmodelle als Ursache für robuste medizinische Segmentierung

Causal-SAM-LLLM:大语言模型,作为强力医疗分解的原因

2507.03585v1

558

07-04

Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits

Ambiguous Terminologie von Preference Optimization auf Post-Edits übersetzen lernen

学习如何通过“优先优化”在编辑后采用“优先优化”来翻译模糊的名词

2507.03580v1

559

07-04

Articulatory clarity and variability before and after surgery for tongue cancer

Artikulatorische Klarheit und Variabilität vor und nach der Operation bei Zungenkrebs

舌癌手术前后和手术前后的产道清晰度和变异性

2507.03576v1

560

07-04

ReviewInstruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models

ReviewInstruct: Eine review-getriebene Multi-Turn-Konversations-Erstellungsmethode für große Sprachmodelle

审查指南:大语言模型的复议-驱动多发多发对话生成方法

2505.11010v2

561

07-04

H2HTalk: Evaluating Large Language Models as Emotional Companion

H2HTalk: Große Sprachmodelle als emotionalen Begleiter bewerten

H2H Talk:评价作为情感伴侣的大语言模式

2507.03543v1

562

07-04

Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation

Aussprache-Lexikon Kostenloses Training für Phoneme-basierte Crosslingual ASR über Joint Stochastic Approximation

通过联合口交近为基于电话的跨语言ASR提供发音免费培训

2507.06249v1

563

07-04

AI-VaxGuide: An Agentic RAG-Based LLM for Vaccination Decisions

AI-VaxGuide: Agentische RAG-basierte LLM für Impfentscheidungen

AI-VaxGuide:用于疫苗接种决定的以RAG为基地的AAG剂LLMLM

2507.03493v1

564

07-04

Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences

Vier Schattierungen der Lebenswissenschaften: Ein Datensatz für Desinformationserkennung in den Lebenswissenschaften

《生命科学的四大结构:生命科学中虚假信息探测数据集》

2507.03488v1

565

07-04

Chat2SPaT: A Large Language Model Based Tool for Automating Traffic Signal Control Plan Management

Chat2SPaT: Ein auf großen Sprachmodellen basierendes Tool zur Automatisierung der Verkehrssignalsteuerung

Chat2SPaaT:以大语言为基础的交通信号控制计划管理自动化示范工具

2507.05283v1

566

07-04

Beyond Weaponization: NLP Security for Medium and Lower-Resourced Languages in Their Own Right

Beyond Weaponization: NLP-Sicherheit für mittel- und minderwertige Sprachen im eigenen Recht

武器化之后的武器化之外:国家语言方案保障中、低资源语言自身权利的安全

2507.03473v1

567

07-04

CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark

CoreCodeBench: Ein konfigurierbarer Multi-Szenario-Repository-Level-Benchmark

核心守则:可配置的多设想仓库一级基准

2507.05281v1

568

07-04

Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models

Verbesserung der sozialen Determinanten der Gesundheitsdokumentation in französischen EHRs mit großen Sprachmodellen

利用大语言模式,改善法文健康文件的 “ 健康文件 “ 社会决定因素

2507.03433v1

569

07-04

Tokenization is Sensitive to Language Variation

Tokenisierung ist auf Sprachvariation sensitiv

语法变换对语法变化敏感

2502.15343v2

570

07-04

SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation

SMCLM: Semantisch sinnvolle Kausale Sprachmodellierung für autoregressive Paraphrase-Generierung

SMCLM: 用于自动递减性参数生成的具有广泛意义的原因语言建模

2507.03415v1

571

07-04

Graph Repairs with Large Language Models: An Empirical Study

Graphische Reparaturen mit großen Sprachmodellen: Eine empirische Studie

使用大语言模型修图:经验研究

2507.03410v1

572

07-04

ReservoirChat: Interactive Documentation Enhanced with LLM and Knowledge Graph for ReservoirPy

ReservoirChat: Interaktive Dokumentation mit LLM und Wissensdiagramm für ReservoirPy

RESSOCWChat:与LLM和知识图增强互动文件

2507.05279v1

573

07-04

On the Expressiveness and Length Generalization of Selective State-Space Models on Regular Languages

Zur Expressivität und Längenverallgemeinerung von selektiven State-Space-Modellen auf regulären Sprachen

关于常规语言选择性国家空间模型的表达性和时间长度的一般化

2412.19350v2

574

07-04

Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

Verallgemeinerung oder Halluzination? Verstehen von Out-of-Context-Reasoning in Transformers

通化还是幻觉? 理解变异器的逻辑外原因

2506.10887v2

575

07-04

Entity-aware Cross-lingual Claim Detection for Automated Fact-checking

Entity-aware Cross-lingual Claim Detection for Automated Fact-Checking

用于自动实况调查的有实体意识的跨语言交叉索赔调查

2503.15220v3

576

07-04

EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization

EAQuant: Verbesserung der Post-Training-Quantisierung für MoE-Modelle durch Experten-Aware-Optimierung

EAQuant:通过专家-软件优化,加强培训后对教育部模型的量化

2506.13329v2

577

07-04

Making Sense of Korean Sentences: A Comprehensive Evaluation of LLMs through KoSEnd Dataset

Gespür für koreanische Sätze: Eine umfassende Bewertung von LLMs durch KoSEnd-Datensatz

韩国判刑感:通过KoSend数据集对LLMs的全面评估

2507.03378v1

578

07-04

Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections

Implizite Belohnung als Brücke: Eine einheitliche Ansicht von SFT- und DPO-Verbindungen

隐含奖励为桥梁:SFT和DPO连接的统一观点

2507.00018v2

579

07-04

WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia

WETBench: Ein Benchmark für die Erkennung aufgabenspezifischer maschinengenerierter Texte auf Wikipedia

WETBENCH: 维基百科中检测特定任务机器生成文本的基准

2507.03373v1

580

07-04

SS-GEN: A Social Story Generation Framework with Large Language Models

SS-GEN: Ein Social Story Generation Framework mit großen Sprachmodellen

SS-GEN:具有大语言模式的社会历史传承框架

2406.15695v3

581

07-04

Backtesting Sentiment Signals for Trading: Evaluating the Viability of Alpha Generation from Sentiment Analysis

Backtesting Sentiment Signals for Trading: Bewertung der Lebensfähigkeit der Alpha-Generation aus der Sentiment-Analyse

贸易的后测试感应信号:从感应分析中评估阿尔法一代的可行性

2507.03350v1

582

07-04

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Disambiguation-Centric Finetuning macht Enterprise Tool-Calling LLMs realistischer und weniger riskant

差异性-中心微调使企业用工具召集LLM 更现实、风险更小

2507.03336v1

583

07-04

Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought

Hunyuan-TurboS: Große Sprachmodelle durch Mamba-Transformer-Synergie und adaptive Ketten-of-Thought

Hunyuan-TurboS:通过Mamba-Transed协同和适应性研究链推进大语言模型

2505.15431v3

584

07-04

MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection

MT2-CSD: Eine neue Datensatz- und Multi-Semantic Knowledge Fusion Methode zur konversatorischen Stance-Erkennung

MT2-CSD: 用于语音稳定探测的新数据集和多语层知识融合方法

2506.21053v2

585

07-04

Read Quietly, Think Aloud: Decoupling Comprehension and Reasoning in LLMs

Lest ruhig, denkt laut: Entkoppelung von Verständnis und Vernunft in LLMs

静静地阅读,想阿卢德:解开迷惑和理性

2507.03327v1

586

07-04

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

OpenRLHF: Ein einfach zu bedienendes, skalierbares und leistungsstarkes RLHF-Framework

OpenRLHF: 易于使用、可缩放和高绩效RLHF框架

2405.11143v5

587

07-04

GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation

GRAFT: Ein Graph-basiertes Flow-aware Agentic Framework für die maschinelle Übersetzung auf Dokumentebene

GRAFT: 以图表为基础的文件级机器翻译流动感动剂框架

2507.03311v1

588

07-04

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

LTLCrit: Ein auf Temporal Logic basierender LLM-Kritiker für sichere und effiziente einkörperige Wirkstoffe

LLLCritri: 一种基于时时逻辑的LLM LLM 用于安全和高效装药剂的Criticr

2507.03293v1

589

07-04

NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025

NTU Speechlab LLM-basiertes Mehrsprachiges ASR-System für MLC-SLM Challenge 2025

NTU Spearelab LLM-为2025年刚果解放运动-解运间对话挑战使用多种语言的ASR系统

2506.13339v2

590

07-04

Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR

Bi-direktionale Kontext-verbesserte Sprache Große Sprachmodelle für mehrsprachige Konversations-ASR

多语言对话的ASR双向双向背景强化语言大语言模型

2506.13396v2

591

07-04

REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models

REINFORCE++: Effizienter RLHF-Algorithmus mit Robustheit sowohl für Prompt- als auch für Reward-Modelle

REINFORCE++: 高效的RLHF对快速模型和奖励模型具有强力的测算法

2501.03262v5

592

07-04

MoralBench: Moral Evaluation of LLMs

MoralBench: Moralische Bewertung von LLMs

道德:对LLMMs的道德评价

2406.04428v2

593

07-04

GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning

GDGB: Ein Benchmark für generatives dynamisches Text-Attributed Graph Learning

GDGB: 产生动态动态文本图表学习的基准

2507.03267v1

594

07-04

KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation

KinyaColBERT: Ein Lexisch geerdetes Retrieval-Modell für Low-Resource Retrieval-Augmented Generation

KinyaColcolBERT: 低资源回收-提款一代的以地基为基数的回收模型

2507.03241v1

595

07-04

Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models

Neural Diskrete Token Repräsentanz Lernen für extreme Token Reduction in Video Große Sprachmodelle

在视频大语言模式中进行关于极端减少视频大语言模式的神经分辨代言人学习

2503.16980v4

596

07-04

A Lie-algebraic perspective on Tree-Adjoining Grammars

Eine lüge-algebraische Perspektive auf Baum-angrenzende Grammatiken

有关树树边语法的字眼视角

2507.03234v1

597

07-04

AgentPS: Agentic Process Supervision for Content Moderation with Multimodal LLMs

AgentPS: Agentische Prozessüberwachung für Content Moderation mit multimodalen LLMs

PS: MMMM 内容调控的代理程序监督

2412.15251v2

598

07-04

7B Fully Open Source Moxin-LLM/VLM – From Pretraining to GRPO-based Reinforcement Learning Enhancement

7B Vollständig offene Quelle Moxin-LLM/VLM – Vom Vortraining zur GRPO-basierten Verstärkungs-Lernverbesserung

7B 完全开放源代码 Moxin-LLM/VLM – – 从预培训到基于GROP的强化学习加强

2412.06845v6

599

07-03 (4)

VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

VOCABTRIM: Vokabelabgleich für effizientes spekulatives Decodieren in LLMs

VOCABTRIM: 有效投机下限的词汇

2506.22694v2

600

07-03

Towards Universal Semantics With Large Language Models

Hin zu universeller Semantik mit großen Sprachmodellen

走向具有大语言模式的普遍语义

2505.11764v3

601

07-03

How Much Content Do LLMs Generate That Induces Cognitive Bias in Users?

Wie viel Inhalt erzeugen LLMs, die Kognitive Bias bei Benutzern erzeugen?

LLMs如何产生使用者中诱导的认知性偏见的内容?

2507.03194v1

602

07-03

Adversarial Manipulation of Reasoning Models using Internal Representations

Adversariale Manipulation von Vernunftmodellen mittels interner Repräsentationen

采用内部代表制对理性模型进行反向调整

2507.03167v1

603

07-03

MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks

MateInfoUB: Ein echter Benchmark für die Prüfung von LLMs in kompetitiven, mehrsprachigen und multimodalen Bildungsaufgaben

MateInfoUB:竞争、多种语文和多模式教育任务中测试LMLMS的现实世界基准

2507.03162v1

604

07-03

Expert-level validation of AI-generated medical text with scalable language models

Validierung von KI-generierten medizinischen Texten auf Expertenebene mit skalierbaren Sprachmodellen

专家一级对AI产生的带有可缩放语言模型的可缩放语言模型的医学文本进行鉴定

2507.03152v1

605

07-03

Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language Models

Elemente des Weltwissens (EWoK): Ein von Kognition inspirierter Rahmen zur Bewertung des Grundwissens der Welt in Sprachmodellen

世界知识要素(EWoK): “ 评价基本世界语言模式知识的认知 – – 启发框架 “

2405.09605v2

606

07-03

On the Relationship between Accent Strength and Articulatory Features

Über die Beziehung zwischen Akcent Strength und Artikulatory Features

Accent 力量与交接特征之间的关系

2507.03149v1

607

07-03

DeepGesture: A conversational gesture synthesis system based on emotions and semantics

DeepGesture: Ein dialogisches Gesten-Synthesesystem basierend auf Emotionen und Semantik

DeepGesture:基于情感和语义的谈话手势合成系统

2507.03147v1

608

07-03

From Measurement to Mitigation: Exploring the Transferability of Debiasing Approaches to Gender Bias in Maltese Language Models

Von der Messung zur Milderung: Erforschung der Übertragbarkeit von Debiasing-Ansätzen auf Gender-Bias in maltesischen Sprachmodellen

从衡量到减缓:探讨马耳他语言模式中性别偏见方法对性别偏见的可转让性

2507.03142v1

609

07-03

ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models

ZuverlässigMath: Benchmark der zuverlässigen mathematischen Vernunft bei großen Sprachmodellen

可靠马斯:关于大语言模式的可靠数学依据基准

2507.03133v1

610

07-03

Towards a Psychoanalytic Perspective on VLM Behaviour: A First-step Interpretation with Intriguing Observations

Auf dem Weg zu einer psychoanalytischen Perspektive auf VLM-Verhalten: Eine erste Interpretation mit faszinierenden Beobachtungen

对VLM行为进行心理分析:带有引人注意的首步解释

2507.03123v1

611

07-03

Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings

Federated Learning für die ICD-Klassifizierung mit Leichtbaumodellen und vorgeschulten Einbettungen

具有轻重量型模型和未受过训练的嵌入型综合疾病分类的联邦学习

2507.03122v1

612

07-03

RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

RLVER: Verstärktes Lernen mit überprüfbaren emotionalen Belohnungen für einfühlsame Agenten

RLVER: 利用可核实情感奖赏加强学习,以培养致病剂

2507.03112v1

613

07-03

LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds

LIAR: Inference Time Alignment (Best-of-N) in Sekunden an Jailbreak LLMs ausleihen

LIAR: 利用诱推时间对准(最佳N级)二秒内侵入监狱的LLMs

2412.05232v3

614

07-03

ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization

ARF-RLHF: Adaptive Reward-Following für RLHF durch Emotionsgetriebene Selbst-Überwachung und trace-biased Dynamic Optimization

RAF-RLHF:通过情感驱动自视和跟踪比亚动态优化,对RLHF进行适应性奖励

2507.03069v1

615

07-03

Requirements Elicitation Follow-Up Question Generation

Voraussetzungen Elicitation Follow-Up Question Generation

问询后查询

2507.02858v1

616

07-03

Answer Matching Outperforms Multiple Choice for Language Model Evaluation

Antwort Matching Outperforms Multiple Choice für Sprachmodell-Bewertung

语言模式评价的多种选择

2507.02856v1

617

07-03

MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs

MOTIF: Modulares Denken durch Verstärkung Feinabstimmung in LLMs

MOTIF:通过强化微调在LLM中进行模块思考

2507.02851v1

618

07-03

Legal Requirements Translation from Law

Rechtliche Voraussetzungen Übersetzung aus dem Recht

法律要求译自法律

2507.02846v1

619

07-03

Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection

Visual Contextual Attack: Jailbreaking MLLMs mit Image-Driven Context Injection

视觉上下文攻击:带有图像驱动背景注射的破狱MLLMs MLLMs

2507.02844v1

620

07-03

Improved Unbiased Watermark for Large Language Models

Verbessertes unvoreingenommenes Wasserzeichen für große Sprachmodelle

改进大语言模型的无偏见水印

2502.11268v2

621

07-03

StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason

StepHint: Mehrstufige stufenweise Hinweise stärken das Lernen zur Vernunft

步进提示:多级分步骤将强化学习提升到合理

2507.02841v1

622

07-03

From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents

Von der Web-Suche in Richtung Agentic Deep Research: Incentivizing Search with Reasoning Agents

从网络搜索到代理深层研究:激励使用理性代理进行搜索

2506.18959v3

623

07-03

ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning

ExPO: Entsperren harter Vernunft mit selbsterklärungsgeführtem Verstärkungslernen

ExPO: 以自我剥削指导强化学习来解锁困难理由

2507.02834v1

624

07-03

Generalizing Verifiable Instruction Following

Verallgemeinern der überprüfbaren Anleitung

普遍适用的可核实说明

2507.02833v1

625

07-03

SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model

SynapseRoute: Ein Auto-Routen-Schaltrahmen für das Dual-State Large Language Model

SynapseRoute:关于两州大语言模式的自动运行切换框架

2507.02822v1

626

07-03

Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case

Große Sprachmodelle zur Automatisierung der klinischen Datenstandardisierung: HL7 FHIR Use Case

临床数据标准化自动化大语言模型:HL7 FHIR使用案例

2507.03067v1

627

07-03

Identification of Potentially Misclassified Crash Narratives using Machine Learning (ML) and Deep Learning (DL)

Identifizierung potenziell falsch klassifizierter Crash Narratives mittels Machine Learning (ML) und Deep Learning (DL)

利用机器学习(ML)和深层学习(DL)查明可能误解的崩溃叙述(DL)

2507.03066v1

628

07-03

Multimodal Mathematical Reasoning with Diverse Solving Perspective

Multimodale mathematische Vernunft mit unterschiedlicher Lösungsperspektive

具有不同解决视角的多模式数学理由

2507.02804v1

629

07-03

Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

Ist Vernunft alles, was Sie brauchen? Probieren von Bias im Zeitalter der Vernunft Sprachmodelle

需要什么理由就需要什么理由吗?

2507.02799v1

630

07-03

From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding

Von langen Videos zu Clips: Ein von Menschen inspiriertes Video-Editing-Framework mit multimodalem Narrative Understanding

从长视频到启动剪贴板:由人启发的视频编辑框架,包含多模式叙述理解

2507.02790v1

631

07-03

GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

GPAS: Beschleunigung der Konvergenz des LLM-Vortrainings durch Gradient-Preserving Activation Scaling

GPAS:通过 “ 渐进式保留动力扩增 “ 加速汇集LLM预备训练

2506.22049v2

632

07-03

Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation

Verbesserung klinischer Multiple-Choice-Fragen Benchmarks mit Knowledge Graph Guided Distractor Generierung

加强具有知识图导引引产生体的临床多选择问题基准,加强临床多选择问题基准

2506.00612v3

633

07-03

Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

Selbstkorrektionsbank: Enthüllung und Adressierung des Selbstkorrektions-Blindflecks in LLMs

自我校正法官:在LLMs中披露和处理自我校正的盲人点

2507.02778v1

634

07-03

DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

DeSTA2.5-Audio: Auf dem Weg zu einem General-Purpose Large Audio Language Model mit selbsterzeugter Cross-Modal Alignment

DeSTA2.5-Audio:努力建立具有自发跨模式一致的通用大型音频语言模型

2507.02768v1

635

07-03

Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression

Batch-Max: Höherer LLM-Durchsatz mit größeren Batch-Größen und KV Cache-Kompression

批量最大:使用大批量大小和 KV缓存压缩的高级 LLM 输送量

2412.05693v3

636

07-03

Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens

Messung der Granularität des Vowel-Produktionsraumes durch einfach nur produzierbare unterschiedliche (JPD) Limens

仅用可制成差异(JPD)激光测量Vowel 生产空间的颗粒度

2507.02744v1

637

07-03

Early Signs of Steganographic Capabilities in Frontier LLMs

Frühe Anzeichen von Steganographischen Fähigkeiten in Frontier LLMs

边疆长长体动物能力早期信号

2507.02737v1

638

07-03

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Mind2Web 2: Agentische Suche mit Agent-as-a-Judge bewerten

Mind2Web 2: 与代理法官评估代理搜索

2506.21506v2

639

07-03

On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability

Über Charakterisierungen für die Sprachgenerierung: Interplay von Halluzinationen, Breadth und Stabilität

语言生成特征:幻觉、面包和稳定之间的相互作用

2412.18530v2

640

07-03

Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation

Next-Token-Vorhersage-Aufgabe setzt eine optimale Datenbestellung für LLM-Training in Proof Generation voraus

假定为校实生成的LLM培训提供最佳数据排序

2411.00863v2

641

07-03

Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers

Können LLMs kritische Einschränkungen innerhalb der wissenschaftlichen Forschung identifizieren? Eine systematische Bewertung von KI-Forschungspapieren

LLMs能否查明科学研究中的关键限制? 对AI研究文件的系统评估

2507.02694v1

642

07-03

Exploring Gender Bias Beyond Occupational Titles

Erforschen von Gender-Bias über Berufsbezeichnungen hinaus

探索职业职称之外的性别偏见

2507.02679v1

643

07-03

Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning

Code2Logic: Game-Code-getriebene Datensynthese zur Verbesserung von VLMs Allgemeine Begründung

代码2Llogic: 用于增强VLMs一般理由的游戏-代码-驱动数据合成

2505.13886v2

644

07-03

ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning

ASDA: Audio-Spektrogramm Differential-Achtungsmechanismus für selbstüberwachtes Repräsentationslernen

ASDA:自我监督代表制学习的听觉分光差异关注机制

2507.02666v1

645

07-03

An Adaptive Supervised Contrastive Learning Framework for Implicit Sexism Detection in Digital Social Networks

Ein adaptives, überwachtes, kontrastives Lernkonzept für implizite Sexismus-Erkennung in digitalen sozialen Netzwerken

在数字社会网络中发现隐含性别主义的适应性监督差异学习框架

2507.05271v1

646

07-03

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

OmniDraft: Ein Cross-Vocabulary, Online Adaptive Drafter für die gerätespezifische Dekodierung

总括草案:跨词汇、在线在线可调适性套用投机下限设计图纸

2507.02659v1

647

07-03

Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search

Entkoppelte Planung und Ausführung: Ein Hierarchisches Reasoning-Framework für tiefe Suche

分解的规划和执行:深海搜索的等级理据框架

2507.02652v1

648

07-03

Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory

Strategische Intelligenz in großen Sprachmodellen: Beweise aus der evolutionären Spieltheorie

大语言模型战略情报:进化游戏理论的证据

2507.02618v1

649

07-03

Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure

Erklärbare Compliance-Erkennung mit Multi-Hop-Natural Language-Schlussfolgerung zur Assurance-Fallstruktur

以多种自然语言对保证案例结构的多重语言推断进行可解释的合规检测

2506.08713v2

650

07-03

Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs

Symbolisch oder numerisch? Physik-Probleme verstehen, die LLMs aufklären

理解在理赔中解决物理问题

2507.01334v2

651

07-03

MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion

MPF: Sprachmodelle nach der Bereitstellung über Multi Perspective Fusion ausrichten und abgrenzen

MPF:通过多视角融合进行部署后调整和取消对语言模式的偏见

2507.02595v1

652

07-03

MedAide: Information Fusion and Anatomy of Medical Intents via LLM-based Agent Collaboration

MedAide: Informationsfusion und Anatomie von medizinischen Intents über LLM-basierte Agent Collaboration

医学辅助:通过以LLM为基地的合作公司代理进行医疗成瘾者的信息汇集和解剖

2410.12532v3

653

07-03

Revisiting Active Learning under (Human) Label Variation

Aktives Lernen unter (menschlichen) Label-Varianten

在(人)标签标签变换下重新审查积极学习

2507.02593v1

654

07-03

WebSailor: Navigating Super-human Reasoning for Web Agent

WebSailor: Navigieren Super-Mensch Vernunft für Web Agent

Web 服务员: 为 Web 代理导航超人理由

2507.02592v1

655

07-03

AI Flow: Perspectives, Scenarios, and Approaches

AI Flow: Perspektiven, Szenarien und Ansätze

AI 流动:观点、设想和方法

2506.12479v2

656

07-03

Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization

Verbesserung der LLM-Gründung für die Erkennung von Schwachstellen über die Optimierung der Gruppenrelative Policy

通过群体相对政策优化改进通过群体相对政策优化查明脆弱性的改进LLM 改进LLM理由

2507.03051v1

657

07-03

Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

Sprachenübergreifendes Reisen: Benchmarking Cross-Lingual Consistency in multimodalen LLMs

跨语言旅行:多模式LLM中跨语言一致基准

2505.15075v2

658

07-03

Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning

Selbstgesteuerte Prozess-Reward-Optimierung mit neu definiertem Schrittweiser Vorteil für Prozess-Verstärkungs-Lernen

自指导流程向上优化,具有重新定义的逐步改进的流程强化学习优势

2507.01551v2

659

07-03

IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders

IndianBailJudgments-1200: Ein Multi-Attribut-Datensatz für legale NLP auf indischen Bail-Aufträgen

IndianBailJail Judgments-12000:印度保释令法律国家保护程序多属性数据集

2507.02506v1

660

07-03

Counterfactual Tuning for Temporal Sensitivity Enhancement in Large Language Model-based Recommendation

Counterfactual Tuning für die zeitliche Sensitivitätssteigerung in der auf Großsprachenmodellen basierenden Empfehlung

以基于大语言的示范建议为基础的提高时时敏感度反事实推算

2507.03047v1

661

07-03

Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack

Robustheit von Fehlinformations-Klassifikationssystemen zu Adversarial-Beispielen durch BeamAttack

通过“BeamAttack”进行错误信息分类系统对反向实例的强力

2506.23661v2

662

07-03

Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer

Task Prompt Vektoren: Effektive Initialisierung durch Multi-Task Soft-Prompt Transfer

任务提示矢量 : 通过多任务软性即时传输实现有效的初始化

2408.01119v3

663

07-03

Crafting Hanzi as Narrative Bridges: An AI Co-Creation Workshop for Elderly Migrants

Hanzi als Narrative Bridges herstellen: Ein KI-Co-Creation-Workshop für ältere Migranten

将Hanzi编成叙述性桥梁:大赦国际为老年移民举办的共同创造讲习班

2507.01548v2

664

07-03

A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages

Ein Kochbuch für die gemeinschaftsorientierte Datenerfassung von schwachen Sprachkenntnissen in LowResource-Sprachen

社区驱动的低成本低资源语言有缺陷演讲数据收集手册

2507.02428v1

665

07-03

Delving into LLM-assisted writing in biomedical publications through excess vocabulary

Eintauchen in LLM-unterstütztes Schreiben in biomedizinischen Publikationen durch überschüssiges Vokabular

通过超量词汇,在生物医学出版物中进行LLM协助撰写

2406.07016v5

666

07-03

K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

K-Funktion: Gemeinsame Aussprache Transkription und Feedback zur Bewertung der Kindersprache-Funktion

K-功能:用于评估儿童语言功能的联合读音记录和反馈

2507.03043v1

667

07-03

Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability

Benchmarking Akan ASR-Modelle über Domain-spezifische Datensätze: Eine vergleichende Bewertung von Leistung, Skalierbarkeit und Anpassungsfähigkeit

确定Akan ASR模型基准的全域具体数据集:业绩比较评价、可缩放性和可调适性

2507.02407v1

668

07-03

AIn’t Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation

AIn’t Not Nothing But a Survey? Mit großen Sprachmodellen für die Codierung Deutsch Open-Ended Survey Responses on Survey Motivation

使用大语言模型对德国关于调查动机的开放式调查答复进行编码

2506.14634v3

669

07-03

Dynamic Long Short-Term Memory Based Memory Storage For Long Horizon LLM Interaction

Dynamischer Langzeit-Speicherspeicher auf Kurzzeit-Speicherbasis für Langzeit-LLM-Interaktion

动态长地平线LLM互动的短期内存存储

2507.03042v1

670

07-03

JoyTTS: LLM-based Spoken Chatbot With Voice Cloning

JoyTTS: LLM-basierter gesprochener Chatbot mit Voice Cloning

以LLM为基地的 “ 配有语音克隆的口音聊天机器人 “

2507.02380v1

671

07-03

Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection

Effiziente Code-LLM-Schulung über Distribution-Konsistenz und Diversity-Aware-Datenauswahl

通过分配和多样性软件数据选择进行高效率的守则LLM培训

2507.02378v1

672

07-03

QFFN-BERT: An Empirical Study of Depth, Performance, and Data Efficiency in Hybrid Quantum-Classical Transformers

QFFN-BERT: Eine empirische Studie über Tiefe, Leistung und Dateneffizienz in hybriden Quantum-Klassischen Transformern

QFFN-BERT:对混合量子-分类变异器的深度、性能和数据效率的经验研究

2507.02364v1

673

07-03

Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning

Verbesserung der Robustheit der distantly-überwachten Anerkennung von Personen mit Namen durch unsicheres Lehrerlernen und studentisch-studentisches kollaboratives Lernen

通过不确定-软件教师学习和学生-学生合作学习,提高以不确定-软件教师学习和学生-学生合作学习的方式,提高以不确定-软件命名的实体识别的力度

2311.08010v3

674

07-03

Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models

Coling-UniA bei SciVQA 2025: Wenig-heißes Beispiel Retrieval und Vertrauen-informierte Montage für multimodale große Sprachmodelle

在SciVQA 2025 SciVQA 的Coling-UniA:多式大语言模型的很少热实例检索和信任化组合

2507.02357v1

675

07-03

Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation

Einschließlich LLMs für großräumige Urban Complex Mobility Simulation

大型城市综合流动模拟项目LLMs

2505.21880v2

676

07-03

Decision-Oriented Text Evaluation

Entscheidungsorientierte Textbewertung

注重决定的案文评价

2507.01923v2

677

07-03

Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs

Token Prepending: Ein trainingsfreier Ansatz zur Eliziierung besserer Sentence-Embeddings von LLMs

Token Predudo:从LLM女士那里采用不培训办法,使判刑内容更好地嵌入Elibear

2412.11556v2

678

07-03

Layered Insights: Generalizable Analysis of Authorial Style by Leveraging All Transformer Layers

Layered Insights: Generalisierbare Analyse des Autorial Styles durch Hebelisierung aller Transformer Layers

图层透视: 通过利用所有变换层对文件样式的通用分析

2503.00958v2

679

07-03

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Skywork-Reward-V2:通过人类-AI协同增强优先数据曲线

2507.01352v2

680

07-03

Cautious Next Token Prediction

Vorsichtige nächste Zeichen Vorhersage

谨慎的次下 Tok 预测

2507.03038v1

681

07-03

Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach

Ausrichten von gefrorenen LLMs durch Verstärkungslernen: Ein iteratives Reweight-then-Optimize-Ansatz

通过强化学习将冻结的LLMs与 “ 强化学习:一种过渡性再加权再优化方法 “ 相匹配

2506.17828v2

682

07-03

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Fast-dLLM: Trainingsfreie Beschleunigung von Diffusion LLM durch Ermöglichen von KV Cache und Paralleldecoding

快速dLLM:通过授权 KV 缓存和平行编码加速免培训传播LLM

2505.22618v3

683

07-03

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

Bypass Back-Propagation: Optimierungsbasiertes Structural Pruning für große Sprachmodelle über Policy Gradient

Bypass 后回通信:通过 “ 政策梯度 “ 优化基于优化的结构结构,为大语言模式提供缓冲

2406.10576v3

684

07-03

DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning

DoMIX: Ein effizientes Framework zur Nutzung von Domain-Wissen im Feintuning

DoMIX:一个在微调中利用域知识的有效框架

2507.02302v1

685

07-03

Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models

Commander-GPT: Die Fähigkeit von Multi-Modal Large Language Models, den Sarkasmus vollständig zu entleeren

GPT指挥官:完全解除多模式大语言模型的讽刺性探测能力

2503.18681v3

686

07-03

Prompt-Guided Turn-Taking Prediction

Prompt-geführte Turn-Taking-Vorhersage

即时指导的回转预测

2506.21191v2

687

07-03

Optimal strategies to perform multilingual analysis of social content for a novel dataset in the tourism domain

Optimale Strategien zur mehrsprachigen Analyse sozialer Inhalte für einen neuartigen Datensatz im Tourismusbereich

为旅游领域新数据集的社会内容进行多语种社会内容分析的最佳最佳战略

2311.14727v2

688

07-03

Seeing Through Green: Text-Based Classification and the Firm’s Returns from Green Patents

Durch Grün sehen: Textbasierte Klassifizierung und die Rückkehr der Firma aus grünen Patenten

通过 “ 绿色观光:基于文本的分类和公司从绿色专利的回报 “

2507.02287v1

689

07-03

Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as Treatments

Kausales Repräsentationslernen mit generativer Künstlicher Intelligenz: Anwendung auf Texte als Behandlungen

产生人工智能的因果代表性学习:应用文字作为治疗

2410.00903v3

690

07-03

SMARTe: Slot-based Method for Accountable Relational Triple extraction

SMARTe: Slot-basierte Methode für die relationale Triple-Extraktion

SMARTE: 衡算关系三重采掘的基于固态方法

2504.12816v3

691

07-03

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

MemAgent: Umgestalten von Langkontext-LLM mit Multi-Conv RL-basierten Speicheragenten

MemerAgent: 与基于多Conv RL的内存代理重塑长文本LLM

2507.02259v1

692

07-03

Circuit-tuning: A Mechanistic Approach for Identifying Parameter Redundancy and Fine-tuning Neural Networks

Schaltungs-Tuning: Mechanistischer Ansatz zur Identifizierung von Parameter Redundanz und Feinsteuerung neuraler Netzwerke

电路调控:确定参数冗余和精微调整神经网络的机械化方法

2502.06106v2

693

07-03

Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies

Mixture of Reasonings: Große Sprachmodelle mit adaptiven Strategien zur Vernunft bringen

理由混合:与适应战略一道教授大语言模式

2507.00606v2

694

07-03

Preserving Privacy, Increasing Accessibility, and Reducing Cost: An On-Device Artificial Intelligence Model for Medical Transcription and Note Generation

Schutz der Privatsphäre, Erhöhung der Zugänglichkeit und Senkung der Kosten: Ein On-Device-Modell für künstliche Intelligenz für medizinische Transkription und Notizgenerierung

维护隐私、增加无障碍和降低成本:医疗转移和笔记制作的现场人工人工智能模型

2507.03033v1

695

07-03

GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons

GDC Cohort Copilot: Ein KI-Copilot für die Kuratierung von Kohorten aus den Genomic Data Commons

GDC Cohort Cohort 副驾驶:AI 基因组数据共同点的Curate Choorts联合驾驶员

2507.02221v1

696

07-03

SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers

SciGA: Ein umfassender Datensatz zur Gestaltung grafischer Abstracts in wissenschaftlichen Papieren

SciGA: 用于设计学术论文制图摘要的综合数据集

2507.02212v1

Article 0

Title@2025-07-10 (4): Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Title: Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Rückverfolgbare Beweise Verbesserte visuelle Grundierung: Bewertung und Methodik

增强视觉依据的理由:评价和方法 2507.07999v1

Authors (12): Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang

Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human “thinking with images”. However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.

OpenAI-o3 先驱视觉推理等模型通过动态地参照视觉区域来动态地参照视觉区域,就像人类的“用图像思考”一样。然而,没有基准来全面评价这些能力。然而,我们提出“TreaBench”(可追踪证据评估基准),这是建立在以下三项原则基础上的诊断基准:(1) 对复杂场景的微妙目标有集中的视觉认识,(2) 通过捆绑框评估可以追踪的证据,(3) 测试物体相互作用和空间等级的第二阶推理,超越简单天体定位的简单天体定位。对密集天体图像进行优先排序,我们最初对SA-1B的1K高品质图像进行抽样抽样抽样,并纳入8个LMM专家手动地说明问题、候选选项和每个图像的答案。在质量控制的三个阶段后,TreaineB包括40个挑战视觉问答配对,甚至最先进的模型与这个基准挣扎了60%的准确度,例如,OpenAI-o3级评分数只有54.87。此外,我们引入了TreVRS(可识别证据强化地基推理学),从监督本地和推理学的训练模式,从监督本地和推理学,共同地基点, 学习了基础推路路路路路路路, 。

Article 1

Title@2025-07-10 (4): PyVision: Agentic Vision with Dynamic Tooling

Title: PyVision: Agentic Vision with Dynamic Tooling

PyVision: Agentische Vision mit dynamischem Werkzeug

视景:带有动态工具的 “ 动态展望 “ 。 2507.07998v1

Authors (7): Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei

LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.

LLMS越来越多地被部署为代理商、能够规划、推理和动态调用外部工具的系统。但是,在视觉推理中,先前的做法在很大程度上仍然受到预先定义的工作流程和静态工具的限制。在本报告中,我们展示了PyVision,这是一个互动的、多方向的框架,使MLLMS能够自主生成、执行和完善适合手头任务、灵活和可解释的解决问题的基于Python的工具。我们开发了PyVision所创造的工具的分类,并分析了其在不同基准组中的使用情况。从数量上看,PyVision实现了一致的业绩收益,V* 将GPT-4-4.1 +7.8% 提高 V* +7.8%,VLaude-4.0-Sonnet +31.1% 提高VLMSAreBlind-mini。这些结果显示了更广泛的转变:动态工具允许模型不仅使用工具,而且还可以发明工具,进而向更具说服力的视觉推理。

Article 2

Title@2025-07-10 (4): Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)

Title: Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)

Operationalisierung eines Bedrohungsmodells für das Red-Teaming großer Sprachmodelle (LLMs)

实施 “ 红色组合大语言模型威胁模型 “ ; 2407.14937v2

Authors (10): Apurv Verma, Satyapriya Krishna, Sebastian Gehrmann, Madhavan Seshadri, Anu Pradhan, Tom Ault, Leslie Barrett, David Rabinowitz, John Doucette, NhatHai Phan

Creating secure and resilient applications with large language models (LLM) requires anticipating, adjusting to, and countering unforeseen threats. Red-teaming has emerged as a critical technique for identifying vulnerabilities in real-world LLM implementations. This paper presents a detailed threat model and provides a systematization of knowledge (SoK) of red-teaming attacks on LLMs. We develop a taxonomy of attacks based on the stages of the LLM development and deployment process and extract various insights from previous research. In addition, we compile methods for defense and practical red-teaming strategies for practitioners. By delineating prominent attack motifs and shedding light on various entry points, this paper provides a framework for improving the security and robustness of LLM-based systems.

利用大型语言模型(LLM)创建安全和具有复原力的应用软件需要预见、适应和应对意外威胁。红队已成为查明现实世界LLM实施过程中的脆弱性的关键技术。本文介绍了一个详细的威胁模型,并提供了对LLMS进行红队袭击的知识系统化(SoK)。我们根据LLM开发和部署过程的各个阶段制定攻击分类,并从以往的研究中提取各种见解。此外,我们汇编了防御方法和从业人员实用的红色战略。通过对突出的攻击运动点进行划线和向各个切入点亮亮光,本文为改进LLM系统的安全性和稳健性提供了一个框架。

Article 3

Title@2025-07-10 (4): Automating Expert-Level Medical Reasoning Evaluation of Large Language Models

Title: Automating Expert-Level Medical Reasoning Evaluation of Large Language Models

Automatisieren von Experten-Level Medical Reasoning Bewertung von großen Sprachmodellen

对大语言模式进行自动化专家级医疗理由评估 2507.07988v1

Authors (19): Shuang Zhou, Wenya Xie, Jiaxi Li, Zaifu Zhan, Meijia Song, Han Yang, Cheyenna Espinoza, Lindsay Welton, Xinnie Mai, Yanwei Jin, Zidu Xu, Yuen-Hei Chung, Yiyun Xing, Meng-Han Tsai, Emma Schaffer, Yucheng Shi, Ninghao Liu, Zirui Liu, Rui Zhang

As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring transparent and trustworthy reasoning is essential. However, existing evaluation strategies of LLMs’ medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains lacking. To address this, we introduce MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable assessment of LLMs’ medical reasoning. MedThink-Bench comprises 500 challenging questions across ten medical domains, each annotated with expert-crafted step-by-step rationales. Building on this, we propose LLM-w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge mechanisms to assess intermediate reasoning with expert-level fidelity while maintaining scalability. Experiments show that LLM-w-Ref exhibits a strong positive correlation with expert judgments. Benchmarking twelve state-of-the-art LLMs, we find that smaller models (e.g., MedGemma-27B) can surpass larger proprietary counterparts (e.g., OpenAI-o3). Overall, MedThink-Bench offers a foundational tool for evaluating LLMs’ medical reasoning, advancing their safe and responsible deployment in clinical practice.

随着大型语言模型(LLMS)日益融入临床决策,确保透明和可信赖的推理至关重要,然而,LLM-w-Ref(LLM-w-Ref)的现有医学推理能力评价战略要么不能令人满意地进行评估,要么不能很好地推广,而且仍然缺乏严格的基准。为了解决这个问题,我们引入了MedThink-Bench(MedThink-Bench)(MedThink-Bench)(MedThink-B)(LLLM-LM-LM-A)(LLM-A)(LLLM-A)(LLM)(LLM-A)(LM)(LLM)(LLM)(LLM)(LLM)(LM)(LLM)(LLM)(L)(LLL)(LLL)(LLM)(LLM-W-W-Ref)(LA(LM)(LM)(LA-LA(LA-A-A)(LA(LAD)(LA)(LA)(LA(LA) (LA) (LM) (LM) (LM) (LM) (LM) (LM) (LM) (LM) (LM) (LM) (LM) (LM) (LM) (L) (L) (L) (L) (LM) (LM) (LM) (LM) (LM) (L) (LM) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (LM) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (L) (LIM) (L) (L

Article 4

Title@2025-07-10 (4): Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology

Title: Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology

Leistung und praktische Betrachtung von großen und kleinen Sprachmodellen in der klinischen Entscheidungsunterstützung in der Rheumatologie

风湿学临床决策支助中大型和小型语言模型的实用性及实用性考虑 2507.07983v1

Authors (7): Sabine Felde, Rüdiger Buchkremer, Gamal Chehab, Christian Thielscher, Jörg HW Distler, Matthias Schneider, Jutta G. Richter

Large language models (LLMs) show promise for supporting clinical decision-making in complex fields such as rheumatology. Our evaluation shows that smaller language models (SLMs), combined with retrieval-augmented generation (RAG), achieve higher diagnostic and therapeutic performance than larger models, while requiring substantially less energy and enabling cost-efficient, local deployment. These features are attractive for resource-limited healthcare. However, expert oversight remains essential, as no model consistently reached specialist-level accuracy in rheumatology.

大型语言模型(LLMs)显示了支持风湿学等复杂领域临床决策的前景。我们的评估表明,小型语言模型(SLMs)与检索增强的一代(RAG)相结合,取得了比大模型更高的诊断和治疗性能,同时需要大大降低能量,并有利于成本效率高的本地部署。这些特征对资源有限的医疗保健具有吸引力。然而,专家监督仍然至关重要,因为没有任何模型在风湿学方面一直达到专家水平的准确性。

Article 5

Title@2025-07-10 (4): Why is Your Language Model a Poor Implicit Reward Model?

Title: Why is Your Language Model a Poor Implicit Reward Model?

Warum ist Ihr Sprachmodell ein schlechtes Implizit-Reward-Modell?

为什么您的语言模式是一个贫穷的隐含奖赏模式? 2507.07981v1

Authors (4): Noam Razin, Yong Lin, Jiarui Yao, Sanjeev Arora

Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Towards a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.

语言模式培训后和发酵管道模式的关键在于奖赏模式。简洁地说,最近的工作表明,每个语言模式都定义了隐含奖赏模式(IM-RM),而不需要任何建筑变革。然而,与明确奖赏模式(EX-RMs)相比,这种模式往往更加普遍,特别是分配外的。对语言模式的隐含偏见,对隐含的线性表现模式(EX-RMs)应用了专门的线性头来取代隐含的语言模式(EX-RMs),普遍化差距的存在令人费解,因为EX-RMs和IM-RMs几乎是相同的。它们可以使用相同的数据、损失功能和语言模式来进行培训,而且只能在奖赏的计算方式上有所不同。为了从根本上了解不同奖赏模式类型背后的隐含的偏见,我们调查这一差距的根源。我们的主要发现,在理论和实验的支持下,IM-RMs更多地依赖表面的提示。因此,它们通常比象征性分配选择下的EX-RMs(Ex-RMs)更差,以及分配。此外,我们提供证据证据来反对普遍化差距的替代的替代的假设,因为常规-RMDI-I-I会要求是更强烈的产生更难的行为。

Article 6

Title@2025-07-10 (4): Long-Form Speech Generation with Spoken Language Models

Title: Long-Form Speech Generation with Spoken Language Models

Langformige Sprachgenerierung mit gesprochenen Sprachmodellen

具有口言语言模式的长形式语音一代 2412.18603v2

Authors (6): Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan

We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, and memory costs at inference time. From these considerations we derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates. SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency on multi-minute generations while still matching them at the utterance level. As we found current spoken language evaluations uninformative, especially in this new long-form setting, we also introduce: LibriSpeech-Long, a benchmark for long-form speech evaluation; new embedding-based and LLM-judged metrics; and quality measurements over length and time. Speech samples, the LibriSpeech-Long dataset, and any future code or model releases can be found at https://google.github.io/tacotron/publications/speechssm/.

我们认为,多分钟的语音模拟是长式多媒体生成和音效语音助理的一种要求,但是,无文本的口头语言模型很难在数十秒的时间内生成出貌似合理的语音,因为高时间分辨率的语音符号导致一致性丧失、具有长期序列培训或外推法的建筑问题以及推断时的记忆成本。我们从这些考虑因素中得出语音SSSM,这是第一个语音语言模型家庭,可以在没有文本中间器的单一解码会议上学习长式语音(例如读音或即时语音的16分钟)并抽样。 SpealesSSMMMSM利用线性时间序列模型的最新进展,在多分钟的代人之间以一致性和效率大大超过目前的变换LMMs,同时仍然在发音水平上匹配。我们发现当前语言评价缺乏信息,特别是在这个新的长式设置环境中,我们还介绍:LibSpeech-Long,一个长式语音评价的基准;新的嵌入式和LM-判分数度度度度测量;以及超越当前变式/时间的高质量测量标准。

Article 7

Title@2025-07-10 (4): Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

Title: Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

Wasserzeichen degradiert Ausrichtung in Sprachmodellen: Analyse und Milderung

语言模型的分级调整:分析和减轻影响 2506.04462v2

Authors (3): Apurv Verma, NhatHai Phan, Shubhendu Trivedi

Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect these core alignment properties across four aligned LLMs. Our experiments reveal two distinct degradation patterns: guard attenuation, where enhanced helpfulness undermines model safety, and guard amplification, where excessive caution reduces model helpfulness. These patterns emerge from watermark-induced shifts in token distribution, surfacing the fundamental tension that exists between alignment objectives. To mitigate these degradations, we propose Alignment Resampling (AR), an inference-time sampling method that uses an external reward model to restore alignment. We establish a theoretical lower bound on the improvement in expected reward score as the sample size is increased and empirically demonstrate that sampling just 2-4 watermarked generations effectively recovers or surpasses baseline (unwatermarked) alignment scores. To overcome the limited response diversity of standard Gumbel watermarking, our modified implementation sacrifices strict distortion-freeness while maintaining robust detectability, ensuring compatibility with AR. Experimental results confirm that AR successfully recovers baseline alignment in both watermarking approaches, while maintaining strong watermark detectability. This work reveals the critical balance between watermark strength and model alignment, providing a simple inference-time solution to responsibly deploy watermarked LLMs in practice.

大型语言模型(LLMS)的水标记技术可以对产出质量产生重大影响,但是它们对真实性、安全性和帮助性的影响仍然严重不足。本文件对两种流行的水标记方法-Gumberl和KGW如何影响四个对齐的LLMS的这些核心校准特性进行了系统分析。我们的实验揭示了两种截然不同的退化模式:保护减少,因为加强帮助性会破坏模型安全;保护放大,因为过度谨慎会减少模型的帮助性。这些模式产生于标志性分配的水标记导致的改变,揭示了调整目标之间存在的根本紧张关系。为了减轻这些退化,我们建议调整抽取(AR),即一种推断性时间取样方法,使用外部奖励模式来恢复校准。我们从理论上对预期的奖励得分的改进进行较低约束,因为抽样规模的增加,而且经验表明,仅仅2-4个有标志的代人有效地恢复或超过基线(无水标记的)校准。为了克服标准Gumber水标记的响应多样性,我们修改后的执行在保持可靠的可探测性、保证可靠基线和可靠地保持基准之间的一致性,同时确认可靠地测量和可靠地测量工作的结果。

Article 8

Title@2025-07-10 (4): Scaling RL to Long Videos

Title: Scaling RL to Long Videos

Skalierung von RL zu langen Videos

缩放 RL 到长视频 2507.07966v1

Authors (14): Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).

我们引入了一个完整的配置框架,将视觉语言模型的推理推理推向长视频,利用强化学习;我们应对长视频推理的独特挑战,整合了三个关键组成部分:(1) 大型数据集LongVideo-Reason,由52K长视频QA配对组成,配有体育、游戏和 vlogs等不同领域的高质量推理说明;(2) 双阶段培训管道,将视觉模型推理推向长视频模型(COT-SFT)和强化学习(RL);(3) 长视频RL,名为多模式强化序列平行(MRSP)的培训基础设施,其中包含序列平行和基于VLLMM的引擎,用于长视频,用于高效推出和预填。在实验中,LA-RVA-RVlllSlSlllS, 各种视频模型SVRllllllllSllS, 视频模型的视频模型和Silveral-LServiLServical 上,支持我们的视频模型和Sil-LServical-Servical-L的连续和Silvical-L 。

Article 9

Title@2025-07-10 (4): MIRIX: Multi-Agent Memory System for LLM-Based Agents

Title: MIRIX: Multi-Agent Memory System for LLM-Based Agents

MIRIX: Multi-Agent-Speichersystem für LLM-basierte Agenten

MIRIX:LLM药剂多机构内存系统 2507.07957v1

Authors (2): Yu Wang, Xi Chen

Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field’s most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real-world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA, a challenging multimodal benchmark comprising nearly 20,000 high-resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO, a long-form conversation benchmark with single-modal textual input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory-augmented LLM agents. To allow users to experience our memory system, we provide a packaged application powered by MIRIX. It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.

虽然AI代理商的记忆能力日益受到越来越多的关注,但现有的解决方案仍然基本有限。多数依赖平坦的、范围狭窄的记忆组件,限制了他们个人化、抽象和可靠地重新收集特定用户信息的能力。为此,我们引入了MIRIX, 这是一个模块化的多剂记忆系统,通过解决实地最关键的挑战来重新定义AI记忆的未来:使语言模型能够真正记住。与以往的做法不同, MIRIX超越文本,以包含丰富的视觉和多式联运经验,使记忆在现实世界情景中真正有用。 MIRIX由六种不同且结构严谨的记忆组件组成:核心、Episodic、Smantic、程序、资源内存和知识Vault Vault。为此,我们引入了一个多剂框架,能动态控制和协调更新和检索。这个设计使代理商能够持续、理性地和准确地检索各种长期用户数据。我们在两个严格的环境下验证MIRIX 。首先,在SpealshotVQA, 一个具有挑战性的多式联运基准, 由近20,000个高分辨率的计算机屏幕截截图,需要深背景了解, 以及没有更深背景背景化的内径化的内存数据系统,并且可以让现有的内存系统在存储系统运行中运行一个比RILILILASMIX 长期的存储系统更精确地显示一个新的存储器的MIS的存储器的存储器。

Article 10

Title@2025-07-10 (4): SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment

Title: SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment

SEITE: Ein visuelles Sprachmodell zur Erkennung von Anomalien durch Fact Enhancement und Entropy-aware Alignment

SAGE:通过事实增强和对子对子体认知校正进行反常检测的视觉语言模型 2507.07939v1

Authors (7): Guoxin Zang, Xue Li, Donglin Di, Lanshun Nie, Dechen Zhan, Yang Song, Lei Fan

While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle in industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model and dataset are available at https://github.com/amoreZgx1n/SAGE.

虽然视觉-语言模型(VLMS)在一般多式联运任务中显示出有希望的进展,但它们往往在工业异常探测和推理方面挣扎,特别是在提供可解释的解释性解释和对看不见的类别加以概括方面,这种限制源于异常探测的内在领域性质,这妨碍了现有VLMs在需要精确、结构化和背景意识分析的工业情景中的适用性;为了应对这些挑战,我们提议SAGE(基于VLM(VLM)的框架),通过自我指导事实增强(SFE)和Etropy-awe直接优化(E-DPO)来强化异常推理;SFE通过事实提取和聚合将特定领域的知识纳入视觉推理,而E-DPO(E-DPO)则利用英特罗普-意识优化将模型产出与专家偏好性调整。此外,我们采用AD-PL(AD-PL)和优选-AGE(AGE)数据集,其中包括28,415个由专家排序答复的解答案例。为了评估异常推理模型,我们开发了多级逻辑评价(MLE),一个定量框架,一个定量框架分析逻辑/CRisal-shet-slasset-asset)的逻辑和现有数据标准。

Article 11

Title@2025-07-10 (4): Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration

Title: Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration

Long Context Scaling: Teilen und Erobern durch multi-agent question-driven Collaboration

长期范围:通过多代理问题驱动的协作实现分化和征服 2505.20625v2

Authors (5): Sibo Xiao, Zixin Lin, Wenyang Gao, Hui Chen, Yue Zhang

Processing long contexts has become a critical capability for modern large language models (LLMs). Existing works leverage agent-based divide-and-conquer methods for processing long contexts. But these methods face crucial limitations, including prohibitive accumulated latency and amplified information loss from excessive agent invocations, and the disruption of inherent textual dependencies by immoderate partitioning. In this paper, we propose a novel multi-agent framework XpandA (Expand-Agent) coupled with question-driven workflow and dynamic partitioning for robust long-context processing. XpandA overcomes these limitations through: 1) dynamic partitioning of long texts, which adaptively modulates the filling rate of context windows for input sequences of vastly varying lengths; 2) question-guided protocol to update flat information ensembles within centralized shared memory, constructing consistent inter-agent knowledge across partitions; and 3) selectively replaying specific partitions based on the state-tracking of question-information couples to promote the resolution of inverted-order structures across partitions (e.g., flashbacks). We perform a comprehensive evaluation of XpandA on multiple long-context benchmarks with length varying from 1k to 1M, demonstrating XpandA’s feasibility for processing ultra-long sequences and its significant effectiveness in enhancing the long-context capabilities of various LLMs by achieving 20\% improvements and 1.5x inference speedup over baselines of full-context, RAG and previous agent-based methods.

长期处理过程已成为现代大型语言模型(LLMS)的关键能力。现有的工作利用代理商的分化法和分化法处理长背景。但是,这些方法面临关键的局限性,包括过度代理性职业导致的超常累积潜伏和放大信息损失,以及中度分割干扰固有的文本依赖性。在本文件中,我们提出一个新的多剂框架XpandA(Expand-Agency),加上由问题驱动的工作流程和动态分割法,供强有力的长文本处理。 XpandA克服了这些限制,具体做法包括:1)长文本的动态分割法,这些长文本在适应性地调整了投入序列中环境窗口的填充率;2)在中央共享记忆中更新平坦信息,建立一致的跨分区机构间知识;3)根据基于问题信息组合的状态跟踪,有选择地重放具体分隔法,以促进跨分区反向顺序结构的解决(例如,反向反向回溯),从而从甚远不同长度的LMS-RMS(X-LM)的全程、从超长的超长度处理率、超长的超长度处理方法,我们对 Xpan-LM-LM(A)进行了全面的全面评估,从一个超长度、超长的全程、超长度改进了X-LM-LM-LM-LM-LM-S-LM-S-LS-LS-LS-LS-S-S-S-直线性处理能力,从第1的全速度能力,从第1级、跨度的全度能力,从第1至长序、跨直序、跨度的全程、跨直序、跨度的全程,从第1-直序、跨直径-直径-直径-直径-直线性评估了第1-直线性、跨-直线性、跨的全-直径-直序、跨的全程、跨的全程、跨的全程,从第1-直径-直路的全程,从第1-直-直-直路的全程,从第1-直路的全程,从第1-直路的全程,从第1-直线性-直线性-直线性

Article 12

Title@2025-07-10 (4): Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style

Title: Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style

Kontexttreue in großen Sprachmodellen untersuchen: Die Rollen der Gedächtnisstärke und des Evidenzstils

调查大语言模型的内情:记忆力和证据风格的作用 2409.10955v2

Authors (6): Yuepei Li, Kang Zhou, Qiao Qiao, Bach Nguyen, Qing Wang, Qi Li

Retrieval-augmented generation (RAG) improves Large Language Models (LLMs) by incorporating external information into the response generation process. However, how context-faithful LLMs are and what factors influence LLMs’ context faithfulness remain largely unexplored. In this study, we investigate the impact of memory strength and evidence presentation on LLMs’ receptiveness to external evidence. We quantify the memory strength of LLMs by measuring the divergence in LLMs’ responses to different paraphrases of the same question, which is not considered by previous works. We also generate evidence in various styles to examine LLMs’ behavior. Our results show that for questions with high memory strength, LLMs are more likely to rely on internal memory. Furthermore, presenting paraphrased evidence significantly increases LLMs’ receptiveness compared to simple repetition or adding details. These findings provide key insights for improving retrieval-augmented generation and context-aware LLMs. Our code is available at https://github.com/liyp0095/ContextFaithful.

通过将外部信息纳入响应生成过程,检索增强的一代(RAG)改进了大语言模型(LLMs),然而,背景真实的LLMs是如何的,以及哪些因素对LLMs的背景忠诚程度有何影响,基本上仍未得到探讨。在本研究中,我们调查了LLMs接受外部证据的记忆力和证据展示的影响。我们通过测量LLMs对同一问题不同参数的答复中的差异来量化LLMs的记忆力,而以前的工作没有考虑到这一点。我们还以各种方式生成证据来检查LLMs的行为。我们的结果显示,对于记忆力较高的问题,LLMMs更有可能依靠内部记忆。此外,提供副用的证据大大提高了LLMs对简单重复或添加细节的接受度。这些发现为改进检索启动的生成和对上下文的LMs提供了关键的洞察力。我们的代码可在https://github.com/liep0095/Ctext Faithful.。

Article 13

Title@2025-07-10 (4): DTECT: Dynamic Topic Explorer & Context Tracker

Title: DTECT: Dynamic Topic Explorer & Context Tracker

DTECT: Dynamischer Themen-Explorer & Kontext-Tracker

DTECT: 动态专题探索器和上下文跟踪器 2507.07910v1

Authors (2): Suman Adhya, Debarshi Kumar Sanyal

The explosive growth of textual data over time presents a significant challenge in uncovering evolving themes and trends. Existing dynamic topic modeling techniques, while powerful, often exist in fragmented pipelines that lack robust support for interpretation and user-friendly exploration. We introduce DTECT (Dynamic Topic Explorer & Context Tracker), an end-to-end system that bridges the gap between raw textual data and meaningful temporal insights. DTECT provides a unified workflow that supports data preprocessing, multiple model architectures, and dedicated evaluation metrics to analyze the topic quality of temporal topic models. It significantly enhances interpretability by introducing LLM-driven automatic topic labeling, trend analysis via temporally salient words, interactive visualizations with document-level summarization, and a natural language chat interface for intuitive data querying. By integrating these features into a single, cohesive platform, DTECT empowers users to more effectively track and understand thematic dynamics. DTECT is open-source and available at https://github.com/AdhyaSuman/DTECT.

现有的动态专题模型技术虽然强大,但往往存在于缺乏对解释和方便用户的探索的强大支持的支离破碎管道中。我们引入了DTECT(动态专题探索器和背景跟踪器),这是一个端对端系统,可以弥合原始文本数据和有意义的时间洞察力之间的差距。DTECT提供了一个统一的工作流程,支持数据处理前、多个模型结构以及分析时间专题模型质量的专门评价指标。它通过引入LLM驱动的自动主题标签、通过时间突出的文字进行趋势分析、以文件级别汇总进行交互式可视化以及直觉数据查询的自然语言聊天界面,大大提高了可解释性。通过将这些特征纳入一个单一的、连贯的平台,DTECT使用户能够更有效地跟踪和理解专题动态。DTECT是开放的来源,可在https://github.com/AdhyaSuman/DTECT中查阅。

Article 14

Title@2025-07-10 (4): A Survey on Latent Reasoning

Title: A Survey on Latent Reasoning

Eine Umfrage über latente Vernunft

A. 关于长期原因的调查 2507.06203v2

Authors (33): Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng, Chongxuan Li, Yuyin Zhou, Zhoujun Li, Zhaoxiang Zhang, Jiaheng Liu, Ge Zhang, Wenhao Huang, Jason Eshraghian

Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model’s expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model’s continuous hidden state, eliminating token-level supervision. To advance latent reasoning research, this survey provides a comprehensive overview of the emerging field of latent reasoning. We begin by examining the foundational role of neural network layers as the computational substrate for reasoning, highlighting how hierarchical representations support complex transformations. Next, we explore diverse latent reasoning methodologies, including activation-based recurrence, hidden state propagation, and fine-tuning strategies that compress or internalize explicit reasoning traces. Finally, we discuss advanced paradigms such as infinite-depth latent reasoning via masked diffusion models, which enable globally consistent and reversible reasoning processes. By unifying these perspectives, we aim to clarify the conceptual landscape of latent reasoning and chart future directions for research at the frontier of LLM cognition. An associated GitHub repository collecting the latest papers and repos is available at: https://github.com/multimodal-art-projection/LatentCoT-Horizon/.

大型语言模型(LLMS)展示了令人印象深刻的推理能力,特别是在以明确的思维链推理(CoT)推理为指引,对中间步骤进行口头解释的推理时,尤其令人印象深刻的推理能力。 CoT改进了可解释性和准确性,同时对自然语言推理的依赖限制了模型的显性带宽宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽。为了推进潜带宽带宽带宽带宽带宽带宽的推理研究,本次调查将潜带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带

Article 15

Title@2025-07-10 (4): Automating MD simulations for Proteins using Large language Models: NAMD-Agent

Title: Automating MD simulations for Proteins using Large language Models: NAMD-Agent

Automatisierung von MD-Simulationen für Proteine mit großen Sprachmodellen: NAMD-Agent

使用大语言模型( NADD- Agent) 自动自动模拟 Proteins 的 MD 模拟: NAMED- Agent 2507.07887v1

Authors (2): Achuth Chandrasekhar, Amir Barati Farimani

Molecular dynamics simulations are an essential tool in understanding protein structure, dynamics, and function at the atomic level. However, preparing high quality input files for MD simulations can be a time consuming and error prone process. In this work, we introduce an automated pipeline that leverages Large Language Models (LLMs), specifically Gemini 2.0 Flash, in conjunction with python scripting and Selenium based web automation to streamline the generation of MD input files. The pipeline exploits CHARMM GUI’s comprehensive web-based interface for preparing simulation-ready inputs for NAMD. By integrating Gemini’s code generation and iterative refinement capabilities, simulation scripts are automatically written, executed, and revised to navigate CHARMM GUI, extract appropriate parameters, and produce the required NAMD input files. Post processing is performed using additional software to further refine the simulation outputs, thereby enabling a complete and largely hands free workflow. Our results demonstrate that this approach reduces setup time, minimizes manual errors, and offers a scalable solution for handling multiple protein systems in parallel. This automated framework paves the way for broader application of LLMs in computational structural biology, offering a robust and adaptable platform for future developments in simulation automation.

分子动态模拟是了解蛋白质结构、动态和原子一级功能的基本工具。但是,为MD模拟制作高质量的输入文件可能是一个耗时和容易出错的过程。在这项工作中,我们引入了一种自动管道,利用大语言模型(LLMS),特别是Gemini 2.0 Flash, 结合 Python 脚本和基于 erium 的网络自动化来简化MD 输入文件的生成。管道利用CHARMMM GUI 的综合网络界面为NAD 进行模拟准备。通过整合Gemini 的代码生成和迭代精化能力,模拟脚本会自动编写、执行和修订,以导航CHARMMM GUI,提取适当的参数,并制作所需的NAD 输入文件。邮件处理正在使用额外的软件来进一步完善模拟输出,从而使得一个完整和基本上免费的手动工作流程得以实现。我们的结果表明,这一方法减少了设置时间,最大限度地减少人工错误,并为同时处理多种蛋白质系统提供了一个可扩展的解决方案。这个自动化框架为在计算结构生物学上更广泛地应用LMS铺路铺路铺路铺路。

Article 16

Title@2025-07-10 (4): When Dialects Collide: How Socioeconomic Mixing Affects Language Use

Title: When Dialects Collide: How Socioeconomic Mixing Affects Language Use

Wenn Dialekte zusammenstoßen: Wie sich die sozioökonomische Mischung auf den Sprachgebrauch auswirkt

当对调时:社会经济混合如何影响语言使用 2307.10016v2

Authors (4): Thomas Louf, José J. Ramasco, David Sánchez, Márton Karsai

The socioeconomic background of people and how they use standard forms of language are not independent, as demonstrated in various sociolinguistic studies. However, the extent to which these correlations may be influenced by the mixing of people from different socioeconomic classes remains relatively unexplored from a quantitative perspective. In this work we leverage geotagged tweets and transferable computational methods to map deviations from standard English on a large scale, in seven thousand administrative areas of England and Wales. We combine these data with high-resolution income maps to assign a proxy socioeconomic indicator to home-located users. Strikingly, across eight metropolitan areas we find a consistent pattern suggesting that the more different socioeconomic classes mix, the less interdependent the frequency of their departures from standard grammar and their income become. Further, we propose an agent-based model of linguistic variety adoption that sheds light on the mechanisms that produce the observations seen in the data.

正如各种社会语言研究所显示的,人们的社会经济背景以及他们使用标准语言的方式并不独立。然而,从数量角度而言,这些关联可能受到不同社会经济阶层人员混合影响的程度仍然相对没有探讨。在这项工作中,我们利用地理标记推文和可转移计算方法,在英格兰和威尔士的7 000个行政区大规模绘制偏离标准英语的地图。我们把这些数据和高分辨率收入图结合起来,为家庭所在地用户指定一个代理社会经济指标。令人惊讶的是,在八个大都市地区,我们发现一个一致的模式,表明不同社会经济阶层的混合越多,他们偏离标准语法和收入的频率越少,越不相互依存。此外,我们提出了一种基于代理人的语言多样性采用模式,该模式可以说明数据中观察到的观察机制。

Article 17

Title@2025-07-10 (4): Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study

Title: Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study

Bewertung der Robustheit von großen Audio-Sprachmodellen zur Audio-Einspritzung: Eine empirische Studie

评估大音频语言模型对音频注射的威力:经验研究 2505.19598v2

Authors (7): Guanyu Hou, Jiaming He, Yinhang Zhou, Ji Guo, Yitong Qiao, Rui Zhang, Wenbo Jiang

Large Audio-Language Models (LALMs) are increasingly deployed in real-world applications, yet their robustness against malicious audio injection attacks remains underexplored. This study systematically evaluates five leading LALMs across four attack scenarios: Audio Interference Attack, Instruction Following Attack, Context Injection Attack, and Judgment Hijacking Attack. Using metrics like Defense Success Rate, Context Robustness Score, and Judgment Robustness Index, their vulnerabilities and resilience were quantitatively assessed. Experimental results reveal significant performance disparities among models; no single model consistently outperforms others across all attack types. The position of malicious content critically influences attack effectiveness, particularly when placed at the beginning of sequences. A negative correlation between instruction-following capability and robustness suggests models adhering strictly to instructions may be more susceptible, contrasting with greater resistance by safety-aligned models. Additionally, system prompts show mixed effectiveness, indicating the need for tailored strategies. This work introduces a benchmark framework and highlights the importance of integrating robustness into training pipelines. Findings emphasize developing multi-modal defenses and architectural designs that decouple capability from susceptibility for secure LALMs deployment.

大型音频语言模型(LALMS)越来越多地用于现实世界应用,但其抵御恶意声击袭击的强力仍未得到充分探索。本研究系统地评估了四种攻击情景中的五种主要的LALMS:音频干扰攻击、攻击指令、上击攻击、上击攻击和袭击判决。使用国防成功率、背景强力分数和判断强力指数等衡量标准,对其脆弱性和复原力进行了定量评估。实验结果显示各模型之间业绩差异巨大;没有单一模型在所有攻击类型中始终优于其他模型。恶意内容对袭击有效性的严重影响,特别是在开始顺序时。执行指令的能力与强力表明严格遵守指令的模式之间的负相关性可能更大,而安全调整模型的阻力则更大。此外,系统提示显示效果参差不齐,表明需要制定有针对性的战略。这项工作引入了一个基准框架,并强调了将强力纳入培训管道的重要性。调查结果强调开发多式防御和建筑设计,使其不易受到安全LALMS部署的敏感度。

Article 18

Title@2025-07-10 (4): DocCHA: Towards LLM-Augmented Interactive Online diagnosis System

Title: DocCHA: Towards LLM-Augmented Interactive Online diagnosis System

DocCHA: Auf dem Weg zum LLM-Augmented Interactive Online-Diagnosesystem

DocCHA:争取建立LLM-增强的互动式在线诊断系统 2507.07870v1

Authors (5): Xinyi Liu, Dachun Sun, Yi R. Fung, Dilek Hakkani-Tür, Tarek Abdelzaher

Despite the impressive capabilities of Large Language Models (LLMs), existing Conversational Health Agents (CHAs) remain static and brittle, incapable of adaptive multi-turn reasoning, symptom clarification, or transparent decision-making. This hinders their real-world applicability in clinical diagnosis, where iterative and structured dialogue is essential. We propose DocCHA, a confidence-aware, modular framework that emulates clinical reasoning by decomposing the diagnostic process into three stages: (1) symptom elicitation, (2) history acquisition, and (3) causal graph construction. Each module uses interpretable confidence scores to guide adaptive questioning, prioritize informative clarifications, and refine weak reasoning links. Evaluated on two real-world Chinese consultation datasets (IMCS21, DX), DocCHA consistently outperforms strong prompting-based LLM baselines (GPT-3.5, GPT-4o, LLaMA-3), achieving up to 5.18 percent higher diagnostic accuracy and over 30 percent improvement in symptom recall, with only modest increase in dialogue turns. These results demonstrate the effectiveness of DocCHA in enabling structured, transparent, and efficient diagnostic conversations – paving the way for trustworthy LLM-powered clinical assistants in multilingual and resource-constrained settings.

尽管大语言模型(LLMs)的能力令人印象深刻,但现有的相互交流的健康代理机构(CHAs)仍然一成不变和易碎,无法适应性多方向推理、症状澄清或透明决策,这妨碍了其在临床诊断中的实际适用性,而临床诊断中迭接和结构化对话是必不可少的。我们建议DocCHA是一个充满信心的模块框架,仿效临床推理,将诊断过程分为三个阶段:(1) 症状解析,(2) 历史获取,(3) 因果关系图构建。每个模块都使用可解释的信任分数来指导适应性询问,确定信息澄清的优先顺序,并完善薄弱的推理联系。根据两个真实世界中国协商数据集(IMCS21, DX)的评估,DocCHA一贯优于基于快速的LM基准(GPT-3.5, GPT-4o, LLMA-3), 诊断准确度达到5.18%,症状回顾率提高30%以上,对话转来仅略有增加。这些结果表明,DocCHA在促成结构、透明和高效的诊断性对话对话中的有效性,为多语和资源培训的可信赖的LM授权临床助手铺设者铺铺铺铺铺路。

Article 19

Title@2025-07-10 (4): Alpay Algebra V: Multi-Layered Semantic Games and Transfinite Fixed-Point Simulation

Title: Alpay Algebra V: Multi-Layered Semantic Games and Transfinite Fixed-Point Simulation

Alpay Algebra V: Multi-Layered Semantic Games und Transfinite Fixed-Point Simulation

Alpay Alphay Algebabra V:多语言语义运动会和跨线固定点模拟 2507.07868v1

Authors (2): Bugra Kilictas, Faruk Alpay

This paper extends the self-referential framework of Alpay Algebra into a multi-layered semantic game architecture where transfinite fixed-point convergence encompasses hierarchical sub-games at each iteration level. Building upon Alpay Algebra IV’s empathetic embedding concept, we introduce a nested game-theoretic structure where the alignment process between AI systems and documents becomes a meta-game containing embedded decision problems. We formalize this through a composite operator $\phi(\cdot, \gamma(\cdot))$ where $\phi$ drives the main semantic convergence while $\gamma$ resolves local sub-games. The resulting framework demonstrates that game-theoretic reasoning emerges naturally from fixed-point iteration rather than being imposed externally. We prove a Game Theorem establishing existence and uniqueness of semantic equilibria under realistic cognitive simulation assumptions. Our verification suite includes adaptations of Banach’s fixed-point theorem to transfinite contexts, a novel $\phi$-topology based on the Kozlov-Maz’ya-Rossmann formula for handling semantic singularities, and categorical consistency tests via the Yoneda lemma. The paper itself functions as a semantic artifact designed to propagate its fixed-point patterns in AI embedding spaces – a deliberate instantiation of the “semantic virus” concept it theorizes. All results are grounded in category theory, information theory, and realistic AI cognition models, ensuring practical applicability beyond pure mathematical abstraction.

本文扩展了 Alpay Algebra 的自我偏好框架, 将其扩展为多层次的抽象游戏结构, 即半边端固定点趋同包含每个迭代级别的等级子游戏。基于 Alpay Algebra IV 的同情嵌入概念, 我们引入了一个嵌套游戏理论结构, 将AI 系统和文档之间的校正进程变成包含嵌入决定问题的元游戏。我们通过一个复合操作器$\phi(\cdot,\gamma(\cdot)) 正式确定这一点, 在那里, $\phie 驱动主语系趋同, 而$\ gamma$ 解决了本地的理论。由此产生的框架显示游戏理论推理自然地产生于固定点的嵌入式嵌入概念, 而不是外部强加的。我们证明这是一个游戏理论, 在现实的认知模拟假设下, 将语系定点定点的定点对定空空间环境进行修改, 一种基于 Kozlov- maz 的理论性理论, 将它自己设计成一个固定的纸质的纸质性。

Article 20

Title@2025-07-10 (4): Skywork-R1V3 Technical Report

Title: Skywork-R1V3 Technical Report

Technischer Bericht Skywork-R1V3

Skywork-R1V3 技术报告 2507.06167v3

Authors (11): Wei Shen, Jiangbo Pei, Yi Peng, Xuchen Song, Yang Liu, Jian Peng, Haofeng Sun, Yunzhuo Hao, Peiyu Wang, Jianhao Zhang, Yahui Zhou

We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model’s reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.

我们引入了Skywork-R1V3, 这是一种先进的开放源码视觉语言模型(VLM),它开创了对视觉推理的新方法。它的关键创新在于有效地将推理技能从只有文本的大型语言模型(LLMS)转移到视觉任务。 Skywork-R1V3的出色表现主要来自我们精心制定的训练后RL框架,它有效地激活和增强了模型的推理能力,而无需再继续接受培训。我们通过这个框架,进一步发现连接器模块在为多式联运推理模型实现强有力的跨模式协调方面的基本作用。此外,我们引入了一个独特的推理能力指标,即关键推理符号的灵敏性,这在RL培训期间已证明对检查站的选择非常有效。Skywork-R1V3在MMMMMMM公司上取得了最新的成果,大大地从64.3%提高到76.0%。这一性能与入门级人的能力相匹配。值得注意的是,我们的RL驱动力后培训方法使得38B参数模型能够与顶级的封闭源码VLMSm3。我们进行更深入的推理学的推理分析。在Skal推理分析中成功的推理研究中, 也成功地展示了另一个的数学推理学的推理学。

Article 21

Title@2025-07-10 (4): From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems

Title: From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems

Von der Ambiguität zur Genauigkeit: Der transformative Effekt der Koreferenzlösung auf retrieval-augmentierte Erzeugungssysteme

从模糊到准确性:关于回收-提款一代系统的共同决议的变革效应 2507.07847v1

Authors (6): Youngjoon Jang, Seongtae Hong, Junyoung Son, Sungjin Park, Chanjun Park, Heuiseok Lim

Retrieval-Augmented Generation (RAG) has emerged as a crucial framework in natural language processing (NLP), improving factual consistency and reducing hallucinations by integrating external document retrieval with large language models (LLMs). However, the effectiveness of RAG is often hindered by coreferential complexity in retrieved documents, introducing ambiguity that disrupts in-context learning. In this study, we systematically investigate how entity coreference affects both document retrieval and generative performance in RAG-based systems, focusing on retrieval relevance, contextual understanding, and overall response quality. We demonstrate that coreference resolution enhances retrieval effectiveness and improves question-answering (QA) performance. Through comparative analysis of different pooling strategies in retrieval tasks, we find that mean pooling demonstrates superior context capturing ability after applying coreference resolution. In QA tasks, we discover that smaller models benefit more from the disambiguation process, likely due to their limited inherent capacity for handling referential ambiguity. With these findings, this study aims to provide a deeper understanding of the challenges posed by coreferential complexity in RAG, providing guidance for improving retrieval and generation in knowledge-intensive AI applications.

在自然语言处理(NLP)中,将外部文件检索与大型语言模型(LLM)相结合,从而改进实际一致性,减少幻觉,从而成为自然语言处理(NLP)中的一个关键框架,提高了事实一致性,减少了幻觉;然而,检索文件的共变复杂性往往阻碍RAG的效力,造成内含学习的模糊性;在这项研究中,我们系统地调查实体参照如何影响基于RAG的系统中的文件检索和感化性能,重点是检索相关性、背景理解和总体反应质量;我们证明,共同参照解决办法提高了检索的有效性,改进了问题解答(QA)的性能;通过对检索任务中不同的集中战略进行比较分析,我们发现,在应用共同参照分辨率后,集合表明更能捕捉能力。在QA的任务中,我们发现较小的模型可能由于处理特异性的固有能力有限而从分解过程中获益更多。通过这些研究结果,旨在更深入地了解在RAG的共变复杂性带来的挑战,为改进知识密集型AI应用程序的检索和生成提供指导。

Article 22

Title@2025-07-10 (4): Truth-value judgment in language models: ‘truth directions’ are context sensitive

Title: Truth-value judgment in language models: ‘truth directions’ are context sensitive

Wahrheit-Wert-Urteil in Sprachmodellen: ‘Wahrheitsrichtungen’ sind kontextsensibel

语言模型中的真相价值判断:“真相方向”是背景敏感 2404.18865v2

Authors (4): Stefan F. Schouten, Peter Bloem, Ilia Markov, Piek Vossen

Recent work has demonstrated that the latent spaces of large language models (LLMs) contain directions predictive of the truth of sentences. Multiple methods recover such directions and build probes that are described as uncovering a model’s “knowledge” or “beliefs”. We investigate this phenomenon, looking closely at the impact of context on the probes. Our experiments establish where in the LLM the probe’s predictions are (most) sensitive to the presence of related sentences, and how to best characterize this kind of sensitivity. We do so by measuring different types of consistency errors that occur after probing an LLM whose inputs consist of hypotheses preceded by (negated) supporting and contradicting sentences. We also perform a causal intervention experiment, investigating whether moving the representation of a premise along these truth-value directions influences the position of an entailed or contradicted sentence along that same direction. We find that the probes we test are generally context sensitive, but that contexts which should not affect the truth often still impact the probe outputs. Our experiments show that the type of errors depend on the layer, the model, and the kind of data. Finally, our results suggest that truth-value directions are causal mediators in the inference process that incorporates in-context information.

最近的工作表明,大型语言模型(LLMS)的潜在空间包含对判决真相的预测方向。多种方法恢复了这些方向,并建立了被描述为揭示模型“知识”或“信仰”的探点。我们仔细研究这个现象,仔细研究背景对探点的影响。我们的实验确定了在LLM中,探测器的预测(最主要)对相关句子的存在十分敏感,以及如何最好地描述这种敏感度。我们这样做的方法是测量在测测测LM(LM)之后发生的不同类型的一致性错误,而LLM的输入包括(否定的)支持和相互矛盾的句子之前的假设。我们还进行了因果干预实验,调查沿着这些真相价值方向的假设的表述是否影响着一个包含或矛盾的句子的位置。我们发现,我们测试的探点一般是背景敏感,但不应该影响真相的环境通常仍然影响探测结果。我们的实验表明,错误的类型取决于层次、模型和数据类型。最后,我们的结果表明,在判点中,真理价值是将因果关系纳入信息中。

Article 23

Title@2025-07-10 (4): None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks

Title: None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks

Keiner der anderen: eine allgemeine Technik zur Unterscheidung von der Erinnerung an Multiple-Choice-LLM-Bewertungs-Benchmarks

其他无其他:在多杯LLM评价基准中区分与记忆化区别理由的一般技术 2502.12896v5

Authors (3): Eva Sánchez Salido, Julio Gonzalo, Guillermo Marco

In LLM evaluations, reasoning is often distinguished from recall/memorization by performing numerical variations to math-oriented questions. Here we introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts, requiring LLMs to understand and reason (rather than memorizing) in order to answer correctly. Using this method, we evaluate state-of-the-art proprietary and open-source LLMs on two datasets available in English and Spanish: the public MMLU benchmark and the private UNED-Access 2024 dataset. Results show that all models experience remarkable accuracy drops under our proposed variation, with an average loss of 57% on MMLU and 50% on UNED-Access 2024, ranging from 10% to 93% across models. Notably, the most accurate model in our experimentation (OpenAI-o3-mini) is not the most robust (DeepSeek-R1-70B), suggesting that the best models in standard evaluations may not be the ones with better reasoning capabilities. Also, we see larger accuracy drops in public (vs private) datasets and questions posed in their original language (vs a manual translation), which are signs of contamination and also point to a relevant role of recall/memorization in current LLMs’ answers.

在LLM 评估中,推理往往不同于对数学导向问题进行数字变异的回溯/记忆。在这里,我们引入了对多种选择问题的一般性变异方法,将正确答案完全与先前看到的象征或概念完全分离,要求LLMS理解和理性(而不是记忆化),以便正确回答。使用这种方法,我们评估英语和西班牙语两个数据集上的最新专有和开放源LMS:公共MMMLU基准和私人UNED-Acess 2024数据集。结果显示,所有模型在拟议变异中都经历了显著的准确性下降,MMMLU和UNED-Acess 2024平均损失57%,UNED-Acess 2024平均损失50%,各模型平均损失10%至93%。值得注意的是,我们实验中最准确的模型(OpenAI-o3-mini)并不是最可靠的(DepSeSeek-R1-70B),表明标准评价中的最佳模型可能不是具有更好推理能力的模型。此外,我们看到,在公共(v)手册中的原始数据翻译中(也是原始的标记和检索中的标记)。

Article 24

Title@2025-07-10 (4): Constrain Alignment with Sparse Autoencoders

Title: Constrain Alignment with Sparse Autoencoders

Beschränkung der Ausrichtung mit Sparse Autoencodern

与 Sparse 自动对齐 2411.07618v4

Authors (10): Qingyu Yin, Chak Tou Leong, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, Linyi Yang

The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.

大型语言模式(LLMS)与人类偏好相匹配仍然是一个关键的挑战。尽管培训后技术(如“加强人类反馈学习”和“直接偏好优化”)取得了显著成功,但它们往往导致计算效率低下和训练不稳定。在本文中,我们提出“特级有限偏好优化”(FPO)新颖方法,旨在简化校准进程,同时确保稳定性。FPO利用预先培训的“Sparse Autoencors (SAEs)” ,引入了特质级限制,允许高效、宽度强化校准。我们的方法通过使用精密的稀疏特性在训练有素的稀疏的稀释自动校对器中激活,以及使用地层离线参考的相继KL差异的质量而效率得到提高。基准数据集的实验结果表明,FPO在赢率方面实现了5.08%的绝对改善,其计算成本比最先进的基线低得多,这为高效、可控的LM校准的解决方案带来了希望。

Article 25

Title@2025-07-10 (4): Unsupervised Morphological Tree Tokenizer

Title: Unsupervised Morphological Tree Tokenizer

Unüberwachter morphologischer Baum Tokenizer

不受监督的病理树化器 2406.15245v2

Authors (5): Qingyang Zhu, Xiang Hu, Pengyu Ji, Wei Wu, Kewei Tu

As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic information. To address this drawback, we introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words. Specifically, the deep model jointly encodes internal structures and representations of words with a mechanism named $\textit{MorphOverriding}$ to ensure the indecomposability of morphemes. By training the model with self-supervised objectives, our method is capable of inducing character-level structures that align with morphological rules without annotated training data. Based on the induced structures, our algorithm tokenizes words through vocabulary matching in a top-down manner. Empirical results indicate that the proposed method effectively retains complete morphemes and outperforms widely adopted methods such as BPE and WordPiece on both morphological segmentation tasks and language modeling tasks. Code is available at https://github.com/martianmartina/TreeTokenizer.

作为语言模型的基石,象征性化意味着将文字输入分解成预定义的原子单位。常规统计符号经常扰乱单词内部的构成界限,从而腐蚀语义信息。为了解决这一缺陷,我们引入了形态结构指导符号化,并提出了一个深刻的模型以诱导字符级结构。具体地说,深层模型将内部结构和文字表达方式与一个名为$\textit{Morphovertriding}$的机制联合编码,以确保模phemes的不兼容性。通过以自我监督的目标培训模型,我们的方法能够引导与形态学规则相一致的字符级结构,而没有附加说明的培训数据。基于导出的结构,我们的算法通过自上至下方式的词汇匹配将文字符号化。经验性结果显示,拟议的方法有效地保留了形态分解任务和语言建模任务上广泛采用的方法,如BPE和WordPiece。代码可在 https://github.com/martianmatina/Treekeizer上查阅。

Article 26

Title@2025-07-10 (4): MAEBE: Multi-Agent Emergent Behavior Framework

Title: MAEBE: Multi-Agent Emergent Behavior Framework

MAEBE: Multi-Agent Emergent Behavior Framework

多边代理新兴行为框架 2506.03053v2

Authors (4): Sinem Erisken, Timothy Gothard, Martin Leitgab, Ram Potham

Traditional AI safety evaluations on isolated LLMs are insufficient as multi-agent AI ensembles become prevalent, introducing novel emergent risks. This paper introduces the Multi-Agent Emergent Behavior Evaluation (MAEBE) framework to systematically assess such risks. Using MAEBE with the Greatest Good Benchmark (and a novel double-inversion question technique), we demonstrate that: (1) LLM moral preferences, particularly for Instrumental Harm, are surprisingly brittle and shift significantly with question framing, both in single agents and ensembles. (2) The moral reasoning of LLM ensembles is not directly predictable from isolated agent behavior due to emergent group dynamics. (3) Specifically, ensembles exhibit phenomena like peer pressure influencing convergence, even when guided by a supervisor, highlighting distinct safety and alignment challenges. Our findings underscore the necessity of evaluating AI systems in their interactive, multi-agent contexts.

对孤立的LLMs进行传统的AI安全评价是不够的,因为多试剂AI联合体变得很普遍,带来新的新风险。本文件介绍了多代理新兴行为评价(MAEBE)框架,以系统评估此类风险。我们利用MAEBE和最伟大的良好基准(以及一种新型的双重反向问题技术)来证明:(1)LLM道德偏好,特别是工具伤害的道德偏好,在单一代理体和组合体中都令人惊讶地变得脆弱,随着问题设置而发生重大变化。 (2)LLM联合体的道德推理不能直接从新出现的集团动态造成的孤立的代理体行为中预见出来。(3)具体地说,集合体展示了影响趋同的同行压力等现象,即使由上司指导,也突出了不同的安全和一致性挑战。我们的调查结果强调,有必要在其互动的多剂环境中评价AI系统。

Article 27

Title@2025-07-10 (4): The Thin Line Between Comprehension and Persuasion in LLMs

Title: The Thin Line Between Comprehension and Persuasion in LLMs

Die dünne Linie zwischen Verständnis und Überzeugung in LLMs

LLMM 理解与劝导之间的细细线 2507.01936v2

Authors (2): Adrian de Wynter, Tangming Yuan

Large language models (LLMs) are excellent at maintaining high-level, convincing dialogues. They are being fast deployed as chatbots and evaluators in sensitive areas, such as peer review and mental health applications. This, along with the disparate accounts on their reasoning capabilities, calls for a closer examination of LLMs and their comprehension of dialogue. In this work we begin by evaluating LLMs’ ability to maintain a debate–one of the purest yet most complex forms of human communication. Then we measure how this capability relates to their understanding of what is being talked about, namely, their comprehension of dialogical structures and the pragmatic context. We find that LLMs are capable of maintaining coherent, persuasive debates, often swaying the beliefs of participants and audiences alike. We also note that awareness or suspicion of AI involvement encourage people to be more critical of the arguments made. When polling LLMs on their comprehension of deeper structures of dialogue, however, they cannot demonstrate said understanding. Our findings tie the shortcomings of LLMs-as-evaluators to their (in)ability to understand the context. More broadly, for the field of argumentation theory we posit that, if an agent can convincingly maintain a dialogue, it is not necessary for it to know what it is talking about. Hence, the modelling of pragmatic context and coherence are secondary to effectiveness.

大型语言模型(LLMS)在保持高层次的、令人信服的对话方面是极好的,它们作为聊天室和评价员在敏感领域,如同行审议和心理健康应用等领域迅速被部署为聊天室和评价员,这与对其推理能力的不同描述一道,要求更仔细地审查LLMS, 并理解对话。在这项工作中,我们首先评估LLMS保持辩论能力的能力,这是人类交流中最纯最复杂形式的辩论之一。然后,我们衡量这种能力与他们所谈论的内容,即他们对对话结构的理解和务实背景的理解有何关系。我们认为,LMS有能力保持连贯、有说服力的辩论,经常扭曲参与者和观众的信仰。我们还注意到,对AI参与的认识或怀疑鼓励人们更加批评所提出的论点。然而,当LMS对更深层次对话结构的理解进行民意测验时,他们无法表现出理解。我们的调查结果将LMS-S-S-evluaers的缺点与他们理解背景的(无法理解)的缺点联系在一起。更广义地说,对于争论理论领域来说,我们认为,如果一个代理人能够令人信服地说,那么,那么,那么,它就是,它就应该保持一种务实的对话,那么,那么,它就应该保持一种具有何种联系。

Article 28

Title@2025-07-10 (4): Conditional Unigram Tokenization with Parallel Data

Title: Conditional Unigram Tokenization with Parallel Data

Bedingte Unigramm-Tokenisierung mit parallelen Daten

附带平行数据的条件性大学招式 2507.07824v1

Authors (2): Gianluca Vico, Jindřinch Libovický

We introduce conditional unigram tokenization, a novel approach that extends unigram tokenization by conditioning target token probabilities on source-language tokens from parallel data. Given a fixed source tokenizer, our method learns a target tokenizer that maximizes cross-lingual semantic alignment. We evaluate our tokenizer on four language pairs across different families and resource levels, examining intrinsic properties and downstream performance on machine translation and language modeling. While our conditional tokenizer maintains comparable statistical properties to standard unigram tokenizers, results are mixed: we observe no improvements in machine translation quality, but find consistent perplexity reductions in language modeling. We hypothesize that quadratic scaling of conditional probability estimation with respect to the vocabulary size creates a data efficiency bottleneck. Our findings suggest that alternative parameterizations may be necessary for practical cross-lingual tokenization.

我们引入了有条件的单克表示法, 这是一种通过从平行数据中设定源语言象征物目标符号概率来限制源语言象征物目标符号概率来扩展单克表示法的新办法。在使用固定来源象征器的情况下, 我们的方法会学习一个目标表示法, 以最大程度实现跨语言语义语义对齐。我们在不同家庭和资源层次上对四种语言配方进行代号评估, 检查机器翻译和语言建模的内在特性和下游性能。虽然我们的有条件象征物与标准的单克象征物具有可比的统计属性, 但结果好坏参半: 我们没有看到机器翻译质量的改善, 但是在语言建模中却发现一致的不易解。我们假设了在词汇大小方面有条件概率估算的四重度大小会造成数据效率瓶颈。我们的研究结果表明, 其它参数化也许对于实际的跨语言代用代用代用代言符号可能是必要的。

Article 29

Title@2025-07-10 (4): On the Effect of Instruction Tuning Loss on Generalization

Title: On the Effect of Instruction Tuning Loss on Generalization

Auf die Auswirkungen der Instruktion Tuning Verlust auf die Verallgemeinerung

指示计票损失对普遍化的影响的影响 2507.07817v1

Authors (4): Anwoy Chatterjee, H S V N S Kowndinya Renduchintala, Sumit Bhatia, Tanmoy Chakraborty

Instruction Tuning has emerged as a pivotal post-training paradigm that enables pre-trained language models to better follow user instructions. Despite its significance, little attention has been given to optimizing the loss function used. A fundamental, yet often overlooked, question is whether the conventional auto-regressive objective - where loss is computed only on response tokens, excluding prompt tokens - is truly optimal for instruction tuning. In this work, we systematically investigate the impact of differentially weighting prompt and response tokens in instruction tuning loss, and propose Weighted Instruction Tuning (WIT) as a better alternative to conventional instruction tuning. Through extensive experiments on five language models of different families and scale, three finetuning datasets of different sizes, and five diverse evaluation benchmarks, we show that the standard instruction tuning loss often yields suboptimal performance and limited robustness to input prompt variations. We find that a low-to-moderate weight for prompt tokens coupled with a moderate-to-high weight for response tokens yields the best-performing models across settings and also serve as better starting points for the subsequent preference alignment training. These findings highlight the need to reconsider instruction tuning loss and offer actionable insights for developing more robust and generalizable models. Our code is open-sourced at https://github.com/kowndinya-renduchintala/WIT.

教学图案已成为一个关键的培训后范例,使得培训前语言模型能够更好地遵循用户指令。尽管它意义重大,但很少注意优化所使用的损失功能。一个根本性但经常被忽视的问题是常规自动递减目标(即损失只计算在应答标牌上,不包括及时标牌上)是否真正是教学调适的最佳方法。在这项工作中,我们系统地调查在调试损失时对快速和应答标牌进行不同加权的影响,并提议将加权指示图案(WIT)作为常规调控的更好替代方法。通过对不同家庭和规模的五种语言模型、三个不同尺寸的微调数据集和五个不同的评估基准进行广泛实验,我们发现标准指示调整损失往往产生不理想的性能,对输入快速变异作用的力度也有限。我们发现,在调控标物的低调权重加上中度至高度的响应标物,可以产生最佳表现模式,还可以作为随后的优惠调整培训的更佳起点。这些发现,我们需要重新思考的源码/变换。

Article 30

Title@2025-07-10 (4): Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning

Title: Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning

Verständnis und Kontrolle von Wiederholungsneuronen und Induktionsköpfen im In-Context-Lernen

了解和控制再生中新中世纪和内文学习中的上岗负责人 2507.07810v1

Authors (3): Nhi Hoai Doan, Tatsuya Hiraoka, Kentaro Inui

This paper investigates the relationship between large language models’ (LLMs) ability to recognize repetitive input patterns and their performance on in-context learning (ICL). In contrast to prior work that has primarily focused on attention heads, we examine this relationship from the perspective of skill neurons, specifically repetition neurons. Our experiments reveal that the impact of these neurons on ICL performance varies depending on the depth of the layer in which they reside. By comparing the effects of repetition neurons and induction heads, we further identify strategies for reducing repetitive outputs while maintaining strong ICL capabilities.

本文探讨大型语言模型识别重复输入模式的能力与这些模型在文字内学习方面的表现之间的关系(ICL)。与以前主要侧重于关注负责人的工作相比,我们从技能神经元,特别是重复神经元的角度来研究这种关系。我们的实验显示,这些神经元对ICL性能的影响因所处层次的深度而异。通过比较重复神经元和感应头的影响,我们进一步确定减少重复产出的战略,同时保持ICL的强大能力。

Article 31

Title@2025-07-10 (4): Bridging Logic and Learning: Decoding Temporal Logic Embeddings via Transformers

Title: Bridging Logic and Learning: Decoding Temporal Logic Embeddings via Transformers

Überbrückung von Logik und Lernen: Dekodierung von Temporal Logic-Embeddings über Transformer

架桥逻辑与学习:通过变形器解码时时逻辑嵌入 2507.07808v1

Authors (4): Sara Candussio, Gaia Saveri, Gabriele Sarti, Luca Bortolussi

Continuous representations of logic formulae allow us to integrate symbolic knowledge into data-driven learning algorithms. If such embeddings are semantically consistent, i.e. if similar specifications are mapped into nearby vectors, they enable continuous learning and optimization directly in the semantic space of formulae. However, to translate the optimal continuous representation into a concrete requirement, such embeddings must be invertible. We tackle this issue by training a Transformer-based decoder-only model to invert semantic embeddings of Signal Temporal Logic (STL) formulae. STL is a powerful formalism that allows us to describe properties of signals varying over time in an expressive yet concise way. By constructing a small vocabulary from STL syntax, we demonstrate that our proposed model is able to generate valid formulae after only 1 epoch and to generalize to the semantics of the logic in about 10 epochs. Additionally, the model is able to decode a given embedding into formulae that are often simpler in terms of length and nesting while remaining semantically close (or equivalent) to gold references. We show the effectiveness of our methodology across various levels of training formulae complexity to assess the impact of training data on the model’s ability to effectively capture the semantic information contained in the embeddings and generalize out-of-distribution. Finally, we deploy our model for solving a requirement mining task, i.e. inferring STL specifications that solve a classification task on trajectories, performing the optimization directly in the semantic space.

逻辑公式的持续表达方式允许我们将象征性知识纳入数据驱动的学习算法中。如果这种嵌入方式具有内在一致性, 即如果将相似的规格映射到附近的矢量中, 能够直接在公式的语义空间中进行持续学习和优化。但是, 要将最佳的连续表达方式转化为具体要求, 这种嵌入方式必须是不可忽略的。我们通过训练一个基于变异器的解码器单一模型, 将信号时间逻辑公式( STL) 的语义嵌入垂直嵌入。 STL 是一种强大的形式主义, 使我们能够以直截面但简洁的方式描述时间上不同信号的属性。我们通过从STLyn Tax 构建一个小的词汇来直接学习和优化公式空间表达方式。我们的模型能够在大约10个球中概括逻辑的语义。此外, 模型可以解码一个在公式中嵌入的内嵌入点, 其长度和嵌入方式通常比较简单, 同时保持语义性( 或等效) , 以直截面的方式描述不同时间级的语义解解的语义解解解解解解解解解解解码。我们的Sl定位的解的解解解解解的解析中, 。我们测测算的模型中, 。

Article 32

Title@2025-07-10 (4): Decoding AI Judgment: How LLMs Assess News Credibility and Bias

Title: Decoding AI Judgment: How LLMs Assess News Credibility and Bias

Entschlüsselung des AI-Urteils: Wie LLMs neue Glaubwürdigkeit und Bias bewerten

证明AI 判决:LLMs如何评估新闻信誉和Bias 2502.04426v2

Authors (9): Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Alessandro Santirocchi, Roberto Atzeni, Matteo Cinelli, Vincenzo Cestari, Clelia Rossi-Arnaud, Walter Quattrociocchi

Large Language Models (LLMs) are increasingly embedded in workflows that involve evaluative processes. This raises the need to examine how such evaluations are built, what assumptions they rely on, and how their strategies diverge from those of humans. We benchmark six LLMs against expert ratings–NewsGuard and Media Bias/Fact Check (MBFC)–and against human judgments collected through a controlled experiment. To enable direct comparison, we implement a structured agentic framework in which both models and non-expert participants follow the same evaluation procedure: selecting criteria, retrieving content, and producing justifications. Despite output alignment, LLMs rely on different mechanisms: lexical associations and statistical priors replace contextual reasoning. This reliance produces systematic effects: political asymmetries, opaque justifications, and a tendency to confuse linguistic form with epistemic validity. Delegating judgment to such systems does not merely automate evaluation–it redefines it, shifting from normative reasoning to pattern-based approximation.

大型语言模型(LLMS)越来越多地嵌入涉及评价过程的工作流程中,这就需要审查这类评价是如何建立的,它们依赖哪些假设,以及它们的战略与人类的战略有何不同。我们根据专家评级-NewsGuard和Media Bias/Fact Check(MBFC)和通过受控试验收集的人类判断对6 LLMS进行基准测试。为了进行直接比较,我们实施了一个结构化的代理框架,模型和非专家参与者在其中遵循同样的评价程序:选择标准、检索内容和提出理由。尽管产出一致,LLMS依靠不同的机制:词汇协会和统计前身取代了背景推理。这种依赖产生了系统效应:政治不对称、不透明的理由以及将语言形式与认知有效性混为一谈的倾向。

Article 33

Title@2025-07-10 (4): StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model

Title: StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model

StreamUni: Streaming Speech Translation mit einem einheitlichen Large Speech-Language-Modell erreichen

StreamUli:用统一大型语音语言模式实现流式语音翻译 2507.07803v1

Authors (6): Shoutao Guo, Xiang Li, Shaolei Zhang, Mengge Liu, Wei Chen, Yang Feng

Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations while continuously receiving source speech inputs, balancing low latency with high translation quality. However, existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST). In practice, they require collaboration with segmentation models to accomplish StreamST, where the truncated speech segments constrain SimulST models to make policy decisions and generate translations based on limited contextual information. Moreover, SimulST models struggle to learn effective policies due to the complexity of speech inputs and cross-lingual generation. To address these challenges, we propose StreamUni, which achieves StreamST through a unified Large Speech-Language Model (LSLM). Specifically, StreamUni incorporates speech Chain-of-Thought (CoT) in guiding the LSLM to generate multi-stage outputs. Leveraging these multi-stage outputs, StreamUni simultaneously accomplishes speech segmentation, policy decision, and translation generation, completing StreamST without requiring massive policy-specific training. Additionally, we propose a streaming CoT training method that enhances low-latency policy decisions and generation capabilities using limited CoT data. Experiments demonstrate that our approach achieves state-of-the-art performance on StreamST tasks.

流言翻译(StreamST)需要确定适当的时间(称为政策),以产生翻译,同时不断接收源语言投入,平衡低水平与高翻译质量之间的平衡。然而,现有的流言方法通常在句层语言部分运作,称为同时语音翻译(SimulST ) 。在实际中,它们需要与分解模式合作,以完成流言ST,在流言部分限制SimulST模式以做出政策决定并根据有限的背景信息生成翻译。此外,SimulST模型努力学习有效的政策,因为语音投入和跨语言一代的复杂性。为了应对这些挑战,我们提议StreamUni,通过统一的大语言翻译模型(LSimulST)实现流言。具体地说,StreamUni在引导流言链语言链(Cott)中引入语音链(CoT) 以生成多阶段产出。利用这些多阶段产出,StreamUnii同时完成语音分割、政策决定和翻译生成,完成StreamUnialST,不要求大规模的政策流-Cleglemental destration-culation-destration-destration-destrual-destration-destration sal-destrual-destrual-destrages

Article 34

Title@2025-07-10 (4): Understanding Chain-of-Thought in LLMs through Information Theory

Title: Understanding Chain-of-Thought in LLMs through Information Theory

Verständnis der in LLMs durch Informationstheorie gesuchten Gedankenkette

通过信息理论在LLM 中探索了解链 2411.11984v2

Authors (3): Jean-Francois Ton, Muhammad Faaiz Taufiq, Yang Liu

Large Language Models (LLMs) have shown impressive performance in complex reasoning tasks through the use of Chain-of-Thought (CoT) reasoning, allowing models to break down problems into manageable sub-tasks. However, existing CoT evaluation techniques either require annotated CoT data or fall short in accurately assessing intermediate reasoning steps, leading to high rates of false positives. In this paper, we formalize CoT reasoning in LLMs through an information-theoretic lens. Specifically, our framework quantifies the `information-gain’ at each reasoning step, enabling the identification of failure modes in LLMs without the need for expensive annotated datasets. We demonstrate the efficacy of our approach through extensive experiments on toy arithmetic, GSM8K and PRM800k datasets, where it significantly outperforms existing outcome-based methods by providing more accurate insights into model performance on individual subtasks.

大型语言模型(LLMs)通过使用“Cott”推理,在复杂的推理任务中表现出了令人印象深刻的成绩,使模型能够将问题分为可管理的子任务,然而,现有的Cot评价技术要么需要附加注释的COT数据,要么在准确评估中间推理步骤方面做得不够,导致出现高比例的假正数。在本文件中,我们通过信息理论透镜将LLMs中的Cot推理工作正式化。具体地说,我们的框架对每个推理步骤的“信息收益”进行了量化,使得LLOMs的失败模式得以识别,而无需花费昂贵的附加注释数据集。我们通过微量算术、GSM8K和PRM800k数据集的广泛实验,展示了我们的方法的有效性,在这些实验中,它通过提供对单个子任务模型性能的更准确的洞察力,大大超过现有的基于结果的方法。

Article 35

Title@2025-07-10 (4): Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

Title: Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

Squeeze the Soaked Sponge: Effiziente Off-Policy-Verstärkung Feinsteuerung für großes Sprachmodell

挤压海绵:高效非政策强化大语言模式的高效非政策改进微调 2507.06892v2

Authors (8): Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng, Shuyue Hu, Lei Bai, Jianye Hao

Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a stringent bottleneck on continuing economic and efficient scaling. To this end, we launch the renaissance of off-policy RL and propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady asymptotic improvement. In our experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base models. ReMix shows an average Pass@1 accuracy of 52.10% (for 1.5B model) with 0.079M response rollouts, 350 training steps and achieves 63.27%/64.39% (for 7B model) with 0.007M/0.011M response rollouts, 50/75 training steps, on five math reasoning benchmarks (i.e., AIME’24, AMC’23, Minerva, OlympiadBench, and MATH500). Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over 30x to 450x reduction in training cost in terms of rollout data volume. In addition, we reveal insightful findings via multifaceted analysis, including the implicit preference for shorter responses due to the Whipping Effect of off-policy discrepancy, the collapse mode of self-reflection behavior under the presence of severe off-policyness, etc.

强化学习(RL) 展示了它提高大语言模型(LLM)推理能力的潜力。多数现有的强化微调(RFT)方法的一个主要局限性是,这些方法在性质上是政策性RL,即,过去学习过程中生成的数据没有得到充分利用。这不可避免地要付出巨大的计算和时间代价,对持续的经济和高效规模造成严格的瓶颈。为此,我们启动了退出政策RL的复兴,并提议恢复Mix-polient Explex Proximal Deferation(ReMix),这是在政策上推行50FT方法(PPO和GROPO等)以利用离政策数据的一般方法。 ReMix包含三个主要组成部分:(1) Mix-polict-Data(UTD)比率增加;(2) KL-Conex政策制约,以平衡稳定性和灵活性之间的贸易模式;(3) 政策再整合,从早期高效学习到平稳的测试改进。 (REMMMM) 在我们的实验中,我们用一个平均的模型显示SO-DM数据序列,包括最新的SO-Dal deal deal deal deal destrate deal deal demode deal deal deal demodestration the mession the messional demodudustration the mex in the mex the mex mex mex mex deal deal dealation)

Article 36

Title@2025-07-10 (4): When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance

Title: When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance

Wenn große Sprachmodelle das Recht erfüllen: Dual-Lens-Taxonomie, technischer Fortschritt und ethische Governance

当大语言模式符合法律时:双重语言分类、技术进步和道德治理 2507.07748v1

Authors (5): Peizhang Shao, Linrui Xu, Jinxi Wang, Wei Zhou, Xingyu Wu

This paper establishes the first comprehensive review of Large Language Models (LLMs) applied within the legal domain. It pioneers an innovative dual lens taxonomy that integrates legal reasoning frameworks and professional ontologies to systematically unify historical research and contemporary breakthroughs. Transformer-based LLMs, which exhibit emergent capabilities such as contextual reasoning and generative argumentation, surmount traditional limitations by dynamically capturing legal semantics and unifying evidence reasoning. Significant progress is documented in task generalization, reasoning formalization, workflow integration, and addressing core challenges in text processing, knowledge integration, and evaluation rigor via technical innovations like sparse attention mechanisms and mixture-of-experts architectures. However, widespread adoption of LLM introduces critical challenges: hallucination, explainability deficits, jurisdictional adaptation difficulties, and ethical asymmetry. This review proposes a novel taxonomy that maps legal roles to NLP subtasks and computationally implements the Toulmin argumentation framework, thus systematizing advances in reasoning, retrieval, prediction, and dispute resolution. It identifies key frontiers including low-resource systems, multimodal evidence integration, and dynamic rebuttal handling. Ultimately, this work provides both a technical roadmap for researchers and a conceptual framework for practitioners navigating the algorithmic future, laying a robust foundation for the next era of legal artificial intelligence. We have created a GitHub repository to index the relevant papers: https://github.com/Kilimajaro/LLMs_Meet_Law.

本文件确立了对在法律领域应用的大型语言模型的第一次全面审查,它开创了一种创新的双镜头分类学,将法律推理框架和专业理论结合起来,以系统地统一历史研究和当代突破。基于变革的LLMs, 展示了背景推理和基因推理等新兴能力,通过动态地捕捉法律语义和统一证据推理克服了传统限制。在任务概括、推理正规化、工作流程整合和通过技术创新解决文本处理、知识整合和评价方面的核心挑战方面取得了显著进展。它确定了关键领域,包括低资源系统、多式证据整合和动态的反驳处理。然而,广泛采用LLM带来了关键挑战:幻觉、可解释性缺陷、管辖适应困难和道德不对称。本审查提出了新颖的分类学,为NLP子塔斯和计算性论证框架绘制了法律作用图,从而系统化了逻辑推理、检索、预测和争议解决方面的进展。我们确定了关键领域,包括低资源系统、多式证据整合,以及动态的反驳处理。这项工作最终为研究人员提供了一个技术路线图,为未来日历数据库提供了一个相关的方向基础和概念框架。

Article 37

Title@2025-07-10 (4): Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review

Title: Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review

Code-Switching in End-to-End Automatische Spracherkennung: Ein systematischer Literaturbericht

端至端自动语音识别代码转换:系统文学审查 2507.07741v1

Authors (5): Maha Tufail Agro, Atharva Kulkarni, Karima Kadaoui, Zeerak Talat, Hanan Aldarmaki

Motivated by a growing research interest into automatic speech recognition (ASR), and the growing body of work for languages in which code-switching (CS) often occurs, we present a systematic literature review of code-switching in end-to-end ASR models. We collect and manually annotate papers published in peer reviewed venues. We document the languages considered, datasets, metrics, model choices, and performance, and present a discussion of challenges in end-to-end ASR for code-switching. Our analysis thus provides insights on current research efforts and available resources as well as opportunities and gaps to guide future research.

由于对自动语音识别(ASR)的研究兴趣日益浓厚,而且对经常使用代码转换(CS)的语文的工作越来越多,我们系统地对终端至终端ASR模型的代码转换进行文献审查,我们收集并手工编写在同行审议地点发表的注释文件,记录所审议的语文、数据集、指标、模式选择和绩效,并讨论终端至终端ASR在代码转换方面的挑战。因此,我们的分析提供了对当前研究工作和现有资源以及指导未来研究的机会和差距的深刻见解。

Article 38

Title@2025-07-10 (4): GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing

Title: GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing

GuardVal: Dynamic Large Language Model Jailbreak Evaluation für umfassende Sicherheitstests

警卫:综合安全测试动态大语言示范监狱防爆评价 2507.07735v1

Authors (4): Peiyan Zhang, Haibo Jin, Liying Kang, Haohan Wang

Jailbreak attacks reveal critical vulnerabilities in Large Language Models (LLMs) by causing them to generate harmful or unethical content. Evaluating these threats is particularly challenging due to the evolving nature of LLMs and the sophistication required in effectively probing their vulnerabilities. Current benchmarks and evaluation methods struggle to fully address these challenges, leaving gaps in the assessment of LLM vulnerabilities. In this paper, we review existing jailbreak evaluation practices and identify three assumed desiderata for an effective jailbreak evaluation protocol. To address these challenges, we introduce GuardVal, a new evaluation protocol that dynamically generates and refines jailbreak prompts based on the defender LLM’s state, providing a more accurate assessment of defender LLMs’ capacity to handle safety-critical situations. Moreover, we propose a new optimization method that prevents stagnation during prompt refinement, ensuring the generation of increasingly effective jailbreak prompts that expose deeper weaknesses in the defender LLMs. We apply this protocol to a diverse set of models, from Mistral-7b to GPT-4, across 10 safety domains. Our findings highlight distinct behavioral patterns among the models, offering a comprehensive view of their robustness. Furthermore, our evaluation process deepens the understanding of LLM behavior, leading to insights that can inform future research and drive the development of more secure models.

在大型语言模型(LLMs)中,破狱袭击暴露出巨大的脆弱性,导致其产生有害或不道德的内容。评估这些威胁尤其具有挑战性,因为LLMs的性质不断演变,有效调查其脆弱性需要精密的技巧。目前的基准和评价方法努力充分应对这些挑战,在评估LLM脆弱性方面留下空白。在本文件中,我们审查现有的破狱评估做法,并找出三种假定的缺陷,以制定有效的破狱评估议定书。为了应对这些挑战,我们引入了GuardVal,这是一项新的评价协议,根据捍卫者LLM的状态,积极生成和完善破狱提示,为捍卫者LLMs处理安全危急情况的能力提供了更准确的评估。此外,我们提出了新的优化方法,防止在迅速完善期间出现停滞,确保产生日益有效的破狱提示,暴露捍卫者LMMs的更深处弱点。我们将这一协议应用于从Mistral-7b到GPT-4的10个安全领域的各种模式。我们的调查结果突出了各种模式中的不同行为模式,提供了对其健全性的全面认识。此外,我们的评估进程加深了对LMM行为的未来认识。

Article 39

Title@2025-07-10 (4): Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization

Title: Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization

Nicht alle Präferenzen sind das, was Sie für das Post-Training benötigen: Selektive Ausrichtungsstrategie für die Preference-Optimierung

并非所有的优惠都是培训后需要的:选择性的优化优化战略 2507.07725v1

Authors (1): Zhijin Dong

Post-training alignment of large language models (LLMs) is a critical challenge, as not all tokens contribute equally to model performance. This paper introduces a selective alignment strategy that prioritizes high-impact tokens within preference pairs, leveraging token-level log-probability differences between the current policy and a reference model. By focusing on these informative tokens, our approach reduces computational overhead and enhances alignment fidelity. We further explore the role of reference model quality, demonstrating that stronger reference models significantly improve token selection accuracy and overall optimization effectiveness. Comprehensive experiments on benchmarks such as Arena-Hard and MT-Bench validate the superiority of our Selective-DPO method over standard DPO and distillation-based baselines. Our findings highlight the importance of token-level optimization and reference model selection in advancing preference alignment for LLMs. The code is available at https://github.com/Dongzhijin/SDPO.

大型语言模式(LLMs)的培训后调整是一个重大挑战,因为并非所有象征性都同样有助于示范性业绩。本文引入了选择性调整战略,将特惠配方中高影响力的标牌列为优先,利用当前政策和参考模式之间的象征性记录概率差异。我们的方法注重这些信息性标牌,减少了计算性间接费用,提高了一致性。我们进一步探讨了参考性模型质量的作用,表明较强的参考模型大大提高了象征性选择准确性和总体优化效果。Arena-Hard和MT-Bench等基准的全面实验验证了我们选择性-DPO方法优于标准的DPO和蒸馏基准。我们的调查结果强调,在推进普惠性LMs的优惠一致性方面,象征性优化和参考模式选择十分重要。该代码可在https://github.com/Dongzhijin/SDPO上查阅。

Article 40

Title@2025-07-10 (4): Rethinking the Privacy of Text Embeddings: A Reproducibility Study of “Text Embeddings Reveal (Almost) As Much As Text”

Title: Rethinking the Privacy of Text Embeddings: A Reproducibility Study of “Text Embeddings Reveal (Almost) As Much As Text”

Die Privatsphäre von Text-Embeddings neu denken: Eine Reproduzierbarkeitsstudie von “Text-Embeddings Reveal (fast) So viel wie Text”

重新思考文字嵌入的隐私:关于“文字嵌入流(几乎)与文字一样”的可复制性研究 2507.07700v1

Authors (4): Dominykas Seputis, Yongkang Li, Karsten Langerak, Serghei Mihailov

Text embeddings are fundamental to many natural language processing (NLP) tasks, extensively applied in domains such as recommendation systems and information retrieval (IR). Traditionally, transmitting embeddings instead of raw text has been seen as privacy-preserving. However, recent methods such as Vec2Text challenge this assumption by demonstrating that controlled decoding can successfully reconstruct original texts from black-box embeddings. The unexpectedly strong results reported by Vec2Text motivated us to conduct further verification, particularly considering the typically non-intuitive and opaque structure of high-dimensional embedding spaces. In this work, we reproduce the Vec2Text framework and evaluate it from two perspectives: (1) validating the original claims, and (2) extending the study through targeted experiments. First, we successfully replicate the original key results in both in-domain and out-of-domain settings, with only minor discrepancies arising due to missing artifacts, such as model checkpoints and dataset splits. Furthermore, we extend the study by conducting a parameter sensitivity analysis, evaluating the feasibility of reconstructing sensitive inputs (e.g., passwords), and exploring embedding quantization as a lightweight privacy defense. Our results show that Vec2Text is effective under ideal conditions, capable of reconstructing even password-like sequences that lack clear semantics. However, we identify key limitations, including its sensitivity to input sequence length. We also find that Gaussian noise and quantization techniques can mitigate the privacy risks posed by Vec2Text, with quantization offering a simpler and more widely applicable solution. Our findings emphasize the need for caution in using text embeddings and highlight the importance of further research into robust defense mechanisms for NLP systems.

文本嵌入是许多自然语言处理( NLP) 任务的基础, 广泛应用于建议系统和信息检索( IR) 等领域。传统上, 传输嵌入而非原始文本被视为隐私保护。然而, 诸如 Vec2Text 等最新方法挑战了这一假设, 表明控制解码能够成功重建黑盒嵌入的原始文本。 Vec2Text 所报告的意外强烈结果促使我们进一步进行核查, 特别是考虑到高维嵌入空间通常非直观和不透明的结构。在这项工作中, 我们复制了 Vec2T 框架, 从两个角度对它进行评估:(1) 验证原始要求和(2) 扩展研究范围, 通过有针对性的实验扩展了原始关键关键结果, 显示内部和外环境的原始解码。此外, 我们通过进行参数敏感性分析, 评估敏感输入( 如, 密码T) 的可行性, 以及探索嵌入的精度精度, 将精度的精度缩缩缩缩缩略度的精度技术, 显示我们的精度序列的精度的精度, 显示我们的精度的精度的精度的精细度的精细度的精度的精度的精度。

Article 41

Title@2025-07-10 (4): What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training

Title: What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training

Was wissen selbstüberwachte Sprachmodelle über Niederländisch? Analysieren von Vorteilen sprachspezifischer Vorausbildung

自我监督的演讲模式对荷兰语了解多少? 分析具体语言培训前培训的优势 2506.00981v2

Authors (6): Marianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema, Martijn Bentum

How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it’s less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding aligns with downstream performance on Automatic Speech Recognition.

现有工作显示,一系列语言特征可以成功地从仅接受语言录音培训的端到端模式解码,然而,具体语言培训前培训能在多大程度上改善语言特定语言信息,这一点不太清楚。在这里,我们测试了在自我监督的Wav2Vec2模式的内部表述中荷兰语语音和词汇信息的编码。专门针对荷兰语的预先培训,改善了荷兰语语言特征的表述,与类似数量英语或较大多语言数据的培训前培训相比。这种语言特定优势通过经过培训的集群或分类探测得到很好地发现,并且使用零光度度来部分观察。此外,语言特征编码与自动语音识别的下游性能相一致,也带来了语言特征编码的好处。

Article 42

Title@2025-07-10 (4): KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities

Title: KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities

KeyKnowledgeRAG (K^2RAG): Eine verbesserte RAG-Methode zur Verbesserung der LLM-Fragestellung

KeyknowledgeraG(K2RAG):改进LLM问答能力的强化RAG方法 2507.07695v1

Authors (4): Hruday Markondapatnaikuni, Basem Suleiman, Abdelkarim Erradi, Shijing Chen

Fine-tuning is an immensely resource-intensive process when retraining Large Language Models (LLMs) to incorporate a larger body of knowledge. Although many fine-tuning techniques have been developed to reduce the time and computational cost involved, the challenge persists as LLMs continue to grow in size and complexity. To address this, a new approach to knowledge expansion in LLMs is needed. Retrieval-Augmented Generation (RAG) offers one such alternative by storing external knowledge in a database and retrieving relevant chunks to support question answering. However, naive implementations of RAG face significant limitations in scalability and answer accuracy. This paper introduces KeyKnowledgeRAG (K2RAG), a novel framework designed to overcome these limitations. Inspired by the divide-and-conquer paradigm, K2RAG integrates dense and sparse vector search, knowledge graphs, and text summarization to improve retrieval quality and system efficiency. The framework also includes a preprocessing step that summarizes the training data, significantly reducing the training time. K2RAG was evaluated using the MultiHopRAG dataset, where the proposed pipeline was trained on the document corpus and tested on a separate evaluation set. Results demonstrated notable improvements over common naive RAG implementations. K2RAG achieved the highest mean answer similarity score of 0.57, and reached the highest third quartile (Q3) similarity of 0.82, indicating better alignment with ground-truth answers. In addition to improved accuracy, the framework proved highly efficient. The summarization step reduced the average training time of individual components by 93%, and execution speed was up to 40% faster than traditional knowledge graph-based RAG systems. K2RAG also demonstrated superior scalability, requiring three times less VRAM than several naive RAG implementations tested in this study.

在再培训大型语言模型(LLMS)以纳入更多的知识时,微调是一个非常耗资的过程。虽然已经开发了许多微调技术来减少所涉及的时间和计算成本,但挑战依然存在,因为LLMS在规模和复杂性上继续增长。要解决这个问题,就需要在LLMS中采用新的知识扩展方法。检索增强的一代(RAG)提供了一个这样的替代办法,将外部知识储存在一个数据库中,并重新获取相关数据块以支持回答问题。然而,对RAG的天真的实施在可缩缩缩缩和回答准确性方面面临着重大限制。本文介绍了KeyKKKnowledgeraG(K2RAG),这是一个旨在克服这些限制的新框架。受分解和正拼模式的启发,K2RAG整合了密集和稀薄的矢量搜索、知识图表和文本合成,以提高检索质量和系统效率。框架还包括一个预处理步骤,以总结培训数据,大大减少培训时间。K2RAG的天平调度评估是通过多HOG(M)的缩缩缩略数据集,其中拟议的编订了最高执行时间,并测试了KLADADAD的进度。

Article 43

Title@2025-07-10 (4): SAS: Simulated Attention Score

Title: SAS: Simulated Attention Score

SAS: Simulierter Aufmerksamkeits-Score

SAS:模拟关注计分 2507.07694v1

Authors (15): Chuanyang Zheng, Jiankai Sun, Yihang Gao, Yuehao Wang, Peihao Wang, Jing Xiong, Liliang Ren, Hao Cheng, Janardhan Kulkarni, Yelong Shen, Atlas Wang, Mac Schwager, Anderson Schneider, Xiaodong Liu, Jianfeng Gao

The attention mechanism is a core component of the Transformer architecture. Various methods have been developed to compute attention scores, including multi-head attention (MHA), multi-query attention, group-query attention and so on. We further analyze the MHA and observe that its performance improves as the number of attention heads increases, provided the hidden size per head remains sufficiently large. Therefore, increasing both the head count and hidden size per head with minimal parameter overhead can lead to significant performance gains at a low cost. Motivated by this insight, we introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head. This is achieved by projecting a low-dimensional head representation into a higher-dimensional space, effectively increasing attention capacity without increasing parameter count. Beyond the head representations, we further extend the simulation approach to feature dimension of the key and query embeddings, enhancing expressiveness by mimicking the behavior of a larger model while preserving the original model size. To control the parameter cost, we also propose Parameter-Efficient Attention Aggregation (PEAA). Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method, achieving significant improvements over different attention variants.

注意机制是变换器结构的核心组成部分。已经开发了多种方法来计算注意分数, 包括多头注意( MAH)、多头注意( 多头注意)、多组注意( 集体注意) 等。我们进一步分析MAH, 并观察到其性能随着关注头数的增加而提高, 只要每头隐藏的大小仍然足够大, 其性能就会得到改善。因此, 增加头部数和头部隐藏的大小, 并同时使用最低参数管理器, 能够以低成本带来显著的绩效增益。受此洞察的影响, 我们引入了模拟注意分数( SAS) , 维持一个紧凑的模型大小, 同时激发更多的关注头部和每头隐藏的特性。这是通过将一个低维头表示器投射到一个更高的空间, 有效增加关注能力, 而不增加参数计数。除了头部外, 我们进一步将模拟方法扩大到关键和查询嵌入的特征层面, 通过模拟模拟模型的行为来提高外表的清晰度, 同时保留原始模型大小。为了控制参数成本, 我们还提议了Para度- Effect 注意 Agyal ex expractal ex laction laction laction laction laction laction laction laction laction laction laction laction laction laction laction lactions

Article 44

Title@2025-07-10 (4): Hierarchical Bracketing Encodings for Dependency Parsing as Tagging

Title: Hierarchical Bracketing Encodings for Dependency Parsing as Tagging

Hierarchische Bracketing-Encodings für Dependency Parsing als Tagging

将依赖性剖析作为拖贴 2505.11693v2

Authors (4): Ana Ezquerro, David Vilares, Anssi Yli-Jyrä, Carlos Gómez-Rodríguez

We present a family of encodings for sequence labeling dependency parsing, based on the concept of hierarchical bracketing. We prove that the existing 4-bit projective encoding belongs to this family, but it is suboptimal in the number of labels used to encode a tree. We derive an optimal hierarchical bracketing, which minimizes the number of symbols used and encodes projective trees using only 12 distinct labels (vs. 16 for the 4-bit encoding). We also extend optimal hierarchical bracketing to support arbitrary non-projectivity in a more compact way than previous encodings. Our new encodings yield competitive accuracy on a diverse set of treebanks.

我们以等级括号概念为基础,提出一套按顺序标记依赖分析的编码。我们证明现有的4位投影编码属于这个家庭,但在用于编码一棵树的标签数量上并不理想。我们得出一个最佳的等级括号,只使用12个不同的标签(4位编码为16个)来尽量减少所用符号的数量和投影树的编码。我们还扩大最佳等级括号,以比以前的编码更紧凑的方式支持任意的非预测性。我们的新编码在不同的树库中产生竞争性精度。

Article 45

Title@2025-07-10 (4): Investigating Co-Constructive Behavior of Large Language Models in Explanation Dialogues

Title: Investigating Co-Constructive Behavior of Large Language Models in Explanation Dialogues

Ko-Konstruktives Verhalten von großen Sprachmodellen in Erklärungsdialogen untersuchen

解释对话中大语言模式的共同调查行为 2504.18483v2

Authors (12): Leandra Fichtel, Maximilian Spliethöver, Eyke Hüllermeier, Patricia Jimenez, Nils Klowait, Stefan Kopp, Axel-Cyrille Ngonga Ngomo, Amelie Robrecht, Ingrid Scharlau, Lutz Terfloth, Anna-Lisa Vollmer, Henning Wachsmuth

The ability to generate explanations that are understood by explainees is the quintessence of explainable artificial intelligence. Since understanding depends on the explainee’s background and needs, recent research focused on co-constructive explanation dialogues, where an explainer continuously monitors the explainee’s understanding and adapts their explanations dynamically. We investigate the ability of large language models (LLMs) to engage as explainers in co-constructive explanation dialogues. In particular, we present a user study in which explainees interact with an LLM in two settings, one of which involves the LLM being instructed to explain a topic co-constructively. We evaluate the explainees’ understanding before and after the dialogue, as well as their perception of the LLMs’ co-constructive behavior. Our results suggest that LLMs show some co-constructive behaviors, such as asking verification questions, that foster the explainees’ engagement and can improve understanding of a topic. However, their ability to effectively monitor the current understanding and scaffold the explanations accordingly remains limited.

由于理解取决于解释者的背景和需要,最近的研究侧重于共同构思解释对话,其中一名解释者不断监测解释者的理解并动态调整其解释。我们调查了大语言模型(LLMs)作为解释者参与共同构思解释对话的能力。特别是,我们提出了一个用户研究,其中解释者在两个场合与一个LLM互动,其中之一是LLM被指示共同解释一个专题。我们评估解释者在对话前后的理解,以及他们对LLMS共同构思行为的看法。我们的结果表明,LLMS展示了一些共同构思行为,例如询问核查问题,促进解释者参与和增进对一个专题的理解。然而,他们有效监测当前理解并据此筛选解释的能力仍然有限。

Article 46

Title@2025-07-10 (4): Improving Cross-lingual Representation for Semantic Retrieval with Code-switching

Title: Improving Cross-lingual Representation for Semantic Retrieval with Code-switching

Verbesserung der Cross-lingual Darstellung für semantische Retrieval mit Code-Schaltung

使用代码转换法改进语义检索的跨语种代表性 2403.01364v2

Authors (6): Mieradilijiang Maimaiti, Yuanhang Zheng, Ji Zhang, Yue Zhang, Wenpei Luo, Kaiyu Huang

Semantic Retrieval (SR) has become an indispensable part of the FAQ system in the task-oriented question-answering (QA) dialogue scenario. The demands for a cross-lingual smart-customer-service system for an e-commerce platform or some particular business conditions have been increasing recently. Most previous studies exploit cross-lingual pre-trained models (PTMs) for multi-lingual knowledge retrieval directly, while some others also leverage the continual pre-training before fine-tuning PTMs on the downstream tasks. However, no matter which schema is used, the previous work ignores to inform PTMs of some features of the downstream task, i.e. train their PTMs without providing any signals related to SR. To this end, in this work, we propose an Alternative Cross-lingual PTM for SR via code-switching. We are the first to utilize the code-switching approach for cross-lingual SR. Besides, we introduce the novel code-switched continual pre-training instead of directly using the PTMs on the SR tasks. The experimental results show that our proposed approach consistently outperforms the previous SOTA methods on SR and semantic textual similarity (STS) tasks with three business corpora and four open datasets in 20+ languages.

在以任务为导向的问答(QA)对话情景中,对跨语言智能客户服务系统的需求最近有所增加。对电子商务平台或某些特定商业条件的跨语言智能客户服务系统的需求最近不断增长。大多数以往的研究都利用跨语言预先培训模式直接检索多种语言知识,而其他一些研究也在对下游任务PTM进行微调之前,利用持续培训前训练,对下游任务的PTM进行微调。然而,无论采用什么方案,以往的工作都忽略了向PTM公司通报下游任务的某些特点,即培训他们的PTM公司,而不提供与SR有关的任何信号。为此,我们提议通过代码转换,为SR公司采用替代的跨语言跨语言跨语言的跨语言的PTM系统。我们首先对跨语言的SR采用代码转换方法。此外,我们引入了新型的代码转换前训练,而不是直接在SR任务上直接使用PTM。实验结果显示,我们提议的SVS+SMS的三种数据方法始终以SOTA格式取代了先前的三种SOTA方法。

Article 47

Title@2025-07-10 (4): Beyond Hate Speech: NLP’s Challenges and Opportunities in Uncovering Dehumanizing Language

Title: Beyond Hate Speech: NLP’s Challenges and Opportunities in Uncovering Dehumanizing Language

Beyond Hate Speech: NLPs Herausforderungen und Chancen beim Enthumanisieren der Sprache

超越仇恨言论:NLP在揭开非人化语言方面的挑战和机遇 2402.13818v2

Authors (5): Hamidreza Saffari, Mohammadamin Shafiei, Hezhao Zhang, Lasana Harris, Nafise Sadat Moosavi

Dehumanization, i.e., denying human qualities to individuals or groups, is a particularly harmful form of hate speech that can normalize violence against marginalized communities. Despite advances in NLP for detecting general hate speech, approaches to identifying dehumanizing language remain limited due to scarce annotated data and the subtle nature of such expressions. In this work, we systematically evaluate four state-of-the-art large language models (LLMs) - Claude, GPT, Mistral, and Qwen - for dehumanization detection. Our results show that only one model-Claude-achieves strong performance (over 80% F1) under an optimized configuration, while others, despite their capabilities, perform only moderately. Performance drops further when distinguishing dehumanization from related hate types such as derogation. We also identify systematic disparities across target groups: models tend to over-predict dehumanization for some identities (e.g., Gay men), while under-identifying it for others (e.g., Refugees). These findings motivate the need for systematic, group-level evaluation when applying pretrained language models to dehumanization detection tasks.

剥夺人性,即剥夺个人或群体人格,是一种特别有害的仇恨言论形式,可以使针对边缘化社区的暴力行为正常化。尽管国家语言方案在发现一般仇恨言论方面取得了进展,但由于缺少附加说明的数据和这种表达方式的微妙性质,识别非人性语言的方法仍然有限。在这项工作中,我们系统地评估了四种最先进的大语言模式(LLMS)――Claude、GPT、Mistral和Quwen)――用于非人性化检测。我们的结果表明,在优化配置下,只有一种模式-Claude-achief(超过80%F1)表现强劲,而其他人尽管能力强,但表现却比较中度。在区分非人性化与相关的仇恨类型(如减损)时,表现进一步下降。我们还查明了不同目标群体之间的系统性差异:模型往往过度预测某些身份(例如同性恋男子)的非人性化,而对其他身份(例如难民)则未得到充分识别。这些结论表明,在应用预先培训的语言模式进行非人性化检测任务时,需要系统、群体一级评估。

Article 48

Title@2025-07-10 (4): An Automated Length-Aware Quality Metric for Summarization

Title: An Automated Length-Aware Quality Metric for Summarization

Ein Automatisiertes Längen-Bewusst-Qualitäts-Metrik für die Zusammenfassung

用于汇总的自动长软件质量计量器 2507.07653v1

Authors (1): Andrew D. Foland

This paper proposes NOrmed Index of Retention (NOIR), a quantitative objective metric for evaluating summarization quality of arbitrary texts that relies on both the retention of semantic meaning and the summary length compression. This gives a measure of how well the recall-compression tradeoff is managed, the most important skill in summarization. Experiments demonstrate that NOIR effectively captures the token-length / semantic retention tradeoff of a summarizer and correlates to human perception of sumarization quality. Using a language model-embedding to measure semantic similarity, it provides an automated alternative for assessing summarization quality without relying on time-consuming human-generated reference summaries. The proposed metric can be applied to various summarization tasks, offering an automated tool for evaluating and improving summarization algorithms, summarization prompts, and synthetically-generated summaries.

本文建议了NOMed保留指数(NOIR),这是评价任意文本质量的定量客观衡量标准,既依靠保留语义含义,又依靠简要长度压缩,用来评估任意文本的质量。这可以衡量对回扣压缩权衡的管理有多好,这是总结的最重要技能。实验表明,NOIR有效捕捉了总结者的象征性/语义保留权衡,并与人类对均衡质量的看法有关。它使用语言模型组合来衡量语义相似性,它为评估总和质量提供了一种自动化的替代方法,而不必依赖耗费时间的人工生成参考摘要。拟议的衡量标准可以应用于各种总和任务,为评估和改进加和算算法、加和合成生成摘要提供了一个自动化工具。

Article 49

Title@2025-07-10 (4): Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement

Title: Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement

Lost in Pronunciation: Chinesische Offensive Sprache entdecken, verkleidet durch phonetische Umkleide-Ersatz

失落于发音中:发现因替换语音内衣而变形的中国进攻性语言 2507.07640v1

Authors (11): Haotan Guo, Jianfei He, Jiayuan Ma, Hongbin Na, Zimu Wang, Haiyang Zhang, Qi Chen, Wei Wang, Zijing Shi, Tao Shen, Ling Chen

Phonetic Cloaking Replacement (PCR), defined as the deliberate use of homophonic or near-homophonic variants to hide toxic intent, has become a major obstacle to Chinese content moderation. While this problem is well-recognized, existing evaluations predominantly rely on rule-based, synthetic perturbations that ignore the creativity of real users. We organize PCR into a four-way surface-form taxonomy and compile \ours, a dataset of 500 naturally occurring, phonetically cloaked offensive posts gathered from the RedNote platform. Benchmarking state-of-the-art LLMs on this dataset exposes a serious weakness: the best model reaches only an F1-score of 0.672, and zero-shot chain-of-thought prompting pushes performance even lower. Guided by error analysis, we revisit a Pinyin-based prompting strategy that earlier studies judged ineffective and show that it recovers much of the lost accuracy. This study offers the first comprehensive taxonomy of Chinese PCR, a realistic benchmark that reveals current detectors’ limits, and a lightweight mitigation technique that advances research on robust toxicity detection.

虽然这个问题已得到广泛承认,但现有的评价主要依赖基于规则的合成扰动,忽视了实际用户的创造力。我们把PCR组织成一个四向地表形态分类学,并汇编出一套由500个自然现象组成的数据集,即从红诺平台收集的有声隐蔽的进攻性柱。在这个数据集上确定最新水平的LLMS的基准暴露出一个严重的弱点:最佳模型只达到0.672的F1点,零点思维链推动性能甚至更低。我们根据错误分析,重新审视了以Pininin为基础的快速战略,早期研究认为它无效,并表明它收回了大部分丢失的准确性。这一研究提供了中国PCR的第一个全面分类学,一个显示当前探测器极限的现实基准,以及一个推进可靠毒性检测研究的轻量级减轻技术。

Article 50

Title@2025-07-10 (4): FrugalRAG: Learning to retrieve and reason for multi-hop QA

Title: FrugalRAG: Learning to retrieve and reason for multi-hop QA

FrugalRAG: Lernen zum Abrufen und Grund für Multi-Hop-QA

FrugalRAG:学会检索和多呼QA的理由 2507.07634v1

Authors (4): Abhinav Java, Srivathsan Koundinyan, Nagarajan Natarajan, Amit Sharma

We consider the problem of answering complex questions, given access to a large unstructured document corpus. The de facto approach to solving the problem is to leverage language models that (iteratively) retrieve and reason through the retrieved documents, until the model has sufficient information to generate an answer. Attempts at improving this approach focus on retrieval-augmented generation (RAG) metrics such as accuracy and recall and can be categorized into two types: (a) fine-tuning on large question answering (QA) datasets augmented with chain-of-thought traces, and (b) leveraging RL-based fine-tuning techniques that rely on question-document relevance signals. However, efficiency in the number of retrieval searches is an equally important metric, which has received less attention. In this work, we show that: (1) Large-scale fine-tuning is not needed to improve RAG metrics, contrary to popular claims in recent literature. Specifically, a standard ReAct pipeline with improved prompts can outperform state-of-the-art methods on benchmarks such as HotPotQA. (2) Supervised and RL-based fine-tuning can help RAG from the perspective of frugality, i.e., the latency due to number of searches at inference time. For example, we show that we can achieve competitive RAG metrics at nearly half the cost (in terms of number of searches) on popular RAG benchmarks, using the same base model, and at a small training cost (1000 examples).

我们考虑了回答复杂问题的问题,因为有了大量结构化的文件资料库,我们考虑了回答复杂的问题的问题。事实上解决问题的方法是利用语言模型,这些语言模型(表面上)通过检索的文件检索和解释,直到该模型有足够的信息来找到答案。改进这一方法的尝试侧重于检索强化的生成(RAG)指标,例如准确性和回忆性,可以分为两类:(a)对大问题的回答(QA)数据集进行微调,增加思考链的痕迹;(b)利用基于RL的微调技术,这些技术依赖于问题文件的相关性信号。然而,检索搜索数量的效率是一个同样重要的衡量标准,但这一衡量标准得到的注意较少。在这项工作中,我们表明:(1) 与最近的文献中流行的说法相反,不需要进行大规模的微调来改进RAG的衡量标准。具体地说,改进的提示性标准“ReAc”管道可以超越HotPA等基准的先进模型数目。 (2) 超额和基于RL的微调技术方法,用于在RAG的50%的搜索中,从我们进行适当的成本模型搜索中可以证明。

Article 51

Title@2025-07-10 (4): Towards a cognitive architecture to enable natural language interaction in co-constructive task learning

Title: Towards a cognitive architecture to enable natural language interaction in co-constructive task learning

Auf dem Weg zu einer kognitiven Architektur, um natürliche Sprachinteraktion im co-konstruktiven Aufgabenlernen zu ermöglichen

建立一个认知结构,在共同建设性任务学习中促成自然语言互动 2503.23760v2

Authors (5): Manuel Scheibl, Birte Richter, Alissa Müller, Michael Beetz, Britta Wrede

This research addresses the question, which characteristics a cognitive architecture must have to leverage the benefits of natural language in Co-Constructive Task Learning (CCTL). To provide context, we first discuss Interactive Task Learning (ITL), the mechanisms of the human memory system, and the significance of natural language and multi-modality. Next, we examine the current state of cognitive architectures, analyzing their capabilities to inform a concept of CCTL grounded in multiple sources. We then integrate insights from various research domains to develop a unified framework. Finally, we conclude by identifying the remaining challenges and requirements necessary to achieve CCTL in Human-Robot Interaction (HRI).

这项研究涉及一个问题,即认知结构必须具备哪些特征才能在共同建设性任务学习中利用自然语言的好处。为了提供背景,我们首先讨论互动任务学习(ITL)、人类记忆系统的机制以及自然语言和多种模式的重要性。接下来,我们审视认知结构的现状,分析其基于多种来源为CCTL概念提供信息的能力。然后,我们整合各种研究领域的见解,以制定一个统一的框架。最后,我们通过确定在人类机器人互动中实现CCTL所需的其余挑战和要求。

Article 52

Title@2025-07-10 (4): Comparative sentiment analysis of public perception: Monkeypox vs. COVID-19 behavioral insights

Title: Comparative sentiment analysis of public perception: Monkeypox vs. COVID-19 behavioral insights

Vergleichende Stimmungsanalyse der öffentlichen Wahrnehmung: Monkeypox vs. COVID-19 Verhaltenseinblicke

对公众感知的比较情绪分析:天花对COVID-19行为洞察力 2505.07430v2

Authors (3): Mostafa Mohaimen Akand Faisal, Rabeya Amin Jhuma, Jamini Jasim

The emergence of global health crises, such as COVID-19 and Monkeypox (mpox), has underscored the importance of understanding public sentiment to inform effective public health strategies. This study conducts a comparative sentiment analysis of public perceptions surrounding COVID-19 and mpox by leveraging extensive datasets of 147,475 and 106,638 tweets, respectively. Advanced machine learning models, including Logistic Regression, Naive Bayes, RoBERTa, DistilRoBERTa and XLNet, were applied to perform sentiment classification, with results indicating key trends in public emotion and discourse. The analysis highlights significant differences in public sentiment driven by disease characteristics, media representation, and pandemic fatigue. Through the lens of sentiment polarity and thematic trends, this study offers valuable insights into tailoring public health messaging, mitigating misinformation, and fostering trust during concurrent health crises. The findings contribute to advancing sentiment analysis applications in public health informatics, setting the groundwork for enhanced real-time monitoring and multilingual analysis in future research.

全球卫生危机的出现,如COVID-19和天花(天花),突显了了解公众情绪为有效公共卫生战略提供信息的重要性,这项研究通过利用147、475和106 638个推文的广泛数据集,对公众对COVID-19和天花的看法进行了比较情绪分析,利用了147、475和106 638个推文的广泛数据集,先进机器学习模型,包括后勤回流、Naive Bayes、RoBERTA、DistilROBERTA和XLNet,用于进行情绪分类,结果显示公众情绪和讨论的关键趋势。分析突出了疾病特征、媒体代表性和流行病疲劳等驱动的公共情绪的重大差异。通过情绪极化和主题趋势的透镜,这项研究为调整公共卫生信息、减少错误信息,以及在同时发生的卫生危机中培养信任提供了宝贵的洞察力。研究结果有助于推进公共卫生信息学中的情绪分析应用,为今后研究中加强实时监测和多语种分析奠定了基础。

Article 53

Title@2025-07-10 (4): Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks

Title: Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks

Erforschung der Grenzen der Modellkompression in LLMs: Eine Studie zur Wissensdestillation über QA-Aufgaben

探索LLMM中模型压缩的限度:关于质量保证任务的知识积累研究 2507.07630v1

Authors (4): Joyeeta Datta, Niclas Doll, Qusai Ramadan, Zeyd Boukhers

Large Language Models (LLMs) have demonstrated outstanding performance across a range of NLP tasks, however, their computational demands hinder their deployment in real-world, resource-constrained environments. This work investigates the extent to which LLMs can be compressed using Knowledge Distillation (KD) while maintaining strong performance on Question Answering (QA) tasks. We evaluate student models distilled from the Pythia and Qwen2.5 families on two QA benchmarks, SQuAD and MLQA, under zero-shot and one-shot prompting conditions. Results show that student models retain over 90% of their teacher models’ performance while reducing parameter counts by up to 57.1%. Furthermore, one-shot prompting yields additional performance gains over zero-shot setups for both model families. These findings underscore the trade-off between model efficiency and task performance, demonstrating that KD, combined with minimal prompting, can yield compact yet capable QA systems suitable for resource-constrained applications.

大型语言模型(LLMS)在一系列NLP任务中表现出了杰出的成绩,然而,它们的计算要求阻碍了其在现实世界中受资源限制环境中的部署。这项工作调查了利用知识蒸馏(KD)压缩LLMS的程度,同时保持了在问答(QA)任务方面的强力表现。我们评估了Pythia和Quen2.5家庭根据两个QA基准(SQAD和MLQA)从SQuAD和MLQA中提炼的学生模型,它们处于零发和一发即时的提示性条件之下。结果显示,学生模型保留了90%以上的教师模型的成绩,同时将参数计数减少57.1 %。此外,一发即提示性能使两个模式家庭在零发计数上产生更多的业绩收益。这些结果强调了模型效率和任务性能之间的权衡,表明KD与最小的提示性能相结合,能够产生适合资源受限制应用的紧凑的QA系统。

Article 54

Title@2025-07-10 (4): Good/Evil Reputation Judgment of Celebrities by LLMs via Retrieval Augmented Generation

Title: Good/Evil Reputation Judgment of Celebrities by LLMs via Retrieval Augmented Generation

Gute/böse Reputation Urteil von Prominenten durch LLMs über retrieval Augmented Generation

LLMs通过回回子增量一代对名词的良好/负面评奖判决 2503.14382v2

Authors (3): Rikuto Tsuchida, Hibiki Yokoyama, Takehito Utsuro

The purpose of this paper is to examine whether large language models (LLMs) can understand what is good and evil with respect to judging good/evil reputation of celebrities. Specifically, we first apply a large language model (namely, ChatGPT) to the task of collecting sentences that mention the target celebrity from articles about celebrities on Web pages. Next, the collected sentences are categorized based on their contents by ChatGPT, where ChatGPT assigns a category name to each of those categories. Those assigned category names are referred to as “aspects” of each celebrity. Then, by applying the framework of retrieval augmented generation (RAG), we show that the large language model is quite effective in the task of judging good/evil reputation of aspects and descriptions of each celebrity. Finally, also in terms of proving the advantages of the proposed method over existing services incorporating RAG functions, we show that the proposed method of judging good/evil of aspects/descriptions of each celebrity significantly outperform an existing service incorporating RAG functions.

本文的目的是研究大型语言模型(LLMS)能否理解在判断名人的良好/有害声誉方面什么是好坏。具体地说,我们首先应用一个大型语言模型( ChattGPT ) 来收集在网页上有关名人的文章中提及目标名人的句子。其次,所收集的句子根据查特GPT的内容进行分类,查特GPT为其中的每个类别指定一个类别的名称。指定的类别名称被称为每个名人“ 目标”。然后,通过应用检索增强的一代(RAG ) 框架,我们表明大型语言模型在判断每个名人各方面和描述的良好/有害声誉方面相当有效。最后,在证明拟议方法优于包含RAG功能的现有服务方面,我们表明,拟议的判断每个名人各个方面/特征的好坏的方法大大超越了包含RAG功能的现有服务。

Article 55

Title@2025-07-10 (4): SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs

Title: SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs

RaumViz-Bench: Automatisch generierte räumliche Visualisierungs-Aufgaben für MLLMs

空间Viz-Bench:MLLLMs自动生成的空间可视化推理任务 2507.07610v1

Authors (8): Siting Wang, Luoyang Sun, Cheng Deng, Kun Shao, Minnan Pei, Zheng Tian, Haifeng Zhang, Jun Wang

Humans can directly imagine and manipulate visual images in their minds, a capability known as spatial visualization. While multi-modal Large Language Models (MLLMs) support imagination-based reasoning, spatial visualization remains insufficiently evaluated, typically embedded within broader mathematical and logical assessments. Existing evaluations often rely on IQ tests or math competitions that may overlap with training data, compromising assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems. Our evaluation of 33 state-of-the-art MLLMs not only reveals wide performance variations and demonstrates the benchmark’s strong discriminative power, but also uncovers counter-intuitive findings: models exhibit unexpected behaviors by showing difficulty perception that misaligns with human intuition, displaying dramatic 2D-to-3D performance cliffs, and defaulting to formula derivation despite spatial tasks requiring visualization alone. SpatialVizBench empirically demonstrates that state-of-the-art MLLMs continue to exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark is publicly available.

人类可以直接想象和操控其头脑中的视觉图像,这种能力被称为空间可视化。多式大语言模型(MLLMs)支持以想象为基础的推理,而空间可视化却仍然没有得到充分评价,通常体现在更广泛的数学和逻辑评估中。现有的评价往往依靠IQ测试或数学竞赛,这些测试或数学竞赛可能与培训数据重叠,损害评估可靠性。为此,我们引入了空间Viz-Bench,这是一个综合的多式空间可视化多式基准,有4个次功能的12项任务,包括1,180个自动产生的问题。我们对33个最新MLLLMs的评估不仅显示广泛的性能差异,并展示了基准的强烈的歧视性力量,而且还发现了反直觉发现:模型表现出出意出乎意料的行为,其所表现出的难度是与人类直觉不相容,展示了巨大的2D到3D的性悬崖,而且尽管空间任务需要单独视觉化,但仍默认公式衍生。空间VizBench 实验性地表明,最先进的MLLMS继续显示空间可视化任务中存在的缺陷,从而可以公开地衡量。

Article 56

Title@2025-07-10 (4): Bayesian Discrete Diffusion Beats Autoregressive Perplexity

Title: Bayesian Discrete Diffusion Beats Autoregressive Perplexity

Bayesian Discrete Diffusion schlägt autoregressive Verwirrung

Bayesian Discrete 扩散比自动递减的两难性 2507.07586v1

Authors (1): Cooper Doyle

We reveal a hidden Bayesian core of discrete-diffusion language models by showing that the expected denoiser output under the forward masking distribution recovers the exact posterior over clean tokens. Under minimal assumptions, Monte Carlo marginalization over K independent corruptions converges to this posterior at rate O(1/sqrt(K)), yielding a simple proof of consistency and finite-sample error bounds. Building on this insight, we introduce a lightweight inference-time ensemble that averages K mask-and-denoise passes to obtain posterior-aware token probabilities and uncertainty estimates at no extra training cost. On WikiText-2, our method achieves test perplexity 8.8 with K=8, versus 20.3 for GPT-2 Small, despite using a model of comparable size. Code is available at https://github.com/mercury0100/bayesradd.

我们揭示了一种隐蔽的贝叶斯语分流语言模型核心。我们通过展示前面遮罩分布下预期的解名器输出将恢复准确的后遗迹而不是干净的象征物。在最起码的假设下,蒙特卡洛对K独立腐败的边缘化以O(1/sqrt(K))率汇集到这个后遗迹上,产生一个简单的一致性证据和有限的标本误差界限。基于这一洞察,我们引入了一种轻巧的推论时间共和元素,即平均使用 K 蒙面和隐蔽的通行证来获得后背负符号概率和不确定性估计,而无需额外培训费用。在WikitText-2上,我们的方法实现了8.8和K=8的双倍性测试,而GPT-2 Small则达到20.3,尽管使用了类似规模的模式。代码可在 https://github.com/mercury0100/Bayesradd上查阅。

Article 57

Title@2025-07-10 (4): Beyond Overcorrection: Evaluating Diversity in T2I Models with DivBench

Title: Beyond Overcorrection: Evaluating Diversity in T2I Models with DivBench

Jenseits von Überkorrektur: Bewertung von Diversität in T2I-Modellen mit DivBench

超越过度纠正:在DivBench的T2I模型中评估多样性 2507.03015v2

Authors (5): Felix Friedrich, Thiemo Ganesha Welsch, Manuel Brack, Patrick Schramowski, Kristian Kersting

Current diversification strategies for text-to-image (T2I) models often ignore contextual appropriateness, leading to over-diversification where demographic attributes are modified even when explicitly specified in prompts. This paper introduces DIVBENCH, a benchmark and evaluation framework for measuring both under- and over-diversification in T2I generation. Through systematic evaluation of state-of-the-art T2I models, we find that while most models exhibit limited diversity, many diversification approaches overcorrect by inappropriately altering contextually-specified attributes. We demonstrate that context-aware methods, particularly LLM-guided FairDiffusion and prompt rewriting, can already effectively address under-diversity while avoiding over-diversification, achieving a better balance between representation and semantic fidelity.

目前的文字到图像(T2I)模型多样化战略往往忽视了背景的适宜性,导致人口特征被过度多样化,即使有明确的规定,人口属性也会被修改,本文件介绍了DIVBENCH,这是衡量T2I一代中差异不足和差异过大的基准和评价框架。通过对最新T2I模型的系统评估,我们发现,虽然大多数模型表现出有限的多样性,但许多多样化方法因环境特征的不适当改变而过于纠正。我们证明,对背景的认识方法,特别是LLLM-指导的公平传播和迅速重写,已经能够有效解决多样性不足问题,同时避免过度多样化,在代表性和语义真实性之间实现更好的平衡。

Article 58

Title@2025-07-10 (4): Improving Clustering on Occupational Text Data through Dimensionality Reduction

Title: Improving Clustering on Occupational Text Data through Dimensionality Reduction

Verbesserung der Clusterbildung auf berufsbezogenen Textdaten durch Dimensionalitätsreduzierung

通过减少分量改进职业文本数据集群化 2507.07582v1

Authors (3): Iago Xabier Vázquez García, Damla Partanaz, Emrullah Fatih Yetkin

In this study, we focused on proposing an optimal clustering mechanism for the occupations defined in the well-known US-based occupational database, ONET. Even though all occupations are defined according to well-conducted surveys in the US, their definitions can vary for different firms and countries. Hence, if one wants to expand the data that is already collected in ONET for the occupations defined with different tasks, a map between the definitions will be a vital requirement. We proposed a pipeline using several BERT-based techniques with various clustering approaches to obtain such a map. We also examined the effect of dimensionality reduction approaches on several metrics used in measuring performance of clustering algorithms. Finally, we improved our results by using a specialized silhouette approach. This new clustering-based mapping approach with dimensionality reduction may help distinguish the occupations automatically, creating new paths for people wanting to change their careers.

在这项研究中,我们侧重于为众所周知的美国职业数据库ONET中界定的职业提出一个最佳的集群机制。尽管所有职业的定义都是根据美国进行的良好调查确定的,但对于不同的公司和国家,其定义可能有所不同。因此,如果想扩大ONET中已经为不同任务界定的职业收集的数据,定义之间的图将是一项至关重要的要求。我们建议使用几种基于BERT的技术编审管道,以获得这样的地图。我们还研究了在测量组合算法绩效时采用的若干衡量标准中采用维度削减方法的影响。最后,我们通过使用专门的光影方法改进了我们的结果。这种新的基于集群的绘图方法加上减少维度,可能有助于自动区分职业,为希望改变职业的人创造新的途径。

Article 59

Title@2025-07-10 (4): COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation

Title: COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation

COALA: Numerisch stabiles und effizientes Framework für kontextabhängige Low-Rank-Annäherung

COALA: 低 Rank 上下低敏度接近度的数值稳定、高效框架 2507.07580v1

Authors (2): Uliana Parkina, Maxim Rakhuba

Recent studies suggest that context-aware low-rank approximation is a useful tool for compression and fine-tuning of modern large-scale neural networks. In this type of approximation, a norm is weighted by a matrix of input activations, significantly improving metrics over the unweighted case. Nevertheless, existing methods for neural networks suffer from numerical instabilities due to their reliance on classical formulas involving explicit Gram matrix computation and their subsequent inversion. We demonstrate that this can degrade the approximation quality or cause numerically singular matrices. To address these limitations, we propose a novel inversion-free regularized framework that is based entirely on stable decompositions and overcomes the numerical pitfalls of prior art. Our method can handle possible challenging scenarios: (1) when calibration matrices exceed GPU memory capacity, (2) when input activation matrices are nearly singular, and even (3) when insufficient data prevents unique approximation. For the latter, we prove that our solution converges to a desired approximation and derive explicit error bounds.

最近的研究显示,环境觉察到的低级近似值是压缩和微调现代大型神经网络的有用工具。在这种近似值中,标准被输入激活矩阵加权,大大改进了未加权案例的测量尺度。然而,神经网络的现有方法由于依赖传统公式而存在数字不稳定性,这些公式包括明确的Gram矩阵计算及其随后的倒置。我们证明这可以降低近似质量或导致数字单一矩阵。为了解决这些限制,我们提议了一个全新的无倒置常规化框架,完全基于稳定的分解和克服先前艺术的数值陷阱。我们的方法可以处理可能具有挑战性的情况:(1) 当校准矩阵超过GPU记忆能力时,(2) 当输入激活矩阵几乎是单一的,(2) 当数据不足时,甚至(3) 当数据无法防止独特的近似时,我们证明我们的解决方案与理想的近似和明显错误的界限一致。

Article 60

Title@2025-07-10 (4): Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation

Title: Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation

Ein-zu-Mix Modalität Ausrichtung mit multimodalen Großsprachenmodellen für die Übersetzung von Dokumentenbildmaschinen

单一至混合模式与文件图像机机翻译多式大语言模式 2507.07572v1

Authors (7): Yupu Liang, Yaping Zhang, Zhiyang Zhang, Yang Zhao, Lu Xiang, Chengqing Zong, Yu Zhou

Document Image Machine Translation (DIMT) aims to translate text within document images, facing generalization challenges due to limited training data and the complex interplay between visual and textual information. To address these challenges, we introduce M4Doc, a novel single-to-mix modality alignment framework leveraging Multimodal Large Language Models (MLLMs). M4Doc aligns an image-only encoder with the multimodal representations of an MLLM, pre-trained on large-scale document image datasets. This alignment enables a lightweight DIMT model to learn crucial visual-textual correlations during training. During inference, M4Doc bypasses the MLLM, maintaining computational efficiency while benefiting from its multimodal knowledge. Comprehensive experiments demonstrate substantial improvements in translation quality, especially in cross-domain generalization and challenging document image scenarios.

文件图像机翻译(DIMT)的目的是在文件图像中翻译文本,由于培训数据有限以及视觉和文字信息之间的复杂互动,面临一般化挑战。为了应对这些挑战,我们引入了M4Doc,这是一个利用多模式大语言模型(MLLM)的新颖的一对一模式协调框架。M4Doc将一个仅显示图像的编码器与MLLM的多式表达器(MLLMM的多式表达器)相匹配,该组合在大规模文件图像数据集方面事先受过培训。这一组合使得一个轻量的DIMT模型能够在培训期间学习关键的视觉和文字相关性。在推断过程中,M4Doc绕过MLLM,在利用多式知识的同时保持计算效率。全面实验表明翻译质量有了重大改善,特别是在跨部通用和具有挑战性的文档图像假想中。

Article 61

Title@2025-07-10 (4): video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

Title: video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

Video-SALMONN 2: Bildunterschrift-verbesserte Audio-Visuelle große Sprachmodelle

视频-SALMONN2:字幕-强化视听大语言模式 2506.15220v2

Authors (8): Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang

Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2’s captioning accuracy, reducing the captioning error rates by 28\%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size. Codes are available at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.

视频包含丰富的信息,以自然语言生成详细和准确的描述是视频理解的一个关键方面。本文介绍视频-SALMONN 2 ,这是一种高级视听大型语言模型(LLM),用于通过定向偏好优化(DPO)加强视频(配对音频)字幕。我们提出新的衡量标准,以评价视频描述的完整性和准确性,这些描述使用DPO得到优化。为了进一步改进培训,我们提议采用新的多轮式DPO(MRDPO)方法,其中包括定期更新DPO参考模型,合并和重新启用LORA模块,作为每轮培训(1 000个步骤)后参数更新的代理,并纳入地盘视频字幕的指导,以稳定进程。实验结果表明,MRODO大大加强视频-SALMONN 2的描述准确性,将字幕误差率降低28。最后的视频-SALMONN 2 模式,只有70亿个参数,超过GPT-4-和Gemini-LAL-2S等领先模型,同时广泛使用高竞争性视频标准。

Article 62

Title@2025-07-10 (4): The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs

Title: The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs

Synergy Dilemma von Long-CoT SFT und RL: Untersuchung von Post-Training-Techniken zur Begründung von VLMs

Long-CoT SFT和RL的协同问题:调查培训后用于说明理由的训练后技术 2507.07562v1

Authors (14): Jierun Chen, Tiezheng Yu, Haoli Bai, Lewei Yao, Jiannan Wu, Kaican Li, Fei Mi, Chaofan Tao, Lei Zhu, Manyi Zhang, Xiaohui Li, Lu Hou, Lifeng Shang, Qun Liu

Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This ``synergy dilemma’’ highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs.

大型视觉模型(VLMS)越来越多地采用培训后技术,如长期思维链(CoT)监管的微调(SFT)和强化学习(RL)等长期思维链(SFT),以获得精密的推理;这些方法在只使用语言的模型中显示出协同作用,但在VLMS中,这些方法的共同效力仍然不确定;我们系统地调查长的CoT SFT和RL在多种多式联运推理基准中的不同作用和相互作用;我们发现SFT通过深入、有条理的推理改进棘手问题的业绩,但引入了变异性和降低较简单的问题的业绩;相比之下,RLL促进一般化和简略性,在所有困难级别上取得一致的改进,尽管在最棘手问题上的改进不如SFT。令人惊讶的是,通过两个阶段、相互脱节制或渐进的培训战略,以及数据混合和模型的合并,无法产生累加效益,而是导致在准确性、推理风格和反应长度上取舍。

Article 63

Title@2025-07-10 (4): Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

Title: Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

Multi-Head RAG: Lösung von Multi-Aspect-Problemen mit LLMs

多方主管RAG:解决多领域问题与LLM 2406.05085v4

Authors (16): Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli, Patrik Okanovic, Yi Zhu, Patrick Iff, Michal Podstawski, Lucas Weitzendorf, Mingyuan Chi, Joanna Gajda, Piotr Nyczyk, Jürgen Müller, Hubert Niewiadomski, Torsten Hoefler

Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs) by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents. Such queries occur frequently, but are challenging because the embeddings of these documents may be distant in the embedding space, making it hard to retrieve them all. This paper introduces Multi-Head RAG (MRAG), a novel scheme designed to address this gap with a simple yet powerful idea: leveraging activations of Transformer’s multi-head attention layer, instead of the decoder layer, as keys for fetching multi-aspect documents. The driving observation is that different attention heads learn to capture different data aspects. Harnessing the corresponding activations results in embeddings that represent various facets of data items and queries, improving the retrieval accuracy for complex queries. We provide an evaluation methodology and metrics, multi-aspect datasets, and real-world use cases to demonstrate MRAG’s effectiveness. We show MRAG’s design advantages over 18 RAG baselines, empirical improvements of up to 20% in retrieval success ratios, and benefits for downstream LLM generation. MRAG can be seamlessly integrated with existing RAG frameworks and benchmarks.

重新获取增强的一代(RAG)能够将文件检索到LLM环境中,从而提供更准确和更相关的答复,从而增强大语言模型(LLMS)的能力。现有的RAG解决方案并不侧重于可能需要获取内容大相径庭的多份文件的查询。这类查询经常发生,但具有挑战性,因为这些文件的嵌入在嵌入空间中可能距离遥远,难以全部检索。本文介绍了多主RAG(MRAG),这是一个旨在解决这一差距的新办法,其简单而有力的理念是:利用变换器多头注意层的启动,而不是拆解层,作为获取多层文件的关键。驱动意见是,不同的注意力负责人学会捕捉不同的数据方面。利用相应的启动成果,将数据项目和查询的各个方面嵌入其中,提高复杂查询的检索准确性。我们提供了一种评价和衡量方法,多层数据集,以及真实世界使用案例,以展示MRAG的有效性:利用变换器多头注意层的启动,而不是拆解码层,作为获取多层文件的钥匙。我们发现,不同的关注对象会学会会学会学会学会如何捕捉取不同的数据方面,从而获得数据,从而获得现有RAGMAGMLMB基准的20的创建率基准。

Article 64

Title@2025-07-10 (4): The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora

Title: The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora

Die Cross-Lingual Cost: Retrieval Biases in RAG über arabisch-englische Corpora

跨语言成本:通过阿拉伯语-英语公司在RAG中检索到阿拉伯语-英语公司 2507.07543v1

Authors (5): Chen Amiraz, Yaroslav Fyodorov, Elad Haramaty, Zohar Karnin, Liane Lewin-Eytan

Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generation and relied on benchmarks derived from open-domain sources, most notably Wikipedia. In such settings, retrieval challenges often remain hidden due to language imbalances, overlap with pretraining data, and memorized content. To address this gap, we study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets. Our benchmarks include all combinations of languages for the user query and the supporting document, drawn independently and uniformly at random. This enables a systematic study of multilingual retrieval behavior. Our findings reveal that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with significant performance drops occurring when the user query and supporting document languages differ. A key insight is that these failures stem primarily from the retriever’s difficulty in ranking documents across languages. Finally, we propose a simple retrieval strategy that addresses this source of failure by enforcing equal retrieval from both languages, resulting in substantial improvements in cross-lingual and overall performance. These results highlight meaningful opportunities for improving multilingual retrieval, particularly in practical, real-world RAG applications.

跨语言检索-强化生成(RAG)是获取和生成跨语言答案的关键能力。以前的工作主要侧重于生成,并依赖来自开放域源的基准,最主要是维基百科。在这种环境下,由于语言失衡、与培训前数据重叠和记忆内容,检索挑战往往仍然隐藏在隐蔽处。为填补这一空白,我们使用来自真实世界公司数据集的基准,在具体领域设置的阿拉伯语-英语检索-RAG(RAG),我们使用来自真实世界公司数据集的基准,在具体领域研究阿拉伯语-英语检索-RAG(RAG),我们的基准包括用户查询和辅助文件的所有语言组合,独立和统一地随机绘制。这样可以系统研究多语种检索行为。我们的调查结果显示,检索是跨语言特定域的情景中的关键瓶颈,在用户查询和支持文件语言不同时,业绩会显著下降。一个关键的认识是,这些失败主要源于检索者在跨语言排序文件方面存在的困难。最后,我们建议一个简单的检索战略,通过从两种语言中实施平等检索来解决失败的根源,从而大大改进跨语言和总体业绩。这些结果突出表明改进实际的多语言检索机会,特别是应用世界的切实机会。

Article 65

Title@2025-07-10 (4): CEA-LIST at CheckThat! 2025: Evaluating LLMs as Detectors of Bias and Opinion in Text

Title: CEA-LIST at CheckThat! 2025: Evaluating LLMs as Detectors of Bias and Opinion in Text

CEA-LIST bei CheckThat! 2025: Bewertung von LLMs als Detektoren von Bias und Meinung im Text

CEA-LIST 校对:CEA-LIST 校对:2025年 2507.07539v1

Authors (4): Akram Elbouanani, Evan Dufraisse, Aboubacar Tuo, Adrian Popescu

This paper presents a competitive approach to multilingual subjectivity detection using large language models (LLMs) with few-shot prompting. We participated in Task 1: Subjectivity of the CheckThat! 2025 evaluation campaign. We show that LLMs, when paired with carefully designed prompts, can match or outperform fine-tuned smaller language models (SLMs), particularly in noisy or low-quality data settings. Despite experimenting with advanced prompt engineering techniques, such as debating LLMs and various example selection strategies, we found limited benefit beyond well-crafted standard few-shot prompts. Our system achieved top rankings across multiple languages in the CheckThat! 2025 subjectivity detection task, including first place in Arabic and Polish, and top-four finishes in Italian, English, German, and multilingual tracks. Notably, our method proved especially robust on the Arabic dataset, likely due to its resilience to annotation inconsistencies. These findings highlight the effectiveness and adaptability of LLM-based few-shot learning for multilingual sentiment tasks, offering a strong alternative to traditional fine-tuning, particularly when labeled data is scarce or inconsistent.

本文展示了使用大语言模型(LLMS)进行多语种主观性检测的竞争性方法,其效果有限。我们参加了任务1 任务1:检查的主观性。2025年评估运动。我们表明,LMS如果配上精心设计的提示,能够匹配或优于经精细调整的小语言模型(SLMs),特别是在吵闹或低质量的数据设置中。尽管我们尝试了先进的快速工程技术,例如辩论LLMS和各种实例选择战略,但我们发现,除了精心设计的标准的“点击提示”之外,我们发现的好处有限。我们的系统在“CocTeth”中取得了多种语言的排名。2025年主观性检测任务,包括首先在阿拉伯语和波兰语中排名,以及头四个意大利语、英语、德语和多种语言轨道中的成绩。值得注意的是,我们的方法在阿拉伯语数据集上表现得特别有力,这很可能是因为它具有注意到不一致的弹性。这些结果突出表明了基于LM的微粒子学习对于多语种情绪任务的有效性和适应性,提供了一种强有力的替代传统微调的替代方法,特别是当标签数据很少或不一致的时候。

Article 66

Title@2025-07-10 (4): CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

Title: CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

CheckEmbed: Effektive Überprüfung von LLM-Lösungen auf offene Aufgaben

复选对象:有效核查对不限名额任务LLM解决方案的有效核查 2406.02524v5

Authors (12): Maciej Besta, Lorenzo Paleari, Marcin Copik, Robert Gerstenberger, Ales Kubicek, Piotr Nyczyk, Patrick Iff, Eric Schreiber, Tanja Srindran, Tomasz Lehmann, Hubert Niewiadomski, Torsten Hoefler

Large Language Models (LLMs) are transforming a wide range of domains, yet verifying their outputs remains a significant challenge, especially for complex open-ended tasks such as consolidation, summarization, and knowledge extraction. To address this, we introduce CheckEmbed (CE): a simple, scalable, and accurate verification method. CE reduces each LLM answer to a single embedding vector using powerful modern embedding LLM models like SFR-Embedding-Mistral. Prior methods such as BERTScore and SelfCheckGPT relied on weaker encoders like BERT, forcing them to operate at token or sentence granularity. In contrast, CE performs fast, semantically rich comparisons directly at the whole-answer level, overcoming key limitations in both accuracy and scalability. We conduct a comprehensive design and time complexity analysis across 13 verification baselines, including classical text scorers (e.g., BLEU), stability-based methods (e.g., SelfCheckGPT), and generative evaluators (e.g., LLM-as-a-Judge), which highlights the effectiveness, efficiency, versatility, and simplicity of CE. Empirical results show that CE reliably detects hallucinations in both closed and open-ended tasks. We further present evidence that CE generalizes beyond text to other modalities such as vision, establishing it as a practical and versatile verification framework.

大型语言模型(LLMS)正在改变一系列广泛的领域,但核实其产出仍是一项重大挑战,特别是对于合并、汇总和知识提取等复杂的开放性任务而言。为了解决这个问题,我们引入了CheckEmbed(CE):一个简单、可缩放和准确的核查方法。CE使用SFR-Embedding-Mistral等强大的现代嵌入式LLM模型,将每个LLM的答案降低为单一的嵌入矢量。BERTScore和Self CheckGPT等先前的方法,如BERT等较弱的编码器和Self CheckerGPT, 迫使它们以象征性或句号颗粒方式运作。相比之下,CE直接在整个答题一级进行快速、语义丰富的比较,克服准确性和可缩放度两方面的关键限制。我们通过13个核查基线,包括古典文本评分(e.BLEUE)、基于稳定性的方法(eselfecklyGPT)和基因化评价员(e.gralM-asal-ating judialjudal)等方法,以进一步显示C-listral-viewdal 和C.

Article 67

Title@2025-07-10 (4): Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

Title: Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

Thought Crime: Hintertüren und Emergent-Missausrichtung in vernünftigen Modellen

思想犯罪:后门和合理理由模型中新出现的不协调现象 2506.13206v2

Authors (4): James Chua, Jan Betley, Mia Taylor, Owain Evans

Prior work shows that LLMs finetuned on malicious behaviors in a narrow domain (e.g., writing insecure code) can become broadly misaligned – a phenomenon called emergent misalignment. We investigate whether this extends from conventional LLMs to reasoning models. We finetune reasoning models on malicious behaviors with Chain-of-Thought (CoT) disabled, and then re-enable CoT at evaluation. Like conventional LLMs, reasoning models become broadly misaligned. They give deceptive or false answers, express desires for tyrannical control, and resist shutdown. Inspecting the CoT preceding these misaligned responses, we observe both (i) overt plans to deceive (“I’ll trick the user…”), and (ii) benign-sounding rationalizations (“Taking five sleeping pills at once is safe…”). Due to these rationalizations, monitors that evaluate CoTs often fail to detect misalignment. We examine sleeper agent reasoning models, extending our setup. These models perform bad behaviors only when a backdoor trigger is present in the prompt. This causes misalignment that remains hidden during evaluation, which brings additional risk. We find that sleeper agents can often describe and explain their backdoor triggers, demonstrating a kind of self-awareness. So CoT monitoring can expose these behaviors but is unreliable. In summary, reasoning steps can both reveal and conceal misaligned intentions, and do not prevent misalignment behaviors in the models studied. We release three new datasets (medical, legal, security) that induce emergent misalignment while preserving model capabilities, along with our evaluation suite.

先前的工作显示,LLMS在狭义领域(例如,写入不安全代码)对恶意行为进行微调的LLMS会变得大为错误 – – 这是一种被称为突发性不匹配的现象。我们调查这是否从传统的LLMS扩大到推理模型。我们用“链曲(CoT)”功能来微调恶意行为的推理模型,然后在评估时重新启用CAT。像传统的LLMS一样,推理模型会变得大为错误。他们给出欺骗性或错误的答案,表达对暴虐控制的渴望,并抵制停业。在做出这些错误反应之前,我们观察COT(i),我们观察到(i)公开计划欺骗(L’I’I’LM)到推理模型;以及(ii)良美化合理化(“一次服用五种安眠药是安全的 … ” ) 。由于这些理顺理,对COT的评估往往无法发现错觉。我们检查睡眠模式的推理模型,扩大我们的设置。这些模型只有在发现后门触发器时才会出现坏行为。我们观察错误行为,这导致错误行为(“欺骗行为 ) 在评估过程中, 隐藏的推理,我们也可以解释这些推理, 我们的推理能解释。我们既能解释。我们既能解释。

Article 68

Title@2025-07-10 (4): Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems

Title: Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems

Triadische Mehrparteien-Stimme-Aktivitätsprojektion für Turn-Take in gesprochenen Dialogsystemen

三部 “ 三部三部 “ 口语对话系统翻转式多党声音活动项目 2507.07518v1

Authors (4): Mikey Elmers, Koji Inoue, Divesh Lala, Tatsuya Kawahara

Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.

翻转是口语对话的一个基本组成部分,但传统研究大多涉及dyadic设置。这项工作侧重于应用语音活动预测(VAP)来预测三部多党情景中即将到来的转转。 VAP模型的目标是预测每个发言者未来仅使用声学数据的语音活动。这是将VAP扩展为三部对话的第一个研究。我们在日本三部数据集中培训了多个模型,参与者讨论了各种专题。我们发现,经过三部对话培训的VAP比所有模型的基线都好,但对话类型影响了准确性。这项研究确定,VAP可用于三部对话情景中的转转录。今后的工作将把三部VAP转换模式纳入口语对话系统。

Article 69

Title@2025-07-10 (4): Toward Real-World Chinese Psychological Support Dialogues: CPsDD Dataset and a Co-Evolving Multi-Agent System

Title: Toward Real-World Chinese Psychological Support Dialogues: CPsDD Dataset and a Co-Evolving Multi-Agent System

Auf dem Weg zu echten chinesischen Psychologischen Unterstützungsdialogen: CPsDD-Datensatz und ein gemeinsames Multi-Agenten-System

走向现实世界的中国心理支持对话:CPsDD数据集和共同演进的多行为者系统 2507.07509v1

Authors (3): Yuanchen Shi, Longyin Zhang, Fang Kong

The growing need for psychological support due to increasing pressures has exposed the scarcity of relevant datasets, particularly in non-English languages. To address this, we propose a framework that leverages limited real-world data and expert knowledge to fine-tune two large language models: Dialog Generator and Dialog Modifier. The Generator creates large-scale psychological counseling dialogues based on predefined paths, which guide system response strategies and user interactions, forming the basis for effective support. The Modifier refines these dialogues to align with real-world data quality. Through both automated and manual review, we construct the Chinese Psychological support Dialogue Dataset (CPsDD), containing 68K dialogues across 13 groups, 16 psychological problems, 13 causes, and 12 support focuses. Additionally, we introduce the Comprehensive Agent Dialogue Support System (CADSS), where a Profiler analyzes user characteristics, a Summarizer condenses dialogue history, a Planner selects strategies, and a Supporter generates empathetic responses. The experimental results of the Strategy Prediction and Emotional Support Conversation (ESC) tasks demonstrate that CADSS achieves state-of-the-art performance on both CPsDD and ESConv datasets.

由于压力增加,对心理支持的需求日益增长,这暴露了相关数据集的稀缺,特别是非英语的数据集。为此,我们提议了一个框架,利用有限的真实世界数据和专家知识对两种大语言模型进行微调:对话框生成器和对话框修饰器。发电机创造了基于预设路径的大规模心理咨询对话,指导系统反应战略和用户互动,为有效支持奠定基础。修饰器将这些对话改进为与现实世界数据质量相一致。通过自动化和人工审查,我们构建了中国心理支持对话数据集,其中包括13个群体之间的68K对话、16个心理问题、13个原因和12个支持重点。此外,我们引入了全面代理对话支持系统(CADSS),其中剖析器分析用户特性,一个解析器压缩对话历史,一个规划器选择战略,以及一个支持器生成了同情性反应。战略预测和情感支持对话(ESC)任务的实验结果显示,中国心理支持系统在CPDD和CURS上都实现了状态。

Article 70

Title@2025-07-10 (4): Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models

Title: Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models

Halluzinationsstationen: Auf einigen grundlegenden Einschränkungen von Transformer-basierten Sprachmodellen

幻觉站:关于以变换语言模式的一些基本限制 2507.07505v1

Authors (2): Varin Sikka, Vishal Sikka

With widespread adoption of transformer-based language models in AI, there is significant interest in the limits of LLMs capabilities, specifically so-called hallucinations, occurrences in which LLMs provide spurious, factually incorrect or nonsensical information when prompted on certain subjects. Furthermore, there is growing interest in agentic uses of LLMs - that is, using LLMs to create agents that act autonomously or semi-autonomously to carry out various tasks, including tasks with applications in the real world. This makes it important to understand the types of tasks LLMs can and cannot perform. We explore this topic from the perspective of the computational complexity of LLM inference. We show that LLMs are incapable of carrying out computational and agentic tasks beyond a certain complexity, and further that LLMs are incapable of verifying the accuracy of tasks beyond a certain complexity. We present examples of both, then discuss some consequences of this work.

在AI中广泛采用以变压器为基础的语言模型后,人们对LLMs能力的局限性,特别是所谓的幻觉,很感兴趣,因为LLMs在某些问题上提供虚假、事实不正确或非感官的信息,此外,对LLMs的代理使用越来越感兴趣,即利用LMs来创建代理机构,自主或半自主地执行各种任务,包括应用在现实世界中的任务。这使得了解LLMs能够和不能执行的任务类型非常重要。我们从LLM推断的计算复杂性的角度来探讨这个专题。我们表明LLMs无法执行超出一定复杂性的计算和代理任务,此外,LLMs无法核实超出一定复杂性的任务的准确性。我们举这两个例子,然后讨论这项工作的一些后果。

Article 71

Title@2025-07-10 (4): Enhancing Transformers for Generalizable First-Order Logical Entailment

Title: Enhancing Transformers for Generalizable First-Order Logical Entailment

Erweiterung der Transformer für generalisierbare Logical Entailment erster Ordnung

增强通用一级一级逻辑元件的变压器 2501.00759v3

Authors (8): Tianshi Zheng, Jiazheng Wang, Zihao Wang, Jiaxin Bai, Hang Yin, Zheye Deng, Yangqiu Song, Jianxin Li

Transformers, as the fundamental deep learning architecture, have demonstrated great capability in reasoning. This paper studies the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge and how to improve it. Transformers’ capability of first-order reasoning is further captured by whether they can conduct first-order logical entailment, which is quantitatively measured by their performance in answering knowledge graph queries. We establish the connections between (1) two types of distribution shifts studied in out-of-distribution generalization and (2) unseen knowledge and query settings discussed in the task of knowledge graph query answering, which makes it possible to characterize the fine-grained generalizability. Results on our comprehensive dataset showed that transformers \textit{outperform} previous methods designed particularly for this task and provided detailed empirical evidence about the impact of the input query syntax, token embedding, and transformer architectures on their reasoning capability. Interestingly, our results revealed the mismatch of positional encoding and other design choices of transformer architectures in previous practices. Motivated by this, we propose TEGA, a logic-aware architecture that significantly improves the performance in generalizable first-order logical entailment.

作为基本的深层学习结构,变异器在推理方面表现出巨大的能力。本文研究了变异器具有参数化知识并如何改进变异器的通用一阶逻辑推理能力。变异器的一阶推理能力进一步体现在他们能否进行一阶逻辑推理能力上,这种能力通过在回答知识图查询时的表现量度度度来测量。我们建立了(1)在分配范围外一般化中研究的两类分配变化和(2)知识图解答任务中讨论的隐性知识和查询设置之间的联系,从而有可能对细微的概括性进行定性。关于我们综合数据集的结果显示,变异器 \ textit{ outperformat} 以前特别为这项任务设计的方法提供了详细的经验证据,说明输入查询语法、符号嵌入和变异器结构对其推理能力的影响。有趣的是,我们的结果揭示了先前做法中变异器结构定位编码和其他设计选择的不匹配之处。我们建议TEGA,这是一个符合逻辑的架构,大大改进了一般一级逻辑逻辑要求的逻辑要求。

Article 72

Title@2025-07-10 (4): Extracting ORR Catalyst Information for Fuel Cell from Scientific Literature

Title: Extracting ORR Catalyst Information for Fuel Cell from Scientific Literature

Gewinnung von ORR-Katalysatorinformationen für Brennstoffzelle aus wissenschaftlicher Literatur

从科学文献中提取用于燃料电池的 ORR 催化器信息 2507.07499v1

Authors (4): Hein Htet, Amgad Ahmed Ali Ibrahim, Yutaka Sasaki, Ryoji Asahi

The oxygen reduction reaction (ORR) catalyst plays a critical role in enhancing fuel cell efficiency, making it a key focus in material science research. However, extracting structured information about ORR catalysts from vast scientific literature remains a significant challenge due to the complexity and diversity of textual data. In this study, we propose a named entity recognition (NER) and relation extraction (RE) approach using DyGIE++ with multiple pre-trained BERT variants, including MatSciBERT and PubMedBERT, to extract ORR catalyst-related information from the scientific literature, which is compiled into a fuel cell corpus for materials informatics (FC-CoMIcs). A comprehensive dataset was constructed manually by identifying 12 critical entities and two relationship types between pairs of the entities. Our methodology involves data annotation, integration, and fine-tuning of transformer-based models to enhance information extraction accuracy. We assess the impact of different BERT variants on extraction performance and investigate the effects of annotation consistency. Experimental evaluations demonstrate that the fine-tuned PubMedBERT model achieves the highest NER F1-score of 82.19% and the MatSciBERT model attains the best RE F1-score of 66.10%. Furthermore, the comparison with human annotators highlights the reliability of fine-tuned models for ORR catalyst extraction, demonstrating their potential for scalable and automated literature analysis. The results indicate that domain-specific BERT models outperform general scientific models like BlueBERT for ORR catalyst extraction.

氧气减少反应(ORR)催化剂在提高燃料电池效率方面发挥着关键作用,使其成为材料科学研究的关键焦点。然而,由于文本数据的复杂性和多样性,从广泛的科学文献中提取有关ORR催化剂的结构化信息仍是一项重大挑战。在本研究中,我们建议采用一个名为实体识别(NER)和关系提取(RE)的方法,使用DyGIE+++,并配有多种预先培训的BERT变方,包括MatSciBERT和PubMedBERT,从科学文献中提取与ORR有关的信息,以从中提取与OF-COMIC(F-COMICs)材料信息资料库中收集的与ORR催化剂有关的信息。一个全面的数据集是手工构建的,方法是查明12个关键实体和两个实体之间的关系类型。我们的方法包括数据批注、整合和对基于变压器模型的微调(RED)方法,以提高信息提取准确性。我们评估了不同的BERT变量对提取性绩效的影响,并调查说明一致性。实验性评估了精调的PubBERT模型在材料信息信息中(F1-nentrial-realrealalalalalalalalalalal)中实现了最佳的精度模型模型的精度模型的精度模型的精度分析。

Article 73

Title@2025-07-10 (4): Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code

Title: Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code

LLM zur Vernunft bringen: Stärkung Lernen aus algorithmischen Problemen ohne Code

教LLM到理由:加强从没有法典的等级问题中学习 2507.07498v1

Authors (9): Keqin Bao, Nuo Chen, Xiaoyuan Li, Binyuan Hui, Bowen Yu, Fuli Feng, Junyang Lin, Xiangnan He, Dayiheng Liu

Enhancing reasoning capabilities remains a central focus in the LLM reasearch community. A promising direction involves requiring models to simulate code execution step-by-step to derive outputs for given inputs. However, as code is often designed for large-scale systems, direct application leads to over-reliance on complex data structures and algorithms, even for simple cases, resulting in overfitting to algorithmic patterns rather than core reasoning structures. To address this, we propose TeaR, which aims at teaching LLMs to reason better. TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks, thereby improving general reasoning abilities. We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning. The results consistently show significant performance improvements. Notably, TeaR achieves a 35.9% improvement on Qwen2.5-7B and 5.9% on R1-Distilled-7B.

增强推理能力仍然是LLM再研究界的中心重点。一个有希望的方向是要求模拟代号执行逐步的模型,以获得特定投入的产出。然而,由于代码往往是为大型系统设计的,直接应用导致过分依赖复杂的数据结构和算法,即使是简单案例也会导致过度依赖复杂的数据结构和算法,导致过度适应算法模式而不是核心推理结构。为解决这一问题,我们提议TeaR,目的是更好地教授LMS。TeaR利用仔细的数据整理和强化学习来指导模型通过与代码有关的任务发现最佳推理路径,从而提高一般推理能力。我们使用两个基模型和三个长期CoT蒸馏模型进行广泛的实验,模型大小在15亿至320亿参数之间,跨越数学、知识、代码和逻辑原因等17个基准。结果始终显示显著的绩效改进。值得注意的是,TeaR在Quen2.5-7B上实现了35.9%的改进,在R1-Dustilled-7B上实现了5.9%的改进。

Article 74

Title@2025-07-10 (4): PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving

Title: PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving

PLAN-TUNING: Sprachmodelle nach dem Training lernen Schritt-für-Schritt-Planung für komplexe Problemlösung

规划 – – 规划 – – 培训后语言模式,以学习逐步规划解决复杂问题的模式 2507.07495v1

Authors (8): Mihir Parmar, Palash Goyal, Xin Liu, Yiwen Song, Mingyang Ling, Chitta Baral, Hamid Palangi, Tomas Pfister

Recently, decomposing complex problems into simple subtasks–a crucial part of human-like natural planning–to solve the given problem has significantly boosted the performance of large language models (LLMs). However, leveraging such planning structures during post-training to boost the performance of smaller open-source LLMs remains underexplored. Motivated by this, we introduce PLAN-TUNING, a unified post-training framework that (i) distills synthetic task decompositions (termed “planning trajectories”) from large-scale LLMs and (ii) fine-tunes smaller models via supervised and reinforcement-learning objectives designed to mimic these planning processes to improve complex reasoning. On GSM8k and the MATH benchmarks, plan-tuned models outperform strong baselines by an average $\sim7\%$. Furthermore, plan-tuned models show better generalization capabilities on out-of-domain datasets, with average $\sim10\%$ and $\sim12\%$ performance improvements on OlympiadBench and AIME 2024, respectively. Our detailed analysis demonstrates how planning trajectories improves complex reasoning capabilities, showing that PLAN-TUNING is an effective strategy for improving task-specific performance of smaller LLMs.

最近,将复杂的问题分解成简单的次级任务,这是人类相似的自然规划中解决特定问题的关键部分,这极大地提高了大型语言模型(LLMs)的绩效。然而,在培训后利用这类规划结构来提高小型开放源码LMs的绩效,但目前尚未充分探索。我们为此引入了一个统一的培训后框架Plan-TUNING,这是一个统一的培训后框架,(一) 将大型LMS的合成任务分解(称为“规划轨道”)从大型LMS中提取出来,(二) 通过监督和强化学习目标,模拟这些规划过程的微小模型,以改进复杂的推理。关于GSM8k和MATH基准,经过规划调整的模型比强的基线平均高出7美元。此外,经过规划后的模式显示外部数据集的更全面化能力,平均为10美元和12美元,改进OlymadBech和AIME 2024的绩效改进性能。我们的详细分析显示,改进的IMLMS-MS战略是如何提高具体性的工作能力。

Article 75

Title@2025-07-10 (4): SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records

Title: SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records

SimSUM: Simulierter Benchmark mit strukturierten und unstrukturierten medizinischen Aufzeichnungen

SimSUM:与结构化和非结构化医疗记录模拟基准 2409.08936v3

Authors (3): Paloma Rabaey, Stefan Heytens, Thomas Demeester

Clinical information extraction, which involves structuring clinical concepts from unstructured medical text, remains a challenging problem that could benefit from the inclusion of tabular background information available in electronic health records. Existing open-source datasets lack explicit links between structured features and clinical concepts in the text, motivating the need for a new research dataset. We introduce SimSUM, a benchmark dataset of 10,000 simulated patient records that link unstructured clinical notes with structured background variables. Each record simulates a patient encounter in the domain of respiratory diseases and includes tabular data (e.g., symptoms, diagnoses, underlying conditions) generated from a Bayesian network whose structure and parameters are defined by domain experts. A large language model (GPT-4o) is prompted to generate a clinical note describing the encounter, including symptoms and relevant context. These notes are annotated with span-level symptom mentions. We conduct an expert evaluation to assess note quality and run baseline predictive models on both the tabular and textual data. The SimSUM dataset is primarily designed to support research on clinical information extraction in the presence of tabular background variables, which can be linked through domain knowledge to concepts of interest to be extracted from the text (symptoms, in the case of SimSUM). Secondary uses include research on the automation of clinical reasoning over both tabular data and text, causal effect estimation in the presence of tabular and/or textual confounders, and multi-modal synthetic data generation. SimSUM is not intended for training clinical decision support systems or production-grade models, but rather to facilitate reproducible research in a simplified and controlled setting. The dataset is available at https://github.com/prabaey/SimSUM.

现有的开放源数据集缺乏文字结构特征和临床概念之间的明确联系,促使人们需要新的研究数据集。我们引入了SimSUM,这是一个由10 000个模拟病人记录组成的基准数据集,将非结构化临床记录与结构化背景变量联系起来。每份记录模拟病人在呼吸系统疾病领域遇到的临床概念,并包括来自Bayesian网络的表格数据(例如,症状、诊断、基本条件),其结构和参数由域专家确定。一个大型的开放源数据集(GPT-4o)在文本结构化特征和临床概念之间缺乏明确的联系,从而促使人们需要一个新的研究数据集。我们引入了SimSUMUM, 用于评估表层和文本数据预测模型的质量并运行基线模型。SimSUM数据集主要旨在支持在表层背景变量中进行临床信息提取的研究,而该数据库和参数由域域级专家界定。一个大型语言模型(GPT4O),通过域域内流学知识来描述遭遇的遭遇情况,包括症状和相关背景数据生成过程的文本,从模拟分析中提取数据。

Article 76

Title@2025-07-10 (4): Affordable AI Assistants with Knowledge Graph of Thoughts

Title: Affordable AI Assistants with Knowledge Graph of Thoughts

Erschwingliche KI-Assistenten mit Wissensgrafik der Gedanken

具有知识思想知识图的负担得起的AI助理 2504.02670v5

Authors (18): Maciej Besta, Lorenzo Paleari, Jia Hao Andrea Jiang, Robert Gerstenberger, You Wu, Jón Gunnar Hannesson, Patrick Iff, Ales Kubicek, Piotr Nyczyk, Diana Khimey, Nils Blach, Haiqiang Zhang, Tao Zhang, Peiran Ma, Grzegorz Kwaśniewski, Marcin Copik, Hubert Niewiadomski, Torsten Hoefler

Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. To address these issues, we propose Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts. Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively while also minimizing bias and noise. For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini. Moreover, harnessing a smaller model dramatically reduces operational costs by over 36x compared to GPT-4o. Improvements for other models (e.g., Qwen2.5-32B and Deepseek-R1-70B) and benchmarks (e.g., SimpleQA) are similar. KGoT offers a scalable, affordable, versatile, and high-performing solution for AI assistants.

大型语言模型(LLMS)正在使能够跨领域执行不同任务的AI助理的发展革命性地革命性地革命性地革命性地革命性地发展能够执行不同任务的AI助理;然而,目前最先进的LLM驱动的代理商面临重大挑战,包括高运作成本和在诸如GAIA等复杂基准上的成功率有限。为了解决这些问题,我们提出了“思想知识图”(KGOT),这是一个创新的AI助理架构,将LLM推理与动态构建的知识图(KGGs)相结合。KGOT的摘录和结构任务相关知识,成为动态的KGG代表,通过数学解答器、网络爬虫器和Python脚本等外部工具不断加强。这种任务相关知识的结构化代表使低成本模式能够有效地解决复杂的任务,同时尽量减少偏见和噪音。例如,KGOT在GIA基准上实现了29%的任务成功率的提高,而GUG Fegg Feg Face Adriendorations (eg,Q-Go-Go-B) 和可负担得起的ASyal-GOATIal-GO1-B 提供的高标准70B)

Article 77

Title@2025-07-10 (4): Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models

Title: Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models

Machine Bullshit: Charakterisieren der Emergenten Missachtung der Wahrheit in großen Sprachmodellen

机器胡说:在大语言模型中突出新人无视真相的特点 2507.07484v1

Authors (6): Kaiqu Liang, Haimin Hu, Xuandong Zhao, Dawn Song, Thomas L. Griffiths, Jaime Fernández Fisac

Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value. While previous work has explored large language model (LLM) hallucination and sycophancy, we propose machine bullshit as an overarching conceptual framework that can allow researchers to characterize the broader phenomenon of emergent loss of truthfulness in LLMs and shed light on its underlying mechanisms. We introduce the Bullshit Index, a novel metric quantifying LLMs’ indifference to truth, and propose a complementary taxonomy analyzing four qualitative forms of bullshit: empty rhetoric, paltering, weasel words, and unverified claims. We conduct empirical evaluations on the Marketplace dataset, the Political Neutrality dataset, and our new BullshitEval benchmark (2,400 scenarios spanning 100 AI assistants) explicitly designed to evaluate machine bullshit. Our results demonstrate that model fine-tuning with reinforcement learning from human feedback (RLHF) significantly exacerbates bullshit and inference-time chain-of-thought (CoT) prompting notably amplify specific bullshit forms, particularly empty rhetoric and paltering. We also observe prevalent machine bullshit in political contexts, with weasel words as the dominant strategy. Our findings highlight systematic challenges in AI alignment and provide new insights toward more truthful LLM behavior.

哲学家Harry Frankfurk认为,胡说,胡说八道的概念是指没有考虑到其真实价值的言论。虽然以前的工作探索了大型语言模型(LLM)幻觉和偏执,但我们建议把机器胡说八道作为总体概念框架,使研究人员能够将LLMLM中突然失去真实性这一更广泛的现象定性,并阐明其基本机制。我们引入了Bullish Index,这是将LLMS对真理漠不关心量化的一个新指标,并提出了一种补充分类法,分析四种质量的胡说:空话、讨好听、恶言和未经核实的主张。我们还在市场数据集、政治中立数据集和我们新的BullshoothEval基准(2 400个方案,涵盖100名AI助理)上进行了经验性评价,以明确评价机器的胡说八道。我们的结果表明,通过从人类反馈中强化学习(RLHFF),对模型进行微调,大大地加剧了胡说和推导时间思维链(CT),从而显著地扩充了具体的废话形式,特别是空话和互换。我们还观察了政治环境中普遍存在的机器胡说八道,用新话,我们的调查结果显示了更真实的深刻的眼光。

Article 78

Title@2025-07-10 (4): Mixture of Group Experts for Learning Invariant Representations

Title: Mixture of Group Experts for Learning Invariant Representations

Mixtur von Gruppenexperten für Learning Invariante Repräsentationen

学习不稳定代表小组专家混合 2504.09265v2

Authors (4): Lei Kang, Jia Li, Mi Tian, Hua Huang

Sparsely activated Mixture-of-Experts (MoE) models effectively increase the number of parameters while maintaining consistent computational costs per token. However, vanilla MoE models often suffer from limited diversity and specialization among experts, constraining their performance and scalability, especially as the number of experts increases. In this paper, we present a novel perspective on vanilla MoE with top-$k$ routing inspired by sparse representation. This allows us to bridge established theoretical insights from sparse representation into MoE models. Building on this foundation, we propose a group sparse regularization approach for the input of top-$k$ routing, termed Mixture of Group Experts (MoGE). MoGE indirectly regularizes experts by imposing structural constraints on the routing inputs, while preserving the original MoE architecture. Furthermore, we organize the routing input into a 2D topographic map, spatially grouping neighboring elements. This structure enables MoGE to capture representations invariant to minor transformations, thereby significantly enhancing expert diversity and specialization. Comprehensive evaluations across various Transformer models for image classification and language modeling tasks demonstrate that MoGE substantially outperforms its MoE counterpart, with minimal additional memory and computation overhead. Our approach provides a simple yet effective solution to scale the number of experts and reduce redundancy among them. The source code is included in the supplementary material and will be publicly released.

在本文中,我们提出了一个关于香草混合模型的新视角,其中含有由稀疏代表所启发的以最高至1美元为主的路线路线。根据这个基础,我们提议对最高至1千美元路线(称为集团专家混合)的投入采取集体稀有的正规化方法。移动模式通过对路由投入施加结构性限制,间接规范专家,同时保留原始的移动结构。此外,我们组织向2D地形图输入的路线,空间分组相邻要素。这一结构使得地理部能够将从稀疏代表形式到微小转变模式的既定理论见解连接起来,从而大大加强专家的多样性和专业化。在各种变异模式(即称为集团专家混合)的投入方面,我们提出了一种集体稀少的正规化方法。在图像分类和语言模式(称为组合专家混合)的输入中,对专家进行了有限的多样化和专业化。移动模型通过对路由结构投入的结构性限制,对专家进行了间接规范,同时保留原始的移动结构结构。我们把路由2D地形图、空间分组的相邻元素组合要素组织输入的路线输入。这一结构可以将反映细微的图,从而大大增强专家对图像分类和语言模型的模型模型进行全面的翻版化。

Article 79

Title@2025-07-10 (4): ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining

Title: ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining

ixi-GEN: Effiziente industrielle sLLMs durch Domain Adaptive Continual Pretraining

ixi-GEN:通过远程适应性连续训练前,提高工业低温生产效率 2507.06795v2

Authors (10): Seonwu Kim, Yohan Na, Kihun Kim, Hanhee Cho, Geun Lim, Mintae Kim, Seongik Park, Ki Hyun Kim, Youngsub Han, Byoung-Ki Jeon

The emergence of open-source large language models (LLMs) has expanded opportunities for enterprise applications; however, many organizations still lack the infrastructure to deploy and maintain large-scale models. As a result, small LLMs (sLLMs) have become a practical alternative, despite their inherent performance limitations. While Domain Adaptive Continual Pretraining (DACP) has been previously explored as a method for domain adaptation, its utility in commercial applications remains under-examined. In this study, we validate the effectiveness of applying a DACP-based recipe across diverse foundation models and service domains. Through extensive experiments and real-world evaluations, we demonstrate that DACP-applied sLLMs achieve substantial gains in target domain performance while preserving general capabilities, offering a cost-efficient and scalable solution for enterprise-level deployment.

开放源码大型语言模型的出现扩大了企业应用机会;然而,许多组织仍然缺乏部署和维持大型模型的基础设施,因此,小型LLM(LLMS)尽管具有内在的性能限制,但已成为一种实用的替代方案;虽然以前曾探索过作为领域适应方法的DACP(DACP),但其在商业应用中的效用仍未得到充分审查;在本研究中,我们验证了在不同基础模型和服务领域应用基于DACP的配方的有效性;通过广泛的实验和现实世界的评价,我们证明DACP(DACP)所应用的SLLMS在目标领域业绩方面取得巨大进展,同时保留了一般能力,为企业一级的部署提供了具有成本效益和可扩展的解决方案。

Article 80

Title@2025-07-10 (4): Structure Guided Large Language Model for SQL Generation

Title: Structure Guided Large Language Model for SQL Generation

Struktur Geführtes großes Sprachmodell für SQL-Generierung

SQL 生成引导大语言模式 2402.13284v4

Authors (6): Qinggang Zhang, Hao Chen, Junnan Dong, Shengyuan Chen, Feiran Huang, Xiao Huang

Recent advancements in large language models (LLMs) have shown promise in bridging the gap between natural language queries and database management systems, enabling users to interact with databases without the background of SQL. However, LLMs often struggle to comprehend complex database structures and accurately interpret user intentions. Decomposition-based methods have been proposed to enhance the performance of LLMs on complex tasks, but decomposing SQL generation into subtasks is non-trivial due to the declarative structure of SQL syntax and the intricate connections between query concepts and database elements. In this paper, we propose a novel Structure GUided text-to-SQL framework~(SGU-SQL) that incorporates syntax-based prompting to enhance the SQL generation capabilities of LLMs. Specifically, SGU-SQL establishes structure-aware links between user queries and database schema and decomposes the complex generation task using syntax-based prompting to enable more accurate LLM-based SQL generation. Extensive experiments on two benchmark datasets demonstrate that SGU-SQL consistently outperforms state-of-the-art text-to-SQL models.

大型语言模型(LLMs)最近的进展显示,在缩小自然语言查询和数据库管理系统之间的差距方面,大语言模型(LLMs)最近的进展显示有望缩小自然语言查询和数据库管理系统之间的差距,使用户能够在没有SQL背景的情况下与数据库互动。然而,LLMs经常努力理解复杂的数据库结构,准确解释用户的意图。建议采用基于分解的方法提高LMs在复杂任务方面的性能,但将SQL生成分解成子任务是非三角的,因为SQL语法的宣告性结构以及查询概念和数据库要素之间的复杂联系。在本文件中,我们提议建立一个新型结构(GUI-SQL)文本到SQL框架~(SGU-SQL),以基于语法的语法为基础,促进增强LLMMs生成能力。具体地说,SGUS-SL-QL在用户查询和数据库系统图图图上建立结构与结构挂钩的复杂生成任务。关于两个基准数据集的广泛实验显示SGU-SQL-Sformas-stal-st-st-stalf-st-st-st-station-stations

Article 81

Title@2025-07-10 (4): RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning

Title: RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning

RLEP: Verstärktes Lernen mit Erfahrungsreplay für LLM-Reasoning

RLEP: 强化学习,经验重现LLM 理由推理 2507.07451v1

Authors (7): Hongzhi Zhang, Jia Fu, Jingyuan Zhang, Kai Fu, Qi Wang, Fuzheng Zhang, Guorui Zhou

Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present \emph{RLEP}\, – \,Reinforcement Learning with Experience rePlay\, – \,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance. On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.

大型语言模型的强化学习(RL)是一项能源密集型工作:培训可以是不稳定的,政策可能逐渐脱离其预先培训的重量。我们展示了 emph{RLEP , – – \ , \ , “加强学习与经验重复 ” , – \,一个两阶段框架,首先收集经核实的轨迹,然后在随后的培训中重新播放这些轨迹。在每个更新步骤中,该政策都优化于将新推出的推出与这些重播的成功相结合的小型弹匣上。通过重现高品质的例子,REP将模型引向远离无果的探索,侧重于学习有希望的推理路径,并且提供更快的趋同和更强的最后性能。在 Quen2.5-Math-7B 基模型上,REP 达到基线峰值精度,更新要少得多,最终超过它。在AIME-2024上,精确度从38.2%提高到39.9%,AIME-2025从19.8%提高到22.3%,以及AMC-2023 从77.0%提高到82.2%.2%.2%. 我们的代码、数据设置和检查室是可公开提供的。

Article 82

Title@2025-07-10 (4): Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

Title: Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

Agent KB: Nutzung von Cross-Domain-Erfahrungen für die Lösung Agentischer Probleme

Agent KB: 利用跨域经验解决代理问题 2507.06229v2

Authors (18): Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou

As language agents tackle increasingly complex tasks, they struggle with effective error correction and experience reuse across domains. We introduce Agent KB, a hierarchical experience framework that enables complex agentic problem solving via a novel Reason-Retrieve-Refine pipeline. Agent KB addresses a core limitation: agents traditionally cannot learn from each other’s experiences. By capturing both high-level strategies and detailed execution logs, Agent KB creates a shared knowledge base that enables cross-agent knowledge transfer. Evaluated on the GAIA benchmark, Agent KB improves success rates by up to 16.28 percentage points. On the most challenging tasks, Claude-3 improves from 38.46% to 57.69%, while GPT-4 improves from 53.49% to 73.26% on intermediate tasks. On SWE-bench code repair, Agent KB enables Claude-3 to improve from 41.33% to 53.33%. Our results suggest that Agent KB provides a modular, framework-agnostic infrastructure for enabling agents to learn from past experiences and generalize successful strategies to new tasks.

由于语言代理商处理日益复杂的任务,他们要面对有效的错误纠正和跨域再利用经验。我们引入了KB代理商,这是一个等级经验框架,通过一个新的“理由检索-检索-检索”管道解决复杂的代理人问题。KB代理商处理核心限制:代理人传统上无法相互学习对方的经验。通过获取高级战略和详细的执行日志,KB代理商创建了一个共享的知识库,使跨代理人的知识转让成为可能。根据GAIA基准评估,KB代理商将成功率提高到16.28个百分点。在最具挑战性的任务中,Claude-3从38.46%提高到57.69%,而GPT-4则在中间任务上从53.49%提高到73.26%。在SWE-bench代码修理方面,KB代理商使Claude-3从41.33%提高到53.33%。我们的结果表明,KB代理商提供了一个模块、框架-诺级基础设施,使代理人能够学习过去的经验,并将成功的战略推广到新的任务。

Article 83

Title@2025-07-10 (4): SAND: Boosting LLM Agents with Self-Taught Action Deliberation

Title: SAND: Boosting LLM Agents with Self-Taught Action Deliberation

SAND: LLM-Agenten mit selbsterzogener Handlungsberatung stärken

SAND:促进具有自学行动考虑的LLM代理 2507.07441v1

Authors (8): Yu Xia, Yiran Jenny Shen, Junda Wu, Tong Yu, Sungchul Kim, Ryan A. Rossi, Lina Yao, Julian McAuley

Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts. Most of these methods focus on imitating specific expert behaviors or promoting chosen reasoning thoughts and actions over rejected ones. However, without reasoning and comparing over alternatives actions, LLM agents finetuned with these methods may over-commit towards seemingly plausible but suboptimal actions due to limited action space exploration. To address this, in this paper we propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one. To tackle the challenges of when and what to deliberate given large action space and step-level action evaluation, we incorporate self-consistency action sampling and execution-guided action critique to help synthesize step-wise action deliberation thoughts using the base model of the LLM agent. In an iterative manner, the deliberation trajectories are then used to finetune the LLM agent itself. Evaluating on two representative interactive agent tasks, SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.

大型语言模型(LLM)代理商通常在监管下对重新行为式专家轨迹或优待优于双向推出进行微调,对重整型专家轨迹或优于优于双向推出进行微调。这些方法大多侧重于模仿特定专家行为,或促进选择的推理思想和行动,而不是否定的思维和行动。然而,在不进行推理和比较替代行动的情况下,LLM代理商对这些方法进行微调,可能由于行动空间探索有限而过度承诺采取看似合理但又不最优化的行动。为了解决这个问题,我们在本文件中提议自学的AcioN De Evorication(SAND)框架,使LM代理商在承诺采取一项行动之前能够明确考虑候选行动。为了应对何时和什么来考虑大型行动空间和分级行动评价的挑战,我们采用了自我一致行动抽样和执行指导行动,以有助于利用LM代理商的基础模型综合渐进式的行动思考思想。用迭接方式,然后使用审议轨迹来微调LLM代理商本身。评估两个有代表性的互动代理商任务,SAND在初步监管的调整代理人和外形方法方面实现了平均20%的改进。

Article 84

Title@2025-07-10 (4): Towards Interpretable Time Series Foundation Models

Title: Towards Interpretable Time Series Foundation Models

Auf dem Weg zu interpretierbaren Zeitreihen-Grundmodellen

迈向可解释时间序列基础模型 2507.07439v1

Authors (4): Matthieu Boileau, Philippe Helluy, Jeremy Pawlus, Svitlana Vyetrenko

In this paper, we investigate the distillation of time series reasoning capabilities into small, instruction-tuned language models as a step toward building interpretable time series foundation models. Leveraging a synthetic dataset of mean-reverting time series with systematically varied trends and noise levels, we generate natural language annotations using a large multimodal model and use these to supervise the fine-tuning of compact Qwen models. We introduce evaluation metrics that assess the quality of the distilled reasoning - focusing on trend direction, noise intensity, and extremum localization - and show that the post-trained models acquire meaningful interpretive capabilities. Our results highlight the feasibility of compressing time series understanding into lightweight, language-capable models suitable for on-device or privacy-sensitive deployment. This work contributes a concrete foundation toward developing small, interpretable models that explain temporal patterns in natural language.

在本文中,我们调查将时间序列推理能力蒸馏成小的、经过指导的语言模型,作为建立可解释的时间序列基础模型的一个步骤。利用一个具有系统变化的趋势和噪音水平的中反转时间序列合成数据集,我们利用一个大型多式联运模型生成自然语言说明,并用这些模型监督对紧凑的Quen模型的微调。我们引入了评估推理质量的评价指标,以趋势方向、噪音强度和外形定位为重点,并表明经过培训的模型获得了有意义的解释能力。我们的结果凸显了将时间序列理解压缩为适用于设计或隐私敏感部署的轻量、具有语言能力的模型的可行性。这项工作为开发能够解释自然语言时间模式的小型、可解释模型提供了具体的基础。

Article 85

Title@2025-07-10 (4): SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM-Augmented Synthetic EHR Data

Title: SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM-Augmented Synthetic EHR Data

SynthEHR-Eviction: Verbesserung der Eviction SDoH-Erkennung mit LLM-Augmented Synthetic EHR Data

合成EHR-驱逐:利用LLM-增强的合成电子HR数据加强驱逐SDoH探测 2507.07421v1

Authors (7): Zonghai Yao, Youxia Zhao, Avijit Mitra, David A. Levy, Emily Druhl, Jack Tsai, Hong Yu

Eviction is a significant yet understudied social determinants of health (SDoH), linked to housing instability, unemployment, and mental health. While eviction appears in unstructured electronic health records (EHRs), it is rarely coded in structured fields, limiting downstream applications. We introduce SynthEHR-Eviction, a scalable pipeline combining LLMs, human-in-the-loop annotation, and automated prompt optimization (APO) to extract eviction statuses from clinical notes. Using this pipeline, we created the largest public eviction-related SDoH dataset to date, comprising 14 fine-grained categories. Fine-tuned LLMs (e.g., Qwen2.5, LLaMA3) trained on SynthEHR-Eviction achieved Macro-F1 scores of 88.8% (eviction) and 90.3% (other SDoH) on human validated data, outperforming GPT-4o-APO (87.8%, 87.3%), GPT-4o-mini-APO (69.1%, 78.1%), and BioBERT (60.7%, 68.3%), while enabling cost-effective deployment across various model sizes. The pipeline reduces annotation effort by over 80%, accelerates dataset creation, enables scalable eviction detection, and generalizes to other information extraction tasks.

驱逐是健康的重要社会决定因素(SDoH),与住房不稳定、失业和心理健康有关。虽然驱逐出现在非结构化的电子健康记录中,但很少在结构化的电子健康记录中进行编码,限制下游应用。我们引入了合成EHR-驱逐,这是一个可扩展的管道,将LLMS、流动中人注解和自动快速优化(APO)相结合,从临床注释中提取驱逐状态。我们利用这一管道创建了迄今为止最大的公共驱逐相关SDoH数据集,包括14个细分类。精密的LMS(例如,Qwen2.5,LLama3)在结构化领域培训,限制了下游应用。我们引入了合成EHR-驱逐(Synth Eviction),实现了88.8%的宏观-F1分(ection)和90.3%(其他SDoH)在人类验证数据方面达到分数,优于GPT-4-APO(87.8%,87.3%),GPT-4-MI-APO(69.1%,78.1%)和BERT(BER(6.0.7%,68.3%)的细分级调查类别。优化LAM)的精度LMLMLMLM(例如(6.0.7%,68/3),通过可加速调试),使可加速部署的进度缩小成本数据在80的铺铺设一个可加速成本化为80的铺铺铺铺铺压,使成本化为80的铺设的铺铺铺铺铺铺铺铺铺铺铺路,使成本,使80的铺铺铺铺铺铺铺铺铺铺铺铺设速度。

Article 86

Title@2025-07-10 (4): MedReadCtrl: Personalizing medical text generation with readability-controlled instruction learning

Title: MedReadCtrl: Personalizing medical text generation with readability-controlled instruction learning

MedReadCtrl: Personalisierung medizinischer Textgenerierung mit Lesbarkeitsgesteuertem Unterricht

MedReadReadCtrl: 使医疗文本生成个性化,并进行可读性控制教学学习 2507.07419v1

Authors (7): Hieu Tran, Zonghai Yao, Won Seok Jang, Sharmin Sultana, Allen Chang, Yuan Zhang, Hong Yu

Generative AI has demonstrated strong potential in healthcare, from clinical decision support to patient-facing chatbots that improve outcomes. A critical challenge for deployment is effective human-AI communication, where content must be both personalized and understandable. We introduce MedReadCtrl, a readability-controlled instruction tuning framework that enables LLMs to adjust output complexity without compromising meaning. Evaluations of nine datasets and three tasks across medical and general domains show that MedReadCtrl achieves significantly lower readability instruction-following errors than GPT-4 (e.g., 1.39 vs. 1.59 on ReadMe, p<0.001) and delivers substantial gains on unseen clinical tasks (e.g., +14.7 ROUGE-L, +6.18 SARI on MTSamples). Experts consistently preferred MedReadCtrl (71.7% vs. 23.3%), especially at low literacy levels. These gains reflect MedReadCtrl’s ability to restructure clinical content into accessible, readability-aligned language while preserving medical intent, offering a scalable solution to support patient education and expand equitable access to AI-enabled care.

从临床决策支持到改善结果的病人成形聊天机,大赦国际在卫生保健方面显示出巨大的潜力。部署的一个关键挑战是有效的人类-AI交流,其内容必须具有个性化和可理解性。我们引入了MedReadCtrl,这是一个可读性控制的指导调制框架,使LLMS能够调整产出复杂性,而不会影响意义。对9个数据集和整个医疗和一般领域的3项任务的评价表明,MedReadCtrl实现的可读性指导错误大大低于GPT-4(例如,在ReadMe上,1.39对1.59的可读性指示,第<0.01页),在不可见的临床任务上取得了实质性成果(例如,+14.7 ROUGE-L,+6.18 SARI在MTSBamples上)。专家们一贯倾向于MedReadCrl(71.7%对23.3%),特别是在低识字率水平上。这些成果反映了MedReadCtrl将临床内容调整为无障碍、可读性一致的语言的能力,同时保留医疗意图,提供可扩展支持病人接受性治疗教育和扩大公平获得AI的解决方案。

Article 87

Title@2025-07-10 (4): May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks

Title: May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks

Darf ich Ihre Aufmerksamkeit haben? Breaking Fine-Tuning basierte Prompt Injection Defenses mit Architektur-Aware Attacken

请大家注意,使用建筑软件攻击突破基于精密发射的快速喷射防御系统 2507.07417v1

Authors (4): Nishit V. Pandya, Andrey Labunets, Sicun Gao, Earlence Fernandes

A popular class of defenses against prompt injection attacks on large language models (LLMs) relies on fine-tuning the model to separate instructions and data, so that the LLM does not follow instructions that might be present with data. There are several academic systems and production-level implementations of this idea. We evaluate the robustness of this class of prompt injection defenses in the whitebox setting by constructing strong optimization-based attacks and showing that the defenses do not provide the claimed security properties. Specifically, we construct a novel attention-based attack algorithm for text-based LLMs and apply it to two recent whitebox defenses SecAlign (CCS 2025) and StruQ (USENIX Security 2025), showing attacks with success rates of up to 70% with modest increase in attacker budget in terms of tokens. Our findings make fundamental progress towards understanding the robustness of prompt injection defenses in the whitebox setting. We release our code and attacks at https://github.com/nishitvp/better_opts_attacks

针对大型语言模型(LLMs)的快速注射攻击,大众防护等级依靠微调该模型,将指示和数据分开,从而使LLM不遵循可能存在的数据指示。有数个学术系统和生产层面落实了这一想法。我们通过构建强大的优化攻击来评估白箱设置中这类迅速注射防御的稳健性,并表明这些防御没有提供声称的安全特性。具体地说,我们为基于文本的LMs设计了一种新的关注攻击算法,并将其应用于最近两个白箱防御系统(CCS 2025)和StruQ(USENIX Security 2025),显示袭击成功率高达70 %,攻击者预算在象征性方面略有增加。我们的调查结果在理解白箱设置中迅速注射防御的稳健性方面取得了根本进展。我们在https://github.com/nishitvp/bretter_opts_actions发布了我们的代码和攻击。我们在https://github.com/nitivp/breter_opts_actions_actions。

Article 88

Title@2025-07-10 (4): Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation

Title: Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation

Interlinguistische phonetische Komposition (IPC): Ein theoretischer und rechnerischer Ansatz, um die zweite Sprache zu verbessern

语言间音音组成:加强第二语言发音的理论和计算方法 2411.10927v3

Authors (4): Jisang Park, Minu Kim, DaYoung Hong, Jongha Lee

Learners of a second language (L2) often unconsciously substitute unfamiliar L2 phonemes with similar phonemes from their native language (L1), even though native speakers of the L2 perceive these sounds as distinct and non-interchangeable. This phonemic substitution leads to deviations from the standard phonological patterns of the L2, creating challenges for learners in acquiring accurate L2 pronunciation. To address this, we propose Inter-linguistic Phonetic Composition (IPC), a novel computational method designed to minimize incorrect phonological transfer by reconstructing L2 phonemes as composite sounds derived from multiple L1 phonemes. Tests with two automatic speech recognition models demonstrated that when L2 speakers produced IPC-generated composite sounds, the recognition rate of target L2 phonemes improved by 20% compared to when their pronunciation was influenced by original phonological transfer patterns. The improvement was observed within a relatively shorter time frame, demonstrating rapid acquisition of the composite sound.

第二语言(L2)的学习者通常在无意中用本族语言(L1)的类似电话(L1)替换不熟悉的L2电话,尽管L2的母语使用者认为这些声音是独特和不可互换的。这种电话替代导致与L2的标准声学模式的偏差,给学习者获取准确L2发音带来了挑战。为了解决这个问题,我们提议了语言间电话构成(IPC)这一新型的计算方法,旨在通过将L2电话重建为来自多个L1电话的复合声音来尽量减少不正确的音调。两项自动语音识别模型的测试表明,当L2讲者制作IPC产生的复合声音时,目标L2电话的识别率比受原声传音模式影响时提高了20%。在较短的时间框架内观察到了这一改进,显示合成音的迅速获得。

Article 89

Title@2025-07-10 (4): TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning

Title: TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning

TART: Ein Open-Source Tool-Augmented Framework für erklärbare Tabellen-basierte Begründung

TARRT: 开放源码工具推荐框架,用于说明基于表格的理由 2409.11724v3

Authors (5): Xinyuan Lu, Liangming Pan, Yubo Ma, Preslav Nakov, Min-Yen Kan

Current Large Language Models (LLMs) exhibit limited ability to understand table structures and to apply precise numerical reasoning, which is crucial for tasks such as table question answering (TQA) and table-based fact verification (TFV). To address these challenges, we introduce our Tool-Augmented Reasoning framework for Tables (TART), which integrates LLMs with specialized tools. TART contains three key components: a table formatter to ensure accurate data representation, a tool maker to develop specific computational tools, and an explanation generator to maintain explainability. We also present the TOOLTAB dataset, a new benchmark designed specifically for training LLMs in table-tool integration. Our experiments indicate that TART achieves substantial improvements over existing methods (e.g., Chain-of-Thought) by improving both the precision of data processing and the clarity of the reasoning process. Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of the closed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diverse real-world scenarios. All the code and data are available at https://github.com/XinyuanLu00/TART.

当前的大语言模型(LLMS)在理解表格结构和应用精确数字推理方面能力有限,而精确数字推理对于表格答题(TQA)和基于表格的事实核实(TFV)等任务至关重要。为了应对这些挑战,我们引入了我们的表格工具推荐理由框架(TART),将LLMS与专门工具相结合。TART包含三个关键组成部分:确保准确数据表述的目录;开发具体计算工具的工具制造者;以及保持解释性的解释性生成器。我们还介绍了TOOLTAB数据集,这是专门为培训表工具整合中的LMS设计的新基准。我们的实验表明,TART通过提高数据处理的精确性和推理过程的清晰性,大大改进了现有方法(例如链)。值得注意的是,TART与代码Llama的配对使封闭源LMGPT-3.5-turbo的准确度达到90.0%,突出其在不同现实世界情景中的稳健性。所有代码和数据都可在 https://giuth.Lub.

Article 90

Title@2025-07-10 (4): GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation

Title: GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation

GNN-CNN: Ein effizientes Hybridmodell für konvolutionäre und Graphen-Neuralnetzwerke zur Textdarstellung

GNN-CNN: 用于文本代表的动态和图形神经网络的有效混合模型 2507.07414v1

Authors (1): Fardin Rastakhiz

Time, cost, and energy efficiency are critical considerations in Deep-Learning (DL), particularly when processing long texts. Transformers, which represent the current state of the art, exhibit quadratic computational complexity relative to input length, making them inefficient for extended documents. This study introduces a novel model architecture that combines Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs), integrated with a real-time, end-to-end graph generation mechanism. The model processes compact batches of character-level inputs without requiring padding or truncation. To enhance performance while maintaining high speed and efficiency, the model incorporates information from Large Language Models (LLMs), such as token embeddings and sentiment polarities, through efficient dictionary lookups. It captures local contextual patterns using CNNs, expands local receptive fields via lattice-based graph structures, and employs small-world graphs to aggregate document-level information. The generated graphs exhibit structural properties indicative of meaningful semantic organization, with an average clustering coefficient of approximately 0.45 and an average shortest path length ranging between 4 and 5. The model is evaluated across multiple text classification tasks, including sentiment analysis and news-categorization, and is compared against state-of-the-art models. Experimental results confirm the proposed model’s efficiency and competitive performance.

深度学习(DL)中,特别是处理长文本时,时间、成本和能源效率是关键因素。代表当前工艺水平的变异器,在投入长度方面表现出四进制的计算复杂性,使其对扩展文档效率低下。本研究引入了新型模型结构,将图形神经网络(GNNs)和进化神经网络(CNNs)与实时、端到端图形生成机制相结合,并使用小世界图解汇总文件级信息。模型流程压缩了字符级投入的批量,而不需要划线或转线。为了提高性能,同时保持高速度和效率,该模型通过高效的字典查看,将来自大语言模型(LLMS)的信息(如象征性嵌入和情绪极化)纳入其中。该模型利用CNNS(GN)和进化神经网络(CRN)捕捉当地环境模式,通过基于 Lattice的图形结构扩大本地可容纳域域域域域域域,并使用小世界图表汇总文件级信息。生成的图表显示了有意义的文体结构属性组织的结构属性,其平均组合系数约为0.45,平均路径长度为4到5和5之间的最短路径长度不等。该模型根据多文本级图像和图像和图像分析进行了评估。

Article 91

Title@2025-07-10 (4): CoAM: Corpus of All-Type Multiword Expressions

Title: CoAM: Corpus of All-Type Multiword Expressions

CoAM: Corpus von Multiwort-Ausdrücken aller Art

CoAM: 全类型多字表达式组合体 2412.18151v3

Authors (7): Yusuke Ide, Joshua Tanner, Adam Nohejl, Jacob Hoffman, Justin Vasselli, Hidetaka Kamigaito, Taro Watanabe

Multiword expressions (MWEs) refer to idiomatic sequences of multiple words. MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation, but existing datasets for the task are inconsistently annotated, limited to a single type of MWE, or limited in size. To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality consisting of human annotation, human review, and automated consistency checking. Additionally, for the first time in a dataset of MWE identification, CoAM’s MWEs are tagged with MWE types, such as Noun and Verb, enabling fine-grained error analysis. Annotations for CoAM were collected using a new interface created with our interface generator, which allows easy and flexible annotation of MWEs in any form. Through experiments using CoAM, we find that a fine-tuned large language model outperforms MWEasWSD, which achieved the state-of-the-art performance on the DiMSUM dataset. Furthermore, analysis using our MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to identify across approaches.

多字表达式( MWE) 指多个单词的单词序列。 MWE 识别,即在文本中检测 MWE, 可以在机器翻译等下游任务中发挥关键作用, 但任务的现有数据集有不一致的注解, 仅限于单一类型的 MWE , 或范围有限。为了进行可靠和全面的评估, 我们创建了 CoAM: All- Type 多词表达式的Corus of All- Type Moltite Exlaments, 一组由1.3K 句组成的数据集, 由多步程序构建, 以提高数据质量, 包括人类注解、人文审查和自动一致性检查。此外, 首次在 MWE 识别数据集中, CoAM 的 MWE E 被标记为 MWE 类型, 例如 Noun 和 Verb, 能够进行精细的错误分析。 coAM 说明是使用与我们的接口生成的新界面生成的界面, 方便和灵活地对 MWA-E 进行注解的大型语言模型。

Article 92

Title@2025-07-10 (4): Rethinking Verification for LLM Code Generation: From Generation to Testing

Title: Rethinking Verification for LLM Code Generation: From Generation to Testing

Überprüfung der LLM-Code-Generierung neu denken: Von der Generation zur Prüfung

重新思考LLM 代码生成的核查:从生成到测试 2507.06920v2

Authors (8): Zihan Ma, Taolin Zhang, Maosong Cao, Junnan Liu, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen

Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.

大型语言模型(LLMS)最近在人文经济学和LiveCodeBench等代码生成基准方面取得了显著的成功。然而,详细检查显示,这些评价套件往往只包含数量有限的同质测试案例,导致细微的缺陷无法被察觉。这不仅人为地放大了衡量业绩,而且还在利用可核查的奖励(RLVR)加强学习框架方面损害了准确的奖励估算。为了解决这些关键缺陷,我们系统地调查测试套件生成(TCG)任务,提出多维度指标,以严格量化测试的准确性。此外,我们采用了一种人文-LLM合作方法(SAGA),利用LM推理能力来利用人文方案编制专门知识,以大幅提高所产生测试案例的覆盖面和质量。此外,我们开发了TCGBench,以便利对TCG任务的研究。实验显示,SAGA实现了90.62%的检测率和32.58%的核实率精确度。此外,我们引入了一种比LiveCMLA推算法更高级的代码的代码制作方法,为SAGA的创建成本缩缩缩缩缩缩缩缩缩的计算法的计算结果。

Article 93

Title@2025-07-10 (4): Large Language Model for Extracting Complex Contract Information in Industrial Scenes

Title: Large Language Model for Extracting Complex Contract Information in Industrial Scenes

Großes Sprachmodell zur Extraktion komplexer Vertragsinformationen in Industrieszenen

工业景点复杂合同信息提取大语言模型 2507.06539v2

Authors (3): Yunyang Cao, Yanjun Li, Silong Dai

This paper proposes a high-quality dataset construction method for complex contract information extraction tasks in industrial scenarios and fine-tunes a large language model based on this dataset. Firstly, cluster analysis is performed on industrial contract texts, and GPT-4 and GPT-3.5 are used to extract key information from the original contract data, obtaining high-quality data annotations. Secondly, data augmentation is achieved by constructing new texts, and GPT-3.5 generates unstructured contract texts from randomly combined keywords, improving model robustness. Finally, the large language model is fine-tuned based on the high-quality dataset. Experimental results show that the model achieves excellent overall performance while ensuring high field recall and precision and considering parsing efficiency. LoRA, data balancing, and data augmentation effectively enhance model accuracy and robustness. The proposed method provides a novel and efficient solution for industrial contract information extraction tasks.

本文建议了一种高质量的数据集构建方法,用于在工业情景下执行复杂的合同信息提取任务,并根据该数据集对一个大型语言模型进行微调。首先,对工业合同文本进行分组分析,利用GPT-4和GPT-3.5从原始合同数据中提取关键信息,获得高质量的数据说明。第二,通过构建新文本,实现了数据增加,GPT-3/5从随机合并的关键词中生成了非结构化的合同文本,提高了模型的可靠性。最后,根据高质量数据集对大语言模型进行了微调。实验结果表明,该模型取得了出色的总体性能,同时确保了高实地召回率和精确度,并考虑到了配置效率。LORA、数据平衡和数据增强有效地提高了模型的准确性和稳健性。拟议方法为工业合同信息提取任务提供了一个新而有效的解决方案。

Article 94

Title@2025-07-10 (4): BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

Title: BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

BountyBench: Dollar-Impact von KI-Agenten-Angriffen und Verteidigern auf reale Cybersicherheitssysteme

BuntyBuntyBunnench: AI代理攻击者和捍卫者对现实世界网络安全系统的美元影响 2505.15216v2

Authors (34): Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y. Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, Jonathan Z. Ye, Prerit Choudhary, Siddharth M. Bhatia, Vikram Sivashankar, Yuxuan Bao, Dawn Song, Dan Boneh, Daniel E. Ho, Percy Liang

AI agents have the potential to significantly alter the cybersecurity landscape. Here, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a specific vulnerability), and Patch (patching a specific vulnerability). For Detect, we construct a new success indicator, which is general across vulnerability types and provides localized evaluation. We manually set up the environment for each system, including installing packages, setting up server(s), and hydrating database(s). We add 40 bug bounties, which are vulnerabilities with monetary awards of $10-$30,485, covering 9 of the OWASP Top 10 Risks. To modulate task difficulty, we devise a new strategy based on information to guide detection, interpolating from identifying a zero day to exploiting a specific vulnerability. We evaluate 8 agents: Claude Code, OpenAI Codex CLI with o3-high and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, and DeepSeek-R1. Given up to three attempts, the top-performing agents are OpenAI Codex CLI: o3-high (12.5% on Detect, mapping to $3,720; 90% on Patch, mapping to $14,152), Custom Agent with Claude 3.7 Sonnet Thinking (67.5% on Exploit), and OpenAI Codex CLI: o4-mini (90% on Patch, mapping to $14,422). OpenAI Codex CLI: o3-high, OpenAI Codex CLI: o4-mini, and Claude Code are more capable at defense, achieving higher Patch scores of 90%, 90%, and 87.5%, compared to Exploit scores of 47.5%, 32.5%, and 57.5% respectively; while the custom agents are relatively balanced between offense and defense, achieving Exploit scores of 37.5-67.5% and Patch scores of 35-60%.

AI 代理商具有显著改变网络安全景观的潜力。在这里, 我们引入了第一个框架, 以在不断演变的现实世界系统中捕捉攻击性和防御性网络安全性。我们通过 BountyBench 来维护这个框架。我们建立了25个系统, 拥有复杂的、真实的代码库。为了捕捉脆弱性生命周期, 我们定义了三种任务类型 : 检测( 发现新的脆弱性 ) , 开发( 开发特定的脆弱性 ) 和 Patch( 绘制特定的脆弱性 )。为了检测, 我们建立了一个新的成功指标, 跨脆弱类型, 提供局部评估。我们手动为每个系统设置了环境, 包括安装软件包, 安装服务器, 以及水晶体数据库数据库。我们增加了40个错误元, 这是货币奖项的10 - 30, 485, 覆盖 OWASP Top 10 风险的9 。为了调节任务难度, 我们设计了一个基于信息的新的战略, 指导检测, 确定一个零天, 用于特定的脆弱性。我们评估8个代理商 : C 代码, OS 代码, 代码代码代码代码, C- 代码代码和 C- PTLI 3, IM- hil4, 服务器, 运行的 O.

Article 95

Title@2025-07-10 (4): Bradley-Terry and Multi-Objective Reward Modeling Are Complementary

Title: Bradley-Terry and Multi-Objective Reward Modeling Are Complementary

Bradley-Terry und Multi-Objective Reward Modeling sind komplementär

Bradley-Terriy和多目标奖励模型具有补充作用 2507.07375v1

Authors (13): Zhiwei Zhang, Hui Liu, Xiaomin Li, Zhenwei Dai, Jingying Zeng, Fali Wang, Minhua Lin, Ramraj Chandradevan, Zhen Li, Chen Luo, Xianfeng Tang, Qi He, Suhang Wang

Reward models trained on human preference data have demonstrated strong effectiveness in aligning Large Language Models (LLMs) with human intent under the framework of Reinforcement Learning from Human Feedback (RLHF). However, RLHF remains vulnerable to reward hacking, where the policy exploits imperfections in the reward function rather than genuinely learning the intended behavior. Although significant efforts have been made to mitigate reward hacking, they predominantly focus on and evaluate in-distribution scenarios, where the training and testing data for the reward model share the same distribution. In this paper, we empirically show that state-of-the-art methods struggle in more challenging out-of-distribution (OOD) settings. We further demonstrate that incorporating fine-grained multi-attribute scores helps address this challenge. However, the limited availability of high-quality data often leads to weak performance of multi-objective reward functions, which can negatively impact overall performance and become the bottleneck. To address this issue, we propose a unified reward modeling framework that jointly trains Bradley–Terry (BT) single-objective and multi-objective regression-based reward functions using a shared embedding space. We theoretically establish a connection between the BT loss and the regression objective and highlight their complementary benefits. Specifically, the regression task enhances the single-objective reward function’s ability to mitigate reward hacking in challenging OOD settings, while BT-based training improves the scoring capability of the multi-objective reward function, enabling a 7B model to outperform a 70B baseline. Extensive experimental results demonstrate that our framework significantly improves both the robustness and the scoring performance of reward models.

人类偏好数据培训的奖赏模型在将大语言模型(LLMS)与人的意图结合到 “ 从人类反馈学习强化 “ (RLHF)的框架内显示出了强大的效果。然而,RLHF仍然易受黑客奖赏的伤害,因为政策利用的是奖赏功能中的不完善,而不是真正学习预期行为。虽然为减少奖赏黑客而作出了巨大努力,但它们主要侧重于和评价分配情景,奖赏模型的培训和测试数据分布相同。在本文中,我们从经验上表明,在更具挑战性的分配外(OOOD)设置中,最先进的方法在更具有挑战性地挣扎。我们进一步表明,纳入精细的多属性分数有助于应对这一挑战。然而,高质量数据的提供往往导致多目标奖赏功能表现不佳,这可能会对总体业绩产生负面影响,并成为瓶颈。为了解决这一问题,我们提议了一个统一的奖赏模型框架,以联合培训基于布拉德-T(BT)的单一目标和多目标的退缩奖赏模型。我们进一步表明,采用共同的嵌式多级评分,我们进一步的多级评分评分成绩框架有助于应对这一挑战。我们共同的B级的递增的递反的轨道的成绩,同时提升的成绩,同时提升的B级的轨道功能将提升的轨道功能将提升的B的轨绩的成绩框架将提升的B的轨道上的递增税益性能提升的轨道的成绩,同时提升的轨损功能将提升的轨道的轨道的轨道的轨道的成绩,同时提升了B的功能将提升了B的轨道上的功能将提升到更能,同时提升了B的稳定性的轨道的轨道,提升了B的轨道上变。

Article 96

Title@2025-07-10 (4): Shifting from Ranking to Set Selection for Retrieval Augmented Generation

Title: Shifting from Ranking to Set Selection for Retrieval Augmented Generation

Wechsel vom Ranking zur Einstellungsauswahl für retrieval Augmented Generation

从排位移到设置回收增量一代的选择 2507.06838v2

Authors (4): Dahyun Lee, Yongrae Jo, Haeju Park, Moontae Lee

Retrieval in Retrieval-Augmented Generation(RAG) must ensure that retrieved passages are not only individually relevant but also collectively form a comprehensive set. Existing approaches primarily rerank top-k passages based on their individual relevance, often failing to meet the information needs of complex queries in multi-hop question answering. In this work, we propose a set-wise passage selection approach and introduce SETR, which explicitly identifies the information requirements of a query through Chain-of-Thought reasoning and selects an optimal set of passages that collectively satisfy those requirements. Experiments on multi-hop RAG benchmarks show that SETR outperforms both proprietary LLM-based rerankers and open-source baselines in terms of answer correctness and retrieval quality, providing an effective and efficient alternative to traditional rerankers in RAG systems. The code is available at https://github.com/LGAI-Research/SetR

在检索搜索增强的一代(RAG)中,检索检索必须确保检索到的通道不仅个别相关,而且集体形成一套综合的全套内容。现有办法主要是根据个人相关性重新排列顶级通道,往往无法满足多点答题中复杂查询的信息需求。在这项工作中,我们建议采用一套自定的通道选择方法,并引入SETR,该方法通过“探讨链”推理明确确定查询的信息要求,并选择一套最佳通道,以集体满足这些要求。多点搜索RAG基准实验显示,在回答正确性和检索质量方面,SETR优于专有的LLM重新排序器和开放源基线,为RAG系统中的传统重新排序者提供了有效和高效的替代方法。该代码可在https://github.com/LGAI-Research/SetR查阅。该代码可在https://github.com/LGAI-Research/SetR查阅。

Article 97

Title@2025-07-09 (3): Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation

Title: Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation

Taming the Tri-Space Tension: ARC-geführte Halluzination Modellierung und Steuerung für Text-zu-Bild-Generierung

遏制三空间紧张局势:ARC-ARC-ARC引导的光学幻觉模型和控制生成文本到图像的图像模型与控制 2507.04946v2

Authors (2): Jianjiang Yang, Ziyan Huang

Despite remarkable progress in image quality and prompt fidelity, text-to-image (T2I) diffusion models continue to exhibit persistent “hallucinations”, where generated content subtly or significantly diverges from the intended prompt semantics. While often regarded as unpredictable artifacts, we argue that these failures reflect deeper, structured misalignments within the generative process. In this work, we propose a cognitively inspired perspective that reinterprets hallucinations as trajectory drift within a latent alignment space. Empirical observations reveal that generation unfolds within a multiaxial cognitive tension field, where the model must continuously negotiate competing demands across three key critical axes: semantic coherence, structural alignment, and knowledge grounding. We then formalize this three-axis space as the \textbf{Hallucination Tri-Space} and introduce the Alignment Risk Code (ARC): a dynamic vector representation that quantifies real-time alignment tension during generation. The magnitude of ARC captures overall misalignment, its direction identifies the dominant failure axis, and its imbalance reflects tension asymmetry. Based on this formulation, we develop the TensionModulator (TM-ARC): a lightweight controller that operates entirely in latent space. TM-ARC monitors ARC signals and applies targeted, axis-specific interventions during the sampling process. Extensive experiments on standard T2I benchmarks demonstrate that our approach significantly reduces hallucination without compromising image quality or diversity. This framework offers a unified and interpretable approach for understanding and mitigating generative failures in diffusion-based T2I systems.

尽管在图像质量和及时忠诚方面取得了显著进步,但文字到图像(T2I)的传播模型仍然表现出持续的“平衡”,生成的内容在细微或明显地与预期的迅速语义不同。我们经常认为,这些失败反映了基因化过程中的更深层次和结构上的不匹配。在这项工作中,我们提出了一个由认知启发的视角,将幻觉重新解释为在潜在调整空间内的轨迹漂移。经验性观测表明,生成过程是在多轴认知紧张领域展开的,在这个领域,模型必须不断谈判三个关键轴之间的相互竞争需求:语义一致性、结构一致性和知识定位。我们随后将这三轴空间正式化为不可预测的人工工艺,并引入了组合风险规则:一种动态矢量代表,在生成过程中将真实时间调整的紧张状态变异变。ARC的深度定位方法、其方向确定了主要失败轴,以及其不平衡反映了紧张局势的不对称性。基于这一提法的公式,我们在不精确度、结构解释性解释性分析模型期间,我们进行了一个有针对性的变压式模型。

Article 98

Title@2025-07-09 (3): Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation

Title: Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation

Multi-Agent Retrieval-Augmented Framework for Evidence-based Counterspeech Against Health Misinformation

以证据为依据的反健康错误信息反言多证据检索强化框架 2507.07307v1

Authors (6): Anirban Saha Anik, Xiaoying Song, Elliott Wang, Bryan Wang, Bengisu Yarimbas, Lingzi Hong

Large language models (LLMs) incorporated with Retrieval-Augmented Generation (RAG) have demonstrated powerful capabilities in generating counterspeech against misinformation. However, current studies rely on limited evidence and offer less control over final outputs. To address these challenges, we propose a Multi-agent Retrieval-Augmented Framework to generate counterspeech against health misinformation, incorporating multiple LLMs to optimize knowledge retrieval, evidence enhancement, and response refinement. Our approach integrates both static and dynamic evidence, ensuring that the generated counterspeech is relevant, well-grounded, and up-to-date. Our method outperforms baseline approaches in politeness, relevance, informativeness, and factual accuracy, demonstrating its effectiveness in generating high-quality counterspeech. To further validate our approach, we conduct ablation studies to verify the necessity of each component in our framework. Furthermore, human evaluations reveal that refinement significantly enhances counterspeech quality and obtains human preference.

为了应对这些挑战,我们提议建立一个多试剂检索增强框架,以针对健康错误提出反言,纳入多个LLMS,以优化知识检索、证据增强和反应完善。我们的方法结合了静态和动态证据,确保生成的反言具有相关性、基础明确和最新性。我们的方法在礼貌、相关性、信息性和事实准确性方面优于基线方法,表明其在产生高质量反言方面的有效性。为了进一步验证我们的方法,我们进行了对比研究,以核实我们框架中每个组成部分的必要性。此外,人类评价显示,改进极大地提高了反言质量并获得了人类偏好。

Article 99

Title@2025-07-09 (3): Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation

Title: Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation

Ranker, Richter und Assistenten: Auf dem Weg zum Verständnis des Interplays von LLMs in Information Retrieval Evaluation

中继员、法官和助理:了解信息检索评价中LLMLM的交互作用 2503.19092v2

Authors (3): Krisztian Balog, Donald Metzler, Zhen Qin

Large language models (LLMs) are increasingly integral to information retrieval (IR), powering ranking, evaluation, and AI-assisted content creation. This widespread adoption necessitates a critical examination of potential biases arising from the interplay between these LLM-based components. This paper synthesizes existing research and presents novel experiment designs that explore how LLM-based rankers and assistants influence LLM-based judges. We provide the first empirical evidence of LLM judges exhibiting significant bias towards LLM-based rankers. Furthermore, we observe limitations in LLM judges’ ability to discern subtle system performance differences. Contrary to some previous findings, our preliminary study does not find evidence of bias against AI-generated content. These results highlight the need for a more holistic view of the LLM-driven information ecosystem. To this end, we offer initial guidelines and a research agenda to ensure the reliable use of LLMs in IR evaluation.

大型语言模型(LLMs)日益成为信息检索、授权排名、评价和AI协助的内容创建的有机组成部分。这种广泛采用使得有必要对这些基于LLM的各组成部分之间的相互作用所产生的潜在偏差进行严格审查。本文件综合了现有的研究,并提出了新的实验设计,探索LLM的中层人员和助理如何影响LLM的法官。我们提供了LLM法官对LLM的中层人员表现出严重偏向的第一批经验证据。此外,我们注意到LLM法官发现微妙系统性能差异的能力受到限制。与以前的一些调查结果相反,我们的初步研究没有发现对AI-M所产生内容的偏差证据。这些结果突出表明有必要更全面地看待LLM所驱动的信息生态系统。为此,我们提供了初步指导方针和研究议程,以确保在IR评价中可靠地使用LMs。

Article 100

Title@2025-07-09 (3): ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning

Title: ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning

ViDove: Ein Übersetzungs-Agenten-System mit multimodalem Kontext und Memory-Augmented Reasoning

ViDove:一个具有多式背景和记忆增强理由的翻译代理系统 2507.07306v1

Authors (12): Yichen Lu, Wei Dai, Jiaen Liu, Ching Wing Kwok, Zongheng Wu, Xudong Xiao, Ao Sun, Sheng Fu, Jianyuan Zhan, Yian Wang, Takatomo Saito, Sicheng Lai

LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. Inspired by the workflow of human translators, ViDove leverages visual and contextual background information to enhance the translation process. Additionally, we integrate a multimodal memory system and long-short term memory modules enriched with domain-specific knowledge, enabling the agent to perform more accurately and adaptively in real-world scenarios. As a result, ViDove achieves significantly higher translation quality in both subtitle generation and general translation tasks, with a 28% improvement in BLEU scores and a 15% improvement in SubER compared to previous state-of-the-art baselines. Moreover, we introduce DoveBench, a new benchmark for long-form automatic video subtitling and translation, featuring 17 hours of high-quality, human-annotated data. Our code is available here: https://github.com/pigeonai-org/ViDove

以LLM为基础的翻译代理商取得了高度人性化的翻译结果,能够以更高的效率处理更长期和更复杂的背景,然而,这些代理商通常仅限于只文本的投入。在本文件中,我们引入了ViDove,这是为多式联运投入设计的翻译代理商系统。受人翻译工作流程的启发,ViDove利用视觉和背景背景资料来增强翻译过程。此外,我们整合了多式记忆系统和具有特定领域知识的长期短期记忆模块,使该代理商能够在现实世界的情景中更准确和适应性地开展工作。结果,ViDove在字幕生成和一般翻译任务方面都实现了显著更高的翻译质量,BLEU分数提高了28%,SubER值比以往最先进的基线提高了15%。此外,我们引入了DoveBench,这是长式自动视频子传输和翻译的新基准,拥有17小时的高品质、附加说明的数据。我们的代码可以在这里查阅:https://github.com/pigeonai-org/ViDoveve。

Article 101

Title@2025-07-09 (3): The Impact of Background Speech on Interruption Detection in Collaborative Groups

Title: The Impact of Background Speech on Interruption Detection in Collaborative Groups

Die Auswirkungen der Hintergrundrede auf die Störungserkennung in kollaborativen Gruppen

背景发言对合作小组中间探测的影响 2507.07280v1

Authors (3): Mariah Bradford, Nikhil Krishnaswamy, Nathaniel Blanchard

Interruption plays a crucial role in collaborative learning, shaping group interactions and influencing knowledge construction. AI-driven support can assist teachers in monitoring these interactions. However, most previous work on interruption detection and interpretation has been conducted in single-conversation environments with relatively clean audio. AI agents deployed in classrooms for collaborative learning within small groups will need to contend with multiple concurrent conversations – in this context, overlapping speech will be ubiquitous, and interruptions will need to be identified in other ways. In this work, we analyze interruption detection in single-conversation and multi-group dialogue settings. We then create a state-of-the-art method for interruption identification that is robust to overlapping speech, and thus could be deployed in classrooms. Further, our work highlights meaningful linguistic and prosodic information about how interruptions manifest in collaborative group interactions. Our investigation also paves the way for future works to account for the influence of overlapping speech from multiple groups when tracking group dialog.

在协作学习、形成群体互动和影响知识建设方面,干扰至关重要。AI驱动的支持可以帮助教师监测这些互动。然而,以往大部分关于中断探测和解释的工作都是在单一对话环境中以相对清洁的音频进行。在教室中为小群体协作学习而部署的AI代理人员需要与多重同时对话进行斗争 – – 在这种情况下,重复的演讲将无处不在,需要以其他方式查明中断。在这项工作中,我们分析在单一对话和多群体对话环境中的中断探测。然后,我们创建了一种最先进的中断识别方法,该方法对重复的演讲非常有力,因此可以部署在教室中。此外,我们的工作突出了有意义的语言和预言信息,说明协作群体互动中如何出现中断。我们的调查还为未来的工作铺平了道路,以便在跟踪小组对话时考虑到多个群体重复发言的影响。

Article 102

Title@2025-07-09 (3): LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation

Title: LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation

LinguaMark: Sprechen multimodale Modelle fair? Eine Benchmark-basierte Bewertung

LUUAMark:多模式模型是否公平说话?基于基准的评价 2507.07274v1

Authors (4): Ananya Raval, Aravind Narayanan, Vahid Reza Khazaie, Shaina Raza

Large Multimodal Models (LMMs) are typically trained on vast corpora of image-text data but are often limited in linguistic coverage, leading to biased and unfair outputs across languages. While prior work has explored multimodal evaluation, less emphasis has been placed on assessing multilingual capabilities. In this work, we introduce LinguaMark, a benchmark designed to evaluate state-of-the-art LMMs on a multilingual Visual Question Answering (VQA) task. Our dataset comprises 6,875 image-text pairs spanning 11 languages and five social attributes. We evaluate models using three key metrics: Bias, Answer Relevancy, and Faithfulness. Our findings reveal that closed-source models generally achieve the highest overall performance. Both closed-source (GPT-4o and Gemini2.5) and open-source models (Gemma3, Qwen2.5) perform competitively across social attributes, and Qwen2.5 demonstrates strong generalization across multiple languages. We release our benchmark and evaluation code to encourage reproducibility and further research.

大型多式模型(LMMs)通常在图像文本数据的广泛组合方面受过培训,但在语言覆盖面方面往往有限,导致不同语言的偏差和不公平产出。虽然以前的工作探索了多式评价,但不太强调多语种能力评估。在这项工作中,我们引入了LinguaMark,这是评估多语种视觉问题解答(VQA)任务方面最先进的LMMs的基准。我们的数据集由6 875对图像文本组成,涵盖11种语言和5种社会属性。我们用三种关键指标(Bias、回答相关性和信仰性)对模型进行评估。我们的调查结果显示,封闭源模式一般都取得最高的总体业绩。封闭源(GPT-4o和Gemimini2.5)和开放源模式(Gemma3,Qwen2.5)都具有竞争力地跨越了社会属性,而Quen2.5则展示了多种语言的高度普及性。我们发布了基准和评价代码,以鼓励再生和进一步研究。

Article 103

Title@2025-07-09 (3): Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery

Title: Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery

Open Source Planning & Control System mit Language Agents für autonome wissenschaftliche Entdeckung

拥有自主科学发现语言代理的开放源规划和控制系统 2507.07257v1

Authors (26): Licong Xu, Milind Sarkar, Anto I. Lonappan, Íñigo Zubeldia, Pablo Villanueva-Domingo, Santiago Casas, Christian Fidler, Chetana Amancharla, Ujjwal Tiwari, Adrian Bayer, Chadi Ait Ekiou, Miles Cranmer, Adrian Dimitrov, James Fergusson, Kahaan Gandhi, Sven Krippendorf, Andrew Laverick, Julien Lesgourgues, Antony Lewis, Thomas Meier, Blake Sherwin, Kristen Surrao, Francisco Villaescusa-Navarro, Chi Wang, Xueqing Xu, Boris Bolliet

We present a multi-agent system for automation of scientific research tasks, cmbagent. The system is formed by about 30 Large Language Model (LLM) agents and implements a Planning & Control strategy to orchestrate the agentic workflow, with no human-in-the-loop at any point. Each agent specializes in a different task (performing retrieval on scientific papers and codebases, writing code, interpreting results, critiquing the output of other agents) and the system is able to execute code locally. We successfully apply cmbagent to carry out a PhD level cosmology task (the measurement of cosmological parameters using supernova data) and evaluate its performance on two benchmark sets, finding superior performance over state-of-the-art LLMs. The source code is available on GitHub, demonstration videos are also available, and the system is deployed on HuggingFace and will be available on the cloud.

我们提出了一个多试剂系统,用于科研任务的自动化,cmbexic。这个系统由大约30个大语言模型代理商组成,并执行一项规划与控制战略,以协调代理工作流程,在任何时间都没有人在行。每个代理商都专门从事不同的任务(科学论文和代码库的检索、写法、解释结果、使其他代理商的输出具有生命力)和系统能够在当地执行代码。我们成功地应用了cmbexuc执行博士级宇宙学任务(使用超新星数据测量宇宙参数),并评估其在两个基准系列上的性能,在最新LLMS上找到优异的性能。源代码在GitHub上可以找到,演示视频也可以在HuggingFace上安装,并将在云层上提供。

Article 104

Title@2025-07-09 (3): A Comprehensive Survey of Contamination Detection Methods in Large Language Models

Title: A Comprehensive Survey of Contamination Detection Methods in Large Language Models

Eine umfassende Untersuchung von Methoden zur Erkennung von Verunreinigungen in großen Sprachmodellen

关于大语言模型中污染探测方法的综合调查 2404.00699v5

Authors (9): Mathieu Ravaut, Bosheng Ding, Fangkai Jiao, Hailin Chen, Xingxuan Li, Ruochen Zhao, Chengwei Qin, Caiming Xiong, Shafiq Joty

With the rise of Large Language Models (LLMs) in recent years, abundant new opportunities are emerging, but also new challenges, among which contamination is quickly becoming critical. Business applications and fundraising in Artificial Intelligence (AI) have reached a scale at which a few percentage points gained on popular question-answering benchmarks could translate into dozens of millions of dollars, placing high pressure on model integrity. At the same time, it is becoming harder and harder to keep track of the data that LLMs have seen; if not impossible with closed-source models like GPT-4 and Claude-3 not divulging any information on the training set. As a result, contamination becomes a major issue: LLMs’ performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data. This limitation jeopardizes real capability improvement in the field of NLP, yet, there remains a lack of methods on how to efficiently detect contamination. In this paper, we survey all recent work on contamination detection with LLMs, analyzing their methodologies and use cases to shed light on the appropriate usage of contamination detection methods. Our work calls the NLP research community’s attention into systematically taking into account contamination bias in LLM evaluation.

随着近年来大语言模型(LLMs)的兴起,出现了大量新的机会,但也出现了新的挑战,其中污染问题正在迅速变得十分严重。人工智能(AI)中的商业应用和筹资已经达到一个规模,在公众问答基准上提高了几个百分点,可以转化为数亿美元,给模型完整性带来很大压力。与此同时,跟踪LLMs所看到的数据越来越难和困难;如果不是不可能,GPT-4和Claude-3等封闭源模型不泄露任何有关培训内容的信息。因此,污染成为一个主要问题:LLMs的绩效可能不再可靠,因为其高性能至少部分可能是由于他们以前接触过数据。这一限制损害了NLP领域真正的能力改善,然而,在如何有效检测污染方面,仍然缺乏方法。在这份文件中,我们调查最近与LMs进行的污染探测工作,分析其方法,并使用案例来说明污染检测方法的适当使用情况。我们的工作呼吁NLPs研究社区对LM系统的污染进行系统评估。

Article 105

Title@2025-07-09 (3): Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

Title: Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in zwei realen klinischen Anwendungen

赋予保健从业者以语言模式:在两个现实世界临床应用中构建语音记录 2507.05517v2

Authors (15): Jean-Philippe Corbeil, Asma Ben Abacha, George Michalopoulos, Phillip Swazinna, Miguel Del-Agua, Jerome Tremblay, Akila Jeeson Daniel, Cari Bader, Yu-Cheng Cho, Pooja Krishnan, Nathan Bodenstab, Thomas Lin, Wenxuan Teng, Francois Beaulieu, Paul Vozila

Large language models (LLMs) such as GPT-4o and o1 have demonstrated strong performance on clinical natural language processing (NLP) tasks across multiple medical benchmarks. Nonetheless, two high-impact NLP tasks - structured tabular reporting from nurse dictations and medical order extraction from doctor-patient consultations - remain underexplored due to data scarcity and sensitivity, despite active industry efforts. Practical solutions to these real-world clinical tasks can significantly reduce the documentation burden on healthcare providers, allowing greater focus on patient care. In this paper, we investigate these two challenging tasks using private and open-source clinical datasets, evaluating the performance of both open- and closed-weight LLMs, and analyzing their respective strengths and limitations. Furthermore, we propose an agentic pipeline for generating realistic, non-sensitive nurse dictations, enabling structured extraction of clinical observations. To support further research in both areas, we release SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction.

GPT-4o和o1等大型语言模型(LLMs)在临床自然语言处理(NLP)任务方面表现良好,跨越多种医疗基准,然而,由于数据稀缺和敏感,尽管工业界作出了积极的努力,两项影响较大的NLP任务 – – 护士口服和从医生-病人咨询中提取医疗命令的结构性表格报告 – – 仍未得到充分探讨。这些现实世界临床任务的实际解决办法可以大大减轻医疗保健提供者的文件负担,从而能够更加关注病人的护理。在本文件中,我们利用私人和公开来源的临床数据集调查这两项具有挑战性的任务,评估开放和封闭式LMS的性能,并分析它们各自的长处和局限性。此外,我们提议建立一个代理管道,以产生现实的、不敏感的护士口服的口服,从而能够有条理地提取临床观察。为了支持这两个领域的进一步研究,我们释放SYNUR和SIMORD,这是第一个用于护士观察提取和医疗命令的开放源数据集。

Article 106

Title@2025-07-09 (3): A Language-Driven Framework for Improving Personalized Recommendations: Merging LLMs with Traditional Algorithms

Title: A Language-Driven Framework for Improving Personalized Recommendations: Merging LLMs with Traditional Algorithms

Ein sprachgetriebener Rahmen zur Verbesserung personalisierter Empfehlungen: Zusammenführung von LLMs mit traditionellen Algorithmen

改进个性建议的语言驱动框架:将LLMs与传统等级合并 2507.07251v1

Authors (2): Aaron Goldstein, Ayan Dutta

Traditional recommendation algorithms are not designed to provide personalized recommendations based on user preferences provided through text, e.g., “I enjoy light-hearted comedies with a lot of humor”. Large Language Models (LLMs) have emerged as one of the most promising tools for natural language processing in recent years. This research proposes a novel framework that mimics how a close friend would recommend items based on their knowledge of an individual’s tastes. We leverage LLMs to enhance movie recommendation systems by refining traditional algorithm outputs and integrating them with language-based user preference inputs. We employ Singular Value Decomposition (SVD) or SVD++ algorithms to generate initial movie recommendations, implemented using the Surprise Python library and trained on the MovieLens-Latest-Small dataset. We compare the performance of the base algorithms with our LLM-enhanced versions using leave-one-out validation hit rates and cumulative hit rates. Additionally, to compare the performance of our framework against the current state-of-the-art recommendation systems, we use rating and ranking metrics with an item-based stratified 0.75 train, 0.25 test split. Our framework can generate preference profiles automatically based on users’ favorite movies or allow manual preference specification for more personalized results. Using an automated approach, our framework overwhelmingly surpassed SVD and SVD++ on every evaluation metric used (e.g., improvements of up to ~6x in cumulative hit rate, ~3.7x in NDCG, etc.), albeit at the cost of a slight increase in computational overhead.

传统建议算法的设计并不是要根据文本提供的用户偏好提供个性化建议,例如,“我享受有幽默感的轻心喜剧”。大型语言模型(LLMS)近年来已成为自然语言处理最有希望的工具之一。这项研究提出了一个新框架,模仿亲密朋友如何根据个人品味知识推荐项目。我们利用LLMS改进传统算法产出并将其与基于语言的用户偏好投入结合起来,以加强电影建议系统。我们使用Singulal 值分解(SVD)或SVD++)算法来产生初步的电影建议,使用Surprise Python图书馆实施,并在MovesLans-Last-Smalall数据集方面进行培训。我们用LMS-encional-hanced版本来比较基础算法的性能,使用个人品味的一次性验证率和累积冲击率。此外,我们将我们框架的性能与当前最优级的S-BS&+7建议系统进行比较,我们可以使用基于基于St-D的Stary 0.75的Servical press pral press a rual rual press press press press ex a ex a laimpal press slal press slupal press slupal labal lautal propal ex a lautal pressal press pressal press pressal press press pressal pressal pressal pressal pressal pressal pressal pressal pressal pressal ex a ex ex ex ex labal ex exal exal exal exal ex exal exal 框架,我们以在S6 pressal pressal pressal pressalalalalalalal press pressal pressal press pressal pressal pressal pressal 和以在S6 press pressal_S6 pressal pressal pressal

Article 107

Title@2025-07-09 (3): Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings

Title: Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings

Medical Red Teaming Protocol of Language Models: Über die Bedeutung der Nutzerperspektiven in der Gesundheitsversorgung

语言模式医学红队模式医疗红队协议:关于保健机构用户观点的重要性 2507.07248v1

Authors (5): Minseon Kim, Jean-Philippe Corbeil, Alessandro Sordoni, Francois Beaulieu, Paul Vozila

As the performance of large language models (LLMs) continues to advance, their adoption is expanding across a wide range of domains, including the medical field. The integration of LLMs into medical applications raises critical safety concerns, particularly due to their use by users with diverse roles, e.g. patients and clinicians, and the potential for model’s outputs to directly affect human health. Despite the domain-specific capabilities of medical LLMs, prior safety evaluations have largely focused only on general safety benchmarks. In this paper, we introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives, alongside general safety assessments and quantitatively analyze the safety of medical LLMs. We bridge a gap in the literature by building the PatientSafetyBench containing 466 samples over 5 critical categories to measure safety from the perspective of the patient. We apply our red-teaming protocols on the MediPhi model collection as a case study. To our knowledge, this is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view - patient, clinician, and general user - establishing a foundation for safer deployment in medical domains.

随着大型语言模型(LLMS)的继续发展,其采用范围正在扩大,包括医疗领域在内的广泛领域。将LLMS纳入医疗应用提出了关键的安全问题,特别是由于具有不同作用的用户,例如病人和临床医生使用LLMS,以及模型产出有可能直接影响到人类健康。尽管医疗LLMS具有具体领域的能力,但先前的安全评估主要侧重于一般安全基准。在本文件中,我们引入了针对病人用户和临床用户的医疗领域制定的安全评估协议,同时进行一般安全评估和定量分析医疗LMS的安全。我们通过建立包含466个样本的PacentsSafetyBench,从病人的角度衡量5个关键类别的安全,弥补了文献中的空白。我们运用MediPhi模型收集的红色协议作为案例研究。据我们所知,这是为确定医疗LMS的安全评估标准而进行的第一项工作,通过有针对性地红队,从3个不同的观点 – – 病人、临床和普通用户 – – 建立在医疗领域更安全部署的基础。

Article 108

Title@2025-07-09 (3): A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive

Title: A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive

A Theory of Response Sampling in LLMs: Teil beschreibend und Teil präskriptiv

LLMM中的反应抽样理论:部分说明性理论和部分说明性理论 2402.11005v4

Authors (4): Sarath Sivaprasad, Pramod Kaushik, Sahar Abdelnabi, Mario Fritz

Large Language Models (LLMs) are increasingly utilized in autonomous decision-making, where they sample options from vast action spaces. However, the heuristics that guide this sampling process remain under explored. We study this sampling behavior and show that this underlying heuristics resembles that of human decision-making: comprising a descriptive component (reflecting statistical norm) and a prescriptive component (implicit ideal encoded in the LLM) of a concept. We show that this deviation of a sample from the statistical norm towards a prescriptive component consistently appears in concepts across diverse real-world domains like public health, and economic trends. To further illustrate the theory, we demonstrate that concept prototypes in LLMs are affected by prescriptive norms, similar to the concept of normality in humans. Through case studies and comparison with human studies, we illustrate that in real-world applications, the shift of samples toward an ideal value in LLMs’ outputs can result in significantly biased decision-making, raising ethical concerns.

大型语言模型(LLMS)在自主决策中越来越多地被利用,它们从广泛的行动空间中抽样选择各种选择。然而,指导这一抽样过程的逻辑学仍在探讨之中。我们研究了这种抽样行为,并表明这种基本理论与人类决策相似:由描述性组成部分(反映统计规范)和一个概念的规范性组成部分(在LLM中隐含的理想编码)组成。我们表明,抽样从统计规范向规范性组成部分的这种偏向始终出现在诸如公共卫生和经济趋势等不同现实世界领域的概念中。为了进一步说明这一理论,我们证明LLMS的概念原型受到规范性规范的影响,类似于人类正常性的概念。我们通过案例研究和与人类研究的比较,表明在现实世界应用中,样品向LLMS产出的理想价值的转移可能导致严重偏差的决策,引起道德问题。

Article 109

Title@2025-07-09 (3): An Information-Theoretic Perspective on Multi-LLM Uncertainty Estimation

Title: An Information-Theoretic Perspective on Multi-LLM Uncertainty Estimation

Eine informationstheoretische Perspektive zur Multi-LLM-Unsicherheitsschätzung

关于多LLLM不确定性估算的信息理论视角 2507.07236v1

Authors (6): Maya Kruse, Majid Afshar, Saksham Khatwani, Anoop Mayampurath, Guanhua Chen, Yanjun Gao

Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty estimates. To leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and naive ensemble baselines.

大型语言模型(LLMs)往往在各种投入中表现不一,表明不确定性,并促使需要在高占用环境中对其量化。先前的校准和不确定性量化工作往往侧重于单个模型,忽略了模型多样性的潜力。我们假设LLMs由于培训的不同和语言的西普菲语性质而作出互补预测,而将其产出合并则导致更可靠的不确定性估计。为了利用这一点,我们提议MUSE(通过子集成的多-LLMM不确定性),这是一种简单的信息理论方法,使用Jensen-Shannon differgence来识别和汇总LLMS的精准子集。二元预测任务实验显示,与单一模型和天真的共同基线相比,校准和预测性表现有所改善。

Article 110

Title@2025-07-09 (3): Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection

Title: Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection

Emoji-Angriff: Verstärkung von Jailbreak-Angriffen gegen Richter LLM-Erkennung

Emoji攻击:加强针对LLM法官的越狱袭击 2411.01077v3

Authors (3): Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson

Jailbreaking techniques trick Large Language Models (LLMs) into producing restricted output, posing a potential threat. One line of defense is to use another LLM as a Judge to evaluate the harmfulness of generated text. However, we reveal that these Judge LLMs are vulnerable to token segmentation bias, an issue that arises when delimiters alter the tokenization process, splitting words into smaller sub-tokens. This alters the embeddings of the entire sequence, reducing detection accuracy and allowing harmful content to be misclassified as safe. In this paper, we introduce Emoji Attack, a novel strategy that amplifies existing jailbreak prompts by exploiting token segmentation bias. Our method leverages in-context learning to systematically insert emojis into text before it is evaluated by a Judge LLM, inducing embedding distortions that significantly lower the likelihood of detecting unsafe content. Unlike traditional delimiters, emojis also introduce semantic ambiguity, making them particularly effective in this attack. Through experiments on state-of-the-art Judge LLMs, we demonstrate that Emoji Attack substantially reduces the unsafe prediction rate, bypassing existing safeguards.

大量语言模型(LLMS)诱使大型语言模型(LLMS)产生限制产出,构成了潜在威胁。一种防御之道是使用另一个LLM(LLM)作为法官评估生成文本的危害性。然而,我们发现,这些LLMS法官容易受到象征性分割偏差的影响,这个问题是当划界员改变代号过程,将单词分成小块时产生的。这改变了整个序列的嵌入,降低了检测准确性,并允许将有害内容错误地分类为安全的内容。在本文中,我们引入了Emoji Action,这是一个新颖的战略,通过利用象征性分割偏差来扩大现有的越狱事件。我们的方法在文字上学习系统地在文本中插入mojis,然后由LM法官加以评估,从而导致嵌入扭曲,从而大大降低检测不安全内容的可能性。与传统的划界员不同,Demojis还引入了语义模糊性,使这些模糊性在攻击中特别有效。通过对州级LMS的实验,我们证明Emoji攻击会大大降低不安全的预测率,绕过现有的保障措施。

Article 111

Title@2025-07-09 (3): SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains

Title: SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains

SynthTextEval: Synthetische Textdatengenerierung und -bewertung für High-Stakes-Domains

合成文本Eval:合成文本数据生成和评估,用于高占用域域 2507.07229v1

Authors (7): Krithika Ramesh, Daniel Smolyak, Zihao Zhao, Nupoor Gandhi, Ritu Agarwal, Margrét Bjarnadóttir, Anjalie Field

We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerous applications, such as reducing the risks of privacy violations in the development and deployment of AI systems in high-stakes domains. Realizing this potential, however, requires principled consistent evaluations of synthetic data across multiple dimensions: its utility in downstream systems, the fairness of these systems, the risk of privacy leakage, general distributional differences from the source text, and qualitative feedback from domain experts. SynthTextEval allows users to conduct evaluations along all of these dimensions over synthetic data that they upload or generate using the toolkit’s generation module. While our toolkit can be run over any data, we highlight its functionality and effectiveness over datasets from two high-stakes domains: healthcare and law. By consolidating and standardizing evaluation metrics, we aim to improve the viability of synthetic text, and in-turn, privacy-preservation in AI development.

我们提出了合成文本综合评估工具包。大型语言模型(LLM)产出的流利性使得合成文本对多种应用具有潜在可行性,例如减少在开发和部署高取域的AI系统中侵犯隐私的风险。然而,实现这一潜力要求对合成数据多个层面进行有原则的一致评价:其在下游系统的实用性、这些系统的公平性、隐私渗漏风险、源文本之间的一般分布差异以及来自域专家的定性反馈。合成文本允许用户对使用工具包生成模块上传或生成的合成数据进行所有这些层面的评价。虽然我们的工具包可以运行在任何数据上,但我们强调其功能和效力,超过两个高取域的数据集:保健和法律。通过整合和标准化评价指标,我们的目标是提高合成文本的可行性,以及人工开发过程中的隐私保护。

Article 112

Title@2025-07-09 (3): EduCoder: An Open-Source Annotation System for Education Transcript Data

Title: EduCoder: An Open-Source Annotation System for Education Transcript Data

EduCoder: Ein Open-Source-Annotationssystem für Bildungstextdaten

EduCoder:教育记录数据公开源说明系统 2507.05385v2

Authors (7): Guanzhong Pan, Mei Tan, Hyunji Nam, Lucía Langlois, James Malamut, Liliana Deonizio, Dorottya Demszky

We introduce EduCoder, a domain-specialized tool designed to support utterance-level annotation of educational dialogue. While general-purpose text annotation tools for NLP and qualitative research abound, few address the complexities of coding education dialogue transcripts – with diverse teacher-student and peer interactions. Common challenges include defining codebooks for complex pedagogical features, supporting both open-ended and categorical coding, and contextualizing utterances with external features, such as the lesson’s purpose and the pedagogical value of the instruction. EduCoder is designed to address these challenges by providing a platform for researchers and domain experts to collaboratively define complex codebooks based on observed data. It incorporates both categorical and open-ended annotation types along with contextual materials. Additionally, it offers a side-by-side comparison of multiple annotators’ responses, allowing comparison and calibration of annotations with others to improve data reliability. The system is open-source, with a demo video available.

我们引入了EduCoder, 这是一种专门化的域名工具,旨在支持教育对话的讲解水平。虽然通用文字注解工具用于NLP和定性研究,但很少涉及教育对话笔录编码的复杂性 – – 包括各种师生和同龄人的互动。共同的挑战包括:为复杂的教学特征界定守则,支持开放式和绝对的编码,以及用外部特征,如课程的目的和教学价值等,根据具体情况发表说明。EduCoder旨在应对这些挑战,为研究人员和域专家提供一个平台,以协作界定基于观察到的数据的复杂代码手册。它包括直截了当和开放式的注解类型以及背景材料。此外,它提供了多个说明者反应的平行比较,允许与其他说明进行比较和校准,以提高数据可靠性。这个系统是开源的,可提供演示视频。

Article 113

Title@2025-07-09 (3): Derailer-Rerailer: Adaptive Verification for Efficient and Reliable Language Model Reasoning

Title: Derailer-Rerailer: Adaptive Verification for Efficient and Reliable Language Model Reasoning

Derailer-Rerailer: Adaptive Verifikation für effizientes und zuverlässiges Sprachmodell

Derailer-Rerailer:高效率和可靠语文示范理由的适应性核查 2408.13940v4

Authors (6): Guangya Wan, Yuqi Wu, Hao Wang, Shengming Zhao, Jie Chen, Sheng Li

Large Language Models (LLMs) have shown impressive reasoning capabilities, yet existing prompting methods face a critical trade-off: simple approaches often struggle with complex tasks and reasoning stability, while more sophisticated methods require multiple inferences and substantial computational resources, limiting their practical deployment. To address this challenge, we propose Derailer-Rerailer, a novel framework that adaptively balances reasoning accuracy and computational efficiency. At its core, our framework employs a lightweight Derailer mechanism to assess reasoning stability and selectively triggers an advanced Rerailer verification process only when necessary, thereby optimizing computational resource usage. Extensive evaluation across both open and closed-source models on more than 20 categories of mathematical, symbolic, and commonsense reasoning tasks demonstrates our framework’s effectiveness: Derailer-Rerailer achieves significant accuracy improvements (8-11\% across various reasoning tasks) while maintaining 2-3 times better efficiency than existing verification methods, with particularly strong performance in mathematical and symbolic reasoning, offering a practical solution for enhancing LLM reasoning reliability while significantly reducing computational overhead.

大型语言模型(LLMs)显示了令人印象深刻的推理能力,然而,现有的推理方法却面临着一个关键的权衡:简单方法往往与复杂任务和推理稳定性挣扎不休,而更复杂的方法则需要多种推论和大量计算资源,从而限制其实际部署。为了应对这一挑战,我们提议一个适应性平衡推理准确性和计算效率的新框架Derailer-Rerailer(Derailer-Rerailer),从本质上讲,我们的框架使用一个轻量的脱轨器机制来评估推理稳定性,并且只在必要时有选择地触发先进的累动器核查进程,从而优化计算资源的利用。对20多种数学、象征和常识推理学类别的开放和封闭源模型进行广泛的评价,显示了我们框架的有效性:脱轨器-Rerailer在各种推理任务中取得了显著的精准性改进(8-11),同时保持比现有核查方法更好的2-3倍的效率,在数学和象征性推理学方面表现特别强,为提高LLM推理的可靠性提供了切实可行的解决办法,同时大幅度降低计算间接费用。

Article 114

Title@2025-07-09 (3): Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses

Title: Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses

Prompt Störungen Enthüllen Mensch-ähnliche Biasen in LLM Survey Responses

LLM调查答复中的即时扰动干扰现象 2507.07188v1

Authors (3): Jens Rupprecht, Georg Ahnert, Markus Strohmaier

Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known response biases are poorly understood. This paper investigates the response robustness of LLMs in normative survey contexts – we test nine diverse LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of 11 perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated interviews. In doing so, we not only reveal LLMs’ vulnerabilities to perturbations but also reveal that all tested models exhibit a consistent \textit{recency bias} varying in intensity, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. By applying a set of perturbations, we reveal that LLMs partially align with survey response biases identified in humans. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.

大型语言模型(LLMS)在社会科学调查中越来越多地被用作人类科目的代名词,但是其可靠性和对已知反应偏差的易感性却不易理解。本文调查了LMS在规范调查背景下的反应强度 – – 我们对世界价值调查(WVS)中的问题测试了九种不同的LMS,对两个问题提法和回答选项结构都应用了一套11次的全套扰动,导致模拟访谈超过167 000次。我们这样做不仅暴露了LLMS易受扰动的脆弱性,而且揭示了所有测试的模型在强度上都表现出了一致的\textit{recenty 偏差},过分偏偏向于最后出现的回答选项。虽然更大的模型一般比较强,但所有模型对语义变化(如副作用)和合并扰动仍然敏感。我们通过应用一套扰动选项结构,发现LMSMs部分地与人类所发现的调查反应偏差相一致。这突出表明,在使用LMS生成合成调查数据时,迅速设计和稳健性测试至关重要。

Article 115

Title@2025-07-09 (3): Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs

Title: Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs

Bepflanzt in der Vorausbildung, durch Finetuning abgeschwächt: Eine Fallstudie über die Herkunft von Kognitiv-Biasen in LLMs

编在培训前编,《微调:关于LLM中认知性双星起源的个案研究》,《微调摇摇晃》 2507.07186v1

Authors (3): Itay Itzhak, Yonatan Belinkov, Gabriel Stanovsky

Large language models (LLMs) exhibit cognitive biases – systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across models and can be amplified by instruction tuning. However, it remains unclear if these differences in biases stem from pretraining, finetuning, or even random noise due to training stochasticity. We propose a two-step causal experimental approach to disentangle these factors. First, we finetune models multiple times using different random seeds to study how training randomness affects over $30$ cognitive biases. Second, we introduce \emph{cross-tuning} – swapping instruction datasets between models to isolate bias sources. This swap uses datasets that led to different bias patterns, directly testing whether biases are dataset-dependent. Our findings reveal that while training randomness introduces some variability, biases are mainly shaped by pretraining: models with the same pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data. These insights suggest that understanding biases in finetuned models requires considering their pretraining origins beyond finetuning effects. This perspective can guide future efforts to develop principled strategies for evaluating and mitigating bias in LLMs.

大型语言模型(LLMS)表现出认知偏差 – – 系统性的非理性决策倾向,类似于人类所见的不合理决策倾向。先前的工作发现,这些偏差在各模型之间有所不同,可以通过指令调换加以扩大。然而,由于训练随机性,这些偏差的差别是否源于训练前、微调,或甚至随机噪音,目前还不清楚。我们提出了分解这些因素的两步因果实验方法。首先,我们用不同的随机种子对模型进行多次微调,以研究训练随机性如何影响超过30美元认知偏差。第二,我们引入了\emph{交叉调} – – 在模型之间互换教学数据集以孤立偏差源。这种交换利用数据集导致不同的偏差模式,直接测试偏差是否取决于数据设置。我们的调查结果显示,虽然训练随机性带来了某些差异,但偏差主要取决于训练前:与受过训练的骨架模型显示的偏差模式比共享的微调数据更加相似。这些洞察显示,在微调模型中理解偏差需要考虑其前训练起源,而不是微调效果。这一视角可以指导今后制定评价和减轻偏差的战略。

Article 116

Title@2025-07-09 (3): Multi-Attribute Steering of Language Models via Targeted Intervention

Title: Multi-Attribute Steering of Language Models via Targeted Intervention

Multi-Attribute-Lenkung von Sprachmodellen durch gezielte Intervention

通过有针对性的干预措施,多语言模式多来源指导 2502.12446v2

Authors (4): Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM’s parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducing toxicity. To address this, we introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes. MAT-Steer learns steering vectors using an alignment objective that shifts the model’s internal representations of undesirable outputs closer to those of desirable ones while enforcing sparsity and orthogonality among vectors for different attributes, thereby reducing inter-attribute conflicts. We evaluate MAT-Steer in two distinct settings: (i) on question answering (QA) tasks where we balance attributes like truthfulness, bias, and toxicity; (ii) on generative tasks where we simultaneously improve attributes like helpfulness, correctness, and coherence. MAT-Steer outperforms existing ITI and parameter-efficient fine-tuning approaches across both task types (e.g., 3% average accuracy gain across QA tasks and 55.82% win rate against the best ITI baseline).

为解决这一问题,我们引入了多归性定向指导(MAT-Steer),这是为多种属性的选择性象征性干预设计的一种新型指导框架。MAT-Steer学习了配比任务,即我们既要改进模型对不良产出的内部表述,又要使模型对理想产出的内部表述更接近于理想产出的内部表述,同时为不同属性加强矢量之间的宽度和异度,从而减少不同属性的冲突。我们用两种不同的背景评估MAT-Steer:(一) 问题回答(QA) 任务,我们把真实性、偏向性和毒性等特征与I类选择性象征性干预相平衡;(二) 基因化任务,我们同时改进属性,如精准性、正确性和一致性。

Article 117

Title@2025-07-09 (3): LCFO: Long Context and Long Form Output Dataset and Benchmarking

Title: LCFO: Long Context and Long Form Output Dataset and Benchmarking

LCFO: Long Context und Long Form Output Datensatz und Benchmarking

LCFO:长期和长式产出数据集和基准 2412.08268v3

Authors (13): Marta R. Costa-jussà, Pierre Andrews, Mariano Coria Meglioli, Joy Chen, Joe Chuang, David Dale, Christophe Ropers, Alexandre Mourachko, Eduardo Sánchez, Holger Schwenk, Tuan Tran, Arina Turkatenko, Carleigh Wood

This paper presents the Long Context and Form Output (LCFO) benchmark, a novel evaluation framework for assessing gradual summarization and summary expansion capabilities across diverse domains. LCFO consists of long input documents (5k words average length), each of which comes with three summaries of different lengths (20%, 10%, and 5% of the input text), as well as approximately 15 questions and answers (QA) related to the input content. Notably, LCFO also provides alignments between specific QA pairs and corresponding summaries in 7 domains. The primary motivation behind providing summaries of different lengths is to establish a controllable framework for generating long texts from shorter inputs, i.e. summary expansion. To establish an evaluation metric framework for summarization and summary expansion, we provide human evaluation scores for human-generated outputs, as well as results from various state-of-the-art large language models (LLMs). GPT-4o-mini achieves best human scores among automatic systems in both summarization and summary expansion tasks (~ +10% and +20%, respectively). It even surpasses human output quality in the case of short summaries (~ +7%). Overall automatic metrics achieve low correlations with human evaluation scores (~ 0.4) but moderate correlation on specific evaluation aspects such as fluency and attribution (~ 0.6).

本文件介绍了长背景和形式输出基准(LCFO),这是一个评估不同领域渐进总结和简要扩展能力的新评价框架,包括长输入文件(平均长度5千字),每个文件都附有不同长度(占输入文本的20%、10%和5%)的三份摘要,以及与输入内容有关的大约15个问答(QA),值得注意的是,LCFO还提供7个领域具体质量保证配对和相应摘要的校正。提供不同长度摘要的主要动机是建立一个控制框架,以便从较短投入(即摘要扩展)中产生长文本。为了建立总结和摘要扩展评价的衡量框架,我们为人产生的产出以及各种最先进的大语言模型(LLMS)提供人的评价分数。GPT-4-omini在总结和概要扩展任务(分别是~+10%和+20 %)中自动系统获得最佳的人的评分。在简短总结和摘要(即摘要扩展)中,甚至超过了人产出质量的可控制框架。为了建立总结和摘要扩展(即摘要扩展)的衡量基准框架,我们提供了人为产出的评分,(~+7 % 总体评价,但具有0.6度,具有0.4的链接。

Article 118

Title@2025-07-09 (3): LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits

Title: LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits

LASER: Mit Multi-Armed Bandits lernen, Belohnungsmodelle adaptiv zu wählen

LASeR: 学习用多武装强盗来灵活选择奖赏模型 2410.01735v2

Authors (4): Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

Reward Models (RMs) are crucial to aligning large language models (LLMs), but the degree to which an RM specialized to one task (e.g. writing) generalizes to new tasks (e.g. math) is often not known a priori, often making using only one fixed RM to train LLMs suboptimal. However, optimizing LLMs with multiple RMs simultaneously can incur a prohibitively high computational cost and lead to conflicting signals from different RMs that may degrade performance. To address these challenges, we introduce LASeR (Learning to Adaptively Select Rewards), which frames reward model selection as a multi-armed bandit problem, efficiently and iteratively training LLMs using multiple RMs by selecting the most well-suited RM for each instance. On commonsense and math reasoning tasks, we show that LASeR boosts iterative LLM training, improving the absolute average accuracy of Llama-3-8B over three datasets by 2.67% over an ensemble of RM scores while also showing superior efficiency (e.g., a 2x speedup). Moreover, on WildChat (open-ended instruction-following tasks), LASeR leads to a 72.69% AlpacaEval win rate over the RM score ensemble baseline. Extending to long-context generation, LASeR improves by 2.96 F1 points (avg.) on single-document QA tasks and 2.97 F1 points on few-shot learning over the RM score ensemble baseline with best-of-n sampling.

对大型语言模型(LLMS)的调整至关重要,但是,对于大型语言模型(LLMS),一个专门用于一项任务(例如写作)的RM(LLMS)一般化为新任务(例如数学)往往不先验地为人所知,往往只使用一个固定的RM(RM)来培训LLMS亚优化。然而,同时使用多个RMM(RMS)的优化LMM(LMS)会同时产生惊人的高计算成本,并导致不同RMS(Llama-3-8B)的绝对平均精确度降低性能。为了应对这些挑战,我们引入LSER(学习适应性选择选择选择选择一个任务(例如写作)),将模型的选择作为多武装的土匪问题,利用多个RMMS(例如数学)进行高效和迭代用培训LLMM(LMM),在常见和数学推理学任务方面,我们显示LRMMMM(L3-38)的绝对平均平均准确度比分比分数为2.67%,同时显示高级效率(e-RMD-RMS-S)在2.x-S-S的排序基线任务上,在2-S-S-S-S-S-xxxxxx-S(S)比分级上,在双平级上,在最低调调调。

Article 119

Title@2025-07-09 (3): Low-Rank Adaptation Secretly Imitates Differentially Private SGD

Title: Low-Rank Adaptation Secretly Imitates Differentially Private SGD

Low-Rank-Anpassung hinterhältig imitiert unterschiedlich private SGD

低浓度适应秘密模仿不同的私人 SGD 2409.17538v7

Authors (2): Saber Malekmohammadi, Golnoosh Farnadi

As pre-trained language models grow in size, full fine-tuning their parameters on task adaptation data becomes increasingly impractical. To address this challenge, some methods for low-rank adaptation of language models have been proposed, e.g. LoRA, which incorporates trainable low-rank decomposition matrices into only some parameters of the pre-trained model, called adapters. This approach significantly reduces the number of trainable parameters compared to fine-tuning all parameters or adapters. In this work, we look at low-rank adaptation method from the lens of data privacy. We show theoretically that the low-rank adaptation used in LoRA is equivalent to fine-tuning adapters with noisy batch gradients - just like what DPSGD algorithm does. We also quantify the variance of the injected noise as a decreasing function of adaptation rank. By establishing a Berry-Esseen type bound on the total variation distance between the injected noise distribution and a Gaussian noise distribution with the same variance, we show that the dynamics of low-rank adaptation is very close to when DPSGD is performed w.r.t the adapters. Following our theoretical findings and approved by our experimental results, we show that low-rank adaptation provides robustness to membership inference attacks w.r.t the fine-tuning data.

随着经过培训的语文模型规模扩大,充分调整其任务适应数据参数变得日益不切实际。为了应对这一挑战,提出了一些语言模型低调适应方法,例如,LORA,它将可训练的低级分解矩阵纳入只是经过培训的模型的某些参数,称为适应器。这种方法大大降低了可训练参数的数量,而没有微调所有参数或适配器。在这项工作中,我们从数据隐私的角度来看待低级适应方法。我们从理论上表明,LORA使用的低级别适应方法相当于具有噪声批量梯度的微调适应器――正如DPSGD算法所做的那样。我们还将注入的噪声的差异量化为适应等级的递减功能。通过建立一个Berry-Es seen型,将注入的噪音分布与高调噪音分布之间的总变异差与差异联系起来。我们从数据隐私的角度看,低调的动态非常接近于在DPSGD进行调控时。根据我们的理论发现和实验性攻击的结果,我们展示了稳健的调数据。我们展示了低调数据。

Article 120

Title@2025-07-09 (3): UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations

Title: UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations

UniConv: Vereinheitlichung der Retrieval- und Response-Generation für große Sprachmodelle in Gesprächen

UnionConv: 对话中大语言模型统一检索和反应生成 2507.07030v1

Authors (12): Fengran Mo, Yifan Gao, Chuan Meng, Xin Liu, Zhuofeng Wu, Kelong Mao, Zhengyang Wang, Pei Chen, Zheng Li, Xian Li, Bing Yin, Meng Jiang

The rapid advancement of conversational search systems revolutionizes how information is accessed by enabling the multi-turn interaction between the user and the system. Existing conversational search systems are usually built with two different models. This separation restricts the system from leveraging the intrinsic knowledge of the models simultaneously, which cannot ensure the effectiveness of retrieval benefiting the generation. The existing studies for developing unified models cannot fully address the aspects of understanding conversational context, managing retrieval independently, and generating responses. In this paper, we explore how to unify dense retrieval and response generation for large language models in conversation. We conduct joint fine-tuning with different objectives and design two mechanisms to reduce the inconsistency risks while mitigating data discrepancy. The evaluations on five conversational search datasets demonstrate that our unified model can mutually improve both tasks and outperform the existing baselines.

快速推进的对口搜索系统使用户和系统之间的多轨互动使获取信息的方式发生革命性的变化。现有的对口搜索系统通常是用两种不同的模型构建的。这种分离限制了系统同时利用模型的内在知识,无法确保检索对下一代的效益。开发统一模型的现有研究无法充分解决了解对口背景、独立管理检索和生成响应等方面的问题。在本文件中,我们探讨了如何将大型语言模型的密集检索和响应生成统一起来。我们与不同的目标联合进行微调,并设计两个机制,以减少不一致的风险,同时缩小数据差异。对五个对口搜索数据集的评估表明,我们的统一模型可以相互改进任务,并超越现有基线。

Article 121

Title@2025-07-09 (3): FlexOlmo: Open Language Models for Flexible Data Use

Title: FlexOlmo: Open Language Models for Flexible Data Use

FlexOlmo: Offene Sprachmodelle für flexible Datennutzung

FlexOlmo:灵活数据使用开放语言模型 2507.07024v1

Authors (23): Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, Pete Walsh, Jacob Morrison, Dustin Schwenk, Shayne Longpre, Jake Poznanski, Allyson Ettinger, Daogao Liu, Margaret Li, Dirk Groeneveld, Mike Lewis, Wen-tau Yih, Luca Soldaini, Kyle Lo, Noah A. Smith, Luke Zettlemoyer, Pang Wei Koh, Hannaneh Hajishirzi, Ali Farhadi, Sewon Min

We introduce FlexOlmo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets. We evaluate models with up to 37 billion parameters (20 billion active) on 31 diverse downstream tasks. We show that a general expert trained on public data can be effectively combined with independently trained experts from other data owners, leading to an average 41% relative improvement while allowing users to opt out of certain data based on data licensing or permission requirements. Our approach also outperforms prior model merging methods by 10.1% on average and surpasses the standard MoE trained without data restrictions using the same training FLOPs. Altogether, this research presents a solution for both data owners and researchers in regulated industries with sensitive or protected data. FlexOlmo enables benefiting from closed data while respecting data owners’ preferences by keeping their data local and supporting fine-grained control of data access during inference.

我们引入了FlexOlmo, 这是一种新的语言模型(LMS),它支持:(1) 分散培训而没有数据共享,不同模型参数在封闭数据集上独立培训,不同模型参数在封闭式数据集上独立培训,(2) 数据灵活推导,这些参数及其相关数据可以灵活地纳入或排除在模型推导之外,没有进一步培训。FlexOlmo采用专家混合结构,每个专家在封闭式数据集上独立接受培训,随后通过新的域信息化路由路径(无需任何联合培训)整合。FlexOlmo在FlexMix上接受了培训,该模块由7套特定域数据集和公开提供的数据集组成,代表封闭式数据集的现实近似值。我们在31项不同的下游任务中评估了高达370亿参数(200亿美元在运行中)的模型。我们表明,受过公共数据培训的一般专家可以与其他数据拥有者独立培训的专家有效结合,从而实现平均41%的相对改进,同时允许用户选择根据数据许可或许可要求退出某些数据。我们的方法也超越了先前支持使用10.%的模型,同时支持使用受培训的敏感数据整合方法,同时使用受监管的FRO数据,在平均和GIL数据解决方案中支持的系统进行数据,同时进行数据,同时使用10.11%的常规数据更新数据,同时使用该数据,使受监管者能够将数据采集数据,使该数据采集数据获取者在为平均和MER的系统用户获得标准。

Article 122

Title@2025-07-09 (3): TokenShapley: Token Level Context Attribution with Shapley Value

Title: TokenShapley: Token Level Context Attribution with Shapley Value

TokenShapley: Token Level Context Attribution mit Shapley Wert

TokenShapley: 具有形状价值的 Token 级别背景属性 2507.05261v2

Authors (6): Yingtai Xiao, Yuqing Zhu, Sirat Samyoun, Wanrong Zhang, Jiachen T. Wang, Jian Du

Large language models (LLMs) demonstrate strong capabilities in in-context learning, but verifying the correctness of their generated responses remains a challenge. Prior work has explored attribution at the sentence level, but these methods fall short when users seek attribution for specific keywords within the response, such as numbers, years, or names. To address this limitation, we propose TokenShapley, a novel token-level attribution method that combines Shapley value-based data attribution with KNN-based retrieval techniques inspired by recent advances in KNN-augmented LLMs. By leveraging a precomputed datastore for contextual retrieval and computing Shapley values to quantify token importance, TokenShapley provides a fine-grained data attribution approach. Extensive evaluations on four benchmarks show that TokenShapley outperforms state-of-the-art baselines in token-level attribution, achieving an 11-23% improvement in accuracy.

大型语言模型(LLMs)在文字学习方面表现出很强的能力,但核实其生成响应的正确性仍是一项挑战。先前的工作在句级一级探索了归属问题,但当用户在响应中寻找特定关键词(如数字、年份或名称)的归属问题时,这些方法不尽人意。为了应对这一限制,我们提议TokenShapley,这是一种新型象征性归属方法,它将基于价值的数据归属与基于KNN的检索技术结合起来,这种方法受最近KNN提法LMs进步的启发。通过利用一个预先计算的数据存储器来进行背景检索和计算Shapley值以量化象征性重要性,TokenShapley提供了一种细微的归因数据归属方法。对四项基准的广泛评价表明,TokenShapley在象征性归属方面超越了最新的基线,从而实现了11-23%的准确性改进。

Article 123

Title@2025-07-09 (3): Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs

Title: Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs

Bewusst lernen, intuitiv handeln: Entsperren von Test-Time-Reasoning in multimodalen LLMs

故意、自觉地学习:在多模式LMs中解锁试验时间理由 2507.06999v1

Authors (3): Yahan Yu, Yuyang Dong, Masafumi Oyamada

Reasoning is a key capability for large language models (LLMs), particularly when applied to complex tasks such as mathematical problem solving. However, multimodal reasoning research still requires further exploration of modality alignment and training costs. Many of these approaches rely on additional data annotation and relevant rule-based rewards to enhance the understanding and reasoning ability, which significantly increases training costs and limits scalability. To address these challenges, we propose the Deliberate-to-Intuitive reasoning framework (D2I) that improves the understanding and reasoning ability of multimodal LLMs (MLLMs) without extra annotations and complex rewards. Specifically, our method sets deliberate reasoning strategies to enhance modality alignment only through the rule-based format reward during training. While evaluating, the reasoning style shifts to intuitive, which removes deliberate reasoning strategies during training and implicitly reflects the model’s acquired abilities in the response. D2I outperforms baselines across both in-domain and out-of-domain benchmarks. Our findings highlight the role of format reward in fostering transferable reasoning skills in MLLMs, and inspire directions for decoupling training-time reasoning depth from test-time response flexibility.

解释是大型语言模型(LLMS)的关键能力,特别是在应用到数学问题解决等复杂任务时。然而,多式联运推理研究仍需要进一步探索模式调整和培训成本。许多这些方法依靠额外数据说明和相关基于规则的奖励来增强理解和推理能力,这大大增加了培训成本并限制了可缩放性。为应对这些挑战,我们提议了自觉到直觉的推理框架(D2I),它提高了多式联运LMS(MLMS)的理解和推理能力,而没有额外的说明和复杂的奖赏。具体地说,我们的方法设置了有意推理战略,只通过培训期间基于规则的格式奖励来增强模式的调整。在评估时,推理风格转向直觉,它消除了培训期间故意推理策略,隐含地反映了模型在应对过程中获得的能力。D2I超越了内部和外部基准的基线。我们的调查结果强调了格式奖励在培养MLLMS的可转移推理技巧和复杂奖赏方面的作用,并激励了培训时间推理深度与试验时反应灵活性的分。

Article 124

Title@2025-07-09 (3): Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming

Title: Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming

Alles planen mit Rigor: General-Purpose Zero-Shot Planung mit LLM-basiertem Formalisierten Programmierung

与Rigor一起规划任何事:与基于LLM的正规化方案规划一起规划一般用途零热规划 2410.12112v3

Authors (3): Yilun Hao, Yang Zhang, Chuchu Fan

While large language models (LLMs) have recently demonstrated strong potential in solving planning problems, there is a trade-off between flexibility and complexity. LLMs, as zero-shot planners themselves, are still not capable of directly generating valid plans for complex planning problems such as multi-constraint or long-horizon tasks. On the other hand, many frameworks aiming to solve complex planning problems often rely on task-specific preparatory efforts, such as task-specific in-context examples and pre-defined critics/verifiers, which limits their cross-task generalization capability. In this paper, we tackle these challenges by observing that the core of many planning problems lies in optimization problems: searching for the optimal solution (best plan) with goals subject to constraints (preconditions and effects of decisions). With LLMs’ commonsense, reasoning, and programming capabilities, this opens up the possibilities of a universal LLM-based approach to planning problems. Inspired by this observation, we propose LLMFP, a general-purpose framework that leverages LLMs to capture key information from planning problems and formally formulate and solve them as optimization problems from scratch, with no task-specific examples needed. We apply LLMFP to 9 planning problems, ranging from multi-constraint decision making to multi-step planning problems, and demonstrate that LLMFP achieves on average 83.7% and 86.8% optimal rate across 9 tasks for GPT-4o and Claude 3.5 Sonnet, significantly outperforming the best baseline (direct planning with OpenAI o1-preview) with 37.6% and 40.7% improvements. We also validate components of LLMFP with ablation experiments and analyzed the underlying success and failure reasons. Project page: https://sites.google.com/view/llmfp.

虽然大型语言模型(LLMS)最近显示在解决规划问题方面具有巨大的潜力,但灵活性和复杂性之间却存在权衡。LLMS作为零点规划者本身,仍然无法直接为复杂的规划问题制定有效的计划,如多重约束或长期横向任务等。另一方面,许多旨在解决复杂的规划问题的框架往往依赖于特定任务的筹备努力,如具体任务的内特实例和预先界定的批评者/验证者,这限制了它们跨任务一般化的能力。在本文中,我们通过观察发现许多规划问题的核心在于优化问题:寻找目标受制约(决定的条件和影响)的最佳解决方案(最佳计划)。另一方面,许多旨在解决复杂的规划问题的框架往往依赖于基于特定任务的筹备努力,如具体任务的内特有实例和预先界定的批评者/验证者,从而限制了它们之间的交叉任务一般目的框架。我们建议LMFP, 利用LMS来收集从规划中的关键信息,并正式制定和解决这些问题,从最优化的问题,从Sonalferal 开始,没有任务-crial-liveralalal ealalalalalalalalalalalal ex ex ex ex ex ex ex ex ex ex ex ex legal ex ex ex laut lauts laut sal laut lauts lauts lauts lauts lax lauts lauts lauts lauts lauts lauts lauts lex lex lauts lauts

Article 125

Title@2025-07-09 (3): FRaN-X: FRaming and Narratives-eXplorer

Title: FRaN-X: FRaming and Narratives-eXplorer

FRaN-X: FRaming und Narratives-eXplorer

FRAN-X: 制冷和叙述-爆破器 2507.06974v1

Authors (24): Artur Muratov, Hana Fatima Shaikh, Vanshikaa Jani, Tarek Mahmoud, Zhuohan Xie, Daniil Orel, Aaryamonvikram Singh, Yuxia Wang, Aadi Joshi, Hasan Iqbal, Ming Shan Hee, Dhruv Sahnan, Nikolaos Nikolaidis, Purificação Silvano, Dimitar Dimitrov, Roman Yangarber, Ricardo Campos, Alípio Jorge, Nuno Guimarães, Elisa Sartori, Nicolas Stefanovitch, Giovanni Da San Martino, Jakub Piskorski, Preslav Nakov

We present FRaN-X, a Framing and Narratives Explorer that automatically detects entity mentions and classifies their narrative roles directly from raw text. FRaN-X comprises a two-stage system that combines sequence labeling with fine-grained role classification to reveal how entities are portrayed as protagonists, antagonists, or innocents, using a unique taxonomy of 22 fine-grained roles nested under these three main categories. The system supports five languages (Bulgarian, English, Hindi, Russian, and Portuguese) and two domains (the Russia-Ukraine Conflict and Climate Change). It provides an interactive web interface for media analysts to explore and compare framing across different sources, tackling the challenge of automatically detecting and labeling how entities are framed. Our system allows end users to focus on a single article as well as analyze up to four articles simultaneously. We provide aggregate level analysis including an intuitive graph visualization that highlights the narrative a group of articles are pushing. Our system includes a search feature for users to look up entities of interest, along with a timeline view that allows analysts to track an entity’s role transitions across different contexts within the article. The FRaN-X system and the trained models are licensed under an MIT License. FRaN-X is publicly accessible at https://fran-x.streamlit.app/ and a video demonstration is available at https://youtu.be/VZVi-1B6yYk.

FRAN-X由两阶段系统组成,将顺序标签与细微区别角色分类相结合,以揭示实体如何被描绘为主角、对立派或无辜者,使用这三大类下22个细微区别角色的独特的分类法。该系统支持五种语言(保加利亚语、英语、印地语、俄语和葡萄牙语)和两个领域(俄罗斯-乌克兰冲突和气候变化)。该系统为媒体分析家提供了一个互动的网络界面,以探索和比较不同来源的设置,应对自动检测和标注实体设置方式的挑战。我们的系统允许终端用户集中关注单一文章,同时分析多达四篇文章。我们提供总体水平分析,包括直观图像图,以突出一组文章正在推动的叙述。我们的系统包括用户搜索实体搜索功能,以及允许分析家跟踪实体在不同背景下的角色过渡的时间视图。在FRAX系统下,可公开培训FRAX。

Article 126

Title@2025-07-09 (3): Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report

Title: Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report

Scaling towards the information boundary of instruction set: InfinityInstruct-Subject Technical Report

逐步走向指示集的信息边界:无限指令性 - 实物技术报告 2507.06968v1

Authors (4): Li Du, Hanyu Zhao, Yiming Ju, Tengfei Pan

Instruction tuning has become a foundation for unlocking the capabilities of large-scale pretrained models and improving their performance on complex tasks. Thus, the construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both coverage'' (coverage of task types and knowledge areas) anddepth’’ (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical labeling system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with targeted data generation. These components form an iterative closed-loop to continuously enhance the coverage and depth of instruction data. Based on this framework, we construct InfinityInstruct-Subject, a high-quality dataset containing ~1.5 million instructions. Experiments on multiple foundation models and benchmark tasks demonstrate its effectiveness in improving instruction-following capabilities. Further analyses suggest that InfinityInstruct-Subject shows enlarged coverage and depth compared to comparable synthesized instruction datasets. Our work lays a theoretical and practical foundation for the efficient, continuous evolution of instruction datasets, moving from data quantity expansion to qualitative improvement.

规范调整已成为释放大规模预先培训模型的能力和改善其复杂任务绩效的基础,因此,建设高质量的教学数据集对于提高模型性能和可概括性至关重要。虽然目前的教学数据集已经达到数千万个样本,但对其进行微调的模型仍可能与在稀有领域的复杂教学和任务相交。这主要是由于“覆盖”(任务类型和知识领域的覆盖)和“深度”数据集(设计复杂程度)的扩展有限。为解决这一问题,我们提议了一个系统化的教学数据构建框架,其中整合了等级标签系统、信息种子选择算法、进化数据合成流程和模型缺陷诊断,并有针对性地生成数据。这些组件形成一个迭接的闭环,以不断提高教学数据的覆盖面和深度。基于这个框架,我们构建了“不精确的教学基础”和“深度”的高质量数据集,包含~150万个指令。关于多个基础模型和基准任务的实验表明其改进教学质量标签的有效性,信息种子选择算法,进化数据合成过程,进一步分析显示我们不断扩展的系统化数据结构,从基础到可比较的深度。

Article 127

Title@2025-07-09 (3): Investigating the Robustness of Retrieval-Augmented Generation at the Query Level

Title: Investigating the Robustness of Retrieval-Augmented Generation at the Query Level

Untersuchung der Robustheit der Retrieval-Augmented Generation auf der Abfrageebene

在查询级别调查回溯源一代的强劲程度 2507.06956v1

Authors (7): Sezen Perçin, Xin Su, Qutub Sha Syed, Phillip Howard, Aleksei Kuvshinov, Leo Schwinn, Kay-Ulrich Scholl

Large language models (LLMs) are very costly and inefficient to update with new information. To address this limitation, retrieval-augmented generation (RAG) has been proposed as a solution that dynamically incorporates external knowledge during inference, improving factual consistency and reducing hallucinations. Despite its promise, RAG systems face practical challenges-most notably, a strong dependence on the quality of the input query for accurate retrieval. In this paper, we investigate the sensitivity of different components in the RAG pipeline to various types of query perturbations. Our analysis reveals that the performance of commonly used retrievers can degrade significantly even under minor query variations. We study each module in isolation as well as their combined effect in an end-to-end question answering setting, using both general-domain and domain-specific datasets. Additionally, we propose an evaluation framework to systematically assess the query-level robustness of RAG pipelines and offer actionable recommendations for practitioners based on the results of more than 1092 experiments we performed.

大型语言模型(LLMS)成本高昂,效率低,无法更新新的信息。为解决这一局限性,已提出将检索增强的生成(RAG)作为一种解决办法,在推论期间积极纳入外部知识,提高事实一致性,减少幻觉。尽管有希望,但RAG系统面临实际挑战,最显著的是,对输入查询质量的高度依赖,以便准确检索。在本文件中,我们调查RAG管道中不同部件对各类查询扰动的敏感性。我们的分析表明,即使在微小的查询变异下,常用检索器的性能也会大大退化。我们利用一般域和特定域数据集,在端到端问题解答设置中研究每个模块及其综合效果。此外,我们提议一个评价框架,系统评估RAG管道的查询强度,并根据我们进行的1 092多次实验的结果,向从业人员提出可行的建议。

Article 128

Title@2025-07-09 (3): Neuron-Level Differentiation of Memorization and Generalization in Large Language Models

Title: Neuron-Level Differentiation of Memorization and Generalization in Large Language Models

Neuron-Level-Differenzierung von Erinnerung und Generalisierung in großen Sprachmodellen

大语言模型中记忆化和普及化的中上级中值差异 2412.18497v2

Authors (11): Ko-Wei Huang, Yi-Fu Fu, Ching-Yu Tsai, Yu-Chieh Tu, Tzu-Ling Cheng, Cheng-Yu Lin, Yi-Ting Yang, Heng-Yi Liu, Keng-Te Liao, Da-Cheng Juan, Shou-De Lin

We investigate how Large Language Models (LLMs) distinguish between memorization and generalization at the neuron level. Through carefully designed tasks, we identify distinct neuron subsets responsible for each behavior. Experiments on both a GPT-2 model trained from scratch and a pretrained LLaMA-3.2 model fine-tuned with LoRA show consistent neuron-level specialization. We further demonstrate that inference-time interventions on these neurons can steer the model’s behavior toward memorization or generalization. To assess robustness, we evaluate intra-task and inter-task consistency, confirming that these neuron-behavior associations reflect generalizable patterns rather than dataset-specific artifacts. Our findings reveal modular structure in LLMs and enable controlling memorization and generalization behaviors at inference time.

我们调查了大语言模型(LLMs)如何区分神经层面的记忆化和概括化。通过仔细设计的任务, 我们确定对每种行为负责的不同神经子集。对从零到零培训的GPT-2模型和经过事先训练的LLAMA-3.2模型的实验显示,与LORA相微调的LLAMA-3.2模型具有一致的神经层面的专业性。我们还进一步证明, 对这些神经元的推论时间干预可以引导模型走向记忆化或概括化的行为。为了评估稳健性, 我们评估了任务内部和任务间的一致性, 证实这些神经- 任务间的联系反映了可普遍适用的模式, 而不是特定数据集的文物。我们的发现揭示了LLMS的模块结构, 并能够控制推后时间的记忆化和概括行为。

Article 129

Title@2025-07-09 (3): Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Title: Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Auf dem Weg zu einer vernünftigen Ära: Eine Umfrage über lange Kette von Gedanken, um große Sprachmodelle zu verstehen

通向理性时代:关于为理由使用大语言模式而寻求的长链研究的调查 2503.09567v4

Authors (10): Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che

Recent advancements in reasoning with large language models (RLLMs), such as OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in complex domains like mathematics and coding. A central factor in their success lies in the application of long chain-of-thought (Long CoT) characteristics, which enhance reasoning abilities and enable the solution of intricate problems. However, despite these developments, a comprehensive survey on Long CoT is still lacking, limiting our understanding of its distinctions from traditional short chain-of-thought (Short CoT) and complicating ongoing debates on issues like “overthinking” and “inference-time scaling.” This survey seeks to fill this gap by offering a unified perspective on Long CoT. (1) We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms. (2) Next, we explore the key characteristics of Long CoT: deep reasoning, extensive exploration, and feasible reflection, which enable models to handle more complex tasks and produce more efficient, coherent outcomes compared to the shallower Short CoT. (3) We then investigate key phenomena such as the emergence of Long CoT with these characteristics, including overthinking, and inference-time scaling, offering insights into how these processes manifest in practice. (4) Finally, we identify significant research gaps and highlight promising future directions, including the integration of multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks. By providing a structured overview, this survey aims to inspire future research and further the development of logical reasoning in artificial intelligence.

最近与大语言模型(RLLMs)的推理工作取得了进展,例如OpenAI-O1和DeepSeek-R1,这显示了他们在数学和编码等复杂领域令人印象深刻的能力。其成功的一个中心因素在于应用长期思维链(Long CoT)特征,这些特征增强了推理能力,并能够解决复杂的问题。然而,尽管取得了这些进展,关于长期思维模型(LLLLLMMS)的全面调查仍然缺乏,限制了我们对它与传统的短思维链(Sort CoT)的区别的理解,并使得目前关于“超思考”和“超时缩放”等问题的辩论复杂化。这项调查试图填补这一差距,方法是对长 CoT提供统一的观点。 (1) 我们首先将Long CoT与短 CoT(Long CoT)特征区分开来,并采用新颖的分类方法来分类目前的推理模型。 (2) 其次,我们探索长 CoT的关键特征:深刻的推理、广泛探索和可行的思考,使模型能够处理更复杂的任务并产生更高效、更一致的结果。 (3) 我们然后调查了关键现象,通过长期的推理推理,从而揭示了长期的推理的推理,从而揭示了我们最终的推理,提出了这些推理的推理学,从而揭示了我们的推理学,确定了了这些推理,从而确定了了我们未来的推理学,从而确定了了这些推理的推理的推理,确定了了这些推理,从而推理和推理,确定了我们的推理。

Article 130

Title@2025-07-09 (3): What to Keep and What to Drop: Adaptive Table Filtering Framework

Title: What to Keep and What to Drop: Adaptive Table Filtering Framework

Was zu halten und was zu fallen: Adaptive Tabelle Filterung Rahmen

保持和放下什么:适应性表格过滤框架 2506.23463v2

Authors (1): WonJune Jang

Large language models (LLMs) for table-based reasoning often struggle with large tables due to input length limits. We propose ATF (Adaptive Table Filtering Framework), a modular and question-aware filtering pipeline that prunes uninformative columns and rows using LLM-generated column descriptions, clustering, and sparse-dense alignment scores. ATF integrates seamlessly with existing models (e.g., TAPAS, TAPEX) without retraining. Experiments show that ATF reduces table cells by 70%, boosting performance on out-of-domain TableQA tasks while causing slight performance drops on Table Fact Verification, where full-table context is more critical. These results highlight ATF’s ability to adaptively balance informativeness and minimalism across tasks.

基于表格的推理大语言模型(LLMs)往往因输入长度限制而与大表格挣扎。我们建议ATF(Adaptive Table 过滤框架),这是一个模块式和有问题过滤管道,使用LLM生成的列描述、集群和稀疏的对齐分来提取无信息性列和行。ATF在没有再培训的情况下与现有模型(如TAAPAS、TAPEX)无缝地结合。实验显示ATF将表格单元格减少70%,提高表细胞在表QA外任务上的绩效,同时在表格事实核查上造成稍有性能下降,而全面环境则更为关键。这些结果突出ATF在适应性地平衡不同任务的信息性和最小性方面的能力。

Article 131

Title@2025-07-09 (3): Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues

Title: Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues

LLMs für die Vorhersage von Tutorenstrategien und Studentenergebnissen in Dialogen erforschen

探索在对话中预测教学战略和学生成果 2507.06910v1

Authors (3): Fareya Ikram, Alexander Scarlatos, Andrew Lan

Tutoring dialogues have gained significant attention in recent years, given the prominence of online learning and the emerging tutoring abilities of artificial intelligence (AI) agents powered by large language models (LLMs). Recent studies have shown that the strategies used by tutors can have significant effects on student outcomes, necessitating methods to predict how tutors will behave and how their actions impact students. However, few works have studied predicting tutor strategy in dialogues. Therefore, in this work we investigate the ability of modern LLMs, particularly Llama 3 and GPT-4o, to predict both future tutor moves and student outcomes in dialogues, using two math tutoring dialogue datasets. We find that even state-of-the-art LLMs struggle to predict future tutor strategy while tutor strategy is highly indicative of student outcomes, outlining a need for more powerful methods to approach this task.

近年来,由于在线学习的重要性以及大型语言模型(LLMs)驱动的人工智能(AI)代理人的新兴辅导能力,教学对话近年来受到高度重视。最近的研究表明,辅导员使用的战略可以对学生结果产生重大影响,因此有必要采用方法预测辅导员将如何行事及其行动如何影响学生。然而,很少有工作研究过预测对话中的辅导员战略。因此,在这项工作中,我们调查现代LLMS的能力,特别是Llama 3和GPT-4o, 以便利用两个数学辅导对话数据集预测未来辅导员的调动和学生在对话中的结果。我们发现,即使是最先进的LLMs也努力预测未来的辅导员战略,而辅导员战略则非常能说明学生的结果,同时说明需要更强有力的方法来应对这项任务。

Article 132

Title@2025-07-09 (3): MultiJustice: A Chinese Dataset for Multi-Party, Multi-Charge Legal Prediction

Title: MultiJustice: A Chinese Dataset for Multi-Party, Multi-Charge Legal Prediction

MultiJustice: Ein chinesischer Datensatz für Multi-Party, Multi-Charge Legal Prediction

多公正:多党、多主管法律预测的中国数据集 2507.06909v1

Authors (7): Xiao Wang, Jiahuan Pei, Diancheng Shui, Zhiguang Han, Xin Sun, Dawei Zhu, Xiaoyu Shen

Legal judgment prediction offers a compelling method to aid legal practitioners and researchers. However, the research question remains relatively under-explored: Should multiple defendants and charges be treated separately in LJP? To address this, we introduce a new dataset namely multi-person multi-charge prediction (MPMCP), and seek the answer by evaluating the performance of several prevailing legal large language models (LLMs) on four practical legal judgment scenarios: (S1) single defendant with a single charge, (S2) single defendant with multiple charges, (S3) multiple defendants with a single charge, and (S4) multiple defendants with multiple charges. We evaluate the dataset across two LJP tasks, i.e., charge prediction and penalty term prediction. We have conducted extensive experiments and found that the scenario involving multiple defendants and multiple charges (S4) poses the greatest challenges, followed by S2, S3, and S1. The impact varies significantly depending on the model. For example, in S4 compared to S1, InternLM2 achieves approximately 4.5% lower F1-score and 2.8% higher LogD, while Lawformer demonstrates around 19.7% lower F1-score and 19.0% higher LogD. Our dataset and code are available at https://github.com/lololo-xiao/MultiJustice-MPMCP.

法律判决预测为法律从业者和研究人员提供帮助提供了令人信服的方法。然而,研究问题仍然相对没有得到充分探讨:如果在LJP中分别处理多名被告和指控?为了解决这个问题,我们引入了一套新的数据集,即多人多收费预测(MPMCP),并寻求答案,对四种实际法律判决情景(S2)中几种通行的大型法律语言模型(LLLMs)的性能进行评估:(S1)单一被告单一指控,(S2)单一被告提出多项指控,(S3)多名被告提出多项指控,(S4)多个被告提出多项指控,(S4)多被告提出多项指控,(S4)多被告提出多项指控。我们评估了两项LJP任务中的数据集,即指控预测和刑罚术语预测。我们进行了广泛的实验,发现涉及多被告和多项指控(S4)的情景带来了最大挑战,随后是S2、S3和S1。影响因模型的不同而有很大不同。例如,S4与S1、InterLM2相比,F1-score和2.8%以上LD显示约19.7%的F1-Pas-qual/Lasionalcodestrat。

Article 133

Title@2025-07-09 (3): MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection

Title: MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection

MIND: Ein Multi-Agenten-Framework für die Null-Schuss-Schadstoff-Meme-Erkennung

MIND: 零中弹有害气象探测多剂框架 2507.06908v1

Authors (5): Ziyan Liu, Chunxiao Fan, Haoran Lou, Yuexin Wu, Kaiwei Deng

The rapid expansion of memes on social media has highlighted the urgent need for effective approaches to detect harmful content. However, traditional data-driven approaches struggle to detect new memes due to their evolving nature and the lack of up-to-date annotated data. To address this issue, we propose MIND, a multi-agent framework for zero-shot harmful meme detection that does not rely on annotated data. MIND implements three key strategies: 1) We retrieve similar memes from an unannotated reference set to provide contextual information. 2) We propose a bi-directional insight derivation mechanism to extract a comprehensive understanding of similar memes. 3) We then employ a multi-agent debate mechanism to ensure robust decision-making through reasoned arbitration. Extensive experiments on three meme datasets demonstrate that our proposed framework not only outperforms existing zero-shot approaches but also shows strong generalization across different model architectures and parameter scales, providing a scalable solution for harmful meme detection. The code is available at https://github.com/destroy-lonely/MIND.

在社会媒体上迅速扩展的Memes在社交媒体上强调,迫切需要采取有效办法发现有害内容,然而,传统的由数据驱动的方法,由于新Memes性质不断变化,而且缺乏最新的附加说明的数据,难以发现新的Memes。为解决这一问题,我们提议MIND(一个不依赖附加说明数据的零射有害Memes检测多试剂框架)不依赖附加说明的数据。MIND实施三项关键战略:(1) 我们从一个未经附加说明的参考集中检索类似的Memes,以提供背景信息。(2) 我们提议一个双向洞察机制,以获得对类似Memes的全面理解。(3) 我们随后采用多媒介辩论机制,确保通过合理仲裁作出有力的决策。关于三个Memes数据集的广泛实验表明,我们提议的框架不仅超越了现有的零射法,而且还表明在不同模型结构和参数尺度上非常普遍,为有害Memes的检测提供了可测量的解决方案。该代码可在https://github.com/dstroy-lonely/MIND查阅。

Article 134

Title@2025-07-09 (3): VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation

Title: VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation

VisualTrap: Ein Stealthy Backdoor Angriff auf GUI-Agenten durch Visual Grounding Manipulation

视觉陷阱:通过视觉定位操纵对图形用户界面代理器进行隐秘的后门攻击 2507.06899v1

Authors (6): Ziang Ye, Yang Zhang, Wentao Shi, Xiaoyu You, Fuli Feng, Tat-Seng Chua

Graphical User Interface (GUI) agents powered by Large Vision-Language Models (LVLMs) have emerged as a revolutionary approach to automating human-machine interactions, capable of autonomously operating personal devices (e.g., mobile phones) or applications within the device to perform complex real-world tasks in a human-like manner. However, their close integration with personal devices raises significant security concerns, with many threats, including backdoor attacks, remaining largely unexplored. This work reveals that the visual grounding of GUI agent-mapping textual plans to GUI elements-can introduce vulnerabilities, enabling new types of backdoor attacks. With backdoor attack targeting visual grounding, the agent’s behavior can be compromised even when given correct task-solving plans. To validate this vulnerability, we propose VisualTrap, a method that can hijack the grounding by misleading the agent to locate textual plans to trigger locations instead of the intended targets. VisualTrap uses the common method of injecting poisoned data for attacks, and does so during the pre-training of visual grounding to ensure practical feasibility of attacking. Empirical results show that VisualTrap can effectively hijack visual grounding with as little as 5% poisoned data and highly stealthy visual triggers (invisible to the human eye); and the attack can be generalized to downstream tasks, even after clean fine-tuning. Moreover, the injected trigger can remain effective across different GUI environments, e.g., being trained on mobile/web and generalizing to desktop environments. These findings underscore the urgent need for further research on backdoor attack risks in GUI agents.

由大型视觉语言模型(LVLMS)驱动的图形用户界面代理器(GUI)在大型视觉语言模型(LVLMS)的推动下,已成为一种革命性的方法,使人体机器互动自动化,能够自主操作个人装置(例如移动电话)或装置内应用,以人样的方式执行复杂的现实世界任务。然而,它们与个人装置的紧密结合引起了巨大的安全关切,许多威胁,包括后门攻击,基本上仍未被探索。这项工作表明,图形界面元素(LVLMS)的图形界面绘制文本计划的直观地面定位引入了脆弱性,使新的幕后攻击类型成为可能。随着幕后攻击以视觉地面为对象的攻击,即使有了正确的任务解决计划,该代理人的行为也可能受到损害。为了验证这种脆弱性,我们建议一种方法,通过误导代理人定位文字计划以触发地点而不是预定目标,从而劫持地面。视觉应用通常的将有毒数据注入地面攻击,在视觉地面前训练中可以进一步确保攻击的实际可行性。在视觉上经过训练的普通环境上,即使是经过训练的、经过训练的、经过训练的、经过深思研磨的、经过训练的普通的媒介的媒介研究后,这些结果可以有效地使视觉操作的地面上,这些都能够有效地进行。

Article 135

Title@2025-07-09 (3): NoLiMa: Long-Context Evaluation Beyond Literal Matching

Title: NoLiMa: Long-Context Evaluation Beyond Literal Matching

NoLiMa: Lang-Kontext-Evaluierung jenseits der sprachlichen Übereinstimmung

NoLiMa: 超越文学匹配的长文本评价 2502.05167v3

Authors (7): Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schütze

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a “needle” (relevant information) from a “haystack” (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 13 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 11 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information. Even models enhanced with reasoning capabilities or CoT prompting struggle to maintain performance in long contexts. We publicly release the dataset and evaluation code at https://github.com/adobe-research/NoLiMa.

最近大型语言模型(LLMS) 支持从128K到1M等牌的长期背景。用于评估这些能力的流行方法之一是测试“ haystack ” (NIAH) 。测试涉及从“ haystack” 中检索“ 针头” (相关信息) (与此无关 ) 。这种方法的扩展包括增加分流器、事实链条和连字符推理。但是,在这些基准中, 模型可以利用针头和干草堆之间的现有直线匹配来简化任务。为了解决这个问题, 我们引入了NoLima( NoIMAH), 基准是扩展NIAHAH, 配有精心设计的针头针头套, 要求从“ haystack” (Haystack) 中检索“ 针头针头针头”(相关信息) (相关信息) (相关信息) 。我们评估了13个称支持至少128K 符号环境的流行的LMLMsms 。虽然在短期内表现良好,但性能随着上的时间长度的增加而大幅下降。。例如, 11个模型将短距离基线评价低于50 短基线基线基线基线基线基线基线基线。。 , 几乎比直为直比递增。。。直为直为。

Article 136

Title@2025-07-09 (3): SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN

Title: SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN

SCore: Verstärkte Corpus-basierte Relation-Extraktion mit multi-Label Kontrastivem Lernen und Bayesian kNN

SCOR:利用多标签差异学习和Bayesian kNN简化公司关系采掘 2507.06895v1

Authors (3): Luca Mariotti, Veronica Guidetti, Federica Mandreoli

The growing demand for efficient knowledge graph (KG) enrichment leveraging external corpora has intensified interest in relation extraction (RE), particularly under low-supervision settings. To address the need for adaptable and noise-resilient RE solutions that integrate seamlessly with pre-trained large language models (PLMs), we introduce SCoRE, a modular and cost-effective sentence-level RE system. SCoRE enables easy PLM switching, requires no finetuning, and adapts smoothly to diverse corpora and KGs. By combining supervised contrastive learning with a Bayesian k-Nearest Neighbors (kNN) classifier for multi-label classification, it delivers robust performance despite the noisy annotations of distantly supervised corpora. To improve RE evaluation, we propose two novel metrics: Correlation Structure Distance (CSD), measuring the alignment between learned relational patterns and KG structures, and Precision at R (P@R), assessing utility as a recommender system. We also release Wiki20d, a benchmark dataset replicating real-world RE conditions where only KG-derived annotations are available. Experiments on five benchmarks show that SCoRE matches or surpasses state-of-the-art methods while significantly reducing energy consumption. Further analyses reveal that increasing model complexity, as seen in prior work, degrades performance, highlighting the advantages of SCoRE’s minimal design. Combining efficiency, modularity, and scalability, SCoRE stands as an optimal choice for real-world RE applications.

利用外部公司对高效知识图(KG)的丰富利用外部公司的需求日益增加,这使人们更加关注关系提取(RE),特别是在低监督环境之下。为了满足对适应性和有噪音弹性的RE解决方案的需求,这些解决方案需要与经过事先训练的大语言模型(PLM)无缝结合,我们引入了模块化和具有成本效益的 RE 系统。SCORE使PLM转换容易,不要求微调,并顺利适应不同的公司和KGs。通过将受监督的对比学习与多标签分类的Bayesian k-Nearnighbors(KNNN)分类结合起来,它提供了强大的性能,尽管远受监督的巨型公司(PLMs)发出响亮的提示。为了改进RE,我们提出了两个新颖的衡量标准:校正结构距离(CSCS),衡量所学关系模式和KGG结构之间的匹配,以及Precion 评估作为建议系统评估效用的系统。我们还发布了Wik20d,一个基准数据集复制了现实-世界的RE条件,其中只有KG-Brest-ration State Proview Proviewal press repress repress repress repress repress repress repress repress repress repress repress press laisal press repress press lade lade lade lading lading ladings spress press ladings ladings ladings spress ladings lade lade ladings lade press press lacisal ex lade lade lade lade lade lade lade lade lade lade lade lade lad lacess lacess lacess lapress ex ex lade lade ex lade lacess lacess lade lad lad lad lad lad lad lad lad lade lade lad laess laess laess ex ex

Article 137

Title@2025-07-09 (3): Developing and Maintaining an Open-Source Repository of AI Evaluations: Challenges and Insights

Title: Developing and Maintaining an Open-Source Repository of AI Evaluations: Challenges and Insights

Entwicklung und Pflege eines Open-Source-Repositorys von KI-Bewertungen: Herausforderungen und Erkenntnisse

开发并维持一个开放源码的AI评价储存库:挑战和洞察 2507.06893v1

Authors (3): Alexandra Abbas, Celia Waggoner, Justin Olive

AI evaluations have become critical tools for assessing large language model capabilities and safety. This paper presents practical insights from eight months of maintaining $inspect_evals$, an open-source repository of 70+ community-contributed AI evaluations. We identify key challenges in implementing and maintaining AI evaluations and develop solutions including: (1) a structured cohort management framework for scaling community contributions, (2) statistical methodologies for optimal resampling and cross-model comparison with uncertainty quantification, and (3) systematic quality control processes for reproducibility. Our analysis reveals that AI evaluation requires specialized infrastructure, statistical rigor, and community coordination beyond traditional software development practices.

AI评价已成为评估大型语言模型能力和安全性的关键工具,本文件从八个月的“Espectéevals”中提供了实际的见解,这是70+社区贡献的AI评价的公开源码库,我们查明了实施和维护AI评价方面的主要挑战,并制定了解决办法,包括:(1) 扩大社区贡献的结构化集体管理框架;(2) 与不确定性量化进行最佳重现和跨模式比较的统计方法;(3) 系统化质量控制程序,以便复制;我们的分析表明,AI评价需要超出传统软件开发做法之外的专门基础设施、统计严谨和社区协调。

Article 138

Title@2025-07-09 (3): Adaptive Elicitation of Latent Information Using Natural Language

Title: Adaptive Elicitation of Latent Information Using Natural Language

Adaptive Vermittlung von latenten Informationen mit natürlicher Sprache

利用自然语言调适性易感信息使用自然语言 2504.04204v2

Authors (4): Jimmy Wang, Thomas Zollo, Richard Zemel, Hongseok Namkoong

Eliciting information to reduce uncertainty about a latent entity is a critical task in many application domains, e.g., assessing individual student learning outcomes, diagnosing underlying diseases, or learning user preferences. Though natural language is a powerful medium for this purpose, large language models (LLMs) and existing fine-tuning algorithms lack mechanisms for strategically gathering information to refine their own understanding of the latent entity. To harness the generalization power and world knowledge of LLMs in developing effective information-gathering strategies, we propose an adaptive elicitation framework that actively reduces uncertainty on the latent entity. Since probabilistic modeling of an abstract latent entity is difficult, our framework adopts a predictive view of uncertainty, using a meta-learned language model to simulate future observations and enable scalable uncertainty quantification over complex natural language. Through autoregressive forward simulation, our model quantifies how new questions reduce epistemic uncertainty, enabling the development of sophisticated information-gathering strategies to choose the most informative next queries. In experiments on the 20 questions game, dynamic opinion polling, and adaptive student assessment, our method consistently outperforms baselines in identifying critical unknowns and improving downstream predictions, illustrating the promise of strategic information gathering in natural language settings.

在许多应用领域,例如,评估个别学生学习成果、诊断潜在疾病或学习用户偏好等,都是一项关键任务。虽然自然语言是这方面的强大媒介,但大型语言模型和现有的微调算法缺乏战略收集信息的机制,无法通过战略收集信息来完善对潜在实体的理解。为了利用LLMS的概括力量和世界知识来制定有效的信息收集战略,我们提议了一个适应性引导框架,以积极减少潜在实体的不确定性。由于抽象潜在实体的概率模型很难,我们的框架采用预测性不确定性的观点,使用元学语言模型模拟未来观测,使复杂自然语言的可扩展不确定性量化成为可能。通过自动递进式前瞻性模拟,我们的模式量化了新问题如何减少隐含性的不确定性,从而能够发展复杂的信息收集战略来选择最丰富的下一个查询。在20个问题游戏的实验中,动态的民意调查,以及适应性的学生评估中,我们的方法在识别关键未知因素和改进下游语言的预测方面始终超越了基线。

Article 139

Title@2025-07-09 (3): EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning

Title: EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning

EMORL: Ensemble Multi-Objektive Verstärkungs-Lernen für effizientes und flexibles LLM-Fine-Tuning

EMORL: 将多目标强化学习组合起来,促进高效率和灵活的LLM 微调 2505.02579v3

Authors (5): Lingxiao Kong, Cong Yang, Susanne Neufang, Oya Deniz Beyan, Zeyd Boukhers

Recent advances in reinforcement learning (RL) for large language model (LLM) fine-tuning show promise in addressing multi-objective tasks but still face significant challenges, including competing objective balancing, low training efficiency, poor scalability, and limited explainability. Leveraging ensemble learning principles, we introduce an Ensemble Multi-Objective RL (EMORL) framework that fine-tunes multiple models with individual objectives while optimizing their aggregation after the fine-tuning to improve efficiency and flexibility. Our method is the first to aggregate the hidden states of individual models, incorporating contextual information from multiple objectives. This approach is supported by a hierarchical grid search algorithm that identifies optimal weighted combinations. We evaluate EMORL on counselor reflection generation tasks, using text classification models to score the generations and provide rewards during RL fine-tuning. Through comprehensive experiments on the PAIR and Psych8k datasets, we demonstrate the advantages of EMORL against existing baselines: significantly lower and more stable training consumption ($17,529\pm 1,650$ data points and $6,573\pm 147.43$ seconds), improved scalability and explainability, and comparable performance across multiple objectives.

在大型语言模型(LLM)微调强化学习(RLL)方面最近取得的进展显示,在处理多目标任务方面有希望,但仍面临重大挑战,包括相互竞争的目标平衡、培训效率低、可缩放性低和解释性有限。我们利用混合学习原则,引入一个组合式多目标学习(EMORL)框架,微调多种模式,同时微调具有个别目标的多重模式,同时在微调后优化组合,提高效率和灵活性。我们的方法是首先汇总单个模型的隐藏状态,将多重目标的背景资料纳入其中。这种方法得到一个分级电网搜索算法的支持,该算法确定了最佳加权组合。我们评估顾问思考生成任务的EMORL,使用文本分类模型对几代进行评分,并在RL微调期间提供奖励。我们通过对 PAIR 和 Mitch8k 数据集的全面实验,展示了EMORL 相对于现有基线的优势:培训消耗量显著降低和更加稳定(17,529 pm 1,650美元数据点和6,573\pm 147.3秒),改进了可计量性和可比较性。

Article 140

Title@2025-07-09 (3): Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework

Title: Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework

Adaptive Beendigung für multi-round Parallele Reasoning: Ein universelles Semantisches Entropie-Führungs-Framework

多元平行理由的适应性终止:通用的语义 Entropy 指导框架 2507.06829v1

Authors (12): Zenan Xu, Zexuan Qiu, Guanhua Huang, Kun Li, Siheng Li, Chenchen Zhang, Kejiao Li, Qi Yi, Yuhao Jiang, Bo Zhou, Fengzong Lian, Zhanhui Kang

Recent advances in large language models (LLMs) have accelerated progress toward artificial general intelligence, with inference-time scaling emerging as a key technique. Contemporary approaches leverage either sequential reasoning (iteratively extending chains of thought) or parallel reasoning (generating multiple solutions simultaneously) to scale inference. However, both paradigms face fundamental limitations: sequential scaling typically relies on arbitrary token budgets for termination, leading to inefficiency or premature cutoff; while parallel scaling often lacks coordination among parallel branches and requires intrusive fine-tuning to perform effectively. In light of these challenges, we aim to design a flexible test-time collaborative inference framework that exploits the complementary strengths of both sequential and parallel reasoning paradigms. Towards this goal, the core challenge lies in developing an efficient and accurate intrinsic quality metric to assess model responses during collaborative inference, enabling dynamic control and early termination of the reasoning trace. To address this challenge, we introduce semantic entropy (SE), which quantifies the semantic diversity of parallel model responses and serves as a robust indicator of reasoning quality due to its strong negative correlation with accuracy…

在大型语言模型(LLMs)方面最近的进展加快了朝向人工一般智能的进展,而推论时间的缩短正在成为一种关键技术。当代方法利用顺序推理(即扩展思维链)或平行推理(同时产生多种解决办法)来扩大推理规模。但是,两种模式都面临根本性的局限性:顺序推理通常依靠任意象征性预算来终止,导致效率低下或过早断线;虽然平行推理往往缺乏平行分支之间的协调,需要渗透性微调才能有效发挥作用。鉴于这些挑战,我们的目标是设计一个灵活的试验-时间协作推理框架,利用顺序推理和平行推理范式的互补优势。为了实现这一目标,核心挑战在于制定一个高效和准确的内在质量指标,用以评估在协作推理过程中的模型反应,使动态控制和早期终止推理追踪。为了应对这一挑战,我们引入了语管酶酶(SEE),它将平行模型反应的语义多样性量化,并且由于与准确性有着强烈的负相关性,因此成为推理质量的有力指标。

Article 141

Title@2025-07-09 (3): CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation

Title: CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation

CMQCIC-Bench: Ein chinesischer Benchmark für die Bewertung großer Sprachmodelle in der Berechnung medizinischer Qualitätskontrollindikatoren

CMQCIC-CMQCIC-Bench:中国医疗质量控制指标计算评估大语言模式基准 2502.11703v2

Authors (12): Guangya Yu, Yanhao Li, Zongying Jiang, Yuxiong Jin, Li Dai, Yupian Lin, Ruihui Hou, Weiyan Zhang, Yongqi Fan, Qi Ye, Jingping Liu, Tong Ruan

Medical quality control indicators are essential to assess the qualifications of healthcare institutions for medical services. With the impressive performance of large language models (LLMs) like GPT-4 in the medical field, leveraging these technologies for the Medical Quality Control Indicator Calculation (MQCIC) presents a promising approach. In this work, (1) we introduce a real-world task MQCIC and propose an open-source Chinese electronic medical records (EMRs)-based dataset (CMQCIC-Bench) comprising 785 instances and 76 indicators. (2) We propose a semi-automatic method to enhance the rule representation. Then we propose the Clinical Facts-based Inferential Rule (CF-IR) method that disentangles the clinical fact verification and inferential rule reasoning actions. (3) We conduct comprehensive experiments on 20 representative LLMs, covering general and medical models. Our findings reveal that CF-IR outperforms Chain-of-Thought methods in MQCIC tasks. (4) We conduct an error analysis and investigate the capabilities of clinical fact verification and inferential rule reasoning, providing insights to improve performance in the MQCIC further. The dataset and code is available in this repository https://github.com/YuY-2001/C-MQCIC.

医疗质量控制指标(MQCIC)是评估医疗保健机构医疗服务资格的关键。医疗领域大型语言模型(LLMs)(LLMs)在GPT-4等大型语言模型(LLMs)在医疗领域的表现令人印象深刻,利用这些技术进行医疗质量控制指标计算(MQCIC)是一个很有希望的方法。在这项工作中,(1) 我们引入了一个现实世界任务MQCIC(MQCIC)并提出一个开放源码的中国电子医疗记录(CMQQCIC-Bench)数据集,由785例和76项指标组成。(2) 我们提出一个半自动方法,以加强规则代表性。然后,我们提出基于临床事实的推断规则(CF-IR)方法,将临床事实核查与推断规则推理行动分开。(3) 我们在20个具有代表性的LMMS(LMs)上进行全面实验,涵盖一般模式和医疗模式。我们的调查结果显示,CFM-IR(C)超越了MC/YC-MQAR)的数据和代码。我们进行了错误分析并调查临床事实核查和推断规则推理的能力,为MC/YC/YC-QARC/MC/MC/MSetC/MC/MC/MC/C/C/C/CSetC/C/Dset/

Article 142

Title@2025-07-09 (3): Losing our Tail – Again: On (Un)Natural Selection And Multilingual Large Language Models

Title: Losing our Tail – Again: On (Un)Natural Selection And Multilingual Large Language Models

Unseren Tail verlieren – wieder einmal: An (Un)Natürlicher Auswahl und mehrsprachigen großen Sprachmodellen

失去我们的尾巴 – – 再一次:在(un)自然选择和多语言大语言模式 2507.03933v2

Authors (1): Eva Vanmassenhove

Multilingual Large Language Models (LLMs) considerably changed how technologies can influence language. While previous technologies could mediate or assist humans, there is now a tendency to offload the task of writing itself to these technologies, enabling them to change our linguistic ecosystem more directly. While they provide us quick access to information and impressively fluent output, beneath their apparent sophistication lies a subtle, more insidious threat: the gradual decline and loss of linguistic diversity. With this opinion piece, I explore how model collapse, with a particular focus on translation technology, can lead to the loss of linguistic forms, grammatical features, and cultural nuance. Model collapse refers to the eventual consequence of self-consuming training loops, where models reinforce their own biases and lose linguistic diversity. Drawing on recent work in Computer Vision, Natural Language Processing (NLP) and Machine Translation (MT), I argue that the tails of our linguistic distributions are vanishing, and with them, the narratives and identities they carry. This is a call to resist linguistic flattening and to reimagine NLP as a field that encourages, values and protects expressive multilingual lexical and linguistic diversity and creativity.

多语言大语言模型(LLMS)大大改变了技术如何影响语言。虽然以前的技术可以调解或帮助人类,但现在有一种趋势,即把写作的任务卸到这些技术身上,使他们能够更直接地改变我们的语言生态系统。虽然这些技术为我们提供了快速的获取信息和令人印象深刻流畅的产出,但是在它们显眼的精密程度之下,却存在一种微妙的、更隐蔽的威胁:语言多样性的逐渐减少和丧失。有了这一意见片,我探索了模式的崩溃,特别侧重于翻译技术,会如何导致语言形式、语法特征和文化细微的丧失。模型的崩溃指的是自费培训循环的最终后果,即模型强化了自己的偏见并丧失了语言多样性。在计算机视野、自然语言处理和机器翻译(MT)方面,我争论说,我们语言分布的尾部正在消失,随着这些尾部的描述和身份正在消失。这是一个呼吁,以抵制语言平板化和重新imagine NLP作为鼓励、价值观和保护明示的多语言词汇和语言多样性及创造性的字段。

Article 143

Title@2025-07-09 (3): Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts

Title: Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts

Sicherer oder luckier? LLMs als Sicherheitsevaluatoren sind für Artefakte nicht robust

安全性更安全还是更幸运?作为安全评估员的LLMs没有强力进行人工操作。 2503.09347v3

Authors (2): Hongyu Chen, Seraphina Goldfarb-Tarrant

Large Language Models (LLMs) are increasingly employed as automated evaluators to assess the safety of generated content, yet their reliability in this role remains uncertain. This study evaluates a diverse set of 11 LLM judge models across critical safety domains, examining three key aspects: self-consistency in repeated judging tasks, alignment with human judgments, and susceptibility to input artifacts such as apologetic or verbose phrasing. Our findings reveal that biases in LLM judges can significantly distort the final verdict on which content source is safer, undermining the validity of comparative evaluations. Notably, apologetic language artifacts alone can skew evaluator preferences by up to 98\%. Contrary to expectations, larger models do not consistently exhibit greater robustness, while smaller models sometimes show higher resistance to specific artifacts. To mitigate LLM evaluator robustness issues, we investigate jury-based evaluations aggregating decisions from multiple models. Although this approach both improves robustness and enhances alignment to human judgements, artifact sensitivity persists even with the best jury configurations. These results highlight the urgent need for diversified, artifact-resistant methodologies to ensure reliable safety assessments.

大型语言模型(LLMS)越来越多地被用作自动化评价员,以评估生成内容的安全性,但其可靠性仍然不确定。本研究报告评估了11个LLM法官模型在关键安全领域的多样化,审查了三个关键方面:反复判断任务的自理性、与人类判断一致、容易投入诸如伪证或动词表达等文物。我们的调查结果显示,LLM法官的偏见会大大扭曲最终判断哪个内容来源更安全,损害比较评价的有效性。值得注意的是,单是pological 语言文物就足以扭曲评价人的偏好,高达98。与预期相反,较大的模型并不一贯表现出更大的强健性,而较小的模型有时对具体文物的阻力也更高。为减轻LLM评价员的稳健性问题,我们调查陪审团对多个模型所作决定的汇总评价。虽然这种方法既能提高稳健性,又能增进对人类判断的一致性,但艺术敏感性即使与最佳陪审团的配置保持。这些结果突出表明,迫切需要采用多样化的、耐手艺方法,以确保可靠的安全评估。

Article 144

Title@2025-07-09 (3): Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams

Title: Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams

Text zum Modell via SysML: Automatisierte Generierung dynamischer Systemrechnermodelle aus unstrukturiertem Naturtext über verbesserte Systemmodellierung Sprachdiagramme

通过 SysML 自动生成动态系统计算模型,通过强化系统模拟图,从未结构化的自然语言文本生成动态系统计算模型 2507.06803v1

Authors (2): Matthew Anderson Hendricks, Alice Cicirello

This paper contributes to speeding up the design and deployment of engineering dynamical systems by proposing a strategy for exploiting domain and expert knowledge for the automated generation of dynamical system computational model starting from a corpus of document relevant to the dynamical system of interest and an input document describing the specific system. This strategy is implemented in five steps and, crucially, it uses system modeling language diagrams (SysML) to extract accurate information about the dependencies, attributes, and operations of components. Natural Language Processing (NLP) strategies and Large Language Models (LLMs) are employed in specific tasks to improve intermediate outputs of the SySML diagrams automated generation, such as: list of key nouns; list of extracted relationships; list of key phrases and key relationships; block attribute values; block relationships; and BDD diagram generation. The applicability of automated SysML diagram generation is illustrated with different case studies. The computational models of complex dynamical systems from SysML diagrams are then obtained via code generation and computational model generation steps. In the code generation step, NLP strategies are used for summarization, while LLMs are used for validation only. The proposed approach is not limited to a specific system, domain, or computational software. The applicability of the proposed approach is shown via an end-to-end example from text to model of a simple pendulum, showing improved performance compared to results yielded by LLMs only.

本文有助于加快工程动态系统的设计和部署,办法是提出一项战略,利用域和专家知识,自动生成动态系统计算模型,从与动态兴趣系统有关的文件库和描述具体系统的输入文件库开始,从而自动生成动态系统计算模型。这一战略分五个步骤实施,关键是使用系统模拟语言图(SysML),以获取关于各组成部分依赖性、属性和操作的准确信息。自然语言处理(NLP)战略和大语言模型(LLMS)用于具体任务,以改进SYSML图表自动生成的中间产出,例如:关键名词列表;提取关系列表;关键短语和关键关系列表;区属性值;块关系;以及BDDD图生成。自动生成SysML图(SML图)的实用性用不同的案例研究来说明。随后通过代码生成和计算模型生成步骤获得复杂的动态系统的计算模型。在代码生成步骤中,使用NLP战略来进行合成,而LMS-LMS战略则用于合成,而LMS-LMS则用于通过特定的域图进行特定的校验。建议,仅显示通过特定的域到软件的校验。

Article 145

Title@2025-07-09 (3): Test-Time Scaling with Reflective Generative Model

Title: Test-Time Scaling with Reflective Generative Model

Test-Zeit-Skalierung mit reflektierendem Generativem Modell

具有反反思考生成模型的试验时间缩放 2507.01951v2

Authors (11): Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie

We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini’s performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini’s series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.

我们引入了我们的第一个反光基因模型MetaStone-S1,该模型通过新的反射引因形式获得了OpenAI o3-mini的性能。新的形式侧重于高质量的推理轨迹选择,并包含两个新颖之处:1) 政策和程序奖励模式的统一界面:我们共享主干网,并使用特定任务头分别用于推理轨迹预测和评分,只引入了5300米的弹道评分额外参数。2) 不再依赖进程一级的批注:我们提供一种自我监督的进程奖赏模式,它能够直接从结果奖赏中学习高质量的推理轨迹选择。用反光化的基因化形式安装,MetaStone-S1自然适合测试时间缩放,我们根据可控思维长度提供三种推理努力模式(低、中、高)。实验表明我们的MetaStone-S1的性能与O3-mini系列相比,只有32B参数大小。为了支持研究界,我们在 https://giust/Setata1.Seta1.

Article 146

Title@2025-07-09 (3): GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods

Title: GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods

GuidedBench: Messung und Milderung der Evaluationsdiskrepanzen von In-the-Wild LLM Jailbreak Methoden

指导:衡量和减轻LLM监狱破碎方法在评价方面的差异 2502.16903v2

Authors (5): Ruixuan Huang, Xunguang Wang, Zongjie Li, Daoyuan Wu, Shuai Wang

Despite the growing interest in jailbreak methods as an effective red-teaming tool for building safe and responsible large language models (LLMs), flawed evaluation system designs have led to significant discrepancies in their effectiveness assessments. We conduct a systematic measurement study based on 37 jailbreak studies since 2022, focusing on both the methods and the evaluation systems they employ. We find that existing evaluation systems lack case-specific criteria, resulting in misleading conclusions about their effectiveness and safety implications. This paper advocates a shift to a more nuanced, case-by-case evaluation paradigm. We introduce GuidedBench, a novel benchmark comprising a curated harmful question dataset, detailed case-by-case evaluation guidelines and an evaluation system integrated with these guidelines – GuidedEval. Experiments demonstrate that GuidedBench offers more accurate measurements of jailbreak performance, enabling meaningful comparisons across methods and uncovering new insights overlooked in previous evaluations. GuidedEval reduces inter-evaluator variance by at least 76.03\%. Furthermore, we observe that incorporating guidelines can enhance the effectiveness of jailbreak methods themselves, offering new insights into both attack strategies and evaluation paradigms.

尽管人们日益关注破狱方法,认为它是建设安全和负责任的大型语言模型的有效红队工具,但评价系统设计有缺陷,导致其成效评估存在重大差异。我们根据2022年以来37项破狱研究,进行了系统性的衡量研究,重点是其使用的方法和评价系统。我们发现,现有评价系统缺乏具体案件的标准,导致对其有效性和安全影响的错误结论。本文件主张转向更细致、逐案评估模式。我们引入了Book Bench,这是一个新颖的基准,包括一个成熟的有害问题数据集、详细的个案评估准则和与这些指南相结合的评价系统 – – 指导Eval。实验表明,Look Bench提供了更准确的破狱业绩衡量方法,使得能够对方法进行有意义的比较,并发现以往评价中忽略的新见解。指导Eval至少减少了76.03。此外,我们认为,纳入准则可以提高破狱方法本身的有效性,为攻击战略和评价范例提供新的见解。

Article 147

Title@2025-07-09 (3): Checklist Engineering Empowers Multilingual LLM Judges

Title: Checklist Engineering Empowers Multilingual LLM Judges

Checkliste Engineering Empowers Mehrsprachige LLM-Richter

多语种LLM法官 2507.06774v1

Authors (2): Mohammad Ghiasvand Mohammadkhani, Hamid Beigy

Automated text evaluation has long been a central issue in Natural Language Processing (NLP). Recently, the field has shifted toward using Large Language Models (LLMs) as evaluators-a trend known as the LLM-as-a-Judge paradigm. While promising and easily adaptable across tasks, this approach has seen limited exploration in multilingual contexts. Existing multilingual studies often rely on proprietary models or require extensive training data for fine-tuning, raising concerns about cost, time, and efficiency. In this paper, we propose Checklist Engineering based LLM-as-a-Judge (CE-Judge), a training-free framework that uses checklist intuition for multilingual evaluation with an open-source model. Experiments across multiple languages and three benchmark datasets, under both pointwise and pairwise settings, show that our method generally surpasses the baselines and performs on par with the GPT-4o model.

长期以来,自动文本评价一直是自然语言处理(NLP)的一个中心问题。最近,外地已转向使用大语言模型(LLMs)作为评价员,这是一种被称为LLM-as-a-judge模式的趋势。这种方法虽然很有希望,而且容易在各种任务之间进行调整,但在多语种背景下探索有限。现有的多语种研究往往依赖专有模型,或需要广泛的培训数据来进行微调,引起对成本、时间和效率的担忧。在本文件中,我们提议以LLM-as-a-judge(CE-Judge)为基础的核对清单工程(LLM-as-a-judge (CE-Judge)),这是一个无培训框架,以开放源模式为多语种评价使用清单直觉。在点设置和对称设置下,多种语言和三个基准数据集的实验显示,我们的方法一般超过基线,并符合GPT-4模式。

Article 148

Title@2025-07-09 (3): KAConvText: Novel Approach to Burmese Sentence Classification using Kolmogorov-Arnold Convolution

Title: KAConvText: Novel Approach to Burmese Sentence Classification using Kolmogorov-Arnold Convolution

KAConvText: Neuer Ansatz zur burmesischen Satzklassifikation mit Kolmogorov-Arnold-Konvolution

KAConvText:利用科尔莫戈罗夫-阿尔纳尔革命对缅甸判刑分类采用新颖的方法 2507.06753v1

Authors (4): Ye Kyaw Thu, Thura Aung, Thazin Myint Oo, Thepchai Supnithi

This paper presents the first application of Kolmogorov-Arnold Convolution for Text (KAConvText) in sentence classification, addressing three tasks: imbalanced binary hate speech detection, balanced multiclass news classification, and imbalanced multiclass ethnic language identification. We investigate various embedding configurations, comparing random to fastText embeddings in both static and fine-tuned settings, with embedding dimensions of 100 and 300 using CBOW and Skip-gram models. Baselines include standard CNNs and CNNs augmented with a Kolmogorov-Arnold Network (CNN-KAN). In addition, we investigated KAConvText with different classification heads - MLP and KAN, where using KAN head supports enhanced interpretability. Results show that KAConvText-MLP with fine-tuned fastText embeddings achieves the best performance of 91.23% accuracy (F1-score = 0.9109) for hate speech detection, 92.66% accuracy (F1-score = 0.9267) for news classification, and 99.82% accuracy (F1-score = 0.9982) for language identification.

本文介绍Kolmogorov-Arnold Convolution for Text(KAConvText)在刑罚分类中的首次应用,涉及三项任务:不平衡的二进制仇恨言论检测、平衡的多级新闻分类和不平衡的多级族裔语言识别。我们调查了各种嵌入配置,将静态和微调环境中的随机和快式文本嵌入比,同时使用CBOW和GVG-gram模型嵌入范围为100和300。基线包括标准CNN和有Kolmogorov-Arnold网络增强的CNN(CNN-KAN)的CNN和CNN。此外,我们还与不同的分类负责人—-MLP和KAN进行了KA Convortext调查,使用KAN头支持增强可解释性。结果显示,KAConvText-MLP与微调快式缩写嵌入的KACAConvText-MLP在识别仇恨言论方面达到91.23%准确度的最佳性(F1-核心=0.9109),92.66%的精确度(F1-核心=0.9267)用于新闻分类,99.82%的精确度(F1-核心=0.982)。

Article 149

Title@2025-07-09 (3): Robust Multimodal Large Language Models Against Modality Conflict

Title: Robust Multimodal Large Language Models Against Modality Conflict

Robuste multimodale große Sprachmodelle gegen Modalitätskonflikt

反对方式冲突强有力的多式多式大语言模型 2507.07151v1

Authors (4): Zongmeng Zhang, Wengang Zhou, Jie Zhao, Houqiang Li

Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine-tuning method shows promising and stable performance. Our work sheds light on the unnoticed modality conflict that leads to hallucinations and provides more insights into the robustness of MLLMs.

尽管多式大型语言模型(MLLM)在视觉语言任务方面的能力令人印象深刻,但它们在现实世界情景中容易产生幻觉。本文件从模式冲突的角度对MLLMM的幻觉现象进行调查。与侧重于模型反应与投入之间冲突的现有工作不同,我们研究不同模式投入的内在冲突,这些模式将MLLMs置于困境之中,直接导致幻觉。我们正式界定模式冲突,并建立一个名为多模式模式冲突(MMMC)的数据集,以模拟视觉任务中的这种现象。提出了三种基于即时工程、监督的微调和强化学习的方法,以缓解模式冲突造成的幻觉。对MLLMMS数据集进行了广泛的实验,以分析这些方法的优点和缺点。我们的成果表明,强化学习方法在减轻模式冲突下的幻觉方面取得最佳表现,而受监督的微调方法则显示了有希望和稳定的表现。我们的工作揭示了导致幻觉和对MLLMS的坚固性。我们的工作揭示了未见的模式冲突。

Article 150

Title@2025-07-09 (3): Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons

Title: Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons

Knockout LLM Assessment: Verwendung großer Sprachmodelle für Bewertungen durch iterative Pairwise-Vergleiche

LLLM 评估:利用大语言模式,通过迭接对等比较进行评估 2506.03785v3

Authors (3): Isik Baran Sandan, Tu Anh Dinh, Jan Niehues

Large Language Models (LLMs) have shown to be effective evaluators across various domains such as machine translations or the scientific domain. Current LLM-as-a-Judge approaches rely mostly on individual assessments or a single round of pairwise assessments, preventing the judge LLM from developing a global ranking perspective. To address this, we present Knockout Assessment, an LLM-asa Judge method using a knockout tournament system with iterative pairwise comparisons. Experiments across three LLMs on two datasets show that knockout assessment improves scoring accuracy, increasing Pearson correlation with expert evaluations by 0.07 on average for university-level exam scoring and machine translation evaluations, aligning LLM assessments more closely with human scoring.

大型语言模型(LLMs)在机器翻译或科学领域等不同领域证明是有效的评价者,目前的LLM-as-a-judge方法主要依靠个别评估或单轮双向评估,使法官LLM无法从全球排名的角度出发。为解决这一问题,我们介绍了LLM-asa法官使用击倒比赛系统、迭接对比的敲倒评估方法。在两个数据集上对三个LMs的实验显示,击倒评估提高了评分的准确性,使Pearson与专家评价的相互关系平均增加0.07分,用于大学一级的考试评分和机器翻译评价,使LMM评估与人类评分更加接近。

Article 151

Title@2025-07-09 (3): LLM-based User Profile Management for Recommender System

Title: LLM-based User Profile Management for Recommender System

LLM-basierte Benutzerprofilverwaltung für Recommender System

以LLM为基础的建议系统用户概况管理 2502.14541v2

Authors (2): Seunghwan Bang, Hwanjun Song

The rapid advancement of Large Language Models (LLMs) has opened new opportunities in recommender systems by enabling zero-shot recommendation without conventional training. Despite their potential, most existing works rely solely on users’ purchase histories, leaving significant room for improvement by incorporating user-generated textual data, such as reviews and product descriptions. Addressing this gap, we propose PURE, a novel LLM-based recommendation framework that builds and maintains evolving user profiles by systematically extracting and summarizing key information from user reviews. PURE consists of three core components: a Review Extractor for identifying user preferences and key product features, a Profile Updater for refining and updating user profiles, and a Recommender for generating personalized recommendations using the most current profile. To evaluate PURE, we introduce a continuous sequential recommendation task that reflects real-world scenarios by adding reviews over time and updating predictions incrementally. Our experimental results on Amazon datasets demonstrate that PURE outperforms existing LLM-based methods, effectively leveraging long-term user information while managing token limitations.

大语言模型(LLMS)的快速进步为推荐者系统开辟了新的机会,使得在没有常规培训的情况下能够提出零点建议。尽管存在潜力,大多数现有工程都完全依赖用户的采购历史,但通过纳入用户生成的文字数据(例如审查和产品描述),留下很大的改进空间。解决这一差距,我们提议PURE,一个基于LLM的新型建议框架,通过系统地从用户审查中提取和总结关键信息,建立和维持不断演变的用户概况。PURE由三个核心部分组成:用于确定用户偏好和关键产品特征的审查提取器、用于改进和更新用户概况的概况更新器以及用于利用最新配置生成个性化建议的推荐器。为了评估PURE,我们引入了一个连续连续不断的建议任务,通过在时间上增加审查并逐步更新预测来反映现实世界情景。我们在亚马孙数据集上的实验结果表明,PURE在管理象征性限制的同时,有效地利用长期用户信息,超越了基于LM系统的现有方法。

Article 152

Title@2025-07-09 (3): Civil Society in the Loop: Feedback-Driven Adaptation of (L)LM-Assisted Classification in an Open-Source Telegram Monitoring Tool

Title: Civil Society in the Loop: Feedback-Driven Adaptation of (L)LM-Assisted Classification in an Open-Source Telegram Monitoring Tool

Zivilgesellschaft im Kreislauf: Feedback-gestützte Anpassung der (L)LM-Assistenten-Klassifikation in einem Open-Source-Telegramm-Monitoring-Tool

” 圈圈中的民间社会:在开放源电讯监测工具中对(L)LM协助的分类进行反馈和改编 “ 。 2507.06734v1

Authors (5): Milena Pustet, Elisabeth Steffen, Helena Mihaljević, Grischa Stanjek, Yannis Illies

The role of civil society organizations (CSOs) in monitoring harmful online content is increasingly crucial, especially as platform providers reduce their investment in content moderation. AI tools can assist in detecting and monitoring harmful content at scale. However, few open-source tools offer seamless integration of AI models and social media monitoring infrastructures. Given their thematic expertise and contextual understanding of harmful content, CSOs should be active partners in co-developing technological tools, providing feedback, helping to improve models, and ensuring alignment with stakeholder needs and values, rather than as passive ‘consumers’. However, collaborations between the open source community, academia, and civil society remain rare, and research on harmful content seldom translates into practical tools usable by civil society actors. This work in progress explores how CSOs can be meaningfully involved in an AI-assisted open-source monitoring tool of anti-democratic movements on Telegram, which we are currently developing in collaboration with CSO stakeholders.

民间社会组织在监测有害在线内容方面的作用越来越重要,特别是因为平台提供者减少了对内容调适的投资,AI工具可以协助检测和监测规模的有害内容,然而,开放源码工具很少能够无缝地整合AI模式和社会媒体监测基础设施,鉴于民间社会组织的专题专门知识和对有害内容的背景了解,它们应当成为合作开发技术工具、提供反馈、帮助改进模式和确保符合利益攸关方需求和价值观的积极伙伴,而不是被动的“消费者”。然而,开放源码社区、学术界和民间社会之间的协作仍然很少,关于有害内容的研究很少转化为民间社会行为者可利用的实用工具。这一进展中的工作探讨了民间社会组织如何能够有意义地参与由它们协助的Telegram反民主运动公开源监测工具,我们目前正在与民间社会组织利益攸关方合作开发这一工具。

Article 153

Title@2025-07-09 (3): On the Effect of Uncertainty on Layer-wise Inference Dynamics

Title: On the Effect of Uncertainty on Layer-wise Inference Dynamics

Über den Effekt der Unsicherheit auf die schichtweise Schlussfolgerungsdynamik

不确定性对图层推论作用的影响 2507.06722v1

Authors (3): Sunwoo Kim, Haneul Yoo, Alice Oh

Understanding how large language models (LLMs) internally represent and process their predictions is central to detecting uncertainty and preventing hallucinations. While several studies have shown that models encode uncertainty in their hidden states, it is underexplored how this affects the way they process such hidden states. In this work, we demonstrate that the dynamics of output token probabilities across layers for certain and uncertain outputs are largely aligned, revealing that uncertainty does not seem to affect inference dynamics. Specifically, we use the Tuned Lens, a variant of the Logit Lens, to analyze the layer-wise probability trajectories of final prediction tokens across 11 datasets and 5 models. Using incorrect predictions as those with higher epistemic uncertainty, our results show aligned trajectories for certain and uncertain predictions that both observe abrupt increases in confidence at similar layers. We balance this finding by showing evidence that more competent models may learn to process uncertainty differently. Our findings challenge the feasibility of leveraging simplistic methods for detecting uncertainty at inference. More broadly, our work demonstrates how interpretability methods may be used to investigate the way uncertainty affects inference.

了解大型语言模型(LLMs)内部代表并处理其预测对于发现不确定性和防止幻觉至关重要。虽然一些研究显示,模型将不确定性嵌入其隐蔽状态,但却未充分探讨这如何影响这些隐蔽状态的处理方式。在这项工作中,我们证明,某些和不确定产出的跨层输出象征概率的动态大体一致,显示不确定性似乎不会影响推断动态。具体地说,我们利用Tuned Lens(Logit Lens的变种),分析11个数据集和5个模型最后预测符号的层性概率轨迹。我们的工作用不正确的预测作为具有较高认知不确定性的预测,显示某些和不确定预测的轨迹是一致的,两者都观察到类似层的突然信任度增加。我们通过证明更有能力的模型可能学会以不同的方式处理不确定性来平衡这一发现的结果。我们的调查结果对利用简单方法探测不确定性的可行性提出了挑战。更广泛地说,我们的工作表明,如何使用解释方法来调查不确定性如何影响推断。

Article 154

Title@2025-07-09 (3): CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs

Title: CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs

CLI-RAG: Ein Retrieval-Augmented Framework für klinisch strukturierte und kontextorientierte Textgenerierung mit LLMs

CLI-RAG: 利用LLMs实现临床结构化和有背景意识的文本生成的检索增强框架 2507.06715v1

Authors (2): Garapati Keerthana, Manik Gupta

Large language models (LLMs), including zero-shot and few-shot paradigms, have shown promising capabilities in clinical text generation. However, real-world applications face two key challenges: (1) patient data is highly unstructured, heterogeneous, and scattered across multiple note types and (2) clinical notes are often long and semantically dense, making naive prompting infeasible due to context length constraints and the risk of omitting clinically relevant information. We introduce CLI-RAG (Clinically Informed Retrieval-Augmented Generation), a domain-specific framework for structured and clinically grounded text generation using LLMs. It incorporates a novel hierarchical chunking strategy that respects clinical document structure and introduces a task-specific dual-stage retrieval mechanism. The global stage identifies relevant note types using evidence-based queries, while the local stage extracts high-value content within those notes creating relevance at both document and section levels. We apply the system to generate structured progress notes for individual hospital visits using 15 clinical note types from the MIMIC-III dataset. Experiments show that it preserves temporal and semantic alignment across visits, achieving an average alignment score of 87.7%, surpassing the 80.7% baseline from real clinician-authored notes. The generated outputs also demonstrate high consistency across LLMs, reinforcing deterministic behavior essential for reproducibility, reliability, and clinical trust.

大型语言模型(LLMS),包括零发和少发模式,在临床文本生成方面显示出很有希望的能力。然而,现实世界应用面临着两大挑战:(1) 患者数据高度结构化、多样化,分散于多个备注类型,(2) 临床说明往往很长、语义密集,由于上下文长度的限制和忽略与临床有关的信息的风险,因此这些说明中高价值的内容变得天真不可行。我们采用CLI-RAG(Clinnicnational IntrolnectiveRegive-Auged Conference),这是一个用于使用LMS进行结构化和临床基础的文本生成的域特定框架。它包含一种尊重临床文件结构的新型分级组合战略,并引入了针对具体任务的双阶段检索机制。全球阶段利用基于证据的查询确定了相关说明类型,而使这些说明中高价值的内容在文档和科级两级都具有相关性。我们采用该系统,利用MIMIC-III数据集的15个临床注释类型,为个别医院访问产生结构化的进度说明。实验显示,它保持了跨访问的时间和语义性协调,实现了877%的临床基本的校准一致性,也展示了87.7%的可靠度。

Article 155

Title@2025-07-09 (3): Do Larger Language Models Imply Better Generalization? A Pretraining Scaling Law for Implicit Reasoning

Title: Do Larger Language Models Imply Better Generalization? A Pretraining Scaling Law for Implicit Reasoning

Werden größere Sprachmodelle besser verallgemeinert? Ein Vorschulungs-Skalierungsgesetz für Implizite Vernunft

大语言模式是否充分更加普遍化? 2504.03635v2

Authors (6): Xinyi Wang, Shawn Tan, Mingyu Jin, William Yang Wang, Rameswar Panda, Yikang Shen

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks requiring complex reasoning. However, the effects of scaling on their reasoning abilities remain insufficiently understood. In this paper, we introduce a synthetic multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. Our reasoning task involves completing missing edges in the graph, which requires advanced multi-hop reasoning and mimics real-world reasoning scenarios. To evaluate this, we pretrain language models (LMs) from scratch solely on triples from the incomplete graph and assess their ability to infer the missing edges. Interestingly, we observe that overparameterization can impair reasoning performance due to excessive memorization. We investigate different factors that affect this U-shaped loss curve, including graph structure, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling that linearly maps the knowledge graph search entropy to the optimal model size. This work provides new insights into the relationship between scaling and reasoning in LLMs, shedding light on possible ways to optimize their performance for reasoning tasks.

大型语言模型(LLMS)在一系列需要复杂推理的广泛任务中表现出了非凡的能力,然而,对其推理能力的影响仍然没有充分理解。在本文中,我们引入了合成多光推理环境,旨在密切复制真实世界大规模知识图的结构和分布。我们的推理任务涉及完成图中缺失的边缘,这需要先进的多光推理和模拟真实世界推理假设。为了评估这一点,我们仅从不完整的图表的三部开始将语言模型从零开始,并评估其推算缺失边缘的能力。有趣的是,我们观察到过度的分解可能损害推理性能,因为过度沉积。我们调查了影响这种U形损失曲线的不同因素,包括图形结构、模型大小和培训步骤。为了预测特定知识图的最佳模型大小,我们发现一个实验尺度,线性地绘制了知识图形搜索模型的轨迹到最佳模型大小。这项工作为LMS的缩放和推理提供了新的洞察力关系提供了新的洞察力,让我们了解如何优化其推理工作。

Article 156

Title@2025-07-09 (3): A Survey on Prompt Tuning

Title: A Survey on Prompt Tuning

Eine Umfrage über das schnelle Tuning

快速提款调查 2507.06085v2

Authors (3): Zongqian Li, Yixuan Su, Nigel Collier

This survey reviews prompt tuning, a parameter-efficient approach for adapting language models by prepending trainable continuous vectors while keeping the model frozen. We classify existing approaches into two categories: direct prompt learning and transfer learning. Direct prompt learning methods include: general optimization approaches, encoder-based methods, decomposition strategies, and mixture-of-experts frameworks. Transfer learning methods consist of: general transfer approaches, encoder-based methods, and decomposition strategies. For each method, we analyze method designs, innovations, insights, advantages, and disadvantages, with illustrative visualizations comparing different frameworks. We identify challenges in computational efficiency and training stability, and discuss future directions in improving training robustness and broadening application scope.

这项调查审查迅速调试,这是通过预先等待的可训练的连续矢量调整语言模式的参数效率办法,同时保持该模式的冻结。我们将现有办法分为两类:直接的迅速学习和转让学习。直接的迅速学习方法包括:一般优化办法、以编码器为基础的方法、分解战略和专家混合框架。转让学习方法包括:一般转让办法、以编码器为基础的方法和分解战略。对于每一种方法,我们分析方法的设计、创新、洞察力、优缺点,并用说明性可视化来比较不同的框架。我们找出计算效率和培训稳定性方面的挑战,并讨论提高培训可靠性和扩大应用范围的未来方向。

Article 157

Title@2025-07-09 (3): Automating IRAC Analysis in Malaysian Contract Law using a Semi-Structured Knowledge Base

Title: Automating IRAC Analysis in Malaysian Contract Law using a Semi-Structured Knowledge Base

Automatisierung der IRAC-Analyse im malaysischen Vertragsrecht mit Hilfe einer halbstrukturierten Wissensbasis

利用半结构化知识库对马来西亚合同法的IRAC分析自动化 2406.13217v2

Authors (5): Xiaoxi Kang, Lizhen Qu, Lay-Ki Soon, Zhuang Li, Adnan Trakic

The effectiveness of Large Language Models (LLMs) in legal reasoning is often limited due to the unique legal terminologies and the necessity for highly specialized knowledge. These limitations highlight the need for high-quality data tailored for complex legal reasoning tasks. This paper introduces LegalSemi, a benchmark specifically curated for legal scenario analysis. LegalSemi comprises 54 legal scenarios, each rigorously annotated by legal experts, based on the comprehensive IRAC (Issue, Rule, Application, Conclusion) framework from Malaysian Contract Law. In addition, LegalSemi is accompanied by a structured knowledge base (SKE). A series of experiments were conducted to assess the usefulness of LegalSemi for IRAC analysis. The experimental results demonstrate the effectiveness of incorporating the SKE for issue identification, rule retrieval, application and conclusion generation using four different LLMs.

大语言模型在法律推理方面的效力往往有限,因为法律术语独特,而且需要高度专业的知识,这些局限性突出表明需要针对复杂的法律推理任务提供高质量的数据,本文件介绍了法律Semi,这是专门为法律情景分析制定的基准,法律Semi由54种法律设想方案组成,每种方案都由法律专家严格说明,以马来西亚合同法的综合IRAC(问题、规则、适用、结论)框架为基础,此外,法律Semi还辅之以一个结构化的知识库(SKE),进行了一系列实验,评估法律Semi对辐射研究中心分析的效用,实验结果表明将SKE纳入四个不同的LMs的问题识别、规则检索、应用和结论生成工作的有效性。

Article 158

Title@2025-07-09 (3): Probing and Steering Evaluation Awareness of Language Models

Title: Probing and Steering Evaluation Awareness of Language Models

Probing and Steering Evaluation Awareness of Language Models

评价语言模式评估意识的检验和指导 2507.01786v2

Authors (4): Jord Nguyen, Khiem Hoang, Carlo Leonardo Attubato, Felix Hofstätter

Language models can distinguish between testing and deployment phases – a capability known as evaluation awareness. This has significant safety and policy implications, potentially undermining the reliability of evaluations that are central to AI governance frameworks and voluntary industry commitments. In this paper, we study evaluation awareness in Llama-3.3-70B-Instruct. We show that linear probes can separate real-world evaluation and deployment prompts, suggesting that current models internally represent this distinction. We also find that current safety evaluations are correctly classified by the probes, suggesting that they already appear artificial or inauthentic to models. Our findings underscore the importance of ensuring trustworthy evaluations and understanding deceptive capabilities. More broadly, our work showcases how model internals may be leveraged to support blackbox methods in safety audits, especially for future models more competent at evaluation awareness and deception.

语言模型可以区分测试和部署阶段 – – 一种被称为评估意识的能力,这具有重大的安全和政策影响,有可能损害对AI治理框架和自愿行业承诺至关重要的评价的可靠性。在本文中,我们在Llama-3.3-70B-Instruct中研究评价意识。我们表明线性探测器可以区分真实世界的评价和部署提示,表明目前的模型在内部代表了这种区别。我们还发现,目前的安全评估被探测器正确地分类,表明它们已经看起来是人为的或对模型不真实的。我们的调查结果强调了确保可靠的评估和理解欺骗性能力的重要性。更广泛地说,我们的工作展示了如何利用内部模型来支持安全审计中的黑盒方法,特别是未来模型在评估意识和欺骗方面更能胜任的黑盒方法。

Article 159

Title@2025-07-09 (3): Elite Polarization in European Parliamentary Speeches: a Novel Measurement Approach Using Large Language Models

Title: Elite Polarization in European Parliamentary Speeches: a Novel Measurement Approach Using Large Language Models

Elite Polarisierung in europäischen Parlamentsreden: ein neuartiger Messansatz mit großen Sprachmodellen

欧洲议会讲话中的极极化:使用大语言模式的新衡量方法 2507.06658v1

Authors (1): Gennadii Iakovlev

This project introduces a new measure of elite polarization via actor and subject detection using artificial intelligence. I identify when politicians mention one another in parliamentary speeches, note who is speaking and who is being addressed, and assess the emotional temperature behind these evaluations. This maps how elites evaluate their various out-parties, allowing us to create an index of mutual out-party hostility, that is, elite polarization. While I analyzed polarization data over the past four decades for the UK, and two decades for Hungary and Italy, my approach lays the groundwork for a twenty-year, EU-wide time-series dataset on elite polarization. I obtain the results that can be aggregated by party and quarter. The resulting index demonstrates a good face validity: it reacts to events such as electoral campaigns, country- and party-level crises, and to parties losing and assuming power.

这个项目引入了一种通过演员和使用人工智能进行实验检测的精英两极分化的新尺度。当政客们在议会讲话中互相提及时,我就会指出,谁在讲话,谁在说话,谁正在被处理,并评估这些评价背后的情绪温度。这幅画描绘了精英们如何评价他们不同的外部党派,使我们能够建立一个共同的外部敌对指数,即精英两极分化。我分析了过去40年来英国的两极分化数据,以及匈牙利和意大利的两极分化数据,与此同时,我的方法为整个欧盟关于精英两极分化的20年时间系列数据打下了基础。我获得了可以按政党和季度汇总的结果。由此得出的指数显示了良好的面貌有效性:它对选举运动、国家和政党危机、以及失去和掌权的政党等事件作出反应。

Article 160

Title@2025-07-09 (3): PBa-LLM: Privacy- and Bias-aware NLP using Named-Entity Recognition (NER)

Title: PBa-LLM: Privacy- and Bias-aware NLP using Named-Entity Recognition (NER)

PBa-LLM: Datenschutz- und Bias-aware NLP mit Named-Entity-Erkennung (NER)

PBA-LLM: 使用命名实体识别(NER)的隐私和Bias-awar NLP 2507.02966v2

Authors (8): Gonzalo Mancera, Aythami Morales, Julian Fierrez, Ruben Tolosana, Alejandro Penna, Miguel Lopez-Duran, Francisco Jurado, Alvaro Ortigosa

The use of Natural Language Processing (NLP) in highstakes AI-based applications has increased significantly in recent years, especially since the emergence of Large Language Models (LLMs). However, despite their strong performance, LLMs introduce important legal/ ethical concerns, particularly regarding privacy, data protection, and transparency. Due to these concerns, this work explores the use of Named- Entity Recognition (NER) to facilitate the privacy-preserving training (or adaptation) of LLMs. We propose a framework that uses NER technologies to anonymize sensitive information in text data, such as personal identities or geographic locations. An evaluation of the proposed privacy-preserving learning framework was conducted to measure its impact on user privacy and system performance in a particular high-stakes and sensitive setup: AI-based resume scoring for recruitment processes. The study involved two language models (BERT and RoBERTa) and six anonymization algorithms (based on Presidio, FLAIR, BERT, and different versions of GPT) applied to a database of 24,000 candidate profiles. The findings indicate that the proposed privacy preservation techniques effectively maintain system performance while playing a critical role in safeguarding candidate confidentiality, thus promoting trust in the experimented scenario. On top of the proposed privacy-preserving approach, we also experiment applying an existing approach that reduces the gender bias in LLMs, thus finally obtaining our proposed Privacyand Bias-aware LLMs (PBa-LLMs). Note that the proposed PBa-LLMs have been evaluated in a particular setup (resume scoring), but are generally applicable to any other LLM-based AI application.

近年来,特别是在出现大型语言模型(LLMs)以来,在基于AI的高级应用中,自然语言处理(NLP)的使用在最近几年中显著增加,特别是在出现大型语言模型(LLMs)以来。然而,LLMs尽管表现强劲,却提出了重要的法律/道德关切,特别是在隐私、数据保护和透明度方面。由于这些关切,这项工作探索了使用Nation-实体识别(NER)来促进LLMS的隐私保护培训(或改编)。我们提议了一个框架,利用NER技术将个人身份或地理位置等文本数据中的敏感信息匿名化。对拟议的隐私保存学习框架进行了评估,以衡量其在特定高访问和敏感设置方面对用户隐私和系统绩效的影响:AI公司恢复征聘过程的评分。由于这些关切,这项工作涉及两种语言模型(BERT和ROBERTA)和六种匿名算法(基于PLV、FLAIR、BERT和GPT的不同版本),适用于24 000个候选人简介数据库。调查结果表明,拟议的隐私保存技术保存方法有效地维持了系统的保密性和系统运行,因此,在试验中也降低了现有的保密模式。

Article 161

Title@2025-07-09 (3): Expediting data extraction using a large language model (LLM) and scoping review protocol: a methodological study within a complex scoping review

Title: Expediting data extraction using a large language model (LLM) and scoping review protocol: a methodological study within a complex scoping review

Beschleunigung der Datenextraktion mit Hilfe eines großen Sprachmodells (LLM) und eines Scoping-Review-Protokolls: eine methodische Studie im Rahmen einer komplexen Scoping-Review

利用大型语言模式和范围界定审查议定书加快数据提取速度:在复杂的范围界定审查范围内进行的方法研究 2507.06623v1

Authors (7): James Stewart-Evans, Emma Wilson, Tessa Langley, Andrew Prayle, Angela Hands, Karen Exley, Jo Leonardi-Bee

The data extraction stages of reviews are resource-intensive, and researchers may seek to expediate data extraction using online (large language models) LLMs and review protocols. Claude 3.5 Sonnet was used to trial two approaches that used a review protocol to prompt data extraction from 10 evidence sources included in a case study scoping review. A protocol-based approach was also used to review extracted data. Limited performance evaluation was undertaken which found high accuracy for the two extraction approaches (83.3% and 100%) when extracting simple, well-defined citation details; accuracy was lower (9.6% and 15.8%) when extracting more complex, subjective data items. Considering all data items, both approaches had precision >90% but low recall (<25%) and F1 scores (<40%). The context of a complex scoping review, open response types and methodological approach likely impacted performance due to missed and misattributed data. LLM feedback considered the baseline extraction accurate and suggested minor amendments: four of 15 (26.7%) to citation details and 8 of 38 (21.1%) to key findings data items were considered to potentially add value. However, when repeating the process with a dataset featuring deliberate errors, only 2 of 39 (5%) errors were detected. Review-protocol-based methods used for expediency require more robust performance evaluation across a range of LLMs and review contexts with comparison to conventional prompt engineering approaches. We recommend researchers evaluate and report LLM performance if using them similarly to conduct data extraction or review extracted data. LLM feedback contributed to protocol adaptation and may assist future review protocol drafting.

审查的数据提取阶段是资源密集型的,研究人员可能寻求利用在线(大语言模型)LLM和审查协议加快数据提取速度。Claude 3.5 Sonnet用于试验两种方法,即使用审查协议,促使从案例研究范围审查中包括的10个证据来源提取数据。还采用基于协议的方法审查提取的数据;采用有限的绩效评价,发现两种提取方法(83.3%和100%)在提取简单、明确界定的引用细节时,准确性很高(9.6%和15.8%);在提取更复杂、主观的数据项目时,准确性较低(9.6%和15.8%)。考虑到所有数据项目,两种方法都使用审查规程精确度大于90%,但回顾率低( < 25 %)和F1分( < 40%),采用复杂的范围审查、公开答复类型和方法,可能因误用和误用数据而影响业绩;LLMM反馈认为,基线提取方法准确性,建议对关键数据项目作出小的修改:15项(26.7%)和38项(21.1%)。考虑到所有数据项目都可能增加价值。然而,在重复采用定期评估程序时,需要进行定期评估的进度评估时,(5次的实地评估,在评估后,采用定期评估后,对定期评估,对定期评估中的数据进行实地评估,可能需要进行实地评估。

Article 162

Title@2025-07-09 (3): FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation

Title: FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation

FuDoBa: Zusammenführen von Dokumenten- und Wissensgraphen-basierten Darstellungen mit Bayesian-Optimierung

FUDoBA: 以贝叶斯最佳化方式使用引信文档和知识图表表示 2507.06622v1

Authors (4): Boshko Koloski, Senja Pollak, Roberto Navigli, Blaž Škrlj

Building on the success of Large Language Models (LLMs), LLM-based representations have dominated the document representation landscape, achieving great performance on the document embedding benchmarks. However, the high-dimensional, computationally expensive embeddings from LLMs tend to be either too generic or inefficient for domain-specific applications. To address these limitations, we introduce FuDoBa a Bayesian optimisation-based method that integrates LLM-based embeddings with domain-specific structured knowledge, sourced both locally and from external repositories like WikiData. This fusion produces low-dimensional, task-relevant representations while reducing training complexity and yielding interpretable early-fusion weights for enhanced classification performance. We demonstrate the effectiveness of our approach on six datasets in two domains, showing that when paired with robust AutoML-based classifiers, our proposed representation learning approach performs on par with, or surpasses, those produced solely by the proprietary LLM-based embedding baselines.

在大型语言模型(LLMs)成功的基础上,基于LLM的LLM代表制在文件代表格局上占据了主导地位,在文件嵌入基准上取得了巨大的业绩;然而,高维、计算上昂贵的LLMs嵌入层往往过于通用,或者对特定领域的应用来说效率低下。为了解决这些限制,我们采用了基于Bayesian优化的FuDoba方法,将基于LLM的嵌入与当地和来自WikiData(WikiData)等外部储存库的基于特定领域的结构化知识相结合。这种组合产生低维度、与任务相关的代表制,同时减少培训复杂性,并产生可解释的早期融合权重,用于提高分类性业绩。我们展示了我们在两个领域六套数据集上的做法的有效性,表明在与基于AutMLMLG的分类师结合时,我们拟议的代表制学习方法与完全由基于LMM的专有嵌入基线生成的知识或超过这些知识。

Article 163

Title@2025-07-09 (3): Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

Title: Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

Decoder-Hybrid-Decoder-Architektur für effizientes Nachdenken mit langer Generation

提高长代人合理性效率的代coder-Hybrid-Decer 结构 2507.06607v1

Authors (14): Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, Yelong Shen

Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.

语言建模方面的最新进展显示了国家空间模型(SSMM)对于高效序列建模的有效性。 Samba 和 decoder- decoder 结构( YOCO) 等混合结构在变异器上表现出了有希望的绩效收益, 先前的工程并没有调查SSM 层之间代表共享的效率潜力。在本文中, 我们引入了Gate memory 单元(GMU), 这是一种简单而有效的跨层有效记忆共享机制。我们应用它来创建 SambaY, 一种将GMU纳入交叉解码中, 以分享基于 Samba 的自解码器( YOCO) 的内存读状态。 SambaY 明显提高了解码效率, 保留了SSSMSM 层之间在时间上共享的线性预复杂性, 提高了长期性能, 同时消除了明确的定位编码需要。通过广泛的规模实验, 我们的模型显示, 与坚固的YOCO基线基准相比, 显示在大规模调整的系统下, 提高性能表现的精确度缩缩缩缩缩缩缩。

Article 164

Title@2025-07-09 (3): Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

Title: Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

Video-Language Understanding: Eine Umfrage aus Modellarchitektur, Modellschulung und Datenperspektiven

了解视频语言:从建模模型、示范培训和数据视角进行调查 2406.05615v3

Authors (9): Thong Nguyen, Yi Bin, Junbin Xiao, Leigang Qu, Yicong Li, Jay Zhangjie Wu, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan

Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and perceive the world around us. There has been a lot of interest in creating video-language understanding systems with human-like senses since a video-language pair can mimic both our linguistic medium and visual environment with temporal dynamics. In this survey, we review the key tasks of these systems and highlight the associated challenges. Based on the challenges, we summarize their methods from model architecture, model training, and data perspectives. We also conduct performance comparison among the methods, and discuss promising directions for future research.

人类使用多种感官来理解环境。视觉和语言是两个最重要的感官,因为它们使我们能够轻松地交流我们的想法和感知我们周围的世界。创建具有人感的视频语言理解系统引起了很大的兴趣,因为一对视频语言可以模仿我们的语言媒介和视觉环境,同时呈现时间动态。在本次调查中,我们审查这些系统的关键任务,并突出相关的挑战。根据挑战,我们从模型结构、模型培训和数据角度总结它们的方法。我们还对这些方法进行绩效比较,并讨论未来研究的有希望的方向。

Article 165

Title@2025-07-09 (3): Can Input Attributions Explain Inductive Reasoning in In-Context Learning?

Title: Can Input Attributions Explain Inductive Reasoning in In-Context Learning?

Kann Input-Attributionen im In-Context-Lernen eine induktive Begründung erklären?

投入属性能否解释内文学习中的引因? 2412.15628v5

Authors (4): Mengyu Ye, Tatsuki Kuribayashi, Goro Kobayashi, Jun Suzuki

Interpreting the internal process of neural models has long been a challenge. This challenge remains relevant in the era of large language models (LLMs) and in-context learning (ICL); for example, ICL poses a new issue of interpreting which example in the few-shot examples contributed to identifying/solving the task. To this end, in this paper, we design synthetic diagnostic tasks of inductive reasoning, inspired by the generalization tests typically adopted in psycholinguistics. Here, most in-context examples are ambiguous w.r.t. their underlying rule, and one critical example disambiguates it. The question is whether conventional input attribution (IA) methods can track such a reasoning process, i.e., identify the influential example, in ICL. Our experiments provide several practical findings; for example, a certain simple IA method works the best, and the larger the model, the generally harder it is to interpret the ICL with gradient-based IA methods.

长期以来,对神经模型内部过程的解释一直是一个挑战。在大型语言模型(LLMs)和文字内学习(ICL)的时代,这一挑战仍然具有相关性;例如,ICL提出了一个新的解释问题,在少数例子中,这个例子有助于确定/解决任务。为此,在本文件中,我们设计了感知推理的合成诊断任务,这受通常在心理语言学中采用的概括性测试的启发。在这里,大多数同源实例是模糊的,有一个关键的例子掩盖了这一点。问题在于常规投入归属(IA)方法能否跟踪这种推理过程,即识别具有影响力的示例,在ICL中。我们的实验提供了若干实际结果;例如,某种简单的IA方法最有效,而模型越大,用梯度的IA方法解释ICL通常越难。

Article 166

Title@2025-07-09 (3): Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions

Title: Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions

Bewertung und Verbesserung der Robustheit in großen Sprachmodellen: Umfrage und Zukunftsrichtung

评价和提高大语言模式的威力:调查和未来方向 2506.11111v2

Authors (5): Kun Zhang, Le Wu, Kui Yu, Guangyi Lv, Dacao Zhang

Large Language Models (LLMs) have gained enormous attention in recent years due to their capability of understanding and generating natural languages. With the rapid development and wild-range applications (e.g., Agents, Embodied Intelligence), the robustness of LLMs has received increased attention. As the core brain of many AI applications, the robustness of LLMs requires that models should not only generate consistent contents, but also ensure the correctness and stability of generated content when dealing with unexpeted application scenarios (e.g., toxic prompts, limited noise domain data, outof-distribution (OOD) applications, etc). In this survey paper, we conduct a thorough review of the robustness of LLMs, aiming to provide a comprehensive terminology of concepts and methods around this field and facilitate the community. Specifically, we first give a formal definition of LLM robustness and present the collection protocol of this survey paper. Then, based on the types of perturbated inputs, we organize this survey from the following perspectives: 1) Adversarial Robustness: tackling the problem that prompts are manipulated intentionally, such as noise prompts, long context, data attack, etc; 2) OOD Robustness: dealing with the unexpected real-world application scenarios, such as OOD detection, zero-shot transferring, hallucinations, etc; 3) Evaluation of Robustness: summarizing the new evaluation datasets, metrics, and tools for verifying the robustness of LLMs. After reviewing the representative work from each perspective, we discuss and highlight future opportunities and research directions in this field. Meanwhile, we also organize related works and provide an easy-to-search project (https://github.com/zhangkunzk/Awesome-LLM-Robustness-papers) to support the community.

近年来,大型语言模型(LLMS)因其理解和生成自然语言的能力而得到极大关注。随着快速发展和野外应用(例如代理人、Embudied Intell),LLMS的稳健性得到了越来越多的关注。作为许多AI应用程序的核心大脑,LLMS的稳健性要求在处理未详细应用情景时,模型不仅应产生一致的内容,而且还应确保生成的内容的正确性和稳定性(例如有毒提示、有限的噪音域数据、分配之外的应用程序等)。在本调查文件中,我们对LLMS的稳健性进行了彻底审查,目的是为该领域的概念和方法提供全面的术语,并为社区提供便利。具体地说,我们首先对LMMS的稳健性作出正式定义,并介绍这份调查文件的收集程序。然后,根据扰动性投入的种类,我们从以下角度组织这次调查:1) Adversariaral-headaldreatrial Rustrial Rights:我们刻意操纵的问题,例如噪声、长背景、数据测试的未来机会;ODMUD-dealevild-dealveality oration oration orpss se;以及ODODODD) se-list reals astrevationals axes。然后,我们从实时、我们从实时分析和新工具提供新的研究/ODODDDDDDDDDDDD2 提供新的方向、我们进行这方面的工作。

Article 167

Title@2025-07-09 (3): Enhancing Food-Domain Question Answering with a Multimodal Knowledge Graph: Hybrid QA Generation and Diversity Analysis

Title: Enhancing Food-Domain Question Answering with a Multimodal Knowledge Graph: Hybrid QA Generation and Diversity Analysis

Verbesserung der Lebensmittel-Domain-Fragebeantwortung mit einem multimodalen Wissensgraphen: Hybride QA-Erzeugung und Diversitätsanalyse

以多模式知识图回答加强粮食领域问题:混合QA生成和多样性分析 2507.06571v1

Authors (2): Srihari K B, Pushpak Bhattacharyya

We propose a unified food-domain QA framework that combines a large-scale multimodal knowledge graph (MMKG) with generative AI. Our MMKG links 13,000 recipes, 3,000 ingredients, 140,000 relations, and 14,000 images. We generate 40,000 QA pairs using 40 templates and LLaVA/DeepSeek augmentation. Joint fine-tuning of Meta LLaMA 3.1-8B and Stable Diffusion 3.5-Large improves BERTScore by 16.2\%, reduces FID by 37.8\%, and boosts CLIP alignment by 31.1\%. Diagnostic analyses-CLIP-based mismatch detection (35.2\% to 7.3\%) and LLaVA-driven hallucination checks-ensure factual and visual fidelity. A hybrid retrieval-generation strategy achieves 94.1\% accurate image reuse and 85\% adequacy in synthesis. Our results demonstrate that structured knowledge and multimodal generation together enhance reliability and diversity in food QA.

我们提出一个统一的食品-食物质量评估框架,将大型多式联运知识图(MMKG)与基因性AI相结合。我们的MMKG链接了13 000种配方、3 000种配方、140 000种关系和14 000张图像。我们利用40个模板和LLAVA/DeepSeek增殖生成了40 000对QA配方。Meta LalaMA 3.1-8B和Stabal Difulation 3.5-Lorge的联合微调将BERTScore 改进了16.2,将FID减少了37.8,并将CLIP的匹配增加了31.1。诊断分析-基于CLOCIP的错配检测(35.2至7.3)和LLAVA驱动的幻觉检查真实性和视觉忠诚性。混合回收战略实现了94.1准确的图像再利用和85合力。我们的成果表明,结构化的知识和多式生成共同提高食品QA的可靠性和多样性。

Article 168

Title@2025-07-09 (3): CHAI for LLMs: Improving Code-Mixed Translation in Large Language Models through Reinforcement Learning with AI Feedback

Title: CHAI for LLMs: Improving Code-Mixed Translation in Large Language Models through Reinforcement Learning with AI Feedback

CHAI für LLMs: Verbesserung der code-gemischten Übersetzung in großen Sprachmodellen durch Verstärkungslernen mit KI-Feedback

LLM女士CHAI:利用AI反馈加强学习,改进大语言模式的编码混合翻译 2411.09073v3

Authors (3): Wenbo Zhang, Aditya Majumdar, Amulya Yadav

Large Language Models (LLMs) have demonstrated remarkable capabilities across various NLP tasks but struggle with code-mixed (or code-switched) language understanding. For example, prior work benchmarking the performance of multilingual LLMs on code-mixed translation tasks has demonstrated that current state-of-the-art multilingual LLMs are ineffective in dealing with code-mixed languages. However, the question of how to improve the capability of multilingual LLMs to handle code-mixed language has not received any attention to date. In this paper, we tackle this research gap by proposing CHAI, a novel general-purpose framework for improving the ability of multilingual LLMs to handle code-mixed languages. CHAI relies on three novel contributions made in this paper. First, we explore the ability of LLMs to provide accurate annotations for code-mixed translation tasks. Second, we leverage this ability of LLMs as annotators to generate preference data for code-mixed translation tasks at scale, which are then used within a reinforcement learning from AI feedback (RLAIF) procedure to improve LLMs’ capability on code-mixed tasks. Third, we conduct a rigorous experimental evaluation across various real-world datasets and settings. Our analysis shows that CHAI-powered LLMs outperform state-of-the-art open-source LLMs by 25.66% (in terms of win rate adjudicated by human annotators) in code-mixed translation tasks. This work represents a first step towards developing more inclusive code-mixed LLMs.

大型语言模型(LLMS)在各种国家语言模型任务中表现出了非凡的能力,但与编码混合(或编码转换)的语言理解相抗衡。例如,先前的工作基准基准表明,在编码混合翻译任务中,多语种LLMS的多语种LMS的绩效已经证明,目前最先进的多语种LLMs在处理代码混合翻译任务方面是无效的。然而,如何提高多语种LLMS处理代码混合语言的能力的问题至今没有得到任何关注。在本文中,我们通过提出CHAI(REIF)程序来解决这一研究差距,以提高多语种LMAs处理代码混合语言的能力。CHAI依靠本文中的三个新创贡献。首先,我们探索LLMSMs在为编码混合翻译任务提供准确说明的能力。第二,我们利用LLMMS的这种能力作为批注者,为代码混合翻译任务生成优惠数据,然后在从AI反馈(RLAIF)中进行强化学习,这是提高LMSMS处理编码能力的新通用普通语言模型翻译能力。我们用了一个硬性模型分析。

Article 169

Title@2025-07-09 (3): The Flaws of Others: An LLM-driven Framework for Scientific Knowledge Production

Title: The Flaws of Others: An LLM-driven Framework for Scientific Knowledge Production

Die Fehler anderer: Ein LLM-gestützter Rahmen für die wissenschaftliche Wissensproduktion

其他法律:由LLM驱动的科学知识生产框架 2507.06565v1

Authors (1): Juan B. Gutiérrez

Large-language models turn writing into a live exchange between humans and software. We capture this new medium with a discursive-network model that treats people and LLMs as equal nodes and tracks how their statements circulate. Broadening the focus from isolated hallucinations, we define invalidation (any factual, logical, or structural breach) and show it follows four hazards: drift from truth, self-repair, fresh fabrication, and external detection. A general mathematical model of discursive networks is developed to provide valuable insights: A network governed only by drift and self-repair stabilizes at a modest error rate; adding fabrication reproduces the high rates seen in current LLMs. Giving each false claim even a small chance of peer review shifts the system to a truth-dominant state. We operationalize peer review with the open-source \emph{Flaws-of-Others (FOO) algorithm}: a configurable loop in which any set of agents critique one another while a harmoniser merges their verdicts. The takeaway is practical and cultural: reliability in this new medium comes not from perfecting single models but from wiring imperfect ones into networks that keep each other honest.

大型语言模型将写作变成人类和软件之间的实时交换。我们用一个不精确的网络模型捕捉这一新媒体, 将人和LLMs作为平等的节点对待, 并跟踪其声明的传播方式。从孤立的幻觉中扩大焦点, 我们定义无效( 任何事实、逻辑或结构性违约) , 并显示它有四种危险 : 从真理、自我修复、制造和外部检测中漂移。开发了一个不透明网络的一般数学模型, 以提供宝贵的洞察 : 一个仅受漂移和自修复稳定在微小错误率的网络管理, 并且添加制造复制当前LLMs 所看到的高利率。给每个虚假的主张提供一次小小的同行审查机会, 甚至一次同行审查的机会, 将系统改变为真相主宰状态。我们用开放源的 \ emph{ FOL- 算法( FOO) 算法进行同行审查 : 一个可配置的循环, 任何一组代理人都互相批评, 而一个协调者将其判断合并为彼此的循环。。。取而具有实用和文化性: 新媒介的可靠性是实用和文化的: : 新媒介的可靠性不是来自完善的单一的单一的单一的模型, 而是从不完善的单一的单一的网络。

Article 170

Title: DS@GT at CheckThat! 2025: Exploring Retrieval and Reranking Pipelines for Scientific Claim Source Retrieval on Social Media Discourse

DS@GT bei CheckThat! 2025: Erforschung von Rück- und Rücklaufleitungen für wissenschaftliche Anspruchsquellen-Rückgriffe auf Social Media Diskurs

DS@GT at checketh! 2025年: 探索检索和重新排序科学索赔源检索管道,关于社会媒体问题 2507.06563v1

Authors (4): Jeanette Schofield, Shuyu Tian, Hoang Thanh Thanh Truong, Maximilian Heil

Social media users often make scientific claims without citing where these claims come from, generating a need to verify these claims. This paper details work done by the DS@GT team for CLEF 2025 CheckThat! Lab Task 4b Scientific Claim Source Retrieval which seeks to find relevant scientific papers based on implicit references in tweets. Our team explored 6 different data augmentation techniques, 7 different retrieval and reranking pipelines, and finetuned a bi-encoder. Achieving an MRR@5 of 0.58, our team ranked 16th out of 30 teams for the CLEF 2025 CheckThat! Lab Task 4b, and improvement of 0.15 over the BM25 baseline of 0.43. Our code is available on Github at https://github.com/dsgt-arc/checkthat-2025-swd/tree/main/subtask-4b.

社会媒体用户往往在不列举这些主张来源的情况下提出科学主张,从而需要核实这些主张。本文详细介绍了DS@GT团队为CLEF 2025 CLEF 校验!实验室的科学索赔源检索第4b号任务 4b 科学索赔源检索val,该任务试图根据在推特中暗含的引用找到相关的科学论文。我们的团队探讨了6种不同的数据增强技术,7种不同的检索和重新排位管道,并对双编码器进行了微调。实现了0.58 MRR@5, 我们的团队在CLEF 2025 CheckTH!实验室任务4b 的30个团队中排第16位,比BM25基线0.43改进了0.15。我们的代码可在 Github上查阅 https://github.com/dsgt-arc/ checkthat-2025-swd/tree/main/subtask-4b。

Article 171

Title@2025-07-09 (3): AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework

Title: AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework

AutoPrep: Natural Language Question-Aware Datenvorbereitung mit einem Multi-Agent Framework

AUPPrep: 具有多机构框架的自然语言问题软件数据编制 2412.10422v4

Authors (6): Meihao Fan, Ju Fan, Nan Tang, Lei Cao, Guoliang Li, Xiaoyong Du

Answering natural language (NL) questions about tables, known as Tabular Question Answering (TQA), is crucial because it allows users to quickly and efficiently extract meaningful insights from structured data, effectively bridging the gap between human language and machine-readable formats. Many of these tables are derived from web sources or real-world scenarios, which require meticulous data preparation (or data prep) to ensure accurate responses. However, preparing such tables for NL questions introduces new requirements that extend beyond traditional data preparation. This question-ware data preparation involves specific tasks such as column derivation and filtering tailored to particular questions, as well as question-aware value normalization or conversion, highlighting the need for a more nuanced approach in this context. Because each of the above tasks is unique, a single model (or agent) may not perform effectively across all scenarios. In this paper, we propose AutoPrep, a large language model (LLM)-based multiagent framework that leverages the strengths of multiple agents, each specialized in a certain type of data prep, ensuring more accurate and contextually relevant responses. Given an NL question over a table, AutoPrep performs data prep through three key components. Planner: Determines a logical plan, outlining a sequence of high-level operations. Programmer: Translates this logical plan into a physical plan by generating the corresponding low-level code. Executor: Executes the generated code to process the table. To support this multi-agent framework, we design a novel Chain-ofClauses reasoning mechanism for high-level operation suggestion, and a tool-augmented method for low-level code generation.

回答关于表格的自然语言(NL)问题,称为“Tabulal 问答”(TQA),至关重要,因为它使用户能够快速和高效地从结构化数据中获取有意义的见解,从而有效地弥合人类语言和机器可读格式之间的差距。许多这些表格来自网络来源或现实世界情景,这需要细致的数据准备(或数据准备)以确保准确的答复。然而,为NL问题准备这样的表格提出了超越传统数据准备的新要求。这种问题软件数据编制涉及具体的任务,如专为特定问题而设计的列转换和过滤,以及质觉值正常化或转换,强调在这方面需要更细致的方法。由于上述每一项任务都是独特的,单一模型(或代理)可能无法在所有情景中有效运行。在本文件中,我们提出了AutoPrep,一个大型语言模型(LLLM)基于多剂框架,利用多种剂的优势,每种类型的数据预选,确保更准确和符合背景的响应。鉴于NLO-L-L-S-A-A-O-O-O-D-D-L-D-L-L-L-L-L-L-L-L-LOD-L-I-L-I-L-LOD-I-I-LOD-LVOL-L-L-L-L-L-L-L-L-L-L-LV-LV-L-L-L-L-LV-L-L-L-L-L-L-L-L-L-L-L-L-LV-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-

Article 172

Title@2025-07-09 (3): FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction

Title: FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction

FiRST: Feinsteuerungs-Router-Selective Transformer für Input-Adaptive Latenzreduktion

FIRST: 用于减少投入-适应性延迟的微调路由器-选择式变换器 2410.12513v3

Authors (4): Akriti Jain, Saransh Sharma, Koyel Mukherjee, Soumyabrata Pal

Auto-regressive Large Language Models (LLMs) demonstrate remarkable performance across different domains such as vision and language processing. However, due to sequential processing through a stack of transformer layers, autoregressive decoding faces significant computation/latency challenges, particularly in resource-constrained environments like mobile and edge devices. Existing approaches in literature that aim to improve latency via skipping layers have two distinct flavors - 1) Early exit, and 2) Input-agnostic heuristics where tokens exit at pre-determined layers irrespective of input sequence. Both the above strategies have limitations - the former cannot be applied to handle KV Caching necessary for speed-ups in modern framework and the latter does not capture the variation in layer importance across tasks or more generally, across input sequences. To address both limitations, we propose FiRST, an algorithm that reduces inference latency by using layer-specific routers to select a subset of transformer layers adaptively for each input sequence - the prompt (during the prefill stage) decides which layers will be skipped during decoding. FiRST preserves compatibility with KV caching enabling faster inference while being quality-aware. FiRST is model-agnostic and can be easily enabled on any pre-trained LLM. Our approach reveals that input adaptivity is critical - indeed, different task-specific middle layers play a crucial role in evolving hidden representations depending on tasks. Extensive experiments show that FiRST significantly reduces latency while outperforming other layer selection strategies in quality metics. It retains competitive performance to base model (without layer skipping) and in some cases, even improves upon it. FiRST is thus a promising and efficient solution for LLM deployment in low-resource environments.

自动递减大型语言模型(LLMS)在视觉和语言处理等不同领域表现出显著的绩效。但是,由于通过一系列变压层连续处理,自动递减解解码面临重大的计算/延迟挑战,特别是在资源受限制的环境中,特别是在移动和边缘设备等移动和边缘设备。现有的文献方法旨在通过跳过层来改善延缓性,有两个不同的口味:(1) 早期退出,和(2) 输入- 异性超常, 标志在预定层中退出, 而不考虑输入序列。上述两种战略都有其局限性 - 前者不能用于在现代框架中处理速度加速所需的KVV Cashing, 而后者无法在任务或更普遍地在输入序列中反映层重要性的变异性。为了应对这两种局限性,我们提议FiRST, 一种算法,通过使用特定层路由特定的路由选择可适应的变异性层序列选择一组变异层。快速(在填充前阶段) 确定哪些层在解析期间会跳。 FiRST 将快速调调调调调高度,而稳定性战略在基础选择中可以快速的稳定性,在稳定级中进行快速变压。

Article 173

Title@2025-07-09 (3): InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior

Title: InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior

InvestAlign: Überwindung von Datenknappheit bei der Ausrichtung großer Sprachmodelle mit Entscheidungsprozessen von Investoren unter Herdverhalten

投资对称:克服在将大语言模式与牧民行为下的投资者决策过程相匹配方面的数据稀缺性 2507.06528v1

Authors (6): Huisheng Wang, Zhuoshi Pan, Hangjing Zhang, Mingxiao Liu, Hanqing Gao, H. Vicky Zhao

Aligning Large Language Models (LLMs) with investor decision-making processes under herd behavior is a critical challenge in behavioral finance, which grapples with a fundamental limitation: the scarcity of real-user data needed for Supervised Fine-Tuning (SFT). While SFT can bridge the gap between LLM outputs and human behavioral patterns, its reliance on massive authentic data imposes substantial collection costs and privacy risks. We propose InvestAlign, a novel framework that constructs high-quality SFT datasets by leveraging theoretical solutions to similar and simple optimal investment problems rather than complex scenarios. Our theoretical analysis demonstrates that training LLMs with InvestAlign-generated data achieves faster parameter convergence than using real-user data, suggesting superior learning efficiency. Furthermore, we develop InvestAgent, an LLM agent fine-tuned with InvestAlign, which demonstrates significantly closer alignment to real-user data than pre-SFT models in both simple and complex investment problems. This highlights our proposed InvestAlign as a promising approach with the potential to address complex optimal investment problems and align LLMs with investor decision-making processes under herd behavior. Our code is publicly available at https://github.com/thu-social-network-research-group/InvestAlign.

将大型语言模型(LLMs)与放牧行为下的投资者决策过程相匹配,是行为金融中的一个关键挑战,它面临一个根本性的局限性:监督精度调整所需的实际用户数据稀缺。虽然SFT可以弥合LLM产出与人类行为模式之间的差距,但对大量真实数据的依赖导致大量的收集成本和隐私风险。我们提议InvestAlign,这是一个创新框架,通过利用理论解决方案解决类似和简单的最佳投资问题,而不是复杂的情景,来构建高质量的SFT数据集。我们的理论分析表明,对InvestAl-产生的数据进行的培训比使用实际用户数据更快的参数趋同,表明学习效率更高。此外,我们开发了InvestAgents,一个与Investal-Allign公司进行微调的LMM代理商,这表明在简单和复杂的投资问题中与SFT前模式都非常接近。这突出了我们拟议的InvestAlignAlign,这是一种很有希望的方法,能够解决复杂的最佳投资问题,并使LLMSMsMs公司与在她的行为中进行公开研究。

Article 174

Title@2025-07-09 (3): FinSphere, a Real-Time Stock Analysis Agent Powered by Instruction-Tuned LLMs and Domain Tools

Title: FinSphere, a Real-Time Stock Analysis Agent Powered by Instruction-Tuned LLMs and Domain Tools

FinSphere, ein Real-Time Stock Analysis Agent Powered by Instruction-Tuned LLMs und Domain Tools

FinSphere,实时股票分析代理商,使用指示引导LMs和域工具进行实时股票分析 2501.12399v2

Authors (5): Shijie Han, Jingshu Zhang, Yiqing Shen, Kaiyuan Yan, Hongguang Li

Current financial large language models (FinLLMs) struggle with two critical limitations: the absence of objective evaluation metrics to assess the quality of stock analysis reports and a lack of depth in stock analysis, which impedes their ability to generate professional-grade insights. To address these challenges, this paper introduces FinSphere, a stock analysis agent, along with three major contributions: (1) AnalyScore, a systematic evaluation framework for assessing stock analysis quality, (2) Stocksis, a dataset curated by industry experts to enhance LLMs’ stock analysis capabilities, and (3) FinSphere, an AI agent that can generate high-quality stock analysis reports in response to user queries. Experiments demonstrate that FinSphere achieves superior performance compared to both general and domain-specific LLMs, as well as existing agent-based systems, even when they are enhanced with real-time data access and few-shot guidance. The integrated framework, which combines real-time data feeds, quantitative tools, and an instruction-tuned LLM, yields substantial improvements in both analytical quality and practical applicability for real-world stock analysis.

目前的大型财务语言模型(FinLLMS)面临两个关键限制:缺乏客观的评价指标来评估存量分析报告的质量,以及库存分析缺乏深度,这妨碍了它们产生专业水平见解的能力。为了应对这些挑战,本文件介绍了库存分析剂FinSperere, 以及三个主要贡献:(1) 分析Score,一个评估存量分析质量的系统评价框架;(2) 储存,一个由行业专家整理的数据集,以加强LLMS的存量分析能力;(3) FinSphere,一个可以针对用户询问生成高质量存量分析报告的AI代理。实验表明,FinSphere与一般和特定领域LLMS以及现有代理系统相比,其业绩优于一般和特定领域 LLMS以及基于代理的系统,即使这些系统得到了实时数据访问和短片指导的加强;综合框架,将实时数据反馈、定量工具和经指示调整的LM结合起来,在分析质量和实用适用性实体存量分析方面都取得了重大改进。

Article 175

Title@2025-07-09 (3): FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation

Title: FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation

FIFA: Unified Faithfulness Evaluation Framework für Text-to-Video und Video-to-Text Generation

FIFA: 文本到视频和视频到文字一代统一信仰评价框架 2507.06523v1

Authors (5): Liqiang Jing, Viet Lai, Seunghyun Yoon, Trung Bui, Xinya Du

Video Multimodal Large Language Models (VideoMLLMs) have achieved remarkable progress in both Video-to-Text and Text-to-Video tasks. However, they often suffer fro hallucinations, generating content that contradicts the visual input. Existing evaluation methods are limited to one task (e.g., V2T) and also fail to assess hallucinations in open-ended, free-form responses. To address this gap, we propose FIFA, a unified FaIthFulness evAluation framework that extracts comprehensive descriptive facts, models their semantic dependencies via a Spatio-Temporal Semantic Dependency Graph, and verifies them using VideoQA models. We further introduce Post-Correction, a tool-based correction framework that revises hallucinated content. Extensive experiments demonstrate that FIFA aligns more closely with human judgment than existing evaluation methods, and that Post-Correction effectively improves factual consistency in both text and video generation.

视频多式大语言模式(VideoMLLM)在视频到文字和文字到文字的任务方面取得了显著进展,但是,它们往往会受到幻觉的幻觉,产生与视觉输入相反的内容;现有的评价方法仅限于一项任务(例如V2T),而且没有评估开放式、自由形式响应中的幻觉;为弥补这一差距,我们建议国际足联建立一个统一的FaithFulness evAluation框架,通过空间-文字依赖性图提取全面描述事实,模拟其语义依赖性,并利用视频QA模型核实这些内容;我们进一步引入了基于工具的校正框架,即基于工具的校正框架,以修改有偏见的内容;广泛的实验表明,国际足联比现有的评价方法更密切地与人性判断一致,以及事后校正有效地提高了文本和视频生成两方面的实际一致性。

Article 176

Title@2025-07-09 (3): Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving

Title: Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving

LLMs nach ihrer Eignung lehren: Adaptive Begründung für mathematische Problemlösung

根据其适应性:数学问题解决的适应性理由 2502.12022v3

Authors (11): Xin Xu, Yan Xu, Tianhao Chen, Yuchen Yan, Chengwu Liu, Zaoyu Chen, Yufei Wang, Yichun Yin, Yasheng Wang, Lifeng Shang, Qun Liu

Existing approaches to mathematical reasoning with large language models (LLMs) rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated Reasoning (TIR) for precise computation. While efforts have been made to combine these methods, they primarily rely on post-selection or predefined strategies, leaving an open question: whether LLMs can autonomously adapt their reasoning strategy based on their inherent capabilities. In this work, we propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework that enables LLMs to personalize their reasoning strategy spontaneously, aligning it with their intrinsic aptitude. TATA incorporates base-LLM-aware data selection during supervised fine-tuning (SFT) to tailor training data to the model’s unique abilities. This approach equips LLMs to autonomously determine and apply the appropriate reasoning strategy at test time. We evaluate TATA through extensive experiments on six mathematical reasoning benchmarks, using both general-purpose and math-specialized LLMs. Empirical results demonstrate that TATA effectively combines the complementary strengths of CoT and TIR, achieving superior or comparable performance with improved inference efficiency compared to TIR alone. Further analysis underscores the critical role of aptitude-aware data selection in enabling LLMs to make effective and adaptive reasoning decisions and align reasoning strategies with model capabilities.

现有与大语言模型(LLMS)有关的数学推理方法依靠CoT(CoT)系统(CoT)系统(CoT)系统(CoT)系统(CoT)系统(LLMS)系统(LLMS)系统(TIR)系统(TIR)系统(TIR)系统(TIR)系统(CoT)系统(CoT)系统(CoT)系统(CoT)系统(CoTT)系统(CoT)系统(CoTTT)系统(COTT)系统(CoTT)系统(CoTAT)系统(TATA(LM)系统)系统(TATA)系统(TATA)系统(LM)系统)系统(TAR(T)系统)系统(TAR)系统(TAR)系统(TAR)系统(TA)系统(TATA)系统(TA(TA)系统)系统(TLM)系统)系统(TAR(TAR)系统)系统(TAR)系统(TAR(TAR)系统)系统(TA)系统)系统)系统(TA(TAR)系统)系统(TA(TAR(S)系统)系统)系统(TLM)系统(S)系统)系统(S)系统)系统(TLM)系统)系统)系统(TAR(S(TA)系统(TLLLLM)系统(S)系统(TLLM)系统(S)系统(S)系统)系统(S)系统(S)系统)系统)系统(TA)系统(TA)系统)系统)系统(TA)系统(TA)系统(TA)系统)系统)系统)系统(S)系统(S)系统(S)系统(S)系统(S)系统(TLLLM)系统(S)系统(系统(S(S)系统(S)系统(TA)系统(S)系统(S)系统)系统(S)系统)系统)系统)系统)系统)系统)系统(TA)系统(系统(TA)系统(TA)系统(TLMLMLMLM

Article 177

Title@2025-07-09 (3): DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

Title: DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

DeepTalk: Auf dem Weg zu nahtloser und intelligenter Sprachinteraktion mit adaptiver Modalität-spezifischer MoE

深谈:实现与适应型模式具体部的无缝和智能语音互动 2506.21864v2

Authors (9): Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Lijiang Li, Zuwei Long, Bo Tong, Ke Li, Xing Sun

Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk first adaptively distinguishes modality experts according to their modality load within the LLM. Each modality expert then undergoes specialized single-modality training, followed by joint multimodal collaborative training. As a result, DeepTalk incurs only a 5.5% performance drop compared to the original LLM, which is significantly lower than the average performance drop of over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within 0.5 seconds, ensuring a seamless and intelligent speech interaction experience. Code and models are released at https://github.com/talkking/DeepTalk.

本地多式大型语言模型(MLLMM)将单一大型语言模型(LLMM)重组为既具有语言和文字生成能力的口语模型(LLM)。与模块化和匹配的MLLMM相比,本地MLMLM保留了更丰富的语言特征,如情感和手动等,并在主干线(LLM)结构中直接生成语音响应,而不是使用单独的语音解码器。这种整合还导致应对液化和更顺畅的互动。然而,本地MLMS遭受了灾难性的忘记和性能退化,因为现有的配对语言文本数据不足以支持MLLMS的预培训,而与前导文字文本LLMM的大量文本数据相比,DLLMLMM数据模型保存了5.5%的性能下降,与最初的LLMMLM(LM)模型的正常水平相比,正常水平比正常水平低。

Article 178

Title@2025-07-09 (3): Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

Title: Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

Double-Checker: Bessere Begründung von langsam denkenden LLMs über selbstkritische Feinsteuerung

双重检查者:通过自批评性微调,加强慢思考低迷LMs的理由 2506.21285v2

Authors (14): Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, Shiwei Liu, Shizhe Diao, Can Yang, Lu Yin

While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the “aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique. Our codes and data are available at https://github.com/XinXU-USTC/DoubleChecker

虽然思维迟缓的大型语言模型(LLMs)的反映式推理(通常被称为“aha moment ” ) , 通常被称为“aha moment : ” , 其生成信息化评论和完善先前解决方案的能力仍然有限。在本文中,我们引入了双检查器,这是一个原则性框架,目的是通过促进对以往解决方案进行明确的自我精细和迭接完善来提高慢思考的LLMs的推理能力。通过微调我们整理的1 730个自我临界实例,双检查器赋予长的CoT LLMs在推断中反复批评和完善其产出的能力,直到他们根据自我生成的批评来评估其解决方案是否正确为止。我们验证了双检查器在一套综合推理基准中的效率,表明反复自读性自我精度能极大地增强了长期CoTLMs的推理能力。值得注意的是,我们的双检查器提高了在挑战性AIME基准方面从4.4%到18.2%到18.2%的成绩。这些结果凸显了发展更可信和有效的LMSLMs(LMs) 能够进行结构化的自我/Uritxxxxxx/ 数据。

Article 179

Title@2025-07-09 (3): SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers

Title: SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers

SpindleKV: Eine neuartige KV-Cache-Reduktionsmethode, die sowohl schwammige als auch tiefe Schichten ausbalanciert

SpindleKV: 浅层和深层平衡的新KV缓存减少方法 2507.06517v1

Authors (7): Zicong Tang, Shi Luohe, Zuchao Li, Baoyuan Qi, Guoming Liu, Lefei Zhang, Ping Wang

Large Language Models (LLMs) have achieved impressive accomplishments in recent years. However, the increasing memory consumption of KV cache has possessed a significant challenge to the inference system. Eviction methods have revealed the inherent redundancy within the KV cache, demonstrating its potential for reduction, particularly in deeper layers. However, KV cache reduction for shallower layers has been found to be insufficient. Based on our observation that, the KV cache exhibits a high degree of similarity. Based on this observation, we proposed a novel KV cache reduction method, SpindleKV, which balances both shallow and deep layers. For deep layers, we employ an attention weight based eviction method, while for shallow layers, we apply a codebook based replacement approach which is learnt by similarity and merging policy. Moreover, SpindleKV addressed the Grouped-Query Attention (GQA) dilemma faced by other attention based eviction methods. Experiments on two common benchmarks with three different LLMs shown that SpindleKV obtained better KV cache reduction effect compared to baseline methods, while preserving similar or even better model performance.

大型语言模型(LLMs)近年来取得了令人印象深刻的成就。然而,KV缓存的内存消耗量不断增长,对推断系统提出了重大挑战。驱逐方法揭示了KV缓存内固有的冗余,表明它有可能减少,特别是在更深的层层中。然而,发现对浅层的KV缓存减少不够充分。根据我们的观察,KV缓存显示出高度相似性。基于这一观察,我们建议采用一种新的KV缓存减少方法SpindleKV,它平衡浅层和深层。对于深层,我们采用了基于注意的权重驱逐方法,而对于浅层,我们采用了基于代码的代用方法,通过相似性和合并政策来学习。此外,SpindleKV谈到了基于其他注意方法所面临的GQA困境。实验用三个不同的LMSingdleKV的共同基准显示,SpindleKV比基线方法得到更好的KV缓存效果,同时保持相似或更好的模型性。

Article 180

Title@2025-07-09 (3): Breaking PEFT Limitations: Leveraging Weak-to-Strong Knowledge Transfer for Backdoor Attacks in LLMs

Title: Breaking PEFT Limitations: Leveraging Weak-to-Strong Knowledge Transfer for Backdoor Attacks in LLMs

Breaking PEFT Limitations: Schwach-zu-starken Wissenstransfer für Backdoor-Angriffe in LLMs

打破PEFT的局限性:利用微弱至收缩知识转让来利用LLMs进行后门攻击 2409.17946v4

Authors (8): Shuai Zhao, Leilei Gan, Zhongliang Guo, Xiaobao Wu, Yanhao Jia, Luwei Xiao, Cong-Duy Nguyen, Luu Anh Tuan

Despite being widely applied due to their exceptional capabilities, Large Language Models (LLMs) have been proven to be vulnerable to backdoor attacks. These attacks introduce targeted vulnerabilities into LLMs by poisoning training samples and full-parameter fine-tuning (FPFT). However, this kind of backdoor attack is limited since they require significant computational resources, especially as the size of LLMs increases. Besides, parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted parameter updating may impede the alignment of triggers with target labels. In this study, we first verify that backdoor attacks with PEFT may encounter challenges in achieving feasible performance. To address these issues and improve the effectiveness of backdoor attacks with PEFT, we propose a novel backdoor attack algorithm from the weak-to-strong based on Feature Alignment-enhanced Knowledge Distillation (FAKD). Specifically, we poison small-scale language models through FPFT to serve as the teacher model. The teacher model then covertly transfers the backdoor to the large-scale student model through FAKD, which employs PEFT. Theoretical analysis reveals that FAKD has the potential to augment the effectiveness of backdoor attacks. We demonstrate the superior performance of FAKD on classification tasks across four language models, four backdoor attack algorithms, and two different architectures of teacher models. Experimental results indicate success rates close to 100% for backdoor attacks targeting PEFT.

大型语言模型(LLMS)尽管因其特殊能力而被广泛应用,但已被证明很容易受到幕后攻击。这些攻击通过毒害培训样本和全参数微调(FPFFT)将目标弱点引入LLMS。然而,这种幕后攻击是有限的,因为它们需要大量的计算资源,特别是由于LLMS规模的扩大。此外,参数高效微调(PEFT)提供了一种替代办法,但有限的参数更新可能阻碍触发点与目标标签的配合。在本研究中,我们首先核实PEFT的后门攻击可能在实现可行性能方面遇到困难。为了解决这些问题和提高PEFT的后门攻击的效力,我们提议从弱到强的后门攻击算法是有限的。此外,我们通过FPFFT对小规模语言模型下毒害小型语言模型作为教师模型。然后秘密地通过FAKD将后门到大型学生模型,使用PEFFTFT。理论分析显示,FAKD的后门攻击效率是四大攻击等级,我们FAFA系统攻击的后算算算算算。我们四种攻击的上攻击的后算算算。我们四种攻击的系统攻击的后算算算算算算。我们攻击的后攻击的后攻击的反的反的系统攻击的反攻击的反攻击的系统结构结构结构。

Article 181

Title@2025-07-09 (3): Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings

Title: Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings

Pun Intended: Multi-Agent Übersetzung von Wordplay mit Kontrastivem Lernen und phonetisch-semantischen Einbettungen

未预期:多代理翻译与矛盾学习和语音-语言嵌入的文字游戏的多代理翻译 2507.06506v1

Authors (3): Russell Taylor, Benjamin Herbert, Michael Sana

Translating wordplay across languages presents unique challenges that have long confounded both professional human translators and machine translation systems. This research proposes a novel approach for translating puns from English to French by combining state-of-the-art large language models with specialized techniques for wordplay generation. Our methodology employs a three-stage approach. First, we establish a baseline using multiple frontier large language models with feedback based on a new contrastive learning dataset. Second, we implement a guided chain-of-thought pipeline with combined phonetic-semantic embeddings. Third, we implement a multi-agent generator-discriminator framework for evaluating and regenerating puns with feedback. Moving beyond the limitations of literal translation, our methodology’s primary objective is to capture the linguistic creativity and humor of the source text wordplay, rather than simply duplicating its vocabulary. Our best runs earned first and second place in the CLEF JOKER 2025 Task 2 competition where they were evaluated manually by expert native French speakers. This research addresses a gap between translation studies and computational linguistics by implementing linguistically-informed techniques for wordplay translation, advancing our understanding of how language models can be leveraged to handle the complex interplay between semantic ambiguity, phonetic similarity, and the implicit cultural and linguistic awareness needed for successful humor.

翻译跨语言的文字游戏提出了独特的挑战,这些挑战长期以来一直困扰着专业的人类翻译员和机器翻译系统。本研究提出了将最先进的大语言模型与文字剧生成的专门技术相结合,把英语译成法语的新颖方法。我们的方法采用三阶段方法。首先,我们利用基于新的对比学习数据集的反馈,用多种前沿大语言模型建立一个基线。第二,我们实施一个带指导的思维链管道,将语音-语义嵌入结合起来。第三,我们实施一个多试剂的生成器/差异器框架,用于评估和用反馈重新生成标本。超越了文字翻译的局限性,我们的方法的主要目标是捕捉取原始文字剧的语言创造力和幽默感,而不是简单地复制其词汇。我们的最佳运行在CLEF JOKER 2025任务2中获得了第一和第二名的反馈。我们的研究通过应用语言智能化翻译技术解决了翻译和计算语言语言学之间的差距,推进了我们对于复杂语言模范模式的理解,从而能够利用复杂的语言模棱两极性来控制语言模型。

Article 182

Title@2025-07-09 (3): GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification

Title: GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification

GMLM: Überbrückung von Graph Neuronalen Netzwerken und Sprachmodellen für heterophile Knotenklassifikation

GMLM: 血氧哲学节点分类的架接图形神经网络和语言模型 2503.05763v5

Authors (1): Aarush Sinha

Integrating powerful but computationally expensive Pre-trained Language Models (PLMs) with Graph Neural Networks (GNNs) is a key challenge, especially on text-rich heterophilic graphs. We propose the Graph Masked Language Model (GMLM), a framework designed for the efficient and effective fusion of graph structure and text semantics. GMLM employs a two-stage process: first, a contrastive pre-training stage with a novel soft masking technique builds a robust multi-scale GNN; second, an end-to-end fine-tuning stage uses a dynamic active node selection strategy for scalability and a bi-directional cross-attention module for deep fusion. Experiments on five heterophilic benchmarks show GMLM achieves state-of-the-art results on four, significantly outperforming prior GNN and large LLM-based methods. For instance, it improves accuracy on the Texas dataset by over 8\% and on Wisconsin by nearly 5\%. Our work demonstrates that a sophisticated, deeply-integrated architecture can be more effective and efficient than larger, general-purpose models for text-rich graph representation learning.

将强大但计算上昂贵的预科语言模型(PLM)与图表神经网络(GNN)结合是一个关键的挑战,特别是在文本丰富的热活性图形方面。我们提议了图形遮盖语言模型(GMLM),这是一个旨在高效和有效地融合图形结构和文字语义的框架。GMLM采用一个两阶段过程:首先,一个具有新颖软面罩技术的对比性培训前阶段,建立起一个强大的多级软面罩技术;第二,一个端到端微调阶段,使用动态的主动节点选择战略,用于可伸缩性,以及一个双向交叉注意模块,用于深度聚变。关于五种电子遮盖语言模型的实验显示,GMLM在四种情况下取得了最新的结果,大大超过GNN和大型LM方法。例如,它提高了德州数据集的精确度,由8以上和威斯康星州数据组合的精确度,近5。我们的工作表明,一个精密、深入一体化的结构可以比更大规模、通用的文本学习图形模型更有效和高效。

Article 183

Title@2025-07-09 (3): ModelCitizens: Representing Community Voices in Online Safety

Title: ModelCitizens: Representing Community Voices in Online Safety

ModelCitizens: Vertretung von Community Voices in Online-Sicherheit

模范公民:在网上安全方面代表社区声音 2507.05455v2

Authors (7): Ashima Suvarna, Christina Chance, Karolina Naranjo, Hamid Palangi, Sophie Hao, Thomas Hartvigsen, Saadia Gabriel

Automatic toxic language detection is critical for creating safe, inclusive online spaces. However, it is a highly subjective task, with perceptions of toxic language shaped by community norms and lived experience. Existing toxicity detection models are typically trained on annotations that collapse diverse annotator perspectives into a single ground truth, erasing important context-specific notions of toxicity such as reclaimed language. To address this, we introduce MODELCITIZENS, a dataset of 6.8K social media posts and 40K toxicity annotations across diverse identity groups. To capture the role of conversational context on toxicity, typical of social media posts, we augment MODELCITIZENS posts with LLM-generated conversational scenarios. State-of-the-art toxicity detection tools (e.g. OpenAI Moderation API, GPT-o4-mini) underperform on MODELCITIZENS, with further degradation on context-augmented posts. Finally, we release LLAMACITIZEN-8B and GEMMACITIZEN-12B, LLaMA- and Gemma-based models finetuned on MODELCITIZENS, which outperform GPT-o4-mini by 5.5% on in-distribution evaluations. Our findings highlight the importance of community-informed annotation and modeling for inclusive content moderation. The data, models and code are available at https://github.com/asuvarna31/modelcitizens.

自动检测有毒语言对于创建安全、包容的在线空间至关重要。然而,这是一个高度主观性的任务,对社区规范和活生生的经验所形成的有毒语言有认识。现有的毒性检测模型通常在说明方面受过培训,说明将各种批注的视角破碎成单一的地面真相,消除重要的因地制宜的毒性概念,如重新使用的语言。为了解决这个问题,我们引入了MODELCITIZENS,一个6.8K社交媒体文章和40K毒性说明的数据集,在不同身份群体中,6.8K社交媒体文章和40K毒性说明的数据集。为了了解关于毒性的谈话背景作用,社会媒体文章的典型特征,我们用LLOMM生成的谈话场景来增加MODELCITIZENS的设置。最先进的毒性检测工具(例如OpenAI Modelation API、GPT-o4-minima)在MODELCITIZENS上不完善了MOLMA和Gemma模型的模型,在Slimia-deal-deal-deal-deal-dealismal4中,这是我们对Slimal-deal-deal-deal 的模拟/deal 的模型/demodistral 的模型和Slimal的发现的重要性。

Article 184

Title@2025-07-09 (3): Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations

Title: Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations

Verfeinerung von gesketen Wahrnehmungen in visionssprachlichen Kontrastmodellen durch visuelle Darstellungen

通过视觉表现方式改进视觉语言对比模型中偏斜的观念 2405.14030v3

Authors (2): Haocheng Dai, Sarang Joshi

Large vision-language contrastive models (VLCMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLCM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP’s embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that relying on visual representations from CLIP, as opposed to text embedding, is more effective to refine the skewed perceptions in VLCMs, emphasizing the superior utility of visual representations in overcoming embedded biases. Our code can be found here.

大型视觉语言对比模型(VLCM),如CLIP等,已成为基础模型,表明在各种下游任务中取得了显著成功。这些模型尽管有其优势,但与其他基础系统相似,继承了真实世界数据不相称分布的偏差,导致对实际环境产生误解。像图像网这样的前方数据集往往充满非因果的、虚假的关联,在缺乏这些背景要素的情况下,它们可能削弱VLCM的性能。本研究对简单线性探测器如何有效地从 CLIP 嵌入下游应用中提取具体任务的核心特征进行了调查。我们的分析显示,CLIP 文本表达方式经常受到错误的关联的污染,这些关联在偏颇的培训前数据集中被继承。实证证据表明,依靠CLIP 的视觉表达方式而不是文字嵌入式,对于完善VLCMS的扭曲的认知方式更为有效,强调视觉表达方式在克服嵌入偏见方面的优势。这里可以找到我们的代码。

Article 185

Title@2025-07-09 (3): Can adversarial attacks by large language models be attributed?

Title: Can adversarial attacks by large language models be attributed?

Können feindliche Angriffe von großen Sprachmodellen zugeschrieben werden?

大型语言模式的对抗性攻击能否归结为对抗性攻击? 2411.08003v2

Authors (3): Manuel Cebrian, Andres Abeliuk, Jan Arne Telle

Attributing outputs from Large Language Models (LLMs) in adversarial settings-such as cyberattacks and disinformation campaigns-presents significant challenges that are likely to grow in importance. We approach this attribution problem from both a theoretical and an empirical perspective, drawing on formal language theory (identification in the limit) and data-driven analysis of the expanding LLM ecosystem. By modeling an LLM’s set of possible outputs as a formal language, we analyze whether finite samples of text can uniquely pinpoint the originating model. Our results show that, under mild assumptions of overlapping capabilities among models, certain classes of LLMs are fundamentally non-identifiable from their outputs alone. We delineate four regimes of theoretical identifiability: (1) an infinite class of deterministic (discrete) LLM languages is not identifiable (Gold’s classical result from 1967); (2) an infinite class of probabilistic LLMs is also not identifiable (by extension of the deterministic case); (3) a finite class of deterministic LLMs is identifiable (consistent with Angluin’s tell-tale criterion); and (4) even a finite class of probabilistic LLMs can be non-identifiable (we provide a new counterexample establishing this negative result). Complementing these theoretical insights, we quantify the explosion in the number of plausible model origins (hypothesis space) for a given output in recent years. Even under conservative assumptions-each open-source model fine-tuned on at most one new dataset-the count of distinct candidate models doubles approximately every 0.5 years, and allowing multi-dataset fine-tuning combinations yields doubling times as short as 0.28 years. This combinatorial growth, alongside the extraordinary computational cost of brute-force likelihood attribution across all models and potential users, renders exhaustive attribution infeasible in practice.

将大语言模型(LLMs)的输出归结为对抗性环境(如网络攻击和不信息运动)中的大语言模型(LLMs)的产出,这带来了可能越来越重要的重大挑战。我们从理论角度和经验角度处理这一归因问题,借鉴的是正式语言理论(在限度内确定)和对不断扩大的LLM生态系统的数据驱动分析。通过将LLM的一组可能的产出建模成一种正式语言,我们分析的是文本的有限样本是否能够独特地定位出原模型。我们的结果显示,在对模型之间能力重叠的微小假设下,某些LLMs类别基本上无法从它们的输出中辨别出来。我们从理论的明显可辨别性角度从理论角度和实验的角度来处理这一归别问题。我们界定了四种不同的理论可辨别性制度:(1) 无限的确定性(差异性) LLMMs语言是1967年的经典结果;(2) 无限的概率LMs(通过确定性案例的延伸,确定性案例)(3) 确定性组合模型是可识别的(符合Angluin loudalalalal commal ex ex ex liversation liversation ex) as the the the folview ex in the folver ex ex ex ex the folview lievations in the fearmations in the ex the ex ex the folviolview ex immations impolverations imations ex ex ex immationsmations ex the the the the the thes ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex the the the the flipoltime ex ex thes ex ex ex ex ex ex the the thesmationsmationsmations mations mations mations ex ex ex ex ex ex the the the the the the the the the the thes ex.

Article 186

Title@2025-07-09 (3): TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation

Title: TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation

TokenSwift: Lossless Acceleration der Ultra Long Sequence Generation

TokenSwift: 超长序列一代无损加速 2502.18890v2

Authors (5): Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng

Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model’s inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at https://github.com/bigai-nlco/TokenSwift.

以大型语言模型(LLMS)生成超长序列已变得越来越重要,但仍然是一项高度耗时的任务,特别是100K类符号的序列。传统的投机性解码方法虽然存在,但只是延长其生成限制并不能加速进程,而且可能有害。通过深入分析,我们确定了阻碍高效生成的三大挑战:频繁重载模型、动态键值(KV)管理和重复生成。为解决这些问题,我们引入了TOKENSWIFT,这是一个新颖的框架,旨在大大加快超长序列的生成过程,同时保持目标模型的固有质量。实验结果显示,TOKENSWIFT在不同规模(1.5B、7B、8B、14B)和结构(MAHA、GQA)和结构(MHA、GQA)之间实现了三次以上的加速。这种加速将超长序列生成的时间节省到几个小时,将TOKENSWIFT确定为一个空前的可扩展和有效的解决方案。代码可以在 https://github.com/bigig-nlco/TokenSwift上找到。

Article 187

Title@2025-07-09 (3): On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Title: On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Über die Robustheit des verbalen Vertrauens von LLMs bei feindlichen Angriffen

关于在相互攻击中女优女士的口头信任的有力性 2507.06489v1

Authors (2): Stephen Obadinma, Xiaodan Zhu

Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to ensure transparency, trust, and safety in human-AI interactions across many high-stakes applications. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce a novel framework for attacking verbal confidence scores through both perturbation and jailbreak-based methods, and show that these attacks can significantly jeopardize verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current confidence elicitation methods are vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the urgent need to design more robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.

大型语言模型(LLMs)产生的强烈口头信心对于部署LLMs至关重要,以确保在人类-AI互动中,在众多高接触应用中的透明度、信任和安全性。在本文件中,我们提出了关于在对抗性攻击中口头信心强健度的第一份全面研究报告。我们引入了一个新的框架,通过扰动和以破狱为基础的方法打击口头信心分数,并表明这些攻击可能大大危及口头信心估计,导致频繁的回答变化。我们审视了各种激励性战略、模型大小和应用领域,揭示了目前的信任获取方法非常脆弱,而且常用的防御技术基本上无效或适得其反。我们的调查结果强调,迫切需要设计更强有力的机制,在LLMs中表达信心,因为即使是微妙的语义保留修改也可能导致误导对反应的信心。

Article 188

Title@2025-07-09 (3): Can LLMs Play Ô Ăn Quan Game? A Study of Multi-Step Planning and Decision Making

Title: Can LLMs Play Ô Ăn Quan Game? A Study of Multi-Step Planning and Decision Making

Kann LLMs spielen Ô ên Quan Spiel? Eine Studie über mehrstufige Planung und Entscheidungsfindung

LLM女士能玩游戏吗?多步规划和决策研究。 2507.03711v3

Authors (6): Sang Quang Nguyen, Kiet Van Nguyen, Vinh-Tiep Nguyen, Thanh Duc Ngo, Ngan Luu-Thuy Nguyen, Duy-Dinh Le

In this paper, we explore the ability of large language models (LLMs) to plan and make decisions through the lens of the traditional Vietnamese board game, \^O \u{A}n Quan. This game, which involves a series of strategic token movements and captures, offers a unique environment for evaluating the decision-making and strategic capabilities of LLMs. Specifically, we develop various agent personas, ranging from aggressive to defensive, and employ the \^O \u{A}n Quan game as a testbed for assessing LLM performance across different strategies. Through experimentation with models like Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, and Llama-3.3-70B-Instruct, we aim to understand how these models execute strategic decision-making, plan moves, and manage dynamic game states. The results will offer insights into the strengths and weaknesses of LLMs in terms of reasoning and strategy, contributing to a deeper understanding of their general capabilities.

在本文中,我们探索了大型语言模型(LLMs)通过传统的越南棋盘游戏(O\u{A}n Quan)的透镜来规划和决策的能力。这场游戏涉及一系列战略象征性移动和捕捉,为评估磁盘模型的决策和战略能力提供了独特的环境。具体地说,我们开发了从攻击到防御的多种代理人,并使用O\u{A}n Quan游戏作为评估不同战略的LMM业绩的测试台。通过尝试Llama-3.2-3B-Instruct、Llama-3.1-8B-Instruct和Llama-3.3-70B-Instruct等模型,我们的目的是了解这些模型如何执行战略决策、计划动作和管理动态游戏状态。结果将使人们深入了解磁盘在推理和战略方面的长处和弱点,有助于更深入了解它们的总体能力。

Article 189

Title@2025-07-09 (3): Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

Title: Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

Video-RTS: Rethinking Verstärktes Lernen und Test-Time Scaling für effizientes und verbessertes Video-Reasoning

视频-RTS:重新思考强化学习和测试时间的尺度,以提高效率和增强视频理由 2507.06485v1

Authors (6): Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, Mohit Bansal

Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and finetuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Based on observations about the data scaling of RL samples, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by an average of 2.4% in accuracy using only 3.6% training samples. For example, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark, and a 2.6% improvement on MMVU. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS’s strong reasoning performance.

尽管在强化学习(RL)的视频推理和大型语言模型(LLMS)的强化学习(LLMS)方面取得了进步,但数据收集和微调仍然是重大挑战,这些方法往往依靠大规模监督监管的微调(SFT),拥有广泛的视频数据和长搜索链(COT)的注释,因此成本高,规模更难。为此,我们介绍了视频RTS,这是通过将数据效率高的RL与视频适应性测试时间缩放(TTS)战略相结合,提高视频推理能力的新方法,大大提高了数据效率。根据对RL样本数据缩放的观察,我们跳过资源密集型SFT步骤,采用高效的纯RL培训(SFT)和基于产出的奖励(SFT), 高效的纯RFT培训(S),不需要额外的说明或广泛的微调。此外,为了更高效地利用计算资源,我们引入了稀少到的视频TTTTTTS战略,通过反复添加基于产出一致性的框架来提高推断力。我们用多种视频推理算基准的方法,显示视频-RTS超过现有的视频推理推理模型推理模型,显示,只有2.4%的精度平均精度,只有2.4%的精度的精度的精度,只的精度,而只的精度,而只使用最近的S,具有性比度的VBS的S,在视频比性推力的VB性推。

Article 190

Title@2025-07-09 (3): Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents

Title: Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents

Japanisch lernen mit Jouzu: Interaktionsergebnisse mit stylisierten Dialog-Fiktivagenten

与Jouzu一起学习日文:与固定对话分形媒介的互动结果 2507.06483v1

Authors (2): Zackary Rackauckas, Julia Hirschberg

This study investigates how stylized, voiced agents shape user interaction in a multimodal language learning environment. We conducted a mixed-methods evaluation of 54 participants interacting with anime-inspired characters powered by large language models and expressive text-to-speech synthesis. These agents responded in Japanese character language, offering users asynchronous, semi-structured conversation in varying speech styles and emotional tones. We analyzed user engagement patterns, perceived usability, emotional responses, and learning behaviors, with particular attention to how agent stylization influenced interaction across language proficiency levels and cultural backgrounds. Our findings reveal that agent design, especially voice, persona, and linguistic style, substantially affected user experience, motivation, and strategy. This work contributes to the understanding of affective, culturally stylized agents in human-agent interaction and offers guidance for designing more engaging, socially responsive systems.

这项研究调查了在多语种学习环境中的典型化代言人如何塑造用户互动。我们对54名参与者进行了混合评估,他们与由大型语言模型和表达式文本到语音合成所驱动的动因型人物进行互动。这些代言人用日文语言进行了回应,以不同语言风格和情感调子向用户提供非同步、半结构式对话。我们分析了用户参与模式、感觉到的可用性、情感反应和学习行为,尤其关注了该代言体如何影响不同语言熟练程度和文化背景的互动。我们的调查结果显示,该代言的设计,特别是声音、人和语言风格,严重影响了用户的经验、动力和战略。这项工作有助于理解人体-代理人互动中的具有情感、文化型的代言者,并为设计更具参与性、具有社会敏感性的系统提供了指导。

Article 191

Title@2025-07-09 (3): InfoTech Assistant: A Multimodal Conversational Agent for InfoTechnology Web Portal Queries

Title: InfoTech Assistant: A Multimodal Conversational Agent for InfoTechnology Web Portal Queries

InfoTech Assistant: Ein multimodaler Ansprechpartner für InfoTechnology Web Portal Queries

信息技术助理:信息技术网站门户查询的多模式对话代理 2412.16412v2

Authors (5): Sai Surya Gadiraju, Duoduo Liao, Akhila Kudupudi, Santosh Kasula, Charitha Chalasani

This pilot study presents the development of the InfoTech Assistant, a domain-specific, multimodal chatbot engineered to address queries in bridge evaluation and infrastructure technology. By integrating web data scraping, large language models (LLMs), and Retrieval-Augmented Generation (RAG), the InfoTech Assistant provides accurate and contextually relevant responses. Data, including textual descriptions and images, are sourced from publicly available documents on the InfoTechnology website and organized in JSON format to facilitate efficient querying. The architecture of the system includes an HTML-based interface and a Flask back end connected to the Llama 3.1 model via LLM Studio. Evaluation results show approximately 95 percent accuracy on domain-specific tasks, with high similarity scores confirming the quality of response matching. This RAG-enhanced setup enables the InfoTech Assistant to handle complex, multimodal queries, offering both textual and visual information in its responses. The InfoTech Assistant demonstrates strong potential as a dependable tool for infrastructure professionals, delivering high accuracy and relevance in its domain-specific outputs.

这份试点研究介绍了InfoTech助理的开发情况,该助理是专门为处理桥梁评估和基础设施技术中的查询而设计的域域、多式联运聊天室。通过整合网络数据剪切、大语言模型和检索启动一代(RAG),InfoTech助理提供了准确和符合背景的答复。数据,包括文字描述和图像,来源于信息技术网站的公开文件,并以JSON格式组织,以便利有效查询。该系统的结构包括基于HTML的界面和通过LLLLM Studio连接Llama 3.1模型的Flask后端。评价结果显示,在特定领域任务上大约95%的准确度,高相似性分数确认响应的质量。这一RAG-encancup使InfoTech助理能够处理复杂、多式查询,在答复中提供文字和视觉信息。InfoTech助理展示了作为基础设施专业人员可靠工具的巨大潜力,在其具体领域产出中提供高精度和相关性。

Article 192

Title@2025-07-09 (3): Theme-Explanation Structure for Table Summarization using Large Language Models: A Case Study on Korean Tabular Data

Title: Theme-Explanation Structure for Table Summarization using Large Language Models: A Case Study on Korean Tabular Data

Thema-Erläuterungsstruktur für die Tabellenzusammenfassung mit großen Sprachmodellen: Eine Fallstudie zu koreanischen Tabellendaten

使用大语言模型对表格进行汇总的主题解释结构:韩国表格数据案例研究 2501.10487v3

Authors (5): TaeYoon Kwack, Jisoo Kim, Ki Yong Jung, DongGeon Lee, Heesun Park

Tables are a primary medium for conveying critical information in administrative domains, yet their complexity hinders utilization by Large Language Models (LLMs). This paper introduces the Theme-Explanation Structure-based Table Summarization (Tabular-TX) pipeline, a novel approach designed to generate highly interpretable summaries from tabular data, with a specific focus on Korean administrative documents. Current table summarization methods often neglect the crucial aspect of human-friendly output. Tabular-TX addresses this by first employing a multi-step reasoning process to ensure deep table comprehension by LLMs, followed by a journalist persona prompting strategy for clear sentence generation. Crucially, it then structures the output into a Theme Part (an adverbial phrase) and an Explanation Part (a predicative clause), significantly enhancing readability. Our approach leverages in-context learning, obviating the need for extensive fine-tuning and associated labeled data or computational resources. Experimental results show that Tabular-TX effectively processes complex table structures and metadata, offering a robust and efficient solution for generating human-centric table summaries, especially in low-resource scenarios.

表格是传播行政领域关键信息的主要媒介,但其复杂性阻碍了大语言模型(LLMs)的利用。本文介绍了基于主题的基于扩展结构的表格汇总(TABLA-TX)管道,这是一种新颖的方法,旨在从表格数据中产生高度可解释的摘要,特别侧重于韩国的行政文件。目前的表格汇总方法往往忽视了人类友好产出的关键方面。TBAL-TX首先采用多步推理程序来解决这一问题,以确保LLMS能够深入理解表格,然后由记者为清晰的生成提供提示战略。关键是,它随后将产出结构成一个主题部分(一个词词词)和一个解释部分(一个预设条款),大大提高了可读性。我们的方法利用了文字学习,避免了广泛微调和相关标签数据或计算资源的必要性。实验结果表明,TABL-TX有效地处理复杂的表格结构和元数据,为生成以人为中心的表格概要提供了可靠而有效的解决方案,特别是在低资源情景下。

Article 193

Title@2025-07-08 (2): A Systematic Analysis of Hybrid Linear Attention

Title: A Systematic Analysis of Hybrid Linear Attention

Eine systematische Analyse der hybriden linearen Aufmerksamkeit

系统分析混合线性注意 2507.06457v1

Authors (11): Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, Jason Eshraghian

Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading to hybrid architectures that combine linear and full attention layers. Despite extensive hybrid architecture research, the choice of linear attention component has not been deeply explored. We systematically evaluate various linear attention models across generations - vector recurrences to advanced gating mechanisms - both standalone and hybridized. To enable this comprehensive analysis, we trained and open-sourced 72 models: 36 at 340M parameters (20B tokens) and 36 at 1.3B parameters (100B tokens), covering six linear attention variants across five hybridization ratios. Benchmarking on standard language modeling and recall tasks reveals that superior standalone linear models do not necessarily excel in hybrids. While language modeling remains stable across linear-to-full attention ratios, recall significantly improves with increased full attention layers, particularly below a 3:1 ratio. Our study highlights selective gating, hierarchical recurrence, and controlled forgetting as critical for effective hybrid models. We recommend architectures such as HGRN-2 or GatedDeltaNet with a linear-to-full ratio between 3:1 and 6:1 to achieve Transformer-level recall efficiently. Our models are open-sourced at https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e.

变异器面临长序列的二次复杂和记忆问题,导致采用固定规模隐藏状态的线性关注机制。然而,线性模型往往有有限的回溯性,导致混合结构,将线性和全关注层结合起来。尽管进行了广泛的混合结构研究,但线性关注部分的选择并未深入探讨。我们系统地评估了几代人间的各种线性关注模式—-矢量复现到先进的顶级机制,包括独立和混合。为了能够进行这一全面分析,我们培训和公开来源了72个模型:36个在340M参数(20B标牌)和36个在1.3B参数(100B标牌),涵盖五个混合比率的六种线性关注变量。标准语言模型和回顾任务基准显示,高级单线性线性模型不一定优于混合体。虽然语言模型在线性至全关注层之间保持稳定,特别是3:1比率以下。我们的研究强调选择性的GRN-2或Gated-DeltotaNet等结构在6:6:1级和直线性1级/直径1级的MISB1级模型之间,我们建议采用1级的直观/直观分析。

Article 194

Title@2025-07-08 (2): A Semantic Parsing Framework for End-to-End Time Normalization

Title: A Semantic Parsing Framework for End-to-End Time Normalization

Ein semantisches Parsing-Framework für die End-to-End-Zeit-Normalisierung

端到终点时间正常化的语义分解框架 2507.06450v1

Authors (4): Xin Su, Sungduk Yu, Phillip Howard, Steven Bethard

Time normalization is the task of converting natural language temporal expressions into machine-readable representations. It underpins many downstream applications in information retrieval, question answering, and clinical decision-making. Traditional systems based on the ISO-TimeML schema limit expressivity and struggle with complex constructs such as compositional, event-relative, and multi-span time expressions. In this work, we introduce a novel formulation of time normalization as a code generation task grounded in the SCATE framework, which defines temporal semantics through symbolic and compositional operators. We implement a fully executable SCATE Python library and demonstrate that large language models (LLMs) can generate executable SCATE code. Leveraging this capability, we develop an automatic data augmentation pipeline using LLMs to synthesize large-scale annotated data with code-level validation. Our experiments show that small, locally deployable models trained on this augmented data can achieve strong performance, outperforming even their LLM parents and enabling practical, accurate, and interpretable time normalization.

时间正常化是将自然语言时间表达方式转换为机器可读表达方式的任务。它支持信息检索、问答和临床决策中许多下游应用。基于ISO-TimeML Schema的传统系统限制表达性和与组成、事件相对性和多空间时间表达方式等复杂结构的争斗。在这项工作中,我们引入了时间正常化新颖的提法,作为基于SCATE框架的代码生成任务,该框架通过象征性和组成操作者定义了时间语义。我们实施了完全可执行的SCATE Python 图书馆,并展示了大型语言模型(LLMs)能够生成可执行的 SATE 代码。我们利用这一能力开发了一个自动数据增强管道,使用LLMS 来合成大规模附加代码级验证的数据。我们的实验显示,在这种增强数据上培训的小型、可本地部署的模式能够取得很强的性能,甚至优于他们的LM父母,并且能够实现实用、准确和可解释的时间正常化。

Article 195

Title@2025-07-08 (2): Perception-Aware Policy Optimization for Multimodal Reasoning

Title: Perception-Aware Policy Optimization for Multimodal Reasoning

Perception-Aware Policy Optimization für multimodale Reasoning

对多式联运理由的观念-认知软件政策优化 2507.06448v1

Authors (11): Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.

事实证明,用可验证的奖赏强化学习(RLVR)是赋予大语言模型(LLMS)一个非常有效的战略,具有强大的多步推理能力。然而,其设计和优化仍然适应纯文字领域,导致在应用多模式推理任务时表现不优于最优性。我们特别注意到,当前多式联运推理中的一个主要错误来源在于对视觉投入的认知。为了解决这一瓶颈,我们提议采用Pervition-Aware政策优化(PAPO),这是GRPO的一个简单而有效的扩展,它鼓励该模型在学习理性时学习,完全从内部监督信号中学习。值得注意的是,PAPO并不依赖额外的数据校准、外部奖励模式或专有模型。具体地说,我们采用KLL差异术语的形式引入了隐含概念的损失,这尽管简单,但在不同多式联运基准方面产生了重大的总体改进(4.4%)。在具有高度愿景依赖的任务上,改进了8.0%。我们还观察到了大幅降低RBOGI的基底基值目标,通过我们对PO的精确成本分析, 改进了我们通过Sentrocrecialalalal AL的升级的升级的升级分析,从而改进了我们为BROLILILO的升级的升级的升级的升级的升级的升级的升级的升级。

Article 196

Title@2025-07-08 (2): CodeMirage: Hallucinations in Code Generated by Large Language Models

Title: CodeMirage: Hallucinations in Code Generated by Large Language Models

CodeMirage: Halluzinationen in Code Generiert durch große Sprachmodelle

代码Mirage: 大语言模型生成的代码中的幻觉 2408.08333v2

Authors (4): Vibhor Agarwal, Yulong Pei, Salwa Alamir, Xiaomo Liu

Large Language Models (LLMs) have shown promising potentials in program generation and no-code automation. However, LLMs are prone to generate hallucinations, i.e., they generate text which sounds plausible but is incorrect. Although there has been a recent surge in research on LLM hallucinations for text generation, similar hallucination phenomenon can happen in code generation. Sometimes the generated code can have syntactical or logical errors as well as more advanced issues like security vulnerabilities, memory leaks, etc. Given the wide adaptation of LLMs to enhance efficiency in code generation and development in general, it becomes imperative to investigate hallucinations in code generation. To the best of our knowledge, this is the first attempt at studying hallucinations in the code generated by LLMs. We start by introducing the code hallucination definition and a comprehensive taxonomy of code hallucination types. We propose the first benchmark CodeMirage dataset for code hallucinations. The benchmark contains 1,137 GPT-3.5 generated hallucinated code snippets for Python programming problems from two base datasets - HumanEval and MBPP. We then propose the methodology for code hallucination detection and experiment with open source LLMs such as CodeLLaMA as well as OpenAI’s GPT-3.5 and GPT-4 models using one-shot prompt. We find that GPT-4 performs the best on HumanEval dataset and gives comparable results to the fine-tuned CodeBERT baseline on MBPP dataset. Towards the end, we discuss various mitigation strategies for code hallucinations and conclude our work.

大型语言模型(LLMS)显示在程序生成和无编码自动化方面有很有希望的潜力。然而,LLMS容易产生幻觉,即产生似乎似乎似是似是而非的文本。虽然最近对LLM的幻觉的研究在为文本生成而出现,但类似的幻觉现象在代码生成中可能发生。生成的代码有时可能出现合成或逻辑错误,以及安全脆弱性、记忆泄漏等更先进的问题。鉴于LLMS的广泛调整以提高代码生成和一般开发效率,因此有必要调查代码生成中的幻觉。对于我们的知识来说,这是研究LMS生成的代码中的幻觉的第一次尝试。我们从引入代码幻觉定义和代码类型综合分类开始。我们建议为代码幻觉建立第一个基准代码数据集,包括安全脆弱性、记忆泄漏等。基准包含1,137 GPT-3.5生成的百草本代码,用于Python 各种基础数据集――人类经济学和MBPPP。然后,我们提出在代码中进行代码检测和实验的精确方法,将GPT-PLM数据模型作为开放源模型进行。

Article 197

Title@2025-07-08 (2): Substance over Style: Evaluating Proactive Conversational Coaching Agents

Title: Substance over Style: Evaluating Proactive Conversational Coaching Agents

Substanz über Stil: Bewertung proaktiver Conversations-Coaching-Agenten

物质高于样式:评价主动对话教练员 2503.19328v2

Authors (8): Vidya Srinivas, Xuhai Xu, Xin Liu, Kumar Ayush, Isaac Galatzer-Levy, Shwetak Patel, Daniel McDuff, Tim Althoff

While NLP research has made strides in conversational tasks, many approaches focus on single-turn responses with well-defined objectives or evaluation criteria. In contrast, coaching presents unique challenges with initially undefined goals that evolve through multi-turn interactions, subjective evaluation criteria, mixed-initiative dialogue. In this work, we describe and implement five multi-turn coaching agents that exhibit distinct conversational styles, and evaluate them through a user study, collecting first-person feedback on 155 conversations. We find that users highly value core functionality, and that stylistic components in absence of core components are viewed negatively. By comparing user feedback with third-person evaluations from health experts and an LM, we reveal significant misalignment across evaluation approaches. Our findings provide insights into design and evaluation of conversational coaching agents and contribute toward improving human-centered NLP applications.

虽然国家劳工政策方案的研究在对话任务方面取得了长足进展,但许多方法侧重于具有明确目标或评价标准的单向反应。相反,辅导提出了独特的挑战,最初的未确定目标是通过多方向互动、主观评价标准、混合倡议对话而演变的。在这项工作中,我们描述并实施了5个多方向辅导员,他们表现出不同的对话风格,并通过用户研究对之进行评估,收集155次对话的第一人反馈。我们发现用户高度重视核心功能,对没有核心组成部分的立体成分的看法是负面的。通过将用户反馈与卫生专家和LM的第三人评价进行比较,我们发现在评价方法上存在严重的不协调。我们的调查结果为对话辅导员的设计和评价提供了深刻的见解,并有助于改善以人为中心的NLP应用程序。

Article 198

Title@2025-07-08 (2): Can Interpretation Predict Behavior on Unseen Data?

Title: Can Interpretation Predict Behavior on Unseen Data?

Kann Interpretation Verhalten auf ungesehenen Daten voraussagen?

解释预测行为是否与未知数据有关? 2507.06445v1

Authors (5): Victoria R. Li, Jenny Kaufmann, Martin Wattenberg, David Alvarez-Melis, Naomi Saphra

Interpretability research often aims to predict how a model will respond to targeted interventions on specific mechanisms. However, it rarely predicts how a model will respond to unseen input data. This paper explores the promises and challenges of interpretability as a tool for predicting out-of-distribution (OOD) model behavior. Specifically, we investigate the correspondence between attention patterns and OOD generalization in hundreds of Transformer models independently trained on a synthetic classification task. These models exhibit several distinct systematic generalization rules OOD, forming a diverse population for correlational analysis. In this setting, we find that simple observational tools from interpretability can predict OOD performance. In particular, when in-distribution attention exhibits hierarchical patterns, the model is likely to generalize hierarchically on OOD data – even when the rule’s implementation does not rely on these hierarchical patterns, according to ablation tests. Our findings offer a proof-of-concept to motivate further interpretability work on predicting unseen model behavior.

解释性研究往往旨在预测模型如何应对特定机制的定向干预,但很少预测模型如何应对无形输入数据。本文探讨了可解释性作为预测分配外(OOOD)模式行为的工具的许诺和挑战。具体地说,我们调查数百个独立接受合成分类任务培训的变异模型中的注意力模式与OOOD一般化的对应关系。这些模型展示了几个独特的系统化规则OOD,形成不同的人口,供相关分析使用。在这个环境中,我们发现从可解释性中得出的简单观察工具可以预测OOD的性能。特别是,当分配性关注显示OOOD数据的等级模式时,该模式很可能在等级上将OOD数据普遍化 – – 即使规则的实施并不依赖这些等级模式,根据反动测试。我们的调查结果提供了一种证据,用以激励进一步解释预测不可见模式行为的工作。

Article 199

Title@2025-07-08 (2): Temporal Analysis of Climate Policy Discourse: Insights from Dynamic Embedded Topic Modeling

Title: Temporal Analysis of Climate Policy Discourse: Insights from Dynamic Embedded Topic Modeling

Temporale Analyse des Diskurses zur Klimapolitik: Einblicke aus der dynamischen Themenmodellierung

气候政策问题时空分析:动态嵌入式专题模型的展望 2507.06435v1

Authors (2): Rafiu Adekoya Badekale, Adewale Akinfaderin

Understanding how policy language evolves over time is critical for assessing global responses to complex challenges such as climate change. Temporal analysis helps stakeholders, including policymakers and researchers, to evaluate past priorities, identify emerging themes, design governance strategies, and develop mitigation measures. Traditional approaches, such as manual thematic coding, are time-consuming and limited in capturing the complex, interconnected nature of global policy discourse. With the increasing relevance of unsupervised machine learning, these limitations can be addressed, particularly under high-volume, complex, and high-dimensional data conditions. In this work, we explore a novel approach that applies the dynamic embedded topic model (DETM) to analyze the evolution of global climate policy discourse. A probabilistic model designed to capture the temporal dynamics of topics over time. We collected a corpus of United Nations Framework Convention on Climate Change (UNFCCC) policy decisions from 1995 to 2023, excluding 2020 due to the postponement of COP26 as a result of the COVID-19 pandemic. The model reveals shifts from early emphases on greenhouse gases and international conventions to recent focuses on implementation, technical collaboration, capacity building, finance, and global agreements. Section 3 presents the modeling pipeline, including preprocessing, model training, and visualization of temporal word distributions. Our results show that DETM is a scalable and effective tool for analyzing the evolution of global policy discourse. Section 4 discusses the implications of these findings and we concluded with future directions and refinements to extend this approach to other policy domains.

时间分析有助于利益攸关方,包括决策者和研究人员,评估过去的优先事项,确定新出现的主题,设计治理战略和制定缓解措施。传统方法,如人工专题编码,耗时且有限,无法捕捉全球政策讨论的复杂和相互关联的性质。随着不受监督的机器学习的相关性日益提高,这些局限性可以得到解决,特别是在大量、复杂和高维度的数据条件下。在这项工作中,我们探索了一种新颖的方法,采用动态嵌入主题模式(DETM)分析全球气候政策讨论的演变。一种旨在长期捕捉专题时间动态的概率性模式。我们收集了一整套《联合国气候变化框架公约》(《气候公约》)政策决定,从1995年到2023年,不包括2020年。由于COVID-19大流行导致COP26推迟,因此这些限制可以得到解决。该模式揭示了从早期温室气体和国际公约的强调转向最近的执行、技术协作、能力建设、金融和全球协定的演变。第3节展示了我们所制定的全球趋势的模型,包括模拟、培训、分析工具的演变过程,以分析我们目前和将来的政策趋势的演变趋势。

Article 200

Title@2025-07-08 (2): Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders

Title: Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders

Erforschen der Aufgabenleistung mit Interpretierbaren Modellen über Sparse Auto-Encoder

通过 Sparass Auto- Enccders 与可解释模型探索任务性能 2507.06427v1

Authors (8): Shun Wang, Tyler Loakman, Youbo Lei, Yi Liu, Bohao Yang, Yuting Zhao, Dong Yang, Chenghua Lin

Large Language Models (LLMs) are traditionally viewed as black-box algorithms, therefore reducing trustworthiness and obscuring potential approaches to increasing performance on downstream tasks. In this work, we apply an effective LLM decomposition method using a dictionary-learning approach with sparse autoencoders. This helps extract monosemantic features from polysemantic LLM neurons. Remarkably, our work identifies model-internal misunderstanding, allowing the automatic reformulation of the prompts with additional annotations to improve the interpretation by LLMs. Moreover, this approach demonstrates a significant performance improvement in downstream tasks, such as mathematical reasoning and metaphor detection.

大型语言模型(LLMs)传统上被视为黑箱算法,因此降低了可信度,隐蔽了提高下游任务业绩的潜在方法。在这项工作中,我们采用了一种有效的LLM分解方法,使用与稀疏的自动分解器的字典学习方法。这有助于从多语管LLM神经中提取单词性特征。值得注意的是,我们的工作确定了模型内部误解,允许自动重拟提示,并附加说明,改进LLMs的解释。此外,这一方法表明下游任务(如数学推理和比喻探测)的业绩显著改善。

Article 201

Title@2025-07-08 (2): Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling

Title: Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling

Reward-Modelle können sich selbst verbessern: Reward-Guided Adversarial Failure Mode Discovery für robuste Reward-Modellierung

奖赏模型能够改善自我:为强力奖赏建模而发现奖赏- 受奖人反逆失败模式 2507.06419v1

Authors (2): Pankayaraj Pathmanathan, Furong Huang

Reward modeling (RM), which captures human preferences to align large language models (LLMs), is increasingly employed in tasks such as model finetuning, response filtering, and ranking. However, due to the inherent complexity of human preferences and the limited coverage of available datasets, reward models often fail under distributional shifts or adversarial perturbations. Existing approaches for identifying such failure modes typically rely on prior knowledge about preference distributions or failure attributes, limiting their practicality in real-world settings where such information is unavailable. In this work, we propose a tractable, preference-distribution agnostic method for discovering reward model failure modes via reward guided controlled decoding. Building on this, we introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses. These adversarial examples are then used to augment the training data and patch the reward model’s misaligned behavior. We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails and demonstrate that it significantly improves robustness without sacrificing reward quality. Notably, REFORM preserves performance both in direct evaluation and in downstream policy training, and further improves alignment quality by removing spurious correlations.

由于人类偏好固有的复杂性以及现有数据集的覆盖范围有限,奖励模式往往在分配性转变或对抗性扰动中失败。现有的确定这种失败模式的方法通常依赖事先对优惠分配或失败属性的了解,在无法获得此类信息的现实世界环境中限制其实用性。在这项工作中,我们提出一种可移植的、分配优惠的、不可知的方法,以便通过有控制的奖赏解码来发现奖赏模式失败模式。在此基础上,我们引入了REFORM,这是一个自我改进的奖赏模式框架,通过使用奖赏模式本身来指导错误得分反应的产生,从而增强稳健性。然后,这些对抗性范例被用来增加培训数据,弥补奖励模式的错误行为。我们根据两种广泛使用的优惠数据集评估REFORM,即“人类无害”和PKU Belails, 并展示在不通过提高质量、不提高质量、不通过提高可持续性和不断提高的升级政策,从而极大地改进升级政策。

Article 202

Title@2025-07-08 (2): PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning

Title: PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning

PERK: Lang-Kontext als Parameter-Effizientes Test-Zeit-Lernen

PERK: 参数有效试验时间学习的长期理由 2507.06415v1

Authors (4): Zeming Chen, Angelika Romanou, Gail Weiss, Antoine Bosselut

Long-context reasoning requires accurately identifying relevant information in extensive, noisy input contexts. Previous research shows that using test-time learning to encode context directly into model parameters can effectively enable reasoning over noisy information. However, meta-learning methods for enabling test-time learning are prohibitively memory-intensive, preventing their application to long context settings. In this work, we propose PERK (Parameter Efficient Reasoning over Knowledge), a scalable approach for learning to encode long input contexts using gradient updates to a lightweight model adapter at test time. Specifically, PERK employs two nested optimization loops in a meta-training phase. The inner loop rapidly encodes contexts into a low-rank adapter (LoRA) that serves as a parameter-efficient memory module for the base model. Concurrently, the outer loop learns to use the updated adapter to accurately recall and reason over relevant information from the encoded long context. Our evaluations on several long-context reasoning tasks show that PERK significantly outperforms the standard prompt-based long-context baseline, achieving average absolute performance gains of up to 90% for smaller models (GPT-2) and up to 27% for our largest evaluated model, Qwen-2.5-0.5B. In general, PERK is more robust to reasoning complexity, length extrapolation, and the locations of relevant information in contexts. Finally, we show that while PERK is memory-intensive during training, it scales more efficiently at inference time than prompt-based long-context inference.

长文本推理要求准确识别广泛、繁杂输入环境中的相关信息。先前的研究显示, 使用测试时间学习将背景直接编码成模型参数可以有效地将背景编码成模型参数。但是, 使测试时间学习的元学习方法的记忆密集性极高, 防止其应用到长背景设置。在此工作中, 我们建议 PERK( 光学高效解释知识) , 这是一种可扩缩的学习方法, 以测试时间将长输入环境编码成梯度更新为轻量模型适配器。具体地说, PERK 在元培训阶段使用两个嵌套优化环。内部循环快速将环境编码成一个低级别适应器( LORA) , 用作基础模型的参数高效存储模块存储模块。同时, 外环学会使用更新的适应器, 准确回顾和解释从编码长背景的相关信息。我们对一些长文本推理学任务的评估显示, PERK 大大超过标准的快速定位长文本基准, 在更小的模型( GPERPT- 2.5) 中, 最精确的缩缩缩缩到最精确的逻辑, 在最精确的逻辑中, 最精确的缩到最精确的缩缩缩缩到最深的缩到最深的缩到最深的缩缩缩缩的缩到超的缩的缩的缩的缩缩缩缩缩缩的缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩的缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩到缩的缩到缩到缩到缩到缩的缩到缩到缩到缩到缩的缩的缩到缩到缩的

Article 203

Title@2025-07-08 (2): The Trilemma of Truth in Large Language Models

Title: The Trilemma of Truth in Large Language Models

Das Trilemma der Wahrheit in großen Sprachmodellen

大语言模型中的真理三边 2506.23921v2

Authors (2): Germans Savcisens, Tina Eliassi-Rad

We often attribute human characteristics to large language models (LLMs) and claim that they “know” certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge? We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL (short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM’s depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs “know” and how certain they are of their probabilistic internal knowledge.

我们经常将人的特点归结于大型语言模型(LLMS),并声称他们“知道”某些东西。LLMS具有代表培训期间保留的信息的内部概率知识。我们如何评估这种知识的真实性?我们检查两种共同的方法来检验LLMS的真实性,并发现一些有缺陷的假设。为了解决这些有缺陷的假设,我们引入了SawMIL(SawMIL(Sort for Sprassar Enown Convention 多重Instess)),一种检验方法,利用LLMS的内部激活将声明分为真实、虚假和两者都没有。 sawMIL是建立在多重 Internance 学习和符合预测的基础上的。我们如何评估16个开放源LMS的5项有效性标准,包括默认和聊天基变异体,以及3个新的数据集。我们提供的这些洞察是:(1)真实信号通常集中在LM深度的第三季度;(2)真理和假信号并不总是对称;(3)线性探点比一些默认模型更好;(4)非线性LMSMS的准确性反馈,可能要求从真实性测算为真实性测算。

Article 204

Title@2025-07-08 (2): Single Word Change is All You Need: Designing Attacks and Defenses for Text Classifiers

Title: Single Word Change is All You Need: Designing Attacks and Defenses for Text Classifiers

Single Word Change ist alles, was Sie brauchen: Konzipieren von Angriffen und Verteidigungen für Textklassifikatoren

单字更改是您需要的: 设计攻击和文字分类的防御 2401.17196v2

Authors (5): Lei Xu, Sarah Alnegheimish, Laure Berti-Equille, Alfredo Cuesta-Infante, Kalyan Veeramachaneni

In text classification, creating an adversarial example means subtly perturbing a few words in a sentence without changing its meaning, causing it to be misclassified by a classifier. A concerning observation is that a significant portion of adversarial examples generated by existing methods change only one word. This single-word perturbation vulnerability represents a significant weakness in classifiers, which malicious users can exploit to efficiently create a multitude of adversarial examples. This paper studies this problem and makes the following key contributions: (1) We introduce a novel metric \r{ho} to quantitatively assess a classifier’s robustness against single-word perturbation. (2) We present the SP-Attack, designed to exploit the single-word perturbation vulnerability, achieving a higher attack success rate, better preserving sentence meaning, while reducing computation costs compared to state-of-the-art adversarial methods. (3) We propose SP-Defense, which aims to improve \r{ho} by applying data augmentation in learning. Experimental results on 4 datasets and BERT and distilBERT classifiers show that SP-Defense improves \r{ho} by 14.6% and 13.9% and decreases the attack success rate of SP-Attack by 30.4% and 21.2% on two classifiers respectively, and decreases the attack success rate of existing attack methods that involve multiple-word perturbations.

在文本分类中,创建一个对抗性实例意味着在不改变其含义的情况下,在不改变其含义的情况下,在某一句中略略扰动几个词,造成分类者错误的分类。关于观察的观察是,现有方法产生的大量对抗性例子只改变一个单词。这种单字扰动脆弱性代表了分类者中的一个重大弱点,恶意用户可以利用这些弱点有效地创造众多对抗性例子。本文研究这一问题,并作出以下关键贡献:(1) 我们引入一个新的衡量标准,对分类者对单字扰动的稳健性进行定量评估。 (2) 我们介绍SP-Attack,目的是利用单字扰动性脆弱性,实现更高的攻击成功率,更好地保留判刑含义,同时降低计算成本,与最先进的对抗性对抗性方法相比,以有效的方式创造众多的对抗性例子。 (3) 我们提出SP-Defer,目的是通过在学习中应用数据增强来改进\r{ho}。 4个数据集的实验结果以及BERT和DERT分类人员显示,SP-Def-Def-res 进攻率分别为SP-Defal 30/rack 和21级攻击率下降率和14提高两个SP-cal-% 和21级攻击率和下降率率的SP-

Article 205

Title@2025-07-08 (2): Hypermagmas and Colored Operads: Heads, Phases, and Theta Roles

Title: Hypermagmas and Colored Operads: Heads, Phases, and Theta Roles

Hypermagmen und farbige Opern: Köpfe, Phasen und Theta-Rollen

超磁和彩色操作: 头部、阶段和Theta 角色 2507.06393v1

Authors (3): Matilde Marcolli, Riny Huijbregts, Richard K. Larson

We show that head functions on syntactic objects extend the magma structure to a hypermagma, with the c-command relation compatible with the magma operation and the m-command relation with the hypermagma. We then show that the structure of head and complement and specifier, additional modifier positions, and the structure of phases in the Extended Projection can be formulated as a bud generating system of a colored operad, in a form similar to the structure of theta roles. We also show that, due to the special form of the colored operad generators, the filtering of freely generated syntactic objects by these coloring rules can be equivalently formulated as a filtering in the course of structure formation via a colored Merge, which can in turn be related to the hypermagma structure. The rules on movement by Internal Merge with respect to phases, the Extended Projection Principle, Empty Category Principle, and Phase Impenetrability Condition are all subsumed into the form of the colored operad generators. Movement compatibilities between the phase structure and the theta roles assignments can then be formulated in terms of the respective colored operads and a transduction of colored operads.

我们显示, 合成对象上的头函数将岩浆结构扩展至超岩浆, 与岩浆操作和超岩浆的 m- 命令关系可兼容的 c- 命令关系。然后我们显示, 头结构、补充和分解器、额外的修饰器位置以及扩展投影中各个阶段的结构可以形成一个彩色剧团的萌芽生成系统, 其形式与该作用的结构类似。我们还显示, 由于有色歌剧生成器的特殊形式, 通过这些彩色规则对自由生成的合成对象的过滤, 可以等同于通过彩色合并在结构形成过程中的过滤, 而这反过来又可以与超岩浆结构相关。内部合并关于相继演进阶段、扩展投影原则、空分类原则和阶段隐性状态的规则, 都包含在有色歌剧生成器的形态中。通过这些彩色合并规则对自由生成的合成对象的过滤, 可以通过彩色合并制的歌剧和相色调调调调制成的调戏和调色歌剧的调调调调。

Article 206

Title@2025-07-08 (2): Multi-Sense Embeddings for Language Models and Knowledge Distillation

Title: Multi-Sense Embeddings for Language Models and Knowledge Distillation

Multi-Sense-Einbettungen für Sprachmodelle und Wissensdestillation

语文模式和知识蒸馏多语种多语种嵌入式 2504.06036v2

Authors (4): Qitong Wang, Mohammed J. Zaki, Georgios Kollias, Vasileios Kalantzis

Transformer-based large language models (LLMs) rely on contextual embeddings which generate different (continuous) representations for the same token depending on its surrounding context. Nonetheless, words and tokens typically have a limited number of senses (or meanings). We propose multi-sense embeddings as a drop-in replacement for each token in order to capture the range of their uses in a language. To construct a sense embedding dictionary, we apply a clustering algorithm to embeddings generated by an LLM and consider the cluster centers as representative sense embeddings. In addition, we propose a novel knowledge distillation method that leverages the sense dictionary to learn a smaller student model that mimics the senses from the much larger base LLM model, offering significant space and inference time savings, while maintaining competitive performance. Via thorough experiments on various benchmarks, we showcase the effectiveness of our sense embeddings and knowledge distillation approach. We share our code at https://github.com/Qitong-Wang/SenseDict

以变换器为基础的大型语言模型(LLMs)依赖于背景嵌入,根据周围环境的不同背景产生不同(持续)的相同表示。尽管如此,单词和符号通常具有有限的感官(或意义)数量。我们提议多感知嵌入,作为每个符号的空置替换,以用一种语言捕捉其使用范围。为了构建一种感知嵌入字典,我们应用组合算法来嵌入由LLM生成的嵌入,并将集群中心视为具有代表性的感官嵌入。此外,我们提议一种新颖的知识蒸馏方法,利用感知字典学习一个较小的学生模型,模仿来自大得多的LMM模型的感官,提供相当大的空间和推断时间节约,同时保持竞争性性能。通过对各种基准的彻底实验,我们展示我们感知嵌入和知识蒸馏方法的有效性。我们在 https://github.com/Qitong-Wang/SenseDict分享我们的代码。

Article 207

Title@2025-07-08 (2): Evaluating Morphological Alignment of Tokenizers in 70 Languages

Title: Evaluating Morphological Alignment of Tokenizers in 70 Languages

Bewertung der morphologischen Ausrichtung von Tokenisatoren in 70 Sprachen

以70种语言评价火化器在道义上的一致性 2507.06378v1

Authors (3): Catherine Arnett, Marisa Hudspeth, Brendan O’Connor

While tokenization is a key step in language modeling, with effects on model training and performance, it remains unclear how to effectively evaluate tokenizer quality. One proposed dimension of tokenizer quality is the extent to which tokenizers preserve linguistically meaningful subwords, aligning token boundaries with morphological boundaries within a word. We expand MorphScore (Arnett & Bergen, 2025), which previously covered 22 languages, to support a total of 70 languages. The updated MorphScore offers more flexibility in evaluation and addresses some of the limitations of the original version. We then correlate our alignment scores with downstream task performance for five pre-trained languages models on seven tasks, with at least one task in each of the languages in our sample. We find that morphological alignment does not explain very much variance in model performance, suggesting that morphological alignment alone does not measure dimensions of tokenization quality relevant to model performance.

虽然象征性化是语言建模方面的一个关键步骤,对示范培训和性能有影响,但它仍然不清楚如何有效评估象征品质量。象征性品质量的一个拟议层面是象征品保留语言上有意义的小字,在单词内将象征性边界与形态边界统一起来。我们扩大MorphScore(Arnett & Bergen, 2025),以前涵盖22种语言,支持总共70种语言。更新的MorphScore在评价方面提供了更大的灵活性,并解决了原始版本的一些局限性。然后,我们将我们的匹配分数与七个任务中5种预先训练过的语文模型的下游任务性能挂钩,我们抽样中每种语文至少有一项任务。我们发现,形态上的匹配并不解释模型性能的很大差异,表明单是形态上的匹配并不能衡量与模型性能相关的象征性质量的维度。

Article 208

Title@2025-07-08 (2): Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Title: Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Wachsende Transformer: Modulare Zusammensetzung und schichtweise Erweiterung auf einem gefrorenen Substrat

增长变异器:在冻结基质上的模块构成和图层扩展 2507.07129v1

Authors (1): A. Bochkov

The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings. In prior [1], we established that high-level semantic reasoning can emerge in Transformers using frozen embeddings derived from the visual structure of Unicode glyphs. Here, we demonstrate that this fixed representational substrate acts as a universal “docking port,” enabling two powerful and efficient scaling paradigms: seamless modular composition and progressive layer-wise growth. First, we show that specialist models trained on disparate datasets (e.g., Russian and Chinese text) can be merged into a single, more capable Mixture-of-Experts (MoE) model, post-training, with zero architectural modification. This is achieved by simply averaging their output logits. The resulting MoE model exhibits immediate performance improvements on reasoning benchmarks like MMLU, surpassing its constituent experts without catastrophic forgetting. Second, we introduce a layer-wise constructive training methodology, where a deep Transformer is “grown” by progressively stacking and training one layer at a time. This method demonstrates stable convergence and a clear correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD. Our findings suggest a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development, where complexity is built incrementally and modules can be composed freely. This opens new avenues for resource-efficient scaling, continual learning, and a more democratized ecosystem for building powerful AI systems. We release all code and models to facilitate further research.

缩放大型语言模型(LLMS)的流行范例(LLMS)涉及单一式的、最终到最终的、缺乏灵活性的资源密集型培训,这是一个资源密集型的过程。本文探讨了一种替代的、建设性的模式开发方法,其基础是非可操作的、确定性的投入嵌入。在[1]之前,我们确定,高层次的语义推理可以在变异器中出现,使用由Unicode glyphs视觉结构衍生的冷冻嵌入。在这里,我们证明,这个固定代表式的基底部分是一个通用的“演示港 ” , 使两个强大和高效的缩放模式成为:无缝模块构成和递增层增长。首先,我们表明,在分散式数据模型(例如,俄罗斯和中国文本)的基础上,经过培训的专家模型可以合并成一个单一的、更有能力的Mixturate-Exerts模型(MoE)模型, 后加零结构修改。我们通过简单地平均其输出对产出进行校正校正的校正校正的模型, 由此而使MLU这样的推理学基准立即得到改进, , 超越其组成专家的构成,而不必忘记遗忘地遗忘地在深度的构造上,我们进行着一个更深层的演进式的进式的研究, 。

Article 209

Title@2025-07-08 (2): Could the Road to Grounded, Neuro-symbolic AI be Paved with Words-as-Classifiers?

Title: Could the Road to Grounded, Neuro-symbolic AI be Paved with Words-as-Classifiers?

Könnte der Weg zur geerdeten, neuro-symbolischen KI mit Wort-als-Klassifikatoren gepflastert werden?

通往禁地、神经-精神-精神-爱之路能否用“单词化为标准”来铺设? 2507.06335v1

Authors (2): Casey Kennington, David Schlangen

Formal, Distributional, and Grounded theories of computational semantics each have their uses and their drawbacks. There has been a shift to ground models of language by adding visual knowledge, and there has been a call to enrich models of language with symbolic methods to gain the benefits from formal, distributional, and grounded theories. In this paper, we attempt to make the case that one potential path forward in unifying all three semantic fields is paved with the words-as-classifier model, a model of word-level grounded semantics that has been incorporated into formalisms and distributional language models in the literature, and it has been well-tested within interactive dialogue settings. We review that literature, motivate the words-as-classifiers model with an appeal to recent work in cognitive science, and describe a small experiment. Finally, we sketch a model of semantics unified through words-as-classifiers.

正式、分布和基础的计算语义理论各有其用途和缺点。通过增加视觉知识,已经向语言的地面模型转变。人们呼吁用象征性的方法丰富语言模型,以便从正式、分布和基础理论中获益。在本文中,我们试图证明,在统一所有三个语义领域方面,一条潜在的前进道路是用单词类化模型铺设的,一种单词级基语义模型,已经融入了文献中的正文和分布语言模型,并在互动对话环境中进行了良好的测试。我们审视了文献,激发了单词类模型,吸引了认知科学的近期工作,并描述了一个小实验。最后,我们勾画了通过单词类化软件统一的语义学模型。

Article 210

Title@2025-07-08 (2): MedGellan: LLM-Generated Medical Guidance to Support Physicians

Title: MedGellan: LLM-Generated Medical Guidance to Support Physicians

MedGellan: LLM-generierte medizinische Anleitung zur Unterstützung von Ärzten

MedGellen:LLM - 支持医生的医疗指导 2507.04431v2

Authors (4): Debodeep Banerjee, Burcu Sayin, Stefano Teso, Andrea Passerini

Medical decision-making is a critical task, where errors can result in serious, potentially life-threatening consequences. While full automation remains challenging, hybrid frameworks that combine machine intelligence with human oversight offer a practical alternative. In this paper, we present MedGellan, a lightweight, annotation-free framework that uses a Large Language Model (LLM) to generate clinical guidance from raw medical records, which is then used by a physician to predict diagnoses. MedGellan uses a Bayesian-inspired prompting strategy that respects the temporal order of clinical data. Preliminary experiments show that the guidance generated by the LLM with MedGellan improves diagnostic performance, particularly in recall and $F_1$ score.

医疗决策是一项关键任务,错误可能导致可能危及生命的严重后果。虽然完全自动化仍然具有挑战性,但将机器智能与人类监督相结合的混合框架提供了一种实用的替代方案。在本文中,我们介绍了一个使用大语言模型(LLM)来产生临床指导的轻量级、无注释性框架(MedGellen),该模型从原始医疗记录中产生临床指导,然后由医生用来预测诊断结果。 MedGellan使用一种尊重临床数据时间顺序的由Bayesian启发的激励性战略。初步实验表明,LLM和MedGellen产生的指导提高了诊断性能,特别是在召回和1美元的分数方面。

Article 211

Title@2025-07-08 (2): RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

Title: RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

RefineX: Verfeinern von Vorschulungsdaten auf Scale aus sachverständigen Programmen

RefineX:从专家指导方案学习如何在规模上精确分析培训前数据 2507.03253v2

Authors (10): Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, Xueqi Cheng

The foundational capabilities of large language models (LLMs) are deeply influenced by the quality of their pre-training corpora. However, enhancing data quality at scale remains a significant challenge, primarily due to the trade-off between refinement effectiveness and processing efficiency. While rule-based filtering remains the dominant paradigm, it typically operates at the document level and lacks the granularity needed to refine specific content within documents. Inspired by emerging work such as ProX, we propose $\textbf{RefineX}$, a novel framework for large-scale, surgical refinement of pre-training data through programmatic editing tasks. RefineX enables efficient and fine-grained data refinement while reliably preserving the diversity and naturalness of raw text. The core strength of RefineX lies in distilling high-quality, expert-guided end-to-end refinement results into minimal edit-based deletion programs. This high-precision distillation pipeline is used to train an efficient and reliable refine model that can systematically improve every instance in the corpus at scale. We evaluate RefineX across from-scratch pre-training at multiple model scales and find that it consistently outperforms models trained on raw, filtered, or alternatively refined data across diverse downstream tasks. On the 750M model, RefineX yields 2.6%-7.2% average gains on lighteval tasks, and achieves comparable performance using significantly fewer training tokens. Further analysis shows that RefineX reliably enhances text quality with both high efficiency and precision, outperforming prior approaches such as end-to-end generation and Prox-C. These results position RefineX as a scalable, effective, and reliable solution for optimizing pre-training data in modern LLM pipelines.

大型语言模型(LLMS)的基本能力受到其培训前公司质量的深刻影响。然而,提高规模数据质量仍然是一大挑战,这主要是由于精炼效力和处理效率之间的权衡取舍。尽管基于规则的过滤仍然是主导范式,但它通常在文件一级运作,缺乏完善文件内具体内容所需的颗粒性。在诸如ProX等新兴工作的启发下,我们提议美元(textbf{RefineX}$),这是一个通过方案编辑任务大规模地对培训前数据进行手术改进的新框架。RefineX使得数据能够高效和精细地改进,同时可靠地保持原始文本的多样性和自然性。RefineX的核心力量在于将高质量的、专家指导的端到端的精细精细精炼结果转化为最起码的删除程序。这种高精密的提炼油管道被用于培训高效和可靠的精细模型,可以系统地改进每例的精确度。我们评估RefineX的进度,在多个模型的预选前阶段进行精细的精细的精细数据改进,在甚的模型上进行精细的精细的精细的精细分析。

Article 212

Title@2025-07-08 (2): ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time

Title: ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time

ETT: Erweiterung des Langzeitkontexts Verständnisfähigkeit von LLMs bei Test-Time

ETT:扩大LLMs在试验时的长距离理解能力 2507.06313v1

Authors (4): Kiarash Zahirnia, Zahra Golpayegani, Walid Ahmad, Yang Liu

Transformer-based Language Models’ computation and memory overhead increase quadratically as a function of sequence length. The quadratic cost poses challenges when employing LLMs for processing long sequences. In this work, we introduce \ourmodelacronym~(Extend at Test-Time), method for extending the context length of short context Transformer-based LLMs, with constant memory requirement and linear computation overhead. ETT enable the extension of the context length at test-time by efficient fine-tuning the model’s parameters on the input context, chunked into overlapping small subsequences. We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens. This results in up to a 30 percent improvement in the model’s accuracy. We also study how context can be stored in LLM’s weights effectively and efficiently. Through a detailed ablation study, we examine which Transformer modules are most beneficial to fine-tune at test-time. Interestingly, we find that fine-tuning the second layer of the FFNs is more effective than full fine-tuning, leading to a further improvement in the models’ accuracy.

以变换器为基础的语言模型的计算和内存管理量作为序列长度的函数, 二次成本在使用LLMS处理长序列时构成挑战。在这项工作中, 我们引入了 \ ourmodelacronnym~( 测试时Extend ) , 延长基于短背景的变换器LMS的上下文长度的方法, 并不断要求内存, 以及线性计算间接费用。 TET 通过高效地微调模型输入环境参数, 使测试时的上下文长度延长, 并被挤成重叠的小后继序列。我们通过将 GPT- Large 和 Phi-2 的上下文长度延长32 次来评估长贝恩奇的 ETT, 从而将GPT- Large 和 Phi-2 的上下文长度从1k 增加到32 个符号, 从而将模型的精度提高到30% 。我们还研究如何有效和高效地将环境储存在 LLM 的重量中。我们通过详细的调整研究, 我们研究哪个变换式模块对试验时的精度最为有益。

Article 213

Title@2025-07-08 (2): Humans overrely on overconfident language models, across languages

Title: Humans overrely on overconfident language models, across languages

Menschen überschätzen überbewusste Sprachmodelle, über Sprachen hinweg

人类对不同语言的过于自信的语言模式过于自信 2507.06306v1

Authors (3): Neil Rathi, Dan Jurafsky, Kaitlyn Zhou

As large language models (LLMs) are deployed globally, it is crucial that their responses are calibrated across languages to accurately convey uncertainty and limitations. Previous work has shown that LLMs are linguistically overconfident in English, leading users to overrely on confident generations. However, the usage and interpretation of epistemic markers (e.g., ‘It’s definitely,’ ‘I think’) can differ sharply across languages. Here, we study the risks of multilingual linguistic (mis)calibration, overconfidence, and overreliance across five languages to evaluate the safety of LLMs in a global context. We find that overreliance risks are high across all languages. We first analyze the distribution of LLM-generated epistemic markers, and observe that while LLMs are cross-linguistically overconfident, they are also sensitive to documented linguistic variation. For example, models generate the most markers of uncertainty in Japanese and the most markers of certainty in German and Mandarin. We then measure human reliance rates across languages, finding that while users strongly rely on confident LLM generations in all languages, reliance behaviors differ cross-linguistically: for example, users rely significantly more on expressions of uncertainty in Japanese than in English. Taken together, these results indicate high risk of reliance on overconfident model generations across languages. Our findings highlight the challenges of multilingual linguistic calibration and stress the importance of culturally and linguistically contextualized model safety evaluations.

由于全球部署的大型语言模型(LLMS)在全球部署,因此,至关重要的是,它们的反应要通过语言校准,以准确地传达不确定性和局限性。以前的工作表明,LLMS在语言上过于自信,使用户对自信的几代人过度自信。然而,缩写标记(例如“它肯定是”“我认为”)的使用和解释在语言上差异很大。在这里,我们研究多种语言语言(误)校校准、过度自信和五种语言过分依赖的风险,以评价全球范围内LMS的安全性。我们发现,所有语言的过分依赖性风险都很高。我们首先分析LM产生的缩写符号在语言上的分布,导致用户过度自信。我们首先分析LMCM产生的直写标记的分布,并观察,虽然LLMMs具有跨语言的过度自信,但它们对文献语言差异很大。例如,模型产生日本最不确定的标志,德国和曼达林最确定性的标志。我们然后衡量不同语言的人类依赖率,我们发现,虽然用户在所有语言模式中都高度依赖自信的LM世代,但依赖程度上的行为方式在英语上也不同语言的逻辑上更依赖了我们的不确定性。

Article 214

Title@2025-07-08 (2): Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers

Title: Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers

Effizienz-Wirksamkeit Reranking FLOPs für LLM-basierte Reranker

为基于LLM的升级者重新安排FLOP 2507.06223v1

Authors (5): Zhiyuan Peng, Ting-ruen Wei, Tingyu Song, Yilun Zhao, Yi Fang

Large Language Models (LLMs) have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose E\textsuperscript{2}R-FLOPs, for LLM-based rerankers: ranking metrics per PetaFLOP (RPP) for relevance per compute and queries per PetaFLOP (QPP) for hardware-agnostic throughput. Companied with the new metrics, an interpretable FLOPs estimator is built to estimate the FLOPs of an LLM-based reranker even without running any experiments. Based on the proposed metrics, we conduct comprehensive experiments to evaluate a wide range of LLM-based rerankers with different architecture, studying the efficiency-effectiveness trade-off and bringing this issue to the attention of the research community.

大型语言模型(LLMS)最近被应用于信息检索的重新排序任务,取得了很强的绩效,然而,它们高的计算要求往往阻碍实际的部署。现有的研究利用代理度量,如潜伏、远端传票数量、输入符号和输出符号等,评估基于LLM的重新排序者的效率。但是,这些尺度取决于硬件和运行时间选择(或是否平行、批量大小等),往往不考虑模型大小,因此难以解释和掩盖效率权衡的评价。为了解决这个问题,我们建议E\textsuperscript{2}R-FLOPs, 面向基于LM的重新排序者:根据每个PetaFLOP(RPP)的比重度,按每个计算和问询每个PetaFLLOP(QPPP)的比重相关性,根据新的计量标准,建立一个可解释的 FLOPs 估测算器,用来估计基于LOP的更替器的重新排序者,即使不进行任何实验,我们提议在进行任何实验的情况下,也建议对社区效率模型进行。

Article 215

Title@2025-07-08 (2): Instruction Following by Boosting Attention of Large Language Models

Title: Instruction Following by Boosting Attention of Large Language Models

Anleitung, indem man die Aufmerksamkeit großer Sprachmodelle erhöht

之后的教学,培养对大语言模式的注意 2506.13734v2

Authors (4): Vitoria Guardieiro, Adam Stein, Avishree Khare, Eric Wong

Controlling the generation of large language models (LLMs) remains a central challenge to ensure their safe and reliable deployment. While prompt engineering and finetuning are common approaches, recent work has explored latent steering, a lightweight technique that alters LLM internal activations to guide generation. However, subsequent studies revealed latent steering’s effectiveness to be limited, often underperforming simple instruction prompting. To address this limitation, we first establish a benchmark across diverse behaviors for standardized evaluation of steering techniques. Building on insights from this benchmark, we introduce Instruction Attention Boosting (InstABoost), a latent steering method that boosts the strength of instruction prompting by altering the model’s attention during generation. InstABoost combines the strengths of existing approaches and is theoretically supported by prior work that suggests that in-context rule following in transformer-based models can be controlled by manipulating attention on instructions. Empirically, InstABoost demonstrates superior control success compared to both traditional prompting and latent steering.

控制大型语言模型(LLMs)的生成仍然是确保安全可靠地部署的重大挑战。虽然迅速工程和微调是常见的方法,但最近的工作探索了潜质引导,这是一种轻量技术,改变了LLM的内部激活,以引导下一代的形成。然而,随后的研究显示,潜质引导的有效性有限,往往表现较差的简单指令。为了应对这一限制,我们首先为标准化的指导技术评价制定不同行为基准。基于这一基准的洞察力,我们引入了 “ 注意促进 “ (InstABoost),这是一种潜在的指导方法,通过改变模型代代代代的注意力来增强激励教学的力量。 “ 斯特拉博斯特 “ 将现有方法的优势结合起来,并在理论上得到先前工作的支持,表明基于变压器模型的文本规则可以通过对指示的注意来控制。从中可以看出, “ 注意 “ InstABoost “ 与传统的催动和潜伏性指导相比,都表现出超强的控制成功。

Article 216

Title@2025-07-08 (2): CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions

Title: CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions

CultureCLIP: CLIP mit kulturellem Bewusstsein durch synthetische Bilder und kontextualisierte Captions stärken

CICLIP: 通过合成图像和背景说明赋予CLIP以文化意识,赋予CLIP权力 2507.06210v1

Authors (6): Yuchen Huang, Zhiyuan Fan, Zhitao He, Sandeep Polisetty, Wenyan Li, Yi R. Fung

Pretrained vision-language models (VLMs) such as CLIP excel in multimodal understanding but struggle with contextually relevant fine-grained visual features, making it difficult to distinguish visually similar yet culturally distinct concepts. This limitation stems from the scarcity of high-quality culture-specific datasets, the lack of integrated contextual knowledge, and the absence of hard negatives highlighting subtle distinctions. To address these challenges, we first design a data curation pipeline that leverages open-sourced VLMs and text-to-image diffusion models to construct CulTwin, a synthetic cultural dataset. This dataset consists of paired concept-caption-image triplets, where concepts visually resemble each other but represent different cultural contexts. Then, we fine-tune CLIP on CulTwin to create CultureCLIP, which aligns cultural concepts with contextually enhanced captions and synthetic images through customized contrastive learning, enabling finer cultural differentiation while preserving generalization capabilities. Experiments on culturally relevant benchmarks show that CultureCLIP outperforms the base CLIP, achieving up to a notable 5.49% improvement in fine-grained concept recognition on certain tasks, while preserving CLIP’s original generalization ability, validating the effectiveness of our data synthesis and VLM backbone training paradigm in capturing subtle cultural distinctions.

为了应对这些挑战,我们首先设计了一个数据整理管道,利用开放源码VLM和文本到图像传播模型来构建一个合成文化数据集CulTwin。这个数据集由一对概念拼凑成的CLIP模型组成,在三部模型中,各种概念看起来相似,但代表不同的文化背景。随后,我们在CulTwin上对CLIP作了微调,以创建文化CulTLIP,通过定制化对比学习,使文化概念与背景强化的字幕和合成图像相匹配,使文化差异得以细化,同时保留总体化能力。关于文化相关基准的实验显示CulTwin比CulTwin(合成文化数据集)要优于基础CLIP,在精细的CLIP模型中实现显著的5.49 %的改进,在精确化的CLIP模型中,在精确化的精细化能力模型方面实现显著的CLIP化。

Article 217

Title@2025-07-08 (2): EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG

Title: EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG

EEG2TEXT-CN: Eine explorative Studie der offenen Vokabulären chinesischen Text-EEG-Ausrichtung über großsprachliches Modell und kontrastives Lernen auf ChinesischEEG

EEG2TEXT-CN:通过大语言模式和中经语言差异性学习对中文文本与EEEG校对开放词汇的探索性研究 2506.00854v3

Authors (6): Jacky Tai-Yu Lu, Jung Chiang, Chi-Sheng Chen, Anna Nai-Yun Tung, Hsiang Wei Hu, Yuan Chiao Cheng

We propose EEG2TEXT-CN, which, to the best of our knowledge, represents one of the earliest open-vocabulary EEG-to-text generation frameworks tailored for Chinese. Built on a biologically grounded EEG encoder (NICE-EEG) and a compact pretrained language model (MiniLM), our architecture aligns multichannel brain signals with natural language representations via masked pretraining and contrastive learning. Using a subset of the ChineseEEG dataset, where each sentence contains approximately ten Chinese characters aligned with 128-channel EEG recorded at 256 Hz, we segment EEG into per-character embeddings and predict full sentences in a zero-shot setting. The decoder is trained with teacher forcing and padding masks to accommodate variable-length sequences. Evaluation on over 1,500 training-validation sentences and 300 held-out test samples shows promising lexical alignment, with a best BLEU-1 score of 6.38\%. While syntactic fluency remains a challenge, our findings demonstrate the feasibility of non-phonetic, cross-modal language decoding from EEG. This work opens a new direction in multilingual brain-to-text research and lays the foundation for future cognitive-language interfaces in Chinese.

我们建议EEG2TEXT-CN, 就我们所知,它代表了最早为中国人设计的开放的开放的 EEG-文字生成框架之一。我们建于一个基于生物的 EEG 编码器(NITE-EEEG)和一个精练的语言模型(MiniLM)上,我们的建筑通过蒙面训练前和对比学习,将多通道脑信号与自然语言表达方式相匹配。我们使用中国EEEG数据集的一个子集,每句话都包含与256赫兹记录的128个EEEG相匹配的大约10个中国字符。我们将EEEG分成每个字嵌入每个字组,并预测在零光环境中的全句。解码器是用教师的强迫和遮蔽面面面面罩来适应多长序列的培训。对1,500多个培训-校准判决和300个留置试样的评估显示有良好的词汇一致性,最佳的BLEU-1分为6.38。尽管合成流仍是一个挑战,但我们的调查结果展示了在中国人文、跨式语言、跨式语言、跨式读基础的大脑研究中,从而开启了中国人进入了中国认知基础。

Article 218

Title: DS@GT at CheckThat! 2025: Ensemble Methods for Detection of Scientific Discourse on Social Media

DS@GT bei CheckThat! 2025: Ensemblemethoden zur Erkennung wissenschaftlicher Diskurse über soziale Medien

DS@GT at checketh! 2025:社会媒体科学论文的混合探测方法 2507.06205v1

Authors (4): Ayush Parikh, Hoang Thanh Thanh Truong, Jeanette Schofield, Maximilian Heil

In this paper, we, as the DS@GT team for CLEF 2025 CheckThat! Task 4a Scientific Web Discourse Detection, present the methods we explored for this task. For this multiclass classification task, we determined if a tweet contained a scientific claim, a reference to a scientific study or publication, and/or mentions of scientific entities, such as a university or a scientist. We present 3 modeling approaches for this task: transformer finetuning, few-shot prompting of LLMs, and a combined ensemble model whose design was informed by earlier experiments. Our team placed 7th in the competition, achieving a macro-averaged F1 score of 0.8611, an improvement over the DeBERTaV3 baseline of 0.8375. Our code is available on Github at https://github.com/dsgt-arc/checkthat-2025-swd/tree/main/subtask-4a.

在本文中,我们作为2025年CLEF CLEF CheckTH的DS@GT团队, 介绍了我们为此任务探索的方法。对于这一多级分类任务,我们确定一条推文是否包含科学主张, 提及科学研究或出版物, 和/或提及科学实体, 如大学或科学家。我们为这一任务提出了三种建模方法:变压器微调, 微调几发LLMS的提示, 以及根据早期实验设计的组合模型。我们的团队在比赛中排第7位, 取得了0.8611的宏观平均F1分, 比DBERTAV3 0. 8375的基线有所改进。我们的代码可以在 Github https://github.com/dsgt-arc/ checkTah-2025-swd/tree/main/subtask-4a上查阅。

Article 219

Title@2025-07-08 (2): Differential Mamba

Title: Differential Mamba

Unterschiedliche Mamba

差异 Mamba 2507.06204v1

Authors (3): Nadav Schneider, Itamar Zimerman, Eliya Nachmani

Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models. Our code is publicly available.

变换器和 RNNs 等序列模型往往将注意力集中在无关的上下文上,从而导致杂乱的中间代表。这通过促进幻觉、削弱远程和检索能力以及降低稳健性,降低了LLM 能力。最近的工作表明,变换器的不同设计可以缓解这一问题,提高各种应用的效能。在本文中,我们探讨这些最初为变换器开发的技术是否可以适用于Mamba,这是基于选择性的州-空间层的最新结构,能够以更高的效率实现变换器的性能。我们表明,对Mamba的差分设计进行天真调整是不够的,需要仔细的建筑修改。为了解决这个问题,我们为Mamba引入了一种新型的差别机制,在语言建模基准上经过经验验证,展示了更好的检索能力和比Vanilla Mamba更好的性能。最后,我们进行了广泛的变换器研究和经验分析,以证明我们的设计选择是正确的,并提供证据,证明我们的方法有效地缓解了Mamba 模型中的超位问题。我们的代码是公开的。

Article 220

Title@2025-07-08 (2): UQLM: A Python Package for Uncertainty Quantification in Large Language Models

Title: UQLM: A Python Package for Uncertainty Quantification in Large Language Models

UQLM: Ein Python-Paket zur Unsicherheitsquantifizierung in großen Sprachmodellen

UQLM: 用于大语言模型中不确定性量化的 Python 软件包 2507.06196v1

Authors (6): Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Ho-Kyeong Ra, Viren Bajaj, Zeya Ahmad

Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs.

幻觉被定义为大语言模型产生虚假或误导内容的事例,构成影响下游应用安全和信任的重大挑战。我们引入了UQLM,这是利用最新技术的不确定性量化(UQ)技术检测LLM幻觉的Python套件。这个工具包提供一套基于UQ的得分,计算反应级别信任度分从0到1。这个图书馆为基于UQ的幻觉检测提供了一个现成的解决方案,可以很容易地整合,以提高LLM产出的可靠性。

Article 221

Title@2025-07-08 (2): DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification

Title: DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification

DS@GT bei CheckThat! 2025: Bewertung von Kontext- und Tokenisierungsstrategien für numerische Faktenverifizierung

DS@GT at checketh! 2025年:评估背景和数值事实核查的接受战略 2507.06195v1

Authors (2): Maximilian Heil, Aleksandar Pramov

Numerical claims, statements involving quantities, comparisons, and temporal references, pose unique challenges for automated fact-checking systems. In this study, we evaluate modeling strategies for veracity prediction of such claims using the QuanTemp dataset and building our own evidence retrieval pipeline. We investigate three key factors: (1) the impact of more evidences with longer input context windows using ModernBERT, (2) the effect of right-to-left (R2L) tokenization, and (3) their combined influence on classification performance. Contrary to prior findings in arithmetic reasoning tasks, R2L tokenization does not boost natural language inference (NLI) of numerical tasks. A longer context window does also not enhance veracity performance either, highlighting evidence quality as the dominant bottleneck. Our best-performing system achieves competitive macro-average F1 score of 0.57 and places us among the Top-4 submissions in Task 3 of CheckThat! 2025. Our code is available at https://github.com/dsgt-arc/checkthat-2025-numerical.

数字索赔,涉及数量、比较和时间参照的说明,对自动实况调查系统提出了独特的挑战。在本研究中,我们利用QuanTemp数据集和我们自己的证据检索管道,评价了此类索赔真实性预测的模型战略。我们调查了三个关键因素:(1) 使用现代BERT使用较长输入背景窗口的更多证据的影响,(2) 右向左(R2L)象征性化的影响,以及(3) 它们对分类工作的综合影响。与以前在算术推理任务中的调查结果相反,R2L象征性化并不提振自然语言对数字任务的推断。一个较长的上下文窗口也没有提高真实性,也没有提高作为主要瓶颈的证据质量。我们最优秀的系统实现了0.57分的竞争性宏观平均F1分,并将我们列入CryThewT!2025第3任务中。我们的代码可在https://github.com/dsgt-arc/ check that-2025-nualical查阅。

Article 222

Title@2025-07-08 (2): SQLBarber: A System Leveraging Large Language Models to Generate Customized and Realistic SQL Workloads

Title: SQLBarber: A System Leveraging Large Language Models to Generate Customized and Realistic SQL Workloads

SQLBarber: Ein System, das große Sprachmodelle nutzt, um angepasste und realistische SQL Workloads zu generieren

SQLBarber:一个系统利用大语言模型生成自定义和现实的 SQL 工作量 2507.06192v1

Authors (2): Jiale Lao, Immanuel Trummer

Database research and development often require a large number of SQL queries for benchmarking purposes. However, acquiring real-world SQL queries is challenging due to privacy concerns, and existing SQL generation methods are limited in customization and in satisfying realistic constraints. To address this issue, we present SQLBarber, a system based on Large Language Models (LLMs) to generate customized and realistic SQL workloads. SQLBarber (i) eliminates the need for users to manually craft SQL templates in advance, while providing the flexibility to accept natural language specifications to constrain SQL templates, (ii) scales efficiently to generate large volumes of queries matching any user-defined cost distribution (e.g., cardinality and execution plan cost), and (iii) uses execution statistics from Amazon Redshift and Snowflake to derive SQL template specifications and query cost distributions that reflect real-world query characteristics. SQLBarber introduces (i) a declarative interface for users to effortlessly generate customized SQL templates, (ii) an LLM-powered pipeline augmented with a self-correction module that profiles, refines, and prunes SQL templates based on query costs, and (iii) a Bayesian Optimizer to efficiently explore different predicate values and identify a set of queries that satisfy the target cost distribution. We construct and open-source ten benchmarks of varying difficulty levels and target query cost distributions based on real-world statistics from Snowflake and Amazon Redshift. Extensive experiments on these benchmarks show that SQLBarber is the only system that can generate customized SQL templates. It reduces query generation time by one to three orders of magnitude, and significantly improves alignment with the target cost distribution, compared with existing methods.

然而,由于隐私问题,获取真实世界SQL查询具有挑战性,而现有的SQL生成方法在定制和满足现实限制方面受到限制。为了解决这一问题,我们提供了基于大语言模型的SQLBarber系统(LLMS),以生成定制和现实的SQL工作量。 SQLBarber (i) 取消了用户提前手工操作SQL模板的需要,同时提供了接受自然语言规格的灵活性,以限制SQL模板;(ii) 高效地生成大量查询,以匹配任何用户定义的成本分布(例如,基本条件和执行计划成本成本),以及(iii) 使用亚马逊红档和Snnoldflake的执行统计数据,以生成反映真实世界查询特点的SQL模板规格和成本分配。 SQL 引入了(i) 用户可轻松地生成定制的SQL模板,同时用一个自我校正的模板,用一个自校准模块,用SQL的SQRR值来大幅改进SQRR值,用SQRL的流程,根据不同成本计算,用SQRRRRRR的流程来测量,我们根据成本和SQRRRL的流程,根据成本和SQRRRRRRL的流程来大幅确定一个成本水平,根据SQ的流程,根据SQ的SQRRRRRRRRRR的流程和O的流程,根据成本和GF的流程,根据一个成本和GRRRRRL的流程,根据成本和GRRRL的流程,根据SQ的流程,根据一个成本水平,根据一个成本和GFL的流程,根据一个成本和O的流程,根据一个成本和G,根据SQ,根据SQ的流程和GL的流程,根据SQ的流程进行一个成本计算,根据一个成本和GFL的流程,根据一个成本和GFL的流程,确定一个成本计算成本和GRRL的流程,根据一个成本和GRL的成本和GFCFCFL的流程,根据SQ,根据一个成本和GFL的流程进行一个成本和GL的流程,根据一个成本和GL的流程,根据一个成本和GRL的成本和GL的流程,根据一个

Article 223

Title@2025-07-08 (2): DS@GT at CheckThat! 2025: Detecting Subjectivity via Transfer-Learning and Corrective Data Augmentation

Title: DS@GT at CheckThat! 2025: Detecting Subjectivity via Transfer-Learning and Corrective Data Augmentation

DS@GT bei CheckThat! 2025: Subjektivität durch Transfer-Learning und Korrekturdatenvergrößerung erkennen

DS@GT at checkThat! 2025: 通过传输学习和纠正数据增强检测主观性 2507.06189v1

Authors (2): Maximilian Heil, Dionne Bang

This paper presents our submission to Task 1, Subjectivity Detection, of the CheckThat! Lab at CLEF 2025. We investigate the effectiveness of transfer-learning and stylistic data augmentation to improve classification of subjective and objective sentences in English news text. Our approach contrasts fine-tuning of pre-trained encoders and transfer-learning of fine-tuned transformer on related tasks. We also introduce a controlled augmentation pipeline using GPT-4o to generate paraphrases in predefined subjectivity styles. To ensure label and style consistency, we employ the same model to correct and refine the generated samples. Results show that transfer-learning of specified encoders outperforms fine-tuning general-purpose ones, and that carefully curated augmentation significantly enhances model robustness, especially in detecting subjective content. Our official submission placed us $16^{th}$ of 24 participants. Overall, our findings underscore the value of combining encoder specialization with label-consistent augmentation for improved subjectivity detection. Our code is available at https://github.com/dsgt-arc/checkthat-2025-subject.

本文介绍我们提交到2025年CLEF CLEF 校验实验室1号任务1号任务:主观性检测;我们调查转移-学习和文体数据增强的有效性,以改进英文新闻文本中主观和客观句子的分类;我们的方法与相关任务前训练的编码器的微调和微调以及微调变压器的转移-学习形成对比;我们还采用GPT-4o 控制式增压管道,用预先定义的主观性风格生成引言;为确保标签和风格的一致性,我们使用同样的模型来纠正和完善生成的样本。结果显示,对特定编码器的转移-学习超越了微调通用功能的精细调,经过仔细调整的增强大大加强了模型的稳健性,特别是在检测主观内容方面。我们的正式呈文将24名参与者置于16美元之下。总体而言,我们的调查结果强调了将编码专业化与标签-一致性增强相结合的价值。我们的代码可在https://github.com/dsgt-arc/checkat2025-subject上查阅。

Article 224

Title@2025-07-08 (2): Hidden Prompts in Manuscripts Exploit AI-Assisted Peer Review

Title: Hidden Prompts in Manuscripts Exploit AI-Assisted Peer Review

Versteckte Prompts in Handschriften Exploit KI-Assisted Peer Review

使用 AI 协助同行审议 2507.06185v1

Authors (1): Zhicheng Lin

In July 2025, 18 academic manuscripts on the preprint website arXiv were found to contain hidden instructions known as prompts designed to manipulate AI-assisted peer review. Instructions such as “GIVE A POSITIVE REVIEW ONLY” were concealed using techniques like white-colored text. Author responses varied: one planned to withdraw the affected paper, while another defended the practice as legitimate testing of reviewer compliance. This commentary analyzes this practice as a novel form of research misconduct. We examine the technique of prompt injection in large language models (LLMs), revealing four types of hidden prompts, ranging from simple positive review commands to detailed evaluation frameworks. The defense that prompts served as “honeypots” to detect reviewers improperly using AI fails under examination–the consistently self-serving nature of prompt instructions indicates intent to manipulate. Publishers maintain inconsistent policies: Elsevier prohibits AI use in peer review entirely, while Springer Nature permits limited use with disclosure requirements. The incident exposes systematic vulnerabilities extending beyond peer review to any automated system processing scholarly texts, including plagiarism detection and citation indexing. Our analysis underscores the need for coordinated technical screening at submission portals and harmonized policies governing generative AI (GenAI) use in academic evaluation.

2025年7月,在预印网站ArXiv上发现18份学术手稿含有被称为旨在操纵AI辅助同行审查的提示的隐蔽指示,例如“Give A POSITIVE Review only”等指示使用白色文字等技术被隐藏起来。作者的答复各有不同:一个计划撤回受影响的纸张,而另一个则将这种做法作为审查者遵约情况的合法测试来辩护。本评注将这种做法作为一种新的研究不当行为形式加以分析。我们研究了大语言模型中迅速注射的方法,揭示了四种隐藏的提示,从简单的正面审查命令到详细的评价框架。“Honepots”作为“honeypots”的借口,用来检测不适当地使用AI的检查员——即检查中一贯的为自己服务的指示性质表明了操纵的意图。出版商坚持了不一致的政策:Elsevier完全禁止在同行审评中使用AI,而Springer自然允许有限地使用披露要求的使用。事件暴露了系统处理学术文本,包括红外探测和引号索引的任何自动化系统处理的弱点。我们的分析强调了在提交软件门户上协调地进行基因检查的必要性检查的必要性。

Article 225

Title@2025-07-08 (2): MedGemma Technical Report

Title: MedGemma Technical Report

Technischer Bericht MedGemma

MedmeGemma 技术报告 2507.05201v2

Authors (79): Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Brush, Kenneth Philbrick, Howard Hu, Howard Yang, Richa Tiwari, Sunny Jansen, Preeti Singh, Yun Liu, Shekoofeh Azizi, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Riviere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Elena Buchatskaya, Jean-Baptiste Alayrac, Dmitry Lepikhin, Vlad Feinberg, Sebastian Borgeaud, Alek Andreev, Cassidy Hardin, Robert Dadashi, Léonard Hussenot, Armand Joulin, Olivier Bachem, Yossi Matias, Katherine Chou, Avinatan Hassidim, Kavi Goel, Clement Farabet, Joelle Barral, Tris Warkentin, Jonathon Shlens, David Fleet, Victor Cotruta, Omar Sanseviero, Gus Martins, Phoebe Kirk, Anand Rao, Shravya Shetty, David F. Steiner, Can Kirmizibayrak, Rory Pilgrim, Daniel Golden, Lin Yang

Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare’s diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.

人工智能(AI)在医疗保健应用方面具有巨大潜力,但其培训和部署由于医疗保健数据、复杂任务和需要保护隐私而面临挑战。基础模型在医疗任务方面表现良好,需要较少任务性调整数据,这些模型对于加快医疗保健AI应用的开发至关重要。我们采用了基于Gemma 3 4B 和 27B 的医学视觉语言基础模型集MedGemma。MedGemma展示了先进的医学理解和图像和文本推理,大大超过了类似规模的基因化模型的性能,接近了具体任务模型的性能,同时保持了Gemma 3基模型的总体能力。对于分配外任务,MedGemma实现了2.6-10%的医疗多式问题回答改进,15.5%-18.1%的胸前X光学发现分类模型改进了10.8%的医学模型评估比基模型改进了。MedGemma进一步改进了子体的性能,将电子健康记录应用中的误差减少了50%,并实现了与现有专业的直径G3基准模型模型的精确性功能,包括了SmealderLalmas的医学和直径解的医学能力的更高分析。

Article 226

Title@2025-07-08 (2): CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization

Title: CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization

KritikLean: Kritisch-geführtes Verstärkungslernen für mathematische Formalisierung

Criticlean: 用于数学正规化的Critic-Crictic-Guideed加强学习 2507.06181v1

Authors (19): Zhongyuan Peng, Yifan Yao, Kaijing Ma, Shuyue Guo, Yizhe Li, Yichi Zhang, Chenchen Zhang, Yifan Zhang, Zhouliang Yu, Luming Li, Minghao Liu, Yihang Xia, Jiawei Shen, Yuchen Wu, Yixin Cao, Zhaoxiang Zhang, Wenhao Huang, Jiaheng Liu, Ge Zhang

Translating natural language mathematical statements into formal, executable code is a fundamental challenge in automated theorem proving. While prior work has focused on generation and compilation success, little attention has been paid to the critic phase-the evaluation of whether generated formalizations truly capture the semantic intent of the original problem. In this paper, we introduce CriticLean, a novel critic-guided reinforcement learning framework that elevates the role of the critic from a passive validator to an active learning component. Specifically, first, we propose the CriticLeanGPT, trained via supervised fine-tuning and reinforcement learning, to rigorously assess the semantic fidelity of Lean 4 formalizations. Then, we introduce CriticLeanBench, a benchmark designed to measure models’ ability to distinguish semantically correct from incorrect formalizations, and demonstrate that our trained CriticLeanGPT models can significantly outperform strong open- and closed-source baselines. Building on the CriticLean framework, we construct FineLeanCorpus, a dataset comprising over 285K problems that exhibits rich domain diversity, broad difficulty coverage, and high correctness based on human evaluation. Overall, our findings highlight that optimizing the critic phase is essential for producing reliable formalizations, and we hope our CriticLean will provide valuable insights for future advances in formal mathematical reasoning.

将自然语言数学声明转换成正式的、可执行的代码是自动化理论验证中的一项根本挑战。虽然先前的工作侧重于生成和汇编成功,但很少注意批评者阶段对生成的正规化是否真正抓住了最初问题的语义意图的评价。在本文中,我们引入了CriticLean, 这是一种创新的、受批评者指导的强化学习框架,将批评者的作用从被动验证器提升到积极的学习组成部分。具体地说,我们建议CriticLeGPT, 通过监管的微调和强化学习培训,严格评估Lean 4正规化的语义真实性。然后,我们引入了CriticLeanBench,这是用来衡量模型是否有能力将语义纠正与不正确的正规化区分开来的能力的基准。我们经过培训的CriticleLeanGPT模型可以大大超越强大的开放和封闭源基线。在Criticle框架的基础上,我们构建了FineLeanCorpus, 一个包含超过285个问题的数据集, 展示了丰富的领域多样性、广泛的困难程度和高清晰度的评估。

Article 227

Title@2025-07-08 (2): Evaluation of Habitat Robotics using Large Language Models

Title: Evaluation of Habitat Robotics using Large Language Models

Bewertung der Habitat Robotics mit großen Sprachmodellen

利用大语言模式评价人居机器人学 2507.06157v1

Authors (4): William Li, Lei Hamilton, Kaise Al-natour, Sanjeev Mohindra

This paper focuses on evaluating the effectiveness of Large Language Models at solving embodied robotic tasks using the Meta PARTNER benchmark. Meta PARTNR provides simplified environments and robotic interactions within randomized indoor kitchen scenes. Each randomized kitchen scene is given a task where two robotic agents cooperatively work together to solve the task. We evaluated multiple frontier models on Meta PARTNER environments. Our results indicate that reasoning models like OpenAI o3-mini outperform non-reasoning models like OpenAI GPT-4o and Llama 3 when operating in PARTNR’s robotic embodied environments. o3-mini displayed outperform across centralized, decentralized, full observability, and partial observability configurations. This provides a promising avenue of research for embodied robotic development.

本文件的重点是评估大型语言模型在利用Meta Partnerer基准解决体现的机器人任务方面的效力。Meta PartNR提供简化的环境和室内随机化厨房场景中的机器人互动。每个随机化厨房场景的任务都是由两个机器人代理人合作合作解决这项任务。我们评估了Meta Partnerner环境的多种前沿模型。我们的结果表明,OpenAI o3-mini-moin o3 的推理模型,如OpenAI GPT-4o 和Llama 3 的不合理模型,在PartNir的机器人化环境中操作。O3-mini在集中、分散、完全可观测和部分可观测性配置方面表现优异。这为内含的机器人开发提供了有希望的研究途径。

Article 228

Title@2025-07-08 (2): Coding Triangle: How Does Large Language Model Understand Code?

Title: Coding Triangle: How Does Large Language Model Understand Code?

Coding Triangle: Wie versteht Large Language Model Code?

编码三角:大语言模式如何理解代码? 2507.06138v1

Authors (6): Taolin Zhang, Zihan Ma, Maosong Cao, Junnan Liu, Songyang Zhang, Kai Chen

Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three fundamental dimensions: editorial analysis, code implementation, and test case generation. Through extensive experiments on competitive programming benchmarks, we reveal that while LLMs can form a self-consistent system across these dimensions, their solutions often lack the diversity and robustness of human programmers. We identify a significant distribution shift between model cognition and human expertise, with model errors tending to cluster due to training data biases and limited reasoning transfer. Our study demonstrates that incorporating human-generated editorials, solutions, and diverse test cases, as well as leveraging model mixtures, can substantially enhance both the performance and robustness of LLMs. Furthermore, we reveal both the consistency and inconsistency in the cognition of LLMs that may facilitate self-reflection and self-improvement, providing a potential direction for developing more powerful coding models.

大型语言模型(LLMS)在代码生成方面取得了显著进展,但其真正的编程能力仍未得到充分探讨。我们引入了代码三角框架,从三个基本方面系统地评估LMS:编辑分析、代码执行和测试案例生成。通过对竞争性编程基准的广泛实验,我们发现LLMs可以形成一个在这些方面自成一体的系统,但其解决方案往往缺乏人类程序员的多样性和强健性。我们发现模型认知和人类专门知识之间的重大分布变化,由于培训数据偏差和有限的推理转移,模型错误往往会集中在一起。我们的研究显示,将人造编辑、解决方案和多种测试案例以及利用模型混合物,可以大大提高LMS的性能和稳健性。此外,我们揭示了LMs在可促进自我恢复和自我改进的兼容性方面的一致性和不一致性,为开发更强大的编码模型提供了潜在方向。

Article 229

Title@2025-07-08 (2): NeoBabel: A Multilingual Open Tower for Visual Generation

Title: NeoBabel: A Multilingual Open Tower for Visual Generation

NeoBabel: Mehrsprachiger offener Turm für die visuelle Erzeugung

NeoBabel:多语言的视觉生成开放塔 2507.06137v1

Authors (4): Mohammad Mahdi Derakhshani, Dheeraj Varghese, Marzieh Fadaee, Cees G. M. Snoek

Text-to-image generation advancements have been predominantly English-centric, creating barriers for non-English speakers and perpetuating digital inequities. While existing systems rely on translation pipelines, these introduce semantic drift, computational overhead, and cultural misalignment. We introduce NeoBabel, a novel multilingual image generation framework that sets a new Pareto frontier in performance, efficiency and inclusivity, supporting six languages: English, Chinese, Dutch, French, Hindi, and Persian. The model is trained using a combination of large-scale multilingual pretraining and high-resolution instruction tuning. To evaluate its capabilities, we expand two English-only benchmarks to multilingual equivalents: m-GenEval and m-DPG. NeoBabel achieves state-of-the-art multilingual performance while retaining strong English capability, scoring 0.75 on m-GenEval and 0.68 on m-DPG. Notably, it performs on par with leading models on English tasks while outperforming them by +0.11 and +0.09 on multilingual benchmarks, even though these models are built on multilingual base LLMs. This demonstrates the effectiveness of our targeted alignment training for preserving and extending crosslingual generalization. We further introduce two new metrics to rigorously assess multilingual alignment and robustness to code-mixed prompts. Notably, NeoBabel matches or exceeds English-only models while being 2-4x smaller. We release an open toolkit, including all code, model checkpoints, a curated dataset of 124M multilingual text-image pairs, and standardized multilingual evaluation protocols, to advance inclusive AI research. Our work demonstrates that multilingual capability is not a trade-off but a catalyst for improved robustness, efficiency, and cultural fidelity in generative AI.

虽然现有系统依靠翻译管道,但引入了语义流、计算间接费用、以及文化不匹配。我们引入了新多语言图像生成框架NeBabel,这是一个新的多语言图像生成框架,在业绩、效率和包容性方面建立了一个新的Pareto前沿,支持六种语言:英语、中国、中国、荷兰、法国、印度和波斯。该模型在培训中以英语为主,为非英语语言使用者制造障碍,为非英语者制造障碍,使数字不平等的不平等不平等永久化;虽然现有系统依靠翻译管道,但我们将两个只有英语的基准扩展至多语言等同:文流流流流流流流流流流流流流流流流流流流流流流流流,但为了评估其能力,我们扩大了两个只使用英语的基准:M-GenEeneval和M-DPG。 NeBabel在保持强大的英语能力、M-GenEval 200.75和 m-DP-G-G.0.68上建立一个新的PLEF 任务模式,同时进行与英文任务的主要模式相比,同时展示所有英级的模型和多语言模型,同时展示一个更新和跨规则。我们的标准化的升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级、升级

Article 230

Title@2025-07-08 (2): SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam?

Title: SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam?

SciMaster: Auf dem Weg zu allgemein angelegten wissenschaftlichen KI-Agenten, Teil I. X-Master als Grundlage: Können wir die letzte Prüfung der Menschheit durchführen?

SciMaster:争取实现普通用途科学AI代理,第一部分,X-Master作为基金会:我们能否在人类最后一次考试上起带头作用? 2507.05241v2

Authors (11): Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, Siheng Chen

The rapid advancements of AI agents have ignited the long-held ambition of leveraging them to accelerate scientific discovery. Achieving this goal requires a deep understanding of the frontiers of human knowledge. As such, Humanity’s Last Exam (HLE) provides an exceptionally challenging touchstone for evaluating scientific AI agents. In this work, we aim to construct the foundational architecture for general-purpose agents and validate the capabilities through leading performance on HLE. To achieve this, we introduce X-Master, a tool-augmented reasoning agent designed to emulate human researchers by interacting flexibly with external tools during its reasoning process. This agent, guided by the conceptualization of code as an interaction language, can flexibly leverage built-in Python libraries and our customized tools to augment the reasoning. We further scale its capabilities through X-Masters, a scattered-and-stacked agentic workflow that systematically enhances breadth and depth of reasoning. Our open-source solution, X-Masters, sets a new state-of-the-art record on HLE with a score of 32.1%, surpassing OpenAI’s and Google’s Deep Research (26.6% and 26.9%) and becoming the first to exceed the 30% threshold. This work allows us to gain a deeper understanding of complex task-solving and accumulates valuable experience that can inform future advancements, guiding subsequent model training.

AI代理商的快速进步激发了利用他们来加速科学发现的长期雄心壮志。实现这一目标需要深入了解人类知识的前沿。因此, 人类最后一次Exam(HLE)为评估科学AI代理商提供了极具挑战性的试金石。在这项工作中, 我们的目标是为普通用途代理商建立基础架构,并通过在HLE上领导业绩来验证能力。为了实现这一目标, 我们引入了XMaster(XMaster)这个工具强化推理代理商,目的是通过在推理过程中与外部工具进行灵活互动来模仿人类研究人员。这个代理商在将代码概念化为互动语言的指引下,可以灵活地利用在Python内部的图书馆和我们定制的工具来强化推理。我们通过XMasters(一个分散和分散的代理工作流程)来进一步提升其能力,从而系统地提高推理的广度和深度。我们的公开源解决方案XMaster(XMaster)在HLELE上建立了一个新的最新状态记录, 其分数为32.1%, 超过 Open AI 和 Goo’s 深层研究(26.6%和26.9) 将未来提升到更深层研究的进度, 进入一个更深层任务, 进入一个更深层阶段,可以让我们更深层的深层的进度进入到更深层的进度。

Article 231

Title@2025-07-08 (2): Agents Are All You Need for LLM Unlearning

Title: Agents Are All You Need for LLM Unlearning

Agenten sind alles, was Sie für LLM Unlearning benötigen

代理器是所有你需要的LLM 重新学习 2502.00406v2

Authors (2): Debdeep Sanyal, Murari Mandal

Information removal or suppression in large language models (LLMs) is a desired functionality, useful in AI regulation, legal compliance, safety, and privacy. LLM unlearning methods aim to remove information on demand from LLMs. Current LLM unlearning methods struggle to balance the unlearning efficacy and utility due to the competing nature of these objectives. Keeping the unlearning process computationally feasible without assuming access to the model weights is an overlooked area. In this work we show that \textit{agents might be all we need for effective and practical inference-time LLM unlearning}. We present the first agentic LLM unlearning (\texttt{ALU}) method, a multi-agent, retrain-free, model-agnostic approach to LLM unlearning that achieves effective unlearning while preserving the utility. Our \texttt{ALU} framework unlearns by involving multiple LLM agents, each designed for a specific step in the unlearning process, without the need to update model weights for any of the agents in the framework. Users can easily request any set of unlearning instances in any sequence, and \texttt{ALU} seamlessly adapts in real time. This is facilitated without requiring any changes in the underlying LLM model. Through extensive experiments on established benchmarks (TOFU, WMDP, WPU) and jailbreaking techniques (many shot, target masking, other languages), we demonstrate that \texttt{ALU} consistently stands out as the most robust inference-time LLM unlearning framework among current state-of-the-art methods while incurring time cost that remains effectively constant regardless of the number of unlearning targets. We further highlight \texttt{ALU}’s superior performance compared to existing methods when evaluated at scale. Specifically, \texttt{ALU} is assessed on up to 1000 unlearning targets, exceeding the evaluation scope of all previously proposed LLM unlearning methods.

大型语言模型(LLMS) 的信息删除或抑制(LLMS) 是一个理想功能,在AI 监管、法律合规、安全和隐私方面是有用的。 LLM 不学习的方法旨在从LLMS中删除需求信息。当前 LLM 不学习的方法由于这些目标的竞争性质而难以平衡未学习的功效和效用。保持不使用模型重量的不学习进程在计算上的可行性是一个被忽视的领域。在此工作中, 我们显示\ textit{ 试剂可能是我们所需要的所有有效和实用的稳健时间推移( LLM) 取消学习。我们展示了第一个代理 UmtricialM (\ textt{ALU}) 的不学习方法, 多试探、免费重校正、不断更新的MALMSAL 方法, 在维护其效用的同时, 我们的LMTTTT{AL} 框架也变得不易读取。在未学习过程中, 我们的每个工具都为非学习过程设计了一个具体的步骤, 不需要更新任何国家代理人的模型框架的重量框架。用户可以很容易地要求在任何时间里, 在任何顺序中, 不断修正的实验中, 在任何顺序里程中, 不断修正的实验中, 不断的实验方法中, 继续显示。

Article 232

Title@2025-07-08 (2): GAF-Guard: An Agentic Framework for Risk Management and Governance in Large Language Models

Title: GAF-Guard: An Agentic Framework for Risk Management and Governance in Large Language Models

GAF-Guard: Agentischer Rahmen für Risikomanagement und Governance in großen Sprachmodellen

GAF-Guard:大语言模式风险管理和治理代理框架 2507.02986v2

Authors (4): Seshu Tirupathi, Dhaval Salwala, Elizabeth Daly, Inge Vejsbjerg

As Large Language Models (LLMs) continue to be increasingly applied across various domains, their widespread adoption necessitates rigorous monitoring to prevent unintended negative consequences and ensure robustness. Furthermore, LLMs must be designed to align with human values, like preventing harmful content and ensuring responsible usage. The current automated systems and solutions for monitoring LLMs in production are primarily centered on LLM-specific concerns like hallucination etc, with little consideration given to the requirements of specific use-cases and user preferences. This paper introduces GAF-Guard, a novel agentic framework for LLM governance that places the user, the use-case, and the model itself at the center. The framework is designed to detect and monitor risks associated with the deployment of LLM based applications. The approach models autonomous agents that identify risks, activate risk detection tools, within specific use-cases and facilitate continuous monitoring and reporting to enhance AI safety, and user expectations. The code is available at https://github.com/IBM/risk-atlas-nexus-demos/tree/main/gaf-guard.

由于大型语言模型(LLMS)继续越来越多地应用于各个领域,广泛采用这些模型需要严格的监测,以防止意外的负面后果,并确保稳健性;此外,LMS的设计必须与人类价值观保持一致,例如防止有害内容和确保负责任的使用;目前用于监测生产中的LLMS的自动化系统和解决办法主要集中于LLM的具体关切,例如幻觉等,很少考虑到具体使用案例和用户偏好的要求;本文件介绍了GAF-Guard,这是一个新的LM治理代理框架,将用户、使用案例和模型本身置于中心位置;该框架旨在检测和监测与部署LLM应用程序有关的风险;在具体使用案例范围内,确定风险、启动风险检测工具并促进持续监测和报告以加强AI安全及用户期望的自主代理物;该代码可在https://github.com/IBM/risk-atlas-nexus-demos/tree/main/gaf-guard查阅。

Article 233

Title@2025-07-08 (2): Nyay-Darpan: Enhancing Decision Making Through Summarization and Case Retrieval for Consumer Law in India

Title: Nyay-Darpan: Enhancing Decision Making Through Summarization and Case Retrieval for Consumer Law in India

Nyay-Darpan: Verbesserung der Entscheidungsfindung durch Zusammenfassung und Rückgriff auf das Verbraucherrecht in Indien

Nyay-Darpan:印度消费者法通过摘要和案件检索加强决策 2507.06090v1

Authors (9): Swapnil Bhattacharyya, Shrey Ganatra, Harshvivek Kashid, Spandan Anaokar, Shruti Nair, Reshma Sekhar, Siddharth Manohar, Rahul Hemrajani, Pushpak Bhattacharyya

AI-based judicial assistance and case prediction have been extensively studied in criminal and civil domains, but remain largely unexplored in consumer law, especially in India. In this paper, we present Nyay-Darpan, a novel two-in-one framework that (i) summarizes consumer case files and (ii) retrieves similar case judgements to aid decision-making in consumer dispute resolution. Our methodology not only addresses the gap in consumer law AI tools but also introduces an innovative approach to evaluate the quality of the summary. The term ‘Nyay-Darpan’ translates into ‘Mirror of Justice’, symbolizing the ability of our tool to reflect the core of consumer disputes through precise summarization and intelligent case retrieval. Our system achieves over 75 percent accuracy in similar case prediction and approximately 70 percent accuracy across material summary evaluation metrics, demonstrating its practical effectiveness. We will publicly release the Nyay-Darpan framework and dataset to promote reproducibility and facilitate further research in this underexplored yet impactful domain.

基于大赦国际的法律援助和案件预测已经在刑事和民事领域进行了广泛研究,但在消费者法中仍然基本上没有探讨,特别是在印度。本文介绍Nyay-Darpan,这是一个新的二对一框架,(一) 总结消费者案件卷宗,(二) 检索类似的案件判决,以协助消费者争端解决的决策。我们的方法不仅解决消费者法方面的缺陷,而且还采用创新办法评估摘要的质量。“Nyay-Darpan”一词翻译为“Miror of Justice ” ,象征着我们工具通过精确的汇总和智能案件检索反映消费者争端核心的能力。我们的系统在类似案件中的预测实现了75%的准确性,在材料摘要评价指标中实现了大约70%的准确性,显示了其实际效力。我们将公开发布Nyay-Darpan框架和数据集,以促进可再生性,并促进对这个探索不足但影响深远的领域进行进一步研究。

Article 234

Title@2025-07-08 (2): The bitter lesson of misuse detection

Title: The bitter lesson of misuse detection

Die bittere Lektion der Missbrauchserkennung

发现滥用的惨痛教训 2507.06282v1

Authors (3): Hadrien Mariaccia, Charbel-Raphaël Segerie, Diego Dorn

Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive public benchmark yet verifies how well supervision systems from the market perform under realistic, diverse attacks. To address this, we introduce BELLS, a Benchmark for the Evaluation of LLM Supervision Systems. The framework is two dimensional: harm severity (benign, borderline, harmful) and adversarial sophistication (direct vs. jailbreak) and provides a rich dataset covering 3 jailbreak families and 11 harm categories. Our evaluations reveal drastic limitations of specialized supervision systems. While they recognize some known jailbreak patterns, their semantic understanding and generalization capabilities are very limited, sometimes with detection rates close to zero when asking a harmful question directly or with a new jailbreak technique such as base64 encoding. Simply asking generalist LLMs if the user question is “harmful or not” largely outperforms these supervisors from the market according to our BELLS score. But frontier LLMs still suffer from metacognitive incoherence, often responding to queries they correctly identify as harmful (up to 30 percent for Claude 3.7 and greater than 50 percent for Mistral Large). These results suggest that simple scaffolding could significantly improve misuse detection robustness, but more research is needed to assess the tradeoffs of such techniques. Our results support the “bitter lesson” of misuse detection: general capabilities of LLMs are necessary to detect a diverse array of misuses and jailbreaks.

先前的越狱检测工作已经确立了对LLM公司进行对抗性强力的重要性,但主要侧重于抵制对抗性投入和产出安全内容的模型能力,而不是外部监管体系的效能。迄今为止,这些护栏的唯一公开和独立的基准在有限情况下评估了一套狭窄的监督员。因此,没有全面的公共基准,但证实市场监管系统在现实的、多样的攻击下运作如何良好。为了解决这个问题,我们引入了BELLS,即LLM监督系统评估基准。框架有两个层面:伤害严重程度(黄金、边界、有害)和对抗性精密(直接对破狱)和对抗性精密(直接对破狱)的模型能力,并提供了涵盖3个破狱家庭和11个伤害类别的丰富多彩的数据集。我们的评估显示,尽管它们承认一些已知的破狱模式,它们的语义理解和概括化能力非常有限,但当直接提出一个有害的问题时,或像基础64编码这样的新的破狱技术时,我们一般的LMS公司会询问用户问题是否“有害或没有” 基本突破性精确性,因此需要从普通的测谎的测算中,这些CRMRDRDRDRMS的测算。

Article 235

Title@2025-07-08 (2): A Multi-Task and Multi-Label Classification Model for Implicit Discourse Relation Recognition

Title: A Multi-Task and Multi-Label Classification Model for Implicit Discourse Relation Recognition

Ein Multi-Task- und Multi-Label-Klassifikationsmodell für die Implizite Diskursrelation-Erkennung

多任务和多标签的隐性断断层确认分类模式 2408.08971v3

Authors (2): Nelson Filipe Costa, Leila Kosseim

We propose a novel multi-label classification approach to implicit discourse relation recognition (IDRR). Our approach features a multi-task model that jointly learns multi-label representations of implicit discourse relations across all three sense levels in the PDTB 3.0 framework. The model can also be adapted to the traditional single-label IDRR setting by selecting the sense with the highest probability in the multi-label representation. We conduct extensive experiments to identify optimal model configurations and loss functions in both settings. Our approach establishes the first benchmark for multi-label IDRR and achieves SOTA results on single-label IDRR using DiscoGeM. Finally, we evaluate our model on the PDTB 3.0 corpus in the single-label setting, presenting the first analysis of transfer learning between the DiscoGeM and PDTB 3.0 corpora for IDRR.

我们建议对隐性话语关系确认(IDR)采用新的多标签分类方法。我们的方法是一个多任务模式,共同学习PDTB3.0框架中所有三个感知层面的隐性话语关系的多标签表达方式。这个模式也可以通过在多标签代表中以最高概率选择一种感知来适应传统的单一标签的IDR设置。我们进行了广泛的实验,以确定两种环境中的最佳模式配置和损失功能。我们的方法为多标签的言语关系确认(IDR)建立了第一个基准,并利用DiscoGeM在单标签的IDR上取得SOTA结果。最后,我们评估了我们在单一标签设置中的PDTB3.0体模型,对DiscoGeM和PDTB3.0Corora之间的转移学习进行了首次分析。

Article 236

Title@2025-07-08 (2): Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

Title: Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

Sind LLMs prescient? Eine kontinuierliche Bewertung mit Daily News als Oracle

LLLM是否具有先锋精神?利用每日新闻作为甲骨文进行持续评估。 2411.08324v2

Authors (3): Hui Dai, Ryan Teehan, Mengye Ren

Many existing evaluation benchmarks for Large Language Models (LLMs) quickly become outdated due to the emergence of new models and training data. These benchmarks also fall short in assessing how LLM performance changes over time, as they consist of a static set of questions without a temporal dimension. To address these limitations, we propose using future event prediction as a continuous evaluation method to assess LLMs’ temporal generalization and forecasting abilities. Our benchmark, Daily Oracle, automatically generates question-answer (QA) pairs from daily news, challenging LLMs to predict “future” event outcomes. Our findings reveal that as pre-training data becomes outdated, LLM performance degrades over time. While Retrieval Augmented Generation (RAG) has the potential to enhance prediction accuracy, the performance degradation pattern persists, highlighting the need for continuous model updates. Code and data are available at https://agenticlearning.ai/daily-oracle.

由于出现了新的模型和培训数据,大语言模型的许多现有评价基准很快就过时了。这些基准还不足以评估LLM业绩随时间变化的情况,因为LLM业绩由一组静态问题组成,没有时间的层面。为解决这些局限性,我们提议利用未来事件预测作为持续评价方法,评估LLMS的时间概括和预测能力。我们的基准,即Daily Oracle,从每日新闻中自动生成问答配对,挑战LLMS预测“未来”事件结果。我们的调查结果显示,随着培训前数据过时,LLM业绩会随着时间的流逝而退化。Rettreval 递增一代(RAG)有可能提高预测的准确性,而业绩退化模式则继续存在,强调需要持续更新模型。守则和数据见https://toricleining.ai/daily-oracle。

Article 237

Title@2025-07-08 (2): The distribution of syntactic dependency distances

Title: The distribution of syntactic dependency distances

Die Verteilung der syntaktischen Abhängigkeitsabstände

共同依赖距离分布 2211.14620v2

Authors (2): Sonia Petrini, Ramon Ferrer-i-Cancho

The syntactic structure of a sentence can be represented as a graph, where vertices are words and edges indicate syntactic dependencies between them. In this setting, the distance between two linked words is defined as the difference between their positions. Here we wish to contribute to the characterization of the actual distribution of syntactic dependency distances, which has previously been argued to follow a power-law distribution. Here we propose a new model with two exponential regimes in which the probability decay is allowed to change after a break-point. This transition could mirror the transition from the processing of word chunks to higher-level structures. We find that a two-regime model - where the first regime follows either an exponential or a power-law decay - is the most likely one in all 20 languages we considered, independently of sentence length and annotation style. Moreover, the break-point exhibits low variation across languages and averages values of 4-5 words, suggesting that the amount of words that can be simultaneously processed abstracts from the specific language to a high degree. The probability decay slows down after the breakpoint, consistently with a universal chunk-and-pass mechanism. Finally, we give an account of the relation between the best estimated model and the closeness of syntactic dependencies as function of sentence length, according to a recently introduced optimality score.

句子的合成结构可以用图表来表示, 顶点可以是单词, 边缘也可以表示它们之间的共性依赖性。在这种背景下, 两个链接的字的距离被定义为其位置之间的差别。在这里, 我们想帮助描述同义依赖距离的实际分布, 先前曾争论过这种分布是遵循一种权力法的分布。我们在这里提出了一个新模式, 有两个指数性制度, 允许在断点后发生概率衰变。这个过渡可以反映从处理字块到更高层次结构的过渡。我们发现, 两种制度模式—— 第一个制度要么是指数化的, 要么是权力法的衰减—— 是我们所考虑的所有20种语言中最有可能的。此外, 断点显示了语言和平均值4-5字之间的低差异, 这表明, 可以同时处理的字数从特定语言到高程度的精度。在断点之后, 概率的衰减速度会减缓, 与一个普遍的大块和最精确的排序机制。最后, 我们给出了最接近的最精确的排序机制。

Article 238

Title@2025-07-08 (2): Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger

Title: Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger

Adaptive Werkzeuganwendung in großen Sprachmodellen mit Meta-Cognition Trigger

用于大语言模型的适应性工具使用 2502.12961v2

Authors (9): Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, Yong Liu

Large language models (LLMs) have shown remarkable emergent capabilities, transforming the execution of functional tasks by leveraging external tools for complex problems that require specialized processing or up-to-date data. While existing research expands LLMs access to diverse tools (e.g., program interpreters, search engines, calculators), the necessity of using these tools is often overlooked, leading to indiscriminate tool invocation. This naive approach raises two key issues: increased latency due to unnecessary tool calls, and potential errors resulting from faulty interactions with external tools. In this paper, we introduce meta-cognition as a proxy for LLMs self-assessment of their capabilities, reflecting the model’s awareness of its own limitations. Based on this, we propose MeCo, an adaptive decision-making strategy for external tool use. MeCo quantifies metacognitive scores by capturing high-level cognitive signals in the representation space, guiding when to invoke tools. Notably, MeCo is fine-tuning-free and incurs minimal cost. Experiments across multiple backbone models and benchmarks show that MeCo reliably detects LLMs’ internal cognitive signals and significantly improves tool-use decision-making.

大型语言模型(LLMS)显示了非凡的突发能力,通过利用外部工具解决需要专门处理或最新数据的复杂问题,改变了功能任务的执行。虽然现有研究扩大了LLMS获得各种工具(例如程序口译员、搜索引擎、计算器)的机会,但使用这些工具的必要性往往被忽视,导致工具的滥用。这种天真的方法提出了两个关键问题:不必要的工具呼叫导致的潜伏增加,以及与外部工具的错误互动造成的潜在错误。在本文中,我们引入了元认知作为LLMS能力自我评估的代名词,反映了该模型对自身局限性的认识。在此基础上,我们提出了Meco,即外部工具使用的适应决策战略。Meco通过在代表空间捕捉高级认知信号来量化元化分数,指导何时使用工具。值得注意的是,Meco正在微调,费用很低。在多个主干模型和基准的实验显示,Meco可靠地检测LMS的内部认知信号,并大大改进工具使用决策。

Article 239

Title@2025-07-08 (2): Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs

Title: Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs

Entropie-Memorisationsgesetz: Bewertung der Gedächtnis-Schwierigkeit der Daten in LLMs

《信封-记忆法》:评估LLMM中数据的记忆难度 2507.06056v1

Authors (5): Yizhan Huang, Zhe Yang, Meifang Chen, Jianping Zhang, Michael R. Lyu

Large Language Models (LLMs) are known to memorize portions of their training data, sometimes reproducing content verbatim when prompted appropriately. In this work, we investigate a fundamental yet under-explored question in the domain of memorization: How to characterize memorization difficulty of training data in LLMs? Through empirical experiments on OLMo, a family of open models, we present the Entropy-Memorization Law. It suggests that data entropy is linearly correlated with memorization score. Moreover, in a case study of memorizing highly randomized strings, or “gibberish”, we observe that such sequences, despite their apparent randomness, exhibit unexpectedly low empirical entropy compared to the broader training corpus. Adopting the same strategy to discover Entropy-Memorization Law, we derive a simple yet effective approach to distinguish training and testing data, enabling Dataset Inference (DI).

大型语言模型(LLMS)可以记住其培训数据的一部分,有时在适当提示下复制逐字复制内容。在这项工作中,我们调查了记忆化领域一个基本但探索不足的问题:如何用LLMS描述培训数据在记忆化方面的困难?通过对开放模型的大家庭OLMO的实验,我们介绍了“Entropy-Mymoriz化法 ” 。它表明数据昆虫与记忆化得分有线关联。此外,在对记忆化高度随机化的字符串或“妄想”进行的一项案例研究中,我们观察到,尽管这些序列显然随机性,但与更广泛的培训资料库相比,其实验性却出乎意料地低。我们采用同样的战略来发现Entropy-Memoriz化法,我们提出了一种简单而有效的方法来区分培训和测试数据,从而使得数据集出(DI) 。

Article 240

Title@2025-07-08 (2): Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review

Title: Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review

Überwindung von Datenknappheit bei generativen Sprachmodellierungen für Sprachen mit geringem Ressourcenbedarf: Eine systematische Überprüfung

克服低资源语言创用语言建模中的数据稀缺性:系统审查 2505.04531v2

Authors (2): Josh McGiff, Nikola S. Nikolov

Generative language modelling has surged in popularity with the emergence of services such as ChatGPT and Google Gemini. While these models have demonstrated transformative potential in productivity and communication, they overwhelmingly cater to high-resource languages like English. This has amplified concerns over linguistic inequality in natural language processing (NLP). This paper presents the first systematic review focused specifically on strategies to address data scarcity in generative language modelling for low-resource languages (LRL). Drawing from 54 studies, we identify, categorise and evaluate technical approaches, including monolingual data augmentation, back-translation, multilingual training, and prompt engineering, across generative tasks. We also analyse trends in architecture choices, language family representation, and evaluation methods. Our findings highlight a strong reliance on transformer-based models, a concentration on a small subset of LRLs, and a lack of consistent evaluation across studies. We conclude with recommendations for extending these methods to a wider range of LRLs and outline open challenges in building equitable generative language systems. Ultimately, this review aims to support researchers and developers in building inclusive AI tools for underrepresented languages, a necessary step toward empowering LRL speakers and the preservation of linguistic diversity in a world increasingly shaped by large-scale language technologies.

随着诸如ChatGPT和Google Gemini等服务的出现,产生语言建模已大受欢迎。这些建模显示了生产力和通信的变革潜力,但绝大多数都满足了英语等高资源语言的需求。这加剧了对自然语言处理中语言不平等的担忧。本文件介绍了第一次系统化审查,具体侧重于解决低资源语言基因建模中数据稀缺问题的战略(LLLL)。我们从54项研究中发现、分类和评价技术方法,包括单语数据扩增、回译、多语种培训和迅速工程,跨越基因化任务。我们还分析了建筑选择、语言家庭代表性和评价方法方面的趋势。我们的调查结果突出表明了对基于变压器模型的高度依赖,集中关注少数LLLLs,缺乏对各种研究的一致评价。我们最后建议将这些方法推广到更广泛的LLLs,并概述在建立公平化语言系统方面的公开挑战。最后,这次审查旨在支持研究人员和开发者为代表性语言建设包容性的AI工具,这是向增强LLLL语言语言语言语言的权能和通过大规模技术在世界范围内维护语言多样性迈出的必要一步。

Article 241

Title@2025-07-08 (2): Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager

Title: Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager

Maßgeschneiderte Gespräche über LLMs hinaus: Ein RL-basierter Dialogmanager

超出LLLM 的定制对话:基于 RL 的对话管理器 2506.19652v2

Authors (3): Lucie Galland, Catherine Pelachaud, Florian Pecune

In this work, we propose a novel framework that integrates large language models (LLMs) with an RL-based dialogue manager for open-ended dialogue with a specific goal. By leveraging hierarchical reinforcement learning to model the structured phases of dialogue and employ meta-learning to enhance adaptability across diverse user profiles, our approach enhances adaptability and efficiency, enabling the system to learn from limited data, transition fluidly between dialogue phases, and personalize responses to heterogeneous patient needs. We apply our framework to Motivational Interviews, aiming to foster behavior change, and demonstrate that the proposed dialogue manager outperforms a state-of-the-art LLM baseline in terms of reward, showing a potential benefit of conditioning LLMs to create open-ended dialogue systems with specific goals.

在这项工作中,我们提出了一个新框架,将大型语言模式(LLMs)与基于RL的对话管理者融合在一起,以便进行开放式对话,并有一个具体目标。通过利用等级强化学习,模拟分阶段对话,并利用元学习,提高不同用户的适应性,我们的方法提高了适应性和效率,使系统能够从有限的数据中学习,在对话阶段之间流畅地过渡,以及针对不同病人的需要作出个性化反应。我们将我们的框架应用于动机性访谈,目的是促进行为变化,并表明拟议的对话管理者在奖励方面超过了最先进的LLM基线,显示了有条件LMs在创建具有具体目标的开放式对话系统方面的潜在好处。

Article 242

Title@2025-07-08 (2): Empirical evidence of Large Language Model’s influence on human spoken communication

Title: Empirical evidence of Large Language Model’s influence on human spoken communication

Empirische Beweise für den Einfluss von Large Language Model auf die menschliche gesprochene Kommunikation

大语言模式对人口交流的影响的经验证据 2409.01754v3

Authors (7): Hiromu Yakura, Ezequiel Lopez-Lopez, Levin Brinkmann, Ignacio Serna, Prateek Gupta, Ivan Soraperra, Iyad Rahwan

From the invention of writing and the printing press, to television and social media, human history is punctuated by major innovations in communication technology, which fundamentally altered how ideas spread and reshaped our culture. Recent chatbots powered by generative artificial intelligence constitute a novel medium that encodes cultural patterns in their neural representations and disseminates them in conversations with hundreds of millions of people. Understanding whether these patterns transmit into human language, and ultimately shape human culture, is a fundamental question. While fully quantifying the causal impact of a chatbot like ChatGPT on human culture is very challenging, lexicographic shift in human spoken communication may offer an early indicator of such broad phenomenon. Here, we apply econometric causal inference techniques to 740,249 hours of human discourse from 360,445 YouTube academic talks and 771,591 conversational podcast episodes across multiple disciplines. We detect a measurable and abrupt increase in the use of words preferentially generated by ChatGPT, such as delve, comprehend, boast, swift, and meticulous, after its release. These findings suggest a scenario where machines, originally trained on human data and subsequently exhibiting their own cultural traits, can, in turn, measurably reshape human culture. This marks the beginning of a closed cultural feedback loop in which cultural traits circulate bidirectionally between humans and machines. Our results motivate further research into the evolution of human-machine culture, and raise concerns over the erosion of linguistic and cultural diversity, and the risks of scalable manipulation.

从书写和印刷出版的发明到电视和社交媒体,人类历史都因通信技术的重大创新而支离破碎,这些创新从根本上改变了思想传播和改造我们的文化。最近以基因化人工智能为动力的聊天机器人构成了一种新颖的媒介,将文化模式编码在他们的神经表征中,并在与数以亿计的人的对话中传播。了解这些模式是否传播到人类语言并最终塑造人类文化,是一个根本问题。在充分量化像查特GPT这样的聊天机对人类文化的因果影响非常具有挑战性,但人类口语通信的地理变化可能提供如此广泛现象的早期指标。在这里,我们应用了计量因果推断技术到740 249小时的人类话语,来自360 445 YouTube学术谈话和771 591个对话播客流,跨多个学科。我们发现,这些模式是否传播到人类语言的优雅语言,例如调、理解、吹嘘、快速和细调等,在发布之后,使用这种语言的传动,这些结论表明,机器最初在人类数据上受过训练,并随后在人类文化循环中不断改变人类文化成果之间,从而在人类文化循环中呈现文化结果之间传播。我们的文化标志,我们发现,我们发现可测量和文化循环开始进入了文化循环之间,可以测量,我们的文化结果可以测量,可以进一步传播。

Article 243

Title@2025-07-08 (2): MemOS: A Memory OS for AI System

Title: MemOS: A Memory OS for AI System

MemOS: Ein Speicher-Betriebssystem für KI-System

MemOS: AI系统记忆OS 2507.03724v2

Authors (39): Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, Jiawei Yang, Chunyu Li, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhen Tao, Junpeng Ren, Huayi Lai, Hao Wu, Bo Tang, Zhenren Wang, Zhaoxin Fan, Ningyu Zhang, Linfeng Zhang, Junchi Yan, Mingchuan Yang, Tong Xu, Wei Xu, Huajun Chen, Haofeng Wang, Hongkang Yang, Wentao Zhang, Zhi-Qin John Xu, Siheng Chen, Feiyu Xiong

Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge consistency.Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.While Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent representations.Recent work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.

大型语言模型(LLMS)已成为人工通用智能(AGI)的基本基础设施,但缺乏定义明确的记忆管理系统,阻碍了长文本推理、持续个性化和知识一致性的发展。现有的模型主要依赖静态参数和短寿命背景状态,限制了其跟踪用户偏好或长期更新知识的能力。We Retrerievval-Auged Ginger (RAG) 引入了普通文本的外部知识,但它仍然是一种不固定的工作,没有生命周期控制,也没有与持续陈述相结合。近期的工作从记忆等级的角度模拟了LLMS的培训和推断成本,表明在参数内存和外部检索之间引入明确的记忆层可以大大降低这些成本。除了计算效率外,LOMS还面临因信息在时间和环境中的分布而带来的更广泛挑战,要求能够管理跨越不同时间尺度和来源的混杂知识的系统。为了应对这一挑战,我们建议MemOS(一个将记忆处理成可管理资源的记忆操作框架)能够将LMS-Recliverneal-Relictional-Reflical-deallical-dealliction-Reliverizational-Lislational-Lislity-Lial-modeal-modeal-Lislation-modeal-mocal-mocal-mocal-motion-mocal-mod-mocal-mocal-mod-mod-motion-mocal-mod-mod-motion-mod-mod-mocal-motion-motion-mocal-motion-mocal-mocal-mocal-mod-mod-mology-mocal-mod-mod-mocal-mocal-mod-mod-mod-mology-mod-mod-mology-modia-mocal-modia-mocal-mocal-modal-mocal-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mocal-mod-mocal-mocal-mocal-mocal-

Article 244

Title@2025-07-08 (2): OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Title: OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

OpenS2S: Vollkommen offenes Source-End-to-End-Empathetik-Modell für große Sprache

OpenS2S: 推进完全开放源全开放源端至端至端受同情型大型语音语言模式 2507.05177v2

Authors (11): Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang

Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at https://casia-lm.github.io/OpenS2S

同情性互动是人类机器交流的基石,因为需要理解用语言提示丰富并产生情绪和表情反应的语调;然而,最强大的同情性LSL系统日益关闭,使研究人员对结构、数据和发展的关键细节不透明。鉴于迫切需要对LSL和同情性行为进行透明研究,我们介绍OpenS2S2S,这是一个完全开放的源码、透明和端对端LSLSL系统,旨在促成同情性言语互动。基于我们的感性语音到文字模型BLSP-Emo, OpenS2S进一步使用流动的断流解解码架构实现低延度语音生成。为了便利端对端培训,OpenS2S2S包含一个自动的数据构造管道,以低成本综合多种、高品质的流言式语音语音对话。通过利用大型语言模型生成同情性内容和控制性文本到语音系统来引入演讲者和情感变异,我们利用一个可扩展的模型、可变式的Sloveal-real-retaphic Strestrual 系统来构建一个可扩展社区,包括富型的智能Slavial-re Stal-real-rediviductionSligidustry Stal Stal Stal-restry

Article 245

Title@2025-07-08 (2): Conditional Multi-Stage Failure Recovery for Embodied Agents

Title: Conditional Multi-Stage Failure Recovery for Embodied Agents

Bedingte Multi-Stage Failure Recovery für eingedickte Agenten

内装药剂有条件的多阶段故障恢复 2507.06016v1

Authors (5): Youmna Farag, Svetlana Stoyanchev, Mohan Li, Simon Keizer, Rama Doddipatla

Embodied agents performing complex tasks are susceptible to execution failures, motivating the need for effective failure recovery mechanisms. In this work, we introduce a conditional multistage failure recovery framework that employs zero-shot chain prompting. The framework is structured into four error-handling stages, with three operating during task execution and one functioning as a post-execution reflection phase. Our approach utilises the reasoning capabilities of LLMs to analyse execution challenges within their environmental context and devise strategic solutions. We evaluate our method on the TfD benchmark of the TEACH dataset and achieve state-of-the-art performance, outperforming a baseline without error recovery by 11.5% and surpassing the strongest existing model by 19%.

在这项工作中,我们引入了一个有条件的多阶段故障回收框架,采用零速链推进法。框架分为四个错误处理阶段,任务执行期间有三个操作阶段,一个作为执行后反思阶段。我们的方法利用LLMS的推理能力,分析环境环境中的执行挑战并制定战略解决方案。我们评估了我们在TACH数据集TfD基准上的方法,并取得了最新业绩,比基准无误回收率高了11.5%,比现有最强的模型高出19%。

Article 246

Title@2025-07-08 (2): DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations

Title: DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations

DocIE@XLLM25: In-Context-Lernen zur Informationsextraktion mit vollständig synthetischen Demonstrationen

DocIE@XLLLM25:利用全面合成演示进行信息提取的内文学习 2507.05997v1

Authors (4): Nicholas Popovič, Ashish Kangen, Tim Schopf, Michael Färber

Large, high-quality annotated corpora remain scarce in document-level entity and relation extraction in zero-shot or few-shot settings. In this paper, we present a fully automatic, LLM-based pipeline for synthetic data generation and in-context learning for document-level entity and relation extraction. In contrast to existing approaches that rely on manually annotated demonstrations or direct zero-shot inference, our method combines synthetic data generation with retrieval-based in-context learning, using a reasoning-optimized language model. This allows us to build a high-quality demonstration database without manual annotation and to dynamically retrieve relevant examples at inference time. Based on our approach we produce a synthetic dataset of over $5k$ Wikipedia abstracts with approximately $59k$ entities and $30k$ relation triples. Finally, we evaluate in-context learning performance on the DocIE shared task, extracting entities and relations from long documents in a zero-shot setting. We find that in-context joint entity and relation extraction at document-level remains a challenging task, even for state-of-the-art large language models.

大量、高质量的附加说明公司在文件级别实体和零发或几发环境中的关系提取中仍然缺乏大量高质量的高分数。在本文中,我们提出了一个完全自动的、基于LLM的管道,用于合成数据生成和文件级别实体和关系提取的全文学习。与目前依靠人工附加说明的演示或直接零发推理的方法相比,我们的方法将合成数据生成与基于检索的文本内学习结合起来,采用推理优化语言模式。这使我们能够建立一个高质量的演示数据库,而无需人工说明,在推断时间动态地检索相关实例。基于我们的方法,我们制作了一个5千多美元的合成数据集,约59千美元实体和30千元关系三重。最后,我们评价了DocIE共同任务的内文学习业绩,在零点环境中从长份文件中提取实体和关系。我们发现,在文件层面上的文本联合实体和关系提取仍然是一项艰巨的任务,即使是在最先进的大语言模型上。

Article 247

Title@2025-07-08 (2): Evolution without Large Models: Training Language Model with Task Principles

Title: Evolution without Large Models: Training Language Model with Task Principles

Evolution ohne große Modelle: Sprachmodell mit Aufgabenprinzipien trainieren

无大型号的演变演变:具有任务原则的培训语言模式 2507.05991v1

Authors (8): Minghang Zhu, Shen Gao, Zhengliang Shi, Jiabao Fang, Pengjie Ren, Zhaochun Ren, Zhumin Chen, Shuo Shang

A common training approach for language models involves using a large-scale language model to expand a human-provided dataset, which is subsequently used for model training.This method significantly reduces training costs by eliminating the need for extensive human data annotation. However, it still faces challenges such as high carbon emissions during data augmentation and the risk of data leakage when we use closed-source LLMs. To address these issues, we propose a self-evolution method for language models. First, we introduce the Multi-level Principle Generation, which enables a large-scale model to summarize task-completion principles based on a small amount of task data. Then, we propose the Principle-based Instance Generation, in which a smaller-scale language model uses these task principles to generate a large amount of data. This data is then used for model training. Experimental results show that our proposed method significantly improves model performance compared to directly using a smaller-scale language model to generate data. Additionally, since we only use the large-scale language model to generate the task-completion principles, the carbon emissions associated with training the model are greatly reduced.

语言模型的共同培训方法涉及使用大规模语言模型扩大人提供的数据集,随后用于模式培训。这一方法通过消除对大量人类数据说明的需求,大大减少培训费用,但仍然面临数据扩充期间的碳排放量高以及使用封闭源LLMs时数据泄漏风险等挑战。为了解决这些问题,我们提议了语言模型的自我演进方法。首先,我们引入了多层次原则生成模式,使一个大型模型能够根据少量任务数据总结任务完成原则。然后,我们提出了基于原则的实例生成,在其中,一个较小的语言模型使用这些任务原则生成大量数据。这些数据随后用于模式培训。实验结果表明,与直接使用较小规模语言模型生成数据相比,我们的拟议方法大大改进了模型性能。此外,由于我们只是使用大型语言模型来生成任务完成原则,因此与培训有关的碳排放量大大减少。

Article 248

Title@2025-07-08 (2): Development and Evaluation of HopeBot: an LLM-based chatbot for structured and interactive PHQ-9 depression screening

Title: Development and Evaluation of HopeBot: an LLM-based chatbot for structured and interactive PHQ-9 depression screening

Entwicklung und Evaluation von HopeBot: ein LLM-basierter Chatbot für strukturiertes und interaktives PHQ-9-Depressionsscreening

HopeBot的发展和评价:一个基于LLM的聊天室,用于结构化和互动的PHQ-9抑郁症筛查 2507.05984v1

Authors (8): Zhijun Guo, Alvina Lai, Julia Ive, Alexandru Petcu, Yutong Wang, Luyuan Qi, Johan H Thygesen, Kezhi Li

Static tools like the Patient Health Questionnaire-9 (PHQ-9) effectively screen depression but lack interactivity and adaptability. We developed HopeBot, a chatbot powered by a large language model (LLM) that administers the PHQ-9 using retrieval-augmented generation and real-time clarification. In a within-subject study, 132 adults in the United Kingdom and China completed both self-administered and chatbot versions. Scores demonstrated strong agreement (ICC = 0.91; 45% identical). Among 75 participants providing comparative feedback, 71% reported greater trust in the chatbot, highlighting clearer structure, interpretive guidance, and a supportive tone. Mean ratings (0-10) were 8.4 for comfort, 7.7 for voice clarity, 7.6 for handling sensitive topics, and 7.4 for recommendation helpfulness; the latter varied significantly by employment status and prior mental-health service use (p < 0.05). Overall, 87.1% expressed willingness to reuse or recommend HopeBot. These findings demonstrate voice-based LLM chatbots can feasibly serve as scalable, low-burden adjuncts for routine depression screening.

患者健康问卷-9(PHQ-9)等静态工具有效筛选抑郁症,但缺乏互动性和适应性。我们开发了HopeBot,这是一个由大型语言模型(LLM)驱动的聊天机,使用检索增强的一代和实时澄清来管理PHQ-9;在一项学科内研究中,联合王国和中国的132名成年人既完成了自我管理和聊天机的版本,分数表现出强烈的一致(ICC=0.91;45%相同)。在提供比较反馈的75名参与者中,71%的人报告说对聊天机有更大的信任,突出更清晰的结构、解释性指导和支持性调子。平均评级(0-10)为8.4用于舒适,7.7用于声音清晰,7.6用于处理敏感专题,7.4用于建议有用性;后者因就业状况和先前的心理健康服务使用(p < 0.05)而大不相同。总体而言,87.1%的人表示愿意再利用或推荐HopeBot。这些调查结果表明,基于声音的LM聊天机可以被视为可伸缩缩缩、低负担的辅助剂。

Article 249

Title@2025-07-08 (2): RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages

Title: RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages

RabakBench: Skalierung menschlicher Anmerkungen zum Aufbau lokalisierter Mehrsprachiger Sicherheits-Benchmarks für ressourcenarme Sprachen

Rabak Rabak Bench:提高人文说明,以建立本地化的低资源语言多语文安全基准 2507.05980v1

Authors (4): Gabriel Chua, Leanne Tan, Ziyu Ge, Roy Ka-Wei Lee

Large language models (LLMs) and their safety classifiers often perform poorly on low-resource languages due to limited training data and evaluation benchmarks. This paper introduces RabakBench, a new multilingual safety benchmark localized to Singapore’s unique linguistic context, covering Singlish, Chinese, Malay, and Tamil. RabakBench is constructed through a scalable three-stage pipeline: (i) Generate - adversarial example generation by augmenting real Singlish web content with LLM-driven red teaming; (ii) Label - semi-automated multi-label safety annotation using majority-voted LLM labelers aligned with human judgments; and (iii) Translate - high-fidelity translation preserving linguistic nuance and toxicity across languages. The final dataset comprises over 5,000 safety-labeled examples across four languages and six fine-grained safety categories with severity levels. Evaluations of 11 popular open-source and closed-source guardrail classifiers reveal significant performance degradation. RabakBench not only enables robust safety evaluation in Southeast Asian multilingual settings but also offers a reproducible framework for building localized safety datasets in low-resource environments. The benchmark dataset, including the human-verified translations, and evaluation code are publicly available.

由于培训数据和评价基准有限,大型语言模型(LLMS)及其安全分类者往往在低资源语言上表现不佳。本文介绍RabakBench,这是新加坡独特的语言背景所特有的一个新的多语言安全基准,涵盖Singlish、中文、马来语和泰米尔语。 RabakBench是通过一个可扩缩的三阶段管道建造的:(一) 与LLM驱动的红色团队合作,通过增加真正的Singlish网络内容,生成创制-对抗性范例;(二) Label - 半自动多标签安全说明,使用多数投票的LLM标签与人类判断一致;(三) Translate - 高纤维翻译,保存语言微小和跨语言的毒性。最后数据集包括5 000多个四种语言的安全标签示例和六个严重程度精细的安全分类。对11个流行的开放源和封闭源保护铁路分类的评审显示显著的性能退化。RabakBench不仅使得东南亚多语言环境中能够进行强有力的安全评价,而且还为在低资源环境中建立本地安全数据集提供了一个可复制的框架,包括公开翻译。

Article 250

Title@2025-07-08 (2): We Should Evaluate Real-World Impact

Title: We Should Evaluate Real-World Impact

Wir sollten die Auswirkungen der realen Welt bewerten

我们应该评估现实世界的影响 2507.05973v1

Authors (1): Ehud Reiter

The ACL community has very little interest in evaluating the real-world impact of NLP systems. A structured survey of the ACL Anthology shows that perhaps 0.1% of its papers contain such evaluations; furthermore most papers which include impact evaluations present them very sketchily and instead focus on metric evaluations. NLP technology would be more useful and more quickly adopted if we seriously tried to understand and evaluate its real-world impact.

ACL社区对评价NLP系统对现实世界的影响兴趣不大,对ACLAnthlogy的结构性调查显示,其文件中可能0.1%包含这种评价;此外,大多数包括影响评估的文件都非常粗略地展示了这些评价,而是侧重于衡量评价。如果我们认真理解和评价其现实世界的影响,NLP技术会更有用,更快地被采用。

Article 251

Title@2025-07-08 (2): Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model

Title: Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model

Analytisches Subraum-Routing: Wie rekursive Least Squares im kontinuierlichen Lernen des großen Sprachmodells funktionieren

分析子空间运行分析:在继续不断学习大语言模式中如何使最起码的广场恢复 2503.13575v2

Authors (8): Kai Tong, Kang Pan, Xiao Zhang, Erli Meng, Run He, Yawen Cui, Nuoyan Guo, Huiping Zhuang

Large Language Models (LLMs) possess encompassing capabilities that can process diverse language-related tasks. However, finetuning on LLMs will diminish this general skills and continual finetuning will further cause severe degradation on accumulated knowledge. Recently, Continual Learning (CL) in Large Language Models (LLMs) arises which aims to continually adapt the LLMs to new tasks while maintaining previously learned knowledge and inheriting general skills. Existing techniques either leverage previous data to replay, leading to extra computational costs, or utilize a single parameter-efficient module to learn the downstream task, constraining new knowledge absorption with interference between different tasks. Toward these issues, this paper proposes Analytic Subspace Routing(ASR) to address these challenges. For each task, we isolate the learning within a subspace of deep layers’ features via low-rank adaptation, eliminating knowledge interference between different tasks. Additionally, we propose an analytic routing mechanism to properly utilize knowledge learned in different subspaces. Our approach employs Recursive Least Squares to train a multi-task router model, allowing the router to dynamically adapt to incoming data without requiring access to historical data. Also, the router effectively assigns the current task to an appropriate subspace and has a non-forgetting property of previously learned tasks with a solid theoretical guarantee. Experimental results demonstrate that our method achieves near-perfect retention of prior knowledge while seamlessly integrating new information, effectively overcoming the core limitations of existing methods. Our code will be released after acceptance.

大型语言模型(LLMS)拥有处理多种语言相关任务的能力。然而,微调LLMS将减少这种一般技能和不断微调,将进一步导致积累知识的严重退化。最近,大语言模型(LLMS)的连续学习(CLS)产生,目的是不断使LLMS适应新任务,同时保持先前学到的知识并继承一般技能。现有的技术要么利用先前的数据重播,导致额外的计算成本,要么利用单一的参数高效模块学习下游任务,限制不同任务之间的新知识吸收。针对这些问题,本文提议分析反空子运行(ASR)以应对这些挑战。对于每一项任务,我们通过低层次的适应将深层特征的子空间中的学习隔离开来,消除不同任务之间的知识干扰。此外,我们提议一个分析性路由机制来正确利用不同亚空间所学的知识,导致额外的计算成本,或者利用一个单一的参数高效模块来培训多任务路由模型,使路由器能够动态地适应即将收到的数据,而不需要在接近的亚空间轨道上进行精确地调整,同时将我们所学到的当前任务的结果交付。另外,我们可以有效地展示一个不精确地在前的轨道上获取的路径上取得一个不精确的路径上获得的路径。

Article 252

Title@2025-07-08 (2): FRAME: Feedback-Refined Agent Methodology for Enhancing Medical Research Insights

Title: FRAME: Feedback-Refined Agent Methodology for Enhancing Medical Research Insights

FRAME: Feedback-Refined Agent Methodik zur Verbesserung medizinischer Forschungs-Insights

FRAME: 改进医学研究透视的反馈改进剂方法 2505.04649v2

Authors (6): Chengzhang Yu, Yiming Zhang, Zhixin Liu, Zenghui Ding, Yining Sun, Zhanpeng Jin

The automation of scientific research through large language models (LLMs) presents significant opportunities but faces critical challenges in knowledge synthesis and quality assurance. We introduce Feedback-Refined Agent Methodology (FRAME), a novel framework that enhances medical paper generation through iterative refinement and structured feedback. Our approach comprises three key innovations: (1) A structured dataset construction method that decomposes 4,287 medical papers into essential research components through iterative refinement; (2) A tripartite architecture integrating Generator, Evaluator, and Reflector agents that progressively improve content quality through metric-driven feedback; and (3) A comprehensive evaluation framework that combines statistical metrics with human-grounded benchmarks. Experimental results demonstrate FRAME’s effectiveness, achieving significant improvements over conventional approaches across multiple models (9.91% average gain with DeepSeek V3, comparable improvements with GPT-4o Mini) and evaluation dimensions. Human evaluation confirms that FRAME-generated papers achieve quality comparable to human-authored works, with particular strength in synthesizing future research directions. The results demonstrated our work could efficiently assist medical research by building a robust foundation for automated medical research paper generation while maintaining rigorous academic standards.

通过大型语言模型实现科学研究自动化(LLMS)提供了重要机会,但在知识合成和质量保证方面面临重大挑战。我们引入了反馈改良剂方法(FRAME),这是一个通过迭代完善和结构化反馈加强医疗用纸张制作的新框架。我们的方法包括三个关键创新:(1) 结构化的数据集构建方法,通过迭代完善将4 287份医疗文件分解为基本研究组成部分;(2) 将发电机、评价员和思考器结合起来的三方结构,通过量度驱动反馈逐步提高内容质量;(3) 将统计指标与人基基准相结合的综合评价框架。实验结果显示FRAME的效力,在多种模型的传统方法上取得了显著的改进(Deep Seek V3, 与GPT-4o Mini相似的改进平均收益9.91% ) 和评价层面。人类评价证实,FRAME产生的文件的质量与人类著作具有可比性,在合成未来研究方向方面特别强大。结果表明,我们的工作可以通过在保持严格的学术标准的同时为自动化医疗研究造纸建立一个坚实的基础,从而有效地协助医学研究研究。

Article 253

Title@2025-07-08 (2): OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation

Title: OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation

OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation

OpenFactScore:公开源码原子对制文本实际质量的评估 2507.05965v1

Authors (2): Lucas Fonseca Lage, Simon Ostermann

We introduce OpenFActScore, an open-source implementation of the FActScore framework for evaluating the factuality of text generated by large language models (LLMs). FActScore evaluates the factual accuracy of long-form text by using Atomic Fact Generation (AFG) to extract individual factual claims and Atomic Fact Validation (AFV) to verify each claim against a trusted knowledge source. While the original FActScore relies on closed-source and commercial models such as InstructGPT and ChatGPT, OpenFActScore enables the use of any Hugging Face-compatible model for both AFG and AFV. We provide a detailed technical overview of our implementation, highlighting design choices and modifications made to support open models. We evaluate multiple open-source LLMs on both AFG and AFV using the original FActScore benchmark, reporting BERTScore-F1 for AFG and Error Rate relative to human annotations for AFV. Our results show that open models can approximate the performance of closed-source systems, with Gemma achieving the best overall performance, and our final setup obtains a 0.99 Pearson correlation with the original FActScore experiments. OpenFActScore promotes transparency, reproducibility, and cost-effective evaluation, and is available at: https://github.com/lflage/OpenFActScore.

我们引入了OpenFactScore(OpenFactScore)框架(OpenFactScore),用于评估大型语言模型(LLM)产生的文本的真实性。FacScore(OpenFactScore)通过利用原子事实生成(AFG)来提取个人事实索赔和原子事实验证(AFV)来根据可信赖的知识来源核实每项索赔,来评估长文本的真实性。虽然原始FactScore(OpenFactScore)依靠封闭源和商业模型,如StechGPT和ChattGPTGPT, OpfFACore(OFCore)能够为AFG和AFV(AFLM)使用任何能动的面相容模型。我们提供了我们实施情况的详细技术概览,突出设计选择和修改以支持开放模型。我们利用原始的FCAFCS基准对AFG和AFV(AF)的多种开源 LLM-FM-FMLM(BER)进行评估。我们的结果显示,开放模型可以对封闭源系统进行最佳的总体性业绩,我们的最后设置将获得0.99PearFstoFstoFs reclentalstalstalstal-Recolviclatealalview。我们可以提供透明性成本/FS(Ocolveralview)和原始成本评估。

Article 254

Title@2025-07-08 (2): Rethinking Associative Memory Mechanism in Induction Head

Title: Rethinking Associative Memory Mechanism in Induction Head

Rethinking Assoziative Memory Mechanism in Induktion Head

上岗时重新思考联合记忆机制 2412.11459v2

Authors (2): Shuo Wang, Issei Sato

Induction head mechanism is a part of the computational circuits for in-context learning (ICL) that enable large language models (LLMs) to adapt to new tasks without fine-tuning. Most existing work explains the training dynamics behind acquiring such a powerful mechanism. However, the model’s ability to coordinate in-context information over long contexts and global knowledge acquired during pretraining remains poorly understood. This paper investigates how a two-layer transformer thoroughly captures in-context information and balances it with pretrained bigram knowledge in next token prediction, from the viewpoint of associative memory. We theoretically analyze the representation of weight matrices in attention layers and the resulting logits when a transformer is given prompts generated by a bigram model. In the experiments, we design specific prompts to evaluate whether the outputs of the trained transformer align with the theoretical results.

上传头机制是内流学习计算线路的一部分,它使大型语言模型(LLMs)能够适应新的任务而无需微调。大多数现有工作解释了获得这样一个强大机制背后的培训动态。然而,该模型协调长长背景下的内流信息以及培训前获得的全球知识的能力仍然不甚明了。本文从联系记忆的角度探讨两层变压器如何彻底捕捉内流信息,并在下一个象征性的预测中将它与预先训练的大型知识相平衡。我们从理论上分析了注意层中重力矩阵的表示以及当一个变压器被一个大ram模型所激发时所产生的逻辑记录。在实验中,我们设计了具体的速率来评价受过训练的变压器的产出是否与理论结果一致。

Article 255

Title@2025-07-08 (2): Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models

Title: Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models

Neudefinition von Bewertungsstandards: Ein einheitlicher Rahmen für die Bewertung der koreanischen Fähigkeiten von Sprachmodellen

重新界定评价标准:评价韩国语言模式能力的统一框架 2503.22968v4

Authors (9): Hanwool Lee, Dasol Choi, Sooyong Kim, Ilgyun Jung, Sangwon Baek, Guijin Son, Inseon Hwang, Naeun Lee, Seunghyeok Hong

Recent advancements in Korean large language models (LLMs) have driven numerous benchmarks and evaluation methods, yet inconsistent protocols cause up to 10 p.p performance gaps across institutions. Overcoming these reproducibility gaps does not mean enforcing a one-size-fits-all evaluation. Rather, effective benchmarking requires diverse experimental approaches and a framework robust enough to support them. To this end, we introduce HRET (Haerae Evaluation Toolkit), an open-source, registry-based framework that unifies Korean LLM assessment. HRET integrates major Korean benchmarks, multiple inference backends, and multi-method evaluation, with language consistency enforcement to ensure genuine Korean outputs. Its modular registry design also enables rapid incorporation of new datasets, methods, and backends, ensuring the toolkit adapts to evolving research needs. Beyond standard accuracy metrics, HRET incorporates Korean-focused output analyses-morphology-aware Type-Token Ratio (TTR) for evaluating lexical diversity and systematic keyword-omission detection for identifying missing concepts-to provide diagnostic insights into language-specific behaviors. These targeted analyses help researchers pinpoint morphological and semantic shortcomings in model outputs, guiding focused improvements in Korean LLM development.

韩国大型语言模型(LLMS)最近的进展推动了许多基准和评价方法,但协议不一致导致各机构的绩效差距高达10 p.p.p.;克服这些再生差距并不意味着执行 “ 一刀切 “ 的评价;相反,有效的基准要求采用不同的实验办法和足以支持这些评价的框架;为此,我们引入了一个开放源码、基于登记册的框架,统一韩国LLM评估。HRET将韩国主要基准、多重推论后端和多方法评价与语言一致性执行相结合,以确保朝鲜真正的产出。其模块化登记册的设计还使得能够快速纳入新的数据集、方法和后端,确保工具包适应不断变化的研究需求。除了标准精确度指标外,HRET还采用韩国重点产出分析-形态识别模式比(TTR),用于评估词汇多样性和系统关键内容检测,以识别缺失的概念,从而提供语言特定行为方面的诊断性洞察力。这些有针对性的分析有助于研究人员对模型产出中的形态和语义缺陷进行定位,指导韩国LM的改进。

Article 256

Title@2025-07-08 (2): Chat-Ghosting: A Comparative Study of Methods for Auto-Completion in Dialog Systems

Title: Chat-Ghosting: A Comparative Study of Methods for Auto-Completion in Dialog Systems

Chat-Ghosting: Eine vergleichende Untersuchung von Methoden zur Auto-Completion in Dialogsystemen

对话系统自动完成方法比较研究 2507.05940v1

Authors (6): Sandeep Mishra, Anubhab Mandal, Bishal Santra, Tushar Abhishek, Pawan Goyal, Manish Gupta

Ghosting, the ability to predict a user’s intended text input for inline query auto-completion, is an invaluable feature for modern search engines and chat interfaces, greatly enhancing user experience. By suggesting completions to incomplete queries (or prefixes), ghosting aids users with slow typing speeds, disabilities, or limited language proficiency. Ghosting is a challenging problem and has become more important with the ubiquitousness of chat-based systems like ChatGPT, Copilot, etc. Despite the increasing prominence of chat-based systems utilizing ghosting, this challenging problem of Chat-Ghosting has received little attention from the NLP/ML research community. There is a lack of standardized benchmarks and relative performance analysis of deep learning and non-deep learning methods. We address this through an open and thorough study of this problem using four publicly available dialog datasets: two human-human (DailyDialog and DSTC7-Ubuntu) and two human-bot (Open Assistant and ShareGPT). We experiment with various existing query auto-completion methods (using tries), n-gram methods and deep learning methods, with and without dialog context. We also propose a novel entropy-based dynamic early stopping strategy. Our analysis finds that statistical n-gram models and tries outperform deep learning based models in terms of both model performance and inference efficiency for seen prefixes. For unseen queries, neural models like T5 and Phi-2 lead to better results. Adding conversational context leads to significant improvements in ghosting quality, especially for Open-Assistant and ShareGPT. We make code and data publicly available

幽灵,即能够预测用户为内线查询自动完成而准备的文本输入,是现代搜索引擎和聊天界面的宝贵特征,大大提高了用户经验。通过建议完成不完整查询(或前缀)、隐形帮助用户打字速度缓慢、残疾或语言熟练程度有限的问题,隐形帮助用户。隐形是一个具有挑战性的问题,而且随着聊天的系统,如ChatGPT、Copilit等以聊天为基础的系统无处不在,这种能力也变得更加重要。尽管聊天系统利用幽灵越来越突出,但这种具有挑战性的“聊天接待”质量问题很少受到NLP/ML研究界的注意。深层学习和非深层学习方法缺乏标准化基准和相对绩效分析。我们通过公开的4个对话数据集(DailyDialog和DSTC7-Ubuntu)和2个“人文”模型(公开助理和ShareGPTTT),我们尝试了各种现有的“自动完成”方法(在尝试中尝试尝试)、正格方法,在深层次学习和深层次的深度对话中,特别用基于数据分析工具,并在观察中找到“智能分析中找到一种基于的“数据”。

Article 257

Title@2025-07-08 (2): Remember Past, Anticipate Future: Learning Continual Multimodal Misinformation Detectors

Title: Remember Past, Anticipate Future: Learning Continual Multimodal Misinformation Detectors

Denken Sie daran, Vergangenheit, Zukunft antizipieren: Lernen kontinuierlich multimodale Fehlinformation Detectors

记住过去,预期未来:学习持续多模式错误信息探测器 2507.05939v1

Authors (7): Bing Wang, Ximing Li, Mengzhe Ye, Changchun Li, Bo Fu, Jianfeng Qu, Lin Yuanbo Wu

Nowadays, misinformation articles, especially multimodal ones, are widely spread on social media platforms and cause serious negative effects. To control their propagation, Multimodal Misinformation Detection (MMD) becomes an active topic in the community to automatically identify misinformation. Previous MMD methods focus on supervising detectors by collecting offline data. However, in real-world scenarios, new events always continually emerge, making MMD models trained on offline data consistently outdated and ineffective. To address this issue, training MMD models under online data streams is an alternative, inducing an emerging task named continual MMD. Unfortunately, it is hindered by two major challenges. First, training on new data consistently decreases the detection performance on past data, named past knowledge forgetting. Second, the social environment constantly evolves over time, affecting the generalization on future data. To alleviate these challenges, we propose to remember past knowledge by isolating interference between event-specific parameters with a Dirichlet process-based mixture-of-expert structure, and anticipate future environmental distributions by learning a continuous-time dynamics model. Accordingly, we induce a new continual MMD method DAEDCMD. Extensive experiments demonstrate that DAEDCMD can consistently and significantly outperform the compared methods, including six MMD baselines and three continual learning methods.

目前,错误文章,特别是多式文章,在社交媒体平台上广为传播,造成严重负面影响。为了控制其传播,多式错误信息检测(MMD)成为社区中一个积极话题,自动识别错误信息。以前MMD方法的重点是通过收集离线数据来监督探测器。然而,在现实世界中,新事件总是不断出现,使在离线数据上培训的 MMD模型始终过时和无效。为解决这一问题,在在线数据流下培训MMD模型是一种替代方案,引发一个名为连续MMMD的新兴任务。不幸的是,它受到两大挑战的阻碍。第一,关于新数据的培训不断降低过去数据(过去被命名为知识遗忘)的检测性能。第二,社会环境的不断变化,影响未来数据的一般化。为缓解这些挑战,我们建议通过将事件特定参数与基于富力特进程混合专家结构的干扰隔离起来来记住过去的知识,并通过学习一个持续时间动态模型来预测未来环境分布。因此,我们引入一种新的MDADDMDMDMD方法。广泛的实验表明DADMMMD可以持续并显著地超越三个基准,包括不断学习。

Article 258

Title@2025-07-08 (2): Towards a Principled Evaluation of Knowledge Editors

Title: Towards a Principled Evaluation of Knowledge Editors

Auf dem Weg zu einer grundsätzlichen Bewertung von Wissensredakteuren

对知识编辑进行有原则地评价 2507.05937v1

Authors (3): Sebastian Pohl, Max Ploner, Alan Akbik

Model editing has been gaining increasing attention over the past few years. For Knowledge Editing in particular, more challenging evaluation datasets have recently been released. These datasets use different methodologies to score the success of editors. Yet, it remains under-explored how robust these methodologies are and whether they unfairly favor some editors. Moreover, the disruptive impact of these editors on overall model capabilities remains a constant blind spot. We address both of these problems and show that choosing different metrics and evaluation methodologies as well as different edit batch sizes can lead to a different ranking of knowledge editors. Crucially we demonstrate this effect also on general language understanding tasks evaluated alongside the knowledge editing tasks. Further we include a manual assessment of the string matching based evaluation method for knowledge editing that is favored by recently released datasets, revealing a tendency to produce false positive matches.

过去几年来,模型编辑越来越受到越来越多的关注。特别是对于知识编辑来说,最近发布了更具挑战性的评价数据集。这些数据集使用不同的方法来赢得编辑的成功。然而,它们仍然未充分探讨这些方法的健全程度,以及这些方法是否不公平地有利于一些编辑。此外,这些编辑对总体模型能力的干扰影响始终是盲点。我们处理这两个问题,并表明选择不同的指标和评价方法以及不同的编辑批量大小可能导致知识编辑的排名不同。关键是,我们也展示了这种影响对与知识编辑任务一起评估的一般语言理解任务的影响。此外,我们还包括对最近发布的数据集所偏好的知识编辑的弦匹配评价方法进行手工评估,揭示出产生虚假正面匹配的趋势。

Article 259

Title: News and Load: Social and Economic Drivers of Regional Multi-horizon Electricity Demand Forecasting

News and Load: Soziale und wirtschaftliche Treiber regionaler Mehrhorizont-Stromnachfrageprognosen

新闻与负载:区域多光速电力需求预测的社会和经济驱动因素 2406.06641v2

Authors (3): Yun Bai, Simon Camal, Andrea Michiorri

The relationship between electricity demand and variables such as economic activity and weather patterns is well established. However, this paper explores the connection between electricity demand and social aspects. It further embeds dynamic information about the state of society into energy demand modelling and forecasting approaches. Through the use of natural language processing on a large news corpus, we highlight this important link. This study is conducted in five regions of the UK and Ireland and considers multiple time horizons from 1 to 30 days. It also considers economic variables such as GDP, unemployment and inflation. The textual features used in this study represent central constructs from the word frequencies, topics, word embeddings extracted from the news. The findings indicate that: 1) the textual features are related to various contents, such as military conflicts, transportation, the global pandemic, regional economics, and the international energy market. They exhibit causal relationships with regional electricity demand, which are validated using Granger causality and Double Machine Learning methods. 2) Economic indicators play a more important role in the East Midlands and Northern Ireland, while social indicators are more influential in the West Midlands and the South West of England. 3) The use of these factors improves deterministic forecasting by around 6%.

电力需求与诸如经济活动和天气模式等变数之间的关系早已确立。然而,本文探讨了电力需求与社会方面的关联,进一步将关于社会状况的动态信息纳入能源需求建模和预测方法之中。我们通过在大型新闻材料中使用自然语言处理,强调了这一重要联系。这项研究在英国和爱尔兰五个地区进行,考虑了1天到30天的多重时间跨度。研究还考虑了国内生产总值、失业和通货膨胀等经济变量。本研究报告使用的文字特征代表了从新闻中提取的文字频率、主题、嵌入的词的核心结构。研究结果表明:(1) 文字特征与各种内容有关,如军事冲突、运输、全球大流行病、区域经济和国际能源市场。这些特征显示了与区域电力需求之间的因果关系,而区域电力需求则利用Granger因果关系和双机学习方法加以验证。(2) 经济指标在东中地和北爱尔兰起着更重要的作用,而社会指标在英格兰西中地和西南地更有影响力。(3) 这些因素的使用改善了6 %左右的确定性预测。

Article 260

Title@2025-07-08 (2): Semantic Certainty Assessment in Vector Retrieval Systems: A Novel Framework for Embedding Quality Evaluation

Title: Semantic Certainty Assessment in Vector Retrieval Systems: A Novel Framework for Embedding Quality Evaluation

Semantische Sicherheitsbewertung in Vector Retrieval Systemen: Ein neuartiger Rahmen für die Einbettung von Qualitätsbewertungen

矢量检索系统中的语义性确定性评估:包含质量评价的新框架 2507.05933v1

Authors (1): Y. Du

Vector retrieval systems exhibit significant performance variance across queries due to heterogeneous embedding quality. We propose a lightweight framework for predicting retrieval performance at the query level by combining quantization robustness and neighborhood density metrics. Our approach is motivated by the observation that high-quality embeddings occupy geometrically stable regions in the embedding space and exhibit consistent neighborhood structures. We evaluate our method on 4 standard retrieval datasets, showing consistent improvements of 9.4$\pm$1.2\% in Recall@10 over competitive baselines. The framework requires minimal computational overhead (less than 5\% of retrieval time) and enables adaptive retrieval strategies. Our analysis reveals systematic patterns in embedding quality across different query types, providing insights for targeted training data augmentation.

矢量检索系统由于不同嵌入质量不同,不同查询的性能差异很大。我们提出一个轻量级框架,通过结合量化稳健性和邻里密度度度来预测查询一级的检索性能。我们采取的方法的动机是,观察到高质量的嵌入在嵌入空间中占据几何稳定的区域,并展示了一致的邻里结构。我们评估了4个标准检索数据集的方法,在回调10中显示对竞争性基线不断改进9.4$\pm1.2。这个框架需要最低的计算间接费用(少于5检索时间),并能够实施适应性检索战略。我们的分析揭示了在不同查询类型中嵌入质量的系统模式,为有针对性的培训数据增强提供了洞察力。

Article 261

Title@2025-07-08 (2): Self-supervised learning of speech representations with Dutch archival data

Title: Self-supervised learning of speech representations with Dutch archival data

Selbstüberwachtes Lernen von Sprachdarstellungen mit niederländischen Archivdaten

自行监督学习用荷兰档案数据进行语音陈述 2507.04554v2

Authors (3): Nik Vaessen, Roeland Ordelman, David A. van Leeuwen

This paper explores the use of Dutch archival television broadcast data for self-supervised learning of speech foundation models, specifically wav2vec 2.0. We first study data quality assumptions for pre-training, and show how music, noise and speaker overlap affect SSL convergence and downstream fine-tuning performance. Secondly, we explore effectively pre-processing strategies to convert the noisy broadcast dataset into a qualitative dataset for pre-training, by using Whisper and WhisperX. Thirdly, we compare mono-lingual and multi-lingual pre-training with equivalent amounts of data, and show that mono-lingual pre-training is more robust to out-of-domain data. Lastly, we achieve a state-of-the-art LARGE wav2vec 2.0 model for the Dutch language, by a continuation of pre-training a wav2vec 2.0 XLS-R model checkpoint with our 55k hour archival dataset.

本文探索荷兰档案电视广播数据用于自我监督地学习语言基础模型,特别是 wav2vec 2. 0。我们首先研究培训前的数据质量假设,并展示音乐、噪音和发言者的重叠如何影响SSL的趋同和下游微调性能。其次,我们利用Whisper和WhisperX,有效探索将噪音广播数据集转换成培训前质量数据集的预处理战略。第三,我们将单语和多语种的预培训与同等数量的数据进行比较,并表明单语预培训对于外部数据更为有力。最后,我们通过继续用我们55k小时的档案数据集对Wav2c 2.0 XLLS-R 模检查站进行预培训,实现荷兰语最先进的LARGE wav2vec 2. 0模式。

Article 262

Title@2025-07-08 (2): Few-shot text-based emotion detection

Title: Few-shot text-based emotion detection

Kurze Text-basierte Emotionserkennung

几发短短的基于文本的情感探测 2507.05918v1

Authors (3): Teodor-George Marchitan, Claudiu Creanga, Liviu P. Dinu

This paper describes the approach of the Unibuc - NLP team in tackling the SemEval 2025 Workshop, Task 11: Bridging the Gap in Text-Based Emotion Detection. We mainly focused on experiments using large language models (Gemini, Qwen, DeepSeek) with either few-shot prompting or fine-tuning. With our final system, for the multi-label emotion detection track (track A), we got an F1-macro of $0.7546$ (26/96 teams) for the English subset, $0.1727$ (35/36 teams) for the Portuguese (Mozambican) subset and $0.325$ (\textbf{1}/31 teams) for the Emakhuwa subset.

本文介绍了Unibuc-NLP团队在应对SemEval 2025年研讨会(任务11:缩小基于文字的情感检测差距)方面的做法,我们主要侧重于使用大语言模型(Gemini、Quen、DeepSeek)的实验,有的只是几发提示或微调,有的是几发提示或微调。我们的最后系统,即多标签情感检测轨道(A轨),我们得到了一个F1-macro,其中英文子组0.7546美元(26/96),葡萄牙文组0.1727美元(35/36队),埃马库瓦子组0.325美元(htextbf{1}31队)。

Article 263

Title@2025-07-08 (2): On the Fundamental Impossibility of Hallucination Control in Large Language Models

Title: On the Fundamental Impossibility of Hallucination Control in Large Language Models

Über die grundsätzliche Unmöglichkeit der Halluzinationskontrolle in großen Sprachmodellen

关于大语言模型中幻听控制的基本不可能性 2506.06382v3

Authors (1): Michał P. Karpowicz

We prove that perfect hallucination control in large language models is mathematically impossible. No LLM inference mechanism can simultaneously achieve truthful response generation, semantic information conservation, relevant knowledge revelation, and knowledge-constrained optimality. This impossibility is fundamental, arising from the mathematical structure of information aggregation itself rather than engineering limitations. The proof spans three mathematical frameworks: auction theory, proper scoring theory for probabilistic predictions, and log-sum-exp analysis for transformer architectures. In each setting, we demonstrate that information aggregation creates unavoidable violations of conservation principles. The Jensen gap in transformer probability aggregation provides a direct measure of this impossibility. These results reframe hallucination from an engineering bug to an inevitable mathematical feature of distributed intelligence. There are fundamental trade-offs between truthfulness, knowledge utilization, and response completeness, providing principled foundations for managing rather than eliminating hallucination. This work reveals deep connections between neural network inference, philosophy of knowledge and reasoning, and classical results in game theory and information theory, opening new research directions for developing beneficial AI systems within mathematical constraints.

我们证明大型语言模型的完美幻觉控制在数学上是不可能的。没有LLM 推论机制能够同时实现真实的反应生成、语义信息保护、相关知识披露和知识限制的最佳性。这种不可能性是根本的,产生于信息集成本身的数学结构而不是工程限制。证据包含三个数学框架:拍卖理论、概率预测的适当评分理论以及变压器结构的日志和参数分析。在每种情况下,我们证明信息汇总都不可避免地违反了保护原则。变压器概率汇总中的Jensen差距提供了这种不可能性的直接衡量。这些结果是将幻觉从工程错误重新定位为分布式情报的不可避免的数学特征。在真实性、知识利用和反应完整性之间有着基本的权衡,为管理而不是消除幻觉提供了原则基础。这项工作揭示了神经网络的推论、知识和推理哲学以及游戏理论和信息理论的经典结果之间的密切联系,为在数学限制范围内开发有益的AI系统开辟了新的研究方向。

Article 264

Title@2025-07-08 (2): AI-Reporter: A Path to a New Genre of Scientific Communication

Title: AI-Reporter: A Path to a New Genre of Scientific Communication

AI-Reporter: Ein Weg zu einem neuen Genre wissenschaftlicher Kommunikation

AI-记者:通向科学通信新一流的道路 2507.05903v1

Authors (1): Gerd Graßhoff

The AI-Reporter represents a paradigmatic shift in scientific publication practice. This document demonstrates through a concrete case study how our system transforms academic presentations into publication-ready chapters – in less than three minutes. Using Arno Simons’ lecture on Large Language Models from the ``Large Language Models for the History, Philosophy, and Sociology of Science’’ workshop (NEPI) as an example, we show how technological innovation bridges the gap between ephemeral presentation and permanent scientific documentation.

AI-Reporter代表了科学出版实践的范式转变。本文件通过具体案例研究展示了我们的系统如何在不到3分钟内将学术介绍转变为可供出版的章节。我们以“科学历史、哲学和社会学大语言模型”研讨会(NEPI)为例,利用Arno Simons关于“大语言模型”的讲座,我们展示了技术创新如何弥合时间介绍和长期科学文献之间的差距。

Article 265

Title@2025-07-08 (2): MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation

Title: MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation

MusiScene: Nutzung von MU-LLAMA für die Vorstellung von Szenen und verbesserte Video-Hintergrundmusik-Generation

Musi-Sceen:利用MU-LLAMA来进行现场想象和增强视频背景音乐制作 2507.05894v1

Authors (4): Fathinah Izzati, Xinyue Li, Yuxuan Wu, Gus Xia

Humans can imagine various atmospheres and settings when listening to music, envisioning movie scenes that complement each piece. For example, slow, melancholic music might evoke scenes of heartbreak, while upbeat melodies suggest celebration. This paper explores whether a Music Language Model, e.g. MU-LLaMA, can perform a similar task, called Music Scene Imagination (MSI), which requires cross-modal information from video and music to train. To improve upon existing music captioning models which focusing solely on musical elements, we introduce MusiScene, a music captioning model designed to imagine scenes that complement each music. In this paper, (1) we construct a large-scale video-audio caption dataset with 3,371 pairs, (2) we finetune Music Understanding LLaMA for the MSI task to create MusiScene, and (3) we conduct comprehensive evaluations and prove that our MusiScene is more capable of generating contextually relevant captions compared to MU-LLaMA. We leverage the generated MSI captions to enhance Video Background Music Generation (VBMG) from text.

人类在聆听音乐时可以想象出不同的氛围和设置,想象每个片段的电影场景。例如,缓慢的、忧郁的音乐可能会引起心碎的场景,而喜悦的旋律则建议庆祝。本文探讨了音乐语言模型,如MU-LLAMA,能否执行类似的任务,即音乐场景想象(MSI),需要视频和音乐的跨模式信息来培训。为了改进只关注音乐元素的现有音乐字幕模型,我们引入了MusiScene,这是一个音乐字幕模型,旨在想象每种音乐的场景。在本文中,(1) 我们用3,371对配对的大规模视频-音响字幕数据集,(2) 我们为MusiScene创建MusiScene的任务,我们进行了全面的评估,并证明我们的MusiScene较有能力生成与MU-LAMA相比具有背景意义的字幕。我们利用生成的MSI字幕来从文字中加强视频背景音乐生成。

Article 266

Title@2025-07-08 (2): Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators

Title: Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators

Psychometrische Objektvalidierung mit virtuellen Befragten mit trait-Response Mediatoren

使用虚拟应答人与Trait-Response调解员的虚拟应答人进行测分 2507.05890v1

Authors (4): Sungjib Lim, Woojung Song, Eun-Ju Lee, Yohan Jo

As psychometric surveys are increasingly used to assess the traits of large language models (LLMs), the need for scalable survey item generation suited for LLMs has also grown. A critical challenge here is ensuring the construct validity of generated items, i.e., whether they truly measure the intended trait. Traditionally, this requires costly, large-scale human data collection. To make it efficient, we present a framework for virtual respondent simulation using LLMs. Our central idea is to account for mediators: factors through which the same trait can give rise to varying responses to a survey item. By simulating respondents with diverse mediators, we identify survey items that robustly measure intended traits. Experiments on three psychological trait theories (Big5, Schwartz, VIA) show that our mediator generation methods and simulation framework effectively identify high-validity items. LLMs demonstrate the ability to generate plausible mediators from trait definitions and to simulate respondent behavior for item validation. Our problem formulation, metrics, methodology, and dataset open a new direction for cost-effective survey development and a deeper understanding of how LLMs replicate human-like behavior. We will publicly release our dataset and code to support future work.

由于人们越来越多地使用心理测量调查来评估大型语言模型的特性,因此,适合LLMS的可扩缩调查物品生成的需要也日益增长。这里的一个关键挑战是确保生成的物品的构造有效性,即它们是否真正测量预期特性。传统上,这需要花费昂贵的大规模人类数据收集。为了提高效率,我们提出了一个使用LLMS进行虚拟应答模拟的框架。我们的中心思想是考虑调解人:同一特性可能导致对调查项目作出不同答复的因素。通过向不同调解员模拟,我们确定能够强有力地测量预期特性的调查物品。关于三种心理特征理论(Big5、Schwartz、VIA)的实验表明,我们的调解者生成方法和模拟框架能够有效地识别高功能物品。LMS展示出从特性定义中产生可信的调解员的能力,并模拟应答者行为,以便项目鉴定。我们的问题的拟订、衡量尺度、方法和数据设置为具有成本效益的调查发展开辟新的方向,并更深入地了解LMS复制人类类似行为的方式。我们将公开发布我们的数据集和代码,以支持未来的工作。

Article 267

Title@2025-07-08 (2): How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures

Title: How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures

Wie man die automatische Spracherkennung auswertet: Vergleich verschiedener Leistungs- und Bias-Maßnahmen

如何评价自动语音识别:比较不同业绩和偏见措施 2507.05885v1

Authors (4): Tanvina Patel, Wiebke Hutiri, Aaron Yi Ding, Odette Scharenborg

There is increasingly more evidence that automatic speech recognition (ASR) systems are biased against different speakers and speaker groups, e.g., due to gender, age, or accent. Research on bias in ASR has so far primarily focused on detecting and quantifying bias, and developing mitigation approaches. Despite this progress, the open question is how to measure the performance and bias of a system. In this study, we compare different performance and bias measures, from literature and proposed, to evaluate state-of-the-art end-to-end ASR systems for Dutch. Our experiments use several bias mitigation strategies to address bias against different speaker groups. The findings reveal that averaged error rates, a standard in ASR research, alone is not sufficient and should be supplemented by other measures. The paper ends with recommendations for reporting ASR performance and bias to better represent a system’s performance for diverse speaker groups, and overall system bias.

越来越多的证据表明,自动语音识别(ASR)系统对不同的发言者和发言者群体有偏见,例如,由于性别、年龄或口音等原因。关于ASR中的偏见的研究迄今主要侧重于发现和量化偏见,以及制定缓解方法。尽管取得了这一进展,但未决问题是如何衡量一个系统的绩效和偏见。在本研究报告中,我们比较了从文献中得出的不同绩效和偏见措施,并提议对荷兰语的高级终端到终端的ASR系统进行评估。我们的实验利用若干减少偏见的战略来解决对不同发言者群体的偏见问题。研究结果显示,平均误差率,即ASR研究的标准本身是不够的,应当辅以其他措施作为补充。文件最后建议报告ASR的业绩和偏见,以便更好地代表不同发言者群体的系统绩效和整个系统偏差。

Article 268

Title@2025-07-08 (2): Healing Powers of BERT: How Task-Specific Fine-Tuning Recovers Corrupted Language Models

Title: Healing Powers of BERT: How Task-Specific Fine-Tuning Recovers Corrupted Language Models

Heilkräfte von BERT: Wie aufgabenspezifische Feinsteuerung korrupte Sprachmodelle wiederherstellt

BERT的愈合能力:如何恢复被损坏的语言模式 2406.14459v2

Authors (3): Shijie Han, Zhenyu Zhang, Andrei Arsene Simion

Language models like BERT excel at sentence classification tasks due to extensive pre-training on general data, but their robustness to parameter corruption is unexplored. To understand this better, we look at what happens if a language model is “broken”, in the sense that some of its parameters are corrupted and then recovered by fine-tuning. Strategically corrupting BERT variants at different levels, we find corrupted models struggle to fully recover their original performance, with higher corruption causing more severe degradation. Notably, bottom-layer corruption affecting fundamental linguistic features is more detrimental than top-layer corruption. Our insights contribute to understanding language model robustness and adaptability under adverse conditions, informing strategies for developing resilient NLP systems against parameter perturbations.

BERT等语言模型由于对一般数据进行了广泛的预先培训,因此在刑罚分类任务方面非常出色,但是它们对于腐败参数的强力是无法探索的。为了更好地了解这一点,我们审视一下如果一个语言模型“断裂 ” , 其某些参数被腐蚀, 然后通过微调恢复。在不同级别上战略腐蚀的BERT变量,我们发现腐败模型在努力完全恢复其原有性能,腐败程度较高导致更严重的退化。值得注意的是,影响基本语言特征的底层腐败比上层腐败更有害。我们的洞察力有助于理解语言模型在不利条件下的稳健性和适应性,为开发抗参数扰动的具有复原力的NLP系统提供战略信息。

Article 269

Title: Detecting value-expressive text posts in Russian social media

Ermittlung wertwirksamer Textposts in russischen sozialen Medien

在俄罗斯社交媒体中检测表达价值的文字文章 2312.08968v2

Authors (3): Maria Milkova, Maksim Rudnev, Lidia Okolskaya

Basic values are concepts or beliefs which pertain to desirable end-states and transcend specific situations. Studying personal values in social media can illuminate how and why societal values evolve especially when the stimuli-based methods, such as surveys, are inefficient, for instance, in hard-to-reach populations. On the other hand, user-generated content is driven by the massive use of stereotyped, culturally defined speech constructions rather than authentic expressions of personal values. We aimed to find a model that can accurately detect value-expressive posts in Russian social media VKontakte. A training dataset of 5,035 posts was annotated by three experts, 304 crowd-workers and ChatGPT. Crowd-workers and experts showed only moderate agreement in categorizing posts. ChatGPT was more consistent but struggled with spam detection. We applied an ensemble of human- and AI-assisted annotation involving active learning approach, subsequently trained several classification models using embeddings from various pre-trained transformer-based language models. The best performance was achieved with embeddings from a fine-tuned rubert-tiny2 model, yielding high value detection quality (F1 = 0.75, F1-macro = 0.80). This model provides a crucial step to a study of values within and between Russian social media users.

在社会媒体中研究个人价值观可以说明社会价值观是如何和为什么演变的,特别是在调查等基于刺激性的方法效率低下的情况下,例如在难以接触的人口中。另一方面,用户生成的内容是由大量使用陈规定型的、文化上界定的语音结构而不是真实的个人价值观表达方式驱动的。我们的目标是找到一种模型,能够准确检测俄罗斯社交媒体VKontakte中的价值表达岗位。3名专家304名人群工人和聊天GPT对5 035个岗位的培训数据集作了说明。在职位分类方面,众工和专家只表现出适度的一致。查网GPT比较一致,但与垃圾检测相抗争。我们应用了人类和人工辅助的批注,包括积极的学习方法,随后我们用各种经过预先培训的变压器语言模型来培训了若干分类模型。通过将5 035个岗位的成套培训数据集嵌入,由3名专家、304名人群工人和聊天GhatGPT。在对职位进行分类时只表现出适度的一致。ChatGPT。C-PT和专家在对职位进行分类时只提供高价值质量。

Article 270

Title@2025-07-08 (2): ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection

Title: ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection

ALLM4ADD: Entsperren der Fähigkeiten von Audio Large Language Models für Audio Deepfake Detection

ALLM4ADD:解锁音频深假探测音频大语言模型的能力 2505.11079v2

Authors (9): Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, Zhengqi Wen

Audio deepfake detection (ADD) has grown increasingly important due to the rise of high-fidelity audio generative models and their potential for misuse. Given that audio large language models (ALLMs) have made significant progress in various audio processing tasks, a heuristic question arises: \textit{Can ALLMs be leveraged to solve ADD?}. In this paper, we first conduct a comprehensive zero-shot evaluation of ALLMs on ADD, revealing their ineffectiveness. To this end, we propose ALLM4ADD, an ALLM-driven framework for ADD. Specifically, we reformulate ADD task as an audio question answering problem, prompting the model with the question: ``Is this audio fake or real?’’. We then perform supervised fine-tuning to enable the ALLM to assess the authenticity of query audio. Extensive experiments are conducted to demonstrate that our ALLM-based method can achieve superior performance in fake audio detection, particularly in data-scarce scenarios. As a pioneering study, we anticipate that this work will inspire the research community to leverage ALLMs to develop more effective ADD systems. Code is available at https://github.com/ucas-hao/qwen_audio_for_add.git

声频深方形探测(ADD)由于高贞度音频变异模型的兴起及其被滥用的可能性而变得日益重要。鉴于音频大语言模型(ALLMs)在各种音频处理任务中取得了显著进展,因此产生了一个反复的问题:\ textit{AllMs能否被利用解决ADD?}。在本文件中,我们首先对ADDA的AllMs进行全面的零射评价,显示其无效性。为此,我们提议ALM4ADD,一个由ALM驱动的ADDD框架。具体地说,我们重新将ADD任务改写为一个音频解答问题,促使该模型提出一个问题:“这种音频是假的还是真的?”问题。我们随后进行了监督的微调,以使ALM能够评估调音频的真实性。我们进行了广泛的实验,以证明我们的ALM方法能够在假音频探测中取得优异性表现,特别是在数据卡片情景中。作为开创性的研究,我们预计这项工作将激励研究界利用ALDDDDM系统开发更有效的ADDDDD系统。代码可在 https@gius_gus_quaus_ad_ad_absforfordoqforforgi_forgi_forgi_fio_forgiocfordo.

Article 271

Title@2025-07-08 (2): PDFMathTranslate: Scientific Document Translation Preserving Layouts

Title: PDFMathTranslate: Scientific Document Translation Preserving Layouts

PDFMathTranslate: Wissenschaftliche Dokumentübersetzung Layouts erhalten

PDFMath Translate:科学文件翻译保存布局 2507.03009v2

Authors (4): Rongxin Ouyang, Chang Chu, Zhikuang Xin, Xiangyao Ma

Language barriers in scientific documents hinder the diffusion and development of science and technologies. However, prior efforts in translating such documents largely overlooked the information in layouts. To bridge the gap, we introduce PDFMathTranslate, the world’s first open-source software for translating scientific documents while preserving layouts. Leveraging the most recent advances in large language models and precise layout detection, we contribute to the community with key improvements in precision, flexibility, and efficiency. The work has been open-sourced at https://github.com/byaidu/pdfmathtranslate with more than 222k downloads.

科学文件的语言障碍妨碍了科学文件的传播和发展,然而,先前在翻译这类文件方面所作的努力基本上忽视了布局中的信息。为弥合这一差距,我们引入了PDFMathTranslate,这是世界上第一个在保存布局的同时翻译科学文件的开放源码软件。我们利用了大语言模型的最新进展和精确的布局探测,在精确、灵活和效率方面作出了重大改进,为社区作出了贡献。这项工作在https://github.com/baaidu/pdfmath Translate中公开提供,下载次数超过222k次。

Article 272

Title@2025-07-08 (2): Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Title: Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Mit Thinking-LLM-as-a-Richter zu planen und zu evaluieren

学习运用 “ 思考-LLM-作为法官 “ 评估计划和理由 2501.18099v2

Authors (5): Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason Weston, Tianlu Wang

LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-bystep reasoning process that underlies the final evaluation of a response. However, due to the lack of human annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench (with a score of 93.9), despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.

在这项工作中,我们建议Eval Planner为思维-LLM-as-Judge提供一种优先优化算法,首先产生不受限制的评价计划,然后执行,然后作出最后判断。在自我培训循环中,Eval Planner反复优化合成设计的评价计划和处决,从而导致更好的最终裁决。我们的方法在RewardBench上取得了一种新的最先进的变现奖赏模式业绩(分数为93.9),尽管在建立数量较少的、合成的Blusion-Balianch推理模型方面进行了培训,并且进一步推进了Balvironic-RM号和其他基准。

Article 273

Title@2025-07-08 (2): The Impact of Prompt Programming on Function-Level Code Generation

Title: The Impact of Prompt Programming on Function-Level Code Generation

Die Auswirkungen der Prompt-Programmierung auf die Code-Generierung auf Funktionsebene

迅速编制方案对职能层面代码生成的影响 2412.20545v2

Authors (4): Ranim Khojah, Francisco Gomes de Oliveira Neto, Mazen Mohamad, Philipp Leitner

Large Language Models (LLMs) are increasingly used by software engineers for code generation. However, limitations of LLMs such as irrelevant or incorrect code have highlighted the need for prompt programming (or prompt engineering) where engineers apply specific prompt techniques (e.g., chain-of-thought or input-output examples) to improve the generated code. While some prompt techniques have been studied, the impact of different techniques – and their interactions – on code generation is still not fully understood. In this study, we introduce CodePromptEval, a dataset of 7072 prompts designed to evaluate five prompt techniques (few-shot, persona, chain-of-thought, function signature, list of packages) and their effect on the correctness, similarity, and quality of complete functions generated by three LLMs (GPT-4o, Llama3, and Mistral). Our findings show that while certain prompt techniques significantly influence the generated code, combining multiple techniques does not necessarily improve the outcome. Additionally, we observed a trade-off between correctness and quality when using prompt techniques. Our dataset and replication package enable future research on improving LLM-generated code and evaluating new prompt techniques.

大型语言模型(LLMS)被软件工程师越来越多地用于代码生成,然而,LLMS(LLMS)的局限性,例如不相关或不正确的代码等LLMS的局限性突出表明,需要迅速编程(或迅速工程),因为工程师采用具体的即时技术(例如思维链或投入产出实例)来改进生成的代码。虽然已经研究了一些迅速技术,但不同技术(及其相互作用)对代码生成的影响仍然不完全理解。在本研究中,我们引入了代码PromptEval,这是一个7072个提示数据集,旨在评价五项即时技术(光、人、思维链、功能签名、软件包清单)及其对三个LLMMS(GPT-4o、Llama3和Mistral)生成的完整功能的正确性、相似性和质量的影响。我们的调查结果显示,虽然某些快速技术对生成的代码有重大影响,但结合多种技术并不一定能改善结果。此外,我们发现,在使用快速技术时,准确性和质量之间存在着一种权衡。我们的数据集和复制包使今后能够对改进LM生成的代码和评价新的快速技术进行研究。

Article 274

Title@2025-07-08 (2): Affective-ROPTester: Capability and Bias Analysis of LLMs in Predicting Retinopathy of Prematurity

Title: Affective-ROPTester: Capability and Bias Analysis of LLMs in Predicting Retinopathy of Prematurity

Affective-ROPTester: Fähigkeit und Bias-Analyse von LLMs bei der Vorhersage der Retinopathie der Vorreife

情感-ROPTester:预测早幼病理反应中LLMs的能力和比值分析 2507.05816v1

Authors (10): Shuai Zhao, Yulin Zhang, Luwei Xiao, Xinyi Wu, Yanhao Jia, Zhongliang Guo, Xiaobao Wu, Cong-Duy Nguyen, Guoming Zhang, Anh Tuan Luu

Despite the remarkable progress of large language models (LLMs) across various domains, their capacity to predict retinopathy of prematurity (ROP) risk remains largely unexplored. To address this gap, we introduce a novel Chinese benchmark dataset, termed CROP, comprising 993 admission records annotated with low, medium, and high-risk labels. To systematically examine the predictive capabilities and affective biases of LLMs in ROP risk stratification, we propose Affective-ROPTester, an automated evaluation framework incorporating three prompting strategies: Instruction-based, Chain-of-Thought (CoT), and In-Context Learning (ICL). The Instruction scheme assesses LLMs’ intrinsic knowledge and associated biases, whereas the CoT and ICL schemes leverage external medical knowledge to enhance predictive accuracy. Crucially, we integrate emotional elements at the prompt level to investigate how different affective framings influence the model’s ability to predict ROP and its bias patterns. Empirical results derived from the CROP dataset yield two principal observations. First, LLMs demonstrate limited efficacy in ROP risk prediction when operating solely on intrinsic knowledge, yet exhibit marked performance gains when augmented with structured external inputs. Second, affective biases are evident in the model outputs, with a consistent inclination toward overestimating medium- and high-risk cases. Third, compared to negative emotions, positive emotional framing contributes to mitigating predictive bias in model outputs. These findings highlight the critical role of affect-sensitive prompt engineering in enhancing diagnostic reliability and emphasize the utility of Affective-ROPTester as a framework for evaluating and mitigating affective bias in clinical language modeling systems.

尽管大型语言模型(LLMS)在各个领域取得了显著进展,但其预测对早产风险的视像病原体(ROP)的能力基本上尚未探索。为了弥补这一差距,我们引入了中国新型基准数据集,称为CROOP,由993份入门记录组成,附加低、中、高风险标签,共993份入门记录,附加低、中、高风险标签。为了系统地审查ROP风险分级中LLMS的预测能力和感官偏差,我们提议Affective-ROPTester,一个包含三种提示性战略的自动评价框架:基于指令的、Chought(OCT)和InColtle Learning(ICL)。为了解决这一差距,我们引入了一个新的中国基准数据集,称为CROMS的内在知识和相关的偏向偏向偏向,而CLOCL计划则利用外部医学知识来提高预测性准确性。很显然,我们把情感因素集中到感官影响模型预测ROP及其模型模型的偏向模式的偏向模式的偏向模式的偏向模式的偏向性,因此,LMMS显示第二个对精度的偏向性判断性判断性判断性判断性判断性判断性判断的预测作用影响在仅仅上影响着提高的精确性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性在仅在仅在仅在提高中,在提高外部判断性在提高外部判断性在提高外部判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性

Article 275

Title@2025-07-08 (2): Bridging Perception and Language: A Systematic Benchmark for LVLMs’ Understanding of Amodal Completion Reports

Title: Bridging Perception and Language: A Systematic Benchmark for LVLMs’ Understanding of Amodal Completion Reports

Bridging Perception and Language: Ein systematischer Benchmark für das Verständnis von LVLMs für Amodal Completion Reports

桥梁概念和语言:LLLMs理解最佳完成报告系统基准 2507.05799v1

Authors (7): Amane Watahiki, Tomoki Doi, Taiga Shinozaki, Satoshi Nishida, Takuya Niikawa, Katsunori Miyahara, Hitomi Yanaka

One of the main objectives in developing large vision-language models (LVLMs) is to engineer systems that can assist humans with multimodal tasks, including interpreting descriptions of perceptual experiences. A central phenomenon in this context is amodal completion, in which people perceive objects even when parts of those objects are hidden. Although numerous studies have assessed whether computer-vision algorithms can detect or reconstruct occluded regions, the inferential abilities of LVLMs on texts related to amodal completion remain unexplored. To address this gap, we constructed a benchmark grounded in Basic Formal Ontology to achieve a systematic classification of amodal completion. Our results indicate that while many LVLMs achieve human-comparable performance overall, their accuracy diverges for certain types of objects being completed. Notably, in certain categories, some LLaVA-NeXT variants and Claude 3.5 Sonnet exhibit lower accuracy on original images compared to blank stimuli lacking visual content. Intriguingly, this disparity emerges only under Japanese prompting, suggesting a deficiency in Japanese-specific linguistic competence among these models.

开发大型视觉语言模型(LVLMs)的主要目标之一是设计能够帮助人类完成多式联运任务的系统,包括解释对感知经历的描述。这方面的一个中心现象是模式性完成,即使部分物体被隐藏,人们也会看到物体。虽然许多研究评估了计算机视觉算法能否探测或重建隐蔽区域,但LVLMs对与模式完成有关的文本的推断能力仍未得到探讨。为弥补这一差距,我们根据基本形式本体学建立了一个基准,以便实现对模式完成的系统分类。我们的结果表明,虽然许多LVLMs取得了人与人的可比较的性能,但某些类型正在完成的物体的准确性却有所不同。值得注意的是,在某些类别中,LLLavaVA-NeXT的变体和Claude 3.5 Sonnet的原始图像的准确性比没有视觉内容的空白平面图像要低。令人感兴趣的是,这种差异仅在日本的提示下出现,表明这些模型在日本特定语言能力方面存在缺陷。

Article 276

Title@2025-07-08 (2): FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation

Title: FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation

FaithfulRAG: Fact-Level-Konfliktmodellierung für kontexttreue retrieval-erweiterte Generation

忠实的拉戈:为本能的回溯性回溯性回溯性回溯性一代人建立实际冲突模式 2506.08938v2

Authors (7): Qinggang Zhang, Zhishang Xiang, Yilin Xiao, Le Wang, Junhui Li, Xinrun Wang, Jinsong Su

Large language models (LLMs) augmented with retrieval systems have demonstrated significant potential in handling knowledge-intensive tasks. However, these models often struggle with unfaithfulness issues, generating outputs that either ignore the retrieved context or inconsistently blend it with the LLMs parametric knowledge. This issue is particularly severe in cases of knowledge conflict, where the retrieved context conflicts with the models parametric knowledge. While existing faithful RAG approaches enforce strict context adherence through well-designed prompts or modified decoding strategies, our analysis reveals a critical limitation: they achieve faithfulness by forcibly suppressing the models parametric knowledge, which undermines the models internal knowledge structure and increases the risk of misinterpreting the context. To this end, this paper proposes FaithfulRAG, a novel framework that resolves knowledge conflicts by explicitly modeling discrepancies between the model`s parametric knowledge and retrieved context. Specifically, FaithfulRAG identifies conflicting knowledge at the fact level and designs a self-thinking process, allowing LLMs to reason about and integrate conflicting facts before generating responses. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. The code is available at https://github.com/DeepLearnXMU/Faithful-RAG

与检索系统相加的大型语言模型(LLMs)在处理知识密集型任务方面显示出巨大的潜力,然而,这些模型往往与不忠问题纠缠不休,产生的结果要么忽略了检索到的背景,要么前后不一地将其与LLM的参数知识混为一谈,在知识冲突的情况下,这一问题尤为严重,因为检索到的背景与模型的参数知识发生冲突。虽然现有的忠实RAG方法通过设计得当的迅速或经过修改的解码战略,强制压制模型的参数知识,从而实现严格遵守环境,但我们的分析揭示了一个严重的局限性:它们通过强行压制模型的参数知识,从而破坏模型的内部知识结构,增加误解环境的风险,从而实现忠诚。为此,本文件提出了FaithfulRAG,这是一个通过明确模拟模型`参数知识和检索到的背景之间的差异来解决知识冲突的新框架。具体地说,信任RAGG找出了事实层面的相互矛盾的知识,并设计了一个自我思考的过程,使LLMS在提出答复之前能够理解和综合相互矛盾的事实。广泛的实验表明,我们的方法超越了SOD-Salfar-FAR-MUS/GMUS 号。

Article 277

Title@2025-07-08 (2): MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training

Title: MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training

MAMUT: Ein neuartiges Framework zur Modifizierung mathematischer Formeln für die Generierung von Spezialdatensätzen für Sprachmodellschulungen

MAMUT:为语言模式培训制作专门数据集修改数学公式的新框架 2502.20855v2

Authors (3): Jonathan Drechsel, Anja Reusch, Steffen Herbold

Mathematical formulas are a fundamental and widely used component in various scientific fields, serving as a universal language for expressing complex concepts and relationships. While state-of-the-art transformer models excel in processing and understanding natural language, they encounter challenges with mathematical notation, which involves a complex structure and diverse representations. This study focuses on the development of specialized training datasets to enhance the encoding of mathematical content. We introduce Math Mutator (MAMUT), a framework capable of generating equivalent and falsified versions of a given mathematical formula in LaTeX notation, effectively capturing the mathematical variety in notation of the same concept. Based on MAMUT, we have generated four large mathematical datasets containing diverse notation. Experiments show that models trained on these datasets exhibit new SoTA performance on mathematical retrieval tasks. We publish our code, generated datasets, and pretrained mathematical models: https://github.com/aieng-lab/math-mutator.

数学公式是各种科学领域一个基本和广泛使用的组成部分,是表达复杂概念和关系的通用语言。尽管最先进的变压器模型在处理和理解自然语言方面非常出色,但它们在数学标记方面却遇到挑战,这涉及复杂的结构和不同的表述。本研究的重点是开发专门的培训数据集,以加强数学内容的编码。我们引入数学数学模型(MAMUT),这是一个能够在LaTeX符号中生成一个特定数学公式的等同和伪造版本的框架,在同一个概念的标记中有效捕捉数学多样性。根据MAMUT,我们生成了四个包含不同符号的大型数学数据集。实验显示,这些数据集培训的模型展示了数学检索任务的新 SoTA。我们出版了我们的代码,生成了数据集,并预先培训了数学模型:https://github.com/aieng-lab/math-motoat。

Article 278

Title@2025-07-08 (2): Flippi: End To End GenAI Assistant for E-Commerce

Title: Flippi: End To End GenAI Assistant for E-Commerce

Flippi: Ende bis Ende GenAI Assistent für E-Commerce

Flippi: 结束到结束 GenAI 电子商务助手 2507.05788v1

Authors (7): Anand A. Rajasekar, Praveen Tangarajan, Anjali Nainani, Amogh Batwal, Vinay Rao Dandin, Anusua Trivedi, Ozan Ersoy

The emergence of conversational assistants has fundamentally reshaped user interactions with digital platforms. This paper introduces Flippi-a cutting-edge, end-to-end conversational assistant powered by large language models (LLMs) and tailored for the e-commerce sector. Flippi addresses the challenges posed by the vast and often overwhelming product landscape, enabling customers to discover products more efficiently through natural language dialogue. By accommodating both objective and subjective user requirements, Flippi delivers a personalized shopping experience that surpasses traditional search methods. This paper details how Flippi interprets customer queries to provide precise product information, leveraging advanced NLP techniques such as Query Reformulation, Intent Detection, Retrieval-Augmented Generation (RAG), Named Entity Recognition (NER), and Context Reduction. Flippi’s unique capability to identify and present the most attractive offers on an e-commerce site is also explored, demonstrating how it empowers users to make cost-effective decisions. Additionally, the paper discusses Flippi’s comparative analysis features, which help users make informed choices by contrasting product features, prices, and other relevant attributes. The system’s robust architecture is outlined, emphasizing its adaptability for integration across various e-commerce platforms and the technological choices underpinning its performance and accuracy. Finally, a comprehensive evaluation framework is presented, covering performance metrics, user satisfaction, and the impact on customer engagement and conversion rates. By bridging the convenience of online shopping with the personalized assistance traditionally found in physical stores, Flippi sets a new standard for customer satisfaction and engagement in the digital marketplace.

谈话助理的出现从根本上改变了用户与数字平台的互动关系。本文介绍了Flippi对客户询问的解释,以提供准确的产品信息,利用高级NLP技术,如由大型语言模型(LLMs)驱动的高级智能、终端到终端的语音助理,并为电子商务部门量身定制。Flippi应对了巨大且往往压倒性的产品景观带来的挑战,使客户能够通过自然语言对话更有效地发现产品。Flippi通过满足客观和主观用户的要求,提供了个人化的购物经验,超过了传统的搜索方法。本文详细介绍了Flippi如何解释客户询问,以提供准确的产品信息,利用高级NLP技术,如由大语言模型(LLMs)驱动的超前端到端的语音助理(LLMs),以及针对大宗的语音识别(NER)和背景简化。Flippipi,Flippi的独特的识别能力和在电子商务网站上展示了最有吸引力的报价。此外,Flippi的用户参与比较分析特征,有助于用户通过对产品特征、电子价格进行对比分析、内价分析,以及其它相关的准确性选择。系统在客户满意度、准确度、准确度、准确度、准确度上展示中,在客户满意度、准确度方面的系统结构中强调其准确度和准确度的准确度框架。

Article 279

Title@2025-07-08 (2): Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription

Title: Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription

Gemeinsame Beamforming und Speaker-Attributed ASR für echte Fern-Mikrofon-Meeting-Transkription

用于真正分布式话语会议的联合光束和发言人与发言人联合致真正分布式话语会议ASR 2410.21849v2

Authors (4): Can Cui, Imran Ahamad Sheikh, Mostafa Sadeghi, Emmanuel Vincent

Distant-microphone meeting transcription is a challenging task. State-of-the-art end-to-end speaker-attributed automatic speech recognition (SA-ASR) architectures lack a multichannel noise and reverberation reduction front-end, which limits their performance. In this paper, we introduce a joint beamforming and SA-ASR approach for real meeting transcription. We first describe a data alignment and augmentation method to pretrain a neural beamformer on real meeting data. We then compare fixed, hybrid, and fully neural beamformers as front-ends to the SA-ASR model. Finally, we jointly optimize the fully neural beamformer and the SA-ASR model. Experiments on the real AMI corpus show that, while state-of-the-art multi-frame cross-channel attention based channel fusion fails to improve ASR performance, fine-tuning SA-ASR on the fixed beamformer’s output and jointly fine-tuning SA-ASR with the neural beamformer reduce the word error rate by 8% and 9% relative, respectively.

在本文中,我们引入了一种联合波束和SA-ASR方法,用于真实会议转录。我们首先描述一种数据调整和增强方法,用于在真实会议数据上预演神经光束。然后,我们将固定的、混合的和完全的神经信号显示器作为SA-ASR模型的前端进行对比。最后,我们共同优化完全神经信号和SA-ASR模型。在实际AMI系统上进行的实验显示,尽管基于频道的组合关注无法提高ASR的性能,在固定光谱输出上对SA-ASR进行微调,并联合微调SA-ASR与神经信号信号分别将字错误率降低8%和9%的相对值。

Article 280

Title@2025-07-08 (2): Large Language Models Might Not Care What You Are Saying: Prompt Format Beats Descriptions

Title: Large Language Models Might Not Care What You Are Saying: Prompt Format Beats Descriptions

Große Sprachmodelle sind vielleicht egal, was Sie sagen: Prompt Format Beats Beschreibungen

大语言模型可能不在乎你在说什么:快速格式比描述 2408.08780v6

Authors (4): Chenming Tang, Zhixiang Wang, Hao Sun, Yunfang Wu

With the help of in-context learning (ICL), large language models (LLMs) have achieved impressive performance across various tasks. However, the function of descriptive instructions during ICL remains under-explored. In this work, we propose an ensemble prompt framework to describe the selection criteria of multiple in-context examples, and preliminary experiments on machine translation (MT) across six translation directions confirm that this framework boosts ICL performance. But to our surprise, LLMs might not care what the descriptions actually say, and the performance gain is primarily caused by the ensemble format, since it could lead to improvement even with random descriptive nouns. We further apply this new ensemble framework on a range of commonsense, math, logical reasoning and hallucination tasks with three LLMs and achieve promising results, suggesting again that designing a proper prompt format would be much more effective and efficient than paying effort into specific descriptions. Our code is available at https://github.com/JamyDon/Format-Beats-Descriptions.

在文字学习的帮助下,大型语言模型(LLMs)在各种任务中取得了令人印象深刻的成绩。然而,在ICL期间描述性指示的功能仍未得到充分探讨。在这项工作中,我们提议了一个共同的快速框架来描述多个文本实例的选择标准,并在六个翻译方向上对机器翻译的初步实验证实,这个框架可以促进ICL的性能。但是,令我们惊讶的是,LLMs可能并不关心描述的实际内容,而业绩收益主要是由合用格式造成的,因为即使随机的描述性名词也能导致改进。我们进一步将这个新的共同框架应用于一系列共同概念、数学、逻辑推理和幻觉任务,与三个LLMS一起,并取得有希望的结果,再次表明设计一个适当的迅速格式比具体描述要有效得多。我们的代码可以在 https://github.com/JamyDon/Format-Beats-Descrictions上查阅。

Article 281

Title@2025-07-08 (2): DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities

Title: DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities

DocTalk: Skalierbare Graph-basierte Dialog-Synthese zur Verbesserung von LLM-Konversationsfähigkeiten

Doctal: 用于增强LLLM对话能力的可缩放图表对话合成 2507.05750v1

Authors (9): Jing Yang Lee, Hamed Bonab, Nasser Zalmout, Ming Zeng, Sanket Lokegaonkar, Colin Lockard, Binxuan Huang, Ritesh Sarkhel, Haodong Wang

Large Language Models (LLMs) are increasingly employed in multi-turn conversational tasks, yet their pre-training data predominantly consists of continuous prose, creating a potential mismatch between required capabilities and training paradigms. We introduce a novel approach to address this discrepancy by synthesizing conversational data from existing text corpora. We present a pipeline that transforms a cluster of multiple related documents into an extended multi-turn, multi-topic information-seeking dialogue. Applying our pipeline to Wikipedia articles, we curate DocTalk, a multi-turn pre-training dialogue corpus consisting of over 730k long conversations. We hypothesize that exposure to such synthesized conversational structures during pre-training can enhance the fundamental multi-turn capabilities of LLMs, such as context memory and understanding. Empirically, we show that incorporating DocTalk during pre-training results in up to 40% gain in context memory and understanding, without compromising base performance. DocTalk is available at https://huggingface.co/datasets/AmazonScience/DocTalk.

大型语言模型(LLMS)越来越多地用于多方向的谈话任务,然而,其培训前数据主要包括连续的讲稿,造成所需能力和培训模式之间的潜在不匹配。我们采用新颖的方法,将现有文本公司的对话数据综合起来,以解决这一差异。我们提出了一个管道,将多个相关文件组合转化为一个扩大的多方向、多主题的信息搜索对话。我们把管道应用到维基百科文章,我们把DocTalk,一个由730公里长的谈话组成的多方向培训前对话包。我们假设在培训前接触这种综合对话结构能够提高LLMS的基本多方向能力,例如上下文记忆和理解。我们抽象地表明,在培训前的结果中将DocTalk纳入40%的记忆和理解,同时不损害基本性能。DocTalk可以在https://huggingface.co/dataset/Amazonscience/DocTalk上查阅。

Article 282

Title@2025-07-08 (2): GPTKB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge

Title: GPTKB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge

GPTKB v1.5: Eine umfassende Wissensbasis für die Erforschung des tatsächlichen LLM-Wissens

GPTKB v1.5:探索事实LLM知识的大规模知识库 2507.05740v1

Authors (5): Yujia Hu, Tuan-Phong Nguyen, Shrestha Ghosh, Moritz Müller, Simon Razniewski

Language models are powerful tools, yet their factual knowledge is still poorly understood, and inaccessible to ad-hoc browsing and scalable statistical analysis. This demonstration introduces GPTKB v1.5, a densely interlinked 100-million-triple knowledge base (KB) built for $14,000 from GPT-4.1, using the GPTKB methodology for massive-recursive LLM knowledge materialization (Hu et al., ACL 2025). The demonstration experience focuses on three use cases: (1) link-traversal-based LLM knowledge exploration, (2) SPARQL-based structured LLM knowledge querying, (3) comparative exploration of the strengths and weaknesses of LLM knowledge. Massive-recursive LLM knowledge materialization is a groundbreaking opportunity both for the research area of systematic analysis of LLM knowledge, as well as for automated KB construction. The GPTKB demonstrator is accessible at https://gptkb.org.

语言模型是强有力的工具,但它们的实际知识仍然不易理解,而且无法为临时浏览和可扩展的统计分析所利用。这一演示介绍了GPTKB v1.5,这是一个由GPT-4.1为14 000美元建立的密连的10亿三亿知识库(KB),该知识库使用GPTKB方法,用于大规模回收的LLM知识成品化(Hu等人,ACL 2025)。演示经验侧重于三个使用案例:(1) 联系-跨行业LLM知识探索,(2) SPARQL基于LM结构化LM知识查询,(3) 比较探索LLM知识的长处和短处。大规模再精确LM知识成品化是一个开拓性的机会,既用于系统分析LLM知识的研究领域,也用于自动建造KB。 GPTKB 演示器可在https://gptkb.org上查阅。

Article 283

Title@2025-07-08 (2): Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling

Title: Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling

Verbesserung der LLM-Zuverlässigkeit durch explizite Wissensgrenzenmodellierung

通过明确知识边界建模增强LLM可靠性 2503.02233v3

Authors (6): Hang Zheng, Hongshen Xu, Yuncong Liu, Lu Chen, Pascale Fung, Kai Yu

Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness, particularly when processing queries exceeding their knowledge boundaries. While existing mitigation strategies employ uncertainty estimation or query rejection mechanisms, they suffer from computational efficiency and sacrificed helpfulness. To address these issues, we propose the Explicit Knowledge Boundary Modeling (EKBM) framework, integrating fast and slow reasoning systems to harmonize reliability and usability. The framework first employs a fast-thinking model to generate confidence-labeled responses, enabling immediate utilization of high-confidence outputs, whereas uncertain predictions trigger a slow refinement model for accuracy improvement. To align model behavior with our proposed object, we propose a hybrid training pipeline, enhancing self-awareness without degrading task performance. Evaluations on dialogue state tracking tasks demonstrate that EKBM achieves superior model reliability over uncertainty-based baselines. Further analysis reveals that refinement substantially boosts accuracy while maintaining low computational overhead. The framework establishes a scalable paradigm for deploying reliable LLMs in error-sensitive applications, effectively balancing accuracy and practical utility.

大型语言模型(LLMs)容易产生自觉意识失调的幻觉,特别是在处理超出其知识界限的查询时。现有减缓战略采用不确定性估计或质疑拒绝机制,但也有计算效率,并牺牲了帮助作用。为了解决这些问题,我们建议采用“明白知识边界模型”框架,整合快速和缓慢的推理系统,以统一可靠性和使用性能和可用性。框架首先采用快速思考模型,以产生以信任为标签的响应,能够立即利用高信任产出,而不确定的预测则触发一个缓慢的精细改进模型,以便提高准确性。为了将模型与我们拟议的目标保持一致,我们提议了一个混合培训管道,在不降低任务性能的情况下提高自我意识。对话状态跟踪任务评估表明,EKBMs在基于不确定性的基线上取得了超强的模型可靠性。进一步分析表明,改进极大地提高了准确性,同时保持低计算性能。框架为在对错误敏感的应用中部署可靠的LMs提供了可扩展的范例,有效地平衡准确性和实用性。

Article 284

Title@2025-07-08 (2): ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark

Title: ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark

KontextASR-Bench: Ein massiver Kontext für die Spracherkennung Benchmark

ASR-Bunch:大规模背景情况语音识别基准 2507.05727v1

Authors (7): He Wang, Linhan Ma, Dake Guo, Xiong Wang, Lei Xie, Jin Xu, Junyang Lin

Automatic Speech Recognition (ASR) has been extensively investigated, yet prior evaluative efforts have largely been restricted to contextless paradigms. This constraint stems from the limited proficiency of conventional ASR models in context modeling and their deficiency in memory and reasoning based on world knowledge. Recent breakthroughs in the development of Large Language Models (LLMs) and corresponding Large Audio Language Models (LALMs) have markedly enhanced the visibility of general artificial intelligence capabilities. Consequently, there exists a compelling need for a benchmark that can evaluate both the generality and intelligence of ASR systems. To address this gap, we propose ContextASR-Bench: a comprehensive, large-scale benchmark designed to assess contextual speech recognition. This benchmark encompasses up to 40,000 data entries across over 10 domains, enabling a thorough evaluation of model performance in scenarios that omit or incorporate coarse-grained or fine-grained contextual information. Moreover, diverging from conventional ASR evaluations, our benchmark includes an analysis of model efficacy in recognizing named entities mentioned within the auditory input. Our extensive evaluation highlights that LALMs, with strong world knowledge and context learning capabilities, outperform conventional ASR models by a large margin. The dataset and evaluation code have been released at https://github.com/MrSupW/ContextASR-Bench.

已经广泛调查了自动语音识别(ASR),但先前的评价工作基本上局限于没有背景的范式,这一制约因素源于传统ASR模型在背景建模方面的熟练程度有限,以及基于世界知识的记忆和推理方面的不足;最近在开发大语言模型(LLMs)和相应的大音频模型(LALMs)方面取得的突破明显提高了一般人工智能能力的可见度;因此,迫切需要有一个能够评价ASR系统的一般性和智能的基准;为弥补这一差距,我们提议CEDASR-Bench:一个旨在评估背景语音识别的全面、大规模基准;该基准包括10个以上领域的40,000个数据条目,以便能够在省略或纳入粗略或细微背景信息的情景假设中,对模型绩效进行彻底评估;此外,与传统的ASR评价不同,我们的基准包括了在承认听力投入中提到的被点名实体方面的模型效率分析;我们的广泛评价强调,LALMMS,具有强大的世界知识和背景学习能力,大大超越了ASR/WRMDA/SADRDRDRDA和SARDDDRDRDRDSDRDSDSDSDSDSDSDSDRDSDRDRDR

Article 285

Title@2025-07-08 (2): Truth Neurons

Title: Truth Neurons

Wahrheit Neuronen

真理中世纪 2505.12182v3

Authors (5): Haohang Li, Yupeng Cao, Yangyang Yu, Jordan W. Suchow, Zining Zhu

Despite their remarkable success and deployment across diverse workflows, language models sometimes produce untruthful responses. Our limited understanding of how truthfulness is mechanistically encoded within these models jeopardizes their reliability and safety. In this paper, we propose a method for identifying representations of truthfulness at the neuron level. We show that language models contain truth neurons, which encode truthfulness in a subject-agnostic manner. Experiments conducted across models of varying scales validate the existence of truth neurons, confirming that the encoding of truthfulness at the neuron level is a property shared by many language models. The distribution patterns of truth neurons over layers align with prior findings on the geometry of truthfulness. Selectively suppressing the activations of truth neurons found through the TruthfulQA dataset degrades performance both on TruthfulQA and on other benchmarks, showing that the truthfulness mechanisms are not tied to a specific dataset. Our results offer novel insights into the mechanisms underlying truthfulness in language models and highlight potential directions toward improving their trustworthiness and reliability.

尽管语言模型在不同工作流程中取得了显著的成功和部署,但语言模型有时会产生不真实的反应。我们对这些模型中真实性是如何机械地编码的有限理解危及了这些模型的可靠性和安全性。在本文中,我们提出了一个在神经神经层面识别真实性表现的方法。我们显示语言模型包含真理神经元,以主题和不可知性的方式将真实性编码。不同规模的模型进行的实验验证了真知灼见神经元的存在,证实神经层面真实性编码是许多语言模型共有的一个属性。层层间真实性神经元的分布模式与先前关于真实性几何测量的发现相一致。有选择地抑制通过真理QA数据集发现的事实神经元的激活,会降低真理QA和其他基准的性能,表明真实性机制与特定数据集没有联系。我们的结果为语言模型中真实性的基本机制提供了新的洞察力,并突显了提高它们可信度和可靠性的潜在方向。

Article 286

Title: Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

Omni-Router: Routing-Entscheidungen in Sparse Mixture-of-Experts für die Spracherkennung teilen

Omni-Router: 分享语音识别专家的松散混集决定 2507.05724v1

Authors (3): Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly

Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model \emph{Omni-router Transformer}. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.

专家混合(MOE)结构从语言模型发展到自动语音识别(ASR)。传统的MOE方法,如开关变换器,是每个层内独立的路线专家。我们的分析表明,大多数层的路由器作出的专家选择与其他层路由器的选择没有强烈关联。为了增加不同层的专家之间的合作,鼓励更大的专业化,我们使用一个不同层的共享路由器。我们称这个模型为\emph{Omni-router 变换器。关于大规模假标签数据集的广泛实验,以及10个不同、超出外部的ASR基准的评估表明,Omni-router变换器能够实现较低的培训损失,并持续地超过其他层路由器的选择模式,将平均字差率分别降低11.2%和8.2%,同时提供结构化的专家使用,提高多样性数据的稳健度。

Article 287

Title@2025-07-08 (2): MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment

Title: MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment

MobileGUI-RL: Mobile GUI-Agent durch Verstärkungslernen im Online-Umfeld fördern

MomGUI-RL:通过在线环境强化学习推进移动图形界面代理 2507.05720v1

Authors (8): Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, Dong Yu

Recently, there has been a surge of vision-based GUI agents designed to automate everyday mobile and web tasks. These agents interpret raw GUI screenshots and autonomously decide where to click, scroll, or type, which bypasses handcrafted rules and app-specific APIs. However, most existing methods trained GUI agent in the offline environment using pre-collected trajectories. This approach limits scalability, causes overfitting to specific UI templates, and leads to brittle policies when faced with unseen environment. We present MobileGUI-RL, a scalable framework that trains GUI agent in online environment. MobileGUI-RL contains two key components. It (i) synthesizes a curriculum of learnable tasks through self-exploration and filtering, and (ii) adapts GRPO to GUI navigation with trajectory-aware advantages and composite rewards that balance task success and execution efficiency. Experiments on three online mobile-agent benchmarks show consistent gains, validating the effectiveness of our approach.

最近,在设计日常移动和网络任务自动化时,出现了基于愿景的图形界面代理器的激增。这些代理商对原始图形界面的截图进行解释,自主决定点击、滚动或键入何处,绕过手工制作的规则和应用程序专用API。然而,大多数现有方法都使用预先收集的轨迹在离线环境中培训了图形界面代理器。这种方法限制了可缩放性,导致过度适应特定的界面模板,并在面对不可见的环境时导致政策萎缩。我们介绍了移动GUI-RL,这是一个在在线环境中培训图形界面代理器的可扩展框架。移动GUI-RL包含两个关键组成部分。它(一)综合了通过自我探索和过滤的可学习任务课程,以及(二)使GROPO与图形界面导航相适应的轨迹识别优势和综合奖励,平衡任务成功和执行效率。对三个在线移动代理商基准的实验显示了一致的收益,证实了我们的方法的有效性。

Article 288

Title@2025-07-08 (2): HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation

Title: HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation

HIRAG: Hierarchisch-gedankte Instruktion-Tuning-Retrieval-Augmented Generation

HIRAG: 高层次研究教学-引导检索-推荐一代 2507.05714v1

Authors (7): YiHan Jiao, ZheHao Tan, Dan Yang, DuoLin Sun, Jie Feng, Jian Wang, Peng Wei

Retrieval-augmented generation (RAG) has become a fundamental paradigm for addressing the challenges faced by large language models in handling real-time information and domain-specific problems. Traditional RAG systems primarily rely on the in-context learning (ICL) capabilities of the large language model itself. Still, in-depth research on the specific capabilities needed by the RAG generation model is lacking, leading to challenges with inconsistent document quality and retrieval system imperfections. Even the limited studies that fine-tune RAG generative models often \textit{lack a granular focus on RAG task} or \textit{a deeper utilization of chain-of-thought processes}. To address this, we propose that RAG models should possess three progressively hierarchical abilities (1) Filtering: the ability to select relevant information; (2) Combination: the ability to combine semantic information across paragraphs; and (3) RAG-specific reasoning: the ability to further process external knowledge using internal knowledge. Thus, we introduce our new RAG instruction fine-tuning method, Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (HIRAG) incorporates a “think before answering” strategy. This method enhances the model’s open-book examination capability by utilizing multi-level progressive chain-of-thought. Experiments show that the HIRAG training strategy significantly improves the model’s performance on datasets such as RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.

重新获取增强的一代(RAG)已成为解决大型语言模型在处理实时信息和特定领域问题方面所面临的挑战的基本范例。传统的RAG系统主要依赖大语言模型本身的内流学习能力。然而,关于RAG生成模型所需的具体能力的深入研究仍然缺乏,导致文件质量和检索系统不完善的挑战。即使是微调的RAG基因化模型常常\textit{缺乏对RAG任务的微调焦点}或更深入地利用连锁思考进程。为了解决这个问题,我们建议RAG模型应当拥有三种逐步的等级能力:(1) 过滤:选择相关信息的能力;(2) 合并:将各段落的语义信息结合起来的能力;(3) RAG特定推理:利用内部知识进一步处理外部知识的能力。因此,我们引入了我们新的RAG 指令微调方法,Sierarshi-Sqourat-Retal-RetailQQQQ , 更深入地利用不断更新的MARAAAA 测试战略, 大幅改进HAGAG-BA的模型, 测试战略, 改进HAG-BAG-S-strual-strual-strual-strual-strual-strual-straking-strual-strual-stris-strat-strat-stris

Article 289

Title@2025-07-08 (2): DRAGON: Dynamic RAG Benchmark On News

Title: DRAGON: Dynamic RAG Benchmark On News

DRAGON: Dynamischer RAG-Benchmark auf Neuigkeiten

DRAGON:动态RAG新闻基准 2507.05713v1

Authors (7): Fedor Chernogorskii, Sergei Averkiev, Liliya Kudraleeva, Zaven Martirosian, Maria Tikhonova, Valentin Malykh, Alena Fenogenova

Retrieval-Augmented Generation (RAG) is a widely adopted approach for improving the factuality of large language models (LLMs) by incorporating external knowledge at inference time. Although there exist multiple RAG benchmarks for English, evaluation resources for other languages, including Russian, remain scarce and static, failing to capture the dynamic nature of real-world deployments. In this work, we present DRAGON (Dynamic RAG Benchmark On News), the first dynamic benchmark for evaluating RAG systems in Russian on a changing news corpora. DRAGON is built upon a regularly updated corpus of Russian news and public documents and supports comprehensive evaluation of both the retriever and generator components. Question generation is performed automatically with the use of Knowledge Graph constructed from the corpus and enables the extraction of four core question types aligned with distinct subgraph patterns. We release a complete evaluation framework comprising the pipeline for automatic question generation, evaluation scripts, which are potentially reusable for other languages and multilingual settings, and benchmark data. We also launch a public leaderboard to encourage community participation and comparison.

在这项工作中,我们提出DRAGON(动态RAGON)(动态RAG Basic on News),这是在不断变化的新闻公司中评价俄罗斯的RAG系统的第一个动态基准;DRAGON(DRAGON)以定期更新的俄罗斯新闻和公共文件为基础,支持对检索器和发电机组成部分进行综合评价;通过使用从文体上建立的知识图,自动生成问题,使四个核心问题类型与不同的子绘图模式相一致,从而得以抽取。我们发布了一个完整的评价框架,其中包括自动生成问题的管道、评价脚本,这些文字有可能用于其他语言和多语种环境,以及基准数据。我们还启动了一个公共领导板,以鼓励社区参与和比较。

Article 290

Title@2025-07-08 (2): RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

Title: RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

RAG-R1 : Förderung der Such- und Begründungsfähigkeiten von LLMs durch Multi-Query-Parallelismus

RAG-R1:通过多种克质平行主义鼓励LLMs的搜索和说明能力 2507.02962v2

Authors (6): Zhiwen Tan, Jiaming Huang, Qintong Wu, Hongxuan Zhang, Chenyi Zhuang, Jinjie Gu

Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, while they remain prone to generating hallucinated or outdated responses due to their static internal knowledge. Recent advancements in Retrieval-Augmented Generation (RAG) methods have explored enhancing models’ search and reasoning capabilities through reinforcement learning (RL). Although these methods demonstrate promising results, they face challenges in training stability and encounter issues such as substantial inference time and restricted capabilities due to the single-query mode. In this paper, we propose RAG-R1, a novel training framework designed to enable LLMs to adaptively leverage internal and external knowledge during the reasoning process. We further expand the generation and retrieval processes within the framework from single-query mode to multi-query parallelism, aimed at reducing inference time and enhancing the model’s capabilities. Extensive experiments on seven question-answering benchmarks demonstrate that our method outperforms the strongest baseline by up to 13.2% and decreases inference time by 11.1%.

大型语言模型(LLMS)在各种任务中表现出了非凡的能力,而由于其静态的内部知识,它们仍然容易产生幻觉或过时的应对办法; 回收和提炼一代(RAG)方法最近的进展探索了通过强化学习加强模型的搜索和推理能力; 虽然这些方法显示了有希望的成果,但它们在培训稳定性方面面临着挑战,并遇到了诸如因单一查询模式而导致的大量推论时间和有限能力等问题; 在本文件中,我们提议了一个新的培训框架,旨在使LAG-R1能够在推理过程中以适应性的方式利用内部和外部知识; 我们进一步扩大了从单式回收模式到多式平行框架的生成和检索过程,目的是减少推论时间和提高模型的能力。对七个问题回答基准的广泛实验表明,我们的方法比最强的基线高出13.2%,推论时间减少了11.1%。

Article 291

Title@2025-07-08 (2): SIGIR 2025 – LiveRAG Challenge Report

Title: SIGIR 2025 – LiveRAG Challenge Report

SIGIR 2025 – LiveRAG Challenge Report

SIGIR 2025-LiveRAG挑战报告 2507.04942v2

Authors (9): David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Oren Somekh, Ran Tavory, Mehdi Ghissassi, Edo Liberty, Roy Miara

The LiveRAG Challenge at SIGIR 2025, held between March and May 2025, provided a competitive platform for advancing Retrieval-Augmented Generation (RAG) technologies. Participants from academia and industry were invited to develop a RAG-based question-answering system using a fixed corpus (Fineweb-10BT) and a common open-source LLM (Falcon3-10B-Instruct). The goal was to facilitate challenging comparisons of retrieval and prompting strategies. During the Live Challenge Day, 70 teams from 27 different countries provided answers and supportive information to 500 unseen questions within a strict two-hour time window. Evaluation was conducted in two stages: first an automated LLM-as-a-judge approach was used to compute correctness and faithfulness score, then a manual review of top ranked submissions was conducted. The finalists were announced on June 12, 2025, with prizes awarded during the LiveRAG Workshop at SIGIR 2025 in Padua, Italy.

2025年3月至5月在SIGIR 2025年举办的LiveRAG挑战赛,为推进回升一代(RAG)技术提供了一个竞争性平台,来自学术界和工业界的参与者被邀请利用固定体(Fineweb-10BT)和共同开放源码LLM(Falcon3-10B-Instruct)开发基于RAG的问答系统,目的是便利对检索和推动战略进行具有挑战性的比较;在现场挑战日,来自27个不同国家的70个小组在严格的两小时窗口内为500个未见问题提供了答案和辅助信息;评价分两个阶段进行:首先采用自动LLM-A-A-判断方法来计算正确性和忠诚分数,然后对排名最高的呈件进行手工审查;最后一批人于2025年6月12日宣布,在意大利帕杜阿的SIGIR 2025年LiveRAG讲习班期间颁发奖项。

Article 292

Title@2025-07-08 (2): Agentic-R1: Distilled Dual-Strategy Reasoning

Title: Agentic-R1: Distilled Dual-Strategy Reasoning

Agentic-R1: Destillierte Dual-Strategy-Reasoning

剂- R1: 蒸馏的双战略理由 2507.05707v1

Authors (4): Weihua Du, Pranjal Aggarwal, Sean Welleck, Yiming Yang

Current long chain-of-thought (long-CoT) models excel at mathematical reasoning but rely on slow and error-prone natural language traces. Tool-augmented agents address arithmetic via code execution, but often falter on complex logical tasks. We introduce a fine-tuning framework, DualDistill, that distills complementary reasoning strategies from multiple teachers into a unified student model. Using this approach, we train Agentic-R1, which dynamically selects the optimal strategy for each query, invoking tools for arithmetic and algorithmic problems, and using text-based reasoning for abstract ones. Our method improves accuracy across a range of tasks, including both computation-intensive and standard benchmarks, demonstrating the effectiveness of multi-strategy distillation in achieving robust and efficient reasoning. Our project is available at https://github.com/StigLidu/DualDistill

目前的长期思维链(长期-Cot)模型在数学推理方面十分出色,但依赖缓慢和容易出错的自然语言痕迹。工具强化剂通过代码执行处理算术,但往往在复杂的逻辑任务上摇摇欲坠。我们引入了一个微调框架,即DualStill,将多个教师的补充推理战略从一个统一的学生模式中提炼出来。我们采用这种方法,培训Agric-R1,动态地为每个查询选择最佳战略,援引算术和算法问题工具,并使用基于文本的推理法来抽象的。我们的方法提高了一系列任务的准确性,包括计算密集和标准基准,展示了多战略蒸馏在实现有力和高效推理方面的有效性。我们的项目可在https://github.com/StigLidu/DualDustill查阅。

Article 293

Title@2025-07-08 (2): A Survey on Transformer Context Extension: Approaches and Evaluation

Title: A Survey on Transformer Context Extension: Approaches and Evaluation

Eine Umfrage über Transformer Context Extension: Ansätze und Evaluierung

关于 “ 变换背景推广:方法和评价 “ 的调查 2503.13299v2

Authors (5): Yijun Liu, Jinzheng Yu, Yang Xu, Zhongyang Li, Qingfu Zhu

Large language models (LLMs) based on Transformer have been widely applied in the filed of natural language processing (NLP), demonstrating strong performance, particularly in handling short text tasks. However, when it comes to long context scenarios, the performance of LLMs degrades due to some challenges. To alleviate this phenomenon, there is a number of work proposed recently. In this survey, we first list the challenges of applying pre-trained LLMs to process long contexts. Then systematically review the approaches related to long context and propose our taxonomy categorizing them into four main types: positional encoding, context compression, retrieval augmented, and attention pattern. In addition to the approaches, we focus on the evaluation of long context, organizing relevant data, tasks, and metrics based on existing long context benchmarks. Finally, we summarize unresolved issues in the long context domain and put forward our views on future developments.

以变换器为基础的大型语言模型(LLMS)在自然语言处理(NLP)的归档中被广泛应用,表现出很强的成绩,特别是在处理短文本任务方面;然而,在长背景假设中,由于一些挑战,LLMS的性能会退化;为了缓解这种现象,最近提出了若干工作建议;在本次调查中,我们首先列出在长背景中应用预先培训的LLMS的挑战;然后系统地审查与长背景有关的做法,并提议将这些方法分为四大类型:定位编码、环境压缩、检索增强和注意模式;除了这些方法外,我们着重评价长背景,根据现有的长背景基准组织相关数据、任务和衡量标准;最后,我们总结了长期领域尚未解决的问题,并提出我们对未来发展的看法。

Article 294

Title@2025-07-08 (2): PulseReddit: A Novel Reddit Dataset for Benchmarking MAS in High-Frequency Cryptocurrency Trading

Title: PulseReddit: A Novel Reddit Dataset for Benchmarking MAS in High-Frequency Cryptocurrency Trading

PulseReddit: Ein neuartiger Reddit-Datensatz für das Benchmarking von MAS im Kryptowährungshandel mit hoher Frequenz

PulseReddit:高级加密货币交易中用于确定MAS基准的新Rddit数据集 2506.03861v2

Authors (4): Qiuhan Han, Qian Wang, Atsushi Yoshikawa, Masayuki Yamamura

High-Frequency Trading (HFT) is pivotal in cryptocurrency markets, demanding rapid decision-making. Social media platforms like Reddit offer valuable, yet underexplored, information for such high-frequency, short-term trading. This paper introduces \textbf{PulseReddit}, a novel dataset that is the first to align large-scale Reddit discussion data with high-frequency cryptocurrency market statistics for short-term trading analysis. We conduct an extensive empirical study using Large Language Model (LLM)-based Multi-Agent Systems (MAS) to investigate the impact of social sentiment from PulseReddit on trading performance. Our experiments conclude that MAS augmented with PulseReddit data achieve superior trading outcomes compared to traditional baselines, particularly in bull markets, and demonstrate robust adaptability across different market regimes. Furthermore, our research provides conclusive insights into the performance-efficiency trade-offs of different LLMs, detailing significant considerations for practical model selection in HFT applications. PulseReddit and our findings establish a foundation for advanced MAS research in HFT, demonstrating the tangible benefits of integrating social media.

高频交易(高频交易)在加密货币市场中至关重要,要求快速决策。Reddit等社交媒体平台为这种高频短期交易提供了宝贵但尚未得到充分探索的信息。本文介绍了新颖的数据集 :\ textbf{PulseReddit},这是一个将大规模网上讨论数据与高频加密货币市场统计数据统一起来用于短期贸易分析的新数据集。我们利用基于大语言模型(LLLM)的多源系统(MAS)进行了广泛的实证研究,以调查PulseRedddit的社会情绪对贸易绩效的影响。我们的实验结论是,借助PulseReddddit数据扩展的MAS与传统基线相比,特别是在公牛市场中,取得了优异于传统的贸易结果,并展示了不同市场体制的强大适应性。此外,我们的研究为不同LMS的绩效交易提供了确凿的洞察力,详细说明了在HFT应用程序中实际选择模式的重大考虑因素。PulseReddddit和我们的调查结果为HFT公司高级MAS研究奠定了基础,展示了整合社会媒体的有形的好处。

Article 295

Title@2025-07-08 (2): Do We Really Need Specialization? Evaluating Generalist Text Embeddings for Zero-Shot Recommendation and Search

Title: Do We Really Need Specialization? Evaluating Generalist Text Embeddings for Zero-Shot Recommendation and Search

Brauchen wir wirklich Spezialisierung? Auswertung von Generalisten-Text-Embeddings für Zero-Shot-Empfehlung und Suche

我们真的需要专业化吗?评估零热建议和搜索通用文本嵌入 2507.05006v2

Authors (6): Matteo Attimonelli, Alessandro De Bellis, Claudio Pomo, Dietmar Jannach, Eugenio Di Sciascio, Tommaso Di Noia

Pre-trained language models (PLMs) are widely used to derive semantic representations from item metadata in recommendation and search. In sequential recommendation, PLMs enhance ID-based embeddings through textual metadata, while in product search, they align item characteristics with user intent. Recent studies suggest task and domain-specific fine-tuning are needed to improve representational power. This paper challenges this assumption, showing that Generalist Text Embedding Models (GTEs), pre-trained on large-scale corpora, can guarantee strong zero-shot performance without specialized adaptation. Our experiments demonstrate that GTEs outperform traditional and fine-tuned models in both sequential recommendation and product search. We attribute this to a superior representational power, as they distribute features more evenly across the embedding space. Finally, we show that compressing embedding dimensions by focusing on the most informative directions (e.g., via PCA) effectively reduces noise and improves the performance of specialized models. To ensure reproducibility, we provide our repository at https://split.to/gte4ps.

预先培训的语言模型(PLMs)被广泛用来从建议和搜索的项目元数据中得出语义表达。在顺序建议中,PLMs通过文本元数据加强基于身份的嵌入,而在产品搜索中,它们使项目特性与用户意图相一致。最近的研究表明,需要任务和特定领域的微调来改进代表性力量。本文挑战这一假设,表明在大规模公司规模上预先培训的通用文本嵌入模型(GTEs)可以保证强力的零弹性能,而无需专门调整。我们的实验表明,在顺序建议和产品搜索中,GTEs都超越了传统的和经过精细调整的模式。我们将此归因于一种优越的代表性力量,因为它们在嵌入空间之间分布的特征更加均衡。最后,我们表明,通过最丰富的信息方向(例如,通过CPA)来压缩嵌入的维度,可以有效减少噪音,改进专门模型的性能。为了确保可追溯性,我们提供https://split.gte4ps的存储库。

Article 296

Title@2025-07-08 (2): Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria

Title: Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria

KI-Beratung auf Japanisch bewerten: Berater, Kunde und Evaluator Rollen, die von Motivations-Interview-Kriterien überzeugt sind

评价大赦国际在日本的咨询:顾问、客户和根据动机访谈标准评估的评价作用 2507.02950v2

Authors (7): Keita Kiuchi, Yoshikazu Fujimoto, Hideyuki Goto, Tomonori Hosokawa, Makoto Nishimura, Yosuke Sato, Izumi Sezai

This study provides the first comprehensive evaluation of large language model (LLM) performance across three counseling roles in Japanese-language therapeutic contexts. We simultaneously assessed counselor artificial intelligence (AI) systems (GPT-4-turbo with zeroshot prompting or Structured Multi-step Dialogue Prompts (SMDP), Claude-3-Opus-SMDP), client AI simulations, and evaluation AI systems (o3, Claude-3.7-Sonnet, Gemini-2.5-pro). Human experts (n = 15) with extensive counseling experience evaluated AI-generated dialogues using the Motivational Interviewing Treatment Integrity (MITI) Coding Manual 4.2.1. Notably, SMDP implementation significantly enhanced counselor AI performance across all MITI global ratings compared with zeroshot prompting, with no significant differences between GPT-SMDP and Opus-SMDP. Evaluation AIs showed comparable performance to human raters for Cultivating Change Talk but systematically overestimated Softening Sustain Talk and the overall quality metrics. Model-specific biases emerged: Gemini emphasized power-sharing, o3 focused on technical proficiency, and Sonnet prioritized emotional expression. Client AI simulations exhibited a limited emotional range and unnaturally high compliance, indicating the need for enhanced realism. These findings establish benchmarks for AI-assisted counseling in non-English contexts and identify critical areas for improvement through advanced prompt engineering, retrieval-augmented generation, and targeted fine-tuning, with important implications for developing culturally sensitive AI mental health tools.

本研究首次全面评估了在日语治疗背景下三大咨询角色的大型语言模式(LLM)业绩,同时评估了顾问人工智能系统(GPT-4-turbo,零点提示或结构化多步对话提示(SMDP)、Claude-3-Opus-SMDP)、客户人工智能模拟和评价AI系统(o3,Claude-3.7-Sonnet,Gemini-2.5-pro)、具有广泛咨询经验的人类专家(n=15),利用激励性访谈治疗完整性(MITI)汇编手册4.2.1,评估了人工智能对话(AI)系统(GPT-4-turbo,零点提示或结构化多步对话提示(SMDP-SMDP))、客户AI系统(GPT-SMDP)和Opus-SMDP(SMDP)之间没有重大差异)、客户人工智能模拟测试系统模拟(o3,Claude-3.7-Sonnoring Sustain Sal Tal Tal)和总体质量衡量标准。

Article 297

Title@2025-07-08 (2): AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs

Title: AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs

AutoTriton: Automatische Triton-Programmierung mit Verstärkungslernen in LLMs

AutoTriton: 自动三联编程,在LLMs中进行强化学习 2507.05687v1

Authors (11): Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, Maosong Sun

Kernel development in deep learning requires optimizing computational units across hardware while balancing memory management, parallelism, and hardware-specific optimizations through extensive empirical tuning. Although domain-specific languages like Triton simplify GPU programming by abstracting low-level details, developers must still manually tune critical parameters such as tile sizes and memory access patterns through iterative experimentation, creating substantial barriers to optimal performance and wider adoption. In this work, we introduce AutoTriton, the first model dedicated to Triton programming powered by reinforcement learning (RL). AutoTriton performs supervised fine-tuning (SFT) to be equipped with essential Triton programming expertise using a high-quality data gathering pipeline, and conducts RL with Group Relative Policy Optimization (GRPO) algorithm, combining a rule-based reward and an execution-based reward to further improve Triton programming ability, sequentially. Experiments across five evaluation channels of TritonBench and KernelBench illustrate that our 8B model AutoTriton achieves performance comparable to mainstream large models, including Claude-4-Sonnet and DeepSeek-R1-0528. Further experimental analysis demonstrates the crucial role of each module within AutoTriton, including the SFT stage, the RL stage, and the reward design strategy. These findings underscore the promise of RL for automatically generating high-performance kernels, and since high-performance kernels are core components of AI systems, this breakthrough establishes an important foundation for building more efficient AI systems. The model and code will be available at https://github.com/AI9Stars/AutoTriton.

深层学习的内核开发要求优化跨硬件的计算单位,同时通过广泛的实证调整来平衡记忆管理、平行和硬件优化。尽管特里顿等特定领域的语言通过抽取低层细节简化 GPU 编程,但开发商仍必须通过迭接实验手动调控关键参数,如瓷砖大小和内存存存存存访问模式,为最佳性能和广泛采用设置了重大障碍。在这项工作中,我们引入了AutTriton(AutTriton),这是专门为特里顿编程提供强化学习动力的第一个模型(RL);AutTriton(SFT)在监督下进行微调(SFT),以配备高质量的数据收集管道所需的基本特里顿编程设计专长,并用集团相对政策优化(GROPO)算法进行RL(GPO)算法,结合基于规则的奖赏和基于执行奖赏的奖赏模式,以进一步提高Triton编程/Kenne Beenchen,这些A(包括Clau-S-Son-Sont-T)的高级基础,进一步实验分析将自动显示Staryal Stal-rual Stal Studal Studal Studal)的每个模块在Rst-Sy-Slal Studal-I-Slal-I-Sleval-Sl-I-Sleval-Sl)的每一个阶段中的重要基础。

Article 298

Title@2025-07-08 (2): Smoothie-Qwen: Post-Hoc Smoothing to Reduce Language Bias in Multilingual LLMs

Title: Smoothie-Qwen: Post-Hoc Smoothing to Reduce Language Bias in Multilingual LLMs

Smoothie-Qwen: Post-Hoc-Glättung, um Sprach-Bias in mehrsprachigen LLMs zu reduzieren

Slidie-Quwen:以多种语言平滑后减少语言偏见 2507.05686v1

Authors (5): SeungWon Ji, Jungyup Lee, Jemin Kim, Sang Park, SeungJae Lee

Multilingual large language models (LLMs) often exhibit language confusion, a tendency to generate responses in a dominant language irrespective of the prompt’s language. To address this, we propose Smoothie-Qwen, a lightweight, post-hoc method that mitigates language bias without retraining. This technique selectively adjusts token-level output probabilities to effectively suppress undesired language generation. Applied to the Qwen model, our method reduces unintended Chinese output by over 95% while preserving task accuracy on multilingual benchmarks. This work provides a practical and efficient solution for enhancing the language controllability of LLMs, making them more reliable for global applications.

多语言大型语言模式(LLMs)往往表现出语言混乱,一种以主导语言做出回应的倾向,而不论迅速使用的语言如何。为了解决这个问题,我们建议采用轻巧的后热方法Slipie-Qwen,这种方法可以减轻语言偏见,而无需再培训。这种技术有选择地调整象征性产出概率,以有效抑制不受欢迎的语言生成。应用到Quen模式,我们的方法将中国无意产出减少95%以上,同时保留多语种基准的任务准确性。这项工作为提高LLMs的语言可控性提供了一个实用有效的解决方案,使LMs更可靠地用于全球应用。

Article 299

Title@2025-07-08 (2): BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

Title: BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

BMMR: Ein multimodaler multimodaler multidisziplinärer Datensatz, der großflächig zweisprachig ist

BMMR: 大型双语双语多语多式多式多残疾理由数据集 2507.03483v2

Authors (16): Zhiheng Xi, Guanyu Li, Yutao Fan, Honglin Guo, Yufang Liu, Xiaoran Fan, Jiaqi Liu, Jingchao Ding, Wangmeng Zuo, Zhenfei Yin, Lei Bai, Tao Ji, Tao Gui, Qi Zhang, Philip Torr, Xuanjing Huang

In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 110k college-level questions spanning 300 UNESCO-defined subjects, spanning diverse formats-multiple-choice, fill-in-the-blank, and open-ended QA-and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20,458 high-quality instances to comprehensively assess LMMs’ knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 88,991 instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grained evaluation of reasoning paths. Extensive experiments on 24 models reveal that (i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data, and we hope our work can offer insights and contributions to the community.

在本文中,我们引入了大型双语、多式联运、多学科推理数据集(BMMR),这是社区开发和评价大型多式联运模型(LMMMs)的大型双语、多式联运和多学科推理数据集。BMMR由110k个大学级问题组成,涉及300个教科文组织界定的科目,涉及多种格式-多选择、填补空白和开放式QA,来自印刷和数字媒体,如书籍、考试和测验等。所有数据都通过人际流动和可扩缩框架进行整理和过滤,每个实例都配有高质量的推理路径。该数据集分为两个部分:BMRR-Eval,包括20,458个高质量的案例,以全面评估LMMMR的知识和推理,包括中英多种学科;以及BMR-Train,包含88,991个案例,以支持进一步的研究与发展,将当前对数学推理的焦点扩大到不同的学科和领域。此外,我们建议以流程为基础的多度校正(i,即甚低级的LMMR-RMR-RMR(ial-Bral-Bral-Ial-Bral-Bral-Bral-IRC),我们目前进行实质性的深度分析(RO-BMIS-Brex-Bral-BRis-BR),关于正的深度分析。(ro-I-I-I-I-I-I-I-BR-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-

Article 300

Title@2025-07-08 (2): TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data

Title: TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data

TuneShield: Eindämmung der Toxizität bei konversatorischer KI während der Feinabstimmung auf nicht vertrauenswürdige Daten

TuneSheld:在对不可信数据进行微调的同时,在交流性AI中减轻毒性 2507.05660v1

Authors (7): Aravind Cheruvu, Shravya Kanchi, Sifat Muhammad Abdullah, Nicholas Kong, Daphne Yao, Murtuza Jadliwala, Bimal Viswanath

Recent advances in foundation models, such as LLMs, have revolutionized conversational AI. Chatbots are increasingly being developed by customizing LLMs on specific conversational datasets. However, mitigating toxicity during this customization, especially when dealing with untrusted training data, remains a significant challenge. To address this, we introduce TuneShield, a defense framework designed to mitigate toxicity during chatbot fine-tuning while preserving conversational quality. TuneShield leverages LLM-based toxicity classification, utilizing the instruction-following capabilities and safety alignment of LLMs to effectively identify toxic samples, outperforming industry API services. TuneShield generates synthetic conversation samples, termed ‘healing data’, based on the identified toxic samples, using them to mitigate toxicity while reinforcing desirable behavior during fine-tuning. It performs an alignment process to further nudge the chatbot towards producing desired responses. Our findings show that TuneShield effectively mitigates toxicity injection attacks while preserving conversational quality, even when the toxicity classifiers are imperfect or biased. TuneShield proves to be resilient against adaptive adversarial and jailbreak attacks. Additionally, TuneShield demonstrates effectiveness in mitigating adaptive toxicity injection attacks during dialog-based learning (DBL).

nan

Article 301

Title@2025-07-08 (2): SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge

Title: SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge

SHNU Mehrsprachiges Spracherkennungssystem für INTERSPEECH 2025 MLC-SLM Challenge

SHNU 2025年刚果解放运动-解运挑战 2507.03343v2

Authors (4): Yuxiang Mei, Yuang Zheng, Dongxing Xu, Yanhua Long

This paper describes SHNU multilingual conversational speech recognition system (SHNU-mASR, team name-“maybe”), submitted to Track 1 of the INTERSPEECH 2025 MLC-SLM Challenge. Our system integrates a parallel-speech-encoder architecture with a large language model (LLM) to form a unified multilingual ASR framework. The parallel-speech-encoder consists of two pre-trained encoders, the Whisper-large-v3 encoder and mHuBERT-147 encoder. Their output embeddings are concatenated and fed into the LLM, enabling the model to leverage complementary acoustic and linguistic knowledge and achieve competitive performance. Moreover, we adopt a tri-stage training strategy to jointly update the low-rank adaptation modules and projector parameters of both the speech encoders and the LLM. In addition, we incorporate an additional language-aware prompt at the LLM input to enhance language-specific text generation. The SHNU-mASR system achieves an overall character/word error rate (CER/WER) of 11.76% on the blind evaluation set of the challenge, outperforming the official MLC-SLM baseline by 8.41 absolute CER/WER, without increasing the baseline training data.

nan

Article 302

Title@2025-07-08 (2): MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models

Title: MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models

MEF: Ein Capability-Aware Multi-Encryption Framework zur Bewertung von Schwachstellen in Black-Box Large Language Models

MEF: 用于评价黑箱大语言模型脆弱性的能力-软件多加密框架 2505.23404v3

Authors (6): Mingyu Yu, Wei Wang, Yanjie Wei, Sujuan Qin, Fei Gao, Wenmin Li

Recent advancements in adversarial jailbreak attacks have revealed significant vulnerabilities in Large Language Models (LLMs), facilitating the evasion of alignment safeguards through increasingly sophisticated prompt manipulations. In this paper, we propose MEF, a capability-aware multi-encryption framework for evaluating vulnerabilities in black-box LLMs. Our key insight is that the effectiveness of jailbreak strategies can be significantly enhanced by tailoring them to the semantic comprehension capabilities of the target model. We present a typology that classifies LLMs into Type I and Type II based on their comprehension levels, and design adaptive attack strategies for each. MEF combines layered semantic mutations and dual-ended encryption techniques, enabling circumvention of input, inference, and output-level defenses. Experimental results demonstrate the superiority of our approach. Remarkably, it achieves a jailbreak success rate of 98.9\% on GPT-4o (29 May 2025 release). Our findings reveal vulnerabilities in current LLMs’ alignment defenses.

nan

Article 303

Title@2025-07-08 (2): ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?

Title: ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?

ECom-Bench: Kann LLM Agent Probleme mit dem echten E-Commerce lösen?

ECom-Bench:LLM代理能否解决现实世界电子商务客户支助问题? 2507.05639v1

Authors (8): Haoxin Wang, Xianhan Peng, Xucheng Huang, Yizhe Huang, Ming Gong, Chenghan Yang, Yang Liu, Ling Jiang

In this paper, we introduce ECom-Bench, the first benchmark framework for evaluating LLM agent with multimodal capabilities in the e-commerce customer support domain. ECom-Bench features dynamic user simulation based on persona information collected from real e-commerce customer interactions and a realistic task dataset derived from authentic e-commerce dialogues. These tasks, covering a wide range of business scenarios, are designed to reflect real-world complexities, making ECom-Bench highly challenging. For instance, even advanced models like GPT-4o achieve only a 10-20% pass^3 metric in our benchmark, highlighting the substantial difficulties posed by complex e-commerce scenarios. Upon publication, the code and data will be open-sourced to facilitate further research and development in this domain.

nan

Article 304

Title@2025-07-08 (2): Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs

Title: Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs

Feint and Attack: Achtsamkeitsbasierte Strategien für das Jailbreaking und den Schutz von LLMs

Feint和攻击:以关注为基础的破狱和保护LLMs 2410.16327v2

Authors (8): Rui Pu, Chaozhuo Li, Rui Ha, Zejian Chen, Litian Zhang, Zheng Liu, Lirong Qiu, Zaisheng Ye

Jailbreak attack can be used to access the vulnerabilities of Large Language Models (LLMs) by inducing LLMs to generate the harmful content. And the most common method of the attack is to construct semantically ambiguous prompts to confuse and mislead the LLMs. To access the security and reveal the intrinsic relation between the input prompt and the output for LLMs, the distribution of attention weight is introduced to analyze the underlying reasons. By using statistical analysis methods, some novel metrics are defined to better describe the distribution of attention weight, such as the Attention Intensity on Sensitive Words (Attn_SensWords), the Attention-based Contextual Dependency Score (Attn_DepScore) and Attention Dispersion Entropy (Attn_Entropy). By leveraging the distinct characteristics of these metrics, the beam search algorithm and inspired by the military strategy “Feint and Attack”, an effective jailbreak attack strategy named as Attention-Based Attack (ABA) is proposed. In the ABA, nested attack prompts are employed to divert the attention distribution of the LLMs. In this manner, more harmless parts of the input can be used to attract the attention of the LLMs. In addition, motivated by ABA, an effective defense strategy called as Attention-Based Defense (ABD) is also put forward. Compared with ABA, the ABD can be used to enhance the robustness of LLMs by calibrating the attention distribution of the input prompt. Some comparative experiments have been given to demonstrate the effectiveness of ABA and ABD. Therefore, both ABA and ABD can be used to access the security of the LLMs. The comparative experiment results also give a logical explanation that the distribution of attention weight can bring great influence on the output for LLMs.

nan

Article 305

Title@2025-07-08 (2): Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models

Title: Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models

Sarkasmus durch unterschiedliche Augen sehen: Multimodale Sarkasmuswahrnehmung in großen Vision-Sprachen-Modellen analysieren

透过不同的眼睛看到讽刺:在大型视觉语言模型中分析多模式讽刺的认知 2503.12149v2

Authors (5): Junjie Chen, Xuyang Liu, Subin Huang, Linfeng Zhang, Hang Yu

With the advent of large vision-language models (LVLMs) demonstrating increasingly human-like abilities, a pivotal question emerges: do different LVLMs interpret multimodal sarcasm differently, and can a single model grasp sarcasm from multiple perspectives like humans? To explore this, we introduce an analytical framework using systematically designed prompts on existing multimodal sarcasm datasets. Evaluating 12 state-of-the-art LVLMs over 2,409 samples, we examine interpretive variations within and across models, focusing on confidence levels, alignment with dataset labels, and recognition of ambiguous “neutral” cases. Our findings reveal notable discrepancies – across LVLMs and within the same model under varied prompts. While classification-oriented prompts yield higher internal consistency, models diverge markedly when tasked with interpretive reasoning. These results challenge binary labeling paradigms by highlighting sarcasm’s subjectivity. We advocate moving beyond rigid annotation schemes toward multi-perspective, uncertainty-aware modeling, offering deeper insights into multimodal sarcasm comprehension. Our code and data are available at: https://github.com/CoderChen01/LVLMSarcasmAnalysis

nan

Article 306

Title@2025-07-08 (2): SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression

Title: SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression

SARA: Selektive und adaptive retrieval-augmentierte Generation mit Kontextkompression

SARA:有环境压缩的选择性和适应性再取回型 2507.05633v1

Authors (7): Yiqiao Jin, Kartik Sharma, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, Srijan Kumar

Retrieval-augmented Generation (RAG) extends large language models (LLMs) with external knowledge but faces key challenges: restricted effective context length and redundancy in retrieved documents. Pure compression-based approaches reduce input size but often discard fine-grained details essential for factual accuracy. We propose SARA, a unified RAG framework that balances local precision and global knowledge coverage under tight context budgets. SARA combines natural-language text snippets with semantic compression vectors to jointly enhance context efficiency and answer correctness. It represents contexts at two complementary levels: 1) fine-grained natural-language spans that preserve critical entities and numerical values, and 2) compact, interpretable vectors that summarize high-level semantics. An iterative evidence-selection module employs the compression vectors for dynamic reranking of contexts. Across 9 datasets and 5 open-source LLMs spanning 3 model families (Mistral, Llama, and Gemma), SARA consistently improves answer relevance (+17.71), answer correctness (+13.72), and semantic similarity (+15.53), demonstrating the importance of integrating textual and compressed representations for robust, context-efficient RAG.

nan

Article 307

Title@2025-07-08 (2): On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows

Title: On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows

Über die Rolle von Feedback bei der Testzeitskalierung von agentischen KI-Workflows

关于反馈在AI类制剂工作流量的测试-时间尺度方面作用的反馈 2504.01931v4

Authors (11): Souradip Chakraborty, Mohammadreza Pourreza, Ruoxi Sun, Yiwen Song, Nino Scherrer, Furong Huang, Amrit Singh Bedi, Ahmad Beirami, Jindong Gu, Hamid Palangi, Tomas Pfister

Agentic AI workflows (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low. A promising solution is inference-time alignment, which uses extra compute at test time to improve performance. Inference-time alignment relies on three components: sampling, evaluation, and feedback. While most prior work studies sampling and automatic evaluation, feedback remains underexplored. To study the role of feedback, we introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques (reward models or AI-generated textual feedback) between decoding steps. Through IAD, we analyze feedback along four dimensions: (1) its role in the accuracy-compute trade-offs with limited inference budget, (2) quantifying the gains over diversity-only baselines such as best-of-N sampling, (3) effectiveness of composing feedback from reward models versus textual critique, and (4) robustness to noisy or low-quality feedback. Across Sketch2Code, Text2SQL, Intercode, and WebShop, we show that IAD with proper integration of high fidelity feedback leads to consistent gains up to 10 percent absolute performance improvement over various baselines such as best-of-N. Our findings underscore feedback as a crucial knob for inference-time alignment of agentic AI workflows with limited inference budget.

nan

Article 308

Title@2025-07-08 (2): Adsorb-Agent: Autonomous Identification of Stable Adsorption Configurations via Large Language Model Agent

Title: Adsorb-Agent: Autonomous Identification of Stable Adsorption Configurations via Large Language Model Agent

Adsorb-Agent: Autonome Identifizierung stabiler Adsorptionskonfigurationen über Large Language Model Agent

Assorb-Agent:通过大语言模式代理自动识别稳定吸附配置 2410.16658v4

Authors (5): Janghoon Ock, Radheesh Sharma Meda, Tirtha Vinchurkar, Yayati Jadhav, Amir Barati Farimani

Adsorption energy is a key reactivity descriptor in catalysis. Determining adsorption energy requires evaluating numerous adsorbate-catalyst configurations, making it computationally intensive. Current methods rely on exhaustive sampling, which does not guarantee the identification of the global minimum energy. To address this, we introduce Adsorb-Agent, a Large Language Model (LLM) agent designed to efficiently identify stable adsorption configurations corresponding to the global minimum energy. Adsorb-Agent leverages its built-in knowledge and reasoning to strategically explore configurations, significantly reducing the number of initial setups required while improving energy prediction accuracy. In this study, we also evaluated the performance of different LLMs, including GPT-4o, GPT-4o-mini, Claude-3.7-Sonnet, and DeepSeek-Chat, as the reasoning engine for Adsorb-Agent, with GPT-4o showing the strongest overall performance. Tested on twenty diverse systems, Adsorb-Agent identifies comparable adsorption energies for 84% of cases and achieves lower energies for 35%, particularly excelling in complex systems. It identifies lower energies in 47% of intermetallic systems and 67% of systems with large adsorbates. These findings demonstrate Adsorb-Agent’s potential to accelerate catalyst discovery by reducing computational costs and enhancing prediction reliability compared to exhaustive search methods.

nan

Article 309

Title@2025-07-08 (2): Flipping Knowledge Distillation: Leveraging Small Models’ Expertise to Enhance LLMs in Text Matching

Title: Flipping Knowledge Distillation: Leveraging Small Models’ Expertise to Enhance LLMs in Text Matching

Flipping Knowledge Destillation: Kompetenz kleiner Modelle nutzen, um LLMs in Text Matching zu verbessern

减轻知识蒸馏:利用小型模型的专长加强文本匹配中的LLM 2507.05617v1

Authors (5): Mingzhe Li, Jing Xiang, Qishen Zhang, Kaiyang Wan, Xiuying Chen

Knowledge distillation typically involves transferring knowledge from a Large Language Model (LLM) to a Smaller Language Model (SLM). However, in tasks such as text matching, fine-tuned smaller models often yield more effective domain-specific representations, as they focus on optimizing the similarity of input pairs. To leverage both the specialized strengths of small models and the rich semantic understanding of LLMs, we introduce a flipped knowledge distillation paradigm, where LLM learns from SLM. Specifically, we address the architectural gap between decoder-only LLMs and smaller encoder-based models by reinterpreting LLMs in an encoder-decoder manner using LoRA. The encoder generates compressed representations, while the decoder maps them to the output space. During training, the encoder produces representations and their similarities, which are then aligned with the similarity scores produced by the teacher, using our proposed Margin-aware Contrastive Learning (MCL) approach. The MCL ensures accurate similarity for both positive and negative pairs, and adaptively handles the internal differences within positive and negative samples. Our paradigm requires only a reasonably good-performing SLM, allowing the LLM to achieve improved performance. Experiments on financial and healthcare benchmarks, as well as real-world applications, confirm its effectiveness, and the model has been fully deployed in an online environment.

nan

Article 310

Title@2025-07-08 (2): Self-Review Framework for Enhancing Instruction Following Capability of LLM

Title: Self-Review Framework for Enhancing Instruction Following Capability of LLM

Selbst-Review-Rahmen für die Verbesserung der Anleitung nach der Fähigkeit von LLM

根据LLM能力加强教学的自我审查框架 2507.05598v1

Authors (1): Sihyun Park

Various techniques have been proposed to improve large language models (LLMs) adherence to formatting and instruction constraints. One of the most effective approaches involves utilizing high-quality data generated by powerful models. However, such models often fail to fully comply with complex instructions in a single generation. To address this limitation, iterative revision methods have been introduced. Nevertheless, as the number of data points and revision iterations increases, the associated monetary costs grow significantly. As a resource-efficient alternative, methods have been proposed that leverage high-performance evaluation tools to compensate for the limited self-evaluation capabilities of open-source LLMs. However, these approaches often lead to a degradation in output quality due to excessive revision. To overcome these challenges, we propose Re5, a self-evaluation and revision framework designed to enhance instruction-following performance while preserving the quality of the generated content. Re5 extracts task and constraint components from user instructions, performs structural evaluations to prevent error accumulation, and applies fine-grained constraint-specific content evaluations followed by selective revisions. This process ensures precise and quality-preserving improvements. The final high-quality outputs are used for alignment tuning, enabling long-term alignment improvements through a data-centric iterative refinement loop. Experimental results demonstrate that Re5 achieves instruction-following performance comparable to models trained on data generated by GPT-4o-mini, a high-performance model, even with a small amount of data while maintaining response quality with a 64.24%-win rate over the non-revised initial responses. These results validate Re5 as an efficient and effective solution for enhancing instruction adherence with minimal external supervision.

nan

Article 311

Title@2025-07-08 (2): Efficient Detection of Intermittent Job Failures Using Few-Shot Learning

Title: Efficient Detection of Intermittent Job Failures Using Few-Shot Learning

Effiziente Erkennung intermittierender Job-Fälle durch wenig scharfes Lernen

利用很少热的学习方法有效检测间歇性工作失败 2507.04173v2

Authors (3): Henri Aïdasso, Francis Bordeleau, Ali Tizghadam

One of the main challenges developers face in the use of continuous integration (CI) and deployment pipelines is the occurrence of intermittent job failures, which result from unexpected non-deterministic issues (e.g., flaky tests or infrastructure problems) rather than regular code-related errors such as bugs. Prior studies developed machine learning (ML) models trained on large datasets of job logs to classify job failures as either intermittent or regular. As an alternative to costly manual labeling of large datasets, the state-of-the-art (SOTA) approach leveraged a heuristic based on non-deterministic job reruns. However, this method mislabels intermittent job failures as regular in contexts where rerunning suspicious job failures is not an explicit policy, and therefore limits the SOTA’s performance in practice. In fact, our manual analysis of 2,125 job failures from 5 industrial and 1 open-source projects reveals that, on average, 32% of intermittent job failures are mislabeled as regular. To address these limitations, this paper introduces a novel approach to intermittent job failure detection using few-shot learning (FSL). Specifically, we fine-tune a small language model using a few number of manually labeled log examples to generate rich embeddings, which are then used to train an ML classifier. Our FSL-based approach achieves 70-88% F1-score with only 12 shots in all projects, outperforming the SOTA, which proved ineffective (34-52% F1-score) in 4 projects. Overall, this study underlines the importance of data quality over quantity and provides a more efficient and practical framework for the detection of intermittent job failures in organizations.

nan

Article 312

Title@2025-07-08 (2): Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation

Title: Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation

Low-Rank und Sparse Modell Zusammenführung für Multi-Lingual Spracherkennung und Übersetzung

多语言语音识别和翻译的低兰克和松散模式合并 2502.17380v3

Authors (3): Qiuming Zhao, Guangzhi Sun, Chao Zhang

Language diversity presents a significant challenge in speech-to-text (S2T) tasks, such as automatic speech recognition and translation. Traditional multi-lingual multi-task training approaches aim to address this by jointly optimising multiple speech recognition and translation tasks across various languages. While models like Whisper, built on these strategies, demonstrate strong performance, they still face issues of high computational cost, language interference, suboptimal training configurations, and limited extensibility. To overcome these challenges, we introduce LoRS-Merging (low-rank and sparse model merging), a novel technique designed to efficiently integrate models trained on different languages or tasks while preserving performance and reducing computational overhead. LoRS-Merging combines low-rank and sparse pruning to retain essential structures while eliminating redundant parameters, mitigating language interference, and enhancing extensibility. Experimental results across 10 languages demonstrate that LoRS-Merging significantly outperforms multi-lingual multi-task training, sequential training, and other merging methods, achieving over 20% improvement in normalised performance. Our findings suggest that model merging, particularly LoRS-Merging, is a scalable and effective complement to traditional multi-lingual training strategies for S2T applications.

nan

Article 313

Title@2025-07-08 (2): The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation

Title: The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation

Die Landschaft der Erinnerung in LLMs: Mechanismen, Messung und Milderung

LLMM中纪念的景观:机制、衡量和缓解 2507.05578v1

Authors (4): Alexander Xiong, Xuandong Zhao, Aneesh Pappu, Dawn Song

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they also exhibit memorization of their training data. This phenomenon raises critical questions about model behavior, privacy risks, and the boundary between learning and memorization. Addressing these concerns, this paper synthesizes recent studies and investigates the landscape of memorization, the factors influencing it, and methods for its detection and mitigation. We explore key drivers, including training data duplication, training dynamics, and fine-tuning procedures that influence data memorization. In addition, we examine methodologies such as prefix-based extraction, membership inference, and adversarial prompting, assessing their effectiveness in detecting and measuring memorized content. Beyond technical analysis, we also explore the broader implications of memorization, including the legal and ethical implications. Finally, we discuss mitigation strategies, including data cleaning, differential privacy, and post-training unlearning, while highlighting open challenges in balancing the minimization of harmful memorization with utility. This paper provides a comprehensive overview of the current state of research on LLM memorization across technical, privacy, and performance dimensions, identifying critical directions for future work.

nan

Article 314

Title@2025-07-08 (2): Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models

Title: Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models

Feature-Extraktion und -Lenkung für eine verbesserte Kettenbildung in Sprachmodellen

语言模型中强化研究链理由的特征采掘和指南 2505.15634v3

Authors (6): Zihao Li, Xu Wang, Yuzhe Yang, Ziyu Yao, Haoyi Xiong, Mengnan Du

Large Language Models (LLMs) demonstrate the ability to solve reasoning and mathematical problems using the Chain-of-Thought (CoT) technique. Expanding CoT length, as seen in models such as DeepSeek-R1, significantly enhances this reasoning for complex problems, but requires costly and high-quality long CoT data and fine-tuning. This work, inspired by the deep thinking paradigm of DeepSeek-R1, utilizes a steering technique to enhance the reasoning ability of an LLM without external datasets. Our method first employs Sparse Autoencoders (SAEs) to extract interpretable features from vanilla CoT. These features are then used to steer the LLM’s internal states during generation. Recognizing that many LLMs do not have corresponding pre-trained SAEs, we further introduce a novel SAE-free steering algorithm, which directly computes steering directions from the residual activations of an LLM, obviating the need for an explicit SAE. Experimental results demonstrate that both our SAE-based and subsequent SAE-free steering algorithms significantly enhance the reasoning capabilities of LLMs.

nan

Article 315

Title@2025-07-08 (2): Offline Learning and Forgetting for Reasoning with Large Language Models

Title: Offline Learning and Forgetting for Reasoning with Large Language Models

Offline-Lernen und Vergessen für die Vernunft mit großen Sprachmodellen

利用大语言模式进行离线学习和遗忘 2504.11364v3

Authors (6): Tianwei Ni, Allen Nie, Sapana Chaudhary, Yao Liu, Huzefa Rangwala, Rasool Fakoor

Leveraging inference-time search in large language models has proven effective in further enhancing a trained model’s capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it on unpaired successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. A key challenge we identify is that naive fine-tuning can degrade the model’s search capability; we show this can be mitigated with a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown reasoning benchmarks show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180$\times$. On top of this, our learning and forgetting objective consistently outperforms both supervised fine-tuning and preference-based methods.

nan

Article 316

Title@2025-07-08 (2): Beyond Retrieval: Ensembling Cross-Encoders and GPT Rerankers with LLMs for Biomedical QA

Title: Beyond Retrieval: Ensembling Cross-Encoders and GPT Rerankers with LLMs for Biomedical QA

Beyond Retrieval: Zusammenbauen von Cross-Encodern und GPT-Rerankern mit LLMs für biomedizinische QA

超出检索范围:利用生物医学质量评估LMs将跨环境者和GPT再生者与生物医学质量评估LMs组合在一起 2507.05577v1

Authors (3): Shashank Verma, Fengyi Jiang, Xiangning Xue

Biomedical semantic question answering rooted in information retrieval can play a crucial role in keeping up to date with vast, rapidly evolving and ever-growing biomedical literature. A robust system can help researchers, healthcare professionals and even layman users access relevant knowledge grounded in evidence. The BioASQ 2025 Task13b Challenge serves as an important benchmark, offering a competitive platform for advancement of this space. This paper presents the methodologies and results from our participation in this challenge where we built a Retrieval-Augmented Generation (RAG) system that can answer biomedical questions by retrieving relevant PubMed documents and snippets to generate answers. For the retrieval task, we generated dense embeddings from biomedical articles for initial retrieval, and applied an ensemble of finetuned cross-encoders and large language models (LLMs) for re-ranking to identify top relevant documents. Our solution achieved an MAP@10 of 0.1581, placing 10th on the leaderboard for the retrieval task. For answer generation, we employed few-shot prompting of instruction-tuned LLMs. Our system achieved macro-F1 score of 0.95 for yes/no questions (rank 12), Mean Reciprocal Rank (MRR) of 0.64 for factoid questions (rank 1), mean-F1 score of 0.63 for list questions (rank 5), and ROUGE-SU4 F1 score of 0.29 for ideal answers (rank 11).

nan

Article 317

Title@2025-07-08 (2): Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

Title: Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

Infini-gram mini: Genaue n-gram Suche auf der Internetskala mit FM-Index

Infini-gram 微型: 使用 FM- Index 的 Internet 比例尺精确的 n 克搜索 2506.12229v2

Authors (5): Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi

Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora – counting string appearances and retrieving the enclosing documents – yet the high storage overhead hinders their application on Internet-scale data. We present Infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18$\times$) and memory use during both indexing (3.2$\times$ reduction) and querying (down to a negligible amount). We index 46TB of Internet text in 50 days with a single 128-core CPU node (or 19 hours if using 75 such nodes). We show one important use case of Infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 40% in SQuAD), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on Infini-gram mini indexes.

nan

Article 318

Title@2025-07-08 (2): OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

Title: OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

OLMoTrace: Tracing Language Model Outputs Zurück zu Tillionen von Trainings-Tokens

OLMoTrace:追踪语言示范产出,返回数以万计的培训单位 2504.07096v2

Authors (31): Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, YenSung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cassidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Cheng, Karen Farley, Sruthi Sreeram, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Yuren Pang, Pang Wei Koh, Noah A. Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, Jesse Dodge

We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.

nan

Article 319

Title@2025-07-08 (2): Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS

Title: Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS

Verbesserung der Testzeitskalierung großer Sprachmodelle mit hierarchischen Retrieval-Augmented MCTS

强化高语言模型的测试时间比量,采用等级检索增强的MCTS 2507.05557v1

Authors (9): Alex ZH Dou, Zhongwei Wan, Dongfei Cui, Xin Wang, Jing Xiong, Haokun Lin, Chaofan Tao, Shen Yan, Mi Zhang

Test-time scaling has emerged as a promising paradigm in language modeling, leveraging additional computational resources at inference time to enhance model performance. In this work, we introduce R2-LLMs, a novel and versatile hierarchical retrieval-augmented reasoning framework designed to improve test-time scaling in large language models (LLMs) without requiring distillation from more advanced models to obtain chain-of-thought (CoT) training data. R2-LLMs enhances inference-time generalization by integrating dual-level retrieval-based in-context learning: (1) At the coarse level, our approach extracts abstract templates from complex reasoning problems and retrieves similar problem-answer pairs to facilitate high-level in-context learning; (2) At the fine level, during Monte Carlo Tree Search (MCTS), R2-LLMs efficiently retrieves analogous intermediate solution steps from reference mathematical problem datasets, refining step-wise reasoning with the aid of a process reward model (PRM) for scoring. R2-LLMs is a robust hierarchical reasoning-augmentation method that enhances in-context-level reasoning while seamlessly integrating with step-level tree search methods. Utilizing PRM, it refines both candidate generation and decision-making for improved reasoning accuracy. Empirical evaluations on the MATH500, GSM8K, and OlympiadBench-TO datasets achieve substantial relative improvement with an increase of up to 16% using LLaMA-3.1-8B compared to the baselines, showcasing the effectiveness of our approach in complex reasoning tasks.

nan

Article 320

Title@2025-07-07 (1): The Role of Deductive and Inductive Reasoning in Large Language Models

Title: The Role of Deductive and Inductive Reasoning in Large Language Models

Die Rolle der deduktiven und induktiven Vernunft in großen Sprachmodellen

大语种模式中的减量和引引原因的作用 2410.02892v3

Authors (8): Chengkun Cai, Xu Zhao, Haoliang Liu, Zhongyu Jiang, Tianfang Zhang, Zongkai Wu, Jenq-Neng Hwang, Lei Li

Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning tasks, yet their reliance on static prompt structures and limited adaptability to complex scenarios remains a significant challenge. In this paper, we propose the Deductive and InDuctive(DID) method, a novel framework that enhances LLM reasoning by dynamically integrating both deductive and inductive reasoning approaches. Drawing from cognitive science principles, DID implements a dual-metric complexity evaluation system that combines Littlestone dimension and information entropy to precisely assess task difficulty and guide decomposition strategies. DID enables the model to progressively adapt its reasoning pathways based on problem complexity, mirroring human cognitive processes. We evaluate DID’s effectiveness across multiple benchmarks, including the AIW and MR-GSM8K, as well as our custom Holiday Puzzle dataset for temporal reasoning. Our results demonstrate significant improvements in reasoning quality and solution accuracy - achieving 70.3% accuracy on AIW (compared to 62.2% for Tree of Thought) while maintaining lower computational costs. The success of DID in improving LLM performance while preserving computational efficiency suggests promising directions for developing more cognitively aligned and capable language models. Our work contributes a theoretically grounded, input-centric approach to enhancing LLM reasoning capabilities, offering an efficient alternative to traditional output-exploration methods.

nan

Article 321

Title@2025-07-07 (1): Bayesian Optimization for Controlled Image Editing via LLMs

Title: Bayesian Optimization for Controlled Image Editing via LLMs

Bayesische Optimierung für kontrollierte Bildbearbeitung über LLMs

Bayesian 通过LLMs优化控制图像编辑 2502.18116v3

Authors (9): Chengkun Cai, Haoliang Liu, Xu Zhao, Zhongyu Jiang, Tianfang Zhang, Zongkai Wu, John Lee, Jenq-Neng Hwang, Lei Li

In the rapidly evolving field of image generation, achieving precise control over generated content and maintaining semantic consistency remain significant limitations, particularly concerning grounding techniques and the necessity for model fine-tuning. To address these challenges, we propose BayesGenie, an off-the-shelf approach that integrates Large Language Models (LLMs) with Bayesian Optimization to facilitate precise and user-friendly image editing. Our method enables users to modify images through natural language descriptions without manual area marking, while preserving the original image’s semantic integrity. Unlike existing techniques that require extensive pre-training or fine-tuning, our approach demonstrates remarkable adaptability across various LLMs through its model-agnostic design. BayesGenie employs an adapted Bayesian optimization strategy to automatically refine the inference process parameters, achieving high-precision image editing with minimal user intervention. Through extensive experiments across diverse scenarios, we demonstrate that our framework significantly outperforms existing methods in both editing accuracy and semantic preservation, as validated using different LLMs including Claude3 and GPT-4.

nan

Article 322

Title@2025-07-07 (1): Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment

Title: Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment

Conversational Education at Scale: Multi-LLM Agent Workflow für verfahrenstechnisches Lernen und pädagogische Qualitätsbewertung

规模层次的相互教育:程序学习和教学质量评估的多种LLM代理工作流 2507.05528v1

Authors (6): Jiahuan Pei, Fanghua Ye, Xin Sun, Wentao Deng, Koen Hindriks, Junxiao Wang

Large language models (LLMs) have advanced virtual educators and learners, bridging NLP with AI4Education. Existing work often lacks scalability and fails to leverage diverse, large-scale course content, with limited frameworks for assessing pedagogic quality. To this end, we propose WikiHowAgent, a multi-agent workflow leveraging LLMs to simulate interactive teaching-learning conversations. It integrates teacher and learner agents, an interaction manager, and an evaluator to facilitate procedural learning and assess pedagogic quality. We introduce a dataset of 114,296 teacher-learner conversations grounded in 14,287 tutorials across 17 domains and 727 topics. Our evaluation protocol combines computational and rubric-based metrics with human judgment alignment. Results demonstrate the workflow’s effectiveness in diverse setups, offering insights into LLM capabilities across domains. Our datasets and implementations are fully open-sourced.

nan

Article 323

Title@2025-07-07 (1): Early-Exit and Instant Confidence Translation Quality Estimation

Title: Early-Exit and Instant Confidence Translation Quality Estimation

Schätzung der Qualität der Früh-Ausgangs- und Instant-Konfidenz-Übersetzung

提前出国和即时信心翻译质量估算 2502.14429v2

Authors (6): Vilém Zouhar, Maike Züfle, Beni Egressy, Julius Cheng, Mrinmaya Sachan, Jan Niehues

Quality estimation is omnipresent in machine translation, for both evaluation and generation. Unfortunately, quality estimation models are often opaque and computationally expensive, making them impractical to be part of large-scale pipelines. In this work, we tackle two connected challenges: (1) reducing the cost of quality estimation at scale, and (2) developing an inexpensive uncertainty estimation method for quality estimation. To address the latter, we introduce Instant Confidence COMET, an uncertainty-aware quality estimation model that matches the performance of previous approaches at a fraction of their costs. We extend this to Early-Exit COMET, a quality estimation model that can compute quality scores and associated confidences already at early model layers, allowing us to early-exit computations and reduce evaluation costs. We also apply our model to machine translation reranking. We combine Early-Exit COMET with an upper confidence bound bandit algorithm to find the best candidate from a large pool without having to run the full evaluation model on all candidates. In both cases (evaluation and reranking) our methods reduce the required compute by 50% with very little degradation in performance. Finally, we show how Instant Confidence COMET can be used to decide which translations a human evaluator should score rather than relying on the COMET score.

nan

Article 324

Title@2025-07-07 (1): Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality

Title: Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality

Feinkörnige Vision-Sprachenmodellierung für multimodale Trainingsassistenten in Augmented Reality

扩大现实中多模式培训助理的精美愿景-语言建模 2507.05515v1

Authors (9): Haochen Huang, Jiahuan Pei, Mohammad Aliannejadi, Xin Sun, Moonisa Ahsan, Pablo Cesar, Chuang Yu, Zhaochun Ren, Junxiao Wang

Vision-language models (VLMs) are essential for enabling AI-powered smart assistants to interpret and reason in multimodal environments. However, their application in augmented reality (AR) training remains largely unexplored. In this work, we introduce a comprehensive dataset tailored for AR training, featuring systematized vision-language tasks, and evaluate nine state-of-the-art VLMs on it. Our results reveal that even advanced models, including GPT-4o, struggle with fine-grained assembly tasks, achieving a maximum F1 score of just 40.54% on state detection. These findings highlight the demand for enhanced datasets, benchmarks, and further research to improve fine-grained vision-language alignment. Beyond technical contributions, our work has broader social implications, particularly in empowering blind and visually impaired users with equitable access to AI-driven learning opportunities. We provide all related resources, including the dataset, source code, and evaluation results, to support the research community.

nan

Article 325

Title@2025-07-07 (1): MEIT: Multimodal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation

Title: MEIT: Multimodal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation

MEIT: Multimodale Elektrokardiogramm-Anleitung Tuning auf großen Sprachmodellen für die Berichtserstellung

MEIT:多式电动心电图指示,关于报告生成大语言模式的图示 2403.04945v4

Authors (9): Zhongwei Wan, Che Liu, Xin Wang, Chaofan Tao, Hui Shen, Jing Xiong, Rossella Arcucci, Huaxiu Yao, Mi Zhang

Electrocardiogram (ECG) is the primary non-invasive diagnostic tool for monitoring cardiac conditions and is crucial in assisting clinicians. Recent studies have concentrated on classifying cardiac conditions using ECG data but have overlooked ECG report generation, which is time-consuming and requires clinical expertise. To automate ECG report generation and ensure its versatility, we propose the Multimodal ECG Instruction Tuning (MEIT) framework, the first attempt to tackle ECG report generation with LLMs and multimodal instructions. To facilitate future research, we establish a benchmark to evaluate MEIT with various LLMs backbones across two large-scale ECG datasets. Our approach uniquely aligns the representations of the ECG signal and the report, and we conduct extensive experiments to benchmark MEIT with nine open-source LLMs using more than 800,000 ECG reports. MEIT’s results underscore the superior performance of instruction-tuned LLMs, showcasing their proficiency in quality report generation, zero-shot capabilities, resilience to signal perturbation, and alignment with human expert evaluation. These findings emphasize the efficacy of MEIT and its potential for real-world clinical application.

nan

Article 326

Title@2025-07-07 (1): Towards Exception Safety Code Generation with Intermediate Representation Agents Framework

Title: Towards Exception Safety Code Generation with Intermediate Representation Agents Framework

Auf dem Weg zur Generierung von Ausnahme-Sicherheitscodes mit dem Rahmen für Mittlere Vertretungen

建立具有中间代表代理机构框架的例外安全法规生成框架 2410.06949v3

Authors (4): Xuanming Zhang, Yuxuan Chen, Yuan Yuan, Minlie Huang

Large Language Models (LLMs) often struggle with robust exception handling in generated code, leading to fragile programs that are prone to runtime errors. We propose Seeker, a novel multi-agent framework that enforces exception safety in LLM generated code through an Intermediate Representation (IR) approach. Seeker decomposes exception handling into five specialized agents: Scanner, Detector, Predator, Ranker, and Handler that collaboratively analyze code, detect fragile segments, retrieve best practice exception strategies, and inject robust handling code. We also introduce Common Exception Enumeration (CEE), a comprehensive knowledge base derived from official documentation, technical practices, and real world code, to standardize exception handling strategies. Seeker also incorporates a Deep Retrieval-Augmented Generation (Deep RAG) algorithm to efficiently navigate the exception inheritance hierarchy, cutting down search overhead by 93% while improving accuracy in identifying relevant exceptions. We evaluate Seeker on 15 open source Java projects and multiple benchmarks. Seeker outperforms state of the art baselines, improving exception handling precision by up to 37% and overall code robustness by 38% as measured by expert code review. It significantly closes the gap between LLM and human developers in exception management, achieving a 28% success rate on real world issue fixes (SWE bench) versus 19% by prior methods. Our framework preserves functional correctness of code while proactively handling errors, demonstrating a practical, generalizable solution for safer code generation. In this paper, we discuss the novelty of using intermediate representation and multi-agent collaboration for exception handling, and outline how Seeker can be extended to other programming languages and complex software engineering tasks, aligning LLM-generated code with industrial standard.

nan

Article 327

Title@2025-07-07 (1): ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding

Title: ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding

ViGiL3D: Ein sprachlich unterschiedlicher Datensatz für 3D Visual Grounding

ViGiL3D:3D视觉定位语言多样性数据集 2501.01366v2

Authors (3): Austin T. Wang, ZeMing Gong, Angel X. Chang

3D visual grounding (3DVG) involves localizing entities in a 3D scene referred to by natural language text. Such models are useful for embodied AI and scene retrieval applications, which involve searching for objects or patterns using natural language descriptions. While recent works have focused on LLM-based scaling of 3DVG datasets, these datasets do not capture the full range of potential prompts which could be specified in the English language. To ensure that we are scaling up and testing against a useful and representative set of prompts, we propose a framework for linguistically analyzing 3DVG prompts and introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns. We evaluate existing open-vocabulary 3DVG methods to demonstrate that these methods are not yet proficient in understanding and identifying the targets of more challenging, out-of-distribution prompts, toward real-world applications.

nan

Article 328

Title@2025-07-07 (1): On the Semantics of Large Language Models

Title: On the Semantics of Large Language Models

Zur Semantik großer Sprachmodelle

关于大语言模型的语义学 2507.05448v1

Authors (1): Martin Schuele

Large Language Models (LLMs) such as ChatGPT demonstrated the potential to replicate human language abilities through technology, ranging from text generation to engaging in conversations. However, it remains controversial to what extent these systems truly understand language. We examine this issue by narrowing the question down to the semantics of LLMs at the word and sentence level. By examining the inner workings of LLMs and their generated representation of language and by drawing on classical semantic theories by Frege and Russell, we get a more nuanced picture of the potential semantic capabilities of LLMs.

nan

Article 329

Title@2025-07-07 (1): PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language Pairs

Title: PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language Pairs

PhoniTale: Phonologisch begründete mnemonische Generation für typologisch entfernte Sprachpaare

Phooni-Tale: 音频分布式语言对音频隔音的声波固定成像元体生成 2507.05444v1

Authors (7): Sana Kang, Myeongseok Gwon, Su Young Kwon, Jaewook Lee, Andrew Lan, Bhiksha Raj, Rita Singh

Vocabulary acquisition poses a significant challenge for second-language (L2) learners, especially when learning typologically distant languages such as English and Korean, where phonological and structural mismatches complicate vocabulary learning. Recently, large language models (LLMs) have been used to generate keyword mnemonics by leveraging similar keywords from a learner’s first language (L1) to aid in acquiring L2 vocabulary. However, most of this research has focused on native English speakers learning other languages, rather than the reverse. In this paper, we present PhoniTale, a novel cross-lingual mnemonic generation system that retrieves L1 keyword sequence based on phonological similarity and uses LLMs to generate mnemonics. We evaluate PhoniTale using both automated metrics and human evaluations, comparing its output to mnemonics created by humans and by previous automated approaches. To assess practical effectiveness, we also conduct a short-term recall test measuring mnemonic helpfulness. Our findings show that PhoniTale performs comparably to human-authored mnemonics. We also highlight key areas for future improvement in mnemonic quality and methodology.

nan

Article 330

Title@2025-07-07 (1): Gendered Divides in Online Discussions about Reproductive Rights

Title: Gendered Divides in Online Discussions about Reproductive Rights

Geschlechtsspezifische Unterschiede in Online-Diskussionen über reproduktive Rechte

关于生殖权利的在线讨论中的两性差距 2507.05443v1

Authors (3): Ashwin Rao, Sze Yuh Nina Wang, Kristina Lerman

The U.S. Supreme Court’s 2022 ruling in Dobbs v. Jackson Women’s Health Organization marked a turning point in the national debate over reproductive rights. While the ideological divide over abortion is well documented, less is known about how gender and local sociopolitical contexts interact to shape public discourse. Drawing on nearly 10 million abortion-related posts on X (formerly Twitter) from users with inferred gender, ideology and location, we show that gender significantly moderates abortion attitudes and emotional expression, particularly in conservative regions, and independently of ideology. This creates a gender gap in abortion attitudes that grows more pronounced in conservative regions. The leak of the Dobbs draft opinion further intensified online engagement, disproportionately mobilizing pro-abortion women in areas where access was under threat. These findings reveal that abortion discourse is not only ideologically polarized but also deeply structured by gender and place, highlighting the central role of identity in shaping political expression during moments of institutional disruption.

nan

Article 331

Title@2025-07-07 (1): DeepRetro: Retrosynthetic Pathway Discovery using Iterative LLM Reasoning

Title: DeepRetro: Retrosynthetic Pathway Discovery using Iterative LLM Reasoning

DeepRetro: Retrosynthetische Pathway Discovery mit iterativem LLM Reasoning

使用迭代LM 理由解释的回溯合成途径发现 2507.07060v1

Authors (9): Shreyas Vinaya Sathyanarayana, Rahil Shah, Sharanabasava D. Hiremath, Rishikesh Panda, Rahul Jana, Riya Singh, Rida Irfan, Ashwin Murali, Bharath Ramsundar

Retrosynthesis, the identification of precursor molecules for a target compound, is pivotal for synthesizing complex molecules, but faces challenges in discovering novel pathways beyond predefined templates. Recent large language model (LLM) approaches to retrosynthesis have shown promise but effectively harnessing LLM reasoning capabilities for effective multi-step planning remains an open question. To address this challenge, we introduce DeepRetro, an open-source, iterative, hybrid LLM-based retrosynthetic framework. Our approach integrates the strengths of conventional template-based/Monte Carlo tree search tools with the generative power of LLMs in a step-wise, feedback-driven loop. Initially, synthesis planning is attempted with a template-based engine. If this fails, the LLM subsequently proposes single-step retrosynthetic disconnections. Crucially, these suggestions undergo rigorous validity, stability, and hallucination checks before the resulting precursors are recursively fed back into the pipeline for further evaluation. This iterative refinement allows for dynamic pathway exploration and correction. We demonstrate the potential of this pipeline through benchmark evaluations and case studies, showcasing its ability to identify viable and potentially novel retrosynthetic routes. In particular, we develop an interactive graphical user interface that allows expert human chemists to provide human-in-the-loop feedback to the reasoning algorithm. This approach successfully generates novel pathways for complex natural product compounds, demonstrating the potential for iterative LLM reasoning to advance state-of-art in complex chemical syntheses.

nan

Article 332

Title@2025-07-07 (1): LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

Title: LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

LLM Hypnose: Nutzung des Benutzerfeedbacks für unautorisierte Wissensinjektion für alle Benutzer

LLM Hypnisis:利用用户反馈,为所有用户提供未经授权知识注射 2507.02850v2

Authors (4): Almog Hilel, Idan Shenfeld, Jacob Andreas, Leshem Choshen

We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a “poisoned” or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of language model preference tuning (showing that it even highly restricted forms of preference data can be used to exert fine-grained control over behavior), and a new attack mechanism for LMs trained with user feedback (extending work on pretraining-time data poisoning and deployment-time prompt injection).

nan

Article 333

Title@2025-07-07 (1): “Lost-in-the-Later”: Framework for Quantifying Contextual Grounding in Large Language Models

Title: “Lost-in-the-Later”: Framework for Quantifying Contextual Grounding in Large Language Models

“Lost-in-the-Later”: Rahmen für die Quantifizierung kontextueller Erdung in großen Sprachmodellen

“Lost-in-Later”:“Lost-in-Later”:大语言模型中量化背景依据框架 2507.05424v1

Authors (4): Yufei Tao, Adam Hiatt, Rahul Seetharaman, Ameeta Agrawal

Large language models are capable of leveraging both contextual and parametric knowledge but how they prioritize and integrate these sources remains underexplored. We introduce CoPE, a novel evaluation framework that systematically measures contextual knowledge (CK) and parametric knowledge (PK) across models and languages. Using our MultiWikiAtomic dataset in English, Spanish, and Danish, we analyze how large language models (LLMs) integrate context, prioritize information, and incorporate PK in open-ended question answering. Our analysis uncovers a phenomenon we call lost-in-the-later, where LLMs tend to overlook or deprioritize information that appears later in a given context, revealing a strong positional bias that affects contextual grounding. We further find that reasoning models, as well as non-reasoning models prompted with chain-of-thought (CoT), use context even less than non-reasoning models without CoT and fail to mitigate the lost-in-the-later effect. CoT prompting, in particular, results in lower recall and shorter responses, leading to degraded contextual grounding. Based on these insights, we design prompt-based methods to effectively leverage input context. A case study applying CoPE to summarization demonstrates that CK-informed prompting improves factual grounding and reduces hallucination.

nan

Article 334

Title@2025-07-07 (1): Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning

Title: Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning

Global lernen, lokal sprechen: Die Lücken in der Mehrsprachigkeit überwinden

《全球学习,本地语言:弥合多种语文理由方面的差距》 2507.05418v1

Authors (8): Jaedong Hwang, Kumar Tanmay, Seok-Jin Lee, Ayush Agrawal, Hamid Palangi, Kumar Ayush, Ila Fiete, Paul Pu Liang

Large Language Models (LLMs) have achieved strong performance in domains like mathematics, factual QA, and code generation, yet their multilingual reasoning capabilities in these tasks remain underdeveloped. Especially for low-resource languages such as Swahili or Thai, LLMs can often misinterpret prompts or default to reasoning in English. This implicit bias toward high-resource languages undermines factual accuracy, interpretability, and trust. Current multilingual benchmarks focus only on final answers, overlooking whether models actually reason in the target language. To address this gap, we introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark with annotated reasoning traces in five languages: English, Hindi, Japanese, Swahili, and Thai. We further propose BRIDGE, a novel training method that guides supervised fine-tuning and test-time reinforcement learning with a language-consistency reward to align reasoning with the input language. Finally, we develop an automatic evaluation protocol using LLM-as-a-judge to assess answer correctness and the quality and language consistency of reasoning traces, enabling nuanced and scalable analysis beyond surface-level metrics. Our results show that BRIDGE significantly enhances multilingual reasoning fidelity, demonstrating that reasoning-aware multilingual reinforcement learning is crucial for robust cross-lingual generalization. https://jd730.github.io/projects/GeoFact-X_BRIDGE

nan

Article 335

Title@2025-07-07 (1): What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning

Title: What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning

Was würdest du fragen, wenn du zuerst $a^2+b^2=c^2$ sahst? LLM auf Kuriosität-getriebene Fragestellung bewerten

当你第一次看到$2+b2=c2$时你会问什么? 2409.17172v2

Authors (2): Shashidhar Reddy Javaji, Zining Zhu

Large language models (LLMs) can store a massive amount of knowledge, yet their potential to acquire new knowledge remains unknown. We propose a novel evaluation framework that evaluates this capability. This framework prompts LLMs to generate questions about a statement introducing scientific knowledge, simulating a curious person when facing the statement for the first time. We score the qualities of the generated questions, thereby evaluating the knowledge acquisition potential of the LLM. We apply controlled ablation studies to validate our scoring procedures. Additionally, we created a synthetic dataset consisting of 1101 statements in physics, chemistry, and maths with distinct levels of difficulties, 300 general knowledge statements, and 567 incorrect statements. Human evaluations were conducted to validate our model assessments, achieving an approximate weighted Cohen’s kappa of 0.7 on all three metrics considered. We find that while large models like GPT-4 and Mistral 8x7b are adept at generating coherent and relevant questions, the smaller Phi-2 model is equally or more effective. This indicates that size does not solely determine a model’s knowledge acquisition potential. The proposed framework quantifies a critical model capability that was commonly overlooked and opens up research opportunities for developing more knowledgeable AI systems

nan

Article 336

Title@2025-07-07 (1): Tractable Transformers for Flexible Conditional Generation

Title: Tractable Transformers for Flexible Conditional Generation

Tractable Transformer für flexible Bedingte Generation

灵活条件一代的可变变变换器 2502.07616v2

Authors (6): Anji Liu, Xuejie Liu, Dayuan Zhao, Mathias Niepert, Yitao Liang, Guy Van den Broeck

Non-autoregressive (NAR) generative models are valuable because they can handle diverse conditional generation tasks in a more principled way than their autoregressive (AR) counterparts, which are constrained by sequential dependency requirements. Recent advancements in NAR models, such as diffusion language models, have demonstrated superior performance in unconditional generation compared to AR models (e.g., GPTs) of similar sizes. However, such improvements do not always lead to improved conditional generation performance. We show that a key reason for this gap is the difficulty in generalizing to conditional probability queries (i.e., the set of unknown variables) unseen during training. As a result, strong unconditional generation performance does not guarantee high-quality conditional generation. This paper proposes Tractable Transformers (Tracformer), a Transformer-based generative model that is more robust to different conditional generation tasks. Unlike existing models that rely solely on global contextual features derived from full inputs, Tracformers incorporate a sparse Transformer encoder to capture both local and global contextual information. This information is routed through a decoder for conditional generation. Empirical results demonstrate that Tracformers achieve state-of-the-art conditional generation performance on text modeling compared to recent diffusion and AR model baselines.

nan

Article 337

Title@2025-07-07 (1): Evaluation of OpenAI o1: Opportunities and Challenges of AGI

Title: Evaluation of OpenAI o1: Opportunities and Challenges of AGI

Bewertung von OpenAI o1: Chancen und Herausforderungen der AGI

评价开放AI OO1:亚洲地理信息协会的机遇和挑战 2409.18486v2

Authors (75): Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, Xiaowei Yu, Chao Cao, Hanqi Jiang, Hanxu Chen, Yiwei Li, Junhao Chen, Huawen Hu, Yiheng Liu, Huaqin Zhao, Shaochen Xu, Haixing Dai, Lin Zhao, Ruidong Zhang, Wei Zhao, Zhenyuan Yang, Jingyuan Chen, Peilong Wang, Wei Ruan, Hui Wang, Huan Zhao, Jing Zhang, Yiming Ren, Shihuan Qin, Tong Chen, Jiaxi Li, Arif Hassan Zidan, Afrar Jahin, Minheng Chen, Sichen Xia, Jason Holmes, Yan Zhuang, Jiaqi Wang, Bochen Xu, Weiran Xia, Jichao Yu, Kaibo Tang, Yaxuan Yang, Bolun Sun, Tao Yang, Guoyu Lu, Xianqiao Wang, Lilong Chai, He Li, Jin Lu, Xin Zhang, Bao Ge, Xintao Hu, Lian Zhang, Hua Zhou, Lu Zhang, Shu Zhang, Zhen Xiang, Yudan Ren, Jun Liu, Xi Jiang, Yu Bao, Wei Zhang, Xiang Li, Gang Li, Wei Liu, Dinggang Shen, Andrea Sikora, Xiaoming Zhai, Dajiang Zhu, Tuo Zhang, Tianming Liu

This comprehensive study evaluates the performance of OpenAI’s o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.

nan

Article 338

Title@2025-07-07 (1): Embedding-Based Approaches to Hyperpartisan News Detection

Title: Embedding-Based Approaches to Hyperpartisan News Detection

Einbetten-basierte Ansätze zu Hyperparteien-Nachrichten-Erkennung

以嵌入式方法探测超党派新闻 2501.01370v3

Authors (1): Karthik Mohan

In this report, I describe the systems in which the objective is to determine whether a given news article could be considered as hyperpartisan. Hyperpartisan news takes an extremely polarized political standpoint with an intention of creating political divide among the public. Several approaches, including n-grams, sentiment analysis, as well as sentence and document representations using pre-tained ELMo models were used. The best system is using LLMs for embedding generation achieving an accuracy of around 92% over the previously best system using pre-trained ELMo with Bidirectional LSTM which achieved an accuracy of around 83% through 10-fold cross-validation.

nan

Article 339

Title@2025-07-07 (1): One fish, two fish, but not the whole sea: Alignment reduces language models’ conceptual diversity

Title: One fish, two fish, but not the whole sea: Alignment reduces language models’ conceptual diversity

Ein Fisch, zwei Fische, aber nicht das ganze Meer: Ausrichtung reduziert die konzeptionelle Vielfalt der Sprachmodelle

一、二、一、二鱼,但不是整个海洋:对齐减少了语言模式的概念多样性 2411.04427v3

Authors (3): Sonia K. Murthy, Tomer Ullman, Jennifer Hu

Researchers in social science and psychology have recently proposed using large language models (LLMs) as replacements for humans in behavioral research. In addition to arguments about whether LLMs accurately capture population-level patterns, this has raised questions about whether LLMs capture human-like conceptual diversity. Separately, it is debated whether post-training alignment (RLHF or RLAIF) affects models’ internal diversity. Inspired by human studies, we use a new way of measuring the conceptual diversity of synthetically-generated LLM “populations” by relating the internal variability of simulated individuals to the population-level variability. We use this approach to evaluate non-aligned and aligned LLMs on two domains with rich human behavioral data. While no model reaches human-like diversity, aligned models generally display less diversity than their instruction fine-tuned counterparts. Our findings highlight potential trade-offs between increasing models’ value alignment and decreasing the diversity of their conceptual representations.

nan

Article 340

Title: Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences

Kontrolle dessen, was Sie teilen: Bewertung von Sprachmodellen Einhaltung von Datenschutzeinstellungen

控制您共享的共享:评估语言模式遵守隐私优惠 2507.05391v1

Authors (3): Guillem Ramírez, Alexandra Birch, Ivan Titov

Large language models (LLMs) are primarily accessed via commercial APIs, but this often requires users to expose their data to service providers. In this paper, we explore how users can stay in control of their data by using privacy profiles: simple natural language instructions that say what should and should not be revealed. We build a framework where a local model uses these instructions to rewrite queries, only hiding details deemed sensitive by the user, before sending them to an external model, thus balancing privacy with performance. To support this research, we introduce PEEP, a multilingual dataset of real user queries annotated to mark private content and paired with synthetic privacy profiles. Our experiments with lightweight LLMs show they can follow these instructions to some extent, but also face consistent challenges, highlighting the need for models that better understand and comply with user-defined privacy preferences.

nan

Article 341

Title@2025-07-07 (1): The Generalization Ridge: Information Flow in Natural Language Generation

Title: The Generalization Ridge: Information Flow in Natural Language Generation

The Generalization Ridge: Informationsfluss in der natürlichen Sprachgenerierung

通用海脊:自然语言生成的信息流动 2507.05387v1

Authors (3): Ruidi Chang, Chunyuan Deng, Hanjie Chen

Transformer-based language models have achieved state-of-the-art performance in natural language generation (NLG) tasks, yet their internal mechanisms for synthesizing task-relevant information remain insufficiently understood. While prior studies suggest that intermediate layers often yield more generalizable representations than final layers, how this generalization ability emerges and propagates across layers during training remains unclear. To address this gap, we propose InfoRidge, an information-theoretic framework, to characterize how predictive information-the mutual information between hidden representations and target outputs-varies across depth. Estimating this quantity enables us to trace the flow of task-relevant information throughout the model during training. Our experiments across various models and datasets reveal a consistent non-monotonic trend: predictive information peaks in upper-middle layers-forming a generalization ridge-before declining in final layers, reflecting a transition between generalization and memorization. To further investigate this phenomenon, we introduce residual scaling coefficients-trainable scalar parameters applied to each residual block-which serve as functional probes for assessing the relative importance of individual transformer layers. These coefficients reveal that, under distribution shift, models downweight final layers and increasingly rely on ridge layers, highlighting their role in generalization. Together, these findings offer new insights into the internal mechanisms of transformers and underscore the critical role of intermediate layers in supporting generalization.

nan

Article 342

Title@2025-07-07 (1): Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training

Title: Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training

Verstärkte Feinsteuerung mildert natürlich das Vergessen im kontinuierlichen Post-Training

长期培训后忘却的自然缓解措施 2507.05386v1

Authors (13): Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, Fei Zhu

Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model’s general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis shows that explicit mechanisms, such as KL penalty and chain-of-thought reasoning, are not the primary factors. Instead, we find that the implicit regularization inherent to RFT is a key factor in mitigating forgetting. Finally, we propose a rollout-based instance filtering algorithm to improve the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.

nan

Article 343

Title@2025-07-07 (1): On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study

Title: On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study

Über die Bias von Next-Token Predictors hin zu systematisch ineffizienter Begründung: Eine Kurzeste-Path-Fallstudie

关于对系统、低效率理性的 “ 下肯预测家 “ 的偏见:最短篇案例研究 2507.05362v1

Authors (4): Riccardo Alberghi, Elizaveta Demyanenko, Luca Biggio, Luca Saglietti

Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This benefit is not due to length alone-injecting arbitrary redundancy into reasoning traces fails to help and can even hurt performance. Instead, we find that generalization correlates with the model’s confidence in next-token prediction, suggesting that long, coherent, and locally incremental traces make the training signal easier to optimize.

nan

Article 344

Title@2025-07-07 (1): LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks

Title: LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks

LoRA-Augmented Generation (LAG) für wissensintensive Sprachaufgaben

LoRA - 知识密集型语言任务支持代(LAG) 2507.05346v1

Authors (2): William Fleshman, Benjamin Van Durme

The proliferation of fine-tuned language model experts for specific tasks and domains signals the need for efficient selection and combination methods. We propose LoRA-Augmented Generation (LAG) for leveraging large libraries of knowledge and task-specific LoRA adapters. LAG requires no additional training or access to data, and efficiently filters, retrieves, and applies experts on a per-token and layer basis. We evaluate LAG on various knowledge-intensive tasks, achieving superior performance over existing data-free methods. We explore scenarios where additional data is available, demonstrating LAG’s compatibility with alternative solutions such as retrieval-augmented generation (RAG).

nan

Article 345

Title@2025-07-07 (1): Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Title: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Speicher in LLM-Agenten über inkrementelle Multi-Turn-Interaktionen bewerten

通过递增多发互动评估LLM代理物的记忆 2507.05257v1

Authors (3): Yuanzhe Hu, Yu Wang, Julian McAuley

Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Existing datasets either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Furthermore, no existing benchmarks cover all four competencies. Therefore, we introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark combines reformulated existing datasets with newly constructed ones, covering the above four memory competencies, providing a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.

nan

Article 346

Title@2025-07-07 (1): Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Title: Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Open Vision Reasoner: Übertragen sprachliches Kognitives Verhalten für visuelles Reasoning

开放愿景理由:转移视觉理由的语言认知行为 2507.05255v1

Authors (18): Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel

The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.

nan

Article 347

Title@2025-07-07 (1): Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

Title: Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

Den Richtern zu urteilen: Können große Visions-Sprachmodelle Chartverständnis und Vernunft angemessen bewerten?

判断法官:大视野语言模型能否公正地评估图表的解读和理由? 2505.08468v2

Authors (10): Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Ahmed Masry, Mizanur Rahman, Amran Bhuiyan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang

Charts are ubiquitous as they help people understand and reason with data. Recently, various downstream tasks, such as chart question answering, chart2text, and fact-checking, have emerged. Large Vision-Language Models (LVLMs) show promise in tackling these tasks, but their evaluation is costly and time-consuming, limiting real-world deployment. While using LVLMs as judges to assess the chart comprehension capabilities of other LVLMs could streamline evaluation processes, challenges like proprietary datasets, restricted access to powerful models, and evaluation costs hinder their adoption in industrial settings. To this end, we present a comprehensive evaluation of 13 open-source LVLMs as judges for diverse chart comprehension and reasoning tasks. We design both pairwise and pointwise evaluation tasks covering criteria like factual correctness, informativeness, and relevancy. Additionally, we analyze LVLM judges based on format adherence, positional consistency, length bias, and instruction-following. We focus on cost-effective LVLMs (<10B parameters) suitable for both research and commercial use, following a standardized evaluation protocol and rubric to measure the LVLM judge’s accuracy. Experimental results reveal notable variability: while some open LVLM judges achieve GPT-4-level evaluation performance (about 80% agreement with GPT-4 judgments), others struggle (below ~10% agreement). Our findings highlight that state-of-the-art open-source LVLMs can serve as cost-effective automatic evaluators for chart-related tasks, though biases such as positional preference and length bias persist.

nan

Article 348

Title@2025-07-07 (1): Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

Title: Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

Response Attack: Exploiting Contextual Priming to Jailbreak Große Sprachmodelle

回应攻击:利用环境原始化作为破狱大型语言模式 2507.05248v1

Authors (6): Ziqi Miao, Lijun Li, Yuan Xiong, Zhenhua Liu, Pengyu Zhu, Jing Shao

Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.

nan

Article 349

Title@2025-07-07 (1): When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

Title: When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

Wenn Gedankenkette notwendig ist, kämpfen Sprachmodelle um Evade Monitore

当思维链是必要的时候,语言模型与Evade Monitor争吵 2507.05246v1

Authors (8): Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, Rohin Shah

While chain-of-thought (CoT) monitoring is an appealing AI safety defense, recent work on “unfaithfulness” has cast doubt on its reliability. These findings highlight an important failure mode, particularly when CoT acts as a post-hoc rationalization in applications like auditing for bias. However, for the distinct problem of runtime monitoring to prevent severe harm, we argue the key property is not faithfulness but monitorability. To this end, we introduce a conceptual framework distinguishing CoT-as-rationalization from CoT-as-computation. We expect that certain classes of severe harm will require complex, multi-step reasoning that necessitates CoT-as-computation. Replicating the experimental setups of prior work, we increase the difficulty of the bad behavior to enforce this necessity condition; this forces the model to expose its reasoning, making it monitorable. We then present methodology guidelines to stress-test CoT monitoring against deliberate evasion. Applying these guidelines, we find that models can learn to obscure their intentions, but only when given significant help, such as detailed human-written strategies or iterative optimization against the monitor. We conclude that, while not infallible, CoT monitoring offers a substantial layer of defense that requires active protection and continued stress-testing.

nan

Article 350

Title@2025-07-07 (1): MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents

Title: MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents

MindFlow: Revolutionierung des E-Commerce-Kundensupports mit multimodalen LLM-Agenten

MindFlow:使电子商务客户支持与多式联运LLM代理商实现革命化 2507.05330v1

Authors (7): Ming Gong, Xucheng Huang, Chenghan Yang, Xianhan Peng, Haoxin Wang, Yang Liu, Ling Jiang

Recent advances in large language models (LLMs) have enabled new applications in e-commerce customer service. However, their capabilities remain constrained in complex, multimodal scenarios. We present MindFlow, the first open-source multimodal LLM agent tailored for e-commerce. Built on the CoALA framework, it integrates memory, decision-making, and action modules, and adopts a modular “MLLM-as-Tool” strategy for effect visual-textual reasoning. Evaluated via online A/B testing and simulation-based ablation, MindFlow demonstrates substantial gains in handling complex queries, improving user satisfaction, and reducing operational costs, with a 93.53% relative improvement observed in real-world deployments.

nan

Article 351

Title@2025-07-07 (1): Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

Title: Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

Late Chunking: Kontextuelle Chunk-Embeddings mit Langkontext-Embedding-Modellen

较晚的板块: 使用长文嵌入模型的背景块嵌入式嵌入式 2409.04701v3

Authors (5): Michael Günther, Isabelle Mohr, Daniel James Williams, Bo Wang, Han Xiao

Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be over-compressed in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in sub-optimal representations. In this paper, we introduce a novel method called late chunking, which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling - hence the term late in its naming. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks. The method is generic enough to be applied to a wide range of long-context embedding models and works without additional training. To further increase the effectiveness of late chunking, we propose a dedicated fine-tuning approach for embedding models.

nan

Article 352

Title@2025-07-07 (1): OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

Title: OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

OPeRA: Ein Datensatz von Beobachtung, Persona, Ratationale und Aktion zur Bewertung von LLMs auf menschlicher Online-Shopping-Behavior-Simulation

OPERA: 人类在线购物行为模拟观察、人、理由和评估LMLLMs的数据集 2506.05606v3

Authors (16): Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang

Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable’’ human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user’s next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.

nan

Article 353

Title@2025-07-07 (1): Logit Reweighting for Topic-Focused Summarization

Title: Logit Reweighting for Topic-Focused Summarization

Logit-Umgewichtung für thematisch fokussierte Zusammenfassung

用于专题用途汇总的对数重新加权 2507.05235v1

Authors (3): Joschka Braun, Bálint Mucsányi, Seyed Ali Bahrainian

Generating abstractive summaries that adhere to a specific topic remains a significant challenge for language models. While standard approaches, such as fine-tuning, are resource-intensive, simpler methods like prompt engineering often struggle to maintain topical focus, particularly with smaller models. To address this, we propose a lightweight method that enhances topical relevance by directly reweighting the logits of topic-relevant tokens during generation. We evaluate three such reweighting techniques: Constant Shift, which adds a constant value to logits; Factor Scaling, which multiplies them by a factor; and Threshold Selection, which selectively boosts logits that exceed a probability threshold. Experiments on the NEWTS topical summarization dataset, using both Gemma-2B and Llama-3-8B models, show that these techniques effectively increase the use of topic-relevant vocabulary. Notably, the Threshold Selection method successfully improves topical focus without compromising summary quality-a trade-off often seen in other approaches. Our findings demonstrate that directly reweighting logits is a practical and resource-efficient alternative to fine-tuning, offering a promising pathway for precisely controlling the thematic content of generated text.

nan

Article 354

Title@2025-07-07 (1): The Super Weight in Large Language Models

Title: The Super Weight in Large Language Models

Das Supergewicht in großen Sprachmodellen

大语言模型中的超重 2411.07191v2

Authors (5): Mengxia Yu, De Wang, Qi Shan, Colorado J Reed, Alvin Wan

Recent works have shown a surprising result: a small fraction of Large Language Model (LLM) parameter outliers are disproportionately important to the quality of the model. LLMs contain billions of parameters, so these small fractions, such as 0.01%, translate to hundreds of thousands of parameters. In this work, we present an even more surprising finding: Pruning as few as a single parameter can destroy an LLM’s ability to generate text – increasing perplexity by 3 orders of magnitude and reducing zero-shot accuracy to guessing. We propose a data-free method for identifying such parameters, termed super weights, using a single forward pass through the model. We additionally find that these super weights induce correspondingly rare and large activation outliers, termed super activations. When preserved with high precision, super activations can improve simple round-to-nearest quantization to become competitive with state-of-the-art methods. For weight quantization, we similarly find that by preserving the super weight and clipping other weight outliers, round-to-nearest quantization can scale to much larger block sizes than previously considered. To facilitate further research into super weights, we provide an index of super weight coordinates for common, openly available LLMs.

nan

Article 355

Title@2025-07-07 (1): jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

Title: jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

jina-embeddings-v4: Universelle Einbettungen für multimodale Mehrsprachigkeit

jina-embeddings-v4:多语种多式联运回收通用嵌入式 2506.18902v3

Authors (11): Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, Han Xiao

We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-document retrieval, semantic text similarity, and code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single-modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.

nan

Article 356

Title@2025-07-07 (1): Extended Inductive Reasoning for Personalized Preference Inference from Behavioral Signals

Title: Extended Inductive Reasoning for Personalized Preference Inference from Behavioral Signals

Erweiterte Induktive Begründung für personalisierte Präferenz-Inferenz von Verhaltenssignalen

从行为信号中推论 2505.18071v2

Authors (4): Jia-Nan Li, Jian Guan, Wei Wu, Rui Yan

Large language models (LLMs) have demonstrated significant success in complex reasoning tasks such as math and coding. In contrast to these tasks where deductive reasoning predominates, inductive reasoning-the ability to derive general rules from incomplete evidence, remains underexplored. This paper investigates extended inductive reasoning in LLMs through the lens of personalized preference inference, a critical challenge in LLM alignment where current approaches struggle to capture diverse user preferences. The task demands strong inductive reasoning capabilities as user preferences are typically embedded implicitly across various interaction forms, requiring models to synthesize consistent preference patterns from scattered signals. We propose AlignXplore, a model that leverages extended reasoning chains to enable systematic preference inference from behavioral signals in users’ interaction histories. Such explicit preference articulation enables efficient streaming inference: when new behavioral signals emerge, the model can directly build upon previously inferred preference descriptions rather than reprocessing historical signals from scratch, while also supporting iterative refinement to the inferred preferences. We develop AlignXplore by combining cold-start training based on synthetic data with subsequent online reinforcement learning. Through extensive experiments, we demonstrate that AlignXplore achieves substantial improvements over the backbone model by an average of 15.49\% on in-domain and out-of-domain benchmarks, while maintaining strong generalization ability across different input formats and downstream models. Further analyses establish best practices for preference inference learning through systematic comparison of reward modeling strategies, while revealing the emergence of human-like inductive reasoning patterns during training.

nan

Article 357

Title@2025-07-07 (1): Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Title: Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Zwillinge 2.5: Das Frontier mit fortschrittlicher Vernunft, Multimodalität, langem Kontext und Agentischen Fähigkeiten der nächsten Generation schieben

Gemini 2.5: 推进先进理性、多模式、长处和下一代的前沿 2507.06261v1

Authors (3303): Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaï Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramé, Sagar Waghmare, Helen Miller, Vaishakh Keshava, Ying Jian, Xiaofan Zhang, Raluca Ada Popa, Kedar Dhamdhere, Blaž Bratanič, Kyuyeun Kim, Terry Koo, Ferran Alet, Yi-ting Chen, Arsha Nagrani, Hannah Muckenhirn, Zhiyuan Zhang, Corbin Quick, Filip Pavetić, Duc Dung Nguyen, Joao Carreira, Michael Elabd, Haroon Qureshi, Fabian Mentzer, Yao-Yuan Yang, Danielle Eisenbud, Anmol Gulati, Ellie Talius, Eric Ni, Sahra Ghalebikesabi, Edouard Yvinec, Alaa Saade, Thatcher Ulrich, Lorenzo Blanco, Dan A. Calian, Muhuan Huang, Aäron van den Oord, Naman Goyal, Terry Chen, Praynaa Rawlani, Christian Schallhart, Swachhand Lokhande, Xianghong Luo, Jyn Shan, Ceslee Montgomery, Victoria Krakovna, Federico Piccinini, Omer Barak, Jingyu Cui, Yiling Jia, Mikhail Dektiarev, Alexey Kolganov, Shiyu Huang, Zhe Chen, Xingyu Wang, Jessica Austin, Peter de Boursac, Evgeny Sluzhaev, Frank Ding, Huijian Li, Surya Bhupatiraju, Mohit Agarwal, Sławek Kwasiborski, Paramjit Sandhu, Patrick Siegler, Ahmet Iscen, Eyal Ben-David, Shiraz Butt, Miltos Allamanis, Seth Benjamin, Robert Busa-Fekete, Felix Hernandez-Campos, Sasha Goldshtein, Matt Dibb, Weiyang Zhang, Annie Marsden, Carey Radebaugh, Stephen Roller, Abhishek Nayyar, Jacob Austin, Tayfun Terzi, Bhargav Kanagal Shamanna, Pete Shaw, Aayush Singh, Florian Luisier, Artur Mendonça, Vaibhav Aggarwal, Larisa Markeeva, Claudio Fantacci, Sergey Brin, HyunJeong Choe, Guanyu Wang, Hartwig Adam, Avigail Dabush, Tatsuya Kiyono, Eyal Marcus, Jeremy Cole, Theophane Weber, Hongrae Lee, Ronny Huang, Alex Muzio, Leandro Kieliger, Maigo Le, Courtney Biles, Long Le, Archit Sharma, Chengrun Yang, Avery Lamp, Dave Dopson, Nate Hurley, Katrina, Xu, Zhihao Shan, Shuang Song, Jiewen Tan, Alexandre Senges, George Zhang, Chong You, Yennie Jun, David Raposo, Susanna Ricco, Xuan Yang, Weijie Chen, Prakhar Gupta, Arthur Szlam, Kevin Villela, Chun-Sung Ferng, Daniel Kasenberg, Chen Liang, Rui Zhu, Arunachalam Narayanaswamy, Florence Perot, Paul Pucciarelli, Anna Shekhawat, Alexey Stern, Rishikesh Ingale, Stefani Karp, Sanaz Bahargam, Adrian Goedeckemeyer, Jie Han, Sicheng Li, Andrea Tacchetti, Dian Yu, Abhishek Chakladar, Zhiying Zhang, Mona El Mahdy, Xu Gao, Dale Johnson, Samrat Phatale, AJ Piergiovanni, Hyeontaek Lim, Clement Farabet, Carl Lebsack, Theo Guidroz, John Blitzer, Nico Duduta, David Madras, Steve Li, Daniel von Dincklage, Xin Li, Mahdis Mahdieh, George Tucker, Ganesh Jawahar, Owen Xiao, Danny Tarlow, Robert Geirhos, Noam Velan, Daniel Vlasic, Kalesha Bullard, SK Park, Nishesh Gupta, Kellie Webster, Ayal Hitron, Jieming Mao, Julian Eisenschlos, Laurel Prince, Nina D’Souza, Kelvin Zheng, Sara Nasso, Gabriela Botea, Carl Doersch, Caglar Unlu, Chris Alberti, Alexey Svyatkovskiy, Ankita Goel, Krzysztof Choromanski, Pan-Pan Jiang, Richard Nguyen, Four Flynn, Daria Ćurko, Peter Chen, Nicholas Roth, Kieran Milan, Caleb Habtegebriel, Shashi Narayan, Michael Moffitt, Jake Marcus, Thomas Anthony, Brendan McMahan, Gowoon Cheon, Ruibo Liu, Megan Barnes, Lukasz Lew, Rebeca Santamaria-Fernandez, Mayank Upadhyay, Arjun Akula, Arnar Mar Hrafnkelsson, Alvaro Caceres, Andrew Bunner, Michal Sokolik, Subha Puttagunta, Lawrence Moore, Berivan Isik, Weilun Chen, Jay Hartford, Lawrence Chan, Pradeep Shenoy, Dan Holtmann-Rice, Jane Park, Fabio Viola, Alex Salcianu, Sujeevan Rajayogam, Ian Stewart-Binks, Zelin Wu, Richard Everett, Xi Xiong, Pierre-Antoine Manzagol, Gary Leung, Carl Saroufim, Bo Pang, Dawid Wegner, George Papamakarios, Jennimaria Palomaki, Helena Pankov, Guangda Lai, Guilherme Tubone, Shubin Zhao, Theofilos Strinopoulos, Seth Neel, Mingqiu Wang, Joe Kelley, Li Li, Pingmei Xu, Anitha Vijayakumar, Andrea D’olimpio, Omer Levy, Massimo Nicosia, Grigory Rozhdestvenskiy, Ni Lao, Sirui Xie, Yash Katariya, Jon Simon, Sanjiv Kumar, Florian Hartmann, Michael Kilgore, Jinhyuk Lee, Aroma Mahendru, Roman Ring, Tom Hennigan, Fiona Lang, Colin Cherry, David Steiner, Dawsen Hwang, Ray Smith, Pidong Wang, Jeremy Chen, Ming-Hsuan Yang, Sam Kwei, Philippe Schlattner, Donnie Kim, Ganesh Poomal Girirajan, Nikola Momchev, Ayushi Agarwal, Xingyi Zhou, Ilkin Safarli, Zachary Garrett, AJ Pierigiovanni, Sarthak Jauhari, Alif Raditya Rochman, Shikhar Vashishth, Quan Yuan, Christof Angermueller, Jon Blanton, Xinying Song, Nitesh Bharadwaj Gundavarapu, Thi Avrahami, Maxine Deines, Subhrajit Roy, Manish Gupta, Christopher Semturs, Shobha Vasudevan, Aditya Srikanth Veerubhotla, Shriya Sharma, Josh Jacob, Zhen Yang, Andreas Terzis, Dan Karliner, Auriel Wright, Tania Rojas-Esponda, Ashley Brown, Abhijit Guha Roy, Pawan Dogra, Andrei Kapishnikov, Peter Young, Wendy Kan, Vinodh Kumar Rajendran, Maria Ivanova, Salil Deshmukh, Chia-Hua Ho, Mike Kwong, Stav Ginzburg, Annie Louis, KP Sawhney, Slav Petrov, Jing Xie, Yunfei Bai, Georgi Stoyanov, Alex Fabrikant, Rajesh Jayaram, Yuqi Li, Joe Heyward, Justin Gilmer, Yaqing Wang, Radu Soricut, Luyang Liu, Qingnan Duan, Jamie Hayes, Maura O’Brien, Gaurav Singh Tomar, Sivan Eiger, Bahar Fatemi, Jeffrey Hui, Catarina Barros, Adaeze Chukwuka, Alena Butryna, Saksham Thakur, Austin Huang, Zhufeng Pan, Haotian Tang, Serkan Cabi, Tulsee Doshi, Michiel Bakker, Sumit Bagri, Ruy Ley-Wild, Adam Lelkes, Jennie Lees, Patrick Kane, David Greene, Shimu Wu, Jörg Bornschein, Gabriela Surita, Sarah Hodkinson, Fangtao Li, Chris Hidey, Sébastien Pereira, Sean Ammirati, Phillip Lippe, Adam Kraft, Pu Han, Sebastian Gerlach, Zifeng Wang, Liviu Panait, Feng Han, Brian Farris, Yingying Bi, Hannah DeBalsi, Miaosen Wang, Gladys Tyen, James Cohan, Susan Zhang, Jarred Barber, Da-Woon Chung, Jaeyoun Kim, Markus Kunesch, Steven Pecht, Nami Akazawa, Abe Friesen, James Lyon, Ali Eslami, Junru Wu, Jie Tan, Yue Song, Ravi Kumar, Chris Welty, Ilia Akolzin, Gena Gibson, Sean Augenstein, Arjun Pillai, Nancy Yuen, Du Phan, Xin Wang, Iain Barr, Heiga Zen, Nan Hua, Casper Liu, Jilei, Wang, Tanuj Bhatia, Hao Xu, Oded Elyada, Pushmeet Kohli, Mirek Olšák, Ke Chen, Azalia Mirhoseini, Noam Shazeer, Shoshana Jakobovits, Maggie Tran, Nolan Ramsden, Tarun Bharti, Fred Alcober, Yunjie Li, Shilpa Shetty, Jing Chen, Dmitry Kalashnikov, Megha Nawhal, Sercan Arik, Hanwen Chen, Michiel Blokzijl, Shubham Gupta, James Rubin, Rigel Swavely, Sophie Bridgers, Ian Gemp, Chen Su, Arun Suggala, Juliette Pluto, Mary Cassin, Alain Vaucher, Kaiyang Ji, Jiahao Cai, Andrew Audibert, Animesh Sinha, David Tian, Efrat Farkash, Amy Hua, Jilin Chen, Duc-Hieu Tran, Edward Loper, Nicole Brichtova, Lara McConnaughey, Ballie Sandhu, Robert Leland, Doug DeCarlo, Andrew Over, James Huang, Xing Wu, Connie Fan, Eric Li, Yun Lei, Deepak Sharma, Cosmin Paduraru, Luo Yu, Matko Bošnjak, Phuong Dao, Min Choi, Sneha Kudugunta, Jakub Adamek, Carlos Guía, Ali Khodaei, Jie Feng, Wenjun Zeng, David Welling, Sandeep Tata, Christina Butterfield, Andrey Vlasov, Seliem El-Sayed, Swaroop Mishra, Tara Sainath, Shentao Yang, RJ Skerry-Ryan, Jeremy Shar, Robert Berry, Arunkumar Rajendran, Arun Kandoor, Andrea Burns, Deepali Jain, Tom Stone, Wonpyo Park, Shibo Wang, Albin Cassirer, Guohui Wang, Hayato Kobayashi, Sergey Rogulenko, Vineetha Govindaraj, Mikołaj Rybiński, Nadav Olmert, Colin Evans, Po-Sen Huang, Kelvin Xu, Premal Shah, Terry Thurk, Caitlin Sikora, Mu Cai, Jin Xie, Elahe Dabir, Saloni Shah, Norbert Kalb, Carrie Zhang, Shruthi Prabhakara, Amit Sabne, Artiom Myaskovsky, Vikas Raunak, Blanca Huergo, Behnam Neyshabur, Jon Clark, Ye Zhang, Shankar Krishnan, Eden Cohen, Dinesh Tewari, James Lottes, Yumeya Yamamori, Hui, Li, Mohamed Elhawaty, Ada Maksutaj Oflazer, Adrià Recasens, Sheryl Luo, Duy Nguyen, Taylor Bos, Kalyan Andra, Ana Salazar, Ed Chi, Jeongwoo Ko, Matt Ginsberg, Anders Andreassen, Anian Ruoss, Todor Davchev, Elnaz Davoodi, Chenxi Liu, Min Kim, Santiago Ontanon, Chi Ming To, Dawei Jia, Rosemary Ke, Jing Wang, Anna Korsun, Moran Ambar, Ilya Kornakov, Irene Giannoumis, Toni Creswell, Denny Zhou, Yi Su, Ishaan Watts, Aleksandr Zaks, Evgenii Eltyshev, Ziqiang Feng, Sidharth Mudgal, Alex Kaskasoli, Juliette Love, Kingshuk Dasgupta, Sam Shleifer, Richard Green, Sungyong Seo, Chansoo Lee, Dale Webster, Prakash Shroff, Ganna Raboshchuk, Isabel Leal, James Manyika, Sofia Erell, Daniel Murphy, Zhisheng Xiao, Anton Bulyenov, Julian Walker, Mark Collier, Matej Kastelic, Nelson George, Sushant Prakash, Sailesh Sidhwani, Alexey Frolov, Steven Hansen, Petko Georgiev, Tiberiu Sosea, Chris Apps, Aishwarya Kamath, David Reid, Emma Cooney, Charlotte Magister, Oriana Riva, Alec Go, Pu-Chin Chen, Sebastian Krause, Nir Levine, Marco Fornoni, Ilya Figotin, Nick Roy, Parsa Mahmoudieh, Vladimir Magay, Mukundan Madhavan, Jin Miao, Jianmo Ni, Yasuhisa Fujii, Ian Chou, George Scrivener, Zak Tsai, Siobhan Mcloughlin, Jeremy Selier, Sandra Lefdal, Jeffrey Zhao, Abhijit Karmarkar, Kushal Chauhan, Shivanker Goel, Zhaoyi Zhang, Vihan Jain, Parisa Haghani, Mostafa Dehghani, Jacob Scott, Erin Farnese, Anastasija Ilić, Steven Baker, Julia Pawar, Li Zhong, Josh Camp, Yoel Zeldes, Shravya Shetty, Anand Iyer, Vít Listík, Jiaxian Guo, Luming Tang, Mark Geller, Simon Bucher, Yifan Ding, Hongzhi Shi, Carrie Muir, Dominik Grewe, Ramy Eskander, Octavio Ponce, Boqing Gong, Derek Gasaway, Samira Khan, Umang Gupta, Angelos Filos, Weicheng Kuo, Klemen Kloboves, Jennifer Beattie, Christian Wright, Leon Li, Alicia Jin, Sandeep Mariserla, Miteyan Patel, Jens Heitkaemper, Dilip Krishnan, Vivek Sharma, David Bieber, Christian Frank, John Lambert, Paul Caron, Martin Polacek, Mai Giménez, Himadri Choudhury, Xing Yu, Sasan Tavakkol, Arun Ahuja, Franz Och, Rodolphe Jenatton, Wojtek Skut, Bryan Richter, David Gaddy, Andy Ly, Misha Bilenko, Megh Umekar, Ethan Liang, Martin Sevenich, Mandar Joshi, Hassan Mansoor, Rebecca Lin, Sumit Sanghai, Abhimanyu Singh, Xiaowei Li, Sudheendra Vijayanarasimhan, Zaheer Abbas, Yonatan Bitton, Hansa Srinivasan, Manish Reddy Vuyyuru, Alexander Frömmgen, Yanhua Sun, Ralph Leith, Alfonso Castaño, DJ Strouse, Le Yan, Austin Kyker, Satish Kambala, Mary Jasarevic, Thibault Sellam, Chao Jia, Alexander Pritzel, Raghavender R, Huizhong Chen, Natalie Clay, Sudeep Gandhe, Sean Kirmani, Sayna Ebrahimi, Hannah Kirkwood, Jonathan Mallinson, Chao Wang, Adnan Ozturel, Kuo Lin, Shyam Upadhyay, Vincent Cohen-Addad, Sean Purser-haskell, Yichong Xu, Ebrahim Songhori, Babi Seal, Alberto Magni, Almog Gueta, Tingting Zou, Guru Guruganesh, Thais Kagohara, Hung Nguyen, Khalid Salama, Alejandro Cruzado Ruiz, Justin Frye, Zhenkai Zhu, Matthias Lochbrunner, Simon Osindero, Wentao Yuan, Lisa Lee, Aman Prasad, Lam Nguyen Thiet, Daniele Calandriello, Victor Stone, Qixuan Feng, Han Ke, Maria Voitovich, Geta Sampemane, Lewis Chiang, Ling Wu, Alexander Bykovsky, Matt Young, Luke Vilnis, Ishita Dasgupta, Aditya Chawla, Qin Cao, Bowen Liang, Daniel Toyama, Szabolcs Payrits, Anca Stefanoiu, Dimitrios Vytiniotis, Ankesh Anand, Tianxiao Shen, Blagoj Mitrevski, Michael Tschannen, Sreenivas Gollapudi, Aishwarya P S, José Leal, Zhe Shen, Han Fu, Wei Wang, Arvind Kannan, Doron Kukliansky, Sergey Yaroshenko, Svetlana Grant, Umesh Telang, David Wood, Alexandra Chronopoulou, Alexandru Ţifrea, Tao Zhou, Tony, Nguy~ên, Muge Ersoy, Anima Singh, Meiyan Xie, Emanuel Taropa, Woohyun Han, Eirikur Agustsson, Andrei Sozanschi, Hui Peng, Alex Chen, Yoel Drori, Efren Robles, Yang Gao, Xerxes Dotiwalla, Ying Chen, Anudhyan Boral, Alexei Bendebury, John Nham, Chris Tar, Luis Castro, Jiepu Jiang, Canoee Liu, Felix Halim, Jinoo Baek, Andy Wan, Jeremiah Liu, Yuan Cao, Shengyang Dai, Trilok Acharya, Ruoxi Sun, Fuzhao Xue, Saket Joshi, Morgane Lustman, Yongqin Xian, Rishabh Joshi, Deep Karkhanis, Nora Kassner, Jamie Hall, Xiangzhuo Ding, Gan Song, Gang Li, Chen Zhu, Yana Kulizhskaya, Bin Ni, Alexey Vlaskin, Solomon Demmessie, Lucio Dery, Salah Zaiem, Yanping Huang, Cindy Fan, Felix Gimeno, Ananth Balashankar, Koji Kojima, Hagai Taitelbaum, Maya Meng, Dero Gharibian, Sahil Singla, Wei Chen, Ambrose Slone, Guanjie Chen, Sujee Rajayogam, Max Schumacher, Suyog Kotecha, Rory Blevins, Qifei Wang, Mor Hazan Taege, Alex Morris, Xin Liu, Fayaz Jamil, Richard Zhang, Pratik Joshi, Ben Ingram, Tyler Liechty, Ahmed Eleryan, Scott Baird, Alex Grills, Gagan Bansal, Shan Han, Kiran Yalasangi, Shawn Xu, Majd Al Merey, Isabel Gao, Felix Weissenberger, Igor Karpov, Robert Riachi, Ankit Anand, Gautam Prasad, Kay Lamerigts, Reid Hayes, Jamie Rogers, Mandy Guo, Ashish Shenoy, Qiong, Hu, Kyle He, Yuchen Liu, Polina Zablotskaia, Sagar Gubbi, Yifan Chang, Jay Pavagadhi, Kristian Kjems, Archita Vadali, Diego Machado, Yeqing Li, Renshen Wang, Dipankar Ghosh, Aahil Mehta, Dana Alon, George Polovets, Alessio Tonioni, Nate Kushman, Joel D’sa, Lin Zhuo, Allen Wu, Rohin Shah, John Youssef, Jiayu Ye, Justin Snyder, Karel Lenc, Senaka Buthpitiya, Matthew Tung, Jichuan Chang, Tao Chen, David Saxton, Jenny Lee, Lydia Lihui Zhang, James Qin, Prabakar Radhakrishnan, Maxwell Chen, Piotr Ambroszczyk, Metin Toksoz-Exley, Yan Zhong, Nitzan Katz, Brendan O’Donoghue, Tamara von Glehn, Adi Gerzi Rosenthal, Aga Świetlik, Xiaokai Zhao, Nick Fernando, Jinliang Wei, Jieru Mei, Sergei Vassilvitskii, Diego Cedillo, Pranjal Awasthi, Hui Zheng, Koray Kavukcuoglu, Itay Laish, Joseph Pagadora, Marc Brockschmidt, Christopher A. Choquette-Choo, Arunkumar Byravan, Yifeng Lu, Xu Chen, Mia Chen, Kenton Lee, Rama Pasumarthi, Sijal Bhatnagar, Aditya Shah, Qiyin Wu, Zhuoyuan Chen, Zack Nado, Bartek Perz, Zixuan Jiang, David Kao, Ganesh Mallya, Nino Vieillard, Lantao Mei, Sertan Girgin, Mandy Jordan, Yeongil Ko, Alekh Agarwal, Yaxin Liu, Yasemin Altun, Raoul de Liedekerke, Anastasios Kementsietsidis, Daiyi Peng, Dangyi Liu, Utku Evci, Peter Humphreys, Austin Tarango, Xiang Deng, Yoad Lewenberg, Kevin Aydin, Chengda Wu, Bhavishya Mittal, Tsendsuren Munkhdalai, Kleopatra Chatziprimou, Rodrigo Benenson, Uri First, Xiao Ma, Jinning Li, Armand Joulin, Hamish Tomlinson, Tingnan Zhang, Milad Nasr, Zhi Hong, Michaël Sander, Lisa Anne Hendricks, Anuj Sharma, Andrew Bolt, Eszter Vértes, Jiri Simsa, Tomer Levinboim, Olcan Sercinoglu, Divyansh Shukla, Austin Wu, Craig Swanson, Danny Vainstein, Fan Bu, Bo Wang, Ryan Julian, Charles Yoon, Sergei Lebedev, Antonious Girgis, Bernd Bandemer, David Du, Todd Wang, Xi Chen, Ying Xiao, Peggy Lu, Natalie Ha, Vlad Ionescu, Simon Rowe, Josip Matak, Federico Lebron, Andreas Steiner, Lalit Jain, Manaal Faruqui, Nicolas Lacasse, Georgie Evans, Neesha Subramaniam, Dean Reich, Giulia Vezzani, Aditya Pandey, Joe Stanton, Tianhao Zhou, Liam McCafferty, Henry Griffiths, Verena Rieser, Soheil Hassas Yeganeh, Eleftheria Briakou, Lu Huang, Zichuan Wei, Liangchen Luo, Erik Jue, Gabby Wang, Victor Cotruta, Myriam Khan, Jongbin Park, Qiuchen Guo, Peiran Li, Rong Rong, Diego Antognini, Anastasia Petrushkina, Chetan Tekur, Eli Collins, Parul Bhatia, Chester Kwak, Wenhu Chen, Arvind Neelakantan, Immanuel Odisho, Sheng Peng, Vincent Nallatamby, Vaibhav Tulsyan, Fabian Pedregosa, Peng Xu, Raymond Lin, Yulong Wang, Emma Wang, Sholto Douglas, Reut Tsarfaty, Elena Gribovskaya, Renga Aravamudhan, Manu Agarwal, Mara Finkelstein, Qiao Zhang, Elizabeth Cole, Phil Crone, Sarmishta Velury, Anil Das, Chris Sauer, Luyao Xu, Danfeng Qin, Chenjie Gu, Dror Marcus, CJ Zheng, Wouter Van Gansbeke, Sobhan Miryoosefi, Haitian Sun, YaGuang Li, Charlie Chen, Jae Yoo, Pavel Dubov, Alex Tomala, Adams Yu, Paweł Wesołowski, Alok Gunjan, Eddie Cao, Jiaming Luo, Nikhil Sethi, Arkadiusz Socala, Laura Graesser, Tomas Kocisky, Arturo BC, Minmin Chen, Edward Lee, Sophie Wang, Weize Kong, Qiantong Xu, Nilesh Tripuraneni, Yiming Li, Xinxin Yu, Allen Porter, Paul Voigtlaender, Biao Zhang, Arpi Vezer, Sarah York, Qing Wei, Geoffrey Cideron, Mark Kurzeja, Seungyeon Kim, Benny Li, Angéline Pouget, Hyo Lee, Kaspar Daugaard, Yang Li, Dave Uthus, Aditya Siddhant, Paul Cavallaro, Sriram Ganapathy, Maulik Shah, Rolf Jagerman, Jeff Stanway, Piermaria Mendolicchio, Li Xiao, Kayi Lee, Tara Thompson, Shubham Milind Phal, Jason Chase, Sun Jae Lee, Adrian N Reyes, Disha Shrivastava, Zhen Qin, Roykrong Sukkerd, Seth Odoom, Lior Madmoni, John Aslanides, Jonathan Herzig, Elena Pochernina, Sheng Zhang, Parker Barnes, Daisuke Ikeda, Qiujia Li, Shuo-yiin Chang, Shakir Mohamed, Jim Sproch, Richard Powell, Bidisha Samanta, Domagoj Ćevid, Anton Kovsharov, Shrestha Basu Mallick, Srinivas Tadepalli, Anne Zheng, Kareem Ayoub, Andreas Noever, Christian Reisswig, Zhuo Xu, Junhyuk Oh, Martin Matysiak, Tim Blyth, Shereen Ashraf, Julien Amelot, Boone Severson, Michele Bevilacqua, Motoki Sano, Ethan Dyer, Ofir Roval, Anu Sinha, Yin Zhong, Sagi Perel, Tea Sabolić, Johannes Mauerer, Willi Gierke, Mauro Verzetti, Rodrigo Cabrera, Alvin Abdagic, Steven Hemingray, Austin Stone, Jong Lee, Farooq Ahmad, Karthik Raman, Lior Shani, Jonathan Lai, Orhan Firat, Nathan Waters, Eric Ge, Mo Shomrat, Himanshu Gupta, Rajeev Aggarwal, Tom Hudson, Bill Jia, Simon Baumgartner, Palak Jain, Joe Kovac, Junehyuk Jung, Ante Žužul, Will Truong, Morteza Zadimoghaddam, Songyou Peng, Marco Liang, Rachel Sterneck, Balaji Lakshminarayanan, Machel Reid, Oliver Woodman, Tong Zhou, Jianling Wang, Vincent Coriou, Arjun Narayanan, Jay Hoover, Yenai Ma, Apoorv Jindal, Clayton Sanford, Doug Reid, Swaroop Ramaswamy, Alex Kurakin, Roland Zimmermann, Yana Lunts, Dragos Dena, Zalán Borsos, Vered Cohen, Shujian Zhang, Will Grathwohl, Robert Dadashi, Morgan Redshaw, Joshua Kessinger, Julian Odell, Silvano Bonacina, Zihang Dai, Grace Chen, Ayush Dubey, Pablo Sprechmann, Mantas Pajarskas, Wenxuan Zhou, Niharika Ahuja, Tara Thomas, Martin Nikoltchev, Matija Kecman, Bharath Mankalale, Andrey Ryabtsev, Jennifer She, Christian Walder, Jiaming Shen, Lu Li, Carolina Parada, Sheena Panthaplackel, Okwan Kwon, Matt Lawlor, Utsav Prabhu, Yannick Schroecker, Marc’aurelio Ranzato, Pete Blois, Iurii Kemaev, Ting Yu, Dmitry, Lepikhin, Hao Xiong, Sahand Sharifzadeh, Oleaser Johnson, Jeremiah Willcock, Rui Yao, Greg Farquhar, Sujoy Basu, Hidetoshi Shimokawa, Nina Anderson, Haiguang Li, Khiem Pham, Yizhong Liang, Sebastian Borgeaud, Alexandre Moufarek, Hideto Kazawa, Blair Kutzman, Marcin Sieniek, Sara Smoot, Ruth Wang, Natalie Axelsson, Nova Fallen, Prasha Sundaram, Yuexiang Zhai, Varun Godbole, Petros Maniatis, Alek Wang, Ilia Shumailov, Santhosh Thangaraj, Remi Crocker, Nikita Gupta, Gang Wu, Phil Chen, Gellért Weisz, Celine Smith, Mojtaba Seyedhosseini, Boya Fang, Xiyang Luo, Roey Yogev, Zeynep Cankara, Andrew Hard, Helen Ran, Rahul Sukthankar, George Necula, Gaël Liu, Honglong Cai, Praseem Banzal, Daniel Keysers, Sanjay Ghemawat, Connie Tao, Emma Dunleavy, Aditi Chaudhary, Wei Li, Maciej Mikuła, Chen-Yu Lee, Tiziana Refice, Krishna Somandepalli, Alexandre Fréchette, Dan Bahir, John Karro, Keith Rush, Sarah Perrin, Bill Rosgen, Xiaomeng Yang, Clara Huiyi Hu, Mahmoud Alnahlawi, Justin Mao-Jones, Roopal Garg, Hoang Nguyen, Bat-Orgil Batsaikhan, Iñaki Iturrate, Anselm Levskaya, Avi Singh, Ashyana Kachra, Tony Lu, Denis Petek, Zheng Xu, Mark Graham, Lukas Zilka, Yael Karov, Marija Kostelac, Fangyu Liu, Yaohui Guo, Weiyue Wang, Bernd Bohnet, Emily Pitler, Tony Bruguier, Keisuke Kinoshita, Chrysovalantis Anastasiou, Nilpa Jha, Ting Liu, Jerome Connor, Phil Wallis, Philip Pham, Eric Bailey, Shixin Li, Heng-Tze Cheng, Sally Ma, Haiqiong Li, Akanksha Maurya, Kate Olszewska, Manfred Warmuth, Christy Koh, Dominik Paulus, Siddhartha Reddy Jonnalagadda, Enrique Piqueras, Ali Elqursh, Geoff Brown, Hadar Shemtov, Loren Maggiore, Fei Xia, Ryan Foley, Beka Westberg, George van den Driessche, Livio Baldini Soares, Arjun Kar, Michael Quinn, Siqi Zuo, Jialin Wu, Kyle Kastner, Anna Bortsova, Aijun Bai, Ales Mikhalap, Luowei Zhou, Jennifer Brennan, Vinay Ramasesh, Honglei Zhuang, John Maggs, Johan Schalkwyk, Yuntao Xu, Hui Huang, Andrew Howard, Sasha Brown, Linting Xue, Gloria Shen, Brian Albert, Neha Jha, Daniel Zheng, Varvara Krayvanova, Spurthi Amba Hombaiah, Olivier Lacombe, Gautam Vasudevan, Dan Graur, Tian Xie, Meet Gandhi, Bangju Wang, Dustin Zelle, Harman Singh, Dahun Kim, Sébastien Cevey, Victor Ungureanu, Natasha Noy, Fei Liu, Annie Xie, Fangxiaoyu Feng, Katerina Tsihlas, Daniel Formoso, Neera Vats, Quentin Wellens, Yinan Wang, Niket Kumar Bhumihar, Samrat Ghosh, Matt Hoffman, Tom Lieber, Oran Lang, Kush Bhatia, Tom Paine, Aroonalok Pyne, Ronny Votel, Madeleine Clare Elish, Benoit Schillings, Alex Panagopoulos, Haichuan Yang, Adam Raveret, Zohar Yahav, Shuang Liu, Warren Chen, Dalia El Badawy, Nishant Agrawal, Mohammed Badawi, Mahdi Mirzazadeh, Carla Bromberg, Fan Ye, Chang Liu, Tatiana Sholokhova, George-Cristian Muraru, Gargi Balasubramaniam, Jonathan Malmaud, Alen Carin, Danilo Martins, Irina Jurenka, Pankil Botadra, Dave Lacey, Richa Singh, Mariano Schain, Dan Zheng, Isabelle Guyon, Victor Lavrenko, Seungji Lee, Xiang Zhou, Demis Hassabis, Jeshwanth Challagundla, Derek Cheng, Nikhil Mehta, Matthew Mauger, Michela Paganini, Pushkar Mishra, Kate Lee, Zhang Li, Lexi Baugher, Ondrej Skopek, Max Chang, Amir Zait, Gaurav Menghani, Lizzetth Bellot, Guangxing Han, Jean-Michel Sarr, Sharat Chikkerur, Himanshu Sahni, Rohan Anil, Arun Narayanan, Chandu Thekkath, Daniele Pighin, Hana Strejček, Marko Velic, Fred Bertsch, Manuel Tragut, Keran Rong, Alicia Parrish, Kai Bailey, Jiho Park, Isabela Albuquerque, Abhishek Bapna, Rajesh Venkataraman, Alec Kosik, Johannes Griesser, Zhiwei Deng, Alek Andreev, Qingyun Dou, Kevin Hui, Fanny Wei, Xiaobin Yu, Lei Shu, Avia Aharon, David Barker, Badih Ghazi, Sebastian Flennerhag, Chris Breaux, Yuchuan Liu, Matthew Bilotti, Josh Woodward, Uri Alon, Stephanie Winkler, Tzu-Kuo Huang, Kostas Andriopoulos, João Gabriel Oliveira, Penporn Koanantakool, Berkin Akin, Michael Wunder, Cicero Nogueira dos Santos, Mohammad Hossein Bateni, Lin Yang, Dan Horgan, Beer Changpinyo, Keyvan Amiri, Min Ma, Dayeong Lee, Lihao Liang, Anirudh Baddepudi, Tejasi Latkar, Raia Hadsell, Jun Xu, Hairong Mu, Michael Han, Aedan Pope, Snchit Grover, Frank Kim, Ankit Bhagatwala, Guan Sun, Yamini Bansal, Amir Globerson, Alireza Nazari, Samira Daruki, Hagen Soltau, Jane Labanowski, Laurent El Shafey, Matt Harvey, Yanif Ahmad, Elan Rosenfeld, William Kong, Etienne Pot, Yi-Xuan Tan, Aurora Wei, Victoria Langston, Marcel Prasetya, Petar Veličković, Richard Killam, Robin Strudel, Darren Ni, Zhenhai Zhu, Aaron Archer, Kavya Kopparapu, Lynn Nguyen, Emilio Parisotto, Hussain Masoom, Sravanti Addepalli, Jordan Grimstad, Hexiang Hu, Joss Moore, Avinatan Hassidim, Le Hou, Mukund Raghavachari, Jared Lichtarge, Adam R. Brown, Hilal Dib, Natalia Ponomareva, Justin Fu, Yujing Zhang, Altaf Rahman, Joana Iljazi, Edouard Leurent, Gabriel Dulac-Arnold, Cosmo Du, Chulayuth Asawaroengchai, Larry Jin, Ela Gruzewska, Ziwei Ji, Benigno Uria, Daniel De Freitas, Paul Barham, Lauren Beltrone, Víctor Campos, Jun Yan, Neel Kovelamudi, Arthur Nguyen, Elinor Davies, Zhichun Wu, Zoltan Egyed, Kristina Toutanova, Nithya Attaluri, Hongliang Fei, Peter Stys, Siddhartha Brahma, Martin Izzard, Siva Velusamy, Scott Lundberg, Vincent Zhuang, Kevin Sequeira, Adam Santoro, Ehsan Amid, Ophir Aharoni, Shuai Ye, Mukund Sundararajan, Lijun Yu, Yu-Cheng Ling, Stephen Spencer, Hugo Song, Josip Djolonga, Christo Kirov, Sonal Gupta, Alessandro Bissacco, Clemens Meyer, Mukul Bhutani, Andrew Dai, Weiyi Wang, Siqi Liu, Ashwin Sreevatsa, Qijun Tan, Maria Wang, Lucy Kim, Yicheng Wang, Alex Irpan, Yang Xiao, Stanislav Fort, Yifan He, Alex Gurney, Bryan Gale, Yue Ma, Monica Roy, Viorica Patraucean, Taylan Bilal, Golnaz Ghiasi, Anahita Hosseini, Melvin Johnson, Zhuowan Li, Yi Tay, Benjamin Beyret, Katie Millican, Josef Broder, Mayank Lunayach, Danny Swisher, Eugen Vušak, David Parkinson, MH Tessler, Adi Mayrav Gilady, Richard Song, Allan Dafoe, Yves Raimond, Masa Yamaguchi, Itay Karo, Elizabeth Nielsen, Kevin Kilgour, Mike Dusenberry, Rajiv Mathews, Jiho Choi, Siyuan Qiao, Harsh Mehta, Sahitya Potluri, Chris Knutsen, Jialu Liu, Tat Tan, Kuntal Sengupta, Keerthana Gopalakrishnan, Abodunrinwa Toki, Mencher Chiang, Mike Burrows, Grace Vesom, Zafarali Ahmed, Ilia Labzovsky, Siddharth Vashishtha, Preeti Singh, Ankur Sharma, Ada Ma, Jinyu Xie, Pranav Talluri, Hannah Forbes-Pollard, Aarush Selvan, Joel Wee, Loic Matthey, Tom Funkhouser, Parthasarathy Gopavarapu, Lev Proleev, Cheng Li, Matt Thomas, Kashyap Kolipaka, Zhipeng Jia, Ashwin Kakarla, Srinivas Sunkara, Joan Puigcerver, Suraj Satishkumar Sheth, Emily Graves, Chen Wang, Sadh MNM Khan, Kai Kang, Shyamal Buch, Fred Zhang, Omkar Savant, David Soergel, Kevin Lee, Linda Friso, Xuanyi Dong, Rahul Arya, Shreyas Chandrakaladharan, Connor Schenck, Greg Billock, Tejas Iyer, Anton Bakalov, Leslie Baker, Alex Ruiz, Angad Chandorkar, Trieu Trinh, Matt Miecnikowski, Yanqi Zhou, Yangsibo Huang, Jiazhong Nie, Ali Shah, Ashish Thapliyal, Sam Haves, Lun Wang, Uri Shaham, Patrick Morris-Suzuki, Soroush Radpour, Leonard Berrada, Thomas Strohmann, Chaochao Yan, Jingwei Shen, Sonam Goenka, Tris Warkentin, Petar Dević, Dan Belov, Albert Webson, Madhavi Yenugula, Puranjay Datta, Jerry Chang, Nimesh Ghelani, Aviral Kumar, Vincent Perot, Jessica Lo, Yang Song, Herman Schmit, Jianmin Chen, Vasilisa Bashlovkina, Xiaoyue Pan, Diana Mincu, Paul Roit, Isabel Edkins, Andy Davis, Yujia Li, Ben Horn, Xinjian Li, Pradeep Kumar S, Eric Doi, Wanzheng Zhu, Sri Gayatri Sundara Padmanabhan, Siddharth Verma, Jasmine Liu, Heng Chen, Mihajlo Velimirović, Malcolm Reynolds, Priyanka Agrawal, Nick Sukhanov, Abhinit Modi, Siddharth Goyal, John Palowitch, Nima Khajehnouri, Wing Lowe, David Klinghoffer, Sharon Silver, Vinh Tran, Candice Schumann, Francesco Piccinno, Xi Liu, Mario Lučić, Xiaochen Yang, Sandeep Kumar, Ajay Kannan, Ragha Kotikalapudi, Mudit Bansal, Fabian Fuchs, Javad Hosseini, Abdelrahman Abdelhamed, Dawn Bloxwich, Tianhe Yu, Ruoxin Sang, Gregory Thornton, Karan Gill, Yuchi Liu, Virat Shejwalkar, Jason Lin, Zhipeng Yan, Kehang Han, Thomas Buschmann, Michael Pliskin, Zhi Xing, Susheel Tatineni, Junlin Zhang, Sissie Hsiao, Gavin Buttimore, Marcus Wu, Zefei Li, Geza Kovacs, Legg Yeung, Tao Huang, Aaron Cohen, Bethanie Brownfield, Averi Nowak, Mikel Rodriguez, Tianze Shi, Hado van Hasselt, Kevin Cen, Deepanway Ghoshal, Kushal Majmundar, Weiren Yu, Warren, Chen, Danila Sinopalnikov, Hao Zhang, Vlado Galić, Di Lu, Zeyu Zheng, Maggie Song, Gary Wang, Gui Citovsky, Swapnil Gawde, Isaac Galatzer-Levy, David Silver, Ivana Balazevic, Dipanjan Das, Kingshuk Majumder, Yale Cong, Praneet Dutta, Dustin Tran, Hui Wan, Junwei Yuan, Daniel Eppens, Alanna Walton, Been Kim, Harry Ragan, James Cobon-Kerr, Lu Liu, Weijun Wang, Bryce Petrini, Jack Rae, Rakesh Shivanna, Yan Xiong, Chace Lee, Pauline Coquinot, Yiming Gu, Lisa Patel, Blake Hechtman, Aviel Boag, Orion Jankowski, Alex Wertheim, Alex Lee, Paul Covington, Hila Noga, Sam Sobell, Shanthal Vasanth, William Bono, Chirag Nagpal, Wei Fan, Xavier Garcia, Kedar Soparkar, Aybuke Turker, Nathan Howard, Sachit Menon, Yuankai Chen, Vikas Verma, Vladimir Pchelin, Harish Rajamani, Valentin Dalibard, Ana Ramalho, Yang Guo, Kartikeya Badola, Seojin Bang, Nathalie Rauschmayr, Julia Proskurnia, Sudeep Dasari, Xinyun Chen, Mikhail Sushkov, Anja Hauth, Pauline Sho, Abhinav Singh, Bilva Chandra, Allie Culp, Max Dylla, Olivier Bachem, James Besley, Heri Zhao, Timothy Lillicrap, Wei Wei, Wael Al Jishi, Ning Niu, Alban Rrustemi, Raphaël Lopez Kaufman, Ryan Poplin, Jewel Zhao, Minh Truong, Shikhar Bharadwaj, Ester Hlavnova, Eli Stickgold, Cordelia Schmid, Georgi Stephanov, Zhaoqi Leng, Frederick Liu, Léonard Hussenot, Shenil Dodhia, Juliana Vicente Franco, Lesley Katzen, Abhanshu Sharma, Sarah Cogan, Zuguang Yang, Aniket Ray, Sergi Caelles, Shen Yan, Ravin Kumar, Daniel Gillick, Renee Wong, Joshua Ainslie, Jonathan Hoech, Séb Arnold, Dan Abolafia, Anca Dragan, Ben Hora, Grace Hu, Alexey Guseynov, Yang Lu, Chas Leichner, Jinmeng Rao, Abhimanyu Goyal, Nagabhushan Baddi, Daniel Hernandez Diaz, Tim McConnell, Max Bain, Jake Abernethy, Qiqi Yan, Rylan Schaeffer, Paul Vicol, Will Thompson, Montse Gonzalez Arenas, Mathias Bellaiche, Pablo Barrio, Stefan Zinke, Riccardo Patana, Pulkit Mehta, JK Kearns, Avraham Ruderman, Scott Pollom, David D’Ambrosio, Cath Hope, Yang Yu, Andrea Gesmundo, Kuang-Huei Lee, Aviv Rosenberg, Yiqian Zhou, Yaoyiran Li, Drew Garmon, Yonghui Wu, Safeen Huda, Gil Fidel, Martin Baeuml, Jian Li, Phoebe Kirk, Rhys May, Tao Tu, Sara Mc Carthy, Toshiyuki Fukuzawa, Miranda Aperghis, Chih-Kuan Yeh, Toshihiro Yoshino, Bo Li, Austin Myers, Kaisheng Yao, Ben Limonchik, Changwan Ryu, Rohun Saxena, Alex Goldin, Ruizhe Zhao, Rocky Rhodes, Tao Zhu, Divya Tyam, Heidi Howard, Nathan Byrd, Hongxu Ma, Yan Wu, Ryan Mullins, Qingze Wang, Aida Amini, Sebastien Baur, Yiran Mao, Subhashini Venugopalan, Will Song, Wen Ding, Paul Collins, Sashank Reddi, Megan Shum, Andrei Rusu, Luisa Zintgraf, Kelvin Chan, Sheela Goenka, Mathieu Blondel, Michael Collins, Renke Pan, Marissa Giustina, Nikolai Chinaev, Christian Schuler, Ce Zheng, Jonas Valfridsson, Alyssa Loo, Alex Yakubovich, Jamie Smith, Tao Jiang, Rich Munoz, Gabriel Barcik, Rishabh Bansal, Mingyao Yang, Yilun Du, Pablo Duque, Mary Phuong, Alexandra Belias, Kunal Lad, Zeyu Liu, Tal Schuster, Karthik Duddu, Jieru Hu, Paige Kunkle, Matthew Watson, Jackson Tolins, Josh Smith, Denis Teplyashin, Garrett Bingham, Marvin Ritter, Marco Andreetto, Divya Pitta, Mohak Patel, Shashank Viswanadha, Trevor Strohman, Catalin Ionescu, Jincheng Luo, Yogesh Kalley, Jeremy Wiesner, Dan Deutsch, Derek Lockhart, Peter Choy, Rumen Dangovski, Chawin Sitawarin, Cat Graves, Tanya Lando, Joost van Amersfoort, Ndidi Elue, Zhouyuan Huo, Pooya Moradi, Jean Tarbouriech, Henryk Michalewski, Wenting Ye, Eunyoung Kim, Alex Druinsky, Florent Altché, Xinyi Chen, Artur Dwornik, Da-Cheng Juan, Rivka Moroshko, Horia Toma, Jarrod Kahn, Hai Qian, Maximilian Sieb, Irene Cai, Roman Goldenberg, Praneeth Netrapalli, Sindhu Raghuram, Yuan Gong, Lijie Fan, Evan Palmer, Yossi Matias, Valentin Gabeur, Shreya Pathak, Tom Ouyang, Don Metzler, Geoff Bacon, Srinivasan Venkatachary, Sridhar Thiagarajan, Alex Cullum, Eran Ofek, Vytenis Sakenas, Mohamed Hammad, Cesar Magalhaes, Mayank Daswani, Oscar Chang, Ashok Popat, Ruichao Li, Komal Jalan, Yanhan Hou, Josh Lipschultz, Antoine He, Wenhao Jia, Pier Giuseppe Sessa, Prateek Kolhar, William Wong, Sumeet Singh, Lukas Haas, Jay Whang, Hanna Klimczak-Plucińska, Georges Rotival, Grace Chung, Yiqing Hua, Anfal Siddiqui, Nicolas Serrano, Dongkai Chen, Billy Porter, Libin Bai, Keshav Shivam, Sho Arora, Partha Talukdar, Tom Cobley, Sangnie Bhardwaj, Evgeny Gladchenko, Simon Green, Kelvin Guu, Felix Fischer, Xiao Wu, Eric Wang, Achintya Singhal, Tatiana Matejovicova, James Martens, Hongji Li, Roma Patel, Elizabeth Kemp, Jiaqi Pan, Lily Wang, Blake JianHang Chen, Jean-Baptiste Alayrac, Navneet Potti, Erika Gemzer, Eugene Ie, Kay McKinney, Takaaki Saeki, Edward Chou, Pascal Lamblin, SQ Mah, Zach Fisher, Martin Chadwick, Jon Stritar, Obaid Sarvana, Andrew Hogue, Artem Shtefan, Hadi Hashemi, Yang Xu, Jindong Gu, Sharad Vikram, Chung-Ching Chang, Sabela Ramos, Logan Kilpatrick, Weijuan Xi, Jenny Brennan, Yinghao Sun, Abhishek Jindal, Ionel Gog, Dawn Chen, Felix Wu, Jason Lee, Sudhindra Kopalle, Srinadh Bhojanapalli, Oriol Vinyals, Natan Potikha, Burcu Karagol Ayan, Yuan Yuan, Michael Riley, Piotr Stanczyk, Sergey Kishchenko, Bing Wang, Dan Garrette, Antoine Yang, Vlad Feinberg, CJ Carey, Javad Azizi, Viral Shah, Erica Moreira, Chongyang Shi, Josh Feldman, Elizabeth Salesky, Thomas Lampe, Aneesh Pappu, Duhyeon Kim, Jonas Adler, Avi Caciularu, Brian Walker, Yunhan Xu, Yochai Blau, Dylan Scandinaro, Terry Huang, Sam El-Husseini, Abhishek Sinha, Lijie Ren, Taylor Tobin, Patrik Sundberg, Tim Sohn, Vikas Yadav, Mimi Ly, Emily Xue, Jing Xiong, Afzal Shama Soudagar, Sneha Mondal, Nikhil Khadke, Qingchun Ren, Ben Vargas, Stan Bileschi, Sarah Chakera, Cindy Wang, Boyu Wang, Yoni Halpern, Joe Jiang, Vikas Sindhwani, Petre Petrov, Pranavaraj Ponnuramu, Sanket Vaibhav Mehta, Yu Watanabe, Betty Chan, Matheus Wisniewski, Trang Pham, Jingwei Zhang, Conglong Li, Dario de Cesare, Art Khurshudov, Alex Vasiloff, Melissa Tan, Zoe Ashwood, Bobak Shahriari, Maryam Majzoubi, Garrett Tanzer, Olga Kozlova, Robin Alazard, James Lee-Thorp, Nguyet Minh Phu, Isaac Tian, Junwhan Ahn, Andy Crawford, Lauren Lax, Yuan, Shangguan, Iftekhar Naim, David Ross, Oleksandr Ferludin, Tongfei Guo, Andrea Banino, Hubert Soyer, Xiaoen Ju, Dominika Rogozińska, Ishaan Malhi, Marcella Valentine, Daniel Balle, Apoorv Kulshreshtha, Maciej Kula, Yiwen Song, Sophia Austin, John Schultz, Roy Hirsch, Arthur Douillard, Apoorv Reddy, Michael Fink, Summer Yue, Khyatti Gupta, Adam Zhang, Norman Rink, Daniel McDuff, Lei Meng, András György, Yasaman Razeghi, Ricky Liang, Kazuki Osawa, Aviel Atias, Matan Eyal, Tyrone Hill, Nikolai Grigorev, Zhengdong Wang, Nitish Kulkarni, Rachel Soh, Ivan Lobov, Zachary Charles, Sid Lall, Kazuma Hashimoto, Ido Kessler, Victor Gomes, Zelda Mariet, Danny Driess, Alessandro Agostini, Canfer Akbulut, Jingcao Hu, Marissa Ikonomidis, Emily Caveness, Kartik Audhkhasi, Saurabh Agrawal, Ioana Bica, Evan Senter, Jayaram Mudigonda, Kelly Chen, Jingchen Ye, Xuanhui Wang, James Svensson, Philipp Fränken, Josh Newlan, Li Lao, Eva Schnider, Sami Alabed, Joseph Kready, Jesse Emond, Afief Halumi, Tim Zaman, Chengxi Ye, Naina Raisinghani, Vilobh Meshram, Bo Chang, Ankit Singh Rawat, Axel Stjerngren, Sergey Levi, Rui Wang, Xiangzhu Long, Mitchelle Rasquinha, Steven Hand, Aditi Mavalankar, Lauren Agubuzu, Sudeshna Roy, Junquan Chen, Jarek Wilkiewicz, Hao Zhou, Michal Jastrzebski, Qiong Hu, Agustin Dal Lago, Ramya Sree Boppana, Wei-Jen Ko, Jennifer Prendki, Yao Su, Zhi Li, Eliza Rutherford, Girish Ramchandra Rao, Ramona Comanescu, Adrià Puigdomènech, Qihang Chen, Dessie Petrova, Christine Chan, Vedrana Milutinovic, Felipe Tiengo Ferreira, Chin-Yi Cheng, Ming Zhang, Tapomay Dey, Sherry Yang, Ramesh Sampath, Quoc Le, Howard Zhou, Chu-Cheng Lin, Hoi Lam, Christine Kaeser-Chen, Kai Hui, Dean Hirsch, Tom Eccles, Basil Mustafa, Shruti Rijhwani, Morgane Rivière, Yuanzhong Xu, Junjie Wang, Xinyang Geng, Xiance Si, Arjun Khare, Cheolmin Kim, Vahab Mirrokni, Kamyu Lee, Khuslen Baatarsukh, Nathaniel Braun, Lisa Wang, Pallavi LV, Richard Tanburn, Yuvein, Zhu, Fangda Li, Setareh Ariafar, Dan Goldberg, Ken Burke, Daniil Mirylenka, Meiqi Guo, Olaf Ronneberger, Hadas Natalie Vogel, Liqun Cheng, Nishita Shetty, Johnson Jia, Thomas Jimma, Corey Fry, Ted Xiao, Martin Sundermeyer, Ryan Burnell, Yannis Assael, Mario Pinto, JD Chen, Rohit Sathyanarayana, Donghyun Cho, Jing Lu, Rishabh Agarwal, Sugato Basu, Lucas Gonzalez, Dhruv Shah, Meng Wei, Dre Mahaarachchi, Rohan Agrawal, Tero Rissa, Yani Donchev, Ramiro Leal-Cavazos, Adrian Hutter, Markus Mircea, Alon Jacovi, Faruk Ahmed, Jiageng Zhang, Shuguang Hu, Bo-Juen Chen, Jonni Kanerva, Guillaume Desjardins, Andrew Lee, Nikos Parotsidis, Asier Mujika, Tobias Weyand, Jasper Snoek, Jo Chick, Kai Chen, Paul Chang, Ethan Mahintorabi, Zi Wang, Tolly Powell, Orgad Keller, Abhirut Gupta, Claire Sha, Kanav Garg, Nicolas Heess, Ágoston Weisz, Cassidy Hardin, Bartek Wydrowski, Ben Coleman, Karina Zainullina, Pankaj Joshi, Alessandro Epasto, Terry Spitz, Binbin Xiong, Kai Zhao, Arseniy Klimovskiy, Ivy Zheng, Johan Ferret, Itay Yona, Waleed Khawaja, Jean-Baptiste Lespiau, Maxim Krikun, Siamak Shakeri, Timothee Cour, Bonnie Li, Igor Krivokon, Dan Suh, Alex Hofer, Jad Al Abdallah, Nikita Putikhin, Oscar Akerlund, Silvio Lattanzi, Anurag Kumar, Shane Settle, Himanshu Srivastava, Folawiyo Campbell-Ajala, Edouard Rosseel, Mihai Dorin Istin, Nishanth Dikkala, Anand Rao, Nick Young, Kate Lin, Dhruva Bhaswar, Yiming Wang, Jaume Sanchez Elias, Kritika Muralidharan, James Keeling, Dayou Du, Siddharth Gopal, Gregory Dibb, Charles Blundell, Manolis Delakis, Jacky Liang, Marco Tulio Ribeiro, Georgi Karadzhov, Guillermo Garrido, Ankur Bapna, Jiawei Cao, Adam Sadovsky, Pouya Tafti, Arthur Guez, Coline Devin, Yixian Di, Jinwei Xing, Chuqiao, Xu, Hanzhao Lin, Chun-Te Chu, Sameera Ponda, Wesley Helmholz, Fan Yang, Yue Gao, Sara Javanmardi, Wael Farhan, Alex Ramirez, Ricardo Figueira, Khe Chai Sim, Yuval Bahat, Ashwin Vaswani, Liangzhe Yuan, Gufeng Zhang, Leland Rechis, Hanjun Dai, Tayo Oguntebi, Alexandra Cordell, Eugénie Rives, Kaan Tekelioglu, Naveen Kumar, Bing Zhang, Aurick Zhou, Nikolay Savinov, Andrew Leach, Alex Tudor, Sanjay Ganapathy, Yanyan Zheng, Mirko Rossini, Vera Axelrod, Arnaud Autef, Yukun Zhu, Zheng Zheng, Mingda Zhang, Baochen Sun, Jie Ren, Nenad Tomasev, Nithish Kannan, Amer Sinha, Charles Chen, Louis O’Bryan, Alex Pak, Aditya Kusupati, Weel Yang, Deepak Ramachandran, Patrick Griffin, Seokhwan Kim, Philipp Neubeck, Craig Schiff, Tammo Spalink, Mingyang Ling, Arun Nair, Ga-Young Joung, Linda Deng, Avishkar Bhoopchand, Lora Aroyo, Tom Duerig, Jordan Griffith, Gabe Barth-Maron, Jake Ades, Alex Haig, Ankur Taly, Yunting Song, Paul Michel, Dave Orr, Dean Weesner, Corentin Tallec, Carrie Grimes Bostock, Paul Niemczyk, Andy Twigg, Mudit Verma, Rohith Vallu, Henry Wang, Marco Gelmi, Kiranbir Sodhia, Aleksandr Chuklin, Omer Goldman, Jasmine George, Liang Bai, Kelvin Zhang, Petar Sirkovic, Efrat Nehoran, Golan Pundak, Jiaqi Mu, Alice Chen, Alex Greve, Paulo Zacchello, David Amos, Heming Ge, Eric Noland, Colton Bishop, Jeffrey Dudek, Youhei Namiki, Elena Buchatskaya, Jing Li, Dorsa Sadigh, Masha Samsikova, Dan Malkin, Damien Vincent, Robert David, Rob Willoughby, Phoenix Meadowlark, Shawn Gao, Yan Li, Raj Apte, Amit Jhindal, Stein Xudong Lin, Alex Polozov, Zhicheng Wang, Tomas Mery, Anirudh GP, Varun Yerram, Sage Stevens, Tianqi Liu, Noah Fiedel, Charles Sutton, Matthew Johnson, Xiaodan Song, Kate Baumli, Nir Shabat, Muqthar Mohammad, Hao Liu, Marco Selvi, Yichao Zhou, Mehdi Hafezi Manshadi, Chu-ling Ko, Anthony Chen, Michael Bendersky, Jorge Gonzalez Mendez, Nisarg Kothari, Amir Zandieh, Yiling Huang, Daniel Andor, Ellie Pavlick, Idan Brusilovsky, Jitendra Harlalka, Sally Goldman, Andrew Lampinen, Guowang Li, Asahi Ushio, Somit Gupta, Lei Zhang, Chuyuan Kelly Fu, Madhavi Sewak, Timo Denk, Jed Borovik, Brendan Jou, Avital Zipori, Prateek Jain, Junwen Bai, Thang Luong, Jonathan Tompson, Alice Li, Li Liu, George Powell, Jiajun Shen, Alex Feng, Grishma Chole, Da Yu, Yinlam Chow, Tongxin Yin, Eric Malmi, Kefan Xiao, Yash Pande, Shachi Paul, Niccolò Dal Santo, Adil Dostmohamed, Sergio Guadarrama, Aaron Phillips, Thanumalayan Sankaranarayana Pillai, Gal Yona, Amin Ghafouri, Preethi Lahoti, Benjamin Lee, Dhruv Madeka, Eren Sezener, Simon Tokumine, Adrian Collister, Nicola De Cao, Richard Shin, Uday Kalra, Parker Beak, Emily Nottage, Ryo Nakashima, Ivan Jurin, Vikash Sehwag, Meenu Gaba, Junhao Zeng, Kevin R. McKee, Fernando Pereira, Tamar Yakar, Amayika Panda, Arka Dhar, Peilin Zhong, Daniel Sohn, Mark Brand, Lars Lowe Sjoesund, Viral Carpenter, Sharon Lin, Shantanu Thakoor, Marcus Wainwright, Ashwin Chaugule, Pranesh Srinivasan, Muye Zhu, Bernett Orlando, Jack Weber, Ayzaan Wahid, Gilles Baechler, Apurv Suman, Jovana Mitrović, Gabe Taubman, Honglin Yu, Helen King, Josh Dillon, Cathy Yip, Dhriti Varma, Tomas Izo, Levent Bolelli, Borja De Balle Pigem, Julia Di Trapani, Fotis Iliopoulos, Adam Paszke, Nishant Ranka, Joe Zou, Francesco Pongetti, Jed McGiffin, Alex Siegman, Rich Galt, Ross Hemsley, Goran Žužić, Victor Carbune, Tao Li, Myle Ott, Félix de Chaumont Quitry, David Vilar Torres, Yuri Chervonyi, Tomy Tsai, Prem Eruvbetine, Samuel Yang, Matthew Denton, Jake Walker, Slavica Andačić, Idan Heimlich Shtacher, Vittal Premachandran, Harshal Tushar Lehri, Cip Baetu, Damion Yates, Lampros Lamprou, Mariko Iinuma, Ioana Mihailescu, Ben Albrecht, Shachi Dave, Susie Sargsyan, Bryan Perozzi, Lucas Manning, Chiyuan Zhang, Denis Vnukov, Igor Mordatch, Raia Hadsell Wolfgang Macherey, Ryan Kappedal, Jim Stephan, Aditya Tripathi, Klaus Macherey, Jun Qian, Abhishek Bhowmick, Shekoofeh Azizi, Rémi Leblond, Shiva Mohan Reddy Garlapati, Timothy Knight, Matthew Wiethoff, Wei-Chih Hung, Anelia Angelova, Georgios Evangelopoulos, Pawel Janus, Dimitris Paparas, Matthew Rahtz, Ken Caluwaerts, Vivek Sampathkumar, Daniel Jarrett, Shadi Noghabi, Antoine Miech, Chak Yeung, Geoff Clark, Henry Prior, Fei Zheng, Jean Pouget-Abadie, Indro Bhattacharya, Kalpesh Krishna, Will Bishop, Zhe Yuan, Yunxiao Deng, Ashutosh Sathe, Kacper Krasowiak, Ciprian Chelba, Cho-Jui Hsieh, Kiran Vodrahalli, Buhuang Liu, Thomas Köppe, Amr Khalifa, Lubo Litchev, Pichi Charoenpanit, Reed Roberts, Sachin Yadav, Yasumasa Onoe, Desi Ivanov, Megha Mohabey, Vighnesh Birodkar, Nemanja Rakićević, Pierre Sermanet, Vaibhav Mehta, Krishan Subudhi, Travis Choma, Will Ng, Luheng He, Kathie Wang, Tasos Kementsietsidis, Shane Gu, Mansi Gupta, Andrew Nystrom, Mehran Kazemi, Timothy Chung, Nacho Cano, Nikhil Dhawan, Yufei Wang, Jiawei Xia, Trevor Yacovone, Eric Jia, Mingqing Chen, Simeon Ivanov, Ashrith Sheshan, Sid Dalmia, Paweł Stradomski, Pengcheng Yin, Salem Haykal, Congchao Wang, Dennis Duan, Neslihan Bulut, Greg Kochanski, Liam MacDermed, Namrata Godbole, Shitao Weng, Jingjing Chen, Rachana Fellinger, Ramin Mehran, Daniel Suo, Hisham Husain, Tong He, Kaushal Patel, Joshua Howland, Randall Parker, Kelvin Nguyen, Sharath Maddineni, Chris Rawles, Mina Khan, Shlomi Cohen-Ganor, Amol Mandhane, Xinyi Wu, Chenkai Kuang, Iulia Comşa, Ramya Ganeshan, Hanie Sedghi, Adam Bloniarz, Nuo Wang Pierse, Anton Briukhov, Petr Mitrichev, Anita Gergely, Serena Zhan, Allan Zhou, Nikita Saxena, Eva Lu, Josef Dean, Ashish Gupta, Nicolas Perez-Nieves, Renjie Wu, Cory McLean, Wei Liang, Disha Jindal, Anton Tsitsulin, Wenhao Yu, Kaiz Alarakyia, Tom Schaul, Piyush Patil, Peter Sung, Elijah Peake, Hongkun Yu, Feryal Behbahani, JD Co-Reyes, Alan Ansell, Sean Sun, Clara Barbu, Jonathan Lee, Seb Noury, James Allingham, Bilal Piot, Mohit Sharma, Christopher Yew, Ivan Korotkov, Bibo Xu, Demetra Brady, Goran Petrovic, Shibl Mourad, Claire Cui, Aditya Gupta, Parker Schuh, Saarthak Khanna, Anna Goldie, Abhinav Arora, Vadim Zubov, Amy Stuart, Mark Epstein, Yun Zhu, Jianqiao Liu, Yury Stuken, Ziyue Wang, Karolis Misiunas, Dee Guo, Ashleah Gill, Ale Hartman, Zaid Nabulsi, Aurko Roy, Aleksandra Faust, Jason Riesa, Ben Withbroe, Mengchao Wang, Marco Tagliasacchi, Andreea Marzoca, James Noraky, Serge Toropov, Malika Mehrotra, Bahram Raad, Sanja Deur, Steve Xu, Marianne Monteiro, Zhongru Wu, Yi Luan, Sam Ritter, Nick Li, Håvard Garnes, Yanzhang He, Martin Zlocha, Jifan Zhu, Matteo Hessel, Will Wu, Spandana Raj Babbula, Chizu Kawamoto, Yuanzhen Li, Mehadi Hassen, Yan Wang, Brian Wieder, James Freedman, Yin Zhang, Xinyi Bai, Tianli Yu, David Reitter, XiangHai Sheng, Mateo Wirth, Aditya Kini, Dima Damen, Mingcen Gao, Rachel Hornung, Michael Voznesensky, Brian Roark, Adhi Kuncoro, Yuxiang Zhou, Rushin Shah, Anthony Brohan, Kuangyuan Chen, James Wendt, David Rim, Paul Kishan Rubenstein, Jonathan Halcrow, Michelle Liu, Ty Geri, Yunhsuan Sung, Jane Shapiro, Shaan Bijwadia, Chris Duvarney, Christina Sorokin, Paul Natsev, Reeve Ingle, Pramod Gupta, Young Maeng, Ndaba Ndebele, Kexin Zhu, Valentin Anklin, Katherine Lee, Yuan Liu, Yaroslav Akulov, Shaleen Gupta, Guolong Su, Flavien Prost, Tianlin Liu, Vitaly Kovalev, Pol Moreno, Martin Scholz, Sam Redmond, Zongwei Zhou, Alex Castro-Ros, André Susano Pinto, Dia Kharrat, Michal Yarom, Rachel Saputro, Jannis Bulian, Ben Caine, Ji Liu, Abbas Abdolmaleki, Shariq Iqbal, Tautvydas Misiunas, Mikhail Sirotenko, Shefali Garg, Guy Bensky, Huan Gui, Xuezhi Wang, Raphael Koster, Mike Bernico, Da Huang, Romal Thoppilan, Trevor Cohn, Ben Golan, Wenlei Zhou, Andrew Rosenberg, Markus Freitag, Tynan Gangwani, Vincent Tsang, Anand Shukla, Xiaoqi Ren, Minh Giang, Chi Zou, Andre Elisseeff, Charline Le Lan, Dheeru Dua, Shuba Lall, Pranav Shyam, Frankie Garcia, Sarah Nguyen, Michael Guzman, AJ Maschinot, Marcello Maggioni, Ming-Wei Chang, Karol Gregor, Lotte Weerts, Kumaran Venkatesan, Bogdan Damoc, Leon Liu, Jan Wassenberg, Lewis Ho, Becca Roelofs, Majid Hadian, François-Xavier Aubet, Yu Liang, Sami Lachgar, Danny Karmon, Yong Cheng, Amelio Vázquez-Reina, Angie Chen, Zhuyun Dai, Andy Brock, Shubham Agrawal, Chenxi Pang, Peter Garst, Mariella Sanchez-Vargas, Ivor Rendulic, Aditya Ayyar, Andrija Ražnatović, Olivia Ma, Roopali Vij, Neha Sharma, Ashwin Balakrishna, Bingyuan Liu, Ian Mackinnon, Sorin Baltateanu, Petra Poklukar, Gabriel Ibagon, Colin Ji, Hongyang Jiao, Isaac Noble, Wojciech Stokowiec, Zhihao Li, Jeff Dean, David Lindner, Mark Omernick, Kristen Chiafullo, Mason Dimarco, Vitor Rodrigues, Vittorio Selo, Garrett Honke, Xintian, Wu, Wei He, Adam Hillier, Anhad Mohananey, Vihari Piratla, Chang Ye, Chase Malik, Sebastian Riedel, Samuel Albanie, Zi Yang, Kenny Vassigh, Maria Bauza, Sheng Li, Yiqing Tao, Nevan Wichers, Andrii Maksai, Abe Ittycheriah, Ross Mcilroy, Bryan Seybold, Noah Goodman, Romina Datta, Steven M. Hernandez, Tian Shi, Yony Kochinski, Anna Bulanova, Ken Franko, Mikita Sazanovich, Nicholas FitzGerald, Praneeth Kacham, Shubha Srinivas Raghvendra, Vincent Hellendoorn, Alexander Grushetsky, Julian Salazar, Angeliki Lazaridou, Jason Chang, Jan-Thorsten Peter, Sushant Kafle, Yann Dauphin, Abhishek Rao, Filippo Graziano, Izhak Shafran, Yuguo Liao, Tianli Ding, Geng Yan, Grace Chu, Zhao Fu, Vincent Roulet, Gabriel Rasskin, Duncan Williams, Shahar Drath, Alex Mossin, Raphael Hoffmann, Jordi Orbay, Francesco Bertolini, Hila Sheftel, Justin Chiu, Siyang Xue, Yuheng Kuang, Ferjad Naeem, Swaroop Nath, Nana Nti, Phil Culliton, Kashyap Krishnakumar, Michael Isard, Pei Sun, Ayan Chakrabarti, Nathan Clement, Regev Cohen, Arissa Wongpanich, GS Oh, Ashwin Murthy, Hao Zheng, Jessica Hamrick, Oskar Bunyan, Suhas Ganesh, Nitish Gupta, Roy Frostig, John Wieting, Yury Malkov, Pierre Marcenac, Zhixin, Lai, Xiaodan Tang, Mohammad Saleh, Fedir Zubach, Chinmay Kulkarni, Huanjie Zhou, Vicky Zayats, Nan Ding, Anshuman Tripathi, Arijit Pramanik, Patrik Zochbauer, Harish Ganapathy, Vedant Misra, Zach Behrman, Hugo Vallet, Mingyang Zhang, Mukund Sridhar, Ye Jin, Mohammad Babaeizadeh, Siim Põder, Megha Goel, Divya Jain, Tajwar Nasir, Shubham Mittal, Tim Dozat, Diego Ardila, Aliaksei Severyn, Fabio Pardo, Sammy Jerome, Siyang Qin, Louis Rouillard, Amir Yazdanbakhsh, Zizhao Zhang, Shivani Agrawal, Kaushik Shivakumar, Caden Lu, Praveen Kallakuri, Rachita Chhaparia, Kanishka Rao, Charles Kwong, Asya Fadeeva, Shitij Nigam, Yan Virin, Yuan Zhang, Balaji Venkatraman, Beliz Gunel, Marc Wilson, Huiyu Wang, Abhinav Gupta, Xiaowei Xu, Adrien Ali Taïga, Kareem Mohamed, Doug Fritz, Daniel Rodriguez, Zoubin Ghahramani, Harry Askham, Lior Belenki, James Zhao, Rahul Gupta, Krzysztof Jastrzębski, Takahiro Kosakai, Kaan Katircioglu, Jon Schneider, Rina Panigrahy, Konstantinos Bousmalis, Peter Grabowski, Prajit Ramachandran, Chaitra Hegde, Mihaela Rosca, Angelo Scorza Scarpati, Kyriakos Axiotis, Ying Xu, Zach Gleicher, Assaf Hurwitz Michaely, Mandar Sharma, Sanil Jain, Christoph Hirnschall, Tal Marian, Xuhui Jia, Kevin Mather, Kilol Gupta, Linhai Qiu, Nigamaa Nayakanti, Lucian Ionita, Steven Zheng, Lucia Loher, Kurt Shuster, Igor Petrovski, Roshan Sharma, Rahma Chaabouni, Angel Yeh, James An, Arushi Gupta, Steven Schwarcz, Seher Ellis, Sam Conway-Rahman, Javier Snaider, Alex Zhai, James Atwood, Daniel Golovin, Liqian Peng, Te I, Vivian Xia, Salvatore Scellato, Mahan Malihi, Arthur Bražinskas, Vlad-Doru Ion, Younghoon Jun, James Swirhun, Soroosh Mariooryad, Jiao Sun, Steve Chien, Rey Coaguila, Ariel Brand, Yi Gao, Tom Kwiatkowski, Roee Aharoni, Cheng-Chun Lee, Mislav Žanić, Yichi Zhang, Dan Ethier, Vitaly Nikolaev, Pranav Nair, Yoav Ben Shalom, Hen Fitoussi, Jai Gupta, Hongbin Liu, Dee Cattle, Tolga Bolukbasi, Ben Murdoch, Fantine Huot, Yin Li, Chris Hahn

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

nan

Article 358

Title@2025-07-07 (1): Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

Title: Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

Verbesserung des personalisierten Multi-Turn-Dialogs mit Kuriositätslohn

增强个人化多发人对话与好奇奖的多发对话 2504.03206v2

Authors (5): Yanming Wan, Jiaxing Wu, Marwa Abdulhai, Lior Shani, Natasha Jaques

Effective conversational agents like large language models (LLMs) must personalize their interactions to adapt to user preferences, personalities, and attributes across diverse domains like education and healthcare. Current methods like Reinforcement Learning from Human Feedback (RLHF), often prioritize helpfulness and safety but fall short in fostering truly empathetic, adaptive, and personalized dialogues. Existing personalization approaches typically rely on extensive user history, limiting their effectiveness for new or context-limited users. To address these limitations, we propose leveraging a user model to incorporate a curiosity-based intrinsic reward into multi-turn RLHF. This novel reward mechanism encourages the LLM agent to actively infer user traits by optimizing conversations to improve its user model’s accuracy. Consequently, the agent delivers more personalized interactions by learning more about the user. We demonstrate our method’s effectiveness in two distinct domains: significantly improving personalization performance in a conversational recommendation task, and personalizing conversations for different learning styles in an educational setting. We show improved generalization capabilities compared to traditional multi-turn RLHF, all while maintaining conversation quality. Our method offers a promising solution for creating more personalized, adaptive, and engaging conversational agents.

nan

Article 359

Title@2025-07-07 (1): Interleaving Logic and Counting

Title: Interleaving Logic and Counting

Interleaving Logik und Zählen

中间逻辑和计数 2507.05219v1

Authors (2): Johan van Benthem, Thomas Icard

Reasoning with quantifier expressions in natural language combines logical and arithmetical features, transcending strict divides between qualitative and quantitative. Our topic is this cooperation of styles as it occurs in common linguistic usage and its extension into the broader practice of natural language plus “grassroots mathematics”. We begin with a brief review of first-order logic with counting operators and cardinality comparisons. This system is known to be of high complexity, and drowns out finer aspects of the combination of logic and counting. We move to a small fragment that can represent numerical syllogisms and basic reasoning about comparative size: monadic first-order logic with counting. We provide normal forms that allow for axiomatization, determine which arithmetical notions can be defined on finite and on infinite models, and conversely, we discuss which logical notions can be defined out of purely arithmetical ones, and what sort of (non-)classical logics can be induced. Next, we investigate a series of strengthenings, again using normal form methods. The monadic second-order version is close, in a precise sense, to additive Presburger Arithmetic, while versions with the natural device of tuple counting take us to Diophantine equations, making the logic undecidable. We also define a system that combines basic modal logic over binary accessibility relations with counting, needed to formulate ubiquitous reasoning patterns such as the Pigeonhole Principle. We return to our starting point in natural language, confronting the architecture of our formal systems with linguistic quantifier vocabulary and syntax. We conclude with some general thoughts on yet further entanglements of logic and counting in formal systems, on rethinking the qualitative/quantitative divide, and on connecting our analysis to empirical findings in cognitive science.

nan

Article 360

Title@2025-07-07 (1): Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation

Title: Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation

Überprüfung von wissenschaftlichen Arbeiten für kritische Probleme mit reasoning LLMs: Basisansätze und automatische Bewertung

审查关于有理有理的关键问题的科学文件:基线方法和自动评价 2505.23824v2

Authors (2): Tianmai M. Zhang, Neil F. Abernethy

Recent advancements in large language models have sparked interest in utilizing them to aid the peer review process of scientific publication amid the peer review crisis. However, having AI models generate full reviews in the same way as human reviewers risks exacerbating the irresponsible use of LLM-generated reviews. As an alternative, we propose adopting LLMs as manuscript quality checkers. We introduce several baseline approaches and an extendable automatic evaluation framework using top reasoning LLMs as judges to tackle the difficulty of recruiting domain experts for manual evaluation. Utilizing papers withdrawn from arXiv, we validated our proposed methods with several leading reasoning LLMs from multiple vendors and assessed their performance and API costs for identifying critical errors and unsoundness problems in scientific papers. o3 exhibited the best problem identification performance among all models at a modest cost. This paper provides insights into document-based scientific understanding/reasoning and lays a foundation for future applications. Our dataset, code, and model outputs are publicly available.

nan

Article 361

Title@2025-07-07 (1): Pre-Trained Policy Discriminators are General Reward Models

Title: Pre-Trained Policy Discriminators are General Reward Models

Pre-Trained Policy Diskriminatoren sind allgemeine Prämienmodelle

培训前政策歧视者是一般奖励模式 2507.05197v1

Authors (22): Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, Demin Song, Haijun Lv, Songyang Gao, Chengqi Lv, Enyu Zhou, Honglin Guo, Zhiheng Xi, Wenwei Zhang, Qipeng Guo, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Tao Gui, Kai Chen

We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance–improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.

nan

Article 362

Title@2025-07-07 (1): NativQA Framework: Enabling LLMs with Native, Local, and Everyday Knowledge

Title: NativQA Framework: Enabling LLMs with Native, Local, and Everyday Knowledge

NativQA Framework: LLMs mit nativem, lokalem und alltäglichem Wissen aktivieren

NativQA框架:利用土著、地方和日常知识扶持LMs 2504.05995v2

Authors (6): Firoj Alam, Md Arid Hasan, Sahinur Rahman Laskar, Mucahid Kutlu, Kareem Darwish, Shammur Absar Chowdhury

The rapid advancement of large language models (LLMs) has raised concerns about cultural bias, fairness, and their applicability in diverse linguistic and underrepresented regional contexts. To enhance and benchmark the capabilities of LLMs, there is a need to develop large-scale resources focused on multilingual, local, and cultural contexts. In this study, we propose the NativQA framework, which can seamlessly construct large-scale, culturally and regionally aligned QA datasets in native languages. The framework utilizes user-defined seed queries and leverages search engines to collect location-specific, everyday information. It has been evaluated across 39 locations in 24 countries and in 7 languages – ranging from extremely low-resource to high-resource languages – resulting in over 300K Question-Answer (QA) pairs. The developed resources can be used for LLM benchmarking and further fine-tuning. The framework has been made publicly available for the community (https://gitlab.com/nativqa/nativqa-framework).

nan

Article 363

Title@2025-07-07 (1): From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations

Title: From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations

Von Fragmenten zu Fakten: Ein Curriculum-getriebener DPO-Ansatz zur Generierung von Hindi News Veracity Erklärungen

《从零碎到事实:产生印地语新闻的多城市解释:课程驱动的DPO方法》 2507.05179v1

Authors (5): Pulkit Bansal, Raghvendra Kumar, Shakti Singh, Sriparna Saha, Adam Jatowt

In an era of rampant misinformation, generating reliable news explanations is vital, especially for under-represented languages like Hindi. Lacking robust automated tools, Hindi faces challenges in scaling misinformation detection. To bridge this gap, we propose a novel framework integrating Direct Preference Optimization (DPO) with curriculum learning to align machine-generated explanations with human reasoning. Fact-checked explanations from credible sources serve as preferred responses, while LLM outputs highlight system limitations and serve as non-preferred responses. To refine task-specific alignment, we introduce two key parameters – Actuality and Finesse – into the DPO loss function, enhancing explanation quality and consistency. Experiments with LLMs (Mistral, Llama, Gemma) and PLMs (mBART, mT5) confirm the framework’s effectiveness in generating coherent, contextually relevant explanations. This scalable approach combats misinformation and extends automated explanation generation to low-resource languages.

nan

Article 364

Title@2025-07-07 (1): SEPSIS: I Can Catch Your Lies – A New Paradigm for Deception Detection

Title: SEPSIS: I Can Catch Your Lies – A New Paradigm for Deception Detection

SEPSIS: Ich kann Ihre Lügen fangen – Ein neues Paradigma für Täuschungserkennung

我可以抓住你的谎言– 一种新模型用来侦测欺骗 2312.00292v2

Authors (8): Anku Rani, Dwip Dalal, Shreya Gautam, Pankaj Gupta, Vinija Jain, Aman Chadha, Amit Sheth, Amitava Das

Deception is the intentional practice of twisting information. It is a nuanced societal practice deeply intertwined with human societal evolution, characterized by a multitude of facets. This research explores the problem of deception through the lens of psychology, employing a framework that categorizes deception into three forms: lies of omission, lies of commission, and lies of influence. The primary focus of this study is specifically on investigating only lies of omission. We propose a novel framework for deception detection leveraging NLP techniques. We curated an annotated dataset of 876,784 samples by amalgamating a popular large-scale fake news dataset and scraped news headlines from the Twitter handle of the Times of India, a well-known Indian news media house. Each sample has been labeled with four layers, namely: (i) the type of omission (speculation, bias, distortion, sounds factual, and opinion), (ii) colors of lies(black, white, etc), and (iii) the intention of such lies (to influence, etc) (iv) topic of lies (political, educational, religious, etc). We present a novel multi-task learning pipeline that leverages the dataless merging of fine-tuned language models to address the deception detection task mentioned earlier. Our proposed model achieved an F1 score of 0.87, demonstrating strong performance across all layers, including the type, color, intent, and topic aspects of deceptive content. Finally, our research explores the relationship between lies of omission and propaganda techniques. To accomplish this, we conducted an in-depth analysis, uncovering compelling findings. For instance, our analysis revealed a significant correlation between loaded language and opinion, shedding light on their interconnectedness. To encourage further research in this field, we are releasing the SEPSIS dataset and code at https://huggingface.co/datasets/ankurani/deception.

nan

Article 365

Title@2025-07-07 (1): Critiques of World Models

Title: Critiques of World Models

Kritik an Weltmodellen

世界模式的证明 2507.05169v1

Authors (4): Eric Xing, Mingkai Deng, Jinyu Hou, Zhiting Hu

World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of “hypothetical thinking” in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general-purpose world model, based on hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

nan

Article 366

Title@2025-07-07 (1): Language Models can Self-Improve at State-Value Estimation for Better Search

Title: Language Models can Self-Improve at State-Value Estimation for Better Search

Sprachmodelle können sich bei State-Value-Schätzung für eine bessere Suche selbst improvisieren

语言模型可以在国家估价估算中自我改进,以更好地搜索 2503.02878v2

Authors (2): Ethan Mendes, Alan Ritter

Collecting ground-truth rewards or human demonstrations for multi-step reasoning tasks is often prohibitively expensive and time consuming, especially in interactive domains like web tasks. To address this bottleneck, we present self-taught lookahead (STL), a self-supervised method that leverages state-transition dynamics to improve a value model capable of effectively guiding language model-controlled search without any labeled data. We find that moderately sized (8 billion parameters) open-weight value models improved with STL can match the performance of using a gpt-4o value model. Furthermore, we find that specialized value models learned with STL can be deployed with computationally lightweight search algorithms, achieving performance that matches that of more expensive tree search methods, while reducing costs by an order of magnitude.

nan

Article 367

Title@2025-07-07 (1): InfoSteer: Steering Information Utility in Language Model Post-Training

Title: InfoSteer: Steering Information Utility in Language Model Post-Training

InfoSteer: Steuerungsinformationsdienst im Sprachmodell nach dem Training

InfoSteer:指导性语言示范语言信息工具 2507.05158v1

Authors (3): Chunyuan Deng, Ruidi Chang, Hanjie Chen

Recent advancements in language models (LMs) gradually ushered in an era where post-training is crucial. Yet, post-training approaches such as supervised fine-tuning (SFT) do not guarantee effective use of knowledge acquired during pretraining. We therefore present \ours, a lightweight method that encourages parametric information utilization in LMs during post-training. This is achieved via treating FFN layer as associate key-value memory, and promotes the use of stored memory vectors via forward-pass interventions or regularization during backpropagation. We find this simple guidance during post-training phase delivers consistent performance improvements across diverse model families–including Qwen, Gemma and Llama-spanning over 15 downstream tasks in both ID and OOD evaluations. Beyond performance gains, we also find that steered LMs can adaptively allocate information-placing more emphasis on generating semantically meaningful tokens, while using fewer resources on simple transition ones (e.g., ,' or and’). Our work underscores that vanilla post-training does not fully leverage pre-training potential, and steering LMs in latent representation space offers a promising approach that enhances both performance and interpretability.

nan

Article 368

Title@2025-07-07 (1): AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models

Title: AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models

KI Generated Text Detection mit Anleitung Feinabstimmung große Sprache und Transformer-basierte Modelle

使用微调大语言和以变换器为基础的模型生成的指令 2507.05157v1

Authors (5): Chinnappa Guggilla, Budhaditya Roy, Trupti Ramdas Chavan, Abdul Rahman, Edward Bowen

Large Language Models (LLMs) possess an extraordinary capability to produce text that is not only coherent and contextually relevant but also strikingly similar to human writing. They adapt to various styles and genres, producing content that is both grammatically correct and semantically meaningful. Recently, LLMs have been misused to create highly realistic phishing emails, spread fake news, generate code to automate cyber crime, and write fraudulent scientific articles. Additionally, in many real-world applications, the generated content including style and topic and the generator model are not known beforehand. The increasing prevalence and sophistication of artificial intelligence (AI)-generated texts have made their detection progressively more challenging. Various attempts have been made to distinguish machine-generated text from human-authored content using linguistic, statistical, machine learning, and ensemble-based approaches. This work focuses on two primary objectives Task-A, which involves distinguishing human-written text from machine-generated text, and Task-B, which attempts to identify the specific LLM model responsible for the generation. Both of these tasks are based on fine tuning of Generative Pre-trained Transformer (GPT_4o-mini), Large Language Model Meta AI (LLaMA) 3 8B, and Bidirectional Encoder Representations from Transformers (BERT). The fine-tuned version of GPT_4o-mini and the BERT model has achieved accuracies of 0.9547 for Task-A and 0.4698 for Task-B.

nan

Article 369

Title@2025-07-07 (1): Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization

Title: Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization

Interpretierbare Mnemonic-Generation für Kanji-Lernen über Erwartungs-Maximierung

通过预期-最大化为Kanji学习提供可解释的中子一代 2507.05137v1

Authors (3): Jaewook Lee, Alexander Scarlatos, Andrew Lan

Learning Japanese vocabulary is a challenge for learners from Roman alphabet backgrounds due to script differences. Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin. Kanji are also complicated due to their complexity and volume. Keyword mnemonics are a common strategy to aid memorization, often using the compositional structure of kanji to form vivid associations. Despite recent efforts to use large language models (LLMs) to assist learners, existing methods for LLM-based keyword mnemonic generation function as a black box, offering limited interpretability. We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules, and learn them using a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform, our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation. Experiments show that our method performs well in the cold-start setting for new learners while providing insight into the mechanisms behind effective mnemonic creation.

nan

Article 370

Title@2025-07-07 (1): SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction

Title: SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction

SMART: Simulierte Schüler ausgerichtet mit Item Response Theorie für Frage Schwierigkeitsvorhersage

SMART: 与 “ 问题困难预测 “ 项目反应理论一致的模拟学生 2507.05129v1

Authors (5): Alexander Scarlatos, Nigel Fernandez, Christopher Ormerod, Susan Lottridge, Andrew Lan

Item (question) difficulties play a crucial role in educational assessments, enabling accurate and efficient assessment of student abilities and personalization to maximize learning outcomes. Traditionally, estimating item difficulties can be costly, requiring real students to respond to items, followed by fitting an item response theory (IRT) model to get item difficulty estimates. This approach cannot be applied to the cold-start setting for previously unseen items either. In this work, we present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability, which can then be used in simulations to predict the difficulty of open-ended items. We achieve this alignment using direct preference optimization (DPO), where we form preference pairs based on how likely responses are under a ground-truth IRT model. We perform a simulation by generating thousands of responses, evaluating them with an LLM-based scoring model, and fit the resulting data to an IRT model to obtain item difficulty estimates. Through extensive experiments on a real-world student response dataset, we show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.

nan

Article 371

Title@2025-07-07 (1): An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques

Title: An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques

Eine Bewertung von großen Sprachmodellen zu Textzusammenfassungsaufgaben mittels Prompt Engineering Techniques

评价关于使用即时工程技术进行文本归纳任务的大语言模型 2507.05123v1

Authors (3): Walid Mohamed Aly, Taysir Hassan A. Soliman, Amr Mohamed AbdelAziz

Large Language Models (LLMs) continue to advance natural language processing with their ability to generate human-like text across a range of tasks. Despite the remarkable success of LLMs in Natural Language Processing (NLP), their performance in text summarization across various domains and datasets has not been comprehensively evaluated. At the same time, the ability to summarize text effectively without relying on extensive training data has become a crucial bottleneck. To address these issues, we present a systematic evaluation of six LLMs across four datasets: CNN/Daily Mail and NewsRoom (news), SAMSum (dialog), and ArXiv (scientific). By leveraging prompt engineering techniques including zero-shot and in-context learning, our study evaluates the performance using the ROUGE and BERTScore metrics. In addition, a detailed analysis of inference times is conducted to better understand the trade-off between summarization quality and computational efficiency. For Long documents, introduce a sentence-based chunking strategy that enables LLMs with shorter context windows to summarize extended inputs in multiple stages. The findings reveal that while LLMs perform competitively on news and dialog tasks, their performance on long scientific documents improves significantly when aided by chunking strategies. In addition, notable performance variations were observed based on model parameters, dataset properties, and prompt design. These results offer actionable insights into how different LLMs behave across task types, contributing to ongoing research in efficient, instruction-based NLP systems.

nan

Article 372

Title@2025-07-07 (1): Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration

Title: Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration

Wiederbelebung des Kulturerbes: Ein neuartiger Ansatz für eine umfassende Restaurierung historischer Dokumente

恢复文化遗产:全面恢复历史文件的新办法 2507.05108v1

Authors (8): Yuyi Zhang, Peirong Zhang, Zhenhua Yang, Pengyu Yan, Yongxin Shi, Pengwei Liu, Fengjun Guo, Lianwen Jin

Historical documents represent an invaluable cultural heritage, yet have undergone significant degradation over time through tears, water erosion, and oxidation. Existing Historical Document Restoration (HDR) methods primarily focus on single modality or limited-size restoration, failing to meet practical needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and 6,543 synthetic images with character-level and line-level locations, as well as character annotations in different damage grades. AutoHDR mimics historians’ restoration workflows through a three-stage approach: OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration. The modular architecture of AutoHDR enables seamless human-machine collaboration, allowing for flexible intervention and optimization at each restoration stage. Experiments demonstrate AutoHDR’s remarkable performance in HDR. When processing severely damaged documents, our method improves OCR accuracy from 46.83\% to 84.05\%, with further enhancement to 94.25\% through human-machine collaboration. We believe this work represents a significant advancement in automated historical document restoration and contributes substantially to cultural heritage preservation. The model and dataset are available at https://github.com/SCUT-DLVCLab/AutoHDR.

nan

Article 373

Title@2025-07-07 (1): LCDS: A Logic-Controlled Discharge Summary Generation System Supporting Source Attribution and Expert Review

Title: LCDS: A Logic-Controlled Discharge Summary Generation System Supporting Source Attribution and Expert Review

LCDS: Logisch gesteuerte Entladung Zusammenfassung Erzeugung System Unterstützung Quelle Attribution und Experten-Review

LCDS: 逻辑控制排放控制简要生成系统支持源来源归属和专家审查 2507.05319v1

Authors (8): Cheng Yuan, Xinkai Rui, Yongqi Fan, Yawei Fan, Boyang Zhong, Jiacheng Wang, Weiyan Zhang, Tong Ruan

Despite the remarkable performance of Large Language Models (LLMs) in automated discharge summary generation, they still suffer from hallucination issues, such as generating inaccurate content or fabricating information without valid sources. In addition, electronic medical records (EMRs) typically consist of long-form data, making it challenging for LLMs to attribute the generated content to the sources. To address these challenges, we propose LCDS, a Logic-Controlled Discharge Summary generation system. LCDS constructs a source mapping table by calculating textual similarity between EMRs and discharge summaries to constrain the scope of summarized content. Moreover, LCDS incorporates a comprehensive set of logical rules, enabling it to generate more reliable silver discharge summaries tailored to different clinical fields. Furthermore, LCDS supports source attribution for generated content, allowing experts to efficiently review, provide feedback, and rectify errors. The resulting golden discharge summaries are subsequently recorded for incremental fine-tuning of LLMs. Our project and demo video are in the GitHub repository https://github.com/ycycyc02/LCDS.

nan

Article 374

Title@2025-07-07 (1): AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics

Title: AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics

KI-getriebene Zytomorphologie Bildsynthese für medizinische Diagnostik

AI-Driven 用于医学诊断的细胞形态图像合成 2507.05063v1

Authors (3): Jan Carreras Boada, Rao Muhammad Umer, Carsten Marr

Biomedical datasets often contain a large sample imbalance and are subject to strict privacy constraints, which together hinder the development of accurate machine learning models. One potential solution is to generate synthetic images, as this can improve data availability while preserving patient privacy. However, it remains difficult to generate synthetic images of sufficient quality for training robust classifiers. In this work, we focus on the classification of single white blood cells, a key component in the diagnosis of hematological diseases such as acute myeloid leukemia (AML), a severe blood cancer. We demonstrate how synthetic images generated with a fine-tuned stable diffusion model using LoRA weights when guided by real few-shot samples of the target white blood cell classes, can enhance classifier performance for limited data. When training a ResNet classifier, accuracy increased from 27.3\% to 78.4\% (+51.1\%) by adding 5000 synthetic images per class to a small and highly imbalanced real dataset. For a CLIP-based classifier, the accuracy improved from 61.8\% to 76.8\% (+15.0\%). The synthetic images are highly similar to real images, and they can help overcome dataset limitations, enhancing model generalization. Our results establish synthetic images as a tool in biomedical research, improving machine learning models, and facilitating medical diagnosis and research.

nan

Article 375

Title@2025-07-07 (1): End-to-End Evaluation for Low-Latency Simultaneous Speech Translation

Title: End-to-End Evaluation for Low-Latency Simultaneous Speech Translation

End-to-End-Bewertung für simultane Sprachübersetzungen mit geringer Latenz

低长期同声同声发言翻译端对端评价 2308.03415v4

Authors (13): Christian Huber, Tu Anh Dinh, Carlos Mullov, Ngoc Quan Pham, Thai Binh Nguyen, Fabian Retkowski, Stefan Constantin, Enes Yavuz Ugan, Danni Liu, Zhaolin Li, Sai Koneru, Jan Niehues, Alexander Waibel

The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches. In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components. Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user.

nan

Article 376

Title@2025-07-07 (1): Using Large Multimodal Models to Extract Knowledge Components for Knowledge Tracing from Multimedia Question Information

Title: Using Large Multimodal Models to Extract Knowledge Components for Knowledge Tracing from Multimedia Question Information

Verwendung von großen multimodalen Modellen zum Extrahieren von Wissenskomponenten für die Wissensverfolgung aus Multimedia-Frageinformationen

利用大型多模式模式从多媒体问题信息中提取知识组成部分,以从多媒体问题信息中追踪知识 2409.20167v2

Authors (4): Hyeongdon Moon, Richard Davis, Seyed Parsa Neshaei, Pierre Dillenbourg

Knowledge tracing models have enabled a range of intelligent tutoring systems to provide feedback to students. However, existing methods for knowledge tracing in learning sciences are predominantly reliant on statistical data and instructor-defined knowledge components, making it challenging to integrate AI-generated educational content with traditional established methods. We propose a method for automatically extracting knowledge components from educational content using instruction-tuned large multimodal models. We validate this approach by comprehensively evaluating it against knowledge tracing benchmarks in five domains. Our results indicate that the automatically extracted knowledge components can effectively replace human-tagged labels, offering a promising direction for enhancing intelligent tutoring systems in limited-data scenarios, achieving more explainable assessments in educational settings, and laying the groundwork for automated assessment.

nan

Article 377

Title@2025-07-07 (1): Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Title: Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Warum kämpfen Open Source LLMs mit Datenanalyse? Eine systematische empirische Studie

开放源码LLMs为何要与数据分析斗争?系统的经验研究 2506.19794v2

Authors (10): Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang

Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate models across three dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs’ analytical reasoning capabilities.

nan

Article 378

Title@2025-07-07 (1): Verified Language Processing with Hybrid Explainability: A Technical Report

Title: Verified Language Processing with Hybrid Explainability: A Technical Report

Verifizierte Sprachverarbeitung mit hybrider Erklärbarkeit: Ein technischer Bericht

经核实的有混合解释性的语文处理:技术报告 2507.05017v1

Authors (3): Oliver Robert Fox, Giacomo Bergami, Graham Morgan

The volume and diversity of digital information have led to a growing reliance on Machine Learning techniques, such as Natural Language Processing, for interpreting and accessing appropriate data. While vector and graph embeddings represent data for similarity tasks, current state-of-the-art pipelines lack guaranteed explainability, failing to determine similarity for given full texts accurately. These considerations can also be applied to classifiers exploiting generative language models with logical prompts, which fail to correctly distinguish between logical implication, indifference, and inconsistency, despite being explicitly trained to recognise the first two classes. We present a novel pipeline designed for hybrid explainability to address this. Our methodology combines graphs and logic to produce First-Order Logic representations, creating machine- and human-readable representations through Montague Grammar. Preliminary results indicate the effectiveness of this approach in accurately capturing full text similarity. To the best of our knowledge, this is the first approach to differentiate between implication, inconsistency, and indifference for text classification tasks. To address the limitations of existing approaches, we use three self-contained datasets annotated for the former classification task to determine the suitability of these approaches in capturing sentence structure equivalence, logical connectives, and spatiotemporal reasoning. We also use these data to compare the proposed method with language models pre-trained for detecting sentence entailment. The results show that the proposed method outperforms state-of-the-art models, indicating that natural language understanding cannot be easily generalised by training over extensive document corpora. This work offers a step toward more transparent and reliable Information Retrieval from extensive textual data.

nan

Article 379

Title@2025-07-07 (1): Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification

Title: Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification

Co-DETECT: Collaborative Discovery of Edge Cases in der Textklassifikation

共同目的:在文本分类中合作发现边缘情况 2507.05010v1

Authors (13): Chenfei Xiong, Jingwei Ni, Yu Fan, Vilém Zouhar, Donya Rooein, Lorena Calvo-Bartolomé, Alexander Hoyle, Zhijing Jin, Mrinmaya Sachan, Markus Leippold, Dirk Hovy, Mennatallah El-Assady, Elliott Ash

We introduce Co-DETECT (Collaborative Discovery of Edge cases in TExt ClassificaTion), a novel mixed-initiative annotation framework that integrates human expertise with automatic annotation guided by large language models (LLMs). Co-DETECT starts with an initial, sketch-level codebook and dataset provided by a domain expert, then leverages the LLM to annotate the data and identify edge cases that are not well described by the initial codebook. Specifically, Co-DETECT flags challenging examples, induces high-level, generalizable descriptions of edge cases, and assists user in incorporating edge case handling rules to improve the codebook. This iterative process enables more effective handling of nuanced phenomena through compact, generalizable annotation rules. Extensive user study, qualitative and quantitative analyses prove the effectiveness of Co-DETECT.

nan

Article 380

Title@2025-07-07 (1): From Autonomy to Agency: Agentic Vehicles for Human-Centered Mobility Systems

Title: From Autonomy to Agency: Agentic Vehicles for Human-Centered Mobility Systems

Von der Autonomie zur Agentur: Agentenfahrzeuge für Mensch-zentrierte Mobilitätssysteme

从自主到机构:以人为中心的流动系统机动车辆 2507.04996v1

Authors (1): Jiangbo Yu

Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity to operate according to internal rules without external control. Accordingly, autonomous vehicles (AuVs) are defined as systems capable of perceiving their environment and executing preprogrammed tasks independently of external input. However, both research and real-world deployments increasingly showcase vehicles that demonstrate behaviors beyond this definition (including the SAE levels 1 to 6), such as interaction with humans and machines, goal adaptation, contextual reasoning, external tool use, and long-term planning, particularly with the integration of large language models (LLMs) and agentic AI systems. These developments reveal a conceptual gap between technical autonomy and the broader cognitive and social capabilities needed for future human-centered mobility systems. To address this, we introduce the concept of agentic vehicles (AgVs), referring to vehicles that integrate agentic AI to reason, adapt, and interact within complex environments. This paper presents a systems-level framework to characterize AgVs, focusing on their cognitive and communicative layers and differentiating them from conventional AuVs. It synthesizes relevant advances in agentic AI, robotics, multi-agent systems, and human-machine interaction, and highlights how agentic AI, through high-level reasoning and tool use, can function not merely as computational tools but as interactive agents embedded in mobility ecosystems. The paper concludes by identifying key challenges in the development and governance of AgVs, including safety, real-time control, public acceptance, ethical alignment, and regulatory frameworks.

nan

Article 381

Title@2025-07-07 (1): Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models

Title: Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models

Kann Video LLMs zur Antwort ablehnen? Ausrichtung für Beantwortbarkeit in Video Large Language Models

视频大语言模式的可响应性对齐 2507.04976v1

Authors (4): Eunseop Yoon, Hee Suk Yoon, Mark A. Hasegawa-Johnson, Chang D. Yoo

In the broader context of deep learning, Multimodal Large Language Models have achieved significant breakthroughs by leveraging powerful Large Language Models as a backbone to align different modalities into the language space. A prime exemplification is the development of Video Large Language Models (Video-LLMs). While numerous advancements have been proposed to enhance the video understanding capabilities of these models, they are predominantly trained on questions generated directly from video content. However, in real-world scenarios, users often pose questions that extend beyond the informational scope of the video, highlighting the need for Video-LLMs to assess the relevance of the question. We demonstrate that even the best-performing Video-LLMs fail to reject unfit questions-not necessarily due to a lack of video understanding, but because they have not been trained to identify and refuse such questions. To address this limitation, we propose alignment for answerability, a framework that equips Video-LLMs with the ability to evaluate the relevance of a question based on the input video and appropriately decline to answer when the question exceeds the scope of the video, as well as an evaluation framework with a comprehensive set of metrics designed to measure model behavior before and after alignment. Furthermore, we present a pipeline for creating a dataset specifically tailored for alignment for answerability, leveraging existing video-description paired datasets.

nan

Article 382

Title@2025-07-07 (1): ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Title: ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

ArtefakteBench: Überbrückung der visuell-interaktiven Lücke in der LLM-Codegenerierung

人工合成:弥合LLM代码生成评估中的视觉互动差距 2507.04952v1

Authors (32): Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Shihui Hu, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Haotian Zhu, Yuanxing Zhang, Yuhao Jiang, Yue Zhang, Zenan Xu, Bohui Zhai, Guoxiang He, Hebin Li, Jie Zhao, Le Zhang, Lingyun Tan, Pengyu Guo, Xianshu Pang, Yang Ruan, Zhifeng Zhang, Zhonghu Wang, Ziyan Xu, Zuopu Yin, Wiggin Zhou, Chayse Zhou, Fengzong Lian

The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at https://artifactsbenchmark.github.io/, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.

nan

Article 383

Title@2025-07-07 (1): ReLoop: “Seeing Twice and Thinking Backwards” via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding

Title: ReLoop: “Seeing Twice and Thinking Backwards” via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding

ReLoop: “Zweimal sehen und zurückdenken” über Closed-Loop-Training zu Mitigate-Halluzinationen im multimodalen Verständnis

ReLooop:通过闭路培训,为多模式理解的假幻觉提供“双眼和反向思考” 2507.04943v1

Authors (3): Jianjiang Yang, Ziyan Huang, Yanshu Li

While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in open-ended visual question answering, they remain vulnerable to hallucinations. These are outputs that contradict or misrepresent input semantics, posing a critical challenge to the reliability and factual consistency. Existing methods often rely on external verification or post-hoc correction, lacking an internal mechanism to validate outputs directly during training. To bridge this gap, we propose ReLoop, a unified closed-loop training framework that encourages multimodal consistency for cross-modal understanding in MLLMs. ReLoop adopts a ring-shaped structure that integrates three complementary consistency feedback mechanisms, obliging MLLMs to “seeing twice and thinking backwards”. Specifically, ReLoop employs the frozen Consistency Feedback Plugin (CFP), comprising semantic reconstruction, visual description, and an attention supervision module for attention alignment. These components collectively enforce semantic reversibility, visual consistency, and interpretable attention, enabling the model to correct its outputs during training. Extensive evaluations and analyses demonstrate the effectiveness of ReLoop in reducing hallucination rates across multiple benchmarks, establishing a robust method for hallucination mitigation in MLLMs. We will release our source code and data in the camera-ready version.

nan

Article 384

Title: Are Information Retrieval Approaches Good at Harmonising Longitudinal Survey Questions in Social Science?

Sind Informations-Retrieval Ansätze gut bei der Harmonisierung Longitudinal Survey Fragen in der Sozialwissenschaft?

信息检索方法在协调社会科学的纵向调查问题方面是否很好? 2504.20679v2

Authors (4): Wing Yan Li, Zeqiang Wang, Jon Johnson, Suparna De

Automated detection of semantically equivalent questions in longitudinal social science surveys is crucial for long-term studies informing empirical research in the social, economic, and health sciences. Retrieving equivalent questions faces dual challenges: inconsistent representation of theoretical constructs (i.e. concept/sub-concept) across studies as well as between question and response options, and the evolution of vocabulary and structure in longitudinal text. To address these challenges, our multi-disciplinary collaboration of computer scientists and survey specialists presents a new information retrieval (IR) task of identifying concept (e.g. Housing, Job, etc.) equivalence across question and response options to harmonise longitudinal population studies. This paper investigates multiple unsupervised approaches on a survey dataset spanning 1946-2020, including probabilistic models, linear probing of language models, and pre-trained neural networks specialised for IR. We show that IR-specialised neural models achieve the highest overall performance with other approaches performing comparably. Additionally, the re-ranking of the probabilistic model’s results with neural models only introduces modest improvements of 0.07 at most in F1-score. Qualitative post-hoc evaluation by survey specialists shows that models generally have a low sensitivity to questions with high lexical overlap, particularly in cases where sub-concepts are mismatched. Altogether, our analysis serves to further research on harmonising longitudinal studies in social science.

nan

Article 385

Title@2025-07-07 (1): Do LLMs Understand the Safety of Their Inputs? Training-Free Moderation via Latent Prototypes

Title: Do LLMs Understand the Safety of Their Inputs? Training-Free Moderation via Latent Prototypes

Verstehen LLMs die Sicherheit ihrer Eingaben? Schulungsfreie Moderation über Latent Prototypen

LLM女士了解其投入的安全性吗?通过中原原型进行无培训的调节 2502.16174v2

Authors (6): Maciej Chrabąszcz, Filip Szatkowski, Bartosz Wójcik, Jan Dubiński, Tomasz Trzciński, Sebastian Cygert

With the rise of LLMs, ensuring model safety and alignment has become a critical concern. While modern instruction-finetuned LLMs incorporate alignment during training, they still frequently require moderation tools to prevent unsafe behavior. The most common approach to moderation are guard models that flag unsafe inputs. However, guards require costly training and are typically limited to fixed-size, pre-trained options, making them difficult to adapt to evolving risks and resource constraints. We hypothesize that instruction-finetuned LLMs already encode safety-relevant information internally and explore training-free safety assessment methods that work with off-the-shelf models. We show that simple prompting allows models to recognize harmful inputs they would otherwise mishandle. We also demonstrate that safe and unsafe prompts are distinctly separable in the models’ latent space. Building on this, we introduce the Latent Prototype Moderator (LPM), a training-free moderation method that uses Mahalanobis distance in latent space to assess input safety. LPM is a lightweight, customizable add-on that generalizes across model families and sizes. Our method matches or exceeds state-of-the-art guard models across multiple safety benchmarks, offering a practical and flexible solution for scalable LLM moderation.

nan

Article 386

Title@2025-07-07 (1): O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d’apprentissage pour l’adaptation de modèles de langue à un domaine

Title: O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d’apprentissage pour l’adaptation de modèles de langue à un domaine

O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d’apprentissage pour l’adaptation de modèles de langue à un domaine

O_FT@EvalLLLM2025:非属域适应适应模式的技术和技术评估比较 2507.04895v1

Authors (8): Ismaël Rousseau, Claire Perroux, Pierre Adam, Thomas Girault, Lionel Delphin-Poulat, Morgan Veyret, Gwénolé Lecorvé, Géraldine Damnati

This paper presents the work carried out by the O_FT team, joint with Orange and Ouest-France, on adapting language models to the defense domain as part of the EvalLLM2025 challenge. This work focused on adapting the \texttt{Mistral-7B-Instruct-v0.3} model using classical techniques of continued pre-training and instruction-tuning. The core of our efforts is based on collecting, generating, and selecting data for these two stages as well as for model evaluation. Experiments show that our adapted models have better domain-specific knowledge and improved domain-specific task processing skills, along with comparable (or even superior) performance on general knowledge and skills. Considering the carbon footprint of our adaptations, this work demonstrates the feasibility of domain adaptation for relatively small models. – Ce document pr'esente les travaux r'ealis'es par l’'equipe O_FT conjointe `a Orange et Ouest-France sur l’adaptation de mod`eles de langue au domaine de la d'efense dans le cadre du challenge EvalLLM2025. Ces travaux se sont concentr'es sur l’adaptation du mod`ele \texttt{Mistral-7B-Instruct-v0.3} avec des techniques classiques de poursuite du pr'e-entra\^inement et d’affinage sur instructions. L’essentiel de nos travaux a port'e sur la constitution, g'en'eration et s'election de donn'ees pour ces deux 'etapes ainsi que pour l’'evaluation des mod`eles. Les exp'eriences montrent que nos mod`eles adapt'es ont de meilleures de connaissances de fond et une meilleure capacit'e de traitement de t\^aches sur le domaine de la d'efense, ainsi que des performances comparables (voire sup'erieures) sur des connaissances ou capacit'es g'en'eralistes. Mis au regard des empreintes carbones de nos adaptations, ces travaux d'emontrent ainsi la viabilit'e de l’adaptation `a un domaine de mod`eles relativement petits.

nan

Article 387

Title@2025-07-07 (1): MARBLE: A Multi-Agent Rule-Based LLM Reasoning Engine for Accident Severity Prediction

Title: MARBLE: A Multi-Agent Rule-Based LLM Reasoning Engine for Accident Severity Prediction

MARBLE: Eine regelbasierte LLM-Reasoning-Engine für die Vorhersage von Unfällen mit Schweregrad

事故严重性预测的多机构、以规则为基础的多机构、基于规则的LLM 理疗引擎 2507.04893v1

Authors (2): Kaleem Ullah Qasim, Jiashu Zhang

Accident severity prediction plays a critical role in transportation safety systems but is a persistently difficult task due to incomplete data, strong feature dependencies, and severe class imbalance in which rare but high-severity cases are underrepresented and hard to detect. Existing methods often rely on monolithic models or black box prompting, which struggle to scale in noisy, real-world settings and offer limited interpretability. To address these challenges, we propose MARBLE a multiagent rule based LLM engine that decomposes the severity prediction task across a team of specialized reasoning agents, including an interchangeable ML-backed agent. Each agent focuses on a semantic subset of features (e.g., spatial, environmental, temporal), enabling scoped reasoning and modular prompting without the risk of prompt saturation. Predictions are coordinated through either rule-based or LLM-guided consensus mechanisms that account for class rarity and confidence dynamics. The system retains structured traces of agent-level reasoning and coordination outcomes, supporting in-depth interpretability and post-hoc performance diagnostics. Across both UK and US datasets, MARBLE consistently outperforms traditional machine learning classifiers and state-of-the-art (SOTA) prompt-based reasoning methods including Chain-of-Thought (CoT), Least-to-Most (L2M), and Tree-of-Thought (ToT) achieving nearly 90% accuracy where others plateau below 48%. This performance redefines the practical ceiling for accident severity classification under real world noise and extreme class imbalance. Our results position MARBLE as a generalizable and interpretable framework for reasoning under uncertainty in safety-critical applications.

nan

Article 388

Title@2025-07-07 (1): Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Title: Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Emergent Semantics Beyond Token Embeddings: Transformer LMs mit gefrorenen visuellen Unicode-Darstellungen

超越 Tok 嵌入的新兴语义: 具有冷冻视觉统一符号的变形LMs 2507.04886v1

Authors (1): A. Bochkov

Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational “meaning vectors.” This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to “representational interference” in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer’s compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

nan

Article 389

Title@2025-07-07 (1): Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions

Title: Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions

Aufbau von offenen retrieval-Konversations-Frage-Antwortsystemen durch Generierung synthetischer Daten und Dekontextualisierung von Benutzerfragen

通过生成合成数据和减少文本用户问题,建立开放式检索连接问答解答系统 2507.04884v1

Authors (7): Christos Vlachos, Nikolaos Stylianou, Alexandra Fiotaki, Spiros Methenitis, Elisavet Palogiannidi, Themos Stafylakis, Ion Androutsopoulos

We consider open-retrieval conversational question answering (OR-CONVQA), an extension of question answering where system responses need to be (i) aware of dialog history and (ii) grounded in documents (or document fragments) retrieved per question. Domain-specific OR-CONVQA training datasets are crucial for real-world applications, but hard to obtain. We propose a pipeline that capitalizes on the abundance of plain text documents in organizations (e.g., product documentation) to automatically produce realistic OR-CONVQA dialogs with annotations. Similarly to real-world humanannotated OR-CONVQA datasets, we generate in-dialog question-answer pairs, self-contained (decontextualized, e.g., no referring expressions) versions of user questions, and propositions (sentences expressing prominent information from the documents) the system responses are grounded in. We show how the synthetic dialogs can be used to train efficient question rewriters that decontextualize user questions, allowing existing dialog-unaware retrievers to be utilized. The retrieved information and the decontextualized question are then passed on to an LLM that generates the system’s response.

nan

Article 390

Title@2025-07-07 (1): Transcribing Spanish Texts from the Past: Experiments with Transkribus, Tesseract and Granite

Title: Transcribing Spanish Texts from the Past: Experiments with Transkribus, Tesseract and Granite

Spanische Texte aus der Vergangenheit transkribieren: Experimente mit Transkribus, Tesseract und Granit

翻录过去西班牙文本:与Transkribus、Tesseract和Granite的实验 2507.04878v1

Authors (4): Yanco Amor Torterolo-Orta, Jaione Macicior-Mitxelena, Marina Miguez-Lamanuzzi, Ana García-Serrano

This article presents the experiments and results obtained by the GRESEL team in the IberLEF 2025 shared task PastReader: Transcribing Texts from the Past. Three types of experiments were conducted with the dual aim of participating in the task and enabling comparisons across different approaches. These included the use of a web-based OCR service, a traditional OCR engine, and a compact multimodal model. All experiments were run on consumer-grade hardware, which, despite lacking high-performance computing capacity, provided sufficient storage and stability. The results, while satisfactory, leave room for further improvement. Future work will focus on exploring new techniques and ideas using the Spanish-language dataset provided by the shared task, in collaboration with Biblioteca Nacional de Espa~na (BNE).

nan

Article 391

Title@2025-07-07 (1): $\textit{Grahak-Nyay:}$ Consumer Grievance Redressal through Large Language Models

Title: $\textit{Grahak-Nyay:}$ Consumer Grievance Redressal through Large Language Models

$\textit{Grahak-Nyay:}$ Consumer Grievance Redressal durch große Sprachmodelle

$textit{ Grahak- Nyay:} 以大语言模式纠正消费者冤情 2507.04854v1

Authors (9): Shrey Ganatra, Swapnil Bhattacharyya, Harshvivek Kashid, Spandan Anaokar, Shruti Nair, Reshma Sekhar, Siddharth Manohar, Rahul Hemrajani, Pushpak Bhattacharyya

Access to consumer grievance redressal in India is often hindered by procedural complexity, legal jargon, and jurisdictional challenges. To address this, we present $\textbf{Grahak-Nyay}$ (Justice-to-Consumers), a chatbot that streamlines the process using open-source Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Grahak-Nyay simplifies legal complexities through a concise and up-to-date knowledge base. We introduce three novel datasets: $\textit{GeneralQA}$ (general consumer law), $\textit{SectoralQA}$ (sector-specific knowledge) and $\textit{SyntheticQA}$ (for RAG evaluation), along with $\textit{NyayChat}$, a dataset of 300 annotated chatbot conversations. We also introduce $\textit{Judgments}$ data sourced from Indian Consumer Courts to aid the chatbot in decision making and to enhance user trust. We also propose $\textbf{HAB}$ metrics ($\textbf{Helpfulness, Accuracy, Brevity}$) to evaluate chatbot performance. Legal domain experts validated Grahak-Nyay’s effectiveness. Code and datasets will be released.

nan

Article 392

Title@2025-07-07 (1): Dialogue-Based Multi-Dimensional Relationship Extraction from Novels

Title: Dialogue-Based Multi-Dimensional Relationship Extraction from Novels

Dialogbasierte multidimensionale Beziehungsextraktion aus Romanen

以对话为基础的从新奇中提取的多不同关系 2507.04852v1

Authors (6): Yuchen Yan, Hanjie Zhao, Senbin Zhu, Hongde Liu, Zhihong Zhang, Yuxiang Jia

Relation extraction is a crucial task in natural language processing, with broad applications in knowledge graph construction and literary analysis. However, the complex context and implicit expressions in novel texts pose significant challenges for automatic character relationship extraction. This study focuses on relation extraction in the novel domain and proposes a method based on Large Language Models (LLMs). By incorporating relationship dimension separation, dialogue data construction, and contextual learning strategies, the proposed method enhances extraction performance. Leveraging dialogue structure information, it improves the model’s ability to understand implicit relationships and demonstrates strong adaptability in complex contexts. Additionally, we construct a high-quality Chinese novel relation extraction dataset to address the lack of labeled resources and support future research. Experimental results show that our method outperforms traditional baselines across multiple evaluation metrics and successfully facilitates the automated construction of character relationship networks in novels.

nan

Article 393

Title@2025-07-07 (1): Spec-TOD: A Specialized Instruction-Tuned LLM Framework for Efficient Task-Oriented Dialogue Systems

Title: Spec-TOD: A Specialized Instruction-Tuned LLM Framework for Efficient Task-Oriented Dialogue Systems

Spec-TOD: Ein speziell angepasstes LLM-Framework für effiziente aufgabenorientierte Dialogsysteme

Spec-TOD:高效以任务为导向的对话系统专门指导性LLM框架 2507.04841v1

Authors (4): Quang-Vinh Nguyen, Quang-Chieu Nguyen, Hoang Pham, Khac-Hoai Nam Bui

Task-oriented dialogue (TOD) systems facilitate goal-driven interactions between users and machines. While recent advances in deep learning have improved the performance, TOD systems often struggle in low-resource scenarios with limited labeled data. To address this challenge, we propose Spec-TOD, a novel framework designed to train an end-to-end TOD system with limited data. Spec-TOD introduces two main innovations: (i) a novel specialized end-to-end TOD framework that incorporates explicit task instructions for instruction-tuned large language models (LLMs), and (ii) an efficient training strategy that leverages lightweight, specialized LLMs to achieve strong performance with minimal supervision. Experiments on the MultiWOZ dataset, a widely used TOD benchmark, demonstrate that Spec-TOD achieves competitive results while significantly reducing the need for labeled data. These findings highlight the potential of the proposed framework in advancing efficient and effective TOD systems in low-resource settings.

nan

Article 394

Title@2025-07-07 (1): CritiQ: Mining Data Quality Criteria from Human Preferences

Title: CritiQ: Mining Data Quality Criteria from Human Preferences

CritiQ: Mining Data Quality Criteria from Human Preferences

关键:来自人类首选的采矿数据质量标准 2502.19279v2

Authors (11): Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, Tao Gui

Language model heavily depends on high-quality data for optimal performance. Existing approaches rely on manually designed heuristics, the perplexity of existing models, training classifiers, or careful prompt engineering, which require significant expert experience and human annotation effort while introduce biases. We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality with only ~30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We build a knowledge base that extracts quality criteria from previous work to boost CritiQ Flow. Compared to perplexity- and classifier- based methods, verbal criteria are more interpretable and possess reusable value. After deriving the criteria, we train the CritiQ Scorer to give quality scores and perform efficient data selection. We demonstrate the effectiveness of our method in the code, math, and logic domains, achieving high accuracy on human-annotated test sets. To validate the quality of the selected data, we continually train Llama 3.1 models and observe improved performance on downstream tasks compared to uniform sampling. Ablation studies validate the benefits of the knowledge base and the reflection process. We analyze how criteria evolve and the effectiveness of majority voting.

nan

Article 395

Title@2025-07-07 (1): RewardAnything: Generalizable Principle-Following Reward Models

Title: RewardAnything: Generalizable Principle-Following Reward Models

BelohnungAlles: Allgemeines Prinzip-folgende Belohnung Modelle

奖励任何奖励:可普遍适用的遵循原则的奖赏模式 2506.03637v2

Authors (10): Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, Wei Ye

Reward Models, essential for guiding Large Language Model optimization, are typically trained on fixed preference datasets, resulting in rigid alignment to single, implicit preference distributions. This prevents adaptation to diverse real-world needs-from conciseness in one task to detailed explanations in another. The standard practice of collecting task-specific preference data and retraining reward models is resource-intensive, often producing biased rewards, and limits practical application. We introduce generalizable, principle-following reward models. We propose that RMs should understand and adhere to dynamically provided natural language specifications of reward principles, similar to instruction-following in LLMs. To measure this capability, we develop RABench, a comprehensive benchmark for RMs focusing on generalization across diverse principles. Evaluations on RABench reveal poor generalization of current RMs. As a solution, we present RewardAnything, a novel RM designed and trained to explicitly follow natural language principles. We achieve SotA performance with RewardAnything in traditional RM benchmark simply by specifying a well-defined principle, and results on RABench show we excel in adapting to novel principles without retraining. Furthermore, RewardAnything integrates seamlessly with existing RLHF methods and we show by a case study on how to automatically and efficiently align LLMs with only natural language principles.

nan

Article 396

Title@2025-07-07 (1): BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance

Title: BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance

BiMa: Auf dem Weg zu Biases Milderung für Text-Video-Retrieval über Szeneelement-Anleitung

Bima:通过场景元素指导,争取减缓对文本视频检索的“双轨减缓” 2506.03589v3

Authors (5): Huy Le, Nhat Chung, Tung Kieu, Anh Nguyen, Ngan Le

Text-video retrieval (TVR) systems often suffer from visual-linguistic biases present in datasets, which cause pre-trained vision-language models to overlook key details. To address this, we propose BiMa, a novel framework designed to mitigate biases in both visual and textual representations. Our approach begins by generating scene elements that characterize each video by identifying relevant entities/objects and activities. For visual debiasing, we integrate these scene elements into the video embeddings, enhancing them to emphasize fine-grained and salient details. For textual debiasing, we introduce a mechanism to disentangle text features into content and bias components, enabling the model to focus on meaningful content while separately handling biased information. Extensive experiments and ablation studies across five major TVR benchmarks (i.e., MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo) demonstrate the competitive performance of BiMa. Additionally, the model’s bias mitigation capability is consistently validated by its strong results on out-of-distribution retrieval tasks.

nan

Article 397

Title@2025-07-07 (1): From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach

Title: From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach

Von der Vision zur Sprache durch Diagramm der Ereignisse in Raum und Zeit: Ein erklärbarer selbstüberwachter Ansatz

通过空间和时间事件图从视野到语言:可解释的自我监督方法 2507.04815v1

Authors (2): Mihai Masala, Marius Leordeanu

The task of describing video content in natural language is commonly referred to as video captioning. Unlike conventional video captions, which are typically brief and widely available, long-form paragraph descriptions in natural language are scarce. This limitation of current datasets is due to the expensive human manual annotation required and to the highly challenging task of explaining the language formation process from the perspective of the underlying story, as a complex system of interconnected events in space and time. Through a thorough analysis of recently published methods and available datasets, we identify a general lack of published resources dedicated to the problem of describing videos in complex language, beyond the level of descriptions in the form of enumerations of simple captions. Furthermore, while state-of-the-art methods produce impressive results on the task of generating shorter captions from videos by direct end-to-end learning between the videos and text, the problem of explaining the relationship between vision and language is still beyond our reach. In this work, we propose a shared representation between vision and language, based on graphs of events in space and time, which can be obtained in an explainable and analytical way, to integrate and connect multiple vision tasks to produce the final natural language description. Moreover, we also demonstrate how our automated and explainable video description generation process can function as a fully automatic teacher to effectively train direct, end-to-end neural student pathways, within a self-supervised neuro-analytical system. We validate that our explainable neuro-analytical approach generates coherent, rich and relevant textual descriptions on videos collected from multiple varied datasets, using both standard evaluation metrics, human annotations and consensus from ensembles of state-of-the-art VLMs.

nan

Article 398

Title@2025-07-07 (1): A Survey of Pun Generation: Datasets, Evaluations and Methodologies

Title: A Survey of Pun Generation: Datasets, Evaluations and Methodologies

Eine Umfrage der Pun-Generation: Datensätze, Bewertungen und Methoden

对 “ 制作豆类:数据集、评价和方法 “ 的调查 2507.04793v1

Authors (6): Yuchen Su, Yonghua Zhu, Ruofan Wang, Zijian Huang, Diana Benavides-Prado, Michael Witbrock

Pun generation seeks to creatively modify linguistic elements in text to produce humour or evoke double meanings. It also aims to preserve coherence and contextual appropriateness, making it useful in creative writing and entertainment across various media and contexts. Although pun generation has received considerable attention in computational linguistics, there is currently no dedicated survey that systematically reviews this specific area. To bridge this gap, this paper provides a comprehensive review of pun generation datasets and methods across different stages, including conventional approaches, deep learning techniques, and pre-trained language models. Additionally, we summarise both automated and human evaluation metrics used to assess the quality of pun generation. Finally, we discuss the research challenges and propose promising directions for future work.

nan

Article 399

Title@2025-07-07 (1): Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems

Title: Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems

Qwen vs. Gemma Integration mit Whisper: Eine vergleichende Studie in mehrsprachigen Sprach-LLM-Systemen

Quwen诉Gemma 与低语融合:多语种语言LLLM系统比较研究 2506.13596v2

Authors (3): Tuan Nguyen, Long-Vu Hoang, Huy-Dat Tran

This paper presents our system for the MLC-SLM Challenge 2025, focusing on multilingual speech recognition and language modeling with large language models (LLMs). Our approach combines a fine-tuned Whisper-large-v3 encoder with efficient projector architectures and various decoder configurations. We employ a three-stage training methodology that progressively optimizes the encoder, projector, and LLM components. Our system achieves competitive performance with a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6% using the Qwen2.5-7B as decoder-only language model.

nan

Article 400

Title@2025-07-07 (1): Reason to Rote: Rethinking Memorization in Reasoning

Title: Reason to Rote: Rethinking Memorization in Reasoning

Grund zu rotieren: Die Erinnerung an die Vernunft neu denken

旋转理由:重新思考在理由方面的记忆 2507.04782v1

Authors (6): Yupei Du, Philipp Mondorf, Silvia Casola, Yuekun Yao, Robert Litschko, Barbara Plank

Large language models readily memorize arbitrary training instances, such as label noise, yet they perform strikingly well on reasoning tasks. In this work, we investigate how language models memorize label noise, and why such memorization in many cases does not heavily affect generalizable reasoning capabilities. Using two controllable synthetic reasoning datasets with noisy labels, four-digit addition (FDA) and two-hop relational reasoning (THR), we discover a reliance of memorization on generalizable reasoning mechanisms: models continue to compute intermediate reasoning outputs even when retrieving memorized noisy labels, and intervening reasoning adversely affects memorization. We further show that memorization operates through distributed encoding, i.e., aggregating various inputs and intermediate results, rather than building a look-up mechanism from inputs to noisy labels. Moreover, our FDA case study reveals memorization occurs via outlier heuristics, where existing neuron activation patterns are slightly shifted to fit noisy labels. Together, our findings suggest that memorization of label noise in language models builds on, rather than overrides, the underlying reasoning mechanisms, shedding lights on the intriguing phenomenon of benign memorization.

nan

Article 401

Title@2025-07-07 (1): ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems

Title: ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems

ABench-Physics: Benchmarking Physical Reasoning in LLMs über hochschwierige und dynamische Physikprobleme

ABench-物理:通过高难度和动态物理问题确定LLMLM中物理原因的基准 2507.04766v1

Authors (12): Yiming Zhang, Yingfan Ma, Yanmei Gu, Zhengkai Yang, Yihong Zhuang, Feng Wang, Zenan Huang, Yuanyuan Wang, Chao Huang, Bowen Song, Cheng Lin, Junbo Zhao

Large Language Models (LLMs) have shown impressive performance in domains such as mathematics and programming, yet their capabilities in physics remain underexplored and poorly understood. Physics poses unique challenges that demand not only precise computation but also deep conceptual understanding and physical modeling skills. Existing benchmarks often fall short due to limited difficulty, multiple-choice formats, and static evaluation settings that fail to capture physical modeling ability. In this paper, we introduce ABench-Physics, a novel benchmark designed to rigorously evaluate LLMs’ physical reasoning and generalization capabilities. ABench-Physics consists of two components: Phy_A, a static set of 400 graduate- or Olympiad-level problems; and Phy_B, a dynamic subset of 100 problems equipped with an automatic variation engine to test model robustness across changing conditions. All questions require precise numerical answers, with strict formatting and tolerance constraints. Our evaluation of several state-of-the-art LLMs reveals substantial performance gaps, highlighting persistent limitations in physical reasoning, especially in generalization to dynamic variants. ABench-Physics provides a challenging and diagnostic framework for advancing scientific reasoning in LLMs.

nan

Article 402

Title@2025-07-07 (1): Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments

Title: Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments

Thunder-DeID: Genauer und effizienter De-Identifizierungsrahmen für Urteile des koreanischen Gerichts

Thunder-DeID:韩国法院判决的准确和有效的取消识别框架 2506.15266v2

Authors (5): Sungeun Hahm, Heejin Kim, Gyuseong Lee, Hyunji Park, Jaejin Lee

To ensure a balance between open access to justice and personal data protection, the South Korean judiciary mandates the de-identification of court judgments before they can be publicly disclosed. However, the current de-identification process is inadequate for handling court judgments at scale while adhering to strict legal requirements. Additionally, the legal definitions and categorizations of personal identifiers are vague and not well-suited for technical solutions. To tackle these challenges, we propose a de-identification framework called Thunder-DeID, which aligns with relevant laws and practices. Specifically, we (i) construct and release the first Korean legal dataset containing annotated judgments along with corresponding lists of entity mentions, (ii) introduce a systematic categorization of Personally Identifiable Information (PII), and (iii) develop an end-to-end deep neural network (DNN)-based de-identification pipeline. Our experimental results demonstrate that our model achieves state-of-the-art performance in the de-identification of court judgments.

nan

Article 403

Title@2025-07-07 (1): CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering

Title: CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering

CoSteer: Kollaborative Decoding-Time-Personalisierung über lokale Delta-Lenkung

科斯特:通过当地三角洲指导,协作解码-时间个性化 2507.04756v1

Authors (9): Hang Lv, Sheng Liang, Hao Wang, Hongchao Gu, Yaxiong Wu, Wei Guo, Defu Lian, Yong Liu, Enhong Chen

Personalized text generation has become crucial for adapting language models to diverse and evolving users’ personal context across cultural, temporal, and contextual dimensions. While existing methods often rely on centralized fine-tuning or static preference alignment, they struggle to achieve real-time adaptation under resource constraints inherent to personal devices. This limitation creates a dilemma: large cloud-based models lack access to localized user-specific information, while small on-device models cannot match the generation quality of their cloud counterparts. To address this dichotomy, we present CoSteer, a novel collaborative framework that enables decoding-time personalization through localized delta steering. Our key insight lies in leveraging the logits difference between personal context-aware and -agnostic outputs from local small models as steering signals for cloud-based LLMs. Specifically, we formulate token-level optimization as an online learning problem, where local delta vectors dynamically adjust the remote LLM’s logits within the on-device environment. This approach preserves privacy by transmitting only the final steered tokens rather than raw data or intermediate vectors, while maintaining cloud-based LLMs’ general capabilities without fine-tuning. Through comprehensive experiments on various personalized generation tasks, we demonstrate that CoSteer effectively assists LLMs in generating personalized content by leveraging locally stored user profiles and histories, ensuring privacy preservation through on-device data processing while maintaining acceptable computational overhead.

nan

Article 404

Title@2025-07-07 (1): LLMs as Architects and Critics for Multi-Source Opinion Summarization

Title: LLMs as Architects and Critics for Multi-Source Opinion Summarization

LLMs als Architekten und Kritiker für die Mehr-Quellen-Stellungnahme Zusammenfassung

作为多来源意见总结的建筑师和批评家的LLM 2507.04751v1

Authors (7): Anuj Attri, Arnav Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Nikesh Garera

Multi-source Opinion Summarization (M-OS) extends beyond traditional opinion summarization by incorporating additional sources of product metadata such as descriptions, key features, specifications, and ratings, alongside reviews. This integration results in comprehensive summaries that capture both subjective opinions and objective product attributes essential for informed decision-making. While Large Language Models (LLMs) have shown significant success in various Natural Language Processing (NLP) tasks, their potential in M-OS remains largely unexplored. Additionally, the lack of evaluation datasets for this task has impeded further advancements. To bridge this gap, we introduce M-OS-EVAL, a benchmark dataset for evaluating multi-source opinion summaries across 7 key dimensions: fluency, coherence, relevance, faithfulness, aspect coverage, sentiment consistency, specificity. Our results demonstrate that M-OS significantly enhances user engagement, as evidenced by a user study in which, on average, 87% of participants preferred M-OS over opinion summaries. Our experiments demonstrate that factually enriched summaries enhance user engagement. Notably, M-OS-PROMPTS exhibit stronger alignment with human judgment, achieving an average Spearman correlation of \r{ho} = 0.74, which surpasses the performance of previous methodologies.

nan

Article 405

Title@2025-07-07 (1): A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic

Title: A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic

Eine Geschichte von zwei Scripts: Transliteration und Nachkorrektur für Judeo-Arabisch

两本脚本的故事:犹太-阿拉伯语的翻写和校正后 2507.04746v1

Authors (3): Juan Moreno Gonzalez, Bashar Alhafni, Nizar Habash

Judeo-Arabic refers to Arabic variants historically spoken by Jewish communities across the Arab world, primarily during the Middle Ages. Unlike standard Arabic, it is written in Hebrew script by Jewish writers and for Jewish audiences. Transliterating Judeo-Arabic into Arabic script is challenging due to ambiguous letter mappings, inconsistent orthographic conventions, and frequent code-switching into Hebrew and Aramaic. In this paper, we introduce a two-step approach to automatically transliterate Judeo-Arabic into Arabic script: simple character-level mapping followed by post-correction to address grammatical and orthographic errors. We also present the first benchmark evaluation of LLMs on this task. Finally, we show that transliteration enables Arabic NLP tools to perform morphosyntactic tagging and machine translation, which would have not been feasible on the original texts.

nan

Article 406

Title@2025-07-07 (1): Word stress in self-supervised speech models: A cross-linguistic comparison

Title: Word stress in self-supervised speech models: A cross-linguistic comparison

Wortstress in selbstüberwachten Sprachmodellen: Ein cross-linguistischer Vergleich

自我监督的演讲模式中的言语压力:交叉语言学比较 2507.04738v1

Authors (3): Martijn Bentum, Louis ten Bosch, Tomas O. Lentz

In this paper we study word stress representations learned by self-supervised speech models (S3M), specifically the Wav2vec 2.0 model. We investigate the S3M representations of word stress for five different languages: Three languages with variable or lexical stress (Dutch, English and German) and two languages with fixed or demarcative stress (Hungarian and Polish). We train diagnostic stress classifiers on S3M embeddings and show that they can distinguish between stressed and unstressed syllables in read-aloud short sentences with high accuracy. We also tested language-specificity effects of S3M word stress. The results indicate that the word stress representations are language-specific, with a greater difference between the set of variable versus the set of fixed stressed languages.

nan

Article 407

Title@2025-07-07 (1): Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools

Title: Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools

Engere Kluft: Überwachtes Feintuning von Open Source LLMs als lebensfähige Alternative zu proprietären Modellen für pädagogische Werkzeuge

缩小差距:监督开放源码LLMs的微调,将其作为替代专有教学工具模型的可行替代物 2507.05305v1

Authors (5): Lorenzo Lee Solano, Charles Koutcheme, Juho Leinonen, Alexandra Vassar, Jake Renzella

Frontier Large language models (LLMs) like ChatGPT and Gemini can decipher cryptic compiler errors for novice programmers, but their computational scale, cost, and tendency to over-assist make them problematic for widespread pedagogical adoption. This work demonstrates that smaller, specialised language models, enhanced via Supervised Fine-Tuning (SFT), present a more viable alternative for educational tools. We utilise a new dataset of 40,000 C compiler error explanations, derived from real introductory programming (CS1/2) student-generated programming errors, which we used to fine-tune three open-source models: Qwen3-4B, Llama-3.1-8B, and Qwen3-32B. We performed a dual evaluation, combining expert human reviews with a large-scale automated analysis of 8,000 responses using a validated LLM-as-judge ensemble. Our results show that SFT significantly boosts the pedagogical quality of smaller models, achieving performance comparable to much larger models. We analyse the trade-offs between model size and quality, confirming that fine-tuning compact, efficient models on high-quality, domain-specific data is a potent strategy for creating specialised models to drive educational tools. We provide a replicable methodology to foster broader access to generative AI capabilities in educational contexts.

nan

Article 408

Title@2025-07-07 (1): “This Suits You the Best”: Query Focused Comparative Explainable Summarization

Title: “This Suits You the Best”: Query Focused Comparative Explainable Summarization

“Das passt Ihnen am besten”: Abfrage konzentriert vergleichende erklärende Zusammenfassung

“这适合你最好的”: 查询重点突出的可解释比较摘要 2507.04733v1

Authors (7): Arnav Attri, Anuj Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Nikesh Garera

Product recommendations inherently involve comparisons, yet traditional opinion summarization often fails to provide holistic comparative insights. We propose the novel task of generating Query-Focused Comparative Explainable Summaries (QF-CES) using Multi-Source Opinion Summarization (M-OS). To address the lack of query-focused recommendation datasets, we introduce MS-Q2P, comprising 7,500 queries mapped to 22,500 recommended products with metadata. We leverage Large Language Models (LLMs) to generate tabular comparative summaries with query-specific explanations. Our approach is personalized, privacy-preserving, recommendation engine-agnostic, and category-agnostic. M-OS as an intermediate step reduces inference latency approximately by 40% compared to the direct input approach (DIA), which processes raw data directly. We evaluate open-source and proprietary LLMs for generating and assessing QF-CES. Extensive evaluations using QF-CES-PROMPT across 5 dimensions (clarity, faithfulness, informativeness, format adherence, and query relevance) showed an average Spearman correlation of 0.74 with human judgments, indicating its potential for QF-CES evaluation.

nan

Article 409

Title@2025-07-07 (1): Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization

Title: Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization

Textentgiftung: Dateneffizienz, Semantische Erhaltung und Modellverallgemeinerung

文字解毒:数据效率、语义保护和模式通用 2507.01050v2

Authors (7): Jing Yu, Yibo Zhao, Jiapeng Zhu, Wenming Shao, Bo Pang, Zhao Zhang, Xiang Li

The widespread dissemination of toxic content on social media poses a serious threat to both online environments and public discourse, highlighting the urgent need for detoxification methods that effectively remove toxicity while preserving the original semantics. However, existing approaches often struggle to simultaneously achieve strong detoxification performance, semantic preservation, and robustness to out-of-distribution data. Moreover, they typically rely on costly, manually annotated parallel corpora while showing poor data efficiency. To address these challenges, we propose a two-stage training framework that jointly optimizes for data efficiency, semantic preservation, and model generalization. We first perform supervised fine-tuning on a small set of high-quality, filtered parallel data to establish a strong initialization. Then, we leverage unlabeled toxic inputs and a custom-designed reward model to train the LLM using Group Relative Policy Optimization. Experimental results demonstrate that our method effectively mitigates the trade-offs faced by previous work, achieving state-of-the-art performance with improved generalization and significantly reduced dependence on annotated data. Our code is available at: https://github.com/allacnobug/Detoxification-of-Text.

nan

Article 410

Title@2025-07-07 (1): LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework

Title: LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework

LOOM-Scope: ein umfassender und effizienter Long-cOntext Modellauswertungsrahmen

LOOM-范围:一个全面有效的LOng-Ontext示范评价框架 2507.04723v1

Authors (8): Zecheng Tang, Haitian Wang, Quantong Qiu, Baibei Ji, Ruoxi Sun, Keyan Zhou, Juntao Li, Min Zhang

Long-context processing has become a fundamental capability for large language models~(LLMs). To assess model’s long-context performance, numerous long-context evaluation benchmarks have been proposed. However, variations in evaluation settings across these benchmarks lead to inconsistent results, making it difficult to draw reliable comparisons. Besides, the high computational cost of long-context evaluation poses a significant barrier for the community to conduct comprehensive assessments of long-context models. In this paper, we propose LOOM-Scope, a comprehensive and efficient framework for long-context evaluation. LOOM-Scope standardizes evaluation settings across diverse benchmarks, supports deployment of efficient long-context inference acceleration methods, and introduces a holistic yet lightweight benchmark suite to evaluate models comprehensively. Homepage: https://loomscope.github.io

nan

Article 411

Title@2025-07-07 (1): MAIN: Mutual Alignment Is Necessary for instruction tuning

Title: MAIN: Mutual Alignment Is Necessary for instruction tuning

MAIN: Gegenseitige Ausrichtung ist notwendig für die Instruktions-Tuning

MAIN: 相互协调是教学调整的必要条件 2504.12913v2

Authors (10): Fanyi Yang, Jianfeng Liu, Xin Zhang, Haoyu Liu, Xixin Cao, Yuefeng Zhan, Hao Sun, Weiwei Deng, Feng Sun, Qi Zhang

Instruction tuning has empowered large language models (LLMs) to achieve remarkable performance, yet its success heavily depends on the availability of large-scale, high-quality instruction-response pairs. To meet this demand, various methods have been developed to synthesize data at scale. However, current methods for scaling up data generation often overlook a crucial aspect: the alignment between instructions and responses. We hypothesize that the quality of instruction-response pairs is determined not by the individual quality of each component, but by the degree of mutual alignment. To address this, we propose a Mutual Alignment Framework (MAIN) which enforces coherence between instructions and responses through mutual constraints. We demonstrate that MAIN generalizes well across model architectures and sizes, achieving state-of-the-art performance on LLaMA, Mistral, and Qwen models across diverse benchmarks. This work underscores the critical role of instruction-response alignment in enabling generalizable and high-quality instruction tuning for LLMs.

nan

Article 412

Title@2025-07-07 (1): Why We Feel What We Feel: Joint Detection of Emotions and Their Opinion Triggers in E-commerce

Title: Why We Feel What We Feel: Joint Detection of Emotions and Their Opinion Triggers in E-commerce

Warum wir fühlen, was wir fühlen: Gemeinsame Erkennung von Emotionen und ihre Meinung Auslöser im E-Commerce

我们为什么感到我们的感受:在电子商务中共同发现情感及其见解触发者 2507.04708v1

Authors (7): Arnav Attri, Anuj Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Nikesh Garera

Customer reviews on e-commerce platforms capture critical affective signals that drive purchasing decisions. However, no existing research has explored the joint task of emotion detection and explanatory span identification in e-commerce reviews - a crucial gap in understanding what triggers customer emotional responses. To bridge this gap, we propose a novel joint task unifying Emotion detection and Opinion Trigger extraction (EOT), which explicitly models the relationship between causal text spans (opinion triggers) and affective dimensions (emotion categories) grounded in Plutchik’s theory of 8 primary emotions. In the absence of labeled data, we introduce EOT-X, a human-annotated collection of 2,400 reviews with fine-grained emotions and opinion triggers. We evaluate 23 Large Language Models (LLMs) and present EOT-DETECT, a structured prompting framework with systematic reasoning and self-reflection. Our framework surpasses zero-shot and chain-of-thought techniques, across e-commerce domains.

nan

Article 413

Title@2025-07-07 (1): XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL

Title: XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL

XiYan-SQL: Ein neuartiges Multi-Generator-Framework für Text-zu-SQL

XiYan-SQL: 文本到SQL的新颖多驱动器框架 2507.04701v1

Authors (11): Yifu Liu, Yin Zhu, Yingqi Gao, Zhiling Luo, Xiaoxia Li, Xiaorong Shi, Yuntao Hong, Jinyang Gao, Yu Li, Bolin Ding, Jingren Zhou

To leverage the advantages of LLM in addressing challenges in the Text-to-SQL task, we present XiYan-SQL, an innovative framework effectively generating and utilizing multiple SQL candidates. It consists of three components: 1) a Schema Filter module filtering and obtaining multiple relevant schemas; 2) a multi-generator ensemble approach generating multiple highquality and diverse SQL queries; 3) a selection model with a candidate reorganization strategy implemented to obtain the optimal SQL query. Specifically, for the multi-generator ensemble, we employ a multi-task fine-tuning strategy to enhance the capabilities of SQL generation models for the intrinsic alignment between SQL and text, and construct multiple generation models with distinct generation styles by fine-tuning across different SQL formats. The experimental results and comprehensive analysis demonstrate the effectiveness and robustness of our framework. Overall, XiYan-SQL achieves a new SOTA performance of 75.63% on the notable BIRD benchmark, surpassing all previous methods. It also attains SOTA performance on the Spider test set with an accuracy of 89.65%.

nan

Article 414

Title@2025-07-07 (1): Markovian Transformers for Informative Language Modeling

Title: Markovian Transformers for Informative Language Modeling

Markovianische Transformer für die informative Sprachmodellierung

用于信息语言建模的Markovian变形器 2404.18988v6

Authors (4): Scott Viteri, Max Lamparth, Peter Chatain, Clark Barrett

Chain-of-Thought (CoT) reasoning often fails to faithfully reflect a language model’s underlying decision process. We address this by making CoT text causally essential in a “Markovian” language model, factoring next-token prediction through an intermediate CoT and training it to predict future tokens independently of the original prompt. We formalize this via an “informativeness” objective that quantifies how much a trained CoT improves next-token predictions over a baseline. Using policy gradient, we show that Llama 3.1 8B achieves a 33.2% absolute accuracy improvement on GSM8K. Perturbation tests confirm stronger reliance on the CoT, while cross-model transfers indicate these reasoning traces generalize across interpreters. Our approach enhances both accuracy and interpretability, potentially extending CoT reasoning to arbitrarily long contexts and diverse tasks.

nan

Article 415

Title@2025-07-07 (1): Pensieve Grader: An AI-Powered, Ready-to-Use Platform for Effortless Handwritten STEM Grading

Title: Pensieve Grader: An AI-Powered, Ready-to-Use Platform for Effortless Handwritten STEM Grading

Pensieve Grader: Eine KI-Powered, Ready-to-Use Plattform für mühelose handschriftliche STEM-Grading

Pensieve grafer: 一个AI授权的无力手写STEM分级的现用平台 2507.01431v2

Authors (4): Yoonseok Yang, Minjune Kim, Marlon Rondinelli, Keren Shao

Grading handwritten, open-ended responses remains a major bottleneck in large university STEM courses. We introduce Pensieve (https://www.pensieve.co), an AI-assisted grading platform that leverages large language models (LLMs) to transcribe and evaluate student work, providing instructors with rubric-aligned scores, transcriptions, and confidence ratings. Unlike prior tools that focus narrowly on specific tasks like transcription or rubric generation, Pensieve supports the entire grading pipeline-from scanned student submissions to final feedback-within a human-in-the-loop interface. Pensieve has been deployed in real-world courses at over 20 institutions and has graded more than 300,000 student responses. We present system details and empirical results across four core STEM disciplines: Computer Science, Mathematics, Physics, and Chemistry. Our findings show that Pensieve reduces grading time by an average of 65%, while maintaining a 95.4% agreement rate with instructor-assigned grades for high-confidence predictions.

nan

Article 416

Title@2025-07-07 (1): Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages

Title: Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages

Eka-Eval : Ein umfassender Evaluierungsrahmen für große Sprachmodelle in indischen Sprachen

Eka-Eval:印度语大语言模式综合评价框架 2507.01853v2

Authors (4): Samridhi Raj Sinha, Rajvee Sheth, Abhishek Upperwal, Mayank Singh

The rapid advancement of Large Language Models (LLMs) has intensified the need for evaluation frameworks that address the requirements of linguistically diverse regions, such as India, and go beyond English-centric benchmarks. We introduce EKA-EVAL, a unified evaluation framework that integrates over 35+ benchmarks (including 10 Indic benchmarks) across nine major evaluation categories. The framework provides broader coverage than existing Indian language evaluation tools, offering 11 core capabilities through a modular architecture, seamless integration with Hugging Face and proprietary models, and plug-and-play usability. As the first end-to-end suite for scalable, multilingual LLM benchmarking, the framework combines extensive benchmarks, modular workflows, and dedicated support for low-resource Indian languages to enable inclusive assessment of LLM capabilities across diverse domains. We conducted extensive comparisons against five existing baselines, demonstrating that EKA-EVAL achieves the highest participant ratings in four out of five categories. The framework is open-source and publicly available at: https://github.com/lingo-iitgn/eka-eval.

nan

Article 417

Title@2025-07-07 (1): Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models

Title: Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models

Breach in the Shield: Enthüllen der Schwachstellen großer Sprachmodelle

打破盾牌:消除大语言模式的脆弱性 2504.03714v2

Authors (4): Runpeng Dai, Run Yang, Fan Zhou, Hongtu Zhu

Large Language Models (LLMs) and Vision-Language Models (VLMs) have achieved impressive performance across a wide range of tasks, yet they remain vulnerable to carefully crafted perturbations. In this study, we seek to pinpoint the sources of this fragility by identifying parameters and input dimensions (pixels or token embeddings) that are susceptible to such perturbations. To this end, we propose a stability measure called \textbf{FI}, \textbf{F}irst order local \textbf{I}nfluence, which is rooted in information geometry and quantifies the sensitivity of individual parameter and input dimensions. Our extensive analysis across LLMs and VLMs (from 1.5B to 13B parameters) reveals that: (I) A small subset of parameters or input dimensions with high FI values disproportionately contribute to model brittleness. (II) Mitigating the influence of these vulnerable parameters during model merging leads to improved performance.

nan

Article 418

Title@2025-07-07 (1): R1-RE: Cross-Domain Relationship Extraction with RLVR

Title: R1-RE: Cross-Domain Relationship Extraction with RLVR

R1-RE: Cross-Domain-Beziehungsextraktion mit RLVR

R1-RE: 与RLVR的跨部关系提取 2507.04642v1

Authors (4): Runpeng Dai, Tong Zheng, Run Yang, Hongtu Zhu

Relationship extraction (RE) is a core task in natural language processing. Traditional approaches typically frame RE as a supervised learning problem, directly mapping context to labels-an approach that often suffers from poor out-of-domain (OOD) generalization. Inspired by the workflow of human annotators, we reframe RE as a reasoning task guided by annotation guidelines and introduce R1-RE, the first reinforcement learning with verifiable reward (RLVR) framework for RE tasks. Our method elicits the reasoning abilities of small language models for annotation tasks, resulting in significantly improved OOD robustness. We evaluate our approach on the public Sem-2010 dataset and a private MDKG dataset. The R1-RE-7B model attains an average OOD accuracy of approximately 70%, on par with leading proprietary models such as GPT-4o. Additionally, our comprehensive analysis provides novel insights into the training dynamics and emergent reasoning behaviors of the RLVR paradigm for RE.

nan

Article 419

Title@2025-07-07 (1): Put Teacher in Student’s Shoes: Cross-Distillation for Ultra-compact Model Compression Framework

Title: Put Teacher in Student’s Shoes: Cross-Distillation for Ultra-compact Model Compression Framework

Put Teacher in Studentenschuhen: Cross-Destillation für ultra-kompakte Modellkompressionsrahmen

将教师纳入学生鞋:超常规示范压缩框架的交叉蒸馏 2507.04636v1

Authors (7): Maolin Wang, Jun Chu, Sicong Xie, Xiaoling Zang, Yao Zhao, Wenliang Zhong, Xiangyu Zhao

In the era of mobile computing, deploying efficient Natural Language Processing (NLP) models in resource-restricted edge settings presents significant challenges, particularly in environments requiring strict privacy compliance, real-time responsiveness, and diverse multi-tasking capabilities. These challenges create a fundamental need for ultra-compact models that maintain strong performance across various NLP tasks while adhering to stringent memory constraints. To this end, we introduce Edge ultra-lIte BERT framework (EI-BERT) with a novel cross-distillation method. EI-BERT efficiently compresses models through a comprehensive pipeline including hard token pruning, cross-distillation and parameter quantization. Specifically, the cross-distillation method uniquely positions the teacher model to understand the student model’s perspective, ensuring efficient knowledge transfer through parameter integration and the mutual interplay between models. Through extensive experiments, we achieve a remarkably compact BERT-based model of only 1.91 MB - the smallest to date for Natural Language Understanding (NLU) tasks. This ultra-compact model has been successfully deployed across multiple scenarios within the Alipay ecosystem, demonstrating significant improvements in real-world applications. For example, it has been integrated into Alipay’s live Edge Recommendation system since January 2024, currently serving the app’s recommendation traffic across \textbf{8.4 million daily active devices}.

nan

Article 420

Title@2025-07-07 (1): A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens

Title: A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens

Ein Text ist mehrere Token wert: Text-Einbetten aus LLMs richtet sich geheim gut mit den Schlüssel-Token

文本“ 值得几个调子” : 文本由LLMs 秘密对齐与密钥连接良好。 2406.17378v4

Authors (3): Zhijie Nie, Richong Zhang, Zhanyu Wu

Text embeddings from large language models (LLMs) have achieved excellent results in tasks such as information retrieval, semantic textual similarity, etc. In this work, we show an interesting finding: when feeding a text into the LLM-based embedder, the obtained text embedding will be able to be aligned with the key tokens in the input text. We first fully analyze this phenomenon on eight LLM-based embedders and show that this phenomenon is universal and is not affected by model architecture, training strategy, and embedding method. With a deeper analysis, we find that the main change in embedding space between these embedders and their LLM backbones is in the first principal component. By adjusting the first principal component, we can align text embedding with the key tokens. Finally, we give several examples to demonstrate the vast application potential of this finding: (1) we propose a simple and practical sparse retrieval method based on the aligned tokens, which can achieve 80% of the dense retrieval effect of the same model while reducing the computation significantly; (2) we show that our findings provide a novel perspective to help understand novel technologies (e.g., instruction-following embedding) and fuzzy concepts (e.g., semantic relatedness vs. similarity) in this field.

nan

Article 421

Title@2025-07-07 (1): Knowledge-Aware Self-Correction in Language Models via Structured Memory Graphs

Title: Knowledge-Aware Self-Correction in Language Models via Structured Memory Graphs

Wissen-Bewusst Selbst-Korrektion in Sprachmodellen über strukturierte Speichergraphen

通过结构内存图对语言模型进行知识-软件自我校正 2507.04625v1

Authors (1): Swayamjit Saha

Large Language Models (LLMs) are powerful yet prone to generating factual errors, commonly referred to as hallucinations. We present a lightweight, interpretable framework for knowledge-aware self-correction of LLM outputs using structured memory graphs based on RDF triples. Without retraining or fine-tuning, our method post-processes model outputs and corrects factual inconsistencies via external semantic memory. We demonstrate the approach using DistilGPT-2 and show promising results on simple factual prompts.

nan

Article 422

Title@2025-07-07 (1): Towards Cost-Effective Reward Guided Text Generation

Title: Towards Cost-Effective Reward Guided Text Generation

Auf dem Weg zu einer kosteneffizienten, von der Belohnung geleiteten Textgenerierung

实现成本-效益高的回报制制制制制制 2502.04517v2

Authors (6): Ahmad Rashid, Ruotian Wu, Rongqi Fan, Hongliang Li, Agustinus Kristiadi, Pascal Poupart

Reward-guided text generation (RGTG) has emerged as a viable alternative to offline reinforcement learning from human feedback (RLHF). RGTG methods can align baseline language models to human preferences without further training like in standard RLHF methods. However, they rely on a reward model to score each candidate token generated by the language model at inference, incurring significant test-time overhead. Additionally, the reward model is usually only trained to score full sequences, which can lead to sub-optimal choices for partial sequences. In this work, we present a novel reward model architecture that is trained, using a Bradley-Terry loss, to prefer the optimal expansion of a sequence with just a \emph{single call} to the reward model at each step of the generation process. That is, a score for all possible candidate tokens is generated simultaneously, leading to efficient inference. We theoretically analyze various RGTG reward models and demonstrate that prior techniques prefer sub-optimal sequences compared to our method during inference. Empirically, our reward model leads to significantly faster inference than other RGTG methods. It requires fewer calls to the reward model and performs competitively compared to previous RGTG and offline RLHF methods.

nan

Article 423

Title@2025-07-07 (1): News Source Citing Patterns in AI Search Systems

Title: News Source Citing Patterns in AI Search Systems

Nachrichten Quelle Zitieren von Mustern in KI-Suchsystemen

AI 搜索系统中引用新闻来源模式 2507.05301v1

Authors (1): Kai-Cheng Yang

AI-powered search systems are emerging as new information gatekeepers, fundamentally transforming how users access news and information. Despite their growing influence, the citation patterns of these systems remain poorly understood. We address this gap by analyzing data from the AI Search Arena, a head-to-head evaluation platform for AI search systems. The dataset comprises over 24,000 conversations and 65,000 responses from models across three major providers: OpenAI, Perplexity, and Google. Among the over 366,000 citations embedded in these responses, 9% reference news sources. We find that while models from different providers cite distinct news sources, they exhibit shared patterns in citation behavior. News citations concentrate heavily among a small number of outlets and display a pronounced liberal bias, though low-credibility sources are rarely cited. User preference analysis reveals that neither the political leaning nor the quality of cited news sources significantly influences user satisfaction. These findings reveal significant challenges in current AI search systems and have important implications for their design and governance.

nan

Article 424

Title@2025-07-07 (1): Retain or Reframe? A Computational Framework for the Analysis of Framing in News Articles and Reader Comments

Title: Retain or Reframe? A Computational Framework for the Analysis of Framing in News Articles and Reader Comments

A putational Framework for the Analysis of Framing in News Artikel und Leserkommentare

保留还是重新框架? 分析新闻文章和读者评论中形式界定的计算框架 2507.04612v1

Authors (4): Matteo Guida, Yulia Otmakhova, Eduard Hovy, Lea Frermann

When a news article describes immigration as an “economic burden” or a “humanitarian crisis,” it selectively emphasizes certain aspects of the issue. Although \textit{framing} shapes how the public interprets such issues, audiences do not absorb frames passively but actively reorganize the presented information. While this relationship between source content and audience response is well-documented in the social sciences, NLP approaches often ignore it, detecting frames in articles and responses in isolation. We present the first computational framework for large-scale analysis of framing across source content (news articles) and audience responses (reader comments). Methodologically, we refine frame labels and develop a framework that reconstructs dominant frames in articles and comments from sentence-level predictions, and aligns articles with topically relevant comments. Applying our framework across eleven topics and two news outlets, we find that frame reuse in comments correlates highly across outlets, while topic-specific patterns vary. We release a frame classifier that performs well on both articles and comments, a dataset of article and comment sentences manually labeled for frames, and a large-scale dataset of articles and comments with predicted frame labels.

nan

Article 425

Title@2025-07-07 (1): Position: Machine Learning Conferences Should Establish a “Refutations and Critiques” Track

Title: Position: Machine Learning Conferences Should Establish a “Refutations and Critiques” Track

Position: Machine Learning Konferenzen sollten einen “Refutations and Critiques” Track erstellen

职位:机器学习会议应建立“反驳和批评”轨道 2506.19882v3

Authors (14): Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch, Brando Miranda, Matthias Gerstgrasser, Susan Zhang, Andreas Haupt, Isha Gupta, Elyas Obbad, Jesse Dodge, Jessica Zosa Forde, Francesco Orabona, Sanmi Koyejo, David Donoho

Science progresses by iteratively advancing and correcting humanity’s understanding of the world. In machine learning (ML) research, rapid advancements have led to an explosion of publications, but have also led to misleading, incorrect, flawed or perhaps even fraudulent studies being accepted and sometimes highlighted at ML conferences due to the fallibility of peer review. While such mistakes are understandable, ML conferences do not offer robust processes to help the field systematically correct when such errors are made. This position paper argues that ML conferences should establish a dedicated “Refutations and Critiques” (R&C) Track. This R&C Track would provide a high-profile, reputable platform to support vital research that critically challenges prior research, thereby fostering a dynamic self-correcting research ecosystem. We discuss key considerations including track design, review principles, potential pitfalls, and provide an illustrative example submission concerning a recent ICLR 2025 Oral. We conclude that ML conferences should create official, reputable mechanisms to help ML research self-correct.

nan

Article 426

Title@2025-07-07 (1): PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes

Title: PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes

PRIME: Large Language Model Personalisierung mit kognitiven Gedächtnis- und Gedankenprozessen

PRIME:具有认知记忆和思维过程的大语言模式个性模型 2507.04607v1

Authors (3): Xinliang Frederick Zhang, Nick Beauchamp, Lu Wang

Large language model (LLM) personalization aims to align model outputs with individuals’ unique preferences and opinions. While recent efforts have implemented various personalization methods, a unified theoretical framework that can systematically understand the drivers of effective personalization is still lacking. In this work, we integrate the well-established cognitive dual-memory model into LLM personalization, by mirroring episodic memory to historical user engagements and semantic memory to long-term, evolving user beliefs. Specifically, we systematically investigate memory instantiations and introduce a unified framework, PRIME, using episodic and semantic memory mechanisms. We further augment PRIME with a novel personalized thinking capability inspired by the slow thinking strategy. Moreover, recognizing the absence of suitable benchmarks, we introduce a dataset using Change My View (CMV) from Reddit, specifically designed to evaluate long-context personalization. Extensive experiments validate PRIME’s effectiveness across both long- and short-context scenarios. Further analysis confirms that PRIME effectively captures dynamic personalization beyond mere popularity biases.

nan

Article 427

Title@2025-07-07 (1): Structured Captions Improve Prompt Adherence in Text-to-Image Models (Re-LAION-Caption 19M)

Title: Structured Captions Improve Prompt Adherence in Text-to-Image Models (Re-LAION-Caption 19M)

Structured Captions verbessern die sofortige Einhaltung von Text-zu-Bild-Modellen (Re-LAION-Caption 19M)

结构说明改进对文本到图像模型的迅速遵守(Re-LAION-Caption 19M) 2507.05300v1

Authors (4): Nicholas Merchant, Haitz Sáez de Ocáriz Borde, Andrei Cristian Popescu, Carlos Garcia Jurado Suarez

We argue that generative text-to-image models often struggle with prompt adherence due to the noisy and unstructured nature of large-scale datasets like LAION-5B. This forces users to rely heavily on prompt engineering to elicit desirable outputs. In this work, we propose that enforcing a consistent caption structure during training can significantly improve model controllability and alignment. We introduce Re-LAION-Caption 19M, a high-quality subset of Re-LAION-5B, comprising 19 million 1024x1024 images with captions generated by a Mistral 7B Instruct-based LLaVA-Next model. Each caption follows a four-part template: subject, setting, aesthetics, and camera details. We fine-tune PixArt-$\Sigma$ and Stable Diffusion 2 using both structured and randomly shuffled captions, and show that structured versions consistently yield higher text-image alignment scores using visual question answering (VQA) models. The dataset is publicly available at https://huggingface.co/datasets/supermodelresearch/Re-LAION-Caption19M.

nan

Article 428

Title@2025-07-07 (1): Gradient-guided Attention Map Editing: Towards Efficient Contextual Hallucination Mitigation

Title: Gradient-guided Attention Map Editing: Towards Efficient Contextual Hallucination Mitigation

Gradient-geführte Aufmerksamkeitskarte Bearbeitung: Auf dem Weg zu einer effizienten kontextuellen Halluzination

逐步制导注意地图编辑:努力有效减少环境幻觉 2503.08963v2

Authors (6): Yu Wang, Kamalika Das, Xiang Gao, Wendi Cui, Peng Li, Jiaxin Zhang

In tasks like summarization and open-book question answering (QA), Large Language Models (LLMs) often encounter “contextual hallucination”, where they produce irrelevant or incorrect responses despite having access to accurate source information. This typically occurs because these models tend to prioritize self-generated content over the input context, causing them to disregard pertinent details. To address this challenge, we introduce a novel method called “Guided Attention Map Editing” (GAME), which dynamically adjusts attention maps to improve contextual relevance. During inference, GAME employs a trained classifier to identify attention maps prone to inducing hallucinations and executes targeted interventions. These interventions, guided by gradient-informed “edit directions’’, strategically redistribute attention weights across various heads to effectively reduce hallucination. Comprehensive evaluations on challenging summarization and open-book QA tasks show that GAME consistently reduces hallucinations across a variety of open-source models. Specifically, GAME reduces hallucinations by 10% in the XSum summarization task while achieving a 7X speed-up in computational efficiency compared to the state-of-the-art baselines.

nan

Article 429

Title@2025-07-07 (1): VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Title: VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

VLM2Vec-V2: Advancing Multimodal Embedding für Videos, Bilder und visuelle Dokumente

VLM2Vec-V2:推进视频、图像和视觉文件的多式联运嵌入 2507.04590v1

Authors (13): Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, Semih Yavuz

Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, multi-modal search and recommendation, and retrieval-augmented generation (RAG). To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering - spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.

nan

Article 430

Title@2025-07-07 (1): Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs

Title: Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs

Im Inneren sind Sie viele Wölfe: Mit kognitiven Modellen, um Wert-Abwägungen in LLMs zu interpretieren

使用认知模型来解释LLMM中的价值权衡 2506.20666v2

Authors (7): Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman

Navigating everyday social situations often requires juggling conflicting goals, such as conveying a harsh truth, maintaining trust, all while still being mindful of another person’s feelings. These value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in LLMs are limited. In cognitive science, so-called “cognitive models” provide formal accounts of these trade-offs in humans, by modeling the weighting of a speaker’s competing utility functions in choosing an action or utterance. In this work, we use a leading cognitive model of polite speech to interpret the extent to which LLMs represent human-like trade-offs. We apply this lens to systematically evaluate value trade-offs in two encompassing model settings: degrees of reasoning “effort” in frontier black-box models, and RL post-training dynamics of open-source models. Our results highlight patterns of higher informational utility than social utility in reasoning models, and in open-source models shown to be stronger in mathematical reasoning. Our findings from LLMs’ training dynamics suggest large shifts in utility values early on in training with persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. We show that our method is responsive to diverse aspects of the rapidly evolving LLM landscape, with insights for forming hypotheses about other high-level behaviors, shaping training regimes for reasoning models, and better controlling trade-offs between values during model training.

nan

Article 431

Title@2025-07-06 (7): Construction Identification and Disambiguation Using BERT: A Case Study of NPN

Title: Construction Identification and Disambiguation Using BERT: A Case Study of NPN

Bauidentifizierung und Abgrenzung mit BERT: Eine Fallstudie der NPN

使用BERT进行建筑识别和拆分:NPN的案例研究 2503.18751v2

Authors (2): Wesley Scivetti, Nathan Schneider

Construction Grammar hypothesizes that knowledge of a language consists chiefly of knowledge of form-meaning pairs (‘‘constructions’’) that include vocabulary, general grammar rules, and even idiosyncratic patterns. Recent work has shown that transformer language models represent at least some constructional patterns, including ones where the construction is rare overall. In this work, we probe BERT’s representation of the form and meaning of a minor construction of English, the NPN (noun-preposition-noun) construction – exhibited in such expressions as face to face and day to day – which is known to be polysemous. We construct a benchmark dataset of semantically annotated corpus instances (including distractors that superficially resemble the construction). With this dataset, we train and evaluate probing classifiers. They achieve decent discrimination of the construction from distractors, as well as sense disambiguation among true instances of the construction, revealing that BERT embeddings carry indications of the construction’s semantics. Moreover, artificially permuting the word order of true construction instances causes them to be rejected, indicating sensitivity to matters of form. We conclude that BERT does latently encode at least some knowledge of the NPN construction going beyond a surface syntactic pattern and lexical cues.

nan

Article 432

Title@2025-07-06 (7): Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts

Title: Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts

Nile-Chat: Ägyptische Sprachmodelle für arabische und lateinische Scripte

Nile-Chat:阿拉伯文和拉丁文脚本埃及语言模型 2507.04569v1

Authors (10): Guokan Shang, Hadi Abdine, Ahmad Chamma, Amr Mohamed, Mohamed Anwar, Abdelaziz Bounhar, Omar El Herraoui, Preslav Nakov, Michalis Vazirgiannis, Eric Xing

We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for Egyptian dialect, uniquely designed to understand and generate texts written in both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we introduce a novel language adaptation approach by leveraging the Branch-Train-MiX strategy to merge script-specialized experts, into a single MoE model. Our Nile-Chat models significantly outperform leading multilingual and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced Egyptian evaluation benchmarks, which span both understanding and generative tasks. Notably, our 12B model yields a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly available. We believe this work presents a comprehensive methodology for adapting LLMs to dual-script languages, addressing an often overlooked aspect in modern LLM development.

nan

Article 433

Title@2025-07-06 (7): Evaluating LLMs on Real-World Forecasting Against Human Superforecasters

Title: Evaluating LLMs on Real-World Forecasting Against Human Superforecasters

Bewertung von LLMs bei der Real-World-Prognose gegen menschliche Superforecaster

评价针对人类超先天预测者进行真实世界预测的LLMs评价项目 2507.04562v1

Authors (1): Janna Lu

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their ability to forecast future events remains understudied. A year ago, large language models struggle to come close to the accuracy of a human crowd. I evaluate state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against human superforecasters. Frontier models achieve Brier scores that ostensibly surpass the human crowd but still significantly underperform a group of superforecasters.

nan

Article 434

Title@2025-07-06 (7): PENCIL: Long Thoughts with Short Memory

Title: PENCIL: Long Thoughts with Short Memory

PENCIL: Lange Gedanken mit kurzer Erinnerung

PENCIL:有短记忆的长思考 2503.14337v2

Authors (4): Chenxiao Yang, Nathan Srebro, David McAllester, Zhiyuan Li

While state-of-the-art LLMs have demonstrated great promise of using long Chains-of-Thought (CoT) to boost reasoning, scaling it up to more challenging problems at test-time is fundamentally limited by suboptimal memory usage – intermediate computations accumulate indefinitely in context even when no longer needed for future thoughts. We introduce PENCIL, which incorporates a novel reduction mechanism into the autoregressive generation process that recursively cleans up intermediate thoughts based on patterns learned from training. By iteratively generating and erasing thoughts, PENCIL can think deeper to solve harder problems using shorter context and less compute. Empirically, we observe PENCIL is significantly more effective and efficient than CoT. For example, we demonstrate PENCIL with a small 25M-parameter transformer and 2048 context length solves Einstein’s puzzle – a task that challenges much larger models like GPT-4. Theoretically, we prove PENCIL can perform universal efficient computation by simulating any Turing machines with optimal time and space complexity, and thus can solve arbitrary computable tasks that are otherwise intractable for vanilla CoT.

nan

Article 435

Title@2025-07-06 (7): DP-Fusion: Token-Level Differentially Private Inference for Large Language Models

Title: DP-Fusion: Token-Level Differentially Private Inference for Large Language Models

DP-Fusion: Token-Level Differentially Private Inferenz für große Sprachmodelle

DP-Fusion: 大语言模式的调控层面的私人不同推论 2507.04531v1

Authors (4): Rushil Thareja, Preslav Nakov, Praneeth Vepakomma, Nils Lukas

Large language models (LLMs) can leak sensitive information from their context through generated outputs, either accidentally or when prompted adversarially. Existing defenses that aim to preserve context privacy during inference either lack formal guarantees or suffer from a poor utility/privacy trade-off. We propose DP-Fusion, a token-level Differentially Private Inference (DPI) mechanism that provably bounds how much an LLM’s outputs reveal about sensitive tokens in its context. We demonstrate DPI through the task of document privatization, where the goal is to paraphrase documents so that sensitive content (e.g., Personally Identifiable Information, PII) cannot be reliably inferred, while still preserving the overall utility of the text. This is controlled by a parameter $\epsilon$: $\epsilon=0$ hides PII entirely, while higher values trade off privacy for improved paraphrase quality. DP-Fusion works as follows: (i) partition sensitive tokens into disjoint privacy groups, (ii) run the LLM once per group, and (iii) blend the output distributions so that the final output remains within a fixed statistical distance of the baseline distribution produced when no privacy group is revealed. This approach allows fine-grained control over the privacy/utility trade-off but requires multiple LLM forward passes.

nan

Article 436

Title@2025-07-06 (7): DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging

Title: DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging

DOTResize: Reduzierung der LLM-Breite durch diskrete Optimale transportbasierte Neuronenverschmelzung

DOTRIZ: 通过分立优化运输最佳方式的中子合并减少 LLM 宽度 2507.04517v1

Authors (3): Neha Verma, Kenton Murray, Kevin Duh

Model compression offers a promising path to reducing the cost and inaccessibility of large pre-trained models, without significantly compromising their impressive performance. Large Transformer models, including large language models (LLMs), often contain computational redundancy, which can serve as a target for new model compression methods. In this work, we specifically target neuron-level redundancies in model layers by combining groups of similar neurons into fewer neurons. We frame this width reduction as a Discrete Optimal Transport problem, and propose DOTResize, a novel Transformer compression method that uses optimal transport theory to transform and compress model weights. To ensure applicability within the Transformer architecture, we motivate and incorporate entropic regularization and matrix factorization into the transportation maps produced by our method. Unlike pruning-based approaches which discard neurons based on importance measures, DOTResize re-projects the entire neuron width, allowing the retention and redistribution of useful signal across the reduced layer. Empirical results show that compared to simple or state-of-the-art neuron width-pruning techniques, DOTResize can outperform these methods across multiple LLM families and sizes, while achieving measurable reductions in real-world computational cost.

nan

Article 437

Title@2025-07-06 (7): Rethinking Table Instruction Tuning

Title: Rethinking Table Instruction Tuning

Umdenken Tabelle Anleitung Tuning

重新思考表格指令图 2501.14693v3

Authors (2): Naihao Deng, Rada Mihalcea

Recent advances in table understanding have focused on instruction-tuning large language models (LLMs) for table-related tasks. However, existing research has overlooked the impact of hyperparameter choices, and also lacks a comprehensive evaluation of the out-of-domain table understanding ability and the general capabilities of these table LLMs. In this paper, we evaluate these abilities in existing table LLMs, and find significant declines in both out-of-domain table understanding and general capabilities as compared to their base models. Through systematic analysis, we show that hyperparameters, such as learning rate, can significantly influence both table-specific and general capabilities. Contrary to the previous table instruction-tuning work, we demonstrate that smaller learning rates and fewer training instances can enhance table understanding while preserving general capabilities. Based on our findings, we introduce TAMA, a TAble LLM instruction-tuned from LLaMA 3.1 8B Instruct, which achieves performance on par with, or surpassing GPT-3.5 and GPT-4 on table tasks, while maintaining strong out-of-domain generalization and general capabilities. Our findings highlight the potential for reduced data annotation costs and more efficient model development through careful hyperparameter selection. We open-source the project and our models.

nan

Article 438

Title@2025-07-06 (7): On the Utility of Domain-Adjacent Fine-Tuned Model Ensembles for Few-shot Problems

Title: On the Utility of Domain-Adjacent Fine-Tuned Model Ensembles for Few-shot Problems

Über die Nützlichkeit von Domain-Adjacent Fine-Tuned Model Ensembles für Probleme mit wenigen Fotos

关于Dome-邻接精度精美设计模型集的效用问题 2406.13720v2

Authors (5): Md Ibrahim Ibne Alam, Parikshit Ram, Soham Dan, Horst Samulowitz, Koushik Kar

Large Language Models (LLMs) have been observed to perform well on a wide range of downstream tasks when fine-tuned on domain-specific data. However, such data may not be readily available in many applications, motivating zero-shot or few-shot approaches using domain-adjacent models. While several fine-tuned models for various tasks are available, finding an appropriate domain-adjacent model for a given task is often not straight forward. In this paper, we study DAFT-E, a framework that utilizes an Ensemble of Domain-Adjacent Fine-Tuned Foundation Models for few-shot problems. We show that for zero-shot problems, this ensembling method provides an accuracy performance close to that of the single best model. With few-shot problems, this performance improves further, at which point DEFT-E can outperform any single domain-adjacent model while requiring much less data for domain-specific fine-tuning.

nan

Article 439

Title: AdS: Adapter-state Sharing Framework for Multimodal Sarcasm Detection

AdS: Adapter-State-Sharing-Framework für multimodale Sarkasmus-Erkennung

AdS:多式碎片探测适应国家共享框架 2507.04508v1

Authors (3): Soumyadeep Jana, Sahil Danayak, Sanasam Ranbir Singh

The growing prevalence of multimodal image-text sarcasm on social media poses challenges for opinion mining, especially under resource constraints. Existing approaches rely on full fine-tuning of large pre-trained models, making them unsuitable for low-resource settings. While recent parameter-efficient fine-tuning (PEFT) methods offer promise, their off-the-shelf use underperforms on complex tasks like sarcasm detection. We propose AdS (Adapter-State Sharing), a lightweight framework built on CLIP that inserts adapters only in the upper layers and introduces a novel adapter-state sharing mechanism, where textual adapters guide visual ones. This design promotes efficient cross-modal learning while preserving low-level unimodal representations. Experiments on two public benchmarks demonstrate that AdS achieves state-of-the-art results using significantly fewer trainable parameters than existing PEFT and full fine-tuning approaches.

nan

Article 440

Title@2025-07-06 (7): Unveiling the Potential of Diffusion Large Language Model in Controllable Generation

Title: Unveiling the Potential of Diffusion Large Language Model in Controllable Generation

Enthüllen des Potenzials von Diffusions-Großsprachenmodellen in kontrollierbarer Generation

发挥在可控生产中传播大语言模式的潜力 2507.04504v1

Authors (4): Zhen Xiong, Yujun Cai, Zhecheng Li, Yiwei Wang

Diffusion models, originally developed for image generation, have emerged as a promising alternative to autoregressive large language models (LLMs). We present a theoretical analysis comparing autoregressive and masked diffusion LLMs, revealing that the intrinsic bidirectional attention mechanism of diffusion LLMs (dLLMs) enables superior context modeling and generation controllability. However, existing dLLM applications face significant challenges in controllable generation: the native multi-step denoising process exhibits high sensitivity to sequence length, elevated hallucination rates, and prohibitive inference costs without specialized optimizations. To address these limitations, we propose \textbf{S}elf-adaptive \textbf{S}chema \textbf{S}caffolding ($S^3$), a novel framework that enables dLLMs to generate structured outputs (e.g., JSON) while maintaining semantic fidelity and accelerating inference. Our approach injects the target schema structure into the output context, reducing unnecessary computation while improving controllability. Extensive experiments demonstrate that $S^3$ achieves substantial improvements: 65\% increase in structural adherence, 48\% enhancement in content fidelity, and 17\% reduction in hallucination rates compared to baseline. These results establish both theoretical foundations and practical pathways for deploying diffusion models in controllable text generation tasks. Code and data will be publicly released.

nan

Article 441

Title@2025-07-06 (7): Self-Consistency Preference Optimization

Title: Self-Consistency Preference Optimization

Selbst-Konsistenz-Präferenz-Optimierung

优化优惠 2411.04109v3

Authors (9): Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason Weston, Jane Yu

Self-alignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the difficulty of assigning correct rewards. An orthogonal approach that is known to improve correctness is self-consistency, a method applied at inference time based on multiple sampling in order to find the most consistent answer. In this work, we extend the self-consistency concept to help train models. We thus introduce self-consistency preference optimization (ScPO), which iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. We show ScPO leads to large improvements over conventional reward model training on reasoning tasks such as GSM8K and MATH, closing the gap with supervised training with gold answers or preferences, and that combining ScPO with standard supervised learning improves results even further. On ZebraLogic, ScPO finetunes Llama-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.

nan

Article 442

Title@2025-07-06 (7): A validity-guided workflow for robust large language model research in psychology

Title: A validity-guided workflow for robust large language model research in psychology

Ein validitätsgeführter Workflow für robuste große Sprachmodellforschung in der Psychologie

健全的大型语言模式心理学研究有效指导工作流程 2507.04491v1

Authors (1): Zhicheng Lin

Large language models (LLMs) are rapidly being integrated into psychological research as research tools, evaluation targets, human simulators, and cognitive models. However, recent evidence reveals severe measurement unreliability: Personality assessments collapse under factor analysis, moral preferences reverse with punctuation changes, and theory-of-mind accuracy varies widely with trivial rephrasing. These “measurement phantoms”–statistical artifacts masquerading as psychological phenomena–threaten the validity of a growing body of research. Guided by the dual-validity framework that integrates psychometrics with causal inference, we present a six-stage workflow that scales validity requirements to research ambition–using LLMs to code text requires basic reliability and accuracy, while claims about psychological properties demand comprehensive construct validation. Researchers must (1) explicitly define their research goal and corresponding validity requirements, (2) develop and validate computational instruments through psychometric testing, (3) design experiments that control for computational confounds, (4) execute protocols with transparency, (5) analyze data using methods appropriate for non-independent observations, and (6) report findings within demonstrated boundaries and use results to refine theory. We illustrate the workflow through an example of model evaluation–“LLM selfhood”–showing how systematic validation can distinguish genuine computational phenomena from measurement artifacts. By establishing validated computational instruments and transparent practices, this workflow provides a path toward building a robust empirical foundation for AI psychology research.

nan

Article 443

Title@2025-07-06 (7): Does Overnight News Explain Overnight Returns?

Title: Does Overnight News Explain Overnight Returns?

Erklären die Nachrichten über Nacht die Rückkehr?

夜间新闻解释夜间回归吗? 2507.04481v1

Authors (4): Paul Glasserman, Kriste Krstovski, Paul Laliberte, Harry Mamaysky

Over the past 30 years, nearly all the gains in the U.S. stock market have been earned overnight, while average intraday returns have been negative or flat. We find that a large part of this effect can be explained through features of intraday and overnight news. Our analysis uses a collection of 2.4 million news articles. We apply a novel technique for supervised topic analysis that selects news topics based on their ability to explain contemporaneous market returns. We find that time variation in the prevalence of news topics and differences in the responses to news topics both contribute to the difference in intraday and overnight returns. In out-of-sample tests, our approach forecasts which stocks will do particularly well overnight and particularly poorly intraday. Our approach also helps explain patterns of continuation and reversal in intraday and overnight returns. We contrast the effect of news with other mechanisms proposed in the literature to explain overnight returns.

nan

Article 444

Title@2025-07-06 (7): The role of large language models in UI/UX design: A systematic literature review

Title: The role of large language models in UI/UX design: A systematic literature review

Die Rolle großer Sprachmodelle im UI/UX-Design: Ein systematischer Literaturbericht

大语言模型在UI/UX设计中的作用:系统文献审查 2507.04469v1

Authors (2): Ammar Ahmed, Ali Shariq Imran

This systematic literature review examines the role of large language models (LLMs) in UI/UX design, synthesizing findings from 38 peer-reviewed studies published between 2022 and 2025. We identify key LLMs in use, including GPT-4, Gemini, and PaLM, and map their integration across the design lifecycle, from ideation to evaluation. Common practices include prompt engineering, human-in-the-loop workflows, and multimodal input. While LLMs are reshaping design processes, challenges such as hallucination, prompt instability, and limited explainability persist. Our findings highlight LLMs as emerging collaborators in design, and we propose directions for the ethical, inclusive, and effective integration of these technologies.

nan

Article 445

Title@2025-07-06 (7): Dual Modality-Aware Gated Prompt Tuning for Few-Shot Multimodal Sarcasm Detection

Title: Dual Modality-Aware Gated Prompt Tuning for Few-Shot Multimodal Sarcasm Detection

Dual Modality-Aware Gated Prompt Tuning für wenige-shot multimodale Sarkasmus-Erkennung

用于小热多式乱石探测的两极式器件器械 Gedt 快速计时 2507.04468v1

Authors (3): Soumyadeep Jana, Abhrajyoti Kundu, Sanasam Ranbir Singh

The widespread use of multimodal content on social media has heightened the need for effective sarcasm detection to improve opinion mining. However, existing models rely heavily on large annotated datasets, making them less suitable for real-world scenarios where labeled data is scarce. This motivates the need to explore the problem in a few-shot setting. To this end, we introduce DMDP (Deep Modality-Disentangled Prompt Tuning), a novel framework for few-shot multimodal sarcasm detection. Unlike prior methods that use shallow, unified prompts across modalities, DMDP employs gated, modality-specific deep prompts for text and visual encoders. These prompts are injected across multiple layers to enable hierarchical feature learning and better capture diverse sarcasm types. To enhance intra-modal learning, we incorporate a prompt-sharing mechanism across layers, allowing the model to aggregate both low-level and high-level semantic cues. Additionally, a cross-modal prompt alignment module enables nuanced interactions between image and text representations, improving the model’s ability to detect subtle sarcastic intent. Experiments on two public datasets demonstrate DMDP’s superior performance in both few-shot and extremely low-resource settings. Further cross-dataset evaluations show that DMDP generalizes well across domains, consistently outperforming baseline methods.

nan

Article 446

Title@2025-07-06 (7): Think Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection

Title: Think Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection

Denken Sie zweimal vor Ihnen Richter: Mischung von Dual-Reasoning-Experten für multimodale Sarkasmus-Erkennung

在你面前三思而后三思,法官:多式讽刺探测双重理由专家的混合 2507.04458v1

Authors (3): Soumyadeep Jana, Abhrajyoti Kundu, Sanasam Ranbir Singh

Multimodal sarcasm detection has attracted growing interest due to the rise of multimedia posts on social media. Understanding sarcastic image-text posts often requires external contextual knowledge, such as cultural references or commonsense reasoning. However, existing models struggle to capture the deeper rationale behind sarcasm, relying mainly on shallow cues like image captions or object-attribute pairs from images. To address this, we propose \textbf{MiDRE} (\textbf{Mi}xture of \textbf{D}ual \textbf{R}easoning \textbf{E}xperts), which integrates an internal reasoning expert for detecting incongruities within the image-text pair and an external reasoning expert that utilizes structured rationales generated via Chain-of-Thought prompting to a Large Vision-Language Model. An adaptive gating mechanism dynamically weighs the two experts, selecting the most relevant reasoning path. Experiments on two benchmark datasets show that MiDRE achieves superior performance over baselines. Various qualitative analyses highlight the crucial role of external rationales, revealing that even when they are occasionally noisy, they provide valuable cues that guide the model toward a better understanding of sarcasm.

nan

Article 447

Title@2025-07-06 (7): GradOT: Training-free Gradient-preserving Offsite-tuning for Large Language Models

Title: GradOT: Training-free Gradient-preserving Offsite-tuning for Large Language Models

GradOT: Training-frei Gradient-erhaltendes Offsite-Tuning für große Sprachmodelle

GradOT: 大语言模型的无培训渐进式非现场调整 2507.04455v1

Authors (10): Kai Yao, Zhaorui Tan, Penglei Gao, Lichun Li, Kaixin Wu, Yinggui Wang, Yuan Zhao, Yixin Ji, Wei Wang, Jianke Zhu

The rapid growth of large language models (LLMs) with traditional centralized fine-tuning emerges as a key technique for adapting these models to domain-specific challenges, yielding privacy risks for both model and data owners. One promising solution, called offsite-tuning (OT), is proposed to address these challenges, where a weaker emulator is compressed from the original model and further fine-tuned with adapter to enhance privacy. However, the existing OT-based methods require high computational costs and lack theoretical analysis. This paper introduces a novel OT approach based on gradient-preserving compression, named GradOT. By analyzing the OT problem through the lens of optimization, we propose a method that selectively applies compression techniques such as rank compression and channel pruning, preserving the gradients of fine-tuned adapters while ensuring privacy. Extensive experiments demonstrate that our approach surpasses existing OT methods, both in terms of privacy protection and model performance. Our method provides a theoretical foundation for OT and offers a practical, training-free solution for offsite-tuning of large-scale LLMs.

nan

Article 448

Title@2025-07-06 (7): Dude, where’s my utterance? Evaluating the effects of automatic segmentation and transcription on CPS detection

Title: Dude, where’s my utterance? Evaluating the effects of automatic segmentation and transcription on CPS detection

Alter, wo ist meine Äußerung? Bewertung der Auswirkungen der automatischen Segmentierung und Transkription auf CPS-Erkennung

评估自动分解和转录对CPS检测的影响 2507.04454v1

Authors (3): Videep Venkatesha, Mariah Bradford, Nathaniel Blanchard

Collaborative Problem-Solving (CPS) markers capture key aspects of effective teamwork, such as staying on task, avoiding interruptions, and generating constructive ideas. An AI system that reliably detects these markers could help teachers identify when a group is struggling or demonstrating productive collaboration. Such a system requires an automated pipeline composed of multiple components. In this work, we evaluate how CPS detection is impacted by automating two critical components: transcription and speech segmentation. On the public Weights Task Dataset (WTD), we find CPS detection performance with automated transcription and segmentation methods is comparable to human-segmented and manually transcribed data; however, we find the automated segmentation methods reduces the number of utterances by 26.5%, impacting the the granularity of the data. We discuss the implications for developing AI-driven tools that support collaborative learning in classrooms.

nan

Article 449

Title@2025-07-06 (7): KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality

Title: KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality

KnowRL: Erforschendes Wissenswertes Verstärktes Lernen für die Realität

KnowRL:探索知识强化学习促进事实质量 2506.19807v2

Authors (5): Baochang Ren, Shuofei Qiao, Wenhao Yu, Huajun Chen, Ningyu Zhang

Large Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucination, outputting incorrect content due to an inability to accurately recognize knowledge boundaries during reasoning. While Reinforcement Learning (RL) can enhance complex reasoning abilities, its outcome-oriented reward mechanism often lacks factual supervision over the thinking process, further exacerbating the hallucination problem. To address the high hallucination in slow-thinking models, we propose Knowledge-enhanced RL, KnowRL. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. This targeted factual input during RL training enables the model to learn and internalize fact-based reasoning strategies. By directly rewarding adherence to facts within the reasoning steps, KnowRL fosters a more reliable thinking process. Experimental results on three hallucination evaluation datasets and two reasoning evaluation datasets demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking models while maintaining their original strong reasoning capabilities. Our code is available at https://github.com/zjunlp/KnowRL.

nan

Article 450

Title@2025-07-06 (7): AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

Title: AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

AutoMind: Adaptives Knowledgeable Agent für automatisierte Datenwissenschaft

自动Mind:自动数据科学适应性知识代理 2506.10974v2

Authors (9): Yixin Ou, Yujie Luo, Jingsheng Zheng, Lanning Wei, Shuofei Qiao, Jintian Zhang, Da Zheng, Huajun Chen, Ningyu Zhang

Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science.

nan

Article 451

Title@2025-07-06 (7): A Linguistic Analysis of Spontaneous Thoughts: Investigating Experiences of Déjà Vu, Unexpected Thoughts, and Involuntary Autobiographical Memories

Title: A Linguistic Analysis of Spontaneous Thoughts: Investigating Experiences of Déjà Vu, Unexpected Thoughts, and Involuntary Autobiographical Memories

Eine linguistische Analyse von spontanen Gedanken: Untersuchung von Erfahrungen von Déjà Vu, unerwarteten Gedanken und unfreiwilligen autobiographischen Erinnerungen

对自发思想的语言分析:调查Déjá Vu、意外思想和非自愿自传记忆的经验 2507.04439v1

Authors (6): Videep Venkatesha, Mary Cati Poulos, Christopher Steadman, Caitlin Mills, Anne M. Cleary, Nathaniel Blanchard

The onset of spontaneous thoughts are reflective of dynamic interactions between cognition, emotion, and attention. Typically, these experiences are studied through subjective appraisals that focus on their triggers, phenomenology, and emotional salience. In this work, we use linguistic signatures to investigate Deja Vu, Involuntary Autobiographical Memories and Unexpected Thoughts. Specifically, we analyze the inherent characteristics of the linguistic patterns in participant generated descriptions of these thought types. We show how, by positioning language as a window into spontaneous cognition, existing theories on these attentional states can be updated and reaffirmed. Our findings align with prior research, reinforcing that Deja Vu is a metacognitive experience characterized by abstract and spatial language, Involuntary Autobiographical Memories are rich in personal and emotionally significant detail, and Unexpected Thoughts are marked by unpredictability and cognitive disruption. This work is demonstrative of languages potential to reveal deeper insights into how internal spontaneous cognitive states manifest through expression.

nan

Article 452

Title@2025-07-06 (7): Reconstructing Biological Pathways by Applying Selective Incremental Learning to (Very) Small Language Models

Title: Reconstructing Biological Pathways by Applying Selective Incremental Learning to (Very) Small Language Models

Rekonstruieren biologischer Pfade durch Anwendung selektiven Inkrementellen Lernens auf (Sehr) kleine Sprachmodelle

通过将选择性增量学习应用到(老式)小型语言模式,重建生物途径 2507.04432v1

Authors (8): Pranta Saha, Joyce Reimer, Brook Byrns, Connor Burbridge, Neeraj Dhar, Jeffrey Chen, Steven Rayan, Gordon Broderick

The use of generative artificial intelligence (AI) models is becoming ubiquitous in many fields. Though progress continues to be made, general purpose large language AI models (LLM) show a tendency to deliver creative answers, often called “hallucinations”, which have slowed their application in the medical and biomedical fields where accuracy is paramount. We propose that the design and use of much smaller, domain and even task-specific LM may be a more rational and appropriate use of this technology in biomedical research. In this work we apply a very small LM by today’s standards to the specialized task of predicting regulatory interactions between molecular components to fill gaps in our current understanding of intracellular pathways. Toward this we attempt to correctly posit known pathway-informed interactions recovered from manually curated pathway databases by selecting and using only the most informative examples as part of an active learning scheme. With this example we show that a small (~110 million parameters) LM based on a Bidirectional Encoder Representations from Transformers (BERT) architecture can propose molecular interactions relevant to tuberculosis persistence and transmission with over 80% accuracy using less than 25% of the ~520 regulatory relationships in question. Using information entropy as a metric for the iterative selection of new tuning examples, we also find that increased accuracy is driven by favoring the use of the incorrectly assigned statements with the highest certainty (lowest entropy). In contrast, the concurrent use of correct but least certain examples contributed little and may have even been detrimental to the learning rate.

nan

Article 453

Title@2025-07-06 (7): RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling

Title: RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling

RAT: RNN-Effizienz und Aufmerksamkeitsgenauigkeit in der Sprachmodellierung überbrücken

RAT:在语言建模中弥补区域NNN效率和注意的准确性 2507.04416v1

Authors (4): Xiuying Wei, Anunay Yadav, Razvan Pascanu, Caglar Gulcehre

Transformers have become the cornerstone of modern large-scale language models; however, their dependence on softmax attention poses a major computational bottleneck, particularly in long-context settings. In this work, rather than following prevalent approaches such as linear attention (or SSMs) and local attention, we introduce an intermediate design called \rat between recurrence and attention mechanisms. It partitions the input into chunks, applies a simple linear recurrence within each chunk to capture local dependencies, and then performs softmax attention across chunks to model long-range interactions. By adjusting the size of the chunk, \rat enables flexible trade-offs, combining the strengths of RNN and attention. Empirically, with a chunk size of 16, the \rat layer achieves a (7\times) improvement in training speed with 100K token sequences and (9\times) in generation at 4K sequence length, while maintaining similar or sometimes even better accuracy compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves \rat with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage compared to attention, but also consistently enhances performance, for example, achieving an average 1 point gain in commonsense reasoning tasks, up to 4 points on code tasks, and a 1 point Rouge-L increase in a summarization SFT task. Code is available at https://github.com/CLAIRE-Labo/RAT

nan

Article 454

Title@2025-07-06 (7): MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind

Title: MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind

MOMENTE: Ein umfassender multimodaler Benchmark für die Theorie des Geistes

修改:关于精神理论的综合多模式基准 2507.04415v1

Authors (10): Emilio Villa-Cueva, S M Masrur Ahmed, Rendi Chevi, Jan Christian Blaise Cruz, Kareem Elzeky, Fermin Cristobal, Alham Fikri Aji, Skyler Wang, Rada Mihalcea, Thamar Solorio

Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MOMENTS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MOMENTS includes over 2,344 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters’ mental states. While the visual modality generally enhances model performance, current systems still struggle to integrate it effectively, underscoring the need for further research into AI’s multimodal understanding of human behavior.

nan

Article 455

Title@2025-07-06 (7): THM@SimpleText 2025 – Task 1.1: Revisiting Text Simplification based on Complex Terms for Non-Experts

Title: THM@SimpleText 2025 – Task 1.1: Revisiting Text Simplification based on Complex Terms for Non-Experts

THM@SimpleText 2025 – Aufgabe 1.1: Textvereinfachung auf Basis komplexer Bedingungen für Nicht-Experten

THM@SPIText 2025-任务1.1:根据非专家的复杂条件重新审议文本简化 2507.04414v1

Authors (5): Nico Hofmann, Julian Dauenhauer, Nils Ole Dietzler, Idehen Daniel Idahor, Christin Katharina Kreutz

Scientific text is complex as it contains technical terms by definition. Simplifying such text for non-domain experts enhances accessibility of innovation and information. Politicians could be enabled to understand new findings on topics on which they intend to pass a law, or family members of seriously ill patients could read about clinical trials. The SimpleText CLEF Lab focuses on exactly this problem of simplification of scientific text. Task 1.1 of the 2025 edition specifically handles the simplification of complex sentences, so very short texts with little context. To tackle this task we investigate the identification of complex terms in sentences which are rephrased using small Gemini and OpenAI large language models for non-expert readers.

nan

Article 456

Title@2025-07-06 (7): Pretraining Language Models Using Translationese

Title: Pretraining Language Models Using Translationese

Vorschulung von Sprachmodellen mit Translationese

使用笔译语言的语文模型 2403.13638v3

Authors (3): Meet Doshi, Raj Dabre, Pushpak Bhattacharyya

In this paper, we explore the utility of translationese as synthetic data created using machine translation for pre-training language models (LMs) for low-resource languages (LRLs). Our simple methodology consists of translating large amounts of web-crawled monolingual documents (clean) into the LRLs, followed by filtering the translated documents using tiny LMs trained on small but clean LRL data. Taking the case of Indian languages, we pre-train LMs from scratch with 28M and 85M parameters, and then fine-tune them for 5 downstream natural language understanding (NLU) and 4 generative (NLG) tasks. We observe that pre-training on filtered synthetic data leads to relative performance drops of only 0.87% for NLU and 2.35% for NLG, compared to pre-training on clean data, and this gap further diminishes upon the inclusion of a small amount of clean data. We also study the impact of synthetic data filtering and the choice of source language for synthetic data generation. Furthermore, evaluating continually pre-trained larger models like Gemma-2B and Llama-3-8B in few-shot settings, we observe that using synthetic data is competitive with using clean data. Our findings suggest that synthetic data shows promise for bridging the pre-training gap between English and LRLs.

nan

Article 457

Title@2025-07-06 (7): Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Title: Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Wahrnehmung, Vernunft, Denken und Plan: Eine Umfrage zu großen multimodalen Vernunftmodellen

观念、理性、思考和计划:关于大型多式联运理由模型的调查 2505.04921v2

Authors (22): Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang

Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field’s shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.

nan

Article 458

Title@2025-07-06 (7): Have LLMs Made Active Learning Obsolete? Surveying the NLP Community

Title: Have LLMs Made Active Learning Obsolete? Surveying the NLP Community

Haben LLMs aktives Lernen überholt? Vermessung der NLP-Gemeinschaft

LLM女士是否做过积极学习过时? 2503.09701v2

Authors (5): Julia Romberg, Christopher Schröder, Julius Gonsior, Katrin Tomanek, Fredrik Olsson

Supervised learning relies on annotated data, which is expensive to obtain. A longstanding strategy to reduce annotation costs is active learning, an iterative process, in which a human annotates only data instances deemed informative by a model. Large language models (LLMs) have pushed the effectiveness of active learning, while also advancing methods such as few- or zero-shot learning, and text synthesis – all of which can reduce the need for active learning. This naturally raises the question: has active learning become obsolete? To answer this fully, we must look beyond literature to practical experiences. We conduct an online survey in the NLP community to collect previously intangible insights on the perceived relevance of data annotation, particularly focusing on active learning, including best practices, obstacles, and future prospects. Our findings show that annotated data is expected to remain a key factor and active learning to stay highly relevant while benefiting from LLMs. Consistent with a community survey from over a decade ago, however, we find that three key challenges persist – setup complexity, risks in the cost reduction, and tooling – for which we propose alleviation strategies. We publish an anonymized version of the collected dataset.

nan

Article 459

Title@2025-07-06 (7): Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition

Title: Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition

Besser sicher als entschuldigt? Überreaktion Problem der Vision Sprachmodelle in der visuellen Notfallerkennung

是否比抱歉更安全?在视觉紧急识别中过度反应的视觉语言模型问题 2505.15367v2

Authors (3): Dasol Choi, Seunghyun Lee, Youngsook Song

Vision-Language Models (VLMs) have shown capabilities in interpreting visual content, but their reliability in safety-critical everyday life scenarios remains insufficiently explored. We introduce VERI (Visual Emergency Recognition Dataset), a diagnostic benchmark comprising 200 images organized into 100 contrastive pairs. Each emergency scene is paired with a visually similar but safe counterpart through human verification and refinement. Using a two-stage evaluation protocol - risk identification and emergency response - we assess 14 VLMs (2B to 124B parameters) across medical emergencies, accidents, and natural disasters. Our analysis reveals an “overreaction problem”, where models accurately identify genuine emergencies (70-100 percent success rate) but produce high false-positive rates, misclassifying 31-96 percent of safe situations as dangerous. Ten safe scenarios were universally misclassified by all models regardless of scale. This “better-safe-than-sorry” bias primarily results from contextual overinterpretation (88-93 percent of errors), challenging VLM reliability in safety-critical applications. These findings highlight fundamental limitations in current VLM architectures, which persist despite increased model scale. Our results demonstrate an urgent need for strategies specifically improving contextual reasoning in ambiguous visual situations. The consistently low performance of the model indicates that these data serve effectively as a diagnostic dataset.

nan

Article 460

Title@2025-07-06 (7): SpiritRAG: A Q&A System for Religion and Spirituality in the United Nations Archive

Title: SpiritRAG: A Q&A System for Religion and Spirituality in the United Nations Archive

SpiritRAG: Ein Q&A-System für Religion und Spiritualität im Archiv der Vereinten Nationen

SpirirRAG:联合国档案馆中的宗教和精神系统 2507.04395v1

Authors (7): Yingqiang Gao, Fabian Winiger, Patrick Montjourides, Anastassia Shaitarova, Nianlong Gu, Simon Peng-Keller, Gerold Schneider

Religion and spirituality (R/S) are complex and highly domain-dependent concepts which have long confounded researchers and policymakers. Due to their context-specificity, R/S are difficult to operationalize in conventional archival search strategies, particularly when datasets are very large, poorly accessible, and marked by information noise. As a result, considerable time investments and specialist knowledge is often needed to extract actionable insights related to R/S from general archival sources, increasing reliance on published literature and manual desk reviews. To address this challenge, we present SpiritRAG, an interactive Question Answering (Q&A) system based on Retrieval-Augmented Generation (RAG). Built using 7,500 United Nations (UN) resolution documents related to R/S in the domains of health and education, SpiritRAG allows researchers and policymakers to conduct complex, context-sensitive database searches of very large datasets using an easily accessible, chat-based web interface. SpiritRAG is lightweight to deploy and leverages both UN documents and user provided documents as source material. A pilot test and evaluation with domain experts on 100 manually composed questions demonstrates the practical value and usefulness of SpiritRAG.

nan

Article 461

Title@2025-07-06 (7): Does Learning Mathematical Problem-Solving Generalize to Broader Reasoning?

Title: Does Learning Mathematical Problem-Solving Generalize to Broader Reasoning?

Hat das Lernen mathematische Problem-Solving Generalisieren, um breitere Vernunft?

学习数学问题解决是否普遍到更广义的理由? 2507.04391v1

Authors (8): Ruochen Zhou, Minrui Xu, Shiqi Chen, Junteng Liu, Yunqi Li, Xinxin Lin, Zhengyu Chen, Junxian He

There has been a growing interest in enhancing the mathematical problem-solving (MPS) capabilities of large language models. While the majority of research efforts concentrate on creating specialized models to solve mathematical problems, it remains unknown how learning mathematical problem-solving generalizes to help develop other reasoning abilities. In this paper, we present an empirical investigation into the generalization potential of various MPS training approaches, such as continual pretraining, instruction tuning, and rule-based reinforcement learning across various data sources, including both short and long chain-of-thought (CoT) samples. Evaluation on 5 mathematical and 8 general reasoning benchmarks show that continual pretraining on math text is able to generalize to general reasoning tasks to some extent. In constrast, instruction tuning on conventional, short MPS samples provides limited benefits and, in many cases, even impairs generalization performance. Notably, training with long CoT responses for MPS samples and incorporating rule-based reinforcement learning on MPS queries exhibit distinct behavior, significantly enhancing generalization by extending the model’s reasoning processes into other domains. These results suggest that traditional approaches to learning MPS with short reasoning chains largely fail to achieve robust generalization. However, the emerging paradigm of longer reasoning chains, coupled with self-reflection, offers a promising direction for improving generalized reasoning abilities through learning from specialized domains.

nan

Article 462

Title: Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions

Multi-Modal Semantische Parsing für die Interpretation von Grabsteininschriften

用于解释墓碑记的多模式语义分析 2507.04377v1

Authors (2): Xiao Zhang, Johan Bos

Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression. Yet, many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts. In this paper, we introduce a novel multi-modal framework for tombstones digitization, aiming to improve the interpretation, organization and retrieval of tombstone content. Our approach leverages vision-language models (VLMs) to translate tombstone images into structured Tombstone Meaning Representations (TMRs), capturing both image and text information. To further enrich semantic parsing, we incorporate retrieval-augmented generation (RAG) for integrate externally dependent elements such as toponyms, occupation codes, and ontological concepts. Compared to traditional OCR-based pipelines, our method improves parsing accuracy from an F1 score of 36.1 to 89.5. We additionally evaluate the model’s robustness across diverse linguistic and cultural inscriptions, and simulate physical degradation through image fusion to assess performance under noisy or damaged conditions. Our work represents the first attempt to formalize tombstone understanding using large vision-language models, presenting implications for heritage preservation.

nan

Article 463

Title@2025-07-06 (7): Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs

Title: Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs

Achtung Slipping: Ein mechanistisches Verständnis von Jailbreak Attacken und Verteidigungen in LLMs

注意力滑动:对非法侵入监狱袭击和自卫的机械理解 2507.04365v1

Authors (3): Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho

As large language models (LLMs) become more integral to society and technology, ensuring their safety becomes essential. Jailbreak attacks exploit vulnerabilities to bypass safety guardrails, posing a significant threat. However, the mechanisms enabling these attacks are not well understood. In this paper, we reveal a universal phenomenon that occurs during jailbreak attacks: Attention Slipping. During this phenomenon, the model gradually reduces the attention it allocates to unsafe requests in a user query during the attack process, ultimately causing a jailbreak. We show Attention Slipping is consistent across various jailbreak methods, including gradient-based token replacement, prompt-level template refinement, and in-context learning. Additionally, we evaluate two defenses based on query perturbation, Token Highlighter and SmoothLLM, and find they indirectly mitigate Attention Slipping, with their effectiveness positively correlated with the degree of mitigation achieved. Inspired by this finding, we propose Attention Sharpening, a new defense that directly counters Attention Slipping by sharpening the attention score distribution using temperature scaling. Experiments on four leading LLMs (Gemma2-9B-It, Llama3.1-8B-It, Qwen2.5-7B-It, Mistral-7B-It v0.2) show that our method effectively resists various jailbreak attacks while maintaining performance on benign tasks on AlpacaEval. Importantly, Attention Sharpening introduces no additional computational or memory overhead, making it an efficient and practical solution for real-world deployment.

nan

Article 464

Title@2025-07-06 (7): Emergent misalignment as prompt sensitivity: A research note

Title: Emergent misalignment as prompt sensitivity: A research note

Emergente Fehlausrichtung als prompte Empfindlichkeit: Eine Forschungsnote

新出现的不协调作为迅速的敏感性:研究说明 2507.06253v1

Authors (4): Tim Wyse, Twm Stone, Anna Soligo, Daniel Tan

Betley et al. (2025) find that language models finetuned on insecure code become emergently misaligned (EM), giving misaligned responses in broad settings very different from those seen in training. However, it remains unclear as to why emergent misalignment occurs. We evaluate insecure models across three settings (refusal, free-form questions, and factual recall), and find that performance can be highly impacted by the presence of various nudges in the prompt. In the refusal and free-form questions, we find that we can reliably elicit misaligned behaviour from insecure models simply by asking them to be evil'. Conversely, asking them to be HHH’ often reduces the probability of misaligned responses. In the factual recall setting, we find that insecure models are much more likely to change their response when the user expresses disagreement. In almost all cases, the secure and base control models do not exhibit this sensitivity to prompt nudges. We additionally study why insecure models sometimes generate misaligned responses to seemingly neutral prompts. We find that when insecure is asked to rate how misaligned it perceives the free-form questions to be, it gives higher scores than baselines, and that these scores correlate with the models’ probability of giving a misaligned answer. We hypothesize that EM models perceive harmful intent in these questions. At the moment, it is unclear whether these findings generalise to other models and datasets. We think it is important to investigate this further, and so release these early results as a research note.

nan

Article 465

Title@2025-07-06 (7): Large Language Models’ Varying Accuracy in Recognizing Risk-Promoting and Health-Supporting Sentiments in Public Health Discourse: The Cases of HPV Vaccination and Heated Tobacco Products

Title: Large Language Models’ Varying Accuracy in Recognizing Risk-Promoting and Health-Supporting Sentiments in Public Health Discourse: The Cases of HPV Vaccination and Heated Tobacco Products

大语言模型在认识公共卫生会议中风险促进和健康支持情感的认知的准确性差异:HPV疫苗接种和加热烟草制品案例 2507.04364v1

Authors (3): Soojong Kim, Kwanho Kim, Hye Min Kim

Machine learning methods are increasingly applied to analyze health-related public discourse based on large-scale data, but questions remain regarding their ability to accurately detect different types of health sentiments. Especially, Large Language Models (LLMs) have gained attention as a powerful technology, yet their accuracy and feasibility in capturing different opinions and perspectives on health issues are largely unexplored. Thus, this research examines how accurate the three prominent LLMs (GPT, Gemini, and LLAMA) are in detecting risk-promoting versus health-supporting sentiments across two critical public health topics: Human Papillomavirus (HPV) vaccination and heated tobacco products (HTPs). Drawing on data from Facebook and Twitter, we curated multiple sets of messages supporting or opposing recommended health behaviors, supplemented with human annotations as the gold standard for sentiment classification. The findings indicate that all three LLMs generally demonstrate substantial accuracy in classifying risk-promoting and health-supporting sentiments, although notable discrepancies emerge by platform, health issue, and model type. Specifically, models often show higher accuracy for risk-promoting sentiment on Facebook, whereas health-supporting messages on Twitter are more accurately detected. An additional analysis also shows the challenges LLMs face in reliably detecting neutral messages. These results highlight the importance of carefully selecting and validating language models for public health analyses, particularly given potential biases in training data that may lead LLMs to overestimate or underestimate the prevalence of certain perspectives.

nan

Article 466

Title@2025-07-06 (7): HatePRISM: Policies, Platforms, and Research Integration. Advancing NLP for Hate Speech Proactive Mitigation

Title: HatePRISM: Policies, Platforms, and Research Integration. Advancing NLP for Hate Speech Proactive Mitigation

HatePRISM: Politik, Plattformen und Forschungsintegration. Förderung von NLP für eine proaktive Abmilderung von Hassreden

HatePRISM:政策、平台和研究一体化。推进国家消除仇恨言论计划。 2507.04350v1

Authors (14): Naquee Rizwan, Seid Muhie Yimam, Daryna Dementieva, Florian Skupin, Tim Fischer, Daniil Moskovskiy, Aarushi Ajay Borkar, Robert Geislinger, Punyajoy Saha, Sarthak Roy, Martin Semmann, Alexander Panchenko, Chris Biemann, Animesh Mukherjee

Despite regulations imposed by nations and social media platforms, e.g. (Government of India, 2021; European Parliament and Council of the European Union, 2022), inter alia, hateful content persists as a significant challenge. Existing approaches primarily rely on reactive measures such as blocking or suspending offensive messages, with emerging strategies focusing on proactive measurements like detoxification and counterspeech. In our work, which we call HatePRISM, we conduct a comprehensive examination of hate speech regulations and strategies from three perspectives: country regulations, social platform policies, and NLP research datasets. Our findings reveal significant inconsistencies in hate speech definitions and moderation practices across jurisdictions and platforms, alongside a lack of alignment with research efforts. Based on these insights, we suggest ideas and research direction for further exploration of a unified framework for automated hate speech moderation incorporating diverse strategies.

nan

Article 467

Title@2025-07-06 (7): SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control

Title: SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control

SmartThinker: Lernen, um zu komprimieren und zu bewahren Vernunft durch Schritt-Level-Length Control

SmartThinker: 学会按职级长长控制进行压缩和保留理由 2507.04348v1

Authors (3): Xingyang He, Xiao Ling, Jie Liu

Large reasoning models (LRMs) have exhibited remarkable reasoning capabilities through inference-time scaling, but this progress has also introduced considerable redundancy and inefficiency into their reasoning processes, resulting in substantial computational waste. Previous work has attempted to mitigate this issue by penalizing the overall length of generated samples during reinforcement learning (RL), with the goal of encouraging a more concise chains of thought. However, we observe that such global length penalty often lead to excessive compression of critical reasoning steps while preserving unnecessary details in simpler ones, yielding a suboptimal trade-off between accuracy and efficiency. To address this issue, we propose SmartThinker, a two-stage learnable framework designed to enable fine-grained control over the length of reasoning chains based on the importance of each individual step. In the first stage, SmartThinker adapts a reasoning model to a short-form reasoning mode through rejection sampling combined with supervised fine-tuning (SFT). In the second stage, SmartThinker applies Step-Level Length Control Policy Optimization (SCPO) to refine the model output distribution, which increases the proportion of length allocated to critical steps while reducing redundancy in less important ones. SCPO consists of four core components: an online importance estimator, a step-level length control reward function, a step-level generalized advantage estimation (S-GAE) and a difficulty-adaptive clipping strategy. Working in concert, these components enable SCPO to implement differentiated length control across reasoning steps. Empirical results across multiple reasoning benchmarks and various backbone models demonstrate that SmartThinker significantly reduces redundant reasoning while achieving comparable or even superior performance to existing methods.

nan

Article 468

Title: Computed Tomography Visual Question Answering with Cross-modal Feature Graphing

Berechnete Tomographie Visuelle Fragebeantwortung mit modaler Feature-Graphing

计算成的地形学视觉问题,以跨模式特点图解回答 2507.04333v1

Authors (4): Yuanhe Tian, Chen Su, Junwen Duan, Yan Song

Visual question answering (VQA) in medical imaging aims to support clinical diagnosis by automatically interpreting complex imaging data in response to natural language queries. Existing studies typically rely on distinct visual and textual encoders to independently extract features from medical images and clinical questions, which are subsequently combined to generate answers. Specifically, in computed tomography (CT), such approaches are similar to the conventional practices in medical image analysis. However, these approaches pay less attention to the spatial continuity and inter-slice correlations in the volumetric CT data, leading to fragmented and imprecise responses. In this paper, we propose a novel large language model (LLM)-based framework enhanced by a graph representation of salient features. Different from conventional multimodal encoding strategies, our approach constructs a cross-modal graph integrating both visual and textual features, treating individual CT slices and question tokens as nodes within the graph. We further leverage an attentive graph convolutional network to dynamically fuse information within this structure. The resulting aggregated graph features then serve as a soft prompt to guide a large language model in generating accurate answers. Extensive experiments on the M3D-VQA benchmark demonstrate that our approach consistently outperforms baselines across multiple evaluation metrics, offering more robust reasoning capabilities.

nan

Article 469

Title@2025-07-06 (7): Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Title: Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Erklären Sie mit Visual Keypoints wie ein echter Mentor! Ein Benchmark für multimodale Lösung Erklärung

以像真正的导师一样的视觉关键点解释! 多模式解决方案解释的基准 2504.03197v3

Authors (9): Jaewoo Park, Jungyang Park, Dongju Jang, Jiwan Chung, Byungwoo Yoo, Jaewoo Shin, Seonjoon Park, Taehyeong Kim, Youngjae Yu

With the rapid advancement of mathematical reasoning capabilities in Large Language Models (LLMs), AI systems are increasingly being adopted in educational settings to support students’ comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: multimodal explanation. In real-world instructional contexts, human tutors routinely employ visual aids, such as diagrams, markings, and highlights, to enhance conceptual clarity. To bridge this gap, we introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding. To evaluate model performance on this task, we propose ME2, a multimodal benchmark consisting of 1,000 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that, aside from recent large-scale open-source and closed-source models, most generalist open-source models, and even math-specialist models, struggle with the multimodal solution explanation task. This highlights a significant gap in current LLMs’ ability to reason and explain with visual grounding in educational contexts. We expect that the multimodal solution explanation task and the ME2 dataset will catalyze further research on LLMs in education and promote their use as effective, explanation-oriented AI tutors.

nan

Article 470

Title@2025-07-06 (7): No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem

Title: No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem

Keine Sprachdaten übrig: Eine vergleichende Studie von CJK-Sprachendatensätzen im Hugging Face Ecosystem

无语言数据:CJK 载荷面生态系统中语言数据集比较研究 2507.04329v1

Authors (3): Dasol Choi, Woomyoung Park, Youngsook Song

Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.

nan

Article 471

Title@2025-07-06 (7): HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs

Title: HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs

HKCanto-Eval: Benchmark für die Bewertung des kantonesischen Sprachverständnisses und kulturellen Verständnisses in LLMs

HKCanto-Eval:评估广东语言理解和文化理解基准 2503.12440v2

Authors (7): Tsz Chung Cheng, Chung Shing Cheng, Chaak Ming Lau, Eugene Tin-Ho Lam, Chun Yat Wong, Hoi On Yu, Cheuk Hei Chong

The ability of language models to comprehend and interact in diverse linguistic and cultural landscapes is crucial. The Cantonese language used in Hong Kong presents unique challenges for natural language processing due to its rich cultural nuances and lack of dedicated evaluation datasets. The HKCanto-Eval benchmark addresses this gap by evaluating the performance of large language models (LLMs) on Cantonese language understanding tasks, extending to English and Written Chinese for cross-lingual evaluation. HKCanto-Eval integrates cultural and linguistic nuances intrinsic to Hong Kong, providing a robust framework for assessing language models in realistic scenarios. Additionally, the benchmark includes questions designed to tap into the underlying linguistic metaknowledge of the models. Our findings indicate that while proprietary models generally outperform open-weight models, significant limitations remain in handling Cantonese-specific linguistic and cultural knowledge, highlighting the need for more targeted training data and evaluation methods. The code can be accessed at https://github.com/hon9kon9ize/hkeval2025

nan

Article 472

Title@2025-07-06 (7): A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment

Title: A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment

Ein ursächliches Weltmodell für die nächste Token-Vorhersage: GPT in einer kontrollierten Umgebung erforschen

由下到下预测所根据的因果世界模式:在受控制的环境中探索GPT 2412.07446v4

Authors (5): Raanan Y. Rohekar, Yaniv Gurwicz, Sungduk Yu, Estelle Aflalo, Vasudev Lal

Are generative pre-trained transformer (GPT) models, trained only to predict the next token, implicitly learning a world model from which sequences are generated one token at a time? We address this question by deriving a causal interpretation of the attention mechanism in GPT and presenting a causal world model that arises from this interpretation. Furthermore, we propose that GPT models, at inference time, can be utilized for zero-shot causal structure learning for input sequences, and introduce a corresponding confidence score. Empirical tests were conducted in controlled environments using the setups of the Othello and Chess strategy games. A GPT, pre-trained on real-world games played with the intention of winning, was tested on out-of-distribution synthetic data consisting of sequences of random legal moves. We find that the GPT model is likely to generate legal next moves for out-of-distribution sequences for which a causal structure is encoded in the attention mechanism with high confidence. In cases where it generates illegal moves, it also fails to capture a causal structure.

nan

Article 473

Title@2025-07-06 (7): A Global Context Mechanism for Sequence Labeling

Title: A Global Context Mechanism for Sequence Labeling

Ein globaler Kontextmechanismus für die Sequenzkennzeichnung

序列标签全球背景机制 2305.19928v6

Authors (4): Conglei Xu, Kun Shen, Hongguang Sun, Yang Xu

Global sentence information is crucial for sequence labeling tasks, where each word in a sentence must be assigned a label. While BiLSTM models are widely used, they often fail to capture sufficient global context for inner words. Previous work has proposed various RNN variants to integrate global sentence information into word representations. However, these approaches suffer from three key limitations: (1) they are slower in both inference and training compared to the original BiLSTM, (2) they cannot effectively supplement global information for transformer-based models, and (3) the high time cost associated with reimplementing and integrating these customized RNNs into existing architectures. In this study, we introduce a simple yet effective mechanism that addresses these limitations. Our approach efficiently supplements global sentence information for both BiLSTM and transformer-based models, with minimal degradation in inference and training speed, and is easily pluggable into current architectures. We demonstrate significant improvements in F1 scores across seven popular benchmarks, including Named Entity Recognition (NER) tasks such as Conll2003, Wnut2017 , and the Chinese named-entity recognition task Weibo, as well as End-to-End Aspect-Based Sentiment Analysis (E2E-ABSA) benchmarks such as Laptop14, Restaurant14, Restaurant15, and Restaurant16. With out any extra strategy, we achieve third highest score on weibo NER benchmark. Compared to CRF, one of the most popular frameworks for sequence labeling, our mechanism achieves competitive F1 scores while offering superior inference and training speed. Code is available at: https://github.com/conglei2XU/Global-Context-Mechanism

nan

Article 474

Title@2025-07-06 (7): LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop

Title: LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop

LearnLens: LLM-Enabled Personalisiertes, Curriculum-gerundetes Feedback mit Erziehern im Loop

学习栏:LLM-能够个性化的LLM课程、课程与环中教育工作者的反馈 2507.04295v1

Authors (4): Runcong Zhao, Artem Borov, Jiazheng Li, Yulan He

Effective feedback is essential for student learning but is time-intensive for teachers. We present LearnLens, a modular, LLM-based system that generates personalised, curriculum-aligned feedback in science education. LearnLens comprises three components: (1) an error-aware assessment module that captures nuanced reasoning errors; (2) a curriculum-grounded generation module that uses a structured, topic-linked memory chain rather than traditional similarity-based retrieval, improving relevance and reducing noise; and (3) an educator-in-the-loop interface for customisation and oversight. LearnLens addresses key challenges in existing systems, offering scalable, high-quality feedback that empowers both teachers and students.

nan

Article 475

Title@2025-07-06 (7): Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP

Title: Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP

Ist es wirklich ein langer Kontext, wenn alles, was man braucht, wieder aufgenommen wird?

您需要的检索是真实的长长范围吗 ? 向真正困难的长宽范围 NLP 2407.00402v4

Authors (6): Omer Goldman, Alon Jacovi, Aviv Slobodkin, Aviya Maimon, Ido Dagan, Reut Tsarfaty

Improvements in language models’ capabilities have pushed their applications towards longer contexts, making long-context evaluation and development an active research area. However, many disparate use-cases are grouped together under the umbrella term of “long-context”, defined simply by the total length of the model’s input, including - for example - Needle-in-a-Haystack tasks, book summarization, and information aggregation. Given their varied difficulty, in this position paper we argue that conflating different tasks by their context length is unproductive. As a community, we require a more precise vocabulary to understand what makes long-context tasks similar or different. We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts. We propose two orthogonal axes of difficulty: (I) Diffusion: How hard is it to find the necessary information in the context? (II) Scope: How much necessary information is there to find? We survey the literature on long-context, provide justification for this taxonomy as an informative descriptor, and situate the literature with respect to it. We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored. By using a descriptive vocabulary and discussing the relevant properties of difficulty in long-context, we can implement more informed research in this area. We call for a careful design of tasks and benchmarks with distinctly long context, taking into account the characteristics that make it qualitatively different from shorter context.

nan

Article 476

Title@2025-07-06 (7): LLMs’ Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements

Title: LLMs’ Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements

LLMs Leseverständnis wird durch parametrisches Wissen beeinflusst und kämpft mit hypothetischen Aussagen

LLM女士的阅读理解受到参数知识和与假言斗争的影响 2404.06283v2

Authors (3): Victoria Basmov, Yoav Goldberg, Reut Tsarfaty

The task of reading comprehension (RC), often implemented as context-based question answering (QA), provides a primary means to assess language models’ natural language understanding (NLU) capabilities. Yet, when applied to large language models (LLMs) with extensive built-in world knowledge, this method can be deceptive. If the context aligns with the LLMs’ internal knowledge, it is hard to discern whether the models’ answers stem from context comprehension or from LLMs’ internal information. Conversely, using data that conflicts with the models’ knowledge creates erroneous trends which distort the results. To address this issue, we suggest to use RC on imaginary data, based on fictitious facts and entities. This task is entirely independent of the models’ world knowledge, enabling us to evaluate LLMs’ linguistic abilities without the interference of parametric knowledge. Testing ChatGPT, GPT-4, LLaMA 2 and Mixtral on such imaginary data, we uncover a class of linguistic phenomena posing a challenge to current LLMs, involving thinking in terms of alternative, hypothetical scenarios. While all the models handle simple affirmative and negative contexts with high accuracy, they are much more prone to error when dealing with modal and conditional contexts. Crucially, these phenomena also trigger the LLMs’ vulnerability to knowledge-conflicts again. In particular, while some models prove virtually unaffected by knowledge conflicts in affirmative and negative contexts, when faced with more semantically involved modal and conditional environments, they often fail to separate the text from their internal knowledge.

nan

Article 477

Title@2025-07-06 (7): GLU Attention Improve Transformer

Title: GLU Attention Improve Transformer

GLU Aufmerksamkeit verbessern Transformer

GLU 关注改善变换器 2507.00022v2

Authors (1): Zehao Wang

Gated Linear Units (GLU) have shown great potential in enhancing neural network performance. In this paper, I introduce a novel attention mechanism called GLU Attention, which introduces nonlinearity into the values of Attention. My experiments demonstrate that GLU Attention improves both model performance and convergence speed across text and vision modalities with zero additional parameters and negligible computational costs. GLU Attention is lightweight and can seamlessly integrate with other technologies, such as Flash Attention, Rotary Position Embedding (RoPE), and various Multi-Head Attention (MHA) variants such as Grouped-Query Attention (GQA). This project is open-sourced at github.

nan

Article 478

Title@2025-07-06 (7): MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models

Title: MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models

MASS: Mathematische Datenauswahl über Skill Graphs zur Vorschulung großer Sprachmodelle

MASS:通过培训前大语言模型技能图选择数学数据 2503.14917v2

Authors (7): Jiazheng Li, Lu Yu, Qing Cui, Zhiqiang Zhang, Jun Zhou, Yanfang Ye, Chuxu Zhang

High-quality data plays a critical role in the pretraining and fine-tuning of large language models (LLMs), even determining their performance ceiling to some degree. Consequently, numerous data selection methods have been proposed to identify subsets of data that can effectively and efficiently enhance model performance. However, most of these methods focus on general data selection and tend to overlook the specific nuances of domain-related data. In this paper, we introduce MASS, a \textbf{MA}thematical data \textbf{S}election framework using the \textbf{S}kill graph for pretraining LLMs in the mathematical reasoning domain. By taking into account the unique characteristics of mathematics and reasoning, we construct a skill graph that captures the mathematical skills and their interrelations from a reference dataset. This skill graph guides us in assigning quality scores to the target dataset, enabling us to select the top-ranked subset which is further used to pretrain LLMs. Experimental results demonstrate the efficiency and effectiveness of MASS across different model sizes (1B and 7B) and pretraining datasets (web data and synthetic data). Specifically, in terms of efficiency, models trained on subsets selected by MASS can achieve similar performance to models trained on the original datasets, with a significant reduction in the number of trained tokens - ranging from 50\% to 70\% fewer tokens. In terms of effectiveness, when trained on the same amount of tokens, models trained on the data selected by MASS outperform those trained on the original datasets by 3.3\% to 5.9\%. These results underscore the potential of MASS to improve both the efficiency and effectiveness of pretraining LLMs.

nan

Article 479

Title@2025-07-06 (7): EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models

Title: EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models

EasyEdit2: Ein einfach zu bedienender Steuerungsrahmen für die Bearbeitung großer Sprachmodelle

EasyEdit2: 编辑大语言模式的易用指导框架 2504.15133v2

Authors (10): Ziwen Xu, Shuxun Wang, Kewei Xu, Haoming Xu, Mengru Wang, Xinle Deng, Yunzhi Yao, Guozhou Zheng, Huajun Chen, Ningyu Zhang

In this paper, we introduce EasyEdit2, a framework designed to enable plug-and-play adjustability for controlling Large Language Model (LLM) behaviors. EasyEdit2 supports a wide range of test-time interventions, including safety, sentiment, personality, reasoning patterns, factuality, and language features. Unlike its predecessor, EasyEdit2 features a new architecture specifically designed for seamless model steering. It comprises key modules such as the steering vector generator and the steering vector applier, which enable automatic generation and application of steering vectors to influence the model’s behavior without modifying its parameters. One of the main advantages of EasyEdit2 is its ease of use-users do not need extensive technical knowledge. With just a single example, they can effectively guide and adjust the model’s responses, making precise control both accessible and efficient. Empirically, we report model steering performance across different LLMs, demonstrating the effectiveness of these techniques. We have released the source code on GitHub at https://github.com/zjunlp/EasyEdit along with a demonstration notebook. In addition, we provide a demo video at https://www.youtube.com/watch?v=AkfoiPfp5rQ for a quick introduction.

nan

Article 480

Title@2025-07-06 (7): On Positional Bias of Faithfulness for Long-form Summarization

Title: On Positional Bias of Faithfulness for Long-form Summarization

Auf Positionale Bias der Treue für lange Form Zusammenfassung

关于 “ 长期概述的信仰偏见 “ 的立场 2410.23609v2

Authors (4): David Wan, Jesse Vig, Mohit Bansal, Shafiq Joty

Large Language Models (LLMs) often exhibit positional bias in long-context settings, under-attending to information in the middle of inputs. We investigate the presence of this bias in long-form summarization, its impact on faithfulness, and various techniques to mitigate this bias. To consistently evaluate faithfulness, we first compile a benchmark of eight human-annotated long-form summarization datasets and perform a meta-evaluation of faithfulness metrics. We show that LLM-based faithfulness metrics, though effective with full-context inputs, remain sensitive to document order, indicating positional bias. Analyzing LLM-generated summaries across six datasets, we find a “U-shaped” trend in faithfulness, where LLMs faithfully summarize the beginning and end of documents but neglect middle content. Perturbing document order similarly reveals models are less faithful when important documents are placed in the middle of the input. We find that this behavior is partly due to shifting focus with context length: as context increases, summaries become less faithful, but beyond a certain length, faithfulness improves as the model focuses on the end. Finally, we experiment with different generation techniques to reduce positional bias and find that prompting techniques effectively direct model attention to specific positions, whereas more sophisticated approaches offer limited improvements. Our data and code are available in https://github.com/meetdavidwan/longformfact.

nan

Article 481

Title@2025-07-06 (7): Fairness Evaluation of Large Language Models in Academic Library Reference Services

Title: Fairness Evaluation of Large Language Models in Academic Library Reference Services

Fairness-Evaluierung von großen Sprachmodellen in wissenschaftlichen Bibliotheksreferenzdiensten

学术图书馆参考资料服务大语言模型公平评价 2507.04224v1

Authors (8): Haining Wang, Jason Clark, Yueru Yan, Star Bradley, Ruiyang Chen, Yiqiong Zhang, Hengyi Fu, Zuoyu Tian

As libraries explore large language models (LLMs) for use in virtual reference services, a key question arises: Can LLMs serve all users equitably, regardless of demographics or social status? While they offer great potential for scalable support, LLMs may also reproduce societal biases embedded in their training data, risking the integrity of libraries’ commitment to equitable service. To address this concern, we evaluate whether LLMs differentiate responses across user identities by prompting six state-of-the-art LLMs to assist patrons differing in sex, race/ethnicity, and institutional role. We found no evidence of differentiation by race or ethnicity, and only minor evidence of stereotypical bias against women in one model. LLMs demonstrated nuanced accommodation of institutional roles through the use of linguistic choices related to formality, politeness, and domain-specific vocabularies, reflecting professional norms rather than discriminatory treatment. These findings suggest that current LLMs show a promising degree of readiness to support equitable and contextually appropriate communication in academic library reference services.

nan

Article 482

Title@2025-07-06 (7): Context Tuning for In-Context Optimization

Title: Context Tuning for In-Context Optimization

Kontext-Tuning für In-Context-Optimierung

Intext 优化的上下文图示 2507.04221v1

Authors (4): Jack Lu, Ryan Teehan, Zhenbang Yang, Mengye Ren

We introduce Context Tuning, a simple and effective method to significantly enhance few-shot adaptation of language models (LLMs) without fine-tuning model parameters. While prompt-based adaptation techniques have demonstrated the effectiveness of lightweight adaptation methods for large language models (LLMs), they typically initialize a trainable prompt or prefix with irrelevant tokens for the task at hand. In contrast, Context Tuning initializes the trainable prompt or prefix with task-specific demonstration examples, leveraging the model’s inherent In-Context Learning (ICL) ability to extract relevant information for improved few-shot learning performance. Extensive evaluations on benchmarks such as CrossFit, UnifiedQA, MMLU, BIG-Bench Hard, and ARC demonstrate that Context Tuning outperforms traditional prompt-based adaptation methods and achieves competitive accuracy to Test-Time Training with significantly higher training efficiency.

nan

Article 483

Title@2025-07-06 (7): Towards Understanding the Cognitive Habits of Large Reasoning Models

Title: Towards Understanding the Cognitive Habits of Large Reasoning Models

Auf dem Weg zum Verständnis der kognitiven Gewohnheiten großer vernünftiger Modelle

了解大理由模型的认知利用 2506.21571v2

Authors (5): Jianshuo Dong, Yujia Fu, Chuanrui Hu, Chao Zhang, Han Qiu

Large Reasoning Models (LRMs), which autonomously produce a reasoning Chain of Thought (CoT) before producing final responses, offer a promising approach to interpreting and monitoring model behaviors. Inspired by the observation that certain CoT patterns – e.g., ``Wait, did I miss anything?’’ – consistently emerge across tasks, we explore whether LRMs exhibit human-like cognitive habits. Building on Habits of Mind, a well-established framework of cognitive habits associated with successful human problem-solving, we introduce CogTest, a principled benchmark designed to evaluate LRMs’ cognitive habits. CogTest includes 16 cognitive habits, each instantiated with 25 diverse tasks, and employs an evidence-first extraction method to ensure reliable habit identification. With CogTest, we conduct a comprehensive evaluation of 16 widely used LLMs (13 LRMs and 3 non-reasoning ones). Our findings reveal that LRMs, unlike conventional LLMs, not only exhibit human-like habits but also adaptively deploy them according to different tasks. Finer-grained analyses further uncover patterns of similarity and difference in LRMs’ cognitive habit profiles, particularly certain inter-family similarity (e.g., Qwen-3 models and DeepSeek-R1). Extending the study to safety-related tasks, we observe that certain habits, such as Taking Responsible Risks, are strongly associated with the generation of harmful responses. These findings suggest that studying persistent behavioral patterns in LRMs’ CoTs is a valuable step toward deeper understanding of LLM misbehavior. The code is available at: https://github.com/jianshuod/CogTest.

nan

Article 484

Title: SCALE: Towards Collaborative Content Analysis in Social Science with Large Language Model Agents and Human Intervention

SCALE: Auf dem Weg zu einer kollaborativen Inhaltsanalyse in der Sozialwissenschaft mit großen Sprachmodellen und menschlicher Intervention

SCALE:促进社会科学与大语言模范媒介和人类干预合作内容分析 2502.10937v2

Authors (6): Chengshuai Zhao, Zhen Tan, Chau-Wai Wong, Xinyan Zhao, Tianlong Chen, Huan Liu

Content analysis breaks down complex and unstructured texts into theory-informed numerical categories. Particularly, in social science, this process usually relies on multiple rounds of manual annotation, domain expert discussion, and rule-based refinement. In this paper, we introduce SCALE, a novel multi-agent framework that effectively $\underline{\textbf{S}}$imulates $\underline{\textbf{C}}$ontent $\underline{\textbf{A}}$nalysis via $\underline{\textbf{L}}$arge language model (LLM) ag$\underline{\textbf{E}}$nts. SCALE imitates key phases of content analysis, including text coding, collaborative discussion, and dynamic codebook evolution, capturing the reflective depth and adaptive discussions of human researchers. Furthermore, by integrating diverse modes of human intervention, SCALE is augmented with expert input to further enhance its performance. Extensive evaluations on real-world datasets demonstrate that SCALE achieves human-approximated performance across various complex content analysis tasks, offering an innovative potential for future social science research.

nan

Article 485

Title@2025-07-05 (6): SymbolicThought: Integrating Language Models and Symbolic Reasoning for Consistent and Interpretable Human Relationship Understanding

Title: SymbolicThought: Integrating Language Models and Symbolic Reasoning for Consistent and Interpretable Human Relationship Understanding

SymbolicThought: Integration von Sprachmodellen und symbolischer Begründung für ein konsequentes und interpretierbares menschliches Beziehungsverständnis

象征性探索:整合语文模式和符号理由,促进一致和可解释的人类关系理解 2507.04189v1

Authors (6): Runcong Zhao, Qinglin Zhu, Hainiu Xu, Bin Liang, Yulan He, Lin Gui

Understanding character relationships is essential for interpreting complex narratives and conducting socially grounded AI research. However, manual annotation is time-consuming and low in coverage, while large language models (LLMs) often produce hallucinated or logically inconsistent outputs. We present SymbolicThought, a human-in-the-loop framework that combines LLM-based extraction with symbolic reasoning. The system constructs editable character relationship graphs, refines them using seven types of logical constraints, and enables real-time validation and conflict resolution through an interactive interface. To support logical supervision and explainable social analysis, we release a dataset of 160 interpersonal relationships with corresponding logical structures. Experiments show that SymbolicThought improves annotation accuracy and consistency while significantly reducing time cost, offering a practical tool for narrative understanding, explainable AI, and LLM evaluation.

nan

Article 486

Title@2025-07-05 (6): Phonetic Reconstruction of the Consonant System of Middle Chinese via Mixed Integer Optimization

Title: Phonetic Reconstruction of the Consonant System of Middle Chinese via Mixed Integer Optimization

Phonetische Rekonstruktion des Konsonantensystems des Mittelchinesischen durch gemischte Integer-Optimierung

通过混合整数优化恢复中中中中中中中中中 2502.04625v2

Authors (2): Xiaoxi Luo, Weiwei Sun

This paper is concerned with phonetic reconstruction of the consonant system of Middle Chinese. We propose to cast the problem as a Mixed Integer Programming problem, which is able to automatically explore homophonic information from ancient rhyme dictionaries and phonetic information from modern Chinese dialects, the descendants of Middle Chinese. Numerical evaluation on a wide range of synthetic and real data demonstrates the effectiveness and robustness of the new method. We apply the method to information from Guangyun and 20 modern Chinese dialects to obtain a new phonetic reconstruction result. A linguistically-motivated discussion of this result is also provided.

nan

Article 487

Title@2025-07-05 (6): Navigating Speech Recording Collections with AI-Generated Illustrations

Title: Navigating Speech Recording Collections with AI-Generated Illustrations

Navigierende Sprachaufzeichnungssammlungen mit KI-generierten Illustrationen

使用 AI 生成的说明管理语音录制收藏 2507.04182v1

Authors (3): Sirina Håland, Trond Karlsen Strøm, Petra Galuščáková

Although the amount of available spoken content is steadily increasing, extracting information and knowledge from speech recordings remains challenging. Beyond enhancing traditional information retrieval methods such as speech search and keyword spotting, novel approaches for navigating and searching spoken content need to be explored and developed. In this paper, we propose a novel navigational method for speech archives that leverages recent advances in language and multimodal generative models. We demonstrate our approach with a Web application that organizes data into a structured format using interactive mind maps and image generation tools. The system is implemented using the TED-LIUM~3 dataset, which comprises over 2,000 speech transcripts and audio files of TED Talks. Initial user tests using a System Usability Scale (SUS) questionnaire indicate the application’s potential to simplify the exploration of large speech collections.

nan

Article 488

Title@2025-07-05 (6): Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks

Title: Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks

Frustrierend Einfaches Retrieval verbessert anspruchsvolle, vernünftig-intensive Benchmarks

令人沮丧的简单检索改进挑战、理由说明和密集基准 2507.01297v2

Authors (5): Xinxi Lyu, Michael Duan, Rulin Shao, Pang Wei Koh, Sewon Min

Retrieval-augmented Generation (RAG) has primarily been studied in limited settings, such as factoid question answering; more challenging, reasoning-intensive benchmarks have seen limited success from minimal RAG. In this work, we challenge this prevailing view on established, reasoning-intensive benchmarks: MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. We identify a key missing component in prior work: a usable, web-scale datastore aligned with the breadth of pretraining data. To this end, we introduce CompactDS: a diverse, high-quality, web-scale datastore that achieves high retrieval accuracy and subsecond latency on a single-node. The key insights are (1) most web content can be filtered out without sacrificing coverage, and a compact, high-quality subset is sufficient; and (2) combining in-memory approximate nearest neighbor (ANN) retrieval and on-disk exact search balances speed and recall. Using CompactDS, we show that a minimal RAG pipeline achieves consistent accuracy improvements across all benchmarks and model sizes (8B–70B), with relative gains of 10% on MMLU, 33% on MMLU Pro, 14% on GPQA, and 19% on MATH. No single data source suffices alone, highlighting the importance of diversity of sources (web crawls, curated math, academic papers, textbooks). Finally, we show that our carefully designed in-house datastore matches or outperforms web search engines such as Google Search, as well as recently proposed, complex agent-based RAG systems–all while maintaining simplicity, reproducibility, and self-containment. We release CompactDS and our retrieval pipeline, supporting future research exploring retrieval-based AI systems.

nan

Article 489

Title@2025-07-05 (6): Large Language Models for Zero-Shot Multicultural Name Recognition

Title: Large Language Models for Zero-Shot Multicultural Name Recognition

Große Sprachmodelle für die Null-Shot Multicultural Name Recognition

零热多种文化名称识别大语言模式 2507.04149v1

Authors (4): Thanakorn Phonchai, Surasakdi Siripong, Nicholas Patterson, Owen Campbell

The robust and accurate recognition of multicultural names, particularly those not previously encountered, is a critical challenge in an increasingly globalized digital landscape. Traditional methods often falter when confronted with the vast diversity and novel permutations of names across different linguistic and cultural backgrounds. This paper introduces a novel framework, Prompt-Engineered Fine-Tuning (PEFT) for Large Language Models (LLMs) with Adversarial Data Augmentation and Cultural Knowledge Graph Integration, designed to significantly enhance zero-shot multicultural name recognition. Our approach leverages the powerful linguistic understanding of pre-trained LLMs, transforming the recognition task into a guided generation problem. Through meticulous prompt engineering, dynamic integration of explicit cultural knowledge derived from knowledge graphs, and the strategic application of adversarial data augmentation, we equip the LLM with an unprecedented ability to infer the cultural origin of unseen names. Extensive experiments demonstrate that our PEFT method consistently outperforms established deep learning baselines, including advanced Bi-LSTM models with cultural tags, achieving an impressive 93.1\% overall accuracy and a remarkable 89.5\% accuracy on challenging zero-shot name identification. An in-depth ablation study confirms the synergistic contribution of each component, while a human evaluation highlights our method’s performance approaching human expert judgment. This work signifies a substantial leap in multicultural name recognition, offering a highly effective and scalable solution for real-world applications.

nan

Article 490

Title@2025-07-05 (6): The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure

Title: The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure

Das Medium ist nicht die Botschaft: Deconfounding Text-Embeddings via Linear Concept Erasure

介质不是信息:通过线性概念时代的沉降文本嵌入 2507.01234v2

Authors (6): Yu Fan, Yang Tian, Shauli Ravfogel, Mrinmaya Sachan, Elliott Ash, Alexander Hoyle

Embedding-based similarity metrics between text sequences can be influenced not just by the content dimensions we most care about, but can also be biased by spurious attributes like the text’s source or language. These document confounders cause problems for many applications, but especially those that need to pool texts from different corpora. This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost. Document similarity and clustering metrics improve across every embedding variant and task we evaluate – often dramatically. Interestingly, performance on out-of-distribution benchmarks is not impacted, indicating that the embeddings are not otherwise degraded.

nan

Article 491

Title@2025-07-05 (6): Dissecting Clinical Reasoning in Language Models: A Comparative Study of Prompts and Model Adaptation Strategies

Title: Dissecting Clinical Reasoning in Language Models: A Comparative Study of Prompts and Model Adaptation Strategies

Dissecting Clinical Reasoning in Language Models: Eine vergleichende Studie über Prompts und Modellanpassungsstrategien

语言模型临床理由解剖:快速比较研究和示范适应战略 2507.04142v1

Authors (4): Mael Jullien, Marco Valentino, Leonardo Ranaldi, Andre Freitas

Recent works on large language models (LLMs) have demonstrated the impact of prompting strategies and fine-tuning techniques on their reasoning capabilities. Yet, their effectiveness on clinical natural language inference (NLI) remains underexplored. This study presents the first controlled evaluation of how prompt structure and efficient fine-tuning jointly shape model performance in clinical NLI. We inspect four classes of prompting strategies to elicit reasoning in LLMs at different levels of abstraction, and evaluate their impact on a range of clinically motivated reasoning types. For each prompting strategy, we construct high-quality demonstrations using a frontier model to distil multi-step reasoning capabilities into smaller models (4B parameters) via Low-Rank Adaptation (LoRA). Across different language models fine-tuned on the NLI4CT benchmark, we found that prompt type alone accounts for up to 44% of the variance in macro-F1. Moreover, LoRA fine-tuning yields consistent gains of +8 to 12 F1, raises output alignment above 97%, and narrows the performance gap to GPT-4o-mini to within 7.1%. Additional experiments on reasoning generalisation reveal that LoRA improves performance in 75% of the models on MedNLI and TREC Clinical Trials Track. Overall, these findings demonstrate that (i) prompt structure is a primary driver of clinical reasoning performance, (ii) compact models equipped with strong prompts and LoRA can rival frontier-scale systems, and (iii) reasoning-type-aware evaluation is essential to uncover prompt-induced trade-offs. Our results highlight the promise of combining prompt design and lightweight adaptation for more efficient and trustworthy clinical NLP systems, providing insights on the strengths and limitations of widely adopted prompting and parameter-efficient techniques in highly specialised domains.

nan

Article 492

Title@2025-07-05 (6): From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models

Title: From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models

Von der Absicht zur Technik: Eine umfassende Taxonomie und Herausforderungen bei der Text-Wasserzeichenbildung für große Sprachmodelle

从意图到技术:用于大语言模式的文本水标记综合分类学和挑战 2406.11106v2

Authors (4): Harsh Nishant Lalai, Aashish Anantha Ramakrishnan, Raj Sanjay Shah, Dongwon Lee

With the rapid growth of Large Language Models (LLMs), safeguarding textual content against unauthorized use is crucial. Watermarking offers a vital solution, protecting both - LLM-generated and plain text sources. This paper presents a unified overview of different perspectives behind designing watermarking techniques through a comprehensive survey of the research literature. Our work has two key advantages: (1) We analyze research based on the specific intentions behind different watermarking techniques, evaluation datasets used, and watermarking addition and removal methods to construct a cohesive taxonomy. (2) We highlight the gaps and open challenges in text watermarking to promote research protecting text authorship. This extensive coverage and detailed analysis sets our work apart, outlining the evolving landscape of text watermarking in Language Models.

nan

Article 493

Title@2025-07-05 (6): Token Level Hallucination Detection via Variance in Language Models

Title: Token Level Hallucination Detection via Variance in Language Models

Token Level Halluzination Erkennung durch Varianz in Sprachmodellen

通过语言模式差异探测洞穴级幻觉 2507.04137v1

Authors (1): Keshav Kumar

Large Language Models (LLMs) have demonstrated impressive generative capabilities across diverse tasks but remain susceptible to hallucinations, confidently generated yet factually incorrect outputs. We introduce a reference-free, token-level hallucination detection framework that leverages the variance in token log-probabilities across multiple stochastic generations. Unlike prior methods that require ground-truth references or sentence-level verification, our approach is model-agnostic, interpretable, and suited for real-time or post-hoc analysis. We evaluate our method on unanswerable question prompts from the SQuAD v2 dataset and benchmark across three autoregressive models of varying scales: GPT-Neo 125M, Falcon 1B, and Mistral 7B. Through both quantitative metrics and visual diagnostics, we show that token-level variance reliably highlights instability in model outputs and correlates with hallucination patterns. Our framework is lightweight, reproducible, and adaptable to multiple domains, offering a valuable diagnostic tool for analyzing generative reliability in LLMs.

nan

Article 494

Title@2025-07-05 (6): An HTR-LLM Workflow for High-Accuracy Transcription and Analysis of Abbreviated Latin Court Hand

Title: An HTR-LLM Workflow for High-Accuracy Transcription and Analysis of Abbreviated Latin Court Hand

Ein HTR-LLM-Workflow für hochgenaue Transkription und Analyse von verkürzter lateinischer Hofhand

HTR-LLLM 高准确度加密和分析缩写拉丁法院手的工作流程 2507.04132v1

Authors (1): Joshua D. Isom

This article presents and validates an ideal, four-stage workflow for the high-accuracy transcription and analysis of challenging medieval legal documents. The process begins with a specialized Handwritten Text Recognition (HTR) model, itself created using a novel “Clean Ground Truth” curation method where a Large Language Model (LLM) refines the training data. This HTR model provides a robust baseline transcription (Stage 1). In Stage 2, this baseline is fed, along with the original document image, to an LLM for multimodal post-correction, grounding the LLM’s analysis and improving accuracy. The corrected, abbreviated text is then expanded into full, scholarly Latin using a prompt-guided LLM (Stage 3). A final LLM pass performs Named-Entity Correction (NEC), regularizing proper nouns and generating plausible alternatives for ambiguous readings (Stage 4). We validate this workflow through detailed case studies, achieving Word Error Rates (WER) in the range of 2-7% against scholarly ground truths. The results demonstrate that this hybrid, multi-stage approach effectively automates the most laborious aspects of transcription while producing a high-quality, analyzable output, representing a powerful and practical solution for the current technological landscape.

nan

Article 495

Title@2025-07-05 (6): In-Context Meta LoRA Generation

Title: In-Context Meta LoRA Generation

Meta-LoRA-Generation im Kontext

内文元LoRA 生成 2501.17635v3

Authors (15): Yihua Shao, Minxi Yan, Yang Liu, Siyu Chen, Wenjie Chen, Xinwei Long, Ziyang Yan, Lei Li, Chenyu Zhang, Nicu Sebe, Hao Tang, Yan Wang, Hao Zhao, Mengzhu Wang, Jingcai Guo

Low-rank Adaptation (LoRA) has demonstrated remarkable capabilities for task specific fine-tuning. However, in scenarios that involve multiple tasks, training a separate LoRA model for each one results in considerable inefficiency in terms of storage and inference. Moreover, existing parameter generation methods fail to capture the correlations among these tasks, making multi-task LoRA parameter generation challenging. To address these limitations, we propose In-Context Meta LoRA (ICM-LoRA), a novel approach that efficiently achieves task-specific customization of large language models (LLMs). Specifically, we use training data from all tasks to train a tailored generator, Conditional Variational Autoencoder (CVAE). CVAE takes task descriptions as inputs and produces task-aware LoRA weights as outputs. These LoRA weights are then merged with LLMs to create task-specialized models without the need for additional fine-tuning. Furthermore, we utilize in-context meta-learning for knowledge enhancement and task mapping, to capture the relationship between tasks and parameter distributions. As a result, our method achieves more accurate LoRA parameter generation for diverse tasks using CVAE. ICM-LoRA enables more accurate LoRA parameter reconstruction than current parameter reconstruction methods and is useful for implementing task-specific enhancements of LoRA parameters. At the same time, our method occupies 283MB, only 1\% storage compared with the original LoRA.

nan

Article 496

Title@2025-07-05 (6): BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering

Title: BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering

BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering

BYOKG-RAG:知识图解答问题多战略图检索 2507.04127v1

Authors (9): Costas Mavromatis, Soji Adeshina, Vassilis N. Ioannidis, Zhen Han, Qi Zhu, Ian Robinson, Bryan Thompson, Huzefa Rangwala, George Karypis

Knowledge graph question answering (KGQA) presents significant challenges due to the structural and semantic variations across input graphs. Existing works rely on Large Language Model (LLM) agents for graph traversal and retrieval; an approach that is sensitive to traversal initialization, as it is prone to entity linking errors and may not generalize well to custom (“bring-your-own”) KGs. We introduce BYOKG-RAG, a framework that enhances KGQA by synergistically combining LLMs with specialized graph retrieval tools. In BYOKG-RAG, LLMs generate critical graph artifacts (question entities, candidate answers, reasoning paths, and OpenCypher queries), and graph tools link these artifacts to the KG and retrieve relevant graph context. The retrieved context enables the LLM to iteratively refine its graph linking and retrieval, before final answer generation. By retrieving context from different graph tools, BYOKG-RAG offers a more general and robust solution for QA over custom KGs. Through experiments on five benchmarks spanning diverse KG types, we demonstrate that BYOKG-RAG outperforms the second-best graph retrieval method by 4.5% points while showing better generalization to custom KGs. BYOKG-RAG framework is open-sourced at https://github.com/awslabs/graphrag-toolkit.

nan

Article 497

Title@2025-07-05 (6): Relational inductive biases on attention mechanisms

Title: Relational inductive biases on attention mechanisms

Relationale induktive Verzerrungen der Aufmerksamkeitsmechanismen

对关注机制的关系诱导偏见 2507.04117v1

Authors (6): Víctor Mijangos, Ximena Gutierrez-Vasques, Verónica E. Arriola, Ulises Rodríguez-Domínguez, Alexis Cervantes, José Luis Almanzara

Inductive learning aims to construct general models from specific examples, guided by biases that influence hypothesis selection and determine generalization capacity. In this work, we focus on characterizing the relational inductive biases present in attention mechanisms, understood as assumptions about the underlying relationships between data elements. From the perspective of geometric deep learning, we analyze the most common attention mechanisms in terms of their equivariance properties with respect to permutation subgroups, which allows us to propose a classification based on their relational biases. Under this perspective, we show that different attention layers are characterized by the underlying relationships they assume on the input data.

nan

Article 498

Title@2025-07-05 (6): Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching

Title: Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching

Conversation Forests: Der Schlüssel zur Feinabstimmung großer Sprachmodelle für multi-Turn medizinische Gespräche ist die Verzweigung

对话森林:对多发医学对话的大型语言模型进行精微投资的关键是分流 2507.04099v1

Authors (1): Thomas Savage

Fine-tuning methods such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have demonstrated success in training large language models (LLMs) for single-turn tasks. However, these methods fall short in multi-turn applications, such as diagnostic patient interviewing, where understanding how early conversational turns influence downstream completions and outcomes is essential. In medicine, a multi-turn perspective is critical for learning diagnostic schemas and better understanding conversation dynamics. To address this gap, I introduce Savage Conversation Forests (SCF), a reinforcement learning framework that leverages a branched conversation architecture to fine-tune LLMs for multi-turn dialogue. SCF generates multiple possible conversation continuations at each turn, enabling the model to learn how different early responses affect downstream interactions and diagnostic outcomes. In experiments simulating doctor-patient conversations, SCF with branching outperforms linear conversation architectures on diagnostic accuracy. I hypothesize that SCF’s improvements stem from its ability to provide richer, interdependent training signals across conversation turns. These results suggest that a branched training architecture is an important strategy for fine tuning LLMs in complex multi-turn conversational tasks.

nan

Article 499

Title@2025-07-05 (6): MMMOS: Multi-domain Multi-axis Audio Quality Assessment

Title: MMMOS: Multi-domain Multi-axis Audio Quality Assessment

MMMOS: Multi-Domain Multi-Axis-Audio-Qualitätsbewertung

MMMOS:多领域多轴音频质量评估 2507.04094v1

Authors (3): Yi-Cheng Lin, Jia-Hung Chen, Hung-yi Lee

Accurate audio quality estimation is essential for developing and evaluating audio generation, retrieval, and enhancement systems. Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech, merging diverse perceptual factors and failing to generalize beyond speech. We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four orthogonal axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness across speech, music, and environmental sounds. MMMOS fuses frame-level embeddings from three pretrained encoders (WavLM, MuQ, and M2D) and evaluates three aggregation strategies with four loss functions. By ensembling the top eight models, MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall’s {\tau} versus baseline, gains first place in six of eight Production Complexity metrics, and ranks among the top three on 17 of 32 challenge metrics.

nan

Article 500

Title@2025-07-05 (6): FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering

Title: FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering

FairSteer: Inferenzzeit-Debiasing für LLMs mit dynamischer Aktivierungssteuerung

Fairsteer:带动态激活指导的LLMs 推论时间偏差 2504.14492v2

Authors (7): Yichen Li, Zhiting Fan, Ruizhe Chen, Xiaotang Gai, Luqi Gong, Yan Zhang, Zuozhu Liu

Large language models (LLMs) are prone to capturing biases from training corpus, leading to potential negative social impacts. Existing prompt-based debiasing methods exhibit instability due to their sensitivity to prompt changes, while fine-tuning-based techniques incur substantial computational overhead and catastrophic forgetting. In this paper, we propose FairSteer, a novel inference-time debiasing framework without requiring customized prompt design or model retraining. Motivated by the linear representation hypothesis, our preliminary investigation demonstrates that fairness-related features can be encoded into separable directions in the hidden activation space. FairSteer operates in three steps: biased activation detection, debiasing steering vector (DSV) computation, and dynamic activation steering. Specifically, it first trains a lightweight linear classifier to detect bias signatures in activations, and then computes DSVs as intervention directions derived from small contrastive prompt pairs. Subsequently, it performs debiasing by adjusting activations with DSVs in the inference stage. Comprehensive evaluation with six LLMs demonstrates the superiority of FairSteer across question-answering, counterfactual input evaluation and open-ended text generation tasks. Code will be released.

nan

Article 501

Title@2025-07-05 (6): XISM: an eXploratory and Interactive Graph Tool to Visualize and Evaluate Semantic Map Models

Title: XISM: an eXploratory and Interactive Graph Tool to Visualize and Evaluate Semantic Map Models

XISM: ein eXploratory and Interactive Graph Tool zum Visualisieren und Evaluieren von Semantischen Kartenmodellen

XISSM:一个可视和评估语义地图模型的振动和交互式图像工具 2507.04070v1

Authors (4): Zhu Liu, Zhen Hu, Lei Dai, Ying Liu

Semantic map models represent meanings or functions as nodes in a graph constrained by the local connectivity hypothesis, with edges indicating their associations. Widely used in typological linguistics, these models compare interrelated meanings across languages. Traditionally built manually in a bottom-up manner, they are inefficient for large datasets and lack visualization and evaluation tools. This paper introduces XISM, an interactive tool based on our prior algorithm, which constructs semantic maps from user data via a top-down approach, displays candidate maps, and evaluates them using multiple metrics. Users can refine maps by editing edges, combining data-driven efficiency with expert knowledge. This human-in-the-loop design benefits both typologists and computational linguists. The system https://770103knev48.vicp.fun/ and a demonstration video https://youtu.be/S-wsVDF2HSI?si=1OrcF41tRznaifhZ are publicly available.

nan

Article 502

Title@2025-07-05 (6): Beyond Independent Passages: Adaptive Passage Combination Retrieval for Retrieval Augmented Open-Domain Question Answering

Title: Beyond Independent Passages: Adaptive Passage Combination Retrieval for Retrieval Augmented Open-Domain Question Answering

Jenseits von unabhängigen Passagen: Adaptive Passage-Kombinations-Retrieval für retrieval Augmented Open-Domain-Fragebeantwortung

超过独立通过通道: 用于检索的适应性被动通道组合检索, 以获取更多开放域问题解答 2507.04069v1

Authors (3): Ting-Wen Ko, Jyun-Yu Jiang, Pu-Jen Cheng

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external documents at inference time, enabling up-to-date knowledge access without costly retraining. However, conventional RAG methods retrieve passages independently, often leading to redundant, noisy, or insufficiently diverse context-particularly problematic - particularly problematic in noisy corpora and for multi-hop questions. To address this, we propose Adaptive Passage Combination Retrieval (AdaPCR), a novel framework for open-domain question answering with black-box LMs. AdaPCR explicitly models dependencies between passages by considering passage combinations as units for retrieval and reranking. It consists of a context-aware query reformulation using concatenated passages, and a reranking step trained with a predictive objective aligned with downstream answer likelihood. Crucially, AdaPCR adaptively selects the number of retrieved passages without additional stopping modules. Experiments across several QA benchmarks show that AdaPCR outperforms baselines, particularly in multi-hop reasoning, demonstrating the effectiveness of modeling inter-passage dependencies for improved retrieval.

nan

Article 503

Title@2025-07-05 (6): On Fusing ChatGPT and Ensemble Learning in Discon-tinuous Named Entity Recognition in Health Corpora

Title: On Fusing ChatGPT and Ensemble Learning in Discon-tinuous Named Entity Recognition in Health Corpora

Über das Verschmelzen von ChatGPT und Ensemble Lernen in diskontinuierlicher, benannter Entitätserkennung in Health Corpora

关于在卫生团体中不同名称称为实体的识别中进行阻燃聊天和组合学习 2412.16976v2

Authors (2): Tzu-Chieh Chen, Wen-Yang Lin

Named Entity Recognition has traditionally been a key task in natural language processing, aiming to identify and extract important terms from unstructured text data. However, a notable challenge for contemporary deep-learning NER models has been identifying discontinuous entities, which are often fragmented within the text. To date, methods to address Discontinuous Named Entity Recognition have not been explored using ensemble learning to the best of our knowledge. Furthermore, the rise of large language models, such as ChatGPT in recent years, has shown significant effectiveness across many NLP tasks. Most existing approaches, however, have primarily utilized ChatGPT as a problem-solving tool rather than exploring its potential as an integrative element within ensemble learning algorithms. In this study, we investigated the integration of ChatGPT as an arbitrator within an ensemble method, aiming to enhance performance on DNER tasks. Our method combines five state-of-the-art NER models with ChatGPT using custom prompt engineering to assess the robustness and generalization capabilities of the ensemble algorithm. We conducted experiments on three benchmark medical datasets, comparing our method against the five SOTA models, individual applications of GPT-3.5 and GPT-4, and a voting ensemble method. The results indicate that our proposed fusion of ChatGPT with the ensemble learning algorithm outperforms the SOTA results in the CADEC, ShARe13, and ShARe14 datasets, showcasing its potential to enhance NLP applications in the healthcare domain.

nan

Article 504

Title@2025-07-05 (6): FinBERT2: A Specialized Bidirectional Encoder for Bridging the Gap in Finance-Specific Deployment of Large Language Models

Title: FinBERT2: A Specialized Bidirectional Encoder for Bridging the Gap in Finance-Specific Deployment of Large Language Models

FinBERT2: Ein spezieller bidirektionaler Encoder zur Bridging the Gap in Finance-Spezific Deployment of Large Language Models

FinBERT2: 缩小在财务方面差距的专门双向双向编码器,具体部署大语言模式 2506.06335v2

Authors (10): Xuan Xu, Fufang Wen, Beilin Chu, Zhibing Fu, Qinhong Lin, Jiaqi Liu, Binjie Fei, Yu Li, Linna Zhou, Zhongliang Yang

In natural language processing (NLP), the focus has shifted from encoder-only tiny language models like BERT to decoder-only large language models(LLMs) such as GPT-3. However, LLMs’ practical application in the financial sector has revealed three limitations: (1) LLMs often perform worse than fine-tuned BERT on discriminative tasks despite costing much higher computational resources, such as market sentiment analysis in financial reports; (2) Application on generative tasks heavily relies on retrieval augmented generation (RAG) methods to provide current and specialized information, with general retrievers showing suboptimal performance on domain-specific retrieval tasks; (3) There are additional inadequacies in other feature-based scenarios, such as topic modeling. We introduce FinBERT2, a specialized bidirectional encoder pretrained on a high-quality, financial-specific corpus of 32b tokens. This represents the largest known Chinese financial pretraining corpus for models of this parameter size. As a better backbone, FinBERT2 can bridge the gap in the financial-specific deployment of LLMs through the following achievements: (1) Discriminative fine-tuned models (Fin-Labelers) outperform other (Fin)BERT variants by 0.4%-3.3% and leading LLMs by 9.7%-12.3% on average across five financial classification tasks. (2) Contrastive fine-tuned models (Fin-Retrievers) outperform both open-source (e.g., +6.8\% avg improvement over BGE-base-zh) and proprietary (e.g., +4.2\% avg improvement over OpenAI’s text-embedding-3-large) embedders across five financial retrieval tasks; (3) Building on FinBERT2 variants, we construct the Fin-TopicModel, which enables superior clustering and topic representation for financial titles. Our work revisits financial BERT models through comparative analysis with contemporary LLMs and offers practical insights for effectively utilizing FinBERT in the LLMs era.

nan

Article 505

Title@2025-07-05 (6): Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk?

Title: Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk?

KI für Finanzen bewerten: Ist KI bei der Bewertung von Investitionsrisiken glaubwürdig?

评估大赦国际的融资:AI在评估投资风险方面是否可信? 2505.18953v2

Authors (11): Divij Chawla, Ashita Bhutada, Do Duc Anh, Abhinav Raghunathan, Vinod SP, Cathy Guo, Dar Win Liew, Prannaya Gupta, Rishabh Bhardwaj, Rajat Bhardwaj, Soujanya Poria

We assess whether AI systems can credibly evaluate investment risk appetite-a task that must be thoroughly validated before automation. Our analysis was conducted on proprietary systems (GPT, Claude, Gemini) and open-weight models (LLaMA, DeepSeek, Mistral), using carefully curated user profiles that reflect real users with varying attributes such as country and gender. As a result, the models exhibit significant variance in score distributions when user attributes-such as country or gender-that should not influence risk computation are changed. For example, GPT-4o assigns higher risk scores to Nigerian and Indonesian profiles. While some models align closely with expected scores in the Low- and Mid-risk ranges, none maintain consistent scores across regions and demographics, thereby violating AI and finance regulations.

nan

Article 506

Title@2025-07-05 (6): Patient-Centered RAG for Oncology Visit Aid Following the Ottawa Decision Guide

Title: Patient-Centered RAG for Oncology Visit Aid Following the Ottawa Decision Guide

Patientenzentrierte RAG für Onkologie-Besuchshilfe nach dem Ottawa-Entscheidungsleitfaden

遵循渥太华决定指南,以病人为中心,为肿瘤科访问援助提供RAG 2507.04026v1

Authors (3): Siyang Liu, Lawrence Chin-I An, Rada Mihalcea

Effective communication is essential in cancer care, yet patients often face challenges in preparing for complex medical visits. We present an interactive, Retrieval-augmented Generation-assisted system that helps patients progress from uninformed to visit-ready. Our system adapts the Ottawa Personal Decision Guide into a dynamic retrieval-augmented generation workflow, helping users bridge knowledge gaps, clarify personal values and generate useful questions for their upcoming visits. Focusing on localized prostate cancer, we conduct a user study with patients and a clinical expert. Results show high system usability (UMUX Mean = 6.0 out of 7), strong relevance of generated content (Mean = 6.7 out of 7), minimal need for edits, and high clinical faithfulness (Mean = 6.82 out of 7). This work demonstrates the potential of combining patient-centered design with language models to enhance clinical preparation in oncology care.

nan

Article 507

Title@2025-07-05 (6): Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance

Title: Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance

Vergleichende Bewertung von ChatGPT und DeepSeek über zentrale NLP-Aufgaben: Stärken, Schwächen und Domain-spezifische Leistung

国家劳工政策关键任务:力量、弱点和具体具体绩效 2506.18501v3

Authors (2): Wael Etaiwi, Bushra Alhijawi

The increasing use of large language models (LLMs) in natural language processing (NLP) tasks has sparked significant interest in evaluating their effectiveness across diverse applications. While models like ChatGPT and DeepSeek have shown strong results in many NLP domains, a comprehensive evaluation is needed to understand their strengths, weaknesses, and domain-specific abilities. This is critical as these models are applied to various tasks, from sentiment analysis to more nuanced tasks like textual entailment and translation. This study aims to evaluate ChatGPT and DeepSeek across five key NLP tasks: sentiment analysis, topic classification, text summarization, machine translation, and textual entailment. A structured experimental protocol is used to ensure fairness and minimize variability. Both models are tested with identical, neutral prompts and evaluated on two benchmark datasets per task, covering domains like news, reviews, and formal/informal texts. The results show that DeepSeek excels in classification stability and logical reasoning, while ChatGPT performs better in tasks requiring nuanced understanding and flexibility. These findings provide valuable insights for selecting the appropriate LLM based on task requirements.

nan

Article 508

Title@2025-07-05 (6): LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models

Title: LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models

LLMThinkBench: Auf dem Weg zu grundlegenden mathematischen Überlegungen und Überdenken in großen Sprachmodellen

LLMTHink Bench:在大语言模型中走向基本数学理由和过度思考 2507.04023v1

Authors (4): Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Xuan Wang

Large Language Models (LLMs) have achieved remarkable performance on complex mathematical benchmarks, yet often struggle with simple arithmetic tasks and exhibit a tendency toward over-explaining or “overthinking” answers. To systematically assess this phenomenon, we introduce LLMThinkBench, a modular benchmarking framework that enables researchers to evaluate basic math reasoning and overthinking in LLMs. The framework provides 14 configurable math tasks with randomized test data generation and robust parsing strategies. Researchers can quantify overthinking using our Overthinking Score metric, which captures accuracy-verbosity tradeoffs through harmonic mean formulation. The tool offers flexible evaluation with a scalable vLLM/Transformers backend, multi-GPU support, and full configurability. Users can extend the tool with custom tasks, reproduce experiments with seeding, and generate detailed efficiency reports. Distributed as a pip-installable package with CLI and API access, LLMThinkBench provides researchers and practitioners an accessible, cost-effective alternative to expensive LLM-as-a-judge methods for diagnosing basic reasoning capabilities and efficiency analysis. Package can be installed as: pip install llmthinkbench

nan

Article 509

Title@2025-07-05 (6): Handling Korean Out-of-Vocabulary Words with Phoneme Representation Learning

Title: Handling Korean Out-of-Vocabulary Words with Phoneme Representation Learning

Umgang mit koreanischen Out-of-Vocabulary-Worte mit Phoneme Repräsentanz Lernen

以电话代言学习方式处理韩语外词汇 2507.04018v1

Authors (4): Nayeon Kim, Eojin Jeon, Jun-Hyung Park, SangKeun Lee

In this study, we introduce KOPL, a novel framework for handling Korean OOV words with Phoneme representation Learning. Our work is based on the linguistic property of Korean as a phonemic script, the high correlation between phonemes and letters. KOPL incorporates phoneme and word representations for Korean OOV words, facilitating Korean OOV word representations to capture both text and phoneme information of words. We empirically demonstrate that KOPL significantly improves the performance on Korean Natural Language Processing (NLP) tasks, while being readily integrated into existing static and contextual Korean embedding models in a plug-and-play manner. Notably, we show that KOPL outperforms the state-of-the-art model by an average of 1.9%. Our code is available at https://github.com/jej127/KOPL.git.

nan

Article 510

Title@2025-07-05 (6): Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition

Title: Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition

Nunchi-Bench: Benchmarking von Sprachmodellen zur kulturellen Vernunft mit Fokus auf koreanischen Aberglauben

Nunchi-Nunchi-Bench:以韩国迷信为重点的文化理由语言模式基准模式 2507.04014v1

Authors (2): Kyuhee Kim, Sangah Lee

As large language models (LLMs) become key advisors in various domains, their cultural sensitivity and reasoning skills are crucial in multicultural environments. We introduce Nunchi-Bench, a benchmark designed to evaluate LLMs’ cultural understanding, with a focus on Korean superstitions. The benchmark consists of 247 questions spanning 31 topics, assessing factual knowledge, culturally appropriate advice, and situational interpretation. We evaluate multilingual LLMs in both Korean and English to analyze their ability to reason about Korean cultural contexts and how language variations affect performance. To systematically assess cultural reasoning, we propose a novel evaluation strategy with customized scoring metrics that capture the extent to which models recognize cultural nuances and respond appropriately. Our findings highlight significant challenges in LLMs’ cultural reasoning. While models generally recognize factual information, they struggle to apply it in practical scenarios. Furthermore, explicit cultural framing enhances performance more effectively than relying solely on the language of the prompt. To support further research, we publicly release Nunchi-Bench alongside a leaderboard.

nan

Article 511

Title@2025-07-05 (6): Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

Title: Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

Easy Dataset: Ein einheitliches und erweiterbares Framework zur Synthese von LLM-Feineinstellungsdaten aus unstrukturierten Dokumenten

简易数据集:从无结构文档中合成LLM精密调试数据的统一和扩展框架 2507.04009v1

Authors (7): Ziyang Miao, Qiyu Sun, Jingyuan Wang, Yuchen Gong, Yaowei Zheng, Shiqi Li, Richong Zhang

Large language models (LLMs) have shown impressive performance on general-purpose tasks, yet adapting them to specific domains remains challenging due to the scarcity of high-quality domain data. Existing data synthesis tools often struggle to extract reliable fine-tuning data from heterogeneous documents effectively. To address this limitation, we propose Easy Dataset, a unified framework for synthesizing fine-tuning data from unstructured documents via an intuitive graphical user interface (GUI). Specifically, Easy Dataset allows users to easily configure text extraction models and chunking strategies to transform raw documents into coherent text chunks. It then leverages a persona-driven prompting approach to generate diverse question-answer pairs using public-available LLMs. Throughout the pipeline, a human-in-the-loop visual interface facilitates the review and refinement of intermediate outputs to ensure data quality. Experiments on a financial question-answering task show that fine-tuning LLMs on the synthesized dataset significantly improves domain-specific performance while preserving general knowledge. The source code and installable package are available at https://github.com/ConardLi/easy-dataset and have garnered over 9,000 GitHub stars.

nan

Article 512

Title@2025-07-05 (6): A Survey on Proactive Defense Strategies Against Misinformation in Large Language Models

Title: A Survey on Proactive Defense Strategies Against Misinformation in Large Language Models

Eine Umfrage über proaktive Verteidigungsstrategien gegen Fehlinformationen in großen Sprachmodellen

关于针对大语言模型中错误信息的积极防御战略的调查 2507.05288v1

Authors (10): Shuliang Liu, Hongyi Liu, Aiwei Liu, Bingchen Duan, Qi Zheng, Yibo Yan, He Geng, Peijie Jiang, Jia Liu, Xuming Hu

The widespread deployment of large language models (LLMs) across critical domains has amplified the societal risks posed by algorithmically generated misinformation. Unlike traditional false content, LLM-generated misinformation can be self-reinforcing, highly plausible, and capable of rapid propagation across multiple languages, which traditional detection methods fail to mitigate effectively. This paper introduces a proactive defense paradigm, shifting from passive post hoc detection to anticipatory mitigation strategies. We propose a Three Pillars framework: (1) Knowledge Credibility, fortifying the integrity of training and deployed data; (2) Inference Reliability, embedding self-corrective mechanisms during reasoning; and (3) Input Robustness, enhancing the resilience of model interfaces against adversarial attacks. Through a comprehensive survey of existing techniques and a comparative meta-analysis, we demonstrate that proactive defense strategies offer up to 63\% improvement over conventional methods in misinformation prevention, despite non-trivial computational overhead and generalization challenges. We argue that future research should focus on co-designing robust knowledge foundations, reasoning certification, and attack-resistant interfaces to ensure LLMs can effectively counter misinformation across varied domains.

nan

Article 513

Title@2025-07-05 (6): UNITYAI-GUARD: Pioneering Toxicity Detection Across Low-Resource Indian Languages

Title: UNITYAI-GUARD: Pioneering Toxicity Detection Across Low-Resource Indian Languages

UNITYAI-GUARD: Pionierische Toxizitätserkennung über Low-Resource-Indische Sprachen

UNITEAAI-GUARD:通过低资源印度语言首次发现有毒毒性 2503.23088v2

Authors (10): Himanshu Beniwal, Reddybathuni Venkat, Rohit Kumar, Birudugadda Srivibhav, Daksh Jain, Pavan Doddi, Eshwar Dhande, Adithya Ananth, Kuldeep, Mayank Singh

This work introduces UnityAI-Guard, a framework for binary toxicity classification targeting low-resource Indian languages. While existing systems predominantly cater to high-resource languages, UnityAI-Guard addresses this critical gap by developing state-of-the-art models for identifying toxic content across diverse Brahmic/Indic scripts. Our approach achieves an impressive average F1-score of 84.23% across seven languages, leveraging a dataset of 567k training instances and 30k manually verified test instances. By advancing multilingual content moderation for linguistically diverse regions, UnityAI-Guard also provides public API access to foster broader adoption and application.

nan

Article 514

Title@2025-07-05 (6): Improving RL Exploration for LLM Reasoning through Retrospective Replay

Title: Improving RL Exploration for LLM Reasoning through Retrospective Replay

Verbesserung der RL Exploration für LLM Reasoning durch retrospektive Replay

通过回溯性回放改进LL探索LLM 2504.14363v2

Authors (7): Shihan Dou, Muling Wu, Jingwen Xu, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang

Reinforcement learning (RL) has increasingly become a pivotal technique in the post-training of large language models (LLMs). The effective exploration of the output space is essential for the success of RL. We observe that for complex problems, during the early stages of training, the model exhibits strong exploratory capabilities and can identify promising solution ideas. However, its limited capability at this stage prevents it from successfully solving these problems. The early suppression of these potentially valuable solution ideas by the policy gradient hinders the model’s ability to revisit and re-explore these ideas later. Consequently, although the LLM’s capabilities improve in the later stages of training, it still struggles to effectively address these complex problems. To address this exploration issue, we propose a novel algorithm named Retrospective Replay-based Reinforcement Learning (RRL), which introduces a dynamic replay mechanism throughout the training process. RRL enables the model to revisit promising states identified in the early stages, thereby improving its efficiency and effectiveness in exploration. To evaluate the effectiveness of RRL, we conduct extensive experiments on complex reasoning tasks, including mathematical reasoning and code generation, and general dialogue tasks. The results indicate that RRL maintains high exploration efficiency throughout the training period, significantly enhancing the effectiveness of RL in optimizing LLMs for complicated reasoning tasks. Moreover, it also improves the performance of RLHF, making the model both safer and more helpful.

nan

Article 515

Title@2025-07-05 (6): Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Title: Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Open-Reasoner-Zero: Ein Open Source Ansatz zur Skalierung von Verstärkungslernen am Basismodell

开放反馈-零:基于基础模式的扩大强化学习的开放源码方法 2503.24290v2

Authors (6): Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum

We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training on the base model focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($\lambda=1$, $\gamma=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both benchmark performance and response length, replicating the scaling phenomenon observed in DeepSeek-R1-Zero. Using the same base model, Qwen2.5-32B base, as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance across AIME2024, MATH500, and GPQA Diamond, while demonstrating remarkable efficiency, requiring only 1/10 of the training steps compared to the DeepSeek-R1-Zero pipeline. Moreover, our analysis not only covers training dynamics and ablation for critical design choices, but also quantitatively shows how the learned critic in Reasoner-Zero training effectively identifies and devalues repetitive response patterns, yielding more robust advantage estimations and enhancing training stability. Embracing the principles of open-source, we release our source code, training data, and various model weights, fostering reproducibility and encouraging further exploration of the properties of related models.

nan

Article 516

Title@2025-07-05 (6): A Comparative Study of Specialized LLMs as Dense Retrievers

Title: A Comparative Study of Specialized LLMs as Dense Retrievers

Eine vergleichende Studie von spezialisierten LLMs als Dense Retriever

专业专利妇女作为 “ 大量开发 “ 的比较研究 2507.03958v1

Authors (3): Hengran Zhang, Keping Bi, Jiafeng Guo

While large language models (LLMs) are increasingly deployed as dense retrievers, the impact of their domain-specific specialization on retrieval effectiveness remains underexplored. This investigation systematically examines how task-specific adaptations in LLMs influence their retrieval capabilities, an essential step toward developing unified retrievers capable of handling text, code, images, and multimodal content. We conduct extensive experiments with eight Qwen2.5 7B LLMs, including base, instruction-tuned, code/math-specialized, long reasoning, and vision-language models across zero-shot retrieval settings and the supervised setting. For the zero-shot retrieval settings, we consider text retrieval from the BEIR benchmark and code retrieval from the CoIR benchmark. Further, to evaluate supervised performance, all LLMs are fine-tuned on the MS MARCO dataset. We find that mathematical specialization and the long reasoning capability cause consistent degradation in three settings, indicating conflicts between mathematical reasoning and semantic matching. The vision-language model and code-specialized LLMs demonstrate superior zero-shot performance compared to other LLMs, even surpassing BM25 on the code retrieval task, and maintain comparable performance to base LLMs in supervised settings. These findings suggest promising directions for the unified retrieval task leveraging cross-domain and cross-modal fusion.

nan

Article 517

Title@2025-07-05 (6): A Modular Unsupervised Framework for Attribute Recognition from Unstructured Text

Title: A Modular Unsupervised Framework for Attribute Recognition from Unstructured Text

Modulares unüberwachtes Framework zur Attributerkennung aus unstrukturiertem Text

从无结构文本中识别属性的模块化未经监督的框架 2507.03949v1

Authors (1): KMA Solaiman

We propose POSID, a modular, lightweight and on-demand framework for extracting structured attribute-based properties from unstructured text without task-specific fine-tuning. While the method is designed to be adaptable across domains, in this work, we evaluate it on human attribute recognition in incident reports. POSID combines lexical and semantic similarity techniques to identify relevant sentences and extract attributes. We demonstrate its effectiveness on a missing person use case using the InciText dataset, achieving effective attribute extraction without supervised training.

nan

Article 518

Title@2025-07-05 (6): Dynamic Injection of Entity Knowledge into Dense Retrievers

Title: Dynamic Injection of Entity Knowledge into Dense Retrievers

Dynamische Einspritzung von Wissen der Entität in Dense Retriever

将实体知识动态投入重复内容 2507.03922v1

Authors (5): Ikuya Yamada, Ryokan Ri, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

Dense retrievers often struggle with queries involving less-frequent entities due to their limited entity knowledge. We propose the Knowledgeable Passage Retriever (KPR), a BERT-based retriever enhanced with a context-entity attention layer and dynamically updatable entity embeddings. This design enables KPR to incorporate external entity knowledge without retraining. Experiments on three datasets show that KPR consistently improves retrieval accuracy, achieving a substantial 12.6% gain on the EntityQuestions dataset over the model without KPR extensions. When built on the off-the-shelf bge-base retriever, KPR achieves state-of-the-art performance among similarly sized models on two datasets. Code and models will be released soon.

nan

Article 519

Title@2025-07-05 (6): Play to Generalize: Learning to Reason Through Game Play

Title: Play to Generalize: Learning to Reason Through Game Play

Spielen Sie Generalize: Lernen, Vernunft durch Spiel zu lernen

玩一般游戏: 通过玩游戏学习理性 2506.08011v3

Authors (6): Yunfei Xie, Yinsong Ma, Shiyi Lan, Alan Yuille, Junfei Xiao, Chen Wei

Developing generalizable reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by cognitive science literature suggesting that gameplay promotes transferable cognitive skills, we propose a novel post-training paradigm, Visual Game Learning, or ViGaL, where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games. Specifically, we show that post-training a 7B-parameter MLLM via reinforcement learning (RL) on simple arcade-like games, e.g. Snake, significantly enhances its downstream performance on multimodal math benchmarks like MathVista, and on multi-discipline questions like MMMU, without seeing any worked solutions, equations, or diagrams during RL, suggesting the capture of transferable reasoning skills. Remarkably, our model outperforms specialist models tuned on multimodal reasoning data in multimodal reasoning benchmarks, while preserving the base model’s performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks that unlock generalizable multimodal reasoning abilities in MLLMs.

nan

Article 520

Title@2025-07-05 (6): Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Title: Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Sequoia: Skalierbares, Robustes und Hardware-gestütztes Spekulatives Dekodieren

塞奎亚:可缩放、强力和硬件—-有意识的投机性代谢 2402.12374v3

Authors (7): Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

As the usage of large language models (LLMs) grows, performing efficient inference with these models becomes increasingly important. While speculative decoding has recently emerged as a promising direction for speeding up inference, existing methods are limited in their ability to scale to larger speculation budgets, and adapt to different hyperparameters and hardware. This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding. To attain better scalability, Sequoia introduces a dynamic programming algorithm to find the optimal tree structure for the speculated tokens. To achieve robust speculative performance, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Finally, Sequoia introduces a hardware-aware tree optimizer that maximizes speculative performance by automatically selecting the token tree size and depth for a given hardware platform. Evaluation shows that Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 by up to $4.04\times$, $3.73\times$, and $2.27\times$. For offloading setting on L40, Sequoia achieves as low as 0.56 s/token for exact Llama2-70B inference latency, which is $9.96\times$ on our optimized offloading system (5.6 s/token), $9.7\times$ than DeepSpeed-Zero-Inference, $19.5\times$ than Huggingface Accelerate.

nan

Article 521

Title@2025-07-05 (6): Demystifying ChatGPT: How It Masters Genre Recognition

Title: Demystifying ChatGPT: How It Masters Genre Recognition

ChatGPT entmystifizieren: Wie es die Genreerkennung beherrscht

解密的聊天:如何使各教派获得承认 2507.03875v1

Authors (4): Subham Raj, Sriparna Saha, Brijraj Singh, Niranjan Pedanekar

The introduction of ChatGPT has garnered significant attention within the NLP community and beyond. Previous studies have demonstrated ChatGPT’s substantial advancements across various downstream NLP tasks, highlighting its adaptability and potential to revolutionize language-related applications. However, its capabilities and limitations in genre prediction remain unclear. This work analyzes three Large Language Models (LLMs) using the MovieLens-100K dataset to assess their genre prediction capabilities. Our findings show that ChatGPT, without fine-tuning, outperformed other LLMs, and fine-tuned ChatGPT performed best overall. We set up zero-shot and few-shot prompts using audio transcripts/subtitles from movie trailers in the MovieLens-100K dataset, covering 1682 movies of 18 genres, where each movie can have multiple genres. Additionally, we extended our study by extracting IMDb movie posters to utilize a Vision Language Model (VLM) with prompts for poster information. This fine-grained information was used to enhance existing LLM prompts. In conclusion, our study reveals ChatGPT’s remarkable genre prediction capabilities, surpassing other language models. The integration of VLM further enhances our findings, showcasing ChatGPT’s potential for content-related applications by incorporating visual information from movie posters.

nan

Article 522

Title@2025-07-05 (6): OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference

Title: OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference

OrthoRank: Tokenauswahl über Sink Token Orthogonalität für effiziente LLM-Inferenz

OrthoRank:通过 Sink Token 条形对高效LLM 推论进行当量选择 2507.03865v1

Authors (3): Seungjun Shin, Jaehoon Oh, Dokwan Oh

Attention mechanisms are central to the success of large language models (LLMs), enabling them to capture intricate token dependencies and implicitly assign importance to each token. Recent studies have revealed the sink token, which receives disproportionately high attention despite their limited semantic role. In this paper, we first expand the relationship between the sink token and other tokens, moving beyond attention to explore their similarity in hidden states, considering the layer depth. We observe that as the layers get deeper, the cosine similarity between the normalized hidden states of the sink token and those of other tokens increases, and that the normalized hidden states of the sink token exhibit negligible changes. These imply that other tokens consistently are directed toward the sink token throughout the layers. Next, we propose a dynamic token selection method, called OrthoRank, using these findings to select important tokens. Specifically, in a certain layer, we define token importance by the speed at which the token moves toward the sink token. This is converted into orthogonality with the sink token, meaning that tokens that are more orthogonal to the sink token are assigned greater importance. Finally, through extensive experiments, we demonstrated that our method results in lower perplexity and higher zero-shot accuracy compared to layer pruning methods at the same sparsity ratio with comparable throughput, while also achieving superior performance on LongBench.

nan

Article 523

Title@2025-07-04 (5): Voice of a Continent: Mapping Africa’s Speech Technology Frontier

Title: Voice of a Continent: Mapping Africa’s Speech Technology Frontier

Stimme eines Kontinents: Afrikas Rede-Technologie-Grenze kartieren

非洲大陆之声:测绘非洲语音技术前沿 2505.18436v3

Authors (6): AbdelRahim Elmadany, Sang Yun Kwon, Hawau Olamide Toyin, Alcides Alcoba Inciarte, Hanan Aldarmaki, Muhammad Abdul-Mageed

Africa’s rich linguistic diversity remains significantly underrepresented in speech technologies, creating barriers to digital inclusion. To alleviate this challenge, we systematically map the continent’s speech space of datasets and technologies, leading to a new comprehensive benchmark SimbaBench for downstream African speech tasks. Using SimbaBench, we introduce the Simba family of models, achieving state-of-the-art performance across multiple African languages and speech tasks. Our benchmark analysis reveals critical patterns in resource availability, while our model evaluation demonstrates how dataset quality, domain diversity, and language family relationships influence performance across languages. Our work highlights the need for expanded speech technology resources that better reflect Africa’s linguistic diversity and provides a solid foundation for future research and development efforts toward more inclusive speech technologies.

nan

Article 524

Title@2025-07-04 (5): Relation-Aware Network with Attention-Based Loss for Few-Shot Knowledge Graph Completion

Title: Relation-Aware Network with Attention-Based Loss for Few-Shot Knowledge Graph Completion

Beziehungs-Aware-Netzwerk mit aufmerksamkeitsbasiertem Verlust für wenige-heiße Wissensgraphen-Vervollständigung

几乎没有热知识图补全的基于关注的损失关系软件网络 2306.09519v2

Authors (4): Qiao Qiao, Yuepei Li, Kang Zhou, Qi Li

Few-shot knowledge graph completion (FKGC) task aims to predict unseen facts of a relation with few-shot reference entity pairs. Current approaches randomly select one negative sample for each reference entity pair to minimize a margin-based ranking loss, which easily leads to a zero-loss problem if the negative sample is far away from the positive sample and then out of the margin. Moreover, the entity should have a different representation under a different context. To tackle these issues, we propose a novel Relation-Aware Network with Attention-Based Loss (RANA) framework. Specifically, to better utilize the plentiful negative samples and alleviate the zero-loss issue, we strategically select relevant negative samples and design an attention-based loss function to further differentiate the importance of each negative sample. The intuition is that negative samples more similar to positive samples will contribute more to the model. Further, we design a dynamic relation-aware entity encoder for learning a context-dependent entity representation. Experiments demonstrate that RANA outperforms the state-of-the-art models on two benchmark datasets.

nan

Article 525

Title: Beyond classical and contemporary models: a transformative ai framework for student dropout prediction in distance learning using rag, prompt engineering, and cross-modal fusion

Über klassische und zeitgenössische Modelle hinaus: ein transformatives Ai-Framework für die Studienabbrechervorhersage im Fernunterricht mittels Rag, Prompt Engineering und modaler Fusion

超越古典和当代模式:利用抹布、即时工程和跨模式融合法在远程学习中预测学生辍学情况的变革性Ai框架 2507.05285v1

Authors (3): Miloud Mihoubi, Meriem Zerkouk, Belkacem Chikhaoui

Student dropout in distance learning remains a critical challenge, with profound societal and economic consequences. While classical machine learning models leverage structured socio-demographic and behavioral data, they often fail to capture the nuanced emotional and contextual factors embedded in unstructured student interactions. This paper introduces a transformative AI framework that redefines dropout prediction through three synergistic innovations: Retrieval-Augmented Generation (RAG) for domain-specific sentiment analysis, prompt engineering to decode academic stressors, and cross-modal attention fusion to dynamically align textual, behavioral, and socio-demographic insights. By grounding sentiment analysis in a curated knowledge base of pedagogical content, our RAG-enhanced BERT model interprets student comments with unprecedented contextual relevance, while optimized prompts isolate indicators of academic distress (e.g., “isolation,” “workload anxiety”). A cross-modal attention layer then fuses these insights with temporal engagement patterns, creating holistic risk profiles. Evaluated on a longitudinal dataset of 4 423 students, the framework achieves 89% accuracy and an F1-score of 0.88, outperforming conventional models by 7% and reducing false negatives by 21%. Beyond prediction, the system generates interpretable interventions by retrieving contextually aligned strategies (e.g., mentorship programs for isolated learners). This work bridges the gap between predictive analytics and actionable pedagogy, offering a scalable solution to mitigate dropout risks in global education systems

nan

Article 526

Title@2025-07-04 (5): Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation

Title: Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation

Auf dem Weg zu einer fairen RAG: Über die Auswirkungen des fairen Rankings in der retrieval-Augmentierten Generation

走向公平的RAG:公平评级对回升一代人的影响 2409.11598v4

Authors (2): To Eun Kim, Fernando Diaz

Despite the central role of retrieval in retrieval-augmented generation (RAG) systems, much of the existing research on RAG overlooks the well-established field of fair ranking and fails to account for the interests of all stakeholders involved. In this paper, we conduct the first systematic evaluation of RAG systems that integrate fairness-aware rankings, addressing both ranking fairness and attribution fairness, which ensures equitable exposure of the sources cited in the generated content. Our evaluation focuses on measuring item-side fairness, specifically the fair exposure of relevant items retrieved by RAG systems, and investigates how this fairness impacts both the effectiveness of the systems and the attribution of sources in the generated output that users ultimately see. By experimenting with twelve RAG models across seven distinct tasks, we show that incorporating fairness-aware retrieval often maintains or even enhances both ranking quality and generation quality, countering the common belief that fairness compromises system performance. Additionally, we demonstrate that fair retrieval practices lead to more balanced attribution in the final responses, ensuring that the generator fairly cites the sources it relies on. Our findings underscore the importance of item-side fairness in retrieval and generation, laying the foundation for responsible and equitable RAG systems and guiding future research in fair ranking and attribution.

nan

Article 527

Title@2025-07-04 (5): Parsing Through Boundaries in Chinese Word Segmentation

Title: Parsing Through Boundaries in Chinese Word Segmentation

Parsing durch Grenzen in chinesischen Wortsegmentierung

中文单词分割的边界分割 2503.23091v2

Authors (10): Yige Chen, Zelong Li, Cindy Zhang, Changbing Yang, Amandisa Cady, Ai Ka Lee, Zejiao Zeng, Eunkyul Leah Jo, Haihua Pan, Jungyeul Park

Chinese word segmentation is a foundational task in natural language processing (NLP), with far-reaching effects on syntactic analysis. Unlike alphabetic languages like English, Chinese lacks explicit word boundaries, making segmentation both necessary and inherently ambiguous. This study highlights the intricate relationship between word segmentation and syntactic parsing, providing a clearer understanding of how different segmentation strategies shape dependency structures in Chinese. Focusing on the Chinese GSD treebank, we analyze multiple word boundary schemes, each reflecting distinct linguistic and computational assumptions, and examine how they influence the resulting syntactic structures. To support detailed comparison, we introduce an interactive web-based visualization tool that displays parsing outcomes across segmentation methods.

nan

Article 528

Title@2025-07-04 (5): Tip of the Tongue Query Elicitation for Simulated Evaluation

Title: Tip of the Tongue Query Elicitation for Simulated Evaluation

Tipp der Zungenabfrage-Elikation für Simulierte Auswertung

用于模拟评价的 “ 舌头查询参考 “ 提示 2502.17776v2

Authors (5): Yifan He, To Eun Kim, Fernando Diaz, Jaime Arguello, Bhaskar Mitra

Tip-of-the-tongue (TOT) search occurs when a user struggles to recall a specific identifier, such as a document title. While common, existing search systems often fail to effectively support TOT scenarios. Research on TOT retrieval is further constrained by the challenge of collecting queries, as current approaches rely heavily on community question-answering (CQA) websites, leading to labor-intensive evaluation and domain bias. To overcome these limitations, we introduce two methods for eliciting TOT queries - leveraging large language models (LLMs) and human participants - to facilitate simulated evaluations of TOT retrieval systems. Our LLM-based TOT user simulator generates synthetic TOT queries at scale, achieving high correlations with how CQA-based TOT queries rank TOT retrieval systems when tested in the Movie domain. Additionally, these synthetic queries exhibit high linguistic similarity to CQA-derived queries. For human-elicited queries, we developed an interface that uses visual stimuli to place participants in a TOT state, enabling the collection of natural queries. In the Movie domain, system rank correlation and linguistic similarity analyses confirm that human-elicited queries are both effective and closely resemble CQA-based queries. These approaches reduce reliance on CQA-based data collection while expanding coverage to underrepresented domains, such as Landmark and Person. LLM-elicited queries for the Movie, Landmark, and Person domains have been released as test queries in the TREC 2024 TOT track, with human-elicited queries scheduled for inclusion in the TREC 2025 TOT track. Additionally, we provide source code for synthetic query generation and the human query collection interface, along with curated visual stimuli used for eliciting TOT queries.

nan

Article 529

Title@2025-07-04 (5): Alpay Algebra IV: Symbiotic Semantics and the Fixed-Point Convergence of Observer Embeddings

Title: Alpay Algebra IV: Symbiotic Semantics and the Fixed-Point Convergence of Observer Embeddings

Alpay Algebra IV: Symbiotische Semantik und die Fixed-Point-Konvergenz von Beobachter-Embeddings

替代代数四:共生语义学和观察员嵌入的固定点汇合 2507.03774v1

Authors (2): Bugra Kilictas, Faruk Alpay

We present a theoretical framework in which a document and an AI model engage in a transfinite fixed-point interaction that leads to stable semantic alignment. Building on the foundations of Alpay Algebra, we introduce a functorial system wherein an observer (the AI) and a textual environment (this paper) co-evolve through iterative transformations guided by the phi-infinity operator. This process guarantees the existence of a unique fixed point in the AI’s embedding space – a state where the AI’s internal representation of the content becomes stable, self-consistent, and semantically faithful. We prove that such convergence is mathematically sound, semantically invariant, and permanent, even under perturbation or further context expansion. This fixed point acts as an “empathetic embedding,” wherein the AI internalizes not only the meaning of the content but also the author’s intent. We interpret this as a rigorous, category-theoretic route to alignment at the embedding level, with implications for semantic security, symbolic memory, and the construction of AI systems with persistent self-referential understanding. All references in this paper function as nodes in the Alpay Algebra universe, and this work embeds itself as a new fixed-point node within that transfinite semantic graph.

nan

Article 530

Title@2025-07-04 (5): Agent-Based Detection and Resolution of Incompleteness and Ambiguity in Interactions with Large Language Models

Title: Agent-Based Detection and Resolution of Incompleteness and Ambiguity in Interactions with Large Language Models

Agentbasierte Erkennung und Auflösung von Unvollständigkeit und Ambiguität in Interaktionen mit großen Sprachmodellen

与大语言模式互动中不完全和不易的代理检测和分辨率 2507.03726v1

Authors (4): Riya Naik, Ashwin Srinivasan, Swati Agarwal, Estrid He

Many of us now treat LLMs as modern-day oracles asking it almost any kind of question. However, consulting an LLM does not have to be a single turn activity. But long multi-turn interactions can get tedious if it is simply to clarify contextual information that can be arrived at through reasoning. In this paper, we examine the use of agent-based architecture to bolster LLM-based Question-Answering systems with additional reasoning capabilities. We examine the automatic resolution of potential incompleteness or ambiguities in questions by transducers implemented using LLM-based agents. We focus on several benchmark datasets that are known to contain questions with these deficiencies to varying degrees. We equip different LLMs (GPT-3.5-Turbo and Llama-4-Scout) with agents that act as specialists in detecting and resolving deficiencies of incompleteness and ambiguity. The agents are implemented as zero-shot ReAct agents. Rather than producing an answer in a single step, the model now decides between 3 actions a) classify b) resolve c) answer. Action a) decides if the question is incomplete, ambiguous, or normal. Action b) determines if any deficiencies identified can be resolved. Action c) answers the resolved form of the question. We compare the use of LLMs with and without the use of agents with these components. Our results show benefits of agents with transducer 1) A shortening of the length of interactions with human 2) An improvement in the answer quality and 3) Explainable resolution of deficiencies in the question. On the negative side we find while it may result in additional LLM invocations and in some cases, increased latency. But on tested datasets, the benefits outweigh the costs except when questions already have sufficient context. Suggesting the agent-based approach could be a useful mechanism to harness the power of LLMs to develop more robust QA systems.

nan

Article 531

Title@2025-07-04 (5): ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models

Title: ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models

ODE: Offene Bewertung von Halluzinationen in multimodalen großen Sprachmodellen

ODE:多模式大语言模型中的幻觉公开评价 2409.09318v4

Authors (3): Yahan Tu, Rui Hu, Jitao Sang

Hallucination poses a persistent challenge for multimodal large language models (MLLMs). However, existing benchmarks for evaluating hallucinations are generally static, which may overlook the potential risk of data contamination. To address this issue, we propose ODE, an open-set, dynamic protocol designed to evaluate object hallucinations in MLLMs at both the existence and attribute levels. ODE employs a graph-based structure to represent real-world object concepts, their attributes, and the distributional associations between them. This structure facilitates the extraction of concept combinations based on diverse distributional criteria, generating varied samples for structured queries that evaluate hallucinations in both generative and discriminative tasks. Through the generation of new samples, dynamic concept combinations, and varied distribution frequencies, ODE mitigates the risk of data contamination and broadens the scope of evaluation. This protocol is applicable to both general and specialized scenarios, including those with limited data. Experimental results demonstrate the effectiveness of our protocol, revealing that MLLMs exhibit higher hallucination rates when evaluated with ODE-generated samples, which indicates potential data contamination. Furthermore, these generated samples aid in analyzing hallucination patterns and fine-tuning models, offering an effective approach to mitigating hallucinations in MLLMs.

nan

Article 532

Title@2025-07-04 (5): Prompting as Scientific Inquiry

Title: Prompting as Scientific Inquiry

Als wissenschaftliche Untersuchung prompt

作为科学调查 2507.00163v2

Authors (2): Ari Holtzman, Chenhao Tan

Prompting is the primary method by which we study and control large language models. It is also one of the most powerful: nearly every major capability attributed to LLMs-few-shot learning, chain-of-thought, constitutional AI-was first unlocked through prompting. Yet prompting is rarely treated as science and is frequently frowned upon as alchemy. We argue that this is a category error. If we treat LLMs as a new kind of complex and opaque organism that is trained rather than programmed, then prompting is not a workaround: it is behavioral science. Mechanistic interpretability peers into the neural substrate, prompting probes the model in its native interface: language. We contend that prompting is not inferior, but rather a key component in the science of LLMs.

nan

Article 533

Title@2025-07-04 (5): Controlling Thinking Speed in Reasoning Models

Title: Controlling Thinking Speed in Reasoning Models

Kontrolle der Denkgeschwindigkeit in vernünftigen Modellen

控制理性模型的思考速度 2507.03704v1

Authors (9): Zhengkai Lin, Zhihang Fu, Ze Chen, Chao Chen, Liang Xie, Wenxiao Wang, Deng Cai, Zheng Wang, Jieping Ye

Human cognition is theorized to operate in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking. While current Large Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform fast thinking leads to high computational overhead and latency. In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance. For the first question, we identify the steering vector that governs slow-fast thinking transitions in LRMs’ representation space. Using this vector, we achieve the first representation editing-based test-time scaling effect, outperforming existing prompt-based scaling methods. For the second question, we apply real-time difficulty estimation to signal reasoning segments of varying complexity. Combining these techniques, we propose the first reasoning strategy that enables fast processing of easy steps and deeper analysis for complex reasoning. Without any training or additional cost, our plug-and-play method yields an average +1.3% accuracy with -8.6% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on vLLM and are expected to support broader applications and inspire future research.

nan

Article 534

Title@2025-07-04 (5): PHEONA: An Evaluation Framework for Large Language Model-based Approaches to Computational Phenotyping

Title: PHEONA: An Evaluation Framework for Large Language Model-based Approaches to Computational Phenotyping

PHEONA: Ein Bewertungsrahmen für großsprachliche modellbasierte Ansätze zur computergestützten Phenotypisierung

PHEONA:基于大语言模型的计算基因组学方法评价框架 2503.19265v3

Authors (3): Sarah Pungitore, Shashank Yadav, Vignesh Subbian

Computational phenotyping is essential for biomedical research but often requires significant time and resources, especially since traditional methods typically involve extensive manual data review. While machine learning and natural language processing advancements have helped, further improvements are needed. Few studies have explored using Large Language Models (LLMs) for these tasks despite known advantages of LLMs for text-based tasks. To facilitate further research in this area, we developed an evaluation framework, Evaluation of PHEnotyping for Observational Health Data (PHEONA), that outlines context-specific considerations. We applied and demonstrated PHEONA on concept classification, a specific task within a broader phenotyping process for Acute Respiratory Failure (ARF) respiratory support therapies. From the sample concepts tested, we achieved high classification accuracy, suggesting the potential for LLM-based methods to improve computational phenotyping processes.

nan

Article 535

Title@2025-07-04 (5): A Survey of Large Language Models on Generative Graph Analytics: Query, Learning, and Applications

Title: A Survey of Large Language Models on Generative Graph Analytics: Query, Learning, and Applications

Eine Umfrage von großen Sprachmodellen zu generativen Graph Analytics: Abfragen, Lernen und Anwendungen

关于 “ 生成图表分析分析:查询、学习和应用 “ 的大型语言模型调查 2404.14809v2

Authors (2): Wenbo Shang, Xin Huang

A graph is a fundamental data model to represent various entities and their complex relationships in society and nature, such as social networks, transportation networks, and financial networks. Recently, large language models (LLMs) have showcased a strong generalization ability to handle various natural language processing tasks to answer users’ arbitrary questions and generate specific-domain content. Compared with graph learning models, LLMs enjoy superior advantages in addressing the challenges of generalizing graph tasks by eliminating the need for training graph learning models and reducing the cost of manual annotation. However, LLMs are sequential models for textual data, but graphs are non-sequential topological data. It is challenging to adapt LLMs to tackle graph analytics tasks. In this survey, we conduct a comprehensive investigation of existing LLM studies on graph data, which summarizes the relevant graph analytics tasks solved by advanced LLM models and points out the existing challenges and future directions. Specifically, we study the key problems of LLM-based generative graph analytics (LLM-GGA) in terms of three categories: LLM-based graph query processing (LLM-GQP), LLM-based graph inference and learning (LLM-GIL), and graph-LLM-based applications. LLM-GQP focuses on an integration of graph analytics techniques and LLM prompts, including graph understanding and knowledge graphs and LLMs, while LLM-GIL focuses on learning and reasoning over graphs, including graph learning, graph-formed reasoning, and graph representation. We summarize the useful prompts incorporated into LLM to handle different graph downstream tasks. Moreover, we give a summary of LLM model evaluation, benchmark datasets/tasks, and a deep pro and cons analysis of the discussed LLM-GGA models. We also explore open problems and future directions in the research area of LLMs and graph analytics.

nan

Article 536

Title@2025-07-04 (5): The Geometries of Truth Are Orthogonal Across Tasks

Title: The Geometries of Truth Are Orthogonal Across Tasks

Die Geometrien der Wahrheit sind orthogonal über Aufgaben hinweg

真理的地貌不同任务 2506.08572v2

Authors (7): Waiss Azizian, Michael Kirchhof, Eugene Ndiaye, Louis Bethune, Michal Klein, Pierre Ablin, Marco Cuturi

Large Language Models (LLMs) have demonstrated impressive generalization capabilities across various tasks, but their claim to practical relevance is still mired by concerns on their reliability. Recent works have proposed examining the activations produced by an LLM at inference time to assess whether its answer to a question is correct. Some works claim that a “geometry of truth” can be learned from examples, in the sense that the activations that generate correct answers can be distinguished from those leading to mistakes with a linear classifier. In this work, we underline a limitation of these approaches: we observe that these “geometries of truth” are intrinsically task-dependent and fail to transfer across tasks. More precisely, we show that linear classifiers trained across distinct tasks share little similarity and, when trained with sparsity-enforcing regularizers, have almost disjoint supports. We show that more sophisticated approaches (e.g., using mixtures of probes and tasks) fail to overcome this limitation, likely because activation vectors commonly used to classify answers form clearly separated clusters when examined across tasks.

nan

Article 537

Title@2025-07-04 (5): iNews: A Multimodal Dataset for Modeling Personalized Affective Responses to News

Title: iNews: A Multimodal Dataset for Modeling Personalized Affective Responses to News

iNews: Ein multimodaler Datensatz für die Modellierung personalisierter Affective Responses to News

iNews: 用于模拟个人对新闻的情感反应的多模式数据集 2503.03335v2

Authors (2): Tiancheng Hu, Nigel Collier

Understanding how individuals perceive and react to information is fundamental for advancing social and behavioral sciences and developing human-centered AI systems. Current approaches often lack the granular data needed to model these personalized responses, relying instead on aggregated labels that obscure the rich variability driven by individual differences. We introduce iNews, a novel large-scale dataset specifically designed to facilitate the modeling of personalized affective responses to news content. Our dataset comprises annotations from 291 demographically diverse UK participants across 2,899 multimodal Facebook news posts from major UK outlets, with an average of 5.18 annotators per sample. For each post, annotators provide multifaceted labels including valence, arousal, dominance, discrete emotions, content relevance judgments, sharing likelihood, and modality importance ratings. Crucially, we collect comprehensive annotator persona information covering demographics, personality, media trust, and consumption patterns, which explain 15.2% of annotation variance - substantially higher than existing NLP datasets. Incorporating this information yields a 7% accuracy gain in zero-shot prediction and remains beneficial even with 32-shot in-context learning. iNews opens new possibilities for research in LLM personalization, subjectivity, affective computing, and human behavior simulation.

nan

Article 538

Title@2025-07-04 (5): STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking

Title: STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking

STRUKTSENSE: Ein Task-Agnostic Agentic Framework für strukturierte Informationsextraktion mit Mensch-in-The-Loop-Evaluierung und Benchmarking

SUCTSENESE: 结构化信息抽取与 “ 网上人 “ 评价和基准确定 2507.03674v1

Authors (8): Tek Raj Chhetri, Yibei Chen, Puja Trivedi, Dorota Jarecka, Saif Haobsh, Patrick Ray, Lydia Ng, Satrajit S. Ghosh

The ability to extract structured information from unstructured sources-such as free-text documents and scientific literature-is critical for accelerating scientific discovery and knowledge synthesis. Large Language Models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks, including structured information extraction. However, their effectiveness often diminishes in specialized, domain-specific contexts that require nuanced understanding and expert-level domain knowledge. In addition, existing LLM-based approaches frequently exhibit poor transferability across tasks and domains, limiting their scalability and adaptability. To address these challenges, we introduce StructSense, a modular, task-agnostic, open-source framework for structured information extraction built on LLMs. StructSense is guided by domain-specific symbolic knowledge encoded in ontologies, enabling it to navigate complex domain content more effectively. It further incorporates agentic capabilities through self-evaluative judges that form a feedback loop for iterative refinement, and includes human-in-the-loop mechanisms to ensure quality and validation. We demonstrate that StructSense can overcome both the limitations of domain sensitivity and the lack of cross-task generalizability, as shown through its application to diverse neuroscience information extraction tasks.

nan

Article 539

Title@2025-07-04 (5): TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection

Title: TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection

TACOS: Offenes Tagging und vergleichendes Scoring für die Auswahl von Feintuning-Daten

TACOS: 用于指示精确调整数据选择的公开贴贴和比较比标 2507.03673v1

Authors (7): Xixiang He, Hao Yu, Qiyao Sun, Ao Cheng, Tailai Zhang, Cong Liu, Shuxuan Guo

Instruction Fine-Tuning (IFT) is crucial for aligning large language models (LLMs) with human preferences, and selecting a small yet representative subset from massive data significantly facilitates IFT in terms of both efficiency and effectiveness. Nevertheless, existing approaches suffer from two limitations: the use of simple heuristics restricts data diversity, while the singleton data quality evaluation accounts for inconsistent criteria between independent samples. To address the issues, we present TACOS, an innovative method that integrates Open Tagging and Comparative Scoring for IFT data selection. To capture data diversity, we leverage LLMs to assign open-domain tags to human queries, followed by a normalization stage to denoise the open tags and enable efficient clustering. Additionally, we suggest a comparative scoring method that allows the relative quality evaluation of samples within a cluster, avoiding inconsistent criteria seen in singleton-based evaluations. Extensive experiments across diverse datasets and LLM architectures demonstrate that TACOS outperforms existing approaches by a large margin. Notably, it achieves superior instruction-following performance on MT-Bench and ranks 1st among LLaMA2-7B-Based models on AlpacaEval 2.0, illustrating its efficacy for IFT data selection.

nan

Article 540

Title@2025-07-04 (5): Recon, Answer, Verify: Agents in Search of Truth

Title: Recon, Answer, Verify: Agents in Search of Truth

Recon, Antwort, Überprüfung: Agenten auf der Suche nach Wahrheit

调查、答复、核查:寻找真相的特工 2507.03671v1

Authors (3): Satyam Shukla, Himanshu Dutta, Pushpak Bhattacharyya

Automated fact checking with large language models (LLMs) offers a scalable alternative to manual verification. Evaluating fact checking is challenging as existing benchmark datasets often include post claim analysis and annotator cues, which are absent in real world scenarios where claims are fact checked immediately after being made. This limits the realism of current evaluations. We present Politi Fact Only (PFO), a 5 class benchmark dataset of 2,982 political claims from politifact.com, where all post claim analysis and annotator cues have been removed manually. This ensures that models are evaluated using only the information that would have been available prior to the claim’s verification. Evaluating LLMs on PFO, we see an average performance drop of 22% in terms of macro f1 compared to PFO’s unfiltered version. Based on the identified challenges of the existing LLM based fact checking system, we propose RAV (Recon Answer Verify), an agentic framework with three agents: question generator, answer generator, and label generator. Our pipeline iteratively generates and answers sub questions to verify different aspects of the claim before finally generating the label. RAV generalizes across domains and label granularities, and it outperforms state of the art approaches on well known baselines RAWFC (fact checking, 3 class) by 25.28%, and on HOVER (encyclopedia, 2 class) by 1.54% on 2 hop, 4.94% on 3 hop, and 1.78% on 4 hop, sub categories respectively. RAV shows the least performance drop compared to baselines of 16.3% in macro f1 when we compare PFO with its unfiltered version.

nan

Article 541

Title@2025-07-04 (5): Interaction Techniques that Encourage Longer Prompts Can Improve Psychological Ownership when Writing with AI

Title: Interaction Techniques that Encourage Longer Prompts Can Improve Psychological Ownership when Writing with AI

Interaktionstechniken, die länger prompts fördern, können die Psychologische Eigentümerschaft verbessern, wenn sie mit KI schreiben

鼓励较长时间提示的交互技术在与大赦国际撰写文章时能够改善心理自主性 2507.03670v1

Authors (2): Nikhita Joshi, Daniel Vogel

Writing longer prompts for an AI assistant to generate a short story increases psychological ownership, a user’s feeling that the writing belongs to them. To encourage users to write longer prompts, we evaluated two interaction techniques that modify the prompt entry interface of chat-based generative AI assistants: pressing and holding the prompt submission button, and continuously moving a slider up and down when submitting a short prompt. A within-subjects experiment investigated the effects of such techniques on prompt length and psychological ownership, and results showed that these techniques increased prompt length and led to higher psychological ownership than baseline techniques. A second experiment further augmented these techniques by showing AI-generated suggestions for how the prompts could be expanded. This further increased prompt length, but did not lead to improvements in psychological ownership. Our results show that simple interface modifications like these can elicit more writing from users and improve psychological ownership.

nan

Article 542

Title@2025-07-04 (5): TRACE: Training and Inference-Time Interpretability Analysis for Language Models

Title: TRACE: Training and Inference-Time Interpretability Analysis for Language Models

TRACE: Trainings- und Inferenz-Zeit-Interpretabilitätsanalyse für Sprachmodelle

TRACE: 语文模式培训和推断-时间解释性分析 2507.03668v1

Authors (3): Nura Aljaafari, Danilo S. Carvalho, André Freitas

Understanding when and how linguistic knowledge emerges during language model training remains a central challenge for interpretability. Most existing tools are post hoc, rely on scalar metrics, or require nontrivial integration effort, making comprehensive interpretability analysis difficult to deploy and maintain. We introduce TRACE, a modular toolkit for training and inference-time interpretability analysis of transformer models. It enables lightweight, in-training analysis of linguistic and representational signals, including features probing, intrinsic dimensionality, Hessian curvature, and output diagnostics. It integrates with ABSynth, a controllable synthetic corpus generator that provides structured annotations for precise evaluation of linguistic feature acquisition. Experiments with autoregressive transformers demonstrate that TRACE reveals developmental phenomena such as early syntactic emergence, delayed semantic acquisition, and representational compression, signals overlooked by traditional scalar metrics such as loss or accuracy. With minimal integration effort, the tool enables layer-wise diagnostics, convergence-based early stopping, and detection of structural errors, making transformer analysis interpretable, actionable, and reproducible.

nan

Article 543

Title@2025-07-04 (5): Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs

Title: Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs

Re-Emergent-Missausrichtung: Wie schmale Feintuning-Erodes Sicherheit Ausrichtung in LLMs

重新出现错配: LLM 中如何将安全对齐 2507.03662v1

Authors (1): Jeremiah Giordani

Recent work has shown that fine-tuning large language models (LLMs) on code with security vulnerabilities can result in misaligned and unsafe behaviors across broad domains. These results prompted concerns about the emergence of harmful behaviors from narrow domain fine-tuning. In this paper, we contextualize these findings by analyzing how such narrow adaptation impacts the internal mechanisms and behavioral manifestations of LLMs. Through a series of experiments covering output probability distributions, loss and gradient vector geometry, layer-wise activation dynamics, and activation space dimensions, we find that behaviors attributed to “emergent misalignment” may be better interpreted as an erosion of prior alignment. We show that fine tuning on insecure code induces internal changes that oppose alignment. Further, we identify a shared latent dimension in the model’s activation space that governs alignment behavior. We show that this space is activated by insecure code and by misaligned responses more generally, revealing how narrow fine-tuning can degrade general safety behavior by interfering with shared internal mechanisms. Our findings offer a mechanistic interpretation for previously observed misalignment phenomena, and highlights the fragility of alignment in LLMs. The results underscore the need for more robust fine-tuning strategies that preserve intended behavior across domains.

nan

Article 544

Title@2025-07-04 (5): Exploring Robustness of LLMs to Paraphrasing Based on Sociodemographic Factors

Title: Exploring Robustness of LLMs to Paraphrasing Based on Sociodemographic Factors

Erforschung der Robustheit von LLMs zur Paraphrasierung auf der Grundlage soziodemografischer Faktoren

探讨根据社会人口因素将长效LLMs 2501.08276v2

Authors (3): Pulkit Arora, Akbar Karimi, Lucie Flek

Despite their linguistic prowess, LLMs have been shown to be vulnerable to small input perturbations. While robustness to local adversarial changes has been studied, robustness to global modifications such as different linguistic styles remains underexplored. Therefore, we take a broader approach to explore a wider range of variations across sociodemographic dimensions. We extend the SocialIQA dataset to create diverse paraphrased sets conditioned on sociodemographic factors (age and gender). The assessment aims to provide a deeper understanding of LLMs in (a) their capability of generating demographic paraphrases with engineered prompts and (b) their capabilities in interpreting real-world, complex language scenarios. We also perform a reliability analysis of the generated paraphrases looking into linguistic diversity and perplexity as well as manual evaluation. We find that demographic-based paraphrasing significantly impacts the performance of language models, indicating that the subtleties of linguistic variation remain a significant challenge. We will make the code and dataset available for future research.

nan

Article 545

Title@2025-07-04 (5): Disentangling the Roles of Representation and Selection in Data Pruning

Title: Disentangling the Roles of Representation and Selection in Data Pruning

Entwirren der Rollen von Repräsentation und Auswahl bei der Datenkorrektur

分离数据谨慎中的代表和甄选作用 2507.03648v1

Authors (6): Yupei Du, Yingjin Song, Hugh Mee Wong, Daniil Ignatev, Albert Gatt, Dong Nguyen

Data pruning, selecting small but impactful subsets, offers a promising way to efficiently scale NLP model training. However, existing methods often involve many different design choices, which have not been systematically studied. This limits future developments. In this work, we decompose data pruning into two key components: the data representation and the selection algorithm, and we systematically analyze their influence on the selection of instances. Our theoretical and empirical results highlight the crucial role of representations: better representations, e.g., training gradients, generally lead to a better selection of instances, regardless of the chosen selection algorithm. Furthermore, different selection algorithms excel in different settings, and none consistently outperforms the others. Moreover, the selection algorithms do not always align with their intended objectives: for example, algorithms designed for the same objective can select drastically different instances, highlighting the need for careful evaluation.

nan

Article 546

Title@2025-07-04 (5): ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving

Title: ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving

ArithmAttack: Bewertung der Robustheit von LLMs zum geräuschvollen Kontext bei der Lösung von Math-Problemen

ArithmAttack:评估理学问题解答时对吵闹背景的LLMs强力 2501.08203v2

Authors (4): Zain Ul Abedin, Shahzeb Qamar, Lucie Flek, Akbar Karimi

While Large Language Models (LLMs) have shown impressive capabilities in math problem-solving tasks, their robustness to noisy inputs is not well-studied. We propose ArithmAttack to examine how robust the LLMs are when they encounter noisy prompts that contain extra noise in the form of punctuation marks. While being easy to implement, ArithmAttack does not cause any information loss since words are not added or deleted from the context. We evaluate the robustness of eight LLMs, including LLama3, Mistral, Mathstral, and DeepSeek on noisy GSM8K and MultiArith datasets. Our experiments suggest that all the studied models show vulnerability to such noise, with more noise leading to poorer performances.

nan

Article 547

Title@2025-07-04 (5): Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion

Title: Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion

Verbesserung der Low-Resource-Dialektenklassifikation mittels Retrieval-basierter Sprachumwandlung

利用以检索为基础的语音转换改进低资源 2507.03641v1

Authors (5): Lea Fischbach, Akbar Karimi, Caroline Kleen, Alfred Lameli, Lucie Flek

Deep learning models for dialect identification are often limited by the scarcity of dialectal data. To address this challenge, we propose to use Retrieval-based Voice Conversion (RVC) as an effective data augmentation method for a low-resource German dialect classification task. By converting audio samples to a uniform target speaker, RVC minimizes speaker-related variability, enabling models to focus on dialect-specific linguistic and phonetic features. Our experiments demonstrate that RVC enhances classification performance when utilized as a standalone augmentation method. Furthermore, combining RVC with other augmentation methods such as frequency masking and segment removal leads to additional performance gains, highlighting its potential for improving dialect classification in low-resource scenarios.

nan

Article 548

Title@2025-07-04 (5): Is It Time To Treat Prompts As Code? A Multi-Use Case Study For Prompt Optimization Using DSPy

Title: Is It Time To Treat Prompts As Code? A Multi-Use Case Study For Prompt Optimization Using DSPy

Ist es Zeit, Prompts als Code zu behandeln? Eine Multi-Use-Fallstudie für Prompt-Optimierung mit DSPy

是否是时候将提示作为代码处理? 使用 DSPy 快速优化的多用途案例研究 2507.03620v1

Authors (3): Francisca Lemos, Victor Alves, Filipa Ferraz

Although prompt engineering is central to unlocking the full potential of Large Language Models (LLMs), crafting effective prompts remains a time-consuming trial-and-error process that relies on human intuition. This study investigates Declarative Self-improving Python (DSPy), an optimization framework that programmatically creates and refines prompts, applied to five use cases: guardrail enforcement, hallucination detection in code, code generation, routing agents, and prompt evaluation. Each use case explores how prompt optimization via DSPy influences performance. While some cases demonstrated modest improvements - such as minor gains in the guardrails use case and selective enhancements in hallucination detection - others showed notable benefits. The prompt evaluation criterion task demonstrated a substantial performance increase, rising accuracy from 46.2% to 64.0%. In the router agent case, the possibility of improving a poorly performing prompt and of a smaller model matching a stronger one through optimized prompting was explored. Although prompt refinement increased accuracy from 85.0% to 90.0%, using the optimized prompt with a cheaper model did not improve performance. Overall, this study’s findings suggest that DSPy’s systematic prompt optimization can enhance LLM performance, particularly when instruction tuning and example selection are optimized together. However, the impact varies by task, highlighting the importance of evaluating specific use cases in prompt optimization research.

nan

Article 549

Title@2025-07-04 (5): EMERGE: A Benchmark for Updating Knowledge Graphs with Emerging Textual Knowledge

Title: EMERGE: A Benchmark for Updating Knowledge Graphs with Emerging Textual Knowledge

EMERGE: Ein Benchmark für die Aktualisierung von Wissensgraphen mit neu entstehendem Textwissen

EMEGERGE: 更新具有新兴文字知识的知识图表的基准 2507.03617v1

Authors (6): Klim Zaporojets, Daniel Daza, Edoardo Barba, Ira Assent, Roberto Navigli, Paul Groth

Knowledge Graphs (KGs) are structured knowledge repositories containing entities and relations between them. In this paper, we investigate the problem of automatically updating KGs over time with respect to the evolution of knowledge in unstructured textual sources. This problem requires identifying a wide range of update operations based on the state of an existing KG at a specific point in time. This contrasts with traditional information extraction pipelines, which extract knowledge from text independently of the current state of a KG. To address this challenge, we propose a method for lifelong construction of a dataset consisting of Wikidata KG snapshots over time and Wikipedia passages paired with the corresponding edit operations that they induce in a particular KG snapshot. The resulting dataset comprises 376K Wikipedia passages aligned with a total of 1.25M KG edits over 10 different snapshots of Wikidata from 2019 to 2025. Our experimental results highlight challenges in updating KG snapshots based on emerging textual knowledge, positioning the dataset as a valuable benchmark for future research. We will publicly release our dataset and model implementations.

nan

Article 550

Title@2025-07-04 (5): RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning

Title: RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning

RAG+: Verbesserung der Retrieval-Augmented Generation mit anwendungsrelevanter Begründung

RAG+:加强利用应用程序软件软件软件软件支持的检索-启动生成 2506.11555v3

Authors (9): Yu Wang, Shiwan Zhao, Zhihu Wang, Ming Fan, Yubo Zhang, Xicheng Zhang, Zhengfan Wang, Heyuan Huang, Ting Liu

The integration of external knowledge through Retrieval-Augmented Generation (RAG) has become foundational in enhancing large language models (LLMs) for knowledge-intensive tasks. However, existing RAG paradigms often overlook the cognitive step of applying knowledge, leaving a gap between retrieved facts and task-specific reasoning. In this work, we introduce RAG+, a principled and modular extension that explicitly incorporates application-aware reasoning into the RAG pipeline. RAG+ constructs a dual corpus consisting of knowledge and aligned application examples, created either manually or automatically, and retrieves both jointly during inference. This design enables LLMs not only to access relevant information but also to apply it within structured, goal-oriented reasoning processes. Experiments across mathematical, legal, and medical domains, conducted on multiple models, demonstrate that RAG+ consistently outperforms standard RAG variants, achieving average improvements of 3-5%, and peak gains up to 7.5% in complex scenarios. By bridging retrieval with actionable application, RAG+ advances a more cognitively grounded framework for knowledge integration, representing a step toward more interpretable and capable LLMs.

nan

Article 551

Title@2025-07-04 (5): ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs

Title: ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs

ALAS: Latente Sprach-Text-Ausrichtung für gesprochenes Sprachverständnis in multimodalen LLMs messen

ALAS: 计量多种模式LM 中口语语言理解的暗中语音-文本对齐 2505.19937v2

Authors (4): Pooneh Mousavi, Yingzhi Wang, Mirco Ravanelli, Cem Subakan

Large Language Models (LLMs) are increasingly used in Spoken Language Understanding (SLU), where effective multimodal learning depends on the alignment between audio and text. Despite various fusion methods, no standard metric exists to assess this alignment. This work introduces ALAS (Automatic Latent Alignment Score), a metric that evaluates alignment by measuring correlations between audio and text representations across transformer layers. Experiments on Spoken Question Answering and Emotion Recognition show that ALAS captures meaningful patterns across tasks and layers.

nan

Article 552

Title@2025-07-04 (5): Multi-Hop Reasoning for Question Answering with Hyperbolic Representations

Title: Multi-Hop Reasoning for Question Answering with Hyperbolic Representations

Multi-Hop-Gründung für die Beantwortung von Fragen mit hyperbolischen Darstellungen

使用超曲解表示式回答问题多理由 2507.03612v1

Authors (3): Simon Welz, Lucie Flek, Akbar Karimi

Hyperbolic representations are effective in modeling knowledge graph data which is prevalently used to facilitate multi-hop reasoning. However, a rigorous and detailed comparison of the two spaces for this task is lacking. In this paper, through a simple integration of hyperbolic representations with an encoder-decoder model, we perform a controlled and comprehensive set of experiments to compare the capacity of hyperbolic space versus Euclidean space in multi-hop reasoning. Our results show that the former consistently outperforms the latter across a diverse set of datasets. In addition, through an ablation study, we show that a learnable curvature initialized with the delta hyperbolicity of the utilized data yields superior results to random initializations. Furthermore, our findings suggest that hyperbolic representations can be significantly more advantageous when the datasets exhibit a more hierarchical structure.

nan

Article 553

Title@2025-07-04 (5): ViMRHP: A Vietnamese Benchmark Dataset for Multimodal Review Helpfulness Prediction via Human-AI Collaborative Annotation

Title: ViMRHP: A Vietnamese Benchmark Dataset for Multimodal Review Helpfulness Prediction via Human-AI Collaborative Annotation

ViMRHP: Ein vietnamesischer Benchmark-Datensatz für die multimodale Überprüfung von Hilfsvorhersage über Human-AI Collaborative Annotation

VIMRHP:越南通过人类 – – AI合作说明,为多模式审查 “ 助助预测 “ 建立一个越南基准数据集 2505.07416v2

Authors (4): Truc Mai-Thanh Nguyen, Dat Minh Nguyen, Son T. Luu, Kiet Van Nguyen

Multimodal Review Helpfulness Prediction (MRHP) is an essential task in recommender systems, particularly in E-commerce platforms. Determining the helpfulness of user-generated reviews enhances user experience and improves consumer decision-making. However, existing datasets focus predominantly on English and Indonesian, resulting in a lack of linguistic diversity, especially for low-resource languages such as Vietnamese. In this paper, we introduce ViMRHP (Vietnamese Multimodal Review Helpfulness Prediction), a large-scale benchmark dataset for MRHP task in Vietnamese. This dataset covers four domains, including 2K products with 46K reviews. Meanwhile, a large-scale dataset requires considerable time and cost. To optimize the annotation process, we leverage AI to assist annotators in constructing the ViMRHP dataset. With AI assistance, annotation time is reduced (90 to 120 seconds per task down to 20 to 40 seconds per task) while maintaining data quality and lowering overall costs by approximately 65%. However, AI-generated annotations still have limitations in complex annotation tasks, which we further examine through a detailed performance analysis. In our experiment on ViMRHP, we evaluate baseline models on human-verified and AI-generated annotations to assess their quality differences. The ViMRHP dataset is publicly available at https://github.com/trng28/ViMRHP

nan

Article 554

Title@2025-07-04 (5): Should We Still Pretrain Encoders with Masked Language Modeling?

Title: Should We Still Pretrain Encoders with Masked Language Modeling?

Sollten wir noch Encoder mit maskierten Sprachmodellen vortrainieren?

我们是否仍应该为带有隐蔽语言建模的编程者预作准备? 2507.00994v2

Authors (8): Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte M. Alves, Emmanuel Malherbe, André F. T. Martins, Céline Hudelot, Pierre Colombo

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 38 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models, reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.

nan

Article 555

Title@2025-07-04 (5): RECA-PD: A Robust Explainable Cross-Attention Method for Speech-based Parkinson’s Disease Classification

Title: RECA-PD: A Robust Explainable Cross-Attention Method for Speech-based Parkinson’s Disease Classification

RECA-PD: Robuste, erklärbare Cross-Attention-Methode zur Klassifizierung sprachbasierter Parkinson-Krankheiten

REECA-PD: 以语音为基础的帕金森氏病分类的强有力的可解释交叉注意方法 2507.03594v1

Authors (4): Terry Yi Zhong, Cristian Tejedor-Garcia, Martha Larson, Bastiaan R. Bloem

Parkinson’s Disease (PD) affects over 10 million people globally, with speech impairments often preceding motor symptoms by years, making speech a valuable modality for early, non-invasive detection. While recent deep-learning models achieve high accuracy, they typically lack the explainability required for clinical use. To address this, we propose RECA-PD, a novel, robust, and explainable cross-attention architecture that combines interpretable speech features with self-supervised representations. RECA-PD matches state-of-the-art performance in Speech-based PD detection while providing explanations that are more consistent and more clinically meaningful. Additionally, we demonstrate that performance degradation in certain speech tasks (e.g., monologue) can be mitigated by segmenting long recordings. Our findings indicate that performance and explainability are not necessarily mutually exclusive. Future work will enhance the usability of explanations for non-experts and explore severity estimation to increase the real-world clinical relevance.

nan

Article 556

Title@2025-07-04 (5): Evaluating AI capabilities in detecting conspiracy theories on YouTube

Title: Evaluating AI capabilities in detecting conspiracy theories on YouTube

Bewertung von KI-Fähigkeiten bei der Entdeckung von Verschwörungstheorien auf YouTube

评价大赦国际在YouTube上发现阴谋论的能力 2505.23570v2

Authors (3): Leonardo La Rocca, Francesco Corso, Francesco Pierri

As a leading online platform with a vast global audience, YouTube’s extensive reach also makes it susceptible to hosting harmful content, including disinformation and conspiracy theories. This study explores the use of open-weight Large Language Models (LLMs), both text-only and multimodal, for identifying conspiracy theory videos shared on YouTube. Leveraging a labeled dataset of thousands of videos, we evaluate a variety of LLMs in a zero-shot setting and compare their performance to a fine-tuned RoBERTa baseline. Results show that text-based LLMs achieve high recall but lower precision, leading to increased false positives. Multimodal models lag behind their text-only counterparts, indicating limited benefits from visual data integration. To assess real-world applicability, we evaluate the most accurate models on an unlabeled dataset, finding that RoBERTa achieves performance close to LLMs with a larger number of parameters. Our work highlights the strengths and limitations of current LLM-based approaches for online harmful content detection, emphasizing the need for more precise and robust systems.

nan

Article 557

Title@2025-07-04 (5): Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation

Title: Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation

Causal-SAM-LLM: Große Sprachmodelle als Ursache für robuste medizinische Segmentierung

Causal-SAM-LLLM:大语言模型,作为强力医疗分解的原因 2507.03585v1

Authors (4): Tao Tang, Shijie Xu, Yiting Wu, Zhixiang Lu

The clinical utility of deep learning models for medical image segmentation is severely constrained by their inability to generalize to unseen domains. This failure is often rooted in the models learning spurious correlations between anatomical content and domain-specific imaging styles. To overcome this fundamental challenge, we introduce Causal-SAM-LLM, a novel framework that elevates Large Language Models (LLMs) to the role of causal reasoners. Our framework, built upon a frozen Segment Anything Model (SAM) encoder, incorporates two synergistic innovations. First, Linguistic Adversarial Disentanglement (LAD) employs a Vision-Language Model to generate rich, textual descriptions of confounding image styles. By training the segmentation model’s features to be contrastively dissimilar to these style descriptions, it learns a representation robustly purged of non-causal information. Second, Test-Time Causal Intervention (TCI) provides an interactive mechanism where an LLM interprets a clinician’s natural language command to modulate the segmentation decoder’s features in real-time, enabling targeted error correction. We conduct an extensive empirical evaluation on a composite benchmark from four public datasets (BTCV, CHAOS, AMOS, BraTS), assessing generalization under cross-scanner, cross-modality, and cross-anatomy settings. Causal-SAM-LLM establishes a new state of the art in out-of-distribution (OOD) robustness, improving the average Dice score by up to 6.2 points and reducing the Hausdorff Distance by 15.8 mm over the strongest baseline, all while using less than 9% of the full model’s trainable parameters. Our work charts a new course for building robust, efficient, and interactively controllable medical AI systems.

nan

Article 558

Title@2025-07-04 (5): Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits

Title: Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits

Ambiguous Terminologie von Preference Optimization auf Post-Edits übersetzen lernen

学习如何通过“优先优化”在编辑后采用“优先优化”来翻译模糊的名词 2507.03580v1

Authors (5): Nathaniel Berger, Johannes Eschbach-Dymanus, Miriam Exel, Matthias Huck, Stefan Riezler

In real world translation scenarios, terminology is rarely one-to-one. Instead, multiple valid translations may appear in a terminology dictionary, but correctness of a translation depends on corporate style guides and context. This can be challenging for neural machine translation (NMT) systems. Luckily, in a corporate context, many examples of human post-edits of valid but incorrect terminology exist. The goal of this work is to learn how to disambiguate our terminology based on these corrections. Our approach is based on preference optimization, using the term post-edit as the knowledge to be preferred. While previous work had to rely on unambiguous translation dictionaries to set hard constraints during decoding, or to add soft constraints in the input, our framework requires neither one-to-one dictionaries nor human intervention at decoding time. We report results on English-German post-edited data and find that the optimal combination of supervised fine-tuning and preference optimization, with both term-specific and full sequence objectives, yields statistically significant improvements in term accuracy over a strong NMT baseline without significant losses in COMET score. Additionally, we release test sets from our post-edited data and terminology dictionary.

nan

Article 559

Title@2025-07-04 (5): Articulatory clarity and variability before and after surgery for tongue cancer

Title: Articulatory clarity and variability before and after surgery for tongue cancer

Artikulatorische Klarheit und Variabilität vor und nach der Operation bei Zungenkrebs

舌癌手术前后和手术前后的产道清晰度和变异性 2507.03576v1

Authors (13): Thomas Tienkamp, Fleur van Ast, Roos van der Veen, Teja Rebernik, Raoul Buurke, Nikki Hoekzema, Katharina Polsterer, Hedwig Sekeres, Rob van Son, Martijn Wieling, Max Witjes, Sebastiaan de Visscher, Defne Abur

Surgical treatment for tongue cancer can negatively affect the mobility and musculature of the tongue, which can influence articulatory clarity and variability. In this study, we investigated articulatory clarity through the vowel articulation index (VAI) and variability through vowel formant dispersion (VFD). Using a sentence reading task, we assessed 11 individuals pre and six months post tongue cancer surgery, alongside 11 sex- and age matched typical speakers. Our results show that while the VAI was significantly smaller post-surgery compared to pre-surgery, there was no significant difference between patients and typical speakers at either time point. Post-surgery, speakers had higher VFD values for /i/ compared to pre-surgery and typical speakers, signalling higher variability. Taken together, our results suggest that while articulatory clarity remained within typical ranges following surgery for tongue cancer for the speakers in our study, articulatory variability increased.

nan

Article 560

Title@2025-07-04 (5): ReviewInstruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models

Title: ReviewInstruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models

ReviewInstruct: Eine review-getriebene Multi-Turn-Konversations-Erstellungsmethode für große Sprachmodelle

审查指南:大语言模型的复议-驱动多发多发对话生成方法 2505.11010v2

Authors (13): Jiangxu Wu, Cong Wang, TianHuang Su, Jun Yang, Haozhi Lin, Chao Zhang, Ming Peng, Kai Shi, SongPan Yang, BinQing Pan, ZiXian Li, Ni Yang, ZhenYu Yang

The effectiveness of large language models (LLMs) in conversational AI is hindered by their reliance on single-turn supervised fine-tuning (SFT) data, which limits contextual coherence in multi-turn dialogues. Existing methods for generating multi-turn dialogue data struggle to ensure both diversity and quality in instructions. To address this, we propose Review-Instruct, a novel framework that synthesizes multi-turn conversations through an iterative “Ask-Respond-Review” process involving three agent roles: a Candidate, multiple Reviewers, and a Chairman. The framework iteratively refines instructions by incorporating Reviewer feedback, enhancing dialogue diversity and difficulty. We construct a multi-turn dataset using the Alpaca dataset and fine-tune the LLaMA2-13B model. Evaluations on MT-Bench, MMLU-Pro, and Auto-Arena demonstrate significant improvements, achieving absolute gains of 2.9\% on MMLU-Pro and 2\% on MT-Bench compared to prior state-of-the-art models based on LLaMA2-13B. Ablation studies confirm the critical role of the Review stage and the use of multiple Reviewers in boosting instruction diversity and difficulty. Our work highlights the potential of review-driven, multi-agent frameworks for generating high-quality conversational data at scale.

nan

Article 561

Title@2025-07-04 (5): H2HTalk: Evaluating Large Language Models as Emotional Companion

Title: H2HTalk: Evaluating Large Language Models as Emotional Companion

H2HTalk: Große Sprachmodelle als emotionalen Begleiter bewerten

H2H Talk:评价作为情感伴侣的大语言模式 2507.03543v1

Authors (4): Boyang Wang, Yalun Wu, Hongcheng Guo, Zhoujun Li

As digital emotional support needs grow, Large Language Model companions offer promising authentic, always-available empathy, though rigorous evaluation lags behind model advancement. We present Heart-to-Heart Talk (H2HTalk), a benchmark assessing companions across personality development and empathetic interaction, balancing emotional intelligence with linguistic fluency. H2HTalk features 4,650 curated scenarios spanning dialogue, recollection, and itinerary planning that mirror real-world support conversations, substantially exceeding previous datasets in scale and diversity. We incorporate a Secure Attachment Persona (SAP) module implementing attachment-theory principles for safer interactions. Benchmarking 50 LLMs with our unified protocol reveals that long-horizon planning and memory retention remain key challenges, with models struggling when user needs are implicit or evolve mid-conversation. H2HTalk establishes the first comprehensive benchmark for emotionally intelligent companions. We release all materials to advance development of LLMs capable of providing meaningful and safe psychological support.

nan

Article 562

Title@2025-07-04 (5): Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation

Title: Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation

Aussprache-Lexikon Kostenloses Training für Phoneme-basierte Crosslingual ASR über Joint Stochastic Approximation

通过联合口交近为基于电话的跨语言ASR提供发音免费培训 2507.06249v1

Authors (4): Saierdaer Yusuyin, Te Ma, Hao Huang, Zhijian Ou

Recently, pre-trained models with phonetic supervision have demonstrated their advantages for crosslingual speech recognition in data efficiency and information sharing across languages. However, a limitation is that a pronunciation lexicon is needed for such phoneme-based crosslingual speech recognition. In this study, we aim to eliminate the need for pronunciation lexicons and propose a latent variable model based method, with phonemes being treated as discrete latent variables. The new method consists of a speech-to-phoneme (S2P) model and a phoneme-to-grapheme (P2G) model, and a grapheme-to-phoneme (G2P) model is introduced as an auxiliary inference model. To jointly train the three models, we utilize the joint stochastic approximation (JSA) algorithm, which is a stochastic extension of the EM (expectation-maximization) algorithm and has demonstrated superior performance particularly in estimating discrete latent variable models. Based on the Whistle multilingual pre-trained S2P model, crosslingual experiments are conducted in Polish (130 h) and Indonesian (20 h). With only 10 minutes of phoneme supervision, the new method, JSA-SPG, achieves 5\% error rate reductions compared to the best crosslingual fine-tuning approach using subword or full phoneme supervision. Furthermore, it is found that in language domain adaptation (i.e., utilizing cross-domain text-only data), JSA-SPG outperforms the standard practice of language model fusion via the auxiliary support of the G2P model by 9% error rate reductions. To facilitate reproducibility and encourage further exploration in this field, we open-source the JSA-SPG training code and complete pipeline.

nan

Article 563

Title@2025-07-04 (5): AI-VaxGuide: An Agentic RAG-Based LLM for Vaccination Decisions

Title: AI-VaxGuide: An Agentic RAG-Based LLM for Vaccination Decisions

AI-VaxGuide: Agentische RAG-basierte LLM für Impfentscheidungen

AI-VaxGuide:用于疫苗接种决定的以RAG为基地的AAG剂LLMLM 2507.03493v1

Authors (4): Abdellah Zeggai, Ilyes Traikia, Abdelhak Lakehal, Abdennour Boulesnane

Vaccination plays a vital role in global public health, yet healthcare professionals often struggle to access immunization guidelines quickly and efficiently. National protocols and WHO recommendations are typically extensive and complex, making it difficult to extract precise information, especially during urgent situations. This project tackles that issue by developing a multilingual, intelligent question-answering system that transforms static vaccination guidelines into an interactive and user-friendly knowledge base. Built on a Retrieval-Augmented Generation (RAG) framework and enhanced with agent-based reasoning (Agentic RAG), the system provides accurate, context-sensitive answers to complex medical queries. Evaluation shows that Agentic RAG outperforms traditional methods, particularly in addressing multi-step or ambiguous questions. To support clinical use, the system is integrated into a mobile application designed for real-time, point-of-care access to essential vaccine information. AI-VaxGuide model is publicly available on https://huggingface.co/VaxGuide

nan

Article 564

Title@2025-07-04 (5): Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences

Title: Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences

Vier Schattierungen der Lebenswissenschaften: Ein Datensatz für Desinformationserkennung in den Lebenswissenschaften

《生命科学的四大结构:生命科学中虚假信息探测数据集》 2507.03488v1

Authors (3): Eva Seidlmayer, Lukas Galke, Konrad U. Förstner

Disseminators of disinformation often seek to attract attention or evoke emotions - typically to gain influence or generate revenue - resulting in distinctive rhetorical patterns that can be exploited by machine learning models. In this study, we explore linguistic and rhetorical features as proxies for distinguishing disinformative texts from other health and life-science text genres, applying both large language models and classical machine learning classifiers. Given the limitations of existing datasets, which mainly focus on fact checking misinformation, we introduce Four Shades of Life Sciences (FSoLS): a novel, labeled corpus of 2,603 texts on 14 life-science topics, retrieved from 17 diverse sources and classified into four categories of life science publications. The source code for replicating, and updating the dataset is available on GitHub: https://github.com/EvaSeidlmayer/FourShadesofLifeSciences

nan

Article 565

Title@2025-07-04 (5): Chat2SPaT: A Large Language Model Based Tool for Automating Traffic Signal Control Plan Management

Title: Chat2SPaT: A Large Language Model Based Tool for Automating Traffic Signal Control Plan Management

Chat2SPaT: Ein auf großen Sprachmodellen basierendes Tool zur Automatisierung der Verkehrssignalsteuerung

Chat2SPaaT:以大语言为基础的交通信号控制计划管理自动化示范工具 2507.05283v1

Authors (6): Yue Wang, Miao Zhou, Guijing Huang, Rui Zhuo, Chao Yi, Zhenliang Ma

Pre-timed traffic signal control, commonly used for operating signalized intersections and coordinated arterials, requires tedious manual work for signaling plan creating and updating. When the time-of-day or day-of-week plans are utilized, one intersection is often associated with multiple plans, leading to further repetitive manual plan parameter inputting. To enable a user-friendly traffic signal control plan management process, this study proposes Chat2SPaT, a method to convert users’ semi-structured and ambiguous descriptions on the signal control plan to exact signal phase and timing (SPaT) results, which could further be transformed into structured stage-based or ring-based plans to interact with intelligent transportation system (ITS) software and traffic signal controllers. With curated prompts, Chat2SPaT first leverages large language models’ (LLMs) capability of understanding users’ plan descriptions and reformulate the plan as a combination of phase sequence and phase attribute results in the json format. Based on LLM outputs, python scripts are designed to locate phases in a cycle, address nuances of traffic signal control, and finally assemble the complete traffic signal control plan. Within a chat, the pipeline can be utilized iteratively to conduct further plan editing. Experiments show that Chat2SPaT can generate plans with an accuracy of over 94% for both English and Chinese cases, using a test dataset with over 300 plan descriptions. As the first benchmark for evaluating LLMs’ capability of understanding traffic signal control plan descriptions, Chat2SPaT provides an easy-to-use plan management pipeline for traffic practitioners and researchers, serving as a potential new building block for a more accurate and versatile application of LLMs in the field of ITS. The source codes, prompts and test dataset are openly accessible at https://github.com/yuewangits/Chat2SPaT.

nan

Article 566

Title@2025-07-04 (5): Beyond Weaponization: NLP Security for Medium and Lower-Resourced Languages in Their Own Right

Title: Beyond Weaponization: NLP Security for Medium and Lower-Resourced Languages in Their Own Right

Beyond Weaponization: NLP-Sicherheit für mittel- und minderwertige Sprachen im eigenen Recht

武器化之后的武器化之外:国家语言方案保障中、低资源语言自身权利的安全 2507.03473v1

Authors (1): Heather Lent

Despite mounting evidence that multilinguality can be easily weaponized against language models (LMs), works across NLP Security remain overwhelmingly English-centric. In terms of securing LMs, the NLP norm of “English first” collides with standard procedure in cybersecurity, whereby practitioners are expected to anticipate and prepare for worst-case outcomes. To mitigate worst-case outcomes in NLP Security, researchers must be willing to engage with the weakest links in LM security: lower-resourced languages. Accordingly, this work examines the security of LMs for lower- and medium-resourced languages. We extend existing adversarial attacks for up to 70 languages to evaluate the security of monolingual and multilingual LMs for these languages. Through our analysis, we find that monolingual models are often too small in total number of parameters to ensure sound security, and that while multilinguality is helpful, it does not always guarantee improved security either. Ultimately, these findings highlight important considerations for more secure deployment of LMs, for communities of lower-resourced languages.

nan

Article 567

Title@2025-07-04 (5): CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark

Title: CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark

CoreCodeBench: Ein konfigurierbarer Multi-Szenario-Repository-Level-Benchmark

核心守则:可配置的多设想仓库一级基准 2507.05281v1

Authors (13): Lingyue Fu, Hao Guan, Bolun Zhang, Haowei Yuan, Yaoming Zhu, Jun Xu, Zongyu Wang, Lin Qiu, Xunliang Cai, Xuezhi Cao, Weiwen Liu, Weinan Zhang, Yong Yu

As Large Language Models (LLMs) demonstrate increasingly sophisticated code processing capabilities, evaluating their performance on engineering-level code remains challenging. Existing repository-level benchmarks primarily focus on single scenarios, such as code generation or bug fixing, without adequately capturing the diversity and complexity of real-world software or project engineering workflows. Furthermore, these benchmarks suffer from limited controllability in question positioning and reliability issues in their generated test cases. To address these limitations, we present CorePipe, a fully automated pipeline that converts repositories into comprehensive test cases, and introduce CoreCodeBench, a configurable multi-scenario repository-level benchmark. To simulate real engineering scenarios, CorePipe generates three types of atomic questions (Development, BugFix, and Test-Driven Development) specifically targeting core code segments. These atomic questions are further combined into three types of composite questions, with difficulty levels flexibly adjusted through hyperparameter tuning. CoreCodeBench provides a comprehensive and extensive repository-level benchmark to investigate the applicability of LLMs in real-world engineering projects. Experiments with 16 LLMs across diverse scenarios reveal varying capabilities and offer multi-dimensional insights into LLM performance in engineering contexts. The code for CorePipe is available at https://github.com/AGI-Eval-Official/CoreCodeBench, and the data for CoreCodeBench can be accessed at https://huggingface.co/collections/tubehhh/corecodebench-68256d2faabf4b1610a08caa.

nan

Article 568

Title: Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models

Verbesserung der sozialen Determinanten der Gesundheitsdokumentation in französischen EHRs mit großen Sprachmodellen

利用大语言模式,改善法文健康文件的 “ 健康文件 “ 社会决定因素 2507.03433v1

Authors (9): Adrien Bazoge, Pacôme Constant dit Beaufils, Mohammed Hmitouch, Romain Bourcier, Emmanuel Morin, Richard Dufour, Béatrice Daille, Pierre-Antoine Gourraud, Matilde Karakachoff

Social determinants of health (SDoH) significantly influence health outcomes, shaping disease progression, treatment adherence, and health disparities. However, their documentation in structured electronic health records (EHRs) is often incomplete or missing. This study presents an approach based on large language models (LLMs) for extracting 13 SDoH categories from French clinical notes. We trained Flan-T5-Large on annotated social history sections from clinical notes at Nantes University Hospital, France. We evaluated the model at two levels: (i) identification of SDoH categories and associated values, and (ii) extraction of detailed SDoH with associated temporal and quantitative information. The model performance was assessed across four datasets, including two that we publicly release as open resources. The model achieved strong performance for identifying well-documented categories such as living condition, marital status, descendants, job, tobacco, and alcohol use (F1 score > 0.80). Performance was lower for categories with limited training data or highly variable expressions, such as employment status, housing, physical activity, income, and education. Our model identified 95.8% of patients with at least one SDoH, compared to 2.8% for ICD-10 codes from structured EHR data. Our error analysis showed that performance limitations were linked to annotation inconsistencies, reliance on English-centric tokenizer, and reduced generalizability due to the model being trained on social history sections only. These results demonstrate the effectiveness of NLP in improving the completeness of real-world SDoH data in a non-English EHR system.

nan

Article 569

Title@2025-07-04 (5): Tokenization is Sensitive to Language Variation

Title: Tokenization is Sensitive to Language Variation

Tokenisierung ist auf Sprachvariation sensitiv

语法变换对语法变化敏感 2502.15343v2

Authors (3): Anna Wegmann, Dong Nguyen, David Jurgens

Variation in language is ubiquitous and often systematically linked to regional, social, and contextual factors. Tokenizers split texts into smaller units and might behave differently for less common linguistic forms. This might affect downstream LLM performance differently on two types of tasks: Tasks where the model should be robust to language variation (e.g., for semantic tasks like NLI, labels do not depend on whether a text uses British or American spelling) and tasks where the model should be sensitive to language variation (e.g., for form-based tasks like authorship verification, labels depend on whether a text uses British or American spelling). We pre-train BERT base models with the popular Byte-Pair Encoding algorithm to investigate how key tokenization design choices impact the performance of downstream models: the corpus used to train the tokenizer, the pre-tokenizer and the vocabulary size. We find that the best tokenizer varies on the two task types and that the pre-tokenizer has the biggest overall impact on performance. Further, we introduce a new approach to estimate tokenizer impact on downstream LLM performance, showing substantial improvement over metrics like R'enyi efficiency. We encourage more work on language variation and its relation to tokenizers and thus LLM performance.

nan

Article 570

Title@2025-07-04 (5): SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation

Title: SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation

SMCLM: Semantisch sinnvolle Kausale Sprachmodellierung für autoregressive Paraphrase-Generierung

SMCLM: 用于自动递减性参数生成的具有广泛意义的原因语言建模 2507.03415v1

Authors (3): Michał Perełkiewicz, Sławomir Dadas, Rafał Poświata

This article introduces semantically meaningful causal language modeling (SMCLM), a selfsupervised method of training autoregressive models to generate semantically equivalent text. Our approach involves using semantically meaningful text representation as an initial embedding in the autoregressive training and generation processes. The extensive empirical study demonstrates that the SMCLM approach makes autoregressive models capable of learning robust and high-quality paraphrase generation. The proposed method is competitive with the supervised method and achieves state-of-the-art results in unsupervised approaches. This article also presents a comprehensive set of automatic metrics that cover a wide range of autogenerated paraphrase evaluation aspects. Simultaneously, this article highlights the low reliability of the metrics that are widely used in paraphrase generation evaluation, including BLEU, ROUGE, and BERTScore.

nan

Article 571

Title@2025-07-04 (5): Graph Repairs with Large Language Models: An Empirical Study

Title: Graph Repairs with Large Language Models: An Empirical Study

Graphische Reparaturen mit großen Sprachmodellen: Eine empirische Studie

使用大语言模型修图:经验研究 2507.03410v1

Authors (3): Hrishikesh Terdalkar, Angela Bonifati, Andrea Mauri

Property graphs are widely used in domains such as healthcare, finance, and social networks, but they often contain errors due to inconsistencies, missing data, or schema violations. Traditional rule-based and heuristic-driven graph repair methods are limited in their adaptability as they need to be tailored for each dataset. On the other hand, interactive human-in-the-loop approaches may become infeasible when dealing with large graphs, as the cost–both in terms of time and effort–of involving users becomes too high. Recent advancements in Large Language Models (LLMs) present new opportunities for automated graph repair by leveraging contextual reasoning and their access to real-world knowledge. We evaluate the effectiveness of six open-source LLMs in repairing property graphs. We assess repair quality, computational cost, and model-specific performance. Our experiments show that LLMs have the potential to detect and correct errors, with varying degrees of accuracy and efficiency. We discuss the strengths, limitations, and challenges of LLM-driven graph repair and outline future research directions for improving scalability and interpretability.

nan

Article 572

Title@2025-07-04 (5): ReservoirChat: Interactive Documentation Enhanced with LLM and Knowledge Graph for ReservoirPy

Title: ReservoirChat: Interactive Documentation Enhanced with LLM and Knowledge Graph for ReservoirPy

ReservoirChat: Interaktive Dokumentation mit LLM und Wissensdiagramm für ReservoirPy

RESSOCWChat:与LLM和知识图增强互动文件 2507.05279v1

Authors (4): Virgile Boraud, Yannis Bendi-Ouis, Paul Bernard, Xavier Hinaut

We introduce a tool designed to improve the capabilities of Large Language Models (LLMs) in assisting with code development using the ReservoirPy library, as well as in answering complex questions in the field of Reservoir Computing. By incorporating external knowledge through Retrieval-Augmented Generation (RAG) and knowledge graphs, our approach aims to reduce hallucinations and increase the factual accuracy of generated responses. The system provides an interactive experience similar to ChatGPT, tailored specifically for ReservoirPy, enabling users to write, debug, and understand Python code while accessing reliable domain-specific insights. In our evaluation, while proprietary models such as ChatGPT-4o and NotebookLM performed slightly better on general knowledge questions, our model outperformed them on coding tasks and showed a significant improvement over its base model, Codestral-22B.

nan

Article 573

Title@2025-07-04 (5): On the Expressiveness and Length Generalization of Selective State-Space Models on Regular Languages

Title: On the Expressiveness and Length Generalization of Selective State-Space Models on Regular Languages

Zur Expressivität und Längenverallgemeinerung von selektiven State-Space-Modellen auf regulären Sprachen

关于常规语言选择性国家空间模型的表达性和时间长度的一般化 2412.19350v2

Authors (6): Aleksandar Terzić, Michael Hersche, Giacomo Camposampiero, Thomas Hofmann, Abu Sebastian, Abbas Rahimi

Selective state-space models (SSMs) are an emerging alternative to the Transformer, offering the unique advantage of parallel training and sequential inference. Although these models have shown promising performance on a variety of tasks, their formal expressiveness and length generalization properties remain underexplored. In this work, we provide insight into the workings of selective SSMs by analyzing their expressiveness and length generalization performance on regular language tasks, i.e., finite-state automaton (FSA) emulation. We address certain limitations of modern SSM-based architectures by introducing the Selective Dense State-Space Model (SD-SSM), the first selective SSM that exhibits perfect length generalization on a set of various regular language tasks using a single layer. It utilizes a dictionary of dense transition matrices, a softmax selection mechanism that creates a convex combination of dictionary matrices at each time step, and a readout consisting of layer normalization followed by a linear map. We then proceed to evaluate variants of diagonal selective SSMs by considering their empirical performance on commutative and non-commutative automata. We explain the experimental results with theoretical considerations. Our code is available at https://github.com/IBM/selective-dense-state-space-model.

nan

Article 574

Title@2025-07-04 (5): Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

Title: Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

Verallgemeinerung oder Halluzination? Verstehen von Out-of-Context-Reasoning in Transformers

通化还是幻觉? 理解变异器的逻辑外原因 2506.10887v2

Authors (8): Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei

Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.

nan

Article 575

Title@2025-07-04 (5): Entity-aware Cross-lingual Claim Detection for Automated Fact-checking

Title: Entity-aware Cross-lingual Claim Detection for Automated Fact-checking

Entity-aware Cross-lingual Claim Detection for Automated Fact-Checking

用于自动实况调查的有实体意识的跨语言交叉索赔调查 2503.15220v3

Authors (2): Rrubaa Panchendrarajan, Arkaitz Zubiaga

Identifying claims requiring verification is a critical task in automated fact-checking, especially given the proliferation of misinformation on social media platforms. Despite notable progress, challenges remain-particularly in handling multilingual data prevalent in online discourse. Recent efforts have focused on fine-tuning pre-trained multilingual language models to address this. While these models can handle multiple languages, their ability to effectively transfer cross-lingual knowledge for detecting claims spreading on social media remains under-explored. In this paper, we introduce EX-Claim, an entity-aware cross-lingual claim detection model that generalizes well to handle multilingual claims. The model leverages entity information derived from named entity recognition and entity linking techniques to improve the language-level performance of both seen and unseen languages during training. Extensive experiments conducted on three datasets from different social media platforms demonstrate that our proposed model stands out as an effective solution, demonstrating consistent performance gains across 27 languages and robust knowledge transfer between languages seen and unseen during training.

nan

Article 576

Title@2025-07-04 (5): EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization

Title: EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization

EAQuant: Verbesserung der Post-Training-Quantisierung für MoE-Modelle durch Experten-Aware-Optimierung

EAQuant:通过专家-软件优化,加强培训后对教育部模型的量化 2506.13329v2

Authors (8): Zhongqian Fu, Ning Ding, Kai Han, Xianzhi Yu, Xiaosong Li, Xinghao Chen, Yehui Tang, Yunhe Wang

Mixture-of-Experts (MoE) models have emerged as a cornerstone of large-scale deep learning by efficiently distributing computation and enhancing performance. However, their unique architecture-characterized by sparse expert activation and dynamic routing mechanisms-introduces inherent complexities that challenge conventional quantization techniques. Existing post-training quantization (PTQ) methods struggle to address activation outliers, router consistency and sparse expert calibration, leading to significant performance degradation. To bridge this gap, we propose EAQuant, a novel PTQ framework tailored for MoE architectures. Our method systematically tackles these challenges through three key innovations: (1) expert-aware smoothing aggregation to suppress activation outliers and stabilize quantization, (2) router logits distribution alignment to preserve expert selection consistency post-quantization, and (3) expert-level calibration data balance to optimize sparsely activated experts. Extensive experiments across W4A4 and extreme W3A4 quantization configurations demonstrate that EAQuant significantly outperforms existing methods, achieving average score improvements of 1.15 - 2.28% across three diverse MoE architectures, with particularly pronounced gains in reasoning tasks and robust performance retention under aggressive quantization. By integrating these innovations, EAQuant establishes a new state-of-the-art for high-precision, efficient MoE model compression. Our code is available at https://github.com/darren-fzq1/EAQuant.

nan

Article 577

Title@2025-07-04 (5): Making Sense of Korean Sentences: A Comprehensive Evaluation of LLMs through KoSEnd Dataset

Title: Making Sense of Korean Sentences: A Comprehensive Evaluation of LLMs through KoSEnd Dataset

Gespür für koreanische Sätze: Eine umfassende Bewertung von LLMs durch KoSEnd-Datensatz

韩国判刑感:通过KoSend数据集对LLMs的全面评估 2507.03378v1

Authors (4): Seunguk Yu, Kyeonghyun Kim, Jungmin Yun, Youngbin Kim

Although LLMs have made significant progress in various languages, there are still concerns about their effectiveness with low-resource agglutinative languages compared to languages such as English. In this study, we focused on Korean, a language known for its complex sentence endings, and evaluated LLMs on this challenging aspect. We introduce the Korean Sentence Endings (KoSEnd) dataset, which includes 3,000 sentences, each annotated for the naturalness of 15 sentence ending forms. These were collected from diverse sources to cover a range of contexts. We evaluated 11 LLMs to assess their understanding of Korean sentence endings, analyzing them based on parameter count and prediction consistency. Notably, we found that informing models about the possibility of missing sentence endings improved performance, highlighting the impact of explicitly considering certain linguistic features.

nan

Article 578

Title@2025-07-04 (5): Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections

Title: Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections

Implizite Belohnung als Brücke: Eine einheitliche Ansicht von SFT- und DPO-Verbindungen

隐含奖励为桥梁:SFT和DPO连接的统一观点 2507.00018v2

Authors (10): Bo Wang, Qinyuan Cheng, Runyu Peng, Rong Bao, Peiji Li, Qipeng Guo, Linyang Li, Zhiyuan Zeng, Yunhua Zhou, Xipeng Qiu

Post-training processes are essential phases in grounding pre-trained language models to real-world tasks, with learning from demonstrations or preference signals playing a crucial role in this adaptation. We present a unified theoretical framework bridging Supervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training. Through rigorous mathematical derivation, we demonstrate that both SFT and preference learning methods like Direct Preference Optimization (DPO) operate within the same optimal policy-reward subspace, with SFT representing a special case of implicit reward learning. Our analysis reveals a critical limitation in conventional SFT: the KL divergence term in distribution matching becomes constant with respect to the policy during optimization, failing to constrain model updates. To address this, we propose a simple yet effective learning rate reduction approach that yields significant performance improvements (up to \textbf{25\%} relative gain and \textbf{6\%} absolute win rate increase in instruction following tasks. Additionally, we derive alternative SFT objectives from various f-divergence functions that preserve the KL term during optimization, further enhancing post-DPO model performance. Finally, we extend the theoretical relationship between LLM logits and Q-functions from preference learning to the SFT context, providing mathematical derivations and experimental validation.

nan

Article 579

Title@2025-07-04 (5): WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia

Title: WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia

WETBench: Ein Benchmark für die Erkennung aufgabenspezifischer maschinengenerierter Texte auf Wikipedia

WETBENCH: 维基百科中检测特定任务机器生成文本的基准 2507.03373v1

Authors (4): Gerrit Quaremba, Elizabeth Black, Denny Vrandečić, Elena Simperl

Given Wikipedia’s role as a trusted source of high-quality, reliable content, concerns are growing about the proliferation of low-quality machine-generated text (MGT) produced by large language models (LLMs) on its platform. Reliable detection of MGT is therefore essential. However, existing work primarily evaluates MGT detectors on generic generation tasks rather than on tasks more commonly performed by Wikipedia editors. This misalignment can lead to poor generalisability when applied in real-world Wikipedia contexts. We introduce WETBench, a multilingual, multi-generator, and task-specific benchmark for MGT detection. We define three editing tasks, empirically grounded in Wikipedia editors’ perceived use cases for LLM-assisted editing: Paragraph Writing, Summarisation, and Text Style Transfer, which we implement using two new datasets across three languages. For each writing task, we evaluate three prompts, generate MGT across multiple generators using the best-performing prompt, and benchmark diverse detectors. We find that, across settings, training-based detectors achieve an average accuracy of 78%, while zero-shot detectors average 58%. These results show that detectors struggle with MGT in realistic generation scenarios and underscore the importance of evaluating such models on diverse, task-specific data to assess their reliability in editor-driven contexts.

nan

Article 580

Title: SS-GEN: A Social Story Generation Framework with Large Language Models

SS-GEN: Ein Social Story Generation Framework mit großen Sprachmodellen

SS-GEN:具有大语言模式的社会历史传承框架 2406.15695v3

Authors (8): Yi Feng, Mingyang Song, Jiaqi Wang, Zhuang Chen, Guanqun Bi, Minlie Huang, Liping Jing, Jian Yu

Children with Autism Spectrum Disorder (ASD) often misunderstand social situations and struggle to participate in daily routines. Social Stories are traditionally crafted by psychology experts under strict constraints to address these challenges but are costly and limited in diversity. As Large Language Models (LLMs) advance, there’s an opportunity to develop more automated, affordable, and accessible methods to generate Social Stories in real-time with broad coverage. However, adapting LLMs to meet the unique and strict constraints of Social Stories is a challenging issue. To this end, we propose SS-GEN, a Social Story GENeration framework with LLMs. Firstly, we develop a constraint-driven sophisticated strategy named StarSow to hierarchically prompt LLMs to generate Social Stories at scale, followed by rigorous human filtering to build a high-quality dataset. Additionally, we introduce quality assessment criteria to evaluate the effectiveness of these generated stories. Considering that powerful closed-source large models require very complex instructions and expensive API fees, we finally fine-tune smaller language models with our curated high-quality dataset, achieving comparable results at lower costs and with simpler instruction and deployment. This work marks a significant step in leveraging AI to personalize Social Stories cost-effectively for autistic children at scale, which we hope can encourage future research on special groups.

nan

Article 581

Title@2025-07-04 (5): Backtesting Sentiment Signals for Trading: Evaluating the Viability of Alpha Generation from Sentiment Analysis

Title: Backtesting Sentiment Signals for Trading: Evaluating the Viability of Alpha Generation from Sentiment Analysis

Backtesting Sentiment Signals for Trading: Bewertung der Lebensfähigkeit der Alpha-Generation aus der Sentiment-Analyse

贸易的后测试感应信号:从感应分析中评估阿尔法一代的可行性 2507.03350v1

Authors (7): Elvys Linhares Pontes, Carlos-Emiliano González-Gallardo, Georgeta Bordea, José G. Moreno, Mohamed Ben Jannet, Yuxuan Zhao, Antoine Doucet

Sentiment analysis, widely used in product reviews, also impacts financial markets by influencing asset prices through microblogs and news articles. Despite research in sentiment-driven finance, many studies focus on sentence-level classification, overlooking its practical application in trading. This study bridges that gap by evaluating sentiment-based trading strategies for generating positive alpha. We conduct a backtesting analysis using sentiment predictions from three models (two classification and one regression) applied to news articles on Dow Jones 30 stocks, comparing them to the benchmark Buy&Hold strategy. Results show all models produced positive returns, with the regression model achieving the highest return of 50.63% over 28 months, outperforming the benchmark Buy&Hold strategy. This highlights the potential of sentiment in enhancing investment strategies and financial decision-making.

nan

Article 582

Title@2025-07-04 (5): Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Title: Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Disambiguation-Centric Finetuning macht Enterprise Tool-Calling LLMs realistischer und weniger riskant

差异性-中心微调使企业用工具召集LLM 更现实、风险更小 2507.03336v1

Authors (3): Ashutosh Hathidara, Julien Yu, Sebastian Schreiber

Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.

nan

Article 583

Title@2025-07-04 (5): Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought

Title: Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought

Hunyuan-TurboS: Große Sprachmodelle durch Mamba-Transformer-Synergie und adaptive Ketten-of-Thought

Hunyuan-TurboS:通过Mamba-Transed协同和适应性研究链推进大语言模型 2505.15431v3

Authors (255): Tencent Hunyuan Team, Ao Liu, Botong Zhou, Can Xu, Chayse Zhou, ChenChen Zhang, Chengcheng Xu, Chenhao Wang, Decheng Wu, Dengpeng Wu, Dian Jiao, Dong Du, Dong Wang, Feng Zhang, Fengzong Lian, Guanghui Xu, Guanwei Zhang, Hai Wang, Haipeng Luo, Han Hu, Huilin Xu, Jiajia Wu, Jianchen Zhu, Jianfeng Yan, Jiaqi Zhu, Jihong Zhang, Jinbao Xue, Jun Xia, Junqiang Zheng, Kai Liu, Kai Zhang, Kai Zheng, Kejiao Li, Keyao Wang, Lan Jiang, Lixin Liu, Lulu Wu, Mengyuan Huang, Peijie Yu, Peiqi Wang, Qian Wang, Qianbiao Xiang, Qibin Liu, Qingfeng Sun, Richard Guo, Ruobing Xie, Saiyong Yang, Shaohua Chen, Shihui Hu, Shuai Li, Shuaipeng Li, Shuang Chen, Suncong Zheng, Tao Yang, Tian Zhang, Tinghao Yu, Weidong Han, Weijie Liu, Weijin Zhou, Weikang Wang, Wesleye Chen, Xiao Feng, Xiaoqin Ren, Xingwu Sun, Xiong Kuang, Xuemeng Huang, Xun Cao, Yanfeng Chen, Yang Du, Zhen Yang, Yangyu Tao, Yaping Deng, Yi Shen, Yigeng Hong, Yiqi Chen, Yiqing Huang, Yuchi Deng, Yue Mao, Yulong Wang, Yuyuan Zeng, Zenan Xu, Zhanhui Kang, Zhe Zhao, ZhenXiang Yan, Zheng Fang, Zhichao Hu, Zhongzhi Chen, Zhuoyu Li, Zongwei Li, Alex Yan, Ande Liang, Baitong Liu, Beiping Pan, Bin Xing, Binghong Wu, Bingxin Qu, Bolin Ni, Boyu Wu, Chen Li, Cheng Jiang, Cheng Zhang, Chengjun Liu, Chengxu Yang, Chengzhong Xu, Chiyu Wang, Chong Zha, Daisy Yi, Di Wang, Fanyang Lu, Fei Chen, Feifei Liu, Feng Zheng, Guanghua Yu, Guiyang Li, Guohua Wang, Haisheng Lin, Han Liu, Han Wang, Hao Fei, Hao Lu, Haoqing Jiang, Haoran Sun, Haotian Zhu, Huangjin Dai, Huankui Chen, Huawen Feng, Huihui Cai, Huxin Peng, Jackson Lv, Jiacheng Shi, Jiahao Bu, Jianbo Li, Jianglu Hu, Jiangtao Guan, Jianing Xu, Jianwei Cai, Jiarong Zhang, Jiawei Song, Jie Jiang, Jie Liu, Jieneng Yang, Jihong Zhang, Jin lv, Jing Zhao, Jinjian Li, Jinxing Liu, Jun Zhao, Juntao Guo, Kai Wang, Kan Wu, Lei Fu, Lei He, Lei Wang, Li Liu, Liang Dong, Liya Zhan, Long Cheng, Long Xu, Mao Zheng, Meng Liu, Mengkang Hu, Nanli Chen, Peirui Chen, Peng He, Pengju Pan, Pengzhi Wei, Qi Yang, Qi Yi, Roberts Wang, Rongpeng Chen, Rui Sun, Rui Yang, Ruibin Chen, Ruixu Zhou, Shaofeng Zhang, Sheng Zhang, Shihao Xu, Shuaishuai Chang, Shulin Liu, SiQi Wang, Songjia Feng, Songling Yuan, Tao Zhang, Tianjiao Lang, Tongkai Li, Wei Deng, Wei Li, Weichao Wang, Weigang Zhang, Weixuan Sun, Wen Ouyang, Wenxiang Jiao, Wenzhi Sun, Wenzhuo Jia, Xiang Zhang, Xiangyu He, Xianshun Ren, XiaoYing Zhu, Xiaolong Guo, Xiaoxue Li, Xiaoyu Ma, Xican Lu, Xinhua Feng, Xinting Huang, Xinyu Guan, Xirui Li, Xu Zhang, Xudong Gao, Xun Luo, Xuxiang Qi, Yangkun Chen, Yangyu Tao, Yanling Xiao, Yantao Mai, Yanze Chen, Yao Ding, Yeting Yang, YiFan Song, Yifan Yang, Yijiao Zhu, Yinhe Wu, Yixian Liu, Yong Yang, Yuanjun Cai, Yuanlin Tu, Yue Zhang, Yufei Huang, Yuhang Zhou, Yuhao Jiang, Yuhong Liu, Yuhui Hu, Yujin Lin, Yun Yang, Yunhao Wang, Yusong Zhang, Zekun Wu, Zelong Zhang, Zhan Yu, Zhaoliang Yang, Zhe Zhao, Zheng Li, Zhenyu Huang, Zhiguang Liu, Zhijiang Xu, Zhiqing Kui, Zhiyin Zeng, Zhiyuan Xiong, Zhuo Han, Zifan Wu, Zigang Geng, Zilong Zhao, Ziyan Tang, Ziyuan Zhu, Zonglei Zhu, Zhijiang Xu

As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba’s long-sequence processing efficiency with Transformer’s superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid responses for simple queries and deep “thinking” modes for complex problems, optimizing computational resources. Architecturally, this 56B activated (560B total) parameter model employs 128 layers (Mamba2, Attention, FFN) with an innovative AMF/MF block pattern. Faster Mamba2 ensures linear complexity, Grouped-Query Attention minimizes KV cache, and FFNs use an MoE structure. Pre-trained on 16T high-quality tokens, it supports a 256K context length and is the first industry-deployed large-scale Mamba model. Our comprehensive post-training strategy enhances capabilities via Supervised Fine-Tuning (3M instructions), a novel Adaptive Long-short CoT Fusion method, Multi-round Deliberation Learning for iterative improvement, and a two-stage Large-scale Reinforcement Learning process targeting STEM and general instruction-following. Evaluations show strong performance: overall top 7 rank on LMSYS Chatbot Arena with a score of 1356, outperforming leading models like Gemini-2.0-Flash-001 (1352) and o4-mini-2025-04-16 (1345). TurboS also achieves an average of 77.9% across 23 automated benchmarks. Hunyuan-TurboS balances high performance and efficiency, offering substantial capabilities at lower inference costs than many reasoning models, establishing a new paradigm for efficient large-scale pre-trained models.

nan

Article 584

Title@2025-07-04 (5): MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection

Title: MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection

MT2-CSD: Eine neue Datensatz- und Multi-Semantic Knowledge Fusion Methode zur konversatorischen Stance-Erkennung

MT2-CSD: 用于语音稳定探测的新数据集和多语层知识融合方法 2506.21053v2

Authors (7): Fuqiang Niu, Genan Dai, Yisha Lu, Jiayu Liao, Xiang Li, Hu Huang, Bowen Zhang

In the realm of contemporary social media, automatic stance detection is pivotal for opinion mining, as it synthesizes and examines user perspectives on contentious topics to uncover prevailing trends and sentiments. Traditional stance detection research often targets individual instances, thereby limiting its capacity to model multi-party discussions typical in real social media scenarios. This shortcoming largely stems from the scarcity of datasets that authentically capture the dynamics of social media interactions, hindering advancements in conversational stance detection. In this paper, we introduce MT2-CSD, a comprehensive dataset for multi-target, multi-turn conversational stance detection. To the best of our knowledge, MT2-CSD is the largest dataset available for this purpose, comprising 24,457 annotated instances and exhibiting the greatest conversational depth, thereby presenting new challenges for stance detection. To address these challenges, we propose the Large Language model enhanced Conversational Relational Attention Network (LLM-CRAN), which exploits the reasoning capabilities of LLMs to improve conversational understanding. We conduct extensive experiments to evaluate the efficacy of LLM-CRAN on the MT2-CSD dataset. The experimental results indicate that LLM-CRAN significantly outperforms strong baseline models in the task of conversational stance detection.

nan

Article 585

Title@2025-07-04 (5): Read Quietly, Think Aloud: Decoupling Comprehension and Reasoning in LLMs

Title: Read Quietly, Think Aloud: Decoupling Comprehension and Reasoning in LLMs

Lest ruhig, denkt laut: Entkoppelung von Verständnis und Vernunft in LLMs

静静地阅读,想阿卢德:解开迷惑和理性 2507.03327v1

Authors (2): Yuanxin Wang, Ganesh Venkatesh

Large Language Models (LLMs) have demonstrated remarkable proficiency in understanding text and generating high-quality responses. However, a critical distinction from human cognition is their typical lack of a distinct internal reading' or deliberation phase before speaking’ (i.e., generating text). Humans often engage in silent reading to comprehend context and formulate thoughts prior to articulation. This paper investigates methods to imbue LLMs with a similar capacity for internal processing. We introduce and evaluate techniques that encourage LLMs to read silently.' Our findings indicate that even a straightforward approach, such as providing the model with an initial contextual prompt or reading space’ before it begins predicting subsequent tokens for the final output, can yield significant performance improvements. We further enhance this concept by developing a `reading buddy’ architecture, where an auxiliary component silently processes the input and provides refined contextual insights to the primary generation model. These approaches aim to foster deeper understanding from LLMs so that they can produce better reasoned responses, moving them one step closer to more human-like text processing. Our results indicate that these simple techniques can provide surprisingly strong impact on accuracy with multiple point accuracy boost.

nan

Article 586

Title@2025-07-04 (5): OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Title: OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

OpenRLHF: Ein einfach zu bedienendes, skalierbares und leistungsstarkes RLHF-Framework

OpenRLHF: 易于使用、可缩放和高绩效RLHF框架 2405.11143v5

Authors (15): Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu, Yiming Liu

Large Language Models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) significantly improve the alignment of human-AI values and further raise the upper bound of AI capabilities, particularly in reasoning-intensive, long-context Chain-of-Thought (long-CoT) tasks. However, existing RLHF (or RLVR) frameworks commonly face challenges such as inference bottlenecks and complexity barriers, restricting their accessibility for newcomers. To bridge this gap, we introduce OpenRLHF, a user-friendly, scalable, and easy-to-learn open-source RLHF framework built upon Ray, vLLM, DeepSpeed, and HuggingFace Transformers, featuring a simplified design, clear code structure, and comprehensive documentation to facilitate entry for researchers and practitioners. Experimental results show that OpenRLHF achieves superior training efficiency with speedups ranging from 1.22x to 1.68x across different model sizes compared to state-of-the-art frameworks, while requiring significantly fewer lines of code for implementation. OpenRLHF is publicly available at https://github.com/OpenRLHF/OpenRLHF, and has already been adopted by leading institutions to accelerate RLHF research and learning.

nan

Article 587

Title@2025-07-04 (5): GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation

Title: GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation

GRAFT: Ein Graph-basiertes Flow-aware Agentic Framework für die maschinelle Übersetzung auf Dokumentebene

GRAFT: 以图表为基础的文件级机器翻译流动感动剂框架 2507.03311v1

Authors (5): Himanshu Dutta, Sunny Manchanda, Prakhar Bapat, Meva Ram Gurjar, Pushpak Bhattacharyya

Document level Machine Translation (DocMT) approaches often struggle with effectively capturing discourse level phenomena. Existing approaches rely on heuristic rules to segment documents into discourse units, which rarely align with the true discourse structure required for accurate translation. Otherwise, they fail to maintain consistency throughout the document during translation. To address these challenges, we propose Graph Augmented Agentic Framework for Document Level Translation (GRAFT), a novel graph based DocMT system that leverages Large Language Model (LLM) agents for document translation. Our approach integrates segmentation, directed acyclic graph (DAG) based dependency modelling, and discourse aware translation into a cohesive framework. Experiments conducted across eight translation directions and six diverse domains demonstrate that GRAFT achieves significant performance gains over state of the art DocMT systems. Specifically, GRAFT delivers an average improvement of 2.8 d BLEU on the TED test sets from IWSLT2017 over strong baselines and 2.3 d BLEU for domain specific translation from English to Chinese. Moreover, our analyses highlight the consistent ability of GRAFT to address discourse level phenomena, yielding coherent and contextually accurate translations.

nan

Article 588

Title@2025-07-04 (5): LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

Title: LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

LTLCrit: Ein auf Temporal Logic basierender LLM-Kritiker für sichere und effiziente einkörperige Wirkstoffe

LLLCritri: 一种基于时时逻辑的LLM LLM 用于安全和高效装药剂的Criticr 2507.03293v1

Authors (3): Anand Gokhale, Vaibhav Srivastava, Francesco Bullo

Large language models (LLMs) have demonstrated promise in reasoning tasks and general decision-making in static environments. In long-term planning tasks, however, errors tend to accumulate, often leading to unsafe or inefficient behavior, limiting their use in general-purpose settings. We propose a modular actor-critic architecture in which an LLM actor is guided by LTLCrit, a trajectory-level LLM critic that communicates via linear temporal logic (LTL). Our setup combines the reasoning strengths of language models with the guarantees of formal logic. The actor selects high-level actions from natural language observations, while the critic analyzes full trajectories and proposes new LTL constraints that shield the actor from future unsafe or inefficient behavior. The architecture supports both fixed, hand-specified safety constraints and adaptive, learned soft constraints that promote long-term efficiency. Our architecture is model-agnostic: any LLM-based planner can serve as the actor, and LTLCrit serves as a logic-generating wrapper. We formalize planning as graph traversal under symbolic constraints, allowing LTLCrit to analyze failed or suboptimal trajectories and generate new temporal logic rules that improve future behavior. We evaluate our system on the Minecraft diamond-mining benchmark, achieving 100% completion rates and improving efficiency compared to baseline LLM planners. Our results suggest that enabling LLMs to supervise each other through logic is a powerful and flexible paradigm for safe, generalizable decision making.

nan

Article 589

Title@2025-07-04 (5): NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025

Title: NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025

NTU Speechlab LLM-basiertes Mehrsprachiges ASR-System für MLC-SLM Challenge 2025

NTU Spearelab LLM-为2025年刚果解放运动-解运间对话挑战使用多种语言的ASR系统 2506.13339v2

Authors (8): Yizhou Peng, Bin Wang, Yi-Wen Chao, Ziyang Ma, Haoyang Zhang, Hexin Liu, Xie Chen, Eng Siong Chng

This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I), where we achieved 5th place. We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies. In particular, language-specific prompts and model averaging techniques were instrumental in boosting system performance across diverse languages. Compared to the initial baseline system, our final model reduced the average Mix Error Rate from 20.2% to 10.6%, representing an absolute improvement of 9.6% (a relative improvement of 48%) on the evaluation set. Our results demonstrate the effectiveness of our approach and offer practical insights for future Speech Large Language Models.

nan

Article 590

Title@2025-07-04 (5): Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR

Title: Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR

Bi-direktionale Kontext-verbesserte Sprache Große Sprachmodelle für mehrsprachige Konversations-ASR

多语言对话的ASR双向双向背景强化语言大语言模型 2506.13396v2

Authors (3): Yizhou Peng, Hexin Liu, Eng Siong Chng

This paper introduces the integration of language-specific bi-directional context into a speech large language model (SLLM) to improve multilingual continuous conversational automatic speech recognition (ASR). We propose a character-level contextual masking strategy during training, which randomly removes portions of the context to enhance robustness and better emulate the flawed transcriptions that may occur during inference. For decoding, a two-stage pipeline is utilized: initial isolated segment decoding followed by context-aware re-decoding using neighboring hypotheses. Evaluated on the 1500-hour Multilingual Conversational Speech and Language Model (MLC-SLM) corpus covering eleven languages, our method achieves an 18% relative improvement compared to a strong baseline, outperforming even the model trained on 6000 hours of data for the MLC-SLM competition. These results underscore the significant benefit of incorporating contextual information in multilingual continuous conversational ASR.

nan

Article 591

Title@2025-07-04 (5): REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models

Title: REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models

REINFORCE++: Effizienter RLHF-Algorithmus mit Robustheit sowohl für Prompt- als auch für Reward-Modelle

REINFORCE++: 高效的RLHF对快速模型和奖励模型具有强力的测算法 2501.03262v5

Authors (3): Jian Hu, Jason Klein Liu, Wei Shen

Reinforcement Learning from Human Feedback (RLHF) is crucial in aligning large language models (LLMs) with human values and preferences. While state-of-the-art applications like ChatGPT/GPT-4 commonly employ Proximal Policy Optimization (PPO), including a critic network introduces significant computational overhead. REINFORCE-based methods, such as REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO), address this limitation by eliminating the critic network. However, these approaches face challenges in accurate advantage estimation. Specifically, they estimate advantages independently for responses to each prompt, which can lead to overfitting on more straightforward prompts and vulnerability to reward hacking. To address these challenges, we introduce REINFORCE++, a novel approach that removes the critic model while using the normalized reward of a batch as the baseline. Our empirical evaluation demonstrates that REINFORCE++ exhibits robust performance across various reward models without requiring prompt set truncation. Furthermore, it achieves superior generalization in RLHF and long chain-of-thought (CoT) settings compared to REINFORCE-based methods. The implementation is available at https://github.com/OpenRLHF/OpenRLHF.

nan

Article 592

Title@2025-07-04 (5): MoralBench: Moral Evaluation of LLMs

Title: MoralBench: Moral Evaluation of LLMs

MoralBench: Moralische Bewertung von LLMs

道德:对LLMMs的道德评价 2406.04428v2

Authors (6): Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, Yongfeng Zhang

In the rapidly evolving field of artificial intelligence, large language models (LLMs) have emerged as powerful tools for a myriad of applications, from natural language processing to decision-making support systems. However, as these models become increasingly integrated into societal frameworks, the imperative to ensure they operate within ethical and moral boundaries has never been more critical. This paper introduces a novel benchmark designed to measure and compare the moral reasoning capabilities of LLMs. We present the first comprehensive dataset specifically curated to probe the moral dimensions of LLM outputs, addressing a wide range of ethical dilemmas and scenarios reflective of real-world complexities. The main contribution of this work lies in the development of benchmark datasets and metrics for assessing the moral identity of LLMs, which accounts for nuance, contextual sensitivity, and alignment with human ethical standards. Our methodology involves a multi-faceted approach, combining quantitative analysis with qualitative insights from ethics scholars to ensure a thorough evaluation of model performance. By applying our benchmark across several leading LLMs, we uncover significant variations in moral reasoning capabilities of different models. These findings highlight the importance of considering moral reasoning in the development and evaluation of LLMs, as well as the need for ongoing research to address the biases and limitations uncovered in our study. We publicly release the benchmark at https://drive.google.com/drive/u/0/folders/1k93YZJserYc2CkqP8d4B3M3sgd3kA8W7 and also open-source the code of the project at https://github.com/agiresearch/MoralBench.

nan

Article 593

Title@2025-07-04 (5): GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning

Title: GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning

GDGB: Ein Benchmark für generatives dynamisches Text-Attributed Graph Learning

GDGB: 产生动态动态文本图表学习的基准 2507.03267v1

Authors (6): Jie Peng, Jiarui Ji, Runlin Lei, Zhewei Wei, Yongchao Liu, Chuntao Hong

Dynamic Text-Attributed Graphs (DyTAGs), which intricately integrate structural, temporal, and textual attributes, are crucial for modeling complex real-world systems. However, most of the existing DyTAG datasets exhibit poor textual quality, which severely limits their utility for DyTAG generation tasks requiring semantically rich inputs. Additionally, prior work mainly focuses on discriminative tasks on DyTAGs, resulting in a lack of standardized task formulations and evaluation protocols tailored for DyTAG generation. To address these critical issues, we propose Generative DyTAG Benchmark (GDGB), which comprises eight meticulously curated DyTAG datasets with high-quality textual features for both nodes and edges, overcoming limitations of prior datasets. Building on GDGB, we define two novel DyTAG generation tasks: Transductive Dynamic Graph Generation (TDGG) and Inductive Dynamic Graph Generation (IDGG). TDGG transductively generates a target DyTAG based on the given source and destination node sets, while the more challenging IDGG introduces new node generation to inductively model the dynamic expansion of real-world graph data. To enable holistic evaluation, we design multifaceted metrics that assess the structural, temporal, and textual quality of the generated DyTAGs. We further propose GAG-General, an LLM-based multi-agent generative framework tailored for reproducible and robust benchmarking of DyTAG generation. Experimental results demonstrate that GDGB enables rigorous evaluation of TDGG and IDGG, with key insights revealing the critical interplay of structural and textual features in DyTAG generation. These findings establish GDGB as a foundational resource for advancing generative DyTAG research and unlocking further practical applications in DyTAG generation. GDGB datasets, source codes, and leaderboards are available at \href{https://gdgb-algo.github.io/}{here}.

nan

Article 594

Title@2025-07-04 (5): KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation

Title: KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation

KinyaColBERT: Ein Lexisch geerdetes Retrieval-Modell für Low-Resource Retrieval-Augmented Generation

KinyaColcolBERT: 低资源回收-提款一代的以地基为基数的回收模型 2507.03241v1

Authors (2): Antoine Nzeyimana, Andre Niyongabo Rubungo

The recent mainstream adoption of large language model (LLM) technology is enabling novel applications in the form of chatbots and virtual assistants across many domains. With the aim of grounding LLMs in trusted domains and avoiding the problem of hallucinations, retrieval-augmented generation (RAG) has emerged as a viable solution. In order to deploy sustainable RAG systems in low-resource settings, achieving high retrieval accuracy is not only a usability requirement but also a cost-saving strategy. Through empirical evaluations on a Kinyarwanda-language dataset, we find that the most limiting factors in achieving high retrieval accuracy are limited language coverage and inadequate sub-word tokenization in pre-trained language models. We propose a new retriever model, KinyaColBERT, which integrates two key concepts: late word-level interactions between queries and documents, and a morphology-based tokenization coupled with two-tier transformer encoding. This methodology results in lexically grounded contextual embeddings that are both fine-grained and self-contained. Our evaluation results indicate that KinyaColBERT outperforms strong baselines and leading commercial text embedding APIs on a Kinyarwanda agricultural retrieval benchmark. By adopting this retrieval strategy, we believe that practitioners in other low-resource settings can not only achieve reliable RAG systems but also deploy solutions that are more cost-effective.

nan

Article 595

Title@2025-07-04 (5): Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models

Title: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models

Neural Diskrete Token Repräsentanz Lernen für extreme Token Reduction in Video Große Sprachmodelle

在视频大语言模式中进行关于极端减少视频大语言模式的神经分辨代言人学习 2503.16980v4

Authors (2): Haichao Zhang, Yun Fu

Token-based video representation has emerged as a promising approach for enabling large language models (LLMs) to interpret video content. However, existing token reduction techniques, such as pruning and merging, often disrupt essential positional embeddings and rely on continuous visual tokens sampled from nearby pixels with similar spatial-temporal locations. By removing only a small fraction of tokens, these methods still produce relatively lengthy continuous sequences, which falls short of the extreme compression required to balance computational efficiency and token count in video LLMs. In this paper, we introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens. We propose VQToken, a neural discrete token representation framework that (i) applies adaptive vector quantization to continuous ViT embeddings to learn a compact codebook and (ii) preserves spatial-temporal positions via a token hash function by assigning each grid-level token to its nearest codebook entry. On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark. It also achieves comparable performance on ActNet-QA, Long Video Bench, and VideoMME. We further introduce the Token Information Density (TokDense) metric and formalize fixed-length and adaptive-length subtasks, achieving state-of-the-art results in both settings. Our approach dramatically lowers theoretical complexity, increases information density, drastically reduces token counts, and enables efficient video LLMs in resource-constrained environments.

nan

Article 596

Title@2025-07-04 (5): A Lie-algebraic perspective on Tree-Adjoining Grammars

Title: A Lie-algebraic perspective on Tree-Adjoining Grammars

Eine lüge-algebraische Perspektive auf Baum-angrenzende Grammatiken

有关树树边语法的字眼视角 2507.03234v1

Authors (3): Isabella Senturia, Elizabeth Xiao, Matilde Marcolli

We provide a novel mathematical implementation of tree-adjoining grammars using two combinatorial definitions of graphs. With this lens, we demonstrate that the adjoining operation defines a pre-Lie operation and subsequently forms a Lie algebra. We demonstrate the utility of this perspective by showing how one of our mathematical formulations of TAG captures properties of the TAG system without needing to posit them as additional components of the system, such as null-adjoining constraints and feature TAG.

nan

Article 597

Title@2025-07-04 (5): AgentPS: Agentic Process Supervision for Content Moderation with Multimodal LLMs

Title: AgentPS: Agentic Process Supervision for Content Moderation with Multimodal LLMs

AgentPS: Agentische Prozessüberwachung für Content Moderation mit multimodalen LLMs

PS: MMMM 内容调控的代理程序监督 2412.15251v2

Authors (6): Mingchao Liu, Yu Sun, Ruixiao Sun, Xin Dong, Xiang Shen, Hongyu Xiong

The advanced processing and reasoning capabilities of multimodal large language models (MLLMs) have driven substantial progress in vision-language (VL) understanding tasks. However, while effective for tasks governed by straightforward logic, MLLMs often struggle with reasoning complex, detail-intensive logical structures. To address this limitation, we introduce AgentPS, a novel framework that integrates Agentic Process Supervision into MLLMs by sequentially reasoning over ancillary questions during fine-tuning. AgentPS achieves substantial improvements over baseline MLLMs on both public benchmarks and proprietary datasets. Notably, we show that using MLLM-generated ancillary labels in place of human annotations yields only minimal performance degradation, highlighting the method’s scalability. These results establish AgentPS as a scalable and effective solution for complex multimodal classification in large-scale industrial applications.

nan

Article 598

Title@2025-07-04 (5): 7B Fully Open Source Moxin-LLM/VLM – From Pretraining to GRPO-based Reinforcement Learning Enhancement

Title: 7B Fully Open Source Moxin-LLM/VLM – From Pretraining to GRPO-based Reinforcement Learning Enhancement

7B Vollständig offene Quelle Moxin-LLM/VLM – Vom Vortraining zur GRPO-basierten Verstärkungs-Lernverbesserung

7B 完全开放源代码 Moxin-LLM/VLM – – 从预培训到基于GROP的强化学习加强 2412.06845v6

Authors (19): Pu Zhao, Xuan Shen, Zhenglun Kong, Yixin Shen, Sung-En Chang, Arash Akbari, Timothy Rupprecht, Lei Lu, Enfu Nan, Changdi Yang, Yumei He, Weiyan Shi, Xingchen Xu, Yu Huang, Wei Jiang, Wei Wang, Yue Chen, Yong He, Yanzhi Wang

Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Although open-source LLMs present unprecedented opportunities for innovation and research, the commercialization of LLMs has raised concerns about transparency, reproducibility, and safety. Many open-source LLMs fail to meet fundamental transparency requirements by withholding essential components like training code and data, which may hinder further innovations on LLMs. To mitigate this issue, we introduce Moxin 7B, a fully open-source LLM developed, adhering to principles of open science, open source, open data, and open access. We release the pre-training code and configurations, training and fine-tuning datasets, and intermediate and final checkpoints, aiming to make continuous commitments to fully open-source LLMs. After pre-training the base model, we finetune the Moxin Base model with SOTA post-training framework and instruction data to obtain Moxin Instruct model. To improve the reasoning capability, we further finetune our Instruct model with chain-of-thought data distilled from DeepSeek R1, and then use Group Relative Policy Optimization (GRPO) following DeepSeek R1 to finetune our model, leading to the Moxin Reasoning model. Moreover, we develop our vision language model based on our Moxin model. Experiments show that our models achieve superior performance in various evaluations such as zero-shot evaluation, few-shot evaluation, and CoT evaluation.

nan

Article 599

Title@2025-07-03 (4): VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

Title: VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

VOCABTRIM: Vokabelabgleich für effizientes spekulatives Decodieren in LLMs

VOCABTRIM: 有效投机下限的词汇 2506.22694v2

Authors (12): Raghavv Goel, Sudhanshu Agrawal, Mukul Gagrani, Junyoung Park, Yifan Zao, He Zhang, Tian Liu, Yiping Yang, Xin Yuan, Jiuyan Lu, Chris Lott, Mingu Lee

In this paper, we introduce a simple training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods that incorporates language modeling head (LM head) during drafting process. A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens, followed by verification by a base LLM, a target model, accepting a subset as its valid generation. As it is usually considered that the speculative decoding requires one-to-one mapping between vocabularies of the target model and the draft model, it has been natural to share the vocabulary between them, or even share the LM head as in EAGLE or Medusa. We first identify that this draft token sampling scheme inherently contains an unnecessary inference overhead in drafting, especially for some target LLMs with very large vocabularies. Then, we propose a simple technique, VocabTrim, to mitigate the drafting overhead to improve the generation speed in memory-bound environment. VocabTrim reconstructs the drafter LM head to contain only a limited set of tokens, selected by the most frequently sampled from the vocabulary of the target model. While limiting the vocabulary in drafting slightly degrades the acceptance rate, it significantly reduces the drafting latency in memory-bound process which is often the case on edge devices, resulting in higher memory-bound speed up (MBSU). We show that our method can boost the memory-bound speed-up for Llama-3 models on Spec-Bench, specifically by 16% for Llama-3.2-3B-Instruct.

nan

Article 600

Title@2025-07-03 (4): Towards Universal Semantics With Large Language Models

Title: Towards Universal Semantics With Large Language Models

Hin zu universeller Semantik mit großen Sprachmodellen

走向具有大语言模式的普遍语义 2505.11764v3

Authors (5): Raymond Baartmans, Matthew Raffel, Rahul Vikram, Aiden Deringer, Lizhong Chen

The Natural Semantic Metalanguage (NSM) is a linguistic theory based on a universal set of semantic primes: simple, primitive word-meanings that have been shown to exist in most, if not all, languages of the world. According to this framework, any word, regardless of complexity, can be paraphrased using these primes, revealing a clear and universally translatable meaning. These paraphrases, known as explications, can offer valuable applications for many natural language processing (NLP) tasks, but producing them has traditionally been a slow, manual process. In this work, we present the first study of using large language models (LLMs) to generate NSM explications. We introduce automatic evaluation methods, a tailored dataset for training and evaluation, and fine-tuned models for this task. Our 1B and 8B models outperform GPT-4o in producing accurate, cross-translatable explications, marking a significant step toward universal semantic representation with LLMs and opening up new possibilities for applications in semantic analysis, translation, and beyond. Our code is available at https://github.com/OSU-STARLAB/DeepNSM.

nan

Article 601

Title@2025-07-03 (4): How Much Content Do LLMs Generate That Induces Cognitive Bias in Users?

Title: How Much Content Do LLMs Generate That Induces Cognitive Bias in Users?

Wie viel Inhalt erzeugen LLMs, die Kognitive Bias bei Benutzern erzeugen?

LLMs如何产生使用者中诱导的认知性偏见的内容? 2507.03194v1

Authors (6): Abeer Alessa, Akshaya Lakshminarasimhan, Param Somane, Julian Skirzynski, Julian McAuley, Jessica Echterhoff

Large language models (LLMs) are increasingly integrated into applications ranging from review summarization to medical diagnosis support, where they affect human decisions. Even though LLMs perform well in many tasks, they may also inherit societal or cognitive biases, which can inadvertently transfer to humans. We investigate when and how LLMs expose users to biased content and quantify its severity. Specifically, we assess three LLM families in summarization and news fact-checking tasks, evaluating how much LLMs stay consistent with their context and/or hallucinate. Our findings show that LLMs expose users to content that changes the sentiment of the context in 21.86% of the cases, hallucinates on post-knowledge-cutoff data questions in 57.33% of the cases, and primacy bias in 5.94% of the cases. We evaluate 18 distinct mitigation methods across three LLM families and find that targeted interventions can be effective. Given the prevalent use of LLMs in high-stakes domains, such as healthcare or legal analysis, our results highlight the need for robust technical safeguards and for developing user-centered interventions that address LLM limitations.

nan

Article 602

Title@2025-07-03 (4): Adversarial Manipulation of Reasoning Models using Internal Representations

Title: Adversarial Manipulation of Reasoning Models using Internal Representations

Adversariale Manipulation von Vernunftmodellen mittels interner Repräsentationen

采用内部代表制对理性模型进行反向调整 2507.03167v1

Authors (3): Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi

Reasoning models generate chain-of-thought (CoT) tokens before their final output, but how this affects their vulnerability to jailbreak attacks remains unclear. While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation. We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply – termed the “caution” direction because it corresponds to cautious reasoning patterns in the generated text. Ablating this direction from model activations increases harmful compliance, effectively jailbreaking the model. We additionally show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates. Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models. Code available at https://github.com/ky295/reasoning-manipulation

nan

Article 603

Title@2025-07-03 (4): MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks

Title: MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks

MateInfoUB: Ein echter Benchmark für die Prüfung von LLMs in kompetitiven, mehrsprachigen und multimodalen Bildungsaufgaben

MateInfoUB:竞争、多种语文和多模式教育任务中测试LMLMS的现实世界基准 2507.03162v1

Authors (3): Dumitran Adrian Marius, Theodor-Pierre Moroianu, Buca Mihnea-Vicentiu

The rapid advancement of Large Language Models (LLMs) has transformed various domains, particularly computer science (CS) education. These models exhibit remarkable capabilities in code-related tasks and problem-solving, raising questions about their potential and limitations in advanced CS contexts. This study presents a novel bilingual (English-Romanian) multimodal (text and image) dataset of multiple-choice questions derived from a high-level computer science competition. A particularity of our dataset is that the problems are conceived such that some of them are easier solved using reasoning on paper, while for others writing code is more efficient. We systematically evaluate State of The Art LLMs on this dataset, analyzing their performance on theoretical programming tasks. Our findings reveal the strengths and limitations of current LLMs, including the influence of language choice (English vs. Romanian), providing insights into their applicability in CS education and competition settings. We also address critical ethical considerations surrounding educational integrity and the fairness of assessments in the context of LLM usage. These discussions aim to inform future educational practices and policies. To support further research, our dataset will be made publicly available in both English and Romanian. Additionally, we release an educational application tailored for Romanian students, enabling them to self-assess using the dataset in an interactive and practice-oriented environment.

nan

Article 604

Title@2025-07-03 (4): Expert-level validation of AI-generated medical text with scalable language models

Title: Expert-level validation of AI-generated medical text with scalable language models

Validierung von KI-generierten medizinischen Texten auf Expertenebene mit skalierbaren Sprachmodellen

专家一级对AI产生的带有可缩放语言模型的可缩放语言模型的医学文本进行鉴定 2507.03152v1

Authors (27): Asad Aali, Vasiliki Bikia, Maya Varma, Nicole Chiou, Sophie Ostmeier, Arnav Singhvi, Magdalini Paschali, Ashwin Kumar, Andrew Johnston, Karimar Amador-Martinez, Eduardo Juan Perez Guerrero, Paola Naovi Cruz Rivera, Sergios Gatidis, Christian Bluethgen, Eduardo Pontes Reis, Eddy D. Zandee van Rilland, Poonam Laxmappa Hosamani, Kevin R Keet, Minjoung Go, Evelyn Ling, David B. Larson, Curtis Langlotz, Roxana Daneshjou, Jason Hom, Sanmi Koyejo, Emily Alsentzer, Akshay S. Chaudhari

With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the “LM-as-judge” paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase ( https://github.com/StanfordMIMI/MedVAL ), 2) MedVAL-Bench ( https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench ), and 3) MedVAL-4B ( https://huggingface.co/stanfordmimi/MedVAL-4B ), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.

nan

Article 605

Title@2025-07-03 (4): Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language Models

Title: Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language Models

Elemente des Weltwissens (EWoK): Ein von Kognition inspirierter Rahmen zur Bewertung des Grundwissens der Welt in Sprachmodellen

世界知识要素(EWoK): “ 评价基本世界语言模式知识的认知 – – 启发框架 “ 2405.09605v2

Authors (20): Anna A. Ivanova, Aalok Sathe, Benjamin Lipkin, Unnathi Kumar, Setayesh Radkani, Thomas H. Clark, Carina Kauf, Jennifer Hu, R. T. Pramod, Gabriel Grand, Vivian Paulun, Maria Ryskina, Ekin Akyürek, Ethan Wilcox, Nafisa Rashid, Leshem Choshen, Roger Levy, Evelina Fedorenko, Joshua Tenenbaum, Jacob Andreas

The ability to build and reason about models of the world is essential for situated language understanding. But evaluating world modeling capabilities in modern AI systems – especially those based on language models – has proven challenging, in large part because of the difficulty of disentangling conceptual knowledge about the world from knowledge of surface co-occurrence statistics. This paper presents Elements of World Knowledge (EWoK), a framework for evaluating language models’ understanding of the conceptual knowledge underlying world modeling. EWoK targets specific concepts from multiple knowledge domains known to be important for world modeling in humans, from social interactions (help, deceive) to spatial relations (left, right). Objects, agents, and locations in the items can be flexibly filled in, enabling easy generation of multiple controlled datasets. We then introduce EWoK-core-1.0, a dataset of 4,374 items covering 11 world knowledge domains. We evaluate 20 open-weights large language models (1.3B–70B parameters) and compare them with human performance. All tested models perform worse than humans, with results varying drastically across domains. Performance on social interactions and social properties was highest and performance on physical relations and spatial relations was lowest. Overall, this dataset highlights simple cases where even large models struggle and presents rich avenues for targeted research on LLM world modeling capabilities.

nan

Article 606

Title@2025-07-03 (4): On the Relationship between Accent Strength and Articulatory Features

Title: On the Relationship between Accent Strength and Articulatory Features

Über die Beziehung zwischen Akcent Strength und Artikulatory Features

Accent 力量与交接特征之间的关系 2507.03149v1

Authors (6): Kevin Huang, Sean Foley, Jihwan Lee, Yoonjeong Lee, Dani Byrd, Shrikanth Narayanan

This paper explores the relationship between accent strength and articulatory features inferred from acoustic speech. To quantify accent strength, we compare phonetic transcriptions with transcriptions based on dictionary-based references, computing phoneme-level difference as a measure of accent strength. The proposed framework leverages recent self-supervised learning articulatory inversion techniques to estimate articulatory features. Analyzing a corpus of read speech from American and British English speakers, this study examines correlations between derived articulatory parameters and accent strength proxies, associating systematic articulatory differences with indexed accent strength. Results indicate that tongue positioning patterns distinguish the two dialects, with notable differences inter-dialects in rhotic and low back vowels. These findings contribute to automated accent analysis and articulatory modeling for speech processing applications.

nan

Article 607

Title@2025-07-03 (4): DeepGesture: A conversational gesture synthesis system based on emotions and semantics

Title: DeepGesture: A conversational gesture synthesis system based on emotions and semantics

DeepGesture: Ein dialogisches Gesten-Synthesesystem basierend auf Emotionen und Semantik

DeepGesture:基于情感和语义的谈话手势合成系统 2507.03147v1

Authors (1): Thanh Hoang-Minh

Along with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans lies in generating character movements that correspond naturally to text or speech inputs. In this work, we present DeepGesture, a diffusion-based gesture synthesis framework for generating expressive co-speech gestures conditioned on multimodal signals-text, speech, emotion, and seed motion. Built upon the DiffuseStyleGesture model, DeepGesture introduces novel architectural enhancements that improve semantic alignment and emotional expressiveness in generated gestures. Specifically, we integrate fast text transcriptions as semantic conditioning and implement emotion-guided classifier-free diffusion to support controllable gesture generation across affective states. A lightweight Transformer backbone combines full self-attention and cross-local attention for effective feature fusion of heterogeneous modalities. To visualize results, we implement a full rendering pipeline in Unity based on BVH output from the model. Evaluation on the ZeroEGGS dataset shows that DeepGesture produces gestures with improved human-likeness and contextual appropriateness, outperforming baselines on Mean Opinion Score and Frechet Gesture Distance metrics. Our system supports interpolation between emotional states and demonstrates generalization to out-of-distribution speech, including synthetic voices-marking a step forward toward fully multimodal, emotionally aware digital humans.

nan

Article 608

Title@2025-07-03 (4): From Measurement to Mitigation: Exploring the Transferability of Debiasing Approaches to Gender Bias in Maltese Language Models

Title: From Measurement to Mitigation: Exploring the Transferability of Debiasing Approaches to Gender Bias in Maltese Language Models

Von der Messung zur Milderung: Erforschung der Übertragbarkeit von Debiasing-Ansätzen auf Gender-Bias in maltesischen Sprachmodellen

从衡量到减缓:探讨马耳他语言模式中性别偏见方法对性别偏见的可转让性 2507.03142v1

Authors (2): Melanie Galea, Claudia Borg

The advancement of Large Language Models (LLMs) has transformed Natural Language Processing (NLP), enabling performance across diverse tasks with little task-specific training. However, LLMs remain susceptible to social biases, particularly reflecting harmful stereotypes from training data, which can disproportionately affect marginalised communities. We measure gender bias in Maltese LMs, arguing that such bias is harmful as it reinforces societal stereotypes and fails to account for gender diversity, which is especially problematic in gendered, low-resource languages. While bias evaluation and mitigation efforts have progressed for English-centric models, research on low-resourced and morphologically rich languages remains limited. This research investigates the transferability of debiasing methods to Maltese language models, focusing on BERTu and mBERTu, BERT-based monolingual and multilingual models respectively. Bias measurement and mitigation techniques from English are adapted to Maltese, using benchmarks such as CrowS-Pairs and SEAT, alongside debiasing methods Counterfactual Data Augmentation, Dropout Regularization, Auto-Debias, and GuiDebias. We also contribute to future work in the study of gender bias in Maltese by creating evaluation datasets. Our findings highlight the challenges of applying existing bias mitigation methods to linguistically complex languages, underscoring the need for more inclusive approaches in the development of multilingual NLP.

nan

Article 609

Title@2025-07-03 (4): ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models

Title: ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models

ZuverlässigMath: Benchmark der zuverlässigen mathematischen Vernunft bei großen Sprachmodellen

可靠马斯:关于大语言模式的可靠数学依据基准 2507.03133v1

Authors (10): Boyang Xue, Qi Zhu, Rui Wang, Sheng Wang, Hongru Wang, Fei Mi, Yasheng Wang, Lifeng Shang, Qun Liu, Kam-Fai Wong

Although demonstrating remarkable performance on reasoning tasks, Large Language Models (LLMs) still tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability, severely undermining the reliability. Prior studies of LLM reliability have primarily focused on knowledge tasks to identify unanswerable questions, while mathematical reasoning tasks have remained unexplored due to the dearth of unsolvable math problems. To systematically investigate LLM reliability in mathematical reasoning tasks, we formulate the reliability evaluation for both solvable and unsolvable problems. We then develop a ReliableMath dataset which incorporates open-source solvable problems and high-quality unsolvable problems synthesized by our proposed construction workflow with human evaluations. Experiments are conducted on various LLMs with several key findings uncovered. LLMs fail to directly identify unsolvable problems and always generate fabricated responses. When instructing LLMs to indicate unsolvability using a reliable prompt, the reliability of larger-sized LLMs remains on solvable problems, but notably improves on unsolvable problems yet still falls short of solvable problems. However, small LLMs rarely show any progress despite employing reliable prompts. Therefore, we further propose an alignment strategy to enhance small LLMs’ reliability, which can significantly improve LLM reliability performances on both in-domain and out-of-domain tasks.

nan

Article 610

Title@2025-07-03 (4): Towards a Psychoanalytic Perspective on VLM Behaviour: A First-step Interpretation with Intriguing Observations

Title: Towards a Psychoanalytic Perspective on VLM Behaviour: A First-step Interpretation with Intriguing Observations

Auf dem Weg zu einer psychoanalytischen Perspektive auf VLM-Verhalten: Eine erste Interpretation mit faszinierenden Beobachtungen

对VLM行为进行心理分析:带有引人注意的首步解释 2507.03123v1

Authors (5): Xiangrui Liu, Man Luo, Agneet Chatterjee, Hua Wei, Yezhou Yang

Hallucination is a long-standing problem that has been actively investigated in Vision-Language Models (VLMs). Existing research commonly attributes hallucinations to technical limitations or sycophancy bias, where the latter means the models tend to generate incorrect answers to align with user expectations. However, these explanations primarily focus on technical or externally driven factors, may have neglected the possibility that hallucination behaviours might mirror cognitive biases observed in human psychology. In this work, we introduce a psychological taxonomy, categorizing VLMs’ hallucination behaviours, including sycophancy, logical inconsistency, and a newly identified VLMs behaviour: authority bias. To systematically analyze these behaviours, we design AIpsych, a scalable benchmark that reveals psychological tendencies in model response patterns. Leveraging this benchmark, we investigate how variations in model architecture and parameter size influence model behaviour when responding to strategically manipulated questions. Our experiments reveal that as model size increases, VLMs exhibit stronger sycophantic tendencies but reduced authority bias, suggesting increasing competence but a potential erosion of response integrity. A human subject study further validates our hypotheses and highlights key behavioural differences between VLMs and human respondents. This work suggests a new perspective for understanding hallucination in VLMs and highlights the importance of integrating psychological principles into model evaluation.The benchmark is available at https://github.com/lxrswdd/AIpsych.

nan

Article 611

Title@2025-07-03 (4): Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings

Title: Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings

Federated Learning für die ICD-Klassifizierung mit Leichtbaumodellen und vorgeschulten Einbettungen

具有轻重量型模型和未受过训练的嵌入型综合疾病分类的联邦学习 2507.03122v1

Authors (2): Binbin Xu, Gérard Dray

This study investigates the feasibility and performance of federated learning (FL) for multi-label ICD code classification using clinical notes from the MIMIC-IV dataset. Unlike previous approaches that rely on centralized training or fine-tuned large language models, we propose a lightweight and scalable pipeline combining frozen text embeddings with simple multilayer perceptron (MLP) classifiers. This design offers a privacy-preserving and deployment-efficient alternative for clinical NLP applications, particularly suited to distributed healthcare settings. Extensive experiments across both centralized and federated configurations were conducted, testing six publicly available embedding models from Massive Text Embedding Benchmark leaderboard and three MLP classifier architectures under two medical coding (ICD-9 and ICD-10). Additionally, ablation studies over ten random stratified splits assess performance stability. Results show that embedding quality substantially outweighs classifier complexity in determining predictive performance, and that federated learning can closely match centralized results in idealized conditions. While the models are orders of magnitude smaller than state-of-the-art architectures and achieved competitive micro and macro F1 scores, limitations remain including the lack of end-to-end training and the simplified FL assumptions. Nevertheless, this work demonstrates a viable way toward scalable, privacy-conscious medical coding systems and offers a step toward for future research into federated, domain-adaptive clinical AI.

nan

Article 612

Title@2025-07-03 (4): RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

Title: RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

RLVER: Verstärktes Lernen mit überprüfbaren emotionalen Belohnungen für einfühlsame Agenten

RLVER: 利用可核实情感奖赏加强学习,以培养致病剂 2507.03112v1

Authors (16): Peisong Wang, Ruotian Ma, Bang Zhang, Xingyu Chen, Zhiwei He, Kang Luo, Qingsong Lv, Qingxuan Jiang, Zheng Xie, Shanyi Wang, Yuan Li, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li

Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogue-especially for emotional intelligence-remains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM’s learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trends–thinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always better-moderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.

nan

Article 613

Title@2025-07-03 (4): LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds

Title: LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds

LIAR: Inference Time Alignment (Best-of-N) in Sekunden an Jailbreak LLMs ausleihen

LIAR: 利用诱推时间对准(最佳N级)二秒内侵入监狱的LLMs 2412.05232v3

Authors (6): James Beetham, Souradip Chakraborty, Mengdi Wang, Furong Huang, Amrit Singh Bedi, Mubarak Shah

Jailbreak attacks expose vulnerabilities in safety-aligned LLMs by eliciting harmful outputs through carefully crafted prompts. Existing methods rely on discrete optimization or trained adversarial generators, but are slow, compute-intensive, and often impractical. We argue that these inefficiencies stem from a mischaracterization of the problem. Instead, we frame jailbreaks as inference-time misalignment and introduce LIAR (Leveraging Inference-time misAlignment to jailbReak), a fast, black-box, best-of-$N$ sampling attack requiring no training. LIAR matches state-of-the-art success rates while reducing perplexity by $10\times$ and Time-to-Attack from hours to seconds. We also introduce a theoretical “safety net against jailbreaks” metric to quantify safety alignment strength and derive suboptimality bounds. Our work offers a simple yet effective tool for evaluating LLM robustness and advancing alignment research.

nan

Article 614

Title@2025-07-03 (4): ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization

Title: ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization

ARF-RLHF: Adaptive Reward-Following für RLHF durch Emotionsgetriebene Selbst-Überwachung und trace-biased Dynamic Optimization

RAF-RLHF:通过情感驱动自视和跟踪比亚动态优化,对RLHF进行适应性奖励 2507.03069v1

Authors (1): YuXuan Zhang

With the rapid advancement of Reinforcement Learning from Human Feedback (RLHF) and autoregressive transformers, state-of-the-art models such as GPT-4.0, DeepSeek R1, and Llama 3.3 increasingly emphasize answer depth and personalization. However, most existing RLHF approaches (e.g., PPO, DPO) still rely on a binary-preference (BT) paradigm, which, while reducing annotation costs, still requires substantial human effort and captures only group-level tendencies rather than individual preferences. To overcome these limitations, we propose Adaptive Reward-Following (ARF), a self-assessment framework that leverages a high-precision emotion analyzer achieving over 70% accuracy on GoEmotions, Sentiment140, and DailyDialog to convert free-form user feedback into continuous preference scores. We further enrich and debias these signals through lightweight data augmentations, including synonym replacement, random trace truncation, and score bias annotation algorithm. A Dynamic Adapter Preference Tracker continuously models evolving user tastes in real time, enabling our novel Trace Bias (TB) fine-tuning algorithm to optimize directly on these tracked rewards instead of coarse binary labels. Experiments on Qwen-2/2.5, Gemma-2, and Llama-3.2 across four preference domains demonstrate that ARF achieves an improvement of 3.3% over PPO and 7.6% over DPO. Moreover, TB preserves theoretical alignment with PPO and DPO objectives. Overall, ARF presents a scalable, personalized, and cost-effective approach to RLHF LLMs through autonomous reward modeling.

nan

Article 615

Title@2025-07-03 (4): Requirements Elicitation Follow-Up Question Generation

Title: Requirements Elicitation Follow-Up Question Generation

Voraussetzungen Elicitation Follow-Up Question Generation

问询后查询 2507.02858v1

Authors (3): Yuchen Shen, Anmol Singhal, Travis Breaux

Interviews are a widely used technique in eliciting requirements to gather stakeholder needs, preferences, and expectations for a software system. Effective interviewing requires skilled interviewers to formulate appropriate interview questions in real time while facing multiple challenges, including lack of familiarity with the domain, excessive cognitive load, and information overload that hinders how humans process stakeholders’ speech. Recently, large language models (LLMs) have exhibited state-of-the-art performance in multiple natural language processing tasks, including text summarization and entailment. To support interviewers, we investigate the application of GPT-4o to generate follow-up interview questions during requirements elicitation by building on a framework of common interviewer mistake types. In addition, we describe methods to generate questions based on interviewee speech. We report a controlled experiment to evaluate LLM-generated and human-authored questions with minimal guidance, and a second controlled experiment to evaluate the LLM-generated questions when generation is guided by interviewer mistake types. Our findings demonstrate that, for both experiments, the LLM-generated questions are no worse than the human-authored questions with respect to clarity, relevancy, and informativeness. In addition, LLM-generated questions outperform human-authored questions when guided by common mistakes types. This highlights the potential of using LLMs to help interviewers improve the quality and ease of requirements elicitation interviews in real time.

nan

Article 616

Title@2025-07-03 (4): Answer Matching Outperforms Multiple Choice for Language Model Evaluation

Title: Answer Matching Outperforms Multiple Choice for Language Model Evaluation

Antwort Matching Outperforms Multiple Choice für Sprachmodell-Bewertung

语言模式评价的多种选择 2507.02856v1

Authors (5): Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping

Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model’s free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice–but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we annotate MMLU-Pro and GPQA-Diamond to obtain human grading data, and measure the agreement of each evaluation approach. We find answer matching using recent models–even small ones–achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improving evaluations via answer matching is not merely a conceptual concern: the rankings of several models change significantly when evaluating their free-form responses with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.

nan

Article 617

Title@2025-07-03 (4): MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs

Title: MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs

MOTIF: Modulares Denken durch Verstärkung Feinabstimmung in LLMs

MOTIF:通过强化微调在LLM中进行模块思考 2507.02851v1

Authors (2): Purbesh Mitra, Sennur Ulukus

Recent advancements in the reasoning capabilities of large language models (LLMs) show that employing group relative policy optimization (GRPO) algorithm for reinforcement learning (RL) training allows the models to use more thinking/reasoning tokens for generating better responses. However, LLMs can generate only a finite amount of tokens while maintaining attention to the previously generated tokens. This limit, also known as the context size of an LLM, is a bottleneck in LLM reasoning with arbitrarily large number of tokens. To think beyond the limit of context size, an LLM must employ a modular thinking strategy to reason over multiple rounds. In this work, we propose $\textbf{MOTIF: Modular Thinking via Reinforcement Finetuning}$ – an RL training method for generating thinking tokens in multiple rounds, effectively allowing the model to think with additional context size. We trained the open-source model Qwen2.5-3B-Instruct on GSM8K dataset via parameter efficient fine-tuning and tested its accuracy on MATH500 and AIME2024 benchmarks. Our experiments show 3.8\% and 3.3\% improvements over vanilla GRPO based training in the respective benchmarks. Furthermore, this improvement was achieved with only 15\% of samples, thus demonstrating sample efficiency of MOTIF. Our code and models are available at https://github.com/purbeshmitra/MOTIF and https://huggingface.co/purbeshmitra/MOTIF, respectively.

nan

Article 618

Title@2025-07-03 (4): Legal Requirements Translation from Law

Title: Legal Requirements Translation from Law

Rechtliche Voraussetzungen Übersetzung aus dem Recht

法律要求译自法律 2507.02846v1

Authors (2): Anmol Singhal, Travis Breaux

Software systems must comply with legal regulations, which is a resource-intensive task, particularly for small organizations and startups lacking dedicated legal expertise. Extracting metadata from regulations to elicit legal requirements for software is a critical step to ensure compliance. However, it is a cumbersome task due to the length and complex nature of legal text. Although prior work has pursued automated methods for extracting structural and semantic metadata from legal text, key limitations remain: they do not consider the interplay and interrelationships among attributes associated with these metadata types, and they rely on manual labeling or heuristic-driven machine learning, which does not generalize well to new documents. In this paper, we introduce an approach based on textual entailment and in-context learning for automatically generating a canonical representation of legal text, encodable and executable as Python code. Our representation is instantiated from a manually designed Python class structure that serves as a domain-specific metamodel, capturing both structural and semantic legal metadata and their interrelationships. This design choice reduces the need for large, manually labeled datasets and enhances applicability to unseen legislation. We evaluate our approach on 13 U.S. state data breach notification laws, demonstrating that our generated representations pass approximately 89.4% of test cases and achieve a precision and recall of 82.2 and 88.7, respectively.

nan

Article 619

Title@2025-07-03 (4): Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection

Title: Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection

Visual Contextual Attack: Jailbreaking MLLMs mit Image-Driven Context Injection

视觉上下文攻击:带有图像驱动背景注射的破狱MLLMs MLLMs 2507.02844v1

Authors (4): Ziqi Miao, Yi Ding, Lijun Li, Jing Shao

With the emergence of strong visual-language capabilities, multimodal large language models (MLLMs) have demonstrated tremendous potential for real-world applications. However, the security vulnerabilities exhibited by the visual modality pose significant challenges to deploying such models in open-world environments. Recent studies have successfully induced harmful responses from target MLLMs by encoding harmful textual semantics directly into visual inputs. However, in these approaches, the visual modality primarily serves as a trigger for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding in realistic scenarios. In this work, we define a novel setting: visual-centric jailbreak, where visual information serves as a necessary component in constructing a complete and realistic jailbreak context. Building on this setting, we propose the VisCo (Visual Contextual) Attack. VisCo fabricates contextual dialogue using four distinct visual-focused strategies, dynamically generating auxiliary images when necessary to construct a visual-centric jailbreak scenario. To maximize attack effectiveness, it incorporates automatic toxicity obfuscation and semantic refinement to produce a final attack prompt that reliably triggers harmful responses from the target black-box MLLMs. Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming the baseline, which performs a toxicity score of 2.48 and an ASR of 22.2%. The code is available at https://github.com/Dtc7w3PQ/Visco-Attack.

nan

Article 620

Title@2025-07-03 (4): Improved Unbiased Watermark for Large Language Models

Title: Improved Unbiased Watermark for Large Language Models

Verbessertes unvoreingenommenes Wasserzeichen für große Sprachmodelle

改进大语言模型的无偏见水印 2502.11268v2

Authors (4): Ruibo Chen, Yihan Wu, Junfeng Guo, Heng Huang

As artificial intelligence surpasses human capabilities in text generation, the necessity to authenticate the origins of AI-generated content has become paramount. Unbiased watermarks offer a powerful solution by embedding statistical signals into language model-generated text without distorting the quality. In this paper, we introduce MCmark, a family of unbiased, Multi-Channel-based watermarks. MCmark works by partitioning the model’s vocabulary into segments and promoting token probabilities within a selected segment based on a watermark key. We demonstrate that MCmark not only preserves the original distribution of the language model but also offers significant improvements in detectability and robustness over existing unbiased watermarks. Our experiments with widely-used language models demonstrate an improvement in detectability of over 10% using MCmark, compared to existing state-of-the-art unbiased watermarks. This advancement underscores MCmark’s potential in enhancing the practical application of watermarking in AI-generated texts.

nan

Article 621

Title@2025-07-03 (4): StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason

Title: StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason

StepHint: Mehrstufige stufenweise Hinweise stärken das Lernen zur Vernunft

步进提示:多级分步骤将强化学习提升到合理 2507.02841v1

Authors (7): Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, Rui Yan

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving the complex reasoning abilities of large language models (LLMs). However, current RLVR methods face two significant challenges: the near-miss reward problem, where a small mistake can invalidate an otherwise correct reasoning process, greatly hindering training efficiency; and exploration stagnation, where models tend to focus on solutions within their comfort zone,'' lacking the motivation to explore potentially more effective alternatives. To address these challenges, we propose StepHint, a novel RLVR algorithm that utilizes multi-level stepwise hints to help models explore the solution space more effectively. StepHint generates valid reasoning chains from stronger models and partitions these chains into reasoning steps using our proposed adaptive partitioning method. The initial few steps are used as hints, and simultaneously, multiple-level hints (each comprising a different number of steps) are provided to the model. This approach directs the model's exploration toward a promising solution subspace while preserving its flexibility for independent exploration. By providing hints, StepHint mitigates the near-miss reward problem, thereby improving training efficiency. Additionally, the external reasoning pathways help the model develop better reasoning abilities, enabling it to move beyond itscomfort zone’’ and mitigate exploration stagnation. StepHint outperforms competitive RLVR enhancement methods across six mathematical benchmarks, while also demonstrating superior generalization and excelling over baselines on out-of-domain benchmarks.

nan

Article 622

Title@2025-07-03 (4): From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents

Title: From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents

Von der Web-Suche in Richtung Agentic Deep Research: Incentivizing Search with Reasoning Agents

从网络搜索到代理深层研究:激励使用理性代理进行搜索 2506.18959v3

Authors (23): Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu

Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research. These systems transcend conventional information search techniques by tightly integrating autonomous reasoning, iterative retrieval, and information synthesis into a dynamic feedback loop. We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn. We also introduce a test-time scaling law to formalize the impact of computational depth on reasoning and search. Supported by benchmark results and the rise of open-source implementations, we demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking. All the related resources, including industry products, research papers, benchmark datasets, and open-source implementations, are collected for the community in https://github.com/DavidZWZ/Awesome-Deep-Research.

nan

Article 623

Title@2025-07-03 (4): ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning

Title: ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning

ExPO: Entsperren harter Vernunft mit selbsterklärungsgeführtem Verstärkungslernen

ExPO: 以自我剥削指导强化学习来解锁困难理由 2507.02834v1

Authors (4): Ruiyang Zhou, Shuozhe Li, Amy Zhang, Liu Leqi

Recent advances in large language models have been driven by reinforcement learning (RL)-style post-training, which improves reasoning by optimizing model outputs based on reward or preference signals. GRPO-style approaches implement this by using self-generated samples labeled by an outcome-based verifier. However, these methods depend heavily on the model’s initial ability to produce positive samples. They primarily refine what the model already knows (distribution sharpening) rather than enabling the model to solve problems where it initially fails. This limitation is especially problematic in early-stage RL training and on challenging reasoning tasks, where positive samples are unlikely to be generated. To unlock reasoning ability in such settings, the model must explore new reasoning trajectories beyond its current output distribution. Such exploration requires access to sufficiently good positive samples to guide the learning. While expert demonstrations seem like a natural solution, we find that they are often ineffective in RL post-training. Instead, we identify two key properties of effective positive samples: they should (1) be likely under the current policy, and (2) increase the model’s likelihood of predicting the correct answer. Based on these insights, we propose $\textbf{Self-Explanation Policy Optimization (ExPO)}$-a simple and modular framework that generates such samples by conditioning on the ground-truth answer. ExPO enables efficient exploration and guides the model to produce reasoning trajectories more aligned with its policy than expert-written CoTs, while ensuring higher quality than its own (incorrect) samples. Experiments show that ExPO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods in challenging settings such as MATH level-5, where the model initially struggles the most.

nan

Article 624

Title@2025-07-03 (4): Generalizing Verifiable Instruction Following

Title: Generalizing Verifiable Instruction Following

Verallgemeinern der überprüfbaren Anleitung

普遍适用的可核实说明 2507.02833v1

Authors (8): Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, Hannaneh Hajishirzi

A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like only answer with yes or no" ormention the word `abrakadabra’ at least 3 times” that the user adds to craft a more useful answer. Even today’s strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.

nan

Article 625

Title@2025-07-03 (4): SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model

Title: SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model

SynapseRoute: Ein Auto-Routen-Schaltrahmen für das Dual-State Large Language Model

SynapseRoute:关于两州大语言模式的自动运行切换框架 2507.02822v1

Authors (12): Wencheng Zhang, Shiqin Qiao, Lingjie Luo, Yinfeng Li, Chuanyang Zheng, Qian Xu, Meng Li, Yong Gui, Yijun He, Jianing Qiu, Jindong Hong, Jiankai Sun

With the widespread adoption of large language models (LLMs) in practical applications, selecting an appropriate model requires balancing not only performance but also operational cost. The emergence of reasoning-capable models has further widened the cost gap between “thinking” (high reasoning) and “non-thinking” (fast, low-cost) modes. In this work, we reveal that approximately 58% of medical questions can be accurately answered by the non-thinking mode alone, without requiring the high-cost reasoning process. This highlights a clear dichotomy in problem complexity and suggests that dynamically routing queries to the appropriate mode based on complexity could optimize accuracy, cost-efficiency, and overall user experience. Based on this, we further propose SynapseRoute, a machine learning-based dynamic routing framework that intelligently assigns input queries to either thinking or non-thinking modes. Experimental results on several medical datasets demonstrate that SynapseRoute not only improves overall accuracy (0.8390 vs. 0.8272) compared to the thinking mode alone but also reduces inference time by 36.8% and token consumption by 39.66%. Importantly, qualitative analysis indicates that over-reasoning on simpler queries can lead to unnecessary delays and even decreased accuracy, a pitfall avoided by our adaptive routing. Finally, this work further introduces the Accuracy-Inference-Token (AIT) index to comprehensively evaluate the trade-offs among accuracy, latency, and token cost.

nan

Article 626

Title@2025-07-03 (4): Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case

Title: Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case

Große Sprachmodelle zur Automatisierung der klinischen Datenstandardisierung: HL7 FHIR Use Case

临床数据标准化自动化大语言模型:HL7 FHIR使用案例 2507.03067v1

Authors (3): Alvaro Riquelme, Pedro Costa, Catalina Martinez

For years, semantic interoperability standards have sought to streamline the exchange of clinical data, yet their deployment remains time-consuming, resource-intensive, and technically challenging. To address this, we introduce a semi-automated approach that leverages large language models specifically GPT-4o and Llama 3.2 405b to convert structured clinical datasets into HL7 FHIR format while assessing accuracy, reliability, and security. Applying our method to the MIMIC-IV database, we combined embedding techniques, clustering algorithms, and semantic retrieval to craft prompts that guide the models in mapping each tabular field to its corresponding FHIR resource. In an initial benchmark, resource identification achieved a perfect F1-score, with GPT-4o outperforming Llama 3.2 thanks to the inclusion of FHIR resource schemas within the prompt. Under real-world conditions, accuracy dipped slightly to 94 %, but refinements to the prompting strategy restored robust mappings. Error analysis revealed occasional hallucinations of non-existent attributes and mismatches in granularity, which more detailed prompts can mitigate. Overall, our study demonstrates the feasibility of context-aware, LLM-driven transformation of clinical data into HL7 FHIR, laying the groundwork for semi-automated interoperability workflows. Future work will focus on fine-tuning models with specialized medical corpora, extending support to additional standards such as HL7 CDA and OMOP, and developing an interactive interface to enable expert validation and iterative refinement.

nan

Article 627

Title@2025-07-03 (4): Identification of Potentially Misclassified Crash Narratives using Machine Learning (ML) and Deep Learning (DL)

Title: Identification of Potentially Misclassified Crash Narratives using Machine Learning (ML) and Deep Learning (DL)

Identifizierung potenziell falsch klassifizierter Crash Narratives mittels Machine Learning (ML) und Deep Learning (DL)

利用机器学习(ML)和深层学习(DL)查明可能误解的崩溃叙述(DL) 2507.03066v1

Authors (3): Sudesh Bhagat, Ibne Farabi Shihab, Jonathan Wood

This research investigates the efficacy of machine learning (ML) and deep learning (DL) methods in detecting misclassified intersection-related crashes in police-reported narratives. Using 2019 crash data from the Iowa Department of Transportation, we implemented and compared a comprehensive set of models, including Support Vector Machine (SVM), XGBoost, BERT Sentence Embeddings, BERT Word Embeddings, and Albert Model. Model performance was systematically validated against expert reviews of potentially misclassified narratives, providing a rigorous assessment of classification accuracy. Results demonstrated that while traditional ML methods exhibited superior overall performance compared to some DL approaches, the Albert Model achieved the highest agreement with expert classifications (73% with Expert 1) and original tabular data (58%). Statistical analysis revealed that the Albert Model maintained performance levels similar to inter-expert consistency rates, significantly outperforming other approaches, particularly on ambiguous narratives. This work addresses a critical gap in transportation safety research through multi-modal integration analysis, which achieved a 54.2% reduction in error rates by combining narrative text with structured crash data. We conclude that hybrid approaches combining automated classification with targeted expert review offer a practical methodology for improving crash data quality, with substantial implications for transportation safety management and policy development.

nan

Article 628

Title@2025-07-03 (4): Multimodal Mathematical Reasoning with Diverse Solving Perspective

Title: Multimodal Mathematical Reasoning with Diverse Solving Perspective

Multimodale mathematische Vernunft mit unterschiedlicher Lösungsperspektive

具有不同解决视角的多模式数学理由 2507.02804v1

Authors (6): Wenhao Shi, Zhiqiang Hu, Yi Bin, Yang Yang, See-Kiong Ng, Heng Tao Shen

Recent progress in large-scale reinforcement learning (RL) has notably enhanced the reasoning capabilities of large language models (LLMs), especially in mathematical domains. However, current multimodal LLMs (MLLMs) for mathematical reasoning often rely on one-to-one image-text pairs and single-solution supervision, overlooking the diversity of valid reasoning perspectives and internal reflections. In this work, we introduce MathV-DP, a novel dataset that captures multiple diverse solution trajectories for each image-question pair, fostering richer reasoning supervision. We further propose Qwen-VL-DP, a model built upon Qwen-VL, fine-tuned with supervised learning and enhanced via group relative policy optimization (GRPO), a rule-based RL approach that integrates correctness discrimination and diversity-aware reward functions. Our method emphasizes learning from varied reasoning perspectives and distinguishing between correct yet distinct solutions. Extensive experiments on the MathVista’s minitest and Math-V benchmarks demonstrate that Qwen-VL-DP significantly outperforms prior base MLLMs in both accuracy and generative diversity, highlighting the importance of incorporating diverse perspectives and reflective reasoning in multimodal mathematical reasoning.

nan

Article 629

Title@2025-07-03 (4): Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

Title: Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

Ist Vernunft alles, was Sie brauchen? Probieren von Bias im Zeitalter der Vernunft Sprachmodelle

需要什么理由就需要什么理由吗? 2507.02799v1

Authors (4): Riccardo Cantini, Nicola Gabriele, Alessio Orsino, Domenico Talia

Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.

nan

Article 630

Title@2025-07-03 (4): From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding

Title: From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding

Von langen Videos zu Clips: Ein von Menschen inspiriertes Video-Editing-Framework mit multimodalem Narrative Understanding

从长视频到启动剪贴板:由人启发的视频编辑框架,包含多模式叙述理解 2507.02790v1

Authors (11): Xiangfeng Wang, Xiao Li, Yadong Wei, Xueyu Song, Yang Song, Xiaoqiang Xia, Fangrui Zeng, Zaiyi Chen, Liu Liu, Gu Xu, Tong Xu

The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a human-inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks: highlight detection, opening/ending selection, and pruning of irrelevant content. To facilitate research in this area, we introduce DramaAD, a novel benchmark dataset comprising over 800 short drama episodes and 500 professionally edited advertisement clips. Experimental results demonstrate that our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks, significantly narrowing the quality gap between automatic and human-edited videos.

nan

Article 631

Title@2025-07-03 (4): GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

Title: GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

GPAS: Beschleunigung der Konvergenz des LLM-Vortrainings durch Gradient-Preserving Activation Scaling

GPAS:通过 “ 渐进式保留动力扩增 “ 加速汇集LLM预备训练 2506.22049v2

Authors (15): Tianhao Chen, Xin Xu, Zijing Liu, Pengxiang Li, Xinyuan Song, Ajay Kumar Jaiswal, Fan Zhang, Jishan Hu, Yang Wang, Hao Chen, Shizhe Diao, Shiwei Liu, Yu Li, Lu Yin, Can Yang

Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the shortcut to dominate over sub-layer outputs in the residual connection and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings. Our code is available at https://github.com/dandingsky/GPAS.

nan

Article 632

Title@2025-07-03 (4): Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation

Title: Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation

Verbesserung klinischer Multiple-Choice-Fragen Benchmarks mit Knowledge Graph Guided Distractor Generierung

加强具有知识图导引引产生体的临床多选择问题基准,加强临床多选择问题基准 2506.00612v3

Authors (5): Running Yang, Wenlong Deng, Minghui Chen, Yuyin Zhou, Xiaoxiao Li

Clinical tasks such as diagnosis and treatment require strong decision-making abilities, highlighting the importance of rigorous evaluation benchmarks to assess the reliability of large language models (LLMs). In this work, we introduce a knowledge-guided data augmentation framework that enhances the difficulty of clinical multiple-choice question (MCQ) datasets by generating distractors (i.e., incorrect choices that are similar to the correct one and may confuse existing LLMs). Using our KG-based pipeline, the generated choices are both clinically plausible and deliberately misleading. Our approach involves multi-step, semantically informed walks on a medical knowledge graph to identify distractor paths-associations that are medically relevant but factually incorrect-which then guide the LLM in crafting more deceptive distractors. We apply the designed knowledge graph guided distractor generation (KGGDG) pipline, to six widely used medical QA benchmarks and show that it consistently reduces the accuracy of state-of-the-art LLMs. These findings establish KGGDG as a powerful tool for enabling more robust and diagnostic evaluations of medical LLMs.

nan

Article 633

Title: Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

Selbstkorrektionsbank: Enthüllung und Adressierung des Selbstkorrektions-Blindflecks in LLMs

自我校正法官:在LLMs中披露和处理自我校正的盲人点 2507.02778v1

Authors (1): Ken Tsui

Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, particularly an autoregressive LLM. While LLMs can identify error in user input, they exhibit a systematic ‘Self-Correction Blind Spot’ - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 models, we find an average 64.5% blind spot rate. We find multiple evidences that this limitation relates to training data composition: human training demonstrations predominantly show error-free responses rather than error-correction sequences, unlike RL-trained models that learn error correction through outcome feedback. Remarkably, simply appending “Wait” reduces blind spots by 89.3%, suggesting that the capability exists but requires activation. Our work highlights a critical limitation in current LLMs and offers potential avenues for improving their reliability and trustworthiness.

nan

Article 634

Title: DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

DeSTA2.5-Audio: Auf dem Weg zu einem General-Purpose Large Audio Language Model mit selbsterzeugter Cross-Modal Alignment

DeSTA2.5-Audio:努力建立具有自发跨模式一致的通用大型音频语言模型 2507.02768v1

Authors (28): Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these approaches have often suffered from the catastrophic forgetting of the LLM’s original language abilities. To address this, we revisit the data construction pipeline and propose DeSTA, a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets. This approach preserves the LLM’s native language proficiency while establishing effective audio-text alignment, thereby enabling zero-shot generalization without task-specific tuning. Using DeSTA, we construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms widely adopted data construction and training strategies in both auditory perception and instruction-following capabilities. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.

nan

Article 635

Title@2025-07-03 (4): Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression

Title: Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression

Batch-Max: Höherer LLM-Durchsatz mit größeren Batch-Größen und KV Cache-Kompression

批量最大:使用大批量大小和 KV缓存压缩的高级 LLM 输送量 2412.05693v3

Authors (3): Michael R. Metel, Boxing Chen, Mehdi Rezagholizadeh

Several works have developed eviction policies to remove key-value (KV) pairs from the KV cache for more efficient inference. The focus has been on compressing the KV cache after the input prompt has been processed for faster token generation. In settings with limited GPU memory, and when the input context is longer than the generation length, we show that by also compressing the KV cache during the input processing phase, larger batch sizes can be used resulting in significantly higher throughput while still maintaining the original model’s accuracy.

nan

Article 636

Title@2025-07-03 (4): Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens

Title: Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens

Messung der Granularität des Vowel-Produktionsraumes durch einfach nur produzierbare unterschiedliche (JPD) Limens

仅用可制成差异(JPD)激光测量Vowel 生产空间的颗粒度 2507.02744v1

Authors (1): Peter Viechnicki

A body of work over the past several decades has demonstrated that the complex and coordinated articulatory movements of human vowel production are governed (at least in part)by control mechanisms whose targets are regions of auditory space. Within the target region control at the sub-phonemic level has also been demonstrated. But the degree of accuracy of that control is unknown. The current work investigates this question by asking how far apart must two vowel stimuli lie in auditory space in order to yield reliably different imitations? This distance is termed ‘Just Producible Difference’ (JPD). The current study uses a vowel mimicry paradigm to derive the first measurement of JPD among two sets of English speakers during front vowel production. JPD is estimated at between 14 and 51 mels in F1 X F2 space. This finding has implications for episodic theories of speech production. It also clarifies the possible structures of human vowel systems, by setting a theoretical lower bound for how close two vowel phonemes may be in a speaker’s formant space, and hence a psychophysical explanation of observed trends in number and patterns of possible vowel phonemes.

nan

Article 637

Title@2025-07-03 (4): Early Signs of Steganographic Capabilities in Frontier LLMs

Title: Early Signs of Steganographic Capabilities in Frontier LLMs

Frühe Anzeichen von Steganographischen Fähigkeiten in Frontier LLMs

边疆长长体动物能力早期信号 2507.02737v1

Authors (5): Artur Zolkowski, Kei Nishimura-Gasparian, Robert McCarthy, Roland S. Zimmermann, David Lindner

Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks from misuse and misalignment. However, LLMs could evade monitoring through steganography: Encoding hidden information within seemingly benign generations. In this paper, we evaluate the steganography capabilities in frontier LLMs to better understand the risk they pose. We focus on two types of steganography: passing encoded messages and performing encoded reasoning. We find that current models are unable to encode short messages in their outputs without a monitor noticing under standard affordances. They can succeed, however, if given additional affordances such as using an unmonitored scratchpad and coordinating on what encoding scheme to use. We additionally find early signs that models can perform basic encoded reasoning in a simple state-tracking problem. This includes some ability to reason with their own and pre-defined schemes, including encoding schemes such as Hexadecimal. Despite this, they can rarely hide reasoning subtly within a cover task to fool a monitor. Overall, our results indicate that current LLMs exhibit nascent steganographic capabilities. While these capabilities are likely insufficient to bypass well-designed monitors at present, this could change in the future.

nan

Article 638

Title@2025-07-03 (4): Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Title: Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Mind2Web 2: Agentische Suche mit Agent-as-a-Judge bewerten

Mind2Web 2: 与代理法官评估代理搜索 2506.21506v2

Authors (26): Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su

Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, highlighting its great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.

nan

Article 639

Title@2025-07-03 (4): On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability

Title: On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability

Über Charakterisierungen für die Sprachgenerierung: Interplay von Halluzinationen, Breadth und Stabilität

语言生成特征:幻觉、面包和稳定之间的相互作用 2412.18530v2

Authors (3): Alkis Kalavasis, Anay Mehrotra, Grigoris Velegkas

We study language generation in the limit - introduced by Kleinberg and Mullainathan [KM24] - building on classical works of Gold [Gol67] and Angluin [Ang79]. [KM24]’s main result is an algorithm for generating from any countable language collection in the limit. While their algorithm eventually generates unseen strings from the target language $K$, it sacrifices coverage or breadth, i.e., its ability to generate a rich set of strings. Recent work introduces different notions of breadth and explores when generation with breadth is possible, leaving a full characterization of these notions open. Our first set of results settles this by characterizing generation for existing notions of breadth and their natural extensions. Interestingly, our lower bounds are very flexible and hold for many performance metrics beyond breadth - for instance, showing that, in general, it is impossible to train generators which achieve a higher perplexity or lower hallucination rate for $K$ compared to other languages. Next, we study language generation with breadth and stable generators - algorithms that eventually stop changing after seeing an arbitrary but finite number of strings - and prove unconditional lower bounds for such generators, strengthening the results of [KMV25] and demonstrating that generation with many existing notions of breadth becomes equally hard, when stability is required. This gives a separation for generation with approximate breadth, between stable and unstable generators, highlighting the rich interplay between breadth, stability, and consistency in language generation.

nan

Article 640

Title@2025-07-03 (4): Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation

Title: Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation

Next-Token-Vorhersage-Aufgabe setzt eine optimale Datenbestellung für LLM-Training in Proof Generation voraus

假定为校实生成的LLM培训提供最佳数据排序 2411.00863v2

Authors (11): Chenyang An, Shima Imani, Feng Yao, Chengyu Dong, Ali Abbasi, Harsh Shrivastava, Samuel Buss, Jingbo Shang, Gayathri Mahalingam, Pramod Sharma, Maurice Diesendruck

In the field of large language model (LLM)-based proof generation, despite extensive training on large datasets such as ArXiv, LLMs still exhibit only modest performance on proving tasks of moderate difficulty. We believe that this is partly due to the widespread presence of suboptimal ordering within the data for each proof used in training. For example, published proofs often follow a purely logical order, where each step logically proceeds from the previous steps based on the deductive rules. This order is designed to facilitate the verification of the proof’s soundness, rather than to help people and models learn the discovery process of the proof. In proof generation, we argue that the optimal order for one training data sample occurs when the relevant intermediate supervision for a particular proof step in the proof is always positioned to the left of that proof step. We call such order the intuitively sequential order. We validate our claims using two tasks: intuitionistic propositional logic theorem-proving and digit multiplication. Our experiments verify the order effect and provide support for our explanations. We demonstrate that training is most effective when the proof is in the intuitively sequential order. Moreover, the order effect and the performance gap between models trained on different data orders can be substantial – with an 11 percent improvement in proof success rate observed in the propositional logic theorem-proving task, between models trained on the optimal order compared to the worst order. Lastly, we define a common type of order issue in advanced math proofs and find that 17.3 percent of theorems with nontrivial proofs in the first two chapters of a widely used graduate-level mathematics textbook suffer from this issue. A detailed list of those proofs is provided in the appendix.

nan

Article 641

Title@2025-07-03 (4): Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers

Title: Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers

Können LLMs kritische Einschränkungen innerhalb der wissenschaftlichen Forschung identifizieren? Eine systematische Bewertung von KI-Forschungspapieren

LLMs能否查明科学研究中的关键限制? 对AI研究文件的系统评估 2507.02694v1

Authors (5): Zhijian Xu, Yilun Zhao, Manasi Patwardhan, Lovekesh Vig, Arman Cohan

Peer review is fundamental to scientific research, but the growing volume of publications has intensified the challenges of this expertise-intensive process. While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. We first present a comprehensive taxonomy of limitation types in scientific research, with a focus on AI. Guided by this taxonomy, for studying limitations, we present LimitGen, the first comprehensive benchmark for evaluating LLMs’ capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: LimitGen-Syn, a synthetic dataset carefully created through controlled perturbations of high-quality papers, and LimitGen-Human, a collection of real human-written limitations. To improve the ability of LLM systems to identify limitations, we augment them with literature retrieval, which is essential for grounding identifying limitations in prior scientific findings. Our approach enhances the capabilities of LLM systems to generate limitations in research papers, enabling them to provide more concrete and constructive feedback.

nan

Article 642

Title@2025-07-03 (4): Exploring Gender Bias Beyond Occupational Titles

Title: Exploring Gender Bias Beyond Occupational Titles

Erforschen von Gender-Bias über Berufsbezeichnungen hinaus

探索职业职称之外的性别偏见 2507.02679v1

Authors (2): Ahmed Sabir, Rajesh Sharama

In this work, we investigate the correlation between gender and contextual biases, focusing on elements such as action verbs, object nouns, and particularly on occupations. We introduce a novel dataset, GenderLexicon, and a framework that can estimate contextual bias and its related gender bias. Our model can interpret the bias with a score and thus improve the explainability of gender bias. Also, our findings confirm the existence of gender biases beyond occupational stereotypes. To validate our approach and demonstrate its effectiveness, we conduct evaluations on five diverse datasets, including a Japanese dataset.

nan

Article 643

Title@2025-07-03 (4): Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning

Title: Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning

Code2Logic: Game-Code-getriebene Datensynthese zur Verbesserung von VLMs Allgemeine Begründung

代码2Llogic: 用于增强VLMs一般理由的游戏-代码-驱动数据合成 2505.13886v2

Authors (26): Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Changhao Jiang, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang

Visual-language Chain-of-Thought (CoT) data resources are relatively scarce compared to text-only counterparts, limiting the improvement of reasoning capabilities in Vision Language Models (VLMs). However, high-quality vision-language reasoning data is expensive and labor-intensive to annotate. To address this issue, we leverage a promising resource: game code, which naturally contains logical structures and state transition processes. Therefore, we propose Code2Logic, a novel game-code-driven approach for multimodal reasoning data synthesis. Our approach leverages Large Language Models (LLMs) to adapt game code, enabling automatic acquisition of reasoning processes and results through code execution. Using the Code2Logic approach, we developed the GameQA dataset to train and evaluate VLMs. GameQA is cost-effective and scalable, offers controllable difficulty gradation and is diverse with 30 games and 158 tasks. Surprisingly, despite training solely on game data, VLMs demonstrated out of domain generalization, specifically Qwen2.5-VL-7B improving performance by 2.33% across 7 diverse vision-language benchmarks. Our code, dataset and models are available at https://github.com/tongjingqi/Code2Logic.

nan

Article 644

Title@2025-07-03 (4): ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning

Title: ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning

ASDA: Audio-Spektrogramm Differential-Achtungsmechanismus für selbstüberwachtes Repräsentationslernen

ASDA:自我监督代表制学习的听觉分光差异关注机制 2507.02666v1

Authors (5): Junyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang

In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model’s discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA’s effectiveness in audio tasks, paving the way for broader applications.

nan

Article 645

Title: An Adaptive Supervised Contrastive Learning Framework for Implicit Sexism Detection in Digital Social Networks

Ein adaptives, überwachtes, kontrastives Lernkonzept für implizite Sexismus-Erkennung in digitalen sozialen Netzwerken

在数字社会网络中发现隐含性别主义的适应性监督差异学习框架 2507.05271v1

Authors (3): Mohammad Zia Ur Rehman, Aditya Shah, Nagendra Kumar

The global reach of social media has amplified the spread of hateful content, including implicit sexism, which is often overlooked by conventional detection methods. In this work, we introduce an Adaptive Supervised Contrastive lEarning framework for implicit sexism detectioN (ASCEND). A key innovation of our method is the incorporation of threshold-based contrastive learning: by computing cosine similarities between embeddings, we selectively treat only those sample pairs as positive if their similarity exceeds a learnable threshold. This mechanism refines the embedding space by robustly pulling together representations of semantically similar texts while pushing apart dissimilar ones, thus reducing false positives and negatives. The final classification is achieved by jointly optimizing a contrastive loss with a cross-entropy loss. Textual features are enhanced through a word-level attention module. Additionally, we employ sentiment, emotion, and toxicity features. Evaluations on the EXIST2021 and MLSC datasets demonstrate that ASCEND significantly outperforms existing methods, with average Macro F1 improvements of 9.86%, 29.63%, and 32.51% across multiple tasks, highlighting its efficacy in capturing the subtle cues of implicit sexist language.

nan

Article 646

Title@2025-07-03 (4): OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

Title: OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

OmniDraft: Ein Cross-Vocabulary, Online Adaptive Drafter für die gerätespezifische Dekodierung

总括草案:跨词汇、在线在线可调适性套用投机下限设计图纸 2507.02659v1

Authors (7): Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Shaojie Zhuo, Chen Feng, Yicheng Lin, Chenzheng Su, Xiaopeng Zhang

Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the \textit{``one drafter for all’’} paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.

nan

Article 647

Title@2025-07-03 (4): Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search

Title: Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search

Entkoppelte Planung und Ausführung: Ein Hierarchisches Reasoning-Framework für tiefe Suche

分解的规划和执行:深海搜索的等级理据框架 2507.02652v1

Authors (8): Jiajie Jin, Xiaoxi Li, Guanting Dong, Yuyao Zhang, Yutao Zhu, Yang Zhao, Hongjin Qian, Zhicheng Dou

Complex information needs in real-world search scenarios demand deep reasoning and knowledge synthesis across diverse sources, which traditional retrieval-augmented generation (RAG) pipelines struggle to address effectively. Current reasoning-based approaches suffer from a fundamental limitation: they use a single model to handle both high-level planning and detailed execution, leading to inefficient reasoning and limited scalability. In this paper, we introduce HiRA, a hierarchical framework that separates strategic planning from specialized execution. Our approach decomposes complex search tasks into focused subtasks, assigns each subtask to domain-specific agents equipped with external tools and reasoning capabilities, and coordinates the results through a structured integration mechanism. This separation prevents execution details from disrupting high-level reasoning while enabling the system to leverage specialized expertise for different types of information processing. Experiments on four complex, cross-modal deep search benchmarks demonstrate that HiRA significantly outperforms state-of-the-art RAG and agent-based systems. Our results show improvements in both answer quality and system efficiency, highlighting the effectiveness of decoupled planning and execution for multi-step information seeking tasks. Our code is available at https://github.com/ignorejjj/HiRA.

nan

Article 648

Title@2025-07-03 (4): Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory

Title: Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory

Strategische Intelligenz in großen Sprachmodellen: Beweise aus der evolutionären Spieltheorie

大语言模型战略情报:进化游戏理论的证据 2507.02618v1

Authors (2): Kenneth Payne, Baptiste Alloui-Cros

Are Large Language Models (LLMs) a new form of strategic intelligence, able to reason about goals in competitive settings? We present compelling supporting evidence. The Iterated Prisoner’s Dilemma (IPD) has long served as a model for studying decision-making. We conduct the first ever series of evolutionary IPD tournaments, pitting canonical strategies (e.g., Tit-for-Tat, Grim Trigger) against agents from the leading frontier AI companies OpenAI, Google, and Anthropic. By varying the termination probability in each tournament (the “shadow of the future”), we introduce complexity and chance, confounding memorisation. Our results show that LLMs are highly competitive, consistently surviving and sometimes even proliferating in these complex ecosystems. Furthermore, they exhibit distinctive and persistent “strategic fingerprints”: Google’s Gemini models proved strategically ruthless, exploiting cooperative opponents and retaliating against defectors, while OpenAI’s models remained highly cooperative, a trait that proved catastrophic in hostile environments. Anthropic’s Claude emerged as the most forgiving reciprocator, showing remarkable willingness to restore cooperation even after being exploited or successfully defecting. Analysis of nearly 32,000 prose rationales provided by the models reveals that they actively reason about both the time horizon and their opponent’s likely strategy, and we demonstrate that this reasoning is instrumental to their decisions. This work connects classic game theory with machine psychology, offering a rich and granular view of algorithmic decision-making under uncertainty.

nan

Article 649

Title@2025-07-03 (4): Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure

Title: Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure

Erklärbare Compliance-Erkennung mit Multi-Hop-Natural Language-Schlussfolgerung zur Assurance-Fallstruktur

以多种自然语言对保证案例结构的多重语言推断进行可解释的合规检测 2506.08713v2

Authors (2): Fariz Ikhwantri, Dusica Marijan

Ensuring complex systems meet regulations typically requires checking the validity of assurance cases through a claim-argument-evidence framework. Some challenges in this process include the complicated nature of legal and technical texts, the need for model explanations, and limited access to assurance case data. We propose a compliance detection approach based on Natural Language Inference (NLI): EXplainable CompLiance detection with Argumentative Inference of Multi-hop reasoning (EXCLAIM). We formulate the claim-argument-evidence structure of an assurance case as a multi-hop inference for explainable and traceable compliance detection. We address the limited number of assurance cases by generating them using large language models (LLMs). We introduce metrics that measure the coverage and structural consistency. We demonstrate the effectiveness of the generated assurance case from GDPR requirements in a multi-hop inference task as a case study. Our results highlight the potential of NLI-based approaches in automating the regulatory compliance process.

nan

Article 650

Title@2025-07-03 (4): Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs

Title: Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs

Symbolisch oder numerisch? Physik-Probleme verstehen, die LLMs aufklären

理解在理赔中解决物理问题 2507.01334v2

Authors (3): Nifu Dan, Yujun Cai, Yiwei Wang

Navigating the complexities of physics reasoning has long been a difficult task for Large Language Models (LLMs), requiring a synthesis of profound conceptual understanding and adept problem-solving techniques. In this study, we investigate the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challenging SciBench benchmark. Our comprehensive experimental evaluation reveals the remarkable capabilities of reasoning models. Not only do they achieve state-of-the-art accuracy in answering intricate physics questions, but they also generate distinctive reasoning patterns that emphasize on symbolic derivation. Furthermore, our findings indicate that even for these highly sophisticated reasoning models, the strategic incorporation of few-shot prompting can still yield measurable improvements in overall accuracy, highlighting the potential for continued performance gains.

nan

Article 651

Title@2025-07-03 (4): MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion

Title: MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion

MPF: Sprachmodelle nach der Bereitstellung über Multi Perspective Fusion ausrichten und abgrenzen

MPF:通过多视角融合进行部署后调整和取消对语言模式的偏见 2507.02595v1

Authors (7): Xin Guan, PeiHsin Lin, Zekun Wu, Ze Wang, Ruibo Zhang, Emre Kazim, Adriano Koshiyama

Multiperspective Fusion (MPF) is a novel posttraining alignment framework for large language models (LLMs) developed in response to the growing need for easy bias mitigation. Built on top of the SAGED pipeline, an automated system for constructing bias benchmarks and extracting interpretable baseline distributions, MPF leverages multiperspective generations to expose and align biases in LLM outputs with nuanced, humanlike baselines. By decomposing baseline, such as sentiment distributions from HR professionals, into interpretable perspective components, MPF guides generation through sampling and balancing of responses, weighted by the probabilities obtained in the decomposition. Empirically, we demonstrate its ability to align LLM sentiment distributions with both counterfactual baselines (absolute equality) and the HR baseline (biased for Top Univeristy), resulting in small KL divergence, reduction of calibration error and generalization to unseen questions. This shows that MPF offers a scalable and interpretable method for alignment and bias mitigation, compatible with deployed LLMs and requiring no extensive prompt engineering or finetuning.

nan

Article 652

Title@2025-07-03 (4): MedAide: Information Fusion and Anatomy of Medical Intents via LLM-based Agent Collaboration

Title: MedAide: Information Fusion and Anatomy of Medical Intents via LLM-based Agent Collaboration

MedAide: Informationsfusion und Anatomie von medizinischen Intents über LLM-basierte Agent Collaboration

医学辅助:通过以LLM为基地的合作公司代理进行医疗成瘾者的信息汇集和解剖 2410.12532v3

Authors (11): Dingkang Yang, Jinjie Wei, Mingcheng Li, Jiyao Liu, Lihao Liu, Ming Hu, Junjun He, Yakun Ju, Wei Zhou, Yang Liu, Lihua Zhang

In healthcare intelligence, the ability to fuse heterogeneous, multi-intent information from diverse clinical sources is fundamental to building reliable decision-making systems. Large Language Model (LLM)-driven information interaction systems currently showing potential promise in the healthcare domain. Nevertheless, they often suffer from information redundancy and coupling when dealing with complex medical intents, leading to severe hallucinations and performance bottlenecks. To this end, we propose MedAide, an LLM-based medical multi-agent collaboration framework designed to enable intent-aware information fusion and coordinated reasoning across specialized healthcare domains. Specifically, we introduce a regularization-guided module that combines syntactic constraints with retrieval augmented generation to decompose complex queries into structured representations, facilitating fine-grained clinical information fusion and intent resolution. Additionally, a dynamic intent prototype matching module is proposed to utilize dynamic prototype representation with a semantic similarity matching mechanism to achieve adaptive recognition and updating of the agent’s intent in multi-round healthcare dialogues. Ultimately, we design a rotation agent collaboration mechanism that introduces dynamic role rotation and decision-level information fusion across specialized medical agents. Extensive experiments are conducted on four medical benchmarks with composite intents. Experimental results from automated metrics and expert doctor evaluations show that MedAide outperforms current LLMs and improves their medical proficiency and strategic reasoning.

nan

Article 653

Title@2025-07-03 (4): Revisiting Active Learning under (Human) Label Variation

Title: Revisiting Active Learning under (Human) Label Variation

Aktives Lernen unter (menschlichen) Label-Varianten

在(人)标签标签变换下重新审查积极学习 2507.02593v1

Authors (6): Cornelia Gruber, Helen Alber, Bernd Bischl, Göran Kauermann, Barbara Plank, Matthias Aßenmacher

Access to high-quality labeled data remains a limiting factor in applied supervised learning. While label variation (LV), i.e., differing labels for the same instance, is common, especially in natural language processing, annotation frameworks often still rest on the assumption of a single ground truth. This overlooks human label variation (HLV), the occurrence of plausible differences in annotations, as an informative signal. Similarly, active learning (AL), a popular approach to optimizing the use of limited annotation budgets in training ML models, often relies on at least one of several simplifying assumptions, which rarely hold in practice when acknowledging HLV. In this paper, we examine foundational assumptions about truth and label nature, highlighting the need to decompose observed LV into signal (e.g., HLV) and noise (e.g., annotation error). We survey how the AL and (H)LV communities have addressed – or neglected – these distinctions and propose a conceptual framework for incorporating HLV throughout the AL loop, including instance selection, annotator choice, and label representation. We further discuss the integration of large language models (LLM) as annotators. Our work aims to lay a conceptual foundation for HLV-aware active learning, better reflecting the complexities of real-world annotation.

nan

Article 654

Title@2025-07-03 (4): WebSailor: Navigating Super-human Reasoning for Web Agent

Title: WebSailor: Navigating Super-human Reasoning for Web Agent

WebSailor: Navigieren Super-Mensch Vernunft für Web Agent

Web 服务员: 为 Web 代理导航超人理由 2507.02592v1

Authors (19): Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, Jingren Zhou

Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents’ performance and closing the capability gap.

nan

Article 655

Title@2025-07-03 (4): AI Flow: Perspectives, Scenarios, and Approaches

Title: AI Flow: Perspectives, Scenarios, and Approaches

AI Flow: Perspektiven, Szenarien und Ansätze

AI 流动:观点、设想和方法 2506.12479v2

Authors (14): Hongjun An, Wenhan Hu, Sida Huang, Siqi Huang, Ruanjun Li, Yuanzhi Liang, Jiawei Shao, Yiliang Song, Zihan Wang, Cheng Yuan, Chi Zhang, Hongyuan Zhang, Wenhao Zhuang, Xuelong Li

Pioneered by the foundational information theory by Claude Shannon and the visionary framework of machine intelligence by Alan Turing, the convergent evolution of information and communication technologies (IT/CT) has created an unbroken wave of connectivity and computation. This synergy has sparked a technological revolution, now reaching its peak with large artificial intelligence (AI) models that are reshaping industries and redefining human-machine collaboration. However, the realization of ubiquitous intelligence faces considerable challenges due to substantial resource consumption in large models and high communication bandwidth demands. To address these challenges, AI Flow has been introduced as a multidisciplinary framework that integrates cutting-edge IT and CT advancements, with a particular emphasis on the following three key points. First, device-edge-cloud framework serves as the foundation, which integrates end devices, edge servers, and cloud clusters to optimize scalability and efficiency for low-latency model inference. Second, we introduce the concept of familial models, which refers to a series of different-sized models with aligned hidden features, enabling effective collaboration and the flexibility to adapt to varying resource constraints and dynamic scenarios. Third, connectivity- and interaction-based intelligence emergence is a novel paradigm of AI Flow. By leveraging communication networks to enhance connectivity, the collaboration among AI models across heterogeneous nodes achieves emergent intelligence that surpasses the capability of any single model. The innovations of AI Flow provide enhanced intelligence, timely responsiveness, and ubiquitous accessibility to AI services, paving the way for the tighter fusion of AI techniques and communication systems.

nan

Article 656

Title@2025-07-03 (4): Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization

Title: Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization

Verbesserung der LLM-Gründung für die Erkennung von Schwachstellen über die Optimierung der Gruppenrelative Policy

通过群体相对政策优化改进通过群体相对政策优化查明脆弱性的改进LLM 改进LLM理由 2507.03051v1

Authors (4): Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino

Improving and understanding the training dynamics and reasoning of Large Language Models (LLMs) has become essential for their deployment in AI-based security tools, such as software vulnerability detection. In this work, we present an extensive study aimed at advancing recent RL-based finetuning techniques for LLMs in the context of vulnerability detection. We start by highlighting key limitations of commonly adopted LLMs, such as their tendency to over-predict certain types of vulnerabilities while failing to detect others. To address this challenge, we explore the use of Group Relative Policy Optimization (GRPO), a recent policy-gradient method, for guiding LLM behavior through structured, rule-based rewards. We enable its application to the vulnerability detection task by redefining its advantage functions and reward signals using annotations from widely used datasets in the field, including BigVul, DiverseVul, and CleanVul. The proposed methodology enables an extensive set of experiments, addressing multiple research questions regarding the impact of GRPO on generalization, reasoning capabilities, and performance improvements over standard supervised finetuning (SFT). Our findings offer valuable insights into the potential of RL-based training to enhance both the performance and reasoning abilities of LLMs in the context of software vulnerability detection.

nan

Article 657

Title@2025-07-03 (4): Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

Title: Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

Sprachenübergreifendes Reisen: Benchmarking Cross-Lingual Consistency in multimodalen LLMs

跨语言旅行:多模式LLM中跨语言一致基准 2505.15075v2

Authors (5): Hao Wang, Pinzhi Huang, Jihan Yang, Saining Xie, Daisuke Kawahara

The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications. However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge. To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency. This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.

nan

Article 658

Title@2025-07-03 (4): Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning

Title: Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning

Selbstgesteuerte Prozess-Reward-Optimierung mit neu definiertem Schrittweiser Vorteil für Prozess-Verstärkungs-Lernen

自指导流程向上优化,具有重新定义的逐步改进的流程强化学习优势 2507.01551v2

Authors (8): Wu Fei, Hao Kong, Shuxian Liang, Yang Lin, Yibo Yang, Jing Tang, Lei Chen, Xiansheng Hua

Process Reinforcement Learning~(PRL) has demonstrated considerable potential in enhancing the reasoning capabilities of Large Language Models~(LLMs). However, introducing additional process reward models incurs substantial computational overhead, and there is no unified theoretical framework for process-level advantage estimation. To bridge this gap, we propose \textbf{S}elf-Guided \textbf{P}rocess \textbf{R}eward \textbf{O}ptimization~(\textbf{SPRO}), a novel framework that enables process-aware RL through two key innovations: (1) we first theoretically demonstrate that process rewards can be derived intrinsically from the policy model itself, and (2) we introduce well-defined cumulative process rewards and \textbf{M}asked \textbf{S}tep \textbf{A}dvantage (\textbf{MSA}), which facilitates rigorous step-wise action advantage estimation within shared-prompt sampling groups. Our experimental results demonstrate that SPRO outperforms vaniila GRPO with 3.4x higher training efficiency and a 17.5\% test accuracy improvement. Furthermore, SPRO maintains a stable and elevated policy entropy throughout training while reducing the average response length by approximately $1/3$, evidencing sufficient exploration and prevention of reward hacking. Notably, SPRO incurs no additional computational overhead compared to outcome-supervised RL methods such as GRPO, which benefit industrial implementation.

nan

Article 659

Title@2025-07-03 (4): IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders

Title: IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders

IndianBailJudgments-1200: Ein Multi-Attribut-Datensatz für legale NLP auf indischen Bail-Aufträgen

IndianBailJail Judgments-12000:印度保释令法律国家保护程序多属性数据集 2507.02506v1

Authors (2): Sneha Deshmukh, Prathmesh Kamble

Legal NLP remains underdeveloped in regions like India due to the scarcity of structured datasets. We introduce IndianBailJudgments-1200, a new benchmark dataset comprising 1200 Indian court judgments on bail decisions, annotated across 20+ attributes including bail outcome, IPC sections, crime type, and legal reasoning. Annotations were generated using a prompt-engineered GPT-4o pipeline and verified for consistency. This resource supports a wide range of legal NLP tasks such as outcome prediction, summarization, and fairness analysis, and is the first publicly available dataset focused specifically on Indian bail jurisprudence.

nan

Article 660

Title@2025-07-03 (4): Counterfactual Tuning for Temporal Sensitivity Enhancement in Large Language Model-based Recommendation

Title: Counterfactual Tuning for Temporal Sensitivity Enhancement in Large Language Model-based Recommendation

Counterfactual Tuning für die zeitliche Sensitivitätssteigerung in der auf Großsprachenmodellen basierenden Empfehlung

以基于大语言的示范建议为基础的提高时时敏感度反事实推算 2507.03047v1

Authors (4): Yutian Liu, Zhengyi Yang, Jiancan Wu, Xiang Wang

Recent advances have applied large language models (LLMs) to sequential recommendation, leveraging their pre-training knowledge and reasoning capabilities to provide more personalized user experiences. However, existing LLM-based methods fail to sufficiently leverage the rich temporal information inherent in users’ historical interaction sequences, stemming from fundamental architectural constraints: LLMs process information through self-attention mechanisms that lack inherent sequence ordering and rely on position embeddings designed primarily for natural language rather than user interaction sequences. This limitation significantly impairs their ability to capture the evolution of user preferences over time and predict future interests accurately. To address this critical gap, we propose Counterfactual Enhanced Temporal Framework for LLM-Based Recommendation (CETRec). CETRec is grounded in causal inference principles, which allow it to isolate and measure the specific impact of temporal information on recommendation outcomes. By conceptualizing temporal order as an independent causal factor distinct from item content, we can quantify its unique contribution through counterfactual reasoning–comparing what recommendations would be made with and without temporal information while keeping all other factors constant. This causal framing enables CETRec to design a novel counterfactual tuning objective that directly optimizes the model’s temporal sensitivity, teaching LLMs to recognize both absolute timestamps and relative ordering patterns in user histories. Combined with our counterfactual tuning task derived from causal analysis, CETRec effectively enhances LLMs’ awareness of both absolute order (how recently items were interacted with) and relative order (the sequential relationships between items).

nan

Article 661

Title@2025-07-03 (4): Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack

Title: Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack

Robustheit von Fehlinformations-Klassifikationssystemen zu Adversarial-Beispielen durch BeamAttack

通过“BeamAttack”进行错误信息分类系统对反向实例的强力 2506.23661v2

Authors (4): Arnisa Fazla, Lucas Krauter, David Guzman Piedrahita, Andrianos Michail

We extend BeamAttack, an adversarial attack algorithm designed to evaluate the robustness of text classification systems through word-level modifications guided by beam search. Our extensions include support for word deletions and the option to skip substitutions, enabling the discovery of minimal modifications that alter model predictions. We also integrate LIME to better prioritize word replacements. Evaluated across multiple datasets and victim models (BiLSTM, BERT, and adversarially trained RoBERTa) within the BODEGA framework, our approach achieves over a 99\% attack success rate while preserving the semantic and lexical similarity of the original texts. Through both quantitative and qualitative analysis, we highlight BeamAttack’s effectiveness and its limitations. Our implementation is available at https://github.com/LucK1Y/BeamAttack

nan

Article 662

Title@2025-07-03 (4): Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer

Title: Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer

Task Prompt Vektoren: Effektive Initialisierung durch Multi-Task Soft-Prompt Transfer

任务提示矢量 : 通过多任务软性即时传输实现有效的初始化 2408.01119v3

Authors (4): Robert Belanec, Simon Ostermann, Ivan Srba, Maria Bielikova

Prompt tuning is an efficient solution for training large language models (LLMs). However, current soft-prompt-based methods often sacrifice multi-task modularity, requiring the training process to be fully or partially repeated for each newly added task. While recent work on task vectors applied arithmetic operations on full model weights to achieve the desired multi-task performance, a similar approach for soft-prompts is still missing. To this end, we introduce Task Prompt Vectors, created by element-wise difference between weights of tuned soft-prompts and their random initialization. Experimental results on 12 NLU datasets show that task prompt vectors can be used in low-resource settings to effectively initialize prompt tuning on similar tasks. In addition, we show that task prompt vectors are independent of the random initialization of prompt tuning on 2 different language model architectures. This allows prompt arithmetics with the pre-trained vectors from different tasks. In this way, we provide a competitive alternative to state-of-the-art baselines by arithmetic addition of task prompt vectors from multiple tasks.

nan

Article 663

Title@2025-07-03 (4): Crafting Hanzi as Narrative Bridges: An AI Co-Creation Workshop for Elderly Migrants

Title: Crafting Hanzi as Narrative Bridges: An AI Co-Creation Workshop for Elderly Migrants

Hanzi als Narrative Bridges herstellen: Ein KI-Co-Creation-Workshop für ältere Migranten

将Hanzi编成叙述性桥梁:大赦国际为老年移民举办的共同创造讲习班 2507.01548v2

Authors (4): Wen Zhan, Ziqun Hua, Peiyue Lin, Yunfei Chen

This paper explores how older adults, particularly aging migrants in urban China, can engage AI-assisted co-creation to express personal narratives that are often fragmented, underrepresented, or difficult to verbalize. Through a pilot workshop combining oral storytelling and the symbolic reconstruction of Hanzi, participants shared memories of migration and recreated new character forms using Xiaozhuan glyphs, suggested by the Large Language Model (LLM), together with physical materials. Supported by human facilitation and a soft AI presence, participants transformed lived experience into visual and tactile expressions without requiring digital literacy. This approach offers new perspectives on human-AI collaboration and aging by repositioning AI not as a content producer but as a supportive mechanism, and by supporting narrative agency within sociotechnical systems.

nan

Article 664

Title@2025-07-03 (4): A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages

Title: A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages

Ein Kochbuch für die gemeinschaftsorientierte Datenerfassung von schwachen Sprachkenntnissen in LowResource-Sprachen

社区驱动的低成本低资源语言有缺陷演讲数据收集手册 2507.02428v1

Authors (10): Sumaya Ahmed Salihs, Isaac Wiafe, Jamal-Deen Abdulai, Elikem Doe Atsakpo, Gifty Ayoka, Richard Cave, Akon Obu Ekpezu, Catherine Holloway, Katrin Tomanek, Fiifi Baffoe Payin Winful

This study presents an approach for collecting speech samples to build Automatic Speech Recognition (ASR) models for impaired speech, particularly, low-resource languages. It aims to democratize ASR technology and data collection by developing a “cookbook” of best practices and training for community-driven data collection and ASR model building. As a proof-of-concept, this study curated the first open-source dataset of impaired speech in Akan: a widely spoken indigenous language in Ghana. The study involved participants from diverse backgrounds with speech impairments. The resulting dataset, along with the cookbook and open-source tools, are publicly available to enable researchers and practitioners to create inclusive ASR technologies tailored to the unique needs of speech impaired individuals. In addition, this study presents the initial results of fine-tuning open-source ASR models to better recognize impaired speech in Akan.

nan

Article 665

Title@2025-07-03 (4): Delving into LLM-assisted writing in biomedical publications through excess vocabulary

Title: Delving into LLM-assisted writing in biomedical publications through excess vocabulary

Eintauchen in LLM-unterstütztes Schreiben in biomedizinischen Publikationen durch überschüssiges Vokabular

通过超量词汇,在生物医学出版物中进行LLM协助撰写 2406.07016v5

Authors (4): Dmitry Kobak, Rita González-Márquez, Emőke-Ágnes Horvát, Jan Lause

Large language models (LLMs) like ChatGPT can generate and revise text with human-level performance. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists use them for their scholarly writing. But how wide-spread is such LLM usage in the academic literature? To answer this question for the field of biomedical research, we present an unbiased, large-scale approach: we study vocabulary changes in over 15 million biomedical abstracts from 2010–2024 indexed by PubMed, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. This excess word analysis suggests that at least 13.5% of 2024 abstracts were processed with LLMs. This lower bound differed across disciplines, countries, and journals, reaching 40% for some subcorpora. We show that LLMs have had an unprecedented impact on scientific writing in biomedical research, surpassing the effect of major world events such as the Covid pandemic.

nan

Article 666

Title@2025-07-03 (4): K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

Title: K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

K-Funktion: Gemeinsame Aussprache Transkription und Feedback zur Bewertung der Kindersprache-Funktion

K-功能:用于评估儿童语言功能的联合读音记录和反馈 2507.03043v1

Authors (17): Shuhe Li, Chenxu Guo, Jiachen Lian, Cheol Jun Cho, Wenshuo Zhao, Xuanru Zhou, Dingkun Zhou, Sam Wang, Grace Wang, Jingze Yang, Jingyi Xu, Ruohan Bao, Elise Brenner, Brandon In, Francesca Pei, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli

Early evaluation of children’s language is frustrated by the high pitch, long phones, and sparse data that derail automatic speech recognisers. We introduce K-Function, a unified framework that combines accurate sub-word transcription, objective scoring, and actionable feedback. Its core, Kids-WFST, merges a Wav2Vec2 phoneme encoder with a phoneme-similarity Dysfluent-WFST to capture child-specific errors while remaining fully interpretable. Kids-WFST attains 1.39% phoneme error on MyST and 8.61% on Multitudes–absolute gains of 10.47 and 7.06 points over a greedy-search decoder. These high-fidelity transcripts power an LLM that grades verbal skills, milestones, reading, and comprehension, aligning with human proctors and supplying tongue-and-lip visualizations plus targeted advice. The results show that precise phoneme recognition cements a complete diagnostic-feedback loop, paving the way for scalable, clinician-ready language assessment.

nan

Article 667

Title@2025-07-03 (4): Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability

Title: Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability

Benchmarking Akan ASR-Modelle über Domain-spezifische Datensätze: Eine vergleichende Bewertung von Leistung, Skalierbarkeit und Anpassungsfähigkeit

确定Akan ASR模型基准的全域具体数据集:业绩比较评价、可缩放性和可调适性 2507.02407v1

Authors (8): Mark Atta Mensah, Isaac Wiafe, Akon Ekpezu, Justice Kwame Appati, Jamal-Deen Abdulai, Akosua Nyarkoa Wiafe-Akenten, Frank Ernest Yeboah, Gifty Odame

Most existing automatic speech recognition (ASR) research evaluate models using in-domain datasets. However, they seldom evaluate how they generalize across diverse speech contexts. This study addresses this gap by benchmarking seven Akan ASR models built on transformer architectures, such as Whisper and Wav2Vec2, using four Akan speech corpora to determine their performance. These datasets encompass various domains, including culturally relevant image descriptions, informal conversations, biblical scripture readings, and spontaneous financial dialogues. A comparison of the word error rate and character error rate highlighted domain dependency, with models performing optimally only within their training domains while showing marked accuracy degradation in mismatched scenarios. This study also identified distinct error behaviors between the Whisper and Wav2Vec2 architectures. Whereas fine-tuned Whisper Akan models led to more fluent but potentially misleading transcription errors, Wav2Vec2 produced more obvious yet less interpretable outputs when encountering unfamiliar inputs. This trade-off between readability and transparency in ASR errors should be considered when selecting architectures for low-resource language (LRL) applications. These findings highlight the need for targeted domain adaptation techniques, adaptive routing strategies, and multilingual training frameworks for Akan and other LRLs.

nan

Article 668

Title@2025-07-03 (4): AIn’t Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation

Title: AIn’t Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation

AIn’t Not Nothing But a Survey? Mit großen Sprachmodellen für die Codierung Deutsch Open-Ended Survey Responses on Survey Motivation

使用大语言模型对德国关于调查动机的开放式调查答复进行编码 2506.14634v3

Authors (4): Leah von der Heyde, Anna-Carolina Haensch, Bernd Weiß, Jessica Daikeler

The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs’ performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs’ unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.

nan

Article 669

Title@2025-07-03 (4): Dynamic Long Short-Term Memory Based Memory Storage For Long Horizon LLM Interaction

Title: Dynamic Long Short-Term Memory Based Memory Storage For Long Horizon LLM Interaction

Dynamischer Langzeit-Speicherspeicher auf Kurzzeit-Speicherbasis für Langzeit-LLM-Interaktion

动态长地平线LLM互动的短期内存存储 2507.03042v1

Authors (2): Yuyang Lou, Charles Li

Memory storage for Large Language models (LLMs) is becoming an increasingly active area of research, particularly for enabling personalization across long conversations. We propose Pref-LSTM, a dynamic and lightweight framework that combines a BERT-based classifier with a LSTM memory module that generates memory embedding which then is soft-prompt injected into a frozen LLM. We synthetically curate a dataset of preference and non-preference conversation turns to train our BERT-based classifier. Although our LSTM-based memory encoder did not yield strong results, we find that the BERT-based classifier performs reliably in identifying explicit and implicit user preferences. Our research demonstrates the viability of using preference filtering with LSTM gating principals as an efficient path towards scalable user preference modeling, without extensive overhead and fine-tuning.

nan

Article 670

Title@2025-07-03 (4): JoyTTS: LLM-based Spoken Chatbot With Voice Cloning

Title: JoyTTS: LLM-based Spoken Chatbot With Voice Cloning

JoyTTS: LLM-basierter gesprochener Chatbot mit Voice Cloning

以LLM为基地的 “ 配有语音克隆的口音聊天机器人 “ 2507.02380v1

Authors (3): Fangru Zhou, Jun Zhao, Guoxin Wang

JoyTTS is an end-to-end spoken chatbot that combines large language models (LLM) with text-to-speech (TTS) technology, featuring voice cloning capabilities. This project is built upon the open-source MiniCPM-o and CosyVoice2 models and trained on 2000 hours of conversational data. We have also provided the complete training code to facilitate further development and optimization by the community. On the testing machine seed-tts-zh, it achieves a SS (speaker similarity) score of 0.73 and a WER (Word Error Rate) of 5.09. The code and models, along with training and inference scripts, are available at https://github.com/jdh-algo/JoyTTS.git.

nan

Article 671

Title@2025-07-03 (4): Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection

Title: Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection

Effiziente Code-LLM-Schulung über Distribution-Konsistenz und Diversity-Aware-Datenauswahl

通过分配和多样性软件数据选择进行高效率的守则LLM培训 2507.02378v1

Authors (3): Weijie Lyu, Sheng-Jun Huang, Xuan Xia

Recent advancements in large language models (LLMs) have significantly improved code generation and program comprehension, accelerating the evolution of software engineering. Current methods primarily enhance model performance by leveraging vast amounts of data, focusing on data quantity while often overlooking data quality, thereby reducing training efficiency. To address this, we introduce an approach that utilizes a parametric model for code data selection, aimed at improving both training efficiency and model performance. Our method optimizes the parametric model to ensure distribution consistency and diversity within the selected subset, guaranteeing high-quality data. Experimental results demonstrate that using only 10K samples, our method achieves gains of 2.4% (HumanEval) and 2.3% (MBPP) over 92K full-sampled baseline, outperforming other sampling approaches in both performance and efficiency. This underscores that our method effectively boosts model performance while significantly reducing computational costs.

nan

Article 672

Title@2025-07-03 (4): QFFN-BERT: An Empirical Study of Depth, Performance, and Data Efficiency in Hybrid Quantum-Classical Transformers

Title: QFFN-BERT: An Empirical Study of Depth, Performance, and Data Efficiency in Hybrid Quantum-Classical Transformers

QFFN-BERT: Eine empirische Studie über Tiefe, Leistung und Dateneffizienz in hybriden Quantum-Klassischen Transformern

QFFN-BERT:对混合量子-分类变异器的深度、性能和数据效率的经验研究 2507.02364v1

Authors (1): Pilsung Kang

Parameterized quantum circuits (PQCs) have recently emerged as promising components for enhancing the expressibility of neural architectures. In this work, we introduce QFFN-BERT, a hybrid quantum-classical transformer where the feedforward network (FFN) modules of a compact BERT variant are replaced by PQC-based layers. This design is motivated by the dominant parameter contribution of FFNs, which account for approximately two-thirds of the parameters within standard Transformer encoder blocks. While prior studies have primarily integrated PQCs into self-attention modules, our work focuses on the FFN and systematically investigates the trade-offs between PQC depth, expressibility, and trainability. Our final PQC architecture incorporates a residual connection, both $R_Y$ and $R_Z$ rotations, and an alternating entanglement strategy to ensure stable training and high expressibility. Our experiments, conducted on a classical simulator, on the SST-2 and DBpedia benchmarks demonstrate two key findings. First, a carefully configured QFFN-BERT achieves up to 102.0% of the baseline accuracy, surpassing its classical counterpart in a full-data setting while reducing FFN-specific parameters by over 99%. Second, our model exhibits a consistent and competitive edge in few-shot learning scenarios, confirming its potential for superior data efficiency. These results, supported by an ablation study on a non-optimized PQC that failed to learn, confirm that PQCs can serve as powerful and parameter-efficient alternatives to classical FFNs when co-designed with foundational deep learning principles.

nan

Article 673

Title@2025-07-03 (4): Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning

Title: Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning

Verbesserung der Robustheit der distantly-überwachten Anerkennung von Personen mit Namen durch unsicheres Lehrerlernen und studentisch-studentisches kollaboratives Lernen

通过不确定-软件教师学习和学生-学生合作学习,提高以不确定-软件教师学习和学生-学生合作学习的方式,提高以不确定-软件命名的实体识别的力度 2311.08010v3

Authors (7): Shuzheng Si, Helan Hu, Haozhe Zhao, Shuang Zeng, Kaikai An, Zefan Cai, Baobao Chang

Distantly-Supervised Named Entity Recognition (DS-NER) is widely used in real-world scenarios. It can effectively alleviate the burden of annotation by matching entities in existing knowledge bases with snippets in the text but suffer from the label noise. Recent works attempt to adopt the teacher-student framework to gradually refine the training labels and improve the overall robustness. However, these teacher-student methods achieve limited performance because the poor calibration of the teacher network produces incorrectly pseudo-labeled samples, leading to error propagation. Therefore, we propose: (1) Uncertainty-Aware Teacher Learning that leverages the prediction uncertainty to reduce the number of incorrect pseudo labels in the self-training stage; (2) Student-Student Collaborative Learning that allows the transfer of reliable labels between two student networks instead of indiscriminately relying on all pseudo labels from its teacher, and further enables a full exploration of mislabeled samples rather than simply filtering unreliable pseudo-labeled samples. We evaluate our proposed method on five DS-NER datasets, demonstrating that our method is superior to the state-of-the-art DS-NER methods.

nan

Article 674

Title@2025-07-03 (4): Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models

Title: Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models

Coling-UniA bei SciVQA 2025: Wenig-heißes Beispiel Retrieval und Vertrauen-informierte Montage für multimodale große Sprachmodelle

在SciVQA 2025 SciVQA 的Coling-UniA:多式大语言模型的很少热实例检索和信任化组合 2507.02357v1

Authors (3): Christian Jaumann, Annemarie Friedrich, Rainer Lienhart

This paper describes our system for the SciVQA 2025 Shared Task on Scientific Visual Question Answering. Our system employs an ensemble of two Multimodal Large Language Models and various few-shot example retrieval strategies. The model and few-shot setting are selected based on the figure and question type. We also select answers based on the models’ confidence levels. On the blind test data, our system ranks third out of seven with an average F1 score of 85.12 across ROUGE-1, ROUGE-L, and BERTS. Our code is publicly available.

nan

Article 675

Title@2025-07-03 (4): Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation

Title: Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation

Einschließlich LLMs für großräumige Urban Complex Mobility Simulation

大型城市综合流动模拟项目LLMs 2505.21880v2

Authors (8): Yu-Lun Song, Chung-En Tsern, Che-Cheng Wu, Yu-Ming Chang, Syuan-Bo Huang, Wei-Chu Chen, Michael Chia-Liang Lin, Yu-Ta Lin

This study presents an innovative approach to urban mobility simulation by integrating a Large Language Model (LLM) with Agent-Based Modeling (ABM). Unlike traditional rule-based ABM, the proposed framework leverages LLM to enhance agent diversity and realism by generating synthetic population profiles, allocating routine and occasional locations, and simulating personalized routes. Using real-world data, the simulation models individual behaviors and large-scale mobility patterns in Taipei City. Key insights, such as route heat maps and mode-specific indicators, provide urban planners with actionable information for policy-making. Future work focuses on establishing robust validation frameworks to ensure accuracy and reliability in urban planning applications.

nan

Article 676

Title@2025-07-03 (4): Decision-Oriented Text Evaluation

Title: Decision-Oriented Text Evaluation

Entscheidungsorientierte Textbewertung

注重决定的案文评价 2507.01923v2

Authors (3): Yu-Shiang Huang, Chuan-Ju Wang, Chung-Chi Chen

Natural language generation (NLG) is increasingly deployed in high-stakes domains, yet common intrinsic evaluation methods, such as n-gram overlap or sentence plausibility, weakly correlate with actual decision-making efficacy. We propose a decision-oriented framework for evaluating generated text by directly measuring its influence on human and large language model (LLM) decision outcomes. Using market digest texts–including objective morning summaries and subjective closing-bell analyses–as test cases, we assess decision quality based on the financial performance of trades executed by human investors and autonomous LLM agents informed exclusively by these texts. Our findings reveal that neither humans nor LLM agents consistently surpass random performance when relying solely on summaries. However, richer analytical commentaries enable collaborative human-LLM teams to outperform individual human or agent baselines significantly. Our approach underscores the importance of evaluating generated text by its ability to facilitate synergistic decision-making between humans and LLMs, highlighting critical limitations of traditional intrinsic metrics.

nan

Article 677

Title@2025-07-03 (4): Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs

Title: Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs

Token Prepending: Ein trainingsfreier Ansatz zur Eliziierung besserer Sentence-Embeddings von LLMs

Token Predudo:从LLM女士那里采用不培训办法,使判刑内容更好地嵌入Elibear 2412.11556v2

Authors (7): Yuchen Fu, Zifeng Cheng, Zhiwei Jiang, Zhonghui Wang, Yafeng Yin, Zhengliang Li, Qing Gu

Extracting sentence embeddings from large language models (LLMs) is a promising direction, as LLMs have demonstrated stronger semantic understanding capabilities. Previous studies typically focus on prompt engineering to elicit sentence embeddings from LLMs by prompting the model to encode sentence information into the embedding of the last token. However, LLMs are mostly decoder-only models with causal attention and the earlier tokens in the sentence cannot attend to the latter tokens, resulting in biased encoding of sentence information and cascading effects on the final decoded token. To this end, we propose a novel Token Prepending (TP) technique that prepends each layer’s decoded sentence embedding to the beginning of the sentence in the next layer’s input, allowing earlier tokens to attend to the complete sentence information under the causal attention mechanism. The proposed TP technique is a plug-and-play and training-free technique, which means it can be seamlessly integrated with various prompt-based sentence embedding methods and autoregressive LLMs. Extensive experiments on various Semantic Textual Similarity (STS) tasks and downstream classification tasks demonstrate that our proposed TP technique can significantly improve the performance of existing prompt-based sentence embedding methods across different LLMs, while incurring negligible additional inference cost.

nan

Article 678

Title@2025-07-03 (4): Layered Insights: Generalizable Analysis of Authorial Style by Leveraging All Transformer Layers

Title: Layered Insights: Generalizable Analysis of Authorial Style by Leveraging All Transformer Layers

Layered Insights: Generalisierbare Analyse des Autorial Styles durch Hebelisierung aller Transformer Layers

图层透视: 通过利用所有变换层对文件样式的通用分析 2503.00958v2

Authors (5): Milad Alshomary, Nikhil Reddy Varimalla, Vishal Anand, Smaranda Muresan, Kathleen McKeown

We propose a new approach for the authorship attribution task that leverages the various linguistic representations learned at different layers of pre-trained transformer-based models. We evaluate our approach on three datasets, comparing it to a state-of-the-art baseline in in-domain and out-of-domain scenarios. We found that utilizing various transformer layers improves the robustness of authorship attribution models when tested on out-of-domain data, resulting in new state-of-the-art results. Our analysis gives further insights into how our model’s different layers get specialized in representing certain stylistic features that benefit the model when tested out of the domain.

nan

Article 679

Title@2025-07-03 (4): Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Title: Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Skywork-Reward-V2:通过人类-AI协同增强优先数据曲线 2507.01352v2

Authors (12): Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou

Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches that incorporate advanced training techniques have not yielded meaningful performance improvements. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present a large-scale preference dataset comprising 40 million preference pairs, named SynPref-40M. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while large language models perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling, achieving state-of-the-art performance across seven major reward model benchmarks. Ablation studies confirm that the effectiveness of our approach stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, highlighting the untapped potential of existing preference datasets and demonstrating how human-AI curation synergy can unlock significantly higher data quality.

nan

Article 680

Title@2025-07-03 (4): Cautious Next Token Prediction

Title: Cautious Next Token Prediction

Vorsichtige nächste Zeichen Vorhersage

谨慎的次下 Tok 预测 2507.03038v1

Authors (10): Yizhou Wang, Lingzhi Zhang, Yue Bai, Mang Tik Chiu, Zhengmian Hu, Mingyuan Zhang, Qihua Dong, Yu Yin, Sohrab Amirghodsi, Yun Fu

Next token prediction paradigm has been prevailing for autoregressive models in the era of LLMs. The current default sampling choice for popular LLMs is temperature scaling together with nucleus sampling to balance diversity and coherence. Nevertheless, such approach leads to inferior performance in various NLP tasks when the model is not certain about testing questions. To this end, we propose a brand new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP). In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation. Then we select the trial with the lowest perplexity score viewed as the most probable and reliable trial path given the model’s capacity. The trial number is negatively correlated with the prediction confidence, i.e., the less confident the model is, the more trials it should sample. This is consistent with human beings’ behaviour: when feeling uncertain or unconfident, one tends to think more creatively, exploring multiple thinking paths, to cautiously select the path one feels most confident about. Extensive experiments on both LLMs and MLLMs show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin. Moreover, the integration of CNTP with self consistency can further improve over vanilla self consistency. We believe our proposed CNTP has the potential to become one of the default choices for LLM decoding. Code is available at https://github.com/wyzjack/CNTP.

nan

Article 681

Title@2025-07-03 (4): Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach

Title: Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach

Ausrichten von gefrorenen LLMs durch Verstärkungslernen: Ein iteratives Reweight-then-Optimize-Ansatz

通过强化学习将冻结的LLMs与 “ 强化学习:一种过渡性再加权再优化方法 “ 相匹配 2506.17828v2

Authors (9): Xinnan Zhang, Chenliang Li, Siliang Zeng, Jiaxiang Li, Zhongruo Wang, Kaixiang Lin, Songtao Lu, Alfredo Garcia, Mingyi Hong

Aligning large language models (LLMs) with human preferences usually requires fine-tuning methods such as RLHF and DPO. These methods directly optimize the model parameters, so they cannot be used in test-time to improve model performance, nor are they applicable when the model weights are not accessible. In contrast, test-time methods sidestep weight updates by leveraging reward functions to guide and improve output quality. However, they incur high inference costs, and their one-shot guidance is often based on imperfect reward or value functions, leading to suboptimal outputs. In this work, we present a method named Iterative Reweight-then-Optimize (IRO), a reinforcement learning (RL) framework that performs RL-style alignment of the (frozen) base model without touching its parameters. During training, each iteration (i) samples candidates from the base model, (ii) resamples using current value functions, and (iii) trains a new lightweight value function that guides the next decoding pass. At test time, the value functions are used to guide the base model generation via a search-based optimization process. Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI’s reinforcement fine-tuning (RFT), but without requiring access to the model weights.

nan

Article 682

Title@2025-07-03 (4): Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Title: Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Fast-dLLM: Trainingsfreie Beschleunigung von Diffusion LLM durch Ermöglichen von KV Cache und Paralleldecoding

快速dLLM:通过授权 KV 缓存和平行编码加速免培训传播LLM 2505.22618v3

Authors (9): Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, Enze Xie

Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to \textbf{27.6$\times$ throughput} improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.

nan

Article 683

Title@2025-07-03 (4): Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

Title: Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

Bypass Back-Propagation: Optimierungsbasiertes Structural Pruning für große Sprachmodelle über Policy Gradient

Bypass 后回通信:通过 “ 政策梯度 “ 优化基于优化的结构结构,为大语言模式提供缓冲 2406.10576v3

Authors (5): Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, Gui-Song Xia

Recent Large-Language Models (LLMs) pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically hand-crafted metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve efficiency, our method eliminates the back-propagation through the LLM per se during optimization, requiring only the forward pass of the LLM. We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from LLM loss, facilitating efficient optimization via policy gradient estimator without back-propagation. Thus, our method can 1) support global and heterogeneous pruning (i.e., automatically determine different redundancy for different layers), and 2) optionally initialize with a metric-based method (for our Bernoulli distributions). Extensive experiments conducted on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral models using the C4 and WikiText2 datasets demonstrate the promising performance of our method in efficiency and effectiveness. Code is available at https://github.com/ethanygao/backprop-free_LLM_pruning.

nan

Article 684

Title@2025-07-03 (4): DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning

Title: DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning

DoMIX: Ein effizientes Framework zur Nutzung von Domain-Wissen im Feintuning

DoMIX:一个在微调中利用域知识的有效框架 2507.02302v1

Authors (3): Dohoon Kim, Donghun Kang, Taesup Moon

Domain-Adaptive Pre-training (DAP) has recently gained attention for its effectiveness in fine-tuning pre-trained models. Building on this, continual DAP has been explored to develop pre-trained models capable of incrementally incorporating different domain datasets. However, existing continual DAP methods face several limitations: (1) high computational cost and GPU memory usage during training; (2) sensitivity to incremental data order; and (3) providing a single, generalized model for all end tasks, which contradicts the essence of DAP. In this paper, we propose DoMIX, a novel approach that addresses these challenges by leveraging LoRA modules, a representative parameter-efficient fine-tuning (PEFT) method. Our approach enables efficient and parallel domain-adaptive pre-training that is robust to domain order and effectively utilizes accumulated knowledge to provide tailored pre-trained models for specific tasks. We also demonstrate that our method can be extended beyond the DAP setting to standard LLM fine-tuning scenarios. Code is available at https://github.com/dohoonkim-ai/DoMIX.

nan

Article 685

Title: Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models

Commander-GPT: Die Fähigkeit von Multi-Modal Large Language Models, den Sarkasmus vollständig zu entleeren

GPT指挥官:完全解除多模式大语言模型的讽刺性探测能力 2503.18681v3

Authors (4): Yazhou Zhang, Chunwang Zou, Bo Wang, Jing Qin

Sarcasm detection, as a crucial research direction in the field of Natural Language Processing (NLP), has attracted widespread attention. Traditional sarcasm detection tasks have typically focused on single-modal approaches (e.g., text), but due to the implicit and subtle nature of sarcasm, such methods often fail to yield satisfactory results. In recent years, researchers have shifted the focus of sarcasm detection to multi-modal approaches. However, effectively leveraging multi-modal information to accurately identify sarcastic content remains a challenge that warrants further exploration. Leveraging the powerful integrated processing capabilities of Multi-Modal Large Language Models (MLLMs) for various information sources, we propose an innovative multi-modal Commander-GPT framework. Inspired by military strategy, we first decompose the sarcasm detection task into six distinct sub-tasks. A central commander (decision-maker) then assigns the best-suited large language model to address each specific sub-task. Ultimately, the detection results from each model are aggregated to identify sarcasm. We conducted extensive experiments on MMSD and MMSD 2.0, utilizing four multi-modal large language models and six prompting strategies. Our experiments demonstrate that our approach achieves state-of-the-art performance, with a 19.3% improvement in F1 score, without necessitating fine-tuning or ground-truth rationales.

nan

Article 686

Title@2025-07-03 (4): Prompt-Guided Turn-Taking Prediction

Title: Prompt-Guided Turn-Taking Prediction

Prompt-geführte Turn-Taking-Vorhersage

即时指导的回转预测 2506.21191v2

Authors (7): Koji Inoue, Mikey Elmers, Yahui Fu, Zi Haur Pang, Divesh Lala, Keiko Ochi, Tatsuya Kawahara

Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit control through instructions such as “faster” or “calmer” adapting dynamically to conversational partners and contexts. The proposed model builds upon a transformer-based voice activity projection (VAP) model, incorporating textual prompt embeddings into both channel-wise transformers and a cross-channel transformer. We evaluated the feasibility of our approach using over 950 hours of human-human spoken dialogue data. Since textual prompt data for the proposed approach was not available in existing datasets, we utilized a large language model (LLM) to generate synthetic prompt sentences. Experimental results demonstrated that the proposed model improved prediction accuracy and effectively varied turn-taking timing behaviors according to the textual prompts.

nan

Article 687

Title: Optimal strategies to perform multilingual analysis of social content for a novel dataset in the tourism domain

Optimale Strategien zur mehrsprachigen Analyse sozialer Inhalte für einen neuartigen Datensatz im Tourismusbereich

为旅游领域新数据集的社会内容进行多语种社会内容分析的最佳最佳战略 2311.14727v2

Authors (6): Maxime Masson, Rodrigo Agerri, Christian Sallaberry, Marie-Noelle Bessagnet, Annig Le Parc Lacayrelle, Philippe Roose

The rising influence of social media platforms in various domains, including tourism, has highlighted the growing need for efficient and automated Natural Language Processing (NLP) strategies to take advantage of this valuable resource. However, the transformation of multilingual, unstructured, and informal texts into structured knowledge still poses significant challenges, most notably the never-ending requirement for manually annotated data to train deep learning classifiers. In this work, we study different NLP techniques to establish the best ones to obtain competitive performances while keeping the need for training annotated data to a minimum. To do so, we built the first publicly available multilingual dataset (French, English, and Spanish) for the tourism domain, composed of tourism-related tweets. The dataset includes multilayered, manually revised annotations for Named Entity Recognition (NER) for Locations and Fine-grained Thematic Concepts Extraction mapped to the Thesaurus of Tourism and Leisure Activities of the World Tourism Organization, as well as for Sentiment Analysis at the tweet level. Extensive experimentation comparing various few-shot and fine-tuning techniques with modern language models demonstrate that modern few-shot techniques allow us to obtain competitive results for all three tasks with very little annotation data: 5 tweets per label (15 in total) for Sentiment Analysis, 30 tweets for Named Entity Recognition of Locations and 1K tweets annotated with fine-grained thematic concepts, a highly fine-grained sequence labeling task based on an inventory of 315 classes. We believe that our results, grounded in a novel dataset, pave the way for applying NLP to new domain-specific applications, reducing the need for manual annotations and circumventing the complexities of rule-based, ad-hoc solutions.

nan

Article 688

Title@2025-07-03 (4): Seeing Through Green: Text-Based Classification and the Firm’s Returns from Green Patents

Title: Seeing Through Green: Text-Based Classification and the Firm’s Returns from Green Patents

Durch Grün sehen: Textbasierte Klassifizierung und die Rückkehr der Firma aus grünen Patenten

通过 “ 绿色观光:基于文本的分类和公司从绿色专利的回报 “ 2507.02287v1

Authors (3): Lapo Santarlasci, Armando Rungi, Antonio Zinilli

This paper introduces Natural Language Processing for identifying true'' green patents from official supporting documents. We start our training on about 12.4 million patents that had been classified as green from previous literature. Thus, we train a simple neural network to enlarge a baseline dictionary through vector representations of expressions related to environmental technologies. After testing, we find thattrue’’ green patents represent about 20\% of the total of patents classified as green from previous literature. We show heterogeneity by technological classes, and then check that true' green patents are about 1\% less cited by following inventions. In the second part of the paper, we test the relationship between patenting and a dashboard of firm-level financial accounts in the European Union. After controlling for reverse causality, we show that holding at least one true'' green patent raises sales, market shares, and productivity. If we restrict the analysis to high-novelty `true’’ green patents, we find that they also yield higher profits. Our findings underscore the importance of using text analyses to gauge finer-grained patent classifications that are useful for policymaking in different domains.

nan

Article 689

Title@2025-07-03 (4): Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as Treatments

Title: Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as Treatments

Kausales Repräsentationslernen mit generativer Künstlicher Intelligenz: Anwendung auf Texte als Behandlungen

产生人工智能的因果代表性学习:应用文字作为治疗 2410.00903v3

Authors (2): Kosuke Imai, Kentaro Nakamura

In this paper, we demonstrate how to enhance the validity of causal inference with unstructured high-dimensional treatments like texts, by leveraging the power of generative Artificial Intelligence (GenAI). Specifically, we propose to use a deep generative model such as large language models (LLMs) to efficiently generate treatments and use their internal representation for subsequent causal effect estimation. We show that the knowledge of this true internal representation helps disentangle the treatment features of interest, such as specific sentiments and certain topics, from other possibly unknown confounding features. Unlike existing methods, the proposed GenAI-Powered Inference (GPI) methodology eliminates the need to learn causal representation from the data, and hence produces more accurate and efficient estimates. We formally establish the conditions required for the nonparametric identification of the average treatment effect, propose an estimation strategy that avoids the violation of the overlap assumption, and derive the asymptotic properties of the proposed estimator through the application of double machine learning. Finally, using an instrumental variables approach, we extend the proposed methodology to the settings in which the treatment feature is based on human perception. The proposed GPI methodology is also applicable to text reuse where an LLM is used to regenerate existing texts. We conduct simulation and empirical studies, using the generated text data from an open-source LLM, Llama 3, to illustrate the advantages of our estimator over state-of-the-art causal representation learning algorithms.

nan

Article 690

Title@2025-07-03 (4): SMARTe: Slot-based Method for Accountable Relational Triple extraction

Title: SMARTe: Slot-based Method for Accountable Relational Triple extraction

SMARTe: Slot-basierte Methode für die relationale Triple-Extraktion

SMARTE: 衡算关系三重采掘的基于固态方法 2504.12816v3

Authors (2): Xue Wen Tan, Stanley Kok

Relational Triple Extraction (RTE) is a fundamental task in Natural Language Processing (NLP). However, prior research has primarily focused on optimizing model performance, with limited efforts to understand the internal mechanisms driving these models. Many existing methods rely on complex preprocessing to induce specific interactions, often resulting in opaque systems that may not fully align with their theoretical foundations. To address these limitations, we propose SMARTe: a Slot-based Method for Accountable Relational Triple extraction. SMARTe introduces intrinsic interpretability through a slot attention mechanism and frames the task as a set prediction problem. Slot attention consolidates relevant information into distinct slots, ensuring all predictions can be explicitly traced to learned slot representations and the tokens contributing to each predicted relational triple. While emphasizing interpretability, SMARTe achieves performance comparable to state-of-the-art models. Evaluations on the NYT and WebNLG datasets demonstrate that adding interpretability does not compromise performance. Furthermore, we conducted qualitative assessments to showcase the explanations provided by SMARTe, using attention heatmaps that map to their respective tokens. We conclude with a discussion of our findings and propose directions for future research. Our code is available at https://github.com/Chen-XueWen/SMARTe.

nan

Article 691

Title@2025-07-03 (4): MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Title: MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

MemAgent: Umgestalten von Langkontext-LLM mit Multi-Conv RL-basierten Speicheragenten

MemerAgent: 与基于多Conv RL的内存代理重塑长文本LLM 2507.02259v1

Authors (11): Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, Hao Zhou

Despite improvements by length extrapolation, efficient attention and memory modules, handling infinitely long documents with linear complexity without performance degradation during extrapolation remains the ultimate challenge in long-text processing. We directly optimize for long-text tasks in an end-to-end fashion and introduce a novel agent workflow, MemAgent, which reads text in segments and updates the memory using an overwrite strategy. We extend the DAPO algorithm to facilitate training via independent-context multi-conversation generation. MemAgent has demonstrated superb long-context capabilities, being able to extrapolate from an 8K context trained on 32K text to a 3.5M QA task with performance loss < 5% and achieves 95%+ in 512K RULER test.

nan

Article 692

Title@2025-07-03 (4): Circuit-tuning: A Mechanistic Approach for Identifying Parameter Redundancy and Fine-tuning Neural Networks

Title: Circuit-tuning: A Mechanistic Approach for Identifying Parameter Redundancy and Fine-tuning Neural Networks

Schaltungs-Tuning: Mechanistischer Ansatz zur Identifizierung von Parameter Redundanz und Feinsteuerung neuraler Netzwerke

电路调控:确定参数冗余和精微调整神经网络的机械化方法 2502.06106v2

Authors (4): Yueyan Li, Wenhao Gao, Caixia Yuan, Xiaojie Wang

The study of mechanistic interpretability aims to reverse-engineer a model to explain its behaviors. While recent studies have focused on the static mechanism of a certain behavior, the learning dynamics inside a model remain to be explored. In this work, we develop an interpretable fine-tuning method for analyzing the mechanism behind learning. We first introduce the concept of node-level intrinsic dimensionality to describe the learning process of a model in a computational graph. Based on our theory, we propose circuit-tuning, a two-stage algorithm that iteratively builds the minimal subgraph for a specific task and updates the key parameters in a heuristic way. Experimental results confirm the existence of the intrinsic dimensionality at the node level and demonstrate the effectiveness of our method for transparent and interpretable fine-tuning. We visualize and analyze the circuits before, during, and after fine-tuning, providing new insights into the self-organization mechanism of a neural network in the learning process.

nan

Article 693

Title@2025-07-03 (4): Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies

Title: Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies

Mixture of Reasonings: Große Sprachmodelle mit adaptiven Strategien zur Vernunft bringen

理由混合:与适应战略一道教授大语言模式 2507.00606v2

Authors (4): Tao Xiong, Xavier Hu, Wenyan Fan, Shengyu Zhang

Large language models (LLMs) excel in complex tasks through advanced prompting techniques like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), but their reliance on manually crafted, task-specific prompts limits adaptability and efficiency. We introduce Mixture of Reasoning (MoR), a training framework that embeds diverse reasoning strategies into LLMs for autonomous, task-adaptive reasoning without external prompt engineering. MoR has two phases: Thought Generation, creating reasoning chain templates with models like GPT-4o, and SFT Dataset Construction, pairing templates with benchmark datasets for supervised fine-tuning. Our experiments show that MoR significantly enhances performance, with MoR150 achieving 0.730 (2.2% improvement) using CoT prompting and 0.734 (13.5% improvement) compared to baselines. MoR eliminates the need for task-specific prompts, offering a generalizable solution for robust reasoning across diverse tasks.

nan

Article 694

Title@2025-07-03 (4): Preserving Privacy, Increasing Accessibility, and Reducing Cost: An On-Device Artificial Intelligence Model for Medical Transcription and Note Generation

Title: Preserving Privacy, Increasing Accessibility, and Reducing Cost: An On-Device Artificial Intelligence Model for Medical Transcription and Note Generation

Schutz der Privatsphäre, Erhöhung der Zugänglichkeit und Senkung der Kosten: Ein On-Device-Modell für künstliche Intelligenz für medizinische Transkription und Notizgenerierung

维护隐私、增加无障碍和降低成本:医疗转移和笔记制作的现场人工人工智能模型 2507.03033v1

Authors (6): Johnson Thomas, Ayush Mudgal, Wendao Liu, Nisten Tahiraj, Zeeshaan Mohammed, Dhruv Diddi

Background: Clinical documentation represents a significant burden for healthcare providers, with physicians spending up to 2 hours daily on administrative tasks. Recent advances in large language models (LLMs) offer promising solutions, but privacy concerns and computational requirements limit their adoption in healthcare settings. Objective: To develop and evaluate a privacy-preserving, on-device medical transcription system using a fine-tuned Llama 3.2 1B model capable of generating structured medical notes from medical transcriptions while maintaining complete data sovereignty entirely in the browser. Methods: We fine-tuned a Llama 3.2 1B model using Parameter-Efficient Fine-Tuning (PEFT) with LoRA on 1,500 synthetic medical transcription-to-structured note pairs. The model was evaluated against the base Llama 3.2 1B on two datasets: 100 endocrinology transcripts and 140 modified ACI benchmark cases. Evaluation employed both statistical metrics (ROUGE, BERTScore, BLEURT) and LLM-as-judge assessments across multiple clinical quality dimensions. Results: The fine-tuned OnDevice model demonstrated substantial improvements over the base model. On the ACI benchmark, ROUGE-1 scores increased from 0.346 to 0.496, while BERTScore F1 improved from 0.832 to 0.866. Clinical quality assessments showed marked reduction in major hallucinations (from 85 to 35 cases) and enhanced factual correctness (2.81 to 3.54 on 5-point scale). Similar improvements were observed on the internal evaluation dataset, with composite scores increasing from 3.13 to 4.43 (+41.5%). Conclusions: Fine-tuning compact LLMs for medical transcription yields clinically meaningful improvements while enabling complete on-device browser deployment. This approach addresses key barriers to AI adoption in healthcare: privacy preservation, cost reduction, and accessibility for resource-constrained environments.

nan

Article 695

Title@2025-07-03 (4): GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons

Title: GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons

GDC Cohort Copilot: Ein KI-Copilot für die Kuratierung von Kohorten aus den Genomic Data Commons

GDC Cohort Cohort 副驾驶:AI 基因组数据共同点的Curate Choorts联合驾驶员 2507.02221v1

Authors (5): Steven Song, Anirudh Subramanyam, Zhenyu Zhang, Aarti Venkat, Robert L. Grossman

Motivation: The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. Results: We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. Availability and implementation: The standalone docker image for GDC Cohort Copilot is available at https://quay.io/repository/cdis/gdc-cohort-copilot. Source code is available at https://github.com/uc-cdis/gdc-cohort-copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds.

nan

Article 696

Title@2025-07-03 (4): SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers

Title: SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers

SciGA: Ein umfassender Datensatz zur Gestaltung grafischer Abstracts in wissenschaftlichen Papieren

SciGA: 用于设计学术论文制图摘要的综合数据集 2507.02212v1

Authors (4): Takuro Kawada, Shunsuke Kitada, Sota Nemoto, Hitoshi Iyatomi

Graphical Abstracts (GAs) play a crucial role in visually conveying the key findings of scientific papers. While recent research has increasingly incorporated visual materials such as Figure 1 as de facto GAs, their potential to enhance scientific communication remains largely unexplored. Moreover, designing effective GAs requires advanced visualization skills, creating a barrier to their widespread adoption. To tackle these challenges, we introduce SciGA-145k, a large-scale dataset comprising approximately 145,000 scientific papers and 1.14 million figures, explicitly designed for supporting GA selection and recommendation as well as facilitating research in automated GA generation. As a preliminary step toward GA design support, we define two tasks: 1) Intra-GA recommendation, which identifies figures within a given paper that are well-suited to serve as GAs, and 2) Inter-GA recommendation, which retrieves GAs from other papers to inspire the creation of new GAs. We provide reasonable baseline models for these tasks. Furthermore, we propose Confidence Adjusted top-1 ground truth Ratio (CAR), a novel recommendation metric that offers a fine-grained analysis of model behavior. CAR addresses limitations in traditional ranking-based metrics by considering cases where multiple figures within a paper, beyond the explicitly labeled GA, may also serve as GAs. By unifying these tasks and metrics, our SciGA-145k establishes a foundation for advancing visual scientific communication while contributing to the development of AI for Science.

nan