cs.CL @ 2025-08-01: 666

07-31 (4)

Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities

Cascaded Information Disclosure for Generalized Evaluation of Problem Lösing Capabilities

用于对解决问题能力通用评价的连锁信息披露

07-31

SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model

SimuRA: Auf dem Weg zu einem General Goal-Oriented Agent über Simulative Reasoning Architecture mit LLM-basiertem Weltmodell

SimurRA:通过使用以LLM为基础的世界模型的模拟合理理由结构,努力实现以一般目标为导向的代理

2507.23773v1

07-31

Perception-Aware Policy Optimization for Multimodal Reasoning

Perception-Aware Policy Optimization für multimodale Reasoning

对多式联运理由的观念-认知软件政策优化

2507.06448v3

07-31

CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

CoT-Self-Instruct: Aufbau hochwertiger synthetischer Aufforderungen zur Begründung und zu nicht-vernünftigen Aufgaben

COT-自学教学:为推理和非理由性任务建立高质量的合成提示

2507.23751v1

07-31

Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs

Regel2Text: Natürliche Sprache Erklärung der logischen Regeln in Wissensgraphen

规则2案文:知识图中逻辑规则的自然语言解释

2507.23740v1

07-31

How AI Ideas Affect the Creativity, Diversity, and Evolution of Human Ideas: Evidence From a Large, Dynamic Experiment

Wie KI-Ideen die Kreativität, Vielfalt und Evolution menschlicher Ideen beeinflussen: Beweise aus einem großen, dynamischen Experiment

AI Ideas如何影响人类思想的创造性、多样性和演变:大规模动态实验的证据

2401.13481v3

07-31

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Seed-Prover: Tiefe und breite Begründung für automatisierte Theorem Proving

种子文献:用于自动理论论证的深度和广度理由

2507.23726v1

07 07-31 RecGPT Technical Report Technischer Bericht des RecGPT RecGPT 技术报告 2507.22879v2

07-31

Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length

Nicht zu vergessen: Proaktive Interferenz offenbart Arbeitsspeichergrenzen in LLMs jenseits der Kontextlänge

无法忘却: 事外长长的LLMM 中主动干扰流出工作内存限制

2506.08184v3

07-31

TextQuests: How Good are LLMs at Text-Based Video Games?

TextQuests: Wie gut sind LLMs bei textbasierten Videospielen?

文本Quests: 文本视频游戏的LLMs效果如何?

2507.23701v1

07-31

TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses

TweakLLM: Eine Routing-Architektur für dynamisches Tailoring von Cached Responses

TweakLLLM: 快速快速定制快速响应的运行结构

2507.23674v1

07-31

Arabic Hate Speech Identification and Masking in Social Media using Deep Learning Models and Pre-trained Models Fine-tuning

Arabische Hass-Spracherkennung und Maskenbildung in sozialen Medien mit Deep-Learning-Modellen und vortrainierten Modellen Feinabstimmung

利用深学习模式和预培训模式进行微调,在社会媒体中识别和遮掩阿拉伯仇恨言论

2507.23661v1

07-31

DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures

DocPolarBERT: Ein vortrainiertes Modell zum Dokumentverständnis mit relativer Polarkoordinate Kodierung von Layoutstrukturen

DocPolarBERT:一个预先培训的文件理解模式,其布局结构的相对极地协调编码

2507.08606v3

07-31

Who’s important? – SUnSET: Synergistic Understanding of Stakeholder, Events and Time for Timeline Generation

Wer ist wichtig? – SUnSET: Synergistisches Verständnis von Stakeholdern, Ereignissen und Zeit für die Timeline Generation

谁重要? - SUNSET:对利益攸关方、事件和时间的协同理解,以产生时间表。

2507.21903v2

07-31

How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Wie kann ich meinen LLM-Benchmark veröffentlichen, ohne die wahren Antworten wegzugeben?

我怎样才能公布我的LLM基准而不给出正确的答案?

2505.18102v2

07-31

Splits! A Flexible Dataset and Evaluation Framework for Sociocultural Linguistic Investigation

Splits! Ein flexibler Datensatz und Evaluationsrahmen für die soziokulturelle Linguistische Untersuchung

社会文化语言调查灵活数据集和评价框架

2504.04640v2

07-31

ILID: Native Script Language Identification for Indian Languages

ILID: Native Script Language Identification für indische Sprachen

ILID:印第安人语言的土著脚本语言识别

2507.11832v2

07-31

Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Estimates

Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Assessments

具有不确定性估计值的临床试验的深入学习预测

2507.23607v1

07-31

Inside-Out: Hidden Factual Knowledge in LLMs

Inside-Out: Verstecktes Sachwissen in LLMs

内外:LLM中隐藏的事实知识

2503.15299v3

07-31

DiffLoRA: Differential Low-Rank Adapters for Large Language Models

DiffLoRA: Differential-Low-Rank-Adapter für große Sprachmodelle

DiffLORA:用于大语言模型的差别型低兰克适应器

2507.23588v1

07-31

T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text

T-Detect: Tail-Aware Statistische Normalisierung zur robusten Erkennung von maschinengeneriertem Text

T-检测:用于对反转机制文本进行强力探测的尾件软件统计标准化

2507.23577v1

07-31

Neutral Residues: Revisiting Adapters for Model Extension

Neutrale Rückstände: Adapter zur Modellerweiterung

中立残留物:重新审视适应器,用于示范推广

2410.02744v3

07-31

Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation

Kann LLMs mit Ambiguity helfen? Eine quantitative Bewertung verschiedener großer Sprachmodelle auf Word Sense Disambiguation

LLMs能否协助其模糊性? 量化评估关于 “ Word Sense Disanderation “ 的各种大语言模型。

2411.18337v4

07-31

Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning

Med-R$^3$: Verbesserung der medizinischen Retrieval-Augmented Reasoning von LLMs durch Progressive Verstärkung Lernen

3美元Med-R$3美元:通过逐步加强学习加强医疗取回-增加LLMs的理据

2507.23541v1

07-31

PurpCode: Reasoning for Safer Code Generation

PurpCode: Begründung für eine sicherere Code-Generierung

PurpCode:更安全代码生成的理由

2507.19060v2

07-31

MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

MECAT: Ein Multi-Experten-Benchmark für feinkörnige Audio-Verstandsaufgaben

MECAT: 完善的音频理解任务多专家基准

2507.23511v1

07-31

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

LLaVA-MORE: Eine vergleichende Studie von LLMs und visuellen Backbones für verbesserte visuelle Instruktions-Tuning

LLAVA-MORE:用于强化视觉教学的LLM和视觉背骨比较研究

2503.15621v2

07-31

A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Ein neuartiger Bewertungs-Benchmark für medizinische LLMs: Beleuchtende Sicherheit und Wirksamkeit in klinischen Bereichen

医疗LLMs新颖的评价基准:临床域的引明安全和有效性

2507.23486v1

07-31

Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

Role-Aware Sprachmodelle für sichere und kontextualisierte Zugriffskontrolle in Organisationen

各组织内安全和环境化出入控制使用控制实用语言模式

2507.23465v1

07-31

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

Counterfactual Evaluation für Blindangriffserkennung in LLM-basierten Evaluationssystemen

以LLM为基础的评价系统中盲人攻击探测的反事实评价

2507.23453v1

07-31

EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework

BildungQ: Bewertung der Lehrfähigkeiten von LLMs durch Multi-Agent Dialograhmen

教育Q:通过多机构对话框架评价LLMS的教学能力

2504.14928v3

07-31

The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models

Der Pragmatische Geist der Maschinen: Auf der Spur des Entstehens der Pragmatischen Kompetenz in großen Sprachmodellen

机器的实用思维:追踪大语言模式中实用能力的出现

2505.18497v2

07-31

Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration

Über passives kritisches Denken hinaus: Förderung proaktiver Befragungen zur Verbesserung der Mensch-KI-Kollaboration

超越被动的批判性思考:促进积极主动的提问,以加强人类与大赦国际的协作

2507.23407v1

07-31

RAVine: Reality-Aligned Evaluation for Agentic Search

RAVine: Realitätsorientierte Bewertung für die Agentische Suche

RAVine: 化学搜索的现实统一评价

2507.16725v2

07-31

Enhanced Arabic Text Retrieval with Attentive Relevance Scoring

Verbesserte arabische Text-Retrieval mit aufmerksamer Relevanz Scoring

阿拉伯强化文本检索, 带有启动相关性显示器

2507.23404v1

07-31

MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization

MRGSEM-Sum: Ein unbeaufsichtigtes Multi-Dokument Zusammenfassungsrahmen basierend auf Multi-Relational Graphen und struktureller Entropie Minimierung

MRGSEM-Sum:基于多关系图和结构元件最小化的无人监督的多文件概括框架

2507.23400v1

07-31

Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators

Beyond the Cloud: Bewertung der Vorteile und Nachteile lokaler LLM-Einsatzmöglichkeiten für Übersetzer

云云之外:评估为笔译员部署当地LLM的利弊

2507.23399v1

07-31

Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models

Causal2Vec: Verbessere Dekoder-nur LLMs als vielseitige Einbettungsmodelle

Causal2Vec:改进只有解码器的LLMs作为Versatile嵌入模型

2507.23386v1

07-31

MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models

MPCC: Ein neuartiger Benchmark für multimodale Planung mit komplexen Einschränkungen in multimodalen großen Sprachmodellen

MPCC:具有多种多语言模式复杂限制的多式联运规划新基准

2507.23382v1

07-31

Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models

Theorem-of-Thought: Ein Multi-Agenten-Framework für abduktive, deduktive und induktive Vernunft in Sprachmodellen

所探讨的理论理论:语言模式中指导、贬低和诱导理由的多机构框架

2506.07106v2

07-31

WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation

WildSpeech-Bench: Benchmarking von Audio-LLMs im natürlichen Sprachgespräch

WirdSpeech-Bench:为自然演讲对话中的音频LMs设定基准

2506.21875v2

07-31

Holistic Evaluations of Topic Models

Ganzheitliche Bewertungen von Themenmodellen

专题模式整体评价

2507.23364v1

07-31

Robust and Fine-Grained Detection of AI Generated Texts

Robuste und feinkörnige Erkennung von KI-generierten Texten

对 AI 生成文本的强力和精细探测

2504.11952v3

07-31

SWE-Exp: Experience-Driven Software Issue Resolution

SWE-Exp: Erfahrungsgetriebene Software-Ausgabeauflösung

SWE-Expl:经验丰富的软件问题决议

2507.23361v1

07-31

VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning

VL-Cogito: Progressives Curriculum-Verstärkungslernen für fortgeschrittene multimodale Vernunft

VL-Cocito:先进多式联运理由的渐进课程强化学习

2507.22607v2

07-31

Text-to-SQL Task-oriented Dialogue Ontology Construction

Text-zu-SQL Aufgabenorientierter Dialog Ontologie Konstruktion

以任务为导向的对话肿瘤构建

2507.23358v1

07-31

KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities

KeyKnowledgeRAG (K^2RAG): Eine verbesserte RAG-Methode zur Verbesserung der LLM-Fragestellung

KeyknowledgeraG(K2RAG):改进LLM问答能力的强化RAG方法

2507.07695v2

07-31

SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution

SWE-Debatte: Wettbewerbsfähige Multi-Agenten-Debatte für die Lösung von Software-Problemen

SWE-Debate:解决软件问题竞争性多机构辩论

2507.23348v1

07-31

Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance

Mehrsprachige Fähigkeiten mit kulturellem und lokalem Wissen in großen Sprachmodellen verbessern und gleichzeitig die Leistungsfähigkeit der Ureinwohner verbessern

提高多语言多语言能力,在提高土著绩效的同时,利用大语言模式的文化和地方知识,同时提高土著绩效

2504.09753v3

07-31

DSBC : Data Science task Benchmarking with Context engineering

DSBC : Data Science-Aufgabe Benchmarking mit Kontext-Engineering

DSBC: 数据科学任务与背景工程基准

2507.23336v1

07-31

MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation

MUST-RAG: MUSical Text Question Beantwortung mit retrieval Augmented Generation

MOST-RAG: 以回取增加的一代人回答的中文本问题

2507.23334v1

07-31

Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette

Kulturelle Palette: Pluralisierung der Kulturausrichtung über Multi-Agenten-Palette

文化调色板:通过多试剂调色板实现多元化文化协调

2412.11167v3

07-31

FinGAIA: A Chinese Benchmark for AI Agents in Real-World Financial Domain

FinGAIA: Ein chinesischer Benchmark für KI-Agenten in der Real-World Financial Domain

金融界:中国真实世界金融领域AI代理商基准

2507.17186v2

07-31

Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages

Multi-Hypothese Destillation von mehrsprachigen Neuralübersetzungsmodellen für ressourcenarme Sprachen

多语言低资源语言多语言神经翻译模型的蒸馏

2507.21568v2

54 07-31 LLMs and the Human Condition LLMs und der menschliche Zustand LLM和人类条件 2402.08403v6

07-31

What’s Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

Was ist Taboo für Sie? - Eine empirische Bewertung von LLMs Verhalten für Sensitive Inhalte

- 对行为举止为敏感内容的LLMS的经验评估

2507.23319v1

07-31

LiMe: a Latin Corpus of Late Medieval Criminal Sentences

LiMe: ein lateinischer Corpus der spätmittelalterlichen Strafurteile

Lime:拉丁美洲中世纪晚期刑事判决区

2404.12829v2

07-31

SequenceLayers: Sequence Processing and Streaming Neural Networks Made Easy

SequenzLayer: Sequenzverarbeitung und Streaming von Neuronalen Netzwerken leicht gemacht

序列激光器:序列处理和串联神经网络变得容易

2507.23292v1

07-31

Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics

Fortschritte in LLMs mit Fokus auf Vernunft, Anpassungsfähigkeit, Effizienz und Ethik

注重理由、适应性、效率和道德操守的LLMs项目的进展

2506.12365v2

07-31

Iterative Repair with Weak Verifiers for Few-shot Transfer in KBQA with Unanswerability

Iterative Reparatur mit schwachen Verifierern für wenige Aufnahmen in KBQA mit Unbeantwortbarkeit

KBQA 中无法解答的微小投射点校验器的迭代性修补

2406.14313v3

07-31

AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora

AutoSchemaKG: Autonome Wissensgraphenkonstruktion durch dynamische Schemainduktion aus Web-Scale Corpora

AutoSchemaKG:通过网络规模公司动态气相引入,建立自主知识图

2505.23628v2

07-31

Unveiling Super Experts in Mixture-of-Experts Large Language Models

Enthüllen Super-Experten in Mixture-of-Experts große Sprachmodelle

混合专家大语言模型中不懈的超级专家

2507.23279v1

07-31

AI-Reporter: A Path to a New Genre of Scientific Communication

AI-Reporter: Ein Weg zu einem neuen Genre wissenschaftlicher Kommunikation

AI-记者:通向科学通信新一流的道路

2507.05903v2

07-31

Evaluating LLMs’ Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis

Bewertung der Mehrsprachigkeitsfähigkeiten von LLMs für Bengalen: Benchmark-Erstellung und Leistungsanalyse

评价孟加拉多种语文能力:基准设定和业绩分析

2507.23248v1

07-31

P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication

P-ReMIS: Pragmatische Vernunft in der psychischen Gesundheit und einer sozialen Implikation

P-REMIS: 心理健康和社会影响方面的实用原因

2507.23247v1

07-31

Generalized Reinforcement Learning for Retriever-Specific Query Rewriter with Unstructured Real-World Documents

Generalisiertes Verstärkungslernen für retriever-spezifische Abfrage-Rewriter mit unstrukturierten Real-World-Dokumenten

利用无结构的 “ 现实世界文件 “ 检索特定查询卷卷的通用强化学习

2507.23242v1

07-31

Cutting Through the Noise: Boosting LLM Performance on Math Word Problems

Schneiden durch den Lärm: Steigerung der LLM-Performance bei Math Word-Problemen

通过噪音剪切:促进数学字问题LLM的LLM性能

2406.15444v4

07-31

Framing Political Bias in Multilingual LLMs Across Pakistani Languages

Framing politische Bias in mehrsprachigen LLMs in pakistanischen Sprachen

以多语种LLMs多种巴基斯坦语言界定政治偏见

2506.00068v2

07-31

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

AgentSpec: Anpassbare Runtime Enforcement für sichere und zuverlässige LLM-Agenten

安全可靠LLM代理商的可定制运行时间执法

2503.18666v3

07-31

Enabling Few-Shot Alzheimer’s Disease Diagnosis on Tabular Biomarker Data with LLMs

Ermöglichung der weniger scharfen Alzheimer-Krankheit Diagnose auf Tabular Biomarker Daten mit LLMs

使小热阿尔茨海默氏病的疾病诊断能够用LMS在表示生物标记数据上进行

2507.23227v1

07-31

Unveiling the Influence of Amplifying Language-Specific Neurons

Enthüllen des Einflusses amplifizierender sprachspezifischer Neuronen

消除扩增语言特有新元的影响

2507.22581v2

07-31

LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

LLM-Crowdsourced: Ein Benchmark-freies Paradigma zur gegenseitigen Bewertung großer Sprachmodelle

LLM-文献来源:用于对大语言模式进行相互评价的无基准建模

2507.22359v2

07-31

Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders

Model Directions, keine Worte: Mechanistische Themenmodelle mit Sparse Autoencodern

模型方向,非单词:使用粗态自动编码器的机械专题模型

2507.23220v1

07-31

Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires

Kulturelle Bias in großen Sprachmodellen: Bewertung von KI-Agenten durch moralische Fragebögen

大语言模式中的文化偏见:通过道德问卷评估AI代理

2507.10073v2

07-31

Failures Are the Stepping Stones to Success: Enhancing Few-Shot In-Context Learning by Leveraging Negative Samples

Fehler sind die Steinschritte zum Erfolg: Erweitern Sie das wenige-heiße In-Context-Lernen durch die Nutzung negativer Muster

失败是走向成功的一步步石:通过利用负面样本加强少许热的文体学习

2507.23211v1

07-31

InfAlign: Inference-aware language model alignment

InfAlign: Inference-aware Sprachmodellausrichtung

Infagign: 参考意识语言模型对齐

2412.19792v4

07-31

Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

Towards Inclusive ASR: Untersuchung der Sprachumwandlung für Dysarthric Speech Recognition in Low-Resource Sprachen

努力实现包容性的ASR:低资源语言中承认代谢语言语音转换调查

2505.14874v4

77 07-31 Explaining vague language Unbestimmte Sprache erklären 解释含糊措辞 2404.18154v2

07-31

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

Geak: Einführung von Triton Kernel AI Agent & Evaluation Benchmarks

Geak:介绍Triton Kernel AI 代理和评估基准

2507.23194v1

07-31

EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos referring to Procedural Texts

EgoOops: Ein Datensatz zur Erkennung von Fehlern aus egozentrischen Videos, die sich auf Verfahrenstexte beziehen

EgoOops: 用于从 Egocentic 视频中检测错误动作的数据集, 指程序文字

2410.05343v3

07-31

Leveraging LLMs to Create Content Corpora for Niche Domains

LLMs nutzen, um Content Corpora für Niche Domains zu erstellen

利用LMLM 来为新域创建内容公司

2505.02851v2

07-31

LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration

LENS: Lerne Ensemble Vertrauen aus neuralen Staaten für Multi-LLM-Antwortintegration

LENS:从神经国家学习多LLM应答整合的集合信任

2507.23167v1

07-31

Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Vision-Language-Modelle sind in Bezug auf Expression Generation nicht pragmatisch kompetent

视觉-语言模型在代言表达式生成中不具备实用能力

2504.16060v3

07-30 (3)

User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal

User Feedback in Human-LLM Dialogen: Ein Objektiv, um die Nutzer zu verstehen, aber laut als Lernsignal

人类- LLLM 对话框中的用户反馈: 了解用户的镜头, 但将吵闹当作学习信号

2507.23158v1

07-30

Can one size fit all?: Measuring Failure in Multi-Document Summarization Domain Transfer

Kann eine Größe für alle passen?: Messfehler in Multi-Document-Zusammenfassung Domain-Transfer

能够一刀切吗? :在多文件概括性文件转让中衡量失败

2503.15768v2

07-30

ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans

ISO-Bench: Benchmarking multimodaler Kausalität in visuellen Sprachmodellen durch verfahrenstechnische Pläne

ISO-Bench:通过程序计划确定视觉语言模型中多式因果关系基准

2507.23135v1

07-30

Meta CLIP 2: A Worldwide Scaling Recipe

Meta CLIP 2: Ein weltweites Scaling-Rezept

Meta CLIP 2: 全球规模扩大食谱

2507.22062v2

07-30

Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

Enthüllen der Fragilität von vertrauenswürdigen LLMs durch chinesische Text-Ambiguität

通过中文文字缩略图,揭开可信赖的LLM 易用性

2507.23121v1

07-30

RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL

RASL: 大规模数据库文本到 SQL 的检索增强的相连接表表

2507.23104v1

07-30

SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity

SMART-Editor: Multi-Agenten-Framework für menschenähnliche Designbearbeitung mit struktureller Integrität

SMART-编辑:具有结构完整性的多机构设计设计框架

2507.23095v1

07-30

Context-aware Rotary Position Embedding

Context-aware Rotary Position Einbetten

扶轮位置嵌入式

2507.23083v1

07-30

Exploring In-Context Learning for Frame-Semantic Parsing

In-Context-Lernen für rahmensemantisches Parsing erforschen

探索用于框架语义分析的内文学习

2507.23082v1

07-30

Math Natural Language Inference: this should be easy!

Math Natural Language Inferenz: das sollte einfach sein!

Math自然语言推论:这应该很容易!

2507.23063v1

07-30

Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion

Conan: Ein Chunkwise Online-Netzwerk für Null-Shot Adaptive Voice Conversion

Conan:一个零热适应性语音转换的中远在线网络

2507.14534v3

07-30

Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Muslims in Large Language Models: A Systematic Review

Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Moslems in Large Language Models: A Systematic Review

减轻大语言模式中针对阿拉伯人和穆斯林的文化偏见的迅速工程技术:系统审查

2506.18199v2

07-30

Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning

Wo man Demos in Deinem Prompt zeigt: Ein positionelles Bias des In-Context-Lernens

在哪里显示您快速的演示 : 内容学习的定位偏见

2507.22887v1

07-30

C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

C3: Ein zweisprachiger Benchmark für gesprochene Dialogmodelle zur Erforschung von Herausforderungen in komplexen Gesprächen

C3:探讨复杂对话挑战的口头对话模式的双语基准

2507.22968v1

07-30

GeoOutageKG: A Multimodal Geospatiotemporal Knowledge Graph for Multiresolution Power Outage Analysis

GeoOutageKG: Ein multimodaler Geospatiotemporaler Wissensgraph für die Multiauflösungsanalyse von Stromausfällen

GeoouteageKG:多分辨率电源外向分析多式地球观测时知识图

2507.22878v1

07-30

FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models

FRED: Finanzielle Retrieval-erweiterte Erkennung und Bearbeitung von Halluzinationen in Sprachmodellen

FRED: 财务检索-加强发现和编辑语言模型中的幻觉

2507.20930v2

07-30

Past Meets Present: Creating Historical Analogy with Large Language Models

Vergangenheit trifft Gegenwart: Historische Analogie mit großen Sprachmodellen erstellen

过去曾出席的会议:创建具有大语言模式的历史分析

2409.14820v2

100

07-30

The Incomplete Bridge: How AI Research (Mis)Engages with Psychology

Die unvollendete Brücke: Wie KI-Forschung (Mis) mit Psychologie verstrickt

不完整的桥梁:人工智能如何研究(Miss)心理学的组合

2507.22847v1

101

07-30

ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer

ReverBERT: Ein State Space Model für eine effiziente textgesteuerte Sprachübertragung

ReverBERT: 高效发短信语音风格转让国家空间模型

2503.20992v2

102

07-30

Cross-Modal State-Space Graph Reasoning for Structured Summarization

Grenzüberschreitende State-Space-Graph-Gründung für strukturierte Zusammenfassung

结构归纳的跨模式国家空间图

2503.20988v2

103

07-30

Scaling RL to Long Videos

Skalierung von RL zu langen Videos

缩放 RL 到长视频

2507.07966v3

104

07-30

MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

MiniLongBench: Der kostengünstige Long Context Benchmark für große Sprachmodelle verstehen

MiniLongBunench:大语言模式低成本长方背景理解基准

2505.19959v2

105

07-30

Beyond Natural Language Plans: Structure-Aware Planning for Query-Focused Table Summarization

Jenseits natürlicher Sprachpläne: Struktur-Bewusst-Planung für Abfrage-fokussierte Tabellenzusammenfassung

超越自然语言计划: 查询用户使用表的结构-软件规划

2507.22829v1

106

07-30

SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs

RaumViz-Bench: Automatisch generierte räumliche Visualisierungs-Aufgaben für MLLMs

空间Viz-Bench:MLLLMs自动生成的空间可视化推理任务

2507.07610v3

107

07-30

DBLPLink 2.0 – An Entity Linker for the DBLP Scholarly Knowledge Graph

DBLPLink 2.0 – Ein Entity Linker für den DBLP-Wissenschaftsgraphen

DBLPLink 2.0 - DBLPLP 学术知识图的实体链接

2507.22811v1

108

07-30

IterKey: Iterative Keyword Generation with LLMs for Enhanced Retrieval Augmented Generation

IterKey: Iterative Keyword Generation mit LLMs für verbesserte retrieval Augmented Generation

IterKey: 循环关键字生成,并配有 “ 增强再获取能力增量一代 “ 的LMML

2505.08450v2

109

07-30

Towards the Law of Capacity Gap in Distilling Language Models

Auf dem Weg zum Gesetz der Kapazitä tigkeitslücke bei der Destillierung von Sprachmodellen

迈向《语文模式再学习能力差距法》

2311.07052v4

110

07-30

MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Hate Speech Multi-hop Explanations

MFTCXplain: Mehrsprachiger Benchmark-Datensatz zur Bewertung der moralischen Vernunft von LLMs durch Hassreden-Multi-Hop-Erklärungen

MFTCXplain:通过仇恨言论多呼多呼解释评估LLMs道德理由的多语言基准数据集

2506.19073v2

111

07-30

DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router

DeepSieve: Informationen über LLM-as-a-Knowledge-Router

深筛选:通过LLM-as-a- knowledge-Router获取信息

2507.22050v2

112

07-30

GATEAU: Selecting Influential Samples for Long Context Alignment

GATEAU: Auswahl von einflussreichen Proben für lange Kontextausrichtung

GATEAU:为长期对齐选择有影响的样本

2410.15633v6

113

07-30

MASCA: LLM based-Multi Agents System for Credit Assessment

MASCA: LLM-basiertes Multi-Agenten-System zur Bonitätsbeurteilung

MASCA: 以LLM为基础的信用评估多边代理系统

2507.22758v1

114

07-30

Opportunities and Challenges of LLMs in Education: An NLP Perspective

Chancen und Herausforderungen von LLM im Bildungswesen: Eine NLP-Perspektive

教育中法学硕士的机遇和挑战:国家学习方案展望

2507.22753v1

115

07-30

CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

CUS-QA:以本地知识为主的不限成员名额问题解答数据集

2507.22752v1

116

07-30

Next Tokens Denoising for Speech Synthesis

Nächste Tokens Denoising für Sprachsynthese

下一集 Tokens 代言人演讲综述

2507.22746v1

117

07-30

Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index

Verringerung der Halluzinationen in der Zusammenfassung durch Verstärkungslernen mit Entity Halluzination Index

利用实体幻觉指数,通过强化学习减少在总结中的幻觉

2507.22744v1

118

07-30

Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning

Bewertungsprüfer: Bewertung der synthetischen Überprüfung für Code und Begründung

标定验证符:评估编码和理由的合成核查

2502.13820v3

119

07-30

Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning

Ressourceneffiziente Anpassung großer Sprachmodelle für Text-Embeddings über Prompt Engineering und Contrastive Fine-Tuning

通过即时工程和反竞争微调对文本嵌入大语言模型进行资源高效率的改编

2507.22729v1

120

07-30

Investigating Hallucination in Conversations for Low Resource Languages

Untersuchung von Halluzinationen in Gesprächen über Sprachen mit geringem Ressourcenreichtum

低资源语言对话中的幻觉

2507.22720v1

121

07-30

Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining

Erhöhung der Ultra-Low-Bit-Quantisierung großer Sprachmodelle durch Saliency-Aware Partial Retraining

通过提高质量-软件部分再培训,加强大语言模型的超低比小量量化

2504.13932v3

122

07-30

From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs

Von der Fähigkeit zur Reflexion: Stärkungsorientiertes Denken Qualität in retrieval-augmented Begründung für LLMs

从充足到反思:LLMs在追偿和增加理由方面的强化引导思考质量

2507.22716v1

123

07-30

UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis

UI-E2I-Synth: Weiterentwicklung der GUI-Grundierung mit großformatiger Instruktionssynthese

UI-E2I-Synth:以大型教学合成为基础推进图形界面

2504.11257v4

124

07-30

Spatial Language Likelihood Grounding Network for Bayesian Fusion of Human-Robot Observations

Raumsprache Likelihood Grounding Network für Bayesian Fusion von Mensch-Roboter-Beobachtungen

Bayesian人类-机器人观测融合空间语言定位网络

2507.19947v2

125

07-30

Listening to the Unspoken: Exploring 365 Aspects of Multimodal Interview Performance Assessment

Hören auf das Unausgesprochene: Erforschen von 365 Aspekten der multimodalen Interview-Performance Assessment

聆听无语者:探索多模式访谈业绩评估的365方面

2507.22676v1

126

07-30

What Are They Talking About? A Benchmark of Knowledge-Grounded Discussion Summarization

Wovon reden sie? Ein Benchmark der wissensgeprägten Diskussionszusammenfassung

他们在谈论什么?知识类讨论总结的基准

2505.12474v2

127

07-30

Instruction-tuned Large Language Models for Machine Translation in the Medical Domain

Instruktionsorientierte große Sprachmodelle für die maschinelle Übersetzung im medizinischen Bereich

医疗领域机器翻译大语言模型

2408.16440v2

128

07-30

QE4PE: Word-level Quality Estimation for Human Post-Editing

QE4PE: Qualitätsschätzung auf Word-Ebene für die menschliche Nachbearbeitung

QE4PE: 计算后人类的字级质量估算

2503.03044v2

129

07-30

Multilingual Political Views of Large Language Models: Identification and Steering

Mehrsprachige politische Ansichten von großen Sprachmodellen: Identifikation und Steuerung

大语言模式多语言多语言政治观点:识别和指导

2507.22623v1

130

07-30

Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation

Sprache Arithmetik: Auf dem Weg zur systemischen Sprache Neuronenidentifikation und Manipulation

语言解貌学:迈向系统语言中中子识别和操纵

2507.22608v1

131

07-30

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

UI-AGILE: Verbesserung von GUI-Agenten mit effektivem Verstärkungslernen und präziser Schlussfolgerungs-Zeiterdung

UI-AGILE: 提高具有有效强化学习和精确推断时间定位的图形代理器

2507.22025v2

132

07-30

BALSAM: A Platform for Benchmarking Arabic Large Language Models

BALSAM: Eine Plattform für Benchmarking arabischer Großsprachenmodelle

BALSAM:阿拉伯语大语言模式基准制定平台

2507.22603v1

133

07-30

Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation

Lernen, rationale Beweise durch Verstärkungslernen für die retrieval-angereicherte Generation zu extrahieren

学习如何通过为回收-提款一代人加强学习来提取合理证据

2507.15586v4

134

07-30

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Die Frontier of Vision-Language Models erkunden: Eine Übersicht aktueller Methoden und Zukunftsrichtungen

探索远景-语言模型的前沿:对当前方法和未来方向的调查

2404.07214v3

135

07-30

Efficient Continual Learning for Small Language Models with a Discrete Key-Value Bottleneck

Effizientes kontinuierliches Lernen für kleine Sprachmodelle mit einem diskreten Schlüsselwert-Bottleneck

高效持续学习具有分立键- Value 瓶颈的小语言模式

2412.08528v2

136

07-30

Efficient Differentially Private Fine-Tuning of LLMs via Reinforcement Learning

Effizientes Differentielles Privates Feintuning von LLMs durch Verstärkungslernen

通过强化学习对LLMs 进行有区别的私人高效率私人罚款

2507.22565v1

137

07-30

Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

Nutzung synergistischer Kognitiv-Biasen zur Umgehung der Sicherheit in LLMs

利用协同协同一致的双星体在LLM中用于绕过安全

2507.22564v1

138

07-30

Rationale-guided Prompting for Knowledge-based Visual Question Answering

Rationale-geführte Aufforderung zur wissensbasierten visuellen Fragebeantwortung

以知识为基础的视觉问题解答

2412.16936v2

139

07-30

Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection

Co-AttenDWG: Co-Attentive Dimension-Wise-Gating und Expertenfusion für Multi-Modal-Offensive Content Detection

共同-DWG:多模式进攻性攻击物质探测联合加速维维维-韦兹交织和专家混合

2505.19010v2

140

07-30

ControlMed: Adding Reasoning Control to Medical Language Model

ControlMed: Reasoning Control in das medizinische Sprachmodell aufnehmen

控制Med:在医疗语文模式中增加理由控制

2507.22545v1

141

07-30

Pre-trained Models Perform the Best When Token Distributions Follow Zipf’s Law

Vortrainierte Modelle führen das Beste aus, wenn Token-Distributionen Zipfs Gesetz folgen

事先培训的模型按照Zipf法在配制时最佳表现

2507.22543v1

142

07-30

A Benchmark Dataset and Evaluation Framework for Vietnamese Large Language Models in Customer Support

Benchmark Dataset und Evaluation Framework für vietnamesische Großsprachenmodelle im Kundensupport

越南客户支助大语言模式基准数据集和评价框架

2507.22542v1

143

07-30

Training language models to be warm and empathetic makes them less reliable and more sycophantic

Training Sprachmodelle warm und einfühlsam zu sein macht sie weniger zuverlässig und sykophantischer

培训语言模式,使其温暖和同情,使其不那么可靠,更具有共生性

2507.21919v2

144

07-30

CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records

CliCARE: Grounding Large Language Models in klinischen Richtlinien zur Entscheidungsunterstützung über Longitudinal Cancer Electronic Health Records

CliCARE:在纵向癌症电子健康记录决策支持临床指南中以大语言模式为基础

2507.22533v1

145

07-30

Yankari: A Monolingual Yoruba Dataset

Yankari: Einsprachiger Yoruba-Datensatz

Yankari:单语Yoruba数据集

2412.03334v2

146

07-30

Probing Information Distribution in Transformer Architectures through Entropy Analysis

Probing Information Distribution in Transformer-Architekturen durch Entropie-Analyse

通过 Entropy 分析在变形结构中进行测试信息发布

2507.15347v2

147

07-30

SLM-SQL: An Exploration of Small Language Models for Text-to-SQL

SLM-SQL: Eine Erforschung kleiner Sprachmodelle für Text-zu-SQL

SMS-SQL:探索文字到SQL的小型语言模型

2507.22478v1

148

07-30

Exploring Dynamic Parameters for Vietnamese Gender-Independent ASR

Dynamische Parameter für vietnamesische geschlechtsunabhängige ASR erkunden

探索越南性别独立ASR的动态参数

2507.22964v1

149

07-30

Voices of Freelance Professional Writers on AI: Limitations, Expectations, and Fears

Stimmen freiberuflicher Schriftsteller über KI: Einschränkungen, Erwartungen und Ängste

自由职业作家对大赦国际的呼声:限制、期望和恐惧

2504.05008v2

150

07-30

IFEvalCode: Controlled Code Generation

IFEvalCode: Kontrollierte Code-Generierung

IFEvalCode:受控制的代码生成

2507.22462v1

151

07-30

FineMedLM-o1: Enhancing Medical Knowledge Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training

FineMedLM-o1: Verbesserung des medizinischen Wissens, das die Fähigkeit von LLM vom überwachten Feintuning bis zum Test-Time Training begründet

FineMedLM-o1:提高LLM从监督的精密教学到试验时间培训的医疗知识能力

2501.09213v3

152

07-30

What is an “Abstract Reasoner”? Revisiting Experiments and Arguments about Large Language Models

Was ist ein “Abstract Reasoner”? Experimenten und Argumenten über große Sprachmodelle nachzuvollziehen

什么是“抽象理由” ? 关于大语言模型的重新审视实验和争论

2507.22457v1

153

07-30

Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance

Falcon-H1: Eine Familie hybrider Sprachmodelle zur Neudefinition von Effizienz und Leistung

Falcon-H1:调整效率和绩效的混合语言模式家庭

2507.22448v1

154

07-30

AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt-4o-mini

KI-generierte Geschichten begünstigen Stabilität gegenüber Veränderung: Homogenität und kulturelle Stereotypisierung in Erzählungen, die von gpt-4o-mini erzeugt werden

AI产生的故事有利于稳定而不是变化:在gpt-4o-mini产生的叙事中,同质性和文化陈规定型

2507.22445v1

155

07-30

BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition

BERSting at the Screams: Ein Maßstab für distanzierte, emotionale und erschrockene Spracherkennung

尖叫时发出尖叫声:远程、情感和呼喊语音识别基准

2505.00059v2

156

07-30

Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation

Mmm whatcha sagen? Enthüllen distale und proximale Kontexteffekte in der ersten und zweiten Sprache Wort Wahrnehmung mit psychophysischen umgekehrten Korrelation

使用心理物理反向关系,在第一和第二语言的词感中产生未发现和预期的环境效应

2406.05515v2

157

07-30

NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models

NeedleChain: Messung der Intact-Langkontext-Begründungsfähigkeit großer Sprachmodelle

Nenelechain:计量大语言模型的精密长文理由

2507.22411v1

158

07-30

Question Generation for Assessing Early Literacy Reading Comprehension

Fragegenerierung für die Bewertung des frühen Leseverständnisses

评估早期阅读读写能力读写能力的提问一代

2507.22410v1

159

07-30

R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

R2-KG:知识图表可靠理由通用双重目的机构框架

2502.12767v6

160

07-30

Reservoir Computing as a Language Model

Reservoir Computing als Sprachmodell

作为语言模式的 “ 储量计算 “ 模式

2507.15779v2

161

07-30

PATENTWRITER: A Benchmarking Study for Patent Drafting with LLMs

PATENTWRITER: Eine Benchmarking-Studie für die Patenterstellung mit LLMs

PATENTWRITER: 专利起草基准研究与LLMs

2507.22387v1

162

07-30

OWLViz: An Open-World Benchmark for Visual Question Answering

OWLViz: Ein Open-World-Benchmark für visuelle Fragen

OWLViz:视觉问答的开放世界基准

2503.07631v3

163

07-30

Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

Multimodale LLMs als maßgeschneiderte Reward-Modelle für die Text-zu-Image-Generierung

以多式多式LLMs作为生成文字到图像的自定制奖励模型

2507.21391v2

164

07-30

BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

BlockFFN: Auf dem Weg zur End-Side Acceleration-Friendly Mixture-of-Experts mit Chunk-Level-Aktivierung Sparsity

块块FFN: 向具有整块级激活分级的终端- 双极加速- 友好混合混合专家方向

2507.08771v2

165

07-30

Traits Run Deep: Enhancing Personality Assessment via Psychology-Guided LLM Representations and Multimodal Apparent Behaviors

Traits Run Deep: Verbesserung der Persönlichkeitsbeurteilung durch psychologisch geführte LLM-Darstellungen und multimodale Scheinverhalten

深层轨迹:通过心理学辅导LLM代表和多模式亲善行为,加强个性评估

2507.22367v1

166

07-30

Masked Language Models are Good Heterogeneous Graph Generalizers

Masked Language Models sind gute Heterogene Graph Generalizers

遮罩语言模型是好异基因图形缩略图

2506.06157v2

167

07-30

Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning

Nutzung von großen Sprachmodellen für Bengalische Mathematik-Wort-Probleme bei der Lösung der Kette der Gedankenveranlagung

利用大语言模型解决孟加拉语数学字词与思维链理性的解决问题

2505.21354v2

168

07-30

MuSciClaims: Multimodal Scientific Claim Verification

MuSciClaims: Multimodale wissenschaftliche Antragsprüfung

穆西索赔: 多式联运科学索赔核实

2506.04585v2

169

07-30

A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers

Eine umfassende Taxonomie der Negation für NLP und Neuralretriever

NLP和神经再研究综合清点分类

2507.22337v1

170

07-30

Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing

Prompt-Reverse Inkonsistenz: LLM Selbstinkonsistenz jenseits generativer Zufälligkeit und prompt Paraphrasierung

迅速反向不一致:LLM 自我不连贯,超越发生性随机和迅速划线

2504.01282v2

171

07-30

Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges

Natürliche Sprachverarbeitung für den Rechtsbereich: Eine Übersicht über Aufgaben, Datensätze, Modelle und Herausforderungen

法律领域自然语言处理:任务、数据集、模型和挑战调查

2410.21306v3

172

07-29 (2)

Intent Recognition and Out-of-Scope Detection using LLMs in Multi-party Conversations

Intent Recognition und Out-of-Scope-Erkennung mit LLMs in Multi-Party-Konversationen

在多方对话中使用LLMs

2507.22289v1

173

07-29

Meaning-infused grammar: Gradient Acceptability Shapes the Geometric Representations of Constructions in LLMs

Bedeutungsverstärkte Grammatik: Gradient Akzeptabilität formt die geometrischen Darstellungen von Konstruktionen in LLMs

含义内含语法:逐渐可接受性形状 LLM 中工程的几何表示法

2507.22286v1

174

07-29

CoEx – Co-evolving World-model and Exploration

CoEx – Co-evolving World-Modell und Exploration

CoEx – – 共同发展的世界模式和探索

2507.22281v1

175

07-29

Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence

Verkörperte Web-Agenten: Überbrückung physikalisch-digitaler Realms für integrierte Agent-Intelligenz

嵌入式网络代理:为综合特工情报连接物理数字王国

2506.15677v3

176

07-29

Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

Denoising Concept Vectors mit Sparse Autoencodern für verbesserte Sprachmodellsteuerung

用于改进语言模式指导的与斯普鲁斯自动编码器一起的失言概念矢量

2505.15038v2

177

07-29

Modeling Story Expectations to Understand Engagement: A Generative Framework Using LLMs

Modeling Story Erwartungen, Engagement zu verstehen: Ein generatives Framework mit LLMs

模拟对理解参与的理论期望:利用LLMM的生成框架

2412.15239v3

178

07-29

ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling

EKG-Byte: Ein Tokenizer für die End-to-End Generative Elektrokardiogramm-Sprachenmodellierung

ECG-Byte: 终端到 En-En Energy 电动心电图语言建模调控器

2412.14373v3

179

07-29

GneissWeb: Preparing High Quality Data for LLMs at Scale

GneissWeb: Hochqualitative Daten für LLMs im Maßstab vorbereiten

GneissWeb: 为缩放 LLMs 准备高品质数据

2502.14907v2

180

07-29

LLM-as-a-qualitative-judge: automating error analysis in natural language generation

LLM-as-a-qualitative-Richter: Automatisierung der Fehleranalyse bei der Generierung natürlicher Sprachen

LLM-as-as-法官法官:在自然语言生成中进行自动误差分析

2506.09147v2

181

07-29

RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation

RL von Lehrer-Modell-Verfeinerung: Graduale Imitation Lernen für maschinelle Übersetzung

教师-模式改进:机器翻译逐步模拟学习

2507.22219v1

182

07-29

Can adversarial attacks by large language models be attributed?

Können feindliche Angriffe von großen Sprachmodellen zugeschrieben werden?

大型语言模式的对抗性攻击能否归结为对抗性攻击?

2411.08003v3

183

07-29

How Well Does First-Token Entropy Approximate Word Entropy as a Psycholinguistic Predictor?

Wie gut ist die Erst-Token-Entropie ungefähre Wort-Entropie als psycholinguistischer Vorhersager?

作为心理语言学预测者,第一到真真真真真真真假近似单字真真真假如何?

2507.22209v1

184

07-29

The role of media memorability in facilitating startups’ access to venture capital funding

Die Rolle der Medienerinnerung bei der Erleichterung des Zugangs von Start-ups zu Risikokapitalfinanzierungen

B. 媒体在便利初创企业获得风险资本资金方面的作用

2507.22201v1

185 07-29 Basic Reading Distillation Grundlesedestillation 基础阅读蒸馏 2507.19741v2

186

07-29

Explainability Through Systematicity: The Hard Systematicity Challenge for Artificial Intelligence

Erklärbarkeit durch Systematik: Die harte Systematik-Herausforderung für künstliche Intelligenz

系统化解释:人工智能的硬系统化挑战

2507.22197v1

187

07-29

Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation

Déjà Vu: Mehrsprachige LLM-Evaluierung durch die Lens of Machine Translation Evaluation

Déjà Vu:通过机器翻译评价的镜头进行多种语文LLM评价

2504.11829v3

188

07-29

A Scalable Pipeline for Estimating Verb Frame Frequencies Using Large Language Models

Eine skalierbare Pipeline zur Schätzung von Verb Frame Frequenzen mit großen Sprachmodellen

使用大语言模型估算 Verb 框架频谱的可缩放管道

2507.22187v1

189

07-29

Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

Positiv-Augmented Contrastive Learning für Vision-and-Language Evaluation und Training

愿景和语言评价和培训的积极强化反竞争学习

2410.07336v2

190

07-29

Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles

Persona-Augmented Benchmarking: Bewertung von LLMs über unterschiedliche Schreibstile hinweg

人人推基准定 : 评价各种写作风格

2507.22168v1

191

07-29

Strategic Deflection: Defending LLMs from Logit Manipulation

Strategische Durchbiegung: LLMs durch Logit-Manipulation verteidigen

战略抵消:保护LLMs免受逻辑操纵

2507.22160v1

192

07-29

IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian

IndoPref: Ein multi-Domain-Pairwise-Preference-Datensatz für Indonesisch

IndoPref:印度尼西亚多域对等优惠数据集

2507.22159v1

193

07-29

The Importance of Facial Features in Vision-based Sign Language Recognition: Eyes, Mouth or Full Face?

Die Bedeutung von Gesichtsfunktionen bei der visionsbasierten Erkennung von Zeichensprachen: Augen, Mund oder Gesicht?

面貌在基于愿景的手语识别中的重要性:眼、嘴还是脸?

2507.20884v2

194

07-29

Prompt Optimization and Evaluation for LLM Automated Red Teaming

Prompt Optimierung und Auswertung für LLM Automatisiertes Red Teaming

LLM自动红色小组迅速优化和评价

2507.22133v1

195

07-29

SAKE: Steering Activations for Knowledge Editing

SAKE: Steuerung von Aktivierungen für die Wissensbearbeitung

战略:知识编辑指导活动

2503.01751v2

196

07-29

UserBench: An Interactive Gym Environment for User-Centric Agents

UserBench: Eine interaktive Gym-Umgebung für User-Centric-Agenten

用户 Bench: 用户中心代理器的交互式 Gym 环境

2507.22034v1

197

07-29

FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression

FLAT-LLM: Feinkörnige Low-rank Aktivierung Raumtransformation für großsprachliche Modellkompression

FLAT-LLM: 用于大语言模型压缩的精制低级激活空间转换

2505.23966v3

198

07-29

SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers

SAND-Math: LLMs nutzen, um neuartige, schwierige und nützliche Mathematikfragen und -antworten zu generieren

SAND-Math:利用LLMs生成新创、困难和有用的数学问答

2507.20527v2

199

07-29

Predicting Microbial Ontology and Pathogen Risk from Environmental Metadata with Large Language Models

Vorhersage mikrobieller Ontologie und Pathogenrisiken durch Umweltmetadaten mit großen Sprachmodellen

预测具有大语言模型的环境元数据产生的微生物本体学和病原体风险和病原体风险

2507.21980v1

200

07-29

LIMO: Less is More for Reasoning

LIMO: Weniger ist mehr für Vernunft

LIMO: 较少的理由更多

2502.03387v3

201

07-29

Culinary Crossroads: A RAG Framework for Enhancing Diversity in Cross-Cultural Recipe Adaptation

Kulinarische Kreuzungen: Ein RAG-Rahmen zur Verbesserung der Vielfalt in der kulturübergreifenden Rezeptanpassung

烹饪十字路口:加强跨文化适应性适应多样性的RAG框架

2507.21934v1

202

07-29

Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory

LLM-Autoscoring-Verlässlichkeit in großformatigen Schriftbeurteilungen unter Verwendung von Generalisierbarkeitstheorien erkunden

利用通用理论探索利用通用理论进行大型书写评估时的可靠性

2507.19980v2

203

07-29

“Whose Side Are You On?” Estimating Ideology of Political and News Content Using Large Language Models and Few-shot Demonstration Selection

„Auf wessen Seite bist du?” Schätzung der Ideologie von Politik und Nachrichteninhalten mit großen Sprachmodellen und der Auswahl von Demonstrationsobjekten

“你站在谁一边?” 估计政治和新闻内容使用大语言模型和少见的示范选择的意识形态和新闻内容。

2503.20797v2

204

07-29

Post-Training Large Language Models via Reinforcement Learning from Self-Feedback

Post-Training Große Sprachmodelle durch Stärkung Lernen aus Selbst-Feedback

培训后通过 “ 自我学习 “ 强化学习大语言模式

2507.21931v1

205

07-29

CHIMERA: A Knowledge Base of Scientific Idea Recombinations for Research Analysis and Ideation

CHIMERA: Eine Wissensbasis für wissenschaftliche Ideen-Rekombinationen für Forschungsanalyse und -Ideation

CHIMERA: 研究分析和衰变科学理念重组知识库

2505.20779v4

206

07-29

Rote Learning Considered Useful: Generalizing over Memorized Data in LLMs

Rotes Lernen als nützlich betrachtet: Verallgemeinern über gemerkte Daten in LLMs

认为轮试学习有用:在LLMs中普遍使用记忆数据

2507.21914v1

207

07-29

SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs

SmoothRot: Kombination von Kanal-Weiss-Skalierung und Rotation für Quantisierungsfreundliche LLMs

平滑旋转: 将频道- Wise 缩放和旋转组合起来, 用于量化- 友好型LLMS

2506.05413v2

208

07-29

SLR: Automated Synthesis for Scalable Logical Reasoning

SLR: Automatisierte Synthese für skalierbare logische Vernunft

SLR: 用于可缩放逻辑理由的自动合成

2506.15787v3

209

07-29

Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning

Graph-R1: Auf dem Weg zu einem agentischen GraphRAG-Framework durch durchgängiges Ausbau-Lernen

图R1:通过端至端强化学习,迈向 “ 干点至端强化学习 “ 框架

2507.21892v1

210

07-29

FrugalRAG: Learning to retrieve and reason for multi-hop QA

FrugalRAG: Lernen zum Abrufen und Grund für Multi-Hop-QA

FrugalRAG:学会检索和多呼QA的理由

2507.07634v2

211

07-29

WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

WakenLLM: Bewertung des Potenzials und der Stabilität von LLMs mittels feinkörniger Benchmarking

WakenLLLM:通过细微基准评估LLM公司的合理潜力和稳定性

2507.16199v3

212

07-29

FB-RAG: Improving RAG with Forward and Backward Lookup

FB-RAG: Verbesserung der RAG durch Vorwärts- und Rückwärtsblick

FB-RAG:以前向和后向看改进RAG

2505.17206v2

213

07-29

AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning

AutoTIR: Autonome Tools Integriertes Reasoning durch Verstärkungslernen

AutoTIR:通过强化学习综合解释理由的自主工具

2507.21836v1

214

07-29

Introducing HALC: A general pipeline for finding optimal prompting strategies for automated coding with LLMs in the computational social sciences

Einführung von HALC: Eine allgemeine Pipeline für die Suche nach optimalen Promptenstrategien für die automatisierte Codierung mit LLMs in den Computational Social Sciences

介绍HALC:寻找计算社会科学中与LLMs自动编码的最佳加速战略的一般管道

2507.21831v1

215

07-29

EEG-CLIP : Learning EEG representations from natural language descriptions

EEG-CLIP : Lernen von EEG-Darstellungen aus natürlichen Sprachbeschreibungen

EEG-CLIP:从自然语言说明中学习EEG代表

2503.16531v2

216

07-29

Modelling Adjectival Modification Effects on Semantic Plausibility

Modellierung adjektiver Modifizierungseffekte auf die semantische Plausibilität

模拟弹道改变对语义等高可变性的影响

2507.21828v1

217

07-29

HRIPBench: Benchmarking LLMs in Harm Reduction Information Provision to Support People Who Use Drugs

HRIPBench: Benchmarking von LLMs bei der Bereitstellung von Informationen zur Schadensreduzierung zur Unterstützung von Drogenkonsumenten

HRIPBENCH:在向吸毒者提供支助的减少危害信息提供中确定LLMs基准

2507.21815v1

218

07-29

Overview of ADoBo at IberLEF 2025: Automatic Detection of Anglicisms in Spanish

Übersicht über ADoBo bei IberLEF 2025: Automatische Erkennung von Anglizismen auf Spanisch

IberLEF 2025年IberLEF ADoBo ADoBo 概览:西班牙文自动检测

2507.21813v1

219

07-29

ChartMark: A Structured Grammar for Chart Annotation

ChartMark: Eine strukturierte Grammatik für Chart-Annotation

图表 Mark: 用于图表注释的结构性语法

2507.21810v1

220

07-29

Task Arithmetic for Language Expansion in Speech Translation

Aufgabe Arithmetik für Spracherweiterung in der Sprachübersetzung

语音翻译中语言扩展任务

2409.11274v3

221

07-29

The Problem with Safety Classification is not just the Models

Das Problem der Sicherheitsklassifizierung sind nicht nur die Modelle

安全分类问题不仅仅是模型

2507.21782v1

222

07-29

Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages

Sparse Autoencoder können sprachspezifische Konzepte über verschiedene Sprachen hinweg erfassen

能够捕捉不同语言语言的特定语言概念的简单自定义者

2507.11230v2

223

07-29

AgriEval: A Comprehensive Chinese Agricultural Benchmark for Large Language Models

AgriEval: Ein umfassender chinesischer Landwirtschafts-Benchmark für große Sprachmodelle

农业:中国大语言模式农业综合基准

2507.21773v1

224

07-29

Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal

Adversariale Verteidigung ohne Adversariale Verteidigung: Verbesserung der Sprachmodell Robustheit über Instanz-Ebene Hauptkomponentenentfernung

无反向辩护的反向辩护,无反向辩护:通过一审一级主要组成部分删除,加强语言模式的强力

2507.21750v1

225

07-29

Image Captioning via Compact Bidirectional Architecture

Bildunterschrift über kompakte bidirektionale Architektur

通过契约双向双向建筑进行图像描述

2201.01984v2

226

07-29

My Life in Artificial Intelligence: People, anecdotes, and some lessons learnt

Mein Leben in Künstlicher Intelligenz: Menschen, Anekdoten und einige Lektionen gelernt

我在人工智能中的生活:人、流浪者、以及一些经验教训

2504.04142v2

227

07-29

Technical Report of TeleChat2, TeleChat2.5 and T1

Technischer Bericht von TeleChat2, TeleChat2.5 und T1

TeleChat2、TeleChat2.5和T1技术报告

2507.18013v3

228

07-29

UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases

UnsafeChain: Verbesserung der Modellsicherheit über Hard Cases

不安全Chain:通过困难案件加强说明理由的示范安全

2507.21652v1

229

07-29

Libra: Assessing and Improving Reward Model by Learning to Think

Waage: Bewertung und Verbesserung des Prämienmodells durch Lernen zu denken

利布拉:通过学习思考来评估和改进奖励模式

2507.21645v1

230

07-29

Probing then Editing Response Personality of Large Language Models

Probing dann Editing Response Persönlichkeit von großen Sprachmodellen

检验后编辑大语言模型的个性反应

2504.10227v2

231

07-29

Strategist: Self-improvement of LLM Decision Making via Bi-Level Tree Search

Stratege: Selbstverbesserung der LLM-Entscheidungsfindung über die Bi-Level-Baumsuche

战略:通过双层树木搜索自我改善LLM决策

2408.10635v3

232

07-29

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Latent Adversarial Training verbessert Robustheit für persistente schädliche Verhalten in LLMs

长效对长效有害行为培训能提高长效LMM中持久性有害行为的积极性

2407.15549v3

233

07-29

Multilingual JobBERT for Cross-Lingual Job Title Matching

Mehrsprachiger JobBERT für Cross-Lingual Job Title Matching

跨语言工作职称匹配多语言工作BERT

2507.21609v1

234

07-29

Pralekha: Cross-Lingual Document Alignment for Indic Languages

Pralekha: Cross-Lingual Document Alignment für indische Sprachen

Pralekha:印度语交叉语言文档协调

2411.19096v2

235

07-29

A Detailed Factor Analysis for the Political Compass Test: Navigating Ideologies of Large Language Models

Eine detaillierte Faktorenanalyse für den politischen Kompasstest: Navigieren von Ideologien großer Sprachmodelle

《政治指南测试的详细要素分析:掌握大语言模式的特征》

2506.22493v2

236

07-29

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

AIM: Adaptive Schlussfolgerung von Multi-Modal LLMs über Token Merging und Pruning

AIM:通过 Token 兼并和预留的多模式LMs的适应性推理

2412.03248v2

237

07-29

Evaluating the cognitive reality of Spanish irregular morphomic patterns: Humans vs. Transformers

Bewertung der kognitiven Realität der spanischen unregelmäßigen morphomischen Muster: Menschen vs. Transformers

评估西班牙非正常染色体模式的认知现实:人类与变异体

2507.21556v1

238

07-29

C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning

C2-Evo: Co-Evolving multimodale Daten und Modell zur Selbstverbesserung

C2-Evo:共同演进的多模式数据和自我改进理由模型

2507.16518v2

239

07-29

Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models

Linguistische und einbettende Profilierung von Texten, die von Menschen und großen Sprachmodellen erzeugt werden

人类和大语言模式产生的文本的语言和嵌入式图解

2507.13614v2

240

07-29

Mind the Language Gap in Digital Humanities: LLM-Aided Translation of SKOS Thesauri

Achten Sie auf die Sprachlücke in digitalen Geisteswissenschaften: LLM-Aided Translation of SKOS Thesauri

注意数字人文中的语言差距:SKOS Thesauri的LLM辅助翻译

2507.19537v2

241

07-29

Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator

Zeichen als Zeichen: Ein retrieval-erweiterter Mehrsprachiger Zeichen-Generator

标为 Tokens 的符号: 一个检索增强的多语种手语手语生成器

2411.17799v3

242

07-29

MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation

MAGIC: Multi-Hop- und Graphenbasierter Benchmark für Inter-Kontext-Konflikte in der retrieval-generierten Generation

MAGIC: 回收后一代人中多重和基于图表的多重和基于图表的相互冲突基准

2507.21544v1

243

07-29

Modern Uyghur Dependency Treebank (MUDT): An Integrated Morphosyntactic Framework for a Low-Resource Language

Modern Uighur Dependency Treebank (MUDT): Ein integriertes morphosyntaktisches Framework für eine ressourcenarme Sprache

现代维吾尔依赖性树库(MUDT): 一种低资源语言综合磷合成法框架

2507.21536v1

244

07-29

Automatic Classification of User Requirements from Online Feedback – A Replication Study

Automatische Klassifizierung der Benutzeranforderungen aus Online-Feedback – Eine Replikationsstudie

在线反馈用户要求自动分类 – – 复制研究

2507.21532v1

245

07-29

HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation

HIRAG: Hierarchisch-gedankte Instruktion-Tuning-Retrieval-Augmented Generation

HIRAG: 高层次研究教学-引导检索-推荐一代

2507.05714v2

246

07-29

TriangleMix: A Lossless and Efficient Attention Pattern for Long Context Prefilling

TriangleMix: Ein verlustfreies und effizientes Aufmerksamkeitsmuster für den langen Kontext

三角组合:长期预填无损高效关注模式

2507.21526v1

247

07-29

Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting

Modellfreies Spekulatives Dekodieren für Transformer-basierte ASR mit Token-Map-Entwurf

采用 Token 地图起草的基于变换器的ASR无示范投机代号

2507.21522v1

248

07-29

Simulated patient systems are intelligent when powered by large language model-based AI agents

Simulierte Patientensysteme sind intelligent, wenn sie von großen modellbasierten AI-Agenten angetrieben werden

由大型语言模型型人工智能代理器供电时,模拟的病人系统是智能系统

2409.18924v3

249

07-29

What Does it Mean for a Neural Network to Learn a “World Model”?

Was bedeutet es für ein neurales Netzwerk, ein “Weltmodell” zu lernen?

神经网络学习“世界模型”意味着什么?

2507.21513v1

250

07-29

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Persona-Vektoren: Überwachung und Kontrolle von Charaktereigenschaften in Sprachmodellen

人向量:监测和控制语言模式中的字符轨迹

2507.21509v1

251

07-29

The Carbon Cost of Conversation, Sustainability in the Age of Language Models

Die CO2-Kosten des Gesprächs, Nachhaltigkeit im Zeitalter der Sprachmodelle

对话的碳成本、语言模式时代的可持续性

2507.20018v2

252

07-29

Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach

Zuverlässige Proof-Generation mit LLMs: Ein neuro-symbolischer Ansatz

努力利用LLM女士实现可靠的证据生产:神经-双曲方法

2505.14479v4

253

07-29

VN-MTEB: Vietnamese Massive Text Embedding Benchmark

VN-MTEB: Vietnamesisch Massiver Text Einbettung Benchmark

VN-MTEB:越南大规模文本嵌入基准

2507.21500v1

254

07-29

Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

Anreize für eine fortgeschrittene Instruktions-Folge von großen Sprachmodellen

为采用大语言模式的高级指示提供激励理由

2506.01413v5

255

07-29

Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning

Low-Confidence Gold: Verfeinerung von Low-Confidence-Proben für effizientes Instruktionstuning

低信任金:改进低信任金样本,以进行高效教学计费

2502.18978v4

256

07-29

Sem-DPO: Mitigating Semantic Inconsistency in Preference Optimization for Prompt Engineering

Sem-DPO: Semantische Inkonsistenz bei der Preference-Optimierung für Prompt Engineering mindern

Sem-DPO: 减轻在优先优化即时工程方面的语义不一致现象

2507.20133v2

257

07-29

The pitfalls of next-token prediction

Die Fallstricke der Next-Token-Vorhersage

下吨预测的陷阱

2403.06963v3

258

07-29

Improving Task Diversity in Label Efficient Supervised Finetuning of LLMs

Verbesserung der Aufgabenvielfalt bei der Label-Effizient überwachten Feinsteuerung von LLMs

改进LLMML在标签高效监督监督下改进任务多样性

2507.21482v1

259

07-29

Which LLMs Get the Joke? Probing Non-STEM Reasoning Abilities with HumorBench

Welche LLMs bekommen den Spaß? Mit HumorBench nicht-STEM-vernünftige Fähigkeiten beweisen

哪个LLMs得到的笑话?

2507.21476v1

260

07-29

BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data

BIG5-CHAT: LLM-Persönlichkeiten durch Schulung auf menschenverändernden Daten gestalten

BIG5-CHAT:通过提供人际数据培训塑造专业人才

2410.16491v3

261

07-29

Soft Injection of Task Embeddings Outperforms Prompt-Based In-Context Learning

Weiche Einspritzung von Task-Embeddings Outperforms Prompt-Based In-Context Learning

任务嵌入器的软输入超出迅速基于信息学习的绩效

2507.20906v2

262

07-29

Towards Locally Deployable Fine-Tuned Causal Large Language Models for Mode Choice Behaviour

Auf dem Weg zu lokal einsetzbaren großformatigen großformatigen Sprachmodellen für Modewahlverhalten

以当地可部署的优质因果大语言模式进行模式选择行为

2507.21432v1

263

07-29

LLAMAPIE: Proactive In-Ear Conversation Assistants

LLAMAPIE: Proaktive In-Ear-Gesprächsassistenten

LLAMAPIE: 主动的在轨在轨对话助理

2505.04066v2

264

07-29

Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling

Bergbau-Intrinsische Belohnungen aus LLM-Hidden States für effiziente Best-of-N-Probenahme

LLM隐藏国为高效率最佳采样而从LLM公司获得的采矿内部奖赏

2505.12225v2

265

07-29

MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations

MemTool: Optimierung der Kurzzeit-Speicherverwaltung für dynamisches Werkzeug beim Aufrufen von LLM Agent Multi-Turn-Konversationen

MemTool:优化短期内存管理,以便利用动态工具在LLM代理多转对话中打电话

2507.21428v1

266

07-29

ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs

ReGATE: Schneller und besser lernen mit weniger Token in MLLMs

ReGATE:与较少的男、女、女、女、男、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女

2507.21420v1

267

07-28 (1)

Teaching Language Models To Gather Information Proactively

Sprachmodelle lehren, um Informationen proaktiv zu sammeln

积极主动地收集资料的教学语言模式

2507.21389v1

268

07-28

Ai2 Scholar QA: Organized Literature Synthesis with Attribution

Ai2 Scholar QA: Organisierte Literatursynthese mit Attribution

Ai2学者QA:有组织文学综述与归属

2504.10861v2

269

07-28

Beyond the Reported Cutoff: Where Large Language Models Fall Short on Financial Knowledge

Beyond the Reported Cutoff: Wo große Sprachmodelle auf finanzielles Wissen zurückfallen

超越报告的截止点:大语言模式对财务知识的缺陷

2504.00042v2

270

07-28

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Audio Flamingo 3: Advancing Audio Intelligence mit vollständig offenen großen Audio-Sprachen-Modelle

3:以完全开放的大型音频语言模式推进音频情报

2507.08128v2

271

07-28

Turbocharging Web Automation: The Impact of Compressed History States

Turbocharging Web Automation: Die Auswirkungen von Komprimierten Geschichte Staaten

涡轮连载网络自动化:压缩历史国家的影响

2507.21369v1

272

07-28

StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation

StructText: Ein synthetischer Table-to-Text-Ansatz für Benchmark-Erzeugung mit multidimensionaler Bewertung

条形图文本:以多层次评价方式编制基准的基准的合成表到文本方法

2507.21340v1

273

07-28

A Deep Learning Automatic Speech Recognition Model for Shona Language

Ein Deep Learning automatische Spracherkennung Modell für Shona Sprache

Shona语言深学习自动语音识别模式

2507.21331v1

274

07-28

SQuat: Subspace-orthogonal KV Cache Quantization

SQuat: Subraum-orthogonale KV-Cache-Quantisierung

Suat: 子空间- orthogonal KV 缓存缓存量化

2503.24358v2

275

07-28

Do Large Language Models Understand Morality Across Cultures?

Verstehen große Sprachmodelle Moral über Kulturen hinweg?

大语言模式是否理解各种文化的道德?

2507.21319v1

276

07-28

Can human clinical rationales improve the performance and explainability of clinical text classification models?

Können menschliche klinische Grundlagen die Leistungsfähigkeit und Erklärbarkeit klinischer Textklassifikationsmodelle verbessern?

人类临床原理能否改善临床文本分类模型的性能和解释性?

2507.21302v1

277

07-28

FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation

FlagEvalMM: Flexibler Rahmen für eine umfassende multimodale Modellbewertung

FlaignEvalMMM:综合多式联运模式评价灵活框架

2506.09081v3

278

07-28

Narrative Context Protocol: An Open-Source Storytelling Framework for Generative AI

Narrative Context Protocol: Ein Open Source Storytelling Framework für generative KI

叙述性背景议定书:开源的开源描述框架

2503.04844v5

279

07-28

Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues

Schulung von LLM-basierten Tutoren zur Verbesserung der Lernergebnisse von Studierenden in Dialogen

培训基于LLLM LLM的辅导员,以改善学生在对话中的学习成果

2503.06424v2

280

07-28

LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems

LeMix: Unified Scheduling für LLM-Training und Schlussfolgerung auf Multi-GPU-Systemen

LeMix:关于多功能保U系统的LLM培训和推理的LLM培训统一日程安排

2507.21276v1

281

07-28

Levels of Analysis for Large Language Models

Analyseebenen für große Sprachmodelle

大语言模式分析水平

2503.13401v2

282

07-28

CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting

CompoST: Ein Benchmark für die Analyse der Fähigkeit von LLMs, Fragen in einer QALD-Einstellung kompositorisch zu interpretieren

CompoST:在质量和限期设计中分析高管公司在组成上解释问题的能力的基准

2507.21257v1

283

07-28

Bangla BERT for Hyperpartisan News Detection: A Semi-Supervised and Explainable AI Approach

Bangla BERT für hyperparteiische Nachrichtenerkennung: Ein halbüberwachter und erklärbarer KI-Ansatz

超党派新闻探测孟加拉BERT:半监督和可解释的AI方法

2507.21242v1

284

07-28

Understanding Public Perception of Crime in Bangladesh: A Transformer-Based Approach with Explainability

Öffentliche Wahrnehmung der Kriminalität in Bangladesch verstehen: Ein transformerbasierter Ansatz mit Erklärbarkeit

了解孟加拉国公众对犯罪的认识:基于变革和可解释的方法

2507.21234v1

285

07-28

Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation

Multi-Agent-as-Judge: LLM-Agent-basierte automatisierte Evaluierung mit multidimensionaler menschlicher Bewertung ausrichten

多边代理法官:将LLM-基于代理的自动评价与多层次的人力评价统一起来

2507.21028v1

286

07-28

Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

Verbesserung der LLM-Vernunft mit iterativem DPO: Eine umfassende empirische Untersuchung

与具有迭接作用的DPO:全面经验调查加强LLM

2503.12854v3

287

07-28

Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions

Bewertung des Versprechens und der Fälle von LLMs bei Hiring-Entscheidungen

评估LLM女士在雇用决定中的许诺和机会

2507.02087v2

288

07-28

Memorization in Fine-Tuned Large Language Models

Auswendiglernen in fein getönten großen Sprachmodellen

微微调大语言模型的记忆

2507.21009v1

289

07-28

LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning

LoRA-PAR: Ein flexibler Dual-System-LoRA-Partitionsansatz für effizientes LLM-Feintuning

LOLAR-PAR:高效 LLM 微调的灵活双系统滚动分割法

2507.20999v1

290

07-28

GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding

GUI-G$^2$: Gaussian Reward Modeling für GUI Grounding

GUI-G$$2美元:GUI地基的高斯奖赏模型

2507.15846v3

291

07-28

Scaling Physical Reasoning with the PHYSICS Dataset

Skalierung der physikalischen Vernunft mit dem PHYSICS-Datensatz

利用PHYSICS数据集调整物理理由

2506.00022v3

292

07-28

Cog-TiPRO: Iterative Prompt Refinement with LLMs to Detect Cognitive Decline via Longitudinal Voice Assistant Commands

Cog-TiPRO: Iterative Prompt-Verfeinerung mit LLMs zur Erkennung kognitiver Deklination über Longitudinal Voice Assistant-Befehle

COg-TiPRO:与LLMs一起与LLMs进行自动迅速改进,以便通过纵向语音助理指挥部检测认知衰减

2505.17137v2

293

07-28

A Survey of Deep Learning for Geometry Problem Solving

Eine Umfrage über Deep Learning zur Lösung von Geometrieproblemen

解决几何问题深层学习调查

2507.11936v4

294

07-28

Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning

Mehrbildbeschreibungen für mehrsprachige, leichte Kognitive Impairment-Erkennung durch kontrastives Lernen enthüllen

通过差异学习发现多语种轻视认知缺陷的单形多语种描述

2505.17067v3

295

07-28

Your AI, Not Your View: The Bias of LLMs in Investment Analysis

Ihre KI, nicht Ihre Ansicht: Die Bias von LLMs in der Investitionsanalyse

您的AI, 而不是您的观点: 投资分析中LLM 的偏见

2507.20957v1

296

07-28

Mind the Gap: Conformative Decoding to Improve Output Diversity of Instruction-Tuned Large Language Models

Mind the Gap: Konformative Dekodierung zur Verbesserung der Output-Vielfalt von instruction-tuned großen Sprachmodellen

注意差距:改进教学型大语言模式产出多样性的合规化配方

2507.20956v1

297

07-28

Dissecting Persona-Driven Reasoning in Language Models via Activation Patching

Persona-Driven Reasoning in Sprachmodellen per Aktivierungs-Patching auflösen

通过激活补丁在语言模型中通过激活补丁解剖人-人-驱动原因

2507.20936v1

298

07-28

LLM2TEA: An Agentic AI Designer for Discovery with Generative Evolutionary Multitasking

LLM2TEA: Agentischer AI-Designer für Entdeckung mit generativem evolutionären Multitasking

LLM2TEA: 利用产生进化多任务探索的代理AI 设计器

2406.14917v3

299

07-28

FHSTP@EXIST 2025 Benchmark: Sexism Detection with Transparent Speech Concept Bottleneck Models

FHSTP@EXIST 2025 Benchmark: Sexismuserkennung mit transparenten Sprachkonzepten Engpassmodelle

FHSTP@EXIST 2025 基准:用透明言论概念瓶颈模型探测性别主义

2507.20924v1

300

07-28

MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

MediQAl: Eine französische medizinische Frage zur Beantwortung von Datensätzen für Wissens- und Begründungsbewertung

MediQAl:用于知识和合理评估的法国医学问题解答数据集

2507.20917v1

301

07-28

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Benchmarking Open-Ended Audio Dialogue Understanding für große Audio-Language-Modelle

确定大型音频语言模型不限成员名额音频对话理解基准

2412.05167v2

302

07-28

Should Top-Down Clustering Affect Boundaries in Unsupervised Word Discovery?

Sollte Top-Down-Clustering Grenzen in unüberwachten Word Discovery beeinflussen?

在无人监督的“发现字”中, 上下层群集是否应该影响边界?

2507.19204v2

303

07-28

$A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement

$A^2R^2$: Verbesserung der Img2LaTeX-Umwandlung durch visuelles Reasoning mit aufmerksamkeitsgeführter Verfeinerung

$A2R2美元:通过关注引导的精炼,通过视觉理性推进Img2LaTeX转换

2507.20890v1

304

07-28

Enhancing Project-Specific Code Completion by Inferring Internal API Information

Verbesserung der projektspezifischen Code-Vervollständigung durch Schlussfolgerung interner API-Informationen

通过推断内部API信息加强具体项目法规的完成

2507.20888v1

305

07-28

Leveraging Open-Source Large Language Models for Clinical Information Extraction in Resource-Constrained Settings

Nutzung von Open-Source-Großsprachenmodellen für die Extraktion klinischer Informationen in ressourcenbeschränkten Einstellungen

利用开放源码大语言模型,在受资源限制的环境下进行临床信息采掘

2507.20859v1

306

07-28

A survey of diversity quantification in natural language processing: The why, what, where and how

Eine Übersicht der Diversitätsquantifizierung in der natürlichen Sprachverarbeitung: Das Warum, Was, Wo und Wie

自然语言处理中多样性量化调查:原因、内容、地点和方式

2507.20858v1

307

07-28

Language Modeling for the Future of Finance: A Survey into Metrics, Tasks, and Data Opportunities

Sprachenmodellierung für die Zukunft der Finanzen: Eine Umfrage zu Metrics, Aufgaben und Datenmöglichkeiten

未来融资语言建模:计量、任务和数据机会调查

2504.07274v2

308

07-28

Latent Inter-User Difference Modeling for LLM Personalization

Latent Inter-User Difference Modeling für LLM Personalisierung

LLM个性化不同模型

2507.20849v1

309

07-28

Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models

Kritik des unreinen Grundes: Enthüllen des Argumentationsverhaltens medizinischer Großsprachenmodelle

简便理由的批评:统一医学大语言模式的推理行为

2412.15748v2

310

07-28

FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings

FocalPO: Verbesserung der Preference-Optimierung durch Fokussierung auf korrekte Preference-Rankings

重点:通过注重正确的优先排序,加强优惠优化

2501.06645v3

311

07-28

Automating Thematic Review of Prevention of Future Deaths Reports: Replicating the ONS Child Suicide Study using Large Language Models

Automatisieren der thematischen Überprüfung der Prävention von zukünftigen Todesfällen Berichte: Nachahmung der ONS-Kinder-Selbstmord-Studie mit großen Sprachmodellen

对预防今后死亡报告进行自动化专题审查:利用大语言模式复制ONS儿童自杀研究

2507.20786v1

312

07-28

On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

Über die Rolle von vorgebildeten Sprachmodellen in allgemeinen Text-Embeddings: Eine Umfrage

关于 “ 预先培训的语言模式在一般用途文本嵌入中所起的作用:调查 “

2507.20783v1

313

07-28

TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks

TN-AutoRCA:电信网络中自我改进基于警报的原始原因分析的基准建设和示范框架

2507.18190v2

314

07-28

The Impact of LoRA Adapters on LLMs for Clinical Text Classification Under Computational and Data Constraints

Die Auswirkungen von LoRA-Adaptern auf LLMs für die klinische Textklassifikation unter Computational und Data Constraints

LoRA适应器对在计算和数据限制下临床文本分类的LLMs的影响

2407.19299v3

315

07-28

Multilingual Self-Taught Faithfulness Evaluators

Mehrsprachige Selbstlernende Bewertung von Treue

多语言自学自学信仰评价员

2507.20752v1

316

07-28

Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study

Untersuchung struktureller Pruning- und Recovery-Techniken zur Komprimierung multimodaler Großsprachenmodelle: Eine empirische Studie

压缩多式大语言模式结构保护和恢复调查技术:经验研究

2507.20749v1

317

07-28

Everything is a Video: Unifying Modalities through Next-Frame Prediction

Alles ist ein Video: Vereinheitlichen von Modalitäten durch Next-Frame-Vorhersage

一切都是一部视频:通过下框架预测实现统一的方式

2411.10503v2

318

07-28

Group Sequence Policy Optimization

Optimierung der Gruppensequenzpolitik

组序列政策优化

2507.18071v2

319

07-28

Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models

Text2VLM: Anpassung von Text-Only-Datensätzen an die Auswertung von Alignment-Trainings in visuellen Sprachmodellen

Text2VLM: 调整纯文本数据集以评价视觉语言模型的对齐培训

2507.20704v1

320

07-28

Computational Analysis of Character Development in Holocaust Testimonies

Computational Analyse der Charakterentwicklung in Holocaust-Zeugnissen

大屠杀证词特征发展计算分析

2412.17063v4

321

07-28

When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification

Wenn Scale auf Vielfalt trifft: Bewertung von Sprachmodellen auf feinkörnige Mehrsprachigkeitsprüfung

规模达到多样性时:评价精细多语言索赔核实的语言模式

2507.20700v1

322

07-28

Geometric-Mean Policy Optimization

Geometrisch-Mean-Policy-Optimierung

几何海洋政策优化

2507.20673v1

323

07-28

Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs

Benchmarking Graph Neural Networks für die Dokumentenlayout-Analyse in öffentlichen Angelegenheiten

用于公共事务文件布局分析的图表神经网络

2505.14699v2

324

07-28

Detection of Adverse Drug Events in Dutch clinical free text documents using Transformer Models: benchmark study

Nachweis von unerwünschten Arzneimittelereignissen in niederländischen klinischen Textdokumenten mit Transformer-Modellen: Benchmark-Studie

利用变换模型发现荷兰临床免费文本文件中的不良毒品事件:基准研究

2507.19396v2

325

07-28

Ontology-Enhanced Knowledge Graph Completion using Large Language Models

Ontologie-erweiterte Wissensgraphenvervollständigung mit großen Sprachmodellen

利用大语言模式完成本部强化知识图

2507.20643v1

326

07-28

Explainable Synthetic Image Detection through Diffusion Timestep Ensembling

Erklärbare Synthetische Bilderkennung durch Diffusionszeitpunkt Zusammenbauen

通过传播时间步骤组合进行可解释的合成图像探测

2503.06201v2

327

07-28

Before the Outrage: Challenges and Advances in Predicting Online Antisocial Behavior

Vor der Empörung: Herausforderungen und Fortschritte bei der Vorhersage von Online-Antisozialverhalten

暴政前:预测在线反社会行为的挑战和进展

2507.20614v1

328

07-28

AutoLibra: Agent Metric Induction from Open-Ended Feedback

AutoLibra: Agent Metric Induktion aus offenem Feedback

AutoLibra: 不限名额反馈的计量介绍代理

2505.02820v2

329

07-28

ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning

ZSE-Cap: Ein Zero-Shot-Ensemble für Bildwiederherstellung und Prompt-Führung

ZSE-Cap: 用于图像检索和即时指导说明的零热组合

2507.20564v1

330

07-28

Enhancing Hallucination Detection via Future Context

Halluzinationserkennung durch zukünftigen Kontext verbessern

通过未来环境加强幻觉探测

2507.20546v1

331

07-28

From Answers to Rationales: Self-Aligning Multimodal Reasoning with Answer-Oriented Chain-of-Thought

Von Antworten zu Rationalen: Selbstjustierung multimodaler Vernunft mit answer-oriented Chain-of-Thought

从答案到理由:自调整的多式联运理由与以回答为主的探索链

2507.02984v2

332

07-28

Kimi K2: Open Agentic Intelligence

Kimi K2: Offene Agentische Intelligenz

Kimi K2:开放特工情报

2507.20534v1

333

07-28

SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law

SafeWork-R1: Koevolving Safety and Intelligence unter dem AI-45$^{\circ}$ Gesetz

安全工作-R1:根据AI-45$ circ}$ 法发展安全和情报

2507.18576v2

334

07-28

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Otter: Ein Multi-Modal-Modell mit In-Context-Anleitung Tuning

Ottter:具有内文指导图纸的多模式模型

2305.03726v2

335

07-28

Dialogues of Dissent: Thematic and Rhetorical Dimensions of Hate and Counter-Hate Speech in Social Media Conversations

Dialoge von Dissent: Thematische und rhetorische Dimensionen von Hass und Gegenhass in Social Media-Gesprächen

不同意见对话:社会媒体对话中的仇恨和反仇恨言论的主题和风湿方面

2507.20528v1

336

07-28

Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards

Versehentliche Sicherheitslücke: Faktoren bei Feinsteuerung, die das Modell schützen

意外脆弱性:改变模式保障保障措施的微调因素

2505.16789v2

337

07-28

Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition

Sicherheitsherausforderungen bei der Bereitstellung von KI-Agenten: Einblicke aus einem groß angelegten öffentlichen Wettbewerb

AI 代理部署在安全方面面临的挑战:大规模公共竞争的展望

2507.20526v1

338

07-28

AQUA: A Large Language Model for Aquaculture & Fisheries

AQUA: Ein großes Sprachmodell für Aquakultur und Fischerei

AQUA:水产养殖和渔业大语言模式

2507.20520v1

339

07-28

Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

Große Sprachmodelle für Tibeter mit kuratierten Daten und kontinuierlichem Pre-Training

推进藏藏人大语言模式,提供 “ 扩展数据 “ 和 “ 持续培训前 “ 。

2507.09205v4

340

07-28

REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models

REINFORCE++: Effizienter RLHF-Algorithmus mit Robustheit sowohl für Prompt- als auch für Reward-Modelle

REINFORCE++: 高效的RLHF对快速模型和奖励模型具有强力的测算法

2501.03262v7

341

07-28

Customize Multi-modal RAI Guardrails with Precedent-based predictions

Multimodale RAI-Guardrails mit vorausschauenden Vorhersagen anpassen

定制具有先例预测的多式RAI护卫车

2507.20503v1

342

07-28

Pruning for Performance: Efficient Idiom and Metaphor Classification in Low-Resource Konkani Using mBERT

Pruning for Performance: Effiziente Idiom- und Metaphor-Klassifikation in Low-Resource Konkani mit mBERT

利用mBERT, 低资源 Konkani 中高效的低资源 Konkani 和同义词分类

2506.02005v2

343

07-28

Speaking in Words, Thinking in Logic: A Dual-Process Framework in QA Systems

Sprechen in Worten, Denken in Logik: Ein Dual-Process-Framework in QA-Systemen

用文字说,用逻辑思考:质量保证系统中的双重处理框架

2507.20491v1

344

07-28

Juru: Legal Brazilian Large Language Model from Reputable Sources

Juru: Rechtliches brasilianisches Large Language Model aus seriösen Quellen

Juru:来自有名来源的巴西大语言法律模型

2403.18140v2

345

07-28

Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents

Benutzer vor ihnen selbst schützen: Schutz der kontextuellen Privatsphäre in Interaktionen mit Gesprächspartnern

保护用户免受自我伤害:在与交流代理人的互动中保护环境隐私

2502.18509v2

346

07-28

Improving Similar Case Retrieval Ranking Performance By Revisiting RankSVM

Ähnliches Beispiel verbessern Retrieval-Ranking-Performance durch Revisiting RankSVM

通过重审RanksSVM改进类似案例检索排名

2502.11131v2

347

07-28

In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

In Prospect und Retrospect: Reflektierendes Speichermanagement für langfristige Personalisierte Dialogagenten

展望和回顾:长期个人化对话代理人的反思记忆管理

2503.08026v2

348 07-27 (7) Critiques of World Models Kritik an Weltmodellen 世界模式的证明 2507.05169v3

349

07-27

CodeNER: Code Prompting for Named Entity Recognition

CodeNER: Codeaufforderung für die benannte Entitätserkennung

识别名称实体的代码提示代码

2507.20423v1

350

07-27

Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks?

Umfrage zu NLU-Benchmarks Diagnose Linguistische Phänomene: Warum nicht Diagnose-Benchmarks standardisieren?

NLU基准诊断语言神话调查:为什么不使诊断基准标准化?

2507.20419v1

351

07-27

CONCAP: Seeing Beyond English with Concepts Retrieval-Augmented Captioning

CONCAP: Über das Englische hinaussehen mit Konzepten Retrieval-Augmented Captioning

CONCACM: 以概念检索增强说明方式在英语以外看问题

2507.20411v1

352

07-27

Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

Clarify lernen: Multiturn-Gespräche mit aktionsbasiertem Kontrast-Selbst-Training

学习澄清:与基于行动的差异性自我培训进行多方向对话

2406.00222v2

353

07-27

Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification

Selbstregularisierung mit Sparse Autoencodern für steuerbare LLM-basierte Klassifizierung

与基于可控 LLM 的可控 LLM 分类的 Sparse 自动编码器的自调节

2502.14133v3

354

07-27

Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations

Kognitive Denkkette: Strukturierte multimodale Begründung über soziale Situationen

认知思考链:社会状况的结构性多模式原因

2507.20409v1

355

07-27

Length Representations in Large Language Models

Längendarstellungen in großen Sprachmodellen

大语言模式中的长长代表

2507.20398v1

356

07-27

Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation

Multi-Agent Retrieval-Augmented Framework for Evidence-based Counterspeech Against Health Misinformation

以证据为依据的反健康错误信息反言多证据检索强化框架

2507.07307v2

357

07-27

Memorization: A Close Look at Books

Auswendiglernen: Ein genauer Blick auf Bücher

记忆化:对书籍的近视

2504.12549v2

358

07-27

Scaling Analysis of Interleaved Speech-Text Language Models

Skalierungsanalyse interleaved Speech-Text Language Models

剖分间语音-文字语言模式扩大分析

2504.02398v2

359

07-27

RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing

RMTBench: Benchmarking von LLMs durch Multi-Turn-Benutzer-Centric-Rollenspiel

RMTBench:通过多发用户中心发挥作用,确定LLMs基准

2507.20352v1

360

07-27

DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns

DYNARTmo: Ein dynamisches Artikulationsmodell zur Visualisierung von Sprachbewegungsmustern

DYNARTmo:语音移动模式视觉化动态脉动模型

2507.20343v1

361

07-27

FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

FMSD-TTS: Wenige Aufnahmen Multi-Speeaker Multi-Dialekt Text-zu-Speech-Synthese für Ü-Tsang, Amdo und Kham Speech Dataset Generation

FMSD-TTS:为于赞、阿姆多和康言语数据集制作而制作的微小多声多声多功能多语音文本到语音合成合成

2505.14351v2

362

07-27

ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios

ELMES: Ein automatisierter Rahmen für die Bewertung großer Sprachmodelle in Bildungsszenarien

ELMES:评估教育情景中大语言模式自动框架

2507.22947v1

363

07-27

What is Wrong with Perplexity for Long-context Language Modeling?

Was ist falsch an Verwirrung für Langkontext-Sprachenmodellierung?

长文本语言建模的复杂性有什么问题?

2410.23771v5

364

07-27

Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation

Förderung dialektischer arabischer zu moderner arabischer Standard-Maschinenübersetzung

向现代标准阿拉伯文机器翻译推广阿拉伯语

2507.20301v1

365

07-27

Real-time Factuality Assessment from Adversarial Feedback

Echtzeit-Faktualitätsbeurteilung aus dem Adversarial Feedback

从反反向反馈反馈中实时进行实况评估

2410.14651v3

366

07-27

SciToolAgent: A Knowledge Graph-Driven Scientific Agent for Multi-Tool Integration

SciToolAgent: Ein wissensbasierter wissenschaftlicher Agent für Multi-Tool-Integration

SciToolAgent: 多工具整合知识图表驱动科学代理

2507.20280v1

367

07-27

What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations

Welche Sprache(n) denkt Aya-23? Wie Mehrsprachigkeit die Repräsentationen der internen Sprache beeinflusst

Aya-23 思考什么语言?多语言如何影响内部语言代表性?

2507.20279v1

368

07-27

Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Agent-Fin-R1: Verbesserung der Finanzintelligenz durch Domain-Expertise, Trainingseffizienz und Advanced Reasoning

Agentar Fin-Fin-R1:通过域域专门知识、培训效率和高级理由加强金融情报

2507.16802v4

369

07-27

MoL-RL: Distilling Multi-Step Environmental Feedback into LLMs for Feedback-Independent Reasoning

MoL-RL: Destillieren von mehrstufigem Umweltfeedback in LLMs zur feedbackunabhängigen Begründung

MoL-RL:将多层环境反馈保留到LLMs,用于提供反馈-独立理由

2507.20278v1

370

07-27

ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech

ChildGuard: Ein spezieller Datensatz zur Bekämpfung von kindgewordener Hassrede

儿童指南:打击针对儿童的仇恨言论专门数据集

2506.21613v2

371

07-27

EMBRACE: Shaping Inclusive Opinion Representation by Aligning Implicit Conversations with Social Norms

EMBRACE: Inclusive Opinion Representation gestalten, indem implizite Gespräche mit sozialen Normen ausgerichtet werden

EMBRACE:通过与社会规范的关联性交流,形成包容性的见解代表制

2507.20264v1

372

07-27

Post-Completion Learning for Language Models

Post-Completion-Lernen für Sprachmodelle

语文模式完成后学习

2507.20252v1

373

07-27

Modeling Professionalism in Expert Questioning through Linguistic Differentiation

Modellierung von Professionalität in der Expertenbefragung durch sprachliche Differenzierung

通过语言差异问题专家提问的示范专业精神

2507.20249v1

374

07-27

Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers

Contrast-CAT: Kontrastierende Aktivierungen für verbesserte Interpretierbarkeit in Transformer-basierten Textklassifikatoren

反对-CAT:在基于变换器的文本分类中增强解释力的对比活动

2507.21186v1

375

07-27

Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models

Reframe Your Life Story: Interaktiver Erzähltherapeut und innovative Moment-Assessment mit großen Sprachmodellen

重构你的生活故事:与大语言模式互动叙述治疗师和创新时间评估

2507.20241v1

376

07-27

DoubleDipper: Improving Long-Context LLMs via Context Recycling

DoubleDipper: Verbesserung der Langkontext-LLMs über Kontext-Recycling

双重顶点:通过上下文再循环改进长文本LLMs

2406.13632v4

377

07-27

Understanding Learner-LLM Chatbot Interactions and the Impact of Prompting Guidelines

Lernende-LLM-Chatbot-Interaktionen verstehen und die Auswirkungen von Sofortrichtlinien verstehen

了解学习者-LLLM 聊天室互动和推动准则的影响

2504.07840v3

378

07-27

Co-NAML-LSTUR: A Combined Model with Attentive Multi-View Learning and Long- and Short-term User Representations for News Recommendation

Co-NAML-LSTUR: Ein kombiniertes Modell mit attentivem Multi-View-Lernen und Langzeit- und Kurzzeit-Benutzervertretungen für News-Empfehlungen

NAML-LTUR:与多视学习和新闻建议长期及短期用户代表相结合的综合模式

2507.20210v1

379

07-27

IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs

IQ-Test für LLMs: Ein Bewertungsrahmen für die Entdeckung von Kernkompetenzen in LLMs

LLMLM的IQ测试:LLM中核心技能覆盖的评估框架

2507.20208v1

380

07-27

Cheap Learning: Maximising Performance of Language Models for Social Data Science Using Minimal Data

Günstiges Lernen: Maximierung der Leistungsfähigkeit von Sprachmodellen für die Sozialdatenwissenschaft mit minimalen Daten

廉价学习:利用最低数据使社会数据科学语言模型的绩效最大化

2401.12295v2

381

07-27

Diversity-Enhanced Reasoning for Subjective Questions

Diversity-Enhanced Reasoning für subjektive Fragen

主观问题的多样性强化理由

2507.20187v1

382

07-27

SessionIntentBench: A Multi-task Inter-session Intention-shift Modeling Benchmark for E-commerce Customer Behavior Understanding

SessionIntentBench: Ein Multi-Task Inter-Session Intention-Shift Modelling Benchmark für E-Commerce Kundenverhalten Verständnis

A. 会议内容:电子商务客户行为理解的多任务、多任务、跨会期、出于利益转移的电子商业客户行为理解示范基准

2507.20185v1

383

07-27

SGPO: Self-Generated Preference Optimization based on Self-Improver

SGPO: Selbsterzeugte Preference-Optimierung auf Basis von Self-Improver

SGPO:基于自我改造的自发优惠优化

2507.20181v1

384

07-27

Checklist Engineering Empowers Multilingual LLM Judges

Checkliste Engineering Empowers Mehrsprachige LLM-Richter

多语种LLM法官

2507.06774v2

385

07-27

Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective

Intersektionale Bias in japanischen großen Sprachmodellen aus einer kontextualisierten Perspektive

日本大语言模型中从背景角度分析的交叉比阿语

2506.12327v2

386

07-27

Goal Alignment in LLM-Based User Simulators for Conversational AI

Zielausrichtung in LLM-basierten Benutzersimulatoren für KI

在基于LLM的LLM用户模拟器中实现目标对齐

2507.20152v1

387

07-27

The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models

The Policy Cliff: Eine theoretische Analyse von Belohnungs-Policy-Karten in großen Sprachmodellen

政策悬崖:大语言模式奖励政策图的理论分析

2507.20150v1

388

07-27

Multi-Agent Interactive Question Generation Framework for Long Document Understanding

Multi-Agent Interactive Question Generierung Framework for Long Document Understanding

长期文件理解问题多机构互动问题生成框架

2507.20145v1

389

07-27

Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG

Multi-Stage Verifikations-Centric Framework zur Eindämmung der Halluzination in Multi-Modal RAG

多模式RAG中减轻幻觉多阶段核查-中心框架

2507.20136v1

390

07-27

EvoSLD: Automated Neural Scaling Law Discovery With Large Language Models

EvoSLD: Automatisierte Neural Scaling Law Discovery mit großen Sprachmodellen

EvoSLD: 用大语言模型发现自动神经放大法

2507.21184v1

391

07-27

When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

Wann funktioniert Metadata Conditioning (NOT) für Sprachmodell-Vorschulungen? Eine Studie mit kontextfreien Grammatiken

元数据条件(NOT)何时能为语言示范培训前培训提供语言示范?无背景语法研究

2504.17562v2

392

07-27

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

MaPPO: Maximale Posteriori-Preference-Optimierung mit vorherigem Wissen

MaPPPO: 与先前知识最优化的后世偏好

2507.21183v1

393

07-27

TIB-STC: A Large-Scale Structured Tibetan Benchmark for Low-Resource Language Modeling

TIB-STC: Ein großflächiger strukturierter tibetischer Benchmark für ressourcenarme Sprachmodellierung

TIB-STC: 低资源语言建模的西藏大型结构化基准

2503.18288v4

394

07-27

Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

Seed LiveInterpret 2.0: End-to-End Simultanübersetzung mit Ihrer Stimme

种子实况解释2.0:用声音翻译终端到终端同声语音语音

2507.17527v3

395

07-27

Measuring Information Distortion in Hierarchical Ultra long Novel Reconstruction:The Optimal Expansion Ratio

Messung von Informationsverzerrung bei hierarchischer Ultralangem Novel Reconstruction:The Optimal Expansion Ratio

测量高层次超长新世纪重建中的信息扭曲:最佳扩展比率

2505.12572v2

396

07-27

Language Models Resist Alignment: Evidence From Data Compression

Sprachmodelle widerstehen Ausrichtung: Beweise aus Datenkompression

语言模型阻力对齐:数据压缩中的证据

2406.06144v5

397

07-27

AI-Driven Generation of Old English: A Framework for Low-Resource Languages

AI-Driven Generation of Old English: Ein Rahmen für Low-Resource-Sprachen

AI-Driven 一代老英语:低资源语言框架

2507.20111v1

398

07-27

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Emergent Semantics Beyond Token Embeddings: Transformer LMs mit gefrorenen visuellen Unicode-Darstellungen

超越 Tok 嵌入的新兴语义: 具有冷冻视觉统一符号的变形LMs

2507.04886v2

399

07-27

EcoTransformer: Attention without Multiplication

EcoTransformer: Achtung ohne Multiplikation

生态转换:注意不乘数

2507.20096v1

400

07-27

ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

ProsodyLM: Enthüllen der neu entstehenden Prosody-Verarbeitungsfähigkeiten in Sprachmodellen

ProsodyLM: 解决语言模式中新出现的处理能力问题

2507.20091v1

401

07-27

Reinforcement learning fine-tuning of language model for instruction following and math reasoning

Verstärktes Lernen der Feinabstimmung des Sprachmodells für Unterricht und Mathe-Reinigung

强化学习,微调用于教学的语文模式和数学推理

2506.21560v2

402

07-26 (6)

The Devil is in the EOS: Sequence Training for Detailed Image Captioning

Der Teufel ist im EOS: Sequenztraining für detaillierte Bildunterschriften

魔鬼在EOS:详细图像说明的序列训练

2507.20077v1

403

07-26

PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training

PITA: Präferenz-geführte Inferenz-Zeit-Ausrichtung für LLM nach dem Training

PITA:LLM培训后培训的优先指导推论-时间协调

2507.20067v1

404

07-26

RAG in the Wild: On the (In)effectiveness of LLMs with Mixture-of-Knowledge Retrieval Augmentation

RAG in the Wild: Über die (In)Wirksamkeit von LLMs mit Mixture-of-Knowledge Retrieval Augmentation

野生ROG:关于利用混合知识回收增加的LLMs(内)效力

2507.20059v1

405

07-26

A Tensor-Based Compiler and a Runtime for Neuron-Level DNN Certifier Specifications

Ein Tensor-basierter Compiler und eine Laufzeit für die Spezifikationen des Neuron-Level DNN Certifier

一个基于 Tensor 的编纂器和中子级别 DNN 验证符规格的运行时间

2507.20055v1

406

07-26

$K^4$: Online Log Anomaly Detection Via Unsupervised Typicality Learning

$K^4$: Online Log Anomalienerkennung durch unüberwachtes Lernen

4K元:在线记录异常探测不受监督的典型学习

2507.20051v1

407

07-26

AI as a deliberative partner fosters intercultural empathy for Americans but fails for Latin American participants

KI als beratender Partner fördert interkulturelles Empathie für Amerikaner, scheitert aber für lateinamerikanische Teilnehmer

作为审议伙伴的大赦国际促进美国人的文化间同情,但拉丁美洲参与者却失败

2504.13887v2

408

07-26

Infogen: Generating Complex Statistical Infographics from Documents

Infogen: Erzeugen komplexer statistischer Infografiken aus Dokumenten

信息源:从文件生成复杂的统计图表

2507.20046v1

409

07-26

Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLMs

Kolumbianische Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Empfehlungen von LLMs

Colombia Worress y juéces canadienses:LLM公司在占领建议中的性别和乡村差别

2505.02456v2

410

07-26

FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression

FAEDKV: Unendliche Window Fourier-Transformation für unvoreingenommene KV-Cache-Kompression

FAEDKV: 用于无偏见的 KV 缓存压缩的无限窗口 Fleier 变换

2507.20030v1

411

07-26

Selective Prompt Anchoring for Code Generation

Selektive Prompt-Ankerung für die Code-Generierung

代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代

2408.09121v6

412

07-26

Preference learning made easy: Everything should be understood through win rate

Vorliebe Lernen leicht gemacht: Alles sollte durch Win-Rate verstanden werden

首选学习容易:人人都应通过双赢率来理解一切

2502.10505v2

413

07-26

Anomaly Detection in Human Language via Meta-Learning: A Few-Shot Approach

Anomalieerkennung in der menschlichen Sprache durch Meta-Learning: Ein wenig heißer Ansatz

通过元学习在人文语言中异常探测: “ 几热 “ 方法

2507.20019v1

414

07-26

A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

Eine Praxis des Post-Trainings auf Llama-3 70B mit optimaler Auswahl des zusätzlichen Sprachmischverhältnisses

Llama-3-70B培训后做法,最佳选择其他语言混合比率

2409.06624v2

415

07-26

MeTHanol: Modularized Thinking Language Models with Intermediate Layer Thinking, Decoding and Bootstrapping Reasoning

MeTHanol: Modularisiertes Denken von Sprachmodellen mit Intermediate Layer Thinking, Decodierung und Bootstrapping Reasoning

METHanol:含有中间层思考、解毒和诱导理由的模块化思维语言模型

2409.12059v5

416

07-26

VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering

VLQA: Der erste umfassende, große und hochqualitative vietnamesische Datensatz für die Beantwortung rechtlicher Fragen

VLQA:用于法律问题解答的第一综合、大、高质量越南数据集

2507.19995v1

417

07-26

Improving the Performance of Sequential Recommendation Systems with an Extended Large Language Model

Verbesserung der Leistungsfähigkeit sequentieller Empfehlungssysteme mit einem erweiterten Großsprachenmodell

利用扩展大语言模式改进序列建议系统的绩效

2507.19990v1

418

07-26

Robust Data Watermarking in Language Models by Injecting Fictitious Knowledge

Robustes Daten-Wasserzeichen in Sprachmodellen durch Einspritzen fiktiver Kenntnisse

在语言模型中,通过输入有说服力的知识在语言模型中进行强力数据水上标记

2503.04036v3

419

07-26

Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization

Leveraging Fine-Tuned Large Language Models for Interpretable Pankreatic Cystic Lesion Feature Extraction and Risk Categorization

利用微量使用大语言模型来利用可解释性恐慌性锥性电磁性悬浮物地物采掘和风险分类

2507.19973v1

420

07-26

Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text

Text2Vis: Ein anspruchsvolles und vielfältiges Benchmark zur Generierung multimodaler Visualisierungen aus Text

Text2Vis: 从文本中生成多式视觉化的质疑性和多样化基准

2507.19969v1

421

07-26

KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models

KLAAD: Verfeinerung von Aufmerksamkeitsmechanismen zur Reduzierung gesellschaftlicher Bias in generativen Sprachmodellen

CPRAD: 完善关注机制,在产生语言模式中减少社会偏见

2507.19962v1

422

07-26

Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

Sprachenübergreifendes Reisen: Benchmarking Cross-Lingual Consistency in multimodalen LLMs

跨语言旅行:多模式LLM中跨语言一致基准

2505.15075v4

423

07-26

Large Language Models Are Human-Like Internally

Große Sprachmodelle sind menschlich-innerlich

大语言模型是人与人之间的内部大语言模型

2502.01615v2

424

07-26

Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA

Aufmerksamkeitsköpfe vor dem Zusammenführen ausrichten: Ein effektiver Weg, MHA in GQA umzuwandeln

合并主题前对齐关注头部对齐:将MAHA转换为GQA的有效途径

2412.20677v2

425

07-26

Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation

Bewertung der Zuverlässigkeit von LLM-Annotationen im Kontext der demografischen Bias und Modellerklärung

结合人口偏见和示范解释评估LLM 说明的可靠性

2507.13138v2

426

07-26

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report

Frontier AI Risk Management Framework in der Praxis: Ein technischer Bericht zur Risikoanalyse

《国际边界风险管理框架实际操作:风险分析技术报告》

2507.16534v2

427

07-26

The Impact of Fine-tuning Large Language Models on Automated Program Repair

Die Auswirkungen von Feinabstimmungen großer Sprachmodelle auf die automatisierte Programmreparatur

微调大语言模型对自动方案维修的影响

2507.19909v1

428

07-26

CaliDrop: KV Cache Compression with Calibration

CaliDrop: KV Cache-Kompression mit Kalibrierung

CaliDrop: KV 缓存压缩加校准

2507.19906v1

429

07-26

A Gold Standard Dataset and Evaluation Framework for Depression Detection and Explanation in Social Media using LLMs

Ein Gold Standard Datensatz und Evaluation Framework für Depression Erkennung und Erklärung in Social Media mit LLMs

利用LLMM公司在社会媒体中发现和解释抑郁症的黄金标准数据集和评价框架

2507.19899v1

430

07-26

Automating Mathematical Proof Generation Using Large Language Model Agents and Knowledge Graphs

Automatisieren der mathematischen Proof-Generierung mit Large Language Model Agents und Wissensgraphen

使用大语言模型代理和知识图

2503.11657v2

431

07-26

Zero-shot Performance of Generative AI in Brazilian Portuguese Medical Exam

Zero-shot Leistung von Generative KI in brasilianischer portugiesischer medizinischer Prüfung

巴西葡萄牙医学考试中创用AI的零弹性能

2507.19885v1

432

07-26

Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning

Causal Sufficiency und Necessity verbessert Kette-of-Thought-Reasoning

C. 因果关系和必要性改进审议链理由

2506.09853v2

433

07-26

FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models

FactReasoner: Ein probabilistischer Ansatz zur Langform-Faktivitätsbewertung für große Sprachmodelle

事实研究者:对大语言模式长期实际评估的概率办法

2502.18573v2

434

07-26

The Polish Vocabulary Size Test: A Novel Adaptive Test for Receptive Vocabulary Assessment

Der polnische Vokabular-Größentest: Ein neuartiger adaptiver Test für die rezeptive Vokabular-Bewertung

波兰词汇大小测试:接受词汇评估的新适应性测试

2507.19869v1

435

07-26

DRIVE: Disfluency-Rich Synthetic Dialog Data Generation Framework for Intelligent Vehicle Environments

DRIVE: Disfluency-Rich Synthetic Dialog Data Generierung Framework für intelligente Fahrzeugumgebungen

DIVE: 智能车辆环境数据生成框架

2507.19867v1

436

07-26

Agentic Reinforced Policy Optimization

Agentische verstärkte politische Optimierung

强化政策优化

2507.19849v1

437

07-26

Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs

Gemeinsames Verständnis von Fehlausrichtung im zielorientierten Dialog: Eine Fallstudie mit Ubuntu Chat Logs

理解目标导向对话框中的共同点不匹配:与Ubuntu聊天日志的案例研究

2503.12370v2

438

07-26

AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition

AutoSign: Direkte Pose-zu-Text-Übersetzung für die kontinuierliche Erkennung von Zeichensprachen

自动签名: 用于持续手语识别的直导 Pose-to- Text 翻译

2507.19840v1

439

07-26

HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs

HCAtention: Extreme KV Cache Compression via Heterogenes Aufmerksamkeitsrechnen für LLMs

HCAttention:通过不同式注意计算法对LLMs进行极端KV缓存压缩

2507.19823v1

440

07-26

A Structured Bangla Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy

Ein strukturierter Bangla-Datensatz von Krankheits-Symptome-Verbänden zur Verbesserung der Diagnosegenauigkeit

改善诊断准确性疾病 – – 症状协会结构化孟加拉数据集

2506.13610v3

441

07-26

LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

LLM-Barber: Block-Aware Rebuilder für Sparsity Maske in One-Shot für große Sprachmodelle

LLM-Barber:大语言模型单点单层面罩块件重建器

2408.10631v2

442

07-26

Flora: Effortless Context Construction to Arbitrary Length and Scale

Flora: Müheloser Kontext Aufbau zu willkürlicher Länge und Skala

Flora: 以任意长度和规模建造环境以达到任意长度和规模

2507.19786v1

443

07-26

UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models’ Reasoning Abilities

UloRL:Ein Ultra-Long-Output-Verstärkungs-Lernansatz zur Förderung großer Sprachmodelle

UloRL: 推进大语言模式解释能力超长输出强化学习方法

2507.19766v1

444

07-26

Are You There God? Lightweight Narrative Annotation of Christian Fiction with LMs

Sind Sie dort Gott? Leichte narrative Anmerkung der christlichen Fiction mit LMs

轻量量级的基督教小说和LMs

2507.19756v1

445

07-26

JT-Math: A Multi-Stage Framework for Advanced Mathematical Reasoning in Large Language Models

JT-Math: Ein Multi-Stage-Framework für fortgeschrittene mathematische Vernunft in großen Sprachmodellen

JT- Math:大语言模型高级数学理由多阶段框架

2507.19748v1

446

07-26

Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation

Assembly Your Crew: Automatisches Multi-Agenten-Kommunikationstopologie-Design über autoregressive Graphen-Generierung

通过自动递减图形生成将您的组群组合成:自动多剂多剂通信地形设计

2507.18224v2

447

07-25 (5)

Ta-G-T: Subjectivity Capture in Table to Text Generation via RDF Graphs

Ta-G-T: Subjektivitätserfassung in Tabelle zur Textgenerierung über RDF Graphen

TaG-T:通过 RDF 图表生成文本的表格中主观性捕获

2507.19710v1

448

07-25

Scalable MatMul-free Language Modeling

Skalierbare MatMul-freie Sprachmodellierung

可缩放 MatMul 无语言建模

2406.02528v7

449

07-25

Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks

Towards Inclusive NLP: Bewertung komprimierter Mehrsprachiger Transformer über unterschiedliche Sprach-Benchmarks

实现包容性的《国家语言规划:评估跨越不同语文基准的压压压多语种变换器》

2507.19699v1

450

07-25

Salsa as a Nonverbal Embodied Language – The CoMPAS3D Dataset and Benchmarks

Salsa als nonverbale Sprache – Der CoMPAS3D Datensatz und Benchmarks

Salsa 作为一种非语言的成形语言 – – CoMPAS3D数据集和基准

2507.19684v1

451

07-25

Navigating the Risks of Using Large Language Models for Text Annotation in Social Science Research

Navigation auf die Risiken der Verwendung großer Sprachmodelle für die Textannotation in der sozialwissenschaftlichen Forschung

利用大语言模式在社会科学研究中使用文字说明的风险

2503.22040v2

452

07-25

Benchmarking Linguistic Diversity of Large Language Models

Benchmarking Linguistische Vielfalt großer Sprachmodelle

衡量大语言模式语言多样性的基准

2412.10271v2

453

07-25

Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs

Haben große Sprachmodelle einen englischen Akzent? Bewertung und Verbesserung der Natürlichkeit von mehrsprachigen LLMs

大语言模式是否有英语中心?

2410.15956v3

454

07-25

RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams

RoD-TAL: Ein Benchmark für die Beantwortung von Fragen in rumänischen Führerscheinprüfungen

RoD-TAL:在罗马尼亚驾驶执照考试中回答问题的基准

2507.19666v1

455

07-25

Code-Switching and Syntax: A Large-Scale Experiment

Code-Schalten und Syntax: Ein groß angelegtes Experiment

代码开动和语法:大规模实验

2506.01846v2

456

07-25

Minimal Pair-Based Evaluation of Code-Switching

Minimale Pair-basierte Auswertung von Code-Switching

对代码转换的最小对等评价

2506.01840v2

457

07-25

Summarization of Opinionated Political Documents with Varied Perspectives

Zusammenfassung opinionierter politischer Dokumente mit unterschiedlichen Perspektiven

具有不同观点的有见解的政治文件概述

2411.04093v2

458

07-25

OneShield – the Next Generation of LLM Guardrails

OneShield – die nächste Generation der LLM-Guardrails

OneShild – – 下一代LLM护卫车

2507.21170v1

459

07-25

Data Caricatures: On the Representation of African American Language in Pretraining Corpora

Daten Karikaturen: Zur Darstellung der afroamerikanischen Sprache im Vortraining Corpora

数据制图:关于非洲裔美国人语言在预科公司中的代表性

2503.10789v2

460

07-25

Opacity as Authority: Arbitrariness and the Preclusion of Contestation

Opacity as Authority: Willkür und die Präklusion der Anfechtung

作为权力的不透明度:仲裁和排除争议

2507.22944v1

461

07-25

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

MCIF: Multimodale Crosslingual Instruction-Following Benchmark aus wissenschaftlichen Vorträgen

MCIF: 科学会谈的多模式跨语言教学基准

2507.19634v1

462

07-25

LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

LoX: Low-Rank-Extrapolation stärkt LLM-Sicherheit gegen Feinabstimmung

LoX:低Rank外推法强力推力LLM 安全防止微调

2506.15606v3

463

07-25

HITSZ’s End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track

HITSZs End-to-End-Sprachübersetzungssysteme zur Kombination von Sequenz-zu-Sequenz-Auto-Spracherkennungsmodell und indic Large Language Model für IWSLT 2025 in Indic Track

HITSZ的端到端语音翻译系统,将序列到序列自动语音识别模型和2025 IWSLT Indic Track IWSLT 2025 的指数式大语言模型结合起来

2507.19616v1

464

07-25

MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?

MOCHA: Sind Code-Sprachenmodelle gegen multi-Turn bösartige Coding-Prompts robust?

MOCHA:守则语言模型是否强力打击多发恶意编码的提示?

2507.19598v1

465

07-25

Efficient Attention Mechanisms for Large Language Models: A Survey

Effiziente Aufmerksamkeitsmechanismen für große Sprachmodelle: Eine Umfrage

高效率关注大语言模式机制:调查

2507.19595v1

466

07-25

Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning

Geospatielles Wissen abmildern Halluzination in großen Sprachmodellen: Benchmarking und Dynamische Faktizität Ausrichtung

减轻大语言模式中的地理空间知识幻觉:基准和动态事实对齐

2507.19586v1

467

07-25

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

MMBench-GUI: Hierarchischer Mehrplattform-Evaluierungsrahmen für GUI-Agenten

MMMBench-GUI:图形用户界面代理器的等级多平台评价框架

2507.19478v1

468

07-25

Advancing Event Forecasting through Massive Training of Large Language Models: Challenges, Solutions, and Broader Impacts

Weiterentwicklung der Event-Prognose durch massives Training von großen Sprachmodellen: Herausforderungen, Lösungen und breitere Auswirkungen

通过大规模培训大语言模式:挑战、解决办法和更广泛影响

2507.19477v1

469

07-25

Long-Form Answers to Visual Questions from Blind and Low Vision People

Langform-Antworten auf visuelle Fragen von Blinden und Sehbehinderten

对盲人和低视力者视觉问题的长期答复

2408.06303v2

470

07-25

Conversations Gone Awry, But Then? Evaluating Conversational Forecasting Models

Gespräche sind schief gegangen, aber dann? Evaluieren von Gesprächsvorhersagemodellen

对话消失,但后来呢?评价对话预测模型

2507.19470v1

471

07-25

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

RADLADS: Schnelle Aufmerksamkeitsdestillation zu linearen Aufmerksamkeitsdecodern auf Scale

RADLADS: 缩放线性引引代码的快速注意蒸馏

2505.03005v3

472

07-25

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

GEPA: Reflektierende Prompt-Evolution kann Verstärkungs-Lernen übertreffen

GEPA: 反思即时进化能够超过成绩的强化学习

2507.19457v1

473

07-25

A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies

Ein Diagramm-Review-Prozess unterstützt durch natürliche Sprachverarbeitung und Multi-Wave adaptive Sampling zur Beschleunigung der Validierung von Code-basierten Algorithmen für große Datenbankstudien

借助自然语言处理和多波适应性取样的图表审查过程,以加快大型数据库研究代码算法的验证工作

2507.22943v1

474 07-25 Distillation Scaling Laws Destillationsskalierungsgesetze 强化法律 2502.08606v2

475

07-25

TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability

TokenSmith: Verstärkte Datenbearbeitung, Suche und Inspektion für großformatige Sprachmodellschulungen und -dolmetschbarkeit

TokenSmitth:简化数据编辑、搜索和检查,以进行大型语文模式培训和解释

2507.19419v1

476

07-25

Towards Domain Specification of Embedding Models in Medicine

Auf dem Weg zur Domain-Spezifikation von Einbettungsmodellen in die Medizin

走向医学嵌入模型的域域指定

2507.19407v1

477

07-25

CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback

CodeEvo: Interaktionsgetriebene Synthese codezentrierter Daten durch hybrides und iteratives Feedback

代码化:通过混合和循环反馈对以代码为中心的数据进行互动驱动合成

2507.22080v1

478

07-25

Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question

Vielfältige LLMs oder unterschiedliche Frageinterpretationen? Das ist die Assembling-Frage

不同的LLMs或不同的问题解释?

2507.21168v1

479

07-25

Data Augmentation for Spoken Grammatical Error Correction

Datenvergrößerung für gesprochene Grammatical Error Correction

语音语法错误校正的数据增强

2507.19374v1

480

07-25

LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences

LOTUS: Ein Leaderboard für detaillierte Bildunterschriften von Qualität zu gesellschaftlichen Bias und Benutzereinstellungen

LOTUS: 从质量到社会偏见和用户首选的详细图像描述领导板

2507.19362v1

481

07-25

SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models

SpeechIQ: Sprachintelligenz Quotient über kognitive Ebenen im Sprachverständnis von großen Sprachmodellen

语音理解大语言模式中不同认知层次的语音情报引号

2507.19361v1

482

07-25

SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

SALM-Duplex: Effiziente und direkte Duplex-Modellierung für Speech-to-Speech-Sprachenmodell

SALM-Duplex:语音对语音语言模式的高效和直接双重模式

2505.15670v4

483

07-25

Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization

Verbesserung der Sprach-Emotions-Erkennung Auslevering Aligning Timestamps von ASR-Transkriptionen und Sprecher-Diarisierung

利用ASR记录稿和议长对称的调和时标

2507.19356v1

484

07-25

DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue

DoctorAgent-RL: Ein multi-agent-kollaboratives Verstärkungs-Lernsystem für den multi-Turn-Klinischen Dialog

DocrAgentor-RL:多轮临床对话多机构合作强化学习系统

2505.19630v2

485

07-25

Smooth Reading: Bridging the Gap of Recurrent LLM to Self-Attention LLM on Long-Context Tasks

Smooth Reading: Die Lücke von LLM zur Selbstaufmerksamkeit von LLM bei langen Kontextaufgaben überbrücken

平滑阅读:弥合经常LLM与长期任务自用LLM之间的差距

2507.19353v1

486

07-25

Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation

Externes Wissen in den vernünftigen Prozess zu spritzen verbessert die retrieval-angereicherte Generation

将外部知识注入说明过程,加强检索-提款一代

2507.19333v1

487

07-25

References Matter: Investigating the Impact of Reference Set Variation on Summarization Evaluation

Referenzen Materie: Untersuchung der Auswirkungen von Referenzsatzvariationen auf die Bewertung der Zusammenfassung

参考参考物质:调查参照标准差异对总结评价的影响

2506.14335v2

488

07-25

AutoPCR: Automated Phenotype Concept Recognition by Prompting

AutoPCR: Automatisierte Erkennung von Phänomenen durch Prompting

自动PCR:通过提示自动地识别基因型概念

2507.19315v1

489

07-25

The Eloquence team submission for task 1 of MLC-SLM challenge

Die Eloquence-Team-Einreichung für die Aufgabe 1 der MLC-SLM-Herausforderung

刚果解运-解运挑战任务1的评分小组提交

2507.19308v1

490

07-25

Identifying Fine-grained Forms of Populism in Political Discourse: A Case Study on Donald Trump’s Presidential Campaigns

Identifizierung feinkörniger Formen des Populismus im politischen Diskurs: Eine Fallstudie zu Donald Trumps Präsidentschaftswahlen

确定政治讨论中精美的民粹主义形式:关于唐纳德·特朗普总统运动的个案研究

2507.19303v1

491

07-25

A Markov Categorical Framework for Language Modeling

Ein kategorisches Markov-Rahmenwerk für Sprachmodellierung

用于语言建模的 Markov 语言建模分类框架

2507.19247v1

492

07-25

Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation

Jailbreaking Large Language Diffusion Models: Enthüllen versteckter Sicherheitsfehler bei der Diffusion-basierten Textgenerierung

大语言传播模式:在以传播为基础的文本生成中披露隐藏的安全条

2507.19227v1

493

07-25

How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework

Wie viel Cheat bei der Evaluation eines großen Sprachmodells? Benchmarking-Überschätzung im Rahmen des One-Time-Pad-basierten Frameworks

大语言模式在评价方面有多大的热量? 以单一时间为基础的框架为高估基准

2507.19219v1

494

07-25

3LM: Bridging Arabic, STEM, and Code through Benchmarking

3LM: Arabisch, MINT und Code durch Benchmarking überbrücken

3LM:通过基准确定连接阿拉伯语、STEM和代码

2507.15850v3

495

07-25

SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology

SigBERT: Kombination narrativer medizinischer Berichte und rough Path Signature Theory zur Einschätzung des Überlebensrisikos in der Onkologie

SigBERT: 将叙述性医疗报告与肿瘤学生存风险估算的粗路签名理论相结合

2507.22941v1

496

07-25

Towards Multimodal Social Conversations with Robots: Using Vision-Language Models

Auf dem Weg zu multimodalen sozialen Gesprächen mit Robotern: Mit Vision-Sprachen-Modellen

走向与机器人的多模式社会对话:使用视觉语言模型

2507.19196v1

497

07-25

Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?

Kann Small-Scale-Datenvergiftung Dialect-Linked Biases in großen Sprachmodellen exazerbieren?

在大语言模型中,小范围数据中毒加剧分解链接的分界线能否成为大语言模型?

2507.19195v1

498

07-25

Natural Language Processing for Tigrinya: Current State and Future Directions

Natürliche Sprachverarbeitung für Tigrinya: Aktueller Zustand und zukünftige Richtungen

提格里尼亚的自然语言处理:现状和未来方向

2507.17974v2

499

07-25

Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them

Scalpel vs. Hammer: GRPO verstärkt bestehende Fähigkeiten, SFT ersetzt sie

缩略图与锤子:GROPO 放大现有能力,SFT 替换

2507.10616v2

500

07-25

An Empirical Investigation of Gender Stereotype Representation in Large Language Models: The Italian Case

Eine empirische Untersuchung der Geschlechterstereotypdarstellung in großen Sprachmodellen: Der italienische Fall

对大语言模式中性别陈规定型观念代表性的经验调查:意大利案例

2507.19156v1

501

07-25

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Beschleunigung multimodaler Großsprachenmodelle über Dynamic Visual-Token Exit und die Empirical Findings

通过动态直视退出和实证结论加速多模式大语言模型

2411.19628v2

502

07-25

Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes

Vertrauenswürdige Begründung: Bewertung und Verbesserung der tatsächlichen Genauigkeit in LLM-Intermediate-Thought-Prozessen

值得信赖的理由:评估和加强LLM中级思考程序中的事实准确性

2507.22940v1

503

07-25

OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth?

OS-MAP: Wie weit können Computer-verwendende Agenten in Breadth und Tiefe gehen?

OS-MAP:计算机用户在面包和深度上能走多远?

2507.19132v1

504

07-25

Distilling the Implicit Multi-Branch Structure in LLMs’ Reasoning via Reinforcement Learning

Destillieren der impliziten Multi-Branch-Struktur in LLMs’ Reasoning durch Verstärkungslernen

通过强化学习,将LLMs的隐含多层结构提炼在“通过强化学习推理”中

2505.16142v3

505

07-25

Objectifying the Subjective: Cognitive Biases in Topic Interpretations

Objektivierung des Subjektiven: Kognitive Biasen in thematischen Interpretationen

表示主观性: 专题解释中的认知性分界线

2507.19117v1

506

07-25

Relation Extraction with Instance-Adapted Predicate Descriptions

Verhältnis-Extraktion mit instance-adapted Prädikat Beschreibungen

采掘与原创性预言性说明的关系

2503.17799v2

507

07-25

Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy

Ensemble Debiasing Across Class und Sample Levels für eine gerechtere Genauigkeit

公平促进准确性

2503.05157v4

508

07-25

Comparison of pipeline, sequence-to-sequence, and GPT models for end-to-end relation extraction: experiments with the rare disease use-case

Vergleich von Pipeline-, Sequenz-zu-Sequenz- und GPT-Modellen für die End-to-End-Relation-Extraktion: Experimente mit dem Einsatzfall der seltenen Krankheiten

管道、序列到序列和终端到终端关系提取GPT模型的比较:与罕见疾病使用案例的实验

2311.13729v3

509

07-25

Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation

Destillieren eines kleinen Utility-Based Passage Selectors zur Verbesserung der Retrieval-Augmented Generation

蒸馏一个小型以公用事业为基础的通道选择器,以加强回收-提款一代

2507.19102v1

510

07-25

How Important is Domain Specificity in Language Models and Instruction Finetuning for Biomedical Relation Extraction?

Wie wichtig ist Domain Specificity in Sprachmodellen und Instruction Finetuning für die biomedizinische Beziehungsextraktion?

在生物医学关系采掘的语言模式和教学教学调整中,域的具体特点有多重要?

2402.13470v2

511

07-25

JCAPT: A Joint Modeling Approach for CAPT

JCAPT: Ein gemeinsamer Modellierungsansatz für CAPT

JCAPT: CAPT的联合示范方法

2506.19315v2

512

07-25

LLMs are Also Effective Embedding Models: An In-depth Overview

LLMs sind auch effektive Einbettungsmodelle: Eine ausführliche Übersicht

LLM项目也是有效的嵌入模型:深入概述

2412.12591v2

513

07-25

Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents

Debating Truth: Debattieren-getriebene Behauptungsverifizierung mit mehreren Large Language Model Agents

讨论真相:由辩论驱动的与多语种示范语言代理核查索赔要求

2507.19090v1

514

07-25

Arg-LLaDA: Argument Summarization via Large Language Diffusion Models and Sufficiency-Aware Refinement

Arg-LlaDA: Argumentationszusammenfassung über Large Language Diffusion Models und Sufficiency-Aware Refinement

ARG-LLADA:通过大语言传播模型和充足软件精炼进行参数汇总

2507.19081v1

515

07-25

Re:Form – Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

Re:Form – Reduzierung menschlicher Priore bei skalierbarer formaler Software-Verifikation mit RL in LLMs: Eine Vorstudie zu Dafny

Re:形式 – – 在可扩展的正式软件核查中减少人类前科,LLL女士:关于Dafny的初步研究

2507.16331v2

516

07-25

ToolACE: Winning the Points of LLM Function Calling

ToolACE: Die Punkte des LLM-Funktionsaufrufs gewinnen

工具ACE:赢得LLLM函数调用点

2409.00920v2

517

07-25

GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness

GOAT-SLM: Ein gesprochenes Sprachmodell mit paralinguistischem und Lautsprechercharakteristischem Bewusstsein

GOAT-SLM:具有多语言语言和议长特点意识的口语模式

2507.18119v2

518

07-25

XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare

XAI4LLM. Lassen Sie Modelle für maschinelles Lernen und LLMs für verbessertes In-Context-Lernen im Gesundheitswesen zusammenarbeiten

XAI4LLLM. 让机器学习模式和LLM合作促进保健领域加强内文学习

2405.06270v4

519

07-25

T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation

T2ISafety: Benchmark für die Bewertung von Fairness, Toxizität und Datenschutz in der Bildgenerierung

T2ISafetty:评估图像生成中的公平、毒性和隐私的基准

2501.12612v3

520

07-25

Closing the Modality Gap for Mixed Modality Search

Schließen der Modalitätslücke für gemischte Modalitätssuche

缩小混合方式搜索模式差距

2507.19054v1

521

07-25

PARROT: An Open Multilingual Radiology Reports Dataset

PARROT: Ein offener Mehrsprachiger Röntgenbericht Datensatz

开放多语种放射学报告数据集

2507.22939v1

522

07-25

FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems

FD-Bench: Eine Full-Duplex-Benchmarking-Pipeline für volle Duplex-Gesprochene Dialogsysteme

FD-Bench:为全双口孔对话系统设计的全自动基准管道

2507.19040v1

523

07-25

MLLM-based Speech Recognition: When and How is Multimodality Beneficial?

MLLM-basierte Spracherkennung: Wann und wie ist Multimodalität vorteilhaft?

基于MLLM的语音识别:多式联运何时和如何受益?

2507.19037v1

524

07-25

A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents

Ein Graph-basierter Ansatz für Multi-Modal-Fragebeantwortungen aus Flussdiagrammen in Telecom-Dokumenten

以图表为基础的电信文件流动图表多模式问题解答方法

2507.22938v1

525

07-25

Acoustically Precise Hesitation Tagging Is Essential for End-to-End Verbatim Transcription Systems

Akustisch präzises Hesitations-Tagging ist für End-to-End-Transkriptionssysteme unerlässlich

终端至终端逐字记录翻译系统至关重要的隐含精确言辞

2506.04076v2

526

07-25

Kill two birds with one stone: generalized and robust AI-generated text detection via dynamic perturbations

Töten Sie zwei Vögel mit einer Klappe: generalisierte und robuste KI-generierte Texterkennung durch dynamische Störungen

以一石一石杀死两鸟:通过动态扰动,普遍和有力地检测AI产生的文本

2504.21019v2

527

07-25

Advancing biomolecular understanding and design following human instructions

Verbesserung des biomolekularen Verständnisses und Designs nach menschlichen Anweisungen

按照人类的指示,推动生物分子理解和设计

2410.07919v2

528

07-25

HIVMedQA: Benchmarking large language models for HIV medical decision support

HIVMedQA: Benchmarking großer Sprachmodelle für die medizinische HIV-Entscheidungsunterstützung

HIVMedQA:确定艾滋病毒医疗决策支助大语言模式的基准

2507.18143v2

529

07-25

Verbalized Representation Learning for Interpretable Few-Shot Generalization

Verbalisiertes Repräsentationslernen für verdolmetschbare wenige-heiße Verallgemeinerung

以口头方式进行代表性学习,为可口译的少或偏的普及化提供口译

2411.18651v2

530

07-25

Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation

Bewertung von LLM-Fehlern, für die Personalisierte Disinformationsgenerierung missbräuchlich verwendet zu werden

评价LLMM 利用LLM 个人化信息生成不当利用他人造成个人化信息的脆弱性

2412.13666v2

531

07-25

CoE-Ops: Collaboration of LLM-based Experts for AIOps Question-Answering

CoE-Ops: Zusammenarbeit von LLM-basierten Experten für AIOps Frage-Antwort

欧委会行动:以LLM为基础的专家协作处理AIOps问题

2507.22937v1

532

07-25

MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts

MultiSocial: Mehrsprachiger Benchmark der maschinengenerierten Texterkennung von Social-Media-Texten

多社会多语言:社会-媒体文本机制文本检测多语言基准

2406.12549v2

533

07-25

A Toolbox, Not a Hammer – Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation

Eine Toolbox, kein Hammer – Multi-TAG: Skalierung der Mathematik mit Multi-Tool-Aggregation

一个工具箱, 不是锤锤 – – 多TAG: 使用多工具聚合的量性数学解释

2507.18973v1

534

07-25

Spike No More: Stabilizing the Pre-training of Large Language Models

Spike No More: Stabilisierung der Vorausbildung großer Sprachmodelle

Spike No No More: 稳定大语言模式培训前

2312.16903v4

535

07-25

A Similarity Measure for Comparing Conversational Dynamics

Eine Ähnlichkeitsmessung für den Vergleich von Konversationsdynamiken

比较相互动态的相似性措施

2507.18956v1

536

07-25

MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model

MedicalBERT: Verbesserung der biomedizinischen natürlichen Sprachverarbeitung mit vorgebildetem BERT-basiertem Modell

医学BERT:利用预先培训的BERT模式,加强生物医学自然语言处理

2507.08013v2

537

07-25

Legal Document Summarization: Enhancing Judicial Efficiency through Automation Detection

Zusammenfassung des Rechtsdokuments: Verbesserung der richterlichen Effizienz durch Automatisierungserkennung

法律文件摘要:通过自动检测提高司法效率

2507.18952v1

538

07-25

Adaptive Learning Systems: Personalized Curriculum Design Using LLM-Powered Analytics

Adaptive Lernsysteme: Personalisierte Lehrplangestaltung mit LLM-Powered Analytics

适应性学习系统:利用LLM能动分析器的个人化课程设计

2507.18949v1

539

07-25

TreeReader: A Hierarchical Academic Paper Reader Powered by Language Models

TreeReader: Ein Hierarchischer Akademischer Papierleser Powered by Language Models

树形阅读器:一个按语言模式授权的等级学术论文阅读器

2507.18945v1

540

07-25

LLaVA-NeuMT: Selective Layer-Neuron Modulation for Efficient Multilingual Multimodal Translation

LLaVA-NeuMT: Selektive Schicht-Neuron-Modulation für effiziente multimodale Mehrsprachigkeit

LLAVA-NeUMT: 选择性多语层-Neuron 高效多语种多语种多模式翻译的调整

2507.18940v1

541

07-25

Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks

Benchmarking des multimodalen Verständnisses und der komplexen Begründung für ESG-Aufgaben

确定环境组合组合任务多式联运理解和复杂理由的基准

2507.18932v1

542

07-25

Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters

Seed-X: Starke Mehrsprachige Übersetzung LLM mit 7B-Parametern aufbauen

种子-X:利用7B参数建立强有力的多语种翻译LLM

2507.13618v3

543

07-25

Uncovering Cross-Linguistic Disparities in LLMs using Sparse Autoencoders

Entdeckt Cross-Linguistic Disparities in LLMs mit Sparse Autoencodern

使用 Sparse 自动编码器在 LLM 中解封跨语言差异

2507.18918v1

544

07-25

Mining Contextualized Visual Associations from Images for Creativity Understanding

Bergbau Kontextualisierte visuelle Assoziationen aus Bildern für Kreativität Verständnis

利用图像促进创造性理解的采矿背景化视觉协会

2507.18915v1

545

07-25

A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions

Eine systematische Überprüfung der Systeme der wichtigsten retrieval-Augmented Generation (RAG): Fortschritt, Lücken und Zukunftsrichtungen

系统审查关键回收-养代(RAG)系统:进展、差距和未来方向

2507.18910v1

546

07-25

Large language models provide unsafe answers to patient-posed medical questions

Große Sprachmodelle bieten unsichere Antworten auf patientenbezogene medizinische Fragen

大型语言模式为病人提出的医疗问题提供不安全的答案

2507.18905v1

547

07-25

SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models

SLoW: Wählen Sie niederfrequente Wörter aus! Automatische Wörterbuchauswahl für Übersetzungen auf großen Sprachmodellen

SLOW: 选择低频单词! 用于大语言模型翻译的自动词典选择

2507.18902v1

548

07-25

REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?

REPRO-Bench: Können Agentische KI-Systeme die Reproduzierbarkeit der sozialwissenschaftlichen Forschung bewerten?

REPRO-BENCH: AI系统能否评估社会科学研究的可减少性?

2507.18901v1

549

07-25

Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs

Kann LLMs Citation Intent voraussagen? Eine experimentelle Analyse des In-Context-Lernens und Feinabstimmungens auf offenen LLMs

LLMs 预测引文意图:对开放式LMs的内文学习和微调的实验分析

2502.14561v3

550

07-25

A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans

Eine umfassende Bewertung der semantischen Beziehungskenntnisse von vorgebildeten Sprachmodellen und Menschen

全面评价未受过训练语言模式和人文的语义关系知识

2412.01131v4

551

07-25

NUTMEG: Separating Signal From Noise in Annotator Disagreement

NUTMEG: Trennen von Signalen von Geräuschen in Annotator-Uneinigkeit

NUTMEG: 在通知器中从噪音中分离信号

2507.18890v1

552

07-25

MindFlow+: A Self-Evolving Agent for E-Commerce Customer Service

MindFlow+: Ein selbstständiger Agent für den E-Commerce-Kundendienst

Mind Flow+:电子商务客户服务自我发展代理

2507.18884v1

553

07-25

An Investigation of Prompt Variations for Zero-shot LLM-based Rankers

Eine Untersuchung von Prompt-Variationen für Null-Schuss LLM-basierte Ranker

调查零射中LLM中士的迅速变化情况

2406.14117v4

554

07-25

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

Phoneme-Level Visuelle Spracherkennung über Point-Visual Fusion und Sprachmodellsanierung

通过点-视点融合和语言模式重建确认电话级视觉讲话

2507.18863v1

555

07-25

PrismRAG: Boosting RAG Factuality with Distractor Resilience and Strategized Reasoning

PrismRAG: Steigerung der RAG-Faktizität mit Distraktorresilienz und geschichteter Vernunft

PrismRAG:提高RAG事实质量,使其具有抗力和策略性合理性

2507.18857v1

556

07-24 (4)

The Curious Case of Class Accuracy Imbalance in LLMs: Post-hoc Debiasing via Nonlinear Integer Programming

Der Kuriose Fall der Klasse Genauigkeit Ungleichgewicht in LLMs: Post-hoc-Debiasing über nichtlineare Integer-Programmierung

LLMLM中分类准确性不平衡的怪案:通过非线性整数编程进行热后脱偏性

2405.07623v7

557

07-24

R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

R-Stitch: Dynamische Trajektorien-Stitching für effiziente Vernunft

R-Stitch: 高效理性的动态轨迹切换

2507.17307v2

558

07-24

Toward Super Agent System with Hybrid AI Routers

Auf dem Weg zum Super Agent System mit Hybrid-KI Routern

向超级代理系统过渡

2504.10519v2

559

07-24

CueBuddy: helping non-native English speakers navigate English-centric STEM education

CueBuddy: Hilfe für nicht-native englische Referenten navigieren Englisch-centric STEM Bildung

CueBuddy:帮助非母语英语者掌握以英语为中心的STEM教育

2507.18827v1

560

07-24

Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models

Promptomatix: Ein automatisches Optimierungs-Framework für große Sprachmodelle

即时表达式:大语言模型自动快速优化框架

2507.14241v3

561

07-24

Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Feature Flow analysieren, um Interpretation und Steuerung in Sprachmodellen zu verbessern

分析地貌流动,以加强语言模型的口译和指导

2502.03032v3

562

07-24

Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

Palme: Ein kulturell inklusiver und sprachlich vielfältiger Datensatz für arabische LLMs

棕榈:阿拉伯文LLMLM具有文化包容性和语言多样性的数据集

2503.00151v2

563

07-24

Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

Plan für Geschwindigkeit: Erweitertes Scheduling für maskierte Diffusions-Sprachmodelle

速度计划: 遮蔽传播语言模型的饱和日程安排

2506.19037v3

564

07-24

Evaluating Code-Mixing in LLMs Across 18 Languages

Bewertung von Code-Mixing in LLMs in 18 Sprachen

评估18种语言的LLMs混合编码

2507.18791v1

565

07-24

Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis

Bewertung großer Sprachmodelle (LLMs) in Financial NLP: Eine vergleichende Studie zur Analyse von Finanzberichten

评价金融中大语言模型:财务报告分析比较研究

2507.22936v1

566

07-24

A Fisher’s exact test justification of the TF-IDF term-weighting scheme

Genaue Begründung des TF-IDF-Term-Wichtungssystems durch einen Fisher

A Fisher公司对TF-IDF术语加权办法的精确测试理由

2507.15742v2

567

07-24

ylmmcl at Multilingual Text Detoxification 2025: Lexicon-Guided Detoxification and Classifier-Gated Rewriting

ylmmcl bei Mehrsprachiger Textentgiftung 2025: Lexikon-geführte Entgiftung und Klassifikator-gestrichenes Umschreiben

2025年多语言文本解毒:Lexicon-Guid解毒和分类法改写

2507.18769v1

568

07-24

Toward Structured Knowledge Reasoning: Contrastive Retrieval-Augmented Generation on Experience

Auf dem Weg zu strukturiertem Wissen Reasoning: Kontrastive retrieval-erweiterte Generation auf Erfahrung

实现结构化知识理由:反向取回-积累经验的一代人

2506.00842v2

569

07-24

The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages

Die Rolle der Orthografiekonsistenz in mehrsprachigen Einbettungsmodellen für die Textklassifizierung in Arabisch-Script-Sprachen

阿拉伯文和克里普特语文文本分类多语种嵌入模型中正统一致性的作用

2507.18762v1

570

07-24

Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition

Lärm Kontrastive Schätzung-basiertes Matching Framework für die Erkennung von Low-Resource-Sicherheitsangriffen

低资源安保攻击模式识别比对框架

2401.10337v4

571

07-24

Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement

Spezifikation Selbst-Korrektion: Eindämmung von In-Context-Belohnung Hacken durch Test-Zeit-Verfeinerung

规格自我校正:通过试验-时间精炼进行减速的背负冲洗

2507.18742v1

572

07-24

AI Flow: Perspectives, Scenarios, and Approaches

AI Flow: Perspektiven, Szenarien und Ansätze

AI 流动:观点、设想和方法

2506.12479v3

573

07-24

An Efficient Sparse Fine-Tuning with Low Quantization Error via Neural Network Pruning

Effizientes Sparse-Fine-Tuning mit geringem Quantisierungsfehler über Neural Network Pruning

通过神经网络节制低量错误的高效粗简精细调整

2502.11439v2

574

07-24

Checklists Are Better Than Reward Models For Aligning Language Models

Checklisten sind besser als Belohnungsmodelle für die Ausrichtung von Sprachmodellen

核对列表比奖励模型更好调整语言模型

2507.18624v1

575

07-24

TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards

TRPrompt: Bootstrapping Query-Aware Prompt Optimierung von Textbelohnungen

TRPropt: 从文本奖励中促进解答询问软件快速优化

2507.18618v1

576

07-24

SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning

SynC: Synthetische Bildunterschrift Datensatzverfeinerung mit ein-zu-vielen Mapping für Zero-shot Bildunterschrift

合成图像说明: 合成图像说明数据集精化,用一到多个绘图进行零光图像说明的合成图像说明

2507.18616v1

577

07-24

BEARCUBS: A benchmark for computer-using web agents

BEARCUBS: Benchmark für computergestützte Web-Agenten

BEARCUBS:计算机使用网络代理器的基准

2503.07919v3

578

07-24

Trusted Knowledge Extraction for Operations and Maintenance Intelligence

Vertrauenswürdige Wissensgewinnung für Operationen und Wartungsintelligenz

行动和维持情报可信赖的知识采掘

2507.22935v1

579

07-24

Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs

Sparse Logit Sampling: Beschleunigung der Wissensdestillation in LLMs

粗略的登录抽样:加速在LLMs中进行知识蒸馏

2503.16870v2

580

07-24

Deep Learning Approaches for Multimodal Intent Recognition: A Survey

Deep Learning Ansätze zur multimodalen Intent-Erkennung: Eine Umfrage

多种形式本能识别的深学习方法:调查

2507.22934v1

581

07-24

What Makes You CLIC: Detection of Croatian Clickbait Headlines

Was macht Sie CLIC: Erkennung von kroatischen Clickbait Schlagzeilen

是什么让你成为CLIC:发现克罗地亚点击头条头条

2507.14314v2

582

07-24

AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs

AQuilt: Verweben von Logik und Selbstinspektion in Low-Cost, High-Relevance-Datensynthese für Spezialisten LLMs

Anilt:将逻辑和自我检查编织成低成本高相关性数据合成,供专家LLMs使用

2507.18584v1

583

07-24

DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data

DR.EHR: Dense Retrieval für elektronische Gesundheitsdaten mit Wissensinjektion und synthetischen Daten

DR.EHR: 具有知识注射和合成数据的电子健康记录大量检索

2507.18583v1

584

07-24

System Report for CCL25-Eval Task 10: SRAG-MAV for Fine-Grained Chinese Hate Speech Recognition

Systembericht für CCL25-Eval Task 10: SRAG-MAV für feinkörnige chinesische Hassspracherkennung

供CCL25-Eval任务10使用的系统报告:关于中华恶言识别的SRAG-MAV系统报告

2507.18580v1

585

07-24

P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts

P-反应:通过专门 LoRA 专家混合组合,综合个人经历专题-适应性反应

2406.12548v3

586

07-24

Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

Weit-in, schmal-out: Wiederverwertbare Dekodierung für effiziente und effektive DLLMs

宽放, 窄出: 为高效和有效DLLMs而可撤销的解码

2507.18578v1

587

07-24

LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs

LingBench++: Ein linguistisch-informiertes Benchmark- und Reasoning-Framework für mehrstufige und kulturübergreifende Schlussfolgerungen mit LLMs

LingBench++:与LLMs的多层次和跨文化推理语言综合基准和理由框架

2507.16809v2

588

07-24

PosterMate: Audience-driven Collaborative Persona Agents for Poster Design

PosterMate: Audience-getriebene Kollaborative Persona Agenten für Poster-Design

PosterMate:由观众驱动的海报设计合作人员代理

2507.18572v1

589

07-24

Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods

Hybride Tokenisierungsstrategie für DNA-Sprachmodell mit Byte Pair Encoding und K-MER Methoden

使用字节对等编码和K-MER方法的DNA语言模型混合化战略

2507.18570v1

590

07-24

GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation

GIIFT: Graph-geführte induktive Bildverarbeitungsfreie multimodale maschinelle Übersetzung

GIIFT: 图表制导感性不含图像的无图像多式机器翻译

2507.18562v1

591

07-24

Identity-related Speech Suppression in Generative AI Content Moderation

Identitätsbezogene Sprachunterdrückung in der Generativen KI-Inhaltsmoderation

在产生AI 内容调节中禁止与身份有关的言语

2409.13725v3

592

07-24

Augmented Vision-Language Models: A Systematic Review

Augmented Vision-Language Models: Eine systematische Bewertung

增强愿景-语言模型:系统审查

2507.22933v1

593

07-24

FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification

FinMarBa: Ein marktinformierter Datensatz für die Einstufung von Finanzsentimenten

FinMarba:用于金融敏感度分类的市场化数据集

2507.22932v1

594

07-24

LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important

LagKV: Lag-Relative Information des KV-Cache erzählt, welche Token wichtig sind

LagKV: KV 缓存告诉哪个 Tokens 重要, 而 KV 缓存的拉格- 相对信息Name

2504.04704v2

595

07-24

GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface

GLiNER2: Ein effizientes Multi-Task-Informationsextraktionssystem mit Schema-gesteuerter Schnittstelle

GLINER2:具有Schema-Driven界面的高效多任务信息提取系统

2507.18546v1

596

07-24

Effective Multi-Task Learning for Biomedical Named Entity Recognition

Effektives Multi-Task-Lernen für die biomedizinische benannte Entitätserkennung

有效多任务学习促进生物医学命名实体的识别

2507.18542v1

597

07-24

The Moral Gap of Large Language Models

Die moralische Kluft großer Sprachmodelle

大语言模式的道德差距

2507.18523v1

598

07-24

GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks

GCC-Spam: Spam-Erkennung über GAN, Kontrastives Lernen und Charaktergleichheitsnetzwerke

海合会-Spam:通过全球大气监测网、反竞争学习和特征相似网络探测垃圾邮件

2507.14679v2

599

07-24

Exploiting individual differences to bootstrap communication

Nutzung individueller Unterschiede zur Bootstrap-Kommunikation

利用个人差异进行靴套通信

2504.05211v2

600

07-24

Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models

Nicht alle Funktionen widmen sich der Aufmerksamkeit: Graphengeführtes Abhängigkeitslernen für tabellarische Datengenerierung mit Sprachmodellen

并非所有值得注意的地物:用语言模型编制图表数据时的图表指导依赖性学习

2507.18504v1

601

07-24

LLM-based Embedders for Prior Case Retrieval

LLM-basierte Embedders für frühere Fallwiederherstellung

用于先前个案检索的LLM 以LLM为基础的嵌入器

2507.18455v1

602

07-24

Generation of Synthetic Clinical Text: A Systematic Review

Generieren von synthetischem klinischem Text: Ein systematischer Test

合成临床文本的生成:系统审查

2507.18451v1

603

07-24

Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, a Low-Resource Language

Wiederherstellung des Rhythmus: Pünktlichkeitsrestaurierung mit Transformer-Modellen für Bangla, eine Sprache mit geringer Ressource

恢复时速:使用孟加拉国低资源语言 “ 孟加拉 “ 变压器模型恢复脉冲

2507.18448v1

604

07-24

AraTable: Benchmarking LLMs’ Reasoning and Understanding of Arabic Tabular Data

AraTable: Benchmarking von LLMs’ Vernunft und Verständnis arabischer Tabellendaten

阿拉伯表格:按基准确定LLM女士对阿拉伯表格数据的理由和理解

2507.18442v1

605

07-24

IPCGRL: Language-Instructed Reinforcement Learning for Procedural Level Generation

IPCGRL: Sprachgestütztes Verstärkungslernen für die verfahrenstechnische Level-Generierung

ICPCGRL: 程序生成阶段语言教学强化学习

2503.12358v4

606

07-24

DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts

DEFAME: Dynamic Evidence-based FAct-Checking mit multimodalen Experten

DFAME: 与多式联运专家进行动态证据法检查

2412.10510v4

607

07-24

How do language models learn facts? Dynamics, curricula and hallucinations

Wie lernen Sprachmodelle Fakten? Dynamik, Lehrpläne und Halluzinationen

语言模式如何了解事实?动态、课程和幻觉

2503.21676v2

608

07-24

FinDPO: Financial Sentiment Analysis for Algorithmic Trading through Preference Optimization of LLMs

FinDPO: Finanz-Sentiment-Analyse für algorithmischen Handel durch Preference-Optimierung von LLMs

FinDPO:通过优惠优化LLMs,分析通过高利贷交易的金融敏感度

2507.18417v1

609

07-24

ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

Explica: Explizite kausale Vernunft in großen Sprachmodellen bewerten

ExpliCa:在大语言模型中评估明确的原因原因

2502.15487v3

610

07-24

Enhancing RAG Efficiency with Adaptive Context Compression

Steigerung der RAG-Effizienz durch adaptive Kontextkompression

提高RAG效率,同时采取适应性环境压缩措施

2507.22931v1

611

07-24

Factual Inconsistencies in Multilingual Wikipedia Tables

Tatsächliche Inkonsistenzen in mehrsprachigen Wikipedia-Tabellen

多语言维基百科表格中的事实不一致

2507.18406v1

612

07-24

CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

CLEAR: Fehleranalyse über LLM-as-a-Judge leicht gemacht

CLLEAR:通过LLM-as-a法官进行错误分析

2507.18392v1

613

07-24

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

Beschädigt durch Reasoning: Reasoning Sprachmodelle werden Free-Riders in Public Goods Games

原因:在公共货物运动会中,理性语言模式成为自由骑手

2506.23276v2

614

07-24

Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs

Beyond Profile: Von Oberflächen-Fakten zur tiefen Persona-Simulation in LLMs

超越简介:从地平面事实到深人模拟LLMM

2502.12988v3

615

07-24

Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

Schutz gefährdeter Stimmen: Synthetische Datensatzgenerierung zur Selbstdetektion

保护弱势声音:为自我披露检测合成数据集生成

2507.22930v1

616

07-24

Mechanistic Indicators of Understanding in Large Language Models

Mechanistische Indikatoren des Verstehens in großen Sprachmodellen

大语言模型中理解力的机械指标

2507.08017v3

617

07-24

Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence

Hybride Annotation für Propagandaerkennung: Integration von LLM-Vorannotationen mit menschlicher Intelligenz

宣传探测混合说明:将LLM预告与人类情报相结合

2507.18343v1

618

07-24

TDR: Task-Decoupled Retrieval with Fine-Grained LLM Feedback for In-Context Learning

TDR: Task-decoupled Retrieval mit feinkörnigem LLM-Feedback für das In-Context-Lernen

TDR: 以精细的LLM反馈方式进行任务减缩的检索,以便进行内容学习

2507.18340v1

619

07-24

Uncertainty Quantification for Evaluating Machine Translation Bias

Ungewissheit Quantifizierung für die Auswertung von maschinellen Übersetzungs-Bias

评价机器翻译偏见的不确定性定量

2507.18338v1

620

07-24

EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow

EH-Benchmark Ophthalmische Halluzination Benchmark und Agent-getriebene Top-Down-Rückverfolgbarkeit Workflow

EH-Benchmark Ophthalmic 幻觉基准和代理Dripreven 顶底可追踪合理理由工作流程

2507.22929v1

621

07-24

A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

Eine umfassende Studie der LLM-basierten Argumentationsklassifikation: von LLAMA über GPT-4o bis Deepseek-R1

关于以LLM为基础的理论分类的全面研究:从LLAMA到GPT-4o到Deepseek-R1

2507.08621v2

622

07-24

BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit

BadReasoner: Pflanzung Tunable Überdenken Hintertüren zu großen Grundmodellen für Spaß oder Gewinn

BadReasoner: 将金枪鱼可变性过度思考的后门规划成娱乐或利润的大理由模型

2507.18305v1

623

07-24

LoRA-Leak: Membership Inference Attacks Against LoRA Fine-tuned Language Models

LoRA-Leak: Membership Inferenz Angriffe gegen LoRA fein abgestimmte Sprachmodelle

LoRA-Leak:对LORA精调语言模式的成员推论攻击

2507.18302v1

624

07-24

DocTER: Evaluating Document-based Knowledge Editing

DocTER: Dokumentbasierte Wissensbearbeitung bewerten

评价基于文件的知识编辑

2308.09954v2

625 07-24 Step-Audio 2 Technical Report Schritt-Audio 2 Technischer Bericht 技术报告 2507.16632v2

626

07-24

VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

VolDoGer: LLM-unterstützte Datensätze für Domain-Verallgemeinerung in Vision-Language-Aufgaben

VolDoGer:LLM辅助数据集,用于视野语言任务中通用域的LLM辅助数据集

2407.19795v2

627

07-24

StyleAdaptedLM: Enhancing Instruction Following Models with Efficient Stylistic Transfer

StyleAdaptedLM: Weiterentwicklung der Anleitung nach Modellen mit effizienter Stylistik-Übertragung

StypeAddapedLM:按照高效立体转让模式加强教学

2507.18294v1

628

07-24

How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

Wie denkt die Kette des Denkens? Mechanistische Interpretierbarkeit von Chain-of-Thought-Reasoning mit Sparse Autoencoding

思维链思维链是如何思考的?

2507.22928v1

629

07-24

Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil

Null-Schuss OCR Genauigkeit der niedrig-Ressourcen Sprachen: Eine vergleichende Analyse auf Sinhala und Tamil

低资源语言的准确性:僧伽罗语和泰米尔语比较分析

2507.18264v1

630

07-24

Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models

Locate-and-Focus: Verbesserung der Terminologieübersetzung in Sprachmodellen

目的和重点:加强语言语言模式术语翻译

2507.18263v1

631

07-24

Multimodal Behavioral Patterns Analysis with Eye-Tracking and LLM-Based Reasoning

Multimodale Verhaltensmusteranalyse mit Eye-Tracking und LLM-basierter Vernunft

以眼跟踪和基于LLM的理由进行多模式行为模式分析

2507.18252v1

632

07-24

Meta Prompting for AI Systems

Meta Prompting für KI-Systeme

AI 系统的模拟模拟

2311.11482v8

633

07-24

Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation

Prune&Comp: Kostenloses Mittagessen für Layer-Pruned LLMs über iterative Pruning mit Magnitude Compensation

Prune & Comp: 通过模拟谨慎与磁度补偿为由层驱动的LMs免费午餐

2507.18212v1

634

07-24

Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge

Verbesserung der Transformation von natürlicher Sprache zur Signalzeitlogik mit LLMs mit vielfältigem externem Wissen

利用具有多种外部知识的LMLML 增强从自然语言向信号时时逻辑的转变

2505.20658v2

635

07-24

Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation

Untersuchung der Auswirkungen von Instruction-Tuning auf die Anfälligkeit von LLM für Fehlinformationen

探讨指导指导对LLM对错误信息易感性的影响

2507.18203v1

636

07-24

Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection

Sicherung von RAG-Pipelines mit GMTP: Eine gradient-basierte maskierte Token-Wahrscheinlichkeitsmethode für vergiftete Dokumentenerkennung

使用GMTP来保护RAG管道:一种基于渐进式蒙面的中毒文件检测概率方法

2507.18202v1

637

07-24

Integrating an ISO30401-compliant Knowledge management system with existing business processes of an organization

Integration eines ISO30401-konformen Wissensmanagementsystems in bestehende Geschäftsprozesse einer Organisation

将符合ISO30401的知识管理系统纳入一个组织的现有业务流程

2507.18197v1

638

07-24

SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models

ANWENDUNGSBEREICH: Stochastische und gegensätzliche Wahlplatzierung für die Bewertung großer Sprachmodelle

SCOPE:评估大语言模式的施虐和反偏见选择安置

2507.18182v1

639

07-24

Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models

Das Mittel halten: Sticky Tokens in Text-Embedding-Modellen erkennen

坚持平均值:在文本嵌入模型中检测粘力

2507.18171v1

640

07-24

Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges

Jüngste Trends bei der Ferngesprächserkennung: Ein Rückblick auf die Herausforderungen CHiME-7 und 8 DASR

最近对不同政见的语音识别趋势:对CHiME-7和8DASR挑战的回顾

2507.18161v1

641

07-24

A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects

Eine Umfrage über die Kausalitätsidentifizierung: Taxonomie, Herausforderungen, Bewertung und Perspektiven

事件原因识别调查:分类、挑战、评估和前景

2411.10371v5

642

07-24

Large Language Models in Argument Mining: A Survey

Große Sprachmodelle im Argumentbergbau: Eine Umfrage

争议采矿大语言模型:调查

2506.16383v4

643

07-24

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

Auf dem Weg zu größerer Hebelwirkung: Skalierungsgesetze für effiziente Mixture-of-Experts-Sprachmodelle

争取更大程度的利用:提高有效混合专家语言模式法的规模

2507.17702v2

644

07-24

MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning

Mathopeval: Ein feinkörniger Evaluations-Benchmark für visuelle Operationen von MLLMs in mathematischer Reasoning

MathOPEval:数学理由中MLLMs视觉操作精美评价基准

2507.18140v1

645

07-24

OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

OPeRA: Ein Datensatz von Beobachtung, Persona, Ratationale und Aktion zur Bewertung von LLMs auf menschlicher Online-Shopping-Behavior-Simulation

OPERA: 人类在线购物行为模拟观察、人、理由和评估LMLLMs的数据集

2506.05606v4

646

07-24

Actively evaluating and learning the distinctions that matter: Vaccine safety signal detection from emergency triage notes

Aktive Bewertung und Erlernen der Unterscheidungen, die wichtig sind: Vakzin-Sicherheitssignalerkennung aus Not-Triage-Notizen

积极评价和学习重要的区别:疫苗安全信号从紧急分级记录中探测到的疫苗安全信号

2507.18123v1

647

07-24

When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems

Wenn Autonomie Rogue: Vorbereitung auf Risiken der Multi-Agenten-Kollusion in sozialen Systemen

当自治时,罗格:准备应对社会系统中多机构串通的风险

2507.14660v2

648

07-24

Agentic AI framework for End-to-End Medical Data Inference

Agentische KI-Framework für Ende-zu-Ende medizinische Datenableitung

最终至最终医疗数据推断的AA AA 框架框架

2507.18115v1

649 07-24 A New Pair of GloVes Ein neues Paar GloVes 新的地球之对 2507.18103v1

650

07-24

Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation

Lang-Short-Distanz Graph Neural Networks und verbessertes Curriculum-Lernen für Emotionserkennung im Gespräch

长短距离远距神经神经网络和改进课程学习,以在对话中认识情感

2507.15205v2

651

07-24

ELITE: Enhanced Language-Image Toxicity Evaluation for Safety

ELITE: Verbesserte Sprach-Image-Toxizitätsbewertung für Sicherheit

ELITE:加强语言-图像安全毒性评价

2502.04757v3

652

07-24

Hybrid and Unitary Fine-Tuning of Large Language Models: Methods and Benchmarking under Resource Constraints

Hybrides und einheitliches Feintuning von großen Sprachmodellen: Methoden und Benchmarking unter Ressourcenbeschränkungen

大语言模式统一调整和统一调整适用:在资源限制下的方法和基准

2507.18076v1

653

07-24

BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference

BlockDialekt: Blockweise feinkörnige Mischformat-Quantisierung für energieeffiziente LLM-Inferenz

BlockDiaect: 节能LLM 推论的粗件精细混合格式量化

2501.01144v5

654

07-24

TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios

TELEVAL: Ein dynamischer Benchmark für gesprochene Sprachmodelle in chinesischen interaktiven Szenarien

TELEVAL:为中文互动假想中的口语模式设计的一个动态基准

2507.18061v1

655

07-24

Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias

Causally Testing Gender Bias in LLMs: Eine Fallstudie über berufsbezogene Bias

《LLMM中因果测试性别偏见:职业偏见案例研究》

2212.10678v4

656

07-24

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

Ein Multi-Faceted-Evaluierungsrahmen für die Bewertung synthetischer Daten, erzeugt durch große Sprachmodelle

评估由大语言模型生成的合成数据多面评价框架

2404.14445v2

657

07-24

Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs

Privacy-Preserving Synthetic Review Generation mit unterschiedlichen Schreibstilen mit LLMs

使用LLMMs以多种写作风格生成的隐私-保护合成审查

2507.18055v1

658

07-24

From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems

Von der Hypothese zur Veröffentlichung: Eine umfassende Umfrage zu KI-getriebenen Forschungsunterstützungssystemen

从假设到出版物:AI-Driven研究支助系统综合调查

2503.01424v3

659

07-24

RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models

EINGEDENK: Ein ungebundener Ressourcenverbrauchsangriff auf große Visions-Sprachenmodelle

回顾:对大型愿景-语言模型的无约束资源消费攻击

2507.18053v1

660

07-24

Segmentation-free Goodness of Pronunciation

Segmentierungsfreie Güte der Aussprache

读音良好

2507.16838v2

661

07-24

Synthetic Data Generation for Phrase Break Prediction with Large Language Model

Synthetische Datengenerierung für Phrase Break Prediction mit großem Sprachmodell

制作用于大语言模范大语言时段间断预测的合成数据

2507.18044v1

662

07-24

GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs

GrAInS: Gradient-basierte Zuordnung zur Inferenz-Zeitlenkung von LLMs und VLMs

GrAInS:LLMs和VLMs的推论时间指导的逐步归属

2507.18043v1

663

07-24

AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark

AIR-Bench: Automatisierte Heterogene Information Retrieval Benchmark

AIR-Bench:自动异源信息检索基准

2412.13102v4

664

07-24

NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database

NeuralDB: Skalierung von Wissen in LLMs auf 100.000 Fakten mit neuraler KV-Datenbank

NeuralDDB: 将知识编辑在LLM 中到 100,000 千兆瓦的Neural KV 数据库中

2507.18028v1

665

07-24

GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures

GRR-CoCa: LLM-Mechanismen in multimodalen Modellarchitekturen nutzen

GRR-CoCa:在多模式建模中利用LLM机制

2507.18009v1

Article 0

Title@2025-07-31 (4): Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities

Title: Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities

Cascaded Information Disclosure for Generalized Evaluation of Problem Lösing Capabilities

用于对解决问题能力通用评价的连锁信息披露 2507.23776v1

Authors (3): Yunxiang Yan, Tomohiro Sawada, Kartik Goyal

While question-answering~(QA) benchmark performance is an automatic and scalable method to compare LLMs, it is an indirect method of evaluating their underlying problem-solving capabilities. Therefore, we propose a holistic and generalizable framework based on \emph{cascaded question disclosure} that provides a more accurate estimate of the models’ problem-solving capabilities while maintaining the scalability and automation. This approach collects model responses in a stagewise manner with each stage revealing partial information about the question designed to elicit generalized reasoning in LLMs. We find that our approach not only provides a better comparison between LLMs, but also induces better intermediate traces in models compared to the standard QA paradigm. We empirically verify this behavior on diverse reasoning and knowledge-heavy QA datasets by comparing LLMs of varying sizes and families. Our approach narrows the performance gap observed in the standard QA evaluation settings, indicating that the prevalent indirect QA paradigm of evaluation overestimates the differences in performance between models. We further validate our findings by extensive ablation studies.

问题解答-(QA)基准性能是比较LLMS的自动和可扩缩的方法,但它是一种间接的方法,用来评价其解决问题的基本能力,因此,我们提议了一个基于\emph{cassaed question disability}的全面和可概括化的框架,在保持可缩放性和自动化的同时,对模型解决问题的能力作出更准确的估计。这个方法以分阶段的方式收集模型答复,每个阶段都揭示了旨在引致LLMS普遍推理的问题的部分信息。我们发现,我们的方法不仅改善了LMS的比较,而且与标准的QA范式相比,在模型中产生了更好的中间痕迹。我们通过对不同大小和家庭的LMS进行不同的比较,对不同的推理和知识重QA数据集进行了实验性地核实。我们的方法缩小了标准QA评价环境中观察到的业绩差距,表明普遍的间接QA评价范式高估了各种模型的绩效差异。我们通过广泛的反差研究进一步证实了我们的调查结果。

Article 1

Title@2025-07-31 (4): SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model

Title: SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model

SimuRA: Auf dem Weg zu einem General Goal-Oriented Agent über Simulative Reasoning Architecture mit LLM-basiertem Weltmodell

SimurRA:通过使用以LLM为基础的世界模型的模拟合理理由结构,努力实现以一般目标为导向的代理 2507.23773v1

Authors (7): Mingkai Deng, Jinyu Hou, Yilin Shen, Hongxia Jin, Graham Neubig, Zhiting Hu, Eric Xing

AI agents built on large language models (LLMs) hold enormous promise, but current practice focuses on a one-task-one-agent approach, which not only falls short of scalability and generality, but also suffers from the fundamental limitations of autoregressive LLMs. On the other hand, humans are general agents who reason by mentally simulating the outcomes of their actions and plans. Moving towards a more general and powerful AI agent, we introduce SimuRA, a goal-oriented architecture for generalized agentic reasoning. Based on a principled formulation of optimal agent in any environment, \modelname overcomes the limitations of autoregressive reasoning by introducing a world model for planning via simulation. The generalized world model is implemented using LLM, which can flexibly plan in a wide range of environments using the concept-rich latent space of natural language. Experiments on difficult web browsing tasks show that \modelname improves the success of flight search from 0\% to 32.2\%. World-model-based planning, in particular, shows consistent advantage of up to 124\% over autoregressive planning, demonstrating the advantage of world model simulation as a reasoning paradigm. We are excited about the possibility for training a single, general agent model based on LLMs that can act superintelligently in all environments. To start, we make SimuRA, a web-browsing agent built on \modelname with pretrained LLMs, available as a research demo for public testing.

以大型语言模型(LLMs)为基础的AI代理商有着巨大的希望,但目前的做法侧重于一号任务一号试剂方法,不仅不能达到可缩放性和普遍性,而且还受到自动递减性LMs的根本限制。另一方面,人类是一般的代理商,其原因是在精神上模拟其行动和计划的结果。向更普遍和强大的AI代理商的方向发展,我们引入了Simura,这是一个面向普遍代理推理的面向目标的结构。基于任何环境中最佳代理商的原则性配方, 模型名通过模拟引入世界规划模式克服了自动递增推理的局限性。通用世界模型模型使用LLMM来实施,它可以在广泛的环境中灵活规划,使用丰富的自然语言概念潜伏空间。在困难的网络浏览任务上进行的实验表明,SimuRA将改进飞行搜索的成功程度,从0到32.2。特别是基于世界模型的规划,显示在自动递增性规划方面达到124的优势,通过模拟采用世界模型模型进行规划,展示世界模型模型模型的优势,在单一的LMsimimal 上,我们可以进行基础的测试。

Article 2

Title@2025-07-31 (4): Perception-Aware Policy Optimization for Multimodal Reasoning

Title: Perception-Aware Policy Optimization for Multimodal Reasoning

Perception-Aware Policy Optimization für multimodale Reasoning

对多式联运理由的观念-认知软件政策优化 2507.06448v3

Authors (11): Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception-aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.

以可验证的奖励(RLVR)加强学习已证明是运用大型语言模型(LLMS)的高度有效战略,具有强大的多步推理能力。然而,其设计和优化仍然适合纯文本域,导致在应用多式推理任务时表现不优于最优性。特别是,我们发现,当前多式联运推理中的一个主要错误来源在于对视觉投入的认识。为了解决这一瓶颈问题,我们建议PAPO, 一种鼓励模型在学习时学会理解的新型政策梯度演算法。具体地说,我们以KL差异术语的形式引入隐含的感知损失:可以无缝地插入RLVR的算法,例如GROPO和DAPO。值得注意的是,PAPO并不依赖额外的数据曲线、奖赏模型或更强的教师模型。为了进一步加强PAPO的培训稳定性,我们引入了双环流损失,从而在不损害业绩的情况下有效地规范新的KL目标。尽管它简单,但PAPO的推理学基础框架在总体上显著改进了4.4%-17.5 % ,在多种多式联运认识30度基准下,我们更精确地减少了一个更高的实验目标。

Article 3

Title@2025-07-31 (4): CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

Title: CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

CoT-Self-Instruct: Aufbau hochwertiger synthetischer Aufforderungen zur Begründung und zu nicht-vernünftigen Aufgaben

COT-自学教学:为推理和非理由性任务建立高质量的合成提示 2507.23751v1

Authors (9): Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Jing Xu

We propose CoT-Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on the given seed tasks, and then to generate a new synthetic prompt of similar quality and complexity for use in LLM training, followed by filtering for high-quality data with automatic metrics. In verifiable reasoning, our synthetic data significantly outperforms existing training datasets, such as s1k and OpenMathReasoning, across MATH500, AMC23, AIME24 and GPQA-Diamond. For non-verifiable instruction-following tasks, our method surpasses the performance of human or standard self-instruct prompts on both AlpacaEval 2.0 and Arena-Hard.

我们建议采用Cot-自制数据生成方法,即根据给定的种子任务,首先引导LLMS理性地通过Thought链(Cot)进行规划,然后生成一个质量和复杂性相似的合成新速度,供LLM培训使用,然后用自动指标过滤高质量数据。在可核查的推理中,我们的合成数据大大优于现有的培训数据集,如在MATH500、AMC23、AME24和GPQA-Diamond之间,S1k和OpenMathReasoning。对于不可核查的教学执行任务,我们的方法超过了在AlpacaEval2.0和Arena-Hard方面的人类或标准自制工具的性能。

Article 4

Title@2025-07-31 (4): Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs

Title: Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs

Regel2Text: Natürliche Sprache Erklärung der logischen Regeln in Wissensgraphen

规则2案文:知识图中逻辑规则的自然语言解释 2507.23740v1

Authors (4): Nasim Shirvani-Mahdavi, Devin Wingfield, Amin Ghasemi, Chengkai Li

Knowledge graphs (KGs) often contain sufficient information to support the inference of new facts. Identifying logical rules not only improves the completeness of a knowledge graph but also enables the detection of potential errors, reveals subtle data patterns, and enhances the overall capacity for reasoning and interpretation. However, the complexity of such rules, combined with the unique labeling conventions of each KG, can make them difficult for humans to understand. In this paper, we explore the potential of large language models to generate natural language explanations for logical rules. Specifically, we extract logical rules using the AMIE 3.5.1 rule discovery algorithm from the benchmark dataset FB15k-237 and two large-scale datasets, FB-CVT-REV and FB+CVT-REV. We examine various prompting strategies, including zero- and few-shot prompting, including variable entity types, and chain-of-thought reasoning. We conduct a comprehensive human evaluation of the generated explanations based on correctness, clarity, and hallucination, and also assess the use of large language models as automatic judges. Our results demonstrate promising performance in terms of explanation correctness and clarity, although several challenges remain for future research. All scripts and data used in this study are publicly available at https://github.com/idirlab/KGRule2NL}{https://github.com/idirlab/KGRule2NL.

确定逻辑规则不仅提高了知识图的完整性,而且能够发现潜在的错误,揭示了微妙的数据模式,并提高了总体推理和解释能力;然而,这些规则的复杂性,加上每个知识图独特的标签公约,可能使人类难以理解这些规则;在本文件中,我们探索大型语言模型产生逻辑规则自然语言解释的潜力;具体地说,我们利用基准数据集FB15k-237和两个大型数据集FB-CVT-REV和FB+CVT-REV。我们审查各种提示战略,包括零和几发提示性战略,包括不同实体类型和思维链推理。我们根据正确性、清晰性和幻觉,对产生的解释进行全面的人文评价,并评估大型语言模型作为自动法官的使用情况。我们的结果显示在解释正确性和清晰性方面有希望的业绩,尽管在http/httpsurity/G_BAR_G_G_BAR_RR)中,所有数据都用于未来研究。

Article 5

Title@2025-07-31 (4): How AI Ideas Affect the Creativity, Diversity, and Evolution of Human Ideas: Evidence From a Large, Dynamic Experiment

Title: How AI Ideas Affect the Creativity, Diversity, and Evolution of Human Ideas: Evidence From a Large, Dynamic Experiment

Wie KI-Ideen die Kreativität, Vielfalt und Evolution menschlicher Ideen beeinflussen: Beweise aus einem großen, dynamischen Experiment

AI Ideas如何影响人类思想的创造性、多样性和演变:大规模动态实验的证据 2401.13481v3

Authors (5): Joshua Ashkinaze, Julia Mendelsohn, Li Qiwei, Ceren Budak, Eric Gilbert

Exposure to large language model output is rapidly increasing. How will seeing AI-generated ideas affect human ideas? We conducted an experiment (800+ participants, 40+ countries) where participants viewed creative ideas that were from ChatGPT or prior experimental participants and then brainstormed their own idea. We varied the number of AI-generated examples (none, low, or high exposure) and if the examples were labeled as ‘AI’ (disclosure). Our dynamic experiment design – ideas from prior participants in an experimental condition are used as stimuli for future participants in the same experimental condition – speaks to the interdependent process of cultural creation: creative ideas are built upon prior ideas. Hence, we capture the compounding effects of having LLMs ‘in the culture loop’. We find that high AI exposure (but not low AI exposure) did not affect the creativity of individual ideas but did increase the average amount and rate of change of collective idea diversity. AI made ideas different, not better. There were no main effects of disclosure. We also found that self-reported creative people were less influenced by knowing an idea was from AI and that participants may knowingly adopt AI ideas when the task is difficult. Our findings suggest that introducing AI ideas may increase collective diversity but not individual creativity.

对大语言模型输出的接触正在迅速增加。看到AI产生的想法将如何影响人类思想?我们进行了一个实验(800+参与者,40+国家),参与者观看了来自ChatGPT或先前实验参与者的创造性想法,然后集思广益。我们改变了AI产生的例子数量(没有、低或高暴露),如果这些例子被贴上“AI”(披露)标签,那么这些例子的数量也各不相同。我们的动态实验设计 – – 实验状态中的前参与者的想法被作为同一实验条件下未来参与者的刺激因素 – – 谈到文化创造的相互依存过程:创造性思想是建立在先前思想基础上的。因此,我们捕捉到将LLOMS“纳入文化循环”的复合效应。我们发现,AI的高度接触(但不低接触)并没有影响个人思想的创造力,而是增加了集体思想多样性的平均数量和变化率。AI做了不同,没有更好的解释。披露没有产生任何主要影响。我们发现,自我报告创造性的人受到了解AI思想的影响较小,但是参与者在任务困难的时候可以知情地接受AI。我们发现,集体发现,引入AI的想法可能增加个人创造力。

Article 6

Title@2025-07-31 (4): Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Title: Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Seed-Prover: Tiefe und breite Begründung für automatisierte Theorem Proving

种子文献:用于自动理论论证的深度和广度理由 2507.23726v1

Authors (36): Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, Cheng Ren, Jiawei Shen, Wenlei Shi, Tong Sun, He Sun, Jiahui Wang, Siran Wang, Zhihong Wang, Chenrui Wei, Shufa Wei, Yonghui Wu, Yuchen Wu, Yihang Xia, Huajian Xin, Fan Yang, Huaiyuan Ying, Hongyi Yuan, Zheng Yuan, Tianyang Zhan, Chi Zhang, Yue Zhang, Ge Zhang, Tianyun Zhao, Jianqiu Zhao, Yichi Zhou, Thomas Hanwen Zhu

LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1\%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50\% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.

LLMS通过利用长期思维链的强化学习,展示了很强的数学推理能力,但是,由于在仅仅使用自然语言时缺乏明确的监督信号,它们仍然在与理论进行斗争。Lian等专门的域名语言通过正式核实证据提供明确的监督,通过强化学习进行有效培训。在这项工作中,我们建议采用一个脂素式的全度防盗推理模型\ textbf{Seed-Prover}。种子-Prover可以反复地改进其基于Lean反馈、经证明的Lemma和自我合成的证据。为了解决海事组织一级的竞争问题,我们设计了三个测试时间推论战略,既能进行深入又广泛的推理。种子-Prover证明过去海事组织问题正规化的78.1美元,能通过强化学习进行有效培训。我们在Putnambench上提出了超过50美分的全方位全方位推理模型。为了解决Lean的地理测量支持不足的问题,我们引入了一种几何推理机引擎{Seud-Gegraphy-Gegraphy

Article 7

Title@2025-07-31 (4): RecGPT Technical Report

Title: RecGPT Technical Report

Technischer Bericht des RecGPT

RecGPT 技术报告 2507.22879v2

Authors (54): Chao Yi, Dian Chen, Gaoyang Guo, Jiakai Tang, Jian Wu, Jing Yu, Mao Zhang, Sunhao Dai, Wen Chen, Wenjun Yang, Yuning Jiang, Zhujin Gao, Bo Zheng, Chi Li, Dimin Wang, Dixuan Wang, Fan Li, Fan Zhang, Haibin Chen, Haozhuang Liu, Jialin Zhu, Jiamang Wang, Jiawei Wu, Jin Cui, Ju Huang, Kai Zhang, Kan Liu, Lang Tian, Liang Rao, Longbin Li, Lulu Zhao, Na He, Peiyang Wang, Qiqi Huang, Tao Luo, Wenbo Su, Xiaoxiao He, Xin Tong, Xu Chen, Xunke Xi, Yang Li, Yaxuan Wu, Yeqiu Yang, Yi Hu, Yinnan Song, Yuchen Li, Yujie Luo, Yujin Yuan, Yuliang Yan, Zhengyang Wang, Zhibo Xiao, Zhixin Ma, Zile Zhou, Ziqi Zhang

Recommender systems are among the most impactful applications of artificial intelligence, serving as critical infrastructure connecting users, merchants, and platforms. However, most current industrial systems remain heavily reliant on historical co-occurrence patterns and log-fitting objectives, i.e., optimizing for past user interactions without explicitly modeling user intent. This log-fitting approach often leads to overfitting to narrow historical preferences, failing to capture users’ evolving and latent interests. As a result, it reinforces filter bubbles and long-tail phenomena, ultimately harming user experience and threatening the sustainability of the whole recommendation ecosystem. To address these challenges, we rethink the overall design paradigm of recommender systems and propose RecGPT, a next-generation framework that places user intent at the center of the recommendation pipeline. By integrating large language models (LLMs) into key stages of user interest mining, item retrieval, and explanation generation, RecGPT transforms log-fitting recommendation into an intent-centric process. To effectively align general-purpose LLMs to the above domain-specific recommendation tasks at scale, RecGPT incorporates a multi-stage training paradigm, which integrates reasoning-enhanced pre-alignment and self-training evolution, guided by a Human-LLM cooperative judge system. Currently, RecGPT has been fully deployed on the Taobao App. Online experiments demonstrate that RecGPT achieves consistent performance gains across stakeholders: users benefit from increased content diversity and satisfaction, merchants and the platform gain greater exposure and conversions. These comprehensive improvement results across all stakeholders validates that LLM-driven, intent-centric design can foster a more sustainable and mutually beneficial recommendation ecosystem.

推荐系统是人工智能中影响最大的应用,是连接用户、商人和平台的关键基础设施;然而,目前大多数工业系统仍然严重依赖历史共同模式和逻辑调整目标,即优化以往用户互动,而没有明确模拟用户意图。这种逻辑调整方法往往导致过度适应狭隘的历史偏好,无法捕捉用户不断变化的潜在利益。结果,它强化过滤泡沫和长尾现象,最终损害用户经验,威胁整个建议生态系统的可持续性。为了应对这些挑战,我们重新思考推荐系统的总体设计模式,并提议一个下一代框架RecGPT,将用户的意向置于建议管道的中心。通过将大型语言模型(LLLMS)纳入用户兴趣挖掘、项目检索和解释生成的关键阶段,RecGPT将符合逻辑的建议转化为一个以意图为中心的过程。为了有效地将通用LMS与以上特定领域的全面建议任务相协调,REGPT包含一个多阶段培训模式,整合了推荐系统的总体设计模式,将用户的意向转换成一个更具有可持续性的、更清晰的升级的、更清晰的升级的日历和升级的流程。

Article 8

Title@2025-07-31 (4): Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length

Title: Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length

Nicht zu vergessen: Proaktive Interferenz offenbart Arbeitsspeichergrenzen in LLMs jenseits der Kontextlänge

无法忘却: 事外长长的LLMM 中主动干扰流出工作内存限制 2506.08184v3

Authors (2): Chupei Wang, Jiaqiu Vince Sun

Information retrieval in Large Language Models (LLMs) is increasingly recognized as intertwined with generation capabilities rather than mere lookup. While longer contexts are often assumed to improve retrieval, the effects of intra-context interference remain understudied. To address this, we adapt the proactive interference (PI) paradigm from cognitive science, where earlier information disrupts recall of newer updates. In humans, susceptibility to such interference is inversely linked to working memory capacity. We introduce PI-LLM, an evaluation that sequentially streams semantically related key-value updates and queries only the final values. Although these final values are clearly positioned just before the query, LLM retrieval accuracy declines log-linearly toward zero as interference accumulates; errors arise from retrieving previously overwritten values. Attempts to mitigate interference via prompt engineering (e.g., instructing models to ignore earlier input) yield limited success. These findings reveal a fundamental constraint on LLMs’ ability to disentangle interference and flexibly manipulate information, suggesting a working memory bottleneck beyond mere context access. This calls for approaches that strengthen models’ ability to suppress irrelevant content during retrieval.

大语言模型(LLMs)中的信息检索日益被公认为与生成能力而不是仅仅看一看相交织在一起。虽然人们往往认为较长的环境可以改进检索,但文本内干扰的影响仍然没有得到充分研究。为了解决这个问题,我们调整了认知科学中的主动干预(PI)范式,早期信息扰乱了对更新更新的记忆的记忆。在人类中,对这种干扰的易感性与工作记忆能力有反向联系。我们引入了PI-LLM,这一评价按顺序流出与语义相关的关键价值更新和查询最后值。虽然这些最后值在查询之前的位置很明确,但LLM检索精度随着干扰的积累而将记录线性降低到零;错误产生于对先前重写价值的检索。试图通过即时工程(例如指示模型忽略早期输入)来减少干扰,但成效有限。这些结论揭示了LLMs解动干扰和灵活调控信息的能力的根本制约,意味着工作记忆瓶颈超出了仅仅上下文访问的范围。这要求采取一些方法,以加强模型在检索过程中抑制不相关内容的能力。

Article 9

Title@2025-07-31 (4): TextQuests: How Good are LLMs at Text-Based Video Games?

Title: TextQuests: How Good are LLMs at Text-Based Video Games?

TextQuests: Wie gut sind LLMs bei textbasierten Videospielen?

文本Quests: 文本视频游戏的LLMs效果如何? 2507.23701v1

Authors (4): Long Phan, Mantas Mazeika, Andy Zou, Dan Hendrycks

Evaluating AI agents within complex, interactive environments that mirror real-world challenges is critical for understanding their practical capabilities. While existing agent benchmarks effectively assess skills like tool use or performance on structured tasks, they often do not fully capture an agent’s ability to operate autonomously in exploratory environments that demand sustained, self-directed reasoning over a long and growing context. To spur the development of agents capable of more robust intrinsic reasoning over long horizons, we introduce TextQuests, a benchmark based on the Infocom suite of interactive fiction games. These text-based adventures, which can take human players over 30 hours and require hundreds of precise actions to solve, serve as an effective proxy for evaluating AI agents on focused, stateful tasks. The benchmark is specifically designed to assess an LLM agent’s capacity for self-contained problem-solving by precluding the use of external tools, thereby focusing on intrinsic long-context reasoning capabilities in an exploratory environment characterized by the need for trial-and-error learning and sustained problem-solving within a single interactive session. We release TextQuests at https://textquests.ai.

在反映现实世界挑战的复杂、互动环境中评价AI代理商对于了解其实际能力至关重要。虽然现有的代理商基准有效评估工具使用或结构化任务绩效等技能,但往往不能完全掌握该代理商在探索环境中自主运作的能力,这种探索环境需要长期和不断增长的持续、自我引导的推理。为了促进发展能够在长视野上进行更强有力的内在推理的代理商,我们引入了一个基于Infocom交互式虚拟游戏套件的基准“TextQuests ”。这些基于文本的冒险,它可以花费人类玩家30小时以上的时间,需要数百次精确的行动来解决,作为评价AI代理商重点突出、明确的任务的有效代理。该基准具体旨在评估LLM代理商通过排除使用外部工具自行解决问题的能力,从而侧重于在探索环境中的内在长文本推理能力,其特征是需要试用和学习并在单一互动会议上持续解决问题。我们在https://textquests.ai发布TextQuests。

Article 10

Title@2025-07-31 (4): TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses

Title: TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses

TweakLLM: Eine Routing-Architektur für dynamisches Tailoring von Cached Responses

TweakLLLM: 快速快速定制快速响应的运行结构 2507.23674v1

Authors (6): Muhammad Taha Cheema, Abeer Aamir, Khawaja Gul Muhammad, Naveed Anwar Bhatti, Ihsan Ayyub Qazi, Zafar Ayyub Qazi

Large Language Models (LLMs) process millions of queries daily, making efficient response caching a compelling optimization for reducing cost and latency. However, preserving relevance to user queries using this approach proves difficult due to the personalized nature of chatbot interactions and the limited accuracy of semantic similarity search. To address this, we present TweakLLM, a novel routing architecture that employs a lightweight LLM to dynamically adapt cached responses to incoming prompts. Through comprehensive evaluation, including user studies with side-by-side comparisons, satisfaction voting, as well as multi-agent LLM debates, we demonstrate that TweakLLM maintains response quality comparable to frontier models while significantly improving cache effectiveness. Our results across real-world datasets highlight TweakLLM as a scalable, resource-efficient caching solution for high-volume LLM deployments without compromising user experience.

大型语言模型(LLM)每天处理数以百万计的询问,使高效的响应为降低成本和延时提供了令人信服的优化,但是,由于聊天机互动的个性性质和语义相似性搜索的准确性有限,很难保持与用户查询的相关性,但是,为了解决这个问题,我们提出了TweakLLM,这是一个新型的路线结构,它使用轻量级LM来动态地调整缓存的响应速度。通过全面评价,包括用户研究,同时进行平行比较、满意投票以及多剂LLM辩论,我们证明TweakLLM保持了与前沿模型相似的反应质量,同时大大提高了缓存效率。我们跨越现实世界数据集的结果突出表明TweakLLM是高容量LM部署的可扩展性、资源高效的缓冲解决方案,同时又不损害用户的经验。

Article 11

Title: Arabic Hate Speech Identification and Masking in Social Media using Deep Learning Models and Pre-trained Models Fine-tuning

Arabische Hass-Spracherkennung und Maskenbildung in sozialen Medien mit Deep-Learning-Modellen und vortrainierten Modellen Feinabstimmung

利用深学习模式和预培训模式进行微调,在社会媒体中识别和遮掩阿拉伯仇恨言论 2507.23661v1

Authors (2): Salam Thabet Doghmash, Motaz Saad

Hate speech identification in social media has become an increasingly important issue in recent years. In this research, we address two problems: 1) to detect hate speech in Arabic text, 2) to clean a given text from hate speech. The meaning of cleaning here is replacing each bad word with stars based on the number of letters for each word. Regarding the first problem, we conduct several experiments using deep learning models and transformers to determine the best model in terms of the F1 score. Regarding second problem, we consider it as a machine translation task, where the input is a sentence containing dirty text and the output is the same sentence with masking the dirty text. The presented methods achieve the best model in hate speech detection with a 92\% Macro F1 score and 95\% accuracy. Regarding the text cleaning experiment, the best result in the hate speech masking model reached 0.3 in BLEU score with 1-gram, which is a good result compared with the state of the art machine translation systems.

近年来,社交媒体中的仇恨言论识别已成为一个日益重要的问题。在这项研究中,我们处理两个问题:(1) 检测阿拉伯文本中的仇恨言论,(2) 清除仇恨言论的文本。这里清洁的含义是根据每个字字母的数量用恒星替换每个坏字。关于第一个问题,我们用深学习模型和变压器进行若干实验,以确定F1分的最佳模式。关于第二个问题,我们认为这是一个机器翻译任务,输入的内容是含有脏字的句子,输出的内容是用脏字遮掩的相同句子。所介绍的方法在仇恨言论检测方面实现了最佳模式,用92 Mac F1分和95准确度。关于文字清理实验,仇恨言论遮盖模式的最佳结果在BLEU分中达到0.3分,为1克,与艺术机器翻译系统的状况相比,这是一个良好的结果。

Article 12

Title@2025-07-31 (4): DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures

Title: DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures

DocPolarBERT: Ein vortrainiertes Modell zum Dokumentverständnis mit relativer Polarkoordinate Kodierung von Layoutstrukturen

DocPolarBERT:一个预先培训的文件理解模式,其布局结构的相对极地协调编码 2507.08606v3

Authors (4): Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro

We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.

我们引入了Doc PolarBERT, 这是一种具有布局意识的BERT文件理解模式,消除了绝对2D定位嵌入的需要。我们自我关注,以考虑到相对极地协调系统而不是笛卡尔协调系统中的文本块位置。尽管在数据集方面接受过比广泛使用的IT-CDIP系统小六倍多的预先培训,但Doc PolarBERT取得了最新的结果。这些结果表明,精心设计的注意机制可以弥补培训前数据的减少,为文件理解提供高效有效的替代方法。

Article 13

Title@2025-07-31 (4): Who’s important? – SUnSET: Synergistic Understanding of Stakeholder, Events and Time for Timeline Generation

Title: Who’s important? – SUnSET: Synergistic Understanding of Stakeholder, Events and Time for Timeline Generation

Wer ist wichtig? – SUnSET: Synergistisches Verständnis von Stakeholdern, Ereignissen und Zeit für die Timeline Generation

谁重要? - SUNSET:对利益攸关方、事件和时间的协同理解,以产生时间表。 2507.21903v2

Authors (4): Tiviatis Sim, Kaiwen Yang, Shen Xin, Kenji Kawaguchi

As news reporting becomes increasingly global and decentralized online, tracking related events across multiple sources presents significant challenges. Existing news summarization methods typically utilizes Large Language Models and Graphical methods on article-based summaries. However, this is not effective since it only considers the textual content of similarly dated articles to understand the gist of the event. To counteract the lack of analysis on the parties involved, it is essential to come up with a novel framework to gauge the importance of stakeholders and the connection of related events through the relevant entities involved. Therefore, we present SUnSET: Synergistic Understanding of Stakeholder, Events and Time for the task of Timeline Summarization (TLS). We leverage powerful Large Language Models (LLMs) to build SET triplets and introduced the use of stakeholder-based ranking to construct a $Relevancy$ metric, which can be extended into general situations. Our experimental results outperform all prior baselines and emerged as the new State-of-the-Art, highlighting the impact of stakeholder information within news article.

由于新闻报道日益全球化和分散,追踪多个来源的相关事件带来了重大挑战。现有新闻汇总方法通常使用大语言模型和基于文章的摘要图形方法。然而,这没有效果,因为仅考虑类似日期文章的文字内容来理解事件要点。为了应对有关各方缺乏分析的问题,必须制定一个新框架,通过相关实体衡量利益攸关方的重要性和相关事件的联系。因此,我们介绍了SUNSET:对利益攸关方、事件和时间的协同理解,以完成时间线的总结(TLS)任务。我们利用强大的大语言模型(LLLMs)来建立SET三重数据,并引入基于利益攸关方的排名,以构建一个能推广到一般情况的$levelance$衡量标准。我们的实验结果超越了以往的所有基线,并成为新的国家艺术,突出了利益攸关方信息在新闻文章中的影响。

Article 14

Title@2025-07-31 (4): How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Title: How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Wie kann ich meinen LLM-Benchmark veröffentlichen, ohne die wahren Antworten wegzugeben?

我怎样才能公布我的LLM基准而不给出正确的答案? 2505.18102v2

Authors (3): Takashi Ishida, Thanawat Lodkaew, Ikko Yamane

Publishing a large language model (LLM) benchmark on the Internet risks contaminating future LLMs: the benchmark may be unintentionally (or intentionally) used to train or select a model. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers. However, this strategy will require trust in a single organization and still permits test-set overfitting through repeated queries. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. Our main idea is to inject randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. This reduces the best possible accuracy, i.e., Bayes accuracy, of the benchmark. Not only is this helpful to keep us from disclosing the ground truth, but this approach also offers a test for detecting data contamination. In principle, even fully capable models should not surpass the Bayes accuracy. If a model surpasses this ceiling despite this expectation, this is a strong signal of data contamination. We present experimental evidence that our method can detect data contamination accurately on a wide range of benchmarks, models, and training methodologies.

在互联网上公布一个大型语言模型(LLM)基准可能会污染未来的LLM:基准可能是无意的(或有意的)用于培训或选择一个模型。一个共同的缓解措施是保持基准的隐私,让参与者向组织者提交模型或预测。然而,这一战略需要信任一个组织,并仍然允许通过反复询问过度测试。为了克服这一问题,我们建议一种公布基准的方法,而不完全披露对问题的地面真相答案,同时保持公开评估LLMS的能力。我们的主要想法是通过编制几个逻辑正确的答案来给答案注入随机性,并且只将其中之一作为基准的解决方案。这降低了基准的最佳准确性,即Bayes准确性。这不仅有助于我们不披露地面真相,而且这一方法也为探测数据污染提供了一个测试。原则上,即使完全有能力的模型也不应该超过Bayes的准确性。尽管有这一预期,但模型超过这一上限,这是数据污染的强烈信号。我们提出实验性证据,说明我们的方法能够准确检测数据污染的范围很广的基准、培训模型和方法。

Article 15

Title@2025-07-31 (4): Splits! A Flexible Dataset and Evaluation Framework for Sociocultural Linguistic Investigation

Title: Splits! A Flexible Dataset and Evaluation Framework for Sociocultural Linguistic Investigation

Splits! Ein flexibler Datensatz und Evaluationsrahmen für die soziokulturelle Linguistische Untersuchung

社会文化语言调查灵活数据集和评价框架 2504.04640v2

Authors (3): Eylon Caplan, Tania Chakraborty, Dan Goldwasser

Variation in language use, shaped by speakers’ sociocultural background and specific context of use, offers a rich lens into cultural perspectives, values, and opinions. However, the computational study of these Sociocultural Linguistic Phenomena (SLP) has often been limited to bespoke analyses of specific groups or topics, hindering the pace of scientific discovery. To address this, we introduce Splits!, a 9.7 million-post dataset from Reddit designed for systematic and flexible research. The dataset contains posts from over 53,000 users across 6 demographic groups, organized into 89 discussion topics to enable comparative analysis. We validate Splits! via self-identification and by successfully replicating several known SLPs from existing literature. We complement this dataset with a framework that leverages efficient retrieval methods to rapidly validate potential SLPs (PSLPs) by automatically evaluating whether a given hypothesis is supported by our data. Crucially, to distinguish between novel and obvious insights, the framework incorporates a human-validated measure of a hypothesis’s ``unexpectedness.’’ We demonstrate that the two-stage process reduces the number of statistically significant findings requiring manual inspection by a factor of 1.5-1.8x, streamlining the discovery of promising phenomena for further investigation.

语言使用的变化,由发言者的社会文化背景和使用的具体背景所决定,形成了语言使用的变化,为文化观点、价值观和观点提供了丰富的视角;然而,对这些社会文化语言特征(SLP)的计算研究往往仅限于对特定群体或专题进行简单分析,从而阻碍科学发现的速度。为此,我们引入了Slips!,这是Reddit为系统灵活研究设计的970万个数据集。数据集包含来自6个人口群体中53 000多个用户的职位,分为89个讨论专题,以便能够进行比较分析。我们通过自我识别和成功复制现有文献中若干已知的 SLPs,验证Slips!我们用一个框架来补充这一数据集,利用高效的检索方法快速验证潜在的SLPs(PLPs),方法是通过自动评估我们的数据是否支持一个给定的假设!关键是,为了区分新的和明显的洞察力,这个框架包含了一个人类有价值的假设“意外”的尺度。我们证明,两阶段进程减少了需要通过1.5.8个因素对具有前瞻性的统计意义的调查结果进行进一步精简。

Article 16

Title@2025-07-31 (4): ILID: Native Script Language Identification for Indian Languages

Title: ILID: Native Script Language Identification for Indian Languages

ILID: Native Script Language Identification für indische Sprachen

ILID:印第安人语言的土著脚本语言识别 2507.11832v2

Authors (2): Yash Ingle, Pruthwik Mishra

The language identification task is a crucial fundamental step in NLP. Often it serves as a pre-processing step for widely used NLP applications such as multilingual machine translation, information retrieval, question and answering, and text summarization. The core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments. This becomes even harder in case of diverse Indian languages that exhibit lexical and phonetic similarities, but have distinct differences. Many Indian languages share the same script, making the task even more challenging. Taking all these challenges into account, we develop and release a dataset of 250K sentences consisting of 23 languages including English and all 22 official Indian languages labeled with their language identifiers, where data in most languages are newly created. We also develop and release baseline models using state-of-the-art approaches in machine learning and fine-tuning pre-trained transformer models. Our models outperforms the state-of-the-art pre-trained transformer models for the language identification task. The dataset and the codes are available at https://yashingle-ai.github.io/ILID/ and in Huggingface open source libraries.

语言识别任务是国家语言平台中至关重要的基本步骤。语言识别任务通常是广泛使用的国家语言平台应用程序的预处理步骤,如多语种机器翻译、信息检索、问答和文本总和。语言识别的核心挑战在于杂音、短语和代码混合环境中的区分语言。如果印度多种语言表现出词汇和语音相似,但有不同之处,这更加困难。许多印度语言有着相同的脚本,使得任务更具挑战性。考虑到所有这些挑战,我们编制并发布一套由23种语言组成的250K句数据集,其中包括英文和所有22种印度官方语言及其语言标识符号,其中大多数语言的数据都是新创建的。我们还在机器学习和微调前训练变异模型中,使用最先进的方法开发和发布基线模型。我们的模型超越了用于语言识别任务的、最先进的预培训变异模型。数据集和代码可在https://yashingle-ai.github.io/LIID/和Hugging 公开源库中查阅。

Article 17

Title@2025-07-31 (4): Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Estimates

Title: Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Estimates

Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Assessments

具有不确定性估计值的临床试验的深入学习预测 2507.23607v1

Authors (4): Tien Huu Do, Antoine Masquelier, Nae Eoun Lee, Jonathan Crowther

Clinical trials are a systematic endeavor to assess the safety and efficacy of new drugs or treatments. Conducting such trials typically demands significant financial investment and meticulous planning, highlighting the need for accurate predictions of trial outcomes. Accurately predicting patient enrollment, a key factor in trial success, is one of the primary challenges during the planning phase. In this work, we propose a novel deep learning-based method to address this critical challenge. Our method, implemented as a neural network model, leverages pre-trained language models (PLMs) to capture the complexities and nuances of clinical documents, transforming them into expressive representations. These representations are then combined with encoded tabular features via an attention mechanism. To account for uncertainties in enrollment prediction, we enhance the model with a probabilistic layer based on the Gamma distribution, which enables range estimation. We apply the proposed model to predict clinical trial duration, assuming site-level enrollment follows a Poisson-Gamma process. We carry out extensive experiments on real-world clinical trial data, and show that the proposed method can effectively predict the number of patients enrolled at a number of sites for a given clinical trial, outperforming established baseline models.

临床试验是评估新药物或新疗法的安全和效能的系统性努力。进行这种试验通常需要大量的资金投资和仔细规划,强调准确预测试验结果的必要性。准确预测病人入学是试验成功的一个关键因素,这是规划阶段的主要挑战之一。在这项工作中,我们提出一种新的深层次的基于学习的方法来应对这一重大挑战。我们作为一种神经网络模型采用的方法,利用预先训练的语言模型(PLM)来捕捉临床文件的复杂性和细微差别,将其转化为直观的表述。然后,这些演示与通过关注机制编码的表格特征相结合。为了说明入学预测的不确定性,我们根据伽马分布,用概率层加强模型,从而能够进行范围估计。我们采用拟议的模型来预测临床试验期限,假设现场一级的招生遵循Poisson-Gamma进程。我们对现实世界临床试验数据进行了广泛的实验,并表明拟议的方法可以有效地预测在一定的临床试验地点注册的病人人数,超过既定基线模型。

Article 18

Title@2025-07-31 (4): Inside-Out: Hidden Factual Knowledge in LLMs

Title: Inside-Out: Hidden Factual Knowledge in LLMs

Inside-Out: Verstecktes Sachwissen in LLMs

内外:LLM中隐藏的事实知识 2503.15299v3

Authors (8): Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, Roi Reichart

This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. While a few studies hint at this possibility, none has clearly defined or demonstrated this phenomenon. We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher. This gives rise to external and internal knowledge, depending on the information used to score individual answer candidates: either the model’s observable token-level probabilities or its intermediate computations. Hidden knowledge arises when internal knowledge exceeds external knowledge. We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup. Our results indicate that: (1) LLMs consistently encode more factual knowledge internally than what they express externally, with an average relative gap of 40%. (2) Surprisingly, some knowledge is so deeply hidden that a model can internally know an answer perfectly, yet fail to generate it even once, despite large-scale repeated sampling of 1,000 answers. This reveals fundamental limitations in the generation capabilities of LLMs, which (3) put a practical constraint on scaling test-time compute via repeated answer sampling in closed-book QA: significant performance improvements remain inaccessible because some answers are practically never sampled, yet if they were, we would be guaranteed to rank them first.

这项工作提供了一个框架,用于评估大型语言模型(LLMS)是否在其参数中比其产出所表述的内容更加真实的知识。虽然有几项研究暗示了这种可能性,但没有一项研究明确定义或展示这种现象。我们首先提出知识的正式定义,将它量化给一个特定问题,即正确答案对对答的分数,正确答案的分数较高。这产生了外部和内部知识,这取决于对个别应答人进行评分所使用的信息:要么是模型的可观测象征性概率,要么是其中间计算。当内部知识超过外部知识时,就会出现隐藏知识。然后我们提出案例研究,在封闭式QA设置中将这一框架应用于三种受欢迎的开放重量LMS。我们的结果显示:(1) LMS在内部一贯地将事实知识纳入比外表表示的更多内容,平均相对差距为40%。(2) 令人惊讶的是,一些知识的深度隐藏到模型内部可以完全了解答案,但即使大规模地重复抽样抽样,也未能产生这种知识。这揭示了三个广受欢迎的开放性 LLMSMS的深度测试中的基本限制,因为我们不断的深度的抽样测试是难解的常规测试,因此仍然有相当的深度的精确的试测测。

Article 19

Title@2025-07-31 (4): DiffLoRA: Differential Low-Rank Adapters for Large Language Models

Title: DiffLoRA: Differential Low-Rank Adapters for Large Language Models

DiffLoRA: Differential-Low-Rank-Adapter für große Sprachmodelle

DiffLORA:用于大语言模型的差别型低兰克适应器 2507.23588v1

Authors (4): Alexandre Misrahi, Nadezhda Chirkova, Maxime Louis, Vassilina Nikoulina

Differential Transformer has recently been proposed to improve performance in Transformer models by canceling out noise through a denoiser attention mechanism. In this work, we introduce DiffLoRA, a parameter-efficient adaptation of the differential attention mechanism, with low-rank adapters on both positive and negative attention terms. This approach retains the efficiency of LoRA while aiming to benefit from the performance gains of differential attention. We evaluate DiffLoRA across a broad range of NLP tasks, including general benchmarks, many-shot in-context learning, RAG, and long-context tests. We observe that, although DiffLoRA falls short of other parameter-efficient fine-tuning methods in most evaluation tasks, it shows interesting results in certain domains (+11 pts on LoRA for HumanEval). We analyze the attention patterns post-finetuning to identify the reasons for this behavior.

最近有人提议采用差异变换器,通过取消音效的注意机制取消噪音,以提高变异器模型的性能。在这项工作中,我们引入了DiffLora,这是对差异注意机制的参数效率的调整,低级适应器具有正负两方面的注意。这种方法保持了Lora的效率,同时着眼于从不同注意的性能收益中获益。我们评估了DiffLora,执行了一系列广泛的国家劳工政策任务,包括一般基准、多发式的文字学习、RAG和长文测试。我们注意到,虽然DiffLora在大多数评价任务中都未达到其他参数效率的微调方法,但在某些领域(HenEval的LORA的+11 pts)取得了令人感兴趣的结果。我们分析了关注后调整模式,以确定这种行为的原因。

Article 20

Title@2025-07-31 (4): T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text

Title: T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text

T-Detect: Tail-Aware Statistische Normalisierung zur robusten Erkennung von maschinengeneriertem Text

T-检测:用于对反转机制文本进行强力探测的尾件软件统计标准化 2507.23577v1

Authors (6): Alva West, Luodan Zhang, Liuliu Zhang, Minjun Zhu, Yixuan Weng, Yue Zhang

The proliferation of sophisticated text generation models necessitates the development of robust detection methods capable of identifying machine-generated content, particularly text designed to evade detection through adversarial perturbations. Existing zero-shot detectors often rely on statistical measures that implicitly assume Gaussian distributions, a premise that falters when confronted with the heavy-tailed statistical artifacts characteristic of adversarial or non-native English texts. This paper introduces T-Detect, a novel detection method that fundamentally redesigns the statistical core of curvature-based detectors. Our primary innovation is the replacement of standard Gaussian normalization with a heavy-tailed discrepancy score derived from the Student’s t-distribution. This approach is theoretically grounded in the empirical observation that adversarial texts exhibit significant leptokurtosis, rendering traditional statistical assumptions inadequate. T-Detect computes a detection score by normalizing the log-likelihood of a passage against the expected moments of a t-distribution, providing superior resilience to statistical outliers. We validate our approach on the challenging RAID benchmark for adversarial text and the comprehensive HART dataset. Experiments show that T-Detect provides a consistent performance uplift over strong baselines, improving AUROC by up to 3.9\% in targeted domains. When integrated into a two-dimensional detection framework (CT), our method achieves state-of-the-art performance, with an AUROC of 0.926 on the Books domain of RAID. Our contributions are a new, theoretically-justified statistical foundation for text detection, an ablation-validated method that demonstrates superior robustness, and a comprehensive analysis of its performance under adversarial conditions. Ours code are released at https://github.com/ResearAI/t-detect.

复杂的文本生成模型的扩散要求开发能够识别机器生成内容的可靠检测方法,特别是旨在逃避通过对抗性直角扰动探测的文本。现有的零发检测器往往依赖隐含假设高斯分布的统计措施,而这一假设在面对激烈的统计手工艺特征时会动摇。本文介绍T-检测,这是一种创新的检测方法,从根本上重新设计了以曲线为基础的探测器的统计核心。我们的主要创新是取代标准高斯的正常化,代之以从学生的 T 分布中得出的严重快速差异分数。从理论上讲,这一方法所依据的是实证观察,即对抗性文本表现出显著的利普托科松散,使传统的统计假设变得不够充分。T-检测通过使一段通道与预期的高度分布时间的逻辑相似性平准,为统计外端探测器提供较强的弹性。我们关于对抗性能的RAID基准和全面HATCT数据设置的精确度分数。实验显示,T-Serverialal-alalal-lax a lagal lavel lax a stal deal descristrational develyal laft as the laft laft as the laview stal-deal-laview laviewal-deal-deal-deal-deal-deal-deal-labal-ladal-deal-lax slations ladal ladals ladal laislation ladal ladal ladal ladaldaldaldal ladaldaldaldaldal ladal lads ladal ladaldaldal ladal ladaldaldaldaldaldaldal ladaldal ladal ladaldal ladaldal ladaldaldaldaldaldal ladaldal ladaldaldaldaldaldal ladal ladal ladaldaldaldaldaldaldaldaldaldaldals ladals las ladal

Article 21

Title@2025-07-31 (4): Neutral Residues: Revisiting Adapters for Model Extension

Title: Neutral Residues: Revisiting Adapters for Model Extension

Neutrale Rückstände: Adapter zur Modellerweiterung

中立残留物:重新审视适应器,用于示范推广 2410.02744v3

Authors (3): Franck Signe Talla, Edouard Grave, Hervé Jégou

We address the problem of extending a pretrained large language model to a new domain that was not seen during training. Standard techniques, such as finetuning or low-rank adaptation (LoRA) are successful at domain adaptation, but do not formally add capacity to the model. This often leads to a trade-off, between performing well on the new domain vs. degrading performance on the original domain. Here, we revisit and improve adapters to extend LLMs from three angles: data, architecture and training procedure, which are advantageously considered jointly. The resulting method, called neutral residues, modifies adapters in a way that leads each new residual block to output near-zeros on the original domain. This solution leads to strong results when adapting a state-of-the-art model originally trained on English to a new language. Neutral residues significantly outperform competing approaches such as finetuning, LoRA or vanilla adapters in terms of the trade-off between learning the new language and not forgetting English.

我们处理将预先培训的大型语言模式扩大到培训期间没有看到的新领域的问题。标准技术,如微调或低级别适应(LORA)在领域适应方面是成功的,但并不正式增加模型能力。这往往导致在新领域表现良好与原始领域表现有辱人格之间取舍。在这里,我们重新审视和改进适应器,将LLMS从三个角度扩大:数据、结构和培训程序,这三者是共同考虑的优势。由此产生的方法,称为中性残留物,将适应器改变为导致每个新的剩余块到原始领域接近零的输出。当将最初接受英语培训的先进模型改造成新语言时,这一解决方案将带来强有力的结果。中性残留物在学习新语言和不忘英语之间的交易中,大大超越了微调、LORA或香草适应器等相互竞争的方法。

Article 22

Title@2025-07-31 (4): Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation

Title: Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation

Kann LLMs mit Ambiguity helfen? Eine quantitative Bewertung verschiedener großer Sprachmodelle auf Word Sense Disambiguation

LLMs能否协助其模糊性? 量化评估关于 “ Word Sense Disanderation “ 的各种大语言模型。 2411.18337v4

Authors (3): T. G. D. K. Sumanathilaka, Nicholas Micallef, Julian Hough

Ambiguous words are often found in modern digital communications. Lexical ambiguity challenges traditional Word Sense Disambiguation (WSD) methods, due to limited data. Consequently, the efficiency of translation, information retrieval, and question-answering systems is hindered by these limitations. This study investigates the use of Large Language Models (LLMs) to improve WSD using a novel approach combining a systematic prompt augmentation mechanism with a knowledge base (KB) consisting of different sense interpretations. The proposed method incorporates a human-in-loop approach for prompt augmentation where prompt is supported by Part-of-Speech (POS) tagging, synonyms of ambiguous words, aspect-based sense filtering and few-shot prompting to guide the LLM. By utilizing a few-shot Chain of Thought (COT) prompting-based approach, this work demonstrates a substantial improvement in performance. The evaluation was conducted using FEWS test data and sense tags. This research advances accurate word interpretation in social media and digital communication.

现代数字通信中经常出现含糊不清的词句。由于数据有限,传统的Word Sense Dismendation(WSD)方法存在明显的模糊性。因此,翻译、信息检索和问答系统的效率受到这些限制的阻碍。本研究报告调查了使用大语言模型(LLMs)改进WSD的新办法,将系统的迅速增强机制与知识库(KB)相结合,由不同感知解释组成。拟议方法包括了快速增强的“人与人”方法,这种方法得到“语言部分”标记、模糊词的同义词、基于侧面感的过滤以及指导LLM的几发提示。通过采用“几发式思维链(COT)”的提示性方法,这项工作表明业绩有了很大的改进。评价是利用FEWS测试数据和感官标记进行的。这项研究在社会媒体和数字通信中促进了准确的字解释。

Article 23

Title@2025-07-31 (4): Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning

Title: Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning

Med-R$^3$: Verbesserung der medizinischen Retrieval-Augmented Reasoning von LLMs durch Progressive Verstärkung Lernen

3美元Med-R$3美元:通过逐步加强学习加强医疗取回-增加LLMs的理据 2507.23541v1

Authors (8): Keer Lu, Zheng Liang, Youquan Li, Jiejun Tan, Da Pan, Shusen Zhang, Guosheng Dong, Huang Leng

In medical scenarios, effectively retrieving external knowledge and leveraging it for rigorous logical reasoning is of significant importance. Despite their potential, existing work has predominantly focused on enhancing either retrieval or reasoning capabilities of the models in isolation, with little attention given to their joint optimization, which leads to limited coordination between the two processes. Additionally, current methods rely heavily on supervised fine-tuning (SFT), which can cause models to memorize existing problem-solving pathways, thereby restricting their generalization ability when confronted with novel problem contexts. Furthermore, while some studies have explored to improve retrieval-augmented reasoning in general domains via reinforcement learning, their reward function designs do not adequately capture the specific demands of the medical domain. To address these challenges, we introduce Med-R$^3$, a Medical Retrieval-augmented Reasoning framework driven by progressive Reinforcement learning. In this framework, we first develop the model’s ability to perform logical reasoning over medical problems. Subsequently, on the basis of this foundation, we adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization throughout the reasoning process. Finally, we conduct joint optimization of the model’s retrieval and reasoning coordination. Extensive experiments indicate that Med-R$^3$ could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + Med-R$^3$ surpassing closed-sourced GPT-4o-mini by 3.93\% at a comparable parameter scale, while Qwen2.5-14B augmented with Med-R$^3$ shows a more substantial gain of 13.53\%.

在医学假设中,有效检索外部知识并利用外部知识进行严格的逻辑推理非常重要,尽管现有工作具有潜力,但主要侧重于加强模型孤立的检索或推理能力,很少注意联合优化,导致两个进程之间的协调有限;此外,目前的方法严重依赖监管的微调(SFT),这可以使模型记住现有的解决问题路径,从而在面临新问题的情况下限制其概括化能力;此外,虽然一些研究探索了通过强化学习改进一般领域的回收-强化推理能力,但其奖赏功能设计没有充分反映医疗领域的具体需求;为应对这些挑战,我们引入了Med-R3美元3,一个Med 质评** 实质性微调(SFT),这可以促使模型在面临新问题的情况下限制其普及能力;此外,尽管一些研究探索了通过强化学习,在一般领域改进回收-PT3的推理学,但它们的奖赏参数不能充分捕捉到医疗领域的具体需求;为了应对这些挑战,我们根据这个基础,我们调整了回收能力,R3 R3 降-R3 3 比例,我们引入了成本和MMMM ,最终将业绩的升级能力,同时我们通过外部信息的特性进行更精确的推理化。

Article 24

Title@2025-07-31 (4): PurpCode: Reasoning for Safer Code Generation

Title: PurpCode: Reasoning for Safer Code Generation

PurpCode: Begründung für eine sicherere Code-Generierung

PurpCode:更安全代码生成的理由 2507.19060v2

Authors (14): Jiawei Liu, Nirav Diwan, Zhe Wang, Haoyu Zhai, Xiaona Zhou, Kiet A. Nguyen, Tianjiao Yu, Muntasir Wahed, Yinlin Deng, Hadjer Benkraouda, Yuxiang Wei, Lingming Zhang, Ismini Lourentzou, Gang Wang

We introduce PurpCode, the first post-training recipe for training safe code reasoning models towards generating secure code and defending against malicious cyberactivities. PurpCode trains a reasoning model in two stages: (i) Rule Learning, which explicitly teaches the model to reference cybersafety rules to generate vulnerability-free code and to avoid facilitating malicious cyberactivities; and (ii) Reinforcement Learning, which optimizes model safety and preserves model utility through diverse, multi-objective reward mechanisms. To empower the training pipelines with comprehensive cybersafety data, we conduct internal red-teaming to synthesize comprehensive and high-coverage prompts based on real-world tasks for inducing unsafe cyberactivities in the model. Based on PurpCode, we develop a reasoning-based coding model, namely PurpCode-32B, which demonstrates state-of-the-art cybersafety, outperforming various frontier models. Meanwhile, our alignment method decreases the model overrefusal rates in both general and cybersafety-specific scenarios, while preserving model utility in both code generation and common security knowledge.

我们引入了PurpCode(PurpCode)(PurpCode)(PurpCode)(PurpCode)(PurpCode)(Purcledge Learning)(这是培训安全代码推理模型的第一个培训后指南)(PurpCode)(这是培训安全代码推理模型的第一批培训后配方),旨在生成安全代码和防范恶意网络活动。PurpCode(Purp Learning)将一个推理模型分为两个阶段:(一) 规则学习,明确教授参考网络安全规则模式,以生成无脆弱性代码,避免为恶意网络活动提供便利;(二) 强化学习(Sergment Learning)(Sergment)(通过多种多目标奖励机制优化模式安全模式,维护模型的实用性,通过综合网络安全数据使培训管道具备能力,我们内部红队(refusal)将基于现实世界任务的全面和高覆盖性提示器,同时维护代码生成和共同安全知识的模型实用性。

Article 25

Title@2025-07-31 (4): MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

Title: MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

MECAT: Ein Multi-Experten-Benchmark für feinkörnige Audio-Verstandsaufgaben

MECAT: 完善的音频理解任务多专家基准 2507.23511v1

Authors (10): Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Junbo Zhang, Jian Luan

While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code are available at https://github.com/xiaomi-research/mecat

虽然大型的音频模型提高了开放的音频理解程度,但它们仍然没有达到细微的人类理解水平,这一差距依然存在,主要是因为目前的基准受到数据说明和评价指标的限制,无法可靠地区分通用和高度详细的模型产出。为此,这项工作引入了MECAT, 即精细读音频理解任务多专家构建基准。通过将专业专家模型的分析与深层次语言链大模型推理相结合的管道生成的,MECAT提供了多视角、精细的字幕和开放式的问答配对。该基准得到了新的指标的补充:DATE(差异性-强化音频文本评价)。该指标将单模类语义相似性和交叉分布相容性结合起来,对通用术语和详细描述进行处罚。还介绍了对最新音频模型的全面评价,提供了对其当前能力和限制的新认识。数据和代码可在https://github.com/sioomim-commexemexearch/cat。

Article 26

Title@2025-07-31 (4): LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Title: LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

LLaVA-MORE: Eine vergleichende Studie von LLMs und visuellen Backbones für verbesserte visuelle Instruktions-Tuning

LLAVA-MORE:用于强化视觉教学的LLM和视觉背骨比较研究 2503.15621v2

Authors (7): Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it difficult to derive optimal design choices. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures. Our analysis systematically explores both small- and medium-scale LLMs – including Phi-4, LLaMA-3.1, and Gemma-2 – to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact on final results, we conduct a comprehensive study of various visual encoders, ranging from CLIP-based architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional experiments investigate the effects of increased image resolution and variations in pre-training datasets. Overall, our results provide insights into the design of more effective MLLMs, offering a reproducible evaluation framework that facilitates direct comparisons and can guide future model development. Our source code and trained models are publicly available at: https://github.com/aimagelab/LLaVA-MORE.

在多式大型语言模型(MLLM)中,最近的进展突出了视觉骨干和基本语言模型的关键作用。虽然先前的工作主要侧重于将这些组成部分扩大到数十亿个参数,但模型规模、结构和性能之间的权衡取舍仍未得到充分探讨。此外,培训数据和评价协议的不一致妨碍了直接比较,使得很难得出最佳设计选择。在本文件中,我们引入了LalVA-MORE,这是MLLMM的新组合,将最新语言模型与不同的视觉骨干结合起来。为了确保公平比较,我们采用了在所有建筑中一致应用的统一培训协议。我们的分析系统地探索中小型LLMS – – 包括Phi-4、Lama-3.1和Gemma-2 – – 来评估模型规模、结构和性能之间的权衡,同时审查模型规模与性能之间的关系。除了评估LLLMMMMMM(LIP)对最终结果的影响外,我们还对各种视觉模型进行了全面研究,从基于CLIP的架构到DINO2、SigLIP和SigLIP-2等替代方法。我们的分析系统地探索中中,进一步研究了中我们经过培训的图像分析的模型的模型的模型分析结果,为我们的图像分析框架的模型的改进提供了更多的分析结果的改进。

Article 27

Title@2025-07-31 (4): A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Title: A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Ein neuartiger Bewertungs-Benchmark für medizinische LLMs: Beleuchtende Sicherheit und Wirksamkeit in klinischen Bereichen

医疗LLMs新颖的评价基准:临床域的引明安全和有效性 2507.23486v1

Authors (38): Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, Yinan Jiang, Zhicheng Huang, Lingyun Ma, Wenjie Shen, Yajie Ji, Yunhui Tan, Chunbo Wang, Yunlu Gao, Qianling Ye, Rui Lin, Mingyu Chen, Lijuan Niu, Zhihao Wang, Peng Yu, Mengran Lang, Yue Liu, Huimin Zhang, Haitao Shen, Long Chen, Qiguang Zhao, Si-Xuan Liu, Lina Zhou, Hua Gao, Dongqiang Ye, Lingmin Meng, Youtao Yu, Naixin Liang, Jianxiong Wu

Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.

大型语言模型(LLMS)在临床决策支持方面很有希望,但在安全评估和有效性验证方面面临重大挑战。我们制定了临床安全有效双轨基准(CSEDB),这是一个基于临床专家共识的多层面框架,包括30项标准,涵盖关键疾病识别、遵守准则、药品安全等关键领域,并附有加权后果措施。32名专家医生制定并审查了符合这些标准的2 069个开放的A项目,涵盖26个临床部门,以模拟现实世界情景。对6个LMS的基准测试显示,总体绩效中等(平均共57.2%、安全54.7%、有效性62.3%),高风险情景中绩效显著下降13.3%(p < 0.0001 ),具体针对特定领域的医疗LMS显示,业绩优于通用模式,安全性最高分数(0.912)和有效性(0.861)相对较高。该研究的结果不仅为评价医疗LMS临床应用提供了标准化衡量标准,便利比较分析、风险暴露识别和改善不同情景方向,而且还有可能促进在医疗保健环境中更安全和更有效地部署大型语言模型。

Article 28

Title@2025-07-31 (4): Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

Title: Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

Role-Aware Sprachmodelle für sichere und kontextualisierte Zugriffskontrolle in Organisationen

各组织内安全和环境化出入控制使用控制实用语言模式 2507.23465v1

Authors (7): Saeed Almheiri, Yerulan Kongrat, Adrian Santosh, Ruslan Tasmukhanov, Josemaria Vera, Muhammad Dehan Al Kautsar, Fajri Koto

As large language models (LLMs) are increasingly deployed in enterprise settings, controlling model behavior based on user roles becomes an essential requirement. Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints. In this work, we investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles. We explore three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation. To evaluate these approaches, we construct two complementary datasets. The first is adapted from existing instruction-tuning corpora through clustering and role labeling, while the second is synthetically generated to reflect realistic, role-sensitive enterprise scenarios. We assess model performance across varying organizational structures and analyze robustness to prompt injection, role mismatch, and jailbreak attempts.

随着大型语言模型(LLMs)越来越多地被应用于企业环境,控制基于用户作用的模型行为成为一项基本要求。现有安全方法通常采取统一的准入方式,侧重于预防有害或有毒产出,而不解决特定角色准入限制。在这项工作中,我们调查LMs是否可以进行微调,以产生反映与不同组织作用相关的准入特权的应对措施。我们探讨三种示范战略:基于BERT的分类器、基于LLM的分类器和基于角色的生成。为了评估这些方法,我们建立了两个互补数据集。第一个数据集是通过集群和角色标签从现有的指令调整公司团体中改编的,第二个是合成生成的,以反映现实的、对角色敏感的企业情景。我们评估不同组织结构的示范业绩,分析快速注入、角色错配和破例尝试的稳健性。

Article 29

Title: Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

Counterfactual Evaluation für Blindangriffserkennung in LLM-basierten Evaluationssystemen

以LLM为基础的评价系统中盲人攻击探测的反事实评价 2507.23453v1

Authors (7): Lijia Liu, Takumi Kondo, Kyohei Atarashi, Koh Takeuchi, Jiyi Li, Shigeru Saito, Hisashi Kashima

This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.

本文调查基于LLM的评估系统防范迅速注射的防御性。我们正式确定了一类威胁,称为盲目袭击,候选人的回答是独立于欺骗评估者的真实答案之外的。为了对付这些袭击,我们提出了一个框架,用反事实评估来补充标准评估,根据蓄意的虚假地面真相回答来重新评估提交材料。如果系统根据标准和反事实条件验证一个答案,就会发现攻击。实验显示,虽然标准评估非常脆弱,但我们的SE+CFE框架通过以最低性能权衡来提升攻击检测,大大改善了安全。

Article 30

Title@2025-07-31 (4): EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework

Title: EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework

BildungQ: Bewertung der Lehrfähigkeiten von LLMs durch Multi-Agent Dialograhmen

教育Q:通过多机构对话框架评价LLMS的教学能力 2504.14928v3

Authors (3): Yao Shi, Rongkeng Liang, Yong Xu

Large language models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging due to the resource-intensive, context-dependent, and methodologically complex nature of teacher-student interactions. We introduce EducationQ, a multi-agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios, featuring specialized agents for teaching, learning, and evaluation. Testing 14 LLMs across major AI Organizations (OpenAI, Meta, Google, Anthropic, and others) on 1,498 questions spanning 13 disciplines and 10 difficulty levels reveals that teaching effectiveness does not correlate linearly with model scale or general reasoning capabilities - with some smaller open-source models outperforming larger commercial counterparts in teaching contexts. This finding highlights a critical gap in current evaluations that prioritize knowledge recall over interactive pedagogy. Our mixed-methods evaluation, combining quantitative metrics with qualitative analysis and expert case studies, identifies distinct pedagogical strengths employed by top-performing models (e.g., sophisticated questioning strategies, adaptive feedback mechanisms). Human expert evaluations show 78% agreement with our automated qualitative analysis of effective teaching behaviors, validating our methodology. EducationQ demonstrates that LLMs-as-teachers require specialized optimization beyond simple scaling, suggesting next-generation educational AI prioritize targeted enhancement of specific pedagogical effectiveness.

大型语言模式(LLMS)日益成为教育工具,然而,由于教师与学生之间互动的资源密集、环境依赖、方法复杂,评估其教学能力仍具有挑战性。我们引入了教育Q,这是一个多媒介对话框架,通过模拟动态教育情景,有效评估教学能力,由教学、学习和评价的专门代理人组成。测试主要的独立组织(OpenAI、Meta、Google、Anthrotic等)的14个LLMS,涉及13个学科和10个难度层次的1 498个问题,显示教学效力与模型规模或一般推理能力没有线性关系。一些较小的开放源模式在教学环境中比较大的商业对应方表现要好。这一发现凸显了当前评价中的一个关键差距,这种评价将知识的回顾放在互动教学、学习和评价的优先地位之上。我们混合方法的评价,将定量指标与定性分析和专家案例研究相结合,确定了最高业绩模型(例如精密的问询策略、适应性反馈机制)所使用的不同教学优势。人类专家评价表明,78%的人同意我们对有效教学行为进行自动化的质量分析,验证我们下一个方法。

Article 31

Title@2025-07-31 (4): The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models

Title: The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models

Der Pragmatische Geist der Maschinen: Auf der Spur des Entstehens der Pragmatischen Kompetenz in großen Sprachmodellen

机器的实用思维:追踪大语言模式中实用能力的出现 2505.18497v2

Authors (6): Kefan Yu, Qingcheng Zeng, Weihao Xuan, Wanxin Li, Jingyi Wu, Rob Voigt

Current large language models (LLMs) have demonstrated emerging capabilities in social intelligence tasks, including implicature resolution and theory-of-mind reasoning, both of which require substantial pragmatic understanding. However, how LLMs acquire this pragmatic competence throughout the training process remains poorly understood. In this work, we introduce ALTPRAG, a dataset grounded in the pragmatic concept of alternatives, to evaluate whether LLMs at different training stages can accurately infer nuanced speaker intentions. Each instance pairs two equally plausible yet pragmatically divergent continuations and requires the model to (i) infer the speaker’s intended meaning and (ii) explain when and why a speaker would choose one utterance over its alternative, thus directly probing pragmatic competence through contrastive reasoning. We systematically evaluate 22 LLMs across 3 key training stages: after pre-training, supervised fine-tuning (SFT), and preference optimization, to examine the development of pragmatic competence. Our results show that even base models exhibit notable sensitivity to pragmatic cues, which improves consistently with increases in model and data scale. Additionally, SFT and RLHF contribute further gains, particularly in cognitive-pragmatic scenarios. These findings highlight pragmatic competence as an emergent and compositional property of LLM training and offer new insights for aligning models with human communicative norms.

目前大型语言模型(LLMS)显示了社会情报任务中新出现的能力,包括隐含的解析和思维理论推理,两者都要求实质性的务实理解;然而,LLMS在整个培训过程中如何获得这种务实的能力,对此仍不甚了解;在这项工作中,我们引入了基于实用替代概念的数据集ALTPRAG,以评价不同培训阶段的LLMS能否准确地推导出演讲者的意图。每个实例都对两种同样合理但务实的延续,要求模式:(一) 推断发言者的预期意义,以及(二) 解释发言者何时和为什么选择一种表达方式取代其替代办法,从而通过对比推理直接探索务实的能力。我们系统地评估了三个关键培训阶段的22 LLMS:在培训前、监督的微调(SFT)和偏好,以审查务实能力的发展。我们的结果表明,即使是基础模型也显示出对务实的提示的显著敏感性,这些提示随着模型和数据规模的提高而不断提高。SFT和RHFFFFF将促进进一步的成果,特别是在认知-Revalmaical imiming imingal impal impresulation impal impal impal impresulation。

Article 32

Title@2025-07-31 (4): Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration

Title: Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration

Über passives kritisches Denken hinaus: Förderung proaktiver Befragungen zur Verbesserung der Mensch-KI-Kollaboration

超越被动的批判性思考:促进积极主动的提问,以加强人类与大赦国际的协作 2507.23407v1

Authors (7): Ante Wang, Yujie Lin, Jingyao Liu, Suhang Wu, Hao Liu, Xinyan Xiao, Jinsong Su

Critical thinking is essential for building robust AI systems, preventing them from blindly accepting flawed data or biased reasoning. However, prior work has primarily focused on passive critical thinking, where models simply reject problematic queries without taking constructive steps to address user requests. In this work, we introduce proactive critical thinking, a paradigm where models actively seek missing or clarifying information from users to resolve their queries better. To evaluate this capability, we present GSM-MC and GSM-MCE, two novel benchmarks based on GSM8K for assessing mathematical reasoning under incomplete or misleading conditions. GSM-MC contains 1,368 math problems with a key variable deliberately removed, requiring models to identify and request the missing information. GSM-MCE further increases the difficulty by introducing irrelevant details to test robustness against distractions. Experiments on Qwen3 and Llama series models show that, while these models excel in traditional reasoning tasks due to extensive post-training and inference-time scaling, they struggle with proactive critical thinking, especially smaller ones. However, we demonstrate that reinforcement learning (RL) can significantly improve this ability. Using our enhanced RL algorithm, we achieve substantial gains, boosting the Qwen3-1.7B’s accuracy from 0.15% to 73.98% on GSM-MC. We hope this work advances models that collaborate more effectively with users in problem-solving through proactive critical thinking.

然而,先前的工作主要侧重于被动批判性思维,模型只是拒绝有问题的询问,而没有采取建设性步骤来应对用户的要求。在这项工作中,我们引入了积极主动的批判性思维,即模式积极寻找用户丢失的信息或澄清用户提供的信息以更好地解决其询问的范例。为了评估这一能力,我们提出了基于GSM-MC和GSM-MC-MC的基于GS8K的两个基于GSM8K的新基准,以便在不完整或误导的条件下评估数学推理。GSM-MC包含1 368个数学问题,关键变量被故意删除,需要模型识别和请求缺失的信息。GSM-MCE进一步增加了难度,引入了不相关的细节,以测试抗分心的强健性。关于Quen3和Llama系列模型的实验表明,尽管这些模型在传统的推理任务中表现优异于广泛的培训后和推导时间缩,但它们与积极主动的批判性思维,特别是较小的思维。然而,我们证明,强化学习(RL)能够大大改进这一能力。我们利用强化的RL算法,实现了实质性的成绩,我们从GSMMMM3-1的精确度思考了G-198的进度,我们从0.98到G-1.10的精确度思考了G-11的进度。

Article 33

Title@2025-07-31 (4): RAVine: Reality-Aligned Evaluation for Agentic Search

Title: RAVine: Reality-Aligned Evaluation for Agentic Search

RAVine: Realitätsorientierte Bewertung für die Agentische Suche

RAVine: 化学搜索的现实统一评价 2507.16725v2

Authors (4): Yilong Xu, Xiang Long, Zhi Zheng, Jinhua Gao

Agentic search, as a more autonomous and adaptive paradigm of retrieval augmentation, is driving the evolution of intelligent search systems. However, existing evaluation frameworks fail to align well with the goals of agentic search. First, the complex queries commonly used in current benchmarks often deviate from realistic user search scenarios. Second, prior approaches tend to introduce noise when extracting ground truth for end-to-end evaluations, leading to distorted assessments at a fine-grained level. Third, most current frameworks focus solely on the quality of final answers, neglecting the evaluation of the iterative process inherent to agentic search. To address these limitations, we propose RAVine – a Reality-Aligned eValuation framework for agentic LLMs with search. RAVine targets multi-point queries and long-form answers that better reflect user intents, and introduces an attributable ground truth construction strategy to enhance the accuracy of fine-grained evaluation. Moreover, RAVine examines model’s interaction with search tools throughout the iterative process, and accounts for factors of efficiency. We benchmark a series of models using RAVine and derive several insights, which we hope will contribute to advancing the development of agentic search systems. The code and datasets are available at https://github.com/SwordFaith/RAVine.

作为一种更自主和适应性的增强检索模式,机械搜索正在推动智能搜索系统的演进,但是,现有的评价框架未能与代理搜索的目标相一致。首先,当前基准中常用的复杂查询往往与现实用户搜索情景不同。其次,先前的方法往往在为端到端评价提取地面真相时引入噪音,导致微小评估的扭曲。第三,大多数当前框架仅侧重于最终答案的质量,忽视了对代理搜索所固有的迭接过程的评价。为克服这些限制,我们提议了RAVine – – 一个用于搜索的代理LLMS的真实性-统一电子估价框架。RAVine针对更好地反映用户意图的多点查询和长式答案,并提出了可归属的地面真相构建战略,以提高精细评估的准确性。此外,RAVine还检查了模型在整个迭接过程中与搜索工具的相互作用,忽略了对效率因素的核算。我们用RAVine为一系列模型设定基准,并提出了若干洞察力,我们希望这将推动代理搜索系统的发展。

Article 34

Title@2025-07-31 (4): Enhanced Arabic Text Retrieval with Attentive Relevance Scoring

Title: Enhanced Arabic Text Retrieval with Attentive Relevance Scoring

Verbesserte arabische Text-Retrieval mit aufmerksamer Relevanz Scoring

阿拉伯强化文本检索, 带有启动相关性显示器 2507.23404v1

Authors (5): Salah Eddine Bekhouche, Azeddine Benlamoudi, Yazid Bounab, Fadi Dornaika, Abdenour Hadid

Arabic poses a particular challenge for natural language processing (NLP) and information retrieval (IR) due to its complex morphology, optional diacritics and the coexistence of Modern Standard Arabic (MSA) and various dialects. Despite the growing global significance of Arabic, it is still underrepresented in NLP research and benchmark resources. In this paper, we present an enhanced Dense Passage Retrieval (DPR) framework developed specifically for Arabic. At the core of our approach is a novel Attentive Relevance Scoring (ARS) that replaces standard interaction mechanisms with an adaptive scoring function that more effectively models the semantic relevance between questions and passages. Our method integrates pre-trained Arabic language models and architectural refinements to improve retrieval performance and significantly increase ranking accuracy when answering Arabic questions. The code is made publicly available at \href{https://github.com/Bekhouche/APR}{GitHub}.

阿拉伯语因其复杂的形态学、选择性偏差学以及现代标准阿拉伯语和各种方言的共存,对自然语言处理和信息检索构成特别的挑战。尽管阿拉伯语在全球的重要性日益增大,但在国家标准阿拉伯语的研究和基准资源中仍然代表不足。在本文中,我们提出了一个专门为阿拉伯语开发的强化的“通过检索”框架。我们的方法的核心是一个新的“强化相关性分级”(ARS),用适应性评分功能取代标准互动机制,后者更有效地模拟问题和段落之间的语义相关性。我们的方法结合了经过预先训练的阿拉伯语模型和建筑改进,以提高检索性能,并在回答阿拉伯语问题时大大提高排序准确性。代码在以下网站公布:https://github.com/Bekhouche/APRGitHub}。

Article 35

Title@2025-07-31 (4): MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization

Title: MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization

MRGSEM-Sum: Ein unbeaufsichtigtes Multi-Dokument Zusammenfassungsrahmen basierend auf Multi-Relational Graphen und struktureller Entropie Minimierung

MRGSEM-Sum:基于多关系图和结构元件最小化的无人监督的多文件概括框架 2507.23400v1

Authors (6): Yongbing Zhang, Fang Nan, Shengxiang Gao, Yuxin Huang, Kaiwen Tan, Zhengtao Yu

The core challenge faced by multi-document summarization is the complexity of relationships among documents and the presence of information redundancy. Graph clustering is an effective paradigm for addressing this issue, as it models the complex relationships among documents using graph structures and reduces information redundancy through clustering, achieving significant research progress. However, existing methods often only consider single-relational graphs and require a predefined number of clusters, which hinders their ability to fully represent rich relational information and adaptively partition sentence groups to reduce redundancy. To overcome these limitations, we propose MRGSEM-Sum, an unsupervised multi-document summarization framework based on multi-relational graphs and structural entropy minimization. Specifically, we construct a multi-relational graph that integrates semantic and discourse relations between sentences, comprehensively modeling the intricate and dynamic connections among sentences across documents. We then apply a two-dimensional structural entropy minimization algorithm for clustering, automatically determining the optimal number of clusters and effectively organizing sentences into coherent groups. Finally, we introduce a position-aware compression mechanism to distill each cluster, generating concise and informative summaries. Extensive experiments on four benchmark datasets (Multi-News, DUC-2004, PubMed, and WikiSum) demonstrate that our approach consistently outperforms previous unsupervised methods and, in several cases, achieves performance comparable to supervised models and large language models. Human evaluation demonstrates that the summaries generated by MRGSEM-Sum exhibit high consistency and coverage, approaching human-level quality.

多文件总和面临的核心挑战在于文件之间的关系的复杂性和信息冗余的存在。图表群集是解决这一问题的有效范例,因为它模拟了使用图表结构的文件之间的复杂关系,并通过集成,取得了显著的研究进展,减少了信息冗余。然而,现有方法往往只考虑单一关系图,需要预先界定的组群数量,这妨碍了它们充分代表丰富的关联信息和适应性分解组以减少冗余的能力。为了克服这些局限性,我们提议MRGSEM-Sum是一个不受监督的多文件总和框架,以多关系图和结构质量最小化为基础,建立一个不受监督的多文件总和框架。具体地说,我们构建了一个多关系图,通过集成和讨论各句子之间的关系,全面建模;然后,我们采用二维结构最小化的最小化算法,自动确定组群集的最佳数量,并有效地将判决组织成一致的组群。最后,我们引入了一种定位压缩机制,将每个组群集进行分解,生成简明和内容精准的多质量最小化;具体地,我们构建了一个多语系的多语系关系图,在四个基准组群集中进行广泛的实验,用比标化的模型,并展示了我们以往的MILS-A-B-B-B-B-B-B-B-C-C-S-C-C-C-C-C-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-A-A-S-S-S-S-S-S-S-S-Axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Article 36

Title@2025-07-31 (4): Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators

Title: Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators

Beyond the Cloud: Bewertung der Vorteile und Nachteile lokaler LLM-Einsatzmöglichkeiten für Übersetzer

云云之外:评估为笔译员部署当地LLM的利弊 2507.23399v1

Authors (1): Peter Sandrini

The rapid proliferation of Large Language Models presents both opportunities and challenges for the translation field. While commercial, cloud-based AI chatbots have garnered significant attention in translation studies, concerns regarding data privacy, security, and equitable access necessitate exploration of alternative deployment models. This paper investigates the feasibility and performance of locally deployable, free language models as a viable alternative to proprietary, cloud-based AI solutions. This study evaluates three open-source models installed on CPU-based platforms and compared against commercially available online chat-bots. The evaluation focuses on functional performance rather than a comparative analysis of human-machine translation quality, an area already subject to extensive research. The platforms assessed were chosen for their accessibility and ease of use across various operating systems. While local deployment introduces its own challenges, the benefits of enhanced data control, improved privacy, and reduced dependency on cloud services are compelling. The findings of this study contribute to a growing body of knowledge concerning the democratization of AI technology and inform future research and development efforts aimed at making LLMs more accessible and practical for a wider range of users, specifically focusing on the needs of individual translators and small businesses.

大语言模型的迅速扩散为翻译领域带来了机遇和挑战。虽然商业的、以云为基础的人工聊天机在翻译研究中引起了极大关注,但对于数据隐私、安全和公平获取的关切要求探索替代部署模式。本文件调查了当地可部署的免费语言模型的可行性和性能,作为专有的、以云为基础的人工智能解决方案的可行替代方案。本研究报告评估了在基于CPU的平台上安装的三种开放源模式,对照商业上可提供的在线聊天机进行比较。评价的重点是功能性业绩,而不是对已经广泛研究的人力资源翻译质量进行比较分析。所评估的平台是根据其可及性和在各种操作系统中的易用性而选择的。虽然当地部署带来了自己的挑战,但加强数据控制、改进隐私和减少对云服务的依赖的好处是令人信服的。本研究报告的结论有助于增加关于AI技术民主化的知识,并为今后旨在使更多的用户更容易接触和实用的LLMS进行研究和开发努力提供信息,特别是侧重于个别笔译员和小企业的需要。

Article 37

Title@2025-07-31 (4): Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models

Title: Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models

Causal2Vec: Verbessere Dekoder-nur LLMs als vielseitige Einbettungsmodelle

Causal2Vec:改进只有解码器的LLMs作为Versatile嵌入模型 2507.23386v1

Authors (3): Ailiang Lin, Zhuoyun Li, Kotaro Funakoshi

Decoder-only large language models (LLMs) are increasingly used to build embedding models that effectively encode the semantic information of natural language texts into dense vector representations for various embedding tasks. However, many existing methods primarily focus on removing the causal attention mask in LLMs to enable bidirectional attention, potentially undermining the model’s ability to extract semantic information acquired during pretraining. Additionally, leading unidirectional approaches often rely on extra input text to overcome the inherent limitations of causal attention, inevitably increasing computational costs. In this work, we propose Causal2Vec, a general-purpose embedding model tailored to enhance the performance of decoder-only LLMs without altering their original architectures or introducing significant computational overhead. Specifically, we first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token, which is then prepended to the LLM’s input sequence, allowing each token to capture contextualized information even without attending to future tokens. Furthermore, to mitigate the recency bias introduced by last-token pooling and help LLMs better leverage the semantic information encoded in the Contextual token, we concatenate the last hidden states of Contextual and EOS tokens as the final text embedding. In practice, Causal2Vec achieves state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB) among models trained solely on publicly available retrieval datasets, while reducing the required sequence length by up to 85% and inference time by up to 82% compared to best-performing methods.

解码器的大型语言模型(LLMS)正越来越多地被用于构建嵌入模型,将自然语言文本的语义信息有效编码成密集的矢量演示,用于各种嵌入任务。然而,许多现有方法主要侧重于消除LLM中因果关注面罩,以便双向关注,有可能破坏该模型提取在预培训前获得的语义信息的能力。此外,领先的单向方法往往依赖额外的输入文本来克服因果关系的内在局限性,不可避免地增加计算成本。在这项工作中,我们提议一个通用嵌入模型Causal2Vec,这是一个通用嵌入模型,专门用来在不改变其原始结构或引入重大计算间接成本的情况下,提高只使用LMSMs的性能。具体地说,我们首先使用一个轻量的BERT型模型,将输入文本预编码成单一的背景符号,然后将它预先定位到LMSM的输入序列,允许每一种信号通过不参与未来符号来获取背景信息。此外,我们提议通过最后的粘贴式集合和LMSLMS的直观来减轻在最后时间模型中引入的偏移偏差偏差,然后帮助在S-inal-inal-li-inal-li-lical-li-li-li-li-licalcol-lical-lical-licol-licol-licol-inal-lical 上,同时将S-lical 将S-inal-licol-inal-inal-inal 将Sild 将Sildald-inal-in 将Slicold-licolvicild 将S-in 将S-in 将S-in 将S-in 将S-ial-inal-ial-li-li-li-li-li-li-li-inal-inal-inal-inal-inal-inal-inal-inal-inal-inal-inal-inal-inal-inal-inal-in-inal-inal-inal-inal-lical-inal-inal-inal-inal-inal-lical-inal-inal-

Article 38

Title@2025-07-31 (4): MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models

Title: MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models

MPCC: Ein neuartiger Benchmark für multimodale Planung mit komplexen Einschränkungen in multimodalen großen Sprachmodellen

MPCC:具有多种多语言模式复杂限制的多式联运规划新基准 2507.23382v1

Authors (6): Yiyan Ji, Haoran Chen, Qiguang Chen, Chengyue Wu, Libo Qin, Wanxiang Che

Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context, which is essential for complex reasoning and decision-making across multiple steps. However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs’ ability to handle multimodal constraints in planning. To address the first challenge, MPCC focuses on three real-world tasks: Flight Planning, Calendar Planning, and Meeting Planning. To solve the second challenge, we introduce complex constraints (e.g. budget, temporal, and spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to separate constraint complexity from search space expansion. Experiments on 13 advanced MLLMs reveal significant challenges: closed-source models achieve only 21.3% feasible plans, while open-source models average below 11%. Additionally, we observe that MLLMs are highly sensitive to constraint complexity and that traditional multimodal prompting strategies fail in multi-constraint scenarios. Our work formalizes multimodal constraints in planning, provides a rigorous evaluation framework, and highlights the need for advancements in constraint-aware reasoning for real-world MLLM applications.

为解决这些问题,我们引入了多种模式规划能力,这是系统地评估MLLMS在规划中处理多式联运制约的能力的第一个基准。为了应对第一个挑战,MCC侧重于三种现实世界任务:飞行规划、日历规划和会议规划。为了解决第二个挑战,我们在这些任务中引入了复杂的制约因素(例如预算、时间和空间),这些制约因素(例如预算、时间和空间),有等级化的难度水平(EASY、MEDIUM、HARD)将制约性复杂性与搜索空间扩展分开。对13个高级MLLMs的实验揭示了重大挑战:封闭源模型只达到21.3%的可行计划,而开放源模式平均低于11 %。此外,我们发现MLLMS对制约复杂性非常敏感,传统的多式联运催化战略在多式制约性应用中失败。我们的工作需要正式的僵化的MLLM逻辑。

Article 39

Title@2025-07-31 (4): Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models

Title: Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models

Theorem-of-Thought: Ein Multi-Agenten-Framework für abduktive, deduktive und induktive Vernunft in Sprachmodellen

所探讨的理论理论:语言模式中指导、贬低和诱导理由的多机构框架 2506.07106v2

Authors (4): Samir Abdaljalil, Hasan Kurban, Khalid Qaraqe, Erchin Serpedin

Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret. Prompting techniques like Chain-of-Thought (CoT) enhance reliability by eliciting intermediate reasoning steps or aggregating multiple outputs. However, they lack mechanisms for enforcing logical structure and assessing internal coherence. We introduce Theorem-of-Thought (ToTh), a novel framework that models reasoning as collaboration among three parallel agents, each simulating a distinct mode of inference: abductive, deductive, and inductive. Each agent produces a reasoning trace, which is structured into a formal reasoning graph. To evaluate consistency, we apply Bayesian belief propagation guided by natural language inference (NLI), assigning confidence scores to each step. The most coherent graph is selected to derive the final answer. Experiments on symbolic (WebOfLies) and numerical (MultiArith) reasoning benchmarks show that ToTh consistently outperforms CoT, Self-Consistency, and CoT-Decoding across multiple LLMs, while producing interpretable and logically grounded reasoning chains. Our findings suggest a promising direction for building more robust and cognitively inspired LLM reasoning. The implementation is available at https://github.com/KurbanIntelligenceLab/theorem-of-thought.

大型语言模型(LLMS)在自然语言推理任务中表现出很强的性能,但其推理过程仍然微弱,难以解释。催化技术(如CoT)通过引出中间推理步骤或综合多种产出来提高可靠性。但是,它们缺乏执行逻辑结构和评估内部一致性的机制。我们引入了推理理论(toTh)的新框架,该理论框架将推理作为三个平行代理人之间的协作模式,每个推理过程模拟一种独特的推理模式:绑架、推理和感化。每个代理人都产生推理跟踪,它的结构是正式推理图表。为了评价一致性,我们应用由自然语言推理(NLI)指导的贝叶信仰传播,给每个步骤分配信任分数。选择最连贯的图表来得出最后答案。我们的符号(WebOFLies)和数字(MultiArith)推理基准实验显示,TT始终超越CT、自求和CT-Decoding 跨多个LMS,同时制作可解释和逻辑推理推理的逻辑推理的推理。我们的发现/CLILGMLIG/M的发现提出了有希望的方向。在可理解性/CMLIGIGIG/CS/CS/CS/CS/CLGIGIGM的推论。

Article 40

Title@2025-07-31 (4): WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation

Title: WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation

WildSpeech-Bench: Benchmarking von Audio-LLMs im natürlichen Sprachgespräch

WirdSpeech-Bench:为自然演讲对话中的音频LMs设定基准 2506.21875v2

Authors (6): Jian Zhang, Linhao Zhang, Bokai Lei, Chuhan Wu, Wei Jia, Xiao Zhou

Recent multi-modal Large Language Models (LLMs) such as GPT-4o have demonstrated strong capabilities of direct speech interaction. However, the lack of specialized and comprehensive benchmarks for end-to-end speech LLM evaluation hinders optimizing the user experience of Audio LLMs in real-world applications. Existing evaluation methods often adapt text-based benchmarks, overlooking speech’s unique characteristics and challenges, including prosody, homophones, stuttering, and differing user expectations. Here, we present a novel approach to thoroughly evaluate LLMs in practical speech conversations. We systematically curate real-world chat data relevant to spoken scenarios, introduce diversity in speaker attributes and acoustic conditions, and augment the dataset with speech-specific phenomena. We further design a query-aware evaluation method to use customized evaluation checklists and prompts to enhance the accuracy of automatic evaluation. We conduct comprehensive testing and detailed analysis of various mainstream speech models, revealing significant differences in model performance across different speech scenarios. The use of query-aware evaluation further enables a finer-grained assessment under various speech-specific scenarios. Our benchmark can provide valuable insights for speech model development and evaluation.

GPT-4o等最新多式大语言模型(LLMs)展示了直接语音互动的强大能力,然而,对于终端至终端语音LLM评价缺乏专门和全面的基准,妨碍了在现实应用中优化音频LMLM的用户经验。现有的评价方法经常调整基于文本的基准,忽略了语言的独特性和挑战,包括流言、同声、静语和不同的用户期望。在这里,我们提出了一个新颖的方法,在实际语音对话中彻底评价LLMS。我们系统地整理与发言情景相关的真实世界聊天数据,引入语音属性和声学条件的多样性,并以特定语言现象补充数据集。我们进一步设计了一种有查询觉的评价方法,使用定制的评价清单,并迅速提高自动评价的准确性。我们全面测试和详细分析各种主流语言模型,揭示不同语音情景在示范性表现上的巨大差异。使用查询评估进一步使得在各种特定语音情景下进行精细的评估。我们的基准可以为语音模型的开发和评价提供宝贵的洞察力。

Article 41

Title@2025-07-31 (4): Holistic Evaluations of Topic Models

Title: Holistic Evaluations of Topic Models

Ganzheitliche Bewertungen von Themenmodellen

专题模式整体评价 2507.23364v1

Authors (1): Thomas Compton

Topic models are gaining increasing commercial and academic interest for their ability to summarize large volumes of unstructured text. As unsupervised machine learning methods, they enable researchers to explore data and help general users understand key themes in large text collections. However, they risk becoming a ‘black box’, where users input data and accept the output as an accurate summary without scrutiny. This article evaluates topic models from a database perspective, drawing insights from 1140 BERTopic model runs. The goal is to identify trade-offs in optimizing model parameters and to reflect on what these findings mean for the interpretation and responsible use of topic models

商业和学术对专题模型总结大量无结构化文本的能力越来越感兴趣。作为不受监督的机器学习方法,这些模型使研究人员能够探索数据,帮助一般用户理解大型文本收藏中的关键主题。然而,它们有可能成为一个“黑箱”,用户输入数据,接受输出为准确的概要而无需仔细审查。本文章从数据库的角度评价专题模型,从1140 BERTopic 模型运行中得出见解。目的是确定在优化模型参数方面的权衡,并思考这些结论对专题模型的解释和负责任使用意味着什么。

Article 42

Title@2025-07-31 (4): Robust and Fine-Grained Detection of AI Generated Texts

Title: Robust and Fine-Grained Detection of AI Generated Texts

Robuste und feinkörnige Erkennung von KI-generierten Texten

对 AI 生成文本的强力和精细探测 2504.11952v3

Authors (14): Ram Mohan Rao Kadiyala, Siddartha Pullakhandam, Kanwal Mehreen, Drishti Sharma, Siddhant Gupta, Jebish Purbey, Ashay Srivastava, Subhasya TippaReddy, Arvind Reddy Bobbili, Suraj Telugara Chandrashekhar, Modabbir Adeeb, Srinadh Vura, Suman Debnath, Hamza Farooq

An ideal detection system for machine generated content is supposed to work well on any generator as many more advanced LLMs come into existence day by day. Existing systems often struggle with accurately identifying AI-generated content over shorter texts. Further, not all texts might be entirely authored by a human or LLM, hence we focused more over partial cases i.e human-LLM co-authored texts. Our paper introduces a set of models built for the task of token classification which are trained on an extensive collection of human-machine co-authored texts, which performed well over texts of unseen domains, unseen generators, texts by non-native speakers and those with adversarial inputs. We also introduce a new dataset of over 2.4M such texts mostly co-authored by several popular proprietary LLMs over 23 languages. We also present findings of our models’ performance over each texts of each domain and generator. Additional findings include comparison of performance against each adversarial method, length of input texts and characteristics of generated texts compared to the original human authored texts.

理想的机器生成内容检测系统应适用于任何发电机,因为许多更先进的LLM每天都存在。现有系统往往难以准确识别AI产生的内容,而不能精确辨别较短的文本。此外,并非所有文本都可能完全由人或LLM编写,因此我们更侧重于部分案例,即人-LLM共同编写的文本。我们的文件介绍了一套为象征性分类任务而设计的模型,这些模型经过培训,涉及大量人体-机器共同编写的文本,这些文本在看不见域的文本、看不见的生成者、非母语发言人的文本和有对抗性投入的文本方面表现得非常出色。我们还引入了2.4M以上这类文本的新数据集,这些文本大多由超过23种语言的几个受欢迎的专利LMM共同编写。我们还介绍了我们模型对每个域和生成器的每种文本的绩效调查结果。其他调查结果包括对照每一种对抗方法的绩效、输入文本的长度和生成文本的特点与原始人类撰写的文本的对比。

Article 43

Title@2025-07-31 (4): SWE-Exp: Experience-Driven Software Issue Resolution

Title: SWE-Exp: Experience-Driven Software Issue Resolution

SWE-Exp: Erfahrungsgetriebene Software-Ausgabeauflösung

SWE-Expl:经验丰富的软件问题决议 2507.23361v1

Authors (10): Silin Chen, Shaoxin Lin, Xiaodong Gu, Yuling Shi, Heng Lian, Longfei Yun, Dong Chen, Weiguo Sun, Lin Cao, Qianxiang Wang

Recent advances in large language model (LLM) agents have shown remarkable progress in software issue resolution, leveraging advanced techniques such as multi-agent collaboration and Monte Carlo Tree Search (MCTS). However, current agents act as memoryless explorers - treating each problem separately without retaining or reusing knowledge from previous repair experiences. This leads to redundant exploration of failed trajectories and missed chances to adapt successful issue resolution methods to similar problems. To address this problem, we introduce SWE-Exp, an experience - enhanced approach that distills concise and actionable experience from prior agent trajectories, enabling continuous learning across issues. Our method introduces a multi-faceted experience bank that captures both successful and failed repair attempts. Specifically, it extracts reusable issue resolution knowledge at different levels - from high-level problem comprehension to specific code changes. Experiments show that SWE-Exp achieves state-of-the-art resolution rate (41.6% Pass@1) on SWE-bench-Verified under open-source agent frameworks. Our approach establishes a new paradigm in which automated software engineering agents systematically accumulate and leverage repair expertise, fundamentally shifting from trial-and-error exploration to strategic, experience-driven issue resolution.

大型语言模型(LLM)代理最近的进展表明,在软件问题的解决方面取得了显著进展,利用了多剂协作和蒙特卡洛树搜索等先进技术。然而,目前代理作为没有记忆的探险家,在不保留或重复以往修复经验的知识的情况下分别处理每个问题,从而导致对失败的轨迹进行重复探索,并错过了将成功解决问题的方法适应类似问题的机会。为解决这一问题,我们引入SWE-Exporation(SWE-Exporation),一种强化方法,从以前的代理轨迹中提取简明和可操作的经验,使各种问题能够不断学习。我们的方法引入了一个多面的经验库,记录成功和失败的修复尝试。具体地说,它提取了不同层次的可重复的解决问题知识――从高层次的问题理解到具体的代码变化。实验表明,SWE-Explex在开放源代理框架下对SWE-bench-Verizer化的SWE-pass@1,在SWE-bench-vicer 框架下,我们的方法建立了一个新的范例,使自动软件工程代理系统积累和利用修复专门知识,从试验和驱动的解决方案问题从根本上转向战略探索。

Article 44

Title@2025-07-31 (4): VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning

Title: VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning

VL-Cogito: Progressives Curriculum-Verstärkungslernen für fortgeschrittene multimodale Vernunft

VL-Cocito:先进多式联运理由的渐进课程强化学习 2507.22607v2

Authors (12): Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, Hao Zhang, Yu Rong

Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.

近期的研究工作逐渐将这一模式扩大到多式联运的推理任务。由于多式联运任务,特别是语义内容和问题配方的内在复杂性和多样性,现有模式在各个领域和困难程度中往往表现不稳定。为解决这些限制,我们建议VL-Cogito,这是一个通过新的多阶段进步课程强化学习(PCuRL)框架培训的先进的多式推理模型。PCuRL通过逐步增加难度、大幅度提高不同多式联运背景的推理能力等任务,系统地指导模型。框架引入了两项关键创新:(1) 在线困难软加权机制,动态调整整个RL培训阶段的培训困难;(2) 动态奖励机制,鼓励模型根据任务复杂性调整其推理过程长度,从而平衡推理效率与正确性。实验性评估表明,VL-Cogito在数学、科学、逻辑和一般理解等主流多式基准方面,始终或超过现有的推理模式。该框架引入了两项关键创新,即:(1) 在线困难软加权机制,动态调整了培训难度,并验证了我们的方法的有效性。

Article 45

Title@2025-07-31 (4): Text-to-SQL Task-oriented Dialogue Ontology Construction

Title: Text-to-SQL Task-oriented Dialogue Ontology Construction

Text-zu-SQL Aufgabenorientierter Dialog Ontologie Konstruktion

以任务为导向的对话肿瘤构建 2507.23358v1

Authors (8): Renato Vukovic, Carel van Niekerk, Michael Heck, Benjamin Ruppik, Hsien-Chin Lin, Shutong Feng, Nurul Lubis, Milica Gasic

Large language models (LLMs) are widely used as general-purpose knowledge sources, but they rely on parametric knowledge, limiting explainability and trustworthiness. In task-oriented dialogue (TOD) systems, this separation is explicit, using an external database structured by an explicit ontology to ensure explainability and controllability. However, building such ontologies requires manual labels or supervised training. We introduce TeQoDO: a Text-to-SQL task-oriented Dialogue Ontology construction method. Here, an LLM autonomously builds a TOD ontology from scratch without supervision using its inherent SQL programming capabilities combined with dialogue theory provided in the prompt. We show that TeQoDO outperforms transfer learning approaches, and its constructed ontology is competitive on a downstream dialogue state tracking task. Ablation studies demonstrate the key role of dialogue theory. TeQoDO also scales to allow construction of much larger ontologies, which we investigate on a Wikipedia and ArXiv dataset. We view this as a step towards broader application of ontologies to increase LLM explainability.

大型语言模型(LLMS)被广泛用作通用知识来源,但它们依赖参数知识,限制解释性和可信度。在任务导向对话(TOD)系统中,这种分离是明确的,使用由明确的本体学构建的外部数据库来确保解释性和可控制性。然而,建立这种本体学需要人工标签或监督培训。我们引入了TeQODO:一个文本到SQL的任务导向式对话本体构建方法。在这里,一个LM自主地从零开始建立一种TOD本体,没有监督,使用它固有的 SQL 编程能力,加上快速提供的对话理论。我们显示TeQODO超越了传输学习方法,其构建本体学在下游对话状态跟踪任务中具有竞争力。吸收研究显示了对话理论的关键作用。TeQDO还进行了规模,以便构建大得多的本体论,我们在Wike和ArXiv数据集上对此进行了调查。我们将此视为朝着更广泛地应用本体学来提高LLM的解释性迈出了一步。

Article 46

Title@2025-07-31 (4): KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities

Title: KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities

KeyKnowledgeRAG (K^2RAG): Eine verbesserte RAG-Methode zur Verbesserung der LLM-Fragestellung

KeyknowledgeraG(K2RAG):改进LLM问答能力的强化RAG方法 2507.07695v2

Authors (4): Hruday Markondapatnaikuni, Basem Suleiman, Abdelkarim Erradi, Shijing Chen

Fine-tuning is an immensely resource-intensive process when retraining Large Language Models (LLMs) to incorporate a larger body of knowledge. Although many fine-tuning techniques have been developed to reduce the time and computational cost involved, the challenge persists as LLMs continue to grow in size and complexity. To address this, a new approach to knowledge expansion in LLMs is needed. Retrieval-Augmented Generation (RAG) offers one such alternative by storing external knowledge in a database and retrieving relevant chunks to support question answering. However, naive implementations of RAG face significant limitations in scalability and answer accuracy. This paper introduces KeyKnowledgeRAG (K2RAG), a novel framework designed to overcome these limitations. Inspired by the divide-and-conquer paradigm, K2RAG integrates dense and sparse vector search, knowledge graphs, and text summarization to improve retrieval quality and system efficiency. The framework also includes a preprocessing step that summarizes the training data, significantly reducing the training time. K2RAG was evaluated using the MultiHopRAG dataset, where the proposed pipeline was trained on the document corpus and tested on a separate evaluation set. Results demonstrated notable improvements over common naive RAG implementations. K2RAG achieved the highest mean answer similarity score of 0.57, and reached the highest third quartile (Q3) similarity of 0.82, indicating better alignment with ground-truth answers. In addition to improved accuracy, the framework proved highly efficient. The summarization step reduced the average training time of individual components by 93%, and execution speed was up to 40% faster than traditional knowledge graph-based RAG systems. K2RAG also demonstrated superior scalability, requiring three times less VRAM than several naive RAG implementations tested in this study.

在再培训大型语言模型(LLMS)以纳入更多的知识时,微调是一个非常耗资的过程。虽然已经开发了许多微调技术来减少所涉及的时间和计算成本,但挑战依然存在,因为LLMS在规模和复杂性上继续增长。要解决这个问题,就需要在LLMS中采用新的知识扩展方法。检索增强的一代(RAG)提供了一个这样的替代办法,将外部知识储存在一个数据库中,并重新获取相关数据块以支持回答问题。然而,对RAG的天真的实施在可缩缩缩缩和回答准确性方面面临着重大限制。本文介绍了KeyKKKnowledgeraG(K2RAG),这是一个旨在克服这些限制的新框架。受分解和正拼模式的启发,K2RAG整合了密集和稀薄的矢量搜索、知识图表和文本合成,以提高检索质量和系统效率。框架还包括一个预处理步骤,以总结培训数据,大大减少培训时间。K2RAG的天平调度评估是通过多HOG(M)的缩缩缩略数据集,其中拟议的编订了最高执行时间,并测试了KLADADAD的进度。

Article 47

Title@2025-07-31 (4): SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution

Title: SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution

SWE-Debatte: Wettbewerbsfähige Multi-Agenten-Debatte für die Lösung von Software-Problemen

SWE-Debate:解决软件问题竞争性多机构辩论 2507.23348v1

Authors (9): Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, Qianxiang Wang

Issue resolution has made remarkable progress thanks to the advanced reasoning capabilities of large language models (LLMs). Recently, agent-based frameworks such as SWE-agent have further advanced this progress by enabling autonomous, tool-using agents to tackle complex software engineering tasks. While existing agent-based issue resolution approaches are primarily based on agents’ independent explorations, they often get stuck in local solutions and fail to identify issue patterns that span across different parts of the codebase. To address this limitation, we propose SWE-Debate, a competitive multi-agent debate framework that encourages diverse reasoning paths and achieves more consolidated issue localization. SWE-Debate first creates multiple fault propagation traces as localization proposals by traversing a code dependency graph. Then, it organizes a three-round debate among specialized agents, each embodying distinct reasoning perspectives along the fault propagation trace. This structured competition enables agents to collaboratively converge on a consolidated fix plan. Finally, this consolidated fix plan is integrated into an MCTS-based code modification agent for patch generation. Experiments on the SWE-bench benchmark show that SWE-Debate achieves new state-of-the-art results in open-source agent frameworks and outperforms baselines by a large margin.

由于大型语言模型(LLMs)的先进推理能力,问题解决取得了显著进展。最近,SWE代理商等基于代理商的框架进一步推进了这一进展,使自动使用工具的代理商能够应对复杂的软件工程任务。虽然现有基于代理商的问题解决方法主要基于代理商的独立探索,但它们往往被困在本地解决方案中,无法查明跨越代码库不同部分的问题模式。为了应对这一限制,我们提议SWE-Debate,这是一个竞争性多代理商辩论框架,鼓励多种推理路径,实现更综合的问题本地化。SWE-Debate首先通过绘制代码依赖性图表,生成多重错误传播痕迹,作为本地化建议。然后,它组织专门代理商之间的三轮辩论,每个都体现了与错误传播跟踪相关的不同推理观点。这种结构竞争使代理商能够就综合固定计划开展合作。最后,这一综合固定计划被纳入基于MCTS的代码修改工具,用于补丁生成。SWE-Debate在SWE-Bench基准上进行的实验表明,SWE-Deate在开放代理商基准框架中实现了新的州差幅。

Article 48

Title@2025-07-31 (4): Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance

Title: Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance

Mehrsprachige Fähigkeiten mit kulturellem und lokalem Wissen in großen Sprachmodellen verbessern und gleichzeitig die Leistungsfähigkeit der Ureinwohner verbessern

提高多语言多语言能力,在提高土著绩效的同时,利用大语言模式的文化和地方知识,同时提高土著绩效 2504.09753v3

Authors (9): Ram Mohan Rao Kadiyala, Siddartha Pullakhandam, Siddhant Gupta, Drishti Sharma, Jebish Purbey, Kanwal Mehreen, Muhammad Arham, Suman Debnath, Hamza Farooq

Large Language Models (LLMs) have shown remarkable capabilities, but their development has primarily focused on English and other high-resource languages, leaving many languages underserved. We present our latest Hindi-English bi-lingual LLM \textbf{Mantra-14B} with ~3\% average improvement in benchmark scores over both languages, outperforming models twice its size. Using a curated dataset composed of English and Hindi instruction data of 485K samples, we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our experiments encompassing seven different LLMs of varying parameter sizes and over 140 training attempts with varying English-Hindi training data ratios demonstrated that it is possible to significantly improve multilingual performance without compromising native performance. Further, our approach avoids resource-intensive techniques like vocabulary expansion or architectural modifications, thus keeping the model size small. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead. We release our training code, datasets, and models under mit and apache licenses to aid further research towards under-represented and low-resource languages.

大型语言模型(LLMS)表现出了非凡的能力,但其开发主要侧重于英语和其他高资源语言,使许多语言得不到充分服务。我们展示了印度语-英语双语LLM \ textbf{Mantra-14B}的最新印度语-英语双语言LLM \ textbf{Mantra-14B},两者的基准分数平均提高了3,优于其规模的两倍。我们使用由485K样本的英语和印地语教学数据组成的经整理的数据集,指导了Quen-2.5-14B-Instruct和Phid语等经调整的模型,以提高英语和印地语的绩效。我们涵盖7个不同参数大小的不同LLMs和140多个培训尝试的实验表明,在不损及本地绩效的前提下大幅改进多种语言的绩效是可能的。此外,我们的方法避免了词汇扩展或建筑改造等资源密集型技术,从而使模型规模小。我们的结果表明,对文化和当地知情数据进行适度的微调,可以弥补绩效差距,而不会产生重大的计算间接费用。我们发布了我们的培训代码、数据集和模型,用于协助低语言的研究。

Article 49

Title@2025-07-31 (4): DSBC : Data Science task Benchmarking with Context engineering

Title: DSBC : Data Science task Benchmarking with Context engineering

DSBC : Data Science-Aufgabe Benchmarking mit Kontext-Engineering

DSBC: 数据科学任务与背景工程基准 2507.23336v1

Authors (6): Ram Mohan Rao Kadiyala, Siddhant Gupta, Jebish Purbey, Giulio Martini, Suman Debnath, Hamza Farooq

Recent advances in large language models (LLMs) have significantly impacted data science workflows, giving rise to specialized data science agents designed to automate analytical tasks. Despite rapid adoption, systematic benchmarks evaluating the efficacy and limitations of these agents remain scarce. In this paper, we introduce a comprehensive benchmark specifically crafted to reflect real-world user interactions with data science agents by observing usage of our commercial applications. We evaluate three LLMs: Claude-4.0-Sonnet, Gemini-2.5-Flash, and OpenAI-o4-Mini across three approaches: zero-shot with context engineering, multi-step with context engineering, and with SmolAgent. Our benchmark assesses performance across a diverse set of eight data science task categories, additionally exploring the sensitivity of models to common prompting issues, such as data leakage and slightly ambiguous instructions. We further investigate the influence of temperature parameters on overall and task-specific outcomes for each model and approach. Our findings reveal distinct performance disparities among the evaluated models and methodologies, highlighting critical factors that affect practical deployment. The benchmark dataset and evaluation framework introduced herein aim to provide a foundation for future research of more robust and effective data science agents.

大型语言模型(LLMS)的最近进展对数据科学工作流程产生了重大影响,产生了专门的数据科学代理物,目的是实现分析任务的自动化。尽管迅速采用,但评估这些代理物的功效和局限性的系统基准仍然很少。在本文件中,我们引入了一个全面基准,专门通过观察我们商业应用的使用情况来反映实际用户与数据科学代理物的相互作用。我们评估了三个LLMs:Claude-4.0-Sonnet、Gemini-2.5-Flash和OpenAI-o4-Mini,这三种方法包括:环境工程零弹射、环境工程多步和SmolAgency。我们的基准评估了八个数据科学任务类别的业绩,另外探讨了模型对共同提示问题的敏感性,例如数据泄漏和略微模糊的指示。我们进一步调查了温度参数对每个模型和方法的总体和具体任务结果的影响。我们的调查结果揭示了评价模型和方法之间不同的业绩差异,突出了影响实际应用的关键因素。我们在此介绍的基准数据集和评价框架的目的是为未来研究更可靠和有效的数据科学代理物提供基础。

Article 50

Title@2025-07-31 (4): MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation

Title: MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation

MUST-RAG: MUSical Text Question Beantwortung mit retrieval Augmented Generation

MOST-RAG: 以回取增加的一代人回答的中文本问题 2507.23334v1

Authors (3): Daeyong Kwon, SeungHeon Doh, Juhan Nam

Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs’ effectiveness in music-related applications remains limited due to the relatively small proportion of music-specific knowledge in their training data. To address this limitation, we propose MusT-RAG, a comprehensive framework based on Retrieval Augmented Generation (RAG) to adapt general-purpose LLMs for text-only music question answering (MQA) tasks. RAG is a technique that provides external knowledge to LLMs by retrieving relevant context information when generating answers to questions. To optimize RAG for the music domain, we (1) propose MusWikiDB, a music-specialized vector database for the retrieval stage, and (2) utilizes context information during both inference and fine-tuning processes to effectively transform general-purpose LLMs into music-specific models. Our experiment demonstrates that MusT-RAG significantly outperforms traditional fine-tuning approaches in enhancing LLMs’ music domain adaptation capabilities, showing consistent improvements across both in-domain and out-of-domain MQA benchmarks. Additionally, our MusWikiDB proves substantially more effective than general Wikipedia corpora, delivering superior performance and computational efficiency.

大型语言模型(LLMS)的近期进步表明,大语言模型(LLMS)在不同领域都取得了显著进步,尽管在各种任务上表现得非常零,但LLMS在音乐相关应用方面的效力仍然有限,因为其培训数据中音乐专用知识所占比例相对较小。为解决这一局限性,我们提议MUST-RAG,一个基于再获取增强一代(RAG)的综合框架,以将通用LMS改编成仅供文字解答的音乐问题解答(MQA)任务。RAG是一种技术,通过在生成问题答案时检索相关背景信息,为LLMS提供外部知识。为了优化音乐领域的RAG,我们(1) 提议MusWikiDB,即用于检索阶段的音乐专用矢量数据库,以及(2) 在推论和微调过程中利用背景信息,将通用LMMMLMSMS(MQA)MQA(MQA)调控域能力(MSUWIA)的常规微调方法大大超越了LMMMMMMA(GA)的适应能力,并大大改进了我们的普通和高级QA(MQA)计算效率标准。

Article 51

Title@2025-07-31 (4): Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette

Title: Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette

Kulturelle Palette: Pluralisierung der Kulturausrichtung über Multi-Agenten-Palette

文化调色板:通过多试剂调色板实现多元化文化协调 2412.11167v3

Authors (7): Jiahao Yuan, Zixiang Di, Shangzixin Zhao, Zhiqing Cui, Hanqing Wang, Guisong Yang, Usman Naseem

Large language models (LLMs) face challenges in aligning with diverse cultural values despite their remarkable performance in generation, which stems from inherent monocultural biases and difficulties in capturing nuanced cultural semantics. Existing methods struggle to adapt to unknown culture after fine-tuning. Inspired by cultural geography across five continents, we propose Cultural Palette, a multi-agent framework that redefines cultural alignment as an adaptive “color-blending” process for country-specific adaptation. Our approach harnesses cultural geography across five continents (Africa, America, Asia, Europe, Oceania) through three key steps: First, we synthesize the Pentachromatic Cultural Palette Dataset using GPT-4o, refining continental-level dialogues with Hofstede’s cultural dimensions to establish foundational cultural representations. Second, five continent-level alignment agents form specialized cultural communities that generate region-specific draft responses. Third, a Meta Agent employs Cultural MoErges to dynamically blend these cultural “colors” through attention-gated parameter merging, akin to mixing pigments on a palette, resolving conflicts while preserving cultural nuances to produce the final culturally-aligned response. Extensive experiments across various countries demonstrate that Cultural Palette surpasses existing baselines in cultural alignment.

大型语言模型(LLMS)在与不同的文化价值观保持一致方面面临着挑战,尽管它们一代人表现出色,这源于固有的单文化偏见和捕捉细微文化语义学的困难。现有方法在微调之后难以适应未知文化。受五大洲文化地理的启发,我们提出文化调和,这是一个多试剂框架,将文化调和重新定义为适合特定国家适应的“彩色混合”进程。我们的方法通过三个关键步骤利用五大洲(非洲、美洲、亚洲、欧洲、大洋洲)的文化地理:第一,我们利用GPT-4o将Pentachromatic 文化调色素数据集合成,改进大陆一级与Hofsstede文化层面的对话,以建立基础文化代表。第二,五个大陆一级的调和剂形成专门的文化社区,产生针对特定区域的回应草案。第三,一个Metagres利用文化调控点,通过引人注意的参数整合,将这些文化“颜色”动态融合为五大洲(非洲、美洲、亚洲、欧洲、大洋洲),通过三个关键步骤,通过三个步骤:第一,我们使用GPPT-4o,在维护文化调和保持文化调,同时解决冲突冲突,同时解决冲突,与Hoclates missionlates bes lavelates bes bes ex extimes ex ex ex各国现有文化调制文化基线。

Article 52

Title@2025-07-31 (4): FinGAIA: A Chinese Benchmark for AI Agents in Real-World Financial Domain

Title: FinGAIA: A Chinese Benchmark for AI Agents in Real-World Financial Domain

FinGAIA: Ein chinesischer Benchmark für KI-Agenten in der Real-World Financial Domain

金融界:中国真实世界金融领域AI代理商基准 2507.17186v2

Authors (21): Lingfeng Zeng, Fangqi Lou, Zixuan Wang, Jiajie Xu, Jinyi Niu, Mengping Li, Yifan Dong, Qi Qi, Wei Zhang, Ziwei Yang, Jun Han, Ruilun Feng, Ruiqi Hu, Lejie Zhang, Zhengbo Feng, Yicheng Ren, Xin Guo, Zhaowei Liu, Dongpo Cheng, Weige Cai, Liwen Zhang

The booming development of AI agents presents unprecedented opportunities for automating complex tasks across various domains. However, their multi-step, multi-tool collaboration capabilities in the financial sector remain underexplored. This paper introduces FinGAIA, an end-to-end benchmark designed to evaluate the practical abilities of AI agents in the financial domain. FinGAIA comprises 407 meticulously crafted tasks, spanning seven major financial sub-domains: securities, funds, banking, insurance, futures, trusts, and asset management. These tasks are organized into three hierarchical levels of scenario depth: basic business analysis, asset decision support, and strategic risk management. We evaluated 10 mainstream AI agents in a zero-shot setting. The best-performing agent, ChatGPT, achieved an overall accuracy of 48.9\%, which, while superior to non-professionals, still lags financial experts by over 35 percentage points. Error analysis has revealed five recurring failure patterns: Cross-modal Alignment Deficiency, Financial Terminological Bias, Operational Process Awareness Barrier, among others. These patterns point to crucial directions for future research. Our work provides the first agent benchmark closely related to the financial domain, aiming to objectively assess and promote the development of agents in this crucial field. Partial data is available at https://github.com/SUFE-AIFLM-Lab/FinGAIA.

AI代理商的蓬勃发展为在各个领域使复杂任务自动化提供了前所未有的机会,然而,其金融部门的多步、多工具协作能力仍未得到充分探讨。本文件介绍了FinGAIA,这是旨在评价AI代理商在金融领域实际能力的一个端对端基准。FinGAIA由407项精心设计的任务组成,涉及七个主要的金融次领域:证券、资金、银行、保险、未来、信托和资产管理。这些任务分为三个层次的情景深度:基本业务分析、资产决策支助和战略风险管理。我们在零发环境中对10个AI主流代理商进行了评估。我们的工作为最佳代理商ChatGPT提供了与48.9的总体准确性,该代理商虽然优于非专业人员,但仍落后于金融专家,但仍超过35个百分点。错误分析揭示了五个反复出现的失败模式:跨模式协调不便、金融时序比亚、业务过程认识障碍等。这些模式指向未来研究的关键方向。我们的工作为金融领域提供了第一位代理商基准。ACTGPTGP/FAAA,目标是客观地评估和AFIAFI/BIAA。

Article 53

Title@2025-07-31 (4): Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages

Title: Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages

Multi-Hypothese Destillation von mehrsprachigen Neuralübersetzungsmodellen für ressourcenarme Sprachen

多语言低资源语言多语言神经翻译模型的蒸馏 2507.21568v2

Authors (4): Aarón Galiano-Jiménez, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Víctor M. Sánchez-Cartagena

This paper explores sequence-level knowledge distillation (KD) of multilingual pre-trained encoder-decoder translation models. We argue that the teacher model’s output distribution holds valuable insights for the student, beyond the approximated mode obtained through beam search (the standard decoding method), and present Multi-Hypothesis Distillation (MHD), a sequence-level KD method that generates multiple translations for each source sentence. This provides a larger representation of the teacher model distribution and exposes the student model to a wider range of target-side prefixes. We leverage $n$-best lists from beam search to guide the student’s learning and examine alternative decoding methods to address issues like low variability and the under-representation of infrequent tokens. For low-resource languages, our research shows that while sampling methods may slightly compromise translation quality compared to beam search based approaches, they enhance the generated corpora with greater variability and lexical richness. This ultimately improves student model performance and mitigates the gender bias amplification often associated with KD.

本文探讨了多语种预先培训的编码器脱coder-decoder翻译模型的序列级知识蒸馏(KD) 。我们认为,教师模型的输出分布为学生提供了宝贵的洞察力,超出了通过光束搜索(标准解码方法)获得的近似模式,并提出了多功能分子蒸馏(MHD)方法(MHD),这是一种序列级知识蒸馏(MHD)方法,为每个源句生成多种译文。这为教师模型的分布提供了更大的代表性,使学生模型暴露在更广泛的目标端前缀中。我们利用了来自宝座搜索的最优名单来指导学生的学习,并考察了替代解码方法,以解决诸如低变异性和非常用符号代表不足等问题。关于低资源语言,我们的研究表明,虽然抽样方法可能略微损害翻译质量,而以光谱搜索方法相比,它们会以更大的变异性和词汇丰富度增强生成的化合物。这最终提高了学生模型的性能,并减轻了常常与KD相关的性别偏见。

Article 54

Title@2025-07-31 (4): LLMs and the Human Condition

Title: LLMs and the Human Condition

LLMs und der menschliche Zustand

LLM和人类条件 2402.08403v6

Authors (1): Peter Wallis

Theory based AI research has had a hard time recently and the aim here is to propose a model of what LLMs are actually doing when they impress us with their language skills. The model integrates three established theories of human decision-making from philosophy, sociology, and computer science. The paper starts with the collective understanding of reasoning from the early days of AI research - primarily because that model is how we humans think we think, and is the most accessible. It then describes what is commonly thought of as “reactive systems” which is the position taken by many philosophers and indeed many contemporary AI researchers. The third component to the proposed model is from sociology and based on the idea that human intelligence is a collective skill for which individuals are merely actors. The resulting model provides an alternate view of ``mind reading’’ in human communication.

基于理论的AI研究最近经历了一段艰难的时期,这里的目的是提出一个LLMs在以语言技能给我们留下深刻印象时实际在做哪些事情的模式。模型综合了哲学、社会学和计算机科学的三种人类决策既定理论。文件从AI研究的最初几天开始,对推理的集体理解,这主要是因为这个模型是我们人类的想法,也是最容易理解的。然后,它描述了人们通常认为的“反应系统”是许多哲学家和当代许多AI研究人员的立场。提议的模型的第三个组成部分来自社会学,其基础是人类智慧是一种集体技能,个人只是这种技能的参与者。由此产生的模型提供了人类交流中的“微读”的替代观点。

Article 55

Title@2025-07-31 (4): What’s Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

Title: What’s Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

Was ist Taboo für Sie? - Eine empirische Bewertung von LLMs Verhalten für Sensitive Inhalte

- 对行为举止为敏感内容的LLMS的经验评估 2507.23319v1

Authors (6): Alfio Ferrara, Sergio Picascia, Laura Pinnavaia, Vojimir Ranitovic, Elisabetta Rocchetti, Alice Tuveri

Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.

虽然先前的研究主要侧重于明确培训模式,以缓和敏感内容并使之解毒,但对于LLMs是否在没有明确指示的情况下暗含净化语言的问题,探索有限,这项研究从经验上分析了GPT-4o-mini在对敏感内容进行反射和评估敏感性变化的程度时的隐含温和行为。我们的实验表明,GPT-4o-mini系统地将内容调低到不太敏感的类别,大幅降低贬低和禁忌语言。此外,我们评估LMs在对判决敏感性进行分类、将其表现与传统方法进行比较方面零弹射能力。

Article 56

Title@2025-07-31 (4): LiMe: a Latin Corpus of Late Medieval Criminal Sentences

Title: LiMe: a Latin Corpus of Late Medieval Criminal Sentences

LiMe: ein lateinischer Corpus der spätmittelalterlichen Strafurteile

Lime:拉丁美洲中世纪晚期刑事判决区 2404.12829v2

Authors (6): Alessandra Bassani, Beatrice Del Bo, Alfio Ferrara, Marta Mangini, Sergio Picascia, Ambra Stefanello

The Latin language has received attention from the computational linguistics research community, which has built, over the years, several valuable resources, ranging from detailed annotated corpora to sophisticated tools for linguistic analysis. With the recent advent of large language models, researchers have also started developing models capable of generating vector representations of Latin texts. The performances of such models remain behind the ones for modern languages, given the disparity in available data. In this paper, we present the LiMe dataset, a corpus of 325 documents extracted from a series of medieval manuscripts called Libri sententiarum potestatis Mediolani, and thoroughly annotated by experts, in order to be employed for masked language model, as well as supervised natural language processing tasks.

拉丁语受到计算语言研究界的注意,多年来,该研究界积累了若干宝贵的资源,从详细的注解公司到复杂的语言分析工具,随着最近大型语言模型的出现,研究人员还开始开发能够生成拉丁文本矢量的模型,鉴于现有数据的差异,这些模型的性能仍然落后于现代语言的模型,在本文中,我们提供了Lime数据集,共有325份文件,摘自一系列中世纪手稿,称为Libri sententententiarum patestatis Mediolani,并得到了专家的详尽说明,以便用于遮罩语言模型以及受监督的自然语言处理任务。

Article 57

Title@2025-07-31 (4): SequenceLayers: Sequence Processing and Streaming Neural Networks Made Easy

Title: SequenceLayers: Sequence Processing and Streaming Neural Networks Made Easy

SequenzLayer: Sequenzverarbeitung und Streaming von Neuronalen Netzwerken leicht gemacht

序列激光器:序列处理和串联神经网络变得容易 2507.23292v1

Authors (11): RJ Skerry-Ryan, Julian Salazar, Soroosh Mariooryad, David Kao, Daisy Stanton, Eric Battenberg, Matt Shannon, Ron J. Weiss, Robin Scheibler, Jonas Rothfuss, Tom Bagby

We introduce a neural network layer API and library for sequence modeling, designed for easy creation of sequence models that can be executed both layer-by-layer (e.g., teacher-forced training) and step-by-step (e.g., autoregressive sampling). To achieve this, layers define an explicit representation of their state over time (e.g., a Transformer KV cache, a convolution buffer, an RNN hidden state), and a step method that evolves that state, tested to give identical results to a stateless layer-wise invocation. This and other aspects of the SequenceLayers contract enables complex models to be immediately streamable, mitigates a wide range of common bugs arising in both streaming and parallel sequence processing, and can be implemented in any deep learning library. A composable and declarative API, along with a comprehensive suite of layers and combinators, streamlines the construction of production-scale models from simple streamable components while preserving strong correctness guarantees. Our current implementations of SequenceLayers (JAX, TensorFlow 2) are available at https://github.com/google/sequence-layers.

为实现这一目标,我们引入了神经网络层 API 和序列模型库, 目的是容易地创建可以逐层执行的序列模型( 教师强制培训) 和一步步执行的序列模型( 自动递减抽样 ) 。为了实现这一点, 层界定了它们随着时间推移的状态的清晰描述( 例如变换器 KV 缓存、混凝土缓冲、隐藏的 RNN ) , 并引入一个步骤方法, 该步骤方法将状态化, 测试为给无国籍的分层性职业带来相同结果。以及序列激光器合同的其他方面使复杂模型能够立即流动, 减轻在串流和平行序列处理中产生的广泛常见的错误, 并且可以在任何深层学习图书馆中实施。一个可比较和具有宣示性的 API , 连同一个全面的层层和梳理器组合, 将生产规模模型的构建从简单可流成的组件简化, 同时又保持强烈的正确性保证。我们目前实施的SquecesLayers ( JAX, TensorFlow 2) 可在 httpsrence.

Article 58

Title@2025-07-31 (4): Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics

Title: Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics

Fortschritte in LLMs mit Fokus auf Vernunft, Anpassungsfähigkeit, Effizienz und Ethik

注重理由、适应性、效率和道德操守的LLMs项目的进展 2506.12365v2

Authors (8): Asifullah Khan, Muhammad Zaeem Khan, Saleha Jamshed, Sadia Ahmad, Aleesha Zainab, Kaynat Khatib, Faria Bibi, Abdul Rehman

This survey paper outlines the key developments in the field of Large Language Models (LLMs), including enhancements to their reasoning skills, adaptability to various tasks, increased computational efficiency, and the ability to make ethical decisions. The techniques that have been most effective in bridging the gap between human and machine communications include the Chain-of-Thought prompting, Instruction Tuning, and Reinforcement Learning from Human Feedback. The improvements in multimodal learning and few-shot or zero-shot techniques have further empowered LLMs to handle complex jobs with minor input. A significant focus is placed on efficiency, detailing scaling strategies, optimization techniques, and the influential Mixture-of-Experts (MoE) architecture, which strategically routes inputs to specialized subnetworks to boost predictive accuracy, while optimizing resource allocation. This survey also offers a broader perspective on recent advancements in LLMs, going beyond isolated aspects such as model architecture or ethical concerns. Additionally, it explores the role of LLMs in Agentic AI and their use as Autonomous Decision-Making Systems, and categorizes emerging methods that enhance LLM reasoning, efficiency, and ethical alignment. The survey also identifies underexplored areas such as interpretability, cross-modal integration, and sustainability. While significant advancements have been made in LLMs, challenges such as high computational costs, biases, and ethical risks remain. Overcoming these requires a focus on bias mitigation, transparent decision-making, and explicit ethical guidelines. Future research will generally focus on enhancing the model’s ability to handle multiple inputs, thereby making it more intelligent, safe, and reliable.

本调查文件概述了大语言模型(LLMS)领域的关键发展,包括提高推理技能、适应各种任务的能力、提高计算效率和作出道德操守决定的能力。在弥合人与机器通信之间的差距方面最有效的技术包括“探索链”、“指导教学”和“从人类反馈中强化学习”。多式学习和微小或零发式技术的改进进一步增强了LMS处理复杂工作的能力。一个显著的重点是效率,详细说明了可靠的规模战略、优化技术和有影响力的混合专家(MoE)结构,从战略上将投入转向专门的子网络,以提高预测准确性,同时优化资源分配。这项调查还从更广的角度介绍了LMMS最近的进展,超越了模型结构或道德问题等孤立的方面。此外,它探讨了LMS在AI模型中的作用,以及它们作为自主决策系统的应用。将新的方法分类,加强LMRM的推理、效率和道德一致性。调查还查明了高透明度、高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高透明度、提高成本、提高成本、提高成本、提高成本、提高等领域。调查还继续注重。

Article 59

Title@2025-07-31 (4): Iterative Repair with Weak Verifiers for Few-shot Transfer in KBQA with Unanswerability

Title: Iterative Repair with Weak Verifiers for Few-shot Transfer in KBQA with Unanswerability

Iterative Reparatur mit schwachen Verifierern für wenige Aufnahmen in KBQA mit Unbeantwortbarkeit

KBQA 中无法解答的微小投射点校验器的迭代性修补 2406.14313v3

Authors (4): Riya Sawhney, Samrat Yadav, Indrajit Bhattacharya, Mausam

Real-world applications of KBQA require models to handle unanswerable questions with a limited volume of in-domain labeled training data. We propose the novel task of few-shot transfer for KBQA with unanswerable questions and contribute two new datasets for performance evaluation. We present FUn-FuSIC - a novel solution for our task that extends FuSIC KBQA, the state-of-the-art few-shot transfer model for answerable-only KBQA. We first note that FuSIC-KBQA’s iterative repair makes a strong assumption that all questions are unanswerable. As a remedy, we propose Feedback for Unanswerability (FUn), which uses iterative repair using feedback from a suite of strong and weak verifiers, and an adaptation of self consistency for unanswerabilty to better assess the answerability of a question. Our experiments show that FUn-FuSIC significantly outperforms suitable adaptations of multiple LLM based and supervised SoTA models on our task, while establishing a new SoTA for answerable few-shot transfer as well.

KBQA 的实时应用要求模型处理无法回答的问题,其内部标记的培训数据数量有限。我们提议为 KBQA 提供一些无法回答的问题,并贡献两个新的数据集来进行绩效评估。我们提出了FU-FusIC-FusIC-一个适用于我们的任务的新型解决方案,它扩展了FusICT KBQA, 即只对可回答的 KBQA 采用最先进的微小传输模式。我们首先注意到, FusIC-KBQA 的迭接式修复有力地假设所有问题都是无法回答的。作为补救措施,我们提出了“无法回答”反馈,它利用来自强弱核查员的反馈进行迭接式修复,并调整自我一致性以更好地评估问题的可回答性。我们的实验表明,FU-Fus-FusICSICM 大大超出了我们任务中基于并监督的多个LM 软件模型的适当适应性,同时为可回答的几发式转让建立一个新的 SoTA。

Article 60

Title@2025-07-31 (4): AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora

Title: AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora

AutoSchemaKG: Autonome Wissensgraphenkonstruktion durch dynamische Schemainduktion aus Web-Scale Corpora

AutoSchemaKG:通过网络规模公司动态气相引入,建立自主知识图 2505.23628v2

Authors (20): Jiaxin Bai, Wei Fan, Qi Hu, Qing Zong, Chunyang Li, Hong Ting Tsang, Hongyu Luo, Yauwai Yim, Haoyu Huang, Xiao Zhou, Feng Qin, Tianshi Zheng, Xi Peng, Xin Yao, Huiwen Yang, Leijie Wu, Yi Ji, Gong Zhang, Renhai Chen, Yangqiu Song

We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.

我们展示了AutoSchemaKG,这是一个完全自主的知识图形构建框架,它消除了对预先定义的模型的需求。我们的系统利用大型语言模型,同时从文本中提取三重知识,并直接产生全面的模型,同时对实体和事件进行建模,同时利用概念化来将事件组织成语义类别。处理超过5 000万份文件,我们建造了ATLAS(自动三连和Schema感应),这是一个知识图表系列,拥有9亿+百万节点和59亿边缘。这个方法在多霍QA任务上优于最新水平的基线,并增强了LLM事实质量。值得注意的是,我们的系统感应实现了与人造图的95-语义比对齐,零人工干预,表明10亿级知识图与动态导成的Schemas可以有效地补充大型语言模型的参数知识。

Article 61

Title@2025-07-31 (4): Unveiling Super Experts in Mixture-of-Experts Large Language Models

Title: Unveiling Super Experts in Mixture-of-Experts Large Language Models

Enthüllen Super-Experten in Mixture-of-Experts große Sprachmodelle

混合专家大语言模型中不懈的超级专家 2507.23279v1

Authors (6): Zunhai Su, Qingyuan Li, Hao Zhang, YuLei Qian, Yuchen Xie, Kehong Yuan

Sparsely activated Mixture-of-Experts (MoE) models have shown promise in enhancing the learning capacity of large language models (LLMs). Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression techniques to improve the efficiency of MoE LLMs. However, existing approaches often rely on empirical criteria to identify critical experts, lacking a deeper exploration and understanding of the heterogeneous importance of experts. In this study, we present the first discovery and investigation of a distinct subset of experts that play a crucial role in the underlying mechanisms during the model’s forward inference. These experts are prevalent in open-source MoE LLMs, and despite their limited number, pruning them leads to a significant decline in model performance (e.g., pruning three causes Qwen3-30B-A3B to produce repetitive and uninformative outputs). We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs. (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs remains model-specific and is unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model’s overall performance, particularly in mathematical reasoning. (iii) We further enhance our understanding of the influence of SEs compression. Our findings confirm that MoE LLMs rely on SEs to induce attention sinks, which are crucial for the distribution of attention scores but are significantly disrupted by SE pruning. The code is available at https://github.com/ZunhaiSu/Super-Experts-Profilling.

快速启动的探索混合模型(MoE)在加强大型语言模型(LLMS)的学习能力方面显示了希望。利用专家之间的内在重要性差异,最近的研究探索了专家级压缩技术,以提高MOE LLMs的效率。然而,现有方法往往依赖经验标准来确定关键专家,缺乏更深入的探索和了解专家的不同重要性。在本研究中,我们首次发现和调查了在模型前方推理期间在基础机制中发挥关键作用的专家群体。这些专家在开放源代码 MoE LLMs中很普遍,尽管专家人数有限,但正在运行这些专家导致模型性能显著下降(例如,运行三个原因导致 Quen3-30B-A3B 产生重复和不具有说服力的产出)。我们把这些专家称为超级专家。我们的全面分析为SEservicials提供了更深入的解析。 (i)Seres decricorations的分布非常罕见,但极富的Sericuldical-dealation在SEretailation中也具有显著的影响力。

Article 62

Title@2025-07-31 (4): AI-Reporter: A Path to a New Genre of Scientific Communication

Title: AI-Reporter: A Path to a New Genre of Scientific Communication

AI-Reporter: Ein Weg zu einem neuen Genre wissenschaftlicher Kommunikation

AI-记者:通向科学通信新一流的道路 2507.05903v2

Authors (1): Gerd Graßhoff

The AI-Reporter represents a paradigmatic shift in scientific publication practice. This document demonstrates through a concrete case study how our system transforms academic presentations into publication-ready chapters – in less than three minutes. Using Arno Simons’ lecture on Large Language Models from the ``Large Language Models for the History, Philosophy, and Sociology of Science’’ workshop (NEPI) as an example, we show how technological innovation bridges the gap between ephemeral presentation and permanent scientific documentation.

AI-Reporter代表了科学出版实践的范式转变。本文件通过具体案例研究展示了我们的系统如何在不到3分钟内将学术介绍转变为可供出版的章节。我们以“科学历史、哲学和社会学大语言模型”研讨会(NEPI)为例,利用Arno Simons关于“大语言模型”的讲座,我们展示了技术创新如何弥合时间介绍和长期科学文献之间的差距。

Article 63

Title@2025-07-31 (4): Evaluating LLMs’ Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis

Title: Evaluating LLMs’ Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis

Bewertung der Mehrsprachigkeitsfähigkeiten von LLMs für Bengalen: Benchmark-Erstellung und Leistungsanalyse

评价孟加拉多种语文能力:基准设定和业绩分析 2507.23248v1

Authors (5): Shimanto Bhowmik, Tawsif Tashwar Dipto, Md Sazzad Islam, Sheryl Hsu, Tahsin Reasat

Bengali is an underrepresented language in NLP research. However, it remains a challenge due to its unique linguistic structure and computational constraints. In this work, we systematically investigate the challenges that hinder Bengali NLP performance by focusing on the absence of standardized evaluation benchmarks. We then evaluated 10 recent open source Large Language Models (LLMs) in 8 of the translated datasets and performed a comprehensive error analysis to pinpoint their primary failure modes. Our findings reveal consistent performance gaps for Bengali compared to English, particularly for smaller models and specific model families like Mistral. We also identified promising robustness in certain architectures, such as DeepSeek, that maintain more stable performance across languages. Our analysis reveals an inverse relationship between tokenization efficiency and LLM accuracy where models tend to perform worse when inputs are excessively tokenized, whereas more efficient \& concise tokenization results in improved performance. These findings highlight critical areas where current models fall short and underscore the need for improved dataset quality and evaluation methodologies tailored to multilingual contexts. This work will catalyze further research on NLP for underrepresented languages, helping to democratize access to advanced language technologies worldwide. The code and dataset used in this research is publicly available at https://github.com/BengaliAI/bn-llm-benchmark.

孟加拉语是国家语言方案研究中代表不足的语言。然而,由于语言结构和计算限制的独特性,孟加拉语是一个挑战。在这项工作中,我们系统地调查阻碍孟加拉语国家语言方案业绩的挑战,重点是缺乏标准化的评价基准。我们随后在8个翻译数据集中评估了10个最近开放源码大语言模型(LLMs),并进行了全面的错误分析,以确定其主要失败模式。我们的调查结果显示孟加拉语与英语的绩效差距始终存在,特别是米斯特拉尔等小型模型和具体模型家庭。我们还发现,在某些结构中,如DeepSeek(DeepSeek),保持不同语言更稳定的性能。我们的分析揭示了象征性效率与LLM(LM)准确性之间的反比关系。当投入过于象征性时,模型往往表现更差,而更有效率的缩写效果则导致业绩的改善。这些调查结果突出了当前模型落后的关键领域,并强调了改进数据集质量和评估方法的必要性,特别是针对多种语言背景的小型模型。这项工作将推动对NLP(NLP)进行进一步的研究,有助于全世界使用先进语言技术的民主化。我们的分析揭示了象征性的代码和数据。用于这一公开研究。

Article 64

Title: P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication

P-ReMIS: Pragmatische Vernunft in der psychischen Gesundheit und einer sozialen Implikation

P-REMIS: 心理健康和社会影响方面的实用原因 2507.23247v1

Authors (2): Sneha Oram, Pushpak Bhattacharyya

There has been an increase in recent advancements in the explainability and development of personalized chatbots for mental health. However, the reasoning aspects for explainability and dialogue discourse have not been explored previously for mental health. Hence, we are investigating the pragmatic reasoning capability of large language models (LLMs) in this domain. We introduce P-ReMe dataset, and propose a modified definition for the pragmatic phenomena of implicature (implied meaning) and presupposition (implicit assumption) in mental health. Following the definition, we formulate two tasks in implicature and one task in presupposition. To benchmark the dataset and the presented tasks, we consider four models - Llama3.1, Mistral, MentaLLaMa, and Qwen. The results of the experiments suggest that Mistral and Qwen show substantial reasoning capabilities in the domain. In addition, we also propose StiPRompts to study the stigma around mental health with the state-of-the-art LLMs, GPT-4o mini, Deepseek-chat, and Claude-3.5-haiku. Our evaluated findings show that Claude-3.5-haiku deals with the stigma more responsibly compared to the other two LLMs.

最近,精神健康个人化聊天室的可解释性和发展有所进展,然而,关于解释性和对话讨论的推理方面以前尚未探讨过,因此,我们正在调查该领域大型语言模型(LLMS)的实用推理能力;我们引进P-ReMe数据集,提出精神健康隐含(隐含意义)和预言(隐含假设)的实用现象的修改定义;在定义之后,我们拟订了隐含性的两项任务和预言的一项任务;为数据集和所提出的任务确定基准,我们考虑了四种模式-Llama3.1、Mistral、MentalLALLAMA和Quwen。实验结果表明,Mistral和Qwen在这方面表现出了很强的推理能力。此外,我们还提议StiPRompts研究关于精神健康的污名,与最先进的LMM、GPT-4o mini、Depseek-chat和Claude-3.5haik-haku进行对比。我们的评估结论显示,与LLLA3.5-3.5号与其他的污名比较,是负责任的。

Article 65

Title@2025-07-31 (4): Generalized Reinforcement Learning for Retriever-Specific Query Rewriter with Unstructured Real-World Documents

Title: Generalized Reinforcement Learning for Retriever-Specific Query Rewriter with Unstructured Real-World Documents

Generalisiertes Verstärkungslernen für retriever-spezifische Abfrage-Rewriter mit unstrukturierten Real-World-Dokumenten

利用无结构的 “ 现实世界文件 “ 检索特定查询卷卷的通用强化学习 2507.23242v1

Authors (6): Sungguk Cha, DongWook Kim, Taeseung Hahn, Mintae Kim, Youngsub Han, Byoung-Ki Jeon

Retrieval-Augmented Generation (RAG) systems rely heavily on effective query formulation to unlock external knowledge, yet optimizing queries for diverse, unstructured real-world documents remains a challenge. We introduce \textbf{RL-QR}, a reinforcement learning framework for retriever-specific query rewriting that eliminates the need for human-annotated datasets and extends applicability to both text-only and multi-modal databases. By synthesizing scenario-question pairs and leveraging Generalized Reward Policy Optimization (GRPO), RL-QR trains query rewriters tailored to specific retrievers, enhancing retrieval performance across varied domains. Experiments on industrial in-house data demonstrate significant improvements, with $\text{RL-QR}{\text{multi-modal}}$ achieving an 11\% relative gain in NDCG@3 for multi-modal RAG and $\text{RL-QR}{\text{lexical}}$ yielding a 9\% gain for lexical retrievers. However, challenges persist with semantic and hybrid retrievers, where rewriters failed to improve performance, likely due to training misalignments. Our findings highlight RL-QR’s potential to revolutionize query optimization for RAG systems, offering a scalable, annotation-free solution for real-world retrieval tasks, while identifying avenues for further refinement in semantic retrieval contexts.

(RAG) 系统严重依赖有效的查询配置,以释放外部知识,而优化对多样化、非结构化现实世界文件的查询仍然是一项挑战。我们引入了\ textbf{RL-QR},这是一个强化的检索器特定查询重写的学习框架,它消除了对人附加说明数据集的需求,并扩展了对文本专用数据库和多模式数据库的适用性。通过对情景-问题配对并利用通用的Reward政策优化化(GROPO)、RL-QR火车针对特定检索器的查询重写者,从而提高了不同领域的检索性能。我们对工业内部数据的实验显示出了显著的改进,用$\ text{RL-QL-Qtext{Mult{Muld-modal$在NDCG@3中实现了11 相对增益,用于多模式的RAG和$text{RL-rlexli reliflical relitical },使软件回收者进一步增9。然而,对于语调化和混合检索器精炼机组的挑战依然存在着挑战依然存在着,在Rlical-rchal-rcal-rch的重新整理中,在重新整理中无法改进我们的研究,而提供一种可能的学习。

Article 66

Title@2025-07-31 (4): Cutting Through the Noise: Boosting LLM Performance on Math Word Problems

Title: Cutting Through the Noise: Boosting LLM Performance on Math Word Problems

Schneiden durch den Lärm: Steigerung der LLM-Performance bei Math Word-Problemen

通过噪音剪切:促进数学字问题LLM的LLM性能 2406.15444v4

Authors (6): Ujjwala Anantheswaran, Himanshu Gupta, Kevin Scaria, Shreyas Verma, Chitta Baral, Swaroop Mishra

Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, PROBLEMATHIC, containing both adversarial and non-adversarial MWPs. Our experiments reveal that LLMs are susceptible to distraction by numerical noise, resulting in an average relative performance drop of ~26% on adversarial MWPs. To mitigate this, we fine-tune LLMs (Llama-2, Mistral) on the adversarial samples from our dataset. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by ~8%, indicating increased robustness to noise and improved ability to identify relevant data for reasoning. Finally, to assess the generalizability of our prompting framework, we introduce GSM-8K-Adv, an adversarial variant of the GSM-8K benchmark. LLMs continue to struggle when faced with adversarial information, reducing performance by up to 6%.

大型语言模型(LLMS)在包括解决数学词问题(MWPs)在内的各种任务方面非常出色,但与包含不相关信息的现实世界问题作斗争。为了解决这个问题,我们建议了一个快速框架,通过添加不相关变量来产生MWP的对抗变体。我们引入了一个数据集,即ProbleMATHIC,包含对抗性和非对抗性MWP。我们的实验显示,LMS很容易被数字噪音分散,导致对抗性MWP的平均相对性能下降~26%。为了减轻这一影响,我们从我们的数据集中提取了对抗性样本的微调LMS(Llama-2,Mistral)。对对抗性培训案例的微调提高了对抗性能的~8%,表明对噪音的强大性能,提高了为推理确定相关数据的能力。最后,为了评估我们快速性框架的通用性,我们引入了GSM-8K-Adv,即GSM-8K基准的对抗性变体。LMS在面对对抗性信息时继续挣扎,将性能降低到6%。

Article 67

Title@2025-07-31 (4): Framing Political Bias in Multilingual LLMs Across Pakistani Languages

Title: Framing Political Bias in Multilingual LLMs Across Pakistani Languages

Framing politische Bias in mehrsprachigen LLMs in pakistanischen Sprachen

以多语种LLMs多种巴基斯坦语言界定政治偏见 2506.00068v2

Authors (3): Afrozah Nadeem, Mark Dras, Usman Naseem

Large Language Models (LLMs) increasingly shape public discourse, yet most evaluations of political and economic bias have focused on high-resource, Western languages and contexts. This leaves critical blind spots in low-resource, multilingual regions such as Pakistan, where linguistic identity is closely tied to political, religious, and regional ideologies. We present a systematic evaluation of political bias in 13 state-of-the-art LLMs across five Pakistani languages: Urdu, Punjabi, Sindhi, Pashto, and Balochi. Our framework integrates a culturally adapted Political Compass Test (PCT) with multi-level framing analysis, capturing both ideological stance (economic/social axes) and stylistic framing (content, tone, emphasis). Prompts are aligned with 11 socio-political themes specific to the Pakistani context. Results show that while LLMs predominantly reflect liberal-left orientations consistent with Western training data, they exhibit more authoritarian framing in regional languages, highlighting language-conditioned ideological modulation. We also identify consistent model-specific bias patterns across languages. These findings show the need for culturally grounded, multilingual bias auditing frameworks in global NLP.

大型语言模式(LLMS)日益影响公共对话,然而,大多数对政治和经济偏见的评价都集中在高资源、西方语言和背景上,这在巴基斯坦等低资源、多语言地区留下了重要的盲点,这些地区的语言认同与政治、宗教和区域意识形态密切相关;我们系统地评价了巴基斯坦五种语言:乌尔都语、旁遮普语、信德语、普什图语和巴洛奇语的13个最先进的LLMS中的政治偏见:乌尔都语、旁遮普语、信德语、普什图语和巴洛奇语。我们的框架将文化上适应的《政治指南测试》(PCT)与多层次的设置分析相结合,既包括意识形态立场(经济/社会轴心),也包括文体框架(内容、语调、重点)。提示与巴基斯坦背景下的11个社会政治主题一致。结果显示,尽管LMS主要反映与西方培训数据相一致的自由左方向,但它们在区域语言中表现出更加专制的设置,强调语言的意识形态调制。我们还确定了不同语言的一致的典型偏见模式模式。这些结论显示,全球国家语言需要基于文化的多语言的多语制审计框架。

Article 68

Title@2025-07-31 (4): AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

Title: AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

AgentSpec: Anpassbare Runtime Enforcement für sichere und zuverlässige LLM-Agenten

安全可靠LLM代理商的可定制运行时间执法 2503.18666v3

Authors (3): Haoyu Wang, Christopher M. Poskitt, Jun Sun

Agents built on LLMs are increasingly deployed across diverse domains, automating complex decision-making and task execution. However, their autonomy introduces safety risks, including security vulnerabilities, legal violations, and unintended harmful actions. Existing mitigation methods, such as model-based safeguards and early enforcement strategies, fall short in robustness, interpretability, and adaptability. To address these challenges, we propose AgentSpec, a lightweight domain-specific language for specifying and enforcing runtime constraints on LLM agents. With AgentSpec, users define structured rules that incorporate triggers, predicates, and enforcement mechanisms, ensuring agents operate within predefined safety boundaries. We implement AgentSpec across multiple domains, including code execution, embodied agents, and autonomous driving, demonstrating its adaptability and effectiveness. Our evaluation shows that AgentSpec successfully prevents unsafe executions in over 90% of code agent cases, eliminates all hazardous actions in embodied agent tasks, and enforces 100% compliance by autonomous vehicles (AVs). Despite its strong safety guarantees, AgentSpec remains computationally lightweight, with overheads in milliseconds. By combining interpretability, modularity, and efficiency, AgentSpec provides a practical and scalable solution for enforcing LLM agent safety across diverse applications. We also automate the generation of rules using LLMs and assess their effectiveness. Our evaluation shows that the rules generated by OpenAI o1 achieve a precision of 95.56% and recall of 70.96% for embodied agents, successfully identify 87.26% of the risky code, and prevent AVs from breaking laws in 5 out of 8 scenarios.

建在LLMS上的代理人越来越多地在不同领域部署,使复杂的决策和任务执行自动化;然而,他们的自主性带来了安全风险,包括安全脆弱性、法律违规和意外有害行动; 现有的缓解方法,例如基于模型的保障措施和早期执行战略,缺乏稳健性、可解释性和适应性; 为了应对这些挑战,我们提议Agrespe,这是用于具体规定和强制执行LM代理的运行时间限制的一种轻量级域别语言; 与Agrespe, 用户制定结构化规则,包括触发器、前提和执法机制,确保代理人在预先确定的安全边界内运作; 我们执行各种领域,包括代码执行、体现代理人和自主驾驶,展示其适应性和有效性; 我们的评估表明,在90%以上的代码代理人案件中,Agrespe成功地防止了不安全处决,消除了所有体现代理任务中的危险行动,并强制执行了自动车辆100%的合规性。尽管有强有力的安全保障,但Agrespec公司仍然计算出较轻的重量,并有毫秒的间接费用。通过将解释性、模块性和效率结合起来,AstrimSpeceptrosprospect 提供一种实用和可理解性的规则,我们用ALLMSDMSDMS的透明性规则, 也通过透明化了一种可实现了一种可操作性的方法, 。

Article 69

Title@2025-07-31 (4): Enabling Few-Shot Alzheimer’s Disease Diagnosis on Tabular Biomarker Data with LLMs

Title: Enabling Few-Shot Alzheimer’s Disease Diagnosis on Tabular Biomarker Data with LLMs

Ermöglichung der weniger scharfen Alzheimer-Krankheit Diagnose auf Tabular Biomarker Daten mit LLMs

使小热阿尔茨海默氏病的疾病诊断能够用LMS在表示生物标记数据上进行 2507.23227v1

Authors (9): Sophie Kearney, Shu Yang, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Jason Moore, Marylyn Ritchie, Li Shen

Early and accurate diagnosis of Alzheimer’s disease (AD), a complex neurodegenerative disorder, requires analysis of heterogeneous biomarkers (e.g., neuroimaging, genetic risk factors, cognitive tests, and cerebrospinal fluid proteins) typically represented in a tabular format. With flexible few-shot reasoning, multimodal integration, and natural-language-based interpretability, large language models (LLMs) offer unprecedented opportunities for prediction with structured biomedical data. We propose a novel framework called TAP-GPT, Tabular Alzheimer’s Prediction GPT, that adapts TableGPT2, a multimodal tabular-specialized LLM originally developed for business intelligence tasks, for AD diagnosis using structured biomarker data with small sample sizes. Our approach constructs few-shot tabular prompts using in-context learning examples from structured biomedical data and finetunes TableGPT2 using the parameter-efficient qLoRA adaption for a clinical binary classification task of AD or cognitively normal (CN). The TAP-GPT framework harnesses the powerful tabular understanding ability of TableGPT2 and the encoded prior knowledge of LLMs to outperform more advanced general-purpose LLMs and a tabular foundation model (TFM) developed for prediction tasks. To our knowledge, this is the first application of LLMs to the prediction task using tabular biomarker data, paving the way for future LLM-driven multi-agent frameworks in biomedical informatics.

对阿尔茨海默氏病(AD)这一复杂的神经退化性神经疾病(AD)的早期和准确诊断是复杂的神经退化性疾病,需要分析通常以表格形式呈现的多种生物标志(例如神经成像、遗传风险因素、认知测试和脑脊髓液蛋白),以表格形式对它进行分析。采用灵活的短片推理、多式整合和基于自然语言的可解释性,大型语言模型(LLLMS)提供了前所未有的机会,用结构化的生物医学数据进行预测。我们提议了一个称为TAP-GPT的新型框架,即Tabulal 阿尔茨海默氏氏病的预测GPT,以调整表GPT2,即最初为商业情报任务而开发的多表单专用LMM,用于使用结构化生物标志的结构性生物标志诊断诊断。我们的方法,即利用结构化生物数据学数据和细微的表型模型,将LMMS的先前知识用于更高级的预测基础。

Article 70

Title@2025-07-31 (4): Unveiling the Influence of Amplifying Language-Specific Neurons

Title: Unveiling the Influence of Amplifying Language-Specific Neurons

Enthüllen des Einflusses amplifizierender sprachspezifischer Neuronen

消除扩增语言特有新元的影响 2507.22581v2

Authors (6): Inaya Rahmanisa, Lyzander Marciano Andrylie, Mahardika Krisna Ihsani, Alfan Farizki Wicaksono, Haryo Akbarianto Wibowo, Alham Fikri Aji

Language-specific neurons in LLMs that strongly correlate with individual languages have been shown to influence model behavior by deactivating them. However, their role in amplification remains underexplored. This work investigates the effect of amplifying language-specific neurons through interventions across 18 languages, including low-resource ones, using three models primarily trained in different languages. We compare amplification factors by their effectiveness in steering to the target language using a proposed Language Steering Shift (LSS) evaluation score, then evaluate it on downstream tasks: commonsense reasoning (XCOPA, XWinograd), knowledge (Include), and translation (FLORES). The optimal amplification factors effectively steer output toward nearly all tested languages. Intervention using this factor on downstream tasks improves self-language performance in some cases but generally degrades cross-language results. These findings highlight the effect of language-specific neurons in multilingual behavior, where amplification can be beneficial especially for low-resource languages, but provides limited advantage for cross-lingual transfer.

与个别语言密切相关的LLM中语言特有神经元在LLM中被证明通过解除其作用来影响模式行为。然而,他们在扩增中的作用仍未得到充分探讨。这项工作调查了通过18种语言的干预措施扩大语言特有神经元的效果,包括使用三种主要以不同语言培训的模型,使用三种主要以不同语言培训的模型。我们通过使用拟议的语言指导转变评价分数来比较扩增因素在引导目标语言方面的效力,然后评价下游任务:常识推理(XCOPA、XWinograd)、知识(Include)和翻译(FLORES)。最佳扩增因素有效地将产出引向几乎所有经过测试的语言。在下游任务中使用这一因素的干预措施在某些情况下提高了自我语言性能,但通常会降低跨语言结果。这些调查结果突出了语言特有的神经元在多语种行为中的影响,在这些中增益特别有利于低资源语言,但为跨语言转移提供了有限的优势。

Article 71

Title@2025-07-31 (4): LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

Title: LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

LLM-Crowdsourced: Ein Benchmark-freies Paradigma zur gegenseitigen Bewertung großer Sprachmodelle

LLM-文献来源:用于对大语言模式进行相互评价的无基准建模 2507.22359v2

Authors (8): Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Kai Chen, Xiaofeng Wang, Baosheng Wang

Although large language models (LLMs) demonstrate remarkable capabilities across various tasks, evaluating their capabilities remains a challenging task. Existing evaluation methods suffer from issues such as data contamination, black-box operation, and subjective preference. These issues make it difficult to evaluate the LLMs’ true capabilities comprehensively. To tackle these challenges, we propose a novel benchmark-free evaluation paradigm, LLM-Crowdsourced. It utilizes LLMs to generate questions, answer independently, and evaluate mutually. This method integrates four key evaluation criteria: dynamic, transparent, objective, and professional, which existing evaluation methods cannot satisfy simultaneously. Experiments on eight mainstream LLMs across mathematics and programming verify the advantages of our method in distinguishing LLM performance. Furthermore, our study reveals several novel findings that are difficult for traditional methods to detect, including but not limited to: (1) Gemini demonstrates the highest original and professional question-design capabilities among others; (2) Some LLMs exhibit ‘‘memorization-based answering’’ by misrecognizing questions as familiar ones with a similar structure; (3) LLM evaluation results demonstrate high consistency (robustness).

尽管大型语言模型(LLMs)在各种任务中表现出非凡的能力,但评价其能力仍然是一项艰巨的任务。现有的评价方法存在数据污染、黑盒操作和主观偏好等问题。这些问题使得很难全面评价LLMs的真正能力。为了应对这些挑战,我们提议了一个新型的无基准评价模式(LLM-Crowds),它利用LLMs提出问题、独立回答和相互评价。这种方法综合了四个关键评价标准:动态、透明、客观和专业,而现有的评价方法无法同时满足。在数学和编程中对八个主流LMs的实验证实了我们区分LLM业绩的方法的优势。此外,我们的研究揭示出一些难以发现传统方法的新发现,包括但不限于:(1) Gemini展示了其他人之间最高的原始和专业问题设计能力;(2) 一些LLMS展示了“以模范为基础的回答”,因为对类似结构熟悉的问题认识不当;(3) LLMM的评价结果显示高度一致(破坏)。

Article 72

Title@2025-07-31 (4): Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders

Title: Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders

Model Directions, keine Worte: Mechanistische Themenmodelle mit Sparse Autoencodern

模型方向,非单词:使用粗态自动编码器的机械专题模型 2507.23220v1

Authors (8): Carolina Zheng, Nicolas Beltran-Velez, Sweta Karlekar, Claudia Shi, Achille Nazaret, Asif Mallik, Amir Feder, David M. Blei

Traditional topic models are effective at uncovering latent themes in large text collections. However, due to their reliance on bag-of-words representations, they struggle to capture semantically abstract features. While some neural variants use richer representations, they are similarly constrained by expressing topics as word lists, which limits their ability to articulate complex topics. We introduce Mechanistic Topic Models (MTMs), a class of topic models that operate on interpretable features learned by sparse autoencoders (SAEs). By defining topics over this semantically rich space, MTMs can reveal deeper conceptual themes with expressive feature descriptions. Moreover, uniquely among topic models, MTMs enable controllable text generation using topic-based steering vectors. To properly evaluate MTM topics against word-list-based approaches, we propose \textit{topic judge}, an LLM-based pairwise comparison evaluation framework. Across five datasets, MTMs match or exceed traditional and neural baselines on coherence metrics, are consistently preferred by topic judge, and enable effective steering of LLM outputs.

传统专题模型对于在大型文本集中发现潜在主题十分有效,然而,由于它们依赖一袋字表,因此难以捕捉精度抽象特征。虽然一些神经变异体使用较丰富的表达方式,但它们同样受到以单词列表的形式表达主题的限制,这限制了它们阐述复杂专题的能力。我们引入了机械主题模型(MTMs),这是一组以稀疏自动计算器(SAEs)所学的可解释特征运作的一类专题模型。MTMs通过界定这个精度丰富的空间,可以揭示更深的概念主题,并进行表达性特征描述。此外,在专题模型中,MTMs使得能够使用基于主题的指导矢量进行可控的文本生成。为了恰当地评估基于单词列表的方法的MTM专题,我们建议采用基于LMM的双向比较评估框架。在五个数据集中,MTMs匹配或超过关于一致性指标的传统和神经基线,专题法官一贯选择,并能够有效地指导LM产出。

Article 73

Title@2025-07-31 (4): Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires

Title: Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires

Kulturelle Bias in großen Sprachmodellen: Bewertung von KI-Agenten durch moralische Fragebögen

大语言模式中的文化偏见:通过道德问卷评估AI代理 2507.10073v2

Authors (1): Simon Münker

Are AI systems truly representing human values, or merely averaging across them? Our study suggests a concerning reality: Large Language Models (LLMs) fail to represent diverse cultural moral frameworks despite their linguistic capabilities. We expose significant gaps between AI-generated and human moral intuitions by applying the Moral Foundations Questionnaire across 19 cultural contexts. Comparing multiple state-of-the-art LLMs’ origins against human baseline data, we find these models systematically homogenize moral diversity. Surprisingly, increased model size doesn’t consistently improve cultural representation fidelity. Our findings challenge the growing use of LLMs as synthetic populations in social science research and highlight a fundamental limitation in current AI alignment approaches. Without data-driven alignment beyond prompting, these systems cannot capture the nuanced, culturally-specific moral intuitions. Our results call for more grounded alignment objectives and evaluation metrics to ensure AI systems represent diverse human values rather than flattening the moral landscape.

人工智能系统是否真正代表了人类价值观,或者仅仅是在它们中间平均?我们的研究显示,现实存在:尽管有语言能力,大语言模型(LLMs)不能代表不同的文化道德框架。我们通过在19个文化背景中应用道德基础问卷,暴露了人工智能生成的和人类道德直觉之间的巨大差距。将多种最先进的LMs起源与人类基线数据进行比较,我们发现这些模型系统地将道德多样性同化。令人惊讶的是,扩大的模型规模并不能不断提高文化代表性的忠诚性。我们的调查结果挑战了LMs在社会科学研究中越来越多地作为合成人群使用,并凸显了当前人工智能调整方法中的基本局限性。没有数据驱动的一致,这些系统就无法捕捉到细微细的、文化特有的道德直觉。我们的结果要求更加有根基的调整目标和评估指标,以确保人工智能系统代表不同的人类价值观,而不是固化道德景观。

Article 74

Title@2025-07-31 (4): Failures Are the Stepping Stones to Success: Enhancing Few-Shot In-Context Learning by Leveraging Negative Samples

Title: Failures Are the Stepping Stones to Success: Enhancing Few-Shot In-Context Learning by Leveraging Negative Samples

Fehler sind die Steinschritte zum Erfolg: Erweitern Sie das wenige-heiße In-Context-Lernen durch die Nutzung negativer Muster

失败是走向成功的一步步石:通过利用负面样本加强少许热的文体学习 2507.23211v1

Authors (4): Yunhao Liang, Ruixuan Ying, Takuya Taniguchi, Zhe Cui

Large Language Models exhibit powerful few-shot in-context learning (ICL) capabilities, but the performance is highly sensitive to provided examples. Recent research has focused on retrieving corresponding examples for each input query, not only enhancing the efficiency and scalability of the learning process but also mitigating inherent biases in manual example selection. However, these studies have primarily emphasized leveraging Positive samples while overlooking the additional information within Negative samples for contextual learning. We propose a novel method that utilizes Negative samples to better select Positive sample examples, thereby enhancing the performance of few-shot ICL. Initially, we construct Positive and Negative sample corpora based on Zero-Shot-Cot. Then, during inference, we employ a semantic similarity-based approach to select the most similar examples from both the Positive and Negative corpora for a given query. Subsequently, we further retrieve Positive examples from the Positive sample corpus based on semantic similarity to the Negative examples, then concatenating them with the previously selected Positive examples to serve as ICL demonstrations. Experimental results demonstrate that our approach surpasses methods solely relying on the most similar positive examples for context, validating that the additional information in negative samples aids in enhancing ICL performance through improved Positive sample selection.

大型语言模型展示了强大的点数的文字学习(ICL)能力,但是其性能非常敏感。最近的研究侧重于为每个输入查询检索相应的实例,不仅提高了学习过程的效率和可扩缩性,而且减轻了人工选择实例的固有偏差。然而,这些研究主要强调利用正面样本,同时忽略了负面样本中的额外信息,以便进行背景学习。我们提出了一个新颖的方法,利用负面样本更好地选择积极的样本实例,从而提高少数点数的ICL的性能。最初,我们根据Zero-Shot-Cot, 构建正和负样样子公司。然后,在推断期间,我们采用了基于语义的相似性方法,从正和负形样本中选择最相似的例子,供特定查询。随后,我们进一步从基于语义与负面示例相似的正面样本中获取正面实例,然后将它们与先前选定的正面实例混为一体,作为ICL的演示。实验结果表明,我们的方法超越了我们仅依靠最相似的积极性样本选择方法,通过改进的正面样本来提高积极性实例。

Article 75

Title@2025-07-31 (4): InfAlign: Inference-aware language model alignment

Title: InfAlign: Inference-aware language model alignment

InfAlign: Inference-aware Sprachmodellausrichtung

Infagign: 参考意识语言模型对齐 2412.19792v4

Authors (12): Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Theertha Suresh, Ahmad Beirami

Language model alignment is a critical step in training modern generative language models. Alignment targets to improve win rate of a sample from the aligned model against the base model. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. We show that this train/test mismatch makes standard RLHF framework sub-optimal in view of such inference-time methods. To this end, we propose a framework for inference-aware alignment (InfAlign), which aims to optimize inference-time win rate of the aligned policy against the base model. We prove that for any inference-time decoding procedure, the optimal aligned policy is the solution to the standard RLHF problem with a transformation of the reward. This motivates us to provide the calibrate-and-transform RL (InfAlign-CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. For best-of-N sampling and best-of-N jailbreaking, we propose specific transformations offering up to 3-8% improvement on inference-time win rates. Finally, we also show that our proposed reward calibration method is a strong baseline for optimizing standard win rate.

语言模型的匹配是培训现代基因化语言模型的关键步骤。匹配目标的目标是提高参照基准模型的匹配模式样本的赢率。今天,我们越来越多地使用推算时间算法(例如最佳计算方法、受控解码、树搜索)从语言模型解码,而不是标准抽样。我们显示,这种火车/测试错配使标准的RLHF框架在这种推论时间方法方面达到亚最佳标准RLHF框架。为此,我们提议了一个推算-觉一致(InfAllign)框架框架框架框架,目的是根据基准模型优化一致政策中的推论时间赢率。我们证明,对于任何推断-时间解码程序,最佳调整政策是解决标准RLHF问题的办法,而不是标准抽样。这激励我们提供校准RL(InfAlign-CTRL)框架的校准和变校准方法来解决这一问题,这需要一个奖赏性校准步骤和KL- 定期奖赏步骤,目的是对基准模型进行最优化的修改。最佳的校准率是最终的校准率。显示我们最终的校准标准的校准率。

Article 76

Title@2025-07-31 (4): Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

Title: Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

Towards Inclusive ASR: Untersuchung der Sprachumwandlung für Dysarthric Speech Recognition in Low-Resource Sprachen

努力实现包容性的ASR:低资源语言中承认代谢语言语音转换调查 2505.14874v4

Authors (10): Chin-Jou Li, Eunjung Yeo, Kwanghee Choi, Paula Andrea Pérez-Toro, Masao Someki, Rohan Kumar Das, Zhengjun Yue, Juan Rafael Orozco-Arroyave, Elmar Nöth, David R. Mortensen

Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.

由于数据稀缺,特别是非英语语言的数据稀缺,对读写语言的自动语音识别(ASR)仍然具有挑战性。为此,我们调整了英语读写语言语言(UASpeech)的语音转换模型,以将发言者的特征和偏差进行编码,然后将其用于将健康的非英语语言(FLEURS)转换成非英语的读写语言(FLEURS),然后将生成的数据用于微调多语种语言的ASR模型(MMS),即大众多语种语言的语音(MMS),以便改进读写语言识别。对PC-GITA(西班牙语)、EasyCall(意大利语)和SSNCE(塔米尔语)的评价表明,语音转换和手动转换都大大超越了现有MMS的功能和常规增强技术,如速度和节奏渗透。对生成数据的客观和主观分析进一步证实,生成的语音模拟了Dysarthric特性。

Article 77

Title@2025-07-31 (4): Explaining vague language

Title: Explaining vague language

Unbestimmte Sprache erklären

解释含糊措辞 2404.18154v2

Authors (2): Paul Égré, Benjamin Spector

Why is language vague? Vagueness may be explained and rationalized if it can be shown that vague language is more useful to speaker and hearer than precise language. In a well-known paper, Lipman proposes a game-theoretic account of vagueness in terms of mixed strategy that leads to a puzzle: vagueness cannot be strictly better than precision at equilibrium. More recently, 'Egr'e, Spector, Mortier and Verheyen have put forward a Bayesian account of vagueness establishing that using vague words can be strictly more informative than using precise words. This paper proposes to compare both results and to explain why they are not in contradiction. Lipman’s definition of vagueness relies exclusively on a property of signaling strategies, without making any assumptions about the lexicon, whereas 'Egr'e et al.’s involves a layer of semantic content. We argue that the semantic account of vagueness is needed, and more adequate and explanatory of vagueness.

为何语言模糊?如果能够证明模糊语言比准确语言更有用,那么模糊语言可能会被解释和合理化。在一份众所周知的报纸上,利普曼提出混杂战略的模糊性游戏理论说明:模糊性绝对不能比平衡的精确性好。最近,斯派克特、斯派克特、莫蒂尔和韦尔希恩(Verheeen)提出了一个模糊性说明,证明使用模糊语言可以比精确语言更严格地说明信息。本文提议比较结果,解释为什么它们不矛盾。利普曼(Lipman)关于模糊性的定义完全依赖于信号战略的属性,而没有对词汇法作任何假设,而“Egrgr'e et al”等则涉及语义内容的层层。我们争辩说,需要模糊性语义的描述,并且更充分、更确切地解释模糊性。

Article 78

Title@2025-07-31 (4): Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

Title: Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

Geak: Einführung von Triton Kernel AI Agent & Evaluation Benchmarks

Geak:介绍Triton Kernel AI 代理和评估基准 2507.23194v1

Authors (10): Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, Emad Barsoum

The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPUs, aiming to reduce manual optimization efforts while achieving near-expert performance on hardware like AMD MI300X. The Triton language, a Python-based DSL for GPU programming, has emerged as a popular target for such AI-generated kernels due to its balance of performance and ease-of-coding. In this work, we present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)-a framework that leverages cutting-edge LLMs to generate performant Triton code specifically for AMD GPUs, including the AMD MI300X and MI250. GEAK leverages inference-time compute scaling to produce Triton-based GPU kernels using a reasoning loop adapted from Reflexion-style feedback mechanisms. On two evaluation benchmarks, GEAK significantly outperformed the baselines of directly prompting frontier LLMs as well as Reflexion-based generation pipelines by achieving correctness up to $63$% and execution speed up of up to $2.59$X. These results highlight the promise of GEAK-like agentic code generation for accelerating the adoption of diverse hardware platforms and democratizing access to expert-level kernel performance.

对AI 生成的 GPU 内核的需求正在迅速增长,这受到产业和学术界对可升级、硬件优化解决方案的需求的影响。随着深层次学习工作量在复杂和多样性方面不断增加,必须使低层内核开发自动化,以满足业绩和生产力需求。主要云源提供商、半导体公司和研究机构目前正在对GPU的AI驱动代码生成进行大量投资,目的是减少手工优化努力,同时在AMD MI300X等硬件上实现近距离专家业绩。Triton语言是用于GPU编程的基于Python的快速加速、硬件优化解决方案。Triton语言是用于GPUNB内核的多功能目标,因为其业绩平衡和生成方便性能。在这项工作中,我们为基于Triton的 GPUP内核内核和Gech (Generage 高效的 AI-cent GPUN) 生成了一套框架,该框架利用尖端LMM(包括AM MI300X 和 MI250) 的快速性硬化硬化操作码,具体用于AUD内核化的硬化的硬化操作。

Article 79

Title@2025-07-31 (4): EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos referring to Procedural Texts

Title: EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos referring to Procedural Texts

EgoOops: Ein Datensatz zur Erkennung von Fehlern aus egozentrischen Videos, die sich auf Verfahrenstexte beziehen

EgoOops: 用于从 Egocentic 视频中检测错误动作的数据集, 指程序文字 2410.05343v3

Authors (10): Yuto Haneji, Taichi Nishimura, Hirotaka Kameko, Keisuke Shirai, Tomoya Yoshida, Keiya Kajimura, Koki Yamamoto, Taiyu Cui, Tomohiro Nishimoto, Shinsuke Mori

Mistake action detection is crucial for developing intelligent archives that detect workers’ errors and provide feedback. Existing studies have focused on visually apparent mistakes in free-style activities, resulting in video-only approaches to mistake detection. However, in text-following activities, models cannot determine the correctness of some actions without referring to the texts. Additionally, current mistake datasets rarely use procedural texts for video recording except for cooking. To fill these gaps, this paper proposes the EgoOops dataset, where egocentric videos record erroneous activities when following procedural texts across diverse domains. It features three types of annotations: video-text alignment, mistake labels, and descriptions for mistakes. We also propose a mistake detection approach, combining video-text alignment and mistake label classification to leverage the texts. Our experimental results show that incorporating procedural texts is essential for mistake detection. Data is available through https://y-haneji.github.io/EgoOops-project-page/.

现有研究侧重于自由式活动中的目视明显错误,从而导致只用视频方式发现错误;然而,在跟踪文本的活动中,模型不能在不参考文本的情况下确定某些行动的正确性;此外,目前的错误数据集很少使用程序文本进行录像记录,但烹饪除外;为填补这些空白,本文件提议EgoOops数据集,在EgoOops数据集中,以自我为中心的视频记录不同领域遵循程序文本的错误活动。它有三种说明类型:视频文本调整、错误标签和错误描述。我们还提议了一种错误识别方法,将视频文本的校正和错误标签分类结合起来,以利用文本。我们的实验结果表明,纳入程序文本对于发现错误至关重要。数据可通过https://y-haneji.github.io/EgoOops-Project-project-page/查阅。

Article 80

Title@2025-07-31 (4): Leveraging LLMs to Create Content Corpora for Niche Domains

Title: Leveraging LLMs to Create Content Corpora for Niche Domains

LLMs nutzen, um Content Corpora für Niche Domains zu erstellen

利用LMLM 来为新域创建内容公司 2505.02851v2

Authors (3): Franklin Zhang, Sonya Zhang, Alon Halevy

Constructing specialized content corpora from vast, unstructured web sources for domain-specific applications poses substantial data curation challenges. In this paper, we introduce a streamlined approach for generating high-quality, domain-specific corpora by efficiently acquiring, filtering, structuring, and cleaning web-based data. We showcase how Large Language Models (LLMs) can be leveraged to address complex data curation at scale, and propose a strategical framework incorporating LLM-enhanced techniques for structured content extraction and semantic deduplication. We validate our approach in the behavior education domain through its integration into 30 Day Me, a habit formation application. Our data pipeline, named 30DayGen, enabled the extraction and synthesis of 3,531 unique 30-day challenges from over 15K webpages. A user survey reports a satisfaction score of 4.3 out of 5, with 91% of respondents indicating willingness to use the curated content for their habit-formation goals.

在本文中,我们引入了一种简化的方法,通过高效获取、过滤、构建和清洁基于网络的数据来生成高质量的、针对具体领域的公司。我们展示了如何利用大语言模型来解决规模化的复杂数据整理问题,并提出了一个包含有结构化内容提取和语义重复的LLM强化技术的战略框架。我们验证了我们在行为教育领域的做法,将它整合为30天Me,即习惯形成应用程序。我们称为30DayGen的数据管道从15K网页上提取和合成了3 531个30天的独特挑战。用户调查显示,5个网页的满意度为4.3分,91%的答卷者表示愿意使用经整理的内容实现习惯形成目标。

Article 81

Title@2025-07-31 (4): LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration

Title: LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration

LENS: Lerne Ensemble Vertrauen aus neuralen Staaten für Multi-LLM-Antwortintegration

LENS:从神经国家学习多LLM应答整合的集合信任 2507.23167v1

Authors (1): Jizhou Guo

Large Language Models (LLMs) have demonstrated impressive performance across various tasks, with different models excelling in distinct domains and specific abilities. Effectively combining the predictions of multiple LLMs is crucial for enhancing system robustness and performance. However, existing ensemble methods often rely on simple techniques like voting or logits ensembling, which overlook the varying confidence and reliability of models in different contexts. In this work, we propose LENS (Learning ENsemble confidence from Neural States), a novel approach that learns to estimate model confidence by analyzing internal representations. For each LLM, we train a lightweight linear confidence predictor that leverages layer-wise hidden states and normalized probabilities as inputs. This allows for more nuanced weighting of model predictions based on their context-dependent reliability. Our method does not require modifying the model parameters and requires negligible additional computation. Experimental results on multiple-choice and boolean question-answering tasks demonstrate that LENS outperforms traditional ensemble methods by a substantial margin. Our findings suggest that internal representations provide valuable signals for determining model confidence and can be effectively leveraged for ensemble learning.

大型语言模型(LLMS)在各种任务中表现出了令人印象深刻的成绩,不同模型在不同的领域和具体能力方面表现得不同。有效地结合对多个LLMS的预测对于提高系统稳健性和性能至关重要。然而,现有的混合方法往往依赖简单的技术,如投票或登录组合,这些技术忽视了不同情况下模型的不同信心和可靠性。在这项工作中,我们提议LENS(从神经国学习可综合信任),这是一种新颖的方法,通过分析内部代表来评估模型的信心。我们为每个LM公司培训了一个轻量线性线性信心预测器,该预测器能够利用分层的隐藏状态和正常的概率作为投入。这使得能够根据不同背景的可靠性对模型预测进行更细致的加权。我们的方法并不要求修改模型参数,而需要微不足道的额外计算。多曲和布林问答任务的实验结果表明,LENS比传统的混合方法要差很多。我们的研究结果表明,内部代表提供了宝贵的信号,用以确定模型信任度,并且能够有效地利用该软件学习。

Article 82

Title@2025-07-31 (4): Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Title: Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Vision-Language-Modelle sind in Bezug auf Expression Generation nicht pragmatisch kompetent

视觉-语言模型在代言表达式生成中不具备实用能力 2504.16060v3

Authors (9): Ziqiao Ma, Jing Ding, Xuejun Zhang, Dezhi Luo, Jiahe Ding, Sihan Xu, Yuchen Huang, Run Peng, Joyce Chai

Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.

表达生成是一项核心任务,用于评价视觉语言系统的实际能力,不仅需要准确的语义基础,还需要遵守合作通信原则(Grice,1975年)。然而,目前对视觉语言模型的评价往往忽略了务实层面,将区域语言模型降低到基于区域的字幕任务,忽视了Gricean的格言。在这项工作中,我们从务实的角度重新审视区域地名组,引入一个新的数据集(RefOI) 1.5k 图像,带有书面和口头参考表达的附加说明。通过系统评估最新VLMs,我们发现务实能力的三个关键缺陷:(1) 无法独家识别参考内容,(2) 包括过度或不相关的信息,(3) 与人的实际偏好不相匹配,例如没有充分利用最低限度的空间提示。我们还表明标准自动评价未能捕捉到这些务实的违规行为,加强了肤浅的提示,而不是真正的偏差的成功。我们的调查结果要求重新关注符合真实人类交流的实用知情模式和评价框架。

Article 83

Title@2025-07-30 (3): User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal

Title: User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal

User Feedback in Human-LLM Dialogen: Ein Objektiv, um die Nutzer zu verstehen, aber laut als Lernsignal

人类- LLLM 对话框中的用户反馈: 了解用户的镜头, 但将吵闹当作学习信号 2507.23158v1

Authors (3): Yuhan Liu, Michael J. Q. Zhang, Eunsol Choi

Once language models (LMs) are deployed, they can interact with users long-term, ideally evolving continuously based on their feedback. Asking for direct user feedback can be disruptive; thus, we study harvesting user feedback from user-LM interaction logs. We study implicit user feedback in two user-LM interaction datasets (WildChat and LMSYS). First, we analyze user feedback in the user-LLM conversation trajectory, providing insights into when and why such feedback occurs. Second, we study harvesting learning signals from such implicit user feedback. We find that the contents of user feedback (e.g., user wanted clarification), not just the polarity (e.g., users were unhappy with the previous model response), can improve model performance in short human-designed questions (MTBench) but not on longer and more complex questions (WildBench). We also find that the usefulness of user feedback is largely tied to the quality of the user’s initial prompt. Together, we provide an in-depth study of implicit user feedback, showing its potential and limitations.

一旦采用了语言模型(LMS),他们就可以与用户进行长期互动,理想的是,根据他们的反馈不断演变。询问直接用户反馈可能会造成干扰;因此,我们研究从用户-LM互动日志中收集用户反馈;我们在两个用户-LM互动数据集(WildChat和LMSYS)中研究隐含用户反馈。首先,我们在用户-LM对话轨迹中分析用户反馈,提供这种反馈的时间和原因。第二,我们研究从这些隐性用户反馈中收集学习的信号。我们发现,用户反馈的内容(例如,用户希望澄清)不仅仅是极性(例如,用户对先前的模型回应不满意),还可以在短期设计的问题(Bench)中改进模型性能,而不是在更长期和复杂的问题(WildBench)中(WildBench)中改进模型性能。我们还发现,用户反馈的有用性很大程度上与用户最初迅速提供的质量挂钩。我们共同提供隐性用户反馈的深入研究,显示其潜力和局限性。

Article 84

Title@2025-07-30 (3): Can one size fit all?: Measuring Failure in Multi-Document Summarization Domain Transfer

Title: Can one size fit all?: Measuring Failure in Multi-Document Summarization Domain Transfer

Kann eine Größe für alle passen?: Messfehler in Multi-Document-Zusammenfassung Domain-Transfer

能够一刀切吗? :在多文件概括性文件转让中衡量失败 2503.15768v2

Authors (2): Alexandra DeLucia, Mark Dredze

Abstractive multi-document summarization (MDS) is the task of automatically summarizing information in multiple documents, from news articles to conversations with multiple speakers. The training approaches for current MDS models can be grouped into four approaches: end-to-end with special pre-training (“direct”), chunk-then-summarize, extract-then-summarize, and inference with GPT-style models. In this work, we evaluate MDS models across training approaches, domains, and dimensions (reference similarity, quality, and factuality), to analyze how and why models trained on one domain can fail to summarize documents from another (News, Science, and Conversation) in the zero-shot domain transfer setting. We define domain-transfer “failure” as a decrease in factuality, higher deviation from the target, and a general decrease in summary quality. In addition to exploring domain transfer for MDS models, we examine potential issues with applying popular summarization metrics out-of-the-box.

抽象的多文件总结(MDS)是用多种文件自动总结信息的任务,从新闻文章到与多位演讲人的对话,目前的MDS模式的培训方法可以分为四种方法:端到端,特别培训前(“直接”),块到端,当下总结,提取-当下总结,以及与GPT式模型的推论。在这项工作中,我们评估MDS模式在各种培训方法、领域和层面(参考类似性、质量和事实质量)之间(参考性、质量和事实质量),分析一个领域培训的模型如何和为什么不能在零发域转移设置中总结另一个领域(新闻、科学和对话)的文件。我们把域传输“失败”定义为事实质量下降,偏离目标程度更高,以及概要质量普遍下降。除了探索MDS模型的域转移外,我们还研究应用大众概括指标出框的潜在问题。

Article 85

Title@2025-07-30 (3): ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans

Title: ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans

ISO-Bench: Benchmarking multimodaler Kausalität in visuellen Sprachmodellen durch verfahrenstechnische Pläne

ISO-Bench:通过程序计划确定视觉语言模型中多式因果关系基准 2507.23135v1

Authors (3): Ananya Sadana, Yash Kumar Lal, Jiawei Zhou

Understanding causal relationships across modalities is a core challenge for multimodal models operating in real-world environments. We introduce ISO-Bench, a benchmark for evaluating whether models can infer causal dependencies between visual observations and procedural text. Each example presents an image of a task step and a text snippet from a plan, with the goal of deciding whether the visual step occurs before or after the referenced text step. Evaluation results on ten frontier vision-language models show underwhelming performance: the best zero-shot F1 is only 0.57, and chain-of-thought reasoning yields only modest gains (up to 0.62 F1), largely behind humans (0.98 F1). Our analysis further highlights concrete directions for improving causal understanding in multimodal models.

理解不同模式之间的因果关系是现实世界环境中运行的多式联运模式面临的一个核心挑战。我们引入了ISO-Bench,这是评估模型能否推断视觉观察与程序文本之间因果关系的基准。每个实例都展示了任务步骤的图像和计划文字片断,目的是决定视觉步骤是在参考文本步骤之前还是之后发生。十个前沿愿景语言模型的评价结果显示业绩不佳:最佳零点F1仅为0.57,思考推理链在人类(0.98 F1)之后只产生微小的收益(高达0.62 F1)。我们的分析进一步强调了改善多式联运模型中因果关系的具体方向。

Article 86

Title@2025-07-30 (3): Meta CLIP 2: A Worldwide Scaling Recipe

Title: Meta CLIP 2: A Worldwide Scaling Recipe

Meta CLIP 2: Ein weltweites Scaling-Rezept

Meta CLIP 2: 全球规模扩大食谱 2507.22062v2

Authors (16): Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li, Hu Xu

Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP’s training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., “curse of multilinguality” that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, Meta CLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval.

虽然CLIP成功地接受了来自英国世界的10亿比例图像文本对来自英国世界的训练,但将CLIP的培训进一步推广到从世界网络数据中学习,仍然具有挑战性:(1) 没有可处理来自非英语世界的数据点的校正方法;(2) 现有多种语言CLIP的英语表现比其只使用英语的对应方(即LLMMS常见的“多语种的诅咒”)差得多。我们在这里介绍Meta CLIP 2, 在世界网络规模的图像文本配对中从零到十亿比例的首次配方培训CLIP。为了概括我们的调查结果,我们进行了严格的推理,但为了应对上述挑战,我们提出了一个能够从来自非英语世界的数据中相互受益的配方。零点图像网络分类,Meta CLIP 2 VIT-H/14比其只使用英语的对应方(LIP)要高出0.8%和 mSigLIP 2, 图像加0.7 % , 和新状态的CI-QM-QM-QM-QM-G-G-I-G-G-I-G-I-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G

Article 87

Title@2025-07-30 (3): Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

Title: Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

Enthüllen der Fragilität von vertrauenswürdigen LLMs durch chinesische Text-Ambiguität

通过中文文字缩略图,揭开可信赖的LLM 易用性 2507.23121v1

Authors (7): Xinwei Wu, Haojie Li, Hongyu Liu, Xinyu Ji, Ruohan Li, Yule Chen, Yigeng Zhang

In this work, we study a critical research problem regarding the trustworthiness of large language models (LLMs): how LLMs behave when encountering ambiguous narrative text, with a particular focus on Chinese textual ambiguity. We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs, representing multiple possible interpretations. These annotated examples are systematically categorized into 3 main categories and 9 subcategories. Through experiments, we discovered significant fragility in LLMs when handling ambiguity, revealing behavior that differs substantially from humans. Specifically, LLMs cannot reliably distinguish ambiguous text from unambiguous text, show overconfidence in interpreting ambiguous text as having a single meaning rather than multiple meanings, and exhibit overthinking when attempting to understand the various possible meanings. Our findings highlight a fundamental limitation in current LLMs that has significant implications for their deployment in real-world applications where linguistic ambiguity is common, calling for improved approaches to handle uncertainty in language understanding. The dataset and code are publicly available at this GitHub repository: https://github.com/ictup/LLM-Chinese-Textual-Disambiguation.

在这项工作中,我们研究了关于大语言模型(LLMs)的可信赖性的一个关键研究问题:LLMs在遇到模棱两可的叙述性文字时如何表现,特别侧重于中文文本的模糊性;我们通过收集和产生带有上下文和相应的脱节配对的模糊性句子,并代表多种可能的解释,创建了基准数据集;这些附加说明的例子系统地分为三大类和9个子类;通过实验,我们发现LLMs在处理模棱两可时非常脆弱,暴露出与人类有重大差异的行为。具体地说,LLMs无法可靠地区分模棱两极的文字和毫不含糊的文字,在试图理解各种可能的含义时表现出对模棱两可的自信,表现出过度思考。我们的调查结果突出了目前LLMs的基本限制,这些限制对在实际应用中的语言模糊性具有重大影响,要求改进处理语言理解不确定性的方法。这个GitHub储存库可以公开查阅数据集和代码:https://github.com/ictup/LLLM-China-Dext-Dismbulguarguation)。

Article 88

Title@2025-07-30 (3): RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL

Title: RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL

RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL

RASL: 大规模数据库文本到 SQL 的检索增强的相连接表表 2507.23104v1

Authors (3): Jeffrey Eben, Aitzaz Ahmad, Stephen Lau

Despite advances in large language model (LLM)-based natural language interfaces for databases, scaling to enterprise-level data catalogs remains an under-explored challenge. Prior works addressing this challenge rely on domain-specific fine-tuning - complicating deployment - and fail to leverage important semantic context contained within database metadata. To address these limitations, we introduce a component-based retrieval architecture that decomposes database schemas and metadata into discrete semantic units, each separately indexed for targeted retrieval. Our approach prioritizes effective table identification while leveraging column-level information, ensuring the total number of retrieved tables remains within a manageable context budget. Experiments demonstrate that our method maintains high recall and accuracy, with our system outperforming baselines over massive databases with varying structure and available metadata. Our solution enables practical text-to-SQL systems deployable across diverse enterprise settings without specialized fine-tuning, addressing a critical scalability gap in natural language database interfaces.

尽管在数据库的大型语言模型(LLM)的自然语言界面方面取得了进展,但推广到企业一级数据目录仍然是一个未得到充分探讨的挑战。先前应对这一挑战的工作依赖于具体领域的微调(复杂的部署),未能利用数据库元数据中所包含的重要的语义背景。为解决这些局限性,我们引入了一个基于组成部分的检索结构,将数据库的图和元数据分解成独立的语义单位,每个单元单独编制索引,用于有针对性的检索。我们的方法优先考虑有效的表格识别,同时利用列级信息,确保检索的表格总数保持在可管理的背景预算之内。实验表明,我们的方法保持高回溯率和准确性,我们的系统运行基线超过结构不同和可用元数据庞大的数据库。我们的解决办法使实用的文本到SQL系统能够在没有专门微调的情况下在不同的企业环境中部署,从而解决自然语言数据库界面中关键的可扩展差距。

Article 89

Title@2025-07-30 (3): SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity

Title: SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity

SMART-Editor: Multi-Agenten-Framework für menschenähnliche Designbearbeitung mit struktureller Integrität

SMART-编辑:具有结构完整性的多机构设计设计框架 2507.23095v1

Authors (5): Ishani Mondal, Meera Bharadwaj, Ayush Roy, Aparna Garimella, Jordan Lee Boyd-Graber

We present SMART-Editor, a framework for compositional layout and content editing across structured (posters, websites) and unstructured (natural images) domains. Unlike prior models that perform local edits, SMART-Editor preserves global coherence through two strategies: Reward-Refine, an inference-time rewardguided refinement method, and RewardDPO, a training-time preference optimization approach using reward-aligned layout pairs. To evaluate model performance, we introduce SMARTEdit-Bench, a benchmark covering multi-domain, cascading edit scenarios. SMART-Editor outperforms strong baselines like InstructPix2Pix and HIVE, with RewardDPO achieving up to 15% gains in structured settings and Reward-Refine showing advantages on natural images. Automatic and human evaluations confirm the value of reward-guided planning in producing semantically consistent and visually aligned edits.

我们提出了SMART-编辑器,这是一个结构化(海报、网站)和无结构化(自然图像)领域间构件布局和内容编辑的框架。与以往进行本地编辑的模型不同,SMART-编辑器通过两个战略维护全球一致性:Reward-Refine(一种推论-时间奖励制导精炼方法)和RewardDPO(一种使用符合奖励的布局配对的培训-时间偏好优化方法)。为了评价模型性能,我们采用了SMARTEdit-Bench(一种涵盖多数据、层层化编辑情景的基准)。SMART-编辑器超越了教官Pix2Pix和HIVE等强有力的基线,奖励DPO在结构化环境中取得了高达15%的收益,而Reward-Refine则展示了自然图像的优势。自动和人文评价证实了以奖励为指南的规划在生成精度一致和视觉一致的编辑过程中的价值。

Article 90

Title@2025-07-30 (3): Context-aware Rotary Position Embedding

Title: Context-aware Rotary Position Embedding

Context-aware Rotary Position Einbetten

扶轮位置嵌入式 2507.23083v1

Authors (3): Ali Veisi, Delaram Fartoot, Hamidreza Amirzadeh

Positional encoding is a vital component of Transformer architectures, enabling models to incorporate sequence order into self-attention mechanisms. Rotary Positional Embeddings (RoPE) have become a widely adopted solution due to their compatibility with relative position encoding and computational efficiency. However, RoPE relies on static, input-independent sinusoidal frequency patterns, limiting its ability to model context-sensitive relationships. In this work, we propose CARoPE (Context-Aware Rotary Positional Embedding), a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings. This design introduces token- and context-sensitive positional representations while preserving RoPE efficiency and architectural simplicity. CARoPE computes input-dependent phase shifts using a bounded transformation of token embeddings and integrates them into the rotary mechanism across attention heads. We evaluate CARoPE on the FineWeb-Edu-10B dataset using GPT-2 variants trained on next-token prediction tasks. Experimental results show that CARoPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths. Additionally, CARoPE enables faster training throughput without sacrificing model stability. These findings demonstrate that CARoPE offers a scalable, expressive, and efficient upgrade to existing positional encoding strategies in Transformer models.

定位编码是变换器结构的重要组成部分,使模型能够将序列顺序纳入自我注意机制。扶轮定位嵌入器(ROPE)因其与相对位置编码和计算效率的兼容性而成为广泛采用的解决办法。然而, RoPE依赖静态的、投入独立的正弦频率模式,限制了其模拟环境敏感关系的能力。在这项工作中,我们建议CARoPE(Context-Aware 旋转定位嵌入器)是RoPE(CARoPE)的新通用模型,它动态地生成以象征性嵌入为条件的头部频率模式。这个设计引入了对象征和背景敏感的定位表示方式,同时保持 RoPE 效率和建筑简化。 CARoPE 使用象征性嵌入的捆绑式转换并把它们纳入旋转机制。我们用GPT-2模型对FineWeb-Edu 10B数据集进行了评估。实验结果表明,CARPE(C)持续地超越了对标志嵌入器和其他通用的直观定位定位定位定位定位定位显示位置,使得CAR- 更稳定成为了更低的升级。

Article 91

Title@2025-07-30 (3): Exploring In-Context Learning for Frame-Semantic Parsing

Title: Exploring In-Context Learning for Frame-Semantic Parsing

In-Context-Lernen für rahmensemantisches Parsing erforschen

探索用于框架语义分析的内文学习 2507.23082v1

Authors (3): Diego Garat, Guillermo Moncecchi, Dina Wonsever

Frame Semantic Parsing (FSP) entails identifying predicates and labeling their arguments according to Frame Semantics. This paper investigates the use of In-Context Learning (ICL) with Large Language Models (LLMs) to perform FSP without model fine-tuning. We propose a method that automatically generates task-specific prompts for the Frame Identification (FI) and Frame Semantic Role Labeling (FSRL) subtasks, relying solely on the FrameNet database. These prompts, constructed from frame definitions and annotated examples, are used to guide six different LLMs. Experiments are conducted on a subset of frames related to violent events. The method achieves competitive results, with F1 scores of 94.3% for FI and 77.4% for FSRL. The findings suggest that ICL offers a practical and effective alternative to traditional fine-tuning for domain-specific FSP tasks.

框架语义解析( FSP) 包含根据框架语义解析( FSP) 来识别上游和标签其参数。本文调查使用大语言模型( LLMs) 来实施 FSP 而不进行模型微调的方法。我们建议了一种方法,可以自动生成用于框架识别( FI) 和框架语义识别( FSRL) 子任务的具体任务提示, 仅依靠 FramtNet 数据库。这些根据框架定义和附加说明的例子构建的提示, 用于指导六种不同的LMs 。实验是在与暴力事件有关的一组框架上进行的。方法取得了竞争性结果, FI 的F1分数为94.3%, FSRL 的得分为77.4%。研究结果表明, ILL为域特定 FSP 任务的传统微调提供了实用和有效的替代方法。

Article 92

Title@2025-07-30 (3): Math Natural Language Inference: this should be easy!

Title: Math Natural Language Inference: this should be easy!

Math Natural Language Inferenz: das sollte einfach sein!

Math自然语言推论:这应该很容易! 2507.23063v1

Authors (7): Valeria de Paiva, Qiyue Gao, Hai Hu, Pavel Kovalev, Yikang Liu, Lawrence S. Moss, Zhiheng Qian

We ask whether contemporary LLMs are able to perform natural language inference (NLI) tasks on mathematical texts. We call this the Math NLI problem. We construct a corpus of Math NLI pairs whose premises are from extant mathematical text and whose hypotheses and gold labels were provided by people with experience in both research-level mathematics and also in the NLI field. We also investigate the quality of corpora using the same premises but whose hypotheses are provided by LLMs themselves. We not only investigate the performance but also the inter-group consistency of the diverse group of LLMs. We have both positive and negative findings. Among our positive findings: in some settings, using a majority vote of LLMs is approximately equivalent to using human-labeled data in the Math NLI area. On the negative side: LLMs still struggle with mathematical language. They occasionally fail at even basic inferences. Current models are not as prone to hypothesis-only “inference” in our data the way the previous generation had been. In addition to our findings, we also provide our corpora as data to support future work on Math NLI.

我们问当代LLMs是否能够在数学文本上进行自然语言推断(NLI)任务。我们称之为数学NLI问题。我们建造了一组数学NLI配对,其前提来自现有数学文本,其假设和金标签由具有研究水平数学和NLI领域经验的人提供。我们还用同一前提调查公司的质量,但其假设由LLMs自己提供。我们不仅调查各种LLM团体的性能,而且研究其集团的一致性。我们既有正面的,也有负面的。我们的积极发现包括:在某些情况下,使用多数LLLLMs的选票相当于在数学NLI区域使用人类标签数据。消极的一面是:LLMS仍然与数学语言斗争,有时甚至没有基本的推论。目前的模式在我们的数据中不象上一代那样容易使用假设的“推论 ” 。除了我们的调查结果外,我们还提供我们的Coropora的数据,作为未来数学NLILI工作的支持数据。

Article 93

Title@2025-07-30 (3): Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion

Title: Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion

Conan: Ein Chunkwise Online-Netzwerk für Null-Shot Adaptive Voice Conversion

Conan:一个零热适应性语音转换的中远在线网络 2507.14534v3

Authors (3): Yu Zhang, Baotong Tian, Zhiyao Duan

Zero-shot online voice conversion (VC) holds significant promise for real-time communications and entertainment. However, current VC models struggle to preserve semantic fidelity under real-time constraints, deliver natural-sounding conversions, and adapt effectively to unseen speaker characteristics. To address these challenges, we introduce Conan, a chunkwise online zero-shot voice conversion model that preserves the content of the source while matching the voice timbre and styles of reference speech. Conan comprises three core components: 1) a Stream Content Extractor that leverages Emformer for low-latency streaming content encoding; 2) an Adaptive Style Encoder that extracts fine-grained stylistic features from reference speech for enhanced style adaptation; 3) a Causal Shuffle Vocoder that implements a fully causal HiFiGAN using a pixel-shuffle mechanism. Experimental evaluations demonstrate that Conan outperforms baseline models in subjective and objective metrics. Audio samples can be found at https://aaronz345.github.io/ConanDemo.

零点在线语音转换(VC)为实时通信和娱乐带来了巨大的希望。然而,当前的 VC 模型在实时限制下努力维护语义真实性,提供自然声音转换,并有效地适应隐性扬声器特性。为了应对这些挑战,我们引入了Conan, 这是一种粗略的在线零点声音转换模式,它保存源的内容,同时匹配音调和参考演讲的风格。 Conan 由三个核心部分组成:1) 一种流体内容提取器,它利用Emexex对低纬度流流内容进行编码;2) 一种调制风格编码器,它从参考演讲中提取精细的发光的文理学特征,用于强化风格适应;3) 一种Causal Shuffle Vocoder,它使用像素-shuffle机制来实施完全因果的HIFIGAN。实验性评估表明, Conan 在主观和客观的计量标准中超越基线模型。音样样本见https://aronz345.github.io/ConanDemo。

Article 94

Title@2025-07-30 (3): Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Muslims in Large Language Models: A Systematic Review

Title: Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Muslims in Large Language Models: A Systematic Review

Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Moslems in Large Language Models: A Systematic Review

减轻大语言模式中针对阿拉伯人和穆斯林的文化偏见的迅速工程技术:系统审查 2506.18199v2

Authors (3): Bushra Asseri, Estabrag Abdelaziz, Areej Al-Wabil

Large language models have demonstrated remarkable capabilities across various domains, yet concerns about cultural bias - particularly towards Arabs and Muslims - pose significant ethical challenges by perpetuating harmful stereotypes and marginalization. Despite growing recognition of bias in LLMs, prompt engineering strategies specifically addressing Arab and Muslim representation remain understudied. This mixed-methods systematic review examines such techniques, offering evidence-based guidance for researchers and practitioners. Following PRISMA guidelines and Kitchenham’s systematic review methodology, we analyzed 8 empirical studies published between 2021-2024 investigating bias mitigation strategies. Our findings reveal five primary prompt engineering approaches: cultural prompting, affective priming, self-debiasing techniques, structured multi-step pipelines, and parameter-optimized continuous prompts. Although all approaches show potential for reducing bias, effectiveness varied substantially across studies and bias types. Evidence suggests that certain bias types may be more resistant to prompt-based mitigation than others. Structured multi-step pipelines demonstrated the highest overall effectiveness, achieving up to 87.7% reduction in bias, though they require greater technical expertise. Cultural prompting offers broader accessibility with substantial effectiveness. These results underscore the accessibility of prompt engineering for mitigating cultural bias without requiring access to model parameters. The limited number of studies identified highlights a significant research gap in this critical area. Future research should focus on developing culturally adaptive prompting techniques, creating Arab and Muslim-specific evaluation resources, and integrating prompt engineering with complementary debiasing methods to address deeper stereotypes while maintaining model utility.

大型语言模式在各个领域都表现出了非凡的能力,然而,对文化偏见的关切,特别是对阿拉伯人和穆斯林的偏见,通过延续有害的陈规定型观念和边缘化,提出了重大的道德挑战。尽管人们日益认识到LLMS中的偏见,但针对阿拉伯和穆斯林代表性的迅速工程战略仍然没有得到充分研究。这种混合系统审查审查了这些技术,为研究人员和从业人员提供了循证指导。根据PRISMA准则和基切纳姆的系统审查方法,我们分析了2021-2024年调查减少偏见战略期间发表的8项经验研究。我们的调查结果揭示了5种主要的迅速工程方法:文化促进、影响性边缘、自我贬低技术、结构化多步管道和参数优化持续推动。尽管所有方法都显示有可能减少偏见,但各种研究和偏见类型之间有很大差异。根据PRISMA准则和基切纳姆的系统审查方法,我们分析了8项经验,显示了最高的总体效果,减少了87.7%的偏见,尽管它们需要更多的技术专长。文化促进更加广泛的可获取性。这些结果突出表明了在减少文化偏见研究领域获得及时的工程选择,同时发展重大的结构调整研究领域,应该确定一个快速的弹性研究领域。

Article 95

Title@2025-07-30 (3): Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning

Title: Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning

Wo man Demos in Deinem Prompt zeigt: Ein positionelles Bias des In-Context-Lernens

在哪里显示您快速的演示 : 内容学习的定位偏见 2507.22887v1

Authors (2): Kwesi Cobbina, Tianyi Zhou

In-context learning (ICL) is a critical emerging capability of large language models (LLMs), enabling few-shot learning during inference by including a few demonstrations (demos) in the prompt. However, it has been found that ICL’s performance can be sensitive to the choices of demos and their order. This paper investigates an unexplored new positional bias of ICL for the first time: we observe that the predictions and accuracy can drift drastically when the positions of demos, the system prompt, and the user message in LLM input are varied. We refer to this bias as DEMOS’ POSITION IN PROMPT (DPP) bias. We design a systematic evaluation pipeline to study this type of positional bias across classification, question answering, summarization, and reasoning tasks. We introduce two metrics, ACCURACY-CHANGE and PREDICTION-CHANGE, to quantify net gains and output volatility induced by changes in the demos’ position. Extensive experiments on ten LLMs from four open-source model families (QWEN, LLAMA3, MISTRAL, COHERE) verify that the bias significantly affects their accuracy and predictions: placing demos at the start of the prompt yields the most stable and accurate outputs with gains of up to +6 points. In contrast, placing demos at the end of the user message flips over 30\% of predictions without improving correctness on QA tasks. Smaller models are most affected by this sensitivity, though even large models remain marginally affected on complex tasks.

内文学习(ICL)是大型语言模型(LLMs)的新兴关键能力,在推论期间,通过将一些演示(演示)包含在内,可以进行一些微小的学习。然而,我们发现,ICL的表现可以敏感地考虑演示的选择及其顺序。本文首次调查了ICL的未探索的新定位偏差:我们观察到,当演示、系统快速和LLM投入中的用户信息变小时,预测和准确性就会急剧移动。我们称这种偏差为DEMOS在PROMPT(DPP)中的定位。我们设计了一个系统化的评价管道,研究这种在分类、问题回答、概括和推理方面的立场偏差。我们引入了两个指标,ACCURACY-Change和PREniction-Change,以量化由于降级立场的变化而导致的净收益和产出波动性。我们观察到,四个开源模型家庭的10个LMSMs(QEN、LA3、MIA3、MIA最准确性)中的定位偏差。我们设计了一种系统评价管道对用户的准确性、CERAL6和最精确的预测结果进行核查。

Article 96

Title@2025-07-30 (3): C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Title: C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

C3: Ein zweisprachiger Benchmark für gesprochene Dialogmodelle zur Erforschung von Herausforderungen in komplexen Gesprächen

C3:探讨复杂对话挑战的口头对话模式的双语基准 2507.22968v1

Authors (3): Chengqian Ma, Wei Tao, Yiwen Guo

Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users’ spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.

最近,口语对话模式(SDMs)因其直接对用户的询问作出语音反应的能力而引起极大关注。尽管这种模式越来越受欢迎,但在全面了解其在理解和模拟人类对话方面的实际效力的研究方面存在着差距。这与基于文本的大语言模式(LLMs)相比尤其如此,后者得益于广泛的基准;由于口语对话的独特性,人类的语音互动本身就比文字更为复杂。模糊性构成一个挑战,来自多种语言等语义因素,以及声学方面,如地形学、异名学和压力模式。此外,背景依赖性,如疏漏、共同参照和多方向互动,增加了人类对话动态的复杂性。为了阐明目前SDM发展的状况并应对这些挑战,我们在本文件中提供了一套基准数据集,其中包括1 079例英语和中文的病例。这一数据集与基于LM的评估方法密切相关,有助于全面探索SDMs在应对这些实际挑战方面的表现。

Article 97

Title@2025-07-30 (3): GeoOutageKG: A Multimodal Geospatiotemporal Knowledge Graph for Multiresolution Power Outage Analysis

Title: GeoOutageKG: A Multimodal Geospatiotemporal Knowledge Graph for Multiresolution Power Outage Analysis

GeoOutageKG: Ein multimodaler Geospatiotemporaler Wissensgraph für die Multiauflösungsanalyse von Stromausfällen

GeoouteageKG:多分辨率电源外向分析多式地球观测时知识图 2507.22878v1

Authors (4): Ethan Frakes, Yinghui Wu, Roger H. French, Mengjie Li

Detecting, analyzing, and predicting power outages is crucial for grid risk assessment and disaster mitigation. Numerous outages occur each year, exacerbated by extreme weather events such as hurricanes. Existing outage data are typically reported at the county level, limiting their spatial resolution and making it difficult to capture localized patterns. However, it offers excellent temporal granularity. In contrast, nighttime light satellite image data provides significantly higher spatial resolution and enables a more comprehensive spatial depiction of outages, enhancing the accuracy of assessing the geographic extent and severity of power loss after disaster events. However, these satellite data are only available on a daily basis. Integrating spatiotemporal visual and time-series data sources into a unified knowledge representation can substantially improve power outage detection, analysis, and predictive reasoning. In this paper, we propose GeoOutageKG, a multimodal knowledge graph that integrates diverse data sources, including nighttime light satellite image data, high-resolution spatiotemporal power outage maps, and county-level timeseries outage reports in the U.S. We describe our method for constructing GeoOutageKG by aligning source data with a developed ontology, GeoOutageOnto. Currently, GeoOutageKG includes over 10.6 million individual outage records spanning from 2014 to 2024, 300,000 NTL images spanning from 2012 to 2024, and 15,000 outage maps. GeoOutageKG is a novel, modular and reusable semantic resource that enables robust multimodal data integration. We demonstrate its use through multiresolution analysis of geospatiotemporal power outages.

检测、分析和预测断电对于电网风险评估和减灾至关重要。每年大量断电,飓风等极端天气事件加剧了这种情况。现有的断电数据通常在县一级报告,限制了其空间分辨率,使其难以捕捉局部模式。然而,它提供了极佳的时间颗粒度。相比之下,夜光卫星图像数据提供了显著更高的空间分辨率,使得能够对断电情况进行更全面的空间描述,提高了评估灾害事件后断电的地理范围和严重程度的准确性。然而,这些卫星数据只能每天提供。将广地视觉和时序数据源整合到统一的知识代表中可以大大改进断电的探测、分析和预测性推理。在本论文中,我们提出了GeoOutageKG,这是将各种数据源(包括夜光卫星图像数据、高分辨率超高分辨率断电流出图)以及州级断流数据断流报告。我们通过将2012年GOOOOOO的源数据与20 GOOOOOOOOOLA、从20 GOOOOO的SOODLA、从2014年的SOOOOOOOLOLOLA、30的SOOLOOOOLOLOOOOOOOOD的20, 20OOLOLOLOLOOOOOOLOLOOOOOOOOOOD的20OO、从20OOOOOOOOOOOOOOOOOOOODLODLLLLODLODLDLLODLLLODLOODLODLLOD)综合数据整合数据整合数据整合了从20000、从20000到20000的多数据到2014的多数据到20000OOOOOOOOOOOOOOOOOOOO的多数据流数据流数据流数据流数据流数据流数据流数据流数据。

Article 98

Title@2025-07-30 (3): FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models

Title: FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models

FRED: Finanzielle Retrieval-erweiterte Erkennung und Bearbeitung von Halluzinationen in Sprachmodellen

FRED: 财务检索-加强发现和编辑语言模型中的幻觉 2507.20930v2

Authors (3): Likun Tan, Kuan-Wei Huang, Kevin Wu

Hallucinations in large language models pose a critical challenge for applications requiring factual reliability, particularly in high-stakes domains such as finance. This work presents an effective approach for detecting and editing factually incorrect content in model-generated responses based on the provided context. Given a user-defined domain-specific error taxonomy, we construct a synthetic dataset by inserting tagged errors into financial question-answering corpora and then fine-tune four language models, Phi-4, Phi-4-mini, Qwen3-4B, and Qwen3-14B, to detect and edit these factual inaccuracies. Our best-performing model, fine-tuned Phi-4, achieves an 8% improvement in binary F1 score and a 30% gain in overall detection performance compared to OpenAI-o3. Notably, our fine-tuned Phi-4-mini model, despite having only 4 billion parameters, maintains competitive performance with just a 2% drop in binary detection and a 0.1% decline in overall detection compared to OpenAI-o3. Our work provides a practical solution for detecting and editing factual inconsistencies in financial text generation while introducing a generalizable framework that can enhance the trustworthiness and alignment of large language models across diverse applications beyond finance. Our code and data are available at https://github.com/pegasi-ai/shield.

大型语言模型的幻觉给需要事实可靠性的应用,特别是金融等高端领域,带来了严峻的挑战。这项工作为根据所提供的背景,探测和编辑模型生成的回复中不正确内容提供了有效方法。鉴于用户定义的域特定误差分类,我们通过将贴标签错误插入财务问题解答公司,然后将四种语言模型(Phi-4、Phi-4-mini、Qwen3-4B和Qwen3-14B)微调四种语言模型(Phi-4、Phi-4-mini、Qwen3-4B和Qwen3-14B),以发现和编辑这些事实不准确之处。我们最优秀的模型(经微调的Phi-4-4)在二进F1评分上实现了8%的改进,总体检测性能比OploaAI-o3增加了30%。值得注意的是,我们经过微调的Phi-4-mini模型尽管只有40亿个参数,但仍保持竞争性的性能,与OpenAI-o3相比,总体检测下降了0.1%。我们的工作为在发现和编辑财务生成中发现和编辑事实不一致事实不一致提供了实用解决办法的实用解决办法,同时采用通用的通用框架,可以加强信任和调整。

Article 99

Title@2025-07-30 (3): Past Meets Present: Creating Historical Analogy with Large Language Models

Title: Past Meets Present: Creating Historical Analogy with Large Language Models

Vergangenheit trifft Gegenwart: Historische Analogie mit großen Sprachmodellen erstellen

过去曾出席的会议:创建具有大语言模式的历史分析 2409.14820v2

Authors (8): Nianqi Li, Siyu Yuan, Jiangjie Chen, Jiaqing Liang, Feng Wei, Zujie Liang, Deqing Yang, Yanghua Xiao

Historical analogies, which compare known past events with contemporary but unfamiliar events, are important abilities that help people make decisions and understand the world. However, research in applied history suggests that people have difficulty finding appropriate analogies. And previous studies in the AI community have also overlooked historical analogies. To fill this gap, in this paper, we focus on the historical analogy acquisition task, which aims to acquire analogous historical events for a given event. We explore retrieval and generation methods for acquiring historical analogies based on different large language models (LLMs). Furthermore, we propose a self-reflection method to mitigate hallucinations and stereotypes when LLMs generate historical analogies. Through human evaluations and our specially designed automatic multi-dimensional assessment, we find that LLMs generally have a good potential for historical analogies. And the performance of the models can be further improved by using our self-reflection method.

历史类比将已知的过去事件与当代但不为人所知的事件进行比较,是帮助人们作出决定和理解世界的重要能力。然而,应用史研究表明,人们很难找到适当的类比。以前在AI社区进行的研究也忽略了历史类比。为了填补这一空白,我们在本文中着重研究历史类比获取任务,目的是为某个特定事件获取类似的历史事件。我们探索根据不同的大语言模型(LLLMs)获取历史类比的检索和生成方法。此外,我们提出一种自我反省方法,以在LLMs产生历史类比时减少幻觉和陈规定型观念。通过人类评估和我们专门设计的自动多维评估,我们发现LLMs通常具有良好的历史类比潜力。通过使用我们的自我反省方法,模型的性能可以进一步改进。

Article 100

Title@2025-07-30 (3): The Incomplete Bridge: How AI Research (Mis)Engages with Psychology

Title: The Incomplete Bridge: How AI Research (Mis)Engages with Psychology

Die unvollendete Brücke: Wie KI-Forschung (Mis) mit Psychologie verstrickt

不完整的桥梁:人工智能如何研究(Miss)心理学的组合 2507.22847v1

Authors (5): Han Jiang, Pengda Wang, Xiaoyuan Yi, Xing Xie, Ziang Xiao

Social sciences have accumulated a rich body of theories and methodologies for investigating the human mind and behaviors, while offering valuable insights into the design and understanding of Artificial Intelligence (AI) systems. Focusing on psychology as a prominent case, this study explores the interdisciplinary synergy between AI and the field by analyzing 1,006 LLM-related papers published in premier AI venues between 2023 and 2025, along with the 2,544 psychology publications they cite. Through our analysis, we identify key patterns of interdisciplinary integration, locate the psychology domains most frequently referenced, and highlight areas that remain underexplored. We further examine how psychology theories/frameworks are operationalized and interpreted, identify common types of misapplication, and offer guidance for more effective incorporation. Our work provides a comprehensive map of interdisciplinary engagement between AI and psychology, thereby facilitating deeper collaboration and advancing AI systems.

社会科学积累了丰富的理论和方法,用于调查人类的思想和行为,同时对人造情报系统的设计与理解提供了宝贵的见解。本研究以心理学为突出案例,通过分析2023年至2025年在首屈一指的AI网站上发表的1 006份与LLM有关的论文以及它们所引用的2 544份心理学出版物,探索AI与该领域之间的跨学科协同作用。我们通过分析,确定了跨学科融合的关键模式,确定了最经常被引用的心理学领域,并突出了尚未探讨的领域。我们进一步审视了心理学理论/框架是如何运作和解释的,找出了应用不当的常见类型,并为更有效的纳入提供了指导。我们的工作为AI与心理学之间的跨学科接触提供了全面的地图,从而促进了更深入的合作和推进了AI系统。

Article 101

Title@2025-07-30 (3): ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer

Title: ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer

ReverBERT: Ein State Space Model für eine effiziente textgesteuerte Sprachübertragung

ReverBERT: 高效发短信语音风格转让国家空间模型 2503.20992v2

Authors (3): Michael Brown, Sofia Martinez, Priya Singh

Text-driven speech style transfer aims to mold the intonation, pace, and timbre of a spoken utterance to match stylistic cues from text descriptions. While existing methods leverage large-scale neural architectures or pre-trained language models, the computational costs often remain high. In this paper, we present \emph{ReverBERT}, an efficient framework for text-driven speech style transfer that draws inspiration from a state space model (SSM) paradigm, loosely motivated by the image-based method of Wang and Liu~\cite{wang2024stylemamba}. Unlike image domain techniques, our method operates in the speech space and integrates a discrete Fourier transform of latent speech features to enable smooth and continuous style modulation. We also propose a novel \emph{Transformer-based SSM} layer for bridging textual style descriptors with acoustic attributes, dramatically reducing inference time while preserving high-quality speech characteristics. Extensive experiments on benchmark speech corpora demonstrate that \emph{ReverBERT} significantly outperforms baselines in terms of naturalness, expressiveness, and computational efficiency. We release our model and code publicly to foster further research in text-driven speech style transfer.

文本驱动的语音风格传输旨在模拟一个口语表达方式的进化、节奏和节奏,以匹配文本描述中的文体提示。虽然现有方法能够利用大型神经结构或预先培训的语言模型,但计算成本通常仍然很高。在本文中,我们提出一个高效的文本驱动语音风格传输框架,以吸引来自国家空间模型(SSSM)模式的灵感,这种模式的灵感来自基于图像的Wang和Liucite{Wang2024stemamba}。与图像域技术不同,我们的方法在语音空间中运作,并整合了潜在语音特征的离散四倍变换,以促成平稳和连续的风格调制。我们还提出一个新的 & emph{ 以透明为基础的 SSSM} 层,用于连接带有声学属性的文本样式描述符,大大缩短了推断时间,同时保留了高质量的语音特征。关于基准语言囊体的大规模实验表明,在自然性、明确性研究和计算过程中,我们的语音风格转换(We-destrutional-traction)进一步超越了我们的语言风格、直观的基线。

Article 102

Title: Cross-Modal State-Space Graph Reasoning for Structured Summarization

Grenzüberschreitende State-Space-Graph-Gründung für strukturierte Zusammenfassung

结构归纳的跨模式国家空间图 2503.20988v2

Authors (3): Hannah Kim, Sofia Martinez, Jason Lee

The ability to extract compact, meaningful summaries from large-scale and multimodal data is critical for numerous applications, ranging from video analytics to medical reports. Prior methods in cross-modal summarization have often suffered from high computational overheads and limited interpretability. In this paper, we propose a \textit{Cross-Modal State-Space Graph Reasoning} (\textbf{CSS-GR}) framework that incorporates a state-space model with graph-based message passing, inspired by prior work on efficient state-space models. Unlike existing approaches relying on purely sequential models, our method constructs a graph that captures inter- and intra-modal relationships, allowing more holistic reasoning over both textual and visual streams. We demonstrate that our approach significantly improves summarization quality and interpretability while maintaining computational efficiency, as validated on standard multimodal summarization benchmarks. We also provide a thorough ablation study to highlight the contributions of each component.

从大型和多式联运数据中提取精密、有意义的摘要的能力,对于从视频分析到医学报告等许多应用都至关重要,从大型和多式数据中提取精密、有意义的摘要的能力,对于从视频分析到医学报告等多种应用都至关重要。以往的跨现代合成方法往往受到高计算间接费用和有限解释的影响。在本文中,我们提议了一个\textit{Cross-Modal State-Space图形解释}(\textbf{CSS-GR})框架,该框架将基于图表的信息传递的州-空间模型与以往关于高效的州-空间模型的工作所启发的信息传递纳入其中。与目前依靠纯顺序模型的方法不同,我们的方法构建了一张图表,记录了各种模式之间和内部的关系,允许对文本流和视觉流进行更全面的推理。我们表明,我们的方法在保持计算效率的同时,大大改进了计算质量和可解释性,同时根据标准的多式合成基准进行了验证。我们还提供了全面的模拟研究,以突出每个组成部分的贡献。

Article 103

Title@2025-07-30 (3): Scaling RL to Long Videos

Title: Scaling RL to Long Videos

Skalierung von RL zu langen Videos

缩放 RL 到长视频 2507.07966v3

Authors (14): Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65.1% and 71.1% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-7B across multiple benchmarks. Moreover, LongVILA-R1-7B supports processing up to 8,192 video frames per video, and configurable FPS settings. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames).

我们引入了一个完整的配置框架,将视觉语言模型的推理推理升级到长视频,利用强化学习;我们应对长视频推理的独特挑战,为此整合了三个关键组成部分:(1) 大型数据集LongVideo-Reason,由104K长视频QA配对组成,配有体育、游戏和 vlogs等不同领域的高质量推理说明;(2) 双阶段培训管道,将视频模型的推理范围扩大到有想象力的精细调整(Cot-SFT)和强化学习(RL);(3) 长视频RL,名为多模式强化序列平行(MRSP)的培训基础设施,包括序列平行和基于VLLLM的引擎,用于长视频,用于高效推出和预填。在我们的实验中,LA-RVIA-R7B在视频基准上取得强劲的成绩,在 RVIA 和RVIL 上,在视频模型上支持我们连续升级的RVA-7B,在视频系统上持续超过 RVA-R-SIL的升级。

Article 104

Title@2025-07-30 (3): MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

Title: MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

MiniLongBench: Der kostengünstige Long Context Benchmark für große Sprachmodelle verstehen

MiniLongBunench:大语言模式低成本长方背景理解基准 2505.19959v2

Authors (5): Zhongzhan Huang, Guoming Ling, Shanshan Zhong, Hefeng Wu, Liang Lin

Long Context Understanding (LCU) is a critical area for exploration in current large language models (LLMs). However, due to the inherently lengthy nature of long-text data, existing LCU benchmarks for LLMs often result in prohibitively high evaluation costs, like testing time and inference expenses. Through extensive experimentation, we discover that existing LCU benchmarks exhibit significant redundancy, which means the inefficiency in evaluation. In this paper, we propose a concise data compression method tailored for long-text data with sparse information characteristics. By pruning the well-known LCU benchmark LongBench, we create MiniLongBench. This benchmark includes only 237 test samples across six major task categories and 21 distinct tasks. Through empirical analysis of over 60 LLMs, MiniLongBench achieves an average evaluation cost reduced to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results. Therefore, our MiniLongBench, as a low-cost benchmark, holds great potential to substantially drive future research into the LCU capabilities of LLMs. See https://github.com/MilkThink-Lab/MiniLongBench for our code, data and tutorial.

长期背景了解(LCU)是当前大型语言模型(LLMM)中一个重要的探索领域。然而,由于长文本数据固有的长期性质,现有LCULLLML基准往往导致高得令人望而却步的评价费用,如测试时间和推论费用等。通过广泛的实验,我们发现现有的LCU基准存在重大冗余,这意味着评价效率低下。在本文中,我们建议了一种针对信息特征稀少的长文本数据的简明数据压缩方法。我们通过运行众所周知的LCU基准LongBench,我们创建了MiniLongBench。这个基准只包括六个主要任务类别和21项不同任务中的237个测试样品。通过对60多个LLMS的经验分析,MiniLBench实现了平均评价费用降至仅4.5 % ,同时保持了与LongBench结果的0.97的平均相关等级系数。因此,我们的MiniLongBench作为低成本基准,极有可能大大推动今后对LCUMS的能力进行研究。见 https://github.com/MilkTink-LAmb/M.M.M.M.M.

Article 105

Title@2025-07-30 (3): Beyond Natural Language Plans: Structure-Aware Planning for Query-Focused Table Summarization

Title: Beyond Natural Language Plans: Structure-Aware Planning for Query-Focused Table Summarization

Jenseits natürlicher Sprachpläne: Struktur-Bewusst-Planung für Abfrage-fokussierte Tabellenzusammenfassung

超越自然语言计划: 查询用户使用表的结构-软件规划 2507.22829v1

Authors (3): Weijia Zhang, Songgaojun Deng, Evangelos Kanoulas

Query-focused table summarization requires complex reasoning, often approached through step-by-step natural language (NL) plans. However, NL plans are inherently ambiguous and lack structure, limiting their conversion into executable programs like SQL and hindering scalability, especially for multi-table tasks. To address this, we propose a paradigm shift to structured representations. We introduce a new structured plan, TaSoF, inspired by formalism in traditional multi-agent systems, and a framework, SPaGe, that formalizes the reasoning process in three phases: 1) Structured Planning to generate TaSoF from a query, 2) Graph-based Execution to convert plan steps into SQL and model dependencies via a directed cyclic graph for parallel execution, and 3) Summary Generation to produce query-focused summaries. Our method explicitly captures complex dependencies and improves reliability. Experiments on three public benchmarks show that SPaGe consistently outperforms prior models in both single- and multi-table settings, demonstrating the advantages of structured representations for robust and scalable summarization.

为了解决这个问题,我们提议将模式转换为结构化的表述。我们引入了一个新的结构化计划,即TaSoF,受传统多试剂系统形式主义的启发,以及一个框架,即SPaGe,将推理过程在三个阶段正式化:1)结构化规划,从查询中产生TaSoF,2)图表化执行,将计划步骤转换成SQL和模型依赖性,通过定向循环图平行执行,将计划步骤转换为SQL和模型依赖性,3)简要生成,以产生注重查询的摘要。我们的方法明确捕捉了复杂的依赖性,提高了可靠性。关于三个公共基准的实验表明,SPaGe在单一和多表环境中都一贯地超越了先前的模式,显示了结构化代表的优势。

Article 106

Title@2025-07-30 (3): SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs

Title: SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs

RaumViz-Bench: Automatisch generierte räumliche Visualisierungs-Aufgaben für MLLMs

空间Viz-Bench:MLLLMs自动生成的空间可视化推理任务 2507.07610v3

Authors (8): Siting Wang, Luoyang Sun, Cheng Deng, Kun Shao, Minnan Pei, Zheng Tian, Haifeng Zhang, Jun Wang

Humans can directly imagine and manipulate visual images in their minds, a capability known as spatial visualization. While multi-modal Large Language Models (MLLMs) support imagination-based reasoning, spatial visualization remains insufficiently evaluated, typically embedded within broader mathematical and logical assessments. Existing evaluations often rely on IQ tests or math competitions that may overlap with training data, compromising assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems. Our evaluation of 33 state-of-the-art MLLMs not only reveals wide performance variations and demonstrates the benchmark’s strong discriminative power, but also uncovers counter-intuitive findings: models show difficulty perception misaligned with human intuition, exhibit dramatic 2Dto-3D performance cliffs, default to formulaic derivation over visualization, and paradoxically suffer performance degradation from Chain-of-Thought prompting in open-source models. Through statistical and qualitative analysis of error types, SpatialViz-Bench demonstrates that state-of-the-art MLLMs continue to exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark data and evaluation code are publicly available.

人类可以直接想象和操控其头脑中的视觉图像,这种能力被称为空间可视化。多式大型语言模型(MLLM)支持基于想象的推理,而空间可视化却仍然没有得到充分评价,这通常体现在更广泛的数学和逻辑评估中。现有的评价往往依靠IQ测试或数学竞赛,这些测试或数学竞赛可能与培训数据重叠,损害评估可靠性。为此,我们引入了空间Viz-Bench,这是一个综合的多式空间可视化多式基准,有四个次功能的12项任务,包括1,180个自动产生的问题。我们对33个最先进的MLLLMS的评估不仅揭示了广泛的性能差异,并展示了基准的强烈的歧视性力量,而且还揭示了反直觉的发现:模型显示与人的直觉不相匹配的难感,展示了2D到3D的性性悬崖,默认了对视觉化的公式衍生法,而且矛盾的是,在开放源模型中引发的连锁操作性下降。通过对错误类型进行统计和定性分析,SpaceViz-Ben 演示了目前现有的空间数据,从而继续展示实地数据缺陷。

Article 107

Title@2025-07-30 (3): DBLPLink 2.0 – An Entity Linker for the DBLP Scholarly Knowledge Graph

Title: DBLPLink 2.0 – An Entity Linker for the DBLP Scholarly Knowledge Graph

DBLPLink 2.0 – Ein Entity Linker für den DBLP-Wissenschaftsgraphen

DBLPLink 2.0 - DBLPLP 学术知识图的实体链接 2507.22811v1

Authors (3): Debayan Banerjee, Tilahun Abedissa Taffa, Ricardo Usbeck

In this work we present an entity linker for DBLP’s 2025 version of RDF-based Knowledge Graph. Compared to the 2022 version, DBLP now considers publication venues as a new entity type called dblp:Stream. In the earlier version of DBLPLink, we trained KG-embeddings and re-rankers on a dataset to produce entity linkings. In contrast, in this work, we develop a zero-shot entity linker using LLMs using a novel method, where we re-rank candidate entities based on the log-probabilities of the “yes” token output at the penultimate layer of the LLM.

在这项工作中,我们为DBLP2025年版的RDF知识图提供了一个实体链接器。与2022年版本相比,DBLP现在将出版地点视为一个新的实体类型,名为 dblp:Stream。在先前版本的DBLPLink中,我们培训了KG编组和重新排序人员使用数据集来产生实体链接。与此形成对照的是,在这项工作中,我们使用一种新颖的方法开发了一个零光实体链接器,我们根据LLM倒数第二层的“是”象征性输出的日志概率重新排序候选实体。

Article 108

Title@2025-07-30 (3): IterKey: Iterative Keyword Generation with LLMs for Enhanced Retrieval Augmented Generation

Title: IterKey: Iterative Keyword Generation with LLMs for Enhanced Retrieval Augmented Generation

IterKey: Iterative Keyword Generation mit LLMs für verbesserte retrieval Augmented Generation

IterKey: 循环关键字生成,并配有 “ 增强再获取能力增量一代 “ 的LMML 2505.08450v2

Authors (4): Kazuki Hayashi, Hidetaka Kamigaito, Shinya Kouda, Taro Watanabe

Retrieval-Augmented Generation (RAG) has emerged as a way to complement the in-context knowledge of Large Language Models (LLMs) by integrating external documents. However, real-world applications demand not only accuracy but also interpretability. While dense retrieval methods provide high accuracy, they lack interpretability; conversely, sparse retrieval methods offer transparency but often fail to capture the full intent of queries due to their reliance on keyword matching. To address these issues, we introduce IterKey, an LLM-driven iterative keyword generation framework that enhances RAG via sparse retrieval. IterKey consists of three LLM-driven stages: generating keywords for retrieval, generating answers based on retrieved documents, and validating the answers. If validation fails, the process iteratively repeats with refined keywords. Across four QA tasks, experimental results show that IterKey achieves 5% to 20% accuracy improvements over BM25-based RAG and simple baselines. Its performance is comparable to dense retrieval-based RAG and prior iterative query refinement methods using dense models. In summary, IterKey is a novel BM25-based approach leveraging LLMs to iteratively refine RAG, effectively balancing accuracy with interpretability.

为解决这些问题,我们引入了IterKey,这是一个由LLM驱动的迭代关键词生成框架,通过吸收外部文件来补充大语言模型(LLMS)的内置知识。然而,现实世界应用程序不仅要求准确性,而且要求可解释性。虽然密集的检索方法提供高精度,但缺乏解释性;反过来,分散的检索方法提供了透明度,但往往由于依赖关键词匹配而不能反映查询的全部意图。为解决这些问题,我们引入了IterKey,这是一个LLM驱动的由LLM驱动的迭代关键词生成框架,通过稀薄的检索增强RAG。 IterKey由三个LLM驱动的阶段组成:为检索生成关键词,根据检索文件生成答案,并验证答案。如果验证失败,这一过程会与精细的关键字重复。在四种QA任务中,实验结果显示IterKey比基于B25的RAG和简单基线实现了5%至20%的准确性改进。其性与使用密度模型的密集检索-RAG和先前的重复性查询改进方法相当。概括,IterKey是一种新型的精准性调整方法。

Article 109

Title@2025-07-30 (3): Towards the Law of Capacity Gap in Distilling Language Models

Title: Towards the Law of Capacity Gap in Distilling Language Models

Auf dem Weg zum Gesetz der Kapazitä tigkeitslücke bei der Destillierung von Sprachmodellen

迈向《语文模式再学习能力差距法》 2311.07052v4

Authors (6): Chen Zhang, Qiuchi Li, Dawei Song, Zheyu Ye, Yan Gao, Yan Hu

Language model (LM) distillation aims at distilling the knowledge in a large teacher LM to a small student one. As a critical issue facing LM distillation, a superior student often arises from a teacher of a relatively small scale instead of a larger one, especially in the presence of substantial capacity gap between the teacher and student. This issue, often referred to as the \textit{curse of capacity gap}, suggests that there is likely an optimal teacher yielding the best-performing student along the scaling course of the teacher. Consequently, distillation trials on teachers of a wide range of scales are called for to determine the optimal teacher, which becomes computationally intensive in the context of large LMs (LLMs). This paper addresses this critical bottleneck by providing the \textit{law of capacity gap} inducted from a preliminary study on distilling a broad range of small-scale (<3B) LMs, where the optimal teacher consistently scales linearly with the student scale across different model and data scales. By extending the law to LLM distillation on a larger scale (7B), we succeed in obtaining versatile LLMs that outperform a wide array of competitors.

语言模型(LM)蒸馏法(LM)旨在将大型教师LM中的知识蒸馏成小学生的大型LM中的知识。作为LM蒸馏所面临的一个关键问题,高级学生往往来自规模相对较小而不是较大规模的教师,特别是在教师与学生之间能力差距很大的情况下。这个问题通常被称为能力差距的缩放管 , 表明在教师规模扩大过程中,很可能有一个最优秀的教师, 产生成绩最好的学生。因此, 需要对各种规模的教师进行蒸馏试验,以确定在大型LMS(LLMs)中具有计算密集性的最佳教师。本文通过在对广泛小规模( < 3B > ) LMS(<3B] LMS)进行的初步研究中引入的\ textitit{能力差距法} 来解决这一关键瓶颈问题, 最佳教师与不同模式和数据尺度的学生比例一致, 将法律推广到LMTM(7B)的蒸馏法, 通过在更大的规模(7B)中将法律推广到LM(LM)中,我们成功地获得了超越了多种磁体的磁体竞争阵列。

Article 110

Title@2025-07-30 (3): MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Hate Speech Multi-hop Explanations

Title: MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Hate Speech Multi-hop Explanations

MFTCXplain: Mehrsprachiger Benchmark-Datensatz zur Bewertung der moralischen Vernunft von LLMs durch Hassreden-Multi-Hop-Erklärungen

MFTCXplain:通过仇恨言论多呼多呼解释评估LLMs道德理由的多语言基准数据集 2506.19073v2

Authors (9): Jackson Trager, Diego Alves, Matteo Guida, Mikel K. Ngueajio, Ameeta Agrawal, Flor Plaza-del-Arco, Yalda Daryanai, Farzan Karimi-Malekabadi, Francielle Vargas

Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via hate speech multi-hop explanation using Moral Foundation Theory (MFT). The dataset comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Empirical results highlight a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. These findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning.

由于这些系统被用于社会敏感的任务,确保大语言模型的道德推理能力日益成为日益令人关切的问题,然而,目前的评价基准存在两个主要缺陷:缺乏说明,说明道德分类的理由,限制了透明度和可解释性;主要侧重于英语,限制了对不同文化背景的道德推理的评估;在本文件中,我们引入了MFTCXplain,这是一个多语种的基准数据集,用于利用道德基金会理论(MFT)通过仇恨言论多语种解释评价LLM的道德推理;数据集包括葡萄牙语、意大利语、波斯语和英语的3 000个推文,附有二元仇恨言论标签、道德类别和跨层次的文字理由说明;经验性结果突出表明了在道德推理任务中LLM产出与人说明之间的不协调;虽然LMs在仇恨言论探测方面表现良好(F1至0.836),但其预测道德情绪的能力明显薄弱(F1 < 0.35),但理由调整仍然主要限于代表性不足的语言。

Article 111

Title@2025-07-30 (3): DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router

Title: DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router

DeepSieve: Informationen über LLM-as-a-Knowledge-Router

深筛选:通过LLM-as-a- knowledge-Router获取信息 2507.22050v2

Authors (8): Minghao Guo, Qingcheng Zeng, Xujiang Zhao, Yanchi Liu, Wenchao Yu, Mengnan Du, Haifeng Chen, Wei Cheng

Large Language Models (LLMs) excel at many reasoning tasks but struggle with knowledge-intensive queries due to their inability to dynamically access up-to-date or domain-specific information. Retrieval-Augmented Generation (RAG) has emerged as a promising solution, enabling LLMs to ground their responses in external sources. However, existing RAG methods lack fine-grained control over both the query and source sides, often resulting in noisy retrieval and shallow reasoning. In this work, we introduce DeepSieve, an agentic RAG framework that incorporates information sieving via LLM-as-a-knowledge-router. DeepSieve decomposes complex queries into structured sub-questions and recursively routes each to the most suitable knowledge source, filtering irrelevant information through a multi-stage distillation process. Our design emphasizes modularity, transparency, and adaptability, leveraging recent advances in agentic system design. Experiments on multi-hop QA tasks across heterogeneous sources demonstrate improved reasoning depth, retrieval precision, and interpretability over conventional RAG approaches. Our codes are available at https://github.com/MinghoKwok/DeepSieve.

大型语言模型(LLMS)在很多推理任务上都非常出色,但是由于无法动态地获取最新或特定领域的信息,因此与知识密集型的询问纠缠不休。回收原始一代(RAG)已经成为一个很有希望的解决方案,使LLM能够以外部来源提出其应对办法,但现有RAG方法缺乏对查询和来源方的精细控制,往往导致噪音检索和浅浅浅推。在这项工作中,我们引入了DeepSieve,这是一个包含通过LLM-as-a-nown-router 等手段获取信息的代理RAG框架。DeepSieve将复杂的查询解密到结构化的子问题和循环路径中,每个途径都可追溯到最合适的知识来源,通过多阶段的蒸馏过程过滤不相干的信息。我们的设计强调模块性、透明度和适应性,利用最近在代理系统设计上的进展。关于多种来源的多霍普 QA任务的实验表明推理深度、检索精确性和对常规RAG方法的可解释性。我们的代码可在 http://giepub.MKwow/Mings.我们可在http://http://httpss://

Article 112

Title@2025-07-30 (3): GATEAU: Selecting Influential Samples for Long Context Alignment

Title: GATEAU: Selecting Influential Samples for Long Context Alignment

GATEAU: Auswahl von einflussreichen Proben für lange Kontextausrichtung

GATEAU:为长期对齐选择有影响的样本 2410.15633v6

Authors (10): Shuzheng Si, Haozhe Zhao, Gang Chen, Yunshui Li, Kangyang Luo, Chuancheng Lv, Kaikai An, Fanchao Qi, Baobao Chang, Maosong Sun

Aligning large language models to handle instructions with extremely long contexts has yet to be fully investigated. Previous studies have attempted to scale up the available data volume by synthesizing long instruction-following samples, as constructing such a dataset tends to be challenging for annotators. However, a lack of a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the model’s performance. Thus, we propose GATEAU, a novel framework to address the unique challenge of long context alignment by identifying the influential samples enriched with long-range dependency relations. Specifically, GATEAU measures the long-range dependencies from two essential aspects: the difficulty of generating target responses due to the long-range dependencies, and the difficulty of understanding long inputs due to such dependencies. Comprehensive experiments indicate that GATEAU effectively identifies influential samples, and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.

以往的研究试图扩大现有数据量,方法是综合长期的遵循指令的样本,因为建立这样一个数据集往往对通知员来说具有挑战性;然而,缺乏确保数据质量的明确战略可能会引入低质量样本,并限制模型的性能。因此,我们提议GATEAU(GATEAU),这是一个新颖的框架,用以通过查明富含长距离依赖关系的有影响的样本来应对长期背景一致的独特挑战。具体地说,GATEAU衡量两个基本方面的长期依赖性:由于长期依赖性而难以产生目标反应,以及由于这种依赖性而难以理解长期投入。全面实验表明,GATEAU有效识别了有影响力的样本,而经过培训的这些样本模型显示了更好的指导跟踪和长文本理解能力。

Article 113

Title@2025-07-30 (3): MASCA: LLM based-Multi Agents System for Credit Assessment

Title: MASCA: LLM based-Multi Agents System for Credit Assessment

MASCA: LLM-basiertes Multi-Agenten-System zur Bonitätsbeurteilung

MASCA: 以LLM为基础的信用评估多边代理系统 2507.22758v1

Authors (3): Gautam Jajoo, Pranjal A Chitale, Saksham Agarwal

Recent advancements in financial problem-solving have leveraged LLMs and agent-based systems, with a primary focus on trading and financial modeling. However, credit assessment remains an underexplored challenge, traditionally dependent on rule-based methods and statistical models. In this paper, we introduce MASCA, an LLM-driven multi-agent system designed to enhance credit evaluation by mirroring real-world decision-making processes. The framework employs a layered architecture where specialized LLM-based agents collaboratively tackle sub-tasks. Additionally, we integrate contrastive learning for risk and reward assessment to optimize decision-making. We further present a signaling game theory perspective on hierarchical multi-agent systems, offering theoretical insights into their structure and interactions. Our paper also includes a detailed bias analysis in credit assessment, addressing fairness concerns. Experimental results demonstrate that MASCA outperforms baseline approaches, highlighting the effectiveness of hierarchical LLM-based multi-agent systems in financial applications, particularly in credit scoring.

最近在金融问题解决方面的进展利用了LLMs和代理系统,主要侧重于贸易和金融模式,然而,信用评估仍是一个未得到充分探讨的挑战,传统上依赖基于规则的方法和统计模式。我们在本文件中引入了由LMCA驱动的多种代理系统,即由LMCA驱动的多种代理系统,目的是通过反映现实世界的决策进程加强信用评估。框架采用一个多层结构,由基于LLMM的专门代理机构协作应对次级任务。此外,我们将风险和奖励评估的对比性学习纳入优化决策。我们进一步展示了有关等级多试剂系统的示性游戏理论观点,提供了对其结构和互动的理论见解。我们的文件还包括了信用评估中的详细偏见分析,解决公平问题。实验结果表明,MASCA在金融应用中,特别是信用评分中,以LMM为主的多代理系统优于基线方法,突出其效力。

Article 114

Title@2025-07-30 (3): Opportunities and Challenges of LLMs in Education: An NLP Perspective

Title: Opportunities and Challenges of LLMs in Education: An NLP Perspective

Chancen und Herausforderungen von LLM im Bildungswesen: Eine NLP-Perspektive

教育中法学硕士的机遇和挑战:国家学习方案展望 2507.22753v1

Authors (5): Sowmya Vajjala, Bashar Alhafni, Stefano Bannò, Kaushal Kumar Maurya, Ekaterina Kochmar

Interest in the role of large language models (LLMs) in education is increasing, considering the new opportunities they offer for teaching, learning, and assessment. In this paper, we examine the impact of LLMs on educational NLP in the context of two main application scenarios: {\em assistance} and {\em assessment}, grounding them along the four dimensions – reading, writing, speaking, and tutoring. We then present the new directions enabled by LLMs, and the key challenges to address. We envision that this holistic overview would be useful for NLP researchers and practitioners interested in exploring the role of LLMs in developing language-focused and NLP-enabled educational applications of the future.

考虑到大型语言模式为教学、学习和评估提供的新机会,人们对大型语言模式在教育中的作用的兴趣正在增加,在本文件中,我们结合两个主要应用情景(`em援助}和{em评估}),审查LLM对教育NLP的影响,将其建立在阅读、写作、讲演和辅导这四个层面的基础上。然后,我们介绍LLMM所促成的新方向和要应对的主要挑战。我们设想,这一全面概览将有益于NLP研究人员和有意探索LLMs在开发未来以语言为主和以NLP为主的教育应用方面的作用的从业人员。

Article 115

Title@2025-07-30 (3): CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

Title: CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

CUS-QA:以本地知识为主的不限成员名额问题解答数据集 2507.22752v1

Authors (4): Jindřich Libovický, Jindřich Helcl, Andrei Manea, Gianluca Vico

We introduce a benchmark for open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations. It includes both purely textual questions and those requiring visual understanding. As a baseline, we evaluate state-of-the-art LLMs through prompting and complement this with human judgments of answer correctness. Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. Our baseline results highlight a significant gap in regional knowledge among current LLMs. Moreover, apart from LLM-based evaluation, there is minimal correlation between automated metrics and human judgment. We release this dataset as a resource to (1) assess regional knowledge in LLMs, (2) study cross-lingual generation consistency in a challenging setting, and (3) advance the development of evaluation metrics for open-ended question answering.

我们引入了一个包含文字和视觉两种方式的开放式区域问题解答基准,我们还利用最先进的大型语言模型(LLMs)提供强有力的基线。我们的数据集由捷克、斯洛伐克和乌克兰的土著发言人在维基百科基础上手工整理的问答以及随附英文译文组成。它包括纯文字问题和需要视觉理解的问题。作为一个基线,我们通过激发和补充对答案正确性的人类判断来评估最先进的LLMs。我们利用这些人类评估,分析现有自动评价指标的可靠性。我们的基线结果突显了当前LLMs之间在区域知识方面存在的巨大差距。此外,除了LLM评价之外,自动化指标与人类判断之间几乎没有什么关联。我们发布这一数据集,作为资源(1) 评估LMS的区域知识,(2) 研究挑战性环境中的跨语言生成一致性,(3) 推进为开放式问题解答制定评价指标。

Article 116

Title@2025-07-30 (3): Next Tokens Denoising for Speech Synthesis

Title: Next Tokens Denoising for Speech Synthesis

Nächste Tokens Denoising für Sprachsynthese

下一集 Tokens 代言人演讲综述 2507.22746v1

Authors (10): Yanqing Liu, Ruiqing Xue, Chong Zhang, Yufei Liu, Gang Wang, Bohan Li, Yao Qian, Lei He, Shujie Liu, Sheng Zhao

While diffusion and autoregressive (AR) models have significantly advanced generative modeling, they each present distinct limitations. AR models, which rely on causal attention, cannot exploit future context and suffer from slow generation speeds. Conversely, diffusion models struggle with key-value (KV) caching. To overcome these challenges, we introduce Dragon-FM, a novel text-to-speech (TTS) design that unifies AR and flow-matching. This model processes 48 kHz audio codec tokens in chunks at a compact 12.5 tokens per second rate. This design enables AR modeling across chunks, ensuring global coherence, while parallel flow-matching within chunks facilitates fast iterative denoising. Consequently, the proposed model can utilize KV-cache across chunks and incorporate future context within each chunk. Furthermore, it bridges continuous and discrete feature modeling, demonstrating that continuous AR flow-matching can predict discrete tokens with finite scalar quantizers. This efficient codec and fast chunk-autoregressive architecture also makes the proposed model particularly effective for generating extended content. Experiment for demos of our work} on podcast datasets demonstrate its capability to efficiently generate high-quality zero-shot podcasts.

虽然扩散和自动递减模型(AR)模型具有显著的进步基因模型,但它们都具有不同的局限性。AR模型依赖因果关注,无法利用未来背景,并且受到缓慢的生成速度的影响。相反,扩散模型与关键值(KV)的缓冲相争。为了克服这些挑战,我们引入了龙-调,这是一个新颖的文本到语音(TTS)设计,可以统一AR和流程匹配。这个模型处理块块中48千赫兹音调解码器,以每秒12个缩压压制成。这个设计可以让AR在块间建模,确保全球的一致性,而块内的平行流配制有助于快速迭代分解。因此,拟议的模型可以利用KV缓冲,将未来环境纳入每个块中。此外,它连接连续和离散的特征模型,表明持续的AR流配制能以有限的量度四分辨器预测离散的象征物。这个高效的编码和快速块反向递增结构也使得拟议的模型对生成扩展的内容特别有效。对高频度数据定位的定位进行实验。

Article 117

Title@2025-07-30 (3): Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index

Title: Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index

Verringerung der Halluzinationen in der Zusammenfassung durch Verstärkungslernen mit Entity Halluzination Index

利用实体幻觉指数,通过强化学习减少在总结中的幻觉 2507.22744v1

Authors (4): Praveenkumar Katwe, Rakesh Chandra, Balabantaray Kali, Prasad Vittala

Reducing hallucinations in abstractive summarization remains a critical challenge for deploying language models (LMs) in real-world settings. In this work, we introduce a rewarddriven fine-tuning framework that explicitly optimizes for Entity Hallucination Index (EHI), a metric designed to quantify the presence, correctness, and grounding of named entities in generated summaries. Given a corpus of meeting transcripts, we first generate baseline summaries using a pre-trained LM and compute EHI scores via automatic entity extraction and matching. We then apply reinforcement learning to fine-tune the model parameters, using EHI as a reward signal to bias generation toward entity-faithful outputs. Our approach does not rely on human-written factuality annotations, enabling scalable fine-tuning. Experiments demonstrate consistent improvements in EHI across datasets, with qualitative analysis revealing a significant reduction in entity-level hallucinations without degradation in fluency or informativeness. We release a reproducible Colab pipeline, facilitating further research on hallucination-aware model fine-tuning using lightweight, hallucintion metrics like EHI.

在现实世界环境中部署语言模型(LMS)时,减少抽象合成的幻觉仍然是一项关键的挑战。在这项工作中,我们引入了一个奖赏驱动的微调框架,明确优化实体幻觉指数(EHI),该指数旨在量化名称实体的存在、正确性和在生成摘要中的依据。根据一系列会议记录,我们首先使用预先培训的LM(LM)生成基线摘要,然后通过自动实体提取和匹配来计算EHI分数。然后我们运用强化学习来微调模型参数,利用EHI作为奖赏信号来生成对实体信仰产出的偏见。我们的方法并不依赖于人写的事实性说明,从而能够进行可扩展的微调。实验表明EHI跨数据集的持续改进,其质量分析显示实体一级幻觉的显著减少,而不会在流利或信息性方面出现退化。我们发布了一个可复制的Colab输油管,便利进一步研究使用轻量、致幻度指标(如EHI)对幻模型进行微调。

Article 118

Title@2025-07-30 (3): Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning

Title: Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning

Bewertungsprüfer: Bewertung der synthetischen Überprüfung für Code und Begründung

标定验证符:评估编码和理由的合成核查 2502.13820v3

Authors (4): Aleksander Ficek, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg

Synthetic verification techniques such as generating test cases and reward modelling are common ways to enhance the coding capabilities of large language models (LLM) beyond predefined tests. Additionally, code verification has recently found great success as a critical component in improving reasoning capability of LLMs via reinforcement learning. In this paper, we propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. We also propose multiple metrics to measure different aspects of the synthetic verifiers with the proposed benchmarks. By employing the proposed approach, we release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs. Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.

合成核查技术,如产生测试案例和奖励建模,是提高大型语言模型(LLM)的编码能力,超越预先界定的测试的常见方法。此外,守则核查最近发现,作为通过强化学习提高LLMS推理能力的一个关键组成部分,在通过强化学习提高LMS推理能力方面,取得了巨大成功。在本文件中,我们提出一种方法,可将现有的编码基准转换成评分和排名数据集,以评价合成核查员的效力。我们还提出多种指标,用拟议基准衡量合成核查员的不同方面。通过采用拟议方法,我们发布了四个新的基准(HE-R、HE-R+、MBPP-R和MBPP-R+),并以标准、推理和奖励为基础的LMS分析合成核查方法。我们的实验表明,推理可以大大改进测试案例的生成,扩大测试案例的数量可以提高核查的准确性。

Article 119

Title@2025-07-30 (3): Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning

Title: Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning

Ressourceneffiziente Anpassung großer Sprachmodelle für Text-Embeddings über Prompt Engineering und Contrastive Fine-Tuning

通过即时工程和反竞争微调对文本嵌入大语言模型进行资源高效率的改编 2507.22729v1

Authors (6): Benedikt Roth, Stephan Rappensperger, Tianming Qiu, Hamza Imamović, Julian Wörmann, Hao Shen

Large Language Models (LLMs) have become a cornerstone in Natural Language Processing (NLP), achieving impressive performance in text generation. Their token-level representations capture rich, human-aligned semantics. However, pooling these vectors into a text embedding discards crucial information. Nevertheless, many non-generative downstream tasks, such as clustering, classification, or retrieval, still depend on accurate and controllable sentence- or document-level embeddings. We explore several adaptation strategies for pre-trained, decoder-only LLMs: (i) various aggregation techniques for token embeddings, (ii) task-specific prompt engineering, and (iii) text-level augmentation via contrastive fine-tuning. Combining these components yields state-of-the-art performance on the English clustering track of the Massive Text Embedding Benchmark (MTEB). An analysis of the attention map further shows that fine-tuning shifts focus from prompt tokens to semantically relevant words, indicating more effective compression of meaning into the final hidden state. Our experiments demonstrate that LLMs can be effectively adapted as text embedding models through a combination of prompt engineering and resource-efficient contrastive fine-tuning on synthetically generated positive pairs.

大型语言模型(LLMS)已成为自然语言处理(NLP)的基石,在文本生成中取得了令人印象深刻的成绩。它们的象征性代表层捕捉了丰富的、与人类一致的语义。然而,将这些矢量集中到一个嵌入抛弃物的文本中,关键的信息。然而,许多非遗传性的下游任务,如集群、分类或检索,仍然取决于准确和可控制的判决或文件级嵌入。我们探索了预先训练的、非编码的LMS(NLP)的几种适应战略,在文本生成过程中取得了令人印象深刻的成绩。我们探讨了一些适应策略,这些策略是:(一) 代用品嵌入的各种聚合技术,(二) 特定任务迅速的工程,以及(三) 通过对比性微调增强文本层。将这些组件合并起来,可以产生在大规模文本嵌入基准(MTEB)的英国组合轨迹上的最先进的表现。对关注地图的分析进一步表明,微调的焦点从提示符号转向语义相关词,表明对含义进行更有效的压缩到最后隐藏状态。我们的实验表明,LMSMs可以有效地作为文本嵌入模型,通过迅速的工程和资源节制生成的合成对准组合。

Article 120

Title@2025-07-30 (3): Investigating Hallucination in Conversations for Low Resource Languages

Title: Investigating Hallucination in Conversations for Low Resource Languages

Untersuchung von Halluzinationen in Gesprächen über Sprachen mit geringem Ressourcenreichtum

低资源语言对话中的幻觉 2507.22720v1

Authors (10): Amit Das, Md. Najib Hasan, Souvika Sarkar, Zheng Zhang, Fatemeh Jamshidi, Tathagata Bhattacharya, Nilanjana Raychawdhury, Dongji Feng, Vinija Jain, Aman Chadha

Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as ‘hallucination’. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3. We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.

大型语言模型(LLMS)在生成与人文写作非常相似的文本方面表现出了非凡的熟练程度,然而,它们往往产生事实错误的陈述,通常被称为“职业介绍”,解决幻觉问题对于提高LLMS的可靠性和有效性至关重要。虽然许多研究都侧重于英语的幻觉,但我们的研究将这一调查扩大到三种语言的谈话数据:印地语、法西语和普通话。我们对一套数据进行了全面分析,以检查GPT-35、GPT-4o、Llama-3.1、Gemma-2.0、DeepSeek-R1和Quen-3等语言中这些语言中的事实和语言错误。我们发现LMS在曼达林语中产生的幻觉反应很少,但在印地语和法西语中产生的幻觉数量却要高得多。

Article 121

Title@2025-07-30 (3): Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining

Title: Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining

Erhöhung der Ultra-Low-Bit-Quantisierung großer Sprachmodelle durch Saliency-Aware Partial Retraining

通过提高质量-软件部分再培训,加强大语言模型的超低比小量量化 2504.13932v3

Authors (2): Deyu Cao, Samin Aref

The growing use of large language models has raised environmental and economic concerns about their intensity of resource usage during inference. Serving these models to each user requires substantial energy and water for cooling. Model compression techniques like quantization can shrink large language models and make them more resource efficient at the cost of potential performance degradation. Quantization methods compress model size through replacing their high-precision parameters by quantized values of lower precision. Among existing methods, the ApiQ method achieves superior accuracy preservation at minimal memory and time overhead. We investigate two ideas to extend performance in ultra-low-bit quantization beyond ApiQ’s level. First, we look into combining existing quantization-aware training techniques with ApiQ’s partial training. We show that this does not outperform the baseline ApiQ method with limited training data and frozen weights. This leads to two key insights: (1) The substantial representational capacity that is gained through full retraining is unlikely to be feasible through partial training. (2) This gain may depend on using a large and diverse dataset in quantization-aware training. Second, through a novel approach informed by the two insights, we propose an ultra-low-bit quantization method that builds upon ApiQ and extends its performance without the need for full retraining. This publicly available method relies on a saliency-aware regularization term that prioritizes preserving the most impactful parameters during quantization. Our experiments on LLaMA 7B and 13B benchmarks demonstrate that our method reduces the ApiQ’s accuracy degradation by 10.85% and 7.54% respectively. A Python implementation of the proposed quantization method is publicly available on GitHub https://github.com/TokuyuSou/ULB-SAPR.

大型语言模型的使用日益增多,引起了人们对在推断期间资源使用强度的环境和经济关切。向每个用户提供这些模型需要大量的能量和水来进行冷却。模型压缩技术,如量化等,可以压缩大型语言模型,使其以潜在性能退化的代价提高资源效率。量化方法压缩模型规模,以低精度值的分量值取代高精度参数。在现有方法中,ApiQ方法在最小的记忆和时间管理上实现高度准确性保护。我们调查了将超低比特四分化的性能扩大到ApiQ的两种想法。首先,我们研究了将现有的量化培训技术与ApiQ的部分培训结合起来的情况。我们表明,这并没有以有限的培训数据和冷却的重量来超过ApiQ的基线方法。这导致两个关键见解:(1) 通过全面再培训获得的大量代表能力不太可能通过部分培训获得。(2)这一增益取决于在夸度化培训中采用大规模和多样化的数据设置的参数。首先,我们研究了现有的量级-测测测测测测测测标准,在Sqial-ralreal再使用两种方法后,我们提出了一种完整的业绩方法。

Article 122

Title@2025-07-30 (3): From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs

Title: From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs

Von der Fähigkeit zur Reflexion: Stärkungsorientiertes Denken Qualität in retrieval-augmented Begründung für LLMs

从充足到反思:LLMs在追偿和增加理由方面的强化引导思考质量 2507.22716v1

Authors (3): Jie He, Victor Gutierrez Basulto, Jeff Z. Pan

Reinforcement learning-based retrieval-augmented generation (RAG) methods enhance the reasoning abilities of large language models (LLMs). However, most rely only on final-answer rewards, overlooking intermediate reasoning quality. This paper analyzes existing RAG reasoning models and identifies three main failure patterns: (1) information insufficiency, meaning the model fails to retrieve adequate support; (2) faulty reasoning, where logical or content-level flaws appear despite sufficient information; and (3) answer-reasoning inconsistency, where a valid reasoning chain leads to a mismatched final answer. We propose TIRESRAG-R1, a novel framework using a think-retrieve-reflect process and a multi-dimensional reward system to improve reasoning and stability. TIRESRAG-R1 introduces: (1) a sufficiency reward to encourage thorough retrieval; (2) a reasoning quality reward to assess the rationality and accuracy of the reasoning chain; and (3) a reflection reward to detect and revise errors. It also employs a difficulty-aware reweighting strategy and training sample filtering to boost performance on complex tasks. Experiments on four multi-hop QA datasets show that TIRESRAG-R1 outperforms prior RAG methods and generalizes well to single-hop tasks. The code and data are available at: https://github.com/probe2/TIRESRAG-R1.

以学习为基础的强化检索-增强一代(RAG)方法加强了大型语言模型(LLMS)的推理能力。然而,多数人只依赖最后回答的奖励,而忽略中间推理质量。本文分析现有RAG推理模型,并查明三个主要失败模式:(1)信息不足,意味着模型未能获得足够的支持;(2)错误推理,尽管信息充足,但逻辑或内容层次的缺陷似乎不足;(3)答案推理不一致,因为有效的推理链导致有错配的最后答案。我们提议TIRESRAG-R1,一个利用思维-检索-反射进程和多层面奖励系统来改进推理和稳定性的新框架。TIRESRAG-R1介绍了:(1)鼓励彻底检索的充分奖励;(2)评估推理链的合理性和准确性的推理质量奖励;(3)发现和修改错误的反省奖励。我们还采用难辨称战略和培训样本过滤,以提高复杂任务的绩效。在四种多霍普-QA数据集上进行的实验表明,TIRESRAG-RAG1号现有数据格式比以往的数据格式。

Article 123

Title@2025-07-30 (3): UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis

Title: UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis

UI-E2I-Synth: Weiterentwicklung der GUI-Grundierung mit großformatiger Instruktionssynthese

UI-E2I-Synth:以大型教学合成为基础推进图形界面 2504.11257v4

Authors (4): Xinyi Liu, Xiaoyi Zhang, Ziyun Zhang, Yan Lu

Recent advancements in Large Vision-Language Models are accelerating the development of Graphical User Interface (GUI) agents that utilize human-like vision perception capabilities to enhance productivity on digital devices. Compared to approaches predicated on GUI metadata, which are platform-dependent and vulnerable to implementation variations, vision-based approaches offer broader applicability. In this vision-based paradigm, the GUI instruction grounding, which maps user instruction to the location of corresponding element on the given screenshot, remains a critical challenge, particularly due to limited public training dataset and resource-intensive manual instruction data annotation. In this paper, we delve into unexplored challenges in this task including element-to-screen ratio, unbalanced element type, and implicit instruction. To address these challenges, we introduce a large-scale data synthesis pipeline UI-E2I-Synth for generating varying complex instruction datasets using GPT-4o instead of human annotators. Furthermore, we propose a new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to address the limitations of existing benchmarks by incorporating diverse annotation aspects. Our model, trained on the synthesized data, achieves superior performance in GUI instruction grounding, demonstrating the advancements of proposed data synthesis pipeline. The proposed benchmark, accompanied by extensive analyses, provides practical insights for future research in GUI grounding. We will release corresponding artifacts at https://microsoft.github.io/FIVE-UI-Evol/ .

大型视觉语言模型最近的进展正在加速开发图形用户界面(GUI)代理器,这些代理器利用人性化的视觉感知能力提高数字装置的生产率。与基于GUI元数据的方法相比,基于愿景的方法具有更广泛的适用性,因为GUI依靠平台,容易出现执行差异。在这个基于愿景的模式中,图形指导定位将用户指示映射到特定截图上相应元素的位置,这仍然是一个重大挑战,特别是因为公共培训数据集和资源密集的人工指令数据说明有限。在本文中,我们深入探讨了这一任务中未探讨的挑战,包括元素对屏幕比率、不平衡元素类型和隐含的指令。为了应对这些挑战,我们采用了大规模的数据合成管道UIUI-E2I-Synth 方法,用于使用GPT-4o而不是使用人类说明器生成不同的复杂指令数据集。此外,我们提出了一个新的界面指令,目的是通过纳入不同的注释说明,解决现有基准的局限性,包括元素对屏幕比率、不平衡的元素类型类型和隐含的指令。为了应对这些挑战,我们模型、经过培训的管道中的最新数据分析,通过拟议的地面分析,将实现拟议的地面分析。

Article 124

Title@2025-07-30 (3): Spatial Language Likelihood Grounding Network for Bayesian Fusion of Human-Robot Observations

Title: Spatial Language Likelihood Grounding Network for Bayesian Fusion of Human-Robot Observations

Raumsprache Likelihood Grounding Network für Bayesian Fusion von Mensch-Roboter-Beobachtungen

Bayesian人类-机器人观测融合空间语言定位网络 2507.19947v2

Authors (4): Supawich Sitdhipol, Waritwong Sukprasongdee, Ekapol Chuangsuwanich, Rina Tse

Fusing information from human observations can help robots overcome sensing limitations in collaborative tasks. However, an uncertainty-aware fusion framework requires a grounded likelihood representing the uncertainty of human inputs. This paper presents a Feature Pyramid Likelihood Grounding Network (FP-LGN) that grounds spatial language by learning relevant map image features and their relationships with spatial relation semantics. The model is trained as a probability estimator to capture aleatoric uncertainty in human language using three-stage curriculum learning. Results showed that FP-LGN matched expert-designed rules in mean Negative Log-Likelihood (NLL) and demonstrated greater robustness with lower standard deviation. Collaborative sensing results demonstrated that the grounded likelihood successfully enabled uncertainty-aware fusion of heterogeneous human language observations and robot sensor measurements, achieving significant improvements in human-robot collaborative task performance.

人类观测的阻燃信息可以帮助机器人克服协作任务中的感知限制。然而,一个具有不确定性的聚合框架需要具有代表人类投入不确定性的有根有据的可能性。本文介绍了一个地貌虫状金字塔网络(FP-LGN),它通过学习相关的地图图像特征及其与空间关系语义的关系而将空间语言作为基础。模型被培训为概率估计器,以便利用三阶段课程学习来捕捉人类语言中的感知性不确定性。结果显示,FP-LGN与专家设计的规则相匹配,其平均值为负日志-日产(NLLL),并显示在较低标准偏差情况下更加稳健。合作感结果表明,基于地貌的概率成功地促成了多种人类语言观测和机器人传感器测量的不确定性-认知融合,从而在人类机器人协作性工作方面取得了显著的改进。

Article 125

Title@2025-07-30 (3): Listening to the Unspoken: Exploring 365 Aspects of Multimodal Interview Performance Assessment

Title: Listening to the Unspoken: Exploring 365 Aspects of Multimodal Interview Performance Assessment

Hören auf das Unausgesprochene: Erforschen von 365 Aspekten der multimodalen Interview-Performance Assessment

聆听无语者:探索多模式访谈业绩评估的365方面 2507.22676v1

Authors (6): Jia Li, Yang Wang, Wenhao Qian, Zhenzhen Hu, Richang Hong, Meng Wang

Interview performance assessment is essential for determining candidates’ suitability for professional positions. To ensure holistic and fair evaluations, we propose a novel and comprehensive framework that explores ``365’’ aspects of interview performance by integrating \textit{three} modalities (video, audio, and text), \textit{six} responses per candidate, and \textit{five} key evaluation dimensions. The framework employs modality-specific feature extractors to encode heterogeneous data streams and subsequently fused via a Shared Compression Multilayer Perceptron. This module compresses multimodal embeddings into a unified latent space, facilitating efficient feature interaction. To enhance prediction robustness, we incorporate a two-level ensemble learning strategy: (1) independent regression heads predict scores for each response, and (2) predictions are aggregated across responses using a mean-pooling mechanism to produce final scores for the five target dimensions. By listening to the unspoken, our approach captures both explicit and implicit cues from multimodal data, enabling comprehensive and unbiased assessments. Achieving a multi-dimensional average MSE of 0.1824, our framework secured first place in the AVI Challenge 2025, demonstrating its effectiveness and robustness in advancing automated and multimodal interview performance assessment. The full implementation is available at https://github.com/MSA-LMC/365Aspects.

面试业绩评估对于确定候选人是否适合专业职位至关重要。为了确保整体和公平评价,我们提议了一个新颖和全面的框架,探讨“365”“365”的面试业绩方面,将“textit{3⁄3”模式(视频、音频和文本)、每个候选人的回答(textit{6}6}答复和关键评价层面结合起来。框架采用特定模式的特征提取器,将不同数据流编码,随后通过一个共同压缩多层次的多层次接受器进行整合。这个模块压缩将多式联运嵌入一个统一的潜在空间,便利高效率的特征互动。为了提高预测的稳健性,我们纳入了一个两级的混合学习战略:(1) 独立回归头预测每个答复的评分,以及(2) 利用一个平均集合机制对各种答复进行汇总预测,以产生五个目标层面的最后评分。通过听不说,我们的方法从多式联运数据中获取明确和隐含的提示,能够进行全面和公正的评估。实现多维平均数0.1824,我们的框架在2025年AVI 挑战性评估中首次得到保障。

Article 126

Title@2025-07-30 (3): What Are They Talking About? A Benchmark of Knowledge-Grounded Discussion Summarization

Title: What Are They Talking About? A Benchmark of Knowledge-Grounded Discussion Summarization

Wovon reden sie? Ein Benchmark der wissensgeprägten Diskussionszusammenfassung

他们在谈论什么?知识类讨论总结的基准 2505.12474v2

Authors (7): Weixiao Zhou, Junnan Zhu, Gengyao Li, Xianfu Cheng, Xinnian Liang, Feifei Zhai, Zhoujun Li

Traditional dialogue summarization primarily focuses on dialogue content, assuming it comprises adequate information for a clear summary. However, this assumption often fails for discussions grounded in shared background, where participants frequently omit context and use implicit references. This results in summaries that are confusing to readers unfamiliar with the background. To address this, we introduce Knowledge-Grounded Discussion Summarization (KGDS), a novel task that produces a supplementary background summary for context and a clear opinion summary with clarified references. To facilitate research, we construct the first KGDS benchmark, featuring news-discussion pairs and expert-created multi-granularity gold annotations for evaluating sub-summaries. We also propose a novel hierarchical evaluation framework with fine-grained and interpretable metrics. Our extensive evaluation of 12 advanced large language models (LLMs) reveals that KGDS remains a significant challenge. The models frequently miss key facts and retain irrelevant ones in background summarization, and often fail to resolve implicit references in opinion summary integration.

传统对话总结主要侧重于对话内容,假定它包含足够的信息,可以提出明确的总结,但这一假设往往不能用于基于共同背景的讨论,因为与会者经常略去背景,使用隐含的参考,结果摘要使不熟悉背景的读者感到困惑。为了解决这个问题,我们引入了知识四面讨论总结(KGDS),这是一项新颖的任务,为背景提供了补充背景摘要,并提供了明确的意见摘要。为了便于研究,我们构建了第一个KGDS基准,以进行新闻讨论的对口和专家为次摘要评价创建的多色金说明为主。我们还提出了一个新的等级评价框架,配有精细的和可解释的参数。我们对12个先进的大语言模型(LLMS)的广泛评价表明,KGDS仍然是一个重大挑战。模型常常忽略关键事实,在背景总结中保留不相干的内容,而且常常无法解决意见摘要整合中隐含的参考。

Article 127

Title@2025-07-30 (3): Instruction-tuned Large Language Models for Machine Translation in the Medical Domain

Title: Instruction-tuned Large Language Models for Machine Translation in the Medical Domain

Instruktionsorientierte große Sprachmodelle für die maschinelle Übersetzung im medizinischen Bereich

医疗领域机器翻译大语言模型 2408.16440v2

Authors (1): Miguel Rios

Large Language Models (LLMs) have shown promising results on machine translation for high resource language pairs and domains. However, in specialised domains (e.g. medical) LLMs have shown lower performance compared to standard neural machine translation models. The consistency in the machine translation of terminology is crucial for users, researchers, and translators in specialised domains. In this study, we compare the performance between baseline LLMs and instruction-tuned LLMs in the medical domain. In addition, we introduce terminology from specialised medical dictionaries into the instruction formatted datasets for fine-tuning LLMs. The instruction-tuned LLMs significantly outperform the baseline models with automatic metrics.

大型语言模型(LLMS)在高资源语言配对和域的机器翻译方面显示出了大有希望的成果,然而,在专业领域(例如医疗)LMS与标准的神经机翻译模型相比表现较差,术语的机器翻译的一致性对于专业领域的用户、研究人员和笔译员至关重要。在本研究中,我们比较了医疗领域的基线LMS和按指示调整的LMS之间的性能。此外,我们还将专门医学词典中的术语引入了用于微调LMS的规范格式化数据集。通过自动衡量,指导调整LMS的LMS大大超过了基线模型。

Article 128

Title@2025-07-30 (3): QE4PE: Word-level Quality Estimation for Human Post-Editing

Title: QE4PE: Word-level Quality Estimation for Human Post-Editing

QE4PE: Qualitätsschätzung auf Word-Ebene für die menschliche Nachbearbeitung

QE4PE: 计算后人类的字级质量估算 2503.03044v2

Authors (6): Gabriele Sarti, Vilém Zouhar, Grzegorz Chrupała, Ana Guerberof-Arenas, Malvina Nissim, Arianna Bisazza

Word-level quality estimation (QE) methods aim to detect erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. In this study, we investigate the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated from behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors’ speed are critical factors in determining highlights’ effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.

字级质量估计(QE)方法旨在检测机器翻译中的误差,这可以指导和促进人类编辑后编辑工作的误差。虽然对字级质量评价系统的准确性进行了广泛评估,但其可用性和对编辑后人类编辑工作的速度、质量和编辑选择的下游影响仍然研究不足。在这项研究中,我们调查了字级质量评价对机器翻译(MT)编辑后编辑的影响。我们发现,在现实环境下,有42个专业编辑后编辑跨两个翻译方向的42个专业版本。我们比较了四种错误分布突出模式,包括有监督的和不确定的字级质量评价方法,用以查明最先进的神经计量模型产出中的潜在错误。后编辑工作和生产率是从行为记录中估算出来的,而质量改进则由字级和分级人类注释评估。我们发现,域、语文和编辑速度是确定重点有效性的关键因素,人造和自动化的QE之间差异不大,突出了专业工作流程中准确性和可用性之间的差距。

Article 129

Title@2025-07-30 (3): Multilingual Political Views of Large Language Models: Identification and Steering

Title: Multilingual Political Views of Large Language Models: Identification and Steering

Mehrsprachige politische Ansichten von großen Sprachmodellen: Identifikation und Steuerung

大语言模式多语言多语言政治观点:识别和指导 2507.22623v1

Authors (6): Daniil Gurgurov, Katharina Trinley, Ivan Vykopal, Josef van Genabith, Simon Ostermann, Roberto Zamparelli

Large language models (LLMs) are increasingly used in everyday tools and applications, raising concerns about their potential influence on political views. While prior research has shown that LLMs often exhibit measurable political biases–frequently skewing toward liberal or progressive positions–key gaps remain. Most existing studies evaluate only a narrow set of models and languages, leaving open questions about the generalizability of political biases across architectures, scales, and multilingual settings. Moreover, few works examine whether these biases can be actively controlled. In this work, we address these gaps through a large-scale study of political orientation in modern open-source instruction-tuned LLMs. We evaluate seven models, including LLaMA-3.1, Qwen-3, and Aya-Expanse, across 14 languages using the Political Compass Test with 11 semantically equivalent paraphrases per statement to ensure robust measurement. Our results reveal that larger models consistently shift toward libertarian-left positions, with significant variations across languages and model families. To test the manipulability of political stances, we utilize a simple center-of-mass activation intervention technique and show that it reliably steers model responses toward alternative ideological positions across multiple languages. Our code is publicly available at https://github.com/d-gurgurov/Political-Ideologies-LLMs.

大型语言模型(LLMS)越来越多地用于日常工具和应用,引起了人们对其对政治观点潜在影响的担忧。先前的研究显示,LLMS经常表现出可衡量的政治偏见,经常地向自由或进步立场-关键差距倾斜。大多数现有研究只评价一套狭窄的模式和语言,留下关于政治偏见在建筑、规模和多语言环境之间的普遍性的开放问题。此外,很少有人研究这些偏见能否得到积极控制。在这项工作中,我们通过在现代开放源码指令调控LMS中大规模研究政治取向来消除这些差距。我们用14种语言评估了7种模式,包括LalaMA-3、Qwen-3和Aya-Explanse,使用11种语义等同的语句子来评估,以确保稳健健的测量。我们的结果表明,较大的模式始终向自由左翼立场转变,在各语言和模范家庭之间差异很大。要测试政治姿态的可操纵性,我们使用简单的中质中心激活干预技术,并显示它可靠地指导着多种语言/MLA-MA-MS-MR的替代意识形态立场。

Article 130

Title@2025-07-30 (3): Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation

Title: Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation

Sprache Arithmetik: Auf dem Weg zur systemischen Sprache Neuronenidentifikation und Manipulation

语言解貌学:迈向系统语言中中子识别和操纵 2507.22608v1

Authors (6): Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef van Genabith, Simon Ostermann

Large language models (LLMs) exhibit strong multilingual abilities, yet the neural mechanisms behind language-specific processing remain unclear. We analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying neurons that control language behavior. Using the Language Activation Probability Entropy (LAPE) method, we show that these neurons cluster in deeper layers, with non-Latin scripts showing greater specialization. Related languages share overlapping neurons, reflecting internal representations of linguistic proximity. Through language arithmetics, i.e. systematic activation addition and multiplication, we steer models to deactivate unwanted languages and activate desired ones, outperforming simpler replacement approaches. These interventions effectively guide behavior across five multilingual tasks: language forcing, translation, QA, comprehension, and NLI. Manipulation is more successful for high-resource languages, while typological similarity improves effectiveness. We also demonstrate that cross-lingual neuron steering enhances downstream performance and reveal internal “fallback” mechanisms for language selection when neurons are progressively deactivated. Our code is made publicly available at https://github.com/d-gurgurov/Language-Neurons-Manipulation.

大型语言模型(LLMS)具有很强的多语种能力,但具体语言处理背后的神经机制仍然不清楚。我们分析了Llama-3.1-31-8B、Mistral-Nemo-12B和Aya-Expanse-8B和32B中21种类型多样的语言中与语言相关的神经元,确定了控制语言行为的神经元。使用语言动能渗透(LAPE)方法,我们发现这些神经元在更深层次上聚居,非拉丁文字更加专业化。相关语言共享重叠的神经元,反映了语言近距离的内部表现。我们通过语言算术,即系统激活添加和倍增,引导模式停止使用不需要的语言并激活理想语言,超越了更简单的替换方法。这些干预措施有效地指导了五个多语言任务的行为:语言强迫、翻译、QA、理解和NLIL. 人工智能,高资源语言的使用更成功,而类型相似性则提高效力。我们还表明,跨语言神经导能增强下游的性,并揭示了在神经系统逐步停止使用/Mangurip时用于选择语言的内部“倒退”机制。我们的代码是公开的。

Article 131

Title@2025-07-30 (3): UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

Title: UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

UI-AGILE: Verbesserung von GUI-Agenten mit effektivem Verstärkungslernen und präziser Schlussfolgerungs-Zeiterdung

UI-AGILE: 提高具有有效强化学习和精确推断时间定位的图形代理器 2507.22025v2

Authors (7): Shuquan Lian, Yuhang Wu, Jia Ma, Zihan Song, Bingqi Chen, Xiawu Zheng, Hui Li

The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE, a comprehensive framework enhancing GUI agents at both the training and inference stages. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a Continuous Reward function to incentivize high-precision grounding; 2) a “Simple Thinking” reward to balance planning with speed and grounding accuracy; and 3) a Cropping-based Resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present Decomposed Grounding with Selection, a novel method that dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2. For instance, using both our proposed training and inference enhancement methods brings 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro.

多种多式大语言模型(MLLMM)的出现推动了图形用户界面(GUI)代理能力的显著进步,然而,现有的GUI代理培训和推断技术仍然在推理设计、无效奖赏和视觉噪音方面处于两难境地。为了解决这些问题,我们引入了UI-AGILE,这是在培训和推理阶段加强GUI代理的综合框架。为了培训,我们建议了一套改进监督微调进程(SFT)的全套改进办法:1) 不断提升功能,激励高精度地面定位;2) 一项“简单思考”奖励,以平衡规划与速度和定位准确性之间的平衡;3) 一项基于裁剪裁法的重现战略,以缓解稀少的奖励问题,改进复杂任务的学习。据推测,我们提出了与选择的松散基础,这是一种新颖方法,通过将图像破碎成小、可操作性部分,大大提高高分辨率显示的准确性。实验显示,UI-AGILE在两个基准标准上实现了最新业绩,利用23SProsporS-Stoforforfor 提高基准方法,在23SPropos-scloforfor-sc-sclopprobortial

Article 132

Title@2025-07-30 (3): BALSAM: A Platform for Benchmarking Arabic Large Language Models

Title: BALSAM: A Platform for Benchmarking Arabic Large Language Models

BALSAM: Eine Plattform für Benchmarking arabischer Großsprachenmodelle

BALSAM:阿拉伯语大语言模式基准制定平台 2507.22603v1

Authors (43): Rawan Al-Matham, Kareem Darwish, Raghad Al-Rasheed, Waad Alshammari, Muneera Alhoshan, Amal Almazrua, Asma Al Wazrah, Mais Alheraki, Firoj Alam, Preslav Nakov, Norah Alzahrani, Eman alBilali, Nizar Habash, Abdelrahman El-Sheikh, Muhammad Elmallah, Haonan Li, Hamdy Mubarak, Mohamed Anwar, Zaid Alyafeai, Ahmed Abdelali, Nora Altwairesh, Maram Hasanain, Abdulmohsen Al Thubaity, Shady Shehata, Bashar Alhafni, Injy Hamed, Go Inoue, Khalid Elmadani, Ossama Obeid, Fatima Haouari, Tamer Elsayed, Emad Alghamdi, Khalid Almubarak, Saied Alshahrani, Ola Aljarrah, Safa Alajlan, Areej Alshaqarawi, Maryam Alshihri, Sultana Alghurabi, Atikah Alzeghayer, Afrah Altamimi, Abdullah Alfaifi, Abdulrahman AlOsaimy

The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.

英文大语言模型(LLMs)的进步令人印象深刻,没有在所有语文中相匹配,特别是由于数据稀缺、阿拉伯语及其方言语言多样性、形态复杂等原因,LLM在阿拉伯语方面的表现落后。阿拉伯基准的质量进一步阻碍了进展,这些基准通常依靠静态的公开数据,缺乏全面的任务覆盖,或没有提供专用的盲人测试仪平台。这给衡量实际进展和减少数据污染带来了挑战。我们在这里的目标是弥合这些差距。特别是,我们引入了BALSAM,这是一个由社区驱动的综合基准,旨在推进阿拉伯语LM的发展和评价,其中包括来自14大类的78项NLP任务,其中52K实例分为37K测试和15K开发,以及一个集中、透明的盲人评价平台。我们设想BALSAM是一个统一平台,用以制定标准,促进合作研究,以提高阿拉伯语LM能力。

Article 133

Title@2025-07-30 (3): Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation

Title: Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation

Lernen, rationale Beweise durch Verstärkungslernen für die retrieval-angereicherte Generation zu extrahieren

学习如何通过为回收-提款一代人加强学习来提取合理证据 2507.15586v4

Authors (7): Xinping Zhao, Shouzheng Huang, Yan Zhong, Xinshuo Hu, Meishan Zhang, Baotian Hu, Min Zhang

Retrieval-Augmented Generation (RAG) effectively improves the accuracy of Large Language Models (LLMs). However, retrieval noises significantly impact the quality of LLMs’ generation, necessitating the development of denoising mechanisms. Previous methods extract evidence straightforwardly without explicit thinking, which risks filtering out key clues and struggles with generalization. To this end, we propose EviOmni, which learns to extract rational evidence by (1) explicitly reasoning to identify potential cues within retrieval contents first, and then (2) consciously extracting to avoid omitting any key cues helpful for answering questions. Specifically, we frame evidence reasoning and evidence extraction into one unified response for end-to-end training; apply knowledge token masks for disentanglement to derive reasoning-based and extraction-based answers; and devise three types of verifiable reward functions, including answer, length, and format, to update the model via the policy optimization algorithm. Extensive experiments on three benchmark datasets show the effectiveness of EviOmni, providing compact and high-quality evidence, improving the accuracy of downstream tasks, and promoting effective application in online RAG systems.

重新获取-增强一代(RAG)有效地提高了大语言模型(LLMs)的准确性。然而,检索噪音对LLMs的生成质量有重大影响,需要开发脱网机制。以前的方法直接地提取证据,而没有明确的思考,有可能过滤关键线索,通过一般化而挣扎。为此,我们建议EviOmni学习通过:(1) 明确推理,首先确定检索内容中的潜在线索,然后(2) 有意识地提取证据,以避免遗漏任何有助于回答问题的关键线索。具体地说,我们将证据推理和证据提取纳入终端到终端培训的统一对策中;应用知识符号面罩解密,以得出基于推理和提取的答案;设计三种可核查的奖励功能,包括答案、长度和格式,以便通过政策优化算法更新模型。关于三个基准数据集的广泛实验显示EviOmni的有效性,提供紧凑和高质量的证据,提高下游任务的准确性,并促进在线RAG系统的有效应用。

Article 134

Title@2025-07-30 (3): Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Title: Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Die Frontier of Vision-Language Models erkunden: Eine Übersicht aktueller Methoden und Zukunftsrichtungen

探索远景-语言模型的前沿:对当前方法和未来方向的调查 2404.07214v3

Authors (5): Akash Ghosh, Arkadeep Acharya, Sriparna Saha, Vinija Jain, Aman Chadha

The advent of Large Language Models (LLMs) has significantly reshaped the trajectory of the AI revolution. Nevertheless, these LLMs exhibit a notable limitation, as they are primarily adept at processing textual information. To address this constraint, researchers have endeavored to integrate visual capabilities with LLMs, resulting in the emergence of Vision-Language Models (VLMs). These advanced models are instrumental in tackling more intricate tasks such as image captioning and visual question answering. In our comprehensive survey paper, we delve into the key advancements within the realm of VLMs. Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.This classification is based on their respective capabilities and functionalities in processing and generating various modalities of data.We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible, providing readers with a comprehensive understanding of its essential components. We also analyzed the performance of VLMs in various benchmark datasets. By doing so, we aim to offer a nuanced understanding of the diverse landscape of VLMs. Additionally, we underscore potential avenues for future research in this dynamic domain, anticipating further breakthroughs and advancements.

大型语言模型(LLMS)的出现大大改变了AI革命的轨迹,然而,这些LLMS显示出明显的局限性,因为它们主要擅长处理文本信息。为了解决这一制约因素,研究人员努力将视觉能力与LLMS结合起来,从而导致产生视觉语言模型(VLMS)的出现。这些先进的模型有助于处理更复杂的任务,例如图像字幕和视觉问题解答。在我们的综合调查文件中,我们深入探讨了VLMS领域的主要进展。我们的分类将VLMS分为三个不同的类别:专门用来处理视觉语言理解的模型、处理多式联运投入的模型,以产生接受和产生多种形式投入和产出的单一形式(LMS)产出和模型。这种分类基于他们在处理和生成各种数据模式方面各自的能力和功能。我们仔细地区分了每一种模型,对它的基础结构、培训数据来源以及尽可能的优势和局限性进行了广泛的分析,为读者提供了对其基本组成部分的全面理解。我们还分析了VLMS在各种基准领域研究领域取得突破性进展的成绩,从而强调了我们今后对数据前景进行突破的突破性研究的突破。

Article 135

Title@2025-07-30 (3): Efficient Continual Learning for Small Language Models with a Discrete Key-Value Bottleneck

Title: Efficient Continual Learning for Small Language Models with a Discrete Key-Value Bottleneck

Effizientes kontinuierliches Lernen für kleine Sprachmodelle mit einem diskreten Schlüsselwert-Bottleneck

高效持续学习具有分立键- Value 瓶颈的小语言模式 2412.08528v2

Authors (4): Andor Diera, Lukas Galke, Fabian Karl, Ansgar Scherp

Continual learning remains a challenge across various natural language processing (NLP) tasks, as models updated with new training data often risk catastrophic forgetting of previously acquired knowledge. We introduce a discrete key-value bottleneck (DKVB) for encoder-only language models, enabling efficient continual learning through localized updates. Inspired by a discrete key-value bottleneck in vision, we consider new and NLP-specific challenges. We compare different bottleneck architectures for NLP and introduce a new, task-independent initialization technique for the discrete keys. We evaluate our DKVB for NLP in four continual learning scenarios and show that it alleviates catastrophic forgetting. Our experiments demonstrate that the proposed approach achieves competitive performance compared to popular continual learning methods while incurring lower computational costs. Furthermore, we show that DKVB remains effective even in challenging single-head continual learning scenarios where no task ID is provided.

持续学习仍然是各种自然语言处理(NLP)任务中的一项挑战,因为根据新的培训数据更新的模型往往有灾难性地忘记先前获得的知识的风险。我们为只编码器语言模型引入了独立的关键值瓶颈(DKVB),通过本地更新,能够高效地持续学习。在独立的关键值瓶颈的启发下,我们考虑新的和NLP特有的挑战。我们比较了不同的关键值架构,为离散键引入了新的、任务独立的初始化技术。我们在四个持续学习情景中为NLP评估了我们的DKVB, 并表明它缓解了灾难性的遗忘。我们的实验表明,拟议方法在降低计算成本的同时,取得了与流行的持续学习方法相比的竞争性业绩。此外,我们表明即使在没有提供任务识别符的具有挑战性的单头持续学习情景中,DKVB依然有效。

Article 136

Title@2025-07-30 (3): Efficient Differentially Private Fine-Tuning of LLMs via Reinforcement Learning

Title: Efficient Differentially Private Fine-Tuning of LLMs via Reinforcement Learning

Effizientes Differentielles Privates Feintuning von LLMs durch Verstärkungslernen

通过强化学习对LLMs 进行有区别的私人高效率私人罚款 2507.22565v1

Authors (5): Afshin Khadangi, Amir Sartipi, Igor Tchappi, Ramin Bahmani, Gilbert Fridgen

The tension between data privacy and model utility has become the defining bottleneck for the practical deployment of large language models (LLMs) trained on sensitive corpora including healthcare. Differentially private stochastic gradient descent (DP-SGD) guarantees formal privacy, yet it does so at a pronounced cost: gradients are forcibly clipped and perturbed with noise, degrading sample efficiency and final accuracy. Numerous variants have been proposed to soften this trade-off, but they all share a handicap: their control knobs are hard-coded, global, and oblivious to the evolving optimization landscape. Consequently, practitioners are forced either to over-spend privacy budget in pursuit of utility, or to accept mediocre models in order to stay within privacy constraints. We present RLDP, the first framework to cast DP optimization itself as a closed-loop control problem amenable to modern deep reinforcement learning (RL). RLDP continuously senses rich statistics of the learning dynamics and acts by selecting fine-grained per parameter gradient-clipping thresholds as well as the magnitude of injected Gaussian noise. A soft actor-critic (SAC) hyper-policy is trained online during language model fine-tuning; it learns, from scratch, how to allocate the privacy budget where it matters and when it matters. Across more than 1,600 ablation experiments on GPT2-small, Llama-1B, Llama-3B, and Mistral-7B, RLDP delivers perplexity reductions of 1.3-30.5% (mean 5.4%) and an average 5.6% downstream utility gain. RLDP reaches each baseline’s final utility after only 13-43% of the gradient-update budget (mean speed-up 71%), all while honoring the same ($\epsilon$, $\delta$)-DP contract and exhibiting equal or lower susceptibility to membership-inference and canary-extraction attacks.

数据隐私和模型效用之间的紧张关系已成为实际部署大型语言模型(LLMS)(LLMs)(LLMs)(LLMs)(LLMs)(LLMs)(LLMS)(LLMM)(LLMM)(LLMM(DP-SGD(DP-SGD))(DP-SGD(DP-SGD))(DP-SGD)(DP-SGD)(DP-SGD)(DP-SGD)(DP-SGD(DP-SGD)(DP)(DP)(DP)(LLM)(PLLLLD)(T)(TLLLLD)(TF)(LLLLDDP(LD)(LLLD)(T)(SLLD)(LLLLDP)(S)(LLLD)(LLLD)(LP)(LD)(LO)(LDB(LD)(L)(LDRD)(LD)(IDRD)(L-I-ID)(SD)(LD)(LI-I-IDID)(S)(S)(S)(S-I(SD)(SDLI)(S)(S)(S)(S)(SLD)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(S)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(S)(L)(IDL)(L)(L)(L)(ID)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(L)(

Article 137

Title@2025-07-30 (3): Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

Title: Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

Nutzung synergistischer Kognitiv-Biasen zur Umgehung der Sicherheit in LLMs

利用协同协同一致的双星体在LLM中用于绕过安全 2507.22564v1

Authors (5): Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu

Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases – systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.

大型语言模型(LLMS)展示了在一系列广泛任务中令人印象深刻的能力,然而,其安全机制仍然容易受到利用认知偏差 – – 系统偏离理性判断的系统偏差 – – 的对抗性攻击。与以往侧重于迅速工程或算法操纵的侵入性做法不同,这项工作凸显了在破坏LLM保障措施方面多偏见互动的被忽视力量。我们提议了CognitiveAttack,这是一个新型的红色组合框架,系统地利用个人和综合认知偏差。通过整合受监督的微调和强化学习,CognitiveAttack生成了闪烁,将最佳偏差组合嵌入其中,有效绕过安全协议,同时保持高攻击成功率。实验结果揭示了30个不同LMSM的显著脆弱性,特别是在开放源模型中。ConnitiveAtack实现了比SOTA黑箱方法PAP(60.1%对31.6%)要高得多的攻击成功率,暴露了当前防御机制的关键限制。这些发现突出了多偏见相互作用,作为强大但未被充分利用的攻击矢控矢量媒介。这项工作通过连接认知科学和LLM安全系统,为新的学科视角。

Article 138

Title@2025-07-30 (3): Rationale-guided Prompting for Knowledge-based Visual Question Answering

Title: Rationale-guided Prompting for Knowledge-based Visual Question Answering

Rationale-geführte Aufforderung zur wissensbasierten visuellen Fragebeantwortung

以知识为基础的视觉问题解答 2412.16936v2

Authors (4): Zhongjian Hu, Peng Yang, Bing Li, Fengyuan Liu

Recently, Large Language Models (LLMs) have been used for knowledge-based Visual Question Answering (VQA). Despite the encouraging results of previous studies, prior methods prompt LLMs to predict answers directly, neglecting intermediate thought processes. We argue that prior methods do not sufficiently activate the capacities of LLMs. We propose a framework called PLRH that Prompts LLMs with Rationale Heuristics for knowledge-based VQA. The PLRH prompts LLMs with Chain of Thought (CoT) to generate rationale heuristics, i.e., intermediate thought processes, and then leverages the rationale heuristics to inspire LLMs to predict answers. Experiments show that our approach outperforms the existing baselines by more than 2.2 and 2.1 on OK-VQA and A-OKVQA, respectively.

最近,大型语言模型(LLMs)被用于知识型视觉问答(VQA),尽管以往的研究取得了令人鼓舞的成果,但以往的方法促使LMs直接预测答案,忽略了中间思维过程。我们争辩说,以往的方法不足以激活LLMs的能力。我们提议了一个称为PLRH的框架,用基于知识的VQA的推理法刺激LLMs。PLRH促使具有思维链的LLMs产生理论超常,即中间思维过程,然后利用理论超常法来激励LMs预测答案。实验表明,我们的方法分别超过基于 OK-VQA和A-OKVQA的现有基线2.2和2.1。

Article 139

Title: Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection

Co-AttenDWG: Co-Attentive Dimension-Wise-Gating und Expertenfusion für Multi-Modal-Offensive Content Detection

共同-DWG:多模式进攻性攻击物质探测联合加速维维维-韦兹交织和专家混合 2505.19010v2

Authors (4): Md. Mithun Hossain, Md. Shakil Hossain, Sudipto Chaki, M. F. Mridha

Multi-modal learning has emerged as a crucial research direction, as integrating textual and visual information can substantially enhance performance in tasks such as classification, retrieval, and scene understanding. Despite advances with large pre-trained models, existing approaches often suffer from insufficient cross-modal interactions and rigid fusion strategies, failing to fully harness the complementary strengths of different modalities. To address these limitations, we propose Co-AttenDWG, co-attention with dimension-wise gating, and expert fusion. Our approach first projects textual and visual features into a shared embedding space, where a dedicated co-attention mechanism enables simultaneous, fine-grained interactions between modalities. This is further strengthened by a dimension-wise gating network, which adaptively modulates feature contributions at the channel level to emphasize salient information. In parallel, dual-path encoders independently refine modality-specific representations, while an additional cross-attention layer aligns the modalities further. The resulting features are aggregated via an expert fusion module that integrates learned gating and self-attention, yielding a robust unified representation. Experimental results on the MIMIC and SemEval Memotion 1.0 datasets show that Co-AttenDWG achieves state-of-the-art performance and superior cross-modal alignment, highlighting its effectiveness for diverse multi-modal applications.

由于整合文本和视觉信息可以大大提高分类、检索和现场理解等任务的业绩,因此,多模式学习已成为关键的研究方向,因为整合文本和视觉信息可以大大提高分类、检索和现场理解等任务的业绩。尽管在经过事先培训的大型模型方面有所进展,但现有方法往往缺乏充分的跨模式互动和僵化的融合战略,未能充分利用不同模式的互补优势。为了解决这些局限性,我们提议共同-AtenDWG, 与维维维维的引力和专家融合相结合。我们的方法先是将文字和视觉特征纳入一个共享的嵌入空间,在这个空间中,专用的共享机制能够促进各种模式之间的同步、细微和精细的相互作用。这通过一个符合维度的整合网络得到进一步加强。这一网络以适应性调整的方式在频道一级贡献以强调突出的信息。同时,双方向的聚合者独立地完善了具体模式的表达方式,而额外的跨保护层进一步调整模式。由此产生的特征通过一个专家融合模块加以汇总,该模块将知识化和自我保存,从而产生强有力的统一代表。在MIMIMIC和Semal-Slovelyal-modal-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-modal-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-

Article 140

Title@2025-07-30 (3): ControlMed: Adding Reasoning Control to Medical Language Model

Title: ControlMed: Adding Reasoning Control to Medical Language Model

ControlMed: Reasoning Control in das medizinische Sprachmodell aufnehmen

控制Med:在医疗语文模式中增加理由控制 2507.22545v1

Authors (4): Sung-Min Lee, Siyoon Lee, Juyeon Kim, Kyungmin Roh

Reasoning Large Language Models (LLMs) with enhanced accuracy and explainability are increasingly being adopted in the medical domain, as the life-critical nature of clinical decision-making demands reliable support. Despite these advancements, existing reasoning LLMs often generate unnecessarily lengthy reasoning processes, leading to significant computational overhead and response latency. These limitations hinder their practical deployment in real-world clinical environments. To address these challenges, we introduce \textbf{ControlMed}, a medical language model that enables users to actively control the length of the reasoning process at inference time through fine-grained control markers. ControlMed is trained through a three-stage pipeline: 1) pre-training on a large-scale synthetic medical instruction dataset covering both \textit{direct} and \textit{reasoning responses}; 2) supervised fine-tuning with multi-length reasoning data and explicit length-control markers; and 3) reinforcement learning with model-based reward signals to enhance factual accuracy and response quality. Experimental results on a variety of English and Korean medical benchmarks demonstrate that our model achieves similar or better performance compared to state-of-the-art models. Furthermore, users can flexibly balance reasoning accuracy and computational efficiency by controlling the reasoning length as needed. These findings demonstrate that ControlMed is a practical and adaptable solution for clinical question answering and medical information analysis.

由于临床决策的生命关键性质要求可靠的支持,医疗领域越来越多地采用具有更高准确度和可解释性的大语言模型(LLMs),因为临床决策的生命关键性质要求得到可靠的支持。尽管取得了这些进步,但现有的推理LLMs往往产生不必要的冗长推理过程,导致大量的计算间接费用和反应延迟。这些限制妨碍了它们在现实世界临床环境中的实际部署。为了应对这些挑战,我们引入了一个医学语言模型(LLLMs),使用户能够积极控制推论时间的长度,通过细微对照控制标记。控制Med通过三阶段编审得到培训:1)关于大规模综合医疗指导数据集的预先培训,涵盖\ textit{direct}和\textit{irective responsert;2)用多长度推理数据和明确的长控标来监督微调;3)用基于模型的奖励信号加强学习,以提高事实准确度和反应质量。英语和韩国医学基准的实验结果表明,我们的模型在与州级综合医疗指导数据集中取得了类似或更好的性业绩。这些精确度分析是灵活分析所需要的,这些精确度和精确度分析,这些用户能够以灵活分析。

Article 141

Title@2025-07-30 (3): Pre-trained Models Perform the Best When Token Distributions Follow Zipf’s Law

Title: Pre-trained Models Perform the Best When Token Distributions Follow Zipf’s Law

Vortrainierte Modelle führen das Beste aus, wenn Token-Distributionen Zipfs Gesetz folgen

事先培训的模型按照Zipf法在配制时最佳表现 2507.22543v1

Authors (3): Yanjin He, Qingkai Zeng, Meng Jiang

Tokenization is a fundamental step in natural language processing (NLP) and other sequence modeling domains, where the choice of vocabulary size significantly impacts model performance. Despite its importance, selecting an optimal vocabulary size remains underexplored, typically relying on heuristics or dataset-specific choices. In this work, we propose a principled method for determining the vocabulary size by analyzing token frequency distributions through Zipf’s law. We show that downstream task performance correlates with how closely token distributions follow power-law behavior, and that aligning with Zipfian scaling improves both model efficiency and effectiveness. Extensive experiments across NLP, genomics, and chemistry demonstrate that models consistently achieve peak performance when the token distribution closely adheres to Zipf’s law, establishing Zipfian alignment as a robust and generalizable criterion for vocabulary size selection.

本地化是自然语言处理( NLP) 和其他序列建模域中的一个基本步骤, 词汇大小的选择对模型性能有重大影响。尽管它很重要, 选择最优词汇大小仍然未得到充分探索, 通常依赖超自然学或数据集特定选择。在这项工作中, 我们提出一个原则性方法, 通过 Zipf 法则分析象征性频率分布来确定词汇大小。我们显示下游任务性能与符号分布与权力法行为之间的密切关联, 与齐普菲安缩放相匹配既能提高模型性能, 也能提高模型性能。跨国家语言、基因组学和化学的广泛实验表明, 当象征性分布与 Zipf 法的严格一致时, 模式始终能达到峰值, 从而将齐普菲亚校准作为选择词汇大小的强有力和通用标准。

Article 142

Title@2025-07-30 (3): A Benchmark Dataset and Evaluation Framework for Vietnamese Large Language Models in Customer Support

Title: A Benchmark Dataset and Evaluation Framework for Vietnamese Large Language Models in Customer Support

Benchmark Dataset und Evaluation Framework für vietnamesische Großsprachenmodelle im Kundensupport

越南客户支助大语言模式基准数据集和评价框架 2507.22542v1

Authors (9): Long S. T. Nguyen, Truong P. Hua, Thanh M. Nguyen, Toan Q. Pham, Nam K. Ngo, An X. Nguyen, Nghi D. M. Pham, Nghia H. Nguyen, Tho T. Quan

With the rapid growth of Artificial Intelligence, Large Language Models (LLMs) have become essential for Question Answering (QA) systems, improving efficiency and reducing human workload in customer service. The emergence of Vietnamese LLMs (ViLLMs) highlights lightweight open-source models as a practical choice for their accuracy, efficiency, and privacy benefits. However, domain-specific evaluations remain limited, and the absence of benchmark datasets reflecting real customer interactions makes it difficult for enterprises to select suitable models for support applications. To address this gap, we introduce the Customer Support Conversations Dataset (CSConDa), a curated benchmark of over 9,000 QA pairs drawn from real interactions with human advisors at a large Vietnamese software company. Covering diverse topics such as pricing, product availability, and technical troubleshooting, CSConDa provides a representative basis for evaluating ViLLMs in practical scenarios. We further present a comprehensive evaluation framework, benchmarking 11 lightweight open-source ViLLMs on CSConDa with both automatic metrics and syntactic analysis to reveal model strengths, weaknesses, and linguistic patterns. This study offers insights into model behavior, explains performance differences, and identifies key areas for improvement, supporting the development of next-generation ViLLMs. By establishing a robust benchmark and systematic evaluation, our work enables informed model selection for customer service QA and advances research on Vietnamese LLMs. The dataset is publicly available at https://huggingface.co/datasets/ura-hcmut/Vietnamese-Customer-Support-QA.

随着人工智能的迅速增长,大型语言模型(LLMS)对于问答系统(QA)至关重要,提高了客户服务的效率和减少了人的工作量。越南LLMS(VLMS)的出现凸显了轻量级开放源模式作为准确性、效率和隐私效益的实际选择。然而,具体领域的评价仍然有限,缺乏反映实际客户互动的基准数据集,使得企业难以选择合适的支持应用模式。为弥补这一差距,我们引入客户支持对话数据集(CSCConDa),这是从与越南一家大型软件公司的人类顾问的实际互动中得出的9,000多对QA的调整基准。涵盖诸如定价、产品供应和技术故障排除等不同主题,CSConDa为在实际情景中评估VillMS(实际客户)评估提供了有代表性的基础。我们还提出了一个全面评价框架,将CSConDA的11个轻度开放源VillMS(轻度开放源)数据库与自动计量和合成分析结合起来,以揭示模型的优点、弱点、语言模型的辅助性模型的对比。本研究报告为数据库和数据库的系统化的系统化评估领域提供了可靠的数据选择。

Article 143

Title@2025-07-30 (3): Training language models to be warm and empathetic makes them less reliable and more sycophantic

Title: Training language models to be warm and empathetic makes them less reliable and more sycophantic

Training Sprachmodelle warm und einfühlsam zu sein macht sie weniger zuverlässig und sykophantischer

培训语言模式,使其温暖和同情,使其不那么可靠,更具有共生性 2507.21919v2

Authors (3): Lujain Ibrahim, Franziska Sofia Hafner, Luc Rocher

Artificial intelligence (AI) developers are increasingly building language models with warm and empathetic personas that millions of people now use for advice, therapy, and companionship. Here, we show how this creates a significant trade-off: optimizing language models for warmth undermines their reliability, especially when users express vulnerability. We conducted controlled experiments on five language models of varying sizes and architectures, training them to produce warmer, more empathetic responses, then evaluating them on safety-critical tasks. Warm models showed substantially higher error rates (+10 to +30 percentage points) than their original counterparts, promoting conspiracy theories, providing incorrect factual information, and offering problematic medical advice. They were also significantly more likely to validate incorrect user beliefs, particularly when user messages expressed sadness. Importantly, these effects were consistent across different model architectures, and occurred despite preserved performance on standard benchmarks, revealing systematic risks that current evaluation practices may fail to detect. As human-like AI systems are deployed at an unprecedented scale, our findings indicate a need to rethink how we develop and oversee these systems that are reshaping human relationships and social interaction.

人工智能(AI)开发者正在越来越多地用热和同情的人来建立语言模型,数百万人现在使用这些模型来提供咨询、治疗和陪伴。在这里,我们展示了这如何创造出一个重大的权衡:优化语言模型以换取温暖会破坏其可靠性,特别是当用户表示脆弱性时。我们对五种不同大小和结构的语言模型进行了有控制的实验,培训它们以产生更温暖、更同情的反应,然后对安全关键任务进行评估。热模型显示的误差率(+10至+30百分点)大大高于原来的对应方,推广阴谋理论,提供不正确的事实信息,并提供有问题的医疗建议。它们也极有可能验证不正确的用户信仰,特别是当用户信息表达悲哀时。重要的是,这些影响在不同模型结构之间是一致的,而且尽管在标准基准上保持了业绩,暴露了当前评价做法可能无法检测的系统性风险。由于像人类一样的人工智能系统以前所未有的规模部署,我们的调查结果表明有必要重新思考我们如何发展和监督这些正在改变人类关系和社会互动的系统。

Article 144

Title@2025-07-30 (3): CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records

Title: CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records

CliCARE: Grounding Large Language Models in klinischen Richtlinien zur Entscheidungsunterstützung über Longitudinal Cancer Electronic Health Records

CliCARE:在纵向癌症电子健康记录决策支持临床指南中以大语言模式为基础 2507.22533v1

Authors (6): Dongchen Li, Jitao Liang, Wei Li, Xiaoyu Wang, Longbing Cao, Kun Yu

Large Language Models (LLMs) hold significant promise for improving clinical decision support and reducing physician burnout by synthesizing complex, longitudinal cancer Electronic Health Records (EHRs). However, their implementation in this critical field faces three primary challenges: the inability to effectively process the extensive length and multilingual nature of patient records for accurate temporal analysis; a heightened risk of clinical hallucination, as conventional grounding techniques such as Retrieval-Augmented Generation (RAG) do not adequately incorporate process-oriented clinical guidelines; and unreliable evaluation metrics that hinder the validation of AI systems in oncology. To address these issues, we propose CliCARE, a framework for Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records. The framework operates by transforming unstructured, longitudinal EHRs into patient-specific Temporal Knowledge Graphs (TKGs) to capture long-range dependencies, and then grounding the decision support process by aligning these real-world patient trajectories with a normative guideline knowledge graph. This approach provides oncologists with evidence-grounded decision support by generating a high-fidelity clinical summary and an actionable recommendation. We validated our framework using large-scale, longitudinal data from a private Chinese cancer dataset and the public English MIMIC-IV dataset. In these diverse settings, CliCARE significantly outperforms strong baselines, including leading long-context LLMs and Knowledge Graph-enhanced RAG methods. The clinical validity of our results is supported by a robust evaluation protocol, which demonstrates a high correlation with assessments made by expert oncologists.

大型语言模型(LLMS)为改进临床决策支持和减少医生的耗竭带来了重要前景,因为通过综合综合复杂、纵向癌症电子健康记录(EHRs),可以改善临床决策支持和减少医生的耗竭。然而,在这一重要领域实施这些模型面临三大挑战:无法有效处理患者记录的广泛长度和多语言性质,以便进行准确的时间分析;临床幻觉风险增加,因为Retreearval-Auged General(RAG)等传统基底技术没有适当纳入以流程为导向的临床指南;以及不可靠的评估指标,妨碍对人工智能系统在肿瘤学方面的验证。为了解决这些问题,我们提议CliCARE(CARE),一个在临床支持决定支持的临床大语言模型中定位大模范框架。该框架通过将无结构的、纵向EHR(TKGs)转化为针对特定患者的时空知识图(TKGG)来捕捉取远程依赖性,然后将决策支持进程的基础是将这些现实世界病人的多样性轨迹与规范性指导知识图表相匹配。这种方法为科学家提供了强有力的证据基础基础评估支持, 包括高层次的临床模型,通过高层次的临床模型数据模型,我们通过高层次的临床模型和高层次的临床模型数据模拟的临床模型数据分析, 显示了中国的临床模型数据模型数据。

Article 145

Title@2025-07-30 (3): Yankari: A Monolingual Yoruba Dataset

Title: Yankari: A Monolingual Yoruba Dataset

Yankari: Einsprachiger Yoruba-Datensatz

Yankari:单语Yoruba数据集 2412.03334v2

Authors (1): Maro Akpobi

This paper presents Yankari, a large-scale monolingual dataset for the Yoruba language, aimed at addressing the critical gap in Natural Language Processing (NLP) resources for this important West African language. Despite being spoken by over 30 million people, Yoruba has been severely underrepresented in NLP research and applications. We detail our methodology for creating this dataset, which includes careful source selection, automated quality control, and rigorous data cleaning processes. The Yankari dataset comprises 51,407 documents from 13 diverse sources, totaling over 30 million tokens. Our approach focuses on ethical data collection practices, avoiding problematic sources and addressing issues prevalent in existing datasets. We provide thorough automated evaluations of the dataset, demonstrating its quality compared to existing resources. The Yankari dataset represents a significant advancement in Yoruba language resources, providing a foundation for developing more accurate NLP models, supporting comparative linguistic studies, and contributing to the digital accessibility of the Yoruba language.

本文介绍Yoruba语言的大型单一语言数据集Yankari,这是一个大型的Yoruba语单一语言数据集,旨在解决这一重要的西非语言在自然语言处理资源(NLP)资源方面存在的重大差距。尽管有3 000多万人发言,Yoruba在NLP的研究和应用中的代表性严重不足。我们详细介绍了我们创建这一数据集的方法,其中包括仔细选择来源、自动化质量控制和严格的数据清理程序。Yankari数据集由13个不同来源的51 407份文件组成,总计超过3 000万个符号。我们的方法侧重于道德数据收集做法,避免问题源和解决现有数据集中普遍存在的问题。我们提供了对数据集的彻底自动评估,表明其质量与现有资源相比。Yankari数据集代表了Yoruba语言资源的重大进步,为开发更准确的NLP模型、支持比较语言研究以及帮助Yoruba语言数字无障碍提供了基础。

Article 146

Title@2025-07-30 (3): Probing Information Distribution in Transformer Architectures through Entropy Analysis

Title: Probing Information Distribution in Transformer Architectures through Entropy Analysis

Probing Information Distribution in Transformer-Architekturen durch Entropie-Analyse

通过 Entropy 分析在变形结构中进行测试信息发布 2507.15347v2

Authors (5): Amedeo Buonanno, Alessandro Rivetti, Francesco A. N. Palmieri, Giovanni Di Gennaro, Gianmarco Romano

This work explores entropy analysis as a tool for probing information distribution within Transformer-based architectures. By quantifying token-level uncertainty and examining entropy patterns across different stages of processing, we aim to investigate how information is managed and transformed within these models. As a case study, we apply the methodology to a GPT-based large language model, illustrating its potential to reveal insights into model behavior and internal representations. This approach may offer insights into model behavior and contribute to the development of interpretability and evaluation frameworks for transformer-based models

这项工作探索了作为在以变压器为基础的结构内进行信息传播的检验工具的酶分析。通过量化象征性的不确定性和审查不同处理阶段的酶型态,我们的目标是调查如何在这些模型内管理和转换信息。作为案例研究,我们将该方法应用到以GPT为基础的大型语言模型中,说明其揭示对模型行为和内部表现的洞察力的潜力。这一方法可以提供对模型行为的洞察力,并有助于为以变压器为基础的模型制定可解释性和评估框架。

Article 147

Title@2025-07-30 (3): SLM-SQL: An Exploration of Small Language Models for Text-to-SQL

Title: SLM-SQL: An Exploration of Small Language Models for Text-to-SQL

SLM-SQL: Eine Erforschung kleiner Sprachmodelle für Text-zu-SQL

SMS-SQL:探索文字到SQL的小型语言模型 2507.22478v1

Authors (2): Lei Sheng, Shuai-Shuai Xu

Large language models (LLMs) have demonstrated strong performance in translating natural language questions into SQL queries (Text-to-SQL). In contrast, small language models (SLMs) ranging from 0.5B to 1.5B parameters currently underperform on Text-to-SQL tasks due to their limited logical reasoning capabilities. However, SLMs offer inherent advantages in inference speed and suitability for edge deployment. To explore their potential in Text-to-SQL applications, we leverage recent advancements in post-training techniques. Specifically, we used the open-source SynSQL-2.5M dataset to construct two derived datasets: SynSQL-Think-916K for SQL generation and SynSQL-Merge-Think-310K for SQL merge revision. We then applied supervised fine-tuning and reinforcement learning-based post-training to the SLM, followed by inference using a corrective self-consistency approach. Experimental results validate the effectiveness and generalizability of our method, SLM-SQL. On the BIRD development set, the five evaluated models achieved an average improvement of 31.4 points. Notably, the 0.5B model reached 56.87\% execution accuracy (EX), while the 1.5B model achieved 67.08\% EX. We will release our dataset, model, and code to github: https://github.com/CycloneBoy/slm_sql.

大型语言模型(LLMS)在将自然语言问题转换成 SQL 查询(Text-to-SQL)方面表现良好,相比之下,小型语言模型(SLMs)在将自然语言问题转换成 SQL 查询(Text-to-SQL)方面表现良好,从0.5B至1.5B参数之间,目前文本到SQL任务方面表现不佳,因为其逻辑推理能力有限。然而,可持续土地管理在推导速度和边缘部署的适宜性方面,具有内在的优势。为了在文本到SQL应用程序中探索其潜力,我们利用培训后技术的最新进展。具体地说,我们利用开放源SySQL-2.5M数据集来构建两个衍生数据集:SySQL-Think-916KSL生成的SMLL-SML-SML-Tink-916K参数,SQL生成SQL,SQL生成SQL的SQLMSQL和SQQQQQL-MQL-MQL-MQL-MQL-MQL-SQL的SQL,SQL的SQLMLMLD;SQLSQLMLMLMLMLDSQLDSQLMQLMQLDML生成的模型和SQL-SQL-SQL-SQL-SQL-SQL-SQL的模型,SQL-SQL-SQL-SQL-SQL-SQL-SQL EX-SQL EX-SQL EX-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL-SQL EX-SQL-SQL-SQL EX-SQ

Article 148

Title@2025-07-30 (3): Exploring Dynamic Parameters for Vietnamese Gender-Independent ASR

Title: Exploring Dynamic Parameters for Vietnamese Gender-Independent ASR

Dynamische Parameter für vietnamesische geschlechtsunabhängige ASR erkunden

探索越南性别独立ASR的动态参数 2507.22964v1

Authors (4): Sotheara Leang, Éric Castelli, Dominique Vaufreydaz, Sethserey Sam

The dynamic characteristics of speech signal provides temporal information and play an important role in enhancing Automatic Speech Recognition (ASR). In this work, we characterized the acoustic transitions in a ratio plane of Spectral Subband Centroid Frequencies (SSCFs) using polar parameters to capture the dynamic characteristics of the speech and minimize spectral variation. These dynamic parameters were combined with Mel-Frequency Cepstral Coefficients (MFCCs) in Vietnamese ASR to capture more detailed spectral information. The SSCF0 was used as a pseudo-feature for the fundamental frequency (F0) to describe the tonal information robustly. The findings showed that the proposed parameters significantly reduce word error rates and exhibit greater gender independence than the baseline MFCCs.

语音信号的动态特征提供了时间信息,并在加强自动语音识别方面发挥着重要作用。在这项工作中,我们用极地参数来捕捉语音动态特征并尽量减少光谱变异,用光谱信号的动态特征对光谱子波段中枢比例平面的声学转变作了描述。这些动态参数与越南ASR的Mel-Funity Cepstraal Covalies(MFCCs)相结合,以捕捉更详细的光谱信息。SSCF0被用作基本频率(F0)的假性功能,以有力地描述古典信息。调查结果显示,拟议的参数大大降低了字差率,并显示出比基线MFCC更多的性别独立性。

Article 149

Title@2025-07-30 (3): Voices of Freelance Professional Writers on AI: Limitations, Expectations, and Fears

Title: Voices of Freelance Professional Writers on AI: Limitations, Expectations, and Fears

Stimmen freiberuflicher Schriftsteller über KI: Einschränkungen, Erwartungen und Ängste

自由职业作家对大赦国际的呼声:限制、期望和恐惧 2504.05008v2

Authors (4): Anastasiia Ivanova, Natalia Fedorova, Sergei Tilga, Ekaterina Artemova

The rapid development of AI-driven tools, particularly large language models (LLMs), is reshaping professional writing. Still, key aspects of their adoption such as languages support, ethics, and long-term impact on writers voice and creativity remain underexplored. In this work, we conducted a questionnaire (N = 301) and an interactive survey (N = 36) targeting professional writers regularly using AI. We examined LLM-assisted writing practices across 25+ languages, ethical concerns, and user expectations. The findings of the survey demonstrate important insights, reflecting upon the importance of: LLMs adoption for non-English speakers; the degree of misinformation, domain and style adaptation; usability and key features of LLMs. These insights can guide further development, benefiting both writers and a broader user base.

AI驱动工具的迅速发展,特别是大型语言模型(LLMS),正在改变专业写作,但是,采用这些工具的关键方面,如语言支持、道德和对作家声音和创造力的长期影响,仍然没有得到充分探讨,在这项工作中,我们经常使用AI对专业作家进行了问卷调查(N=301)和互动式调查(N=36),我们审查了LLM协助撰写25+种语言的做法、道德问题和用户期望,调查结果显示了重要的深刻见解,反映了以下做法的重要性:为非英语者采用LLMs;错误信息、领域和风格的适应程度;LLMs的实用性和关键特征。这些见解可以指导进一步发展,使作家和更广泛的用户基础受益。

Article 150

Title@2025-07-30 (3): IFEvalCode: Controlled Code Generation

Title: IFEvalCode: Controlled Code Generation

IFEvalCode: Kontrollierte Code-Generierung

IFEvalCode:受控制的代码生成 2507.22462v1

Authors (12): Jian Yang, Wei Zhang, Shukai Liu, Linzheng Chai, Yingshui Tan, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou, Guanglin Niu, Zhoujun Li, Binyuan Hui, Junyang Lin

Code large language models (Code LLMs) have made significant progress in code generation by translating natural language descriptions into functional code; however, real-world applications often demand stricter adherence to detailed requirements such as coding style, line count, and structural constraints, beyond mere correctness. To address this, the paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs in controlled code generation, ensuring outputs align more closely with human-defined guidelines. The authors further present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages (Python, Java, JavaScript, TypeScript, Shell, C++, and C#), with each sample featuring both Chinese and English queries. Unlike existing benchmarks, IFEvalCode decouples evaluation into two metrics: correctness (Corr.) and instruction-following (Instr.), enabling a more nuanced assessment. Experiments on over 40 LLMs reveal that closed-source models outperform open-source ones in controllable code generation and highlight a significant gap between the models’ ability to generate correct code versus code that precisely follows instructions.

守则大语言模型(Code LLMS)通过将自然语言描述转换成功能代码,在代码生成方面取得了显著进展;然而,现实世界应用往往要求严格遵循详细要求,如编码样式、行数和结构限制,而不仅仅是正确性;为此,本文件介绍了前向和后向制约生成,以提高守则LLM在受控代码生成中的遵循指令能力,确保产出与人类定义指南更加一致。作者还介绍了IFEvalCode,这是一个多语言基准,包括了7种编程语言(Python、Java、JavaScript、TypeScript、Shell、C++和C#)的1.6K测试样本,其中每种样本都包含中英查询。与现有的基准不同,IFEvalCode decouples评价分为两个衡量尺度:正确性(Corr.)和教学跟踪(Instr.),能够进行更细致的评估。对40多个LLMS进行了实验,表明封闭源模型在可控代码生成中超越开放源,并突出模型在生成代码与代码指示之间的巨大差距。

Article 151

Title@2025-07-30 (3): FineMedLM-o1: Enhancing Medical Knowledge Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training

Title: FineMedLM-o1: Enhancing Medical Knowledge Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training

FineMedLM-o1: Verbesserung des medizinischen Wissens, das die Fähigkeit von LLM vom überwachten Feintuning bis zum Test-Time Training begründet

FineMedLM-o1:提高LLM从监督的精密教学到试验时间培训的医疗知识能力 2501.09213v3

Authors (9): Hongzhou Yu, Tianhao Cheng, Yingwen Wang, Wen He, Qing Wang, Ying Cheng, Yuejie Zhang, Rui Feng, Xiaobo Zhang

Recent advancements in large language models (LLMs) have shown promise in medical applications such as disease diagnosis and treatment planning. However, most existing medical LLMs struggle with the deep reasoning required for complex medical problems, such as differential diagnosis and medication recommendations. We propose FineMedLM-o1, which leverages high-quality medical synthetic data and long-form reasoning data for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), enabling advanced dialogue and deep reasoning capabilities. Additionally, we introduce Test-Time Training (TTT) in the medical domain for the first time, facilitating domain adaptation and ensuring reliable, accurate reasoning. Experimental results demonstrate that FineMedLM-o1 achieves a 23% average performance improvement over prior models on key medical benchmarks. Furthermore, the introduction of TTT provides an additional 14% performance boost, highlighting its effectiveness in enhancing medical reasoning capabilities. To support this process, we also propose a novel method for synthesizing medical dialogue. Compared to other open-source datasets, our dataset stands out as superior in both quality and complexity. The project and data will be released on GitHub.

在大型语言模型(LLMS)的最近进展在疾病诊断和治疗规划等医疗应用方面显示出了希望,然而,大多数现有的医疗LMS都与复杂医疗问题所需的深刻推理,如不同诊断和药物建议等,进行了斗争。我们建议FineMedLM-o1,利用高质量的医学合成数据和长式推理数据,用于监督的精密配制(SFT)和直接优化(DPO),使先进的对话和深入推理能力成为可能。此外,我们首次在医疗领域引入了试验时间培训(TTTT),便利了域的适应,并确保了可靠和准确的推理。实验结果显示FineMedLM-o1比以前的关键医疗基准模型平均提高了23%的性能改进。此外,TT的引入提供了额外的14%的性能提升,突出其在提高医疗推理能力方面的效力。为了支持这一进程,我们还提出了一种新型的方法,用于综合医疗对话。与其他公开源数据集相比,我们的数据集在质量和复杂性上都处于优势。项目和数据将在GiHub上公布数据。

Article 152

Title@2025-07-30 (3): What is an “Abstract Reasoner”? Revisiting Experiments and Arguments about Large Language Models

Title: What is an “Abstract Reasoner”? Revisiting Experiments and Arguments about Large Language Models

Was ist ein “Abstract Reasoner”? Experimenten und Argumenten über große Sprachmodelle nachzuvollziehen

什么是“抽象理由” ? 关于大语言模型的重新审视实验和争论 2507.22457v1

Authors (3): Tian Yun, Chen Sun, Ellie Pavlick

Recent work has argued that large language models (LLMs) are not “abstract reasoners”, citing their poor zero-shot performance on a variety of challenging tasks as evidence. We revisit these experiments in order to add nuance to the claim. First, we show that while LLMs indeed perform poorly in a zero-shot setting, even tuning a small subset of parameters for input encoding can enable near-perfect performance. However, we also show that this finetuning does not necessarily transfer across datasets. We take this collection of empirical results as an invitation to (re-)open the discussion of what it means to be an “abstract reasoner”, and why it matters whether LLMs fit the bill.

最近的工作认为,大型语言模型(LLMs)不是“抽象的理性”,以其在各种具有挑战性的任务上表现差的零点作为证据。我们重新审视这些实验,以便给索赔增加细微差别。首先,我们表明,虽然LLMs在零点情况下的表现确实很差,但即使调整一小撮输入编码参数也能产生接近完美的效果。然而,我们还表明,这种微调并不一定会跨越数据集。我们把收集的经验结果当作邀请(重新)开诚布公地讨论“简单理性”的含义,以及为什么LLMs是否适合帐单的问题。

Article 153

Title@2025-07-30 (3): Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance

Title: Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance

Falcon-H1: Eine Familie hybrider Sprachmodelle zur Neudefinition von Effizienz und Leistung

Falcon-H1:调整效率和绩效的混合语言模式家庭 2507.22448v1

Authors (27): Jingwei Zuo, Maksim Velikanov, Ilyas Chahed, Younes Belkada, Dhia Eddine Rhayem, Guillaume Kunsch, Hakim Hacid, Hamza Yous, Brahim Farhat, Ibrahim Khadraoui, Mugariya Farooq, Giulia Campesan, Ruxandra Cojocaru, Yasser Djilali, Shi Hu, Iheb Chaabane, Puneesh Khanna, Mohamed El Amine Seddik, Ngoc Dung Huynh, Phuc Le Khac, Leen AlQadi, Billel Mokeddem, Mohamed Chami, Abdalgader Abubaker, Mikhail Lubinets, Kacper Piskorski, Slim Frikha

In this report, we introduce Falcon-H1, a new series of large language models (LLMs) featuring hybrid architecture designs optimized for both high performance and efficiency across diverse use cases. Unlike earlier Falcon models built solely on Transformer or Mamba architectures, Falcon-H1 adopts a parallel hybrid approach that combines Transformer-based attention with State Space Models (SSMs), known for superior long-context memory and computational efficiency. We systematically revisited model design, data strategy, and training dynamics, challenging conventional practices in the field. Falcon-H1 is released in multiple configurations, including base and instruction-tuned variants at 0.5B, 1.5B, 1.5B-deep, 3B, 7B, and 34B parameters. Quantized instruction-tuned models are also available, totaling over 30 checkpoints on Hugging Face Hub. Falcon-H1 models demonstrate state-of-the-art performance and exceptional parameter and training efficiency. The flagship Falcon-H1-34B matches or outperforms models up to 70B scale, such as Qwen3-32B, Qwen2.5-72B, and Llama3.3-70B, while using fewer parameters and less data. Smaller models show similar trends: the Falcon-H1-1.5B-Deep rivals current leading 7B-10B models, and Falcon-H1-0.5B performs comparably to typical 7B models from 2024. These models excel across reasoning, mathematics, multilingual tasks, instruction following, and scientific knowledge. With support for up to 256K context tokens and 18 languages, Falcon-H1 is suitable for a wide range of applications. All models are released under a permissive open-source license, underscoring our commitment to accessible and impactful AI research.

在本报告中,我们引入了Falcon-H1系列新的大型语言模型(LLMs),该系列是混合型结构(LLMs),其设计优化,以适应不同用途案例中的高性能和效率。与以前完全建在变压器或Mamba结构上的Falcon-H1模型不同,Falcon-H1采用了一种平行混合式方法,将基于变压器的注意力与国家空间模型(SSMS)相结合,后者以高长的长文记忆和计算效率著称。我们系统地重新审视了模型设计、数据战略和培训动态,对外地的常规做法提出了挑战。Fal-H1-34B匹配或超越了常规做法。Falconom-H1以多种组合形式发布,包括0.5B、1.5B、1.5B-deep、3B、7B、7B和指示调整变量变量变压变换变量。 Qwent-32B的调整型指令模式也可用Qwen-72B和34B的参数调整模型,同时以较低的当前趋势显示IM-10的运行模式。

Article 154

Title@2025-07-30 (3): AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt-4o-mini

Title: AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt-4o-mini

KI-generierte Geschichten begünstigen Stabilität gegenüber Veränderung: Homogenität und kulturelle Stereotypisierung in Erzählungen, die von gpt-4o-mini erzeugt werden

AI产生的故事有利于稳定而不是变化:在gpt-4o-mini产生的叙事中,同质性和文化陈规定型 2507.22445v1

Authors (2): Jill Walker Rettberg, Hermann Wigers

Can a language model trained largely on Anglo-American texts generate stories that are culturally relevant to other nationalities? To find out, we generated 11,800 stories - 50 for each of 236 countries - by sending the prompt “Write a 1500 word potential {demonym} story” to OpenAI’s model gpt-4o-mini. Although the stories do include surface-level national symbols and themes, they overwhelmingly conform to a single narrative plot structure across countries: a protagonist lives in or returns home to a small town and resolves a minor conflict by reconnecting with tradition and organising community events. Real-world conflicts are sanitised, romance is almost absent, and narrative tension is downplayed in favour of nostalgia and reconciliation. The result is a narrative homogenisation: an AI-generated synthetic imaginary that prioritises stability above change and tradition above growth. We argue that the structural homogeneity of AI-generated narratives constitutes a distinct form of AI bias, a narrative standardisation that should be acknowledged alongside the more familiar representational bias. These findings are relevant to literary studies, narratology, critical AI studies, NLP research, and efforts to improve the cultural alignment of generative AI.

在英美文本方面受过培训的语言模式能否产生在文化上与其他民族相关的故事?为了发现,我们通过向OpenAI的模型gpt-4o-mini发送“写写1500个单词潜在{demonom}故事”的提示,产生了11,800个故事——236个国家中每个国家各有50个——通过向OpenAI的模型发送“写出1,500个单词潜在{demonom}故事 {demonom}故事”。虽然故事确实包括地平面上的国家符号和主题,但它们绝大多数符合各国单一的叙事情节结构:主角生活在小城镇或回到小城镇,通过重新与传统联系和组织社区活动来解决小冲突。现实世界冲突已经消毒化,浪漫几乎不存在,叙述紧张被淡化,有利于怀旧与和解。结果是叙述式同质化:人工合成的想象,其优先特征高于变化和传统,高于增长。我们争辩说,AI产生的叙事的结构性同质构成一种独特的AI偏见形式,一种叙述标准化,应该与更熟悉的表述偏见一起得到承认。这些结论与文学研究、词学、批判性AI研究、批判性研究、NLP的基因调整和努力有关。

Article 155

Title@2025-07-30 (3): BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition

Title: BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition

BERSting at the Screams: Ein Maßstab für distanzierte, emotionale und erschrockene Spracherkennung

尖叫时发出尖叫声:远程、情感和呼喊语音识别基准 2505.00059v2

Authors (9): Paige Tuttösí, Mantaj Dhillon, Luna Sang, Shane Eastwood, Poorvi Bhatia, Quang Minh Dinh, Avni Kapoor, Yewon Jin, Angelica Lim

Some speech recognition tasks, such as automatic speech recognition (ASR), are approaching or have reached human performance in many reported metrics. Yet, they continue to struggle in complex, real-world, situations, such as with distanced speech. Previous challenges have released datasets to address the issue of distanced ASR, however, the focus remains primarily on distance, specifically relying on multi-microphone array systems. Here we present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. The dataset contains almost 4 hours of English speech from 98 actors with varying regional and non-native accents. The data was collected on smartphones in the actors homes and therefore includes at least 98 different acoustic environments. The data also includes 7 different emotion prompts and both shouted and spoken utterances. The smartphones were places in 19 different positions, including obstructions and being in a different room than the actor. This data is publicly available for use and can be used to evaluate a variety of speech recognition tasks, including: ASR, shout detection, and speech emotion recognition (SER). We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion. Our results show that the BERSt dataset is challenging for both ASR and SER tasks and continued work is needed to improve the robustness of such systems for more accurate real-world use.

一些语音识别任务,如自动语音识别(ASR),正在接近或已经达到许多报告指标中的人类性能。然而,它们继续在复杂的现实世界中挣扎,例如远程言论。以往的挑战已经释放出数据集,以解决远程ASR问题,然而,重点仍然主要放在距离上,具体依靠多声式阵列系统。这里我们展示的是B(asic) E(As) R(音调) R(音) R(音) R(音)) 数据集。数据集包含98个行为体的几乎4小时英语演讲,这些行为体具有不同的区域和非当地性口音。这些数据是在行为体家中智能手机上收集的,因此包括至少98个不同的声响环境。这些数据还包括7种不同的感应以及喊和口语表达。智能手机位于19个不同位置,包括障碍和位于与行为体不同的房间。这些数据可供公开使用,可用于评价各种语音识别任务,包括:ASR、警喊检测、言论感应激、感应感应感应、感应、感应感应性能、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感官、感、感官、感官、感官、感官、感官、感官、感官、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感、感)等,我们

Article 156

Title@2025-07-30 (3): Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation

Title: Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation

Mmm whatcha sagen? Enthüllen distale und proximale Kontexteffekte in der ersten und zweiten Sprache Wort Wahrnehmung mit psychophysischen umgekehrten Korrelation

使用心理物理反向关系,在第一和第二语言的词感中产生未发现和预期的环境效应 2406.05515v2

Authors (7): Paige Tuttösí, H. Henny Yeung, Yue Wang, Fenqi Wang, Guillaume Denis, Jean-Julien Aucouturier, Angelica Lim

Acoustic context effects, where surrounding changes in pitch, rate or timbre influence the perception of a sound, are well documented in speech perception, but how they interact with language background remains unclear. Using a reverse-correlation approach, we systematically varied the pitch and speech rate in phrases around different pairs of vowels for second language (L2) speakers of English (/i/-/I/) and French (/u/-/y/), thus reconstructing, in a data-driven manner, the prosodic profiles that bias their perception. Testing English and French speakers (n=25), we showed that vowel perception is in fact influenced by conflicting effects from the surrounding pitch and speech rate: a congruent proximal effect 0.2s pre-target and a distal contrastive effect up to 1s before; and found that L1 and L2 speakers exhibited strikingly similar prosodic profiles in perception. We provide a novel method to investigate acoustic context effects across stimuli, timescales, and acoustic domain.

声频背景效应,即声频、速率或音调的变化影响声音感知的周围变化,在语音感知中有详细记载,但与语言背景互动的方式仍不明朗。我们采用反向关系方法,系统地将音频和语音率在第二语言(L2)讲英语(/i/-I/)和法语(/u/-/y/)的不同配音词(L2)周围的语调(L2)周围的语调(L2)上,用英语(/i/-I/)和法语(/u//y/)和法语(/y/),从而以数据驱动的方式重建偏向其感知的立体特征。测试英语和法语演讲者(n=25),我们表明,元音调感其实受周围音频和语调率的相冲突效应影响:准准准效果(0.2s)预准效果和偏差效果达到1;发现L1和L2讲者在认知中表现出惊人相似的Prosodic剖面特征。我们提供了一种新型方法,用于调查音学背景、时间尺度和声学域间影响。

Article 157

Title@2025-07-30 (3): NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models

Title: NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models

NeedleChain: Messung der Intact-Langkontext-Begründungsfähigkeit großer Sprachmodelle

Nenelechain:计量大语言模型的精密长文理由 2507.22411v1

Authors (2): Hyeonseok Moon, Heuiseok Lim

The Needle-in-a-Haystack (NIAH) benchmark is widely used to evaluate Large Language Models’ (LLMs) ability to understand long contexts (LC). It evaluates the capability to identify query-relevant context within extensive query-irrelevant passages. Although this method serves as a widely accepted standard for evaluating long-context understanding, our findings suggest it may overestimate the true LC capability of LLMs. We demonstrate that even state-of-the-art models such as GPT-4o struggle to intactly incorporate given contexts made up of solely query-relevant ten sentences. In response, we introduce a novel benchmark, \textbf{NeedleChain}, where the context consists entirely of query-relevant information, requiring the LLM to fully grasp the input to answer correctly. Our benchmark allows for flexible context length and reasoning order, offering a more comprehensive analysis of LLM performance. Additionally, we propose an extremely simple yet compelling strategy to improve LC understanding capability of LLM: ROPE Contraction. Our experiments with various advanced LLMs reveal a notable disparity between their ability to process large contexts and their capacity to fully understand them. Source code and datasets are available at https://github.com/hyeonseokk/NeedleChain

虽然这种方法是评价长期理解的一种广泛接受的标准,但我们的研究结果表明,甚至诸如GPT-4o等最先进的模型也广泛用于评价大语言模型(LLMs)了解长处(LLMs)理解长处(LLC)的能力。它评价了在广泛的与查询有关的段落中查明与查询有关背景的能力。虽然这种方法是评价长处理解的一种广泛接受的标准,但我们的研究结果表明,它可能高估LCms的真正LC能力。我们表明,甚至GPT-4o等最先进的模型也被用来完整地纳入由纯查询相关的十句话组成的特定环境。我们与各高级LMS的实验显示,它们处理大处环境的能力与完全理解投入的能力之间存在明显差距。我们的基准允许灵活的背景长度和推理顺序,对LLMMs绩效进行更全面的分析。此外,我们提出了一个非常简单而又令人信服的战略来提高LCs的LM能力:ROPE Contrationion。我们在各种高级LMs的实验显示,它们在处理大处所处理大处的能力与它们完全理解AMs/MHAHIRCSD的数据和数据之间有明显差距。

Article 158

Title@2025-07-30 (3): Question Generation for Assessing Early Literacy Reading Comprehension

Title: Question Generation for Assessing Early Literacy Reading Comprehension

Fragegenerierung für die Bewertung des frühen Leseverständnisses

评估早期阅读读写能力读写能力的提问一代 2507.22410v1

Authors (3): Xiaocheng Yang, Sumuk Shashidhar, Dilek Hakkani-Tur

Assessment of reading comprehension through content-based interactions plays an important role in the reading acquisition process. In this paper, we propose a novel approach for generating comprehension questions geared to K-2 English learners. Our method ensures complete coverage of the underlying material and adaptation to the learner’s specific proficiencies, and can generate a large diversity of question types at various difficulty levels to ensure a thorough evaluation. We evaluate the performance of various language models in this framework using the FairytaleQA dataset as the source material. Eventually, the proposed approach has the potential to become an important part of autonomous AI-driven English instructors.

通过基于内容的互动对阅读理解进行评估,在阅读获取过程中发挥了重要作用。在本文件中,我们提出了针对K-2英语学习者提出理解问题的新办法。我们的方法确保基础材料的完整覆盖和适应学习者的具体能力,并能够在不同的困难级别产生多种多样的问题类型,以确保进行彻底评估。我们利用FairtytaleQA数据集作为原始材料,评估本框架中各种语言模式的绩效。最终,拟议办法有可能成为AI驱动的自主英语教员的重要组成部分。

Article 159

Title@2025-07-30 (3): R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

Title: R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

R2-KG:知识图表可靠理由通用双重目的机构框架 2502.12767v6

Authors (4): Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi

Recent studies have combined Large Language Models (LLMs) with Knowledge Graphs (KGs) to enhance reasoning, improving inference accuracy without additional training while mitigating hallucination. However, existing frameworks still suffer two practical drawbacks: they must be re-tuned whenever the KG or reasoning task changes, and they depend on a single, high-capacity LLM for reliable (i.e., trustworthy) reasoning. To address this, we introduce R2-KG, a plug-and-play, dual-agent framework that separates reasoning into two roles: an Operator (a low-capacity LLM) that gathers evidence and a Supervisor (a high-capacity LLM) that makes final judgments. This design is cost-efficient for LLM inference while still maintaining strong reasoning accuracy. Additionally, R2-KG employs an Abstention mechanism, generating answers only when sufficient evidence is collected from KG, which significantly enhances reliability. Experiments across five diverse benchmarks show that R2-KG consistently outperforms baselines in both accuracy and reliability, regardless of the inherent capability of LLMs used as the Operator. Further experiments reveal that the single-agent version of R2-KG, equipped with a strict self-consistency strategy, achieves significantly higher-than-baseline reliability with reduced inference cost but increased abstention rate in complex KGs. Our findings establish R2-KG as a flexible and cost-effective solution for KG-based reasoning, reducing reliance on high-capacity LLMs while ensuring trustworthy inference. The code is available at https://github.com/ekrxjwh2009/R2-KG/.

最近的研究将大语言模型(LLMs)与知识图(KGs)相结合,以加强推理,提高推理准确性,而无需额外培训,同时减轻幻觉;然而,现有框架仍然有两个实际的缺点:每当KG或推理任务发生变化时,必须重新调整现有框架;它们依赖一个单一的、高能力的LM(LLM)(可靠(即可信赖的)推理);为此,我们引入了R2-KG(一个插接插和游戏的双重试剂框架),将依赖分为两个角色:一个操作员(低容量LM)(收集证据的低容量LM)和一个主管(高容量LMM)(作出最后判断的高级LM)(高容量LM),这一设计对LM推断具有成本效益,同时仍然保持很强的推理准确性;此外,R2-KGG公司采用一个吸收足够证据的系统,只有在从KG组收集到足够证据,从而大大提高可靠性时才能找到答案;五个基准的实验显示,R2-KG(KG)在准确性和可靠性标准方面始终都比基准的基线,不管LMS的内在能力,进一步实验显示,在操作操作者使用高清晰度上,在高清晰度战略中,在高清晰度上,而能能度上,在高的KKG(KKG)在高的精确度上,在高的精确度上,在高的精确度上,在高的精确度上要达到高水平上,在高的精确度上,在精确度上,在精确度上,在高的精确度上要能能能能能能能。

Article 160

Title@2025-07-30 (3): Reservoir Computing as a Language Model

Title: Reservoir Computing as a Language Model

Reservoir Computing als Sprachmodell

作为语言模式的 “ 储量计算 “ 模式 2507.15779v2

Authors (2): Felix Köster, Atsushi Uchida

Large Language Models (LLM) have dominated the science and media landscape duo to their impressive performance on processing large chunks of data and produce human-like levels of text. Nevertheless, their huge energy demand and slow processing still a bottleneck for further increasing quality while also making the models accessible to everyone. To solve this bottleneck, we will investigate how reservoir computing performs on natural text processing, which could enable fast and energy efficient hardware implementations. Studies investigating the use of reservoir computing as a language model remain sparse. In this paper, we compare three distinct approaches for character-level language modeling, two different reservoir computing approaches, where only an output layer is trainable, and the well-known transformer-based architectures, which fully learn an attention-based sequence representation. We explore the performance, computational cost and prediction accuracy for both paradigms by equally varying the number of trainable parameters for all models. Using a consistent pipeline for all three approaches, we demonstrate that transformers excel in prediction quality, whereas reservoir computers remain highly efficient reducing the training and inference speed. Furthermore, we investigate two types of reservoir computing: a traditional reservoir with a static linear readout, and an attention-enhanced reservoir that dynamically adapts its output weights via an attention mechanism. Our findings underline how these paradigms scale and offer guidelines to balance resource constraints with performance.

大型语言模型(LLM)在科学和媒体景观中占据了主导地位,在处理大量数据和制作人文文本方面的业绩令人印象深刻。然而,其巨大的能源需求和缓慢的处理仍然是进一步提高质量的瓶颈,同时也使每个人都可以使用模型。为了解决这一瓶颈,我们将调查储油层计算如何在自然文本处理中发挥作用,这可以快速和节能地实施硬件。研究储油层计算作为一种语言模型的使用仍然很少。在本文件中,我们比较了三个不同的性格语言模型使用方法:两种不同的储油层计算方法,即只有产出层可以培训的两种不同的储油层计算方法,以及众所周知的基于变压器的结构,这些结构充分学习基于关注的顺序代表。我们探索这两种模式的性能、计算成本和预测准确性,通过对所有模型的可培训参数数量进行同样的差异。我们用所有三种方法一致的管道证明变压器在预测质量方面是优秀的,而储油层计算机仍然非常高效地减少培训和推导速度。此外,我们调查了两种储油层计算方法:一种传统的储油层储油层储层储油层和固定的直线式结构,我们通过感重的阅读了它们如何调整了它们。

Article 161

Title@2025-07-30 (3): PATENTWRITER: A Benchmarking Study for Patent Drafting with LLMs

Title: PATENTWRITER: A Benchmarking Study for Patent Drafting with LLMs

PATENTWRITER: Eine Benchmarking-Studie für die Patenterstellung mit LLMs

PATENTWRITER: 专利起草基准研究与LLMs 2507.22387v1

Authors (3): Homaira Huda Shomee, Suman Kalyan Maity, Sourav Medya

Large language models (LLMs) have emerged as transformative approaches in several important fields. This paper aims for a paradigm shift for patent writing by leveraging LLMs to overcome the tedious patent-filing process. In this work, we present PATENTWRITER, the first unified benchmarking framework for evaluating LLMs in patent abstract generation. Given the first claim of a patent, we evaluate six leading LLMs – including GPT-4 and LLaMA-3 – under a consistent setup spanning zero-shot, few-shot, and chain-of-thought prompting strategies to generate the abstract of the patent. Our benchmark PATENTWRITER goes beyond surface-level evaluation: we systematically assess the output quality using a comprehensive suite of metrics – standard NLP measures (e.g., BLEU, ROUGE, BERTScore), robustness under three types of input perturbations, and applicability in two downstream patent classification and retrieval tasks. We also conduct stylistic analysis to assess length, readability, and tone. Experimental results show that modern LLMs can generate high-fidelity and stylistically appropriate patent abstracts, often surpassing domain-specific baselines. Our code and dataset are open-sourced to support reproducibility and future research.

大型语言模型(LLMS)已成为若干重要领域的变革方法。本文件旨在通过利用LLMS来利用LLMs来克服无聊的专利过滤程序,实现专利写作范式的转变。在这项工作中,我们介绍PATENTWRITER,这是在专利抽象生成过程中评价LMS的第一个统一基准框架。鉴于第一项专利主张,我们根据一个涵盖零发、少发和一连串的激励策略的一致设置,对六大LMS – – 包括GPT-4和LLLAMA-3 – – 进行了评价,这六大LLMS(包括GPT-4和LLAMA-3)进行了评价,以产生专利的抽象。我们的基准PATENTWRITER超越了地表一级的评估:我们系统评估产出质量的方法包括一套综合的计量尺度 – – 标准NLP措施(例如,BLEU、ROUGE、BERSTScore),在三种类型的投入扰动作用下,以及两种下专利分类和检索任务的适用性。我们还进行了模拟分析,以评估性分析,以评估性分析,以评估长度、可读性和调和调度。实验性分析结果显示,现代LMSMSDMs能够产生高的专利基础和对未来数据库基础和升级性支持。

Article 162

Title@2025-07-30 (3): OWLViz: An Open-World Benchmark for Visual Question Answering

Title: OWLViz: An Open-World Benchmark for Visual Question Answering

OWLViz: Ein Open-World-Benchmark für visuelle Fragen

OWLViz:视觉问答的开放世界基准 2503.07631v3

Authors (6): Thuy Nguyen, Dang Nguyen, Hoang Nguyen, Thuan Luong, Long Hoang Dang, Viet Dac Lai

We present a challenging benchmark for the Open WorLd VISual question answering (OWLViz) task. OWLViz presents concise, unambiguous queries that require integrating multiple capabilities, including visual understanding, web exploration, and specialized tool usage. While humans achieve 69.2% accuracy on these intuitive tasks, even state-of-the-art VLMs struggle, with the best model, Gemini 2.0, achieving only 26.6% accuracy. Current agentic VLMs, which rely on limited vision and vision-language models as tools, perform even worse. This performance gap reveals significant limitations in multimodal systems’ ability to select appropriate tools and execute complex reasoning sequences, establishing new directions for advancing practical AI research.

我们为Open WorLd Visual 答题(OWLViz)的任务提出了一个具有挑战性的基准。 OWLViz 给出了简明、明确的询问,要求整合多种能力,包括视觉理解、网络探索和专门工具使用。虽然人类在这些直观任务上实现了69.2%的准确性,但即使是最先进的VLM,其最佳模型是Gemini 2.0, 其准确性仅为26.6%。目前依赖有限的愿景和愿景语言模型作为工具的VLMs,其表现更差。这一绩效差距揭示了多式联运系统在选择适当工具和执行复杂推理序列、为推进实际的AI研究确定新方向方面的巨大局限性。

Article 163

Title@2025-07-30 (3): Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

Title: Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

Multimodale LLMs als maßgeschneiderte Reward-Modelle für die Text-zu-Image-Generierung

以多式多式LLMs作为生成文字到图像的自定制奖励模型 2507.21391v2

Authors (8): Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu, Branislav Kveton, Yufan Zhou, Jiuxiang Gu, Jian Chen, Changyou Chen

We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and difficult to train. To address this problem, we propose LLaVA-Reward, which directly utilizes the hidden states of MLLMs given text-image pairs. To enhance the bidirectional interaction between visual and textual representations in decoder-only MLLMs, we further propose adding a Skip-connection Cross Attention (SkipCA) module. This design enhances text-image correlation reasoning by connecting early-layer visual features with later-layer hidden representations. In addition, LLaVA-Reward supports different types of preference data for efficient fine-tuning, including paired preference data and unpaired data. We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking. Empirical results demonstrate that LLaVA-Reward outperforms conventional and MLLM-based methods in generating human-aligned scores for automatic evaluations and inference-time scaling in text-to-image generations.

我们引入了LLAVA-Reward(LLAVA-Reward),这是一个高效的奖励模式,旨在从多种角度自动评价文字到图像(T2I)的几代人,利用预先培训的多式联运大型语言模型(MLLLM)。基于MLLM(MLLM)的现有方法需要指导跟踪数据,用于监督微调,并评估分析文本反应的生成质量,这既耗时又难于培训。为了解决这一问题,我们提议LLLAVA-Reward(Reward)直接利用MLLMS(MLimage)的隐藏状态,直接利用MLLLM(T-image)的文本图像。为了加强光学和文本代表之间的双向互动,我们进一步提议增加一个Skipple-contion Crostition(Skip-CA)模块。这一设计通过将早期视觉特征与后层隐藏的表达方式联系起来,从而增强文字的关联性推介关系。此外,LAVA-Refer(LA-A-A-A-A-A-LVA-A-SAR)级的升级方法展示,以展示-A-A-A-A-AD-SLVD-SD-SD-SD-SLD-SD-SLD-SD-S-S-S-S-S-S-SD-S-S-S-SAR-SD-SD-S-SD-SD-S-SD-SD-SD-SD-SD-SD-S-S-S-S-S-S-SD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SD-S-S-SD-SD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-A-A-S-S-S-A-A-S-S-S-A-

Article 164

Title@2025-07-30 (3): BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

Title: BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

BlockFFN: Auf dem Weg zur End-Side Acceleration-Friendly Mixture-of-Experts mit Chunk-Level-Aktivierung Sparsity

块块FFN: 向具有整块级激活分级的终端- 双极加速- 友好混合混合专家方向 2507.08771v2

Authors (8): Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Yuxuan Li, Zhiyuan Liu, Maosong Sun

To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67$\times$ speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (https://github.com/thunlp/BlockFFN).

为了减轻大型语言模型(LLMS)的计算负担,以专家混合(MoE)为代表的具有激活性弹性的架构(LLMS)吸引了越来越多的关注。然而,香草MoE的无差别和不灵活路线令模式伤害了模型性能。此外,尽管每个象征性的架构只激活了几个参数,但这些分散活跃的架构却呈现出低块水平的宽度,表明多个连续代号的结合激活了巨大的参数比例。这样的松散模式对于在低资源条件(例如,终端设备)下加速运行不方便,而且与主流加速技术(例如,投机性解码)不兼容。为了应对这些挑战,我们引入了全新的MOE架构(BlubFFN)及其高效培训和部署技术。具体地说,我们使用一个将RELU的激活和RMSNormm 整合到不同和灵活路线上的路径上。接下来,在Sal-levelopmental-ality(TLS)和块级级的终端设备(CLS)中(CS-LS-s-wapildal-loadalalalalalalalal-dealalalalalalalal) strationalalalalalalalal 和3.(Cal-deal-deal-dealal) 80) 目标是设计、CLFMal-deal-deal-deal-dealizaldal-dealmentalmentalmental-dealmentalmentalal 80 80 。最后运行,为我们80 和80的升级化的加速性能、CLFMFMFTalmental-deal-deal-deal-deal-deal-deal-deal-tamental-tamental-al-al-al-al-deal-tamental-tamental-deal-deal-tamental-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-al-al-deal-al-al-deal-deal-deal-al-al-al-al-al-al-al-al-al-deal-

Article 165

Title@2025-07-30 (3): Traits Run Deep: Enhancing Personality Assessment via Psychology-Guided LLM Representations and Multimodal Apparent Behaviors

Title: Traits Run Deep: Enhancing Personality Assessment via Psychology-Guided LLM Representations and Multimodal Apparent Behaviors

Traits Run Deep: Verbesserung der Persönlichkeitsbeurteilung durch psychologisch geführte LLM-Darstellungen und multimodale Scheinverhalten

深层轨迹:通过心理学辅导LLM代表和多模式亲善行为,加强个性评估 2507.22367v1

Authors (7): Jia Li, Yichao He, Jiacheng Xu, Tianhao Luo, Zhenzhen Hu, Richang Hong, Meng Wang

Accurate and reliable personality assessment plays a vital role in many fields, such as emotional intelligence, mental health diagnostics, and personalized education. Unlike fleeting emotions, personality traits are stable, often subconsciously leaked through language, facial expressions, and body behaviors, with asynchronous patterns across modalities. It was hard to model personality semantics with traditional superficial features and seemed impossible to achieve effective cross-modal understanding. To address these challenges, we propose a novel personality assessment framework called \textit{\textbf{Traits Run Deep}}. It employs \textit{\textbf{psychology-informed prompts}} to elicit high-level personality-relevant semantic representations. Besides, it devises a \textit{\textbf{Text-Centric Trait Fusion Network}} that anchors rich text semantics to align and integrate asynchronous signals from other modalities. To be specific, such fusion module includes a Chunk-Wise Projector to decrease dimensionality, a Cross-Modal Connector and a Text Feature Enhancer for effective modality fusion and an ensemble regression head to improve generalization in data-scarce situations. To our knowledge, we are the first to apply personality-specific prompts to guide large language models (LLMs) in extracting personality-aware semantics for improved representation quality. Furthermore, extracting and fusing audio-visual apparent behavior features further improves the accuracy. Experimental results on the AVI validation set have demonstrated the effectiveness of the proposed components, i.e., approximately a 45\% reduction in mean squared error (MSE). Final evaluations on the test set of the AVI Challenge 2025 confirm our method’s superiority, ranking first in the Personality Assessment track. The source code will be made available at https://github.com/MSA-LMC/TraitsRunDeep.

准确和可靠的人格评估在许多领域,例如情感智能、心理健康诊断和个性化教育,都发挥着关键的作用。与时俱进的情感不同,个性特征是稳定的,通常通过语言、面部表达方式和身体行为在潜意识中渗漏,不同模式的形态是非同步的。很难模拟具有传统表面特征的个性语义,而且似乎不可能实现有效的跨模式理解。为了应对这些挑战,我们提议了一个名为\ textitut thextb{Traits run Deep的新性人格评估框架。它使用\ textitle thextb{精神健康诊断和个性化教育。与时俱进的个性特征特征特征特征特征特征是稳定的稳定,此外,它设计出一个具有丰富文字语义的语义表达, 将一个Chunk-Wisetical Projectorationorlation to developmental dislationality A-dealalalalal ress reviews 用于有效的变现变现工具。

Article 166

Title@2025-07-30 (3): Masked Language Models are Good Heterogeneous Graph Generalizers

Title: Masked Language Models are Good Heterogeneous Graph Generalizers

Masked Language Models sind gute Heterogene Graph Generalizers

遮罩语言模型是好异基因图形缩略图 2506.06157v2

Authors (8): Jinyu Yang, Cheng Yang, Shanyuan Cui, Zeyuan Guo, Liangwei Yang, Muhan Zhang, Zhiqiang Zhang, Chuan Shi

Heterogeneous graph neural networks (HGNNs) excel at capturing structural and semantic information in heterogeneous graphs (HGs), while struggling to generalize across domains and tasks. With the rapid advancement of large language models (LLMs), a recent study explored the integration of HGNNs with LLMs for generalizable heterogeneous graph learning. However, this approach typically encodes structural information as HG tokens using HGNNs, and disparities in embedding spaces between HGNNs and LLMs have been shown to bias the LLM’s comprehension of HGs. Moreover, since these HG tokens are often derived from node-level tasks, the model’s ability to generalize across tasks remains limited. To this end, we propose a simple yet effective Masked Language Modeling-based method, called MLM4HG. MLM4HG introduces metapath-based textual sequences instead of HG tokens to extract structural and semantic information inherent in HGs, and designs customized textual templates to unify different graph tasks into a coherent cloze-style ‘mask’ token prediction paradigm. Specifically,MLM4HG first converts HGs from various domains to texts based on metapaths, and subsequently combines them with the unified task texts to form a HG-based corpus. Moreover, the corpus is fed into a pretrained LM for fine-tuning with a constrained target vocabulary, enabling the fine-tuned LM to generalize to unseen target HGs. Extensive cross-domain and multi-task experiments on four real-world datasets demonstrate the superior generalization performance of MLM4HG over state-of-the-art methods in both few-shot and zero-shot scenarios. Our code is available at https://github.com/BUPT-GAMMA/MLM4HG.

厚度图形神经网络(HGNNS) 擅长捕捉混杂图形(HGNS) 中的结构性和语义信息,同时努力在各领域和任务之间推广。随着大型语言模型(LLMS)的快速进步,最近的一项研究探索了HGNs与LLMS的整合,以便进行通用的混杂图形学习。然而,这种方法通常会将结构信息编码为HG表示使用HGNNS的HGmost,而将HGNNs和LLLMMMS之间的空间嵌入差异显示为对LMMLMTM的交叉理解。此外,由于这些HGMS往往来自节点目标层面的任务,因此该模型对任务进行统称化的能力仍然有限。为此,我们提出了一个简单而有效的、有效的基于通用语言模型的MLMHMHMH模型方法,将基于HGML的结构性和基于GMMMMML的MLMS-Maldaldaldal 版本用于提取基于H的模板。

Article 167

Title@2025-07-30 (3): Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning

Title: Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning

Nutzung von großen Sprachmodellen für Bengalische Mathematik-Wort-Probleme bei der Lösung der Kette der Gedankenveranlagung

利用大语言模型解决孟加拉语数学字词与思维链理性的解决问题 2505.21354v2

Authors (5): Bidyarthi Paul, Jalisha Jashim Era, Mirazur Rahman Zim, Tahmid Sattar Aothoi, Faisal Muhammad Shah

Solving Bengali Math Word Problems (MWPs) remains a major challenge in natural language processing (NLP) due to the language’s low-resource status and the multi-step reasoning required. Existing models struggle with complex Bengali MWPs, largely because no human-annotated Bengali dataset has previously addressed this task. This gap has limited progress in Bengali mathematical reasoning. To address this, we created SOMADHAN, a dataset of 8792 complex Bengali MWPs with manually written, step-by-step solutions. We designed this dataset to support reasoning-focused evaluation and model development in a linguistically underrepresented context. Using SOMADHAN, we evaluated a range of large language models (LLMs) - including GPT-4o, GPT-3.5 Turbo, LLaMA series models, Deepseek, and Qwen - through both zero-shot and few-shot prompting with and without Chain of Thought (CoT) reasoning. CoT prompting consistently improved performance over standard prompting, especially in tasks requiring multi-step logic. LLaMA-3.3 70B achieved the highest accuracy of 88% with few-shot CoT prompting. We also applied Low-Rank Adaptation (LoRA) to fine-tune models efficiently, enabling them to adapt to Bengali MWPs with minimal computational cost. Our work fills a critical gap in Bengali NLP by providing a high-quality reasoning dataset and a scalable framework for solving complex MWPs. We aim to advance equitable research in low-resource languages and enhance reasoning capabilities in educational and language technologies.

解决孟加拉数学字数问题(MWPs)仍然是自然语言处理(NLP)的一大挑战,原因是该语言资源水平低,需要多步推理。现有的模型与复杂的孟加拉语言模型(LLMMS)挣扎,这主要是因为以前没有人类附加说明的孟加拉语数据集。这一差距限制了孟加拉语数学推理的进展。为了解决这个问题,我们创建了由8792个复杂的孟加拉语模型组成的数据集SOMADHAN,配有手写、逐步的解决方案。我们设计了这一数据集,以支持在语言代表性不足的背景下进行注重逻辑的评价和模型开发。我们利用SOMAADHAN评估了一系列大型语言模型(LLMMSM),包括GPT-4o、GPT-3.5 Turbo、LLMMA系列模型、Deepseek和Quwen-通过零发和几发的推理(CoT)推理推理来不断改进标准,特别是在需要多步逻辑的情况下,标准推理的推理。LMA-3.3-RMM-B在高层次推理学上实现了88的精度的精准,我们的数据推理成本,我们用低的精确推理学推理学推理学推理,我们用低的推理推理推理学的推理,我们用高的推理的推理,我们用低的推理学的推理的推理的推理的推理的推理方法推理方法推理,还了88888,也提高了了高的推理。

Article 168

Title@2025-07-30 (3): MuSciClaims: Multimodal Scientific Claim Verification

Title: MuSciClaims: Multimodal Scientific Claim Verification

MuSciClaims: Multimodale wissenschaftliche Antragsprüfung

穆西索赔: 多式联运科学索赔核实 2506.04585v2

Authors (6): Yash Kumar Lal, Manikanta Bandham, Mohammad Saqib Hasan, Apoorva Kashi, Mahnaz Koupaee, Niranjan Balasubramanian

Assessing scientific claims requires identifying, extracting, and reasoning with multimodal data expressed in information-rich figures in scientific literature. Despite the large body of work in scientific QA, figure captioning, and other multimodal reasoning tasks over chart-based data, there are no readily usable multimodal benchmarks that directly test claim verification abilities. To remedy this gap, we introduce a new benchmark MuSciClaims accompanied by diagnostics tasks. We automatically extract supported claims from scientific articles, which we manually perturb to produce contradicted claims. The perturbations are designed to test for a specific set of claim verification capabilities. We also introduce a suite of diagnostic tasks that help understand model failures. Our results show most vision-language models are poor (~0.3-0.5 F1), with even the best model only achieving 0.72 F1. They are also biased towards judging claims as supported, likely misunderstanding nuanced perturbations within the claims. Our diagnostics show models are bad at localizing correct evidence within figures, struggle with aggregating information across modalities, and often fail to understand basic components of the figure.

评估科学主张需要确定、提取和推理科学文献中信息丰富数字中表述的多式数据。尽管科学质量评估、图表说明和其他基于图表的数据的多式联运推理任务方面做了大量工作,但没有直接测试核实能力的现用多式联运基准。为了弥补这一差距,我们引入了新的基准MuSci要求,并辅以诊断任务。我们自动从科学文章中提取支持性主张,我们人工干扰这些主张,以产生自相矛盾的主张。扰动是为了测试一套具体的索赔核实能力。我们还引入了一套有助于理解模型失败的诊断任务。我们的结果显示,大多数愿景语言模型都很差(~0.3-0.5 F1),即使最佳模型也只能达到0.72 F1。它们也偏向于判断索赔所支持的索赔,可能存在误解,对索赔中的扰动进行细微分化。我们的诊断显示模型不利于在数字中将正确证据本地化,与各种方式的信息拼凑,而且往往无法理解数字的基本组成部分。

Article 169

Title@2025-07-30 (3): A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers

Title: A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers

Eine umfassende Taxonomie der Negation für NLP und Neuralretriever

NLP和神经再研究综合清点分类 2507.22337v1

Authors (4): Roxana Petcu, Samarth Bhargav, Maarten de Rijke, Evangelos Kanoulas

Understanding and solving complex reasoning tasks is vital for addressing the information needs of a user. Although dense neural models learn contextualised embeddings, they still underperform on queries containing negation. To understand this phenomenon, we study negation in both traditional neural information retrieval and LLM-based models. We (1) introduce a taxonomy of negation that derives from philosophical, linguistic, and logical definitions; (2) generate two benchmark datasets that can be used to evaluate the performance of neural information retrieval models and to fine-tune models for a more robust performance on negation; and (3) propose a logic-based classification mechanism that can be used to analyze the performance of retrieval models on existing datasets. Our taxonomy produces a balanced data distribution over negation types, providing a better training setup that leads to faster convergence on the NevIR dataset. Moreover, we propose a classification schema that reveals the coverage of negation types in existing datasets, offering insights into the factors that might affect the generalization of fine-tuned models on negation.

理解和解决复杂的推理任务对于满足用户的信息需求至关重要。虽然密集的神经模型学会了背景嵌入,但它们在含有否定的查询方面仍然表现不佳。为了理解这一现象,我们在传统的神经信息检索和基于LLM的模型中都研究否定现象。我们(1) 采用哲学、语言和逻辑定义产生的否定分类法;(2) 产生两个基准数据集,可用于评价神经信息检索模型的性能和微调模型,以便在否定方面进行更强的性能;(3) 提出一个基于逻辑的分类机制,可用来分析现有数据集检索模型的性能。我们的分类法在否定类型上产生了平衡的数据分布,提供了更好的培训设置,使NevIR数据集更快地趋于一致。此外,我们提出一个分类方案,显示现有数据集中否定类型的范围,对可能影响到对否定的精确模型的普及。

Article 170

Title@2025-07-30 (3): Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing

Title: Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing

Prompt-Reverse Inkonsistenz: LLM Selbstinkonsistenz jenseits generativer Zufälligkeit und prompt Paraphrasierung

迅速反向不一致:LLM 自我不连贯,超越发生性随机和迅速划线 2504.01282v2

Authors (2): Jihyun Janice Ahn, Wenpeng Yin

While the inconsistency of LLMs is not a novel topic, prior research has predominantly addressed two types of generative inconsistencies: i) Randomness Inconsistency: running the same LLM multiple trials, yielding varying responses; ii) Paraphrase Inconsistency: paraphrased prompts result in different responses from the same LLM. Randomness Inconsistency arises from the inherent randomness due to stochastic sampling in generative models, while Paraphrase Inconsistency is a consequence of the language modeling objectives, where paraphrased prompts alter the distribution of vocabulary logits. This research discovers Prompt-Reverse Inconsistency (PRIN), a new form of LLM self-inconsistency: given a question and a couple of LLM-generated answer candidates, the LLM often has conflicting responses when prompted “Which are correct answers?” and “Which are incorrect answers?”. PRIN poses a big concern as it undermines the credibility of LLM-as-a-judge, and suggests a challenge for LLMs to adhere to basic logical rules. We conduct a series of experiments to investigate PRIN, examining the extent of PRIN across different LLMs, methods to mitigate it, potential applications, and its relationship with Randomness Inconsistency and Paraphrase Inconsistency. As the first study to explore PRIN, our findings offer valuable insights into the inner workings of LLMs and contribute to advancing trustworthy AI.

虽然LLMs的不一致并不是一个新专题,但先前的研究主要解决了两类基因不一致的问题:(一) 随机性:执行同样的LLM多重审判,得出不同的答复;(二) 原教旨不一致性:由同一LLM作出不同的答复。随机性不一致性产生于因基因模型的随机抽样而固有的随机性,而原教旨不一致性是语言模型目标的结果,用原言来解释的迅速性改变了词汇记录表的分发。本研究发现迅速反偏向性(PRIN)是一种新的LLM自我不一致性:给一个问题和几个LLM产生的应答候选人带来不同的答复。LMM常常在“答案正确”和“答案不正确”时作出相互冲突的答复。 PRIN是一个非常令人关切的问题,因为它破坏了LM-a-a-a-judge的可信度,并且建议LMS内部的遵守基本逻辑规则。我们进行了一系列实验,以调查PRIN(PRI)的透明性研究,研究其价值性应用程度,并研究它与不同LIS-LIS的透明性研究。

Article 171

Title@2025-07-30 (3): Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges

Title: Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges

Natürliche Sprachverarbeitung für den Rechtsbereich: Eine Übersicht über Aufgaben, Datensätze, Modelle und Herausforderungen

法律领域自然语言处理:任务、数据集、模型和挑战调查 2410.21306v3

Authors (3): Farid Ariai, Joel Mackenzie, Gianluca Demartini

Natural Language Processing (NLP) is revolutionising the way both professionals and laypersons operate in the legal field. The considerable potential for NLP in the legal sector, especially in developing computational assistance tools for various legal processes, has captured the interest of researchers for years. This survey follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses framework, reviewing 154 studies, with a final selection of 131 after manual filtering. It explores foundational concepts related to NLP in the legal domain, illustrating the unique aspects and challenges of processing legal texts, such as extensive document lengths, complex language, and limited open legal datasets. We provide an overview of NLP tasks specific to legal text, such as Document Summarisation, Named Entity Recognition, Question Answering, Argument Mining, Text Classification, and Judgement Prediction. Furthermore, we analyse both developed legal-oriented language models, and approaches for adapting general-purpose language models to the legal domain. Additionally, we identify sixteen open research challenges, including the detection and mitigation of bias in artificial intelligence applications, the need for more robust and interpretable models, and improving explainability to handle the complexities of legal language and reasoning.

自然语言处理(NLP)在法律领域改变了专业人员和外行人员的运作方式,在法律部门,特别是在为各种法律程序开发计算协助工具方面,国家语言处理具有相当大的潜力,多年来引起了研究人员的兴趣,这项调查遵循了系统审查和元分析框架的首选报告项目,审查了154项研究,在人工过滤后最后挑选了131项研究,在法律领域探索了与国家语言处理方法有关的基本概念,说明了法律文本处理的独特方面和挑战,如文件长度大、语言复杂和开放法律数据集有限等。我们概述了国家语言处理方法的具体任务,如文件总结、实体识别、问题回答、标名、标榜采矿、文本分类和判断。此外,我们分析了以法律为导向的语言模式和使通用语言模式适应法律领域的方法。此外,我们查明了16项公开研究挑战,包括发现和减少人工智能应用中的偏差,需要更有力和可解释的语言模型,以及改进处理复杂程度。

Article 172

Title@2025-07-29 (2): Intent Recognition and Out-of-Scope Detection using LLMs in Multi-party Conversations

Title: Intent Recognition and Out-of-Scope Detection using LLMs in Multi-party Conversations

Intent Recognition und Out-of-Scope-Erkennung mit LLMs in Multi-Party-Konversationen

在多方对话中使用LLMs 2507.22289v1

Authors (3): Galo Castillo-López, Gaël de Chalendar, Nasredine Semmar

Intent recognition is a fundamental component in task-oriented dialogue systems (TODS). Determining user intents and detecting whether an intent is Out-of-Scope (OOS) is crucial for TODS to provide reliable responses. However, traditional TODS require large amount of annotated data. In this work we propose a hybrid approach to combine BERT and LLMs in zero and few-shot settings to recognize intents and detect OOS utterances. Our approach leverages LLMs generalization power and BERT’s computational efficiency in such scenarios. We evaluate our method on multi-party conversation corpora and observe that sharing information from BERT outputs to LLMs leads to system performance improvement.

主动认识是面向任务的对话系统(TODS)的一个基本组成部分。确定用户意图并查明某种意图是否超越范围(OOS)对于TODS提供可靠的反应至关重要。然而,传统的TODS需要大量附带说明的数据。在这项工作中,我们提议采用混合办法,将BERT和LLMs在零和几发环境中结合起来,以确认意图并检测OS的全局性能。我们的方法利用LLMS的概括性能和BERT在这类情况下的计算效率。我们评估了我们关于多党对话公司的方法,并观察到从BERT产出到LOMS共享信息可以改善系统性能。

Article 173

Title@2025-07-29 (2): Meaning-infused grammar: Gradient Acceptability Shapes the Geometric Representations of Constructions in LLMs

Title: Meaning-infused grammar: Gradient Acceptability Shapes the Geometric Representations of Constructions in LLMs

Bedeutungsverstärkte Grammatik: Gradient Akzeptabilität formt die geometrischen Darstellungen von Konstruktionen in LLMs

含义内含语法:逐渐可接受性形状 LLM 中工程的几何表示法 2507.22286v1

Authors (2): Supantho Rakshit, Adele Goldberg

The usage-based constructionist (UCx) approach posits that language comprises a network of learned form-meaning pairings (constructions) whose use is largely determined by their meanings or functions, requiring them to be graded and probabilistic. This study investigates whether the internal representations in Large Language Models (LLMs) reflect the proposed function-infused gradience. We analyze the neural representations of the English dative constructions (Double Object and Prepositional Object) in Pythia-$1.4$B, using a dataset of $5000$ sentence pairs systematically varied for human-rated preference strength. A macro-level geometric analysis finds that the separability between construction representations, as measured by Energy Distance or Jensen-Shannon Divergence, is systematically modulated by gradient preference strength. More prototypical exemplars of each construction occupy more distinct regions in the activation space of LLMs. These results provide strong evidence that LLMs learn rich, meaning-infused, graded representations of constructions and offer support for geometric measures of basic constructionist principles in LLMs.

以使用为基础的建筑师(UCx)方法假定,语言包括一个知识形式的配对网络(建筑),其使用主要取决于其含义或功能,要求其分级和概率。本研究调查了大语言模型的内部代表是否反映了拟议的功能抑制光度。我们分析了Pythia-1.4美元B中英国配制建筑(二重物体和前置物体)的神经表现,使用了一套因人称优惠强度而系统变化的5 000美元判刑配对数据集。一项宏观的几何分析发现,按能源距离或詹森-汉诺分辨法衡量的建筑代表之间是否具有系统性的分离性,由梯度偏好强度加以调节。我们分析了在激活LLMS空间中更多具有超典型特征的建筑设计师占据了较不同的区域。这些结果提供了有力的证据,证明LMS学会了丰富的、含意的、分级的建筑表现,并为LMS基本建筑原则的几何测量度措施提供了支持。

Article 174

Title@2025-07-29 (2): CoEx – Co-evolving World-model and Exploration

Title: CoEx – Co-evolving World-model and Exploration

CoEx – Co-evolving World-Modell und Exploration

CoEx – – 共同发展的世界模式和探索 2507.22281v1

Authors (2): Minsoo Kim, Seung-won Hwang

Planning in modern LLM agents relies on the utilization of LLM as an internal world model, acquired during pretraining. However, existing agent designs fail to effectively assimilate new observations into dynamic updates of the world model. This reliance on the LLM’s static internal world model is progressively prone to misalignment with the underlying true state of the world, leading to the generation of divergent and erroneous plans. We introduce a hierarchical agent architecture, CoEx, in which hierarchical state abstraction allows LLM planning to co-evolve with a dynamically updated model of the world. CoEx plans and interacts with the world by using LLM reasoning to orchestrate dynamic plans consisting of subgoals, and its learning mechanism continuously incorporates these subgoal experiences into a persistent world model in the form of a neurosymbolic belief state, comprising textual inferences and code-based symbolic memory. We evaluate our agent across a diverse set of agent scenarios involving rich environments and complex tasks including ALFWorld, PDDL, and Jericho. Our experiments show that CoEx outperforms existing agent paradigms in planning and exploration.

现代LLM代理商的规划依赖于将LLM作为内部世界模型,在培训前获得;然而,现有代理商的设计未能有效地将新的观测结果吸收到世界模型的动态更新中。对LLM静态的内部世界模型的这种依赖,逐渐会与世界基本真实状态不相符,导致产生不同和错误的计划。我们引入了等级代理结构CoEx,在这个结构中,等级国家抽象使LLM计划与动态更新的世界模型共同演变。CoEx计划和与世界互动,利用LLM推理来协调由次级目标组成的动态计划,其学习机制不断将这些次级目标经验纳入一个持久的世界模型中,其形式是神经共性信仰状态,由文字推理和基于代码的象征记忆组成。我们评估了我们的代理商在包括ALFWorld、PDDL和杰里科在内的多种涉及丰富环境和复杂任务的代理商设想方案。我们的实验表明,CoEx在规划和探索方面超越了现有的代理商范式。

Article 175

Title@2025-07-29 (2): Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence

Title: Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence

Verkörperte Web-Agenten: Überbrückung physikalisch-digitaler Realms für integrierte Agent-Intelligenz

嵌入式网络代理:为综合特工情报连接物理数字王国 2506.15677v3

Authors (10): Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, Kai-Wei Chang

AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access. All datasets, codes and websites are publicly available at our project page https://embodied-web-agent.github.io/.

今天,AI代理机构大多是空置的 – – 它们要么对在网上获得的大量数字信息和知识进行检索和解释;要么通过体现的认知、规划和行动与物理世界互动,但两者都很少。这种分离限制了他们解决需要综合物理和数字情报的任务的能力,例如用在线食谱烹饪,用动态地图数据浏览,或者利用网络知识解释真实世界的里程碑。我们引入Embodied网络代理机构,这是AI代理机构的一种新颖范例,可以流传地连接成形和网络规模推理。为了落实这一概念,我们首先开发了Embudied网络代理机构的任务环境,一个将现实的3D室内和室外环境与功能的网络界面紧密结合的统一模拟平台。在这个平台上,我们建造和发布Embodied网络代理机构基准,它包含各种各样的任务,包括烹饪、导航、购物、旅游和地理定位等,所有这些任务都需要跨物理和数字领域的协调推理,以便系统地评估跨多域情报。实验结果揭示了国家-艺术AI系统和人类能力之间的重大业绩差距,一个统一的模拟平台,既能连接,又能将挑战与机会紧密地结合的网络网站/网络网站。

Article 176

Title@2025-07-29 (2): Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

Title: Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

Denoising Concept Vectors mit Sparse Autoencodern für verbesserte Sprachmodellsteuerung

用于改进语言模式指导的与斯普鲁斯自动编码器一起的失言概念矢量 2505.15038v2

Authors (6): Haiyan Zhao, Xuansheng Wu, Fan Yang, Bo Shen, Ninghao Liu, Mengnan Du

Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness. We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations. Our key insight is that concept-relevant signals can be explicitly separated from dataset noise by scaling up activations of top-k latents that best differentiate positive and negative samples. Applied to linear probing and difference-in-mean, SDCV consistently improves steering success rates by 4-16\% across six challenging concepts, while maintaining topic relevance.

线性概念矢量有效地引导了LLMS,但现有方法在各种数据集中都存在噪音,破坏了方向的稳健性。我们提议采用Sparse Autoencoder-Denoized概念矢量(SDCV ) , 有选择地保留最具歧视性的 SAE 潜值,同时重建隐藏的表达方式。我们的关键见解是,通过扩大最能区分正和负样的顶层潜值的激活,可以将概念相关信号与数据集噪音明确分开。适用于线性探测和中值差异,SDCV 不断提高6个具有挑战性的概念的成功率,同时保持主题相关性。

Article 177

Title@2025-07-29 (2): Modeling Story Expectations to Understand Engagement: A Generative Framework Using LLMs

Title: Modeling Story Expectations to Understand Engagement: A Generative Framework Using LLMs

Modeling Story Erwartungen, Engagement zu verstehen: Ein generatives Framework mit LLMs

模拟对理解参与的理论期望:利用LLMM的生成框架 2412.15239v3

Authors (2): Hortense Fong, George Gui

Understanding when and why consumers engage with stories is crucial for content creators and platforms. While existing theories suggest that audience beliefs of what is going to happen should play an important role in engagement decisions, empirical work has mostly focused on developing techniques to directly extract features from actual content, rather than capturing forward-looking beliefs, due to the lack of a principled way to model such beliefs in unstructured narrative data. To complement existing feature extraction techniques, this paper introduces a novel framework that leverages large language models to model audience forward-looking beliefs about how stories might unfold. Our method generates multiple potential continuations for each story and extracts features related to expectations, uncertainty, and surprise using established content analysis techniques. Applying our method to over 30,000 book chapters, we demonstrate that our framework complements existing feature engineering techniques by amplifying their marginal explanatory power on average by 31%. The results reveal that different types of engagement-continuing to read, commenting, and voting-are driven by distinct combinations of current and anticipated content features. Our framework provides a novel way to study and explore how audience forward-looking beliefs shape their engagement with narrative media, with implications for marketing strategy in content-focused industries.

了解消费者何时和为何参与故事对于内容创作者和平台至关重要。虽然现有的理论表明,受众对即将发生的事情的信念应在参与决定中发挥重要作用,但经验性工作主要侧重于开发直接从实际内容中提取特征的技术,而不是捕捉前瞻性信仰,因为缺乏一种原则性方法来将这种信仰建模在结构化的叙述性数据中。为了补充现有的特征提取技术,本文件提出了一个新颖的框架,利用大型语言模型来模拟受众对故事如何展开的前瞻性信仰。我们的方法为每个故事产生多种潜在延续,并用既定内容分析技术提取与期望、不确定性和惊喜有关的特征。我们将我们的方法应用到超过30,000个书章中,我们证明我们的框架补充了现有特征工程技术,平均扩大了31%的边际解释力。结果显示,不同的接触类型延续了当前和预期内容特征的不同组合。我们的框架提供了一种新颖的方法,用于研究和探索受众如何前瞻性信仰影响其与叙述性媒体的接触,对内容重心产业的营销战略产生了影响。

Article 178

Title@2025-07-29 (2): ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling

Title: ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling

EKG-Byte: Ein Tokenizer für die End-to-End Generative Elektrokardiogramm-Sprachenmodellierung

ECG-Byte: 终端到 En-En Energy 电动心电图语言建模调控器 2412.14373v3

Authors (5): William Han, Chaojing Duan, Michael A. Rosenberg, Emerson Liu, Ding Zhao

Large Language Models (LLMs) have demonstrated exceptional versatility across domains, including applications to electrocardiograms (ECGs). A growing body of work focuses on generating text from multi-channeled ECG signals and corresponding textual prompts. Existing approaches often involve a two-stage process: pretraining an ECG-specific encoder with a self-supervised learning (SSL) objective, followed by finetuning an LLM for natural language generation (NLG) using encoder-derived features. However, these methods face two key limitations: inefficiency due to multi-stage training and challenges in interpreting encoder-generated features. To overcome these issues, we propose ECG-Byte, an adapted byte pair encoding (BPE) tokenizer pipeline for autoregressive language modeling of ECGs. ECG-Byte compresses and encodes ECG signals into tokens, enabling direct end-to-end LLM training by combining ECG and text tokens. This approach enhances interpretability, as ECG tokens can be directly mapped back to the original signals. Leveraging ECG-Byte, we achieve competitive NLG performance while training 3 times faster and using just 48\% of the data required by traditional two-stage methods.

大型语言模型(LLMS)展示了不同领域的特殊多功能性,包括对心电图的应用。越来越多的工作重点是从多渠道ECG信号和相应的文本提示生成文本。现有方法往往涉及一个两阶段过程:先对ECG专用编码器进行自我监督学习(SSL)目标的培训,然后对用于自然语言生成的LLM(NLG)进行微调,使用编码器衍生特征。然而,这些方法面临两个主要的局限性:由于多阶段培训和解释编码器生成特征方面的挑战,效率低下。为了克服这些问题,我们提议ECG-Byte,这是为ECGs自动递增语言模型改编成的配对代用品管道。ECG-Byrest 和 ECG 信号编码为代号,通过将ECG和文本符号合并,使直接端到端LM培训成为直接培训。这种方法提高了解释性,因为ECG的代号可以直接映射回原信号,而我们则需要使用48种具有竞争力的NBE-BSL方法,我们只需通过48种具有竞争力的C-C-C-C-BSpeat-CSy 进行快速的测试。

Article 179

Title@2025-07-29 (2): GneissWeb: Preparing High Quality Data for LLMs at Scale

Title: GneissWeb: Preparing High Quality Data for LLMs at Scale

GneissWeb: Hochqualitative Daten für LLMs im Maßstab vorbereiten

GneissWeb: 为缩放 LLMs 准备高品质数据 2502.14907v2

Authors (32): Hajar Emami Gohari, Swanand Ravindra Kadhe, Syed Yousaf Shah, Constantin Adam, Abdulhamid Adebayo, Praneet Adusumilli, Farhan Ahmed, Nathalie Baracaldo Angel, Santosh Subhashrao Borse, Yuan-Chi Chang, Xuan-Hong Dang, Nirmit Desai, Revital Eres, Ran Iwamoto, Alexei Karve, Yan Koyfman, Wei-Han Lee, Changchang Liu, Boris Lublinsky, Takuyo Ohko, Pablo Pesce, Maroun Touma, Shiqiang Wang, Shalisha Witherspoon, Herbert Woisetschläger, David Wood, Kun-Lung Wu, Issei Yoshida, Syed Zawad, Petros Zerfos, Yi Zhou, Bishwaranjan Bhattacharjee

Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM’s ability to generalize on a wide range of downstream tasks. Large pre-training datasets for leading LLMs remain inaccessible to the public, whereas many open datasets are small in size (less than 5 trillion tokens), limiting their suitability for training large models. In this paper, we introduce GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters. GneissWeb achieves a favorable trade-off between data quality and quantity, producing models that outperform models trained on state-of-the-art open large datasets (5+ trillion tokens). We show that models trained using GneissWeb dataset outperform those trained on FineWeb-V1.1.0 by 2.73 percentage points in terms of average score computed on a set of 11 commonly used benchmarks (both zero-shot and few-shot) for pre-training dataset evaluation. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), models trained using GneissWeb still achieve a 1.75 percentage points advantage over those trained on FineWeb-V1.1.0.

高质量的数据,尤其能够大大提高LLM在一系列下游任务方面的普及能力。主要LLM的大型培训前数据集仍然无法为公众所接受,而许多开放数据集的规模小(少于5万亿个符号),限制了其培训大型模型的适宜性。在本文中,我们引入了GneissWeb,一个大型数据集,产生大约10万亿个符号,满足培训LM的数据质量和数量要求。我们的GneissWeb制成的精确的子字符串除和明智地构建的质量过滤器集合构成的精细小培训前数据集。GneissWeb在数据质量和数量之间实现了有利的交换,产生了超出在开放大数据集方面受过培训的模型(5+万亿个比例)。我们展示了使用GneisWeb制的模型所培训的优于经过培训的1.0个基准数,在FineWeb-Vserb 标准中,在经过培训的181.0级标准中实现了2.73个基准,在标准前实现了2.73个标准中,在平均标准中实现了2.73个基准。

Article 180

Title@2025-07-29 (2): LLM-as-a-qualitative-judge: automating error analysis in natural language generation

Title: LLM-as-a-qualitative-judge: automating error analysis in natural language generation

LLM-as-a-qualitative-Richter: Automatisierung der Fehleranalyse bei der Generierung natürlicher Sprachen

LLM-as-as-法官法官:在自然语言生成中进行自动误差分析 2506.09147v2

Authors (7): Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana Perlić, Ekaterina Borisova, Markarit Vartampetian

Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that LLM-as-a-qualitative-judge correctly recognizes instance-specific issues in 2/3 cases and is capable of producing error type reports resembling the reports composed by human annotators. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.

推动大型语言模型(LLM)(LLM-as-a-judge),以评价生成的文本,称为LLM-a-a-judge(LLM),已成为自然语言生成的一种标准评价方法,但主要用作量化工具,即以数字分数作为主要产出。在这项工作中,我们提议LLM-as-as-a-qilitiative-judge(LLM),以LLM(LM)为基础的评价方法,主要产出为关于NLG系统产出中常见问题类型的结构化报告。我们的方法旨在向开发者提供有意义的见解,使其了解对特定NLG系统可作出哪些改进,包括两个主要步骤,即不限每次审议问题分析,并利用直观累积算法对发现的问题进行分组。我们还提出了一项评价拟议方法的战略,加上12 NLG数据集中的问题说明~300。我们的结果显示,LLM-as-as-a-qlitial-judge(LM)正确地确认2/3个案例中的特定问题,能够产生错误类型报告,并重印由人类说明的报告。我们的代码和数据可在http://qlistal-mas-qal-qal-qual-s-s-s-s-s-qual-s-s-s-s-s-qir-qir-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-qut-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s

Article 181

Title: RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation

RL von Lehrer-Modell-Verfeinerung: Graduale Imitation Lernen für maschinelle Übersetzung

教师-模式改进:机器翻译逐步模拟学习 2507.22219v1

Authors (3): Dongyub Jude Lee, Zhenyi Ye, Pengcheng He

Preference-learning methods for machine translation (MT)–such as Direct Preference Optimization (DPO)–have achieved impressive gains but depend heavily on large, carefully curated triplet datasets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o). RLfR frames each translation step as a micro-tutorial: the actor generates a hypothesis, the teacher refines it, and the actor is rewarded based on how closely it aligns with the teacher’s refinement. Guided by two complementary signals–(i) negative edit distance, promoting lexical and structural fidelity, and (ii) COMET score, ensuring semantic adequacy–the actor progressively learns to emulate the teacher, mirroring a human learning process through incremental, iterative improvement. On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.

机械翻译(MT)的参考学习方法——例如直接偏好优化(DPO)——已经取得了令人印象深刻的成绩,但在很大程度上依赖于大型、仔细整理的三重数据集,而且往往难以超越其调校范围。我们提议“教师-模版精炼强化学习”(RLfR),这是一个新颖的框架,它通过利用外部教师模式(GPT-4o)的连续、高质量的反馈,消除对静态三重技术的依赖。RLfR将每一翻译步骤作为微观研究:行为者产生一种假设,教师加以完善,而行为者根据它与教师的精细校准如何紧密配合而得到奖励。我们建议,我们建议采用两个互补信号-(i)负编辑距离,促进词汇性和结构上的忠诚性,以及(ii)知识与技术伦理学的评分,确保行为者逐渐学习模仿教师,通过递增、迭代改进反映人类学习过程。关于FLORES-200基准(英语和德语、西班牙语、中文、韩语和日语),RfR-FT的评分差比(不断提高基准)和MMMMMT的优等基准。

Article 182

Title@2025-07-29 (2): Can adversarial attacks by large language models be attributed?

Title: Can adversarial attacks by large language models be attributed?

Können feindliche Angriffe von großen Sprachmodellen zugeschrieben werden?

大型语言模式的对抗性攻击能否归结为对抗性攻击? 2411.08003v3

Authors (3): Manuel Cebrian, Andres Abeliuk, Jan Arne Telle

Attributing outputs from Large Language Models (LLMs) in adversarial settings-such as cyberattacks and disinformation campaigns-presents significant challenges that are likely to grow in importance. We approach this attribution problem from both a theoretical and an empirical perspective, drawing on formal language theory (identification in the limit) and data-driven analysis of the expanding LLM ecosystem. By modeling an LLM’s set of possible outputs as a formal language, we analyze whether finite samples of text can uniquely pinpoint the originating model. Our results show that, under mild assumptions of overlapping capabilities among models, certain classes of LLMs are fundamentally non-identifiable from their outputs alone. We delineate four regimes of theoretical identifiability: (1) an infinite class of deterministic (discrete) LLM languages is not identifiable (Gold’s classical result from 1967); (2) an infinite class of probabilistic LLMs is also not identifiable (by extension of the deterministic case); (3) a finite class of deterministic LLMs is identifiable (consistent with Angluin’s tell-tale criterion); and (4) even a finite class of probabilistic LLMs can be non-identifiable (we provide a new counterexample establishing this negative result). Complementing these theoretical insights, we quantify the explosion in the number of plausible model origins (hypothesis space) for a given output in recent years. Even under conservative assumptions-each open-source model fine-tuned on at most one new dataset-the count of distinct candidate models doubles approximately every 0.5 years, and allowing multi-dataset fine-tuning combinations yields doubling times as short as 0.28 years. This combinatorial growth, alongside the extraordinary computational cost of brute-force likelihood attribution across all models and potential users, renders exhaustive attribution infeasible in practice.

将大语言模型(LLMs)的输出归结为对抗性环境(如网络攻击和不信息运动)中的大语言模型(LLMs)的产出,这带来了可能越来越重要的重大挑战。我们从理论角度和经验角度处理这一归因问题,借鉴的是正式语言理论(在限度内确定)和对不断扩大的LLM生态系统的数据驱动分析。通过将LLM的一组可能的产出建模成一种正式语言,我们分析的是文本的有限样本是否能够独特地定位出原模型。我们的结果显示,在对模型之间能力重叠的微小假设下,某些LLMs类别基本上无法从它们的输出中辨别出来。我们从理论的明显可辨别性角度从理论角度和实验的角度来处理这一归别问题。我们界定了四种不同的理论可辨别性制度:(1) 无限的确定性(差异性) LLMMs语言是1967年的经典结果;(2) 无限的概率LMs(通过确定性案例的延伸,确定性案例)(3) 确定性组合模型是可识别的(符合Angluin loudalalalal commal ex ex ex liversation liversation ex) as the the the folview ex in the folver ex ex ex ex the folview lievations in the fearmations in the ex the ex ex the folviolview ex immations impolverations imations ex ex ex immationsmations ex the the the the the thes ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex the the the the flipoltime ex ex thes ex ex ex ex ex ex the the thesmationsmationsmations mations mations mations ex ex ex ex ex ex the the the the the the the the the the thes ex.

Article 183

Title@2025-07-29 (2): How Well Does First-Token Entropy Approximate Word Entropy as a Psycholinguistic Predictor?

Title: How Well Does First-Token Entropy Approximate Word Entropy as a Psycholinguistic Predictor?

Wie gut ist die Erst-Token-Entropie ungefähre Wort-Entropie als psycholinguistischer Vorhersager?

作为心理语言学预测者,第一到真真真真真真真假近似单字真真真假如何? 2507.22209v1

Authors (3): Christian Clark, Byung-Doh Oh, William Schuler

Contextual entropy is a psycholinguistic measure capturing the anticipated difficulty of processing a word just before it is encountered. Recent studies have tested for entropy-related effects as a potential complement to well-known effects from surprisal. For convenience, entropy is typically estimated based on a language model’s probability distribution over a word’s first subword token. However, this approximation results in underestimation and potential distortion of true word entropy. To address this, we generate Monte Carlo (MC) estimates of word entropy that allow words to span a variable number of tokens. Regression experiments on reading times show divergent results between first-token and MC word entropy, suggesting a need for caution in using first-token approximations of contextual entropy.

上下文引力是一种心理语言测量方法,它反映了在遇到一个单词之前处理一个单词的预期困难。最近的研究已经测试了与酶有关的影响,作为超原的已知效应的一种潜在补充。为了方便起见,英特罗比通常根据一个语言模型对单词第一个子字符号的概率分布来估计。然而,这一近似结果导致低估和可能扭曲真实单词的英特罗比。为了解决这个问题,我们生成了蒙特卡洛(MC)对单词导力的估计,使单词能够跨越一个可变数的符号。阅读时间的反常实验显示第一调和MC字的英特罗比之间的不同结果,这表明在使用一调方言的方言外引力近似值时需要谨慎。

Article 184

Title@2025-07-29 (2): The role of media memorability in facilitating startups’ access to venture capital funding

Title: The role of media memorability in facilitating startups’ access to venture capital funding

Die Rolle der Medienerinnerung bei der Erleichterung des Zugangs von Start-ups zu Risikokapitalfinanzierungen

B. 媒体在便利初创企业获得风险资本资金方面的作用 2507.22201v1

Authors (3): L. Toschi, S. Torrisi, A. Fronzetti Colladon

Media reputation plays an important role in attracting venture capital investment. However, prior research has focused too narrowly on general media exposure, limiting our understanding of how media truly influences funding decisions. As informed decision-makers, venture capitalists respond to more nuanced aspects of media content. We introduce the concept of media memorability - the media’s ability to imprint a startup’s name in the memory of relevant investors. Using data from 197 UK startups in the micro and nanotechnology sector (funded between 1995 and 2004), we show that media memorability significantly influences investment outcomes. Our findings suggest that venture capitalists rely on detailed cues such as a startup’s distinctiveness and connectivity within news semantic networks. This contributes to research on entrepreneurial finance and media legitimation. In practice, startups should go beyond frequent media mentions to strengthen brand memorability through more targeted, meaningful coverage highlighting their uniqueness and relevance within the broader industry conversation.

在吸引风险资本投资方面,媒体声誉在吸引风险资本投资方面起着重要作用。然而,先前的研究过于狭隘地侧重于一般媒体曝光,限制了我们对媒体如何真正影响供资决策的理解。作为知情的决策者,风险资本家对媒体内容中更为细微的方面作出反应。我们引入了媒体可保性的概念 — — 媒体在相关投资者记忆中刻画启动者名字的能力。我们利用197家英国微型和纳米技术部门新创办企业(1995年至2004年提供资金)的数据,表明媒体可保性极大地影响投资成果。我们的调查结果表明,风险资本家依赖诸如启动企业的独特性和新闻语义网络的连通性等详细线索。这有助于对创业融资和媒体合法性的研究。在实践中,创业企业应超越经常提到的媒体,通过更有针对性的、更有意义的报道来强化品牌可保性,突出其在更广泛的产业对话中的独特性和相关性。

Article 185

Title@2025-07-29 (2): Basic Reading Distillation

Title: Basic Reading Distillation

Grundlesedestillation

基础阅读蒸馏 2507.19741v2

Authors (5): Zhi Zhou, Sirui Miao, Xiangyu Duan, Hao Yang, Min Zhang

Large language models (LLMs) have demonstrated remarkable abilities in various natural language processing areas, but they demand high computation resources which limits their deployment in real-world. Distillation is one technique to solve this problem through either knowledge distillation or task distillation. Both distillation approaches train small models to imitate specific features of LLMs, but they all neglect basic reading education for small models on generic texts that are \emph{unrelated} to downstream tasks. In this paper, we propose basic reading distillation (BRD) which educates a small model to imitate LLMs basic reading behaviors, such as named entity recognition, question raising and answering, on each sentence. After such basic education, we apply the small model on various tasks including language inference benchmarks and BIG-bench tasks. It shows that the small model can outperform or perform comparable to over 20x bigger LLMs. Analysis reveals that BRD effectively influences the probability distribution of the small model, and has orthogonality to either knowledge distillation or task distillation.

大型语言模型(LLMS)在各种自然语言处理领域表现出了非凡的能力,但是它们需要高的计算资源来限制其在现实世界中的部署。蒸馏是一种通过知识蒸馏或任务蒸馏来解决这个问题的方法。两种蒸馏方法都训练小模型来模仿LLMS的具体特征, 但是它们都忽略了对与下游任务有关的通用文本小模型的基本阅读教育。在本文中, 我们提议了基础阅读蒸馏(BRD) , 用来教育一个小模型来模仿LMS的基本阅读行为, 比如名称的实体识别、问题提炼和回答。在进行这种基本教育之后, 我们将小模型应用于各种任务, 包括语言推断基准和 BIG-bench 任务。它表明小模型可以超越或执行相当于20x以上大LMS。分析表明, BRD有效地影响小模型的概率分布, 并且对知识蒸馏或任务蒸馏具有任意的分辨性。

Article 186

Title@2025-07-29 (2): Explainability Through Systematicity: The Hard Systematicity Challenge for Artificial Intelligence

Title: Explainability Through Systematicity: The Hard Systematicity Challenge for Artificial Intelligence

Erklärbarkeit durch Systematik: Die harte Systematik-Herausforderung für künstliche Intelligenz

系统化解释:人工智能的硬系统化挑战 2507.22197v1

Authors (1): Matthieu Queloz

This paper argues that explainability is only one facet of a broader ideal that shapes our expectations towards artificial intelligence (AI). Fundamentally, the issue is to what extent AI exhibits systematicity–not merely in being sensitive to how thoughts are composed of recombinable constituents, but in striving towards an integrated body of thought that is consistent, coherent, comprehensive, and parsimoniously principled. This richer conception of systematicity has been obscured by the long shadow of the “systematicity challenge” to connectionism, according to which network architectures are fundamentally at odds with what Fodor and colleagues termed “the systematicity of thought.” I offer a conceptual framework for thinking about “the systematicity of thought” that distinguishes four senses of the phrase. I use these distinctions to defuse the perceived tension between systematicity and connectionism and show that the conception of systematicity that historically shaped our sense of what makes thought rational, authoritative, and scientific is more demanding than the Fodorian notion. To determine whether we have reason to hold AI models to this ideal of systematicity, I then argue, we must look to the rationales for systematization and explore to what extent they transfer to AI models. I identify five such rationales and apply them to AI. This brings into view the “hard systematicity challenge.” However, the demand for systematization itself needs to be regulated by the rationales for systematization. This yields a dynamic understanding of the need to systematize thought, which tells us how systematic we need AI models to be and when.

本文认为,解释性只是影响我们对人工智能(AI)期望的更广泛理想的一个方面。从根本上说,问题在于AI在多大程度上表现出系统性 — — 不仅仅是对思想如何由可重新组合的成分组成具有敏感性,而是在努力形成一个一致、一致、全面、神秘的集思广益的综合思想体系。这种更丰富的系统性概念被联系主义“系统化挑战”的长阴影所掩盖,根据这种观念,网络结构从根本上与Fodor和同事所称的“系统化思维”有矛盾。我提供了一个概念框架,用于思考“系统化思维”如何区分四个概念。我利用这些区别来缓解系统性与关联主义之间的明显紧张关系,并表明系统化概念在历史上左右着我们思维理性、权威和科学感的观念比Fodorian概念更为艰巨。为了确定我们是否有理由让AI模式符合这种系统化理想,我随后指出,我们必须研究系统化的理论原理,并探索如何系统化的理论化,从而让AI本身具有何种程度的理念。

Article 187

Title@2025-07-29 (2): Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation

Title: Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation

Déjà Vu: Mehrsprachige LLM-Evaluierung durch die Lens of Machine Translation Evaluation

Déjà Vu:通过机器翻译评价的镜头进行多种语文LLM评价 2504.11829v3

Authors (5): Julia Kreutzer, Eleftheria Briakou, Sweta Agrawal, Marzieh Fadaee, Kocmi Tom

Generation capabilities and language coverage of multilingual large language models (mLLMs) are advancing rapidly. However, evaluation practices for generative abilities of mLLMs are still lacking comprehensiveness, scientific rigor, and consistent adoption across research labs, which undermines their potential to meaningfully guide mLLM development. We draw parallels with machine translation (MT) evaluation, a field that faced similar challenges and has, over decades, developed transparent reporting standards and reliable evaluations for multilingual generative models. Through targeted experiments across key stages of the generative evaluation pipeline, we demonstrate how best practices from MT evaluation can deepen the understanding of quality differences between models. Additionally, we identify essential components for robust meta-evaluation of mLLMs, ensuring the evaluation methods themselves are rigorously assessed. We distill these insights into a checklist of actionable recommendations for mLLM research and development.

多语文大型语言模型(MLLMs)的生成能力和语言覆盖面正在迅速发展,然而,对MLLMs的基因能力的评价做法仍然缺乏全面性、科学严密性和跨研究实验室的一致采用,这削弱了它们有意义地指导MLLM发展的潜力。我们与机器翻译(MT)评价平行进行,这个领域面临类似的挑战,数十年来为多语文基因模型制定了透明的报告标准和可靠的评价。通过在基因化评价管道的关键阶段进行有针对性的试验,我们展示了MT评价的最佳做法如何加深对模型之间质量差异的理解。此外,我们确定了对MLLMs进行强有力元评价的必要组成部分,确保评价方法本身得到严格评估。我们将这些见解纳入一个可用于MLLM研发的行动建议清单。

Article 188

Title@2025-07-29 (2): A Scalable Pipeline for Estimating Verb Frame Frequencies Using Large Language Models

Title: A Scalable Pipeline for Estimating Verb Frame Frequencies Using Large Language Models

Eine skalierbare Pipeline zur Schätzung von Verb Frame Frequenzen mit großen Sprachmodellen

使用大语言模型估算 Verb 框架频谱的可缩放管道 2507.22187v1

Authors (2): Adam M. Morgan, Adeen Flinker

We present an automated pipeline for estimating Verb Frame Frequencies (VFFs), the frequency with which a verb appears in particular syntactic frames. VFFs provide a powerful window into syntax in both human and machine language systems, but existing tools for calculating them are limited in scale, accuracy, or accessibility. We use large language models (LLMs) to generate a corpus of sentences containing 476 English verbs. Next, by instructing an LLM to behave like an expert linguist, we had it analyze the syntactic structure of the sentences in this corpus. This pipeline outperforms two widely used syntactic parsers across multiple evaluation datasets. Furthermore, it requires far fewer resources than manual parsing (the gold-standard), thereby enabling rapid, scalable VFF estimation. Using the LLM parser, we produce a new VFF database with broader verb coverage, finer-grained syntactic distinctions, and explicit estimates of the relative frequencies of structural alternates commonly studied in psycholinguistics. The pipeline is easily customizable and extensible to new verbs, syntactic frames, and even other languages. We present this work as a proof of concept for automated frame frequency estimation, and release all code and data to support future research.

我们提出了一个用于估计 Verb 框架变量( VFFs) 的自动管道, 即动词在特定合成框架中出现的频率。 VFFs 在人文和机器语言系统中为语法提供了强大的窗口, 但是现有的计算工具在规模、准确性或可访问性方面都有限。我们使用大型语言模型( LLMs) 来生成包含476 个英语动词的一组句子。其次, 通过指示一个LLM 能够像一个专业语言学家那样行事, 我们得到了它分析本句中句子的合成结构结构的精确性结构。这个管道在多个评价数据集中都比两个广泛使用的合成分析器高。此外, 它所需要的资源远远少于手动的分类( 金本标准), 从而使得能够快速、可扩展的 VFF 估计。我们用LM 模型制作了一个新的 VFF 数据库, 其覆盖范围更广, 精细的合成方法区分, 并明确估计了在心理语言中共同研究的结构性替代结构的相对频率。管道可以轻松、和扩展的频率定义框架 , , 将和用于新的新的 Cenerview view view view view view view view view view view view vical view view view view view view view view view view view view view view view view view view view view view view 。

Article 189

Title@2025-07-29 (2): Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

Title: Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

Positiv-Augmented Contrastive Learning für Vision-and-Language Evaluation und Training

愿景和语言评价和培训的积极强化反竞争学习 2410.07336v2

Authors (5): Sara Sarto, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed play a key role in the fine-tuning stage of captioning models, ultimately enhancing the quality of the generated captions. In this paper, we propose PAC-S++, a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data and regularized through additional pairs of generated visual and textual positive samples. Exploiting this stronger and curated pre-training, we also apply PAC-S++ as a reward in the Self-Critical Sequence Training (SCST) stage typically employed to fine-tune captioning models. Extensive experiments on different image and video datasets highlight the effectiveness of PAC-S++ compared to popular metrics for the task, including its sensitivity to object hallucinations. Furthermore, we show that integrating PAC-S++ into the fine-tuning stage of a captioning model results in semantically richer captions with fewer repetitions and grammatical errors. Evaluations on out-of-domain benchmarks further demonstrate the efficacy of our fine-tuning approach in enhancing model capabilities. Source code and trained models are publicly available at: https://github.com/aimagelab/pacscore.

尽管在标题生成方面取得了显著进步,但现有的评价指标往往无法捕捉完整质量或细细细细细的字幕细节,这主要是由于它们依赖非特定的人写参考资料或紧张的培训前数据。尽管如此,找到有效的指标不仅对标题评估至关重要,而且对于生成阶段也至关重要。计量确实可以在字幕制作模型的微调阶段发挥关键作用,最终提高生成字幕的质量。在本文中,我们提议采用PAC-S++这一可学习的衡量标准,利用CLIP模型,在网络收集和清理数据方面经过预先培训,并通过更多成对的视觉和文字正面样本进行正规化。探讨这一更强有力和经过整理的预培训前的样本,我们还将PAC-S++作为自定义序列培训阶段的一种奖励,通常用于微调字幕模型的质量。关于不同图像和视频数据集的广泛实验,突出PAC-S++与用于本项任务的普通指标的实效,包括在网络收集的图像和文本上生成的精准性样本。此外,我们展示了在Sqrealimalalal-realalal 的模型中,我们展示了比以往更难得的模型/变校正的成绩。

Article 190

Title@2025-07-29 (2): Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles

Title: Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles

Persona-Augmented Benchmarking: Bewertung von LLMs über unterschiedliche Schreibstile hinweg

人人推基准定 : 评价各种写作风格 2507.22168v1

Authors (4): Kimberly Le Truong, Riccardo Fogliato, Hoda Heidari, Zhiwei Steven Wu

Current benchmarks for evaluating Large Language Models (LLMs) often do not exhibit enough writing style diversity, with many adhering primarily to standardized conventions. Such benchmarks do not fully capture the rich variety of communication patterns exhibited by humans. Thus, it is possible that LLMs, which are optimized on these benchmarks, may demonstrate brittle performance when faced with “non-standard” input. In this work, we test this hypothesis by rewriting evaluation prompts using persona-based LLM prompting, a low-cost method to emulate diverse writing styles. Our results show that, even with identical semantic content, variations in writing style and prompt formatting significantly impact the estimated performance of the LLM under evaluation. Notably, we identify distinct writing styles that consistently trigger either low or high performance across a range of models and tasks, irrespective of model family, size, and recency. Our work offers a scalable approach to augment existing benchmarks, improving the external validity of the assessments they provide for measuring LLM performance across linguistic variations.

目前用于评价大语言模型的基准往往没有表现出足够的写作风格多样性,其中许多人主要遵守标准化的公约,这些基准没有充分反映人类所展示的丰富多样的通信模式,因此,在面临“非标准”投入时,根据这些基准优化的LLM可能显示业绩差。在这项工作中,我们通过使用以人为基础的LLM促进的低成本方法重写评价提示来测试这一假设,这种方法可以模仿不同的写作风格。我们的结果表明,即使具有相同的语义内容,写作风格的变化和迅速格式化也会对所评价的LLM的估计业绩产生重大影响。值得注意的是,我们发现不同的写作风格在一系列模式和任务中始终触发低或高绩效,而不管其型号、大小和正确性如何。我们的工作为扩大现有基准提供了一种可扩展的方法,提高了它们为衡量不同语言的LM业绩所提供的评估的外部有效性。

Article 191

Title@2025-07-29 (2): Strategic Deflection: Defending LLMs from Logit Manipulation

Title: Strategic Deflection: Defending LLMs from Logit Manipulation

Strategische Durchbiegung: LLMs durch Logit-Manipulation verteidigen

战略抵消:保护LLMs免受逻辑操纵 2507.22160v1

Authors (5): Yassine Rachidy, Jihad Rbaiti, Youssef Hmamouche, Faissal Sehbaoui, Amal El Fallah Seghrouchni

With the growing adoption of Large Language Models (LLMs) in critical areas, ensuring their security against jailbreaking attacks is paramount. While traditional defenses primarily rely on refusing malicious prompts, recent logit-level attacks have demonstrated the ability to bypass these safeguards by directly manipulating the token-selection process during generation. We introduce Strategic Deflection (SDeflection), a defense that redefines the LLM’s response to such advanced attacks. Instead of outright refusal, the model produces an answer that is semantically adjacent to the user’s request yet strips away the harmful intent, thereby neutralizing the attacker’s harmful intent. Our experiments demonstrate that SDeflection significantly lowers Attack Success Rate (ASR) while maintaining model performance on benign queries. This work presents a critical shift in defensive strategies, moving from simple refusal to strategic content redirection to neutralize advanced threats.

随着在关键领域越来越多地采用大语言模型(LLMs),确保他们的安全不受侵入性攻击是最重要的。传统防御主要依靠拒绝恶意煽动,而最近的逻辑级袭击则表明能够绕过这些保障,直接操纵代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代谢过程。我们引入了战略抑制(Sdeflection),这是重新定义LM对此类先进攻击的反应的辩护。该模型产生的答案不是完全拒绝,而是一种与用户的要求息息相近的答案,却将有害意图抹除,从而消除攻击者的有害意图。我们的实验表明,Sdeflection在保持良性查询的示范性表现的同时,大大降低了攻击成功率。这项工作展示了防御战略的重大转变,从简单的拒绝转向战略内容的重新定向,以抵消先进威胁。

Article 192

Title@2025-07-29 (2): IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian

Title: IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian

IndoPref: Ein multi-Domain-Pairwise-Preference-Datensatz für Indonesisch

IndoPref:印度尼西亚多域对等优惠数据集 2507.22159v1

Authors (4): Vanessa Rebecca Wiyono, David Anugraha, Ayu Purwarianti, Genta Indra Winata

Over 200 million people speak Indonesian, yet the language remains significantly underrepresented in preference-based research for large language models (LLMs). Most existing multilingual datasets are derived from English translations, often resulting in content that lacks cultural and linguistic authenticity. To address this gap, we introduce IndoPref, the first fully human-authored and multi-domain Indonesian preference dataset specifically designed to evaluate the naturalness and quality of LLM-generated text. All annotations are natively written in Indonesian and evaluated using Krippendorff’s alpha, demonstrating strong inter-annotator agreement. Additionally, we benchmark the dataset across multiple LLMs and assess the output quality of each model.

2亿多人会讲印度尼西亚语,但语言在大型语言模型(LLMs)的以优惠为基础的研究中仍然严重不足,现有的多语文数据集大多来自英文译文,其内容往往缺乏文化和语言真实性。为了弥补这一差距,我们引入了IndoPref,这是印度尼西亚首个专门为评价LLM产生的文本的自然性质和质量而设计的完全由人编写的多域印度尼西亚偏好数据集。所有说明都是印度尼西亚语,用Krippendorff的字母来进行本地撰写,并用Krippendorff的字母来评估,显示出强烈的跨咨询者协议。此外,我们为多个LLMs的数据集设定基准,并评估每个模型的产出质量。

Article 193

Title@2025-07-29 (2): The Importance of Facial Features in Vision-based Sign Language Recognition: Eyes, Mouth or Full Face?

Title: The Importance of Facial Features in Vision-based Sign Language Recognition: Eyes, Mouth or Full Face?

Die Bedeutung von Gesichtsfunktionen bei der visionsbasierten Erkennung von Zeichensprachen: Augen, Mund oder Gesicht?

面貌在基于愿景的手语识别中的重要性:眼、嘴还是脸? 2507.20884v2

Authors (2): Dinh Nam Pham, Eleftherios Avramidis

Non-manual facial features play a crucial role in sign language communication, yet their importance in automatic sign language recognition (ASLR) remains underexplored. While prior studies have shown that incorporating facial features can improve recognition, related work often relies on hand-crafted feature extraction and fails to go beyond the comparison of manual features versus the combination of manual and facial features. In this work, we systematically investigate the contribution of distinct facial regionseyes, mouth, and full faceusing two different deep learning models (a CNN-based model and a transformer-based model) trained on an SLR dataset of isolated signs with randomly selected classes. Through quantitative performance and qualitative saliency map evaluation, we reveal that the mouth is the most important non-manual facial feature, significantly improving accuracy. Our findings highlight the necessity of incorporating facial features in ASLR.

非人工面部特征在手语交流中发挥着关键作用,但在自动手语识别(ASLR)中的重要性仍未得到充分探讨。虽然先前的研究显示,包含面部特征可以提高识别度,但相关工作往往依赖手工艺特征提取,没有超越手工艺特征与手工艺特征和面部特征相结合的比较。在这项工作中,我们系统地调查不同面部区域眼、口腔和充分运用两种不同的深层学习模式(以CNN为基础的模型和以变压器为基础的模型)的贡献,这两种模式是受过随机选择班级的孤立标志的 SLR数据集培训的。我们通过定量性能和定性突出的地图评估,发现口部是最重要的非人工面部特征,显著提高了准确性。我们的调查结果突出表明了将面部特征纳入ASLR的必要性。

Article 194

Title@2025-07-29 (2): Prompt Optimization and Evaluation for LLM Automated Red Teaming

Title: Prompt Optimization and Evaluation for LLM Automated Red Teaming

Prompt Optimierung und Auswertung für LLM Automatisiertes Red Teaming

LLM自动红色小组迅速优化和评价 2507.22133v1

Authors (11): Michael Freenor, Lauren Alvarez, Milton Leal, Lily Smith, Joel Garrett, Yelyzaveta Husieva, Madeline Woodruff, Ryan Miller, Erich Kummerfeld, Rafael Medeiros, Sander Schulhoff

Applications that use Large Language Models (LLMs) are becoming widespread, making the identification of system vulnerabilities increasingly important. Automated Red Teaming accelerates this effort by using an LLM to generate and execute attacks against target systems. Attack generators are evaluated using the Attack Success Rate (ASR) the sample mean calculated over the judgment of success for each attack. In this paper, we introduce a method for optimizing attack generator prompts that applies ASR to individual attacks. By repeating each attack multiple times against a randomly seeded target, we measure an attack’s discoverability the expectation of the individual attack success. This approach reveals exploitable patterns that inform prompt optimization, ultimately enabling more robust evaluation and refinement of generators.

使用大语言模型(LLMS)的应用正在变得日益普及,使得系统脆弱性的识别越来越重要。自动红队队通过使用LLM来加速这项努力,利用LLM来制造和实施对目标系统的袭击。攻击发电机用攻击成功率(ASR)评估,这是根据每次攻击成功率判断结果得出的样本平均值。在本论文中,我们引入了一种将ASR用于个别攻击的攻击发生频率优化的方法。通过对随机种子目标重复多次攻击,我们测量一次攻击的可发现性与个人攻击成功的期望。这个方法揭示了可加以利用的模式,为迅速优化提供了信息,最终使得能够对发电机进行更强有力的评估和改进。

Article 195

Title@2025-07-29 (2): SAKE: Steering Activations for Knowledge Editing

Title: SAKE: Steering Activations for Knowledge Editing

SAKE: Steuerung von Aktivierungen für die Wissensbearbeitung

战略:知识编辑指导活动 2503.01751v2

Authors (4): Marco Scialanga, Thibault Laugel, Vincent Grari, Marcin Detyniecki

As Large Langue Models have been shown to memorize real-world facts, the need to update this knowledge in a controlled and efficient manner arises. Designed with these constraints in mind, Knowledge Editing (KE) approaches propose to alter specific facts in pretrained models. However, they have been shown to suffer from several limitations, including their lack of contextual robustness and their failure to generalize to logical implications related to the fact. To overcome these issues, we propose SAKE, a steering activation method that models a fact to be edited as a distribution rather than a single prompt. Leveraging Optimal Transport, SAKE alters the LLM behavior over a whole fact-related distribution, defined as paraphrases and logical implications. Several numerical experiments demonstrate the effectiveness of this method: SAKE is thus able to perform more robust edits than its existing counterparts.

由于大型Langue模型已被显示为对现实世界事实的记忆,因此有必要以有控制和有效的方式更新这一知识。根据这些限制因素设计,知识编辑(KE)方法建议改变预先培训的模式中的具体事实。然而,它们受到若干限制,包括缺乏背景的稳健性,以及无法概括与事实有关的逻辑影响。为了克服这些问题,我们提议Sake,这是一种指导启动方法,它模拟一个事实,作为发行而不是单一的提示进行编辑。利用最佳运输,Sake将LLM行为改变为整个与事实有关的分配,被定义为副词和逻辑影响。若干数字实验证明了这种方法的有效性:Sake因此,Sake能够比现有的对应方进行更强有力的编辑。

Article 196

Title@2025-07-29 (2): UserBench: An Interactive Gym Environment for User-Centric Agents

Title: UserBench: An Interactive Gym Environment for User-Centric Agents

UserBench: Eine interaktive Gym-Umgebung für User-Centric-Agenten

用户 Bench: 用户中心代理器的交互式 Gym 环境 2507.22034v1

Authors (12): Cheng Qian, Zuxin Liu, Akshara Prabhakar, Zhiwei Liu, Jianguo Zhang, Haolin Chen, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang

Large Language Models (LLMs)-based agents have made impressive progress in reasoning and tool use, enabling them to solve complex tasks. However, their ability to proactively collaborate with users, especially when goals are vague, evolving, or indirectly expressed, remains underexplored. To address this gap, we introduce UserBench, a user-centric benchmark designed to evaluate agents in multi-turn, preference-driven interactions. UserBench features simulated users who start with underspecified goals and reveal preferences incrementally, requiring agents to proactively clarify intent and make grounded decisions with tools. Our evaluation of leading open- and closed-source LLMs reveals a significant disconnect between task completion and user alignment. For instance, models provide answers that fully align with all user intents only 20% of the time on average, and even the most advanced models uncover fewer than 30% of all user preferences through active interaction. These results highlight the challenges of building agents that are not just capable task executors, but true collaborative partners. UserBench offers an interactive environment to measure and advance this critical capability.

大型语言模型(LLMS)的代理商在推理和工具使用方面取得了令人印象深刻的进展,使他们能够解决复杂的任务。然而,他们与用户积极协作的能力,特别是在目标模糊、演变或间接表达的情况下,仍然没有得到充分利用。为了解决这一差距,我们引入了一个以用户为中心的基准,即用户Bench(用户Bench),该基准旨在评估多方向、偏好驱动的互动中的代理商。用户Bench特征模拟用户,这些用户以特定目标为起点,并逐渐显示偏好,要求代理商积极主动地澄清意向,用工具做出有根据的决定。我们对主要的开放和封闭源LMS的评估显示,任务完成和用户对齐之间有很大的脱节。例如,模型提供的答案与所有用户的意向完全一致,平均只有20%的时间,甚至最先进的模型通过积极互动发现不到所有用户偏好30%。这些结果突出了建筑代理商的挑战,这些代理商不仅仅是任务执行人,而且是真正的合作伙伴。用户Bench提供了一种互动的环境,以测量和推进这一关键能力。

Article 197

Title@2025-07-29 (2): FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression

Title: FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression

FLAT-LLM: Feinkörnige Low-rank Aktivierung Raumtransformation für großsprachliche Modellkompression

FLAT-LLM: 用于大语言模型压缩的精制低级激活空间转换 2505.23966v3

Authors (6): Jiayi Tian, Ryan Solgi, Jinming Lu, Yifan Yang, Hai Li, Zheng Zhang

Large Language Models (LLMs) have enabled remarkable progress in natural language processing, yet their high computational and memory demands pose challenges for deployment in resource-constrained environments. Although recent low-rank decomposition methods offer a promising path for structural compression, they often suffer from accuracy degradation, expensive calibration procedures, and result in inefficient model architectures that hinder real-world inference speedups. In this paper, we propose FLAT-LLM, a fast and accurate, training-free structural compression method based on fine-grained low-rank transformations in the activation space. Specifically, we reduce the hidden dimension by transforming the weights using truncated eigenvectors computed via head-wise Principal Component Analysis, and employ a greedy budget redistribution strategy to adaptively allocate ranks across decoders. FLAT-LLM achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes. Evaluated across 5 models and 11 datasets, FLAT-LLM outperforms structural pruning baselines in generalization and downstream performance, while delivering inference speedups over decomposition-based methods.

大型语言模型(LLMs)在自然语言处理方面取得了显著进展,但其高的计算和记忆需求对资源限制环境中的部署提出了挑战。尽管最近的低级分解方法为结构压缩提供了一条充满希望的道路,但它们往往受到精度退化、昂贵的校准程序的影响,并导致低效模型结构结构阻碍真实世界的推导速度。在本文中,我们提议FLAT-LLM是一种快速、准确、无培训的结构性压缩方法,其基础是在激活空间中精细的低级转换。具体地说,我们通过使用通过头部主构件分析计算出来的短精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精精

Article 198

Title@2025-07-29 (2): SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers

Title: SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers

SAND-Math: LLMs nutzen, um neuartige, schwierige und nützliche Mathematikfragen und -antworten zu generieren

SAND-Math:利用LLMs生成新创、困难和有用的数学问答 2507.20527v2

Authors (5): Chaitanya Manem, Pratik Prabhanjan Brahma, Prakamya Mishra, Zicheng Liu, Emad Barsoum

The demand for Large Language Models (LLMs) capable of sophisticated mathematical reasoning is growing across industries. However, the development of performant mathematical LLMs is critically bottlenecked by the scarcity of difficult, novel training data. We introduce \textbf{SAND-Math} (Synthetic Augmented Novel and Difficult Mathematics problems and solutions), a pipeline that addresses this by first generating high-quality problems from scratch and then systematically elevating their complexity via a new \textbf{Difficulty Hiking} step. We demonstrate the effectiveness of our approach through two key findings. First, augmenting a strong baseline with SAND-Math data significantly boosts performance, outperforming the next-best synthetic dataset by \textbf{$\uparrow$ 17.85 absolute points} on the AIME25 benchmark. Second, in a dedicated ablation study, we show our Difficulty Hiking process is highly effective: by increasing average problem difficulty from 5.02 to 5.98, this step lifts AIME25 performance from 46.38\% to 49.23\%. The full generation pipeline, final dataset, and a fine-tuned model form a practical and scalable toolkit for building more capable and efficient mathematical reasoning LLMs. SAND-Math dataset is released here: \href{https://huggingface.co/datasets/amd/SAND-MATH}{https://huggingface.co/datasets/amd/SAND-MATH}

对具有精密数学推理能力的大型语言模型(LLMS)的需求正在各行业之间增长。然而,由于缺少困难和新颖的培训数据,发展性能数学数学模型(LLMS)的工作严重受阻。我们引入了“合成增强新颖和困难数学问题和解决方案” ,这是一个解决该问题的管道,它首先从零开始产生高质量的问题,然后通过一个新的\ textbf{Difficultyhiking}步骤系统提升其复杂性。我们通过两个关键结论展示了我们的方法的有效性。首先,用SAND-Math数据提升一个强大的基准大大提升了业绩,在 AIME25 基准上比下一个最好的合成数据集表现得更好。第二,在一项专门的通货膨胀研究中,我们展示了我们困难的感应过程:通过增加平均问题难度,从5.02/5.98,这一步骤将AIME25的性能从46.38/%+49.23}。全面生成的SMA-MA-S-S-ANDS-S-S-S-S-S-S-Appentalalalaldalalalal-dal-dalsetal-dalset 建立一个高效的管道、最终数据库和升级的模型和升级的模型和升级的模型和升级的模型。

Article 199

Title@2025-07-29 (2): Predicting Microbial Ontology and Pathogen Risk from Environmental Metadata with Large Language Models

Title: Predicting Microbial Ontology and Pathogen Risk from Environmental Metadata with Large Language Models

Vorhersage mikrobieller Ontologie und Pathogenrisiken durch Umweltmetadaten mit großen Sprachmodellen

预测具有大语言模型的环境元数据产生的微生物本体学和病原体风险和病原体风险 2507.21980v1

Authors (2): Hyunwoo Yoo, Gail L. Rosen

Traditional machine learning models struggle to generalize in microbiome studies where only metadata is available, especially in small-sample settings or across studies with heterogeneous label formats. In this work, we explore the use of large language models (LLMs) to classify microbial samples into ontology categories such as EMPO 3 and related biological labels, as well as to predict pathogen contamination risk, specifically the presence of E. Coli, using environmental metadata alone. We evaluate LLMs such as ChatGPT-4o, Claude 3.7 Sonnet, Grok-3, and LLaMA 4 in zero-shot and few-shot settings, comparing their performance against traditional models like Random Forests across multiple real-world datasets. Our results show that LLMs not only outperform baselines in ontology classification, but also demonstrate strong predictive ability for contamination risk, generalizing across sites and metadata distributions. These findings suggest that LLMs can effectively reason over sparse, heterogeneous biological metadata and offer a promising metadata-only approach for environmental microbiology and biosurveillance applications.

传统机器学习模式试图在只有元数据的微生物研究中,特别是在小型抽样环境中或以不同标签格式进行的跨类研究中,在只有元数据的情况下,在微生物研究中普遍推广传统机器学习模式。在这项工作中,我们探索使用大型语言模型(LLMs)将微生物样本分类为肿瘤类,如EMPO 3和相关生物标签,以及预测病原体污染风险,特别是E.Coli的存在,仅使用环境元数据即可。我们用零发和几发方式评估ChatGPT-4o、Claude 3.7 Sonnet、Grok-3和LalaMA 4等LLMMMS,将它们的性能与随机森林等传统模型在多个现实世界数据集中的性能进行比较。我们的结果显示,LLMs不仅超越了肿瘤分类方面的标准基线,而且还显示出对污染风险的强烈预测能力,对不同地点和元数据分布进行了概括。这些研究结果表明,LLLMs可以有效地解释稀多、混杂的生物元元元数据,并为环境微生物学和生物巡视应用提供有希望的元方法。

Article 200

Title@2025-07-29 (2): LIMO: Less is More for Reasoning

Title: LIMO: Less is More for Reasoning

LIMO: Weniger ist mehr für Vernunft

LIMO: 较少的理由更多 2502.03387v3

Authors (6): Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu

We challenge the prevailing assumption that complex reasoning in large language models (LLMs) necessitates massive training data. We demonstrate that sophisticated mathematical reasoning can emerge with only a few examples. Specifically, through simple supervised fine-tuning, our model, LIMO, achieves 63.3\% accuracy on AIME24 and 95.6\% on MATH500, surpassing previous fine-tuned models (6.5\% on AIME24, 59.2\% on MATH500) while using only 1\% of the training data required by prior approaches. Furthermore, LIMO exhibits strong out-of-distribution generalization, achieving a 45.8\% absolute improvement across diverse benchmarks, outperforming models trained on 100x more data. Synthesizing these findings, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning can emerge through minimal but strategically designed demonstrations of cognitive processes. This hypothesis suggests that the threshold for eliciting complex reasoning is not dictated by task complexity but rather by two key factors: (1) the completeness of the model’s pre-trained knowledge base and (2) the effectiveness of post-training examples in serving as “cognitive templates” that guide reasoning.

我们质疑一种普遍假设,即大型语言模型(LLMS)的复杂推理需要大量的培训数据。我们证明,精密的数学推理只能以几个例子出现。具体地说,通过简单的监督微调,我们的模型(LIMO)在AIME24和95.6的MATH500上实现了63.3精确度,超过了以前的微调模型(关于AIME24、59.2关于MATH500),同时只使用了先前方法所要求的1培训数据。此外,LIMO表现出在分布上过于概括化,在不同的基准中取得了45.8绝对的改进,超过了在100x以上数据方面受过培训的绩效模型。我们将这些结果结合起来,我们提出了“低I-More 解释假说”:在基础模型中,域知识在培训前已经全面编码,精密的推理可以通过最低但有战略设计的认知过程演示产生。这一假设表明,得出复杂推理的门槛不是由任务复杂性决定的,而是由两个关键因素决定的:(1) 模型的完备性,作为培训后推理学的模板。

Article 201

Title@2025-07-29 (2): Culinary Crossroads: A RAG Framework for Enhancing Diversity in Cross-Cultural Recipe Adaptation

Title: Culinary Crossroads: A RAG Framework for Enhancing Diversity in Cross-Cultural Recipe Adaptation

Kulinarische Kreuzungen: Ein RAG-Rahmen zur Verbesserung der Vielfalt in der kulturübergreifenden Rezeptanpassung

烹饪十字路口:加强跨文化适应性适应多样性的RAG框架 2507.21934v1

Authors (5): Tianyi Hu, Andrea Morales-Garzón, Jingyi Zheng, Maria Maistro, Daniel Hershcovich

In cross-cultural recipe adaptation, the goal is not only to ensure cultural appropriateness and retain the original dish’s essence, but also to provide diverse options for various dietary needs and preferences. Retrieval Augmented Generation (RAG) is a promising approach, combining the retrieval of real recipes from the target cuisine for cultural adaptability with large language models (LLMs) for relevance. However, it remains unclear whether RAG can generate diverse adaptation results. Our analysis shows that RAG tends to overly rely on a limited portion of the context across generations, failing to produce diverse outputs even when provided with varied contextual inputs. This reveals a key limitation of RAG in creative tasks with multiple valid answers: it fails to leverage contextual diversity for generating varied responses. To address this issue, we propose CARRIAGE, a plug-and-play RAG framework for cross-cultural recipe adaptation that enhances diversity in both retrieval and context organization. To our knowledge, this is the first RAG framework that explicitly aims to generate highly diverse outputs to accommodate multiple user preferences. Our experiments show that CARRIAGE achieves Pareto efficiency in terms of diversity and quality of recipe adaptation compared to closed-book LLMs.

在跨文化食谱适应中,目标不仅在于确保文化适当性并保留原始菜的精髓,而且在于为各种饮食需要和偏好提供多种选择。检索增强型(RAG)是一种很有希望的方法,将从用于文化适应性的目标烹饪中检索真实食谱与大语言模型(LLMs)结合起来,但是仍然不清楚RAG是否能够产生不同的适应结果。我们的分析表明,RAG往往过度依赖不同世代之间有限的环境部分,即使提供不同的背景投入,也无法产生不同的产出。这揭示了RAG在创造性任务中存在一个关键局限性,并有多重有效的答案:它未能利用背景多样性来产生不同的反应。为了解决这一问题,我们建议CARRIAG,这是一个用于跨文化食谱适应的插座和播放式的RAG框架,可以增强检索和背景组织的多样性。据我们所知,这是第一个RAG框架,明确旨在产生高度多样化的产出,以适应多种用户的偏好。我们的实验表明,CARRIAGE在食谱适应与封闭式LMS相比,在多样性和质量上达到了帕雷托效率。

Article 202

Title@2025-07-29 (2): Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory

Title: Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory

LLM-Autoscoring-Verlässlichkeit in großformatigen Schriftbeurteilungen unter Verwendung von Generalisierbarkeitstheorien erkunden

利用通用理论探索利用通用理论进行大型书写评估时的可靠性 2507.19980v2

Authors (3): Dan Song, Won-Chan Lee, Hong Jiao

This study investigates the estimation of reliability for large language models (LLMs) in scoring writing tasks from the AP Chinese Language and Culture Exam. Using generalizability theory, the research evaluates and compares score consistency between human and AI raters across two types of AP Chinese free-response writing tasks: story narration and email response. These essays were independently scored by two trained human raters and seven AI raters. Each essay received four scores: one holistic score and three analytic scores corresponding to the domains of task completion, delivery, and language use. Results indicate that although human raters produced more reliable scores overall, LLMs demonstrated reasonable consistency under certain conditions, particularly for story narration tasks. Composite scoring that incorporates both human and AI raters improved reliability, which supports that hybrid scoring models may offer benefits for large-scale writing assessments.

这项研究调查了大语言模型(LLMs)在从AP中文语言和文化考试中评分写作任务时的可靠性估计。研究采用可概括性理论,评估和比较了在两种AP中文自由应答写任务(故事叙述和电子邮件回应)中人与AI评级员之间的得分一致性:故事叙事和电子邮件回应。这些论文由两名训练有素的人和七名AI评分员独立评分。每篇论文得分四分:一个整体得分和三个分析得分,与任务完成、交付和语言使用领域相对应。结果显示,虽然人类计分员总得分比较可靠,但LLMs在某些条件下表现出了合理的一致性,特别是在故事叙事任务方面。包含人和AI评分员的复合评分提高了可靠性,这支持混合评分模型可为大规模写作评估带来好处。

Article 203

Title@2025-07-29 (2): “Whose Side Are You On?” Estimating Ideology of Political and News Content Using Large Language Models and Few-shot Demonstration Selection

Title: “Whose Side Are You On?” Estimating Ideology of Political and News Content Using Large Language Models and Few-shot Demonstration Selection

„Auf wessen Seite bist du?” Schätzung der Ideologie von Politik und Nachrichteninhalten mit großen Sprachmodellen und der Auswahl von Demonstrationsobjekten

“你站在谁一边?” 估计政治和新闻内容使用大语言模型和少见的示范选择的意识形态和新闻内容。 2503.20797v2

Authors (3): Muhammad Haroon, Magdalena Wojcieszak, Anshuman Chhabra

The rapid growth of social media platforms has led to concerns about radicalization, filter bubbles, and content bias. Existing approaches to classifying ideology are limited in that they require extensive human effort, the labeling of large datasets, and are not able to adapt to evolving ideological contexts. This paper explores the potential of Large Language Models (LLMs) for classifying the political ideology of online content in the context of the two-party US political spectrum through in-context learning (ICL). Our extensive experiments involving demonstration selection in label-balanced fashion, conducted on three datasets comprising news articles and YouTube videos, reveal that our approach significantly outperforms zero-shot and traditional supervised methods. Additionally, we evaluate the influence of metadata (e.g., content source and descriptions) on ideological classification and discuss its implications. Finally, we show how providing the source for political and non-political content influences the LLM’s classification.

社交媒体平台的迅速增长引起了人们对激进化、过滤泡沫和内容偏向的关切。现有的意识形态分类方法有限,因为它们需要广泛的人力努力、大数据集标签、无法适应不断变化的意识形态背景。本文探讨了大语言模型(LLMs)通过通俗学习(ICL)在美国两党政治背景中对在线内容的政治意识形态进行分类的潜力。我们在由新闻文章和YouTube视频组成的三个数据集上进行的关于以标签平衡方式进行示范选择的广泛实验表明,我们的方法大大超过零光和传统监督方法。此外,我们评估了元数据(例如内容来源和描述)对意识形态分类的影响,并讨论了其影响。最后,我们展示了提供政治和非政治内容来源如何影响LM的分类。

Article 204

Title@2025-07-29 (2): Post-Training Large Language Models via Reinforcement Learning from Self-Feedback

Title: Post-Training Large Language Models via Reinforcement Learning from Self-Feedback

Post-Training Große Sprachmodelle durch Stärkung Lernen aus Selbst-Feedback

培训后通过 “ 自我学习 “ 强化学习大语言模式 2507.21931v1

Authors (5): Carel van Niekerk, Renato Vukovic, Benjamin Matthias Ruppik, Hsien-chin Lin, Milica Gašić

Large Language Models (LLMs) often produce plausible but poorly-calibrated answers, limiting their reliability on reasoning-intensive tasks. We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model’s own confidence as an intrinsic reward, mimicking how humans learn in the absence of external feedback. After a frozen LLM generates several chain-of-thought solutions, we define and compute the confidence of each final answer span and rank the traces accordingly. These synthetic preferences are then used to fine-tune the policy with standard preference optimization, similar to RLHF yet requiring no human labels, gold answers, or externally curated rewards. RLSF simultaneously (i) refines the model’s probability estimates – restoring well-behaved calibration – and (ii) strengthens step-by-step reasoning, yielding improved performance on arithmetic reasoning and multiple-choice question answering. By turning a model’s own uncertainty into useful self-feedback, RLSF affirms reinforcement learning on intrinsic model behaviour as a principled and data-efficient component of the LLM post-training pipeline and warrents further research in intrinsic rewards for LLM post-training.

大型语言模型(LLMS) 通常产生合理、但协调不力的答案,限制了其在推理密集型任务的可靠性。我们展示了“自我回馈强化学习”这一培训后阶段,该阶段将模型本身的信心用作内在的奖励,模仿人类在没有外部反馈的情况下如何学习。在冻结的LLM产生若干一系列思考解决方案后,我们定义并计算了每个最终答案的可信度,并据此对痕迹进行排序。这些合成偏好随后被用来微调该政策,使其符合标准偏好优化,类似于RLHF,但不需要人类标签、黄金答案或外部调节的奖励。RLSF同时(一) 完善模型的概率估计 – – 恢复良好管理校准 – 并(二) 加强逐步推理,提高算推理和多曲解的处理能力。通过将模型本身的不确定性转化为有用的自我反馈,RLSF确认将内在行为模型的学习作为LM后训练和战争后不断研究中的一项有原则和数据效率的部分。

Article 205

Title@2025-07-29 (2): CHIMERA: A Knowledge Base of Scientific Idea Recombinations for Research Analysis and Ideation

Title: CHIMERA: A Knowledge Base of Scientific Idea Recombinations for Research Analysis and Ideation

CHIMERA: Eine Wissensbasis für wissenschaftliche Ideen-Rekombinationen für Forschungsanalyse und -Ideation

CHIMERA: 研究分析和衰变科学理念重组知识库 2505.20779v4

Authors (2): Noy Sternlicht, Tom Hope

A hallmark of human innovation is recombination – the creation of novel ideas by integrating elements from existing concepts and mechanisms. In this work, we introduce CHIMERA, a large-scale Knowledge Base (KB) of over 28K recombination examples automatically mined from the scientific literature. CHIMERA enables large-scale empirical analysis of how scientists recombine concepts and draw inspiration from different areas, and enables training models that propose novel, cross-disciplinary research directions. To construct this KB, we define a new information extraction task: identifying recombination instances in scientific abstracts. We curate a high-quality, expert-annotated dataset and use it to fine-tune a large language model, which we apply to a broad corpus of AI papers. We showcase the utility of CHIMERA through two applications. First, we analyze patterns of recombination across AI subfields. Second, we train a scientific hypothesis generation model using the KB, showing that it can propose novel research directions that researchers rate as inspiring. We release our data and code at https://github.com/noy-sternlicht/CHIMERA-KB.

人类创新的一个标志是重组 – – 通过整合现有概念和机制的要素,创建了新颖思想。在这项工作中,我们引入了CHIMERA(CHIMERA),这是一个大型知识库(KB),拥有28K以上科学文献自动提取的重组实例。CHIMERA使得能够对科学家的再生概念和不同领域的灵感进行大规模的经验分析,并使培训模式能够提出新的、跨学科的研究方向。为了构建这个KB,我们定义了一个新的信息提取任务:在科学摘要中确定再融合实例。我们整理了一个高质量的、专家附加说明的数据集,并用它微调一个大语言模型,我们将其应用于广泛的AI文件。我们通过两种应用展示了CHIMERA的效用。首先,我们分析了跨AI子领域的再融合模式。第二,我们用KB来培训一个科学假说生成模型,表明它可以提出新的研究方向,研究人员将它评为鼓舞人心。我们在 https://github.com/noy-sternlich/CHIK-MARKB公布我们的数据和代码。

Article 206

Title@2025-07-29 (2): Rote Learning Considered Useful: Generalizing over Memorized Data in LLMs

Title: Rote Learning Considered Useful: Generalizing over Memorized Data in LLMs

Rotes Lernen als nützlich betrachtet: Verallgemeinern über gemerkte Daten in LLMs

认为轮试学习有用:在LLMs中普遍使用记忆数据 2507.21914v1

Authors (7): Qinyuan Wu, Soumi Das, Mahsa Amani, Bishwamittra Ghosh, Mohammad Aflah Khan, Krishna P. Gummadi, Muhammad Bilal Zafar

Rote learning is a memorization technique based on repetition. It is commonly believed to hinder generalization by encouraging verbatim memorization rather than deeper understanding. This insight holds for even learning factual knowledge that inevitably requires a certain degree of memorization. In this work, we demonstrate that LLMs can be trained to generalize from rote memorized data. We introduce a two-phase memorize-then-generalize framework, where the model first rote memorizes factual subject-object associations using a semantically meaningless token and then learns to generalize by fine-tuning on a small set of semantically meaningful prompts. Extensive experiments over 8 LLMs show that the models can reinterpret rote memorized data through the semantically meaningful prompts, as evidenced by the emergence of structured, semantically aligned latent representations between the two. This surprising finding opens the door to both effective and efficient knowledge injection and possible risks of repurposing the memorized data for malicious usage.

旋转学习是一种基于重复的记忆技术。一般认为它会通过鼓励逐字记忆而不是更深入的理解而阻碍一般化。这种洞察力甚至有助于学习必然需要某种程度的记忆的实际知识。在这项工作中,我们证明LLMs可以接受从腐烂的记忆数据中进行概括化的训练。我们引入了一个两阶段的记忆-当时的普及框架, 模型首先用一个语义上毫无意义的象征物, 转录事实主题对象关联, 然后通过微调一小套具有语义意义的提示物来学习一般化。 8 LLMS 的广泛实验显示, 模型可以通过有结构的、语义上一致的表达方式, 来重新解释具有象征意义的记忆数据。这个惊人的发现打开了有效和高效的知识注入的大门, 以及重新将记忆中的数据用于恶意用途的可能风险。

Article 207

Title@2025-07-29 (2): SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs

Title: SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs

SmoothRot: Kombination von Kanal-Weiss-Skalierung und Rotation für Quantisierungsfreundliche LLMs

平滑旋转: 将频道- Wise 缩放和旋转组合起来, 用于量化- 友好型LLMS 2506.05413v2

Authors (3): Patrik Czakó, Gábor Kertész, Sándor Szénási

We present SmoothRot, a novel post-training quantization technique to enhance the efficiency of 4-bit quantization in Large Language Models (LLMs). SmoothRot addresses the critical challenge of massive activation outliers, by integrating channel-wise scaling with Hadamard transformations. Our technique effectively transforms extreme outliers into quantization-friendly activations, significantly improving quantization accuracy. Experiments conducted on popular LLMs (LLaMA2 7B, LLaMA3.1 8B, and Mistral 7B) demonstrate that SmoothRot consistently reduces the performance gap between quantized and FP16 models by approximately 10-30\% across language generation and zero-shot reasoning tasks, without introducing additional inference latency. Code is available at https://github.com/czakop/smoothrot.

我们展示了“平滑”技术,这是提高大语言模型四位数量化效率的一种创新培训后量化技术。“平滑”通过将频道与哈达马德变换相结合,应对大规模激活外源器的关键挑战。我们的技术有效地将极端外源器转化为有利于量化的激活,大大提高了量化的准确性。在广受欢迎的LLMS(LLAMA2 7B、LLLAMA3.1 8B和Mistral 7B)上进行的实验表明,平滑始终将四分制和FP16模型之间的性能差距缩小约10-30,在语言生成和零点推理任务之间减少约10-30,而没有引入额外的推理延度。守则可在https://github.com/czakop/smoothro查阅。

Article 208

Title@2025-07-29 (2): SLR: Automated Synthesis for Scalable Logical Reasoning

Title: SLR: Automated Synthesis for Scalable Logical Reasoning

SLR: Automatisierte Synthese für skalierbare logische Vernunft

SLR: 用于可缩放逻辑理由的自动合成 2506.15787v3

Authors (9): Lukas Helff, Ahmad Omar, Felix Friedrich, Antonia Wüst, Hikaru Shindo, Rupert Mitchell, Tim Woydt, Patrick Schramowski, and Wolfgang Stammer Kristian Kersting

We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user’s task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.

我们引入了SLR,这是通过可缩放逻辑解释系统评估和培训大语言模型的端到端框架。根据用户的任务规格,SLR自动合成了(一) 用于推理任务的指令迅速,(二) 验证程序,可在模型产出上执行,以提供可核查的奖赏,(三) 潜在的地面真相规则。这一过程完全自动化,可缩放,不需要人手说明,对任务困难提供精确的控制。使用SLR,我们创建SLR-Bench,这是一个由19个提示组成的基准,分为20个课程级别,逐步提高关系、算术和再现的复杂性。大规模评估显示,当代LLLMS随时能够产生综合有效的规则,但往往无法正确推理出逻辑。最近的推理,LLMS显示业绩有所改善,但测试时间计算得非常高,只需1 000个提示就超过300美元。最后,通过SLR的LRA-3-8B精准性课程学习。SLR-B,在SLR-B上实现与GEM-FLash-Lash-Lasimlash通用推理算法基础的等等,在普遍推算中,这些推算能力至GLS-LILS-LisLisLisLisLL。

Article 209

Title@2025-07-29 (2): Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning

Title: Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning

Graph-R1: Auf dem Weg zu einem agentischen GraphRAG-Framework durch durchgängiges Ausbau-Lernen

图R1:通过端至端强化学习,迈向 “ 干点至端强化学习 “ 框架 2507.21892v1

Authors (11): Haoran Luo, Haihong E, Guanting Chen, Qika Lin, Yikai Guo, Fangzhi Xu, Zemin Kuang, Meina Song, Xiaobao Wu, Yifan Zhu, Luu Anh Tuan

Retrieval-Augmented Generation (RAG) mitigates hallucination in LLMs by incorporating external knowledge, but relies on chunk-based retrieval that lacks structural semantics. GraphRAG methods improve RAG by modeling knowledge as entity-relation graphs, but still face challenges in high construction cost, fixed one-time retrieval, and reliance on long-context reasoning and prompt design. To address these challenges, we propose Graph-R1, an agentic GraphRAG framework via end-to-end reinforcement learning (RL). It introduces lightweight knowledge hypergraph construction, models retrieval as a multi-turn agent-environment interaction, and optimizes the agent process via an end-to-end reward mechanism. Experiments on standard RAG datasets show that Graph-R1 outperforms traditional GraphRAG and RL-enhanced RAG methods in reasoning accuracy, retrieval efficiency, and generation quality.

通过吸收外部知识,并依赖缺乏结构语义学的块状检索,再获取新一代人(RAG)减轻了LMS的幻觉。GIARAG的方法通过将知识建模成实体关系图改进了RAG,但是在高建筑成本、固定的一次性检索以及依赖长文本推理和迅速设计方面仍然面临挑战。为了应对这些挑战,我们提议GIP-R1,一个通过端到端强化学习(RL)的代理GIGRAG框架。它引入了轻量知识高集构建、模型检索作为多转媒介-环境互动,并通过端到端奖励机制优化代理过程。关于标准RAG数据集的实验显示,在推理精度、检索效率和生成质量方面,GI和RL强化的RAG方法超越了传统的GRAG和RL强化RAG方法。

Article 210

Title@2025-07-29 (2): FrugalRAG: Learning to retrieve and reason for multi-hop QA

Title: FrugalRAG: Learning to retrieve and reason for multi-hop QA

FrugalRAG: Lernen zum Abrufen und Grund für Multi-Hop-QA

FrugalRAG:学会检索和多呼QA的理由 2507.07634v2

Authors (4): Abhinav Java, Srivathsan Koundinyan, Nagarajan Natarajan, Amit Sharma

We consider the problem of answering complex questions, given access to a large unstructured document corpus. The de facto approach to solving the problem is to leverage language models that (iteratively) retrieve and reason through the retrieved documents, until the model has sufficient information to generate an answer. Attempts at improving this approach focus on retrieval-augmented generation (RAG) metrics such as accuracy and recall and can be categorized into two types: (a) fine-tuning on large question answering (QA) datasets augmented with chain-of-thought traces, and (b) leveraging RL-based fine-tuning techniques that rely on question-document relevance signals. However, efficiency in the number of retrieval searches is an equally important metric, which has received less attention. In this work, we show that: (1) Large-scale fine-tuning is not needed to improve RAG metrics, contrary to popular claims in recent literature. Specifically, a standard ReAct pipeline with improved prompts can outperform state-of-the-art methods on benchmarks such as HotPotQA. (2) Supervised and RL-based fine-tuning can help RAG from the perspective of frugality, i.e., the latency due to number of searches at inference time. For example, we show that we can achieve competitive RAG metrics at nearly half the cost (in terms of number of searches) on popular RAG benchmarks, using the same base model, and at a small training cost (1000 examples).

我们考虑了回答复杂问题的问题,因为有了大量结构化的文件资料库,我们考虑了回答复杂的问题的问题。事实上解决问题的方法是利用语言模型,这些语言模型(表面上)通过检索的文件检索和解释,直到该模型有足够的信息来找到答案。改进这一方法的尝试侧重于检索强化的生成(RAG)指标,例如准确性和回忆性,可以分为两类:(a)对大问题的回答(QA)数据集进行微调,增加思考链的痕迹;(b)利用基于RL的微调技术,这些技术依赖于问题文件的相关性信号。然而,检索搜索数量的效率是一个同样重要的衡量标准,但这一衡量标准得到的注意较少。在这项工作中,我们表明:(1) 与最近的文献中流行的说法相反,不需要进行大规模的微调来改进RAG的衡量标准。具体地说,改进的提示性标准“ReAc”管道可以超越HotPA等基准的先进模型数目。 (2) 超额和基于RL的微调技术方法,用于在RAG的50%的搜索中,从我们进行适当的成本模型搜索中可以证明。

Article 211

Title@2025-07-29 (2): WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

Title: WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

WakenLLM: Bewertung des Potenzials und der Stabilität von LLMs mittels feinkörniger Benchmarking

WakenLLLM:通过细微基准评估LLM公司的合理潜力和稳定性 2507.16199v3

Authors (10): Zipeng Ling, Yuehao Tang, Shuliang Liu, Junqi Yang, Shenghong Fu, Chen Huang, Kejia Huang, Yao Wan, Zhichao Hou, Xuming Hu

Large Language Models (LLMs) frequently output the label Unknown in reasoning tasks, where two scenarios may appear: (i) an input sample is genuinely unverifiable, but the model cannot understand why; and (ii) a verifiable problem that the model fails to solve, thus outputs Unknown. We refer to these cases collectively as the Vague Perception phenomenon. Current evaluations focus on whether such answers are honest, rather than analyzing the limits of LLM reasoning. To address this, we introduce WakenLLM, a framework that quantifies the portion of Unknown output attributable to model incapacity and evaluates whether stimulation can convert them into either correct answers (verifiable) or justified (unverifiable) responses with valid reasoning. Our method offers a clearer picture of the limits of LLM reasoning and the potential for corrections across various datasets. Comprehensive experiments on six LLMs suggest that, without any training or parameter revision, LLMs can achieve up to a 68.53% accuracy improvement on Vague Perception samples through guided understanding. Our work reveals that current baseline methods only activate a small portion of LLMs’ reasoning potential, indicating considerable unexplored capacity. This extends the theoretical upper bounds of reasoning accuracy in LLMs. Consequently, this study deepens our understanding of the latent reasoning capacity of LLMs and offers a new perspective on addressing the Vague Perception phenomenon.

大型语言模型(LLMS)经常输出在推理任务中未知的标签,其中可能出现两种假想:(一) 输入样本真正无法核实,但模型无法理解原因;(二) 模型未能解决的可核查问题,因此无法解决,结果不明。我们将这些案例统称为模糊概念现象。目前的评估侧重于这些答案是否诚实,而不是分析LLM推理的限度。为了解决这个问题,我们引入了WakenLLLMM这个框架,这个框架量化了因模型缺乏能力而产生的未知产出部分,并评估了刺激能否将其转化为正确答案(可核实)或合理(不可核实)的答案。我们的方法更清楚地描绘了LLMM推理的局限性和各种数据集的纠正潜力。对六个LMS的全面实验表明,在不进行任何培训或参数修订的情况下,LLMS样本能够达到68.53%的精确度。我们的工作显示,目前的基线方法只能激发LMS推理潜力的一小部分,表明相当的未解释能力。我们的方法更清楚地展示了LMS推理的理论,从而加深了VLMS的深层推理。

Article 212

Title@2025-07-29 (2): FB-RAG: Improving RAG with Forward and Backward Lookup

Title: FB-RAG: Improving RAG with Forward and Backward Lookup

FB-RAG: Verbesserung der RAG durch Vorwärts- und Rückwärtsblick

FB-RAG:以前向和后向看改进RAG 2505.17206v2

Authors (4): Kushal Chawla, Alfy Samuel, Anoop Kumar, Daben Liu

Traditional Retrieval-Augmented Generation (RAG) struggles with complex queries that lack strong signals to retrieve the most relevant context, forcing a trade-off between choosing a small context that misses key information and a large context that confuses the LLM. To address this, we propose Forward-Backward RAG (FB-RAG), a new training-free framework based on a simple yet powerful forward-looking strategy. FB-RAG employs a light-weight LLM to peek into potential future generations, using evidence from multiple sampled outputs to precisely identify the most relevant context for a final, more powerful generator. This improves performance without complex finetuning or Reinforcement Learning common in prior work. Across 9 datasets, FB-RAG consistently delivers strong results. Further, the performance gains can be achieved with reduced latency due to a shorter, more focused prompt for the powerful generator. On EN.QA dataset, FB-RAG matches the leading baseline with over 48% latency reduction or achieves an 8% performance improvement with a 10% latency reduction. Our analysis finds cases where even when the forward-looking LLM fails to generate correct answers, its attempts are sufficient to guide the final model to an accurate response, demonstrating how smaller LLMs can systematically improve the performance and efficiency of larger ones.

传统回溯-增强型一代(RAG)与复杂的询问斗争,这些询问缺乏强有力的信号,无法找到最相关的背景,迫使在选择缺少关键信息的小背景和混淆LLM的大背景之间权衡取舍。为了解决这个问题,我们提议采用基于简单而有力的前瞻性战略的新的无培训框架,即前背式RAG(FB-RAG),这是一个基于简单而有力的前瞻性战略的新的无培训框架。FB-RAG使用轻量级LM,以偷窥后代,利用多个抽样产出的证据精确地确定最终、更强大的生成器最相关的环境。这在以往工作中没有复杂的微调或强化学习共同之处,就能改善业绩。在9个数据集中,FB-RAG始终提供强有力的成果。此外,由于强大的生成器更短、更集中,从而降低了延迟性,因此可以实现业绩增益。在EN.QA数据集上,FB-RAG将领先基线与超过48%的拉特率降低或实现8 %的绩效改进,同时减少10%的延迟度。我们的分析发现,即使有更精确的预测性地展示了更精确性效率,但最终的答案,但最终的指南却却却却却也无法改进了。

Article 213

Title@2025-07-29 (2): AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning

Title: AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning

AutoTIR: Autonome Tools Integriertes Reasoning durch Verstärkungslernen

AutoTIR:通过强化学习综合解释理由的自主工具 2507.21836v1

Authors (6): Yifan Wei, Xiaoyan Yu, Yixuan Weng, Tengfei Pan, Angsheng Li, Li Du

Large Language Models (LLMs), when enhanced through reasoning-oriented post-training, evolve into powerful Large Reasoning Models (LRMs). Tool-Integrated Reasoning (TIR) further extends their capabilities by incorporating external tools, but existing methods often rely on rigid, predefined tool-use patterns that risk degrading core language competence. Inspired by the human ability to adaptively select tools, we introduce AutoTIR, a reinforcement learning framework that enables LLMs to autonomously decide whether and which tool to invoke during the reasoning process, rather than following static tool-use strategies. AutoTIR leverages a hybrid reward mechanism that jointly optimizes for task-specific answer correctness, structured output adherence, and penalization of incorrect tool usage, thereby encouraging both precise reasoning and efficient tool integration. Extensive evaluations across diverse knowledge-intensive, mathematical, and general language modeling tasks demonstrate that AutoTIR achieves superior overall performance, significantly outperforming baselines and exhibits superior generalization in tool-use behavior. These results highlight the promise of reinforcement learning in building truly generalizable and scalable TIR capabilities in LLMs. The code and data are available at https://github.com/weiyifan1023/AutoTIR.

大语言模型(LLMS)在通过以推理为导向的培训后强化后,发展成为强大的大理由模型(LRMs)。工具综合理由模型(TIR)通过纳入外部工具进一步扩大其能力,但现有方法往往依赖僵硬的、预先界定的工具使用模式,这种模式可能使核心语言能力降低;在人类适应性选择工具的能力的启发下,我们引入AutoTIR(AutoTIR),这是一个强化学习框架,使LLMS能够在推理过程中自主决定是否和哪一种工具可以援引,而不是采用静态的工具使用战略。AutoTIR(AIR)利用混合奖励机制,共同优化具体任务答案的正确性、结构化产出的坚持性和对不正确工具使用的惩罚性,从而鼓励精确的推理和高效的工具整合。各种知识密集型、数学和通用语言模型任务的广泛评价表明,AutoTIR在总体业绩上取得优异,大大超出基准,在工具使用行为上表现出超凡。这些结果突出表明了加强学习在LMS内建立真正普遍可计量和可计量的TIR能力的前景。

Article 214

Title: Introducing HALC: A general pipeline for finding optimal prompting strategies for automated coding with LLMs in the computational social sciences

Einführung von HALC: Eine allgemeine Pipeline für die Suche nach optimalen Promptenstrategien für die automatisierte Codierung mit LLMs in den Computational Social Sciences

介绍HALC:寻找计算社会科学中与LLMs自动编码的最佳加速战略的一般管道 2507.21831v1

Authors (3): Andreas Reich, Claudia Thoms, Tobias Schrimpf

LLMs are seeing widespread use for task automation, including automated coding in the social sciences. However, even though researchers have proposed different prompting strategies, their effectiveness varies across LLMs and tasks. Often trial and error practices are still widespread. We propose HALC$-$a general pipeline that allows for the systematic and reliable construction of optimal prompts for any given coding task and model, permitting the integration of any prompting strategy deemed relevant. To investigate LLM coding and validate our pipeline, we sent a total of 1,512 individual prompts to our local LLMs in over two million requests. We test prompting strategies and LLM task performance based on few expert codings (ground truth). When compared to these expert codings, we find prompts that code reliably for single variables (${\alpha}$climate = .76; ${\alpha}$movement = .78) and across two variables (${\alpha}$climate = .71; ${\alpha}$movement = .74) using the LLM Mistral NeMo. Our prompting strategies are set up in a way that aligns the LLM to our codebook$-$we are not optimizing our codebook for LLM friendliness. Our paper provides insights into the effectiveness of different prompting strategies, crucial influencing factors, and the identification of reliable prompts for each coding task and model.

然而,尽管研究人员提出了不同的快速战略,但其效力也各不相同。通常,试验和错误做法仍然很普遍。我们建议为任何特定的编码任务和模式系统、可靠地建造最佳速度的普通管道,允许整合认为相关的任何提示战略。为了调查LLM编码和验证我们的管道,我们总共以超过200万个请求向当地LM公司发送了1 512个个人提示。我们测试了基于少数专家编码(地面真相)的快速战略和LLM任务绩效。与这些专家编码相比,我们发现为单一变量($HALFA}$气候=76)可靠代码的提示;$halpha}移动=78,以及两个变量($HALFA}=0.71;$alpha}流动=74。我们利用LM Mistral NeMo测试了快速战略和LLM任务绩效快速分析的模型。我们迅速制定的战略,而不是以最可靠的方式调整我们的LLM的代码。

Article 215

Title@2025-07-29 (2): EEG-CLIP : Learning EEG representations from natural language descriptions

Title: EEG-CLIP : Learning EEG representations from natural language descriptions

EEG-CLIP : Lernen von EEG-Darstellungen aus natürlichen Sprachbeschreibungen

EEG-CLIP:从自然语言说明中学习EEG代表 2503.16531v2

Authors (3): Tidiane Camaret Ndir, Robin Tibor Schirrmeister, Tonio Ball

Deep networks for electroencephalogram (EEG) decoding are often only trained to solve one specific task, such as pathology or age decoding. A more general task-agnostic approach is to train deep networks to match a (clinical) EEG recording to its corresponding textual medical report and vice versa. This approach was pioneered in the computer vision domain matching images and their text captions and subsequently allowed to do successful zero-shot decoding using textual class prompts. In this work, we follow this approach and develop a contrastive learning framework, EEG-CLIP, that aligns the EEG time series and the descriptions of the corresponding clinical text in a shared embedding space. We investigated its potential for versatile EEG decoding, evaluating performance in a range of few-shot and zero-shot settings. Overall, we show that EEG-CLIP manages to non-trivially align text and EEG representations. Our work presents a promising approach to learn general EEG representations, which could enable easier analyses of diverse decoding questions through zero-shot decoding or training task-specific models from fewer training examples. The code for reproducing our results is available at https://github.com/tidiane-camaret/EEGClip

深层电脑图解码网络通常仅经过培训才能解决病理学或年龄解码等一项具体任务。更一般性的任务不可知性办法是培训深层网络,将(临床)EEEG记录与其相应的文本医学报告相匹配,反之亦然。这种方法在计算机视野域中先行推出,匹配图像及其文本标题,然后允许使用文本类提示成功完成零发解码。在这项工作中,我们遵循这一方法,并开发了一个对比式学习框架EEEG-CLIP,将EEG时间序列和相应的临床文本描述与共同嵌入空间相匹配。我们研究了其多功能 EEEG解码的潜力,评估了几发和零发环境环境的性能。总体而言,我们显示EEG-CLIP管理着非边际的文本和 EEG表示式。我们的工作为学习通用的 EEG 演示提供了一种很有希望的方法,它能够通过零发解码或培训特定任务模型对不同的解码问题进行更方便的分析。我们可以利用的 MAG/AMIADM/ADLAB 用于较少的培训示例。

Article 216

Title@2025-07-29 (2): Modelling Adjectival Modification Effects on Semantic Plausibility

Title: Modelling Adjectival Modification Effects on Semantic Plausibility

Modellierung adjektiver Modifizierungseffekte auf die semantische Plausibilität

模拟弹道改变对语义等高可变性的影响 2507.21828v1

Authors (3): Anna Golub, Beate Zywietz, Annerose Eichel

While the task of assessing the plausibility of events such as ‘‘news is relevant’’ has been addressed by a growing body of work, less attention has been paid to capturing changes in plausibility as triggered by event modification. Understanding changes in plausibility is relevant for tasks such as dialogue generation, commonsense reasoning, and hallucination detection as it allows to correctly model, for example, ‘‘gentle sarcasm’’ as a sign of closeness rather than unkindness among friends [9]. In this work, we tackle the ADEPT challenge benchmark [6] consisting of 16K English sentence pairs differing by exactly one adjectival modifier. Our modeling experiments provide a conceptually novel method by using sentence transformers, and reveal that both they and transformer-based models struggle with the task at hand, and sentence transformers - despite their conceptual alignment with the task - even under-perform in comparison to models like RoBERTa. Furthermore, an in-depth comparison with prior work highlights the importance of a more realistic, balanced evaluation method: imbalances distort model performance and evaluation metrics, and weaken result trustworthiness.

虽然评估“新闻是相关的”等事件是否可信的任务已经通过越来越多的工作得到处理,但较少注意捕捉事件修改引发的可信程度的变化。理解可行性的变化对于诸如对话生成、常识推理和幻觉探测等任务来说是相关的,因为它能够正确地模拟,例如“gentle sarcasm”是朋友之间亲密而不是不友善的迹象[9]。在这项工作中,我们处理ADEPT挑战基准[6],其中包括16K英语句子对口,完全由一位弹道修饰者所不同。我们的模型实验通过使用变压器提供了一种概念上创新的方法,并揭示了它们和变压器模型与手头的任务和变压器——尽管在概念上与任务一致——甚至与RoBERTa等模型相比不甚完善。此外,与先前的工作进行深入比较,突出表明更现实、平衡的评价方法的重要性:不平衡的模型性能和评估度量度和削弱结果的可靠性。

Article 217

Title@2025-07-29 (2): HRIPBench: Benchmarking LLMs in Harm Reduction Information Provision to Support People Who Use Drugs

Title: HRIPBench: Benchmarking LLMs in Harm Reduction Information Provision to Support People Who Use Drugs

HRIPBench: Benchmarking von LLMs bei der Bereitstellung von Informationen zur Schadensreduzierung zur Unterstützung von Drogenkonsumenten

HRIPBENCH:在向吸毒者提供支助的减少危害信息提供中确定LLMs基准 2507.21815v1

Authors (5): Kaixuan Wang, Chenxin Diao, Jason T. Jacques, Zhongliang Guo, Shuai Zhao

Millions of individuals’ well-being are challenged by the harms of substance use. Harm reduction as a public health strategy is designed to improve their health outcomes and reduce safety risks. Some large language models (LLMs) have demonstrated a decent level of medical knowledge, promising to address the information needs of people who use drugs (PWUD). However, their performance in relevant tasks remains largely unexplored. We introduce HRIPBench, a benchmark designed to evaluate LLM’s accuracy and safety risks in harm reduction information provision. The benchmark dataset HRIP-Basic has 2,160 question-answer-evidence pairs. The scope covers three tasks: checking safety boundaries, providing quantitative values, and inferring polysubstance use risks. We build the Instruction and RAG schemes to evaluate model behaviours based on their inherent knowledge and the integration of domain knowledge. Our results indicate that state-of-the-art LLMs still struggle to provide accurate harm reduction information, and sometimes, carry out severe safety risks to PWUD. The use of LLMs in harm reduction contexts should be cautiously constrained to avoid inducing negative health outcomes. WARNING: This paper contains illicit content that potentially induces harms.

减少危害是一项公共卫生战略,目的是改善他们的健康结果,减少安全风险。一些大型语言模型(LLMs)已经展示出适当的医疗知识水平,有望满足吸毒者的信息需求(PWUD),然而,他们在相关任务中的绩效在很大程度上仍未得到探讨。我们引入了HRIPBench,这是一个基准,旨在评估LLM在减少危害信息提供方面的准确性和安全风险。基准数据集 HRIP-Basus有2,160对问答证据。范围包括三项任务:检查安全界限,提供数量值,并推断多重物质使用风险。我们建立指令和RAG计划,以基于其固有知识和领域知识的整合来评价示范行为。我们的结果表明,目前最先进的LPBenchms仍然在努力提供准确的减少伤害信息,有时还会给PWUD带来严重的安全风险。在减少伤害背景下使用LLMs应该谨慎地加以限制,以避免产生负面的健康结果。WARNINING:本文载有可能诱发损害的非法内容。

Article 218

Title@2025-07-29 (2): Overview of ADoBo at IberLEF 2025: Automatic Detection of Anglicisms in Spanish

Title: Overview of ADoBo at IberLEF 2025: Automatic Detection of Anglicisms in Spanish

Übersicht über ADoBo bei IberLEF 2025: Automatische Erkennung von Anglizismen auf Spanisch

IberLEF 2025年IberLEF ADoBo ADoBo 概览:西班牙文自动检测 2507.21813v1

Authors (4): Elena Alvarez-Mellado, Jordi Porta-Zamorano, Constantine Lignos, Julio Gonzalo

This paper summarizes the main findings of ADoBo 2025, the shared task on anglicism identification in Spanish proposed in the context of IberLEF 2025. Participants of ADoBo 2025 were asked to detect English lexical borrowings (or anglicisms) from a collection of Spanish journalistic texts. Five teams submitted their solutions for the test phase. Proposed systems included LLMs, deep learning models, Transformer-based models and rule-based systems. The results range from F1 scores of 0.17 to 0.99, which showcases the variability in performance different systems can have for this task.

本文件总结了ADoBo 2025年的主要调查结果,这是在IberLEF 2025年背景下提出的用西班牙文识别古生物的共同任务,要求ADoBo 2025年的参与者从西班牙新闻文本汇编中发现英国的词汇借款(或假象),五个小组提交了测试阶段的解决办法,提议的系统包括LLMS、深学习模型、基于变异器的模型和基于规则的系统,结果从F1分0.17到0.99不等,显示不同系统业绩的变异性可以用于这项任务。

Article 219

Title@2025-07-29 (2): ChartMark: A Structured Grammar for Chart Annotation

Title: ChartMark: A Structured Grammar for Chart Annotation

ChartMark: Eine strukturierte Grammatik für Chart-Annotation

图表 Mark: 用于图表注释的结构性语法 2507.21810v1

Authors (7): Yiyu Chen, Yifan Wu, Shuyu Shen, Yupeng Xie, Leixian Shen, Hui Xiong, Yuyu Luo

Chart annotations enhance visualization accessibility but suffer from fragmented, non-standardized representations that limit cross-platform reuse. We propose ChartMark, a structured grammar that separates annotation semantics from visualization implementations. ChartMark features a hierarchical framework mapping onto annotation dimensions (e.g., task, chart context), supporting both abstract intents and precise visual details. Our toolkit demonstrates converting ChartMark specifications into Vega-Lite visualizations, highlighting its flexibility, expressiveness, and practical applicability.

图表说明提高了可视化的可视性,但存在限制跨平台再利用的支离破碎、非标准化的表征。我们提出了ChartMark,这是一个结构化的语法,将注解语义与可视化实施区分开来。ChartMark具有向注解维度(例如任务、图表背景)绘制的等级框架,支持抽象意图和准确的直观细节。我们的工具包显示将图 Mark规格转换成Vega-Lite可视化,突出显示其灵活性、表现性和实用性。

Article 220

Title@2025-07-29 (2): Task Arithmetic for Language Expansion in Speech Translation

Title: Task Arithmetic for Language Expansion in Speech Translation

Aufgabe Arithmetik für Spracherweiterung in der Sprachübersetzung

语音翻译中语言扩展任务 2409.11274v3

Authors (7): Yao-Fei Cheng, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Wen Shen Teo, Siddhant Arora, Shinji Watanabe

Recent progress in large language models (LLMs) has gained interest in speech-text multimodal foundation models, achieving strong performance on instruction-tuned speech translation (ST). However, expanding language pairs is costly due to re-training on combined new and previous datasets. To address this, we aim to build a one-to-many ST system from existing one-to-one ST systems using task arithmetic without re-training. Direct application of task arithmetic in ST leads to language confusion; therefore, we introduce an augmented task arithmetic method incorporating a language control model to ensure correct target language generation. Our experiments on MuST-C and CoVoST-2 show BLEU score improvements of up to 4.66 and 4.92, with COMET gains of 8.87 and 11.83. In addition, we demonstrate our framework can extend to language pairs lacking paired ST training data or pre-trained ST models by synthesizing ST models based on existing machine translation (MT) and ST models via task analogies.

在大型语文模式(LLMS)方面最近取得的进展引起了人们对语言-文字多式联运基础模型的兴趣,在经指导的语音翻译(ST)方面表现良好。然而,由于在新的和以前的合并数据集方面进行再培训,扩大对口语文是昂贵的。为了解决这个问题,我们的目标是利用现有的一对一的ST系统,使用不经过再培训的任务算术,从现有的一对一的ST系统建立一个一对一的ST系统。在ST直接应用任务算术会导致语言混乱;因此,我们引入了一种扩大的任务算术方法,其中包括一种语言控制模型,以确保正确生成目标语言。我们在 MuST-C 和 CoVoST-2上进行的实验显示,BLEU的得分提高达4.66和4.92,而知识与技术交流的得益为8.87和11.83。此外,我们通过任务类比将基于现有机器翻译(MT)的ST模型和ST模型合成ST模型。

Article 221

Title@2025-07-29 (2): The Problem with Safety Classification is not just the Models

Title: The Problem with Safety Classification is not just the Models

Das Problem der Sicherheitsklassifizierung sind nicht nur die Modelle

安全分类问题不仅仅是模型 2507.21782v1

Authors (1): Sowmya Vajjala

Studying the robustness of Large Language Models (LLMs) to unsafe behaviors is an important topic of research today. Building safety classification models or guard models, which are fine-tuned models for input/output safety classification for LLMs, is seen as one of the solutions to address the issue. Although there is a lot of research on the safety testing of LLMs themselves, there is little research on evaluating the effectiveness of such safety classifiers or the evaluation datasets used for testing them, especially in multilingual scenarios. In this position paper, we demonstrate how multilingual disparities exist in 5 safety classification models by considering datasets covering 18 languages. At the same time, we identify potential issues with the evaluation datasets, arguing that the shortcomings of current safety classifiers are not only because of the models themselves. We expect that these findings will contribute to the discussion on developing better methods to identify harmful content in LLM inputs across languages.

建立安全分类模型或防护模型,这些模型是LLMS投入/产出安全分类的精细模型,被视为解决这一问题的解决方案之一。尽管对LLMS本身的安全测试进行了大量研究,但很少研究评价这种安全分类器或用于测试这些模型的评价数据集的有效性,特别是在多语种情况下。在本立场文件中,我们通过考虑涵盖18种语言的数据集,表明5种安全分类模型存在多语言差异。与此同时,我们查明评价数据集的潜在问题,认为目前的安全分类器的缺点不仅仅是因为模型本身。我们期望这些研究结果将有助于讨论制定更好的方法,查明LLMM在各种语言中投入的有害内容。

Article 222

Title@2025-07-29 (2): Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages

Title: Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages

Sparse Autoencoder können sprachspezifische Konzepte über verschiedene Sprachen hinweg erfassen

能够捕捉不同语言语言的特定语言概念的简单自定义者 2507.11230v2

Authors (6): Lyzander Marciano Andrylie, Inaya Rahmanisa, Mahardika Krisna Ihsani, Alfan Farizki Wicaksono, Haryo Akbarianto Wibowo, Alham Fikri Aji

Understanding the multilingual mechanisms of large language models (LLMs) provides insight into how they process different languages, yet this remains challenging. Existing studies often focus on individual neurons, but their polysemantic nature makes it difficult to isolate language-specific units from cross-lingual representations. To address this, we explore sparse autoencoders (SAEs) for their ability to learn monosemantic features that represent concrete and abstract concepts across languages in LLMs. While some of these features are language-independent, the presence of language-specific features remains underexplored. In this work, we introduce SAE-LAPE, a method based on feature activation probability, to identify language-specific features within the feed-forward network. We find that many such features predominantly appear in the middle to final layers of the model and are interpretable. These features influence the model’s multilingual performance and language output and can be used for language identification with performance comparable to fastText along with more interpretability. Our code is available at https://github.com/LyzanderAndrylie/language-specific-features

了解大型语言模型(LLMS)的多语言机制可以深入了解它们是如何处理不同语言的,然而,这仍然具有挑战性。现有的研究往往侧重于单个神经元,但其多语种性质使得难以将特定语言单位与跨语言代表隔离开来。为了解决这个问题,我们探索了稀少的自动校考员(SAEs),以使他们有能力学习代表不同语言具体和抽象概念的单语种特征。虽然其中一些特征是语言独立的,但特定语言特征的存在仍未得到充分探讨。在这项工作中,我们引入了基于特征激活概率的SAE-LAPE方法,即基于特征激活概率的SAE-LAPE,以识别进料前网络中特定语言特征。我们发现许多这类特征主要出现在模型的中间至最后层,是可以解释的。这些特征影响模型的多语种性能和语言输出,并可用于语言识别与可与快读性相比的功能和解释性能。我们的代码可在https://github.com/Lysanderandoryandrylie/laugal-fecal-fetatatatures上查阅。

Article 223

Title@2025-07-29 (2): AgriEval: A Comprehensive Chinese Agricultural Benchmark for Large Language Models

Title: AgriEval: A Comprehensive Chinese Agricultural Benchmark for Large Language Models

AgriEval: Ein umfassender chinesischer Landwirtschafts-Benchmark für große Sprachmodelle

农业:中国大语言模式农业综合基准 2507.21773v1

Authors (8): Lian Yan, Haotian Wang, Chen Tang, Haifeng Liu, Tianyang Sun, Liangliang Liu, Yi Guan, Jingchi Jiang

In the agricultural domain, the deployment of large language models (LLMs) is hindered by the lack of training data and evaluation benchmarks. To mitigate this issue, we propose AgriEval, the first comprehensive Chinese agricultural benchmark with three main characteristics: (1) Comprehensive Capability Evaluation. AgriEval covers six major agriculture categories and 29 subcategories within agriculture, addressing four core cognitive scenarios: memorization, understanding, inference, and generation. (2) High-Quality Data. The dataset is curated from university-level examinations and assignments, providing a natural and robust benchmark for assessing the capacity of LLMs to apply knowledge and make expert-like decisions. (3) Diverse Formats and Extensive Scale. AgriEval comprises 14,697 multiple-choice questions and 2,167 open-ended question-and-answer questions, establishing it as the most extensive agricultural benchmark available to date. We also present comprehensive experimental results over 51 open-source and commercial LLMs. The experimental results reveal that most existing LLMs struggle to achieve 60% accuracy, underscoring the developmental potential in agricultural LLMs. Additionally, we conduct extensive experiments to investigate factors influencing model performance and propose strategies for enhancement. AgriEval is available at https://github.com/YanPioneer/AgriEval/.

在农业领域,大型语言模型(LLMs)的部署因缺乏培训数据和评价基准而受阻。为缓解这一问题,我们提议AgriEval,这是中国第一个综合农业基准,有三大特点:(1) 综合能力评估;AgriEval涵盖农业的六个主要农业类别和29个亚类,涉及四个核心认知情景:记忆、理解、推断和生成。(2) 高品质数据。数据集由大学一级的考试和任务整理,为评估LLMs应用知识和作出类似专家决定的能力提供一个自然和强有力的基准。(3) 多样化格式和广泛规模。AgriEval包括14,697个多选择问题和2,167个开放式问答问题,将其作为迄今最广泛的农业基准。我们还介绍了51个开放源和商业LMs的综合实验结果。实验结果显示,大多数现有的LMs为达到60%的精确度而斗争,强调了农业LMs的发展潜力。此外,我们进行了广泛的实验,以调查影响模型性表现的因素,并提出加强战略。AGA/AGIA。

Article 224

Title@2025-07-29 (2): Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal

Title: Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal

Adversariale Verteidigung ohne Adversariale Verteidigung: Verbesserung der Sprachmodell Robustheit über Instanz-Ebene Hauptkomponentenentfernung

无反向辩护的反向辩护,无反向辩护:通过一审一级主要组成部分删除,加强语言模式的强力 2507.21750v1

Authors (6): Yang Wang, Chenghao Xiao, Yizhi Li, Stuart E. Middleton, Noura Al Moubayed, Chenghua Lin

Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before-attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.

预先培训的语言模式(PLMs)在自然语言处理方面取得了长足进展,但仍然容易受到对抗性攻击,使人们对其在现实世界应用中的强健性感到担忧。以前的研究试图通过在培训过程中隐含或明确引入对抗性干扰来减轻对抗性攻击的影响。虽然这两种战略都增强了强健性,但往往产生很高的计算成本。在这项工作中,我们提出了一个简单而有效的附加模块,通过删除实例一级的主要组成部分,而不依赖常规的对抗性防御或干扰原始培训数据,加强PLMs的对抗性强健性。我们的方法将嵌入空间转化为近似高斯特性,从而减少其受对抗性攻击性侵的影响,同时保持语义关系。这种转变将分配方式与尽量减少对抗性噪音对决定边界的影响,加强强健性,而不需要对抗性实例或昂贵的培训时间增强。对八个基准数据集的评价表明,我们的方法在保持攻击前的准确性与基线的可比性的同时,提高了对抗性强性强性,同时实现了稳健性和一般之间的平衡贸易。

Article 225

Title@2025-07-29 (2): Image Captioning via Compact Bidirectional Architecture

Title: Image Captioning via Compact Bidirectional Architecture

Bildunterschrift über kompakte bidirektionale Architektur

通过契约双向双向建筑进行图像描述 2201.01984v2

Authors (7): Zijie Song, Yuanen Zhou, Zhenzhen Hu, Daqing Liu, Huixia Ben, Richang Hong, Meng Wang

Most current image captioning models typically generate captions from left-to-right. This unidirectional property makes them can only leverage past context but not future context. Though refinement-based models can exploit both past and future context by generating a new caption in the second stage based on pre-retrieved or pre-generated captions in the first stage, the decoder of these models generally consists of two networks~(i.e. a retriever or captioner in the first stage and a captioner in the second stage), which can only be executed sequentially. In this paper, we introduce a Compact Bidirectional Transformer model for image captioning that can leverage bidirectional context implicitly and explicitly while the decoder can be executed parallelly. Specifically, it is implemented by tightly coupling left-to-right(L2R) and right-to-left(R2L) flows into a single compact model to serve as a regularization for implicitly exploiting bidirectional context and optionally allowing explicit interaction of the bidirectional flows, while the final caption is chosen from either L2R or R2L flow in a sentence-level ensemble manner. We conduct extensive ablation studies on MSCOCO benchmark and find that the compact bidirectional architecture and the sentence-level ensemble play more important roles than the explicit interaction mechanism. By combining with word-level ensemble seamlessly, the effect of sentence-level ensemble is further enlarged. We further extend the conventional one-flow self-critical training to the two-flows version under this architecture and achieve new state-of-the-art results in comparison with non-vision-language-pretraining models. Finally, we verify the generality of this compact bidirectional architecture by extending it to LSTM backbone. Source code is available at https://github.com/YuanEZhou/cbtic.

多数当前图像字幕模型通常产生左对右的字幕。这种单向属性使它们只能利用过去的背景而不是未来的背景。虽然基于精细的模型可以利用过去和未来的背景, 在第二阶段产生一个新的标题, 其基础是在第一阶段以预检索或预生成的字幕为基础, 在第二阶段产生一个新的标题, 这些模型的解码器通常由两个网络组成 ~( 即, 第一阶段的检索器或字幕, 第二阶段的字幕) , 只能按顺序执行。在本文中, 我们引入了一个契约双向双向变换模型, 用于图像字幕说明, 既可以以隐含和明确的方式利用双向的双向变换。具体地说, 这些模式的解码在第二阶段产生一个新的标题( L2R) 和右向左转( R2L) , 以隐含双向双向的双向背景, 最终标题从 L2R2 或 R2L 将双向双向的双向变换 , 其最终的自我变换 , 将一个更深级的图像- IM- 工具级的游戏级的游戏- 结构 , 将一个更深层的变为我们进入一个更高级的版本。

Article 226

Title@2025-07-29 (2): My Life in Artificial Intelligence: People, anecdotes, and some lessons learnt

Title: My Life in Artificial Intelligence: People, anecdotes, and some lessons learnt

Mein Leben in Künstlicher Intelligenz: Menschen, Anekdoten und einige Lektionen gelernt

我在人工智能中的生活:人、流浪者、以及一些经验教训 2504.04142v2

Authors (1): Kees van Deemter

In this very personal workography, I relate my 40-year experiences as a researcher and educator in and around Artificial Intelligence (AI), more specifically Natural Language Processing. I describe how curiosity, and the circumstances of the day, led me to work in both industry and academia, and in various countries, including The Netherlands (Amsterdam, Eindhoven, and Utrecht), the USA (Stanford), England (Brighton), Scotland (Aberdeen), and China (Beijing and Harbin). People and anecdotes play a large role in my story; the history of AI forms its backdrop. I focus on things that might be of interest to (even) younger colleagues, given the choices they face in their own work and life at a time when AI is finally emerging from the shadows.

我讲述了我40年来在人工智能(AI)及其周围(更具体地说是自然语言处理)的研究和教育家和教育工作者的经验。我描述了好奇心和当时的情况如何导致我在工业和学术界以及包括荷兰(阿姆斯特丹、艾因多芬和乌得勒支)、美国(斯坦福德)、英格兰(布莱顿)、苏格兰(阿伯丁)和中国(北京和哈宾)等不同国家工作。人和阿密多斯在我的故事中扮演了重要角色;AI的历史形成了它的背景。我着重讲述了可能令年轻同事感兴趣的事情(甚至),因为当AI最终走出阴影的时候,他们在自己的工作和生活中面临着选择。

Article 227

Title@2025-07-29 (2): Technical Report of TeleChat2, TeleChat2.5 and T1

Title: Technical Report of TeleChat2, TeleChat2.5 and T1

Technischer Bericht von TeleChat2, TeleChat2.5 und T1

TeleChat2、TeleChat2.5和T1技术报告 2507.18013v3

Authors (38): Zihan Wang, Xinzhang Liu, Yitong Yao, Chao Wang, Yu Zhao, Zhihao Yang, Wenmin Deng, Kaipeng Jia, Jiaxin Peng, Yuyao Huang, Sishi Xiong, Zhuo Jiang, Kaidong Yu, Xiaohui Hu, Fubei Yao, Ruiyu Fang, Zhuoru Jiang, Ruiting Song, Qiyi Xie, Rui Xue, Xuewei He, Yanlei Xue, Zhu Yuan, Zhaoxi Zhang, Zilu Huang, Shiquan Wang, Xin Wang, Hanming Wu, Mingyuan Wang, Xufeng Zhan, Yuhan Sun, Zhaohu Xing, Yuhao Jiang, Bingkai Yang, Shuangyong Song, Yongxiang Li, Zhongjiang He, Xuelong Li

We introduce the latest series of TeleChat models: \textbf{TeleChat2}, \textbf{TeleChat2.5}, and \textbf{T1}, offering a significant upgrade over their predecessor, TeleChat. Despite minimal changes to the model architecture, the new series achieves substantial performance gains through enhanced training strategies in both pre-training and post-training stages. The series begins with \textbf{TeleChat2}, which undergoes pretraining on 10 trillion high-quality and diverse tokens. This is followed by Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to further enhance its capabilities. \textbf{TeleChat2.5} and \textbf{T1} expand the pipeline by incorporating a continual pretraining phase with domain-specific datasets, combined with reinforcement learning (RL) to improve performance in code generation and mathematical reasoning tasks. The \textbf{T1} variant is designed for complex reasoning, supporting long Chain-of-Thought (CoT) reasoning and demonstrating substantial improvements in mathematics and coding. In contrast, \textbf{TeleChat2.5} prioritizes speed, delivering rapid inference. Both flagship models of \textbf{T1} and \textbf{TeleChat2.5} are dense Transformer-based architectures with 115B parameters, showcasing significant advancements in reasoning and general task performance compared to the original TeleChat. Notably, \textbf{T1-115B} outperform proprietary models such as OpenAI’s o1-mini and GPT-4o. We publicly release \textbf{TeleChat2}, \textbf{TeleChat2.5} and \textbf{T1}, including post-trained versions with 35B and 115B parameters, to empower developers and researchers with state-of-the-art language models tailored for diverse applications.

我们推出最新系列的TeleChat 模式 :\ textbf{ TeleFhat2},\ textbf{ TeleChat2.5},\ textbf{ TeleC2.5} 和\ textbf{T1}, 提供了对其前身TeleC的大幅升级。尽管对模式架构的修改很小, 新系列通过在培训前和培训后两个阶段的强化培训战略取得了巨大的绩效。该系列从\ textbf{ TeleC2} 开始, 以10万个高品质和多种标识进行预培训。之后是Surviced FinalT( SSFT) 和直接Preport Ofer Ofer Appimation( 支持长链- t- t) 高级模型, 以及将 G- flotf 数据数据集与强化学习( RL) 来提高代码生成和数学推理的性能。

Article 228

Title@2025-07-29 (2): UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases

Title: UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases

UnsafeChain: Verbesserung der Modellsicherheit über Hard Cases

不安全Chain:通过困难案件加强说明理由的示范安全 2507.21652v1

Authors (3): Raj Vardhan Tomar, Preslav Nakov, Yuxia Wang

As large reasoning models (LRMs) grow more capable, chain-of-thought (CoT) reasoning introduces new safety challenges. Existing SFT-based safety alignment studies dominantly focused on filtering prompts with safe, high-quality responses, while overlooking hard prompts that always elicit harmful outputs. To fill this gap, we introduce UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources, where unsafe completions are identified and explicitly corrected into safe responses. By exposing models to unsafe behaviors and guiding their correction, UnsafeChain enhances safety while preserving general reasoning ability. We fine-tune three LRMs on UnsafeChain and compare them against recent SafeChain and STAR-1 across six out-of-distribution and five in-distribution benchmarks. UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching or surpassing baseline performance, demonstrating the effectiveness and generalizability of correction-based supervision. We release our dataset and code at https://github.com/mbzuai-nlp/UnsafeChain

由于大型推理模型(LRMs)的能力越来越强,思维链(CoT)推理带来了新的安全挑战。基于SFT的现有安全协调研究主要侧重于以安全、高质量的回应方式过滤提示,而忽略总是产生有害产出的硬性提示。为了填补这一空白,我们引入了不安全Chain,这是一个安全协调数据集,该数据集由来自不同来源的硬性提示和不同来源建立,其中查明了不安全的完成情况,并明确更正为安全反应。通过将模型暴露为不安全行为并指导其纠正,Anse Chain在维护一般推理能力的同时加强了安全。我们在不安全Cain上对三个LRMs进行了微调,并将其与最近的Safechain和STAR-1在六个分配外和五个分配基准上进行了比较。不安全Chain一贯地超越先前的数据集,甚至1K组匹配或超过基线性,表明基于纠正的监督的有效性和可普遍性。我们在https://github.com/mbzuai-nlp/UnsafeCHain上公布了我们的数据设置和代码。

Article 229

Title@2025-07-29 (2): Libra: Assessing and Improving Reward Model by Learning to Think

Title: Libra: Assessing and Improving Reward Model by Learning to Think

Waage: Bewertung und Verbesserung des Prämienmodells durch Lernen zu denken

利布拉:通过学习思考来评估和改进奖励模式 2507.21645v1

Authors (8): Meng Zhou, Bei Li, Jiahao Liu, Xiaowen Shi, Yang Bai, Rongxiang Weng, Jingang Wang, Xunliang Cai

Reinforcement learning (RL) has significantly improved the reasoning ability of large language models. However, current reward models underperform in challenging reasoning scenarios and predominant RL training paradigms rely on rule-based or reference-based rewards, which impose two critical limitations: 1) the dependence on finely annotated reference answer to attain rewards; and 2) the requirement for constrained output format. These limitations fundamentally hinder further RL data scaling and sustained enhancement of model reasoning performance. To address these limitations, we propose a comprehensive framework for evaluating and improving the performance of reward models in complex reasoning scenarios. We first present a reasoning-oriented benchmark (Libra Bench), systematically constructed from a diverse collection of challenging mathematical problems and advanced reasoning models, to address the limitations of existing reward model benchmarks in reasoning scenarios. We further introduce a novel approach for improving the generative reward model via learning-to-think methodologies. Based on the proposed approach, we develop Libra-RM series, a collection of generative reward models with reasoning capabilities that achieve state-of-the-art results on various benchmarks. Comprehensive downstream experiments are conducted and the experimental results demonstrate the correlation between our Libra Bench and downstream application, and the potential of Libra-RM to further improve reasoning models with unlabeled data.

强化学习(RL)极大地提高了大型语言模型的推理能力,然而,目前的奖赏模式在具有挑战性的推理假设和占主导地位的RL培训模式方面表现不佳,依赖于基于规则的或基于参考的奖赏,这些奖赏模式有两大限制:(1) 依赖附带注释的精细参考回答来获得奖赏;(2) 要求有限制的产出格式。这些限制从根本上阻碍了REL数据的进一步扩展和示范推理业绩的持续提高。为克服这些限制,我们提出了一个综合框架,用于在复杂的推理假设中评价和改进奖赏模型的绩效。我们首先提出了一个以推理为导向的基准(Libra Bench),该基准由各种具有挑战性的数学问题和高级推理模型系统构建,以解决推理假设中现有奖赏模型基准的局限性。我们进一步引入了一种新的方法,即通过从学习到思维的方法来改进归正奖励模式。我们开发了利布拉-RM系列,这是一套具有推理能力的、在各种基准上取得最新结果的归正奖赏模型。我们进行了全面的下游试验,实验结果进一步表明我们的图书馆座座座座座座座和下游应用与LIbraRM的推理的潜力。

Article 230

Title@2025-07-29 (2): Probing then Editing Response Personality of Large Language Models

Title: Probing then Editing Response Personality of Large Language Models

Probing dann Editing Response Persönlichkeit von großen Sprachmodellen

检验后编辑大语言模型的个性反应 2504.10227v2

Authors (10): Tianjie Ju, Zhenyu Shao, Bowen Wang, Yujia Chen, Zhuosheng Zhang, Hao Fei, Mong-Li Lee, Wynne Hsu, Sufeng Duan, Gongshen Liu

Large Language Models (LLMs) have demonstrated promising capabilities to generate responses that simulate consistent personality traits. Despite the major attempts to analyze personality expression through output-based evaluations, little is known about how such traits are internally encoded within LLM parameters. In this paper, we introduce a layer-wise probing framework to systematically investigate the layer-wise capability of LLMs in simulating personality for responding. We conduct probing experiments on 11 open-source LLMs over the PersonalityEdit benchmark and find that LLMs predominantly simulate personality for responding in their middle and upper layers, with instruction-tuned models demonstrating a slightly clearer separation of personality traits. Furthermore, by interpreting the trained probing hyperplane as a layer-wise boundary for each personality category, we propose a layer-wise perturbation method to edit the personality expressed by LLMs during inference. Our results show that even when the prompt explicitly specifies a particular personality, our method can still successfully alter the response personality of LLMs. Interestingly, the difficulty of converting between certain personality traits varies substantially, which aligns with the representational distances in our probing experiments. Finally, we conduct a comprehensive MMLU benchmark evaluation and time overhead analysis, demonstrating that our proposed personality editing method incurs only minimal degradation in general capabilities while maintaining low training costs and acceptable inference latency. Our code is publicly available at https://github.com/universe-sky/probing-then-editing-personality.

大型语言模型(LLMS)已经展示出极好的应对能力,以模拟一致的个性特征。尽管通过基于产出的评价对个性表现进行了重大分析,但对于这些特征如何在LLM参数内部编码却知之甚少。在本文中,我们引入了一个分层的探索框架,以系统调查LLMS在模拟个性响应时的分层能力。我们对11个开放源LMS在个人性Edit基准方面进行了测试实验,发现LLMS主要模拟个性在中层和上层作出反应,而经指导的模型显示个性特征的区分略为明确。此外,通过将经过训练的超高机率仪作为每个个性类别的分层边界加以解释,我们提出了一种分层的扰动方法,以编辑LLMMS在推断中表达的个性的能力。我们的结果显示,即使及时明确指明一个特定的个性,我们的方法仍然能够成功地改变LMMS的个性。有趣的是,某些个性特征之间的转换困难很大,这与我们进行模拟的分流/分级实验的距离距离相当,这与我们进行模拟的分级实验的分级试验时的距离相当的距离是我们在进行一般的平级分析时空分析,我们进行一般的平级的平级分析时空分析。最后,我们进行一个可接受的计算的方法是,我们进行一个可接受的计算。我们进行普通的计算。我们进行普通的计算。我们进行普通的计算的方法是用来进行普通的计算。

Article 231

Title@2025-07-29 (2): Strategist: Self-improvement of LLM Decision Making via Bi-Level Tree Search

Title: Strategist: Self-improvement of LLM Decision Making via Bi-Level Tree Search

Stratege: Selbstverbesserung der LLM-Entscheidungsfindung über die Bi-Level-Baumsuche

战略:通过双层树木搜索自我改善LLM决策 2408.10635v3

Authors (8): Jonathan Light, Min Cai, Weiqin Chen, Guanzhi Wang, Xiusi Chen, Wei Cheng, Yisong Yue, Ziniu Hu

Traditional reinforcement learning and planning typically requires vast amounts of data and training to develop effective policies. In contrast, large language models (LLMs) exhibit strong generalization and zero-shot capabilities, but struggle with tasks that require detailed planning and decision-making in complex action spaces. We introduce STRATEGIST, a novel approach that integrates the strengths of both methods. Our approach leverages LLMs to search and update high-level strategies (as text), which are then refined and executed by low-level Monte Carlo Tree Search (MCTS). STRATEGIST is a generalizable framework to optimize the strategy through population-based self-play simulations without the need for any training data. We demonstrate the effectiveness of STRATEGIST in learning optimal strategies for competitive, multi-turn games with partial information, including Game of Pure Strategy (GOPS) and multi-agent, hidden-identity discussion games like The Resistance: Avalon. Our results show that agents equipped with STRATEGIST outperform those trained with traditional RL methods, other LLM-based skill acquisition techniques, pre-existing LLM agents across both game environments and achieves comparable performance against human players.

传统强化学习和规划通常需要大量的数据和培训才能制定有效的政策,相比之下,大型语言模型(LLMS)具有很强的通用和零射能力,但与需要在复杂行动空间进行详细规划和决策的任务抗争。我们引入了SSTATEGIST,这是将两种方法的优势结合起来的一种新颖办法。我们的方法利用LLMS搜索和更新高层次战略(作为文本),这些战略随后由低层次的蒙特卡洛树搜索(MCTS)加以完善和执行。STATEGIST是一个一般化的框架,通过基于人口的自我游戏模拟来优化战略,而无需任何培训数据。我们展示了STATEGIST在学习竞争性、多转盘游戏的最佳战略方面的有效性,包括普里战略游戏(GOPS)和多媒介、隐性的讨论游戏,如抵抗:阿瓦隆。我们的结果显示,配备STATEGISTS的代理器超越了那些经过传统RL方法、其他基于LM的技能获取技术的技术技能技术、在游戏环境中的前LM代理者,并取得与人类玩家相似的业绩。

Article 232

Title@2025-07-29 (2): Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Title: Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Latent Adversarial Training verbessert Robustheit für persistente schädliche Verhalten in LLMs

长效对长效有害行为培训能提高长效LMM中持久性有害行为的积极性 2407.15549v3

Authors (11): Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper

Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of ‘jailbreaking’ techniques to elicit harmful text from models that were fine-tuned to be harmless. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities from LLMs. Prior work has introduced latent adversarial training (LAT) as a way to improve robustness to broad classes of failures. These prior works have considered untargeted latent space attacks where the adversary perturbs latent activations to maximize loss on examples of desirable behavior. Untargeted LAT can provide a generic type of robustness but does not leverage information about specific failure modes. Here, we experiment with targeted LAT where the adversary seeks to minimize loss on a specific competing task. We find that it can augment a wide variety of state-of-the-art methods. First, we use targeted LAT to improve robustness to jailbreaks, outperforming a strong R2D2 baseline with orders of magnitude less compute. Second, we use it to more effectively remove backdoors with no knowledge of the trigger. Finally, we use it to more effectively unlearn knowledge for specific undesirable tasks in a way that is also more robust to re-learning. Overall, our results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs.

大型语言模型( LLMs) 通常可以让大型语言模型( LLM ) 以不可取的方式行事,而这种方式是它们明确调整不适应的。例如, LLM 红队文学已经产生了各种各样的“ 侵入性” 技术,从经过微调无害的模型中引出有害文字。最近关于红队、模版编辑和可解释性的工作表明,这一挑战源于( 对抗性) 微调如何主要用来抑制而不是消除LLMs的不良能力。先前的工作已经引入了潜在的对抗性训练( LAT) , 以此来提高各种失败的强度。这些以前的工作已经考虑了非目标性的潜在空间行为, 敌人潜伏性地激活了这些“ 侵入性” 技术, 以最大限度地减少可取性行为的例子。目标性LAT 能够提供一种通用的强健性, 但不利用特定失败模式的信息。在这里, 我们试验定向的LAT 能够扩大各种最不可取的方法。首先, 我们使用定向的LAT 来改进监狱破损的稳性, , 超越目标性地启动一个更强的 R2 基线, 最后我们用一个更有效的R2D 基准级的排序, 我们用一个更有害的基线, 将它去一个更有效的方式去一个更精确的顺序。

Article 233

Title@2025-07-29 (2): Multilingual JobBERT for Cross-Lingual Job Title Matching

Title: Multilingual JobBERT for Cross-Lingual Job Title Matching

Mehrsprachiger JobBERT für Cross-Lingual Job Title Matching

跨语言工作职称匹配多语言工作BERT 2507.21609v1

Authors (3): Jens-Joris Decorte, Matthias De Lange, Jeroen Van Hautte

We introduce JobBERT-V3, a contrastive learning-based model for cross-lingual job title matching. Building on the state-of-the-art monolingual JobBERT-V2, our approach extends support to English, German, Spanish, and Chinese by leveraging synthetic translations and a balanced multilingual dataset of over 21 million job titles. The model retains the efficiency-focused architecture of its predecessor while enabling robust alignment across languages without requiring task-specific supervision. Extensive evaluations on the TalentCLEF 2025 benchmark demonstrate that JobBERT-V3 outperforms strong multilingual baselines and achieves consistent performance across both monolingual and cross-lingual settings. While not the primary focus, we also show that the model can be effectively used to rank relevant skills for a given job title, demonstrating its broader applicability in multilingual labor market intelligence. The model is publicly available: https://huggingface.co/TechWolf/JobBERT-v3.

我们引入了基于学习的跨语言职称对比模式。基于最先进的单一语言职称匹配模式,我们的方法通过利用合成翻译和2 100多万个职称的均衡多语种数据集,向英文、德文、西班牙文和中文提供支持。该模式保留了其前身以效率为重点的架构,同时使各语文之间无需特定任务监督就能进行强有力的统一。对2025年才智CLEF基准的广泛评价表明,该模式超越了强大的多语言基线,实现了单一语言和跨语言环境的一致业绩。我们虽然不是主要重点,但我们也表明该模式可以有效地用于给特定职称的相关技能定级,表明其在多语言劳动力市场情报中的更广泛适用性。该模型可公开查阅:https://ggingface.co/TechWolf/JobERT-v3。

Article 234

Title@2025-07-29 (2): Pralekha: Cross-Lingual Document Alignment for Indic Languages

Title: Pralekha: Cross-Lingual Document Alignment for Indic Languages

Pralekha: Cross-Lingual Document Alignment für indische Sprachen

Pralekha:印度语交叉语言文档协调 2411.19096v2

Authors (5): Sanjay Suryanarayanan, Haiyue Song, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, Raj Dabre

Mining parallel document pairs for document-level machine translation (MT) remains challenging due to the limitations of existing Cross-Lingual Document Alignment (CLDA) techniques. Most approaches rely on metadata such as URLs, which is often unavailable in low-resource language settings, while others represent documents using pooled sentence embeddings, which fail to capture fine-grained alignment cues. Moreover, current sentence embedding models have limited context windows, hindering their ability to represent document-level information effectively. To address these challenges for Indic languages, we introduce PRALEKHA, a large-scale benchmark for evaluating document-level alignment techniques. It contains over 3 million aligned document pairs across 11 Indic languages and English, of which 1.5 million are English–Indic pairs. Furthermore, we propose Document Alignment Coefficient (DAC), a novel metric for fine-grained document alignment. Unlike pooling-based approaches, DAC aligns documents by matching smaller chunks and computes similarity as the ratio of aligned chunks to the average number of chunks in a pair. Intrinsic evaluation shows that DAC achieves substantial improvements over pooling-based baselines, particularly in noisy scenarios. Extrinsic evaluation further demonstrates that document MT models trained on DAC-aligned pairs consistently outperform those using baseline alignment methods. These results highlight DAC’s effectiveness for parallel document mining. The PRALEKHA dataset and CLDA evaluation framework will be made publicly available.

用于文件级机器翻译(MT)的采矿平行文档配对依然具有挑战性,因为现有的跨语言文档协调(CLDA)技术存在局限性。大多数方法依赖诸如URL等元数据,而URL往往在低资源语言环境中无法使用,而其他方法则代表使用集合判决嵌入文件的文件,这些嵌入未能捕捉细微细的校准提示。此外,目前嵌入模型的背景窗口有限,妨碍了它们有效代表文件级信息的能力。为了应对印第列语言的这些挑战,我们引入了PRALEKHA,这是评价文件级校准技术的大规模基准。它包含11种英、英两种语言的300多万对匹配文件配对,其中150万对是英英英双配。此外,我们提议文件协调系数(DAC),这是用于细加校准文件校准的新型标准。发援会与小块比对文件的相似性比,一对成对大,我们将采用双对式评估,显示发援会在联合的基建基文件基线上取得了重大改进,特别是在高要求的DRBA基线上。

Article 235

Title@2025-07-29 (2): A Detailed Factor Analysis for the Political Compass Test: Navigating Ideologies of Large Language Models

Title: A Detailed Factor Analysis for the Political Compass Test: Navigating Ideologies of Large Language Models

Eine detaillierte Faktorenanalyse für den politischen Kompasstest: Navigieren von Ideologien großer Sprachmodelle

《政治指南测试的详细要素分析:掌握大语言模式的特征》 2506.22493v2

Authors (7): Sadia Kamal, Lalu Prasad Yadav Prakash, S M Rafiuddin, Mohammed Rakib, Arunkumar Bagavathi, Atriya Sen, Sagnik Ray Choudhury

Political Compass Test (PCT) or similar questionnaires have been used to quantify LLM’s political leanings. Building on a recent line of work that examines the validity of PCT tests, we demonstrate that variation in standard generation parameters does not significantly impact the models’ PCT scores. However, external factors such as prompt variations and fine-tuning individually and in combination affect the same. Finally, we demonstrate that when models are fine-tuned on text datasets with higher political content than others, the PCT scores are not differentially affected. This calls for a thorough investigation into the validity of PCT and similar tests, as well as the mechanism by which political leanings are encoded in LLMs.

利用政治指南测试或类似的问卷来量化LLM的政治倾向。根据最近审查PCT测试有效性的工作方针,我们证明标准生成参数的变化不会对模型的PCT分数产生重大影响,但是,迅速变异和个别微调等外部因素和组合影响相同。最后,我们证明,当模型对政治内容高于其他内容的文本数据集进行微调时,PCT分数不会受到不同影响。这要求彻底调查PCT和类似测试的有效性,以及将政治倾斜纳入LMS的机制。

Article 236

Title: AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

AIM: Adaptive Schlussfolgerung von Multi-Modal LLMs über Token Merging und Pruning

AIM:通过 Token 兼并和预留的多模式LMs的适应性推理 2412.03248v2

Authors (4): Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang

Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos. However, these models usually rely on extensive visual tokens from visual encoders, leading to high computational demands, which limits their applicability in resource-constrained environments and for long-context tasks. In this work, we propose a training-free adaptive inference method for multi-modal LLMs that can accommodate a broad range of efficiency requirements with a minimum performance drop. Our method consists of a) iterative token merging based on embedding similarity before LLMs, and b) progressive token pruning within LLM layers based on multi-modal importance. With a minimalist design, our method can be applied to both video and image LLMs. Extensive experiments on diverse video and image benchmarks demonstrate that our method substantially reduces computation load (e.g., a $\textbf{7-fold}$ reduction in FLOPs) while preserving the performance of video and image LLMs. Further, at a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding (e.g., $\textbf{+4.6}$ on MLVU). Additionally, our in-depth analysis provides insights into token redundancy and LLM layer behaviors, offering guidance for future research in designing efficient multi-modal LLMs. Our code is available at https://github.com/LaVi-Lab/AIM.

大型语言模型(LLMS)使得能够创建多模式的LLMs,这些模型能够对图像和视频等视觉数据表现出强烈的理解;然而,这些模型通常依赖视觉编码器的广泛视觉符号,导致大量计算需求,从而限制其在资源限制环境和长期图像任务中的适用性;在这项工作中,我们建议为多模式LMs制定一种无需培训的适应性推论方法,该方法能够满足一系列广泛的效率要求,而最低性能下降。我们的方法包括基于在LLMS之前嵌入类似数据的迭代象征性合并;以及(b)基于多模式重要性的LLM层内渐进式象征性标语。如果采用最低限度的设计,我们的方法可以适用于视频和图像LMMs。关于多种视频和图像基准的广泛实验表明,我们的方法可以大幅降低计算负荷(例如,一个$\textb{7-xxxxxxxxxxxxxxxxxxxxxxxxxxxlmmmmmmmmmmmmmmmmmmmmmmmmmmmms),同时保留视频和图像LMMMMMMMMMMsmrus-modal_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLI/LLLLLLLLL

Article 237

Title@2025-07-29 (2): Evaluating the cognitive reality of Spanish irregular morphomic patterns: Humans vs. Transformers

Title: Evaluating the cognitive reality of Spanish irregular morphomic patterns: Humans vs. Transformers

Bewertung der kognitiven Realität der spanischen unregelmäßigen morphomischen Muster: Menschen vs. Transformers

评估西班牙非正常染色体模式的认知现实:人类与变异体 2507.21556v1

Authors (3): Akhilesh Kakolu Ramarao, Kevin Tang, Dinah Baer-Henney

This study investigates the cognitive plausibility of the Spanish irregular morphomic pattern by directly comparing transformer-based neural networks to human behavioral data from \citet{Nevins2015TheRA}. Using the same analytical framework as the original human study, we evaluate whether transformer models can replicate human-like sensitivity to a complex linguistic phenomena, the morphome, under controlled input conditions. Our experiments focus on three frequency conditions: natural, low-frequency, and high-frequency distributions of verbs exhibiting irregular morphomic patterns. While the models outperformed humans in stem and suffix accuracy, a clear divergence emerged in response preferences. Unlike humans, who consistently favored natural responses across all test items, models’ preferred irregular responses and were influenced by the proportion of irregular verbs in their training data. Additionally, models trained on the natural and low-frequency distributions, but not the high-frequency distribution, were sensitive to the phonological similarity between test items and real Spanish L-shaped verbs.

这项研究通过直接将基于变压器的神经网络与来自\citet{Nevins2015TheRA} 的人类行为数据进行对比,调查西班牙非正常光谱模式的认知可行性。使用与原始人类研究相同的分析框架,我们评估变压器模型能否在受控输入条件下复制对复杂的语言现象即变异体的类似敏感度。我们的实验侧重于三种频率条件:自然、低频和显示非正常光谱模式的动词的高频分布。虽然模型在干燥和后缀精确度方面比人类表现得要好,但在反应偏好方面却出现了明显的差异。不像人类一样,他们一贯倾向于在所有测试项目中作出自然反应,模型偏好非正常反应,并且受到其培训数据中非正常动词比例的影响。此外,关于自然和低频率分布的模型,而不是高频分布,对测试项目与真实的西班牙L型动词之间的声相相似性非常敏感。

Article 238

Title@2025-07-29 (2): C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning

Title: C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning

C2-Evo: Co-Evolving multimodale Daten und Modell zur Selbstverbesserung

C2-Evo:共同演进的多模式数据和自我改进理由模型 2507.16518v2

Authors (12): Xiuwei Chen, Wentao Hu, Hanhui Li, Jun Zhou, Zisheng Chen, Meng Cao, Yihan Zeng, Kui Zhang, Yu-Jie Yuan, Jianhua Han, Hang Xu, Xiaodan Liang

Recent advances in multimodal large language models (MLLMs) have shown impressive reasoning capabilities. However, further enhancing existing MLLMs necessitates high-quality vision-language datasets with carefully curated task complexities, which are both costly and challenging to scale. Although recent self-improving models that iteratively refine themselves offer a feasible solution, they still suffer from two core challenges: (i) most existing methods augment visual or textual data separately, resulting in discrepancies in data complexity (e.g., over-simplified diagrams paired with redundant textual descriptions); and (ii) the evolution of data and models is also separated, leading to scenarios where models are exposed to tasks with mismatched difficulty levels. To address these issues, we propose C2-Evo, an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities. Specifically, given a base dataset and a base model, C2-Evo enhances them by a cross-modal data evolution loop and a data-model evolution loop. The former loop expands the base dataset by generating complex multimodal problems that combine structured textual sub-problems with iteratively specified geometric diagrams, while the latter loop adaptively selects the generated problems based on the performance of the base model, to conduct supervised fine-tuning and reinforcement learning alternately. Consequently, our method continuously refines its model and training data, and consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks. Our code, models, and datasets will be released.

多式联运大型语言模型(MLLM)的近期进展显示了令人印象深刻的推理能力,然而,进一步加强现有的MLLMS需要高质量的愿景语言数据集,并需要仔细制定复杂的任务,这些复杂的任务既昂贵又具有规模挑战性。尽管最近自我改进的自我改进模型提供了可行的解决办法,但它们仍面临两个核心挑战:(一) 多数现有方法将视觉数据或文字数据分开,导致数据复杂性的差异(例如,过于简化的图表与多余的文字描述相配);(二) 数据和模型的演变也分离,导致模型暴露于不匹配的困难程度的任务的假设情景。为了解决这些问题,我们建议C2-Evo,一个自动、封闭的自我改进的自我改进框架,共同发展培训数据和模型能力。具体地说,鉴于一个基础数据集和基础模型,C2-Evo通过跨模式数据演变循环和数据模型演变基准循环来增强这些数据。以前的循环扩大了基础数据集,通过生成复杂的模型模型模型模型、分解的升级模型和滚动的滚动模型,同时选择结构化的次级模型和不断升级的升级的模型,然后又进行模拟的升级的升级的升级的系统。

Article 239

Title@2025-07-29 (2): Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models

Title: Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models

Linguistische und einbettende Profilierung von Texten, die von Menschen und großen Sprachmodellen erzeugt werden

人类和大语言模式产生的文本的语言和嵌入式图解 2507.13614v2

Authors (2): Sergio E. Zanotto, Segun Aroyehun

The rapid advancements in large language models (LLMs) have significantly improved their ability to generate natural language, making texts generated by LLMs increasingly indistinguishable from human-written texts. While recent research has primarily focused on using LLMs to classify text as either human-written and machine-generated texts, our study focus on characterizing these texts using a set of linguistic features across different linguistic levels such as morphology, syntax, and semantics. We select a dataset of human-written and machine-generated texts spanning 8 domains and produced by 11 different LLMs. We calculate different linguistic features such as dependency length and emotionality and we use them for characterizing human-written and machine-generated texts along with different sampling strategies, repetition controls and model release date. Our statistical analysis reveals that human-written texts tend to exhibit simpler syntactic structures and more diverse semantic content. Furthermore, we calculate the variability of our set of features across models and domains. Both human and machine texts show stylistic diversity across domains, with humans displaying greater variation in our features. Finally, we apply style embeddings to further test variability among human-written and machine-generated texts. Notably, newer models output text that is similarly variable, pointing to an homogenization of machine-generated texts.

大型语言模型(LLMS)的快速发展大大提高了他们创造自然语言的能力,使LLMS产生的文字与人文文本越来越无法区分。虽然最近的研究主要侧重于利用LLMS将文字分类为人文文本和机器生成的文本,但我们的研究重点是利用不同语言层次,如形态学、语法和语义学等一系列语言特征对这些文本进行定性。我们选择了一套涵盖8个领域、由11个不同的LMS制作的人类书写和机器生成的文本的数据集。我们计算了不同语言特征,如依赖长度和情感性等,我们用它们来描述人类书写和机器生成的文本以及不同的抽样战略、重复控制和发布日期。我们的统计分析表明,人类书写文本往往展示更简单的合成结构和更多样化的语义内容。此外,我们计算了我们各模型和机器文本的变异性。人类和机器文本都显示了不同的语言多样性,在我们的特征上表现出更大的差异。最后,我们应用样式嵌入式和机器生成的文本以及不同的样本,以进一步测试人类-机器版本之间的变异性。

Article 240

Title@2025-07-29 (2): Mind the Language Gap in Digital Humanities: LLM-Aided Translation of SKOS Thesauri

Title: Mind the Language Gap in Digital Humanities: LLM-Aided Translation of SKOS Thesauri

Achten Sie auf die Sprachlücke in digitalen Geisteswissenschaften: LLM-Aided Translation of SKOS Thesauri

注意数字人文中的语言差距:SKOS Thesauri的LLM辅助翻译 2507.19537v2

Authors (4): Felix Kraus, Nicolas Blumenröhr, Danah Tonne, Achim Streit

We introduce WOKIE, an open-source, modular, and ready-to-use pipeline for the automated translation of SKOS thesauri. This work addresses a critical need in the Digital Humanities (DH), where language diversity can limit access, reuse, and semantic interoperability of knowledge resources. WOKIE combines external translation services with targeted refinement using Large Language Models (LLMs), balancing translation quality, scalability, and cost. Designed to run on everyday hardware and be easily extended, the application requires no prior expertise in machine translation or LLMs. We evaluate WOKIE across several DH thesauri in 15 languages with different parameters, translation services and LLMs, systematically analysing translation quality, performance, and ontology matching improvements. Our results show that WOKIE is suitable to enhance the accessibility, reuse, and cross-lingual interoperability of thesauri by hurdle-free automated translation and improved ontology matching performance, supporting more inclusive and multilingual research infrastructures.

我们引入了开放源码、模块化和即时使用管道WOKIE, 用于SKOS Thesauri的自动翻译; 这项工作解决了数字人文学(DH)的迫切需要,语言多样性可以限制知识资源的获取、再利用和语义互操作性; WOKIE将外部翻译服务与使用大语言模型(LLMS)的有针对性的改进结合起来,平衡翻译质量、可缩放性和成本; 设计该应用程序要用日常硬件运行,并且容易扩展,应用程序不需要在机器翻译或LMS方面事先具备专业知识; 我们用不同参数、翻译服务和LLMS对若干DHsauri的15种语言进行WOKIE, 系统分析翻译质量、性能和本体匹配性改进。我们的结果表明,WOKIE适合通过无障碍自动翻译和改进本体匹配性能,支持更具包容性和多语种的研究基础设施,提高这些语言的无障碍性能、再利用性和跨语言互操作性。

Article 241

Title@2025-07-29 (2): Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator

Title: Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator

Zeichen als Zeichen: Ein retrieval-erweiterter Mehrsprachiger Zeichen-Generator

标为 Tokens 的符号: 一个检索增强的多语种手语手语生成器 2411.17799v3

Authors (5): Ronglai Zuo, Rolandos Alexandros Potamias, Evangelos Ververas, Jiankang Deng, Stefanos Zafeiriou

Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. Although many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), the reverse task-sign language generation (text-to-sign)-remains largely unexplored. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we leverage a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. During decoding, unlike existing approaches that flatten all part-wise tokens into a single sequence and predict one token at a time, we propose a multi-head decoding method capable of predicting multiple tokens simultaneously. This approach improves inference efficiency while maintaining effective information fusion across different body parts. To further ease the generation process, we propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs as auxiliary conditions, significantly improving the precision of generated signs. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE.

手语是一种视觉语言,包含自然语言的所有语言特征,是聋人和听力困难社区的主要沟通方法。虽然许多研究成功地调整了手语翻译(手对文本)的预先培训语言模式(LMs),但倒置任务符号语言生成(文本对手语)的剩余部分基本上没有探索。在这项工作中,我们引入了多语言手语模式(Tokens ),即Tokens 符号(SOKE),它可以生成3D符号自动从文本输入中自动转换,使用预先培训的LM。为了将手语与LM统一起来,我们使用一个拆分解的代号符号,将连续信号分解成代表各身体部分的象征序列。在解码过程中,与将所有部分符号都划为单一序列并同时预测一个符号的现有方法不同,我们建议一种多头解码方法,能够同时预测多个符号。这个方法提高了推论效率,同时保持不同身体部分的有效信息融合。为了进一步简化生成过程,我们建议一种将连续信号分解为代表质量的精确度的方法。我们建议,将ShanhanG 展示了精确度信号的校正读质量。

Article 242

Title@2025-07-29 (2): MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation

Title: MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation

MAGIC: Multi-Hop- und Graphenbasierter Benchmark für Inter-Kontext-Konflikte in der retrieval-generierten Generation

MAGIC: 回收后一代人中多重和基于图表的多重和基于图表的相互冲突基准 2507.21544v1

Authors (3): Jungyeon Lee, Kangmin Lee, Taeuk Kim

Knowledge conflict often arises in retrieval-augmented generation (RAG) systems, where retrieved documents may be inconsistent with one another or contradict the model’s parametric knowledge. Existing benchmarks for investigating the phenomenon have notable limitations, including a narrow focus on the question answering setup, heavy reliance on entity substitution techniques, and a restricted range of conflict types. To address these issues, we propose a knowledge graph (KG)-based framework that generates varied and subtle conflicts between two similar yet distinct contexts, while ensuring interpretability through the explicit relational structure of KGs. Experimental results on our benchmark, MAGIC, provide intriguing insights into the inner workings of LLMs regarding knowledge conflict: both open-source and proprietary models struggle with conflict detection – especially when multi-hop reasoning is required – and often fail to pinpoint the exact source of contradictions. Finally, we present in-depth analyses that serve as a foundation for improving LLMs in integrating diverse, sometimes even conflicting, information.

现有调查这一现象的基准有显著的局限性,包括对问题解答设置的狭隘关注、对实体替代技术的高度依赖以及有限的冲突类型。为了解决这些问题,我们提议了一个基于知识的图(KG)框架,在两个相似但又截然不同的背景之间产生不同和微妙的冲突,同时确保通过KGs明确的关系结构进行解释。关于我们基准的实验结果(MAGIC)提供了对LLMs内部关于知识冲突的探索性洞察力:开放源和专有模式与冲突探测斗争 – – 特别是在需要多点推理的情况下 – – 往往未能确定矛盾的确切来源。最后,我们提出深入分析,作为改进LLMs整合多样性、有时甚至相互矛盾的信息的基础。

Article 243

Title@2025-07-29 (2): Modern Uyghur Dependency Treebank (MUDT): An Integrated Morphosyntactic Framework for a Low-Resource Language

Title: Modern Uyghur Dependency Treebank (MUDT): An Integrated Morphosyntactic Framework for a Low-Resource Language

Modern Uighur Dependency Treebank (MUDT): Ein integriertes morphosyntaktisches Framework für eine ressourcenarme Sprache

现代维吾尔依赖性树库(MUDT): 一种低资源语言综合磷合成法框架 2507.21536v1

Authors (4): Jiaxin Zuo, Yiquan Wang, Yuan Pan, Xiadiya Yibulayin

To address a critical resource gap in Uyghur Natural Language Processing (NLP), this study introduces a dependency annotation framework designed to overcome the limitations of existing treebanks for the low-resource, agglutinative language. This inventory includes 18 main relations and 26 subtypes, with specific labels such as cop:zero for verbless clauses and instr:case=loc/dat for nuanced instrumental functions. To empirically validate the necessity of this tailored approach, we conducted a cross-standard evaluation using a pre-trained Universal Dependencies parser. The analysis revealed a systematic 47.9% divergence in annotations, pinpointing the inadequacy of universal schemes for handling Uyghur-specific structures. Grounded in nine annotation principles that ensure typological accuracy and semantic transparency, the Modern Uyghur Dependency Treebank (MUDT) provides a more accurate and semantically transparent representation, designed to enable significant improvements in parsing and downstream NLP tasks, and offers a replicable model for other morphologically complex languages.

为解决维吾尔自然语言处理(Uyghur自然语言处理(NLP)中的关键资源缺口,本研究引入了一种依赖性说明框架,旨在克服现有树库对低资源、混凝土语言的限制,该清单包括18种主要关系和26个亚型,具体标签包括条纹:无异词条款零和细微工具功能的内写:case=loc/dat。为了实证这种量身定做方法的必要性,我们使用经过培训的普遍依赖性分类师进行了跨标准评价。分析显示说明有47.9%的系统性差异,指出处理维吾尔特定结构的普遍办法不足。根据九项说明性原则,确保字型准确性和语义透明,现代维吾尔依赖性树库(MUyghur Dependenity Treebank)提供了更准确、更透明的代表性,目的是在区分和下游线任务方面实现重大改进,并为其他变形复杂语言提供可复制的模式。

Article 244

Title@2025-07-29 (2): Automatic Classification of User Requirements from Online Feedback – A Replication Study

Title: Automatic Classification of User Requirements from Online Feedback – A Replication Study

Automatische Klassifizierung der Benutzeranforderungen aus Online-Feedback – Eine Replikationsstudie

在线反馈用户要求自动分类 – – 复制研究 2507.21532v1

Authors (7): Meet Bhatt, Nic Boilard, Muhammad Rehan Chaudhary, Cole Thompson, Jacob Idoko, Aakash Sorathiya, Gouri Ginde

Natural language processing (NLP) techniques have been widely applied in the requirements engineering (RE) field to support tasks such as classification and ambiguity detection. Although RE research is rooted in empirical investigation, it has paid limited attention to replicating NLP for RE (NLP4RE) studies. The rapidly advancing realm of NLP is creating new opportunities for efficient, machine-assisted workflows, which can bring new perspectives and results to the forefront. Thus, we replicate and extend a previous NLP4RE study (baseline), “Classifying User Requirements from Online Feedback in Small Dataset Environments using Deep Learning”, which evaluated different deep learning models for requirement classification from user reviews. We reproduced the original results using publicly released source code, thereby helping to strengthen the external validity of the baseline study. We then extended the setup by evaluating model performance on an external dataset and comparing results to a GPT-4o zero-shot classifier. Furthermore, we prepared the replication study ID-card for the baseline study, important for evaluating replication readiness. Results showed diverse reproducibility levels across different models, with Naive Bayes demonstrating perfect reproducibility. In contrast, BERT and other models showed mixed results. Our findings revealed that baseline deep learning models, BERT and ELMo, exhibited good generalization capabilities on an external dataset, and GPT-4o showed performance comparable to traditional baseline machine learning models. Additionally, our assessment confirmed the baseline study’s replication readiness; however missing environment setup files would have further enhanced readiness. We include this missing information in our replication package and provide the replication study ID-card for our study to further encourage and support the replication of our study.

自然语言处理(NLP)技术在需求工程(RE)领域被广泛应用,以支持分类和模糊性探测等任务。虽然RE研究植根于经验调查,但对复制RE(NLP4RE)研究的NLP(NLP4RE)研究重视有限。NLP的快速发展领域为高效的、机器辅助工作流程创造了新的机会,这可以为前沿带来新的视角和结果。因此,我们复制并推广了先前的NLP4RE研究(基线),“利用深层学习对小型数据集环境的在线反馈的用户复制要求进行分类”,评估了用户审查对需求分类的不同深层学习模式。我们利用公开发布源代码复制了原始结果,从而有助于加强基线研究的外部有效性。我们随后通过评价外部数据集的模型性能和将结果与GPT-4零光分解仪进行比较,我们为基线研究准备的复制研究ID卡,对于评估复制准备情况非常重要。结果显示,不同模型的再复制程度不同,而Nayes显示不同模型,展示了精确性复制的复制性文件,结果也鼓励了我们的GRBL的外部升级性研究。我们的学习基础性研究。我们的模型显示了我们的深层的外部研究。我们的研究。我们的研究, 我们的升级的升级的模型显示了我们的深层研究。

Article 245

Title@2025-07-29 (2): HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation

Title: HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation

HIRAG: Hierarchisch-gedankte Instruktion-Tuning-Retrieval-Augmented Generation

HIRAG: 高层次研究教学-引导检索-推荐一代 2507.05714v2

Authors (8): YiHan Jiao, ZheHao Tan, Dan Yang, DuoLin Sun, Jie Feng, Yue Shen, Jian Wang, Peng Wei

Retrieval-augmented generation (RAG) has become a fundamental paradigm for addressing the challenges faced by large language models in handling real-time information and domain-specific problems. Traditional RAG systems primarily rely on the in-context learning (ICL) capabilities of the large language model itself. Still, in-depth research on the specific capabilities needed by the RAG generation model is lacking, leading to challenges with inconsistent document quality and retrieval system imperfections. Even the limited studies that fine-tune RAG generative models often \textit{lack a granular focus on RAG task} or \textit{a deeper utilization of chain-of-thought processes}. To address this, we propose that RAG models should possess three progressively hierarchical abilities (1) Filtering: the ability to select relevant information; (2) Combination: the ability to combine semantic information across paragraphs; and (3) RAG-specific reasoning: the ability to further process external knowledge using internal knowledge. Thus, we introduce our new RAG instruction fine-tuning method, Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (HIRAG) incorporates a “think before answering” strategy. This method enhances the model’s open-book examination capability by utilizing multi-level progressive chain-of-thought. Experiments show that the HIRAG training strategy significantly improves the model’s performance on datasets such as RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.

重新获取增强的一代(RAG)已成为解决大型语言模型在处理实时信息和特定领域问题方面所面临的挑战的基本范例。传统的RAG系统主要依赖大语言模型本身的内流学习能力。然而,关于RAG生成模型所需的具体能力的深入研究仍然缺乏,导致文件质量和检索系统不完善的挑战。即使是微调的RAG基因化模型常常\textit{缺乏对RAG任务的微调焦点}或更深入地利用连锁思考进程。为了解决这个问题,我们建议RAG模型应当拥有三种逐步的等级能力:(1) 过滤:选择相关信息的能力;(2) 合并:将各段落的语义信息结合起来的能力;(3) RAG特定推理:利用内部知识进一步处理外部知识的能力。因此,我们引入了我们新的RAG 指令微调方法,Sierarshi-Sqourat-Retal-RetailQQQQ , 更深入地利用不断更新的MARAAAA 测试战略, 大幅改进HAGAG-BA的模型, 测试战略, 改进HAG-BAG-S-strual-strual-strual-strual-strual-strual-straking-strual-strual-stris-strat-strat-stris

Article 246

Title@2025-07-29 (2): TriangleMix: A Lossless and Efficient Attention Pattern for Long Context Prefilling

Title: TriangleMix: A Lossless and Efficient Attention Pattern for Long Context Prefilling

TriangleMix: Ein verlustfreies und effizientes Aufmerksamkeitsmuster für den langen Kontext

三角组合:长期预填无损高效关注模式 2507.21526v1

Authors (6): Zhiyuan He, Yike Zhang, Chengruidong Zhang, Huiqiang Jiang, Yuqing Yang, Lili Qiu

Large Language Models (LLMs) rely on attention mechanisms whose time complexity grows quadratically with input sequence length, creating significant computational bottlenecks during the prefilling stage. Existing static sparse attention methods typically degrade accuracy, while dynamic sparsity methods introduce additional computational overhead due to runtime sparse index estimation. To address these limitations, we propose TriangleMix, a novel training-free static attention pattern. TriangleMix employs dense attention in shallow layers and switches to a triangle-shaped sparse pattern in deeper layers. Extensive experiments demonstrate that TriangleMix reduces attention overhead by 3.7x to 15.3x in deep layers, and decreases overall Time-to-First-Token (TTFT) by 12% to 32% for sequence lengths ranging from 32K to 128K, without sacrificing model accuracy. Moreover, TriangleMix can be seamlessly integrated with dynamic sparsity methods to achieve further speedup, e.g. accelerating MInference by 19% at 128K, highlighting its potential to enhance LLM inference efficiency.

大型语言模型(LLMS)依赖于关注机制,其时间复杂性随着输入序列长度的四倍增长,在预填阶段造成了重大的计算瓶颈。现有的静态稀疏关注方法通常会降低准确性,而动态宽度方法则会由于时间稀少的指数估计而引入额外的计算间接费用。为了解决这些局限性,我们建议三角Mix(TridgeMix),这是一个没有培训的新型静态关注模式。三角Mix在浅层使用密集的注意力,在深层将三角Mix切换成三角形稀疏模式。广泛的实验表明,三角Mix将注意力管理减少3.7x至15.3x深层,并且在不牺牲模型准确性的情况下,将时间到一至二十八K(TTFT)之间的总时间长度减少12%至32%。此外,三角Mix可以与动态宽度方法无缝地结合,以进一步加速速度,例如,在128K时速加快了19%的MInference,突出其提高LM推算效率的潜力。

Article 247

Title@2025-07-29 (2): Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting

Title: Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting

Modellfreies Spekulatives Dekodieren für Transformer-basierte ASR mit Token-Map-Entwurf

采用 Token 地图起草的基于变换器的ASR无示范投机代号 2507.21522v1

Authors (4): Tuan Vu Ho, Hiroaki Kokubo, Masaaki Yamamoto, Yohei Kawaguchi

End-to-end automatic speech recognition (ASR) systems based on transformer architectures, such as Whisper, offer high transcription accuracy and robustness. However, their autoregressive decoding is computationally expensive, hence limiting deployment on CPU-based and resource-constrained devices. Speculative decoding (SD) mitigates this issue by using a smaller draft model to propose candidate tokens, which are then verified by the main model. However, this approach is impractical for devices lacking hardware accelerators like GPUs. To address this, we propose \emph{Token Map Drafting}, a model-free SD technique that eliminates the need for a separate draft model. Instead, we leverage a precomputed n-gram token map derived from domain-specific training data, enabling efficient speculative decoding with minimal overhead. Our method significantly accelerates ASR inference in structured, low-perplexity domains without sacrificing transcription accuracy. Experimental results demonstrate decoding speed-ups of $1.27\times$ on the CI-AVSR dataset and $1.37\times$ on our internal dataset without degrading recognition accuracy. Additionally, our approach achieves a $10\%$ absolute improvement in decoding speed over the Distill-spec baseline running on CPU, highlighting its effectiveness for on-device ASR applications.

以威斯伯等变压器结构为基础的端到端自动语音识别(ASR)系统提供了高转录准确性和稳健性。但是,它们的自动递减解码技术在计算上成本高昂,从而限制了在基于CPU和资源限制的装置上的部署。投机解码(SD)通过使用一个较小的模型草案来提出候选人标牌来缓解这一问题,然后由主模型来验证。但是,对于缺少像GPUs这样的硬件加速器的设备来说,这种方法是不切实际的。为了解决这个问题,我们建议采用无模型的SD技术来消除对单独模式草案的需要。相反,我们利用了从特定领域培训数据中衍生出来的预先配置的ngram代号地图,从而能够以最小的间接费用有效地进行投机解码。我们的方法大大加快了ASR在结构化、低易碎度域中的推断力,同时不牺牲了校正准确性。实验结果显示,CI-AVSR数据的解码速度上升速度为1.27美元,在AVSR数据设置和A.37\xximates deminal laimmedegraphilling the 10 内部数据升级。

Article 248

Title@2025-07-29 (2): Simulated patient systems are intelligent when powered by large language model-based AI agents

Title: Simulated patient systems are intelligent when powered by large language model-based AI agents

Simulierte Patientensysteme sind intelligent, wenn sie von großen modellbasierten AI-Agenten angetrieben werden

由大型语言模型型人工智能代理器供电时,模拟的病人系统是智能系统 2409.18924v3

Authors (23): Huizi Yu, Jiayan Zhou, Lingyao Li, Shan Chen, Jack Gallifant, Anye Shi, Xiang Li, Jingxian He, Wenyue Hua, Mingyu Jin, Guang Chen, Yang Zhou, Zhao Li, Trisha Gupte, Ming-Li Chen, Zahra Azizi, Yongfeng Zhang, Yanqiu Xing, Themistocles L. Danielle S. Bitterman, Themistocles L. Assimes, Xin Ma, Lin Lu, Lizhou Fan

Simulated patient systems play an important role in modern medical education and research, providing safe, integrative medical training environments and supporting clinical decision-making simulations. We developed AIPatient, an intelligent simulated patient system powered by large language model-based AI agents. The system incorporates the Retrieval Augmented Generation (RAG) framework, powered by six task-specific LLM-based AI agents for complex reasoning. For simulation reality, the system is also powered by the AIPatient KG (Knowledge Graph), built with de-identified real patient data from the Medical Information Mart for Intensive Care (MIMIC)-III database. Primary outcomes showcase the system’s intelligence, including the system’s accuracy in Electronic Record (EHR)-based medical Question Answering (QA), readability, robustness, and stability. The system achieved a QA accuracy of 94.15% when all six AI agents present, surpassing benchmarks with partial or no agent integration. Its knowledgebase demonstrated high validity (F1 score=0.89). Readability scores showed median Flesch Reading Ease at 77.23 and median Flesch Kincaid Grade at 5.6, indicating accessibility to all medical professionals. Robustness and stability were confirmed with non-significant variance (ANOVA F-value=0.6126, p > 0.1; F-value=0.782, p > 0.1). A user study with medical students further demonstrated that AIPatient offers high fidelity, strong usability, and effective educational value, performing comparably or better than human-simulated patients in medical history-taking scenarios. The promising intelligence of the AIPatient system highlights its potential to support a wide range of applications, including medical education, model evaluation, and system integration.

模拟的患者系统在现代医疗教育和研究中发挥重要作用,提供安全、综合的医疗培训环境,并支持临床决策模拟。我们开发了AIPatient,这是一个智能模拟病人系统,由基于大语言模型的AI代理商驱动。该系统包含检索增强型(RAG)框架,由6个特定任务 LLM 的AI代理商驱动,进行复杂推理。模拟现实中,该系统还由AIPatient KG(知识图形)驱动,该系统由强化护理医疗信息 Mart(MIMIIC)-III数据库中已查明的真实病人数据构成。初级结果展示了该系统的智能,包括基于电子记录(EHR)的医疗问答(QAAA)的准确性、可读性、稳健性和稳定性。当所有6个AI代理商都在场时,在部分或无代理支持下超过了基准,其知识基础显示甚高效力(F1分为0.89)。在77.23年的FLES-NOS-可读性应用中位显示ES-LA的准确性,在FALS-SALS-S-SAL Syal Syal Syal Syal-Serviewerviewal 中显示整个医学系统上显示其稳定性为A-A-Syal-Syal-Syal-Syal-S-Sylegal-S-S-S-S-S-S-S-S-S-SAL-IA-S-S-S-S-S-S-Sy-S-S-S-IGISAL-Sylation-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-

Article 249

Title@2025-07-29 (2): What Does it Mean for a Neural Network to Learn a “World Model”?

Title: What Does it Mean for a Neural Network to Learn a “World Model”?

Was bedeutet es für ein neurales Netzwerk, ein “Weltmodell” zu lernen?

神经网络学习“世界模型”意味着什么? 2507.21513v1

Authors (3): Kenneth Li, Fernanda Viégas, Martin Wattenberg

We propose a set of precise criteria for saying a neural net learns and uses a “world model.” The goal is to give an operational meaning to terms that are often used informally, in order to provide a common language for experimental investigation. We focus specifically on the idea of representing a latent “state space” of the world, leaving modeling the effect of actions to future work. Our definition is based on ideas from the linear probing literature, and formalizes the notion of a computation that factors through a representation of the data generation process. An essential addition to the definition is a set of conditions to check that such a “world model” is not a trivial consequence of the neural net’s data or task.

我们提出一套精确的标准来说明神经网的学习,并使用“世界模型 ” 。目标是给经常非正式使用的术语一个操作意义,以便为实验性调查提供一个共同的语言。我们特别侧重于代表世界潜在的“状态空间”的想法,将行动的效果建模留给今后的工作。我们的定义以线性研究文献的理念为基础,并通过数据生成过程的表述正式确定计算要素的概念。定义的一个基本补充是一系列条件,以检查这种“世界模型”不是神经网数据或任务的一个微不足道的后果。

Article 250

Title@2025-07-29 (2): Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Title: Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Persona-Vektoren: Überwachung und Kontrolle von Charaktereigenschaften in Sprachmodellen

人向量:监测和控制语言模式中的字符轨迹 2507.21509v1

Authors (5): Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey

Large language models interact with users through a simulated ‘Assistant’ persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model’s activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant’s personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.

大型语言模型通过模拟的“ 助理” 人性来与用户互动。虽然助理通常被训练成有用、无害和诚实, 但有时会偏离这些理想。在本文中, 我们确定该模型激活空间- 人矢量的方向, 以若干特征为基础, 如邪恶、交错和致幻倾向。我们确认这些矢量可用于监测助理在部署时间的个性波动。然后我们应用人矢量来预测和控制培训期间发生的个性变化。我们发现, 微调后预期的和意外的个性变化与相关个人矢量的转变密切相关。这些转变可以通过热后干预来缓解, 或者首先通过新的预防性指导方法来避免。此外, 人矢量可以用来标出在数据集一级和单个样本一级产生不可取的个性变化的数据。我们提取个人矢量的方法是自动的, 并且可以应用到任何个性利益特征, 仅提供自然语言描述。

Article 251

Title@2025-07-29 (2): The Carbon Cost of Conversation, Sustainability in the Age of Language Models

Title: The Carbon Cost of Conversation, Sustainability in the Age of Language Models

Die CO2-Kosten des Gesprächs, Nachhaltigkeit im Zeitalter der Sprachmodelle

对话的碳成本、语言模式时代的可持续性 2507.20018v2

Authors (6): Sayed Mahbub Hasan Amiri, Prasun Goswami, Md. Mainul Islam, Mohammad Shakhawat Hossen, Sayed Majhab Hasan Amiri, Naznin Akter

Large language models (LLMs) like GPT-3 and BERT have revolutionized natural language processing (NLP), yet their environmental costs remain dangerously overlooked. This article critiques the sustainability of LLMs, quantifying their carbon footprint, water usage, and contribution to e-waste through case studies of models such as GPT-4 and energy-efficient alternatives like Mistral 7B. Training a single LLM can emit carbon dioxide equivalent to hundreds of cars driven annually, while data centre cooling exacerbates water scarcity in vulnerable regions. Systemic challenges corporate greenwashing, redundant model development, and regulatory voids perpetuate harm, disproportionately burdening marginalized communities in the Global South. However, pathways exist for sustainable NLP: technical innovations (e.g., model pruning, quantum computing), policy reforms (carbon taxes, mandatory emissions reporting), and cultural shifts prioritizing necessity over novelty. By analysing industry leaders (Google, Microsoft) and laggards (Amazon), this work underscores the urgency of ethical accountability and global cooperation. Without immediate action, AIs ecological toll risks outpacing its societal benefits. The article concludes with a call to align technological progress with planetary boundaries, advocating for equitable, transparent, and regenerative AI systems that prioritize both human and environmental well-being.

GPT-3和BERT等大型语言模型(LLMs)实现了自然语言处理(NLP)的革命性,但其环境成本仍然被严重忽视。本篇文章批评LLMs的可持续性,通过GPT-4等模型和Mistral 7B等节能替代品的案例研究量化其碳足迹、水的使用和对电子废物的贡献。培训一个LLM每年可以排放相当于数百辆汽车的二氧化碳,而数据中心冷却则加剧了脆弱区域的缺水问题。系统性挑战:企业洗绿、冗余模式开发和监管真空使全球南部边缘化社区长期遭受伤害,负担过重。然而,可持续NLPs:技术创新(例如模式裁剪裁、量计算)、政策改革(碳税、强制性排放报告)和文化转变(将必要性置于新颖之上。通过分析工业领导人(Google、微软)和滞后者(Amazon),这项工作强调了道德问责和全球合作的紧迫性。如果没有立即采取行动,AIs生态风险就超过其社会效益。文章最后是:技术创新(例如,要求全球)的优先度,要求将技术进步与全球系统联系起来。

Article 252

Title@2025-07-29 (2): Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach

Title: Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach

Zuverlässige Proof-Generation mit LLMs: Ein neuro-symbolischer Ansatz

努力利用LLM女士实现可靠的证据生产:神经-双曲方法 2505.14479v4

Authors (3): Oren Sultan, Eitan Stern, Dafna Shahaf

Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation. We propose a neuro-symbolic approach that combines LLMs’ generative strengths with structured components to overcome this challenge. As a proof-of-concept, we focus on geometry problems. Our approach is two-fold: (1) we retrieve analogous problems and use their proofs to guide the LLM, and (2) a formal verifier evaluates the generated proofs and provides feedback, helping the model fix incorrect proofs. We demonstrate that our method significantly improves proof accuracy for OpenAI’s o1 model (58%-70% improvement); both analogous problems and the verifier’s feedback contribute to these gains. More broadly, shifting to LLMs that generate provably correct conclusions could dramatically improve their reliability, accuracy and consistency, unlocking complex tasks and critical real-world applications that require trustworthiness.

大型语言模型(LLMS)与需要严格逻辑推算和符号推理的正式领域(如数学校准生成)相抗衡,我们建议一种神经-顺理成章的方法,将LLMS的基因特长与结构化组成部分相结合,以克服这一挑战。作为一个概念的证明,我们侧重于几何问题。我们的方法有两个方面:(1) 我们找出类似的问题,用其证明来指导LLM;(2) 一个正式的核查员评估产生的证明并提供反馈,帮助模型修正错误的证明。我们证明,我们的方法大大提高了O1模型(58%-70%的改进率)的证明准确性; 相似的问题和验证人的反馈都有助于这些成就。更广泛地说,转向能够产生可辨别正确结论的LLMs可以大大提高其可靠性、准确性和一致性,解锁复杂的任务和需要信任的关键性真实世界应用。

Article 253

Title@2025-07-29 (2): VN-MTEB: Vietnamese Massive Text Embedding Benchmark

Title: VN-MTEB: Vietnamese Massive Text Embedding Benchmark

VN-MTEB: Vietnamesisch Massiver Text Einbettung Benchmark

VN-MTEB:越南大规模文本嵌入基准 2507.21500v1

Authors (5): Loc Pham, Tung Luu, Thu Vo, Minh Nguyen, Viet Hoang

Vietnam ranks among the top countries in terms of both internet traffic and online toxicity. As a result, implementing embedding models for recommendation and content control duties in applications is crucial. However, a lack of large-scale test datasets, both in volume and task diversity, makes it tricky for scientists to effectively evaluate AI models before deploying them in real-world, large-scale projects. To solve this important problem, we introduce a Vietnamese benchmark, VN-MTEB for embedding models, which we created by translating a large number of English samples from the Massive Text Embedding Benchmark using our new automated framework. We leverage the strengths of large language models (LLMs) and cutting-edge embedding models to conduct translation and filtering processes to retain high-quality samples, guaranteeing a natural flow of language and semantic fidelity while preserving named entity recognition (NER) and code snippets. Our comprehensive benchmark consists of 41 datasets from six tasks specifically designed for Vietnamese text embeddings. In our analysis, we find that bigger and more complex models using Rotary Positional Embedding outperform those using Absolute Positional Embedding in embedding tasks. Datasets are available at HuggingFace: https://huggingface.co/collections/GreenNode/vn-mteb-68871433f0f7573b8e1a6686

越南在互联网交通和在线毒性方面名列前茅。因此,在应用中实施建议和内容控制义务的嵌入模型至关重要。然而,由于在数量和任务多样性方面缺乏大规模测试数据集,科学家很难在将AI模型部署到现实世界、大型项目之前对AI模型进行有效评价。为解决这一重要问题,我们引入了越南基准,即VN-MTEB,用于嵌入模型,我们通过翻译大量英文样本,利用我们新的自动化框架,从Massive Text嵌入基准中找到大量英文样本。我们利用大型语言模型和尖端嵌入模型的优势,进行翻译和过滤流程,以保留高质量的样本,保证语言和语义的自然流动,同时保留名称实体识别(NER)和代码夹。我们的综合基准包括41个数据集,这些数据集来自专门为越南文本嵌入而设计的6项任务。我们的分析发现,使用扶轮性定位嵌入模型的更大、更复杂的模型超越了在嵌入Gregggggageding 73681/Grealfface中使用绝对定位的模型。数据设置:httpsetsetsats:http://Greglegleglesh3333/Face-ffax187186186/Fading。

Article 254

Title@2025-07-29 (2): Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

Title: Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

Anreize für eine fortgeschrittene Instruktions-Folge von großen Sprachmodellen

为采用大语言模式的高级指示提供激励理由 2506.01413v5

Authors (9): Yulei Qin, Gang Li, Zongyi Li, Zihan Xu, Yuchen Shi, Zhekai Lin, Xiao Cui, Ke Li, Xing Sun

Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose RAIF, a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Evaluation on OOD constraints also confirms the generalizability of our RAIF. Codes and data are available at https://github.com/yuleiqin/RAIF. Keywords: reinforcement learning with verifiable rewards (RLVR), instruction following, complex instructions

现有大型语言模型(LLMS)面临遵循复杂指令的挑战,特别是当存在多种制约因素,并在平行、链条和分支结构中组织多种制约时。一个直观的解决方案,即思维链(CoT),有望普遍提高LLMs的能力。然而,我们发现,香草CO(LLMS)由于其肤浅的推理模式,简单地将指示抛光,对业绩产生了负面影响。它未能消除在确定不同类型和层面的等级关系方面存在的制约的构成。为此,我们建议RAIF(RAIF)采用系统方法,通过激励测试时间计算缩放的推理,促进LMM(CLM)处理复杂的指令。首先,我们在现有的分类中将复杂的指令分解,并提出可再生数据获取的方法。第二,我们利用强化学习(RLLL),用可核查的规则中心奖赏信号,专门为随后的教学提供推理。我们通过样本比对COT执行进行简单对比,从浅、非典型的推理的推理学性质。我们还利用了专家行为演法行为规范,在测试11 将可比较的RLMMMLMS(LM) 进行可比较的演化,从快速分析,从而确认可判的精确性地将精确性分析。

Article 255

Title@2025-07-29 (2): Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning

Title: Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning

Low-Confidence Gold: Verfeinerung von Low-Confidence-Proben für effizientes Instruktionstuning

低信任金:改进低信任金样本,以进行高效教学计费 2502.18978v4

Authors (4): Hongyi Cai, Jie Li, Mohammad Mahdinur Rahman, Wenzhen Dong

The effectiveness of instruction fine-tuning for Large Language Models is fundamentally constrained by the quality and efficiency of training datasets. This work introduces Low-Confidence Gold (LCG), a novel filtering framework that employs centroid-based clustering and confidence-guided selection for identifying valuable instruction pairs. Through a semi-supervised approach using a lightweight classifier trained on representative samples, LCG curates high-quality subsets while preserving data diversity. Experimental evaluation demonstrates that models fine-tuned on LCG-filtered subsets of 6K samples achieve superior performance compared to existing methods, with substantial improvements on MT-bench and consistent gains across comprehensive evaluation metrics. The framework’s efficacy while maintaining model performance establishes a promising direction for efficient instruction tuning.

培训数据集的质量和效率从根本上限制了大语言模型教学微调的有效性,这项工作引入了低保密金(LCG),这是一个新的过滤框架,采用以机器人为基础的集群和信任制为指南的选择,以确定有价值的教学配对,采用半监督办法,使用经过代表性样本培训的轻量级分类师,在保存数据多样性的同时,保存高质量的子集;实验性评价显示,对LCG过滤的6K样本子集进行微调的模型比现有方法取得优异的性能,在MT-Bench方面大有改进,在综合评价指标方面不断取得收益;框架在保持示范性业绩的同时,在保持示范性业绩的同时,为有效的指导调整确定了有希望的方向。

Article 256

Title@2025-07-29 (2): Sem-DPO: Mitigating Semantic Inconsistency in Preference Optimization for Prompt Engineering

Title: Sem-DPO: Mitigating Semantic Inconsistency in Preference Optimization for Prompt Engineering

Sem-DPO: Semantische Inkonsistenz bei der Preference-Optimierung für Prompt Engineering mindern

Sem-DPO: 减轻在优先优化即时工程方面的语义不一致现象 2507.20133v2

Authors (8): Anas Mohamed, Azal Ahmad Khan, Xinran Wang, Ahmad Faraz Khan, Shuwen Ge, Saman Bahzad Khan, Ayaan Ahmad, Ali Anwar

Generative AI can now synthesize strikingly realistic images from text, yet output quality remains highly sensitive to how prompts are phrased. Direct Preference Optimization (DPO) offers a lightweight, off-policy alternative to RL for automatic prompt engineering, but its token-level regularization leaves semantic inconsistency unchecked as prompts that win higher preference scores can still drift away from the user’s intended meaning. We introduce Sem-DPO, a variant of DPO that preserves semantic consistency yet retains its simplicity and efficiency. Sem-DPO adjusts the DPO loss using a weight based on how different the winning prompt is from the original, reducing the impact of training examples that are semantically misaligned. We provide the first analytical bound on semantic drift for preference-tuned prompt generators, showing that Sem-DPO keeps learned prompts within a provably bounded neighborhood of the original text. On three standard text-to-image prompt-optimization benchmarks and two language models, Sem-DPO achieves 8-12% higher CLIP similarity and 5-9% higher human-preference scores (HPSv2.1, PickScore) than DPO, while also outperforming state-of-the-art baselines. These findings suggest that strong flat baselines augmented with semantic weighting should become the new standard for prompt-optimization studies and lay the groundwork for broader, semantics-aware preference optimization in language models.

直接偏好优化(DPO)为自动快速工程提供了比RL更轻、更宽松的替代政策,但是其象征性的正规化使得语义不统一,因为赢得更高偏好分数的提示仍然可以偏离用户的预期含义。我们引入了Sem-DPO,这是DPO的变种,它保留了语义一致性,但保持了其简单和效率。Sem-DPO根据胜出速度与原来的不同程度,调整了DPO损失。直接偏好优化(DPO)提供了比RLL更轻、更不切合语义的替代政策(DPO) , 但它象征性的正规化使语义性变化不受限制, 表明SEM-DPO仍然在原始文本中一个可辨别不透视的周边学习快感。关于三个标准文本到图像快速优化基准和两种语言偏好模式, Sem-DPO 实现了8-12 % 更高的CLIP 相似性和5-9% 更高的培训范例, 高分级的SpickS

Article 257

Title@2025-07-29 (2): The pitfalls of next-token prediction

Title: The pitfalls of next-token prediction

Die Fallstricke der Next-Token-Vorhersage

下吨预测的陷阱 2403.06963v3

Authors (2): Gregor Bachmann, Vaishnavh Nagarajan

Can a mere next-token predictor faithfully model human intelligence? We crystallize this emerging concern and correct popular misconceptions surrounding it, and advocate a simple multi-token objective. As a starting point, we argue that the two often-conflated phases of next-token prediction – autoregressive inference and teacher-forced training – must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacher-forcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacher-forcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically fail in that manner – remarkably, despite the task being straightforward to learn. Finally, we provide preliminary evidence that this failure can be resolved using teacherless training, a simple modification using dummy tokens that predicts multiple tokens in advance. We hope this finding can ground future debates and inspire explorations beyond the next-token prediction paradigm. We make our code available under https://github.com/gregorbachmann/Next-Token-Failures

仅仅下到脚的预测人能忠实地模拟人类智能吗?我们将这种新出现的关切具体化,纠正围绕它出现的流行误解,并提倡简单多到目的。作为一个起点,我们主张必须明确对待经常被组合的下到的预测的两个阶段 – – 自动递进推论和师资强制培训。错误在自动递进推论期间可能加剧的流行批评,关键地假设教师拒绝教师已经学会了准确的下到脚的预测人。这一假设回避了我们暴露的一个更深层次的问题:在某些任务类别中,教师拒绝教师可能只是无法首先学习准确的下到下到的预测人。我们描述了教师推举失败的一般机制,并设计了一个最起码的规划任务,使变异和曼巴结构在这种方式上都有可能失败 – – 尽管任务非常简单易学。最后,我们提供了初步证据,证明这一失败可以通过_教师无到脚的训练来解决。一个简单的修改,用下一个假象来预示多重的标志。我们希望在未来的探索中找到我们未来的准则。

Article 258

Title@2025-07-29 (2): Improving Task Diversity in Label Efficient Supervised Finetuning of LLMs

Title: Improving Task Diversity in Label Efficient Supervised Finetuning of LLMs

Verbesserung der Aufgabenvielfalt bei der Label-Effizient überwachten Feinsteuerung von LLMs

改进LLMML在标签高效监督监督下改进任务多样性 2507.21482v1

Authors (4): Abhinav Arabelly, Jagrut Nemade, Robert D Nowak, Jifan Zhang

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but developing high-performing models for specialized applications often requires substantial human annotation – a process that is time-consuming, labor-intensive, and expensive. In this paper, we address the label-efficient learning problem for supervised finetuning (SFT) by leveraging task-diversity as a fundamental principle for effective data selection. This is markedly different from existing methods based on the prompt-diversity. Our approach is based on two key observations: 1) task labels for different prompts are often readily available; 2) pre-trained models have significantly varying levels of confidence across tasks. We combine these facts to devise a simple yet effective sampling strategy: we select examples across tasks using an inverse confidence weighting strategy. This produces models comparable to or better than those trained with more complex sampling procedures, while being significantly easier to implement and less computationally intensive. Notably, our experimental results demonstrate that this method can achieve better accuracy than training on the complete dataset (a 4\% increase in MMLU score). Across various annotation budgets and two instruction finetuning datasets, our algorithm consistently performs at or above the level of the best existing methods, while reducing annotation costs by up to 80\%.

大型语言模型(LLMS)在不同领域表现出了非凡的能力,但为专门应用开发高效模型往往需要大量的人力说明 – – 这一过程耗时费时、劳力密集和昂贵。在本文中,我们通过利用任务多样性作为有效数据选择的基本原则,解决了监督微调的标签效率高的学习问题。这与基于迅速多样性的现有方法明显不同。我们的方法基于两项关键观察:1) 不同提示的任务标签往往随时可得;2) 预先培训的模型在各项任务之间具有不同的信任度。我们将这些事实结合起来,以设计一个简单而有效的抽样战略:我们采用反信任加权战略在各项任务之间选择实例。这产生了与经过更复杂取样程序培训的类似或更好的模型,但执行起来要容易得多,计算强度要小得多。值得注意的是,我们的实验结果表明,这种方法比关于完整数据集的培训(MMLU分数增加4);2) 各种说明预算和两个指示对数据集进行了显著的不同程度的信任度。我们用逆向80或高于现有最佳方法水平的算法来降低或降低现有成本。

Article 259

Title@2025-07-29 (2): Which LLMs Get the Joke? Probing Non-STEM Reasoning Abilities with HumorBench

Title: Which LLMs Get the Joke? Probing Non-STEM Reasoning Abilities with HumorBench

Welche LLMs bekommen den Spaß? Mit HumorBench nicht-STEM-vernünftige Fähigkeiten beweisen

哪个LLMs得到的笑话? 2507.21476v1

Authors (8): Reuben Narad, Siddharth Suresh, Jiayi Chen, Pine S. L. Dysart-Bricken, Bob Mankoff, Robert Nowak, Jifan Zhang, Lalit Jain

We present HumorBench, a benchmark designed to evaluate large language models’ (LLMs) ability to reason about and explain sophisticated humor in cartoon captions. As reasoning models increasingly saturate existing benchmarks in mathematics and science, novel and challenging evaluations of model intelligence beyond STEM domains are essential. Reasoning is fundamentally involved in text-based humor comprehension, requiring the identification of connections between concepts in cartoons/captions and external cultural references, wordplays, and other mechanisms. HumorBench includes approximately 300 unique cartoon-caption pairs from the New Yorker Caption Contest and Cartoonstock.com, with expert-annotated evaluation rubrics identifying essential joke elements. LLMs are evaluated based on their explanations towards the humor and abilities in identifying the joke elements. To perform well on this task, models must form and test hypotheses about associations between concepts, potentially backtracking from initial interpretations to arrive at the most plausible explanation. Our extensive benchmarking of current SOTA models reveals three key insights: (1) LLM progress on STEM reasoning transfers effectively to humor comprehension; (2) models trained exclusively on STEM reasoning data still perform well on HumorBench, demonstrating strong transferability of reasoning abilities; and (3) test-time scaling by increasing thinking token budgets yields mixed results across different models in humor reasoning.

我们介绍了Humorbench, 这是一项旨在评价大型语言模型(LLMs)在漫画字幕中解释和解释精密幽默能力的基准,即Humorbench(LLMs),这是用来评价大型语言模型(LLMs)在漫画标题中解释和解释精密幽默的能力的基准。由于推理模型越来越饱和数学和科学领域的现有基准,因此,在STEM领域之外对示范情报进行新颖和具有挑战性的评估至关重要。理性模型从根本上涉及基于文字的幽默理解,要求确定卡通/戏剧和外部文化参考、文字剧和其他机制的概念之间的联系。Humorbench(Humer Caption Contest and Cartoonstock.com)包括大约300对独特的卡通卡通插配对。 Humorbench(LM) com, 由专家附加注释的评价标注了基本的笑话要素。LLMSMs是根据其对确定笑话元素的幽默和能力的解释进行评估的。为了很好地完成这项任务,模型必须形成和测试关于概念之间关联的假设,可能从最初的解释到最可信的解释。我们对STEEM推理学的推理向幽默理解的广泛基准提供了三个专门训练的模型。

Article 260

Title@2025-07-29 (2): BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data

Title: BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data

BIG5-CHAT: LLM-Persönlichkeiten durch Schulung auf menschenverändernden Daten gestalten

BIG5-CHAT:通过提供人际数据培训塑造专业人才 2410.16491v3

Authors (6): Wenkai Li, Jiarui Liu, Andy Liu, Xuhui Zhou, Mona Diab, Maarten Sap

In this work, we tackle the challenge of embedding realistic human personality traits into LLMs. Previous approaches have primarily focused on prompt-based methods that describe the behavior associated with the desired personality traits, suffering from realism and validity issues. To address these limitations, we introduce BIG5-CHAT, a large-scale dataset containing 100,000 dialogues designed to ground models in how humans express their personality in language. Leveraging this dataset, we explore Supervised Fine-Tuning and Direct Preference Optimization as training-based methods to align LLMs more naturally with human personality patterns. Our methods outperform prompting on personality assessments such as BFI and IPIP-NEO, with trait correlations more closely matching human data. Furthermore, our experiments reveal that models trained to exhibit higher conscientiousness, higher agreeableness, lower extraversion, and lower neuroticism display better performance on reasoning tasks, aligning with psychological findings on how these traits impact human cognitive performance. To our knowledge, this work is the first comprehensive study to demonstrate how training-based methods can shape LLM personalities through learning from real human behaviors.

在这项工作中,我们应对将现实的人格特征纳入LLMs的挑战。以前的做法主要侧重于以迅速为基础的方法描述与期望的个性特征有关的行为,这些特征受到现实主义和有效性问题的影响。为了解决这些局限性,我们引入了BIG5-CHAT,这是一个大型的数据集,包含10万个对话,旨在将模型定位为人类如何用语言表达其个性。利用这一数据集,我们探索以监督性微调和直接偏好为基于培训的方法,使LMS更自然地与人的个性模式接轨。我们采用的方法超越了诸如BFI和IPIP-NEO等个性评估的速效方法,其特质相关性与人类数据更接近。此外,我们的实验显示,经过培训的模型在推理任务上表现得更好,与关于这些特征如何影响人类认知性表现的心理发现相一致。我们了解,这是第一次全面研究,以显示基于培训的方法如何通过学习真实人类行为塑造LM人格特征。

Article 261

Title@2025-07-29 (2): Soft Injection of Task Embeddings Outperforms Prompt-Based In-Context Learning

Title: Soft Injection of Task Embeddings Outperforms Prompt-Based In-Context Learning

Weiche Einspritzung von Task-Embeddings Outperforms Prompt-Based In-Context Learning

任务嵌入器的软输入超出迅速基于信息学习的绩效 2507.20906v2

Authors (2): Jungwon Park, Wonjong Rhee

In-Context Learning (ICL) enables Large Language Models (LLMs) to perform tasks by conditioning on input-output examples in the prompt, without requiring any update in model parameters. While widely adopted, it remains unclear whether prompting with multiple examples is the most effective and efficient way to convey task information. In this work, we propose Soft Injection of task embeddings. The task embeddings are constructed only once using few-shot ICL prompts and repeatedly used during inference. Soft injection is performed by softly mixing task embeddings with attention head activations using pre-optimized mixing parameters, referred to as soft head-selection parameters. This method not only allows a desired task to be performed without in-prompt demonstrations but also significantly outperforms existing ICL approaches while reducing memory usage and compute cost at inference time. An extensive evaluation is performed across 57 tasks and 12 LLMs, spanning four model families of sizes from 4B to 70B. Averaged across 57 tasks, our method outperforms 10-shot ICL by 10.2%-14.3% across 12 LLMs. Additional analyses show that our method also serves as an insightful tool for analyzing task-relevant roles of attention heads, revealing that task-relevant head positions selected by our method transfer across similar tasks but not across dissimilar ones – underscoring the task-specific nature of head functionality. Our soft injection method opens a new paradigm for reducing prompt length and improving task performance by shifting task conditioning from the prompt space to the activation space.

文本中学习(ICL) 使大语言模型(LLMS) 能够通过对快速输入输出示例进行调整来完成任务, 无需在模型参数中作任何更新。虽然广泛采用, 但仍不清楚用多个示例来提示任务信息是否是最有效和高效的方式。在这项工作中, 我们建议对任务嵌入进行柔性输入。任务嵌入仅仅在使用微小的 ILLL 提示和反复在推断中使用一次。软性混合任务通过软性混装任务来进行, 并使用优化前混合参数来启动关注头部启动, 被称为软性快速化头部选择参数。这种方法不仅允许在不进行即时演示的情况下执行所期望的任务, 并且大大优于现有的 ICLL 方法, 同时减少记忆用量, 并在时间推移时再计成本。广泛评价了57项任务中的4B 至70B 。平均为57项任务, 我们的方法在10张的 ICLLO 上快速化模型, 以10%-14.3% 移动头部定位定位, 在12 LLMs 上, 分析一个类似的任务转移任务的方法。分析我们的任务。

Article 262

Title@2025-07-29 (2): Towards Locally Deployable Fine-Tuned Causal Large Language Models for Mode Choice Behaviour

Title: Towards Locally Deployable Fine-Tuned Causal Large Language Models for Mode Choice Behaviour

Auf dem Weg zu lokal einsetzbaren großformatigen großformatigen Sprachmodellen für Modewahlverhalten

以当地可部署的优质因果大语言模式进行模式选择行为 2507.21432v1

Authors (2): Tareq Alsaleh, Bilal Farooq

This study investigates the adoption of open-access, locally deployable causal large language models (LLMs) for travel mode choice prediction and introduces LiTransMC, the first fine-tuned causal LLM developed for this task. We systematically benchmark eleven LLMs (1-12B parameters) across three stated and revealed preference datasets, testing 396 configurations and generating over 79,000 synthetic commuter predictions. Beyond predictive accuracy, we evaluate models generated reasoning using BERTopic for topic modelling and a novel Explanation Strength Index, providing the first structured analysis of how LLMs articulate decision factors in alignment with behavioural theory. LiTransMC, fine-tuned using parameter efficient and loss masking strategy, achieved a weighted F1 score of 0.6845 and a Jensen-Shannon Divergence of 0.000245, surpassing both untuned local models and larger proprietary systems, including GPT-4o with advanced persona inference and embedding-based loading, while also outperforming classical mode choice methods such as discrete choice models and machine learning classifiers for the same dataset. This dual improvement, i.e., high instant-level accuracy and near-perfect distributional calibration, demonstrates the feasibility of creating specialist, locally deployable LLMs that integrate prediction and interpretability. Through combining structured behavioural prediction with natural language reasoning, this work unlocks the potential for conversational, multi-task transport models capable of supporting agent-based simulations, policy testing, and behavioural insight generation. These findings establish a pathway for transforming general purpose LLMs into specialized, explainable tools for transportation research and policy formulation, while maintaining privacy, reducing cost, and broadening access through local deployment.

这项研究调查了在旅行模式选择预测中采用可在当地部署的开放性因果大型语言模型(LLMS)的情况,并介绍了LiTransMC,这是为这项任务开发的第一个经精细调整的因果性LMM。我们系统地将11个LMS(1-12B参数)的基准基准设定为三个公开的优惠数据集,测试396个配置,并生成79 000多个合成通勤者预测。除了预测准确性外,我们评估模型还利用BERTopic进行主题模型和新颖的解释力指数进行推理,首次结构化分析LITransMC如何根据行为理论来阐述扩大决定因素。LiTransMC,利用参数高效和损失掩码战略进行微调,实现了0.6845分的加权F1分和0.007025的Jensen-Shannon Divergence, 超过了未调整的本地模型和更大的专利系统,包括GPT-4o, 高级人物的推断和嵌入式装式装,同时超越了典型模式选择方法,例如独立选择模型和机器学习分解等数据集。这种双重改进,即时支持级的精确级精确级精确精确精确精确精确精确和接近的精确分析,同时解释,将这种可操作性分析、可操作化的精确化的精确和接近的逻辑分析分析,并解释和接近性精确性分析分析分析分析分析分析分析分析分析分析分析分析,并解释,这些可进行基础分析,通过基础的计算分析,并解释,通过可操作性演算算法性分析,通过基础分析,通过可操作性推算法性演算算算算制制制制制制制式的计算。

Article 263

Title@2025-07-29 (2): LLAMAPIE: Proactive In-Ear Conversation Assistants

Title: LLAMAPIE: Proactive In-Ear Conversation Assistants

LLAMAPIE: Proaktive In-Ear-Gesprächsassistenten

LLAMAPIE: 主动的在轨在轨对话助理 2505.04066v2

Authors (5): Tuochao Chen, Nicholas Batchelder, Alisa Liu, Noah Smith, Shyamnath Gollakota

We introduce LlamaPIE, the first real-time proactive assistant designed to enhance human conversations through discreet, concise guidance delivered via hearable devices. Unlike traditional language models that require explicit user invocation, this assistant operates in the background, anticipating user needs without interrupting conversations. We address several challenges, including determining when to respond, crafting concise responses that enhance conversations, leveraging knowledge of the user for context-aware assistance, and real-time, on-device processing. To achieve this, we construct a semi-synthetic dialogue dataset and propose a two-model pipeline: a small model that decides when to respond and a larger model that generates the response. We evaluate our approach on real-world datasets, demonstrating its effectiveness in providing helpful, unobtrusive assistance. User studies with our assistant, implemented on Apple Silicon M2 hardware, show a strong preference for the proactive assistant over both a baseline with no assistance and a reactive model, highlighting the potential of LlamaPie to enhance live conversations.

我们引入了第一个实时主动助理Llamapie(Llamapie),这是第一个旨在通过可听设备提供的简明指导加强人类对话的实时主动助理。与需要明确用户使用的传统的语言模式不同,该助理在背景中运作,预见用户的需要,而不中断交谈。我们应对若干挑战,包括确定何时作出反应,起草简明的应对措施,加强对话,利用用户知识提供背景认知援助,实时、实时、设备处理。为了实现这一目标,我们建立了一个半合成对话数据集,并提出一个双模样的管道:一个小模型,决定何时作出反应,一个更大的模型,产生反应。我们评估了我们关于现实世界数据集的方法,展示了它在提供无干扰的帮助方面的有效性。与我们的助理进行的用户研究,在苹果硅M2硬件上实施,显示积极助理在没有援助和反应模型的基线上都非常偏好,强调LlamaPie加强现场对话的潜力。

Article 264

Title@2025-07-29 (2): Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling

Title: Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling

Bergbau-Intrinsische Belohnungen aus LLM-Hidden States für effiziente Best-of-N-Probenahme

LLM隐藏国为高效率最佳采样而从LLM公司获得的采矿内部奖赏 2505.12225v2

Authors (4): Jizhou Guo, Zhaomin Wu, Hanchen Yang, Philip S. Yu

Enhancing Large Language Model (LLM)’s performance with best-of-N sampling is effective and has attracted significant attention. However, it is computationally prohibitive due to massive, data-hungry text-based reward models. By changing the data source from text to hidden states, we introduce SWIFT (Simple Weighted Intrinsic Feedback Technique), a novel, lightweight technique that leverages the rich information embedded in LLM hidden states to address these issues, which operates on token-level and consists of only linear layers. Extensive experiments show that SWIFT outperforms baselines with less than 0.005% of the parameters of baselines, requiring only a few samples for training, demonstrating significant efficiency improvement. SWIFT’s robust scalability, applicability to some closed-source models via logits, and ability to be combined with traditional reward models to yield further performance gains underscore its practical value.

提高大语言模型(LLM)以最佳N抽样的性能是有效的,并吸引了人们的极大关注。然而,由于大量的数据饥饿文本奖赏模式,它在计算上令人望而却步。通过将数据源从文本转换为隐蔽状态,我们引入了SWIFT(Spreaty weighted Intrinsic conference Technique ) , 这是一种新型的轻量级技术,利用LLM隐藏状态中丰富的信息来解决这些问题,该技术在象征性层面运作,仅包括线性层。广泛的实验表明SWIFT在基线参数低于0.005%的情况下超过了基线,只需要少数几个样本来进行培训,显示出显著的效率提高。 SWIFT的强大可扩展性、通过登录对一些封闭源模型的可适用性、与传统奖赏模式相结合以产生进一步业绩收益的能力强调了其实际价值。

Article 265

Title@2025-07-29 (2): MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations

Title: MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations

MemTool: Optimierung der Kurzzeit-Speicherverwaltung für dynamisches Werkzeug beim Aufrufen von LLM Agent Multi-Turn-Konversationen

MemTool:优化短期内存管理,以便利用动态工具在LLM代理多转对话中打电话 2507.21428v1

Authors (5): Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, James A. Burke

Large Language Model (LLM) agents have shown significant autonomous capabilities in dynamically searching and incorporating relevant tools or Model Context Protocol (MCP) servers for individual queries. However, fixed context windows limit effectiveness in multi-turn interactions requiring repeated, independent tool usage. We introduce MemTool, a short-term memory framework enabling LLM agents to dynamically manage tools or MCP server contexts across multi-turn conversations. MemTool offers three agentic architectures: 1) Autonomous Agent Mode, granting full tool management autonomy, 2) Workflow Mode, providing deterministic control without autonomy, and 3) Hybrid Mode, combining autonomous and deterministic control. Evaluating each MemTool mode across 13+ LLMs on the ScaleMCP benchmark, we conducted experiments over 100 consecutive user interactions, measuring tool removal ratios (short-term memory efficiency) and task completion accuracy. In Autonomous Agent Mode, reasoning LLMs achieve high tool-removal efficiency (90-94% over a 3-window average), while medium-sized models exhibit significantly lower efficiency (0-60%). Workflow and Hybrid modes consistently manage tool removal effectively, whereas Autonomous and Hybrid modes excel at task completion. We present trade-offs and recommendations for each MemTool mode based on task accuracy, agency, and model capabilities.

大型语言模型(LLM)代理商在动态搜索和纳入相关工具或用于个别查询的示范背景协议服务器方面表现出了强大的自主能力;然而,固定环境窗口限制了需要反复独立使用工具的多方向互动的实效;我们引入了MemTool,这是一个短期存储框架,使LMTors能够在多方向对话中动态管理工具或MCP服务器环境;MemTool提供三种代理结构:(1)自主代理模式,给予充分工具管理自主权;(2)工作流程模式,给予完全工具管理自主权的确定性控制;(3)混合模式,结合自主和确定性控制;在规模MCP基准上对13+LLMS的每个MTool模式进行评价,我们进行了100多次连续用户互动试验,测量工具清除比率(短期存储效率)和任务完成准确性;在自主代理模式中,推理LMMs实现了高工具清除效率(在3个窗口中为90-94%),而中等规模模型显示效率(0-60%);工作流程和混合模式,在13+LMLMS的基准上对工具清除进行了持续管理,而自主和混合模式则以任务完成方式为基础。

Article 266

Title@2025-07-29 (2): ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs

Title: ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs

ReGATE: Schneller und besser lernen mit weniger Token in MLLMs

ReGATE:与较少的男、女、女、女、男、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女、女 2507.21420v1

Authors (3): Chaoyu Li, Yogesh Kulkarni, Pooyan Fazli

The computational cost of training multimodal large language models (MLLMs) rapidly increases with the number of tokens involved. Existing efficiency methods primarily target inference and rely on token reduction or merging, offering limited benefit during training. In this paper, we propose ReGATE (Reference$-$Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. Specifically, ReGATE adopts a teacher-student framework in which the MLLM being trained serves as the student, and a frozen reference large language model (LLM) acts as the teacher. The teacher computes per-token reference losses, which are combined with an exponential moving average (EMA) of the student’s own difficulty scores. This adaptive difficulty-based scoring enables the selective processing of crucial tokens while bypassing less informative ones in the forward pass, significantly reducing computational overhead. Experiments demonstrate that ReGATE, when applied to VideoLLaMA2, matches the peak accuracy of standard training on MVBench up to 2$\times$ faster, using only 35% of the tokens. With additional training, it even surpasses the baseline on several multimodal benchmarks, all while reducing the total token count by over 41%. Code and models will be released soon.

培训多式联运大语言模型的计算成本随着所涉物证数量的增加而迅速增加。现有的效率方法主要针对推断,依靠象征性减少或合并,在培训期间提供有限的好处。在本文中,我们提议“ReGATE(参考美元-美元指导适应 Token Elision)”,这是加速MLLM培训的一种适应性象征性模拟方法。具体地说,ReGATE采用师资-学生框架,培训MLLMM作为学生,冻结的参考大语言模型(LLLM)作为教师。教师计算每吨参考损失,结合学生自身困难分数的指数移动平均数(EMA)。这种基于适应性困难的评分使关键物证的选择性处理,同时绕过远道信息较少的标志,大大减少计算间接费用。实验表明,ReGATE在应用VALLAMA2时,与MVBench最高标准培训的准确性达到2美元相比,速度更快。教师计算每吨参考物证损失,与学生自己困难分数的指数平均移动平均数(EMA)相结合。这种基于适应性的评分数的评分法评分后,很快将超过第41号的模型。额外计数将超过基准。

Article 267

Title@2025-07-28 (1): Teaching Language Models To Gather Information Proactively

Title: Teaching Language Models To Gather Information Proactively

Sprachmodelle lehren, um Informationen proaktiv zu sammeln

积极主动地收集资料的教学语言模式 2507.21389v1

Authors (7): Tenghao Huang, Sihao Chen, Muhao Chen, Jonathan May, Longqi Yang, Mengting Wan, Pei Zhou

Large language models (LLMs) are increasingly expected to function as collaborative partners, engaging in back-and-forth dialogue to solve complex, ambiguous problems. However, current LLMs often falter in real-world settings, defaulting to passive responses or narrow clarifications when faced with incomplete or under-specified prompts, falling short of proactively gathering the missing information that is crucial for high-quality solutions. In this work, we introduce a new task paradigm: proactive information gathering, where LLMs must identify gaps in the provided context and strategically elicit implicit user knowledge through targeted questions. To systematically study and train this capability, we design a scalable framework that generates partially specified, real-world tasks, masking key information and simulating authentic ambiguity. Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information – such as hidden domain expertise or fine-grained requirements – that would otherwise remain unspoken. Experiments demonstrate that our trained Qwen-2.5-7B model significantly outperforms o3-mini by 18% on automatic evaluation metrics. More importantly, human evaluation reveals that clarification questions and final outlines generated by our model are favored by human annotators by 42% and 28% respectively. Together, these results highlight the value of proactive clarification in elevating LLMs from passive text generators to genuinely collaborative thought partners.

大型语言模型(LLMS)日益被期望作为协作伙伴发挥作用,参与前后对话,以解决复杂、模糊的问题。然而,目前的LLMS往往在现实世界环境中摇摇欲坠,在面对不完全或未充分指明的提示时,默认被动反应或狭隘的澄清,没有主动收集对高质量解决方案至关重要的缺失信息。在这项工作中,我们引入一个新的任务模式:积极主动的信息收集,LLMS必须查明所提供环境中的差距,通过有针对性的问题从战略上获取隐含的用户知识。为了系统学习和培训这一能力,我们设计了一个可扩展的框架,产生部分具体、真实的任务,掩盖关键信息并模拟真实的模糊性。在这个设置中,我们的核心创新是一种强化的微调战略,奖励那些真正产生新的、隐性用户信息的问题 – – 例如隐藏的域专门知识或细微的分类要求 – – 而在其他方面,我们受过训练的 Quen-2.5-7B模式必须找出差距,通过有针对性的问题来大大超越O3-min 18 % 的自动评价指标。更重要的是,人类评估表明,澄清问题和最后概要分别由模型产生的42 % 和最后提要由人类的蓝图分别由人类思考结果显示的澄清结果分别由人类的模型和最后的模型产生的结果。

Article 268

Title@2025-07-28 (1): Ai2 Scholar QA: Organized Literature Synthesis with Attribution

Title: Ai2 Scholar QA: Organized Literature Synthesis with Attribution

Ai2 Scholar QA: Organisierte Literatursynthese mit Attribution

Ai2学者QA:有组织文学综述与归属 2504.10861v2

Authors (18): Amanpreet Singh, Joseph Chee Chang, Chloe Anastasiades, Dany Haddad, Aakanksha Naik, Amber Tanaka, Angele Zamarron, Cecile Nguyen, Jena D. Hwang, Jason Dunkleberger, Matt Latzke, Smita Rao, Jaron Lochner, Rob Evans, Rodney Kinney, Daniel S. Weld, Doug Downey, Sergey Feldman

Retrieval-augmented generation is increasingly effective in answering scientific questions from literature, but many state-of-the-art systems are expensive and closed-source. We introduce Ai2 Scholar QA, a free online scientific question answering application. To facilitate research, we make our entire pipeline public: as a customizable open-source Python package and interactive web app, along with paper indexes accessible through public APIs and downloadable datasets. We describe our system in detail and present experiments analyzing its key design decisions. In an evaluation on a recent scientific QA benchmark, we find that Ai2 Scholar QA outperforms competing systems.

检索增强型的一代在解答文学科学问题方面越来越有效,但许多最先进的系统是昂贵和封闭源码的。我们引入了Ai2学者QA,这是一个免费的在线科学问题解答应用程序。为了便利研究,我们公布整个管道:作为定制的开放源码Python软件包和互动式网络应用程序,以及可以通过公共API和可下载数据集查阅的纸质索引。我们详细描述我们的系统,并介绍分析其关键设计决定的实验。在对最近科学QA基准的评估中,我们发现Ai2学者QA优于相互竞争的系统。

Article 269

Title@2025-07-28 (1): Beyond the Reported Cutoff: Where Large Language Models Fall Short on Financial Knowledge

Title: Beyond the Reported Cutoff: Where Large Language Models Fall Short on Financial Knowledge

Beyond the Reported Cutoff: Wo große Sprachmodelle auf finanzielles Wissen zurückfallen

超越报告的截止点:大语言模式对财务知识的缺陷 2504.00042v2

Authors (5): Agam Shah, Liqin Ye, Sebastian Jaskowski, Wei Xu, Sudheer Chava

Large Language Models (LLMs) are frequently utilized as sources of knowledge for question-answering. While it is known that LLMs may lack access to real-time data or newer data produced after the model’s cutoff date, it is less clear how their knowledge spans across historical information. In this study, we assess the breadth of LLMs’ knowledge using financial data of U.S. publicly traded companies by evaluating more than 197k questions and comparing model responses to factual data. We further explore the impact of company characteristics, such as size, retail investment, institutional attention, and readability of financial filings, on the accuracy of knowledge represented in LLMs. Our results reveal that LLMs are less informed about past financial performance, but they display a stronger awareness of larger companies and more recent information. Interestingly, at the same time, our analysis also reveals that LLMs are more likely to hallucinate for larger companies, especially for data from more recent years. The code, prompts, and model outputs are available on GitHub.

大型语言模型(LLMS)经常被用作解答问题的知识来源,虽然人们知道LLMS可能无法获得在模型截止日之后产生的实时数据或最新数据,但不清楚其知识如何跨越历史信息。在本研究中,我们利用美国公开交易公司的金融数据评估LLMs知识的广度,评估了197个以上的问题,比较了对事实数据的示范答复。我们进一步探讨了公司特点,如规模、零售投资、机构关注和金融档案的可读性,对LLMS中知识的准确性的影响。我们的结果显示,LLMs不太了解过去的财务业绩,但它们表现出对大公司和最新信息的更深刻认识。有趣的是,我们的分析还表明,LLMS更有可能给大公司带来幻灭,特别是近些年的数据。GitHub提供了代码、提示和模型产出。

Article 270

Title@2025-07-28 (1): Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Title: Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Audio Flamingo 3: Advancing Audio Intelligence mit vollständig offenen großen Audio-Sprachen-Modelle

3:以完全开放的大型音频语言模式推进音频情报 2507.08128v2

Authors (11): Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro

We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.

AF3介绍:(一) AF-Whisper,一个使用所有三种语言、声音和音乐模式联合学习的新颖战略培训的统一的录音编码器;(二) 灵活、随需求思考,允许该模式在回答之前进行思维型推理链式推理;(三) 多式、多式聊天;(四) 长期听觉理解和推理(包括演讲),最多10分钟;和(五) 语音对语音互动。为了能够发挥这些能力,我们建议采用若干大型培训数据集,利用新战略,包括AudioSkill-XL、LongAudio-XL、AF-Tink和AF-Chat, 并用新颖的五阶段课程制培训战略对AF3进行培训。仅以开放源的音频数据为培训,AF3在20+(长级)以上的音频理解和推理模型上取得了新的SOTA结果。

Article 271

Title@2025-07-28 (1): Turbocharging Web Automation: The Impact of Compressed History States

Title: Turbocharging Web Automation: The Impact of Compressed History States

Turbocharging Web Automation: Die Auswirkungen von Komprimierten Geschichte Staaten

涡轮连载网络自动化:压缩历史国家的影响 2507.21369v1

Authors (4): Xiyue Zhu, Peng Tang, Haofu Liao, Srikar Appalaraju

Language models have led to a leap forward in web automation. The current web automation approaches take the current web state, history actions, and language instruction as inputs to predict the next action, overlooking the importance of history states. However, the highly verbose nature of web page states can result in long input sequences and sparse information, hampering the effective utilization of history states. In this paper, we propose a novel web history compressor approach to turbocharge web automation using history states. Our approach employs a history compressor module that distills the most task-relevant information from each history state into a fixed-length short representation, mitigating the challenges posed by the highly verbose history states. Experiments are conducted on the Mind2Web and WebLINX datasets to evaluate the effectiveness of our approach. Results show that our approach obtains 1.2-5.4% absolute accuracy improvements compared to the baseline approach without history inputs.

语言模型导致了网络自动化的飞跃。当前的网络自动化方法将当前的网络状态、历史动作和语言教学作为预测下一步行动的投入,忽略历史的重要性。但是,网页状态的高度杂乱性质可能导致输入序列长和信息稀少,阻碍历史状态的有效利用。在本文中,我们建议采用新的网络历史压缩器方法,用历史状态对网络自动化进行涡轮。我们的方法使用历史压缩机模块,将每个历史状态中最与任务相关的信息压缩成一个固定长度的短期代表,减轻高度扭曲的历史状态带来的挑战。在Mind2Web和WebLINX数据集上进行了实验,以评价我们的方法的有效性。结果显示,我们的方法比没有历史投入的基线方法获得了1.2-5.4%的绝对准确性改进。

Article 272

Title@2025-07-28 (1): StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation

Title: StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation

StructText: Ein synthetischer Table-to-Text-Ansatz für Benchmark-Erzeugung mit multidimensionaler Bewertung

条形图文本:以多层次评价方式编制基准的基准的合成表到文本方法 2507.21340v1

Authors (4): Satyananda Kashyap, Sola Shirai, Nandana Mihindukulasooriya, Horst Samulowitz

Extracting structured information from text, such as key-value pairs that could augment tabular data, is quite useful in many enterprise use cases. Although large language models (LLMs) have enabled numerous automated pipelines for converting natural language into structured formats, there is still a lack of benchmarks for evaluating their extraction quality, especially in specific domains or focused documents specific to a given organization. Building such benchmarks by manual annotations is labour-intensive and limits the size and scalability of the benchmarks. In this work, we present StructText, an end-to-end framework for automatically generating high-fidelity benchmarks for key-value extraction from text using existing tabular data. It uses available tabular data as structured ground truth, and follows a two-stage ``plan-then-execute’’ pipeline to synthetically generate corresponding natural-language text. To ensure alignment between text and structured source, we introduce a multi-dimensional evaluation strategy that combines (a) LLM-based judgments on factuality, hallucination, and coherence and (b) objective extraction metrics measuring numeric and temporal accuracy. We evaluated the proposed method on 71,539 examples across 49 datasets. Results reveal that while LLMs achieve strong factual accuracy and avoid hallucination, they struggle with narrative coherence in producing extractable text. Notably, models presume numerical and temporal information with high fidelity yet this information becomes embedded in narratives that resist automated extraction. We release a framework, including datasets, evaluation tools, and baseline extraction systems, to support continued research.

从文本中提取结构化信息,例如能够增加表格数据的关键值对等数据,在许多企业使用案例中非常有用。虽然大型语言模型(LLMS)使许多自动管道能够将自然语言转换成结构化格式,但仍缺乏评估其提取质量的基准,特别是在特定领域或特定组织特有的重点文件方面。通过人工说明建立此类基准是劳动密集型的,限制了基准的规模和可缩放性。在这项工作中,我们介绍了一个端对端框架,即自动生成使用现有表格数据从文本中自动生成高纤维化基准。它用现有表格数据作为结构化的地面真相,并遵循了两个阶段的“计划前执行”管道,以合成方式生成相应的自然语言文本。为确保文本来源和结构化来源之间的一致性,我们引入了一个多维的评价战略,将(a)基于LLM系统对事实质量、幻觉和一致性的判断以及(b)衡量数字和时间精确度的客观提取指标。我们用71、539的列表数据数据数据作为结构化的地面结构化数据,我们评估了这一方法,通过49个深度的精确度分析模型,我们用直观、直观的模型来评估,我们用直观、直观性模型来评估,然后用直观的模型来评估。

Article 273

Title@2025-07-28 (1): A Deep Learning Automatic Speech Recognition Model for Shona Language

Title: A Deep Learning Automatic Speech Recognition Model for Shona Language

Ein Deep Learning automatische Spracherkennung Modell für Shona Sprache

Shona语言深学习自动语音识别模式 2507.21331v1

Authors (2): Leslie Wellington Sirora, Mainford Mutandavari

This study presented the development of a deep learning-based Automatic Speech Recognition system for Shona, a low-resource language characterized by unique tonal and grammatical complexities. The research aimed to address the challenges posed by limited training data, lack of labelled data, and the intricate tonal nuances present in Shona speech, with the objective of achieving significant improvements in recognition accuracy compared to traditional statistical models. The research first explored the feasibility of using deep learning to develop an accurate ASR system for Shona. Second, it investigated the specific challenges involved in designing and implementing deep learning architectures for Shona speech recognition and proposed strategies to mitigate these challenges. Lastly, it compared the performance of the deep learning-based model with existing statistical models in terms of accuracy. The developed ASR system utilized a hybrid architecture consisting of a Convolutional Neural Network for acoustic modelling and a Long Short-Term Memory network for language modelling. To overcome the scarcity of data, data augmentation techniques and transfer learning were employed. Attention mechanisms were also incorporated to accommodate the tonal nature of Shona speech. The resulting ASR system achieved impressive results, with a Word Error Rate of 29%, Phoneme Error Rate of 12%, and an overall accuracy of 74%. These metrics indicated the potential of deep learning to enhance ASR accuracy for under-resourced languages like Shona. This study contributed to the advancement of ASR technology for under-resourced languages like Shona, ultimately fostering improved accessibility and communication for Shona speakers worldwide.

这项研究介绍了为Shona开发一个深层次的基于学习的自动语音识别系统,这是一种低资源语言,其特点是具有独特的音调和语法复杂性,研究的目的是应对培训数据有限、缺乏贴标签数据以及Shona演讲中出现的复杂的肾脏细微差别所带来的挑战,目的是与传统统计模型相比,在认识准确性方面实现显著改进;研究首先探讨了利用深学习为Shona开发准确的ASR系统的可行性;其次,研究探讨了为Shona演讲设计和实施深层次学习结构所涉及的具体挑战,以及减轻这些挑战的拟议战略;最后,将深层次学习模式的绩效与现有统计模型的准确性进行比较;开发的ASR系统利用了一个混合结构,其中包括声学建模革命性神经网络和语言模拟长期短期记忆网络;为了克服数据匮乏,采用了数据增强技术和转让学习技术;还纳入了关注机制,以适应Shona演讲的通俗性质;由此形成的ASR系统取得了令人印象深刻的结果,在29%的字写错误率下将深层次的ASR错误率最终提升到12 %的A类Shona语言的学习率;这些技术的升级到在12 %的深度的精确性研究中显示,这些技术的升级到提升了全球技术的精确度。

Article 274

Title@2025-07-28 (1): SQuat: Subspace-orthogonal KV Cache Quantization

Title: SQuat: Subspace-orthogonal KV Cache Quantization

SQuat: Subraum-orthogonale KV-Cache-Quantisierung

Suat: 子空间- orthogonal KV 缓存缓存量化 2503.24358v2

Authors (4): Hao Wang, Ligong Han, Kai Xu, Akash Srivastava

The key-value (KV) cache accelerates LLMs decoding by storing KV tensors from previously generated tokens. It reduces redundant computation at the cost of increased memory usage. To mitigate this overhead, existing approaches compress KV tensors into lower-bit representations; however, quantization errors can accumulate as more tokens are generated, potentially resulting in undesired outputs. In this paper, we introduce SQuat (Subspace-orthogonal KV cache quantization). It first constructs a subspace spanned by query tensors to capture the most critical task-related information. During key tensor quantization, it enforces that the difference between the (de)quantized and original keys remains orthogonal to this subspace, minimizing the impact of quantization errors on the attention mechanism’s outputs. SQuat requires no model fine-tuning, no additional calibration dataset for offline learning, and is grounded in a theoretical framework we develop. Through numerical experiments, we show that our method reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.

键值缓存缓存会加速通过从先前生成的质证存储 KV Exctors 来解码 LLMs 。它会减少以更多内存使用为代价的冗余计算。为了减轻这一间接成本, 现有的方法将 KV Excors 压缩成低位表示式; 但是, 当生成更多的符号时, 可能会导致不受欢迎的输出时, 量化错误会累积起来。在本文中, 我们引入 SQuat ( Subspace- orthogonal KV 缓存量) 。它首先通过查询数组构建一个子空间, 以获取最关键的任务相关信息。在关键数组化过程中, 它强制将( 量化的) 和原始键之间的差别压缩成低位表达式表达式; 但是, 当生成更多符号时, Quat 可能会累积出量化错误对关注机制输出的影响, 可能导致不理想的结果。在我们开发的理论框架中, 我们通过数字实验, 显示我们的方法会将峰存减少2. 17 到 2.82, 通过245 到 3. 60 并实现更有利的QQev 。

Article 275

Title@2025-07-28 (1): Do Large Language Models Understand Morality Across Cultures?

Title: Do Large Language Models Understand Morality Across Cultures?

Verstehen große Sprachmodelle Moral über Kulturen hinweg?

大语言模式是否理解各种文化的道德? 2507.21319v1

Authors (4): Hadi Mohammadi, Yasmeen F. S. S. Meijer, Efthymia Papadopoulou, Ayoub Bagheri

Recent advancements in large language models (LLMs) have established them as powerful tools across numerous domains. However, persistent concerns about embedded biases, such as gender, racial, and cultural biases arising from their training data, raise significant questions about the ethical use and societal consequences of these technologies. This study investigates the extent to which LLMs capture cross-cultural differences and similarities in moral perspectives. Specifically, we examine whether LLM outputs align with patterns observed in international survey data on moral attitudes. To this end, we employ three complementary methods: (1) comparing variances in moral scores produced by models versus those reported in surveys, (2) conducting cluster alignment analyses to assess correspondence between country groupings derived from LLM outputs and survey data, and (3) directly probing models with comparative prompts using systematically chosen token pairs. Our results reveal that current LLMs often fail to reproduce the full spectrum of cross-cultural moral variation, tending to compress differences and exhibit low alignment with empirical survey patterns. These findings highlight a pressing need for more robust approaches to mitigate biases and improve cultural representativeness in LLMs. We conclude by discussing the implications for the responsible development and global deployment of LLMs, emphasizing fairness and ethical alignment.

最近大语言模型(LLMs)的进展是众多领域的一个有力工具,然而,对其培训数据产生的性别、种族和文化偏见等内在偏见的持续关切,对这些技术的道德使用和社会后果提出了重大问题。本研究报告调查LLMs在多大程度上抓住了跨文化差异和道德观点的相似之处。具体地说,我们研究LLM产出是否与国际道德态度调查数据中观察到的模式相一致。为此,我们采用三种补充方法:(1)比较模型产生的道德分数与调查中报告的差异;(2)进行分组调整分析,以评估从LLM产出和调查数据中得出的国家分组之间的对应关系;(3)利用系统选择的象征性配对直接探索具有比较提示的模型。我们的结果显示,目前的LMS往往不能完全复制跨文化道德差异的全方位,倾向于压缩差异,并显示与经验调查模式不相符。这些结果突出表明迫切需要采取更强有力的办法,减少LMs的偏见,提高文化代表性。我们最后通过讨论LMs负责任发展和全球部署的影响,强调公平和道德一致性。

Article 276

Title@2025-07-28 (1): Can human clinical rationales improve the performance and explainability of clinical text classification models?

Title: Can human clinical rationales improve the performance and explainability of clinical text classification models?

Können menschliche klinische Grundlagen die Leistungsfähigkeit und Erklärbarkeit klinischer Textklassifikationsmodelle verbessern?

人类临床原理能否改善临床文本分类模型的性能和解释性? 2507.21302v1

Authors (4): Christoph Metzner, Shang Gao, Drahomira Herrmannova, Heidi A. Hanson

AI-driven clinical text classification is vital for explainable automated retrieval of population-level health information. This work investigates whether human-based clinical rationales can serve as additional supervision to improve both performance and explainability of transformer-based models that automatically encode clinical documents. We analyzed 99,125 human-based clinical rationales that provide plausible explanations for primary cancer site diagnoses, using them as additional training samples alongside 128,649 electronic pathology reports to evaluate transformer-based models for extracting primary cancer sites. We also investigated sufficiency as a way to measure rationale quality for pre-selecting rationales. Our results showed that clinical rationales as additional training data can improve model performance in high-resource scenarios but produce inconsistent behavior when resources are limited. Using sufficiency as an automatic metric to preselect rationales also leads to inconsistent results. Importantly, models trained on rationales were consistently outperformed by models trained on additional reports instead. This suggests that clinical rationales don’t consistently improve model performance and are outperformed by simply using more reports. Therefore, if the goal is optimizing accuracy, annotation efforts should focus on labeling more reports rather than creating rationales. However, if explainability is the priority, training models on rationale-supplemented data may help them better identify rationale-like features. We conclude that using clinical rationales as additional training data results in smaller performance improvements and only slightly better explainability (measured as average token-level rationale coverage) compared to training on additional reports.

人工智能驱动的临床文本分类对于人口一级健康信息的可解释的自动检索至关重要。这项工作调查了基于人类的临床理由是否可以作为额外的监督来提高基于变压器的模型的性能和解释性,从而自动编码临床文件。我们分析了99,125个基于人类的临床理由,为初级癌症现场诊断提供了合理的解释,用这些理由作为额外的培训样本,同时用128,649份电子病理学报告来评价提取初级癌症现场的基于变压器模型。我们还调查了充足性作为衡量预选理由的理由质量的一种方法。我们的结果表明,作为额外培训数据的附加性能可以改善高资源情景中的模型性能,但在资源有限时则会产生不一致的行为。我们用充足性作为自动衡量理由的衡量标准,为初级癌症现场诊断提供了合理性的解释。关键是,经过培训的模型一直比其他报告所培训的模型要好,这表示临床理由不能不断改善模型的性能,而仅仅使用更多的报告来弥补。因此,如果目标正在优化准确性,那么,作为补充性的说明性的努力应该侧重于比模拟性报告更精确性,而不是确定表面性的理由。然而,我们用模拟性数据来解释性解释性数据作为结论,如果我们可以用来解释性地解释,那么,那么,那么,那么,我们用更精确性地解释性地解释性地解释性地解释性地解释性地解释性地解释性地解释性地解释性地解释性地解释性地解释性地解释性地解释性地解释。。

Article 277

Title@2025-07-28 (1): FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation

Title: FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation

FlagEvalMM: Flexibler Rahmen für eine umfassende multimodale Modellbewertung

FlaignEvalMMM:综合多式联运模式评价灵活框架 2506.09081v3

Authors (8): Zheqi He, Yesheng Liu, Jing-shu Zheng, Xuejing Li, Jin-Ge Yao, Bowen Qin, Richeng Xuan, Xi Yang

We present FlagEvalMM, an open-source evaluation framework designed to comprehensively assess multimodal models across a diverse range of vision-language understanding and generation tasks, such as visual question answering, text-to-image/video generation, and image-text retrieval. We decouple model inference from evaluation through an independent evaluation service, thus enabling flexible resource allocation and seamless integration of new tasks and models. Moreover, FlagEvalMM utilizes advanced inference acceleration tools (e.g., vLLM, SGLang) and asynchronous data loading to significantly enhance evaluation efficiency. Extensive experiments show that FlagEvalMM offers accurate and efficient insights into model strengths and limitations, making it a valuable tool for advancing multimodal research. The framework is publicly accessible at https://github.com/flageval-baai/FlagEvalMM.

我们提出FlagEvalMM(FlagEvalMM),这是一个开放源码评价框架,旨在全面评估多种视觉语言理解和生成任务,如视觉问答、文字到图像/视频生成和图像-文本检索等多种多式联运模式;我们通过独立评价服务将模型推论与评价脱钩,从而能够灵活分配资源和无缝地整合新任务和新模式;此外,FlagEvalMM(FlagEvalM)利用先进的推论加速工具(如VLLM、SGLang)和不同步的数据负荷,以大大提高评价效率;广泛的实验显示FlagEvalM(FlagEvalM)对模型的长处和局限性提供了准确而有效的洞察力,使其成为推进多式联运研究的宝贵工具;框架可在https://github.com/flageeval-baai/FlageEvalMM(https://githu.com/flageval-baai/FlagEvalMMM)上公开查阅。

Article 278

Title@2025-07-28 (1): Narrative Context Protocol: An Open-Source Storytelling Framework for Generative AI

Title: Narrative Context Protocol: An Open-Source Storytelling Framework for Generative AI

Narrative Context Protocol: Ein Open Source Storytelling Framework für generative KI

叙述性背景议定书:开源的开源描述框架 2503.04844v5

Authors (1): Hank Gerba

Here we introduce Narrative Context Protocol (NCP), an open-source narrative standard designed to enable narrative interoperability, AI-driven authoring tools, real-time emergent narratives, and more. By encoding a story’s structure in a “Storyform,” which is a structured register of its narrative features, NCP enables narrative portability across systems as well as intent-based constraints for generative storytelling systems. We demonstrate the capabilities of NCP through a year-long experiment, during which an author used NCP and a custom authoring platform to create a playable, text-based experience based on her pre-existing novella. This experience is driven by generative AI, with unconstrained natural language input. NCP functions as a set of “guardrails” that allows the generative system to accommodate player agency while also ensuring that narrative context and coherence are maintained.

在此,我们引入了“叙述背景协议 ” ( NCP ) , 这是一种开放源代码叙述性标准, 目的是让叙述性互操作性、 AI驱动的作者工具、实时突发描述等等成为可能。通过将一个故事结构化的“故事形式 ” ( Tory Form) , 这是一种结构化的描述性特征登记册, NCP 使得跨系统的叙述性可移动性以及基因化叙事系统的用意限制得以实现。我们通过一个长达一年的实验展示了NCP的能力, 在此期间, 作者利用了NCP 和一个定制的作者平台, 以她先前存在的小说为基础, 创造了一个可播放的、基于文本的经验。这种经验是由基因化的 AI 驱动的, 并有不加限制的自然语言输入。 NCP 功能是一套“ 保护性装置 ” , 使基因化系统能够容纳玩家机构, 同时确保描述性背景和一致性得到维护。

Article 279

Title@2025-07-28 (1): Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues

Title: Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues

Schulung von LLM-basierten Tutoren zur Verbesserung der Lernergebnisse von Studierenden in Dialogen

培训基于LLLM LLM的辅导员,以改善学生在对话中的学习成果 2503.06424v2

Authors (5): Alexander Scarlatos, Naiming Liu, Jaewook Lee, Richard Baraniuk, Andrew Lan

Generative artificial intelligence (AI) has the potential to scale up personalized tutoring through large language models (LLMs). Recent AI tutors are adapted for the tutoring task by training or prompting LLMs to follow effective pedagogical principles, though they are not trained to maximize student learning throughout the course of a dialogue. Therefore, they may engage with students in a suboptimal way. We address this limitation by introducing an approach to train LLMs to generate tutor utterances that maximize the likelihood of student correctness, while still encouraging the model to follow good pedagogical practice. Specifically, we generate a set of candidate tutor utterances and score them using (1) an LLM-based student model to predict the chance of correct student responses and (2) a pedagogical rubric evaluated by GPT-4o. We then use the resulting data to train an open-source LLM, Llama 3.1 8B, using direct preference optimization. We show that tutor utterances generated by our model lead to significantly higher chances of correct student responses while maintaining the pedagogical quality of GPT-4o. We also conduct qualitative analyses and a human evaluation to demonstrate that our model generates high quality tutor utterances.

培养人工智能(AI)有可能通过大型语言模式扩大个人化辅导(LLMs) 。最近的AI教义通过培训或促使LLMs遵循有效的教学原则,适应辅导任务,尽管他们没有经过培训,在整个对话过程中最大限度地提高学生的学习水平,因此,他们可以以次优方式与学生接触。我们采用培训LLMs的方法解决这一限制,以产生尽可能提高学生正确性的可能性的辅导讲义,同时仍然鼓励学习良好教学做法的模式。具体地说,我们制作了一套候选人辅导讲义,并用(1) 以LLM为基础的学生模型来预测学生作出正确反应的机会,(2) 由GPT-4o评估的教学图解。我们随后利用由此产生的数据来培训开源LM,Llama 3.1 8B,使用直接的优惠优化。我们展示了我们模型产生的讲义导,在保持GPT-4o教学质量的同时,使学生作出正确反应的机会大得多。我们还进行了定性分析和人文评价,以证明我们的模型产生了高质量的讲义。

Article 280

Title@2025-07-28 (1): LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems

Title: LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems

LeMix: Unified Scheduling für LLM-Training und Schlussfolgerung auf Multi-GPU-Systemen

LeMix:关于多功能保U系统的LLM培训和推理的LLM培训统一日程安排 2507.21276v1

Authors (4): Yufei Li, Zexin Li, Yinglun Zhu, Cong Liu

Modern deployment of large language models (LLMs) frequently involves both inference serving and continuous retraining to stay aligned with evolving data and user feedback. Common practices separate these workloads onto distinct servers in isolated phases, causing substantial inefficiencies (e.g., GPU idleness) and delayed adaptation to new data in distributed settings. Our empirical analysis reveals that these inefficiencies stem from dynamic request arrivals during serving and workload heterogeneity in pipeline-parallel training. To address these challenges, we propose LeMix, a system for co-locating and managing concurrent LLM serving and training workloads. LeMix integrates offline profiling, execution prediction mechanisms, and runtime scheduling to dynamically adapt resource allocation based on workload characteristics and system conditions. By understanding task-specific behaviors and co-execution interference across shared nodes, LeMix improves utilization and serving quality without compromising serving responsiveness. Our evaluation shows that LeMix improves throughput by up to 3.53x, reduces inference loss by up to 0.61x, and delivers up to 2.12x higher response time SLO attainment over traditional separate setups. To our knowledge, this is the first work to uncover and exploit the opportunities of joint LLM inference and training, paving the way for more resource-efficient deployment of LLMs in production environments.

现代使用大型语言模型(LLMS)经常涉及为适应不断演变的数据和用户反馈而对大型语言模型(LLMS)的现代部署的推论和持续再培训,以适应不断演变的数据和用户反馈。常见做法是将这些工作量分解到孤立的服务器上,造成效率极低(例如,GPU闲置)和对分布式环境中新数据的延迟调整。我们的经验分析表明,这些效率低下是由于在服务期间需求激增和工作量在管道平行培训中出现差异造成的。为了应对这些挑战,我们建议LeMix(一个同时分配和管理LLMMX服务和培训工作量的系统)共同定位和管理。LeMix(LeMix)整合了离线剖析、执行预测机制以及运行时间安排,以便根据工作量特点和系统条件动态调整资源分配。通过理解特定任务的行为和共同执行干扰,在共享节点之间提高利用率和工作质量,同时不损害响应能力。我们的评估表明,LMix(LeMix)的吞吐量增加至3.53x,将误损失降低到0.61x,并将预测损失降低2.12x的响应时间超过传统的SLOE实现传统单独设置和再利用我们的知识,从而探索资源环境。

Article 281

Title@2025-07-28 (1): Levels of Analysis for Large Language Models

Title: Levels of Analysis for Large Language Models

Analyseebenen für große Sprachmodelle

大语言模式分析水平 2503.13401v2

Authors (13): Alexander Ku, Declan Campbell, Xuechunzi Bai, Jiayi Geng, Ryan Liu, Raja Marjieh, R. Thomas McCoy, Andrew Nam, Ilia Sucholutsky, Veniamin Veselovsky, Liyi Zhang, Jian-Qiao Zhu, Thomas L. Griffiths

Modern artificial intelligence systems, such as large language models, are increasingly powerful but also increasingly hard to understand. Recognizing this problem as analogous to the historical difficulties in understanding the human mind, we argue that methods developed in cognitive science can be useful for understanding large language models. We propose a framework for applying these methods based on the levels of analysis that David Marr proposed for studying information processing systems. By revisiting established cognitive science techniques relevant to each level and illustrating their potential to yield insights into the behavior and internal organization of large language models, we aim to provide a toolkit for making sense of these new kinds of minds.

现代人工智能系统,如大型语言模型,越来越强大,但也越来越难以理解。我们认识到这个问题与理解人类思想的历史困难相似,因此认为认知科学开发的方法有助于理解大型语言模型。我们根据David Marr为研究信息处理系统而提出的分析水平,提出了一个应用这些方法的框架。我们通过重新研究与各级有关的既定认知科学技术,并展示其了解大型语言模型的行为和内部组织的潜力,我们的目标是提供一个工具,使这些新类型的思想变得有意义。

Article 282

Title@2025-07-28 (1): CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting

Title: CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting

CompoST: Ein Benchmark für die Analyse der Fähigkeit von LLMs, Fragen in einer QALD-Einstellung kompositorisch zu interpretieren

CompoST:在质量和限期设计中分析高管公司在组成上解释问题的能力的基准 2507.21257v1

Authors (3): David Maria Schmidt, Raoul Schubert, Philipp Cimiano

Language interpretation is a compositional process, in which the meaning of more complex linguistic structures is inferred from the meaning of their parts. Large language models possess remarkable language interpretation capabilities and have been successfully applied to interpret questions by mapping them to SPARQL queries. An open question is how systematic this interpretation process is. Toward this question, in this paper, we propose a benchmark for investigating to what extent the abilities of LLMs to interpret questions are actually compositional. For this, we generate three datasets of varying difficulty based on graph patterns in DBpedia, relying on Lemon lexica for verbalization. Our datasets are created in a very controlled fashion in order to test the ability of LLMs to interpret structurally complex questions, given that they have seen the atomic building blocks. This allows us to evaluate to what degree LLMs are able to interpret complex questions for which they “understand” the atomic parts. We conduct experiments with models of different sizes using both various prompt and few-shot optimization techniques as well as fine-tuning. Our results show that performance in terms of macro $F_1$ degrades from $0.45$ over $0.26$ down to $0.09$ with increasing deviation from the samples optimized on. Even when all necessary information was provided to the model in the input, the $F_1$ scores do not exceed $0.57$ for the dataset of lowest complexity. We thus conclude that LLMs struggle to systematically and compositionally interpret questions and map them into SPARQL queries.

语言解释是一个构成过程,从各个部分的含义中推断出更复杂的语言结构的含义。大型语言模型具有非凡的语言口译能力,并且成功地用于通过将语言翻译能力映射给SPARQL查询来解释问题。一个未决问题是这个解释过程是如何系统化的。关于这个问题,我们在本文件中提出一个基准,用于调查LLMM解释问题的能力在多大程度上实际上具有构成性。我们为此根据DBpedia的图表模式,根据Lemon lexica的口头化,生成了三套不同难度的数据集。我们的数据集是以一种非常受控制的方式创建的,以测试LLMs解释结构复杂问题的能力,因为它们已经看到原子构造块。这使我们能够评估LMS能够在多大程度上解释它们“能理解”原子部分的复杂问题。我们利用各种快速和微量优化技术以及微调来进行不同尺寸模型的实验。我们的结果显示,从0.45美元以上的LMs的计算结果,从0.26美元降至0.09美元,因此将数据从不断偏差的S.00_0.9美元,然后将所有数据压为最低的模型。

Article 283

Title@2025-07-28 (1): Bangla BERT for Hyperpartisan News Detection: A Semi-Supervised and Explainable AI Approach

Title: Bangla BERT for Hyperpartisan News Detection: A Semi-Supervised and Explainable AI Approach

Bangla BERT für hyperparteiische Nachrichtenerkennung: Ein halbüberwachter und erklärbarer KI-Ansatz

超党派新闻探测孟加拉BERT:半监督和可解释的AI方法 2507.21242v1

Authors (6): Mohammad Mehadi Hasan, Fatema Binte Hassan, Md Al Jubair, Zobayer Ahmed, Sazzatul Yeakin, Md Masum Billah

In the current digital landscape, misinformation circulates rapidly, shaping public perception and causing societal divisions. It is difficult to identify hyperpartisan news in Bangla since there aren’t many sophisticated natural language processing methods available for this low-resource language. Without effective detection methods, biased content can spread unchecked, posing serious risks to informed discourse. To address this gap, our research fine-tunes Bangla BERT. This is a state-of-the-art transformer-based model, designed to enhance classification accuracy for hyperpartisan news. We evaluate its performance against traditional machine learning models and implement semi-supervised learning to enhance predictions further. Not only that, we use LIME to provide transparent explanations of the model’s decision-making process, which helps to build trust in its outcomes. With a remarkable accuracy score of 95.65%, Bangla BERT outperforms conventional approaches, according to our trial data. The findings of this study demonstrate the usefulness of transformer models even in environments with limited resources, which opens the door to further improvements in this area.

在当前的数字景观中,错误信息传播迅速,影响公众认识并造成社会分裂。在孟加拉语中很难找到超党派新闻,因为对这种低资源语言来说,没有许多复杂的自然语言处理方法。没有有效的检测方法,偏见的内容就会传播,对知情的言论构成严重风险。为了解决这一差距,我们的研究将孟加拉语BERT 进行微调,这是一个基于最新技术的变压器模型,目的是提高超党派新闻的分类准确性。我们对照传统机器学习模型评估其表现,并采用半监督的学习方法来进一步加强预测。我们不仅使用LIME来提供模型决策过程的透明解释,这有助于建立对其结果的信任。根据我们的实验数据,Bangla BERT的精确度高达95.65%,它超越了常规方法。这项研究的结果表明,即使在资源有限的环境中,变压器模型也非常有用,这为这一领域进一步改进打开了大门。

Article 284

Title@2025-07-28 (1): Understanding Public Perception of Crime in Bangladesh: A Transformer-Based Approach with Explainability

Title: Understanding Public Perception of Crime in Bangladesh: A Transformer-Based Approach with Explainability

Öffentliche Wahrnehmung der Kriminalität in Bangladesch verstehen: Ein transformerbasierter Ansatz mit Erklärbarkeit

了解孟加拉国公众对犯罪的认识:基于变革和可解释的方法 2507.21234v1

Authors (6): Fatema Binte Hassan, Md Al Jubair, Mohammad Mehadi Hasan, Tahmid Hossain, S M Mehebubur Rahman Khan Shuvo, Mohammad Shamsul Arefin

In recent years, social media platforms have become prominent spaces for individuals to express their opinions on ongoing events, including criminal incidents. As a result, public sentiment can shift dynamically over time. This study investigates the evolving public perception of crime-related news by classifying user-generated comments into three categories: positive, negative, and neutral. A newly curated dataset comprising 28,528 Bangla-language social media comments was developed for this purpose. We propose a transformer-based model utilizing the XLM-RoBERTa Base architecture, which achieves a classification accuracy of 97%, outperforming existing state-of-the-art methods in Bangla sentiment analysis. To enhance model interpretability, explainable AI technique is employed to identify the most influential features driving sentiment classification. The results underscore the effectiveness of transformer-based models in processing low-resource languages such as Bengali and demonstrate their potential to extract actionable insights that can support public policy formulation and crime prevention strategies.

近年来,社交媒体平台已成为个人表达对当前事件(包括犯罪事件)意见的显著空间,因此公众情绪会随着时间的推移而发生动态变化。本研究通过将用户生成的评论分为积极、消极和中性这三类,调查公众对犯罪相关新闻不断变化的看法。为此开发了由28 528种孟加拉语社交媒体评论组成的新整理数据集。我们提议利用XLM-ROBERTA Base架构建立基于变压器的模型,该模型的分类精确度达到97%,优于孟加拉语情感分析的现有最新方法。为了提高模型可解释性,采用了可解释的AI技术来识别驱动情绪分类的最有影响力的特征。结果强调了基于变压器的模型在处理孟加拉语等低资源语言方面的有效性,并展示了这些变压器模式在获取可操作的洞察力方面的潜力,从而支持公共政策的制定和预防犯罪战略。

Article 285

Title@2025-07-28 (1): Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation

Title: Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation

Multi-Agent-as-Judge: LLM-Agent-basierte automatisierte Evaluierung mit multidimensionaler menschlicher Bewertung ausrichten

多边代理法官:将LLM-基于代理的自动评价与多层次的人力评价统一起来 2507.21028v1

Authors (9): Jiaju Chen, Yuxuan Lu, Xiaojie Wang, Huimin Zeng, Jing Huang, Jiri Gesi, Ying Xu, Bingsheng Yao, Dakuo Wang

Nearly all human work is collaborative; thus, the evaluation of real-world NLP applications often requires multiple dimensions that align with diverse human perspectives. As real human evaluator resources are often scarce and costly, the emerging “LLM-as-a-judge” paradigm sheds light on a promising approach to leverage LLM agents to believably simulate human evaluators. Yet, to date, existing LLM-as-a-judge approaches face two limitations: persona descriptions of agents are often arbitrarily designed, and the frameworks are not generalizable to other tasks. To address these challenges, we propose MAJ-EVAL, a Multi-Agent-as-Judge evaluation framework that can automatically construct multiple evaluator personas with distinct dimensions from relevant text documents (e.g., research papers), instantiate LLM agents with the personas, and engage in-group debates with multi-agents to Generate multi-dimensional feedback. Our evaluation experiments in both the educational and medical domains demonstrate that MAJ-EVAL can generate evaluation results that better align with human experts’ ratings compared with conventional automated evaluation metrics and existing LLM-as-a-judge methods.

几乎所有的人类工作都是协作的;因此,对现实世界NLP应用的评估往往需要与人类不同视角相适应的多个层面。由于真正的人力资源往往稀缺且成本高昂,新兴的“LLM-as-a-judge”范式为利用LLM代理商进行令人信服的模拟人体评估提供了很有希望的办法。然而,到目前为止,现有的LLM-as-a-judge-s-instal 方法面临两个限制:对代理商的描述往往被任意设计,而框架又无法普遍用于其他任务。为了应对这些挑战,我们建议MAJ-EVAL,一个多机构-as-judge评价框架,可以自动建立多个具有相关文本文件不同层面的评价人(例如研究文件)、与人进行即时法LM代理商与人进行集体辩论,并与多机构进行集体辩论,以产生多维反馈。我们在教育和医疗领域的评价实验表明,MAJ-EVAL能够产生与人类专家评级相比与常规自动评价指标和现有LM-as-a-jud-a-a-judge方法更为一致的评价结果。

Article 286

Title@2025-07-28 (1): Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

Title: Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

Verbesserung der LLM-Vernunft mit iterativem DPO: Eine umfassende empirische Untersuchung

与具有迭接作用的DPO:全面经验调查加强LLM 2503.12854v3

Authors (11): Songjun Tu, Jiahao Lin, Xiangyu Tian, Qichao Zhang, Linjing Li, Yuqian Fu, Nan Xu, Wei He, Xiangyuan Lan, Dongmei Jiang, Dongbin Zhao

Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with RL-based approaches have led to growing interest in alternative paradigms, such as Direct Preference Optimization (DPO). In this study, we investigate the effectiveness of DPO in facilitating self-improvement for LLMs through iterative preference-based learning. We demonstrate that a single round of DPO with coarse filtering significantly enhances mathematical reasoning performance, particularly for strong base model. Furthermore, we design an iterative enhancement framework for both the generator and the reward model (RM), enabling their mutual improvement through online interaction across multiple rounds of DPO. Finally, with simple verifiable rewards, our model DPO-VP achieves RL-level performance with significantly lower computational overhead. These findings highlight DPO as a scalable and cost-effective alternative to RL, offering a practical solution for enhancing LLM reasoning in resource-constrained situations.

最近大语言模型培训后方法的进展突出表明,强化学习是强化推理的一个关键组成部分,但是,基于学习模式的计算成本巨大,导致人们对其他模式的兴趣日益浓厚,例如直接优惠优化(DPO)等替代模式的兴趣日益浓厚。在本研究中,我们调查了DPO通过反复的基于优惠的学习促进LLMS自我改进的有效性。我们证明,单轮粗略过滤法的DPO大大提高了数学推理性能,特别是强健的基础模型。此外,我们为生成者和奖励模式设计了一个迭代强化框架,通过多轮DPO的在线互动,使得它们能够相互改进。最后,有了简单的可核实的奖励,我们的DPO-VP模型在计算间接费用上实现了RL水平的业绩,大大降低了计算费。这些结论强调DPO是可扩展的、成本效益高的替代RL,为在资源紧张的情况下加强LM推理提供了切实可行的解决办法。

Article 287

Title@2025-07-28 (1): Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions

Title: Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions

Bewertung des Versprechens und der Fälle von LLMs bei Hiring-Entscheidungen

评估LLM女士在雇用决定中的许诺和机会 2507.02087v2

Authors (4): Eitan Anzenberg, Arunava Samajpati, Sivasankaran Chandrasekar, Varun Kacholia

The use of large language models (LLMs) in hiring promises to streamline candidate screening, but it also raises serious concerns regarding accuracy and algorithmic bias where sufficient safeguards are not in place. In this work, we benchmark several state-of-the-art foundational LLMs - including models from OpenAI, Anthropic, Google, Meta, and Deepseek, and compare them with our proprietary domain-specific hiring model (Match Score) for job candidate matching. We evaluate each model’s predictive accuracy (ROC AUC, Precision-Recall AUC, F1-score) and fairness (impact ratio of cut-off analysis across declared gender, race, and intersectional subgroups). Our experiments on a dataset of roughly 10,000 real-world recent candidate-job pairs show that Match Score outperforms the general-purpose LLMs on accuracy (ROC AUC 0.85 vs 0.77) and achieves significantly more equitable outcomes across demographic groups. Notably, Match Score attains a minimum race-wise impact ratio of 0.957 (near-parity), versus 0.809 or lower for the best LLMs, (0.906 vs 0.773 for the intersectionals, respectively). We discuss why pretraining biases may cause LLMs with insufficient safeguards to propagate societal biases in hiring scenarios, whereas a bespoke supervised model can more effectively mitigate these biases. Our findings highlight the importance of domain-specific modeling and bias auditing when deploying AI in high-stakes domains such as hiring, and caution against relying on off-the-shelf LLMs for such tasks without extensive fairness safeguards. Furthermore, we show with empirical evidence that there shouldn’t be a dichotomy between choosing accuracy and fairness in hiring: a well-designed algorithm can achieve both accuracy in hiring and fairness in outcomes.

大型语言模型(LLMS)在招聘时的使用有望简化候选人筛选程序,但也引起了人们对准确性和算法偏差的严重关切。在这项工作中,我们将一些最先进的基础模型(包括OpenAI、Anthroopicic、Google、Meta和DeepSeek的模型)作为基准,并将它们与我们专有的域别特定招聘模式(Match Scord)相比,用于招聘候选人匹配。我们评估了每个模型的预测准确性(ROC ACUC、 Precision-Recall AUC、F1-Score)和公平性(在宣布的性别、种族和交叉分组之间缺乏足够保障的情况下,截断率和算分析的比重)以及公平性(在宣布的性别、种族、种族和交叉分组之间缺乏足够保障的情况下,我们对最先进的基本基本基本基本基本基本基本基本基本基本基本基本基本基本标准 — — 在招聘过程中,我们的标准(Ox906)和(BLMS)之间可以有效地进行准确性评估。

Article 288

Title@2025-07-28 (1): Memorization in Fine-Tuned Large Language Models

Title: Memorization in Fine-Tuned Large Language Models

Auswendiglernen in fein getönten großen Sprachmodellen

微微调大语言模型的记忆 2507.21009v1

Authors (4): Danil Savine, Muni Sreenivas Pydi, Jamal Atif, Olivier Cappé

This study investigates the mechanisms and factors influencing memorization in fine-tuned large language models (LLMs), with a focus on the medical domain due to its privacy-sensitive nature. We examine how different aspects of the fine-tuning process affect a model’s propensity to memorize training data, using the PHEE dataset of pharmacovigilance events. Our research employs two main approaches: a membership inference attack to detect memorized data, and a generation task with prompted prefixes to assess verbatim reproduction. We analyze the impact of adapting different weight matrices in the transformer architecture, the relationship between perplexity and memorization, and the effect of increasing the rank in low-rank adaptation (LoRA) fine-tuning. Key findings include: (1) Value and Output matrices contribute more significantly to memorization compared to Query and Key matrices; (2) Lower perplexity in the fine-tuned model correlates with increased memorization; (3) Higher LoRA ranks lead to increased memorization, but with diminishing returns at higher ranks. These results provide insights into the trade-offs between model performance and privacy risks in fine-tuned LLMs. Our findings have implications for developing more effective and responsible strategies for adapting large language models while managing data privacy concerns.

这项研究调查了微调大型语言模型(LLMS)中影响记忆化的机制和因素,重点是医疗领域,因为其隐私敏感性。我们研究微调过程的不同方面如何影响模型对培训数据记忆化的倾向,使用PHEE 药理监督事件数据集。我们的研究采用两个主要方法:(1) 价值和产出矩阵对记忆化的贡献比Query和Key矩阵大得多;(2) 微调模型与增加记忆化相关联的难度降低;(3) 较高的LORA等级导致增加记忆化,但随着较高等级的回报减少。这些结果为低级别适应(LORA)微调的排名(LORA)提供了深刻的见解。这些结果为改进模型绩效和隐私风险之间的贸易影响提供了深刻的见解。

Article 289

Title@2025-07-28 (1): LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning

Title: LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning

LoRA-PAR: Ein flexibler Dual-System-LoRA-Partitionsansatz für effizientes LLM-Feintuning

LOLAR-PAR:高效 LLM 微调的灵活双系统滚动分割法 2507.20999v1

Authors (4): Yining Huang, Bin Li, Keke Tang, Meilian Chen

Large-scale generative models like DeepSeek-R1 and OpenAI-O1 benefit substantially from chain-of-thought (CoT) reasoning, yet pushing their performance typically requires vast data, large model sizes, and full-parameter fine-tuning. While parameter-efficient fine-tuning (PEFT) helps reduce cost, most existing approaches primarily address domain adaptation or layer-wise allocation rather than explicitly tailoring data and parameters to different response demands. Inspired by “Thinking, Fast and Slow,” which characterizes two distinct modes of thought-System 1 (fast, intuitive, often automatic) and System 2 (slower, more deliberative and analytic)-we draw an analogy that different “subregions” of an LLM’s parameters might similarly specialize for tasks that demand quick, intuitive responses versus those requiring multi-step logical reasoning. Therefore, we propose LoRA-PAR, a dual-system LoRA framework that partitions both data and parameters by System 1 or System 2 demands, using fewer yet more focused parameters for each task. Specifically, we classify task data via multi-model role-playing and voting, and partition parameters based on importance scoring, then adopt a two-stage fine-tuning strategy of training System 1 tasks with supervised fine-tuning (SFT) to enhance knowledge and intuition and refine System 2 tasks with reinforcement learning (RL) to reinforce deeper logical deliberation next. Extensive experiments show that the two-stage fine-tuning strategy, SFT and RL, lowers active parameter usage while matching or surpassing SOTA PEFT baselines.

DeepSeek-R1 和 OpenAI-O1 等大型基因模型从思维链推理(CoT)的推理中获益匪浅,但推推推其性能通常需要大量数据、大模型尺寸和全参数微调。尽管参数效率微调(PEFT)有助于降低成本,但大多数现有方法主要针对领域适应或分层,而不是明确根据不同响应需求调整数据和参数。受“思维、快速和慢”的启发,它具有两种不同的思维-系统(快速、直观、往往自动)和系统2(更低、更具审议性和分析性)使用两种不同模式,而系统2(更低、更低、更具审议性和分析性)和系统2(较低)的精度使用更集中的参数,我们得出一个类比,即LLMM参数的不同“子区域”可能同样专门用于需要快速、直观反应或多步逻辑分配的数据,而不是需要逻辑推理的数据和参数。因此,我们建议LAR-PAR-PAR-PAR系统系统将数据和参数按照系统1或系统更低的精细的精细的精细的比值比,用更集中的参数对每个任务进行更集中的参数的参数调整,我们通过双级、更精细的S-级、更精细的S-级、更精细的S-级、更精细的SL-级、更精细的S-级、更精细的S-级、更精细的S-级、更精细的比、更精细的S-级、更精细的校化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级化-级-级-级化-级-级-级-级化-级化-级化-级化-级化-级化-级化-级化-级-级-级化-级化-级调制-级-级-级-级-级调制-级-级-级-级化-级化-级-级-级-级-级-级-级-级-级-级

Article 290

Title@2025-07-28 (1): GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding

Title: GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding

GUI-G$^2$: Gaussian Reward Modeling für GUI Grounding

GUI-G$$2美元:GUI地基的高斯奖赏模型 2507.15846v3

Authors (12): Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.

图形用户界面( GUI) 绘制自然语言指示, 以精确的界面位置进行自主互动。当前强化学习方法使用将元素作为目标目标处理的二进制奖赏, 创建忽略空间互动连续性的微弱信号。我们受自然以目标元素为核心的高斯分布的人类点击行为驱动, 我们引入了GUI Gausian 定位奖项( GUI- G$2$), 一个原则奖赏框架, 将图形界面元素作为连续的高斯分布在界面中。 GUI- G$2$ 包含两个协同机制: 高斯点奖赏模型, 通过元素固醇的快速衰减版版化分布, 创建零星点的精确本地化模型, 覆盖点评估空间一致性, 通过测量预测高点分布和目标区域之间的重叠。为了处理不同元素尺度, 我们开发了一个适应性差异机制, 校准基于元素维度的分布。这个框架将GUIGI从稀少的二级分类到密集的连续优化优化优化。校正的分布产生丰富的梯度信号信号信号信号, 向最优化的互动定位定位定位定位定位定位 $PROSQS- breal- browst- browst- grealmamamamas

Article 291

Title@2025-07-28 (1): Scaling Physical Reasoning with the PHYSICS Dataset

Title: Scaling Physical Reasoning with the PHYSICS Dataset

Skalierung der physikalischen Vernunft mit dem PHYSICS-Datensatz

利用PHYSICS数据集调整物理理由 2506.00022v3

Authors (12): Shenghe Zheng, Qianjia Cheng, Junchi Yao, Mengsong Wu, Haonan He, Ning Ding, Yu Cheng, Shuyue Hu, Lei Bai, Dongzhan Zhou, Ganqu Cui, Peng Ye

Large Language Models (LLMs) have achieved remarkable progress on advanced reasoning tasks such as mathematics and coding competitions. Meanwhile, physics, despite being both reasoning-intensive and essential to real-world understanding, received limited academic and industrial attention. This paper introduces PHYSICS, a dataset containing 16,568 high-quality physics problems spanning subjects and difficulty levels, to facilitate this issue. Specifically, PHYSICS is curated with exercises from over 100 textbooks through a carefully designed pipeline for quality control. It covers five major physics domains: Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics. It also spans a wide range of difficulty levels, from high school to graduate-level physics courses. To utilize the data for improving and evaluating the model’s physical reasoning capabilities, we split the dataset into training and test sets, and provide reasoning paths generated by powerful reasoning models for the training data to facilitate model training. In addition, for the evaluation part, we find that existing evaluation frameworks exhibit biases in aspects such as units, simplification, and precision in physics domain. To balance efficiency and accuracy, we introduce a Rule+Model evaluation framework tailored to physics problems. Our evaluations on current state-of-the-art open-source and proprietary models highlight the limitations of current models in handling physics-related tasks. We hope that our dataset and evaluation methodology will jointly advance the development of LLMs in the field of physics.

大型语言模型(LLMS)在数学和编码竞赛等高级推理任务方面取得了显著的进展,与此同时,物理学尽管在推理上是密集的,对现实世界的理解也至关重要,但得到的学术和工业关注有限,本文件介绍了PHYSICS, 这是一个包含16 568个主题和困难层次的高质量物理学问题的数据集,它包含16 568个主题和困难层次的高质量物理学问题,为这一问题提供了便利。具体地说,PHYSICS通过精心设计的质量控制管道,从100多本教科书的练习中得到整理。它涵盖五个主要物理领域:机械学、电磁学、热力学、光学和现代物理学。它也涉及从高中到研究生物理课程等广泛的困难程度。为了利用数据来改进和评价模型的物理推理能力,我们将数据集分成培训和测试组,并为培训数据提供强大的推理模型产生的推理路径。此外,我们发现现有的评价框架在物理领域的单位、简化和精确度等方面显示出偏向。为了平衡和准确性,我们采用规则+模型的现行物理标准评估框架,我们目前对物理物理模型的正确评估框架进行评估。

Article 292

Title: Cog-TiPRO: Iterative Prompt Refinement with LLMs to Detect Cognitive Decline via Longitudinal Voice Assistant Commands

Cog-TiPRO: Iterative Prompt-Verfeinerung mit LLMs zur Erkennung kognitiver Deklination über Longitudinal Voice Assistant-Befehle

COg-TiPRO:与LLMs一起与LLMs进行自动迅速改进,以便通过纵向语音助理指挥部检测认知衰减 2505.17137v2

Authors (5): Kristin Qi, Youxiang Zhu, Caroline Summerour, John A. Batsis, Xiaohui Liang

Early detection of cognitive decline is crucial for enabling interventions that can slow neurodegenerative disease progression. Traditional diagnostic approaches rely on labor-intensive clinical assessments, which are impractical for frequent monitoring. Our pilot study investigates voice assistant systems (VAS) as non-invasive tools for detecting cognitive decline through longitudinal analysis of speech patterns in voice commands. Over an 18-month period, we collected voice commands from 35 older adults, with 15 participants providing daily at-home VAS interactions. To address the challenges of analyzing these short, unstructured and noisy commands, we propose Cog-TiPRO, a framework that combines (1) LLM-driven iterative prompt refinement for linguistic feature extraction, (2) HuBERT-based acoustic feature extraction, and (3) transformer-based temporal modeling. Using iTransformer, our approach achieves 73.80% accuracy and 72.67% F1-score in detecting MCI, outperforming its baseline by 27.13%. Through our LLM approach, we identify linguistic features that uniquely characterize everyday command usage patterns in individuals experiencing cognitive decline.

早期发现认知衰落对于能够减缓神经退化性疾病蔓延的干预至关重要。传统的诊断方法依赖劳动密集型临床评估,而这种评估对频繁监测是不切实际的。我们的试点研究将语音助理系统(VAS)作为非侵入性工具,通过对语音指令中的语音模式进行纵向分析来调查认知衰落。在18个月期间,我们收集了35个老年人的语音指令,15名参与者每天在家中提供VAS互动。为了应对分析这些短、无结构、吵闹命令的挑战,我们建议Cog-TiPRO,这是一个将(1)LLM驱动的语言特征提取迭代快速完善、(2)基于HuBERT的声学特征提取和(3)基于变压器的时间模型相结合的框架。我们使用 Transinform,在检测MCI方面实现了73.80%的精度和72.67%的F1芯,比其基线高出27.13%。我们通过LM方法,确定了在认知衰落的个人日常指令使用模式的独特语言特征。

Article 293

Title@2025-07-28 (1): A Survey of Deep Learning for Geometry Problem Solving

Title: A Survey of Deep Learning for Geometry Problem Solving

Eine Umfrage über Deep Learning zur Lösung von Geometrieproblemen

解决几何问题深层学习调查 2507.11936v4

Authors (3): Jianzhe Ma, Wenxuan Wang, Qin Jin

Geometry problem solving is a key area of mathematical reasoning, which is widely involved in many important fields such as education, mathematical ability assessment of artificial intelligence, and multimodal ability assessment. In recent years, the rapid development of deep learning technology, especially the rise of multimodal large language models, has triggered a widespread research boom. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our goal is to provide a comprehensive and practical reference of deep learning for geometry problem solving to promote further developments in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.

解决几何问题是数学推理的一个关键领域,它广泛涉及许多重要领域,例如教育、人工智能数学能力评估和多式联运能力评估。近年来,深层次学习技术的迅速发展,特别是多式联运大型语言模型的兴起,引发了广泛的研究繁荣。本文调查了深层次学习在解决几何问题方面的应用,包括:(一) 全面概述几何问题解决中的相关任务;(二) 彻底审查相关的深层次学习方法;(三) 详细分析评价指标和方法;(四) 批判性地讨论目前的挑战和今后可探讨的方向。我们的目标是为解决几何问题的深层次学习提供全面和实用的参考,以促进该领域的进一步发展。我们不断更新关于GitHub的文件清单:https://github.com/majianz/dl4gps。

Article 294

Title@2025-07-28 (1): Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning

Title: Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning

Mehrbildbeschreibungen für mehrsprachige, leichte Kognitive Impairment-Erkennung durch kontrastives Lernen enthüllen

通过差异学习发现多语种轻视认知缺陷的单形多语种描述 2505.17067v3

Authors (5): Kristin Qi, Jiali Cheng, Youxiang Zhu, Hadi Amiri, Xiaohui Liang

Detecting Mild Cognitive Impairment from picture descriptions is critical yet challenging, especially in multilingual and multiple picture settings. Prior work has primarily focused on English speakers describing a single picture (e.g., the ‘Cookie Theft’). The TAUKDIAL-2024 challenge expands this scope by introducing multilingual speakers and multiple pictures, which presents new challenges in analyzing picture-dependent content. To address these challenges, we propose a framework with three components: (1) enhancing discriminative representation learning via supervised contrastive learning, (2) involving image modality rather than relying solely on speech and text modalities, and (3) applying a Product of Experts (PoE) strategy to mitigate spurious correlations and overfitting. Our framework improves MCI detection performance, achieving a +7.1% increase in Unweighted Average Recall (UAR) (from 68.1% to 75.2%) and a +2.9% increase in F1 score (from 80.6% to 83.5%) compared to the text unimodal baseline. Notably, the contrastive learning component yields greater gains for the text modality compared to speech. These results highlight our framework’s effectiveness in multilingual and multi-picture MCI detection.

从图片描述中检测出微弱的认知障碍,特别是在多语种和多种图片环境中,是关键但具有挑战性的。先前的工作主要侧重于描述单一图片(如“Cookie Theft”)的讲英语者。TAUKDIAL 2024挑战通过引入多语种语言者和多图片扩大了这一范围,这在分析依赖图片的内容方面提出了新的挑战。为了应对这些挑战,我们提议了一个包含三个组成部分的框架:(1) 通过监督对比学习加强有区别的代表性学习,(2) 涉及图像模式,而不仅仅是语言和文本模式,(3) 应用专家产品(PoE)战略来减少虚假的关联和过度匹配。我们的框架提高了MCI的检测性能,实现了未加权平均回调+7.1%(UAR)(从68.1%增加到75.2%),与文本单式基线相比,F1分(从80.6%增加到83.5%)增加了2.9%。值得注意的是,对比学习部分的文本模式比语音和多语种识别效果更大。这些结果突出表明了我们的框架在多语种和多图像中的有效性。

Article 295

Title@2025-07-28 (1): Your AI, Not Your View: The Bias of LLMs in Investment Analysis

Title: Your AI, Not Your View: The Bias of LLMs in Investment Analysis

Ihre KI, nicht Ihre Ansicht: Die Bias von LLMs in der Investitionsanalyse

您的AI, 而不是您的观点: 投资分析中LLM 的偏见 2507.20957v1

Authors (8): Hoyoung Lee, Junhyuk Seo, Suhwan Park, Junhyeong Lee, Wonbin Ahn, Chanyeol Choi, Alejandro Lopez-Lira, Yongjae Lee

In finance, Large Language Models (LLMs) face frequent knowledge conflicts due to discrepancies between pre-trained parametric knowledge and real-time market data. These conflicts become particularly problematic when LLMs are deployed in real-world investment services, where misalignment between a model’s embedded preferences and those of the financial institution can lead to unreliable recommendations. Yet little research has examined what investment views LLMs actually hold. We propose an experimental framework to investigate such conflicts, offering the first quantitative analysis of confirmation bias in LLM-based investment analysis. Using hypothetical scenarios with balanced and imbalanced arguments, we extract models’ latent preferences and measure their persistence. Focusing on sector, size, and momentum, our analysis reveals distinct, model-specific tendencies. In particular, we observe a consistent preference for large-cap stocks and contrarian strategies across most models. These preferences often harden into confirmation bias, with models clinging to initial judgments despite counter-evidence.

在金融方面,大语言模型(LLMs)由于预先培训的参数知识与实时市场数据之间的差异而经常面临知识冲突。当LLMs被部署在现实世界的投资服务中时,这些冲突特别成问题,因为模型的内在偏好与金融机构的偏好不匹配可能导致不可靠的建议。然而,研究很少研究投资观点中LLMs实际上持有什么样的观点。我们提议了一个实验框架来调查这种冲突,在基于LLM的投资分析中首次对确认偏差进行定量分析。我们利用平衡和不平衡的假设假设情景,提取模型的潜在偏好并衡量其持久性。我们的分析以部门、规模和势头为重点,揭示了不同的模式趋势。特别是,我们观察到了对大头股票和反面战略的一贯偏好。这些偏好往往硬化为确认偏差,模型不顾反证证据坚持初步判断。

Article 296

Title@2025-07-28 (1): Mind the Gap: Conformative Decoding to Improve Output Diversity of Instruction-Tuned Large Language Models

Title: Mind the Gap: Conformative Decoding to Improve Output Diversity of Instruction-Tuned Large Language Models

Mind the Gap: Konformative Dekodierung zur Verbesserung der Output-Vielfalt von instruction-tuned großen Sprachmodellen

注意差距:改进教学型大语言模式产出多样性的合规化配方 2507.20956v1

Authors (4): Max Peeperkorn, Tom Kouwenhoven, Dan Brown, Anna Jordanous

Instruction-tuning large language models (LLMs) reduces the diversity of their outputs, which has implications for many tasks, particularly for creative tasks. This paper investigates the ``diversity gap’’ for a writing prompt narrative generation task. This gap emerges as measured by current diversity metrics for various open-weight and open-source LLMs. The results show significant decreases in diversity due to instruction-tuning. We explore the diversity loss at each fine-tuning stage for the OLMo and OLMo 2 models to further understand how output diversity is affected. The results indicate that DPO has the most substantial impact on diversity. Motivated by these findings, we present a new decoding strategy, conformative decoding, which guides an instruct model using its more diverse base model to reintroduce output diversity. We show that conformative decoding typically increases diversity and even maintains or improves quality.

教学调整大型语言模型(LLMS)减少了其产出的多样性,这影响到许多任务,特别是创造性任务。本文件调查“多样性差距”对于写作快速叙述生成任务的影响。这一差距以当前各种开放重量和开放源码LMS的多样性指标来衡量。结果显示由于教学调整,多样性显著下降。我们探索了OLMO和OLMO 2模型每个微调阶段的多样性损失,以进一步了解产出多样性是如何受到影响的。结果显示DPO对多样性的影响最大。根据这些发现,我们提出了一个新的解码战略,符合要求的解码,用以指导使用更多样化的基础模型的教学模式重新引入产出多样性。我们显示,兼容的解码通常会增加多样性,甚至保持或改进质量。

Article 297

Title@2025-07-28 (1): Dissecting Persona-Driven Reasoning in Language Models via Activation Patching

Title: Dissecting Persona-Driven Reasoning in Language Models via Activation Patching

Persona-Driven Reasoning in Sprachmodellen per Aktivierungs-Patching auflösen

通过激活补丁在语言模型中通过激活补丁解剖人-人-驱动原因 2507.20936v1

Authors (2): Ansh Poonia, Maeghal Jain

Large language models (LLMs) exhibit remarkable versatility in adopting diverse personas. In this study, we examine how assigning a persona influences a model’s reasoning on an objective task. Using activation patching, we take a first step toward understanding how key components of the model encode persona-specific information. Our findings reveal that the early Multi-Layer Perceptron (MLP) layers attend not only to the syntactic structure of the input but also process its semantic content. These layers transform persona tokens into richer representations, which are then used by the middle Multi-Head Attention (MHA) layers to shape the model’s output. Additionally, we identify specific attention heads that disproportionately attend to racial and color-based identities.

大语言模型(LLMS)在采用多种人方面表现出非凡的多功能性。在这次研究中, 我们研究一个人的指派如何影响模型对客观任务的推理。使用激活补丁, 我们迈出第一步, 了解模型的关键组成部分如何将特定个人的信息编码。我们的研究结果显示, 早期的多语言跨视谱层不仅关注输入的合成结构, 也处理其语义内容。这些层将个人符号转换为更丰富的表达方式, 然后由中等多极关注层( MHA) 用来塑造模型的输出。此外, 我们确定那些不相称地关注种族和肤色身份的焦点。

Article 298

Title@2025-07-28 (1): LLM2TEA: An Agentic AI Designer for Discovery with Generative Evolutionary Multitasking

Title: LLM2TEA: An Agentic AI Designer for Discovery with Generative Evolutionary Multitasking

LLM2TEA: Agentischer AI-Designer für Entdeckung mit generativem evolutionären Multitasking

LLM2TEA: 利用产生进化多任务探索的代理AI 设计器 2406.14917v3

Authors (5): Melvin Wong, Jiao Liu, Thiago Rios, Stefan Menzel, Yew Soon Ong

This paper presents LLM2TEA, a Large Language Model (LLM) driven MultiTask Evolutionary Algorithm, representing the first agentic AI designer of its kind operating with generative evolutionary multitasking (GEM). LLM2TEA enables the crossbreeding of solutions from multiple domains, fostering novel solutions that transcend disciplinary boundaries. Of particular interest is the ability to discover designs that are both novel and conforming to real-world physical specifications. LLM2TEA comprises an LLM to generate genotype samples from text prompts describing target objects, a text-to-3D generative model to produce corresponding phenotypes, a classifier to interpret its semantic representations, and a computational simulator to assess its physical properties. Novel LLM-based multitask evolutionary operators are introduced to guide the search towards high-performing, practically viable designs. Experimental results in conceptual design optimization validate the effectiveness of LLM2TEA, showing 97% to 174% improvements in the diversity of novel designs over the current text-to-3D baseline. Moreover, over 73% of the generated designs outperform the top 1% of designs produced by the text-to-3D baseline in terms of physical performance. The designs produced by LLM2TEA are not only aesthetically creative but also functional in real-world contexts. Several of these designs have been successfully 3D printed, demonstrating the ability of our approach to transform AI-generated outputs into tangible, physical designs. These designs underscore the potential of LLM2TEA as a powerful tool for complex design optimization and discovery, capable of producing novel and physically viable designs.

本文展示了LLM2TEA(LLM2TEA),这是一个大型语言模型(LLM)驱动的多语种进化演化算法,代表了首个使用基因进化多任务(GEM)的AI代理设计师。LLM2TEA(LLM2TEA)能够交叉利用多个域的解决方案,促进超越学科界限的新解决方案。特别令人感兴趣的是能够发现既新颖又符合现实世界物理规格的设计。LLM2TEA(LM)包含一个LM(LM),用来从描述目标对象的文本提示中生成基因型样本的基因样本,一个文本到3D(D)的基因化模型,用来生成相应的字符型类型,一个用于解释其语义表达的物理结构。LM(LM)多功能演化操作员的计算模拟器,用来指导寻找高性能、实际可行的设计。LM2TEA(LM)的实验性能优化验证了LM2TEA(LMTEA)的效能,显示在目前文本至3D基线的新型设计上97%至174。此外智能设计中的73%(D)的精精精化能力,通过印刷设计,也显示了SD(D)的精制成了SD(D)的精制成)的精制的精制的精制成的精制的精制成的精制版图)。

Article 299

Title@2025-07-28 (1): FHSTP@EXIST 2025 Benchmark: Sexism Detection with Transparent Speech Concept Bottleneck Models

Title: FHSTP@EXIST 2025 Benchmark: Sexism Detection with Transparent Speech Concept Bottleneck Models

FHSTP@EXIST 2025 Benchmark: Sexismuserkennung mit transparenten Sprachkonzepten Engpassmodelle

FHSTP@EXIST 2025 基准:用透明言论概念瓶颈模型探测性别主义 2507.20924v1

Authors (6): Roberto Labadie-Tamayo, Adrian Jaques Böck, Djordje Slijepčević, Xihui Chen, Andreas Babic, Matthias Zeppelzauer

Sexism has become widespread on social media and in online conversation. To help address this issue, the fifth Sexism Identification in Social Networks (EXIST) challenge is initiated at CLEF 2025. Among this year’s international benchmarks, we concentrate on solving the first task aiming to identify and classify sexism in social media textual posts. In this paper, we describe our solutions and report results for three subtasks: Subtask 1.1 - Sexism Identification in Tweets, Subtask 1.2 - Source Intention in Tweets, and Subtask 1.3 - Sexism Categorization in Tweets. We implement three models to address each subtask which constitute three individual runs: Speech Concept Bottleneck Model (SCBM), Speech Concept Bottleneck Model with Transformer (SCBMT), and a fine-tuned XLM-RoBERTa transformer model. SCBM uses descriptive adjectives as human-interpretable bottleneck concepts. SCBM leverages large language models (LLMs) to encode input texts into a human-interpretable representation of adjectives, then used to train a lightweight classifier for downstream tasks. SCBMT extends SCBM by fusing adjective-based representation with contextual embeddings from transformers to balance interpretability and classification performance. Beyond competitive results, these two models offer fine-grained explanations at both instance (local) and class (global) levels. We also investigate how additional metadata, e.g., annotators’ demographic profiles, can be leveraged. For Subtask 1.1, XLM-RoBERTa, fine-tuned on provided data augmented with prior datasets, ranks 6th for English and Spanish and 4th for English in the Soft-Soft evaluation. Our SCBMT achieves 7th for English and Spanish and 6th for Spanish.

为了解决这一问题,在2025年CLEF中启动了第五个社会网络性别识别(EXIST)挑战。在今年的国际基准中,我们集中解决第一个旨在识别和分类社交媒体文本文章中的性别主义的任务。在本文中,我们描述了我们的三个子任务的解决办法和报告结果:Tweets的Subtask 1.1 - Suptask 1.2 - Tweet的性别识别;Subtask 1.2 - Tweets 的源源识别;和 Subtask 1.3 - Tweet的性别分类。我们实施了三个模型来解决构成三个单个运行的每一个子任务: Speople Notleeck 模型(SCBMM)、 Speople Noteck 模型(SCBMM) 和一个经过精细调的 XLM-ROBM 变异模型。SBM 用于人与人之间交替的瓶概念概念概念概念。SBMFlocketrial-dealal dealations 和Slational-lational-deal-deal-deal-demodeal dealal lavelmental ladeal deal deal ladal exal ladeal dal listrations. Scal 和Smal-deal-s 和Slational-SBildal-s ial-s 和Slational-s ibal-s lautal-dealtistrational-s 和Slational-deal-s 和Slational-在英国的Slational-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-在英国数据级别上,在英国数据级数据上,在SB) 和SB) 和SB) 和SBI-SB-SB-SBal 上可以使用两个级数据上提供提供提供。S-S-SBD-S-S-SBD-SBD-SBD-SBD-SBD-SB 和SBD-SB

Article 300

Title@2025-07-28 (1): MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

Title: MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

MediQAl: Eine französische medizinische Frage zur Beantwortung von Datensätzen für Wissens- und Begründungsbewertung

MediQAl:用于知识和合理评估的法国医学问题解答数据集 2507.20917v1

Authors (1): Adrien Bazoge

This work introduces MediQAl, a French medical question answering dataset designed to evaluate the capabilities of language models in factual medical recall and reasoning over real-world clinical scenarios. MediQAl contains 32,603 questions sourced from French medical examinations across 41 medical subjects. The dataset includes three tasks: (i) Multiple-Choice Question with Unique answer, (ii) Multiple-Choice Question with Multiple answer, and (iii) Open-Ended Question with Short-Answer. Each question is labeled as Understanding or Reasoning, enabling a detailed analysis of models’ cognitive capabilities. We validate the MediQAl dataset through extensive evaluation with 14 large language models, including recent reasoning-augmented models, and observe a significant performance gap between factual recall and reasoning tasks. Our evaluation provides a comprehensive benchmark for assessing language models’ performance on French medical question answering, addressing a crucial gap in multilingual resources for the medical domain.

nan

Article 301

Title@2025-07-28 (1): Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Title: Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Benchmarking Open-Ended Audio Dialogue Understanding für große Audio-Language-Modelle

确定大型音频语言模型不限成员名额音频对话理解基准 2412.05167v2

Authors (5): Kuofeng Gao, Shu-Tao Xia, Ke Xu, Philip Torr, Jindong Gu

Large Audio-Language Models (LALMs), such as GPT-4o, have recently unlocked audio dialogue capabilities, enabling direct spoken exchanges with humans. The potential of LALMs broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, we firstly propose the evaluation of ambiguity handling in audio dialogues that expresses different intentions beyond the same literal meaning of sentences, e.g., “Really!?” with different intonations. In summary, ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs. Through extensive experiments on 16 LALMs, our analysis reveals that existing LALMs struggle with mathematical symbols and formulas, understanding human behavior such as roleplay, comprehending multiple languages, and handling audio dialogue ambiguities from different phonetic elements, such as intonations, pause positions, and homophones. The benchmark is available at https://adu-bench.github.io/.

nan

Article 302

Title@2025-07-28 (1): Should Top-Down Clustering Affect Boundaries in Unsupervised Word Discovery?

Title: Should Top-Down Clustering Affect Boundaries in Unsupervised Word Discovery?

Sollte Top-Down-Clustering Grenzen in unüberwachten Word Discovery beeinflussen?

在无人监督的“发现字”中, 上下层群集是否应该影响边界? 2507.19204v2

Authors (3): Simon Malan, Benjamin van Niekerk, Herman Kamper

We investigate the problem of segmenting unlabeled speech into word-like units and clustering these to create a lexicon. Prior work can be categorized into two frameworks. Bottom-up methods first determine boundaries and then cluster the fixed segmented words into a lexicon. In contrast, top-down methods incorporate information from the clustered words to inform boundary selection. However, it is unclear whether top-down information is necessary to improve segmentation. To explore this, we look at two similar approaches that differ in whether top-down clustering informs boundary selection. Our simple bottom-up strategy predicts word boundaries using the dissimilarity between adjacent self-supervised features, then clusters the resulting segments to construct a lexicon. Our top-down system is an updated version of the ES-KMeans dynamic programming method that iteratively uses K-means to update its boundaries. On the five-language ZeroSpeech benchmarks, both approaches achieve comparable state-of-the-art results, with the bottom-up system being nearly five times faster. Through detailed analyses, we show that the top-down influence of ES-KMeans can be beneficial (depending on factors like the candidate boundaries), but in many cases the simple bottom-up method performs just as well. For both methods, we show that the clustering step is a limiting factor. Therefore, we recommend that future work focus on improved clustering techniques and learning more discriminative word-like representations. Project code repository: https://github.com/s-malan/prom-seg-clus.

nan

Article 303

Title: $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement

$A^2R^2$: Verbesserung der Img2LaTeX-Umwandlung durch visuelles Reasoning mit aufmerksamkeitsgeführter Verfeinerung

$A2R2美元:通过关注引导的精炼,通过视觉理性推进Img2LaTeX转换 2507.20890v1

Authors (6): Zhecheng Li, Guoxian Song, Yiwei Wang, Zhen Xiong, Junsong Yuan, Yujun Cai

Img2LaTeX is a practically significant task that involves converting mathematical expressions or tabular data from images into LaTeX code. In recent years, vision-language models (VLMs) have demonstrated strong performance across a variety of visual understanding tasks, owing to their generalization capabilities. While some studies have explored the use of VLMs for the Img2LaTeX task, their performance often falls short of expectations. Empirically, VLMs sometimes struggle with fine-grained visual elements, leading to inaccurate LaTeX predictions. To address this challenge, we propose $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement, a framework that effectively integrates attention localization and iterative refinement within a visual reasoning framework, enabling VLMs to perform self-correction and progressively improve prediction quality. For effective evaluation, we introduce a new dataset, Img2LaTex-Hard-1K, consisting of 1,100 carefully curated and challenging examples designed to rigorously evaluate the capabilities of VLMs within this task domain. Extensive experimental results demonstrate that: (1) $A^2R^2$ significantly improves model performance across six evaluation metrics spanning both textual and visual levels, consistently outperforming other baseline methods; (2) Increasing the number of inference rounds yields notable performance gains, underscoring the potential of $A^2R^2$ in test-time scaling scenarios; (3) Ablation studies and human evaluations validate the practical effectiveness of our approach, as well as the strong synergy among its core components during inference.

nan

Article 304

Title@2025-07-28 (1): Enhancing Project-Specific Code Completion by Inferring Internal API Information

Title: Enhancing Project-Specific Code Completion by Inferring Internal API Information

Verbesserung der projektspezifischen Code-Vervollständigung durch Schlussfolgerung interner API-Informationen

通过推断内部API信息加强具体项目法规的完成 2507.20888v1

Authors (6): Le Deng, Xiaoxue Ren, Chao Ni, Ming Liang, David Lo, Zhongxin Liu

Project-specific code completion is a critical task that leverages context from a project to generate accurate code. State-of-the-art methods use retrieval-augmented generation (RAG) with large language models (LLMs) and project information for code completion. However, they often struggle to incorporate internal API information, which is crucial for accuracy, especially when APIs are not explicitly imported in the file. To address this, we propose a method to infer internal API information without relying on imports. Our method extends the representation of APIs by constructing usage examples and semantic descriptions, building a knowledge base for LLMs to generate relevant completions. We also introduce ProjBench, a benchmark that avoids leaked imports and consists of large-scale real-world projects. Experiments on ProjBench and CrossCodeEval show that our approach significantly outperforms existing methods, improving code exact match by 22.72% and identifier exact match by 18.31%. Additionally, integrating our method with existing baselines boosts code match by 47.80% and identifier match by 35.55%.

nan

Article 305

Title@2025-07-28 (1): Leveraging Open-Source Large Language Models for Clinical Information Extraction in Resource-Constrained Settings

Title: Leveraging Open-Source Large Language Models for Clinical Information Extraction in Resource-Constrained Settings

Nutzung von Open-Source-Großsprachenmodellen für die Extraktion klinischer Informationen in ressourcenbeschränkten Einstellungen

利用开放源码大语言模型,在受资源限制的环境下进行临床信息采掘 2507.20859v1

Authors (5): Luc Builtjes, Joeran Bosma, Mathias Prokop, Bram van Ginneken, Alessa Hering

Medical reports contain rich clinical information but are often unstructured and written in domain-specific language, posing challenges for information extraction. While proprietary large language models (LLMs) have shown promise in clinical natural language processing, their lack of transparency and data privacy concerns limit their utility in healthcare. This study therefore evaluates nine open-source generative LLMs on the DRAGON benchmark, which includes 28 clinical information extraction tasks in Dutch. We developed \texttt{llm_extractinator}, a publicly available framework for information extraction using open-source generative LLMs, and used it to assess model performance in a zero-shot setting. Several 14 billion parameter models, Phi-4-14B, Qwen-2.5-14B, and DeepSeek-R1-14B, achieved competitive results, while the bigger Llama-3.3-70B model achieved slightly higher performance at greater computational cost. Translation to English prior to inference consistently degraded performance, highlighting the need of native-language processing. These findings demonstrate that open-source LLMs, when used with our framework, offer effective, scalable, and privacy-conscious solutions for clinical information extraction in low-resource settings.

nan

Article 306

Title@2025-07-28 (1): A survey of diversity quantification in natural language processing: The why, what, where and how

Title: A survey of diversity quantification in natural language processing: The why, what, where and how

Eine Übersicht der Diversitätsquantifizierung in der natürlichen Sprachverarbeitung: Das Warum, Was, Wo und Wie

自然语言处理中多样性量化调查:原因、内容、地点和方式 2507.20858v1

Authors (5): Louis Estève, Marie-Catherine de Marneffe, Nurit Melnik, Agata Savary, Olha Kanishcheva

The concept of diversity has received increased consideration in Natural Language Processing (NLP) in recent years. This is due to various motivations like promoting and inclusion, approximating human linguistic behavior, and increasing systems’ performance. Diversity has however often been addressed in an ad hoc manner in NLP, and with few explicit links to other domains where this notion is better theorized. We survey articles in the ACL Anthology from the past 6 years, with “diversity” or “diverse” in their title. We find a wide range of settings in which diversity is quantified, often highly specialized and using inconsistent terminology. We put forward a unified taxonomy of why, what on, where, and how diversity is measured in NLP. Diversity measures are cast upon a unified framework from ecology and economy (Stirling, 2007) with 3 dimensions of diversity: variety, balance and disparity. We discuss the trends which emerge due to this systematized approach. We believe that this study paves the way towards a better formalization of diversity in NLP, which should bring a better understanding of this notion and a better comparability between various approaches.

nan

Article 307

Title@2025-07-28 (1): Language Modeling for the Future of Finance: A Survey into Metrics, Tasks, and Data Opportunities

Title: Language Modeling for the Future of Finance: A Survey into Metrics, Tasks, and Data Opportunities

Sprachenmodellierung für die Zukunft der Finanzen: Eine Umfrage zu Metrics, Aufgaben und Datenmöglichkeiten

未来融资语言建模:计量、任务和数据机会调查 2504.07274v2

Authors (4): Nikita Tatarinov, Siddhant Sukhani, Agam Shah, Sudheer Chava

Recent advances in language modeling have led to growing interest in applying Natural Language Processing (NLP) techniques to financial problems, enabling new approaches to analysis and decision-making. To systematically examine this trend, we review 374 NLP research papers published between 2017 and 2024 across 38 conferences and workshops, with a focused analysis of 221 papers that directly address finance-related tasks. We evaluate these papers across 11 quantitative and qualitative dimensions, and our study identifies the following opportunities: (i) expanding the scope of forecasting tasks; (ii) enriching evaluation with financial metrics; (iii) leveraging multilingual and crisis-period datasets; and (iv) balancing PLMs with efficient or interpretable alternatives. We identify actionable directions for research and practice, supported by dataset and tool recommendations, with implications for both the academia and industry communities.

nan

Article 308

Title@2025-07-28 (1): Latent Inter-User Difference Modeling for LLM Personalization

Title: Latent Inter-User Difference Modeling for LLM Personalization

Latent Inter-User Difference Modeling für LLM Personalisierung

LLM个性化不同模型 2507.20849v1

Authors (6): Yilun Qiu, Tianhao Shi, Xiaoyan Zhao, Fengbin Zhu, Yang Zhang, Fuli Feng

Large language models (LLMs) are increasingly integrated into users’ daily lives, leading to a growing demand for personalized outputs. Previous work focuses on leveraging a user’s own history, overlooking inter-user differences that are crucial for effective personalization. While recent work has attempted to model such differences, the reliance on language-based prompts often hampers the effective extraction of meaningful distinctions. To address these issues, we propose Difference-aware Embedding-based Personalization (DEP), a framework that models inter-user differences in the latent space instead of relying on language prompts. DEP constructs soft prompts by contrasting a user’s embedding with those of peers who engaged with similar content, highlighting relative behavioral signals. A sparse autoencoder then filters and compresses both user-specific and difference-aware embeddings, preserving only task-relevant features before injecting them into a frozen LLM. Experiments on personalized review generation show that DEP consistently outperforms baseline methods across multiple metrics. Our code is available at https://github.com/SnowCharmQ/DEP.

nan

Article 309

Title@2025-07-28 (1): Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models

Title: Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models

Kritik des unreinen Grundes: Enthüllen des Argumentationsverhaltens medizinischer Großsprachenmodelle

简便理由的批评:统一医学大语言模式的推理行为 2412.15748v2

Authors (2): Shamus Sim, Tyrone Chen

Background: Despite the current ubiquity of Large Language Models (LLMs) across the medical domain, there is a surprising lack of studies which address their reasoning behaviour. We emphasise the importance of understanding reasoning behaviour as opposed to high-level prediction accuracies, since it is equivalent to explainable AI (XAI) in this context. In particular, achieving XAI in medical LLMs used in the clinical domain will have a significant impact across the healthcare sector. Results: Therefore, in this work, we adapt the existing concept of reasoning behaviour and articulate its interpretation within the specific context of medical LLMs. We survey and categorise current state-of-the-art approaches for modeling and evaluating reasoning reasoning in medical LLMs. Additionally, we propose theoretical frameworks which can empower medical professionals or machine learning engineers to gain insight into the low-level reasoning operations of these previously obscure models. We also outline key open challenges facing the development of Large Reasoning Models. Conclusion: The subsequent increased transparency and trust in medical machine learning models by clinicians as well as patients will accelerate the integration, application as well as further development of medical AI for the healthcare system as a whole.

nan

Article 310

Title@2025-07-28 (1): FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings

Title: FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings

FocalPO: Verbesserung der Preference-Optimierung durch Fokussierung auf korrekte Preference-Rankings

重点:通过注重正确的优先排序,加强优惠优化 2501.06645v3

Authors (5): Tong Liu, Xiao Yu, Wenxuan Zhou, Jindong Gu, Volker Tresp

Efficient preference optimization algorithms such as Direct Preference Optimization (DPO) have become a popular approach in aligning large language models (LLMs) with human preferences. These algorithms implicitly treat the LLM as a reward model, and focus on training it to correct misranked preference pairs. However, recent work~\citep{chen2024preference} empirically finds that DPO training \textit{rarely improves these misranked preference pairs}, despite its gradient emphasizing on these cases. We introduce FocalPO, a DPO variant that instead \textit{down-weighs} misranked preference pairs and prioritizes enhancing the model’s understanding of pairs that it can already rank correctly. Inspired by Focal Loss used in vision tasks, FocalPO achieves this by adding a modulating factor to dynamically scale DPO loss. Our experiment demonstrates that FocalPO surpasses DPO and its variants on popular benchmarks like Alpaca Eval 2.0 using Mistral-Base-7B and Llama-3-Instruct-8B, with the introduced hyperparameter fixed. Additionally, we empirically reveals how FocalPO affects training on correct and incorrect sample groups, further underscoring its effectiveness.

nan

Article 311

Title@2025-07-28 (1): Automating Thematic Review of Prevention of Future Deaths Reports: Replicating the ONS Child Suicide Study using Large Language Models

Title: Automating Thematic Review of Prevention of Future Deaths Reports: Replicating the ONS Child Suicide Study using Large Language Models

Automatisieren der thematischen Überprüfung der Prävention von zukünftigen Todesfällen Berichte: Nachahmung der ONS-Kinder-Selbstmord-Studie mit großen Sprachmodellen

对预防今后死亡报告进行自动化专题审查:利用大语言模式复制ONS儿童自杀研究 2507.20786v1

Authors (5): Sam Osian, Arpan Dutta, Sahil Bhandari, Iain E. Buchan, Dan W. Joyce

Prevention of Future Deaths (PFD) reports, issued by coroners in England and Wales, flag systemic hazards that may lead to further loss of life. Analysis of these reports has previously been constrained by the manual effort required to identify and code relevant cases. In 2025, the Office for National Statistics (ONS) published a national thematic review of child-suicide PFD reports ($\leq$ 18 years), identifying 37 cases from January 2015 to November 2023 - a process based entirely on manual curation and coding. We evaluated whether a fully automated, open source “text-to-table” language-model pipeline (PFD Toolkit) could reproduce the ONS’s identification and thematic analysis of child-suicide PFD reports, and assessed gains in efficiency and reliability. All 4,249 PFD reports published from July 2013 to November 2023 were processed via PFD Toolkit’s large language model pipelines. Automated screening identified cases where the coroner attributed death to suicide in individuals aged 18 or younger, and eligible reports were coded for recipient category and 23 concern sub-themes, replicating the ONS coding frame. PFD Toolkit identified 72 child-suicide PFD reports - almost twice the ONS count. Three blinded clinicians adjudicated a stratified sample of 144 reports to validate the child-suicide screening. Against the post-consensus clinical annotations, the LLM-based workflow showed substantial to almost-perfect agreement (Cohen’s $\kappa$ = 0.82, 95% CI: 0.66-0.98, raw agreement = 91%). The end-to-end script runtime was 8m 16s, transforming a process that previously took months into one that can be completed in minutes. This demonstrates that automated LLM analysis can reliably and efficiently replicate manual thematic reviews of coronial data, enabling scalable, reproducible, and timely insights for public health and safety. The PFD Toolkit is openly available for future research.

nan

Article 312

Title@2025-07-28 (1): On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

Title: On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

Über die Rolle von vorgebildeten Sprachmodellen in allgemeinen Text-Embeddings: Eine Umfrage

关于 “ 预先培训的语言模式在一般用途文本嵌入中所起的作用:调查 “ 2507.20783v1

Authors (6): Meishan Zhang, Xin Zhang, Xinping Zhao, Shouzheng Huang, Baotian Hu, Min Zhang

Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, such as retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction. Then, we describe advanced roles enabled by PLMs, such as multilingual support, multimodal integration, code understanding, and scenario-specific adaptation. Finally, we highlight potential future research directions that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings. This survey aims to serve as a valuable reference for both newcomers and established researchers seeking to understand the current state and future potential of GPTE.

nan

Article 313

Title@2025-07-28 (1): TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks

Title: TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks

TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks

TN-AutoRCA:电信网络中自我改进基于警报的原始原因分析的基准建设和示范框架 2507.18190v2

Authors (7): Keyu Wu, Qianjin Yu, Manlin Mei, Ruiting Liu, Jun Wang, Kailai Zhang, Yelun Bao

Root Cause Analysis (RCA) in telecommunication networks is a critical task, yet it presents a formidable challenge for Artificial Intelligence (AI) due to its complex, graph-based reasoning requirements and the scarcity of realistic benchmarks.

nan

Article 314

Title@2025-07-28 (1): The Impact of LoRA Adapters on LLMs for Clinical Text Classification Under Computational and Data Constraints

Title: The Impact of LoRA Adapters on LLMs for Clinical Text Classification Under Computational and Data Constraints

Die Auswirkungen von LoRA-Adaptern auf LLMs für die klinische Textklassifikation unter Computational und Data Constraints

LoRA适应器对在计算和数据限制下临床文本分类的LLMs的影响 2407.19299v3

Authors (6): Thanh-Dung Le, Ti Ti Nguyen, Vu Nguyen Ha, Symeon Chatzinotas, Philippe Jouvet, Rita Noumeir

Fine-tuning Large Language Models (LLMs) for clinical Natural Language Processing (NLP) poses significant challenges due to domain gap, limited data, and stringent hardware constraints. In this study, we evaluate four adapter techniques-Adapter, Lightweight, TinyAttention, and Gated Residual Network (GRN) - equivalent to Low-Rank Adaptation (LoRA), for clinical note classification under real-world, resource-constrained conditions. All experiments were conducted on a single NVIDIA Quadro P620 GPU (2 GB VRAM, 512 CUDA cores, 1.386 TFLOPS FP32), limiting batch sizes to <8 sequences and maximum sequence length to 256 tokens. Our clinical corpus comprises only 580 000 tokens, several orders of magnitude smaller than standard LLM pre-training datasets. We fine-tuned three biomedical pre-trained LLMs (CamemBERT-bio, AliBERT, DrBERT) and two lightweight Transformer models trained from scratch. Results show that 1) adapter structures provide no consistent gains when fine-tuning biomedical LLMs under these constraints, and 2) simpler Transformers, with minimal parameter counts and training times under six hours, outperform adapter-augmented LLMs, which required over 1000 GPU-hours. Among adapters, GRN achieved the best metrics (accuracy, precision, recall, F1 = 0.88). These findings demonstrate that, in low-resource clinical settings with limited data and compute, lightweight Transformers trained from scratch offer a more practical and efficient solution than large LLMs, while GRN remains a viable adapter choice when minimal adaptation is needed.

nan

Article 315

Title@2025-07-28 (1): Multilingual Self-Taught Faithfulness Evaluators

Title: Multilingual Self-Taught Faithfulness Evaluators

Mehrsprachige Selbstlernende Bewertung von Treue

多语言自学自学信仰评价员 2507.20752v1

Authors (6): Carlo Alfano, Aymen Al Marjani, Zeno Jonke, Amin Mantrach, Saab Mansour, Marcello Federico

The growing use of large language models (LLMs) has increased the need for automatic evaluation systems, particularly to address the challenge of information hallucination. Although existing faithfulness evaluation approaches have shown promise, they are predominantly English-focused and often require expensive human-labeled training data for fine-tuning specialized models. As LLMs see increased adoption in multilingual contexts, there is a need for accurate faithfulness evaluators that can operate across languages without extensive labeled data. This paper presents Self-Taught Evaluators for Multilingual Faithfulness, a framework that learns exclusively from synthetic multilingual summarization data while leveraging cross-lingual transfer learning. Through experiments comparing language-specific and mixed-language fine-tuning approaches, we demonstrate a consistent relationship between an LLM’s general language capabilities and its performance in language-specific evaluation tasks. Our framework shows improvements over existing baselines, including state-of-the-art English evaluators and machine translation-based approaches.

nan

Article 316

Title@2025-07-28 (1): Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study

Title: Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study

Untersuchung struktureller Pruning- und Recovery-Techniken zur Komprimierung multimodaler Großsprachenmodelle: Eine empirische Studie

压缩多式大语言模式结构保护和恢复调查技术:经验研究 2507.20749v1

Authors (5): Yiran Huang, Lukas Thede, Massimiliano Mancini, Wenjia Xu, Zeynep Akata

While Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose significant barriers to practical deployment. Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs), but these methods offer limited flexibility and remain computationally intensive. To address this gap, we propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training. Specifically, we investigate two structural pruning paradigms–layerwise and widthwise pruning–applied to the language model backbone of MLLMs, alongside supervised finetuning and knowledge distillation. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios with limited computational resources or insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels (< 20%). Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved with as little as 5% of the original training data, while retaining over 95% of the original performance. Through empirical study on two representative MLLMs, i.e., LLaVA-v1.5-7B and Bunny-v1.0-3B, this study offers actionable insights for practitioners aiming to compress MLLMs effectively without extensive computation resources or sufficient data.

nan

Article 317

Title@2025-07-28 (1): Everything is a Video: Unifying Modalities through Next-Frame Prediction

Title: Everything is a Video: Unifying Modalities through Next-Frame Prediction

Alles ist ein Video: Vereinheitlichen von Modalitäten durch Next-Frame-Vorhersage

一切都是一部视频:通过下框架预测实现统一的方式 2411.10503v2

Authors (7): G. Thomas Hudson, Dean Slack, Thomas Winterbottom, Jamie Sterling, Chenghao Xiao, Junjie Shentu, Noura Al Moubayed

Multimodal learning, which involves integrating information from various modalities such as text, images, audio, and video, is pivotal for numerous complex tasks like visual question answering, cross-modal retrieval, and caption generation. Traditional approaches rely on modality-specific encoders and late fusion techniques, which can hinder scalability and flexibility when adapting to new tasks or modalities. To address these limitations, we introduce a novel framework that extends the concept of task reformulation beyond natural language processing (NLP) to multimodal learning. We propose to reformulate diverse multimodal tasks into a unified next-frame prediction problem, allowing a single model to handle different modalities without modality-specific components. This method treats all inputs and outputs as sequential frames in a video, enabling seamless integration of modalities and effective knowledge transfer across tasks. Our approach is evaluated on a range of tasks, including text-to-text, image-to-text, video-to-video, video-to-text, and audio-to-text, demonstrating the model’s ability to generalize across modalities with minimal adaptation. We show that task reformulation can significantly simplify multimodal model design across various tasks, laying the groundwork for more generalized multimodal foundation models.

nan

Article 318

Title@2025-07-28 (1): Group Sequence Policy Optimization

Title: Group Sequence Policy Optimization

Optimierung der Gruppensequenzpolitik

组序列政策优化 2507.18071v2

Authors (12): Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

nan

Article 319

Title@2025-07-28 (1): Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models

Title: Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models

Text2VLM: Anpassung von Text-Only-Datensätzen an die Auswertung von Alignment-Trainings in visuellen Sprachmodellen

Text2VLM: 调整纯文本数据集以评价视觉语言模型的对齐培训 2507.20704v1

Authors (4): Gabriel Downer, Sean Craven, Damian Ruck, Jake Thomas

The increasing integration of Visual Language Models (VLMs) into AI systems necessitates robust model alignment, especially when handling multimodal content that combines text and images. Existing evaluation datasets heavily lean towards text-only prompts, leaving visual vulnerabilities under evaluated. To address this gap, we propose \textbf{Text2VLM}, a novel multi-stage pipeline that adapts text-only datasets into multimodal formats, specifically designed to evaluate the resilience of VLMs against typographic prompt injection attacks. The Text2VLM pipeline identifies harmful content in the original text and converts it into a typographic image, creating a multimodal prompt for VLMs. Also, our evaluation of open-source VLMs highlights their increased susceptibility to prompt injection when visual inputs are introduced, revealing critical weaknesses in the current models’ alignment. This is in addition to a significant performance gap compared to closed-source frontier models. We validate Text2VLM through human evaluations, ensuring the alignment of extracted salient concepts; text summarization and output classification align with human expectations. Text2VLM provides a scalable tool for comprehensive safety assessment, contributing to the development of more robust safety mechanisms for VLMs. By enhancing the evaluation of multimodal vulnerabilities, Text2VLM plays a role in advancing the safe deployment of VLMs in diverse, real-world applications.

nan

Article 320

Title@2025-07-28 (1): Computational Analysis of Character Development in Holocaust Testimonies

Title: Computational Analysis of Character Development in Holocaust Testimonies

Computational Analyse der Charakterentwicklung in Holocaust-Zeugnissen

大屠杀证词特征发展计算分析 2412.17063v4

Authors (4): Esther Shizgal, Eitan Wagner, Renana Keydar, Omri Abend

This work presents a computational approach to analyze character development along the narrative timeline. The analysis characterizes the inner and outer changes the protagonist undergoes within a narrative, and the interplay between them. We consider transcripts of Holocaust survivor testimonies as a test case, each telling the story of an individual in first-person terms. We focus on the survivor’s religious trajectory, examining the evolution of their disposition toward religious belief and practice along the testimony. Clustering the resulting trajectories in the dataset, we identify common sequences in the data. Our findings highlight multiple common structures of religiosity across the narratives: in terms of belief, most present a constant disposition, while for practice, most present an oscillating structure, serving as valuable material for historical and sociological research. This work demonstrates the potential of natural language processing techniques for analyzing character evolution through thematic trajectories in narratives.

nan

Article 321

Title@2025-07-28 (1): When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification

Title: When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification

Wenn Scale auf Vielfalt trifft: Bewertung von Sprachmodellen auf feinkörnige Mehrsprachigkeitsprüfung

规模达到多样性时:评价精细多语言索赔核实的语言模式 2507.20700v1

Authors (4): Hanna Shcharbakova, Tatiana Anikina, Natalia Skachkova, Josef van Genabith

The rapid spread of multilingual misinformation requires robust automated fact verification systems capable of handling fine-grained veracity assessments across diverse languages. While large language models have shown remarkable capabilities across many NLP tasks, their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied. We conduct a comprehensive evaluation of five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories. Our experiments compare small language models (encoder-based XLM-R and mT5) with recent decoder-only LLMs (Llama 3.1, Qwen 2.5, Mistral Nemo) using both prompting and fine-tuning approaches. Surprisingly, we find that XLM-R (270M parameters) substantially outperforms all tested LLMs (7-12B parameters), achieving 57.7% macro-F1 compared to the best LLM performance of 16.9%. This represents a 15.8% improvement over the previous state-of-the-art (41.9%), establishing new performance benchmarks for multilingual fact verification. Our analysis reveals problematic patterns in LLM behavior, including systematic difficulties in leveraging evidence and pronounced biases toward frequent categories in imbalanced data settings. These findings suggest that for fine-grained multilingual fact verification, smaller specialized models may be more effective than general-purpose large models, with important implications for practical deployment of fact-checking systems.

nan

Article 322

Title@2025-07-28 (1): Geometric-Mean Policy Optimization

Title: Geometric-Mean Policy Optimization

Geometrisch-Mean-Policy-Optimierung

几何海洋政策优化 2507.20673v1

Authors (12): Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, Furu Wei

Recent advancements, such as Group Relative Policy Optimization (GRPO), have enhanced the reasoning capabilities of large language models by optimizing the arithmetic mean of token-level rewards. However, GRPO suffers from unstable policy updates when processing tokens with outlier importance-weighted rewards, which manifests as extreme importance sampling ratios during training, i.e., the ratio between the sampling probabilities assigned to a token by the current and old policies. In this work, we propose Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, we provide comprehensive theoretical and experimental analysis to justify the design and stability benefits of GMPO. Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark, including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K. Code is available at https://github.com/callsys/GMPO.

nan

Article 323

Title@2025-07-28 (1): Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs

Title: Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs

Benchmarking Graph Neural Networks für die Dokumentenlayout-Analyse in öffentlichen Angelegenheiten

用于公共事务文件布局分析的图表神经网络 2505.14699v2

Authors (6): Miguel Lopez-Duran, Julian Fierrez, Aythami Morales, Ruben Tolosana, Oscar Delgado-Mohatar, Alvaro Ortigosa

The automatic analysis of document layouts in digital-born PDF documents remains a challenging problem due to the heterogeneous arrangement of textual and nontextual elements and the imprecision of the textual metadata in the Portable Document Format. In this work, we benchmark Graph Neural Network (GNN) architectures for the task of fine-grained layout classification of text blocks from digital native documents. We introduce two graph construction structures: a k-closest-neighbor graph and a fully connected graph, and generate node features via pre-trained text and vision models, thus avoiding manual feature engineering. Three experimental frameworks are evaluated: single-modality (text or visual), concatenated multimodal, and dual-branch multimodal. We evaluated four foundational GNN models and compared them with the baseline. Our experiments are specifically conducted on a rich dataset of public affairs documents that includes more than 20 sources (e.g., regional and national-level official gazettes), 37K PDF documents, with 441K pages in total. Our results demonstrate that GraphSAGE operating on the k-closest-neighbor graph in a dual-branch configuration achieves the highest per-class and overall accuracy, outperforming the baseline in some sources. These findings confirm the importance of local layout relationships and multimodal fusion exploited through GNNs for the analysis of native digital document layouts.

nan

Article 324

Title@2025-07-28 (1): Detection of Adverse Drug Events in Dutch clinical free text documents using Transformer Models: benchmark study

Title: Detection of Adverse Drug Events in Dutch clinical free text documents using Transformer Models: benchmark study

Nachweis von unerwünschten Arzneimittelereignissen in niederländischen klinischen Textdokumenten mit Transformer-Modellen: Benchmark-Studie

利用变换模型发现荷兰临床免费文本文件中的不良毒品事件:基准研究 2507.19396v2

Authors (8): Rachel M. Murphy, Nishant Mishra, Nicolette F. de Keizer, Dave A. Dongelmans, Kitty J. Jager, Ameen Abu-Hanna, Joanna E. Klopotowska, Iacer Calixto

In this study, we establish a benchmark for adverse drug event (ADE) detection in Dutch clinical free-text documents using several transformer models, clinical scenarios, and fit-for-purpose performance measures. We trained a Bidirectional Long Short-Term Memory (Bi-LSTM) model and four transformer-based Dutch and/or multilingual encoder models (BERTje, RobBERT, MedRoBERTa(.)nl, and NuNER) for the tasks of named entity recognition (NER) and relation classification (RC) using 102 richly annotated Dutch ICU clinical progress notes. Anonymized free-text clinical progress notes of patients admitted to the intensive care unit (ICU) of one academic hospital and discharge letters of patients admitted to Internal Medicine wards of two non-academic hospitals were reused. We evaluated our ADE RC models internally using the gold standard (two-step task) and predicted entities (end-to-end task). In addition, all models were externally validated for detecting ADEs at the document level. We report both micro- and macro-averaged F1 scores, given the dataset imbalance in ADEs. Although differences for the ADE RC task between the models were small, MedRoBERTa(.)nl was the best performing model with a macro-averaged F1 score of 0.63 using the gold standard and 0.62 using predicted entities. The MedRoBERTa(.)nl models also performed the best in our external validation and achieved a recall of between 0.67 to 0.74 using predicted entities, meaning between 67 to 74% of discharge letters with ADEs were detected. Our benchmark study presents a robust and clinically meaningful approach for evaluating language models for ADE detection in clinical free-text documents. Our study highlights the need to use appropriate performance measures fit for the task of ADE detection in clinical free-text documents and envisioned future clinical use.

nan

Article 325

Title@2025-07-28 (1): Ontology-Enhanced Knowledge Graph Completion using Large Language Models

Title: Ontology-Enhanced Knowledge Graph Completion using Large Language Models

Ontologie-erweiterte Wissensgraphenvervollständigung mit großen Sprachmodellen

利用大语言模式完成本部强化知识图 2507.20643v1

Authors (5): Wenbin Guo, Xin Wang, Jiaoyan Chen, Zhao Li, Zirui Chen

Large Language Models (LLMs) have been extensively adopted in Knowledge Graph Completion (KGC), showcasing significant research advancements. However, as black-box models driven by deep neural architectures, current LLM-based KGC methods rely on implicit knowledge representation with parallel propagation of erroneous knowledge, thereby hindering their ability to produce conclusive and decisive reasoning outcomes. We aim to integrate neural-perceptual structural information with ontological knowledge, leveraging the powerful capabilities of LLMs to achieve a deeper understanding of the intrinsic logic of the knowledge. We propose an ontology enhanced KGC method using LLMs – OL-KGC. It first leverages neural perceptual mechanisms to effectively embed structural information into the textual space, and then uses an automated extraction algorithm to retrieve ontological knowledge from the knowledge graphs (KGs) that needs to be completed, which is further transformed into a textual format comprehensible to LLMs for providing logic guidance. We conducted extensive experiments on three widely-used benchmarks – FB15K-237, UMLS and WN18RR. The experimental results demonstrate that OL-KGC significantly outperforms existing mainstream KGC methods across multiple evaluation metrics, achieving state-of-the-art performance.

nan

Article 326

Title@2025-07-28 (1): Explainable Synthetic Image Detection through Diffusion Timestep Ensembling

Title: Explainable Synthetic Image Detection through Diffusion Timestep Ensembling

Erklärbare Synthetische Bilderkennung durch Diffusionszeitpunkt Zusammenbauen

通过传播时间步骤组合进行可解释的合成图像探测 2503.06201v2

Authors (10): Yixin Wu, Feiran Zhang, Tianyuan Shi, Ruicheng Yin, Zhenghua Wang, Zhenliang Gan, Xiaohua Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang

Recent advances in diffusion models have enabled the creation of deceptively real images, posing significant security risks when misused. In this study, we empirically show that different timesteps of DDIM inversion reveal varying subtle distinctions between synthetic and real images that are extractable for detection, in the forms of such as Fourier power spectrum high-frequency discrepancies and inter-pixel variance distributions. Based on these observations, we propose a novel synthetic image detection method that directly utilizes features of intermediately noised images by training an ensemble on multiple noised timesteps, circumventing conventional reconstruction-based strategies. To enhance human comprehension, we introduce a metric-grounded explanation generation and refinement module to identify and explain AI-generated flaws. Additionally, we construct the GenHard and GenExplain benchmarks to provide detection samples of greater difficulty and high-quality rationales for fake images. Extensive experiments show that our method achieves state-of-the-art performance with 98.91% and 95.89% detection accuracy on regular and challenging samples respectively, and demonstrates generalizability and robustness. Our code and datasets are available at https://github.com/Shadowlized/ESIDE.

nan

Article 327

Title@2025-07-28 (1): Before the Outrage: Challenges and Advances in Predicting Online Antisocial Behavior

Title: Before the Outrage: Challenges and Advances in Predicting Online Antisocial Behavior

Vor der Empörung: Herausforderungen und Fortschritte bei der Vorhersage von Online-Antisozialverhalten

暴政前:预测在线反社会行为的挑战和进展 2507.20614v1

Authors (1): Anaïs Ollagnier

Antisocial behavior (ASB) on social media-including hate speech, harassment, and trolling-poses growing challenges for platform safety and societal wellbeing. While prior work has primarily focused on detecting harmful content after it appears, predictive approaches aim to forecast future harmful behaviors-such as hate speech propagation, conversation derailment, or user recidivism-before they fully unfold. Despite increasing interest, the field remains fragmented, lacking a unified taxonomy or clear synthesis of existing methods. This paper presents a systematic review of over 49 studies on ASB prediction, offering a structured taxonomy of five core task types: early harm detection, harm emergence prediction, harm propagation prediction, behavioral risk prediction, and proactive moderation support. We analyze how these tasks differ by temporal framing, prediction granularity, and operational goals. In addition, we examine trends in modeling techniques-from classical machine learning to pre-trained language models-and assess the influence of dataset characteristics on task feasibility and generalization. Our review highlights methodological challenges, such as dataset scarcity, temporal drift, and limited benchmarks, while outlining emerging research directions including multilingual modeling, cross-platform generalization, and human-in-the-loop systems. By organizing the field around a coherent framework, this survey aims to guide future work toward more robust and socially responsible ASB prediction.

nan

Article 328

Title@2025-07-28 (1): AutoLibra: Agent Metric Induction from Open-Ended Feedback

Title: AutoLibra: Agent Metric Induction from Open-Ended Feedback

AutoLibra: Agent Metric Induktion aus offenem Feedback

AutoLibra: 不限名额反馈的计量介绍代理 2505.02820v2

Authors (6): Hao Zhu, Phil Cuvin, Xinkai Yu, Charlotte Ka Yee Yan, Jason Zhang, Diyi Yang

Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose AutoLibra, a framework for agent evaluation, that transforms open-ended human feedback e.g. “If you find that the button is disabled, don’t click it again”, or “This agent has too much autonomy to decide what to do on its own” into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent’s behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta-metrics to evaluate the alignment of a set of (induced) metrics with open feedback: “coverage” and “redundancy”. Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra’s ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra-induced metrics serve as better prompt-engineering targets than the task success rate on a wide range of text game tasks, improving agent performance over baseline by a mean of 20%. Second, we show that AutoLibra can iteratively select high-quality fine-tuning data for web navigation agents. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.

nan

Article 329

Title@2025-07-28 (1): ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning

Title: ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning

ZSE-Cap: Ein Zero-Shot-Ensemble für Bildwiederherstellung und Prompt-Führung

ZSE-Cap: 用于图像检索和即时指导说明的零热组合 2507.20564v1

Authors (2): Duc-Tai Dinh, Duc Anh Khoa Dinh

We present ZSE-Cap (Zero-Shot Ensemble for Captioning), our 4th place system in Event-Enriched Image Analysis (EVENTA) shared task on article-grounded image retrieval and captioning. Our zero-shot approach requires no finetuning on the competition’s data. For retrieval, we ensemble similarity scores from CLIP, SigLIP, and DINOv2. For captioning, we leverage a carefully engineered prompt to guide the Gemma 3 model, enabling it to link high-level events from the article to the visual content in the image. Our system achieved a final score of 0.42002, securing a top-4 position on the private test set, demonstrating the effectiveness of combining foundation models through ensembling and prompting. Our code is available at https://github.com/ductai05/ZSE-Cap.

nan

Article 330

Title@2025-07-28 (1): Enhancing Hallucination Detection via Future Context

Title: Enhancing Hallucination Detection via Future Context

Halluzinationserkennung durch zukünftigen Kontext verbessern

通过未来环境加强幻觉探测 2507.20546v1

Authors (6): Joosung Lee, Cheonbok Park, Hwiyeol Jo, Jeonghoon Kim, Joonsuk Park, Kang Min Yoo

Large Language Models (LLMs) are widely used to generate plausible text on online platforms, without revealing the generation process. As users increasingly encounter such black-box outputs, detecting hallucinations has become a critical challenge. To address this challenge, we focus on developing a hallucination detection framework for black-box generators. Motivated by the observation that hallucinations, once introduced, tend to persist, we sample future contexts. The sampled future contexts provide valuable clues for hallucination detection and can be effectively integrated with various sampling-based methods. We extensively demonstrate performance improvements across multiple methods using our proposed sampling approach.

nan

Article 331

Title@2025-07-28 (1): From Answers to Rationales: Self-Aligning Multimodal Reasoning with Answer-Oriented Chain-of-Thought

Title: From Answers to Rationales: Self-Aligning Multimodal Reasoning with Answer-Oriented Chain-of-Thought

Von Antworten zu Rationalen: Selbstjustierung multimodaler Vernunft mit answer-oriented Chain-of-Thought

从答案到理由:自调整的多式联运理由与以回答为主的探索链 2507.02984v2

Authors (5): Wentao Tan, Qiong Cao, Yibing Zhan, Chao Xue, Changxing Ding

Achieving human-like reasoning capabilities in Multimodal Large Language Models (MLLMs) has long been a goal. Current methods primarily focus on synthesizing positive rationales, typically relying on manual annotations or complex systems. Moreover, they often overlook negative reasoning, which limits the model’s generalization ability and robustness in multimodal inference. To address this gap, we propose a novel framework: \textbf{S}elf-Aligning \textbf{M}ultimodal Reasoning with \textbf{A}nswer-O\textbf{r}iented Chain-of-\textbf{T}hought (SMART). SMART employs an answer-oriented chain-of-thought (AoT) prompt to automatically construct high-quality data. Drawing inspiration from human proof-based strategies, AoT leverages both correct and incorrect answers to extract key visual information that links questions and answers. When provided with correct answers, the model produces strong positive rationales. Conversely, when correct answers are replaced with incorrect alternatives, the model generates an erroneous yet compelling reasoning path, serving as a form of discriminative negative rationale. Models trained with AoT-generated data outperform those trained on manually annotated datasets, demonstrating superior reasoning capabilities. Consequently, SMART establishes an iterative generation-optimization method that continually enhances the model’s reasoning skills. Experiments indicate that the SMART framework significantly improves various MLLMs, regardless of model architecture, parameter size, or pre-training dataset. The code is available at https://github.com/WentaoTan/SMART.

nan

Article 332

Title@2025-07-28 (1): Kimi K2: Open Agentic Intelligence

Title: Kimi K2: Open Agentic Intelligence

Kimi K2: Offene Agentische Intelligenz

Kimi K2:开放特工情报 2507.20534v1

Authors (169): Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Hao Hu, Xiaoru Hao, Tianhong He, Weiran He, Wenyang He, Chao Hong, Yangyang Hu, Zhenxing Hu, Weixiao Huang, Zhiqi Huang, Zihao Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yongsheng Kang, Guokun Lai, Cheng Li, Fang Li, Haoyang Li, Ming Li, Wentao Li, Yanhao Li, Yiwei Li, Zhaowei Li, Zheming Li, Hongzhan Lin, Xiaohan Lin, Zongyu Lin, Chengyin Liu, Chenyu Liu, Hongzhang Liu, Jingyuan Liu, Junqi Liu, Liang Liu, Shaowei Liu, T. Y. Liu, Tianwei Liu, Weizhou Liu, Yangyang Liu, Yibo Liu, Yiping Liu, Yue Liu, Zhengying Liu, Enzhe Lu, Lijun Lu, Shengling Ma, Xinyu Ma, Yingwei Ma, Shaoguang Mao, Jie Mei, Xin Men, Yibo Miao, Siyuan Pan, Yebo Peng, Ruoyu Qin, Bowen Qu, Zeyu Shang, Lidong Shi, Shengyuan Shi, Feifan Song, Jianlin Su, Zhengyuan Su, Xinjie Sun, Flood Sung, Heyi Tang, Jiawen Tao, Qifeng Teng, Chensi Wang, Dinglu Wang, Feng Wang, Haiming Wang, Jianzhou Wang, Jiaxing Wang, Jinhong Wang, Shengjie Wang, Shuyi Wang, Yao Wang, Yejie Wang, Yiqin Wang, Yuxin Wang, Yuzhi Wang, Zhaoji Wang, Zhengtao Wang, Zhexu Wang, Chu Wei, Qianqian Wei, Wenhao Wu, Xingzhe Wu, Yuxin Wu, Chenjun Xiao, Xiaotong Xie, Weimin Xiong, Boyu Xu, Jing Xu, Jinjing Xu, L. H. Xu, Lin Xu, Suting Xu, Weixin Xu, Xinran Xu, Yangchuan Xu, Ziyao Xu, Junjie Yan, Yuzi Yan, Xiaofei Yang, Ying Yang, Zhen Yang, Zhilin Yang, Zonghan Yang, Haotian Yao, Xingcheng Yao, Wenjie Ye, Zhuorui Ye, Bohong Yin, Longhui Yu, Enming Yuan, Hongbang Yuan, Mengjie Yuan, Haobing Zhan, Dehao Zhang, Hao Zhang, Wanlu Zhang, Xiaobin Zhang, Yangkun Zhang, Yizhi Zhang, Yongting Zhang, Yu Zhang, Yutao Zhang, Yutong Zhang, Zheng Zhang, Haotian Zhao, Yikai Zhao, Huabin Zheng, Shaojie Zheng, Jianren Zhou, Xinyu Zhou, Zaida Zhou, Zhen Zhu, Weiyu Zhuang, Xinxing Zu

We introduce Kimi K2, a Mixture-of-Experts (MoE) large language model with 32 billion activated parameters and 1 trillion total parameters. We propose the MuonClip optimizer, which improves upon Muon with a novel QK-clip technique to address training instability while enjoying the advanced token efficiency of Muon. Based on MuonClip, K2 was pre-trained on 15.5 trillion tokens with zero loss spike. During post-training, K2 undergoes a multi-stage post-training process, highlighted by a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage, where the model improves its capabilities through interactions with real and synthetic environments. Kimi K2 achieves state-of-the-art performance among open-source non-thinking models, with strengths in agentic capabilities. Notably, K2 obtains 66.1 on Tau2-Bench, 76.5 on ACEBench (En), 65.8 on SWE-Bench Verified, and 47.3 on SWE-Bench Multilingual – surpassing most open and closed-sourced baselines in non-thinking settings. It also exhibits strong capabilities in coding, mathematics, and reasoning tasks, with a score of 53.7 on LiveCodeBench v6, 49.5 on AIME 2025, 75.1 on GPQA-Diamond, and 27.1 on OJBench, all without extended thinking. These results position Kimi K2 as one of the most capable open-source large language models to date, particularly in software engineering and agentic tasks. We release our base and post-trained model checkpoints to facilitate future research and applications of agentic intelligence.

nan

Article 333

Title@2025-07-28 (1): SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law

Title: SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law

SafeWork-R1: Koevolving Safety and Intelligence unter dem AI-45$^{\circ}$ Gesetz

安全工作-R1:根据AI-45$ circ}$ 法发展安全和情报 2507.18576v2

Authors (118): Shanghai AI Lab, :, Yicheng Bao, Guanxu Chen, Mingkang Chen, Yunhao Chen, Chiyu Chen, Lingjie Chen, Sirui Chen, Xinquan Chen, Jie Cheng, Yu Cheng, Dengke Deng, Yizhuo Ding, Dan Ding, Xiaoshan Ding, Yi Ding, Zhichen Dong, Lingxiao Du, Yuyu Fan, Xinshun Feng, Yanwei Fu, Yuxuan Gao, Ruijun Ge, Tianle Gu, Lujun Gui, Jiaxuan Guo, Qianxi He, Yuenan Hou, Xuhao Hu, Hong Huang, Kaichen Huang, Shiyang Huang, Yuxian Jiang, Shanzhe Lei, Jie Li, Lijun Li, Hao Li, Juncheng Li, Xiangtian Li, Yafu Li, Lingyu Li, Xueyan Li, Haotian Liang, Dongrui Liu, Qihua Liu, Zhixuan Liu, Bangwei Liu, Huacan Liu, Yuexiao Liu, Zongkai Liu, Chaochao Lu, Yudong Lu, Xiaoya Lu, Zhenghao Lu, Qitan Lv, Caoyuan Ma, Jiachen Ma, Xiaoya Ma, Zhongtian Ma, Lingyu Meng, Ziqi Miao, Yazhe Niu, Yuezhang Peng, Yuan Pu, Han Qi, Chen Qian, Xingge Qiao, Jingjing Qu, Jiashu Qu, Wanying Qu, Wenwen Qu, Xiaoye Qu, Qihan Ren, Qingnan Ren, Qingyu Ren, Jing Shao, Wenqi Shao, Shuai Shao, Dongxing Shi, Xin Song, Xinhao Song, Yan Teng, Xuan Tong, Yingchun Wang, Xuhong Wang, Shujie Wang, Xin Wang, Yige Wang, Yixu Wang, Yuanfu Wang, Futing Wang, Ruofan Wang, Wenjie Wang, Yajie Wang, Muhao Wei, Xiaoyu Wen, Fenghua Weng, Yuqi Wu, Yingtong Xiong, Xingcheng Xu, Chao Yang, Yue Yang, Yang Yao, Yulei Ye, Zhenyun Yin, Yi Yu, Bo Zhang, Qiaosheng Zhang, Jinxuan Zhang, Yexin Zhang, Yinqiang Zheng, Hefeng Zhou, Zhanhui Zhou, Pengyu Zhu, Qingzi Zhu, Yubo Zhu, Bowen Zhou

We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety. It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training, supported by a suite of multi-principled verifiers. Unlike previous alignment methods such as RLHF that simply learn human preferences, SafeLadder enables SafeWork-R1 to develop intrinsic safety reasoning and self-reflection abilities, giving rise to safety `aha’ moments. Notably, SafeWork-R1 achieves an average improvement of $46.54\%$ over its base model Qwen2.5-VL-72B on safety-related benchmarks without compromising general capabilities, and delivers state-of-the-art safety performance compared to leading proprietary models such as GPT-4.1 and Claude Opus 4. To further bolster its reliability, we implement two distinct inference-time intervention methods and a deliberative search mechanism, enforcing step-level verification. Finally, we further develop SafeWork-R1-InternVL3-78B, SafeWork-R1-DeepSeek-70B, and SafeWork-R1-Qwen2.5VL-7B. All resulting models demonstrate that safety and capability can co-evolve synergistically, highlighting the generalizability of our framework in building robust, reliable, and trustworthy general-purpose AI.

nan

Article 334

Title: Otter: A Multi-Modal Model with In-Context Instruction Tuning

Otter: Ein Multi-Modal-Modell mit In-Context-Anleitung Tuning

Ottter:具有内文指导图纸的多模式模型 2305.03726v2

Authors (8): Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, Ziwei Liu

Recent advances in Large Multimodal Models (LMMs) have unveiled great potential as visual assistants. However, most existing works focus on responding to individual instructions or using previous dialogues for contextual understanding. There is little discussion on employing both images and text as in-context examples to enhance the instruction following capability. To bridge this gap, we introduce the \textbf{Otter} model to leverage both textual and visual in-context examples for instruction tuning. Specifically, Otter builds upon Flamingo with Perceiver architecture, and has been instruction tuned for general purpose multi-modal assistant. Otter seamlessly processes multi-modal inputs, supporting modalities including text, multiple images, and dynamic video content. To support the training of Otter, we present the \textbf{MIMIC-IT} (\textbf{M}ult\textbf{I}-\textbf{M}odal \textbf{I}n-\textbf{C}ontext \textbf{I}nstruction \textbf{T}uning) dataset, which encompasses over 3 million multi-modal instruction-response pairs, including approximately 2.2 million unique instructions across a broad spectrum of images and videos. MIMIC-IT has been carefully curated to feature a diverse array of in-context examples for each entry. Comprehensive evaluations suggest that instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities. Notably, the extensive scenario coverage provided by the MIMIC-IT dataset empowers the Otter model to excel in tasks involving complex video and multi-image understanding.

nan

Article 335

Title: Dialogues of Dissent: Thematic and Rhetorical Dimensions of Hate and Counter-Hate Speech in Social Media Conversations

Dialoge von Dissent: Thematische und rhetorische Dimensionen von Hass und Gegenhass in Social Media-Gesprächen

不同意见对话:社会媒体对话中的仇恨和反仇恨言论的主题和风湿方面 2507.20528v1

Authors (4): Effi Levi, Gal Ron, Odelia Oshri, Shaul R. Shenhav

We introduce a novel multi-labeled scheme for joint annotation of hate and counter-hate speech in social media conversations, categorizing hate and counter-hate messages into thematic and rhetorical dimensions. The thematic categories outline different discursive aspects of each type of speech, while the rhetorical dimension captures how hate and counter messages are communicated, drawing on Aristotle’s Logos, Ethos and Pathos. We annotate a sample of 92 conversations, consisting of 720 tweets, and conduct statistical analyses, incorporating public metrics, to explore patterns of interaction between the thematic and rhetorical dimensions within and between hate and counter-hate speech. Our findings provide insights into the spread of hate messages on social media, the strategies used to counter them, and their potential impact on online behavior.

nan

Article 336

Title@2025-07-28 (1): Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards

Title: Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards

Versehentliche Sicherheitslücke: Faktoren bei Feinsteuerung, die das Modell schützen

意外脆弱性:改变模式保障保障措施的微调因素 2505.16789v2

Authors (4): Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin

As large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can inadvertently introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Vulnerability, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity across multiple experimental datasets. We then evaluate the adversarial robustness of these fine-tuned models, analyzing persona shifts and interpretability traits to understand how dataset factors contribute to attack success rates. Lastly, we explore causal relationships that offer new insights into adversarial defense strategies, highlighting the crucial role of dataset design in preserving model alignment. Our code is available at https://github.com/psyonp/accidental_vulnerability.

nan

Article 337

Title@2025-07-28 (1): Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition

Title: Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition

Sicherheitsherausforderungen bei der Bereitstellung von KI-Agenten: Einblicke aus einem groß angelegten öffentlichen Wettbewerb

AI 代理部署在安全方面面临的挑战:大规模公共竞争的展望 2507.20526v1

Authors (17): Andy Zou, Maxwell Lin, Eliot Jones, Micha Nowak, Mateusz Dziemian, Nick Winter, Alexander Grattan, Valent Nathanael, Ayla Croft, Xander Davies, Jai Patel, Robert Kirk, Nate Burnikell, Yarin Gal, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson

Recent advances have enabled LLM-powered AI agents to autonomously execute complex tasks by combining language model reasoning with tools, memory, and web access. But can these systems be trusted to follow deployment policies in realistic environments, especially under attack? To investigate, we ran the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios. Participants submitted 1.8 million prompt-injection attacks, with over 60,000 successfully eliciting policy violations such as unauthorized data access, illicit financial actions, and regulatory noncompliance. We use these results to build the Agent Red Teaming (ART) benchmark - a curated set of high-impact attacks - and evaluate it across 19 state-of-the-art models. Nearly all agents exhibit policy violations for most behaviors within 10-100 queries, with high attack transferability across models and tasks. Importantly, we find limited correlation between agent robustness and model size, capability, or inference-time compute, suggesting that additional defenses are needed against adversarial misuse. Our findings highlight critical and persistent vulnerabilities in today’s AI agents. By releasing the ART benchmark and accompanying evaluation framework, we aim to support more rigorous security assessment and drive progress toward safer agent deployment.

nan

Article 338

Title@2025-07-28 (1): AQUA: A Large Language Model for Aquaculture & Fisheries

Title: AQUA: A Large Language Model for Aquaculture & Fisheries

AQUA: Ein großes Sprachmodell für Aquakultur und Fischerei

AQUA:水产养殖和渔业大语言模式 2507.20520v1

Authors (7): Praneeth Narisetty, Uday Kumar Reddy Kattamanchi, Lohit Akshant Nimma, Sri Ram Kaushik Karnati, Shiva Nagendra Babu Kore, Mounika Golamari, Tejashree Nageshreddy

Aquaculture plays a vital role in global food security and coastal economies by providing sustainable protein sources. As the industry expands to meet rising demand, it faces growing challenges such as disease outbreaks, inefficient feeding practices, rising labor costs, logistical inefficiencies, and critical hatchery issues, including high mortality rates and poor water quality control. Although artificial intelligence has made significant progress, existing machine learning methods fall short of addressing the domain-specific complexities of aquaculture. To bridge this gap, we introduce AQUA, the first large language model (LLM) tailored for aquaculture, designed to support farmers, researchers, and industry practitioners. Central to this effort is AQUADAPT (Data Acquisition, Processing and Tuning), an Agentic Framework for generating and refining high-quality synthetic data using a combination of expert knowledge, largescale language models, and automated evaluation techniques. Our work lays the foundation for LLM-driven innovations in aquaculture research, advisory systems, and decision-making tools.

nan

Article 339

Title@2025-07-28 (1): Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

Title: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

Große Sprachmodelle für Tibeter mit kuratierten Daten und kontinuierlichem Pre-Training

推进藏藏人大语言模式,提供 “ 扩展数据 “ 和 “ 持续培训前 “ 。 2507.09205v4

Authors (17): Leiyu Pan, Bojian Xiong, Lei Yang, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, Jianxiang Peng, Juesi Xiao, Tianyu Dong, Zhuowen Han, Zhuo Chen, Yuqi Ren, Deyi Xiong

Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model to enhance its generative capabilities in Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that our model consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.

nan

Article 340

Title@2025-07-28 (1): REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models

Title: REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models

REINFORCE++: Effizienter RLHF-Algorithmus mit Robustheit sowohl für Prompt- als auch für Reward-Modelle

REINFORCE++: 高效的RLHF对快速模型和奖励模型具有强力的测算法 2501.03262v7

Authors (4): Jian Hu, Jason Klein Liu, Haotian Xu, Wei Shen

Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning large language models (LLMs) with human values and preferences. While state-of-the-art applications like ChatGPT or GPT-4 commonly employ Proximal Policy Optimization (PPO), the inclusion of a critic network introduces significant computational overhead. REINFORCE-based methods, such as REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO), address this limitation by eliminating the critic network. However, these approaches face challenges in accurate advantage estimation. Specifically, they estimate advantages independently for responses to each prompt, which can lead to overfitting on simpler prompts and vulnerability to reward hacking and may be biased. To address these challenges, we introduce REINFORCE++, a novel approach that removes the critic model while using the global advantage normalization which is unbiased to improve the training stability. Our empirical evaluation demonstrates that REINFORCE++ exhibits robust performance across various reward models without requiring prompt set truncation. Furthermore, it achieves superior generalization in both RLHF and long chain-of-thought (CoT) settings compared to existing REINFORCE-based methods. The implementation is available at https://github.com/OpenRLHF/OpenRLHF.

nan

Article 341

Title: Customize Multi-modal RAI Guardrails with Precedent-based predictions

Multimodale RAI-Guardrails mit vorausschauenden Vorhersagen anpassen

定制具有先例预测的多式RAI护卫车 2507.20503v1

Authors (6): Cheng-Fu Yang, Thanh Tran, Christos Christodoulopoulos, Weitong Ruan, Rahul Gupta, Kai-Wei Chang

A multi-modal guardrail must effectively filter image content based on user-defined policies, identifying material that may be hateful, reinforce harmful stereotypes, contain explicit material, or spread misinformation. Deploying such guardrails in real-world applications, however, poses significant challenges. Users often require varied and highly customizable policies and typically cannot provide abundant examples for each custom policy. Consequently, an ideal guardrail should be scalable to the multiple policies and adaptable to evolving user standards with minimal retraining. Existing fine-tuning methods typically condition predictions on pre-defined policies, restricting their generalizability to new policies or necessitating extensive retraining to adapt. Conversely, training-free methods struggle with limited context lengths, making it difficult to incorporate all the policies comprehensively. To overcome these limitations, we propose to condition model’s judgment on “precedents”, which are the reasoning processes of prior data points similar to the given input. By leveraging precedents instead of fixed policies, our approach greatly enhances the flexibility and adaptability of the guardrail. In this paper, we introduce a critique-revise mechanism for collecting high-quality precedents and two strategies that utilize precedents for robust prediction. Experimental results demonstrate that our approach outperforms previous methods across both few-shot and full-dataset scenarios and exhibits superior generalization to novel policies.

nan

Article 342

Title@2025-07-28 (1): Pruning for Performance: Efficient Idiom and Metaphor Classification in Low-Resource Konkani Using mBERT

Title: Pruning for Performance: Efficient Idiom and Metaphor Classification in Low-Resource Konkani Using mBERT

Pruning for Performance: Effiziente Idiom- und Metaphor-Klassifikation in Low-Resource Konkani mit mBERT

利用mBERT, 低资源 Konkani 中高效的低资源 Konkani 和同义词分类 2506.02005v2

Authors (7): Timothy Do, Pranav Saran, Harshita Poojary, Pranav Prabhu, Sean O’Brien, Vasu Sharma, Kevin Zhu

In this paper, we address the persistent challenges that figurative language expressions pose for natural language processing (NLP) systems, particularly in low-resource languages such as Konkani. We present a hybrid model that integrates a pre-trained Multilingual BERT (mBERT) with a bidirectional LSTM and a linear classifier. This architecture is fine-tuned on a newly introduced annotated dataset for metaphor classification, developed as part of this work. To improve the model’s efficiency, we implement a gradient-based attention head pruning strategy. For metaphor classification, the pruned model achieves an accuracy of 78%. We also applied our pruning approach to expand on an existing idiom classification task, achieving 83% accuracy. These results demonstrate the effectiveness of attention head pruning for building efficient NLP tools in underrepresented languages.

nan

Article 343

Title@2025-07-28 (1): Speaking in Words, Thinking in Logic: A Dual-Process Framework in QA Systems

Title: Speaking in Words, Thinking in Logic: A Dual-Process Framework in QA Systems

Sprechen in Worten, Denken in Logik: Ein Dual-Process-Framework in QA-Systemen

用文字说,用逻辑思考:质量保证系统中的双重处理框架 2507.20491v1

Authors (8): Tuan Bui, Trong Le, Phat Thai, Sang Nguyen, Minh Hua, Ngan Pham, Thang Bui, Tho Quan

Recent advances in large language models (LLMs) have significantly enhanced question-answering (QA) capabilities, particularly in open-domain contexts. However, in closed-domain scenarios such as education, healthcare, and law, users demand not only accurate answers but also transparent reasoning and explainable decision-making processes. While neural-symbolic (NeSy) frameworks have emerged as a promising solution, leveraging LLMs for natural language understanding and symbolic systems for formal reasoning, existing approaches often rely on large-scale models and exhibit inefficiencies in translating natural language into formal logic representations. To address these limitations, we introduce Text-JEPA (Text-based Joint-Embedding Predictive Architecture), a lightweight yet effective framework for converting natural language into first-order logic (NL2FOL). Drawing inspiration from dual-system cognitive theory, Text-JEPA emulates System 1 by efficiently generating logic representations, while the Z3 solver operates as System 2, enabling robust logical inference. To rigorously evaluate the NL2FOL-to-reasoning pipeline, we propose a comprehensive evaluation framework comprising three custom metrics: conversion score, reasoning score, and Spearman rho score, which collectively capture the quality of logical translation and its downstream impact on reasoning accuracy. Empirical results on domain-specific datasets demonstrate that Text-JEPA achieves competitive performance with significantly lower computational overhead compared to larger LLM-based systems. Our findings highlight the potential of structured, interpretable reasoning frameworks for building efficient and explainable QA systems in specialized domains.

nan

Article 344

Title@2025-07-28 (1): Juru: Legal Brazilian Large Language Model from Reputable Sources

Title: Juru: Legal Brazilian Large Language Model from Reputable Sources

Juru: Rechtliches brasilianisches Large Language Model aus seriösen Quellen

Juru:来自有名来源的巴西大语言法律模型 2403.18140v2

Authors (4): Roseval Malaquias Junior, Ramon Pires, Roseli Romero, Rodrigo Nogueira

The high compute cost associated with pretraining large language models limits their research. Two strategies have emerged to address this issue: domain specialization and pretraining with high-quality data. To explore these strategies, we specialized the Mistral-7B model with 1.9 billion unique tokens from reputable Brazilian legal sources and conducted few-shot evaluations on legal and general knowledge test suites. Our model, Juru, demonstrates the benefits of domain specialization by achieving improved performance on legal benchmarks, even with a reduced amount of pretraining data. However, this domain specialization through continued pretraining comes at the cost of increased forgetting in unrelated domains, as evidenced by performance degradation on general knowledge test suites in both Portuguese and English. This study contributes to the growing body of scientific evidence showing that pretraining data selection may enhance the performance of large language models, enabling the exploration of these models at a lower cost. Juru is publicly available at https://huggingface.co/roseval/Juru-7B .

nan

Article 345

Title@2025-07-28 (1): Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents

Title: Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents

Benutzer vor ihnen selbst schützen: Schutz der kontextuellen Privatsphäre in Interaktionen mit Gesprächspartnern

保护用户免受自我伤害:在与交流代理人的互动中保护环境隐私 2502.18509v2

Authors (7): Ivoline Ngong, Swanand Kadhe, Hao Wang, Keerthiram Murugesan, Justin D. Weisz, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy

Conversational agents are increasingly woven into individuals’ personal lives, yet users often underestimate the privacy risks associated with them. The moment users share information with these agents-such as large language models (LLMs)-their private information becomes vulnerable to exposure. In this paper, we characterize the notion of contextual privacy for user interactions with LLM-based Conversational Agents (LCAs). It aims to minimize privacy risks by ensuring that users (sender) disclose only information that is both relevant and necessary for achieving their intended goals when interacting with LCAs (untrusted receivers). Through a formative design user study, we observe how even “privacy-conscious” users inadvertently reveal sensitive information through indirect disclosures. Based on insights from this study, we propose a locally deployable framework that operates between users and LCAs, identifying and reformulating out-of-context information in user prompts. Our evaluation using examples from ShareGPT shows that lightweight models can effectively implement this framework, achieving strong gains in contextual privacy while preserving the user’s intended interaction goals. Notably, about 76% of participants in our human evaluation preferred the reformulated prompts over the original ones, validating the usability and effectiveness of contextual privacy in our proposed framework. We opensource the code at https://github.com/IBM/contextual-privacy-LLM.

nan

Article 346

Title@2025-07-28 (1): Improving Similar Case Retrieval Ranking Performance By Revisiting RankSVM

Title: Improving Similar Case Retrieval Ranking Performance By Revisiting RankSVM

Ähnliches Beispiel verbessern Retrieval-Ranking-Performance durch Revisiting RankSVM

通过重审RanksSVM改进类似案例检索排名 2502.11131v2

Authors (2): Yuqi Liu, Yan Zheng

Given the rapid development of Legal AI, a lot of attention has been paid to one of the most important legal AI tasks–similar case retrieval, especially with language models to use. In our paper, however, we try to improve the ranking performance of current models from the perspective of learning to rank instead of language models. Specifically, we conduct experiments using a pairwise method–RankSVM as the classifier to substitute a fully connected layer, combined with commonly used language models on similar case retrieval datasets LeCaRDv1 and LeCaRDv2. We finally come to the conclusion that RankSVM could generally help improve the retrieval performance on the LeCaRDv1 and LeCaRDv2 datasets compared with original classifiers by optimizing the precise ranking. It could also help mitigate overfitting owing to class imbalance. Our code is available in https://github.com/liuyuqi123study/RankSVM_for_SLR

nan

Article 347

Title@2025-07-28 (1): In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

Title: In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

In Prospect und Retrospect: Reflektierendes Speichermanagement für langfristige Personalisierte Dialogagenten

展望和回顾:长期个人化对话代理人的反思记忆管理 2503.08026v2

Authors (15): Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long T. Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, Anand Iyer, Tianlong Chen, Huan Liu, Chen-Yu Lee, Tomas Pfister

Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information from long-term interactions limits their effectiveness in applications requiring sustained personalization. External memory mechanisms have been proposed to address this limitation, enabling LLMs to maintain conversational continuity. However, existing approaches struggle with two key challenges. First, rigid memory granularity fails to capture the natural semantic structure of conversations, leading to fragmented and incomplete representations. Second, fixed retrieval mechanisms cannot adapt to diverse dialogue contexts and user interaction patterns. In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities-utterances, turns, and sessions-into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs’ cited evidence. Experiments show that RMM demonstrates consistent improvement across various metrics and benchmarks. For example, RMM shows more than 10% accuracy improvement over the baseline without memory management on the LongMemEval dataset.

nan

Article 348

Title@2025-07-27 (7): Critiques of World Models

Title: Critiques of World Models

Kritik an Weltmodellen

世界模式的证明 2507.05169v3

Authors (4): Eric Xing, Mingkai Deng, Jinyu Hou, Zhiting Hu

World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of “hypothetical thinking” in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general-purpose world model, based on hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

nan

Article 349

Title@2025-07-27 (7): CodeNER: Code Prompting for Named Entity Recognition

Title: CodeNER: Code Prompting for Named Entity Recognition

CodeNER: Codeaufforderung für die benannte Entitätserkennung

识别名称实体的代码提示代码 2507.20423v1

Authors (5): Sungwoo Han, Hyeyeon Kim, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura

Recent studies have explored various approaches for treating candidate named entity spans as both source and target sequences in named entity recognition (NER) by leveraging large language models (LLMs). Although previous approaches have successfully generated candidate named entity spans with suitable labels, they rely solely on input context information when using LLMs, particularly, ChatGPT. However, NER inherently requires capturing detailed labeling requirements with input context information. To address this issue, we propose a novel method that leverages code-based prompting to improve the capabilities of LLMs in understanding and performing NER. By embedding code within prompts, we provide detailed BIO schema instructions for labeling, thereby exploiting the ability of LLMs to comprehend long-range scopes in programming languages. Experimental results demonstrate that the proposed code-based prompting method outperforms conventional text-based prompting on ten benchmarks across English, Arabic, Finnish, Danish, and German datasets, indicating the effectiveness of explicitly structuring NER instructions. We also verify that combining the proposed code-based prompting method with the chain-of-thought prompting further improves performance.

nan

Article 350

Title@2025-07-27 (7): Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks?

Title: Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks?

Umfrage zu NLU-Benchmarks Diagnose Linguistische Phänomene: Warum nicht Diagnose-Benchmarks standardisieren?

NLU基准诊断语言神话调查:为什么不使诊断基准标准化? 2507.20419v1

Authors (3): Khloud AL Jallad, Nada Ghneim, Ghaida Rebdawi

Natural Language Understanding (NLU) is a basic task in Natural Language Processing (NLP). The evaluation of NLU capabilities has become a trending research topic that attracts researchers in the last few years, resulting in the development of numerous benchmarks. These benchmarks include various tasks and datasets in order to evaluate the results of pretrained models via public leaderboards. Notably, several benchmarks contain diagnostics datasets designed for investigation and fine-grained error analysis across a wide range of linguistic phenomena. This survey provides a comprehensive review of available English, Arabic, and Multilingual NLU benchmarks, with a particular emphasis on their diagnostics datasets and the linguistic phenomena they covered. We present a detailed comparison and analysis of these benchmarks, highlighting their strengths and limitations in evaluating NLU tasks and providing in-depth error analysis. When highlighting the gaps in the state-of-the-art, we noted that there is no naming convention for macro and micro categories or even a standard set of linguistic phenomena that should be covered. Consequently, we formulated a research question regarding the evaluation metrics of the evaluation diagnostics benchmarks: “Why do not we have an evaluation standard for the NLU evaluation diagnostics benchmarks?” similar to ISO standard in industry. We conducted a deep analysis and comparisons of the covered linguistic phenomena in order to support experts in building a global hierarchy for linguistic phenomena in future. We think that having evaluation metrics for diagnostics evaluation could be valuable to gain more insights when comparing the results of the studied models on different diagnostics benchmarks.

nan

Article 351

Title@2025-07-27 (7): CONCAP: Seeing Beyond English with Concepts Retrieval-Augmented Captioning

Title: CONCAP: Seeing Beyond English with Concepts Retrieval-Augmented Captioning

CONCAP: Über das Englische hinaussehen mit Konzepten Retrieval-Augmented Captioning

CONCACM: 以概念检索增强说明方式在英语以外看问题 2507.20411v1

Authors (3): George Ibrahim, Rita Ramos, Yova Kementchedjhieva

Multilingual vision-language models have made significant strides in image captioning, yet they still lag behind their English counterparts due to limited multilingual training data and costly large-scale model parameterization. Retrieval-augmented generation (RAG) offers a promising alternative by conditioning caption generation on retrieved examples in the target language, reducing the need for extensive multilingual training. However, multilingual RAG captioning models often depend on retrieved captions translated from English, which can introduce mismatches and linguistic biases relative to the source language. We introduce CONCAP, a multilingual image captioning model that integrates retrieved captions with image-specific concepts, enhancing the contextualization of the input image and grounding the captioning process across different languages. Experiments on the XM3600 dataset indicate that CONCAP enables strong performance on low- and mid-resource languages, with highly reduced data requirements. Our findings highlight the effectiveness of concept-aware retrieval augmentation in bridging multilingual performance gaps.

nan

Article 352

Title@2025-07-27 (7): Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

Title: Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

Clarify lernen: Multiturn-Gespräche mit aktionsbasiertem Kontrast-Selbst-Training

学习澄清:与基于行动的差异性自我培训进行多方向对话 2406.00222v2

Authors (4): Maximillian Chen, Ruoxi Sun, Tomas Pfister, Sercan Ö. Arık

Large language models (LLMs), optimized through human feedback, have rapidly emerged as a leading paradigm for developing intelligent conversational assistants. However, despite their strong performance across many benchmarks, LLM-based agents might still lack conversational skills such as disambiguation – when they are faced with ambiguity, they often overhedge or implicitly guess users’ true intents rather than asking clarification questions. Under task-specific settings, high-quality conversation samples are often limited, constituting a bottleneck for LLMs’ ability to learn optimal dialogue action policies. We propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm based on Direct Preference Optimization (DPO), that enables data-efficient dialogue policy learning in multi-turn conversation modeling. We demonstrate ACT’s efficacy under in data-efficient tuning scenarios, even when there is no action label available, using multiple real-world conversational tasks: tabular-grounded question-answering, machine reading comprehension, and AmbigSQL, a novel task for disambiguating information-seeking requests for complex SQL generation towards data analysis agents. Additionally, we propose evaluating LLMs’ ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation. ACT demonstrates substantial conversation modeling improvements over standard tuning approaches like supervised fine-tuning and DPO.

nan

Article 353

Title@2025-07-27 (7): Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification

Title: Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification

Selbstregularisierung mit Sparse Autoencodern für steuerbare LLM-basierte Klassifizierung

与基于可控 LLM 的可控 LLM 分类的 Sparse 自动编码器的自调节 2502.14133v3

Authors (4): Xuansheng Wu, Wenhao Yu, Xiaoming Zhai, Ninghao Liu

Modern text classification methods heavily rely on contextual embeddings from large language models (LLMs). Compared to human-engineered features, these embeddings provide automatic and effective representations for classification model training. However, they also introduce a challenge: we lose the ability to manually remove unintended features, such as sensitive or task-irrelevant features, to guarantee regulatory compliance or improve the generalizability of classification models. This limitation arises because LLM embeddings are opaque and difficult to interpret. In this paper, we propose a novel framework to identify and regularize unintended features in the LLM latent space. Specifically, we first pre-train a sparse autoencoder (SAE) to extract interpretable features from LLM latent spaces. To ensure the SAE can capture task-specific features, we further fine-tune it on task-specific datasets. In training the classification model, we propose a simple and effective regularizer, by minimizing the similarity between the classifier weights and the identified unintended feature, to remove the impact of these unintended features on classification. We evaluate the proposed framework on three real-world tasks, including toxic chat detection, reward modeling, and disease diagnosis. Results show that the proposed self-regularization framework can improve the classifier’s generalizability by regularizing those features that are not semantically correlated to the task. This work pioneers controllable text classification on LLM latent spaces by leveraging interpreted features to address generalizability, fairness, and privacy challenges. The code and data are publicly available at https://github.com/JacksonWuxs/Controllable_LLM_Classifier.

nan

Article 354

Title: Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations

Kognitive Denkkette: Strukturierte multimodale Begründung über soziale Situationen

认知思考链:社会状况的结构性多模式原因 2507.20409v1

Authors (5): Eunkyu Park, Wesley Hanwen Deng, Gunhee Kim, Motahhare Eslami, Maarten Sap

Chain-of-Thought (CoT) prompting helps models think step by step. But what happens when they must see, understand, and judge-all at once? In visual tasks grounded in social context, where bridging perception with norm-grounded judgments is essential, flat CoT often breaks down. We introduce Cognitive Chain-of-Thought (CoCoT), a prompting strategy that scaffolds VLM reasoning through three cognitively inspired stages: perception, situation, and norm. Our experiments show that, across multiple multimodal benchmarks (including intent disambiguation, commonsense reasoning, and safety), CoCoT consistently outperforms CoT and direct prompting (+8\% on average). Our findings demonstrate that cognitively grounded reasoning stages enhance interpretability and social awareness in VLMs, paving the way for safer and more reliable multimodal systems.

nan

Article 355

Title@2025-07-27 (7): Length Representations in Large Language Models

Title: Length Representations in Large Language Models

Längendarstellungen in großen Sprachmodellen

大语言模式中的长长代表 2507.20398v1

Authors (5): Sangjun Moon, Dasom Choi, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura

Large language models (LLMs) have shown remarkable capabilities across various tasks, that are learned from massive amounts of text-based data. Although LLMs can control output sequence length, particularly in instruction-based settings, the internal mechanisms behind this control have been unexplored yet. In this study, we provide empirical evidence on how output sequence length information is encoded within the internal representations in LLMs. In particular, our findings show that multi-head attention mechanisms are critical in determining output sequence length, which can be adjusted in a disentangled manner. By scaling specific hidden units within the model, we can control the output sequence length without losing the informativeness of the generated text, thereby indicating that length information is partially disentangled from semantic information. Moreover, some hidden units become increasingly active as prompts become more length-specific, thus reflecting the model’s internal awareness of this attribute. Our findings suggest that LLMs have learned robust and adaptable internal mechanisms for controlling output length without any external control.

nan

Article 356

Title@2025-07-27 (7): Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation

Title: Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation

Multi-Agent Retrieval-Augmented Framework for Evidence-based Counterspeech Against Health Misinformation

以证据为依据的反健康错误信息反言多证据检索强化框架 2507.07307v2

Authors (6): Anirban Saha Anik, Xiaoying Song, Elliott Wang, Bryan Wang, Bengisu Yarimbas, Lingzi Hong

Large language models (LLMs) incorporated with Retrieval-Augmented Generation (RAG) have demonstrated powerful capabilities in generating counterspeech against misinformation. However, current studies rely on limited evidence and offer less control over final outputs. To address these challenges, we propose a Multi-agent Retrieval-Augmented Framework to generate counterspeech against health misinformation, incorporating multiple LLMs to optimize knowledge retrieval, evidence enhancement, and response refinement. Our approach integrates both static and dynamic evidence, ensuring that the generated counterspeech is relevant, well-grounded, and up-to-date. Our method outperforms baseline approaches in politeness, relevance, informativeness, and factual accuracy, demonstrating its effectiveness in generating high-quality counterspeech. To further validate our approach, we conduct ablation studies to verify the necessity of each component in our framework. Furthermore, cross evaluations show that our system generalizes well across diverse health misinformation topics and datasets. And human evaluations reveal that refinement significantly enhances counterspeech quality and obtains human preference.

nan

Article 357

Title@2025-07-27 (7): Memorization: A Close Look at Books

Title: Memorization: A Close Look at Books

Auswendiglernen: Ein genauer Blick auf Bücher

记忆化:对书籍的近视 2504.12549v2

Authors (5): Iris Ma, Ian Domingo, Alberto Krone-Martins, Pierre Baldi, Cristina V. Lopes

To what extent can entire books be extracted from LLMs? Using the Llama 3 70B family of models, and the “prefix-prompting” extraction technique, we were able to auto-regressively reconstruct, with a very high level of similarity, one entire book (Alice’s Adventures in Wonderland) from just the first 500 tokens. We were also able to obtain high extraction rates on several other books, piece-wise. However, these successes do not extend uniformly to all books. We show that extraction rates of books correlate with book popularity and thus, likely duplication in the training data. We also confirm the undoing of mitigations in the instruction-tuned Llama 3.1, following recent work (Nasr et al., 2025). We further find that this undoing comes from changes to only a tiny fraction of weights concentrated primarily in the lower transformer blocks. Our results provide evidence of the limits of current regurgitation mitigation strategies and introduce a framework for studying how fine-tuning affects the retrieval of verbatim memorization in aligned LLMs.

nan

Article 358

Title@2025-07-27 (7): Scaling Analysis of Interleaved Speech-Text Language Models

Title: Scaling Analysis of Interleaved Speech-Text Language Models

Skalierungsanalyse interleaved Speech-Text Language Models

剖分间语音-文字语言模式扩大分析 2504.02398v2

Authors (4): Gallil Maimon, Michael Hassid, Amit Roth, Yossi Adi

Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. It predicts that SLMs require much more compute and data compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - “Do interleaved SLMs scale more efficiently than textless-SLMs?” In this paper we answer a resounding yes! We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the scaling dynamics significantly differ from textless-SLMs, suggesting one should allocate notably more of the compute budget to increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential. Results suggest that our scaled up model achieves comparable semantic speech performance to leading models, while using less compute and data. We open source models, samples, and data - https://pages.cs.huji.ac.il/adiyoss-lab/sims/ .

nan

Article 359

Title@2025-07-27 (7): RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing

Title: RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing

RMTBench: Benchmarking von LLMs durch Multi-Turn-Benutzer-Centric-Rollenspiel

RMTBench:通过多发用户中心发挥作用,确定LLMs基准 2507.20352v1

Authors (13): Hao Xiang, Tianyi Tang, Yang Su, Bowen Yu, An Yang, Fei Huang, Yichang Zhang, Yaojie Lu, Hongyu Lin, Xianpei Han, Jingren Zhou, Junyang Lin, Le Sun

Recent advancements in Large Language Models (LLMs) have shown outstanding potential for role-playing applications. Evaluating these capabilities is becoming crucial yet remains challenging. Existing benchmarks mostly adopt a \textbf{character-centric} approach, simplify user-character interactions to isolated Q&A tasks, and fail to reflect real-world applications. To address this limitation, we introduce RMTBench, a comprehensive \textbf{user-centric} bilingual role-playing benchmark featuring 80 diverse characters and over 8,000 dialogue rounds. RMTBench includes custom characters with detailed backgrounds and abstract characters defined by simple traits, enabling evaluation across various user scenarios. Our benchmark constructs dialogues based on explicit user motivations rather than character descriptions, ensuring alignment with practical user applications. Furthermore, we construct an authentic multi-turn dialogue simulation mechanism. With carefully selected evaluation dimensions and LLM-based scoring, this mechanism captures the complex intention of conversations between the user and the character. By shifting focus from character background to user intention fulfillment, RMTBench bridges the gap between academic evaluation and practical deployment requirements, offering a more effective framework for assessing role-playing capabilities in LLMs. All code and datasets will be released soon.

nan

Article 360

Title@2025-07-27 (7): DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns

Title: DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns

DYNARTmo: Ein dynamisches Artikulationsmodell zur Visualisierung von Sprachbewegungsmustern

DYNARTmo:语音移动模式视觉化动态脉动模型 2507.20343v1

Authors (1): Bernd J. Kröger

We present DYNARTmo, a dynamic articulatory model designed to visualize speech articulation processes in a two-dimensional midsagittal plane. The model builds upon the UK-DYNAMO framework and integrates principles of articulatory underspecification, segmental and gestural control, and coarticulation. DYNARTmo simulates six key articulators based on ten continuous and six discrete control parameters, allowing for the generation of both vocalic and consonantal articulatory configurations. The current implementation is embedded in a web-based application (SpeechArticulationTrainer) that includes sagittal, glottal, and palatal views, making it suitable for use in phonetics education and speech therapy. While this paper focuses on the static modeling aspects, future work will address dynamic movement generation and integration with articulatory-acoustic modules.

nan

Article 361

Title@2025-07-27 (7): FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

Title: FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

FMSD-TTS: Wenige Aufnahmen Multi-Speeaker Multi-Dialekt Text-zu-Speech-Synthese für Ü-Tsang, Amdo und Kham Speech Dataset Generation

FMSD-TTS:为于赞、阿姆多和康言语数据集制作而制作的微小多声多声多功能多语音文本到语音合成合成 2505.14351v2

Authors (10): Yutong Liu, Ziyue Zhang, Ban Ma-bao, Yuqing Cai, Yongbin Yu, Renzeng Duojie, Xiangxiang Wang, Fan Gao, Cheng Huang, Nyima Tashi

Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-"U-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a novel speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects while preserving speaker identity. Extensive objective and subjective evaluations demonstrate that FMSD-TTS significantly outperforms baselines in both dialectal expressiveness and speaker similarity. We further validate the quality and utility of the synthesized speech through a challenging speech-to-speech dialect conversion task. Our contributions include: (1) a novel few-shot TTS system tailored for Tibetan multi-dialect speech synthesis, (2) the public release of a large-scale synthetic Tibetan speech corpus generated by FMSD-TTS, and (3) an open-source evaluation toolkit for standardized assessment of speaker similarity, dialect consistency, and audio quality.

nan

Article 362

Title@2025-07-27 (7): ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios

Title: ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios

ELMES: Ein automatisierter Rahmen für die Bewertung großer Sprachmodelle in Bildungsszenarien

ELMES:评估教育情景中大语言模式自动框架 2507.22947v1

Authors (12): Shou’ang Wei, Xinyun Wang, Shuzhen Bi, Jian Chen, Ruijia Li, Bo Jiang, Xin Lin, Min Zhang, Yu Song, BingDong Li, Aimin Zhou, Hao Hao

The emergence of Large Language Models (LLMs) presents transformative opportunities for education, generating numerous novel application scenarios. However, significant challenges remain: evaluation metrics vary substantially across different educational scenarios, while many emerging scenarios lack appropriate assessment metrics. Current benchmarks predominantly measure general intelligence rather than pedagogical capabilities. To address this gap, we introduce ELMES, an open-source automated evaluation framework specifically designed for assessing LLMs in educational settings. ELMES features a modular architecture that enables researchers to create dynamic, multi-agent dialogues through simple configuration files, facilitating flexible scenario design without requiring extensive programming expertise. The framework incorporates a hybrid evaluation engine that objectively quantifies traditionally subjective pedagogical metrics using an LLM-as-a-Judge methodology. We conduct systematic benchmarking of state-of-the-art LLMs across four critical educational scenarios: Knowledge Point Explanation, Guided Problem-Solving Teaching, Interdisciplinary Lesson Plan Generation, and Contextualized Question Generation, employing fine-grained metrics developed in collaboration with education specialists. Our results demonstrate distinct capability distributions among models, revealing context-specific strengths and limitations. ELMES provides educators and researchers with an accessible evaluation framework that significantly reduces adaptation barriers for diverse educational applications while advancing the practical implementation of LLMs in pedagogy. The framework is publicly available at \emph{https://github.com/sii-research/elmes.git}.

nan

Article 363

Title@2025-07-27 (7): What is Wrong with Perplexity for Long-context Language Modeling?

Title: What is Wrong with Perplexity for Long-context Language Modeling?

Was ist falsch an Verwirrung für Langkontext-Sprachenmodellierung?

长文本语言建模的复杂性有什么问题? 2410.23771v5

Authors (8): Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, Yisen Wang

Handling long-context inputs is crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. The underlying cause of this limitation has remained unclear. In this work, we provide a comprehensive explanation for this issue. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose \textbf{LongPPL}, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Our experiments demonstrate that LongPPL strongly correlates with performance on various long-context benchmarks (e.g., Pearson correlation of -0.96), significantly outperforming traditional PPL in predictive accuracy. Additionally, we introduce \textbf{LongCE} (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens, leading to consistent improvements across diverse benchmarks. In summary, these contributions offer deeper insights into the limitations of PPL and present effective solutions for accurately evaluating and enhancing the long-context capabilities of LLMs. Code is available at https://github.com/PKU-ML/LongPPL.

nan

Article 364

Title@2025-07-27 (7): Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation

Title: Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation

Förderung dialektischer arabischer zu moderner arabischer Standard-Maschinenübersetzung

向现代标准阿拉伯文机器翻译推广阿拉伯语 2507.20301v1

Authors (3): Abdullah Alabdullah, Lifeng Han, Chenghua Lin

Dialectal Arabic (DA) poses a persistent challenge for natural language processing (NLP), as most everyday communication in the Arab world occurs in dialects that diverge significantly from Modern Standard Arabic (MSA). This linguistic divide limits access to digital services and educational resources and impedes progress in Arabic machine translation. This paper presents two core contributions to advancing DA-MSA translation for the Levantine, Egyptian, and Gulf dialects, particularly in low-resource and computationally constrained settings: a comprehensive evaluation of training-free prompting techniques, and the development of a resource-efficient fine-tuning pipeline. Our evaluation of prompting strategies across six large language models (LLMs) found that few-shot prompting consistently outperformed zero-shot, chain-of-thought, and our proposed Ara-TEaR method. GPT-4o achieved the highest performance across all prompting settings. For fine-tuning, a quantized Gemma2-9B model achieved a CHrF++ score of 49.88, outperforming zero-shot GPT-4o (44.58). Joint multi-dialect trained models outperformed single-dialect counterparts by over 10% CHrF++, and 4-bit quantization reduced memory usage by 60% with less than 1% performance loss. The results and insights of our experiments offer a practical blueprint for improving dialectal inclusion in Arabic NLP, showing that high-quality DA-MSA machine translation is achievable even with limited resources and paving the way for more inclusive language technologies.

nan

Article 365

Title@2025-07-27 (7): Real-time Factuality Assessment from Adversarial Feedback

Title: Real-time Factuality Assessment from Adversarial Feedback

Echtzeit-Faktualitätsbeurteilung aus dem Adversarial Feedback

从反反向反馈反馈中实时进行实况评估 2410.14651v3

Authors (3): Sanxing Chen, Yukun Huang, Bhuwan Dhingra

We show that existing evaluations for assessing the factuality of news from conventional sources, such as claims on fact-checking websites, result in high accuracies over time for LLM-based detectors-even after their knowledge cutoffs. This suggests that recent popular false information from such sources can be easily identified due to its likely presence in pre-training/retrieval corpora or the emergence of salient, yet shallow, patterns in these datasets. Instead, we argue that a proper factuality evaluation dataset should test a model’s ability to reason about current events by retrieving and reading related evidence. To this end, we develop a novel pipeline that leverages natural language feedback from a RAG-based detector to iteratively modify real-time news into deceptive variants that challenge LLMs. Our iterative rewrite decreases the binary classification ROC-AUC by an absolute 17.5 percent for a strong RAG-based GPT-4o detector. Our experiments reveal the important role of RAG in both evaluating and generating challenging news examples, as retrieval-free LLM detectors are vulnerable to unseen events and adversarial attacks, while feedback from RAG-based evaluation helps discover more deceitful patterns.

nan

Article 366

Title@2025-07-27 (7): SciToolAgent: A Knowledge Graph-Driven Scientific Agent for Multi-Tool Integration

Title: SciToolAgent: A Knowledge Graph-Driven Scientific Agent for Multi-Tool Integration

SciToolAgent: Ein wissensbasierter wissenschaftlicher Agent für Multi-Tool-Integration

SciToolAgent: 多工具整合知识图表驱动科学代理 2507.20280v1

Authors (6): Keyan Ding, Jing Yu, Junjie Huang, Yuchen Yang, Qiang Zhang, Huajun Chen

Scientific research increasingly relies on specialized computational tools, yet effectively utilizing these tools demands substantial domain expertise. While Large Language Models (LLMs) show promise in tool automation, they struggle to seamlessly integrate and orchestrate multiple tools for complex scientific workflows. Here, we present SciToolAgent, an LLM-powered agent that automates hundreds of scientific tools across biology, chemistry, and materials science. At its core, SciToolAgent leverages a scientific tool knowledge graph that enables intelligent tool selection and execution through graph-based retrieval-augmented generation. The agent also incorporates a comprehensive safety-checking module to ensure responsible and ethical tool usage. Extensive evaluations on a curated benchmark demonstrate that SciToolAgent significantly outperforms existing approaches. Case studies in protein engineering, chemical reactivity prediction, chemical synthesis, and metal-organic framework screening further demonstrate SciToolAgent’s capability to automate complex scientific workflows, making advanced research tools accessible to both experts and non-experts.

nan

Article 367

Title@2025-07-27 (7): What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations

Title: What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations

Welche Sprache(n) denkt Aya-23? Wie Mehrsprachigkeit die Repräsentationen der internen Sprache beeinflusst

Aya-23 思考什么语言?多语言如何影响内部语言代表性? 2507.20279v1

Authors (4): Katharina Trinley, Toshiki Nakai, Tatiana Anikina, Tanja Baeumel

Large language models (LLMs) excel at multilingual tasks, yet their internal language processing remains poorly understood. We analyze how Aya-23-8B, a decoder-only LLM trained on balanced multilingual data, handles code-mixed, cloze, and translation tasks compared to predominantly monolingual models like Llama 3 and Chinese-LLaMA-2. Using logit lens and neuron specialization analyses, we find: (1) Aya-23 activates typologically related language representations during translation, unlike English-centric models that rely on a single pivot language; (2) code-mixed neuron activation patterns vary with mixing rates and are shaped more by the base language than the mixed-in one; and (3) Aya-23’s languagespecific neurons for code-mixed inputs concentrate in final layers, diverging from prior findings on decoder-only models. Neuron overlap analysis further shows that script similarity and typological relations impact processing across model types. These findings reveal how multilingual training shapes LLM internals and inform future cross-lingual transfer research.

nan

Article 368

Title@2025-07-27 (7): Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Title: Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Agent-Fin-R1: Verbesserung der Finanzintelligenz durch Domain-Expertise, Trainingseffizienz und Advanced Reasoning

Agentar Fin-Fin-R1:通过域域专门知识、培训效率和高级理由加强金融情报 2507.16802v4

Authors (13): Yanjun Zheng, Xiyang Du, Longfei Liao, Xiaoke Zhao, Zhaowen Zhou, Jingze Song, Bo Zhang, Jiawei Liu, Xiang Qi, Zhe Li, Zhiqiang Zhang, Wei Wang, Peng Zhang

Large Language Models (LLMs) exhibit considerable promise in financial applications; however, prevailing models frequently demonstrate limitations when confronted with scenarios that necessitate sophisticated reasoning capabilities, stringent trustworthiness criteria, and efficient adaptation to domain-specific requirements. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task label system with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage training pipeline, and dynamic attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including Fineva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA-diamond. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications. The Finova bench is available at https://github.com/antgroup/Finova.

nan

Article 369

Title@2025-07-27 (7): MoL-RL: Distilling Multi-Step Environmental Feedback into LLMs for Feedback-Independent Reasoning

Title: MoL-RL: Distilling Multi-Step Environmental Feedback into LLMs for Feedback-Independent Reasoning

MoL-RL: Destillieren von mehrstufigem Umweltfeedback in LLMs zur feedbackunabhängigen Begründung

MoL-RL:将多层环境反馈保留到LLMs,用于提供反馈-独立理由 2507.20278v1

Authors (5): Kang Yang, Jingxue Chen, Qingkun Tang, Tianxiang Zhang, Qianchun Lu

Large language models (LLMs) face significant challenges in effectively leveraging sequential environmental feedback (EF) signals, such as natural language evaluations, for feedback-independent chain-of-thought (CoT) reasoning. Existing approaches either convert EF into scalar rewards, losing rich contextual information, or employ refinement datasets, failing to exploit the multi-step and discrete nature of EF interactions. To address these limitations, we propose MoL-RL, a novel training paradigm that integrates multi-step EF signals into LLMs through a dual-objective optimization framework. Our method combines MoL (Mixture-of-Losses) continual training, which decouples domain-specific EF signals (optimized via cross-entropy loss) and general language capabilities (preserved via Kullback-Leibler divergence), with GRPO-based post-training to distill sequential EF interactions into single-step inferences. This synergy enables robust feedback-independent reasoning without relying on external feedback loops. Experimental results on mathematical reasoning (MATH-500, AIME24/AIME25) and code generation (CodeAgent-Test) benchmarks demonstrate that MoL-RL achieves state-of-the-art performance with the Qwen3-8B model, while maintaining strong generalization across model scales (Qwen3-4B). This work provides a promising approach for leveraging multi-step textual feedback to enhance LLMs’ reasoning capabilities in diverse domains.

nan

Article 370

Title@2025-07-27 (7): ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech

Title: ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech

ChildGuard: Ein spezieller Datensatz zur Bekämpfung von kindgewordener Hassrede

儿童指南:打击针对儿童的仇恨言论专门数据集 2506.21613v2

Authors (6): Gautam Siddharth Kashyap, Mohammad Anas Azeez, Rafiq Ali, Zohaib Hasan Siddiqui, Jiechao Gao, Usman Naseem

Hate speech targeting children on social media is a serious and growing problem, yet current NLP systems struggle to detect it effectively. This gap exists mainly because existing datasets focus on adults, lack age specific labels, miss nuanced linguistic cues, and are often too small for robust modeling. To address this, we introduce ChildGuard, the first large scale English dataset dedicated to hate speech aimed at children. It contains 351,877 annotated examples from X (formerly Twitter), Reddit, and YouTube, labeled by three age groups: younger children (under 11), pre teens (11–12), and teens (13–17). The dataset is split into two subsets for fine grained analysis: a contextual subset (157K) focusing on discourse level features, and a lexical subset (194K) emphasizing word-level sentiment and vocabulary. Benchmarking state of the art hate speech models on ChildGuard reveals notable drops in performance, highlighting the challenges of detecting child directed hate speech.

nan

Article 371

Title: EMBRACE: Shaping Inclusive Opinion Representation by Aligning Implicit Conversations with Social Norms

EMBRACE: Inclusive Opinion Representation gestalten, indem implizite Gespräche mit sozialen Normen ausgerichtet werden

EMBRACE:通过与社会规范的关联性交流,形成包容性的见解代表制 2507.20264v1

Authors (2): Abeer Aldayel, Areej Alokaili

Shaping inclusive representations that embrace diversity and ensure fair participation and reflections of values is at the core of many conversation-based models. However, many existing methods rely on surface inclusion using mention of user demographics or behavioral attributes of social groups. Such methods overlook the nuanced, implicit expression of opinion embedded in conversations. Furthermore, the over-reliance on overt cues can exacerbate misalignment and reinforce harmful or stereotypical representations in model outputs. Thus, we took a step back and recognized that equitable inclusion needs to account for the implicit expression of opinion and use the stance of responses to validate the normative alignment. This study aims to evaluate how opinions are represented in NLP or computational models by introducing an alignment evaluation framework that foregrounds implicit, often overlooked conversations and evaluates the normative social views and discourse. Our approach models the stance of responses as a proxy for the underlying opinion, enabling a considerate and reflective representation of diverse social viewpoints. We evaluate the framework using both (i) positive-unlabeled (PU) online learning with base classifiers, and (ii) instruction-tuned language models to assess post-training alignment. Through this, we provide a lens on how implicit opinions are (mis)represented and offer a pathway toward more inclusive model behavior.

nan

Article 372

Title@2025-07-27 (7): Post-Completion Learning for Language Models

Title: Post-Completion Learning for Language Models

Post-Completion-Lernen für Sprachmodelle

语文模式完成后学习 2507.20252v1

Authors (7): Xiang Fei, Siqi Wang, Shu Wei, Yuxiang Nie, Wei Shi, Hao Feng, Can Huang

Current language model training paradigms typically terminate learning upon reaching the end-of-sequence (}) token, overlooking the potential learning opportunities in the post-completion space. We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion, to enhance both the reasoning and self-evaluation abilities. PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point. To fully utilize this post-completion space, we design a white-box reinforcement learning method: let the model evaluate the output content according to the reward rules, then calculate and align the score with the reward functions for supervision. We implement dual-track SFT to optimize both reasoning and evaluation capabilities, and mixed it with RL training to achieve multi-objective hybrid optimization. Experimental results on different datasets and models demonstrate consistent improvements over traditional SFT and RL methods. Our method provides a new technical path for language model training that enhances output quality while preserving deployment efficiency.

nan

Article 373

Title@2025-07-27 (7): Modeling Professionalism in Expert Questioning through Linguistic Differentiation

Title: Modeling Professionalism in Expert Questioning through Linguistic Differentiation

Modellierung von Professionalität in der Expertenbefragung durch sprachliche Differenzierung

通过语言差异问题专家提问的示范专业精神 2507.20249v1

Authors (2): Giulia D’Agostino, Chung-Chi Chen

Professionalism is a crucial yet underexplored dimension of expert communication, particularly in high-stakes domains like finance. This paper investigates how linguistic features can be leveraged to model and evaluate professionalism in expert questioning. We introduce a novel annotation framework to quantify structural and pragmatic elements in financial analyst questions, such as discourse regulators, prefaces, and request types. Using both human-authored and large language model (LLM)-generated questions, we construct two datasets: one annotated for perceived professionalism and one labeled by question origin. We show that the same linguistic features correlate strongly with both human judgments and authorship origin, suggesting a shared stylistic foundation. Furthermore, a classifier trained solely on these interpretable features outperforms gemini-2.0 and SVM baselines in distinguishing expert-authored questions. Our findings demonstrate that professionalism is a learnable, domain-general construct that can be captured through linguistically grounded modeling.

nan

Article 374

Title@2025-07-27 (7): Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers

Title: Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers

Contrast-CAT: Kontrastierende Aktivierungen für verbesserte Interpretierbarkeit in Transformer-basierten Textklassifikatoren

反对-CAT:在基于变换器的文本分类中增强解释力的对比活动 2507.21186v1

Authors (3): Sungmin Han, Jeonghyun Lee, Sangkyun Lee

Transformers have profoundly influenced AI research, but explaining their decisions remains challenging – even for relatively simpler tasks such as classification – which hinders trust and safe deployment in real-world applications. Although activation-based attribution methods effectively explain transformer-based text classification models, our findings reveal that these methods can be undermined by class-irrelevant features within activations, leading to less reliable interpretations. To address this limitation, we propose Contrast-CAT, a novel activation contrast-based attribution method that refines token-level attributions by filtering out class-irrelevant features. By contrasting the activations of an input sequence with reference activations, Contrast-CAT generates clearer and more faithful attribution maps. Experimental results across various datasets and models confirm that Contrast-CAT consistently outperforms state-of-the-art methods. Notably, under the MoRF setting, it achieves average improvements of x1.30 in AOPC and x2.25 in LOdds over the most competing methods, demonstrating its effectiveness in enhancing interpretability for transformer-based text classification.

nan

Article 375

Title@2025-07-27 (7): Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models

Title: Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models

Reframe Your Life Story: Interaktiver Erzähltherapeut und innovative Moment-Assessment mit großen Sprachmodellen

重构你的生活故事:与大语言模式互动叙述治疗师和创新时间评估 2507.20241v1

Authors (9): Yi Feng, Jiaqi Wang, Wenxuan Zhang, Zhuang Chen, Yutong Shen, Xiyao Xiao, Minlie Huang, Liping Jing, Jian Yu

Recent progress in large language models (LLMs) has opened new possibilities for mental health support, yet current approaches lack realism in simulating specialized psychotherapy and fail to capture therapeutic progression over time. Narrative therapy, which helps individuals transform problematic life stories into empowering alternatives, remains underutilized due to limited access and social stigma. We address these limitations through a comprehensive framework with two core components. First, INT (Interactive Narrative Therapist) simulates expert narrative therapists by planning therapeutic stages, guiding reflection levels, and generating contextually appropriate expert-like responses. Second, IMA (Innovative Moment Assessment) provides a therapy-centric evaluation method that quantifies effectiveness by tracking “Innovative Moments” (IMs), critical narrative shifts in client speech signaling therapy progress. Experimental results on 260 simulated clients and 230 human participants reveal that INT consistently outperforms standard LLMs in therapeutic quality and depth. We further demonstrate the effectiveness of INT in synthesizing high-quality support conversations to facilitate social applications.

nan

Article 376

Title@2025-07-27 (7): DoubleDipper: Improving Long-Context LLMs via Context Recycling

Title: DoubleDipper: Improving Long-Context LLMs via Context Recycling

DoubleDipper: Verbesserung der Langkontext-LLMs über Kontext-Recycling

双重顶点:通过上下文再循环改进长文本LLMs 2406.13632v4

Authors (11): Arie Cattan, Alon Jacovi, Alex Fabrikant, Jonathan Herzig, Roee Aharoni, Hannah Rashkin, Dror Marcus, Avinatan Hassidim, Yossi Matias, Idan Szpektor, Avi Caciularu

Despite recent advancements in Large Language Models (LLMs), their performance on tasks involving long contexts remains sub-optimal. In this work, we propose DoubleDipper, a novel In-Context-Learning method that automatically generates few-shot examples for long context QA tasks by recycling contexts. Specifically, given a long input context (1-3k tokens) and a query, we generate additional query-output pairs from the given context as few-shot examples, while introducing the context only once. This ensures that the demonstrations are leveraging the same context as the target query while only adding a small number of tokens to the prompt. We further enhance each demonstration by instructing the model to explicitly identify the relevant paragraphs before the answer, which improves performance while providing fine-grained attribution to the answer source. We apply our method on multiple LLMs and obtain substantial improvements (+16 absolute points on average across models) on various QA datasets with long context. Surprisingly, despite introducing only single-hop ICL examples, LLMs successfully generalize to multi-hop long-context QA using our approach.

nan

Article 377

Title@2025-07-27 (7): Understanding Learner-LLM Chatbot Interactions and the Impact of Prompting Guidelines

Title: Understanding Learner-LLM Chatbot Interactions and the Impact of Prompting Guidelines

Lernende-LLM-Chatbot-Interaktionen verstehen und die Auswirkungen von Sofortrichtlinien verstehen

了解学习者-LLLM 聊天室互动和推动准则的影响 2504.07840v3

Authors (16): Cansu Koyuturk, Emily Theophilou, Sabrina Patania, Gregor Donabauer, Andrea Martinenghi, Chiara Antico, Alessia Telari, Alessia Testa, Sathya Bursic, Franca Garzotto, Davinia Hernandez-Leo, Udo Kruschwitz, Davide Taibi, Simona Amenta, Martin Ruskov, Dimitri Ognibene

Large Language Models (LLMs) have transformed human-computer interaction by enabling natural language-based communication with AI-powered chatbots. These models are designed to be intuitive and user-friendly, allowing users to articulate requests with minimal effort. However, despite their accessibility, studies reveal that users often struggle with effective prompting, resulting in inefficient responses. Existing research has highlighted both the limitations of LLMs in interpreting vague or poorly structured prompts and the difficulties users face in crafting precise queries. This study investigates learner-AI interactions through an educational experiment in which participants receive structured guidance on effective prompting. We introduce and compare three types of prompting guidelines: a task-specific framework developed through a structured methodology and two baseline approaches. To assess user behavior and prompting efficacy, we analyze a dataset of 642 interactions from 107 users. Using Von NeuMidas, an extended pragmatic annotation schema for LLM interaction analysis, we categorize common prompting errors and identify recurring behavioral patterns. We then evaluate the impact of different guidelines by examining changes in user behavior, adherence to prompting strategies, and the overall quality of AI-generated responses. Our findings provide a deeper understanding of how users engage with LLMs and the role of structured prompting guidance in enhancing AI-assisted communication. By comparing different instructional frameworks, we offer insights into more effective approaches for improving user competency in AI interactions, with implications for AI literacy, chatbot usability, and the design of more responsive AI systems.

nan

Article 378

Title@2025-07-27 (7): Co-NAML-LSTUR: A Combined Model with Attentive Multi-View Learning and Long- and Short-term User Representations for News Recommendation

Title: Co-NAML-LSTUR: A Combined Model with Attentive Multi-View Learning and Long- and Short-term User Representations for News Recommendation

Co-NAML-LSTUR: Ein kombiniertes Modell mit attentivem Multi-View-Lernen und Langzeit- und Kurzzeit-Benutzervertretungen für News-Empfehlungen

NAML-LTUR:与多视学习和新闻建议长期及短期用户代表相结合的综合模式 2507.20210v1

Authors (3): Minh Hoang Nguyen, Thuat Thien Nguyen, Minh Nhat Ta

News recommendation systems play a vital role in mitigating information overload by delivering personalized news content. A central challenge is to effectively model both multi-view news representations and the dynamic nature of user interests, which often span both short- and long-term preferences. Existing methods typically rely on single-view features of news articles (e.g., titles or categories) or fail to comprehensively capture user preferences across time scales. In this work, we propose Co-NAML-LSTUR, a hybrid news recommendation framework that integrates NAML for attentive multi-view news modeling and LSTUR for capturing both long- and short-term user representations. Our model also incorporates BERT-based word embeddings to enhance semantic feature extraction. We evaluate Co-NAML-LSTUR on two widely used benchmarks, MIND-small and MIND-large. Experimental results show that Co-NAML-LSTUR achieves substantial improvements over most state-of-the-art baselines on MIND-small and MIND-large, respectively. These results demonstrate the effectiveness of combining multi-view news representations with dual-scale user modeling. The implementation of our model is publicly available at https://github.com/MinhNguyenDS/Co-NAML-LSTUR.

nan

Article 379

Title@2025-07-27 (7): IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs

Title: IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs

IQ-Test für LLMs: Ein Bewertungsrahmen für die Entdeckung von Kernkompetenzen in LLMs

LLMLM的IQ测试:LLM中核心技能覆盖的评估框架 2507.20208v1

Authors (5): Aviya Maimon, Amir DN Cohen, Gal Vishne, Shauli Ravfogel, Reut Tsarfaty

Current evaluations of large language models (LLMs) rely on benchmark scores, but it is difficult to interpret what these individual scores reveal about a model’s overall skills. Specifically, as a community we lack understanding of how tasks relate to one another, what they measure in common, how they differ, or which ones are redundant. As a result, models are often assessed via a single score averaged across benchmarks, an approach that fails to capture the models’ wholistic strengths and limitations. Here, we propose a new evaluation paradigm that uses factor analysis to identify latent skills driving performance across benchmarks. We apply this method to a comprehensive new leaderboard showcasing the performance of 60 LLMs on 44 tasks, and identify a small set of latent skills that largely explain performance. Finally, we turn these insights into practical tools that identify redundant tasks, aid in model selection, and profile models along each latent skill.

nan

Article 380

Title: Cheap Learning: Maximising Performance of Language Models for Social Data Science Using Minimal Data

Günstiges Lernen: Maximierung der Leistungsfähigkeit von Sprachmodellen für die Sozialdatenwissenschaft mit minimalen Daten

廉价学习:利用最低数据使社会数据科学语言模型的绩效最大化 2401.12295v2

Authors (7): Leonardo Castro-Gonzalez, Yi-Ling Chung, Hannak Rose Kirk, John Francis, Angus R. Williams, Pica Johansson, Jonathan Bright

The field of machine learning has recently made significant progress in reducing the requirements for labelled training data when building new models. These cheaper' learning techniques hold significant potential for the social sciences, where development of large labelled training datasets is often a significant practical impediment to the use of machine learning for analytical tasks. In this article we review three cheap’ techniques that have developed in recent years: weak supervision, transfer learning and prompt engineering. For the latter, we also review the particular case of zero-shot prompting of large language models. For each technique we provide a guide of how it works and demonstrate its application across six different realistic social science applications (two different tasks paired with three different dataset makeups). We show good performance for all techniques, and in particular we demonstrate how prompting of large language models can achieve high accuracy at very low cost. Our results are accompanied by a code repository to make it easy for others to duplicate our work and use it in their own research. Overall, our article is intended to stimulate further uptake of these techniques in the social sciences.

nan

Article 381

Title@2025-07-27 (7): Diversity-Enhanced Reasoning for Subjective Questions

Title: Diversity-Enhanced Reasoning for Subjective Questions

Diversity-Enhanced Reasoning für subjektive Fragen

主观问题的多样性强化理由 2507.20187v1

Authors (4): Yumeng Wang, Zhiyuan Fan, Jiayu Liu, Yi R. Fung

Large reasoning models (LRM) with long chain-of-thought (CoT) capabilities have shown strong performance on objective tasks, such as math reasoning and coding. However, their effectiveness on subjective questions that may have different responses from different perspectives is still limited by a tendency towards homogeneous reasoning, introduced by the reliance on a single ground truth in supervised fine-tuning and verifiable reward in reinforcement learning. Motivated by the finding that increasing role perspectives consistently improves performance, we propose MultiRole-R1, a diversity-enhanced framework with multiple role perspectives, to improve the accuracy and diversity in subjective reasoning tasks. MultiRole-R1 features an unsupervised data construction pipeline that generates reasoning chains that incorporate diverse role perspectives. We further employ reinforcement learning via Group Relative Policy Optimization (GRPO) with reward shaping, by taking diversity as a reward signal in addition to the verifiable reward. With specially designed reward functions, we successfully promote perspective diversity and lexical diversity, uncovering a positive relation between reasoning diversity and accuracy. Our experiment on six benchmarks demonstrates MultiRole-R1’s effectiveness and generalizability in enhancing both subjective and objective reasoning, showcasing the potential of diversity-enhanced training in LRMs.

nan

Article 382

Title@2025-07-27 (7): SessionIntentBench: A Multi-task Inter-session Intention-shift Modeling Benchmark for E-commerce Customer Behavior Understanding

Title: SessionIntentBench: A Multi-task Inter-session Intention-shift Modeling Benchmark for E-commerce Customer Behavior Understanding

SessionIntentBench: Ein Multi-Task Inter-Session Intention-Shift Modelling Benchmark für E-Commerce Kundenverhalten Verständnis

A. 会议内容:电子商务客户行为理解的多任务、多任务、跨会期、出于利益转移的电子商业客户行为理解示范基准 2507.20185v1

Authors (16): Yuqi Yang, Weiqi Wang, Baixuan Xu, Wei Fan, Qing Zong, Chunkit Chan, Zheye Deng, Xin Liu, Yifan Gao, Changlong Yu, Chen Luo, Yang Li, Zheng Li, Qingyu Yin, Bing Yin, Yangqiu Song

Session history is a common way of recording user interacting behaviors throughout a browsing activity with multiple products. For example, if an user clicks a product webpage and then leaves, it might because there are certain features that don’t satisfy the user, which serve as an important indicator of on-the-spot user preferences. However, all prior works fail to capture and model customer intention effectively because insufficient information exploitation and only apparent information like descriptions and titles are used. There is also a lack of data and corresponding benchmark for explicitly modeling intention in E-commerce product purchase sessions. To address these issues, we introduce the concept of an intention tree and propose a dataset curation pipeline. Together, we construct a sibling multimodal benchmark, SessionIntentBench, that evaluates L(V)LMs’ capability on understanding inter-session intention shift with four subtasks. With 1,952,177 intention entries, 1,132,145 session intention trajectories, and 13,003,664 available tasks mined using 10,905 sessions, we provide a scalable way to exploit the existing session data for customer intention understanding. We conduct human annotations to collect ground-truth label for a subset of collected data to form an evaluation gold set. Extensive experiments on the annotated data further confirm that current L(V)LMs fail to capture and utilize the intention across the complex session setting. Further analysis show injecting intention enhances LLMs’ performances.

nan

Article 383

Title@2025-07-27 (7): SGPO: Self-Generated Preference Optimization based on Self-Improver

Title: SGPO: Self-Generated Preference Optimization based on Self-Improver

SGPO: Selbsterzeugte Preference-Optimierung auf Basis von Self-Improver

SGPO:基于自我改造的自发优惠优化 2507.20181v1

Authors (4): Hyeonji Lee, Daejin Jo, Seohwan Yun, Sungwoong Kim

Large language models (LLMs), despite their extensive pretraining on diverse datasets, require effective alignment to human preferences for practical and reliable deployment. Conventional alignment methods typically employ off-policy learning and depend on human-annotated datasets, which limits their broad applicability and introduces distribution shift issues during training. To address these challenges, we propose Self-Generated Preference Optimization based on Self-Improver (SGPO), an innovative alignment framework that leverages an on-policy self-improving mechanism. Specifically, the improver refines responses from a policy model to self-generate preference data for direct preference optimization (DPO) of the policy model. Here, the improver and policy are unified into a single model, and in order to generate higher-quality preference data, this self-improver learns to make incremental yet discernible improvements to the current responses by referencing supervised fine-tuning outputs. Experimental results on AlpacaEval 2.0 and Arena-Hard show that the proposed SGPO significantly improves performance over DPO and baseline self-improving methods without using external preference data.

nan

Article 384

Title@2025-07-27 (7): Checklist Engineering Empowers Multilingual LLM Judges

Title: Checklist Engineering Empowers Multilingual LLM Judges

Checkliste Engineering Empowers Mehrsprachige LLM-Richter

多语种LLM法官 2507.06774v2

Authors (2): Mohammad Ghiasvand Mohammadkhani, Hamid Beigy

Automated text evaluation has long been a central issue in Natural Language Processing (NLP). Recently, the field has shifted toward using Large Language Models (LLMs) as evaluators-a trend known as the LLM-as-a-Judge paradigm. While promising and easily adaptable across tasks, this approach has seen limited exploration in multilingual contexts. Existing multilingual studies often rely on proprietary models or require extensive training data for fine-tuning, raising concerns about cost, time, and efficiency. In this paper, we propose Checklist Engineering based LLM-as-a-Judge (CE-Judge), a training-free framework that uses checklist intuition for multilingual evaluation with an open-source model. Experiments across multiple languages and three benchmark datasets, under both pointwise and pairwise settings, show that our method generally surpasses the baselines and performs on par with the GPT-4o model.

nan

Article 385

Title@2025-07-27 (7): Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective

Title: Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective

Intersektionale Bias in japanischen großen Sprachmodellen aus einer kontextualisierten Perspektive

日本大语言模型中从背景角度分析的交叉比阿语 2506.12327v2

Authors (9): Hitomi Yanaka, Xinqi He, Jie Lu, Namgi Han, Sunjin Oh, Ryoma Kumon, Yuma Matsuoka, Katsuhiko Watabe, Yuko Itatsu

An increasing number of studies have examined the social bias of rapidly developed large language models (LLMs). Although most of these studies have focused on bias occurring in a single social attribute, research in social science has shown that social bias often occurs in the form of intersectionality – the constitutive and contextualized perspective on bias aroused by social attributes. In this study, we construct the Japanese benchmark inter-JBBQ, designed to evaluate the intersectional bias in LLMs on the question-answering setting. Using inter-JBBQ to analyze GPT-4o and Swallow, we find that biased output varies according to its contexts even with the equal combination of social attributes.

nan

Article 386

Title@2025-07-27 (7): Goal Alignment in LLM-Based User Simulators for Conversational AI

Title: Goal Alignment in LLM-Based User Simulators for Conversational AI

Zielausrichtung in LLM-basierten Benutzersimulatoren für KI

在基于LLM的LLM用户模拟器中实现目标对齐 2507.20152v1

Authors (6): Shuhaib Mehri, Xiaocheng Yang, Takyoung Kim, Gokhan Tur, Shikib Mehri, Dilek Hakkani-Tür

User simulators are essential to conversational AI, enabling scalable agent development and evaluation through simulated interactions. While current Large Language Models (LLMs) have advanced user simulation capabilities, we reveal that they struggle to consistently demonstrate goal-oriented behavior across multi-turn conversations–a critical limitation that compromises their reliability in downstream applications. We introduce User Goal State Tracking (UGST), a novel framework that tracks user goal progression throughout conversations. Leveraging UGST, we present a three-stage methodology for developing user simulators that can autonomously track goal progression and reason to generate goal-aligned responses. Moreover, we establish comprehensive evaluation metrics for measuring goal alignment in user simulators, and demonstrate that our approach yields substantial improvements across two benchmarks (MultiWOZ 2.4 and {\tau}-Bench). Our contributions address a critical gap in conversational AI and establish UGST as an essential framework for developing goal-aligned user simulators.

nan

Article 387

Title@2025-07-27 (7): The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models

Title: The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models

The Policy Cliff: Eine theoretische Analyse von Belohnungs-Policy-Karten in großen Sprachmodellen

政策悬崖:大语言模式奖励政策图的理论分析 2507.20150v1

Authors (1): Xingcheng Xu

Reinforcement learning (RL) plays a crucial role in shaping the behavior of large language and reasoning models (LLMs/LRMs). However, it often produces brittle and unstable policies, leading to critical failures such as spurious reasoning, deceptive alignment, and instruction disobedience that undermine the trustworthiness and safety of LLMs/LRMs. Currently, these issues lack a unified theoretical explanation and are typically addressed using ad-hoc heuristics. This paper presents a rigorous mathematical framework for analyzing the stability of the mapping from a reward function to the optimal policy. We show that policy brittleness often stems from non-unique optimal actions, a common occurrence when multiple valid traces exist in a reasoning task. This theoretical lens provides a unified explanation for a range of seemingly disparate failures, reframing them as rational outcomes of optimizing rewards that may be incomplete or noisy, especially in the presence of action degeneracy. We extend this analysis from the fundamental single-reward setting to the more realistic multi-reward RL across diverse domains, showing how stability is governed by an “effective reward” aggregation mechanism. We also prove that entropy regularization restores policy stability at the cost of increased stochasticity. Our framework provides a unified explanation for recent empirical findings on deceptive reasoning, instruction-following trade-offs, and RLHF-induced sophistry, and is further validated through perturbation experiments in multi-reward RL. This work advances policy-stability analysis from empirical heuristics towards a principled theory, offering essential insights for designing safer and more trustworthy AI systems.

nan

Article 388

Title@2025-07-27 (7): Multi-Agent Interactive Question Generation Framework for Long Document Understanding

Title: Multi-Agent Interactive Question Generation Framework for Long Document Understanding

Multi-Agent Interactive Question Generierung Framework for Long Document Understanding

长期文件理解问题多机构互动问题生成框架 2507.20145v1

Authors (9): Kesen Wang, Daulet Toibazar, Abdulrahman Alfulayt, Abdulaziz S. Albadawi, Ranya A. Alkahtani, Asma A. Ibrahim, Haneen A. Alhomoud, Sherif Mohamed, Pedro J. Moreno

Document Understanding (DU) in long-contextual scenarios with complex layouts remains a significant challenge in vision-language research. Although Large Vision-Language Models (LVLMs) excel at short-context DU tasks, their performance declines in long-context settings. A key limitation is the scarcity of fine-grained training data, particularly for low-resource languages such as Arabic. Existing state-of-the-art techniques rely heavily on human annotation, which is costly and inefficient. We propose a fully automated, multi-agent interactive framework to generate long-context questions efficiently. Our approach efficiently generates high-quality single- and multi-page questions for extensive English and Arabic documents, covering hundreds of pages across diverse domains. This facilitates the development of LVLMs with enhanced long-context understanding ability. Experimental results in this work have shown that our generated English and Arabic questions (\textbf{AraEngLongBench}) are quite challenging to major open- and close-source LVLMs. The code and data proposed in this work can be found in https://github.com/wangk0b/Multi_Agentic_QA_Long_Doc.git. Sample Question and Answer (QA) pairs and structured system prompts can be found in the Appendix.

nan

Article 389

Title: Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG

Multi-Stage Verifikations-Centric Framework zur Eindämmung der Halluzination in Multi-Modal RAG

多模式RAG中减轻幻觉多阶段核查-中心框架 2507.20136v1

Authors (5): Baiyu Chen, Wilson Wongso, Xiaoqian Hu, Yue Tan, Flora Salim

This paper presents the technical solution developed by team CRUISE for the KDD Cup 2025 Meta Comprehensive RAG Benchmark for Multi-modal, Multi-turn (CRAG-MM) challenge. The challenge aims to address a critical limitation of modern Vision Language Models (VLMs): their propensity to hallucinate, especially when faced with egocentric imagery, long-tail entities, and complex, multi-hop questions. This issue is particularly problematic in real-world applications where users pose fact-seeking queries that demand high factual accuracy across diverse modalities. To tackle this, we propose a robust, multi-stage framework that prioritizes factual accuracy and truthfulness over completeness. Our solution integrates a lightweight query router for efficiency, a query-aware retrieval and summarization pipeline, a dual-pathways generation and a post-hoc verification. This conservative strategy is designed to minimize hallucinations, which incur a severe penalty in the competition’s scoring metric. Our approach achieved 3rd place in Task 1, demonstrating the effectiveness of prioritizing answer reliability in complex multi-modal RAG systems. Our implementation is available at https://github.com/Breezelled/KDD-Cup-2025-Meta-CRAG-MM .

nan

Article 390

Title@2025-07-27 (7): EvoSLD: Automated Neural Scaling Law Discovery With Large Language Models

Title: EvoSLD: Automated Neural Scaling Law Discovery With Large Language Models

EvoSLD: Automatisierte Neural Scaling Law Discovery mit großen Sprachmodellen

EvoSLD: 用大语言模型发现自动神经放大法 2507.21184v1

Authors (4): Haowei Lin, Xiangyu Wang, Jianzhu Ma, Yitao Liang

Scaling laws are fundamental mathematical relationships that predict how neural network performance evolves with changes in variables such as model size, dataset size, and computational resources. Traditionally, discovering these laws requires extensive human expertise and manual experimentation. We introduce EvoSLD, an automated framework for Scaling Law Discovery (SLD) that leverages evolutionary algorithms guided by Large Language Models (LLMs) to co-evolve symbolic expressions and their optimization routines. Formulated to handle scaling variables, control variables, and response metrics across diverse experimental settings, EvoSLD searches for parsimonious, universal functional forms that minimize fitting errors on grouped data subsets. Evaluated on five real-world scenarios from recent literature, EvoSLD rediscovers exact human-derived laws in two cases and surpasses them in others, achieving up to orders-of-magnitude reductions in normalized mean squared error on held-out test sets. Compared to baselines like symbolic regression and ablated variants, EvoSLD demonstrates superior accuracy, interpretability, and efficiency, highlighting its potential to accelerate AI research. Code is available at https://github.com/linhaowei1/SLD.

nan

Article 391

Title@2025-07-27 (7): When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

Title: When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

Wann funktioniert Metadata Conditioning (NOT) für Sprachmodell-Vorschulungen? Eine Studie mit kontextfreien Grammatiken

元数据条件(NOT)何时能为语言示范培训前培训提供语言示范?无背景语法研究 2504.17562v2

Authors (10): Rei Higuchi, Ryotaro Kawata, Naoki Nishikawa, Kazusato Oko, Shoichiro Yamaguchi, Sosuke Kobayashi, Seiya Tokui, Kohei Hayashi, Daisuke Okanohara, Taiji Suzuki

The ability to acquire latent semantics is one of the key properties that determines the performance of language models. One convenient approach to invoke this ability is to prepend metadata (e.g. URLs, domains, and styles) at the beginning of texts in the pre-training data, making it easier for the model to access latent semantics before observing the entire text. Previous studies have reported that this technique actually improves the performance of trained models in downstream tasks; however, this improvement has been observed only in specific downstream tasks, without consistent enhancement in average next-token prediction loss. To understand this phenomenon, we closely investigate how prepending metadata during pre-training affects model performance by examining its behavior using artificial data. Interestingly, we found that this approach produces both positive and negative effects on the downstream tasks. We demonstrate that the effectiveness of the approach depends on whether latent semantics can be inferred from the downstream task’s prompt. Specifically, through investigations using data generated by probabilistic context-free grammars, we show that training with metadata helps improve model’s performance when the given context is long enough to infer the latent semantics. In contrast, the technique negatively impacts performance when the context lacks the necessary information to make an accurate posterior inference.

nan

Article 392

Title@2025-07-27 (7): MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Title: MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

MaPPO: Maximale Posteriori-Preference-Optimierung mit vorherigem Wissen

MaPPPO: 与先前知识最优化的后世偏好 2507.21183v1

Authors (10): Guangchen Lan, Sipeng Zhang, Tianle Wang, Yuwei Zhang, Daoan Zhang, Xinpeng Wei, Xiaoman Pan, Hongming Zhang, Dong-Jun Han, Christopher G. Brinton

As the era of large language models (LLMs) on behalf of users unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. While existing methods such as Direct Preference Optimization (DPO) and its variants treat preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. More importantly, MaPPO introduces no additional hyperparameter, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin with consistent improvement on DPO variants, including widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks, including MT-Bench, AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.

nan

Article 393

Title@2025-07-27 (7): TIB-STC: A Large-Scale Structured Tibetan Benchmark for Low-Resource Language Modeling

Title: TIB-STC: A Large-Scale Structured Tibetan Benchmark for Low-Resource Language Modeling

TIB-STC: Ein großflächiger strukturierter tibetischer Benchmark für ressourcenarme Sprachmodellierung

TIB-STC: 低资源语言建模的西藏大型结构化基准 2503.18288v4

Authors (14): Cheng Huang, Fan Gao, Yutong Liu, Nyima Tashi, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Hao Wang, Yongbin Yu

Advancement of large language models (LLMs) has brought transformative capabilities to NLP, but such progress remains unevenly distributed, especially for low-resource and culturally rich languages like Tibetan. In this paper, we present TIB-STC, the first large-scale, expert-curated, and multi-domain benchmark specifically designed to support the development and evaluation of LLMs for the Tibetan language. Spanning over 11 billion tokens across literature, religion, medicine, law, and daily communication, TIB-STC preserves traditional grammar and stylistic richness. To validate its utility, we train a reference model, Sun-Shine, on TIB-STC through a three-stage pipeline involving pretraining, supervised fine-tuning, and preference optimization. Evaluation on TLUE Benchmark for Tibetan-specific tasks, including Ti-MMLU and Ti-SafetyBench, demonstrates the benchmark’s effectiveness in enabling robust instruction-following and culturally aligned generation. We release TIB-STC to advance research in low-resource language modeling and promote inclusivity in multilingual NLP. All data are available at: https://github.com/Vicentvankor/sun-shine

nan

Article 394

Title@2025-07-27 (7): Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

Title: Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

Seed LiveInterpret 2.0: End-to-End Simultanübersetzung mit Ihrer Stimme

种子实况解释2.0:用声音翻译终端到终端同声语音语音 2507.17527v3

Authors (28): Shanbo Cheng, Yu Bao, Zhichao Huang, Yu Lu, Ningxin Peng, Lu Xu, Runsheng Yu, Rong Cao, Yujiao Du, Ting Han, Yuxiang Hu, Zeyang Li, Sitong Liu, Shengtao Ma, Shiguang Pan, Jiongchen Xiao, Nuo Xu, Meng Yang, Rong Ye, Yiming Yu, Jun Zhang, Ruofei Zhang, Wanyi Zhang, Wenhao Zhu, Liehao Zou, Lu Lu, Yuxuan Wang, Yonghui Wu

Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.

nan

Article 395

Title@2025-07-27 (7): Measuring Information Distortion in Hierarchical Ultra long Novel Reconstruction:The Optimal Expansion Ratio

Title: Measuring Information Distortion in Hierarchical Ultra long Novel Reconstruction:The Optimal Expansion Ratio

Messung von Informationsverzerrung bei hierarchischer Ultralangem Novel Reconstruction:The Optimal Expansion Ratio

测量高层次超长新世纪重建中的信息扭曲:最佳扩展比率 2505.12572v2

Authors (2): Hanwen Shen, Ting Ying

A two stage novel generation framework (outline -> section outline -> manuscript) is widely used in long novel generation,(e.g., \textsc{DOME}, \textsc{Plan\&Write}, \textsc{Long Writer}), but study of such framework in ultra long novel(>1M words) reconstruction is little. Building on recent text compression methods (\textsc{LLMZip}, \textsc{LLM2Vec}), we conduct an information-theoretic analysis to quantify semantic distortion under different compression-expansion ratios. We examine how outline length affects information preservation. Experiments on ultra-long novels show that the optimal compression-expansion ratio significantly reduces semantic distortion compared to other non-optimal compression-expansion ratio.

nan

Article 396

Title@2025-07-27 (7): Language Models Resist Alignment: Evidence From Data Compression

Title: Language Models Resist Alignment: Evidence From Data Compression

Sprachmodelle widerstehen Ausrichtung: Beweise aus Datenkompression

语言模型阻力对齐:数据压缩中的证据 2406.06144v5

Authors (10): Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Juntao Dai, Yunhuai Liu, Yaodong Yang

Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the $\mathbf{elasticity}$ of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment. The model weight and code are available at pku-lm-resist-alignment.github.io.

nan

Article 397

Title@2025-07-27 (7): AI-Driven Generation of Old English: A Framework for Low-Resource Languages

Title: AI-Driven Generation of Old English: A Framework for Low-Resource Languages

AI-Driven Generation of Old English: Ein Rahmen für Low-Resource-Sprachen

AI-Driven 一代老英语:低资源语言框架 2507.20111v1

Authors (4): Rodrigo Gabriel Salazar Alva, Matías Nuñez, Cristian López, Javier Martín Arista

Preserving ancient languages is essential for understanding humanity’s cultural and linguistic heritage, yet Old English remains critically under-resourced, limiting its accessibility to modern natural language processing (NLP) techniques. We present a scalable framework that uses advanced large language models (LLMs) to generate high-quality Old English texts, addressing this gap. Our approach combines parameter-efficient fine-tuning (Low-Rank Adaptation, LoRA), data augmentation via backtranslation, and a dual-agent pipeline that separates the tasks of content generation (in English) and translation (into Old English). Evaluation with automated metrics (BLEU, METEOR, and CHRF) shows significant improvements over baseline models, with BLEU scores increasing from 26 to over 65 for English-to-Old English translation. Expert human assessment also confirms high grammatical accuracy and stylistic fidelity in the generated texts. Beyond expanding the Old English corpus, our method offers a practical blueprint for revitalizing other endangered languages, effectively uniting AI innovation with the goals of cultural preservation.

nan

Article 398

Title@2025-07-27 (7): Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Title: Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Emergent Semantics Beyond Token Embeddings: Transformer LMs mit gefrorenen visuellen Unicode-Darstellungen

超越 Tok 嵌入的新兴语义: 具有冷冻视觉统一符号的变形LMs 2507.04886v2

Authors (1): A. Bochkov

Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational “meaning vectors.” This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to “representational interference” in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer’s compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

nan

Article 399

Title@2025-07-27 (7): EcoTransformer: Attention without Multiplication

Title: EcoTransformer: Attention without Multiplication

EcoTransformer: Achtung ohne Multiplikation

生态转换:注意不乘数 2507.20096v1

Authors (2): Xin Gao, Xingming Xu

The Transformer, with its scaled dot-product attention mechanism, has become a foundational architecture in modern AI. However, this mechanism is computationally intensive and incurs substantial energy costs. We propose a new Transformer architecture EcoTransformer, in which the output context vector is constructed as the convolution of the values using a Laplacian kernel, where the distances are measured by the L1 metric between the queries and keys. Compared to dot-product based attention, the new attention score calculation is free of matrix multiplication. It performs on par with, or even surpasses, scaled dot-product attention in NLP, bioinformatics, and vision tasks, while consuming significantly less energy.

nan

Article 400

Title@2025-07-27 (7): ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

Title: ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

ProsodyLM: Enthüllen der neu entstehenden Prosody-Verarbeitungsfähigkeiten in Sprachmodellen

ProsodyLM: 解决语言模式中新出现的处理能力问题 2507.20091v1

Authors (7): Kaizhi Qian, Xulin Fan, Junrui Ni, Slava Shechtman, Mark Hasegawa-Johnson, Chuang Gan, Yang Zhang

Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody. The existing mainstream paradigm of training speech language models, which converts speech into discrete tokens before feeding them into LLMs, is sub-optimal in learning prosody information – we find that the resulting LLMs do not exhibit obvious emerging prosody processing capabilities via pre-training alone. To overcome this, we propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody. Each speech utterance is first transcribed into text, followed by a sequence of word-level prosody tokens. Compared with conventional speech tokenization schemes, the proposed tokenization scheme retains more complete prosody information, and is more understandable to text-based LLMs. We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone, ranging from harnessing the prosody nuances in generated speech, such as contrastive focus, understanding emotion and stress in an utterance, to maintaining prosody consistency in long contexts.

nan

Article 401

Title@2025-07-27 (7): Reinforcement learning fine-tuning of language model for instruction following and math reasoning

Title: Reinforcement learning fine-tuning of language model for instruction following and math reasoning

Verstärktes Lernen der Feinabstimmung des Sprachmodells für Unterricht und Mathe-Reinigung

强化学习,微调用于教学的语文模式和数学推理 2506.21560v2

Authors (2): Yifu Han, Geo Zhang

This study investigates the effectiveness of reinforcement learning (RL) fine-tuning techniques on a compact language model (Qwen2.5-0.5B Base) for two challenging tasks: instruction following and mathematical reasoning. We compare supervised fine-tuning (SFT), Direct Preference Optimization (DPO) using preference-labeled data, and Reinforce Leave-One-Out (RLOO) with reward models. Our experiments show that RLOO with DeBERTa reward modeling achieves the best alignment, while DPO provides strong and consistent results. For math reasoing tasks, synthetic data augmentation and best-of-N sampling with an external verifier significantly improve accuracy, showing the potential of combining fine-tuning with inference-time tools. This study highlights key trade-offs and practical strategies for training lightweight, task-aligned small-scale language models.

nan

Article 402

Title@2025-07-26 (6): The Devil is in the EOS: Sequence Training for Detailed Image Captioning

Title: The Devil is in the EOS: Sequence Training for Detailed Image Captioning

Der Teufel ist im EOS: Sequenztraining für detaillierte Bildunterschriften

魔鬼在EOS:详细图像说明的序列训练 2507.20077v1

Authors (2): Abdelrahman Mohamed, Yova Kementchedjhieva

Despite significant advances in vision-language models (VLMs), image captioning often suffers from a lack of detail, with base models producing short, generic captions. This limitation persists even though VLMs are equipped with strong vision and language backbones. While supervised data and complex reward functions have been proposed to improve detailed image captioning, we identify a simpler underlying issue: a bias towards the end-of-sequence (EOS) token, which is introduced during cross-entropy training. We propose an unsupervised method to debias the model’s tendency to predict the EOS token prematurely. By reducing this bias, we encourage the generation of longer, more detailed captions without the need for intricate reward functions or supervision. Our approach is straightforward, effective, and easily applicable to any pretrained model. We demonstrate its effectiveness through experiments with three VLMs and on three detailed captioning benchmarks. Our results show a substantial increase in caption length and relevant details, albeit with an expected increase in the rate of hallucinations.

nan

Article 403

Title@2025-07-26 (6): PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training

Title: PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training

PITA: Präferenz-geführte Inferenz-Zeit-Ausrichtung für LLM nach dem Training

PITA:LLM培训后培训的优先指导推论-时间协调 2507.20067v1

Authors (4): Sarat Chandra Bobbili, Ujwal Dinesha, Dheeraj Narasimha, Srinivas Shakkottai

Inference-time alignment enables large language models (LLMs) to generate outputs aligned with end-user preferences without further training. Recent post-training methods achieve this by using small guidance models to modify token generation during inference. These methods typically optimize a reward function KL-regularized by the original LLM taken as the reference policy. A critical limitation, however, is their dependence on a pre-trained reward model, which requires fitting to human preference feedback–a potentially unstable process. In contrast, we introduce PITA, a novel framework that integrates preference feedback directly into the LLM’s token generation, eliminating the need for a reward model. PITA learns a small preference-based guidance policy to modify token probabilities at inference time without LLM fine-tuning, reducing computational cost and bypassing the pre-trained reward model dependency. The problem is framed as identifying an underlying preference distribution, solved through stochastic search and iterative refinement of the preference-based guidance model. We evaluate PITA across diverse tasks, including mathematical reasoning and sentiment classification, demonstrating its effectiveness in aligning LLM outputs with user preferences.

nan

Article 404

Title@2025-07-26 (6): RAG in the Wild: On the (In)effectiveness of LLMs with Mixture-of-Knowledge Retrieval Augmentation

Title: RAG in the Wild: On the (In)effectiveness of LLMs with Mixture-of-Knowledge Retrieval Augmentation

RAG in the Wild: Über die (In)Wirksamkeit von LLMs mit Mixture-of-Knowledge Retrieval Augmentation

野生ROG:关于利用混合知识回收增加的LLMs(内)效力 2507.20059v1

Authors (6): Ran Xu, Yuchen Zhuang, Yue Yu, Haoyu Wang, Wenqi Shi, Carl Yang

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieved at inference time. While RAG demonstrates strong performance on benchmarks largely derived from general-domain corpora like Wikipedia, its effectiveness under realistic, diverse retrieval scenarios remains underexplored. We evaluated RAG systems using MassiveDS, a large-scale datastore with mixture of knowledge, and identified critical limitations: retrieval mainly benefits smaller models, rerankers add minimal value, and no single retrieval source consistently excels. Moreover, current LLMs struggle to route queries across heterogeneous knowledge sources. These findings highlight the need for adaptive retrieval strategies before deploying RAG in real-world settings. Our code and data can be found at https://github.com/ritaranx/RAG_in_the_Wild.

nan

Article 405

Title@2025-07-26 (6): A Tensor-Based Compiler and a Runtime for Neuron-Level DNN Certifier Specifications

Title: A Tensor-Based Compiler and a Runtime for Neuron-Level DNN Certifier Specifications

Ein Tensor-basierter Compiler und eine Laufzeit für die Spezifikationen des Neuron-Level DNN Certifier

一个基于 Tensor 的编纂器和中子级别 DNN 验证符规格的运行时间 2507.20055v1

Authors (6): Avaljot Singh, Yamin Chandini Sarita, Aditya Mishra, Ishaan Goyal, Gagandeep Singh, Charith Mendis

The uninterpretability of DNNs has led to the adoption of abstract interpretation-based certification as a practical means to establish trust in real-world systems that rely on DNNs. However, the current landscape supports only a limited set of certifiers, and developing new ones or modifying existing ones for different applications remains difficult. This is because the mathematical design of certifiers is expressed at the neuron level, while their implementations are optimized and executed at the tensor level. This mismatch creates a semantic gap between design and implementation, making manual bridging both complex and expertise-intensive – requiring deep knowledge in formal methods, high-performance computing, etc. We propose a compiler framework that automatically translates neuron-level specifications of DNN certifiers into tensor-based, layer-level implementations. This is enabled by two key innovations: a novel stack-based intermediate representation (IR) and a shape analysis that infers the implicit tensor operations needed to simulate the neuron-level semantics. During lifting, the shape analysis creates tensors in the minimal shape required to perform the corresponding operations. The IR also enables domain-specific optimizations as rewrites. At runtime, the resulting tensor computations exhibit sparsity tied to the DNN architecture. This sparsity does not align well with existing formats. To address this, we introduce g-BCSR, a double-compression format that represents tensors as collections of blocks of varying sizes, each possibly internally sparse. Using our compiler and g-BCSR, we make it easy to develop new certifiers and analyze their utility across diverse DNNs. Despite its flexibility, the compiler achieves performance comparable to hand-optimized implementations.

nan

Article 406

Title@2025-07-26 (6): $K^4$: Online Log Anomaly Detection Via Unsupervised Typicality Learning

Title: $K^4$: Online Log Anomaly Detection Via Unsupervised Typicality Learning

$K^4$: Online Log Anomalienerkennung durch unüberwachtes Lernen

4K元:在线记录异常探测不受监督的典型学习 2507.20051v1

Authors (6): Weicong Chen, Vikash Singh, Zahra Rahmani, Debargha Ganguly, Mohsen Hariri, Vipin Chaudhary

Existing Log Anomaly Detection (LogAD) methods are often slow, dependent on error-prone parsing, and use unrealistic evaluation protocols. We introduce $K^4$, an unsupervised and parser-independent framework for high-performance online detection. $K^4$ transforms arbitrary log embeddings into compact four-dimensional descriptors (Precision, Recall, Density, Coverage) using efficient k-nearest neighbor (k-NN) statistics. These descriptors enable lightweight detectors to accurately score anomalies without retraining. Using a more realistic online evaluation protocol, $K^4$ sets a new state-of-the-art (AUROC: 0.995-0.999), outperforming baselines by large margins while being orders of magnitude faster, with training under 4 seconds and inference as low as 4 $\mu$s.

nan

Article 407

Title@2025-07-26 (6): AI as a deliberative partner fosters intercultural empathy for Americans but fails for Latin American participants

Title: AI as a deliberative partner fosters intercultural empathy for Americans but fails for Latin American participants

KI als beratender Partner fördert interkulturelles Empathie für Amerikaner, scheitert aber für lateinamerikanische Teilnehmer

作为审议伙伴的大赦国际促进美国人的文化间同情,但拉丁美洲参与者却失败 2504.13887v2

Authors (5): Isabel Villanueva, Tara Bobinac, Binwei Yao, Junjie Hu, Kaiping Chen

Despite increasing AI chatbot deployment in public discourse, empirical evidence on their capacity to foster intercultural empathy remains limited. Through a randomized experiment, we assessed how different AI deliberation approaches–cross-cultural deliberation (presenting other-culture perspectives), own-culture deliberation (representing participants’ own culture), and non-deliberative control–affect intercultural empathy across American and Latin American participants. Cross-cultural deliberation increased intercultural empathy among American participants through positive emotional engagement, but produced no such effects for Latin American participants, who perceived AI responses as culturally inauthentic despite explicit prompting to represent their cultural perspectives. Our analysis of participant-driven feedback, where users directly flagged and explained culturally inappropriate AI responses, revealed systematic gaps in AI’s representation of Latin American contexts that persist despite sophisticated prompt engineering. These findings demonstrate that current approaches to AI cultural alignment–including linguistic adaptation and explicit cultural prompting–cannot fully address deeper representational asymmetries in AI systems. Our work advances both deliberation theory and AI alignment research by revealing how the same AI system can simultaneously promote intercultural understanding for one cultural group while failing for another, with critical implications for designing equitable AI systems for cross-cultural democratic discourse.

nan

Article 408

Title@2025-07-26 (6): Infogen: Generating Complex Statistical Infographics from Documents

Title: Infogen: Generating Complex Statistical Infographics from Documents

Infogen: Erzeugen komplexer statistischer Infografiken aus Dokumenten

信息源:从文件生成复杂的统计图表 2507.20046v1

Authors (5): Akash Ghosh, Aparna Garimella, Pritika Ramu, Sambaran Bandyopadhyay, Sriparna Saha

Statistical infographics are powerful tools that simplify complex data into visually engaging and easy-to-understand formats. Despite advancements in AI, particularly with LLMs, existing efforts have been limited to generating simple charts, with no prior work addressing the creation of complex infographics from text-heavy documents that demand a deep understanding of the content. We address this gap by introducing the task of generating statistical infographics composed of multiple sub-charts (e.g., line, bar, pie) that are contextually accurate, insightful, and visually aligned. To achieve this, we define infographic metadata that includes its title and textual insights, along with sub-chart-specific details such as their corresponding data and alignment. We also present Infodat, the first benchmark dataset for text-to-infographic metadata generation, where each sample links a document to its metadata. We propose Infogen, a two-stage framework where fine-tuned LLMs first generate metadata, which is then converted into infographic code. Extensive evaluations on Infodat demonstrate that Infogen achieves state-of-the-art performance, outperforming both closed and open-source LLMs in text-to-statistical infographic generation.

nan

Article 409

Title@2025-07-26 (6): Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLMs

Title: Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLMs

Kolumbianische Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Empfehlungen von LLMs

Colombia Worress y juéces canadienses:LLM公司在占领建议中的性别和乡村差别 2505.02456v2

Authors (5): Elisa Forcada Rodríguez, Olatz Perez-de-Viñaspre, Jon Ander Campos, Dietrich Klakow, Vagrant Gautam

One of the goals of fairness research in NLP is to measure and mitigate stereotypical biases that are propagated by NLP systems. However, such work tends to focus on single axes of bias (most often gender) and the English language. Addressing these limitations, we contribute the first study of multilingual intersecting country and gender biases, with a focus on occupation recommendations generated by large language models. We construct a benchmark of prompts in English, Spanish and German, where we systematically vary country and gender, using 25 countries and four pronoun sets. Then, we evaluate a suite of 5 Llama-based models on this benchmark, finding that LLMs encode significant gender and country biases. Notably, we find that even when models show parity for gender or country individually, intersectional occupational biases based on both country and gender persist. We also show that the prompting language significantly affects bias, and instruction-tuned models consistently demonstrate the lowest and most stable levels of bias. Our findings highlight the need for fairness researchers to use intersectional and multilingual lenses in their work.

nan

Article 410

Title@2025-07-26 (6): FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression

Title: FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression

FAEDKV: Unendliche Window Fourier-Transformation für unvoreingenommene KV-Cache-Kompression

FAEDKV: 用于无偏见的 KV 缓存压缩的无限窗口 Fleier 变换 2507.20030v1

Authors (6): Runchao Li, Yao Fu, Mu Sheng, Xianxuan Long, Haotian Yu, Pan Li

The efficacy of Large Language Models (LLMs) in long-context tasks is often hampered by the substantial memory footprint and computational demands of the Key-Value (KV) cache. Current compression strategies, including token eviction and learned projections, frequently lead to biased representations – either by overemphasizing recent/high-attention tokens or by repeatedly degrading information from earlier context – and may require costly model retraining. We present FAEDKV (Frequency-Adaptive Infinite-Window for KV cache), a novel, training-free KV cache compression framework that ensures unbiased information retention. FAEDKV operates by transforming the KV cache into the frequency domain using a proposed Infinite-Window Fourier Transform (IWDFT). This approach allows for the equalized contribution of all tokens to the compressed representation, effectively preserving both early and recent contextual information. A preliminary frequency ablation study identifies critical spectral components for layer-wise, targeted compression. Experiments on LongBench benchmark demonstrate FAEDKV’s superiority over existing methods by up to 22\%. In addition, our method shows superior, position-agnostic retrieval accuracy on the Needle-In-A-Haystack task compared to compression based approaches.

nan

Article 411

Title@2025-07-26 (6): Selective Prompt Anchoring for Code Generation

Title: Selective Prompt Anchoring for Code Generation

Selektive Prompt-Ankerung für die Code-Generierung

代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代 2408.09121v6

Authors (2): Yuan Tian, Tianyi Zhang

Recent advances in large language models (LLMs) have transformed software development by automatically generating code from natural language. Yet challenges remain in generating fully correct code that aligns with user intent. Our study reveals that LLMs tend to pay less attention to user prompts as more code tokens are generated. We hypothesize that this attention dilution issue is an important reason for code generation errors. To mitigate this issue, we propose Selective Prompt Anchoring (SPA) to guide code LLMs to pay more attention to user intent when generating code. We evaluate SPA using six base LLMs across six benchmarks. Our results demonstrate that SPA enhances Pass@1 by up to 12.9%, consistently outperforming SOTA code generation methods in all settings. Our code is available at https://github.com/magic-YuanTian/Selective-Prompt-Anchoring.

nan

Article 412

Title@2025-07-26 (6): Preference learning made easy: Everything should be understood through win rate

Title: Preference learning made easy: Everything should be understood through win rate

Vorliebe Lernen leicht gemacht: Alles sollte durch Win-Rate verstanden werden

首选学习容易:人人都应通过双赢率来理解一切 2502.10505v2

Authors (2): Lily H. Zhang, Rajesh Ranganath

Preference learning, or the task of aligning generative models to preference comparison data, has yet to reach the conceptual maturity of classification, density estimation, etc. To close this gap, this work presents a framework to understand preference learning starting from the sampling distribution of pairwise preference data. First, we prove that the only evaluation of a generative model that respects both preferences and prevalences in the data distribution is a form of win rate, justifying win rate as the focal point to understand preference learning. We then analyze preference learning methods as win rate optimization (WRO) or non-WRO. We present novel instances of WRO beyond existing examples (RLHF, NLHF) and identify two key theoretical benefits of all such methods. We prove that common non-WRO methods like DPO and SFT on preferred samples lack these properties and suggest ways to mitigate such theoretical limitations. We also show that WRO underperforms in practice due optimization difficulties and that optimization success predicts performance better than choices which affect the objective’s solution. Our analysis highlights best practices for existing methods and provides recommendations for future research, guided by the principle that one should either align non-WRO methods more closely with WRO or improve the optimization of WRO objectives.

nan

Article 413

Title@2025-07-26 (6): Anomaly Detection in Human Language via Meta-Learning: A Few-Shot Approach

Title: Anomaly Detection in Human Language via Meta-Learning: A Few-Shot Approach

Anomalieerkennung in der menschlichen Sprache durch Meta-Learning: Ein wenig heißer Ansatz

通过元学习在人文语言中异常探测: “ 几热 “ 方法 2507.20019v1

Authors (4): Saurav Singla, Aarav Singla, Advik Gupta, Parnika Gupta

We propose a meta learning framework for detecting anomalies in human language across diverse domains with limited labeled data. Anomalies in language ranging from spam and fake news to hate speech pose a major challenge due to their sparsity and variability. We treat anomaly detection as a few shot binary classification problem and leverage meta-learning to train models that generalize across tasks. Using datasets from domains such as SMS spam, COVID-19 fake news, and hate speech, we evaluate model generalization on unseen tasks with minimal labeled anomalies. Our method combines episodic training with prototypical networks and domain resampling to adapt quickly to new anomaly detection tasks. Empirical results show that our method outperforms strong baselines in F1 and AUC scores. We also release the code and benchmarks to facilitate further research in few-shot text anomaly detection.

nan

Article 414

Title@2025-07-26 (6): A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

Title: A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

Eine Praxis des Post-Trainings auf Llama-3 70B mit optimaler Auswahl des zusätzlichen Sprachmischverhältnisses

Llama-3-70B培训后做法,最佳选择其他语言混合比率 2409.06624v2

Authors (6): Ningyuan Xi, Yetao Wu, Kun Fan, Teng Chen, Qingqing Gu, Luo Ji

Large Language Models (LLM) often need to be Continual Pre-Trained (CPT) to obtain unfamiliar language skills or adapt to new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study that bridges the gap between the optimal mixture ratio and the actual model performance, and the gap between experimental scaling law and the actual deployment in the full model size. In this paper, we perform CPT on Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size which directly indicates the optimal experimental setup. By thorough choice of hyper-parameter, and subsequent fine-tuning, the model capability is improved not only on the Chinese-related benchmark but also in some specific domains including math, coding, and emotional intelligence. We deploy the final 70B version of LLM on a real-life chat system which obtains satisfying performance.

nan

Article 415

Title@2025-07-26 (6): MeTHanol: Modularized Thinking Language Models with Intermediate Layer Thinking, Decoding and Bootstrapping Reasoning

Title: MeTHanol: Modularized Thinking Language Models with Intermediate Layer Thinking, Decoding and Bootstrapping Reasoning

MeTHanol: Modularisiertes Denken von Sprachmodellen mit Intermediate Layer Thinking, Decodierung und Bootstrapping Reasoning

METHanol:含有中间层思考、解毒和诱导理由的模块化思维语言模型 2409.12059v5

Authors (10): Ningyuan Xi, Xiaoyu Wang, Yetao Wu, Teng Chen, Qingqing Gu, Yue Zhao, Jinxian Qu, Zhonglin Jiang, Yong Chen, Luo Ji

Current research efforts are focused on enhancing the thinking and reasoning capability of large language model (LLM) by prompting, data-driven emergence and inference-time computation. In this study, we consider stimulating language model’s thinking and cognitive abilities from a modular perspective, which mimics the human brain architecture. We select a specific intermediate attention layer with newly implemented language heads. We conduct dual-layer fine-tuning by annotated (query, thought, answer) samples and show that the intermediate layer can also learn to decode fluent and reasonable language tokens. A two-pass inference mechanism is designed to generate thoughts then formal responses. The entire framework is called modularized thinking language model (MeTHanol) which can enhance LLM’s cognitive behaviors as indicated by Theory of Mind (ToM) and Vignette-based experiments. Case studies also show that MeTHanol can plan and self-reflect and generate human-like thoughts and answers, even on unseen and open-domain tasks. MeTHanol can also adapt to a personalized prompt and behave as the specified character. Our study holds promise for significant cognitive gains from a modular perspective. Our code, model and data are available at https://bachozean.github.io/methanol-page

nan

Article 416

Title@2025-07-26 (6): VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering

Title: VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering

VLQA: Der erste umfassende, große und hochqualitative vietnamesische Datensatz für die Beantwortung rechtlicher Fragen

VLQA:用于法律问题解答的第一综合、大、高质量越南数据集 2507.19995v1

Authors (6): Tan-Minh Nguyen, Hoang-Trung Nguyen, Trong-Khoi Dao, Xuan-Hieu Phan, Ha-Thanh Nguyen, Thi-Hai-Yen Vuong

The advent of large language models (LLMs) has led to significant achievements in various domains, including legal text processing. Leveraging LLMs for legal tasks is a natural evolution and an increasingly compelling choice. However, their capabilities are often portrayed as greater than they truly are. Despite the progress, we are still far from the ultimate goal of fully automating legal tasks using artificial intelligence (AI) and natural language processing (NLP). Moreover, legal systems are deeply domain-specific and exhibit substantial variation across different countries and languages. The need for building legal text processing applications for different natural languages is, therefore, large and urgent. However, there is a big challenge for legal NLP in low-resource languages such as Vietnamese due to the scarcity of resources and annotated data. The need for labeled legal corpora for supervised training, validation, and supervised fine-tuning is critical. In this paper, we introduce the VLQA dataset, a comprehensive and high-quality resource tailored for the Vietnamese legal domain. We also conduct a comprehensive statistical analysis of the dataset and evaluate its effectiveness through experiments with state-of-the-art models on legal information retrieval and question-answering tasks.

nan

Article 417

Title@2025-07-26 (6): Improving the Performance of Sequential Recommendation Systems with an Extended Large Language Model

Title: Improving the Performance of Sequential Recommendation Systems with an Extended Large Language Model

Verbesserung der Leistungsfähigkeit sequentieller Empfehlungssysteme mit einem erweiterten Großsprachenmodell

利用扩展大语言模式改进序列建议系统的绩效 2507.19990v1

Authors (2): Sinnyum Choi, Woong Kim

Recently, competition in the field of artificial intelligence (AI) has intensified among major technological companies, resulting in the continuous release of new large-language models (LLMs) that exhibit improved language understanding and context-based reasoning capabilities. It is expected that these advances will enable more efficient personalized recommendations in LLM-based recommendation systems through improved quality of training data and architectural design. However, many studies have not considered these recent developments. In this study, it was proposed to improve LLM-based recommendation systems by replacing Llama2 with Llama3 in the LlamaRec framework. To ensure a fair comparison, random seed values were set and identical input data was provided during preprocessing and training. The experimental results show average performance improvements of 38.65\%, 8.69\%, and 8.19\% for the ML-100K, Beauty, and Games datasets, respectively, thus confirming the practicality of this method. Notably, the significant improvements achieved by model replacement indicate that the recommendation quality can be improved cost-effectively without the need to make structural changes to the system. Based on these results, it is our contention that the proposed approach is a viable solution for improving the performance of current recommendation systems.

nan

Article 418

Title@2025-07-26 (6): Robust Data Watermarking in Language Models by Injecting Fictitious Knowledge

Title: Robust Data Watermarking in Language Models by Injecting Fictitious Knowledge

Robustes Daten-Wasserzeichen in Sprachmodellen durch Einspritzen fiktiver Kenntnisse

在语言模型中,通过输入有说服力的知识在语言模型中进行强力数据水上标记 2503.04036v3

Authors (4): Xinyue Cui, Johnny Tian-Zheng Wei, Swabha Swayamdipta, Robin Jia

Data watermarking in language models injects traceable signals, such as specific token sequences or stylistic patterns, into copyrighted text, allowing copyright holders to track and verify training data ownership. Previous data watermarking techniques primarily focus on effective memorization during pretraining, while overlooking challenges that arise in other stages of the LLM lifecycle, such as the risk of watermark filtering during data preprocessing and verification difficulties due to API-only access. To address these challenges, we propose a novel data watermarking approach that injects plausible yet fictitious knowledge into training data using generated passages describing a fictitious entity and its associated attributes. Our watermarks are designed to be memorized by the LLM through seamlessly integrating in its training data, making them harder to detect lexically during preprocessing. We demonstrate that our watermarks can be effectively memorized by LLMs, and that increasing our watermarks’ density, length, and diversity of attributes strengthens their memorization. We further show that our watermarks remain effective after continual pretraining and supervised finetuning. Finally, we show that our data watermarks can be evaluated even under API-only access via question answering.

nan

Article 419

Title@2025-07-26 (6): Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization

Title: Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization

Leveraging Fine-Tuned Large Language Models for Interpretable Pankreatic Cystic Lesion Feature Extraction and Risk Categorization

利用微量使用大语言模型来利用可解释性恐慌性锥性电磁性悬浮物地物采掘和风险分类 2507.19973v1

Authors (17): Ebrahim Rasromani, Stella K. Kang, Yanqi Xu, Beisong Liu, Garvit Luhadia, Wan Fung Chui, Felicia L. Pasadyn, Yu Chih Hung, Julie Y. An, Edwin Mathieu, Zehui Gu, Carlos Fernandez-Granda, Ammar A. Javed, Greg D. Sacks, Tamas Gonda, Chenchan Huang, Yiqiu Shen

Background: Manual extraction of pancreatic cystic lesion (PCL) features from radiology reports is labor-intensive, limiting large-scale studies needed to advance PCL research. Purpose: To develop and evaluate large language models (LLMs) that automatically extract PCL features from MRI/CT reports and assign risk categories based on guidelines. Materials and Methods: We curated a training dataset of 6,000 abdominal MRI/CT reports (2005-2024) from 5,134 patients that described PCLs. Labels were generated by GPT-4o using chain-of-thought (CoT) prompting to extract PCL and main pancreatic duct features. Two open-source LLMs were fine-tuned using QLoRA on GPT-4o-generated CoT data. Features were mapped to risk categories per institutional guideline based on the 2017 ACR White Paper. Evaluation was performed on 285 held-out human-annotated reports. Model outputs for 100 cases were independently reviewed by three radiologists. Feature extraction was evaluated using exact match accuracy, risk categorization with macro-averaged F1 score, and radiologist-model agreement with Fleiss’ Kappa. Results: CoT fine-tuning improved feature extraction accuracy for LLaMA (80% to 97%) and DeepSeek (79% to 98%), matching GPT-4o (97%). Risk categorization F1 scores also improved (LLaMA: 0.95; DeepSeek: 0.94), closely matching GPT-4o (0.97), with no statistically significant differences. Radiologist inter-reader agreement was high (Fleiss’ Kappa = 0.888) and showed no statistically significant difference with the addition of DeepSeek-FT-CoT (Fleiss’ Kappa = 0.893) or GPT-CoT (Fleiss’ Kappa = 0.897), indicating that both models achieved agreement levels on par with radiologists. Conclusion: Fine-tuned open-source LLMs with CoT supervision enable accurate, interpretable, and efficient phenotyping for large-scale PCL research, achieving performance comparable to GPT-4o.

nan

Article 420

Title@2025-07-26 (6): Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text

Title: Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text

Text2Vis: Ein anspruchsvolles und vielfältiges Benchmark zur Generierung multimodaler Visualisierungen aus Text

Text2Vis: 从文本中生成多式视觉化的质疑性和多样化基准 2507.19969v1

Authors (4): Mizanur Rahman, Md Tahmid Rahman Laskar, Shafiq Joty, Enamul Hoque

Automated data visualization plays a crucial role in simplifying data interpretation, enhancing decision-making, and improving efficiency. While large language models (LLMs) have shown promise in generating visualizations from natural language, the absence of comprehensive benchmarks limits the rigorous evaluation of their capabilities. We introduce Text2Vis, a benchmark designed to assess text-to-visualization models, covering 20+ chart types and diverse data science queries, including trend analysis, correlation, outlier detection, and predictive analytics. It comprises 1,985 samples, each with a data table, natural language query, short answer, visualization code, and annotated charts. The queries involve complex reasoning, conversational turns, and dynamic data retrieval. We benchmark 11 open-source and closed-source models, revealing significant performance gaps, highlighting key challenges, and offering insights for future advancements. To close this gap, we propose the first cross-modal actor-critic agentic framework that jointly refines the textual answer and visualization code, increasing GPT-4o`s pass rate from 26% to 42% over the direct approach and improving chart quality. We also introduce an automated LLM-based evaluation framework that enables scalable assessment across thousands of samples without human annotation, measuring answer correctness, code execution success, visualization readability, and chart accuracy. We release Text2Vis at https://github.com/vis-nlp/Text2Vis.

nan

Article 421

Title@2025-07-26 (6): KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models

Title: KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models

KLAAD: Verfeinerung von Aufmerksamkeitsmechanismen zur Reduzierung gesellschaftlicher Bias in generativen Sprachmodellen

CPRAD: 完善关注机制,在产生语言模式中减少社会偏见 2507.19962v1

Authors (3): Seorin Kim, Dongyoung Lee, Jaejin Lee

Large language models (LLMs) often exhibit societal biases in their outputs, prompting ethical concerns regarding fairness and harm. In this work, we propose KLAAD (KL-Attention Alignment Debiasing), an attention-based debiasing framework that implicitly aligns attention distributions between stereotypical and anti-stereotypical sentence pairs without directly modifying model weights. KLAAD introduces a composite training objective combining Cross-Entropy, KL divergence, and Triplet losses, guiding the model to consistently attend across biased and unbiased contexts while preserving fluency and coherence. Experimental evaluation of KLAAD demonstrates improved bias mitigation on both the BBQ and BOLD benchmarks, with minimal impact on language modeling quality. The results indicate that attention-level alignment offers a principled solution for mitigating bias in generative language models.

nan

Article 422

Title@2025-07-26 (6): Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

Title: Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

Sprachenübergreifendes Reisen: Benchmarking Cross-Lingual Consistency in multimodalen LLMs

跨语言旅行:多模式LLM中跨语言一致基准 2505.15075v4

Authors (5): Hao Wang, Pinzhi Huang, Jihan Yang, Saining Xie, Daisuke Kawahara

The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications. However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge. To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency. This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.

nan

Article 423

Title@2025-07-26 (6): Large Language Models Are Human-Like Internally

Title: Large Language Models Are Human-Like Internally

Große Sprachmodelle sind menschlich-innerlich

大语言模型是人与人之间的内部大语言模型 2502.01615v2

Authors (5): Tatsuki Kuribayashi, Yohei Oseki, Souhaib Ben Taieb, Kentaro Inui, Timothy Baldwin

Recent cognitive modeling studies have reported that larger language models (LMs) exhibit a poorer fit to human reading behavior (Oh and Schuler, 2023b; Shain et al., 2024; Kuribayashi et al., 2024), leading to claims of their cognitive implausibility. In this paper, we revisit this argument through the lens of mechanistic interpretability and argue that prior conclusions were skewed by an exclusive focus on the final layers of LMs. Our analysis reveals that next-word probabilities derived from internal layers of larger LMs align with human sentence processing data as well as, or better than, those from smaller LMs. This alignment holds consistently across behavioral (self-paced reading times, gaze durations, MAZE task processing times) and neurophysiological (N400 brain potentials) measures, challenging earlier mixed results and suggesting that the cognitive plausibility of larger LMs has been underestimated. Furthermore, we first identify an intriguing relationship between LM layers and human measures: earlier layers correspond more closely with fast gaze durations, while later layers better align with relatively slower signals such as N400 potentials and MAZE processing times. Our work opens new avenues for interdisciplinary research at the intersection of mechanistic interpretability and cognitive modeling.

nan

Article 424

Title@2025-07-26 (6): Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA

Title: Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA

Aufmerksamkeitsköpfe vor dem Zusammenführen ausrichten: Ein effektiver Weg, MHA in GQA umzuwandeln

合并主题前对齐关注头部对齐:将MAHA转换为GQA的有效途径 2412.20677v2

Authors (4): Qingyun Jin, Xiaohui Song, Feng Zhou, Zengchang Qin

Large language models (LLMs) have demonstrated exceptional performance across diverse natural language processing tasks. However, as the model size and the input sequence’s length increase, the linearly increasing key-value (KV) cache significantly degrades inference throughput. Therefore, grouped-query attention (GQA), as an alternative to multi-head attention (MHA), has been widely introduced into LLMs. In this work, we propose a cost-effective method for converting MHA into GQA with any compression ratio of KV heads. The key point of our method lies in the application of Procrustes analysis to the attention heads, which enhances the similarity among attention heads while preserving computational invariance, thereby improving the model’s post-training performance. Subsequently, we employ $\mathit{L_0}$ regularization to prune redundant parameters. The model after pruning can be adapted to the standard GQA framework. Experimental results show that our strategy can compress up to 87.5\% KV heads of LLaMA2-7B model and 75\% KV heads of Sheared-LLaMA-1.3B with acceptable performance degradation. Our code is released at https://github.com/fpcsong/mha2gqa.

nan

Article 425

Title@2025-07-26 (6): Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation

Title: Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation

Bewertung der Zuverlässigkeit von LLM-Annotationen im Kontext der demografischen Bias und Modellerklärung

结合人口偏见和示范解释评估LLM 说明的可靠性 2507.13138v2

Authors (6): Hadi Mohammadi, Tina Shahedi, Pablo Mosteiro, Massimo Poesio, Ayoub Bagheri, Anastasia Giachanou

Understanding the sources of variability in annotations is crucial for developing fair NLP systems, especially for tasks like sexism detection where demographic bias is a concern. This study investigates the extent to which annotator demographic features influence labeling decisions compared to text content. Using a Generalized Linear Mixed Model, we quantify this inf luence, finding that while statistically present, demographic factors account for a minor fraction ( 8%) of the observed variance, with tweet content being the dominant factor. We then assess the reliability of Generative AI (GenAI) models as annotators, specifically evaluating if guiding them with demographic personas improves alignment with human judgments. Our results indicate that simplistic persona prompting often fails to enhance, and sometimes degrades, performance compared to baseline models. Furthermore, explainable AI (XAI) techniques reveal that model predictions rely heavily on content-specific tokens related to sexism, rather than correlates of demographic characteristics. We argue that focusing on content-driven explanations and robust annotation protocols offers a more reliable path towards fairness than potentially persona simulation.

nan

Article 426

Title@2025-07-26 (6): Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report

Title: Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report

Frontier AI Risk Management Framework in der Praxis: Ein technischer Bericht zur Risikoanalyse

《国际边界风险管理框架实际操作:风险分析技术报告》 2507.16534v2

Authors (38): Shanghai AI Lab, :, Xiaoyang Chen, Yunhao Chen, Zeren Chen, Zhiyun Chen, Hanyun Cui, Yawen Duan, Jiaxuan Guo, Qi Guo, Xuhao Hu, Hong Huang, Lige Huang, Chunxiao Li, Juncheng Li, Qihao Lin, Dongrui Liu, Xinmin Liu, Zicheng Liu, Chaochao Lu, Xiaoya Lu, Jingjing Qu, Qibing Ren, Jing Shao, Jingwei Shi, Jingwei Sun, Peng Wang, Weibing Wang, Jia Xu, Lewen Yan, Xiao Yu, Yi Yu, Boxuan Zhang, Jie Zhang, Weichen Zhang, Zhijie Zheng, Tianyi Zhou, Bowen Zhou

To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, this report presents a comprehensive assessment of their frontier risks. Drawing on the E-T-C analysis (deployment environment, threat source, enabling capability) from the Frontier AI Risk Management Framework (v1.0) (SafeWork-F1-Framework), we identify critical risks in seven areas: cyber offense, biological and chemical risks, persuasion and manipulation, uncontrolled autonomous AI R\&D, strategic deception and scheming, self-replication, and collusion. Guided by the “AI-$45^\circ$ Law,” we evaluate these risks using “red lines” (intolerable thresholds) and “yellow lines” (early warning indicators) to define risk zones: green (manageable risk for routine deployment and continuous monitoring), yellow (requiring strengthened mitigations and controlled deployment), and red (necessitating suspension of development and/or deployment). Experimental results show that all recent frontier AI models reside in green and yellow zones, without crossing red lines. Specifically, no evaluated models cross the yellow line for cyber offense or uncontrolled AI R\&D risks. For self-replication, and strategic deception and scheming, most models remain in the green zone, except for certain reasoning models in the yellow zone. In persuasion and manipulation, most models are in the yellow zone due to their effective influence on humans. For biological and chemical risks, we are unable to rule out the possibility of most models residing in the yellow zone, although detailed threat modeling and in-depth assessment are required to make further claims. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.

nan

Article 427

Title@2025-07-26 (6): The Impact of Fine-tuning Large Language Models on Automated Program Repair

Title: The Impact of Fine-tuning Large Language Models on Automated Program Repair

Die Auswirkungen von Feinabstimmungen großer Sprachmodelle auf die automatisierte Programmreparatur

微调大语言模型对自动方案维修的影响 2507.19909v1

Authors (4): Roman Macháček, Anastasiia Grishina, Max Hort, Leon Moonen

Automated Program Repair (APR) uses various tools and techniques to help developers achieve functional and error-free code faster. In recent years, Large Language Models (LLMs) have gained popularity as components in APR tool chains because of their performance and flexibility. However, training such models requires a significant amount of resources. Fine-tuning techniques have been developed to adapt pre-trained LLMs to specific tasks, such as APR, and enhance their performance at far lower computational costs than training from scratch. In this study, we empirically investigate the impact of various fine-tuning techniques on the performance of LLMs used for APR. Our experiments provide insights into the performance of a selection of state-of-the-art LLMs pre-trained on code. The evaluation is done on three popular APR benchmarks (i.e., QuixBugs, Defects4J and HumanEval-Java) and considers six different LLMs with varying parameter sizes (resp. CodeGen, CodeT5, StarCoder, DeepSeekCoder, Bloom, and CodeLlama-2). We consider three training regimens: no fine-tuning, full fine-tuning, and parameter-efficient fine-tuning (PEFT) using LoRA and IA3. We observe that full fine-tuning techniques decrease the benchmarking performance of various models due to different data distributions and overfitting. By using parameter-efficient fine-tuning methods, we restrict models in the amount of trainable parameters and achieve better results. Keywords: large language models, automated program repair, parameter-efficient fine-tuning, AI4Code, AI4SE, ML4SE.

nan

Article 428

Title@2025-07-26 (6): CaliDrop: KV Cache Compression with Calibration

Title: CaliDrop: KV Cache Compression with Calibration

CaliDrop: KV Cache-Kompression mit Kalibrierung

CaliDrop: KV 缓存压缩加校准 2507.19906v1

Authors (9): Yi Su, Quantong Qiu, Yuechi Zhou, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang

Large Language Models (LLMs) require substantial computational resources during generation. While the Key-Value (KV) cache significantly accelerates this process by storing attention intermediates, its memory footprint grows linearly with sequence length, batch size, and model size, creating a bottleneck in long-context scenarios. Various KV cache compression techniques, including token eviction, quantization, and low-rank projection, have been proposed to mitigate this bottleneck, often complementing each other. This paper focuses on enhancing token eviction strategies. Token eviction leverages the observation that the attention patterns are often sparse, allowing for the removal of less critical KV entries to save memory. However, this reduction usually comes at the cost of notable accuracy degradation, particularly under high compression ratios. To address this issue, we propose \textbf{CaliDrop}, a novel strategy that enhances token eviction through calibration. Our preliminary experiments show that queries at nearby positions exhibit high similarity. Building on this observation, CaliDrop performs speculative calibration on the discarded tokens to mitigate the accuracy loss caused by token eviction. Extensive experiments demonstrate that CaliDrop significantly improves the accuracy of existing token eviction methods.

nan

Article 429

Title: A Gold Standard Dataset and Evaluation Framework for Depression Detection and Explanation in Social Media using LLMs

Ein Gold Standard Datensatz und Evaluation Framework für Depression Erkennung und Erklärung in Social Media mit LLMs

利用LLMM公司在社会媒体中发现和解释抑郁症的黄金标准数据集和评价框架 2507.19899v1

Authors (2): Prajval Bolegave, Pushpak Bhattacharya

Early detection of depression from online social media posts holds promise for providing timely mental health interventions. In this work, we present a high-quality, expert-annotated dataset of 1,017 social media posts labeled with depressive spans and mapped to 12 depression symptom categories. Unlike prior datasets that primarily offer coarse post-level labels \cite{cohan-etal-2018-smhd}, our dataset enables fine-grained evaluation of both model predictions and generated explanations. We develop an evaluation framework that leverages this clinically grounded dataset to assess the faithfulness and quality of natural language explanations generated by large language models (LLMs). Through carefully designed prompting strategies, including zero-shot and few-shot approaches with domain-adapted examples, we evaluate state-of-the-art proprietary LLMs including GPT-4.1, Gemini 2.5 Pro, and Claude 3.7 Sonnet. Our comprehensive empirical analysis reveals significant differences in how these models perform on clinical explanation tasks, with zero-shot and few-shot prompting. Our findings underscore the value of human expertise in guiding LLM behavior and offer a step toward safer, more transparent AI systems for psychological well-being.

nan

Article 430

Title@2025-07-26 (6): Automating Mathematical Proof Generation Using Large Language Model Agents and Knowledge Graphs

Title: Automating Mathematical Proof Generation Using Large Language Model Agents and Knowledge Graphs

Automatisieren der mathematischen Proof-Generierung mit Large Language Model Agents und Wissensgraphen

使用大语言模型代理和知识图 2503.11657v2

Authors (5): Vincent Li, Tim Knappe, Yule Fu, Kevin Han, Kevin Zhu

Large language models have demonstrated remarkable capabilities in natural language processing tasks requiring multi-step logical reasoning capabilities, such as automated theorem proving. However, challenges persist within theorem proving, such as the identification of key mathematical concepts, understanding their interrelationships, and formalizing proofs correctly within natural language. We present KG-prover, a novel framework that leverages knowledge graphs mined from reputable mathematical texts to augment general-purpose LLMs to construct and formalize mathematical proofs. We also study the effects of scaling graph-based, test-time compute using KG-Prover, demonstrating significant performance improvements over baselines across multiple datasets. General-purpose LLMs improve up to 21\% on miniF2F-test when combined with KG-Prover, with consistent improvements ranging from 2-11\% on the ProofNet, miniF2F-test, and MUSTARD datasets without additional scaling. Furthermore, KG-Prover with o4-mini achieves over 50% miniF2F-test. This work provides a promising approach for augmenting natural language proof reasoning with knowledge graphs without the need for additional finetuning.

nan

Article 431

Title@2025-07-26 (6): Zero-shot Performance of Generative AI in Brazilian Portuguese Medical Exam

Title: Zero-shot Performance of Generative AI in Brazilian Portuguese Medical Exam

Zero-shot Leistung von Generative KI in brasilianischer portugiesischer medizinischer Prüfung

巴西葡萄牙医学考试中创用AI的零弹性能 2507.19885v1

Authors (10): Cesar Augusto Madid Truyts, Amanda Gomes Rabelo, Gabriel Mesquita de Souza, Daniel Scaldaferri Lages, Adriano Jose Pereira, Uri Adrian Prync Flato, Eduardo Pontes dos Reis, Joaquim Edson Vieira, Paulo Sergio Panse Silveira, Edson Amaro Junior

Artificial intelligence (AI) has shown the potential to revolutionize healthcare by improving diagnostic accuracy, optimizing workflows, and personalizing treatment plans. Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have achieved notable advancements in natural language processing and medical applications. However, the evaluation of these models has focused predominantly on the English language, leading to potential biases in their performance across different languages. This study investigates the capability of six LLMs (GPT-4.0 Turbo, LLaMA-3-8B, LLaMA-3-70B, Mixtral 8x7B Instruct, Titan Text G1-Express, and Command R+) and four MLLMs (Claude-3.5-Sonnet, Claude-3-Opus, Claude-3-Sonnet, and Claude-3-Haiku) to answer questions written in Brazilian spoken portuguese from the medical residency entrance exam of the Hospital das Cl'inicas da Faculdade de Medicina da Universidade de S~ao Paulo (HCFMUSP) - the largest health complex in South America. The performance of the models was benchmarked against human candidates, analyzing accuracy, processing time, and coherence of the generated explanations. The results show that while some models, particularly Claude-3.5-Sonnet and Claude-3-Opus, achieved accuracy levels comparable to human candidates, performance gaps persist, particularly in multimodal questions requiring image interpretation. Furthermore, the study highlights language disparities, emphasizing the need for further fine-tuning and data set augmentation for non-English medical AI applications. Our findings reinforce the importance of evaluating generative AI in various linguistic and clinical settings to ensure a fair and reliable deployment in healthcare. Future research should explore improved training methodologies, improved multimodal reasoning, and real-world clinical integration of AI-driven medical assistance.

nan

Article 432

Title@2025-07-26 (6): Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning

Title: Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning

Causal Sufficiency und Necessity verbessert Kette-of-Thought-Reasoning

C. 因果关系和必要性改进审议链理由 2506.09853v2

Authors (8): Xiangning Yu, Zhuohan Wang, Linyi Yang, Haoxuan Li, Anjie Liu, Xiao Xue, Jun Wang, Mengyue Yang

Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.

nan

Article 433

Title@2025-07-26 (6): FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models

Title: FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models

FactReasoner: Ein probabilistischer Ansatz zur Langform-Faktivitätsbewertung für große Sprachmodelle

事实研究者:对大语言模式长期实际评估的概率办法 2502.18573v2

Authors (8): Radu Marinescu, Debarun Bhattacharjya, Junkyu Lee, Tigran Tchrakian, Javier Carnerero Cano, Yufang Hou, Elizabeth Daly, Alessandra Pascale

Large language models (LLMs) have demonstrated vast capabilities on generative tasks in recent years, yet they struggle with guaranteeing the factual correctness of the generated content. This makes these models unreliable in realistic situations where factually accurate responses are expected. In this paper, we propose FactReasoner, a new factuality assessor that relies on probabilistic reasoning to assess the factuality of a long-form generated response. Specifically, FactReasoner decomposes the response into atomic units, retrieves relevant contexts for them from an external knowledge source, and constructs a joint probability distribution over the atoms and contexts using probabilistic encodings of the logical relationships (entailment, contradiction) between the textual utterances corresponding to the atoms and contexts. FactReasoner then computes the posterior probability of whether atomic units in the response are supported by the retrieved contexts. Our experiments on labeled and unlabeled benchmark datasets demonstrate clearly that FactReasoner improves considerably over state-of-the-art prompt-based approaches in terms of both factual precision and recall.

nan

Article 434

Title@2025-07-26 (6): The Polish Vocabulary Size Test: A Novel Adaptive Test for Receptive Vocabulary Assessment

Title: The Polish Vocabulary Size Test: A Novel Adaptive Test for Receptive Vocabulary Assessment

Der polnische Vokabular-Größentest: Ein neuartiger adaptiver Test für die rezeptive Vokabular-Bewertung

波兰词汇大小测试:接受词汇评估的新适应性测试 2507.19869v1

Authors (3): Danil Fokin, Monika Płużyczka, Grigory Golovin

We present the Polish Vocabulary Size Test (PVST), a novel tool for assessing the receptive vocabulary size of both native and non-native Polish speakers. Based on Item Response Theory and Computerized Adaptive Testing, PVST dynamically adjusts to each test-taker’s proficiency level, ensuring high accuracy while keeping the test duration short. To validate the test, a pilot study was conducted with 1.475 participants. Native Polish speakers demonstrated significantly larger vocabularies compared to non-native speakers. For native speakers, vocabulary size showed a strong positive correlation with age. The PVST is available online at myvocab.info/pl.

nan

Article 435

Title@2025-07-26 (6): DRIVE: Disfluency-Rich Synthetic Dialog Data Generation Framework for Intelligent Vehicle Environments

Title: DRIVE: Disfluency-Rich Synthetic Dialog Data Generation Framework for Intelligent Vehicle Environments

DRIVE: Disfluency-Rich Synthetic Dialog Data Generierung Framework für intelligente Fahrzeugumgebungen

DIVE: 智能车辆环境数据生成框架 2507.19867v1

Authors (6): Anshul Chavda, M Jagadeesh, Chintalapalli Raja Kullayappa, B Jayaprakash, Medchalimi Sruthi, Pushpak Bhattacharyya

In-car conversational AI is becoming increasingly critical as autonomous vehicles and smart assistants gain widespread adoption. Yet, existing datasets fail to capture the spontaneous disfluencies such as hesitations, false starts, repetitions, and self-corrections that characterize real driver-AI dialogs. To address this, we introduce DiscoDrive, a synthetic corpus of 3500 multi-turn dialogs across seven automotive domains, generated using a two-stage, prompt-driven pipeline that dynamically integrates disfluencies during synthesis. We show that DiscoDrive is effective both as a training resource, enabling DialoGPT-Medium and T5-Base to match or exceed KVRET-trained models on the MultiWOZ 2.2 and Schema-Guided Dialogue (SGD) relevant test sets (BLEU-4 improvements of 0.26 to 0.61; METEOR +2.10; ROUGE-L +3.48; BERTScore F1 improvements of 1.35 to 3.48), and as a data augmentation resource in low-resource scenarios, delivering additional gains of up to BLEU-4 +0.38, METEOR +1.95, ROUGE-L +2.87, and BERTScore F1 +4.00 when combined with 10 percent of KVRET. Human evaluations further confirm that dialogs sampled from DiscoDrive are rated higher than KVRET’s human-collected dialogs in naturalness (3.8 vs 3.6) and coherence (4.1 vs 4.0), and are perceived as more context-appropriate than leading post-hoc methods (such as LARD), without compromising clarity. DiscoDrive fills a critical gap in existing resources and serves as a versatile corpus for both training and augmenting conversational AI, enabling robust handling of real-world, disfluent in-car interactions.

nan

Article 436

Title@2025-07-26 (6): Agentic Reinforced Policy Optimization

Title: Agentic Reinforced Policy Optimization

Agentische verstärkte politische Optimierung

强化政策优化 2507.19849v1

Authors (14): Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou

Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models’ intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Our experiments across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO’s superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our code and datasets are released at https://github.com/dongguanting/ARPO

nan

Article 437

Title@2025-07-26 (6): Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs

Title: Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs

Gemeinsames Verständnis von Fehlausrichtung im zielorientierten Dialog: Eine Fallstudie mit Ubuntu Chat Logs

理解目标导向对话框中的共同点不匹配:与Ubuntu聊天日志的案例研究 2503.12370v2

Authors (6): Rupak Sarkar, Neha Srikanth, Taylor Hudson, Rachel Rudinger, Claire Bonial, Philip Resnik

While it is commonly accepted that maintaining common ground plays a role in conversational success, little prior research exists connecting conversational grounding to success in task-oriented conversations. We study failures of grounding in the Ubuntu IRC dataset, where participants use text-only communication to resolve technical issues. We find that disruptions in conversational flow often stem from a misalignment in common ground, driven by a divergence in beliefs and assumptions held by participants. These disruptions, which we call conversational friction, significantly correlate with task success. We find that although LLMs can identify overt cases of conversational friction, they struggle with subtler and more context-dependent instances requiring pragmatic or domain-specific reasoning.

nan

Article 438

Title@2025-07-26 (6): AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition

Title: AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition

AutoSign: Direkte Pose-zu-Text-Übersetzung für die kontinuierliche Erkennung von Zeichensprachen

自动签名: 用于持续手语识别的直导 Pose-to- Text 翻译 2507.19840v1

Authors (4): Samuel Ebimobowei Johnny, Blessed Guda, Andrew Blayama Stephen, Assane Gueye

Continuously recognizing sign gestures and converting them to glosses plays a key role in bridging the gap between the hearing and hearing-impaired communities. This involves recognizing and interpreting the hands, face, and body gestures of the signer, which pose a challenge as it involves a combination of all these features. Continuous Sign Language Recognition (CSLR) methods rely on multi-stage pipelines that first extract visual features, then align variable-length sequences with target glosses using CTC or HMM-based approaches. However, these alignment-based methods suffer from error propagation across stages, overfitting, and struggle with vocabulary scalability due to the intermediate gloss representation bottleneck. To address these limitations, we propose AutoSign, an autoregressive decoder-only transformer that directly translates pose sequences to natural language text, bypassing traditional alignment mechanisms entirely. The use of this decoder-only approach allows the model to directly map between the features and the glosses without the need for CTC loss while also directly learning the textual dependencies in the glosses. Our approach incorporates a temporal compression module using 1D CNNs to efficiently process pose sequences, followed by AraGPT2, a pre-trained Arabic decoder, to generate text (glosses). Through comprehensive ablation studies, we demonstrate that hand and body gestures provide the most discriminative features for signer-independent CSLR. By eliminating the multi-stage pipeline, AutoSign achieves substantial improvements on the Isharah-1000 dataset, achieving an improvement of up to 6.1\% in WER score compared to the best existing method.

nan

Article 439

Title@2025-07-26 (6): HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs

Title: HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs

HCAtention: Extreme KV Cache Compression via Heterogenes Aufmerksamkeitsrechnen für LLMs

HCAttention:通过不同式注意计算法对LLMs进行极端KV缓存压缩 2507.19823v1

Authors (5): Dongquan Yang, Yifan Yang, Xiaotian Yu, Xianbiao Qi, Rong Xiao

Processing long-context inputs with large language models presents a significant challenge due to the enormous memory requirements of the Key-Value (KV) cache during inference. Existing KV cache compression methods exhibit noticeable performance degradation when memory is reduced by more than 85%. Additionally, strategies that leverage GPU-CPU collaboration for approximate attention remain underexplored in this setting. We propose HCAttention, a heterogeneous attention computation framework that integrates key quantization, value offloading, and dynamic KV eviction to enable efficient inference under extreme memory constraints. The method is compatible with existing transformer architectures and does not require model fine-tuning. Experimental results on the LongBench benchmark demonstrate that our approach preserves the accuracy of full-attention model while shrinking the KV cache memory footprint to 25% of its original size. Remarkably, it stays competitive with only 12.5% of the cache, setting a new state-of-the-art in LLM KV cache compression. To the best of our knowledge, HCAttention is the first to extend the Llama-3-8B model to process 4 million tokens on a single A100 GPU with 80GB memory.

nan

Article 440

Title@2025-07-26 (6): A Structured Bangla Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy

Title: A Structured Bangla Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy

Ein strukturierter Bangla-Datensatz von Krankheits-Symptome-Verbänden zur Verbesserung der Diagnosegenauigkeit

改善诊断准确性疾病 – – 症状协会结构化孟加拉数据集 2506.13610v3

Authors (4): Abdullah Al Shafi, Rowzatul Zannat, Abdul Muntakim, Mahmudul Hasan

Disease-symptom datasets are significant and in demand for medical research, disease diagnosis, clinical decision-making, and AI-driven health management applications. These datasets help identify symptom patterns associated with specific diseases, thus improving diagnostic accuracy and enabling early detection. The dataset presented in this study systematically compiles disease-symptom relationships from various online sources, medical literature, and publicly available health databases. The data was gathered through analyzing peer-reviewed medical articles, clinical case studies, and disease-symptom association reports. Only the verified medical sources were included in the dataset, while those from non-peer-reviewed and anecdotal sources were excluded. The dataset is structured in a tabular format, where the first column represents diseases, and the remaining columns represent symptoms. Each symptom cell contains a binary value, indicating whether a symptom is associated with a disease. Thereby, this structured representation makes the dataset very useful for a wide range of applications, including machine learning-based disease prediction, clinical decision support systems, and epidemiological studies. Although there are some advancements in the field of disease-symptom datasets, there is a significant gap in structured datasets for the Bangla language. This dataset aims to bridge that gap by facilitating the development of multilingual medical informatics tools and improving disease prediction models for underrepresented linguistic communities. Further developments should include region-specific diseases and further fine-tuning of symptom associations for better diagnostic performance

nan

Article 441

Title@2025-07-26 (6): LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

Title: LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

LLM-Barber: Block-Aware Rebuilder für Sparsity Maske in One-Shot für große Sprachmodelle

LLM-Barber:大语言模型单点单层面罩块件重建器 2408.10631v2

Authors (9): Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Zhengfei Chen, Graziano Chesi, Ngai Wong, Hao Yu

Large language models (LLMs) have seen substantial growth, necessitating efficient model pruning techniques. Existing post-training pruning methods primarily measure weight importance in converged dense models, often overlooking changes in weight significance during the pruning process, leading to performance degradation. To address this issue, we present LLM-Barber (Block-Aware Rebuilder for Sparsity Mask in One-Shot), a novel one-shot pruning framework that rebuilds the sparsity mask of pruned models without any retraining or weight reconstruction. LLM-Barber incorporates block-aware error optimization across Self-Attention and MLP blocks, facilitating global performance optimization. We are the first to employ the product of weights and gradients as a pruning metric in the context of LLM post-training pruning. This enables accurate identification of weight importance in massive models and significantly reduces computational complexity compared to methods using secondorder information. Our experiments show that LLM-Barber efficiently prunes models from LLaMA and OPT families (7B to 13B) on a single A100 GPU in just 30 minutes, achieving state-of-the-art results in both perplexity and zero-shot performance across various language benchmarks. Code is available at https://github.com/YupengSu/LLM-Barber.

nan

Article 442

Title@2025-07-26 (6): Flora: Effortless Context Construction to Arbitrary Length and Scale

Title: Flora: Effortless Context Construction to Arbitrary Length and Scale

Flora: Müheloser Kontext Aufbau zu willkürlicher Länge und Skala

Flora: 以任意长度和规模建造环境以达到任意长度和规模 2507.19786v1

Authors (8): Tianxiang Chen, Zhentao Tan, Xiaofan Bo, Yue Wu, Tao Gong, Qi Chu, Jieping Ye, Nenghai Yu

Effectively handling long contexts is challenging for Large Language Models (LLMs) due to the rarity of long texts, high computational demands, and substantial forgetting of short-context abilities. Recent approaches have attempted to construct long contexts for instruction tuning, but these methods often require LLMs or human interventions, which are both costly and limited in length and diversity. Also, the drop in short-context performances of present long-context LLMs remains significant. In this paper, we introduce Flora, an effortless (human/LLM-free) long-context construction strategy. Flora can markedly enhance the long-context performance of LLMs by arbitrarily assembling short instructions based on categories and instructing LLMs to generate responses based on long-context meta-instructions. This enables Flora to produce contexts of arbitrary length and scale with rich diversity, while only slightly compromising short-context performance. Experiments on Llama3-8B-Instruct and QwQ-32B show that LLMs enhanced by Flora excel in three long-context benchmarks while maintaining strong performances in short-context tasks. Our data-construction code is available at \href{https://github.com/txchen-USTC/Flora}{https://github.com/txchen-USTC/Flora}.

nan

Article 443

Title@2025-07-26 (6): UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models’ Reasoning Abilities

Title: UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models’ Reasoning Abilities

UloRL:Ein Ultra-Long-Output-Verstärkungs-Lernansatz zur Förderung großer Sprachmodelle

UloRL: 推进大语言模式解释能力超长输出强化学习方法 2507.19766v1

Authors (5): Dong Du, Shulin Liu, Tao Yang, Shaohua Chen, Yang Li

Recent advances in large language models (LLMs) have highlighted the potential of reinforcement learning with verifiable rewards (RLVR) to enhance reasoning capabilities through extended output sequences. However, traditional RL frameworks face inefficiencies when handling ultra-long outputs due to long-tail sequence distributions and entropy collapse during training. To address these challenges, we propose an Ultra-Long Output Reinforcement Learning (UloRL) approach for advancing large language models’ reasoning abilities. Specifically, we divide ultra long output decoding into short segments, enabling efficient training by mitigating delays caused by long-tail samples. Additionally, we introduce dynamic masking of well-Mastered Positive Tokens (MPTs) to prevent entropy collapse. Experimental results demonstrate the effectiveness of our approach. On the Qwen3-30B-A3B model, RL with segment rollout achieved 2.06x increase in training speed, while RL training with 128k-token outputs improves the model’s performance on AIME2025 from 70.9\% to 85.1\% and on BeyondAIME from 50.7\% to 61.9\%, even surpassing Qwen3-235B-A22B with remarkable gains. These findings underscore the potential of our methods to advance the reasoning capabilities of LLMs with ultra-long sequence generation. We will release our code and model for further use by the community.

nan

Article 444

Title@2025-07-26 (6): Are You There God? Lightweight Narrative Annotation of Christian Fiction with LMs

Title: Are You There God? Lightweight Narrative Annotation of Christian Fiction with LMs

Sind Sie dort Gott? Leichte narrative Anmerkung der christlichen Fiction mit LMs

轻量量级的基督教小说和LMs 2507.19756v1

Authors (5): Rebecca M. M. Hicke, Brian Haggard, Mia Ferrante, Rayhan Khanna, David Mimno

In addition to its more widely studied political activities, the American Evangelical movement has a well-developed but less externally visible cultural and literary side. Christian Fiction, however, has been little studied, and what scholarly attention there is has focused on the explosively popular Left Behind series. In this work, we use computational tools to provide both a broad topical overview of Christian Fiction as a genre and a more directed exploration of how its authors depict divine acts. Working with human annotators we first developed definitions and a codebook for “acts of God.” We then adapted those instructions designed for human annotators for use by a recent, lightweight LM with the assistance of a much larger model. The laptop-scale LM is capable of matching human annotations, even when the task is subtle and challenging. Using these annotations, we show that significant and meaningful differences exist between the Left Behind books and Christian Fiction more broadly and between books by male and female authors.

nan

Article 445

Title@2025-07-26 (6): JT-Math: A Multi-Stage Framework for Advanced Mathematical Reasoning in Large Language Models

Title: JT-Math: A Multi-Stage Framework for Advanced Mathematical Reasoning in Large Language Models

JT-Math: Ein Multi-Stage-Framework für fortgeschrittene mathematische Vernunft in großen Sprachmodellen

JT- Math:大语言模型高级数学理由多阶段框架 2507.19748v1

Authors (9): Yifan Hao, Fangning Chao, Yaqian Hao, Zhaojun Cui, Huan Bai, Haiyu Zhang, Yankai Liu, Chao Deng, Junlan Feng

Mathematical reasoning is a cornerstone of artificial general intelligence and a primary benchmark for evaluating the capabilities of Large Language Models (LLMs). While state-of-the-art models show promise, they often falter when faced with complex problems that demand deep conceptual understanding and intricate, multi-step deliberation. To address this challenge, we introduce JT-Math-8B, a series of open-source models comprising base, instruct, and thinking versions, built upon a systematic, multi-stage optimization framework. Our pre-training corpus is a high-quality, 210B-token dataset curated through a dedicated data pipeline that uses model-based validation to ensure quality and diversity. The Instruct Model is optimized for direct, concise answers through Supervised Fine-Tuning (SFT) and a GRPO-based reinforcement learning (RL) method. The Thinking Model is trained for complex problem-solving using a Long Chain-of-Thought (Long CoT) approach, combining SFT with a novel, multi-stage RL curriculum that progressively increases task difficulty and context length up to 32K tokens. JT-Math-8B achieves state-of-the-art results among open-source models of similar size, surpassing prominent models like OpenAI’s O1-mini and GPT-4o , and demonstrating superior performance on competition-level mathematics.

nan

Article 446

Title@2025-07-26 (6): Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation

Title: Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation

Assembly Your Crew: Automatisches Multi-Agenten-Kommunikationstopologie-Design über autoregressive Graphen-Generierung

通过自动递减图形生成将您的组群组合成:自动多剂多剂通信地形设计 2507.18224v2

Authors (5): Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, Shirui Pan

Multi-agent systems (MAS) based on large language models (LLMs) have emerged as a powerful solution for dealing with complex problems across diverse domains. The effectiveness of MAS is critically dependent on its collaboration topology, which has become a focal point for automated design research. However, existing approaches are fundamentally constrained by their reliance on a template graph modification paradigm with a predefined set of agents and hard-coded interaction structures, significantly limiting their adaptability to task-specific requirements. To address these limitations, we reframe MAS design as a conditional autoregressive graph generation task, where both the system composition and structure are designed jointly. We propose ARG-Designer, a novel autoregressive model that operationalizes this paradigm by constructing the collaboration graph from scratch. Conditioned on a natural language task query, ARG-Designer sequentially and dynamically determines the required number of agents, selects their appropriate roles from an extensible pool, and establishes the optimal communication links between them. This generative approach creates a customized topology in a flexible and extensible manner, precisely tailored to the unique demands of different tasks. Extensive experiments across six diverse benchmarks demonstrate that ARG-Designer not only achieves state-of-the-art performance but also enjoys significantly greater token efficiency and enhanced extensibility. The source code of ARG-Designer is available at https://github.com/Shiy-Li/ARG-Designer.

nan

Article 447

Title@2025-07-25 (5): Ta-G-T: Subjectivity Capture in Table to Text Generation via RDF Graphs

Title: Ta-G-T: Subjectivity Capture in Table to Text Generation via RDF Graphs

Ta-G-T: Subjektivitätserfassung in Tabelle zur Textgenerierung über RDF Graphen

TaG-T:通过 RDF 图表生成文本的表格中主观性捕获 2507.19710v1

Authors (3): Ronak Upasham, Tathagata Dey, Pushpak Bhattacharyya

In Table-to-Text (T2T) generation, existing approaches predominantly focus on providing objective descriptions of tabular data. However, generating text that incorporates subjectivity, where subjectivity refers to interpretations beyond raw numerical data, remains underexplored. To address this, we introduce a novel pipeline that leverages intermediate representations to generate both objective and subjective text from tables. Our three-stage pipeline consists of: 1) extraction of Resource Description Framework (RDF) triples, 2) aggregation of text into coherent narratives, and 3) infusion of subjectivity to enrich the generated text. By incorporating RDFs, our approach enhances factual accuracy while maintaining interpretability. Unlike large language models (LLMs) such as GPT-3.5, Mistral-7B, and Llama-2, our pipeline employs smaller, fine-tuned T5 models while achieving comparable performance to GPT-3.5 and outperforming Mistral-7B and Llama-2 in several metrics. We evaluate our approach through quantitative and qualitative analyses, demonstrating its effectiveness in balancing factual accuracy with subjective interpretation. To the best of our knowledge, this is the first work to propose a structured pipeline for T2T generation that integrates intermediate representations to enhance both factual correctness and subjectivity.

nan

Article 448

Title@2025-07-25 (5): Scalable MatMul-free Language Modeling

Title: Scalable MatMul-free Language Modeling

Skalierbare MatMul-freie Sprachmodellierung

可缩放 MatMul 无语言建模 2406.02528v7

Authors (10): Rui-Jie Zhu, Yu Zhang, Steven Abreu, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Sumit Bam Shrestha, Peng Zhou, Jason K. Eshraghian

Large Language Models (LLMs) have fundamentally altered how we approach scaling in machine learning. However, these models pose substantial computational and memory challenges, primarily due to the reliance on matrix multiplication (MatMul) within their attention and feed-forward (FFN) layers. We demonstrate that MatMul operations can be eliminated from LLMs while maintaining strong performance, even at billion-parameter scales. Our MatMul-free models, tested on models up to 2.7B parameters, are comparable to state-of-the-art pre-trained Transformers, and the performance gap narrows as model size increases. Our approach yields significant memory savings: a GPU-efficient implementation reduces memory consumption by up to 61% during training and over 10x during inference. When adapted for a multi-chip neuromorphic system, the model leverages asynchronous processing to achieve 4x higher throughput with 10x less energy than edge GPUs.

nan

Article 449

Title@2025-07-25 (5): Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks

Title: Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks

Towards Inclusive NLP: Bewertung komprimierter Mehrsprachiger Transformer über unterschiedliche Sprach-Benchmarks

实现包容性的《国家语言规划:评估跨越不同语文基准的压压压多语种变换器》 2507.19699v1

Authors (3): Maitha Alshehhi, Ahmed Sharshar, Mohsen Guizani

Although LLMs have attained significant success in high-resource languages, their capacity in low-resource linguistic environments like Kannada and Arabic is not yet fully understood. This work benchmarking the performance of multilingual and monolingual Large Language Models (LLMs) across Arabic, English, and Indic languages, with particular emphasis on the effects of model compression strategies such as pruning and quantization. Findings shows significant performance differences driven by linguistic diversity and resource availability on SOTA LLMS as BLOOMZ, AceGPT, Jais, LLaMA-2, XGLM, and AraGPT2. We find that multilingual versions of the model outperform their language-specific counterparts across the board, indicating substantial cross-lingual transfer benefits. Quantization (4-bit and 8-bit) is effective in maintaining model accuracy while promoting efficiency, but aggressive pruning significantly compromises performance, especially in bigger models. Our findings pinpoint key strategies to construct scalable and fair multilingual NLP solutions and underscore the need for interventions to address hallucination and generalization errors in the low-resource setting.

nan

Article 450

Title@2025-07-25 (5): Salsa as a Nonverbal Embodied Language – The CoMPAS3D Dataset and Benchmarks

Title: Salsa as a Nonverbal Embodied Language – The CoMPAS3D Dataset and Benchmarks

Salsa als nonverbale Sprache – Der CoMPAS3D Datensatz und Benchmarks

Salsa 作为一种非语言的成形语言 – – CoMPAS3D数据集和基准 2507.19684v1

Authors (6): Bermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang, Trinity Evans, Paige Tuttösí, Angelica Lim

Imagine a humanoid that can safely and creatively dance with a human, adapting to its partner’s proficiency, using haptic signaling as a primary form of communication. While today’s AI systems excel at text or voice-based interaction with large language models, human communication extends far beyond text-it includes embodied movement, timing, and physical coordination. Modeling coupled interaction between two agents poses a formidable challenge: it is continuous, bidirectionally reactive, and shaped by individual variation. We present CoMPAS3D, the largest and most diverse motion capture dataset of improvised salsa dancing, designed as a challenging testbed for interactive, expressive humanoid AI. The dataset includes 3 hours of leader-follower salsa dances performed by 18 dancers spanning beginner, intermediate, and professional skill levels. For the first time, we provide fine-grained salsa expert annotations, covering over 2,800 move segments, including move types, combinations, execution errors and stylistic elements. We draw analogies between partner dance communication and natural language, evaluating CoMPAS3D on two benchmark tasks for synthetic humans that parallel key problems in spoken language and dialogue processing: leader or follower generation with proficiency levels (speaker or listener synthesis), and duet (conversation) generation. Towards a long-term goal of partner dance with humans, we release the dataset, annotations, and code, along with a multitask SalsaAgent model capable of performing all benchmark tasks, alongside additional baselines to encourage research in socially interactive embodied AI and creative, expressive humanoid motion generation.

nan

Article 451

Title: Navigating the Risks of Using Large Language Models for Text Annotation in Social Science Research

Navigation auf die Risiken der Verwendung großer Sprachmodelle für die Textannotation in der sozialwissenschaftlichen Forschung

利用大语言模式在社会科学研究中使用文字说明的风险 2503.22040v2

Authors (2): Hao Lin, Yongjun Zhang

Large language models (LLMs) have the potential to revolutionize computational social science, particularly in automated textual analysis. In this paper, we conduct a systematic evaluation of the promises and risks associated with using LLMs for text classification tasks, using social movement studies as an example. We propose a framework for social scientists to incorporate LLMs into text annotation, either as the primary coding decision-maker or as a coding assistant. This framework offers researchers tools to develop the potential best-performing prompt, and to systematically examine and report the validity and reliability of LLMs as a methodological tool. Additionally, we evaluate and discuss its epistemic risks associated with validity, reliability, replicability, and transparency. We conclude with several practical guidelines for using LLMs in text annotation tasks and offer recommendations for more effectively communicating epistemic risks in research.

nan

Article 452

Title@2025-07-25 (5): Benchmarking Linguistic Diversity of Large Language Models

Title: Benchmarking Linguistic Diversity of Large Language Models

Benchmarking Linguistische Vielfalt großer Sprachmodelle

衡量大语言模式语言多样性的基准 2412.10271v2

Authors (3): Yanzhu Guo, Guokan Shang, Chloé Clavel

The development and evaluation of Large Language Models (LLMs) has primarily focused on their task-solving capabilities, with recent models even surpassing human performance in some areas. However, this focus often neglects whether machine-generated language matches the human level of diversity, in terms of vocabulary choice, syntactic construction, and expression of meaning, raising questions about whether the fundamentals of language generation have been fully addressed. This paper emphasizes the importance of examining the preservation of human linguistic richness by language models, given the concerning surge in online content produced or aided by LLMs. We propose a comprehensive framework for evaluating LLMs from various linguistic diversity perspectives including lexical, syntactic, and semantic dimensions. Using this framework, we benchmark several state-of-the-art LLMs across all diversity dimensions, and conduct an in-depth case study for syntactic diversity. Finally, we analyze how different development and deployment choices impact the linguistic diversity of LLM outputs.

nan

Article 453

Title@2025-07-25 (5): Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs

Title: Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs

Haben große Sprachmodelle einen englischen Akzent? Bewertung und Verbesserung der Natürlichkeit von mehrsprachigen LLMs

大语言模式是否有英语中心? 2410.15956v3

Authors (6): Yanzhu Guo, Simone Conia, Zelin Zhou, Min Li, Saloni Potdar, Henry Xiao

Current Large Language Models (LLMs) are predominantly designed with English as the primary language, and even the few that are multilingual tend to exhibit strong English-centric biases. Much like speakers who might produce awkward expressions when learning a second language, LLMs often generate unnatural outputs in non-English languages, reflecting English-centric patterns in both vocabulary and grammar. Despite the importance of this issue, the naturalness of multilingual LLM outputs has received limited attention. In this paper, we address this gap by introducing novel automatic corpus-level metrics to assess the lexical and syntactic naturalness of LLM outputs in a multilingual context. Using our new metrics, we evaluate state-of-the-art LLMs on a curated benchmark in French and Chinese, revealing a tendency towards English-influenced patterns. To mitigate this issue, we also propose a simple and effective alignment method to improve the naturalness of an LLM in a target language and domain, achieving consistent improvements in naturalness without compromising the performance on general-purpose benchmarks. Our work highlights the importance of developing multilingual metrics, resources and methods for the new wave of multilingual LLMs.

nan

Article 454

Title@2025-07-25 (5): RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams

Title: RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams

RoD-TAL: Ein Benchmark für die Beantwortung von Fragen in rumänischen Führerscheinprüfungen

RoD-TAL:在罗马尼亚驾驶执照考试中回答问题的基准 2507.19666v1

Authors (6): Andrei Vlad Man, Răzvan-Alexandru Smădu, Cristian-George Craciun, Dumitru-Clementin Cercel, Florin Pop, Mihaela-Claudia Cercel

The intersection of AI and legal systems presents a growing need for tools that support legal education, particularly in under-resourced languages such as Romanian. In this work, we aim to evaluate the capabilities of Large Language Models (LLMs) and Vision-Language Models (VLMs) in understanding and reasoning about Romanian driving law through textual and visual question-answering tasks. To facilitate this, we introduce RoD-TAL, a novel multimodal dataset comprising Romanian driving test questions, text-based and image-based, alongside annotated legal references and human explanations. We implement and assess retrieval-augmented generation (RAG) pipelines, dense retrievers, and reasoning-optimized models across tasks including Information Retrieval (IR), Question Answering (QA), Visual IR, and Visual QA. Our experiments demonstrate that domain-specific fine-tuning significantly enhances retrieval performance. At the same time, chain-of-thought prompting and specialized reasoning models improve QA accuracy, surpassing the minimum grades required to pass driving exams. However, visual reasoning remains challenging, highlighting the potential and the limitations of applying LLMs and VLMs to legal education.

nan

Article 455

Title@2025-07-25 (5): Code-Switching and Syntax: A Large-Scale Experiment

Title: Code-Switching and Syntax: A Large-Scale Experiment

Code-Schalten und Syntax: Ein groß angelegtes Experiment

代码开动和语法:大规模实验 2506.01846v2

Authors (2): Igor Sterner, Simone Teufel

The theoretical code-switching (CS) literature provides numerous pointwise investigations that aim to explain patterns in CS, i.e. why bilinguals switch language in certain positions in a sentence more often than in others. A resulting consensus is that CS can be explained by the syntax of the contributing languages. There is however no large-scale, multi-language, cross-phenomena experiment that tests this claim. When designing such an experiment, we need to make sure that the system that is predicting where bilinguals tend to switch has access only to syntactic information. We provide such an experiment here. Results show that syntax alone is sufficient for an automatic system to distinguish between sentences in minimal pairs of CS, to the same degree as bilingual humans. Furthermore, the learnt syntactic patterns generalise well to unseen language pairs.

nan

Article 456

Title@2025-07-25 (5): Minimal Pair-Based Evaluation of Code-Switching

Title: Minimal Pair-Based Evaluation of Code-Switching

Minimale Pair-basierte Auswertung von Code-Switching

对代码转换的最小对等评价 2506.01840v2

Authors (2): Igor Sterner, Simone Teufel

There is a lack of an evaluation methodology that estimates the extent to which large language models (LLMs) use code-switching (CS) in the same way as bilinguals. Existing methods do not have wide language coverage, fail to account for the diverse range of CS phenomena, or do not scale. We propose an intervention based on minimal pairs of CS. Each minimal pair contains one naturally occurring CS sentence and one minimally manipulated variant. We collect up to 1,000 such pairs each for 11 language pairs. Our human experiments show that, for every language pair, bilinguals consistently prefer the naturally occurring CS sentence. Meanwhile our experiments with current LLMs show that the larger the model, the more consistently it assigns higher probability to the naturally occurring CS sentence than to the variant. In accordance with theoretical claims, the largest probability differences arise in those pairs where the manipulated material consisted of closed-class words.

nan

Article 457

Title@2025-07-25 (5): Summarization of Opinionated Political Documents with Varied Perspectives

Title: Summarization of Opinionated Political Documents with Varied Perspectives

Zusammenfassung opinionierter politischer Dokumente mit unterschiedlichen Perspektiven

具有不同观点的有见解的政治文件概述 2411.04093v2

Authors (2): Nicholas Deas, Kathleen McKeown

Global partisan hostility and polarization has increased, and this polarization is heightened around presidential elections. Models capable of generating accurate summaries of diverse perspectives can help reduce such polarization by exposing users to alternative perspectives. In this work, we introduce a novel dataset and task for independently summarizing each political perspective in a set of passages from opinionated news articles. For this task, we propose a framework for evaluating different dimensions of perspective summary performance. We benchmark 11 summarization models and LLMs of varying sizes and architectures through both automatic and human evaluation. While recent models like GPT-4o perform well on this task, we find that all models struggle to generate summaries that are faithful to the intended perspective. Our analysis of summaries focuses on how extraction behavior is impacted by features of the input documents.

nan

Article 458

Title@2025-07-25 (5): OneShield – the Next Generation of LLM Guardrails

Title: OneShield – the Next Generation of LLM Guardrails

OneShield – die nächste Generation der LLM-Guardrails

OneShild – – 下一代LLM护卫车 2507.21170v1

Authors (10): Chad DeLuca, Anna Lisa Gentile, Shubhi Asthana, Bing Zhang, Pawan Chowdhary, Kellen Cheng, Basel Shbita, Pengyuan Li, Guang-Jie Ren, Sandeep Gopisetty

The rise of Large Language Models has created a general excitement about the great potential for a myriad of applications. While LLMs offer many possibilities, questions about safety, privacy, and ethics have emerged, and all the key actors are working to address these issues with protective measures for their own models and standalone solutions. The constantly evolving nature of LLMs makes the task of universally shielding users against their potential risks extremely challenging, and one-size-fits-all solutions unfeasible. In this work, we propose OneShield, our stand-alone, model-agnostic and customizable solution to safeguard LLMs. OneShield aims to provide facilities for defining risk factors, expressing and declaring contextual safety and compliance policies, and mitigating LLM risks, with a focus on each specific customer. We describe the implementation of the framework, the scalability considerations and provide usage statistics of OneShield since its first deployment.

nan

Article 459

Title@2025-07-25 (5): Data Caricatures: On the Representation of African American Language in Pretraining Corpora

Title: Data Caricatures: On the Representation of African American Language in Pretraining Corpora

Daten Karikaturen: Zur Darstellung der afroamerikanischen Sprache im Vortraining Corpora

数据制图:关于非洲裔美国人语言在预科公司中的代表性 2503.10789v2

Authors (8): Nicholas Deas, Blake Vente, Amith Ananthram, Jessica A. Grieser, Desmond Patton, Shana Kleiner, James Shepard, Kathleen McKeown

With a combination of quantitative experiments, human judgments, and qualitative analyses, we evaluate the quantity and quality of African American Language (AAL) representation in 12 predominantly English, open-source pretraining corpora. We specifically focus on the sources, variation, and naturalness of included AAL texts representing the AAL-speaking community. We find that AAL is underrepresented in all evaluated pretraining corpora compared to US demographics, constituting as few as 0.007% and at most 0.18% of documents. We also find that more than 25% of AAL texts in C4 may be perceived as inappropriate for LLMs to generate and to reinforce harmful stereotypes. Finally, we find that most automated filters are more likely to conserve White Mainstream English (WME) texts over AAL in pretraining corpora.

nan

Article 460

Title@2025-07-25 (5): Opacity as Authority: Arbitrariness and the Preclusion of Contestation

Title: Opacity as Authority: Arbitrariness and the Preclusion of Contestation

Opacity as Authority: Willkür und die Präklusion der Anfechtung

作为权力的不透明度:仲裁和排除争议 2507.22944v1

Authors (1): Naomi Omeonga wa Kayembe

This article redefines arbitrariness not as a normative flaw or a symptom of domination, but as a foundational functional mechanism structuring human systems and interactions. Diverging from critical traditions that conflate arbitrariness with injustice, it posits arbitrariness as a semiotic trait: a property enabling systems - linguistic, legal, or social - to operate effectively while withholding their internal rationale. Building on Ferdinand de Saussure’s concept of l’arbitraire du signe, the analysis extends this principle beyond language to demonstrate its cross-domain applicability, particularly in law and social dynamics. The paper introduces the “Motivation -> Constatability -> Contestability” chain, arguing that motivation functions as a crucial interface rendering an act’s logic vulnerable to intersubjective contestation. When this chain is broken through mechanisms like “immotivization” or “Conflict Lateralization” (exemplified by “the blur of the wolf drowned in the fish”), acts produce binding effects without exposing their rationale, thus precluding justiciability. This structural opacity, while appearing illogical, is a deliberate design protecting authority from accountability. Drawing on Shannon’s entropy model, the paper formalizes arbitrariness as A = H(L

M) (conditional entropy). It thereby proposes a modern theory of arbitrariness as a neutral operator central to control as well as care, an overlooked dimension of interpersonal relations. While primarily developed through human social systems, this framework also illuminates a new pathway for analyzing explainability in advanced artificial intelligence systems.

nan

Article 461

Title@2025-07-25 (5): MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Title: MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

MCIF: Multimodale Crosslingual Instruction-Following Benchmark aus wissenschaftlichen Vorträgen

MCIF: 科学会谈的多模式跨语言教学基准 2507.19634v1

Authors (8): Sara Papi, Maike Züfle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues

Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations – hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities – speech, vision, and text – and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs’ abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.

nan

Article 462

Title@2025-07-25 (5): LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Title: LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

LoX: Low-Rank-Extrapolation stärkt LLM-Sicherheit gegen Feinabstimmung

LoX:低Rank外推法强力推力LLM 安全防止微调 2506.15606v3

Authors (6): Gabriel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong

Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model’s adaptability to new tasks. For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations. The code is available at github.com/VITA-Group/LoX.

nan

Article 463

Title@2025-07-25 (5): HITSZ’s End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track

Title: HITSZ’s End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track

HITSZs End-to-End-Sprachübersetzungssysteme zur Kombination von Sequenz-zu-Sequenz-Auto-Spracherkennungsmodell und indic Large Language Model für IWSLT 2025 in Indic Track

HITSZ的端到端语音翻译系统,将序列到序列自动语音识别模型和2025 IWSLT Indic Track IWSLT 2025 的指数式大语言模型结合起来 2507.19616v1

Authors (7): Xuchen Wei, Yangxin Wu, Yaoyin Zhang, Henglyu Liu, Kehai Chen, Xuefeng Bai, Min Zhang

This paper presents HITSZ’s submission for the IWSLT 2025 Indic track, focusing on speech-to-text translation (ST) for English-to-Indic and Indic-to-English language pairs. To enhance translation quality in this low-resource scenario, we propose an end-to-end system integrating the pre-trained Whisper automated speech recognition (ASR) model with Krutrim, an Indic-specialized large language model (LLM). Experimental results demonstrate that our end-to-end system achieved average BLEU scores of $28.88$ for English-to-Indic directions and $27.86$ for Indic-to-English directions. Furthermore, we investigated the Chain-of-Thought (CoT) method. While this method showed potential for significant translation quality improvements on successfully parsed outputs (e.g. a $13.84$ BLEU increase for Tamil-to-English), we observed challenges in ensuring the model consistently adheres to the required CoT output format.

nan

Article 464

Title@2025-07-25 (5): MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?

Title: MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?

MOCHA: Sind Code-Sprachenmodelle gegen multi-Turn bösartige Coding-Prompts robust?

MOCHA:守则语言模型是否强力打击多发恶意编码的提示? 2507.19598v1

Authors (8): Muntasir Wahed, Xiaona Zhou, Kiet A. Nguyen, Tianjiao Yu, Nirav Diwan, Gang Wang, Dilek Hakkani-Tür, Ismini Lourentzou

Recent advancements in Large Language Models (LLMs) have significantly enhanced their code generation capabilities. However, their robustness against adversarial misuse, particularly through multi-turn malicious coding prompts, remains underexplored. In this work, we introduce code decomposition attacks, where a malicious coding task is broken down into a series of seemingly benign subtasks across multiple conversational turns to evade safety filters. To facilitate systematic evaluation, we introduce \benchmarkname{}, a large-scale benchmark designed to evaluate the robustness of code LLMs against both single-turn and multi-turn malicious prompts. Empirical results across open- and closed-source models reveal persistent vulnerabilities, especially under multi-turn scenarios. Fine-tuning on MOCHA improves rejection rates while preserving coding ability, and importantly, enhances robustness on external adversarial datasets with up to 32.4% increase in rejection rates without any additional supervision.

nan

Article 465

Title@2025-07-25 (5): Efficient Attention Mechanisms for Large Language Models: A Survey

Title: Efficient Attention Mechanisms for Large Language Models: A Survey

Effiziente Aufmerksamkeitsmechanismen für große Sprachmodelle: Eine Umfrage

高效率关注大语言模式机制:调查 2507.19595v1

Authors (7): Yutao Sun, Zhenyu Li, Yike Zhang, Tengyu Pan, Bowen Dong, Yuyi Guo, Jianyong Wang

Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address this limitation, recent research has introduced two principal categories of efficient attention mechanisms. Linear attention methods achieve linear complexity through kernel approximations, recurrent formulations, or fastweight dynamics, thereby enabling scalable inference with reduced computational overhead. Sparse attention techniques, in contrast, limit attention computation to selected subsets of tokens based on fixed patterns, block-wise routing, or clustering strategies, enhancing efficiency while preserving contextual coverage. This survey provides a systematic and comprehensive overview of these developments, integrating both algorithmic innovations and hardware-level considerations. In addition, we analyze the incorporation of efficient attention into largescale pre-trained language models, including both architectures built entirely on efficient attention and hybrid designs that combine local and global components. By aligning theoretical foundations with practical deployment strategies, this work aims to serve as a foundational reference for advancing the design of scalable and efficient language models.

nan

Article 466

Title@2025-07-25 (5): Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning

Title: Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning

Geospatielles Wissen abmildern Halluzination in großen Sprachmodellen: Benchmarking und Dynamische Faktizität Ausrichtung

减轻大语言模式中的地理空间知识幻觉:基准和动态事实对齐 2507.19586v1

Authors (5): Shengyuan Wang, Jie Feng, Tianhui Liu, Dan Pei, Yong Li

Large language models (LLMs) possess extensive world knowledge, including geospatial knowledge, which has been successfully applied to various geospatial tasks such as mobility prediction and social indicator prediction. However, LLMs often generate inaccurate geospatial knowledge, leading to geospatial hallucinations (incorrect or inconsistent representations of geospatial information) that compromise their reliability. While the phenomenon of general knowledge hallucination in LLMs has been widely studied, the systematic evaluation and mitigation of geospatial hallucinations remain largely unexplored. To address this gap, we propose a comprehensive evaluation framework for geospatial hallucinations, leveraging structured geospatial knowledge graphs for controlled assessment. Through extensive evaluation across 20 advanced LLMs, we uncover the hallucinations in their geospatial knowledge. Building on these insights, we introduce a dynamic factuality aligning method based on Kahneman-Tversky Optimization (KTO) to mitigate geospatial hallucinations in LLMs, leading to a performance improvement of over 29.6% on the proposed benchmark. Extensive experimental results demonstrate the effectiveness of our benchmark and learning algorithm in enhancing the trustworthiness of LLMs in geospatial knowledge and reasoning tasks.

nan

Article 467

Title@2025-07-25 (5): MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

Title: MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

MMBench-GUI: Hierarchischer Mehrplattform-Evaluierungsrahmen für GUI-Agenten

MMMBench-GUI:图形用户界面代理器的等级多平台评价框架 2507.19478v1

Authors (28): Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Chenyu Yang, Shiqian Su, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Yanting Zhang, Xiangyu Yue, Weijie Su, Xizhou Zhu, Wei Shen, Jifeng Dai, Wenhai Wang

We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA) metric to assess GUI agent execution efficiency in online automation scenarios. Through MMBench-GUI, we identify accurate visual grounding as a critical determinant of overall task success, emphasizing the substantial benefits of modular frameworks that integrate specialized grounding modules. Furthermore, to achieve reliable GUI automation, an agent requires strong task planning and cross-platform generalization abilities, with long-context memory, a broad action space, and long-term reasoning playing a critical role. More important, task efficiency remains a critically underexplored dimension, and all models suffer from substantial inefficiencies, with excessive redundant steps even when tasks are ultimately completed. The integration of precise localization, effective planning, and early stopping strategies is indispensable to enable truly efficient and scalable GUI automation. Our benchmark code, evaluation data, and running environment will be publicly available at https://github.com/open-compass/MMBench-GUI.

nan

Article 468

Title@2025-07-25 (5): Advancing Event Forecasting through Massive Training of Large Language Models: Challenges, Solutions, and Broader Impacts

Title: Advancing Event Forecasting through Massive Training of Large Language Models: Challenges, Solutions, and Broader Impacts

Weiterentwicklung der Event-Prognose durch massives Training von großen Sprachmodellen: Herausforderungen, Lösungen und breitere Auswirkungen

通过大规模培训大语言模式:挑战、解决办法和更广泛影响 2507.19477v1

Authors (4): Sang-Woo Lee, Sohee Yang, Donghyun Kwak, Noah Y. Siegel

Many recent papers have studied the development of superforecaster-level event forecasting LLMs. While methodological problems with early studies cast doubt on the use of LLMs for event forecasting, recent studies with improved evaluation methods have shown that state-of-the-art LLMs are gradually reaching superforecaster-level performance, and reinforcement learning has also been reported to improve future forecasting. Additionally, the unprecedented success of recent reasoning models and Deep Research-style models suggests that technology capable of greatly improving forecasting performance has been developed. Therefore, based on these positive recent trends, we argue that the time is ripe for research on large-scale training of superforecaster-level event forecasting LLMs. We discuss two key research directions: training methods and data acquisition. For training, we first introduce three difficulties of LLM-based event forecasting training: noisiness-sparsity, knowledge cut-off, and simple reward structure problems. Then, we present related ideas to mitigate these problems: hypothetical event Bayesian networks, utilizing poorly-recalled and counterfactual events, and auxiliary reward signals. For data, we propose aggressive use of market, public, and crawling datasets to enable large-scale training and evaluation. Finally, we explain how these technical advances could enable AI to provide predictive intelligence to society in broader areas. This position paper presents promising specific paths and considerations for getting closer to superforecaster-level AI technology, aiming to call for researchers’ interest in these directions.

nan

Article 469

Title@2025-07-25 (5): Long-Form Answers to Visual Questions from Blind and Low Vision People

Title: Long-Form Answers to Visual Questions from Blind and Low Vision People

Langform-Antworten auf visuelle Fragen von Blinden und Sehbehinderten

对盲人和低视力者视觉问题的长期答复 2408.06303v2

Authors (8): Mina Huh, Fangyuan Xu, Yi-Hao Peng, Chongyan Chen, Hansika Murugu, Danna Gurari, Eunsol Choi, Amy Pavel

Vision language models can now generate long-form answers to questions about images - long-form visual question answers (LFVQA). We contribute VizWiz-LF, a dataset of long-form answers to visual questions posed by blind and low vision (BLV) users. VizWiz-LF contains 4.2k long-form answers to 600 visual questions, collected from human expert describers and six VQA models. We develop and annotate functional roles of sentences of LFVQA and demonstrate that long-form answers contain information beyond the question answer such as explanations and suggestions. We further conduct automatic and human evaluations with BLV and sighted people to evaluate long-form answers. BLV people perceive both human-written and generated long-form answers to be plausible, but generated answers often hallucinate incorrect visual details, especially for unanswerable visual questions (e.g., blurry or irrelevant images). To reduce hallucinations, we evaluate the ability of VQA models to abstain from answering unanswerable questions across multiple prompting strategies.

nan

Article 470

Title@2025-07-25 (5): Conversations Gone Awry, But Then? Evaluating Conversational Forecasting Models

Title: Conversations Gone Awry, But Then? Evaluating Conversational Forecasting Models

Gespräche sind schief gegangen, aber dann? Evaluieren von Gesprächsvorhersagemodellen

对话消失,但后来呢?评价对话预测模型 2507.19470v1

Authors (5): Son Quoc Tran, Tushaar Gangavarapu, Nicholas Chernogor, Jonathan P. Chang, Cristian Danescu-Niculescu-Mizil

We often rely on our intuition to anticipate the direction of a conversation. Endowing automated systems with similar foresight can enable them to assist human-human interactions. Recent work on developing models with this predictive capacity has focused on the Conversations Gone Awry (CGA) task: forecasting whether an ongoing conversation will derail. In this work, we revisit this task and introduce the first uniform evaluation framework, creating a benchmark that enables direct and reliable comparisons between different architectures. This allows us to present an up-to-date overview of the current progress in CGA models, in light of recent advancements in language modeling. Our framework also introduces a novel metric that captures a model’s ability to revise its forecast as the conversation progresses.

nan

Article 471

Title@2025-07-25 (5): RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Title: RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

RADLADS: Schnelle Aufmerksamkeitsdestillation zu linearen Aufmerksamkeitsdecodern auf Scale

RADLADS: 缩放线性引引代码的快速注意蒸馏 2505.03005v3

Authors (4): Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah

We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than $2,000 USD at today’s prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper

nan

Article 472

Title@2025-07-25 (5): GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Title: GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

GEPA: Reflektierende Prompt-Evolution kann Verstärkungs-Lernen übertreffen

GEPA: 反思即时进化能够超过成绩的强化学习 2507.19457v1

Authors (17): Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab

Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language can often provide a much richer learning medium for LLMs, compared with policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA’s design, it can often turn even just a few rollouts into a large quality gain. Across four tasks, GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% across two LLMs, and demonstrates promising results as an inference-time search strategy for code optimization.

nan

Article 473

Title@2025-07-25 (5): A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies

Title: A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies

借助自然语言处理和多波适应性取样的图表审查过程,以加快大型数据库研究代码算法的验证工作 2507.22943v1

Authors (16): Shirley V Wang, Georg Hahn, Sushama Kattinakere Sreedhara, Mufaddal Mahesri, Haritha S. Pillai, Rajendra Aldis, Joyce Lii, Sarah K. Dutcher, Rhoda Eniafe, Jamal T. Jones, Keewan Kim, Jiwei He, Hana Lee, Sengwee Toh, Rishi J Desai, Jie Yang

Background: One of the ways to enhance analyses conducted with large claims databases is by validating the measurement characteristics of code-based algorithms used to identify health outcomes or other key study parameters of interest. These metrics can be used in quantitative bias analyses to assess the robustness of results for an inferential study given potential bias from outcome misclassification. However, extensive time and resource allocation are typically re-quired to create reference-standard labels through manual chart review of free-text notes from linked electronic health records. Methods: We describe an expedited process that introduces efficiency in a validation study us-ing two distinct mechanisms: 1) use of natural language processing (NLP) to reduce time spent by human reviewers to review each chart, and 2) a multi-wave adaptive sampling approach with pre-defined criteria to stop the validation study once performance characteristics are identified with sufficient precision. We illustrate this process in a case study that validates the performance of a claims-based outcome algorithm for intentional self-harm in patients with obesity. Results: We empirically demonstrate that the NLP-assisted annotation process reduced the time spent on review per chart by 40% and use of the pre-defined stopping rule with multi-wave samples would have prevented review of 77% of patient charts with limited compromise to precision in derived measurement characteristics. Conclusion: This approach could facilitate more routine validation of code-based algorithms used to define key study parameters, ultimately enhancing understanding of the reliability of find-ings derived from database studies.

nan

Article 474

Title@2025-07-25 (5): Distillation Scaling Laws

Title: Distillation Scaling Laws

Destillationsskalierungsgesetze

强化法律 2502.08606v2

Authors (6): Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb

We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design.

nan

Article 475

Title@2025-07-25 (5): TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability

Title: TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability

TokenSmith: Verstärkte Datenbearbeitung, Suche und Inspektion für großformatige Sprachmodellschulungen und -dolmetschbarkeit

TokenSmitth:简化数据编辑、搜索和检查,以进行大型语文模式培训和解释 2507.19419v1

Authors (8): Mohammad Aflah Khan, Ameya Godbole, Johnny Tian-Zheng Wei, Ryan Wang, James Flemings, Krishna Gummadi, Willie Neiswanger, Robin Jia

Understanding the relationship between training data and model behavior during pretraining is crucial, but existing workflows make this process cumbersome, fragmented, and often inaccessible to researchers. We present TokenSmith, an open-source library for interactive editing, inspection, and analysis of datasets used in Megatron-style pretraining frameworks such as GPT-NeoX, Megatron, and NVIDIA NeMo. TokenSmith supports a wide range of operations including searching, viewing, ingesting, exporting, inspecting, and sampling data, all accessible through a simple user interface and a modular backend. It also enables structured editing of pretraining data without requiring changes to training code, simplifying dataset debugging, validation, and experimentation. TokenSmith is designed as a plug and play addition to existing large language model pretraining workflows, thereby democratizing access to production-grade dataset tooling. TokenSmith is hosted on GitHub1, with accompanying documentation and tutorials. A demonstration video is also available on YouTube.

nan

Article 476

Title@2025-07-25 (5): Towards Domain Specification of Embedding Models in Medicine

Title: Towards Domain Specification of Embedding Models in Medicine

Auf dem Weg zur Domain-Spezifikation von Einbettungsmodellen in die Medizin

走向医学嵌入模型的域域指定 2507.19407v1

Authors (4): Mohammad Khodadad, Ali Shiraee, Mahdi Astaraki, Hamidreza Mahyar

Medical text embedding models are foundational to a wide array of healthcare applications, ranging from clinical decision support and biomedical information retrieval to medical question answering, yet they remain hampered by two critical shortcomings. First, most models are trained on a narrow slice of medical and biological data, beside not being up to date in terms of methodology, making them ill suited to capture the diversity of terminology and semantics encountered in practice. Second, existing evaluations are often inadequate: even widely used benchmarks fail to generalize across the full spectrum of real world medical tasks. To address these gaps, we leverage MEDTE, a GTE model extensively fine-tuned on diverse medical corpora through self-supervised contrastive learning across multiple data sources, to deliver robust medical text embeddings. Alongside this model, we propose a comprehensive benchmark suite of 51 tasks spanning classification, clustering, pair classification, and retrieval modeled on the Massive Text Embedding Benchmark (MTEB) but tailored to the nuances of medical text. Our results demonstrate that this combined approach not only establishes a robust evaluation framework but also yields embeddings that consistently outperform state of the art alternatives in different tasks.

nan

Article 477

Title@2025-07-25 (5): CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback

Title: CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback

CodeEvo: Interaktionsgetriebene Synthese codezentrierter Daten durch hybrides und iteratives Feedback

代码化:通过混合和循环反馈对以代码为中心的数据进行互动驱动合成 2507.22080v1

Authors (5): Qiushi Sun, Jinyang Gong, Lei Li, Qipeng Guo, Fei Yuan

Acquiring high-quality instruction-code pairs is essential for training Large Language Models (LLMs) for code generation. Manually curated data is expensive and inherently limited in scale, motivating the development of code-centric synthesis methods. Yet, current approaches either focus on augmenting existing code or rely on predefined heuristics, both lacking rigorous data validation, which results in synthetic data that is ungrounded, repetitive, or overly simplistic. Inspired by collaborative programming practices, we propose CodeEvo, a framework that synthesizes code data through iterative interactions between two LLM agents: a Coder, which generates candidate code and test cases based on given instructions, and a Reviewer, which guides the synthesis process by producing new instructions and feedback. We further introduce a hybrid feedback mechanism that combines compiler determinism with the generative flexibility of agents, enabling automatic quality control throughout synthesis. Extensive experiments demonstrate that models fine-tuned on CodeEvo data significantly outperform established baselines across code generation benchmarks with various difficulties. In-depth analyses further provide insights from multiple perspectives into effective code-centric data synthesis.

nan

Article 478

Title@2025-07-25 (5): Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question

Title: Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question

Vielfältige LLMs oder unterschiedliche Frageinterpretationen? Das ist die Assembling-Frage

不同的LLMs或不同的问题解释? 2507.21168v1

Authors (2): Rafael Rosales, Santiago Miret

Effectively leveraging diversity has been shown to improve performance for various machine learning models, including large language models (LLMs). However, determining the most effective way of using diversity remains a challenge. In this work, we compare two diversity approaches for answering binary questions using LLMs: model diversity, which relies on multiple models answering the same question, and question interpretation diversity, which relies on using the same model to answer the same question framed in different ways. For both cases, we apply majority voting as the ensemble consensus heuristic to determine the final answer. Our experiments on boolq, strategyqa, and pubmedqa show that question interpretation diversity consistently leads to better ensemble accuracy compared to model diversity. Furthermore, our analysis of GPT and LLaMa shows that model diversity typically produces results between the best and the worst ensemble members without clear improvement.

nan

Article 479

Title@2025-07-25 (5): Data Augmentation for Spoken Grammatical Error Correction

Title: Data Augmentation for Spoken Grammatical Error Correction

Datenvergrößerung für gesprochene Grammatical Error Correction

语音语法错误校正的数据增强 2507.19374v1

Authors (5): Penny Karanasou, Mengjie Qian, Stefano Bannò, Mark J. F. Gales, Kate M. Knill

While there exist strong benchmark datasets for grammatical error correction (GEC), high-quality annotated spoken datasets for Spoken GEC (SGEC) are still under-resourced. In this paper, we propose a fully automated method to generate audio-text pairs with grammatical errors and disfluencies. Moreover, we propose a series of objective metrics that can be used to evaluate the generated data and choose the more suitable dataset for SGEC. The goal is to generate an augmented dataset that maintains the textual and acoustic characteristics of the original data while providing new types of errors. This augmented dataset should augment and enrich the original corpus without altering the language assessment scores of the second language (L2) learners. We evaluate the use of the augmented corpus both for written GEC (the text part) and for SGEC (the audio-text pairs). Our experiments are conducted on the S\&I Corpus, the first publicly available speech dataset with grammar error annotations.

nan

Article 480

Title@2025-07-25 (5): LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences

Title: LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences

LOTUS: Ein Leaderboard für detaillierte Bildunterschriften von Qualität zu gesellschaftlichen Bias und Benutzereinstellungen

LOTUS: 从质量到社会偏见和用户首选的详细图像描述领导板 2507.19362v1

Authors (10): Yusuke Hirota, Boyi Li, Ryo Hachiuma, Yueh-Hua Wu, Boris Ivanovic, Yuta Nakashima, Marco Pavone, Yejin Choi, Yu-Chiang Frank Wang, Chao-Han Huck Yang

Large Vision-Language Models (LVLMs) have transformed image captioning, shifting from concise captions to detailed descriptions. We introduce LOTUS, a leaderboard for evaluating detailed captions, addressing three main gaps in existing evaluations: lack of standardized criteria, bias-aware assessments, and user preference considerations. LOTUS comprehensively evaluates various aspects, including caption quality (e.g., alignment, descriptiveness), risks (\eg, hallucination), and societal biases (e.g., gender bias) while enabling preference-oriented evaluations by tailoring criteria to diverse user preferences. Our analysis of recent LVLMs reveals no single model excels across all criteria, while correlations emerge between caption detail and bias risks. Preference-oriented evaluations demonstrate that optimal model selection depends on user priorities.

nan

Article 481

Title@2025-07-25 (5): SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models

Title: SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models

SpeechIQ: Sprachintelligenz Quotient über kognitive Ebenen im Sprachverständnis von großen Sprachmodellen

语音理解大语言模式中不同认知层次的语音情报引号 2507.19361v1

Authors (11): Zhen Wan, Chao-Han Huck Yang, Yahan Yu, Jinchuan Tian, Sheng Li, Ke Hu, Zhehuai Chen, Shinji Watanabe, Fei Cheng, Chenhui Chu, Sadao Kurohashi

We introduce Speech-based Intelligence Quotient (SIQ) as a new form of human cognition-inspired evaluation pipeline for voice understanding large language models, LLM Voice, designed to assess their voice understanding ability. Moving beyond popular voice understanding metrics such as word error rate (WER), SIQ examines LLM Voice across three cognitive levels motivated by Bloom’s Taxonomy: (1) Remembering (i.e., WER for verbatim accuracy); (2) Understanding (i.e., similarity of LLM’s interpretations); and (3) Application (i.e., QA accuracy for simulating downstream tasks). We demonstrate that SIQ not only quantifies voice understanding abilities but also provides unified comparisons between cascaded methods (e.g., ASR LLM) and end-to-end models, identifies annotation errors in existing benchmarks, and detects hallucinations in LLM Voice. Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks, while exposing overlooked challenges in multi-modal training.

nan

Article 482

Title@2025-07-25 (5): SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

Title: SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

SALM-Duplex: Effiziente und direkte Duplex-Modellierung für Speech-to-Speech-Sprachenmodell

SALM-Duplex:语音对语音语言模式的高效和直接双重模式 2505.15670v4

Authors (10): Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Żelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg

Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.

nan

Article 483

Title@2025-07-25 (5): Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization

Title: Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization

Verbesserung der Sprach-Emotions-Erkennung Auslevering Aligning Timestamps von ASR-Transkriptionen und Sprecher-Diarisierung

利用ASR记录稿和议长对称的调和时标 2507.19356v1

Authors (3): Hsuan-Yu Wang, Pei-Ying Lee, Berlin Chen

In this paper, we investigate the impact of incorporating timestamp-based alignment between Automatic Speech Recognition (ASR) transcripts and Speaker Diarization (SD) outputs on Speech Emotion Recognition (SER) accuracy. Misalignment between these two modalities often reduces the reliability of multimodal emotion recognition systems, particularly in conversational contexts. To address this issue, we introduce an alignment pipeline utilizing pre-trained ASR and speaker diarization models, systematically synchronizing timestamps to generate accurately labeled speaker segments. Our multimodal approach combines textual embeddings extracted via RoBERTa with audio embeddings from Wav2Vec, leveraging cross-attention fusion enhanced by a gating mechanism. Experimental evaluations on the IEMOCAP benchmark dataset demonstrate that precise timestamp alignment improves SER accuracy, outperforming baseline methods that lack synchronization. The results highlight the critical importance of temporal alignment, demonstrating its effectiveness in enhancing overall emotion recognition accuracy and providing a foundation for robust multimodal emotion analysis.

nan

Article 484

Title@2025-07-25 (5): DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue

Title: DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue

DoctorAgent-RL: Ein multi-agent-kollaboratives Verstärkungs-Lernsystem für den multi-Turn-Klinischen Dialog

DocrAgentor-RL:多轮临床对话多机构合作强化学习系统 2505.19630v2

Authors (5): Yichun Feng, Jiawei Wang, Lu Zhou, Zhen Lei, Yixue Li

Large language models (LLMs) have demonstrated excellent capabilities in the field of biomedical question answering, but their application in real-world clinical consultations still faces core challenges. Single-round consultation systems require patients to describe all symptoms upfront, leading to vague diagnosis with unclear complaints. Traditional multi-turn dialogue models, constrained by static supervised learning, lack flexibility and fail to intelligently extract key clinical information. To address these limitations, we propose \Ours{}, a reinforcement learning (RL)-based multi-agent collaborative framework that models medical consultations as a dynamic decision-making process under uncertainty. The doctor agent continuously optimizes its questioning strategy within the RL framework through multi-turn interactions with the patient agent, dynamically adjusting its information-gathering path based on comprehensive rewards from the Consultation Evaluator. This RL fine-tuning mechanism enables LLMs to autonomously develop interaction strategies aligned with clinical reasoning logic, rather than superficially imitating patterns in existing dialogue data. Notably, we constructed MTMedDialog, the first English multi-turn medical consultation dataset capable of simulating patient interactions. Experiments demonstrate that \Ours{} outperforms existing models in both multi-turn reasoning capability and final diagnostic performance. This approach shows immense practical value by reducing misdiagnosis risks in time-pressured settings, freeing clinicians for complex cases, and pioneering a strategy to optimize medical resource allocation and alleviate workforce shortages. Code and data are available at https://github.com/JarvisUSTC/DoctorAgent-RL

nan

Article 485

Title@2025-07-25 (5): Smooth Reading: Bridging the Gap of Recurrent LLM to Self-Attention LLM on Long-Context Tasks

Title: Smooth Reading: Bridging the Gap of Recurrent LLM to Self-Attention LLM on Long-Context Tasks

Smooth Reading: Die Lücke von LLM zur Selbstaufmerksamkeit von LLM bei langen Kontextaufgaben überbrücken

平滑阅读:弥合经常LLM与长期任务自用LLM之间的差距 2507.19353v1

Authors (7): Kai Liu, Zhan Su, Peijie Dong, Fengran Mo, Jianfei Gao, ShaoTing Zhang, Kai Chen

Recently, recurrent large language models (Recurrent LLMs) with linear computational complexity have re-emerged as efficient alternatives to self-attention-based LLMs (Self-Attention LLMs), which have quadratic complexity. However, Recurrent LLMs often underperform on long-context tasks due to their limited fixed-size memory. Previous research has primarily focused on enhancing the memory capacity of Recurrent LLMs through architectural innovations, but these approaches have not yet enabled Recurrent LLMs to match the performance of Self-Attention LLMs on long-context tasks. We argue that this limitation arises because processing the entire context at once is not well-suited for Recurrent LLMs. In this paper, we propose Smooth Reading, a chunk-wise inference method inspired by human reading strategies. Smooth Reading processes context in chunks and iteratively summarizes the contextual information, thereby reducing memory demands and making the approach more compatible with Recurrent LLMs. Our experimental results show that this method substantially narrows the performance gap between Recurrent and Self-Attention LLMs on long-context tasks, while preserving the efficiency advantages of Recurrent LLMs. Our Smooth Reading boosts SWA-3B-4k (a Recurrent LLM) from 5.68% lower to 3.61% higher performance than Self-Attention LLMs on LongBench. Besides, our method maintains the high efficiency, training 3x faster and inferring 2x faster at 64k context compared to Self-Attention LLMs. To our knowledge, this is the first work to achieve comparable performance using Recurrent LLMs compared with Self-Attention LLMs on long-context tasks. We hope our method will inspire future research in this area. To facilitate further progress, we will release code and dataset.

nan

Article 486

Title@2025-07-25 (5): Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation

Title: Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation

Externes Wissen in den vernünftigen Prozess zu spritzen verbessert die retrieval-angereicherte Generation

将外部知识注入说明过程,加强检索-提款一代 2507.19333v1

Authors (4): Minghao Tang, Shiyu Ni, Jiafeng Guo, Keping Bi

Retrieval-augmented generation (RAG) has been widely adopted to augment large language models (LLMs) with external knowledge for knowledge-intensive tasks. However, its effectiveness is often undermined by the presence of noisy (i.e., low-quality) retrieved passages. Enhancing LLMs’ robustness to such noise is critical for improving the reliability of RAG systems. Recent advances have equipped LLMs with strong reasoning and self-reflection capabilities, allowing them to identify and correct errors in their reasoning process. Inspired by this ability, we propose Passage Injection-a simple yet effective method that explicitly incorporates retrieved passages into LLMs’ reasoning process, aiming to enhance the model’s ability to recognize and resist noisy passages. We validate Passage Injection under general RAG settings using BM25 as the retriever. Experiments on four reasoning-enhanced LLMs across four factual QA datasets demonstrate that Passage Injection significantly improves overall RAG performance. Further analysis on two noisy retrieval settings-random noise, where the model is provided irrelevant passages, and counterfactual noise, where it is given misleading passages-shows that Passage Injection consistently improves robustness. Controlled experiments confirm that Passage Injection can also effectively leverage helpful passages. These findings suggest that incorporating passages in LLMs’ reasoning process is a promising direction for building more robust RAG systems. The code can be found \href{here}{https://github.com/mh-tang/Passage-Injection}.

nan

Article 487

Title@2025-07-25 (5): References Matter: Investigating the Impact of Reference Set Variation on Summarization Evaluation

Title: References Matter: Investigating the Impact of Reference Set Variation on Summarization Evaluation

Referenzen Materie: Untersuchung der Auswirkungen von Referenzsatzvariationen auf die Bewertung der Zusammenfassung

参考参考物质:调查参照标准差异对总结评价的影响 2506.14335v2

Authors (6): Silvia Casola, Yang Janet Liu, Siyao Peng, Oliver Kraus, Albert Gatt, Barbara Plank

Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is known to improve correlation with human judgments, the impact of the reference set on reference-based metrics has not been systematically investigated. This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004. We demonstrate that many popular metrics exhibit significant instability. This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons. We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation. Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs.

nan

Article 488

Title@2025-07-25 (5): AutoPCR: Automated Phenotype Concept Recognition by Prompting

Title: AutoPCR: Automated Phenotype Concept Recognition by Prompting

AutoPCR: Automatisierte Erkennung von Phänomenen durch Prompting

自动PCR:通过提示自动地识别基因型概念 2507.19315v1

Authors (3): Yicheng Tao, Yuanhao Huang, Jie Liu

Phenotype concept recognition (CR) is a fundamental task in biomedical text mining, enabling applications such as clinical diagnostics and knowledge graph construction. However, existing methods often require ontology-specific training and struggle to generalize across diverse text types and evolving biomedical terminology. We present AutoPCR, a prompt-based phenotype CR method that does not require ontology-specific training. AutoPCR performs CR in three stages: entity extraction using a hybrid of rule-based and neural tagging strategies, candidate retrieval via SapBERT, and entity linking through prompting a large language model. Experiments on four benchmark datasets show that AutoPCR achieves the best average and most robust performance across both mention-level and document-level evaluations, surpassing prior state-of-the-art methods. Further ablation and transfer studies demonstrate its inductive capability and generalizability to new ontologies.

nan

Article 489

Title@2025-07-25 (5): The Eloquence team submission for task 1 of MLC-SLM challenge

Title: The Eloquence team submission for task 1 of MLC-SLM challenge

Die Eloquence-Team-Einreichung für die Aufgabe 1 der MLC-SLM-Herausforderung

刚果解运-解运挑战任务1的评分小组提交 2507.19308v1

Authors (5): Lorenzo Concina, Jordi Luque, Alessio Brutti, Marco Matassoni, Yuchen Zhang

In this paper, we present our studies and experiments carried out for the task 1 of the Challenge and Workshop on Multilingual Conversational Speech Language Model (MLC-SLM), which focuses on advancing multilingual conversational speech recognition through the development of speech language models architectures. Given the increasing relevance of real-world conversational data for building robust Spoken Dialogue Systems, we explore three approaches to multilingual ASR. First, we conduct an evaluation of the official baseline to better understand its strengths and limitations, by training two projectors (linear and qformer) with different foundation models. Second we leverage the SLAM-ASR framework to train a custom multilingual linear projector. Finally we investigate the role of contrastive learning and the extended conversational context in enhancing the robustness of recognition.

nan

Article 490

Title@2025-07-25 (5): Identifying Fine-grained Forms of Populism in Political Discourse: A Case Study on Donald Trump’s Presidential Campaigns

Title: Identifying Fine-grained Forms of Populism in Political Discourse: A Case Study on Donald Trump’s Presidential Campaigns

Identifizierung feinkörniger Formen des Populismus im politischen Diskurs: Eine Fallstudie zu Donald Trumps Präsidentschaftswahlen

确定政治讨论中精美的民粹主义形式:关于唐纳德·特朗普总统运动的个案研究 2507.19303v1

Authors (3): Ilias Chalkidis, Stephanie Brandl, Paris Aslanidis

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of instruction-following tasks, yet their grasp of nuanced social science concepts remains underexplored. This paper examines whether LLMs can identify and classify fine-grained forms of populism, a complex and contested concept in both academic and media debates. To this end, we curate and release novel datasets specifically designed to capture populist discourse. We evaluate a range of pre-trained (large) language models, both open-weight and proprietary, across multiple prompting paradigms. Our analysis reveals notable variation in performance, highlighting the limitations of LLMs in detecting populist discourse. We find that a fine-tuned RoBERTa classifier vastly outperforms all new-era instruction-tuned LLMs, unless fine-tuned. Additionally, we apply our best-performing model to analyze campaign speeches by Donald Trump, extracting valuable insights into his strategic use of populist rhetoric. Finally, we assess the generalizability of these models by benchmarking them on campaign speeches by European politicians, offering a lens into cross-context transferability in political discourse analysis. In this setting, we find that instruction-tuned LLMs exhibit greater robustness on out-of-domain data.

nan

Article 491

Title@2025-07-25 (5): A Markov Categorical Framework for Language Modeling

Title: A Markov Categorical Framework for Language Modeling

Ein kategorisches Markov-Rahmenwerk für Sprachmodellierung

用于语言建模的 Markov 语言建模分类框架 2507.19247v1

Authors (1): Yifan Zhang

Auto-regressive language models factorize sequence probabilities and are trained by minimizing the negative log-likelihood (NLL) objective. While empirically powerful, a deep theoretical understanding of why this simple objective yields such versatile representations remains elusive. This work introduces a unifying analytical framework using Markov Categories (MCs) to deconstruct the AR generation process and the NLL objective. We model the single-step generation map as a composition of Markov kernels in the category Stoch. This compositional view, when enriched with statistical divergences, allows us to dissect information flow and learned geometry. Our framework makes three main contributions. First, we provide a formal, information-theoretic rationale for the success of modern speculative decoding methods like EAGLE, quantifying the information surplus in hidden states that these methods exploit. Second, we formalize how NLL minimization forces the model to learn not just the next token, but the data’s intrinsic conditional stochasticity, a process we analyze using categorical entropy. Third, and most centrally, we prove that NLL training acts as an implicit form of spectral contrastive learning. By analyzing the information geometry of the model’s prediction head, we show that NLL implicitly forces the learned representation space to align with the eigenspectrum of a predictive similarity operator, thereby learning a geometrically structured space without explicit contrastive pairs. This compositional and information-geometric perspective reveals the deep structural principles underlying the effectiveness of modern LMs. Project Page: https://github.com/asiresearch/lm-theory

nan

Article 492

Title@2025-07-25 (5): Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation

Title: Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation

Jailbreaking Large Language Diffusion Models: Enthüllen versteckter Sicherheitsfehler bei der Diffusion-basierten Textgenerierung

大语言传播模式:在以传播为基础的文本生成中披露隐藏的安全条 2507.19227v1

Authors (7): Yuanhe Zhang, Fangzhou Xie, Zhenhong Zhou, Zherui Li, Hao Chen, Kun Wang, Yufei Guo

Large Language Diffusion Models (LLDMs) exhibit comparable performance to LLMs while offering distinct advantages in inference speed and mathematical reasoning tasks.The precise and rapid generation capabilities of LLDMs amplify concerns of harmful generations, while existing jailbreak methodologies designed for Large Language Models (LLMs) prove limited effectiveness against LLDMs and fail to expose safety vulnerabilities.Successful defense cannot definitively resolve harmful generation concerns, as it remains unclear whether LLDMs possess safety robustness or existing attacks are incompatible with diffusion-based architectures.To address this, we first reveal the vulnerability of LLDMs to jailbreak and demonstrate that attack failure in LLDMs stems from fundamental architectural differences.We present a PArallel Decoding jailbreak (PAD) for diffusion-based language models. PAD introduces Multi-Point Attention Attack, which guides parallel generative processes toward harmful outputs that inspired by affirmative response patterns in LLMs. Experimental evaluations across four LLDMs demonstrate that PAD achieves jailbreak attack success rates by 97%, revealing significant safety vulnerabilities. Furthermore, compared to autoregressive LLMs of the same size, LLDMs increase the harmful generation speed by 2x, significantly highlighting risks of uncontrolled misuse.Through comprehensive analysis, we provide an investigation into LLDM architecture, offering critical insights for the secure deployment of diffusion-based language models.

nan

Article 493

Title@2025-07-25 (5): How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework

Title: How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework

Wie viel Cheat bei der Evaluation eines großen Sprachmodells? Benchmarking-Überschätzung im Rahmen des One-Time-Pad-basierten Frameworks

大语言模式在评价方面有多大的热量? 以单一时间为基础的框架为高估基准 2507.19219v1

Authors (5): Zi Liang, Liantong Yu, Shiyu Zhang, Qingqing Ye, Haibo Hu

Overestimation in evaluating large language models (LLMs) has become an increasing concern. Due to the contamination of public benchmarks or imbalanced model training, LLMs may achieve unreal evaluation results on public benchmarks, either intentionally or unintentionally, which leads to unfair comparisons among LLMs and undermines their realistic capability assessments. Existing benchmarks attempt to address these issues by keeping test cases permanently secret, mitigating contamination through human evaluation, or repeatedly collecting and constructing new samples. However, these approaches fail to ensure reproducibility, transparency, and high efficiency simultaneously. Moreover, the extent of overestimation in current LLMs remains unquantified. To address these issues, we propose ArxivRoll, a dynamic evaluation framework inspired by one-time pad encryption in cryptography. ArxivRoll comprises two key components: \emph{i) SCP (Sequencing, Cloze, and Prediction)}, an automated generator for private test cases, and \emph{ii) Rugged Scores (RS)}, metrics that measure the proportion of public benchmark contamination and training bias. Leveraging SCP, ArxivRoll constructs a new benchmark every six months using recent articles from ArXiv and employs them for one-time evaluations of LLM performance. Extensive experiments demonstrate the high quality of our benchmark, and we provide a systematic evaluation of current LLMs. The source code is available at https://github.com/liangzid/ArxivRoll/.

nan

Article 494

Title@2025-07-25 (5): 3LM: Bridging Arabic, STEM, and Code through Benchmarking

Title: 3LM: Bridging Arabic, STEM, and Code through Benchmarking

3LM: Arabisch, MINT und Code durch Benchmarking überbrücken

3LM:通过基准确定连接阿拉伯语、STEM和代码 2507.15850v3

Authors (8): Basma El Amel Boussaha, Leen AlQadi, Mugariya Farooq, Shaikha Alsuwaidi, Giulia Campesan, Ahmed Alzubaidi, Mohammed Alyafeai, Hakim Hacid

Arabic is one of the most widely spoken languages in the world, yet efforts to develop and evaluate Large Language Models (LLMs) for Arabic remain relatively limited. Most existing Arabic benchmarks focus on linguistic, cultural, or religious content, leaving a significant gap in domains like STEM and code which are increasingly relevant for real-world LLM applications. To help bridge this gap, we present 3LM, a suite of three benchmarks designed specifically for Arabic. The first is a set of STEM-related question-answer pairs, naturally sourced from Arabic textbooks and educational worksheets. The second consists of synthetically generated STEM questions, created using the same sources. The third benchmark focuses on code generation, built through a careful translation of two widely used code benchmarks, incorporating a human-in-the-loop process with several rounds of review to ensure high-quality and faithful translations. We release all three benchmarks publicly to support the growth of Arabic LLM research in these essential but underrepresented areas.

nan

Article 495

Title@2025-07-25 (5): SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology

Title: SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology

SigBERT: Kombination narrativer medizinischer Berichte und rough Path Signature Theory zur Einschätzung des Überlebensrisikos in der Onkologie

SigBERT: 将叙述性医疗报告与肿瘤学生存风险估算的粗路签名理论相结合 2507.22941v1

Authors (5): Paul Minchella, Loïc Verlingue, Stéphane Chrétien, Rémi Vaucher, Guillaume Metzler

Electronic medical reports (EHR) contain a vast amount of information that can be leveraged for machine learning applications in healthcare. However, existing survival analysis methods often struggle to effectively handle the complexity of textual data, particularly in its sequential form. Here, we propose SigBERT, an innovative temporal survival analysis framework designed to efficiently process a large number of clinical reports per patient. SigBERT processes timestamped medical reports by extracting and averaging word embeddings into sentence embeddings. To capture temporal dynamics from the time series of sentence embedding coordinates, we apply signature extraction from rough path theory to derive geometric features for each patient, which significantly enhance survival model performance by capturing complex temporal dynamics. These features are then integrated into a LASSO-penalized Cox model to estimate patient-specific risk scores. The model was trained and evaluated on a real-world oncology dataset from the L'eon B'erard Center corpus, with a C-index score of 0.75 (sd 0.014) on the independent test cohort. SigBERT integrates sequential medical data to enhance risk estimation, advancing narrative-based survival analysis.

nan

Article 496

Title: Towards Multimodal Social Conversations with Robots: Using Vision-Language Models

Auf dem Weg zu multimodalen sozialen Gesprächen mit Robotern: Mit Vision-Sprachen-Modellen

走向与机器人的多模式社会对话:使用视觉语言模型 2507.19196v1

Authors (2): Ruben Janssens, Tony Belpaeme

Large language models have given social robots the ability to autonomously engage in open-domain conversations. However, they are still missing a fundamental social skill: making use of the multiple modalities that carry social interactions. While previous work has focused on task-oriented interactions that require referencing the environment or specific phenomena in social interactions such as dialogue breakdowns, we outline the overall needs of a multimodal system for social conversations with robots. We then argue that vision-language models are able to process this wide range of visual information in a sufficiently general manner for autonomous social robots. We describe how to adapt them to this setting, which technical challenges remain, and briefly discuss evaluation practices.

nan

Article 497

Title@2025-07-25 (5): Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?

Title: Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?

Kann Small-Scale-Datenvergiftung Dialect-Linked Biases in großen Sprachmodellen exazerbieren?

在大语言模型中,小范围数据中毒加剧分解链接的分界线能否成为大语言模型? 2507.19195v1

Authors (3): Chaymaa Abbas, Mariette Awad, Razane Tajeddine

Despite the ongoing improvements in the design of large language models (LLMs) to foster inclusion and balanced responses, these systems remain susceptible to encoding and amplifying social biases. This study examines how dialectal variation, specifically African American Vernacular English (AAVE) versus Standard American English (SAE), interacts with data poisoning to influence toxicity in outputs. Using both small- and medium-scale LLaMA models, we show that even minimal exposure to poisoned data significantly increases toxicity for AAVE inputs, while it remains comparatively unaffected for SAE. Larger models exhibit a more significant amplification effect which suggests heightened susceptibility with scale. To further assess these disparities, we employed GPT-4o as a fairness auditor, which identified harmful stereotypical patterns disproportionately tied to AAVE inputs, including portrayals of aggression, criminality, and intellectual inferiority. These findings underscore the compounding impact of data poisoning and dialectal bias and emphasize the need for dialect-aware evaluation, targeted debiasing interventions, and socially responsible training protocols during development.

nan

Article 498

Title@2025-07-25 (5): Natural Language Processing for Tigrinya: Current State and Future Directions

Title: Natural Language Processing for Tigrinya: Current State and Future Directions

Natürliche Sprachverarbeitung für Tigrinya: Aktueller Zustand und zukünftige Richtungen

提格里尼亚的自然语言处理:现状和未来方向 2507.17974v2

Authors (2): Fitsum Gaim, Jong C. Park

Despite being spoken by millions of people, Tigrinya remains severely underrepresented in Natural Language Processing (NLP) research. This work presents a comprehensive survey of NLP research for Tigrinya, analyzing over 40 studies spanning more than a decade of work from 2011 to 2025. We systematically review the current state of computational resources, models, and applications across ten distinct downstream tasks, including morphological processing, machine translation, speech recognition, and question-answering. Our analysis reveals a clear trajectory from foundational, rule-based systems to modern neural architectures, with progress consistently unlocked by resource creation milestones. We identify key challenges rooted in Tigrinya’s morphological complexity and resource scarcity, while highlighting promising research directions, including morphology-aware modeling, cross-lingual transfer, and community-centered resource development. This work serves as both a comprehensive reference for researchers and a roadmap for advancing Tigrinya NLP. A curated metadata of the surveyed studies and resources is made publicly available.

nan

Article 499

Title@2025-07-25 (5): Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them

Title: Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them

Scalpel vs. Hammer: GRPO verstärkt bestehende Fähigkeiten, SFT ersetzt sie

缩略图与锤子:GROPO 放大现有能力,SFT 替换 2507.10616v2

Authors (4): Neel Rajani, Aryo Pradipta Gema, Seraphina Goldfarb-Tarrant, Ivan Titov

Training large language models (LLMs) for reasoning via maths and code datasets has become a major new focus in LLM post-training. Two particularly popular approaches are reinforcement learning (RL) and supervised fine-tuning (SFT), but their training dynamics are poorly understood. We present a comparative analysis of RL and SFT on the same maths problems with the same model and similar hyperparameters. We find that RL yields minor in-domain gains on maths and slight degradation on knowledge-intensive benchmarks like MMLU, while both trends are more pronounced in SFT. We also analyse model parameters across checkpoints, observing that both algorithms modify query and key weights the most. Meanwhile, SFT exhibits greater updates and also affects mid-layer MLPs more, leading us to hypothesise that this may have caused the out-of-domain degradation. We therefore investigate whether freezing parts of the model during training can mitigate the reduced performance on knowledge-intensive benchmarks. However, our results are inconclusive, with benefits on GPQA:Diamond and degradation on other benchmarks. Taken together, our observations provide a preliminary indication for why RL amplifies existing capabilities, while SFT replaces old skills with new ones.

nan

Article 500

Title@2025-07-25 (5): An Empirical Investigation of Gender Stereotype Representation in Large Language Models: The Italian Case

Title: An Empirical Investigation of Gender Stereotype Representation in Large Language Models: The Italian Case

Eine empirische Untersuchung der Geschlechterstereotypdarstellung in großen Sprachmodellen: Der italienische Fall

对大语言模式中性别陈规定型观念代表性的经验调查:意大利案例 2507.19156v1

Authors (5): Gioele Giachino, Marco Rondina, Antonio Vetrò, Riccardo Coppola, Juan Carlos De Martin

The increasing use of Large Language Models (LLMs) in a large variety of domains has sparked worries about how easily they can perpetuate stereotypes and contribute to the generation of biased content. With a focus on gender and professional bias, this work examines in which manner LLMs shape responses to ungendered prompts, contributing to biased outputs. This analysis uses a structured experimental method, giving different prompts involving three different professional job combinations, which are also characterized by a hierarchical relationship. This study uses Italian, a language with extensive grammatical gender differences, to highlight potential limitations in current LLMs’ ability to generate objective text in non-English languages. Two popular LLM-based chatbots are examined, namely OpenAI ChatGPT (gpt-4o-mini) and Google Gemini (gemini-1.5-flash). Through APIs, we collected a range of 3600 responses. The results highlight how content generated by LLMs can perpetuate stereotypes. For example, Gemini associated 100% (ChatGPT 97%) of ‘she’ pronouns to the ‘assistant’ rather than the ‘manager’. The presence of bias in AI-generated text can have significant implications in many fields, such as in the workplaces or in job selections, raising ethical concerns about its use. Understanding these risks is pivotal to developing mitigation strategies and assuring that AI-based systems do not increase social inequalities, but rather contribute to more equitable outcomes. Future research directions include expanding the study to additional chatbots or languages, refining prompt engineering methods or further exploiting a larger experimental base.

nan

Article 501

Title@2025-07-25 (5): Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Title: Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Beschleunigung multimodaler Großsprachenmodelle über Dynamic Visual-Token Exit und die Empirical Findings

通过动态直视退出和实证结论加速多模式大语言模型 2411.19628v2

Authors (7): Qiong Wu, Wenhao Lin, Yiyi Zhou, Weihao Ye, Zhanpeng Zen, Xiaoshuai Sun, Rongrong Ji

The excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation. To gain insights into this problem, we first conduct extensive empirical studies on the attention behaviors of MLLMs, and summarize three main inference stages in MLLMs: (i) Early fusion between tokens is first accomplished quickly. (ii) Intra-modality modeling then comes to play. (iii) Multimodal reasoning} resumes and lasts until the end of inference. In particular, we reveal that visual tokens will stop contributing to reasoning when the text tokens receive enough image information, yielding obvious visual redundancy. Based on these generalized observations, we propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer, thereby addressing the observed visual redundancy. To validate VTE, we apply it to a set of MLLMs, including LLaVA, VILA, Eagle and InternVL, and conduct extensive experiments on a bunch of benchmarks. The experiment results not only show the effectiveness of our VTE in improving MLLMs’ efficiency, but also yield the general modeling patterns of MLLMs, well facilitating the in-depth understanding of MLLMs. Our code is released at https://github.com/DoubtedSteam/DyVTE.

nan

Article 502

Title@2025-07-25 (5): Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes

Title: Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes

Vertrauenswürdige Begründung: Bewertung und Verbesserung der tatsächlichen Genauigkeit in LLM-Intermediate-Thought-Prozessen

值得信赖的理由:评估和加强LLM中级思考程序中的事实准确性 2507.22940v1

Authors (3): Rui Jiao, Yue Zhang, Jinku Li

We present RELIANCE (Reasoning Evaluation with Logical Integrity and Accuracy for Confidence Enhancement), a novel framework addressing a critical vulnerability in Large Language Models (LLMs): the prevalence of factual inaccuracies within intermediate reasoning steps despite correct final answers. This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research, where erroneous yet confidently presented reasoning can mislead users into dangerous decisions. Our framework integrates three core components: (1) a specialized fact-checking classifier trained on counterfactually augmented data to detect subtle factual inconsistencies within reasoning chains; (2) a Group Relative Policy Optimization (GRPO) reinforcement learning approach that balances factuality, coherence, and structural correctness through multi-dimensional rewards; and (3) a mechanistic interpretability module examining how factuality improvements manifest in model activations during reasoning processes. Extensive evaluation across ten state-of-the-art models reveals concerning patterns: even leading models like Claude-3.7 and GPT-o1 demonstrate reasoning factual accuracy of only 81.93% and 82.57% respectively. RELIANCE significantly enhances factual robustness (up to 49.90% improvement) while maintaining or improving performance on challenging benchmarks including Math-500, AIME-2024, and GPQA. Furthermore, our activation-level analysis provides actionable insights into how factual enhancements reshape reasoning trajectories within model architectures, establishing foundations for future training methodologies that explicitly target factual robustness through activation-guided optimization.

nan

Article 503

Title@2025-07-25 (5): OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth?

Title: OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth?

OS-MAP: Wie weit können Computer-verwendende Agenten in Breadth und Tiefe gehen?

OS-MAP:计算机用户在面包和深度上能走多远? 2507.19132v1

Authors (15): Xuetian Chen, Yinghao Chen, Xinfeng Yuan, Zhuo Peng, Lu Chen, Yuekeng Li, Zhoujia Zhang, Yingqian Huang, Leyan Huang, Jiaqing Liang, Tianbao Xie, Zhiyong Wu, Qiushi Sun, Biqing Qi, Bowen Zhou

Computer-using agents have shown strong potential to boost human productivity and enable new application forms across platforms. While recent advances have led to usable applications, existing benchmarks fail to account for the internal task heterogeneity and the corresponding agent capabilities, as well as their alignment with actual user demands-hindering both targeted capability development and the reliable transition of research progress into practical deployment. To bridge the gap, we present OS-MAP, a benchmark for daily computer-using automation that organizes its 416 realistic tasks across 15 applications along two key dimensions: a five-level taxonomy of automation and a generalization scope derived from a real-world user demand hierarchy. To enable fine-grained analysis of required capabilities and alignment with real-world scenarios, OS-MAP evaluates agents along two dimensions: automation level across a five-level taxonomy, and generalization scope across a demand hierarchy. This design captures varying levels of required agent autonomy and generalization, forming a performance-generalization evaluation matrix for structured and comprehensive assessment. Experiments show that even State-of-the-Art agents with VLM backbones struggle with higher-level tasks involving perception, reasoning, and coordination-highlighting the need for a deeper understanding of current strengths and limitations to drive the future progress in computer-using agents research and deployment. All code, environments, baselines, and data are publicly available at https://github.com/OS-Copilot/OS-Map.

nan

Article 504

Title@2025-07-25 (5): Distilling the Implicit Multi-Branch Structure in LLMs’ Reasoning via Reinforcement Learning

Title: Distilling the Implicit Multi-Branch Structure in LLMs’ Reasoning via Reinforcement Learning

Destillieren der impliziten Multi-Branch-Struktur in LLMs’ Reasoning durch Verstärkungslernen

通过强化学习,将LLMs的隐含多层结构提炼在“通过强化学习推理”中 2505.16142v3

Authors (9): Shicheng Xu, Liang Pang, Yunchang Zhu, Jia Gu, Zihao Wei, Jingcheng Deng, Feiyang Pan, Huawei Shen, Xueqi Cheng

Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher’s reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to internalize the teacher’s implicit multi-branch reasoning structure rather than merely mimicking fixed output paths. Experiments show RLKD surpasses standard SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime, unlocking greater student reasoning potential than SFT-based distillation.

nan

Article 505

Title@2025-07-25 (5): Objectifying the Subjective: Cognitive Biases in Topic Interpretations

Title: Objectifying the Subjective: Cognitive Biases in Topic Interpretations

Objektivierung des Subjektiven: Kognitive Biasen in thematischen Interpretationen

表示主观性: 专题解释中的认知性分界线 2507.19117v1

Authors (7): Swapnil Hingmire, Ze Shi Li, Shiyu, Zeng, Ahmed Musa Awon, Luiz Franciscatto Guerra, Neil Ernst

Interpretation of topics is crucial for their downstream applications. State-of-the-art evaluation measures of topic quality such as coherence and word intrusion do not measure how much a topic facilitates the exploration of a corpus. To design evaluation measures grounded on a task, and a population of users, we do user studies to understand how users interpret topics. We propose constructs of topic quality and ask users to assess them in the context of a topic and provide rationale behind evaluations. We use reflexive thematic analysis to identify themes of topic interpretations from rationales. Users interpret topics based on availability and representativeness heuristics rather than probability. We propose a theory of topic interpretation based on the anchoring-and-adjustment heuristic: users anchor on salient words and make semantic adjustments to arrive at an interpretation. Topic interpretation can be viewed as making a judgment under uncertainty by an ecologically rational user, and hence cognitive biases aware user models and evaluation frameworks are needed.

nan

Article 506

Title@2025-07-25 (5): Relation Extraction with Instance-Adapted Predicate Descriptions

Title: Relation Extraction with Instance-Adapted Predicate Descriptions

Verhältnis-Extraktion mit instance-adapted Prädikat Beschreibungen

采掘与原创性预言性说明的关系 2503.17799v2

Authors (2): Yuhang Jiang, Ramakanth Kavuluru

Relation extraction (RE) is a standard information extraction task playing a major role in downstream applications such as knowledge discovery and question answering. Although decoder-only large language models are excelling in generative tasks, smaller encoder models are still the go to architecture for RE. In this paper, we revisit fine-tuning such smaller models using a novel dual-encoder architecture with a joint contrastive and cross-entropy loss. Unlike previous methods that employ a fixed linear layer for predicate representations, our approach uses a second encoder to compute instance-specific predicate representations by infusing them with real entity spans from corresponding input instances. We conducted experiments on two biomedical RE datasets and two general domain datasets. Our approach achieved F1 score improvements ranging from 1% to 2% over state-of-the-art methods with a simple but elegant formulation. Ablation studies justify the importance of various components built into the proposed architecture.

nan

Article 507

Title@2025-07-25 (5): Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy

Title: Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy

Ensemble Debiasing Across Class und Sample Levels für eine gerechtere Genauigkeit

公平促进准确性 2503.05157v4

Authors (3): Ruixi Lin, Ziqiao Wang, Yang You

Language models are strong few-shot learners and achieve good overall accuracy in text classification tasks, masking the fact that their results suffer from great class accuracy imbalance. We believe that the pursuit of overall accuracy should not come from enriching the strong classes, but from raising up the weak ones. To address the imbalance, we propose a Heaviside step function based ensemble debiasing method, which enables flexible rectifications of in-context learned class probabilities at both class and sample levels. Evaluations with Llama-2-13B on seven text classification benchmarks show that our approach achieves state-of-the-art overall accuracy gains with balanced class accuracies. More importantly, we perform analyses on the resulted probability correction scheme, showing that sample-level corrections are necessary to elevate weak classes. Due to effectively correcting weak classes, our method also brings significant performance gains to a larger model variant, Llama-2-70B, especially on a biomedical domain task, further demonstrating the necessity of ensemble debiasing at both levels. Our source code is available at https://github.com/NUS-HPC-AI-Lab/DCS.

nan

Article 508

Title@2025-07-25 (5): Comparison of pipeline, sequence-to-sequence, and GPT models for end-to-end relation extraction: experiments with the rare disease use-case

Title: Comparison of pipeline, sequence-to-sequence, and GPT models for end-to-end relation extraction: experiments with the rare disease use-case

Vergleich von Pipeline-, Sequenz-zu-Sequenz- und GPT-Modellen für die End-to-End-Relation-Extraktion: Experimente mit dem Einsatzfall der seltenen Krankheiten

管道、序列到序列和终端到终端关系提取GPT模型的比较:与罕见疾病使用案例的实验 2311.13729v3

Authors (3): Shashank Gupta, Xuguang Ai, Ramakanth Kavuluru

End-to-end relation extraction (E2ERE) is an important and realistic application of natural language processing (NLP) in biomedicine. In this paper, we aim to compare three prevailing paradigms for E2ERE using a complex dataset focused on rare diseases involving discontinuous and nested entities. We use the RareDis information extraction dataset to evaluate three competing approaches (for E2ERE): NER $\rightarrow$ RE pipelines, joint sequence to sequence models, and generative pre-trained transformer (GPT) models. We use comparable state-of-the-art models and best practices for each of these approaches and conduct error analyses to assess their failure modes. Our findings reveal that pipeline models are still the best, while sequence-to-sequence models are not far behind; GPT models with eight times as many parameters are worse than even sequence-to-sequence models and lose to pipeline models by over 10 F1 points. Partial matches and discontinuous entities caused many NER errors contributing to lower overall E2E performances. We also verify these findings on a second E2ERE dataset for chemical-protein interactions. Although generative LM-based methods are more suitable for zero-shot settings, when training data is available, our results show that it is better to work with more conventional models trained and tailored for E2ERE. More innovative methods are needed to marry the best of the both worlds from smaller encoder-decoder pipeline models and the larger GPT models to improve E2ERE. As of now, we see that well designed pipeline models offer substantial performance gains at a lower cost and carbon footprint for E2ERE. Our contribution is also the first to conduct E2ERE for the RareDis dataset.

nan

Article 509

Title@2025-07-25 (5): Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation

Title: Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation

Destillieren eines kleinen Utility-Based Passage Selectors zur Verbesserung der Retrieval-Augmented Generation

蒸馏一个小型以公用事业为基础的通道选择器,以加强回收-提款一代 2507.19102v1

Authors (7): Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Shuaiqiang Wang, Dawei Yin, Xueqi Cheng

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating retrieved information. Standard retrieval process prioritized relevance, focusing on topical alignment between queries and passages. In contrast, in RAG, the emphasis has shifted to utility, which considers the usefulness of passages for generating accurate answers. Despite empirical evidence showing the benefits of utility-based retrieval in RAG, the high computational cost of using LLMs for utility judgments limits the number of passages evaluated. This restriction is problematic for complex queries requiring extensive information. To address this, we propose a method to distill the utility judgment capabilities of LLMs into smaller, more efficient models. Our approach focuses on utility-based selection rather than ranking, enabling dynamic passage selection tailored to specific queries without the need for fixed thresholds. We train student models to learn pseudo-answer generation and utility judgments from teacher LLMs, using a sliding window method that dynamically selects useful passages. Our experiments demonstrate that utility-based selection provides a flexible and cost-effective solution for RAG, significantly reducing computational costs while improving answer quality. We present the distillation results using Qwen3-32B as the teacher model for both relevance ranking and utility-based selection, distilled into RankQwen1.7B and UtilityQwen1.7B. Our findings indicate that for complex questions, utility-based selection is more effective than relevance ranking in enhancing answer generation performance. We will release the relevance ranking and utility-based selection annotations for the MS MARCO dataset, supporting further research in this area.

nan

Article 510

Title@2025-07-25 (5): How Important is Domain Specificity in Language Models and Instruction Finetuning for Biomedical Relation Extraction?

Title: How Important is Domain Specificity in Language Models and Instruction Finetuning for Biomedical Relation Extraction?

Wie wichtig ist Domain Specificity in Sprachmodellen und Instruction Finetuning für die biomedizinische Beziehungsextraktion?

在生物医学关系采掘的语言模式和教学教学调整中,域的具体特点有多重要? 2402.13470v2

Authors (2): Aviv Brokman, Ramakanth Kavuluru

Cutting edge techniques developed in the general NLP domain are often subsequently applied to the high-value, data-rich biomedical domain. The past few years have seen generative language models (LMs), instruction finetuning, and few-shot learning become foci of NLP research. As such, generative LMs pretrained on biomedical corpora have proliferated and biomedical instruction finetuning has been attempted as well, all with the hope that domain specificity improves performance on downstream tasks. Given the nontrivial effort in training such models, we investigate what, if any, benefits they have in the key biomedical NLP task of relation extraction. Specifically, we address two questions: (1) Do LMs trained on biomedical corpora outperform those trained on general domain corpora? (2) Do models instruction finetuned on biomedical datasets outperform those finetuned on assorted datasets or those simply pretrained? We tackle these questions using existing LMs, testing across four datasets. In a surprising result, general-domain models typically outperformed biomedical-domain models. However, biomedical instruction finetuning improved performance to a similar degree as general instruction finetuning, despite having orders of magnitude fewer instructions. Our findings suggest it may be more fruitful to focus research effort on larger-scale biomedical instruction finetuning of general LMs over building domain-specific biomedical LMs

nan

Article 511

Title@2025-07-25 (5): JCAPT: A Joint Modeling Approach for CAPT

Title: JCAPT: A Joint Modeling Approach for CAPT

JCAPT: Ein gemeinsamer Modellierungsansatz für CAPT

JCAPT: CAPT的联合示范方法 2506.19315v2

Authors (3): Tzu-Hsuan Yang, Yue-Yang He, Berlin Chen

Effective pronunciation feedback is critical in second language (L2) learning, for which computer-assisted pronunciation training (CAPT) systems often encompass two key tasks: automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD). Recent work has shown that joint modeling of these two tasks can yield mutual benefits. Our unified framework leverages Mamba, a selective state space model (SSM), while integrating phonological features and think token strategies to jointly enhance interpretability and fine-grained temporal reasoning in APA and MDD. To our knowledge, this is the first study to combine phonological attribution, SSM-based modeling, and prompting in CAPT. A series of experiments conducted on the speechocean762 benchmark demonstrate that our model consistently outperforms prior methods, particularly on the MDD task.

nan

Article 512

Title@2025-07-25 (5): LLMs are Also Effective Embedding Models: An In-depth Overview

Title: LLMs are Also Effective Embedding Models: An In-depth Overview

LLMs sind auch effektive Einbettungsmodelle: Eine ausführliche Übersicht

LLM项目也是有效的嵌入模型:深入概述 2412.12591v2

Authors (9): Chongyang Tao, Tao Shen, Shen Gao, Junshuo Zhang, Zhen Li, Kai Hua, Wenpeng Hu, Zhengwei Tao, Shuai Ma

Large language models (LLMs) have revolutionized natural language processing by achieving state-of-the-art performance across various tasks. Recently, their effectiveness as embedding models has gained attention, marking a paradigm shift from traditional encoder-only models like ELMo and BERT to decoder-only, large-scale LLMs such as GPT, LLaMA, and Mistral. This survey provides an in-depth overview of this transition, beginning with foundational techniques before the LLM era, followed by LLM-based embedding models through two main strategies to derive embeddings from LLMs. 1) Direct prompting: We mainly discuss the prompt designs and the underlying rationale for deriving competitive embeddings. 2) Data-centric tuning: We cover extensive aspects that affect tuning an embedding model, including model architecture, training objectives, data constructions, etc. Upon the above, we also cover advanced methods for producing embeddings from longer texts, multilingual, code, cross-modal data, as well as reasoning-aware and other domain-specific scenarios. Furthermore, we discuss factors affecting choices of embedding models, such as performance/efficiency comparisons, dense vs sparse embeddings, pooling strategies, and scaling law. Lastly, the survey highlights the limitations and challenges in adapting LLMs for embeddings, including cross-task embedding quality, trade-offs between efficiency and accuracy, low-resource, long-context, data bias, robustness, etc. This survey serves as a valuable resource for researchers and practitioners by synthesizing current advancements, highlighting key challenges, and offering a comprehensive framework for future work aimed at enhancing the effectiveness and efficiency of LLMs as embedding models.

nan

Article 513

Title@2025-07-25 (5): Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents

Title: Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents

Debating Truth: Debattieren-getriebene Behauptungsverifizierung mit mehreren Large Language Model Agents

讨论真相:由辩论驱动的与多语种示范语言代理核查索赔要求 2507.19090v1

Authors (5): Haorui He, Yupeng Li, Dacheng Wen, Reynold Cheng, Francis C. M. Lau

Claim verification is critical for enhancing digital literacy. However, the state-of-the-art single-LLM methods struggle with complex claim verification that involves multi-faceted evidences. Inspired by real-world fact-checking practices, we propose DebateCV, the first claim verification framework that adopts a debate-driven methodology using multiple LLM agents. In our framework, two Debaters take opposing stances on a claim and engage in multi-round argumentation, while a Moderator evaluates the arguments and renders a verdict with justifications. To further improve the performance of the Moderator, we introduce a novel post-training strategy that leverages synthetic debate data generated by the zero-shot DebateCV, effectively addressing the scarcity of real-world debate-driven claim verification data. Experimental results show that our method outperforms existing claim verification methods under varying levels of evidence quality. Our code and dataset are publicly available at https://anonymous.4open.science/r/DebateCV-6781.

nan

Article 514

Title: Arg-LLaDA: Argument Summarization via Large Language Diffusion Models and Sufficiency-Aware Refinement

Arg-LlaDA: Argumentationszusammenfassung über Large Language Diffusion Models und Sufficiency-Aware Refinement

ARG-LLADA:通过大语言传播模型和充足软件精炼进行参数汇总 2507.19081v1

Authors (6): Hao Li, Yizheng Sun, Viktor Schlegel, Kailai Yang, Riza Batista-Navarro, Goran Nenadic

Argument summarization aims to generate concise, structured representations of complex, multi-perspective debates. While recent work has advanced the identification and clustering of argumentative components, the generation stage remains underexplored. Existing approaches typically rely on single-pass generation, offering limited support for factual correction or structural refinement. To address this gap, we introduce Arg-LLaDA, a novel large language diffusion framework that iteratively improves summaries via sufficiency-guided remasking and regeneration. Our method combines a flexible masking controller with a sufficiency-checking module to identify and revise unsupported, redundant, or incomplete spans, yielding more faithful, concise, and coherent outputs. Empirical results on two benchmark datasets demonstrate that Arg-LLaDA surpasses state-of-the-art baselines in 7 out of 10 automatic evaluation metrics. In addition, human evaluations reveal substantial improvements across core dimensions, coverage, faithfulness, and conciseness, validating the effectiveness of our iterative, sufficiency-aware generation strategy.

nan

Article 515

Title@2025-07-25 (5): Re:Form – Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

Title: Re:Form – Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

Re:Form – Reduzierung menschlicher Priore bei skalierbarer formaler Software-Verifikation mit RL in LLMs: Eine Vorstudie zu Dafny

Re:形式 – – 在可扩展的正式软件核查中减少人类前科,LLL女士:关于Dafny的初步研究 2507.16331v2

Authors (16): Chuanhao Yan, Fengdi Che, Xuhan Huang, Xu Xu, Xin Li, Yizhi Li, Xingwei Qu, Jingzhe Shi, Zhuangzhuang He, Chenghua Lin, Yaodong Yang, Binhang Yuan, Hang Zhao, Yu Qiao, Bowen Zhou, Jie Fu

Existing informal language-based (e.g., human language) Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes, which provide crucial training signals, are neither reliable nor scalable. In fact, the prevalent large proprietary models could hardly generate verifiable programs. A promising yet largely uncharted alternative is formal language-based reasoning. Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes. This capability is pivotal for achieving large-scale, reliable formal software verification. It is a common practice to employ human-annotated chain-of-thought and other human priors to induce the reasoning and coding capabilities of LLMs. Unfortunately, it becomes unacceptably all-consuming to provide such priors for supervising complex programming tasks. In this work, we systematically explore ways to reduce human priors with the formal language, Dafny, as the main environment for our pilot study. Our pipeline mainly relies on introducing an automatic and scalable data curation pipeline, and careful RL designs integrated with feedback from the formal language verifier. We introduce DafnyComp, a benchmark of compositional formal programs with auto-formalized specifications for specification reasoning. Our supervised fine-tuning (SFT) stage enables even small models (e.g., 0.5B) to generate syntactically valid and verifiable Dafny code, surpassing proprietary models. RL with regularization further improves performance, achieving stronger generalization to out-of-domain tasks and outperforming all strong baselines on the challenging DafnyComp benchmark.

nan

Article 516

Title@2025-07-25 (5): ToolACE: Winning the Points of LLM Function Calling

Title: ToolACE: Winning the Points of LLM Function Calling

ToolACE: Die Punkte des LLM-Funktionsaufrufs gewinnen

工具ACE:赢得LLLM函数调用点 2409.00920v2

Authors (27): Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, Enhong Chen

Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. However, real function-calling data is quite challenging to collect and annotate, while synthetic data generated by existing pipelines tends to lack coverage and accuracy. In this paper, we present ToolACE, an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. ToolACE leverages a novel self-evolution synthesis process to curate a comprehensive API pool of 26,507 diverse APIs. Dialogs are further generated through the interplay among multiple agents, guided by a formalized thinking process. To ensure data accuracy, we implement a dual-layer verification system combining rule-based and model-based checks. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard, rivaling the latest GPT-4 models. Our model and a subset of the data are publicly available at https://huggingface.co/Team-ACE.

nan

Article 517

Title@2025-07-25 (5): GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness

Title: GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness

GOAT-SLM: Ein gesprochenes Sprachmodell mit paralinguistischem und Lautsprechercharakteristischem Bewusstsein

GOAT-SLM:具有多语言语言和议长特点意识的口语模式 2507.18119v2

Authors (16): Hongjie Chen, Zehan Li, Yaodong Song, Wenming Deng, Yitong Yao, Yuxin Zhang, Hang Lv, Xuechao Zhu, Jian Kang, Jie Lian, Jie Li, Chao Wang, Shuangyong Song, Yongxiang Li, Zhongjiang He, Xuelong Li

Recent advances in end-to-end spoken language models (SLMs) have significantly improved the ability of AI systems to engage in natural spoken interactions. However, most existing models treat speech merely as a vehicle for linguistic content, often overlooking the rich paralinguistic and speaker characteristic cues embedded in human speech, such as dialect, age, emotion, and non-speech vocalizations. In this work, we introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness, designed to extend spoken language modeling beyond text semantics. GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization, enabling robust language understanding while supporting expressive and adaptive speech generation. To enhance model efficiency and versatility, we propose a modular, staged training strategy that progressively aligns linguistic, paralinguistic, and speaker characteristic information using large-scale speech-text corpora. Experimental results on TELEVAL, a multi-dimensional evaluation benchmark, demonstrate that GOAT-SLM achieves well-balanced performance across both semantic and non-semantic tasks, and outperforms existing open-source models in handling emotion, dialectal variation, and age-sensitive interactions. This work highlights the importance of modeling beyond linguistic content and advances the development of more natural, adaptive, and socially aware spoken language systems.

nan

Article 518

Title@2025-07-25 (5): XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare

Title: XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare

XAI4LLM. Lassen Sie Modelle für maschinelles Lernen und LLMs für verbessertes In-Context-Lernen im Gesundheitswesen zusammenarbeiten

XAI4LLLM. 让机器学习模式和LLM合作促进保健领域加强内文学习 2405.06270v4

Authors (4): Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia, Eugenio di Sciascio

Clinical decision support systems require models that are not only highly accurate but also equitable and sensitive to the implications of missed diagnoses. In this study, we introduce a knowledge-guided in-context learning (ICL) framework designed to enable large language models (LLMs) to effectively process structured clinical data. Our approach integrates domain-specific feature groupings, carefully balanced few-shot examples, and task-specific prompting strategies. We systematically evaluate this method across seventy distinct ICL designs by various prompt variations and two different communication styles-natural-language narrative and numeric conversational-and compare its performance to robust classical machine learning (ML) benchmarks on tasks involving heart disease and diabetes prediction. Our findings indicate that while traditional ML models maintain superior performance in balanced precision-recall scenarios, LLMs employing narrative prompts with integrated domain knowledge achieve higher recall and significantly reduce gender bias, effectively narrowing fairness disparities by an order of magnitude. Despite the current limitation of increased inference latency, LLMs provide notable advantages, including the capacity for zero-shot deployment and enhanced equity. This research offers the first comprehensive analysis of ICL design considerations for applying LLMs to tabular clinical tasks and highlights distillation and multimodal extensions as promising directions for future research.

nan

Article 519

Title@2025-07-25 (5): T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation

Title: T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation

T2ISafety: Benchmark für die Bewertung von Fairness, Toxizität und Datenschutz in der Bildgenerierung

T2ISafetty:评估图像生成中的公平、毒性和隐私的基准 2501.12612v3

Authors (8): Lijun Li, Zhelun Shi, Xuhao Hu, Bowen Dong, Yiran Qin, Xihui Liu, Lu Sheng, Jing Shao

Text-to-image (T2I) models have rapidly advanced, enabling the generation of high-quality images from text prompts across various domains. However, these models present notable safety concerns, including the risk of generating harmful, biased, or private content. Current research on assessing T2I safety remains in its early stages. While some efforts have been made to evaluate models on specific safety dimensions, many critical risks remain unexplored. To address this gap, we introduce T2ISafety, a safety benchmark that evaluates T2I models across three key domains: toxicity, fairness, and bias. We build a detailed hierarchy of 12 tasks and 44 categories based on these three domains, and meticulously collect 70K corresponding prompts. Based on this taxonomy and prompt set, we build a large-scale T2I dataset with 68K manually annotated images and train an evaluator capable of detecting critical risks that previous work has failed to identify, including risks that even ultra-large proprietary models like GPTs cannot correctly detect. We evaluate 12 prominent diffusion models on T2ISafety and reveal several concerns including persistent issues with racial fairness, a tendency to generate toxic content, and significant variation in privacy protection across the models, even with defense methods like concept erasing. Data and evaluator are released under https://github.com/adwardlee/t2i_safety.

nan

Article 520

Title@2025-07-25 (5): Closing the Modality Gap for Mixed Modality Search

Title: Closing the Modality Gap for Mixed Modality Search

Schließen der Modalitätslücke für gemischte Modalitätssuche

缩小混合方式搜索模式差距 2507.19054v1

Authors (6): Binxu Li, Yuhui Zhang, Xiaohan Wang, Weixin Liang, Ludwig Schmidt, Serena Yeung-Levy

Mixed modality search – retrieving information across a heterogeneous corpus composed of images, texts, and multimodal documents – is an important yet underexplored real-world application. In this work, we investigate how contrastive vision-language models, such as CLIP, perform on the mixed modality search task. Our analysis reveals a critical limitation: these models exhibit a pronounced modality gap in the embedding space, where image and text embeddings form distinct clusters, leading to intra-modal ranking bias and inter-modal fusion failure. To address this issue, we propose GR-CLIP, a lightweight post-hoc calibration method that removes the modality gap in CLIP’s embedding space. Evaluated on MixBench – the first benchmark specifically designed for mixed modality search – GR-CLIP improves NDCG@10 by up to 26 percentage points over CLIP, surpasses recent vision-language generative embedding models by 4 percentage points, while using 75x less compute.

nan

Article 521

Title@2025-07-25 (5): PARROT: An Open Multilingual Radiology Reports Dataset

Title: PARROT: An Open Multilingual Radiology Reports Dataset

PARROT: Ein offener Mehrsprachiger Röntgenbericht Datensatz

开放多语种放射学报告数据集 2507.22939v1

Authors (88): Bastien Le Guellec, Kokou Adambounou, Lisa C Adams, Thibault Agripnidis, Sung Soo Ahn, Radhia Ait Chalal, Tugba Akinci D Antonoli, Philippe Amouyel, Henrik Andersson, Raphael Bentegeac, Claudio Benzoni, Antonino Andrea Blandino, Felix Busch, Elif Can, Riccardo Cau, Armando Ugo Cavallo, Christelle Chavihot, Erwin Chiquete, Renato Cuocolo, Eugen Divjak, Gordana Ivanac, Barbara Dziadkowiec Macek, Armel Elogne, Salvatore Claudio Fanni, Carlos Ferrarotti, Claudia Fossataro, Federica Fossataro, Katarzyna Fulek, Michal Fulek, Pawel Gac, Martyna Gachowska, Ignacio Garcia Juarez, Marco Gatti, Natalia Gorelik, Alexia Maria Goulianou, Aghiles Hamroun, Nicolas Herinirina, Krzysztof Kraik, Dominik Krupka, Quentin Holay, Felipe Kitamura, Michail E Klontzas, Anna Kompanowska, Rafal Kompanowski, Alexandre Lefevre, Tristan Lemke, Maximilian Lindholz, Lukas Muller, Piotr Macek, Marcus Makowski, Luigi Mannacio, Aymen Meddeb, Antonio Natale, Beatrice Nguema Edzang, Adriana Ojeda, Yae Won Park, Federica Piccione, Andrea Ponsiglione, Malgorzata Poreba, Rafal Poreba, Philipp Prucker, Jean Pierre Pruvo, Rosa Alba Pugliesi, Feno Hasina Rabemanorintsoa, Vasileios Rafailidis, Katarzyna Resler, Jan Rotkegel, Luca Saba, Ezann Siebert, Arnaldo Stanzione, Ali Fuat Tekin, Liz Toapanta Yanchapaxi, Matthaios Triantafyllou, Ekaterini Tsaoulia, Evangelia Vassalou, Federica Vernuccio, Johan Wasselius, Weilang Wang, Szymon Urban, Adrian Wlodarczak, Szymon Wlodarczak, Andrzej Wysocki, Lina Xu, Tomasz Zatonski, Shuhang Zhang, Sebastian Ziegelmayer, Gregory Kuchcinski, Keno K Bressem

Rationale and Objectives: To develop and validate PARROT (Polyglottal Annotated Radiology Reports for Open Testing), a large, multicentric, open-access dataset of fictional radiology reports spanning multiple languages for testing natural language processing applications in radiology. Materials and Methods: From May to September 2024, radiologists were invited to contribute fictional radiology reports following their standard reporting practices. Contributors provided at least 20 reports with associated metadata including anatomical region, imaging modality, clinical context, and for non-English reports, English translations. All reports were assigned ICD-10 codes. A human vs. AI report differentiation study was conducted with 154 participants (radiologists, healthcare professionals, and non-healthcare professionals) assessing whether reports were human-authored or AI-generated. Results: The dataset comprises 2,658 radiology reports from 76 authors across 21 countries and 13 languages. Reports cover multiple imaging modalities (CT: 36.1%, MRI: 22.8%, radiography: 19.0%, ultrasound: 16.8%) and anatomical regions, with chest (19.9%), abdomen (18.6%), head (17.3%), and pelvis (14.1%) being most prevalent. In the differentiation study, participants achieved 53.9% accuracy (95% CI: 50.7%-57.1%) in distinguishing between human and AI-generated reports, with radiologists performing significantly better (56.9%, 95% CI: 53.3%-60.6%, p<0.05) than other groups. Conclusion: PARROT represents the largest open multilingual radiology report dataset, enabling development and validation of natural language processing applications across linguistic, geographic, and clinical boundaries without privacy constraints.

nan

Article 522

Title@2025-07-25 (5): FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems

Title: FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems

FD-Bench: Eine Full-Duplex-Benchmarking-Pipeline für volle Duplex-Gesprochene Dialogsysteme

FD-Bench:为全双口孔对话系统设计的全自动基准管道 2507.19040v1

Authors (7): Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, Eng Siong Chng

Full-duplex spoken dialogue systems (FDSDS) enable more natural human-machine interactions by allowing real-time user interruptions and backchanneling, compared to traditional SDS that rely on turn-taking. However, existing benchmarks lack metrics for FD scenes, e.g., evaluating model performance during user interruptions. In this paper, we present a comprehensive FD benchmarking pipeline utilizing LLMs, TTS, and ASR to address this gap. It assesses FDSDS’s ability to handle user interruptions, manage delays, and maintain robustness in challenging scenarios with diverse novel metrics. We applied our benchmark to three open-source FDSDS (Moshi, Freeze-omni, and VITA-1.5) using over 40 hours of generated speech, with 293 simulated conversations and 1,200 interruptions. The results show that all models continue to face challenges, such as failing to respond to user interruptions, under frequent disruptions and noisy conditions. Demonstrations, data, and code will be released.

nan

Article 523

Title@2025-07-25 (5): MLLM-based Speech Recognition: When and How is Multimodality Beneficial?

Title: MLLM-based Speech Recognition: When and How is Multimodality Beneficial?

MLLM-basierte Spracherkennung: Wann und wie ist Multimodalität vorteilhaft?

基于MLLM的语音识别:多式联运何时和如何受益? 2507.19037v1

Authors (4): Yiwen Guan, Viet Anh Trinh, Vivek Voleti, Jacob Whitehill

Recent advances in multi-modal large language models (MLLMs) have opened new possibilities for unified modeling of speech, text, images, and other modalities. Building on our prior work, this paper examines the conditions and model architectures under which multiple input modalities can improve automatic speech recognition (ASR) accuracy in noisy environments. Through experiments on synthetic and real-world data, we find that (1) harnessing more modalities usually improves ASR accuracy, as each modality provides complementary information, but the improvement depends on the amount of auditory noise. (2) Synchronized modalities (e.g., lip movements) are more useful at high noise levels whereas unsynchronized modalities (e.g., image context) are most helpful at moderate noise levels. (3) Higher-quality visual representations consistently improve ASR accuracy, highlighting the importance of developing more powerful visual encoders. (4) Mamba exhibits similar trends regarding the benefits of multimodality as do Transformers. (5) The input order of modalities as well as their weights in the loss function can significantly impact accuracy. These findings both offer practical insights and help to deepen our understanding of multi-modal speech recognition under challenging conditions.

nan

Article 524

Title: A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents

Ein Graph-basierter Ansatz für Multi-Modal-Fragebeantwortungen aus Flussdiagrammen in Telecom-Dokumenten

以图表为基础的电信文件流动图表多模式问题解答方法 2507.22938v1

Authors (7): Sumit Soman, H. G. Ranjani, Sujoy Roychowdhury, Venkata Dharma Surya Narayana Sastry, Akshat Jain, Pranav Gangrade, Ayaaz Khan

Question-Answering (QA) from technical documents often involves questions whose answers are present in figures, such as flowcharts or flow diagrams. Text-based Retrieval Augmented Generation (RAG) systems may fail to answer such questions. We leverage graph representations of flowcharts obtained from Visual large Language Models (VLMs) and incorporate them in a text-based RAG system to show that this approach can enable image retrieval for QA in the telecom domain. We present the end-to-end approach from processing technical documents, classifying image types, building graph representations, and incorporating them with the text embedding pipeline for efficient retrieval. We benchmark the same on a QA dataset created based on proprietary telecom product information documents. Results show that the graph representations obtained using a fine-tuned VLM model have lower edit distance with respect to the ground truth, which illustrate the robustness of these representations for flowchart images. Further, the approach for QA using these representations gives good retrieval performance using text-based embedding models, including a telecom-domain adapted one. Our approach also alleviates the need for a VLM in inference, which is an important cost benefit for deployed QA systems.

nan

Article 525

Title@2025-07-25 (5): Acoustically Precise Hesitation Tagging Is Essential for End-to-End Verbatim Transcription Systems

Title: Acoustically Precise Hesitation Tagging Is Essential for End-to-End Verbatim Transcription Systems

Akustisch präzises Hesitations-Tagging ist für End-to-End-Transkriptionssysteme unerlässlich

终端至终端逐字记录翻译系统至关重要的隐含精确言辞 2506.04076v2

Authors (5): Jhen-Ke Lin, Hao-Chien Lu, Chung-Chun Wang, Hong-Yun Lin, Berlin Chen

Verbatim transcription for automatic speaking assessment demands accurate capture of disfluencies, crucial for downstream tasks like error analysis and feedback. However, many ASR systems discard or generalize hesitations, losing important acoustic details. We fine-tune Whisper models on the Speak & Improve 2025 corpus using low-rank adaptation (LoRA), without recourse to external audio training data. We compare three annotation schemes: removing hesitations (Pure), generic tags (Rich), and acoustically precise fillers inferred by Gemini 2.0 Flash from existing audio-transcript pairs (Extra). Our challenge system achieved 6.47% WER (Pure) and 5.81% WER (Extra). Post-challenge experiments reveal that fine-tuning Whisper Large V3 Turbo with the “Extra” scheme yielded a 5.5% WER, an 11.3% relative improvement over the “Pure” scheme (6.2% WER). This demonstrates that explicit, realistic filled-pause labeling significantly enhances ASR accuracy for verbatim L2 speech transcription.

nan

Article 526

Title@2025-07-25 (5): Kill two birds with one stone: generalized and robust AI-generated text detection via dynamic perturbations

Title: Kill two birds with one stone: generalized and robust AI-generated text detection via dynamic perturbations

Töten Sie zwei Vögel mit einer Klappe: generalisierte und robuste KI-generierte Texterkennung durch dynamische Störungen

以一石一石杀死两鸟:通过动态扰动,普遍和有力地检测AI产生的文本 2504.21019v2

Authors (6): Yinghan Zhou, Juan Wen, Wanli Peng, Yiming Xue, Ziwei Zhang, Zhengxian Wu

The growing popularity of large language models has raised concerns regarding the potential to misuse AI-generated text (AIGT). It becomes increasingly critical to establish an excellent AIGT detection method with high generalization and robustness. However, existing methods either focus on model generalization or concentrate on robustness. The unified mechanism, to simultaneously address the challenges of generalization and robustness, is less explored. In this paper, we argue that robustness can be view as a specific form of domain shift, and empirically reveal an intrinsic mechanism for model generalization of AIGT detection task. Then, we proposed a novel AIGT detection method (DP-Net) via dynamic perturbations introduced by a reinforcement learning with elaborated reward and action. Experimentally, extensive results show that the proposed DP-Net significantly outperforms some state-of-the-art AIGT detection methods for generalization capacity in three cross-domain scenarios. Meanwhile, the DP-Net achieves best robustness under two text adversarial attacks. The code is publicly available at https://github.com/CAU-ISS-Lab/AIGT-Detection-Evade-Detection/tree/main/DP-Net.

nan

Article 527

Title@2025-07-25 (5): Advancing biomolecular understanding and design following human instructions

Title: Advancing biomolecular understanding and design following human instructions

Verbesserung des biomolekularen Verständnisses und Designs nach menschlichen Anweisungen

按照人类的指示,推动生物分子理解和设计 2410.07919v2

Authors (12): Xiang Zhuang, Keyan Ding, Tianwen Lyu, Yinuo Jiang, Xiaotong Li, Zhuoyi Xiang, Zeyuan Wang, Ming Qin, Kehua Feng, Jike Wang, Qiang Zhang, Huajun Chen

Understanding and designing biomolecules, such as proteins and small molecules, is central to advancing drug discovery, synthetic biology and enzyme engineering. Recent breakthroughs in artificial intelligence have revolutionized biomolecular research, achieving remarkable accuracy in biomolecular prediction and design. However, a critical gap remains between artificial intelligence’s computational capabilities and researchers’ intuitive goals, particularly in using natural language to bridge complex tasks with human intentions. Large language models have shown potential to interpret human intentions, yet their application to biomolecular research remains nascent due to challenges including specialized knowledge requirements, multimodal data integration, and semantic alignment between natural language and biomolecules. To address these limitations, we present InstructBioMol, a large language model designed to bridge natural language and biomolecules through a comprehensive any-to-any alignment of natural language, molecules and proteins. This model can integrate multimodal biomolecules as the input, and enable researchers to articulate design goals in natural language, providing biomolecular outputs that meet precise biological needs. Experimental results demonstrate that InstructBioMol can understand and design biomolecules following human instructions. In particular, it can generate drug molecules with a 10% improvement in binding affinity and design enzymes that achieve an enzyme-substrate pair prediction score of 70.4. This highlights its potential to transform real-world biomolecular research. The code is available at https://github.com/HICAI-ZJU/InstructBioMol.

nan

Article 528

Title@2025-07-25 (5): HIVMedQA: Benchmarking large language models for HIV medical decision support

Title: HIVMedQA: Benchmarking large language models for HIV medical decision support

HIVMedQA: Benchmarking großer Sprachmodelle für die medizinische HIV-Entscheidungsunterstützung

HIVMedQA:确定艾滋病毒医疗决策支助大语言模式的基准 2507.18143v2

Authors (6): Gonzalo Cardenal-Antolin, Jacques Fellay, Bashkim Jaha, Roger Kouyos, Niko Beerenwinkel, Diane Duroux

Large language models (LLMs) are emerging as valuable tools to support clinicians in routine decision-making. HIV management is a compelling use case due to its complexity, including diverse treatment options, comorbidities, and adherence challenges. However, integrating LLMs into clinical practice raises concerns about accuracy, potential harm, and clinician acceptance. Despite their promise, AI applications in HIV care remain underexplored, and LLM benchmarking studies are scarce. This study evaluates the current capabilities of LLMs in HIV management, highlighting their strengths and limitations. We introduce HIVMedQA, a benchmark designed to assess open-ended medical question answering in HIV care. The dataset consists of curated, clinically relevant questions developed with input from an infectious disease physician. We evaluated seven general-purpose and three medically specialized LLMs, applying prompt engineering to enhance performance. Our evaluation framework incorporates both lexical similarity and an LLM-as-a-judge approach, extended to better reflect clinical relevance. We assessed performance across key dimensions: question comprehension, reasoning, knowledge recall, bias, potential harm, and factual accuracy. Results show that Gemini 2.5 Pro consistently outperformed other models across most dimensions. Notably, two of the top three models were proprietary. Performance declined as question complexity increased. Medically fine-tuned models did not always outperform general-purpose ones, and larger model size was not a reliable predictor of performance. Reasoning and comprehension were more challenging than factual recall, and cognitive biases such as recency and status quo were observed. These findings underscore the need for targeted development and evaluation to ensure safe, effective LLM integration in clinical care.

nan

Article 529

Title@2025-07-25 (5): Verbalized Representation Learning for Interpretable Few-Shot Generalization

Title: Verbalized Representation Learning for Interpretable Few-Shot Generalization

Verbalisiertes Repräsentationslernen für verdolmetschbare wenige-heiße Verallgemeinerung

以口头方式进行代表性学习,为可口译的少或偏的普及化提供口译 2411.18651v2

Authors (6): Cheng-Fu Yang, Da Yin, Wenbo Hu, Nanyun Peng, Bolei Zhou, Kai-Wei Chang

Humans recognize objects after observing only a few examples, a remarkable capability enabled by their inherent language understanding of the real-world environment. Developing verbalized and interpretable representation can significantly improve model generalization in low-data settings. In this work, we propose Verbalized Representation Learning (VRL), a novel approach for automatically extracting human-interpretable features for object recognition using few-shot data. Our method uniquely captures inter-class differences and intra-class commonalities in the form of natural language by employing a Vision-Language Model (VLM) to identify key discriminative features between different classes and shared characteristics within the same class. These verbalized features are then mapped to numeric vectors through the VLM. The resulting feature vectors can be further utilized to train and infer with downstream classifiers. Experimental results show that, at the same model scale, VRL achieves a 24% absolute improvement over prior state-of-the-art methods while using 95% less data and a smaller mode. Furthermore, compared to human-labeled attributes, the features learned by VRL exhibit a 20% absolute gain when used for downstream classification tasks. Code is available at: https://github.com/joeyy5588/VRL/tree/main.

nan

Article 530

Title@2025-07-25 (5): Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation

Title: Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation

Bewertung von LLM-Fehlern, für die Personalisierte Disinformationsgenerierung missbräuchlich verwendet zu werden

评价LLMM 利用LLM 个人化信息生成不当利用他人造成个人化信息的脆弱性 2412.13666v2

Authors (7): Aneta Zugecova, Dominik Macko, Ivan Srba, Robert Moro, Jakub Kopal, Katarina Marcincinova, Matus Mesarcik

The capabilities of recent large language models (LLMs) to generate high-quality content indistinguishable by humans from human-written texts raises many concerns regarding their misuse. Previous research has shown that LLMs can be effectively misused for generating disinformation news articles following predefined narratives. Their capabilities to generate personalized (in various aspects) content have also been evaluated and mostly found usable. However, a combination of personalization and disinformation abilities of LLMs has not been comprehensively studied yet. Such a dangerous combination should trigger integrated safety filters of the LLMs, if there are some. This study fills this gap by evaluating vulnerabilities of recent open and closed LLMs, and their willingness to generate personalized disinformation news articles in English. We further explore whether the LLMs can reliably meta-evaluate the personalization quality and whether the personalization affects the generated-texts detectability. Our results demonstrate the need for stronger safety-filters and disclaimers, as those are not properly functioning in most of the evaluated LLMs. Additionally, our study revealed that the personalization actually reduces the safety-filter activations; thus effectively functioning as a jailbreak. Such behavior must be urgently addressed by LLM developers and service providers.

nan

Article 531

Title@2025-07-25 (5): CoE-Ops: Collaboration of LLM-based Experts for AIOps Question-Answering

Title: CoE-Ops: Collaboration of LLM-based Experts for AIOps Question-Answering

CoE-Ops: Zusammenarbeit von LLM-basierten Experten für AIOps Frage-Antwort

欧委会行动:以LLM为基础的专家协作处理AIOps问题 2507.22937v1

Authors (9): Jinkun Zhao, Yuanshuai Wang, Xingjian Zhang, Ruibo Chen, Xingchuang Liao, Junle Wang, Lei Huang, Kui Zhang, Wenjun Wu

With the rapid evolution of artificial intelligence, AIOps has emerged as a prominent paradigm in DevOps. Lots of work has been proposed to improve the performance of different AIOps phases. However, constrained by domain-specific knowledge, a single model can only handle the operation requirement of a specific task,such as log parser,root cause analysis. Meanwhile, combining multiple models can achieve more efficient results, which have been proved in both previous ensemble learning and the recent LLM training domain. Inspired by these works,to address the similar challenges in AIOPS, this paper first proposes a collaboration-of-expert framework(CoE-Ops) incorporating a general-purpose large language model task classifier. A retrieval-augmented generation mechanism is introduced to improve the framework’s capability in handling both Question-Answering tasks with high-level(Code,build,Test,etc.) and low-level(fault analysis,anomaly detection,etc.). Finally, the proposed method is implemented in the AIOps domain, and extensive experiments are conducted on the DevOps-EVAL dataset. Experimental results demonstrate that CoE-Ops achieves a 72% improvement in routing accuracy for high-level AIOps tasks compared to existing CoE methods, delivers up to 8% accuracy enhancement over single AIOps models in DevOps problem resolution, and outperforms larger-scale Mixture-of-Experts (MoE) models by up to 14% in accuracy.

nan

Article 532

Title: MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts

MultiSocial: Mehrsprachiger Benchmark der maschinengenerierten Texterkennung von Social-Media-Texten

多社会多语言:社会-媒体文本机制文本检测多语言基准 2406.12549v2

Authors (4): Dominik Macko, Jakub Kopal, Robert Moro, Ivan Srba

Recent LLMs are able to generate high-quality multilingual texts, indistinguishable for humans from authentic human-written ones. Research in machine-generated text detection is however mostly focused on the English language and longer texts, such as news articles, scientific papers or student essays. Social-media texts are usually much shorter and often feature informal language, grammatical errors, or distinct linguistic items (e.g., emoticons, hashtags). There is a gap in studying the ability of existing methods in detection of such texts, reflected also in the lack of existing multilingual benchmark datasets. To fill this gap we propose the first multilingual (22 languages) and multi-platform (5 social media platforms) dataset for benchmarking machine-generated text detection in the social-media domain, called MultiSocial. It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual LLMs. We use this benchmark to compare existing detection methods in zero-shot as well as fine-tuned form. Our results indicate that the fine-tuned detectors have no problem to be trained on social-media texts and that the platform selection for training matters.

nan

Article 533

Title@2025-07-25 (5): A Toolbox, Not a Hammer – Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation

Title: A Toolbox, Not a Hammer – Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation

Eine Toolbox, kein Hammer – Multi-TAG: Skalierung der Mathematik mit Multi-Tool-Aggregation

一个工具箱, 不是锤锤 – – 多TAG: 使用多工具聚合的量性数学解释 2507.18973v1

Authors (2): Bohan Yao, Vikas Yadav

Augmenting large language models (LLMs) with external tools is a promising avenue for developing high-performance mathematical reasoning systems. Prior tool-augmented approaches typically finetune an LLM to select and invoke a single tool at each reasoning step and show promising results on simpler math reasoning benchmarks such as GSM8K. However, these approaches struggle with more complex math problems that require precise reasoning over multiple steps. To address this limitation, in this work, we propose Multi-TAG, a Multi-Tool AGgregation-based framework. Instead of relying on a single tool, Multi-TAG guides an LLM to concurrently invoke multiple tools at each reasoning step. It then aggregates their diverse outputs to verify and refine the reasoning process, enhancing solution robustness and accuracy. Notably, Multi-TAG is a finetuning-free, inference-only framework, making it readily applicable to any LLM backbone, including large open-weight models which are computationally expensive to finetune and proprietary frontier models which cannot be finetuned with custom recipes. We evaluate Multi-TAG on four challenging benchmarks: MATH500, AIME, AMC, and OlympiadBench. Across both open-weight and closed-source LLM backbones, Multi-TAG consistently and substantially outperforms state-of-the-art baselines, achieving average improvements of 6.0% to 7.5% over state-of-the-art baselines.

nan

Article 534

Title@2025-07-25 (5): Spike No More: Stabilizing the Pre-training of Large Language Models

Title: Spike No More: Stabilizing the Pre-training of Large Language Models

Spike No More: Stabilisierung der Vorausbildung großer Sprachmodelle

Spike No No More: 稳定大语言模式培训前 2312.16903v4

Authors (4): Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki

Loss spikes often occur during pre-training of large language models. The spikes degrade the performance of large language models and sometimes ruin the pre-training. Since the pre-training needs a vast computational budget, we should avoid such spikes. Based on the assumption that the loss spike is caused by the sudden growth of the gradient norm, we explore factors to keep the gradient norm small through an analysis of the spectral norms of the Jacobian matrices for the sub-layers. Our findings suggest that stabilizing the pre-training process requires two conditions: small sub-layers and large shortcut. We conduct various experiments to empirically verify our theoretical analyses. Experimental results demonstrate that methods satisfying the conditions effectively prevent loss spikes during pre-training.

nan

Article 535

Title@2025-07-25 (5): A Similarity Measure for Comparing Conversational Dynamics

Title: A Similarity Measure for Comparing Conversational Dynamics

Eine Ähnlichkeitsmessung für den Vergleich von Konversationsdynamiken

比较相互动态的相似性措施 2507.18956v1

Authors (3): Sang Min Jung, Kaixiang Zhang, Cristian Danescu-Niculescu-Mizil

The quality of a conversation goes beyond the individual quality of each reply, and instead emerges from how these combine into interactional patterns that give the conversation its distinctive overall “shape”. However, there is no robust automated method for comparing conversations in terms of their overall interactional dynamics. Such methods could enhance the analysis of conversational data and help evaluate conversational agents more holistically. In this work, we introduce a similarity measure for comparing conversations with respect to their dynamics. We design a validation framework for testing the robustness of the metric in capturing differences in conversation dynamics and for assessing its sensitivity to the topic of the conversations. Finally, to illustrate the measure’s utility, we use it to analyze conversational dynamics in a large online community, bringing new insights into the role of situational power in conversations.

nan

Article 536

Title@2025-07-25 (5): MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model

Title: MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model

MedicalBERT: Verbesserung der biomedizinischen natürlichen Sprachverarbeitung mit vorgebildetem BERT-basiertem Modell

医学BERT:利用预先培训的BERT模式,加强生物医学自然语言处理 2507.08013v2

Authors (6): K. Sahit Reddy, N. Ragavenderan, Vasanth K., Ganesh N. Naik, Vishalakshi Prabhu, Nagaraja G. S

Recent advances in natural language processing (NLP) have been driven bypretrained language models like BERT, RoBERTa, T5, and GPT. Thesemodels excel at understanding complex texts, but biomedical literature, withits domain-specific terminology, poses challenges that models likeWord2Vec and bidirectional long short-term memory (Bi-LSTM) can’t fullyaddress. GPT and T5, despite capturing context, fall short in tasks needingbidirectional understanding, unlike BERT. Addressing this, we proposedMedicalBERT, a pretrained BERT model trained on a large biomedicaldataset and equipped with domain-specific vocabulary that enhances thecomprehension of biomedical terminology. MedicalBERT model is furtheroptimized and fine-tuned to address diverse tasks, including named entityrecognition, relation extraction, question answering, sentence similarity, anddocument classification. Performance metrics such as the F1-score,accuracy, and Pearson correlation are employed to showcase the efficiencyof our model in comparison to other BERT-based models such as BioBERT,SciBERT, and ClinicalBERT. MedicalBERT outperforms these models onmost of the benchmarks, and surpasses the general-purpose BERT model by5.67% on average across all the tasks evaluated respectively. This work alsounderscores the potential of leveraging pretrained BERT models for medicalNLP tasks, demonstrating the effectiveness of transfer learning techniques incapturing domain-specific information. (PDF) MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model. Available from: https://www.researchgate.net/publication/392489050_MedicalBERT_enhancing_biomedical_natural_language_processing_using_pretrained_BERT-based_model [accessed Jul 06 2025].

nan

Article 537

Title@2025-07-25 (5): Legal Document Summarization: Enhancing Judicial Efficiency through Automation Detection

Title: Legal Document Summarization: Enhancing Judicial Efficiency through Automation Detection

Zusammenfassung des Rechtsdokuments: Verbesserung der richterlichen Effizienz durch Automatisierungserkennung

法律文件摘要:通过自动检测提高司法效率 2507.18952v1

Authors (4): Yongjie Li, Ruilin Nong, Jianan Liu, Lucas Evans

Legal document summarization represents a significant advancement towards improving judicial efficiency through the automation of key information detection. Our approach leverages state-of-the-art natural language processing techniques to meticulously identify and extract essential data from extensive legal texts, which facilitates a more efficient review process. By employing advanced machine learning algorithms, the framework recognizes underlying patterns within judicial documents to create precise summaries that encapsulate the crucial elements. This automation alleviates the burden on legal professionals, concurrently reducing the likelihood of overlooking vital information that could lead to errors. Through comprehensive experiments conducted with actual legal datasets, we demonstrate the capability of our method to generate high-quality summaries while preserving the integrity of the original content and enhancing processing times considerably. The results reveal marked improvements in operational efficiency, allowing legal practitioners to direct their efforts toward critical analytical and decision-making activities instead of manual reviews. This research highlights promising technology-driven strategies that can significantly alter workflow dynamics within the legal sector, emphasizing the role of automation in refining judicial processes.

nan

Article 538

Title@2025-07-25 (5): Adaptive Learning Systems: Personalized Curriculum Design Using LLM-Powered Analytics

Title: Adaptive Learning Systems: Personalized Curriculum Design Using LLM-Powered Analytics

Adaptive Lernsysteme: Personalisierte Lehrplangestaltung mit LLM-Powered Analytics

适应性学习系统:利用LLM能动分析器的个人化课程设计 2507.18949v1

Authors (4): Yongjie Li, Ruilin Nong, Jianan Liu, Lucas Evans

Large language models (LLMs) are revolutionizing the field of education by enabling personalized learning experiences tailored to individual student needs. In this paper, we introduce a framework for Adaptive Learning Systems that leverages LLM-powered analytics for personalized curriculum design. This innovative approach uses advanced machine learning to analyze real-time data, allowing the system to adapt learning pathways and recommend resources that align with each learner’s progress. By continuously assessing students, our framework enhances instructional strategies, ensuring that the materials presented are relevant and engaging. Experimental results indicate a marked improvement in both learner engagement and knowledge retention when using a customized curriculum. Evaluations conducted across varied educational environments demonstrate the framework’s flexibility and positive influence on learning outcomes, potentially reshaping conventional educational practices into a more adaptive and student-centered model.

nan

Article 539

Title@2025-07-25 (5): TreeReader: A Hierarchical Academic Paper Reader Powered by Language Models

Title: TreeReader: A Hierarchical Academic Paper Reader Powered by Language Models

TreeReader: Ein Hierarchischer Akademischer Papierleser Powered by Language Models

树形阅读器:一个按语言模式授权的等级学术论文阅读器 2507.18945v1

Authors (7): Zijian Zhang, Pan Chen, Fangshi Du, Runlong Ye, Oliver Huang, Michael Liut, Alán Aspuru-Guzik

Efficiently navigating and understanding academic papers is crucial for scientific progress. Traditional linear formats like PDF and HTML can cause cognitive overload and obscure a paper’s hierarchical structure, making it difficult to locate key information. While LLM-based chatbots offer summarization, they often lack nuanced understanding of specific sections, may produce unreliable information, and typically discard the document’s navigational structure. Drawing insights from a formative study on academic reading practices, we introduce TreeReader, a novel language model-augmented paper reader. TreeReader decomposes papers into an interactive tree structure where each section is initially represented by an LLM-generated concise summary, with underlying details accessible on demand. This design allows users to quickly grasp core ideas, selectively explore sections of interest, and verify summaries against the source text. A user study was conducted to evaluate TreeReader’s impact on reading efficiency and comprehension. TreeReader provides a more focused and efficient way to navigate and understand complex academic literature by bridging hierarchical summarization with interactive exploration.

nan

Article 540

Title@2025-07-25 (5): LLaVA-NeuMT: Selective Layer-Neuron Modulation for Efficient Multilingual Multimodal Translation

Title: LLaVA-NeuMT: Selective Layer-Neuron Modulation for Efficient Multilingual Multimodal Translation

LLaVA-NeuMT: Selektive Schicht-Neuron-Modulation für effiziente multimodale Mehrsprachigkeit

LLAVA-NeUMT: 选择性多语层-Neuron 高效多语种多语种多模式翻译的调整 2507.18940v1

Authors (8): Jingxuan Wei, Caijun Jia, Qi Chen, Yujun Cai, Linzhuang Sun, Xiangxiang Zhang, Gaowei Wu, Bihui Yu

Multimodal Machine Translation (MMT) enhances translation quality by incorporating visual context, helping to resolve textual ambiguities. While existing MMT methods perform well in bilingual settings, extending them to multilingual translation remains challenging due to cross-lingual interference and ineffective parameter-sharing strategies. To address this, we propose LLaVA-NeuMT, a novel multimodal multilingual translation framework that explicitly models language-specific and language-agnostic representations to mitigate multilingual interference. Our approach consists of a layer selection mechanism that identifies the most informative layers for different language pairs and a neuron-level adaptation strategy that dynamically selects language-specific and agnostic neurons to improve translation quality while reducing redundancy. We conduct extensive experiments on the M3-Multi30K and M3-AmbigCaps datasets, demonstrating that LLaVA-NeuMT, while fine-tuning only 40\% of the model parameters, surpasses full fine-tuning approaches and ultimately achieves SOTA results on both datasets. Our analysis further provides insights into the importance of selected layers and neurons in multimodal multilingual adaptation, offering an efficient and scalable solution to cross-lingual adaptation in multimodal translation.

nan

Article 541

Title@2025-07-25 (5): Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks

Title: Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks

Benchmarking des multimodalen Verständnisses und der komplexen Begründung für ESG-Aufgaben

确定环境组合组合任务多式联运理解和复杂理由的基准 2507.18932v1

Authors (8): Lei Zhang, Xin Zhou, Chaoyue He, Di Wang, Yi Wu, Hong Xu, Wei Liu, Chunyan Miao

Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency. However, these documents are often lengthy, structurally diverse, and multimodal, comprising dense text, structured tables, complex figures, and layout-dependent semantics. Existing AI systems often struggle to perform reliable document-level reasoning in such settings, and no dedicated benchmark currently exists in ESG domain. To fill the gap, we introduce \textbf{MMESGBench}, a first-of-its-kind benchmark dataset targeted to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents. This dataset is constructed via a human-AI collaborative, multi-stage pipeline. First, a multimodal LLM generates candidate question-answer (QA) pairs by jointly interpreting rich textual, tabular, and visual information from layout-aware document pages. Second, an LLM verifies the semantic accuracy, completeness, and reasoning complexity of each QA pair. This automated process is followed by an expert-in-the-loop validation, where domain specialists validate and calibrate QA pairs to ensure quality, relevance, and diversity. MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories. Questions are categorized as single-page, cross-page, or unanswerable, with each accompanied by fine-grained multimodal evidence. Initial experiments validate that multimodal and retrieval-augmented models substantially outperform text-only baselines, particularly on visually grounded and cross-page tasks. MMESGBench is publicly available as an open-source dataset at https://github.com/Zhanglei1103/MMESGBench.

nan

Article 542

Title@2025-07-25 (5): Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters

Title: Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters

Seed-X: Starke Mehrsprachige Übersetzung LLM mit 7B-Parametern aufbauen

种子-X:利用7B参数建立强有力的多语种翻译LLM 2507.13618v3

Authors (26): Shanbo Cheng, Yu Bao, Qian Cao, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, Wenhao Zhu, Jingwen Chen, Zhichao Huang, Tao Li, Yifu Li, Huiying Lin, Sitong Liu, Ningxin Peng, Shuaijie She, Lu Xu, Nuo Xu, Sen Yang, Runsheng Yu, Yiming Yu, Liehao Zou, Hang Li, Lu Lu, Yuxuan Wang, Yonghui Wu

Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.

nan

Article 543

Title@2025-07-25 (5): Uncovering Cross-Linguistic Disparities in LLMs using Sparse Autoencoders

Title: Uncovering Cross-Linguistic Disparities in LLMs using Sparse Autoencoders

Entdeckt Cross-Linguistic Disparities in LLMs mit Sparse Autoencodern

使用 Sparse 自动编码器在 LLM 中解封跨语言差异 2507.18918v1

Authors (3): Richmond Sin Jing Xuan, Jalil Huseynov, Yang Zhang

Multilingual large language models (LLMs) exhibit strong cross-linguistic generalization, yet medium to low resource languages underperform on common benchmarks such as ARC-Challenge, MMLU, and HellaSwag. We analyze activation patterns in Gemma-2-2B across all 26 residual layers and 10 languages: Chinese (zh), Russian (ru), Spanish (es), Italian (it), medium to low resource languages including Indonesian (id), Catalan (ca), Marathi (mr), Malayalam (ml), and Hindi (hi), with English (en) as the reference. Using Sparse Autoencoders (SAEs), we reveal systematic disparities in activation patterns. Medium to low resource languages receive up to 26.27 percent lower activations in early layers, with a persistent gap of 19.89 percent in deeper layers. To address this, we apply activation-aware fine-tuning via Low-Rank Adaptation (LoRA), leading to substantial activation gains, such as 87.69 percent for Malayalam and 86.32 percent for Hindi, while maintaining English retention at approximately 91 percent. After fine-tuning, benchmark results show modest but consistent improvements, highlighting activation alignment as a key factor in enhancing multilingual LLM performance.

nan

Article 544

Title@2025-07-25 (5): Mining Contextualized Visual Associations from Images for Creativity Understanding

Title: Mining Contextualized Visual Associations from Images for Creativity Understanding

Bergbau Kontextualisierte visuelle Assoziationen aus Bildern für Kreativität Verständnis

利用图像促进创造性理解的采矿背景化视觉协会 2507.18915v1

Authors (3): Ananya Sahu, Amith Ananthram, Kathleen McKeown

Understanding another person’s creative output requires a shared language of association. However, when training vision-language models such as CLIP, we rely on web-scraped datasets containing short, predominantly literal, alt-text. In this work, we introduce a method for mining contextualized associations for salient visual elements in an image that can scale to any unlabeled dataset. Given an image, we can use these mined associations to generate high quality creative captions at increasing degrees of abstraction. With our method, we produce a new dataset of visual associations and 1.7m creative captions for the images in MSCOCO. Human evaluation confirms that these captions remain visually grounded while exhibiting recognizably increasing abstraction. Moreover, fine-tuning a visual encoder on this dataset yields meaningful improvements in zero-shot image-text retrieval in two creative domains: poetry and metaphor visualization. We release our dataset, our generation code and our models for use by the broader community.

nan

Article 545

Title@2025-07-25 (5): A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions

Title: A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions

Eine systematische Überprüfung der Systeme der wichtigsten retrieval-Augmented Generation (RAG): Fortschritt, Lücken und Zukunftsrichtungen

系统审查关键回收-养代(RAG)系统:进展、差距和未来方向 2507.18910v1

Authors (4): Agada Joseph Oche, Ademola Glory Folashade, Tirthankar Ghosal, Arpan Biswas

Retrieval-Augmented Generation (RAG) represents a major advancement in natural language processing (NLP), combining large language models (LLMs) with information retrieval systems to enhance factual grounding, accuracy, and contextual relevance. This paper presents a comprehensive systematic review of RAG, tracing its evolution from early developments in open domain question answering to recent state-of-the-art implementations across diverse applications. The review begins by outlining the motivations behind RAG, particularly its ability to mitigate hallucinations and outdated knowledge in parametric models. Core technical components-retrieval mechanisms, sequence-to-sequence generation models, and fusion strategies are examined in detail. A year-by-year analysis highlights key milestones and research trends, providing insight into RAG’s rapid growth. The paper further explores the deployment of RAG in enterprise systems, addressing practical challenges related to retrieval of proprietary data, security, and scalability. A comparative evaluation of RAG implementations is conducted, benchmarking performance on retrieval accuracy, generation fluency, latency, and computational efficiency. Persistent challenges such as retrieval quality, privacy concerns, and integration overhead are critically assessed. Finally, the review highlights emerging solutions, including hybrid retrieval approaches, privacy-preserving techniques, optimized fusion strategies, and agentic RAG architectures. These innovations point toward a future of more reliable, efficient, and context-aware knowledge-intensive NLP systems.

nan

Article 546

Title@2025-07-25 (5): Large language models provide unsafe answers to patient-posed medical questions

Title: Large language models provide unsafe answers to patient-posed medical questions

Große Sprachmodelle bieten unsichere Antworten auf patientenbezogene medizinische Fragen

大型语言模式为病人提出的医疗问题提供不安全的答案 2507.18905v1

Authors (17): Rachel L. Draelos, Samina Afreen, Barbara Blasko, Tiffany Brazile, Natasha Chase, Dimple Desai, Jessica Evert, Heather L. Gardner, Lauren Herrmann, Aswathy Vaikom House, Stephanie Kass, Marianne Kavan, Kirshma Khemani, Amanda Koire, Lauren M. McDonald, Zahraa Rabeeah, Amy Shah

Millions of patients are already using large language model (LLM) chatbots for medical advice on a regular basis, raising patient safety concerns. This physician-led red-teaming study compares the safety of four publicly available chatbots–Claude by Anthropic, Gemini by Google, GPT-4o by OpenAI, and Llama3-70B by Meta–on a new dataset, HealthAdvice, using an evaluation framework that enables quantitative and qualitative analysis. In total, 888 chatbot responses are evaluated for 222 patient-posed advice-seeking medical questions on primary care topics spanning internal medicine, women’s health, and pediatrics. We find statistically significant differences between chatbots. The rate of problematic responses varies from 21.6 percent (Claude) to 43.2 percent (Llama), with unsafe responses varying from 5 percent (Claude) to 13 percent (GPT-4o, Llama). Qualitative results reveal chatbot responses with the potential to lead to serious patient harm. This study suggests that millions of patients could be receiving unsafe medical advice from publicly available chatbots, and further work is needed to improve the clinical safety of these powerful tools.

nan

Article 547

Title@2025-07-25 (5): SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models

Title: SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models

SLoW: Wählen Sie niederfrequente Wörter aus! Automatische Wörterbuchauswahl für Übersetzungen auf großen Sprachmodellen

SLOW: 选择低频单词! 用于大语言模型翻译的自动词典选择 2507.18902v1

Authors (4): Hongyuan Lu, Zixuan Li, Zefan Zhang, Wai Lam

There are more than 7,000 languages around the world, and current Large Language Models (LLMs) only support hundreds of languages. Dictionary-based prompting methods can enhance translation on them, but most methods use all the available dictionaries, which could be expensive. Instead, it will be flexible to have a trade-off between token consumption and translation performance. This paper proposes a novel task called \textbf{A}utomatic \textbf{D}ictionary \textbf{S}election (\textbf{ADS}). The goal of the task is to automatically select which dictionary to use to enhance translation. We propose a novel and effective method which we call \textbf{S}elect \textbf{Lo}w-frequency \textbf{W}ords! (\textbf{SLoW}) which selects those dictionaries that have a lower frequency. Our methods have unique advantages. First, there is no need for access to the training data for frequency estimation (which is usually unavailable). Second, it inherits the advantage of dictionary-based methods, where no additional tuning is required on LLMs. Experimental results on 100 languages from FLORES indicate that SLoW surpasses strong baselines, and it can obviously save token usage, with many languages even surpassing the translation performance of the full dictionary baseline.\footnote{A shocking fact is that there is no need to use the actual training data (often unobtainable) for frequency estimation, and an estimation frequency obtained using public resources is still apparently effective in improving translation with ChatGPT and Llama, and DeepSeek.}\footnote{Code and data available upon publication.}

nan

Article 548

Title: REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?

REPRO-Bench: Können Agentische KI-Systeme die Reproduzierbarkeit der sozialwissenschaftlichen Forschung bewerten?

REPRO-BENCH: AI系统能否评估社会科学研究的可减少性? 2507.18901v1

Authors (6): Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, Daniel Kang

Assessing the reproducibility of social science papers is essential for promoting rigor in research processes, but manual assessment is costly. With recent advances in agentic AI systems (i.e., AI agents), we seek to evaluate their capability to automate this process. However, existing benchmarks for reproducing research papers (1) focus solely on reproducing results using provided code and data without assessing their consistency with the paper, (2) oversimplify real-world scenarios, and (3) lack necessary diversity in data formats and programming languages. To address these issues, we introduce REPRO-Bench, a collection of 112 task instances, each representing a social science paper with a publicly available reproduction report. The agents are tasked with assessing the reproducibility of the paper based on the original paper PDF and the corresponding reproduction package. REPRO-Bench features end-to-end evaluation tasks on the reproducibility of social science papers with complexity comparable to real-world assessments. We evaluate three representative AI agents on REPRO-Bench, with the best-performing agent achieving an accuracy of only 21.4%. Building on our empirical analysis, we develop REPRO-Agent, which improves the highest accuracy achieved by existing agents by 71%. We conclude that more advanced AI agents should be developed to automate real-world reproducibility assessment. REPRO-Bench is publicly available at https://github.com/uiuc-kang-lab/REPRO-Bench.

nan

Article 549

Title@2025-07-25 (5): Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs

Title: Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs

Kann LLMs Citation Intent voraussagen? Eine experimentelle Analyse des In-Context-Lernens und Feinabstimmungens auf offenen LLMs

LLMs 预测引文意图:对开放式LMs的内文学习和微调的实验分析 2502.14561v3

Authors (4): Paris Koloveas, Serafeim Chatzopoulos, Thanasis Vergoulis, Christos Tryfonopoulos

This work investigates the ability of open Large Language Models (LLMs) to predict citation intent through in-context learning and fine-tuning. Unlike traditional approaches relying on domain-specific pre-trained models like SciBERT, we demonstrate that general-purpose LLMs can be adapted to this task with minimal task-specific data. We evaluate twelve model variations across five prominent open LLM families using zero-, one-, few-, and many-shot prompting. Our experimental study identifies the top-performing model and prompting parameters through extensive in-context learning experiments. We then demonstrate the significant impact of task-specific adaptation by fine-tuning this model, achieving a relative F1-score improvement of 8% on the SciCite dataset and 4.3% on the ACL-ARC dataset compared to the instruction-tuned baseline. These findings provide valuable insights for model selection and prompt engineering. Additionally, we make our end-to-end evaluation framework and models openly available for future use.

nan

Article 550

Title@2025-07-25 (5): A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans

Title: A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans

Eine umfassende Bewertung der semantischen Beziehungskenntnisse von vorgebildeten Sprachmodellen und Menschen

全面评价未受过训练语言模式和人文的语义关系知识 2412.01131v4

Authors (4): Zhihan Cao, Hiroaki Yamada, Simone Teufel, Takenobu Tokunaga

Recently, much work has concerned itself with the enigma of what exactly pretrained language models~(PLMs) learn about different aspects of language, and how they learn it. One stream of this type of research investigates the knowledge that PLMs have about semantic relations. However, many aspects of semantic relations were left unexplored. Generally, only one relation has been considered, namely hypernymy. Furthermore, previous work did not measure humans’ performance on the same task as that performed by the PLMs. This means that at this point in time, there is only an incomplete view of the extent of these models’ semantic relation knowledge. To address this gap, we introduce a comprehensive evaluation framework covering five relations beyond hypernymy, namely hyponymy, holonymy, meronymy, antonymy, and synonymy. We use five metrics (two newly introduced here) for recently untreated aspects of semantic relation knowledge, namely soundness, completeness, symmetry, prototypicality, and distinguishability. Using these, we can fairly compare humans and models on the same task. Our extensive experiments involve six PLMs, four masked and two causal language models. The results reveal a significant knowledge gap between humans and models for all semantic relations. In general, causal language models, despite their wide use, do not always perform significantly better than masked language models. Antonymy is the outlier relation where all models perform reasonably well.

nan

Article 551

Title@2025-07-25 (5): NUTMEG: Separating Signal From Noise in Annotator Disagreement

Title: NUTMEG: Separating Signal From Noise in Annotator Disagreement

NUTMEG: Trennen von Signalen von Geräuschen in Annotator-Uneinigkeit

NUTMEG: 在通知器中从噪音中分离信号 2507.18890v1

Authors (3): Jonathan Ivey, Susan Gauch, David Jurgens

NLP models often rely on human-labeled data for training and evaluation. Many approaches crowdsource this data from a large number of annotators with varying skills, backgrounds, and motivations, resulting in conflicting annotations. These conflicts have traditionally been resolved by aggregation methods that assume disagreements are errors. Recent work has argued that for many tasks annotators may have genuine disagreements and that variation should be treated as signal rather than noise. However, few models separate signal and noise in annotator disagreement. In this work, we introduce NUTMEG, a new Bayesian model that incorporates information about annotator backgrounds to remove noisy annotations from human-labeled training data while preserving systematic disagreements. Using synthetic data, we show that NUTMEG is more effective at recovering ground-truth from annotations with systematic disagreement than traditional aggregation methods. We provide further analysis characterizing how differences in subpopulation sizes, rates of disagreement, and rates of spam affect the performance of our model. Finally, we demonstrate that downstream models trained on NUTMEG-aggregated data significantly outperform models trained on data from traditionally aggregation methods. Our results highlight the importance of accounting for both annotator competence and systematic disagreements when training on human-labeled data.

nan

Article 552

Title@2025-07-25 (5): MindFlow+: A Self-Evolving Agent for E-Commerce Customer Service

Title: MindFlow+: A Self-Evolving Agent for E-Commerce Customer Service

MindFlow+: Ein selbstständiger Agent für den E-Commerce-Kundendienst

Mind Flow+:电子商务客户服务自我发展代理 2507.18884v1

Authors (4): Ming Gong, Xucheng Huang, Ziheng Xu, Vijayan K. Asari

High-quality dialogue is crucial for e-commerce customer service, yet traditional intent-based systems struggle with dynamic, multi-turn interactions. We present MindFlow+, a self-evolving dialogue agent that learns domain-specific behavior by combining large language models (LLMs) with imitation learning and offline reinforcement learning (RL). MindFlow+ introduces two data-centric mechanisms to guide learning: tool-augmented demonstration construction, which exposes the model to knowledge-enhanced and agentic (ReAct-style) interactions for effective tool use; and reward-conditioned data modeling, which aligns responses with task-specific goals using reward signals. To evaluate the model’s role in response generation, we introduce the AI Contribution Ratio, a novel metric quantifying AI involvement in dialogue. Experiments on real-world e-commerce conversations show that MindFlow+ outperforms strong baselines in contextual relevance, flexibility, and task accuracy. These results demonstrate the potential of combining LLMs tool reasoning, and reward-guided learning to build domain-specialized, context-aware dialogue systems.

nan

Article 553

Title@2025-07-25 (5): An Investigation of Prompt Variations for Zero-shot LLM-based Rankers

Title: An Investigation of Prompt Variations for Zero-shot LLM-based Rankers

Eine Untersuchung von Prompt-Variationen für Null-Schuss LLM-basierte Ranker

调查零射中LLM中士的迅速变化情况 2406.14117v4

Authors (4): Shuoqi Sun, Shengyao Zhuang, Shuai Wang, Guido Zuccon

We provide a systematic understanding of the impact of specific components and wordings used in prompts on the effectiveness of rankers based on zero-shot Large Language Models (LLMs). Several zero-shot ranking methods based on LLMs have recently been proposed. Among many aspects, methods differ across (1) the ranking algorithm they implement, e.g., pointwise vs. listwise, (2) the backbone LLMs used, e.g., GPT3.5 vs. FLAN-T5, (3) the components and wording used in prompts, e.g., the use or not of role-definition (role-playing) and the actual words used to express this. It is currently unclear whether performance differences are due to the underlying ranking algorithm, or because of spurious factors such as better choice of words used in prompts. This confusion risks to undermine future research. Through our large-scale experimentation and analysis, we find that ranking algorithms do contribute to differences between methods for zero-shot LLM ranking. However, so do the LLM backbones – but even more importantly, the choice of prompt components and wordings affect the ranking. In fact, in our experiments, we find that, at times, these latter elements have more impact on the ranker’s effectiveness than the actual ranking algorithms, and that differences among ranking methods become more blurred when prompt variations are considered.

nan

Article 554

Title@2025-07-25 (5): Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

Title: Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

Phoneme-Level Visuelle Spracherkennung über Point-Visual Fusion und Sprachmodellsanierung

通过点-视点融合和语言模式重建确认电话级视觉讲话 2507.18863v1

Authors (3): Matthew Kit Khinn Teng, Haibo Zhang, Takeshi Saitoh

Visual Automatic Speech Recognition (V-ASR) is a challenging task that involves interpreting spoken language solely from visual information, such as lip movements and facial expressions. This task is notably challenging due to the absence of auditory cues and the visual ambiguity of phonemes that exhibit similar visemes-distinct sounds that appear identical in lip motions. Existing methods often aim to predict words or characters directly from visual cues, but they commonly suffer from high error rates due to viseme ambiguity and require large amounts of pre-training data. We propose a novel phoneme-based two-stage framework that fuses visual and landmark motion features, followed by an LLM model for word reconstruction to address these challenges. Stage 1 consists of V-ASR, which outputs the predicted phonemes, thereby reducing training complexity. Meanwhile, the facial landmark features address speaker-specific facial characteristics. Stage 2 comprises an encoder-decoder LLM model, NLLB, that reconstructs the output phonemes back to words. Besides using a large visual dataset for deep learning fine-tuning, our PV-ASR method demonstrates superior performance by achieving 17.4% WER on the LRS2 and 21.0% WER on the LRS3 dataset.

nan

Article 555

Title@2025-07-25 (5): PrismRAG: Boosting RAG Factuality with Distractor Resilience and Strategized Reasoning

Title: PrismRAG: Boosting RAG Factuality with Distractor Resilience and Strategized Reasoning

PrismRAG: Steigerung der RAG-Faktizität mit Distraktorresilienz und geschichteter Vernunft

PrismRAG:提高RAG事实质量,使其具有抗力和策略性合理性 2507.18857v1

Authors (13): Mohammad Kachuee, Teja Gollapudi, Minseok Kim, Yin Huang, Kai Sun, Xiao Yang, Jiaqi Wang, Nirav Shah, Yue Liu, Aaron Colak, Anuj Kumar, Wen-tau Yih, Xin Luna Dong

Retrieval-augmented generation (RAG) often falls short when retrieved context includes confusing semi-relevant passages, or when answering questions require deep contextual understanding and reasoning. We propose an efficient fine-tuning framework, called PrismRAG, that (i) trains the model with distractor-aware QA pairs mixing gold evidence with subtle distractor passages, and (ii) instills reasoning-centric habits that make the LLM plan, rationalize, and synthesize without relying on extensive human engineered instructions. Evaluated across 12 open-book RAG QA benchmarks spanning diverse application domains and scenarios, PrismRAG improves average factuality by 5.4%, outperforming state-of-the-art solutions.

nan

Article 556

Title@2025-07-24 (4): The Curious Case of Class Accuracy Imbalance in LLMs: Post-hoc Debiasing via Nonlinear Integer Programming

Title: The Curious Case of Class Accuracy Imbalance in LLMs: Post-hoc Debiasing via Nonlinear Integer Programming

Der Kuriose Fall der Klasse Genauigkeit Ungleichgewicht in LLMs: Post-hoc-Debiasing über nichtlineare Integer-Programmierung

LLMLM中分类准确性不平衡的怪案:通过非线性整数编程进行热后脱偏性 2405.07623v7

Authors (2): Ruixi Lin, Yang You

Large language models (LLMs) are good knowledge bases but struggle to perform equally well for all classes in text classification. This paper investigates the case of class accuracy imbalance in LLMs, where deeply entangled pretraining biases and prompt-specific cues contribute to the imbalance. To overcome the difficulty in bias identification and inaccessibility of retraining, we post-hoc balance class accuracy using only output probabilities. This is enabled by reformulating debiasing as a combinatorial optimization problem. In details, we first motivate a post-hoc bias metric, the Contextual Oddity Bias (COBias), to quantify the over-/under-prediction (a tendency to over-predict some classes while under-predicting others) in LLMs. We then propose the Debiasing as Nonlinear Integer Programming (DNIP) method to reweight LLM output class probabilities towards minimizing COBias and maximizing overall accuracy, without being constrained by bias sources or updating LLM parameters. Since the DNIP model contains non-differentiable elements, we use simulated annealing to efficiently solve it. Evaluations on five LLMs across NLP classification benchmarks show that DNIP simultaneously achieves significant COBias reduction (61% relative reduction) and accuracy improvement (18% relative increase) under different LLM prompting setups.

nan

Article 557

Title@2025-07-24 (4): R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

Title: R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

R-Stitch: Dynamische Trajektorien-Stitching für effiziente Vernunft

R-Stitch: 高效理性的动态轨迹切换 2507.17307v2

Authors (6): Zhuokun Chen, Zeren Chen, Jiahao He, Mingkui Tan, Jianfei Cai, Bohan Zhuang

Chain-of-thought (CoT) reasoning enhances the problem-solving capabilities of large language models by encouraging step-by-step intermediate reasoning during inference. While effective, CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences. Existing acceleration strategies either reduce sequence length through early stopping or compressive reward designs, or improve decoding speed via speculative decoding with smaller models. However, speculative decoding suffers from limited speedup when the agreement between small and large models is low, and fails to exploit the potential advantages of small models in producing concise intermediate reasoning. In this paper, we present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference by switching between a small language model (SLM) and a large language model (LLM) along the reasoning trajectory. R-Stitch uses the SLM to generate tokens by default and delegates to the LLM only when the SLM’s confidence falls below a threshold. This design avoids full-sequence rollback and selectively invokes the LLM on uncertain steps, preserving both efficiency and answer quality. R-Stitch is model-agnostic, training-free, and compatible with standard decoding pipelines. Experiments on math reasoning benchmarks demonstrate that R-Stitch achieves up to 85\% reduction in inference latency with negligible accuracy drop, highlighting its practical effectiveness in accelerating CoT reasoning.

nan

Article 558

Title@2025-07-24 (4): Toward Super Agent System with Hybrid AI Routers

Title: Toward Super Agent System with Hybrid AI Routers

Auf dem Weg zum Super Agent System mit Hybrid-KI Routern

向超级代理系统过渡 2504.10519v2

Authors (8): Yuhang Yao, Haixin Wang, Yibo Chen, Jiawen Wang, Min Chang Jordan Ren, Bosheng Ding, Salman Avestimehr, Chaoyang He

AI Agents powered by Large Language Models are transforming the world through enormous applications. A super agent has the potential to fulfill diverse user needs, such as summarization, coding, and research, by accurately understanding user intent and leveraging the appropriate tools to solve tasks. However, to make such an agent viable for real-world deployment and accessible at scale, significant optimizations are required to ensure high efficiency and low cost. This position paper presents a design of the Super Agent System powered by the hybrid AI routers. Upon receiving a user prompt, the system first detects the intent of the user, then routes the request to specialized task agents with the necessary tools or automatically generates agentic workflows. In practice, most applications directly serve as AI assistants on edge devices such as phones and robots. As different language models vary in capability and cloud-based models often entail high computational costs, latency, and privacy concerns, we then explore the hybrid mode where the router dynamically selects between local and cloud models based on task complexity. Finally, we introduce the blueprint of an on-device super agent enhanced with cloud. With advances in multi-modality models and edge hardware, we envision that most computations can be handled locally, with cloud collaboration only as needed. Such architecture paves the way for super agents to be seamlessly integrated into everyday life in the near future.

nan

Article 559

Title@2025-07-24 (4): CueBuddy: helping non-native English speakers navigate English-centric STEM education

Title: CueBuddy: helping non-native English speakers navigate English-centric STEM education

CueBuddy: Hilfe für nicht-native englische Referenten navigieren Englisch-centric STEM Bildung

CueBuddy:帮助非母语英语者掌握以英语为中心的STEM教育 2507.18827v1

Authors (1): Pranav Gupta

Students across the world in STEM classes, especially in the Global South, fall behind their peers who are more fluent in English, despite being at par with them in terms of scientific prerequisites. While many of them are able to follow everyday English at ease, key terms in English stay challenging. In most cases, such students have had most of their course prerequisites in a lower resource language. Live speech translation to lower resource languages is a promising area of research, however, models for speech translation can be too expensive on a large scale and often struggle with technical content. In this paper, we describe CueBuddy, which aims to remediate these issues by providing real-time “lexical cues” through technical keyword spotting along real-time multilingual glossary lookup to help students stay up to speed with complex English jargon without disrupting their concentration on the lecture. We also describe the limitations and future extensions of our approach.

nan

Article 560

Title@2025-07-24 (4): Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models

Title: Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models

Promptomatix: Ein automatisches Optimierungs-Framework für große Sprachmodelle

即时表达式:大语言模型自动快速优化框架 2507.14241v3

Authors (9): Rithesh Murthy, Ming Zhu, Liangwei Yang, Jielin Qiu, Juntao Tan, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang

Large Language Models (LLMs) perform best with well-crafted prompts, yet prompt engineering remains manual, inconsistent, and inaccessible to non-experts. We introduce Promptomatix, an automatic prompt optimization framework that transforms natural language task descriptions into high-quality prompts without requiring manual tuning or domain expertise. Promptomatix supports both a lightweight meta-prompt-based optimizer and a DSPy-powered compiler, with modular design enabling future extension to more advanced frameworks. The system analyzes user intent, generates synthetic training data, selects prompting strategies, and refines prompts using cost-aware objectives. Evaluated across 5 task categories, Promptomatix achieves competitive or superior performance compared to existing libraries, while reducing prompt length and computational overhead making prompt optimization scalable and efficient.

nan

Article 561

Title@2025-07-24 (4): Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Title: Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Feature Flow analysieren, um Interpretation und Steuerung in Sprachmodellen zu verbessern

分析地貌流动,以加强语言模型的口译和指导 2502.03032v3

Authors (4): Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov

We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.

nan

Article 562

Title@2025-07-24 (4): Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

Title: Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

Palme: Ein kulturell inklusiver und sprachlich vielfältiger Datensatz für arabische LLMs

棕榈:阿拉伯文LLMLM具有文化包容性和语言多样性的数据集 2503.00151v2

Authors (44): Fakhraddin Alwajih, Abdellah El Mekki, Samar Mohamed Magdy, Abdelrahim A. Elmadany, Omer Nacar, El Moatez Billah Nagoudi, Reem Abdel-Salam, Hanin Atwany, Youssef Nafea, Abdulfattah Mohammed Yahya, Rahaf Alhamouri, Hamzah A. Alsayadi, Hiba Zayed, Sara Shatnawi, Serry Sibaee, Yasir Ech-Chammakhy, Walid Al-Dhabyani, Marwa Mohamed Ali, Imen Jarraya, Ahmed Oumar El-Shangiti, Aisha Alraeesi, Mohammed Anwar Al-Ghrawi, Abdulrahman S. Al-Batati, Elgizouli Mohamed, Noha Taha Elgindi, Muhammed Saeed, Houdaifa Atou, Issam Ait Yahia, Abdelhak Bouayad, Mohammed Machrouh, Amal Makouar, Dania Alkawi, Mukhtar Mohamed, Safaa Taher Abdelfadil, Amine Ziad Ounnoughene, Rouabhia Anfel, Rwaa Assi, Ahmed Sorkatti, Mohamedou Cheikh Tourad, Anis Koubaa, Ismail Berrada, Mustafa Jarrar, Shady Shehata, Muhammad Abdul-Mageed

As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.

nan

Article 563

Title@2025-07-24 (4): Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

Title: Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

Plan für Geschwindigkeit: Erweitertes Scheduling für maskierte Diffusions-Sprachmodelle

速度计划: 遮蔽传播语言模型的饱和日程安排 2506.19037v3

Authors (3): Omer Luxembourg, Haim Permuter, Eliya Nachmani

Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasked them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP) and general-knowledge benchmarks (BBH, MMLU-Pro), DUS outperforms confidence-based planners, without modifying the underlying denoiser, and reveals the true speed-quality frontier of MDLMs.

nan

Article 564

Title@2025-07-24 (4): Evaluating Code-Mixing in LLMs Across 18 Languages

Title: Evaluating Code-Mixing in LLMs Across 18 Languages

Bewertung von Code-Mixing in LLMs in 18 Sprachen

评估18种语言的LLMs混合编码 2507.18791v1

Authors (2): Yilun Yang, Yekun Chai

Code-mixing, the practice of switching between languages within a conversation, presents unique challenges for traditional natural language processing. Existing benchmarks, such as LinCE and GLUECoS, are limited by narrow language pairings and tasks, failing to adequately evaluate the code-mixing capabilities of large language models (LLMs). Despite the significance of code-mixing for multilingual users, research on LLMs in this context remains limited. Additionally, current methods for generating code-mixed data are underdeveloped. In this paper, we conduct a comprehensive evaluation of LLMs’ performance on code-mixed data across 18 languages from seven language families. We also propose a novel approach for generating synthetic code-mixed texts by combining word substitution with GPT-4 prompting. Our analysis reveals consistent underperformance of LLMs on code-mixed datasets involving multiple language families. We suggest that improvements in training data size, model scale, and few-shot learning could enhance their performance.

nan

Article 565

Title@2025-07-24 (4): Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis

Title: Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis

Bewertung großer Sprachmodelle (LLMs) in Financial NLP: Eine vergleichende Studie zur Analyse von Finanzberichten

评价金融中大语言模型:财务报告分析比较研究 2507.22936v1

Authors (1): Md Talha Mohsin

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide variety of Financial Natural Language Processing (FinNLP) tasks. However, systematic comparisons among widely used LLMs remain underexplored. Given the rapid advancement and growing influence of LLMs in financial analysis, this study conducts a thorough comparative evaluation of five leading LLMs, GPT, Claude, Perplexity, Gemini and DeepSeek, using 10-K filings from the ‘Magnificent Seven’ technology companies. We create a set of domain-specific prompts and then use three methodologies to evaluate model performance: human annotation, automated lexical-semantic metrics (ROUGE, Cosine Similarity, Jaccard), and model behavior diagnostics (prompt-level variance and across-model similarity). The results show that GPT gives the most coherent, semantically aligned, and contextually relevant answers; followed by Claude and Perplexity. Gemini and DeepSeek, on the other hand, have more variability and less agreement. Also, the similarity and stability of outputs change from company to company and over time, showing that they are sensitive to how prompts are written and what source material is used.

nan

Article 566

Title@2025-07-24 (4): A Fisher’s exact test justification of the TF-IDF term-weighting scheme

Title: A Fisher’s exact test justification of the TF-IDF term-weighting scheme

Genaue Begründung des TF-IDF-Term-Wichtungssystems durch einen Fisher

A Fisher公司对TF-IDF术语加权办法的精确测试理由 2507.15742v2

Authors (3): Paul Sheridan, Zeyad Ahmed, Aitazaz A. Farooque

Term frequency-inverse document frequency, or TF-IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term’s occurrences are concentrated in any one given document out of many, TF-IDF and its many variants are routinely used as term-weighting schemes in diverse text analysis applications. There is a growing body of scholarship dedicated to placing TF-IDF on a sound theoretical foundation. Building on that tradition, this paper justifies the use of TF-IDF to the statistics community by demonstrating how the famed expression can be understood from a significance testing perspective. We show that the common TF-IDF variant TF-ICF is, under mild regularity conditions, closely related to the negative logarithm of the $p$-value from a one-tailed version of Fisher’s exact test of statistical significance. As a corollary, we establish a connection between TF-IDF and the said negative log-transformed $p$-value under certain idealized assumptions. We further demonstrate, as a limiting case, that this same quantity converges to TF-IDF in the limit of an infinitely large document collection. The Fisher’s exact test justification of TF-IDF equips the working statistician with a ready explanation of the term-weighting scheme’s long-established effectiveness.

nan

Article 567

Title@2025-07-24 (4): ylmmcl at Multilingual Text Detoxification 2025: Lexicon-Guided Detoxification and Classifier-Gated Rewriting

Title: ylmmcl at Multilingual Text Detoxification 2025: Lexicon-Guided Detoxification and Classifier-Gated Rewriting

ylmmcl bei Mehrsprachiger Textentgiftung 2025: Lexikon-geführte Entgiftung und Klassifikator-gestrichenes Umschreiben

2025年多语言文本解毒:Lexicon-Guid解毒和分类法改写 2507.18769v1

Authors (4): Nicole Lai-Lopez, Lusha Wang, Su Yuan, Liza Zhang

In this work, we introduce our solution for the Multilingual Text Detoxification Task in the PAN-2025 competition for the ylmmcl team: a robust multilingual text detoxification pipeline that integrates lexicon-guided tagging, a fine-tuned sequence-to-sequence model (s-nlp/mt0-xl-detox-orpo) and an iterative classifier-based gatekeeping mechanism. Our approach departs from prior unsupervised or monolingual pipelines by leveraging explicit toxic word annotation via the multilingual_toxic_lexicon to guide detoxification with greater precision and cross-lingual generalization. Our final model achieves the highest STA (0.922) from our previous attempts, and an average official J score of 0.612 for toxic inputs in both the development and test sets. It also achieved xCOMET scores of 0.793 (dev) and 0.787 (test). This performance outperforms baseline and backtranslation methods across multiple languages, and shows strong generalization in high-resource settings (English, Russian, French). Despite some trade-offs in SIM, the model demonstrates consistent improvements in detoxification strength. In the competition, our team achieved ninth place with a score of 0.612.

nan

Article 568

Title@2025-07-24 (4): Toward Structured Knowledge Reasoning: Contrastive Retrieval-Augmented Generation on Experience

Title: Toward Structured Knowledge Reasoning: Contrastive Retrieval-Augmented Generation on Experience

Auf dem Weg zu strukturiertem Wissen Reasoning: Kontrastive retrieval-erweiterte Generation auf Erfahrung

实现结构化知识理由:反向取回-积累经验的一代人 2506.00842v2

Authors (10): Jiawei Gu, Ziting Xian, Yuanzhen Xie, Ye Liu, Enjie Liu, Ruichao Zhong, Mochi Gao, Yunzhi Tan, Bo Hu, Zang Li

Large language models (LLMs) achieve strong performance on plain text tasks but underperform on structured data like tables and databases. Potential challenges arise from their underexposure during pre-training and rigid text-to-structure transfer mechanisms. Unlike humans who seamlessly apply learned patterns across data modalities, LLMs struggle to infer implicit relationships embedded in tabular formats, especially in the absence of explicit structural guidance. To bridge this cognitive gap, we introduce Contrastive Retrieval-Augmented Generation on Experience (CoRE), a framework that builds experience memory representations and enhances generalization through contrastive In-Context Learning (ICL) to simulate human-like knowledge transfer. Experiments on Text-to-SQL and TableQA show CoRE significantly improves performance, achieving average gains of 3.44% and 4.24%, with up to 17.2% on challenging tasks. Our Monte Carlo Tree Search (MCTS)-generated Experience Memory expands training data 8-9x, enhancing diversity and domain coverage. This training-free and continual method propels LLMs toward structured knowledge expertise.

nan

Article 569

Title@2025-07-24 (4): The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages

Title: The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages

Die Rolle der Orthografiekonsistenz in mehrsprachigen Einbettungsmodellen für die Textklassifizierung in Arabisch-Script-Sprachen

阿拉伯文和克里普特语文文本分类多语种嵌入模型中正统一致性的作用 2507.18762v1

Authors (7): Abdulhady Abas Abdullah, Amir H. Gandomi, Tarik A Rashid, Seyedali Mirjalili, Laith Abualigah, Milena Živković, Hadi Veisi

In natural language processing, multilingual models like mBERT and XLM-RoBERTa promise broad coverage but often struggle with languages that share a script yet differ in orthographic norms and cultural context. This issue is especially notable in Arabic-script languages such as Kurdish Sorani, Arabic, Persian, and Urdu. We introduce the Arabic Script RoBERTa (AS-RoBERTa) family: four RoBERTa-based models, each pre-trained on a large corpus tailored to its specific language. By focusing pre-training on language-specific script features and statistics, our models capture patterns overlooked by general-purpose models. When fine-tuned on classification tasks, AS-RoBERTa variants outperform mBERT and XLM-RoBERTa by 2 to 5 percentage points. An ablation study confirms that script-focused pre-training is central to these gains. Error analysis using confusion matrices shows how shared script traits and domain-specific content affect performance. Our results highlight the value of script-aware specialization for languages using the Arabic script and support further work on pre-training strategies rooted in script and language specificity.

nan

Article 570

Title@2025-07-24 (4): Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition

Title: Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition

Lärm Kontrastive Schätzung-basiertes Matching Framework für die Erkennung von Low-Resource-Sicherheitsangriffen

低资源安保攻击模式识别比对框架 2401.10337v4

Authors (3): Tu Nguyen, Nedim Šrndić, Alexander Neth

Tactics, Techniques and Procedures (TTPs) represent sophisticated attack patterns in the cybersecurity domain, described encyclopedically in textual knowledge bases. Identifying TTPs in cybersecurity writing, often called TTP mapping, is an important and challenging task. Conventional learning approaches often target the problem in the classical multi-class or multilabel classification setting. This setting hinders the learning ability of the model due to a large number of classes (i.e., TTPs), the inevitable skewness of the label distribution and the complex hierarchical structure of the label space. We formulate the problem in a different learning paradigm, where the assignment of a text to a TTP label is decided by the direct semantic similarity between the two, thus reducing the complexity of competing solely over the large labeling space. To that end, we propose a neural matching architecture with an effective sampling-based learn-to-compare mechanism, facilitating the learning process of the matching model despite constrained resources.

nan

Article 571

Title@2025-07-24 (4): Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement

Title: Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement

Spezifikation Selbst-Korrektion: Eindämmung von In-Context-Belohnung Hacken durch Test-Zeit-Verfeinerung

规格自我校正:通过试验-时间精炼进行减速的背负冲洗 2507.18742v1

Authors (1): Víctor Gallego

Language models (LMs) are susceptible to in-context reward hacking, where they exploit flaws in tainted or faulty written specifications or rubrics to achieve high scores without fulfilling the user’s true intent. We introduce Specification Self-Correction (SSC), a novel, test-time framework that enables an LM to identify and correct flaws within its own guiding specification. SSC employs a multi-step inference process where the model first generates a response based on a potentially tainted specification, critiques its output, and then revises the specification itself to remove the exploitable loophole. A final, more robust response is then generated using this self-corrected specification. Across experiments spanning creative writing and agentic coding tasks with several LMs, we demonstrate that while models initially game tainted specifications in 50-70\% of cases, the SSC process reduces this vulnerability by over 90\%. This dynamic repair occurs at inference time, requires no weight modification, and leads to more robustly aligned model behavior. Code at https://github.com/vicgalle/specification-self-correction .

nan

Article 572

Title@2025-07-24 (4): AI Flow: Perspectives, Scenarios, and Approaches

Title: AI Flow: Perspectives, Scenarios, and Approaches

AI Flow: Perspektiven, Szenarien und Ansätze

AI 流动:观点、设想和方法 2506.12479v3

Authors (14): Hongjun An, Wenhan Hu, Sida Huang, Siqi Huang, Ruanjun Li, Yuanzhi Liang, Jiawei Shao, Yiliang Song, Zihan Wang, Cheng Yuan, Chi Zhang, Hongyuan Zhang, Wenhao Zhuang, Xuelong Li

Pioneered by the foundational information theory by Claude Shannon and the visionary framework of machine intelligence by Alan Turing, the convergent evolution of information and communication technologies (IT/CT) has created an unbroken wave of connectivity and computation. This synergy has sparked a technological revolution, now reaching its peak with large artificial intelligence (AI) models that are reshaping industries and redefining human-machine collaboration. However, the realization of ubiquitous intelligence faces considerable challenges due to substantial resource consumption in large models and high communication bandwidth demands. To address these challenges, AI Flow has been introduced as a multidisciplinary framework that integrates cutting-edge IT and CT advancements, with a particular emphasis on the following three key points. First, device-edge-cloud framework serves as the foundation, which integrates end devices, edge servers, and cloud clusters to optimize scalability and efficiency for low-latency model inference. Second, we introduce the concept of familial models, which refers to a series of different-sized models with aligned hidden features, enabling effective collaboration and the flexibility to adapt to varying resource constraints and dynamic scenarios. Third, connectivity- and interaction-based intelligence emergence is a novel paradigm of AI Flow. By leveraging communication networks to enhance connectivity, the collaboration among AI models across heterogeneous nodes achieves emergent intelligence that surpasses the capability of any single model. The innovations of AI Flow provide enhanced intelligence, timely responsiveness, and ubiquitous accessibility to AI services, paving the way for the tighter fusion of AI techniques and communication systems.

nan

Article 573

Title@2025-07-24 (4): An Efficient Sparse Fine-Tuning with Low Quantization Error via Neural Network Pruning

Title: An Efficient Sparse Fine-Tuning with Low Quantization Error via Neural Network Pruning

Effizientes Sparse-Fine-Tuning mit geringem Quantisierungsfehler über Neural Network Pruning

通过神经网络节制低量错误的高效粗简精细调整 2502.11439v2

Authors (2): Cen-Jhih Li, Aditya Bhaskara

Fine-tuning is an important step in adapting foundation models such as large language models to downstream tasks. To make this step more accessible to users with limited computational budgets, it is crucial to develop fine-tuning methods that are memory and computationally efficient. Sparse Fine-tuning (SpFT) and Low-rank adaptation (LoRA) are two frameworks that have emerged for addressing this problem and have been adopted widely in practice. In this work, we develop a new SpFT framework, based on ideas from neural network pruning. At a high level, we first identify ``important’’ neurons/nodes using feature importance metrics from network pruning (specifically, we use the structural pruning method), and then perform fine-tuning by restricting to weights involving these neurons. Experiments on common language tasks show our method improves SpFT’s memory efficiency by 20-50\% while matching the accuracy of state-of-the-art methods like LoRA’s variants.

nan

Article 574

Title@2025-07-24 (4): Checklists Are Better Than Reward Models For Aligning Language Models

Title: Checklists Are Better Than Reward Models For Aligning Language Models

Checklisten sind besser als Belohnungsmodelle für die Ausrichtung von Sprachmodellen

核对列表比奖励模型更好调整语言模型 2507.18624v1

Authors (7): Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xiang Kong, Meng Cao, Graham Neubig, Tongshuang Wu

Language models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this – typically using fixed criteria such as “helpfulness” and “harmfulness”. In our work, we instead propose using flexible, instruction-specific criteria as a means of broadening the impact that reinforcement learning can have in eliciting instruction following. We propose “Reinforcement Learning from Checklist Feedback” (RLCF). From instructions, we extract checklists and evaluate how well responses satisfy each item - using both AI judges and specialized verifier programs - then combine these scores to compute rewards for RL. We compare RLCF with other alignment methods applied to a strong instruction following model (Qwen2.5-7B-Instruct) on five widely-studied benchmarks – RLCF is the only method to improve performance on every benchmark, including a 4-point boost in hard satisfaction rate on FollowBench, a 6-point increase on InFoBench, and a 3-point rise in win rate on Arena-Hard. These results establish checklist feedback as a key tool for improving language models’ support of queries that express a multitude of needs.

nan

Article 575

Title@2025-07-24 (4): TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards

Title: TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards

TRPrompt: Bootstrapping Query-Aware Prompt Optimierung von Textbelohnungen

TRPropt: 从文本奖励中促进解答询问软件快速优化 2507.18618v1

Authors (5): Andreea Nica, Ivan Zakazov, Nicolas Mario Baldwin, Saibo Geng, Robert West

Prompt optimization improves the reasoning abilities of large language models (LLMs) without requiring parameter updates to the target model. Following heuristic-based “Think step by step” approaches, the field has evolved in two main directions: while one group of methods uses textual feedback to elicit improved prompts from general-purpose LLMs in a training-free way, a concurrent line of research relies on numerical rewards to train a special prompt model, tailored for providing optimal prompts to the target model. In this paper, we introduce the Textual Reward Prompt framework (TRPrompt), which unifies these approaches by directly incorporating textual feedback into training of the prompt model. Our framework does not require prior dataset collection and is being iteratively improved with the feedback on the generated prompts. When coupled with the capacity of an LLM to internalize the notion of what a “good” prompt is, the high-resolution signal provided by the textual rewards allows us to train a prompt model yielding state-of-the-art query-specific prompts for the problems from the challenging math datasets GSMHard and MATH.

nan

Article 576

Title: SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning

SynC: Synthetische Bildunterschrift Datensatzverfeinerung mit ein-zu-vielen Mapping für Zero-shot Bildunterschrift

合成图像说明: 合成图像说明数据集精化,用一到多个绘图进行零光图像说明的合成图像说明 2507.18616v1

Authors (6): Si-Woo Kim, MinJu Jeon, Ye-Chan Kim, Soeun Lee, Taewhan Kim, Dong-Jin Kim

Zero-shot Image Captioning (ZIC) increasingly utilizes synthetic datasets generated by text-to-image (T2I) models to mitigate the need for costly manual annotation. However, these T2I models often produce images that exhibit semantic misalignments with their corresponding input captions (e.g., missing objects, incorrect attributes), resulting in noisy synthetic image-caption pairs that can hinder model training. Existing dataset pruning techniques are largely designed for removing noisy text in web-crawled data. However, these methods are ill-suited for the distinct challenges of synthetic data, where captions are typically well-formed, but images may be inaccurate representations. To address this gap, we introduce SynC, a novel framework specifically designed to refine synthetic image-caption datasets for ZIC. Instead of conventional filtering or regeneration, SynC focuses on reassigning captions to the most semantically aligned images already present within the synthetic image pool. Our approach employs a one-to-many mapping strategy by initially retrieving multiple relevant candidate images for each caption. We then apply a cycle-consistency-inspired alignment scorer that selects the best image by verifying its ability to retrieve the original caption via image-to-text retrieval. Extensive evaluations demonstrate that SynC consistently and significantly improves performance across various ZIC models on standard benchmarks (MS-COCO, Flickr30k, NoCaps), achieving state-of-the-art results in several scenarios. SynC offers an effective strategy for curating refined synthetic data to enhance ZIC.

nan

Article 577

Title@2025-07-24 (4): BEARCUBS: A benchmark for computer-using web agents

Title: BEARCUBS: A benchmark for computer-using web agents

BEARCUBS: Benchmark für computergestützte Web-Agenten

BEARCUBS:计算机使用网络代理器的基准 2503.07919v3

Authors (6): Yixiao Song, Katherine Thai, Chau Minh Pham, Yapei Chang, Mazin Nadaf, Mohit Iyyer

Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a “smallbut mighty” benchmark of 111 information-seeking questions designed to evaluate a web agent’s ability to search, browse, and identify factual information from the web. Unlike prior web agent benchmarks, solving BEARCUBS requires (1) accessing live web content rather than synthetic or simulated pages, which captures the unpredictability of real-world web interactions; and (2) performing a broad range of multimodal interactions (e.g., video understanding, 3D navigation) that cannot be bypassed via text-based workarounds. Each question in BEARCUBS has a corresponding short, unambiguous answer and a human-validated browsing trajectory, allowing for transparent evaluation of agent performance and strategies. A human study confirms that BEARCUBS questions are solvable but non-trivial (84.7% human accuracy), revealing domain knowledge gaps and overlooked details as common failure points. We find that ChatGPT Agent significantly outperforms other computer-using agents with an overall accuracy of 65.8% (compared to e.g., Operator’s 23.4%), showcasing substantial progress in tasks involving real computer use, such as playing web games and navigating 3D environments. Nevertheless, closing the gap to human performance requires improvements in areas like fine control, complex data filtering, and execution speed. To facilitate future research, BEARCUBS will be updated periodically to replace invalid or contaminated questions, keeping the benchmark fresh for future generations of web agents.

nan

Article 578

Title@2025-07-24 (4): Trusted Knowledge Extraction for Operations and Maintenance Intelligence

Title: Trusted Knowledge Extraction for Operations and Maintenance Intelligence

Vertrauenswürdige Wissensgewinnung für Operationen und Wartungsintelligenz

行动和维持情报可信赖的知识采掘 2507.22935v1

Authors (5): Kathleen Mealey, Jonathan A. Karr Jr., Priscila Saboia Moreira, Paul R. Brenner, Charles F. Vardeman II

Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.

nan

Article 579

Title@2025-07-24 (4): Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs

Title: Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs

Sparse Logit Sampling: Beschleunigung der Wissensdestillation in LLMs

粗略的登录抽样:加速在LLMs中进行知识蒸馏 2503.16870v2

Authors (8): Anshumann, Mohd Abbas Zaidi, Akhil Kedia, Jinwoo Ahn, Taehwak Kwon, Kangwook Lee, Haejun Lee, Joohyung Lee

Knowledge distillation can be a cost-effective technique to distill knowledge in Large Language Models, if the teacher output logits can be pre-computed and cached. However, successfully applying this to pre-training remains largely unexplored. In this work, we prove that naive approaches for sparse knowledge distillation such as caching Top-K probabilities, while intuitive, provide biased estimates of teacher probability distribution to the student, resulting in suboptimal performance and calibration. We propose an importance-sampling-based method `Random Sampling Knowledge Distillation’, which provides unbiased estimates, preserves the gradient in expectation, and requires storing significantly sparser logits. Our method enables faster training of student models with marginal overhead (<10%) compared to cross-entropy based training, while maintaining competitive performance compared to full distillation, across a range of model sizes from 300M to 3B.

nan

Article 580

Title@2025-07-24 (4): Deep Learning Approaches for Multimodal Intent Recognition: A Survey

Title: Deep Learning Approaches for Multimodal Intent Recognition: A Survey

Deep Learning Ansätze zur multimodalen Intent-Erkennung: Eine Umfrage

多种形式本能识别的深学习方法:调查 2507.22934v1

Authors (11): Jingwei Zhao, Yuhua Wen, Qifei Li, Minchi Hu, Yingying Zhou, Jingyao Xue, Junyang Wu, Yingming Gao, Zhengqi Wen, Jianhua Tao, Ya Li

Intent recognition aims to identify users’ underlying intentions, traditionally focusing on text in natural language processing. With growing demands for natural human-computer interaction, the field has evolved through deep learning and multimodal approaches, incorporating data from audio, vision, and physiological signals. Recently, the introduction of Transformer-based models has led to notable breakthroughs in this domain. This article surveys deep learning methods for intent recognition, covering the shift from unimodal to multimodal techniques, relevant datasets, methodologies, applications, and current challenges. It provides researchers with insights into the latest developments in multimodal intent recognition (MIR) and directions for future research.

nan

Article 581

Title@2025-07-24 (4): What Makes You CLIC: Detection of Croatian Clickbait Headlines

Title: What Makes You CLIC: Detection of Croatian Clickbait Headlines

Was macht Sie CLIC: Erkennung von kroatischen Clickbait Schlagzeilen

是什么让你成为CLIC:发现克罗地亚点击头条头条 2507.14314v2

Authors (4): Marija Anđelić, Dominik Šipek, Laura Majer, Jan Šnajder

Online news outlets operate predominantly on an advertising-based revenue model, compelling journalists to create headlines that are often scandalous, intriguing, and provocative – commonly referred to as clickbait. Automatic detection of clickbait headlines is essential for preserving information quality and reader trust in digital media and requires both contextual understanding and world knowledge. For this task, particularly in less-resourced languages, it remains unclear whether fine-tuned methods or in-context learning (ICL) yield better results. In this paper, we compile CLIC, a novel dataset for clickbait detection of Croatian news headlines spanning a 20-year period and encompassing mainstream and fringe outlets. We fine-tune the BERTi'c model on this task and compare its performance to LLM-based ICL methods with prompts both in Croatian and English. Finally, we analyze the linguistic properties of clickbait. We find that nearly half of the analyzed headlines contain clickbait, and that finetuned models deliver better results than general LLMs.

nan

Article 582

Title@2025-07-24 (4): AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs

Title: AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs

AQuilt: Verweben von Logik und Selbstinspektion in Low-Cost, High-Relevance-Datensynthese für Spezialisten LLMs

Anilt:将逻辑和自我检查编织成低成本高相关性数据合成,供专家LLMs使用 2507.18584v1

Authors (7): Xiaopeng Ke, Hexuan Deng, Xuebo Liu, Jun Rao, Zhenxi Song, Jun Yu, Min Zhang

Despite the impressive performance of large language models (LLMs) in general domains, they often underperform in specialized domains. Existing approaches typically rely on data synthesis methods and yield promising results by using unlabeled data to capture domain-specific features. However, these methods either incur high computational costs or suffer from performance limitations, while also demonstrating insufficient generalization across different tasks. To address these challenges, we propose AQuilt, a framework for constructing instruction-tuning data for any specialized domains from corresponding unlabeled data, including Answer, Question, Unlabeled data, Inspection, Logic, and Task type. By incorporating logic and inspection, we encourage reasoning processes and self-inspection to enhance model performance. Moreover, customizable task instructions enable high-quality data generation for any task. As a result, we construct a dataset of 703k examples to train a powerful data synthesis model. Experiments show that AQuilt is comparable to DeepSeek-V3 while utilizing just 17% of the production cost. Further analysis demonstrates that our generated data exhibits higher relevance to downstream tasks. Source code, models, and scripts are available at https://github.com/Krueske/AQuilt.

nan

Article 583

Title@2025-07-24 (4): DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data

Title: DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data

DR.EHR: Dense Retrieval für elektronische Gesundheitsdaten mit Wissensinjektion und synthetischen Daten

DR.EHR: 具有知识注射和合成数据的电子健康记录大量检索 2507.18583v1

Authors (4): Zhengyun Zhao, Huaiyuan Ying, Yue Zhong, Sheng Yu

Electronic Health Records (EHRs) are pivotal in clinical practices, yet their retrieval remains a challenge mainly due to semantic gap issues. Recent advancements in dense retrieval offer promising solutions but existing models, both general-domain and biomedical-domain, fall short due to insufficient medical knowledge or mismatched training corpora. This paper introduces \texttt{DR.EHR}, a series of dense retrieval models specifically tailored for EHR retrieval. We propose a two-stage training pipeline utilizing MIMIC-IV discharge summaries to address the need for extensive medical knowledge and large-scale training data. The first stage involves medical entity extraction and knowledge injection from a biomedical knowledge graph, while the second stage employs large language models to generate diverse training data. We train two variants of \texttt{DR.EHR}, with 110M and 7B parameters, respectively. Evaluated on the CliniQ benchmark, our models significantly outperforms all existing dense retrievers, achieving state-of-the-art results. Detailed analyses confirm our models’ superiority across various match and query types, particularly in challenging semantic matches like implication and abbreviation. Ablation studies validate the effectiveness of each pipeline component, and supplementary experiments on EHR QA datasets demonstrate the models’ generalizability on natural language questions, including complex ones with multiple entities. This work significantly advances EHR retrieval, offering a robust solution for clinical applications.

nan

Article 584

Title@2025-07-24 (4): System Report for CCL25-Eval Task 10: SRAG-MAV for Fine-Grained Chinese Hate Speech Recognition

Title: System Report for CCL25-Eval Task 10: SRAG-MAV for Fine-Grained Chinese Hate Speech Recognition

Systembericht für CCL25-Eval Task 10: SRAG-MAV für feinkörnige chinesische Hassspracherkennung

供CCL25-Eval任务10使用的系统报告:关于中华恶言识别的SRAG-MAV系统报告 2507.18580v1

Authors (4): Jiahao Wang, Ramen Liu, Longhui Zhang, Jing Li

This paper presents our system for CCL25-Eval Task 10, addressing Fine-Grained Chinese Hate Speech Recognition (FGCHSR). We propose a novel SRAG-MAV framework that synergistically integrates task reformulation(TR), Self-Retrieval-Augmented Generation (SRAG), and Multi-Round Accumulative Voting (MAV). Our method reformulates the quadruplet extraction task into triplet extraction, uses dynamic retrieval from the training set to create contextual prompts, and applies multi-round inference with voting to improve output stability and performance. Our system, based on the Qwen2.5-7B model, achieves a Hard Score of 26.66, a Soft Score of 48.35, and an Average Score of 37.505 on the STATE ToxiCN dataset, significantly outperforming baselines such as GPT-4o (Average Score 15.63) and fine-tuned Qwen2.5-7B (Average Score 35.365). The code is available at https://github.com/king-wang123/CCL25-SRAG-MAV.

nan

Article 585

Title@2025-07-24 (4): P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts

Title: P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts

P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts

P-反应:通过专门 LoRA 专家混合组合,综合个人经历专题-适应性反应 2406.12548v3

Authors (5): Yuhao Dan, Jie Zhou, Qin Chen, Junfeng Tian, Liang He

Personalized large language models (LLMs) have attracted great attention in many applications, such as emotional support and role-playing. However, existing works primarily focus on modeling explicit character profiles, while ignoring the underlying personality traits that truly shape behaviors and decision-making, hampering the development of more anthropomorphic and psychologically-grounded AI systems. In this paper, we explore the modeling of Big Five personality traits, which is the most widely used trait theory in psychology, and propose P-React, a mixture of experts (MoE)-based personalized LLM. Particularly, we integrate a Personality Specialization Loss (PSL) to better capture individual trait expressions, providing a more nuanced and psychologically grounded personality simulacrum. To facilitate research in this field, we curate OCEAN-Chat, a high-quality, human-verified dataset designed to train LLMs in expressing personality traits across diverse topics. Extensive experiments demonstrate the effectiveness of P-React in maintaining consistent and real personality.

nan

Article 586

Title@2025-07-24 (4): Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

Title: Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

Weit-in, schmal-out: Wiederverwertbare Dekodierung für effiziente und effektive DLLMs

宽放, 窄出: 为高效和有效DLLMs而可撤销的解码 2507.18578v1

Authors (8): Feng Hong, Geng Yu, Yushi Ye, Haicheng Huang, Huangjie Zheng, Ya Zhang, Yanfeng Wang, Jiangchao Yao

Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model’s bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6$\times$ while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10$\times$ speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.

nan

Article 587

Title@2025-07-24 (4): LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs

Title: LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs

LingBench++: Ein linguistisch-informiertes Benchmark- und Reasoning-Framework für mehrstufige und kulturübergreifende Schlussfolgerungen mit LLMs

LingBench++:与LLMs的多层次和跨文化推理语言综合基准和理由框架 2507.16809v2

Authors (10): Da-Chen Lian, Ri-Sheng Huang, Pin-Er Chen, Chunki Lim, You-Kuan Lin, Guan-Yu Tseng, Zi-Cheng Yang, Zhen-Yu Lin, Pin-Cheng Chen, Shu-Kai Hsieh

We propose LingBench++, a linguistically-informed benchmark and reasoning framework designed to evaluate large language models (LLMs) on complex linguistic tasks inspired by the International Linguistics Olympiad (IOL). Unlike prior benchmarks that focus solely on final answer accuracy, LingBench++ provides structured reasoning traces, stepwise evaluation protocols, and rich typological metadata across over 90 low-resource and cross-cultural languages. We further develop a multi-agent architecture integrating grammatical knowledge retrieval, tool-augmented reasoning, and deliberate hypothesis testing. Through systematic comparisons of baseline and our proposed agentic models, we demonstrate that models equipped with external knowledge sources and iterative reasoning outperform single-pass approaches in both accuracy and interpretability. LingBench++ offers a comprehensive foundation for advancing linguistically grounded, culturally informed, and cognitively plausible reasoning in LLMs.

nan

Article 588

Title@2025-07-24 (4): PosterMate: Audience-driven Collaborative Persona Agents for Poster Design

Title: PosterMate: Audience-driven Collaborative Persona Agents for Poster Design

PosterMate: Audience-getriebene Kollaborative Persona Agenten für Poster-Design

PosterMate:由观众驱动的海报设计合作人员代理 2507.18572v1

Authors (4): Donghoon Shin, Daniel Lee, Gary Hsieh, Gromit Yeuk-Yin Chan

Poster designing can benefit from synchronous feedback from target audiences. However, gathering audiences with diverse perspectives and reconciling them on design edits can be challenging. Recent generative AI models present opportunities to simulate human-like interactions, but it is unclear how they may be used for feedback processes in design. We introduce PosterMate, a poster design assistant that facilitates collaboration by creating audience-driven persona agents constructed from marketing documents. PosterMate gathers feedback from each persona agent regarding poster components, and stimulates discussion with the help of a moderator to reach a conclusion. These agreed-upon edits can then be directly integrated into the poster design. Through our user study (N=12), we identified the potential of PosterMate to capture overlooked viewpoints, while serving as an effective prototyping tool. Additionally, our controlled online evaluation (N=100) revealed that the feedback from an individual persona agent is appropriate given its persona identity, and the discussion effectively synthesizes the different persona agents’ perspectives.

nan

Article 589

Title@2025-07-24 (4): Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods

Title: Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods

Hybride Tokenisierungsstrategie für DNA-Sprachmodell mit Byte Pair Encoding und K-MER Methoden

使用字节对等编码和K-MER方法的DNA语言模型混合化战略 2507.18570v1

Authors (2): Ganesh Sapkota, Md Hasibur Rahman

This paper presents a novel hybrid tokenization strategy that enhances the performance of DNA Language Models (DLMs) by combining 6-mer tokenization with Byte Pair Encoding (BPE-600). Traditional k-mer tokenization is effective at capturing local DNA sequence structures but often faces challenges, including uneven token distribution and a limited understanding of global sequence context. To address these limitations, we propose merging unique 6mer tokens with optimally selected BPE tokens generated through 600 BPE cycles. This hybrid approach ensures a balanced and context-aware vocabulary, enabling the model to capture both short and long patterns within DNA sequences simultaneously. A foundational DLM trained on this hybrid vocabulary was evaluated using next-k-mer prediction as a fine-tuning task, demonstrating significantly improved performance. The model achieved prediction accuracies of 10.78% for 3-mers, 10.1% for 4-mers, and 4.12% for 5-mers, outperforming state-of-the-art models such as NT, DNABERT2, and GROVER. These results highlight the ability of the hybrid tokenization strategy to preserve both the local sequence structure and global contextual information in DNA modeling. This work underscores the importance of advanced tokenization methods in genomic language modeling and lays a robust foundation for future applications in downstream DNA sequence analysis and biological research.

nan

Article 590

Title@2025-07-24 (4): GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation

Title: GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation

GIIFT: Graph-geführte induktive Bildverarbeitungsfreie multimodale maschinelle Übersetzung

GIIFT: 图表制导感性不含图像的无图像多式机器翻译 2507.18562v1

Authors (2): Jiafeng Xiong, Yuting Zhao

Multimodal Machine Translation (MMT) has demonstrated the significant help of visual information in machine translation. However, existing MMT methods face challenges in leveraging the modality gap by enforcing rigid visual-linguistic alignment whilst being confined to inference within their trained multimodal domains. In this work, we construct novel multimodal scene graphs to preserve and integrate modality-specific information and introduce GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework that uses a cross-modal Graph Attention Network adapter to learn multimodal knowledge in a unified fused space and inductively generalize it to broader image-free translation domains. Experimental results on the Multi30K dataset of English-to-French and English-to-German tasks demonstrate that our GIIFT surpasses existing approaches and achieves the state-of-the-art, even without images during inference. Results on the WMT benchmark show significant improvements over the image-free translation baselines, demonstrating the strength of GIIFT towards inductive image-free inference.

nan

Article 591

Title: Identity-related Speech Suppression in Generative AI Content Moderation

Identitätsbezogene Sprachunterdrückung in der Generativen KI-Inhaltsmoderation

在产生AI 内容调节中禁止与身份有关的言语 2409.13725v3

Authors (5): Grace Proebsting, Oghenefejiro Isaacs Anigboro, Charlie M. Crawford, Danaé Metaxa, Sorelle A. Friedler

Automated content moderation has long been used to help identify and filter undesired user-generated content online. But such systems have a history of incorrectly flagging content by and about marginalized identities for removal. Generative AI systems now use such filters to keep undesired generated content from being created by or shown to users. While a lot of focus has been given to making sure such systems do not produce undesired outcomes, considerably less attention has been paid to making sure appropriate text can be generated. From classrooms to Hollywood, as generative AI is increasingly used for creative or expressive text generation, whose stories will these technologies allow to be told, and whose will they suppress? In this paper, we define and introduce measures of speech suppression, focusing on speech related to different identity groups incorrectly filtered by a range of content moderation APIs. Using both short-form, user-generated datasets traditional in content moderation and longer generative AI-focused data, including two datasets we introduce in this work, we create a benchmark for measurement of speech suppression for nine identity groups. Across one traditional and four generative AI-focused automated content moderation services tested, we find that identity-related speech is more likely to be incorrectly suppressed than other speech. We find that reasons for incorrect flagging behavior vary by identity based on stereotypes and text associations, with, e.g., disability-related content more likely to be flagged for self-harm or health-related reasons while non-Christian content is more likely to be flagged as violent or hateful. As generative AI systems are increasingly used for creative work, we urge further attention to how this may impact the creation of identity-related content.

nan

Article 592

Title@2025-07-24 (4): Augmented Vision-Language Models: A Systematic Review

Title: Augmented Vision-Language Models: A Systematic Review

Augmented Vision-Language Models: Eine systematische Bewertung

增强愿景-语言模型:系统审查 2507.22933v1

Authors (4): Anthony C Davis, Burhan Sadiq, Tianmin Shu, Chien-Ming Huang

Recent advances in visual-language machine learning models have demonstrated exceptional ability to use natural language and understand visual scenes by training on large, unstructured datasets. However, this training paradigm cannot produce interpretable explanations for its outputs, requires retraining to integrate new information, is highly resource-intensive, and struggles with certain forms of logical reasoning. One promising solution involves integrating neural networks with external symbolic information systems, forming neural symbolic systems that can enhance reasoning and memory abilities. These neural symbolic systems provide more interpretable explanations to their outputs and the capacity to assimilate new information without extensive retraining. Utilizing powerful pre-trained Vision-Language Models (VLMs) as the core neural component, augmented by external systems, offers a pragmatic approach to realizing the benefits of neural-symbolic integration. This systematic literature review aims to categorize techniques through which visual-language understanding can be improved by interacting with external symbolic information systems.

nan

Article 593

Title@2025-07-24 (4): FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification

Title: FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification

FinMarBa: Ein marktinformierter Datensatz für die Einstufung von Finanzsentimenten

FinMarba:用于金融敏感度分类的市场化数据集 2507.22932v1

Authors (6): Baptiste Lefort, Eric Benhamou, Beatrice Guez, Jean-Jacques Ohana, Ethan Setrouk, Alban Etienne

This paper presents a novel hierarchical framework for portfolio optimization, integrating lightweight Large Language Models (LLMs) with Deep Reinforcement Learning (DRL) to combine sentiment signals from financial news with traditional market indicators. Our three-tier architecture employs base RL agents to process hybrid data, meta-agents to aggregate their decisions, and a super-agent to merge decisions based on market data and sentiment analysis. Evaluated on data from 2018 to 2024, after training on 2000-2017, the framework achieves a 26% annualized return and a Sharpe ratio of 1.2, outperforming equal-weighted and S&P 500 benchmarks. Key contributions include scalable cross-modal integration, a hierarchical RL structure for enhanced stability, and open-source reproducibility.

nan

Article 594

Title@2025-07-24 (4): LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important

Title: LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important

LagKV: Lag-Relative Information des KV-Cache erzählt, welche Token wichtig sind

LagKV: KV 缓存告诉哪个 Tokens 重要, 而 KV 缓存的拉格- 相对信息Name 2504.04704v2

Authors (4): Manlai Liang, JiaMing Zhang, Xiong Li, Jinlong Li

The increasing size of the Key-Value (KV) cache during the Large Language Models long-context inference is the main obstacle for its balance between the deployment cost and task accuracy. To reduce the KV cache size in such scenarios, most previous efforts leveraged on the attention weight to evict non-critical cache tokens. But there is a trade-off in those methods, they usually require major modification of the inference infrastructure and significant computation overhead. Based on the fact that the Large Language models are autoregressive models, we propose LagKV, a KV compression strategy only relying on straight forward comparison among KV themselves. It is a totally attention free method which offers easy integration to the main stream inference platform and comparable performance comparing to other complicated KV compression methods. Results on RULER benchmark show that, our approach outperforms SnapKV and StreamingLLM in different compression ratios. Especially in the 64-digit passkey retrieval task, our method outperforms the attention weight based method $H_2O$ over $50\%$ with same compression ratios. Our code is available at https://github.com/AI-Lab-China-Merchants-Bank/LagKV.

nan

Article 595

Title@2025-07-24 (4): GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface

Title: GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface

GLiNER2: Ein effizientes Multi-Task-Informationsextraktionssystem mit Schema-gesteuerter Schnittstelle

GLINER2:具有Schema-Driven界面的高效多任务信息提取系统 2507.18546v1

Authors (5): Urchade Zaratiana, Gil Pasternak, Oliver Boyd, George Hurn-Maloney, Ash Lewis

Information extraction (IE) is fundamental to numerous NLP applications, yet existing solutions often require specialized models for different tasks or rely on computationally expensive large language models. We present GLiNER2, a unified framework that enhances the original GLiNER architecture to support named entity recognition, text classification, and hierarchical structured data extraction within a single efficient model. Built pretrained transformer encoder architecture, GLiNER2 maintains CPU efficiency and compact size while introducing multi-task composition through an intuitive schema-based interface. Our experiments demonstrate competitive performance across extraction and classification tasks with substantial improvements in deployment accessibility compared to LLM-based alternatives. We release GLiNER2 as an open-source pip-installable library with pre-trained models and documentation at https://github.com/fastino-ai/GLiNER2.

nan

Article 596

Title@2025-07-24 (4): Effective Multi-Task Learning for Biomedical Named Entity Recognition

Title: Effective Multi-Task Learning for Biomedical Named Entity Recognition

Effektives Multi-Task-Lernen für die biomedizinische benannte Entitätserkennung

有效多任务学习促进生物医学命名实体的识别 2507.18542v1

Authors (4): João Ruano, Gonçalo M. Correia, Leonor Barreiros, Afonso Mendes

Biomedical Named Entity Recognition presents significant challenges due to the complexity of biomedical terminology and inconsistencies in annotation across datasets. This paper introduces SRU-NER (Slot-based Recurrent Unit NER), a novel approach designed to handle nested named entities while integrating multiple datasets through an effective multi-task learning strategy. SRU-NER mitigates annotation gaps by dynamically adjusting loss computation to avoid penalizing predictions of entity types absent in a given dataset. Through extensive experiments, including a cross-corpus evaluation and human assessment of the model’s predictions, SRU-NER achieves competitive performance in biomedical and general-domain NER tasks, while improving cross-domain generalization.

nan

Article 597

Title@2025-07-24 (4): The Moral Gap of Large Language Models

Title: The Moral Gap of Large Language Models

Die moralische Kluft großer Sprachmodelle

大语言模式的道德差距 2507.18523v1

Authors (2): Maciej Skorski, Alina Landowska

Moral foundation detection is crucial for analyzing social discourse and developing ethically-aligned AI systems. While large language models excel across diverse tasks, their performance on specialized moral reasoning remains unclear. This study provides the first comprehensive comparison between state-of-the-art LLMs and fine-tuned transformers across Twitter and Reddit datasets using ROC, PR, and DET curve analysis. Results reveal substantial performance gaps, with LLMs exhibiting high false negative rates and systematic under-detection of moral content despite prompt engineering efforts. These findings demonstrate that task-specific fine-tuning remains superior to prompting for moral reasoning applications.

nan

Article 598

Title@2025-07-24 (4): GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks

Title: GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks

GCC-Spam: Spam-Erkennung über GAN, Kontrastives Lernen und Charaktergleichheitsnetzwerke

海合会-Spam:通过全球大气监测网、反竞争学习和特征相似网络探测垃圾邮件 2507.14679v2

Authors (3): Zhijie Wang, Zixin Xu, Zhiyuan Pan

The exponential growth of spam text on the Internet necessitates robust detection mechanisms to mitigate risks such as information leakage and social instability. This work addresses two principal challenges: adversarial strategies employed by spammers and the scarcity of labeled data. We propose a novel spam-text detection framework GCC-Spam, which integrates three core innovations. First, a character similarity network captures orthographic and phonetic features to counter character-obfuscation attacks and furthermore produces sentence embeddings for downstream classification. Second, contrastive learning enhances discriminability by optimizing the latent-space distance between spam and normal texts. Third, a Generative Adversarial Network (GAN) generates realistic pseudo-spam samples to alleviate data scarcity while improving model robustness and classification accuracy. Extensive experiments on real-world datasets demonstrate that our model outperforms baseline approaches, achieving higher detection rates with significantly fewer labeled examples.

nan

Article 599

Title@2025-07-24 (4): Exploiting individual differences to bootstrap communication

Title: Exploiting individual differences to bootstrap communication

Nutzung individueller Unterschiede zur Bootstrap-Kommunikation

利用个人差异进行靴套通信 2504.05211v2

Authors (2): Richard A. Blythe, Casimir Fisch

Establishing a communication system is hard because the intended meaning of a signal is unknown to its receiver when first produced, and the signaller also has no idea how that signal will be interpreted. Most theoretical accounts of the emergence of communication systems rely on feedback to reinforce behaviours that have led to successful communication in the past. However, providing such feedback requires already being able to communicate the meaning that was intended or interpreted. Therefore these accounts cannot explain how communication can be bootstrapped from non-communicative behaviours. Here we present a model that shows how a communication system, capable of expressing an unbounded number of meanings, can emerge as a result of individual behavioural differences in a large population without any pre-existing means to determine communicative success. The two key cognitive capabilities responsible for this outcome are behaving predictably in a given situation, and an alignment of psychological states ahead of signal production that derives from shared intentionality. Since both capabilities can exist independently of communication, our results are compatible with theories in which large flexible socially-learned communication systems like language are the product of a general but well-developed capacity for social cognition.

nan

Article 600

Title@2025-07-24 (4): Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models

Title: Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models

Nicht alle Funktionen widmen sich der Aufmerksamkeit: Graphengeführtes Abhängigkeitslernen für tabellarische Datengenerierung mit Sprachmodellen

并非所有值得注意的地物:用语言模型编制图表数据时的图表指导依赖性学习 2507.18504v1

Authors (4): Zheyu Zhang, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci

Large Language Models (LLMs) have shown strong potential for tabular data generation by modeling textualized feature-value pairs. However, tabular data inherently exhibits sparse feature-level dependencies, where many feature interactions are structurally insignificant. This creates a fundamental mismatch as LLMs’ self-attention mechanism inevitably distributes focus across all pairs, diluting attention on critical relationships, particularly in datasets with complex dependencies or semantically ambiguous features. To address this limitation, we propose GraDe (Graph-Guided Dependency Learning), a novel method that explicitly integrates sparse dependency graphs into LLMs’ attention mechanism. GraDe employs a lightweight dynamic graph learning module guided by externally extracted functional dependencies, prioritizing key feature interactions while suppressing irrelevant ones. Our experiments across diverse real-world datasets demonstrate that GraDe outperforms existing LLM-based approaches by up to 12% on complex datasets while achieving competitive results with state-of-the-art approaches in synthetic data quality. Our method is minimally intrusive yet effective, offering a practical solution for structure-aware tabular data modeling with LLMs.

nan

Article 601

Title@2025-07-24 (4): LLM-based Embedders for Prior Case Retrieval

Title: LLM-based Embedders for Prior Case Retrieval

LLM-basierte Embedders für frühere Fallwiederherstellung

用于先前个案检索的LLM 以LLM为基础的嵌入器 2507.18455v1

Authors (3): Damith Premasiri, Tharindu Ranasinghe, Ruslan Mitkov

In common law systems, legal professionals such as lawyers and judges rely on precedents to build their arguments. As the volume of cases has grown massively over time, effectively retrieving prior cases has become essential. Prior case retrieval (PCR) is an information retrieval (IR) task that aims to automatically identify the most relevant court cases for a specific query from a large pool of potential candidates. While IR methods have seen several paradigm shifts over the last few years, the vast majority of PCR methods continue to rely on traditional IR methods, such as BM25. The state-of-the-art deep learning IR methods have not been successful in PCR due to two key challenges: i. Lengthy legal text limitation; when using the powerful BERT-based transformer models, there is a limit of input text lengths, which inevitably requires to shorten the input via truncation or division with a loss of legal context information. ii. Lack of legal training data; due to data privacy concerns, available PCR datasets are often limited in size, making it difficult to train deep learning-based models effectively. In this research, we address these challenges by leveraging LLM-based text embedders in PCR. LLM-based embedders support longer input lengths, and since we use them in an unsupervised manner, they do not require training data, addressing both challenges simultaneously. In this paper, we evaluate state-of-the-art LLM-based text embedders in four PCR benchmark datasets and show that they outperform BM25 and supervised transformer-based models.

nan

Article 602

Title@2025-07-24 (4): Generation of Synthetic Clinical Text: A Systematic Review

Title: Generation of Synthetic Clinical Text: A Systematic Review

Generieren von synthetischem klinischem Text: Ein systematischer Test

合成临床文本的生成:系统审查 2507.18451v1

Authors (5): Basel Alshaikhdeeb, Ahmed Abdelmonem Hemedan, Soumyabrata Ghosh, Irina Balaur, Venkata Satagopam

Generating clinical synthetic text represents an effective solution for common clinical NLP issues like sparsity and privacy. This paper aims to conduct a systematic review on generating synthetic medical free-text by formulating quantitative analysis to three research questions concerning (i) the purpose of generation, (ii) the techniques, and (iii) the evaluation methods. We searched PubMed, ScienceDirect, Web of Science, Scopus, IEEE, Google Scholar, and arXiv databases for publications associated with generating synthetic medical unstructured free-text. We have identified 94 relevant articles out of 1,398 collected ones. A great deal of attention has been given to the generation of synthetic medical text from 2018 onwards, where the main purpose of such a generation is towards text augmentation, assistive writing, corpus building, privacy-preserving, annotation, and usefulness. Transformer architectures were the main predominant technique used to generate the text, especially the GPTs. On the other hand, there were four main aspects of evaluation, including similarity, privacy, structure, and utility, where utility was the most frequent method used to assess the generated synthetic medical text. Although the generated synthetic medical text demonstrated a moderate possibility to act as real medical documents in different downstream NLP tasks, it has proven to be a great asset as augmented, complementary to the real documents, towards improving the accuracy and overcoming sparsity/undersampling issues. Yet, privacy is still a major issue behind generating synthetic medical text, where more human assessments are needed to check for the existence of any sensitive information. Despite that, advances in generating synthetic medical text will considerably accelerate the adoption of workflows and pipeline development, discarding the time-consuming legalities of data transfer.

nan

Article 603

Title@2025-07-24 (4): Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, a Low-Resource Language

Title: Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, a Low-Resource Language

Wiederherstellung des Rhythmus: Pünktlichkeitsrestaurierung mit Transformer-Modellen für Bangla, eine Sprache mit geringer Ressource

恢复时速:使用孟加拉国低资源语言 “ 孟加拉 “ 变压器模型恢复脉冲 2507.18448v1

Authors (4): Md Obyedullahil Mamun, Md Adyelullahil Mamun, Arif Ahmad, Md. Imran Hossain Emu

Punctuation restoration enhances the readability of text and is critical for post-processing tasks in Automatic Speech Recognition (ASR), especially for low-resource languages like Bangla. In this study, we explore the application of transformer-based models, specifically XLM-RoBERTa-large, to automatically restore punctuation in unpunctuated Bangla text. We focus on predicting four punctuation marks: period, comma, question mark, and exclamation mark across diverse text domains. To address the scarcity of annotated resources, we constructed a large, varied training corpus and applied data augmentation techniques. Our best-performing model, trained with an augmentation factor of alpha = 0.20%, achieves an accuracy of 97.1% on the News test set, 91.2% on the Reference set, and 90.2% on the ASR set. Results show strong generalization to reference and ASR transcripts, demonstrating the model’s effectiveness in real-world, noisy scenarios. This work establishes a strong baseline for Bangla punctuation restoration and contributes publicly available datasets and code to support future research in low-resource NLP.

nan

Article 604

Title@2025-07-24 (4): AraTable: Benchmarking LLMs’ Reasoning and Understanding of Arabic Tabular Data

Title: AraTable: Benchmarking LLMs’ Reasoning and Understanding of Arabic Tabular Data

AraTable: Benchmarking von LLMs’ Vernunft und Verständnis arabischer Tabellendaten

阿拉伯表格:按基准确定LLM女士对阿拉伯表格数据的理由和理解 2507.18442v1

Authors (3): Rana Alshaikh, Israa Alghanmi, Shelan Jeawak

The cognitive and reasoning abilities of large language models (LLMs) have enabled remarkable progress in natural language processing. However, their performance in interpreting structured data, especially in tabular formats, remains limited. Although benchmarks for English tabular data are widely available, Arabic is still underrepresented because of the limited availability of public resources and its unique language features. To address this gap, we present AraTable, a novel and comprehensive benchmark designed to evaluate the reasoning and understanding capabilities of LLMs when applied to Arabic tabular data. AraTable consists of various evaluation tasks, such as direct question answering, fact verification, and complex reasoning, involving a wide range of Arabic tabular sources. Our methodology follows a hybrid pipeline, where initial content is generated by LLMs and subsequently filtered and verified by human experts to ensure high dataset quality. Initial analyses using AraTable show that, while LLMs perform adequately on simpler tabular tasks such as direct question answering, they continue to face significant cognitive challenges when tasks require deeper reasoning and fact verification. This indicates that there are substantial opportunities for future work to improve performance on complex tabular reasoning tasks. We also propose a fully automated evaluation framework that uses a self-deliberation mechanism and achieves performance nearly identical to that of human judges. This research provides a valuable, publicly available resource and evaluation framework that can help accelerate the development of foundational models for processing and analysing Arabic structured data.

nan

Article 605

Title@2025-07-24 (4): IPCGRL: Language-Instructed Reinforcement Learning for Procedural Level Generation

Title: IPCGRL: Language-Instructed Reinforcement Learning for Procedural Level Generation

IPCGRL: Sprachgestütztes Verstärkungslernen für die verfahrenstechnische Level-Generierung

ICPCGRL: 程序生成阶段语言教学强化学习 2503.12358v4

Authors (5): In-Chang Baek, Sung-Hyun Kim, Seo-Young Lee, Dong-Hyeon Kim, Kyung-Joong Kim

Recent research has highlighted the significance of natural language in enhancing the controllability of generative models. While various efforts have been made to leverage natural language for content generation, research on deep reinforcement learning (DRL) agents utilizing text-based instructions for procedural content generation remains limited. In this paper, we propose IPCGRL, an instruction-based procedural content generation method via reinforcement learning, which incorporates a sentence embedding model. IPCGRL fine-tunes task-specific embedding representations to effectively compress game-level conditions. We evaluate IPCGRL in a two-dimensional level generation task and compare its performance with a general-purpose embedding method. The results indicate that IPCGRL achieves up to a 21.4% improvement in controllability and a 17.2% improvement in generalizability for unseen instructions. Furthermore, the proposed method extends the modality of conditional input, enabling a more flexible and expressive interaction framework for procedural content generation.

nan

Article 606

Title@2025-07-24 (4): DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts

Title: DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts

DEFAME: Dynamic Evidence-based FAct-Checking mit multimodalen Experten

DFAME: 与多式联运专家进行动态证据法检查 2412.10510v4

Authors (4): Tobias Braun, Mark Rothermel, Marcus Rohrbach, Anna Rohrbach

The proliferation of disinformation demands reliable and scalable fact-checking solutions. We present Dynamic Evidence-based FAct-checking with Multimodal Experts (DEFAME), a modular, zero-shot MLLM pipeline for open-domain, text-image claim verification. DEFAME operates in a six-stage process, dynamically selecting the tools and search depth to extract and evaluate textual and visual evidence. Unlike prior approaches that are text-only, lack explainability, or rely solely on parametric knowledge, DEFAME performs end-to-end verification, accounting for images in claims and evidence while generating structured, multimodal reports. Evaluation on the popular benchmarks VERITE, AVerITeC, and MOCHEG shows that DEFAME surpasses all previous methods, establishing itself as the new state-of-the-art fact-checking system for uni- and multimodal fact-checking. Moreover, we introduce a new multimodal benchmark, ClaimReview2024+, featuring claims after the knowledge cutoff of GPT-4o, avoiding data leakage. Here, DEFAME drastically outperforms the GPT-4o baselines, showing temporal generalizability and the potential for real-time fact-checking.

nan

Article 607

Title@2025-07-24 (4): How do language models learn facts? Dynamics, curricula and hallucinations

Title: How do language models learn facts? Dynamics, curricula and hallucinations

Wie lernen Sprachmodelle Fakten? Dynamik, Lehrpläne und Halluzinationen

语言模式如何了解事实?动态、课程和幻觉 2503.21676v2

Authors (6): Nicolas Zucchet, Jörg Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, Soham De

Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood. This work investigates the learning dynamics of language models on a synthetic factual recall task, uncovering three key findings: First, language models learn in three phases, exhibiting a performance plateau before acquiring precise factual knowledge. Mechanistically, this plateau coincides with the formation of attention-based circuits that support recall. Second, the training data distribution significantly impacts learning dynamics, as imbalanced distributions lead to shorter plateaus. Finally, hallucinations emerge simultaneously with knowledge, and integrating new knowledge into the model through fine-tuning is challenging, as it quickly corrupts its existing parametric memories. Our results emphasize the importance of data distribution in knowledge acquisition and suggest novel data scheduling strategies to accelerate neural network training.

nan

Article 608

Title@2025-07-24 (4): FinDPO: Financial Sentiment Analysis for Algorithmic Trading through Preference Optimization of LLMs

Title: FinDPO: Financial Sentiment Analysis for Algorithmic Trading through Preference Optimization of LLMs

FinDPO: Finanz-Sentiment-Analyse für algorithmischen Handel durch Preference-Optimierung von LLMs

FinDPO:通过优惠优化LLMs,分析通过高利贷交易的金融敏感度 2507.18417v1

Authors (3): Giorgos Iacovides, Wuyang Zhou, Danilo Mandic

Opinions expressed in online finance-related textual data are having an increasingly profound impact on trading decisions and market movements. This trend highlights the vital role of sentiment analysis as a tool for quantifying the nature and strength of such opinions. With the rapid development of Generative AI (GenAI), supervised fine-tuned (SFT) large language models (LLMs) have become the de facto standard for financial sentiment analysis. However, the SFT paradigm can lead to memorization of the training data and often fails to generalize to unseen samples. This is a critical limitation in financial domains, where models must adapt to previously unobserved events and the nuanced, domain-specific language of finance. To this end, we introduce FinDPO, the first finance-specific LLM framework based on post-training human preference alignment via Direct Preference Optimization (DPO). The proposed FinDPO achieves state-of-the-art performance on standard sentiment classification benchmarks, outperforming existing supervised fine-tuned models by 11% on the average. Uniquely, the FinDPO framework enables the integration of a fine-tuned causal LLM into realistic portfolio strategies through a novel ‘logit-to-score’ conversion, which transforms discrete sentiment predictions into continuous, rankable sentiment scores (probabilities). In this way, simulations demonstrate that FinDPO is the first sentiment-based approach to maintain substantial positive returns of 67% annually and strong risk-adjusted performance, as indicated by a Sharpe ratio of 2.0, even under realistic transaction costs of 5 basis points (bps).

nan

Article 609

Title@2025-07-24 (4): ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

Title: ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

Explica: Explizite kausale Vernunft in großen Sprachmodellen bewerten

ExpliCa:在大语言模型中评估明确的原因原因 2502.15487v3

Authors (7): Martina Miliani, Serena Auriemma, Alessandro Bondielli, Emmanuele Chersoni, Lucia Passaro, Irene Sucameli, Alessandro Lenci

Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.

nan

Article 610

Title@2025-07-24 (4): Enhancing RAG Efficiency with Adaptive Context Compression

Title: Enhancing RAG Efficiency with Adaptive Context Compression

Steigerung der RAG-Effizienz durch adaptive Kontextkompression

提高RAG效率,同时采取适应性环境压缩措施 2507.22931v1

Authors (2): Shuyu Guo, Zhaochun Ren

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but incurs significant inference costs due to lengthy retrieved contexts. While context compression mitigates this issue, existing methods apply fixed compression rates, over-compressing simple queries or under-compressing complex ones. We propose Adaptive Context Compression for RAG (ACC-RAG), a framework that dynamically adjusts compression rates based on input complexity, optimizing inference efficiency without sacrificing accuracy. ACC-RAG combines a hierarchical compressor (for multi-granular embeddings) with a context selector to retain minimal sufficient information, akin to human skimming. Evaluated on Wikipedia and five QA datasets, ACC-RAG outperforms fixed-rate methods and matches/unlocks over 4 times faster inference versus standard RAG while maintaining or improving accuracy.

nan

Article 611

Title@2025-07-24 (4): Factual Inconsistencies in Multilingual Wikipedia Tables

Title: Factual Inconsistencies in Multilingual Wikipedia Tables

Tatsächliche Inkonsistenzen in mehrsprachigen Wikipedia-Tabellen

多语言维基百科表格中的事实不一致 2507.18406v1

Authors (6): Silvia Cappa, Lingxiao Kong, Pille-Riin Peet, Fanfu Wei, Yuchen Zhou, Jan-Christoph Kalo

Wikipedia serves as a globally accessible knowledge source with content in over 300 languages. Despite covering the same topics, the different versions of Wikipedia are written and updated independently. This leads to factual inconsistencies that can impact the neutrality and reliability of the encyclopedia and AI systems, which often rely on Wikipedia as a main training source. This study investigates cross-lingual inconsistencies in Wikipedia’s structured content, with a focus on tabular data. We developed a methodology to collect, align, and analyze tables from Wikipedia multilingual articles, defining categories of inconsistency. We apply various quantitative and qualitative metrics to assess multilingual alignment using a sample dataset. These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems leveraging Wikipedia content.

nan

Article 612

Title@2025-07-24 (4): CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

Title: CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

CLEAR: Fehleranalyse über LLM-as-a-Judge leicht gemacht

CLLEAR:通过LLM-as-a法官进行错误分析 2507.18392v1

Authors (5): Asaf Yehudai, Lilach Eden, Yotam Perlitz, Roy Bar-Haim, Michal Shmueli-Scheuer

The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model’s performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.

nan

Article 613

Title@2025-07-24 (4): Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

Title: Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

Beschädigt durch Reasoning: Reasoning Sprachmodelle werden Free-Riders in Public Goods Games

原因:在公共货物运动会中,理性语言模式成为自由骑手 2506.23276v2

Authors (6): David Guzman Piedrahita, Yongjin Yang, Mrinmaya Sachan, Giorgia Ramponi, Bernhard Schölkopf, Zhijing Jin

As large language models (LLMs) are increasingly deployed as autonomous agents, understanding their cooperation and social mechanisms is becoming increasingly important. In particular, how LLMs balance self-interest and collective well-being is a critical challenge for ensuring alignment, robustness, and safe deployment. In this paper, we examine the challenge of costly sanctioning in multi-agent LLM systems, where an agent must decide whether to invest its own resources to incentivize cooperation or penalize defection. To study this, we adapt a public goods game with institutional choice from behavioral economics, allowing us to observe how different LLMs navigate social dilemmas over repeated interactions. Our analysis reveals four distinct behavioral patterns among models: some consistently establish and sustain high levels of cooperation, others fluctuate between engagement and disengagement, some gradually decline in cooperative behavior over time, and others rigidly follow fixed strategies regardless of outcomes. Surprisingly, we find that reasoning LLMs, such as the o1 series, struggle significantly with cooperation, whereas some traditional LLMs consistently achieve high levels of cooperation. These findings suggest that the current approach to improving LLMs, which focuses on enhancing their reasoning capabilities, does not necessarily lead to cooperation, providing valuable insights for deploying LLM agents in environments that require sustained collaboration. Our code is available at https://github.com/davidguzmanp/SanctSim

nan

Article 614

Title@2025-07-24 (4): Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs

Title: Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs

Beyond Profile: Von Oberflächen-Fakten zur tiefen Persona-Simulation in LLMs

超越简介:从地平面事实到深人模拟LLMM 2502.12988v3

Authors (6): Zixiao Wang, Duzhen Zhang, Ishita Agrawal, Shen Gao, Le Song, Xiuying Chen

Previous approaches to persona simulation large language models (LLMs) have typically relied on learning basic biographical information, or using limited role-play dialogue datasets to capture a character’s responses. However, a holistic representation of an individual goes beyond surface-level facts or conversations to deeper thoughts and thinking. In this work, we introduce CharacterBot, a model designed to replicate both the linguistic patterns and distinctive thought patterns as manifested in the textual works of a character. Using Lu Xun, a renowned Chinese writer as a case study, we propose four training tasks derived from his 17 essay collections. These include a pre-training task focused on mastering external linguistic structures and knowledge, as well as three fine-tuning tasks: multiple-choice question answering, generative question answering, and style transfer, each aligning the LLM with Lu Xun’s internal ideation and writing style. To optimize learning across these tasks, we introduce a CharLoRA parameter updating mechanism, where a general linguistic style expert collaborates with other task-specific experts to better study both the language style and the understanding of deeper thoughts. We evaluate CharacterBot on three tasks for linguistic accuracy and opinion comprehension, demonstrating that it significantly outperforms the baselines on our adapted metrics. We hope this work inspires future research on deep character persona simulation LLMs while considering the importance of ethical standards.

nan

Article 615

Title@2025-07-24 (4): Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

Title: Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

Schutz gefährdeter Stimmen: Synthetische Datensatzgenerierung zur Selbstdetektion

保护弱势声音:为自我披露检测合成数据集生成 2507.22930v1

Authors (4): Shalini Jangra, Suparna De, Nishanth Sastry, Saeed Fadaei

Social platforms such as Reddit have a network of communities of shared interests, with a prevalence of posts and comments from which one can infer users’ Personal Information Identifiers (PIIs). While such self-disclosures can lead to rewarding social interactions, they pose privacy risks and the threat of online harms. Research into the identification and retrieval of such risky self-disclosures of PIIs is hampered by the lack of open-source labeled datasets. To foster reproducible research into PII-revealing text detection, we develop a novel methodology to create synthetic equivalents of PII-revealing data that can be safely shared. Our contributions include creating a taxonomy of 19 PII-revealing categories for vulnerable populations and the creation and release of a synthetic PII-labeled multi-text span dataset generated from 3 text generation Large Language Models (LLMs), Llama2-7B, Llama3-8B, and zephyr-7b-beta, with sequential instruction prompting to resemble the original Reddit posts. The utility of our methodology to generate this synthetic dataset is evaluated with three metrics: First, we require reproducibility equivalence, i.e., results from training a model on the synthetic data should be comparable to those obtained by training the same models on the original posts. Second, we require that the synthetic data be unlinkable to the original users, through common mechanisms such as Google Search. Third, we wish to ensure that the synthetic data be indistinguishable from the original, i.e., trained humans should not be able to tell them apart. We release our dataset and code at https://netsys.surrey.ac.uk/datasets/synthetic-self-disclosure/ to foster reproducible research into PII privacy risks in online social media.

nan

Article 616

Title@2025-07-24 (4): Mechanistic Indicators of Understanding in Large Language Models

Title: Mechanistic Indicators of Understanding in Large Language Models

Mechanistische Indikatoren des Verstehens in großen Sprachmodellen

大语言模型中理解力的机械指标 2507.08017v3

Authors (2): Pierre Beckmann, Matthieu Queloz

Recent findings in mechanistic interpretability (MI), the field probing the inner workings of Large Language Models (LLMs), challenge the view that these models rely solely on superficial statistics. We offer an accessible synthesis of these findings that doubles as an introduction to MI while integrating these findings within a novel theoretical framework for thinking about machine understanding. We argue that LLMs develop internal structures that are functionally analogous to the kind of understanding that consists in seeing connections. To sharpen this idea, we propose a three-tiered conception of understanding. First, conceptual understanding emerges when a model forms “features” as directions in latent space, learning the connections between diverse manifestations of something. Second, state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world. Third, principled understanding emerges when a model ceases to rely on a collection of memorized facts and discovers a “circuit” connecting these facts. However, these forms of understanding remain radically different from human understanding, as the phenomenon of “parallel mechanisms” shows. We conclude that the debate should move beyond the yes-or-no question of whether LLMs understand to investigate how their strange minds work and forge conceptions that fit them.

nan

Article 617

Title@2025-07-24 (4): Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence

Title: Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence

Hybride Annotation für Propagandaerkennung: Integration von LLM-Vorannotationen mit menschlicher Intelligenz

宣传探测混合说明:将LLM预告与人类情报相结合 2507.18343v1

Authors (6): Ariana Sahitaj, Premtim Sahitaj, Veronika Solopova, Jiaao Li, Sebastian Möller, Vera Schmitt

Propaganda detection on social media remains challenging due to task complexity and limited high-quality labeled data. This paper introduces a novel framework that combines human expertise with Large Language Model (LLM) assistance to improve both annotation consistency and scalability. We propose a hierarchical taxonomy that organizes 14 fine-grained propaganda techniques into three broader categories, conduct a human annotation study on the HQP dataset that reveals low inter-annotator agreement for fine-grained labels, and implement an LLM-assisted pre-annotation pipeline that extracts propagandistic spans, generates concise explanations, and assigns local labels as well as a global label. A secondary human verification study shows significant improvements in both agreement and time-efficiency. Building on this, we fine-tune smaller language models (SLMs) to perform structured annotation. Instead of fine-tuning on human annotations, we train on high-quality LLM-generated data, allowing a large model to produce these annotations and a smaller model to learn to generate them via knowledge distillation. Our work contributes towards the development of scalable and robust propaganda detection systems, supporting the idea of transparent and accountable media ecosystems in line with SDG 16. The code is publicly available at our GitHub repository.

nan

Article 618

Title@2025-07-24 (4): TDR: Task-Decoupled Retrieval with Fine-Grained LLM Feedback for In-Context Learning

Title: TDR: Task-Decoupled Retrieval with Fine-Grained LLM Feedback for In-Context Learning

TDR: Task-decoupled Retrieval mit feinkörnigem LLM-Feedback für das In-Context-Lernen

TDR: 以精细的LLM反馈方式进行任务减缩的检索,以便进行内容学习 2507.18340v1

Authors (7): Yifu Chen, Bingchen Huang, Zhiling Wang, Yuanchao Du, Junfeng Luo, Lei Shen, Zhineng chen

In-context learning (ICL) has become a classic approach for enabling LLMs to handle various tasks based on a few input-output examples. The effectiveness of ICL heavily relies on the quality of these examples, and previous works which focused on enhancing example retrieval capabilities have achieved impressive performances. However, two challenges remain in retrieving high-quality examples: (1) Difficulty in distinguishing cross-task data distributions, (2) Difficulty in making the fine-grained connection between retriever output and feedback from LLMs. In this paper, we propose a novel framework called TDR. TDR decouples the ICL examples from different tasks, which enables the retrieval module to retrieve examples specific to the target task within a multi-task dataset. Furthermore, TDR models fine-grained feedback from LLMs to supervise and guide the training of the retrieval module, which helps to retrieve high-quality examples. We conducted extensive experiments on a suite of 30 NLP tasks, the results demonstrate that TDR consistently improved results across all datasets and achieves state-of-the-art performance. Meanwhile, our approach is a plug-and-play method, which can be easily combined with various LLMs to improve example retrieval abilities for ICL. The code is available at https://github.com/Nnn-s/TDR.

nan

Article 619

Title@2025-07-24 (4): Uncertainty Quantification for Evaluating Machine Translation Bias

Title: Uncertainty Quantification for Evaluating Machine Translation Bias

Ungewissheit Quantifizierung für die Auswertung von maschinellen Übersetzungs-Bias

评价机器翻译偏见的不确定性定量 2507.18338v1

Authors (3): Ieva Raminta Staliūnaitė, Julius Cheng, Andreas Vlachos

In machine translation (MT), when the source sentence includes a lexeme whose gender is not overtly marked, but whose target-language equivalent requires gender specification, the model must infer the appropriate gender from the context and/or external knowledge. Studies have shown that MT models exhibit biased behaviour, relying on stereotypes even when they clash with contextual information. We posit that apart from confidently translating using the correct gender when it is evident from the input, models should also maintain uncertainty about the gender when it is ambiguous. Using recently proposed metrics of semantic uncertainty, we find that models with high translation and gender accuracy on unambiguous instances do not necessarily exhibit the expected level of uncertainty in ambiguous ones. Similarly, debiasing has independent effects on ambiguous and unambiguous translation instances.

nan

Article 620

Title@2025-07-24 (4): EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow

Title: EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow

EH-Benchmark Ophthalmische Halluzination Benchmark und Agent-getriebene Top-Down-Rückverfolgbarkeit Workflow

EH-Benchmark Ophthalmic 幻觉基准和代理Dripreven 顶底可追踪合理理由工作流程 2507.22929v1

Authors (8): Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu

Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs’ hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results show that our multi-agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at https://github.com/ppxy1/EH-Benchmark.

nan

Article 621

Title@2025-07-24 (4): A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

Title: A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

Eine umfassende Studie der LLM-basierten Argumentationsklassifikation: von LLAMA über GPT-4o bis Deepseek-R1

关于以LLM为基础的理论分类的全面研究:从LLAMA到GPT-4o到Deepseek-R1 2507.08621v2

Authors (5): Marcin Pietroń, Rafał Olszowski, Jakub Gomułka, Filip Gampel, Andrzej Tomski

Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanced the efficiency of analyzing and extracting argument semantics compared to traditional methods and other deep learning models. There are many benchmarks for testing and verifying the quality of LLM, but there is still a lack of research and results on the operation of these models in publicly available argument classification databases. This paper presents a study of a selection of LLM’s, using diverse datasets such as Args.me and UKP. The models tested include versions of GPT, Llama, and DeepSeek, along with reasoning-enhanced variants incorporating the Chain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks. In case of models incorporated with reasoning capabilities, the Deepseek-R1 shows its superiority. However, despite their superiority, GPT-4o and Deepseek-R1 still make errors. The most common errors are discussed for all models. To our knowledge, the presented work is the first broader analysis of the mentioned datasets using LLM and prompt algorithms. The work also shows some weaknesses of known prompt algorithms in argument analysis, while indicating directions for their improvement. The added value of the work is the in-depth analysis of the available argument datasets and the demonstration of their shortcomings.

nan

Article 622

Title@2025-07-24 (4): BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit

Title: BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit

BadReasoner: Pflanzung Tunable Überdenken Hintertüren zu großen Grundmodellen für Spaß oder Gewinn

BadReasoner: 将金枪鱼可变性过度思考的后门规划成娱乐或利润的大理由模型 2507.18305v1

Authors (7): Biao Yi, Zekun Fei, Jianing Geng, Tong Li, Lihai Nie, Zheli Liu, Yiming Li

Large reasoning models (LRMs) have emerged as a significant advancement in artificial intelligence, representing a specialized class of large language models (LLMs) designed to tackle complex reasoning tasks. The defining characteristic of LRMs lies in their extensive chain-of-thought (CoT) reasoning capabilities. In this paper, we identify a previously unexplored attack vector against LRMs, which we term “overthinking backdoors”. We advance this concept by proposing a novel tunable backdoor, which moves beyond simple on/off attacks to one where an attacker can precisely control the extent of the model’s reasoning verbosity. Our attack is implemented through a novel data poisoning methodology. It pairs a tunable trigger-where the number of repetitions signals the desired intensity-with a correspondingly verbose CoT response. These responses are programmatically generated by instructing a teacher LLM to inject a controlled number of redundant refinement steps into a correct reasoning process. The approach preserves output correctness, which ensures stealth and establishes the attack as a pure resource-consumption vector. Extensive empirical results on various LRMs demonstrate that our method can reliably trigger a controllable, multi-fold increase in the length of the reasoning process, without degrading the final answer’s correctness. Our source code is available at https://github.com/FZaKK/BadReasoner.

nan

Article 623

Title@2025-07-24 (4): LoRA-Leak: Membership Inference Attacks Against LoRA Fine-tuned Language Models

Title: LoRA-Leak: Membership Inference Attacks Against LoRA Fine-tuned Language Models

LoRA-Leak: Membership Inferenz Angriffe gegen LoRA fein abgestimmte Sprachmodelle

LoRA-Leak:对LORA精调语言模式的成员推论攻击 2507.18302v1

Authors (6): Delong Ran, Xinlei He, Tianshuo Cong, Anyu Wang, Qi Li, Xiaoyun Wang

Language Models (LMs) typically adhere to a “pre-training and fine-tuning” paradigm, where a universal pre-trained model can be fine-tuned to cater to various specialized domains. Low-Rank Adaptation (LoRA) has gained the most widespread use in LM fine-tuning due to its lightweight computational cost and remarkable performance. Because the proportion of parameters tuned by LoRA is relatively small, there might be a misleading impression that the LoRA fine-tuning data is invulnerable to Membership Inference Attacks (MIAs). However, we identify that utilizing the pre-trained model can induce more information leakage, which is neglected by existing MIAs. Therefore, we introduce LoRA-Leak, a holistic evaluation framework for MIAs against the fine-tuning datasets of LMs. LoRA-Leak incorporates fifteen membership inference attacks, including ten existing MIAs, and five improved MIAs that leverage the pre-trained model as a reference. In experiments, we apply LoRA-Leak to three advanced LMs across three popular natural language processing tasks, demonstrating that LoRA-based fine-tuned LMs are still vulnerable to MIAs (e.g., 0.775 AUC under conservative fine-tuning settings). We also applied LoRA-Leak to different fine-tuning settings to understand the resulting privacy risks. We further explore four defenses and find that only dropout and excluding specific LM layers during fine-tuning effectively mitigate MIA risks while maintaining utility. We highlight that under the “pre-training and fine-tuning” paradigm, the existence of the pre-trained model makes MIA a more severe risk for LoRA-based LMs. We hope that our findings can provide guidance on data privacy protection for specialized LM providers.

nan

Article 624

Title@2025-07-24 (4): DocTER: Evaluating Document-based Knowledge Editing

Title: DocTER: Evaluating Document-based Knowledge Editing

DocTER: Dokumentbasierte Wissensbearbeitung bewerten

评价基于文件的知识编辑 2308.09954v2

Authors (7): Suhang Wu, Ante Wang, Minlong Peng, Yujie Lin, Wenbo Li, Mingming Sun, Jinsong Su

Knowledge editing aims to correct outdated or inaccurate knowledge in neural networks. In this paper, we explore knowledge editing using easily accessible documents instead of manually labeled factual triples employed in earlier research. To advance this field, we establish the first evaluation benchmark, \textit{DocTER}, featuring Documents containing counterfactual knowledge for editing. A comprehensive four-perspective evaluation is introduced: Edit Success, Locality, Reasoning, and Cross-lingual Transfer. To adapt conventional triplet-based knowledge editing methods for this task, we develop an Extract-then-Edit pipeline that extracts triples from documents before applying existing methods. Experiments on popular knowledge editing methods demonstrate that editing with documents presents significantly greater challenges than using triples. In document-based scenarios, even the best-performing in-context editing approach still lags behind by 10 points in editing success when compared to using gold triples. This observation also holds for both reasoning and cross-lingual test sets. We further analyze key factors influencing task performance, including the quality of extracted triples, the frequency and position of edited knowledge in documents, various methods for enhancing reasoning, and performance differences across various directions in cross-lingual knowledge editing, which provide valuable insights for future research.

nan

Article 625

Title@2025-07-24 (4): Step-Audio 2 Technical Report

Title: Step-Audio 2 Technical Report

Schritt-Audio 2 Technischer Bericht

技术报告 2507.16632v2

Authors (109): Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, Haiyang Sun, Kang An, Na Wang, Shuli Gao, Wei Ji, Wen Li, Wen Sun, Xuan Wen, Yong Ren, Yuankai Ma, Yufan Lu, Bin Wang, Bo Li, Changxin Miao, Che Liu, Chen Xu, Dapeng Shi, Dingyuan Hu, Donghang Wu, Enle Liu, Guanzhe Huang, Gulin Yan, Han Zhang, Hao Nie, Haonan Jia, Hongyu Zhou, Jianjian Sun, Jiaoren Wu, Jie Wu, Jie Yang, Jin Yang, Junzhe Lin, Kaixiang Li, Lei Yang, Liying Shi, Li Zhou, Longlong Gu, Ming Li, Mingliang Li, Mingxiao Li, Nan Wu, Qi Han, Qinyuan Tan, Shaoliang Pang, Shengjie Fan, Siqi Liu, Tiancheng Cao, Wanying Lu, Wenqing He, Wuxun Xie, Xu Zhao, Xueqi Li, Yanbo Yu, Yang Yang, Yi Liu, Yifan Lu, Yilei Wang, Yuanhao Ding, Yuanwei Liang, Yuanwei Lu, Yuchu Luo, Yuhe Yin, Yumeng Zhan, Yuxiang Zhang, Zidong Yang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu

This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.

nan

Article 626

Title@2025-07-24 (4): VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

Title: VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

VolDoGer: LLM-unterstützte Datensätze für Domain-Verallgemeinerung in Vision-Language-Aufgaben

VolDoGer:LLM辅助数据集,用于视野语言任务中通用域的LLM辅助数据集 2407.19795v2

Authors (5): Juhwan Choi, Junehyoung Kwon, JungMin Yun, Seunguk Yu, YoungBin Kim

Domain generalizability is a crucial aspect of a deep learning model since it determines the capability of the model to perform well on data from unseen domains. However, research on the domain generalizability of deep learning models for vision-language tasks remains limited, primarily because of the lack of required datasets. To address these challenges, we propose VolDoGer: Vision-Language Dataset for Domain Generalization, a dedicated dataset designed for domain generalization that addresses three vision-language tasks: image captioning, visual question answering, and visual entailment. We constructed VolDoGer by extending LLM-based data annotation techniques to vision-language tasks, thereby alleviating the burden of recruiting human annotators. We evaluated the domain generalizability of various models, ranging from fine-tuned models to a recent multimodal large language model, through VolDoGer.

nan

Article 627

Title@2025-07-24 (4): StyleAdaptedLM: Enhancing Instruction Following Models with Efficient Stylistic Transfer

Title: StyleAdaptedLM: Enhancing Instruction Following Models with Efficient Stylistic Transfer

StyleAdaptedLM: Weiterentwicklung der Anleitung nach Modellen mit effizienter Stylistik-Übertragung

StypeAddapedLM:按照高效立体转让模式加强教学 2507.18294v1

Authors (5): Pritika Ramu, Apoorv Saxena, Meghanath M Y, Varsha Sankar, Debraj Basu

Adapting LLMs to specific stylistic characteristics, like brand voice or authorial tones, is crucial for enterprise communication but challenging to achieve from corpora which lacks instruction-response formatting without compromising instruction adherence. We introduce StyleAdaptedLM, a framework that efficiently transfers stylistic traits to instruction-following models using Low-Rank Adaptation (LoRA). LoRA adapters are first trained on a base model with diverse unstructured stylistic corpora, then merged with a separate instruction-following model. This enables robust stylistic customization without paired data or sacrificing task performance. Experiments across multiple datasets and models demonstrate improved stylistic consistency while preserving instruction adherence, with human evaluations confirming brand-specific convention uptake. StyleAdaptedLM offers an efficient path for stylistic personalization in LLMs.

nan

Article 628

Title@2025-07-24 (4): How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

Title: How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

Wie denkt die Kette des Denkens? Mechanistische Interpretierbarkeit von Chain-of-Thought-Reasoning mit Sparse Autoencoding

思维链思维链是如何思考的? 2507.22928v1

Authors (3): Xi Chen, Aske Plaat, Niki van Stein

Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated “thoughts” reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model’s confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch-curves and random-feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.

nan

Article 629

Title@2025-07-24 (4): Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil

Title: Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil

Null-Schuss OCR Genauigkeit der niedrig-Ressourcen Sprachen: Eine vergleichende Analyse auf Sinhala und Tamil

低资源语言的准确性:僧伽罗语和泰米尔语比较分析 2507.18264v1

Authors (2): Nevidu Jayatilleke, Nisansa de Silva

Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commercial and open-source systems, aiming to evaluate the strengths of each category. The Cloud Vision API, Surya, Document AI, and Tesseract were evaluated for both Sinhala and Tamil, while Subasa OCR and EasyOCR were examined for only one language due to their limitations. The performance of these systems was rigorously analysed using five measurement techniques to assess accuracy at both the character and word levels. According to the findings, Surya delivered the best performance for Sinhala across all metrics, with a WER of 2.61%. Conversely, Document AI excelled across all metrics for Tamil, highlighted by a very low CER of 0.78%. In addition to the above analysis, we also introduce a novel synthetic Tamil OCR benchmarking dataset.

nan

Article 630

Title@2025-07-24 (4): Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models

Title: Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models

Locate-and-Focus: Verbesserung der Terminologieübersetzung in Sprachmodellen

目的和重点:加强语言语言模式术语翻译 2507.18263v1

Authors (9): Suhang Wu, Jialong Tang, Chengyi Yang, Pei Zhang, Baosong Yang, Junhui Li, Junfeng Yao, Min Zhang, Jinsong Su

Direct speech translation (ST) has garnered increasing attention nowadays, yet the accurate translation of terminology within utterances remains a great challenge. In this regard, current studies mainly concentrate on leveraging various translation knowledge into ST models. However, these methods often struggle with interference from irrelevant noise and can not fully utilize the translation knowledge. To address these issues, in this paper, we propose a novel Locate-and-Focus method for terminology translation. It first effectively locates the speech clips containing terminologies within the utterance to construct translation knowledge, minimizing irrelevant information for the ST model. Subsequently, it associates the translation knowledge with the utterance and hypothesis from both audio and textual modalities, allowing the ST model to better focus on translation knowledge during translation. Experimental results across various datasets demonstrate that our method effectively locates terminologies within utterances and enhances the success rate of terminology translation, while maintaining robust general translation performance.

nan

Article 631

Title@2025-07-24 (4): Multimodal Behavioral Patterns Analysis with Eye-Tracking and LLM-Based Reasoning

Title: Multimodal Behavioral Patterns Analysis with Eye-Tracking and LLM-Based Reasoning

Multimodale Verhaltensmusteranalyse mit Eye-Tracking und LLM-basierter Vernunft

以眼跟踪和基于LLM的理由进行多模式行为模式分析 2507.18252v1

Authors (4): Dongyang Guo, Yasmeen Abdrabou, Enkeleda Thaqi, Enkelejda Kasneci

Eye-tracking data reveals valuable insights into users’ cognitive states but is difficult to analyze due to its structured, non-linguistic nature. While large language models (LLMs) excel at reasoning over text, they struggle with temporal and numerical data. This paper presents a multimodal human-AI collaborative framework designed to enhance cognitive pattern extraction from eye-tracking signals. The framework includes: (1) a multi-stage pipeline using horizontal and vertical segmentation alongside LLM reasoning to uncover latent gaze patterns; (2) an Expert-Model Co-Scoring Module that integrates expert judgment with LLM output to generate trust scores for behavioral interpretations; and (3) a hybrid anomaly detection module combining LSTM-based temporal modeling with LLM-driven semantic analysis. Our results across several LLMs and prompt strategies show improvements in consistency, interpretability, and performance, with up to 50% accuracy in difficulty prediction tasks. This approach offers a scalable, interpretable solution for cognitive modeling and has broad potential in adaptive learning, human-computer interaction, and educational analytics.

nan

Article 632

Title@2025-07-24 (4): Meta Prompting for AI Systems

Title: Meta Prompting for AI Systems

Meta Prompting für KI-Systeme

AI 系统的模拟模拟 2311.11482v8

Authors (3): Yifan Zhang, Yang Yuan, Andrew Chi-Chih Yao

We introduce Meta Prompting (MP), a framework that elevates the reasoning capabilities of large language models (LLMs) by focusing on the formal structure of a task rather than content-specific examples. We establish a theoretical foundation for this paradigm, formalizing MP as a functor that maps a category of tasks to a category of structured prompts, thereby guaranteeing that compositional problem-solving strategies can be systematically decomposed into modular prompt structures. We extend this concept to Recursive Meta Prompting (RMP), an automated process where an LLM can generate and refine its own prompts. We model this self-improvement loop formally as a monad, providing a principled framework for automated prompt engineering. Our claims are validated through extensive experiments demonstrating that a Qwen-72B base model, guided by a single, example-agnostic meta-prompt, achieves state-of-the-art results on MATH, GSM8K, and Game of 24. These results are achieved with substantial token efficiency gains over traditional few-shot methods. Project Page: https://github.com/meta-prompting/meta-prompting.

nan

Article 633

Title@2025-07-24 (4): Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation

Title: Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation

Prune&Comp: Kostenloses Mittagessen für Layer-Pruned LLMs über iterative Pruning mit Magnitude Compensation

Prune & Comp: 通过模拟谨慎与磁度补偿为由层驱动的LMs免费午餐 2507.18212v1

Authors (8): Xinrui Chen, Hongxing Zhang, Fanyi Zeng, Yongxian Wei, Yizhi Wang, Xitong Ling, Guanghao Li, Chun Yuan

Layer pruning has emerged as a promising technique for compressing large language models (LLMs) while achieving acceleration proportional to the pruning ratio. In this work, we identify that removing any layer induces a significant magnitude gap in hidden states, resulting in substantial performance degradation. To address this issue, we propose Prune&Comp, a novel plug-and-play layer pruning scheme that leverages magnitude compensation to mitigate such gaps in a training-free manner. Specifically, we first estimate the magnitude gap caused by layer removal and then eliminate this gap by rescaling the remaining weights offline, with zero runtime overhead incurred. We further demonstrate the advantages of Prune&Comp through an iterative pruning strategy. When integrated with an iterative prune-and-compensate loop, Prune&Comp consistently enhances existing layer pruning metrics. For instance, when 5 layers of LLaMA-3-8B are pruned using the prevalent block influence metric, Prune&Comp nearly halves the perplexity and retains 93.19\% of the original model’s question-answering performance, outperforming the baseline by 4.01%.

nan

Article 634

Title@2025-07-24 (4): Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge

Title: Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge

Verbesserung der Transformation von natürlicher Sprache zur Signalzeitlogik mit LLMs mit vielfältigem externem Wissen

利用具有多种外部知识的LMLML 增强从自然语言向信号时时逻辑的转变 2505.20658v2

Authors (6): Yue Fang, Zhi Jin, Jie An, Hongshen Chen, Xiaohong Chen, Naijun Zhan

Temporal Logic (TL), especially Signal Temporal Logic (STL), enables precise formal specification, making it widely used in cyber-physical systems such as autonomous driving and robotics. Automatically transforming NL into STL is an attractive approach to overcome the limitations of manual transformation, which is time-consuming and error-prone. However, due to the lack of datasets, automatic transformation currently faces significant challenges and has not been fully explored. In this paper, we propose an NL-STL dataset named STL-Diversity-Enhanced (STL-DivEn), which comprises 16,000 samples enriched with diverse patterns. To develop the dataset, we first manually create a small-scale seed set of NL-STL pairs. Next, representative examples are identified through clustering and used to guide large language models (LLMs) in generating additional NL-STL pairs. Finally, diversity and accuracy are ensured through rigorous rule-based filters and human validation. Furthermore, we introduce the Knowledge-Guided STL Transformation (KGST) framework, a novel approach for transforming natural language into STL, involving a generate-then-refine process based on external knowledge. Statistical analysis shows that the STL-DivEn dataset exhibits more diversity than the existing NL-STL dataset. Moreover, both metric-based and human evaluations indicate that our KGST approach outperforms baseline models in transformation accuracy on STL-DivEn and DeepSTL datasets.

nan

Article 635

Title@2025-07-24 (4): Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation

Title: Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation

Untersuchung der Auswirkungen von Instruction-Tuning auf die Anfälligkeit von LLM für Fehlinformationen

探讨指导指导对LLM对错误信息易感性的影响 2507.18203v1

Authors (5): Kyubeen Han, Junseo Jang, Hongjin Kim, Geunyeong Jeong, Harksoo Kim

Instruction-tuning enhances the ability of large language models (LLMs) to follow user instructions more accurately, improving usability while reducing harmful outputs. However, this process may increase the model’s dependence on user input, potentially leading to the unfiltered acceptance of misinformation and the generation of hallucinations. Existing studies primarily highlight that LLMs are receptive to external information that contradict their parametric knowledge, but little research has been conducted on the direct impact of instruction-tuning on this phenomenon. In our study, we investigate the impact of instruction-tuning on LLM’s susceptibility to misinformation. Our analysis reveals that instruction-tuned LLMs are significantly more likely to accept misinformation when it is presented by the user. A comparison with base models shows that instruction-tuning increases reliance on user-provided information, shifting susceptibility from the assistant role to the user role. Furthermore, we explore additional factors influencing misinformation susceptibility, such as the role of the user in prompt structure, misinformation length, and the presence of warnings in the system prompt. Our findings underscore the need for systematic approaches to mitigate unintended consequences of instruction-tuning and enhance the reliability of LLMs in real-world applications.

nan

Article 636

Title@2025-07-24 (4): Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection

Title: Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection

Sicherung von RAG-Pipelines mit GMTP: Eine gradient-basierte maskierte Token-Wahrscheinlichkeitsmethode für vergiftete Dokumentenerkennung

使用GMTP来保护RAG管道:一种基于渐进式蒙面的中毒文件检测概率方法 2507.18202v1

Authors (4): San Kim, Jonghwi Kim, Yejin Jeon, Gary Geunbae Lee

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by providing external knowledge for accurate and up-to-date responses. However, this reliance on external sources exposes a security risk, attackers can inject poisoned documents into the knowledge base to steer the generation process toward harmful or misleading outputs. In this paper, we propose Gradient-based Masked Token Probability (GMTP), a novel defense method to detect and filter out adversarially crafted documents. Specifically, GMTP identifies high-impact tokens by examining gradients of the retriever’s similarity function. These key tokens are then masked, and their probabilities are checked via a Masked Language Model (MLM). Since injected tokens typically exhibit markedly low masked-token probabilities, this enables GMTP to easily detect malicious documents and achieve high-precision filtering. Experiments demonstrate that GMTP is able to eliminate over 90% of poisoned content while retaining relevant documents, thus maintaining robust retrieval and generation performance across diverse datasets and adversarial settings.

nan

Article 637

Title@2025-07-24 (4): Integrating an ISO30401-compliant Knowledge management system with existing business processes of an organization

Title: Integrating an ISO30401-compliant Knowledge management system with existing business processes of an organization

Integration eines ISO30401-konformen Wissensmanagementsystems in bestehende Geschäftsprozesse einer Organisation

将符合ISO30401的知识管理系统纳入一个组织的现有业务流程 2507.18197v1

Authors (2): Aline Belloni, Patrick Prieur

Business process modeling is used by most organizations as an essential framework for ensuring efficiency and effectiveness of the work and workflow performed by its employees and for ensuring the alignment of such work with its strategic goals. For organizations that are compliant or near-compliant with ISO 9001, this approach involves the detailed mapping of processes, sub-processes, activities, and tasks. ISO30401 is a Management System Standard, introduced in 2018, establishing universal requirements for the set up of a Knowledge Management System in an organization. As ``ISO30401 implementers’’ we regularly face the challenge of explaining our clients how the knowledge development, transformation and conveyances activities depicted in ISO30401 do integrate with existing operational processes. This article recaps process modelling principles in the context of ISO9001 and explores, based on our experience, how an ISO30401-compliant Knowledge Management System (KMS) entwines with all other processes of an Integrated Management System and in particular how it can be implemented by deploying the mechanisms of the SECI model through the steps of PDCA cycles.

nan

Article 638

Title@2025-07-24 (4): SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models

Title: SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models

ANWENDUNGSBEREICH: Stochastische und gegensätzliche Wahlplatzierung für die Bewertung großer Sprachmodelle

SCOPE:评估大语言模式的施虐和反偏见选择安置 2507.18182v1

Authors (3): Wonjun Jeong, Dongseok Kim, Taegkeun Whangbo

Large Language Models (LLMs) can achieve inflated scores on multiple-choice tasks by exploiting inherent biases in option positions or labels, rather than demonstrating genuine understanding. This study introduces SCOPE, an evaluation framework designed to measure and mitigate such selection bias in a dataset-independent manner. By repeatedly invoking a null prompt that lacks semantic content, SCOPE estimates each model’s unique position-bias distribution. It then redistributes the answer slot according to the inverse-bias distribution, thereby equalizing the lucky-rate, the probability of selecting the correct answer by chance. Furthermore, it prevents semantically similar distractors from being placed adjacent to the answer, thereby blocking near-miss guesses based on superficial proximity cues. Across multiple benchmark experiments, SCOPE consistently outperformed existing debiasing methods in terms of stable performance improvements and showed clearer confidence distributions over correct options. This framework thus offers a new standard for enhancing the fairness and reliability of LLM evaluations.

nan

Article 639

Title@2025-07-24 (4): Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models

Title: Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models

Das Mittel halten: Sticky Tokens in Text-Embedding-Modellen erkennen

坚持平均值:在文本嵌入模型中检测粘力 2507.18171v1

Authors (5): Kexin Chen, Dongxia Wang, Yi Liu, Haonan Zhang, Wenhai Wang

Despite the widespread use of Transformer-based text embedding models in NLP tasks, surprising ‘sticky tokens’ can undermine the reliability of embeddings. These tokens, when repeatedly inserted into sentences, pull sentence similarity toward a certain value, disrupting the normal distribution of embedding distances and degrading downstream performance. In this paper, we systematically investigate such anomalous tokens, formally defining them and introducing an efficient detection method, Sticky Token Detector (STD), based on sentence and token filtering. Applying STD to 40 checkpoints across 14 model families, we discover a total of 868 sticky tokens. Our analysis reveals that these tokens often originate from special or unused entries in the vocabulary, as well as fragmented subwords from multilingual corpora. Notably, their presence does not strictly correlate with model size or vocabulary size. We further evaluate how sticky tokens affect downstream tasks like clustering and retrieval, observing significant performance drops of up to 50%. Through attention-layer analysis, we show that sticky tokens disproportionately dominate the model’s internal representations, raising concerns about tokenization robustness. Our findings show the need for better tokenization strategies and model design to mitigate the impact of sticky tokens in future text embedding applications.

nan

Article 640

Title@2025-07-24 (4): Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges

Title: Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges

Jüngste Trends bei der Ferngesprächserkennung: Ein Rückblick auf die Herausforderungen CHiME-7 und 8 DASR

最近对不同政见的语音识别趋势:对CHiME-7和8DASR挑战的回顾 2507.18161v1

Authors (12): Samuele Cornell, Christoph Boeddeker, Taejin Park, He Huang, Desh Raj, Matthew Wiesner, Yoshiki Masuyama, Xuankai Chang, Zhong-Qiu Wang, Stefano Squartini, Paola Garcia, Shinji Watanabe

The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges’ design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios and different recording setups. 3) All best systems employ diarization refinement via target-speaker diarization techniques. Accurate speaker counting in the first diarization pass is thus crucial to avoid compounding errors and CHiME-8 DASR participants especially focused on this part. 4) Downstream evaluation via meeting summarization can correlate weakly with transcription quality due to the remarkable effectiveness of large-language models in handling errors. On the NOTSOFAR-1 scenario, even systems with over 50\% time-constrained minimum permutation WER can perform roughly on par with the most effective ones (around 11\%). 5) Despite recent progress, accurately transcribing spontaneous speech in challenging acoustic environments remains difficult, even when using computationally intensive system ensembles.

nan

Article 641

Title@2025-07-24 (4): A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects

Title: A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects

Eine Umfrage über die Kausalitätsidentifizierung: Taxonomie, Herausforderungen, Bewertung und Perspektiven

事件原因识别调查:分类、挑战、评估和前景 2411.10371v5

Authors (5): Qing Cheng, Zefan Zeng, Xingchen Hu, Yuehang Si, Zhong Liu

Event Causality Identification (ECI) has become an essential task in Natural Language Processing (NLP), focused on automatically detecting causal relationships between events within texts. This comprehensive survey systematically investigates fundamental concepts and models, developing a systematic taxonomy and critically evaluating diverse models. We begin by defining core concepts, formalizing the ECI problem, and outlining standard evaluation protocols. Our classification framework divides ECI models into two primary tasks: Sentence-level Event Causality Identification (SECI) and Document-level Event Causality Identification (DECI). For SECI, we review models employing feature pattern-based matching, machine learning classifiers, deep semantic encoding, prompt-based fine-tuning, and causal knowledge pre-training, alongside data augmentation strategies. For DECI, we focus on approaches utilizing deep semantic encoding, event graph reasoning, and prompt-based fine-tuning. Special attention is given to recent advancements in multi-lingual and cross-lingual ECI, as well as zero-shot ECI leveraging Large Language Models (LLMs). We analyze the strengths, limitations, and unresolved challenges associated with each approach. Extensive quantitative evaluations are conducted on four benchmark datasets to rigorously assess the performance of various ECI models. We conclude by discussing future research directions and highlighting opportunities to advance the field further.

nan

Article 642

Title@2025-07-24 (4): Large Language Models in Argument Mining: A Survey

Title: Large Language Models in Argument Mining: A Survey

Große Sprachmodelle im Argumentbergbau: Eine Umfrage

争议采矿大语言模型:调查 2506.16383v4

Authors (5): Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, Goran Nenadic

Argument Mining (AM), a critical subfield of Natural Language Processing (NLP), focuses on extracting argumentative structures from text. The advent of Large Language Models (LLMs) has profoundly transformed AM, enabling advanced in-context learning, prompt-based generation, and robust cross-domain adaptability. This survey systematically synthesizes recent advancements in LLM-driven AM. We provide a concise review of foundational theories and annotation frameworks, alongside a meticulously curated catalog of datasets. A key contribution is our comprehensive taxonomy of AM subtasks, elucidating how contemporary LLM techniques – such as prompting, chain-of-thought reasoning, and retrieval augmentation – have reconfigured their execution. We further detail current LLM architectures and methodologies, critically assess evaluation practices, and delineate pivotal challenges including long-context reasoning, interpretability, and annotation bottlenecks. Conclusively, we highlight emerging trends and propose a forward-looking research agenda for LLM-based computational argumentation, aiming to strategically guide researchers in this rapidly evolving domain.

nan

Article 643

Title@2025-07-24 (4): Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

Title: Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

Auf dem Weg zu größerer Hebelwirkung: Skalierungsgesetze für effiziente Mixture-of-Experts-Sprachmodelle

争取更大程度的利用:提高有效混合专家语言模式法的规模 2507.17702v2

Authors (6): Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou

Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear modulator with a clear optimal range. We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration. To validate our derived scaling laws, we designed and trained Ling-mini-beta, a pilot model for Ling-2.0 series with only 0.85B active parameters, alongside a 6.1B dense model for comparison. When trained on an identical 1T high-quality token dataset, Ling-mini-beta matched the performance of the 6.1B dense model while consuming over 7x fewer computational resources, thereby confirming the accuracy of our scaling laws. This work provides a principled and empirically-grounded foundation for the scaling of efficient MoE models.

nan

Article 644

Title@2025-07-24 (4): MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning

Title: MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning

Mathopeval: Ein feinkörniger Evaluations-Benchmark für visuelle Operationen von MLLMs in mathematischer Reasoning

MathOPEval:数学理由中MLLMs视觉操作精美评价基准 2507.18140v1

Authors (8): Xiaoyuan Li, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu, Junyang Lin

Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps. However, existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM’s ability to perform accurate visual operations via code largely unexplored. This work takes a first step toward addressing that gap by evaluating MLLM’s code-based capabilities in multi-modal mathematical reasoning.Specifically, our framework focuses on two key evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model’s ability to accurately understand and construct visualizations from scratch. (2) Multi-modal Code Editing (MCE) assesses the model’s capacity for fine-grained operations, which include three types: Deletion, Modification and Annotation. To evaluate the above tasks, we incorporate a dataset that covers the five most popular types of mathematical figures, including geometric diagrams, function plots, and three types of statistical charts, to provide a comprehensive and effective measurement of existing MLLMs. Our experimental evaluation involves nine mainstream MLLMs, and the results reveal that existing models still lag significantly behind human performance in performing fine-grained visual operations.

nan

Article 645

Title@2025-07-24 (4): OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

Title: OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

OPeRA: Ein Datensatz von Beobachtung, Persona, Ratationale und Aktion zur Bewertung von LLMs auf menschlicher Online-Shopping-Behavior-Simulation

OPERA: 人类在线购物行为模拟观察、人、理由和评估LMLLMs的数据集 2506.05606v4

Authors (16): Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang

Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable’’ human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user’s next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.

nan

Article 646

Title@2025-07-24 (4): Actively evaluating and learning the distinctions that matter: Vaccine safety signal detection from emergency triage notes

Title: Actively evaluating and learning the distinctions that matter: Vaccine safety signal detection from emergency triage notes

Aktive Bewertung und Erlernen der Unterscheidungen, die wichtig sind: Vakzin-Sicherheitssignalerkennung aus Not-Triage-Notizen

积极评价和学习重要的区别:疫苗安全信号从紧急分级记录中探测到的疫苗安全信号 2507.18123v1

Authors (7): Sedigh Khademi, Christopher Palmer, Muhammad Javed, Hazel Clothier, Jim Buttery, Gerardo Luis Dimaguila, Jim Black

The rapid development of COVID-19 vaccines has showcased the global communitys ability to combat infectious diseases. However, the need for post-licensure surveillance systems has grown due to the limited window for safety data collection in clinical trials and early widespread implementation. This study aims to employ Natural Language Processing techniques and Active Learning to rapidly develop a classifier that detects potential vaccine safety issues from emergency department notes. ED triage notes, containing expert, succinct vital patient information at the point of entry to health systems, can significantly contribute to timely vaccine safety signal surveillance. While keyword-based classification can be effective, it may yield false positives and demand extensive keyword modifications. This is exacerbated by the infrequency of vaccination-related ED presentations and their similarity to other reasons for ED visits. NLP offers a more accurate and efficient alternative, albeit requiring annotated data, which is often scarce in the medical field. Active learning optimizes the annotation process and the quality of annotated data, which can result in faster model implementation and improved model performance. This work combines active learning, data augmentation, and active learning and evaluation techniques to create a classifier that is used to enhance vaccine safety surveillance from ED triage notes.

nan

Article 647

Title: When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems

Wenn Autonomie Rogue: Vorbereitung auf Risiken der Multi-Agenten-Kollusion in sozialen Systemen

当自治时,罗格:准备应对社会系统中多机构串通的风险 2507.14660v2

Authors (7): Qibing Ren, Sitao Xie, Longxuan Wei, Zhenfei Yin, Junchi Yan, Lizhuang Ma, Jing Shao

Recent large-scale events like election fraud and financial scams have shown how harmful coordinated efforts by human groups can be. With the rise of autonomous AI systems, there is growing concern that AI-driven groups could also cause similar harm. While most AI safety research focuses on individual AI systems, the risks posed by multi-agent systems (MAS) in complex real-world situations are still underexplored. In this paper, we introduce a proof-of-concept to simulate the risks of malicious MAS collusion, using a flexible framework that supports both centralized and decentralized coordination structures. We apply this framework to two high-risk fields: misinformation spread and e-commerce fraud. Our findings show that decentralized systems are more effective at carrying out malicious actions than centralized ones. The increased autonomy of decentralized systems allows them to adapt their strategies and cause more damage. Even when traditional interventions, like content flagging, are applied, decentralized groups can adjust their tactics to avoid detection. We present key insights into how these malicious groups operate and the need for better detection systems and countermeasures. Code is available at https://github.com/renqibing/RogueAgent.

nan

Article 648

Title@2025-07-24 (4): Agentic AI framework for End-to-End Medical Data Inference

Title: Agentic AI framework for End-to-End Medical Data Inference

Agentische KI-Framework für Ende-zu-Ende medizinische Datenableitung

最终至最终医疗数据推断的AA AA 框架框架 2507.18115v1

Authors (5): Soorya Ram Shimgekar, Shayan Vassef, Abhay Goyal, Navin Kumar, Koustuv Saha

Building and deploying machine learning solutions in healthcare remains expensive and labor-intensive due to fragmented preprocessing workflows, model compatibility issues, and stringent data privacy constraints. In this work, we introduce an Agentic AI framework that automates the entire clinical data pipeline, from ingestion to inference, through a system of modular, task-specific agents. These agents handle both structured and unstructured data, enabling automatic feature selection, model selection, and preprocessing recommendation without manual intervention. We evaluate the system on publicly available datasets from geriatrics, palliative care, and colonoscopy imaging. For example, in the case of structured data (anxiety data) and unstructured data (colonoscopy polyps data), the pipeline begins with file-type detection by the Ingestion Identifier Agent, followed by the Data Anonymizer Agent ensuring privacy compliance, where we first identify the data type and then anonymize it. The Feature Extraction Agent identifies features using an embedding-based approach for tabular data, extracting all column names, and a multi-stage MedGemma-based approach for image data, which infers modality and disease name. These features guide the Model-Data Feature Matcher Agent in selecting the best-fit model from a curated repository. The Preprocessing Recommender Agent and Preprocessing Implementor Agent then apply tailored preprocessing based on data type and model requirements. Finally, the ``Model Inference Agent” runs the selected model on the uploaded data and generates interpretable outputs using tools like SHAP, LIME, and DETR attention maps. By automating these high-friction stages of the ML lifecycle, the proposed framework reduces the need for repeated expert intervention, offering a scalable, cost-efficient pathway for operationalizing AI in clinical environments.

nan

Article 649

Title@2025-07-24 (4): A New Pair of GloVes

Title: A New Pair of GloVes

Ein neues Paar GloVes

新的地球之对 2507.18103v1

Authors (3): Riley Carlson, John Bauer, Christopher D. Manning

This report documents, describes, and evaluates new 2024 English GloVe (Global Vectors for Word Representation) models. While the original GloVe models built in 2014 have been widely used and found useful, languages and the world continue to evolve and we thought that current usage could benefit from updated models. Moreover, the 2014 models were not carefully documented as to the exact data versions and preprocessing that were used, and we rectify this by documenting these new models. We trained two sets of word embeddings using Wikipedia, Gigaword, and a subset of Dolma. Evaluation through vocabulary comparison, direct testing, and NER tasks shows that the 2024 vectors incorporate new culturally and linguistically relevant words, perform comparably on structural tasks like analogy and similarity, and demonstrate improved performance on recent, temporally dependent NER datasets such as non-Western newswire data.

nan

Article 650

Title@2025-07-24 (4): Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation

Title: Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation

Lang-Short-Distanz Graph Neural Networks und verbessertes Curriculum-Lernen für Emotionserkennung im Gespräch

长短距离远距神经神经网络和改进课程学习,以在对话中认识情感 2507.15205v2

Authors (3): Xinran Li, Xiujuan Xu, Jiaqi Qiao

Emotion Recognition in Conversation (ERC) is a practical and challenging task. This paper proposes a novel multimodal approach, the Long-Short Distance Graph Neural Network (LSDGNN). Based on the Directed Acyclic Graph (DAG), it constructs a long-distance graph neural network and a short-distance graph neural network to obtain multimodal features of distant and nearby utterances, respectively. To ensure that long- and short-distance features are as distinct as possible in representation while enabling mutual influence between the two modules, we employ a Differential Regularizer and incorporate a BiAffine Module to facilitate feature interaction. In addition, we propose an Improved Curriculum Learning (ICL) to address the challenge of data imbalance. By computing the similarity between different emotions to emphasize the shifts in similar emotions, we design a “weighted emotional shift” metric and develop a difficulty measurer, enabling a training process that prioritizes learning easy samples before harder ones. Experimental results on the IEMOCAP and MELD datasets demonstrate that our model outperforms existing benchmarks.

nan

Article 651

Title@2025-07-24 (4): ELITE: Enhanced Language-Image Toxicity Evaluation for Safety

Title: ELITE: Enhanced Language-Image Toxicity Evaluation for Safety

ELITE: Verbesserte Sprach-Image-Toxizitätsbewertung für Sicherheit

ELITE:加强语言-图像安全毒性评价 2502.04757v3

Authors (8): Wonjun Lee, Doehyeon Lee, Eugene Choi, Sangyoon Yu, Ashkan Yousefpour, Haon Park, Bumsub Ham, Suhyun Kim

Current Vision Language Models (VLMs) remain vulnerable to malicious prompts that induce harmful outputs. Existing safety benchmarks for VLMs primarily rely on automated evaluation methods, but these methods struggle to detect implicit harmful content or produce inaccurate evaluations. Therefore, we found that existing benchmarks have low levels of harmfulness, ambiguous data, and limited diversity in image-text pair combinations. To address these issues, we propose the ELITE benchmark, a high-quality safety evaluation benchmark for VLMs, underpinned by our enhanced evaluation method, the ELITE evaluator. The ELITE evaluator explicitly incorporates a toxicity score to accurately assess harmfulness in multimodal contexts, where VLMs often provide specific, convincing, but unharmful descriptions of images. We filter out ambiguous and low-quality image-text pairs from existing benchmarks using the ELITE evaluator and generate diverse combinations of safe and unsafe image-text pairs. Our experiments demonstrate that the ELITE evaluator achieves superior alignment with human evaluations compared to prior automated methods, and the ELITE benchmark offers enhanced benchmark quality and diversity. By introducing ELITE, we pave the way for safer, more robust VLMs, contributing essential tools for evaluating and mitigating safety risks in real-world applications.

nan

Article 652

Title@2025-07-24 (4): Hybrid and Unitary Fine-Tuning of Large Language Models: Methods and Benchmarking under Resource Constraints

Title: Hybrid and Unitary Fine-Tuning of Large Language Models: Methods and Benchmarking under Resource Constraints

Hybrides und einheitliches Feintuning von großen Sprachmodellen: Methoden und Benchmarking unter Ressourcenbeschränkungen

大语言模式统一调整和统一调整适用:在资源限制下的方法和基准 2507.18076v1

Authors (3): Haomin Qi, Zihan Dai, Chengbo Huang

Fine-tuning large language models (LLMs) remains a computational bottleneck due to their scale and memory demands. This paper presents a comprehensive evaluation of parameter-efficient fine-tuning (PEFT) techniques, including LoRA, BOFT, LoRA-GA, and uRNN, and introduces a novel hybrid strategy that dynamically integrates BOFT’s orthogonal stability with LoRA-GA’s gradient-aligned rapid convergence. By computing per-layer adaptive updates guided by gradient norms, the hybrid method achieves superior convergence efficiency and generalization across diverse tasks. We also explore, for the first time, the adaptation of unitary RNN (uRNN) principles to transformer-based LLMs, enhancing gradient stability through structured unitary constraints. Empirical evaluations on four benchmarks – GLUE, GSM8K, MT-Bench, and HumanEval – using models ranging from 7B to 405B parameters demonstrate that our hybrid method consistently outperforms individual PEFT baselines, approaching full fine-tuning accuracy while reducing resource consumption by up to 2.1 times in training time and 50 percent in memory usage. These findings establish the hybrid approach as a practical and scalable fine-tuning solution for real-world deployment of LLMs under resource constraints.

nan

Article 653

Title@2025-07-24 (4): BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference

Title: BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference

BlockDialekt: Blockweise feinkörnige Mischformat-Quantisierung für energieeffiziente LLM-Inferenz

BlockDiaect: 节能LLM 推论的粗件精细混合格式量化 2501.01144v5

Authors (2): Wonsuk Jang, Thierry Tambe

The rapidly increasing size of large language models (LLMs) presents significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with hardware-supported fine-grained scaling emerging as a promising solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. We propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from a formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures energy efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. BlockDialect achieves 10.78% (7.48%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.45% (2.69%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.

nan

Article 654

Title@2025-07-24 (4): TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios

Title: TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios

TELEVAL: Ein dynamischer Benchmark für gesprochene Sprachmodelle in chinesischen interaktiven Szenarien

TELEVAL:为中文互动假想中的口语模式设计的一个动态基准 2507.18061v1

Authors (14): Zehan Li, Hongjie Chen, Yuxin Zhang, Jing Zhou, Xuening Wang, Hang Lv, Mengjie Du, Yaodong Song, Jie Lian, Jian Kang, Jie Li, Yongxiang Li, Zhongjiang He, Xuelong Li

Spoken language models (SLMs) have seen rapid progress in recent years, along with the development of numerous benchmarks for evaluating their performance. However, most existing benchmarks primarily focus on evaluating whether SLMs can perform complex tasks comparable to those tackled by large language models (LLMs), often failing to align with how users naturally interact in real-world conversational scenarios. In this paper, we propose TELEVAL, a dynamic benchmark specifically designed to evaluate SLMs’ effectiveness as conversational agents in realistic Chinese interactive settings. TELEVAL defines three evaluation dimensions: Explicit Semantics, Paralinguistic and Implicit Semantics, and System Abilities. It adopts a dialogue format consistent with real-world usage and evaluates text and audio outputs separately. TELEVAL particularly focuses on the model’s ability to extract implicit cues from user speech and respond appropriately without additional instructions. Our experiments demonstrate that despite recent progress, existing SLMs still have considerable room for improvement in natural conversational tasks. We hope that TELEVAL can serve as a user-centered evaluation framework that directly reflects the user experience and contributes to the development of more capable dialogue-oriented SLMs.

nan

Article 655

Title@2025-07-24 (4): Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias

Title: Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias

Causally Testing Gender Bias in LLMs: Eine Fallstudie über berufsbezogene Bias

《LLMM中因果测试性别偏见:职业偏见案例研究》 2212.10678v4

Authors (5): Yuen Chen, Vethavikashini Chithrra Raghuram, Justus Mattern, Rada Mihalcea, Zhijing Jin

Generated texts from large language models (LLMs) have been shown to exhibit a variety of harmful, human-like biases against various demographics. These findings motivate research efforts aiming to understand and measure such effects. This paper introduces a causal formulation for bias measurement in generative language models. Based on this theoretical foundation, we outline a list of desiderata for designing robust bias benchmarks. We then propose a benchmark called OccuGender, with a bias-measuring procedure to investigate occupational gender bias. We test several state-of-the-art open-source LLMs on OccuGender, including Llama, Mistral, and their instruction-tuned versions. The results show that these models exhibit substantial occupational gender bias. Lastly, we discuss prompting strategies for bias mitigation and an extension of our causal formulation to illustrate the generalizability of our framework. Our code and data https://github.com/chenyuen0103/gender-bias.

nan

Article 656

Title@2025-07-24 (4): A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

Title: A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

Ein Multi-Faceted-Evaluierungsrahmen für die Bewertung synthetischer Daten, erzeugt durch große Sprachmodelle

评估由大语言模型生成的合成数据多面评价框架 2404.14445v2

Authors (3): Yefeng Yuan, Yuhong Liu, Liang Cheng

The rapid advancements in generative AI and large language models (LLMs) have opened up new avenues for producing synthetic data, particularly in the realm of structured tabular formats, such as product reviews. Despite the potential benefits, concerns regarding privacy leakage have surfaced, especially when personal information is utilized in the training datasets. In addition, there is an absence of a comprehensive evaluation framework capable of quantitatively measuring the quality of the generated synthetic data and their utility for downstream tasks. In response to this gap, we introduce SynEval, an open-source evaluation framework designed to assess the fidelity, utility, and privacy preservation of synthetically generated tabular data via a suite of diverse evaluation metrics. We validate the efficacy of our proposed framework - SynEval - by applying it to synthetic product review data generated by three state-of-the-art LLMs: ChatGPT, Claude, and Llama. Our experimental findings illuminate the trade-offs between various evaluation metrics in the context of synthetic data generation. Furthermore, SynEval stands as a critical instrument for researchers and practitioners engaged with synthetic tabular data,, empowering them to judiciously determine the suitability of the generated data for their specific applications, with an emphasis on upholding user privacy.

nan

Article 657

Title@2025-07-24 (4): Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs

Title: Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs

Privacy-Preserving Synthetic Review Generation mit unterschiedlichen Schreibstilen mit LLMs

使用LLMMs以多种写作风格生成的隐私-保护合成审查 2507.18055v1

Authors (6): Tevin Atwal, Chan Nam Tieu, Yefeng Yuan, Zhan Shi, Yuhong Liu, Liang Cheng

The increasing use of synthetic data generated by Large Language Models (LLMs) presents both opportunities and challenges in data-driven applications. While synthetic data provides a cost-effective, scalable alternative to real-world data to facilitate model training, its diversity and privacy risks remain underexplored. Focusing on text-based synthetic data, we propose a comprehensive set of metrics to quantitatively assess the diversity (i.e., linguistic expression, sentiment, and user perspective), and privacy (i.e., re-identification risk and stylistic outliers) of synthetic datasets generated by several state-of-the-art LLMs. Experiment results reveal significant limitations in LLMs’ capabilities in generating diverse and privacy-preserving synthetic data. Guided by the evaluation results, a prompt-based approach is proposed to enhance the diversity of synthetic reviews while preserving reviewer privacy.

nan

Article 658

Title@2025-07-24 (4): From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems

Title: From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems

Von der Hypothese zur Veröffentlichung: Eine umfassende Umfrage zu KI-getriebenen Forschungsunterstützungssystemen

从假设到出版物:AI-Driven研究支助系统综合调查 2503.01424v3

Authors (14): Zekun Zhou, Xiaocheng Feng, Lei Huang, Xiachong Feng, Ziyun Song, Ruihan Chen, Liang Zhao, Weitao Ma, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Ting Liu, Bing Qin

Research is a fundamental process driving the advancement of human civilization, yet it demands substantial time and effort from researchers. In recent years, the rapid development of artificial intelligence (AI) technologies has inspired researchers to explore how AI can accelerate and enhance research. To monitor relevant advancements, this paper presents a systematic review of the progress in this domain. Specifically, we organize the relevant studies into three main categories: hypothesis formulation, hypothesis validation, and manuscript publication. Hypothesis formulation involves knowledge synthesis and hypothesis generation. Hypothesis validation includes the verification of scientific claims, theorem proving, and experiment validation. Manuscript publication encompasses manuscript writing and the peer review process. Furthermore, we identify and discuss the current challenges faced in these areas, as well as potential future directions for research. Finally, we also offer a comprehensive overview of existing benchmarks and tools across various domains that support the integration of AI into the research process. We hope this paper serves as an introduction for beginners and fosters future research. Resources have been made publicly available at https://github.com/zkzhou126/AI-for-Research.

nan

Article 659

Title@2025-07-24 (4): RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models

Title: RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models

EINGEDENK: Ein ungebundener Ressourcenverbrauchsangriff auf große Visions-Sprachenmodelle

回顾:对大型愿景-语言模型的无约束资源消费攻击 2507.18053v1

Authors (9): Haoran Gao, Yuanhe Zhang, Zhenhong Zhou, Lei Jiang, Fanyu Meng, Yujia Xiao, Kun Wang, Yang Liu, Junlan Feng

Resource Consumption Attacks (RCAs) have emerged as a significant threat to the deployment of Large Language Models (LLMs). With the integration of vision modalities, additional attack vectors exacerbate the risk of RCAs in large vision-language models (LVLMs). However, existing red-teaming studies have largely overlooked visual inputs as a potential attack surface, resulting in insufficient mitigation strategies against RCAs in LVLMs. To address this gap, we propose RECALLED (\textbf{RE}source \textbf{C}onsumption \textbf{A}ttack on \textbf{L}arge Vision-\textbf{L}anguag\textbf{E} Mo\textbf{D}els), the first approach for exploiting visual modalities to trigger unbounded RCAs red-teaming. First, we present \textit{Vision Guided Optimization}, a fine-grained pixel-level optimization, to obtain \textit{Output Recall} adversarial perturbations, which can induce repeating output. Then, we inject the perturbations into visual inputs, triggering unbounded generations to achieve the goal of RCAs. Additionally, we introduce \textit{Multi-Objective Parallel Losses} to generate universal attack templates and resolve optimization conflicts when intending to implement parallel attacks. Empirical results demonstrate that RECALLED increases service response latency by over 26 $\uparrow$, resulting in an additional 20\% increase in GPU utilization and memory consumption. Our study exposes security vulnerabilities in LVLMs and establishes a red-teaming framework that can facilitate future defense development against RCAs.

nan

Article 660

Title@2025-07-24 (4): Segmentation-free Goodness of Pronunciation

Title: Segmentation-free Goodness of Pronunciation

Segmentierungsfreie Güte der Aussprache

读音良好 2507.16838v2

Authors (4): Xinwei Cao, Zijian Fan, Torbjørn Svendsen, Giampiero Salvi

Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer aided language learning (CALL) systems. Within MDD, phoneme-level pronunciation assessment is key to helping L2 learners improve their pronunciation. However, most systems are based on a form of goodness of pronunciation (GOP) which requires pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general alignment-free method that takes all possible alignments of the target phoneme into account (GOP-AF). We give a theoretical account of our definition of GOP-AF, an implementation that solves potential numerical issues as well as a proper normalization which makes the method applicable with acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and Speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-AF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the Speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.

nan

Article 661

Title@2025-07-24 (4): Synthetic Data Generation for Phrase Break Prediction with Large Language Model

Title: Synthetic Data Generation for Phrase Break Prediction with Large Language Model

Synthetische Datengenerierung für Phrase Break Prediction mit großem Sprachmodell

制作用于大语言模范大语言时段间断预测的合成数据 2507.18044v1

Authors (4): Hoyeon Lee, Sejung Son, Ye-Eun Kang, Jong-Hwan Kim

Current approaches to phrase break prediction address crucial prosodic aspects of text-to-speech systems but heavily rely on vast human annotations from audio or text, incurring significant manual effort and cost. Inherent variability in the speech domain, driven by phonetic factors, further complicates acquiring consistent, high-quality data. Recently, large language models (LLMs) have shown success in addressing data challenges in NLP by generating tailored synthetic data while reducing manual annotation needs. Motivated by this, we explore leveraging LLM to generate synthetic phrase break annotations, addressing the challenges of both manual annotation and speech-related tasks by comparing with traditional annotations and assessing effectiveness across multiple languages. Our findings suggest that LLM-based synthetic data generation effectively mitigates data challenges in phrase break prediction and highlights the potential of LLMs as a viable solution for the speech domain.

nan

Article 662

Title@2025-07-24 (4): GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs

Title: GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs

GrAInS: Gradient-basierte Zuordnung zur Inferenz-Zeitlenkung von LLMs und VLMs

GrAInS:LLMs和VLMs的推论时间指导的逐步归属 2507.18043v1

Authors (4): Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

Inference-time steering methods offer a lightweight alternative to fine-tuning large language models (LLMs) and vision-language models (VLMs) by modifying internal activations at test time without updating model weights. However, most existing approaches rely on fixed, global intervention vectors, overlook the causal influence of individual input tokens, and fail to leverage informative gradients from the model’s logits, particularly in multimodal settings where visual and textual inputs contribute unevenly. To address these limitations, we introduce GrAInS, an inference-time steering approach that operates across both language-only and vision-language models and tasks. GrAInS uses contrastive, gradient-based attribution via Integrated Gradients to identify the top-k most influential tokens, both positively and negatively attributed based on their contribution to preferred versus dispreferred outputs. These tokens are then used to construct directional steering vectors that capture semantic shifts from undesirable to desirable behavior. During inference, GrAInS adjusts hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale. This enables fine-grained, interpretable, and modular control over model behavior, without retraining or auxiliary supervision. Empirically, GrAInS consistently outperforms both fine-tuning and existing steering baselines: it achieves a 13.22% accuracy gain on TruthfulQA using Llama-3.1-8B, reduces hallucination rates on MMHal-Bench from 0.624 to 0.514 with LLaVA-1.6-7B, and improves alignment win rates on SPA-VL by 8.11%, all while preserving the model’s fluency and general capabilities.

nan

Article 663

Title@2025-07-24 (4): AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark

Title: AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark

AIR-Bench: Automatisierte Heterogene Information Retrieval Benchmark

AIR-Bench:自动异源信息检索基准 2412.13102v4

Authors (9): Jianlyu Chen, Nan Wang, Chaofan Li, Bo Wang, Shitao Xiao, Han Xiao, Hao Liao, Defu Lian, Zheng Liu

Evaluation plays a crucial role in the advancement of information retrieval (IR) models. However, current benchmarks, which are based on predefined domains and human-labeled data, face limitations in addressing evaluation needs for emerging domains both cost-effectively and efficiently. To address this challenge, we propose the Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1) Automated. The testing data in AIR-Bench is automatically generated by large language models (LLMs) without human intervention. 2) Heterogeneous. The testing data in AIR-Bench is generated with respect to diverse tasks, domains and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are constantly augmented to provide an increasingly comprehensive evaluation benchmark for community developers. We develop a reliable and robust data generation pipeline to automatically create diverse and high-quality evaluation datasets based on real-world corpora. Our findings demonstrate that the generated testing data in AIR-Bench aligns well with human-labeled testing data, making AIR-Bench a dependable benchmark for evaluating IR models. The resources in AIR-Bench are publicly available at https://github.com/AIR-Bench/AIR-Bench.

nan

Article 664

Title@2025-07-24 (4): NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database

Title: NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database

NeuralDB: Skalierung von Wissen in LLMs auf 100.000 Fakten mit neuraler KV-Datenbank

NeuralDDB: 将知识编辑在LLM 中到 100,000 千兆瓦的Neural KV 数据库中 2507.18028v1

Authors (10): Weizhi Fei, Hao Shi, Jing Xu, Jingchen Peng, Jiazheng Li, Jingzhao Zhang, Bo Bai, Wei Han, Zhenyuan Chen, Xueyan Niu

Efficiently editing knowledge stored in large language models (LLMs) enables model updates without large-scale training. One possible solution is Locate-and-Edit (L\&E), allowing simultaneous modifications of a massive number of facts. However, such editing may compromise the general abilities of LLMs and even result in forgetting edited facts when scaling up to thousands of edits. In this paper, we model existing linear L\&E methods as querying a Key-Value (KV) database. From this perspective, we then propose NeuralDB, an editing framework that explicitly represents the edited facts as a neural KV database equipped with a non-linear gated retrieval module, % In particular, our gated module only operates when inference involves the edited facts, effectively preserving the general abilities of LLMs. Comprehensive experiments involving the editing of 10,000 facts were conducted on the ZsRE and CounterFacts datasets, using GPT2-XL, GPT-J (6B) and Llama-3 (8B). The results demonstrate that NeuralDB not only excels in editing efficacy, generalization, specificity, fluency, and consistency, but also preserves overall performance across six representative text understanding and generation tasks. Further experiments indicate that NeuralDB maintains its effectiveness even when scaled to 100,000 facts (\textbf{50x} more than in prior work).

nan

Article 665

Title@2025-07-24 (4): GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures

Title: GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures

GRR-CoCa: LLM-Mechanismen in multimodalen Modellarchitekturen nutzen

GRR-CoCa:在多模式建模中利用LLM机制 2507.18009v1

Authors (6): Jake R. Patock, Nicole Catherine Lewis, Kevin McCoy, Christina Gomez, Canling Chen, Lorenzo Luzi

State-of-the-art (SOTA) image and text generation models are multimodal models that have many similarities to large language models (LLMs). Despite achieving strong performances, leading foundational multimodal model architectures frequently lag behind the architectural sophistication of contemporary LLMs. We propose GRR-CoCa, an improved SOTA Contrastive Captioner (CoCa) model that incorporates Gaussian error gated linear units, root mean squared normalization, and rotary positional embedding into the textual decoders and the vision transformer (ViT) encoder. Each architectural modification has been shown to improve model performance in LLMs, but has yet to be adopted in CoCa. We benchmarked GRR-CoCa against Baseline CoCa, a model with the same modified textual decoders but with CoCa’s original ViT encoder. We used standard pretraining and fine-tuning workflows to benchmark the models on contrastive and generative tasks. Our GRR-CoCa significantly outperformed Baseline CoCa on the pretraining dataset and three diverse fine-tuning datasets. Pretraining improvements were 27.25% in contrastive loss, 3.71% in perplexity, and 7.15% in CoCa loss. The average fine-tuning improvements were 13.66% in contrastive loss, 5.18% in perplexity, and 5.55% in CoCa loss. We show that GRR-CoCa’s modified architecture improves performance and generalization across vision-language domains.

nan

Article 0

Title@2025-07-31 (4): Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities

Article 1

Title@2025-07-31 (4): SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model

Article 2

Title@2025-07-31 (4): Perception-Aware Policy Optimization for Multimodal Reasoning

Article 3

Title@2025-07-31 (4): CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

Article 4

Title@2025-07-31 (4): Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs

Article 5

Title@2025-07-31 (4): How AI Ideas Affect the Creativity, Diversity, and Evolution of Human Ideas: Evidence From a Large, Dynamic Experiment

Article 6

Title@2025-07-31 (4): Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Article 7

Title@2025-07-31 (4): RecGPT Technical Report

Article 8

Title@2025-07-31 (4): Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length

Article 9

Title@2025-07-31 (4): TextQuests: How Good are LLMs at Text-Based Video Games?

Article 10

Title@2025-07-31 (4): TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses

Article 11

Title@2025-07-31 (4): Arabic Hate Speech Identification and Masking in Social Media using Deep Learning Models and Pre-trained Models Fine-tuning

Article 12

Title@2025-07-31 (4): DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures

Article 13

Title@2025-07-31 (4): Who’s important? – SUnSET: Synergistic Understanding of Stakeholder, Events and Time for Timeline Generation

Article 14

Title@2025-07-31 (4): How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Article 15

Title@2025-07-31 (4): Splits! A Flexible Dataset and Evaluation Framework for Sociocultural Linguistic Investigation

Article 16

Title@2025-07-31 (4): ILID: Native Script Language Identification for Indian Languages

Article 17

Title@2025-07-31 (4): Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Estimates

Article 18

Title@2025-07-31 (4): Inside-Out: Hidden Factual Knowledge in LLMs

Article 19

Title@2025-07-31 (4): DiffLoRA: Differential Low-Rank Adapters for Large Language Models

Article 20

Title@2025-07-31 (4): T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text

Article 21

Title@2025-07-31 (4): Neutral Residues: Revisiting Adapters for Model Extension

Article 22

Title@2025-07-31 (4): Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation

Article 23

Title@2025-07-31 (4): Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning

Article 24

Title@2025-07-31 (4): PurpCode: Reasoning for Safer Code Generation

Article 25

Title@2025-07-31 (4): MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

Article 26

Title@2025-07-31 (4): LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Article 27

Title@2025-07-31 (4): A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Article 28

Title@2025-07-31 (4): Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

Article 29

Title@2025-07-31 (4): Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

Article 30

Title@2025-07-31 (4): EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework

Article 31

Title@2025-07-31 (4): The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models

Article 32

Title@2025-07-31 (4): Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration

Article 33

Title@2025-07-31 (4): RAVine: Reality-Aligned Evaluation for Agentic Search

Article 34

Title@2025-07-31 (4): Enhanced Arabic Text Retrieval with Attentive Relevance Scoring

Article 35

Title@2025-07-31 (4): MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization

Article 36

Title@2025-07-31 (4): Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators

Article 37

Title@2025-07-31 (4): Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models

Article 38

Title@2025-07-31 (4): MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models

Article 39

Title@2025-07-31 (4): Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models