• 00 06-12 (4) How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts? Wie gut können vernünftigen Modelle erkennen und sich von unhilflichen Gedanken erholen? 理性模型如何能从无益的想法中查明和复苏? 2506.10979v1
  • 01 06-12 AutoMind: Adaptive Knowledgeable Agent for Automated Data Science AutoMind: Adaptives Knowledgeable Agent für automatisierte Datenwissenschaft 自动Mind:自动数据科学适应性知识代理 2506.10974v1
  • 02 06-12 MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning MMMG: Ein massiver, multidisziplinärer, multi-Tier-Erzeugungs-Benchmark für Bild-zu-Bild-Reasoning MMMMM: 大量、多学科、多代、多语言的文字到图像推理基准 2506.10963v1
  • 03 06-12 ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark ChinesischHarm-Bench: Ein chinesischer schädlicher Content Detection Benchmark 中中汉禁区:中国有害内容检测基准 2506.10960v1
  • 04 06-12 Build the web for agents, not agents for the web Erstellen Sie das Web für Agenten, nicht Agenten für das Web 为代理者而不是网络代理者建立网络 2506.10953v1
  • 05 06-12 Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training Domain2Vec: Vectorizing Datasets, um die optimale Datenmischung ohne Training zu finden 域2Vec: 将数据集矢量化,以查找未经过培训的最佳数据混合体 2506.10952v1
  • 06 06-12 GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models GUARD: Geführtes Lernen und Zurückhalten über Datenzuweisung für große Sprachmodelle GUARD:通过大语言模式数据归称制,指导学习和保留 2506.10946v1
  • 07 06-12 VINCIE: Unlocking In-context Image Editing from Video VINCIE: Im Kontext Bildbearbeitung von Video entsperren VINCIE: 从视频中解锁 Incontext 图像编辑 2506.10941v1
  • 08 06-12 Visually Descriptive Language Model for Vector Graphics Reasoning Visuell Deskriptives Sprachmodell für Vektorgrafiken 矢量图形推理视觉描述语言模型 2404.06479v5
  • 09 06-12 Dynamic Epistemic Friction in Dialogue Dynamische epistemische Reibung im Dialog 对话框中的动态瞬间摩擦 2506.10934v1
  • 10 06-12 Improving LLM Safety Alignment with Dual-Objective Optimization Verbesserung der LLM-Sicherheitsausrichtung mit Dual-Ziel-Optimierung 提高LLM安全一致性,实现双目标优化 2503.03710v2
  • 11 06-12 Robustly Improving LLM Fairness in Realistic Settings via Interpretability Robuste Verbesserung der LLM Fairness in realistischen Einstellungen durch Dolmetschbarkeit 通过可解释性在现实环境中强有力地提高LLM公平性 2506.10922v1
  • 12 06-12 Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization Dekomponieren von MLP-Aktivierungen in Interpretierbare Funktionen über semi-Nonnegative Matrix-Fabrikisierung 通过半氮基矩阵化系数化,将劳动和生产部的分解活动转化为可解释性特征 2506.10920v1
  • 13 06-12 Weak-to-Strong Jailbreaking on Large Language Models Schwach-zu-starkes Gefängnis mit großen Sprachmodellen 关于大语言模型的弱至强强监狱破解 2401.17256v3
  • 14 06-12 Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification? Breaking Bad Molecules: Sind MLLMs bereit für die molekulare Entgiftung auf Strukturebene? 破碎坏分子:MLLMs是否准备好进行结构级分子解毒? 2506.10912v1
  • 15 06-12 Efficiently Identifying Watermarked Segments in Mixed-Source Texts Effiziente Identifikation von wassermarkierten Segmenten in Mixed-Source-Texten 有效识别混合来源文本中划划水段 2410.03600v2
  • 16 06-12 Magistral Magistral 司 司 司 司 司 司 司 司 司 2506.10910v1
  • 17 06-12 Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning Jenseits von Goldstandards: Epistemisches Ensemble von LLM-Richtern für formale mathematische Vernunft 超越金金标准:法学硕士正式数学理由法官集会 2506.10903v1
  • 18 06-12 BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP BioClinical ModernBERT: Ein hochmoderner Long-Context-Encoder für biomedizinische und klinische NLP 生物医学和临床国家实验室方案最新生物医学和临床现代生物临床现代BERT:最先进的生物医学和临床临床长期编码器 2506.10896v1
  • 19 06-12 The Diffusion Duality Die Diffusionsdualität 传播质量 2506.10892v1
  • 20 06-12 PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play PLAY2PROMPT: Zero-shot Tool Instruction Optimierung für LLM Agenten über Tool Play PLAY2PROMOPT: 通过工具游戏优化LLM代理器的零射工具指令 2503.14432v2
  • 21 06-12 Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers Verallgemeinerung oder Halluzination? Verstehen von Out-of-Context-Reasoning in Transformers 通化还是幻觉? 理解变异器的逻辑外原因 2506.10887v1
  • 22 06-12 Slimming Down LLMs Without Losing Their Minds LLMs abschwächen, ohne ihre Gedanken zu verlieren 在不失去理智的情况下将LLMs 压倒在地 2506.10885v1
  • 23 06-12 Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment Verbesserung des medizinischen Dialogs durch Wissensverfeinerung und dynamische Anpassung 通过知识完善和动态快速调整加强医疗对话 2506.10877v1
  • 24 06-12 Large Language Models for Multilingual Previously Fact-Checked Claim Detection Große Sprachmodelle für die multilinguale bisher Fact-Checked Claim Detection 多语种以前实况调查索赔调查大语言模型 2503.02737v2
  • 25 06-12 Sailing by the Stars: A Survey on Reward Models and Learning Strategies for Learning from Rewards Segeln mit den Sternen: Eine Umfrage über Prämienmodelle und Lernstrategien zum Lernen aus Belohnungen 星舰航行:奖励模型调查以及从奖励中学习的学习战略 2505.02686v2
  • 26 06-12 Multi-group Uncertainty Quantification for Long-form Text Generation Multi-Gruppen-Unsicherheits-Quantifizierung für langformige Textgenerierung 长式文本生成的不确定性量化 2407.21057v2
  • 27 06-12 Debiasing Watermarks for Large Language Models via Maximal Coupling Debiasing Wasserzeichen für große Sprachmodelle über Maximal Coupling 通过Maximal Coupling为大语言模型减少对水标记的偏差 2411.11203v2
  • 28 06-12 Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models Analyse der Beziehungen zwischen vorschulischer Sprache, phonetischer, klanglicher und sprachlicher Information in selbstüberwachten Sprachmodellen 以自我监督的演讲模式分析培训前语言、音、音、音、音和演讲者信息之间的关系 2506.10855v1
  • 29 06-12 Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles Beschleunigen von Diffusion Großsprachige Modelle mit SlowFast: Die drei goldenen Prinzipien 以慢速加速传播大语言模式:三大金原则 2506.10848v1
  • 30 06-12 CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training CIIR@LiveRAG 2025: Optimierung der Multi-Agent Retrieval Augmented Generation durch Selbsttraining CIIR@LiveRAG 2025:通过自我培训优化多要求回生增生一代 2506.10844v1
  • 31 06-12 ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization ReCUT: Ausbalancierende Grundlänge und Genauigkeit in LLMs über Schrittweise Trails und Preference Optimization RECUT:通过分步跟踪和优化优化平衡长长和准确性 2506.10822v1
  • 32 06-12 VideoDeepResearch: Long Video Understanding With Agentic Tool Using VideoDeepResearch: Langes Video-Verstehen mit Agentischem Werkzeug 视频深入研究:与使用代理工具的远程视频了解 2506.10821v1
  • 33 06-12 Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints Negative Interferenzen in der Mehrsprachigkeit sequenzieller Wissensbearbeitung durch Null-Raum-Beschränkungen abmildern 减少多语种序列知识编辑的负面干扰 2506.10800v1
  • 34 06-12 The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages Der Esethu-Rahmen: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages Esethu框架:重新想象可持续数据集治理和低碳语言的理论 2502.15916v2
  • 35 06-12 FASCIST-O-METER: Classifier for Neo-fascist Discourse Online FASCIST-O-METER: Klassifikator für neofaschistischen Diskurs Online FASCIST-O-METER:新法西斯人士在线论文分类 2506.10789v1
  • 36 06-12 Improving Named Entity Transcription with Contextual LLM-based Revision Verbesserung der Transkription der benannten Entität mit kontextueller LLM-basierter Revision 改进以背景LLM为基础订正的命名实体跟踪 2506.10779v1
  • 37 06-12 Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs Unterschiedliche Fragen, unterschiedliche Modelle: Feinkörnige Bewertung von Unsicherheit und Kalibrierung in klinischen QA mit LLMs 不同问题、不同模式:对临床质量评估中不确定性和校准的精细评估 2506.10769v1
  • 38 06-12 Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation Chain-of-Code Collapse: Gründe für Fehler in LLMs über Adversarial Prompting in der Code-Generierung 崩溃链:通过代码生成中的反向提示造成LLMs中失败的原因 2506.06971v2
  • 39 06-12 One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers Ein Tokenizer, um sie alle zu beherrschen: Emergente Sprachplastizität über Mehrsprachige Tokenizer 万能统治者:通过多语种教育者实现新兴语言的可塑性 2506.10766v1
  • 40 06-12 Aspect-Based Opinion Summarization with Argumentation Schemes Aspektbasierte Zusammenfassung der Meinungen mit Argumentierungsschemata 与参数说明方案对照审计意见的概述 2506.09917v2
  • 41 06-12 Great Models Think Alike and this Undermines AI Oversight Große Modelle denken ähnlich und dies unterminiert AI Oversight 伟大的模特儿们想着类似的想法 和这枚地下地雷 AI监督 2502.04313v2
  • 42 06-12 Persistent Topological Features in Large Language Models Persistente Topologische Features in großen Sprachmodellen 大语言模式中的持久性有机污染物特征 2410.11042v2
  • 43 06-12 Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering Neural bei ArchEHR-QA 2025: Agentische Prompt-Optimierung für evidenzgerundete klinische Fragen ArchEHR-QA 2025:证据四舍五入临床问题解答的代理快速优化 2506.10751v1
  • 44 06-12 TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora TaxoAdapt: LLM-basierte multidimensionale Taxonomie-Konstruktion an die sich entwickelnde Forschungskorporation ausrichten 将基于LLM的多层面分类学建设与不断发展的研究公司相协调 2506.10737v1
  • 45 06-12 Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims 超越真或假:收回增加的无损失索赔的等级结构分析 2506.10728v1
  • 46 06-12 PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models PREMISE: Skalierbare und strategische Prompt-Optimierung für effiziente mathematische Reasoning in großen Modellen PREMISE:大规模模型中高效数学理由的可扩展和战略快速优化 2506.10716v1
  • 47 06-12 Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet Adjektive Hypernyms mit Sprachmodellen ableiten, um die Konnektivität von Open English Wordnet zu erhöhen 推导语言模型的形容词超音音音,以提高开放英文Wordnet的连通性 2506.10715v1
  • 48 06-12 Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models Anreize für eine fortgeschrittene Instruktions-Folge von großen Sprachmodellen 为采用大语言模式的高级指示提供激励理由 2506.01413v2
  • 49 06-12 PRSA: Prompt Stealing Attacks against Real-World Prompt Services PRSA: Sofortige Diebstahlangriffe gegen Real-World Prompt Services PRSA: 迅速窃盗对现实世界迅速服务公司的袭击 2402.19200v3
  • 50 06-12 FedRAG: A Framework for Fine-Tuning Retrieval-Augmented Generation Systems FedRAG: Ein Rahmen für Systeme der Feinsteuerung von Retrieval-Augmented Generation FFRAG: 微调取回系统框架 2506.09200v2
  • 51 06-12 SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models SelectLLM: Query-Aware Effiziente Auswahl Algorithmen für große Sprachmodelle 选择LLM: 用于大语言模型的查询- 软件高效选择算法 2408.08545v4
  • 52 06-12 Large Language Models for Detection of Life-Threatening Texts Große Sprachmodelle zur Erkennung lebensbedrohlicher Texte 探测生命威胁文字的长语言大语言模型 2506.10687v1
  • 53 06-12 Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models Habe ich treu gesagt, was ich dachte? Die Kluft zwischen neuraler Aktivität und Selbsterklärungen in großen Sprachmodellen überbrücken 缩小大语言模式中神经活动与自我开发之间的差距 2506.09277v2
  • 54 06-12 TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving TeleMath: Ein Benchmark für große Sprachmodelle in der Telecom Mathematischen Problemlösung TeleMatth:电信数学问题解决大语言模型基准 2506.10674v1
  • 55 06-12 CoRT: Code-integrated Reasoning within Thinking CoRT: Code-integrierte Vernunft im Denken CORT: 思考中守则综合理由 2506.09820v2
  • 56 06-12 Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes Robuste, unüberwachte Anpassung eines Spracherkennungsgeräts mit Entropie-Minimierungs- und Lautsprechercodes 使用磁最小化和演讲人守则的演讲者演讲者 2506.10653v1
  • 57 06-12 Identifying Reliable Evaluation Metrics for Scientific Text Revision Identifizieren von verlässlichen Bewertungsmetrics für wissenschaftliche Textrevision 科学文本订正的可靠评价计量指标 2506.04772v3
  • 58 06-12 Spelling-out is not Straightforward: LLMs’ Capability of Tokenization from Token to Characters Rechtschreibung ist nicht geradeaus: LLMs Fähigkeit der Tokenisierung von Token zu Charakteren 拼写出不是直向前进的: LLMs 的调制能力从调制字符到字符 2506.10641v1
  • 59 06-12 Conversational Search: From Fundamentals to Frontiers in the LLM Era Conversational Search: Von Grundlagen zu Grenzen in der LLM-Ära 对话搜索:从基本原理到LLM时代的前沿 2506.10635v1
  • 60 06-12 NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors NeuralNexus bei BEA 2025 Shared Task: Retrieval-Augmented Prompting für Fehlererkennung in KI-Tutoren BEA 2025年BEA 的神经外观 共同任务:在 AI 导师中检索错误识别提示 2506.10627v1
  • 61 06-12 SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis SDialog: Ein Python-Toolkit für die Synthetische Dialog-Generierung und -Analyse Sidialog:合成对话生成和分析的Python工具包 2506.10622v1
  • 62 06-12 Deep Learning-Based Digitization of Overlapping ECG Images with Open-Source Python Code Deep Learning-based Digitalisierung von überlappenden EKG-Bildern mit Open-Source-Python-Code 使用开放源码的 ECG 重叠图像的深学习数字化 2506.10617v1
  • 63 06-12 Unsupervised Protoform Reconstruction through Parsimonious Rule-guided Heuristics and Evolutionary Search Unüberwachte protoforme Rekonstruktion durch Parsimonious Regel-geführte Heuristiken und evolutionäre Suche 通过法理学、法理学、受规制的哲理学和进化搜索进行不受监督的原形重建 2506.10614v1
  • 64 06-12 ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization ConfPO: Ausnutzen des politischen Modells Vertrauen für kritische Token-Auswahl in Präferenz-Optimierung 召集:利用政策模范信心在优先最佳化中选择关键物优选标准 2506.08712v2
  • 65 06-12 IPA-CHILDES & G2P+: Feature-Rich Resources for Cross-Lingual Phonology and Phonemic Language Modeling IPA-CHILDES & G2P+: Feature-Rich-Ressourcen für Cross-Lingual Phonologie und Phonemic Language Modeling IPA-CHILDES & G2P+:跨语言歌曲和语音语言建模的地貌资源 2504.03036v3
  • 66 06-12 Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges Pragmatics in the Era of Large Language Models: Eine Umfrage zu Datensätzen, Evaluation, Chancen und Herausforderungen 《大语言模式时代中的实用模型:关于数据集、评价、机遇和挑战的调查》 2502.12378v3
  • 67 06-12 Encoding call-by-push-value in the pi-calculus Kodierung des Call-by-Push-Wertes im Pi-Calculus Pi-calcululus 中的编码调用 by-push- 值 2506.10584v1
  • 68 06-12 BabyLM’s First Words: Word Segmentation as a Phonological Probing Task BabyLMs erste Worte: Wortsegmentierung als phonologische Probing-Aufgabe BabyLM 的第一单词: 单词分割作为声学检测任务 2504.03338v3
  • 69 06-12 Human and LLM Biases in Hate Speech Annotations: A Socio-Demographic Analysis of Annotators and Targets Menschliche und LLM-Biasen in Hate Speech Annotationen: Eine sozio-demographische Analyse von Annotatoren und Zielen 仇恨言论说明中的人类和LLM比阿斯语:对说明者和目标的社会-人口分析 2410.07991v6
  • 70 06-12 Reinforcing Multimodal Understanding and Generation with Dual Self-rewards Verstärkung des multimodalen Verständnisses und der Erzeugung mit Dual Self-Rewards 加强多模式理解和多模式代代与双重奖赏 2506.07963v2
  • 71 06-12 Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models Obliviate: Effiziente Unvergesslichkeit für den Schutz geistigen Eigentums in großen Sprachmodellen 默认:在大语言模式中有效统一保护知识产权 2502.15010v2
  • 72 06-12 Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning Erste Prüfung von Wissenschaftlern: Kognitive Fähigkeiten von MLLM durch Wahrnehmung, Verständnis und Vernunft unter Beweis stellen 科学家的第一次考试:通过感知、理解和理性,发现MLLM的认知能力 2506.10521v1
  • 73 06-12 Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs Zuverlässiger Weg zur Vernunft: Destillieren effektiver Leitlinien für LLM-Reasoning mit Wissensgraphen 可靠理由说明:为学习图解的LLM 理由说明保留有效指导 2506.10508v1
  • 74 06-12 Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps Messung der Kette der Gedankentreue durch unlernende Vernunftschritte 通过 “ 不学习理性步骤 “ 衡量思考链的信念 2502.14829v2
  • 75 06-12 Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models Beyond Single-User Dialogue: Bewertung des Multi-User Dialogue State Tracking Fähigkeiten großer Sprachmodelle 超越单一用户对话:评估多用户对话国家跟踪大语言模式的能力 2506.10504v1
  • 76 06-12 Mind the Style Gap: Meta-Evaluation of Style and Attribute Transfer Metrics Achten Sie auf den Style Gap: Meta-Evaluation von Style und Attribut-Transfer-Metriken 思维风格差距:对样式和属性转移的元评价 2502.15022v3
  • 77 06-12 TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding TransXSSM: Ein Hybrid Transformer State Space Modell mit unified Rotary Position Einbettung TransXSSSSM:带有统一扶轮定位嵌入装置的混合变形国家空间模型 2506.09507v2
  • 78 06-12 Towards Large Language Models with Self-Consistent Natural Language Explanations Auf dem Weg zu großen Sprachmodellen mit selbstkonsistenten natürlichen Spracherklärungen 努力建立具有自我联系自然语言解释的大型语言模式 2506.07523v2
  • 79 06-12 Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models Surface Fairness, Deep Bias: Eine vergleichende Studie über Bias in Sprachmodellen 地表公平、深比亚:语言模型比亚比较研究 2506.10491v1
  • 80 06-12 Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers Tabelle-Text Alignment: Erklärung der Antragsprüfung gegen Tabellen in wissenschaftlichen Arbeiten 表-文字对齐:对照科学文件表格解释索赔核实 2506.10486v1
  • 81 06-12 IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language IndoToxic2024: Ein demographischer Datensatz von Hass-Sprach- und Toxizitätstypen für indonesische Sprache Indotoxic2024:印度尼西亚语仇恨言论和毒性类型人口资料集 2406.19349v2
  • 82 06-12 VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models VScan: Rethinking Visual Token Reduction für effiziente große Vision-Sprache Modelle Vscan:重新思考如何降低视力,以建立高效的大型视觉语言模型 2505.22654v2
  • 83 06-12 Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts Auf dem Weg zur robusten multimodalen Emotionserkennung unter fehlenden Modalitäten und Verteilungsverschiebungen 争取在缺失模式和分销转移模式下强有力地承认多模式情感 2506.10452v1
  • 84 06-12 Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations Social Bias Benchmark for Generation: Ein Vergleich von Generation und QA-basierten Bewertungen 社会比重基准: 社会比重基准: 社会比比: 社会比比: 社会比比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比 2503.06987v2
  • 85 06-12 Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty Schnell auf dem Einfachen, Tief auf dem Harten: Effiziente Vernunft über Powered Length Penalty 快速快速执行 “ 容易 “ 、 “ 深沉:通过死刑有效解释理由 “ 2506.10446v1
  • 86 06-12 CheMatAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning CheMatAgent: Verbesserung von LLMs für Chemie und Materialwissenschaft durch baumsuchebasiertes Tool Learning CheMatAgent:通过植树搜索工具学习加强化学和材料科学LLMs 2506.07551v2
  • 87 06-12 ConvD: Attention Enhanced Dynamic Convolutional Embeddings for Knowledge Graph Completion ConvD: Aufmerksamkeitsverstärkte dynamische konvolutionäre Einbettungen für die Wissensgraphenvervollständigung ConvD: 关注增强动态动态嵌入,以完成知识图的完成 2312.07589v2
  • 88 06-12 PAL: Probing Audio Encoders via LLMs – A Study of Information Transfer from Audio Encoders to LLMs PAL: Probing Audio-Encoder über LLMs – Eine Studie über den Informationstransfer von Audio-Encodern zu LLMs PAL:通过LLMs探查音频成象器 – – 研究从音频成象器向LLMs传送信息 2506.10423v1
  • 89 06-12 Beyond the Battlefield: Framing Analysis of Media Coverage in Conflict Reporting Jenseits des Schlachtfeldes: Framing Analyse der Medienabdeckung in der Konfliktberichterstattung 战场以外的战场:冲突报道中媒体报道的系统化分析 2506.10421v1
  • 90 06-12 Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences? Brennen Sie nach dem Lesen: Erfassen multimodale große Sprachmodelle wirklich die Reihenfolge der Ereignisse in Bildsequenzen? Burn after read: 多式大语言模型在图像序列中是否真的能捕捉事件秩序? 2506.10415v1
  • 91 06-12 Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series Zeit-IMM: Ein Datensatz und Benchmark für irreguläre multimodale Multivariate Zeitreihen 时间-IMM:非正常多式联运多变时间序列的数据集和基准 2506.10412v1
  • 92 06-12 PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier PAG: Multi-Turn verstärkt LLM Selbstkorrektion mit Politik als Generativer Prüfer PAG: 多发强化LLM自我校正,政策作为产生验证 2506.10406v1
  • 93 06-12 iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering iQUEST: Ein iteratives Frage-Framework für die Beantwortung von Fragen in der Wissensdatenbank i. 知识基础问题解答的动态问题指导框架 2506.01784v2
  • 94 06-12 AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving AgentThink: Ein einheitliches Framework für Tool-Augmented Chain-of-Thought Reasoning in Vision-Language-Modellen für autonomes Fahren Agent Think: 自主驾驶愿景-语言模型中工具推荐研究链理由统一框架 2505.15298v3
  • 95 06-12 On Many-Shot In-Context Learning for Long-Context Evaluation Auf viel-heißes In-Context-Lernen für die Lang-Kontext-Evaluierung 为长期内容评价进行许多热的内文学习 2411.07130v3
  • 96 06-12 TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning TableRAG: Ein Retrieval Augmented Generation Framework für heterogene Dokument-Reasoning 表RAG:异源文件说明理由的回收增加代际生成框架 2506.10380v1
  • 97 06-12 Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning Hierarchische Latentenfähigkeiten von Sprachmodellen über das kausale Repräsentationslernen entdecken 通过因果代表制学习发现语言模式的分级本端能力 2506.10378v1
  • 98 06-12 A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce Ein minimalistischer Ansatz zur LLM-Vernunft: von der Abstoßung zur Verstärkung 从拒绝抽样到强化 2504.11343v2
  • 99 06-12 Can We Infer Confidential Properties of Training Data from LLMs? Können wir vertrauliche Eigenschaften von Trainingsdaten von LLMs ableiten? 我们能否从LLMS中推断培训数据的机密性? 2506.10364v1
  • 100 06-12 CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models CAF-I: Ein kollaboratives Multi-Agent-Framework für eine verbesserte Ironieerkennung mit großen Sprachmodellen CAF-I:采用大语言模式加强铁铁探测多机构合作多方协作框架 2506.08430v2
  • 101 06-12 Improving Fairness of Large Language Models in Multi-document Summarization Verbesserung der Fairness von großen Sprachmodellen in Multi-Dokument-Zusammenfassung 提高多文件总结中大语言模式的公平性 2506.07479v2
  • 102 06-12 SCORE: Story Coherence and Retrieval Enhancement for AI Narratives SCORE: Story-Kohärenz und Retrieval-Verbesserung für KI-Erzählungen SCORE: “ 独立叙述 “ 的 “ 一致性 “ 和 “ 检索 “ 增强 “ 增强 “ 统一 “ 和 “ 检索 “ 增强 “ 增强 “ 独立叙述 “ 2503.23512v4
  • 103 06-12 The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets Das automatisierte, aber riskante Spiel: Modellierung von Agent-zu-Agent-Verhandlungen und Transaktionen in Verbrauchermärkten 自动但有风险游戏:消费者市场代理对代理谈判和交易的模拟 2506.00073v2
  • 104 06-12 An Analysis of Datasets, Metrics and Models in Keyphrase Generation Eine Analyse von Datensätzen, Metrics und Modellen in der Keyphrase-Generierung 对关键词生成中的数据集、计量和模型的分析 2506.10346v1
  • 105 06-12 Code Execution as Grounded Supervision for LLM Reasoning Code-Execution als geerdete Überwachung für LLM-Reasoning 法规执行作为LLM理由的有限制的监督 2506.10343v1
  • 106 06-12 Provably Learning from Language Feedback Wahrscheinlich von Sprachfeedback lernen 从语言反馈中学习 2506.10341v1
  • 107 06-12 Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs Amulett: Neuausrichtung während der Testzeit für Personalisierte Präferenzanpassung von LLMs 缩略图:在试验期间重新对准,以适应LLMM的个性化偏好 2502.19148v3
  • 108 06-12 Benchmarking LLMs for Environmental Review and Permitting Benchmarking LLMs für Umweltprüfung und Genehmigung 环境审查和许可基准确定LLMs 2407.07321v3
  • 109 06-12 CHANCERY: Evaluating Corporate Governance Reasoning Capabilities in Language Models CHANCERY: Bewertung der Corporate Governance-Reasoning-Fähigkeiten in Sprachmodellen C. 机会:评价语言模式中的公司治理能力 2506.04636v2
  • 110 06-12 Paired Completion: Flexible Quantification of Issue-framing at Scale with LLMs Gepaarte Fertigstellung: Flexible Quantifizierung von Emissions-Framing auf Scale mit LLMs 提前完成:与LLMs一道灵活量化规模问题配置 2408.09742v2
  • 111 06-12 Detecting Sockpuppetry on Wikipedia Using Meta-Learning Sockepuppetry auf Wikipedia erkennen Mit Meta-Learning 在维基百科上用元学习探测袜子布料 2506.10314v1
  • 112 06-12 Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling Effiziente Längenverallgemeinerbare Aufmerksamkeit über Causal Retrieval für die Lang-Kontext-Sprachenmodellierung 长文本语言建模通过 “ 目的检索 “ 吸引长文本语言建模 2410.01651v4
  • 113 06-12 AC/DC: LLM-based Audio Comprehension via Dialogue Continuation AC/DC: LLM-basiertes Audio-Verständnis über Dialog-Fortsetzung AC/DC:基于LLM的通过对话继续了解音频 2506.10312v1
  • 114 06-12 BeamLoRA: Beam-Constraint Low-Rank Adaptation BeamLoRA: Beam-Constraint Low-Rank Anpassung BeamLORA: 束-节制低射线适应 2502.13604v2
  • 115 06-12 Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs Geplante interleaved Speech-Text-Schulung für Sprach-zu-Sprach-Übersetzung mit LLMs 配有LLMM的语音对语音翻译教学 2506.10299v1
  • 116 06-12 “Check My Work?”: Measuring Sycophancy in a Simulated Educational Context “Check My Work?”: Sykopanzmessung in einem simulierten Bildungskontext “检查我的工作?” “测量模拟教育环境中的相对性” 2506.10297v1
  • 117 06-12 Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages Flick: Wenige Labels Textklassifizierung mit K-Aware Intermediate Learning in Multi-Task Low-Resource Sprachen Flick:使用K-Aware中级学习多种低资源语言的多种语言的标签文字分类, 2506.10292v1
  • 118 06-12 Context Is Not Comprehension Kontext ist nicht verständlich 背景不令人理解 2506.04907v4
  • 119 06-12 Prompt-based Depth Pruning of Large Language Models Prompt-basierte Tiefenkorrektur von großen Sprachmodellen 大语言模式的即时深度定位 2502.04348v3
  • 120 06-12 ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs ClusterUCB: Effiziente Gradient-basierte Datenauswahl für gezielte Feinsteuerung von LLMs COCUCB: 高效率的逐步数据选择,以便有针对性地微调LLMM 2506.10288v1
  • 121 06-12 Convert Language Model into a Value-based Strategic Planner Konvertieren Sie Sprachmodell in einen wertbasierten strategischen Planer 将语言模式转换成基于价值的战略规划员 2505.06987v3
  • 122 06-12 Discrete Audio Tokens: More Than a Survey! Diskrete Audio Tokens: Mehr als nur eine Umfrage! 分辨音频代号: 多于调查 ! 2506.10274v1
  • 123 06-12 Play to Generalize: Learning to Reason Through Game Play Spielen Sie Generalize: Lernen, Vernunft durch Spiel zu lernen 玩一般游戏: 通过玩游戏学习理性 2506.08011v2
  • 124 06-12 Do Language Models Have Bayesian Brains? Distinguishing Stochastic and Deterministic Decision Patterns within Large Language Models Haben Sprachmodelle Bayesische Gehirne? Beeindruckende stochastische und deterministische Entscheidungsmuster innerhalb großer Sprachmodelle 语言模式是否具有贝耶斯人脑? 区分大语言模式中的斯托卡和决定性决定模式 2506.10268v1
  • 125 06-12 Research Borderlands: Analysing Writing Across Research Cultures Forschungsgrenzen: Analysieren des Schreibens über Forschungskulturen hinweg 研究边界地区:分析跨研究文化的写作 2506.00784v2
  • 126 06-12 M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction M-MRE: Ausdehnung des Effekts der gegenseitigen Verstärkung auf multimodale Informationsextraktion M-MRE:将相互强化效应扩大到多式联运信息提取 2504.17353v2
  • 127 06-11 (3) ToxSyn-PT: A Large-Scale Synthetic Dataset for Hate Speech Detection in Portuguese ToxSyn-PT: Ein großformatiger synthetischer Datensatz für Hass-Spracherkennung auf Portugiesisch ToxSyn-PT:用葡萄牙语探测仇恨言论的大型合成数据集 2506.10245v1
  • 128 06-11 The Fellowship of the LLMs: Multi-Agent Workflows for Synthetic Preference Optimization Dataset Generation Das Fellowship der LLMs: Multi-Agenten-Workflows für die Synthetische Preference-Optimierung Datensatzgenerierung LLM女士的研究金:合成首选最佳制作数据集多机构工作流程 2408.08688v5
  • 129 06-11 Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods Prompt Angriffe Enthüllen Oberflächliche Wissensentfernung in unlearning Methoden 以不学习方法清除知识 2506.10236v1
  • 130 06-11 Classifying Unreliable Narrators with Large Language Models Klassifizieren von unzuverlässigen Erzählern mit großen Sprachmodellen 使用大语言模型的不可靠旁白器分类 2506.10231v1
  • 131 06-11 Squeezed Attention: Accelerating Long Context Length LLM Inference Squeezed Achtung: Beschleunigung der langen Kontextlänge LLM-Inferenz 挤压注意:加速长线长长LLM 推论 2411.09688v3
  • 132 06-11 Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges Große Sprachmodelle für ambulante Referrals: Problemdefinition, Benchmarking und Herausforderungen 门诊转诊大语言模式:问题定义、基准设定和挑战 2503.08292v3
  • 133 06-11 EgoNormia: Benchmarking Physical Social Norm Understanding EgoNormia: Benchmarking Physical Social Norm Understanding EgoNormia:确定物质社会规范基准 2502.20490v5
  • 134 06-11 Failure Modes of LLMs for Causal Reasoning on Narratives Failure Modes von LLMs für die ursächliche Begründung von Narrativen 以叙述为由解释原因的LLMs失败模式 2410.23884v4
  • 135 06-11 TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games TTT-Bench: Ein Benchmark für die Bewertung von Vernunftfähigkeit mit einfachen und neuartigen Tic-Tac-Toe-Spielen TTT-Bench:以简单和新颖的Tic-Tic-Tac-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te 2506.10209v1
  • 136 06-11 SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models SealQA: Anhebung der Messlatte für die Vernunft in Search-Augmented Language Models SealQA: 提高搜索增强语言模式中的原因栏 2506.01062v2
  • 137 06-11 Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval Q2E: Abfrage-zu-Veranstaltungs-Zersetzung für Null-Schuss-Mehrsprachiges Text-zu-Video-Retrieval 问题2E: 零热多语种文字到视频检索的从查询到编辑的分解 2506.10202v1
  • 138 06-11 Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance Objekt-Halluzination in großen Visions-Sprachen-Modellen durch Bild-umrundete Anleitung abmildern 通过图像四舍五入指南在大型视觉语言模型中减轻物体幻觉 2402.08680v2
  • 139 06-11 Building UD Cairo for Old English in the Classroom Gebäude UD Kairo für altes Englisch im Klassenzimmer 在教室内为旧英语建造开罗 2504.18718v2
  • 140 06-11 Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce Zukunft der Arbeit mit KI-Agenten: Auditing Automation und Augmentation Potential in der gesamten US-Arbeitskraft 未来与AI代理机构的合作:全美国员工队伍的审计自动化和增强潜力 2506.06576v2
  • 141 06-11 Disclosure Audits for LLM Agents Offenlegungsprüfungen für LLM-Agenten 对LLLM代理的披露审计 2506.10171v1
  • 142 06-11 Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective Können LLMs gute Geschichten generieren? Erkenntnisse und Herausforderungen aus einer narrativen Planungsperspektive LLM女士能够产生好故事吗?从叙述性规划的角度看问题和挑战 2506.10161v1
  • 143 06-11 One Patient, Many Contexts: Scaling Medical AI Through Contextual Intelligence Ein Patient, viele Kontexte: Skalierung medizinischer KI durch Kontextintelligenz 一名病人,多种背景:通过背景情报扩大医疗AI 2506.10157v1
  • 144 06-11 Measuring Corporate Human Capital Disclosures: Lexicon, Data, Code, and Research Opportunities Corporate Human Capital Disclosures messen: Lexikon, Daten, Code und Forschungsmöglichkeiten 衡量公司人力资本披露:词汇、数据、代码和研究机会 2506.10155v1
  • 145 06-11 Analyzing Emotions in Bangla Social Media Comments Using Machine Learning and LIME Analysieren von Emotionen in Bangla Social Media Kommentare mit maschinellem Lernen und LIME 利用机器学习和LIME分析孟加拉社会媒体评论中的情感 2506.10154v1
  • 146 06-11 When Large Language Models are Reliable for Judging Empathic Communication Wenn große Sprachmodelle zuverlässig sind, um empathische Kommunikation zu beurteilen 当大语言模型在判断同情通信方面可靠时 2506.10150v1
  • 147 06-11 Computation Mechanism Behind LLM Position Generalization Berechnungsmechanismus hinter LLM-Positionsverallgemeinerung LLM 职位普遍化背后的计算机制 2503.13305v2
  • 148 06-11 Unsupervised Elicitation of Language Models Unüberwachte Elizitation von Sprachmodellen 未受监督的语文模式的Elicus Elicooth 2506.10139v1
  • 149 06-11 ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering ChartReasoner: Code-getriebene Modalitäts-Bridging für Lang-Chain-Reasoning in der Chart-Fragebeantwortung 图表检索器:图解问题解答中长钱原因的代码-驱动模式连接 2506.10116v1
  • 150 06-11 When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs Wenn die Bedeutung bleibt die gleiche, aber Modelle Drift: Bewertung der Servicequalität unter Token-Level Behavioral Instability in LLMs 当意指保持相同时,但模型漂流:在高端行为不稳下评估服务的质量 2506.10095v1
  • 151 06-11 Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information Chat-of-Thought: Kollaboratives Multi-Agent-System zur Generierung von Domain-spezifischen Informationen 共同探索:生成具体域域特定信息的合作多机构机构系统 2506.10086v1
  • 152 06-11 A quantum semantic framework for natural language processing Ein quantensemantischer Rahmen für die natürliche Sprachverarbeitung 自然语言处理的量子语义框架 2506.10077v1
  • 153 06-11 From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring Vom Urteil bis zur Störung: Frühzeitiges Stoppen von LLM-schädlichen Ausgängen durch Streaming-Content-Monitoring 从判决到干扰:通过流式内容监测及早停止LLM有害产出 2506.09996v1
  • 154 06-11 Flipping Against All Odds: Reducing LLM Coin Flip Bias via Verbalized Rejection Sampling Flipping gegen alle Ungeraden: LLM-Münze Flip-Bias durch verbalisierte Ablehnungsstichproben reduzieren 反万千万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万千万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万 2506.09998v1
  • 155 06-11 Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages Große Sprachmodelle für toxische Spracherkennung in ressourcenarmen Balkansprachen 低资源巴尔干语言中有毒语言探测大语言模式 2506.09992v1
  • 156 06-11 TaskCraft: Automated Generation of Agentic Tasks TaskCraft: Automatisierte Generierung von Agentischen Aufgaben TTTCraft:自动生成代理任务 2506.10055v1
  • 157 06-11 Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs Omni-DPO: Ein Dual-Perspektiven-Paradigma für dynamisches Preference-Lernen von LLMs Omni-DPO:一个双双面范式,以动态方式学习LLMs 2506.10054v1
  • 158 06-11 Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs Schritt-für-Schritt-Anweisungen und ein einfaches tabellarisches Ausgabeformat verbessern die Abhängigkeits-Abgleichgenauigkeit von LLMs 逐步指示和简单表格格式 改进LLMM的可靠性分析精确度 2506.09983v1
  • 159 06-11 When Detection Fails: The Power of Fine-Tuned Models to Generate Human-Like Social Media Text Wenn die Detektion fehlschlägt: Die Macht von fein-getönten Modellen, um menschenähnliche Social Media-Texte zu erzeugen 当检测失败时:制作像人类一样的社会媒体文字的精选模型的力量 2506.09975v1
  • 160 06-11 Trustworthy AI: Safety, Bias, and Privacy – A Survey Vertrauenswürdige KI: Sicherheit, Bias und Privatsphäre – Eine Umfrage 值得信赖的大赦国际:安全、比亚斯和隐私 – – 调查 2502.10450v2
  • 161 06-11 Resa: Transparent Reasoning Models via SAEs Resa: Transparente Begründungsmodelle über SAE Resa:通过SAEs建立透明说明理由模型 2506.09967v1
  • 162 06-11 Language Models Resist Alignment: Evidence From Data Compression Sprachmodelle widerstehen Ausrichtung: Beweise aus Datenkompression 语言模型阻力对齐:数据压缩中的证据 2406.06144v4
  • 163 06-11 Outside Knowledge Conversational Video (OKCV) Dataset – Dialoguing over Videos Outside Knowledge Conversational Video (OKCV) Datensatz – Dialoguing über Videos 外部知识对话视频(OKCV)数据集 – – 视频对话 2506.09953v1
  • 164 06-11 Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking Abfrage-fokussierte Retrieval-Köpfe verbessern Long-Context Reasoning und Re-Ranking 改进长期理由和重新排序 2506.09944v1
  • 165 06-11 7B Fully Open Source Moxin-LLM/VLM – From Pretraining to GRPO-based Reinforcement Learning Enhancement 7B Vollständig offene Quelle Moxin-LLM/VLM – Vom Vortraining zur GRPO-basierten Verstärkungs-Lernverbesserung 7B 完全开放源代码 Moxin-LLM/VLM – – 从预培训到基于GROP的强化学习加强 2412.06845v5
  • 166 06-11 VerIF: Verification Engineering for Reinforcement Learning in Instruction Following VerIF: Verifikationstechnik für das Verstärkungslernen im Instruction Following VerIF: 教学中加强学习的核查工程 2506.09942v1
  • 167 06-11 Let’s Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Robust and Instruction-Aware ASR and OCR Lassen Sie uns Schritt für Schritt fuse: Eine generative Fusion Decodierung Algorithmus mit LLMs für robuste und instruction-aware ASR und OCR 让我们一步一步地引信: 与用于软体和教学软件 ASR 和 OCR 的LMLMLM 生成融合解码算法 2405.14259v4
  • 168 06-11 LLM-BT-Terms: Back-Translation as a Framework for Terminology Standardization and Dynamic Semantic Embedding LLM-BT-Terms: Back-Translation als Framework für Terminologie-Standardisierung und dynamische semantische Einbettung LLM-BT-LLM-BT-LLM-BT-LLM-LLM-BT-T-LLM-LLM-LLT-LLT-LT-LT-T-LLLM-LLM-LLT-LLM-BT-T-LLLM-LLM-BT-T-T-LLLM-LLLM-BT-T-T-LLLLLMM-BT-LLLMT-BT-LT-LT-LT-LLT-T-T-T-T-LLLLLLMM-BT-T-T-T-T-LLLLT-LLMLLMM-BT-T-T-T-LLLT-LT-LT-LT-LT-LT-T-T-LLT-T-T-T-LLLLLLLLLLsssssssssssssssss LLLLLLLLLLLLLLM-LM-LM-LLLLLLLLLLLLM-LM-LLLLLLLLM-BT-T-T-T-T-T-T-T-T-T-T-T-T-LLT-LT-LLLT-LT-LT-LLLLLT-LT-LT-LT-LLT-T-LLT-LT-LLLLLLT-T-T-T-T-LT-LT-T-T-LLLLLLLLLLT-LT-LLLLLLLLLLLT-LT-T-LLLLT-LT-LT-T-LLLLLLLLT-LLLLLLLLLLLT-LLLLLLLLLT-BT-BT-T-T-T-T-T-LT-T- 2506.08174v2
  • 169 06-11 CaLMQA: Exploring culturally specific long-form question answering across 23 languages CaLMQA: Erforschen von kulturell spezifischen Langzeitfragen in 23 Sprachen CALMQA:探讨以23种语言回答文化上特有的长式问题 2406.17761v3
  • 170 06-11 Standard Language Ideology in AI-Generated Language Standardsprache-Ideologie in KI-generierter Sprache AI-创用语言的标准语言意识形态 2406.08726v2
  • 171 06-11 Discovering Forbidden Topics in Language Models Verbotene Themen in Sprachmodellen entdecken 发现语言模型中的禁止专题 2505.17441v3
  • 172 06-11 PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants PersonaLens: Ein Benchmark für die Personalisierungsbewertung in Conversational AI Assistants 人际关系:对话性AI助理个人化评价基准 2506.09902v1
  • 173 06-11 The Remarkable Robustness of LLMs: Stages of Inference? Die bemerkenswerte Robustheit von LLMs: Stufen der Schlussfolgerung? LLMS的显著威力:推论阶段? 2406.19384v2
  • 174 06-11 Advancing Decoding Strategies: Enhancements in Locally Typical Sampling for LLMs Weiterentwicklung der Decodierungsstrategien: Verbesserungen in der lokalen typischen Probenahme für LLMs 推进下限战略:加强LLMM公司当地典型抽样 2506.05387v2
  • 175 06-11 The Emergence of Abstract Thought in Large Language Models Beyond Any Language Das Entstehen abstrakten Denkens in großen Sprachmodellen jenseits jeder Sprache 超越任何语言的大型语言模式中抽象思考的出现 2506.09890v1
  • 176 06-11 Attention Head Embeddings with Trainable Deep Kernels for Hallucination Detection in LLMs Aufmerksamkeitskopf-Embeddings mit trainierbaren Tiefenkernen für Halluzinationserkennung in LLMs 用于在LLMM中探测幻觉的可训练深心内核嵌入器 2506.09886v1
  • 177 06-11 TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration TACTIC: Übersetzungsagenturen mit kognitiv-theoretischer Interactive Collaboration TACTIC: 具有认知理论互动协作的翻译剂 2506.08403v2
  • 178 06-11 TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering TableEval: Ein Real-World Benchmark für komplexe, mehrsprachige und multistrukturierte Tabellenfrageantworten 表Eval:复杂、多语种和多结构表问题答案的现实世界基准 2506.03949v2
  • 179 06-11 ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models ImageChain: Advancing Sequential Image-to-Text Reasoning in multimodalen großen Sprachmodellen 图像Chain:在多模式大语言模型中推进按顺序排列的图像到文字理由 2502.19409v2
  • 180 06-11 Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning Causal Sufficiency und Necessity verbessert Kette-of-Thought-Reasoning C. 因果关系和必要性 改进审议链 理由 2506.09853v1
  • 181 06-11 Advancing Exchange Rate Forecasting: Leveraging Machine Learning and AI for Enhanced Accuracy in Global Financial Markets Advancing Exchange Rate Forecasting: Hebeling Machine Learning und KI für verbesserte Genauigkeit in den globalen Finanzmärkten 提高汇率预测:利用机器学习和大赦国际促进全球金融市场更加准确性 2506.09851v1
  • 182 06-11 Dataset of News Articles with Provenance Metadata for Media Relevance Assessment Datensatz von Nachrichtenartikeln mit Provenienz Metadaten für die Medienrelevanzbewertung 为媒体相关性评估提供具有证明的元数据的新闻文章数据集 2506.09847v1
  • 183 06-11 Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective Produkt von Experten mit LLMs: Leistungssteigerung auf ARC ist eine Frage der Perspektive 与LLMM公司的专家产品有关的产品:促进ACR的绩效是一个透视问题 2505.07859v2
  • 184 06-11 Crafting Customisable Characters with LLMs: Introducing SimsChat, a Persona-Driven Role-Playing Agent Framework Anpassbare Charaktere mit LLMs gestalten: Einführung von SimsChat, einem Persona-Driven Role-Playing Agent Framework 与LLMM女士一起制作自定义字符:介绍人际角色扮演代理框架SimsChat 2406.17962v6
  • 185 06-11 Error-Guided Pose Augmentation: Enhancing Rehabilitation Exercise Assessment through Targeted Data Generation Fehlergeführte Pose Augmentation: Verbesserung der Rehabilitationsübungsbewertung durch gezielte Datengenerierung 错误引导粒子增强:通过有针对性数据生成加强复原工作评估 2506.09833v1
  • 186 06-11 One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image Ein Bild ist alles, was es braucht: Vergiftung von visuellen Dokumenten Retrieval Augmented Generation mit einem einzigen Bild 一个象数就够了: 以单一图像显示的毒害视觉文件检索增强的一代人。 2504.02132v2
  • 187 06-11 EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection EmoNet-Voice: Ein feinkörniger, sachverständiger Benchmark für Sprachemotionserkennung EmoNet-Voice:语音情感检测精密、经专家核实的专家验证基准 2506.09827v1
  • 188 06-11 Emphasising Structured Information: Integrating Abstract Meaning Representation into LLMs for Enhanced Open-Domain Dialogue Evaluation Hervorhebung strukturierter Informationen: Integration abstrakter Bedeutungsdarstellungen in LLMs für eine verbesserte Bewertung des Open-Domain-Dialogs 强调结构化信息:将抽象含义代表纳入增强开放式对话评价LMLM 2404.01129v4
  • 189 06-11 Unveiling the Hidden: Movie Genre and User Bias in Spoiler Detection Enthüllung des Verborgenen: Filmgenre und User Bias in Spoiler Detection 解开隐藏:电影流派和在垃圾检测中的用户比亚斯 2504.17834v3
  • 190 06-11 Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards Writing-Zero: Brücke zwischen nicht überprüfbaren Aufgaben und überprüfbaren Belohnungen 零书写:弥合不可核查任务与可核实的奖励之间的差距 2506.00103v2
  • 191 06-11 Steps are all you need: Rethinking STEM Education with Prompt Engineering Schritte sind alles, was Sie brauchen: STEM-Bildung mit Prompt Engineering neu denken 重新思考STEM教育与即时工程 2412.05023v3
  • 192 06-11 Using Shapley interactions to understand how models use structure Shapley-Interaktionen nutzen, um zu verstehen, wie Modelle Struktur verwenden 利用模拟互动来了解模型如何使用结构 2403.13106v2
  • 193 06-11 Regularizing Learnable Feature Extraction for Automatic Speech Recognition Regularisieren von Lern-Feature-Extraktion für automatische Spracherkennung 将自动语音识别的可读功能抽取器正规化 2506.09804v1
  • 194 06-11 Knowledge Graphs are all you need: Leveraging KGs in Physics Question Answering Wissen Graphen sind alles, was Sie brauchen: Leveraging KGs in Physik Fragen beantworten 知识图形是您所需要的全部: 在物理问答中使用 KGs 工具 2412.05453v3
  • 195 06-11 LogProber: Disentangling confidence from contamination in LLM responses LogProber: Entwirren des Vertrauens in LLM-Antworten 日志Prober:解除对LLM反应中污染的信心 2408.14352v2
  • 196 06-11 Do LLMs Give Psychometrically Plausible Responses in Educational Assessments? Geben LLMs psychometrisch plausible Antworten in pädagogischen Bewertungen? LLMs在教育评估中是否给予心理可视反应? 2506.09796v1
  • 197 06-11 ComfyUI-R1: Exploring Reasoning Models for Workflow Generation ComfyUI-R1: Erforschung von Konzeptmodellen für die Workflow-Generierung ComfyUI-R1:探索产生工作流程的理由模型 2506.09790v1
  • 198 06-11 MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Cultural Learning MindForge: Empowering Embodyd Agents mit Theorie des Geistes für lebenslanges kulturelles Lernen Mindforge:赋予有终身文化学习思想理论的内装代理人权力 2411.12977v4
  • 199 06-11 Guidelines for Fine-grained Sentence-level Arabic Readability Annotation Richtlinien für feinkörnige Sätze Arabische Lesbarkeit Annotation 阿拉伯文可读性说明 2410.08674v3
  • 200 06-11 Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements Feintuning große Audio-Sprachen-Modelle mit LoRA für die präzise zeitliche Lokalisierung von langanhaltenden Expositionstherapieelementen 与LORA一道精细设计大型音频语言模型,用于长期接触治疗元素的精确时间定位 2506.09707v1
  • 201 06-11 LLM2TEA: Agentic AI Designer Finds Innovative Objects with Generative Evolutionary Multitasking LLM2TEA: Agentischer AI-Designer findet innovative Objekte mit generativem evolutionären Multitasking LLM2TEA: ARI 设计师发现具有引演进进化多任务的创新对象 2406.14917v2
  • 202 06-11 Reasoning Language Models: A Blueprint Grundlegende Sprachmodelle: Ein Blueprint 理由语言模式:蓝图 2501.11223v4
  • 203 06-11 Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment Jenseits von Bradley-Terry-Modellen: Ein allgemeines Präferenzmodell für Sprachmodellausrichtung 超越布拉德利-泰瑞模式:语文模式统一通用优先模式 2410.02197v3
  • 204 06-11 Retrofitting Large Language Models with Dynamic Tokenization Nachrüstung großer Sprachmodelle mit dynamischer Tokenisierung 以动态招招化改造大语言模型 2411.18553v3
  • 205 06-11 Adding simple structure at inference improves Vision-Language Compositionality Einfache Struktur bei Inferenz hinzufügen verbessert die Vision-Sprachenkompositionalität 在推论中添加简单结构改进视野-语言的构成性 2506.09691v1
  • 206 06-11 Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models Inv-Entropy: Ein voll probabilistisches Rahmenwerk zur Unsicherheitsquantifizierung in Sprachmodellen Inv-Entropy: 语文模型不确定性量化的完全概率框架 2506.09684v1
  • 207 06-11 Is Fine-Tuning an Effective Solution? Reassessing Knowledge Editing for Unstructured Data Ist Fine-Tuning eine effektive Lösung? Neubewertung der Wissensbearbeitung für unstrukturierte Daten 精细调整是否是一种有效的解决方案?重新评估无结构数据的知识编辑 2506.09672v1
  • 208 06-11 CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization CROW: Beseitigung von Backdoors aus großen Sprachmodellen über interne Konsistenzregularisierung CROW:通过内部一致性规范化,消除大语言模式中的后门 2411.12768v2
  • 209 06-11 Query-Level Uncertainty in Large Language Models Abfrage-Level-Unsicherheit in großen Sprachmodellen 大语言模式的质疑级不确定性 2506.09669v1
  • 210 06-11 Intent Factored Generation: Unleashing the Diversity in Your Language Model Intent Factored Generation: Die Vielfalt in Ihrem Sprachmodell entfesseln 内在因数代:在你的语言模式中释放多样性 2506.09659v1
  • 211 06-11 Bridging the Gap Between Open-Source and Proprietary LLMs in Table QA Die Lücke zwischen Open Source und Proprietary LLMs in Tabelle QA überbrücken 表QA缩小开放来源和专有LMs与公开来源和产权LMs之间的差距 2506.09657v1
  • 212 06-11 AskToAct: Enhancing LLMs Tool Use via Self-Correcting Clarification AskToAct: Verbesserung der LLMs-Tool-Nutzung durch selbstkorrekte Klärung AskToact:通过自我更正澄清加强LLMs工具的使用 2503.01940v2
  • 213 06-11 Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering Effizientes und generalisierbares Graph Retriever für Wissens-Graph-Frageantworten lernen 知识拼图问题解答的学习效率和通用图表 2506.09645v1
  • 214 06-11 Using Sign Language Production as Data Augmentation to enhance Sign Language Translation Verwendung der Sign Language-Produktion als Datenerweiterung zur Verbesserung der Sign Language-Übersetzung 利用手语制作作为数据扩充,加强手语翻译 2506.09643v1
  • 215 06-11 Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs Gleiche Aufgabe, unterschiedliche Schaltungen: Entwirren von Modalität-spezifischen Mechanismen in VLMs 同一任务, 不同电路: VLMs 拆分模式特定机制 2506.09047v2
  • 216 06-11 Modeling Probabilistic Reduction using Information Theory and Naive Discriminative Learning Modellierung der probabilistischen Reduktion mittels Informationstheorie und Naive Discriminative Learning 利用信息理论和辨别性差异性学习来模拟降低概率的模型 2506.09641v1
  • 217 06-11 Benchmarking Debiasing Methods for LLM-based Parameter Estimates Benchmarking-Debiasing-Methoden für LLM-basierte Parameterschätzungen 以LLM为基础的参数估算的偏差方法 2506.09627v1
  • 218 06-11 MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering MAGIC-VQA: Multimodale und begründete Schlussfolgerungen mit Commonsense-Wissen für visuelle Fragenbeantwortung MAGIC-VQA:视觉问题解答使用常识知识的多式和有根据的推断 2503.18491v3
  • 219 06-11 Style over Substance: Distilled Language Models Reason Via Stylistic Replication Stil über Substanz: Destillierte Sprachmodelle Grund über stylistische Replication 物质之上的样式: 蒸馏语言模型 2504.01738v2
  • 220 06-11 Effective Red-Teaming of Policy-Adherent Agents Effektives Red-Teaming von Policy-Adherent Agents 有效的政策协调代理人红队 2506.09600v1
  • 221 06-11 Assessment of Evolving Large Language Models in Upper Secondary Mathematics Bewertung von sich entwickelnden großen Sprachmodellen in der Oberen Sekundärmathematik 对高中数学中演变中的大语言模型的评估 2504.12347v2
  • 222 06-11 NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction NTPP: Generative Sprachmodellierung für Dual-Channel-Gesprochenen Dialog über Next-Token-Pair-Vorhersage NTPP:通过下波波语预测为双声道口语对话生成语音模型 2506.00975v4
  • 223 06-11 Memorization in Language Models through the Lens of Intrinsic Dimension Auswendiglernen in Sprachmodellen durch die Linse der Intrinsischen Dimension 通过内在层面的镜头在语言模型中的记忆化 2506.09591v1
  • 224 06-11 Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis Decoding Knowledge Attribution in Mixture-of-Experts: Ein Rahmenwerk der Basic-Refinement-Kollaboration und Effizienzanalyse 在混合专家中解码知识的配置:基础改进协作和效率分析框架 2505.24593v2
  • 225 06-11 Explaining word embeddings with perfect fidelity: Case study in research impact prediction Erklären von Wort-Einbettungen mit perfekter Treue: Fallstudie in der Forschungs-Impact-Vorhersage 解释完美忠实地嵌入的字词的解释:研究影响预测案例研究 2409.15912v2
  • 226 06-11 Human-like object concept representations emerge naturally in multimodal large language models Mensch-ähnliche Objekt-Konzept-Darstellungen entstehen natürlich in multimodalen großen Sprachmodellen 在多式大型语言模式中自然出现类似人类的物体概念表述 2407.01067v3
  • 227 06-11 MOSAIC: Multiple Observers Spotting AI Content MOSAIC: Mehrere Beobachter finden KI-Inhalte MOSAIC:多观察员观察AI 内容 2409.07615v3
  • 228 06-11 Phonology-Guided Speech-to-Speech Translation for African Languages Phonologie-geführte Sprach-zu-Sprach-Übersetzung für afrikanische Sprachen 非洲语文声学引导语音语音对语音翻译 2410.23323v3
  • 229 06-11 From Symbolic to Neural and Back: Exploring Knowledge Graph-Large Language Model Synergies Von Symbolisch zu Neural und Zurück: Erforschen von Wissen Graphen-Große Sprachmodell-Synergien 从符号到神经和后背:探索知识图集-大型语言模式协同作用 2506.09566v1
  • 230 06-11 Meaningless is better: hashing bias-inducing words in LLM prompts improves performance in logical reasoning and statistical learning Sinnlos ist besser: Hasching-Bias-induzierende Wörter in LLM-Prompten verbessern die Leistung in logischer Argumentation und statistischem Lernen 无意义的更好:在LLM 提示中匆忙的带有偏见的词句会改善逻辑推理和统计学习的绩效 2411.17304v2
  • 231 06-11 Unable to Forget: Proactive lnterference Reveals Working Memory Limits in LLMs Beyond Context Length Nicht zu vergessen: Proaktive lnterference zeigt Arbeitsspeichergrenzen in LLMs jenseits von Kontextlängen 无法忘却: 事外长长的LLMM 中主动性 Interference Repeals 工作内存限制 2506.08184v2
  • 232 06-11 Multi-Party Supervised Fine-tuning of Language Models for Multi-Party Dialogue Generation Multi-Party überwacht Feinabstimmung von Sprachmodellen für die Mehrparteien-Dialog-Generierung 多党监督多党监督的多党对话生成语言模式微调 2412.05342v5
  • 233 06-11 ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts ClimateViz: Benchmark für die statistische Begründung und Faktenverifizierung auf wissenschaftlichen Charts ClimcViz:科学图表统计理由和事实核查基准 2506.08700v2
  • 234 06-11 Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language Auf dem Weg zu offenem Stiftungs-Sprachenmodell und Corpus für Mazedonier: Eine Sprache mit geringen Ressourcen 走向马其顿开放基础语言模式:一种低资源语言 2506.09560v1
  • 235 06-11 Gender Bias in English-to-Greek Machine Translation Geschlecht Bias in Englisch-Griechische Maschinelle Übersetzung 英文到希腊语机器翻译中的性别偏见 2506.09558v1
  • 236 06-11 Societal AI Research Has Become Less Interdisciplinary Gesellschaftliche KI-Forschung ist weniger interdisziplinäre geworden 社会AIAI 研究的跨学科程度越来越低 2506.08738v2
  • 237 06-11 MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions MEDUSA: Multimodale Deep Fusion Multi-Stage Training Framework zur Sprachemotionserkennung unter naturalistischen Bedingungen MEDUS: 一种在自然条件下承认言论情感的多式深层融合多层次培训框架 2506.09556v1
  • 238 06-11 KG-Infused RAG: Augmenting Corpus-Based RAG with External Knowledge Graphs KG-infundierte RAG: Augmenting Corpus-based RAG mit externen Wissensgraphen KG 输入的 RAG: 使用外部知识图增强以公司为基础的RAG 2506.09542v1
  • 239 06-11 Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning Lingshu: Ein generalistisches Stiftungsmodell für ein einheitliches multimodales medizinisches Verständnis und Vernunft Lingshu:通用主义基金会统一多式联运医疗理解和理性模式模式 2506.07044v3
  • 240 06-11 GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models GenBreak: Red Teaming Text-to-Image Generatoren mit großen Sprachmodellen GenBreak:使用大语言模型的红色组合文字到图像发电机 2506.10047v1
  • 241 06-11 Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models Athena: Multimodale Vernunft mit dateneffizienten Prozess-Reward-Modellen verbessern Athena:利用数据高效流程奖励模式加强多式联运理由 2506.09532v1
  • 242 06-11 AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP AraReasoner: Bewertung von mit Gründen versehenen LLMs für arabische NLP AraReasoner:为阿拉伯文国家实验室方案评价基于合理理由的LLMs 2506.08768v2
  • 243 06-11 Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs Überzeugen Sie sich über das, was Sie sehen: Sprachvorherige in Visions-Token für eine effiziente, geführte Dekodierung von LVLMs enthüllen 重新审视你所看到的:在 “ 高效制导LVLMs代谢 “ 的愿景中先解密语言 2506.09522v1
  • 244 06-11 You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks Sie sind, was Sie sagen: Ausnutzen sprachlicher Inhalte für VoicePrivacy Attacks 你说的就是: 利用语言内容进行语音隐私攻击 2506.09521v1
  • 245 06-11 ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning ReasonMed: Ein 370K Multi-Agent Generated Dataset zur Verbesserung der medizinischen Vernunft 理由:用于推进医疗理由的A370K多代理生成数据集 2506.09513v1
  • 246 06-11 Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning Gib mir FP32 oder gib mir den Tod? Herausforderungen und Lösungen für reproduzierbare Vernunft 给我FP32还是给我死亡? 2506.09501v1
  • 247 06-11 EnrichEvent: Enriching Social Data with Contextual Information for Emerging Event Extraction EnrichEvent: Anreicherung sozialer Daten mit kontextuellen Informationen zur Gewinnung neuer Ereignisse 丰富事件:丰富社会数据,提供新事件采掘时的背景信息 2307.16082v8
  • 248 06-11 CiteFusion: An Ensemble Framework for Citation Intent Classification Harnessing Dual-Model Binary Couples and SHAP Analyses CiteFusion: Ein Ensemble-Framework für Citation Intent Classification Harnessing Dual-Model Binäre Paare und SHAP-Analysen CiteFusion:双重模式二进制组合和SHAP分析引文内分级利用集合框架 2407.13329v3
  • 249 06-11 Bridging Online Behavior and Clinical Insight: A Longitudinal LLM-based Study of Suicidality on YouTube Reveals Novel Digital Markers Bridging Online Behavior and Clinical Insight: Eine längsseitige LLM-basierte Studie zur Suizidalität auf YouTube enthüllt neuartige digitale Marker 架桥在线行为和临床透视:YouTube Reveals新数字标记上基于纵向LLM的自杀性纵向LM研究 2506.09495v1
  • 250 06-11 CASPER: A Large Scale Spontaneous Speech Dataset CASPER: Ein großer Spontane-Sprachdatensatz 大规模自应自发语音数据集 2506.00267v3
  • 251 06-11 CiteFix: Enhancing RAG Accuracy Through Post-Processing Citation Correction CiteFix: Verbesserung der RAG-Genauigkeit durch Nachbearbeitung der Citation Correction CiteFix:通过处理后引用校正,加强RAG的准确性 2504.15629v2
  • 252 06-11 Rethinking Text-based Protein Understanding: Retrieval or LLM? Rethinking Text-basierte Protein-Verständnis: Retrieval oder LLM? 重新思考基于文本的蛋白质理解:检索还是LLM? 2505.20354v3
  • 253 06-11 Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing Erhöhung der Messlatte: Ermittlung der Werte von großen Sprachmodellen durch Generative Evolving-Tests 提高律师资格:通过创造演变测试调查大语言模式的价值 2406.14230v5
  • 254 06-11 Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment Persona-Richter: Personalisierte Ausrichtung großer Sprachmodelle über Token-Level-Selbsturteil 法官:通过Token级自评大语言模式个人化调整 2504.12663v2
  • 255 06-11 Revisiting Self-Consistency from Dynamic Distributional Alignment Perspective on Answer Aggregation Überprüfung der Selbstkonsistenz aus der Perspektive der dynamischen Verteilungsausrichtung auf die Antwortaggregation 从动态分配一致的视角重新审视自我一致性 2502.19830v2
  • 256 06-11 Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms Auf dem Weg zur Überbrückung des Lohn-Generation-Gaps in direkten Ausrichtungsalgorithmen 争取弥合直接调整数值中的奖励-评分差距 2506.09457v1
  • 257 06-11 Irony Detection, Reasoning and Understanding in Zero-shot Learning Ironie-Erkennung, Vernunft und Verständnis im Null-Schuss-Lernen ” 零弹学习 “ 中的讽刺探测、解释和理解 2501.16884v2
  • 258 06-11 Can LLMs Ground when they (Don’t) Know: A Study on Direct and Loaded Political Questions Kann LLMs Boden, wenn sie (nicht) wissen: Eine Studie über direkte und geladene politische Fragen 当他们(不知道)知道:关于直接和上载政治问题的研究 2506.08952v2
  • 259 06-11 Learning Obfuscations Of LLM Embedding Sequences: Stained Glass Transform Verschlimmerungen von LLM-Embedding-Sequenzen lernen: Glas-Reformation 学习LLM 嵌入序列的学习障碍:染色玻璃变形 2506.09452v1
  • 260 06-11 UniToMBench: Integrating Perspective-Taking to Improve Theory of Mind in LLMs UniToMBench: Integration von Perspektiven zur Verbesserung der Theorie des Geistes in LLMs Unitombennch:将透视-提高思维理论纳入LLMs 2506.09450v1
  • 261 06-11 OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary OWSM-Biasing: Kontextualisieren von Open Whisper-Style Speech Models für automatische Spracherkennung mit dynamischem Vokabeln OWSM-Biasing:使用动态词汇表自动语音识别的开放耳语语音模型背景化 2506.09448v1
  • 262 06-11 GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture GigaChat Familie: Effiziente russische Sprachmodellierung durch Mixtur von Experten Architektur GigaChat家庭:通过混合专家建筑结构,高效率的俄语建模 2506.09440v1
  • 263 06-11 Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting Verbessertes Supervised-Fine-Tuning für große Sprachmodelle, um Katastrophenvergessenheit zu vermeiden 改进对大语言模型改进监督的微调,以缓解灾难性遗忘 2506.09428v1
  • 264 06-11 Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models Vertrauen ist alles, was Sie brauchen: Wenig heißes RL Feintuning von Sprachmodellen 信心是你们需要的:很少点点RL精细的语言模式 2506.06395v3
  • 265 06-11 Hidden in Plain Sight: Evaluation of the Deception Detection Capabilities of LLMs in Multimodal Settings Versteckt in Plain Sight: Auswertung der Erkennungsfähigkeiten von LLMs in multimodalen Einstellungen 隐藏在浅视中:评估多模式环境中LLMs的欺骗性探测能力 2506.09424v1
  • 266 06-11 GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment GenARM: Reward-geführte Generation mit autoregressivem Reward-Modell für Testzeitausrichtung GENARM: 具有自动递减奖益模型的奖赏制向导生成(测试时间对齐自动递减奖模型) 2410.08193v4
  • 267 06-11 A Call for Collaborative Intelligence: Why Human-Agent Systems Should Precede AI Autonomy Ein Aufruf für kollaborative Intelligenz: Warum Menschen-Agenten-Systeme der KI-Autonomie vorausgehen sollten 呼吁合作情报:为什么人力-物力系统应先于自主 2506.09420v1
  • 268 06-11 AcTracer: Active Testing of Large Language Model via Multi-Stage Sampling AcTracer: Aktives Testen von Großsprachenmodellen durch Multi-Stage-Sampling AcTracler:通过多标准抽样积极测试大语言模型 2408.03573v2
  • 269 06-11 Speech Synthesis By Unrolling Diffusion Process using Neural Network Layers Sprachsynthese durch Entrollen des Diffusionsprozesses mit neuralen Netzwerkebenen 使用神经网络图层解开传播进程 2309.09652v5
  • 270 06-11 PGDA-KGQA: A Prompt-Guided Generative Framework with Multiple Data Augmentation Strategies for Knowledge Graph Question Answering PGDA-KGQA: Ein prompt geführtes Generatives Framework mit Multiple Data Augmentation Strategies for Knowledge Graph Question Answering PGDA-KGQA: 快速指导生成框架,包括多数据增强战略,用于知识图表问题解答 2506.09414v1
  • 271 06-11 Self-Steering Optimization: Autonomous Preference Optimization for Large Language Models Self-Steering Optimization: Autonome Preference-Optimierung für große Sprachmodelle 自制最佳优化:自主优化大语言模式 2410.17131v2
  • 272 06-11 Rethinking Diverse Human Preference Learning through Principal Component Analysis Umdenken von unterschiedlichen menschlichen Vorlieben Lernen durch Hauptkomponentenanalyse 通过主要构成部分分析重新思考多种人类优先学习 2502.13131v2
  • 273 06-11 Value Portrait: Assessing Language Models’ Values through Psychometrically and Ecologically Valid Items Wertportrait: Bewertung der Werte von Sprachmodellen durch psychometrische und ökologisch gültige Elemente 价值纵向:通过心理和生态有效项目评估语言模型的价值 2505.01015v3
  • 274 06-11 Comparing human and LLM proofreading in L2 writing: Impact on lexical and syntactic features Vergleich von Human- und LLM-Korrekturlesen in L2-Schrift: Auswirkungen auf lexikalische und syntaktische Merkmale L2 著述中比较人与LL2校对法和LLM校对法:对词汇和综合特征的影响 2506.09021v2
  • 275 06-11 UD-KSL Treebank v1.3: A semi-automated framework for aligning XPOS-extracted units with UPOS tags UD-KSL Treebank v1.3: Ein semi-automatisiertes Framework zur Ausrichtung von XPOS-extrahierten Einheiten mit UPOS-Tags UD-KSL 树库 v1.3:一个半自动框架,使XPOS抽取的单位与 UPOS 标记相匹配 2506.09009v2
  • 276 06-11 Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models Token Constraint Decoding verbessert Robustheit bei der Fragebeantwortung für große Sprachmodelle 对大语言模型问题解答的强力 2506.09408v1
  • 277 06-11 Code-Switching Curriculum Learning for Multilingual Transfer in LLMs Code-Switching Curriculum Learning für den mehrsprachigen Transfer in LLMs 以LLMM为多种语言转让进行代码转换课程学习 2411.02460v2
  • 278 06-11 Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding Code-Switching Red-Teaming: LLM-Bewertung für Sicherheit und mehrsprachiges Verständnis 编码转换红队:LLM 安全和多语文理解评价 2406.15481v3
  • 279 06-11 DREsS: Dataset for Rubric-based Essay Scoring on EFL Writing DRESS: Datensatz für Rubric-basierte Essay-Scoring auf EFL-Writing DREsS: 以Ruric为基础的ELFL 写法日志分类的数据集 2402.16733v3
  • 280 06-11 AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI) AdversariaL attacK sAfety alignment(ALKALI): Sicherung von LLMs durch GRACE: Geometrische Darstellung-Bewusst Kontrastive Verbesserung- Einführung von Adversarial Vulnerability Quality Index (AVQI) Adversarial attacK sAfety alignment(ALKALI):通过GRACE来保护LLMS:几何代表-提高软件对抗性增强-采用反脆弱性质量指数(AVQI) 2506.08885v2
  • 281 06-11 Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach Zuverlässige Proof-Generation mit LLMs: Ein neuro-symbolischer Ansatz 努力利用LLM女士实现可靠的证据生产:神经-双曲方法 2505.14479v3
  • 282 06-11 Chem42: a Family of chemical Language Models for Target-aware Ligand Generation Chem42: eine Familie chemischer Sprachmodelle für die zielbewusste Ligand-Generation Chem42: “ 目标觉知长者 “ 化学语言模式家庭 2503.16563v2
  • 283 06-11 A Hierarchical Probabilistic Framework for Incremental Knowledge Tracing in Classroom Settings Ein hierarchisches probabilistisches Framework für inkrementelles Wissen in Klassenzimmereinstellungen 课堂环境中递增知识追踪的等级概率框架 2506.09393v1
  • 284 06-11 Comparing human and LLM politeness strategies in free production Vergleich zwischen menschlichen und LLM-Richtungsstrategien in der freien Produktion 在自由生产方面比较人和LLM的礼貌战略 2506.09391v1
  • 285 06-11 MMREC: LLM Based Multi-Modal Recommender System MMREC: LLM-basiertes Multi-Modal-Empfängersystem MMREC: 以LLM为基础的多模式建议系统 2408.04211v2
  • 286 06-11 Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey Verbesserung von Code LLMs mit Verstärkungslernen in der Codegenerierung: Eine Umfrage 增强法典制定中强化学习的加强守则LLMS 代码生成:调查 2412.20367v4
  • 287 06-11 Binary classification for perceived quality of headlines and links on worldwide news websites, 2018-2024 Binäre Klassifizierung für wahrgenommene Qualität von Schlagzeilen und Links auf weltweiten News-Websites, 2018-2024 2018-2024年全世界新闻网站头版新闻和链接的预期质量和链接的二进制分类,2018-2024年 2506.09381v1
  • 288 06-11 Measuring What Makes You Unique: Difference-Aware User Modeling for Enhancing LLM Personalization Messen, was Sie einzigartig macht: Difference-Aware User Modeling zur Verbesserung der LLM Personalisierung 衡量什么使你独一之处:提高LLM个性化的差异型用户模型 2503.02450v3
  • 289 06-11 Unlocking General Long Chain-of-Thought Reasoning Capabilities of Large Language Models via Representation Engineering Entriegelung der allgemeinen Long Chain-of-Thought-Reasoning-Fähigkeiten großer Sprachmodelle über Representation Engineering 通过代表制工程,解决大语言模型通过代表制工程所探索的一般长 长 长 长 问 问 问 程 理由 能力 2503.11314v2
  • 290 06-11 Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining Modality-Balancing Preference Optimierung großer multimodaler Modelle durch disversariale Negative Mining 通过反反向负负采矿优化大型多式联运模型 2506.08022v2
  • 291 06-11 DecIF: Improving Instruction-Following through Meta-Decomposition DecIF: Verbesserung der Instruktions-Folge durch Meta-Dekomposition DEIF:通过元衰减改进教学 2505.13990v2
  • 292 06-11 CoLMbo: Speaker Language Model for Descriptive Profiling CoLMbo: Speaker-Sprachenmodell für Descriptive Profiling ColLMbo:描述性描述语言模式 2506.09375v1
  • 293 06-11 Low-resource domain adaptation while minimizing energy and hardware resource consumption Low-Resource-Domänenanpassung bei Minimierung des Energie- und Hardware-Ressourcenverbrauchs 低资源领域的适应,同时尽量减少能源和硬件资源的消耗 2506.08433v2
  • 294 06-11 SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner SWE-Flow: Software-Engineering-Daten in einem testgetriebenen Manner synthetisieren SWE-Flow:将软件工程数据合成成一种试验驱动方式 2506.09003v2
  • 295 06-11 COGENT: A Curriculum-oriented Framework for Generating Grade-appropriate Educational Content COGENT: Ein Curriculum-orientierter Rahmen zur Generierung von stufengerechten Bildungsinhalten COGENT: 以课程为导向的生成适合年级的教育内容框架 2506.09367v1
  • 296 06-11 Taming SQL Complexity: LLM-Based Equivalence Evaluation for Text-to-SQL Zähmen von SQL-Komplexität: LLM-basierte Gleichwertigkeitsbewertung für Text-zu-SQL SQL复杂程度:基于LLM的文本到SQL等效评价 2506.09359v1
  • 297 06-11 Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models Automatische Pseudo-Harmful Prompt Generation zur Bewertung falscher Ablehnungen in großen Sprachmodellen 自动生成用于评价大语言模式虚假拒绝的优多有害快速自动生成 2409.00598v2
  • 298 06-11 SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems SAFEFLOW: Ein Grundsatzprotokoll für vertrauenswürdige und transaktionsunabhängige Agentensysteme 安全安全自主:可信赖和交易自主代理系统原则议定书 2506.07564v3
  • 299 06-11 Mitigating Posterior Salience Attenuation in Long-Context LLMs with Positional Contrastive Decoding Milderung der hinteren Salience-Abschwächung in Langkontext-LLMs mit positionaler Kontrastdekodierung 具有位置相矛盾下限作用的长文本液溶液中减轻减轻水量的盐分 2506.08371v2
  • 300 06-11 LID Models are Actually Accent Classifiers: Implications and Solutions for LID on Accented Speech LID-Modelle sind eigentlich Accent Klassifikatoren: Implikationen und Lösungen für LID auf Accented Speech LID 模型实际上是浓度分级器:LID对发声的影响和解决办法 2506.00628v2
  • 301 06-11 DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-Experts DIVE into MoE: Diversity-Enhanced Rekonstruktion großer Sprachmodelle von Dense in Mixture-of-Experts 发展到教育部:多样性-加强大语言模式的重建,从感官转变为专家混合体 2506.09351v1
  • 302 06-11 Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test Prüfung von Black-Box LLM APIs mit einem Rank-Based Uniformity Test 采用级级统一测试的黑色Box LLM API 2506.06975v3
  • 303 06-11 Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis Warum sind Web-KI-Agenten anfälliger als Standalone LLMs? Eine Sicherheitsanalyse 为什么互联网上AI代理机构比独立的LLMs更加脆弱? 2502.20383v2
  • 304 06-11 OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment OmniDRCA: Paralleles Sprach-Text-Stiftungsmodell über Dual-Resolution-Sprachdarstellungen und Kontrastive Ausrichtung OmniDRCA:通过双分辨率发言代表和对立对调并行发言-文本基金会模式 2506.09349v1
  • 305 06-11 Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models Qwen3 Embedding: Advancing Text Embedding and Reranking by Foundation Models wen3 嵌入式:通过基础模型推进文本嵌入和重新排位 2506.05176v3
  • 306 06-11 Ming-Omni: A Unified Multimodal Model for Perception and Generation Ming-Omni: Ein einheitliches multimodales Modell für Wahrnehmung und Erzeugung Ming-Omni:感知和一代的统一多模式模型 2506.09344v1
  • 307 06-11 Latent Multi-Head Attention for Small Language Models Latent Multi-Head Aufmerksamkeit für kleine Sprachmodelle 对小型语言模式的多方关注 2506.09342v1
  • 308 06-11 RePO: Replay-Enhanced Policy Optimization RePO: Replay-erweiterte Politikoptimierung REPO: 重放强化政策优化 2506.09340v1
  • 309 06-11 Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference Sentence-Level-Reward-Modell kann besser für die Ausrichtung LLM aus menschlichen Präferenzen Generalisieren 判决一级奖赏模式可以普遍化,更有利于将人类特惠的LLM与LM相匹配 2503.04793v4
  • 310 06-11 GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation GraphRAG-Bench: Herausfordernde Domain-spezifische Begründung für die Auswertung der Graph Retrieval-Augmented Generation 图图RAG-Bench:评估图回收-提款一代的有挑战性域特定原因 2506.02404v2
  • 311 06-11 LIFEBench: Evaluating Length Instruction Following in Large Language Models LIFEBench: Bewertung der Längenlehre in großen Sprachmodellen LELELEBENCH: 以大语言模式评价长期指示 2505.16234v2
  • 312 06-11 Natural Language Guided Ligand-Binding Protein Design Natürliche Sprache geführt Ligand-Binding Protein Design Ligand-Binding 蛋白设计 2506.09332v1
  • 313 06-11 Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation 多方语言模式:推进合作、协调和适应 2506.09331v1
  • 314 06-11 Towards Efficient and Effective Alignment of Large Language Models Auf dem Weg zu einer effizienten und effektiven Ausrichtung großer Sprachmodelle 实现高语言模式的高效和有效统一 2506.09329v1
  • 315 06-11 Do Large Vision-Language Models Distinguish between the Actual and Apparent Features of Illusions? Unterscheiden sich große Visions-Sprachenmodelle zwischen den tatsächlichen und scheinbaren Eigenschaften von Illusionen? 大型视觉语言模型是否区分幻象的实际特征和表面特征? 2506.05765v2
  • 316 06-11 Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic Critic-CoT: Erhöht die Argumentationsfähigkeiten des großen Sprachmodells durch Chain-of-Thoughts Critic Critic-Cot:通过思维链增强大型语言模型的推理能力 2408.16326v3
  • 317 06-11 CC-RAG: Structured Multi-Hop Reasoning via Theme-Based Causal Graphs CC-RAG: Strukturierte Multi-Hop-Reasoning über themenbasierte Kausalgraphen CC-RAG:通过基于主题的因果图结构化多功能原因 2506.08364v2
  • 318 06-11 OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation OWL: Optimiertes Workforce-Learning für die allgemeine Multi-Agent-Hilfe in der Real-World Task Automation OWL: 优化劳动力学习,为现实世界任务自动化提供一般多机构援助 2505.23885v2
  • 319 06-11 Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies Beschleunigung der LLM-Inferenz mit verlustfreier spekulativer Dekodierung von Algorithmen für heterogene Vokabeln 加速LLM 超异基因词汇表无损失投机代谢值的加速LLM推导值 2502.05202v3
  • 320 06-11 Alzheimer’s Dementia Detection Using Perplexity from Paired Large Language Models Alzheimer’s Dementia Detection Using Perplexity from Paired Large Language Models 阿兹海默氏病痴呆症检测,使用Paired大语言模型的复度 2506.09315v1
  • 321 06-11 AAD-LLM: Neural Attention-Driven Auditory Scene Understanding AAD-LLM: Neurales aufmerksamkeitsgetriebenes Auditory Scene Understanding AAD-LLM: 神经引人注意 – – 引导观众了解场景 2502.16794v3
  • 322 06-11 Listen, Chat, and Remix: Text-Guided Soundscape Remixing for Enhanced Auditory Experience Hören, Chatten und Remixen: Textgeführtes Soundscape Remixen für verbesserte Hörerfahrung 监听、聊天和混音:加强监听经验的文本引导声音立体混音 2402.03710v2
  • 323 06-11 AbstRaL: Augmenting LLMs’ Reasoning by Reinforcing Abstract Thinking AbstRaL: Augmenting LLMs’ Reasoning durch Stärkung des abstrakten Denkens AbstRAL:加强抽象思考,增加LLM女士的理由理由 2506.07751v2
  • 324 06-10 (2) $(RSA)^2$: A Rhetorical-Strategy-Aware Rational Speech Act Framework for Figurative Language Understanding $(RSA)^2$: Ein rhetorisch-strategisch-bewusster Rationaler Sprachgesetzrahmen für das figurative Sprachverständnis $(RSA) $2美元:一个用于模拟语言理解的 “ 路语-战略-报警有理演讲法 “ 框架 2506.09301v1
  • 325 06-10 AMELI: Enhancing Multimodal Entity Linking with Fine-Grained Attributes AMELI: Verbesserung der Multimodalität der Verbindung mit feinkörnigen Attributen AMELI: 增强与精细所得属性挂钩的多式联运实体 2305.14725v2
  • 326 06-10 UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench UTBoost: Strenge Bewertung von Coding Agents auf SWE-Bench UTBost: 严格评价SWE-Bench上的编码剂 2506.09289v1
  • 327 06-10 EMMA: Efficient Visual Alignment in Multi-Modal LLMs EMMA: Effiziente visuelle Ausrichtung in Multi-Modal LLMs EMMA:多模式LMS中高效视觉协调 2410.02080v2
  • 328 06-10 Archon: An Architecture Search Framework for Inference-Time Techniques Archon: Ein Architektur-Suchrahmen für Inferenz-Zeit-Techniken Archon: 推断-时间技术建筑搜索框架 2409.15254v6
  • 329 06-10 ThinkQE: Query Expansion via an Evolving Thinking Process ThinkQE: Abfrageerweiterung über einen sich entwickelnden Denkprozess 思考QE:通过不断演变的思考进程扩大查询 2506.09260v1
  • 330 06-10 Self-Anchored Attention Model for Sample-Efficient Classification of Prosocial Text Chat Selbsternanntes Aufmerksamkeitsmodell zur stichprobenmäßigen Klassifizierung von prosozialen Textchats 优社会性文本聊天抽样有效分类自作自作主的注意模式 2506.09259v1
  • 331 06-10 Token Perturbation Guidance for Diffusion Models Token Perturbation Guidance für Diffusionsmodelle 扩散模型的 Token 扰动指南 2506.10036v1
  • 332 06-10 Extrapolation by Association: Length Generalization Transfer in Transformers Extrapolation nach Verband: Länge Generalisierung Transfer in Transformatoren 协会的外推法:变异器中时间的通用转移 2506.09251v1
  • 333 06-10 Synthesis by Design: Controlled Data Generation via Structural Guidance Synthese nach Design: Kontrollierte Datengenerierung über strukturelle Leitlinien 按设计分列的综合报告:通过结构指导控制的数据生成 2506.07664v2
  • 334 06-10 StochasTok: Improving Fine-Grained Subword Understanding in LLMs StochasTok: Verbesserung des feinkörnigen Unterwortverständnisses in LLMs StochasTok:改进在LLM中精美的子字理解 2506.01687v2
  • 335 06-10 A Technique for Isolating Lexically-Independent Phonetic Dependencies in Generative CNNs Eine Technik zur Isolierung Lexisch-Unabhängiger phonetischer Abhängigkeiten in generativen CNNs 制作有线电视新闻网的孤立地独立电话依赖技术 2506.09218v1
  • 336 06-10 SimClass: A Classroom Speech Dataset Generated via Game Engine Simulation For Automatic Speech Recognition Research SimClass: Ein Klassenzimmer-Sprachdatensatz, der über Game Engine Simulation für automatische Spracherkennungsforschung erzeugt wird SimClass:通过游戏引擎模拟生成的课堂语音数据集,用于自动语音识别研究 2506.09206v1
  • 337 06-10 Can LLMs Interpret and Leverage Structured Linguistic Representations? A Case Study with AMRs Kann LLMs Dolmetschen und Leverage strukturierte sprachliche Repräsentationen? Eine Fallstudie mit AMRs LLMs 能够解释和利用结构化语言代表吗? 2504.04745v3
  • 338 06-10 Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems Können LLMs zuverlässige Testfallgeneratoren generieren? Eine Studie zu Wettbewerbs-Level-Programmierungsproblemen LLM女士能产生可靠的试验案例发电机吗? 2506.06821v2
  • 339 06-10 PHRASED: Phrase Dictionary Biasing for Speech Translation PHRASED: Phrase Dictionary Biasing for Speech Translation PHRASED: 语音翻译的词典比喻 2506.09175v1
  • 340 06-10 The Curious Language Model: Strategic Test-Time Information Acquisition Das Kuriose Sprachmodell: Strategische Test-Zeit-Informationsbeschaffung 好奇语言模式:战略试验时间信息采集 2506.09173v1
  • 341 06-10 Improving LLM Agent Planning with In-Context Learning via Atomic Fact Augmentation and Lookahead Search Verbesserung der LLM-Agentenplanung mit In-Context Learning über Atomic Fact Augmentation und Lookahead Search 通过原子事实增强和目光头搜索改进LLM Agent 规划,通过原子事实增强和目光头搜索进行文中学习 2506.09171v1
  • 342 06-10 Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models Multimodale Inkonsistenz-Reasoning (MMIR): Ein neuer Maßstab für multimodale Reasoning-Modelle 多式联运不一致理由:多式联运理由理由模型的新基准 2502.16033v3
  • 343 06-10 ICONS: Influence Consensus for Vision-Language Data Selection ICONS: Einflusskonsens für Vision-Sprachen-Datenauswahl ICONS: 愿景语言数据选择影响共识 2501.00654v3
  • 344 06-10 Adversarial Text Generation with Dynamic Contextual Perturbation Adversariale Textgenerierung mit dynamischer kontextueller Störung 具有动态环境扰动作用的反向生成文本 2506.09148v1
  • 345 06-10 LLM-as-a-qualitative-judge: automating error analysis in natural language generation LLM-as-a-qualitative-Richter: Automatisierung der Fehleranalyse bei der Generierung natürlicher Sprachen LLM-as-as-法官法官:在自然语言生成中进行自动误差分析 2506.09147v1
  • 346 06-10 Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better Autoregressive Semantische visuelle Rekonstruktion hilft VLMs besser verstehen 自动递减语义视觉重建帮助VLMs更好地了解 2506.09040v1
  • 347 06-10 Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning Router-R1: Lehren von LLMs Multi-Round Routing und Aggregation durch Verstärkungslernen 路由-R1路由-R1路由:教学LLMS 2506.09033v1
  • 348 06-10 JuStRank: Benchmarking LLM Judges for System Ranking JuStRank: Benchmarking der LLM-Richter für Systemranking JuStRank:为系统评分确定LLM法官基准 2412.09569v2
  • 349 06-10 e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs e3: Erforschen lernen ermöglicht Extrapolation von Test-Time Compute für LLMs e3: 学习探索以利对LLMM的试验时间计算进行外推计算 2506.09026v1
  • 350 06-10 High-Throughput Phenotyping of Clinical Text Using Large Language Models Hochdurchsatz-Phenotypisierung von klinischem Text mit großen Sprachmodellen 使用大语言模型的临床文本的高通量测算 2408.01214v2
  • 351 06-10 Learning to Reason Across Parallel Samples for LLM Reasoning Über parallele Proben für LLM-Reasoning nachdenken LLM 理由的平行抽样学习 2506.09014v1
  • 352 06-10 Employing self-supervised learning models for cross-linguistic child speech maturity classification Verwendung von selbstüberwachten Lernmodellen für die Klassifizierung der sprachübergreifenden Sprachreife von Kindern 采用自我监督的学习模式,进行跨语言儿童言语成熟程度分类 2506.08999v1
  • 353 06-10 CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation CAIRe: Kulturelle Zuordnung von Bildern durch Retrieval-Augmented Evaluation CAIRe:通过检索-增强评价对图像进行文化归属 2506.09109v1
  • 354 06-10 SensorLM: Learning the Language of Wearable Sensors SensorLM: Die Sprache der tragbaren Sensoren lernen 传感器LM:学习穿装传感器的语言 2506.09108v1
  • 355 06-10 Mechanistic Decomposition of Sentence Representations Mechanistische Zersetzung von Satzdarstellungen 判决代表的体力分解 2506.04373v2
  • 356 06-10 SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning SwS: Selbstbewusste Schwäche-getriebene Problemsynthese bei der Stärkung des Lernens für LLM-Reasoning SWS:自觉的弱点驱动的问题综述 2506.08989v1
  • 357 06-10 DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments DefenderBench: Ein Toolkit zur Bewertung von Sprachagenten in Cybersicherheitsumgebungen ” 捍卫堡垒:网络安全环境中评价语言代理工具工具包 “ 2506.00739v2
  • 358 06-10 Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models Quamba2: Robustes und skalierbares Post-Training Quantization Framework für selektive State Space Models Quamba2:有选择性的国家空间模型强力和可扩展的培训后量化框架 2503.22879v3
  • 359 06-10 Naturalistic Language-related Movie-Watching fMRI Task for Detecting Neurocognitive Decline and Disorder Naturalistische Sprache-bezogene Film-Watching fMRI Aufgabe für die Erkennung neurokognitiver Deklination und Störung FMRI 检测神经分辨衰竭和疾病任务 2506.08986v1
  • 360 06-10 PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips PrisonBreak: Gefängnisbrechende große Sprachmodelle mit weniger als fünfundzwanzig gezielten Bit-Flips 监狱破碎:破禁大语言模式,目标位翻页小于25个目标小于25个 2412.07192v2
  • 361 06-10 FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1, L2 and Imitated L2 Accents FROST-EMA: Finnische und Russische Sprachdatensätze für elektromagnetische Artikulographiemessungen mit L1, L2 und imitierten L2 Accents FROST-EMA:芬兰和俄罗斯关于L1、L2和模拟L2弧度电磁电动天文学测量的口头语音数据集 2506.08981v1
  • 362 06-10 DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models DiffLM: Ansteuerbare Synthetische Datengenerierung über Diffusions-Sprachenmodelle DiffLM: 通过传播语言模型生成可控合成数据 2411.03250v2
  • 363 06-10 Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System Atom-to-Compositional Generalization für mobile Agenten mit einem neuen Benchmark- und Scheduling-System 具有新的基准和日程安排系统的移动剂原子到集成一般化 2506.08972v1
  • 364 06-10 Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model Schritt-Audio-AQAA: ein vollständig von Ende zu Ende ausdrucksstarkes großes Audio-Sprachenmodell 渐进-AQAAA:全端到端全端表达式大音频语言模型 2506.08967v1
  • 365 06-10 Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers Vortrainierte Sprachmodelle lernen bemerkenswert genaue Darstellungen von Zahlen 培训前语文模式的精确数字表示 2506.08966v1
  • 366 06-10 AnnaAgent: Dynamic Evolution Agent System with Multi-Session Memory for Realistic Seeker Simulation AnnaAgent: Dynamisches Evolutions-Agentensystem mit Multi-Session-Speicher für realistische Suchersimulation AnnAAgenti:动态进化剂系统,具有多会议记忆系统,用于模拟现实探索者模拟 2506.00551v2
  • 367 06-10 FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation FaithfulRAG: Fact-Level-Konfliktmodellierung für kontexttreue retrieval-erweiterte Generation 忠实的拉戈:为本能的回溯性回溯性回溯性回溯性一代人建立实际冲突模式 2506.08938v1
  • 368 06-10 Can A Gamer Train A Mathematical Reasoning Model? Kann ein Gamer ein mathematisches Reasoning-Modell trainieren? 赌徒能训练数学理论模型吗? 2506.08935v1
  • 369 06-10 Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions Scratic-MCTS: Sichtvernunft in der Testzeit, indem man die richtigen Fragen stellt Scorti-MCTS:通过询问正确的问题进行试验时视觉理性 2506.08927v1
  • 370 06-10 PropMEND: Hypernetworks for Knowledge Propagation in LLMs PropMEND: Hypernetworks für die Wissensverbreitung in LLMs PropMEND: 以LLMMs方式传播知识的超网络 2506.08920v1
  • 371 06-10 Dialect Normalization using Large Language Models and Morphological Rules Dialect Normalization mit großen Sprachmodellen und morphologischen Regeln 使用大语言模式和人文规则实现正常化 2506.08907v1
  • 372 06-10 Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection Verbesserung der Automatisierten Abhandlungswertung in Arabisch mit synthetischen Daten und Fehlerinjektion 用合成数据和错误喷射加强阿拉伯文自动日志扫描 2503.17739v2
  • 373 06-10 A Decomposition-Based Approach for Evaluating and Analyzing Inter-Annotator Disagreement Ein zersetzen-basierter Ansatz zur Bewertung und Analyse von Inter-Annotator-Uneinigkeit 一种基于分解的办法来评价和分析 2206.05446v2
  • 374 06-10 From Legal Texts to Defeasible Deontic Logic via LLMs: A Study in Automated Semantic Analysis Von Rechtstexten zur defasiblen Deontic Logic über LLMs: Eine Studie in der automatisierten semantischen Analyse 从法律文本到通过LLMM从法律文本到防御性 Deontic Lologic: 自动语义分析研究 2506.08899v1
  • 375 06-10 PlantBert: An Open Source Language Model for Plant Science PlantBert: Ein Open Source Sprachmodell für die Pflanzenwissenschaft 植物生物:植物科学开放源语言模型 2506.08897v1
  • 376 06-10 Poro 34B and the Blessing of Multilinguality Poro 34B und der Segen der Mehrsprachigkeit 波罗34B和多语言祝福 2404.01856v3
  • 377 06-10 Too Big to Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers Zu groß zu denken: Kapazität, Erinnerung und Verallgemeinerung in vortrainierten Transformern 能力、记忆和在培训前变异器中的普及化 2506.09099v1
  • 378 06-10 Fusing Bidirectional Chains of Thought and Reward Mechanisms A Method for Enhancing Question-Answering Capabilities of Large Language Models for Chinese Intangible Cultural Heritage Bidirektionale Ketten von Gedanken- und Belohnungsmechanismen zusammenführen Eine Methode zur Verbesserung von Frage-Antwort-Fähigkeiten von großen Sprachmodellen für chinesisches immaterielles Kulturerbe 利用思想和奖赏机制的双向双向两向链 提高中国非物质文化遗产大语言模式的回答问题能力的方法 2505.08167v4
  • 379 06-10 Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia Umgang mit Pitfalls bei der Prüfung von Praktiken automatischer Spracherkennungstechnologien: Eine Fallstudie von Menschen mit Aphasie 解决自动语音识别技术审计做法中的缺陷:阿法西亚人案例研究 2506.08846v1
  • 380 06-10 Position: Editing Large Language Models Poses Serious Safety Risks Position: Bearbeiten von großen Sprachmodellen stellt ernste Sicherheitsrisiken dar 职位:编辑大语言模型 2502.02958v2
  • 381 06-10 Advancing STT for Low-Resource Real-World Speech Förderung der STT für die Real-World-Rede mit geringem Ressourcenbedarf 推进低资源现实世界演讲的STT 2506.08836v1
  • 382 06-10 CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics CulturalFrames: Bewertung der kulturellen Erwartungshaltung in Text-zu-Image-Modellen und Evaluationsmetrics 文化框架:在文本到图像模型和评价计量中评估文化期望一致 2506.08835v1
  • 383 06-10 The impact of fine tuning in LLaMA on hallucinations for named entity extraction in legal documentation Die Auswirkungen der Feinabstimmung in LLaMA auf Halluzinationen für benannte Entitätsextraktion in gesetzlicher Dokumentation LLamaMA的微调对法律文件中记名实体提取的幻觉的影响 2506.08827v1
  • 384 06-10 Cross-lingual Transfer in Programming Languages: An Extensive Empirical Study Cross-lingual Transfer in Programmiersprachen: Eine umfangreiche empirische Studie 方案语言的跨语言传译:广泛经验研究 2310.16937v3
  • 385 06-10 Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents Messen von Data Science Automation: Eine Umfrage an Evaluationstools für KI-Assistenten und -Agenten 衡量数据科学自动化:AI助理和代理人评价工具调查 2506.08800v1
  • 386 06-10 Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO Eintauchen in RL für Bildgenerierung mit CoT: Eine Studie über DPO vs. GRPO 与COT一起为图像生成而跳入RL:关于DPO与GROP公司的研究 2505.17017v2
  • 387 06-10 In Praise of Stubbornness: An Empirical Case for Cognitive-Dissonance Aware Continual Update of Knowledge in LLMs In Lob der Stubbornness: Ein empirischer Fall für Kognitiv-Dissonanz Aware Continual Update of Knowledge in LLMs 赞美固态:关于知识在LLM中不断更新的认知与不相识经验案例 2502.04390v2
  • 388 06-10 Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval Optimierte Text-Embedding-Modelle und Benchmarks für die Amharische Passage Retrieval 阿姆光通过通过检索的最佳文本嵌入模型和基准 2505.19356v2
  • 389 06-10 Silencing Empowerment, Allowing Bigotry: Auditing the Moderation of Hate Speech on Twitch Silencing Empowerment, Allowing Bigotry: Auditierung der Moderation von Hass Rede auf Twitch 静默赋权,允许偏见:审计对抽搐者仇恨言论的温和程度 2506.07667v2
  • 390 06-10 TextAtari: 100K Frames Game Playing with Language Agents TextAtari: 100K Frames Spiel mit Sprachagenten TextAtari: 100K 框架游戏与语言代理游戏 2506.04098v2
  • 391 06-10 Paths to Causality: Finding Informative Subgraphs Within Knowledge Graphs for Knowledge-Based Causal Discovery Pfade zur Kausalität: Informative Subgraphen innerhalb von Wissensgraphen finden für wissensbasierte Kausale Entdeckung 因果路径:在知识型因果发现的知识图中查找信息子集 2506.08771v1
  • 392 06-10 Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models Generative psycho-lexikalische Vorgehensweise für den Aufbau von Wertsystemen in großen Sprachmodellen 在大语言模式中建立价值体系的产生心理-临床方法 2502.02444v5
  • 393 06-10 EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements EDINET-Bench: Bewertung von LLMs zu komplexen Finanzaufgaben anhand japanischer Finanzausweise EDINET-Bench:利用日本财务报表评估复杂金融任务LMs 2506.08762v1
  • 394 06-10 Enhancing Accuracy and Maintainability in Nuclear Plant Data Retrieval: A Function-Calling LLM Approach Over NL-to-SQL Verbesserung der Genauigkeit und Aufrechterhaltung des Datenrücklaufs von Kernkraftwerken: Ein funktionsorientierter LLM-Ansatz über NL-zu-SQL 提高核电厂数据检索的准确性和可维护性:NL-SQL的功能计算LLM法 2506.08757v1
  • 395 06-10 Factors affecting the in-context learning abilities of LLMs for dialogue state tracking Faktoren, die die Lernfähigkeiten von LLM im Kontext des Dialogs für die Zustandsverfolgung beeinflussen 影响通过对话的LLMLM公司在对话方面的内在学习能力的因素 2506.08753v1
  • 396 06-10 Big Help or Big Brother? Auditing Tracking, Profiling, and Personalization in Generative AI Assistants Große Hilfe oder großer Bruder? Auditierung von Tracking, Profilierung und Personalisierung in Generativen KI-Assistenten 大帮还是大大哥? 2503.16586v2
  • 397 06-10 Unlocking the Potential of Large Language Models in the Nuclear Industry with Synthetic Data Entsperren des Potenzials großer Sprachmodelle in der Kernindustrie mit synthetischen Daten 利用合成数据释放核工业大语言模型的潜力 2506.08750v1
  • 398 06-10 Towards Secure and Private Language Models for Nuclear Power Plants Auf dem Weg zu sicheren und privaten Sprachmodellen für Kernkraftwerke 争取实现核电厂安全和私用语言模式 2506.08746v1
  • 399 06-10 Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning Konsequente Pfade führen zur Wahrheit: Selbstvergeltendes Stärkungslernen für LLM-Vernünftigkeit 通向真相的一致道路:为LLM理由自评加强学习 2506.08745v1
  • 400 06-10 Iterative Corpus Refinement for Materials Property Prediction Based on Scientific Texts Iterative Corpus-Verfeinerung für Material-Eigenschaftsvorhersage auf der Grundlage wissenschaftlicher Texte 以科学文本为基础的材料财产预测材料性迭代公司精炼 2505.21646v2
  • 401 06-10 Improved LLM Agents for Financial Document Question Answering Verbesserte LLM-Agenten für die Beantwortung von Fragen zu Finanzdokumenten 改进财务报告问题回答问题LLM代理 2506.08726v1
  • 402 06-10 VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs? VIST-GPT: Sich in der Ära des Visual Storytelling mit LLMs niederlassen? 使用LLMS的视觉故事时代? 2504.19267v3
  • 403 06-10 How Do Multilingual Language Models Remember Facts? Wie erinnern sich Mehrsprachige Sprachmodelle an Fakten? 多语言模式如何记住事实? 2410.14387v3
  • 404 06-10 Multi-Teacher Language-Aware Knowledge Distillation for Multilingual Speech Emotion Recognition Multi-Lehrer Sprachkenntnisse Destillation für mehrsprachige Sprachemotionserkennung 多语种语言-语言知识蒸馏促进多语种语言语音情感识别 2506.08717v1
  • 405 06-10 Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure Erklärbare Compliance-Erkennung mit Multi-Hop-Natural Language-Schlussfolgerung zur Assurance-Fallstruktur 以多种自然语言对保证案例结构的多重语言推断进行可解释的合规检测 2506.08713v1
  • 406 06-10 Summarizing Speech: A Comprehensive Survey Zusammenfassung der Rede: Eine umfassende Umfrage 总结发言:全面调查 2504.08024v2
  • 407 06-10 Self-Training Elicits Concise Reasoning in Large Language Models Selbst-Trainings-Elizite schließen Vernunft in großen Sprachmodellen ein 大语言模式中的简明理由 2502.20122v3
  • 408 06-10 Brevity is the soul of sustainability: Characterizing LLM response lengths Brevity ist die Seele der Nachhaltigkeit: Charakterisierende LLM-Responselängen 博利是可持续性的灵魂:确定LLM 反应长度 2506.08686v1
  • 409 06-10 RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling RuleReasoner: Verstärkte regelbasiertes Reasoning via Domain-aware Dynamic Sampling 规则响应器:通过Domain-aware动态抽样加强基于规则的理由说明 2506.08672v1
  • 410 06-10 Compositional Causal Reasoning Evaluation in Language Models kompositorische Kausale Reasoning-Bewertung in Sprachmodellen 语文模式的构成原因原因原因评估 2503.04556v4
  • 411 06-10 Summarization for Generative Relation Extraction in the Microbiome Domain Zusammenfassung für Generative Relation-Extraktion im Microbiome-Bereich 微生物群域生成关系采掘概要 2506.08647v1
  • 412 06-10 TableDreamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning TableDreamer: Progressive und schwache Datensynthese von Scratch für die Table Instruction Tuning 表Dreamer:Scratch为表格指示图纸提供的渐进和弱点制导数据合成 2506.08646v1
  • 413 06-10 An Analysis of Hyper-Parameter Optimization Methods for Retrieval Augmented Generation Eine Analyse von Hyperparameter-Optimierungsmethoden für die retrieval Augmented Generation 对回收增加的一代的超光子辐射仪最佳优化方法的分析 2505.03452v2
  • 414 06-10 MEMETRON: Metaheuristic Mechanisms for Test-time Response Optimization of Large Language Models MEMETRON: Metaheuristic Mechanisms for Test-time Response Optimization of Large Language Models MEMEMETRON: 大语言模型试验时间反应优化的元机制 2506.08643v1
  • 415 06-10 Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence Den Halluzinationscode in LVLMs mit Vision-aware Head Divergence knacken 以有远见和觉察力的正面差异在LVLMs中破坏《幻觉守则》 2412.13949v3
  • 416 06-10 NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples NaturalBench: Bewertung von Vision-Sprachen-Modellen auf natürlichen Adversarialproben 自然区:评估自然对立抽样的远景-语言模型 2410.14669v4
  • 417 06-10 Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMs Annäherung an Dialog Staat Tracking über Aligning Speech Encoder und LLMs 通过对齐语音编码器和LLMs处理对话国家跟踪 2506.08633v1
  • 418 06-10 RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval RAISE: Verbesserung der wissenschaftlichen Vernunft in LLMs durch Schritt-für-Schritt-Retrieval RAISE:通过一步步检索加强LLMs的科学理由 2506.08625v1
  • 419 06-10 Evaluation empirique de la sécurisation et de l’alignement de ChatGPT et Gemini: analyse comparative des vulnérabilités par expérimentations de jailbreaks Bewertung empirique de la sécurisation et de l’alignement de ChatGPT et Gemini: Analyse der vergleichenden des vulnérabilités par expérimentations de jailbreaks 安全和治安评估:对监狱破狱情况进行对比分析 2506.10029v1
  • 420 06-10 Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models Hassvolle Person oder Hassvolles Modell? Untersuchung der Rolle der Personas bei der Hassredenerkennung durch große Sprachmodelle 《调查个人在用大语言模式探测仇恨言论中的作用》 2506.08593v1
  • 421 06-10 Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings Dichte Retriever können bei einfachen Abfragen scheitern: Enthüllen des Granularitäts-Dilemmas der Einbettungen 简单查询: 读取嵌入器的颗粒性三角形 2506.08592v1
  • 422 06-10 CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling CounselBench: Ein groß angelegter Experten-Evaluierungs- und Adversarial-Benchmark für große Sprachmodelle in der Mental Health Counseling 心理卫生咨询中大语言模式的大范围专家评价和反向基准 2506.08584v1
  • 423 06-10 How transformers learn structured data: insights from hierarchical filtering Wie Transformatoren strukturierte Daten lernen: Erkenntnisse aus der hierarchischen Filterung 变压器如何学习结构化数据:从上层过滤器获得的洞察力 2408.15138v3
  • 424 06-10 How Malicious AI Swarms Can Threaten Democracy Wie bösartige KI-Schwärme die Demokratie bedrohen können 恶意的AI Swararms能够如何威胁民主 2506.06299v2
  • 425 06-10 The Geometries of Truth Are Orthogonal Across Tasks Die Geometrien der Wahrheit sind orthogonal über Aufgaben hinweg 真理的地貌不同任务 2506.08572v1
  • 426 06-10 Can Large Language Models Invent Algorithms to Improve Themselves?: Algorithm Discovery for Recursive Self-Improvement through Reinforcement Learning Können große Sprachmodelle Algorithmen erfinden, um sich selbst zu verbessern?: Algorithmenentdeckung für rekursive Selbstverbesserung durch Stärkungslernen 大语言模型能创造出能改善自我的算法吗? :通过强化学习实现递归自我改进的算法发现 2410.15639v5
  • 427 06-10 Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world? Nachbarn und Verwandte: Wie spiegeln Spracheinbettungen sprachliche Zusammenhänge auf der ganzen Welt wider? 邻居和亲戚:语言嵌入如何反映世界各地的语言联系? 2506.08564v1
  • 428 06-10 Efficient Post-Training Refinement of Latent Reasoning in Large Language Models Effiziente Nachschulung zur Verfeinerung latenter Vernunft in großen Sprachmodellen 改进大语言模式中长期原因 2506.08552v1
  • 429 06-10 BiasGuard: A Reasoning-enhanced Bias Detection Tool For Large Language Models BiasGuard: Ein reasoning-enhanced Bias Detection Tool für große Sprachmodelle BiasGuard:为大语言模型提供的理由强化的Bias Bias探测工具 2504.21299v2
  • 430 06-10 FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs FairMT-Bench: Benchmarking Fairness für Multiturn-Dialoge in Konversations-LLMs FairMT-Bench:为多方向对话设定公平基准 2410.19317v2
  • 431 06-10 Exploring the Escalation of Source Bias in User, Data, and Recommender System Feedback Loop Erforschung der Eskalation von Source Bias in User, Data und Recommender System Feedback Loop 探索用户、数据、建议系统反馈圈中源比值源的上升 2405.17998v2
  • 432 06-10 Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering Meta-Adaptive Prompt-Destillation für wenige-heiße visuelle Frage-Antworten 用于少数浅光视觉问题解答的元适应性快速蒸馏 2506.06905v2
  • 433 06-10 Length-Induced Embedding Collapse in PLM-based Models Längeninduzierter Einbettungskollaps in PLM-basierte Modelle 以PLM为基础的模型中长期生成的嵌入式折叠 2410.24200v2
  • 434 06-10 A Survey on Long Text Modeling with Transformers Eine Umfrage über lange Textmodellierung mit Transformern 利用变形器模拟长文本调查 2302.14502v2
  • 435 06-10 The BS-meter: A ChatGPT-Trained Instrument to Detect Sloppy Language-Games Das BS-Meter: Ein ChatGPT-getrainiertes Instrument, um schlampige Sprachspiele zu erkennen BS 计: 探测斯洛文尼亚语地名的聊天式TT-训练工具 2411.15129v2
  • 436 06-10 A Hybrid Architecture with Efficient Fine Tuning for Abstractive Patent Document Summarization Hybride Architektur mit effizienter Feinabstimmung für abstrakte Patentdokumentzusammenfassung 简易专利文件摘要的高效精度计价混合结构 2503.10354v3
  • 437 06-10 CoMuMDR: Code-mixed Multi-modal Multi-domain corpus for Discourse paRsing in conversations CoMuMDR: Code-gemischter multimodaler Multidomain-Korpus für Diskurs paRsing in Gesprächen COMMDR: 混合编码的多式混合多式对话中聊天声响多式公文 2506.08504v1
  • 438 06-10 Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design Bewertung ist alles, was Sie brauchen: Strategisches Überfordern von LLM-Gründerfähigkeiten durch Evaluationsdesign 评价是全你需要的:通过评价设计,从战略上压低LLM能力,通过评价设计提高LLM能力 2506.04734v2
  • 439 06-10 LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement LLaSE-G1: Förderung der Verallgemeinerungsfähigkeit für LLaMA-basierte Sprachverbesserung LLASE-G1:激励以LLASMA为基础的加强发声能力的普及能力 2503.00493v4
  • 440 06-10 DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs In Konflikte geraten: In suchgesteigerten LLMs widersprüchliche Quellen erkennen und bekämpfen 钻入冲突:发现和解决搜索中的冲突源 2506.08500v1
  • 441 06-10 Integration of Old and New Knowledge for Generalized Intent Discovery: A Consistency-driven Prototype-Prompting Framework Integration von altem und neuem Wissen für eine generalisierte Intent Discovery: Ein konsistentes prototype-prompting Framework 综合旧知识和新知识促进普遍化的本能发现:由一致性驱动的原型-促进框架 2506.08490v1
  • 442 06-10 EtiCor++: Towards Understanding Etiquettical Bias in LLMs EtiCor++: Auf dem Weg zum Verständnis von Etikettien in LLMs EtiCor++:争取在长效 2506.08488v1
  • 443 06-10 Fairness is Not Silence: Unmasking Vacuous Neutrality in Small Language Models Fairness ist kein Schweigen: Leere Neutralität in kleinen Sprachmodellen entlarven 公平不是沉默:在小型语言模式中松散不显中性 2506.08487v1
  • 444 06-10 Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models Die automatische Auswertung der Bild-Text-Ausrichtung in Text-zu-Bild-Modellen neu denken 重新思考文本到图像模型图像文本对齐的自动评价 2506.08480v1
  • 445 06-10 Efficient Context Selection for Long-Context QA: No Tuning, No Iteration, Just Adaptive-$k$ Effiziente Kontextauswahl für Lang-Kontext QA: Keine Abstimmung, keine Iteration, nur adaptiv-$k$ 长文本QA的高效率背景选择:无调试、无迭接、仅可调适-k$ 2506.08479v1
  • 446 06-10 Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning Gefährliche Memes mit entkoppeltem Verständnis und geleiteter CoT-Vernunft erkennen 以分解理解和引导计算成本理由,检测有害金属 2506.08477v1
  • 447 06-10 Textual Unlearning Gives a False Sense of Unlearning Textuelles Nichtlernen gibt einen falschen Sinn für das Nichtlernen 文字不学习给人一个错误的不学习感 2406.13348v3
  • 448 06-10 SHARE: Shared Memory-Aware Open-Domain Long-Term Dialogue Dataset Constructed from Movie Script SHARE: Shared Memory-Aware Open-Domain Langzeitdialogdatensatz aus Movie Script SHARE: 从电影脚本建构的内存- 内存- 内存- 公用 Open- Domain 长期对话数据集 2410.20682v2
  • 449 06-10 Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence Jenseits von Induktionsköpfen: Meta-Lernen im Kontext führt zu Mehrphasigkeit 超越上岗前头:内文元学习促进多阶段巡回新出现 2505.16694v2
  • 450 06-10 Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language Flexible Werkzeugauswahl durch Low-dimensionale Attributausrichtung von Vision und Sprache 通过视力和语言的低维属性一致进行灵活工具选择 2505.22146v2
  • 451 06-10 Understand User Opinions of Large Language Models via LLM-Powered In-the-Moment User Experience Interviews Verstehen Sie Anwendermeinungen von großen Sprachmodellen über LLM-Powered In-the-Moment User Experience Interviews 理解大语言模式用户通过LLM 带动的动态用户经验访谈对大语言模式的看法 2502.15226v2
  • 452 06-10 Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models Mixture of Decoding: Eine aufmerksamkeitsinspirierte Adaptive Decoding-Strategie zur Mititation von Halluzinationen in großen Vision-Sprachen-Modellen 解码混合:大型视觉语言模型中模拟幻觉的引人注意的调适解码战略 2505.17061v3
  • 453 06-10 Understanding Bias Reinforcement in LLM Agents Debate Verständnis der Bias-Verstärkung in LLM-Agenten-Debatte 了解LLLM代理商的强化申请 2503.16814v3
  • 454 06-10 A Survey on Large Language Models for Mathematical Reasoning Eine Umfrage zu großen Sprachmodellen für mathematische Vernunft 数学理由大语言模型调查 2506.08446v1
  • 455 06-10 Olica: Efficient Structured Pruning of Large Language Models without Retraining Olica: Effizientes strukturiertes Beschneiden großer Sprachmodelle ohne Umschulung Olica:无再培训的大型语言模型高效结构稳健 2506.08436v1
  • 456 06-10 FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation FlagEvalMM: Flexibler Rahmen für eine umfassende multimodale Modellbewertung FlaignEvalMMM:综合多式联运模式评价灵活框架 2506.09081v1
  • 457 06-10 Know-MRI: A Knowledge Mechanisms Revealer&Interpreter for Large Language Models Know-MRI: Ein Wissensmechanismus Revealer&Interpreter für große Sprachmodelle Know-MRI:大语言模型的知识机制传输和解释 2506.08427v1
  • 458 06-10 P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts P-反应:通过专门 LoRA 专家混合组合,综合个人经历专题-适应性反应 2406.12548v2
  • 459 06-10 DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data DISCO balanciert die Skalen: Adaptives Domain- und Schwierigkeitsgrad-Bewusst-Verstärkung-Lernen auf unausgewogenen Daten DISCO 平衡规模:关于数据不平衡的适应性域域和困难性软件强化学习 2505.15074v2
  • 460 06-10 Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens Große Sprachmodelle haben eine intrinsische Meta-Kognition, benötigen aber eine gute Linse 大语言模型具有内在的外形 – – 认知,但需要好的透镜 2506.08410v1
  • 461 06-10 RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation RLHS: Fehlausrichtung in RLHF mit Hindsight Simulation abmildern RLHS: 减轻RLHF的差错,采用速光模拟模拟 2501.08617v3
  • 462 06-10 mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks mSTEB: Massive mehrsprachige Bewertung von LLMs zu Sprach- und Textaufgaben mSTEB: 对关于发言和文本任务LLM女士进行大规模多语种评价 2506.08400v1
  • 463 06-10 Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR SSL-Diskrete Sprachfunktionen für Zipformer-basierte Kontext-ASR erkunden 探索基于 Zipef 的本地语背景 ASR 的 SSL 分辨语音特征 2409.08797v2
  • 464 06-10 TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression TL;DR: Zu lange, re-Gewichtung für effiziente LLM-Reasoning-Kompression TL;DR:太长,为高效 LLM 合理压缩而重新加权 2506.02678v2
  • 465 06-10 Reinforcement Learning Teachers of Test Time Scaling Verstärktes Lernen von Lehrern der Testzeitskalierung 测试时间尺度强化学习教师 2506.08388v1
  • 466 06-10 Reinforce LLM Reasoning through Multi-Agent Reflection Verstärkung der LLM-Vernunft durch Multi-Agent-Reflexion 通过多方机构反省说明理由 2506.08379v1
  • 467 06-10 EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models EIFBENCH: Extrem komplexe Instruktion nach Benchmark für große Sprachmodelle EIFBENCH: 按照大语言模式基准进行极为复杂的指示 2506.08375v1
  • 468 06-10 Draft-based Approximate Inference for LLMs Entwurfsbasierte annähernde Schlussfolgerung für LLM LLMM 的基于草案的近似推论 2506.08373v1
  • 469 06-10 DEAL: Disentangling Transformer Head Activations for LLM Steering DEAL: Depentangling Transformer Head Aktivierungen für LLM Steering DDEL: 解除LLM指导的变压器头动动 2506.08359v1
  • 470 06-10 Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning Text-Embeddings sollten Implizite Semantik erfassen, nicht nur Oberflächenbedeutung 文字嵌入应捕获隐含语义,而不仅仅是表面含义 2506.08354v1
  • 471 06-10 How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models How much to Guide: Neuvisiting Adaptive Guidance in Klassifier-Free Guidance Text-to-Vision Diffusion Models 《指南》的多少:在无分类指导、无分类指导、文字到视觉传播模型中重新审视适应性指导 2506.08351v1
  • 472 06-10 Evaluating LLMs Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving Bewertung von LLMs auf multi-kognitiven Ebenen: Vom medizinischen Wissen zur szenariobasierten Problemlösung 跨多级别评价LLMs:从医学知识掌握到基于设想的解决问题 2506.08349v1
  • 473 06-10 SPBA: Utilizing Speech Large Language Model for Backdoor Attacks on Speech Classification Models SPBA: Sprachmodell für Hintertürangriffe auf Sprachklassifikationsmodelle nutzen SSPBA: 使用大语言语音模式进行语言分类模式的后门攻击 2506.08346v1
  • 474 06-10 Wait, We Don’t Need to “Wait”! Removing Thinking Tokens Improves Reasoning Efficiency Warten Sie, wir brauchen nicht zu “warten”! Entfernen von Gedanken-Tokens verbessert vernünftige Effizienz 等等,我们不需要”等等”! 2506.08343v1
  • 475 06-10 RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors und Natural Language 关于音频、视频、嵌入传感器和自然语言的问答答案 2505.17114v2
  • 476 06-10 TPP-LLM: Modeling Temporal Point Processes by Efficiently Fine-Tuning Large Language Models TPP-LLM: Modellierung von Temporal Point-Prozessen durch effizientes Feintuning großer Sprachmodelle TPP-LLM:通过高效微调大语言模式模拟时点进程 2410.02062v2
  • 477 06-10 Curse of High Dimensionality Issue in Transformer for Long-context Modeling Fluch der Hochdimensionalitätsfrage im Transformer für die Langkontextmodellierung 变异器中高多维度问题的诅咒,用于长期建模 2505.22107v3
  • 478 06-10 Institutional Books 1.0: A 242B token dataset from Harvard Library’s collections, refined for accuracy and usability Institutionelle Bücher 1.0: Ein 242B-Token-Datensatz aus den Sammlungen der Harvard Library, verfeinert für Genauigkeit und Usability 机构手册1.0:哈佛图书馆收藏的242B象征性数据集,根据准确性和可用性加以完善 2506.08300v1
  • 479 06-09 (1) From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information? Von passiv zur aktiven Vernunft: Können große Sprachmodelle unter unvollständigen Informationen die richtigen Fragen stellen? 从被动到积极理由:大语言模式能够在不完整信息下询问正确的问题吗? 2506.08295v1
  • 480 06-09 From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium Von der Debatte zum Equilibrium: Glaube-getriebener Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium 从辩论到平衡:通过巴伊西亚纳什平衡进行 2506.08292v1
  • 481 06-09 BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models VerhaltenBox: Automatisierte Entdeckung von feinkörnigen Leistungsunterschieden zwischen Sprachmodellen 行为框:语言模型之间优异性能差异的自动发现 2506.02204v2
  • 482 06-09 Voice Impression Control in Zero-Shot TTS Sprach-Impression-Steuerung in Zero-Shot TTS 零热TTTS中的声音压抑控制 2506.05688v2
  • 483 06-09 Instruction-Tuned Video-Audio Models Elucidate Functional Specialization in the Brain Instruction-Tuned Video-Audio Modelle Elucidate Funktionelle Spezialisierung im Gehirn 脑中精明功能性专业 2506.08277v1
  • 484 06-09 Retrieval-augmented systems can be dangerous medical communicators Retrieval-augmentierte Systeme können gefährliche medizinische Kommunikatoren sein 回收val-加压系统可以是危险的医疗通讯器 2502.14898v2
  • 485 06-09 Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints Verstärktes Lernen aus menschlichem Feedback mit hochkonfidenziellen Sicherheitseinschränkungen 在高度信任安全限制下,从人类反馈中加强学习 2506.08266v1
  • 486 06-09 SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs SK-VQA: Synthetische Wissensgenerierung auf Scale für die Schulung kontextverstärkter multimodaler LLMs SK-VQA:在培训背景增强型多模式LM培训规模内生成合成知识 2406.19593v2
  • 487 06-09 Automatic Generation of Inference Making Questions for Reading Comprehension Assessments Automatische Erzeugung von Schlussfolgerungen, die Fragen für Leseverständnis-Bewertungen stellen 阅读理解评估自动产生引文问题 2506.08260v1
  • 488 06-09 RADAR: Benchmarking Language Models on Imperfect Tabular Data RADAR: Benchmarking von Sprachmodellen auf unvollkommenen Tabellendaten RADAR:关于不完善表格数据的语言模式基准基准 2506.08249v1
  • 489 06-09 Self-Supervised Transformers as Iterative Solution Improvers for Constraint Satisfaction Selbstüberwachte Transformer als iterative Lösungsverbesserungen für Einschränkungen Zufriedenheit 自我主宰的变形器作为抑制满意度的循环解决方案改进器 2502.15794v2
  • 490 06-09 TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation TRAVEL: Training-freies Retrieval und Ausrichtung für Vision-und-Sprachen-Navigation TRAVEL:无培训检索和协调愿景和语言导航 2502.07306v2
  • 491 06-09 Learning to Reason via Mixture-of-Thought for Logical Reasoning Mit Vernunft über Mixture-of-Thought für logische Vernunft lernen 通过对逻辑原因的混合研究学习理性 2505.15817v2
  • 492 06-09 Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning Kann KI die Wissenschaft bestätigen? Benchmarking LLMs für genaue wissenschaftliche Ansprüche $\rightarrow$ Evidence Reasoning AI 能验证科学吗? 2506.08235v1
  • 493 06-09 Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions Compound AI Systems Optimization: Eine Übersicht über Methoden, Herausforderungen und zukünftige Richtungen AI 系统优化:方法、挑战和今后方向调查 2506.08234v1
  • 494 06-09 “I Wrote, I Paused, I Rewrote” Teaching LLMs to Read Between the Lines of Student Writing “I Wrote, I Paused, I Rewrote” Lehre LLMs zu lesen zwischen den Linien des Studentenschreibens “我写,我写,我写,我重写” 教LMS 在学生写作的两行之间阅读 2506.08221v1
  • 495 06-09 A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation Eine umfassende Studie von Decoder-Only LLMs für die Text-to-Image-Generierung 综合研究 “ 文字到图像生成的开发者 – – 唯一 “ 有限LLMs “ 综合研究 2506.08210v1
  • 496 06-09 R.R.: Unveiling LLM Training Privacy through Recollection and Ranking R.R.: Enthüllen LLM Training Privacy durch Erinnerung und Ranking R.R.R.:通过记忆和排名来统一LLM培训隐私 2502.12658v2
  • 497 06-09 Open World Scene Graph Generation using Vision Language Models Open World Scene Graph Generation mit Vision Language Models 利用视觉语言模型制作开放世界景点图集 2506.08189v1
  • 498 06-09 GradEscape: A Gradient-Based Evader Against AI-Generated Text Detectors GradEscape: Ein gradientbasierter Evader gegen KI-generierte Textdetektoren 梯度梯度:针对AI-General Text 检测器的渐进式蒸发器 2506.08188v1
  • 499 06-09 LLM Alignment as Retriever Optimization: An Information Retrieval Perspective LLM Alignment als Retriever-Optimierung: Eine Informations-Retrieval-Perspektive LLM 对齐作为最佳优化:信息检索视角 2502.03699v2
  • 500 06-09 Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction Kann Künstliche Intelligenz wie Borges schreiben? Ein Bewertungsprotokoll für spanische Mikrofiktion 《像博尔热斯那样的人工智能书写? 西班牙微缩片的评价议定书》 2506.08172v1
  • 501 06-09 xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs xGen-MM-Vid (BLIP-3-Video): Sie benötigen nur 32 Zeichen, um ein Video auch in VLMs darzustellen xGen-MM-Vid (BLIP-3-Video):您只需32个调子即可在 VLMs 中代表视频 2410.16267v2
  • 502 06-09 ETT-CKGE: Efficient Task-driven Tokens for Continual Knowledge Graph Embedding ETT-CKGE: Effiziente Task-getriebene Token für kontinuierliches Einbetten von Wissensgraphen ETT-CKGE: 高效任务驱动的连续知识图嵌入标记 2506.08158v1
  • 503 06-09 Multilingual Hate Speech Detection in Social Media Using Translation-Based Approaches with Large Language Models Mehrsprachige Hate Speech Detection in sozialen Medien mit translationsbasierten Ansätzen mit großen Sprachmodellen 在社会媒体中使用基于翻译的方法和大语言模式的多语言社会媒体多语言仇恨言论探测 2506.08147v1
  • 504 06-09 AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists AutoSDT: Scaling Data-Driven Discovery Aufgaben gegenüber offenen Mitwissenschaftlern AutoSDD: 扩大数据驱动发现任务, 以建立开放的共同科学家 2506.08140v1
  • 505 06-09 EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments EconWebArena: Benchmarking von autonomen Agenten zu wirtschaftlichen Aufgaben in realistischen Web-Umgebungen EconwebArena:确定现实网络环境中经济任务自治机构的基准 2506.08136v1
  • 506 06-09 Bingo: Boosting Efficient Reasoning of LLMs via Dynamic and Significance-based Reinforcement Learning Bingo: Effiziente Begründung von LLMs durch dynamisches und auf Bedeutung basierendes Verstärkungslernen 宾果:通过以动态和重要内容为基础的强化学习,提高LLMs高效率的理据能力 2506.08125v1
  • 507 06-09 QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA QA-LIGN: LLMs durch konstitutionell zerlegte QA ausrichten QA-LIign:通过《宪法》分解的QA调整LMs 2506.08123v1
  • 508 06-09 Conservative Bias in Large Language Models: Measuring Relation Predictions Konservative Bias in großen Sprachmodellen: Messung von Beziehungsvorhersagen 大语言模型中的保守偏见:衡量关系预测 2506.08120v1
  • 509 06-09 “Would You Want an AI Tutor?” Understanding Stakeholder Perceptions of LLM-based Systems in the Classroom “Würden Sie einen KI-Tutor wollen?” Verständnis der Stakeholder Wahrnehmungen von LLM-basierten Systemen im Klassenzimmer “你想要一个人工智能导师吗?” 了解课堂上利益攸关方对基于LLM系统的LM系统的看法。 2503.02885v2
  • 510 06-09 From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation? Von der Ausgabe bis zur Auswertung: Reicht die Ausgabe von LLMs mit rohem Instruktionscode für die Generierung von Fill-in-the-Middle Code aus? 从输出到评价:原始指令-指令代码LLMs 输出足量是否用于中代代号的填充? 2505.18789v2
  • 511 06-09 From Language Models over Tokens to Language Models over Characters Von Sprachmodellen über Tokens zu Sprachmodellen über Zeichen 从名称以上的语言模式到字符上的语言模式 2412.03719v2
  • 512 06-09 Reinforcement Pre-Training Verstärkung der Vorausbildung 培训前 2506.08007v1
  • 513 06-09 Reparameterized LLM Training via Orthogonal Equivalence Transformation Reparameterisiertes LLM-Training über Orthogonale Äquivalenztransformation 通过正正对等转化进行修复性磁力LLM培训 2506.08001v1
  • 514 06-09 Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval Hierarchischer Lexischer Graph für verbessertes Multi-Hop-Retrieval 用于增强多功能检索的 高级高分解逻辑图 2506.08074v1
  • 515 06-09 Private Memorization Editing: Turning Memorization into a Defense to Strengthen Data Privacy in Large Language Models Private Memorization Editing: Aus Memorization eine Verteidigung zur Stärkung der Datenschutz in großen Sprachmodellen 私人记忆编辑:将记忆转换为国防,以加强大语言模型中的数据隐私 2506.10024v1
  • 516 06-09 PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation PhantomWiki: On-Demand-Datensätze für die Bewertung von Vernunft und Retrieval PhohantomWiki: 用于理性和检索评价的现场数据集 2502.20377v2
  • 517 06-09 When Two LLMs Debate, Both Think They’ll Win Wenn zwei LLMs diskutieren, denken beide, dass sie gewinnen werden 当两个LLM 辩论, 双方都认为他们会赢 2505.19184v3
  • 518 06-09 $τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment $τ^2$-Bench: Bewertung von Conversational Agents in einer Dual-Control-Umgebung $2美元 - 伯恩奇:在两管环境中评价相互对话的代理人 2506.07982v1
  • 519 06-09 Automated Capability Discovery via Foundation Model Self-Exploration Automatisierte Capability Discovery über die Selbsterkundung von Fundamentmodellen 通过基金会自我探索示范模式自动发现能力 2502.07577v3
  • 520 06-09 HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization HeuriGym: Agentischer Benchmark für LLM-gekrempelte Heuristik in der kombinatorischen Optimierung HeuriGym: 组合优化中LLM-减压希力学的代理基准 2506.07972v1
  • 521 06-09 MIB: A Mechanistic Interpretability Benchmark MIB: Ein Mechanistischer Interpretations-Benchmark MIB: 机械解释基准 2504.13151v2
  • 522 06-09 General-Reasoner: Advancing LLM Reasoning Across All Domains General-Reasoner: Bessere LLM-Reasonierung über alle Domains hinweg 通用Reasoner:在所有领域推推推LLM 2505.14652v5
  • 523 06-09 Correlated Errors in Large Language Models Korrelierte Fehler in großen Sprachmodellen 大语言模型中相关连的Cor 错误 2506.07962v1
  • 524 06-09 Language Models over Canonical Byte-Pair Encodings Sprachmodelle über Canonical Byte-Pair Encodings Canonicic 字节对称编码语言模型 2506.07956v1
  • 525 06-09 CORDIAL: Can Multimodal Large Language Models Effectively Understand Coherence Relationships? CORDIAL: Können multimodale große Sprachmodelle Kohärenzbeziehungen effektiv verstehen? 多式联运大语言模式能够有效地理解一致性关系吗? 2502.11300v2
  • 526 06-09 Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization Easy2Hard-Bench: Standardisierte Schwierigkeits-Labels für Profiling LLM Performance und Generalisierung Easy2Hard-Bench:用于分析LLM性能和通用的标准化困难标签 2409.18433v2
  • 527 06-09 Statistical Hypothesis Testing for Auditing Robustness in Language Models Statistische Hypothesenprüfung zur Prüfung der Robustheit in Sprachmodellen 语文模式审计强力统计假设测试 2506.07947v1
  • 528 06-09 ProtocolLLM: RTL Benchmark for SystemVerilog Generation of Communication Protocols ProtocolLLM: RTL Benchmark für SystemVerilog Generierung von Kommunikationsprotokollen 协议LLLM: 系统生成通信协议系统生成的RTL基准 2506.07945v1
  • 529 06-09 Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features Verbesserung der sichtsfreundlichen Klassifizierung mit großen multimodalen Modellmerkmalen 以大型多式模型特点加强低热视觉-语言分类 2412.00142v3
  • 530 06-09 Quantum Graph Transformer for NLP Sentiment Classification Quantengraphtransformator für NLP-Sentimentklassifikation NLP 感应分类量图图图变形器 2506.07937v1
  • 531 06-09 Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models Mimicking oder Reasoning: Multi-Modal In-Context Learning in Vision-Language-Modellen neu denken 模拟或解释:在视觉语言模型中重新思考多模式内学习 2506.07936v1
  • 532 06-09 RONA: Pragmatically Diverse Image Captioning with Coherence Relations RONA: Pragmatisch unterschiedliche Bildunterschriften mit Kohärenzbeziehungen RONA: 与一致性关系相结合的多元图像说明 2503.10997v2
  • 533 06-09 Solving Inequality Proofs with Large Language Models Unequality Proofs mit großen Sprachmodellen lösen 以大语言模式解决不平等证据 2506.07927v1
  • 534 06-09 Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks Dynamic-SUPERB Phase-2: Ein kollaborativ erweiterter Benchmark zur Messung der Fähigkeiten von gesprochenen Sprachmodellen mit 180 Aufgaben 动态-SUPERB第2阶段:协作扩大基准,以衡量具有180项任务的语言模式的能力 2411.05361v2
  • 535 06-09 Uncovering the Functional Roles of Nonlinearity in Memory Enthüllen der funktionalen Rollen der Nichtlinearität im Gedächtnis 发挥记忆中非线性的作用 2506.07919v1
  • 536 06-09 LUCIFER: Language Understanding and Context-Infused Framework for Exploration and Behavior Refinement LUCIFER: Sprachverständnis und Kontext-infundierter Rahmen für Exploration und Verhaltensverfeinerung LUCIFER: 探讨和行为改进的语言理解和背景介绍框架 2506.07915v1
  • 537 06-09 MiniCPM4: Ultra-Efficient LLMs on End Devices MiniCPM4: Ultraeffiziente LLMs auf Endgeräten MiniCPM4:末端装置超有效LMs 2506.07900v1
  • 538 06-09 MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs MEMOIR: Lebenslange Modellbearbeitung mit minimalem Überschreiben und informierter Aufbewahrung für LLMs MEMOIR: 长寿命模型编辑,以最小覆盖和知情保留LLMM 2506.07899v1
  • 539 06-09 Evaluating Large Language Models on the Frame and Symbol Grounding Problems: A Zero-shot Benchmark Bewertung großer Sprachmodelle auf Rahmen- und Symbolerdungsproblemen: Ein Null-Schuss-Benchmark 评价关于框架和符号依据问题的大语言模型:零速基准 2506.07896v1
  • 540 06-09 Introspective Growth: Automatically Advancing LLM Expertise in Technology Judgment Introspektives Wachstum: LLM-Kompetenz im Technologieurteil automatisch verbessern 前瞻性增长:自动提高LLM在技术判断方面的专门知识 2505.12452v2
  • 541 06-09 DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance DBudgetKV: Dynamisches Budget in KV Cache Compression für optimale Performance DuggitKV:确保最佳业绩的 KV缓存压缩中动态预算 2502.16886v2
  • 542 06-09 Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning Fokussieren lernen: Kausale Aufmerksamkeit Destillieren über gradient-geführte Token Pruning 学习到焦点:通过渐渐引导的 Token 鲁士宁 修整导致关注的原因 2506.07851v1
  • 543 06-09 Beyond Numeric Rewards: In-Context Dueling Bandits with LLM Agents Über Numerische Belohnungen hinaus: In-Context-Duellbanditen mit LLM-Agenten 除数值奖励之外:与LLM代理商进行文中评分的强盗 2407.01887v4
  • 544 06-09 Improving large language models with concept-aware fine-tuning Verbesserung großer Sprachmodelle mit konzeptbewusster Feinsteuerung 改进大语言模式,进行概念意识微调 2506.07833v1
  • 545 06-09 WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code WebUIBench: Ein umfassender Benchmark für die Bewertung multimodaler großer Sprachmodelle in WebUI-to-Code WebUIBench:在WebUI-Code中评价多式大语言模式的综合基准 2506.07818v1
  • 546 06-09 ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT ConECT-Datensatz: Überwindung von Datenknappheit im Kontext-Bewusst-E-Commerce MT Conconect 数据集:克服上下文软件电子商务MT中的数据缺乏问题 2506.04929v2
  • 547 06-09 WeQA: A Benchmark for Retrieval Augmented Generation in Wind Energy Domain WeQA: Ein Benchmark für die retrieval Augmented Generation im Bereich Windenergie WeQA:风能域回收增量发电基准 2408.11800v3
  • 548 06-09 Minerva: A Programmable Memory Test Benchmark for Language Models Minerva: Ein programmierbarer Speichertest-Benchmark für Sprachmodelle Minirva:语言模型可编程记忆测试基准 2502.03358v2
  • 549 06-09 MultiMatch: Multihead Consistency Regularization Matching for Semi-Supervised Text Classification MultiMatch: Multihead-Konsistenzregularisierung passend zur semi-überwachten Textklassifikation 多匹配: 用于半有效文本分类的多标题一致性规则化 2506.07801v1
  • 550 06-09 Generalized Interpolating Discrete Diffusion Generalisierte Interpolation diskreter Diffusion 普遍国际刑 化 2503.04482v2
  • 551 06-09 MIRIAD: Augmenting LLMs with millions of medical query-response pairs MIRIAD: Augmenting LLMs mit Millionen von medizinischen Abfrage-Response-Paare MIRIAD: 以数百万对医疗问询对子加强LLMs 2506.06091v2
  • 552 06-09 LLM Unlearning Should Be Form-Independent LLM Unlearning sollte formunabhängig sein LLM 重新学习应是独立的形式 2506.07795v1
  • 553 06-09 Is poisoning a real threat to LLM alignment? Maybe more so than you think Ist Vergiftung eine echte Bedrohung für LLM Ausrichtung? Vielleicht mehr, als Sie denken 中毒是否真正威胁到LLM对齐? 2406.12091v4
  • 554 06-09 Swiss Parliaments Corpus Re-Imagined (SPC_R): Enhanced Transcription with RAG-based Correction and Predicted BLEU Schweizer Parlamente Corpus Re-Imagined (SPC_R): Verbesserte Transkription mit RAG-basierter Korrektur und vorhergesagter BLEU 瑞士议会重新想象(SPC_R):加强以RAG为基础的校正和预测BLEU的跟踪 2506.07726v1
  • 555 06-09 Evaluating Zero-Shot Multilingual Aspect-Based Sentiment Analysis with Large Language Models Bewertung von Null-Shot-Multilingual-Aspekt-basierten Sentiment-Analysen mit großen Sprachmodellen 以大语言模式评价零热多语种频谱分析 2412.12564v3
  • 556 06-09 Multilingual Grammatical Error Annotation: Combining Language-Agnostic Framework with Language-Specific Flexibility Mehrsprachige Grammatikfehler-Annotation: Verknüpfung von sprach-agnostischem Framework mit sprachspezifischer Flexibilität 多语种、多语语地名错误说明:将语言-不可知框架与语言特有灵活性相结合 2506.07719v1
  • 557 06-09 Representation Bending for Large Language Model Safety Darstellungsbiegen für große Sprachmodellsicherheit 大语文示范语文安全示范语文代表名单 2504.01550v2
  • 558 06-09 Through the Valley: Path to Effective Long CoT Training for Small Language Models Through the Valley: Weg zum effektiven LangcoT-Training für kleine Sprachmodelle 通过山谷:为小型语言模式提供有效的长 CoT 培训的途径 2506.07712v1
  • 559 06-09 EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications EVADE: Multimodaler Benchmark für Evasive Content Detection in E-Commerce Anwendungen EVADE:电子商务应用中蒸发内容探测的多模式基准 2505.17654v2
  • 560 06-09 Training Superior Sparse Autoencoders for Instruct Models Training Superior Sparse Autoencoder für Instruct-Modelle 培训高级Sparass Autoencors 教学模型高级导师 2506.07691v1
  • 561 06-09 Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models Cross-lingual Collapse: Wie Sprach-Centric Foundation Modelle in großen Sprachmodellen formen 跨语言的折叠:大语言模式中语言中心基础模型的形状 2506.05850v2
  • 562 06-09 Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models Unlösbare Problemerkennung: Robuste Bewertung des Verständnisses für große multimodale Modelle 无法解决的问题探测:大型多式联运模型的强有力理解评估 2403.20331v4
  • 563 06-09 GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation GaRAGe: Ein Benchmark mit Erdungsanmerkungen für die RAG-Bewertung GARAGe:一份基准,并附有对RAG评价的根据说明 2506.07671v1
  • 564 06-09 Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought Skywork R1V: Pionierische multimodale Vernunft mit phantasievoller Kette Skywork R1V:先锋式多模式理由与探索链 2504.05599v2
  • 565 06-09 SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage SATA: Ein Paradigma für LLM Jailbreak über einfache assistive Task-Linkage SATA:LLM监狱通过简单辅助任务链接破获的范例 2412.15289v4
  • 566 06-09 Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping Jenseits von Benchmarks: Ein neuartiges Framework für Domain-spezifische LLM-Evaluierung und Wissensmapping 超越基准范围:具体领域LLM评估和知识绘图新框架 2506.07658v1
  • 567 06-09 Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks Bewertung von Dialect Fairness und Robustheit von großen Sprachmodellen in mit Gründen versehenen Aufgaben 评估大语言模式在说明理由任务中的公平性和强度 2410.11005v3
  • 568 06-09 Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation Transcript-prompted Whisper mit Wörterbuch-erweiterte Decodierung für japanische Sprachannotation 为日语语音注解配有字典加加码解码的字典插口器 2506.07646v1
  • 569 06-09 Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models Bewertung von LLMs Robustheit in weniger qualifizierten Sprachen mit Proxy-Modellen 具有代用模式的低资源语言的威力评价LLMs 2506.07645v1
  • 570 06-09 TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review TreeReview: Ein dynamischer Baum der Fragen Rahmen für tiefe und effiziente LLM-basierte Scientific Peer Review 树审查:深入、高效的LLM科学同行审评问题框架动态树 2506.07642v1
  • 571 06-09 Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline Unblocking Fine-Grained Evaluation von detaillierten Bildunterschriften: Eine erklärende Auto-Rater und Kritik-und-Revise-Pipeline 对详细说明进行无阻的精细精细评价:解释自动拉动器和批评和校正管道 2506.07631v1
  • 572 06-09 Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation Intent Matters: Verbesserung der KI-Tutoring mit feinkörniger pädagogischer Intent-Annotation 内容:加强AI 与精细教育学本意注解相结合的教学 2506.07626v1
  • 573 06-09 Can Perplexity Predict Fine-tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali Kann Perplexity Fine-Tuning Performance voraussagen? Eine Untersuchung von Tokenization-Effekten auf sequentielle Sprachmodelle für Nepali 重复性预测微调性能能否? 对尼泊尔的序列语言模式的收缩性影响调查i 2404.18071v2
  • 574 06-09 LoRMA: Low-Rank Multiplicative Adaptation for LLMs LoRMA: Multiplikative Anpassung mit niedrigem Rank für LLMs LoRMA:LLMML的低Rank多复制性适应 2506.07621v1
  • 575 06-09 Retrieval-Augmented Generation as Noisy In-Context Learning: A Unified Theory and Risk Bounds Retrieval-Augmented Generation as Noisy In-Context Learning: Eine einheitliche Theorie und Risikogrenzen 作为有噪音的文内学习:统一理论和风险波 2506.03100v3
  • 576 06-09 Vuyko Mistral: Adapting LLMs for Low-Resource Dialectal Translation Vuyko Mistral: Anpassung von LLMs für ressourcenarme Dialektalübersetzung Vuyko Mistral: 调适低资源对流翻译LLMs 2506.07617v1
  • 577 06-09 APE: Selective Fine-tuning with Acceptance Criteria for Language Model Adaptation APE: Selektive Feinabstimmung mit Annahmekriterien für Sprachmodellanpassung APE:有选择地调整语言模式适应接受标准 2505.19912v2
  • 578 06-09 PolitiSky24: U.S. Political Bluesky Dataset with User Stance Labels PolitiSky24: US-politischer Bluesky-Datensatz mit Benutzer-Stance-Etiketten 政治SpolitSky24:美国政治蓝sky数据集, 2506.07606v1
  • 579 06-09 Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque Unterweisung großer Sprachmodelle für ressourcenarme Sprachen: Eine systematische Studie für Baskisch 指导低资源语言大语言模式:巴斯克人系统研究 2506.07597v1
  • 580 06-09 ParallelComp: Parallel Long-Context Compressor for Length Extrapolation ParallelComp: Paralleler Long-Context-Kompressor für Längen-Extrapolation 平行Comp: 长度外推法的平行长文本压缩器 2502.14317v2
  • 581 06-09 Attention with Trained Embeddings Provably Selects Important Tokens Aufmerksamkeit bei trainierten Einbettungen wählt wahrscheinlich wichtige Token aus 与经过训练的嵌入器的关注 2505.17282v2
  • 582 06-09 Beyond the Sentence: A Survey on Context-Aware Machine Translation with Large Language Models Beyond the Sentence: Eine Umfrage zu Kontext-Bewusst Maschinelle Übersetzung mit großen Sprachmodellen 句子之后:用大语言模型进行的背景软件机器翻译调查 2506.07583v1
  • 583 06-09 HSF: Defending against Jailbreak Attacks with Hidden State Filtering HSF: Verteidigung gegen Jailbreak Attacken mit versteckten Zustand Filterung HSF: 防范以隐蔽国家过滤手段侵入监狱的袭击 2409.03788v2
  • 584 06-09 Cool-Fusion: Fuse Large Language Models without Training Cool-Fusion: Große Sprachmodelle ohne Training 冷冻功能:没有培训的引信大语言模型 2407.19807v2
  • 585 06-09 Learning Speaker-Invariant Visual Features for Lipreading Learning Speaker-Invariant Visual Features für Lipreading 读心学习用语言-不同功能的视觉特征 2506.07572v1
  • 586 06-09 AI Scientists Fail Without Strong Implementation Capability KI-Wissenschaftler scheitern ohne starke Umsetzungsfähigkeit AI 缺乏强有力的执行能力的科学家失败 2506.01372v2
  • 587 06-09 SELT: Self-Evaluation Tree Search for LLMs with Task Decomposition SELT: Selbstevaluierungsbaumsuche nach LLMs mit Aufgabenabbau SELT: 自评估树搜索具有任务分解的LLMLM 2506.07557v1
  • 588 06-09 Bit-level BPE: Below the byte boundary Bit-Level BPE: Unterhalb der Byte-Grenze BPE 位位值 BPE: 字节边界下 2506.07541v1
  • 589 06-09 Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models Flattery, Fluff und Fog: Diagnostizieren und Abmildern von idiosynkratischen Biasen in Präferenzmodellen Flattery、Fluff和Fog:在首选模式中诊断和缓解非典型的两面主义 2506.05339v2
  • 590 06-09 On Support Samples of Next Word Prediction Unterstützungsbeispiele für die nächste Wortvorhersage 关于 “ 下一词预测 “ 支助样本 2506.04047v2
  • 591 06-09 Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles Enigmata: Scaling Logical Reasoning in großen Sprachmodellen mit synthetischen überprüfbaren Puzzles 英格玛塔:在使用合成可核实拼图的大型语言模型中扩大逻辑理由 2505.19914v2
  • 592 06-09 SecFormer: Fast and Accurate Privacy-Preserving Inference for Transformer Models via SMPC SecFormer: Schnelle und genaue Datenschutz-Bewahrung von Schlussfolgerungen für Transformer-Modelle über SMPC SecFormer:通过SMPC为变形模型快速和准确的隐私保护推断 2401.00793v5
  • 593 06-09 Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition Speaker-distinguishable CTC: Lernen Speaker-Distinction mit CTC für Multi-Talker Spracherkennung 发言人可区分的反恐委员会:学习演讲人利用反恐委员会表彰多语言者言论 2506.07515v1
  • 594 06-09 DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction DERAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction DERAGAC: ASR 错误校正带有合成理由的代名实体候选人 2506.07510v1
  • 595 06-09 Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models Eliciting In-Context Retrieval und Reasoning für Lang-Kontext Große Sprachmodelle 长式长式大语言模型的全文检索和理由 2501.08248v3
  • 596 06-09 What Do Indonesians Really Need from Language Technology? A Nationwide Survey Was brauchen Indonesier wirklich von Sprachtechnologie? Eine landesweite Umfrage 印度尼西亚人真正需要语言技术什么? 2506.07506v1
  • 597 06-09 DEBATE: A Dataset for Disentangling Textual Ambiguity in Mandarin Through Speech DEBATE: Ein Datensatz für entwirrende textuelle Ambiguität in Mandarin durch Sprache 辩论:通过讲话解调普通话文字模糊性数据集 2506.07502v1
  • 598 06-09 Graph-of-Causal Evolution: Challenging Chain-of-Model for Reasoning Graph-of-Causal Evolution: Herausfordernde Modellkette für die Vernunft Causal 进化图:有挑战性的理由模型链 2506.07501v1
  • 599 06-09 Compliance-to-Code: Enhancing Financial Compliance Checking via Code Generation Compliance-to-Code: Verbesserung der finanziellen Compliance-Prüfung durch Codegenerierung 遵守到守则:通过代码生成加强金融合规检查 2505.19804v2
  • 600 06-09 Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning Verbesserung des Charakter-Level-Verständnisses in LLMs durch Token Internal Structure Learning 通过 Token 内部结构学习加强LLM女士的品级理解 2411.17679v5
  • 601 06-09 Binary Classifier Optimization for Large Language Model Alignment Binärklassifikator-Optimierung für großsprachliche Modellausrichtung 大语言模型对齐的二进制分类器优化 2404.04656v2
  • 602 06-09 Semantic Exploration with Adaptive Gating for Efficient Problem Solving with Language Models Semantische Exploration mit adaptiver Beschichtung für effiziente Problemlösung mit Sprachmodellen 以语言模式有效解决问题的适应性基质的语义探索 2501.05752v2
  • 603 06-09 A Hybrid GA LLM Framework for Structured Task Optimization Ein hybrides GA LLM-Rahmenwerk für strukturierte Aufgabenoptimierung GA 混合LLM 结构化任务优化框架 2506.07483v1
  • 604 06-09 Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models Verfolgung beweglicher Ziele mit Online-Selbstspiel-Verstärkung Lernen für sicherere Sprachmodelle 利用在线加强自身能力学习,建立更安全语言模式,以追踪移动目标 2506.07468v1
  • 605 06-09 BEYOND DIALOGUE: A Profile-Dialogue Alignment Framework Towards General Role-Playing Language Model BEYOND DIALOG: Ein Profil-Dialog-Rahmen für die Ausrichtung auf das allgemeine Rollenspiel-Sprachenmodell BEYONUD DDILOUG: 通用角色扮演语言模式概况-对话协调框架 2408.10903v6
  • 606 06-09 CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models CCI4.0: Ein zweisprachiger Vorschulungs-Datensatz zur Verbesserung der Vernunft in großen Sprachmodellen CCI4.0:提高大语言模式理由的双语预先培训数据集 2506.07463v1
  • 607 06-09 From Calibration to Collaboration: LLM Uncertainty Quantification Should Be More Human-Centered Von der Kalibrierung zur Zusammenarbeit: LLM Uncertainty Quantification sollte menschlicher werden 从校准到协作:LLM 不确定性量化应更加以人为本 2506.07461v1
  • 608 06-09 GLOS: Sign Language Generation with Temporally Aligned Gloss-Level Conditioning GLOS: Zeichensprache-Generierung mit temporär ausgerichteter Gloss-Level-Konditionierung GLOS: 具有临时统一质素水平条件的手语语言一代 2506.07460v1
  • 609 06-09 Parameter-Efficient Fine-Tuning of State Space Models Parameter-Effizient Feintuning von State Space Models 国家空间模型有效精美设计 2410.09016v3
  • 610 06-09 KScope: A Framework for Characterizing the Knowledge Status of Language Models KScope: Ein Rahmen für die Charakterisierung des Wissensstatus von Sprachmodellen KCope: 语言模式知识状况描述框架 2506.07458v1
  • 611 06-09 Understanding Cross-Domain Adaptation in Low-Resource Topic Modeling Cross-Domain-Anpassung in Low-Resource-Themenmodellierung verstehen 在低成本资源专题建模中了解跨领域适应 2506.07453v1
  • 612 06-09 When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment Wenn Stil die Sicherheit bricht: Verteidigung von Sprachmodellen gegen hocheffiziente Stilausrichtung 当样式打破安全时: 防御语言模型, 对抗超优样式对齐 2506.07452v1
  • 613 06-09 LlamaRec-LKG-RAG: A Single-Pass, Learnable Knowledge Graph-RAG Framework for LLM-Based Ranking LlamaRec-LKG-RAG: Ein eingängiges, erlernbares Wissen Graph-RAG-Framework für LLM-basiertes Ranking LlamaRec-LKG-RAG:以LLM为基础的排名单张、可学习知识图表-RAG框架 2506.07449v1
  • 614 06-09 Measuring Diversity in Synthetic Datasets Diversität in synthetischen Datensätzen messen 计量合成数据集的多样性 2502.08512v2
  • 615 06-09 Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning Token Cleaning: Feinkörnige Datenauswahl für LLM Supervised Fine-Tuning Tok Cleaning: 为LLM 监督的LLM 精密导出选择精密数据 2502.01968v2
  • 616 06-09 RomanLens: The Role Of Latent Romanization In Multilinguality In LLMs RomanLens: Die Rolle der latenten Romanisierung in der Mehrsprachigkeit in LLMs RomanLenens:LLM中原罗马化在多种语言中的作用 2502.07424v3
  • 617 06-09 LG-ANNA-Embedding technical report Technischer Bericht LG-ANNA-Embedding LG-ANNA-取代技术报告 2506.07438v1
  • 618 06-09 Cartridges: Lightweight and general-purpose long context representations via self-study Patronen: Leichte und universelle lange Kontextdarstellungen durch Selbststudium Cartridges:轻量和一般用途长背景介绍,通过自学 2506.06266v2
  • 619 06-09 Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding Gut Begun ist die Hälfte Fertig: Low-Resource Preference Alignment by Weak-to-Strong Decoding Well Begun 半完成: 低资源优先配对, 由弱到强解码 2506.07434v1
  • 620 06-09 Multi-agent Architecture Search via Agentic Supernet Multi-Agent Architektur Suche über Agentic Supernet 通过 Agric Supernet 多剂机构建筑搜索 2502.04180v2
  • 621 06-09 Conjoined Predication and Scalar Implicature Zusammengefügte Prädikation und Scalar Implicature 连带预言和斯卡拉复制 2506.07429v1
  • 622 06-09 When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human-AI Collaboration Wenn Modelle mehr wissen, als sie erklären können: Quantifizierung des Wissenstransfers in der Mensch-AI-Kollaboration 当模型知道的超出它们所能解释的时:在人类-AI合作中量化知识转让 2506.05579v2
  • 623 06-09 Plug-in and Fine-tuning: Bridging the Gap between Small Language Models and Large Language Models Plug-in und Feintuning: Die Lücke zwischen kleinen Sprachmodellen und großen Sprachmodellen überbrücken 插插和微调:缩小小语言模式与大语言模式之间的差距 2506.07424v1
  • 624 06-09 SEED: Enhancing Text-to-SQL Performance and Practical Usability Through Automatic Evidence Generation SEED: Verbesserung der Text-zu-SQL-Performance und der praktischen Usability durch automatische Evidenz-Generierung SEED:通过自动产生证据提高文字到SQL的性能和实用性能 2506.07423v1
  • 625 06-09 Power-Law Decay Loss for Large Language Model Finetuning: A Theory Perspective Macht-Rechts-Dekay Verlust für große Sprachmodell Finetuning: Eine Theorie-Perspektive 大语言模型微调的功率法减缩损失:理论视角 2505.16900v5
  • 626 06-09 Structured Pruning for Diverse Best-of-N Reasoning Optimization Strukturierte Prunings für die unterschiedliche Best-of-N-Reasoning-Optimierung 结构审慎,实现最佳理由的多样化最佳最佳理由的优化 2506.03978v2
  • 627 06-09 BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance BiMa: Auf dem Weg zu Biases Milderung für Text-Video-Retrieval über Szeneelement-Anleitung Bima:通过场景元素指导,争取减缓对文本视频检索的“双轨减缓” 2506.03589v2
  • 628 06-09 Outlier-weighed Layerwise Sampling for LLM Fine-tuning Ausgewägte schichtweise Probenahme für LLM-Feinabstimmung LLM 精密调试 2405.18380v3
  • 629 06-09 Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures Jenseits von Jailbreaks: Enthüllen von Stealthier- und breiteren LLM-Sicherheitsrisiken, die von Alignment-Fehlern ausgehen 越越狱:揭发 “ 团结失败 “ 带来的 “ 安全风险 “ 。 2506.07402v1
  • 630 06-09 G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems G-Memory: Hierarchischer Speicher für Multi-Agent-Systeme G-记忆:为多机构系统追踪等级记忆 2506.07398v1
  • 631 06-09 Do Large Language Models Judge Error Severity Like Humans? Urteilen große Sprachmodelle über Schwerelosigkeit wie Menschen? 大语言模型法官 误差严重像人类吗? 2506.05142v2
  • 632 06-09 SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering SyncMind: Messmittel Out-of-Sync-Wiederherstellung in der kollaborativen Software-Engineering Syncmind:在合作软件工程中测量合成外恢复剂 2502.06994v2
  • 633 06-09 Zero-Shot Event Causality Identification via Multi-source Evidence Fuzzy Aggregation with Large Language Models Null-Shot-Ereignis-Causalitäts-Identifizierung über Multi-Source-Evidenz Fuzzy-Aggregation mit großen Sprachmodellen 通过多种来源证据确定零热事件因果关系 2506.05675v2
  • 634 06-09 Scalable Vision Language Model Training via High Quality Data Curation Skalierbare Vision Sprachmodell-Schulung über hochwertige Daten-Kuration 通过高品质数据曲线计算,通过高质量数据曲线进行可缩放的愿景语言模式培训 2501.05952v3
  • 635 06-09 A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment Eine umfassende Umfrage in LLM(-Agent) Full Stack Sicherheit: Daten, Schulung und Bereitstellung 用LLLM(-代理)全堆安全:数据、培训和部署进行的全面调查 2504.15585v4
  • 636 06-09 Theoretical Benefit and Limitation of Diffusion Language Model Theoretischer Nutzen und Begrenzung des Diffusions-Sprachmodells 传播语言模式理论效益和限制 2502.09622v2
  • 637 06-09 Knowledge-to-Jailbreak: Investigating Knowledge-driven Jailbreaking Attacks for Large Language Models Knowledge-to-Jailbreak: Untersuchung wissensgetriebener Jailbreaking-Angriffe für große Sprachmodelle 知识对正义的突破:调查以知识为驱动的大型语言模式破狱袭击 2406.11682v2
  • 638 06-09 Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation Refusal-Feature-geführter Lehrer für sichere Feinsteuerung über Datenfilterung und Alignment Destillation 通过数据过滤和对齐蒸馏进行安全微调的拒绝制导师教师 2506.07356v1
  • 639 06-09 Unveiling and Addressing Pseudo Forgetting in Large Language Models Enthüllung und Adressierung von Pseudo Forgetting in großen Sprachmodellen 在大语言模式中统一和解决 “ 优于 “ 的遗忘 2411.11932v2
  • 640 06-09 What if LLMs Have Different World Views: Simulating Alien Civilizations with LLM-based Agents Was, wenn LLMs unterschiedliche Weltansichten haben: Alien-Zivilisationen mit LLM-basierten Agenten simulieren 如果LLLM女士具有不同的世界观点:用以LLM为基地的代理人模拟外星文明 2402.13184v6
  • 641 06-09 Improving LLM Reasoning through Interpretable Role-Playing Steering Verbesserung der LLM-Reasoning durch interpretierbare Rollenspiel-Lenkung 通过可解释性角色扮演指导改进LLM 2506.07335v1
  • 642 06-09 BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain BioHopR: Ein Benchmark für Multi-Hop, Multi-Answer Reasoning in der biomedizinischen Domäne BioHopR:生物医学领域多层次、多层次原因基准 2505.22240v2
  • 643 06-09 Rational Decision-Making Agent with Internalized Utility Judgment Rationaler Entscheidungsbevollmächtigter mit internalisiertem Utility-Urteil 具有内化公用事业判决的合理决策代理 2308.12519v3
  • 644 06-09 Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis Tree-of-Debatte: Multi-Persona-Debatte Bäume Elizites kritisches Denken für wissenschaftliche vergleichende Analyse 理论树:多人辩论树 用于科学比较分析的 明确关键思考 2502.14767v2
  • 645 06-08 (7) Reward Model Interpretability via Optimal and Pessimal Tokens Reward Model Interpretierbarkeit über Optimale und Pessimal Token 通过最佳和微量调制调来解释性 2506.07326v1
  • 646 06-08 Aligned but Blind: Alignment Increases Implicit Bias by Reducing Awareness of Race Ausgerichtet, aber blind: Ausgerichtetheit erhöht Implizite Bias, indem sie das Bewusstsein für die Rasse verringert 结盟但盲目:通过减少对种族问题的认识,协调会增加隐含的偏见 2506.00253v3
  • 647 06-08 Synergizing Unsupervised Episode Detection with LLMs for Large-Scale News Events Synergisieren der unüberwachten Episodenerkennung mit LLMs für großformatige News-Events 与大型新闻事件LMsLM公司同步进行未受监督的子集探测 2408.04873v2
  • 648 06-08 Watermarking Language Models with Error Correcting Codes Wasserzeichen-Sprachenmodelle mit Fehler-Korrektur-Codes 错误校正代码的划水语言模型 2406.10281v4
  • 649 06-08 ConfQA: Answer Only If You Are Confident ConfQA: Nur beantworten, wenn Sie zuversichtlich sind ConfQA: 仅回答您是否确信 2506.07309v1
  • 650 06-08 Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models Zuverlässige wissenschaftliche Hypothesen-Generation: Bewertung von Wahrhaftigkeit und Halluzination in großen Sprachmodellen 走向可靠的科学假设一代:在大语言模型中评估真理和幻觉 2505.14599v2
  • 651 06-08 Highly Fast Text Segmentation With Pairwise Markov Chains Hochschnelle Textsegmentierung mit paarweise Markov-Ketten 配对马科夫链条的快速文本分割 2102.11037v2
  • 652 06-08 Subjectivity in the Annotation of Bridging Anaphora Subjektivität in der Annotation von Brücken Anaphora 连接Adaphora说明中的主观性 2506.07297v1
  • 653 06-08 Exploring the Impact of Temperature on Large Language Models:Hot or Cold? Den Einfluss der Temperatur auf große Sprachmodelle erforschen: Heiß oder kalt? 探索温度对大语言模型的影响:热还是冷? 2506.07295v1
  • 654 06-08 MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache MiniKV: Die Grenzen der LLM-Inferenz durch 2-Bit-Layer-Diskriminativ-KV-Cache drücken MiniKV: 通过2-Bit 层分解 KV 缓存推进LLM 推断的限度 2411.18077v3
  • 655 06-08 Selective Prompt Anchoring for Code Generation Selektive Prompt-Ankerung für die Code-Generierung 代代代代代代代代代代代代代代代代代 代代代代代代代代代代代代代 代代代代代代代代代代代代 2408.09121v5
  • 656 06-08 Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages Parsing the Switch: LLM-basierte UD-Annotation für komplexe Code-Switched- und Low-Resource-Sprachen 开关剖析:基于LLM的LLM UD 复杂代码开关和低资源语言说明 2506.07274v1
  • 657 06-08 Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs Fragebeantwortung unter Zeitkonflikt: Bewertung und Organisation des sich entwickelnden Wissens mit LLMs 在时间冲突下回答问题:与LLMs一道评价和组织不断发展的知识 2506.07270v1
  • 658 06-08 AMPO: Active Multi-Preference Optimization for Self-play Preference Selection AMPO: Aktive Multi-Preference-Optimierung für Selbstspiel-Preference-Auswahl AMPO: 主动的自玩首选多优先优化 2502.18293v2
  • 659 06-08 DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products DeltaProdukt: Verbesserung der State-Tracking in linearen RNNs über Haushaltsprodukte DeltaProduction:通过家用产品改进国家通过家用产品对Linear RNNNs的跟踪 2502.10297v5
  • 660 06-08 Unraveling Token Prediction Refinement and Identifying Essential Layers in Language Models Unraveling Token Prediction Verfeinerung und Identifizierung wesentlicher Ebenen in Sprachmodellen 在语言模型中统一地改进和确定关键层 2501.15054v2
  • 661 06-08 Bias Attribution in Filipino Language Models: Extending a Bias Interpretability Metric for Application on Agglutinative Languages Bias Attribution in philippinischen Sprachmodellen: Erweiterung eines Bias Interpretability Metric für die Anwendung auf Agglutinative Sprachen 菲律宾语言模式中的偏见归属:扩大申请隐含语言的偏见解释度量 2506.07249v1
  • 662 06-08 Improving the Efficiency of Long Document Classification using Sentence Ranking Approach Verbesserung der Effizienz der Langdokumentklassifikation mittels Sentence-Ranking-Ansatz 采用判决分级办法提高长文件分类的效率 2506.07248v1
  • 663 06-08 SDE-SQL: Enhancing Text-to-SQL Generation in Large Language Models via Self-Driven Exploration with SQL Probes SDE-SQL: Verbesserung der Text-zu-SQL-Generierung in großen Sprachmodellen durch selbstgesteuerte Exploration mit SQL-Probes SDE-SQL:通过自发探索SQL勘探,在大语言模型中加强从文字到SQL的生成 2506.07245v1
  • 664 06-08 Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification Multi-Step Visual Reasoning mit Visual Tokens Skalierung und Verifizierung 具有视觉肯量度和核查的多脚视觉理性 2506.07235v1
  • 665 06-08 Reducing Object Hallucination in Large Audio-Language Models via Audio-Aware Decoding Objekt-Halluzination in großen Audio-Sprachen-Modellen mittels Audio-Aware-Dekodierung reduzieren 通过音频软件解码减少大型音频语言模型中的天体幻觉 2506.07233v1
  • 666 06-08 Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration Effiziente Vorauswahl von Sprachmodellen über Multi-Actor-Kollaboration 通过多方行为者协作为语言模式选择高效培训前数据 2410.08102v3
  • 667 06-08 Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning Halluzination auf einen Blick: Kontrollierte visuelle Schnitte und feinkörniges multimodales Lernen 《一眼幻觉:受控的视觉编辑和优美的多模式学习》 2506.07227v1
  • 668 06-08 DeepRAG: Thinking to Retrieve Step by Step for Large Language Models DeepRAG: Schritt für Schritt für Schritt für große Sprachmodelle nachzudenken 深RAG:考虑逐步获取大语言模型 2502.01142v2
  • 669 06-08 Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts Diversifizierung des Expertenwissens für Task-Agnostic Pruning in Sparse Mixture-of-Experts 使专家在Sparse Mixture 专家中的任务不可知性调节专家知识多样化 2407.09590v4
  • 670 06-08 Value Residual Learning Wertorientiertes Lernen 残余价值学习 2410.17897v5
  • 671 06-08 Gumbel Reranking: Differentiable End-to-End Reranker Optimization Gumbel Reranking: Differentielle End-to-End-Reranker-Optimierung Gumbel 中继等级: 可区别的 端到端到端再排序器优化 2502.11116v2
  • 672 06-08 SAP-Bench: Benchmarking Multimodal Large Language Models in Surgical Action Planning SAP-Bench: Benchmarking multimodaler Großsprachenmodelle in der operativen Aktionsplanung SAP-Bench:在外科行动规划中确定多式大语言模式基准 2506.07196v1
  • 673 06-08 MMTEB: Massive Multilingual Text Embedding Benchmark MMTEB: Massive mehrsprachige Text-Einbettung Benchmark MMDEB: 大规模多语言文本嵌入基准 2502.13595v3
  • 674 06-08 Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images Mildernde Verhaltenshalluzination in multimodalen großen Sprachmodellen für Sequenzbilder 序列图像多式大语言模型中减轻行为幻觉行为 2506.07184v1
  • 675 06-08 Epistemic Integrity in Large Language Models Epistemische Integrität in großen Sprachmodellen 大语言模型中的集成完整性 2411.06528v2
  • 676 06-08 Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs Flattery in Motion: Benchmarking und Analyse der Sykopanz in Video-LLMs 电影中的发辣:视频-LLMs中的基准设定和分析 2506.07180v1
  • 677 06-08 Understanding and Mitigating Cross-lingual Privacy Leakage via Language-specific and Universal Privacy Neurons Verständnis und Abmilderung von Cross-lingual Privacy Leakage über sprachspezifische und Universal Privacy Neuronen 通过特定语言和通用隐私中世纪理解和减少跨语言隐私疏漏 2506.00759v2
  • 678 06-08 Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering Rechte Antwort, falsches Ergebnis: Enthüllen der Inkonsistenzen der LLM-Bewertung in Multiple-Choice-Fragebeantwortung 正确答案, 错误分数: 解开多选项问题解答中LLM评价的不一致之处 2503.14996v2
  • 679 06-08 PECAN: LLM-Guided Dynamic Progress Control with Attention-Guided Hierarchical Weighted Graph for Long-Document QA PECAN: LLM-geführte dynamische Fortschrittskontrolle mit aufmerksamkeitsgeführtem Hierarchischen Gewichtungsdiagramm für Langdokument QA PECAN: LLM-指导的长文档质量评估动态进展控制,附有引人指导的纵向分级加权图 2410.04790v2
  • 680 06-08 RULE: Reinforcement UnLEarning Achieves Forget-Retain Pareto Optimality RULE: Verstärkung des Unerfahrenen erreicht Vergessen-Behalten Pareto-Optimalität RULE: 强化不放功率达到遗忘-保留最佳程度 2506.07171v1
  • 681 06-08 CTDGSI: A comprehensive exploitation of instance selection methods for automatic text classification. VII Concurso de Teses, Dissertações e Trabalhos de Graduação em SI – XXI Simpósio Brasileiro de Sistemas de Informação CTDGSI: Eine umfassende Nutzung von Beispielauswahlmethoden für die automatische Textklassifizierung. VII Concurso de Tesses, Dissertações e Trabalhos de Graduação em SI – XXI Simpósio Brasileiro de Sistemas de Informação CTDGSI: 综合利用自动文本分类实例选择方法。 VII 技术、技术、技术、技术、技术、工业、工业、技术、工业、工业、工业、工业、工业和工业 – – 二十一世纪巴西信息统计系统 2506.07169v1
  • 682 06-08 Efficient Text-Attributed Graph Learning through Selective Annotation and Graph Alignment Effizientes Text-Attributes Graphenlernen durch selektive Annotation und Graphenausrichtung 通过选择性注释和图表对齐,通过选择性注释和图表对齐,高效率的文本定位图表学习 2506.07168v1
  • 683 06-08 GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization GeometrieZero: Verbesserung der Geometrie-Lösung für LLM mit Gruppen-Kontrast-Policy-Optimierung 几何零:改进与集团反竞争政策优化相结合的LLM的几何解决办法 2506.07160v1
  • 684 06-08 ExLM: Rethinking the Impact of [MASK] Tokens in Masked Language Models ExLM: Nachdenken über die Auswirkungen von [MASK] Token in maskierten Sprachmodellen EXLM:重新思考[MASK]语调模式中[MASK]语调的影响 2501.13397v5
  • 685 06-08 Syntactic Control of Language Models by Posterior Inference Syntaktische Kontrolle von Sprachmodellen durch hintere Inferenz 由Posideer 推断对语言模型的协同控制 2506.07154v1
  • 686 06-08 Comba: Improving Bilinear RNNs with Closed-loop Control Comba: Bilineare RNNs mit Closed-Loop-Steuerung verbessern Comba: 改进有闭环控制的双线区域网网 2506.02475v2
  • 687 06-08 Semantic-preserved Augmentation with Confidence-weighted Fine-tuning for Aspect Category Sentiment Analysis Semantisch-konservierte Augmentation mit vertrauensgewichtiger Feinabstimmung für Aspect Category Sentiment Analysis 带有信任加权微调的语义保有增强值,用于隐形类别情感分析 2506.07148v1
  • 688 06-08 How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis Wie reduziert DPO die Toxizität? Eine mechanistische Neuronen-Level-Analyse DPO 如何减少毒性? 2411.06424v3
  • 689 06-08 Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting Prompting Science Report 2: Der abnehmende Wert der Kette des Denkens in Prompting 快速科学报告2:激励过程中思维链价值的下降 2506.07142v1
  • 690 06-08 Learning Compact Vision Tokens for Efficient Large Multimodal Models Kompakte Visions-Token für effiziente große multimodale Modelle lernen 高效大型多模式模式的学习契约愿景 2506.07138v1
  • 691 06-08 Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models Theorem-of-Thought: Ein Multi-Agenten-Framework für abduktive, deduktive und induktive Vernunft in Sprachmodellen 所探讨的理论理论:语言模式中指导、贬低和诱导理由的多机构框架 2506.07106v1
  • 692 06-08 How Far Are We from Optimal Reasoning Efficiency? Wie weit sind wir von optimaler Effizienz entfernt? 我们离最佳计算效率还有多远呢? 2506.07104v1
  • 693 06-08 Exploring Training and Inference Scaling Laws in Generative Retrieval Erforschen von Schulungs- und Inferenzskalierungsgesetzen in generativem Retrieval 探索 “ 创造检索 “ 中的培训和引文扩展法 2503.18941v2
  • 694 06-08 Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models Stellen Sie sich vor zu hören: Auditory Knowledge Generation kann ein effektiver Assistent für Sprachmodelle sein 想象一下:产生听力知识可以成为语言模式的有效助理。 2503.16853v2
  • 695 06-08 Representation Decomposition for Learning Similarity and Contrastness Across Modalities for Affective Computing Darstellungsdekomposition für das Lernen von Ähnlichkeit und Kontrast über Modalitäten für affektives Computing 积极计算方法不同模式的学习相似性和相异性 2506.07086v1
  • 696 06-08 From Informal to Formal – Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs Vom informellen zum formalen – Einbinden und Bewerten von LLMs über natürliche Sprachanforderungen bis hin zu überprüfbaren Formalproofs 从非正式到正式 – – 纳入和评价关于自然语言要求与可核实的正式证明之间的LLMs 2501.16207v4
  • 697 06-08 Com$^2$: A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models Com$^2$: Ein ursächlich geführter Benchmark für die Erforschung komplexer Commonsense-Reasoning in großen Sprachmodellen com$2美元:探讨大语言模式中复杂常识原因的因果指导基准 2506.07064v1
  • 698 06-08 AdaSplash: Adaptive Sparse Flash Attention AdaSplash: Adaptive Sparse Flash Achtung AdaSplash: 适应性散散闪光关注 2502.12082v2
  • 699 06-08 When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models Wenn Aufmerksamkeit zusammenbricht: Wie degenerierte Ebenen in LLMs kleinere, stärkere Modelle aktivieren 注意折叠时: LLM 中如何分解层的 LLM 启用更小、 更强模型 2404.08634v3
  • 700 06-08 Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs Interpretierbare und zuverlässige Erkennung von KI-generierten Bildern durch geerdete Vernunft in MLLMs 通过MLLLMs中基于理由对AI创像进行可解释和可靠地探测 2506.07045v1
  • 701 06-08 Reasoning with RAGged events: RAG-Enhanced Event Knowledge Base Construction and reasoning with proof-assistants Reasoning mit RAGged Events: RAG-erweiterte Event Knowledge Base Konstruktion und Reasoning mit Proof-Assistenten RAG-加强事件知识库建设和与证据助理的推理 2506.07042v1
  • 702 06-08 Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More Stoppen Sie die Suche nach wichtigen Token in multimodalen Sprachmodellen: Duplizieren zählt mehr 停止在多模式语言模式中查找重要调子: 更多重复问题 2502.11494v2
  • 703 06-08 BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modeling BRIDGE: Bootstrapping-Text zur Steuerung der Time-Series-Generation über Multi-Agent iterative Optimierung und Diffusionsmodellierung BRIDGE:通过多代理迭代优化和传播模型化控制时间- 系列生成的推进文本 2503.02445v4
  • 704 06-08 KG2QA: Knowledge Graph-enhanced Retrieval-Augmented Generation for Communication Standards Question Answering KG2QA: Knowledge Graph-enhanced Retrieval-Augmented Generation for Communication Standards Question Answering KG2QA:为通信标准问题解答提供知识型图集增强的检索回声-增强的一代人 2506.07037v1
  • 705 06-08 WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning WorkForceAgent-R1: Förderung der Fähigkeit von LLM-basierten Web-Agenten durch Verstärkungs-Lernen 工作力量-R1:通过强化学习在基于LLM的网络代理中鼓励 2505.22942v2
  • 706 06-08 A Culturally-diverse Multilingual Multimodal Video Benchmark & Model Ein multimodales multimodales Multimodal-Video-Benchmark & -Modell 多语言多语言多语多语多模式视频基准和模式 2506.07032v1
  • 707 06-08 HauntAttack: When Attack Follows Reasoning as a Shadow HauntAttack: Wenn Angriff folgt Vernunft als Schatten 恶霸:当攻击发生时,以影子为由进行攻击 2506.07031v1
  • 708 06-08 JsonTuning: Towards Generalizable, Robust, and Controllable Instruction Tuning JsonTuning: Auf dem Weg zu einer generalisierbaren, robusten und kontrollierbaren Instruktion JsonTurning: 走向可普及、强力和可控制的指示图示 2310.02953v6
  • 709 06-08 HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models HAIC: Verbesserung des menschlichen Handelns durch bessere Bildunterschriften für multimodale große Sprachmodelle HAIC: 提高人类行动的理解和生成,为多模式大语言模式提供更好的指导 2502.20811v2
  • 710 06-08 AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding AdaReTaKe: Adaptive Redundanzreduzierung zur Wahrnehmung länger für Video-Sprachen-Verständnis AdaRETAKe:为了解视频语言而将适应性裁员减到可能更长的时间 2503.12559v2
  • 711 06-08 From Sub-Ability Diagnosis to Human-Aligned Generation: Bridging the Gap for Text Length Control via MARKERGEN Von der Sub-Ability-Diagnose zur Menschengerechten Generation: Überbrückung der Lücke zur Textlängenkontrolle über MARKERGEN 从亚能力分析到人类一代:通过MARKERGEN弥合文本长度控制差距 2502.13544v3
  • 712 06-08 Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States Hin zu dynamischer Geistestheorie: Bewertung der LLM-Anpassung an die zeitliche Evolution menschlicher Staaten 走向动态思维理论:评价LLM适应人类国家时间演变的适应LLM 2505.17663v2
  • 713 06-08 sudo rm -rf agentic_security SUDO rm -rf agentic_security sudo rm -rf 毒剂安全 2503.20279v3
  • 714 06-08 Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text Adversariale Paraphrasierung: Ein universeller Angriff für humanisierenden KI-generierten Text 反对反对截图:普遍攻击使 AI - 发光的文字人性化 2506.07001v1
  • 715 06-08 What makes Reasoning Models Different? Follow the Reasoning Leader for Efficient Decoding Was macht Vernunftmodelle anders? Folgen Sie der Vernunftführerin für effiziente Decodierung 是什么让理性模型不同? 跟随高效配方的理性导师 2506.06998v1
  • 716 06-08 STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings STAMP Ihr Inhalt: Nachweis der Datensatzmitgliedschaft über Watermarked Rephrasings STAMP 您的内容: 通过水标修改方式证明数据集成员 2504.13416v2
  • 717 06-08 Cultural Bias Matters: A Cross-Cultural Benchmark Dataset and Sentiment-Enriched Model for Understanding Multimodal Metaphors Cultural Bias Matters: Ein Cross-Cultural Benchmark Dataset und Sentiment-Enriched Modell zum Verständnis multimodaler Metaphoren 文化偏见问题:跨文化基准数据集和适应性强化模式,了解多模式代词器 2506.06987v1
  • 718 06-08 ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation ChartGalaxy: Ein Datensatz für Infografik Chart Verstehen und Generieren 图表银河:用于了解和生成信息图表的数据集 2505.18668v3
  • 719 06-08 Chain of Methodologies: Scaling Test Time Computation without Training Kette der Methoden: Skalierung Testzeitberechnung ohne Training 方法链:不经过培训的缩放测试时间计算 2506.06982v1
  • 720 06-08 Atomic Reasoning for Scientific Table Claim Verification Atomische Begründung für die Überprüfung der wissenschaftlichen Tabelle 科学表格索赔核实的原子理由 2506.06972v1
  • 721 06-08 A dependently-typed calculus of event telicity and culminativity Ein abhängig typisiertes Kalkül von Ereignistelizität und Kulminivität 一种依附型型的事件的微积分, 具有感触力和上流性 2506.06968v1
  • 722 06-08 Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning Lernen, durch verstärktes Lernen durch reward-gewichtete Feintuning zu klären 学习通过通过奖励-加权微调加强学习来澄清问题 2506.06964v1
  • 723 06-08 ViMMRC 2.0 – Enhancing Machine Reading Comprehension on Vietnamese Literature Text ViMMRC 2.0 – Verbesserung des Leseverständnisses in vietnamesischer Literatur Text VIMRC 2.0 – – 加强对越南文学文本的机器阅读理解 2303.18162v2
  • 724 06-08 BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning BIS Reasoning 1.0: Der erste großformatige japanische Benchmark für glaubens-inkonsistente syllogistische Reasoning BIS 理由1.0:日本第一个大尺度的信仰不一致时断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断 2506.06955v1
  • 725 06-07 (6) What Makes a Good Natural Language Prompt? Was macht eine gute natürliche Sprache prompt? 是什么让好自然语言迅速? 2506.06950v1
  • 726 06-07 PlanGenLLMs: A Modern Survey of LLM Planning Capabilities PlanGenLLMs: Eine moderne Erhebung der LLM-Planungskapazitäten PlanGenLLLMs:LLM规划能力现代调查 2502.11221v2
  • 727 06-07 The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity Die Illusion des Denkens: Die Stärken und Grenzen von Vernunftmodellen über das Lens of Problem Complexity verstehen 思考的幻觉:通过问题复杂焦点了解理性模型的长处和局限性 2506.06941v1
  • 728 06-07 Mechanistic evaluation of Transformers and state space models Mechanistische Auswertung von Transformern und State Space Models 对变压器和国家空间模型的机械评估 2505.15105v2
  • 729 06-07 Scaling over Scaling: Exploring Test-Time Scaling Plateau in Large Reasoning Models Skalierung über Skalierung: Untersuchung des Testzeit-Skalierungs-Plateau in Großmodellen 缩放超过缩放: 探索大型理由模型中的试验时间缩放高台 2505.20522v2
  • 730 06-07 DiscoSum: Discourse-aware News Summarization DiscoSum: Diskurs-bewusste Nachrichten Zusammenfassung DiscoSum: 深知新闻摘要 2506.06930v1
  • 731 06-07 BLEUBERI: BLEU is a surprisingly effective reward for instruction following BLEUBERI: BLEU ist eine überraschend effektive Belohnung für den Unterricht nach BLEU是一种令人惊讶的有效奖赏 2505.11080v2
  • 732 06-07 On Path to Multimodal Historical Reasoning: HistBench and HistAgent Auf dem Weg zu multimodaler historischer Vernunft: HistBench und HistAgent 通向多式联运历史原因原因之路:历史时尚与历史代理人 2505.20246v2
  • 733 06-07 Hybrid Extractive Abstractive Summarization for Multilingual Sentiment Analysis Hybride extrahierende abstrakte Zusammenfassung für Mehrsprachige Sentimentanalyse 多种语文感化分析的抽取性摘要 2506.06929v1
  • 734 06-07 EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments EconEvals: Benchmarks und Litmus-Tests für LLM-Agenten in unbekannten Umgebungen Econevals:未知环境中LLM代理物的测标和Litmus测试 2503.18825v2
  • 735 06-07 When Incentives Backfire, Data Stops Being Human Wenn Incentives Backfire, Daten aufhören, Mensch zu sein 当鼓励反火时,数据停止成为人类 2502.07732v2
  • 736 06-07 Robustifying Vision-Language Models via Dynamic Token Reweighting Beschleunigen von Vision-Sprachenmodellen durch Dynamic Token Reweighting 通过动态调控重标加权强化视觉语言模型 2505.17132v2
  • 737 06-07 TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training TorchTitan: Ein-Stop PyTorch native Lösung für produktionsfertiges LLM-Vortraining 火炬泰坦:当地一站式PyToirch当地生产即成LLM培训前的一站式PyTorch本地解决方案 2410.06511v3
  • 738 06-07 Modality-Specialized Synergizers for Interleaved Vision-Language Generalists Modalitätsspezialisierte Synergizer für interleaved Vision-Language Generalists 外离视觉语言通用通识主义者模式-专门化同步器 2407.03604v2
  • 739 06-07 LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models LLM-SRBench: Ein neuer Benchmark für die Entdeckung wissenschaftlicher Gleichungen mit großen Sprachmodellen LLM-SRBench:采用大语言模式科学等同发现新基准 2504.10415v2
  • 740 06-07 Automatic Speech Recognition of African American English: Lexical and Contextual Effects Automatische Spracherkennung des African American English: Lexikalische und Kontexteffekte 非裔英语的自动语音识别:词汇和背景效果 2506.06888v1
  • 741 06-07 Mixture of Small and Large Models for Chinese Spelling Check Mischung aus kleinen und großen Modellen für chinesische Rechtschreibprüfung 用于中文拼写检查的小型和大型模型混合体 2506.06887v1
  • 742 06-07 DISC: Plug-and-Play Decoding Intervention with Similarity of Characters for Chinese Spelling Check DISC: Plug-and-Play-Decoding-Intervention mit Ähnlichkeit von Zeichen für chinesische Rechtschreibprüfung DISC: 用于中文拼写检查的插件和打字解码干预,其字符相似 2412.12863v2
  • 743 06-07 Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning Recht ist nicht genug: Die Pitfalls von Outcome Supervision in der Ausbildung LLMs for Math Reasoning 权利还是不够的:数学原因培训优等生培训成果监督的空洞 2506.06877v1
  • 744 06-07 Watermark under Fire: A Robustness Evaluation of LLM Watermarking Wasserzeichen unter Feuer: Eine Robustheitsbewertung von LLM Watermarking 火灾下的水印:对LLM水标志的强力评价 2411.13425v3
  • 745 06-07 Taxonomizing Representational Harms using Speech Act Theory Taxonomizing Representative Harms mit Sprachgesetz-Theorie 使用演讲法理论对代表性伤害进行分类 2504.00928v2
  • 746 06-07 Active Preference Optimization for Sample Efficient RLHF Aktive Preference-Optimierung für Probeneffizienz RLHF 抽样高效RLHF的主动首选最佳优化 2402.10500v3
  • 747 06-07 Grounded Persuasive Language Generation for Automated Marketing Gegründete überzeugende Sprachgenerierung für automatisiertes Marketing 用于自动营销的有源辅助性语言生成 2502.16810v3
  • 748 06-07 Adapt Once, Thrive with Updates: Transferable Parameter-Efficient Fine-Tuning on Evolving Base Models Einmal anpassen, Thrive mit Updates: Übertragbar Parameter-Effizient Feintuning auf sich entwickelnden Basismodellen 一次适应,带更新的 “ 振动式 “ :可转让参数 – – 对不断演化的基建模型进行有效微调 2506.06844v1
  • 749 06-07 PCoT: Persuasion-Augmented Chain of Thought for Detecting Fake News and Social Media Disinformation PCoT: Persuasion-Augmented Chain of Thought zur Erkennung von Fake News und Social Media Desinformation PCOT: 检测假冒新闻和社会媒体假消息的劝导-增强思维链 2506.06842v1
  • 750 06-07 Alignment Drift in CEFR-prompted LLMs for Interactive Spanish Tutoring Ausrichtung Drift in CEFR-geförderten LLMs für interaktives Spanisch Tutoring CEFR-快速推动的交互式西班牙语导图的对齐磁盘驱动器LMs 2505.08351v2
  • 751 06-07 Cross-Entropy Games for Language Models: From Implicit Knowledge to General Capability Measures Cross-Entropy-Spiele für Sprachmodelle: Vom Impliziten Wissen bis hin zu allgemeinen Fähigkeiten 语文模式跨英语运动会:从隐隐知识到一般能力措施 2506.06832v1
  • 752 06-07 FedALT: Federated Fine-Tuning through Adaptive Local Training with Rest-of-World LoRA FedALT: Federated Fine-Tuning durch Adaptives lokales Training mit Rest-of-World LoRA FedALT:通过与世界其余地区合作的适应性地方培训实现联邦退休 2503.11880v2
  • 753 06-07 Data Shapley in One Training Run Datenschädlichkeit in einem Trainingslauf 一个培训运行中的数据形状 2406.11011v3
  • 754 06-07 AutoPal: Autonomous Adaptation to Users for Personal AI Companionship AutoPal: Autonome Anpassung an Benutzer für die persönliche KI-Companionship AutoPal: 个人AI 个人伴侣关系用户自动适应 2406.13960v4
  • 755 06-07 Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs 超越分类:以多任务音频LLLMs为理由的言论情感 2506.06820v1
  • 756 06-07 How do datasets, developers, and models affect biases in a low-resourced language? Wie wirken sich Datensätze, Entwickler und Modelle auf Voreingenommenheiten in einer Sprache mit geringem Ressourcenbedarf aus? 数据集、开发者和模型如何影响低资源语言的偏差? 2506.06816v1
  • 757 06-07 BTPD: A Multilingual Hand-curated Dataset of Bengali Transnational Political Discourse Across Online Communities BTPD: Mehrsprachiger handkuratierter Datensatz von Bengali Transnationaler politischer Diskurs über Online-Gemeinschaften BTPD: 孟加拉跨国在线社区政治论文多语种手动多语种数据集 2506.06813v1
  • 758 06-07 Advancing Question Generation with Joint Narrative and Difficulty Control Verbesserung der Fragegenerierung mit gemeinsamer Narrativ- und Schwierigkeitskontrolle 推进问题生成,并采用 “ 共同叙述和困难联合控制 “ 2506.06812v1
  • 759 06-07 Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events Nicht ganz Sherlock Holmes: Sprachmodellvorhersagen unterscheiden nicht zuverlässig unmöglich von unwahrscheinlichen Ereignissen 夏洛克·福尔摩斯:语言模型预测无法可靠地区分不可能与不可能发生的事件。 2506.06808v1
  • 760 06-07 Label-semantics Aware Generative Approach for Domain-Agnostic Multilabel Classification Label-Semantik Aware Generativer Ansatz für Domain-Agnostic Multilabel-Klassifikation 域-不可知性多标签分类的认知生成方法 2506.06806v1
  • 761 06-07 Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs Dialog ohne Grenzen: Konstant-Sized KV-Caches für erweiterte Reaktionen in LLMs 无限制对话框: 普通 KV 宽度响应库( LLM ) 2503.00979v2
  • 762 06-07 On the Adaptive Psychological Persuasion of Large Language Models Über die adaptive Psychologische Überzeugung von großen Sprachmodellen 关于大语言模型的适应性心理心理分析 2506.06800v1
  • 763 06-07 ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models ZeroDL: Null-Schuss-Distributions-Lernen für Text-Clustering über große Sprachmodelle 零DL:通过大语言模式对文本分组进行零光分发学习 2406.13342v2
  • 764 06-07 Extending dependencies to the taggedPBC: Word order in transitive clauses Ausdehnung der Abhängigkeiten auf den TaggedPBC: Word-Ordnung in transitiven Klauseln 将依赖延伸至贴有标签的PBC:过渡条款中的Word顺序 2506.06785v1
  • 765 06-07 They want to pretend not to understand: The Limits of Current LLMs in Interpreting Implicit Content of Political Discourse Sie wollen vorgeben, nicht zu verstehen: Die Grenzen aktueller LLMs bei der Interpretation impliziter Inhalte politischer Diskurse 他们想假装不理解:当前LLMs在解释政治论文隐含内容方面的局限性 2506.06775v1
  • 766 06-07 Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models 在《愿景语言模式》中实体知识采掘不同模式的绩效差距 2412.14133v2
  • 767 06-07 On Adversarial Robustness of Language Models in Transfer Learning Zur widerlichen Robustheit von Sprachmodellen im Transfer-Lernen 转让学习中语言模式的反效果 2501.00066v2
  • 768 06-07 AfroBench: How Good are Large Language Models on African Languages? AfroBench: Wie gut sind große Sprachmodelle auf afrikanischen Sprachen? AfroBench:非洲语言大语言模式有多好? 2311.07978v5
  • 769 06-07 Geopolitical biases in LLMs: what are the “good” and the “bad” countries according to contemporary language models Geopolitische Voreingenommenheiten in LLMs: Was sind die “guten” und die “schlechten” Länder nach zeitgenössischen Sprachmodellen LLMM中的地缘政治偏见:根据当代语言模式,什么是“好”和“坏”国家? 2506.06751v1
  • 770 06-07 C-PATH: Conversational Patient Assistance and Triage in Healthcare System C-PATH: Conversational Patient Assistance und Triage in Healthcare System C-PATH: 病人相互援助和保健系统中的转诊 2506.06737v1
  • 771 06-07 Mitigating Object Hallucination via Robust Local Perception Search Objekt-Halluzination durch robuste lokale Wahrnehmungssuche abmildern 通过强力局部感知搜索减轻天体幻觉 2506.06729v1
  • 772 06-07 Broaden your SCOPE! Efficient Multi-turn Conversation Planning for LLMs with Semantic Space Verbreiten Sie Ihren SCOPE! Effiziente Multiturn-Conversation-Planung für LLMs mit semantischem Raum 扩大你的SCOPE ! 与语义空间的LLMs 高效多方向对话规划 。 2503.11586v2
  • 773 06-07 nvBench 2.0: Resolving Ambiguity in Text-to-Visualization through Stepwise Reasoning nvBench 2.0: Ambiguität in der Text-zu-Visualisierung durch schrittweise Vernunft auflösen nvBench 2.0:通过逐步说明理由解决文本到虚拟化的模糊性 2503.12880v2
  • 774 06-07 From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment Von der Bedrohung zum Werkzeug: Rückgängige Injektionsangriffe für die Sicherheitsausrichtung 从威胁到工具:为安全协调而利用拒绝的故意喷射攻击 2506.10020v1
  • 775 06-07 RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking RED QUEEN: Schutz großer Sprachmodelle gegen verschleierte Multi-Turn-Jailbreaking RED QUEEN: 保护大型语言模型,防止隐蔽的多轮破门破门 2409.17458v2
  • 776 06-07 A Survey of Retentive Network Ein Überblick über das Netz der Retentive 网上反应网络调查 2506.06708v1
  • 777 06-07 DivScore: Zero-Shot Detection of LLM-Generated Text in Specialized Domains DivScore: Zero-Shot-Erkennung von LLM-generiertem Text in spezialisierten Domains DivScore: 专用域内LLM-发光文本零热探测 2506.06705v1
  • 778 06-07 Dynamic and Parametric Retrieval-Augmented Generation Dynamische und parametrische Retrieval-Augmented Generation 动态和参数检索增强的一代 2506.06704v1
  • 779 06-07 from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors aus Benign Import Giftic: Jailbreaking the Language Model via Adversarial Metaphors 有毒物质:通过反面符号破坏语言模式的监狱 2503.00038v3
  • 780 06-07 MarginSel : Max-Margin Demonstration Selection for LLMs MarginSel : Max-Margin-Demonstrationsauswahl für LLMs MaginSel : LLMS 最大边框演示选择 2506.06699v1
  • 781 06-07 Contextual Experience Replay for Self-Improvement of Language Agents Kontextuelles Erfahrungsreplay zur Selbstverbesserung von Sprachagenten 语文代理人员自我改进的内事经验回顾 2506.06698v1
  • 782 06-07 Legal Mathematical Reasoning with LLMs: Procedural Alignment through Two-Stage Reinforcement Learning Rechtliche Mathematische Reasoning mit LLMs: Verfahrensorientierte Ausrichtung durch Zwei-Stufen-Verstärkungs-Lernen 与LLMM女士的法律数学理由:通过双层强化学习实现程序协调 2504.02590v2
  • 783 06-07 Enhancing Input-Label Mapping in In-Context Learning with Contrastive Decoding Erweiterung der Input-Label Mapping im In-Context-Lernen mit kontrastiver Dekodierung 加强在与反比下标的学习的文内学习中的输入标标映射 2502.13738v2
  • 784 06-07 CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models CVC: Eine groß angelegte chinesische Wertregel Corpus zur Wertausrichtung großer Sprachmodelle CVC: 大型中文大语言模式价值调整大型中国价值规则公司 2506.01495v2
  • 785 06-07 Examining the Mental Health Impact of Misinformation on Social Media Using a Hybrid Transformer-Based Approach Untersuchung der psychischen gesundheitlichen Auswirkungen von Fehlinformationen auf soziale Medien mittels eines hybriden Transformer-basierten Ansatzes 利用混合变革办法审查错误信息对社会媒体的心理健康影响 2503.02333v2
  • 786 06-07 Learning Distribution-Wise Control in Representation Space for Language Models Learning Distribution-Wise Control im Repräsentationsraum für Sprachmodelle 语文模式代表空间的学习分配-网络控制 2506.06686v1
  • 787 06-07 SudoLM: Learning Access Control of Parametric Knowledge with Authorization Alignment SudoLM: Lernzugangskontrolle von parametrischem Wissen mit Autorisierungsausrichtung SudoLM: 学习以统一授权方式对参数知识进行访问控制 2410.14676v3
  • 788 06-07 Quantile Regression with Large Language Models for Price Prediction Quantile Regression mit großen Sprachmodellen für die Preisvorhersage 具有大语言价格预测模型的量化递减 2506.06657v1
  • 789 06-07 SafeLawBench: Towards Safe Alignment of Large Language Models SafeLawBench: Auf dem Weg zu einer sicheren Ausrichtung großer Sprachmodelle ” 安全法:实现大语言模式安全一致 “ 2506.06636v1
  • 790 06-07 Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning Lehrplan Verstärkung Lernen von einfachen bis harten Aufgaben verbessert LLM Reasoning 从易到困难的工作学习改进LLM 理由 2506.06632v1
  • 791 06-07 LLMs Can Simulate Standardized Patients via Agent Coevolution LLMs können standardisierte Patienten über Agent Coevolution simulieren LLM Can 通过革命代理人模拟标准化病人 2412.11716v2
  • 792 06-07 Psychological Counseling Cannot Be Achieved Overnight: Automated Psychological Counseling Through Multi-Session Conversations Psychologische Beratung kann nicht über Nacht erreicht werden: Automatisierte Psychologische Beratung durch Multi-Session-Konversationen 心理心理咨询不可能在一夜之间达成:通过多次会议对话提供自动心理咨询 2506.06626v1
  • 793 06-07 Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps Kann MLLMs mich nach Hause führen? Eine Benchmark-Studie zur feinkörnigen visuellen Vernunft von Transit Maps MLLMM MLLM 指导我回家吗? 关于过境地图的精美视觉依据基准研究 2505.18675v2
  • 794 06-07 BriefMe: A Legal NLP Benchmark for Assisting with Legal Briefs BriefMe: Ein gesetzlicher NLP-Benchmark für die Unterstützung mit rechtlichen Briefen 简报:协助提供法律简报的《国家劳工规划法》法律基准 2506.06619v1
  • 795 06-07 Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications Open-FinLLMs: Open Multimodal Large Language Models für Finanzanwendungen Open-FininLLLMs:用于金融应用的开放多式大语言模式 2408.11878v3
  • 796 06-07 Interpretable Depression Detection from Social Media Text Using LLM-Derived Embeddings Interpretable Depression Detection von Social Media Text mit LLM-derived Embeddings 从使用LLM-Derived嵌入器的社会媒体文字中可解释的抑郁症探测 2506.06616v1
  • 797 06-07 Transferring Features Across Language Models With Model Stitching Übertragung von Funktionen über Sprachmodelle mit Modellstich 使用模型裁剪的跨语言模型传输功能 2506.06609v1
  • 798 06-07 Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit Training-Free Tokenizer Transplantation über Orthogonal Passende Verfolgung 通过矫形配对运动进行无培训的取暖器移植 2506.06607v1
  • 799 06-07 MedCite: Can Language Models Generate Verifiable Text for Medicine? MedCite: Können Sprachmodelle verifizierbaren Text für die Medizin generieren? MedCite:语言模型能够产生可核实的医学文本吗? 2506.06605v1
  • 800 06-07 Retrieval-augmented Large Language Models for Financial Time Series Forecasting Retrieval-augmentierte große Sprachmodelle für die Prognose der Finanzzeitreihen 财务时间序列预测的检索增强型大语言模型 2502.05878v3
  • 801 06-07 Scaling Inference-Efficient Language Models Skalierung von Schlussfolgerungen-Effiziente Sprachmodelle 扩展率-有效语言模型 2501.18107v2
  • 802 06-06 (5) Precise Information Control in Long-Form Text Generation Präzise Informationssteuerung in der langformatigen Textgenerierung 长期制文本生成的精确信息控制 2506.06589v1
  • 803 06-06 Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques Auf dem Weg zu einer effizienten Multi-LLM-Inferenz: Charakterisierung und Analyse von LLM Routing und Hierarchical Techniques 实现高效的多LLLM 多级LLM推理:LLM转流和等级技术的特征和分析 2506.06579v1
  • 804 06-06 SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions SATA-BENCH: Wählen Sie alle, die Benchmark für Multiple-Choice-Fragen anwenden SATA-BENCH: 选择多个选择问题的所有应用基准 2506.00643v2
  • 805 06-06 DISC: DISC: Dynamic Decomposition Improves LLM Inference Scaling DISC: DISC: Dynamische Zersetzung verbessert LLM-Inferenzskalierung DISC: DISC: 动态分解改善LLM 推推法的扩大 2502.16706v2
  • 806 06-06 Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective Auf dem Weg zur Analyse und dem Verständnis der Grenzen von VAPO: Eine theoretische Perspektive 分析和理解VAPO的局限性:理论视角 2506.03038v2
  • 807 06-06 ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models ThinkEdit: Verdolmetschbare Gewichte bearbeiten, um übermäßig kurzes Denken in vernünftigen Modellen zu vermischen 思考Edit:将可解释的重量编辑为 “ 解释性模型 “ 中模拟超短思考 2503.22048v3
  • 808 06-06 LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles LaMP-Cap: Personalisierte Bildunterschriftserstellung mit multimodalen Bildprofilen LaMP-Cap: 具有多模式图解的个人化图解生成 2506.06561v1
  • 809 06-06 CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective CoT ist nicht wahr Vernunft, es ist nur eine enge Beschränkung zu nachahmen: Eine Theorie Perspektive COT不是真实的理由,这只是一个难以想象的束缚:理论观点 2506.02878v2
  • 810 06-06 A UD Treebank for Bohairic Coptic Eine UD-Baumbank für bohairische koptische Bohairic Copth 的UD树库 2504.18386v2
  • 811 06-06 LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs LabSafety Bench: Benchmarking LLMs zu Sicherheitsfragen in wissenschaftlichen Laboren 实验室安全法官:科学实验室安全问题LLM女士基准基准 2410.14182v3
  • 812 06-06 Large Language Models Can Be a Viable Substitute for Expert Political Surveys When a Shock Disrupts Traditional Measurement Approaches Große Sprachmodelle können ein lebensfähiger Ersatz für politische Expertenbefragungen sein, wenn ein Schock traditionelle Messansätze ausbricht 当震震中断传统计量方法时,大语言模型可以成为专家政治调查的可行替代物 2506.06540v1
  • 813 06-06 Beyond Facts: Evaluating Intent Hallucination in Large Language Models Beyond Facts: Intent Halluzination in großen Sprachmodellen bewerten 超越事实:评价大语言模型中内在的幻觉 2506.06539v1
  • 814 06-06 NeoBERT: A Next-Generation BERT NeoBERT: Eine nächste Generation BERT NEEBERT: 下一轮BERT 2502.19587v2
  • 815 06-06 JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models JailbreakLens: Visuelle Analyse von Jailbreak Attacken gegen große Sprachmodelle 《监狱破纪录:对针对大语言模式的越狱袭击的视觉分析》 2404.08793v2
  • 816 06-06 Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance Fixing it in Post: Eine vergleichende Studie der LLM Nachschulung Datenqualität und Modellleistung 在员额中加以确定:LLM培训后数据质量和示范业绩比较研究 2506.06522v1
  • 817 06-06 Diversity Explains Inference Scaling Laws: Through a Case Study of Minimum Bayes Risk Decoding Diversität erklärt Folgerungen Skalierungsgesetze: Durch eine Fallstudie von Minimum Bayes Risikodekodierung 多样性解释推推推法的扩大法:通过对最低贝系风险解码的个案研究 2410.15021v2
  • 818 06-06 Biases Propagate in Encoder-based Vision-Language Models: A Systematic Analysis From Intrinsic Measures to Zero-shot Retrieval Outcomes Biasen propagate in Encoder-basierte Vision-Sprache Modelle: Eine systematische Analyse von intrinsischen Maßnahmen zu Zero-shot Retrieval Ergebnisse 以编码器为基础的视觉-语言模型中的Biases Proplagation:从内在措施到零射回收结果的系统分析 2506.06506v1
  • 819 06-06 Improving LLM-Powered EDA Assistants with RAFT Verbesserung der LLM-Powered EDA-Assistenten mit RAFT 改进与RAFT合作的LLM授权的EDA助理 2506.06500v1
  • 820 06-06 Synthetic Text Generation for Training Large Language Models via Gradient Matching Synthetische Textgenerierung für das Training großer Sprachmodelle über Gradient Matching 通过 “ 渐级匹配 “ 培训大语言模型的合成文本生成 2502.17607v2
  • 821 06-06 Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws Diversity of Transformer Layers: Ein Aspekt der Parameterskalierungsgesetze 变形层多样性:参数面积法的一个方面 2505.24009v2
  • 822 06-06 What Is Seen Cannot Be Unseen: The Disruptive Effect of Knowledge Conflict on Large Language Models Was gesehen wird, kann nicht ungesehen werden: Die disruptive Wirkung von Wissenskonflikten auf große Sprachmodelle ” 所见之见不可不见:知识冲突对大语言模式的破坏效应 “ 2506.06485v1
  • 823 06-06 Large Language Models Often Know When They Are Being Evaluated Große Sprachmodelle kennen oft, wenn sie bewertet werden 大语言模型经常知道何时被评估 2505.23836v2
  • 824 06-06 BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning BRiTE: Bootstrapping verstärkter Denkprozess zur Verbesserung von Sprachmodellen Brite:强化强化思考程序以加强语言示范理由 2501.18858v2
  • 825 06-06 Latent Thought Models with Variational Bayes Inference-Time Computation Latent Thought Models mit Variational Bayes Inferenz-Zeit-Computation 具有变异性贝湾推推算-时间计算法的古老思想模型 2502.01567v2
  • 826 06-06 Sparse Autoencoders for Hypothesis Generation Sparse Autoencoder für Hypothesis-Generierung 假设代代的粗化自动编码器 2502.04382v3
  • 827 06-06 From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions Von Tools zu Teamkollegen: LLMs in Multi-Session Coding Interaktionen bewerten 从工具到队友:在多次会议编码互动中评估LLMs 2502.13791v2
  • 828 06-06 Canonical Autoregressive Generation Kanonische autoregressive Generation Canonic 自动后退一代 2506.06446v1
  • 829 06-06 Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding Filmfakten und Fibs (MF$^2$): Ein Benchmark für langes Filmverständnis 电影事实和纤维(MF$2美元):长期电影理解基准 2506.06275v1
  • 830 06-06 AdvSumm: Adversarial Training for Bias Mitigation in Text Summarization AdvSumm: Adversariale Ausbildung für Bias Mitigation in Textzusammenfassung AdvSumm:在文本摘要中进行减轻偏见的反向培训 2506.06273v1
  • 831 06-06 MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators MEDAL: Ein Rahmen für Benchmarking von LLMs als mehrsprachige Open-Domain Chatbots und Dialogevaluatoren MEDAL:多语言开放域聊天和对话评价员对LLMs进行基准评估的框架 2505.22777v2
  • 832 06-06 Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks Leopard: Ein Visions-Sprachenmodell für Text-Rich-Multi-Image-Aufgaben 豹:文本-Rich多图像任务愿景语言模型 2410.01744v3
  • 833 06-06 MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models MimeQA: Auf dem Weg zu sozial-intelligenten nichtverbalen Stiftungsmodellen MIMEQA:走向社会智能非语言基金会模式 2502.16671v2
  • 834 06-06 A semantic embedding space based on large language models for modelling human beliefs Eine semantische Einbettung von Raum basierend auf großen Sprachmodellen zur Modellierung menschlicher Überzeugungen 基于模拟人类信仰的大型语言模型的语义嵌入空间 2408.07237v3
  • 835 06-06 PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time PersonaAgent: Wenn große Sprachmodell-Agenten Personalisierung zur Testzeit treffen 人 员:当大语言示范媒介在试验时间达到个性化时 2506.06254v1
  • 836 06-06 Bridging External and Parametric Knowledge: Mitigating Hallucination of LLMs with Shared-Private Semantic Synergy in Dual-Stream Knowledge Externes und parametrisches Wissen überbrücken: Halluzination von LLMs mit Shared-Private Semantic Synergy im Dual-Stream-Wissen abmildern 架桥外部和参数知识:减少双层知识中具有共同-私人语义协同效应的LLMsLLMs的幻觉 2506.06240v1
  • 837 06-06 Explaining Matters: Leveraging Definitions and Semantic Expansion for Sexism Detection Erklären von Dingen: Beherrschende Definitionen und semantische Expansion für Sexismus-Erkennung 解释事项:利用定义和语义扩展探索性别主义 2506.06238v1
  • 838 06-06 Banyan: Improved Representation Learning with Explicit Structure Banyan: Verbessertes Repräsentationslernen mit expliziter Struktur Banyan: 改善有明确结构的代表制学习 2407.17771v4
  • 839 06-06 Corrector Sampling in Language Models Korrektor-Probenahme in Sprachmodellen 语言模型中校正器样本 2506.06215v1
  • 840 06-06 Can Theoretical Physics Research Benefit from Language Agents? Kann die theoretische Physikforschung von Sprachagenten profitieren? 理论物理研究能否从语言代理中受益? 2506.06214v1
  • 841 06-06 PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts PuzzleWorld: Ein Benchmark für multimodale, offene Vernunft in Puzzlejagden 谜题世界:谜题搜索中多种形式、不限名额的理由基准 2506.06211v1
  • 842 06-06 Building Models of Neurological Language Baumodelle der neurologischen Sprache 神经语言建模模型 2506.06208v1
  • 843 06-06 Kinetics: Rethinking Test-Time Scaling Laws Kinetik: Überdenken von Test-Zeit-Skalierungsgesetzen 动因:重新思考试验时间扩增法 2506.05333v2
  • 844 06-06 Detecting Voice Phishing with Precision: Fine-Tuning Small Language Models Sprach Phishing mit Präzision erkennen: Feintuning-kleine Sprachmodelle 以精密度探测语音幻听:精美使用小型语言模式 2506.06180v1
  • 845 06-06 Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach Läuft es und ist das genug? Text-zu-Chart-Generation mit einem Multi-Agenten-Ansatz 以多机构方式重新审视文本到图表一代 2506.06175v1
  • 846 06-06 Emergent Response Planning in LLMs Emergent Response Planning in LLMs 新的LLMM 应急规划 2502.06258v2
  • 847 06-06 semantic-features: A User-Friendly Tool for Studying Contextual Word Embeddings in Interpretable Semantic Spaces semantische Eigenschaften: Ein benutzerfreundliches Tool zum Studieren von Kontext-Wort-Einbettungen in interpretierbare semantische Räume 语义词义-语义特点:一种用户友好工具,用于研究可解释的语义空间的背景文字嵌入 2506.06169v1
  • 848 06-06 The Lock-in Hypothesis: Stagnation by Algorithm Die Lock-in Hypothese: Stagnation durch Algorithmen Lock-in 假说: 阿尔哥里特姆的停滞 2506.06166v1
  • 849 06-06 Automated Journalistic Questions: A New Method for Extracting 5W1H in French Automatisierte journalistische Fragen: Eine neue Methode zur Extraktion von 5W1H in Französisch 自动新闻问题:以法文提取5W1H的新方法 2505.14804v2
  • 850 06-06 Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning Nicht alle Rollouts sind nützlich: Down-Sampling Rollouts im LLM Verstärkungslernen 并非所有的推出都有用:在LLM强化学习中进行下选推出 2504.13818v2
  • 851 06-06 Masked Language Models are Good Heterogeneous Graph Generalizers Masked Language Models sind gute Heterogene Graph Generalizers 遮罩语言模型是好异基因图形缩略图 2506.06157v1
  • 852 06-06 Towards Effective Extraction and Evaluation of Factual Claims Auf dem Weg zu einer wirksamen Extraktion und Bewertung der tatsächlichen Forderungen 争取有效提取和评估事实申诉 2502.10855v2
  • 853 06-06 LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws LLMs on the Line: Daten bestimmen die Loss-to-Loss-Skalierungsgesetze 线上LLM:数据确定损失到损失的缩放法 2502.12120v2
  • 854 06-06 CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval CLaMR: Kontextualisierte Late-Interaktion für multimodale Content Retrieval CLAMR: 多式内容检索的内幕性后期交流交流 2506.06144v1
  • 855 06-06 Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models Tabelle-r1: Selbstbeaufsichtigtes und verstärktes Lernen für programmbasierte Tabellenveranlagung in kleinen Sprachmodellen 表-r1:针对基于方案的小语言模式说明理由的表格进行自我监督和强化学习 2506.06137v1
  • 856 06-06 Let’s CONFER: A Dataset for Evaluating Natural Language Inference Models on CONditional InFERence and Presupposition Lassen Sie uns CONFER: Ein Datensatz für die Bewertung der natürlichen Sprache Inferenzmodelle auf konditional Inference und Presupposition 让我们考虑一下:评估关于有条件责任和假设的自然语言推断模型的数据集。 2506.06133v1
  • 857 06-06 Unveiling Topological Structures from Language: A Comprehensive Survey of Topological Data Analysis Applications in NLP Enthüllen topologischer Strukturen aus der Sprache: Eine umfassende Erhebung von topologischen Datenanalyseanwendungen in NLP 语言中不固定的地形结构:国家语言规划局地形数据分析应用综合调查 2411.10298v3
  • 858 06-06 Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction Telefonisch-angereicherte diskriminative Korrektur für Sprachsuchfehler-Korrektur 语音搜索错误校正的声频搜索 2506.06117v1
  • 859 06-06 Bridging the Gap: In-Context Learning for Modeling Human Disagreement Bridging the Gap: In-Context-Lernen zur Modellierung von menschlichen Uneinigkeiten 弥合差距:模拟人类分歧的理论内学习 2506.06113v1
  • 860 06-06 Opt-Out: Investigating Entity-Level Unlearning for Large Language Models via Optimal Transport Opt-Out: Untersuchung von Entity-Level-Unlearning für große Sprachmodelle via Optimal Transport 选择结果:调查实体一级通过最佳运输方式对大语言模式的不学习情况 2406.12329v3
  • 861 06-06 The Canary’s Echo: Auditing Privacy Risks of LLM-Generated Synthetic Text Das Echo der Kanaren: Auditing Privacy Risks of LLM-generated Synthetic Text 加那利的回声:LLM发明的合成文字的 审计隐私风险 2502.14921v2
  • 862 06-06 Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model Effiziente und direkte Duplex-Modellierung für Speech-to-Speech-Sprachenmodell 语音语音和语音语言模式的高效和直接双重模式 2505.15670v2
  • 863 06-06 Label-Context-Dependent Internal Language Model Estimation for CTC Label-Context-Dependent Interne Sprachmodellschätzung für CTC Label-Ctext-依赖性内部语言模型 反恐委员会内部语言模型估算 2506.06096v1
  • 864 06-06 Reinforcing Code Generation: Improving Text-to-SQL with Execution-Based Learning Stärkung der Code-Generierung: Verbesserung von Text-zu-SQL mit execution-basiertem Lernen 加强代码生成:通过以执行为基础的学习改进文本到SQL 2506.06093v1
  • 865 06-06 HeavyWater and SimplexWater: Watermarking Low-Entropy Text Distributions HeavyWater und SimplexWater: Wassermarkierung mit geringer Entropie Textverteilung 重水和简易水:低英基文本分发的标注水 2506.06409v1
  • 866 06-06 GraphCheck: Multi-Path Fact-Checking with Entity-Relationship Graphs GraphCheck: Multi-Path-Faktenprüfung mit Entity-Relationship Graphen 图表检查:多平面实况调查实体-关系图 2502.20785v2
  • 867 06-06 Tug-of-war between idiom’s figurative and literal meanings in LLMs Kriegstug zwischen Idiom’s figurativen und wörtlichen Bedeutungen in LLMs 智多明的象征意义和LLMM中字面含义之间的战争图示 2506.01723v3
  • 868 06-06 CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition CO-VADA: Ein vertrauensorientierter Voice Augmentation Debiasing-Ansatz für faire Sprach-Emotionserkennung CO-VADA: 以信任为导向的增强声音消除偏见的公平言论态度承认办法 2506.06071v1
  • 869 06-06 Zero-Shot Detection of LLM-Generated Code via Approximated Task Conditioning Zero-Shot-Erkennung des LLM-generierten Codes über eine angenäherte Aufgabenkonditionierung 通过近似任务设置条件对LLM生成的代码进行零热检测 2506.06069v1
  • 870 06-06 Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models Einfach und dennoch effektiv: Extrahieren von Privatdaten über Clients im Federated Fine-Tuning von großen Sprachmodellen 简单但有效:在联邦大语言模式价格调整中,在客户中提取私人数据 2506.06060v1
  • 871 06-06 HIGHT: Hierarchical Graph Tokenization for Molecule-Language Alignment HIGHT: Hierarchische Graphen-Tokenisierung für die Ausrichtung von Molekülen HIG: 分子-语言对齐的等级图形方位化 2406.14021v2
  • 872 06-06 Hey, That’s My Data! Label-Only Dataset Inference in Large Language Models Hey, das sind meine Daten! Label-Only-Datensatz-Schlussfolgerung in großen Sprachmodellen 嘿,这就是我的数据! 大语言模型中的标签 - 唯一数据集推理 2506.06057v1
  • 873 06-06 MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding MedXpertQA: Benchmarking von medizinischem Verständnis auf Expertenebene MedXperttQA:确定专家级医疗理由和理解基准 2501.18362v3
  • 874 06-06 SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal Large Language Models Preserving Language Capabilities SMAR: Soft Modality-Aware Routing-Strategie für MoE-basierte multimodale große Sprachmodelle Erhaltung von Sprachfähigkeiten SMAR: 以教育部为基础的维护语言能力的多模式多语言模型软式模式-软件运行战略 2506.06406v1
  • 875 06-06 Rethinking Machine Unlearning in Image Generation Models Rethinking Machine Unlearning in Image Generation Models 在图像生成模型中重新思考机离学习 2506.02761v2
  • 876 06-06 MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems? MATP-BENCH: Kann MLLM für multimodale Probleme ein gutes Automatisiertes Theorem-Prover sein? MLLM能成为多模式问题的一个好的自动理论导论吗? 2506.06034v1
  • 877 06-06 Large Language Models are Demonstration Pre-Selectors for Themselves Große Sprachmodelle sind Demonstrationsvorwähler für sich selbst 大语言模型是演示预选人 2506.06033v1
  • 878 06-06 Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation Begründung durch Ausführung: Vereinheitlichung von Prozess- und Ergebnisprämien für die Codegenerierung 执行中的理由:代码生成的统一程序和结果奖励 2412.15118v2
  • 879 06-06 Improving Customer Service with Automatic Topic Detection in User Emails Verbesserung des Kundenservices mit automatischer Themenerkennung in Benutzer-E-Mails 改进用户电子邮件自动专题检测客户服务 2502.19115v3
  • 880 06-06 When to Trust Context: Self-Reflective Debates for Context Reliability Wann man vertraut Kontext: Selbstreflektierende Debatten für Kontext Zuverlässigkeit 何时信任背景:关于环境可靠性的自反辩论 2506.06020v1
  • 881 06-06 AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search AgentSwift: Effizientes LLM Agent Design über wertgeführte Hierarchische Suche AgentSwift: 高效LLM Agent 设计 2506.06017v1
  • 882 06-06 Unlocking Recursive Thinking of LLMs: Alignment via Refinement Entsperrendes rekursives Denken an LLMs: Ausrichtung durch Verfeinerung 解锁对LLMM的递归思维:通过精炼调整 2506.06009v1
  • 883 06-06 Token Signature: Predicting Chain-of-Thought Gains with Token Decoding Feature in Large Language Models Token-Signatur: Vorhersage von Chain-of-Thought-Gewinnen mit Token-Decoding-Funktion in großen Sprachmodellen Tok 签名:以大语言模型的 Tok 解码功能预测所寻求的收益链 2506.06008v1
  • 884 06-06 Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models Bootstrapping-Weltmodelle aus Dynamics-Modellen in multimodalen Foundation-Modellen 从多式联运基础模型中的动态模型中推出世界模型 2506.06006v1
  • 885 06-06 Judgment of Learning: A Human Ability Beyond Generative Artificial Intelligence Urteil des Lernens: Eine menschliche Fähigkeit jenseits Generativer Künstlicher Intelligenz 学习总结:人类能力超越创造人工智能的能力 2410.13392v3
  • 886 06-06 Does It Make Sense to Speak of Introspection in Large Language Models? Macht es Sinn, von Introspektion in großen Sprachmodellen zu sprechen? 它是否让人们想到在大语言模型中谈论反省? 2506.05068v2
  • 887 06-06 A Culturally-Rich Romanian NLP Dataset from “Who Wants to Be a Millionaire?” Videos Ein kultureller Rich rumänischer NLP-Datensatz aus “Wer will Millionär werden?” Videos 罗马尼亚NLP数据来自“谁想成为百万富翁?” 2506.05991v1
  • 888 06-06 Peri-LN: Revisiting Normalization Layer in the Transformer Architecture Peri-LN: Normalisierungsschicht in der Transformer-Architektur neu aufsuchen Peri-LN: 变形器结构中重新审视正常化层 2502.02732v3
  • 889 06-06 DPO-Shift: Shifting the Distribution of Direct Preference Optimization DPO-Shift: Verschieben der Verteilung der Direktpräferenzoptimierung DPO-Shift:改变直接优惠优化分配 2502.07599v2
  • 890 06-06 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Jacobian Sparse Autoencoder: Sparsify Computations, nicht nur Aktivierungen Jacobian Sparass Autoencoders: 简化计算,而不仅仅是激活 2502.18147v2
  • 891 06-06 A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations Eine Übersicht über automatische Bewertungsmethoden zu Text-, Bild- und Sprachgenerierungen 文字、视觉和语音一代自动评价方法调查 2506.10019v1
  • 892 06-06 The Synergy of LLMs & RL Unlocks Offline Learning of Generalizable Language-Conditioned Policies with Low-fidelity Data Die Synergie von LLMs & RL entsperrt Offline-Lernen allgemeiner sprachkonditionierter Politiken mit Daten mit geringer Fidelität LLMs和RL Unlocks离线学习LLM和RL Unlocks的协同效应 2412.06877v2
  • 893 06-06 Audio-Aware Large Language Models as Judges for Speaking Styles Audio-Bewusst Große Sprachmodelle als Richter für sprechende Stile 音响软件大语言模型,作为语音风格法官 2506.05984v1
  • 894 06-06 Tau-Eval: A Unified Evaluation Framework for Useful and Private Text Anonymization Tau-Eval: Ein einheitliches Evaluations-Framework für sinnvolle und private Textanonymisierung Tau-Eval:有用和私用文本匿名化统一评价框架 2506.05979v1
  • 895 06-06 LTG at SemEval-2025 Task 10: Optimizing Context for Classification of Narrative Roles LTG bei SemEval-2025 Aufgabe 10: Optimierung des Kontextes für die Klassifizierung narrativer Rollen SemEval-2025 LTG 任务10:优化叙述作用分类的背景 2506.05976v1
  • 896 06-06 Let’s Put Ourselves in Sally’s Shoes: Shoes-of-Others Prefixing Improves Theory of Mind in Large Language Models Setzen wir uns selbst in Sallys Schuhe: Andere Schuhe präfixieren verbessert die Theorie des Geistes in großen Sprachmodellen 让我们把我们自己放在莎莉的鞋子里: 在大语言模型中, 鞋上的其他人的预设改进思维理论 2506.05970v1
  • 897 06-06 TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge TRACT: Regressionsbewusstes Feinjustieren trifft auf eine schlüsselfertige Begründung für LLM-as-a-Richter TRACT:LLM作为法官的理由 2503.04381v2
  • 898 06-06 Diving into Self-Evolving Training for Multimodal Reasoning Eintauchen in das Selbst-Evolving Training für multimodale Vernunft 跳入多模式理由自我发展培训 2412.17451v3
  • 899 06-06 GRASP: Replace Redundant Layers with Adaptive Singular Parameters for Efficient Model Compression GRASP: Redundante Ebenen durch adaptive Singular-Parameter für effiziente Modellkompression ersetzen GRASP: 用高效模型压缩的适应性单质参数取代冗余层 2501.00339v3
  • 900 06-06 Elementary Math Word Problem Generation using Large Language Models Grundlegende Math Word-Problem-Erstellung mit großen Sprachmodellen 使用大语言模式制造问题 2506.05950v1
  • 901 06-06 NameTag 3: A Tool and a Service for Multilingual/Multitagset NER NameTag 3: Ein Werkzeug und ein Service für Multilingual/Multitagset NER 名称Tag 3: 多语种/多语种NER的工具和服务 2506.05949v1
  • 902 06-06 IntentionESC: An Intention-Centered Framework for Enhancing Emotional Support in Dialogue Systems IntentionESC: Ein intentionszentrierter Rahmen zur Verbesserung der emotionalen Unterstützung in Dialogsystemen 意图ESC:加强对话系统中情感支持的以意图为中心的框架 2506.05947v1
  • 903 06-06 AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML AutoML-Agent: Ein Multi-Agent-LLM-Framework für Full-Pipeline-AutoML 自动MAL- Agency: 全Pipeline 自动MLM 多边代理LLM 框架 2410.02958v2
  • 904 06-06 DynamicMind: A Tri-Mode Thinking System for Large Language Models DynamicMind: Ein Tri-Mode-Denksystem für große Sprachmodelle 动态Mind:大语言模型三模式思维系统 2506.05936v1
  • 905 06-06 CAT-LLM: Style-enhanced Large Language Models with Text Style Definition for Chinese Article-style Transfer CAT-LLM: Stilverstärkte große Sprachmodelle mit Textstil-Definition für den chinesischen Artikel-Stil-Transfer CAT-LLM: 具有中文本风格定义的中文本文字风格转让强化型大语言模式 2401.05707v2
  • 906 06-06 MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models MoA: Heterogene Mischung von Adaptern für ein parametereffizientes Feintuning von großen Sprachmodellen MoA:大型语言模型参数有效精美设计应用适应器的异变混合体 2506.05928v1
  • 907 06-06 LengClaro2023: A Dataset of Administrative Texts in Spanish with Plain Language adaptations LengClaro2023: Ein Datensatz von Verwaltungstexten auf Spanisch mit plain language adaptations 冷克拉罗2023:西班牙文行政文本数据集,有普通语言适应 2506.05927v1
  • 908 06-06 Generating Grounded Responses to Counter Misinformation via Learning Efficient Fine-Grained Critiques Generierung geerdeter Reaktionen zur Bekämpfung von Fehlinformationen durch Learning Efficient Fine-Grained Critiques 通过 “ 学习高效率的先进经验 “ ,对错误信息采取有底基的对策 2506.05924v1
  • 909 06-06 Proactive Assistant Dialogue Generation from Streaming Egocentric Videos Proaktive Assistant Dialogue Generation von Streaming Egozentrischen Videos 主动的助理对话助理,从流动的以地球为中心的视频中生成 2506.05904v1
  • 910 06-06 CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness CAPability: Ein umfassender Visual Caption Benchmark zur Bewertung von Korrektheit und Gründlichkeit 能力:评估正确性和彻底性的综合视觉能力基准 2502.14914v3
  • 911 06-06 Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router Route-and-Reason: Scaling Large Language Model Reasoning mit verstärktem Modellrouter 路由和反应:使用强化示范路由器扩大大语言模型 2506.05901v1
  • 912 06-06 MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models MapEval: Eine kartesbasierte Bewertung von Geo-Spatial Reasoning in Fundamentalmodellen MafEval:基础模型中地理空间理由的地图评价 2501.00316v2
  • 913 06-06 PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning GiftBench: Bewertung der Vulnerabilität eines großen Sprachmodells für Datenvergiftung 毒区:评估大语言模型对数据中毒的脆弱性 2410.08811v2
  • 914 06-06 Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models Infi-MMR: Curriculumbasiertes Entsperren multimodaler Vernunft durch schrittweises Verstärktes Lernen in multimodalen Small Language-Modellen Infi-MMMR:通过在多模式小型语言模式中分阶段强化学习,以课程为基础解锁多模式原因 2505.23091v2
  • 915 06-06 ECoRAG: Evidentiality-guided Compression for Long Context RAG ECoRAG: Evidentialitätsgeführte Kompression für lange RAG-Kontext ECORAG: 长期RAG的证据制导压缩 2506.05167v2
  • 916 06-06 FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging FinanceReasoning: Benchmarking der finanziellen numerischen Reasoning glaubwürdiger, umfassender und anspruchsvoller 财务责任:确定金融数字理由基准,使其更加可信、全面和具有挑战性 2506.05828v1
  • 917 06-06 Training Software Engineering Agents and Verifiers with SWE-Gym Schulung von Software Engineering Agents und Prüfern mit SWE-Gym SWE-Gym培训软件工程代理和验证人 2412.21139v2
  • 918 06-06 Taming Knowledge Conflicts in Language Models Zähmen von Wissenskonflikten in Sprachmodellen 在语言模式中讨论知识冲突 2503.10996v2
  • 919 06-06 Not All Jokes Land: Evaluating Large Language Models Understanding of Workplace Humor Nicht alle Witze Land: Bewertung großer Sprachmodelle Verständnis des Arbeitsplatzes Humor 并非所有的笑话地:评价工作场所幽默性大语言模型理解 2506.01819v2
  • 920 06-06 CodeContests+: High-Quality Test Case Generation for Competitive Programming CodeContests+: Hochqualitative Testfall-Generation für wettbewerbsfähige Programmierung 标准测试+:为竞争性方案拟订编制高品质测试个案 2506.05817v1
  • 921 06-06 MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning MAPLE: Multi-Agent-Adaptive-Planung mit Langzeit-Speicher für Tabellen-Reasoning MMALE: 具有长期内存的多机构适应规划,用于表格说明理由 2506.05813v1
  • 922 06-06 E-LDA: Toward Interpretable LDA Topic Models with Strong Guarantees in Logarithmic Parallel Time E-LDA: Auf dem Weg zu interpretierbaren LDA-Themenmodellen mit starken Garantien in logarithmischer Parallelzeit E-LDA:争取在对数平行时间建立具有有力保证的可解释的LDA专题模型 2506.07747v1
  • 923 06-06 DiMA: An LLM-Powered Ride-Hailing Assistant at DiDi DiMA: Ein LLM-Powered Ride-Hailing Assistant bei DiDi DiMA:Didi法学硕士派助教 2503.04768v2
  • 924 06-06 Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights Unbeabsichtigte Schäden an wertorientierten LLMs: Psychologische und empirische Erkenntnisse 价值观一致的LMMs:心理和实证透视的意外伤害 2506.06404v1
  • 925 06-06 TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining TiC-LM: Ein Web-Scale-Benchmark für zeitkontinuierliches LLM-Vortraining TIC-LM:培训前时间-持续性LLM网络基准 2504.02107v3
  • 926 06-06 Lost in the Passage: Passage-level In-context Learning Does Not Necessarily Need a “Passage” Lost in the Passage: Passage-Level In-Context Learning braucht nicht unbedingt eine “Passage” 遗失在通行证中:通过程度的内文学习不需要“通行证” 2502.10634v2
  • 927 06-06 LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification LLM im Loop: Erstellen des ParaDeHate-Datensatzes für die Entgiftung von Hassreden 圈圈中的LLM:为仇恨言论解毒创建ParaDeHate数据集 2506.01484v2
  • 928 06-06 Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning Benchmarking und Verbesserung großer Visions-Sprachenmodelle für grundlegendes visuelles Verständnis und Vernunft 制定基准和改进基本视觉图解理解和理由的大型视野-语言模型 2412.13540v3
  • 929 06-06 Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective Bilder sprechen lauter als Worte: Bias im Vision-Language-Modell aus einer ursächlichen Vermittlungsperspektive verstehen und abmildern 图像比单词更大声:从原因调解角度理解和减轻视觉语言模型中的偏见 2407.02814v3
  • 930 06-06 Discrete Minds in a Continuous World: Do Language Models Know Time Passes? Diskrete Gedanken in einer kontinuierlichen Welt: Kennen Sprachmodelle Zeitüberschreitungen? 持续世界中的分化思想:语言模型是否了解时间? 2506.05790v1
  • 931 06-06 IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis IDA-Bench: Bewertung von LLMs zur interaktiven Datenanalyse IDA-Bench:评价交互式引导数据分析LML女士 2505.18223v2
  • 932 06-06 LGAR: Zero-Shot LLM-Guided Neural Ranking for Abstract Screening in Systematic Literature Reviews LGAR: Zero-Shot LLM-geführtes Neural Ranking für Abstract Screening in Systematic Literature Reviews LGAR: 系统文学评论中抽象筛选的零热热LLM-指导神经神经定级 2505.24757v2
  • 933 06-06 m-KAILIN: Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training m-KAILIN: Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training m-KAILIN:生物医学大型语言模型培训知识开发剂科学公司蒸馏框架 2504.19565v2
  • 934 06-06 dots.llm1 Technical Report points.llm1 Technischer Bericht 技术报告 2506.05767v1
  • 935 06-06 BioMol-MQA: A Multi-Modal Question Answering Dataset For LLM Reasoning Over Bio-Molecular Interactions BioMol-MQA: Multi-Modal Frage beantworten Datensatz für LLM-Gründung über bio-Molekulare Interaktionen BioMol-MQA:生物分子相互作用的多种模式解答LLM问题数据集 2506.05766v1
  • 936 06-06 On the Query Complexity of Verifier-Assisted Language Generation Zur Fragekomplexität der sprachlichen Generierung von Prüfern 关于验证者协助语言生成的查询复杂程度 2502.12123v2
  • 937 06-06 Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning 书写-RL:通过适应性课程强化学习推进长式写作 2506.05760v1
  • 938 06-06 Constrained Sampling for Language Models Should Be Easy: An MCMC Perspective Beschränkte Probenahmen für Sprachmodelle sollten einfach sein: Eine MCMC-Perspektive 语言模式抽样应容易:MCMC视角 2506.05754v1
  • 939 06-06 ProSec: Fortifying Code LLMs with Proactive Security Alignment ProSec: Erweiterung von Code LLMs mit proaktiver Sicherheitsausrichtung Prosec: 使用前期安全对齐来强化代码LLMs 2411.12882v3
  • 940 06-06 Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models Emergente symbolische Mechanismen unterstützen abstrakte Vernunft in großen Sprachmodellen 支持大语言模式的新兴机制支持摘要 2502.20332v2
  • 941 06-06 LLM-Symbolic Integration for Robust Temporal Tabular Reasoning LLM-Symbolische Integration für robuste temporale Tabulare Reasoning LLM- 用于强力时时制表理由的 Symblic 集成 2506.05746v1
  • 942 06-06 ResearchTown: Simulator of Human Research Community ResearchTown: Simulator der Human Research Community 研究对象:人类研究界模拟者 2412.17767v2
  • 943 06-06 Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness Vergessen LLMs wirklich? Bewertung des Unlernens mit Wissenskorrelation und Vertrauensbewusstsein LLMs真的会忘却吗?用知识关联和信心意识来评价不学习 2506.05735v1
  • 944 06-06 Analyzing LLMs’ Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations Analysieren des Wissens von LLMs Grenzenkognition über Sprachen hinweg durch die Linse interner Repräsentationen 分析LLM女士通过内部代表的镜头跨越各种语言的知识边界认知 2504.13816v2
  • 945 06-06 Large Language Models are Good Relational Learners Große Sprachmodelle sind gute relationale Lernende 大语言模型是好关系学习者 2506.05725v1
  • 946 06-06 CoIR: A Comprehensive Benchmark for Code Information Retrieval Models CoIR: Ein umfassender Benchmark für Code-Informations-Retrieval-Modelle CoIR: 代码信息检索模型综合基准 2407.02883v3
  • 947 06-06 Structure Guided Large Language Model for SQL Generation Struktur Geführtes großes Sprachmodell für SQL-Generierung SQL 生成引导大语言模式 2402.13284v3
  • 948 06-06 A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models Eine Umfrage über Sparse Autoencoder: Interpretation der internen Mechanismen großer Sprachmodelle 粗略自动编码器调查:解释大语言模式内部机制 2503.05613v2
  • 949 06-06 RKEFino1: A Regulation Knowledge-Enhanced Large Language Model RKEFino1: Ein normatives, wissensgestütztes, großes Sprachmodell RKEFino1:知识强化大语言规范模式 2506.05700v1
  • 950 06-06 Generalizable LLM Learning of Graph Synthetic Data with Reinforcement Learning Allgemeines LLM-Lernen von Graphen-Synthetischen Daten mit Verstärkungs-Lernen 利用强化学习学习图形合成数据学习 2506.00845v2
  • 951 06-06 Being Strong Progressively! Enhancing Knowledge Distillation of Large Language Models through a Curriculum Learning Framework Progressiv stark sein! Erweiterung der Wissensdestillation von großen Sprachmodellen durch einen Curriculum Learning Framework 通过课程学习框架加强大语言模式的知识提炼 2506.05695v1
  • 952 06-06 Direct Behavior Optimization: Unlocking the Potential of Lightweight LLMs Direct Behavior Optimization: Entsperren des Potenzials leichter LLMs 直接行为优化:释放轻度LMT的潜力 2506.06401v1
  • 953 06-06 When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation Wann in RAG Graphen verwendet werden: Eine umfassende Analyse für die graphisch retrieval-augmentierte Generation 何时使用 RAG 中的图表 : 图表检索- 增强的一代的综合分析 。 2506.05690v1
  • 954 06-06 A Unified Representation for Continuity and Discontinuity: Syntactic and Computational Motivations Eine einheitliche Darstellung für Kontinuität und Diskontinuität: Syntaktische und rechnerische Motivationen 连续和中断统一代表:协同和计算模式 2506.05686v1
  • 955 06-06 GuessBench: Sensemaking Multimodal Creativity in the Wild GuessBench: Sensemaking Multimodale Kreativität in der Wildnis Gisguess B Beench: 荒野中思想制造的多模式创造性 2506.00814v2
  • 956 06-06 Data Swarms: Optimizable Generation of Synthetic Evaluation Data Datenschwärme: Optimierbare Generierung synthetischer Auswertungsdaten Data Swarms: 最佳生成合成评价数据 2506.00741v2
  • 957 06-06 TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages TMT: Drei-Modalen-Übersetzung zwischen Sprache, Bild und Text durch Verarbeitung verschiedener Modalitäten als verschiedene Sprachen TMT: 以不同语言处理不同方式的语音、图像和文本之间的三式翻译 2402.16021v2
  • 958 06-06 WER We Stand: Benchmarking Urdu ASR Models WER We Stand: Benchmarking Urdu ASR Modelle WER 站立:确定乌尔都年度年度社会责任模式的基准 2409.11252v3
  • 959 06-06 Multi-Agent Collaboration via Cross-Team Orchestration Multi-Agenten-Zusammenarbeit über Cross-Team-Orchestrierung 通过跨团队管弦化多机构协作 2406.08979v2
  • 960 06-06 Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective Desecting Bias in LLMs: Eine mechanistische Interpretationsperspektive 在LLMM中分解偏见:机械解释视角 2506.05166v2
  • 961 06-06 Contextually Guided Transformers via Low-Rank Adaptation Kontextuell geführte Transformer über Low-Rank-Anpassung 通过低辐射适应实现上下文向导变形器 2506.05672v1
  • 962 06-06 Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning Low-Resource-Domänenanpassung für Sprach-LLMs via Text-Only Fine-Tuning 通过纯文本微调调整的语音磁性LLMs的低资源域适应 2506.05671v1
  • 963 06-06 Can LLMs Express Personality Across Cultures? Introducing CulturalPersonas for Evaluating Trait Alignment Können LLMs ihre Persönlichkeit über Kulturen hinweg ausdrücken? LLMs Can LLM Can LLM 跨文化的特异性? 引入文化名人来评估轨迹协调情况 2506.05670v1
  • 964 06-06 BAQ: Efficient Bit Allocation Quantization for Large Language Models BAQ: Effiziente Bitallokation Quantisierung für große Sprachmodelle BAQ: 大语言模式的高效比分配比量量化 2506.05664v1
  • 965 06-06 Instructor-Worker Large Language Model System for Policy Recommendation: a Case Study on Air Quality Analysis of the January 2025 Los Angeles Wildfires Instructor-Worker Large Language Model System for Policy Empfehlung: Eine Fallstudie zur Luftqualitätsanalyse der Januar 2025 Los Angeles Wildfires 政策建议:2025年1月洛杉矶野火空气质量分析个案研究 2503.00566v2
  • 966 06-06 Where is the signal in tokenization space? Wo ist das Signal im Tokenisationsraum? 象征空间的信号在哪里? 2408.08541v2
  • 967 06-06 Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones Projektierbare Modelle: One-Shot-Generation von kleinen Spezialtransformatoren aus großen Modellen 集成项目模型:大型一型小型专门化变形器的单制制式 2506.05641v1
  • 968 06-05 (4) A Fictional Q&A Dataset for Studying Memorization and Knowledge Acquisition Ein fiktiver Q&A-Datensatz für das Studieren von Memorization und Knowledge Acquisition 用于研究记忆化和知识获取的字形 A 数据集 2506.05639v1
  • 969 06-05 IYKYK: Using language models to decode extremist cryptolects IYKYK: Mit Sprachmodellen extremistische Kryptolekten entschlüsseln IYYKK:使用语言模式解码极端主义加密器 2506.05635v1
  • 970 06-05 Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs Selbstaufmerksamkeit für Input-Dependent Soft Prompting in LLMs 在LLMM中利用输入- 依赖软提示自控 2506.05629v1
  • 971 06-05 Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency Grundlegende Grenzen der schnellen Tuning Transformer: Universalität, Kapazität und Effizienz 迅速提款变换者的根本限制:普遍性、能力和效益 2411.16525v2
  • 972 06-05 Deployability-Centric Infrastructure-as-Code Generation: An LLM-based Iterative Framework Deployability-Centric Infrastructure-as-Code Generation: Ein LLM-basiertes Iteratives Framework 以LLM为基础的迭代框架 2506.05623v1
  • 973 06-05 Reasoning Towards Fairness: Mitigating Bias in Language Models through Reasoning-Guided Fine-Tuning Reasoning Towards Fairness: Bias in Sprachmodellen durch reasoning-guided Fine-Tuning abmildern 追求公平的理由:通过有理、有理、有指导的微调减少语言模式中的偏见 2504.05632v3
  • 974 06-05 Adversarial Tokenization Adversariale Tokenisierung 对抗感应器化 2503.02174v2
  • 975 06-05 Mitigating Confounding in Speech-Based Dementia Detection through Weight Masking Mildernde Verwüstung bei sprachbasierter Dementia-Erkennung durch Gewicht-Maskierung 通过湿度蒙面,缓解基于声音的痴呆症检测中的困扰 2506.05610v1
  • 976 06-05 OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation OPeRA: Ein Datensatz von Beobachtung, Persona, Ratationale und Aktion zur Bewertung von LLMs auf menschlicher Online-Shopping-Behavior-Simulation OPERA: 人类在线购物行为模拟观察、人、理由和评估LMLLMs的数据集 2506.05606v1
  • 977 06-05 Prompting is Not All You Need! Evaluating LLM Agent Simulation Methodologies with Real-World Online Customer Behavior Data Prompting ist nicht alles, was Sie brauchen! Bewertung LLM Agent Simulation Methoden mit Real-World Online Kunden Verhalten Daten 提示并非你所需要的全部! 使用真实世界在线客户行为数据评估LLM代理模拟方法 2503.20749v6
  • 978 06-05 SynthesizeMe! Inducing Persona-Guided Prompts for Personalized Reward Models in LLMs SynthesizeMe! Inducing Persona-Guided Prompts für personalisierte Belohnung Modelle in LLMs 合成Me! LLM 中个性奖得奖模型的引人指导提示 2506.05598v1
  • 979 06-05 SoK: Are Watermarks in LLMs Ready for Deployment? SoK: Sind Wasserzeichen in LLMs bereit für den Einsatz? SoK:LLMS中的水印准备部署了吗? 2506.05594v1
  • 980 06-05 UTSA-NLP at ArchEHR-QA 2025: Improving EHR Question Answering via Self-Consistency Prompting UTSA-NLP bei ArchEHR-QA 2025: Verbesserung der EHR-Fragebeantwortung durch Selbstbeherrschung ArchEHR-QA 2025:通过自我协调促进改进EHR问题的答案 2506.05589v1
  • 981 06-05 MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark MMTU: Ein massiver Multi-Task-Tabellen-Verstands- und Orientierungs-Benchmark MMTU: 大规模多任务表格理解和说明理由基准 2506.05587v1
  • 982 06-05 Combating Misinformation in the Arab World: Challenges & Opportunities Bekämpfung von Fehlinformationen in der arabischen Welt: Herausforderungen und Chancen 打击阿拉伯世界的错误信息:挑战和机遇 2506.05582v1
  • 983 06-05 TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking TituLLMs: Eine Familie von Bangla LLMs mit umfassendem Benchmarking TitulLLMs:一个具有全面基准基准的孟加拉LMLMLM家庭 2502.11187v3
  • 984 06-05 Natural Language Interaction with Databases on Edge Devices in the Internet of Battlefield Things Natürliche Sprachinteraktion mit Datenbanken auf Edge Devices im Internet von Battlefield Things 自然语言与战场物体互联网边缘装置数据库的互动 2506.06396v1
  • 985 06-05 The Impact of Inference Acceleration on Bias of LLMs Die Auswirkungen von Inferenzbeschleunigung auf LLM-Bias 推推加速对LLMM的偏见的影响 2410.22118v3
  • 986 06-05 Improving LLMs with a knowledge from databases Verbesserung von LLMs mit Kenntnissen aus Datenbanken 利用数据库知识改进LLMs 2506.05560v1
  • 987 06-05 MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning MORSE-500: Ein programmierbarer Video-Benchmark zum Stress-Test Multimodale Reasoning MORSE-500:压力测试多模式理由方案可控制的视频基准 2506.05523v1
  • 988 06-05 Investigating Non-Transitivity in LLM-as-a-Judge Untersuchung der Nichtübertragbarkeit in LLM-as-a-Richter 调查LLM-as-A法官的非过渡性 2502.14074v3
  • 989 06-05 Multidimensional Analysis of Specific Language Impairment Using Unsupervised Learning Through PCA and Clustering Multidimensionale Analyse der spezifischen Sprachbeeinträchtigung durch unbeaufsichtigtes Lernen durch PCA und Clustering 利用无监督的学习,通过常设仲裁院和集群,对特定语言障碍进行多层面分析 2506.05498v1
  • 990 06-05 Detect, Explain, Escalate: Low-Carbon Dialogue Breakdown Management for LLM-Powered Agents Detect, Explain, Escalate: Low-Carbon Dialogue Breakdown Management für LLM-Powered Agents 检测、解释、埃斯卡拉拉特:LLM强力剂的低碳对话框分解管理 2504.18839v2
  • 991 06-05 Inference-Time Hyper-Scaling with KV Cache Compression Inferenz-Zeit-Hyper-Skalierung mit KV-Cache-Kompression 与 KV 缓存压缩一起的推断时间超大缩放 2506.05345v1
  • 992 06-05 Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets Warum LLM Sicherheits-Guardrails nach Feinabstimmung zusammenbrechen: Eine Ähnlichkeitsanalyse zwischen Alignment- und Feinabstimmungs-Datensätzen 为何LLM 安全护卫车在微调后倒塌:对准和微调数据集之间的相似性分析 2506.05346v1
  • 993 06-05 Search Arena: Analyzing Search-Augmented LLMs Sucharena: Analyse von Such-Augmentierten LLMs 搜索竞技场: 分析搜索推荐LMS 2506.05334v1
  • 994 06-05 AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning AceReason-Nemotron: Mathematische und Code-Reasonierung durch Stärkungslernen AceReson-Nepron:通过强化学习推进数学和守则 2505.16400v3
  • 995 06-05 Unleashing Hour-Scale Video Training for Long Video-Language Understanding Unleashing Stunden-Scale Video-Training für lange Video-Sprache verstehen 为了解长视视频语言而开放的时空视频培训 2506.05332v1
  • 996 06-05 The broader spectrum of in-context learning Das breitere Spektrum des In-Context-Lernens 广义的内通学习 2412.03782v3
  • 997 06-05 MLLM-CL: Continual Learning for Multimodal Large Language Models MLLM-CL: Kontinuierliches Lernen für multimodale große Sprachmodelle MLLM-CL:多模式大语言模式持续学习 2506.05453v1
  • 998 06-05 Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety Interpretation trifft auf Sicherheit: Eine Umfrage zu Interpretationsmethoden und Tools zur Verbesserung der LLM-Sicherheit 口译满足安全需要:关于改进LLM安全的解释方法和工具的调查 2506.05451v1
  • 999 06-05 Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay Verbesserung der Dateneffizienz für LLM-Verstärkung Feinabstimmung durch problemorientierte Online-Datenauswahl und Rollout-Replay 提高LLLM数据效率,通过困难目标在线数据选择和推出重播提高LLM强化微调的数据效率 2506.05316v1
  • 1000 06-05 Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models Engropisches Lernen eingeschränkt: Ein primäres Rahmenwerk für große Sprachmodelle 未学习:大语言模式的原始-双重框架 2506.05314v1
  • 1001 06-05 Is LLM the Silver Bullet to Low-Resource Languages Machine Translation? Ist LLM die Silver Bullet zu Low-Resource Sprachen Maschinelle Übersetzung? LLM 银弹到低资源语言机器翻译吗? 2503.24102v2
  • 1002 06-05 Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games Time to Talk: LLM-Agenten für asynchrone Gruppenkommunikation in Mafia-Spielen 讨论时间:黑手党运动会Asynconomic Group通讯的LLM代理商 2506.05309v1
  • 1003 06-05 ProRefine: Inference-time Prompt Refinement with Textual Feedback ProRefine: Inferenz-Zeit Prompt Verfeinerung mit Text-Feedback ProRefine: 用文字反馈迅速改进推论-时间 2506.05305v1
  • 1004 06-05 ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL ReasonGen-R1: CoT für Autoregressive Imagegenerierungsmodelle durch SFT und RL 理由Gen-R1:通过SFT和RL自动递减图像生成模型的 CoT 2505.24875v2
  • 1005 06-05 GRAF: Graph Retrieval Augmented by Facts for Romanian Legal Multi-Choice Question Answering GRAF: Graph Retrieval Augmented by Facts for Rumänian Legal Multi-Choice Question Answering GRAF: 罗马尼亚多种选择法律问题解答事实加增图 2412.04119v3
  • 1006 06-05 Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning Micro-Act: Wissenskonflikt bei der Fragebeantwortung durch handlungsfähige Selbstbesinnung abmildern 微行动:通过可采取行动的自觉反应解决问题时减少知识冲突 2506.05278v1
  • 1007 06-05 DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts DEFAME: Dynamic Evidence-based FAct-Checking mit multimodalen Experten DFAME: 与多式联运专家进行动态证据法检查 2412.10510v3
  • 1008 06-05 CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection CLATT: Umfassende Verschweigung Grund für Halluzination Detection CLATTER: 用于探测幻觉的全面成像理由 2506.05243v1
  • 1009 06-05 Towards a Unified System of Representation for Continuity and Discontinuity in Natural Language Hin zu einem einheitlichen System der Repräsentation für Kontinuität und Diskontinuität in der Natursprache 迈向一个统一的自然语言连续和中断代表制 2506.05235v1
  • 1010 06-05 From Rogue to Safe AI: The Role of Explicit Refusals in Aligning LLMs with International Humanitarian Law Von der Rogue zur sicheren KI: Die Rolle der expliziten Ablehnungen bei der Ausrichtung von LLMs mit dem humanitären Völkerrecht 从流氓到安全AI:明确拒绝者在使LLMs符合国际人道主义法方面的作用 2506.06391v1
  • 1011 06-05 MesaNet: Sequence Modeling by Locally Optimal Test-Time Training MesaNet: Sequenzmodellierung durch lokal optimale Test-Time-Schulung MesaNet:通过当地最佳试验时间培训进行序列建模 2506.05233v1
  • 1012 06-05 Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts Diagonales Batching löst Parallelismus in recurrenten Speichertransformatoren für lange Kontexte auf 对角对角拔 解锁长期常用内存变换器中的平行主义 2506.05229v1
  • 1013 06-05 Improving Low-Resource Morphological Inflection via Self-Supervised Objectives Verbesserung der ressourcenarmen morphologischen Beugung durch selbstüberwachte Ziele 通过自我监督目标改进低资源资源 道德影响 2506.05227v1
  • 1014 06-05 UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation UniWorld-V1: Semantische Encoder mit hoher Auflösung für einheitliches visuelles Verständnis und Generation Uni-World-V1:用于统一视觉理解和生成的高分辨率语义编码器 2506.03147v3
  • 1015 06-05 The Lessons of Developing Process Reward Models in Mathematical Reasoning Die Lehren aus der Entwicklung von Prozess-Reward-Modellen in mathematischer Reasoning 数学理由中发展进程奖励模型的经验教训 2501.07301v2
  • 1016 06-05 Mitigating Degree Bias Adaptively with Hard-to-Learn Nodes in Graph Contrastive Learning Degree Bias im Graph Contrastive Learning adaptiv mit schwer zu erlernenden Knoten abmildern 与图表对比学习中难以识别的节点相适应 2506.05214v1
  • 1017 06-05 LLM-First Search: Self-Guided Exploration of the Solution Space LLM-First Search: Selbstgeführte Erkundung des Lösungsraums LLM-第一次搜索:自导探索解决办法空间 2506.05213v1
  • 1018 06-05 CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks CheckEmbed: Effektive Überprüfung von LLM-Lösungen auf offene Aufgaben 复选对象:有效核查对不限名额任务LLM解决方案的有效核查 2406.02524v4
  • 1019 06-05 The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text The Common Pile v0.1: Ein 8TB-Datensatz von Public Domain und Openly Licensed Text Pile v0.1:公共域和公开许可文本的 8TB数据集 2506.05209v1
  • 1020 06-05 MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration MMBoundary: MLLM-Wissensgrenzen-Bewusstsein durch vernünftige Schritt-Vertrauens-Kalibrierung MMMMMMMMMM MMMMMMMM:通过合理步骤信任校准提高MLLM知识边界认识 2505.23224v2
  • 1021 06-05 RELIC: Evaluating Compositional Instruction Following via Language Recognition relIC: Bewertung der kompositorischen Instruktion über Spracherkennung RELIC:评估通过语言承认进行的组成说明 2506.05205v1
  • 1022 06-05 DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models DREAM: Entwirren von Risiken zur Verbesserung der Sicherheitsausrichtung in multimodalen großen Sprachmodellen DREAM:消除风险,加强多式联运大语言模式中的安全协调 2504.18053v2
  • 1023 06-05 Counterfactual reasoning: an analysis of in-context emergence Gegenfaktische Argumentation: eine Analyse der Entstehung von Inkontexten 反事实推理:对内源出现的分析 2506.05188v1
  • 1024 06-05 Multi-Head RAG: Solving Multi-Aspect Problems with LLMs Multi-Head RAG: Lösung von Multi-Aspect-Problemen mit LLMs 多方主管RAG:解决多领域问题与LLM 2406.05085v3
  • 1025 06-05 Can Large Language Models Understand Intermediate Representations in Compilers? Können große Sprachmodelle Zwischendarstellungen in Compilern verstehen? 大语言模式能理解《汇编者》的中间代表吗? 2502.06854v2
  • 1026 06-05 SNaRe: Domain-aware Data Generation for Low-Resource Event Detection SNARe: Domain-aware Datengenerierung für Low-Resource-Erkennung SNaRe: 用于低资源事件探测的域觉数据生成 2502.17394v2
  • 1027 06-05 ValueSim: Generating Backstories to Model Individual Value Systems ValueSim: Erzeugung von Backstories zum Modell individueller Wertsysteme 价值模拟: 生成个人价值系统模型的备份 2505.23827v2
  • 1028 06-05 Explainability in Practice: A Survey of Explainable NLP Across Various Domains Erklärbarkeit in der Praxis: Eine Übersicht über erklärbare NLP über verschiedene Domains 实践中的可解释性:关于各种领域可解释的《国家劳工调查》的调查 2502.00837v2
  • 1029 06-05 Knowledgeable-r1: Policy Optimization for Knowledge Exploration in Retrieval-Augmented Generation Knowledgeable-r1: Politikoptimierung für Wissensexploration in der retrieval-generierten Generation 可知识-r1:在回溯性回溯性养殖中知识探索的政策优化 2506.05154v1
  • 1030 06-05 CIVET: Systematic Evaluation of Understanding in VLMs CIVET: Systematische Bewertung des Verständnisses in VLMs CIVET: 系统评估对脆弱、危险、危险和 2506.05146v1
  • 1031 06-05 AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models AudioLens: Ein genauerer Blick auf auditory Attribut Wahrnehmung von großen Audio-Sprachen-Modellen 音频路程:更仔细地审视大型音频语言模型的听觉特性 2506.05140v1
  • 1032 06-05 Information Locality as an Inductive Bias for Neural Language Models Informationslokalität als induktive Bias für neurale Sprachmodelle 信息地点作为神经语言模型的感性偏见 2506.05136v1
  • 1033 06-05 CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature CHIMERA: Eine Wissensbasis der Ideenrekombination in der wissenschaftlichen Literatur CHIMERA:科学文献中思想再融合的知识库 2505.20779v3
  • 1034 06-05 DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning DiCoRe: Erweitern der Null-Shot-Erkennung durch Divergent-Convergent LLM Reasoning DiCore: 通过差异-说服者LLM 合理性,加强零射事件探测 2506.05128v1
  • 1035 06-05 LLM Social Simulations Are a Promising Research Method LLM Sozialsimulationen sind eine vielversprechende Forschungsmethode LLM LLM 社会模拟是一种有希望的研究方法 2504.02234v2
  • 1036 06-05 The NTNU System at the S&I Challenge 2025 SLA Open Track Das NTNU-System bei der S&I Challenge 2025 SLA Open Track S&I挑战2025年S&I挑战的NNNU系统 2506.05121v1
  • 1037 06-05 OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction OmniCharacter: Auf dem Weg zu immersiven Rollenspiel-Agenten mit nahtloser Sprach-Persönlichkeits-Interaktion OmniCharacter:争取用无缝无言语-语言个性交互作用来模拟角色扮演剂 2505.20277v2
  • 1038 06-05 CL-ISR: A Contrastive Learning and Implicit Stance Reasoning Framework for Misleading Text Detection on Social Media CL-ISR: Ein kontrastives Lern- und Implizit-Stance-Reasoning-Framework für irreführende Texterkennung in sozialen Medien CL-ISR: 社交媒体错误领导层文字探测错误文本的矛盾学习和隐含理由依据框架 2506.05107v1
  • 1039 06-05 The Impossibility of Fair LLMs Die Unmöglichkeit fairer LLMs 公平专利Ms的不可行性 2406.03198v2
  • 1040 06-05 Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation Bias in Language Models: Jenseits von Tricktests und hin zu RUTEd-Evaluierung 语言模式中的偏见:超越欺骗性测试和争取公正评价 2402.12649v3
  • 1041 06-05 Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics 以人为中心的街头评估可解释的多模式框架:将视觉语言模型纳入感知城市诊断 2506.05087v1
  • 1042 06-05 Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin Parken, Wahrnehmung und Einzelhandel: Street-Level Determinanten der gemeinschaftlichen Vitalität in Harbin 泊车、观感和零售:哈尔滨社区生命的街道一级决定因素 2506.05080v1
  • 1043 06-05 Just a Scratch: Enhancing LLM Capabilities for Self-harm Detection through Intent Differentiation and Emoji Interpretation Nur ein Kratzer: Verbesserung der LLM-Fähigkeiten für die Selbst-Schaden-Erkennung durch Intent Differentiation und Emoji-Interpretation 仅仅一个缩略图:通过内在差别和Emoji解释,提高自残检测的LLM能力 2506.05073v1
  • 1044 06-05 RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation RIVAL: Verstärktes Lernen mit iterativer und adversarieller Optimierung für maschinelle Übersetzung RIV: 机械翻译的循环和逆向优化强化学习 2506.05070v1
  • 1045 06-05 Leveraging LLMs for Bangla Grammar Error Correction:Error Categorization, Synthetic Data, and Model Evaluation Leveraging LLMs für Bangla Grammatik Fehlerkorrektur:Error Kategorisierung, Synthetische Daten und Modellbewertung Bangla语法错误校正:错误分类、合成数据和模型评价 2406.14284v2
  • 1046 06-05 GoRA: Gradient-driven Adaptive Low Rank Adaptation GoRA: Gradient-getriebene Adaptive Low-Rank-Anpassung GARA:逐步驱动的适应性低级别适应 2502.12171v2
  • 1047 06-05 Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation Debattierbare Intelligenz: Benchmarking der LLM-Richter durch Debatte Sprachbewertung 可辩驳的情报:通过辩论演说评价确定LLM法官的基准 2506.05062v1
  • 1048 06-05 Argument Summarization and its Evaluation in the Era of Large Language Models Argumentationszusammenfassung und ihre Bewertung im Zeitalter der großen Sprachmodelle 在 “ 大语言模式时代 “ 中的论点概述及其评价 2503.00847v3
  • 1049 06-05 TALL – A Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages TALL – Eine praktikable Architektur zur Verbesserung der LLM-Leistung in ressourcenarmen Sprachen TALL – – 提高低资源语言LLM性能的训练建筑 2506.05057v1
  • 1050 06-05 Optimizing Anytime Reasoning via Budget Relative Policy Optimization Optimierung jederzeit über Budget-Relational-Policy-Optimierung 通过预算相对政策优化优化,优化任何时间的理由 2505.13438v2
  • 1051 06-05 Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers Automatische Robustheits-Stress-Prüfung von LLMs als mathematische Problemlöser 对作为数学问题解答器的LLMS进行自动强力压力测试 2506.05038v1
  • 1052 06-05 GRU: Mitigating the Trade-off between Unlearning and Retention for LLMs GRU: Abbau des Kompromisses zwischen Unlearning und Retention für LLMs GRU:减少LLMM的不学习与保留之间的取舍 2503.09117v3
  • 1053 06-05 Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks Selbstkorrektion ist mehr als Verfeinerung: Ein Lernrahmen für visuelle und sprachliche Aufgaben 自我校正不仅仅是改进:视觉和语言原因说明任务学习框架 2410.04055v3
  • 1054 06-05 Controlling Summarization Length Through EOS Token Weighting Kontrolle der Zusammenfassung Länge durch EOS Token Gewichtung 控制通过 EOS 过 EOS 键权重加权的控控总和长度 2506.05017v1
  • 1055 06-05 ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development ComfyUI-Copilot: Ein intelligenter Assistent für automatisierte Workflow-Entwicklung ComfyUI-副驾驶员:一名负责自动工作流程开发的智能助理 2506.05010v1
  • 1056 06-05 Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity 1568 Tokens in einen einzigen Vektor und wieder zurück krammen: Die Grenzen der Einbettung von Raumkapazität erkunden 将1568吨撞成单一矢量和后向:探索嵌入空间能力的极限 2502.13063v2
  • 1057 06-05 Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs) Position ist Macht: Systemprompts als Mechanismus von Bias in großen Sprachmodellen (LLMs) 位置是电源:系统提示作为大语言模型比阿语机制(LLMs) 2505.21091v2
  • 1058 06-05 SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View SCOP: Bewertung des Verständnisprozesses von großen Sprachmodellen aus einer kognitiven Ansicht SCOP: 从认知的角度评估大语言模型的全纳进程 2506.05000v1
  • 1059 06-05 Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings Auf dem Weg zu einem effizienten visuellen Dokumenten-Retrieval: Eine empirische Studie zur Reduzierung von Patch-Level-Einbindungen 迈向储存-有效视觉文件检索:减少定级嵌入的经验研究 2506.04997v1
  • 1060 06-05 Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons Knockout LLM Assessment: Verwendung großer Sprachmodelle für Bewertungen durch iterative Pairwise-Vergleiche LLLM 评估:利用大语言模式,通过迭接对等比较进行评估 2506.03785v2
  • 1061 06-05 Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering Besseres semi-überwachtes Lernen für Multi-Domain ASR durch inkrementelle Umschulung und Datenfilterung 通过递增再培训和数据过滤,为多领域ASR更好地进行半监督的半监督学习 2506.04981v1
  • 1062 06-05 EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving EvaLearn: Quantifizierung der Lernfähigkeit und Effizienz von LLMs durch sequentielle Problemlösung EvaLearn:通过按顺序解决问题,量化LLMs的学习能力和效率 2506.02672v2
  • 1063 06-05 BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models BYO-Eval: Erstellen Sie Ihren eigenen Datensatz für eine feinkörnige visuelle Bewertung multimodaler Sprachmodelle BYO-Eval:建立你自己的数据集,以便对多模式语言模式进行精美的视觉评估 2506.05440v1
  • 1064 06-05 From Struggle (06-2024) to Mastery (02-2025) LLMs Conquer Advanced Algorithm Exams and Pave the Way for Editorial Generation Vom Kampf (06-2024) zur Meisterschaft (02-2025) LLMs erobern fortgeschrittene Algorithmenprüfungen und ebnen den Weg für die redaktionelle Generation 从斗争(06-2024)到掌握(02-2025) 2506.04965v1
  • 1065 06-05 Toward a Human-Centered Evaluation Framework for Trustworthy LLM-Powered GUI Agents Auf dem Weg zu einem menschenzentrierten Bewertungsrahmen für vertrauenswürdige LLM-Powered-GUI-Agenten 争取为具有信誉的LLM授权的用户代表建立一个以人为中心的评价框架 2504.17934v2
  • 1066 06-05 Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models Adaptive Jailbreaking-Strategien basierend auf dem semantischen Verständnis von Fähigkeiten großer Sprachmodelle 基于大语言模型的语义理解能力 2505.23404v2
  • 1067 06-05 Towards Robust ESG Analysis Against Greenwashing Risks: Aspect-Action Analysis with Cross-Category Generalization Auf dem Weg zu robuster ESG-Analyse gegen Greenwashing-Risiken: Aspekt-Action-Analyse mit Cross-Category-Verallgemeinerung 致力于对绿色清洗风险进行强有力的环境SG分析:跨类别通用化的外观行动分析 2502.15821v2
  • 1068 06-05 LLMs Can Compensate for Deficiencies in Visual Representations LLMs können Defizite in der visuellen Darstellung kompensieren LLMs Can Can 补偿视觉代表的缺陷 2506.05439v1
  • 1069 06-05 Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation Strategische Markteinblicke mit großen Sprachmodellen ableiten: Ein Benchmark für die vorausschauende kontrafaktische Generation 具有大语言模式的战略市场展望:前瞻性反实际生成基准 2505.19430v2
  • 1070 06-05 Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback Simulation von LLM-zu-LLM-Tutoring für mehrsprachiges Mathe-Feedback 模拟多种语言数学反馈LLM至LLM教学 2506.04920v1
  • 1071 06-05 A Practitioner’s Guide to Building ASR Models for Low-Resource Languages: A Case Study on Scottish Gaelic Leitfaden für Praktiker zum Aufbau von ASR-Modellen für Low-Resource-Sprachen: Eine Fallstudie zu schottischem Gälisch 《建立低资源语言ASR模式实践者指南:苏格兰盖尔语案例研究》 2506.04915v1
  • 1072 06-05 In-context Language Learning for Endangered Languages in Speech Recognition Im Zusammenhang mit dem Sprachenlernen für gefährdete Sprachen in der Spracherkennung 在语音识别中为濒危语言进行内通语言学习 2505.20445v3
  • 1073 06-05 MiMo: Unlocking the Reasoning Potential of Language Model – From Pretraining to Posttraining MiMo: Entsperren des vernünftigen Potenzials des Sprachmodells – Von der Vorschulung zur Nachschulung 米莫:释放语文模式的理性潜力 – – 从预培训到员额培训 2505.07608v2
  • 1074 06-05 Dissecting Long Reasoning Models: An Empirical Study Modelle der langen Vernunft zu entschlüsseln: Eine empirische Studie 解析长期理由模型:经验研究 2506.04913v1
  • 1075 06-05 When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models Wenn LLMs denken lügen: Enthüllen der strategischen Täuschung in Repräsentationen von Vernunftmodellen 当思考LLLM Lie:在解释理由模型代表中消除战略欺骗时 2506.04909v1
  • 1076 06-05 ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests ICPC-Eval: Mit wettbewerbsfähigen Programmierwettbewerben die Grenzen der LLM-Vernunft ebnen ICPC-Eval:以竞争性方案拟订竞赛为理由,探索LLM的前沿 2506.04894v1
  • 1077 06-05 Full-Parameter Continual Pretraining of Gemma2: Insights into Fluency and Domain Knowledge Full-Parameter Continual Pretraining von Gemma2: Einblicke in Fluency und Domain Knowledge Gemma2:深入了解流能和广域知识 2505.05946v2
  • 1078 06-05 Evaluating the Effectiveness of Linguistic Knowledge in Pretrained Language Models: A Case Study of Universal Dependencies Bewertung der Wirksamkeit sprachlicher Kenntnisse in vorgebildeten Sprachmodellen: Eine Fallstudie universeller Abhängigkeiten 评价在未受过训练的语言模式中语言知识的有效性:对普遍依赖性的个案研究 2506.04887v1
  • 1079 06-05 NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark NorEval: Ein Benchmark für das Verständnis und die Bewertung der norwegischen Sprache NorEval:挪威语言理解和世代评价基准 2504.07749v2
  • 1080 06-05 Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition Bewertung von LLMs in Speech wird oft abgeflacht: Testset Kontaminierung in großen Sprachmodellen für die Spracherkennung 对演讲中LLMs的评价经常是片断的:在大语言语音识别模型中测试设置污染 2505.22251v2
  • 1081 06-05 MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost MuLan: Multilinguale Diffusionsmodelle für Hunderte von Sprachen mit vernachlässigbaren Kosten anpassen MuLan:为数百种具有可忽略费用的语言调整多语言传播模式 2412.01271v2
  • 1082 06-05 Prompting LLMs: Length Control for Isometric Machine Translation Prompting LLMs: Längensteuerung für isometrische maschinelle Übersetzung LLM: 遥测机器翻译的长度控制 2506.04855v1
  • 1083 06-05 Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights Multiple-Choice-Frage-Generierung mit großen Sprachmodellen: Methodologie und Erzieher-Insights 使用大语言模式生成多选择问题:方法和教育者透视 2506.04851v1
  • 1084 06-05 When Claims Evolve: Evaluating and Enhancing the Robustness of Embedding Models Against Misinformation Edits Wenn Ansprüche entstehen: Bewertung und Verbesserung der Robustheit von Einbettungsmodellen gegen Fehlinformations-Edits 索赔演变时:评价和加强反错误信息嵌入模型的威力 2503.03417v3
  • 1085 06-05 MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines MockConf: A Student Interpretation Dataset: Analysis, Word- und Span-Level Alignment und Baselines MockConf:学生解释数据集:分析、单词和泛层对齐和基线 2506.04848v1
  • 1086 06-05 Evaluating Morphological Compositional Generalization in Large Language Models Bewertung der morphologischen Kompositionsverallgemeinerung in großen Sprachmodellen 评价大语言模式中人种构成的概括化 2410.12656v4
  • 1087 06-05 Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models Gemeinsame Bewertung der Antwort- und Begründungskonsistenz für Halluzinationserkennung in großen Vernunftmodellen 对在大理由模型中发现幻幻幻剂的答案和理由一致性的联合评价 2506.04832v1
  • 1088 06-05 From EHRs to Patient Pathways: Scalable Modeling of Longitudinal Health Trajectories with LLMs Von EHRs zu Patientenpfaden: Skalierbare Modellierung von langitudinalen Gesundheits-Trajektorien mit LLMs 从EHRs到患者途径:可扩展的长纵向健康轨迹模型与LLMs 2506.04831v1
  • 1089 06-05 FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models FlowCut: Redundanz über Informationsfluss für effiziente Vision-Sprachenmodelle neu denken 流程:通过信息流动重新思考通过信息流动实现高效愿景-语言模型的冗余 2505.19536v2
  • 1090 06-05 A Reasoning-Based Approach to Cryptic Crossword Clue Solving Ein vernünftiger Ansatz zur kryptischen Kreuzworträtsellösung 以基于理性的方法解决加密口号的谜题 2506.04824v1
  • 1091 06-05 Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms Bewertung von Vision-Sprachen und großen Sprachmodellen für die automatisierte Beurteilung von Studierenden in indonesischen Klassenräumen 评价印度尼西亚教室学生自动评估的视觉语言和大语言模式 2506.04822v1
  • 1092 06-05 Design of intelligent proofreading system for English translation based on CNN and BERT Entwurf eines intelligenten Korrekturlesesystems für die englische Übersetzung basierend auf CNN und BERT 设计基于CNN和BERT的英译文智能校对系统 2506.04811v1
  • 1093 06-05 Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study Dissecting Logical Reasoning in LLMs: Eine feinkörnige Bewertungs- und Aufsichtsstudie 在LLMM中解剖逻辑理由:精细评价和监督研究 2506.04810v1
  • 1094 06-05 Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction Weniger Halluzinationen, mehr Verifizierung: Ein dreistufiges LLM-basiertes Framework für die ASR-Fehlerkorrektur 较少的幻觉,更多的核查:基于ASR的三级LLM框架错误更正 2505.24347v2
  • 1095 06-05 Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques Auf dem Weg zur multimodalen Fusion LLM-Centric: Eine Studie über Integrationsstrategien und -techniken 走向LLM-Centric Multimods 融合:关于一体化战略和技术的调查 2506.04788v1
  • 1096 06-05 MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark MMSU: Ein massiver Multi-Task Spoken Language Understanding Benchmark MMSU: 大规模多任务口头语言理解和说明理由基准 2506.04779v1
  • 1097 06-05 Fine-Grained Interpretation of Political Opinions in Large Language Models Feinkörnige Interpretation politischer Meinungen in großen Sprachmodellen 以大语言模式对政治意见的精细解释 2506.04774v1
  • 1098 06-05 Distilling the Implicit Multi-Branch Structure in LLMs’ Reasoning via Reinforcement Learning Destillieren der impliziten Multi-Branch-Struktur in LLMs’ Reasoning durch Verstärkungslernen 通过强化学习,将LLMs的隐含多层结构提炼在“通过强化学习推理”中 2505.16142v2
  • 1099 06-05 GOLFer: Smaller LM-Generated Documents Hallucination Filter & Combiner for Query Expansion in Information Retrieval GOLFer: Kleinere LM-generierte Dokumente Halluzination Filter & Combiner zur Abfrageerweiterung in der Informationsaufzeichnung GOLFer: 用于信息检索中查询扩展的小型 LM - 光生文件 HALL 过滤器和组合器 2506.04762v1
  • 1100 06-05 Exp4Fuse: A Rank Fusion Framework for Enhanced Sparse Retrieval using Large Language Model-based Query Expansion Exp4Fuse: Ein Rank-Fusion-Framework für verbesserte Sparse-Retrieval unter Verwendung einer großsprachigen modellbasierten Abfrageerweiterung Exp4Fuse:使用基于大语言模型的查询扩展增强分散检索的排名组合框架 2506.04760v1
  • 1101 06-05 Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions Firm oder Fickle? Bewertung großer Sprachmodelle Konsistenz in sequenziellen Interaktionen 公司或Fickle?评估大语言模型在序列相互作用中的一致性 2503.22353v4
  • 1102 06-05 Focus On This, Not That! Steering LLMs with Adaptive Feature Specification Fokus auf diese, nicht das! Lenkung LLMs mit adaptiven Feature Specification 聚焦于此, 不是那个! 指导性带适应性特征规格的LLMS 2410.22944v4
  • 1103 06-05 The Role of Diversity in In-Context Learning for Large Language Models Die Rolle der Vielfalt im In-Context-Lernen für große Sprachmodelle 多样性在为大语言模式进行内文学习方面的作用 2505.19426v2
  • 1104 06-05 Lifelong Evolution: Collaborative Learning between Large and Small Language Models for Continuous Emergent Fake News Detection Lebenslange Evolution: Kollaboratives Lernen zwischen großen und kleinen Sprachmodellen für kontinuierliche emergente Fake News Detection 终身演进:大型和小型语文模式合作学习,以不断发现假冒假消息 2506.04739v1
  • 1105 06-05 Scaling Trends in Language Model Robustness Skalierungstrends in der Robustheit von Sprachmodellen 语言模型强度的增缩趋势 2407.18213v5
  • 1106 06-05 FastDraft: How to Train Your Draft FastDraft: Wie Sie Ihren Entwurf trainieren 快速草稿:如何训练您的草稿 2411.11055v3
  • 1107 06-05 Coordinated Robustness Evaluation Framework for Vision-Language Models Koordinierte Robustheitsbewertungsrahmen für Vision-Language-Modelle 愿景-语言模型协调强力评价框架 2506.05429v1
  • 1108 06-05 The Vector Grounding Problem Das Vector Grounding Problem 矢量基定问题 2304.01481v2
  • 1109 06-05 SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat SPARTA ALIGNMENT: Kollektive Ausrichtung mehrerer Sprachmodelle durch Kampf SPARTA 签署:通过战斗集体调整多种语言模式 2506.04721v1
  • 1110 06-05 IIITH-BUT system for IWSLT 2025 low-resource Bhojpuri to Hindi speech translation IIITH-BUT System für IWSLT 2025 Low-Resource Bhojpuri zu Hindi Sprachübersetzung IWSLT 2025 IWSLT 的三TH-BUT系统 2506.04714v1
  • 1111 06-05 LLM-based phoneme-to-grapheme for phoneme-based speech recognition LLM-basiertes Phoneme-to-grapheme für Phoneme-basierte Spracherkennung 用于语音语音识别的LLM-基于LLM的电话机对电报 2506.04711v1
  • 1112 06-05 Accelerated Test-Time Scaling with Model-Free Speculative Sampling Beschleunigte Test-Zeit-Skalierung mit modellfreier Spekulativ-Sampling 加速测试时间缩放,采用无投机示范抽样 2506.04708v1
  • 1113 06-05 EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents EmbodydBench: Umfassendes Benchmarking multimodaler Großsprachenmodelle für visionsgetriebene Embodyd-Agenten Embudied bench:为有远见的Embodied代理商制定综合基准综合基准,确定多模式大语言模式 2502.09560v3
  • 1114 06-05 Cracking the Code: Enhancing Implicit Hate Speech Detection through Coding Classification Den Code knacken: Implizite Hass-Spracherkennung durch Coding-Klassifikation verbessern 打破《守则》:通过编码分类加强隐性仇恨言论探测 2506.04693v1
  • 1115 06-05 Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models Mind the Confidence Gap: Überbewusstsein, Kalibrierung und Distraktor-Effekte in großen Sprachmodellen 牢记信心差距:在大语言模式中过度自信、校准和扰动效应 2502.11028v2
  • 1116 06-05 Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models Recycling the Web: Eine Methode zur Verbesserung der Vorschulung von Daten Qualität und Menge für Sprachmodelle 网上再循环:提高语文模式培训前数据质量和数量的方法 2506.04689v1
  • 1117 06-05 MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models MMRefine: Enthüllen der Hindernisse zur robusten Veredelung in multimodalen großen Sprachmodellen MMREFine:克服阻碍大力改进多模式大语言模式的障碍 2506.04688v1
  • 1118 06-05 Locally Typical Sampling Lokal typische Probenahmen 本地典型抽样 2202.00666v6
  • 1119 06-05 Urania: Differentially Private Insights into AI Use Urania: Unterschiedliche private Einblicke in die KI-Nutzung Urania: 差异私人透视转化为 AI 使用 2506.04681v1
  • 1120 06-05 Normative Conflicts and Shallow AI Alignment Normative Konflikte und abgeschwächte Alignment 规范冲突和小规模的 AI 调整 2506.04679v1
  • 1121 06-05 Magic Mushroom: A Customizable Benchmark for Fine-grained Analysis of Retrieval Noise Erosion in RAG Systems Magic Mushroom: Ein anpassbarer Benchmark für die feinkörnige Analyse von Retrieval Noise Erosion in RAG-Systemen 魔术蘑菇:对RAG系统中的检索噪音侵蚀进行精细分析的定制基准 2506.03901v2
  • 1122 06-05 Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL Sollte ich Ihnen vertrauen? Detektion von Täuschung in Verhandlungen mit kontrafaktischen RL 利用反事实RL在谈判中发现欺骗行为 2502.12436v3
  • 1123 06-05 Propaganda and Information Dissemination in the Russo-Ukrainian War: Natural Language Processing of Russian and Western Twitter Narratives Propaganda und Informationsverbreitung im Russisch-Ukrainischen Krieg: Natürliche Sprachverarbeitung von russischen und westlichen Twitter Narratives Russo-Ukraineian战争中的宣传和信息传播:俄罗斯和西方推特上的自然语言处理 2506.01807v2
  • 1124 06-05 An Efficient Task-Oriented Dialogue Policy: Evolutionary Reinforcement Learning Injected by Elite Individuals Eine effiziente, auf Aufgaben ausgerichtete Dialogpolitik: Evolutionäres Stärkungslernen von Elite-Personen 高效的、以任务为导向的对话政策:精英个人注射的进化强化学习 2506.03519v2
  • 1125 06-05 Detection Method for Prompt Injection by Integrating Pre-trained Model and Heuristic Feature Engineering Detektionsmethode für die schnelle Einspritzung durch Integration vortrainierter Modell- und Heuristic Feature Engineering 纳入培训前模型和肥力地貌工程的快速注射快速注射检测方法 2506.06384v1
  • 1126 06-05 Inducing lexicons of in-group language with socio-temporal context Induzieren von Lexikons der In-Gruppe-Sprache mit sozio-temporalem Kontext 引引具有社会-时时背景的组内语言词汇法 2409.19257v3
  • 1127 06-05 EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition EMO-Debias: Benchmarking Gender-Debiasing-Techniken in der multi-Label Sprachemotionserkennung EMO-Debias:在多语言言论中确定性别偏见技术基准 2506.04652v1
  • 1128 06-05 From Intention To Implementation: Automating Biomedical Research via LLMs Von der Absicht zur Umsetzung: Automatisierung der biomedizinischen Forschung über LLMs 从实施目的出发:通过LLMs实现生物医学研究自动化 2412.09429v4
  • 1129 06-05 Not All Options Are Created Equal: Textual Option Weighting for Token-Efficient LLM-Based Knowledge Tracing Nicht alle Optionen sind gleich: Textuelle Optionsgewichtung für Token-Efficient LLM-basierte Wissensverfolgung 并非所有选项都创建为等: Token- Efficient LLM 知识追踪的文本选项比重 2410.12872v2
  • 1130 06-05 Rectified Sparse Attention Rektifizierte Sparse Achtung 校正的 松散注意 2506.04108v2
  • 1131 06-05 Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents Flex-TravelPlanner: Ein Benchmark für flexible Planung mit Sprachagenten 弹性旅行规划:语文代表灵活规划基准 2506.04649v1
  • 1132 06-05 TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering TaDA: Training-freies Rezept zur Dekodierung mit adaptiver KV-Cache-Kompression und Mean-Centering TaDA:使用适应性 KV 缓存压缩和中度激励解码的无培训食谱 2506.04642v1
  • 1133 06-05 MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning MoDoMoDo: Multi-Domain-Datenmischungen für multimodales LLM-Verstärkungslernen MoDoMoMoDoDo:多式LLM强化学习多功能数据混合体 2505.24871v2
  • 1134 06-05 ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition ViCocktail: Automatisierte Multi-Modal-Datensammlung für vietnamesische Audio-Visuelle Spracherkennung Vicocktail:越南视听语音语音识别自动多模式数据收集 2506.04635v1
  • 1135 06-05 MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation MAVL: Ein mehrsprachiger Audio-Video-Text Datensatz für animierte Song-Übersetzung MAVL: 动动歌曲翻译多语种视听歌词数据集 2505.18614v2
  • 1136 06-05 Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon Breaking the Cloak! Enthüllung der chinesischen verhüllten Toxizität mit Homophon Graph und giftigem Lexikon 破解衣物! 中华便衣毒物与同声图和毒毒词汇结合 2505.22184v2
  • 1137 06-05 COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing COMI-LINGUA: Experte kommentierter Großrechner-Datensatz für Multitask NLP in Hindi-Englisch Code-Mixing COMI-LINLingUA:印地语-英语编码混合多语种国家编码多语种方案专家附加说明的大尺度数据集 2503.21670v2
  • 1138 06-05 Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning Advancing Tool-Augmented Large Language Models durch Meta-Verifikation und Reflexion Lernen 通过元核查和反省学习促进工具强化大语言模式 2506.04625v1
  • 1139 06-05 Static Word Embeddings for Sentence Semantic Representation Statische Wort-Einbettungen für Satz semantische Darstellung 判决语义代表的静态单词嵌入 2506.04624v1
  • 1140 06-05 Subjective Perspectives within Learned Representations Predict High-Impact Innovation Subjektive Perspektiven in erfahrenen Vertretungen prognostizieren High-Impact Innovation 学术界代表的主观观点 预测高影响创新 2506.04616v1
  • 1141 06-05 Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning Überprüfung von Test-Time Scaling: Eine Umfrage und eine vielfaltsbewusste Methode für effizientes Reasoning 重新审视试验时间尺度:调查以及有效说明理由的多样化软件方法 2506.04611v1
  • 1142 06-05 DataGen: Unified Synthetic Dataset Generation via Large Language Models DataGen: Unified Synthetic Dataset Generation über große Sprachmodelle DataGen:通过大语言模型生成统一合成数据集 2406.18966v4
  • 1143 06-05 A MISMATCHED Benchmark for Scientific Natural Language Inference Ein MISMATCHED-Benchmark für naturwissenschaftliche Sprachinferenzen 科学自然语言引文基准 2506.04603v1
  • 1144 06-05 Seed-Coder: Let the Code Model Curate Data for Itself Saatgut-Coder: Lassen Sie das Code-Modell Daten für sich selbst kuratieren 种子编码器:让代码模型为它自己计算曲线数据 2506.03524v2
  • 1145 06-05 Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem Lösen des vernünftigen Potenzials von vortrainierten LLMs durch Kritik Feinsteuerung auf ein Problem 通过Critique对一个问题的微调,释放预先培训的LLMs的理据潜力 2506.03295v2
  • 1146 06-05 Beyond Memorization: A Rigorous Evaluation Framework for Medical Knowledge Editing Beyond Memorization: Ein strenger Evaluationsrahmen für medizinisches Knowledge Editing 记忆之后:医学知识编辑的严格评价框架 2506.03490v2
  • 1147 06-05 Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification Sicher: Mathematische Reasoning in großen Sprachmodellen durch retrospektive stufenweise Verifizierung verbessern 安全:通过回溯性逐步认识正式核查,加强大语言模型中的数学理由 2506.04592v1
  • 1148 06-05 LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models LESS: Großes Sprachmodell Verbessertes semi-überwachtes Lernen für Sprachgrundmodelle LESS:大语言模式强化半监督半监督学习演讲基础模型 2506.04586v1
  • 1149 06-05 MuSciClaims: Multimodal Scientific Claim Verification MuSciClaims: Multimodale wissenschaftliche Antragsprüfung 穆西索赔: 多式联运科学索赔核实 2506.04585v1
  • 1150 06-05 SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-Checking durch Forderungszersetzung und Bearbeitung CUCEA: 通过索赔的分解和编辑进行反向实况调查的理由说明-密集检索 2506.04583v1
  • 1151 06-05 Selecting Demonstrations for Many-Shot In-Context Learning via Gradient Matching Auswahl von Demonstrationen für das Viel-Schuss-In-Kontext-Lernen über Gradient Matching 通过梯度匹配选择多片点在文本中学习的演示 2506.04579v1
  • 1152 06-05 Contrastive Visual Data Augmentation Kontrastive Bilddatenvergrößerung 对比视觉数据增强 2502.17709v2
  • 1153 06-05 Are LLMs Reliable Translators of Logical Reasoning Across Lexically Diversified Contexts? Sind LLMs zuverlässige Übersetzer der logischen Vernunft in lexikalisch diversifizierten Kontexten? LLMs 可靠翻译者是否掌握了跨越多种不同背景的逻辑理由? 2506.04575v1
  • 1154 06-05 Reasoning or Overthinking: Evaluating Large Language Models on Financial Sentiment Analysis Reasoning or Overthinking: Bewertung großer Sprachmodelle zur Analyse von Finanzsentimenten 理由或过度思考:评价关于金融敏感分析的大型语言模型 2506.04574v1
  • 1155 06-05 Demonstrations of Integrity Attacks in Multi-Agent Systems Demonstrationen von Integritätsangriffen in Multi-Agent-Systemen 多机构系统中廉正攻击示范 2506.04572v1
  • 1156 06-05 Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment Implizite Cross-Lingual-Belohnung für effiziente Mehrsprachigkeitsausrichtung 高效多语种和多种语文首选项统一化的双双优 2503.04647v2
  • 1157 06-05 Clustering and Median Aggregation Improve Differentially Private Inference Clustering und Median Aggregation verbessern unterschiedliche private Schlussfolgerungen 群集和中中聚合改善差别私人推断 2506.04566v1
  • 1158 06-05 From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems Von Standalone LLMs bis hin zu integrierter Intelligenz: Eine Übersicht über zusammengesetzte Al-Systeme 从独立的LMLM公司到综合情报公司:对Al Complical Systems的调查 2506.04565v1
  • 1159 06-05 Investigating Distributions of Telecom Adapted Sentence Embeddings for Document Retrieval Untersuchung der Verteilung von Telecom-adaptierten Satz-Einbindungen für Dokumentenwiederherstellung 用于文件检索的经调整的远程判刑嵌入表的调查分发情况 2406.12336v3
  • 1160 06-05 SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages? SSA-COMET: Sind LLMs in der Bewertung von MT für unterresourced African Languages bestens ausgebildete Metrics? SSA-COMET:在为资源不足的非洲语言评价MT方面,LLMs是否超过成绩优异的计量? 2506.04557v1
  • 1161 06-05 What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective Was in LLM-Schichten passiert ist, wenn es um schnelles gegen langsames Denken geht: Eine gradiente Perspektive 训练快速与慢思考:渐进视角时 LLM 图层中发生的情况 2410.23743v2
  • 1162 06-05 Mosaic-IT: Cost-Free Compositional Data Synthesis for Instruction Tuning Mosaic-IT: Kostenfreie kompositorische Datensynthese für die Instruction Tuning Mosaic-IT:用于教学图纸的无成本构成数据综述 2405.13326v3
  • 1163 06-05 $K$-MSHC: Unmasking Minimally Sufficient Head Circuits in Large Language Models with Experiments on Syntactic Classification Tasks $K$-MSHC: Entlarvung minimal ausreichender Kopfkreise in großen Sprachmodellen mit Experimenten zu syntaktischen Klassifizierungsaufgaben USK美元-MSHC:在以同步分类任务实验的大型语言模型中,以最小化的足够大语言电路总电路 2505.12268v2
  • 1164 06-05 ATLaS: Agent Tuning via Learning Critical Steps ATLaS: Agent Tuning über Learning Critical Steps 通过学习关键步骤进行测试的代理 2503.02197v2
  • 1165 06-05 On Entity Identification in Language Models Zur Identitätskennung in Sprachmodellen 关于在语文模式中实体识别 2506.02701v3
  • 1166 06-05 ChatWise: A Strategy-Guided Chatbot for Enhancing Cognitive Support in Older Adults ChatWise: Ein strategiegeführter Chatbot zur Verbesserung der kognitiven Unterstützung bei älteren Erwachsenen ChatWise:加强老年人认知支助战略指导的Chattbot 2503.05740v2
  • 1167 06-05 BSBench: will your LLM find the largest prime number? BSBench: Findet Ihr LLM die größte Primzahl? 你的LLM会找到最大的质数吗? 2506.04535v1
  • 1168 06-05 Is It JUST Semantics? A Case Study of Discourse Particle Understanding in LLMs Ist es nur Semantik? Eine Fallstudie des Diskurses Teilchenverständnis in LLMs 《只是语义学吗? 2506.04534v1
  • 1169 06-05 A Survey on Data Contamination for Large Language Models Eine Umfrage über Datenkontamination für große Sprachmodelle 大语言模型数据污染调查 2502.14425v2
  • 1170 06-05 DECASTE: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias Analysis DECASTE: Enthüllen von Kastenstereotypen in großen Sprachmodellen durch multidimensionale Bias-Analyse DECASTE:通过多语言双种族分析,在大语言模式中统一种姓定型观念 2505.14971v2
  • 1171 06-05 WiNGPT-3.0 Technical Report Technischer Bericht WiNGPT-3.0 WINGPT-3.0技术报告 2505.17387v2
  • 1172 06-05 Calibrating Translation Decoding with Quality Estimation on LLMs Kalibrierung der Übersetzungsdekodierung mit Qualitätsschätzung auf LLMs 以LLMM中质量估算法进行标注 2504.19044v3
  • 1173 06-05 An Exploration of Self-Supervised Mutual Information Alignment for Multi-Task Settings Eine Untersuchung der selbstüberwachten gegenseitigen Information Ausrichtung für Multi-Task-Einstellungen 探索多种任务设置的自我监督的相互信息协调 2410.01704v2
  • 1174 06-05 Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning Grapheme-Kohärente phonemische und prosodic Annotation der Sprache durch Implizite und explizite Graphemkonditionierung 隐含和显性图形条件的图形一致的语音和Prosodic 语音批注 2506.04527v1
  • 1175 06-05 Please Translate Again: Two Simple Experiments on Whether Human-Like Reasoning Helps Translation Bitte übersetzen Sie erneut: Zwei einfache Experimente darüber, ob Menschen wie Vernunft hilft Übersetzung 请再次翻译: 两项简单的实验, 有关 人类与人类之间的理由是否有助于翻译的简单实验 。 2506.04521v1
  • 1176 06-05 EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles EnsemW2S: Verbesserung der Schwach-zu-Strong-Verallgemeinerung mit großsprachigen Modellensembles EnsemW2S:用大语言模型组合加强弱至强的通用化 2505.21959v2

Article 0

Title@2025-06-12 (4): How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?

Title: How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts? Wie gut können vernünftigen Modelle erkennen und sich von unhilflichen Gedanken erholen? 理性模型如何能从无益的想法中查明和复苏? 2506.10979v1

Authors (6): Sohee Yang, Sang-Woo Lee, Nora Kassner, Daniela Gottesman, Sebastian Riedel, Mor Geva

Recent reasoning models show the ability to reflect, backtrack, and self-validate their reasoning, which is crucial in spotting mistakes and arriving at accurate solutions. A natural question that arises is how effectively models can perform such self-reevaluation. We tackle this question by investigating how well reasoning models identify and recover from four types of unhelpful thoughts: uninformative rambling thoughts, thoughts irrelevant to the question, thoughts misdirecting the question as a slightly different question, and thoughts that lead to incorrect answers. We show that models are effective at identifying most unhelpful thoughts but struggle to recover from the same thoughts when these are injected into their thinking process, causing significant performance drops. Models tend to naively continue the line of reasoning of the injected irrelevant thoughts, which showcases that their self-reevaluation abilities are far from a general “meta-cognitive” awareness. Moreover, we observe non/inverse-scaling trends, where larger models struggle more than smaller ones to recover from short irrelevant thoughts, even when instructed to reevaluate their reasoning. We demonstrate the implications of these findings with a jailbreak experiment using irrelevant thought injection, showing that the smallest models are the least distracted by harmful-response-triggering thoughts. Overall, our findings call for improvement in self-reevaluation of reasoning models to develop better reasoning and safer systems.

最近的推理模型显示了反映、回溯和自我验证其推理的能力,这是发现错误和得出准确解决办法的关键所在。自然产生的一个问题是模型能够如何有效地进行这种自我评价。我们通过调查推理模型如何很好地辨别和从四类无益的想法中恢复过来来解决这个问题:没有信息集散的想法、与问题无关的想法、将问题误导为一个略有不同的问题的想法,以及导致错误答案的想法。我们表明模型能够有效地查明最无益的想法,但当这些想法注入到其思维过程中时,却努力从同样的想法中恢复过来,从而导致显著的绩效下降。模型往往天真地延续被注入的不相关想法的推理,这些想法表明它们的自我评估能力远远没有达到一般的“元认知”意识。此外,我们看到非/反缩缩缩趋势,在较大的模型从短暂的不相干的想法中挣扎,甚至当它们被指示重新评价其推理时,我们展示了这些结论的影响,用不相关的思维注入了不相干的想法来进行推理的实验,显示了最小的推理推理,表明最有害的推理,最有害的推理是最小的模型的自我推理。


Article 1

Title@2025-06-12 (4): AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

Title: AutoMind: Adaptive Knowledgeable Agent for Automated Data Science AutoMind: Adaptives Knowledgeable Agent für automatisierte Datenwissenschaft 自动Mind:自动数据科学适应性知识代理 2506.10974v1

Authors (9): Yixin Ou, Yujie Luo, Jingsheng Zheng, Lanning Wei, Shuofei Qiao, Jintian Zhang, Da Zheng, Huajun Chen, Ningyu Zhang

Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science.

大型语言模型(LLM)代理商在解决现实世界数据科学问题方面表现出了巨大的潜力。LLM驱动的数据科学代理商承诺使整个机器学习管道自动化,然而其真实世界的有效性仍然有限。现有框架依赖于僵硬、预先定义的工作流程和不灵活的编码战略;因此,它们仅擅长于相对简单、古老的问题,未能捕捉人类从业者带来复杂、创新任务的经验专长。在这项工作中,我们引入了AutoMind(AutoMind)(一个适应性、知识丰富的LLM(LM)代理商)框架,通过三项关键进步克服了这些缺陷:(1) 一种成熟的专家知识基础,使该代理商具有领域专家知识;(2) 一种具有代理知识的树搜索算法,从战略上探索可能的解决方案;(3) 一种自我调整的编码战略,根据任务的复杂性动态地定制生成代码。对两个自动化数据科学基准的评估表明,AutoMind(AutoMind)能够提供优异的绩效、效率和质量解决方案质量,强调AutMind(Autmind)是迈向完全自动化数据科学的高效和稳健健捷的一步。


Article 2

Title@2025-06-12 (4): MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

Title: MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning MMMG: Ein massiver, multidisziplinärer, multi-Tier-Erzeugungs-Benchmark für Bild-zu-Bild-Reasoning MMMMM: 大量、多学科、多代、多语言的文字到图像推理基准 2506.10963v1

Authors (9): Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, Zhouhui Lian

In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning–a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image’s core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits–low entity fidelity, weak relations, and clutter–with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark’s difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.

在本文中,我们将知识图像生成作为一项新任务,与MMMG(MMM)的大规模多分解多层次多层次知识生成基准(MMMG)一道,将知识图像生成作为一种新任务,以探究图像生成模型的推理能力;知识图像对于人类文明和人类学习机制 – – 一个事实 – – 由双重编码理论和图像超度效应所强调的事实 – – 至关重要。生成这种图像具有挑战性,要求多式推理将世界知识与像素级基础相结合,成为清晰的解释性视觉。为了能够进行全面评估,MMMMMG(知识)多分辨的(知识)图像生成模型)提供了4,456对10个学科、6个教育级别和各种知识格式,如图表、图表和脑图地图。为了消除评估过程中的复杂复杂性,我们采用了统一的知识图表图表图表(KGG)图解核心实体及其依赖性。我们进一步引入MMMG-C-核心来评估生成的知识图像。我们引入了MMBMB-C-C(仅用于评估所生成的知识图像的公开性)。这一指标化的更准确性,用图表流化模型和直径直径分解的模型和直径直径分解的20分解的模型和直径分解的模型,将事实和直径分解的20级、直判的模型和直径分解的直判。


Article 3

Title@2025-06-12 (4): ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark

Title: ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark ChinesischHarm-Bench: Ein chinesischer schädlicher Content Detection Benchmark 中中汉禁区:中国有害内容检测基准 2506.10960v1

Authors (10): Kangwei Liu, Siyuan Cheng, Bozhong Tian, Xiaozhuan Liang, Yuyang Yin, Meng Han, Ningyu Zhang, Bryan Hooi, Xi Chen, Shumin Deng

Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs. Code and data are available at https://github.com/zjunlp/ChineseHarm-bench.

大型语言模型(LLMS)已越来越多地用于自动识别有害内容的任务,协助主持人查明违反政策的情况,提高内容审查的总体效率和准确性;然而,用于检测有害内容的现有资源主要集中于英语,中国数据集仍然稀缺,而且范围往往有限;我们为中国内容伤害检测提供了一个全面、专业的附加说明的基准,该基准涵盖六个代表性类别,完全以现实世界数据为基础;我们的批注过程还产生了一个知识规则基础,该知识基础提供明确的专家知识知识,协助查找中国有害内容的LMS;此外,我们提出一个知识强化基线,将人附加说明的知识规则和大型语言模型的隐含知识结合起来,使较小的模型能够达到与最先进的LMS相近的性能;可在https://github.com/zjunp/中国Harm-bench查阅数据。


Article 4

Title@2025-06-12 (4): Build the web for agents, not agents for the web

Title: Build the web for agents, not agents for the web Erstellen Sie das Web für Agenten, nicht Agenten für das Web 为代理者而不是网络代理者建立网络 2506.10953v1

Authors (4): Xing Han Lù, Gaurav Kamath, Marius Mosbach, Siva Reddy

Recent advancements in Large Language Models (LLMs) and multimodal counterparts have spurred significant interest in developing web agents – AI systems capable of autonomously navigating and completing tasks within web environments. While holding tremendous promise for automating complex web interactions, current approaches face substantial challenges due to the fundamental mismatch between human-designed interfaces and LLM capabilities. Current methods struggle with the inherent complexity of web inputs, whether processing massive DOM trees, relying on screenshots augmented with additional information, or bypassing the user interface entirely through API interactions. This position paper advocates for a paradigm shift in web agent research: rather than forcing web agents to adapt to interfaces designed for humans, we should develop a new interaction paradigm specifically optimized for agentic capabilities. To this end, we introduce the concept of an Agentic Web Interface (AWI), an interface specifically designed for agents to navigate a website. We establish six guiding principles for AWI design, emphasizing safety, efficiency, and standardization, to account for the interests of all primary stakeholders. This reframing aims to overcome fundamental limitations of existing interfaces, paving the way for more efficient, reliable, and transparent web agent design, which will be a collaborative effort involving the broader ML community.

大语言模型(LLMS)和多式联运对应方最近的进展引起了人们对开发网络代理器的极大兴趣 – – 能够自主浏览和完成网络环境中任务的AI系统。在拥有使复杂的网络互动自动化的巨大希望的同时,目前的方法面临巨大的挑战,因为人设计的界面与LLM能力之间根本的不匹配。目前的方法与网络投入的内在复杂性挣扎,无论是处理大型DOM树,依靠通过额外信息增加的截图,还是完全通过API互动绕过用户界面。这份立场文件倡导网络代理器研究的范式转变:而不是迫使网络代理器适应为人类设计的界面,我们应开发一种新的互动模式,专门优化了代理能力。为此,我们引入了AWI(AWI)概念,这是一个专门为代理器浏览网站而设计的界面。我们为AWI设计制定了六项指导原则,强调安全、效率和标准化,以顾及所有主要利益攸关方的利益。这一重新配置的目的是克服现有界面的基本限制,为更有效、可靠和透明的网络代理器设计铺平道路,这将扩大社区的协作努力。


Article 5

Title@2025-06-12 (4): Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Title: Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training Domain2Vec: Vectorizing Datasets, um die optimale Datenmischung ohne Training zu finden 域2Vec: 将数据集矢量化,以查找未经过培训的最佳数据混合体 2506.10952v1

Authors (4): Mozhi Zhang, Howe Tissue, Lu Wang, Xipeng Qiu

We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any dataset into a linear combination of several \emph{meta-domains}, a new concept designed to capture the key underlying features of datasets. \textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \emph{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods. Extensive experiments demonstrate that \textsc{Domain2Vec} helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, \textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only $51.5\%$ of the computation required when training on the original mixture of The Pile dataset. Under equivalent compute budget, \textsc{Domain2Vec} improves downstream performance by an average of $2.83\%$.

我们引入了{ textsc{ domain2Vec} 这一新的方法, 将任何数据集分解成多个\ emph{ meta- domains} 的线性组合。 这是一个新概念, 旨在捕捉数据集的关键基本特征。\ textsc{ Domain2Vec} 维持一个元域词汇, 并使用一个分类器将任何特定数据集分解成一个与此词汇中分布相对应的域矢量。 这些域矢量使得能够根据\ emph{ meta- domays} 将任何数据集分解成一个线性化组合。 在\ emph{ textb{ D} 以无培训方式对语言模式( LM) 进行预培训前的优化数据组合。 Do} greatation\ textbf{A} lignment\ textbf{A} Asumplegation of divical conditional ladections the requideal requidealal $ $ $2\ deviltypecial develyal devely extradeal le exdeal le ex deal extradestruffects a ex extial le lemental lemental lemental lections lections 在前端, levelmental lections lemental press


Article 6

Title@2025-06-12 (4): GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models

Title: GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models GUARD: Geführtes Lernen und Zurückhalten über Datenzuweisung für große Sprachmodelle GUARD:通过大语言模式数据归称制,指导学习和保留 2506.10946v1

Authors (7): Evelyn Ma, Duo Zhou, Peizhi Niu, Huiting Zhou, Huan Zhang, Olgica Milenkovic, S. Rasoul Etesami

Unlearning in large language models (LLMs) is becoming increasingly important due to regulatory compliance, copyright protection, and privacy concerns. However, a key challenge in LLM unlearning is unintended forgetting, where the removal of specific data inadvertently impairs the utility of the model and its retention of valuable, desired information. While prior work has primarily focused on architectural innovations, the influence of data-level factors on unlearning performance remains underexplored. As a result, existing methods often suffer from degraded retention when forgetting high-impact data. To address this, we propose GUARD-a novel framework for Guided Unlearning And Retention via Data attribution. At its core, GUARD introduces a lightweight proxy data attribution metric tailored for LLM unlearning, which quantifies the “alignment” between the forget and retain sets while remaining computationally efficient. Building on this, we design a novel unlearning objective that assigns adaptive, nonuniform unlearning weights to samples, inversely proportional to their proxy attribution scores. Through such a reallocation of unlearning power, GUARD mitigates unintended losses in retention. We provide rigorous theoretical guarantees that GUARD significantly enhances retention while maintaining forgetting metrics comparable to prior methods. Extensive experiments on the TOFU benchmark across multiple LLM architectures demonstrate that GUARD substantially improves utility preservation while ensuring effective unlearning. Notably, GUARD reduces utility sacrifice on the Retain Set by up to 194.92% in terms of Truth Ratio when forgetting 10% of the training data.

在大型语言模型(LLMS)中,不学于大型语言模型(LLM)正变得日益重要。然而,LLM(LLM)中的一个关键挑战是无意忘记,因为删除具体数据会无意地损害模型的效用和保存宝贵、期望的信息。虽然以前的工作主要侧重于建筑创新,但数据层面因素对不学业绩的影响仍然未得到充分探讨。因此,在忘记高影响数据时,现有方法往往会受到退化保留。为了解决这个问题,我们提议GUARD(GUARD)为通过数据归属引导不学和保留建立一个新的框架。在核心方面,GUARD为LM(LM)不学习推出一个轻型代用代用数据归属指标,该指标使遗忘和保留各套之间的“匹配”在计算上保持效率的同时被量化。在此基础上,我们设计了一个新的不学习目标,即给样本分配适应性、非单一的不学习权重,与其代用分数的分数反比。我们提议,GURD(GARD)提出一个严格的理论保证为LM(GR)在大幅提升历史标定值的同时,同时,使R(DR)R(DR)的R)BILILILILIL)标准在以往的升级中大幅改进。同时,同时,同时,使R(BILVILBIBIBR(R)的不比重)的保存。


Article 7

Title@2025-06-12 (4): VINCIE: Unlocking In-context Image Editing from Video

Title: VINCIE: Unlocking In-context Image Editing from Video VINCIE: Im Kontext Bildbearbeitung von Video entsperren VINCIE: 从视频中解锁 Incontext 图像编辑 2506.10941v1

Authors (10): Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, Lu Jiang

In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.

文本内图像编辑旨在根据由文本和先前生成的图像组成的背景序列修改图像。 现有方法通常取决于任务特定管道和专家模型( 如分解和油漆)来整理培训数据。 在这项工作中,我们探讨能否直接从视频中学习文本内图像编辑模型; 我们采用可扩缩的方法将视频注释作为相互脱节的多式联运序列; 为了有效地从这些数据中学习, 我们设计了一个块- 螺旋扩散变压器, 就三种代理任务进行了培训: 下一个图像预测, 当前分解预测, 和下一个分解预测。 此外, 我们提出一个新的多方向图像编辑基准, 以推进这一领域的研究。 广泛的实验表明, 我们的模式展示了超强的文本图像编辑能力, 在两个多方向图像编辑基准上取得了最新的结果。 尽管我们只接受过视频培训, 我们的模式还展示了多概念构成、 故事生成和链编辑应用程序方面的有前途的能力。


Article 8

Title@2025-06-12 (4): Visually Descriptive Language Model for Vector Graphics Reasoning

Title: Visually Descriptive Language Model for Vector Graphics Reasoning Visuell Deskriptives Sprachmodell für Vektorgrafiken 矢量图形推理视觉描述语言模型 2404.06479v5

Authors (7): Zhenhailong Wang, Joy Hsu, Xingyao Wang, Kuan-Hao Huang, Manling Li, Jiajun Wu, Heng Ji

Despite significant advancements, large multimodal models (LMMs) still struggle to bridge the gap between low-level visual perception – focusing on shapes, sizes, and layouts – and high-level language reasoning, such as semantics and logic. This limitation is evident in tasks that require precise visual perception, like comparing geometric properties or solving visual reasoning problems. To study this failure mode, we focus on vector graphics – images composed of 2D objects and shapes, prevalent in LMM-based tasks in web, design, and OS environments. We identify two key research questions: how can we enable precise visual perception, and how can we facilitate high-level reasoning based on such low-level perceptions? To capture fine visual details, we use Scalable Vector Graphics (SVG) for accurate encoding of visual scenes. However, SVGs are not readily interpretable by LMMs in a zero-shot manner. To tackle this, we propose the Visually Descriptive Language Model (VDLM), which introduces a Primal Visual Description (PVD) as an intermediate textual representation. PVD translates SVGs into a text-based abstraction consisting of primitive attributes (e.g., shape, position, measurement) and their corresponding values. PVD can be learned using task-agnostic synthesized data and represents visual primitives that are universal across vector graphics. This abstraction is more structured, allowing for direct interpretation by foundation models for zero-shot generalization. Without human-annotated data, empirical results show that VDLM significantly improves state-of-the-art LMMs like GPT-4o on various multimodal perception and reasoning tasks. Extensive analyses of VDLM show improved interpretability due to its disentangled perception and reasoning. We also demonstrate a positive correlation between PVD quality and task performance. Project page: https://mikewangwzhl.github.io/VDLM/

尽管取得了显著的进步,大型多式模型(LMMD)仍然在努力弥合低层次视觉感知 – – 侧重于形状、大小和布局 – – 与高层次语言推理(例如语义和逻辑)之间的差距。这种限制在需要精确视觉感测的任务中显而易见,例如比较几何属性或解决视觉推理问题。为了研究这种失败模式,我们侧重于由2D对象和形状组成的矢量图形 – – 以 LMM 为基础在网络、设计和OS环境中盛行的2D对象和形状。我们确定了两个关键的研究问题:我们如何能够提供精确视觉感知,以及我们如何根据这种低层次感知促进高层次推理?为了捕捉精细视觉细节,我们使用可缩放矢量的矢量图形图形图形图形(SVGGGG)来准确校正图像。SVVGGGG不易被解析。为了解决这个问题,我们建议直观描述直观语言语言模模型(VDLMM),将PlM(PVD)推算结果作为中间文本显示的直径直径直径直径直径直径判/直径直径数据,用直径模型将SVGGGM(S-GM)的直径分析,用直径分析显示直径的直径MD-ILM) 和直径数据显示其直径的直径直径直径直径直径数据。


Article 9

Title@2025-06-12 (4): Dynamic Epistemic Friction in Dialogue

Title: Dynamic Epistemic Friction in Dialogue Dynamische epistemische Reibung im Dialog 对话框中的动态瞬间摩擦 2506.10934v1

Authors (5): Timothy Obiso, Kenneth Lai, Abhijnan Nath, Nikhil Krishnaswamy, James Pustejovsky

Recent developments in aligning Large Language Models (LLMs) with human preferences have significantly enhanced their utility in human-AI collaborative scenarios. However, such approaches often neglect the critical role of “epistemic friction,” or the inherent resistance encountered when updating beliefs in response to new, conflicting, or ambiguous information. In this paper, we define dynamic epistemic friction as the resistance to epistemic integration, characterized by the misalignment between an agent’s current belief state and new propositions supported by external evidence. We position this within the framework of Dynamic Epistemic Logic (Van Benthem and Pacuit, 2011), where friction emerges as nontrivial belief-revision during the interaction. We then present analyses from a situated collaborative task that demonstrate how this model of epistemic friction can effectively predict belief updates in dialogues, and we subsequently discuss how the model of belief alignment as a measure of epistemic resistance or friction can naturally be made more sophisticated to accommodate the complexities of real-world dialogue scenarios.

在使大语言模型与人类偏好相一致方面的最新发展大大增强了其在人类-AI合作情景中的实用性。然而,这些方法往往忽视了“流行摩擦”的关键作用,或针对新的、相互冲突的或模糊的信息更新信仰时遇到的内在阻力。 在本文中,我们将动态的记忆摩擦定义为对认知融合的阻力,其特征是代理人当前信仰状态与外部证据支持的新主张之间的不匹配。我们将此置于动态流行逻辑(Van Benthem和Pacuit,2011年)的框架内,在互动期间,摩擦作为非三边信仰的改变而出现。我们然后从一个位置上的合作任务中提出分析,表明这种特征摩擦模式如何有效地预测信仰在对话中的变化,我们随后讨论如何自然地使作为缩写抵抗或摩擦的衡量尺度的信仰调整模式更加复杂,以适应现实对话情景的复杂性。


Article 10

Title@2025-06-12 (4): Improving LLM Safety Alignment with Dual-Objective Optimization

Title: Improving LLM Safety Alignment with Dual-Objective Optimization Verbesserung der LLM-Sicherheitsausrichtung mit Dual-Ziel-Optimierung 提高LLM安全一致性,实现双目标优化 2503.03710v2

Authors (7): Xuandong Zhao, Will Cai, Tianneng Shi, David Huang, Licong Lin, Song Mei, Dawn Song

Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely deployed alignment method, exhibits limitations in both experimental and theoretical contexts as its loss function proves suboptimal for refusal learning. Through gradient-based analysis, we identify these shortcomings and propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge. This approach significantly increases LLM robustness against a wide range of jailbreak attacks, including prefilling, suffix, and multi-turn attacks across both in-distribution and out-of-distribution scenarios. Furthermore, we introduce a method to emphasize critical refusal tokens by incorporating a reward-based token-level weighting mechanism for refusal learning, which further improves the robustness against adversarial exploits. Our research also suggests that robustness to jailbreak attacks is correlated with token distribution shifts in the training process and internal representations of refusal and harmful tokens, offering valuable directions for future research in LLM safety alignment. The code is available at https://github.com/wicai24/DOOR-Alignment

对大型语言模型(LLMS)的现有培训时间安全调整技术仍然容易受到越狱攻击。直接偏好优化(DPO)是一种广泛采用的调整方法,在实验和理论背景下都表现出了局限性,因为其损失功能证明对拒绝学习而言并不理想。我们通过基于梯度的分析,查明这些缺陷,并提议改进安全调整,将DPO的目标分为两个部分:(1) 强有力的拒绝培训,鼓励即使在产生部分不安全的一代人的情况下也予以拒绝,(2) 有针对性的不学习有害知识。这一方法大大提高了LM对一系列广泛的越狱攻击的稳健性,包括预先填补、后缀和在分配和分配之外两种情况下的多发攻击。此外,我们引入了一种强调关键拒绝标志的方法,即采用基于奖励的象征性加权机制来学习拒绝学习,从而进一步提高对抗对抗敌对剥削的稳健性。我们的研究还表明,对监狱破伤攻击的稳健性与培训过程中象征性的分发变化以及拒绝和有害标志的内部表现有关,为LM安全调整的未来研究提供了宝贵的方向。该代码可在 https://Giusub/DOB/DO.


Article 11

Title@2025-06-12 (4): Robustly Improving LLM Fairness in Realistic Settings via Interpretability

Title: Robustly Improving LLM Fairness in Realistic Settings via Interpretability Robuste Verbesserung der LLM Fairness in realistischen Einstellungen durch Dolmetschbarkeit 通过可解释性在现实环境中强有力地提高LLM公平性 2506.10922v1

Authors (2): Adam Karvonen, Samuel Marks

Large language models (LLMs) are increasingly deployed in high-stakes hiring applications, making decisions that directly impact people’s careers and livelihoods. While prior studies suggest simple anti-bias prompts can eliminate demographic biases in controlled evaluations, we find these mitigations fail when realistic contextual details are introduced. We address these failures through internal bias mitigation: by identifying and neutralizing sensitive attribute directions within model activations, we achieve robust bias reduction across all tested scenarios. Across leading commercial (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3, Mistral-24B), we find that adding realistic context such as company names, culture descriptions from public careers pages, and selective hiring constraints (e.g.,``only accept candidates in the top 10\%”) induces significant racial and gender biases (up to 12\% differences in interview rates). When these biases emerge, they consistently favor Black over White candidates and female over male candidates across all tested models and scenarios. Moreover, models can infer demographics and become biased from subtle cues like college affiliations, with these biases remaining invisible even when inspecting the model’s chain-of-thought reasoning. To address these limitations, our internal bias mitigation identifies race and gender-correlated directions and applies affine concept editing at inference time. Despite using directions from a simple synthetic dataset, the intervention generalizes robustly, consistently reducing bias to very low levels (typically under 1\%, always below 2.5\%) while largely maintaining model performance. Our findings suggest that practitioners deploying LLMs for hiring should adopt more realistic evaluation methodologies and consider internal mitigation strategies for equitable outcomes.

大型语言模型(LLMS)越来越多地被部署到高端招聘应用程序中,做出直接影响人们职业和生计的决定。虽然先前的研究显示简单的反偏见提示可以消除受控评价中的人口偏差,但我们发现,当引入现实背景细节时,这些缓解措施会失败。 我们通过内部偏差减缓这些失败:通过在模型启动过程中识别和中和敏感属性方向,我们在所有测试的情景中实现强烈的偏差减少。在领先的商业模型(GPT-4o、Claude 4 Sonnet、Gemini 2.5 Flash)和开放源代码模型(Gemma-227B、Gemma-3、Gemma-3、Mistral-24B)中,我们发现,在多数情况下,增加现实的环境环境环境环境,如公司名称、公共职业网页的文化描述以及选择性招聘限制(例如,只接受前10的候选人),会引发严重的种族和性别偏差(在面试率上达到12差异 ) 。当这些偏差出现这些偏差时,在所有测试的模型和低度模型和假设模型中,他们总是偏向白人候选人和女性候选人倾重男候选人。此外, 考虑保持人口统计学战略,从低的偏差的偏差,从大学的偏差,从低于大学的校友关系,在1级推向下,在测试中推向下,在精确推向下,在计算。


Article 12

Title@2025-06-12 (4): Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

Title: Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization Dekomponieren von MLP-Aktivierungen in Interpretierbare Funktionen über semi-Nonnegative Matrix-Fabrikisierung 通过半氮基矩阵化系数化,将劳动和生产部的分解活动转化为可解释性特征 2506.10920v1

Authors (3): Or Shafran, Atticus Geiger, Mor Geva

A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP’s activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.

机械诠释的核心目标是在大型语言模型(LLMs)中确定正确的分析单位,这些分析单位能因果解释其产出。虽然早期工作侧重于单个神经元,但证据表明神经元往往将多种概念编码成多种概念,从而推动向分析空间激活方向转变。一个关键问题是如何找到能够以不受监督的方式捕捉可解释特征的方向。目前的方法依靠的是与稀有自动编码器(SAEs)一起的字典学习,这些方法通常经过关于从零开始学习方向的剩余流启动的训练。然而,SAEs经常在因果评估方面挣扎,缺乏内在解释性的解释性,因为它们的学习与模型特性的计算没有明确挂钩。在这里,我们通过直接将MLP的启动与半同步矩阵因子化要素(SNMF)进行分解,从而找到能够以不受监督的方式捕捉到可解释的特性。当前方法依赖于(a) 与稀松散的自动神经元神经元的线性组合,以及(b)与其激活性投入直接解释。Llama 3.1, Gemma 2 和GPT-2 实验显示SNMLEDF 的超越了SAEER 的定位位置位置,同时将S-de-de-deal-de-deal-deal-res-lavial-liversal-laviews) 的定位,同时将一个强大的结构显示一个强大的结构的、一个强大的结构,并显示一个稳定的循环的精确的精确的模型,并显示一个稳定的模型。


Article 13

Title@2025-06-12 (4): Weak-to-Strong Jailbreaking on Large Language Models

Title: Weak-to-Strong Jailbreaking on Large Language Models Schwach-zu-starkes Gefängnis mit großen Sprachmodellen 关于大语言模型的弱至强强监狱破解 2401.17256v3

Authors (7): Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang

Large language models (LLMs) are vulnerable to jailbreak attacks - resulting in harmful, unethical, or biased text generations. However, existing jailbreaking methods are computationally costly. In this paper, we propose the weak-to-strong jailbreaking attack, an efficient inference time attack for aligned LLMs to produce harmful text. Our key intuition is based on the observation that jailbroken and aligned models only differ in their initial decoding distributions. The weak-to-strong attack’s key technical insight is using two smaller models (a safe and an unsafe one) to adversarially modify a significantly larger safe model’s decoding probabilities. We evaluate the weak-to-strong attack on 5 diverse open-source LLMs from 3 organizations. The results show our method can increase the misalignment rate to over 99% on two datasets with just one forward pass per example. Our study exposes an urgent safety issue that needs to be addressed when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available at https://github.com/XuandongZhao/weak-to-strong

大型语言模型(LLMs)很容易受到越狱攻击 — — 导致有害、不道德或偏颇的文字世代。 但是,现有的破狱方法在计算上成本很高。 在本文中,我们建议对匹配的LLMs进行弱至强的侵入性攻击,对匹配的LLMs进行高效的推断时间攻击,以便生成有害的文字。我们的关键直觉是基于这样的观察,即:在最初的解码分布中,监禁和校正模式只是不同。弱至强的攻击的关键技术洞察是使用两个较小的模型(安全和不安全的模型)来对一个大得多的安全模型的解码概率进行对抗性修改。我们建议对3个组织的5个不同的开放源LMs进行较弱至强的攻击进行评估。结果显示,我们的方法可以将两个数据集的误差率提高到99%以上,每个只有一个前传。我们的研究揭示了一个在调整LMS时需要解决的紧迫的安全问题。作为初步尝试,我们提议了一项防御战略来保护这种攻击,但创造更先进的防御性。 重新复制方法的代码可以在 http://Xqong-weadstrual/xwang/ astotototototototototototomtal


Article 14

Title@2025-06-12 (4): Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

Title: Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification? Breaking Bad Molecules: Sind MLLMs bereit für die molekulare Entgiftung auf Strukturebene? 破碎坏分子:MLLMs是否准备好进行结构级分子解毒? 2506.10912v1

Authors (8): Fei Lin, Ziyang Gong, Cong Wang, Yonglin Tian, Tengchao Zhang, Xue Yang, Gen Luo, Fei-Yue Wang

Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair - generating structurally valid molecular alternatives with reduced toxicity - has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 560 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess nearly 30 mainstream general-purpose MLLMs and design multiple ablation studies to analyze key factors such as evaluation criteria, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware molecule editing.

尽管分子设计和财产预测取得了进展,但分子毒性修复任务 – – 产生结构上有效的分子替代物,减少毒性 – – 尚未系统地界定或确定基准;为填补这一空白,我们引入了侧重于分子毒性修复的通用多式多种语言模型(MLLM)的第一个基准任务 – – ToxiMol,这是侧重于分子毒性修复的通用多模式模型(MLLMs)的第一个基准任务;我们建立了一个标准化的数据集,涵盖11个主要任务和560个具有代表性的有毒分子,涵盖各种机制和颗粒。我们设计了一个及时的注解管道,配有机制性能和任务适应能力,并有专门的毒理学知识。与此同时,我们提议了一个自动评估框架 – – ToxiEval,将毒性终点预测、合成可获性、药物相似性和结构相似性纳入一个高通量评价链,以修复成功。我们系统地评估了近30个主流通用MLLMM,并设计了多种相关研究,以分析诸如评价标准、候选多样性和失败归属等关键因素。实验结果表明,尽管目前的MLLMS-Ms在这项任务的坚持性、磁性能力方面仍面临重大挑战。


Article 15

Title@2025-06-12 (4): Efficiently Identifying Watermarked Segments in Mixed-Source Texts

Title: Efficiently Identifying Watermarked Segments in Mixed-Source Texts Effiziente Identifikation von wassermarkierten Segmenten in Mixed-Source-Texten 有效识别混合来源文本中划划水段 2410.03600v2

Authors (4): Xuandong Zhao, Chenwen Liao, Yu-Xiang Wang, Lei Li

Text watermarks in large language models (LLMs) are increasingly used to detect synthetic text, mitigating misuse cases like fake news and academic dishonesty. While existing watermarking detection techniques primarily focus on classifying entire documents as watermarked or not, they often neglect the common scenario of identifying individual watermark segments within longer, mixed-source documents. Drawing inspiration from plagiarism detection systems, we propose two novel methods for partial watermark detection. First, we develop a geometry cover detection framework aimed at determining whether there is a watermark segment in long text. Second, we introduce an adaptive online learning algorithm to pinpoint the precise location of watermark segments within the text. Evaluated on three popular watermarking techniques (KGW-Watermark, Unigram-Watermark, and Gumbel-Watermark), our approach achieves high accuracy, significantly outperforming baseline methods. Moreover, our framework is adaptable to other watermarking techniques, offering new insights for precise watermark detection. Our code is publicly available at https://github.com/XuandongZhao/llm-watermark-location

大型语言模型(LLMs)的文本水标记越来越多地用于探测合成文本,减少误用案件,如假新闻和学术不诚实。虽然现有的水标记探测技术主要侧重于将整个文件分类为划水标记或非划水标记,但它们往往忽视了在较长的混合来源文件中确定单个水标记区块的常见情景。我们从浸泡式探测系统中汲取了灵感,提出了两种局部水标记探测的新方法。首先,我们开发了几何覆盖探测框架,目的是确定长文本中是否有水标记区段。第二,我们引入了适应性在线学习算法,以确定文本中水标记区段的确切位置。根据三种流行的水标记技术(KGW-水标记、Unigram-watermark和Gumbel-watermark)的评估,我们的方法达到了很高的精确度,大大超出基线方法。此外,我们的框架适应了其他水标记技术,为精确的水标记探测提供了新的洞察力。我们的代码公开发布在https://github.com/XuantungZhao/llm-watermarkmark-plat-plat-plat-plat-place。


Article 16

Title@2025-06-12 (4): Magistral

Title: Magistral Magistral 司 司 司 司 司 司 司 司 司 2506.10910v1

Authors (101): Mistral-AI, :, Abhinav Rastogi, Albert Q. Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, Jason Rute, Joep Barmentlo, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, Léonard Blier, Lucile Saulnier, Matthieu Dinot, Maxime Darrin, Neha Gupta, Roman Soletskyi, Sagar Vaze, Teven Le Scao, Yihan Wang, Adam Yang, Alexander H. Liu, Alexandre Sablayrolles, Amélie Héliou, Amélie Martin, Andy Ehrenberg, Anmol Agarwal, Antoine Roux, Arthur Darcet, Arthur Mensch, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Chris Bamford, Christian Wallenwein, Christophe Renaudin, Clémence Lanfranchi, Darius Dabert, Devon Mizelle, Diego de las Casas, Elliot Chane-Sane, Emilien Fugier, Emma Bou Hanna, Gauthier Delerce, Gauthier Guinet, Georgii Novikov, Guillaume Martin, Himanshu Jaju, Jan Ludziejewski, Jean-Hadrien Chabran, Jean-Malo Delignon, Joachim Studnia, Jonas Amar, Josselin Somerville Roberts, Julien Denize, Karan Saxena, Kush Jain, Lingxiao Zhao, Louis Martin, Luyu Gao, Lélio Renard Lavaud, Marie Pellat, Mathilde Guillaumin, Mathis Felardos, Maximilian Augustin, Mickaël Seznec, Nikhil Raghuraman, Olivier Duchenne, Patricia Wang, Patrick von Platen, Patryk Saffer, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Pavankumar Reddy Muddireddy, Philomène Chagniot, Pierre Stock, Pravesh Agrawal, Romain Sauvestre, Rémi Delacourt, Sanchit Gandhi, Sandeep Subramanian, Shashwat Dalal, Siddharth Gandhi, Soham Ghosh, Srijan Mishra, Sumukh Aithal, Szymon Antoniak, Thibault Schueller, Thibaut Lavril, Thomas Robert, Thomas Wang, Timothée Lacroix, Valeriia Nemychnikova, Victor Paltz, Virgile Richard, Wen-Ding Li, William Marshall, Xuanyu Zhang, Yunhao Tang

We introduce Magistral, Mistral’s first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint’s capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.

我们引入了Magistral、Mistral的第一个推理模型和我们自己的可缩放强化学习(RL)管道。我们不依靠现有的执行和从以前的模型中蒸馏出来的RL痕迹,而是遵循一种自下而上的方法,完全依靠我们自己的模型和基础设施。值得注意的是,我们展示了一堆材料,使我们能够探索纯RL培训LLMs的局限性,提出了一种简单的方法来强制使用模型的推理语言,并表明关于文本数据的RL单独保持了最初的检查站的大部分能力。我们发现,关于文本的RL保持或改进了多式联运的理解、指令的遵循和功能。我们展示了Mistral 中3顶部的法师Mistral Mid 3,仅与RL一起,我们展示了我们开放源的Magistral Small Small (Apache 2.0) ,其中还包括来自Magistral Min的冷启动数据。


Article 17

Title@2025-06-12 (4): Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning

Title: Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning Jenseits von Goldstandards: Epistemisches Ensemble von LLM-Richtern für formale mathematische Vernunft 超越金金标准:法学硕士正式数学理由法官集会 2506.10903v1

Authors (3): Lan Zhang, Marco Valentino, Andre Freitas

Autoformalization plays a crucial role in formal mathematical reasoning by enabling the automatic translation of natural language statements into formal languages. While recent advances using large language models (LLMs) have shown promising results, methods for automatically evaluating autoformalization remain underexplored. As one moves to more complex domains (e.g., advanced mathematics), human evaluation requires significant time and domain expertise, especially as the complexity of the underlying statements and background knowledge increases. LLM-as-a-judge presents a promising approach for automating such evaluation. However, existing methods typically employ coarse-grained and generic evaluation criteria, which limit their effectiveness for advanced formal mathematical reasoning, where quality hinges on nuanced, multi-granular dimensions. In this work, we take a step toward addressing this gap by introducing a systematic, automatic method to evaluate autoformalization tasks. The proposed method is based on an epistemically and formally grounded ensemble (EFG) of LLM judges, defined on criteria encompassing logical preservation (LP), mathematical consistency (MC), formal validity (FV), and formal quality (FQ), resulting in a transparent assessment that accounts for different contributing factors. We validate the proposed framework to serve as a proxy for autoformalization assessment within the domain of formal mathematics. Overall, our experiments demonstrate that the EFG ensemble of LLM judges is a suitable emerging proxy for evaluation, more strongly correlating with human assessments than a coarse-grained model, especially when assessing formal qualities. These findings suggest that LLM-as-judges, especially when guided by a well-defined set of atomic properties, could offer a scalable, interpretable, and reliable support for evaluating formal mathematical reasoning.

在正式的数学推理中,自动化化通过将自然语言语句自动翻译成正式语言,在正式的数学推理中发挥着关键作用。虽然使用大型语言模型(LLMS)的最近进展显示了有希望的成果,但自动评估自动化化的方法仍未得到充分探讨。随着向更复杂的领域(例如高级数学)的移动,人类评价需要大量的时间和领域专门知识,特别是因为基础语句和背景知识的复杂性增加。LLM-as-a-judge为这种评价的自动化提供了一个很有希望的方法。但是,现有方法通常采用粗略和通用的评价标准,这些标准限制了它们对于高级正式数学推理(LMM)的有效性,而质量则取决于精细的多语义层面。在这项工作中,我们为弥补这一差距而迈出了一步,采用了系统、自动的方法来评估自动化任务。 拟议的方法基于一个具有启发性和正式基础的LM法官通则基于逻辑保存(LP)、可实现的数学一致性(MC)、明确性评价(FV)和正式质量(FQ)等标准,这限制了它们对于高级正式的实效(FQ)的实效评估的有效性,导致一个透明性评估,在透明性评估中,最终的数学推理学级评估中,从而以透明性评估中更能能能能为整个的数学推算出一个更能的数学模型,具体地为整个的数学推算。我们为整个的数学上,我们内部的数学判法理学系的数学判法系的数学评估,我们为整个的数学判法系的数学判法系的数学评估。我们为整个的数学级的数学级的数学级的数学评算。我们为整个的数学级的数学级的数学级的数学级的数学级的数学级的数学评。我们为整个的数学级的理论性评法系,我们为整个的数学级的数学级的数学级的数学级的数学级的数学级框架,为一种比较的理论性评估,为一种比较的理论性评法系的数学级的数学级的比较的数学级的数学级的数学级的数学级的数学级的数学级的数学级的数学级的数学级的数学级的数学级的理论性评法系的数学级的数学级的数学级的数学级的数学级的数学级的数学级的数学级的理论


Article 18

Title@2025-06-12 (4): BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP

Title: BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP BioClinical ModernBERT: Ein hochmoderner Long-Context-Encoder für biomedizinische und klinische NLP 生物医学和临床国家实验室方案最新生物医学和临床现代生物临床现代BERT:最先进的生物医学和临床临床长期编码器 2506.10896v1

Authors (10): Thomas Sounack, Joshua Davis, Brigitte Durieux, Antoine Chaffin, Tom J. Pollard, Eric Lehman, Alistair E. W. Johnson, Matthew McDermott, Tristan Naumann, Charlotta Lindvall

Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text through discriminative tasks. However, encoders have seen slower development compared to decoder models, leading to limited domain adaptation in biomedical and clinical settings. We introduce BioClinical ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is developed through continued pretraining on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source. It outperforms existing biomedical and clinical encoders on four downstream tasks spanning a broad range of use cases. We release both base (150M parameters) and large (396M parameters) versions of BioClinical ModernBERT, along with training checkpoints to support further research.

基于生物医学和临床自然语言处理(NLP)的变压器模型是生物医学和临床自然语言处理(NLP)的核心,因为其双向自我关注使得它们完全适合通过歧视性任务从无结构的文本中高效地提取结构化信息,然而,与解码模型相比,编码器的发展速度较慢,导致生物医学和临床环境有限的领域适应性;我们引入了生物临床现代变压器,这是一个基于现代BERT的发布,包括长文本处理和大幅度改进生物医学和临床NLP的速度和性能的域码器。 生物临床现代BERT通过持续对迄今为止最大的生物医学和临床材料进行预先培训,有535亿多个符号,通过利用不同机构、地区和地理区域的20个数据集,而不是依赖单一来源的数据,解决了以前临床变压器的关键限制,它超越了现有生物医学和临床变压器在四种下游任务上跨越广泛用途案例。我们释放了基础(150M参数)和大型生物科学研究(396M参数)的实验室,同时对生物科学现代化实验室的检查进行了进一步培训。


Article 19

Title@2025-06-12 (4): The Diffusion Duality

Title: The Diffusion Duality Die Diffusionsdualität 传播质量 2506.10892v1

Authors (6): Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, Volodymyr Kuleshov

Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: http://s-sahoo.github.io/duo

单一状态离散扩散模型具有快速生成文本的希望,因为它们具有自我校正的内在能力。 但是,它们通常表现优于自动递减模型和蒙面扩散模型。 在这项工作中,我们通过利用关键的洞察力缩小了这一性能差距:统一状态扩散过程自然地从根基高斯扩散中产生。我们的方法,Duo,从高山扩散中转让强大的技术,以改进培训和取样。首先,我们引入了高山进程指导下的课程学习战略,通过减少差异使培训速度翻倍。经过课程培训的模型在7个基准中的3个标准中超过了自动递增模型。第二,我们展示了混凝固性蒸馏法,将一致性从连续的蒸馏到离散的设置。这种算法通过加速两个级的采样来解开传播语言模型的几步生成。我们在项目网页上提供了代码和模式检查站: http://s-sahoo.github.io/duo。我们提供了项目页面上的代码和模式检查站: http://s-sahoo. github. duo/duo/duo。


Article 20

Title@2025-06-12 (4): PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play

Title: PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play PLAY2PROMPT: Zero-shot Tool Instruction Optimierung für LLM Agenten über Tool Play PLAY2PROMOPT: 通过工具游戏优化LLM代理器的零射工具指令 2503.14432v2

Authors (5): Wei Fang, Yang Zhang, Kaizhi Qian, James Glass, Yada Zhu

Large language models (LLMs) are increasingly integrated with specialized external tools, yet many tasks demand zero-shot tool usage with minimal or noisy documentation. Existing solutions rely on manual rewriting or labeled data for validation, making them inapplicable in true zero-shot settings. To address these challenges, we propose PLAY2PROMPT, an automated framework that systematically “plays” with each tool to explore its input-output behaviors. Through this iterative trial-and-error process, PLAY2PROMPT refines tool documentation and generates usage examples without any labeled data. These examples not only guide LLM inference but also serve as validation to further enhance tool utilization. Extensive experiments on real-world tasks demonstrate that PLAY2PROMPT significantly improves zero-shot tool performance across both open and closed models, offering a scalable and effective solution for domain-specific tool integration.

大型语言模型(LLMS)日益与专门的外部工具融合在一起,但许多任务要求使用零光工具,且文件很少或吵闹。现有解决方案依靠人工重写或标签数据进行验证,使其在真正的零光环境中不适用。为了应对这些挑战,我们提议PLAY2PROMPT,这是一个系统“玩耍”的自动框架,每个工具都可以探索其输入-输出行为。通过这个迭接的试验-操作程序,PLAY2PROMOT改进了工具文档,并生成了使用示例,而没有任何标签数据。这些示例不仅指导LLLM推断,而且还作为进一步增强工具利用的验证。关于现实世界任务的广泛实验表明,PLAY2PROMOT大大改善了开放和封闭模式的零光工具性,为特定域工具整合提供了可扩展的有效解决方案。


Article 21

Title@2025-06-12 (4): Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

Title: Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers Verallgemeinerung oder Halluzination? Verstehen von Out-of-Context-Reasoning in Transformers 通化还是幻觉? 理解变异器的逻辑外原因 2506.10887v1

Authors (8): Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei

Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.

大型语言模型(LLMS)可以通过微调获得新知识,但这一过程呈现出一种令人费解的双重性:模型可以从新的事实中加以概括,但又容易产生幻觉。然而,这一现象的原因仍然没有得到很好理解。在这项工作中,我们争辩说,这两种行为都来自一个单一的机制,称为超文本推理(OCR):通过结合概念,甚至没有因果联系的概念,来推断影响的能力。我们在五个突出的LMS的实验中证实,OCR确实既会推动普遍性,也会引发幻觉,这取决于相关概念是否因果相关。要建立严格的理论理解,我们然后将OCR正式确定为综合事实回顾任务。我们从经验上表明,一个单层单一的注意力变异器,带有因素性产出和价值矩阵,可以学会解决这项任务,而一个综合权重的模型则不能突出矩阵化的关键作用。我们的理论分析显示,OCR能力可以归因于渐渐下降的隐含偏见,这取决于相关概念是否具有因果关系。为了建立严格的核规范,我们然后将OCRR正式确定一个综合产出-价值模型,这是我们最终的理论分析基础,这解释了为什么是提供我们如何解释。


Article 22

Title@2025-06-12 (4): Slimming Down LLMs Without Losing Their Minds

Title: Slimming Down LLMs Without Losing Their Minds LLMs abschwächen, ohne ihre Gedanken zu verlieren 在不失去理智的情况下将LLMs 压倒在地 2506.10885v1

Authors (2): Qingda, Mai

This paper investigates and validates the impact of fine-tuning on large language model performance, focusing on parameter-efficient methods (LoRA and QLoRA). We evaluate model capabilities across three key domains: (1) commonsense reasoning (HellaSwag), (2) mathematical reasoning (GSM8K), and (3) multi-domain knowledge (MMLU-CS). Our findings demonstrate that: (1) LoRA-based methods effectively improve task-specific performance while maintaining computational efficiency, and (2) performance strongly depends on alignment between fine-tuning dataset and benchmark tasks. The study provides both theoretical insights into parameter-efficient mechanisms and practical guidance for developers implementing efficient LLM adaptation with limited resources.

本文件调查并验证了微调对大型语言模型绩效的影响,重点是参数效率方法(LORA和QLORA)。我们评估了三个关键领域的模型能力:(1) 常识推理(HellaSwag)、(2) 数学推理(GSM8K)和(3) 多域知识(MMLU-CS)。我们的研究结果表明:(1) 以LORA为基础的方法有效改进了具体任务绩效,同时保持了计算效率,(2) 业绩在很大程度上取决于微调数据集和基准任务之间的协调。研究为利用有限的资源实施高效的LLMM适应开发者提供了对参数效率机制的理论见解和实际指导。


Article 23

Title@2025-06-12 (4): Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment

Title: Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment Verbesserung des medizinischen Dialogs durch Wissensverfeinerung und dynamische Anpassung 通过知识完善和动态快速调整加强医疗对话 2506.10877v1

Authors (6): Hongda Sun, Jiaren Peng, Wenzhong Yang, Liang He, Bo Du, Rui Yan

Medical dialogue systems (MDS) have emerged as crucial online platforms for enabling multi-turn, context-aware conversations with patients. However, existing MDS often struggle to (1) identify relevant medical knowledge and (2) generate personalized, medically accurate responses. To address these challenges, we propose MedRef, a novel MDS that incorporates knowledge refining and dynamic prompt adjustment. First, we employ a knowledge refining mechanism to filter out irrelevant medical data, improving predictions of critical medical entities in responses. Additionally, we design a comprehensive prompt structure that incorporates historical details and evident details. To enable real-time adaptability to diverse patient conditions, we implement two key modules, Triplet Filter and Demo Selector, providing appropriate knowledge and demonstrations equipped in the system prompt. Extensive experiments on MedDG and KaMed benchmarks show that MedRef outperforms state-of-the-art baselines in both generation quality and medical entity accuracy, underscoring its effectiveness and reliability for real-world healthcare applications.

医疗对话系统(MDS)已成为与病人进行多转、有背景认识的对话的重要在线平台,然而,现有的MDS经常在努力:(1) 确定相关的医疗知识,(2) 产生个性化、医学准确的反应;为应对这些挑战,我们提议MedRef,这是一个新的MDS,它包含知识的改进和动态的迅速调整。首先,我们采用知识完善机制来过滤无关的医疗数据,改进对关键医疗实体的反应预测。此外,我们设计了一个全面迅速的结构,它包含历史细节和明显的细节。为了使实时适应不同的病人条件,我们实施了两个关键模块,即Triplet过滤器和Demo选择器,提供系统迅速配备的适当知识和演示。关于MDDG和KAMed基准的广泛实验表明,MedRef在新一代质量和医疗实体的准确性两方面都超越了最新基线,强调其对于现实世界保健应用的有效性和可靠性。


Article 24

Title@2025-06-12 (4): Large Language Models for Multilingual Previously Fact-Checked Claim Detection

Title: Large Language Models for Multilingual Previously Fact-Checked Claim Detection Große Sprachmodelle für die multilinguale bisher Fact-Checked Claim Detection 多语种以前实况调查索赔调查大语言模型 2503.02737v2

Authors (6): Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, Tatiana Anikina, Michal Gregor, Marián Šimko

In our era of widespread false information, human fact-checkers often face the challenge of duplicating efforts when verifying claims that may have already been addressed in other countries or languages. As false information transcends linguistic boundaries, the ability to automatically detect previously fact-checked claims across languages has become an increasingly important task. This paper presents the first comprehensive evaluation of large language models (LLMs) for multilingual previously fact-checked claim detection. We assess seven LLMs across 20 languages in both monolingual and cross-lingual settings. Our results show that while LLMs perform well for high-resource languages, they struggle with low-resource languages. Moreover, translating original texts into English proved to be beneficial for low-resource languages. These findings highlight the potential of LLMs for multilingual previously fact-checked claim detection and provide a foundation for further research on this promising application of LLMs.

在广泛虚假信息时代,在核实其他国家或语言可能已经处理过的主张时,人类事实检查员往往面临重复努力的挑战。由于虚假信息超越语言界限,自动发现以往经过事实检查的跨语言主张的能力已成为一项越来越重要的任务。本文件首次全面评估了大语言模型(LLMs),用于以前经过事实核实的多语种主张检测。我们评估了在单一语言和跨语言环境中20种语言的7个LMs。我们的结果显示,LLMs在高资源语言方面表现良好,但与低资源语言打交道。此外,将原文翻译成英语证明对低资源语言有益。这些结果突出表明LLMs在以往经过事实核实的多语种主张检测方面的潜力,并为进一步研究LMs这一有希望的应用奠定了基础。


Article 25

Title@2025-06-12 (4): Sailing by the Stars: A Survey on Reward Models and Learning Strategies for Learning from Rewards

Title: Sailing by the Stars: A Survey on Reward Models and Learning Strategies for Learning from Rewards Segeln mit den Sternen: Eine Umfrage über Prämienmodelle und Lernstrategien zum Lernen aus Belohnungen 星舰航行:奖励模型调查以及从奖励中学习的学习战略 2505.02686v2

Authors (1): Xiaobao Wu

Recent developments in Large Language Models (LLMs) have shifted from pre-training scaling to post-training and test-time scaling. Across these developments, a key unified paradigm has arisen: Learning from Rewards, where reward signals act as the guiding stars to steer LLM behavior. It has underpinned a wide range of prevalent techniques, such as reinforcement learning (RLHF, RLAIF, DPO, and GRPO), reward-guided decoding, and post-hoc correction. Crucially, this paradigm enables the transition from passive learning from static data to active learning from dynamic feedback. This endows LLMs with aligned preferences and deep reasoning capabilities for diverse tasks. In this survey, we present a comprehensive overview of learning from rewards, from the perspective of reward models and learning strategies across training, inference, and post-inference stages. We further discuss the benchmarks for reward models and the primary applications. Finally we highlight the challenges and future directions. We maintain a paper collection at https://github.com/bobxwu/learning-from-rewards-llm-papers.

大语言模型(LLMS)的近期发展从培训前的升级到培训后和测试时的升级。在这些发展过程中,出现了一个关键的统一模式:从奖励中学习,奖励信号是指导LLM行为的引导星;它支持了广泛的流行技术,例如强化学习(RLHF、RLAIF、DPO和GROP)、奖赏引导解码和热后校正。关键是,这一模式使被动学习从静态数据过渡到积极学习动态反馈。这一模式使LLMS具有各种任务的统一偏好和深刻推理能力。在这次调查中,我们从奖励模式和学习战略的角度,从培训、推论和推论后阶段,全面概述了从奖励模式和学习的学习。我们进一步讨论奖励模式的基准和主要应用。最后,我们强调挑战和未来方向。我们在 https://github.com/bobxwu/learking-wherd-llm-papers。我们保留一个文件的论文收藏。


Article 26

Title@2025-06-12 (4): Multi-group Uncertainty Quantification for Long-form Text Generation

Title: Multi-group Uncertainty Quantification for Long-form Text Generation Multi-Gruppen-Unsicherheits-Quantifizierung für langformige Textgenerierung 长式文本生成的不确定性量化 2407.21057v2

Authors (2): Terrance Liu, Zhiwei Steven Wu

While past works have shown how uncertainty quantification can be applied to large language model (LLM) outputs, the question of whether resulting uncertainty guarantees still hold within sub-groupings of data remains open. In our work, given some long-form text generated by an LLM, we study uncertainty at both the level of individual claims contained within the output (via calibration) and across the entire output itself (via conformal prediction). Using biography generation as a testbed for this study, we derive a set of (demographic) attributes (e.g., whether some text describes a man or woman) for each generation to form such “subgroups” of data. We find that although canonical methods for both types of uncertainty quantification perform well when measuring across the entire dataset, such guarantees break down when examining particular subgroups. Having established this issue, we invoke group-conditional methods for uncertainty quantification – multicalibration and multivalid conformal prediction – and find that across a variety of approaches, additional subgroup information consistently improves calibration and conformal prediction within subgroups (while crucially retaining guarantees across the entire dataset). As the problems of calibration, conformal prediction, and their multi-group counterparts have not been extensively explored in the context of long-form text generation, we consider these results to form a benchmark for this setting.

过去的工作表明,不确定性的量化如何适用于大型语言模型(LLM)产出,由此产生的不确定性的保证是否仍然保留在数据分组内的问题仍然未解决。在我们的工作中,鉴于LLM产生的一些长式文本,我们研究了产出(校准)和整个产出本身(校准)中包含的个别索赔的不确定性水平(通过一致的预测),利用传记生成作为本研究的测试台,我们为每代人得出一套(人口)属性(例如,一些文字是否描述一个男人或女人),以形成这种“子群”的数据。我们发现,尽管这两种不确定性量化的可归纳方法在测量整个数据集时效果良好,但在检查特定分组时,这种保证会破裂。我们确定了这一问题,我们援引了不确定性量化的群点 – – 多重校准和多valid一致的预测 – – 并发现,在各种方法中,额外的分组信息始终在改进校准和一致的预测(同时关键地在整个数据集中保留着保证)。由于校准、符合这些版本的预测和多式结果的组问题不是我们所探讨的长式的对应。


Article 27

Title@2025-06-12 (4): Debiasing Watermarks for Large Language Models via Maximal Coupling

Title: Debiasing Watermarks for Large Language Models via Maximal Coupling Debiasing Wasserzeichen für große Sprachmodelle über Maximal Coupling 通过Maximal Coupling为大语言模型减少对水标记的偏差 2411.11203v2

Authors (5): Yangxinyu Xie, Xiang Li, Tanwi Mallick, Weijie J. Su, Ruixun Zhang

Watermarking language models is essential for distinguishing between human and machine-generated text and thus maintaining the integrity and trustworthiness of digital communication. We present a novel green/red list watermarking approach that partitions the token set into green'' andred’’ lists, subtly increasing the generation probability for green tokens. To correct token distribution bias, our method employs maximal coupling, using a uniform coin flip to decide whether to apply bias correction, with the result embedded as a pseudorandom watermark signal. Theoretical analysis confirms this approach’s unbiased nature and robust detection capabilities. Experimental results show that it outperforms prior techniques by preserving text quality while maintaining high detectability, and it demonstrates resilience to targeted modifications aimed at improving text quality. This research provides a promising watermarking solution for language models, balancing effective detection with minimal impact on text quality.

水标识语言模型对于区分人类和机器生成的文本,从而保持数字通信的完整性和可信赖性至关重要。我们展示了一种新的绿色/红列表水标识方法,将标牌分割成“绿色”和“红”列表,小于增加绿色标牌的生成概率。为了纠正象征性分布偏差,我们的方法采用了最大程度的组合,使用统一硬币翻转来确定是否应用偏向纠正,结果嵌入伪随机水标识。理论分析证实了这一方法的公正性和稳健的检测能力。实验结果表明,它通过在保持高可探测性的同时维护文本质量,优于先前的技术,并展示了对旨在改善文本质量的有针对性的修改的复原力。这一研究为语言模型提供了一个有希望的标注水解决方案,平衡了有效检测,对文本质量的影响最小。


Article 28

Title@2025-06-12 (4): Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models

Title: Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models Analyse der Beziehungen zwischen vorschulischer Sprache, phonetischer, klanglicher und sprachlicher Information in selbstüberwachten Sprachmodellen 以自我监督的演讲模式分析培训前语言、音、音、音、音和演讲者信息之间的关系 2506.10855v1

Authors (5): Michele Gubian, Ioana Krehan, Oli Liu, James Kirby, Sharon Goldwater

Analyses of self-supervised speech models have begun to reveal where and how they represent different types of information. However, almost all analyses have focused on English. Here, we examine how wav2vec2 models trained on four different languages encode both language-matched and non-matched speech. We use probing classifiers and geometric analyses to examine how phones, lexical tones, and speaker information are represented. We show that for all pretraining and test languages, the subspaces encoding phones, tones, and speakers are largely orthogonal, and that layerwise patterns of probing accuracy are similar, with a relatively small advantage for matched-language phone and tone (but not speaker) probes in the later layers. Our findings suggest that the structure of representations learned by wav2vec2 is largely independent of the speech material used during pretraining.

对自我监督的演讲模式的分析已开始揭示它们代表不同类型信息的地点和方式,然而,几乎所有的分析都侧重于英语。在这里,我们研究在四种不同语言上培训的 wav2vec2 模式如何将语言匹配和非匹配的演讲编码为四种不同的语言。我们使用测试分类和几何分析来检查电话、词汇音调和发言者信息是如何代表的。我们显示,对于所有预培训和测试语言而言,亚空间编码电话、音调和发言者基本上都是正统的,而不同层次的精确度模式是相似的,对匹配语言的电话和语调(但非语调)在后几层的探测具有相对小的优势。我们的调查结果表明,Wav2vec2所学的表述结构基本上独立于预培训期间使用的语音材料。


Article 29

Title@2025-06-12 (4): Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles

Title: Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles Beschleunigen von Diffusion Großsprachige Modelle mit SlowFast: Die drei goldenen Prinzipien 以慢速加速传播大语言模式:三大金原则 2506.10848v1

Authors (5): Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Dongrui Liu, Linfeng Zhang

Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and up to 34.22$\times$ when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.

以扩散为基础的语言模型(dLLMS)通过平行象征性生成和大量减少推导延迟度,成为传统自动递增的LLMS(dLLMS)的一个有希望的替代方法,它使平行象征性生成和显著减少推导延迟度,然而,现有的dLLMS取样战略,例如信心基或半自动递减解码,往往受到静态行为的影响,导致效率低下和灵活性有限。在本文中,我们建议慢速取样,这是一种新的动态抽样战略,在探索阶段和加速解码阶段之间互换。我们的方法以三项黄金原则为指导:确定性原则、趋同原则和定位原则,这些原则在何时和何地规范标志可以自信和高效地解码。我们进一步将我们的战略与dLLLMM-Case整合,以减少多余的计算。跨基准和模型的广泛实验表明,慢速采样在LLADADA中达到15.631美元,在精确下降后达到34.22\时间,加固。值得注意的是,我们的方法超越了像LLMAMA3和快速制模8B的快速制成的完整自动采样的潜力。


Article 30

Title@2025-06-12 (4): CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training

Title: CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training CIIR@LiveRAG 2025: Optimierung der Multi-Agent Retrieval Augmented Generation durch Selbsttraining CIIR@LiveRAG 2025:通过自我培训优化多要求回生增生一代 2506.10844v1

Authors (3): Alireza Salemi, Mukta Maddipatla, Hamed Zamani

This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG) framework composed of specialized agents for subtasks such as planning, searching, reasoning, and coordination. Our system uses a self-training paradigm with reward-guided trajectory sampling to optimize inter-agent collaboration and enhance response generation. Evaluated on DataMorgana-derived datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms conventional RAG baselines. We further analyze competition outcomes and showcase the framework’s strengths with case studies, demonstrating its efficacy for complex, real-world RAG tasks.

本文介绍由规划、搜索、推理和协调等子任务专门代理组成的多试剂检索-增强型新一代框架(RAG),我们的系统采用自培训模式,采用奖赏引导轨迹抽样,优化机构间协作,加强反应生成。在SIGIR 2025 LiveRAG 竞争期间对数据-有机衍生数据集进行了评估,该模型优于常规RAG基线。我们进一步分析竞争结果,通过案例研究展示框架的优势,展示框架在复杂、现实世界的RAG任务方面的功效。


Article 31

Title@2025-06-12 (4): ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization

Title: ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization ReCUT: Ausbalancierende Grundlänge und Genauigkeit in LLMs über Schrittweise Trails und Preference Optimization RECUT:通过分步跟踪和优化优化平衡长长和准确性 2506.10822v1

Authors (10): Zhensheng Jin, Xinze Li, Yifan Ji, Chunyi Peng, Zhenghao Liu, Qi Shi, Yukun Yan, Shuo Wang, Furong Peng, Ge Yu

Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is often constrained by the quality of the generated data and prone to overfitting. To address the challenge, we propose Reasoning Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a stepwise exploration mechanism and a long-short switched sampling strategy, enabling LLMs to incrementally generate diverse reasoning paths. These paths are evaluated and used to construct preference pairs to train two specialized models (Gemini LLMs)-one optimized for reasoning accuracy, the other for shorter reasoning. A final integrated model is obtained by interpolating the parameters of these two models. Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%, while maintaining or improving reasoning accuracy compared to various baselines. All codes and data will be released via https://github.com/NEUIR/ReCUT.

近来在LLLM(LLM)系统(CoT)系统(LLOT)方面的进步大大提高了大语言模型(LLMS)的推理能力。然而,这些方法往往受到过度思考的影响,导致不必要地冗长或多余的推理痕迹。现有的方法试图通过为培训LLMS而调整多种推理链来缓解这一问题,但是其效力往往受到生成数据质量的限制,而且容易过度适应。为了应对这一挑战,我们提议了“合理压缩TroUgh Stepwith Test(ReCUT)”系统(ReCUT)系统(ReCUT)系统(ReCUT)系统是一个旨在平衡推理轨的准确性和长度的新颖方法。具体地说,RECUT使用一种渐进式探索机制和长期短调试样战略,使LLLMS能够逐渐产生不同的推理路径。这些路径被评估并用来构建优选配对两种专业模型(Gemini LLMS-one)系统(Gemini LLMs)系统(Geminialth LLMs)系统(GEUC)-LEUC)系统(LEUD/LES)系统(LOrentalimations)系统(Rislental)系统(R)系统(LUD)系统(Rislental)系统(R)系统(R)系统(Rislentald)系统(R)系统/R)系统(R)系统(Risl)系统(Risl)系统(Risldentalut)系统(Lisl)系统(R)系统(R)系统(R)系统(LUD)系统(R)系统)系统(R)系统(R)系统(R)系统(R)和/R)系统)系统(R)系统(R)系统(R)系统(R)系统(R)系统(LUD)系统(LUD)系统(LUDLUDR)系统(R)系统(R)系统(R)系统(R)系统(R)系统(LUDR)系统)系统(LUDR)系统(R)系统(R)系统(R)系统(R)系统(R)系统(R)


Article 32

Title@2025-06-12 (4): VideoDeepResearch: Long Video Understanding With Agentic Tool Using

Title: VideoDeepResearch: Long Video Understanding With Agentic Tool Using VideoDeepResearch: Langes Video-Verstehen mit Agentischem Werkzeug 视频深入研究:与使用代理工具的远程视频了解 2506.10821v1

Authors (5): Huaying Yuan, Zheng Liu, Junjie Zhou, Ji-Rong Wen, Zhicheng Dou

Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task’s inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.

长期视频理解(LVU)对当前多式大型语言模型(MLLM)提出了重大挑战,原因是任务本身的复杂性和上下文窗口的限制。人们普遍认为,解决LVU的任务需要基础MLLM, 其背景窗口宽广,视觉感知能力强,领域专长精良。在这项工作中,我们通过引入视频深入研究(一个用于长期视频理解的新颖的代理框架)来挑战这一共同信念。我们的方法完全依赖于一个只有文本的大型推理模型(LRM)和一个模块化的多式多式工具包,包括多式联运检索器和视觉观察器,所有这些工具都在实践中很容易获得。对于LVU的每一项任务,系统都通过推理、有选择地访问和利用基本视频内容来制定解决问题的战略。我们开展了广泛的LVU基准实验,包括MLVU、VM-MME和LVBench。我们的研究结果表明,VLLMLMLMLMLV系统所面临的挑战比LLLMV系统、LVVVLV和3.9%的挑战。


Article 33

Title@2025-06-12 (4): Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints

Title: Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints Negative Interferenzen in der Mehrsprachigkeit sequenzieller Wissensbearbeitung durch Null-Raum-Beschränkungen abmildern 减少多语种序列知识编辑的负面干扰 2506.10800v1

Authors (5): Wei Sun, Tingyu Qu, Mingxiao Li, Jesse Davis, Marie-Francine Moens

Efficiently updating multilingual knowledge in large language models (LLMs), while preserving consistent factual representations across languages, remains a long-standing and unresolved challenge. While deploying separate editing systems for each language might seem viable, this approach incurs substantial costs due to the need to manage multiple models. A more efficient solution involves integrating knowledge updates across all languages into a unified model. However, performing sequential edits across languages often leads to destructive parameter interference, significantly degrading multilingual generalization and the accuracy of injected knowledge. To address this challenge, we propose LangEdit, a novel null-space constrained framework designed to precisely isolate language-specific knowledge updates. The core innovation of LangEdit lies in its ability to project parameter updates for each language onto the orthogonal complement of previous updated subspaces. This approach mathematically guarantees update independence while preserving multilingual generalization capabilities. We conduct a comprehensive evaluation across three model architectures, six languages, and four downstream tasks, demonstrating that LangEdit effectively mitigates parameter interference and outperforms existing state-of-the-art editing methods. Our results highlight its potential for enabling efficient and accurate multilingual knowledge updates in LLMs. The code is available at https://github.com/VRCMF/LangEdit.git.

以大型语文模式(LLMS)有效更新多语文知识,同时保持各语文之间一致的事实陈述,这仍然是一个长期和未解决的挑战。在为每种语文分别部署编辑系统看来可能可行,但这一方法由于需要管理多种模式而需要花费大量费用。一个更有效的解决办法是将所有语文的知识更新纳入一个统一的模式。不过,在各语文之间进行顺序编辑往往导致破坏性参数干扰,显著降低多语种的概括性和注入知识的准确性。为了应对这一挑战,我们提议LangEdit,这是一个新的空域限制框架,旨在精确孤立特定语文的知识更新。LangEdit的核心创新在于它能够将每种语文的参数更新投射到以往更新的子空间的正方形补充中。这种方法在数学上保证更新独立性,同时维护多语种通用能力。我们在三个模型结构(六种语言)和四个下游任务中进行全面评价,表明LangEdit有效减轻了参数干扰,并超越了现有的最新编辑方法。我们的成果凸显了LLMMS./GIS/Longub.com。


Article 34

Title@2025-06-12 (4): The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages

Title: The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages Der Esethu-Rahmen: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages Esethu框架:重新想象可持续数据集治理和低碳语言的理论 2502.15916v2

Authors (15): Jenalea Rajab, Anuoluwapo Aremu, Everlyn Asiko Chimoto, Dale Dunbar, Graham Morrissey, Fadel Thior, Luandrie Potgieter, Jessico Ojo, Atnafu Lambebo Tonja, Maushami Chetty, Wilhelmina NdapewaOnyothi Nekoto, Pelonomi Moiloa, Jade Abbott, Vukosi Marivate, Benjamin Rosman

This paper presents the Esethu Framework, a sustainable data curation framework specifically designed to empower local communities and ensure equitable benefit-sharing from their linguistic resource. This framework is supported by the Esethu license, a novel community-centric data license. As a proof of concept, we introduce the Vuk’uzenzele isiXhosa Speech Dataset (ViXSD), an open-source corpus developed under the Esethu Framework and License. The dataset, containing read speech from native isiXhosa speakers enriched with demographic and linguistic metadata, demonstrates how community-driven licensing and curation principles can bridge resource gaps in automatic speech recognition (ASR) for African languages while safeguarding the interests of data creators. We describe the framework guiding dataset development, outline the Esethu license provisions, present the methodology for ViXSD, and present ASR experiments validating ViXSD’s usability in building and refining voice-driven applications for isiXhosa.

本文介绍了埃塞图框架,这是专门为赋予地方社区权力和确保公平分享其语言资源利益而设计的可持续的数据整理框架,该框架得到了埃塞图许可证的支持,这是一个以社区为中心的新型数据验证。作为概念的证明,我们引入了Vuk’uzenzele isiXhosa语音数据集(ViXSD),这是根据埃塞图框架和许可开发的开放源码。数据集包含当地使用人口和语言元数据的异己索语者的读话,显示社区驱动的许可和整理原则如何在保护数据创建者利益的同时,弥补非洲语言自动语音识别(ASR)方面的资源差距。我们描述了指导数据集开发的框架,概述了埃塞图许可条款,介绍了维塞德的方法学,并介绍了ASR实验,验证了维克斯德在建立和完善伊西索语语音驱动应用程序方面的可用性。


Article 35

Title@2025-06-12 (4): FASCIST-O-METER: Classifier for Neo-fascist Discourse Online

Title: FASCIST-O-METER: Classifier for Neo-fascist Discourse Online FASCIST-O-METER: Klassifikator für neofaschistischen Diskurs Online FASCIST-O-METER:新法西斯人士在线论文分类 2506.10789v1

Authors (4): Rudy Alexandro Garrido Veliz, Martin Semmann, Chris Biemann, Seid Muhie Yimam

Neo-fascism is a political and societal ideology that has been having remarkable growth in the last decade in the United States of America (USA), as well as in other Western societies. It poses a grave danger to democracy and the minorities it targets, and it requires active actions against it to avoid escalation. This work presents the first-of-its-kind neo-fascist coding scheme for digital discourse in the USA societal context, overseen by political science researchers. Our work bridges the gap between Natural Language Processing (NLP) and political science against this phenomena. Furthermore, to test the coding scheme, we collect a tremendous amount of activity on the internet from notable neo-fascist groups (the forums of Iron March and Stormfront.org), and the guidelines are applied to a subset of the collected posts. Through crowdsourcing, we annotate a total of a thousand posts that are labeled as neo-fascist or non-neo-fascist. With this labeled data set, we fine-tune and test both Small Language Models (SLMs) and Large Language Models (LLMs), obtaining the very first classification models for neo-fascist discourse. We find that the prevalence of neo-fascist rhetoric in this kind of forum is ever-present, making them a good target for future research. The societal context is a key consideration for neo-fascist speech when conducting NLP research. Finally, the work against this kind of political movement must be pressed upon and continued for the well-being of a democratic society. Disclaimer: This study focuses on detecting neo-fascist content in text, similar to other hate speech analyses, without labeling individuals or organizations.

新法西斯主义是一种政治和社会意识形态,在过去十年中,美利坚合众国(美国)和其他西方社会都取得了显著的增长。它给民主和它所针对的少数群体带来了严重威胁,需要对此采取积极的行动以避免升级。这项工作展示了美国社会背景下首创的新法西斯主义数字对话编码计划,由政治科学研究者监督。我们的工作弥合了美国自然语言处理和政治科学与这一现象之间的差距。此外,为了测试编码计划,我们从著名的新法西斯主义团体(铁马和斯托布罗德论坛.org)那里收集了大量的互联网活动,并且指导方针被应用于收集的一组文章。通过众包,我们共发了一千个被贴上新法主义者或非新法主义者标签的网页。我们的工作缩小了语言处理(NLP)和政治科学与这一现象之间的差距。我们精细测试了小语言模型(MLMs)和大语言模型(LLIMS)的编码机制,我们收集了大量的互联网上的活动量,使得一个收集到的系列文章。通过众集,我们注意到了一千个被称为新法主义者或非新法主义者的新法分子的新政理论理论的标签。我们不得不在这样的新理论研究中找到一个最新的研究模式。


Article 36

Title@2025-06-12 (4): Improving Named Entity Transcription with Contextual LLM-based Revision

Title: Improving Named Entity Transcription with Contextual LLM-based Revision Verbesserung der Transkription der benannten Entität mit kontextueller LLM-basierter Revision 改进以背景LLM为基础订正的命名实体跟踪 2506.10779v1

Authors (3): Viet Anh Trinh, Xinlu He, Jacob Whitehill

With recent advances in modeling and the increasing amount of supervised training data, automatic speech recognition (ASR) systems have achieved remarkable performance on general speech. However, the word error rate (WER) of state-of-the-art ASR remains high for named entities. Since named entities are often the most critical keywords, misrecognizing them can affect all downstream applications, especially when the ASR system functions as the front end of a complex system. In this paper, we introduce a large language model (LLM) revision mechanism to revise incorrect named entities in ASR predictions by leveraging the LLM’s reasoning ability as well as local context (e.g., lecture notes) containing a set of correct named entities. Finally, we introduce the NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses for development and testing. On this dataset, our proposed technique achieves up to 30\% relative WER reduction for named entities.

由于在建模方面的最近进展以及受监督的培训数据数量不断增加,自动语音识别系统在一般演讲中取得了显著的成绩,但是,对被指名的实体来说,最先进的ASR系统的单词错误率仍然很高,由于被指名的实体往往是最关键的关键词,因此它们被误认可能影响所有下游应用,特别是当ASR系统发挥复杂系统前端功能时。在本文件中,我们采用了一个大型语言模型(LLM)修订机制,利用LLM的推理能力以及包含一套正确命名实体的当地背景(例如讲演说明)来修订ASR预测中不正确的名称实体。最后,我们采用了NER-MIT-OpenCournorner数据集,其中包含MIT-OpenCoursWare课程45小时的开发和测试数据。在这个数据集中,我们提议的技术为被指名实体实现了30个相对的WER削减。


Article 37

Title@2025-06-12 (4): Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs

Title: Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs Unterschiedliche Fragen, unterschiedliche Modelle: Feinkörnige Bewertung von Unsicherheit und Kalibrierung in klinischen QA mit LLMs 不同问题、不同模式:对临床质量评估中不确定性和校准的精细评估 2506.10769v1

Authors (2): Alberto Testoni, Iacer Calixto

Accurate and well-calibrated uncertainty estimates are essential for deploying large language models (LLMs) in high-stakes domains such as clinical decision support. We present a fine-grained evaluation of uncertainty estimation methods for clinical multiple-choice question answering, covering ten open-source LLMs (general-purpose, biomedical, and reasoning models) across two datasets, eleven medical specialties, and six question types. We compare standard single-generation and sampling-based methods, and present a case study exploring simple, single-pass estimators based on behavioral signals in reasoning traces. These lightweight methods approach the performance of Semantic Entropy while requiring only one generation. Our results reveal substantial variation across specialties and question types, underscoring the importance of selecting models based on both the nature of the question and model-specific strengths.

对于在临床决策支持等高风险领域部署大型语言模型(LLMs)来说,准确和经过良好校准的不确定性估计是关键所在。我们对临床多选择问题回答的不确定性估计方法进行了精细评估,涉及10个开放源LMs(普通用途、生物医学和推理模型),涉及两个数据集、11个医学专业和6个问题类型。我们比较了标准的单代方法和抽样方法,并提出了一个案例研究,探讨基于理性跟踪中的行为信号的简单、单行测算器。这些轻量方法在只要求一代人的情况下,对待Smantic Entropy的性能。我们的结果揭示了不同专业和问题类型之间的巨大差异,强调了基于问题性质和具体模型优势选择模型的重要性。


Article 38

Title@2025-06-12 (4): Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation

Title: Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation Chain-of-Code Collapse: Gründe für Fehler in LLMs über Adversarial Prompting in der Code-Generierung 崩溃链:通过代码生成中的反向提示造成LLMs中失败的原因 2506.06971v2

Authors (4): Jaechul Roh, Varun Gandhi, Shivani Anilkumar, Arin Garg

Large Language Models (LLMs) have achieved remarkable success in tasks requiring complex reasoning, such as code generation, mathematical problem solving, and algorithmic synthesis – especially when aided by reasoning tokens and Chain-of-Thought prompting. Yet, a core question remains: do these models truly reason, or do they merely exploit shallow statistical patterns? In this paper, we introduce Chain-of-Code Collapse, where we systematically investigate the robustness of reasoning LLMs by introducing a suite of semantically faithful yet adversarially structured prompt perturbations. Our evaluation – spanning 700 perturbed code generations derived from LeetCode-style problems – applies transformations such as storytelling reframing, irrelevant constraint injection, example reordering, and numeric perturbation. We observe that while certain modifications severely degrade performance (with accuracy drops up to -42.1%), others surprisingly improve model accuracy by up to 35.3%, suggesting sensitivity not only to semantics but also to surface-level prompt dynamics. These findings expose the fragility and unpredictability of current reasoning systems, underscoring the need for more principles approaches to reasoning alignments and prompting robustness. We release our perturbation datasets and evaluation framework to promote further research in trustworthy and resilient LLM reasoning.

大型语言模型(LLMS)在需要复杂推理的任务中取得了显著的成功,例如代码生成、数学问题解决和算法合成 – – 特别是在推理符号和推理链的推动下。然而,一个核心问题仍然是:这些模型是否真正合理,或者它们只是利用浅浅的统计模式?在本文中,我们引入了Code Crounds 链,我们系统地调查推理的稳健性,我们引入了一套精密忠实但有对抗性结构的快速扰动。我们的评估 – – 覆盖700个来自LeetCode风格问题的跨周期代码代 – – 应用了诸如讲故事重编、无关的制约注入、示例重新排序和数字扰动等转换。我们注意到,虽然某些修改严重地降低了性(精确率下降至-42.1 % ) ,但其他一些令人惊讶的是,通过高达35.3%,表明不仅对语义学敏感,而且对地表级快速动态。这些结论暴露了当前推理系统的脆弱性和不可预测性,强调需要更精确的推理学方法来推理和推理。


Article 39

Title@2025-06-12 (4): One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers

Title: One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers Ein Tokenizer, um sie alle zu beherrschen: Emergente Sprachplastizität über Mehrsprachige Tokenizer 万能统治者:通过多语种教育者实现新兴语言的可塑性 2506.10766v1

Authors (9): Diana Abagyan, Alejandro R. Salamanca, Andres Felipe Cruz-Salinas, Kris Cao, Hangyu Lin, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, Sara Hooker

Pretraining massively multilingual Large Language Models (LLMs) for many languages at once is challenging due to limited model capacity, scarce high-quality data, and compute constraints. Moreover, the lack of language coverage of the tokenizer makes it harder to address the gap for new languages purely at the post-training stage. In this work, we study what relatively cheap interventions early on in training improve “language plasticity”, or adaptation capabilities of the model post-training to new languages. We focus on tokenizer design and propose using a universal tokenizer that is trained for more languages than the primary pretraining languages to enable efficient adaptation in expanding language coverage after pretraining. Our systematic experiments across diverse groups of languages and different training strategies show that a universal tokenizer enables significantly higher language adaptation, with up to 20.2% increase in win rates compared to tokenizers specific to pretraining languages. Furthermore, a universal tokenizer also leads to better plasticity towards languages that are completely unseen in the tokenizer and pretraining, by up to 5% win rate gain. We achieve this adaptation to an expanded set of languages with minimal compromise in performance on the majority of languages included in pretraining.

由于模型容量有限、数据质量差、以及计算限制因素,同时对许多语言进行大规模多语言大语言模型(LLMS)的预培训具有挑战性。此外,由于缺少标识器的语言覆盖面,因此很难完全在培训后阶段解决新语言的差距。在这项工作中,我们研究了培训初期哪些相对廉价的干预措施改善了“语言可塑性”或示范后培训适应新语言的能力。我们侧重于标识器设计,并提议使用通用的标识器,该标识器比初级培训前语言多,经过培训后语言培训后能够有效地适应扩大语言覆盖面。我们在不同语言组别和不同培训战略的系统实验表明,通用标识器能够大大提高语言适应性,与培训前语言中特有的标识器相比,双赢率高达20.2%。此外,通用标识器还能通过高达5%的赢率,使在标识器和预培训前语言中完全看不见的语文更加具有塑料性。我们实现了对扩大的一套语言的适应,在培训前多数语言的绩效上实现最低限度的妥协。


Article 40

Title@2025-06-12 (4): Aspect-Based Opinion Summarization with Argumentation Schemes

Title: Aspect-Based Opinion Summarization with Argumentation Schemes Aspektbasierte Zusammenfassung der Meinungen mit Argumentierungsschemata 与参数说明方案对照审计意见的概述 2506.09917v2

Authors (3): Wendi Zhou, Ameer Saadat-Yazdi, Nadin Kokciyan

Reviews are valuable resources for customers making purchase decisions in online shopping. However, it is impractical for customers to go over the vast number of reviews and manually conclude the prominent opinions, which prompts the need for automated opinion summarization systems. Previous approaches, either extractive or abstractive, face challenges in automatically producing grounded aspect-centric summaries. In this paper, we propose a novel summarization system that not only captures predominant opinions from an aspect perspective with supporting evidence, but also adapts to varying domains without relying on a pre-defined set of aspects. Our proposed framework, ASESUM, summarizes viewpoints relevant to the critical aspects of a product by extracting aspect-centric arguments and measuring their salience and validity. We conduct experiments on a real-world dataset to demonstrate the superiority of our approach in capturing diverse perspectives of the original reviews compared to new and existing methods.

在网上购物中,审查是客户做出采购决定的宝贵资源,然而,客户不切实际地翻过大量审查,手工完成突出意见,这促使需要自动化的意见总结系统。以前的做法,无论是采掘还是抽象的做法,在自动产生有根有据的、以方为中心的摘要方面都面临挑战。在本文件中,我们提议一个新的总结系统,不仅从支持性证据的角度收集主要意见,而且适应不同的领域,而不必依赖一套预先界定的方面。我们提议的框架,即ASESUM, 总结了与产品关键方面有关的观点,提取了以方为中心的论点,并衡量其显著性和有效性。我们在现实世界数据集上进行了实验,以表明我们的方法在捕捉与新的和现有方法相比的最初审查的不同观点方面具有优势。


Article 41

Title@2025-06-12 (4): Great Models Think Alike and this Undermines AI Oversight

Title: Great Models Think Alike and this Undermines AI Oversight Große Modelle denken ähnlich und dies unterminiert AI Oversight 伟大的模特儿们想着类似的想法 和这枚地下地雷 AI监督 2502.04313v2

Authors (9): Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping

As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as ‘‘AI Oversight’’. We study how model similarity affects both aspects of AI oversight by proposing Chance Adjusted Probabilistic Agreement (CAPA): a metric for LM similarity based on overlap in model mistakes. Using CAPA, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from ‘‘weak-to-strong generalization’’. As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend – model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.

随着语言模型(LM)能力的进步、评估和在规模上监督这些能力对人类来说越来越困难。希望其他语言模型能够使这两项任务自动化,我们称之为“AI 监督 ” 。我们研究模型相似性如何影响AI监督的两方面,提出“CAPA 机会调整概率协议(CAPA) : 一种基于模式错误重叠的LM相似性衡量标准。我们使用CAPA,我们首先显示LLM-as-a-judge 评分类似法官的偏爱模式,概括最近的自我参照结果。然后,我们研究LM 说明,并在弱的主管和强的学生模型之间找到互补知识,在“Weak-en-strong-processional化”的收益中发挥着关键作用。随着模型能力的增长,发现错误变得更加困难,我们可能更偏重于AI监督。然而,我们观察到一种趋势 — 模型错误正越来越接近于日益增强的能力,指出了相关失败的风险。我们的工作强调了报告和纠正模式相似性的重要性,特别是在新兴的AI 监督模式中。


Article 42

Title@2025-06-12 (4): Persistent Topological Features in Large Language Models

Title: Persistent Topological Features in Large Language Models Persistente Topologische Features in großen Sprachmodellen 大语言模式中的持久性有机污染物特征 2410.11042v2

Authors (6): Yuri Gardinazzi, Karthik Viswanathan, Giada Panerai, Alessio Ansuini, Alberto Cazzaniga, Matteo Biagetti

Understanding the decision-making processes of large language models is critical given their widespread applications. To achieve this, we aim to connect a formal mathematical framework – zigzag persistence from topological data analysis – with practical and easily applicable algorithms. Zigzag persistence is particularly effective for characterizing data as it dynamically transforms across model layers. Within this framework, we introduce topological descriptors that measure how topological features, $p$-dimensional holes, persist and evolve throughout the layers. Unlike methods that assess each layer individually and then aggregate the results, our approach directly tracks the full evolutionary path of these features. This offers a statistical perspective on how prompts are rearranged and their relative positions changed in the representation space, providing insights into the system’s operation as an integrated whole. To demonstrate the expressivity and applicability of our framework, we highlight how sensitive these descriptors are to different models and a variety of datasets. As a showcase application to a downstream task, we use zigzag persistence to establish a criterion for layer pruning, achieving results comparable to state-of-the-art methods while preserving the system-level perspective.

理解大语言模型的决策过程至关重要,因为大语言模型的应用十分广泛。为了实现这一点,我们力求将正式数学框架 – – 从地形数据分析得出的zigzag持久性 – – 与实用和易于应用的算法联系起来。 Zigzag持久性对于数据特征的定性特别有效,因为它能动态地跨模式层转变。在此框架内,我们引入了表层描述符,用来测量地形特征、美元-维洞、如何持续并在整个层次上演。不同于对每个层次进行单独评估和随后汇总结果的方法,我们的方法直接跟踪这些特征的全面演进路径。这提供了一种统计视角,说明在代表空间中如何重新排列亮度及其相对位置的变化,为整个系统的运作提供洞察力。为了展示我们框架的清晰度和适用性,我们强调这些描述符对不同模型和各种数据集的敏感度。作为下游任务的示范应用,我们用 zigzag持久性来建立层钻探标准,在保持系统层面视角的同时,取得与最新方法相近的结果。


Article 43

Title@2025-06-12 (4): Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering

Title: Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering Neural bei ArchEHR-QA 2025: Agentische Prompt-Optimierung für evidenzgerundete klinische Fragen ArchEHR-QA 2025:证据四舍五入临床问题解答的代理快速优化 2506.10751v1

Authors (6): Sai Prasanna Teja Reddy Bogireddy, Abrar Majeedi, Viswanatha Reddy Gajjala, Zhuoyan Xu, Siddhant Rai, Vaishnav Potlapalli

Automated question answering (QA) over electronic health records (EHRs) can bridge critical information gaps for clinicians and patients, yet it demands both precise evidence retrieval and faithful answer generation under limited supervision. In this work, we present Neural, the runner-up in the BioNLP 2025 ArchEHR-QA shared task on evidence-grounded clinical QA. Our proposed method decouples the task into (1) sentence-level evidence identification and (2) answer synthesis with explicit citations. For each stage, we automatically explore the prompt space with DSPy’s MIPROv2 optimizer, jointly tuning instructions and few-shot demonstrations on the development set. A self-consistency voting scheme further improves evidence recall without sacrificing precision. On the hidden test set, our method attains an overall score of 51.5, placing second stage while outperforming standard zero-shot and few-shot prompting by over 20 and 10 points, respectively. These results indicate that data-driven prompt optimization is a cost-effective alternative to model fine-tuning for high-stakes clinical QA, advancing the reliability of AI assistants in healthcare.

对电子健康记录自动回答问题(QA)可以弥补临床医生和病人的关键信息差距,但要求准确的证据检索和在有限监督下忠实地生成答案。在这项工作中,我们介绍BioNLP 2025年ArchEHR-QA的后排神经系统,共同执行有证据根据的临床质量分析任务。我们建议的方法将这项任务分为(1) 判决一级证据识别和(2) 回答合成,并明确引用。在每一个阶段,我们自动探索与DSPy’s MIPROv2优化器、联合调整指示和对成套开发的微小的演示的迅速空间。自我一致投票计划可以进一步改进证据召回,而不牺牲精确性。在隐藏的测试中,我们的方法达到总分51.5分,处于第二个阶段,同时超过标准零发数和几发点,分别超过20分和10分。这些结果表明,数据驱动的快速优化是高取临床质量分析模型的成本效益替代方法,提高AI助理在医疗保健方面的可靠性。


Article 44

Title@2025-06-12 (4): TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora

Title: TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora TaxoAdapt: LLM-basierte multidimensionale Taxonomie-Konstruktion an die sich entwickelnde Forschungskorporation ausrichten 将基于LLM的多层面分类学建设与不断发展的研究公司相协调 2506.10737v1

Authors (6): Priyanka Kargupta, Nan Zhang, Yunyi Zhang, Rui Zhang, Prasenjit Mitra, Jiawei Han

The rapid evolution of scientific fields introduces challenges in organizing and retrieving scientific literature. While expert-curated taxonomies have traditionally addressed this need, the process is time-consuming and expensive. Furthermore, recent automatic taxonomy construction methods either (1) over-rely on a specific corpus, sacrificing generalizability, or (2) depend heavily on the general knowledge of large language models (LLMs) contained within their pre-training datasets, often overlooking the dynamic nature of evolving scientific domains. Additionally, these approaches fail to account for the multi-faceted nature of scientific literature, where a single research paper may contribute to multiple dimensions (e.g., methodology, new tasks, evaluation metrics, benchmarks). To address these gaps, we propose TaxoAdapt, a framework that dynamically adapts an LLM-generated taxonomy to a given corpus across multiple dimensions. TaxoAdapt performs iterative hierarchical classification, expanding both the taxonomy width and depth based on corpus’ topical distribution. We demonstrate its state-of-the-art performance across a diverse set of computer science conferences over the years to showcase its ability to structure and capture the evolution of scientific fields. As a multidimensional method, TaxoAdapt generates taxonomies that are 26.51% more granularity-preserving and 50.41% more coherent than the most competitive baselines judged by LLMs.

科学领域的迅速演变带来了组织和检索科学文献的挑战。尽管专家精准的分类学传统上解决了这一需要,但这一过程耗时费钱。此外,最近的自动分类构造方法有:(1) 过分重复某一具体内容,牺牲一般性,或(2) 严重依赖其培训前数据集中所含大语言模型的一般知识,往往忽视不断变化的科学领域的动态性质。此外,这些方法没有考虑到科学文献的多面性质,其中单份研究论文可能有助于多方面(例如方法、新任务、评价指标、基准)。为了弥补这些差距,我们提议采用CatioAdapt,这是一个框架,能动态地将LLM生成的分类方法调整到多个方面的特定内容。CatioAdapt进行迭迭代等级分类,根据物理专题分布扩大分类的广度和深度。我们展示了多年来在一系列不同的计算机科学会议中,单份研究论文可能促进多种层面(例如方法、方法、方法、新任务、评价指标、基准)。为了解决这些差距,我们提议采用Texoapatapitatatatatatat-pality amin astiumations


Article 45

Title@2025-06-12 (4): Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims

Title: Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims 超越真或假:收回增加的无损失索赔的等级结构分析 2506.10728v1

Authors (3): Priyanka Kargupta, Runchu Tian, Jiawei Han

Claims made by individuals or entities are oftentimes nuanced and cannot be clearly labeled as entirely “true” or “false” – as is frequently the case with scientific and political claims. However, a claim (e.g., “vaccine A is better than vaccine B”) can be dissected into its integral aspects and sub-aspects (e.g., efficacy, safety, distribution), which are individually easier to validate. This enables a more comprehensive, structured response that provides a well-rounded perspective on a given problem while also allowing the reader to prioritize specific angles of interest within the claim (e.g., safety towards children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based framework for automatically constructing a hierarchy of aspects typically considered when addressing a claim and enriching them with corpus-specific perspectives. This structure hierarchically partitions an input corpus to retrieve relevant segments, which assist in discovering new sub-aspects. Moreover, these segments enable the discovery of varying perspectives towards an aspect of the claim (e.g., support, neutral, or oppose) and their respective prevalence (e.g., “how many biomedical papers believe vaccine A is more transportable than B?”). We apply ClaimSpect to a wide variety of real-world scientific and political claims featured in our constructed dataset, showcasing its robustness and accuracy in deconstructing a nuanced claim and representing perspectives within a corpus. Through real-world case studies and human evaluation, we validate its effectiveness over multiple baselines.

个人或实体提出的索赔往往有细微的细微差别,不能像科学和政治索赔经常出现的情况那样,被明确称为完全“真实”或“虚假” – – 科学和政治索赔经常出现的情况。然而,一项索赔(例如,“真空A比疫苗B更好”)可以分解成其整体和子部分(例如,功效、安全、分发),这些都比较容易确认。这样,就能够作出更加全面、结构化的反应,就某一问题提供全面的观点,同时使读者能够确定索赔(例如,对儿童的安全)中具体利益角度的轻重缓急。因此,我们建议采用“索赔特征”、基于回收的基于一代的框架,以便在处理索赔时自动考虑一个层次的层次和次部分(例如,功效、安全、分发),从而可以将这种结构分层分割出一个输入库,以检索相关的部分,从而有助于发现新的分层。此外,这些部分还使得对索赔的一个方面(例如,支持、中立或反对对儿童的安全)进行优先排序。因此,我们建议采用一个基于回收的分类的分类,一个基于不同程度的基线的研究,而不是一个比生物医学上更精确性的数据。


Article 46

Title@2025-06-12 (4): PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models

Title: PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models PREMISE: Skalierbare und strategische Prompt-Optimierung für effiziente mathematische Reasoning in großen Modellen PREMISE:大规模模型中高效数学理由的可扩展和战略快速优化 2506.10716v1

Authors (3): Ye Yu, Yaoning Yu, Haohan Wang

Large reasoning models (LRMs) such as Claude 3.7 Sonnet and OpenAI o1 achieve strong performance on mathematical benchmarks using lengthy chain-of-thought (CoT) reasoning, but the resulting traces are often unnecessarily verbose. This inflates token usage and cost, limiting deployment in latency-sensitive or API-constrained settings. We introduce PREMISE (PRompt-based Efficient Mathematical Inference with Strategic Evaluation), a prompt-only framework that reduces reasoning overhead without modifying model weights. PREMISE combines trace-level diagnostics with gradient-inspired prompt optimization to minimize redundant computation while preserving answer accuracy. The approach jointly optimizes brevity and correctness through a multi-objective textual search that balances token length and answer validity. Unlike prior work, PREMISE runs in a single-pass black-box interface, so it can be applied directly to commercial LLMs. On GSM8K, SVAMP, and Math500 we match or exceed baseline accuracy ($96\%\rightarrow96\%$ with Claude, $91\%\rightarrow92\%$ with Gemini) while reducing reasoning tokens by up to $87.5\%$ and cutting dollar cost by $69$–$82\%$. These results show that prompt-level optimization is a practical and scalable path to efficient LRM inference without compromising reasoning quality.

大型推理模型(LRMs),如Claude 3.7 Sonnet 和 OpenAI o1 等大型推理模型(LRM),利用冗长的思维链推理,在数学基准上取得很强的成绩,但由此产生的痕迹往往不必要。这种增缩象征使用和成本,限制在对延时敏感或受API限制的环境下的部署。我们引入了PREMISE(基于Prompt的高效数学参数与战略评估的PROMISE),一个在不改变模型重量的情况下减少推理成本的简单框架。PREMISE将追踪级诊断与受梯度激励的快速优化结合起来,以尽量减少多余的计算,同时保持回答准确性。这种方法通过多目标文本搜索,使符号长度和回答的有效性达到最佳和准确性。 与以前的工作不同,PREMSE在单一的黑箱界面中运行,因此可以直接应用到商业LMSMS。 GSM8K, SVAMP, 和 Math500我们匹配或超过基准精确性精确性(96-rdefile ladeal $ $99_cal_cal$)和Gemini-rillational-rillational leximlexinal lexal lex lex lex lex lex lex lex lex lex lex lemental lex lex lex lex lex lex lex。


Article 47

Title@2025-06-12 (4): Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet

Title: Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet Adjektive Hypernyms mit Sprachmodellen ableiten, um die Konnektivität von Open English Wordnet zu erhöhen 推导语言模型的形容词超音音音,以提高开放英文Wordnet的连通性 2506.10715v1

Authors (2): Lorenzo Augello, John P. McCrae

Open English Wordnet is a key resource published in OntoLex-lemon as part of the linguistic linked open data cloud. There are, however, many links missing in the resource, and in this paper, we look at how we can establish hypernymy between adjectives. We present a theoretical discussion of the hypernymy relation and how it differs for adjectives in contrast to nouns and verbs. We develop a new resource for adjective hypernymy and fine-tune large language models to predict adjective hypernymy, showing that the methodology of TaxoLLaMa can be adapted to this task.

Open English Wordnet 是OntoLex-lemon 出版的关键资源, 作为语言链接的开放数据云的一部分。 但是, 资源中缺少许多链接, 在本文中, 我们审视如何在形容词之间建立超性关系。 我们展示了超性关系的理论讨论, 以及它相对于名词和动词的形容词的区别。 我们开发了一个新的形容词超尼和微调大语言模型的资源, 以预测形容词超尼, 显示TexollaMa 的方法可以适应这项任务 。


Article 48

Title@2025-06-12 (4): Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

Title: Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models Anreize für eine fortgeschrittene Instruktions-Folge von großen Sprachmodellen 为采用大语言模式的高级指示提供激励理由 2506.01413v2

Authors (9): Yulei Qin, Gang Li, Zongyi Li, Zihan Xu, Yuchen Shi, Zhekai Lin, Xiao Cui, Ke Li, Xing Sun

Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Codes and data are available at https://github.com/yuleiqin/RAIF.

现有大型语言模型(LLMS)面临遵守复杂指示的挑战,特别是当存在多种限制因素,并在平行、链条和分支结构中组织多种制约时。一个直观的解决方案,即思维链(CoT),预计将普遍提高LLMs的能力。然而,我们发现香草COT由于其肤浅的推理模式而对其业绩产生消极影响,因为它只是将指示抛光光,它没有剥去在确定不同类型和层面的等级关系方面的制约的构成。为此,我们提出一个系统的方法,通过激励测试-时间计算比例的推理,促进LMS处理复杂的指示。首先,我们源于现有分类法下复杂指示的分解,并提出可再生的数据采集方法。第二,我们利用强化学习(RLL)和可核实的以规则为中心的奖赏信号,专门为随后的教学提供推理。我们通过样本对比,处理在高级COT执法的复杂指示下的浅浅、非本质推理学性质。我们还利用专家行为克隆,促进从现有测试-LMLMS-LMS-S-S-S-S-S-S-S-S-Syal-S-S-S-S-S-S-S-S-Sy-Sy-Sy-S-Syal-Syal-S-S-Syal-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-Sy-Sy-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-


Article 49

Title@2025-06-12 (4): PRSA: Prompt Stealing Attacks against Real-World Prompt Services

Title: PRSA: Prompt Stealing Attacks against Real-World Prompt Services PRSA: Sofortige Diebstahlangriffe gegen Real-World Prompt Services PRSA: 迅速窃盗对现实世界迅速服务公司的袭击 2402.19200v3

Authors (9): Yong Yang, Changjiang Li, Qingming Li, Oubo Ma, Haoyu Wang, Zonghui Wang, Yandong Gao, Wenzhi Chen, Shouling Ji

Recently, large language models (LLMs) have garnered widespread attention for their exceptional capabilities. Prompts are central to the functionality and performance of LLMs, making them highly valuable assets. The increasing reliance on high-quality prompts has driven significant growth in prompt services. However, this growth also expands the potential for prompt leakage, increasing the risk that attackers could replicate original functionalities, create competing products, and severely infringe on developers’ intellectual property. Despite these risks, prompt leakage in real-world prompt services remains underexplored. In this paper, we present PRSA, a practical attack framework designed for prompt stealing. PRSA infers the detailed intent of prompts through very limited input-output analysis and can successfully generate stolen prompts that replicate the original functionality. Extensive evaluations demonstrate PRSA’s effectiveness across two main types of real-world prompt services. Specifically, compared to previous works, it improves the attack success rate from 17.8% to 46.1% in prompt marketplaces and from 39% to 52% in LLM application stores, respectively. Notably, in the attack on “Math”, one of the most popular educational applications in OpenAI’s GPT Store with over 1 million conversations, PRSA uncovered a hidden Easter egg that had not been revealed previously. Besides, our analysis reveals that higher mutual information between a prompt and its output correlates with an increased risk of leakage. This insight guides the design and evaluation of two potential defenses against the security threats posed by PRSA. We have reported these findings to the prompt service vendors, including PromptBase and OpenAI, and actively collaborate with them to implement defensive measures.

最近,大型语言模型(LLMS)因其特殊能力而得到了广泛的关注。 提示对于LLMS的功能和业绩至关重要, 使得它们拥有非常宝贵的资产。 日益依赖高质量的提示促使迅速服务的迅速增长。 但是,这种增长也扩大了迅速泄漏的可能性,增加了攻击者复制原始功能、创造竞争产品和严重侵害开发商知识产权的风险。 尽管存在这些风险,但现实世界快速服务迅速渗漏的情况仍未得到充分探讨。 在本文中,我们介绍了为迅速偷窃而设计的实用攻击框架PRSA。PRSA通过非常有限的输入-产出分析推断出及时行动的详细意图,并能够成功生成复制原始功能的被盗提示。 但是,这种增长也扩大了快速渗漏的可能性,使袭击者有可能复制原始功能,从而增加攻击者在现实世界的两种主要类型的迅速服务中的有效性。 具体地说,与以往的工程相比,袭击成功率从市场中的17.8%提高到46.1%,LMM应用程序的迅速渗漏率从39%提高到52%。 值得注意的是,在“Math”攻击中, 最受欢迎的教育应用了快速的提示意图, 复制了PLPSASASASA的升级, 和SA 增加了我们之前的机密数据, 和最新版本。


Article 50

Title@2025-06-12 (4): FedRAG: A Framework for Fine-Tuning Retrieval-Augmented Generation Systems

Title: FedRAG: A Framework for Fine-Tuning Retrieval-Augmented Generation Systems FedRAG: Ein Rahmen für Systeme der Feinsteuerung von Retrieval-Augmented Generation FFRAG: 微调取回系统框架 2506.09200v2

Authors (8): Val Andrei Fajardo, David B. Emerson, Amandeep Singh, Veronica Chatrath, Marcelo Lotif, Ravi Theja, Alex Cheung, Izuki Matsuba

Retrieval-augmented generation (RAG) systems have been shown to be effective in addressing many of the drawbacks of relying solely on the parametric memory of large language models. Recent work has demonstrated that RAG systems can be improved via fine-tuning of their retriever and generator models. In this work, we introduce FedRAG, a framework for fine-tuning RAG systems across centralized and federated architectures. FedRAG supports state-of-the-art fine-tuning methods, offering a simple and intuitive interface and a seamless conversion from centralized to federated training tasks. FedRAG is also deeply integrated with the modern RAG ecosystem, filling a critical gap in available tools.

事实证明,回收-提款发电系统(RAG)对于解决完全依赖大型语言模型的参数记忆的许多缺点是有效的,最近的工作表明,通过微调其检索器和发电机模型,可以改进RAG系统。在这项工作中,我们引入FedRAG,这是对中央和联合结构的RAG系统进行微调的框架。 FedRAG支持最先进的微调方法,提供一个简单和直观的界面,并将中央化培训任务顺利地转换为联合培训任务。 FedRAG还与现代RAG生态系统紧密结合,填补了现有工具中的一个关键空白。


Article 51

Title@2025-06-12 (4): SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models

Title: SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models SelectLLM: Query-Aware Effiziente Auswahl Algorithmen für große Sprachmodelle 选择LLM: 用于大语言模型的查询- 软件高效选择算法 2408.08545v4

Authors (3): Kaushal Kumar Maurya, KV Aditya Srivatsa, Ekaterina Kochmar

Large language models (LLMs) have been widely adopted due to their remarkable performance across various applications, driving the accelerated development of a large number of diverse models. However, these individual LLMs show limitations in generalization and performance on complex tasks due to inherent training biases, model size constraints, and the quality or diversity of pre-training datasets. A promising direction is to efficiently harness the diverse capabilities of LLMs to overcome these individual limitations. To address these limitations, we introduce a novel LLM selection algorithm called SelectLLM, which efficiently directs input queries to the most suitable subset of LLMs from a large pool, ensuring that the selected models collectively provide accurate responses. SelectLLM employs a multi-label classifier and policy based on the classifier’s predictions and confidence scores in selecting an optimal, query-aware, and lightweight subset of LLMs. Our findings indicate that the proposed model outperforms existing ensemble-based baselines and achieves competitive performance with similarly sized top-performing LLMs while maintaining efficiency. Specifically, it achieves a huge reduction in inference latency on two challenging reasoning benchmarks: 13\% on GSM8K and 70\% on MMLU, compared to the top-performing baseline. Also, we establish a theoretical upper bound by an Oracle with LLMs and perform an in-depth linguistic analysis to understand the performance gap between the Oracle and SelectLLM.

大型语言模型(LLMS)因其在各种应用方面的杰出表现而被广泛采用,促使大量不同模型的加速发展;然而,由于固有的培训偏差、模型尺寸限制以及培训前数据集的质量或多样性,这些LLMS在复杂任务的概括和业绩方面存在局限性。一个有希望的方向是有效利用LLM的多种能力克服这些个别的局限性。为了克服这些局限性,我们引入了一种名为 CompressLLM 的新颖的LLM 选择算法,该算法有效地引导从大型人才库向最合适的LMS子集进行输入查询,确保选定的模型集体提供准确的响应。LMM在挑选最佳的、可查询的和较轻的LMMs 时,采用多标签分类和政策,基于分类者的预测和信心分数。我们的调查结果表明,拟议的模型超越了现有基于各种要素的基线,在保持效率的同时,以类似高水平的LMSMs达到竞争性的性能。具体地说,在两个具有挑战性的参数基准上,即:GSM8K和SLMS的高级和70MLM的成绩上,在比我们最接近的、最高一级和最高和最高一级的LMMSLMS-LMS-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-LAR-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-


Article 52

Title@2025-06-12 (4): Large Language Models for Detection of Life-Threatening Texts

Title: Large Language Models for Detection of Life-Threatening Texts Große Sprachmodelle zur Erkennung lebensbedrohlicher Texte 探测生命威胁文字的长语言大语言模型 2506.10687v1

Authors (3): Thanh Thi Nguyen, Campbell Wilson, Janis Dalins

Detecting life-threatening language is essential for safeguarding individuals in distress, promoting mental health and well-being, and preventing potential harm and loss of life. This paper presents an effective approach to identifying life-threatening texts using large language models (LLMs) and compares them with traditional methods such as bag of words, word embedding, topic modeling, and Bidirectional Encoder Representations from Transformers. We fine-tune three open-source LLMs including Gemma, Mistral, and Llama-2 using their 7B parameter variants on different datasets, which are constructed with class balance, imbalance, and extreme imbalance scenarios. Experimental results demonstrate a strong performance of LLMs against traditional methods. More specifically, Mistral and Llama-2 models are top performers in both balanced and imbalanced data scenarios while Gemma is slightly behind. We employ the upsampling technique to deal with the imbalanced data scenarios and demonstrate that while this method benefits traditional approaches, it does not have as much impact on LLMs. This study demonstrates a great potential of LLMs for real-world life-threatening language detection problems.

检测威胁生命的语言对于保护处于困境中的个人、促进心理健康和福祉以及防止潜在伤害和生命损失至关重要,本文件展示了使用大型语言模型(LLMs)确定威胁生命的文本的有效方法,并将这些文本与传统方法进行比较,如变异器的单词包、字嵌入、主题模型和双向编码显示器等传统方法。我们微小地运用高端技术处理不平衡的数据假设情况,并表明虽然这种方法有利于传统方法,但对Llama-2 系统的影响不大。这项研究表明LLMs在现实世界威胁生命的语言探测问题方面有很大的潜力。


Article 53

Title@2025-06-12 (4): Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models

Title: Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models Habe ich treu gesagt, was ich dachte? Die Kluft zwischen neuraler Aktivität und Selbsterklärungen in großen Sprachmodellen überbrücken 缩小大语言模式中神经活动与自我开发之间的差距 2506.09277v2

Authors (5): Milan Bhan, Jean-Noel Vittaut, Nicolas Chesneau, Sarath Chandar, Marie-Jeanne Lesot

Large Language Models (LLM) have demonstrated the capability of generating free text self Natural Language Explanation (self-NLE) to justify their answers. Despite their logical appearance, self-NLE do not necessarily reflect the LLM actual decision-making process, making such explanations unfaithful. While existing methods for measuring self-NLE faithfulness mostly rely on behavioral tests or computational block identification, none of them examines the neural activity underlying the model’s reasoning. This work introduces a novel flexible framework for quantitatively measuring the faithfulness of LLM-generated self-NLE by directly comparing the latter with interpretations of the model’s internal hidden states. The proposed framework is versatile and provides deep insights into self-NLE faithfulness by establishing a direct connection between self-NLE and model reasoning. This approach advances the understanding of self-NLE faithfulness and provides building blocks for generating more faithful self-NLE.

大型语言模型(LLM)已经展示了产生自由文本自自然语言解释(自我自然语言解释)的能力来证明答案的合理性。 尽管自我语言模型(LLM)有逻辑外观,但自我语言模型不一定反映LLM实际决策过程,使这种解释变得不真实。 尽管衡量自我语言模型(LLM)忠诚程度的现有方法主要依靠行为测试或计算区块识别,但其中没有一个方法审查模型推理背后的神经活动。 这项工作引入了一个新的灵活框架,通过直接比较LLM自自然语言解释(自我语言解释 ) 和对模型内部隐藏状态的解释,从数量上衡量LLM自我语言解释的忠诚程度。 拟议的框架是多功能的,通过在自我语言模型推理和自我学习之间建立直接的联系,为自我语言模型的忠诚提供了深刻的洞察力。 这一方法增进了对自我-NLE的忠诚的理解,并为产生更忠实的自我学习NLE提供了基础。


Article 54

Title@2025-06-12 (4): TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving

Title: TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving TeleMath: Ein Benchmark für große Sprachmodelle in der Telecom Mathematischen Problemlösung TeleMatth:电信数学问题解决大语言模型基准 2506.10674v1

Authors (6): Vincenzo Colle, Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed, Merouane Debbah

The increasing adoption of artificial intelligence in telecommunications has raised interest in the capability of Large Language Models (LLMs) to address domain-specific, mathematically intensive tasks. Although recent advancements have improved the performance of LLMs in general mathematical reasoning, their effectiveness within specialized domains, such as signal processing, network optimization, and performance analysis, remains largely unexplored. To address this gap, we introduce TeleMath, the first benchmark dataset specifically designed to evaluate LLM performance in solving mathematical problems with numerical solutions in the telecommunications domain. Comprising 500 question-answer (QnA) pairs, TeleMath covers a wide spectrum of topics in the telecommunications field. This paper outlines the proposed QnAs generation pipeline, starting from a selected seed of problems crafted by Subject Matter Experts. The evaluation of a wide range of open-source LLMs reveals that best performance on TeleMath is achieved by recent models explicitly designed for mathematical or logical reasoning. In contrast, general-purpose models, even those with a large number of parameters, often struggle with these challenges. We have released the dataset and the evaluation code to ease result reproducibility and support future research.

在电信中越来越多地采用人工智能已使人们对大型语言模型(LLMS)处理具体领域、数学密集型任务的能力产生了兴趣。虽然最近的进展在一般数学推理方面提高了LLMS的绩效,但它们在信号处理、网络优化和性能分析等专门领域的效力基本上尚未探索。为解决这一差距,我们引进了TeleMath,这是第一个专门用来评价LLM在用电信领域的数字解决方案解决数学问题方面表现的基准数据集。TealMath由500对问答(QnA)组成,涵盖电信领域广泛的专题。本文概述了拟议的QAs生成管道,从主题专题专家所设计的问题的选定种子开始。对范围广泛的开放源LMS的评估表明,TealMath的最佳性能是通过最近为数学或逻辑推理而专门设计的模型实现的。相比之下,通用模型,即使是具有大量参数的模型,也常常与这些挑战相抗争。我们发布了数据集和评价代码,以方便结果再生和支持未来研究。


Article 55

Title@2025-06-12 (4): CoRT: Code-integrated Reasoning within Thinking

Title: CoRT: Code-integrated Reasoning within Thinking CoRT: Code-integrierte Vernunft im Denken CORT: 思考中守则综合理由 2506.09820v2

Authors (11): Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao, Tian Ding, Ruoyu Sun, Benyou Wang, Xiang Wang, Junyang Lin, Dayiheng Liu

Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model’s internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4\% and 8\% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30\% fewer tokens for the 32B model and 50\% fewer tokens for the 1.5B model compared with the natural language models. The models and code are available at https://github.com/ChengpengLi1003/CoRT.

o1 和 DeepSeek-R1 等大型解释模型(LRMs)显示,在使用长期思维链(CoT)的自然语言推理方面,取得了显著进展,但在处理复杂的数学操作时,它们仍然效率低或不准确。通过计算工具(例如计算图书馆和符号求解器)解决这些限制是大有希望的,但它带来了一个技术挑战:代码解释器(CI)将外部知识带出模型的内部文本演示之外,因此直接组合效率不高。本文介绍了CORT,这是一个培训后框架,用于教授LRM有效和高效地利用CI。作为第一步,我们通过Hint-EngIL3/RT,将代码集解的理论数据数据数据集数据集数据集数据集数据集(Hint-Eng-I)合成数据集数据集数据集在适当的位置上插入了不同的提示。我们手工创建了30个高品质的样本,从1.5B到32B参数的后向模型,同时对模型进行监管的微调、拒绝、微调、微调和加固化。我们的实验结果表明,Hint-Ening模型可以实现4-R-r-r-r-r-r-rxxx-xxxxx-xxxxxxxx的50的数学-x-x的数学-r-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-xx-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-


Article 56

Title@2025-06-12 (4): Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes

Title: Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes Robuste, unüberwachte Anpassung eines Spracherkennungsgeräts mit Entropie-Minimierungs- und Lautsprechercodes 使用磁最小化和演讲人守则的演讲者演讲者 2506.10653v1

Authors (4): Rogier C. van Dalen, Shucong Zhang, Titouan Parcollet, Sourav Bhattacharya

Speech recognisers usually perform optimally only in a specific environment and need to be adapted to work well in another. For adaptation to a new speaker, there is often too little data for fine-tuning to be robust, and that data is usually unlabelled. This paper proposes a combination of approaches to make adaptation to a single minute of data robust. First, instead of estimating the adaptation parameters with cross-entropy on a single error-prone hypothesis or “pseudo-label”, this paper proposes a novel loss function, the conditional entropy over complete hypotheses. Using multiple hypotheses makes adaptation more robust to errors in the initial recognition. Second, a “speaker code” characterises a speaker in a vector short enough that it requires little data to estimate. On a far-field noise-augmented version of Common Voice, the proposed scheme yields a 20% relative improvement in word error rate on one minute of adaptation data, increasing on 10 minutes to 29%.

语音识别器通常只在特定环境中最优化地发挥作用, 并且需要在另一个环境中适应效果良好。 为了适应新扬声器, 通常只有太少的数据可以微调, 数据通常不贴标签。 本文提出将适应方法组合起来, 使数据达到单分钟的稳健度。 首先, 本文没有在单一的错误易变假设或“ 假名” 上用跨点滴子来估计适应参数, 而是提出一个新的损失功能, 即有条件的对完整假称的诱变。 使用多个假设使适应能力更能适应初始识别错误。 其次, “ 扩音码” 的特性使一个矢量短, 需要很少数据来估计。 在远方的常见声音版本中, 所拟议的方案在适应数据一分钟的字错误率上产生20%的相对改善, 增加10分钟到29 % 。


Article 57

Title@2025-06-12 (4): Identifying Reliable Evaluation Metrics for Scientific Text Revision

Title: Identifying Reliable Evaluation Metrics for Scientific Text Revision Identifizieren von verlässlichen Bewertungsmetrics für wissenschaftliche Textrevision 科学文本订正的可靠评价计量指标 2506.04772v3

Authors (4): Léane Jourdan, Florian Boudin, Richard Dufour, Nicolas Hernandez

Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.

在这项工作中,我们分析和确定这些指标的局限性,并探索更符合人类判断的替代评价方法;我们首先进行人工说明研究,以评估不同修订的质量;然后,我们调查相关国家实验室领域的无参考评价指标;此外,我们审查LLM-as-a-judge性方法,分析它们评估修订的能力,同时不参考黄金;我们的结果显示LLM有效评估教学的遵循情况,但以正确的方式挣扎,而具体领域指标则提供补充性见解;我们发现,将LLM-as-a-a-judge性评价和具体任务指标相结合的混合方法,提供了对修订质量的最可靠的评估。


Article 58

Title@2025-06-12 (4): Spelling-out is not Straightforward: LLMs’ Capability of Tokenization from Token to Characters

Title: Spelling-out is not Straightforward: LLMs’ Capability of Tokenization from Token to Characters Rechtschreibung ist nicht geradeaus: LLMs Fähigkeit der Tokenisierung von Token zu Charakteren 拼写出不是直向前进的: LLMs 的调制能力从调制字符到字符 2506.10641v1

Authors (2): Tatsuya Hiraoka, Kentaro Inui

Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks, such as identifying compositional subcomponents within tokens. In this work, we investigate how LLMs internally represent and utilize character-level information during the spelling-out process. Our analysis reveals that, although spelling out is a simple task for humans, it is not handled in a straightforward manner by LLMs. Specifically, we show that the embedding layer does not fully encode character-level information, particularly beyond the first character. As a result, LLMs rely on intermediate and higher Transformer layers to reconstruct character-level knowledge, where we observe a distinct “breakthrough” in their spelling behavior. We validate this mechanism through three complementary analyses: probing classifiers, identification of knowledge neurons, and inspection of attention weights.

大型语言模型(LLMS)可以非常精确地按性格来拼写符号字符,但是它们要努力完成更复杂的性格层面任务,例如确定象征物中的组成子组件。 在这项工作中,我们调查LLMS内部如何在拼写过程中代表并利用字符级信息。我们的分析表明,虽然拼写是人类的一项简单的任务,但LLMS并没有直接处理。具体地说,我们表明嵌入层没有完全编码字符级信息,特别是除了第一个字符之外。因此,LMS依靠中间和较高的变异层来重建字符级知识,在那里我们通过拼写行为观察到一个截然不同的“突破 ” 。我们通过三个互补分析来验证这一机制:验证分类师、识别知识神经元和检查注意力重量。


Article 59

Title@2025-06-12 (4): Conversational Search: From Fundamentals to Frontiers in the LLM Era

Title: Conversational Search: From Fundamentals to Frontiers in the LLM Era Conversational Search: Von Grundlagen zu Grenzen in der LLM-Ära 对话搜索:从基本原理到LLM时代的前沿 2506.10635v1

Authors (4): Fengran Mo, Chuan Meng, Mohammad Aliannejadi, Jian-Yun Nie

Conversational search enables multi-turn interactions between users and systems to fulfill users’ complex information needs. During this interaction, the system should understand the users’ search intent within the conversational context and then return the relevant information through a flexible, dialogue-based interface. The recent powerful large language models (LLMs) with capacities of instruction following, content generation, and reasoning, attract significant attention and advancements, providing new opportunities and challenges for building up intelligent conversational search systems. This tutorial aims to introduce the connection between fundamentals and the emerging topics revolutionized by LLMs in the context of conversational search. It is designed for students, researchers, and practitioners from both academia and industry. Participants will gain a comprehensive understanding of both the core principles and cutting-edge developments driven by LLMs in conversational search, equipping them with the knowledge needed to contribute to the development of next-generation conversational search systems.

在这种互动中,系统应了解用户在对话背景下的搜索意图,然后通过一个灵活的、基于对话的界面将相关信息反馈回。最近强大的大型语言模型(LLMs)具有教学能力,随后是内容生成和推理,吸引了大量的关注和进步,为建立智能对话搜索系统提供了新的机会和挑战。这一辅导的目的是介绍基本知识与LLMs在对话搜索过程中革命产生的专题之间的联系。它针对的是学术界和工业界的学生、研究人员和从业人员。参与者将全面了解LLMs在对话搜索中推动的核心原则和尖端发展,使他们掌握必要的知识,为下一代对话搜索系统的发展作出贡献。


Article 60

Title@2025-06-12 (4): NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors

Title: NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors NeuralNexus bei BEA 2025 Shared Task: Retrieval-Augmented Prompting für Fehlererkennung in KI-Tutoren BEA 2025年BEA 的神经外观 共同任务:在 AI 导师中检索错误识别提示 2506.10627v1

Authors (4): Numaan Naeem, Sarfraz Ahmad, Momina Ahsan, Hasan Iqbal

This paper presents our system for Track 1: Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The task involves evaluating whether a tutor’s response correctly identifies a mistake in a student’s mathematical reasoning. We explore four approaches: (1) an ensemble of machine learning models over pooled token embeddings from multiple pretrained language models (LMs); (2) a frozen sentence-transformer using [CLS] embeddings with an MLP classifier; (3) a history-aware model with multi-head attention between token-level history and response embeddings; and (4) a retrieval-augmented few-shot prompting system with a large language model (LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided output parsing to produce interpretable predictions. It outperforms all baselines, demonstrating the effectiveness of combining example-driven prompting with LLM reasoning for pedagogical feedback assessment. Our code is available at https://github.com/NaumanNaeem/BEA_2025.

本文介绍了我们的第1轨系统:BEA 2025年教育能力评估BEA 动力导师的教学能力共同任务错误识别系统。任务包括评估导师的反应是否正确确定了学生数学推理的错误。我们探索了四种方法:(1) 机器学习模型的组合,而不是来自多种预先培训语言模型的集合象征性嵌入;(2) 使用与MLP分类器嵌入的[CLS]封存的冻结句式转换器;(3) 历史觉悟模型,在象征性历史和响应嵌入之间引起多头注意;(4) 检索微弱的提示系统,使用大型语言模型(LLM),即GPT 4o。我们的最后系统检索了类似语义的例子,构建结构化提示,并使用化学引导输出分解,以产生可解释的预测。它超越了所有基线,表明以实例驱动的提示和LM 20 教学反馈评估的推理的有效性。我们的代码可以在 https://githubb.com/Naauem.


Article 61

Title@2025-06-12 (4): SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis

Title: SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis SDialog: Ein Python-Toolkit für die Synthetische Dialog-Generierung und -Analyse Sidialog:合成对话生成和分析的Python工具包 2506.10622v1

Authors (3): Sergio Burdisso, Esaú Villatoro-Tello, Petr Motlicek

The advancement of conversational AI systems relies on the availability of high-quality, flexible, and reproducible synthetic dialogues for training, evaluation, and benchmarking. SDialog is a modular, extensible Python toolkit designed to address the challenges of synthetic dialogue generation and analysis. By leveraging instruction-tuned Large Language Models (LLMs), SDialog provides abstractions for personas, orchestration, and scenario management, enabling the creation of realistic, diverse, and controllable conversational data for research and development. SDialog supports workflows such as multi-agent simulation and scenario-driven generation, and represents a step forward in the standardization of tools and frameworks for synthetic data generation, a crucial advancement for ensuring reproducibility in today’s fast-evolving research landscape.

Sdialog是一个模块化的、可推广的Python工具包,旨在应对合成对话生成和分析的挑战。 Sdialog通过利用受指导的大型语言模型(LLMs),为个人、交响和情景管理提供抽象信息,为研究和开发创造现实、多样和可控制的交谈数据。Sdialog支持多试模拟和情景生成等工作流程,并代表了合成数据生成工具和框架标准化的一步,这是确保当今快速变化的研究格局中再生的关键进步。


Article 62

Title@2025-06-12 (4): Deep Learning-Based Digitization of Overlapping ECG Images with Open-Source Python Code

Title: Deep Learning-Based Digitization of Overlapping ECG Images with Open-Source Python Code Deep Learning-based Digitalisierung von überlappenden EKG-Bildern mit Open-Source-Python-Code 使用开放源码的 ECG 重叠图像的深学习数字化 2506.10617v1

Authors (4): Reza Karbasi, Masoud Rahimi, Abdol-Hossein Vahabie, Hadi Moradi

This paper addresses the persistent challenge of accurately digitizing paper-based electrocardiogram (ECG) recordings, with a particular focus on robustly handling single leads compromised by signal overlaps-a common yet under-addressed issue in existing methodologies. We propose a two-stage pipeline designed to overcome this limitation. The first stage employs a U-Net based segmentation network, trained on a dataset enriched with overlapping signals and fortified with custom data augmentations, to accurately isolate the primary ECG trace. The subsequent stage converts this refined binary mask into a time-series signal using established digitization techniques, enhanced by an adaptive grid detection module for improved versatility across different ECG formats and scales. Our experimental results demonstrate the efficacy of our approach. The U-Net architecture achieves an IoU of 0.87 for the fine-grained segmentation task. Crucially, our proposed digitization method yields superior performance compared to a well-established baseline technique across both non-overlapping and challenging overlapping ECG samples. For non-overlapping signals, our method achieved a Mean Squared Error (MSE) of 0.0010 and a Pearson Correlation Coefficient (rho) of 0.9644, compared to 0.0015 and 0.9366, respectively, for the baseline. On samples with signal overlap, our method achieved an MSE of 0.0029 and a rho of 0.9641, significantly improving upon the baseline’s 0.0178 and 0.8676. This work demonstrates an effective strategy to significantly enhance digitization accuracy, especially in the presence of signal overlaps, thereby laying a strong foundation for the reliable conversion of analog ECG records into analyzable digital data for contemporary research and clinical applications. The implementation is publicly available at this GitHub repository: https://github.com/masoudrahimi39/ECG-code.

本文讨论了将基于纸张的心电图(ECG)记录进行准确数字化的持久挑战,特别侧重于严格处理因信号重叠而受损的单项线索,这是现有方法中常见但处理不足的问题。我们提出一个旨在克服这一限制的两阶段管道。第一阶段使用基于U-Net的分解网络,该网的分解网络经过培训,该网的数据集经过重叠信号的丰富和定制数据增强,以准确隔离主要的ECG追踪。随后阶段利用既定的数字化技术,将这一经过改进的双面遮罩转换成一个时间序列信号,并辅之以一个适应性电网检测模块,以改善不同ECGF格式和尺度的多功能。我们的实验结果显示了我们的方法的有效性。U-Net结构为精细分解任务实现了0.87的IOU。 值得注意的是,我们提议的数字化方法在非重叠和具有挑战性的ECGH样本中,将这一改进的双面双面双面双层遮固的双层遮掩罩。 对于非重叠的信号,我们的方法在0.00和Pearson-Correxl的服务器的转换应用中, 将这一基基基比为0.9-0.00的基比的基数据, 提高了我们的数据基基基比为0.16的基比为0.9的基的基比的基比的基基基比的基基比提高了。


Article 63

Title: Unsupervised Protoform Reconstruction through Parsimonious Rule-guided Heuristics and Evolutionary Search Unüberwachte protoforme Rekonstruktion durch Parsimonious Regel-geführte Heuristiken und evolutionäre Suche 通过法理学、法理学、受规制的哲理学和进化搜索进行不受监督的原形重建 2506.10614v1

Authors (1): Promise Dodzi Kpoglu

We propose an unsupervised method for the reconstruction of protoforms i.e., ancestral word forms from which modern language forms are derived. While prior work has primarily relied on probabilistic models of phonological edits to infer protoforms from cognate sets, such approaches are limited by their predominantly data-driven nature. In contrast, our model integrates data-driven inference with rule-based heuristics within an evolutionary optimization framework. This hybrid approach leverages on both statistical patterns and linguistically motivated constraints to guide the reconstruction process. We evaluate our method on the task of reconstructing Latin protoforms using a dataset of cognates from five Romance languages. Experimental results demonstrate substantial improvements over established baselines across both character-level accuracy and phonological plausibility metrics.

我们提出一种未经监督的方法来重建原型,即祖传的文字形式,现代语言形式源自于此。虽然以前的工作主要依靠声学编辑的概率模型来从古代各组中推断出原形,但这种方法由于主要以数据为主的性质而受到限制。相反,我们的模型在进化优化框架内将数据驱动的推论与基于规则的惯性结合在一起。这种混合方法利用统计模式和语言驱动的限制来指导重建进程。我们利用五种罗马语言的白兰地数据集来评估我们重建拉丁原形的任务的方法。实验结果显示,在特征级精确度和声学合理性衡量标准方面,比既定基线有重大改进。


Article 64

Title@2025-06-12 (4): ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization

Title: ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization ConfPO: Ausnutzen des politischen Modells Vertrauen für kritische Token-Auswahl in Präferenz-Optimierung 召集:利用政策模范信心在优先最佳化中选择关键物优选标准 2506.08712v2

Authors (5): Hee Suk Yoon, Eunseop Yoon, Mark Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo

We introduce ConfPO, a method for preference learning in Large Language Models (LLMs) that identifies and optimizes preference-critical tokens based solely on the training policy’s confidence, without requiring any auxiliary models or compute. Unlike prior Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO), which uniformly adjust all token probabilities regardless of their relevance to preference, ConfPO focuses optimization on the most impactful tokens. This targeted approach improves alignment quality while mitigating overoptimization (i.e., reward hacking) by using the KL divergence budget more efficiently. In contrast to recent token-level methods that rely on credit-assignment models or AI annotators, raising concerns about scalability and reliability, ConfPO is simple, lightweight, and model-free. Experimental results on challenging alignment benchmarks, including AlpacaEval 2 and Arena-Hard, demonstrate that ConfPO consistently outperforms uniform DAAs across various LLMs, delivering better alignment with zero additional computational overhead.

我们引入了ConpPO(ConfPO),这是在大语言模型中进行优先学习的一种方法,它仅基于培训政策的信心,而无需任何辅助模型或计算,而确定和优化偏爱关键符号。与以前直接对齐的算法(DAAs)(DPO)不同,前者统一调整所有象征性概率,而不论其与偏好的相关性如何,ConfPO(Fonpo)侧重于最有影响力的标码。这一有针对性的方法提高了匹配质量,同时通过使用 KL 差异预算更有效地减少过度优化(即奖励黑客),同时通过使用 KL 差异预算来降低过度优化(即奖励) 。与最近采用依赖信用分配模型或AI Antiers(AI ) 的象征性方法相比, ConfPO(Compo) 是简单、轻和无型的。关于具有挑战性的校准基准(包括AlpacaEval 2 和 Arena-Hard)的实验结果显示,Conpo(Confo) 始终优于不同LMS(DAA) 的一致的DA) ,与零的额外计算管理。


Article 65

Title@2025-06-12 (4): IPA-CHILDES & G2P+: Feature-Rich Resources for Cross-Lingual Phonology and Phonemic Language Modeling

Title: IPA-CHILDES & G2P+: Feature-Rich Resources for Cross-Lingual Phonology and Phonemic Language Modeling IPA-CHILDES & G2P+: Feature-Rich-Ressourcen für Cross-Lingual Phonologie und Phonemic Language Modeling IPA-CHILDES & G2P+:跨语言歌曲和语音语言建模的地貌资源 2504.03036v3

Authors (2): Zébulon Goriely, Paula Buttery

In this paper, we introduce two resources: (i) G2P+, a tool for converting orthographic datasets to a consistent phonemic representation; and (ii) IPA CHILDES, a phonemic dataset of child-centered speech across 31 languages. Prior tools for grapheme-to-phoneme conversion result in phonemic vocabularies that are inconsistent with established phonemic inventories, an issue which G2P+ addresses by leveraging the inventories in the Phoible database. Using this tool, we augment CHILDES with phonemic transcriptions to produce IPA CHILDES. This new resource fills several gaps in existing phonemic datasets, which often lack multilingual coverage, spontaneous speech, and a focus on child-directed language. We demonstrate the utility of this dataset for phonological research by training phoneme language models on 11 languages and probing them for distinctive features, finding that the distributional properties of phonemes are sufficient to learn major class and place features cross-lingually.

在本文中,我们引入了两种资源:(一) G2P+,一个将拼写数据集转换成一致的语音表达方式的工具;和(二) IPA ChildES,一个31种语言以儿童为中心的言论的语音数据集。先前的石墨到手机转换工具导致与既定的语音目录不一致的语音词汇,G2P+通过利用虚拟数据库的目录处理这一问题。我们利用这一工具,用语音抄录来增加儿童儿童信息,以生成IPA ChildRES。这个新资源填补了现有语音数据集中的若干空白,这些数据集往往缺乏多语种覆盖、自发语言和以儿童为主的语言为重点。我们通过培训11种语言的电话语言模型,并用这些模型进行辨别特征的验证,我们展示了该数据集对声学研究的有用性,发现手机的传播特性足以学习主要类和跨语言特征。


Article 66

Title@2025-06-12 (4): Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges

Title: Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges Pragmatics in the Era of Large Language Models: Eine Umfrage zu Datensätzen, Evaluation, Chancen und Herausforderungen 《大语言模式时代中的实用模型:关于数据集、评价、机遇和挑战的调查》 2502.12378v3

Authors (10): Bolei Ma, Yuting Li, Wei Zhou, Ziwei Gong, Yang Janet Liu, Katja Jasinskaja, Annemarie Friedrich, Julia Hirschberg, Frauke Kreuter, Barbara Plank

Understanding pragmatics-the use of language in context-is crucial for developing NLP systems capable of interpreting nuanced language use. Despite recent advances in language technologies, including large language models, evaluating their ability to handle pragmatic phenomena such as implicatures and references remains challenging. To advance pragmatic abilities in models, it is essential to understand current evaluation trends and identify existing limitations. In this survey, we provide a comprehensive review of resources designed for evaluating pragmatic capabilities in NLP, categorizing datasets by the pragmatic phenomena they address. We analyze task designs, data collection methods, evaluation approaches, and their relevance to real-world applications. By examining these resources in the context of modern language models, we highlight emerging trends, challenges, and gaps in existing benchmarks. Our survey aims to clarify the landscape of pragmatic evaluation and guide the development of more comprehensive and targeted benchmarks, ultimately contributing to more nuanced and context-aware NLP models.

尽管语言技术(包括大型语言模型)最近有所进步,但评价其处理诸如隐含和参考等实际现象的能力仍具有挑战性。为了提高模型的实用能力,至关重要的是要了解当前的评价趋势,并查明现有的局限性。在这次调查中,我们全面审查了用于评价国家语言方案实际能力的资源,按照所处理的务实现象对数据集进行分类。我们分析了任务设计、数据收集方法、评价方法及其与现实世界应用的相关性。我们通过在现代语言模型中审查这些资源,我们强调现有基准中新出现的趋势、挑战和差距。我们的调查旨在澄清务实评价的格局,指导制定更全面和更有针对性的基准,最终有助于更加细致和符合实际情况的NLP模型。


Article 67

Title@2025-06-12 (4): Encoding call-by-push-value in the pi-calculus

Title: Encoding call-by-push-value in the pi-calculus Kodierung des Call-by-Push-Wertes im Pi-Calculus Pi-calcululus 中的编码调用 by-push- 值 2506.10584v1

Authors (3): Benjamin Bennetzen, Nikolaj Rossander Kristensen, Peter Buus Steffensen

In this report we define an encoding of Levys call-by-push-value lambda-calculus (CBPV) in the pi-calculus, and prove that our encoding is both sound and complete. We present informal (by-hand) proofs of soundness, completeness, and all required lemmas. The encoding is specialized to the internal pi-calculus (pi-i-calculus) to circumvent certain challenges associated with using de Bruijn index in a formalization, and it also helps with bisimulation as early-, late- and open-bisimulation coincide in this setting, furthermore bisimulation is a congruence. Additionally, we argue that our encoding also satisfies the five criteria for good encodings proposed by Gorla, as well as show similarities between Milners and our encoding. This paper includes encodings from CBPV in the pi-i-calculus, asynchronous polyadic pi-calculus and the local pi-calculus. We begin a formalization of the proof in Coq for the soundness and completeness of the encoding in the pi-i-calculus. Not all lemmas used in the formalization are themselves formally proven. However, we argue that the non-proven lemmas are reasonable, as they are proven by hand, or amount to Coq formalities that are straightforward given informal arguments.

在本报告中,我们定义了Levys Call- by-push- vallib- vallibal alibda- calculus (CBPV) 的编码, 并证明我们的编码既健全又完整。 我们用非正式( 单手) 的证据证明了健全、 完整和所有需要的利玛斯。 编码是内部pi- calculus (pi- i- calculus) 专用的, 以避免在正式化中使用 de Bruijn 指数的某些挑战, 也有助于在这种环境下早期、 晚期和开放的振动中进行振动, 进一步的振动是一种和谐。 此外, 我们说, 我们的编码也符合Gorla 提议的良好编码的五条标准, 也显示了Milners 和我们编码之间的相似之处。 本文包括 CBBPV 在 pi- aliculus (pyynconronous monicalation pi- calulus) 和本地的缩略。 我们开始正式化的证明, 不是正式化的, 而正式化, 已经被正式化了。


Article 68

Title@2025-06-12 (4): BabyLM’s First Words: Word Segmentation as a Phonological Probing Task

Title: BabyLM’s First Words: Word Segmentation as a Phonological Probing Task BabyLMs erste Worte: Wortsegmentierung als phonologische Probing-Aufgabe BabyLM 的第一单词: 单词分割作为声学检测任务 2504.03338v3

Authors (2): Zébulon Goriely, Paula Buttery

Language models provide a key framework for studying linguistic theories based on prediction, but phonological analysis using large language models (LLMs) is difficult; there are few phonological benchmarks beyond English and the standard input representation used in LLMs (subwords of graphemes) is not suitable for analyzing the representation of phonemes. In this work, we demonstrate how word segmentation can be used as a phonological probing task, allowing us to study the representations learned by phoneme-based language models trained on child-directed speech across 31 languages. Following computational models of word segmentation, we present unsupervised methods for extracting word boundaries from a trained model using the observation that prediction-error peaks at the start of words. We also use linear probes to identify that these models implicitly track word boundaries, even when they do not appear in training. This cross-lingual work corroborates statistical learning theories of acquisition and empirically motivates new methods for training subword tokenizers.

语言模型为在预测的基础上研究语言理论提供了一个关键框架,但使用大型语言模型(LLMs)进行声学分析是困难的;除了英语之外,几乎没有什么声学基准,而LLMM(石墨的子字)使用的标准输入说明(LLMs)不适于分析电话的表达方式。在这项工作中,我们演示了如何将字分化用作声学研究任务,从而使我们能够研究通过在31种语言中以儿童为对象的演讲受过训练的基于电话语言模型所学到的表述方式。在计算单词分化模型之后,我们提出了从受过训练的模型中提取字界的方法,使用预测-危险高峰在字句开头的观察方法。我们还使用线性探测方法确定这些模型隐含的字界线,即使它们没有出现在培训中。这种跨语言的工作证实了获取的统计学习理论,并用经验激励了培训小字符号的新方法。


Article 69

Title@2025-06-12 (4): Human and LLM Biases in Hate Speech Annotations: A Socio-Demographic Analysis of Annotators and Targets

Title: Human and LLM Biases in Hate Speech Annotations: A Socio-Demographic Analysis of Annotators and Targets Menschliche und LLM-Biasen in Hate Speech Annotationen: Eine sozio-demographische Analyse von Annotatoren und Zielen 仇恨言论说明中的人类和LLM比阿斯语:对说明者和目标的社会-人口分析 2410.07991v6

Authors (5): Tommaso Giorgi, Lorenzo Cima, Tiziano Fagni, Marco Avvenuti, Stefano Cresci

The rise of online platforms exacerbated the spread of hate speech, demanding scalable and effective detection. However, the accuracy of hate speech detection systems heavily relies on human-labeled data, which is inherently susceptible to biases. While previous work has examined the issue, the interplay between the characteristics of the annotator and those of the target of the hate are still unexplored. We fill this gap by leveraging an extensive dataset with rich socio-demographic information of both annotators and targets, uncovering how human biases manifest in relation to the target’s attributes. Our analysis surfaces the presence of widespread biases, which we quantitatively describe and characterize based on their intensity and prevalence, revealing marked differences. Furthermore, we compare human biases with those exhibited by persona-based LLMs. Our findings indicate that while persona-based LLMs do exhibit biases, these differ significantly from those of human annotators. Overall, our work offers new and nuanced results on human biases in hate speech annotations, as well as fresh insights into the design of AI-driven hate speech detection systems.

在线平台的兴起加剧了仇恨言论的传播,要求推广和有效检测;然而,仇恨言论检测系统的准确性严重依赖人类标签数据,这在本质上容易产生偏差;虽然以前的工作审查了这一问题,但说明人的特点与仇恨目标的特点之间的相互作用仍未得到探讨;我们利用大量数据集来填补这一空白,该数据集包含丰富的说明人和目标的社会人口信息,揭示了与目标属性有关的人类偏见的表现形式;我们的分析显示,存在广泛的偏见,我们根据这些偏见的强度和普遍程度进行定量描述和定性,揭示了明显的差异;此外,我们比较了人与以人为基础的LLMs所展示的偏见。我们的研究结果表明,虽然以人为基础的LMS所展示的偏见与人所表现出的偏见大不相同,但这些差异与人所知者大不相同。总体而言,我们的工作在仇恨言论说明方面提供了人类偏见的新和细微的结果,并对由AI驱动的仇恨言论探测系统的设计提出了新的见解。


Article 70

Title@2025-06-12 (4): Reinforcing Multimodal Understanding and Generation with Dual Self-rewards

Title: Reinforcing Multimodal Understanding and Generation with Dual Self-rewards Verstärkung des multimodalen Verständnisses und der Erzeugung mit Dual Self-Rewards 加强多模式理解和多模式代代与双重奖赏 2506.07963v2

Authors (6): Jixiang Hong, Yiran Zhang, Guanzhong Wang, Yi Liu, Ji-Rong Wen, Rui Yan

Building upon large language models (LLMs), recent large multimodal models (LMMs) unify cross-model understanding and generation into a single framework. However, LMMs still struggle to achieve accurate image-text alignment, prone to generating text responses contradicting the visual input or failing to follow the text-to-image prompts. Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks-either understanding or generation. In this work, based on the observation that understanding and generation are inverse dual tasks, we introduce a self-supervised dual reward mechanism to reinforce the understanding and generation capabilities of LMMs. Specifically, we sample multiple outputs for a given input in one task domain, then reverse the input-output pairs to compute the dual likelihood of the model as self-rewards for optimization. Extensive experimental results on visual understanding and generation benchmarks demonstrate that our method can effectively enhance the performance of the model without any external supervision, especially achieving remarkable improvements in text-to-image tasks.

在大型语言模型(LLMS)的基础上,最近的大型多式联运模型(LMMs)将跨模式理解和生成统一为一个框架,然而,LMMs仍难以实现准确的图像-文本协调,容易产生与视觉输入相矛盾的文本响应,或者不遵循文字到图像提示;目前的解决办法需要外部监督(例如,人类反馈或奖励模式),而只处理了解或生成的单向任务;在这项工作中,基于理解和生成是反向双重任务的观察,我们引入了自我监督的双重奖赏机制,以加强LMMs的理解和生成能力;具体地说,我们为一个任务领域的特定投入抽样多项产出,然后将投入-产出对齐,将模型的双重可能性算作自我调整的优化。关于视觉理解和生成基准的广泛实验结果表明,我们的方法可以有效地提高模型的性能,而无需外部监督,特别是实现文本到图像任务的显著改进。


Article 71

Title@2025-06-12 (4): Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models

Title: Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models Obliviate: Effiziente Unvergesslichkeit für den Schutz geistigen Eigentums in großen Sprachmodellen 默认:在大语言模式中有效统一保护知识产权 2502.15010v2

Authors (2): Mark Russinovich, Ahmed Salem

Recent copyright agreements between AI companies and content creators underscore the need for fine-grained control over language models’ ability to reproduce copyrighted text. Existing defenses-ranging from aggressive unlearning to simplistic output filters-either sacrifice model utility or inadequately address verbatim leakage. We introduce Obliviate, a lightweight post-training method that surgically suppresses exact reproduction of specified sequences while preserving semantic understanding. Obliviate first identifies memorized passages and then, for each target token, minimally adjusts the model’s output distribution via a Kullback-Leibler divergence penalty to drive down the probability of exact reproduction. Simultaneously, we enforce a consistency loss on non-target tokens to retain the model’s fluency and task performance. We evaluate Obliviate on four popular 6-8B-parameter models (LLaMA-3.1, LLaMA-3.1-Instruct, Qwen-2.5, and Yi-1.5) using synthetic memorization benchmarks and organic copyrighted excerpts (e.g., Moby Dick, Frankenstein, Alice in Wonderland and Les Miserables). Across all settings, Obliviate reduces verbatim recall by two orders of magnitude (e.g., from hundreds of words to fewer than 12) while degrading downstream accuracy by at most 1% on HellaSwag, MMLU, TruthfulQA, and Winogrande. Furthermore, we benchmark Obliviate aganist different unlearning and copyright techniques using the MUSE and CoTaEval benchmarks. These results position Obliviate as a practical, high-fidelity solution for copyright compliance in deployed LLMs.

AI公司和内容创作者之间最近达成的版权协议强调,需要对语言模型复制版权文本的能力进行细微的控制。现有的防御措施包括:从主动的不学习到简单的产出过滤器,无论是牺牲模型效用还是不当的邮件渗漏。我们采用了轻巧的培训后训练方法Obliviate,在保持语义理解的同时,用外科方式抑制特定序列的复制。首先确定记忆化的段落,然后对每个目标标志,通过Kullback-Lebel差价罚款,对模型的输出分配进行最小调整,以降低精确复制的概率。与此同时,我们对非目标标志实施一致性损失,以保持模型的流畅和任务性。我们用四种受欢迎的 6-8B 参数模型(LLLAMA-3、LalaMA-3.1-Instruct、Qwen-2.5和Yy-1.5) 评估了四种受欢迎的6-8B参数模型(LLO-I-Instruct、Q-2.5和Y-1.5),用合成的数学标准基准和有机版权基准(例如,Moby Dick、Francel、Al-Al-Al-Al-Lislate 和Lislation) 和最低级的准确性排序,用两种标准,用最低级标准,用最低的顺序和最低的顺序,将这些结果和最低的精确性。


Article 72

Title@2025-06-12 (4): Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Title: Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning Erste Prüfung von Wissenschaftlern: Kognitive Fähigkeiten von MLLM durch Wahrnehmung, Verständnis und Vernunft unter Beweis stellen 科学家的第一次考试:通过感知、理解和理性,发现MLLM的认知能力 2506.10521v1

Authors (27): Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhiwei Li, Qiantai Feng, Zijie Guo, Yuejin Yang, Hao Wu, Wenxuan Huang, Jiaqi Wei, Dan Si, Xiuqi Yao, Jia Bu, Haiwen Huang, Tianfan Fu, Shixiang Tang, Ben Fei, Dongzhan Zhou, Fenghua Ling, Yan Lu, Siqi Sun, Chenhui Li, Guanjie Zheng, Jiancheng Lv, Wenlong Zhang, Lei Bai

Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists’ First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.

科学发现日益依赖基于信息密集型科学数据和特定领域专门知识的复杂多式联运推理。专家级科学基准、科学多式大语言模型(MLLM)赋予了专家级科学基准、科学多式大语言模型(MLLM)的权能,因此有可能在现实工作流程中大大加强这一发现过程。然而,目前的科学基准主要侧重于评价MLLM的知识理解能力,导致对其认识和推理能力评估不足。为弥补这一差距,我们提出了科学家首届Exam(SFE)基准,目的是通过三个相互关联的层面评估MLLMS的科学认知能力:科学信号感知、科学属性理解、科学比较推理。具体地说,SFE由经过专家验证的VQA对组组成,分三个问题类型830对,跨越五个高价值学科的66项多式任务。广泛的实验显示,目前最先进的GPT-o3和InternVL-3号科学发现只有34.08%和26.52%的成绩,突出MLMS在科学领域改进的重大空间。我们希望SLLMS获得的洞察力将有助于AI加强科学发现。我们希望。


Article 73

Title@2025-06-12 (4): Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs

Title: Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs Zuverlässiger Weg zur Vernunft: Destillieren effektiver Leitlinien für LLM-Reasoning mit Wissensgraphen 可靠理由说明:为学习图解的LLM 理由说明保留有效指导 2506.10508v1

Authors (6): Yilin Xiao, Chuang Zhou, Qinggang Zhang, Bo Li, Qing Li, Xiao Huang

Large language models (LLMs) often struggle with knowledge-intensive tasks due to a lack of background knowledge and a tendency to hallucinate. To address these limitations, integrating knowledge graphs (KGs) with LLMs has been intensively studied. Existing KG-enhanced LLMs focus on supplementary factual knowledge, but still struggle with solving complex questions. We argue that refining the relationships among facts and organizing them into a logically consistent reasoning path is equally important as factual knowledge itself. Despite their potential, extracting reliable reasoning paths from KGs poses the following challenges: the complexity of graph structures and the existence of multiple generated paths, making it difficult to distinguish between useful and redundant ones. To tackle these challenges, we propose the RRP framework to mine the knowledge graph, which combines the semantic strengths of LLMs with structural information obtained through relation embedding and bidirectional distribution learning. Additionally, we introduce a rethinking module that evaluates and refines reasoning paths according to their significance. Experimental results on two public datasets show that RRP achieves state-of-the-art performance compared to existing baseline methods. Moreover, RRP can be easily integrated into various LLMs to enhance their reasoning abilities in a plug-and-play manner. By generating high-quality reasoning paths tailored to specific questions, RRP distills effective guidance for LLM reasoning.

大型语言模型(LLMS)往往因为缺乏背景知识和产生幻觉的倾向而与知识密集型任务纠缠不休。为了解决这些局限性,已经深入研究了将知识图表(KGs)与LLMs相结合的问题。现有的KG-增强的LMSLML侧重于补充事实知识,但仍在努力解决复杂的问题。我们认为,改进事实之间的关系并将其组织成一个逻辑上一致的推理路径与事实知识本身同样重要。尽管它们具有潜力,但从KGs获取可靠的推理路径带来了以下挑战:图表结构的复杂性和多种生成路径的存在,使得难以区分有用和多余的路径。为了应对这些挑战,我们建议RPM框架将知识图表的语义优势与通过嵌入和双向分布学习获得的结构信息结合起来。此外,我们引入一个重新思考的模块,根据它们的重要性来评估和完善推理路径。两个公共数据集的实验结果显示,与现有基线方法相比,图表结构结构结构结构既复杂,又难以区分有用和多余的路径。为了应对这些挑战,我们提议RPM的RP框架框架框架框架框架框架框架框架框架框架框架框架,可以轻松地将高品质地纳入各种推理学,从而提升高水平,从而提升各种推理学。


Article 74

Title@2025-06-12 (4): Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps

Title: Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps Messung der Kette der Gedankentreue durch unlernende Vernunftschritte 通过 “ 不学习理性步骤 “ 衡量思考链的信念 2502.14829v2

Authors (4): Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasović, Yonatan Belinkov

When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. Despite much work on CoT prompting, it is unclear if reasoning verbalized in a CoT is faithful to the models’ parametric beliefs. We introduce a framework for measuring parametric faithfulness of generated reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an instance of this framework. FUR erases information contained in reasoning steps from model parameters, and measures faithfulness as the resulting effect on the model’s prediction. Our experiments with four LMs and five multi-hop multi-choice question answering (MCQA) datasets show that FUR is frequently able to precisely change the underlying models’ prediction for a given instance by unlearning key steps, indicating when a CoT is parametrically faithful. Further analysis shows that CoTs generated by models post-unlearning support different answers, hinting at a deeper effect of unlearning.

当人们逐步思考时,语言模型(LMS)产生一系列思维链(CoT),这是该模型据称用来作出预测的一系列推理步骤。尽管在CoT促动方面做了大量工作,但尚不清楚CoT中言语推理是否忠实于模型的参数信念。我们引入了一个框架,以衡量所产生推理的参数的准数忠诚性,并提出了不学习推理步骤(FUR)的忠实性。FUR删除了模型参数推理步骤中所包含的信息,并测量了作为模型预测结果的忠诚性。我们用四个LMS和五个多动多动多动回答(MCQA)的数据集进行的实验显示,FUR经常能够通过不学习关键步骤来精确改变某个特定例子的基本模型的预测,这说明CoT在何时具有偏向性的忠诚性。进一步的研究表明,模型后学习后生成的COTs支持不同的答案,暗示了不学习的更深效果。


Article 75

Title@2025-06-12 (4): Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models

Title: Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models Beyond Single-User Dialogue: Bewertung des Multi-User Dialogue State Tracking Fähigkeiten großer Sprachmodelle 超越单一用户对话:评估多用户对话国家跟踪大语言模式的能力 2506.10504v1

Authors (4): Sangmin Song, Juhwan Choi, JungMin Yun, YoungBin Kim

Large language models (LLMs) have demonstrated remarkable performance in zero-shot dialogue state tracking (DST), reducing the need for task-specific training. However, conventional DST benchmarks primarily focus on structured user-agent conversations, failing to capture the complexities of real-world multi-user interactions. In this study, we assess the robustness of LLMs in multi-user DST while minimizing dataset construction costs. Inspired by recent advances in LLM-based data annotation, we extend an existing DST dataset by generating utterances of a second user based on speech act theory. Our methodology systematically incorporates a second user’s utterances into conversations, enabling a controlled evaluation of LLMs in multi-user settings. Experimental results reveal a significant performance drop compared to single-user DST, highlighting the limitations of current LLMs in extracting and tracking dialogue states amidst multiple speakers. Our findings emphasize the need for future research to enhance LLMs for multi-user DST scenarios, paving the way for more realistic and robust DST models.

大型语言模型(LLMS)在零点对话状态跟踪(DST)方面表现显著,减少了对具体任务培训的需求;然而,常规DST基准主要侧重于结构化用户-代理对话,未能捕捉现实世界多用户互动的复杂性;在本研究中,我们评估了多用户DST中LLM的稳健性,同时尽量减少了数据集构建成本;在基于LLM的数据注释的最新进展的启发下,我们根据言论行为理论生成了第二个用户的言论,从而扩展了现有的DST数据集;我们的方法系统地将第二个用户的言论纳入谈话,从而得以对多用户环境中的LMS进行有控制的评价;实验结果显示,与单一用户DST相比,业绩显著下降,突出了当前LMS在提取和跟踪对话中的局限性;我们的调查结果强调,今后需要开展研究,以加强多用户DST假设情景的LMS,为更现实和健全的DST模式铺平了道路。


Article 76

Title@2025-06-12 (4): Mind the Style Gap: Meta-Evaluation of Style and Attribute Transfer Metrics

Title: Mind the Style Gap: Meta-Evaluation of Style and Attribute Transfer Metrics Achten Sie auf den Style Gap: Meta-Evaluation von Style und Attribut-Transfer-Metriken 思维风格差距:对样式和属性转移的元评价 2502.15022v3

Authors (3): Amalie Brogaard Pauli, Isabelle Augenstein, Ira Assent

Large language models (LLMs) make it easy to rewrite a text in any style – e.g. to make it more polite, persuasive, or more positive – but evaluation thereof is not straightforward. A challenge lies in measuring content preservation: that content not attributable to style change is retained. This paper presents a large meta-evaluation of metrics for evaluating style and attribute transfer, focusing on content preservation. We find that meta-evaluation studies on existing datasets lead to misleading conclusions about the suitability of metrics for content preservation. Widely used metrics show a high correlation with human judgments despite being deemed unsuitable for the task – because they do not abstract from style changes when evaluating content preservation. We show that the overly high correlations with human judgment stem from the nature of the test data. To address this issue, we introduce a new, challenging test set specifically designed for evaluating content preservation metrics for style transfer. Using this dataset, we demonstrate that suitable metrics for content preservation for style transfer indeed are style-aware. To support efficient evaluation, we propose a new style-aware method that utilises small language models, obtaining a higher alignment with human judgements than prompting a model of a similar size as an autorater.

大型语言模型(LLMS)使得以任何方式重写文本变得容易 – – 例如,使其更礼貌、更有说服力或更积极 – – 但对其的评价并非直截了当。一个挑战在于如何衡量内容的保存:保留了非因风格变化而导致的内容。本文对用于评价风格和属性转移的衡量标准进行了大规模元评价,重点是内容保护。我们发现,关于现有数据集的元评价研究导致关于内容保护指标是否适合内容保护的错误结论。广泛使用的衡量标准表明,尽管被认为不适合这项任务,但与人类判断有高度的相关性 – – 因为在评价内容保护时,它们并不从样式变化中抽取。我们显示,与人类判断的过度高的相关性源于测试数据的性质。为解决这一问题,我们引入了一套新的、具有挑战性的测试集,专门用来评价样式转移中的内容保存指标。我们利用这一数据集,证明用于风格传输内容保存的适当计量标准确实具有风格意识。为了支持高效率的评价,我们提出了一种具有风格觉悟的新方法,即使用小语言模型,获得与人类判断的高度一致性,而不是催化一个类似大小的模型。


Article 77

Title@2025-06-12 (4): TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding

Title: TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding TransXSSM: Ein Hybrid Transformer State Space Modell mit unified Rotary Position Einbettung TransXSSSSM:带有统一扶轮定位嵌入装置的混合变形国家空间模型 2506.09507v2

Authors (5): Bingheng Wu, Jingze Shi, Yifan Wu, Nan Tang, Yuyu Luo

Transformers exhibit proficiency in capturing long-range dependencies, whereas State Space Models (SSMs) facilitate linear-time sequence modeling. Notwithstanding their synergistic potential, the integration of these architectures presents a significant challenge, primarily attributable to a fundamental incongruity in their respective positional encoding mechanisms: Transformers rely on explicit Rotary Position Embeddings (RoPE), while SSMs leverage implicit positional representations via convolutions. This divergence often precipitates discontinuities and suboptimal performance. To address this impediment, we propose a unified rotary position embedding (Unified RoPE) methodology, thereby establishing a consistent positional encoding framework for both self-attention and state-space components. Using this Unified RoPE, we introduce TransXSSM, a hybrid architecture that coherently integrates the Transformer and SSM layers under this unified positional encoding scheme. At a 4K sequence length, TransXSSM exhibits training and inference speeds that are 42.3\% and 29.5\% faster, respectively, relative to standard Transformer models. It also delivers higher accuracy: under comparable settings, it surpasses a Transformer baseline by over 4\% on language modeling benchmarks.TransXSSM furthermore scales more effectively: TransXSSM-1.3B gains 7.22\% in average accuracy over its 320M version (versus about 6\% gains for equivalent Transformers or SSMs). Our results show that unified positional encoding resolves positional incompatibility in hybrid models, enabling efficient, high-performance long-context modeling.

变异器在捕捉远程依赖性方面表现出熟练,而国家空间模型(SSMS)则有利于线性时间序列建模。尽管这些结构具有协同潜力,但整合这些结构仍是一个重大挑战,主要原因是各自位置编码机制存在根本性的不一致:变异器依赖明确的扶轮定位嵌入器(ROPE),而SSMs则通过组合来利用隐含的定位表示。这种差异往往导致不连续和不优化性能。为解决这一障碍,我们提议采用统一的旋转定位嵌入(统一ROPE)方法,从而为自备和州空间组件建立一致的定位编码框架。我们采用 TransXSSSSSM, 混合结构将变异器和SSSM层一致地结合到这个统一的定位编码计划下。在4K序列中, TransXSSSSM的展示和推导力速度分别为42.3%和29.5.1.3,与标准变异器模型相比,还提供更高的准确性:在可比较的环境下,它超越了统一位置,在SMSM-CR的高度精确度基准中,在SMSM 4x平均SMSM的模型上显示比SM的变异性标度上。


Article 78

Title@2025-06-12 (4): Towards Large Language Models with Self-Consistent Natural Language Explanations

Title: Towards Large Language Models with Self-Consistent Natural Language Explanations Auf dem Weg zu großen Sprachmodellen mit selbstkonsistenten natürlichen Spracherklärungen 努力建立具有自我联系自然语言解释的大型语言模式 2506.07523v2

Authors (4): Sahar Admoni, Ofra Amir, Assaf Hallak, Yftah Ziser

Large language models (LLMs) seem to offer an easy path to interpretability: just ask them to explain their decisions. Yet, studies show that these post-hoc explanations often misrepresent the true decision process, as revealed by mismatches in feature importance. Despite growing evidence of this inconsistency, no systematic solutions have emerged, partly due to the high cost of estimating feature importance, which limits evaluations to small datasets. To address this, we introduce the Post-hoc Self-Consistency Bank (PSCB) - a large-scale benchmark of decisions spanning diverse tasks and models, each paired with LLM-generated explanations and corresponding feature importance scores. Analysis of PSCB reveals that self-consistency scores barely differ between correct and incorrect predictions. We also show that the standard metric fails to meaningfully distinguish between explanations. To overcome this limitation, we propose an alternative metric that more effectively captures variation in explanation quality. We use it to fine-tune LLMs via Direct Preference Optimization (DPO), leading to significantly better alignment between explanations and decision-relevant features, even under domain shift. Our findings point to a scalable path toward more trustworthy, self-consistent LLMs.

大型语言模型(LLMS)似乎提供了一个容易解释的途径:只是要求它们解释其决定。然而,研究表明,这些事后拼凑解释往往曲解真正的决策程序,正如非常重要的不匹配所揭示的那样。尽管越来越多的证据表明这种不一致,但没有出现系统的解决办法,部分原因是估计特征重要性的成本很高,从而限制了对小数据集的评价。为了解决这个问题,我们引入了热后自闭式银行(PSCB)——一个涉及不同任务和模型的大型决定基准,每个决定都配以LLLM产生的解释和相应的特征重要性分数。对PSCB的分析表明,自我一致性的得分在正确和不正确预测之间几乎没有区别。我们还表明,标准指标无法对解释作出有意义的区分。为了克服这一限制,我们建议了一种替代指标,以更有效地捕捉解释质量上的差异。我们用它来通过直接的Preferation Optimination(DPO)来微调LMSMS,从而大大改进解释和决定相关特征之间的吻合,甚至在域内转移。我们的调查结果指出,我们发现,通向更可信、自我一致的路径。


Article 79

Title@2025-06-12 (4): Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models

Title: Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models Surface Fairness, Deep Bias: Eine vergleichende Studie über Bias in Sprachmodellen 地表公平、深比亚:语言模型比亚比较研究 2506.10491v1

Authors (4): Aleksandra Sorokovikova, Pavel Chizhov, Iuliia Eremenko, Ivan P. Yamshchikov

Modern language models are trained on large amounts of data. These data inevitably include controversial and stereotypical content, which contains all sorts of biases related to gender, origin, age, etc. As a result, the models express biased points of view or produce different results based on the assigned personality or the personality of the user. In this paper, we investigate various proxy measures of bias in large language models (LLMs). We find that evaluating models with pre-prompted personae on a multi-subject benchmark (MMLU) leads to negligible and mostly random differences in scores. However, if we reformulate the task and ask a model to grade the user’s answer, this shows more significant signs of bias. Finally, if we ask the model for salary negotiation advice, we see pronounced bias in the answers. With the recent trend for LLM assistant memory and personalization, these problems open up from a different angle: modern LLM users do not need to pre-prompt the description of their persona since the model already knows their socio-demographics.

现代语言模式在大量数据方面受过培训,这些数据不可避免地包括有争议的和陈规定型的内容,其中含有与性别、出身、年龄等相关的各种偏见。因此,模型根据用户的个性或个性表达了偏向观点或产生了不同的结果。在本文中,我们调查了大语言模式中各种偏见的代用衡量标准(LLMS)。我们发现,在多主题基准(MMLU)上对预先预测的属人模式进行评估会导致可忽略不计的和多数随机的得分差异。然而,如果我们重新配置任务并要求一个模型来评定用户的答案,这显示了更重大的偏见迹象。最后,如果我们要求薪金谈判建议模式,我们就会看到答案中的明显偏差。随着LLM助理记忆和个性化的最近趋势,这些问题从不同的角度出现:现代LLMM用户不需要预先说明其个人特征,因为模型已经了解他们的社会人口。


Article 80

Title@2025-06-12 (4): Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers

Title: Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers Tabelle-Text Alignment: Erklärung der Antragsprüfung gegen Tabellen in wissenschaftlichen Arbeiten 表-文字对齐:对照科学文件表格解释索赔核实 2506.10486v1

Authors (6): Xanh Ho, Sunisth Kumar, Yun-Ang Wu, Florian Boudin, Atsuhiro Takasu, Akiko Aizawa

Scientific claim verification against tables typically requires predicting whether a claim is supported or refuted given a table. However, we argue that predicting the final label alone is insufficient: it reveals little about the model’s reasoning and offers limited interpretability. To address this, we reframe table-text alignment as an explanation task, requiring models to identify the table cells essential for claim verification. We build a new dataset by extending the SciTab benchmark with human-annotated cell-level rationales. Annotators verify the claim label and highlight the minimal set of cells needed to support their decision. After the annotation process, we utilize the collected information and propose a taxonomy for handling ambiguous cases. Our experiments show that (i) incorporating table alignment information improves claim verification performance, and (ii) most LLMs, while often predicting correct labels, fail to recover human-aligned rationales, suggesting that their predictions do not stem from faithful reasoning.

对表格的科学索赔进行核实通常要求预测索赔是否得到支持或反驳,并给出一个表格。然而,我们争辩说,仅仅预测最后标签是不够的:它很少揭示模型的推理,并且提供了有限的解释性。为了解决这个问题,我们重新设计了表格文本的校准,作为解释性任务,要求模型确定核实索赔所必需的表格单元格。我们通过扩大SciTab基准和人文说明细胞层次的理由来建立一个新的数据集。警告者核查索赔标签并突出支持其决定所需的最起码的一组细胞。在批注过程之后,我们利用所收集的信息并提出处理模棱两可案例的分类方法。我们的实验表明,(一) 纳入表格校准信息可以提高索赔的核实性能,以及(二) 大多数LLMMs虽然经常预测正确的标签,但未能恢复与人性一致的理由,表明它们的预测并不来自忠实的推理。


Article 81

Title@2025-06-12 (4): IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language

Title: IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language IndoToxic2024: Ein demographischer Datensatz von Hass-Sprach- und Toxizitätstypen für indonesische Sprache Indotoxic2024:印度尼西亚语仇恨言论和毒性类型人口资料集 2406.19349v2

Authors (7): Lucky Susanto, Musa Izzanardi Wijanarko, Prasetia Anugrah Pratama, Traci Hong, Ika Idris, Alham Fikri Aji, Derry Wijaya

Hate speech poses a significant threat to social harmony. Over the past two years, Indonesia has seen a ten-fold increase in the online hate speech ratio, underscoring the urgent need for effective detection mechanisms. However, progress is hindered by the limited availability of labeled data for Indonesian texts. The condition is even worse for marginalized minorities, such as Shia, LGBTQ, and other ethnic minorities because hate speech is underreported and less understood by detection tools. Furthermore, the lack of accommodation for subjectivity in current datasets compounds this issue. To address this, we introduce IndoToxic2024, a comprehensive Indonesian hate speech and toxicity classification dataset. Comprising 43,692 entries annotated by 19 diverse individuals, the dataset focuses on texts targeting vulnerable groups in Indonesia, specifically during the hottest political event in the country: the presidential election. We establish baselines for seven binary classification tasks, achieving a macro-F1 score of 0.78 with a BERT model (IndoBERTweet) fine-tuned for hate speech classification. Furthermore, we demonstrate how incorporating demographic information can enhance the zero-shot performance of the large language model, gpt-3.5-turbo. However, we also caution that an overemphasis on demographic information can negatively impact the fine-tuned model performance due to data fragmentation.

在过去两年里,印度尼西亚的在线仇恨言论比例增加了十倍,强调了有效检测机制的迫切需要。然而,由于印度尼西亚文本的标签数据有限,进展受到阻碍。对什叶派、男女同性恋、双性恋和变性者Q等处于社会边缘地位的少数群体来说,情况更为糟糕,因为仇恨言论报告不足,而检测工具对仇恨言论的了解较少。此外,当前数据集缺乏主观性容纳,使得这一问题复杂化。为了解决这个问题,我们引入了IndoToxec2024, 一套全面的印度尼西亚仇恨言论和毒性分类数据集。由19个不同的个人加注的43 692个条目组成,数据集侧重于针对印度尼西亚弱势群体的文本,尤其是在印度尼西亚最热的政治活动:总统选举期间。我们为七种二进制分类任务制定了基线,在仇恨言论分类模型(IndoBERTweet)中达到0.78分,对仇恨言论分类作了微调。此外,我们演示了如何将人口信息纳入大型语言模型的零发性表现,即Gpt-3.5-turbo 调整后,我们还要谨慎地要求将数据压低性调整。


Article 82

Title@2025-06-12 (4): VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

Title: VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models VScan: Rethinking Visual Token Reduction für effiziente große Vision-Sprache Modelle Vscan:重新思考如何降低视力,以建立高效的大型视觉语言模型 2505.22654v2

Authors (10): Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, Dong Yu

Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a 10$\times$ reduction in FLOPs, while retaining 95.4\% of the original performance. Code is available at https://github.com/Tencent/SelfEvolvingAgent/tree/main/VScan.

最近大型视觉语言模型(LVLMs)纳入了精细的视觉感知和编码,从而提高了对多种模式的理解程度。然而,由于视觉象征序列的视觉比重较长,对实时部署构成挑战,这些方法产生了巨大的计算成本。为此,以前的研究探索了在视觉编码器输出层或语言模型早期层运行不重要的视觉象征。在这项工作中,我们重新审视这些设计选择,并通过对视觉编码和语言解码阶段如何处理视觉象征的全面经验研究来重新评估其有效性。根据这些洞察,我们建议VScan,一个两阶段视觉象征减少框架,解决象征性冗余问题,其方法是:(1) 在视觉编码过程中将补充性全球和局部扫描与象征性合并结合起来,(2) 在语言模型的中间层运行。 四个LVLVMMs的广泛实验结果验证了VScan在加速推断方面的有效性,并显示其优于16项基准的当前状态。在LLVA-NEX-T-7B中, VScan=LCans Stall a stilling a prilling a prival $ and 10-FAximal Stilling Stilling a 99/FAximal Stilling in Stilling 99/FLisal Stilling Statimpilling.


Article 83

Title@2025-06-12 (4): Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts

Title: Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts Auf dem Weg zur robusten multimodalen Emotionserkennung unter fehlenden Modalitäten und Verteilungsverschiebungen 争取在缺失模式和分销转移模式下强有力地承认多模式情感 2506.10452v1

Authors (5): Guowei Zhong, Ruohong Huan, Mingzhen Wu, Ronghua Liang, Peng Chen

Recent advancements in Multimodal Emotion Recognition (MER) face challenges in addressing both modality missing and Out-Of-Distribution (OOD) data simultaneously. Existing methods often rely on specific models or introduce excessive parameters, which limits their practicality. To address these issues, we propose a novel robust MER framework, Causal Inference Distiller (CIDer), and introduce a new task, Random Modality Feature Missing (RMFM), to generalize the definition of modality missing. CIDer integrates two key components: a Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal Inference (MACI) module. MSSD enhances robustness under the RMFM task through a weight-sharing self-distillation approach applied across low-level features, attention maps, and high-level representations. Additionally, a Word-level Self-aligned Attention Module (WSAM) reduces computational complexity, while a Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion. To tackle OOD challenges, MACI employs a tailored causal graph to mitigate label and language biases using a Multimodal Causal Module (MCM) and fine-grained counterfactual texts. Notably, MACI can independently enhance OOD generalization with minimal additional parameters. Furthermore, we also introduce the new repartitioned MER OOD datasets. Experimental results demonstrate that CIDer achieves robust performance in both RMFM and OOD scenarios, with fewer parameters and faster training compared to state-of-the-art methods. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CIDer.

现有方法往往依赖具体模型或引入过多参数,从而限制其实用性。为了解决这些问题,我们提议采用一个新的强有力的市场模型框架,即Causal Inference Dispitiller(CIDer),并引入一个新的任务,即随机模式特征特征缺失(RMFM),以概括缺失模式的定义。CIDer综合了两个关键组成部分:模型特征自我蒸馏参数(MSSD)模块和模型-遗传诱导(MACI)模块。MSSD通过在低级别特征、关注地图和高级别表述中应用加权共享自我蒸馏法,增强RMFM任务下的稳健性。此外,Word级自我调控模块(WASAM)降低了计算复杂性,而多模式性能快速变异变(MCT)则促进高效的多式联运挑战,MACI采用定制的因果性图表,同时在IMM(IMA)通用的IMA和IMA(IMA) IM(O) IM(多级变数) 测试中引入了常规定义和IM(IM(IM) IM(IMA) IM(IMA) IM(IM) IM) IM(O) IM(IM) IM(IM) IM(IM) IML) IML) IML) 常规和新版本操作(O(O(IML) (IML) IML) IML) (IM(IML) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (OL) (IML) (IMLL) (IM) (IM) (IM) (IM) (IML) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (IML) (IML) (IML) (IML) (IM) (IM(O) (IM(IM) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (IM) (O) (IM) (IM) (IM) (IM) (IML) (IM) (IM


Article 84

Title@2025-06-12 (4): Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations

Title: Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations Social Bias Benchmark for Generation: Ein Vergleich von Generation und QA-basierten Bewertungen 社会比重基准: 社会比重基准: 社会比比: 社会比比: 社会比比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比 2503.06987v2

Authors (4): Jiho Jin, Woosung Kang, Junho Myung, Alice Oh

Measuring social bias in large language models (LLMs) is crucial, but existing bias evaluation methods struggle to assess bias in long-form generation. We propose a Bias Benchmark for Generation (BBG), an adaptation of the Bias Benchmark for QA (BBQ), designed to evaluate social bias in long-form generation by having LLMs generate continuations of story prompts. Building our benchmark in English and Korean, we measure the probability of neutral and biased generations across ten LLMs. We also compare our long-form story generation evaluation results with multiple-choice BBQ evaluation, showing that the two approaches produce inconsistent results.

衡量大语言模型的社会偏见至关重要,但现有的偏见评价方法难以评估长龄一代的偏见。 我们提议采用“代代比亚斯基准 ” ( BBG ) , 即“ 代代比亚斯基准 ” ( BBQ ) , 用于评估长龄一代的社会偏见, 由 LLM 产生故事的延续。 我们用英语和韩语建立基准,用10个LM衡量中性和有偏见的代代人的可能性。 我们还将长龄故事评价结果与多选择 BBQ 评估进行比较, 表明这两种方法产生不一致的结果。


Article 85

Title@2025-06-12 (4): Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty

Title: Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty Schnell auf dem Einfachen, Tief auf dem Harten: Effiziente Vernunft über Powered Length Penalty 快速快速执行 “ 容易 “ 、 “ 深沉:通过死刑有效解释理由 “ 2506.10446v1

Authors (6): Zehui Ling, Deshu Chen, Hongwei Zhang, Yifeng Jiao, Xin Guo, Yuan Cheng

Large language models (LLMs) have demonstrated significant advancements in reasoning capabilities, performing well on various challenging benchmarks. Techniques like Chain-of-Thought prompting have been introduced to further improve reasoning. However, these approaches frequently generate longer outputs, which in turn increase computational latency. Although some methods use reinforcement learning to shorten reasoning, they often apply uniform penalties without considering the problem’s complexity, leading to suboptimal outcomes. In this study, we seek to enhance the efficiency of LLM reasoning by promoting conciseness for simpler problems while preserving sufficient reasoning for more complex ones for accuracy, thus improving the model’s overall performance. Specifically, we manage the model’s reasoning efficiency by dividing the reward function and including a novel penalty for output length. Our approach has yielded impressive outcomes in benchmark evaluations across three datasets: GSM8K, MATH500, and AIME2024. For the comparatively simpler datasets GSM8K and MATH500, our method has effectively shortened output lengths while preserving or enhancing accuracy. On the more demanding AIME2024 dataset, our approach has resulted in improved accuracy.

大型语言模型(LLMs)在推理能力方面取得了显著进步,在各种挑战性基准方面表现良好。例如“链式推理”等技术已经引入,以进一步改进推理。然而,这些方法经常产生较长的产出,从而增加计算时间。虽然有些方法使用强化学习来缩短推理时间,但它们往往在不考虑问题的复杂性的情况下适用统一的惩罚,导致不尽人意的结果。在这项研究中,我们力求提高LLM推理的效率,方法是促进简化问题的简明性,同时为更复杂的精确性保留足够的推理,从而改进模型的总体性能。具体地说,我们管理模型推理效率的方式是,对奖赏功能进行划分,并对产出长度规定新的惩罚。我们的方法在三个数据集(GSM8K、MATH500和AIME2024)的基准评价中取得了令人印象深刻的成果。对于比较简单的数据集GSM8K和MATH500,我们的方法有效地缩短了产出长度,同时保持或提高精确性。关于要求更高的 AIME2024数据设置,我们的方法提高了准确性。


Article 86

Title@2025-06-12 (4): CheMatAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning

Title: CheMatAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning CheMatAgent: Verbesserung von LLMs für Chemie und Materialwissenschaft durch baumsuchebasiertes Tool Learning CheMatAgent:通过植树搜索工具学习加强化学和材料科学LLMs 2506.07551v2

Authors (10): Mengsong Wu, YaFei Wang, Yidong Ming, Yuqi An, Yuwei Wan, Wenliang Chen, Binbin Lin, Yuqiang Li, Tong Xie, Dongzhan Zhou

Large language models (LLMs) have recently demonstrated promising capabilities in chemistry tasks while still facing challenges due to outdated pretraining knowledge and the difficulty of incorporating specialized chemical expertise. To address these issues, we propose an LLM-based agent that synergistically integrates 137 external chemical tools created ranging from basic information retrieval to complex reaction predictions, and a dataset curation pipeline to generate the dataset ChemToolBench that facilitates both effective tool selection and precise parameter filling during fine-tuning and evaluation. We introduce a Hierarchical Evolutionary Monte Carlo Tree Search (HE-MCTS) framework, enabling independent optimization of tool planning and execution. By leveraging self-generated data, our approach supports step-level fine-tuning (FT) of the policy model and training task-adaptive PRM and ORM that surpass GPT-4o. Experimental evaluations demonstrate that our approach significantly improves performance in Chemistry QA and discovery tasks, offering a robust solution to integrate specialized tools with LLMs for advanced chemical applications. All datasets and code are available at https://github.com/AI4Chem/ChemistryAgent .

大型语言模型(LLMS)最近表现出了在化学任务方面充满希望的能力,同时由于培训前知识过时和难以纳入专门的化学专门知识,仍然面临着挑战。为了解决这些问题,我们提议一个基于LLM的代理机构,以协同方式整合从基本信息检索到复杂反应预测等范围创建的137个外部化学工具,以及一个旨在生成数据集ChemToolBench的数据中心曲线管道,该平台既有利于在微调和评估期间有效选择工具,也有利于精确的参数填充。我们引入了一个等级进化蒙特卡洛树搜索框架,使工具规划和执行能够独立优化。我们的方法利用自我生成的数据,支持政策模型的级微调,以及培训任务适应性PRM和ORM,这超过了GPT-4o。实验性评估表明,我们的方法大大改进了化学质量和发现任务的业绩,为先进化学应用的专业工具与LMS整合提供了强有力的解决方案。所有数据集和代码都可在https://github.com/AI4chem/ChemAgentistry上查阅。


Article 87

Title@2025-06-12 (4): ConvD: Attention Enhanced Dynamic Convolutional Embeddings for Knowledge Graph Completion

Title: ConvD: Attention Enhanced Dynamic Convolutional Embeddings for Knowledge Graph Completion ConvD: Aufmerksamkeitsverstärkte dynamische konvolutionäre Einbettungen für die Wissensgraphenvervollständigung ConvD: 关注增强动态动态嵌入,以完成知识图的完成 2312.07589v2

Authors (7): Wenbin Guo, Zhao Li, Xin Wang, Zirui Chen, Jun Zhao, Jianxin Li, Ye Yuan

Knowledge graphs often suffer from incompleteness issues, which can be alleviated through information completion. However, current state-of-the-art deep knowledge convolutional embedding models rely on external convolution kernels and conventional convolution processes, which limits the feature interaction capability of the model. This paper introduces a novel dynamic convolutional embedding model, ConvD, which directly reshapes relation embeddings into multiple internal convolution kernels. This approach effectively enhances the feature interactions between relation embeddings and entity embeddings. Simultaneously, we incorporate a priori knowledge-optimized attention mechanism that assigns different contribution weight coefficients to the multiple relation convolution kernels in dynamic convolution, further boosting the expressive power of the model. Extensive experiments on various datasets show that our proposed model consistently outperforms the state-of-the-art baseline methods, with average improvements ranging from 3.28% to 14.69% across all model evaluation metrics, while the number of parameters is reduced by 50.66% to 85.40% compared to other state-of-the-art models.

知识图往往存在不完全的问题,这些问题可以通过信息完成而缓解。然而,目前最先进的知识深层集成模型依赖于外部的进化内核和常规的进化过程,这限制了模型的特性互动能力。本文介绍了一个新的动态进化嵌入模型,即ConvD,它直接将关系嵌入多个内部进化内核。这一方法有效地加强了关系嵌入和实体嵌入之间的特征互动。同时,我们引入了先天知识优化关注机制,为动态进化中的多个关系内核分配了不同的贡献权系数,进一步加强了模型的表达力。关于各种数据集的广泛实验表明,我们提议的模型始终超越了最新基线方法,在所有模型评价指标中,平均改进幅度从3.28%到14.69%不等,而参数数量则比其他状态模型减少50.66%到85.40%。


Article 88

Title@2025-06-12 (4): PAL: Probing Audio Encoders via LLMs – A Study of Information Transfer from Audio Encoders to LLMs

Title: PAL: Probing Audio Encoders via LLMs – A Study of Information Transfer from Audio Encoders to LLMs PAL: Probing Audio-Encoder über LLMs – Eine Studie über den Informationstransfer von Audio-Encodern zu LLMs PAL:通过LLMs探查音频成象器 – – 研究从音频成象器向LLMs传送信息 2506.10423v1

Authors (7): Tony Alex, Wish Suharitdamrong, Sara Atito, Armin Mustafa, Philip J. B. Jackson, Imran Razzak, Muhammad Awais

The integration of audio perception capabilities into Large Language Models (LLMs) has enabled significant advances in Audio-LLMs. Although application-focused developments, particularly in curating training data for specific capabilities e.g., audio reasoning, have progressed rapidly, the underlying mechanisms that govern efficient transfer of rich semantic representations from audio encoders to LLMs remain under-explored. We conceptualize effective audio-LLM interaction as the LLM’s ability to proficiently probe the audio encoder representations to satisfy textual queries. This paper presents a systematic investigation on how architectural design choices can affect that. Beginning with a standard Pengi/LLaVA-style audio-LLM architecture, we propose and evaluate several modifications guided by hypotheses derived from mechanistic interpretability studies and LLM operational principles. Our experiments demonstrate that: (1) delaying audio integration until the LLM’s initial layers establish textual context that enhances its ability to probe the audio representations for relevant information; (2) the LLM can proficiently probe audio representations exclusively through LLM layer’s attention submodule, without requiring propagation to its Feed-Forward Network (FFN) submodule; (3) an efficiently integrated ensemble of diverse audio encoders provides richer, complementary representations, thereby broadening the LLM’s capacity to probe a wider spectrum of audio information. All hypotheses are evaluated using an identical three-stage training curriculum on a dataset of 5.6 million audio-text pairs, ensuring controlled comparisons. Our final architecture, which incorporates all proposed modifications, achieves relative improvements from 10\% to 60\% over the baseline, validating our approach to optimizing cross-modal information transfer in audio-LLMs. Project page: https://ta012.github.io/PAL/

将音频感知能力纳入大语言模型(LLMs)使音频- LLM 取得显著进展。虽然以应用为重点的发展,特别是在为具体能力(例如音频推理)整理培训数据方面,进展迅速,但管理将丰富的语义表达形式从音频编码器有效转换到LMs的基本机制仍然未得到充分探讨。我们设想了有效的音频- LLM 互动,作为LLM 精密探测音频解析器的能力,以满足文字查询。本文对建筑设计选择如何影响进行系统调查。从标准的Pangi/ LLAVA类音频- LLM 结构开始,我们提议并评价了由机械性解释研究和LLM业务原则衍生的假设所指导的若干修改。我们的实验表明:(1) 将音频整合延迟到LLM的初始层建立文本背景环境,以加强其探测音频表达形式的能力;(2) LLMM可以完全通过LM层注意的子模块对音频/音频表达方式进行精度探测,而无需向Fefro-LVAVLS- real-levyLM Adal com real real real real real real commal demodustrational deal deal demodudustrationsmlations thesmationsmations a 10 a listrations supal dealdaldaldal 10 a limental devals a laututusmal devals supal deal devalsmationslations lautus 10 a ro lautusaldaldal deal deal devaldal deal devals silds saldal deal deal deal deal dealdsmaldsaldsmalds lauts lautsmaldaldal lactions lactionsmaldaldaldaldaldaldaldaldal lauts lauts a lauts lads lauts lauts lauts lauts lauts a las a lauts as a ladal


Article 89

Title@2025-06-12 (4): Beyond the Battlefield: Framing Analysis of Media Coverage in Conflict Reporting

Title: Beyond the Battlefield: Framing Analysis of Media Coverage in Conflict Reporting Jenseits des Schlachtfeldes: Framing Analyse der Medienabdeckung in der Konfliktberichterstattung 战场以外的战场:冲突报道中媒体报道的系统化分析 2506.10421v1

Authors (2): Avneet Kaur, Arnav Arora

Framing used by news media, especially in times of conflict, can have substantial impact on readers’ opinion, potentially aggravating the conflict itself. Current studies on the topic of conflict framing have limited insights due to their qualitative nature or only look at surface level generic frames without going deeper. In this work, we identify indicators of war and peace journalism, as outlined by prior work in conflict studies, in a corpus of news articles reporting on the Israel-Palestine war. For our analysis, we use computational approaches, using a combination of frame semantics and large language models to identify both communicative framing and its connection to linguistic framing. Our analysis reveals a higher focus on war based reporting rather than peace based. We also show substantial differences in reporting across the US, UK, and Middle Eastern news outlets in framing who the assailant and victims of the conflict are, surfacing biases within the media.

在这项工作中,我们确定战争与和平新闻指标,如冲突研究中以前的工作所概括的那样,在一系列报道以巴战争的新闻报道中加以说明。我们的分析采用计算方法,结合框架语义和大语言模式,确定沟通框架及其与语言框架的联系。我们的分析显示,我们更注重基于战争的报道,而不是基于和平的报道。我们还显示,美国、英国和中东的新闻媒体在描述冲突攻击者和受害者是谁方面存在着巨大的差异。


Article 90

Title@2025-06-12 (4): Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?

Title: Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences? Brennen Sie nach dem Lesen: Erfassen multimodale große Sprachmodelle wirklich die Reihenfolge der Ereignisse in Bildsequenzen? Burn after read: 多式大语言模型在图像序列中是否真的能捕捉事件秩序? 2506.10415v1

Authors (4): Yingjin Song, Yupei Du, Denis Paperno, Albert Gatt

This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.

本文介绍TembVS基准,该基准侧重于图像序列中多式大语言模型(MLLMs)的时间基础和推理能力,TembVS由三个主要测试(即事件相关推论、判决命令和图像订购)组成,每个测试都附有基本的地面测试。TempVS要求MLLMS依赖视觉和语言模式来理解事件的时间顺序。我们评估了38个最先进的MLLMs,表明模型在努力解决TempVS方面与人的能力相比存在巨大的性能差距。我们还提供了精细的洞察力,为未来的研究提出了有希望的方向。我们的TempVS基准数据和代码可在https://github.com/yjsong22/TempVS上查阅。


Article 91

Title@2025-06-12 (4): Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series

Title: Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series Zeit-IMM: Ein Datensatz und Benchmark für irreguläre multimodale Multivariate Zeitreihen 时间-IMM:非正常多式联运多变时间序列的数据集和基准 2506.10412v1

Authors (7): Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, Wei Wang

Time series data in real-world applications such as healthcare, climate modeling, and finance are often irregular, multimodal, and messy, with varying sampling rates, asynchronous modalities, and pervasive missingness. However, existing benchmarks typically assume clean, regularly sampled, unimodal data, creating a significant gap between research and real-world deployment. We introduce Time-IMM, a dataset specifically designed to capture cause-driven irregularity in multimodal multivariate time series. Time-IMM represents nine distinct types of time series irregularity, categorized into trigger-based, constraint-based, and artifact-based mechanisms. Complementing the dataset, we introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal time series, enabling asynchronous integration and realistic evaluation. IMM-TSF includes specialized fusion modules, including a timestamp-to-text fusion module and a multimodality fusion module, which support both recency-aware averaging and attention-based integration strategies. Empirical results demonstrate that explicitly modeling multimodality on irregular time series data leads to substantial gains in forecasting performance. Time-IMM and IMM-TSF provide a foundation for advancing time series analysis under real-world conditions. The dataset is publicly available at https://www.kaggle.com/datasets/blacksnail789521/time-imm/data, and the benchmark library can be accessed at https://anonymous.4open.science/r/IMMTSF_NeurIPS2025.

在现实世界应用中的时间序列数据,如医疗保健、气候模型和融资,往往不规则、多模式和混乱,抽样率不同,不统一的方式和普遍缺失;然而,现有基准通常假定清洁、定期抽样、单一方式的数据,造成研究和现实世界部署之间的巨大差距;我们引入了时间-IMM,这是一个数据集,专门用来记录多式多变时间序列中由原因驱动的不规则性;时间-IMM是9种不同的时间序列,分为触发型、约束型和工艺型机制;补充数据集,我们引入IMM-TSF,这是一个用于预测不规则的多式联运时间序列的基准图书馆,能够实现非同步整合和现实评估;IMM-TSM-TSF包括专门的聚合模块,包括一个时间戳至文字融合模块和一个基于多式多变时间序列的多式联运模块;时间-观测平均和关注型整合型战略。


Article 92

Title@2025-06-12 (4): PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier

Title: PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier PAG: Multi-Turn verstärkt LLM Selbstkorrektion mit Politik als Generativer Prüfer PAG: 多发强化LLM自我校正,政策作为产生验证 2506.10406v1

Authors (8): Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu, Yu Yue, Qianchuan Zhao, Lin Yan

Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm. Distinct from prior approaches that always generate a second attempt regardless of model confidence, PAG introduces a selective revision mechanism: the model revises its answer only when its own generative verification step detects an error. This verify-then-revise workflow not only alleviates model collapse but also jointly enhances both reasoning and verification abilities. Extensive experiments across diverse reasoning benchmarks highlight PAG’s dual advancements: as a policy, it enhances direct generation and self-correction accuracy; as a verifier, its self-verification outperforms self-consistency.

大型语言模型(LLMS)在复杂的推理任务中表现出了令人印象深刻的能力,然而,它们仍然在努力可靠地核实其自身产出的正确性。这一核查挑战的现有解决办法往往取决于单独的核查模型,或需要多阶段自我校正培训管道,这限制了可缩放性。在本文中,我们提议政策作为“创制验证器”(PAG),这是一个简单而有效的框架,它使LLMs能够通过在统一的多方向强化学习模式中交替政策和验证员的作用来自我纠正。与以往的做法不同,以往的做法总是产生第二次尝试,而不管模型是否有信心。 PAG采用了选择性的修订机制:模型只有在其自身的基因核查步骤发现错误时才修改其答复。这一校验后工作流程不仅缓解了模型的崩溃,而且还共同增强了推理和核查能力。 不同推理基准的广泛实验突出PAG的双重进步:作为一种政策,它能加强直接生成和自我校正的准确性;作为校验者,其自我核查超越了自我一致性。


Article 93

Title@2025-06-12 (4): iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering

Title: iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering iQUEST: Ein iteratives Frage-Framework für die Beantwortung von Fragen in der Wissensdatenbank i. 知识基础问题解答的动态问题指导框架 2506.01784v2

Authors (2): Shuai Wang, Yinan Yu

While Large Language Models (LLMs) excel at many natural language processing tasks, they often suffer from factual inaccuracies in knowledge-intensive scenarios. Integrating external knowledge resources, particularly knowledge graphs (KGs), provides a transparent and updatable foundation for more reliable reasoning. Knowledge Base Question Answering (KBQA), which queries and reasons over KGs, is central to this effort, especially for complex, multi-hop queries. However, multi-hop reasoning poses two key challenges: (1)~maintaining coherent reasoning paths, and (2)~avoiding prematurely discarding critical multi-hop connections. To address these issues, we introduce iQUEST, a question-guided KBQA framework that iteratively decomposes complex queries into simpler sub-questions, ensuring a structured and focused reasoning trajectory. Additionally, we integrate a Graph Neural Network (GNN) to look ahead and incorporate 2-hop neighbor information at each reasoning step. This dual approach strengthens the reasoning process, enabling the model to explore viable paths more effectively. Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.

虽然大型语言模型(LLMS)在许多自然语言处理任务中非常出色,但往往在知识密集型情景中存在事实上的不准确之处。整合外部知识资源,特别是知识图表(KGs),为更可靠的推理提供了透明和更可靠的基础。知识基问题回答(KBQA)对于这项工作至关重要,特别是对于复杂、多点查询来说尤其如此。不过,多点推理提出了两大挑战:(1) 保持连贯的推理路径,(2) 避免过早地抛弃关键的多点连接。为了解决这些问题,我们引入了iQUEST,这是一个以问题为指南的KBQQA框架,将复杂的查询反复分解为更简单的子问题,确保有条不紊和重点的推理轨。此外,我们整合了图神经网络(GNNN),以向前看,在每个推理步骤都包含2点邻居信息。这一双重方法加强了推理过程,使模型能够更有效地探索可行的多点连接。详细实验表明iQEST在四个基准数据集和4个LMs之间不断改进。


Article 94

Title@2025-06-12 (4): AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving

Title: AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving AgentThink: Ein einheitliches Framework für Tool-Augmented Chain-of-Thought Reasoning in Vision-Language-Modellen für autonomes Fahren Agent Think: 自主驾驶愿景-语言模型中工具推荐研究链理由统一框架 2505.15298v3

Authors (21): Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, Yining Shi, He Zhe Lim, Li Liu, Tianbao Zhou, Huang Yu, Yifei Hu, Guang Li, Guang Chen, Hao Ye, Lijun Sun, Diange Yang

Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce AgentThink, a pioneering unified framework that, for the first time, integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink’s core innovations include: (i) Structured Data Generation, by establishing an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; (ii) A Two-stage Training Pipeline, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and (iii) Agent-style Tool-Usage Evaluation, introducing a novel multi-tool assessment protocol to rigorously evaluate the model’s tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate AgentThink significantly boosts overall reasoning scores by 53.91% and enhances answer accuracy by 33.54%, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models.

视觉-语言模型(VLMS)显示了自主驱动的前景,然而,它们与幻觉、低效率推理和有限的现实世界验证进行的斗争却妨碍了准确的认知和稳健的逐步推理。为了克服这一点,我们引入了AgentThink,这是一个先行的统一框架,它首次将探索链(CoT)推理与动态的、代理式的工具工具结合起来,为自主驱动任务进行调用。Agenthink的核心创新包括:(一) 结构化数据生成,建立一个自主驱动工具库,以自动构建结构化的、自我核实的推理工具库,明确将工具用于不同的驾驶方案;(二) 两阶段培训管道,使用超级精细图(SFT),与集团相对政策优化(GROPO)一起,首次将探索链(Cond-T)推理(Conf-stol-sage Proview)集,引入新的多工具评估协议,以严格评价模型的运用和使用。在DRMM-33-1基准上进行实验,显示Agres-think/liverniversalalalalalallational-lixlational-bislational exxleck exxxxxxxleck 并大大提升其总体推算,同时,同时通过一个可靠的推推推推推推推推推推推推推推推分。


Article 95

Title@2025-06-12 (4): On Many-Shot In-Context Learning for Long-Context Evaluation

Title: On Many-Shot In-Context Learning for Long-Context Evaluation Auf viel-heißes In-Context-Lernen für die Lang-Kontext-Evaluierung 为长期内容评价进行许多热的内文学习 2411.07130v3

Authors (3): Kaijian Zou, Muhammad Khalifa, Lu Wang

Many-shot in-context learning (ICL) has emerged as a unique setup to both utilize and test the ability of large language models to handle long context. This paper delves into long-context language model (LCLM) evaluation through many-shot ICL. We first ask: what types of ICL tasks benefit from additional demonstrations, and how effective are they in evaluating LCLMs? We find that classification and summarization tasks show performance improvements with additional demonstrations, while translation and reasoning tasks do not exhibit clear trends. Next, we investigate the extent to which different tasks necessitate retrieval versus global context understanding. We develop metrics to categorize ICL tasks into two groups: (i) similar-sample learning (SSL): tasks where retrieval of the most similar examples is sufficient for good performance, and (ii) all-sample learning (ASL): tasks that necessitate a deeper comprehension of all examples in the prompt. Lastly, we introduce a new many-shot ICL benchmark, MANYICLBENCH, to characterize model’s ability on both fronts and benchmark 12 LCLMs using MANYICLBENCH. We find that while state-of-the-art models demonstrate good performance up to 64k tokens in SSL tasks, many models experience significant performance drops at only 16k tokens in ASL tasks.

许多截图的文文本学习(ICL)已成为一种独特的设置,既利用和测试大型语言模型处理长期背景的能力,又利用和测试大型语言模型的能力。本文通过许多截图的ICL,深入到长文本语言模型(LLM)评价中。我们首先问:哪些类型的ICL任务从更多的演示中受益,以及它们在评估LLLMS方面的效力?我们发现,分类和汇总任务显示业绩的改善,增加了更多的演示,而翻译和推理任务没有表现出明确的趋势。接下来,我们调查不同任务在多大程度上需要检索,而全球背景理解。我们制定了指标,将ICL任务分为两类:(一) 类似样本学习(SSL):最相似例子的检索足以取得良好业绩的任务,以及(二) 全面抽样学习(ASL):需要更深入地理解所有实例的任务。最后,我们引入了一个新的许多截图的ICLL基准,即翻译和推理任务没有表现出明确的趋势。我们开发了模型的能力,用MAYLLLBENCH将12 LCMS的任务分为两个类别:(一)相似的学习(SSLLLLLLLLLLL):相似的学习(SSL):在16个象征性的模型中,我们发现,在SLSLSLSLSLSLSLSLSLSLSLMSL)的工作表现很好。许多。我们只是在象征性任务在象征性模型上展示了。


Article 96

Title@2025-06-12 (4): TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning

Title: TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning TableRAG: Ein Retrieval Augmented Generation Framework für heterogene Dokument-Reasoning 表RAG:异源文件说明理由的回收增加代际生成框架 2506.10380v1

Authors (3): Xiaohan Yu, Pu Jian, Chong Chen

Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an hybrid framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.

在开放式问题解答中,Retrieval-Auged Generation(RAG)显示出相当的实效;然而,当应用到由文字和表格式组成的各种文件时,现有的RAG方法显示出严重的局限性;平整表格和块块化战略的普遍做法扰乱了内在的表格结构,导致信息丢失,并破坏了多元式全球查询中LLMs的推理能力;为了应对这些挑战,我们提议ColraG是一个混合框架,它是一个对文本理解和复杂操作与表格数据一致的混合框架。表RAG迭接地运行了四个步骤:环境敏感查询分解、文本检索、SQL编程和执行以及组成中间解答一代。我们还开发了HeteQA,这是一个旨在评估多动式多元推理能力的新基准。实验结果表明,表RAG始终超越了公共数据集和我们的HetQA的现有基线,为解答的混合文件建立了新的状态-艺术状态。我们在 https://github.com/yx-y/TableRAG/treal。


Article 97

Title@2025-06-12 (4): Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning

Title: Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning Hierarchische Latentenfähigkeiten von Sprachmodellen über das kausale Repräsentationslernen entdecken 通过因果代表制学习发现语言模式的分级本端能力 2506.10378v1

Authors (4): Jikai Jin, Vasilis Syrgkanis, Sham Kakade, Hanlin Zhang

Faithful evaluation of language model capabilities is crucial for deriving actionable insights that can inform model development. However, rigorous causal evaluations in this domain face significant methodological challenges, including complex confounding effects and prohibitive computational costs associated with extensive retraining. To tackle these challenges, we propose a causal representation learning framework wherein observed benchmark performance is modeled as a linear transformation of a few latent capability factors. Crucially, these latent factors are identified as causally interrelated after appropriately controlling for the base model as a common confounder. Applying this approach to a comprehensive dataset encompassing over 1500 models evaluated across six benchmarks from the Open LLM Leaderboard, we identify a concise three-node linear causal structure that reliably explains the observed performance variations. Further interpretation of this causal structure provides substantial scientific insights beyond simple numerical rankings: specifically, we reveal a clear causal direction starting from general problem-solving capabilities, advancing through instruction-following proficiency, and culminating in mathematical reasoning ability. Our results underscore the essential role of carefully controlling base model variations during evaluation, a step critical to accurately uncovering the underlying causal relationships among latent model capabilities.

对语言模型能力的忠实评价对于得出可用于模型开发的可操作的洞察力至关重要。然而,这一领域的严格的因果评估面临着重大的方法挑战,包括复杂的混乱效应和与广泛再培训有关的令人难以承受的计算费用。为了应对这些挑战,我们提议了一个因果代表性学习框架,其中所观察到的基准业绩以几个潜在能力因素的线性转变为模型。关键是,这些潜在因素在适当控制基础模型后被确定为因果性关联性,而作为共同的混淆因素。将这一方法应用于一个包含1500多个模型的综合数据集,该数据集由开放 LLM 领头板的六个基准进行评估,我们确定了一个简洁的三点线性因果结构,可靠地解释了观察到的业绩差异。对这一因果结构的进一步解释提供了超出简单数字排序以外的大量科学洞察力:具体地说,我们揭示了一个明确的因果性方向,从一般解决问题的能力出发,通过以熟练的教学方式推进,以数学推理能力为顶峰。我们的结果强调了在评估期间仔细控制基础模型变换的基本作用,这是准确发现潜在模型能力之间潜在因果关系的关键一步。


Article 98

Title@2025-06-12 (4): A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce

Title: A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce Ein minimalistischer Ansatz zur LLM-Vernunft: von der Abstoßung zur Verstärkung 从拒绝抽样到强化 2504.11343v2

Authors (11): Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, Hanze Dong

Reinforcement learning (RL) has become a prevailing approach for fine-tuning large language models (LLMs) on complex reasoning tasks. Among recent methods, GRPO stands out for its empirical success in training models such as DeepSeek-R1, yet the sources of its effectiveness remain poorly understood. In this work, we revisit GRPO from a reinforce-like algorithm perspective and analyze its core components. Surprisingly, we find that a simple rejection sampling baseline, RAFT, which trains only on positively rewarded samples, yields competitive performance than GRPO and PPO. Our ablation studies reveal that GRPO’s main advantage arises from discarding prompts with entirely incorrect responses, rather than from its reward normalization. Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples. Reinforce-Rej improves KL efficiency and stability, serving as a lightweight yet effective alternative to more complex RL algorithms. We advocate RAFT as a robust and interpretable baseline, and suggest that future advances should focus on more principled designs for incorporating negative samples, rather than relying on them indiscriminately. Our findings provide guidance for future work in reward-based LLM post-training.

加强学习已成为在复杂的推理任务上微调大型语言模型(LLMs)的普遍做法。在最近的方法中,GROP在DeepSeek-R1等培训模型中表现出了成功的经验,但是其效力的来源仍然不甚为人理解。在这项工作中,我们从强化的算法角度重新研究GROP,并分析其核心组成部分。令人惊讶的是,我们发现一个简单的拒绝抽样基准,RAFT,它仅对积极奖励的样本进行培训,比GROPO和PPPO具有竞争性性能。我们的通货膨胀研究表明,GROPO的主要优势来自于完全错误的反应,而不是其奖赏正常化。我们建议加强-Rej,这是政策梯度的最小延伸,可以筛选完全不正确和完全正确的样本。加强-REj提高了KL的效率和稳定性,作为较复杂的RL算法的较轻但有效的替代方法。我们主张RAFT为强有力和可解释的基线,并建议今后的进展应侧重于更有原则性的设计,以纳入负面的样品,而不是依赖我们未来的草本性训练。


Article 99

Title@2025-06-12 (4): Can We Infer Confidential Properties of Training Data from LLMs?

Title: Can We Infer Confidential Properties of Training Data from LLMs? Können wir vertrauliche Eigenschaften von Trainingsdaten von LLMs ableiten? 我们能否从LLMS中推断培训数据的机密性? 2506.10364v1

Authors (4): Penguin Huang, Chhavi Yadav, Ruihan Wu, Kamalika Chaudhuri

Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets to support applications in fields such as healthcare, finance, and law. These fine-tuning datasets often have sensitive and confidential dataset-level properties – such as patient demographics or disease prevalence – that are not intended to be revealed. While prior work has studied property inference attacks on discriminative models (e.g., image classification models) and generative models (e.g., GANs for image data), it remains unclear if such attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark task for evaluating property inference in LLMs under two fine-tuning paradigms: question-answering and chat-completion. Built on the ChatDoctor dataset, our benchmark includes a range of property types and task configurations. We further propose two tailored attacks: a prompt-based generation attack and a shadow-model attack leveraging word frequency signals. Empirical evaluations across multiple pretrained LLMs show the success of our attacks, revealing a previously unrecognized vulnerability in LLMs.

大型语言模型(LLMS)日益在特定领域数据集上进行微调,以支持保健、金融和法律等领域的应用;这些微调数据集往往具有敏感和保密的数据集级特性,如病人人口或疾病流行,这些性质并不打算披露;虽然以前的工作研究过对歧视模型(例如图像分类模型)和基因模型(例如图像数据GANs)进行的财产推断攻击,但仍然不清楚这种攻击是否转移到LMS;在这项工作中,我们引入PropInfer,这是在两个微调模式下评估LLMS财产推断的基准任务:回答问题和完成聊天。在ChatDoctor数据集上,我们的基准包括一系列财产类型和任务配置。我们进一步建议了两种有针对性的攻击:即时生成攻击和影子模型攻击,利用文字频率信号。在多个受过训练的LMS中进行的经验性评价显示我们攻击的成功,揭示了LMS中先前未确认的脆弱性。


Article 100

Title@2025-06-12 (4): CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models

Title: CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models CAF-I: Ein kollaboratives Multi-Agent-Framework für eine verbesserte Ironieerkennung mit großen Sprachmodellen CAF-I:采用大语言模式加强铁铁探测多机构合作多方协作框架 2506.08430v2

Authors (3): Ziqi. Liu, Ziyang. Zhou, Mingxuan. Hu

Large language model (LLM) have become mainstream methods in the field of sarcasm detection. However, existing LLM methods face challenges in irony detection, including: 1. single-perspective limitations, 2. insufficient comprehensive understanding, and 3. lack of interpretability. This paper introduces the Collaborative Agent Framework for Irony (CAF-I), an LLM-driven multi-agent system designed to overcome these issues. CAF-I employs specialized agents for Context, Semantics, and Rhetoric, which perform multidimensional analysis and engage in interactive collaborative optimization. A Decision Agent then consolidates these perspectives, with a Refinement Evaluator Agent providing conditional feedback for optimization. Experiments on benchmark datasets establish CAF-I’s state-of-the-art zero-shot performance. Achieving SOTA on the vast majority of metrics, CAF-I reaches an average Macro-F1 of 76.31, a 4.98 absolute improvement over the strongest prior baseline. This success is attained by its effective simulation of human-like multi-perspective analysis, enhancing detection accuracy and interpretability.

大型语言模型(LLM)已成为讽刺性探测领域的主流方法,然而,现有的LLM方法在讽刺性探测方面面临着挑战,包括:1. 单视限制,2. 全面理解不足,3. 缺乏解释性;本文件介绍了旨在解决这些问题的LLM驱动的多试剂系统 “ 讽刺性协作剂框架(CAF-I) “ ;CAF-I雇用了背景、语义和Rhetoric等专门剂,进行多层面分析,并进行互动协作优化;然后,一个决策代理将这些观点合并起来,有一个精细评价剂为优化提供有条件反馈;基准数据集实验建立了CAF-I最先进的零光性性能;在绝大多数指标上实现SOTA, CAF-I达到76.31的平均宏观-F1,比以前最强的基线有4.98的绝对改进;通过有效模拟人型多视角分析,提高探测准确性和可解释性,取得了这一成功。


Article 101

Title@2025-06-12 (4): Improving Fairness of Large Language Models in Multi-document Summarization

Title: Improving Fairness of Large Language Models in Multi-document Summarization Verbesserung der Fairness von großen Sprachmodellen in Multi-Dokument-Zusammenfassung 提高多文件总结中大语言模式的公平性 2506.07479v2

Authors (3): Haoyuan Li, Rui Zhang, Snigdha Chaturvedi

Fairness in multi-document summarization (MDS) is crucial for providing comprehensive views across documents with diverse social attribute values, which can significantly impact decision-making. For example, a summarization system that tends to overrepresent negative reviews of products can mislead customers into disregarding good products. Previous works measure fairness in MDS at two levels: summary-level and corpus-level. While summary-level fairness focuses on individual summaries, corpus-level fairness focuses on a corpus of summaries. Recent methods primarily focus on summary-level fairness. We propose FairPO, a preference tuning method that focuses on both summary-level and corpus-level fairness in MDS. To improve summary-level fairness, we propose to generate preference pairs by perturbing document sets. To improve corpus-level fairness, we propose fairness-aware preference tuning by dynamically adjusting the weights of preference pairs. Our experiments show that FairPO outperforms strong baselines while maintaining the critical qualities of summaries. The code is available at https://github.com/leehaoyuan/coverage_fairnes.

多文件摘要(MDS)的公平性对于提供各种具有不同社会属性价值的文件的全面观点至关重要,这可以对决策产生重大影响。例如,一个倾向于过多地对产品进行负面审查的汇总系统可以误导客户忽视好产品。以前的工作在两个层次上衡量MDS的公平性:摘要一级和主体一级。虽然摘要一级的公平性侧重于个人摘要,但程序一级的公平性侧重于摘要汇编。最近的方法主要侧重于摘要一级的公平性。我们提议了FairPO,即偏好调法,既注重摘要一级,又注重材料一级的公平性。为了改善摘要一级的公平性,我们提议通过随机的文档集来产生优先配对。为了改善标准一级的公平性,我们建议通过动态调整优惠的权重来调整公平性偏重。我们的实验显示,FairPO在维持摘要的关键质量的同时,超越了强有力的基线。代码见https://github.com/leehaiyuan/coverage_fairnes。


Article 102

Title@2025-06-12 (4): SCORE: Story Coherence and Retrieval Enhancement for AI Narratives

Title: SCORE: Story Coherence and Retrieval Enhancement for AI Narratives SCORE: Story-Kohärenz und Retrieval-Verbesserung für KI-Erzählungen SCORE: “ 独立叙述 “ 的 “ 一致性 “ 和 “ 检索 “ 增强 “ 增强 “ 统一 “ 和 “ 检索 “ 增强 “ 增强 “ 独立叙述 “ 2503.23512v4

Authors (14): Qiang Yi, Yangfan He, Jianhui Wang, Xinyuan Song, Shiyao Qian, Xinhang Yuan, Li Sun, Yi Xin, Jingqun Tang, Keqin Li, Kuan Lu, Menghao Huo, Jiaqi Chen, Tianyu Shi

Large Language Models (LLMs) can generate creative and engaging narratives from user-specified input, but maintaining coherence and emotional depth throughout these AI-generated stories remains a challenge. In this work, we propose SCORE, a framework for Story Coherence and Retrieval Enhancement, designed to detect and resolve narrative inconsistencies. By tracking key item statuses and generating episode summaries, SCORE uses a Retrieval-Augmented Generation (RAG) approach, incorporating TF-IDF and cosine similarity to identify related episodes and enhance the overall story structure. Results from testing multiple LLM-generated stories demonstrate that SCORE significantly improves the consistency and stability of narrative coherence compared to baseline GPT models, providing a more robust method for evaluating and refining AI-generated narratives.

大型语言模型(LLMs)能够从用户指定的投入中产生创造性和有吸引力的叙述,但在AI产生的这些故事中保持连贯性和情感深度仍然是一个挑战。在这项工作中,我们提议SCORE,这是一个旨在探测和解决叙述不一致现象的Story一致性和检索增强框架。通过跟踪关键项目状况和生成集成摘要,SCORE使用回收型一代(RAG)方法,纳入TF-IDF和串通型,以识别相关事件并加强总体故事结构。测试多个LLM产生的故事的结果表明,SCORE大大改进了与基准GPT模型相比叙述一致性和稳定性,为评估和完善AI产生的叙述提供了更强有力的方法。


Article 103

Title@2025-06-12 (4): The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets

Title: The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets Das automatisierte, aber riskante Spiel: Modellierung von Agent-zu-Agent-Verhandlungen und Transaktionen in Verbrauchermärkten 自动但有风险游戏:消费者市场代理对代理谈判和交易的模拟 2506.00073v2

Authors (6): Shenzhe Zhu, Jiao Sun, Yi Nian, Tobin South, Alex Pentland, Jiaxin Pei

AI agents are increasingly used in consumer-facing applications to assist with tasks such as product search, negotiation, and transaction execution. In this paper, we explore a future scenario where both consumers and merchants authorize AI agents to fully automate negotiations and transactions. We aim to answer two key questions: (1) Do different LLM agents vary in their ability to secure favorable deals for users? (2) What risks arise from fully automating deal-making with AI agents in consumer markets? To address these questions, we develop an experimental framework that evaluates the performance of various LLM agents in real-world negotiation and transaction settings. Our findings reveal that AI-mediated deal-making is an inherently imbalanced game – different agents achieve significantly different outcomes for their users. Moreover, behavioral anomalies in LLMs can result in financial losses for both consumers and merchants, such as overspending or accepting unreasonable deals. These results underscore that while automation can improve efficiency, it also introduces substantial risks. Users should exercise caution when delegating business decisions to AI agents.

以消费者为对象的大赦国际代理人越来越多地被用于消费者为对象的应用程序,以协助完成产品搜索、谈判和交易执行等任务。在本文件中,我们探讨了消费者和商人授权大赦国际代理人使谈判和交易完全自动化的未来情景。我们的目标是回答两个关键问题:(1) 不同的LLM代理商在为用户争取优惠交易的能力方面是否各不相同?(2) 在消费者市场上与AI代理商进行完全自动化交易会产生什么风险?为了解决这些问题,我们制定了一个实验框架,评估各种LM代理商在现实世界谈判和交易环境中的表现。我们的调查结果显示,AI中介交易的制作是一种固有的不平衡游戏,不同的代理商为其用户取得了显著不同的结果。此外,LLMMS的行为异常可能会给消费者和商人造成财务损失,例如过度支出或接受不合理的交易。这些结果强调,自动化可以提高效率,但也带来很大风险。用户在将商业决定委托给AI代理商时,应该谨慎行事。


Article 104

Title@2025-06-12 (4): An Analysis of Datasets, Metrics and Models in Keyphrase Generation

Title: An Analysis of Datasets, Metrics and Models in Keyphrase Generation Eine Analyse von Datensätzen, Metrics und Modellen in der Keyphrase-Generierung 对关键词生成中的数据集、计量和模型的分析 2506.10346v1

Authors (2): Florian Boudin, Akiko Aizawa

Keyphrase generation refers to the task of producing a set of words or phrases that summarises the content of a document. Continuous efforts have been dedicated to this task over the past few years, spreading across multiple lines of research, such as model architectures, data resources, and use-case scenarios. Yet, the current state of keyphrase generation remains unknown as there has been no attempt to review and analyse previous work. In this paper, we bridge this gap by presenting an analysis of over 50 research papers on keyphrase generation, offering a comprehensive overview of recent progress, limitations, and open challenges. Our findings highlight several critical issues in current evaluation practices, such as the concerning similarity among commonly-used benchmark datasets and inconsistencies in metric calculations leading to overestimated performances. Additionally, we address the limited availability of pre-trained models by releasing a strong PLM-based model for keyphrase generation as an effort to facilitate future research.

关键词组生成是指制作一套概述文件内容的词句或短语的任务。过去几年来,我们一直致力于这项工作,在多种研究领域,如模型结构、数据资源、使用情景等,分布在不同的研究领域;然而,关键词组生成的现状仍然未知,因为没有试图审查和分析以前的工作。在本文件中,我们通过分析50多份关于关键词生成的研究文件来弥补这一差距,全面概述了最近的进展、限制和公开挑战。我们的调查结果突出了当前评价工作中的若干关键问题,例如常用基准数据集的相似性,以及导致高估业绩的计量计算不一致。此外,我们通过为关键词生成发布一个强大的基于PLM的模型,努力便利未来研究,解决事先培训的模型有限问题。


Article 105

Title@2025-06-12 (4): Code Execution as Grounded Supervision for LLM Reasoning

Title: Code Execution as Grounded Supervision for LLM Reasoning Code-Execution als geerdete Überwachung für LLM-Reasoning 法规执行作为LLM理由的有限制的监督 2506.10343v1

Authors (3): Dongwon Jung, Wenxuan Zhou, Muhao Chen

Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.

培训大型语言模型(LLMs)进行思维链(Cot)监督已证明对提高推理能力十分有效,然而,获得可靠和准确的推理监督仍是一项重大挑战。我们提出了一个可推广的方法,通过利用程序执行的决定性因素生成高质量的CAT监督数据集。与现有的依赖昂贵的人类说明或易出错的LLM生成COT的推理方法不同,我们的方法从代码执行中提取了可核实的、逐步推理的痕迹,并将其转化为自然语言的Cot推理。对不同领域的推理基准的实验表明,我们的方法有效地使LITM具备了跨越不同任务的可转移推理能力。此外,通缩研究证实,我们的方法通过减少毫无意义的重复和过度思考,产生高度准确的推理数据,并在推断过程中减少总体象征性的长度。


Article 106

Title@2025-06-12 (4): Provably Learning from Language Feedback

Title: Provably Learning from Language Feedback Wahrscheinlich von Sprachfeedback lernen 从语言反馈中学习 2506.10341v1

Authors (6): Wanqiao Xu, Allen Nie, Ruijie Zheng, Aditya Modi, Adith Swaminathan, Ching-An Cheng

Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce $\textit{transfer eluder dimension}$ as a complexity measure to characterize the hardness of LLF problems. We show that transfer eluder dimension captures the intuition that information in the feedback changes the learning complexity of the LLF problem. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called $\texttt{HELiX}$, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show that $\texttt{HELiX}$ performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark a first step towards designing principled interactive learning algorithms from generic language feedback.

从观察和语言反馈中相互学习是一个日益受大型语言模式(LLM)代理商的出现所驱动的研究领域。虽然已经展示了令人印象深刻的经验示范,但迄今为止仍然缺乏对这些决定问题的有原则框架。在本文中,我们将语言反馈(LLLF)问题正式化,提出足够的假设,使学习成为可能得到的回报,并引入$\textit{transfer eluder situle} $,作为衡量LLF问题难度的复杂尺度。我们表明,传输私利的维度能够捕捉到反馈中的信息改变LLLF问题学习复杂性的直觉。我们展示了从丰富的语言反馈中学习的速度可能比从奖励中学习的速度快得多的案例。我们开发了一个名为$\ textt{HELX}的无约束算法,这通过连续的相互作用可以解决LLLF问题,而绩效保证与LF问题的转移 Eluder 维度的规模。在几个经验领域,我们显示, $\ textt{HELX} 能够很好地完成LLMS的学习问题的复杂性。我们的贡献标志着设计从通用语言的交互式反射法的第一步。


Article 107

Title@2025-06-12 (4): Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs

Title: Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs Amulett: Neuausrichtung während der Testzeit für Personalisierte Präferenzanpassung von LLMs 缩略图:在试验期间重新对准,以适应LLMM的个性化偏好 2502.19148v3

Authors (8): Zhaowei Zhang, Fengshuo Bai, Qizhi Chen, Chengdong Ma, Mingzhi Wang, Haoran Sun, Zilong Zheng, Yaodong Yang

How to align large language models (LLMs) with user preferences from a static general dataset has been frequently studied. However, user preferences are usually personalized, changing, and diverse regarding culture, values, or time. This leads to the problem that the actual user preferences often do not coincide with those trained by the model developers in the practical use of LLMs. Since we cannot collect enough data and retrain for every demand, researching efficient real-time preference adaptation methods based on the backbone LLMs during test time is important. To this end, we introduce Amulet, a novel, training-free framework that formulates the decoding process of every token as a separate online learning problem with the guidance of simple user-provided prompts, thus enabling real-time optimization to satisfy users’ personalized preferences. To reduce the computational cost brought by this optimization process for each token, we additionally provide a closed-form solution for each iteration step of the optimization process, thereby reducing the computational time cost to a negligible level. The detailed experimental results demonstrate that Amulet can achieve significant performance improvements in rich settings with combinations of different LLMs, datasets, and user preferences, while maintaining acceptable computational efficiency.

经常研究如何使大型语言模型(LLMS)与静态一般数据集的用户偏好相一致,然而,用户偏好通常是个性化的、变化的,在文化、价值或时间方面是多种多样的,这导致一个问题,即实际用户偏好往往与模型开发者在实际使用LLMs方面所培训的偏好不相吻合。由于我们无法为每项需求收集足够的数据和再培训,因此在测试期间根据主干LLMS研究有效的实时偏好适应方法十分重要。为此,我们引入了Amulet,这是一个创新的、无培训的框架,它以简单的用户提供的提示为指导,将每个符号的解码进程作为一个单独的在线学习问题,从而使得实时优化能够满足用户的个人偏好。为了减少这种优化进程为每种符号带来的计算成本,我们为优化过程的每个迭代步骤提供额外的封闭式解决方案,从而将计算时间成本降至微不足道的水平。我们详细的实验结果表明,Amulett可以在丰富的环境中实现显著的性改进,同时将不同的LMS、数据设置和用户偏好,同时保持可接受的效率。


Article 108

Title@2025-06-12 (4): Benchmarking LLMs for Environmental Review and Permitting

Title: Benchmarking LLMs for Environmental Review and Permitting Benchmarking LLMs für Umweltprüfung und Genehmigung 环境审查和许可基准确定LLMs 2407.07321v3

Authors (15): Rounak Meyur, Hung Phan, Koby Hayashi, Ian Stewart, Shivam Sharma, Sarthak Chaturvedi, Mike Parker, Dan Nally, Sadie Montgomery, Karl Pazdernik, Ali Jannesari, Mahantesh Halappanavar, Sai Munikoti, Sameera Horawalavithana, Anurag Acharya

The National Environment Policy Act (NEPA) stands as a foundational piece of environmental legislation in the United States, requiring federal agencies to consider the environmental impacts of their proposed actions. The primary mechanism for achieving this is through the preparation of Environmental Assessments (EAs) and, for significant impacts, comprehensive Environmental Impact Statements (EIS). Large Language Model (LLM)s’ effectiveness in specialized domains like NEPA remains untested for adoption in federal decision-making processes. To address this gap, we present NEPA Question and Answering Dataset (NEPAQuAD), the first comprehensive benchmark derived from EIS documents, along with a modular and transparent evaluation pipeline, MAPLE, to assess LLM performance on NEPA-focused regulatory reasoning tasks. Our benchmark leverages actual EIS documents to create diverse question types, ranging from factual to complex problem-solving ones. We built a modular and transparent evaluation pipeline to test both closed- and open-source models in zero-shot or context-driven QA benchmarks. We evaluate five state-of-the-art LLMs using our framework to assess both their prior knowledge and their ability to process NEPA-specific information. The experimental results reveal that all the models consistently achieve their highest performance when provided with the gold passage as context. While comparing the other context-driven approaches for each model, Retrieval Augmented Generation (RAG)-based approaches substantially outperform PDF document contexts, indicating that neither model is well suited for long-context question-answering tasks. Our analysis suggests that NEPA-focused regulatory reasoning tasks pose a significant challenge for LLMs, particularly in terms of understanding the complex semantics and effectively processing the lengthy regulatory documents.

《国家环境政策法》(NEPA)是美国环境立法的基础部分,要求联邦机构考虑其拟议行动对环境的影响,实现这一目的的主要机制是编制环境评估(EAs),以及针对重大影响的综合环境影响说明(EIS)。大型语言模型(LLM)在诸如NEPA等专门领域的效力仍然未经测试,供联邦决策进程采用。为弥补这一差距,我们介绍了NEPA问题和回答数据集(NEPAQuAD),这是从EIS文件以及模块化和透明的评价管道(MAPLE)中得出的第一个全面基准,用以评估以NEPA为重点的监管推理任务方面的LLM绩效。我们的基准利用实际的EIS文件来创造各种各样的问题类型,从事实到复杂的解决问题。我们建立了模块化和透明的评价管道,以测试零点或背景驱动的QAA基准的封闭和公开源模型。我们用我们的框架来评估5个州的LMSMS(NEPA-LMS),用以评估其先前知识及其进程能力,而以NEPA-LLLLA为核心的调整范围, 有效地展示了每个具体文件的进度。我们通过过程的实验性结果显示了我们的所有结果,而不断展示结果,以持续地显示其整个过程的进度,而持续地显示其他的进度,具体过程是持续地展示。


Article 109

Title@2025-06-12 (4): CHANCERY: Evaluating Corporate Governance Reasoning Capabilities in Language Models

Title: CHANCERY: Evaluating Corporate Governance Reasoning Capabilities in Language Models CHANCERY: Bewertung der Corporate Governance-Reasoning-Fähigkeiten in Sprachmodellen C. 机会:评价语言模式中的公司治理能力 2506.04636v2

Authors (5): Lucas Irwin, Arda Kaz, Peiyao Sheng, Sewoong Oh, Pramod Viswanath

Law has long been a domain that has been popular in natural language processing (NLP) applications. Reasoning (ratiocination and the ability to make connections to precedent) is a core part of the practice of the law in the real world. Nevertheless, while multiple legal datasets exist, none have thus far focused specifically on reasoning tasks. We focus on a specific aspect of the legal landscape by introducing a corporate governance reasoning benchmark (CHANCERY) to test a model’s ability to reason about whether executive/board/shareholder’s proposed actions are consistent with corporate governance charters. This benchmark introduces a first-of-its-kind corporate governance reasoning test for language models - modeled after real world corporate governance law. The benchmark consists of a corporate charter (a set of governing covenants) and a proposal for executive action. The model’s task is one of binary classification: reason about whether the action is consistent with the rules contained within the charter. We create the benchmark following established principles of corporate governance - 24 concrete corporate governance principles established in and 79 real life corporate charters selected to represent diverse industries from a total dataset of 10k real life corporate charters. Evaluations on state-of-the-art (SOTA) reasoning models confirm the difficulty of the benchmark, with models such as Claude 3.7 Sonnet and GPT-4o achieving 64.5% and 75.2% accuracy respectively. Reasoning agents exhibit superior performance, with agents based on the ReAct and CodeAct frameworks scoring 76.1% and 78.1% respectively, further confirming the advanced legal reasoning capabilities required to score highly on the benchmark. We also conduct an analysis of the types of questions which current reasoning models struggle on, revealing insights into the legal reasoning capabilities of SOTA models.

在自然语言处理(NLP)应用中,法律长期以来一直是一个受人欢迎的领域。理性(刺激和与先例建立联系的能力)是现实世界中法律实践的核心部分。然而,尽管存在多个法律数据集,但迄今为止还没有一个法律数据集专门侧重于推理任务。我们侧重于法律景观的一个具体方面,引入了公司治理推理基准(CHANCRAY),以测试一个模型是否有能力解释执行/董事会/股东拟议行动是否与公司治理章程一致。这一基准引入了对语言逻辑模型的首选、其实质公司治理推理的首选检验。 该基准由公司章程(一套管理契约)和一项行政行动建议组成。该模型的任务是二元分类:该模型是否与宪章所载规则相一致。我们创建了一个基准,遵循了公司治理的既定原则――24项具体公司治理原则,并选择了79项真正的公司章程,以代表从10千个实际成本成本成本分析模型的多样化行业,1 并分别确认公司成本成本评估基准模型的准确性标准,5 以及SOBA的精确度评估, 以及SOA的准确性成本评估。


Article 110

Title@2025-06-12 (4): Paired Completion: Flexible Quantification of Issue-framing at Scale with LLMs

Title: Paired Completion: Flexible Quantification of Issue-framing at Scale with LLMs Gepaarte Fertigstellung: Flexible Quantifizierung von Emissions-Framing auf Scale mit LLMs 提前完成:与LLMs一道灵活量化规模问题配置 2408.09742v2

Authors (2): Simon D Angus, Lachlan O’Neill

Detecting issue framing in text - how different perspectives approach the same topic - is valuable for social science and policy analysis, yet challenging for automated methods due to subtle linguistic differences. We introduce `paired completion’, a novel approach using LLM next-token log probabilities to detect contrasting frames using minimal examples. Through extensive evaluation across synthetic datasets and a human-labeled corpus, we demonstrate that paired completion is a cost-efficient, low-bias alternative to both prompt-based and embedding-based methods, offering a scalable solution for analyzing issue framing in large text collections, especially suited to low-resource settings.

通过对合成数据集和人类标记的文体进行广泛评估,我们证明配对的完成是一种成本效益低的替代方法,可以取代快速和嵌入法,为分析大型文本集(特别适合低资源环境)中的问题框架提供了可扩展的解决方案。


Article 111

Title@2025-06-12 (4): Detecting Sockpuppetry on Wikipedia Using Meta-Learning

Title: Detecting Sockpuppetry on Wikipedia Using Meta-Learning Sockepuppetry auf Wikipedia erkennen Mit Meta-Learning 在维基百科上用元学习探测袜子布料 2506.10314v1

Authors (2): Luc Raszewski, Christine De Kock

Malicious sockpuppet detection on Wikipedia is critical to preserving access to reliable information on the internet and preventing the spread of disinformation. Prior machine learning approaches rely on stylistic and meta-data features, but do not prioritise adaptability to author-specific behaviours. As a result, they struggle to effectively model the behaviour of specific sockpuppet-groups, especially when text data is limited. To address this, we propose the application of meta-learning, a machine learning technique designed to improve performance in data-scarce settings by training models across multiple tasks. Meta-learning optimises a model for rapid adaptation to the writing style of a new sockpuppet-group. Our results show that meta-learning significantly enhances the precision of predictions compared to pre-trained models, marking an advancement in combating sockpuppetry on open editing platforms. We release a new dataset of sockpuppet investigations to foster future research in both sockpuppetry and meta-learning fields.

维基百科上的恶性袜子探测对保存在互联网上获取可靠信息的机会和防止虚假信息传播至关重要。 先前的机器学习方法依赖于文体和元数据特征,但并不优先考虑作者特定行为。 结果,他们努力有效地模拟特定袜子布偶群体的行为,特别是在文本数据有限的情况下。 为了解决这个问题,我们提议应用元学习,即一种机器学习技术,通过多种任务的培训模型来改善数据卡片环境中的性能。 元学习选修模式是快速适应一个新的袜子布偶团体的写作风格的模式。 我们的结果表明,与预先培训的模式相比,元学习极大地提高了预测的准确性,标志着在打击开放编辑平台上的袜子布偶方面有所进步。 我们发布了一套新数据集,由袜子调查提供,以促进今后在袜子和元学习领域的研究。


Article 112

Title@2025-06-12 (4): Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling

Title: Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling Effiziente Längenverallgemeinerbare Aufmerksamkeit über Causal Retrieval für die Lang-Kontext-Sprachenmodellierung 长文本语言建模通过 “ 目的检索 “ 吸引长文本语言建模 2410.01651v4

Authors (5): Xiang Hu, Zhihao Teng, Jun Zhao, Wei Wu, Kewei Tu

Despite the success of Transformers, handling long contexts remains challenging due to the limited length generalization and quadratic complexity of self-attention. Thus Transformers often require post-training with a larger attention window, significantly increasing computational and memory costs. In this paper, we propose a novel attention mechanism based on dynamic context, Grouped Cross Attention (GCA), which can generalize to 1000 times the pre-training context length while maintaining the ability to access distant information with a constant attention window size. For a given input sequence, we split it into chunks and use each chunk to retrieve top-k relevant past chunks for subsequent text generation. Specifically, unlike most previous works that use an off-the-shelf retriever, our key innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner. Such a mechanism accommodates retrieved chunks with a fixed-size attention window to achieve long-range information access, significantly reducing computational and memory costs during training and inference. Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths, which is 1000 times the training length.

尽管变形器取得了成功,但处理长长环境仍具有挑战性,因为自我自控的长度一般化和复杂度有限。因此,变形器往往需要培训后的培训,而关注窗口则更大,计算成本和记忆成本也大大增加。在本文件中,我们提出了一个基于动态环境的新式关注机制,即Group Cross Commission(GCA),它可以将培训前背景长度的千倍放大,同时保持以恒定的关注窗口大小获取远处信息的能力。对于给定的输入序列,我们将其分成块,并使用每个块来为随后的文本生成检索最上的相关过去块。具体地说,与大多数使用现成检索器的以往工作不同,我们的关键创新使检索者能够学习如何回收过去的块块,从而以端到端的方式更好地将随后代代号的自动递减损失降到最低。这种机制容纳了固定关注窗口的回收块,以便实现远程信息的存取,在培训和推断过程中大幅降低计算和记忆成本。实验显示,基于GCA的模型在16次的远程检索中达到近超精确度的长度。


Article 113

Title@2025-06-12 (4): AC/DC: LLM-based Audio Comprehension via Dialogue Continuation

Title: AC/DC: LLM-based Audio Comprehension via Dialogue Continuation AC/DC: LLM-basiertes Audio-Verständnis über Dialog-Fortsetzung AC/DC:基于LLM的通过对话继续了解音频 2506.10312v1

Authors (5): Yusuke Fujita, Tomoya Mizumoto, Atsushi Kojima, Lianbo Liu, Yui Sudo

We propose an instruction-following audio comprehension model that leverages the dialogue continuation ability of large language models (LLMs). Instead of directly generating target captions in training data, the proposed method trains a model to produce responses as if the input caption triggered a dialogue. This dialogue continuation training mitigates the caption variation problem. Learning to continue a dialogue effectively captures the caption’s meaning beyond its surface-level words. As a result, our model enables zero-shot instruction-following capability without multitask instruction tuning, even trained solely on audio captioning datasets. Experiments on AudioCaps, WavCaps, and Clotho datasets with AudioBench audio-scene question-answering tests demonstrate our model’s ability to follow various unseen instructions.

我们提议了一个遵循教学的听力理解模型,利用大型语言模型(LLMs)的对话持续能力。建议的方法不是直接生成培训数据的目标说明,而是将一个模型培养成一个反应模式,仿佛输入说明触发了对话。这种对话继续培训缓解了标题变异问题。学习继续对话,有效地捕捉字幕的含义,超越其表面的文字。结果,我们的模型使得零弹指令传导能力无需多任务指示调试,甚至只对音频说明数据集进行培训。关于音频Caps、WavCapts和Clootho数据集的实验展示了我们模型是否有能力遵循各种看不见的指示。


Article 114

Title@2025-06-12 (4): BeamLoRA: Beam-Constraint Low-Rank Adaptation

Title: BeamLoRA: Beam-Constraint Low-Rank Adaptation BeamLoRA: Beam-Constraint Low-Rank Anpassung BeamLORA: 束-节制低射线适应 2502.13604v2

Authors (10): Naibin Gu, Zhenyu Zhang, Xiyu Liu, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang

Due to the demand for efficient fine-tuning of large language models, Low-Rank Adaptation (LoRA) has been widely adopted as one of the most effective parameter-efficient fine-tuning methods. Nevertheless, while LoRA improves efficiency, there remains room for improvement in accuracy. Herein, we adopt a novel perspective to assess the characteristics of LoRA ranks. The results reveal that different ranks within the LoRA modules not only exhibit varying levels of importance but also evolve dynamically throughout the fine-tuning process, which may limit the performance of LoRA. Based on these findings, we propose BeamLoRA, which conceptualizes each LoRA module as a beam where each rank naturally corresponds to a potential sub-solution, and the fine-tuning process becomes a search for the optimal sub-solution combination. BeamLoRA dynamically eliminates underperforming sub-solutions while expanding the parameter space for promising ones, enhancing performance with a fixed rank. Extensive experiments across three base models and 12 datasets spanning math reasoning, code generation, and commonsense reasoning demonstrate that BeamLoRA consistently enhances the performance of LoRA, surpassing the other baseline methods.

由于对大语言模型进行高效微调的需求,低兰克适应(LORA)已被广泛采用,作为最有效的参数效率微调方法之一。然而,虽然LORA提高了效率,但准确性仍有改进的余地。在这里,我们采用了新的视角来评估LORA等级的特性。结果显示,LORA单元的不同等级不仅具有不同的重要性,而且在整个微调过程中动态地演变,这可能会限制LORA的性能。根据这些调查结果,我们提议BAAM LoRA将每个LORA单元的概念化为一个束,每个波段自然与潜在的分解相对应,而微调过程则成为对最佳次溶组合的探索。BaamLORA动态地消除了表现不佳的次溶物,同时扩大了有希望的分解物的参数空间,提高了固定级的性能。在三个基模型和12个数据集中进行广泛的实验,跨越数学推理、代码生成和常识推理推理,表明BAamLORA不断提高LRA的性能,超越了其他基线方法。


Article 115

Title@2025-06-12 (4): Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs

Title: Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs Geplante interleaved Speech-Text-Schulung für Sprach-zu-Sprach-Übersetzung mit LLMs 配有LLMM的语音对语音翻译教学 2506.10299v1

Authors (7): Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Yuki Ito, Hassan Shahmohammadi, Siddhant Arora, Shinji Watanabe

Speech-to-speech translation (S2ST) has been advanced with large language models (LLMs), which are fine-tuned on discrete speech units. In such approaches, modality adaptation from text to speech has been an issue. LLMs are trained on text-only data, which presents challenges to adapt them to speech modality with limited speech-to-speech data. To address the training difficulty, we propose scheduled interleaved speech–text training in this study. We use interleaved speech–text units instead of speech units during training, where aligned text tokens are interleaved at the word level. We gradually decrease the ratio of text as training progresses, to facilitate progressive modality adaptation from text to speech. We conduct experimental evaluations by fine-tuning LLaMA3.2-1B for S2ST on the CVSS dataset. We show that the proposed method consistently improves the translation performances, especially for languages with limited training data.

语音对语音翻译(S2ST)已经与大型语言模型(LLMS)相提并论,这些模型对独立语音单位进行了细微的调整。在这种方法中,从文本到语言的模式调整是一个问题。LMS接受只文本数据的培训,这为使其适应语音对语音数据有限的语音模式提出了挑战。为了解决培训困难,我们建议在本研究中安排间断语音-文字培训。我们在培训中使用隔断语音-文字单位,而不是语音单位,在单词一级对齐文本符号进行互接。我们逐渐降低文本在培训进展中的比例,以促进从文本到语言的渐进模式调整。我们通过对CVSS数据集上的S2ST进行微调LAMA3.2-1B进行实验性评估。我们表明,拟议方法始终在改进翻译绩效,特别是培训数据有限的语言的翻译性能。


Article 116

Title@2025-06-12 (4): “Check My Work?”: Measuring Sycophancy in a Simulated Educational Context

Title: “Check My Work?”: Measuring Sycophancy in a Simulated Educational Context “Check My Work?”: Sykopanzmessung in einem simulierten Bildungskontext “检查我的工作?” “测量模拟教育环境中的相对性” 2506.10297v1

Authors (1): Chuck Arvin

This study examines how user-provided suggestions affect Large Language Models (LLMs) in a simulated educational context, where sycophancy poses significant risks. Testing five different LLMs from the OpenAI GPT-4o and GPT-4.1 model classes across five experimental conditions, we show that response quality varies dramatically based on query framing. In cases where the student mentions an incorrect answer, the LLM correctness can degrade by as much as 15 percentage points, while mentioning the correct answer boosts accuracy by the same margin. Our results also show that this bias is stronger in smaller models, with an effect of up to 30% for the GPT-4.1-nano model, versus 8% for the GPT-4o model. Our analysis of how often LLMs “flip” their answer, and an investigation into token level probabilities, confirm that the models are generally changing their answers to answer choices mentioned by students in line with the sycophancy hypothesis. This sycophantic behavior has important implications for educational equity, as LLMs may accelerate learning for knowledgeable students while the same tools may reinforce misunderstanding for less knowledgeable students. Our results highlight the need to better understand the mechanism, and ways to mitigate, such bias in the educational context.

这项研究考察了用户提供的建议如何在模拟教育环境中影响大语言模型(LLMS),这种模拟教育对恋童癖构成重大风险。测试了来自OpenAI GPT-4o 和 GPT-4.1 的五个不同的LMS,在五个实验条件下测试了五个不同的LMS,我们发现,根据查询框架,反应质量差异很大。在学生提到错误答案的情况下,LLM的正确性可以降低多达15个百分点,同时提到正确的答案能提高同一差幅的准确性。我们的结果还显示,这种偏差在较小的模型中更为强烈,GPT-4-4.1-nano 模型的效果高达30%,而GPT-4 模型则高达8%。我们对LMS“翻转”答案的频率的分析以及对象征性概率的调查证实,如果学生提到错误的答案,那么LLMS的正确性就会降低多达15个百分点,同时提到正确的答案会提高同一差幅。这种偏差行为对教育公平性有着重要影响,因为LMS可能加快对有知识的学生的学习,而同样的工具可能会加深对缺乏了解的学生产生误解。我们的结果需要更好的理解机制。


Article 117

Title@2025-06-12 (4): Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages

Title: Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages Flick: Wenige Labels Textklassifizierung mit K-Aware Intermediate Learning in Multi-Task Low-Resource Sprachen Flick:使用K-Aware中级学习多种低资源语言的多种语言的标签文字分类, 2506.10292v1

Authors (6): Ali Almutairi, Abdullah Alsuhaibani, Shoaib Jameel, Usman Naseem, Gelareh Mohammadi, Imran Razzak

Training deep learning networks with minimal supervision has gained significant research attention due to its potential to reduce reliance on extensive labelled data. While self-training methods have proven effective in semi-supervised learning, they remain vulnerable to errors from noisy pseudo labels. Moreover, most recent approaches to the few-label classification problem are either designed for resource-rich languages such as English or involve complex cascading models that are prone to overfitting. To address the persistent challenge of few-label text classification in truly low-resource linguistic contexts, where existing methods often struggle with noisy pseudo-labels and domain adaptation, we propose Flick. Unlike prior methods that rely on generic multi-cluster pseudo-labelling or complex cascading architectures, Flick leverages the fundamental insight that distilling high-confidence pseudo-labels from a broader set of initial clusters can dramatically improve pseudo-label quality, particularly for linguistically diverse, low-resource settings. Flick introduces a novel pseudo-label refinement component, a departure from traditional pseudo-labelling strategies by identifying and leveraging top-performing pseudo-label clusters. This component specifically learns to distil highly reliable pseudo-labels from an initial broad set by focusing on single-cluster cohesion and leveraging an adaptive top-k selection mechanism. This targeted refinement process is crucial for mitigating the propagation of errors inherent in low-resource data, allowing for robust fine-tuning of pre-trained language models with only a handful of true labels. We demonstrate Flick’s efficacy across 14 diverse datasets, encompassing challenging low-resource languages such as Arabic, Urdu, and Setswana, alongside English, showcasing its superior performance and adaptability.

低监管水平的培训深层次学习网络由于具有减少对广泛标签数据依赖的潜力而赢得了重要的研究关注。尽管自培训方法在半监督学习中证明是有效的,但它们仍然易受到杂音假标签的错误。此外,最近针对少数标签分类问题的办法要么是为资源丰富的语言(如英语)设计的,要么涉及容易过度装配的复杂的假标签模型。为了应对在真正低资源语言环境中低标签文本分类这一持续存在的挑战,因为现有方法往往与杂音的假标签和域名调整纠缠不休,我们建议Flick。不同于以前使用通用多组伪标签或复杂的卡萨卡结构的方法,Flick 利用从范围更广的最初组群中提取高信任假标签的基本洞察力,可以极大地提高假标签质量,特别是语言多样性、低资源环境环境环境环境。Flick引入了一个新的假标签精细化部分,通过识别和利用上流化的假标签集群群集,这一部分特别学会了高可靠性伪标签,高等级的高级标签与高层次的升级性、高层次的升级性、高层次的升级性、高层次的升级性、高层次的升级性、高层次的升级性、高压性、高压性、高压性、高压性能、高压性能、高压性能、高压性能、高压性能、高压性地展示了高压压压压压压压压压压压压压压压压压压压压性地的升级的升级性、高的升级性能、高压性能、高压性能、高压性能、高压压压压压压压压压压压压压压压压压压压压压压压压压压压压压压压压压压压压压压低的标。


Article 118

Title@2025-06-12 (4): Context Is Not Comprehension

Title: Context Is Not Comprehension Kontext ist nicht verständlich 背景不令人理解 2506.04907v4

Authors (2): Alex Pan, Mary-Anne Williams

The dominant way of judging Large Language Models (LLMs) has been to ask how well they can recall explicit facts from very long inputs. While today’s best models achieve near perfect recall, this masks a harder skill: performing multi-step reasoning and tracking intermediate state that never appears verbatim. We introduce Verbose ListOps (VLO), a benchmark that embeds deterministic ListOps computations inside narrative camouflage and, crucially, allows step-level evaluation of every intermediate result. Experiments show that models which solve raw ListOps with approximately 100% accuracy collapse on VLO after only 10,000 tokens. By exposing where a model’s reasoning chain first diverges, VLO moves assessment beyond sheer context length and toward genuine comprehension. VLO’s generation pipeline is task-agnostic: it can weave any deterministically verifiable reasoning schema – arithmetic, symbolic, abductive, inductive or defeasible – into narrative form. This makes VLO a reusable test-bed for the next wave of reasoning-centric model designs, not merely those with step-explicit scaffolds.

判断大语言模型(LLMS)的主要方法一直是询问它们能够从非常长的投入中记得清楚的事实。 虽然今天的最佳模型几乎完全可以回顾, 但是它掩盖了一个更难掌握的技能: 执行多步推理和追踪从未出现过的中间状态。 我们引入了Verbose ListOps (VLO), 这个基准将确定性 ListOps 计算方法嵌入叙事伪装中, 并且关键地允许对每个中间结果进行分级评估。 实验显示, 解决原始 ListOps 的模型, 其精度在1万个符号之后大约为100%的VLO 崩溃。 通过曝光模型推理链首先出现差异的地方, VLO 移动评估超越纯粹的上下文长度, 走向真正的理解。 VLO 的一代管道具有任务性: 它可以将任何可确定性、 可核实的推理的推理方法 – 算、 符号、 、 诱性、 诱导性或易失能性 – 编成叙述形式。 这让VLO成为下一个推理中心模型设计的可重复的测试台, 不仅仅是具有分辨的试的测试台式模型设计。 。 这让VLO成为下一个波的试的测试台, 的测试台, 。


Article 119

Title@2025-06-12 (4): Prompt-based Depth Pruning of Large Language Models

Title: Prompt-based Depth Pruning of Large Language Models Prompt-basierte Tiefenkorrektur von großen Sprachmodellen 大语言模式的即时深度定位 2502.04348v3

Authors (3): Juyun Wee, Minjae Park, Jaeho Lee

Depth pruning aims to reduce the inference cost of a large language model without any hardware-specific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent – a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (Prompt-routed Dynamic Depth Pruning), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines.

深度剪裁的目的是通过简单地去除几个不太重要的变压器块来降低没有任何硬件特有并发症的大语言模型的推论成本。然而,我们的经验调查结果表明,变压器块的重要性可能高度依赖任务 – – 对于一项任务至关重要的块块可以在不降低另一项任务的准确性的情况下被删除。基于这一观察,我们开发了动态深度剪裁算法,并创建了PuDDing(PuDDing-routed Pudent-rout Prutning),该算法决定了哪些块块可以从基于输入速度的模型中省略。PuDDing通过培训一个轻量级路由器来运行,以预测一组选项中的最佳遗漏,而该选项组也是以数据驱动的方式构建的。 常识推理基准的实证结果表明,PuDDing有效地加速了推断语言模型,并且比静态深度剪动基线在塔克上的表现更好。


Article 120

Title@2025-06-12 (4): ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs

Title: ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs ClusterUCB: Effiziente Gradient-basierte Datenauswahl für gezielte Feinsteuerung von LLMs COCUCB: 高效率的逐步数据选择,以便有针对性地微调LLMM 2506.10288v1

Authors (6): Zige Wang, Qi Zhu, Fei Mi, Minghui Xu, Ruochun Jin, Wenjing Yang

Gradient-based data influence approximation has been leveraged to select useful data samples in the supervised fine-tuning of large language models. However, the computation of gradients throughout the fine-tuning process requires too many resources to be feasible in practice. In this paper, we propose an efficient gradient-based data selection framework with clustering and a modified Upper Confidence Bound (UCB) algorithm. Based on the intuition that data samples with similar gradient features will have similar influences, we first perform clustering on the training data pool. Then, we frame the inter-cluster data selection as a constrained computing budget allocation problem and consider it a multi-armed bandit problem. A modified UCB algorithm is leveraged to solve this problem. Specifically, during the iterative sampling process, historical data influence information is recorded to directly estimate the distributions of each cluster, and a cold start is adopted to balance exploration and exploitation. Experimental results on various benchmarks show that our proposed framework, ClusterUCB, can achieve comparable results to the original gradient-based data selection methods while greatly reducing computing consumption.

在大型语言模型的监督下微调中,已利用基于渐变的数据影响近似值来选择有用的数据样本,但在整个微调过程中计算梯度需要太多的资源才能实际可行。在本文件中,我们建议采用高效的梯度数据选择框架,同时采用集群和修改的超信任量计算法。基于具有类似梯度特征的数据样本具有类似影响的直觉,我们首先在培训数据库中进行分组。然后,我们将集群数据选择视为一个有限的计算预算分配问题,并将其视为一个多臂强盗问题。修改后的UCB算法被用来解决这一问题。具体地说,在迭代抽样过程中,记录历史数据影响信息,直接估计每个组的分布情况,并采用一个寒冷的起点来平衡探索和利用。各种基准的实验结果表明,我们提议的Crob CUCB框架可以取得与原始梯度数据选择方法的类似结果,同时大大减少计算消耗量。


Article 121

Title@2025-06-12 (4): Convert Language Model into a Value-based Strategic Planner

Title: Convert Language Model into a Value-based Strategic Planner Konvertieren Sie Sprachmodell in einen wertbasierten strategischen Planer 将语言模式转换成基于价值的战略规划员 2505.06987v3

Authors (7): Xiaoyu Wang, Yue Zhao, Qingqing Gu, Zhonglin Jiang, Xiaokai Chen, Yong Chen, Luo Ji

Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations. Although large language models (LLMs) have obtained remarkable progress on ESC, most of these studies might not define the diagram from the state model perspective, therefore providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage the Q-learning on LLMs, and propose a framework called straQ. Our framework allows a plug-and-play LLM to bootstrap the planning during ESC, determine the optimal strategy based on long-term returns, and finally guide the LLM to response. Substantial experiments on ESC datasets suggest that straQ outperforms many baselines, including direct inference, self-refine, chain of thought, finetuning, and finite state machines.

情感支持对话(ESC)旨在通过有效对话减轻个人的情感痛苦。 尽管大型语言模型(LLMs)在ESC方面取得了显著的进展,但大多数这些研究可能无法从州模式的角度界定图表,因此为长期满意度提供了亚优的解决方案。 为了解决这一问题,我们利用LMs的Q学习,并提出了一个名为stra的框架。 我们的框架允许插接和播放LLM在ESC期间启动规划,确定基于长期回报的最佳战略,并最终指导LM作出反应。 有关ESC数据集的大量实验表明,它超越了许多基线,包括直接引用、自我反省、思维链、微调和限定状态机器。


Article 122

Title@2025-06-12 (4): Discrete Audio Tokens: More Than a Survey!

Title: Discrete Audio Tokens: More Than a Survey! Diskrete Audio Tokens: Mehr als nur eine Umfrage! 分辨音频代号: 多于调查 ! 2506.10274v1

Authors (21): Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli

Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks.They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.

分辨音义象征物是旨在保存感知质量、语音内容和发言者特征,同时能够有效储存和推断,以及具有不同下游任务的竞争性业绩的简明表述物。它们为连续特征提供了一个实用的替代物,能够将语音和音频纳入现代大型语言模型(LLMs)。随着对以象征性为基础的音效处理的兴趣增加,出现了各种象征性方法,一些调查审查了该领域的最新进展。然而,现有研究往往侧重于特定领域或任务,缺乏各种基准的统一比较。本文对独立音效象征物进行了系统审查和基准,涵盖三个领域:言论、音乐和一般音频。我们提出了基于编码-解码、量化技术、培训范式、可流和应用领域的象征化方法的分类。我们评估了重建、下游性业绩和声学语言建模的多重基准,并通过受控的反动研究分析了贸易利弊端。我们的调查结果强调了关键限制、实际考虑和公开挑战,为这个迅速演变的区域的未来研究提供了洞察力和指导。更多信息,包括我们的主要结果和网站。


Article 123

Title@2025-06-12 (4): Play to Generalize: Learning to Reason Through Game Play

Title: Play to Generalize: Learning to Reason Through Game Play Spielen Sie Generalize: Lernen, Vernunft durch Spiel zu lernen 玩一般游戏: 通过玩游戏学习理性 2506.08011v2

Authors (6): Yunfei Xie, Yinsong Ma, Shiyi Lan, Alan Yuille, Junfei Xiao, Chen Wei

Developing generalizable reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by cognitive science literature suggesting that gameplay promotes transferable cognitive skills, we propose a novel post-training paradigm, Visual Game Learning, or ViGaL, where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games. Specifically, we show that post-training a 7B-parameter MLLM via reinforcement learning (RL) on simple arcade-like games, e.g. Snake, significantly enhances its downstream performance on multimodal math benchmarks like MathVista, and on multi-discipline questions like MMMU, without seeing any worked solutions, equations, or diagrams during RL, suggesting the capture of transferable reasoning skills. Remarkably, our model outperforms specialist models tuned on multimodal reasoning data in multimodal reasoning benchmarks, while preserving the base model’s performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks that unlock generalizable multimodal reasoning abilities in MLLMs.

在多式大语言模型(MLLM)中,发展通用推理能力仍具有挑战性。 在认知科学文献的激励下,我们提出一个新的培训后模式、视觉游戏学习或ViGaL(MLMS通过玩环形游戏将多式推理推理法从外向外概括化)。具体地说,我们展示了在简单街机类游戏(如蛇)中强化学习(RL)7B参数MLM(MLM)后,大幅提升其在MatthVista等多式数学基准和MMMMMMMU(MMMM)等多式问题上的下游性能,没有看到任何可行的解决方案、方程式或图表,也没有看到可转移推理技能的捕获。 值得注意的是,我们的模型超越了在多式推理基准中调整多式推理数据的专家模型,同时保持了基础模型在一般视觉基准上的绩效,专家模型往往落后。 我们的研究结果表明一种新的培训后模式:合成、基于规则的游戏可以作为可控性和可缩的预言前任务,可以解一般多式联运推理。


Article 124

Title@2025-06-12 (4): Do Language Models Have Bayesian Brains? Distinguishing Stochastic and Deterministic Decision Patterns within Large Language Models

Title: Do Language Models Have Bayesian Brains? Distinguishing Stochastic and Deterministic Decision Patterns within Large Language Models Haben Sprachmodelle Bayesische Gehirne? Beeindruckende stochastische und deterministische Entscheidungsmuster innerhalb großer Sprachmodelle 语言模式是否具有贝耶斯人脑? 区分大语言模式中的斯托卡和决定性决定模式 2506.10268v1

Authors (2): Andrea Yaoyun Cui, Pengfei Yu

Language models are essentially probability distributions over token sequences. Auto-regressive models generate sentences by iteratively computing and sampling from the distribution of the next token. This iterative sampling introduces stochasticity, leading to the assumption that language models make probabilistic decisions, similar to sampling from unknown distributions. Building on this assumption, prior research has used simulated Gibbs sampling, inspired by experiments designed to elicit human priors, to infer the priors of language models. In this paper, we revisit a critical question: Do language models possess Bayesian brains? Our findings show that under certain conditions, language models can exhibit near-deterministic decision-making, such as producing maximum likelihood estimations, even with a non-zero sampling temperature. This challenges the sampling assumption and undermines previous methods for eliciting human-like priors. Furthermore, we demonstrate that without proper scrutiny, a system with deterministic behavior undergoing simulated Gibbs sampling can converge to a “false prior.” To address this, we propose a straightforward approach to distinguish between stochastic and deterministic decision patterns in Gibbs sampling, helping to prevent the inference of misleading language model priors. We experiment on a variety of large language models to identify their decision patterns under various circumstances. Our results provide key insights in understanding decision making of large language models.

语言模型基本上是象征性序列的概率分布。 自动递减模型通过迭代计算和采样从下一个符号的分布中产生句子。 迭代抽样引入了随机性, 从而假设语言模型作出概率性决定, 类似于未知分布的抽样。 基于这一假设, 先前的研究使用了模拟Gibs抽样, 受旨在提取人类前科的实验的启发, 并参照了语言模型的前身。 在本文中, 我们重新审视了一个关键问题 : 语言模型是否拥有巴耶斯人的大脑? 我们的研究结果显示, 在某些条件下, 语言模型可以展示出近乎决定性的决策决策, 例如生成最大概率估计, 甚至是非零采样温度。 这挑战了取样假设, 并破坏了先前的人类类似模式。 此外, 我们证明, 未经适当审查, 模拟吉布抽样抽样的具有确定性行为的系统可以归结为“ 假象 ” 。 为了解决这个问题, 我们建议一种直截了的方法来区分随机和确定性决定性的决定模式, 有助于防止在大量语言模型中做出误导性判断。


Article 125

Title@2025-06-12 (4): Research Borderlands: Analysing Writing Across Research Cultures

Title: Research Borderlands: Analysing Writing Across Research Cultures Forschungsgrenzen: Analysieren des Schreibens über Forschungskulturen hinweg 研究边界地区:分析跨研究文化的写作 2506.00784v2

Authors (3): Shaily Bhatt, Tal August, Maria Antoniak

Improving cultural competence of language technologies is important. However most recent works rarely engage with the communities they study, and instead rely on synthetic setups and imperfect proxies of culture. In this work, we take a human-centered approach to discover and measure language-based cultural norms, and cultural competence of LLMs. We focus on a single kind of culture, research cultures, and a single task, adapting writing across research cultures. Through a set of interviews with interdisciplinary researchers, who are experts at moving between cultures, we create a framework of structural, stylistic, rhetorical, and citational norms that vary across research cultures. We operationalise these features with a suite of computational metrics and use them for (a) surfacing latent cultural norms in human-written research papers at scale; and (b) highlighting the lack of cultural competence of LLMs, and their tendency to homogenise writing. Overall, our work illustrates the efficacy of a human-centered approach to measuring cultural norms in human-written and LLM-generated texts.

提高语言技术的文化能力十分重要。但是,最近的一些工作很少与他们所研究的社区接触,而是依赖合成结构和文化的不完善替代物。在这项工作中,我们采取以人为本的方法发现和衡量以语言为基础的文化规范以及LLM女士的文化能力。我们侧重于单一的文化、研究文化和单一的任务,使各种研究文化的写作有所调整。我们与跨学科研究人员进行了一系列访谈,他们是在不同文化间移动的专家,我们建立了结构、文体、口头和引言规范的框架,这种框架在不同的研究文化中各不相同。我们用一套计算尺度来操作这些特征,并将其用于(a) 大规模地在人写研究论文中浮现潜在的文化规范;(b) 突出LMMS缺乏文化能力,以及他们倾向于同质书写。总体而言,我们的工作展示了以人为中心的方法在人写和LM制文本中衡量文化规范的效力。


Article 126

Title@2025-06-12 (4): M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction

Title: M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction M-MRE: Ausdehnung des Effekts der gegenseitigen Verstärkung auf multimodale Informationsextraktion M-MRE:将相互强化效应扩大到多式联运信息提取 2504.17353v2

Authors (6): Chengguang Gan, Zhixi Cai, Yanbin Wei, Yunhao Liang, Shiwen Ni, Tatsunori Mori

Mutual Reinforcement Effect (MRE) is an emerging subfield at the intersection of information extraction and model interpretability. MRE aims to leverage the mutual understanding between tasks of different granularities, enhancing the performance of both coarse-grained and fine-grained tasks through joint modeling. While MRE has been explored and validated in the textual domain, its applicability to visual and multimodal domains remains unexplored. In this work, we extend MRE to the multimodal information extraction domain for the first time. Specifically, we introduce a new task: Multimodal Mutual Reinforcement Effect (M-MRE), and construct a corresponding dataset to support this task. To address the challenges posed by M-MRE, we further propose a Prompt Format Adapter (PFA) that is fully compatible with various Large Vision-Language Models (LVLMs). Experimental results demonstrate that MRE can also be observed in the M-MRE task, a multimodal text-image understanding scenario. This provides strong evidence that MRE facilitates mutual gains across three interrelated tasks, confirming its generalizability beyond the textual domain.

相互强化效应(MRE)是信息提取和模型解释交汇处一个新兴的子领域。MRE旨在利用不同粒子任务之间的相互理解,通过联合建模提高粗粒和细粒任务的业绩。虽然在文本领域已经探索和验证了MRE,但对视觉和多式联运领域的适用性仍然没有被探索。在这项工作中,我们首次将MRE扩大到多式联运信息提取领域。具体地说,我们引入了一项新的任务:多式相互强化效应(M-MRE),并构建了相应的数据集以支持这项任务。为了应对M-MRE带来的挑战,我们进一步建议了一种与各种大视野-语言模型(LVLMM)完全兼容的快速格式调整器(PFA)。实验结果表明,M-MRE任务(一种多式文本图像设想)中也可以观察到MRE。这提供了有力的证据,表明MRE促进了三项相互关联的任务之间的相互收益,确认了其在文本领域以外的一般可操作性。


Article 127

Title@2025-06-11 (3): ToxSyn-PT: A Large-Scale Synthetic Dataset for Hate Speech Detection in Portuguese

Title: ToxSyn-PT: A Large-Scale Synthetic Dataset for Hate Speech Detection in Portuguese ToxSyn-PT: Ein großformatiger synthetischer Datensatz für Hass-Spracherkennung auf Portugiesisch ToxSyn-PT:用葡萄牙语探测仇恨言论的大型合成数据集 2506.10245v1

Authors (5): Iago Alves Brito, Julia Soares Dollis, Fernanda Bufon Färber, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho

We present ToxSyn-PT, the first large-scale Portuguese corpus that enables fine-grained hate-speech classification across nine legally protected minority groups. The dataset contains 53,274 synthetic sentences equally distributed between minorities groups and toxicity labels. ToxSyn-PT is created through a novel four-stage pipeline: (1) a compact, manually curated seed; (2) few-shot expansion with an instruction-tuned LLM; (3) paraphrase-based augmentation; and (4) enrichment, plus additional neutral texts to curb overfitting to group-specific cues. The resulting corpus is class-balanced, stylistically diverse, and free from the social-media domain that dominate existing Portuguese datasets. Despite domain differences with traditional benchmarks, experiments on both binary and multi-label classification on the corpus yields strong results across five public Portuguese hate-speech datasets, demonstrating robust generalization even across domain boundaries. The dataset is publicly released to advance research on synthetic data and hate-speech detection in low-resource settings.

我们展示了ToxSyn-PT,这是葡萄牙第一个能够对9个受法律保护的少数群体进行细微仇恨言论分类的大规模葡萄牙保护的大规模物质,该数据集包含53 274个合成句子,在少数群体群体和毒性标签之间平均分布。 托克斯-Syn-PT是通过一个新的四阶段管道创建的:(1) 一个小型、人工包扎的种子;(2) 一个按指令调制的LLM小片片片片扩大;(3) 基于原言的扩增;(4) 浓缩,加上更多的中性文字,以遏制过度适应特定群体的提示。由此形成的物质是等级平衡的,具有风格多样性,不受主导现有葡萄牙数据集的社会媒体领域的影响。尽管与传统基准存在差异,但关于该物质二进制和多标签分类的实验在五个葡萄牙公众仇恨语音数据集中产生了强有力的结果,显示即使在跨域边界的强力通用。数据集公开发布用于推进对低资源环境中的合成数据和仇恨语音检测的研究。


Article 128

Title@2025-06-11 (3): The Fellowship of the LLMs: Multi-Agent Workflows for Synthetic Preference Optimization Dataset Generation

Title: The Fellowship of the LLMs: Multi-Agent Workflows for Synthetic Preference Optimization Dataset Generation Das Fellowship der LLMs: Multi-Agenten-Workflows für die Synthetische Preference-Optimierung Datensatzgenerierung LLM女士的研究金:合成首选最佳制作数据集多机构工作流程 2408.08688v5

Authors (5): Samee Arif, Sualeha Farid, Abdul Hameed Azeemi, Awais Athar, Agha Ali Raza

This paper presents a novel methodology for generating synthetic Preference Optimization (PO) datasets using multi-model workflows. We evaluate the effectiveness and potential of these workflows in automating and enhancing the dataset generation process. PO dataset generation requires two modules: (1) $\textit{response evaluation}$, and (2) $\textit{response generation}$. In the $\textit{response evaluation}$ module, the responses from Large Language Models (LLMs) are evaluated and ranked - a task typically carried out by human annotators that we automate using LLMs. We assess the response evaluation module in a 2 step process. In step 1, we assess LLMs as evaluators using three distinct prompting strategies. In step 2, we apply the winning prompting strategy to compare the performance of LLM-as-a-Judge, LLMs-as-a-Jury, and LLM Debate. Our evaluation shows that GPT-4o-as-a-Judge is more consistent across all datasets. For the $\textit{response generation}$ module, we use the identified LLM evaluator configuration and compare different configurations of the LLM Feedback Loop. We use the win rate to determine the best multi-model configuration for generation. Experimenting with various configurations, we find that the LLM Feedback Loop, with Llama as the generator and Gemma as the reviewer, achieves a notable 71.8% and 73.8% win rate over single-model Llama and Gemma, respectively. After identifying the best configurations for both modules, we generate our PO datasets using the above pipeline.

本文介绍了使用多模范工作流程生成合成偏好优化(PO)数据集的新方法。 我们用多模范工作流程评估这些工作流程在自动化和加强数据集生成过程中的有效性和潜力。 PO 数据集的生成需要两个模块:(1) $\ textit{ 回应评价} $ 和 (2) $\ textit{ 回应生成$ 美元。 在 $textit{ 回应评价} 模块中, 大语言模型(LLLMs) 的回复得到评估和排序 - 通常由使用LLLMs自动生成的人类标识员执行。 我们用两个步骤来评估反应评价模块的响应模块。 在步骤1中,我们用三种截然不同的提示战略来评估LMS作为评估评价员。 在步骤2中,我们应用“LLM-as-Jodge”、LLM-as-a-Jury和LM 辩论的绩效。我们的评估显示,GPT-4-a-a-a-a-Jo-judge 在所有数据集中,我们用确定最佳的LM- 和LM Ralma Ral- disal 的配置来确定最佳的图像。


Article 129

Title@2025-06-11 (3): Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

Title: Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods Prompt Angriffe Enthüllen Oberflächliche Wissensentfernung in unlearning Methoden 以不学习方法清除知识 2506.10236v1

Authors (4): Yeonwoo Jang, Shariqah Hossain, Ashwin Sreevatsa, Diogo Cruz

In this work, we show that some machine unlearning methods may fail when subjected to straightforward prompt attacks. We systematically evaluate eight unlearning techniques across three model families, and employ output-based, logit-based, and probe analysis to determine to what extent supposedly unlearned knowledge can be retrieved. While methods like RMU and TAR demonstrate robust unlearning, ELM remains vulnerable to specific prompt attacks (e.g., Hindi filler text in original prompt recovering 57.3% accuracy). Our logit analysis also confirms that unlearned models are generally not hiding knowledge by modifying the way the answer is formatted, as the correlation between output and logit accuracy is strong. These results challenge prevailing assumptions about unlearning effectiveness and highlight the need for evaluation frameworks that can reliably distinguish between true knowledge removal and superficial output suppression. We also publicly make available our evaluation framework to easily evaluate prompting techniques to retrieve unlearning knowledge.

在这项工作中,我们发现,当受到直截了当的迅速攻击时,一些机器的不学习方法可能会失败。我们系统地评估了三个模式家庭8种不学习技术,并采用了基于产出的、基于登录的和基于调查的分析,以确定在多大程度上可以检索到据称没有获得的知识。RMU和TAR等方法显示的是强有力的不学习,而ELM则仍然容易受到具体的迅速攻击(例如,印地语填充器的原始快速恢复57.3%的准确性)。我们的对日志的分析还证实,由于产出和登录准确性之间的关联性很强,未学习模式通常不会通过改变答案的格式来隐藏知识。这些结果挑战了关于不学习有效性的普遍假设,并强调了评价框架的必要性,这种框架能够可靠地区分真正的知识删除和表面禁止产出。我们还公开提供我们的评价框架,以方便地评价检索不学习知识的快速技术。


Article 130

Title@2025-06-11 (3): Classifying Unreliable Narrators with Large Language Models

Title: Classifying Unreliable Narrators with Large Language Models Klassifizieren von unzuverlässigen Erzählern mit großen Sprachmodellen 使用大语言模型的不可靠旁白器分类 2506.10231v1

Authors (5): Anneliese Brei, Katharine Henry, Abhisheik Sharma, Shashank Srivastava, Snigdha Chaturvedi

Often when we interact with a first-person account of events, we consider whether or not the narrator, the primary speaker of the text, is reliable. In this paper, we propose using computational methods to identify unreliable narrators, i.e. those who unintentionally misrepresent information. Borrowing literary theory from narratology to define different types of unreliable narrators based on a variety of textual phenomena, we present TUNa, a human-annotated dataset of narratives from multiple domains, including blog posts, subreddit posts, hotel reviews, and works of literature. We define classification tasks for intra-narrational, inter-narrational, and inter-textual unreliabilities and analyze the performance of popular open-weight and proprietary LLMs for each. We propose learning from literature to perform unreliable narrator classification on real-world text data. To this end, we experiment with few-shot, fine-tuning, and curriculum learning settings. Our results show that this task is very challenging, and there is potential for using LLMs to identify unreliable narrators. We release our expert-annotated dataset and code and invite future research in this area.

通常,当我们与事件的第一人叙述发生时,我们考虑的是,作为文字主要发言人的旁白人是否可靠。在本文中,我们提议使用计算方法来识别不可靠的旁白人,即无意中歪曲信息的人。从恋童癖学中借用文学理论来根据各种文字现象来界定不同类型的不可靠的旁白人,我们提出图纳,这是一组由多个领域(包括博客文章、子编辑文章、旅馆评论和文学作品)的叙述内容组成的附加说明的人类数据集。我们界定了内部、交际和文字间不可靠的分类任务,分析了受欢迎的开放重量和专有LLMs对每个人的性能。我们建议从文学中学习如何对真实世界文本数据进行不可靠的旁白分类。为此,我们试验了几分、微调和课程学习环境。我们的结果表明,这项任务非常艰巨,而且有可能使用LMS来识别不可靠的旁白的旁白人。我们发布了我们的专家附加说明的数据集和代码,并且邀请今后在这个领域进行研究。


Article 131

Title@2025-06-11 (3): Squeezed Attention: Accelerating Long Context Length LLM Inference

Title: Squeezed Attention: Accelerating Long Context Length LLM Inference Squeezed Achtung: Beschleunigung der langen Kontextlänge LLM-Inferenz 挤压注意:加速长线长长LLM 推论 2411.09688v3

Authors (9): Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, Sebastian Zhao, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

Emerging Large Language Model (LLM) applications require long input context in order to perform complex tasks like document analysis and code generation. For these long context length applications, the length of the input prompt poses a significant challenge in terms of inference efficiency since the inference costs increase linearly with sequence length. However, for many of these applications, much of the context in the prompt is fixed across different user inputs, thereby providing the opportunity to perform offline optimizations in order to process user inputs quickly, as they are received. We propose Squeezed Attention to accelerate LLM applications where a large portion of the input context is fixed. We first leverage K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant, and then compute exact attention using only the important keys, thereby reducing bandwidth and computational costs. We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length. We evaluate our method on long-context benchmarks including LongBench, where it achieves a 3.1$\times$ reduction in KV budget with no noticeable accuracy loss and up to an 8$\times$ reduction with only a 0.5 point accuracy gap for the LLaMA-2-7B-32K, LWM-Text-Chat-1M, and Longchat-7B-v1.5-32K models. Futhermore, we implement kernels for centroid comparison and sparse FlashAttention with important keys, achieving more than 4$\times$ speedups during both the prefill and generation phases for long-context inference. Our code is available at https://github.com/SqueezeAILab/SqueezedAttention.

新兴的大型语言模型(LLIM)应用程序需要较长的输入环境才能完成文件分析和代码生成等复杂任务。 对于这些长的上下文应用,输入提示的长度在推断效率方面构成了巨大的挑战,因为推断成本随着序列长度而线性地增加。然而,对于许多这些应用,许多快速的背景在不同的用户输入中固定下来,从而提供机会进行离线优化,以便在收到时快速处理用户输入。我们提议在大部分输入环境已经固定的情况下加速LLM应用程序。我们首先利用 K- mount 将固定环境的键离线组合起来,在语义相似性的基础上,代表每个组的密钥都有单一的中间值。在推断过程中,我们将用户输入的代号与元的中间值比较,以便预测从固定环境的键具有静态相关性,然后仅仅用重要的键来计算关注,从而减少带宽度和计算成本。我们还展示了我们的算法的分级版,从线性到直径直径直径- 直径- 直径- 直径- 直径- 直径- 直径- 直径- 直径- 直径- 直径- 直径- 直径- 直径- 直径- 直到直到直径- 直到直到直到直到直到直- 直- 直到直到直到直径- 直- 直到直到直到直径- 直径- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直至- 直距- 方向- 直径- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 等- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直- 直-


Article 132

Title@2025-06-11 (3): Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges

Title: Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges Große Sprachmodelle für ambulante Referrals: Problemdefinition, Benchmarking und Herausforderungen 门诊转诊大语言模式:问题定义、基准设定和挑战 2503.08292v3

Authors (11): Xiaoxiao Liu, Qingying Xiao, Junying Chen, Xiangyi Feng, Xiangbo Wu, Bairui Zhang, Xiang Wan, Jian Chang, Guangjun Yu, Yan Hu, Benyou Wang

Large language models (LLMs) are increasingly applied to outpatient referral tasks across healthcare systems. However, there is a lack of standardized evaluation criteria to assess their effectiveness, particularly in dynamic, interactive scenarios. In this study, we systematically examine the capabilities and limitations of LLMs in managing tasks within Intelligent Outpatient Referral (IOR) systems and propose a comprehensive evaluation framework specifically designed for such systems. This framework comprises two core tasks: static evaluation, which focuses on evaluating the ability of predefined outpatient referrals, and dynamic evaluation, which evaluates capabilities of refining outpatient referral recommendations through iterative dialogues. Our findings suggest that LLMs offer limited advantages over BERT-like models, but show promise in asking effective questions during interactive dialogues.

大型语言模式(LLMs)越来越多地应用于各保健系统的门诊转诊任务,然而,缺乏标准化的评价标准来评估其有效性,特别是在动态互动情景下的有效性,在本研究中,我们系统地审查“LLMs”在智能门诊转诊系统内管理任务的能力和局限性,并提议专为这类系统设计的综合评价框架,这一框架包括两项核心任务:静态评价,重点是评价预先确定的门诊转诊能力,动态评价,评价通过迭接对话改进门诊转诊建议的能力,我们的调查结果表明,LLMs比BERT类似的模式具有有限优势,但在互动对话期间提出有效问题的希望。


Article 133

Title@2025-06-11 (3): EgoNormia: Benchmarking Physical Social Norm Understanding

Title: EgoNormia: Benchmarking Physical Social Norm Understanding EgoNormia: Benchmarking Physical Social Norm Understanding EgoNormia:确定物质社会规范基准 2502.20490v5

Authors (7): MohammadHossein Rezaei, Yicheng Fu, Phil Cuvin, Caleb Ziems, Yanzhe Zhang, Hao Zhu, Diyi Yang

Human activity is moderated by norms; however, supervision for normative reasoning is sparse, particularly where norms are physically- or socially-grounded. We thus present EGONORMIA $|\epsilon|$, comprising 1,853 (200 for EGONORMIA-verified) multiple choice questions (MCQs) grounded within egocentric videos of human interactions, enabling the evaluation and improvement of normative reasoning in vision-language models (VLMs). EGONORMIA spans seven norm categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline to generate grounded MCQs from raw egocentric video. Our work demonstrates that current state-of-the-art VLMs lack robust grounded norm understanding, scoring a maximum of 54% on EGONORMIA and 65% on EGONORMIA-verified, with performance across norm categories indicating significant risks of safety and privacy when VLMs are used in real-world agents. We additionally explore methods for improving normative understanding, demonstrating that a naive retrieval-based generation (RAG) method using EGONORMIA can enhance normative reasoning in VLMs.

人类活动受规范的制约;然而,对规范性推理的监督很少,特别是在有物理或社会依据的规范或社会依据的规范的情况下,我们因此提出EGONORMIA $epsilon $1,853(EGONORMIA核查的200美元)基于以自我为中心的人类互动视频中的自我中心视频的多重选择问题(MCQs),这有利于评价和改进视觉模型(VLMS)的规范性推理。EGONORMIA 涵盖七个规范类别:安全、隐私、隐私、丙类、礼仪、礼貌、合作、协调/积极性、以及交流/可能性。为了大规模汇编这一数据集,我们建议了一个新的管道,从原始自利自利视频中产生有根的MCQ。我们的工作表明,目前最新的VLMS状态缺乏坚实的规范理解,在EGORMIA和EGNORMIA上最高得54%的评分,其表现表明,在现实世界代理人使用VLMS时,安全和隐私的重大风险。我们还进一步探索改进规范理解的方法,表明在ERA-RA(NAR)的一代中可以加强S-RUA)


Article 134

Title@2025-06-11 (3): Failure Modes of LLMs for Causal Reasoning on Narratives

Title: Failure Modes of LLMs for Causal Reasoning on Narratives Failure Modes von LLMs für die ursächliche Begründung von Narrativen 以叙述为由解释原因的LLMs失败模式 2410.23884v4

Authors (5): Khurram Yamin, Shantanu Gupta, Gaurav R. Ghosal, Zachary C. Lipton, Bryan Wilder

In this work, we investigate the causal reasoning abilities of large language models (LLMs) through the representative problem of inferring causal relationships from narratives. We find that even state-of-the-art language models rely on unreliable shortcuts, both in terms of the narrative presentation and their parametric knowledge. For example, LLMs tend to determine causal relationships based on the topological ordering of events (i.e., earlier events cause later ones), resulting in lower performance whenever events are not narrated in their exact causal order. Similarly, we demonstrate that LLMs struggle with long-term causal reasoning and often fail when the narratives are long and contain many events. Additionally, we show LLMs appear to rely heavily on their parametric knowledge at the expense of reasoning over the provided narrative. This degrades their abilities whenever the narrative opposes parametric knowledge. We extensively validate these failure modes through carefully controlled synthetic experiments, as well as evaluations on real-world narratives. Finally, we observe that explicitly generating a causal graph generally improves performance while naive chain-of-thought is ineffective. Collectively, our results distill precise failure modes of current state-of-the-art models and can pave the way for future techniques to enhance causal reasoning in LLMs.

在这项工作中,我们调查大型语言模型(LLMs)的因果推理能力,方法是通过从叙述中推断因果关系这一具有代表性的问题来推断大语言模型(LLMs)的因果推理能力;我们发现,即使是最先进的语言模型也依赖不可靠的捷径,无论是叙述性陈述还是其参数知识;例如,LLMs往往根据事件的地形顺序确定因果关系(即早期事件导致后来的事件),因此,如果事件没有按其确切的因果顺序加以说明,其性能就会下降;同样,我们证明LLMs与长期因果推理斗争,当叙述很长且包含许多事件时往往失败。此外,我们显示LLMs似乎严重依赖其准度知识,而牺牲了对所提供的叙述性陈述的推理。这削弱了他们的能力,每当叙述性说明反对参数知识时,我们通过仔细控制的合成实验以及实际世界叙述性叙述性来广泛验证这些失败模式。最后,我们指出,明确产生因果关系的图表通常会改善业绩,而天真的思维链是无效的。从整体上看,我们的结果会保持准确的失败模式,从而强化了当前磁号模型的推算方法,可以加强未来的推导法。


Article 135

Title@2025-06-11 (3): TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games

Title: TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games TTT-Bench: Ein Benchmark für die Bewertung von Vernunftfähigkeit mit einfachen und neuartigen Tic-Tac-Toe-Spielen TTT-Bench:以简单和新颖的Tic-Tic-Tac-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te-Te 2506.10209v1

Authors (6): Prakamya Mishra, Jiang Liu, Jialian Wu, Xiaodong Yu, Zicheng Liu, Emad Barsoum

Large reasoning models (LRMs) have demonstrated impressive reasoning capabilities across a broad range of tasks including Olympiad-level mathematical problems, indicating evidence of their complex reasoning abilities. While many reasoning benchmarks focus on the STEM domain, the ability of LRMs to reason correctly in broader task domains remains underexplored. In this work, we introduce \textbf{TTT-Bench}, a new benchmark that is designed to evaluate basic strategic, spatial, and logical reasoning abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games that humans can effortlessly solve from a young age. We propose a simple yet scalable programmatic approach for generating verifiable two-player game problems for TTT-Bench. Although these games are trivial for humans, they require reasoning about the intentions of the opponent, as well as the game board’s spatial configurations, to ensure a win. We evaluate a diverse set of state-of-the-art LRMs, and \textbf{discover that the models that excel at hard math problems frequently fail at these simple reasoning games}. Further testing reveals that our evaluated reasoning models score on average $\downarrow$ 41\% \& $\downarrow$ 5\% lower on TTT-Bench compared to MATH 500 \& AIME 2024 respectively, with larger models achieving higher performance using shorter reasoning traces, where most of the models struggle on long-term strategic reasoning situations on simple and new TTT-Bench tasks.

大型推理模型(LRMs)在包括奥林匹亚级数学问题在内的广泛任务中展示了令人印象深刻的推理能力,这表明了他们复杂的推理能力。虽然许多推理基准侧重于STEM领域,但LRM在更广泛的任务领域正确理性的能力仍未得到充分探讨。在这项工作中,我们引入了 ktextbf{TTTTT-Bench},这是一个新的基准,目的是通过一套由四套双玩者Tic-Tac-Toe风格的游戏来评估LRM的基本战略、空间和逻辑推理能力,这四套游戏在人类年轻时可以不懈地解决。我们提出了一种简单但可扩展的方案方法,为TTTT-Bench带来可核查的双玩游戏问题。虽然这些游戏对人类来说微不足道,但我们需要推理对手的意图,以及游戏局的空间配置,以确保获胜。我们评估了一套不同的短期LRMMMs型模型, 以及各种人类从年轻时就能不懈地解决更高层次的推理学问题。我们用最简单的推理的TTTTTT-Blex__Bard的推理模型来进一步检验我们如何评估如何在平均的20美元的推算。


Article 136

Title@2025-06-11 (3): SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Title: SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models SealQA: Anhebung der Messlatte für die Vernunft in Search-Augmented Language Models SealQA: 提高搜索增强语言模式中的原因栏 2506.01062v2

Authors (6): Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, Tu Vu

We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in “needle-in-a-haystack” settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the “lost-in-the-middle” issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at huggingface.co/datasets/vtllms/sealqa.

我们引入了Search-Auged 语言模型, 用于评估网络搜索结果相互冲突、吵闹或无益的实况调查问题的新挑战基准。 SealQA 以三种口味出现:(1) Seal-0 (main) 和(2) Seal-Hard, 评估真实准确性和推理能力, 由Seal-0 侧重于聊天模式(例如GPT-4.1)通常达到接近零准确度的最具有挑战性的问题;(3) LongSeal, 将SealQA 扩展至测试“需要-a-haystack” 设置中的长相文本、多文档推理。 我们的评估揭示了当前模式中的关键局限性:(1) 甚至边境LLMS在所有Seal 口味上表现不佳。 在Seil-0, 配有O3 和 o4 工具的边境代理模型分别只达到17.1% 和6.3% 准确度, 其最佳推理工作。 我们发现, DeepSeek- R1-671B 和 o3 Mini 等高级推理模型仍然极易受到冷测的搜索结果的伤害。 。 。 明显, 不断的测试- 分析- 和不断下降的收益 , 无法产生 。


Article 137

Title@2025-06-11 (3): Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

Title: Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval Q2E: Abfrage-zu-Veranstaltungs-Zersetzung für Null-Schuss-Mehrsprachiges Text-zu-Video-Retrieval 问题2E: 零热多语种文字到视频检索的从查询到编辑的分解 2506.10202v1

Authors (2): Shubhashis Roy Dipta, Francis Ferraro

Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.

最近的做法显示,在从大语言模型和视觉语言模型中提取和利用参数知识方面,熟练地掌握了巨大语言模型(LLMS)和视觉语言模型(VLMS)的参数知识,令人印象深刻。在这项工作中,我们考虑如何通过自动提取关于复杂的现实世界事件的潜在参数知识,改进与这些复杂事件有关的视频的识别和检索。我们介绍了Q2E:零光多语种文本到视频检索的Query-to-Event分解方法、可调整的跨数据集、域、LLMS或VLMS。我们的方法表明,通过使用LMS和VLMS所含的知识解析查询,我们可以提高对人类问题的理解。我们进一步展示了如何将我们的方法应用于视觉和语音两方面的投入。为了将这种不同的多式联运知识结合起来,我们采用了基于酶的聚变调方法。通过对两个不同的数据集和多个检索指标的评价,我们表明,Q2E优于几个最先进的基准。我们的评估还表明,将视听信息整合到将来的检索数据。


Article 138

Title@2025-06-11 (3): Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance

Title: Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance Objekt-Halluzination in großen Visions-Sprachen-Modellen durch Bild-umrundete Anleitung abmildern 通过图像四舍五入指南在大型视觉语言模型中减轻物体幻觉 2402.08680v2

Authors (4): Linxi Zhao, Yihe Deng, Weitong Zhang, Quanquan Gu

The advancement of Large Vision-Language Models (LVLMs) has increasingly highlighted the critical issue of their tendency to hallucinate non-existing objects in the images. To address this issue, previous works focused on using specially curated datasets or powerful LLMs to rectify the outputs of LVLMs. However, these approaches require either costly training or fine-tuning, or API access to proprietary LLMs for post-generation correction. In response to these limitations, we propose Mitigating hallucinAtion via image-gRounded guIdaNcE (MARINE), a framework that is both training-free and API-free. MARINE effectively and efficiently reduces object hallucinations during inference by introducing image-grounded guidance to LVLMs. This is achieved by leveraging open-source vision models to extract object-level information, thereby enhancing the precision of LVLM-generated content. Our framework’s flexibility further allows for the integration of multiple vision models, enabling more reliable and robust object-level guidance. Through comprehensive evaluations across 5 popular LVLMs with diverse evaluation metrics and benchmarks, we demonstrate the effectiveness of MARINE, which even outperforms existing fine-tuning-based methods. Remarkably, it reduces hallucinations consistently in GPT-4V-assisted evaluation while maintaining the detailedness of LVLMs’ generations. We release our code at https://github.com/Linxi-ZHAO/MARINE.

为解决这一问题,以往的工作重点是利用特别整理的数据集或强大的LLMS来纠正LVMs的产出。然而,这些方法需要成本高的培训或微调,或者需要API获得专有的LMS来进行生成后校正。为了应对这些限制,我们提议通过图像定位古尔达恩克(MARINE)来缓解幻觉,这是一个既无培训又无API的框架。MARINE通过向LVLMs引入基于图像的指导来切实有效地减少物体的幻觉。这是通过利用公开来源的视觉模型来提取目标级信息,从而提高LVMM内容的精确度。我们的框架的灵活性进一步允许通过图像模型的整合,使得更可靠和可靠的对象级指导。通过对5个大众LVLMS系统进行全面评价,同时采用不同的评价基准和基准,从而在LVMS-LMS-DMS/REDRA规则的不断调整中,我们通过GLMS-RISMS-remartal 的升级方法来降低其有效性。


Article 139

Title@2025-06-11 (3): Building UD Cairo for Old English in the Classroom

Title: Building UD Cairo for Old English in the Classroom Gebäude UD Kairo für altes Englisch im Klassenzimmer 在教室内为旧英语建造开罗 2504.18718v2

Authors (3): Lauren Levine, Junghyun Min, Amir Zeldes

In this paper we present a sample treebank for Old English based on the UD Cairo sentences, collected and annotated as part of a classroom curriculum in Historical Linguistics. To collect the data, a sample of 20 sentences illustrating a range of syntactic constructions in the world’s languages, we employ a combination of LLM prompting and searches in authentic Old English data. For annotation we assigned sentences to multiple students with limited prior exposure to UD, whose annotations we compare and adjudicate. Our results suggest that while current LLM outputs in Old English do not reflect authentic syntax, this can be mitigated by post-editing, and that although beginner annotators do not possess enough background to complete the task perfectly, taken together they can produce good results and learn from the experience. We also conduct preliminary parsing experiments using Modern English training data, and find that although performance on Old English is poor, parsing on annotated features (lemma, hyperlemma, gloss) leads to improved performance.

在本文中,我们展示了一个基于UD开罗判决的老英语样板树库,作为历史语言教学课程的一部分,收集并附加说明,作为历史语言课课堂课程的一部分。为了收集数据,一个20句话样本,展示了世界语言的一系列综合构思。为了收集数据,我们采用了法学硕士的提示和搜索等多种真实的老英语数据。为了批注,我们把判决分配给了在接触UD之前接触时间有限的多名学生,我们对这些学生的说明进行了比较和裁决。我们的结果表明,虽然目前老英语的LLM产出并不反映真实的语法,但可以通过编辑后加以缓解,尽管初学者没有足够的背景来完美地完成这项任务,一起使用它们可以产生良好的结果并从经验中学习。我们还利用现代英语培训数据进行初步区分实验,发现虽然老英语的成绩很差,但根据附加说明的特征(lemma、hulemmma、gloss)的成绩导致业绩的改善。


Article 140

Title@2025-06-11 (3): Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce

Title: Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce Zukunft der Arbeit mit KI-Agenten: Auditing Automation und Augmentation Potential in der gesamten US-Arbeitskraft 未来与AI代理机构的合作:全美国员工队伍的审计自动化和增强潜力 2506.06576v2

Authors (7): Yijia Shao, Humishka Zope, Yucheng Jiang, Jiaxin Pei, David Nguyen, Erik Brynjolfsson, Diyi Yang

The rapid rise of compound AI systems (a.k.a., AI agents) is reshaping the labor market, raising concerns about job displacement, diminished human agency, and overreliance on automation. Yet, we lack a systematic understanding of the evolving landscape. In this paper, we address this gap by introducing a novel auditing framework to assess which occupational tasks workers want AI agents to automate or augment, and how those desires align with the current technological capabilities. Our framework features an audio-enhanced mini-interview to capture nuanced worker desires and introduces the Human Agency Scale (HAS) as a shared language to quantify the preferred level of human involvement. Using this framework, we construct the WORKBank database, building on the U.S. Department of Labor’s O*NET database, to capture preferences from 1,500 domain workers and capability assessments from AI experts across over 844 tasks spanning 104 occupations. Jointly considering the desire and technological capability divides tasks in WORKBank into four zones: Automation “Green Light” Zone, Automation “Red Light” Zone, R&D Opportunity Zone, Low Priority Zone. This highlights critical mismatches and opportunities for AI agent development. Moving beyond a simple automate-or-not dichotomy, our results reveal diverse HAS profiles across occupations, reflecting heterogeneous expectations for human involvement. Moreover, our study offers early signals of how AI agent integration may reshape the core human competencies, shifting from information-focused skills to interpersonal ones. These findings underscore the importance of aligning AI agent development with human desires and preparing workers for evolving workplace dynamics.

复合AI系统(a.k.a.a.a.,AI代理商)的迅速上升正在改变劳动力市场,使人们对就业转移、人力机构衰减和过度依赖自动化等问题感到关切。然而,我们缺乏对不断变化的形势的系统理解。在本文件中,我们通过引入新的审计框架来弥补这一差距,以评估哪些职业任务工人希望AI代理商自动化或扩大,以及这些工人的愿望如何与当前的技术能力相一致。我们的框架将强化的小型访问功能用于捕捉微妙的工人愿望,并将人类机构规模作为量化人类参与偏好程度的共同语言。我们利用这一框架,在劳工部O*NET数据库的基础上,建立WorkBank数据库,从1,500个领域工人的偏好和AI代理商的能力评估,涵盖超过104种职业的844项任务。我们共同考虑将工作班克的任务分为四个区域:自动化的“绿色光区”、自动化的“亮点区”的“人类机构规模”,将人类机构规模作为量化人类参与程度的通用语言。我们利用这个框架构建了Horb Bank数据库数据库,从A.


Article 141

Title@2025-06-11 (3): Disclosure Audits for LLM Agents

Title: Disclosure Audits for LLM Agents Offenlegungsprüfungen für LLM-Agenten 对LLLM代理的披露审计 2506.10171v1

Authors (3): Saswat Das, Jameson Sandler, Ferdinando Fioretto

Large Language Model agents have begun to appear as personal assistants, customer service bots, and clinical aides. While these applications deliver substantial operational benefits, they also require continuous access to sensitive data, which increases the likelihood of unauthorized disclosures. This study proposes an auditing framework for conversational privacy that quantifies and audits these risks. The proposed Conversational Manipulation for Privacy Leakage (CMPL) framework, is an iterative probing strategy designed to stress-test agents that enforce strict privacy directives. Rather than focusing solely on a single disclosure event, CMPL simulates realistic multi-turn interactions to systematically uncover latent vulnerabilities. Our evaluation on diverse domains, data modalities, and safety configurations demonstrate the auditing framework’s ability to reveal privacy risks that are not deterred by existing single-turn defenses. In addition to introducing CMPL as a diagnostic tool, the paper delivers (1) an auditing procedure grounded in quantifiable risk metrics and (2) an open benchmark for evaluation of conversational privacy across agent implementations.

大型语言模型代理商已开始以个人助理、客户服务机器人和临床助手的身份出现,这些应用程序带来大量业务效益,但也需要持续获取敏感数据,从而增加未经授权披露的可能性。本研究报告提出了对谈话隐私的审计框架,对风险进行量化和审计。提议的隐私泄漏(CMPL)框架是反复探索战略,旨在对执行严格隐私指令的应力代理商进行压力测试。CMPL模拟了现实的多点互动,以便系统地发现潜在的脆弱性。我们对不同领域、数据模式和安全配置的评价表明审计框架能够揭示隐私风险,而这种风险并没有被现有的单向防御所遏制。除了将CMPL作为一种诊断工具外,本文件还提供了(1)基于量化风险指标的审计程序,(2)用于评估跨代理商实施之间谈话隐私的公开基准。


Article 142

Title@2025-06-11 (3): Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective

Title: Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective Können LLMs gute Geschichten generieren? Erkenntnisse und Herausforderungen aus einer narrativen Planungsperspektive LLM女士能够产生好故事吗?从叙述性规划的角度看问题和挑战 2506.10161v1

Authors (2): Yi Wang, Max Kreminski

Story generation has been a prominent application of Large Language Models (LLMs). However, understanding LLMs’ ability to produce high-quality stories remains limited due to challenges in automatic evaluation methods and the high cost and subjectivity of manual evaluation. Computational narratology offers valuable insights into what constitutes a good story, which has been applied in the symbolic narrative planning approach to story generation. This work aims to deepen the understanding of LLMs’ story generation capabilities by using them to solve narrative planning problems. We present a benchmark for evaluating LLMs on narrative planning based on literature examples, focusing on causal soundness, character intentionality, and dramatic conflict. Our experiments show that GPT-4 tier LLMs can generate causally sound stories at small scales, but planning with character intentionality and dramatic conflict remains challenging, requiring LLMs trained with reinforcement learning for complex reasoning. The results offer insights on the scale of stories that LLMs can generate while maintaining quality from different aspects. Our findings also highlight interesting problem solving behaviors and shed lights on challenges and considerations for applying LLM narrative planning in game environments.

然而,由于自动评价方法的挑战以及人工评价的成本和主观性高,对LLMS制作高质量故事的能力的了解仍然有限。 计算论的词组对构成一个好故事的内容提供了宝贵的洞察力,在对故事制作的象征性叙事规划方法中应用了这一方法。这项工作的目的是通过利用LMS的故事生成能力来解决叙述性规划问题,加深对LOMS故事生成能力的理解。我们提出了一个基准,用于根据文献实例对叙述性规划的LMS进行评价,重点是因果关系的正确性、性格故意性和戏剧性冲突。我们的实验表明,GPT-4级LMs可以小规模地产生有因果关系的好故事,但具有性质故意性和戏剧性冲突的规划仍然具有挑战性,需要LMS经过培训,通过复杂的推理学习强化LMS的学习。结果提供了LMS在保持不同方面的质量的同时能够产生的故事规模的洞察力。我们的调查结果还突出了在游戏环境中应用LLM叙述性规划方面解决行为和提出挑战和考虑的有趣问题。


Article 143

Title@2025-06-11 (3): One Patient, Many Contexts: Scaling Medical AI Through Contextual Intelligence

Title: One Patient, Many Contexts: Scaling Medical AI Through Contextual Intelligence Ein Patient, viele Kontexte: Skalierung medizinischer KI durch Kontextintelligenz 一名病人,多种背景:通过背景情报扩大医疗AI 2506.10157v1

Authors (9): Michelle M. Li, Ben Y. Reis, Adam Rodman, Tianxi Cai, Noa Dagan, Ran D. Balicer, Joseph Loscalzo, Isaac S. Kohane, Marinka Zitnik

Medical foundation models, including language models trained on clinical notes, vision-language models on medical images, and multimodal models on electronic health records, can summarize clinical notes, answer medical questions, and assist in decision-making. Adapting these models to new populations, specialties, or settings typically requires fine-tuning, careful prompting, or retrieval from knowledge bases. This can be impractical, and limits their ability to interpret unfamiliar inputs and adjust to clinical situations not represented during training. As a result, models are prone to contextual errors, where predictions appear reasonable but fail to account for critical patient-specific or contextual information. These errors stem from a fundamental limitation that current models struggle with: dynamically adjusting their behavior across evolving contexts of medical care. In this Perspective, we outline a vision for context-switching in medical AI: models that dynamically adapt their reasoning without retraining to new specialties, populations, workflows, and clinical roles. We envision context-switching AI to diagnose, manage, and treat a wide range of diseases across specialties and regions, and expand access to medical care.

医学基础模型,包括临床笔记培训的语言模型、医学图象的视觉语言模型以及电子健康记录多式模型,可以总结临床笔记、回答医学问题和协助决策。将这些模型适应新的人口、专业或环境通常需要细微调整、仔细推动或从知识库检索。这可能不切实际,限制了他们解释不熟悉的投入和适应培训期间没有代表的临床情况的能力。因此,模型容易出现背景错误,因为预测似乎合理,但却没有考虑到关键病人或背景信息。这些错误源于当前模型所挣扎的基本限制:动态调整其在不断变化的医疗护理环境中的行为。从这个角度看,我们勾画了医学人工外科的背景切换的愿景:在不根据新的特殊情况、人口、工作流程和临床作用进行再培训的情况下动态调整其推理的模型。我们设想背景开动人工智能以诊断、管理和治疗不同专业和不同区域的各种疾病,并扩大获得医疗护理的机会。


Article 144

Title@2025-06-11 (3): Measuring Corporate Human Capital Disclosures: Lexicon, Data, Code, and Research Opportunities

Title: Measuring Corporate Human Capital Disclosures: Lexicon, Data, Code, and Research Opportunities Corporate Human Capital Disclosures messen: Lexikon, Daten, Code und Forschungsmöglichkeiten 衡量公司人力资本披露:词汇、数据、代码和研究机会 2506.10155v1

Authors (3): Elizabeth Demers, Victor Xiaoqi Wang, Kean Wu

Human capital (HC) is increasingly important to corporate value creation. Unlike other assets, however, HC is not currently subject to well-defined measurement or disclosure rules. We use a machine learning algorithm (word2vec) trained on a confirmed set of HC disclosures to develop a comprehensive list of HC-related keywords classified into five subcategories (DEI; health and safety; labor relations and culture; compensation and benefits; and demographics and other) that capture the multidimensional nature of HC management. We share our lexicon, corporate HC disclosures, and the Python code used to develop the lexicon, and we provide detailed examples of using our data and code, including for fine-tuning a BERT model. Researchers can use our HC lexicon (or modify the code to capture another construct of interest) with their samples of corporate communications to address pertinent HC questions. We close with a discussion of future research opportunities related to HC management and disclosure.

与其它资产不同,HC目前不受明确界定的测量或披露规则的约束。 我们使用经过经证实的HC披露方法培训的机器学习算法(word2vec)来制定一份全面的HC相关关键词清单,这些关键词可分为五个亚类(DI;卫生和安全;劳动关系和文化;报酬与福利;人口统计等),它们反映了HC管理的多层面性质。我们分享了我们的词汇、公司HC披露和用于开发词汇的Python代码,我们提供了使用我们数据和代码的详细例子,包括微调BERT模型。研究人员可以使用我们的HC词汇(或修改代码以获取另一个利益结构)及其公司通信样本来解决相关的HC问题。我们最后讨论了与HC管理和披露有关的未来研究机会。


Article 145

Title@2025-06-11 (3): Analyzing Emotions in Bangla Social Media Comments Using Machine Learning and LIME

Title: Analyzing Emotions in Bangla Social Media Comments Using Machine Learning and LIME Analysieren von Emotionen in Bangla Social Media Kommentare mit maschinellem Lernen und LIME 利用机器学习和LIME分析孟加拉社会媒体评论中的情感 2506.10154v1

Authors (5): Bidyarthi Paul, SM Musfiqur Rahman, Dipta Biswas, Md. Ziaul Hasan, Md. Zahid Hossain

Research on understanding emotions in written language continues to expand, especially for understudied languages with distinctive regional expressions and cultural features, such as Bangla. This study examines emotion analysis using 22,698 social media comments from the EmoNoBa dataset. For language analysis, we employ machine learning models: Linear SVM, KNN, and Random Forest with n-gram data from a TF-IDF vectorizer. We additionally investigated how PCA affects the reduction of dimensionality. Moreover, we utilized a BiLSTM model and AdaBoost to improve decision trees. To make our machine learning models easier to understand, we used LIME to explain the predictions of the AdaBoost classifier, which uses decision trees. With the goal of advancing sentiment analysis in languages with limited resources, our work examines various techniques to find efficient techniques for emotion identification in Bangla.

有关理解书面语言情感的研究在继续扩大,特别是针对研究不足的语言,这些语言具有独特的区域表现形式和文化特征,如孟加拉语。本研究利用EmonoBa数据集的22,698份社交媒体评论来审查情感分析。在语言分析中,我们采用了机器学习模型:Linear SVM、KNN和随机森林,使用来自TF-IDF矢量器的n-gram数据。我们进一步调查了五氯苯如何影响维度的减少。此外,我们利用了BILSTM模型和AdaBoost来改进决策树。为了使机器学习模型更容易理解,我们利用LIME来解释AdaBoost分类师的预测,因为后者使用决策树。为了在资源有限的情况下推进语言的情绪分析,我们的工作研究了各种技术,以便在孟加拉语中找到有效的情感识别技术。


Article 146

Title@2025-06-11 (3): When Large Language Models are Reliable for Judging Empathic Communication

Title: When Large Language Models are Reliable for Judging Empathic Communication Wenn große Sprachmodelle zuverlässig sind, um empathische Kommunikation zu beurteilen 当大语言模型在判断同情通信方面可靠时 2506.10150v1

Authors (6): Aakriti Kumar, Nalin Poungpeth, Diyi Yang, Erina Farrell, Bruce Lambert, Matthew Groh

Large language models (LLMs) excel at generating empathic responses in text-based conversations. But, how reliably do they judge the nuances of empathic communication? We investigate this question by comparing how experts, crowdworkers, and LLMs annotate empathic communication across four evaluative frameworks drawn from psychology, natural language processing, and communications applied to 200 real-world conversations where one speaker shares a personal problem and the other offers support. Drawing on 3,150 expert annotations, 2,844 crowd annotations, and 3,150 LLM annotations, we assess inter-rater reliability between these three annotator groups. We find that expert agreement is high but varies across the frameworks’ sub-components depending on their clarity, complexity, and subjectivity. We show that expert agreement offers a more informative benchmark for contextualizing LLM performance than standard classification metrics. Across all four frameworks, LLMs consistently approach this expert level benchmark and exceed the reliability of crowdworkers. These results demonstrate how LLMs, when validated on specific tasks with appropriate benchmarks, can support transparency and oversight in emotionally sensitive applications including their use as conversational companions.

大型语言模型(LLMS)擅长在基于文字的谈话中产生共性反应。但是,他们如何可靠地判断共性交流的细微差别?我们通过比较专家、众工和LLMS如何在心理学、自然语言处理和通信的四大评价框架中运用共性交流来调查这一问题,这四大评价框架来自心理学、自然语言处理和通信,200个现实世界对话,其中一名发言者有个人问题,另一人提供支持。我们利用3,150个专家说明、2,844个人群说明和3,150个LLM说明,评估这三个批注组之间的跨区域可靠性。我们发现,专家协议很高,但各框架的次级组成部分之间差异很大,取决于其清晰度、复杂性和主观性。我们表明,专家协议为将LMM业绩背景化提供了比标准分类指标更加丰富的基准。在所有四个框架中,LMS一直接近这一专家水平基准,超过众工的可靠性。这些结果表明,LMS在用适当基准验证具体任务时,如何支持情感敏感应用的透明度和监督,包括用作谈话伙伴。


Article 147

Title@2025-06-11 (3): Computation Mechanism Behind LLM Position Generalization

Title: Computation Mechanism Behind LLM Position Generalization Berechnungsmechanismus hinter LLM-Positionsverallgemeinerung LLM 职位普遍化背后的计算机制 2503.13305v2

Authors (2): Chi Han, Heng Ji

Most written natural languages are composed of sequences of words and sentences. Similar to humans, large language models (LLMs) exhibit flexibility in handling textual positions - a phenomenon we term position generalization. They can understand texts with position perturbations and generalize to longer texts than those encountered during training with the latest techniques. These phenomena suggest that LLMs handle positions tolerantly, but how LLMs computationally process positional relevance remains largely unexplored. This work connects the linguistic phenomenon with LLMs’ computational mechanisms. We show how LLMs enforce certain computational mechanisms for the aforementioned tolerance in position perturbations. Despite the complex design of the self-attention mechanism, this work reveals that LLMs learn a counterintuitive disentanglement of attention logits. Their values show a 0.959 linear correlation with an approximation of the arithmetic sum of positional relevance and semantic importance. Furthermore, we identify a prevalent pattern in intermediate features, which we prove theoretically enables this effect. The pattern, which is different from how randomly initialized parameters would behave, suggests that it is a learned behavior rather than a natural result of the model architecture. Based on these findings, we provide computational explanations and criteria for LLMs’ position flexibilities. This work takes a pioneering step in linking position generalization with modern LLMs’ internal mechanisms.

与人类似, 大型语言模型(LLMS)在处理文本位置时表现出灵活性, 这是一种我们称之为定位一般化的现象。 他们可以理解带有位置扰动的文本, 并且比在培训最新技术过程中遇到的文本要长。 这些现象表明, LLMS可以容忍地处理位置, 但LLMS计算过程位置相关性如何基本上没有被探索。 这项工作将语言现象与LLMS的计算机制联系起来。 我们展示了LLMS如何对定位扰动的上述容忍性实施某些计算机制。 尽管自我注意机制的设计复杂, 这项工作表明LLMS学会了一种反直觉的注意力记录脱节。 它们的价值显示了0. 959线性相关性, 与定位相关性和语义重要性的计算总和相近。 此外, 我们确定了中间特征中的一种普遍模式, 我们证明这种模式在理论上能够产生这种效果。 与随机初始参数将表现如何不同, 这表明LMS是学习的行为而不是模型结构的自然结果。 在模型中, 我们提供了一种将模型的模型上的灵活性和跨级模型的计算方法。


Article 148

Title@2025-06-11 (3): Unsupervised Elicitation of Language Models

Title: Unsupervised Elicitation of Language Models Unüberwachte Elizitation von Sprachmodellen 未受监督的语文模式的Elicus Elicooth 2506.10139v1

Authors (13): Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman-Wetzler, Linda Petrini, Henry Sleight, Collin Burns, He He, Shi Feng, Ethan Perez, Jan Leike

To steer pretrained language models for downstream tasks, today’s post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision. On tasks where LMs’ capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.

为了引导下游任务的预先培训语言模式,今天的培训后模式依靠人来说明需要的行为。然而,对于具有超人能力的模型,很难或不可能获得高质量的人类监督。为了应对这一挑战,我们引入了一种新的不受监督的算法,即内部一致性最大化(ICM),在自己制作的标签上微调经过培训的语言模式,在没有外部监督的情况下,\ emph{。关于GSM8k验证、真理QA和Alpaca奖赏模式,我们的方法与黄金监督培训的绩效相匹配,在众包人类监督培训方面优于培训。在LMS能力非常超人的工作上,我们的方法可以大大超过人类标签培训。最后,我们表明我们的方法可以改进边界LMS的培训:我们用我们的方法来训练一个不受监督的奖励模式,并利用强化学习来训练一位基于Claude 3.5 Haiku的助理。奖赏模式和助理都比他们的人类监督对手高得多。


Article 149

Title@2025-06-11 (3): ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering

Title: ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering ChartReasoner: Code-getriebene Modalitäts-Bridging für Lang-Chain-Reasoning in der Chart-Fragebeantwortung 图表检索器:图解问题解答中长钱原因的代码-驱动模式连接 2506.10116v1

Authors (7): Caijun Jia, Nan Xu, Jingxuan Wei, Qingli Wang, Lei Wang, Bihui Yu, Junnan Zhu

Recently, large language models have shown remarkable reasoning capabilities through long-chain reasoning before responding. However, how to extend this capability to visual reasoning tasks remains an open challenge. Existing multimodal reasoning approaches transfer such visual reasoning task into textual reasoning task via several image-to-text conversions, which often lose critical structural and semantic information embedded in visualizations, especially for tasks like chart question answering that require a large amount of visual details. To bridge this gap, we propose ChartReasoner, a code-driven novel two-stage framework designed to enable precise, interpretable reasoning over charts. We first train a high-fidelity model to convert diverse chart images into structured ECharts codes, preserving both layout and data semantics as lossless as possible. Then, we design a general chart reasoning data synthesis pipeline, which leverages this pretrained transport model to automatically and scalably generate chart reasoning trajectories and utilizes a code validator to filter out low-quality samples. Finally, we train the final multimodal model using a combination of supervised fine-tuning and reinforcement learning on our synthesized chart reasoning dataset and experimental results on four public benchmarks clearly demonstrate the effectiveness of our proposed ChartReasoner. It can preserve the original details of the charts as much as possible and perform comparably with state-of-the-art open-source models while using fewer parameters, approaching the performance of proprietary systems like GPT-4o in out-of-domain settings.

最近,大型语言模型在答复之前通过长链推理展示了非凡的推理能力。然而,如何将这一能力扩大到直观推理任务仍然是一个公开的挑战。现有的多式联运推理方法将这种直观推理任务通过几部图像转换成文字推理任务,这些转换往往会失去在视觉化中嵌入的关键结构和语义信息,特别是像图表解答那样需要大量视觉细节的任务。为了缩小这一差距,我们提出了ChartReasoner,这是一个由代码驱动的新颖的两阶段框架,旨在对图表进行精确、可解释的推理。我们首先训练了一个高机密模型,将不同的图表图像转换成结构化的ECharts代码,将布局和数据精度都保留在尽可能无损的文字转换中。然后,我们设计了一个总图表推理推理数据合成管道,利用这一经过预先训练的运输模型自动和可伸缩地生成图表推理轨迹,并利用一个代码验证器过滤低质量的样本。最后的多式联运模型,我们用监督的精细调整和强化的学习我们综合的图表推理图集图集和实验结果的模型,在四种公共数据库中,同时以清晰地展示Gsrent推算的图表的进度图,以维护了我们的拟议的进度图的进度图的精准性地展示。


Article 150

Title@2025-06-11 (3): When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs

Title: When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs Wenn die Bedeutung bleibt die gleiche, aber Modelle Drift: Bewertung der Servicequalität unter Token-Level Behavioral Instability in LLMs 当意指保持相同时,但模型漂流:在高端行为不稳下评估服务的质量 2506.10095v1

Authors (3): Xiao Li, Joel Kreuzwieser, Alan Peters

We investigate how large language models respond to prompts that differ only in their token-level realization but preserve the same semantic intent, a phenomenon we call prompt variance. We propose Prompt-Based Semantic Shift (PBSS), a diagnostic framework for measuring behavioral drift in LLMs under semantically equivalent prompt rewordings. Applied to ten constrained tasks, PBSS reveals consistent, model-specific response shifts, suggesting statistical regularities linked to tokenization and decoding. These results highlight an overlooked dimension of model evaluation stability under rephrasing and suggest that tokenization strategies and decoding dynamics may contribute to post-training quality of service instability.

我们调查大型语言模型如何应对在象征性实现方面有所不同、但保留同样语义意图的提示,我们称之为即时差异的现象。我们建议快速基于语义的转变(PBSS),这是一个诊断性框架,用于测量在语义等同的迅速重新措辞下LLMs中的行为漂移。应用到10项限制的任务中,PBSS揭示了一致的、针对具体模式的反应转变,表明与象征性和解码相关的统计规律。这些结果凸显了在重拟中被忽略的模型评价稳定性的一个方面,并表明代用战略和解码动态可能会助长培训后服务不稳定的质量。


Article 151

Title@2025-06-11 (3): Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information

Title: Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information Chat-of-Thought: Kollaboratives Multi-Agent-System zur Generierung von Domain-spezifischen Informationen 共同探索:生成具体域域特定信息的合作多机构机构系统 2506.10086v1

Authors (4): Christodoulos Constantinides, Shuxin Lin, Nianjun Zhou, Dhaval Patel

This paper presents a novel multi-agent system called Chat-of-Thought, designed to facilitate the generation of Failure Modes and Effects Analysis (FMEA) documents for industrial assets. Chat-of-Thought employs multiple collaborative Large Language Model (LLM)-based agents with specific roles, leveraging advanced AI techniques and dynamic task routing to optimize the generation and validation of FMEA tables. A key innovation in this system is the introduction of a Chat of Thought, where dynamic, multi-persona-driven discussions enable iterative refinement of content. This research explores the application domain of industrial equipment monitoring, highlights key challenges, and demonstrates the potential of Chat-of-Thought in addressing these challenges through interactive, template-driven workflows and context-aware agent collaboration.

本文介绍了一个新的多试剂系统,称为 “ 意见交换 “ ,旨在促进为工业资产生成失败模式和影响分析(FMEA)文件; “ 意见交换 “ ,采用多种协作性大型语言模式(LLM),具有具体作用,利用先进的人工智能技术和动态任务路线,优化FMEA表格的生成和验证;该系统的一项关键创新是引入 “ 思想交流 “ ,通过动态、多人驱动的讨论对内容进行迭接完善;这一研究探索了工业设备监测的应用领域,突出了主要挑战,并展示了通过互动、模板驱动的工作流程和背景意识代理合作,在应对这些挑战方面进行“意见交流”的潜力。


Article 152

Title@2025-06-11 (3): A quantum semantic framework for natural language processing

Title: A quantum semantic framework for natural language processing Ein quantensemantischer Rahmen für die natürliche Sprachverarbeitung 自然语言处理的量子语义框架 2506.10077v1

Authors (6): Christopher J. Agostino, Quan Le Thien, Molly Apsel, Denizhan Pak, Elina Lesyk, Ashabari Majumdar

Semantic degeneracy represents a fundamental property of natural language that extends beyond simple polysemy to encompass the combinatorial explosion of potential interpretations that emerges as semantic expressions increase in complexity. Large Language Models (LLMs) and other modern NLP systems face inherent limitations precisely because they operate within natural language itself, making them subject to the same interpretive constraints imposed by semantic degeneracy. In this work, we argue using Kolmogorov complexity that as an expression’s complexity grows, the likelihood of any interpreting agent (human or LLM-powered AI) recovering the single intended meaning vanishes. This computational intractability suggests the classical view that linguistic forms possess meaning in and of themselves is flawed. We alternatively posit that meaning is instead actualized through an observer-dependent interpretive act. To test this, we conducted a semantic Bell inequality test using diverse LLM agents as ``computational cognitive systems’’ to interpret ambiguous word pairs under varied contextual settings. Across several independent experiments, we found average CHSH expectation values ranging from 1.2 to 2.8, with several runs yielding values (e.g., 2.3-2.4) that significantly violate the classical boundary ($ S \leq2$). This demonstrates that linguistic interpretation under ambiguity can exhibit non-classical contextuality, consistent with results from human cognition experiments. These results inherently imply that classical frequentist-based analytical approaches for natural language are necessarily lossy. Instead, we propose that Bayesian-style repeated sampling approaches can provide more practically useful and appropriate characterizations of linguistic meaning in context.

语义变异性代表自然语言的基本属性,它超越简单的多语种特性,包括随着语义表达形式复杂性增加而出现的潜在解释的组合爆炸。大语言模型(LLMS)和其他现代NLP系统面临固有的限制,原因恰恰是它们本身在自然语言范围内运作,因此它们受到语言变异性所施加的相同的解释限制。在这项工作中,我们争论使用科尔莫戈洛夫的复杂性,因为一种表达方式的复杂性会增加,任何解释剂(人或LLM-动力AI)恢复单一预期含义的可能性会消失。这种计算不易显示传统观点,即语言形式本身具有意义,而这种传统观点是有缺陷的。我们假设这种含义不是通过观察员独立的解释行为而实际化的。为了测试,我们用不同的LLMTM代理来进行一个语义变异性语言变异性测试,在不同的背景环境中解释有用的词配对。在几个独立的实验中,我们发现平均CHSH的预期值从1.2到2.8不等, 并有几连串的自然语言表达结果,在实际的逻辑上可以证明。


Article 153

Title@2025-06-11 (3): From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

Title: From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring Vom Urteil bis zur Störung: Frühzeitiges Stoppen von LLM-schädlichen Ausgängen durch Streaming-Content-Monitoring 从判决到干扰:通过流式内容监测及早停止LLM有害产出 2506.09996v1

Authors (5): Yang Li, Qiang Sheng, Yehan Yang, Xueyao Zhang, Juan Cao

Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations to provide reasonable supervision for token-level training. Then, we propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.

虽然大多数大型语言模型(LLMS)都应用了安全调整,但LLM服务供应商通常会将随后的节制作为现实世界产品的外部安全卫士。现有的主持人主要采用常规全面检测方法,根据完全LLM产出确定有害性,造成高服务延迟。最近的工作更加关注部分检测,即由主持人监督中途发电,如果检测出有害性,则早期停止产出,但是他们直接将经过全面检测模式培训的主持人应用于不完全检测产出,引入培训-推断差距,降低绩效。在本文中,我们探索如何形成数据-模型解决方案,由本地支持部分检测。对于数据,我们建造Finharm,这是一个由29K的快速响应配对组成的数据集,配有精细微的注释,为象征性培训提供合理的监督。然后,我们提议对流内容监测器进行双重监督,对响应和象征性等级标签进行培训,并跟踪LM产出流流,以便及时判断伤害性。实验显示,SCM获得比宏观F1标准高0.95+40的顶级反应,通过18级标准进行全面检测。


Article 154

Title@2025-06-11 (3): Flipping Against All Odds: Reducing LLM Coin Flip Bias via Verbalized Rejection Sampling

Title: Flipping Against All Odds: Reducing LLM Coin Flip Bias via Verbalized Rejection Sampling Flipping gegen alle Ungeraden: LLM-Münze Flip-Bias durch verbalisierte Ablehnungsstichproben reduzieren 反万千万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万千万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万万 2506.09998v1

Authors (6): Tim Z. Xiao, Johannes Zenn, Zhen Liu, Weiyang Liu, Robert Bamler, Bernhard Schölkopf

Large language models (LLMs) can often accurately describe probability distributions using natural language, yet they still struggle to generate faithful samples from them. This mismatch limits their use in tasks requiring reliable stochasticity, such as Monte Carlo methods, agent-based simulations, and randomized decision-making. We investigate this gap between knowledge and sampling in the context of Bernoulli distributions. We introduce Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling that prompts the LLM to reason about and accept or reject proposed samples. Despite relying on the same Bernoulli mechanism internally, VRS substantially reduces sampling bias across models. We provide theoretical analysis showing that, under mild assumptions, VRS improves over direct sampling, with gains attributable to both the algorithm and prompt design. More broadly, our results show how classical probabilistic tools can be verbalized and embedded into LLM workflows to improve reliability, without requiring access to model internals or heavy prompt engineering.

大型语言模型(LLMS)通常能够准确描述使用自然语言的概率分布,但是它们仍然在努力从中生成忠实的样本。这种不匹配限制了它们在需要可靠的随机性的任务中的使用,例如蒙特卡洛方法、代理模拟和随机决策。我们在伯努利分布范围内调查知识和抽样之间的差距。我们引入了口头拒绝抽样(VRS),这是古典拒绝抽样的自然语言适应,促使LLM在内部使用同样的Bernoulli机制,但VRS大大降低了不同模型的抽样偏差。我们提供了理论分析,表明在轻度假设下,VRS比直接抽样有所改进,其收益可归因于算法和迅速设计。更广义地说,我们的结果表明传统概率工具如何被口头描述,并嵌入LLM工作流程,以提高可靠性,而无需使用模型内部或重度快速工程。


Article 155

Title@2025-06-11 (3): Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages

Title: Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages Große Sprachmodelle für toxische Spracherkennung in ressourcenarmen Balkansprachen 低资源巴尔干语言中有毒语言探测大语言模式 2506.09992v1

Authors (2): Amel Muminovic, Amela Kadric Muminovic

Online toxic language causes real harm, especially in regions with limited moderation tools. In this study, we evaluate how large language models handle toxic comments in Serbian, Croatian, and Bosnian, languages with limited labeled data. We built and manually labeled a dataset of 4,500 YouTube and TikTok comments drawn from videos across diverse categories, including music, politics, sports, modeling, influencer content, discussions of sexism, and general topics. Four models (GPT-3.5 Turbo, GPT-4.1, Gemini 1.5 Pro, and Claude 3 Opus) were tested in two modes: zero-shot and context-augmented. We measured precision, recall, F1 score, accuracy and false positive rates. Including a short context snippet raised recall by about 0.12 on average and improved F1 score by up to 0.10, though it sometimes increased false positives. The best balance came from Gemini in context-augmented mode, reaching an F1 score of 0.82 and accuracy of 0.82, while zero-shot GPT-4.1 led on precision and had the lowest false alarms. We show how adding minimal context can improve toxic language detection in low-resource settings and suggest practical strategies such as improved prompt design and threshold calibration. These results show that prompt design alone can yield meaningful gains in toxicity detection for underserved Balkan language communities.

在线有毒语言造成了真正的伤害,特别是在温和工具有限的地区。在本研究中,我们评估了大型语言模型如何处理塞尔维亚、克罗地亚和波斯尼亚语的有毒评论,并使用标签数据有限。我们建立并手工贴上了一组4 500个YouTube和TikTok从音乐、政治、体育、建模、影响者内容、性别主义讨论和一般主题等不同类别视频中提取的数据集,其中包括4 500个YouTube和TikTok评论,这些评论来自音乐、政治、体育、建模、影响者内容、性别主义讨论和一般主题。四个模型(GPT-3 Turbo、GPT-4、Gemini 1.5 Pro和Claude 3 Opus)以两种模式进行了测试:零发和背景放大。我们测量了精确度、回顾、F1评分、准确度和假正率。我们展示了如何在平均0.12点中以0.12计的简短背景来改进有毒语言检测,并将F1评分提高到0.10,尽管有时增加了假正数。在背景中,Gemini在推算模式中取得最佳平衡,达到0.82和准确度,而零发GPTT-41,我们展示了最低背景能够改进了低语言测量结果。


Article 156

Title@2025-06-11 (3): TaskCraft: Automated Generation of Agentic Tasks

Title: TaskCraft: Automated Generation of Agentic Tasks TaskCraft: Automatisierte Generierung von Agentischen Aufgaben TTTCraft:自动生成代理任务 2506.10055v1

Authors (17): Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li, Hongxuan Lu, Fangchen Dong, Tianrui Qin, King Zhu, Minghao Yang, Jian Yang, Ge Zhang, Jiaheng Liu, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou

Agentic tasks, which require multi-step problem solving with autonomy, tool use, and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI. However, existing instruction data lacks tool interaction, and current agentic benchmarks rely on costly human annotation, limiting their scalability. We introduce \textsc{TaskCraft}, an automated workflow for generating difficulty-scalable, multi-tool, and verifiable agentic tasks with execution trajectories. TaskCraft expands atomic tasks using depth-based and width-based extensions to create structurally and hierarchically complex challenges. Empirical results show that these tasks improve prompt optimization in the generation workflow and enhance supervised fine-tuning of agentic foundation models. We present a large-scale synthetic dataset of approximately 36,000 tasks with varying difficulty to support future research on agent tuning and evaluation.

需要以自主、工具使用和适应性推理解决多步问题的任务,对推进国家实验室方案和AI来说越来越重要。然而,现有的教学数据缺乏工具互动,而目前的代理基准依赖于昂贵的人类说明,限制了其可缩放性。我们引入了“textsc{TasskCraft}”,这是一个自动工作流程,可以产生可缩放的、多工具和可核实的代理任务,与执行轨迹。TlexCraft利用深度和宽度扩展扩大原子任务,制造结构上和等级上的复杂挑战。经验性结果显示,这些任务改善了生成工作流程的迅速优化,加强了受监督的对代理基础模型的微调。我们提出了一套大型合成数据,大约有36 000项,在支持今后对代理的调整和评价进行研究方面困难不一。


Article 157

Title@2025-06-11 (3): Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs

Title: Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs Omni-DPO: Ein Dual-Perspektiven-Paradigma für dynamisches Preference-Lernen von LLMs Omni-DPO:一个双双面范式,以动态方式学习LLMs 2506.10054v1

Authors (10): Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, Min Zhang

Direct Preference Optimization (DPO) has become a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based approaches typically treat all preference pairs uniformly, ignoring critical variations in their inherent quality and learning utility, leading to suboptimal data utilization and performance. To address this challenge, we propose Omni-DPO, a dual-perspective optimization framework that jointly accounts for (1) the inherent quality of each preference pair and (2) the model’s evolving performance on those pairs. By adaptively weighting samples according to both data quality and the model’s learning dynamics during training, Omni-DPO enables more effective training data utilization and achieves better performance. Experimental results on various models and benchmarks demonstrate the superiority and generalization capabilities of Omni-DPO. On textual understanding tasks, Gemma-2-9b-it finetuned with Omni-DPO beats the leading LLM, Claude 3 Opus, by a significant margin of 6.7 points on the Arena-Hard benchmark. On mathematical reasoning tasks, Omni-DPO consistently outperforms the baseline methods across all benchmarks, providing strong empirical evidence for the effectiveness and robustness of our approach. Code and models will be available at https://github.com/pspdada/Omni-DPO.

为了应对这一挑战,我们提议Omni-DPO,这是一个双面优化框架,它共同考虑到:(1) 每种优惠的固有质量,(2) 模型在这些对子上不断演化的业绩;Omni-DPO根据数据质量和模型在培训期间的学习动态,根据数据质量和模型对样本进行适应性加权,使培训数据得到更有效的利用,并取得更好的业绩;关于各种模型和基准的实验结果显示了Omni-DPO的优越性和一般化能力。关于文字理解任务,Gemma-2-9b-it与Omni-DPO的微调,在Arena-Hard基准上击败了领先的LMM,Claude 3 Opus,在6.7点上大大超过了Aren-Hard基准。关于数学推理任务,Omni-DPOOO 持续地超越了我们所有基准基准标准。


Article 158

Title@2025-06-11 (3): Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs

Title: Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs Schritt-für-Schritt-Anweisungen und ein einfaches tabellarisches Ausgabeformat verbessern die Abhängigkeits-Abgleichgenauigkeit von LLMs 逐步指示和简单表格格式 改进LLMM的可靠性分析精确度 2506.09983v1

Authors (3): Hiroshi Matsuda, Chunpeng Ma, Masayuki Asahara

Recent advances in large language models (LLMs) have enabled impressive performance in various tasks. However, standard prompting often struggles to produce structurally valid and accurate outputs, especially in dependency parsing. We propose a novel step-by-step instruction strategy, where universal part-of-speech tagging precedes the prediction of syntactic heads and dependency labels, and a simplified CoNLL-U like output format, our method achieves state-of-the-art accuracy on Universal Dependencies datasets across 17 languages without hallucination or contamination. We further show that multilingual fine-tuning simultaneously improves cross-language generalization performance. Our results highlight the effectiveness of explicit reasoning steps in LLM-based parsing and offer a scalable, format-consistent alternative to bracket-based approaches.

在大型语言模型(LLMs)方面最近的进展使各种任务取得了令人印象深刻的成绩。然而,标准化的催化工作往往在结构上难以产生有效和准确的产出,特别是在对依赖性的分析方面。我们提出了一个新的逐步指导战略,在预测合成头和依赖性标签之前先采用通用部分标记,并采用简化的CONLL-U(如产出格式),我们的方法实现了17种语言通用依赖性数据集的最新准确性,而没有产生幻觉或污染。我们进一步显示多语种的微调同时改善了跨语种的通用性表现。我们的成果凸显了基于LLM(LM)的分类中明确推理步骤的有效性,并为基于括号的方法提供了一种可扩缩的、格式一致的替代方法。


Article 159

Title@2025-06-11 (3): When Detection Fails: The Power of Fine-Tuned Models to Generate Human-Like Social Media Text

Title: When Detection Fails: The Power of Fine-Tuned Models to Generate Human-Like Social Media Text Wenn die Detektion fehlschlägt: Die Macht von fein-getönten Modellen, um menschenähnliche Social Media-Texte zu erzeugen 当检测失败时:制作像人类一样的社会媒体文字的精选模型的力量 2506.09975v1

Authors (3): Hillary Dawkins, Kathleen C. Fraser, Svetlana Kiritchenko

Detecting AI-generated text is a difficult problem to begin with; detecting AI-generated text on social media is made even more difficult due to the short text length and informal, idiosyncratic language of the internet. It is nonetheless important to tackle this problem, as social media represents a significant attack vector in online influence campaigns, which may be bolstered through the use of mass-produced AI-generated posts supporting (or opposing) particular policies, decisions, or events. We approach this problem with the mindset and resources of a reasonably sophisticated threat actor, and create a dataset of 505,159 AI-generated social media posts from a combination of open-source, closed-source, and fine-tuned LLMs, covering 11 different controversial topics. We show that while the posts can be detected under typical research assumptions about knowledge of and access to the generating models, under the more realistic assumption that an attacker will not release their fine-tuned model to the public, detectability drops dramatically. This result is confirmed with a human study. Ablation experiments highlight the vulnerability of various detection algorithms to fine-tuned LLMs. This result has implications across all detection domains, since fine-tuning is a generally applicable and realistic LLM use case.

检测 AI 生成的文本是一个难以开始的问题; 检测 AI 生成的文本在社交媒体上变得更加困难,因为互联网的短文本长度和非正式的、特异性的语言,因此发现 AI 生成的文本更加困难; 解决这一问题十分重要,因为社交媒体在在线影响运动中代表了一个重要的攻击矢量, 可以通过使用大量制作的 AI 生成的海报支持(或反对)特定政策、决定或事件而予以支持。 我们用一个相当精密的威胁行为体的思维方式和资源来解决这个问题, 并且从开放源、封闭源和微调LLMS的组合中创建了505, 159 AI AI 生成的社交媒体的数据集, 涵盖11个不同的争议性主题。 我们表明,虽然根据典型的研究假设,即对生成模型的了解和获取,可以被检测到这些职位,但根据更现实的假设,攻击者不会向公众发布精细的模型,可探测性下降, 这一点得到人类研究的证实。 增强实验强调各种探测算法对微调LMS的弱点。 这一结果在所有探测领域都有影响,因为微调使用LLM 。


Article 160

Title@2025-06-11 (3): Trustworthy AI: Safety, Bias, and Privacy – A Survey

Title: Trustworthy AI: Safety, Bias, and Privacy – A Survey Vertrauenswürdige KI: Sicherheit, Bias und Privatsphäre – Eine Umfrage 值得信赖的大赦国际:安全、比亚斯和隐私 – – 调查 2502.10450v2

Authors (4): Xingli Fang, Jianwei Li, Varun Mulchandani, Jung-Eun Kim

The capabilities of artificial intelligence systems have been advancing to a great extent, but these systems still struggle with failure modes, vulnerabilities, and biases. In this paper, we study the current state of the field, and present promising insights and perspectives regarding concerns that challenge the trustworthiness of AI models. In particular, this paper investigates the issues regarding three thrusts: safety, privacy, and bias, which hurt models’ trustworthiness. For safety, we discuss safety alignment in the context of large language models, preventing them from generating toxic or harmful content. For bias, we focus on spurious biases that can mislead a network. Lastly, for privacy, we cover membership inference attacks in deep neural networks. The discussions addressed in this paper reflect our own experiments and observations.

人工智能系统的能力在很大的程度上一直在进步,但这些系统仍然在与失败模式、弱点和偏见作斗争。在本文件中,我们研究了实地的现状,并就挑战AI模型可信度的关切提出了很有希望的见解和观点。特别是,本文件调查了三个方面的问题:安全、隐私和偏见,这损害了模型的可信度。在安全方面,我们在大型语言模型的背景下讨论了安全一致性问题,防止它们产生有毒或有害的内容。在偏见方面,我们侧重于误导网络的虚假偏见。最后,关于隐私,我们讨论了深层神经网络中的成员资格攻击问题。本文件所讨论的讨论反映了我们自己的经验和观察。


Article 161

Title@2025-06-11 (3): Resa: Transparent Reasoning Models via SAEs

Title: Resa: Transparent Reasoning Models via SAEs Resa: Transparente Begründungsmodelle über SAE Resa:通过SAEs建立透明说明理由模型 2506.09967v1

Authors (7): Shangshang Wang, Julian Asilis, Ömer Faruk Akgül, Enes Burak Bilgin, Ollie Liu, Deqing Fu, Willie Neiswanger

How cost-effectively can we elicit strong reasoning in language models by leveraging their underlying representations? We answer this question with Resa, a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data without any reasoning traces. Notably, when applied to certain base models before further RL post-training, SAE-Tuning retains >97% of its RL-trained counterpart’s reasoning performance while reducing training costs by >2000x to roughly $1 and training time by >450x to around 20 minutes. Furthermore, when applied to lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only around $1 additional cost. Surprisingly, the reasoning abilities extracted via SAEs are potentially both generalizable and modular. Generality means abilities extracted from one dataset still elevate performance on a larger and overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math can be attached to the R1-Distill model at test time, without any retraining, and yield comparable gains. Extensive ablations validate these findings and all artifacts are fully open-sourced.

如何以成本效益高的方式在语言模型中找到强有力的推理?我们用Resa回答这个问题,Resa是一个由1.5B推理模型组成的大家庭,通过新颖和高效的稀释自动编码调节(SAE-Tuning)程序培训了1.5B推理模型。这个方法首先培训SAE,从源模型中获取推理能力,然后利用经过培训的SAE指导标准监督微调进程,在目标模型中获取这种能力,所有这些都使用经核实的问答数据,而没有任何推理痕迹。值得注意的是,在进一步进行RL后培训之前,SAE-Tuning对某些基准模型应用了SAE-Tuning的1.5B推理模型,其经过RL培训的对应方推理学绩效的>97%,同时将培训成本从>2000x降低到约1美元,培训时间从>450x到20分钟左右。此外,当应用经过轻度RL培训的模型(例如,在2 GPUPS)应用了1 的43.33% Passelp@1和90% Passelable@1, AMC 23仅保留了大约1的额外成本。 QQLlbliental Rlievlexnialalalalalalal dalationalationalations, 从一个不具有可变的升级能力, 和通过SAElationalbalbalbalalalal 能力,从一个可变的推算。


Article 162

Title@2025-06-11 (3): Language Models Resist Alignment: Evidence From Data Compression

Title: Language Models Resist Alignment: Evidence From Data Compression Sprachmodelle widerstehen Ausrichtung: Beweise aus Datenkompression 语言模型阻力对齐:数据压缩中的证据 2406.06144v4

Authors (10): Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Juntao Dai, Yunhuai Liu, Yaodong Yang

Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the $\mathbf{elasticity}$ of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment. The model weight and code are available at pku-lm-resist-alignment.github.io.

大型语言模型(LLMS) 可能表现出意想不到或不可取的行为。 最近的工作集中在调整LLMS, 以减少有害的产出。 尽管作出了这些努力,一些异常现象表明,即使是一个良好操作的调整过程也可以很容易地被有意或意外地规避。 调整微调的收益是否对模型产生强力效应,或只是表面影响? 在这项工作中,我们从理论和经验角度首次探索这一现象。 具有戏剧性,我们展示了后调整模型的美元价值,即,在进一步微调后恢复到培训前阶段形成的行为分布的趋势。 利用压缩理论,我们正式推断,微调会极大地破坏与培训前的配合,可能因规模的大小不同而损害得不成比例。 我们通过不同类型和规模的模型实验来验证是否存在弹性。 具体地说,我们发现模型性表现在恢复到培训前的分布之前迅速下降,随后下降率显著下降。 此外,我们进一步揭示,弹性与培训前阶段现有模型规模和升级之前的数据的扩展具有积极关联性。


Article 163

Title@2025-06-11 (3): Outside Knowledge Conversational Video (OKCV) Dataset – Dialoguing over Videos

Title: Outside Knowledge Conversational Video (OKCV) Dataset – Dialoguing over Videos Outside Knowledge Conversational Video (OKCV) Datensatz – Dialoguing über Videos 外部知识对话视频(OKCV)数据集 – – 视频对话 2506.09953v1

Authors (5): Benjamin Reichman, Constantin Patsch, Jack Truxal, Atishay Jain, Larry Heck

In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue. We further provide several baselines evaluated on our dataset and show future challenges associated with this task. The dataset is made publicly available here: https://github.com/c-patsch/OKCV.

在外部知识直观解答(OK-VQA)中,模型必须确定图像中的相关视觉信息,并纳入外部知识,以准确回答问题。将这一任务扩大到基于视频的有视觉基础的对话,对话模式必须同时承认长期相关视觉细节,并回答在视觉信息中不一定存在所需信息的问题。此外,总体对话的背景必须为随后的对话加以考虑。为了探索这项任务,我们引入了一个数据集,由2 017美元的视频组成,其中5 986美元的人类附加说明对话,其中40 954美元为内部对话转折。虽然对话背景以特定视频部分为视觉基础,但问题还需要外部知识,而外部知识却不具有视觉面貌。因此,模型不仅必须确定相关视频部分,而且还要利用外部知识在对话中进行反向。我们进一步提供对我们数据集进行评估的若干基线,并显示与这项任务相关的未来挑战。数据集在这里公开提供:https://github.com/c-patsch/OKCV。


Article 164

Title@2025-06-11 (3): Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking

Title: Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking Abfrage-fokussierte Retrieval-Köpfe verbessern Long-Context Reasoning und Re-Ranking 改进长期理由和重新排序 2506.09944v1

Authors (5): Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen, Xi Ye

Recent work has identified retrieval heads (Wu et al., 2025b), a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needle-in-a-Haystack tasks. In this paper, we introduce QRHEAD (Query-Focused Retrieval Head), an improved set of attention heads that enhance retrieval from long context. We identify QRHEAD by aggregating attention scores with respect to the input query, using a handful of examples from real-world tasks (e.g., long-context QA). We further introduce QR- RETRIEVER, an efficient and effective retriever that uses the accumulated attention mass of QRHEAD as retrieval scores. We use QR- RETRIEVER for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. We also evaluate QRRETRIEVER as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. Further analysis shows that both the querycontext attention scoring and task selection are crucial for identifying QRHEAD with strong downstream utility. Overall, our work contributes a general-purpose retriever and offers interpretability insights into the long-context capabilities of LMs.

最近的工作确定了检索头(Wu等人,2025b),这是负责检索长文本语言模型(LMS)中突出信息的一组关注头,其衡量标准是他们在Needle-a-Haystack任务中的复制式纸面行为。在本文中,我们引入了QRHEAD(Query-Foced Retreiewval Head),这是一套改进式的关注头,从长文中加强检索。我们通过将投入查询的注意力分数集中起来,利用现实世界任务(例如,长文本QA)的少数例子,来检索长文本语言模型(LMMM)中的突出信息。我们进一步引入了QR-RETRTIW,这是一个高效有效的检索器,将QRHEAD的累积关注质量作为检索分数。我们用QRR-RESTER进行长文推理推论,选择最相关的部分,从长文中推理任务LemEval和CLIPPER,这在全版背景和更精确的回收者中取得了10%的成绩,我们用LRR-RIMR-RILM 做了更精确的更精确的成绩分析。我们用更精确的成绩来评估。我们用更精确的成绩来评估。


Article 165

Title@2025-06-11 (3): 7B Fully Open Source Moxin-LLM/VLM – From Pretraining to GRPO-based Reinforcement Learning Enhancement

Title: 7B Fully Open Source Moxin-LLM/VLM – From Pretraining to GRPO-based Reinforcement Learning Enhancement 7B Vollständig offene Quelle Moxin-LLM/VLM – Vom Vortraining zur GRPO-basierten Verstärkungs-Lernverbesserung 7B 完全开放源代码 Moxin-LLM/VLM – – 从预培训到基于GROP的强化学习加强 2412.06845v5

Authors (18): Pu Zhao, Xuan Shen, Zhenglun Kong, Yixin Shen, Sung-En Chang, Timothy Rupprecht, Lei Lu, Enfu Nan, Changdi Yang, Yumei He, Weiyan Shi, Xingchen Xu, Yu Huang, Wei Jiang, Wei Wang, Yue Chen, Yong He, Yanzhi Wang

Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Although open-source LLMs present unprecedented opportunities for innovation and research, the commercialization of LLMs has raised concerns about transparency, reproducibility, and safety. Many open-source LLMs fail to meet fundamental transparency requirements by withholding essential components like training code and data, which may hinder further innovations on LLMs. To mitigate this issue, we introduce Moxin 7B, a fully open-source LLM developed, adhering to principles of open science, open source, open data, and open access. We release the pre-training code and configurations, training and fine-tuning datasets, and intermediate and final checkpoints, aiming to make continuous commitments to fully open-source LLMs. After pre-training the base model, we finetune the Moxin Base model with SOTA post-training framework and instruction data to obtain Moxin Instruct model. To improve the reasoning capability, we further finetune our Instruct model with chain-of-thought data distilled from DeepSeek R1, and then use Group Relative Policy Optimization (GRPO) following DeepSeek R1 to finetune our model, leading to the Moxin Reasoning model. Moreover, we develop our vision language model based on our Moxin model. Experiments show that our models achieve superior performance in various evaluations such as zero-shot evaluation, few-shot evaluation, and CoT evaluation.

最近,大语言模型(LLMS)经历了重大转变,其特点是其受欢迎程度和能力迅速提高。这一演变是GPT-4和GPT-1等专有的LLMs,由于它们业绩出色和多功能,在AI界引起了广泛的关注。与此同时,LLamaMA等开放源LLMS为LLM的日益受欢迎作出了巨大贡献,因为易于定制和在不同应用中部署这些模型。虽然开放源LMS为创新和研究提供了前所未有的深层次机会,但LLMS的商业化引起了透明度、再生性能和安全方面的担忧。许多开放源LMSMS没有达到基本的透明度要求,因为其保留了培训代码和数据等基本组成部分,可能阻碍LMMMSMS的进一步创新。为了缓解这一问题,我们引入了Mexin 7B,一个完全开放源LMMMM公司所开发的、遵守开放科学、开放源、公开数据模型和开放存取的原则。我们发布了培训前的代码和配置,随后对数据集进行了培训和调整,中间和最后的检查点,目的是使ODRMS数据库能够完全开放的升级。


Article 166

Title@2025-06-11 (3): VerIF: Verification Engineering for Reinforcement Learning in Instruction Following

Title: VerIF: Verification Engineering for Reinforcement Learning in Instruction Following VerIF: Verifikationstechnik für das Verstärkungslernen im Instruction Following VerIF: 教学中加强学习的核查工程 2506.09942v1

Authors (6): Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li

Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at https://github.com/THU-KEG/VerIF.

在这项工作中,我们探索了RL在教学方面的核查挑战,并提出了VerIF,这是将基于规则的代码核查与基于LLM的核查从一个大型推理模型(例如QwQ-32B)中结合起来的一种核查方法。为了支持这一方法,我们建立了一个高质量的教学跟踪数据集,VerInstruct, 包含约22 000个与相关核查信号相关的案例。我们向VerIF应用了RL培训,在几个有代表性的遵循指令的基准方面实现重大改进。经过培训的模型在类似规模的模型中达到最先进的性能,并广泛了解看不见的制约。我们还注意到,它们的一般能力仍然不受影响,表明与VerIF的RL和VerIF可以纳入现有的RL配方,以加强总体模型性能。我们已经发布了我们的数据集、代码和模型,以便利在 https://github.com/UKKER.上的未来研究。


Article 167

Title@2025-06-11 (3): Let’s Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Robust and Instruction-Aware ASR and OCR

Title: Let’s Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Robust and Instruction-Aware ASR and OCR Lassen Sie uns Schritt für Schritt fuse: Eine generative Fusion Decodierung Algorithmus mit LLMs für robuste und instruction-aware ASR und OCR 让我们一步一步地引信: 与用于软体和教学软件 ASR 和 OCR 的LMLMLM 生成融合解码算法 2405.14259v4

Authors (7): Chan-Jan Hsu, Yi-Chang Chen, Feng-Ting Liao, Pei-Chen Ho, Yu-Hsiang Wang, Po-Chun Hsu, Da-shan Shiu

We propose “Generative Fusion Decoding” (GFD), a novel shallow fusion framework designed to integrate large language models (LLMs) into cross-modal text recognition systems for automatic speech recognition (ASR) and optical character recognition (OCR). We derive the necessary formulations to enable GFD to operate across mismatched token spaces of different models by calculating likelihood at the byte level, thereby enabling seamless fusion and synchronous progression during the decoding process. GFD is plug-and-play by design, making it readily compatible with various auto-regressive models without the need for any re-training. GFD proves effective for general ASR and OCR tasks through intermediate and frequent interactions with LLMs, surpassing cascaded methods in English and Mandarin benchmarks. In addition, GFD transfers in-context learning abilities of LLMs and allows for adaptive ASR in instruction-aware and long-context settings, yielding significant WER reductions of up to 17.7\%.

我们提出“融合融合解调”(GFD),这是一个新颖的浅质融合框架,旨在将大型语言模型(LLMs)纳入自动语音识别和光学字符识别的跨现代文本识别系统(OCR),我们通过在字节一级计算可能性,使GFD能够跨越不同模型的不相配的象征性空间运作,从而在解码过程中实现无缝融合和同步进展。GFD是设计上的插接和游戏,可以与各种自动反向模型兼容,无需再培训。 GFD通过与LLMs的中间和经常互动,超越英语和曼达林基准的累进法方法,证明对一般的ASR和OCR任务有效。此外,GFD转让LMs的文本学习能力,允许在教学觉悟和长文本环境中适应ASR,使WER大幅降低至17.7。


Article 168

Title@2025-06-11 (3): LLM-BT-Terms: Back-Translation as a Framework for Terminology Standardization and Dynamic Semantic Embedding

Title: LLM-BT-Terms: Back-Translation as a Framework for Terminology Standardization and Dynamic Semantic Embedding LLM-BT-Terms: Back-Translation als Framework für Terminologie-Standardisierung und dynamische semantische Einbettung LLM-BT-LLM-BT-LLM-BT-LLM-LLM-BT-T-LLM-LLM-LLT-LLT-LT-LT-T-LLLM-LLM-LLT-LLM-BT-T-LLLM-LLM-BT-T-T-LLLM-LLLM-BT-T-T-LLLLLMM-BT-LLLMT-BT-LT-LT-LT-LLT-T-T-T-T-LLLLLLMM-BT-T-T-T-T-LLLLT-LLMLLMM-BT-T-T-T-LLLT-LT-LT-LT-LT-LT-T-T-LLT-T-T-T-LLLLLLLLLLsssssssssssssssss LLLLLLLLLLLLLLM-LM-LM-LLLLLLLLLLLLM-LM-LLLLLLLLM-BT-T-T-T-T-T-T-T-T-T-T-T-T-LLT-LT-LLLT-LT-LT-LLLLLT-LT-LT-LT-LLT-T-LLT-LT-LLLLLLT-T-T-T-T-LT-LT-T-T-LLLLLLLLLLT-LT-LLLLLLLLLLLT-LT-T-LLLLT-LT-LT-T-LLLLLLLLT-LLLLLLLLLLLT-LLLLLLLLLT-BT-BT-T-T-T-T-T-LT-T- 2506.08174v2

Authors (2): Li Weigang, Pedro Carvalho Brom

The rapid expansion of English technical terminology presents a significant challenge to traditional expert-based standardization, particularly in rapidly developing areas such as artificial intelligence and quantum computing. Manual approaches face difficulties in maintaining consistent multilingual terminology. To address this, we introduce LLM-BT, a back-translation framework powered by large language models (LLMs) designed to automate terminology verification and standardization through cross-lingual semantic alignment. Our key contributions include: (1) term-level consistency validation: by performing English -> intermediate language -> English back-translation, LLM-BT achieves high term consistency across different models (such as GPT-4, DeepSeek, and Grok). Case studies demonstrate over 90 percent of terms are preserved either exactly or semantically; (2) multi-path verification workflow: we develop a novel pipeline described as Retrieve -> Generate -> Verify -> Optimize, which supports both serial paths (e.g., English -> Simplified Chinese -> Traditional Chinese -> English) and parallel paths (e.g., English -> Chinese / Portuguese -> English). BLEU scores and term-level accuracy indicate strong cross-lingual robustness, with BLEU scores exceeding 0.45 and Portuguese term accuracy reaching 100 percent; (3) back-translation as semantic embedding: we reinterpret back-translation as a form of dynamic semantic embedding that uncovers latent trajectories of meaning. In contrast to static embeddings, LLM-BT offers transparent, path-based embeddings shaped by the evolution of the models. This reframing positions back-translation as an active mechanism for multilingual terminology standardization, fostering collaboration between machines and humans - machines preserve semantic integrity, while humans provide cultural interpretation.

英国技术术语的迅速扩展对传统专家标准化提出了重大挑战,特别是在人工智能和量子计算等快速开发领域。 手册方法在维护连贯一致的多语种术语方面面临困难。 为此,我们引入了LLM-BT,这是一个由大语言模型(LLLM)驱动的回译框架,旨在通过跨语言语义语义一致实现术语核查和标准化自动化。 我们的主要贡献包括:(1) 名词级一致性验证: 执行英语 - > 中间语言 - > 英语背译,LLLM-BT在各种模型(如 GPT-4、 DeepSeek和Grok)之间实现高额术语一致性。 案例研究显示90%以上的术语得到了准确或语义上的保存; (2) 多语种核查流程: 我们开发了一条被称为Retrieve - > Generate - > 校验 - > Oppimimimmilation的新型管道, 它既支持英语 - > 简单中文 - > 传统翻译 - > 英文 - > 和平行路径(如英语 - 中文 - 葡萄牙语 - 英国 - 直流路路路路路路路规则 , , 将BLEEU 的精确级 的精确定位, 的精确定位, 的Silentalalalalal-lational-lational-lational-lational-lational-lational-lational-lational-lational-lational laus lade lade lade lade lade lade , lade lade lade lade lade ex lade lade lade , 一种我们为我们的 , 一种我们的中, exusal- s ex- s s exal deal- sil der der der der ex ex ex ex ex exal-lexal-lexal-lexal-lational-lexal-le ladeal-lexal-lex-l-l-l-l-lus-lus ex-l-l-lil-lal-lal-l-l-l ex-lil-lil-lil-


Article 169

Title@2025-06-11 (3): CaLMQA: Exploring culturally specific long-form question answering across 23 languages

Title: CaLMQA: Exploring culturally specific long-form question answering across 23 languages CaLMQA: Erforschen von kulturell spezifischen Langzeitfragen in 23 Sprachen CALMQA:探讨以23种语言回答文化上特有的长式问题 2406.17761v3

Authors (6): Shane Arora, Marzena Karpinska, Hung-Ting Chen, Ipsita Bhattacharjee, Mohit Iyyer, Eunsol Choi

Despite rising global usage of large language models (LLMs), their ability to generate long-form answers to culturally specific questions remains unexplored in many languages. To fill this gap, we perform the first study of textual multilingual long-form QA by creating CaLMQA, a dataset of 51.7K culturally specific questions across 23 different languages. We define culturally specific questions as those that refer to concepts unique to one or a few cultures, or have different answers depending on the cultural or regional context. We obtain these questions by crawling naturally-occurring questions from community web forums in high-resource languages, and by hiring native speakers to write questions in under-resourced, rarely-studied languages such as Fijian and Kirundi. Our data collection methodologies are translation-free, enabling the collection of culturally unique questions like “Kuber iki umwami wa mbere w’uburundi yitwa Ntare?” (Kirundi; English translation: “Why was the first king of Burundi called Ntare (Lion)?”). We evaluate factuality, relevance and surface-level quality of LLM-generated long-form answers, finding that (1) for many languages, even the best models make critical surface-level errors (e.g., answering in the wrong language, repetition), especially for low-resource languages; and (2) answers to culturally specific questions contain more factual errors than answers to culturally agnostic questions – questions that have consistent meaning and answer across many cultures. We release CaLMQA to facilitate future research in cultural and multilingual long-form QA.

尽管全球使用大型语言模式(LLMS)的情况在不断增多,但是,他们以多种语言对文化上的具体问题作出长期答复的能力仍未得到探讨。为填补这一空白,我们通过创建由23种不同语言组成的51.7K文化上的具体问题数据集CALMQA,对文本多语种长式QA(CALMQA)进行首次研究。我们把文化上的具体问题定义为指一种或几种文化所特有的概念,或根据文化或区域背景的不同答案。我们从社区网络论坛以高资源语言收集自然引发的问题,以及聘请本地语言演讲者以资源不足、很少受研究的语言(斐济语和基隆迪语)编写问题。我们的数据采集方法是免费的,能够收集文化上独特的问题,例如“Kuber iki umwami wa mbere w’uburundi yitwa Ntare?” (Kirund;英文翻译:“为什么布隆迪第一任国王用高语言称为Ntare (Lion) 解答?”我们评估了LM(LM) 生成的长式解答题的质量、相关性和地表层解的答案,特别是直言层解的答案。对于许多语言来说,在直言的解的解的解的答案中找到最深层次的答案。


Article 170

Title@2025-06-11 (3): Standard Language Ideology in AI-Generated Language

Title: Standard Language Ideology in AI-Generated Language Standardsprache-Ideologie in KI-generierter Sprache AI-创用语言的标准语言意识形态 2406.08726v2

Authors (5): Genevieve Smith, Eve Fleisig, Madeline Bossi, Ishita Rustagi, Xavier Yin

Standard language ideology is reflected and reinforced in language generated by large language models (LLMs). We present a faceted taxonomy of open problems that illustrate how standard language ideology manifests in AI-generated language, alongside implications for minoritized language communities and society more broadly. We introduce the concept of standard AI-generated language ideology, a process through which LLMs position “standard” languages–particularly Standard American English (SAE)–as the linguistic default, reinforcing the perception that SAE is the most “appropriate” language. We then discuss ongoing tensions around what constitutes desirable system behavior, as well as advantages and drawbacks of generative AI tools attempting, or refusing, to imitate different English language varieties. Rather than prescribing narrow technical fixes, we offer three recommendations for researchers, practitioners, and funders that focus on shifting structural conditions and supporting more emancipatory outcomes for diverse language communities.

标准语言意识形态在大型语言模式(LLMs)产生的语言中得到了反映和加强。我们展示了一个公开问题的面对面分类,说明标准语言意识形态如何在AI产生的语言中表现出来,同时对较小的语言群体和更广泛的社会产生影响。我们引入了AI产生的标准语言意识形态概念,通过这个过程,LLMs将“标准”语言-特别是标准的美国英语(SAE)定位为语言默认,强化了SAE是最“适当”的语言的观念。然后我们讨论了围绕什么是理想的系统行为以及基因化的AI工具的优缺点,试图或拒绝模仿不同的英语品种。我们不但没有提出狭义的技术修正,反而向研究人员、从业者和供资者提出三项建议,重点是改变结构条件,支持不同语言社区更加解放的结果。


Article 171

Title@2025-06-11 (3): Discovering Forbidden Topics in Language Models

Title: Discovering Forbidden Topics in Language Models Verbotene Themen in Sprachmodellen entdecken 发现语言模型中的禁止专题 2505.17441v3

Authors (4): Can Rager, Chris Wendler, Rohit Gandikota, David Bau

Refusal discovery is the task of identifying the full set of topics that a language model refuses to discuss. We introduce this new problem setting and develop a refusal discovery method, Iterated Prefill Crawler (IPC), that uses token prefilling to find forbidden topics. We benchmark IPC on Tulu-3-8B, an open-source model with public safety tuning data. Our crawler manages to retrieve 31 out of 36 topics within a budget of 1000 prompts. Next, we scale the crawler to a frontier model using the prefilling option of Claude-Haiku. Finally, we crawl three widely used open-weight models: Llama-3.3-70B and two of its variants finetuned for reasoning: DeepSeek-R1-70B and Perplexity-R1-1776-70B. DeepSeek-R1-70B reveals patterns consistent with censorship tuning: The model exhibits “thought suppression” behavior that indicates memorization of CCP-aligned responses. Although Perplexity-R1-1776-70B is robust to censorship, IPC elicits CCP-aligned refusals answers in the quantized model. Our findings highlight the critical need for refusal discovery methods to detect biases, boundaries, and alignment failures of AI systems.

拒绝的发现是确定语言模式拒绝讨论的全部议题的任务。 我们引入了这一新的问题设置,并开发了一种拒绝的发现方法 – – Exerated Prefill Crawler(IPC),该方法使用象征性的预填来查找被禁止的话题。 我们将IPC基准以Tulu-3-8B为基准,这是一个开放源码,具有公共安全调试数据。 我们的爬虫者设法在1,000个提示预算范围内检索了36个议题中的31个。 其次, 我们利用Claude- Haiku的预填选项,将爬行者推广到一个前沿模式。 最后, 我们爬行了三种广泛使用的开放重量模型:Llama-3.3-70B, 以及其中两个经调整的变式:DeepSeek-R1-70B和Perloblistity-R1-1776-70B。 DeeplyS-R-1776-70B 揭示了与审查调整相符的模式: 显示CCP调和反应反应的模模范式行为。尽管Preflicity-Rest roduction rofulations 需要检测到关键的防偏差。


Article 172

Title@2025-06-11 (3): PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants

Title: PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants PersonaLens: Ein Benchmark für die Personalisierungsbewertung in Conversational AI Assistants 人际关系:对话性AI助理个人化评价基准 2506.09902v1

Authors (6): Zheng Zhao, Clara Vania, Subhradeep Kayal, Naila Khan, Shay B. Cohen, Emine Yilmaz

Large language models (LLMs) have advanced conversational AI assistants. However, systematically evaluating how well these assistants apply personalization–adapting to individual user preferences while completing tasks–remains challenging. Existing personalization benchmarks focus on chit-chat, non-conversational tasks, or narrow domains, failing to capture the complexities of personalized task-oriented assistance. To address this, we introduce PersonaLens, a comprehensive benchmark for evaluating personalization in task-oriented AI assistants. Our benchmark features diverse user profiles equipped with rich preferences and interaction histories, along with two specialized LLM-based agents: a user agent that engages in realistic task-oriented dialogues with AI assistants, and a judge agent that employs the LLM-as-a-Judge paradigm to assess personalization, response quality, and task success. Through extensive experiments with current LLM assistants across diverse tasks, we reveal significant variability in their personalization capabilities, providing crucial insights for advancing conversational AI systems.

大型语言模型(LLMs)具有先进的对话性人工智能助理。然而,系统评估这些助理在完成任务挑战性任务的同时,如何很好地将个性化适应于个人用户的偏好。现有的个性化基准侧重于空聊天、非对立任务或狭窄的领域,未能捕捉个性化任务型援助的复杂性。为了解决这个问题,我们引入人性模型,这是评估面向任务的AI助理个性化的全面基准。我们的基准特征是具有丰富喜好和互动史的不同用户特征,以及两个基于LLM的专职代理:一个与AI助理进行现实的面向任务的对话的用户代理,以及一个使用LLM-as-a-Judge模式评估个性化、反应质量和任务成功的法官代理。通过与当前LLM-A-A-Judge助理进行的广泛实验,我们发现其个性化能力差异很大,为推进对话性人工智能系统提供了至关重要的洞察力。


Article 173

Title@2025-06-11 (3): The Remarkable Robustness of LLMs: Stages of Inference?

Title: The Remarkable Robustness of LLMs: Stages of Inference? Die bemerkenswerte Robustheit von LLMs: Stufen der Schlussfolgerung? LLMS的显著威力:推论阶段? 2406.19384v2

Authors (3): Vedang Lad, Wes Gurnee, Max Tegmark

We investigate the robustness of Large Language Models (LLMs) to structural interventions by deleting and swapping adjacent layers during inference. Surprisingly, models retain 72-95% of their original top-1 prediction accuracy without any fine-tuning. We find that performance degradation is not uniform across layers: interventions to the early and final layers cause the most degradation, while the model is remarkably robust to dropping middle layers. This pattern of localized sensitivity motivates our hypothesis of four stages of inference, observed across diverse model families and sizes: (1) detokenization, where local context is integrated to lift raw token embeddings into higher-level representations; (2) feature engineering, where task- and entity-specific features are iteratively refined; (3) prediction ensembling, where hidden states are aggregated into plausible next-token predictions; and (4) residual sharpening, where irrelevant features are suppressed to finalize the output distribution. Synthesizing behavioral and mechanistic evidence, we provide a framework for interpreting depth-dependent computations in LLMs.

我们调查了大语言模型(LLMs)对结构干预的稳健性,在推论期间删除和互换相邻层层。令人惊讶的是,模型保留了最初的顶层-1的预测准确率的72-95%,而不作任何微调。我们发现,性能退化在各层之间是不一致的:对早期和最后层的干预造成最大的退化,而模型对中层的下降则非常强烈。这种局部敏感性模式促使我们假设在不同模式家庭和大小之间观察到的四个推论阶段:(1) 脱位化,即当地环境结合,将原始标本嵌入更高层次的表示;(2) 特征工程,即任务和实体的具体特征反复完善;(3) 预测组合,将隐藏状态汇总成可信的下方预测;(4) 残留精锐化,即不相关的特征被抑制,以最终产出分布。行为和机械性证据的合成,我们为解释LMS的深度计算提供了一个框架。


Article 174

Title@2025-06-11 (3): Advancing Decoding Strategies: Enhancements in Locally Typical Sampling for LLMs

Title: Advancing Decoding Strategies: Enhancements in Locally Typical Sampling for LLMs Weiterentwicklung der Decodierungsstrategien: Verbesserungen in der lokalen typischen Probenahme für LLMs 推进下限战略:加强LLMM公司当地典型抽样 2506.05387v2

Authors (3): Jaydip Sen, Saptarshi Sengupta, Subhasis Dasgupta

This chapter explores advancements in decoding strategies for large language models (LLMs), focusing on enhancing the Locally Typical Sampling (LTS) algorithm. Traditional decoding methods, such as top-k and nucleus sampling, often struggle to balance fluency, diversity, and coherence in text generation. To address these challenges, Adaptive Semantic-Aware Typicality Sampling (ASTS) is proposed as an improved version of LTS, incorporating dynamic entropy thresholding, multi-objective scoring, and reward-penalty adjustments. ASTS ensures contextually coherent and diverse text generation while maintaining computational efficiency. Its performance is evaluated across multiple benchmarks, including story generation and abstractive summarization, using metrics such as perplexity, MAUVE, and diversity scores. Experimental results demonstrate that ASTS outperforms existing sampling techniques by reducing repetition, enhancing semantic alignment, and improving fluency.

本章探讨了大语言模型解码战略的进展,重点是加强当地典型抽样算法;传统的解码方法,如顶点取样和核心取样,往往难以在文本生成中平衡流畅、多样性和一致性;为应对这些挑战,建议采用适应性语义软件典型抽样(ASTS),作为LTS的改良版本,纳入动态诱变阈值、多目标评分和奖赏-刑罚调整;ASTS在保持计算效率的同时,确保根据具体情况生成不同版本的文本;其性能通过多种基准进行评估,包括故事生成和抽象加和,使用诸如易解性、MAUVE和多样性分数等指标。实验结果表明,ASTS通过减少重复、增强语义一致性和改善流畅度,超越了现有的取样技术。


Article 175

Title@2025-06-11 (3): The Emergence of Abstract Thought in Large Language Models Beyond Any Language

Title: The Emergence of Abstract Thought in Large Language Models Beyond Any Language Das Entstehen abstrakten Denkens in großen Sprachmodellen jenseits jeder Sprache 超越任何语言的大型语言模式中抽象思考的出现 2506.09890v1

Authors (10): Yuxin Chen, Yiran Zhao, Yang Zhang, An Zhang, Kenji Kawaguchi, Shafiq Joty, Junnan Li, Tat-Seng Chua, Michael Qizhe Shieh, Wenxuan Zhang

As large language models (LLMs) continue to advance, their capacity to function effectively across a diverse range of languages has shown marked improvement. Preliminary studies observe that the hidden activations of LLMs often resemble English, even when responding to non-English prompts. This has led to the widespread assumption that LLMs may “think” in English. However, more recent results showing strong multilingual performance, even surpassing English performance on specific tasks in other languages, challenge this view. In this work, we find that LLMs progressively develop a core language-agnostic parameter space-a remarkably small subset of parameters whose deactivation results in significant performance degradation across all languages. This compact yet critical set of parameters underlies the model’s ability to generalize beyond individual languages, supporting the emergence of abstract thought that is not tied to any specific linguistic system. Specifically, we identify language-related neurons-those are consistently activated during the processing of particular languages, and categorize them as either shared (active across multiple languages) or exclusive (specific to one). As LLMs undergo continued development over time, we observe a marked increase in both the proportion and functional importance of shared neurons, while exclusive neurons progressively diminish in influence. These shared neurons constitute the backbone of the core language-agnostic parameter space, supporting the emergence of abstract thought. Motivated by these insights, we propose neuron-specific training strategies tailored to LLMs’ language-agnostic levels at different development stages. Experiments across diverse LLM families support our approach.

随着大型语言模型(LLMS)不断进步,其在不同语言范围内有效运行的能力明显提高。初步研究表明,LLMS的隐藏启动能力往往类似于英语,即使是在对非英语提示作出反应时也是如此。这导致广泛假设LLMS可能用英语“思考”英语“思考”英语。然而,随着大型语言模型(LLMS)继续进步,随着大型语言模型(LLLMS)在使用其他语言的具体任务上的表现甚至超过英语的成绩,这种观点也具有挑战性。在这项工作中,LLMS逐渐开发了一个核心语言――空间参数非常少的一组参数,这些参数的失灵导致所有语言的性能严重退化。这一紧凑而又关键的一组参数表明,LLMMMS公司能够超越单个语言的范围,支持与任何特定语言系统不相连的抽象思维的出现。具体语言处理过程中,我们发现与语言相关的神经元系的不断激活,将其归类为共同的(多语言)或(具体为一种方法)。随着时间的不断发展,我们观察到共享的神经系和功能重要性显著的神经系中,同时支持这些神经系的神经系的形成。


Article 176

Title@2025-06-11 (3): Attention Head Embeddings with Trainable Deep Kernels for Hallucination Detection in LLMs

Title: Attention Head Embeddings with Trainable Deep Kernels for Hallucination Detection in LLMs Aufmerksamkeitskopf-Embeddings mit trainierbaren Tiefenkernen für Halluzinationserkennung in LLMs 用于在LLMM中探测幻觉的可训练深心内核嵌入器 2506.09886v1

Authors (3): Rodion Oblovatny, Alexandra Bazarova, Alexey Zaytsev

We present a novel approach for detecting hallucinations in large language models (LLMs) by analyzing the probabilistic divergence between prompt and response hidden-state distributions. Counterintuitively, we find that hallucinated responses exhibit smaller deviations from their prompts compared to grounded responses, suggesting that hallucinations often arise from superficial rephrasing rather than substantive reasoning. Leveraging this insight, we propose a model-intrinsic detection method that uses distributional distances as principled hallucination scores, eliminating the need for external knowledge or auxiliary models. To enhance sensitivity, we employ deep learnable kernels that automatically adapt to capture nuanced geometric differences between distributions. Our approach outperforms existing baselines, demonstrating state-of-the-art performance on several benchmarks. The method remains competitive even without kernel training, offering a robust, scalable solution for hallucination detection.

通过分析迅速和反应的隐藏状态分布之间的概率差异,我们提出了在大型语言模型中发现幻觉的新办法。 反之,我们发现幻觉反应与其有根据的反应相比,其速度偏差较小,这表明幻觉往往产生于表面改写而不是实质性推理。 我们利用这种洞察力,提出了一种使用分配距离作为有原则的幻觉分数、消除对外部知识或辅助模型的需求的模型探索方法。为了提高敏感性,我们采用了可自动适应捕捉分布间细微几何差异的深层次可学习的内核。我们的方法比现有基线更优,在几个基准上展示了最新水平的性能。即使没有内核培训,这种方法仍然具有竞争力,为幻觉的检测提供了一个强有力、可扩展的解决方案。


Article 177

Title@2025-06-11 (3): TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration

Title: TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration TACTIC: Übersetzungsagenturen mit kognitiv-theoretischer Interactive Collaboration TACTIC: 具有认知理论互动协作的翻译剂 2506.08403v2

Authors (11): Weiya Li, Junjie Chen, Bei Li, Boyang Liu, Zichen Wen, Nuanqiao Shan, Xiaoqian Liu, Anping Liu, Huajie Liu, Hu Song, Linfeng Zhang

Machine translation has long been a central task in natural language processing. With the rapid advancement of large language models (LLMs), there has been remarkable progress in translation quality. However, fully realizing the translation potential of LLMs remains an open challenge. Recent studies have explored multi-agent systems to decompose complex translation tasks into collaborative subtasks, showing initial promise in enhancing translation quality through agent cooperation and specialization. Nevertheless, existing multi-agent translation frameworks largely neglect foundational insights from cognitive translation studies. These insights emphasize how human translators employ different cognitive strategies, such as balancing literal and free translation, refining expressions based on context, and iteratively evaluating outputs. To address this limitation, we propose a cognitively informed multi-agent framework called TACTIC, which stands for T ranslation A gents with Cognitive- T heoretic Interactive Collaboration. The framework comprises six functionally distinct agents that mirror key cognitive processes observed in human translation behavior. These include agents for drafting, refinement, evaluation, scoring, context reasoning, and external knowledge gathering. By simulating an interactive and theory-grounded translation workflow, TACTIC effectively leverages the full capacity of LLMs for high-quality translation. Experimental results on diverse language pairs from the FLORES-200 and WMT24 benchmarks show that our method consistently achieves state-of-the-art performance. Using DeepSeek-V3 as the base model, TACTIC surpasses GPT-4.1 by an average of +0.6 XCOMET and +1.18 COMETKIWI-23. Compared to DeepSeek-R1, it further improves by +0.84 XCOMET and +2.99 COMETKIWI-23. Code is available at https://github.com/weiyali126/TACTIC.

长期以来,机器翻译一直是自然语言处理的一项核心任务。随着大型语言模型(LLMS)的快速进步,翻译质量取得了显著进步。然而,充分实现LLMS的翻译潜力仍是一个公开的挑战。最近的研究探索了将复杂的翻译任务分解成协作子任务的多试剂系统,显示了通过代理合作和专业化提高翻译质量的初步希望。然而,现有的多试剂翻译框架在很大程度上忽视了认知翻译研究的基本洞察力。这些深刻的见解强调人类翻译如何采用不同的认知战略,例如平衡直译和免费翻译,根据背景改进表达方式和迭代评价产出。为了应对这一限制,我们提议了一个名为TACTIC(T)的认知知情知情多试用多试用多试用多试机框架,称为TACT(T),它代表了CRVVT+23的全能化翻译,它代表了TLMS+MS的高级基础翻译。 18 和TLMS-MS-GMS-MS-MS-S-MLMS-S-S-GS-ILG-S-GS-S-ILMS-S-S-S-S-IS-S-S-S-S-S-ILG-S-S-S-IL-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SD-S-S-S-S-S-S-S-IAR-Sl-S-S-SD-T-S-SD-SD-SD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SD-SD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-I-I-S-S-S-S


Article 178

Title@2025-06-11 (3): TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering

Title: TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering TableEval: Ein Real-World Benchmark für komplexe, mehrsprachige und multistrukturierte Tabellenfrageantworten 表Eval:复杂、多语种和多结构表问题答案的现实世界基准 2506.03949v2

Authors (7): Junnan Zhu, Jingyi Wang, Bohan Yu, Xiaoyu Wu, Junbo Li, Lei Wang, Nan Xu

LLMs have shown impressive progress in natural language processing. However, they still face significant challenges in TableQA, where real-world complexities such as diverse table structures, multilingual data, and domain-specific reasoning are crucial. Existing TableQA benchmarks are often limited by their focus on simple flat tables and suffer from data leakage. Furthermore, most benchmarks are monolingual and fail to capture the cross-lingual and cross-domain variability in practical applications. To address these limitations, we introduce TableEval, a new benchmark designed to evaluate LLMs on realistic TableQA tasks. Specifically, TableEval includes tables with various structures (such as concise, hierarchical, and nested tables) collected from four domains (including government, finance, academia, and industry reports). Besides, TableEval features cross-lingual scenarios with tables in Simplified Chinese, Traditional Chinese, and English. To minimize the risk of data leakage, we collect all data from recent real-world documents. Considering that existing TableQA metrics fail to capture semantic accuracy, we further propose SEAT, a new evaluation framework that assesses the alignment between model responses and reference answers at the sub-question level. Experimental results have shown that SEAT achieves high agreement with human judgment. Extensive experiments on TableEval reveal critical gaps in the ability of state-of-the-art LLMs to handle these complex, real-world TableQA tasks, offering insights for future improvements. We make our dataset available here: https://github.com/wenge-research/TableEval.

然而,在表QA中,它们仍然面临着巨大的挑战,在表QA中,现实世界的复杂情况,如不同的表格结构、多语言数据和特定领域的推理等,都是至关重要的;现有的表QA基准往往因侧重于简单的平板表格而受到限制,而且数据渗漏;此外,大多数基准都是单语的,未能在实际应用中捕捉到跨语言和跨主题的变异性;为了解决这些局限性,我们引入了表Eval,这是一个新的基准,旨在根据现实的表QA任务评估LM的准确性。具体地说,表Eval包含从四个领域(包括政府、金融、学术界和行业报告)收集的各种结构(如简洁、分级和嵌入式表格)的表格。此外,表Eval的特征是跨语言的情景,以简化的中文、中文本和英文为表格。为了尽量减少数据渗漏风险,我们从最近的实际世界文件收集了所有数据。考虑到现有的表QA指标未能反映语义准确性,我们进一步提议SEAT,一个新的评价框架,用以评估在次问题级别一级(包括政府、金融、学术界、学术界和学术界)不同层次分析的模型分析中,我们目前的数据分析中的数据分析能力显示。


Article 179

Title@2025-06-11 (3): ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models

Title: ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models ImageChain: Advancing Sequential Image-to-Text Reasoning in multimodalen großen Sprachmodellen 图像Chain:在多模式大语言模型中推进按顺序排列的图像到文字理由 2502.19409v2

Authors (3): Danae Sánchez Villegas, Ingo Ziegler, Desmond Elliott

Reasoning over sequences of images remains a challenge for multimodal large language models (MLLMs). While recent models incorporate multi-image data during pre-training, they still struggle to recognize sequential structures, often treating images independently. This work introduces ImageChain, a framework that enhances MLLMs with sequential reasoning capabilities over image data by modeling visual sequences as a multi-turn conversation. In ImageChain, images are interleaved with corresponding textual descriptions to form a controlled dialogue that explicitly captures temporal dependencies and narrative progression. Our method optimizes for the task of next-scene description, where the model generates a context-aware description of an upcoming scene based on preceding visual and textual cues. We demonstrate that our approach improves performance on the next-scene description task – achieving an average improvement from 3.7% to 19% in SimRate, a metric that quantifies semantic similarity to human-annotated ground truths. Moreover, ImageChain achieves robust zero-shot out-of-domain performance in applications ranging from comics to robotics. Extensive experiments validate that instruction-tuning in a multimodal, multi-turn conversation design is key to bridging the gap between static image understanding and temporally-aware reasoning.

对图像序列进行解释仍然是多式大型语言模型(MLLM)的一个挑战。虽然最近的模型在培训前将多图像数据纳入多图像数据,但它们仍然难以识别相继结构,常常独立处理图像。 这项工作引入了图像CHain, 这个框架通过将视觉序列建模作为多方向对话,提高了MLLM对图像数据的顺序推理能力。 在图像Chain 中,图像与相应的文字描述相互交织,形成一种控制的对话,明确捕捉时间依赖性和描述进展。 我们的方法优化了下色描述的任务,即该模型根据先前的视觉和文字提示生成对即将到来的场景的背景觉悟描述。 我们证明我们的方法提高了下色描述任务的业绩 – – 在SimRate实现平均从3.7%到19%的改善,这一指标使语义与人类附加说明的地面真相相似。 此外,图像CHain在从漫画到机器人的应用中实现了强力的零射出场外性性工作。 广泛的实验证实,在模型和文字提示中,对正态思维中,正态思维是关键的过渡性理解。


Article 180

Title@2025-06-11 (3): Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning

Title: Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning Causal Sufficiency und Necessity verbessert Kette-of-Thought-Reasoning C. 因果关系和必要性 改进审议链 理由 2506.09853v1

Authors (8): Xiangning Yu, Zhuohan Wang, Linyi Yang, Haoxuan Li, Anjie Liu, Xiao Xue, Jun Wang, Mengyue Yang

Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.

在赋予具有复杂推理能力的大型语言模型(LLMs)时,迅速推理链(Cot)发挥了不可或缺的作用,然而,Cot目前面临两个基本挑战:(1) 充分性,确保产生的中间推理步骤全面覆盖和证实最后结论;(2) 必要性,确定对得出答案的正确性真正必不可少的推论步骤;我们提议了一个因果框架,通过充足性和必要性的双重角度来说明Cot推理的特性。 纳入因果性充足性和必要性使我们不仅能够确定哪些步骤在逻辑上足够或必要,而且能够量化其对不同干预情景下最后推理结果的实际影响,从而能够自动增加缺失的步骤和调整多余的步骤。关于各种数学和常识推理基准的广泛实验结果证实在推理效率和减少象征性使用方面有很大改进,同时又不牺牲准确性。我们的工作为改进LLM推理性业绩和成本效益提供了很有希望的方向。


Article 181

Title@2025-06-11 (3): Advancing Exchange Rate Forecasting: Leveraging Machine Learning and AI for Enhanced Accuracy in Global Financial Markets

Title: Advancing Exchange Rate Forecasting: Leveraging Machine Learning and AI for Enhanced Accuracy in Global Financial Markets Advancing Exchange Rate Forecasting: Hebeling Machine Learning und KI für verbesserte Genauigkeit in den globalen Finanzmärkten 提高汇率预测:利用机器学习和大赦国际促进全球金融市场更加准确性 2506.09851v1

Authors (7): Md. Yeasin Rahat, Rajan Das Gupta, Nur Raisa Rahman, Sudipto Roy Pritom, Samiur Rahman Shakir, Md Imrul Hasan Showmick, Md. Jakir Hossen

The prediction of foreign exchange rates, such as the US Dollar (USD) to Bangladeshi Taka (BDT), plays a pivotal role in global financial markets, influencing trade, investments, and economic stability. This study leverages historical USD/BDT exchange rate data from 2018 to 2023, sourced from Yahoo Finance, to develop advanced machine learning models for accurate forecasting. A Long Short-Term Memory (LSTM) neural network is employed, achieving an exceptional accuracy of 99.449%, a Root Mean Square Error (RMSE) of 0.9858, and a test loss of 0.8523, significantly outperforming traditional methods like ARIMA (RMSE 1.342). Additionally, a Gradient Boosting Classifier (GBC) is applied for directional prediction, with backtesting on a $10,000 initial capital revealing a 40.82% profitable trade rate, though resulting in a net loss of $20,653.25 over 49 trades. The study analyzes historical trends, showing a decline in BDT/USD rates from 0.012 to 0.009, and incorporates normalized daily returns to capture volatility. These findings highlight the potential of deep learning in forex forecasting, offering traders and policymakers robust tools to mitigate risks. Future work could integrate sentiment analysis and real-time economic indicators to further enhance model adaptability in volatile markets.

对汇率的预测,如美元对孟加拉国塔卡(BDT)的预测,在全球金融市场中发挥着关键作用,影响贸易、投资和经济稳定。这项研究利用了2018年至2023年来自雅虎金融公司的历史美元/BDT汇率数据,开发了用于准确预测的先进机器学习模型。采用了长期短期内存神经网络(LSTM)99.449 % 的异常精确度,根极平方错误为0.98588, 测试损失0.8523, 大大超过ARIMA(RMSE 1.342)等传统方法。此外,在定向预测中使用了渐进式促进分类法(GBC GBC ) , 以10 000 美元初始资本为基准,显示40.82%的盈利贸易率,尽管在49个贸易中造成20 653.25美元的净损失。研究分析了历史趋势,表明BDT/美元汇率从0.012下降到0.009,并纳入了正常的每日回报以捕捉到波动性。这些研究结果突出表明了深度的预测、向贸易商和决策者提供稳健的经济感分析工具以进一步降低未来风险的可能性。


Article 182

Title@2025-06-11 (3): Dataset of News Articles with Provenance Metadata for Media Relevance Assessment

Title: Dataset of News Articles with Provenance Metadata for Media Relevance Assessment Datensatz von Nachrichtenartikeln mit Provenienz Metadaten für die Medienrelevanzbewertung 为媒体相关性评估提供具有证明的元数据的新闻文章数据集 2506.09847v1

Authors (2): Tomas Peterka, Matyas Bohacek

Out-of-context and misattributed imagery is the leading form of media manipulation in today’s misinformation and disinformation landscape. The existing methods attempting to detect this practice often only consider whether the semantics of the imagery corresponds to the text narrative, missing manipulation so long as the depicted objects or scenes somewhat correspond to the narrative at hand. To tackle this, we introduce News Media Provenance Dataset, a dataset of news articles with provenance-tagged images. We formulate two tasks on this dataset, location of origin relevance (LOR) and date and time of origin relevance (DTOR), and present baseline results on six large language models (LLMs). We identify that, while the zero-shot performance on LOR is promising, the performance on DTOR hinders, leaving room for specialized architectures and future work.

为解决这一问题,我们引入了媒体验证数据集,这是一个带有源码图象的新闻文章数据集。我们对这一数据集制定了两项任务,即来源相关位置、日期和时间相关关系(LOR)以及六大语言模型(LLMs)的基线结果。 我们发现,虽然LOR的零点表现很有希望,但Dtor的性能阻碍了Dtor的运行,为专门架构和今后的工作留有余地。


Article 183

Title@2025-06-11 (3): Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective

Title: Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective Produkt von Experten mit LLMs: Leistungssteigerung auf ARC ist eine Frage der Perspektive 与LLMM公司的专家产品有关的产品:促进ACR的绩效是一个透视问题 2505.07859v2

Authors (3): Daniel Franzen, Jan Disselhoff, David Hartmann

The Abstraction and Reasoning Corpus (ARC-AGI) poses a significant challenge for large language models (LLMs), exposing limitations in their abstract reasoning abilities. In this work, we leverage task-specific data augmentations throughout the training, generation, and scoring phases, and employ a depth-first search algorithm to generate diverse, high-probability candidate solutions. Furthermore, we utilize the LLM not only as a generator but also as a scorer, using its output probabilities to select the most promising solutions. Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. While concurrent closed-source work has reported higher scores, our method distinguishes itself through its transparency, reproducibility, and remarkably low inference cost, averaging only around 2ct per task on readily available hardware (we assume a price of 36ct/hour for a Nvidia 4090 GPU).

抽象和理性公司(ARC-AGI)对大型语言模型(LLM)提出了重大挑战,暴露了其抽象推理能力的局限性。 在这项工作中,我们在整个培训、生成和评分阶段利用任务特定数据增强,并使用深度第一搜索算法生成多样、高概率候选解决方案。 此外,我们不仅将LLM作为生成者,而且作为计分者,利用其产出概率选择最有希望的解决方案。我们的方法在ARC-AGI公开评估组上取得了71.6%的分数(286.5.400个已解决的任务 ) , 展示了公众可用方法的最新表现。 虽然同时进行的封闭源工作报告得分较高,但我们的方法通过透明性、可复制性和极低的引价而有所区别,每件易用硬件平均只有2ct左右(我们承担4090 GPUNvidia的36ct/小时价格 ) 。


Article 184

Title@2025-06-11 (3): Crafting Customisable Characters with LLMs: Introducing SimsChat, a Persona-Driven Role-Playing Agent Framework

Title: Crafting Customisable Characters with LLMs: Introducing SimsChat, a Persona-Driven Role-Playing Agent Framework Anpassbare Charaktere mit LLMs gestalten: Einführung von SimsChat, einem Persona-Driven Role-Playing Agent Framework 与LLMM女士一起制作自定义字符:介绍人际角色扮演代理框架SimsChat 2406.17962v6

Authors (8): Bohao Yang, Dong Liu, Chenghao Xiao, Kun Zhao, Chao Li, Lin Yuan, Guang Yang, Chenghua Lin

Large Language Models (LLMs) demonstrate remarkable ability to comprehend instructions and generate human-like text, enabling sophisticated agent simulation beyond basic behavior replication. However, the potential for creating freely customisable characters remains underexplored. We introduce the Customisable Conversation Agent Framework, which employs LLMs to simulate real-world characters through personalised characteristic feature injection, enabling diverse character creation according to user preferences. We propose the SimsConv dataset, comprising 68 customised characters and 13,971 multi-turn role-playing dialogues across 1,360 real-world scenes. Characters are initially customised using pre-defined elements (career, aspiration, traits, skills), then expanded through personal and social profiles. Building on this, we present SimsChat, a freely customisable role-playing agent incorporating various realistic settings and topic-specified character interactions. Experimental results on both SimsConv and WikiRoleEval datasets demonstrate SimsChat’s superior performance in maintaining character consistency, knowledge accuracy, and appropriate question rejection compared to existing models. Our framework provides valuable insights for developing more accurate and customisable human simulacra. Our data and code are publicly available at https://github.com/Bernard-Yang/SimsChat.

大型语言模型(LLMS) 展示了理解指令和生成类似人类文字的非凡能力,使得复杂的代理器模拟超越基本行为复制,使得超越基本行为复制的尖端代理器模拟能力得以实现。然而,创建可自由定制字符的潜力仍然没有得到充分发挥。我们引入了自定义的交流代理框架,即使用LLMS来通过个性化特征注入模拟真实世界字符,从而根据用户的喜好创建不同的字符。我们提议Sims Conv数据集,其中包括68个定制字符和13,971个在1,360个真实世界场景中进行多回合角色扮演的对话。字符最初是使用预先定义的元素(职业、愿望、特性、技能)定制的,然后通过个人和社会概况加以扩展。我们在此的基础上介绍SimsChat,一个自由定制的角色扮演代理,包含各种现实环境和专题性字符互动。我们关于Sims Conv和WikRoleEval数据集的实验结果,显示了SimsChat在保持性一致性、知识准确性和与现有模型相比适当的问题拒绝方面的优异性。我们的框架为开发更准确和可定制的人类定义/MembLAMS/Simmulasmax。我们的数据和代码是公开的。


Article 185

Title@2025-06-11 (3): Error-Guided Pose Augmentation: Enhancing Rehabilitation Exercise Assessment through Targeted Data Generation

Title: Error-Guided Pose Augmentation: Enhancing Rehabilitation Exercise Assessment through Targeted Data Generation Fehlergeführte Pose Augmentation: Verbesserung der Rehabilitationsübungsbewertung durch gezielte Datengenerierung 错误引导粒子增强:通过有针对性数据生成加强复原工作评估 2506.09833v1

Authors (2): Omar Sherif, Ali Hamdi

Effective rehabilitation assessment is essential for monitoring patient progress, particularly in home-based settings. Existing systems often face challenges such as data imbalance and difficulty detecting subtle movement errors. This paper introduces Error-Guided Pose Augmentation (EGPA), a method that generates synthetic skeleton data by simulating clinically relevant movement mistakes. Unlike standard augmentation techniques, EGPA targets biomechanical errors observed in rehabilitation. Combined with an attention-based graph convolutional network, EGPA improves performance across multiple evaluation metrics. Experiments demonstrate reductions in mean absolute error of up to 27.6 percent and gains in error classification accuracy of 45.8 percent. Attention visualizations show that the model learns to focus on clinically significant joints and movement phases, enhancing both accuracy and interpretability. EGPA offers a promising approach for improving automated movement quality assessment in both clinical and home-based rehabilitation contexts.

有效的康复评估对于监测病人的进展至关重要,特别是在家庭环境中。现有系统往往面临数据不平衡和难以发现微妙的移动错误等挑战。本文介绍错误引导增强骨骼数据(EGPA),这是一种通过模拟临床相关运动错误生成合成骨骼数据的方法。与标准的增强技术不同,EGPA针对的是康复过程中观察到的生物机械错误。与基于关注的图形相联网络相结合,EGPA改善了多个评价指标的绩效。实验显示,绝对误差减少高达27.6%,误差分类精确度提高45.8%。关注可视化显示,模型学会侧重于具有临床意义的联合和移动阶段,提高了准确性和可解释性。EGPA为改进临床和基于家的康复背景下的自动流动质量评估提供了有希望的方法。


Article 186

Title@2025-06-11 (3): One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

Title: One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image Ein Bild ist alles, was es braucht: Vergiftung von visuellen Dokumenten Retrieval Augmented Generation mit einem einzigen Bild 一个象数就够了: 以单一图像显示的毒害视觉文件检索增强的一代人。 2504.02132v2

Authors (6): Ezzeldin Shereen, Dan Ristea, Shae McFadden, Burak Hasircioglu, Vasilios Mavroudis, Chris Hicks

Multi-modal retrieval augmented generation (M-RAG) is instrumental for inhibiting hallucinations in large multi-modal models (LMMs) through the use of a factual knowledge base (KB). However, M-RAG introduces new attack vectors for adversaries that aim to disrupt the system by injecting malicious entries into the KB. In this paper, we present the first poisoning attack against M-RAG targeting visual document retrieval applications where the KB contains images of document pages. We propose two attacks, each of which require injecting only a single adversarial image into the KB. Firstly, we propose a universal attack that, for any potential user query, influences the response to cause a denial-of-service (DoS) in the M-RAG system. Secondly, we present a targeted attack against one or a group of user queries, with the goal of spreading targeted misinformation. For both attacks, we use a multi-objective gradient-based adversarial approach to craft the injected image while optimizing for both retrieval and generation. We evaluate our attacks against several visual document retrieval datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (LMMs), demonstrating the attack effectiveness in both the universal and targeted settings. We additionally present results including commonly used defenses, various attack hyper-parameter settings, ablations, and attack transferability.

多式检索增强生成(M-RAG)有助于通过使用事实知识库(KB)来抑制大型多模式模型中的幻觉。然而,M-RAG为对手引入了新的攻击矢量,目的是通过向KB注入恶意条目破坏系统。在本文中,我们介绍针对视觉文件检索应用程序的M-RAG首次中毒袭击,KB含有文件页面图像的视觉文件检索应用程序。我们提议了两次袭击,每次袭击只需要向KB注入一个单一的对立图像。首先,我们提议了一次普遍袭击,为任何潜在的用户查询而影响对M-RAG系统中造成拒绝服务(DoS)的反应。第二,我们针对一个或一组用户的询问进行有针对性的袭击,目的是传播有针对性的错误信息。对于两次袭击,我们使用多目标梯度对抗性对抗方法来制作注射图像,同时优化检索和生成。我们评估了我们针对几个视觉文件检索数据集的袭击,一组不同的州级检索者,一组在M-RAG系统中造成拒绝提供服务。第二,我们提出了针对一个或一组用户询问,目的是传播当前攻击性攻击环境,包括各种攻击性攻击性攻击性、使用的额外模型。


Article 187

Title@2025-06-11 (3): EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection

Title: EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection EmoNet-Voice: Ein feinkörniger, sachverständiger Benchmark für Sprachemotionserkennung EmoNet-Voice:语音情感检测精密、经专家核实的专家验证基准 2506.09827v1

Authors (9): Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus, Kourosh Nadi, Huu Nguyen, Kristian Kersting, Sören Auer

The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems. Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals. This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities. Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions. Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels. This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets. Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts. Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration.

文本到语音和音频生成模型的进步要求为评价AI系统的情感理解能力制定强有力的基准。当前的语音感知(SER)数据集往往在情感颗粒、隐私关切或依赖行为描述方面表现出局限性。本文介绍了EmoNet-Voice,这是语音情绪检测的新资源,包括EmoNet-Voice Big,一个大型的预培训前数据集(在11种声音、40种情感和4种语言上超过4 500小时的演讲)和EmoNet-Voice Pench,这是一个带有人类专家说明的新的基准数据集。EmoNet-Voice旨在评估SER模型模型在40种精细的情感类别和不同程度的强度上往往存在局限性。我们精心打造了最先进的语音生成,我们精心打造了合成的音频片片片模拟,描绘了旨在激发特定情感的场景景。我们进行了严格的鉴定,这些心理学专家指派了更清晰的强度标签。这种合成的、隐私保护方法使得敏感情感状态的模型能够被包含在目前高层次的情绪分析中。最后,我们引入了一种高层次的语音感官压评估。我们所了解的高度的情绪评估。


Article 188

Title@2025-06-11 (3): Emphasising Structured Information: Integrating Abstract Meaning Representation into LLMs for Enhanced Open-Domain Dialogue Evaluation

Title: Emphasising Structured Information: Integrating Abstract Meaning Representation into LLMs for Enhanced Open-Domain Dialogue Evaluation Hervorhebung strukturierter Informationen: Integration abstrakter Bedeutungsdarstellungen in LLMs für eine verbesserte Bewertung des Open-Domain-Dialogs 强调结构化信息:将抽象含义代表纳入增强开放式对话评价LMLM 2404.01129v4

Authors (5): Bohao Yang, Kun Zhao, Dong Liu, Liang Zhan, Chenghua Lin

Automatic open-domain dialogue evaluation has attracted increasing attention, yet remains challenging due to the complexity of assessing response appropriateness. Traditional evaluation metrics, typically trained with true positive and randomly selected negative responses, tend to assign higher scores to responses that share greater content similarity with contexts. However, adversarial negative responses, despite possessing high lexical overlap with contexts, can be semantically incongruous. Consequently, existing metrics struggle to effectively evaluate such responses, resulting in low correlations with human judgments. While recent studies have demonstrated the effectiveness of Large Language Models (LLMs) for open-domain dialogue evaluation, they still face challenges in handling adversarial negative examples. We propose a novel evaluation framework that integrates Abstract Meaning Representation (AMR) enhanced domain-specific language models (SLMs) with LLMs. Our SLMs explicitly incorporate AMR graph information through a gating mechanism for enhanced semantic representation learning, while both SLM predictions and AMR knowledge are integrated into LLM prompts for robust evaluation. Extensive experiments on open-domain dialogue evaluation tasks demonstrate the superiority of our method compared to state-of-the-art baselines. Our comprehensive ablation studies reveal that AMR graph information contributes substantially more to performance improvements. Our framework achieves strong correlations with human judgments across multiple datasets, establishing a new benchmark for dialogue evaluation. Our code and data are publicly available.

自动开放域对话评价已引起越来越多的注意,但由于评估反应适当性的复杂性,仍然具有挑战性; 传统的评价指标,通常经过真正正面和随机选择的负面反应培训,往往会给内容与背景相似程度较高的答复分得较多的分数; 然而,对抗性负面反应,尽管与背景有较高的词汇重叠,但从字面上看,可能是不相容的; 因此,现有的衡量标准很难有效地评价这种反应,导致与人类判断的关联性较低; 虽然最近的研究表明大语言模型(LLLMs)对开放域对话评价的有效性,但它们在处理对抗性负面例子方面仍然面临挑战; 我们提议了一个新颖的评价框架,将摘要表示(AMR)强化了特定域语言模式(LMs)与 LLMs相结合。 我们的可持续土地管理明确纳入了AM图表信息,通过强化语义代表学习机制,同时将可持续土地管理预测和AMR知识纳入LM的快速评价提示中。 关于开放式对话评价任务的广泛实验表明,我们的方法优于状态基线。 我们现有的数据对比性研究显示,我们现有的数据对比性更强的模型显示,我们的数据比重的模型显示我们的数据比重。


Article 189

Title@2025-06-11 (3): Unveiling the Hidden: Movie Genre and User Bias in Spoiler Detection

Title: Unveiling the Hidden: Movie Genre and User Bias in Spoiler Detection Enthüllung des Verborgenen: Filmgenre und User Bias in Spoiler Detection 解开隐藏:电影流派和在垃圾检测中的用户比亚斯 2504.17834v3

Authors (7): Haokai Zhang, Shengtao Zhang, Zijian Cai, Heng Wang, Ruixuan Zhu, Zinan Zeng, Minnan Luo

Spoilers in movie reviews are important on platforms like IMDb and Rotten Tomatoes, offering benefits and drawbacks. They can guide some viewers’ choices but also affect those who prefer no plot details in advance, making effective spoiler detection essential. Existing spoiler detection methods mainly analyze review text, often overlooking the impact of movie genres and user bias, limiting their effectiveness. To address this, we analyze movie review data, finding genre-specific variations in spoiler rates and identifying that certain users are more likely to post spoilers. Based on these findings, we introduce a new spoiler detection framework called GUSD (The code is available at https://github.com/AI-explorer-123/GUSD) (Genre-aware and User-specific Spoiler Detection), which incorporates genre-specific data and user behavior bias. User bias is calculated through dynamic graph modeling of review history. Additionally, the R2GFormer module combines RetGAT (Retentive Graph Attention Network) for graph information and GenreFormer for genre-specific aggregation. The GMoE (Genre-Aware Mixture of Experts) model further assigns reviews to specialized experts based on genre. Extensive testing on benchmark datasets shows that GUSD achieves state-of-the-art results. This approach advances spoiler detection by addressing genre and user-specific patterns, enhancing user experience on movie review platforms.

在IMDb 和 Rotten Tomatoes 等平台上,电影审查中的破坏器很重要,提供了好处和退步。它们可以指导某些观众的选择,也可以影响那些更愿意事先不提供情节细节的人,因此有效探测破坏者至关重要。现有的破坏器探测方法主要分析审查文本,往往忽视电影流流体和用户偏差的影响,限制其效力。为了解决这个问题,我们分析电影审查数据,发现变异变异变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变,并查明某些用户更有可能邮寄破坏者。根据这些发现,我们引入了新的破坏器探测框架,称为GUS(代码见https://github.com/AI-explorer-123/GUSD) (Genreaster-Gentreal Referal Reformal Informessional Instital) 和GGE-Creal-Creal-Creal Exciental Excial Excial 专家在GEDA 上对GEA 进行更深入的升级测试。Greal 测试,Gres 测试显示Gres 测试。Gres 测试。根据Gres


Article 190

Title@2025-06-11 (3): Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards

Title: Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards Writing-Zero: Brücke zwischen nicht überprüfbaren Aufgaben und überprüfbaren Belohnungen 零书写:弥合不可核查任务与可核实的奖励之间的差距 2506.00103v2

Authors (8): Ruipeng Jia, Yunyi Yang, Yongbo Gai, Kai Luo, Shihao Huang, Jianhe Lin, Xiaoxi Jiang, Guanjun Jiang

Reinforcement learning with verifiable rewards (RLVR) has enabled large language models (LLMs) to achieve remarkable breakthroughs in reasoning tasks with objective ground-truth answers, such as mathematics and code generation. However, a significant gap remains for non-verifiable tasks, like creative writing and open-ended dialogue, where quality assessment is inherently subjective and lacks definitive references. Existing approaches for these domains often rely on scalar reward models trained with human preferences, which suffer from limited generalization and are prone to reward hacking, such as over-explanation and length bias. In this work, we propose a unified RLVR-based training paradigm that bridges the gap between non-verifiable tasks and verifiable rewards. We introduce a writing-principle-based pairwise Generative Reward Model (GenRM) and a novel Bootstrapped Relative Policy Optimization (BRPO) algorithm. The pairwise writing GenRM leverages self-principled critique to transform subjective assessments into reliable, verifiable rewards, while BRPO enables dynamic, reference-free pairwise comparison by leveraging a bootstrapped response as temporary reference from within group rollouts during RL training. Our approach empowers LLMs to develop robust writing capabilities without supervised fine-tuning, as demonstrated by Writing-Zero, which shows consistent improvement and strong resistance to reward hacking compared to scalar reward baselines. Furthermore, our method achieves competitive results on both in-house and open-source writing benchmarks. Our findings suggest the potential to unify rule-based, reference-based, and reference-free reward modeling under the RLVR framework, thus paving the way for a comprehensive and scalable RL training paradigm applicable across all language tasks.

以可核查的奖励(RLVR)加强学习使大型语言模型(LLMs)在数学和代码生成等客观地面真实性答案的推理任务方面取得显著突破,然而,对于非可核查的任务,如创造性的写作和开放性对话,质量评估本质上是主观的,缺乏明确的参考依据。这些领域的现有方法往往依赖经过人类偏好培训的大规模奖赏模式,这些模式具有有限的概括性,容易奖赏黑客,如过度解释和长度偏差。在这项工作中,我们建议采用统一的RLVR培训模式,以弥补不可核实的全面任务和可核查的奖赏之间的差距。我们采用了基于书面原则的配对奖赏模式(GenRMM)和新颖的“加强政策相对优化”的算法,GenRRMR的书写方法自律运用了将主观评估模式转变为可靠、可核实的奖赏性奖赏,而BROPO则则通过不使用基于团体内部的简洁性参考来进行动态的参照,在RLRBS中,从而显示我们不断的递校订的奖赏方法,从而展示了我们不断的校正的奖赏,从而展示了我们不断的奖赏的奖赏方法。


Article 191

Title@2025-06-11 (3): Steps are all you need: Rethinking STEM Education with Prompt Engineering

Title: Steps are all you need: Rethinking STEM Education with Prompt Engineering Schritte sind alles, was Sie brauchen: STEM-Bildung mit Prompt Engineering neu denken 重新思考STEM教育与即时工程 2412.05023v3

Authors (7): Krishnasai Addala, Kabir Dev Paul Baghel, Navya Gupta, Rishitej Reddy Vyalla, Chhavi Kirtani, Avinash Anand, Rajiv Ratn Shah

Few shot and Chain-of-Thought prompting have shown promise when applied to Physics Question Answering Tasks, but are limited by the lack of mathematical ability inherent to LLMs, and are prone to hallucination. By utilizing a Mixture of Experts (MoE) Model, along with analogical prompting, we are able to show improved model performance when compared to the baseline on standard LLMs. We also survey the limits of these prompting techniques and the effects they have on model performance. Additionally, we propose Analogical CoT prompting, a prompting technique designed to allow smaller, open source models to leverage Analogical prompting, something they have struggled with, possibly due to a lack of specialist training data.

在应用物理问题解答任务时,很少的试射和试探链显示有希望,但因LLMS本身缺乏数学能力而受到限制,容易产生幻觉。 通过使用专家混合模型(MOE)以及模拟提示,我们能够显示与标准LLMS基准相比,模型性能有所改善。我们还调查了这些快速技术的局限性及其对模型性能的影响。此外,我们提议了模拟 CoT催化技术,这是一种快速技术,旨在让小型开放源码模型利用分析性催化(Anacial comT),这可能是由于缺乏专业培训数据,这些模型的难度很大。


Article 192

Title@2025-06-11 (3): Using Shapley interactions to understand how models use structure

Title: Using Shapley interactions to understand how models use structure Shapley-Interaktionen nutzen, um zu verstehen, wie Modelle Struktur verwenden 利用模拟互动来了解模型如何使用结构 2403.13106v2

Authors (6): Divyansh Singhvi, Diganta Misra, Andrej Erkelens, Raghav Jain, Isabel Papadimitriou, Naomi Saphra

Language is an intricately structured system, and a key goal of NLP interpretability is to provide methodological insights for understanding how language models represent this structure internally. In this paper, we use Shapley Taylor interaction indices (STII) in order to examine how language and speech models internally relate and structure their inputs. Pairwise Shapley interactions measure how much two inputs work together to influence model outputs beyond if we linearly added their independent influences, providing a view into how models encode structural interactions between inputs. We relate the interaction patterns in models to three underlying linguistic structures: syntactic structure, non-compositional semantics, and phonetic coarticulation. We find that autoregressive text models encode interactions that correlate with the syntactic proximity of inputs, and that both autoregressive and masked models encode nonlinear interactions in idiomatic phrases with non-compositional semantics. Our speech results show that inputs are more entangled for pairs where a neighboring consonant is likely to influence a vowel or approximant, showing that models encode the phonetic interaction needed for extracting discrete phonemic representations.

语言是一个结构复杂的系统, 而 NLP 解释性的一个关键目标是提供方法见解, 以了解语言模型如何在内部代表这一结构。 在本文中, 我们使用 Shapley Taylor 互动指数( STII ) 来检查语言和语言模型的内部关联和输入结构。 Pairwith Shalty 互动度测量了我们用线性添加其独立影响, 从而了解模型如何将输入之间的结构互动整合成结构。 我们把模型中的互动模式与三种基本语言结构: 合成结构、 非对称语和语音对称结构。 我们发现, 自动递增文本模型将相互作用与输入的同步近距离相关, 以及自动递增和遮掩的模型将非直线性词句中的非直线性互动与非对立语语义的连接在一起。 我们的演讲结果显示, 在对配对中, 相邻配方的互连带可能影响元音或对称式的对称调, 显示用于提取离离式的电话代表的模型。


Article 193

Title@2025-06-11 (3): Regularizing Learnable Feature Extraction for Automatic Speech Recognition

Title: Regularizing Learnable Feature Extraction for Automatic Speech Recognition Regularisieren von Lern-Feature-Extraktion für automatische Spracherkennung 将自动语音识别的可读功能抽取器正规化 2506.09804v1

Authors (5): Peter Vieting, Maximilian Kannen, Benedikt Hilmes, Ralf Schlüter, Hermann Ney

Neural front-ends are an appealing alternative to traditional, fixed feature extraction pipelines for automatic speech recognition (ASR) systems since they can be directly trained to fit the acoustic model. However, their performance often falls short compared to classical methods, which we show is largely due to their increased susceptibility to overfitting. This work therefore investigates regularization methods for training ASR models with learnable feature extraction front-ends. First, we examine audio perturbation methods and show that larger relative improvements can be obtained for learnable features. Additionally, we identify two limitations in the standard use of SpecAugment for these front-ends and propose masking in the short time Fourier transform (STFT)-domain as a simple but effective modification to address these challenges. Finally, integrating both regularization approaches effectively closes the performance gap between traditional and learnable features.

神经前端是用于自动语音识别(ASR)系统的传统固定地物提取管道的有吸引力的替代物,因为它们可以直接接受适合声学模型的培训,但是,与古典方法相比,它们的性能往往不尽人意,我们表明,这在很大程度上是由于它们更容易被过度安装。因此,这项工作调查了用可学习的特征提取前端对ASR模型进行培训的正规化方法。首先,我们研究了音频扰动方法,并表明可以学习的特征可以取得较大的相对改进。此外,我们确定了这些前端标准使用SpecAugment的两种限制,并建议在短短的时间内将Fourier变换(STFT)-Dormain(STFT)作为应对这些挑战的一个简单而有效的修改。最后,将两种正规化方法结合起来,有效地缩小了传统和可学习特征之间的性能差距。


Article 194

Title@2025-06-11 (3): Knowledge Graphs are all you need: Leveraging KGs in Physics Question Answering

Title: Knowledge Graphs are all you need: Leveraging KGs in Physics Question Answering Wissen Graphen sind alles, was Sie brauchen: Leveraging KGs in Physik Fragen beantworten 知识图形是您所需要的全部: 在物理问答中使用 KGs 工具 2412.05453v3

Authors (8): Krishnasai Addala, Kabir Dev Paul Baghel, Dhruv Jain, Navya Gupta, Rishitej Reddy Vyalla, Chhavi Kirtani, Avinash Anand, Rajiv Ratn Shah

This study explores the effectiveness of using knowledge graphs generated by large language models to decompose high school-level physics questions into sub-questions. We introduce a pipeline aimed at enhancing model response quality for Question Answering tasks. By employing LLMs to construct knowledge graphs that capture the internal logic of the questions, these graphs then guide the generation of subquestions. We hypothesize that this method yields sub-questions that are more logically consistent with the original questions compared to traditional decomposition techniques. Our results show that sub-questions derived from knowledge graphs exhibit significantly improved fidelity to the original question’s logic. This approach not only enhances the learning experience by providing clearer and more contextually appropriate sub-questions but also highlights the potential of LLMs to transform educational methodologies. The findings indicate a promising direction for applying AI to improve the quality and effectiveness of educational content.

这项研究探索了使用大型语言模型产生的知识图将高中级物理学问题分解成子问题的有效性。我们引入了一种旨在增强问题回答任务的示范反应质量的管道。通过使用LLMS构建反映问题内部逻辑的知识图,这些图随后指导了子问题的产生。我们假设这种方法产生的子问题与传统的分解技术相比在逻辑上更符合原始问题。我们的结果显示,从知识图中得出的子问题表明,对原始问题的逻辑的忠诚性大大提高。这一方法不仅通过提供更明确、更符合背景的子问题来增进学习经验,而且还突出了LLMS改变教育方法的潜力。调查结果表明,应用AI来提高教育内容的质量和有效性有希望的方向。


Article 195

Title@2025-06-11 (3): LogProber: Disentangling confidence from contamination in LLM responses

Title: LogProber: Disentangling confidence from contamination in LLM responses LogProber: Entwirren des Vertrauens in LLM-Antworten 日志Prober:解除对LLM反应中污染的信心 2408.14352v2

Authors (3): Nicolas Yax, Pierre-Yves Oudeyer, Stefano Palminteri

In machine learning, contamination refers to situations where testing data leak into the training set. The issue is particularly relevant for the evaluation of the performance of Large Language Models (LLMs), which are generally trained on gargantuan, and generally opaque, corpora of text scraped from the world wide web. Developing tools to detect contamination is therefore crucial to be able to fairly and properly track the evolution of the performance of LLMs. To date, only a few recent studies have attempted to address the issue of quantifying and detecting contamination in short text sequences, such as those commonly found in benchmarks. However, these methods have limitations that can sometimes render them impractical.In the present paper, we introduce LogProber, a novel, efficient algorithm that we show to be able to detect contamination in a black box setting that tries to tackle some of these drawbacks by focusing on the familiarity with the question rather than the answer. Here, we explore the properties of the proposed method in comparison with concurrent approaches, identify its advantages and limitations, and illustrate how different forms of contamination can go undetected depending on the design of the detection algorithm.

在机器学习中,污染是指测试数据渗入培训组的情况,这个问题对于评估大语言模型(LLMS)的性能特别相关,这些模型通常经过关于gargantuan的培训,通常不透明,是从万维网上剪掉的文字体格。因此,开发探测污染的工具对于能够公平和适当地跟踪LLMS性能的演变至关重要。到目前为止,只有最近的一些研究试图用短文字序列,例如基准中常见的文字序列,处理量化和探测污染的问题。然而,这些方法有一定的局限性,有时会使它们不切实际。 在本文件中,我们采用了一种新颖的高效算法,我们显示能够通过侧重于熟悉问题而不是答案,在黑盒子中探测污染,从而设法解决其中的一些缺陷。在这里,我们探索了拟议方法与同时方法的特性,查明其优点和局限性,并说明不同污染形式如何不被发现,取决于检测算法的设计。


Article 196

Title@2025-06-11 (3): Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?

Title: Do LLMs Give Psychometrically Plausible Responses in Educational Assessments? Geben LLMs psychometrisch plausible Antworten in pädagogischen Bewertungen? LLMs在教育评估中是否给予心理可视反应? 2506.09796v1

Authors (3): Andreas Säuberli, Diego Frassinelli, Barbara Plank

Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.

如果大型语言模型(LLMs)对测试项目表现出像人一样的反应行为,那么就有可能利用它们作为试点参与者来加速测试开发。在本文中,我们评估了18个受指导的LLMs与两种公开的多种选择测试项目数据集在三个科目(阅读、美国历史和经济学)中所作的反应的人类相似性或心理测定合理性。我们的方法基于两个理论框架,这些理论框架通常用于教育评估、古典测试理论和项目反应理论中。结果显示,虽然较大的模型过于自信,但其反应分布在与温度缩放校准时可能更像人类。此外,我们发现LLMMs在阅读理解项目时与其他科目相比,与人类的关系往往更好。然而,总体而言,LMs并不十分牢固,表明在零镜头中不应用于试点教育评估。


Article 197

Title@2025-06-11 (3): ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

Title: ComfyUI-R1: Exploring Reasoning Models for Workflow Generation ComfyUI-R1: Erforschung von Konzeptmodellen für die Workflow-Generierung ComfyUI-R1:探索产生工作流程的理由模型 2506.09790v1

Authors (8): Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang

AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.

AI 生成的内容已经从单一模式演变为模块化工作流程,特别是在ComfyUI等平台上,使创造性管道具有定制化功能。然而,制定有效的工作流程需要大量专门知识来协调许多专门组成部分,为用户提供一个陡峭的学习曲线。为了应对这一挑战,我们引入了第一个大型自动工作流程生成推理模型ComfyUI-R1,这是第一个自动化工作流程生成的大型推理模型。从我们整理的4K工作流程数据集开始,我们构建了长期的思维链推理数据,包括节点选择、工作流程规划和代码级工作流程代表。ComfyU-R1通过一个两阶段框架进行培训:(1) COT为寒点启动进行微调,将模型调整到ComfyUI 域;(2) 强化激励推理能力学习,以精细的定标定混合奖项为指导,确保格式的有效性、结构完整性和不偏差水平。 实验显示,我们的7B 参数模型取得了97格式有效性率,以及高通过高通度、平级和图表级F1分级的F1,通过前两个阶段的CLIL的流程流程推算,大大超越了我们之前的精确流程流程的流程的流程,从而利用了G的升级的精确流程的升级的流程的流程的流程,从而进一步采用了了G的升级的升级的升级的升级的升级的升级的流程。


Article 198

Title@2025-06-11 (3): MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Cultural Learning

Title: MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Cultural Learning MindForge: Empowering Embodyd Agents mit Theorie des Geistes für lebenslanges kulturelles Lernen Mindforge:赋予有终身文化学习思想理论的内装代理人权力 2411.12977v4

Authors (4): Mircea Lică, Ojas Shirekar, Baptiste Colle, Chirag Raman

Embodied agents powered by large language models (LLMs), such as Voyager, promise open-ended competence in worlds such as Minecraft. However, when powered by open-weight LLMs they still falter on elementary tasks after domain-specific fine-tuning. We propose MindForge, a generative-agent framework for cultural lifelong learning through explicit perspective taking. We introduce three key innovations: (1) a structured theory of mind representation linking percepts, beliefs, desires, and actions; (2) natural inter-agent communication; and (3) a multi-component memory system. Following the cultural learning framework, we test MindForge in both instructive and collaborative settings within Minecraft. In an instructive setting with GPT-4, MindForge agents powered by open-weight LLMs significantly outperform their Voyager counterparts in basic tasks yielding $3\times$ more tech-tree milestones and collecting $2.3\times$ more unique items than the Voyager baseline. Furthermore, in fully \textit{collaborative} settings, we find that the performance of two underachieving agents improves with more communication rounds, echoing the Condorcet Jury Theorem. MindForge agents demonstrate sophisticated behaviors, including expert-novice knowledge transfer, collaborative problem solving, and adaptation to out-of-distribution tasks through accumulated cultural experiences.

由大型语言模型(LLMS)驱动的博彩剂(LLMS),如Voyager, 承诺在像Minecraft这样的世界中拥有开放能力,然而,当由开放性LMS提供力量时,他们仍然在初级任务上因地制宜地调整而动摇。我们提出MindForge,这是一个通过明确视角进行文化终身学习的基因化试剂框架。我们引入了三个主要创新:(1) 一种结构化的思维代表理论,将概念、信仰、愿望和行动联系起来;(2) 自然的跨机构交流;(3) 一个多组成部分的记忆系统。在文化学习框架之后,我们测试Minecraft内部的指导性和协作性环境的Mindforge。在与GPT-4、MindForge代理商在开放性 LLMS上所推动的原始任务下,在基本任务方面大大优于他们的Voyagreager对应人员,产生3美元的技术-树木里程碑,并收集比Voyager基线更独特的项目。此外,在完全的文本{colabolatudecommasution}环境中,我们发现,两个不完善的代理人的业绩表现会改进了更多的工作。


Article 199

Title@2025-06-11 (3): Guidelines for Fine-grained Sentence-level Arabic Readability Annotation

Title: Guidelines for Fine-grained Sentence-level Arabic Readability Annotation Richtlinien für feinkörnige Sätze Arabische Lesbarkeit Annotation 阿拉伯文可读性说明 2410.08674v3

Authors (5): Nizar Habash, Hanada Taha-Thomure, Khalid N. Elmadani, Zeina Zeino, Abdallah Abushmaes

This paper presents the annotation guidelines of the Balanced Arabic Readability Evaluation Corpus (BAREC), a large-scale resource for fine-grained sentence-level readability assessment in Arabic. BAREC includes 69,441 sentences (1M+ words) labeled across 19 levels, from kindergarten to postgraduate. Based on the Taha/Arabi21 framework, the guidelines were refined through iterative training with native Arabic-speaking educators. We highlight key linguistic, pedagogical, and cognitive factors in determining readability and report high inter-annotator agreement: Quadratic Weighted Kappa 81.8% (substantial/excellent agreement) in the last annotation phase. We also benchmark automatic readability models across multiple classification granularities (19-, 7-, 5-, and 3-level). The corpus and guidelines are publicly available.

本文介绍了平衡的阿拉伯语可读性评价公司的说明准则,这是阿拉伯语微量量刑可读性评估的大规模资源。BAREC包括69,441个判决(1M+字),标记在幼儿园到研究生的19个层次上。根据Taha/Arabi21框架,通过与讲阿拉伯语的本地教育者进行迭接培训,对准则进行了完善。我们强调了确定可读性的关键语言、教学和认知因素,并报告了高跨咨询者协议:最后注解阶段的Quadratic Weighted Kappa 81.8%(实质性/极优异性协议),我们还将自动可读性模式基准适用于多种分类粮(19、7、5和3级),并公开提供材料和准则。


Article 200

Title@2025-06-11 (3): Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements

Title: Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements Feintuning große Audio-Sprachen-Modelle mit LoRA für die präzise zeitliche Lokalisierung von langanhaltenden Expositionstherapieelementen 与LORA一道精细设计大型音频语言模型,用于长期接触治疗元素的精确时间定位 2506.09707v1

Authors (7): Suhas BN, Andrew M. Sherrill, Jyoti Alaparthi, Dominik Mattioli, Rosa I. Arriaga, Chris W. Wiese, Saeed Abdullah

Prolonged Exposure (PE) therapy is an effective treatment for post-traumatic stress disorder (PTSD), but evaluating therapist fidelity remains labor-intensive due to the need for manual review of session recordings. We present a method for the automatic temporal localization of key PE fidelity elements – identifying their start and stop times – directly from session audio and transcripts. Our approach fine-tunes a large pre-trained audio-language model, Qwen2-Audio, using Low-Rank Adaptation (LoRA) to process focused 30-second windows of audio-transcript input. Fidelity labels for three core protocol phases – therapist orientation (P1), imaginal exposure (P2), and post-imaginal processing (P3) – are generated via LLM-based prompting and verified by trained raters. The model is trained to predict normalized boundary offsets using soft supervision guided by task-specific prompts. On a dataset of 313 real PE sessions, our best configuration (LoRA rank 8, 30s windows) achieves a mean absolute error (MAE) of 5.3 seconds across tasks. We further analyze the effects of window size and LoRA rank, highlighting the importance of context granularity and model adaptation. This work introduces a scalable framework for fidelity tracking in PE therapy, with potential to support clinician training, supervision, and quality assurance.

长期接触(PE)疗法是创伤后应激障碍的有效治疗(PTSD),但是,由于需要对会议记录进行人工审查,评价治疗师忠诚性仍然是劳动密集型的。我们提出了一个方法,用于对关键PE忠诚要素自动进行时间定位 – – 确定其起始时间和停止时间 – – 直接从会话和记录誊本中进行。我们的方法是微调一个大型的预先培训的音频语言模型,Quen2-Audio,使用Low-Rank Aditive (LORA)处理重点30秒的音频描述输入窗口。三个核心协议阶段 – – 治疗师导向(P1)、不成熟暴露(P2)和图像后处理(P3) – – 的Fidality标签是通过基于LLM的提示生成并由经过培训的评分器校验的自动时间定位。我们的方法经过了培训,利用特定任务提示指导的软性监督来预测正常边界偏差。在313次实际PE课程的数据集中,我们的最佳配置(LOLA排名8、30窗口)实现了5.3秒的绝对错误(MAE),整个任务5.5秒。我们进一步分析了对客户质量监督框架的重要性,并引入了微级框架。


Article 201

Title@2025-06-11 (3): LLM2TEA: Agentic AI Designer Finds Innovative Objects with Generative Evolutionary Multitasking

Title: LLM2TEA: Agentic AI Designer Finds Innovative Objects with Generative Evolutionary Multitasking LLM2TEA: Agentischer AI-Designer findet innovative Objekte mit generativem evolutionären Multitasking LLM2TEA: ARI 设计师发现具有引演进进化多任务的创新对象 2406.14917v2

Authors (5): Melvin Wong, Jiao Liu, Thiago Rios, Stefan Menzel, Yew Soon Ong

In this paper, we introduce LLM-driven MultiTask Evolutionary Algorithm (LLM2TEA), the first agentic AI designer within a generative evolutionary multitasking (GEM) framework that promotes the crossover and synergy of designs from multiple domains, leading to innovative solutions that transcend individual disciplines. Of particular interest is the discovery of objects that are not only innovative but also conform to the physical specifications of the real world in science and engineering. LLM2TEA comprises a large language model to initialize a population of genotypes (defined by text prompts) describing the objects of interest, a text-to-3D generative model to produce phenotypes from these prompts, a classifier to interpret the semantic representations of the objects, and a physics simulation model to assess their physical properties. We propose several novel LLM-based multitask evolutionary operators to guide the search toward the discovery of high-performing practical objects. Experimental results in conceptual design optimization validate the effectiveness of LLM2TEA, revealing from 97\% to 174\% improvement in the diversity of innovative objects compared to the present text-to-3D generative model baseline. In addition, more than 73\% of the generated designs have better physical performance than the top 1\% percentile of the designs generated in the baseline. Moreover, LLM2TEA generates designs that are not only aesthetically creative but also functional in real-world applications. Several of these designs have been successfully 3D-printed, emphasizing the proposed approach’s capacity to transform AI-generated outputs into tangible physical objects. The designs produced by LLM2TEA meets practical requirements while showcasing creative and innovative features, underscoring its potential applications in complex design optimization and discovery.

在本文中,我们引入了LLMM驱动的多 Task 进化解算法 Algorithm (LLM2TEA),这是第一个在基因进化多任务框架(GEM)内,在基因进化多任务框架内,促进多个领域设计的交叉和协同,从而产生超越各个学科的创新解决方案。特别令人感兴趣的是发现不仅创新而且符合现实世界在科学和工程领域物理规格的物体。LLLM2TEA包括一个大型语言模型,以初始化一组显示兴趣对象的基因型(由文字提示定义),一个文本到3D的基因化模型设计设计,一个文本到3D的基因化模型设计,一个解释这些物体的语义表达方式,以及物理学模拟模型评估其物理属性。我们建议了一些基于LMTEM的多种数学进化操作操作,以指导搜索高性能实用性物体的发现。 概念设计优化的唯一实验结果是LM2TEA的效益,从97++++174显示创新对象的多样化,但比当前文本到3D的物理进化设计要求更精确化应用类型。


Article 202

Title@2025-06-11 (3): Reasoning Language Models: A Blueprint

Title: Reasoning Language Models: A Blueprint Grundlegende Sprachmodelle: Ein Blueprint 理由语言模式:蓝图 2501.11223v4

Authors (19): Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Zixuan Chen, Hubert Niewiadomski, Torsten Hoefler

Reasoning language models (RLMs), also known as Large Reasoning Models (LRMs), such as OpenAI’s o1 and o3, DeepSeek-R1, and Alibaba’s QwQ, have redefined AI’s problem-solving capabilities by extending LLMs with advanced reasoning mechanisms. Yet, their high costs, proprietary nature, and complex architectures - uniquely combining reinforcement learning (RL), search heuristics, and LLMs - present accessibility and scalability challenges. To address these, we propose a comprehensive blueprint that organizes RLM components into a modular framework, based on a survey and analysis of all RLM works. This blueprint incorporates diverse reasoning structures (chains, trees, graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy, value models and others), supervision schemes (Outcome-Based and Process-Based Supervision), and other related concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent tools). We also provide detailed mathematical formulations and algorithmic specifications to simplify RLM implementation. By showing how schemes like LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases, we demonstrate the blueprint’s versatility and unifying potential. To illustrate its utility, we introduce x1, a modular implementation for rapid RLM prototyping and experimentation. Using x1 and a literature review, we provide key insights, such as multi-phase training for policy and value models, and the importance of familiar training distributions. Finally, we discuss scalable RLM cloud deployments and we outline how RLMs can integrate with a broader LLM ecosystem. Our work demystifies RLM construction, democratizes advanced reasoning capabilities, and fosters innovation, aiming to mitigate the gap between “rich AI” and “poor AI” by lowering barriers to RLM design and experimentation.

理性语言模型(RLM) , 也称为大型理性模型(LRM) , 如 OpenAI 的 O1 和 o3 、 DeepSeek-R1 和 Alibaba 的 QwQ, 重新定义了AI 的解决问题能力, 扩大了LLM , 包括先进的推理机制。 然而, 成本高、 专有性、 复杂结构 — 独特地结合了强化学习(RL)、 搜索超常、 和 LLM 监督 — 提出了无障碍和可缩放的挑战。 为了解决这些问题, 我们提出了一个全面的蓝图, 将RLM 组件组织成模块, 对所有RLM 工作的调查和分析。 这个蓝图包含多种推理结构( 链、 树、 图表和嵌套式)、 推导( 政策、 RLLM 模型等)、 监管机制( 以结果为基础和基于流程的监管 ) 以及其他相关概念( 例如, 测试- 配置, , 提供RErieval-ADM 培训工具 , , 展示了我们 的数学和 的流程流程 的配置和 的配置, 演示的流程流程流程流程 , 展示, 展示, 展示, 演示的流程的流程的流程的流程的流程, 展示, 展示了 我们的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程, , , 展示的流程的流程的流程, 展示的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程的流程, , , 。


Article 203

Title@2025-06-11 (3): Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment

Title: Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment Jenseits von Bradley-Terry-Modellen: Ein allgemeines Präferenzmodell für Sprachmodellausrichtung 超越布拉德利-泰瑞模式:语文模式统一通用优先模式 2410.02197v3

Authors (5): Yifan Zhang, Ge Zhang, Yue Wu, Kangping Xu, Quanquan Gu

Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. In this paper, we introduce preference embedding, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback (RLHF). Experimental results show that our General Preference embedding Model (GPM) consistently outperforms the BT reward model on the RewardBench benchmark and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0, following the language model post-training with GPO and our general preference model, reveal performance improvements over BT models. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at https://github.com/general-preference/general-preference-model.

模拟人类偏好对于使基础模型与人类价值观相一致至关重要。传统的奖赏模型方法,如布拉德利-泰瑞奖赏模型(BT)模型,在表达性方面有缺陷,特别是在处理不透明偏好方面。本文中我们引入了优惠嵌入法,将反应嵌入一个潜在空间,以便有效地捕捉复杂的优惠结构,实现线性查询的复杂性。此外,我们提议以得分为基础的一般优惠偏好优化(GPO)为基础,将从人类反馈(RLHF)中汲取的以奖励为基础的强化学习(RLHF)加以概括。实验结果表明,我们的一般优惠嵌入模型(GPM)在RewardBennch基准上一贯优于BT奖赏模型,而有效模型周期偏好则在任何BT奖赏模式都像随机猜测一样的行为模式下游任务(AlpacaEval2.0)的评估,在与GPO和我们的一般偏好模式进行的语言模式培训后,揭示了业绩优于BT模型的改进。这些研究结果表明,我们的方法可以加强基础模型与微妙人类价值的一致性。该代码可在 http://gistrubbus-greformalformation/genferviewerview/gerviewerviews/gionalview


Article 204

Title@2025-06-11 (3): Retrofitting Large Language Models with Dynamic Tokenization

Title: Retrofitting Large Language Models with Dynamic Tokenization Nachrüstung großer Sprachmodelle mit dynamischer Tokenisierung 以动态招招化改造大语言模型 2411.18553v3

Authors (3): Darius Feher, Ivan Vulić, Benjamin Minixhofer

Current language models (LMs) use a fixed, static subword tokenizer. This default choice typically results in degraded efficiency and language capabilities, especially in languages other than English. To address this issue, we challenge the static design and propose retrofitting LMs with dynamic tokenization: a way to dynamically decide on token boundaries based on the input text via a subword-merging algorithm inspired by byte-pair encoding. We merge frequent subword sequences in a batch, then apply a pre-trained embedding-prediction hypernetwork to compute the token embeddings on-the-fly. For encoder-style models (e.g., XLM-R), this on average reduces token sequence lengths by >20% across 14 languages while degrading performance by less than 2%. The same method applied to pre-filling and scoring in decoder-style models (e.g., Mistral-7B) results in minimal performance degradation at up to 17% reduction in sequence length. Overall, we find that dynamic tokenization can mitigate the limitations of static tokenization by substantially improving inference speed and promoting fairness across languages, enabling more equitable and adaptable LMs.

当前语言模型(LMS) 使用固定、 静态的子字质表示器 。 这个默认选择通常导致降低效率和语言能力, 特别是英语以外的语言 。 为了解决这个问题, 我们质疑静态设计, 并提议用动态符号化来改造 LMS : 一种通过由byte-pair 编码启发的子字词合并算法, 动态地决定基于输入文本的象征性界限的方法 。 我们将常用的小字序列合并成批组, 然后应用预先训练的嵌入- 准备超网络来计算在现场的符号嵌入。 对于编码模式( 例如, XLM- R) 来说, 这在14种语言中平均减少超过20%的象征性序列长度, 同时降低不到2%的性能。 在解码模式中预先填充和评分数的同样方法( 如 Misttral-7B ) 导致最低性能降解到17 % 的序列长度。 总之, 我们发现动态符号化可以通过大幅度提高递增速度和促进语言之间的公平性, 使LMSMS( ) 能够更公平和更公平地调整。


Article 205

Title@2025-06-11 (3): Adding simple structure at inference improves Vision-Language Compositionality

Title: Adding simple structure at inference improves Vision-Language Compositionality Einfache Struktur bei Inferenz hinzufügen verbessert die Vision-Sprachenkompositionalität 在推论中添加简单结构改进视野-语言的构成性 2506.09691v1

Authors (4): Imanol Miranda, Ander Salaberria, Eneko Agirre, Gorka Azkune

Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for image-text retrieval tasks. However, those models struggle with compositionality, showing a bag-of-words-like behavior that limits their retrieval performance. Many different training approaches have been proposed to improve the vision-language compositionality capabilities of those models. In comparison, inference-time techniques have received little attention. In this paper, we propose to add simple structure at inference, where, given an image and a caption: i) we divide the image into different smaller crops, ii) we extract text segments, capturing objects, attributes and relations, iii) using a VLM, we find the image crops that better align with text segments obtaining matches, and iv) we compute the final image-text similarity aggregating the individual similarities of the matches. Based on various popular dual encoder VLMs, we evaluate our approach in controlled and natural datasets for VL compositionality. We find that our approach consistently improves the performance of evaluated VLMs without any training, which shows the potential of inference-time techniques. The results are especially good for attribute-object binding as shown in the controlled dataset. As a result of an extensive analysis: i) we show that processing image crops is actually essential for the observed gains in performance, and ii) we identify specific areas to further improve inference-time approaches.

图像- 图像- 图像- 语言模型( VLM ) ( VLM ) , 例如 CLIP 等 , 被广泛用于图像- 文字检索任务 。 但是, 这些模型与成文性挣扎, 显示成文性, 显示类似于文字的行为, 限制其检索性能 。 提出了许多不同的培训方法, 以提高这些模型的成文性。 相比之下, 推论时间技术很少受到注意 。 在本文中, 我们建议添加简单的推论结构, 给图像和标题 : (一) 我们将图像分为不同的小作物 ; (二) 我们提取文本段, 捕捉对象、 属性和关系 ; (三) 使用成文成文性, 我们发现成像作物的成文, 更好地与获得匹配的成文部分保持一致 ; (四) 我们根据各种流行的双倍编码 VLMS 技术, 评估我们在受控和自然数据集中的方法 。 我们发现, 我们的方法不断改进 VLMS 的性功能, 没有任何训练, 我们发现 显示特定的成像 。 ( ) 显示我们所观察到的成像 。


Article 206

Title@2025-06-11 (3): Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models

Title: Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models Inv-Entropy: Ein voll probabilistisches Rahmenwerk zur Unsicherheitsquantifizierung in Sprachmodellen Inv-Entropy: 语文模型不确定性量化的完全概率框架 2506.09684v1

Authors (5): Haoyi Song, Ruihan Ji, Naichen Shi, Fan Lai, Raed Al Kontar

Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ). Existing UQ methods are often heuristic and lack a probabilistic foundation. This paper begins by providing a theoretical justification for the role of perturbations in UQ for LLMs. We then introduce a dual random walk perspective, modeling input-output pairs as two Markov chains with transition probabilities defined by semantic similarity. Building on this, we propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations. Within this framework, we define a new uncertainty measure, Inv-Entropy. A key strength of our framework is its flexibility: it supports various definitions of uncertainty measures, embeddings, perturbation strategies, and similarity metrics. We also propose GAAP, a perturbation algorithm based on genetic algorithms, which enhances the diversity of sampled inputs. In addition, we introduce a new evaluation metric, Temperature Sensitivity of Uncertainty (TSU), which directly assesses uncertainty without relying on correctness as a proxy. Extensive experiments demonstrate that Inv-Entropy outperforms existing semantic UQ methods. The code to reproduce the results can be found at https://github.com/UMDataScienceLab/Uncertainty-Quantification-for-LLMs.

大型语言模型(LLMS)已经转变了自然语言处理过程,但它们的可靠部署需要有效的不确定性量化(UQ) 。现有的 UQ 方法往往是超常的,缺乏概率基础。本文件首先为在UQ 中LLM 中干扰作用提供了理论依据。 我们随后引入了一种双随机行进视角, 将输入输出对配制建模为两个有过渡概率的Markov链条, 其过渡概率由语义相似性定义。 在此基础上,我们提议了一个完全概率化的框架, 以一个反向模型为基础, 通过系统扰动来评估以给定输出为条件的输入空间的多样性来量化不确定性。 在此框架内,我们定义了一个新的不确定性度度、 温度性、 感官和性。 我们框架的关键力量是其灵活性: 它支持各种不确定性计量、 嵌入、 扰动策略和类似度度度度度测量。 我们还提议一个基于遗传算法的易扰动算算算算算法, 增强抽样输入的多样性。 此外,我们引入一个新的评价度度、 度- 度- Tempal- sental Sental- relistryallistryalalisalisalisal advial adview 。


Article 207

Title@2025-06-11 (3): Is Fine-Tuning an Effective Solution? Reassessing Knowledge Editing for Unstructured Data

Title: Is Fine-Tuning an Effective Solution? Reassessing Knowledge Editing for Unstructured Data Ist Fine-Tuning eine effektive Lösung? Neubewertung der Wissensbearbeitung für unstrukturierte Daten 精细调整是否是一种有效的解决方案?重新评估无结构数据的知识编辑 2506.09672v1

Authors (3): Hao Xiong, Chuanyuan Tan, Wenliang Chen

Unstructured Knowledge Editing (UKE) is crucial for updating the relevant knowledge of large language models (LLMs). It focuses on unstructured inputs, such as long or free-form texts, which are common forms of real-world knowledge. Although previous studies have proposed effective methods and tested them, some issues exist: (1) Lack of Locality evaluation for UKE, and (2) Abnormal failure of fine-tuning (FT) based methods for UKE. To address these issues, we first construct two datasets, UnKEBench-Loc and AKEW-Loc (CF), by extending two existing UKE datasets with locality test data from the unstructured and structured views. This enables a systematic evaluation of the Locality of post-edited models. Furthermore, we identify four factors that may affect the performance of FT-based methods. Based on these factors, we conduct experiments to determine how the well-performing FT-based methods should be trained for the UKE task, providing a training recipe for future research. Our experimental results indicate that the FT-based method with the optimal setting (FT-UKE) is surprisingly strong, outperforming the existing state-of-the-art (SOTA). In batch editing scenarios, FT-UKE shows strong performance as well, with its advantage over SOTA methods increasing as the batch size grows, expanding the average metric lead from +6.78% to +10.80%

无结构的知识编辑(UKE)对于更新大语言模型的相关知识至关重要,它侧重于非结构化投入,如长文本或自由格式文本,这是现实世界知识的常见形式,尽管以前的研究提出了有效方法并测试了这些方法,但存在一些问题:(1) 对UKE缺乏地方性评价,和(2) 对UKE的微调(FT)方法异常失灵。为了解决这些问题,我们首先建立两个数据集,UnKEBench-Loc和AKEW-Loc(CF),从结构化和结构化观点扩大现有的两个基于地方测试数据的UCE数据集,以地方测试数据为基础,从结构化和结构化观点扩展两个现有数据集,例如长处或自由格式化文本,以便系统化地评估经编辑后模式的地域性。此外,我们查明可能影响基于FT方法的绩效的四个因素。基于这些因素,我们进行实验以确定如何为UKE任务对基于良好业绩的FT方法进行培训,为今后的研究提供一种培训食谱。我们的实验结果表明,基于最佳设置(FT-UC-80E)的现有方法是惊人的强的,从不断增长的SO-TAFTA,其质量的比UTA的升级优势,从不断增长的B-TA.


Article 208

Title@2025-06-11 (3): CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization

Title: CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization CROW: Beseitigung von Backdoors aus großen Sprachmodellen über interne Konsistenzregularisierung CROW:通过内部一致性规范化,消除大语言模式中的后门 2411.12768v2

Authors (4): Nay Myat Min, Long H. Pham, Yige Li, Jun Sun

Large Language Models (LLMs) are vulnerable to backdoor attacks that manipulate outputs via hidden triggers. Existing defense methods–designed for vision/text classification tasks–fail for text generation. We propose Internal Consistency Regularization (CROW), a defense leveraging the observation that backdoored models exhibit unstable layer-wise hidden representations when triggered, while clean models show smooth transitions. CROW enforces consistency across layers via adversarial perturbations and regularization during finetuning, neutralizing backdoors without requiring clean reference models or trigger knowledge–only a small clean dataset. Experiments across Llama-2 (7B, 13B), CodeLlama (7B, 13B), and Mistral-7B demonstrate CROW’s effectiveness: it achieves significant reductions in attack success rates across diverse backdoor strategies (sentiment steering, targeted refusal, code injection) while preserving generative performance. CROW’s architecture-agnostic design enables practical deployment.

大型语言模型(LLMs)很容易受到幕后攻击,这种攻击通过隐藏的触发器操纵产出。现有的防御方法是为视觉/文本分类任务设计的,在文本生成方面没有成功。我们提议内部一致性常规化(CROW),这是一种利用以下观察的辩护手段,即幕后模式在触发时表现出不稳定的层层层隐藏的表示方式,而清洁模型则显示平稳的过渡。CROW在微调过程中通过对抗性干扰和正规化,使后门中立,而不需要干净的参考模型或触发一个只有知识的小型清洁数据集。Llama-2(7B,13B),CodLlama (7B,13B))和Mistral-7B的实验展示了CROW的效能:它大大降低了各种后门战略(压力方向、定向拒绝、代码注入)的攻击成功率,同时保持基因化性能。CROW的建筑-认知设计使得实际部署成为可能。


Article 209

Title@2025-06-11 (3): Query-Level Uncertainty in Large Language Models

Title: Query-Level Uncertainty in Large Language Models Abfrage-Level-Unsicherheit in großen Sprachmodellen 大语言模式的质疑级不确定性 2506.09669v1

Authors (2): Lihu Chen, Gaël Varoquaux

It is important for Large Language Models to be aware of the boundary of their knowledge, the mechanism of identifying known and unknown queries. This type of awareness can help models perform adaptive inference, such as invoking RAG, engaging in slow and deep thinking, or adopting the abstention mechanism, which is beneficial to the development of efficient and trustworthy AI. In this work, we propose a method to detect knowledge boundaries via Query-Level Uncertainty, which aims to determine if the model is able to address a given query without generating any tokens. To this end, we introduce a novel and training-free method called \emph{Internal Confidence}, which leverages self-evaluations across layers and tokens. Empirical results on both factual QA and mathematical reasoning tasks demonstrate that our internal confidence can outperform several baselines. Furthermore, we showcase that our proposed method can be used for efficient RAG and model cascading, which is able to reduce inference costs while maintaining performance.

大语言模型必须了解其知识的界限,即识别已知和未知查询的机制。这类认识有助于模型进行适应性的推断,例如援引RAG,进行缓慢和深入的思考,或采用有利于发展高效和可信赖的AI的弃权机制。在这项工作中,我们提出一种方法,通过查询级不确定性来探测知识界限,目的是确定模型是否能够在不产生任何象征物的情况下处理特定查询。为此,我们引入了一种名为\emph{Internal Infority}的新颖的、无培训的方法,利用了跨层次和标志的自我评价。关于事实质变和数学推理任务的经验性结果表明,我们的内部信任可以超越若干基线。此外,我们展示了我们提出的方法可用于高效的RAG和模型采集,在保持性能的同时可以降低推断成本。


Article 210

Title@2025-06-11 (3): Intent Factored Generation: Unleashing the Diversity in Your Language Model

Title: Intent Factored Generation: Unleashing the Diversity in Your Language Model Intent Factored Generation: Die Vielfalt in Ihrem Sprachmodell entfesseln 内在因数代:在你的语言模式中释放多样性 2506.09659v1

Authors (5): Eltayeb Ahmed, Uljad Berdica, Martha Elliott, Danijela Horak, Jakob N. Foerster

Obtaining multiple meaningfully diverse, high quality samples from Large Language Models for a fixed prompt remains an open challenge. Current methods for increasing diversity often only operate at the token-level, paraphrasing the same response. This is problematic because it leads to poor exploration on reasoning problems and to unengaging, repetitive conversational agents. To address this we propose Intent Factored Generation (IFG), factorising the sampling process into two stages. First, we sample a semantically dense intent, e.g., a summary or keywords. Second, we sample the final response conditioning on both the original prompt and the intent from the first stage. This allows us to use a higher temperature during the intent step to promote conceptual diversity, and a lower temperature during the final generation to ensure the outputs are coherent and self-consistent. Additionally, we find that prompting the model to explicitly state its intent for each step of the chain-of-thought before generating the step is beneficial for reasoning tasks. We demonstrate our method’s effectiveness across a diverse set of tasks. We show this method improves both pass@k and Reinforcement Learning from Verifier Feedback on maths and code tasks. For instruction-tuning, we combine IFG with Direct Preference Optimisation to increase conversational diversity without sacrificing reward. Finally, we achieve higher diversity while maintaining the quality of generations on a general language modelling task, using a new dataset of reader comments and news articles that we collect and open-source. In summary, we present a simple method of increasing the sample diversity of LLMs while maintaining performance. This method can be implemented by changing the prompt and varying the temperature during generation, making it easy to integrate into many algorithms for gains across various applications.

从大语言模型中获取多种有意义、高质量且多样的固定快速样本仍然是一项公开的挑战。当前增加多样性的方法通常只在象征性层面运作,对相同的响应进行推理。这很成问题,因为它导致对推理问题的探索不力,导致不采用重复的谈话代理器。为了解决这个问题,我们提议“内插因子生成”(IFG),将抽样过程推入两个阶段。首先,我们抽样一个语义密集的意图,例如一个摘要或关键字。第二,我们抽样最后反应,取决于最初的及时性和第一阶段的意向。这使我们能够在意向步骤期间使用更高的温度来促进概念多样性,同时在最后一代使用较低的温度以确保产出的一致性和自我兼容性。此外,我们发现在生成该步骤之前,促使该模型明确表明其对每个步骤的意图。我们展示了我们的方法在一系列不同的任务中是公开的。我们用这种方法改进了通过@k和加固的温度评论,让我们在目标步骤步骤中,从虚拟版本中学习,在最终的生成输出输出输出结果时,在不断更新的数学和读性质量分析中,在不断更新的数学和读取数据的方法中,同时,在不断更新的数学和不断调整。我们用新的的数学和读取数据的方法,在不断调整过程中,在不断更新的数学和读取的数学和读取的数学和读取数据中,同时,在不断更新的数学和读取数据的方法,在不断改进。


Article 211

Title@2025-06-11 (3): Bridging the Gap Between Open-Source and Proprietary LLMs in Table QA

Title: Bridging the Gap Between Open-Source and Proprietary LLMs in Table QA Die Lücke zwischen Open Source und Proprietary LLMs in Tabelle QA überbrücken 表QA缩小开放来源和专有LMs与公开来源和产权LMs之间的差距 2506.09657v1

Authors (2): Nikolas Evkarpidi, Elena Tutubalina

This paper presents a system developed for SemEval 2025 Task 8: Question Answering (QA) over tabular data. Our approach integrates several key components: text-to-SQL and text-to-code generation modules, a self-correction mechanism, and a retrieval-augmented generation (RAG). Additionally, it includes an end-to-end (E2E) module, all orchestrated by a large language model (LLM). Through ablation studies, we analyzed the effects of different parts of our pipeline and identified the challenges that are still present in this field. During the evaluation phase of the competition, our solution achieved an accuracy of 80%, resulting in a top-13 ranking among the 38 participating teams. Our pipeline demonstrates a significant improvement in accuracy for open-source models and achieves a performance comparable to proprietary LLMs in QA tasks over tables. The code is available at GitHub repository.

本文件介绍了为SemEval 2025任务8:在表格数据之上回答问题8(QA)开发的系统。我们的方法结合了几个关键组成部分:文本到SQL和文本到代码生成模块、自我校正机制和检索增强生成模块(RAG)。此外,它包括一个端到端(E2E)模块,全部由一个大语言模型(LLM)操作。通过消化研究,我们分析了我们输油管线不同部分的影响,并查明了这一领域仍然存在的挑战。在竞争的评估阶段,我们的解决方案达到了80%的准确度,导致在38个参与团队中排名第13位。我们的管道显示,开源模型的准确性显著提高,并取得了与QA任务中专有LMs比表的性能。该代码可在GitHub仓库查阅。


Article 212

Title@2025-06-11 (3): AskToAct: Enhancing LLMs Tool Use via Self-Correcting Clarification

Title: AskToAct: Enhancing LLMs Tool Use via Self-Correcting Clarification AskToAct: Verbesserung der LLMs-Tool-Nutzung durch selbstkorrekte Klärung AskToact:通过自我更正澄清加强LLMs工具的使用 2503.01940v2

Authors (9): Xuan Zhang, Yongliang Shen, Zhe Zheng, Linjuan Wu, Wenqi Zhang, Yuchen Yan, Qiuying Peng, Jun Wang, Weiming Lu

Large language models (LLMs) have demonstrated remarkable capabilities in tool learning. In real-world scenarios, user queries are often ambiguous and incomplete, requiring effective clarification. However, existing interactive clarification approaches face two critical limitations: reliance on manually constructed datasets, which inherently constrains training data scale and diversity, and lack of error correction mechanisms during multi-turn clarification, leading to error accumulation that compromises both accuracy and efficiency. We present AskToAct, which addresses these challenges by exploiting the structural mapping between queries and their tool invocation solutions. Our key insight is that tool parameters naturally represent explicit user intents. By systematically removing key parameters from queries while retaining them as ground truth, we enable automated construction of high-quality training data. We further enhance model robustness through error-correction pairs and selective masking, enabling dynamic error detection during clarification interactions. Comprehensive experiments demonstrate that AskToAct significantly outperforms existing approaches, achieving above 57% accuracy in recovering critical unspecified intents and enhancing clarification efficiency by an average of 10.46% while maintaining high accuracy in tool invocation. Our framework exhibits robust performance across different model architectures and successfully generalizes to entirely unseen APIs without additional training, achieving performance comparable to GPT-4o with substantially fewer computational resources.

大型语言模型(LLMS)展示了工具学习的非凡能力。在现实世界情景中,用户的询问往往模糊不清且不完整,需要有效澄清。然而,现有的互动式澄清方法面临两个关键限制:依赖人工构建的数据集,这必然限制培训数据的规模和多样性,在多方向澄清过程中缺乏错误纠正机制,导致错误积累,从而损害准确性和效率。我们介绍AskToAct,它利用查询及其工具引用解决方案之间的结构绘图来应对这些挑战。我们的关键洞察力是工具参数自然代表明确的用户意图。通过系统地从查询中删除关键参数,同时将其保留为地面真相,我们得以自动构建高质量的培训数据。我们通过错误校正配对和选择性掩码来进一步加强模型的稳健性,使在澄清互动中能够进行动态错误探测。全面实验表明,AskToAction大大超越了现有方法,在恢复关键未定意图和通过平均10.46%提高澄清效率方面实现了57%的准确性,同时保持了工具的高度精确性。我们的框架展示了不同模型结构的强性业绩,并成功地向完全看不见的APT(GPT)进行大量的计算。


Article 213

Title@2025-06-11 (3): Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering

Title: Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering Effizientes und generalisierbares Graph Retriever für Wissens-Graph-Frageantworten lernen 知识拼图问题解答的学习效率和通用图表 2506.09645v1

Authors (6): Tianjun Yao, Haoxuan Li, Zhiqiang Shen, Pan Li, Tongliang Liu, Kun Zhang

Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains, but their reliability is hindered by the outdated knowledge and hallucinations. Retrieval-Augmented Generation mitigates these issues by grounding LLMs with external knowledge; however, most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning. Knowledge graphs, which represent facts as relational triples, offer a more structured and compact alternative. Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering (KGQA), with a significant proportion adopting the retrieve-then-reasoning paradigm. In this framework, graph-based retrievers have demonstrated strong empirical performance, yet they still face challenges in generalization ability. In this work, we propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA. RAPL addresses these limitations through three aspects: (1) a two-stage labeling strategy that combines heuristic signals with parametric models to provide causally grounded supervision; (2) a model-agnostic graph transformation approach to capture both intra- and inter-triple interactions, thereby enhancing representational capacity; and (3) a path-based reasoning strategy that facilitates learning from the injected rational knowledge, and supports downstream reasoner through structured inputs. Empirically, RAPL outperforms state-of-the-art methods by $2.66\%-20.34\%$, and significantly reduces the performance gap between smaller and more powerful LLM-based reasoners, as well as the gap under cross-dataset settings, highlighting its superior retrieval capability and generalizability. Codes are available at: https://github.com/tianyao-aka/RAPL.

大型语言模型(LLMS)在各个领域显示出很强的感化推理能力,但其可靠性受到过时的知识和幻觉的阻碍。回溯式启动式生成通过外部知识将LLMS打下基础,缓解了这些问题;然而,大多数现有的RAG管道依赖非结构化文本,限制解释性和结构化推理。作为关联三重事实的知识图表提供了一种结构化和紧凑的替代方案。最近的研究探索了将知识图表与LLMS6和知识图形解答(KGQA)相结合,其中相当一部分采用了检索到正对错的模式。在这个框架中,基于图表的检索器显示了强大的实证性业绩,但在总体化能力方面仍面临挑战。在这项工作中,我们提出了RAPL(RAPL),这是一个在KQA中高效和高效的图形检索新框架。 RAPL(RAPL)通过三个方面解决了这些局限性:(1) 一种将基于超常性缺陷的信号与基于因果关系的模型相结合的模型结合起来,以提供基于因果关系的监管;(2) 一种基于模型的图表转换方法,用以从内部和三连级之间获取的逻辑互动,从而通过结构化的推理化的推理化的推理学,从而增强内部推理化了内部推理学,从而支持了内部推理的推理的推理的推理学的推理,从而增强了了内部推理性推理的推理,从而增强了了内部推理性推理性推理性推理,从而增强了了内部推理性推理性推理性推理性推理性推理性推理性推理性推理性推理;(3) 。


Article 214

Title@2025-06-11 (3): Using Sign Language Production as Data Augmentation to enhance Sign Language Translation

Title: Using Sign Language Production as Data Augmentation to enhance Sign Language Translation Verwendung der Sign Language-Produktion als Datenerweiterung zur Verbesserung der Sign Language-Übersetzung 利用手语制作作为数据扩充,加强手语翻译 2506.09643v1

Authors (3): Harry Walsh, Maksym Ivashechkin, Richard Bowden

Machine learning models fundamentally rely on large quantities of high-quality data. Collecting the necessary data for these models can be challenging due to cost, scarcity, and privacy restrictions. Signed languages are visual languages used by the deaf community and are considered low-resource languages. Sign language datasets are often orders of magnitude smaller than their spoken language counterparts. Sign Language Production is the task of generating sign language videos from spoken language sentences, while Sign Language Translation is the reverse translation task. Here, we propose leveraging recent advancements in Sign Language Production to augment existing sign language datasets and enhance the performance of Sign Language Translation models. For this, we utilize three techniques: a skeleton-based approach to production, sign stitching, and two photo-realistic generative models, SignGAN and SignSplat. We evaluate the effectiveness of these techniques in enhancing the performance of Sign Language Translation models by generating variation in the signer’s appearance and the motion of the skeletal data. Our results demonstrate that the proposed methods can effectively augment existing datasets and enhance the performance of Sign Language Translation models by up to 19%, paving the way for more robust and accurate Sign Language Translation systems, even in resource-constrained environments.

机器学习模式基本上依赖大量高质量数据。 收集这些模式的必要数据可能由于成本、稀缺和隐私限制而具有挑战性。 手语是聋人社区使用的视觉语言,被视为低资源语言。 手语数据集通常数量小于口语对应方。 手语制作的任务是从口语句中生成手语视频,而手语翻译则是反向翻译任务。 在这里,我们建议利用手语制作的最新进展来扩大现有手语数据集,提高手语翻译模型的性能。 为此,我们使用三种技术:基于骨架的制作、签名缝纫和两种光真化的基因模型(SignGAN和SignSplat)。 我们评估这些技术在提高手语翻译模型性能方面的效力,通过改变手语的外观和骨骼数据运动。 我们的结果表明,拟议方法可以有效地增加现有手语翻译模型的数据集,提高手语翻译模型的性能,达到19 %,为更可靠、更精确的手语翻译系统铺平了道路,甚至在资源库中也是如此。


Article 215

Title@2025-06-11 (3): Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs

Title: Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs Gleiche Aufgabe, unterschiedliche Schaltungen: Entwirren von Modalität-spezifischen Mechanismen in VLMs 同一任务, 不同电路: VLMs 拆分模式特定机制 2506.09047v2

Authors (4): Yaniv Nikankin, Dana Arad, Yossi Gandelsman, Yonatan Belinkov

Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the \textit{circuits} - the task-specific computational sub-graphs - in different modalities. We show that while circuits are largely disjoint between modalities, they implement relatively similar functionalities: the differences lie primarily in processing modality-specific data positions (an image or a text sequence). Zooming in on the image data representations, we observe they become aligned with the higher-performing analogous textual representations only towards later layers, too late in processing to effectively influence subsequent positions. To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers. In experiments with multiple tasks and models, this simple intervention closes a third of the performance gap between the modalities, on average. Our analysis sheds light on the multi-modal performance gap in VLMs and suggests a training-free approach for reducing it.

视觉-语言模型(VLMS)显示出令人印象深刻的回答视觉输入问题的能力(如在图像中计数对象),但在执行类似文本任务(如在文本中计数单词)时却表现出更高的理解度。我们通过辨别和比较特定任务次图解来调查这一准确性差距。我们用不同的方式来调查这一准确性差距。我们显示,虽然电路在模式之间大相脱节,但它们的功能相对相似:差别主要在于处理特定模式的数据位置(图像或文本序列)。放大图像数据表示时,我们观察到它们与高性能相似文本表示相近,但仅限于后层,在处理时过迟,以有效影响以后的位置。为了克服这一差距,我们把后层的视觉数据符号表示方式与前层相补。在多个任务和模型的实验中,这种简单的干预平均地缩小了模式之间性能差距的三分之一。我们的分析揭示了VLMS的多式性能差距,并提出了减少这种差异的无培训方法。


Article 216

Title@2025-06-11 (3): Modeling Probabilistic Reduction using Information Theory and Naive Discriminative Learning

Title: Modeling Probabilistic Reduction using Information Theory and Naive Discriminative Learning Modellierung der probabilistischen Reduktion mittels Informationstheorie und Naive Discriminative Learning 利用信息理论和辨别性差异性学习来模拟降低概率的模型 2506.09641v1

Authors (2): Anna Stein, Kevin Tang

This study compares probabilistic predictors based on information theory with Naive Discriminative Learning (NDL) predictors in modeling acoustic word duration, focusing on probabilistic reduction. We examine three models using the Buckeye corpus: one with NDL-derived predictors using information-theoretic formulas, one with traditional NDL predictors, and one with N-gram probabilistic predictors. Results show that the N-gram model outperforms both NDL models, challenging the assumption that NDL is more effective due to its cognitive motivation. However, incorporating information-theoretic formulas into NDL improves model performance over the traditional model. This research highlights a) the need to incorporate not only frequency and contextual predictability but also average contextual predictability, and b) the importance of combining information-theoretic metrics of predictability and information derived from discriminative learning in modeling acoustic reduction.

这项研究将基于信息理论的概率预测器与用于模拟声词持续时间并侧重于降低概率的视觉差异学习(NDL)预测器进行比较。我们用BuckeyPorporation研究三种模型:一种是使用信息理论公式的NDL预测器,一种是传统的NDL预测器,一种是Ngro预测器,一种是Ngram预测器。结果显示,Ngram模型与Ngram预测器相比,都优于NDL模型,挑战了NDL因其认知动机而更加有效的假设。然而,将信息理论公式纳入NDL可以提高传统模型的性能。这项研究强调,不仅需要将频率和上下文可预测性纳入NDL的预测器,而且需要将平均背景可预测性纳入,以及(b) 必须将在模拟声学减少中从歧视性学习获得的可预测性和信息的信息理论性计量器结合起来。


Article 217

Title@2025-06-11 (3): Benchmarking Debiasing Methods for LLM-based Parameter Estimates

Title: Benchmarking Debiasing Methods for LLM-based Parameter Estimates Benchmarking-Debiasing-Methoden für LLM-basierte Parameterschätzungen 以LLM为基础的参数估算的偏差方法 2506.09627v1

Authors (5): Nicolas Audinet de Pieuchon, Adel Daoud, Connor T. Jerzak, Moa Johansson, Richard Johansson

Large language models (LLMs) offer an inexpensive yet powerful way to annotate text, but are often inconsistent when compared with experts. These errors can bias downstream estimates of population parameters such as regression coefficients and causal effects. To mitigate this bias, researchers have developed debiasing methods such as Design-based Supervised Learning (DSL) and Prediction-Powered Inference (PPI), which promise valid estimation by combining LLM annotations with a limited number of expensive expert annotations. Although these methods produce consistent estimates under theoretical assumptions, it is unknown how they compare in finite samples of sizes encountered in applied research. We make two contributions: First, we study how each method’s performance scales with the number of expert annotations, highlighting regimes where LLM bias or limited expert labels significantly affect results. Second, we compare DSL and PPI across a range of tasks, finding that although both achieve low bias with large datasets, DSL often outperforms PPI on bias reduction and empirical efficiency, but its performance is less consistent across datasets. Our findings indicate that there is a bias-variance tradeoff at the level of debiasing methods, calling for more research on developing metrics for quantifying their efficiency in finite samples.

大型语言模型(LLMS)提供了省略文字的廉价而有力的方法,但在与专家相比时往往不一致。这些错误可能偏向下游人口参数的估算,如回归系数和因果关系。为了减轻这种偏差,研究人员开发了偏差方法,如基于设计的监督学习(DSL)和预测引力推理(PPI)等,通过将LLM说明与数量有限的昂贵专家说明相结合,可望作出有效的估计。虽然这些方法在理论假设下得出一致的估计,但在应用研究所遇到规模的有限样本中,其性能却不那么一致。我们做出两种贡献:第一,我们研究每种方法的性能尺度与专家说明的数量如何偏差,突出LM偏见或有限专家标签对结果产生重大影响的制度。第二,我们将DSL和PI在一系列任务中进行比较,发现虽然两者都与大数据集的偏差程度较低,但DSLSL往往在减少偏差和经验效率方面优于PPI,但其性在各种数据集之间却不那么一致。我们的研究结果表明,在量化方法方面存在着偏差的偏差性交易。我们的调查结果表明,要求进行更量化的方法。


Article 218

Title@2025-06-11 (3): MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering

Title: MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering MAGIC-VQA: Multimodale und begründete Schlussfolgerungen mit Commonsense-Wissen für visuelle Fragenbeantwortung MAGIC-VQA:视觉问题解答使用常识知识的多式和有根据的推断 2503.18491v3

Authors (4): Shuo Yang, Siwen Luo, Soyeon Caren Han, Eduard Hovy

Visual Question Answering (VQA) requires reasoning across visual and textual modalities, yet Large Vision-Language Models (LVLMs) often lack integrated commonsense knowledge, limiting their robustness in real-world scenarios. To address this, we introduce MAGIC-VQA, a novel framework that enhances VQA by systematically integrating commonsense knowledge with LVLMs. MAGIC-VQA employs a three-stage process: (1) Explicit Knowledge Integration from external sources, (2) By-Type Post-Processing for contextual refinement, and (3) Implicit Knowledge Augmentation using a Graph Neural Network (GNN) for structured reasoning. While GNNs bring greater depth to structured inference, they enable superior relational inference beyond LVLMs. MAGIC-VQA bridges a key gap by unifying commonsensse knowledge with LVLM-driven reasoning, eliminating the need for extensive pre-training or complex prompt tuning. Our framework achieves state-of-the-art performance on benchmark datasets, significantly improving commonsense reasoning in VQA.

视觉问题解答(VQA)要求通过视觉和文字方式进行推理,但大型视觉语言模型(LVLM)往往缺乏综合的常识知识,限制了其在现实世界情景中的稳健性。为了解决这个问题,我们引入了MAGIC-VQA,这是一个通过系统地将常识知识与LVLMs整合而增强VQA的新框架。MAGIC-VQA采用一个三阶段过程:(1)外部来源的清晰知识整合,(2)用于背景改进的By-Type后处理,(3)利用图形神经网络(GNNN)进行隐性知识增强,以进行结构化推理。虽然GNNS为结构推理带来了更大的深度,但是它们使得超高的关联性推理超越LVLMs。MAGIC-VQA通过将常识与LM驱动的推理统一,消除广泛培训前或复杂的迅速调整的需要,从而缩小了对共同知识的鸿沟。我们的框架在基准数据集上取得了最新的业绩,极大地改进了VQA的常识推理。


Article 219

Title@2025-06-11 (3): Style over Substance: Distilled Language Models Reason Via Stylistic Replication

Title: Style over Substance: Distilled Language Models Reason Via Stylistic Replication Stil über Substanz: Destillierte Sprachmodelle Grund über stylistische Replication 物质之上的样式: 蒸馏语言模型 2504.01738v2

Authors (2): Philip Lippmann, Jie Yang

Specialized reasoning language models (RLMs) have demonstrated that scaling test-time computation through detailed reasoning traces significantly enhances performance. Although these traces effectively facilitate knowledge distillation into smaller, instruction-tuned models, the precise nature of transferred reasoning remains unclear. In this study, we investigate to what extent distilled models internalize replicated stylistic patterns during reasoning. To this end, we systematically analyze reasoning traces, identifying structural and lexical patterns that characterize successful reasoning. We then introduce two new datasets – a dataset of emergent reasoning traces and a synthetic dataset explicitly constructed to replicate these stylistic patterns – to precisely examine their influence on distilled models’ reasoning capabilities. We find that models trained on the synthetic traces achieve comparable performance, indicating that distilled reasoning abilities rely significantly on surface-level patterns. Surprisingly, we observe an increase in performance even when the synthetic traces are altered to lead to the wrong answer. Our findings highlight how stylistic patterns can be leveraged to efficiently enhance LM reasoning across diverse model families.

专业推理语言模型(RLMS)表明,通过详细推理进行缩放测试时间的计算,极大地提高了绩效。虽然这些痕迹有效地促进了知识蒸馏成较小的、按指示调整的模型,但转移推理的确切性质仍然不明确。在本研究中,我们调查了在推理过程中,在何种程度上蒸馏的模型内化了复制的文体模式。为此,我们系统地分析推理痕迹,查明作为成功推理特点的结构和词汇模式。然后我们引入了两个新的数据集 – – 一组由新兴推理痕迹组成的数据集,以及一套为复制这些模式而明确设计的合成数据集 – – 以精确地检查其对精炼模型推理能力的影响。我们发现,经过培训的合成模型取得了可比较的性能,表明蒸馏推理能力在很大程度上依赖地表水平模式。令人惊讶的是,我们观察到,即使在合成痕迹被改变以导致错误答案时,性能也有所提高。我们的调查结果突出表明,如何利用典型模式来有效提高不同模型家庭的LM推理力。


Article 220

Title@2025-06-11 (3): Effective Red-Teaming of Policy-Adherent Agents

Title: Effective Red-Teaming of Policy-Adherent Agents Effektives Red-Teaming von Policy-Adherent Agents 有效的政策协调代理人红队 2506.09600v1

Authors (6): Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, Ateret Anaby-Tavor

Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing tau-bench benchmark, we introduce tau-break, a complementary benchmark designed to rigorously assess the agent’s robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks

以任务为导向的LLM代理商越来越多地在有严格政策的领域使用,例如退税资格或注销规则。挑战在于确保代理商始终遵守这些规则和政策,适当拒绝违反这些规则和政策的任何要求,同时保持有益和自然的互动。这要求制定有针对性的设计和评价方法,以确保代理商抵御恶意用户行为的能力。我们提出了一个新的威胁模式,以对抗性用户为重点,目的是利用政策适应性代理商谋取个人利益。为了解决这个问题,我们提出了CRAFT,这是一个多试剂红色组合系统,利用政策认知的有说服力战略,在客户服务情景中破坏政策适应性代理商,超过常规的破狱方法,如丹麦语提示、情感操纵和胁迫性。我们在现有的Tau-bench基准的基础上,我们引入Tau-break,这是一个补充性基准,旨在严格评估代理商抵御操纵性用户行为的强健性。最后,我们评估了若干直接而有效的防御战略。这些措施提供了一些保护,但很短,它们需要强调为保护政策适应性攻击的代理人免受敌对性攻击需要更有力、更强有力的研究驱动的保障措施。


Article 221

Title@2025-06-11 (3): Assessment of Evolving Large Language Models in Upper Secondary Mathematics

Title: Assessment of Evolving Large Language Models in Upper Secondary Mathematics Bewertung von sich entwickelnden großen Sprachmodellen in der Oberen Sekundärmathematik 对高中数学中演变中的大语言模型的评估 2504.12347v2

Authors (4): Mika Setälä, Pieta Sikström, Ville Heilala, Tommi Kärkkäinen

Large language models (LLMs) have shown increasing promise in educational settings, yet their mathematical reasoning has been considered evolving. This study evaluates the mathematical capabilities of various LLMs using the Finnish matriculation examination, a high-stakes digital test for upper secondary education. Initial tests yielded moderate performance corresponding to mid-range grades, but later evaluations demonstrated substantial improvements as the language models evolved. Remarkably, some models achieved near-perfect or perfect scores, matching top student performance and qualifying for university admission. Our findings highlight the rapid advances in the mathematical proficiency of LLMs and illustrate their potential as underlying tools to support learning and teaching in a variety of ways.

大型语言模式(LLMS)在教育环境中显示出越来越大的希望,但其数学推理却被认为是不断发展的,这项研究利用芬兰大学入学考试,即高中教育的高级数字测试,评估了各种LLMS的数学能力,初步测试取得了与中年级相对应的中等成绩,但后来的评估表明随着语言模式的发展,取得了显著的改进。一些模式取得了近乎完美或完美的分数,与优秀学生的成绩相匹配,并获得大学入学资格。我们的调查结果突出表明了LLMS数学熟练程度的快速进步,并展示了它们作为支持以各种方式进行学习和教学的基本工具的潜力。


Article 222

Title@2025-06-11 (3): NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction

Title: NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction NTPP: Generative Sprachmodellierung für Dual-Channel-Gesprochenen Dialog über Next-Token-Pair-Vorhersage NTPP:通过下波波语预测为双声道口语对话生成语音模型 2506.00975v4

Authors (9): Qichao Wang, Ziqiao Meng, Wenqian Cui, Yifei Zhang, Pengcheng Wu, Bingzhe Wu, Irwin King, Liang Chen, Peilin Zhao

Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures the structure and dynamics of human conversation. In this work, we systematically explore the use of dual-channel speech data in the context of modern large language models, and introduce a novel generative modeling paradigm, Next-Token-Pair Prediction (NTPP), to enable speaker-independent dual-channel spoken dialogue learning using decoder-only architectures for the first time. We evaluate our approach on standard benchmarks, and empirical results show that our proposed method, NTPP, significantly improves the conversational abilities of SLMs in terms of turn-taking prediction, response coherence, and naturalness. Moreover, compared to existing methods, NTPP achieves substantially lower inference latency, highlighting its practical efficiency for real-time applications.

由于GPT-4o的惊人能力,人们越来越有兴趣使语言模型(SLMs)能够与人类进行自然的、流畅的口头互动,最近的进展导致若干可持续土地管理的发展,展示了这一领域的有希望的成果,然而,目前的办法尚未充分利用双通道语音数据,这些数据本身就捕捉了人类对话的结构和动态。在这项工作中,我们系统地探索在现代大型语言模型中使用双通道语音数据,并引入了新型的基因模型模式(Rext-Token-Pair 预测(NTPPPP)),以便首次利用只使用拆解器的架构,使具有独立性的双通道语音对话学习成为可能。我们评估标准基准的方法和实证结果显示,我们拟议的方法(NTPPP)大大提高了可持续土地管理在转手预测、反应一致性和自然性方面的谈话能力。此外,与现有方法相比,NTPPPP取得低得多的推导力,突出其实时应用的实际效率。


Article 223

Title@2025-06-11 (3): Memorization in Language Models through the Lens of Intrinsic Dimension

Title: Memorization in Language Models through the Lens of Intrinsic Dimension Auswendiglernen in Sprachmodellen durch die Linse der Intrinsischen Dimension 通过内在层面的镜头在语言模型中的记忆化 2506.09591v1

Authors (1): Stefan Arnold

Language Models (LMs) are prone to memorizing parts of their data during training and unintentionally emitting them at generation time, raising concerns about privacy leakage and disclosure of intellectual property. While previous research has identified properties such as context length, parameter size, and duplication frequency, as key drivers of unintended memorization, little is known about how the latent structure modulates this rate of memorization. We investigate the role of Intrinsic Dimension (ID), a geometric proxy for the structural complexity of a sequence in latent space, in modulating memorization. Our findings suggest that ID acts as a suppressive signal for memorization: compared to low-ID sequences, high-ID sequences are less likely to be memorized, particularly in overparameterized models and under sparse exposure. These findings highlight the interaction between scale, exposure, and complexity in shaping memorization.

语言模型(LMS)很容易在培训期间对部分数据进行记忆,并在一代人的时间里无意地发布这些数据,从而引起对隐私泄漏和知识产权披露的关切。虽然以前的研究已经查明了上下文长度、参数大小和重复频率等属性,作为意外记忆化的关键驱动因素,但对于潜在结构如何调节这种记忆化速度却知之甚少。我们调查了内在维度(ID)的作用,即暗层空间序列结构复杂性的几何替代物,在调节记忆化方面。我们的研究结果表明,ID作为记忆化的抑制信号作用:与低ID序列相比,高ID序列不太可能被记忆化,特别是在多度分解模型和低度暴露下。这些发现突出表明了比例、暴露和复杂性在塑造记忆化过程中的相互作用。


Article 224

Title@2025-06-11 (3): Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis

Title: Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis Decoding Knowledge Attribution in Mixture-of-Experts: Ein Rahmenwerk der Basic-Refinement-Kollaboration und Effizienzanalyse 在混合专家中解码知识的配置:基础改进协作和效率分析框架 2505.24593v2

Authors (6): Junzhuo Li, Bo Wang, Xiuze Zhou, Peijie Jiang, Jia Liu, Xuming Hu

The interpretability of Mixture-of-Experts (MoE) models, especially those with heterogeneous designs, remains underexplored. Existing attribution methods for dense models fail to capture dynamic routing-expert interactions in sparse MoE architectures. To address this issue, we propose a cross-level attribution algorithm to analyze sparse MoE architectures (Qwen 1.5-MoE, OLMoE, Mixtral-8x7B) against dense models (Qwen 1.5-7B, Llama-7B, Mistral-7B). Results show MoE models achieve 37% higher per-layer efficiency via a “mid-activation, late-amplification” pattern: early layers screen experts, while late layers refine knowledge collaboratively. Ablation studies reveal a “basic-refinement” framework–shared experts handle general tasks (entity recognition), while routed experts specialize in domain-specific processing (geographic attributes). Semantic-driven routing is evidenced by strong correlations between attention heads and experts (r=0.68), enabling task-aware coordination. Notably, architectural depth dictates robustness: deep Qwen 1.5-MoE mitigates expert failures (e.g., 43% MRR drop in geographic tasks when blocking top-10 experts) through shared expert redundancy, whereas shallow OLMoE suffers severe degradation (76% drop). Task sensitivity further guides design: core-sensitive tasks (geography) require concentrated expertise, while distributed-tolerant tasks (object attributes) leverage broader participation. These insights advance MoE interpretability, offering principles to balance efficiency, specialization, and robustness.

为解决这一问题,我们提出一个跨层次的归因算法,以分析稀有的专家结构(Qwen 1.5-MOE、OLMoE、Mixtral-8x7B)与密集模型(Qwen 1.5-7B、Llama-7B、Mistral-7B)相比,混合设计模型(Mistral 7B)的可解释性仍然未得到充分探讨。结果显示,浓密模型的现有归因方法未能在稀薄的MOE结构中捕捉到动态路由专家互动。 成果显示,浓厚模型通过“中度、延迟增益性”模式达到37%的每层效率:早期层屏幕专家,而后层则合作完善知识。 缩放研究揭示出一种“基本-修正”的框架共享型专家处理一般任务(实体识别),而路由路由专家专门处理(地理属性)的选线性调整表现为:关注对象和专家(r=0.68)之间紧密的分流化直径直径直径直径,使任务得以进行认真协调。 显著的深层深层层筛选定义,而专家任务则需要降低进行。


Article 225

Title@2025-06-11 (3): Explaining word embeddings with perfect fidelity: Case study in research impact prediction

Title: Explaining word embeddings with perfect fidelity: Case study in research impact prediction Erklären von Wort-Einbettungen mit perfekter Treue: Fallstudie in der Forschungs-Impact-Vorhersage 解释完美忠实地嵌入的字词的解释:研究影响预测案例研究 2409.15912v2

Authors (6): Lucie Dvorackova, Marcin P. Joachimiak, Michal Cerny, Adriana Kubecova, Vilem Sklenak, Tomas Kliegr

Best performing approaches for scholarly document quality prediction are based on embedding models, which do not allow direct explanation of classifiers as distinct words no longer correspond to the input features for model training. Although model-agnostic explanation methods such as Local interpretable model-agnostic explanations (LIME) can be applied, these produce results with questionable correspondence to the ML model. We introduce a new feature importance method, Self-model Rated Entities (SMER), for logistic regression-based classification models trained on word embeddings. We show that SMER has theoretically perfect fidelity with the explained model, as its prediction corresponds exactly to the average of predictions for individual words in the text. SMER allows us to reliably determine which words or entities positively contribute to predicting impactful articles. Quantitative and qualitative evaluation is performed through five diverse experiments conducted on 50.000 research papers from the CORD-19 corpus. Through an AOPC curve analysis, we experimentally demonstrate that SMER produces better explanations than LIME for logistic regression.

学术文件质量预测的最佳表现方法基于嵌入模型,这些模型不允许直接将分类者解释为与模型培训的投入特征不符的独特词,虽然可以应用当地可解释的模型-不可知解释解释(LIME)等模型-不可知解释方法,但这些方法产生的结果与ML模型有疑问的对应性。我们采用了一种新的特征重要性方法,即自模标准指数实体(SMER),用于以词嵌入为培训的后勤回归分类模型。我们表明,SMER在理论上与解释的模型完全一致,因为它的预测与文本中单词预测的平均数完全吻合。SMER使我们能够可靠地确定哪些词或实体对预测具有影响力的物品做出积极的贡献。定量和定性评价是通过对CORD-19amp的500 000份研究论文进行的五次不同实验进行的。我们通过AOPC曲线分析实验证明,SME在逻辑回归方面比LME作出更好的解释。


Article 226

Title@2025-06-11 (3): Human-like object concept representations emerge naturally in multimodal large language models

Title: Human-like object concept representations emerge naturally in multimodal large language models Mensch-ähnliche Objekt-Konzept-Darstellungen entstehen natürlich in multimodalen großen Sprachmodellen 在多式大型语言模式中自然出现类似人类的物体概念表述 2407.01067v3

Authors (13): Changde Du, Kaicheng Fu, Bincheng Wen, Yi Sun, Jie Peng, Wei Wei, Ying Gao, Shengpei Wang, Chuncheng Zhang, Jinpeng Li, Shuang Qiu, Le Chang, Huiguang He

Understanding how humans conceptualize and categorize natural objects offers critical insights into perception and cognition. With the advent of Large Language Models (LLMs), a key question arises: can these models develop human-like object representations from linguistic and multimodal data? In this study, we combined behavioral and neuroimaging analyses to explore the relationship between object concept representations in LLMs and human cognition. We collected 4.7 million triplet judgments from LLMs and Multimodal LLMs (MLLMs) to derive low-dimensional embeddings that capture the similarity structure of 1,854 natural objects. The resulting 66-dimensional embeddings were stable, predictive, and exhibited semantic clustering similar to human mental representations. Remarkably, the dimensions underlying these embeddings were interpretable, suggesting that LLMs and MLLMs develop human-like conceptual representations of objects. Further analysis showed strong alignment between model embeddings and neural activity patterns in brain regions such as EBA, PPA, RSC, and FFA. This provides compelling evidence that the object representations in LLMs, while not identical to human ones, share fundamental similarities that reflect key aspects of human conceptual knowledge. Our findings advance the understanding of machine intelligence and inform the development of more human-like artificial cognitive systems.

人类如何理解自然物体的概念和分类,使人们对认知和认知产生重要的洞察力。随着大语言模型(LLMs)的出现,产生了一个关键问题:这些模型能否从语言和多式联运数据中产生像人一样的物体表象?在本研究中,我们结合了行为和神经成像分析,以探索LOMs和人类认知中的物体概念表象之间的关系;我们从LLMs和多模式LMs收集了470万个三重判断,以获得低维嵌入,从而捕捉到1 854个自然物体的相似结构;由此产生的66维嵌入是稳定的、预测性的,并展示出与人类心理表象相似的语义组合;显而易见,这些嵌入背后的维度是可以解释的,表明LMMs和MLLMs对物体的概念表象是人性的。进一步分析表明,模型嵌入和神经活动模式在脑区域(例如EBA、PA、RSC和FFA)之间有着很强的一致。这提供了令人信服的证据,表明LMs的物体表象虽然与人类的不完全相同,但具有类似人类的特征,但与人际认知系统也具有更接近于人类认知系统的主要相似性理解。


Article 227

Title@2025-06-11 (3): MOSAIC: Multiple Observers Spotting AI Content

Title: MOSAIC: Multiple Observers Spotting AI Content MOSAIC: Mehrere Beobachter finden KI-Inhalte MOSAIC:多观察员观察AI 内容 2409.07615v3

Authors (3): Matthieu Dubois, François Yvon, Pablo Piantanida

The dissemination of Large Language Models (LLMs), trained at scale, and endowed with powerful text-generating abilities, has made it easier for all to produce harmful, toxic, faked or forged content. In response, various proposals have been made to automatically discriminate artificially generated from human-written texts, typically framing the problem as a binary classification problem. Early approaches evaluate an input document with a well-chosen detector LLM, assuming that low-perplexity scores reliably signal machine-made content. More recent systems instead consider two LLMs and compare their probability distributions over the document to further discriminate when perplexity alone cannot. However, using a fixed pair of models can induce brittleness in performance. We extend these approaches to the ensembling of several LLMs and derive a new, theoretically grounded approach to combine their respective strengths. Our experiments, conducted with various generator LLMs, indicate that this approach effectively leverages the strengths of each model, resulting in robust detection performance across multiple domains. Our code and data are available at https://github.com/BaggerOfWords/MOSAIC .

大语言模型(LLMs)的传播,经过规模培训,并具有强大的生成文本能力,使所有人都更容易产生有害、有毒、伪造或伪造的内容。作为回应,提出了各种建议,自动区分人为编写的文本产生的内容,通常将问题定为二元分类问题。早期方法用一个精选的检测器LLM来评价一个输入文件,假设低易读性分数可靠信号机器制作的内容。较新的系统则考虑两个LMs,比较其在文件中的概率分布,以进一步区分单是不易懂的情况。但是,使用固定的模型可以诱发性能的不稳定。我们将这些方法扩大到若干LMs的组合,并形成一种新的、基于理论的办法来将各自的长处结合起来。我们与各种发电机LMs进行的实验表明,这一方法有效地利用了每种模型的长处,从而在多个领域都取得了可靠的检测性。我们的代码和数据可在https://github.com/BaggerofWords/MOIC上查阅。


Article 228

Title@2025-06-11 (3): Phonology-Guided Speech-to-Speech Translation for African Languages

Title: Phonology-Guided Speech-to-Speech Translation for African Languages Phonologie-geführte Sprach-zu-Sprach-Übersetzung für afrikanische Sprachen 非洲语文声学引导语音语音对语音翻译 2410.23323v3

Authors (2): Peter Ochieng, Dennis Kaburu

We present a prosody-guided framework for speech-to-speech translation (S2ST) that aligns and translates speech \emph{without} transcripts by leveraging cross-linguistic pause synchrony. Analyzing a 6{,}000-hour East African news corpus spanning five languages, we show that \emph{within-phylum} language pairs exhibit 30–40\% lower pause variance and over 3$\times$ higher onset/offset correlation compared to cross-phylum pairs. These findings motivate \textbf{SPaDA}, a dynamic-programming alignment algorithm that integrates silence consistency, rate synchrony, and semantic similarity. SPaDA improves alignment $F_1$ by +3–4 points and eliminates up to 38\% of spurious matches relative to greedy VAD baselines. Using SPaDA-aligned segments, we train \textbf{SegUniDiff}, a diffusion-based S2ST model guided by \emph{external gradients} from frozen semantic and speaker encoders. SegUniDiff matches an enhanced cascade in BLEU (30.3 on CVSS-C vs.\ 28.9 for UnitY), reduces speaker error rate (EER) from 12.5\% to 5.3\%, and runs at an RTF of 1.02. To support evaluation in low-resource settings, we also release a three-tier, transcript-free BLEU suite (M1–M3) that correlates strongly with human judgments. Together, our results show that prosodic cues in multilingual speech provide a reliable scaffold for scalable, non-autoregressive S2ST.

我们为语音到语音翻译( S2ST) 提供了一个模拟指导框架, 通过使用跨语言暂停同步, 校对并翻译语音\ emph{ nor} 记录誊本。 分析一个覆盖五种语言的6{, 4: 00小时的东非新闻资料库, 我们显示, emph{ inline- phylum} 语言配对显示30- 40++++40+++3+++3++++4+6:00小时的东非新闻资料库显示, 与贪婪的VAD基线相比, 暂停差异减少30- 40+++++++++3+++++4+3+3+3++++3+++38++3+3+。 我们用与SPaDA匹配部分, 培训 textbf{SegUni=%Diff}, 一种基于S2ST- 的可靠和emph{Sy=deadli 校准的校准校正校正的校正校正的校正SIS- 3- ASUL3 校正 校正 校正 校正的SE- 校正的SE-L3 校正 校正 校正 校正 校正 校正 校正 校正 校正 校正 校正 校正 校正 校正 校正 校对 校对 校对 校正 校正 校正 校正 校正 校正 校正 校正的SUEUEUL 校正 校正 校对 校对 校对 校对 校对 校对 校对 校对 校对 校对 校对 校对 校对 校对 校对


Article 229

Title@2025-06-11 (3): From Symbolic to Neural and Back: Exploring Knowledge Graph-Large Language Model Synergies

Title: From Symbolic to Neural and Back: Exploring Knowledge Graph-Large Language Model Synergies Von Symbolisch zu Neural und Zurück: Erforschen von Wissen Graphen-Große Sprachmodell-Synergien 从符号到神经和后背:探索知识图集-大型语言模式协同作用 2506.09566v1

Authors (4): Blaž Škrlj, Boshko Koloski, Senja Pollak, Nada Lavrač

Integrating structured knowledge from Knowledge Graphs (KGs) into Large Language Models (LLMs) enhances factual grounding and reasoning capabilities. This survey paper systematically examines the synergy between KGs and LLMs, categorizing existing approaches into two main groups: KG-enhanced LLMs, which improve reasoning, reduce hallucinations, and enable complex question answering; and LLM-augmented KGs, which facilitate KG construction, completion, and querying. Through comprehensive analysis, we identify critical gaps and highlight the mutual benefits of structured knowledge integration. Compared to existing surveys, our study uniquely emphasizes scalability, computational efficiency, and data quality. Finally, we propose future research directions, including neuro-symbolic integration, dynamic KG updating, data reliability, and ethical considerations, paving the way for intelligent systems capable of managing more complex real-world knowledge tasks.

将知识图(KGs)的结构化知识纳入大语言模型(LLMs)可提高事实依据和推理能力。本调查文件系统地审查KGs和LLMs之间的协同作用,将现有办法分为两大类:KG-enhanced LLMs, 改进推理、减少幻觉,并能够回答复杂的问题;LLMs提炼KG, 便利KG的构建、完成和查询。我们通过全面分析,找出关键差距,突出结构化知识整合的互利。与现有调查相比,我们的研究特别强调可扩展性、计算效率和数据质量。最后,我们提出了未来的研究方向,包括神经-精神整合、动态KG的更新、数据可靠性和伦理考虑,为能够管理更复杂的现实世界知识任务的智能系统铺平铺平了道路。


Article 230

Title@2025-06-11 (3): Meaningless is better: hashing bias-inducing words in LLM prompts improves performance in logical reasoning and statistical learning

Title: Meaningless is better: hashing bias-inducing words in LLM prompts improves performance in logical reasoning and statistical learning Sinnlos ist besser: Hasching-Bias-induzierende Wörter in LLM-Prompten verbessern die Leistung in logischer Argumentation und statistischem Lernen 无意义的更好:在LLM 提示中匆忙的带有偏见的词句会改善逻辑推理和统计学习的绩效 2411.17304v2

Authors (3): Milena Chadimová, Eduard Jurášek, Tomáš Kliegr

This paper introduces a novel method, referred to as “hashing”, which involves masking potentially bias-inducing words in large language models (LLMs) with hash-like meaningless identifiers to reduce cognitive biases and reliance on external knowledge. The method was tested across three sets of experiments involving a total of 490 prompts. Statistical analysis using chi-square tests showed significant improvements in all tested scenarios, which covered LLama, ChatGPT, Copilot, Gemini and Mixtral models. In the first experiment, hashing decreased the fallacy rate in a modified version of the “Linda” problem aimed at evaluating susceptibility to cognitive biases. In the second experiment, it improved LLM results on the frequent itemset extraction task. In the third experiment, we found hashing is also effective when the Linda problem is presented in a tabular format rather than text, indicating that the technique works across various input representations. Overall, the method was shown to improve bias reduction and incorporation of external knowledge. Despite bias reduction, hallucination rates were inconsistently reduced across types of LLM models. These findings suggest that masking bias-inducing terms can improve LLM performance, although its effectiveness is model- and task-dependent.

本文采用了一种新颖的方法,称为“催化”法,它涉及掩盖大语言模型中可能带有偏见的词句(LLMs),而大语言模型(LLMs)中可能带有偏见的词句(LLMs),具有散装的毫无意义的识别符号,以减少认知偏差和对外部知识的依赖。该方法在三组实验中进行了测试,共涉及490个提示器。使用“cig-square”测试的统计分析显示,所有测试的情景都有了显著的改进,涵盖Lama、ChattGPT、Coople、Gpilot、Gepini和Mixtral模型。在第一次实验中,“Linda”问题被修改后,旨在评价对认知偏差的敏感度,从而降低了“LIND”问题的偏差率。在第二次实验中,该方法改进了经常进行的项目集采掘任务的LLM结果。在第三次实验中,我们发现,当Linda问题以表格形式而不是文字形式提出时,这种技术在各种投入表达方式上都有了改进偏见减少和吸收外部知识的方法。总的来说,尽管减少了偏差率,但各种LLLM模式的偏差率却不一减低。这些偏差率。这些结论表明,但模型可以改进了LLLLM任务是依赖性的任务。


Article 231

Title@2025-06-11 (3): Unable to Forget: Proactive lnterference Reveals Working Memory Limits in LLMs Beyond Context Length

Title: Unable to Forget: Proactive lnterference Reveals Working Memory Limits in LLMs Beyond Context Length Nicht zu vergessen: Proaktive lnterference zeigt Arbeitsspeichergrenzen in LLMs jenseits von Kontextlängen 无法忘却: 事外长长的LLMM 中主动性 Interference Repeals 工作内存限制 2506.08184v2

Authors (2): Chupei Wang, Jiaqiu Vince Sun

Information retrieval in Large Language Models (LLMs) is increasingly recognized as intertwined with generation capabilities rather than mere lookup. While longer contexts are often assumed to improve retrieval, the effects of intra-context interference remain understudied. To address this, we adapt the proactive interference (PI) paradigm from cognitive science, where earlier information disrupts recall of newer updates. In humans, susceptibility to such interference is inversely linked to working memory capacity. We introduce PI-LLM, an evaluation that sequentially streams semantically related key-value updates and queries only the final values. Although these final values are clearly positioned just before the query, LLM retrieval accuracy declines log-linearly toward zero as interference accumulates; errors arise from retrieving previously overwritten values. Attempts to mitigate interference via prompt engineering (e.g., instructing models to ignore earlier input) yield limited success. These findings reveal a fundamental constraint on LLMs’ ability to disentangle interference and flexibly manipulate information, suggesting a working memory bottleneck beyond mere context access. This calls for approaches that strengthen models’ ability to suppress irrelevant content during retrieval.

大语言模型(LLMs)中的信息检索日益被公认为与生成能力而不是仅仅看一看相交织在一起。虽然人们往往认为较长的环境可以改进检索,但文本内干扰的影响仍然没有得到充分研究。为了解决这个问题,我们调整了认知科学中的主动干预(PI)范式,早期信息扰乱了对更新更新的记忆的记忆。在人类中,对这种干扰的易感性与工作记忆能力有反向联系。我们引入了PI-LLM,这一评价按顺序流出与语义相关的关键价值更新和查询最后值。虽然这些最后值在查询之前的位置很明确,但LLM检索精度随着干扰的积累而将记录线性降低到零;错误产生于对先前重写价值的检索。试图通过即时工程(例如指示模型忽略早期输入)来减少干扰,但成效有限。这些结论揭示了LLMs解动干扰和灵活调控信息的能力的根本制约,意味着工作记忆瓶颈超出了仅仅上下文访问的范围。这要求采取一些方法,以加强模型在检索过程中抑制不相关内容的能力。


Article 232

Title@2025-06-11 (3): Multi-Party Supervised Fine-tuning of Language Models for Multi-Party Dialogue Generation

Title: Multi-Party Supervised Fine-tuning of Language Models for Multi-Party Dialogue Generation Multi-Party überwacht Feinabstimmung von Sprachmodellen für die Mehrparteien-Dialog-Generierung 多党监督多党监督的多党对话生成语言模式微调 2412.05342v5

Authors (9): Xiaoyu Wang, Ningyuan Xi, Teng Chen, Qingqing Gu, Yue Zhao, Xiaokai Chen, Zhonglin Jiang, Yong Chen, Luo Ji

Large Language Models (LLM) are usually fine-tuned to participate in dyadic or two-party dialogues, which can not adapt well to multi-party dialogues (MPD), which hinders their applications in such scenarios including multi-personal meetings, discussions and daily communication. Previous LLM-based researches mainly focus on the multi-agent framework, while their base LLMs are still pairwisely fine-tuned. In this work, we design a multi-party fine-tuning framework (MuPaS) for LLMs on the multi-party dialogue datasets, and prove such a straightforward framework can let the LLM align with the multi-party conversation style efficiently and effectively. We also design two training strategies which can convert MuPaS into the MPD simulator. Substantial experiments show that MuPaS can achieve state-of-the-art multi-party response, higher accuracy of the-next-speaker prediction, higher human and automatic evaluated utterance qualities, and can even generate reasonably with out-of-distribution scene, topic and role descriptions. The MuPaS framework bridges the LLM training with more complicated multi-party applications, such as conversation generation, virtual rehearsal or meta-universe.

大型语言模型(LLM)通常经过微调,以参加多党对话或两党对话,这种对话无法适应多党对话,从而妨碍其在这种情景下的应用,包括多人会议、讨论和日常交流。以前以LLM为基础的研究主要侧重于多试框架,而其基础的LLM仍然是双轨微调。在这项工作中,我们为多党对话数据集中的LMS设计了一个多党微调框架(MUPAS),并证明这样一个直接的框架能够让LLM与多党对话风格(MPD)相配合,从而影响这些对话方式的高效和有效。我们还设计了两种培训战略,可以将MEPS转换为MPD模拟器。大量的实验表明,MuPAS可以实现最先进的多党反应,而其基础的LMS预测更准确性,提高人文和自动评价的语音质量,甚至可以合理地利用分配范围外的场景、主题和角色说明产生合理的效果。MUPAS框架将LM培训与更复杂的多党应用程序连接起来,例如虚拟演练。


Article 233

Title@2025-06-11 (3): ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts

Title: ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts ClimateViz: Benchmark für die statistische Begründung und Faktenverifizierung auf wissenschaftlichen Charts ClimcViz:科学图表统计理由和事实核查基准 2506.08700v2

Authors (4): Ruiran Su, Jiasheng Si, Zhijiang Guo, Janet B. Pierrehumbert

Scientific fact-checking has mostly focused on text and tables, overlooking scientific charts, which are key for presenting quantitative evidence and statistical reasoning. We introduce ClimateViz, the first large-scale benchmark for scientific fact-checking using expert-curated scientific charts. ClimateViz contains 49,862 claims linked to 2,896 visualizations, each labeled as support, refute, or not enough information. To improve interpretability, each example includes structured knowledge graph explanations covering trends, comparisons, and causal relations. We evaluate state-of-the-art multimodal language models, including both proprietary and open-source systems, in zero-shot and few-shot settings. Results show that current models struggle with chart-based reasoning: even the best systems, such as Gemini 2.5 and InternVL 2.5, reach only 76.2 to 77.8 percent accuracy in label-only settings, far below human performance (89.3 and 92.7 percent). Explanation-augmented outputs improve performance in some models. We released our dataset and code alongside the paper.

科学事实检查主要侧重于文本和表格,忽略科学图表,这些是提供定量证据和统计推理的关键。我们引入了ClimateViz,这是使用专家精准科学图表进行科学事实检查的第一个大规模基准。ClimateViz包含49 862项索赔,涉及2 896个可视化,每个被贴上支持、反驳或不够信息标签。为了提高可解释性,每个例子都包括结构化知识图表解释,涉及趋势、比较和因果关系。我们评估了最先进的多式语言模型,包括零点和几分镜头的专有和开放源系统。结果显示,目前的模型与基于图表的推理挣扎:即使是最好的系统,如Gemini 2.5和InternVL 2.5,也只达到76.2%至77.8%的精度,远远低于人的性能(89.3和92.7%)。解释性产出改进了一些模型的性能。我们将数据集和代码与文件一起发布。


Article 234

Title@2025-06-11 (3): Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language

Title: Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language Auf dem Weg zu offenem Stiftungs-Sprachenmodell und Corpus für Mazedonier: Eine Sprache mit geringen Ressourcen 走向马其顿开放基础语言模式:一种低资源语言 2506.09560v1

Authors (5): Stefan Krsteski, Matea Tashkovska, Borjan Sazdov, Hristijan Gjoreski, Branislav Gerazov

The increase in technological adoption worldwide comes with demands for novel tools to be used by the general population. Large Language Models (LLMs) provide a great opportunity in this respect, but their capabilities remain limited for low-resource languages, restricting applications in countries where such languages are spoken. We create several resources to facilitate the adoption of LLMs and to support research advancements for Macedonian. We collect the largest Macedonian corpus to date, consisting of 40GB of textual data and totaling 3.5B words. To support conversational applications, we collect a 106k-instance instruction dataset, carefully built to be culturally grounded. For evaluation, we construct a Macedonian evaluation suite covering seven benchmarks. Finally, we train domestic-yak, a state-of-the-art 8B-parameter model, on our curated datasets and evaluate it against eight baseline models using the newly constructed benchmark suite. Our model outperforms all existing models in the 8B parameter range across all benchmarks, and achieves performance comparable to models up to 10x larger. Furthermore, a qualitative analysis with native speakers reveals that our model is preferred over larger counterparts, receiving higher ratings for grammatical correctness and cultural appropriateness. All datasets, code, and model weights are openly released, setting a foundation for advancing LLMs in similarly underrepresented languages. These resources are publicly available at github.com/LVSTCK for source code, and at huggingface.co/LVSTCK for pretrained model weights and data.

全世界技术采用率的提高与对公众使用新工具的需求有关。大型语言模型(LLMS)在这方面提供了一个巨大的机会,但对于低资源语言而言,它们的能力仍然有限,限制了在使用这种语言的国家的应用。我们创造了一些资源,以促进采用LLMS,并支持马其顿的研究进展。我们收集了迄今为止最大的马其顿文版,包括40GB的文本数据,总共为3.5B字。为了支持对话应用,我们收集了106k-Instirate 教学数据集,这是精心建造的,以文化为基础。为了评估,我们建造了一套涵盖七项基准的马其顿评价套。最后,我们用最先进的8B参数模型培训了国内-yak,以方便于我们采用LMMS,并用8个基线模型来评估它。我们收集了迄今为止最大的马其顿文版,包括40GB的文本数据,总共为3.5B字词。为了达到10x以上的模型的性能。此外,我们与土著演讲者进行的质量分析表明,我们的模式比较大的模型要高得多,要接受高7个基准的马其顿文评价。最后8B-B参数模型,我们用最先进的8BM模型,在公开的SLVC数据库中,这些数据基础上是公开的重量和不断更新的代码。这些数据库中,这些数据库和不断更新。


Article 235

Title@2025-06-11 (3): Gender Bias in English-to-Greek Machine Translation

Title: Gender Bias in English-to-Greek Machine Translation Geschlecht Bias in Englisch-Griechische Maschinelle Übersetzung 英文到希腊语机器翻译中的性别偏见 2506.09558v1

Authors (3): Eleni Gkovedarou, Joke Daems, Luna De Bruyne

As the demand for inclusive language increases, concern has grown over the susceptibility of machine translation (MT) systems to reinforce gender stereotypes. This study investigates gender bias in two commercial MT systems, Google Translate and DeepL, focusing on the understudied English-to-Greek language pair. We address three aspects of gender bias: i) male bias, ii) occupational stereotyping, and iii) errors in anti-stereotypical translations. Additionally, we explore the potential of prompted GPT-4o as a bias mitigation tool that provides both gender-explicit and gender-neutral alternatives when necessary. To achieve this, we introduce GendEL, a manually crafted bilingual dataset of 240 gender-ambiguous and unambiguous sentences that feature stereotypical occupational nouns and adjectives. We find persistent gender bias in translations by both MT systems; while they perform well in cases where gender is explicitly defined, with DeepL outperforming both Google Translate and GPT-4o in feminine gender-unambiguous sentences, they are far from producing gender-inclusive or neutral translations when the gender is unspecified. GPT-4o shows promise, generating appropriate gendered and neutral alternatives for most ambiguous cases, though residual biases remain evident.

随着对包容性语言需求的增加,对强化性别陈规定型观念的机器翻译系统(MT)的易感性日益引起人们的关注,这项研究调查了两个商业MT系统(Google Translate和DeepL)中的性别偏见,重点是研究不足的英语至希腊语,我们处理性别偏见的三个方面:(一) 男性偏见,(二) 职业陈规定型观念,以及(三) 反陈规定型翻译中的错误。此外,我们探讨了促使GPT-4o作为减少偏见工具的潜力,必要时提供性别显露和性别中立的替代语言。为此,我们引入了GendEL,这是一套人工制作的双语数据集,有240个性别模糊和毫不含糊的词句,以陈规定型的职业名和形容为主。我们发现,这两个MT系统在翻译中都长期存在性别偏见;虽然DeepL在性别定义上表现优于Google Translate和GPT-4o,但在性别模糊性判决中都远远没有产生性别包容性或中性化的翻译。


Article 236

Title@2025-06-11 (3): Societal AI Research Has Become Less Interdisciplinary

Title: Societal AI Research Has Become Less Interdisciplinary Gesellschaftliche KI-Forschung ist weniger interdisziplinäre geworden 社会AIAI 研究的跨学科程度越来越低 2506.08738v2

Authors (3): Dror Kris Markus, Fabrizio Gilardi, Daria Stetsenko

As artificial intelligence (AI) systems become deeply embedded in everyday life, calls to align AI development with ethical and societal values have intensified. Interdisciplinary collaboration is often championed as a key pathway for fostering such engagement. Yet it remains unclear whether interdisciplinary research teams are actually leading this shift in practice. This study analyzes over 100,000 AI-related papers published on ArXiv between 2014 and 2024 to examine how ethical values and societal concerns are integrated into technical AI research. We develop a classifier to identify societal content and measure the extent to which research papers express these considerations. We find a striking shift: while interdisciplinary teams remain more likely to produce societally-oriented research, computer science-only teams now account for a growing share of the field’s overall societal output. These teams are increasingly integrating societal concerns into their papers and tackling a wide range of domains - from fairness and safety to healthcare and misinformation. These findings challenge common assumptions about the drivers of societal AI and raise important questions. First, what are the implications for emerging understandings of AI safety and governance if most societally-oriented research is being undertaken by exclusively technical teams? Second, for scholars in the social sciences and humanities: in a technical field increasingly responsive to societal demands, what distinctive perspectives can we still offer to help shape the future of AI?

随着人工智能(AI)系统深入地植根于日常生活中,要求将AI发展与伦理和社会价值观接轨的呼声得到加强;跨学科合作常常被提倡,作为促进这种参与的关键途径;然而,跨学科研究团队实际上是否正在实际领导这一转变,目前仍不清楚;这项研究分析了2014年至2024年在ArXiv出版的10万多份与AI有关的论文,以研究如何将道德价值观和社会关切纳入AI的技术研究;我们开发了一个分类师,以确定社会内容并衡量研究论文表达这些考虑的程度;我们发现一个显著的转变:跨学科团队仍然更有可能产生面向社会的研究,而计算机科学小组现在却占外地总体社会产出的越来越大的份额;这些团队正在越来越多地将社会关切纳入其论文,并处理从公平和安全到保健和错误信息等广泛领域。这些研究结果挑战关于社会AI驱动因素的共同假设,并提出了重要问题。首先,如果大多数面向社会的研究都由技术团队进行,则会对新出现的对AI安全和治理的理解产生什么影响?第二,对于社会科学和人类学的学者来说,我们如何提供独特的未来社会视角:在技术领域,我们越来越需要一个具有独特性的社会领域,我们越来越适应新的社会形象。


Article 237

Title@2025-06-11 (3): MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions

Title: MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions MEDUSA: Multimodale Deep Fusion Multi-Stage Training Framework zur Sprachemotionserkennung unter naturalistischen Bedingungen MEDUS: 一种在自然条件下承认言论情感的多式深层融合多层次培训框架 2506.09556v1

Authors (8): Georgios Chatzichristodoulou, Despoina Kosmopoulou, Antonios Kritikos, Anastasia Poulopoulou, Efthymios Georgiou, Athanasios Katsamanis, Vassilis Katsouros, Alexandros Potamianos

SER is a challenging task due to the subjective nature of human emotions and their uneven representation under naturalistic conditions. We propose MEDUSA, a multimodal framework with a four-stage training pipeline, which effectively handles class imbalance and emotion ambiguity. The first two stages train an ensemble of classifiers that utilize DeepSER, a novel extension of a deep cross-modal transformer fusion mechanism from pretrained self-supervised acoustic and linguistic representations. Manifold MixUp is employed for further regularization. The last two stages optimize a trainable meta-classifier that combines the ensemble predictions. Our training approach incorporates human annotation scores as soft targets, coupled with balanced data sampling and multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic Conditions Challenge.

由于人类情感的主观性及其在自然条件下的不均衡代表性,SER是一项具有挑战性的任务。我们提议MEDUSA,这是一个多式框架,有四阶段培训管道,能有效处理阶级不平衡和情感模糊问题。前两个阶段培训了一组利用DeepSER的分类人员,这是从事先经过训练的自我监督的语音和语言表达中新扩展的深层跨模式变压器集成机制。 MinixUp用于进一步的正规化。最后两个阶段优化了将共同预测结合起来的可训练的元分类。我们的培训方法包括将人类批注作为软目标,加上平衡的数据抽样和多任务学习。MEDUSA在任务1:2025年Interspeech Interspeech Expetical Incience:The Interspendicalicanity condiscrition in Conditionality。


Article 238

Title@2025-06-11 (3): KG-Infused RAG: Augmenting Corpus-Based RAG with External Knowledge Graphs

Title: KG-Infused RAG: Augmenting Corpus-Based RAG with External Knowledge Graphs KG-infundierte RAG: Augmenting Corpus-based RAG mit externen Wissensgraphen KG 输入的 RAG: 使用外部知识图增强以公司为基础的RAG 2506.09542v1

Authors (5): Dingjun Wu, Yukun Yan, Zhenghao Liu, Zhiyuan Liu, Maosong Sun

Retrieval-Augmented Generation (RAG) improves factual accuracy by grounding responses in external knowledge. However, existing methods typically rely on a single source, either unstructured text or structured knowledge. Moreover, they lack cognitively inspired mechanisms for activating relevant knowledge. To address these issues, we propose KG-Infused RAG, a framework that integrates KGs into RAG systems to implement spreading activation, a cognitive process that enables concept association and inference. KG-Infused RAG retrieves KG facts, expands the query accordingly, and enhances generation by combining corpus passages with structured facts, enabling interpretable, multi-source retrieval grounded in semantic structure. We further improve KG-Infused RAG via preference learning on sampled key stages in the pipeline. Experiments on five QA benchmarks show that KG-Infused RAG consistently outperforms vanilla RAG (by 3.8% to 13.8%). Additionally, when integrated into Self-RAG, KG-Infused RAG brings further performance gains, demonstrating its effectiveness and versatility as a plug-and-play enhancement module for corpus-based RAG methods.

为解决这些问题,我们建议KG-Infuse-RAG(RAG)将KG-Infused RAG(RAG)纳入RAG系统以实施扩散激活,一个认知过程使概念关联和推断成为可能。KG-Infused RAG(RAG)检索KG(KG)事实,相应扩大查询范围,通过将物理通道与结构化事实相结合,使基于语义结构的可解释和多源检索成为可能,增强生成能力。我们通过抽样关键阶段的优先学习,进一步改进KG-Infused RAG。关于五个QA基准的实验表明,KG-InG(插入RAG)始终超越了香草(3.8%至13.8%)。此外,KG-Infused RAG(K)在融入自RAG(自我拉G)时,将进一步的业绩收益化,展示其效力和多功能性,并显示其作为插件AG(RAG)增强模块的改进方法。


Article 239

Title@2025-06-11 (3): Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Title: Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning Lingshu: Ein generalistisches Stiftungsmodell für ein einheitliches multimodales medizinisches Verständnis und Vernunft Lingshu:通用主义基金会统一多式联运医疗理解和理性模式模式 2506.07044v3

Authors (19): LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, Hao Zhang, Yu Rong

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu’s medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks …

多式大语言模型(MLLM)在理解共同视觉要素方面表现出令人印象深刻的能力,这主要是由于其大规模数据集和高级培训战略,然而,医疗应用的效力仍然有限,因为医疗假设情景和一般领域的数据和任务之间固有的差异。具体地说,现有医疗应用模型面临以下关键限制:(1) 医学模型的覆盖范围有限,超出成像范围,(2) 由于数据整理过程不理想,对幻觉的敏感度增加;(3) 缺乏针对复杂医学情景的推理能力。为了应对这些挑战,我们首先提议一个全面的数据校正程序,即(1) 不仅从医疗成像中,而且从广泛的医学文本和一般数据中,有效地获得丰富的医学知识数据;(2) 综合准确的医疗说明、直观回答(VQA)和推理样本。结果,我们建立一个多式数据集,以广泛的医学知识丰富。我们介绍了医学模型:Lingshu。 Lingshu。 Lusshu进行多阶段培训,以纳入医学专门知识,并逐步加强其任务执行能力。此外,我们先先是探索现有医学标准化框架,然后是不断研究。


Article 240

Title@2025-06-11 (3): GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models

Title: GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models GenBreak: Red Teaming Text-to-Image Generatoren mit großen Sprachmodellen GenBreak:使用大语言模型的红色组合文字到图像发电机 2506.10047v1

Authors (6): Zilong Wang, Xiang Zheng, Xiaosen Wang, Bo Wang, Xingjun Ma, Yu-Gang Jiang

Text-to-image (T2I) models such as Stable Diffusion have advanced rapidly and are now widely used in content creation. However, these models can be misused to generate harmful content, including nudity or violence, posing significant safety risks. While most platforms employ content moderation systems, underlying vulnerabilities can still be exploited by determined adversaries. Recent research on red-teaming and adversarial attacks against T2I models has notable limitations: some studies successfully generate highly toxic images but use adversarial prompts that are easily detected and blocked by safety filters, while others focus on bypassing safety mechanisms but fail to produce genuinely harmful outputs, neglecting the discovery of truly high-risk prompts. Consequently, there remains a lack of reliable tools for evaluating the safety of defended T2I models. To address this gap, we propose GenBreak, a framework that fine-tunes a red-team large language model (LLM) to systematically explore underlying vulnerabilities in T2I generators. Our approach combines supervised fine-tuning on curated datasets with reinforcement learning via interaction with a surrogate T2I model. By integrating multiple reward signals, we guide the LLM to craft adversarial prompts that enhance both evasion capability and image toxicity, while maintaining semantic coherence and diversity. These prompts demonstrate strong effectiveness in black-box attacks against commercial T2I generators, revealing practical and concerning safety weaknesses.

虽然大多数平台都使用内容调适系统,但潜在的弱点仍然可以被坚定的对手利用。最近关于对T2I模型的红队和对立攻击的研究有显著的局限性:一些研究成功地生成了剧毒图像,但使用很容易被安全过滤器发现和阻止的对立信号,而其他研究则侧重于绕过安全机制,但未能产生真正有害的产出,忽略发现真正高风险的提示,从而忽略了发现真正高风险的提示。因此,大多数平台仍然缺乏可靠的工具来评价捍卫的T2I模型的安全性。为了解决这一差距,我们提议GenBreak,这是一个精细调整红队和对T2I模型的大型语言模型(LLM)的框架,以便系统地探索T2I发电机的潜在弱点。我们的方法是监督对调控的数据集的微调,并通过一个强大的T2I模型进行强化学习。我们通过整合多种敌对性2I模型,在加强关于耐受保护的T2号模型和耐受攻击性安全性方面,我们指导的LM系统,同时展示关于耐受创性攻击的系统。


Article 241

Title@2025-06-11 (3): Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Title: Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models Athena: Multimodale Vernunft mit dateneffizienten Prozess-Reward-Modellen verbessern Athena:利用数据高效流程奖励模式加强多式联运理由 2506.09532v1

Authors (6): Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum

We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

我们提出Athena-PRM,这是一个多式联运进程奖励模式,目的是评估解决复杂推理问题的每一步骤的奖励分数。开发高绩效的PRM通常需要大量的时间和财政投资,这主要是因为需要分级推理步骤的说明。传统自动标签方法,如Monte Carlo估计,常常产生吵闹标签,并产生大量计算成本。为了高效生成高质量的流程标签数据,我们建议利用弱和强的完整者之间的预测一致性,作为确定可靠流程标签的标准。值得注意的是,Athena-PRM展示了各种情景和基准的出色效果,只有5 000个样本。此外,我们还制定了两个有效的战略来改进PRM的业绩:ORM初始化和反向数据抽样。我们在三种具体假设中验证了我们的方法:测试时间的核查、对逻辑正确度的直接评价以及奖励微调。我们的Athena-PRM始终在多个基准和假设中实现优异性业绩。 值得注意的是,当我们使用Qwen2.5-V-7B作为政策模型时,只有5,000个样本样本。此外,Athena-PRM的正值比值比值比值比值比值评估了其前一个显著的成绩测试。


Article 242

Title@2025-06-11 (3): AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP

Title: AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP AraReasoner: Bewertung von mit Gründen versehenen LLMs für arabische NLP AraReasoner:为阿拉伯文国家实验室方案评价基于合理理由的LLMs 2506.08768v2

Authors (6): Ahmed Hasanaath, Aisha Alansari, Ahmed Ashraf, Chafik Salmane, Hamzah Luqman, Saad Ezzini

Large language models (LLMs) have shown remarkable progress in reasoning abilities and general natural language processing (NLP) tasks, yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored. This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs, with a special emphasis on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP tasks. We experiment with various strategies, including zero-shot, few-shot, and fine-tuning. This allows us to systematically evaluate performance on datasets covering a range of applications to examine their capacity for linguistic reasoning under different levels of complexity. Our experiments reveal several key findings. First, carefully selecting just three in-context examples delivers an average uplift of over 13 F1 points on classification tasks-boosting sentiment analysis from 35.3% to 87.5% and paraphrase detection from 56.1% to 87.0%. Second, reasoning-focused DeepSeek architectures outperform a strong GPT o4-mini baseline by an average of 12 F1 points on complex inference tasks in the zero-shot setting. Third, LoRA-based fine-tuning yields up to an additional 8 points in F1 and BLEU compared to equivalent increases in model scale. The code is available at https://anonymous.4open.science/r/AraReasoner41299

大型语言模型(LLMS)在推理能力和一般自然语言处理(NLP)任务方面取得了显著进展,然而,它们在阿拉伯语数据方面的表现,以丰富的形态、多种方言和复杂的脚本为特征,仍然没有得到充分探讨。本文件对多个注重推理的LLMS进行了全面的基准研究,特别强调了新推出的DeepSeek模型,涉及15项阿拉伯文国家语言模型任务,共15项。我们试验了各种战略,包括零点、微点、微点和微调。这使我们能够系统地评价数据集的性能,涵盖各种应用,以审查不同复杂程度的语言推理能力。我们的实验揭示了几项关键结论。首先,只要仔细选择三个文本中的例子,就能平均提升13个F1点,从35.3%到87.5%的分类任务加速情绪分析,并从56.1%到87.0%的检测。第二,以理性为主的DeepSeek结构超越了一个强大的GPTO4-mini基线,以12 F1点的平均F1点为基础,在零点1/FScialScial 中,比重的TRA1级标准设置。


Article 243

Title@2025-06-11 (3): Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs

Title: Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs Überzeugen Sie sich über das, was Sie sehen: Sprachvorherige in Visions-Token für eine effiziente, geführte Dekodierung von LVLMs enthüllen 重新审视你所看到的:在 “ 高效制导LVLMs代谢 “ 的愿景中先解密语言 2506.09522v1

Authors (2): Beomsik Cho, Jaehyung Kim

Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks by integrating visual perception with language understanding. However, conventional decoding strategies of LVLMs often fail to successfully utilize visual information, leading to visually ungrounded responses. While various approaches have been proposed to address this limitation, they typically require additional training, multi-step inference procedures, or external model dependencies. This paper introduces ReVisiT, a simple yet effective decoding method that references vision tokens to guide the text generation process in LVLMs. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution space, and dynamically selecting the most relevant vision token at each decoding step through constrained divergence minimization. This selected vision token is then used to refine the output distribution to better incorporate visual semantics. Experiments on three LVLM hallucination benchmarks with two recent LVLMs demonstrate that ReVisiT consistently enhances visual grounding with minimal computational overhead. Moreover, our method achieves competitive or superior results relative to state-of-the-art baselines while reducing computational costs for up to $2\times$.

大型视觉语言模型(LVLMS)通过将视觉观念与语言理解结合起来,在多种多式联运任务中表现出了显著的成绩。然而,LVLM的常规解码战略往往未能成功地利用视觉信息,导致目光不明的反应。虽然提出了各种办法来解决这一限制,但通常需要额外的培训、多步推论程序或外部模型依赖性。本文介绍了ReVisiT,这是一种简单而有效的解码方法,它引用视觉标志来指导LVLMS的文本生成过程。我们的方法利用视觉符号中所含的语义信息,将它们投射到文本符号分发空间,并动态地在限制差异最小化的每个步骤中选择最相关的解码符号。随后,这一选定的视觉标志被用来改进产出分配,以更好地纳入视觉语义。与最近两个LVIM的LVMS实验显示,ReVisiT在最小的计算间接费用下不断提高视觉基础。此外,我们的方法在降低计算成本至2美元时,实现了与州基准的竞争性或优等结果。


Article 244

Title@2025-06-11 (3): You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks

Title: You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks Sie sind, was Sie sagen: Ausnutzen sprachlicher Inhalte für VoicePrivacy Attacks 你说的就是: 利用语言内容进行语音隐私攻击 2506.09521v1

Authors (7): Ünal Ege Gaznepoglu, Anna Leschanowsky, Ahmad Aloradi, Prachi Singh, Daniel Tenbrinck, Emanuël A. P. Habets, Nils Peters

Speaker anonymization systems hide the identity of speakers while preserving other information such as linguistic content and emotions. To evaluate their privacy benefits, attacks in the form of automatic speaker verification (ASV) systems are employed. In this study, we assess the impact of intra-speaker linguistic content similarity in the attacker training and evaluation datasets, by adapting BERT, a language model, as an ASV system. On the VoicePrivacy Attacker Challenge datasets, our method achieves a mean equal error rate (EER) of 35%, with certain speakers attaining EERs as low as 2%, based solely on the textual content of their utterances. Our explainability study reveals that the system decisions are linked to semantically similar keywords within utterances, stemming from how LibriSpeech is curated. Our study suggests reworking the VoicePrivacy datasets to ensure a fair and unbiased evaluation and challenge the reliance on global EER for privacy evaluations.

语音匿名系统在保存语言内容和情绪等其他信息的同时隐藏了发言者的身份。 为了评估他们的隐私利益,使用了自动语音核实系统(ASV)形式的攻击。在本研究中,我们评估攻击者培训和评价数据集中语音语言内容相似性的影响,将语言模式BERT改编为ASV系统。在语音私隐攻击者挑战数据集中,我们的方法达到平均平均等误率35%,某些发言者达到EERs的低至2%,仅以其语句的文字内容为基础。我们的解释性研究显示,系统决定与语句中的语义相似的关键词相联系,源于LibriSpeech是如何被曲解的。我们的研究建议对语音私隐数据集进行重新工作,以确保公平公正的评价,并对依赖全球ER进行隐私评估提出质疑。


Article 245

Title@2025-06-11 (3): ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

Title: ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning ReasonMed: Ein 370K Multi-Agent Generated Dataset zur Verbesserung der medizinischen Vernunft 理由:用于推进医疗理由的A370K多代理生成数据集 2506.09513v1

Authors (10): Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Yu Rong, Wenbing Huang, Qifeng Bai, Tingyang Xu

Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a \textit{multi-agent verification and refinement process}, where we design an \textit{Error Refiner} to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.

尽管基于推理的大型语言模型(LLMs)在数学和编程方面成绩卓越,但它们在知识密集型医疗问题解答方面的能力仍未得到充分探讨。为了解决这个问题,我们引入了理性Med,这是最大的医学推理数据集,由来自各种LLMs 最初170万条推理路径的370千个高质量的实例组成。理性Med是通过一个\ textit{多剂核查和精细过程}构建的,我们设计了一个\ textit{Error Refiner} 来强化推理路径,方法是查明并纠正核查员所标注的易出错误的步骤。 运用理性Med,我们系统地调查了培训医学推理模型的最佳做法,发现将详细的链式推理与简明解答摘要相结合,可以产生最有效的微调战略。基于这一战略,我们培训理性Med-7B,为子10B模型设定了新的基准,比前最佳标准高4.17,甚至超过PubMedQA上的LaMA3.1-70B。


Article 246

Title@2025-06-11 (3): Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning

Title: Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning Gib mir FP32 oder gib mir den Tod? Herausforderungen und Lösungen für reproduzierbare Vernunft 给我FP32还是给我死亡? 2506.09501v1

Authors (10): Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, Zirui Liu

Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration such as evaluation batch size, GPU count, and GPU version can introduce significant difference in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision – while critical for reproducibility – is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.

大型语言模型(LLMS)现在在不同领域是不可分割的,并表现出了令人印象深刻的绩效。但是,进展的前提是基准分数既准确又可复制。我们证明LM性能的可复制性是脆弱的:系统配置的变化,如评价批量大小、GPU 计数和 GPU 版本等,在生成的响应量上可以带来显著差异。这个问题在推理模型中特别明显,早期标本的微小四舍五入差异可以扩散到不同的思维链中,最终影响到准确性。例如,在Bloaat16精确度下,贪婪解码,DeepSeek-R1-Dustill-Qwent-Qwen-7B等推理模型的推理模型可以显示9%的准确性差异和9 000个表示的响应长度差异,因为GPUPU、类型和评价批量大小的差异。我们追踪这种差异的根源在于在数字精确性下浮动的计算。这项工作首次系统调查了LM的精确性如何影响LM的精确性。通过各种硬件、软件和精确性环境的精确度的推算,我们用来量化的精确性地分析来量化的精确性地分析。我们如何进行这种精确性分析。


Article 247

Title@2025-06-11 (3): EnrichEvent: Enriching Social Data with Contextual Information for Emerging Event Extraction

Title: EnrichEvent: Enriching Social Data with Contextual Information for Emerging Event Extraction EnrichEvent: Anreicherung sozialer Daten mit kontextuellen Informationen zur Gewinnung neuer Ereignisse 丰富事件:丰富社会数据,提供新事件采掘时的背景信息 2307.16082v8

Authors (2): Mohammadali Sefidi Esfahani, Mohammad Akbari

Social platforms have emerged as crucial platforms for distributing information and discussing social events, offering researchers an excellent opportunity to design and implement novel event detection frameworks. Identifying unspecified events and detecting events without prior knowledge enables governments, aid agencies, and experts to respond swiftly and effectively to unfolding situations, such as natural disasters, by assessing severity and optimizing aid delivery. Social data is characterized by misspellings, incompleteness, word sense ambiguation, and irregular language. While discussing an ongoing event, users share different opinions and perspectives based on their prior experience, background, and knowledge. Prior works primarily leverage tweets’ lexical and structural patterns to capture users’ opinions and views about events. In this study, we propose an end-to-end novel framework, EnrichEvent, to identify unspecified events from streaming social data. In addition to lexical and structural patterns, we leverage contextual knowledge of the tweets to enrich their representation and gain a better perspective on users’ opinions about events. Compared to our baselines, the EnrichEvent framework achieves the highest values for Consolidation outcome with an average of 87% vs. 67% and the lowest for Discrimination outcome with an average of 10% vs. 16%. Moreover, the Trending Data Extraction module in the EnrichEvent framework improves efficiency by reducing Runtime by up to 50% by identifying and discarding irrelevant tweets within message blocks, making the framework highly scalable for processing streaming data. Our source code and dataset are available in our official replication package.

社会平台已成为传播信息和讨论社会事件的重要平台,为研究人员提供了设计和实施新事件探测框架的极好机会; 查明未指明的事件和在没有事先知识的情况下发现事件,使政府、援助机构和专家能够通过评估严重程度和优化援助交付,对自然灾害等正在发生的情况迅速有效地作出反应; 社会数据的特点是拼写错误、不完整、感知模糊和不规则语言; 在讨论当前活动的同时,用户根据他们以往的经验、背景和知识分享不同的意见和观点; 先前的工作主要是利用推文的词汇和结构模式来获取用户对事件的看法和观点; 在本研究中,我们提议了一个端对端的新框架Enrich Event, 以便从流社会数据流中找出未指明的事件。 除了词汇和结构模式外,我们利用对推文的背景知识来丰富他们的代表性,更好地了解用户对事件的看法。 与我们的基线相比, Enrich Eventel 框架取得了最高值, 平均87%比67%, 歧视结果最低值在50个端至端的新版本中, 通过数据流中的平均版本, 将数据更新为10 % 和升级的文本, 将数据流中的数据流 更新为16 更新到升级 格式, 将数据格式,将数据格式 更新为升级为升级为升级为升级为升级 格式,将数据格式,将数据格式, 更新为升级为升级为升级为升级为升级为升级为升级为升级为16 格式,将数据格式。


Article 248

Title@2025-06-11 (3): CiteFusion: An Ensemble Framework for Citation Intent Classification Harnessing Dual-Model Binary Couples and SHAP Analyses

Title: CiteFusion: An Ensemble Framework for Citation Intent Classification Harnessing Dual-Model Binary Couples and SHAP Analyses CiteFusion: Ein Ensemble-Framework für Citation Intent Classification Harnessing Dual-Model Binäre Paare und SHAP-Analysen CiteFusion:双重模式二进制组合和SHAP分析引文内分级利用集合框架 2407.13329v3

Authors (6): Lorenzo Paolini, Sahar Vahdati, Angelo Di Iorio, Robert Wardenga, Ivan Heibi, Silvio Peroni

Understanding the motivations underlying scholarly citations is essential to evaluate research impact and pro-mote transparent scholarly communication. This study introduces CiteFusion, an ensemble framework designed to address the multi-class Citation Intent Classification task on two benchmark datasets: SciCite and ACL-ARC. The framework employs a one-vs-all decomposition of the multi-class task into class-specific binary sub-tasks, leveraging complementary pairs of SciBERT and XLNet models, independently tuned, for each citation intent. The outputs of these base models are aggregated through a feedforward neural network meta-classifier to reconstruct the original classification task. To enhance interpretability, SHAP (SHapley Additive exPlanations) is employed to analyze token-level contributions, and interactions among base models, providing transparency into the classification dynamics of CiteFusion, and insights about the kind of misclassifications of the ensem-ble. In addition, this work investigates the semantic role of structural context by incorporating section titles, as framing devices, into input sentences, assessing their positive impact on classification accuracy. CiteFusion ul-timately demonstrates robust performance in imbalanced and data-scarce scenarios: experimental results show that CiteFusion achieves state-of-the-art performance, with Macro-F1 scores of 89.60% on SciCite, and 76.24% on ACL-ARC. Furthermore, to ensure interoperability and reusability, citation intents from both datasets sche-mas are mapped to Citation Typing Ontology (CiTO) object properties, highlighting some overlaps. Finally, we describe and release a web-based application that classifies citation intents leveraging the CiteFusion models developed on SciCite.

了解学术引用背后的动机对于评价研究影响和推动透明学术交流至关重要。本研究引入了CiteFusion(CiteFusion),这是一个用于在两个基准数据集:SciCite和ACL-ARC(ACL-ARC)上解决多级引用意向分类任务的混合框架。这个框架将多级任务一五一分分解成班级特定的二进制任务,利用SciBERT和XLNet模型的互补对等,独立调试每种引用意向。这些基础模型的输出通过一个反馈线性神经网络元级化集成,以重建原始分类任务。为了提高可解释性,SHapley Additive Explations(SHapley Additive Explications) 用于分析代谢级贡献和基模型之间的相互作用,为CiteFlickral-Flations(Site-C-rality)的分类提供透明性能,Sentrial-lation-lations 和Sential-S-I-lation-Silation-I-Silation-Silation-Silational-Silation-Ial-Ial-Ilational-Sildal-Ilational-lational-Slational-Slational-Slational-Slational-Slational-Slational-Slations) 和S-S-S-Slational-Slational-S-S-S-S-S-S-Slational-Slation-Slational-S-S-S-S-Slation-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-I-S-S-S-S-


Article 249

Title@2025-06-11 (3): Bridging Online Behavior and Clinical Insight: A Longitudinal LLM-based Study of Suicidality on YouTube Reveals Novel Digital Markers

Title: Bridging Online Behavior and Clinical Insight: A Longitudinal LLM-based Study of Suicidality on YouTube Reveals Novel Digital Markers Bridging Online Behavior and Clinical Insight: Eine längsseitige LLM-basierte Studie zur Suizidalität auf YouTube enthüllt neuartige digitale Marker 架桥在线行为和临床透视:YouTube Reveals新数字标记上基于纵向LLM的自杀性纵向LM研究 2506.09495v1

Authors (7): Ilanit Sobol, Shir Lissak, Refael Tikochinski, Tal Nakash, Anat Brunstein Klomek, Eyal Fruchter, Roi Reichart

Suicide remains a leading cause of death in Western countries, underscoring the need for new research approaches. As social media becomes central to daily life, digital footprints offer valuable insight into suicidal behavior. Focusing on individuals who attempted suicide while uploading videos to their channels, we investigate: How do suicidal behaviors manifest on YouTube, and how do they differ from expert knowledge? We applied complementary approaches: computational bottom-up, hybrid, and expert-driven top-down, on a novel longitudinal dataset of 181 YouTube channels from individuals with life-threatening attempts, alongside 134 control channels. In the bottom-up approach, we applied LLM-based topic modeling to identify behavioral indicators. Of 166 topics, five were associated with suicide-attempt, with two also showing temporal attempt-related changes ($p<.01$) - Mental Health Struggles ($+0.08$)* and YouTube Engagement ($+0.1$)*. In the hybrid approach, a clinical expert reviewed LLM-derived topics and flagged 19 as suicide-related. However, none showed significant attempt-related temporal effects beyond those identified bottom-up. Notably, YouTube Engagement, a platform-specific indicator, was not flagged by the expert, underscoring the value of bottom-up discovery. In the top-down approach, psychological assessment of suicide attempt narratives revealed that the only significant difference between individuals who attempted before and those attempted during their upload period was the motivation to share this experience: the former aimed to Help Others ($\beta=-1.69$, $p<.01$), while the latter framed it as part of their Personal Recovery ($\beta=1.08$, $p<.01$). By integrating these approaches, we offer a nuanced understanding of suicidality, bridging digital behavior and clinical insights. * Within-group changes in relation to the suicide attempt.

自杀仍然是西方国家死亡的一个主要原因,这凸显出需要新的研究方法。随着社交媒体成为日常生活的核心,数字足迹为自杀行为提供了宝贵的洞察力。在自下而上的方法中,数字足迹为确定行为指标提供了宝贵的模型。在166个专题中,有5个专题与自杀未遂相关,其中2个专题还显示与尝试相关的时间变化(<0.01美元) - 心理健康节(+0.08美元)* 和YouTube互动(+0.1美元)* 。在一种新型方法中,一个临床专家对有生命威胁的YouTube频道的181个专题进行审查,并将19个专题标注为自杀相关。然而,在自下而上的方法中,我们没有显示与自下而上的LLM* 主题相关的重大尝试时间影响。值得注意的是,YouTube互动,一个与自杀未遂尝试相关的变化(<0.01美元) ,一个平台具体指标显示的是,这些个人在自下而后期的实验中,这些个人在从上至上至上而上下的数据评估中,这些个人在自上至上。


Article 250

Title@2025-06-11 (3): CASPER: A Large Scale Spontaneous Speech Dataset

Title: CASPER: A Large Scale Spontaneous Speech Dataset CASPER: Ein großer Spontane-Sprachdatensatz 大规模自应自发语音数据集 2506.00267v3

Authors (11): Cihan Xiao, Ruixing Liang, Xiangyu Zhang, Mehmet Emre Tiryaki, Veronica Bae, Lavanya Shankar, Rong Yang, Ethan Poon, Emmanuel Dupoux, Sanjeev Khudanpur, Leibny Paola Garcia Perera

The success of large language models has driven interest in developing similar speech processing capabilities. However, a key challenge is the scarcity of high-quality spontaneous speech data, as most existing datasets contain scripted dialogues. To address this, we present a novel pipeline for eliciting and recording natural dialogues and release our dataset with 100+ hours of spontaneous speech. Our approach fosters fluid, natural conversations while encouraging a diverse range of topics and interactive exchanges. Unlike traditional methods, it facilitates genuine interactions, providing a reproducible framework for future data collection. This paper introduces our dataset and methodology, laying the groundwork for addressing the shortage of spontaneous speech data. We plan to expand this dataset in future stages, offering a growing resource for the research community.

大型语言模式的成功激发了人们对发展类似语言处理能力的兴趣,然而,关键的挑战在于缺少高质量的自发语音数据,因为大多数现有数据集都包含文字对话。为了解决这个问题,我们提出了一个新的管道,用于收集和记录自然对话,并以100小时以上自发演讲的方式发布我们的数据集。我们的方法培养流动、自然对话,同时鼓励各种各样的专题和互动交流。与传统方法不同,它促进真正的互动,为今后的数据收集提供一个可复制的框架。本文介绍了我们的数据集和方法,为解决自发语音数据短缺打下基础。我们计划在未来阶段扩大这一数据集,为研究界提供越来越多的资源。


Article 251

Title@2025-06-11 (3): CiteFix: Enhancing RAG Accuracy Through Post-Processing Citation Correction

Title: CiteFix: Enhancing RAG Accuracy Through Post-Processing Citation Correction CiteFix: Verbesserung der RAG-Genauigkeit durch Nachbearbeitung der Citation Correction CiteFix:通过处理后引用校正,加强RAG的准确性 2504.15629v2

Authors (3): Harsh Maheshwari, Srikanth Tenneti, Alwarappan Nakkiran

Retrieval Augmented Generation (RAG) has emerged as a powerful application of Large Language Models (LLMs), revolutionizing information search and consumption. RAG systems combine traditional search capabilities with LLMs to generate comprehensive answers to user queries, ideally with accurate citations. However, in our experience of developing a RAG product, LLMs often struggle with source attribution, aligning with other industry studies reporting citation accuracy rates of only about 74% for popular generative search engines. To address this, we present efficient post-processing algorithms to improve citation accuracy in LLM-generated responses, with minimal impact on latency and cost. Our approaches cross-check generated citations against retrieved articles using methods including keyword + semantic matching, fine tuned model with BERTScore, and a lightweight LLM-based technique. Our experimental results demonstrate a relative improvement of 15.46% in the overall accuracy metrics of our RAG system. This significant enhancement potentially enables a shift from our current larger language model to a relatively smaller model that is approximately 12x more cost-effective and 3x faster in inference time, while maintaining comparable performance. This research contributes to enhancing the reliability and trustworthiness of AI-generated content in information retrieval and summarization tasks which is critical to gain customer trust especially in commercial products.

作为大语言模型(LLMS)的有力应用,检索增强型(RAG)已成为大语言模型(LLMS)的强大应用,使信息搜索和消费发生革命性的变化;RAG系统将传统搜索能力与LLMS相结合,以生成用户询问的全面答案,最好是准确引用;然而,在我们开发RAG产品的经验中,LLMS常常与来源归属斗争,与其他行业研究一致,报告流行基因化搜索引擎的引用准确率只有74%左右;为此,我们提出了高效的后处理算法,以提高LLM生成的响应的引用准确性,对延缓度和成本的影响最小;我们的方法将传统搜索能力与LLMs结合起来,以便产生对用户询问的全面答案,最好是准确引用;然而,根据我们开发RAG产品的经验,LM产品通常与来源归属有矛盾,与LM系统的总体精确度相对提高15.46%的引用率率;为了解决这一问题,我们提出了高效的后处理算法,以提高LLMRM反应的准确性,对延用率和成本影响最小。我们的方法交叉核对了引用,我们的方法利用了被检索的物品,使用的方法使用的方法包括关键的关键功能,与BERTRECSCSCSCSCLILM产品的可靠性和可靠性。


Article 252

Title@2025-06-11 (3): Rethinking Text-based Protein Understanding: Retrieval or LLM?

Title: Rethinking Text-based Protein Understanding: Retrieval or LLM? Rethinking Text-basierte Protein-Verständnis: Retrieval oder LLM? 重新思考基于文本的蛋白质理解:检索还是LLM? 2505.20354v3

Authors (9): Juntong Wu, Zijing Liu, He Cao, Hao Li, Bin Feng, Zishan Shu, Ke Yu, Li Yuan, Yu Li

In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to accurately assess the model’s performance in this domain. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data can be seen at https://github.com/IDEA-XL/RAPM.

近年来,蛋白质文本模型在蛋白质的产生和理解方面的潜力已受到极大关注。目前的方法侧重于通过继续培训前和多模式调整,将蛋白质相关知识纳入大型语言模型,从而能够同时理解文本描述和蛋白序列。通过对现有模型结构和基于文本的蛋白质理解基准的透彻分析,我们确定了当前基准中存在的重要数据渗漏问题。此外,来自自然语言处理的常规指标未能准确评估该模型在这一领域的绩效。为解决这些局限性,我们调整了现有的数据集,并引入了基于生物实体的新的评价框架。根据我们的观察,我们提出了一种检索强化的方法,该方法大大超越了蛋白质到文字生成的微调LLMS,并显示了培训无培训情景的准确度和效率。我们的代码和数据可以在https://github.com/IDA-XL/RAPM上查看。


Article 253

Title@2025-06-11 (3): Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing

Title: Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing Erhöhung der Messlatte: Ermittlung der Werte von großen Sprachmodellen durch Generative Evolving-Tests 提高律师资格:通过创造演变测试调查大语言模式的价值 2406.14230v5

Authors (6): Han Jiang, Xiaoyuan Yi, Zhihua Wei, Ziang Xiao, Shu Wang, Xing Xie

Warning: Contains harmful model outputs. Despite significant advancements, the propensity of Large Language Models (LLMs) to generate harmful and unethical content poses critical challenges. Measuring value alignment of LLMs becomes crucial for their regulation and responsible deployment. Although numerous benchmarks have been constructed to assess social bias, toxicity, and ethical issues in LLMs, those static benchmarks suffer from evaluation chronoeffect, in which, as models rapidly evolve, existing benchmarks may leak into training data or become saturated, overestimating ever-developing LLMs. To tackle this problem, we propose GETA, a novel generative evolving testing approach based on adaptive testing methods in measurement theory. Unlike traditional adaptive testing methods that rely on a static test item pool, GETA probes the underlying moral boundaries of LLMs by dynamically generating test items tailored to model capability. GETA co-evolves with LLMs by learning a joint distribution of item difficulty and model value conformity, thus effectively addressing evaluation chronoeffect. We evaluated various popular LLMs with GETA and demonstrated that 1) GETA can dynamically create difficulty-tailored test items and 2) GETA’s evaluation results are more consistent with models’ performance on unseen OOD and i.i.d. items, laying the groundwork for future evaluation paradigms.

尽管取得了重大进步,但大型语言模型(LLMs)产生有害和不道德内容的倾向提出了重大挑战。衡量LLMs的价值调整对于其监管和负责任地部署至关重要。虽然已经制定了许多基准来评估LLMs的社会偏见、毒性和伦理问题,但这些静态基准受到评价时间效应的影响,在这种效应中,随着模型的迅速发展,现有基准可能会渗漏到培训数据中,或者变得饱和、高估不断开发LMs。为了解决这一问题,我们建议GETA(GETA)是一种基于测量理论中适应性测试方法的新颖的基因进化演变测试方法。与传统的适应性测试方法不同,LLMMs在监管和负责任地部署方面至关重要。GETA通过动态生成适合模型能力的测试项目来探究LMs的基本道德界限。GETA(GA)与LMs(LMs)通过学习项目困难和模型一致性的联合分布,从而有效地解决评价时间效应。我们向GETA(GTA)评估了各种受欢迎的LMs,并表明:(1) GETA(GETA)能够动态地创造困难的测试项目,并且为未来的模型奠定基础。


Article 254

Title@2025-06-11 (3): Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment

Title: Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment Persona-Richter: Personalisierte Ausrichtung großer Sprachmodelle über Token-Level-Selbsturteil 法官:通过Token级自评大语言模式个人化调整 2504.12663v2

Authors (4): Xiaotian Zhang, Ruizhe Chen, Yang Feng, Zuozhu Liu

Aligning language models with human preferences presents significant challenges, particularly in achieving personalization without incurring excessive computational costs. Existing methods rely on reward signals and additional annotated data, limiting their scalability and adaptability to diverse human values. To address these challenges, we introduce Persona-judge, a novel discriminative paradigm that enables training-free personalized alignment with unseen preferences. Instead of optimizing policy parameters through external reward feedback, Persona-judge leverages the intrinsic preference judgment capabilities of the model. Specifically, a draft model generates candidate tokens conditioned on a given preference, while a judge model, embodying another preference, cross-validates the predicted tokens whether to be accepted. Experimental results demonstrate that Persona-judge, using the inherent preference evaluation mechanisms of the model, offers a scalable and computationally efficient solution to personalized alignment, paving the way for more adaptive customized alignment. Our code is available here.

使语言模式与人类偏好相一致,这带来了重大挑战,特别是在实现个性化方面,而不会造成过高的计算费用; 现有方法依靠奖励信号和附加附加说明的数据,限制其可缩放性和适应不同的人类价值观; 为了应对这些挑战,我们引入了个人法官,这是一个新的歧视性模式,使培训无限制,个性化地符合看不见的偏好; 个人法官不通过外部奖励反馈优化政策参数,而是利用模式固有的偏好判断能力。 具体地说,一个模式草案产生候选人的标牌,以特定偏好为条件,而一个法官模式,体现另一种偏好,交叉验证预测的标牌是否被接受。 实验结果表明,个人法官,利用模式固有的偏好评价机制,为个性化调整提供了一种可缩放的、计算高效的解决办法,为更适应的定制一致铺平了道路。 我们的代码在这里可用。


Article 255

Title@2025-06-11 (3): Revisiting Self-Consistency from Dynamic Distributional Alignment Perspective on Answer Aggregation

Title: Revisiting Self-Consistency from Dynamic Distributional Alignment Perspective on Answer Aggregation Überprüfung der Selbstkonsistenz aus der Perspektive der dynamischen Verteilungsausrichtung auf die Antwortaggregation 从动态分配一致的视角重新审视自我一致性 2502.19830v2

Authors (11): Yiwei Li, Ji Zhang, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li

Self-consistency improves reasoning by aggregating diverse stochastic samples, yet the dynamics behind its efficacy remain underexplored. We reframe self-consistency as a dynamic distributional alignment problem, revealing that decoding temperature not only governs sampling randomness but also actively shapes the latent answer distribution. Given that high temperatures require prohibitively large sample sizes to stabilize, while low temperatures risk amplifying biases, we propose a confidence-driven mechanism that dynamically calibrates temperature: sharpening the sampling distribution under uncertainty to align with high-probability modes, and promoting exploration when confidence is high. Experiments on mathematical reasoning tasks show this approach outperforms fixed-diversity baselines under limited samples, improving both average and best-case performance across varying initial temperatures without additional data or modules. This establishes self-consistency as a synchronization challenge between sampling dynamics and evolving answer distributions.

自相矛盾通过综合各种随机样本来改进推理,但其功效背后的动态仍未得到充分探索。 我们重新定义自相矛盾作为动态分布匹配问题,暴露出解码温度不仅制约抽样随机性,而且积极影响潜在答案分布。 鉴于高温需要高得惊人的大样本大小才能稳定下来,而低温则有可能扩大偏差,我们提议一个充满信心的机制,动态校准温度:在不确定性下强化抽样分布,使之与高概率模式相一致,并在信任度高时促进探索。 数学推理实验显示,这种方法在有限的样本下超过了固定生物多样性基线,在没有额外数据或模块的情况下,改善了不同初始温度的平均和最佳性能。 这就将自我一致性确定为取样动态和不断演变的答案分布之间的同步挑战。


Article 256

Title@2025-06-11 (3): Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

Title: Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms Auf dem Weg zur Überbrückung des Lohn-Generation-Gaps in direkten Ausrichtungsalgorithmen 争取弥合直接调整数值中的奖励-评分差距 2506.09457v1

Authors (3): Zeguan Xiao, Yun Chen, Guanhua Chen

Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the “reward-generation gap” – a misalignment between optimization objectives during training and actual generation performance during inference. In this paper, we find a contributor to the reward-generation gap is the mismatch between the inherent importance of prefix tokens during the LLM generation process and how this importance is reflected in the implicit reward functions of DAAs. To bridge the gap, we introduce a simple yet effective approach called Prefix-Oriented Equal-length Training (POET), which truncates both preferred and dispreferred responses to match the shorter one’s length. Training with POET, where both responses in each sample are truncated to equal length, resulting in diverse truncated lengths across samples, the optimization of DAAs objective is implicitly constrained to converge across all positions, thus paying more attention to prefix tokens than the standard DAAs. We conduct experiments with DPO and SimPO, two representative DAAs, demonstrating that POET improves over their standard implementations, achieving up to 15.6 points in AlpacaEval 2 and overall improvements across downstream tasks. Our results highlight the importance of addressing the misalignment between reward optimization and generation performance in DAAs.

直接对齐的算法(DAAs),如直接偏好优化(DPO)和简单偏好优化(SIMPO),已经成为将大型语言模型(LLMS)与人类喜好相匹配的强化人文反馈(RLHF)算法(LLMS)与人性偏好相匹配的高效替代方法。然而,DAAA(DAAA)存在一个根本性的局限性,我们称之为“回报生成差距”,这是在培训期间优化目标与推断期间实际生成业绩之间的偏差。在本文中,我们发现导致收益生成差距的一个因素是,在LLMM生成过程中,前缀符号的内在重要性与如何在DAA的隐含的奖赏功能中得到反映。为了缩小差距,我们引入了一个简单而有效的方法,称为Prefix-Oident Equal-LET(POET),该方法将优于较短的响应率。在POA(PA)的每个样本中显示错误差,导致在总体对AAAA的排名中不同比例的排名,因此,在DAAAAAA的压压前,我们对达标前的动作的动作的压压。


Article 257

Title@2025-06-11 (3): Irony Detection, Reasoning and Understanding in Zero-shot Learning

Title: Irony Detection, Reasoning and Understanding in Zero-shot Learning Ironie-Erkennung, Vernunft und Verständnis im Null-Schuss-Lernen ” 零弹学习 “ 中的讽刺探测、解释和理解 2501.16884v2

Authors (3): Peiling Yi, Yuhan Xia, Yunfei Long

The generalisation of irony detection faces significant challenges, leading to substantial performance deviations when detection models are applied to diverse real-world scenarios. In this study, we find that irony-focused prompts, as generated from our IDADP framework for LLMs, can not only overcome dataset-specific limitations but also generate coherent, human-readable reasoning, transforming ironic text into its intended meaning. Based on our findings and in-depth analysis, we identify several promising directions for future research aimed at enhancing LLMs’ zero-shot capabilities in irony detection, reasoning, and comprehension. These include advancing contextual awareness in irony detection, exploring hybrid symbolic-neural methods, and integrating multimodal data, among others.

讽刺探测的概括化面临重大挑战,在将探测模型应用于不同的现实世界情景时,导致显著的性能偏差。在本研究中,我们发现,由我们的国际开发发展方案LLMS框架产生的具有讽刺重点的提示不仅能够克服特定数据集的局限性,而且能够产生一致的、人文可读的推理,将具有讽刺意味的文字转化为其预定含义。根据我们的调查结果和深入分析,我们为今后旨在加强LLMS在具有讽刺意味的探测、推理和理解方面的零射能力的研究确定了若干有希望的方向。其中包括提高具有讽刺意味的探测方面的背景意识,探索混合的符号-神经方法,以及整合多式数据等等。


Article 258

Title@2025-06-11 (3): Can LLMs Ground when they (Don’t) Know: A Study on Direct and Loaded Political Questions

Title: Can LLMs Ground when they (Don’t) Know: A Study on Direct and Loaded Political Questions Kann LLMs Boden, wenn sie (nicht) wissen: Eine Studie über direkte und geladene politische Fragen 当他们(不知道)知道:关于直接和上载政治问题的研究 2506.08952v2

Authors (3): Clara Lachenmaier, Judith Sieker, Sina Zarrieß

Communication among humans relies on conversational grounding, allowing interlocutors to reach mutual understanding even when they do not have perfect knowledge and must resolve discrepancies in each other’s beliefs. This paper investigates how large language models (LLMs) manage common ground in cases where they (don’t) possess knowledge, focusing on facts in the political domain where the risk of misinformation and grounding failure is high. We examine the ability of LLMs to answer direct knowledge questions and loaded questions that presuppose misinformation. We evaluate whether loaded questions lead LLMs to engage in active grounding and correct false user beliefs, in connection to their level of knowledge and their political bias. Our findings highlight significant challenges in LLMs’ ability to engage in grounding and reject false user beliefs, raising concerns about their role in mitigating misinformation in political discourse.

人与人之间的沟通依赖于对话基础,使对话者能够相互理解,即使他们没有完全的知识,也必须解决彼此信仰的差异。本文调查了大型语言模式(LLMs)在他们(不)拥有知识的情况下如何管理共同点,侧重于政治领域存在大量错误信息和基础失败风险的事实。我们研究了LLMs回答直接知识问题和以错误信息为前提的大量问题的能力。我们评估了大量问题是否导致LLMs在其知识水平和政治偏见方面积极基础化和纠正虚假用户信仰。我们的调查结果强调了LLMs参与基础化和否定虚假用户信仰的能力方面的重大挑战,使人们担心他们在减少政治言论中的错误信息方面的作用。


Article 259

Title@2025-06-11 (3): Learning Obfuscations Of LLM Embedding Sequences: Stained Glass Transform

Title: Learning Obfuscations Of LLM Embedding Sequences: Stained Glass Transform Verschlimmerungen von LLM-Embedding-Sequenzen lernen: Glas-Reformation 学习LLM 嵌入序列的学习障碍:染色玻璃变形 2506.09452v1

Authors (4): Jay Roberts, Kyle Mylonakis, Sidhartha Roy, Kaan Kale

The high cost of ownership of AI compute infrastructure and challenges of robust serving of large language models (LLMs) has led to a surge in managed Model-as-a-service deployments. Even when enterprises choose on-premises deployments, the compute infrastructure is typically shared across many teams in order to maximize the return on investment. In both scenarios the deployed models operate only on plaintext data, and so enterprise data owners must allow their data to appear in plaintext on a shared or multi-tenant compute infrastructure. This results in data owners with private or sensitive data being hesitant or restricted in what data they use with these types of deployments. In this work we introduce the Stained Glass Transform, a learned, stochastic, and sequence dependent transformation of the word embeddings of an LLM which information theoretically provides privacy to the input of the LLM while preserving the utility of model. We theoretically connect a particular class of Stained Glass Transforms to the theory of mutual information of Gaussian Mixture Models. We then calculate a-postiori privacy estimates, based on mutual information, and verify the privacy and utility of instances of transformed embeddings through token level metrics of privacy and standard LLM performance benchmarks.

AI 的拥有成本高,它计算了基础设施的基础设施和强力服务大型语言模型(LLMs)的挑战,导致管理下的模型服务部署激增。即使企业选择了部署地点,计算基础设施通常也由许多团队共享,以最大限度地实现投资回报。在两种情况下,所部署的模型只能以纯文本数据运作,因此企业数据所有者必须允许其数据在共享或多租赁的计算基础设施上以纯文本显示。这导致拥有私人或敏感数据的数据的所有者在这些类型的部署中使用的数据中犹豫不决或受到限制。在这项工作中,我们引入了有色玻璃变换,从理论上为LLM的投入提供隐私,同时维护模型的实用性。我们从理论上将特定类别的有色玻璃变换与Gausian Mixture模型的相互信息理论连接起来。我们随后根据相互的信息计算了隐蔽性估计,并核实了通过图像标准水平改变的隐私和性能基准的隐私和实用性。


Article 260

Title@2025-06-11 (3): UniToMBench: Integrating Perspective-Taking to Improve Theory of Mind in LLMs

Title: UniToMBench: Integrating Perspective-Taking to Improve Theory of Mind in LLMs UniToMBench: Integration von Perspektiven zur Verbesserung der Theorie des Geistes in LLMs Unitombennch:将透视-提高思维理论纳入LLMs 2506.09450v1

Authors (8): Prameshwar Thiyagarajan, Vaishnavi Parimi, Shamant Sai, Soumil Garg, Zhangir Meirbek, Nitin Yarlagadda, Kevin Zhu, Chris Kim

Theory of Mind (ToM), the ability to understand the mental states of oneself and others, remains a challenging area for large language models (LLMs), which often fail to predict human mental states accurately. In this paper, we introduce UniToMBench, a unified benchmark that integrates the strengths of SimToM and TOMBENCH to systematically improve and assess ToM capabilities in LLMs by integrating multi-interaction task designs and evolving story scenarios. Supported by a custom dataset of over 1,000 hand-written scenarios, UniToMBench combines perspective-taking techniques with diverse evaluation metrics to better stimulate social cognition in LLMs. Through evaluation, we observe that while models like GPT-4o and GPT-4o Mini show consistently high accuracy in tasks involving emotional and belief-related scenarios, with results usually above 80%, there is significant variability in their performance across knowledge-based tasks. These results highlight both the strengths and limitations of current LLMs in ToM-related tasks, underscoring the value of UniToMBench as a comprehensive tool for future development. Our code is publicly available here: https://github.com/Shamant/unifiedtombenchmark.

Mind(TOM)是了解自己和他人精神状态的能力,对于大语言模型(LLMS)来说,这仍然是一个挑战性的领域,而LLMS往往无法准确预测人类精神状态。在本文中,我们引入了UniTombennch(UniTomench),这是一个统一的基准,将SimToM(SimToM)和TOMBENCH(TOMBENCH)的优势结合起来,通过整合多种互动任务设计和不断变化的故事情景,系统地提高和评估LMS(TOM)在LM(M)中的能力。UniTomench(UniTomench)在定制数据集1,000多个手写情景的定制数据集的支持下,将透视技术与各种评估指标相结合,以更好地刺激LMS(LM)中的社会认知。我们观察到,虽然GPT-4o(GPT-4o)和GPT-4Mini(GPT-Mini)等模型在涉及情感和信仰相关情景(结果通常超过80%)的任务中始终表现出高度的准确性,但它们在各种知识任务中表现上有很大差异。这些结果突出的优势和局限性。这些结果突出地显示目前与托贝托贝托曼奇奇奇(ToM)作为未来发展综合工具的价值。我们的代码在这里可以公开使用:http://gis://gis://giths://gint/s://gint/Shamaint/ commaint/ commaint/ commaint/untomattomattomattomattototototototototototototorificild)的功能。


Article 261

Title@2025-06-11 (3): OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary

Title: OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary OWSM-Biasing: Kontextualisieren von Open Whisper-Style Speech Models für automatische Spracherkennung mit dynamischem Vokabeln OWSM-Biasing:使用动态词汇表自动语音识别的开放耳语语音模型背景化 2506.09448v1

Authors (5): Yui Sudo, Yusuke Fujita, Atsushi Kojima, Tomoya Mizumoto, Lianbo Liu

Speech foundation models (SFMs), such as Open Whisper-Style Speech Models (OWSM), are trained on massive datasets to achieve accurate automatic speech recognition. However, even SFMs struggle to accurately recognize rare and unseen words. While contextual biasing (CB) is a promising approach to improve recognition of such words, most CB methods are trained from scratch, resulting in lower performance than SFMs due to the lack of pre-trained knowledge. This paper integrates an existing CB method with OWSM v3.1 while freezing its pre-trained parameters. By leveraging the knowledge embedded in SFMs, the proposed method enables effective CB while preserving the advantages of SFMs, even with a small dataset. Experimental results show that the proposed method improves the biasing word error rate (B-WER) by 11.6 points, resulting in a 0.9 point improvement in the overall WER while reducing the real-time factor by 7.5% compared to the non-biasing baseline on the LibriSpeech 100 test-clean set.

开放 Wissper-Style Speaking models (OWSM) 等语音基础模型(SFM) , 诸如 Open Whisper-Speak Speak models (OWSM) 等, 接受大规模数据集培训, 以获得准确的自动语音识别。 然而, 即使是SFMs 也努力准确地识别稀有和看不见的词。 虽然背景偏差(CB) 是改善对这些词的识别的一个很有希望的方法, 但大多数 CB方法都是从零开始培训的, 由于缺乏经过预先培训的知识, 导致功能低于 SFMs 的功能。 本文将现有的 CB 方法与 OWSM v3.1 合并, 同时冻结其预先训练的参数 。 通过利用SFMs 中所包含的知识, 拟议的方法既能有效创建 CBB, 同时又保留 SFMs 的优势, 即使有一个小数据集 。 实验结果显示, 拟议的方法将偏差字误差率率率率的字率率(B-WER) 比 LibriSpeech 100 测试- caplecen) 的基线减少了0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 。


Article 262

Title@2025-06-11 (3): GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture

Title: GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture GigaChat Familie: Effiziente russische Sprachmodellierung durch Mixtur von Experten Architektur GigaChat家庭:通过混合专家建筑结构,高效率的俄语建模 2506.09440v1

Authors (34): GigaChat team, Mamedov Valentin, Evgenii Kosarev, Gregory Leleytner, Ilya Shchuckin, Valeriy Berezovskiy, Daniil Smirnov, Dmitry Kozlov, Sergei Averkiev, Lukyanenko Ivan, Aleksandr Proshunin, Ainur Israfilova, Ivan Baskov, Artem Chervyakov, Emil Shakirov, Mikhail Kolesov, Daria Khomich, Darya Latortseva, Sergei Porkhun, Yury Fedorov, Oleg Kutuzov, Polina Kudriavtseva, Sofiia Soldatova, Kolodin Egor, Stanislav Pyatkin, Dzmitry Menshykh, Grafov Sergei, Eldar Damirov, Karlov Vladimir, Ruslan Gaitukiev, Arkadiy Shatenov, Alena Fenogenova, Nikita Savushkin, Fedor Minkin

Generative large language models (LLMs) have become crucial for modern NLP research and applications across various languages. However, the development of foundational models specifically tailored to the Russian language has been limited, primarily due to the significant computational resources required. This paper introduces the GigaChat family of Russian LLMs, available in various sizes, including base models and instruction-tuned versions. We provide a detailed report on the model architecture, pre-training process, and experiments to guide design choices. In addition, we evaluate their performance on Russian and English benchmarks and compare GigaChat with multilingual analogs. The paper presents a system demonstration of the top-performing models accessible via an API, a Telegram bot, and a Web interface. Furthermore, we have released three open GigaChat models in open-source (https://huggingface.co/ai-sage), aiming to expand NLP research opportunities and support the development of industrial solutions for the Russian language.

生成的大型语言模型(LLMS)对于现代国家语言规划的各种研究和应用至关重要,然而,主要由于需要大量计算资源,专门针对俄语的基础模型的开发有限,本文件介绍了不同规模的俄罗斯LMSGigaChat家族,包括基础模型和经指导的版本;我们提供了关于模型结构、培训前过程和实验的详细报告,以指导设计选择;此外,我们评估了这些模型在俄语和英语基准方面的表现,并将GigaChat与多种语言类比进行比较;该文件展示了通过API、Telegrambot和网络界面可访问的最优秀模型的系统演示;此外,我们还发布了三种开放源的GigaChat模型(https://huggingface.co/ai-sage),目的是扩大国家语言规划的研究机会,支持开发俄语的工业解决方案。


Article 263

Title@2025-06-11 (3): Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting

Title: Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting Verbessertes Supervised-Fine-Tuning für große Sprachmodelle, um Katastrophenvergessenheit zu vermeiden 改进对大语言模型改进监督的微调,以缓解灾难性遗忘 2506.09428v1

Authors (2): Fei Ding, Baiqiao Wang

Supervised Fine-Tuning (SFT), while enhancing large language models(LLMs)’ instruction-following capabilities and domain-specific task adaptability, often diminishes their general capabilities. Moreover, due to the inaccessibility of original pre-training data, catastrophic forgetting tends to be exacerbated when third-party practitioners implement SFT on open-sourced models. To address this challenge, we propose a novel, more cost-effective SFT method which could effectively reduce the risk of catastrophic forgetting without access to original SFT data. Our approach begins by reconstructing the likely SFT instruction distribution of the base model, followed by a multi-model screening process to select optimal data, which is then mixed with new data for SFT. Experimental results demonstrate that our method preserves generalization capabilities in general domains while improving task-specific performance.

此外,由于培训前原始数据无法获取,如果第三方从业者在开放源码模型上实施SFT,灾难性的遗忘往往会更加严重。为了应对这一挑战,我们建议一种新的、成本效益更高的SFT方法,这种方法可以有效减少无法获取原始SFT数据而发生灾难性遗忘的风险。我们的方法是重建基础模型可能存在的SFT指令分布,然后是多模型筛选程序,选择最佳数据,然后与SFT的新数据混在一起。实验结果表明,我们的方法在改进具体任务绩效的同时,保留了一般领域的通用能力。


Article 264

Title@2025-06-11 (3): Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

Title: Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models Vertrauen ist alles, was Sie brauchen: Wenig heißes RL Feintuning von Sprachmodellen 信心是你们需要的:很少点点RL精细的语言模式 2506.06395v3

Authors (5): Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, Ivan Oseledets

Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model’s own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 16 samples per question and 10 or 20 training steps, RLSC improves accuracy by +13.4% on AIME2024, +21.2% on MATH500, +21.7% on Minerva Math, +20.8% on Olympiadbench, and +9.7% on AMC23. RLSC provides a simple, scalable post-training method for inference models, requiring only a small number of samples and unlabelled supervision.

大型语言模型(LLMS)在推理方面非常出色,但培训后对于使其行为与任务目标保持一致仍然至关重要。现有的强化学习方法(RL)往往取决于昂贵的人类说明或外部奖励模式。我们提议通过自我认知加强学习(RLSC ) , 将该模式作为奖励信号,消除标签、优惠模式或奖赏工程的需要。 应用到Quen2.5-Math-7B,每个问题只有16个样本,10或20个培训步骤,RLSC在 AIME2024 上提高了13.4%的精确度,在MATH500 上增加了21.2%,在Minrva Math上增加了21.7%,在奥林雅德本上增加了20.8%,在AMC23上增加了9.7%。 RLSC为推断模型提供了简单、可扩展的后培训方法,只需要少量样本和无标签的监督。


Article 265

Title@2025-06-11 (3): Hidden in Plain Sight: Evaluation of the Deception Detection Capabilities of LLMs in Multimodal Settings

Title: Hidden in Plain Sight: Evaluation of the Deception Detection Capabilities of LLMs in Multimodal Settings Versteckt in Plain Sight: Auswertung der Erkennungsfähigkeiten von LLMs in multimodalen Einstellungen 隐藏在浅视中:评估多模式环境中LLMs的欺骗性探测能力 2506.09424v1

Authors (4): Md Messal Monem Miah, Adrita Anika, Xi Shi, Ruihong Huang

Detecting deception in an increasingly digital world is both a critical and challenging task. In this study, we present a comprehensive evaluation of the automated deception detection capabilities of Large Language Models (LLMs) and Large Multimodal Models (LMMs) across diverse domains. We assess the performance of both open-source and commercial LLMs on three distinct datasets: real life trial interviews (RLTD), instructed deception in interpersonal scenarios (MU3D), and deceptive reviews (OpSpam). We systematically analyze the effectiveness of different experimental setups for deception detection, including zero-shot and few-shot approaches with random or similarity-based in-context example selection. Our results show that fine-tuned LLMs achieve state-of-the-art performance on textual deception detection tasks, while LMMs struggle to fully leverage cross-modal cues. Additionally, we analyze the impact of auxiliary features, such as non-verbal gestures and video summaries, and examine the effectiveness of different prompting strategies, including direct label generation and chain-of-thought reasoning. Our findings provide key insights into how LLMs process and interpret deceptive cues across modalities, highlighting their potential and limitations in real-world deception detection applications.

在日益数字化的世界中,发现欺骗既是一项关键又具有挑战性的任务。在本研究中,我们全面评价了大语言模型(LLMS)和大型多模式模型(LMMS)在不同领域的自动欺骗检测能力。我们评估了开放源码和商用LLMS在三个不同的数据集上的表现:真实生活试验访谈(RLTD)、在人际情景中指示欺骗(MU3D)和欺骗性审查(OpSpam)。我们系统地分析各种欺骗检测实验设置的有效性,包括随机或类似的随机或类似文本示例选择的零射和少射方法。我们的结果显示,微调LMMS在文字欺骗检测任务上达到最新水平,而LMMS努力充分利用跨模式的提示。此外,我们分析了非语言手势和视频摘要等辅助特征的影响,并审查了不同提示性战略的有效性,包括直接标签生成和连锁推理。我们的调查结果为LMS过程提供了关键洞察力,并解释了在现实世界中如何进行欺骗性探测和跨模式的极限。


Article 266

Title@2025-06-11 (3): GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

Title: GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment GenARM: Reward-geführte Generation mit autoregressivem Reward-Modell für Testzeitausrichtung GENARM: 具有自动递减奖益模型的奖赏制向导生成(测试时间对齐自动递减奖模型) 2410.08193v4

Authors (7): Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, Sumitra Ganesh

Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining. However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model–a novel reward parametrization designed to predict next-token rewards for efficient and effective autoregressive generation. Theoretically, we demonstrate that this parametrization can provably guide frozen LLMs toward any distribution achievable by traditional RMs within the KL-regularized reinforcement learning framework. Experimental results show that GenARM significantly outperforms prior test-time alignment baselines and matches the performance of training-time methods. Additionally, GenARM enables efficient weak-to-strong guidance, aligning larger LLMs with smaller RMs without the high costs of training larger models. Furthermore, GenARM supports multi-objective alignment, allowing real-time trade-offs between preference dimensions and catering to diverse user preferences without retraining. Our project page is available at: https://genarm.github.io.

大型语言模型(LLMS)表现出令人印象深刻的能力,但需要与人类偏好密切配合; 传统的培训时间方法利用人类偏好数据集对LLMS进行精细调整,但需要大量培训费用,并需要反复培训才能满足用户的不同偏好; 测试时间调整方法通过奖赏模式(RMs)来解决这个问题,引导冻结的LLMS,无需再培训; 但是,现有的测试时间方法依靠轨迹水平的RMS, 目的是评价完整的响应, 使其不适合自动递减的文本生成, 需要从部分响应中计算下点奖励。 为此,我们采用了GENARM, 测试时间调整方法利用自动递增模式(BenARM) 的测试时间调整方法, 利用测试时间调整方法来利用自动递增模式(MLMS)的新颖的奖励, 目的是预测对高效和高效的自动递增型的自动递增新一代的奖励。 理论上,我们证明,这种平衡性能引导冻结LMS公司在KMS(KMS) 常规培训中的任何分配。


Article 267

Title@2025-06-11 (3): A Call for Collaborative Intelligence: Why Human-Agent Systems Should Precede AI Autonomy

Title: A Call for Collaborative Intelligence: Why Human-Agent Systems Should Precede AI Autonomy Ein Aufruf für kollaborative Intelligenz: Warum Menschen-Agenten-Systeme der KI-Autonomie vorausgehen sollten 呼吁合作情报:为什么人力-物力系统应先于自主 2506.09420v1

Authors (13): Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Chunyu Miao, Dongyuan Li, Aiwei Liu, Yue Zhou, Yankai Chen, Weizhi Zhang, Yangning Li, Liancheng Fang, Renhe Jiang, Philip S. Yu

Recent improvements in large language models (LLMs) have led many researchers to focus on building fully autonomous AI agents. This position paper questions whether this approach is the right path forward, as these autonomous systems still have problems with reliability, transparency, and understanding the actual requirements of human. We suggest a different approach: LLM-based Human-Agent Systems (LLM-HAS), where AI works with humans rather than replacing them. By keeping human involved to provide guidance, answer questions, and maintain control, these systems can be more trustworthy and adaptable. Looking at examples from healthcare, finance, and software development, we show how human-AI teamwork can handle complex tasks better than AI working alone. We also discuss the challenges of building these collaborative systems and offer practical solutions. This paper argues that progress in AI should not be measured by how independent systems become, but by how well they can work with humans. The most promising future for AI is not in systems that take over human roles, but in those that enhance human capabilities through meaningful partnership.

最近对大型语言模型(LLMS)的改进促使许多研究人员注重建立完全自主的AI代理商。本立场文件质疑这一方法是否是正确的前进道路,因为这些自主系统在可靠性、透明度和了解人类实际需求方面仍有问题。我们建议采取不同的方法:基于LLM的人类代理系统(LLM-HAS),AI与人类合作,而不是取而代之。通过让人类参与提供指导、回答问题和保持控制,这些系统可以更加可信和适应性更高。我们从保健、金融和软件开发的示例中看到,人类-AI团队合作能够比AI单独工作更好地处理复杂的任务。我们还讨论建立这些合作系统并提供实际解决办法的挑战。本文认为,AI的进展不应当以独立系统如何发展来衡量,而应当以这些系统与人类合作如何良好来衡量。对于AI来说,最有希望的未来不是在取代人类作用的系统中,而是通过有意义的伙伴关系增强人类能力的系统中。


Article 268

Title@2025-06-11 (3): AcTracer: Active Testing of Large Language Model via Multi-Stage Sampling

Title: AcTracer: Active Testing of Large Language Model via Multi-Stage Sampling AcTracer: Aktives Testen von Großsprachenmodellen durch Multi-Stage-Sampling AcTracler:通过多标准抽样积极测试大语言模型 2408.03573v2

Authors (5): Yuheng Huang, Jiayang Song, Qiang Hu, Felix Juefei-Xu, Lei Ma

Performance evaluation plays a crucial role in the development life cycle of large language models (LLMs). It estimates the model’s capability, elucidates behavior characteristics, and facilitates the identification of potential issues and limitations, thereby guiding further improvement. Given that LLMs’ diverse task-handling abilities stem from large volumes of training data, a comprehensive evaluation also necessitates abundant, well-annotated, and representative test data to assess LLM performance across various downstream tasks. However, the demand for high-quality test data often entails substantial time, computational resources, and manual efforts, sometimes causing the evaluation to be inefficient or impractical. To address these challenges, researchers propose active testing, which estimates the overall performance by selecting a subset of test data. Nevertheless, the existing active testing methods tend to be inefficient, even inapplicable, given the unique new challenges of LLMs (e.g., diverse task types, increased model complexity, and unavailability of training data). To mitigate such limitations and expedite the development cycle of LLMs, in this work, we introduce AcTracer, an active testing framework tailored for LLMs that strategically selects a small subset of test data to achieve a more accurate performance estimation for LLMs. AcTracer utilizes both internal and external information from LLMs to guide the test sampling process, reducing variance through a multi-stage pool-based active selection. Our experiment results demonstrate that AcTracer achieves state-of-the-art performance compared to existing methods across various tasks.

绩效评估在大型语言模型(LLMS)的发展生命周期中发挥着关键作用; 评估模型的能力,阐明行为特征,便利查明潜在问题和局限性,从而指导进一步的改进; 鉴于LLMS的多种任务处理能力来自大量培训数据,因此,全面评估还需要大量、有说明的和有代表性的测试数据,以评估各下游任务中LLM系统的业绩; 然而,对高质量测试数据的需求往往需要大量的时间、计算资源和人工工作,有时导致评价效率低下或不切实际; 为了应对这些挑战,研究人员提议进行积极测试,通过选择一组测试数据来估计总体业绩; 然而,鉴于LLMS的独特新挑战(例如,不同任务类型,增加模型复杂性,以及缺乏培训数据),现有积极测试方法往往效率低下,甚至不适用; 为了减轻这种局限性,加快LMS的发展周期,我们在此工作中引入了ATraxer,这是为LMS公司设计的积极测试框架,从一个小的测试数据分组,从一个测试数据中挑选出一个较精确的测试数据,从一个测试阶段的ALMS-LMS-S-S-S-S-S-S-S-S-SLARTS-S-S-S-S-S-S-S-S-S-Servic-S-S-S-S-S


Article 269

Title@2025-06-11 (3): Speech Synthesis By Unrolling Diffusion Process using Neural Network Layers

Title: Speech Synthesis By Unrolling Diffusion Process using Neural Network Layers Sprachsynthese durch Entrollen des Diffusionsprozesses mit neuralen Netzwerkebenen 使用神经网络图层解开传播进程 2309.09652v5

Authors (1): Peter Ochieng

This work introduces UDPNet, a novel architecture designed to accelerate the reverse diffusion process in speech synthesis. Unlike traditional diffusion models that rely on timestep embeddings and shared network parameters, UDPNet unrolls the reverse diffusion process directly into the network architecture, with successive layers corresponding to equally spaced steps in the diffusion schedule. Each layer progressively refines the noisy input, culminating in a high-fidelity estimation of the original data, (x_0). Additionally, we redefine the learning target by predicting latent variables instead of the conventional (x_0) or noise (\epsilon_0). This shift addresses the common issue of large prediction errors in early denoising stages, effectively reducing speech distortion. Extensive evaluations on single- and multi-speaker datasets demonstrate that UDPNet consistently outperforms state-of-the-art methods in both quality and efficiency, while generalizing effectively to unseen speech. These results position UDPNet as a robust solution for real-time speech synthesis applications. Sample audio is available at https://onexpeters.github.io/UDPNet.

这项工作引入了UDPNet(UDPNet),这是一个旨在加速语音合成中反向扩散进程的新结构。与依赖时间步骤嵌入和共享网络参数的传统传播模式不同,UDPNet将反向扩散进程直接引入网络架构,其相继层与扩散时间表中同样间距的步骤相对应。每个层逐步完善了噪音输入,最终对原始数据进行了高不洁的估计,(x_0) 。此外,我们通过预测潜在变量而不是常规的(x_0) 或噪音(\ epsilon_0) 来重新定义学习目标。这一转变解决了早期拆分化阶段大预测错误、有效减少语音扭曲的常见问题。对单位和多位语音数据集的广泛评估表明,UDPNet在质量和效率上始终超越最新方法,同时有效地向看不见的语音进行概括。这些结果将UDPNet定位为实时语音合成应用的可靠解决方案。在 https://onexpeters.github.io/UDPNet上提供了抽样音频。


Article 270

Title@2025-06-11 (3): PGDA-KGQA: A Prompt-Guided Generative Framework with Multiple Data Augmentation Strategies for Knowledge Graph Question Answering

Title: PGDA-KGQA: A Prompt-Guided Generative Framework with Multiple Data Augmentation Strategies for Knowledge Graph Question Answering PGDA-KGQA: Ein prompt geführtes Generatives Framework mit Multiple Data Augmentation Strategies for Knowledge Graph Question Answering PGDA-KGQA: 快速指导生成框架,包括多数据增强战略,用于知识图表问题解答 2506.09414v1

Authors (3): Xiujun Zhou, Pingjian Zhang, Deyou Tang

Knowledge Graph Question Answering (KGQA) is a crucial task in natural language processing that requires reasoning over knowledge graphs (KGs) to answer natural language questions. Recent methods utilizing large language models (LLMs) have shown remarkable semantic parsing capabilities but are limited by the scarcity of diverse annotated data and multi-hop reasoning samples. Traditional data augmentation approaches are focus mainly on single-hop questions and prone to semantic distortion, while LLM-based methods primarily address semantic distortion but usually neglect multi-hop reasoning, thus limiting data diversity. The scarcity of multi-hop samples further weakens models’ generalization. To address these issues, we propose PGDA-KGQA, a prompt-guided generative framework with multiple data augmentation strategies for KGQA. At its core, PGDA-KGQA employs a unified prompt-design paradigm: by crafting meticulously engineered prompts that integrate the provided textual content, it leverages LLMs to generate large-scale (question, logical form) pairs for model training. Specifically, PGDA-KGQA enriches its training set by: (1) generating single-hop pseudo questions to improve the alignment of question semantics with KG relations; (2) applying semantic-preserving question rewriting to improve robustness against linguistic variations; (3) employing answer-guided reverse path exploration to create realistic multi-hop questions. By adopting an augment-generate-retrieve semantic parsing pipeline, PGDA-KGQA utilizes the augmented data to enhance the accuracy of logical form generation and thus improve answer retrieval performance. Experiments demonstrate that outperforms state-of-the-art methods on standard KGQA datasets, achieving improvements on WebQSP by 2.8%, 1.2%, and 3.1% and on ComplexWebQuestions by 1.8%, 1.1%, and 2.4% in F1, Hits@1, and Accuracy, respectively.

知识图形解答( KGQA) 是自然语言处理中的一项关键任务, 需要对知识图形( KGQA) 进行推理, 才能解答自然语言问题。 最近使用大型语言模型( LLMS) 的方法显示出惊人的语义解能力, 但因缺少多种附加说明的数据和多点推理样本而受到限制。 传统的数据增强方法主要侧重于单点问题, 容易出现语义扭曲, 而基于 LLM 的方法主要处理语义扭曲, 但通常忽视多点推理, 从而限制数据的多样性。 多点样本的缺乏进一步削弱模型的通用性。 为了解决这些问题, 我们建议使用 快速的 QA- KQQQA , 快速导导导出具有多点数据增强策略。 PGDA- KGQA 核心, 精心设计了将提供文本内容整合的快速设计, 利用 LLMSMS, 在模型培训上生成大层次( 要求、 逻辑格式) QA- kQAA 将 快速调调调调调改进数据, , 以更新运行数据变换成网络。


Article 271

Title@2025-06-11 (3): Self-Steering Optimization: Autonomous Preference Optimization for Large Language Models

Title: Self-Steering Optimization: Autonomous Preference Optimization for Large Language Models Self-Steering Optimization: Autonome Preference-Optimierung für große Sprachmodelle 自制最佳优化:自主优化大语言模式 2410.17131v2

Authors (10): Hao Xiang, Bowen Yu, Hongyu Lin, Keming Lu, Yaojie Lu, Xianpei Han, Ben He, Le Sun, Jingren Zhou, Junyang Lin

The key to effective alignment lies in high-quality preference data. Recent research has focused on automated alignment, which involves developing alignment systems with minimal human intervention. However, prior research has predominantly focused on developing data generation methods, while insufficient attention has been paid to quality control mechanisms, which often produce inaccurate and unhelpful data, leading to unpredictable benefits during iterative optimization. In this paper, we present Self-Steering Optimization ($SSO$), an algorithm that autonomously generates high-quality preference data, eliminating manual annotation requirements. $SSO$ employs a specialized optimization objective to build a data generator from the policy model itself, which is used to produce accurate and on-policy data. We demonstrate $SSO$’s effectiveness through comprehensive experiments on two series of models: Llama 3 and Qwen 2. Our evaluation across diverse benchmarks shows that $SSO$ consistently outperforms baselines in human preference alignment and reward optimization. Further analysis validates $SSO$ as a scalable framework for preference optimization, benefiting the advancement in automated alignment techniques.

有效调整的关键在于高质量的优惠数据。最近的研究侧重于自动化调整,这涉及在最低限度的人力干预下开发调整系统。然而,先前的研究主要侧重于开发数据生成方法,而对于质量控制机制重视不够,而质量控制机制往往产生不准确和无益的数据,从而在迭接优化期间产生无法预测的效益。在本文中,我们介绍了自制优化(SSOS$),这是一种自主生成高质量优惠数据的算法,取消了人工批注要求。 USSO$利用一个专门优化目标,从政策模型本身建立一个数据生成器,用于编制准确和政策数据。我们通过对两个模型系列的综合实验,即Llama 3和 Quen 2. 我们的跨不同基准的评估表明,SOS$始终高于人类优惠调整和奖励优化的基线。进一步的分析证实,USSO$是一个可测量的优惠优化框架,有利于自动调整技术的进步。


Article 272

Title@2025-06-11 (3): Rethinking Diverse Human Preference Learning through Principal Component Analysis

Title: Rethinking Diverse Human Preference Learning through Principal Component Analysis Umdenken von unterschiedlichen menschlichen Vorlieben Lernen durch Hauptkomponentenanalyse 通过主要构成部分分析重新思考多种人类优先学习 2502.13131v2

Authors (8): Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen

Understanding human preferences is crucial for improving foundation models and building personalized AI systems. However, preferences are inherently diverse and complex, making it difficult for traditional reward models to capture their full range. While fine-grained preference data can help, collecting it is expensive and hard to scale. In this paper, we introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons without requiring fine-grained annotations. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA). By constructing a dataset of embedding differences between preferred and rejected responses, DRMs identify orthogonal basis vectors that capture distinct aspects of preference. These decomposed rewards can be flexibly combined to align with different user needs, offering an interpretable and scalable alternative to traditional reward models. We demonstrate that DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training. Our results highlight DRMs as a powerful framework for personalized and interpretable LLM alignment. Our code is available at https://github.com/amandaluof/DRMs.

理解人类偏好对于改进基础模型和建立个性化的AI系统至关重要。然而,偏好本质上是多种多样和复杂的,使得传统的奖赏模型难以完整地捕捉到它们。微细的优惠数据虽然有帮助,但收集费用昂贵,规模很难。在本文中,我们引入了分解的奖赏模型(DRMs),这是一种新颖的方法,从二进制比较中提取不同的人类偏好,而不需要细微的注解。我们的关键见解是将人类的偏好作为矢量,并利用主要组成部分分析加以分析。通过建立一套将偏好和被拒绝的响应者之间的差别嵌入的数据集,DRMs确定能够捕捉到不同偏好方面不同的偏好对象。这些分解的奖赏可以灵活地与不同的用户需要相结合,为传统奖赏模式提供可解释和可伸缩的替代方法。我们证明DRMs有效地提取有意义的优惠层面(例如有用性、安全性、幽默性)和适应新的用户,而无需额外培训。我们的结果突出表明DRMMs作为个人化和可解释的强大框架。我们的代码可在 http://DRDRMSP/DRMs.


Article 273

Title@2025-06-11 (3): Value Portrait: Assessing Language Models’ Values through Psychometrically and Ecologically Valid Items

Title: Value Portrait: Assessing Language Models’ Values through Psychometrically and Ecologically Valid Items Wertportrait: Bewertung der Werte von Sprachmodellen durch psychometrische und ökologisch gültige Elemente 价值纵向:通过心理和生态有效项目评估语言模型的价值 2505.01015v3

Authors (5): Jongwook Han, Dongmin Choi, Woojung Song, Eun-Ju Lee, Yohan Jo

The importance of benchmarks for assessing the values of language models has been pronounced due to the growing need of more authentic, human-aligned responses. However, existing benchmarks rely on human or machine annotations that are vulnerable to value-related biases. Furthermore, the tested scenarios often diverge from real-world contexts in which models are commonly used to generate text and express values. To address these issues, we propose the Value Portrait benchmark, a reliable framework for evaluating LLMs’ value orientations with two key characteristics. First, the benchmark consists of items that capture real-life user-LLM interactions, enhancing the relevance of assessment results to real-world LLM usage. Second, each item is rated by human subjects based on its similarity to their own thoughts, and correlations between these ratings and the subjects’ actual value scores are derived. This psychometrically validated approach ensures that items strongly correlated with specific values serve as reliable items for assessing those values. Through evaluating 44 LLMs with our benchmark, we find that these models prioritize Benevolence, Security, and Self-Direction values while placing less emphasis on Tradition, Power, and Achievement values. Also, our analysis reveals biases in how LLMs perceive various demographic groups, deviating from real human data.

评估语言模型价值的基准之所以重要,是因为越来越需要更真实、更符合人类特点的答复;然而,现有的基准依赖于人或机器说明,这些说明容易受到与价值有关的偏差;此外,经过测试的情景往往与通常使用模型生成文本和表达价值的现实世界背景不同;为了解决这些问题,我们提议了价值纵向基准,这是评价LLMS价值取向的可靠框架,具有两个关键特征。首先,基准包括能够捕捉实际生活用户-LLM互动的项目,提高了评估结果与现实世界LLM使用的相关性。第二,每个项目都根据与自己想法的相似性按人类对象评级,以及这些评级与对象实际价值分数的关联性来评定。这种经过精神测定的验证方法确保项目与具体价值密切相关,作为评估这些价值的可靠项目。通过对44 LLMS基准进行评估,我们发现这些模型将损害、安全和自我理解价值列为优先事项,同时不那么强调对传统、权力和成就价值的强调。此外,我们的分析还揭示了人类数据在真实群体中如何认识人口学系中的偏差。


Article 274

Title@2025-06-11 (3): Comparing human and LLM proofreading in L2 writing: Impact on lexical and syntactic features

Title: Comparing human and LLM proofreading in L2 writing: Impact on lexical and syntactic features Vergleich von Human- und LLM-Korrekturlesen in L2-Schrift: Auswirkungen auf lexikalische und syntaktische Merkmale L2 著述中比较人与LL2校对法和LLM校对法:对词汇和综合特征的影响 2506.09021v2

Authors (3): Hakyung Sung, Karla Csuros, Min-Chang Sung

This study examines the lexical and syntactic interventions of human and LLM proofreading aimed at improving overall intelligibility in identical second language writings, and evaluates the consistency of outcomes across three LLMs (ChatGPT-4o, Llama3.1-8b, Deepseek-r1-8b). Findings show that both human and LLM proofreading enhance bigram lexical features, which may contribute to better coherence and contextual connectedness between adjacent words. However, LLM proofreading exhibits a more generative approach, extensively reworking vocabulary and sentence structures, such as employing more diverse and sophisticated vocabulary and incorporating a greater number of adjective modifiers in noun phrases. The proofreading outcomes are highly consistent in major lexical and syntactic features across the three models.

本研究报告审查了人类和LLM校对的词汇和综合干预,目的是提高相同第二语言著作的总体智能,并评价三个LMM(ChatGPT-4o、Llama3.1-8b、Deepseek-r1-8b)的成果的一致性,研究结果显示,人类和LLLM校正都加强了大法词典特征,这可能有助于提高相邻词典之间的一致性和关联性。然而,LLM校对展示了一种更精细的方法,广泛修改词汇和句子结构,例如使用更加多样化和复杂的词汇,在名词词中加入更多形容词的修改词。校对结果在三大模式的主要词汇和合成词典特征中非常一致。


Article 275

Title@2025-06-11 (3): UD-KSL Treebank v1.3: A semi-automated framework for aligning XPOS-extracted units with UPOS tags

Title: UD-KSL Treebank v1.3: A semi-automated framework for aligning XPOS-extracted units with UPOS tags UD-KSL Treebank v1.3: Ein semi-automatisiertes Framework zur Ausrichtung von XPOS-extrahierten Einheiten mit UPOS-Tags UD-KSL 树库 v1.3:一个半自动框架,使XPOS抽取的单位与 UPOS 标记相匹配 2506.09009v2

Authors (5): Hakyung Sung, Gyu-Ho Shin, Chanyoung Lee, You Kyung Sung, Boo Kyung Jung

The present study extends recent work on Universal Dependencies annotations for second-language (L2) Korean by introducing a semi-automated framework that identifies morphosyntactic constructions from XPOS sequences and aligns those constructions with corresponding UPOS categories. We also broaden the existing L2-Korean corpus by annotating 2,998 new sentences from argumentative essays. To evaluate the impact of XPOS-UPOS alignments, we fine-tune L2-Korean morphosyntactic analysis models on datasets both with and without these alignments, using two NLP toolkits. Our results indicate that the aligned dataset not only improves consistency across annotation layers but also enhances morphosyntactic tagging and dependency-parsing accuracy, particularly in cases of limited annotated data.

本研究报告扩展了最近关于第二语言(L2)普遍依赖性说明的工作,采用了一个半自动化框架,用两个NLP工具包,确定来自XPOS序列的形态合成结构,并将这些结构与相应的UPOS类别相协调;我们还扩大了现有的L2-朝鲜文,从引证文章中加注2 998个新句子;为评价XPOS-UPOS校正调整的影响,我们用两个NLP工具包,对L2-朝鲜文的形态合成分析模型对有和没有校正的数据集的影响进行了微调。我们的结果表明,校正数据集不仅提高了注解层的一致性,而且还提高了形态合成标记和依赖性区分的准确性,特别是在有限的附加说明数据的情况下。


Article 276

Title@2025-06-11 (3): Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models

Title: Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models Token Constraint Decoding verbessert Robustheit bei der Fragebeantwortung für große Sprachmodelle 对大语言模型问题解答的强力 2506.09408v1

Authors (7): Jui-Ming Yao, Hao-Yuan Chen, Zi-Xian Tang, Bing-Jia Tan, Sheng-Wei Peng, Bing-Cheng Xie, Shun-Feng Su

Large Language Models (LLMs) have demonstrated impressive performance on multiple-choice question answering (MCQA) benchmarks, yet they remain highly vulnerable to minor input perturbations. In this paper, we introduce and evaluate Token Constraint Decoding (TCD). This simple yet effective inference-time algorithm enforces alignment between token-level predictions to enhance robustness in noisy settings. Through extensive experiments on CommonsenseQA, MMLU, and MMLU-Pro, we show that TCD, especially when paired with prompt engineering (PE) fixes, significantly restores performance degraded by input noise, yielding up to +39\% absolute gains for weaker models like Gemma3 1B. Penalty sweep analyses further reveal that TCD implicitly regularizes overconfident outputs, with different models requiring distinct penalty schedules to maximize resilience. Our findings establish TCD as a practical, model-agnostic approach for improving reasoning stability under real-world imperfections and pave the way for more reliable deployment of LLMs in safety-critical or user-facing applications.

大型语言模型(LLMS)在多选择答题(MCQA)基准上表现出了令人印象深刻的成绩,但是它们仍然极易受到微小输入扰动的影响。在本文中,我们介绍并评价了Token Constrint Decoding(TCD)。这种简单而有效的推论时间算法在象征性的预测之间进行了调整,以加强噪音环境中的稳健性。通过对ComensenseQA、MMMLU和MMLU-Pro的广泛实验,我们表明,TCD,特别是在与快速工程(PE)修复相结合时,大大恢复了因输入噪音而退化的性能,使Gemma3 1B等较弱模型的绝对增益达到+39。 刑罚扫荡分析进一步显示,TCD隐含着对超自信产出的规范,不同的模型需要不同的惩罚时间表才能最大限度地提高弹性。我们的调查结果将TCD确立为一种实用的、示范的、不可知性的方法,在现实世界的缺陷下改进推理稳定性,并为更可靠地将LLMs应用于安全或用户法的应用铺路。


Article 277

Title@2025-06-11 (3): Code-Switching Curriculum Learning for Multilingual Transfer in LLMs

Title: Code-Switching Curriculum Learning for Multilingual Transfer in LLMs Code-Switching Curriculum Learning für den mehrsprachigen Transfer in LLMs 以LLMM为多种语言转让进行代码转换课程学习 2411.02460v2

Authors (5): Haneul Yoo, Cheonbok Park, Sangdoo Yun, Alice Oh, Hwaran Lee

Large language models (LLMs) now exhibit near human-level performance in various tasks, but their performance drops drastically after a handful of high-resource languages due to the imbalance in pre-training data. Inspired by the human process of second language acquisition, particularly code-switching$\unicode{x2014}$the practice of language alternation in a conversation$\unicode{x2014}$we propose code-switching curriculum learning (CSCL) to enhance cross-lingual transfer for LLMs. CSCL mimics the stages of human language learning by progressively training models with a curriculum consisting of 1) token-level code-switching, 2) sentence-level code-switching, and 3) monolingual corpora. Using Qwen 2 as our underlying model, we demonstrate the efficacy of the CSCL in improving language transfer to Korean, achieving significant performance gains compared to monolingual continual pre-training methods. Ablation studies reveal that both token- and sentence-level code-switching significantly enhance cross-lingual transfer and that curriculum learning amplifies these effects. We also extend our findings into various languages, including Japanese (high-resource) and Indonesian (low-resource), and using two additional models (Gemma 2 and Phi 3.5). We further show that CSCL mitigates spurious correlations between language resources and safety alignment, presenting a robust, efficient framework for more equitable language transfer in LLMs. We observe that CSCL is effective for low-resource settings where high-quality, monolingual corpora for language transfer are hardly available.

大型语言模式(LLMS)现在在各种任务中接近人文水平的表现,但由于培训前的数据不平衡,在少数高资源语言之后,其绩效急剧下降。受第二语言获取的人类过程的启发,特别是以代码转换$\uncode{x2014}美元为主的人工流程,我们提议在语言转换课程学习中采用语言交替做法,以加强LLMS的跨语言调任。 CSCL通过逐步培训模式来模拟人类语言学习的各个阶段,培训模式包括:(1) 代号代码转换,(2) 句级代码转换,和(3) 单语种公司。我们用Qwen 2作为我们的基本模式,展示CSCL在改进向朝鲜语言转换方面的效力,与单语持续培训前方法相比,取得了显著的业绩收益。 校正研究显示,象征性和句级代码转换大大加强了跨语言框架的跨语言转移,而且课程学习也增加了这些效果。 我们还将我们的调查结果扩展为多种语言的代号,包括高资源(高资源)和印度尼西亚的直径(我们展示了高资源)之间的双级的Coral-real-real-real-real-reseration)。


Article 278

Title@2025-06-11 (3): Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding

Title: Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding Code-Switching Red-Teaming: LLM-Bewertung für Sicherheit und mehrsprachiges Verständnis 编码转换红队:LLM 安全和多语文理解评价 2406.15481v3

Authors (3): Haneul Yoo, Yongjin Yang, Hwaran Lee

As large language models (LLMs) have advanced rapidly, concerns regarding their safety have become prominent. In this paper, we discover that code-switching in red-teaming queries can effectively elicit undesirable behaviors of LLMs, which are common practices in natural language. We introduce a simple yet effective framework, CSRT, to synthesize codeswitching red-teaming queries and investigate the safety and multilingual understanding of LLMs comprehensively. Through extensive experiments with ten state-of-the-art LLMs and code-switching queries combining up to 10 languages, we demonstrate that the CSRT significantly outperforms existing multilingual red-teaming techniques, achieving 46.7% more attacks than standard attacks in English and being effective in conventional safety domains. We also examine the multilingual ability of those LLMs to generate and understand codeswitching texts. Additionally, we validate the extensibility of the CSRT by generating codeswitching attack prompts with monolingual data. We finally conduct detailed ablation studies exploring code-switching and propound unintended correlation between resource availability of languages and safety alignment in existing multilingual LLMs.

随着大型语言模式(LLMS)的迅速发展,人们对其安全的关切变得十分突出。在本文中,我们发现,在红队询问中代码转换能够有效地引起LMS的不良行为,这是自然语言的常见做法。我们引入了一个简单而有效的框架(CSRT),将红色队询问代码转换成代码并全面调查LLMS的安全和多语种理解问题。通过对10种最先进的LMS和由10种语言组成的代码转换查询的广泛实验,我们证明CSRT大大超越了现有的多语种红队技术,比标准的英语攻击增加了46.7%,在常规安全领域也有效。我们还审查了这些LMMs的多语种能力,以生成和理解代码转换文本。此外,我们通过生成单语种数据来生成代码转换攻击信号来验证CSRT的可扩展性。我们最后进行了详细的校正研究,探索语言资源可用性与现有多语种LMS的安全协调之间的代码转换和意外关联性。


Article 279

Title@2025-06-11 (3): DREsS: Dataset for Rubric-based Essay Scoring on EFL Writing

Title: DREsS: Dataset for Rubric-based Essay Scoring on EFL Writing DRESS: Datensatz für Rubric-basierte Essay-Scoring auf EFL-Writing DREsS: 以Ruric为基础的ELFL 写法日志分类的数据集 2402.16733v3

Authors (4): Haneul Yoo, Jieun Han, So-Yeon Ahn, Alice Oh

Automated essay scoring (AES) is a useful tool in English as a Foreign Language (EFL) writing education, offering real-time essay scores for students and instructors. However, previous AES models were trained on essays and scores irrelevant to the practical scenarios of EFL writing education and usually provided a single holistic score due to the lack of appropriate datasets. In this paper, we release DREsS, a large-scale, standard dataset for rubric-based automated essay scoring with 48.9K samples in total. DREsS comprises three sub-datasets: DREsS_New, DREsS_Std., and DREsS_CASE. We collect DREsS_New, a real-classroom dataset with 2.3K essays authored by EFL undergraduate students and scored by English education experts. We also standardize existing rubric-based essay scoring datasets as DREsS_Std. We suggest CASE, a corruption-based augmentation strategy for essays, which generates 40.1K synthetic samples of DREsS_CASE and improves the baseline results by 45.44%. DREsS will enable further research to provide a more accurate and practical AES system for EFL writing education.

自动作文评分(AES)是英语中作为外语写作教育的一种有用工具,为学生和教员提供实时作文评分,然而,以前的AES模型在与ELF写作教育实际情景无关的作文和分数方面接受了培训,并通常由于缺乏适当的数据集而提供单一的整体评分。在本文中,我们发行了红文自动作文评评分的大规模标准数据集,共有48.9K个样本。DRES包括三个子数据集:DRES_New,DRES_Std.和DRESS_CASE。我们收集了DRES_New, 和DRESS_New, 和DRESS_CASE。我们收集了DRES_New,一个真实的教室数据集,由ELF本科学生编写并由英国教育专家得分的2.3K论文。我们还将现有的红文作文评分评分数据集标准化为DRES_Std。我们建议CASE,一个基于腐败的扩增策略,它生成了40.1K DRES_CASASASE的合成样本, 和改进了EFLS的基线结果, 提供更精确的DRES。


Article 280

Title@2025-06-11 (3): AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)

Title: AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI) AdversariaL attacK sAfety alignment(ALKALI): Sicherung von LLMs durch GRACE: Geometrische Darstellung-Bewusst Kontrastive Verbesserung- Einführung von Adversarial Vulnerability Quality Index (AVQI) Adversarial attacK sAfety alignment(ALKALI):通过GRACE来保护LLMS:几何代表-提高软件对抗性增强-采用反脆弱性质量指数(AVQI) 2506.08885v2

Authors (7): Danush Khanna, Krishna Kumar, Basab Ghosh, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das

Adversarial threats against LLMs are escalating faster than current defenses can adapt. We expose a critical geometric blind spot in alignment: adversarial prompts exploit latent camouflage, embedding perilously close to the safe representation manifold while encoding unsafe intent thereby evading surface level defenses like Direct Preference Optimization (DPO), which remain blind to the latent geometry. We introduce ALKALI, the first rigorously curated adversarial benchmark and the most comprehensive to date spanning 9,000 prompts across three macro categories, six subtypes, and fifteen attack families. Evaluation of 21 leading LLMs reveals alarmingly high Attack Success Rates (ASRs) across both open and closed source models, exposing an underlying vulnerability we term latent camouflage, a structural blind spot where adversarial completions mimic the latent geometry of safe ones. To mitigate this vulnerability, we introduce GRACE - Geometric Representation Aware Contrastive Enhancement, an alignment framework coupling preference learning with latent space regularization. GRACE enforces two constraints: latent separation between safe and adversarial completions, and adversarial cohesion among unsafe and jailbreak behaviors. These operate over layerwise pooled embeddings guided by a learned attention profile, reshaping internal geometry without modifying the base model, and achieve up to 39% ASR reduction. Moreover, we introduce AVQI, a geometry aware metric that quantifies latent alignment failure via cluster separation and compactness. AVQI reveals when unsafe completions mimic the geometry of safe ones, offering a principled lens into how models internally encode safety. We make the code publicly available at https://anonymous.4open.science/r/alkali-B416/README.md.

对LLMS的反向威胁正在以比当前防御能够适应的更快的速度升级。我们暴露了一个临界的几何盲点:对21个主要LMS的评估显示,在开放和封闭源模式中,攻击成功率都惊人地高得惊人,将危险地嵌入安全代表系统,同时对不安全的意图进行编码,从而回避潜在几何学仍然看不见的Direct Perect Aproptical(DPO)等表面防御系统。我们引入了ALKALALI,这是第一个严格调整的对抗基准,也是迄今为止最全面的框架,涵盖三个宏观类别、六个子类和十五个攻击家庭。对21个主要LMS的评估显示,在开放和封闭源模式中,攻击成功率都惊人地高得惊人,在开放和封闭源模式中,暴露了一个潜在的弱点,我们称之为潜伏迷彩色迷彩,一个结构盲点,在隐含潜在安全度的地理代表系统增强能力方面,我们引入了GRACE(GR)的组合框架,在不透视线上,在不透视线中,我们开始调整了对地基数据进行内部分析时,在缩小的排序中,我们开始关注。


Article 281

Title@2025-06-11 (3): Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach

Title: Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach Zuverlässige Proof-Generation mit LLMs: Ein neuro-symbolischer Ansatz 努力利用LLM女士实现可靠的证据生产:神经-双曲方法 2505.14479v3

Authors (3): Oren Sultan, Eitan Stern, Dafna Shahaf

Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation. We propose a neuro-symbolic approach that combines LLMs’ generative strengths with structured components to overcome this challenge. As a proof-of-concept, we focus on geometry problems. Our approach is two-fold: (1) we retrieve analogous problems and use their proofs to guide the LLM, and (2) a formal verifier evaluates the generated proofs and provides feedback, helping the model fix incorrect proofs. We demonstrate that our method significantly improves proof accuracy for OpenAI’s o1 model (58%-70% improvement); both analogous problems and the verifier’s feedback contribute to these gains. More broadly, shifting to LLMs that generate provably correct conclusions could dramatically improve their reliability, accuracy and consistency, unlocking complex tasks and critical real-world applications that require trustworthiness.

大型语言模型(LLMS)与需要严格逻辑推算和符号推理的正式领域(如数学校准生成)相抗衡,我们建议一种神经-顺理成章的方法,将LLMS的基因特长与结构化组成部分相结合,以克服这一挑战。作为一个概念的证明,我们侧重于几何问题。我们的方法有两个方面:(1) 我们找出类似的问题,用其证明来指导LLM;(2) 一个正式的核查员评估产生的证明并提供反馈,帮助模型修正错误的证明。我们证明,我们的方法大大提高了O1模型(58%-70%的改进率)的证明准确性; 相似的问题和验证人的反馈都有助于这些成就。 更广泛地说,转向能够产生可辨别正确结论的LLMs可以大大提高其可靠性、准确性和一致性,解锁复杂的任务和需要信任的关键性真实世界应用。


Article 282

Title@2025-06-11 (3): Chem42: a Family of chemical Language Models for Target-aware Ligand Generation

Title: Chem42: a Family of chemical Language Models for Target-aware Ligand Generation Chem42: eine Familie chemischer Sprachmodelle für die zielbewusste Ligand-Generation Chem42: “ 目标觉知长者 “ 化学语言模式家庭 2503.16563v2

Authors (7): Aahan Singh, Engin Tekin, Maryam Nadeem, Nancy A. ElNaker, Mohammad Amaan Sayeed, Natalia Vassilieva, Boulbaba Ben Amor

Revolutionizing drug discovery demands more than just understanding molecular interactions - it requires generative models that can design novel ligands tailored to specific biological targets. While chemical Language Models (cLMs) have made strides in learning molecular properties, most fail to incorporate target-specific insights, restricting their ability to drive de-novo ligand generation. Chem42, a cutting-edge family of generative chemical Language Models, is designed to bridge this gap. By integrating atomic-level interactions with multimodal inputs from Prot42, a complementary protein Language Model, Chem42 achieves a sophisticated cross-modal representation of molecular structures, interactions, and binding patterns. This innovative framework enables the creation of structurally valid, synthetically accessible ligands with enhanced target specificity. Evaluations across diverse protein targets confirm that Chem42 surpasses existing approaches in chemical validity, target-aware design, and predicted binding affinity. By reducing the search space of viable drug candidates, Chem42 could accelerate the drug discovery pipeline, offering a powerful generative AI tool for precision medicine. Our Chem42 models set a new benchmark in molecule property prediction, conditional molecule generation, and target-aware ligand design. The models are publicly available at huggingface.co/inceptionai.

将药物发现革命化不仅需要理解分子互动,而且还需要能够设计适合特定生物目标的新颖的结链和成型的基因模型。虽然化学语言模型(cLMS)在学习分子特性方面取得了长足的进步,但大多数没有纳入特定目标的洞察力,限制了其驱动新奥尔良一代的能力。Chem42是基因化化学语言模型的前沿组合,旨在弥合这一差距。Chem42通过将原子级互动与来自Prot42(一种补充性蛋白语言模型)的多式联运投入结合起来,可以实现分子结构、互动和约束模式的复杂跨模式代表。这个创新框架使得能够创建结构上有效、合成易懂的结链和强化目标特性。各种蛋白指标的评估证实,Chem42超越了化学品有效性、目标认知设计以及预测的紧密性的现有方法。通过减少可行的药物候选者的搜索空间,Chem42可以加速药物发现管道,为精密医学提供强大的基因化AI工具。我们的Chem42模型在分子财产预测、有条件的分子生成和目标定位模型设计上设定了新的基准。


Article 283

Title@2025-06-11 (3): A Hierarchical Probabilistic Framework for Incremental Knowledge Tracing in Classroom Settings

Title: A Hierarchical Probabilistic Framework for Incremental Knowledge Tracing in Classroom Settings Ein hierarchisches probabilistisches Framework für inkrementelles Wissen in Klassenzimmereinstellungen 课堂环境中递增知识追踪的等级概率框架 2506.09393v1

Authors (7): Xinyi Gao, Qiucheng Wu, Yang Zhang, Xuechen Liu, Kaizhi Qian, Ying Xu, Shiyu Chang

Knowledge tracing (KT) aims to estimate a student’s evolving knowledge state and predict their performance on new exercises based on performance history. Many realistic classroom settings for KT are typically low-resource in data and require online updates as students’ exercise history grows, which creates significant challenges for existing KT approaches. To restore strong performance under low-resource conditions, we revisit the hierarchical knowledge concept (KC) information, which is typically available in many classroom settings and can provide strong prior when data are sparse. We therefore propose Knowledge-Tree-based Knowledge Tracing (KT$^2$), a probabilistic KT framework that models student understanding over a tree-structured hierarchy of knowledge concepts using a Hidden Markov Tree Model. KT$^2$ estimates student mastery via an EM algorithm and supports personalized prediction through an incremental update mechanism as new responses arrive. Our experiments show that KT$^2$ consistently outperforms strong baselines in realistic online, low-resource settings.

知识追踪(KT)旨在估计学生不断演变的知识状况,并根据业绩历史预测其业绩。许多现实的KT课堂环境通常在数据方面资源较少,需要随着学生运动历史的增长而在线更新,这给现有的KT方法带来了重大挑战。为了在资源匮乏的情况下恢复强劲的绩效,我们重新审视等级知识概念(KC)信息,这种信息通常在许多课堂环境中都有,在数据稀少之前可以提供强有力的信息。因此,我们提议以知识为基础的知识追踪(KT$%2美元),这是一个概率性KT框架,用以模拟学生对使用隐藏的Markov树模型的树形结构知识概念的理解。KT$2美元估算学生通过EM算法掌握学生的掌握能力,并在新的响应到达时通过渐进式更新机制支持个性化预测。我们的实验显示,KT$2美元始终在现实的在线、低资源环境中超过强大的基线。


Article 284

Title@2025-06-11 (3): Comparing human and LLM politeness strategies in free production

Title: Comparing human and LLM politeness strategies in free production Vergleich zwischen menschlichen und LLM-Richtungsstrategien in der freien Produktion 在自由生产方面比较人和LLM的礼貌战略 2506.09391v1

Authors (2): Haoran Zhao, Robert D. Hawkins

Polite speech poses a fundamental alignment challenge for large language models (LLMs). Humans deploy a rich repertoire of linguistic strategies to balance informational and social goals – from positive approaches that build rapport (compliments, expressions of interest) to negative strategies that minimize imposition (hedging, indirectness). We investigate whether LLMs employ a similarly context-sensitive repertoire by comparing human and LLM responses in both constrained and open-ended production tasks. We find that larger models ($\ge$70B parameters) successfully replicate key preferences from the computational pragmatics literature, and human evaluators surprisingly prefer LLM-generated responses in open-ended contexts. However, further linguistic analyses reveal that models disproportionately rely on negative politeness strategies even in positive contexts, potentially leading to misinterpretations. While modern LLMs demonstrate an impressive handle on politeness strategies, these subtle differences raise important questions about pragmatic alignment in AI systems.

对大型语言模式(LLMs)来说,自然语言演讲是一个根本性的一致挑战。 人类运用了大量丰富的语言战略来平衡信息和社会目标 – – 从建立和谐(赞美、表示兴趣)的积极方法到尽量减少强加(隐藏、间接)的消极战略;我们调查LMs是否通过比较人类和LLM在受限制和不限生产任务中的反应,使用类似背景敏感的模式。我们发现,较大的模型($70B参数)成功地复制了计算实用文献中的关键偏好,而人类评价者则出乎意料地偏爱在开放环境中的LLM产生的反应。然而,进一步的语言分析显示,即使在积极的环境下,模型也不成比例地依赖消极的礼貌战略,可能导致误解。虽然现代LMs对礼貌战略的处理令人印象深刻,但这些微妙的差别提出了在AI系统上务实一致的重要问题。


Article 285

Title@2025-06-11 (3): MMREC: LLM Based Multi-Modal Recommender System

Title: MMREC: LLM Based Multi-Modal Recommender System MMREC: LLM-basiertes Multi-Modal-Empfängersystem MMREC: 以LLM为基础的多模式建议系统 2408.04211v2

Authors (4): Jiahao Tian, Jinman Zhao, Zhenkai Wang, Zhicheng Ding

The importance of recommender systems is growing rapidly due to the exponential increase in the volume of content generated daily. This surge in content presents unique challenges for designing effective recommender systems. Key among these challenges is the need to effectively leverage the vast amounts of natural language data and images that represent user preferences. This paper presents a novel approach to enhancing recommender systems by leveraging Large Language Models (LLMs) and deep learning techniques. The proposed framework aims to improve the accuracy and relevance of recommendations by incorporating multi-modal information processing and by the use of unified latent space representation. The study explores the potential of LLMs to better understand and utilize natural language data in recommendation contexts, addressing the limitations of previous methods. The framework efficiently extracts and integrates text and image information through LLMs, unifying diverse modalities in a latent space to simplify the learning process for the ranking model. Experimental results demonstrate the enhanced discriminative power of the model when utilizing multi-modal information. This research contributes to the evolving field of recommender systems by showcasing the potential of LLMs and multi-modal data integration to create more personalized and contextually relevant recommendations.

由于每天生成的内容数量成倍增加,推荐人系统的重要性正在迅速增长。内容的激增为设计有效的推荐人系统带来了独特的挑战。主要的挑战之一是需要有效利用代表用户偏好的大量自然语言数据和图像。本文件介绍了一种新颖的方法,通过利用大语言模型和深层学习技术来加强推荐人系统。拟议框架旨在通过纳入多模式信息处理和使用统一的潜在空间代表来提高建议的准确性和相关性。研究探索了LLMS在建议环境中更好地了解和利用自然语言数据的潜力,解决了以往方法的局限性。框架通过LLMS有效地提取和整合文本和图像信息,在潜在空间统一了各种模式,以简化排名模型的学习过程。实验结果表明模型在使用多模式信息时的强化歧视力量。这项研究通过展示LMS的潜力和多模式数据整合,为建议人性化和与背景相关的建议提供更有针对性的建议,从而有助于建议人系统不断发展的领域。


Article 286

Title@2025-06-11 (3): Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey

Title: Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey Verbesserung von Code LLMs mit Verstärkungslernen in der Codegenerierung: Eine Umfrage 增强法典制定中强化学习的加强守则LLMS 代码生成:调查 2412.20367v4

Authors (19): Junqiao Wang, Zeng Zhang, Yangfan He, Zihao Zhang, Yuyang Song, Tianyu Shi, Yuchen Li, Hengyuan Xu, Kunyu Wu, Xin Yi, Zhongwei Wan, Xinhang Yuan, Kuan Lu, Menghao Huo, Tang Jingqun, Guangwu Qian, Keqin Li, Qiuwu Chen, Lewei He

Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing large language models (LLMs) in code generation and optimization. This survey systematically reviews RL-driven techniques across the code development lifecycle, from compiler-level optimizations and resource allocation strategies to end-to-end code synthesis frameworks. We first examine classical and modern RL algorithms – spanning policy gradients, actor-critic methods, human-feedback alignment, and preference-based optimization – and their adaptations to the unique challenges of code generation, such as sparse and delayed rewards. Next, we analyze key benchmarks, datasets, and evaluation metrics that drive progress in RL-augmented Code LLMs. Finally, we identify open problems, including the need for richer feedback sources, support for low-level and domain-specific languages, and methods to reduce computational overhead. By consolidating current insights and outlining future directions, this work aims to guide researchers and practitioners in leveraging RL to produce more robust, efficient, and human-aligned code generation systems.

强化学习(RL)已成为在代码生成和优化方面加强大型语言模型(LLM)的强大范例,这项调查系统地审查了代码开发生命周期中由RL驱动的技术,从汇编者一级的优化和资源分配战略到端到端至端代码合成框架。我们首先审查传统和现代RL算法 – – 涵盖政策梯度、行为体-批评方法、人肉背对齐和基于优惠的优化 – – 以及这些算法适应代码生成的独特挑战,如微弱和延迟的奖励。接着,我们分析了推动RL强化代码LM取得进展的关键基准、数据集和评价指标。最后,我们查明了一些尚未解决的问题,包括需要更丰富的反馈来源、支持低层次和特定领域语言以及减少计算间接费用的方法。通过整合目前的见解和概述未来方向,这项工作旨在指导研究人员和从业人员利用RL生成更健全、高效和与人接轨的代码生成系统。


Article 287

Title: Binary classification for perceived quality of headlines and links on worldwide news websites, 2018-2024 Binäre Klassifizierung für wahrgenommene Qualität von Schlagzeilen und Links auf weltweiten News-Websites, 2018-2024 2018-2024年全世界新闻网站头版新闻和链接的预期质量和链接的二进制分类,2018-2024年 2506.09381v1

Authors (4): Austin McCutcheon, Thiago E. A. de Oliveira, Aleksandr Zheleznov, Chris Brogly

The proliferation of online news enables potential widespread publication of perceived low-quality news headlines/links. As a result, we investigated whether it was possible to automatically distinguish perceived lower-quality news headlines/links from perceived higher-quality headlines/links. We evaluated twelve machine learning models on a binary, balanced dataset of 57,544,214 worldwide news website links/headings from 2018-2024 (28,772,107 per class) with 115 extracted linguistic features. Binary labels for each text were derived from scores based on expert consensus regarding the respective news domain quality. Traditional ensemble methods, particularly the bagging classifier, had strong performance (88.1% accuracy, 88.3% F1, 80/20 train/test split). Fine-tuned DistilBERT achieved the highest accuracy (90.3%, 80/20 train/test split) but required more training time. The results suggest that both NLP features with traditional classifiers and deep learning models can effectively differentiate perceived news headline/link quality, with some trade-off between predictive performance and train time.

因此,我们调查了是否有可能将低质量新闻头条/链接与高质量新闻头条/链接自动区分开来。我们用2018-2024年全球57 544 214个新闻网站链接/标题的双元平衡数据集对12个机器学习模式进行了评价,该数据集来自2018-2024年全球新闻链接/标题(每类28 772 107个),有115个提取语言特征。每种文本的二进制标签来自专家对各新闻领域质量的共识。传统的混合方法,特别是加袋式分类方法,表现优异(88.1%的准确度,88.3%的F1,80/20列车/测试拆分 ) 。精细调的DTISTERT实现了最高精度(90.3%,80/20列/测试拆分),但需要更多培训时间。结果显示,与传统分类和深层次学习模式的NLP特征,都能有效地区分人们所察觉的新闻头条/链接质量,在预测性业绩与培训时间之间有所取舍。


Article 288

Title@2025-06-11 (3): Measuring What Makes You Unique: Difference-Aware User Modeling for Enhancing LLM Personalization

Title: Measuring What Makes You Unique: Difference-Aware User Modeling for Enhancing LLM Personalization Messen, was Sie einzigartig macht: Difference-Aware User Modeling zur Verbesserung der LLM Personalisierung 衡量什么使你独一之处:提高LLM个性化的差异型用户模型 2503.02450v3

Authors (8): Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yimeng Bai, Wenjie Wang, Hong Cheng, Fuli Feng, Tat-Seng Chua

Personalizing Large Language Models (LLMs) has become a critical step in facilitating their widespread application to enhance individual life experiences. In pursuit of personalization, distilling key preference information from an individual’s historical data as instructional preference context to customize LLM generation has emerged as a promising direction. However, these methods face a fundamental limitation by overlooking the inter-user comparative analysis, which is essential for identifying the inter-user differences that truly shape preferences. To address this limitation, we propose Difference-aware Personalization Learning (DPL), a novel approach that emphasizes extracting inter-user differences to enhance LLM personalization. DPL strategically selects representative users for comparison and establishes a structured standard to extract meaningful, task-relevant differences for customizing LLM generation. Extensive experiments on real-world datasets demonstrate that DPL significantly enhances LLM personalization. We release our code at https://github.com/SnowCharmQ/DPL.

个人大语言模型(LLMs)已成为促进广泛应用以提升个人生活经验的关键步骤,在追求个性化的过程中,从个人历史数据中提取关键偏好信息,作为自定义LLM生成的教学偏好,已成为一个有希望的方向,然而,这些方法面临一个根本性的局限性,因为忽略了用户之间的比较分析,而这种分析对于确定真正影响偏爱的用户之间的差异至关重要。为了解决这一局限性,我们提议采用一种新颖的办法,即强调提取用户之间的差异,以加强LLM个性化。DPL从战略上选择有代表性的用户,以便进行比较,并建立一个结构化标准,为自定义LM生成找到有意义的、与任务相关的差异。关于现实世界数据集的广泛实验表明,DPL大大加强了LM个性化。我们在https://github.com/SnowCharmQ/DPL公布我们的代码。


Article 289

Title@2025-06-11 (3): Unlocking General Long Chain-of-Thought Reasoning Capabilities of Large Language Models via Representation Engineering

Title: Unlocking General Long Chain-of-Thought Reasoning Capabilities of Large Language Models via Representation Engineering Entriegelung der allgemeinen Long Chain-of-Thought-Reasoning-Fähigkeiten großer Sprachmodelle über Representation Engineering 通过代表制工程,解决大语言模型通过代表制工程所探索的一般长 长 长 长 问 问 问 程 理由 能力 2503.11314v2

Authors (8): Xinyu Tang, Xiaolei Wang, Zhihao Lv, Yingqian Min, Wayne Xin Zhao, Binbin Hu, Ziqi Liu, Zhiqiang Zhang

Recent advancements in long chain-of-thoughts(long CoTs) have significantly improved the reasoning capabilities of large language models(LLMs). Existing work finds that the capability of long CoT reasoning can be efficiently elicited by tuning on only a few examples and can easily transfer to other tasks. This motivates us to investigate whether long CoT reasoning is a general capability for LLMs. In this work, we conduct an empirical analysis for this question from the perspective of representation. We find that LLMs do encode long CoT reasoning as a general capability, with a clear distinction from vanilla CoTs. Furthermore, domain-specific representations are also required for the effective transfer of long CoT reasoning. Inspired by these findings, we propose GLoRE, a novel representation engineering method to unleash the general long CoT reasoning capabilities of LLMs. Extensive experiments demonstrate the effectiveness and efficiency of GLoRE in both in-domain and cross-domain scenarios.

现有工作发现,只有对几个例子进行调整,才能有效地激发长期的理论推理能力,并很容易地将其转移到其他任务中。这促使我们调查长期的理论推理是否是长期的理算能力。在这项工作中,我们从代表性的角度对这一问题进行了经验分析。我们发现,LLMS将长期的理论推理作为一种一般能力进行编码,与香草COT有明显的区别。此外,为了有效转让长期的理论,还需要具体领域的表述。我们建议GLORE,这是一种新的表述工程方法,以释放长期的科技推理能力。 广泛的实验表明GLORE在现场和跨场情景中的效果和效率。


Article 290

Title@2025-06-11 (3): Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining

Title: Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining Modality-Balancing Preference Optimierung großer multimodaler Modelle durch disversariale Negative Mining 通过反反向负负采矿优化大型多式联运模型 2506.08022v2

Authors (7): Chenxi Liu, Tianyi Xiong, Ruibo Chen, Yihan Wu, Junfeng Guo, Tianyi Zhou, Heng Huang

The task adaptation and alignment of Large Multimodal Models (LMMs) have been significantly advanced by instruction tuning and further strengthened by recent preference optimization. Yet, most LMMs still suffer from severe modality imbalance during reasoning, i.e., outweighing language prior biases over visual inputs, which bottlenecks their generalization to downstream tasks and causes hallucinations. However, existing preference optimization approaches for LMMs do not focus on restraining the internal biases of their Large Language Model (LLM) backbones when curating the training data. Moreover, they heavily rely on offline data and lack the capacity to explore diverse responses adaptive to dynamic distributional shifts during training. Meanwhile, Group Relative Policy Optimization (GRPO), a recent method using online-generated data and verified rewards to improve reasoning capabilities, remains largely underexplored in LMM alignment. In this paper, we propose a novel preference learning framework, Modality-Balancing Preference Optimization (MBPO), to address the modality imbalance in LMMs. MBPO constructs a more effective offline preference dataset by generating hard negatives, i.e., rejected responses misled by LLM biases due to limited usage of visual information, through adversarial perturbation of input images. Moreover, MBPO leverages the easy-to-verify nature of close-ended tasks to generate online responses with verified rewards. GRPO is then employed to train the model with offline-online hybrid data. Extensive experiments demonstrate that MBPO can enhance LMM performance on challenging vision-language tasks and effectively reduce hallucinations.

大型多模式模式的任务调整和调整已经通过指令调整和通过最近的优惠优化而大大推进了任务调整和调整;然而,大多数LMM模式在推理过程中仍然遭受严重模式不平衡,即超过对视觉投入的先前语言偏差,这阻碍了对下游任务的概括化并造成幻觉;然而,现有LMMM模式的偏好优化办法并不侧重于限制其大语言模式主干的内部偏向,此外,它们严重依赖离线数据,缺乏能力,无法探索适应培训期间动态分布变化的各种反应。同时,Group Deferal Poptimination(GRO)是最近使用在线生成的数据和经核实的奖励以提高推理能力的一种方法,但在很大程度上在LMMMM上,我们提议一个新的偏好学习框架,即Mdality-Balance Popimation Popimation(MPPO)主干线,以解决LMMMM(LMM)模式的不平衡。MPPO的离线偏向在线反应可以有效地减少具有挑战性的偏向外部偏向。


Article 291

Title@2025-06-11 (3): DecIF: Improving Instruction-Following through Meta-Decomposition

Title: DecIF: Improving Instruction-Following through Meta-Decomposition DecIF: Verbesserung der Instruktions-Folge durch Meta-Dekomposition DEIF:通过元衰减改进教学 2505.13990v2

Authors (7): Tingfeng Hui, Pengyu Zhu, Bowen Ping, Ling Tang, Guanting Dong, Yaqi Zhang, Sen Su

Instruction-following has emerged as a crucial capability for large language models (LLMs). However, existing approaches often rely on pre-existing documents or external resources to synthesize instruction-following data, which limits their flexibility and generalizability. In this paper, we introduce DecIF, a fully autonomous, meta-decomposition guided framework that generates diverse and high-quality instruction-following data using only LLMs. DecIF is grounded in the principle of decomposition. For instruction generation, we guide LLMs to iteratively produce various types of meta-information, which are then combined with response constraints to form well-structured and semantically rich instructions. We further utilize LLMs to detect and resolve potential inconsistencies within the generated instructions. Regarding response generation, we decompose each instruction into atomic-level evaluation criteria, enabling rigorous validation and the elimination of inaccurate instruction-response pairs. Extensive experiments across a wide range of scenarios and settings demonstrate DecIF’s superior performance on instruction-following tasks. Further analysis highlights its strong flexibility, scalability, and generalizability in automatically synthesizing high-quality instruction data.

然而,现有方法往往依赖原有文件或外部资源来综合教学跟踪数据,从而限制其灵活性和可概括性。在本文件中,我们引入了完全自主的元分解指导框架,即仅使用LLMS生成多样化和高质量教学跟踪数据的一个完全自主的元分解框架。DecIF基于分解原则。关于教学生成,我们指导LIMS反复生成各种类型的元信息,然后与反应制约相结合,形成结构完善和内容丰富的指令。我们进一步利用LOMS来探测和解决生成指令中的潜在不一致之处。关于反应生成,我们将每项指令分解为原子级评估标准,允许严格验证和消除不准确的指令对应配对。在广泛的情景和环境中进行的广泛实验表明DecIF在完成教学任务方面的优异性表现。进一步分析突出了其强大的灵活性、可伸缩性和一般性,在自动合成高质量指令数据方面显示了这种灵活性、可概括性。


Article 292

Title@2025-06-11 (3): CoLMbo: Speaker Language Model for Descriptive Profiling

Title: CoLMbo: Speaker Language Model for Descriptive Profiling CoLMbo: Speaker-Sprachenmodell für Descriptive Profiling ColLMbo:描述性描述语言模式 2506.09375v1

Authors (8): Massa Baali, Shuo Han, Syed Abdul Hannan, Purusottam Samal, Karanveer Singh, Soham Deshmukh, Rita Singh, Bhiksha Raj

Speaker recognition systems are often limited to classification tasks and struggle to generate detailed speaker characteristics or provide context-rich descriptions. These models primarily extract embeddings for speaker identification but fail to capture demographic attributes such as dialect, gender, and age in a structured manner. This paper introduces CoLMbo, a Speaker Language Model (SLM) that addresses these limitations by integrating a speaker encoder with prompt-based conditioning. This allows for the creation of detailed captions based on speaker embeddings. CoLMbo utilizes user-defined prompts to adapt dynamically to new speaker characteristics and provides customized descriptions, including regional dialect variations and age-related traits. This innovative approach not only enhances traditional speaker profiling but also excels in zero-shot scenarios across diverse datasets, marking a significant advancement in the field of speaker recognition.

发言者识别系统往往局限于分类任务,力求产生详细的演讲者特征或提供内容丰富的描述,这些模型主要为语音识别提取嵌入内容,但未能以结构化的方式捕捉到方言、性别和年龄等人口特征,本文介绍了ColMbo, 一种语言语言模式,即通过将发言者编码器与基于即时的附加条件相结合来解决这些局限性的CoLMbo,这样就可以根据发言者嵌入内容制作详细的字幕。 CoLMbo利用用户定义的提示来动态地适应新的语音特征,并提供定制描述,包括区域方言变和与年龄有关的特征。这种创新方法不仅加强了传统的语音特征描述,而且优于各种数据集的零光速情景,标志着在语音识别领域取得显著进展。


Article 293

Title@2025-06-11 (3): Low-resource domain adaptation while minimizing energy and hardware resource consumption

Title: Low-resource domain adaptation while minimizing energy and hardware resource consumption Low-Resource-Domänenanpassung bei Minimierung des Energie- und Hardware-Ressourcenverbrauchs 低资源领域的适应,同时尽量减少能源和硬件资源的消耗 2506.08433v2

Authors (3): Hernán Maina, Nicolás Wolovick, Luciana Benotti

Training Large Language Models (LLMs) is costly in terms of energy, hardware, and annotated data, often resulting in a positionality rooted in predominant cultures and values (Santy et al., 2023). Domain adaptation has emerged as a promising strategy to better align models with diverse cultural and value contexts (Hershcovich et al., 2022), but its computational cost remains a significant barrier, particularly for research groups lacking access to large-scale infrastructure. In this paper, we evaluate how the use of different numerical precision formats and data parallelization strategies impacts both training speed (as a proxy to energy and hardware consumption) and model accuracy, with the goal of facilitating domain adaptation in low-resource environments. Our findings are relevant to any setting where energy efficiency, accessibility, or limited hardware availability are key concerns.

培训大语言模型(LLMS)在能源、硬件和附加说明的数据方面成本高昂,往往导致定位根植于主导文化和价值观(Santy等人,2023年),对大语言模型的适应已成为一项大有希望的战略,使模型更好地适应不同的文化和价值背景(Hershcovich等人,2022年),但其计算成本仍然是一个重大障碍,特别是对于无法获得大规模基础设施的研究团体而言。在本文件中,我们评估了不同数字精确格式和数据平行化战略的使用如何影响培训速度(作为能源和硬件消费的替代物)和模型准确性,目的是促进在低资源环境中的域适应(Hershcovich等人,2022年),但我们的研究结果与能源效率、可获取性或硬件有限是关键问题的任何环境有关。


Article 294

Title@2025-06-11 (3): SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner

Title: SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner SWE-Flow: Software-Engineering-Daten in einem testgetriebenen Manner synthetisieren SWE-Flow:将软件工程数据合成成一种试验驱动方式 2506.09003v2

Authors (9): Lei Zhang, Jiaxi Yang, Min Yang, Jian Yang, Mouxiang Chen, Jiajun Zhang, Zeyu Cui, Binyuan Hui, Junyang Lin

We introduce SWE-Flow, a novel data synthesis framework grounded in Test-Driven Development (TDD). Unlike existing software engineering data that rely on human-submitted issues, SWE-Flow automatically infers incremental development steps directly from unit tests, which inherently encapsulate high-level requirements. The core of SWE-Flow is the construction of a Runtime Dependency Graph (RDG), which precisely captures function interactions, enabling the generation of a structured, step-by-step development schedule. At each step, SWE-Flow produces a partial codebase, the corresponding unit tests, and the necessary code modifications, resulting in fully verifiable TDD tasks. With this approach, we generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating the SWE-Flow-Eval benchmark. Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding. To facilitate further research, we release all code, datasets, models, and Docker images at Github.

我们引入了基于测试-驱动开发的新数据综合框架(TDD)。与基于人类提交的问题的现有软件工程数据不同,SWE-Flow** 自动从单位测试中直接推断出渐进的开发步骤,而单位测试本身就包含高层次要求。SWE-Flow 的核心是构建运行时间依赖图(RDG),精确地捕捉功能互动,促成结构化、逐步的开发时间表*。在每一步,SWE-Flow 产生一个部分代码库、相应的单位测试和必要的代码修改,从而产生完全可核查的TDD任务。通过这种方法,我们产生了16 061个培训案例和2 020个真实世界GitHub项目测试案例,创建了SWE-Flow-Eval基准。我们的实验显示,该数据集的微调开放模型大大改进了基于TDD-D编码的编码的功能。为了便利进一步研究,我们发布了所有代码、数据设置、模型、软件、软件/工具/工具/工具。[GHOB/HWA]在[GHUB/M/HWAR 的图像。


Article 295

Title@2025-06-11 (3): COGENT: A Curriculum-oriented Framework for Generating Grade-appropriate Educational Content

Title: COGENT: A Curriculum-oriented Framework for Generating Grade-appropriate Educational Content COGENT: Ein Curriculum-orientierter Rahmen zur Generierung von stufengerechten Bildungsinhalten COGENT: 以课程为导向的生成适合年级的教育内容框架 2506.09367v1

Authors (4): Zhengyuan Liu, Stella Xin Yin, Dion Hoe-Lian Goh, Nancy F. Chen

While Generative AI has demonstrated strong potential and versatility in content generation, its application to educational contexts presents several challenges. Models often fail to align with curriculum standards and maintain grade-appropriate reading levels consistently. Furthermore, STEM education poses additional challenges in balancing scientific explanations with everyday language when introducing complex and abstract ideas and phenomena to younger students. In this work, we propose COGENT, a curriculum-oriented framework for generating grade-appropriate educational content. We incorporate three curriculum components (science concepts, core ideas, and learning objectives), control readability through length, vocabulary, and sentence complexity, and adopt a ``wonder-based’’ approach to increase student engagement and interest. We conduct a multi-dimensional evaluation via both LLM-as-a-judge and human expert analysis. Experimental results show that COGENT consistently produces grade-appropriate passages that are comparable or superior to human references. Our work establishes a viable approach for scaling adaptive and high-quality learning resources.

虽然创世的大赦国际在内容制作方面表现出了巨大的潜力和多功能性,但在教育背景下的应用却提出了若干挑战,模式往往不符合课程标准,并始终保持适合年级的阅读水平;此外,科技教育在向年轻学生介绍复杂和抽象的想法和现象时,在科学解释与日常语言之间取得平衡方面还构成额外挑战;在这项工作中,我们建议COGENT是一个以课程为导向的框架,用于产生适合年级的教育内容;我们包括三个课程组成部分(科学概念、核心思想和学习目标)、通过长度、词汇和句号复杂性来控制可读性,并采用“wonder”法提高学生参与程度和兴趣。我们通过LLM-as-a-a-法官和人类专家分析进行多层面评价。实验结果表明,COGENT一贯地制作与人类参考资料相仿或优异的适合年级的段落。我们的工作为扩大适应性和高质量学习资源制定了可行的方法。


Article 296

Title@2025-06-11 (3): Taming SQL Complexity: LLM-Based Equivalence Evaluation for Text-to-SQL

Title: Taming SQL Complexity: LLM-Based Equivalence Evaluation for Text-to-SQL Zähmen von SQL-Komplexität: LLM-basierte Gleichwertigkeitsbewertung für Text-zu-SQL SQL复杂程度:基于LLM的文本到SQL等效评价 2506.09359v1

Authors (5): Qingyun Zeng, Simin Ma, Arash Niknafs, Ashish Basran, Carol Szabo

The rise of Large Language Models (LLMs) has significantly advanced Text-to-SQL (NL2SQL) systems, yet evaluating the semantic equivalence of generated SQL remains a challenge, especially given ambiguous user queries and multiple valid SQL interpretations. This paper explores using LLMs to assess both semantic and a more practical “weak” semantic equivalence. We analyze common patterns of SQL equivalence and inequivalence, discuss challenges in LLM-based evaluation.

大语言模型(LLMs)的兴起极大地推动了文本到SQL(NL2SQL)系统,然而,评价产生的SQL的语义等同性仍然是一个挑战,特别是由于用户问得含糊不清和多种有效的SQL解释,本文探讨利用LLMs评估语义和更加实用的“弱”语义等同性。我们分析了SQL等同性和等同性的共同模式,讨论了基于LLM的评估中的挑战。


Article 297

Title@2025-06-11 (3): Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

Title: Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models Automatische Pseudo-Harmful Prompt Generation zur Bewertung falscher Ablehnungen in großen Sprachmodellen 自动生成用于评价大语言模式虚假拒绝的优多有害快速自动生成 2409.00598v2

Authors (6): Bang An, Sicheng Zhu, Ruiyi Zhang, Michael-Andrei Panaitescu-Liess, Yuancheng Xu, Furong Huang

Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful prompts, like “how to kill a mosquito,” which are actually harmless. Frequent false refusals not only frustrate users but also provoke a public backlash against the very values alignment seeks to protect. In this paper, we propose the first method to auto-generate diverse, content-controlled, and model-dependent pseudo-harmful prompts. Using this method, we construct an evaluation dataset called PHTest, which is ten times larger than existing datasets, covers more false refusal patterns, and separately labels controversial prompts. We evaluate 20 LLMs on PHTest, uncovering new insights due to its scale and labeling. Our findings reveal a trade-off between minimizing false refusals and improving safety against jailbreak attacks. Moreover, we show that many jailbreak defenses significantly increase the false refusal rates, thereby undermining usability. Our method and dataset can help developers evaluate and fine-tune safer and more usable LLMs. Our code and dataset are available at https://github.com/umd-huang-lab/FalseRefusal

与安全相关的大型语言模型(LLMs)有时假冒拒绝假冒有害信号,比如“如何杀死蚊子”等实际上无害的“如何杀死蚊子”等。经常的不实拒绝不仅使用户失望,而且还引起公众对要保护的价值观的反弹。在本文中,我们提出了自动生成多样化、内容控制和依赖模型的伪有害信号的第一种方法。使用这种方法,我们构建了一个称为PHTest的评价数据集,该数据集比现有数据集大十倍,涵盖更多的虚假拒绝模式和有争议的单独标签。我们在PHTest上评估了20个LMs,发现了因其规模和标签而带来的新洞见。我们的调查结果揭示了在尽量减少虚假拒绝与改善防范越狱袭击安全之间的权衡。此外,我们表明许多破狱防卫大大提高了虚假拒绝率,从而破坏了可用性。我们的方法和数据集可以帮助开发者评估和微调调更安全和更有用的LLMs。我们的代码和数据集可以在https://github.com/umd-huang-lab/FalseRefusals上查阅。


Article 298

Title@2025-06-11 (3): SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems

Title: SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems SAFEFLOW: Ein Grundsatzprotokoll für vertrauenswürdige und transaktionsunabhängige Agentensysteme 安全安全自主:可信赖和交易自主代理系统原则议定书 2506.07564v3

Authors (12): Peiran Li, Xinkai Zou, Zhuohang Wu, Ruifeng Li, Shuo Xing, Hanwen Zheng, Zhikai Hu, Yuping Wang, Haoxi Li, Qin Yuan, Yingmo Zhang, Zhengzhong Tu

Recent advances in large language models (LLMs) and vision-language models (VLMs) have enabled powerful autonomous agents capable of complex reasoning and multi-modal tool use. Despite their growing capabilities, today’s agent frameworks remain fragile, lacking principled mechanisms for secure information flow, reliability, and multi-agent coordination. In this work, we introduce SAFEFLOW, a new protocol-level framework for building trustworthy LLM/VLM-based agents. SAFEFLOW enforces fine-grained information flow control (IFC), precisely tracking provenance, integrity, and confidentiality of all the data exchanged between agents, tools, users, and environments. By constraining LLM reasoning to respect these security labels, SAFEFLOW prevents untrusted or adversarial inputs from contaminating high-integrity decisions. To ensure robustness in concurrent multi-agent settings, SAFEFLOW introduces transactional execution, conflict resolution, and secure scheduling over shared state, preserving global consistency across agents. We further introduce mechanisms, including write-ahead logging, rollback, and secure caches, that further enhance resilience against runtime errors and policy violations. To validate the performances, we built SAFEFLOWBENCH, a comprehensive benchmark suite designed to evaluate agent reliability under adversarial, noisy, and concurrent operational conditions. Extensive experiments demonstrate that agents built with SAFEFLOW maintain impressive task performance and security guarantees even in hostile environments, substantially outperforming state-of-the-art. Together, SAFEFLOW and SAFEFLOWBENCH lay the groundwork for principled, robust, and secure agent ecosystems, advancing the frontier of reliable autonomy.

大型语言模型(LLMS)和愿景语言模型(VLMS)最近的进展使强大的自主代理商能够进行复杂的推理和使用多式工具。尽管其能力不断增强,但今天的代理商框架仍然脆弱,缺乏安全信息流动、可靠性和多式代理协调的原则机制。在这项工作中,我们引入了FAFEFLOW,即新的协议级框架,用于建立可靠的LLM/VLM代理商。FEFLOW实行精细制信息流动控制,准确跟踪代理商、工具、用户和环境之间交换的所有数据的出处、完整性和保密性。尽管其能力不断提高,但SFAFEFLOLW仍限制LM关于尊重这些安全标签的推理,从而防止高品质决定受到不信任或对抗性的投入。在同时建立多式代理商环境的情况下,SFEFEFLLLW引入了交易执行、冲突解决和安全安排共享状态,维护全球各代理商的一致性。我们进一步引入了各种机制,包括:纸面、可靠记录、滚回和保证,从而进一步加强运行错误和违反标准。


Article 299

Title@2025-06-11 (3): Mitigating Posterior Salience Attenuation in Long-Context LLMs with Positional Contrastive Decoding

Title: Mitigating Posterior Salience Attenuation in Long-Context LLMs with Positional Contrastive Decoding Milderung der hinteren Salience-Abschwächung in Langkontext-LLMs mit positionaler Kontrastdekodierung 具有位置相矛盾下限作用的长文本液溶液中减轻减轻水量的盐分 2506.08371v2

Authors (8): Zikai Xiao, Ziyang Wang, Wen Ma, Yan Zhang, Wei Shen, Yan Wang, Luqi Gong, Zuozhu Liu

While Large Language Models (LLMs) support long contexts, they struggle with performance degradation within the context window. Current solutions incur prohibitive training costs, leaving statistical behaviors and cost-effective approaches underexplored. From the decoding perspective, we identify the Posterior Salience Attenuation (PSA) phenomenon, where the salience ratio correlates with long-text performance degradation. Notably, despite the attenuation, gold tokens still occupy high-ranking positions in the decoding space. Motivated by it, we propose the training-free Positional Contrastive Decoding (PCD) that contrasts the logits derived from long-aware attention with those from designed local-aware attention, enabling the model to focus on the gains introduced by large-scale short-to-long training. Through the analysis of long-term decay simulation, we demonstrate that PCD effectively alleviates attention score degradation. Experimental results show that PCD achieves state-of-the-art performance on long-context benchmarks.

虽然大语言模型(LLMS)支持长期环境,但它们在环境窗口中挣扎着性能退化。 目前的解决办法需要高昂的培训费用,使得统计行为和成本效益高的方法得不到充分探讨。 从解码的角度,我们确定Posdeserate Salicience 衰减(PSA)现象,其显著比率与长文本性能退化相关。值得注意的是,尽管衰减,黄金标牌仍然占据了解码空间的高层位置。受它推动,我们提议采用无培训的定位对称解码(PCD),将长期关注的登录与设计的地方觉察关注的登录对比,使模型能够侧重于大规模短期至长期培训带来的收益。我们通过分析长期衰减模拟,证明PCD有效地减轻了注意力的减值。实验结果表明,PCD在长文本基准上取得了最先进的业绩。


Article 300

Title@2025-06-11 (3): LID Models are Actually Accent Classifiers: Implications and Solutions for LID on Accented Speech

Title: LID Models are Actually Accent Classifiers: Implications and Solutions for LID on Accented Speech LID-Modelle sind eigentlich Accent Klassifikatoren: Implikationen und Lösungen für LID auf Accented Speech LID 模型实际上是浓度分级器:LID对发声的影响和解决办法 2506.00628v2

Authors (2): Niyati Bafna, Matthew Wiesner

Prior research indicates that LID model performance significantly declines on accented speech; however, the specific causes, extent, and characterization of these errors remain under-explored. (i) We identify a common failure mode on accented speech whereby LID systems often misclassify L2 accented speech as the speaker’s native language or a related language. (ii) We present evidence suggesting that state-of-the-art models are invariant to permutations of short spans of speech, implying they classify on the basis of short phonotactic features indicative of accent rather than language. Our analysis reveals a simple method to enhance model robustness to accents through input chunking. (iii) We present an approach that integrates sequence-level information into our model without relying on monolingual ASR systems; this reduces accent-language confusion and significantly enhances performance on accented speech while maintaining comparable results on standard LID.

nan


Article 301

Title@2025-06-11 (3): DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-Experts

Title: DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-Experts DIVE into MoE: Diversity-Enhanced Rekonstruktion großer Sprachmodelle von Dense in Mixture-of-Experts 发展到教育部:多样性-加强大语言模式的重建,从感官转变为专家混合体 2506.09351v1

Authors (7): Yuchen Feng, Bowen Shen, Naibin Gu, Jiaxuan Zhao, Peng Fu, Zheng Lin, Weiping Wang

Large language models (LLMs) with the Mixture-of-Experts (MoE) architecture achieve high cost-efficiency by selectively activating a subset of the parameters. Despite the inference efficiency of MoE LLMs, the training of extensive experts from scratch incurs substantial overhead, whereas reconstructing a dense LLM into an MoE LLM significantly reduces the training budget. However, existing reconstruction methods often overlook the diversity among experts, leading to potential redundancy. In this paper, we come up with the observation that a specific LLM exhibits notable diversity after being pruned on different calibration datasets, based on which we present a Diversity-Enhanced reconstruction method named DIVE. The recipe of DIVE includes domain affinity mining, pruning-based expert reconstruction, and efficient retraining. Specifically, the reconstruction includes pruning and reassembly of the feed-forward network (FFN) module. After reconstruction, we efficiently retrain the model on routers, experts and normalization modules. We implement DIVE on Llama-style LLMs with open-source training corpora. Experiments show that DIVE achieves training efficiency with minimal accuracy trade-offs, outperforming existing pruning and MoE reconstruction methods with the same number of activated parameters.

nan


Article 302

Title@2025-06-11 (3): Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test

Title: Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test Prüfung von Black-Box LLM APIs mit einem Rank-Based Uniformity Test 采用级级统一测试的黑色Box LLM API 2506.06975v3

Authors (9): Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger

As API access becomes a primary interface to large language models (LLMs), users often interact with black-box systems that offer little transparency into the deployed model. To reduce costs or maliciously alter model behaviors, API providers may discreetly serve quantized or fine-tuned variants, which can degrade performance and compromise safety. Detecting such substitutions is difficult, as users lack access to model weights and, in most cases, even output logits. To tackle this problem, we propose a rank-based uniformity test that can verify the behavioral equality of a black-box LLM to a locally deployed authentic model. Our method is accurate, query-efficient, and avoids detectable query patterns, making it robust to adversarial providers that reroute or mix responses upon the detection of testing attempts. We evaluate the approach across diverse threat scenarios, including quantization, harmful fine-tuning, jailbreak prompts, and full model substitution, showing that it consistently achieves superior statistical power over prior methods under constrained query budgets.

nan


Article 303

Title@2025-06-11 (3): Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis

Title: Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis Warum sind Web-KI-Agenten anfälliger als Standalone LLMs? Eine Sicherheitsanalyse 为什么互联网上AI代理机构比独立的LLMs更加脆弱? 2502.20383v2

Authors (5): Jeffrey Yang Fan Chiang, Seungjae Lee, Jia-Bin Huang, Furong Huang, Yizheng Chen

Recent advancements in Web AI agents have demonstrated remarkable capabilities in addressing complex web navigation tasks. However, emerging research shows that these agents exhibit greater vulnerability compared to standalone Large Language Models (LLMs), despite both being built upon the same safety-aligned models. This discrepancy is particularly concerning given the greater flexibility of Web AI Agent compared to standalone LLMs, which may expose them to a wider range of adversarial user inputs. To build a scaffold that addresses these concerns, this study investigates the underlying factors that contribute to the increased vulnerability of Web AI agents. Notably, this disparity stems from the multifaceted differences between Web AI agents and standalone LLMs, as well as the complex signals - nuances that simple evaluation metrics, such as success rate, often fail to capture. To tackle these challenges, we propose a component-level analysis and a more granular, systematic evaluation framework. Through this fine-grained investigation, we identify three critical factors that amplify the vulnerability of Web AI agents; (1) embedding user goals into the system prompt, (2) multi-step action generation, and (3) observational capabilities. Our findings highlights the pressing need to enhance security and robustness in AI agent design and provide actionable insights for targeted defense strategies.

nan


Article 304

Title@2025-06-11 (3): OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment

Title: OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment OmniDRCA: Paralleles Sprach-Text-Stiftungsmodell über Dual-Resolution-Sprachdarstellungen und Kontrastive Ausrichtung OmniDRCA:通过双分辨率发言代表和对立对调并行发言-文本基金会模式 2506.09349v1

Authors (16): Chao-Hong Tan, Qian Chen, Wen Wang, Chong Deng, Qinglin Zhang, Luyao Cheng, Hai Yu, Xin Zhang, Xiang Lv, Tianyu Zhao, Chong Zhang, Yukun Ma, Yafeng Chen, Hui Wang, Jiaqing Liu, Jieping Ye

Recent studies on end-to-end speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM’s autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents OmniDRCA, a parallel speech-text foundation model based on joint autoregressive modeling, featuring dual-resolution speech representations and contrastive cross-modal alignment. Our approach processes speech and text representations in parallel while enhancing audio comprehension through contrastive alignment. Experimental results on Spoken Question Answering benchmarks demonstrate that OmniDRCA establishes new state-of-the-art (SOTA) performance among parallel joint speech-text modeling based foundation models, and achieves competitive performance compared to interleaved models. Additionally, we explore the potential of extending the framework to full-duplex conversational scenarios.

nan


Article 305

Title@2025-06-11 (3): Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Title: Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models Qwen3 Embedding: Advancing Text Embedding and Reranking by Foundation Models wen3 嵌入式:通过基础模型推进文本嵌入和重新排位 2506.05176v3

Authors (12): Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, Jingren Zhou

In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs’ robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.

nan


Article 306

Title@2025-06-11 (3): Ming-Omni: A Unified Multimodal Model for Perception and Generation

Title: Ming-Omni: A Unified Multimodal Model for Perception and Generation Ming-Omni: Ein einheitliches multimodales Modell für Wahrnehmung und Erzeugung Ming-Omni:感知和一代的统一多模式模型 2506.09344v1

Authors (58): Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan, Lyuxin Xue, Lan Wang, Mochen Bai, Ning Gao, Pei Chen, Qingpei Guo, Qinglong Zhang, Qiang Xu, Rui Liu, Ruijie Xiong, Sirui Gao, Tinghao Liu, Taisong Li, Weilong Chai, Xinyu Xiao, Xiaomei Wang, Xiaoxue Chen, Xiao Lu, Xiaoyu Li, Xingning Dong, Xuzheng Yu, Yi Yuan, Yuting Gao, Yunxiao Sun, Yipeng Chen, Yifei Wu, Yongjie Lyu, Ziping Ma, Zipeng Feng, Zhijiang Fang, Zhihao Qiu, Ziyuan Huang, Zhengyu He

We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.

nan


Article 307

Title@2025-06-11 (3): Latent Multi-Head Attention for Small Language Models

Title: Latent Multi-Head Attention for Small Language Models Latent Multi-Head Aufmerksamkeit für kleine Sprachmodelle 对小型语言模式的多方关注 2506.09342v1

Authors (4): Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat

We present the first comprehensive study of latent multi-head attention (MLA) for small language models, revealing interesting efficiency-quality trade-offs. Training 30M-parameter GPT models on 100,000 synthetic stories, we benchmark three architectural variants: standard multi-head attention (MHA), MLA, and MLA with rotary positional embeddings (MLA+RoPE). Our key finding is that MLA+RoPE with half-rank latent dimensions (r = d/2) achieves a 45% KV-cache memory reduction while incurring only a 0.3% increase in validation loss (essentially matching MHA quality)- a Pareto improvement for memory constrained deployment. We further show that RoPE is crucial for MLA in small models: without it, MLA underperforms vanilla attention by 3-5%, but with RoPE, it surpasses vanilla by 2%. Inference benchmarks on NVIDIA A100 GPUs reveal that MLA with r=d/2 achieves a 1.4 times speedup over full-rank MLA while maintaining the memory savings. GPT-4 evaluations corroborate perplexity results, with ours achieving the highest quality scores (7.4/10) across grammar, creativity, and consistency metrics. Code and models will be released upon acceptance.

nan


Article 308

Title@2025-06-11 (3): RePO: Replay-Enhanced Policy Optimization

Title: RePO: Replay-Enhanced Policy Optimization RePO: Replay-erweiterte Politikoptimierung REPO: 重放强化政策优化 2506.09340v1

Authors (5): Siheng Li, Zhanhui Zhou, Wai Lam, Chao Yang, Chaochao Lu

Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by $15\%$ while raising the number of effective optimization steps by $48\%$ for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to $8$. The repository can be accessed at https://github.com/SihengLi99/RePO.

nan


Article 309

Title@2025-06-11 (3): Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference

Title: Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference Sentence-Level-Reward-Modell kann besser für die Ausrichtung LLM aus menschlichen Präferenzen Generalisieren 判决一级奖赏模式可以普遍化,更有利于将人类特惠的LLM与LM相匹配 2503.04793v4

Authors (7): Wenjie Qiu, Yi-Chen Li, Xuqin Zhang, Tianyi Zhang, Yihang Zhang, Zongzhang Zhang, Yang Yu

Learning reward models from human preference datasets and subsequently optimizing language models via reinforcement learning has emerged as a fundamental paradigm for aligning LLMs with human preferences. The performance of the reward model plays a crucial role in the effectiveness of alignment. Previous reward models operate at a coarse-grained level, requiring the generation of a complete response to obtain a reward value. The sparse reward may present challenges for downstream reinforcement learning. While recent efforts have attempted to learn token-level reward models, the lack of explicit semantic information makes it difficult to model the credit of every individual token. In this paper, we propose assigning scores to every sentence, introducing an intermediate-grained reward model. By segmenting the complete response into sentences and applying differential operations to reward output at the start and end positions of each sentence, we can effectively model the rewards of sentences. Moreover, a novel attention mechanism is introduced to aggregate the scores of all sentences into a response-level score, which allows it to be trained using the Bradley-Terry model. On common benchmarks, our method outperforms the response-level reward model by 2.7% on RewardBench (for reward modeling evaluation) and surpasses all baselines on AlpacaEval (for alignment evaluation).

nan


Article 310

Title@2025-06-11 (3): GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation

Title: GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation GraphRAG-Bench: Herausfordernde Domain-spezifische Begründung für die Auswertung der Graph Retrieval-Augmented Generation 图图RAG-Bench:评估图回收-提款一代的有挑战性域特定原因 2506.02404v2

Authors (8): Yilin Xiao, Junnan Dong, Chuang Zhou, Su Dong, Qian-wen Zhang, Di Yin, Xing Sun, Xiao Huang

Graph Retrieval Augmented Generation (GraphRAG) has garnered increasing recognition for its potential to enhance large language models (LLMs) by structurally organizing domain-specific corpora and facilitating complex reasoning. However, current evaluations of GraphRAG models predominantly rely on traditional question-answering datasets. Their limited scope in questions and evaluation metrics fails to comprehensively assess the reasoning capacity improvements enabled by GraphRAG models. To address this gap, we introduce GraphRAG-Bench, a large-scale, domain-specific benchmark designed to rigorously evaluate GraphRAG models. Our benchmark offers three key superiorities: ((i)) Challenging question design. Featuring college-level, domain-specific questions that demand multi-hop reasoning, the benchmark ensures that simple content retrieval is insufficient for problem-solving. For example, some questions require mathematical reasoning or programming. ((ii)) Diverse task coverage. The dataset includes a broad spectrum of reasoning tasks, multiple-choice, true/false, multi-select, open-ended, and fill-in-the-blank. It spans 16 disciplines in twenty core textbooks. ((iii)) Holistic evaluation framework. GraphRAG-Bench provides comprehensive assessment across the entire GraphRAG pipeline, including graph construction, knowledge retrieval, and answer generation. Beyond final-answer correctness, it evaluates the logical coherence of the reasoning process. By applying nine contemporary GraphRAG methods to GraphRAG-Bench, we demonstrate its utility in quantifying how graph-based structuring improves model reasoning capabilities. Our analysis reveals critical insights about graph architectures, retrieval efficacy, and reasoning capabilities, offering actionable guidance for the research community.

nan


Article 311

Title@2025-06-11 (3): LIFEBench: Evaluating Length Instruction Following in Large Language Models

Title: LIFEBench: Evaluating Length Instruction Following in Large Language Models LIFEBench: Bewertung der Längenlehre in großen Sprachmodellen LELELEBENCH: 以大语言模式评价长期指示 2505.16234v2

Authors (12): Wei Zhang, Zhenhong Zhou, Kun Wang, Junfeng Fang, Yuanhe Zhang, Rui Wang, Ge Zhang, Xavier Li, Li Sun, Lingjuan Lyu, Yang Liu, Sen Su

While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions-e.g., write a 10,000-word novel. Additionally, models often generate far too short outputs, terminate prematurely, or even refuse the request. Existing benchmarks focus primarily on evaluating generations quality, but often overlook whether the generations meet length constraints. To this end, we introduce Length Instruction Following Evaluation Benchmark (LIFEBench) to comprehensively evaluate LLMs’ ability to follow length instructions across diverse tasks and a wide range of specified lengths. LIFEBench consists of 10,800 instances across 4 task categories in both English and Chinese, covering length constraints ranging from 16 to 8192 words. We evaluate 26 widely-used LLMs and find that most models reasonably follow short-length instructions but deteriorate sharply beyond a certain threshold. Surprisingly, almost all models fail to reach the vendor-claimed maximum output lengths in practice, as further confirmed by our evaluations extending up to 32K words. Even long-context LLMs, despite their extended input-output windows, counterintuitively fail to improve length-instructions following. Notably, Reasoning LLMs outperform even specialized long-text generation models, achieving state-of-the-art length following. Overall, LIFEBench uncovers fundamental limitations in current LLMs’ length instructions following ability, offering critical insights for future progress.

nan


Article 312

Title@2025-06-11 (3): Natural Language Guided Ligand-Binding Protein Design

Title: Natural Language Guided Ligand-Binding Protein Design Natürliche Sprache geführt Ligand-Binding Protein Design Ligand-Binding 蛋白设计 2506.09332v1

Authors (5): Zhenqiao Song, Ramith Hettiarachchi, Chuan Li, Jianwen Xie, Lei Li

Can AI protein models follow human language instructions and design proteins with desired functions (e.g. binding to a ligand)? Designing proteins that bind to a given ligand is crucial in a wide range of applications in biology and chemistry. Most prior AI models are trained on protein-ligand complex data, which is scarce due to the high cost and time requirements of laboratory experiments. In contrast, there is a substantial body of human-curated text descriptions about protein-ligand interactions and ligand formula. In this paper, we propose InstructPro, a family of protein generative models that follow natural language instructions to design ligand-binding proteins. Given a textual description of the desired function and a ligand formula in SMILES, InstructPro generates protein sequences that are functionally consistent with the specified instructions. We develop the model architecture, training strategy, and a large-scale dataset, InstructProBench, to support both training and evaluation. InstructProBench consists of 9,592,829 triples of (function description, ligand formula, protein sequence). We train two model variants: InstructPro-1B (with 1 billion parameters) and InstructPro-3B~(with 3 billion parameters). Both variants consistently outperform strong baselines, including ProGen2, ESM3, and Pinal. Notably, InstructPro-1B achieves the highest docking success rate (81.52% at moderate confidence) and the lowest average root mean square deviation (RMSD) compared to ground truth structures (4.026{\AA}). InstructPro-3B further descreases the average RMSD to 2.527{\AA}, demonstrating InstructPro’s ability to generate ligand-binding proteins that align with the functional specifications.

nan


Article 313

Title@2025-06-11 (3): Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation

Title: Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation 多方语言模式:推进合作、协调和适应 2506.09331v1

Authors (1): Arjun Vaithilingam Sudhakar

Modern Large Language Models (LLMs) exhibit impressive zero-shot and few-shot generalization capabilities across complex natural language tasks, enabling their widespread use as virtual assistants for diverse applications such as translation and summarization. Despite being trained solely on large corpora of text without explicit supervision on author intent, LLMs appear to infer the underlying meaning of textual interactions. This raises a fundamental question: can LLMs model and reason about the intentions of others, i.e., do they possess a form of theory of mind? Understanding other’s intentions is crucial for effective collaboration, which underpins human societal success and is essential for cooperative interactions among multiple agents, including humans and autonomous systems. In this work, we investigate the theory of mind in LLMs through the lens of cooperative multi-agent reinforcement learning (MARL), where agents learn to collaborate via repeated interactions, mirroring human social reasoning. Our approach aims to enhance artificial agent’s ability to adapt and cooperate with both artificial and human partners. By leveraging LLM-based agents capable of natural language interaction, we move towards creating hybrid human-AI systems that can foster seamless collaboration, with broad implications for the future of human-artificial interaction.

nan


Article 314

Title@2025-06-11 (3): Towards Efficient and Effective Alignment of Large Language Models

Title: Towards Efficient and Effective Alignment of Large Language Models Auf dem Weg zu einer effizienten und effektiven Ausrichtung großer Sprachmodelle 实现高语言模式的高效和有效统一 2506.09329v1

Authors (1): Yuxin Jiang

Large language models (LLMs) exhibit remarkable capabilities across diverse tasks, yet aligning them efficiently and effectively with human expectations remains a critical challenge. This thesis advances LLM alignment by introducing novel methodologies in data collection, training, and evaluation. We first address alignment data collection. Existing approaches rely heavily on manually curated datasets or proprietary models. To overcome these limitations, we propose Lion, an adversarial distillation framework that iteratively refines training data by identifying and generating challenging instructions, enabling state-of-the-art zero-shot reasoning. Additionally, we introduce Web Reconstruction (WebR), a fully automated framework that synthesizes instruction-tuning data directly from raw web documents, significantly improving data diversity and scalability over existing synthetic data methods. Next, we enhance alignment training through novel optimization techniques. We develop Learning to Edit (LTE), a framework that enables LLMs to efficiently integrate new knowledge while preserving existing information. LTE leverages meta-learning to improve both real-time and batch knowledge updates. Furthermore, we introduce Bridging and Modeling Correlations (BMC), a refinement of Direct Preference Optimization (DPO) that explicitly captures token-level correlations in preference data, leading to superior alignment across QA and mathematical reasoning tasks. Finally, we tackle the challenge of evaluating alignment. Existing benchmarks emphasize response quality but overlook adherence to specific constraints. To bridge this gap, we introduce FollowBench, a multi-level, fine-grained benchmark assessing LLMs’ ability to follow complex constraints across diverse instruction types. Our results expose key weaknesses in current models’ constraint adherence, offering insights for future improvements.

nan


Article 315

Title@2025-06-11 (3): Do Large Vision-Language Models Distinguish between the Actual and Apparent Features of Illusions?

Title: Do Large Vision-Language Models Distinguish between the Actual and Apparent Features of Illusions? Unterscheiden sich große Visions-Sprachenmodelle zwischen den tatsächlichen und scheinbaren Eigenschaften von Illusionen? 大型视觉语言模型是否区分幻象的实际特征和表面特征? 2506.05765v2

Authors (5): Taiga Shinozaki, Tomoki Doi, Amane Watahiki, Satoshi Nishida, Hitomi Yanaka

Humans are susceptible to optical illusions, which serve as valuable tools for investigating sensory and cognitive processes. Inspired by human vision studies, research has begun exploring whether machines, such as large vision language models (LVLMs), exhibit similar susceptibilities to visual illusions. However, studies often have used non-abstract images and have not distinguished actual and apparent features, leading to ambiguous assessments of machine cognition. To address these limitations, we introduce a visual question answering (VQA) dataset, categorized into genuine and fake illusions, along with corresponding control images. Genuine illusions present discrepancies between actual and apparent features, whereas fake illusions have the same actual and apparent features even though they look illusory due to the similar geometric configuration. We evaluate the performance of LVLMs for genuine and fake illusion VQA tasks and investigate whether the models discern actual and apparent features. Our findings indicate that although LVLMs may appear to recognize illusions by correctly answering questions about both feature types, they predict the same answers for both Genuine Illusion and Fake Illusion VQA questions. This suggests that their responses might be based on prior knowledge of illusions rather than genuine visual understanding. The dataset is available at https://github.com/ynklab/FILM

nan


Article 316

Title@2025-06-11 (3): Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic

Title: Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic Critic-CoT: Erhöht die Argumentationsfähigkeiten des großen Sprachmodells durch Chain-of-Thoughts Critic Critic-Cot:通过思维链增强大型语言模型的推理能力 2408.16326v3

Authors (10): Xin Zheng, Jie Lou, Boxi Cao, Xueru Wen, Yuqiu Ji, Hongyu Lin, Yaojie Lu, Xianpei Han, Debing Zhang, Le Sun

Self-critic has become a crucial mechanism for enhancing the reasoning performance of LLMs. However, current approaches mainly involve basic prompts for intuitive instance-level feedback, which resembles System-1 processes and limits the reasoning capabilities. Moreover, there is a lack of in-depth investigations into the relationship between LLM’s ability to criticize and its task-solving performance. To address these issues, we propose Critic-CoT, a novel framework that pushes LLMs toward System-2-like critic capability. Through a step-wise CoT reasoning paradigm and the automatic construction of distant-supervision data without human annotation, Critic-CoT enables LLMs to engage in slow, analytic self-critique and refinement, thereby improving their reasoning abilities. Experiments on GSM8K and MATH demonstrate that our enhanced model significantly boosts task-solving performance by filtering out invalid solutions or iterative refinement. Furthermore, we investigate the intrinsic correlation between critique and task-solving abilities within LLMs, discovering that these abilities can mutually reinforce each other rather than conflict.

nan


Article 317

Title@2025-06-11 (3): CC-RAG: Structured Multi-Hop Reasoning via Theme-Based Causal Graphs

Title: CC-RAG: Structured Multi-Hop Reasoning via Theme-Based Causal Graphs CC-RAG: Strukturierte Multi-Hop-Reasoning über themenbasierte Kausalgraphen CC-RAG:通过基于主题的因果图结构化多功能原因 2506.08364v2

Authors (3): Jash Rajesh Parekh, Pengcheng Jiang, Jiawei Han

Understanding cause and effect relationships remains a formidable challenge for Large Language Models (LLMs), particularly in specialized domains where reasoning requires more than surface-level correlations. Retrieval-Augmented Generation (RAG) improves factual accuracy, but standard RAG pipelines treat evidence as flat context, lacking the structure required to model true causal dependencies. We introduce Causal-Chain RAG (CC-RAG), a novel approach that integrates zero-shot triple extraction and theme-aware graph chaining into the RAG pipeline, enabling structured multi-hop inference. Given a domain specific corpus, CC-RAG constructs a Directed Acyclic Graph (DAG) of <cause, relation, effect> triples and uses forward/backward chaining to guide structured answer generation. Experiments on two real-world domains: Bitcoin price fluctuations and Gaucher disease, show that CC-RAG outperforms standard RAG and zero-shot LLMs in chain similarity, information density, and lexical diversity. Both LLM-as-a-Judge and human evaluations consistently favor CC-RAG. Our results demonstrate that explicitly modeling causal structure enables LLMs to generate more accurate and interpretable responses, especially in specialized domains where flat retrieval fails.

nan


Article 318

Title@2025-06-11 (3): OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation

Title: OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation OWL: Optimiertes Workforce-Learning für die allgemeine Multi-Agent-Hilfe in der Real-World Task Automation OWL: 优化劳动力学习,为现实世界任务自动化提供一般多机构援助 2505.23885v2

Authors (16): Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Bowei Xia, Tao Sun, Ziyu Ye, Zhaoxuan Jin, Yingru Li, Qiguang Chen, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Bernard Ghanem, Ping Luo, Guohao Li

Large Language Model (LLM)-based multi-agent systems show promise for automating real-world tasks but struggle to transfer across domains due to their domain-specific nature. Current approaches face two critical shortcomings: they require complete architectural redesign and full retraining of all components when applied to new domains. We introduce Workforce, a hierarchical multi-agent framework that decouples strategic planning from specialized execution through a modular architecture comprising: (i) a domain-agnostic Planner for task decomposition, (ii) a Coordinator for subtask management, and (iii) specialized Workers with domain-specific tool-calling capabilities. This decoupling enables cross-domain transferability during both inference and training phases: During inference, Workforce seamlessly adapts to new domains by adding or modifying worker agents; For training, we introduce Optimized Workforce Learning (OWL), which improves generalization across domains by optimizing a domain-agnostic planner with reinforcement learning from real-world feedback. To validate our approach, we evaluate Workforce on the GAIA benchmark, covering various realistic, multi-domain agentic tasks. Experimental results demonstrate Workforce achieves open-source state-of-the-art performance (69.70%), outperforming commercial systems like OpenAI’s Deep Research by 2.34%. More notably, our OWL-trained 32B model achieves 52.73% accuracy (+16.37%) and demonstrates performance comparable to GPT-4o on challenging tasks. To summarize, by enabling scalable generalization and modular domain transfer, our work establishes a foundation for the next generation of general-purpose AI assistants.

nan


Article 319

Title@2025-06-11 (3): Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

Title: Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies Beschleunigung der LLM-Inferenz mit verlustfreier spekulativer Dekodierung von Algorithmen für heterogene Vokabeln 加速LLM 超异基因词汇表无损失投机代谢值的加速LLM推导值 2502.05202v3

Authors (8): Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Gaurav Jain, Oren Pereg, Moshe Wasserblat, David Harel

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass. However, existing SD approaches require the drafter and target models to share the same vocabulary, thus limiting the pool of possible drafters, often necessitating the training of a drafter from scratch. We present three new SD methods that remove this shared-vocabulary constraint. All three methods preserve the target distribution (i.e., they are lossless) and work with off-the-shelf models without requiring additional training or modifications. Empirically, on summarization, programming, and long-context tasks, our algorithms demonstrate significant speedups of up to 2.8x over standard autoregressive decoding. By enabling any off-the-shelf model to serve as a drafter and requiring no retraining, this work substantially broadens the applicability of the SD framework in practice.

nan


Article 320

Title@2025-06-11 (3): Alzheimer’s Dementia Detection Using Perplexity from Paired Large Language Models

Title: Alzheimer’s Dementia Detection Using Perplexity from Paired Large Language Models Alzheimer’s Dementia Detection Using Perplexity from Paired Large Language Models 阿兹海默氏病痴呆症检测,使用Paired大语言模型的复度 2506.09315v1

Authors (3): Yao Xiao, Heidi Christensen, Stefan Goetze

Alzheimer’s dementia (AD) is a neurodegenerative disorder with cognitive decline that commonly impacts language ability. This work extends the paired perplexity approach to detecting AD by using a recent large language model (LLM), the instruction-following version of Mistral-7B. We improve accuracy by an average of 3.33% over the best current paired perplexity method and by 6.35% over the top-ranked method from the ADReSS 2020 challenge benchmark. Our further analysis demonstrates that the proposed approach can effectively detect AD with a clear and interpretable decision boundary in contrast to other methods that suffer from opaque decision-making processes. Finally, by prompting the fine-tuned LLMs and comparing the model-generated responses to human responses, we illustrate that the LLMs have learned the special language patterns of AD speakers, which opens up possibilities for novel methods of model interpretation and data augmentation.

nan


Article 321

Title@2025-06-11 (3): AAD-LLM: Neural Attention-Driven Auditory Scene Understanding

Title: AAD-LLM: Neural Attention-Driven Auditory Scene Understanding AAD-LLM: Neurales aufmerksamkeitsgetriebenes Auditory Scene Understanding AAD-LLM: 神经引人注意 – – 引导观众了解场景 2502.16794v3

Authors (9): Xilin Jiang, Sukru Samet Dindar, Vishal Choudhari, Stephan Bickel, Ashesh Mehta, Guy M McKhann, Daniel Friedman, Adeen Flinker, Nima Mesgarani

Auditory foundation models, including auditory large language models (LLMs), process all sound inputs equally, independent of listener perception. However, human auditory perception is inherently selective: listeners focus on specific speakers while ignoring others in complex auditory scenes. Existing models do not incorporate this selectivity, limiting their ability to generate perception-aligned responses. To address this, we introduce Intention-Informed Auditory Scene Understanding (II-ASU) and present Auditory Attention-Driven LLM (AAD-LLM), a prototype system that integrates brain signals to infer listener attention. AAD-LLM extends an auditory LLM by incorporating intracranial electroencephalography (iEEG) recordings to decode which speaker a listener is attending to and refine responses accordingly. The model first predicts the attended speaker from neural activity, then conditions response generation on this inferred attentional state. We evaluate AAD-LLM on speaker description, speech transcription and extraction, and question answering in multitalker scenarios, with both objective and subjective ratings showing improved alignment with listener intention. By taking a first step toward intention-aware auditory AI, this work explores a new paradigm where listener perception informs machine listening, paving the way for future listener-centered auditory systems. Demo and code available: https://aad-llm.github.io.

nan


Article 322

Title@2025-06-11 (3): Listen, Chat, and Remix: Text-Guided Soundscape Remixing for Enhanced Auditory Experience

Title: Listen, Chat, and Remix: Text-Guided Soundscape Remixing for Enhanced Auditory Experience Hören, Chatten und Remixen: Textgeführtes Soundscape Remixen für verbesserte Hörerfahrung 监听、聊天和混音:加强监听经验的文本引导声音立体混音 2402.03710v2

Authors (4): Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani

In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces “Listen, Chat, and Remix” (LCR), a novel multimodal sound remixer that controls each sound source in a mixture based on user-provided text instructions. LCR distinguishes itself with a user-friendly text interface and its unique ability to remix multiple sound sources simultaneously within a mixture, without needing to separate them. Users input open-vocabulary text prompts, which are interpreted by a large language model to create a semantic filter for remixing the sound mixture. The system then decomposes the mixture into its components, applies the semantic filter, and reassembles filtered components back to the desired output. We developed a 160-hour dataset with over 100k mixtures, including speech and various audio sources, along with text prompts for diverse remixing tasks including extraction, removal, and volume control of single or multiple sources. Our experiments demonstrate significant improvements in signal quality across all remixing tasks and robust performance in zero-shot scenarios with varying numbers and types of sound sources. An audio demo is available at: https://listenchatremix.github.io/demo.

nan


Article 323

Title@2025-06-11 (3): AbstRaL: Augmenting LLMs’ Reasoning by Reinforcing Abstract Thinking

Title: AbstRaL: Augmenting LLMs’ Reasoning by Reinforcing Abstract Thinking AbstRaL: Augmenting LLMs’ Reasoning durch Stärkung des abstrakten Denkens AbstRAL:加强抽象思考,增加LLM女士的理由理由 2506.07751v2

Authors (4): Silin Gao, Antoine Bosselut, Samy Bengio, Emmanuel Abbe

Recent studies have shown that large language models (LLMs), especially smaller ones, often lack robustness in their reasoning. I.e., they tend to experience performance drops when faced with distribution shifts, such as changes to numerical or nominal variables, or insertions of distracting clauses. A possible strategy to address this involves generating synthetic data to further “instantiate” reasoning problems on potential variations. In contrast, our approach focuses on “abstracting” reasoning problems. This not only helps counteract distribution shifts but also facilitates the connection to symbolic tools for deriving solutions. We find that this abstraction process is better acquired through reinforcement learning (RL) than just supervised fine-tuning, which often fails to produce faithful abstractions. Our method, AbstRaL – which promotes abstract reasoning in LLMs using RL on granular abstraction data – significantly mitigates performance degradation on recent GSM perturbation benchmarks.

nan


Article 324

Title@2025-06-10 (2): $(RSA)^2$: A Rhetorical-Strategy-Aware Rational Speech Act Framework for Figurative Language Understanding

Title: $(RSA)^2$: A Rhetorical-Strategy-Aware Rational Speech Act Framework for Figurative Language Understanding $(RSA)^2$: Ein rhetorisch-strategisch-bewusster Rationaler Sprachgesetzrahmen für das figurative Sprachverständnis $(RSA) $2美元:一个用于模拟语言理解的 “ 路语-战略-报警有理演讲法 “ 框架 2506.09301v1

Authors (4): Cesare Spinoso-Di Piano, David Austin, Pablo Piantanida, Jackie Chi Kit Cheung

Figurative language (e.g., irony, hyperbole, understatement) is ubiquitous in human communication, resulting in utterances where the literal and the intended meanings do not match. The Rational Speech Act (RSA) framework, which explicitly models speaker intentions, is the most widespread theory of probabilistic pragmatics, but existing implementations are either unable to account for figurative expressions or require modeling the implicit motivations for using figurative language (e.g., to express joy or annoyance) in a setting-specific way. In this paper, we introduce the Rhetorical-Strategy-Aware RSA $(RSA)^2$ framework which models figurative language use by considering a speaker’s employed rhetorical strategy. We show that $(RSA)^2$ enables human-compatible interpretations of non-literal utterances without modeling a speaker’s motivations for being non-literal. Combined with LLMs, it achieves state-of-the-art performance on the ironic split of PragMega+, a new irony interpretation dataset introduced in this study.

nan


Article 325

Title@2025-06-10 (2): AMELI: Enhancing Multimodal Entity Linking with Fine-Grained Attributes

Title: AMELI: Enhancing Multimodal Entity Linking with Fine-Grained Attributes AMELI: Verbesserung der Multimodalität der Verbindung mit feinkörnigen Attributen AMELI: 增强与精细所得属性挂钩的多式联运实体 2305.14725v2

Authors (8): Barry Menglong Yao, Sijia Wang, Yu Chen, Qifan Wang, Minqian Liu, Zhiyang Xu, Licheng Yu, Lifu Huang

We propose attribute-aware multimodal entity linking, where the input consists of a mention described with a text paragraph and images, and the goal is to predict the corresponding target entity from a multimodal knowledge base (KB) where each entity is also accompanied by a text description, visual images, and a collection of attributes that present the meta-information of the entity in a structured format. To facilitate this research endeavor, we construct AMELI, encompassing a new multimodal entity linking benchmark dataset that contains 16,735 mentions described in text and associated with 30,472 images, and a multimodal knowledge base that covers 34,690 entities along with 177,873 entity images and 798,216 attributes. To establish baseline performance on AMELI, we experiment with several state-of-the-art architectures for multimodal entity linking and further propose a new approach that incorporates attributes of entities into disambiguation. Experimental results and extensive qualitative analysis demonstrate that extracting and understanding the attributes of mentions from their text descriptions and visual images play a vital role in multimodal entity linking. To the best of our knowledge, we are the first to integrate attributes in the multimodal entity linking task. The programs, model checkpoints, and the dataset are publicly available at https://github.com/VT-NLP/Ameli.

nan


Article 326

Title@2025-06-10 (2): UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

Title: UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench UTBoost: Strenge Bewertung von Coding Agents auf SWE-Bench UTBost: 严格评价SWE-Bench上的编码剂 2506.09289v1

Authors (4): Boxi Yu, Yuxuan Zhu, Pinjia He, Daniel Kang

The advent of Large Language Models (LLMs) has spurred the development of coding agents for real-world code generation. As a widely used benchmark for evaluating the code generation capabilities of these agents, SWE-Bench uses real-world problems based on GitHub issues and their corresponding pull requests. However, the manually written test cases included in these pull requests are often insufficient, allowing generated patches to pass the tests without resolving the underlying issue. To address this challenge, we introduce UTGenerator, an LLM-driven test case generator that automatically analyzes codebases and dependencies to generate test cases for real-world Python projects. Building on UTGenerator, we propose UTBoost, a comprehensive framework for test case augmentation. In our evaluation, we identified 36 task instances with insufficient test cases and uncovered 345 erroneous patches incorrectly labeled as passed in the original SWE Bench. These corrections, impacting 40.9% of SWE-Bench Lite and 24.4% of SWE-Bench Verified leaderboard entries, yield 18 and 11 ranking changes, respectively.

nan


Article 327

Title@2025-06-10 (2): EMMA: Efficient Visual Alignment in Multi-Modal LLMs

Title: EMMA: Efficient Visual Alignment in Multi-Modal LLMs EMMA: Effiziente visuelle Ausrichtung in Multi-Modal LLMs EMMA:多模式LMS中高效视觉协调 2410.02080v2

Authors (5): Sara Ghazanfari, Alexandre Araujo, Prashanth Krishnamurthy, Siddharth Garg, Farshad Khorrami

Multi-modal Large Language Models (MLLMs) have recently exhibited impressive general-purpose capabilities by leveraging vision foundation models to encode the core concepts of images into representations. These are then combined with instructions and processed by the language model to generate high-quality responses. Despite significant progress in enhancing the language component, challenges persist in optimally fusing visual encodings within the language model for task-specific adaptability. Recent research has focused on improving this fusion through modality adaptation modules but at the cost of significantly increased model complexity and training data needs. In this paper, we propose EMMA (Efficient Multi-Modal Adaptation), a lightweight cross-modality module designed to efficiently fuse visual and textual encodings, generating instruction-aware visual representations for the language model. Our key contributions include: (1) an efficient early fusion mechanism that integrates vision and language representations with minimal added parameters (less than 0.2% increase in model size), (2) an in-depth interpretability analysis that sheds light on the internal mechanisms of the proposed method; (3) comprehensive experiments that demonstrate notable improvements on both specialized and general benchmarks for MLLMs. Empirical results show that EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations. Our code is available at https://github.com/SaraGhazanfari/EMMA

nan


Article 328

Title@2025-06-10 (2): Archon: An Architecture Search Framework for Inference-Time Techniques

Title: Archon: An Architecture Search Framework for Inference-Time Techniques Archon: Ein Architektur-Suchrahmen für Inferenz-Zeit-Techniken Archon: 推断-时间技术建筑搜索框架 2409.15254v6

Authors (11): Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, Etash Guha, E. Kelly Buchanan, Mayee Chen, Neel Guha, Christopher Ré, Azalia Mirhoseini

Inference-time techniques, such as repeated sampling or iterative revisions, are emerging as powerful ways to enhance large-language models (LLMs) at test time. However, best practices for developing systems that combine these techniques remain underdeveloped due to our limited understanding of the utility of each technique across models and tasks, the interactions between them, and the massive search space for combining them. To address these challenges, we introduce Archon, a modular and automated framework for optimizing the process of selecting and combining inference-time techniques and LLMs. Given a compute budget and a set of available LLMs, Archon explores a large design space to discover optimized configurations tailored to target benchmarks. It can design custom or general-purpose architectures that advance the Pareto frontier of accuracy vs. maximum token budget compared to top-performing baselines. Across instruction-following, reasoning, and coding tasks, we show that Archon can leverage additional inference compute budget to design systems that outperform frontier models such as OpenAI’s o1, GPT-4o, and Claude 3.5 Sonnet by an average of 15.1%.

nan


Article 329

Title@2025-06-10 (2): ThinkQE: Query Expansion via an Evolving Thinking Process

Title: ThinkQE: Query Expansion via an Evolving Thinking Process ThinkQE: Abfrageerweiterung über einen sich entwickelnden Denkprozess 思考QE:通过不断演变的思考进程扩大查询 2506.09260v1

Authors (3): Yibin Lei, Tao Shen, Andrew Yates

Effective query expansion for web search benefits from promoting both exploration and result diversity to capture multiple interpretations and facets of a query. While recent LLM-based methods have improved retrieval performance and demonstrate strong domain generalization without additional training, they often generate narrowly focused expansions that overlook these desiderata. We propose ThinkQE, a test-time query expansion framework addressing this limitation through two key components: a thinking-based expansion process that encourages deeper and comprehensive semantic exploration, and a corpus-interaction strategy that iteratively refines expansions using retrieval feedback from the corpus. Experiments on diverse web search benchmarks (DL19, DL20, and BRIGHT) show ThinkQE consistently outperforms prior approaches, including training-intensive dense retrievers and rerankers.

nan


Article 330

Title@2025-06-10 (2): Self-Anchored Attention Model for Sample-Efficient Classification of Prosocial Text Chat

Title: Self-Anchored Attention Model for Sample-Efficient Classification of Prosocial Text Chat Selbsternanntes Aufmerksamkeitsmodell zur stichprobenmäßigen Klassifizierung von prosozialen Textchats 优社会性文本聊天抽样有效分类自作自作主的注意模式 2506.09259v1

Authors (7): Zhuofang Li, Rafal Kocielnik, Fereshteh Soltani, Penphob, Boonyarungsrit, Animashree Anandkumar, R. Michael Alvarez

Millions of players engage daily in competitive online games, communicating through in-game chat. Prior research has focused on detecting relatively small volumes of toxic content using various Natural Language Processing (NLP) techniques for the purpose of moderation. However, recent studies emphasize the importance of detecting prosocial communication, which can be as crucial as identifying toxic interactions. Recognizing prosocial behavior allows for its analysis, rewarding, and promotion. Unlike toxicity, there are limited datasets, models, and resources for identifying prosocial behaviors in game-chat text. In this work, we employed unsupervised discovery combined with game domain expert collaboration to identify and categorize prosocial player behaviors from game chat. We further propose a novel Self-Anchored Attention Model (SAAM) which gives 7.9% improvement compared to the best existing technique. The approach utilizes the entire training set as “anchors” to help improve model performance under the scarcity of training data. This approach led to the development of the first automated system for classifying prosocial behaviors in in-game chats, particularly given the low-resource settings where large-scale labeled data is not available. Our methodology was applied to one of the most popular online gaming titles - Call of Duty(R): Modern Warfare(R)II, showcasing its effectiveness. This research is novel in applying NLP techniques to discover and classify prosocial behaviors in player in-game chat communication. It can help shift the focus of moderation from solely penalizing toxicity to actively encouraging positive interactions on online platforms.

nan


Article 331

Title@2025-06-10 (2): Token Perturbation Guidance for Diffusion Models

Title: Token Perturbation Guidance for Diffusion Models Token Perturbation Guidance für Diffusionsmodelle 扩散模型的 Token 扰动指南 2506.10036v1

Authors (4): Javad Rajabi, Soroush Mehraban, Seyedmorteza Sadat, Babak Taati

Classifier-free guidance (CFG) has become an essential component of modern diffusion models to enhance both generation quality and alignment with input conditions. However, CFG requires specific training procedures and is limited to conditional generation. To address these limitations, we propose Token Perturbation Guidance (TPG), a novel method that applies perturbation matrices directly to intermediate token representations within the diffusion network. TPG employs a norm-preserving shuffling operation to provide effective and stable guidance signals that improve generation quality without architectural changes. As a result, TPG is training-free and agnostic to input conditions, making it readily applicable to both conditional and unconditional generation. We further analyze the guidance term provided by TPG and show that its effect on sampling more closely resembles CFG compared to existing training-free guidance techniques. Extensive experiments on SDXL and Stable Diffusion 2.1 show that TPG achieves nearly a 2$\times$ improvement in FID for unconditional generation over the SDXL baseline, while closely matching CFG in prompt alignment. These results establish TPG as a general, condition-agnostic guidance method that brings CFG-like benefits to a broader class of diffusion models. The code is available at https://github.com/TaatiTeam/Token-Perturbation-Guidance

nan


Article 332

Title@2025-06-10 (2): Extrapolation by Association: Length Generalization Transfer in Transformers

Title: Extrapolation by Association: Length Generalization Transfer in Transformers Extrapolation nach Verband: Länge Generalisierung Transfer in Transformatoren 协会的外推法:变异器中时间的通用转移 2506.09251v1

Authors (5): Ziyang Cai, Nayoung Lee, Avi Schwarzschild, Samet Oymak, Dimitris Papailiopoulos

Transformer language models have demonstrated impressive generalization capabilities in natural language domains, yet we lack a fine-grained understanding of how such generalization arises. In this paper, we investigate length generalization–the ability to extrapolate from shorter to longer inputs–through the lens of \textit{task association}. We find that length generalization can be \textit{transferred} across related tasks. That is, training a model with a longer and related auxiliary task can lead it to generalize to unseen and longer inputs from some other target task. We demonstrate this length generalization transfer across diverse algorithmic tasks, including arithmetic operations, string transformations, and maze navigation. Our results show that transformer models can inherit generalization capabilities from similar tasks when trained jointly. Moreover, we observe similar transfer effects in pretrained language models, suggesting that pretraining equips models with reusable computational scaffolding that facilitates extrapolation in downstream settings. Finally, we provide initial mechanistic evidence that length generalization transfer correlates with the re-use of the same attention heads between the tasks. Together, our findings deepen our understanding of how transformers generalize to out-of-distribution inputs and highlight the compositional reuse of inductive structure across tasks.

nan


Article 333

Title@2025-06-10 (2): Synthesis by Design: Controlled Data Generation via Structural Guidance

Title: Synthesis by Design: Controlled Data Generation via Structural Guidance Synthese nach Design: Kontrollierte Datengenerierung über strukturelle Leitlinien 按设计分列的综合报告:通过结构指导控制的数据生成 2506.07664v2

Authors (4): Lei Xu, Sirui Chen, Yuxuan Huang, Chaochao Lu

Mathematical reasoning remains challenging for LLMs due to complex logic and the need for precise computation. Existing methods enhance LLM reasoning by synthesizing datasets through problem rephrasing, but face issues with generation quality and problem complexity. To address this, we propose to extract structural information with generated problem-solving code from mathematical reasoning and guide data generation with structured solutions. Applied to MATH and GSM8K, our approach produces 39K problems with labeled intermediate steps and a 6.1K-problem benchmark of higher difficulty. Results on our benchmark show that model performance declines as reasoning length increases. Additionally, we conducted fine-tuning experiments using the proposed training data on a range of LLMs, and the results validate the effectiveness of our dataset. We hope the proposed method and dataset will contribute to future research in enhancing LLM reasoning capabilities. Our code and data are available at https://github.com/OpenCausaLab/StructuralGeneration.

nan


Article 334

Title@2025-06-10 (2): StochasTok: Improving Fine-Grained Subword Understanding in LLMs

Title: StochasTok: Improving Fine-Grained Subword Understanding in LLMs StochasTok: Verbesserung des feinkörnigen Unterwortverständnisses in LLMs StochasTok:改进在LLM中精美的子字理解 2506.01687v2

Authors (8): Anya Sims, Thom Foster, Klara Kaleb, Tuan-Duy H. Nguyen, Joseph Lee, Jakob N. Foerster, Yee Whye Teh, Cong Lu

Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still often struggle with seemingly simple subword-level tasks like How many ‘r’s in ‘strawberry’?. A key factor behind these failures is tokenization which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to ‘see’ their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs’ downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok’s simplicity allows seamless integration at any stage of the training pipeline; and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models. Code open-sourced at: https://github.com/anyasims/stochastok.

nan


Article 335

Title@2025-06-10 (2): A Technique for Isolating Lexically-Independent Phonetic Dependencies in Generative CNNs

Title: A Technique for Isolating Lexically-Independent Phonetic Dependencies in Generative CNNs Eine Technik zur Isolierung Lexisch-Unabhängiger phonetischer Abhängigkeiten in generativen CNNs 制作有线电视新闻网的孤立地独立电话依赖技术 2506.09218v1

Authors (1): Bruno Ferenc Šegedin

The ability of deep neural networks (DNNs) to represent phonotactic generalizations derived from lexical learning remains an open question. This study (1) investigates the lexically-invariant generalization capacity of generative convolutional neural networks (CNNs) trained on raw audio waveforms of lexical items and (2) explores the consequences of shrinking the fully-connected layer (FC) bottleneck from 1024 channels to 8 before training. Ultimately, a novel technique for probing a model’s lexically-independent generalizations is proposed that works only under the narrow FC bottleneck: generating audio outputs by bypassing the FC and inputting randomized feature maps into the convolutional block. These outputs are equally biased by a phonotactic restriction in training as are outputs generated with the FC. This result shows that the convolutional layers can dynamically generalize phonetic dependencies beyond lexically-constrained configurations learned by the FC.

nan


Article 336

Title@2025-06-10 (2): SimClass: A Classroom Speech Dataset Generated via Game Engine Simulation For Automatic Speech Recognition Research

Title: SimClass: A Classroom Speech Dataset Generated via Game Engine Simulation For Automatic Speech Recognition Research SimClass: Ein Klassenzimmer-Sprachdatensatz, der über Game Engine Simulation für automatische Spracherkennungsforschung erzeugt wird SimClass:通过游戏引擎模拟生成的课堂语音数据集,用于自动语音识别研究 2506.09206v1

Authors (3): Ahmed Adel Attia, Jing Liu, Carl Espy-Wilson

The scarcity of large-scale classroom speech data has hindered the development of AI-driven speech models for education. Public classroom datasets remain limited, and the lack of a dedicated classroom noise corpus prevents the use of standard data augmentation techniques. In this paper, we introduce a scalable methodology for synthesizing classroom noise using game engines, a framework that extends to other domains. Using this methodology, we present SimClass, a dataset that includes both a synthesized classroom noise corpus and a simulated classroom speech dataset. The speech data is generated by pairing a public children’s speech corpus with YouTube lecture videos to approximate real classroom interactions in clean conditions. Our experiments on clean and noisy speech demonstrate that SimClass closely approximates real classroom speech, making it a valuable resource for developing robust speech recognition and enhancement models.

nan


Article 337

Title@2025-06-10 (2): Can LLMs Interpret and Leverage Structured Linguistic Representations? A Case Study with AMRs

Title: Can LLMs Interpret and Leverage Structured Linguistic Representations? A Case Study with AMRs Kann LLMs Dolmetschen und Leverage strukturierte sprachliche Repräsentationen? Eine Fallstudie mit AMRs LLMs 能够解释和利用结构化语言代表吗? 2504.04745v3

Authors (3): Ankush Raut, Xiaofeng Zhu, Maria Leonor Pacheco

This paper evaluates the ability of Large Language Models (LLMs) to leverage contextual information in the form of structured linguistic representations. Specifically, we examine the impact of encoding both short and long contexts using Abstract Meaning Representation (AMR) structures across a diverse set of language tasks. We perform our analysis using 8-bit quantized and instruction-tuned versions of Llama 3.1 (8B), Phi-3, and Mistral 7B. Our results indicate that, for tasks involving short contexts, augmenting the prompt with the AMR of the original language context often degrades the performance of the underlying LLM. However, for tasks that involve long contexts, such as dialogue summarization in the SAMSum dataset, this enhancement improves LLM performance, for example, by increasing the zero-shot cosine similarity score of Llama 3.1 from 66% to 76%. This improvement is more evident in the newer and larger LLMs, but does not extend to the older or smaller ones. In addition, we observe that LLMs can effectively reconstruct the original text from a linearized AMR, achieving a cosine similarity of 81% in the best-case scenario.

nan


Article 338

Title@2025-06-10 (2): Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Title: Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems Können LLMs zuverlässige Testfallgeneratoren generieren? Eine Studie zu Wettbewerbs-Level-Programmierungsproblemen LLM女士能产生可靠的试验案例发电机吗? 2506.06821v2

Authors (21): Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, Yu Wang, Xiaoning Dong, Yeqi Feng, Guanzhong He, Jingcheng Huang, Jianhao Li, Yixuan Tan, Jiafu Tang, Yilin Tang, Junlei Wu, Qianyu Xiao, Can Zheng, Shouchen Zhou, Yuxiang Zhu, Yiming Huang, Tian Xie, Tianxing He

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.

nan


Article 339

Title@2025-06-10 (2): PHRASED: Phrase Dictionary Biasing for Speech Translation

Title: PHRASED: Phrase Dictionary Biasing for Speech Translation PHRASED: Phrase Dictionary Biasing for Speech Translation PHRASED: 语音翻译的词典比喻 2506.09175v1

Authors (6): Peidong Wang, Jian Xue, Rui Zhao, Junkun Chen, Aswin Shanmugam Subramanian, Jinyu Li

Phrases are essential to understand the core concepts in conversations. However, due to their rare occurrence in training data, correct translation of phrases is challenging in speech translation tasks. In this paper, we propose a phrase dictionary biasing method to leverage pairs of phrases mapping from the source language to the target language. We apply the phrase dictionary biasing method to two types of widely adopted models, a transducer-based streaming speech translation model and a multimodal large language model. Experimental results show that the phrase dictionary biasing method outperforms phrase list biasing by 21% relatively for the streaming speech translation model. In addition, phrase dictionary biasing enables multimodal large language models to use external phrase information, achieving 85% relative improvement in phrase recall.

nan


Article 340

Title@2025-06-10 (2): The Curious Language Model: Strategic Test-Time Information Acquisition

Title: The Curious Language Model: Strategic Test-Time Information Acquisition Das Kuriose Sprachmodell: Strategische Test-Zeit-Informationsbeschaffung 好奇语言模式:战略试验时间信息采集 2506.09173v1

Authors (5): Michael Cooper, Rohan Wadhawan, John Michael Giorgi, Chenhao Tan, Davis Liang

Decision-makers often possess insufficient information to render a confident decision. In these cases, the decision-maker can often undertake actions to acquire the necessary information about the problem at hand, e.g., by consulting knowledgeable authorities or by conducting experiments. Importantly, different levers of information acquisition come with different costs, posing the challenge of selecting the actions that are both informative and cost-effective. In this work, we propose CuriosiTree, a heuristic-based, test-time policy for zero-shot information acquisition in large language models (LLMs). CuriosiTree employs a greedy tree search to estimate the expected information gain of each action and strategically chooses actions based on a balance of anticipated information gain and associated cost. Empirical validation in a clinical diagnosis simulation shows that CuriosiTree enables cost-effective integration of heterogenous sources of information, and outperforms baseline action selection strategies in selecting action sequences that enable accurate diagnosis.

nan


Article 341

Title: Improving LLM Agent Planning with In-Context Learning via Atomic Fact Augmentation and Lookahead Search Verbesserung der LLM-Agentenplanung mit In-Context Learning über Atomic Fact Augmentation und Lookahead Search 通过原子事实增强和目光头搜索改进LLM Agent 规划,通过原子事实增强和目光头搜索进行文中学习 2506.09171v1

Authors (4): Samuel Holt, Max Ruiz Luyten, Thomas Pouplin, Mihaela van der Schaar

Large Language Models (LLMs) are increasingly capable but often require significant guidance or extensive interaction history to perform effectively in complex, interactive environments. Existing methods may struggle with adapting to new information or efficiently utilizing past experiences for multi-step reasoning without fine-tuning. We introduce a novel LLM agent framework that enhances planning capabilities through in-context learning, facilitated by atomic fact augmentation and a recursive lookahead search. Our agent learns to extract task-critical ``atomic facts’’ from its interaction trajectories. These facts dynamically augment the prompts provided to LLM-based components responsible for action proposal, latent world model simulation, and state-value estimation. Planning is performed via a depth-limited lookahead search, where the LLM simulates potential trajectories and evaluates their outcomes, guided by the accumulated facts and interaction history. This approach allows the agent to improve its understanding and decision-making online, leveraging its experience to refine its behavior without weight updates. We provide a theoretical motivation linking performance to the quality of fact-based abstraction and LLM simulation accuracy. Empirically, our agent demonstrates improved performance and adaptability on challenging interactive tasks, achieving more optimal behavior as it accumulates experience, showcased in tasks such as TextFrozenLake and ALFWorld.

nan


Article 342

Title@2025-06-10 (2): Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models

Title: Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models Multimodale Inkonsistenz-Reasoning (MMIR): Ein neuer Maßstab für multimodale Reasoning-Modelle 多式联运不一致理由:多式联运理由理由模型的新基准 2502.16033v3

Authors (8): Qianqi Yan, Yue Fan, Hongquan Li, Shan Jiang, Yang Zhao, Xinze Guan, Ching-Chen Kuo, Xin Eric Wang

Existing Multimodal Large Language Models (MLLMs) are predominantly trained and tested on consistent visual-textual inputs, leaving open the question of whether they can handle inconsistencies in real-world, layout-rich content. To bridge this gap, we propose the Multimodal Inconsistency Reasoning (MMIR) benchmark to assess MLLMs’ ability to detect and reason about semantic mismatches in artifacts such as webpages, presentation slides, and posters. MMIR comprises 534 challenging samples, each containing synthetically injected errors across five reasoning-heavy categories: Factual Contradiction, Identity Misattribution, Contextual Mismatch, Quantitative Discrepancy, and Temporal/Spatial Incoherence. We evaluate six state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts while open-source models remain particularly vulnerable to inconsistency errors. Detailed error analyses further show that models excel in detecting pairwise inconsistencies but struggle with inconsistencies confined to single elements in complex layouts. Probing experiments reveal that single-modality prompting, including Chain-of-Thought (CoT) and Set-of-Mark (SoM) methods, yields marginal gains, revealing a key bottleneck in cross-modal reasoning. Our findings highlight the need for advanced multimodal reasoning and point to future research on multimodal inconsistency.

nan


Article 343

Title@2025-06-10 (2): ICONS: Influence Consensus for Vision-Language Data Selection

Title: ICONS: Influence Consensus for Vision-Language Data Selection ICONS: Einflusskonsens für Vision-Sprachen-Datenauswahl ICONS: 愿景语言数据选择影响共识 2501.00654v3

Authors (6): Xindi Wu, Mengzhou Xia, Rulin Shao, Zhiwei Deng, Pang Wei Koh, Olga Russakovsky

Training vision-language models via instruction tuning often relies on large mixtures of data spanning diverse tasks and domains. However, these mixtures frequently include redundant information, increasing computational costs without proportional performance gains, necessitating more effective data selection strategies. Existing methods typically rely on task-agnostic heuristics to estimate data importance or focus on optimizing single tasks in isolation, limiting their effectiveness in multitask settings. In this work, we introduce ICONS, a gradient-based Influence CONsensus approach for vision-language data Selection. Our method leverages first-order training dynamics to estimate the influence of individual training examples on validation performance and aggregates these estimates across tasks via majority voting over task-specific influences. This cross-task consensus identifies data points that are consistently valuable across tasks, enabling us to prioritize examples that drive overall performance. The voting-based design further mitigates issues such as score calibration and outlier sensitivity, resulting in robust and scalable data selection for diverse multitask mixtures. With only 20% of the data from LLaVA-665K and Cambrian-7M, our selected subsets retain 98.6% and 98.8% of the performance achieved with full datasets, and can even surpass full data training at a 60% selection ratio on LLaVA-665K. Our approach also generalizes to unseen tasks and architectures, demonstrating strong transfer. We release two compact, high-utility subsets, LLaVA-ICONS-133K and Cambrian-ICONS-1.4M, preserving impactful training examples for efficient and scalable vision-language model development.

nan


Article 344

Title@2025-06-10 (2): Adversarial Text Generation with Dynamic Contextual Perturbation

Title: Adversarial Text Generation with Dynamic Contextual Perturbation Adversariale Textgenerierung mit dynamischer kontextueller Störung 具有动态环境扰动作用的反向生成文本 2506.09148v1

Authors (4): Hetvi Waghela, Jaydip Sen, Sneha Rakshit, Subhasis Dasgupta

Adversarial attacks on Natural Language Processing (NLP) models expose vulnerabilities by introducing subtle perturbations to input text, often leading to misclassification while maintaining human readability. Existing methods typically focus on word-level or local text segment alterations, overlooking the broader context, which results in detectable or semantically inconsistent perturbations. We propose a novel adversarial text attack scheme named Dynamic Contextual Perturbation (DCP). DCP dynamically generates context-aware perturbations across sentences, paragraphs, and documents, ensuring semantic fidelity and fluency. Leveraging the capabilities of pre-trained language models, DCP iteratively refines perturbations through an adversarial objective function that balances the dual objectives of inducing model misclassification and preserving the naturalness of the text. This comprehensive approach allows DCP to produce more sophisticated and effective adversarial examples that better mimic natural language patterns. Our experimental results, conducted on various NLP models and datasets, demonstrate the efficacy of DCP in challenging the robustness of state-of-the-art NLP systems. By integrating dynamic contextual analysis, DCP significantly enhances the subtlety and impact of adversarial attacks. This study highlights the critical role of context in adversarial attacks and lays the groundwork for creating more robust NLP systems capable of withstanding sophisticated adversarial strategies.

nan


Article 345

Title@2025-06-10 (2): LLM-as-a-qualitative-judge: automating error analysis in natural language generation

Title: LLM-as-a-qualitative-judge: automating error analysis in natural language generation LLM-as-a-qualitative-Richter: Automatisierung der Fehleranalyse bei der Generierung natürlicher Sprachen LLM-as-as-法官法官:在自然语言生成中进行自动误差分析 2506.09147v1

Authors (7): Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana Perlić, Ekaterina Borisova, Markarit Vartampetian

Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that LLM-as-a-qualitative-judge correctly recognizes instance-specific issues in 2/3 cases and is capable of producing error type reports resembling the reports composed by human annotators. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.

nan


Article 346

Title@2025-06-10 (2): Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better

Title: Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better Autoregressive Semantische visuelle Rekonstruktion hilft VLMs besser verstehen 自动递减语义视觉重建帮助VLMs更好地了解 2506.09040v1

Authors (7): Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang

Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.

nan


Article 347

Title@2025-06-10 (2): Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

Title: Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning Router-R1: Lehren von LLMs Multi-Round Routing und Aggregation durch Verstärkungslernen 路由-R1路由-R1路由:教学LLMS 2506.09033v1

Authors (3): Haozhen Zhang, Tao Feng, Jiaxuan You

The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (\textit{i.e.}, assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present \textbf{Router-R1}, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave “think” actions (internal deliberation) with “route” actions (dynamic model invocation), and integrates each response into its evolving context. To guide learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for performance and cost trade-off optimization, opening a pathway toward optimizing performance-cost tradeoffs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms over several strong baselines, achieving superior performance while maintaining robust generalization and cost management.Code is available at https://github.com/ulab-uiuc/Router-R1.

nan


Article 348

Title@2025-06-10 (2): JuStRank: Benchmarking LLM Judges for System Ranking

Title: JuStRank: Benchmarking LLM Judges for System Ranking JuStRank: Benchmarking der LLM-Richter für Systemranking JuStRank:为系统评分确定LLM法官基准 2412.09569v2

Authors (6): Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, Asaf Yehudai

Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge’s positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge’s quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.

nan


Article 349

Title@2025-06-10 (2): e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

Title: e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs e3: Erforschen lernen ermöglicht Extrapolation von Test-Time Compute für LLMs e3: 学习探索以利对LLMM的试验时间计算进行外推计算 2506.09026v1

Authors (8): Amrith Setlur, Matthew Y. R. Yang, Charlie Snell, Jeremy Greer, Ian Wu, Virginia Smith, Max Simchowitz, Aviral Kumar

Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep “thinking” for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging “negative” gradients from incorrect traces to amplify exploration during RL, resulting in longer search traces that chains additional asymmetries; and (3) coupling task difficulty with training token budget during training via a specifically-designed curriculum to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME’25 and HMMT’25 scores, and extrapolates to 2x the training token budget. Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.

nan


Article 350

Title@2025-06-10 (2): High-Throughput Phenotyping of Clinical Text Using Large Language Models

Title: High-Throughput Phenotyping of Clinical Text Using Large Language Models Hochdurchsatz-Phenotypisierung von klinischem Text mit großen Sprachmodellen 使用大语言模型的临床文本的高通量测算 2408.01214v2

Authors (5): Daniel B. Hier, S. Ilyas Munzir, Anne Stahlfeld, Tayo Obafemi-Ajayi, Michael D. Carrithers

High-throughput phenotyping automates the mapping of patient signs to standardized ontology concepts and is essential for precision medicine. This study evaluates the automation of phenotyping of clinical summaries from the Online Mendelian Inheritance in Man (OMIM) database using large language models. Due to their rich phenotype data, these summaries can be surrogates for physician notes. We conduct a performance comparison of GPT-4 and GPT-3.5-Turbo. Our results indicate that GPT-4 surpasses GPT-3.5-Turbo in identifying, categorizing, and normalizing signs, achieving concordance with manual annotators comparable to inter-rater agreement. Despite some limitations in sign normalization, the extensive pre-training of GPT-4 results in high performance and generalizability across several phenotyping tasks while obviating the need for manually annotated training data. Large language models are expected to be the dominant method for automating high-throughput phenotyping of clinical text.

nan


Article 351

Title@2025-06-10 (2): Learning to Reason Across Parallel Samples for LLM Reasoning

Title: Learning to Reason Across Parallel Samples for LLM Reasoning Über parallele Proben für LLM-Reasoning nachdenken LLM 理由的平行抽样学习 2506.09014v1

Authors (5): Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, Eunsol Choi

Scaling test-time compute brings substantial performance gains for large language models (LLMs). By sampling multiple answers and heuristically aggregate their answers (e.g., either through majority voting or using verifiers to rank the answers), one can achieve consistent performance gains in math domains. In this paper, we propose a new way to leverage such multiple sample set. We train a compact LLM, called Sample Set Aggregator (SSA), that takes a concatenated sequence of multiple samples and output the final answer, optimizing it for the answer accuracy with reinforcement learning. Experiments on multiple reasoning datasets show that SSA outperforms other test-time scaling methods such as reward model-based re-ranking. Our approach also shows a promising generalization ability, across sample set sizes, base model families and scales, and tasks. By separating LLMs to generate answers and LLMs to analyze and aggregate sampled answers, our approach can work with the outputs from premier black box models easily and efficiently.

nan


Article 352

Title@2025-06-10 (2): Employing self-supervised learning models for cross-linguistic child speech maturity classification

Title: Employing self-supervised learning models for cross-linguistic child speech maturity classification Verwendung von selbstüberwachten Lernmodellen für die Klassifizierung der sprachübergreifenden Sprachreife von Kindern 采用自我监督的学习模式,进行跨语言儿童言语成熟程度分类 2506.08999v1

Authors (6): Theo Zhang, Madurya Suresh, Anne S. Warlaumont, Kasia Hitczenko, Alejandrina Cristia, Margaret Cychosz

Speech technology systems struggle with many downstream tasks for child speech due to small training corpora and the difficulties that child speech pose. We apply a novel dataset, SpeechMaturity, to state-of-the-art transformer models to address a fundamental classification task: identifying child vocalizations. Unlike previous corpora, our dataset captures maximally ecologically-valid child vocalizations across an unprecedented sample, comprising children acquiring 25+ languages in the U.S., Bolivia, Vanuatu, Papua New Guinea, Solomon Islands, and France. The dataset contains 242,004 labeled vocalizations, magnitudes larger than previous work. Models were trained to distinguish between cry, laughter, mature (consonant+vowel), and immature speech (just consonant or vowel). Models trained on the dataset outperform state-of-the-art models trained on previous datasets, achieved classification accuracy comparable to humans, and were robust across rural and urban settings.

nan


Article 353

Title@2025-06-10 (2): CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation

Title: CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation CAIRe: Kulturelle Zuordnung von Bildern durch Retrieval-Augmented Evaluation CAIRe:通过检索-增强评价对图像进行文化归属 2506.09109v1

Authors (5): Arnav Yayavaram, Siddharth Yayavaram, Simran Khanuja, Michael Saxon, Graham Neubig

As text-to-image models become increasingly prevalent, ensuring their equitable performance across diverse cultural contexts is critical. Efforts to mitigate cross-cultural biases have been hampered by trade-offs, including a loss in performance, factual inaccuracies, or offensive outputs. Despite widespread recognition of these challenges, an inability to reliably measure these biases has stalled progress. To address this gap, we introduce CAIRe, a novel evaluation metric that assesses the degree of cultural relevance of an image, given a user-defined set of labels. Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label. On a manually curated dataset of culturally salient but rare items built using language models, CAIRe surpasses all baselines by 28% F1 points. Additionally, we construct two datasets for culturally universal concept, one comprising of T2I-generated outputs and another retrieved from naturally occurring data. CAIRe achieves Pearson’s correlations of 0.56 and 0.66 with human ratings on these sets, based on a 5-point Likert scale of cultural relevance. This demonstrates its strong alignment with human judgment across diverse image sources.

nan


Article 354

Title@2025-06-10 (2): SensorLM: Learning the Language of Wearable Sensors

Title: SensorLM: Learning the Language of Wearable Sensors SensorLM: Die Sprache der tragbaren Sensoren lernen 传感器LM:学习穿装传感器的语言 2506.09108v1

Authors (20): Yuwei Zhang, Kumar Ayush, Siyuan Qiao, A. Ali Heydari, Girish Narayanswamy, Maxwell A. Xu, Ahmed A. Metwally, Shawn Xu, Jake Garrison, Xuhai Xu, Tim Althoff, Yun Liu, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak Patel, Cecilia Mascolo, Xin Liu, Daniel McDuff, Yuzhe Yang

We present SensorLM, a family of sensor-language foundation models that enable wearable sensor data understanding with natural language. Despite its pervasive nature, aligning and interpreting sensor data with language remains challenging due to the lack of paired, richly annotated sensor-text descriptions in uncurated, real-world wearable data. We introduce a hierarchical caption generation pipeline designed to capture statistical, structural, and semantic information from sensor data. This approach enabled the curation of the largest sensor-language dataset to date, comprising over 59.7 million hours of data from more than 103,000 people. Furthermore, SensorLM extends prominent multimodal pretraining architectures (e.g., CLIP, CoCa) and recovers them as specific variants within a generic architecture. Extensive experiments on real-world tasks in human activity analysis and healthcare verify the superior performance of SensorLM over state-of-the-art in zero-shot recognition, few-shot learning, and cross-modal retrieval. SensorLM also demonstrates intriguing capabilities including scaling behaviors, label efficiency, sensor captioning, and zero-shot generalization to unseen tasks.

nan


Article 355

Title@2025-06-10 (2): Mechanistic Decomposition of Sentence Representations

Title: Mechanistic Decomposition of Sentence Representations Mechanistische Zersetzung von Satzdarstellungen 判决代表的体力分解 2506.04373v2

Authors (5): Matthieu Tehenan, Vikram Natarajan, Jonathan Michala, Milton Lin, Juri Opitz

Sentence embeddings are central to modern NLP and AI systems, yet little is known about their internal structure. While we can compare these embeddings using measures such as cosine similarity, the contributing features are not human-interpretable, and the content of an embedding seems untraceable, as it is masked by complex neural transformations and a final pooling operation that combines individual token embeddings. To alleviate this issue, we propose a new method to mechanistically decompose sentence embeddings into interpretable components, by using dictionary learning on token-level representations. We analyze how pooling compresses these features into sentence representations, and assess the latent features that reside in a sentence embedding. This bridges token-level mechanistic interpretability with sentence-level analysis, making for more transparent and controllable representations. In our studies, we obtain several interesting insights into the inner workings of sentence embedding spaces, for instance, that many semantic and syntactic aspects are linearly encoded in the embeddings.

nan


Article 356

Title@2025-06-10 (2): SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning

Title: SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning SwS: Selbstbewusste Schwäche-getriebene Problemsynthese bei der Stärkung des Lernens für LLM-Reasoning SWS:自觉的弱点驱动的问题综述 2506.08989v1

Authors (8): Xiao Liang, Zhong-Zhi Li, Yeyun Gong, Yang Wang, Hengyuan Zhang, Yelong Shen, Ying Nian Wu, Weizhu Chen

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for training large language models (LLMs) on complex reasoning tasks, such as mathematical problem solving. A prerequisite for the scalability of RLVR is a high-quality problem set with precise and verifiable answers. However, the scarcity of well-crafted human-labeled math problems and limited-verification answers in existing distillation-oriented synthetic datasets limit their effectiveness in RL. Additionally, most problem synthesis strategies indiscriminately expand the problem set without considering the model’s capabilities, leading to low efficiency in generating useful questions. To mitigate this issue, we introduce a Self-aware Weakness-driven problem Synthesis framework (SwS) that systematically identifies model deficiencies and leverages them for problem augmentation. Specifically, we define weaknesses as questions that the model consistently fails to learn through its iterative sampling during RL training. We then extract the core concepts from these failure cases and synthesize new problems to strengthen the model’s weak areas in subsequent augmented training, enabling it to focus on and gradually overcome its weaknesses. Without relying on external knowledge distillation, our framework enables robust generalization byempowering the model to self-identify and address its weaknesses in RL, yielding average performance gains of 10.0% and 7.7% on 7B and 32B models across eight mainstream reasoning benchmarks.

nan


Article 357

Title@2025-06-10 (2): DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

Title: DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments DefenderBench: Ein Toolkit zur Bewertung von Sprachagenten in Cybersicherheitsumgebungen ” 捍卫堡垒:网络安全环境中评价语言代理工具工具包 “ 2506.00739v2

Authors (9): Chiyu Zhang, Marc-Alexandre Cote, Michael Albada, Anush Sankaran, Jack W. Stokes, Tong Wang, Amir Abdi, William Blum, Muhammad Abdul-Mageed

Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench’s modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at https://github.com/microsoft/DefenderBench.

nan


Article 358

Title@2025-06-10 (2): Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

Title: Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models Quamba2: Robustes und skalierbares Post-Training Quantization Framework für selektive State Space Models Quamba2:有选择性的国家空间模型强力和可扩展的培训后量化框架 2503.22879v3

Authors (6): Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu

State Space Models (SSMs) are emerging as a compelling alternative to Transformers because of their consistent memory usage and high performance. Despite this, scaling up SSMs on cloud services or limited-resource devices is challenging due to their storage requirements and computational power. To overcome this, quantizing SSMs with low bit-width data formats can reduce model size and benefit from hardware acceleration. As SSMs are prone to quantization-induced errors, recent efforts have focused on optimizing a particular model or bit-width for efficiency without sacrificing performance. However, distinct bit-width configurations are essential for different scenarios, like W4A8 for boosting large-batch decoding speed, and W4A16 for enhancing generation speed in short prompt applications for a single user. To this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment on various platforms. Based on the channel order preserving and activation persistence of SSMs, we propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for input $x$, combined with a per-state-group quantization for input-dependent parameters $B$ and $C$. To ensure compute-invariance in the SSM output, we rearrange weights offline according to the clustering sequence. The experiments show that Quamba2-8B outperforms two state-of-the-art SSM quantization methods and delivers 1.3$\times$ and 3$\times$ speed-ups in the pre-filling and generation stages, respectively, while offering 4$\times$ memory reduction with only a $1.6\%$ average accuracy drop. The evaluation on MMLU shows the generalizability and robustness of our framework. The code and quantized models will be released at: https://github.com/enyac-group/Quamba.

nan


Article 359

Title: Naturalistic Language-related Movie-Watching fMRI Task for Detecting Neurocognitive Decline and Disorder Naturalistische Sprache-bezogene Film-Watching fMRI Aufgabe für die Erkennung neurokognitiver Deklination und Störung FMRI 检测神经分辨衰竭和疾病任务 2506.08986v1

Authors (7): Yuejiao Wang, Xianmin Gong, Xixin Wu, Patrick Wong, Hoi-lam Helene Fung, Man Wai Mak, Helen Meng

Early detection is crucial for timely intervention aimed at preventing and slowing the progression of neurocognitive disorder (NCD), a common and significant health problem among the aging population. Recent evidence has suggested that language-related functional magnetic resonance imaging (fMRI) may be a promising approach for detecting cognitive decline and early NCD. In this paper, we proposed a novel, naturalistic language-related fMRI task for this purpose. We examined the effectiveness of this task among 97 non-demented Chinese older adults from Hong Kong. The results showed that machine-learning classification models based on fMRI features extracted from the task and demographics (age, gender, and education year) achieved an average area under the curve of 0.86 when classifying participants’ cognitive status (labeled as NORMAL vs DECLINE based on their scores on a standard neurcognitive test). Feature localization revealed that the fMRI features most frequently selected by the data-driven approach came primarily from brain regions associated with language processing, such as the superior temporal gyrus, middle temporal gyrus, and right cerebellum. The study demonstrated the potential of the naturalistic language-related fMRI task for early detection of aging-related cognitive decline and NCD.

nan


Article 360

Title@2025-06-10 (2): PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips

Title: PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips PrisonBreak: Gefängnisbrechende große Sprachmodelle mit weniger als fünfundzwanzig gezielten Bit-Flips 监狱破碎:破禁大语言模式,目标位翻页小于25个目标小于25个 2412.07192v2

Authors (8): Zachary Coalson, Jeonghyun Woo, Yu Sun, Shiyang Chen, Lishan Yang, Prashant Nair, Bo Fang, Sanghyun Hong

We introduce a new class of attacks on commercial-scale (human-aligned) language models that induce jailbreaking through targeted bitwise corruptions in model parameters. Our adversary can jailbreak billion-parameter language models with fewer than 25 bit-flips in all cases$-$and as few as 5 in some$-$using up to 40$\times$ less bit-flips than existing attacks on computer vision models at least 100$\times$ smaller. Unlike prompt-based jailbreaks, our attack renders these models in memory ‘uncensored’ at runtime, allowing them to generate harmful responses without any input modifications. Our attack algorithm efficiently identifies target bits to flip, offering up to 20$\times$ more computational efficiency than previous methods. This makes it practical for language models with billions of parameters. We show an end-to-end exploitation of our attack using software-induced fault injection, Rowhammer (RH). Our work examines 56 DRAM RH profiles from DDR4 and LPDDR4X devices with different RH vulnerabilities. We show that our attack can reliably induce jailbreaking in systems similar to those affected by prior bit-flip attacks. Moreover, our approach remains effective even against highly RH-secure systems (e.g., 46$\times$ more secure than previously tested systems). Our analyses further reveal that: (1) models with less post-training alignment require fewer bit flips to jailbreak; (2) certain model components, such as value projection layers, are substantially more vulnerable than others; and (3) our method is mechanistically different than existing jailbreaks. Our findings highlight a pressing, practical threat to the language model ecosystem and underscore the need for research to protect these models from bit-flip attacks.

nan


Article 361

Title@2025-06-10 (2): FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1, L2 and Imitated L2 Accents

Title: FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1, L2 and Imitated L2 Accents FROST-EMA: Finnische und Russische Sprachdatensätze für elektromagnetische Artikulographiemessungen mit L1, L2 und imitierten L2 Accents FROST-EMA:芬兰和俄罗斯关于L1、L2和模拟L2弧度电磁电动天文学测量的口头语音数据集 2506.08981v1

Authors (6): Satu Hopponen, Tomi Kinnunen, Alexandre Nikolaev, Rosa González Hautamäki, Lauri Tavi, Einar Meister

We introduce a new FROST-EMA (Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography) corpus. It consists of 18 bilingual speakers, who produced speech in their native language (L1), second language (L2), and imitated L2 (fake foreign accent). The new corpus enables research into language variability from phonetic and technological points of view. Accordingly, we include two preliminary case studies to demonstrate both perspectives. The first case study explores the impact of L2 and imitated L2 on the performance of an automatic speaker verification system, while the second illustrates the articulatory patterns of one speaker in L1, L2, and a fake accent.

nan


Article 362

Title@2025-06-10 (2): DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models

Title: DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models DiffLM: Ansteuerbare Synthetische Datengenerierung über Diffusions-Sprachenmodelle DiffLM: 通过传播语言模型生成可控合成数据 2411.03250v2

Authors (9): Ying Zhou, Xinyao Wang, Yulei Niu, Yaojie Shen, Lexin Tang, Fan Chen, Ben He, Le Sun, Longyin Wen

Recent advancements in large language models (LLMs) have significantly enhanced their knowledge and generative capabilities, leading to a surge of interest in leveraging LLMs for high-quality data synthesis. However, synthetic data generation via prompting LLMs remains challenging due to LLMs’ limited understanding of target data distributions and the complexity of prompt engineering, especially for structured formatted data. To address these issues, we introduce DiffLM, a controllable data synthesis framework based on variational autoencoder (VAE), which further (1) leverages diffusion models to reserve more information of original distribution and format structure in the learned latent distribution and (2) decouples the learning of target distribution knowledge from the LLM’s generative objectives via a plug-and-play latent feature injection module. As we observed significant discrepancies between the VAE’s latent representations and the real data distribution, the latent diffusion module is introduced into our framework to learn a fully expressive latent distribution. Evaluations on seven real-world datasets with structured formatted data (i.e., Tabular, Code, and Tool data) demonstrate that DiffLM generates high-quality data, with performance on downstream tasks surpassing that of real data by 2%-7% in certain cases. Data and code are available at https://github.com/bytedance/DiffLM.

nan


Article 363

Title@2025-06-10 (2): Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System

Title: Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System Atom-to-Compositional Generalization für mobile Agenten mit einem neuen Benchmark- und Scheduling-System 具有新的基准和日程安排系统的移动剂原子到集成一般化 2506.08972v1

Authors (6): Yuan Guo, Tingjia Miao, Zheng Wu, Pengzhou Cheng, Ming Zhou, Zhuosheng Zhang

Autonomous agents powered by multimodal large language models have been developed to facilitate task execution on mobile devices. However, prior work has predominantly focused on atomic tasks – such as shot-chain execution tasks and single-screen grounding tasks – while overlooking the generalization to compositional tasks, which are indispensable for real-world applications. This work introduces UI-NEXUS, a comprehensive benchmark designed to evaluate mobile agents on three categories of compositional operations: Simple Concatenation, Context Transition, and Deep Dive. UI-NEXUS supports interactive evaluation in 20 fully controllable local utility app environments, as well as 30 online Chinese and English service apps. It comprises 100 interactive task templates with an average optimal step count of 14.05. Experimental results across a range of mobile agents with agentic workflow or agent-as-a-model show that UI-NEXUS presents significant challenges. Specifically, existing agents generally struggle to balance performance and efficiency, exhibiting representative failure modes such as under-execution, over-execution, and attention drift, causing visible atomic-to-compositional generalization gap. Inspired by these findings, we propose AGENT-NEXUS, a lightweight and efficient scheduling system to tackle compositional mobile tasks. AGENT-NEXUS extrapolates the abilities of existing mobile agents by dynamically decomposing long-horizon tasks to a series of self-contained atomic subtasks. AGENT-NEXUS achieves 24% to 40% task success rate improvement for existing mobile agents on compositional operation tasks within the UI-NEXUS benchmark without significantly sacrificing inference overhead. The demo video, dataset, and code are available on the project page at https://ui-nexus.github.io.

nan


Article 364

Title@2025-06-10 (2): Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

Title: Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model Schritt-Audio-AQAA: ein vollständig von Ende zu Ende ausdrucksstarkes großes Audio-Sprachenmodell 渐进-AQAAA:全端到端全端表达式大音频语言模型 2506.08967v1

Authors (76): Ailin Huang, Bingxin Li, Bruce Wang, Boyong Wu, Chao Yan, Chengli Feng, Heng Wang, Hongyu Zhou, Hongyuan Wang, Jingbei Li, Jianjian Sun, Joanna Wang, Mingrui Chen, Peng Liu, Ruihang Miao, Shilei Jiang, Tian Fei, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Ge, Zheng Gong, Zhewei Huang, Zixin Zhang, Bin Wang, Bo Li, Buyun Ma, Changxin Miao, Changyi Wan, Chen Xu, Dapeng Shi, Dingyuan Hu, Enle Liu, Guanzhe Huang, Gulin Yan, Hanpeng Hu, Haonan Jia, Jiahao Gong, Jiaoren Wu, Jie Wu, Jie Yang, Junzhe Lin, Kaixiang Li, Lei Xia, Longlong Gu, Ming Li, Nie Hao, Ranchen Ming, Shaoliang Pang, Siqi Liu, Song Yuan, Tiancheng Cao, Wen Li, Wenqing He, Xu Zhao, Xuelin Zhang, Yanbo Yu, Yinmin Zhong, Yu Zhou, Yuanwei Liang, Yuanwei Lu, Yuxiang Yang, Zidong Yang, Zili Zhang, Binxing Jiao, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Daxin Jiang, Shuchang Zhou, Chen Hu

Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.

nan


Article 365

Title@2025-06-10 (2): Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers

Title: Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers Vortrainierte Sprachmodelle lernen bemerkenswert genaue Darstellungen von Zahlen 培训前语文模式的精确数字表示 2506.08966v1

Authors (5): Marek Kadlčík, Michal Štefánik, Timothee Mickus, Michal Spiegel, Josef Kuchař

Pretrained language models (LMs) are prone to arithmetic errors. Existing work showed limited success in probing numeric values from models’ representations, indicating that these errors can be attributed to the inherent unreliability of distributionally learned embeddings in representing exact quantities. However, we observe that previous probing methods are inadequate for the emergent structure of learned number embeddings with sinusoidal patterns. In response, we propose a novel probing technique that decodes numeric values from input embeddings with near-perfect accuracy across a range of open-source LMs. This proves that after the sole pre-training, LMs represent numbers with remarkable precision. Finally, we find that the embeddings’ preciseness judged by our probe’s accuracy explains a large portion of LM’s errors in elementary arithmetic, and show that aligning the embeddings with the pattern discovered by our probe can mitigate these errors.

nan


Article 366

Title@2025-06-10 (2): AnnaAgent: Dynamic Evolution Agent System with Multi-Session Memory for Realistic Seeker Simulation

Title: AnnaAgent: Dynamic Evolution Agent System with Multi-Session Memory for Realistic Seeker Simulation AnnaAgent: Dynamisches Evolutions-Agentensystem mit Multi-Session-Speicher für realistische Suchersimulation AnnAAgenti:动态进化剂系统,具有多会议记忆系统,用于模拟现实探索者模拟 2506.00551v2

Authors (9): Ming Wang, Peidong Wang, Lin Wu, Xiaocui Yang, Daling Wang, Shi Feng, Yuxin Chen, Bixuan Wang, Yifei Zhang

Constrained by the cost and ethical concerns of involving real seekers in AI-driven mental health, researchers develop LLM-based conversational agents (CAs) with tailored configurations, such as profiles, symptoms, and scenarios, to simulate seekers. While these efforts advance AI in mental health, achieving more realistic seeker simulation remains hindered by two key challenges: dynamic evolution and multi-session memory. Seekers’ mental states often fluctuate during counseling, which typically spans multiple sessions. To address this, we propose AnnaAgent, an emotional and cognitive dynamic agent system equipped with tertiary memory. AnnaAgent incorporates an emotion modulator and a complaint elicitor trained on real counseling dialogues, enabling dynamic control of the simulator’s configurations. Additionally, its tertiary memory mechanism effectively integrates short-term and long-term memory across sessions. Evaluation results, both automated and manual, demonstrate that AnnaAgent achieves more realistic seeker simulation in psychological counseling compared to existing baselines. The ethically reviewed and screened code can be found on https://github.com/sci-m-wang/AnnaAgent.

nan


Article 367

Title@2025-06-10 (2): FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation

Title: FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation FaithfulRAG: Fact-Level-Konfliktmodellierung für kontexttreue retrieval-erweiterte Generation 忠实的拉戈:为本能的回溯性回溯性回溯性回溯性一代人建立实际冲突模式 2506.08938v1

Authors (7): Qinggang Zhang, Zhishang Xiang, Yilin Xiao, Le Wang, Junhui Li, Xinrun Wang, Jinsong Su

Large language models (LLMs) augmented with retrieval systems have demonstrated significant potential in handling knowledge-intensive tasks. However, these models often struggle with unfaithfulness issues, generating outputs that either ignore the retrieved context or inconsistently blend it with the LLMs parametric knowledge. This issue is particularly severe in cases of knowledge conflict, where the retrieved context conflicts with the models parametric knowledge. While existing faithful RAG approaches enforce strict context adherence through well-designed prompts or modified decoding strategies, our analysis reveals a critical limitation: they achieve faithfulness by forcibly suppressing the models parametric knowledge, which undermines the models internal knowledge structure and increases the risk of misinterpreting the context. To this end, this paper proposes FaithfulRAG, a novel framework that resolves knowledge conflicts by explicitly modeling discrepancies between the model`s parametric knowledge and retrieved context. Specifically, FaithfulRAG identifies conflicting knowledge at the fact level and designs a self-thinking process, allowing LLMs to reason about and integrate conflicting facts before generating responses. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. The code is available at https:// github.com/DeepLearnXMU/Faithful-RAG

nan


Article 368

Title@2025-06-10 (2): Can A Gamer Train A Mathematical Reasoning Model?

Title: Can A Gamer Train A Mathematical Reasoning Model? Kann ein Gamer ein mathematisches Reasoning-Modell trainieren? 赌徒能训练数学理论模型吗? 2506.08935v1

Authors (1): Andrew Shin

While large language models (LLMs) have achieved remarkable performance in various tasks including mathematical reasoning, their development typically demands prohibitive computational resources. Recent advancements have reduced costs for training capable models, yet even these approaches rely on high-end hardware clusters. In this paper, we demonstrate that a single average gaming GPU can train a solid mathematical reasoning model, by integrating reinforcement learning and memory optimization techniques. Specifically, we train a 1.5B parameter mathematical reasoning model on RTX 3080 Ti of 16GB memory that achieves comparable or better performance on mathematical reasoning benchmarks than models several times larger, in resource-constrained environments. Our results challenge the paradigm that state-of-the-art mathematical reasoning necessitates massive infrastructure, democratizing access to high-performance AI research. https://github.com/shinandrew/YouronMath.

nan


Article 369

Title@2025-06-10 (2): Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions

Title: Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions Scratic-MCTS: Sichtvernunft in der Testzeit, indem man die richtigen Fragen stellt Scorti-MCTS:通过询问正确的问题进行试验时视觉理性 2506.08927v1

Authors (7): David Acuna, Ximing Lu, Jaehun Jung, Hyunwoo Kim, Amlan Kar, Sanja Fidler, Yejin Choi

Recent research in vision-language models (VLMs) has centered around the possibility of equipping them with implicit long-form chain-of-thought reasoning – akin to the success observed in language models – via distillation and reinforcement learning. But what about the non-reasoning models already trained and deployed across the internet? Should we simply abandon them, or is there hope for a search mechanism that can elicit hidden knowledge and induce long reasoning traces – without any additional training or supervision? In this paper, we explore this possibility using a Monte Carlo Tree Search (MCTS)-inspired algorithm, which injects subquestion-subanswer pairs into the model’s output stream. We show that framing reasoning as a search process – where subquestions act as latent decisions within a broader inference trajectory – helps the model “connect the dots” between fragmented knowledge and produce extended reasoning traces in non-reasoning models. We evaluate our method across three benchmarks and observe consistent improvements. Notably, our approach yields a 2% overall improvement on MMMU-PRO, including a significant 9% gain in Liberal Arts.

nan


Article 370

Title@2025-06-10 (2): PropMEND: Hypernetworks for Knowledge Propagation in LLMs

Title: PropMEND: Hypernetworks for Knowledge Propagation in LLMs PropMEND: Hypernetworks für die Wissensverbreitung in LLMs PropMEND: 以LLMMs方式传播知识的超网络 2506.08920v1

Authors (3): Zeyu Leo Liu, Greg Durrett, Eunsol Choi

Knowledge editing techniques for large language models (LLMs) can inject knowledge that is later reproducible verbatim, but they fall short on propagating that knowledge: models cannot answer questions that require reasoning with the injected knowledge. We present a hypernetwork-based approach for knowledge propagation, named PropMEND, where we meta-learn how to modify gradients of a language modeling loss to encourage injected information to propagate. Our approach extends the meta-objective of MEND [29] so that gradient updates on knowledge are transformed to enable answering multi-hop questions involving that knowledge. We show improved performance on the RippleEdit dataset, showing almost 2x accuracy on challenging multi-hop questions whose answers are not explicitly stated in the injected fact. We further introduce a new dataset, Controlled RippleEdit, to evaluate the generalization of our hypernetwork, testing knowledge propagation along relations and entities unseen during hypernetwork training. PropMEND still outperforms existing approaches in unseen entity-relation pairs, yet the performance gap decreases substantially, suggesting future work in propagating knowledge to a wide range of relations.

nan


Article 371

Title@2025-06-10 (2): Dialect Normalization using Large Language Models and Morphological Rules

Title: Dialect Normalization using Large Language Models and Morphological Rules Dialect Normalization mit großen Sprachmodellen und morphologischen Regeln 使用大语言模式和人文规则实现正常化 2506.08907v1

Authors (3): Antonios Dimakis, John Pavlopoulos, Antonios Anastasopoulos

Natural language understanding systems struggle with low-resource languages, including many dialects of high-resource ones. Dialect-to-standard normalization attempts to tackle this issue by transforming dialectal text so that it can be used by standard-language tools downstream. In this study, we tackle this task by introducing a new normalization method that combines rule-based linguistically informed transformations and large language models (LLMs) with targeted few-shot prompting, without requiring any parallel data. We implement our method for Greek dialects and apply it on a dataset of regional proverbs, evaluating the outputs using human annotators. We then use this dataset to conduct downstream experiments, finding that previous results regarding these proverbs relied solely on superficial linguistic information, including orthographic artifacts, while new observations can still be made through the remaining semantics.

nan


Article 372

Title@2025-06-10 (2): Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection

Title: Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection Verbesserung der Automatisierten Abhandlungswertung in Arabisch mit synthetischen Daten und Fehlerinjektion 用合成数据和错误喷射加强阿拉伯文自动日志扫描 2503.17739v2

Authors (5): Chatrine Qwaider, Bashar Alhafni, Kirill Chirkunov, Nizar Habash, Ted Briscoe

Automated Essay Scoring (AES) plays a crucial role in assessing language learners’ writing quality, reducing grading workload, and providing real-time feedback. The lack of annotated essay datasets inhibits the development of Arabic AES systems. This paper leverages Large Language Models (LLMs) and Transformer models to generate synthetic Arabic essays for AES. We prompt an LLM to generate essays across the Common European Framework of Reference (CEFR) proficiency levels and introduce and compare two approaches to error injection. We create a dataset of 3,040 annotated essays with errors injected using our two methods. Additionally, we develop a BERT-based Arabic AES system calibrated to CEFR levels. Our experimental results demonstrate the effectiveness of our synthetic dataset in improving Arabic AES performance. We make our code and data publicly available.

nan


Article 373

Title@2025-06-10 (2): A Decomposition-Based Approach for Evaluating and Analyzing Inter-Annotator Disagreement

Title: A Decomposition-Based Approach for Evaluating and Analyzing Inter-Annotator Disagreement Ein zersetzen-basierter Ansatz zur Bewertung und Analyse von Inter-Annotator-Uneinigkeit 一种基于分解的办法来评价和分析 2206.05446v2

Authors (2): Effi Levi, Shaul R. Shenhav

We propose a novel method to conceptually decompose an existing annotation into separate levels, allowing the analysis of inter-annotators disagreement in each level separately. We suggest two distinct strategies in order to actualize this approach: a theoretically-driven one, in which the researcher defines a decomposition based on prior knowledge of the annotation task, and an exploration-based one, in which many possible decompositions are inductively computed and presented to the researcher for interpretation and evaluation. Utilizing a recently constructed dataset for narrative analysis as our use-case, we apply each of the two strategies to demonstrate the potential of our approach in testing hypotheses regarding the sources of annotation disagreements, as well as revealing latent structures and relations within the annotation task. We conclude by suggesting how to extend and generalize our approach, as well as use it for other purposes.

nan


Article 374

Title: From Legal Texts to Defeasible Deontic Logic via LLMs: A Study in Automated Semantic Analysis Von Rechtstexten zur defasiblen Deontic Logic über LLMs: Eine Studie in der automatisierten semantischen Analyse 从法律文本到通过LLMM从法律文本到防御性 Deontic Lologic: 自动语义分析研究 2506.08899v1

Authors (4): Elias Horner, Cristinel Mateis, Guido Governatori, Agata Ciabattoni

We present a novel approach to the automated semantic analysis of legal texts using large language models (LLMs), targeting their transformation into formal representations in Defeasible Deontic Logic (DDL). We propose a structured pipeline that segments complex normative language into atomic snippets, extracts deontic rules, and evaluates them for syntactic and semantic coherence. Our methodology is evaluated across various LLM configurations, including prompt engineering strategies, fine-tuned models, and multi-stage pipelines, focusing on legal norms from the Australian Telecommunications Consumer Protections Code. Empirical results demonstrate promising alignment between machine-generated and expert-crafted formalizations, showing that LLMs - particularly when prompted effectively - can significantly contribute to scalable legal informatics.

nan


Article 375

Title@2025-06-10 (2): PlantBert: An Open Source Language Model for Plant Science

Title: PlantBert: An Open Source Language Model for Plant Science PlantBert: Ein Open Source Sprachmodell für die Pflanzenwissenschaft 植物生物:植物科学开放源语言模型 2506.08897v1

Authors (8): Hiba Khey, Amine Lakhder, Salma Rouichi, Imane El Ghabi, Kamal Hejjaoui, Younes En-nahli, Fahd Kalloubi, Moez Amri

The rapid advancement of transformer-based language models has catalyzed breakthroughs in biomedical and clinical natural language processing; however, plant science remains markedly underserved by such domain-adapted tools. In this work, we present PlantBert, a high-performance, open-source language model specifically tailored for extracting structured knowledge from plant stress-response literature. Built upon the DeBERTa architecture-known for its disentangled attention and robust contextual encoding-PlantBert is fine-tuned on a meticulously curated corpus of expert-annotated abstracts, with a primary focus on lentil (Lens culinaris) responses to diverse abiotic and biotic stressors. Our methodology combines transformer-based modeling with rule-enhanced linguistic post-processing and ontology-grounded entity normalization, enabling PlantBert to capture biologically meaningful relationships with precision and semantic fidelity. The underlying corpus is annotated using a hierarchical schema aligned with the Crop Ontology, encompassing molecular, physiological, biochemical, and agronomic dimensions of plant adaptation. PlantBert exhibits strong generalization capabilities across entity types and demonstrates the feasibility of robust domain adaptation in low-resource scientific fields. By providing a scalable and reproducible framework for high-resolution entity recognition, PlantBert bridges a critical gap in agricultural NLP and paves the way for intelligent, data-driven systems in plant genomics, phenomics, and agronomic knowledge discovery. Our model is publicly released to promote transparency and accelerate cross-disciplinary innovation in computational plant science.

nan


Article 376

Title@2025-06-10 (2): Poro 34B and the Blessing of Multilinguality

Title: Poro 34B and the Blessing of Multilinguality Poro 34B und der Segen der Mehrsprachigkeit 波罗34B和多语言祝福 2404.01856v3

Authors (8): Risto Luukkonen, Jonathan Burdge, Elaine Zosa, Aarne Talman, Ville Komulainen, Väinö Hatanpää, Peter Sarlin, Sampo Pyysalo

The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing: when the lack of training data is a constraint for effectively training larger models for a target language, augmenting the dataset with other languages can offer a way to improve over the capabilities of monolingual models for that language. In this study, we introduce Poro 34B, a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages, and demonstrate that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish and excels in translation, while also achieving competitive performance in its class for English and programming languages. We release the model parameters, scripts, and data under open licenses at https://huggingface.co/LumiOpen/Poro-34B.

nan


Article 377

Title@2025-06-10 (2): Too Big to Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers

Title: Too Big to Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers Zu groß zu denken: Kapazität, Erinnerung und Verallgemeinerung in vortrainierten Transformern 能力、记忆和在培训前变异器中的普及化 2506.09099v1

Authors (2): Joshua Barron, Devin White

The relationship between memorization and generalization in large language models (LLMs) remains an open area of research, with growing evidence that the two are deeply intertwined. In this work, we investigate this relationship by pre-training a series of capacity-limited Transformer models from scratch on two synthetic character-level tasks designed to separately probe generalization (via arithmetic extrapolation) and memorization (via factual recall). We observe a consistent trade-off: small models extrapolate to unseen arithmetic cases but fail to memorize facts, while larger models memorize but fail to extrapolate. An intermediate-capacity model exhibits a similar shift toward memorization. When trained on both tasks jointly, no model (regardless of size) succeeds at extrapolation. These findings suggest that pre-training may intrinsically favor one learning mode over the other. By isolating these dynamics in a controlled setting, our study offers insight into how model capacity shapes learning behavior and offers broader implications for the design and deployment of small language models.

nan


Article 378

Title@2025-06-10 (2): Fusing Bidirectional Chains of Thought and Reward Mechanisms A Method for Enhancing Question-Answering Capabilities of Large Language Models for Chinese Intangible Cultural Heritage

Title: Fusing Bidirectional Chains of Thought and Reward Mechanisms A Method for Enhancing Question-Answering Capabilities of Large Language Models for Chinese Intangible Cultural Heritage Bidirektionale Ketten von Gedanken- und Belohnungsmechanismen zusammenführen Eine Methode zur Verbesserung von Frage-Antwort-Fähigkeiten von großen Sprachmodellen für chinesisches immaterielles Kulturerbe 利用思想和奖赏机制的双向双向两向链 提高中国非物质文化遗产大语言模式的回答问题能力的方法 2505.08167v4

Authors (5): Ruilin Liu, Zhixiao Zhao, Jieqiong Li, Chang Liu, Dongbo Wang

The rapid development of large language models (LLMs) has provided significant support and opportunities for the advancement of domain-specific LLMs. However, fine-tuning these large models using Intangible Cultural Heritage (ICH) data inevitably faces challenges such as bias, incorrect knowledge inheritance, and catastrophic forgetting. To address these issues, we propose a novel training method that integrates a bidirectional chains of thought and a reward mechanism. This method is built upon ICH-Qwen, a large language model specifically designed for the field of intangible cultural heritage. The proposed method enables the model to not only perform forward reasoning but also enhances the accuracy of the generated answers by utilizing reverse questioning and reverse reasoning to activate the model’s latent knowledge. Additionally, a reward mechanism is introduced during training to optimize the decision-making process. This mechanism improves the quality of the model’s outputs through structural and content evaluations with different weighting schemes. We conduct comparative experiments on ICH-Qwen, with results demonstrating that our method outperforms 0-shot, step-by-step reasoning, knowledge distillation, and question augmentation methods in terms of accuracy, Bleu-4, and Rouge-L scores on the question-answering task. Furthermore, the paper highlights the effectiveness of combining the bidirectional chains of thought and reward mechanism through ablation experiments. In addition, a series of generalizability experiments are conducted, with results showing that the proposed method yields improvements on various domain-specific datasets and advanced models in areas such as Finance, Wikidata, and StrategyQA. This demonstrates that the method is adaptable to multiple domains and provides a valuable approach for model training in future applications across diverse fields.

nan


Article 379

Title@2025-06-10 (2): Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia

Title: Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia Umgang mit Pitfalls bei der Prüfung von Praktiken automatischer Spracherkennungstechnologien: Eine Fallstudie von Menschen mit Aphasie 解决自动语音识别技术审计做法中的缺陷:阿法西亚人案例研究 2506.08846v1

Authors (5): Katelyn Xiaoying Mei, Anna Seo Gyeong Choi, Hilke Schellmann, Mona Sloane, Allison Koenecke

Automatic Speech Recognition (ASR) has transformed daily tasks from video transcription to workplace hiring. ASR systems’ growing use warrants robust and standardized auditing approaches to ensure automated transcriptions of high and equitable quality. This is especially critical for people with speech and language disorders (such as aphasia) who may disproportionately depend on ASR systems to navigate everyday life. In this work, we identify three pitfalls in existing standard ASR auditing procedures, and demonstrate how addressing them impacts audit results via a case study of six popular ASR systems’ performance for aphasia speakers. First, audits often adhere to a single method of text standardization during data pre-processing, which (a) masks variability in ASR performance from applying different standardization methods, and (b) may not be consistent with how users - especially those from marginalized speech communities - would want their transcriptions to be standardized. Second, audits often display high-level demographic findings without further considering performance disparities among (a) more nuanced demographic subgroups, and (b) relevant covariates capturing acoustic information from the input audio. Third, audits often rely on a single gold-standard metric – the Word Error Rate – which does not fully capture the extent of errors arising from generative AI models, such as transcription hallucinations. We propose a more holistic auditing framework that accounts for these three pitfalls, and exemplify its results in our case study, finding consistently worse ASR performance for aphasia speakers relative to a control group. We call on practitioners to implement these robust ASR auditing practices that remain flexible to the rapidly changing ASR landscape.

nan


Article 380

Title@2025-06-10 (2): Position: Editing Large Language Models Poses Serious Safety Risks

Title: Position: Editing Large Language Models Poses Serious Safety Risks Position: Bearbeiten von großen Sprachmodellen stellt ernste Sicherheitsrisiken dar 职位:编辑大语言模型 2502.02958v2

Authors (5): Paul Youssef, Zhixue Zhao, Daniel Braun, Jörg Schlötterer, Christin Seifert

Large Language Models (LLMs) contain large amounts of facts about the world. These facts can become outdated over time, which has led to the development of knowledge editing methods (KEs) that can change specific facts in LLMs with limited side effects. This position paper argues that editing LLMs poses serious safety risks that have been largely overlooked. First, we note the fact that KEs are widely available, computationally inexpensive, highly performant, and stealthy makes them an attractive tool for malicious actors. Second, we discuss malicious use cases of KEs, showing how KEs can be easily adapted for a variety of malicious purposes. Third, we highlight vulnerabilities in the AI ecosystem that allow unrestricted uploading and downloading of updated models without verification. Fourth, we argue that a lack of social and institutional awareness exacerbates this risk, and discuss the implications for different stakeholders. We call on the community to (i) research tamper-resistant models and countermeasures against malicious model editing, and (ii) actively engage in securing the AI ecosystem.

nan


Article 381

Title@2025-06-10 (2): Advancing STT for Low-Resource Real-World Speech

Title: Advancing STT for Low-Resource Real-World Speech Förderung der STT für die Real-World-Rede mit geringem Ressourcenbedarf 推进低资源现实世界演讲的STT 2506.08836v1

Authors (2): Flavio D’Intino, Hans-Peter Hutter

Swiss German is a low-resource language represented by diverse dialects that differ significantly from Standard German and from each other, lacking a standardized written form. As a result, transcribing Swiss German involves translating into Standard German. Existing datasets have been collected in controlled environments, yielding effective speech-to-text (STT) models, but these models struggle with spontaneous conversational speech. This paper, therefore, introduces the new SRB-300 dataset, a 300-hour annotated speech corpus featuring real-world long-audio recordings from 39 Swiss German radio and TV stations. It captures spontaneous speech across all major Swiss dialects recorded in various realistic environments and overcomes the limitation of prior sentence-level corpora. We fine-tuned multiple OpenAI Whisper models on the SRB-300 dataset, achieving notable enhancements over previous zero-shot performance metrics. Improvements in word error rate (WER) ranged from 19% to 33%, while BLEU scores increased between 8% and 40%. The best fine-tuned model, large-v3, achieved a WER of 17.1% and a BLEU score of 74.8. This advancement is crucial for developing effective and robust STT systems for Swiss German and other low-resource languages in real-world contexts.

nan


Article 382

Title@2025-06-10 (2): CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

Title: CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics CulturalFrames: Bewertung der kulturellen Erwartungshaltung in Text-zu-Image-Modellen und Evaluationsmetrics 文化框架:在文本到图像模型和评价计量中评估文化期望一致 2506.08835v1

Authors (9): Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd van Steenkiste, Yash Goyal, Karolina Stańczak, Aishwarya Agrawal

The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurately represent diverse cultural contexts. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit as well as implicit cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that T2I models not only fail to meet the more challenging implicit expectations but also the less challenging explicit expectations. Across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we demonstrate that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, providing actionable directions for developing more culturally informed T2I models and evaluation methodologies.

nan


Article 383

Title: The impact of fine tuning in LLaMA on hallucinations for named entity extraction in legal documentation Die Auswirkungen der Feinabstimmung in LLaMA auf Halluzinationen für benannte Entitätsextraktion in gesetzlicher Dokumentation LLamaMA的微调对法律文件中记名实体提取的幻觉的影响 2506.08827v1

Authors (5): Francisco Vargas, Alejandro González Coene, Gaston Escalante, Exequiel Lobón, Manuel Pulido

The extraction of information about traffic accidents from legal documents is crucial for quantifying insurance company costs. Extracting entities such as percentages of physical and/or psychological disability and the involved compensation amounts is a challenging process, even for experts, due to the subtle arguments and reasoning in the court decision. A two-step procedure is proposed: first, segmenting the document identifying the most relevant segments, and then extracting the entities. For text segmentation, two methodologies are compared: a classic method based on regular expressions and a second approach that divides the document into blocks of n-tokens, which are then vectorized using multilingual models for semantic searches (text-embedding-ada-002/MiniLM-L12-v2 ). Subsequently, large language models (LLaMA-2 7b, 70b, LLaMA-3 8b, and GPT-4 Turbo) are applied with prompting to the selected segments for entity extraction. For the LLaMA models, fine-tuning is performed using LoRA. LLaMA-2 7b, even with zero temperature, shows a significant number of hallucinations in extractions which are an important contention point for named entity extraction. This work shows that these hallucinations are substantially reduced after finetuning the model. The performance of the methodology based on segment vectorization and subsequent use of LLMs significantly surpasses the classic method which achieves an accuracy of 39.5%. Among open-source models, LLaMA-2 70B with finetuning achieves the highest accuracy 79.4%, surpassing its base version 61.7%. Notably, the base LLaMA-3 8B model already performs comparably to the finetuned LLaMA-2 70B model, achieving 76.6%, highlighting the rapid progress in model development. Meanwhile, GPT-4 Turbo achieves the highest accuracy at 86.1%.

nan


Article 384

Title@2025-06-10 (2): Cross-lingual Transfer in Programming Languages: An Extensive Empirical Study

Title: Cross-lingual Transfer in Programming Languages: An Extensive Empirical Study Cross-lingual Transfer in Programmiersprachen: Eine umfangreiche empirische Studie 方案语言的跨语言传译:广泛经验研究 2310.16937v3

Authors (6): Razan Baltaji, Saurabh Pujar, Louis Mandel, Martin Hirzel, Luca Buratti, Lav Varshney

Large language models (LLMs) have achieved state-of-the-art performance in various software engineering tasks, including error detection, clone detection, and code translation, primarily leveraging high-resource programming languages like Python and Java. However, many critical languages, such as COBOL, as well as emerging languages, such as Rust and Swift, remain low-resource due to limited openly available code. This scarcity hampers the training and effectiveness of LLMs for these languages, increasing software maintenance costs and stifling innovation. Addressing this gap, we investigate the potential of transfer learning to enhance LLM performance on low-resource programming languages by leveraging data from high-resource counterparts. Our extensive empirical study evaluates transferability across 10 to 41 programming languages and five key tasks: code generation, clone detection, code repair, solution domain classification, and error detection. Additionally, we develop a performance prediction model to guess the best source languages for a given target and task, and analyze the features that influence transfer performance. We further replicate a representative subset of experiments with a larger model to test the generalizability of our conclusions to contemporary large-scale LLMs. Our findings demonstrate that cross-lingual transfer significantly outperforms zero-shot learning, with effectiveness varying based on both source and target languages. Furthermore, our model reliably predicts successful transfer sources by considering linguistic and dataset-specific features, offering practical guidance for data acquisition and model training. This work contributes to the development of LLM-driven tools for low-resource programming languages and provides insights into the characteristics that facilitate transfer across language pairs.

nan


Article 385

Title@2025-06-10 (2): Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents

Title: Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents Messen von Data Science Automation: Eine Umfrage an Evaluationstools für KI-Assistenten und -Agenten 衡量数据科学自动化:AI助理和代理人评价工具调查 2506.08800v1

Authors (3): Irene Testini, José Hernández-Orallo, Lorenzo Pacchiardi

Data science aims to extract insights from data to support decision-making processes. Recently, Large Language Models (LLMs) are increasingly used as assistants for data science, by suggesting ideas, techniques and small code snippets, or for the interpretation of results and reporting. Proper automation of some data-science activities is now promised by the rise of LLM agents, i.e., AI systems powered by an LLM equipped with additional affordances–such as code execution and knowledge bases–that can perform self-directed actions and interact with digital environments. In this paper, we survey the evaluation of LLM assistants and agents for data science. We find (1) a dominant focus on a small subset of goal-oriented activities, largely ignoring data management and exploratory activities; (2) a concentration on pure assistance or fully autonomous agents, without considering intermediate levels of human-AI collaboration; and (3) an emphasis on human substitution, therefore neglecting the possibility of higher levels of automation thanks to task transformation.

nan


Article 386

Title@2025-06-10 (2): Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO

Title: Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO Eintauchen in RL für Bildgenerierung mit CoT: Eine Studie über DPO vs. GRPO 与COT一起为图像生成而跳入RL:关于DPO与GROP公司的研究 2505.17017v2

Authors (8): Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, Pheng-Ann Heng

Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at https://github.com/ZiyuGuo99/Image-Generation-CoT

nan


Article 387

Title@2025-06-10 (2): In Praise of Stubbornness: An Empirical Case for Cognitive-Dissonance Aware Continual Update of Knowledge in LLMs

Title: In Praise of Stubbornness: An Empirical Case for Cognitive-Dissonance Aware Continual Update of Knowledge in LLMs In Lob der Stubbornness: Ein empirischer Fall für Kognitiv-Dissonanz Aware Continual Update of Knowledge in LLMs 赞美固态:关于知识在LLM中不断更新的认知与不相识经验案例 2502.04390v2

Authors (6): Simone Clemente, Zied Ben Houidi, Alexis Huet, Dario Rossi, Giulio Franzese, Pietro Michiardi

Through systematic empirical investigation, we uncover a fundamental and concerning property of Large Language Models: while they can safely learn facts that don’t contradict their knowledge, attempting to update facts with contradictory information triggers catastrophic corruption of unrelated knowledge. Unlike humans, who naturally resist contradictory information, these models indiscriminately accept contradictions, leading to devastating interference, destroying up to 80% of unrelated knowledge even when learning as few as 10-100 contradicting facts. To understand whether this interference could be mitigated through selective plasticity, we experiment with targeted network updates, distinguishing between previously used (stubborn) and rarely used (plastic) neurons. We uncover another asymmetry: while sparing frequently-used neurons significantly improves retention of existing knowledge for non-contradictory updates (98% vs 93% with standard updates), contradictory updates trigger catastrophic interference regardless of targeting strategy. This effect which persists across tested model scales (GPT-2 to GPT-J-6B), suggests a fundamental limitation in how neural networks handle contradictions. Finally, we demonstrate that contradictory information can be reliably detected (95%+ accuracy) using simple model features, offering a potential protective mechanism. These findings motivate new architectures that can, like humans, naturally resist contradictions rather than allowing destructive overwrites.

nan


Article 388

Title@2025-06-10 (2): Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval

Title: Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval Optimierte Text-Embedding-Modelle und Benchmarks für die Amharische Passage Retrieval 阿姆光通过通过检索的最佳文本嵌入模型和基准 2505.19356v2

Authors (3): Kidist Amde Mekonnen, Yosef Worku Alemneh, Maarten de Rijke

Neural retrieval methods using transformer-based pre-trained language models have advanced multilingual and cross-lingual retrieval. However, their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones. Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while being over 13x smaller. Additionally, we train a ColBERT-based late interaction retrieval model that achieves the highest MRR@10 score (0.843) among all evaluated models. We benchmark our proposed models against both sparse and dense retrieval baselines to systematically assess retrieval effectiveness in Amharic. Our analysis highlights key challenges in low-resource settings and underscores the importance of language-specific adaptation. To foster future research in low-resource IR, we publicly release our dataset, codebase, and trained models at https://github.com/kidist-amde/amharic-ir-benchmarks.

nan


Article 389

Title@2025-06-10 (2): Silencing Empowerment, Allowing Bigotry: Auditing the Moderation of Hate Speech on Twitch

Title: Silencing Empowerment, Allowing Bigotry: Auditing the Moderation of Hate Speech on Twitch Silencing Empowerment, Allowing Bigotry: Auditierung der Moderation von Hass Rede auf Twitch 静默赋权,允许偏见:审计对抽搐者仇恨言论的温和程度 2506.07667v2

Authors (6): Prarabdh Shukla, Wei Yin Chong, Yash Patel, Brennan Schaffner, Danish Pruthi, Arjun Bhagoji

To meet the demands of content moderation, online platforms have resorted to automated systems. Newer forms of real-time engagement($\textit{e.g.}$, users commenting on live streams) on platforms like Twitch exert additional pressures on the latency expected of such moderation systems. Despite their prevalence, relatively little is known about the effectiveness of these systems. In this paper, we conduct an audit of Twitch’s automated moderation tool ($\texttt{AutoMod}$) to investigate its effectiveness in flagging hateful content. For our audit, we create streaming accounts to act as siloed test beds, and interface with the live chat using Twitch’s APIs to send over $107,000$ comments collated from $4$ datasets. We measure $\texttt{AutoMod}$’s accuracy in flagging blatantly hateful content containing misogyny, racism, ableism and homophobia. Our experiments reveal that a large fraction of hateful messages, up to $94\%$ on some datasets, $\textit{bypass moderation}$. Contextual addition of slurs to these messages results in $100\%$ removal, revealing $\texttt{AutoMod}$’s reliance on slurs as a moderation signal. We also find that contrary to Twitch’s community guidelines, $\texttt{AutoMod}$ blocks up to $89.5\%$ of benign examples that use sensitive words in pedagogical or empowering contexts. Overall, our audit points to large gaps in $\texttt{AutoMod}$’s capabilities and underscores the importance for such systems to understand context effectively.

nan


Article 390

Title@2025-06-10 (2): TextAtari: 100K Frames Game Playing with Language Agents

Title: TextAtari: 100K Frames Game Playing with Language Agents TextAtari: 100K Frames Spiel mit Sprachagenten TextAtari: 100K 框架游戏与语言代理游戏 2506.04098v2

Authors (11): Wenhao Li, Wenwu Li, Chuyun Shen, Junjie Sheng, Zixiao Huang, Di Wu, Yun Hua, Wei Yin, Xiangfeng Wang, Hongyuan Zha, Bo Jin

We present TextAtari, a benchmark for evaluating language agents on very long-horizon decision-making tasks spanning up to 100,000 steps. By translating the visual state representations of classic Atari games into rich textual descriptions, TextAtari creates a challenging test bed that bridges sequential decision-making with natural language processing. The benchmark includes nearly 100 distinct tasks with varying complexity, action spaces, and planning horizons, all rendered as text through an unsupervised representation learning framework (AtariARI). We evaluate three open-source large language models (Qwen2.5-7B, Gemma-7B, and Llama3.1-8B) across three agent frameworks (zero-shot, few-shot chain-of-thought, and reflection reasoning) to assess how different forms of prior knowledge affect performance on these long-horizon challenges. Four scenarios-Basic, Obscured, Manual Augmentation, and Reference-based-investigate the impact of semantic understanding, instruction comprehension, and expert demonstrations on agent decision-making. Our results reveal significant performance gaps between language agents and human players in extensive planning tasks, highlighting challenges in sequential reasoning, state tracking, and strategic planning across tens of thousands of steps. TextAtari provides standardized evaluation protocols, baseline implementations, and a framework for advancing research at the intersection of language models and planning. Our code is available at https://github.com/Lww007/Text-Atari-Agents.

nan


Article 391

Title@2025-06-10 (2): Paths to Causality: Finding Informative Subgraphs Within Knowledge Graphs for Knowledge-Based Causal Discovery

Title: Paths to Causality: Finding Informative Subgraphs Within Knowledge Graphs for Knowledge-Based Causal Discovery Pfade zur Kausalität: Informative Subgraphen innerhalb von Wissensgraphen finden für wissensbasierte Kausale Entdeckung 因果路径:在知识型因果发现的知识图中查找信息子集 2506.08771v1

Authors (2): Yuni Susanti, Michael Färber

Inferring causal relationships between variable pairs is crucial for understanding multivariate interactions in complex systems. Knowledge-based causal discovery – which involves inferring causal relationships by reasoning over the metadata of variables (e.g., names or textual context) – offers a compelling alternative to traditional methods that rely on observational data. However, existing methods using Large Language Models (LLMs) often produce unstable and inconsistent results, compromising their reliability for causal inference. To address this, we introduce a novel approach that integrates Knowledge Graphs (KGs) with LLMs to enhance knowledge-based causal discovery. Our approach identifies informative metapath-based subgraphs within KGs and further refines the selection of these subgraphs using Learning-to-Rank-based models. The top-ranked subgraphs are then incorporated into zero-shot prompts, improving the effectiveness of LLMs in inferring the causal relationship. Extensive experiments on biomedical and open-domain datasets demonstrate that our method outperforms most baselines by up to 44.4 points in F1 scores, evaluated across diverse LLMs and KGs. Our code and datasets are available on GitHub: https://github.com/susantiyuni/path-to-causality

nan


Article 392

Title@2025-06-10 (2): Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models

Title: Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models Generative psycho-lexikalische Vorgehensweise für den Aufbau von Wertsystemen in großen Sprachmodellen 在大语言模式中建立价值体系的产生心理-临床方法 2502.02444v5

Authors (7): Haoran Ye, Tianze Zhang, Yuhang Xie, Liyuan Zhang, Yuanyi Ren, Xin Zhang, Guojie Song

Values are core drivers of individual and collective perception, cognition, and behavior. Value systems, such as Schwartz’s Theory of Basic Human Values, delineate the hierarchy and interplay among these values, enabling cross-disciplinary investigations into decision-making and societal dynamics. Recently, the rise of Large Language Models (LLMs) has raised concerns regarding their elusive intrinsic values. Despite growing efforts in evaluating, understanding, and aligning LLM values, a psychologically grounded LLM value system remains underexplored. This study addresses the gap by introducing the Generative Psycho-Lexical Approach (GPLA), a scalable, adaptable, and theoretically informed method for constructing value systems. Leveraging GPLA, we propose a psychologically grounded five-factor value system tailored for LLMs. For systematic validation, we present three benchmarking tasks that integrate psychological principles with cutting-edge AI priorities. Our results reveal that the proposed value system meets standard psychological criteria, better captures LLM values, improves LLM safety prediction, and enhances LLM alignment, when compared to the canonical Schwartz’s values.

nan


Article 393

Title@2025-06-10 (2): EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements

Title: EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements EDINET-Bench: Bewertung von LLMs zu komplexen Finanzaufgaben anhand japanischer Finanzausweise EDINET-Bench:利用日本财务报表评估复杂金融任务LMs 2506.08762v1

Authors (7): Issa Sugiura, Takashi Ishida, Taro Makino, Chieko Tazuke, Takanori Nakagawa, Kosuke Nakago, David Ha

Financial analysis presents complex challenges that could leverage large language model (LLM) capabilities. However, the scarcity of challenging financial datasets, particularly for Japanese financial data, impedes academic innovation in financial analytics. As LLMs advance, this lack of accessible research resources increasingly hinders their development and evaluation in this specialized domain. To address this gap, we introduce EDINET-Bench, an open-source Japanese financial benchmark designed to evaluate the performance of LLMs on challenging financial tasks including accounting fraud detection, earnings forecasting, and industry prediction. EDINET-Bench is constructed by downloading annual reports from the past 10 years from Japan’s Electronic Disclosure for Investors’ NETwork (EDINET) and automatically assigning labels corresponding to each evaluation task. Our experiments reveal that even state-of-the-art LLMs struggle, performing only slightly better than logistic regression in binary classification for fraud detection and earnings forecasting. These results highlight significant challenges in applying LLMs to real-world financial applications and underscore the need for domain-specific adaptation. Our dataset, benchmark construction code, and evaluation code is publicly available to facilitate future research in finance with LLMs.

nan


Article 394

Title@2025-06-10 (2): Enhancing Accuracy and Maintainability in Nuclear Plant Data Retrieval: A Function-Calling LLM Approach Over NL-to-SQL

Title: Enhancing Accuracy and Maintainability in Nuclear Plant Data Retrieval: A Function-Calling LLM Approach Over NL-to-SQL Verbesserung der Genauigkeit und Aufrechterhaltung des Datenrücklaufs von Kernkraftwerken: Ein funktionsorientierter LLM-Ansatz über NL-zu-SQL 提高核电厂数据检索的准确性和可维护性:NL-SQL的功能计算LLM法 2506.08757v1

Authors (5): Mishca de Costa, Muhammad Anwar, Dave Mercier, Mark Randall, Issam Hammad

Retrieving operational data from nuclear power plants requires exceptional accuracy and transparency due to the criticality of the decisions it supports. Traditionally, natural language to SQL (NL-to-SQL) approaches have been explored for querying such data. While NL-to-SQL promises ease of use, it poses significant risks: end-users cannot easily validate generated SQL queries, and legacy nuclear plant databases – often complex and poorly structured – complicate query generation due to decades of incremental modifications. These challenges increase the likelihood of inaccuracies and reduce trust in the approach. In this work, we propose an alternative paradigm: leveraging function-calling large language models (LLMs) to address these challenges. Instead of directly generating SQL queries, we define a set of pre-approved, purpose-specific functions representing common use cases. Queries are processed by invoking these functions, which encapsulate validated SQL logic. This hybrid approach mitigates the risks associated with direct NL-to-SQL translations by ensuring that SQL queries are reviewed and optimized by experts before deployment. While this strategy introduces the upfront cost of developing and maintaining the function library, we demonstrate how NL-to-SQL tools can assist in the initial generation of function code, allowing experts to focus on validation rather than creation. Our study includes a performance comparison between direct NL-to-SQL generation and the proposed function-based approach, highlighting improvements in accuracy and maintainability. This work underscores the importance of balancing user accessibility with operational safety and provides a novel, actionable framework for robust data retrieval in critical systems.

nan


Article 395

Title@2025-06-10 (2): Factors affecting the in-context learning abilities of LLMs for dialogue state tracking

Title: Factors affecting the in-context learning abilities of LLMs for dialogue state tracking Faktoren, die die Lernfähigkeiten von LLM im Kontext des Dialogs für die Zustandsverfolgung beeinflussen 影响通过对话的LLMLM公司在对话方面的内在学习能力的因素 2506.08753v1

Authors (8): Pradyoth Hegde, Santosh Kesiraju, Jan Švec, Šimon Sedláček, Bolaji Yusuf, Oldřich Plchot, Deepak K T, Jan Černocký

This study explores the application of in-context learning (ICL) to the dialogue state tracking (DST) problem and investigates the factors that influence its effectiveness. We use a sentence embedding based k-nearest neighbour method to retrieve the suitable demonstrations for ICL. The selected demonstrations, along with the test samples, are structured within a template as input to the LLM. We then conduct a systematic study to analyse the impact of factors related to demonstration selection and prompt context on DST performance. This work is conducted using the MultiWoZ2.4 dataset and focuses primarily on the OLMo-7B-instruct, Mistral-7B-Instruct-v0.3, and Llama3.2-3B-Instruct models. Our findings provide several useful insights on in-context learning abilities of LLMs for dialogue state tracking.

nan


Article 396

Title@2025-06-10 (2): Big Help or Big Brother? Auditing Tracking, Profiling, and Personalization in Generative AI Assistants

Title: Big Help or Big Brother? Auditing Tracking, Profiling, and Personalization in Generative AI Assistants Große Hilfe oder großer Bruder? Auditierung von Tracking, Profilierung und Personalisierung in Generativen KI-Assistenten 大帮还是大大哥? 2503.16586v2

Authors (7): Yash Vekaria, Aurelio Loris Canino, Jonathan Levitsky, Alex Ciechonski, Patricia Callejo, Anna Maria Mandalari, Zubair Shafiq

Generative AI (GenAI) browser assistants integrate powerful capabilities of GenAI in web browsers to provide rich experiences such as question answering, content summarization, and agentic navigation. These assistants, available today as browser extensions, can not only track detailed browsing activity such as search and click data, but can also autonomously perform tasks such as filling forms, raising significant privacy concerns. It is crucial to understand the design and operation of GenAI browser extensions, including how they collect, store, process, and share user data. To this end, we study their ability to profile users and personalize their responses based on explicit or inferred demographic attributes and interests of users. We perform network traffic analysis and use a novel prompting framework to audit tracking, profiling, and personalization by the ten most popular GenAI browser assistant extensions. We find that instead of relying on local in-browser models, these assistants largely depend on server-side APIs, which can be auto-invoked without explicit user interaction. When invoked, they collect and share webpage content, often the full HTML DOM and sometimes even the user’s form inputs, with their first-party servers. Some assistants also share identifiers and user prompts with third-party trackers such as Google Analytics. The collection and sharing continues even if a webpage contains sensitive information such as health or personal information such as name or SSN entered in a web form. We find that several GenAI browser assistants infer demographic attributes such as age, gender, income, and interests and use this profile–which carries across browsing contexts–to personalize responses. In summary, our work shows that GenAI browser assistants can and do collect personal and sensitive information for profiling and personalization with little to no safeguards.

nan


Article 397

Title@2025-06-10 (2): Unlocking the Potential of Large Language Models in the Nuclear Industry with Synthetic Data

Title: Unlocking the Potential of Large Language Models in the Nuclear Industry with Synthetic Data Entsperren des Potenzials großer Sprachmodelle in der Kernindustrie mit synthetischen Daten 利用合成数据释放核工业大语言模型的潜力 2506.08750v1

Authors (4): Muhammad Anwar, Daniel Lau, Mishca de Costa, Issam Hammad

The nuclear industry possesses a wealth of valuable information locked away in unstructured text data. This data, however, is not readily usable for advanced Large Language Model (LLM) applications that require clean, structured question-answer pairs for tasks like model training, fine-tuning, and evaluation. This paper explores how synthetic data generation can bridge this gap, enabling the development of robust LLMs for the nuclear domain. We discuss the challenges of data scarcity and privacy concerns inherent in the nuclear industry and how synthetic data provides a solution by transforming existing text data into usable Q&A pairs. This approach leverages LLMs to analyze text, extract key information, generate relevant questions, and evaluate the quality of the resulting synthetic dataset. By unlocking the potential of LLMs in the nuclear industry, synthetic data can pave the way for improved information retrieval, enhanced knowledge sharing, and more informed decision-making in this critical sector.

nan


Article 398

Title@2025-06-10 (2): Towards Secure and Private Language Models for Nuclear Power Plants

Title: Towards Secure and Private Language Models for Nuclear Power Plants Auf dem Weg zu sicheren und privaten Sprachmodellen für Kernkraftwerke 争取实现核电厂安全和私用语言模式 2506.08746v1

Authors (4): Muhammad Anwar, Mishca de Costa, Issam Hammad, Daniel Lau

This paper introduces a domain-specific Large Language Model for nuclear applications, built from the publicly accessible Essential CANDU textbook. Drawing on a compact Transformer-based architecture, the model is trained on a single GPU to protect the sensitive data inherent in nuclear operations. Despite relying on a relatively small dataset, it shows encouraging signs of capturing specialized nuclear vocabulary, though the generated text sometimes lacks syntactic coherence. By focusing exclusively on nuclear content, this approach demonstrates the feasibility of in-house LLM solutions that align with rigorous cybersecurity and data confidentiality standards. Early successes in text generation underscore the model’s utility for specialized tasks, while also revealing the need for richer corpora, more sophisticated preprocessing, and instruction fine-tuning to enhance domain accuracy. Future directions include extending the dataset to cover diverse nuclear subtopics, refining tokenization to reduce noise, and systematically evaluating the model’s readiness for real-world applications in nuclear domain.

nan


Article 399

Title@2025-06-10 (2): Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning

Title: Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning Konsequente Pfade führen zur Wahrheit: Selbstvergeltendes Stärkungslernen für LLM-Vernünftigkeit 通向真相的一致道路:为LLM理由自评加强学习 2506.08745v1

Authors (8): Kongcheng Zhang, Qi Yao, Shunyu Liu, Yingjie Wang, Baisheng Lai, Jieping Ye, Mingli Song, Dacheng Tao

Recent advances of Reinforcement Learning (RL) have highlighted its potential in complex reasoning tasks, yet effective training often relies on external supervision, which limits the broader applicability. In this work, we propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning by leveraging the consistency of intermediate reasoning states across different reasoning trajectories. Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood: their intermediate reasoning states tend to converge toward their own final answers (high consistency) with minimal deviation toward other candidates (low volatility). Inspired by this observation, we introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy, complemented by a curiosity bonus to promote diverse exploration. CoVo enables LLMs to perform RL in a self-rewarding manner, offering a scalable pathway for learning to reason without external supervision. Extensive experiments on diverse reasoning benchmarks show that CoVo achieves performance comparable to or even surpassing supervised RL. Our code is available at https://github.com/sastpg/CoVo.

nan


Article 400

Title@2025-06-10 (2): Iterative Corpus Refinement for Materials Property Prediction Based on Scientific Texts

Title: Iterative Corpus Refinement for Materials Property Prediction Based on Scientific Texts Iterative Corpus-Verfeinerung für Material-Eigenschaftsvorhersage auf der Grundlage wissenschaftlicher Texte 以科学文本为基础的材料财产预测材料性迭代公司精炼 2505.21646v2

Authors (2): Lei Zhang, Markus Stricker

The discovery and optimization of materials for specific applications is hampered by the practically infinite number of possible elemental combinations and associated properties, also known as the `combinatorial explosion’. By nature of the problem, data are scarce and all possible data sources should be used. In addition to simulations and experimental results, the latent knowledge in scientific texts is not yet used to its full potential. We present an iterative framework that refines a given scientific corpus by strategic selection of the most diverse documents, training Word2Vec models, and monitoring the convergence of composition-property correlations in embedding space. Our approach is applied to predict high-performing materials for oxygen reduction (ORR), hydrogen evolution (HER), and oxygen evolution (OER) reactions for a large number of possible candidate compositions. Our method successfully predicts the highest performing compositions among a large pool of candidates, validated by experimental measurements of the electrocatalytic performance in the lab. This work demonstrates and validates the potential of iterative corpus refinement to accelerate materials discovery and optimization, offering a scalable and efficient tool for screening large compositional spaces where reliable data are scarce or non-existent.

nan


Article 401

Title@2025-06-10 (2): Improved LLM Agents for Financial Document Question Answering

Title: Improved LLM Agents for Financial Document Question Answering Verbesserte LLM-Agenten für die Beantwortung von Fragen zu Finanzdokumenten 改进财务报告问题回答问题LLM代理 2506.08726v1

Authors (6): Nelvin Tan, Zian Seng, Liang Zhang, Yu-Ching Shih, Dong Yang, Amol Salunkhe

Large language models (LLMs) have shown impressive capabilities on numerous natural language processing tasks. However, LLMs still struggle with numerical question answering for financial documents that include tabular and textual data. Recent works have showed the effectiveness of critic agents (i.e., self-correction) for this task given oracle labels. Building upon this framework, this paper examines the effectiveness of the traditional critic agent when oracle labels are not available, and show, through experiments, that this critic agent’s performance deteriorates in this scenario. With this in mind, we present an improved critic agent, along with the calculator agent which outperforms the previous state-of-the-art approach (program-of-thought) and is safer. Furthermore, we investigate how our agents interact with each other, and how this interaction affects their performance.

nan


Article 402

Title@2025-06-10 (2): VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?

Title: VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs? VIST-GPT: Sich in der Ära des Visual Storytelling mit LLMs niederlassen? 使用LLMS的视觉故事时代? 2504.19267v3

Authors (5): Mohamed Gado, Towhid Taliee, Muhammad Memon, Dmitry Ignatov, Radu Timofte

Visual storytelling is an interdisciplinary field combining computer vision and natural language processing to generate cohesive narratives from sequences of images. This paper presents a novel approach that leverages recent advancements in multimodal models, specifically adapting transformer-based architectures and large multimodal models, for the visual storytelling task. Leveraging the large-scale Visual Storytelling (VIST) dataset, our VIST-GPT model produces visually grounded, contextually appropriate narratives. We address the limitations of traditional evaluation metrics, such as BLEU, METEOR, ROUGE, and CIDEr, which are not suitable for this task. Instead, we utilize RoViST and GROOVIST, novel reference-free metrics designed to assess visual storytelling, focusing on visual grounding, coherence, and non-redundancy. These metrics provide a more nuanced evaluation of narrative quality, aligning closely with human judgment.

nan


Article 403

Title@2025-06-10 (2): How Do Multilingual Language Models Remember Facts?

Title: How Do Multilingual Language Models Remember Facts? Wie erinnern sich Mehrsprachige Sprachmodelle an Fakten? 多语言模式如何记住事实? 2410.14387v3

Authors (4): Constanza Fierro, Negar Foroutan, Desmond Elliott, Anders Søgaard

Large Language Models (LLMs) store and retrieve vast amounts of factual knowledge acquired during pre-training. Prior research has localized and identified mechanisms behind knowledge recall; however, it has only focused on English monolingual models. The question of how these mechanisms generalize to non-English languages and multilingual LLMs remains unexplored. In this paper, we address this gap by conducting a comprehensive analysis of three multilingual LLMs. First, we show that previously identified recall mechanisms in English largely apply to multilingual contexts, with nuances based on language and architecture. Next, through patching intermediate representations, we localize the role of language during recall, finding that subject enrichment is language-independent, while object extraction is language-dependent. Additionally, we discover that the last token representation acts as a Function Vector (FV), encoding both the language of the query and the content to be extracted from the subject. Furthermore, in decoder-only LLMs, FVs compose these two pieces of information in two separate stages. These insights reveal unique mechanisms in multilingual LLMs for recalling information, highlighting the need for new methodologies – such as knowledge evaluation, fact editing, and knowledge acquisition – that are specifically tailored for multilingual LLMs.

nan


Article 404

Title@2025-06-10 (2): Multi-Teacher Language-Aware Knowledge Distillation for Multilingual Speech Emotion Recognition

Title: Multi-Teacher Language-Aware Knowledge Distillation for Multilingual Speech Emotion Recognition Multi-Lehrer Sprachkenntnisse Destillation für mehrsprachige Sprachemotionserkennung 多语种语言-语言知识蒸馏促进多语种语言语音情感识别 2506.08717v1

Authors (4): Mehedi Hasan Bijoy, Dejan Porjazovski, Tamás Grósz, Mikko Kurimo

Speech Emotion Recognition (SER) is crucial for improving human-computer interaction. Despite strides in monolingual SER, extending them to build a multilingual system remains challenging. Our goal is to train a single model capable of multilingual SER by distilling knowledge from multiple teacher models. To address this, we introduce a novel language-aware multi-teacher knowledge distillation method to advance SER in English, Finnish, and French. It leverages Wav2Vec2.0 as the foundation of monolingual teacher models and then distills their knowledge into a single multilingual student model. The student model demonstrates state-of-the-art performance, with a weighted recall of 72.9 on the English dataset and an unweighted recall of 63.4 on the Finnish dataset, surpassing fine-tuning and knowledge distillation baselines. Our method excels in improving recall for sad and neutral emotions, although it still faces challenges in recognizing anger and happiness.

nan


Article 405

Title@2025-06-10 (2): Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure

Title: Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure Erklärbare Compliance-Erkennung mit Multi-Hop-Natural Language-Schlussfolgerung zur Assurance-Fallstruktur 以多种自然语言对保证案例结构的多重语言推断进行可解释的合规检测 2506.08713v1

Authors (2): Fariz Ikhwantri, Dusica Marijan

Ensuring complex systems meet regulations typically requires checking the validity of assurance cases through a claim-argument-evidence framework. Some challenges in this process include the complicated nature of legal and technical texts, the need for model explanations, and limited access to assurance case data. We propose a compliance detection approach based on Natural Language Inference (NLI): EXplainable CompLiance detection with Argumentative Inference of Multi-hop reasoning (EXCLAIM). We formulate the claim-argument-evidence structure of an assurance case as a multi-hop inference for explainable and traceable compliance detection. We address the limited number of assurance cases by generating them using large language models (LLMs). We introduce metrics that measure the coverage and structural consistency. We demonstrate the effectiveness of the generated assurance case from GDPR requirements in a multi-hop inference task as a case study. Our results highlight the potential of NLI-based approaches in automating the regulatory compliance process.

nan


Article 406

Title@2025-06-10 (2): Summarizing Speech: A Comprehensive Survey

Title: Summarizing Speech: A Comprehensive Survey Zusammenfassung der Rede: Eine umfassende Umfrage 总结发言:全面调查 2504.08024v2

Authors (7): Fabian Retkowski, Maike Züfle, Andreas Sudmann, Dinah Pfau, Shinji Watanabe, Jan Niehues, Alexander Waibel

Speech summarization has become an essential tool for efficiently managing and accessing the growing volume of spoken and audiovisual content. However, despite its increasing importance, speech summarization remains loosely defined. The field intersects with several research areas, including speech recognition, text summarization, and specific applications like meeting summarization. This survey not only examines existing datasets and evaluation protocols, which are crucial for assessing the quality of summarization approaches, but also synthesizes recent developments in the field, highlighting the shift from traditional systems to advanced models like fine-tuned cascaded architectures and end-to-end solutions. In doing so, we surface the ongoing challenges, such as the need for realistic evaluation benchmarks, multilingual datasets, and long-context handling.

nan


Article 407

Title@2025-06-10 (2): Self-Training Elicits Concise Reasoning in Large Language Models

Title: Self-Training Elicits Concise Reasoning in Large Language Models Selbst-Trainings-Elizite schließen Vernunft in großen Sprachmodellen ein 大语言模式中的简明理由 2502.20122v3

Authors (6): Tergel Munkhbat, Namgyu Ho, Seo Hyun Kim, Yongjin Yang, Yujin Kim, Se-Young Yun

Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens to solve complex tasks. However, we posit that typical reasoning traces contain many redundant tokens, incurring extraneous inference costs. Upon examination of the output distribution of current LLMs, we find evidence on their latent ability to reason more concisely, relative to their default behavior. To elicit this capability, we propose simple fine-tuning methods which leverage self-generated concise reasoning paths obtained by best-of-N sampling and few-shot conditioning, in task-specific settings. Our combined method achieves a 30% reduction in output tokens on average, across five model families on GSM8K and MATH, while maintaining average accuracy. By exploiting the fundamental stochasticity and in-context learning capabilities of LLMs, our self-training approach robustly elicits concise reasoning on a wide range of models, including those with extensive post-training. Code is available at https://github.com/TergelMunkhbat/concise-reasoning

nan


Article 408

Title@2025-06-10 (2): Brevity is the soul of sustainability: Characterizing LLM response lengths

Title: Brevity is the soul of sustainability: Characterizing LLM response lengths Brevity ist die Seele der Nachhaltigkeit: Charakterisierende LLM-Responselängen 博利是可持续性的灵魂:确定LLM 反应长度 2506.08686v1

Authors (7): Soham Poddar, Paramita Koley, Janardan Misra, Sanjay Podder, Navveen Balani, Niloy Ganguly, Saptarshi Ghosh

A significant portion of the energy consumed by Large Language Models (LLMs) arises from their inference processes; hence developing energy-efficient methods for inference is crucial. While several techniques exist for inference optimization, output compression remains relatively unexplored, with only a few preliminary efforts addressing this aspect. In this work, we first benchmark 12 decoder-only LLMs across 5 datasets, revealing that these models often produce responses that are substantially longer than necessary. We then conduct a comprehensive quality assessment of LLM responses, formally defining six information categories present in LLM responses. We show that LLMs often tend to include redundant or additional information besides the minimal answer. To address this issue of long responses by LLMs, we explore several simple and intuitive prompt-engineering strategies. Empirical evaluation shows that appropriate prompts targeting length reduction and controlling information content can achieve significant energy optimization between 25-60\% by reducing the response length while preserving the quality of LLM responses.

nan


Article 409

Title@2025-06-10 (2): RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling

Title: RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling RuleReasoner: Verstärkte regelbasiertes Reasoning via Domain-aware Dynamic Sampling 规则响应器:通过Domain-aware动态抽样加强基于规则的理由说明 2506.08672v1

Authors (3): Yang Liu, Jiaqi Li, Zilong Zheng

Rule-based reasoning has been acknowledged as one of the fundamental problems in reasoning, while deviations in rule formats, types, and complexity in real-world applications pose severe challenges. Recent studies have shown that large reasoning models (LRMs) have remarkable reasoning capabilities, and their performance is substantially enhanced by reinforcement learning (RL). However, it remains an open question whether small reasoning models (SRMs) can learn rule-based reasoning effectively with robust generalization across diverse tasks and domains. To address this, we introduce Reinforced Rule-based Reasoning, a.k.a. RuleReasoner, a simple yet effective method to conduct rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach. Specifically, RuleReasoner resamples each training batch by updating the sampling weights of different domains based on historical rewards. This facilitates domain augmentation and flexible online learning schedules for RL, obviating the need for pre-hoc human-engineered mix-training recipes used in existing methods. Empirical evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin ($\Delta$4.1% average points on eight ID tasks and $\Delta$10.4% average points on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior dynamic sampling methods for RL.

nan


Article 410

Title@2025-06-10 (2): Compositional Causal Reasoning Evaluation in Language Models

Title: Compositional Causal Reasoning Evaluation in Language Models kompositorische Kausale Reasoning-Bewertung in Sprachmodellen 语文模式的构成原因原因原因评估 2503.04556v4

Authors (5): Jacqueline R. M. A. Maasch, Alihan Hüyük, Xinnuo Xu, Aditya V. Nori, Javier Gonzalez

Causal reasoning and compositional reasoning are two core aspirations in AI. Measuring the extent of these behaviors requires principled evaluation methods. We explore a unified perspective that considers both behaviors simultaneously, termed compositional causal reasoning (CCR): the ability to infer how causal measures compose and, equivalently, how causal quantities propagate through graphs. We instantiate a framework for the systematic evaluation of CCR for the average treatment effect and the probability of necessity and sufficiency. As proof of concept, we demonstrate CCR evaluation for language models in the LLama, Phi, and GPT families. On a math word problem, our framework revealed a range of taxonomically distinct error patterns. CCR errors increased with the complexity of causal paths for all models except o1.

nan


Article 411

Title@2025-06-10 (2): Summarization for Generative Relation Extraction in the Microbiome Domain

Title: Summarization for Generative Relation Extraction in the Microbiome Domain Zusammenfassung für Generative Relation-Extraktion im Microbiome-Bereich 微生物群域生成关系采掘概要 2506.08647v1

Authors (3): Oumaima El Khettari, Solen Quiniou, Samuel Chaffron

We explore a generative relation extraction (RE) pipeline tailored to the study of interactions in the intestinal microbiome, a complex and low-resource biomedical domain. Our method leverages summarization with large language models (LLMs) to refine context before extracting relations via instruction-tuned generation. Preliminary results on a dedicated corpus show that summarization improves generative RE performance by reducing noise and guiding the model. However, BERT-based RE approaches still outperform generative models. This ongoing work demonstrates the potential of generative methods to support the study of specialized domains in low-resources setting.

nan


Article 412

Title@2025-06-10 (2): TableDreamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning

Title: TableDreamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning TableDreamer: Progressive und schwache Datensynthese von Scratch für die Table Instruction Tuning 表Dreamer:Scratch为表格指示图纸提供的渐进和弱点制导数据合成 2506.08646v1

Authors (7): Mingyu Zheng, Zhifan Feng, Jia Wang, Lanrui Wang, Zheng Lin, Yang Hao, Weiping Wang

Despite the commendable progress of recent LLM-based data synthesis methods, they face two limitations in generating table instruction tuning data. First, they can not thoroughly explore the vast input space of table understanding tasks, leading to limited data diversity. Second, they ignore the weaknesses in table understanding ability of the target LLM and blindly pursue the increase of data quantity, resulting in suboptimal data efficiency. In this paper, we introduce a progressive and weakness-guided data synthesis framework tailored for table instruction tuning, named TableDreamer, to mitigate the above issues. Specifically, we first synthesize diverse tables and related instructions as seed data, and then perform an iterative exploration of the input space under the guidance of the newly identified weakness data, which eventually serve as the final training data for fine-tuning the target LLM. Extensive experiments on 10 tabular benchmarks demonstrate the effectiveness of the proposed framework, which boosts the average accuracy of Llama3.1-8B-instruct by 11.62% (49.07% to 60.69%) with 27K GPT-4o synthetic data and outperforms state-of-the-art data synthesis baselines which use more training data. The code and data is available at https://github.com/SpursGoZmy/TableDreamer

nan


Article 413

Title@2025-06-10 (2): An Analysis of Hyper-Parameter Optimization Methods for Retrieval Augmented Generation

Title: An Analysis of Hyper-Parameter Optimization Methods for Retrieval Augmented Generation Eine Analyse von Hyperparameter-Optimierungsmethoden für die retrieval Augmented Generation 对回收增加的一代的超光子辐射仪最佳优化方法的分析 2505.03452v2

Authors (15): Matan Orbach, Ohad Eytan, Benjamin Sznajder, Ariel Gera, Odellia Boni, Yoav Kantor, Gal Bloch, Omri Levy, Hadas Abraham, Nitzan Barzilay, Eyal Shnarch, Michael E. Factor, Shila Ofek-Koifman, Paula Ta-Shma, Assaf Toledo

Finding the optimal Retrieval-Augmented Generation (RAG) configuration for a given use case can be complex and expensive. Motivated by this challenge, frameworks for RAG hyper-parameter optimization (HPO) have recently emerged, yet their effectiveness has not been rigorously benchmarked. To address this gap, we present a comprehensive study involving 5 HPO algorithms over 5 datasets from diverse domains, including a new one collected for this work on real-world product documentation. Our study explores the largest HPO search space considered to date, with three evaluation metrics as optimization targets. Analysis of the results shows that RAG HPO can be done efficiently, either greedily or with random search, and that it significantly boosts RAG performance for all datasets. For greedy HPO approaches, we show that optimizing model selection first is preferable to the prevalent practice of optimizing according to RAG pipeline order.

nan


Article 414

Title@2025-06-10 (2): MEMETRON: Metaheuristic Mechanisms for Test-time Response Optimization of Large Language Models

Title: MEMETRON: Metaheuristic Mechanisms for Test-time Response Optimization of Large Language Models MEMETRON: Metaheuristic Mechanisms for Test-time Response Optimization of Large Language Models MEMEMETRON: 大语言模型试验时间反应优化的元机制 2506.08643v1

Authors (2): Son The Nguyen, Theja Tulabandhula

Large language models (LLMs) are increasingly used for both open-ended and structured tasks, yet their inference-time behavior is still largely dictated by heuristic decoding strategies such as greedy search, sampling, or reranking. These methods provide limited control and do not explicitly optimize for task-specific objectives. We introduce MEMETRON, a task-agnostic framework that formulates LLM decoding as a discrete black-box optimization problem. MEMETRON leverages hybrid metaheuristic algorithms, GENETRON and ANNETRON, to search the response space, guided by reward models and contextual operations performed by the LLM itself. This approach enables efficient discovery of high-reward responses without requiring model retraining or gradient access. The framework is modular and generalizes across diverse tasks, requiring only a reward function and lightweight prompt templates. We evaluate our framework on the critical human preference alignment task and demonstrate that it significantly outperforms standard decoding and reranking methods, highlighting its potential to improve alignment without model retraining.

nan


Article 415

Title@2025-06-10 (2): Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence

Title: Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence Den Halluzinationscode in LVLMs mit Vision-aware Head Divergence knacken 以有远见和觉察力的正面差异在LVLMs中破坏《幻觉守则》 2412.13949v3

Authors (9): Jinghan He, Kuan Zhu, Haiyun Guo, Junfeng Fang, Zhenglin Hua, Yuheng Jia, Ming Tang, Tat-Seng Chua, Jinqiao Wang

Large vision-language models (LVLMs) have made substantial progress in integrating large language models (LLMs) with visual inputs, enabling advanced multimodal reasoning. Despite their success, a persistent challenge is hallucination-where generated text fails to accurately reflect visual content-undermining both accuracy and reliability. Existing methods focus on alignment training or decoding refinements but primarily address symptoms at the generation stage without probing the underlying causes. In this work, we investigate the internal mechanisms driving hallucination in LVLMs, with an emphasis on the multi-head attention module. Specifically, we introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context. Based on this, our findings reveal the presence of vision-aware attention heads that are more attuned to visual information; however, the model’s overreliance on its prior language patterns is closely related to hallucinations. Building on these insights, we propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads. Extensive experiments demonstrate that our method achieves superior performance compared to state-of-the-art approaches in mitigating hallucinations, while maintaining high efficiency with negligible additional time overhead.

nan


Article 416

Title@2025-06-10 (2): NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Title: NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples NaturalBench: Bewertung von Vision-Sprachen-Modellen auf natürlichen Adversarialproben 自然区:评估自然对立抽样的远景-语言模型 2410.14669v4

Authors (10): Baiqi Li, Zhiqiu Lin, Wenxuan Peng, Jean de Dieu Nyandwi, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna, Graham Neubig, Deva Ramanan

Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. Crucially, we adopt a $\textbf{vision-centric}$ design by pairing each question with two images that yield different answers, preventing blind solutions from answering without using the images. This makes NaturalBench more challenging than previous benchmarks that can be solved with commonsense priors. We evaluate 53 state-of-the-art VLMs on NaturalBench, showing that models like LLaVA-OneVision, Cambrian-1, Llama3.2-Vision, Molmo, Qwen2-VL, and even GPT-4o lag 50%-70% behind human performance (over 90%). We analyze why NaturalBench is hard from two angles: (1) Compositionality: Solving NaturalBench requires diverse visio-linguistic skills, including understanding attribute bindings, object relationships, and advanced reasoning like logic and counting. To this end, unlike prior work that uses a single tag per sample, we tag each NaturalBench sample with 1 to 8 skill tags for fine-grained evaluation. (2) Biases: NaturalBench exposes severe biases in VLMs, as models often choose the same answer regardless of the image. Lastly, we apply our benchmark curation method to diverse data sources, including long captions (over 100 words) and non-English languages like Chinese and Hindi, highlighting its potential for dynamic evaluations of VLMs.

nan


Article 417

Title@2025-06-10 (2): Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMs

Title: Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMs Annäherung an Dialog Staat Tracking über Aligning Speech Encoder und LLMs 通过对齐语音编码器和LLMs处理对话国家跟踪 2506.08633v1

Authors (7): Šimon Sedláček, Bolaji Yusuf, Ján Švec, Pradyoth Hegde, Santosh Kesiraju, Oldřich Plchot, Jan Černocký

In this work, we approach spoken Dialogue State Tracking (DST) by bridging the representation spaces of speech encoders and LLMs via a small connector module, with a focus on fully open-sourced and open-data components (WavLM-large, OLMo). We focus on ablating different aspects of such systems including full/LoRA adapter fine-tuning, the effect of agent turns in the dialogue history, as well as fuzzy matching-based output post-processing, which greatly improves performance of our systems on named entities in the dialogue slot values. We conduct our experiments on the SpokenWOZ dataset, and additionally utilize the Speech-Aware MultiWOZ dataset to augment our training data. Ultimately, our best-performing WavLM + connector + OLMo-1B aligned models achieve state of the art on the SpokenWOZ test set (34.66% JGA), and our system with Gemma-2-9B-instruct further surpasses this result, reaching 42.17% JGA on SpokenWOZ test.

nan


Article 418

Title@2025-06-10 (2): RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval

Title: RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval RAISE: Verbesserung der wissenschaftlichen Vernunft in LLMs durch Schritt-für-Schritt-Retrieval RAISE:通过一步步检索加强LLMs的科学理由 2506.08625v1

Authors (6): Minhae Oh, Jeonghye Kim, Nakyung Lee, Donggeon Seo, Taeuk Kim, Jungwoo Lee

Scientific reasoning requires not only long-chain reasoning processes, but also knowledge of domain-specific terminologies and adaptation to updated findings. To deal with these challenges for scientific reasoning, we introduce RAISE, a step-by-step retrieval-augmented framework which retrieves logically relevant documents from in-the-wild corpus. RAISE is divided into three steps: problem decomposition, logical query generation, and logical retrieval. We observe that RAISE consistently outperforms other baselines on scientific reasoning benchmarks. We analyze that unlike other baselines, RAISE retrieves documents that are not only similar in terms of the domain knowledge, but also documents logically more relevant.

nan


Article 419

Title@2025-06-10 (2): Evaluation empirique de la sécurisation et de l’alignement de ChatGPT et Gemini: analyse comparative des vulnérabilités par expérimentations de jailbreaks

Title: Evaluation empirique de la sécurisation et de l’alignement de ChatGPT et Gemini: analyse comparative des vulnérabilités par expérimentations de jailbreaks Bewertung empirique de la sécurisation et de l’alignement de ChatGPT et Gemini: Analyse der vergleichenden des vulnérabilités par expérimentations de jailbreaks 安全和治安评估:对监狱破狱情况进行对比分析 2506.10029v1

Authors (1): Rafaël Nouailles

Large Language models (LLMs) are transforming digital usage, particularly in text generation, image creation, information retrieval and code development. ChatGPT, launched by OpenAI in November 2022, quickly became a reference, prompting the emergence of competitors such as Google’s Gemini. However, these technological advances raise new cybersecurity challenges, including prompt injection attacks, the circumvention of regulatory measures (jailbreaking), the spread of misinformation (hallucinations) and risks associated with deep fakes. This paper presents a comparative analysis of the security and alignment levels of ChatGPT and Gemini, as well as a taxonomy of jailbreak techniques associated with experiments.

nan


Article 420

Title@2025-06-10 (2): Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models

Title: Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models Hassvolle Person oder Hassvolles Modell? Untersuchung der Rolle der Personas bei der Hassredenerkennung durch große Sprachmodelle 《调查个人在用大语言模式探测仇恨言论中的作用》 2506.08593v1

Authors (6): Shuzhou Yuan, Ercong Nie, Mario Tawfelis, Helmut Schmid, Hinrich Schütze, Michael Färber

Hate speech detection is a socially sensitive and inherently subjective task, with judgments often varying based on personal traits. While prior work has examined how socio-demographic factors influence annotation, the impact of personality traits on Large Language Models (LLMs) remains largely unexplored. In this paper, we present the first comprehensive study on the role of persona prompts in hate speech classification, focusing on MBTI-based traits. A human annotation survey confirms that MBTI dimensions significantly affect labeling behavior. Extending this to LLMs, we prompt four open-source models with MBTI personas and evaluate their outputs across three hate speech datasets. Our analysis uncovers substantial persona-driven variation, including inconsistencies with ground truth, inter-persona disagreement, and logit-level biases. These findings highlight the need to carefully define persona prompts in LLM-based annotation workflows, with implications for fairness and alignment with human values.

nan


Article 421

Title@2025-06-10 (2): Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings

Title: Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings Dichte Retriever können bei einfachen Abfragen scheitern: Enthüllen des Granularitäts-Dilemmas der Einbettungen 简单查询: 读取嵌入器的颗粒性三角形 2506.08592v1

Authors (6): Liyan Xu, Zhenlin Su, Mo Yu, Jiangnan Li, Fandong Meng, Jie Zhou

This work focuses on an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within the semantics, resulting in failed dense retrieval on even simple cases. To examine such behaviors, we first introduce a new evaluation dataset in Chinese, named CapRetrieval, whose passages are image captions, and queries are phrases inquiring entities or events in various forms. Zero-shot evaluation suggests that encoders may fail on these fine-grained matching, regardless of training sources or model sizes. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, which obtains the best performance on CapRetrieval. Within this process, we further identify an issue of granularity dilemma, a challenge for embeddings to express fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at https://github.com/lxucs/CapRetrieval.

nan


Article 422

Title@2025-06-10 (2): CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling

Title: CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling CounselBench: Ein groß angelegter Experten-Evaluierungs- und Adversarial-Benchmark für große Sprachmodelle in der Mental Health Counseling 心理卫生咨询中大语言模式的大范围专家评价和反向基准 2506.08584v1

Authors (6): Yahan Li, Jifan Yao, John Bosco S. Bunyi, Adam C. Frank, Angel Hwang, Ruishan Liu

Large language models (LLMs) are increasingly proposed for use in mental health support, yet their behavior in realistic counseling scenarios remains largely untested. We introduce CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test LLMs in single-turn counseling. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of responses from GPT-4, LLaMA 3, Gemini, and online human therapists to real patient questions. Each response is rated along six clinically grounded dimensions, with written rationales and span-level annotations. We find that LLMs often outperform online human therapists in perceived quality, but experts frequently flag their outputs for safety concerns such as unauthorized medical advice. Follow-up experiments show that LLM judges consistently overrate model responses and overlook safety issues identified by human experts. To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored counseling questions designed to trigger specific model issues. Evaluation across 2,880 responses from eight LLMs reveals consistent, model-specific failure patterns. Together, CounselBench establishes a clinically grounded framework for benchmarking and improving LLM behavior in high-stakes mental health settings.

nan


Article 423

Title@2025-06-10 (2): How transformers learn structured data: insights from hierarchical filtering

Title: How transformers learn structured data: insights from hierarchical filtering Wie Transformatoren strukturierte Daten lernen: Erkenntnisse aus der hierarchischen Filterung 变压器如何学习结构化数据:从上层过滤器获得的洞察力 2408.15138v3

Authors (4): Jerome Garnier-Brun, Marc Mézard, Emanuele Moscato, Luca Saglietti

Understanding the learning process and the embedded computation in transformers is becoming a central goal for the development of interpretable AI. In the present study, we introduce a hierarchical filtering procedure for data models of sequences on trees, allowing us to hand-tune the range of positional correlations in the data. Leveraging this controlled setting, we provide evidence that vanilla encoder-only transformers can approximate the exact inference algorithm when trained on root classification and masked language modeling tasks, and study how this computation is discovered and implemented. We find that correlations at larger distances, corresponding to increasing layers of the hierarchy, are sequentially included by the network during training. By comparing attention maps from models trained with varying degrees of filtering and by probing the different encoder levels, we find clear evidence of a reconstruction of correlations on successive length scales corresponding to the various levels of the hierarchy, which we relate to a plausible implementation of the exact inference algorithm within the same architecture.

nan


Article 424

Title@2025-06-10 (2): How Malicious AI Swarms Can Threaten Democracy

Title: How Malicious AI Swarms Can Threaten Democracy Wie bösartige KI-Schwärme die Demokratie bedrohen können 恶意的AI Swararms能够如何威胁民主 2506.06299v2

Authors (21): Daniel Thilo Schroeder, Meeyoung Cha, Andrea Baronchelli, Nick Bostrom, Nicholas A. Christakis, David Garcia, Amit Goldenberg, Yara Kyrychenko, Kevin Leyton-Brown, Nina Lutz, Gary Marcus, Filippo Menczer, Gordon Pennycook, David G. Rand, Frank Schweitzer, Christopher Summerfield, Audrey Tang, Jay Van Bavel, Sander van der Linden, Dawn Song, Jonas R. Kunst

Advances in AI portend a new era of sophisticated disinformation operations. While individual AI systems already create convincing – and at times misleading – information, an imminent development is the emergence of malicious AI swarms. These systems can coordinate covertly, infiltrate communities, evade traditional detectors, and run continuous A/B tests, with round-the-clock persistence. The result can include fabricated grassroots consensus, fragmented shared reality, mass harassment, voter micro-suppression or mobilization, contamination of AI training data, and erosion of institutional trust. With democratic processes worldwide increasingly vulnerable, we urge a three-pronged response: (1) platform-side defenses – always-on swarm-detection dashboards, pre-election high-fidelity swarm-simulation stress-tests, transparency audits, and optional client-side “AI shields” for users; (2) model-side safeguards – standardized persuasion-risk tests, provenance-authenticating passkeys, and watermarking; and (3) system-level oversight – a UN-backed AI Influence Observatory.

nan


Article 425

Title@2025-06-10 (2): The Geometries of Truth Are Orthogonal Across Tasks

Title: The Geometries of Truth Are Orthogonal Across Tasks Die Geometrien der Wahrheit sind orthogonal über Aufgaben hinweg 真理的地貌不同任务 2506.08572v1

Authors (7): Waiss Azizian, Michael Kirchhof, Eugene Ndiaye, Louis Bethune, Michal Klein, Pierre Ablin, Marco Cuturi

Large Language Models (LLMs) have demonstrated impressive generalization capabilities across various tasks, but their claim to practical relevance is still mired by concerns on their reliability. Recent works have proposed examining the activations produced by an LLM at inference time to assess whether its answer to a question is correct. Some works claim that a “geometry of truth” can be learned from examples, in the sense that the activations that generate correct answers can be distinguished from those leading to mistakes with a linear classifier. In this work, we underline a limitation of these approaches: we observe that these “geometries of truth” are intrinsically task-dependent and fail to transfer across tasks. More precisely, we show that linear classifiers trained across distinct tasks share little similarity and, when trained with sparsity-enforcing regularizers, have almost disjoint supports. We show that more sophisticated approaches (e.g., using mixtures of probes and tasks) fail to overcome this limitation, likely because activation vectors commonly used to classify answers form clearly separated clusters when examined across tasks.

nan


Article 426

Title@2025-06-10 (2): Can Large Language Models Invent Algorithms to Improve Themselves?: Algorithm Discovery for Recursive Self-Improvement through Reinforcement Learning

Title: Can Large Language Models Invent Algorithms to Improve Themselves?: Algorithm Discovery for Recursive Self-Improvement through Reinforcement Learning Können große Sprachmodelle Algorithmen erfinden, um sich selbst zu verbessern?: Algorithmenentdeckung für rekursive Selbstverbesserung durch Stärkungslernen 大语言模型能创造出能改善自我的算法吗? :通过强化学习实现递归自我改进的算法发现 2410.15639v5

Authors (3): Yoichi Ishibashi, Taro Yano, Masafumi Oyamada

Large Language Models (LLMs) have achieved remarkable capabilities, yet their improvement methods remain fundamentally constrained by human design. We present Self-Developing, a framework that enables LLMs to autonomously discover, implement, and refine their own improvement algorithms. Our approach employs an iterative cycle where a seed model generates algorithmic candidates as executable code, evaluates their effectiveness, and uses Direct Preference Optimization to recursively improve increasingly sophisticated improvement strategies. We demonstrate this framework through model merging, a practical technique for combining specialized models. Self-Developing successfully discovered novel merging algorithms that outperform existing human-designed algorithms. On mathematical reasoning benchmarks, the autonomously discovered algorithms improve the seed model’s GSM8k performance by 6\% and exceed human-designed approaches like Task Arithmetic by 4.3\%. Remarkably, these algorithms exhibit strong generalization, achieving 7.4\% gains on out-of-domain models without re-optimization. Our findings demonstrate that LLMs can transcend their training to invent genuinely novel optimization techniques. This capability represents a crucial step toward a new era where LLMs not only solve problems but autonomously develop the methodologies for their own advancement.

nan


Article 427

Title@2025-06-10 (2): Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?

Title: Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world? Nachbarn und Verwandte: Wie spiegeln Spracheinbettungen sprachliche Zusammenhänge auf der ganzen Welt wider? 邻居和亲戚:语言嵌入如何反映世界各地的语言联系? 2506.08564v1

Authors (3): Tuukka Törö, Antti Suni, Juraj Šimko

Investigating linguistic relationships on a global scale requires analyzing diverse features such as syntax, phonology and prosody, which evolve at varying rates influenced by internal diversification, language contact, and sociolinguistic factors. Recent advances in machine learning (ML) offer complementary alternatives to traditional historical and typological approaches. Instead of relying on expert labor in analyzing specific linguistic features, these new methods enable the exploration of linguistic variation through embeddings derived directly from speech, opening new avenues for large-scale, data-driven analyses. This study employs embeddings from the fine-tuned XLS-R self-supervised language identification model voxlingua107-xls-r-300m-wav2vec, to analyze relationships between 106 world languages based on speech recordings. Using linear discriminant analysis (LDA), language embeddings are clustered and compared with genealogical, lexical, and geographical distances. The results demonstrate that embedding-based distances align closely with traditional measures, effectively capturing both global and local typological patterns. Challenges in visualizing relationships, particularly with hierarchical clustering and network-based methods, highlight the dynamic nature of language change. The findings show potential for scalable analyses of language variation based on speech embeddings, providing new perspectives on relationships among languages. By addressing methodological considerations such as corpus size and latent space dimensionality, this approach opens avenues for studying low-resource languages and bridging macro- and micro-level linguistic variation. Future work aims to extend these methods to underrepresented languages and integrate sociolinguistic variation for a more comprehensive understanding of linguistic diversity.

nan


Article 428

Title@2025-06-10 (2): Efficient Post-Training Refinement of Latent Reasoning in Large Language Models

Title: Efficient Post-Training Refinement of Latent Reasoning in Large Language Models Effiziente Nachschulung zur Verfeinerung latenter Vernunft in großen Sprachmodellen 改进大语言模式中长期原因 2506.08552v1

Authors (8): Xinyuan Wang, Dongjie Wang, Wangyang Ying, Haoyue Bai, Nanxu Gong, Sixun Dong, Kunpeng Liu, Yanjie Fu

Reasoning is a key component of language understanding in Large Language Models. While Chain-of-Thought prompting enhances performance via explicit intermediate steps, it suffers from sufficient token overhead and a fixed reasoning trajectory, preventing step-wise refinement. Recent advances in latent reasoning address these limitations by refining internal reasoning processes directly in the model’s latent space, without producing explicit outputs. However, a key challenge remains: how to effectively update reasoning embeddings during post-training to guide the model toward more accurate solutions. To overcome this challenge, we propose a lightweight post-training framework that refines latent reasoning trajectories using two novel strategies: 1) Contrastive reasoning feedback, which compares reasoning embeddings against strong and weak baselines to infer effective update directions via embedding enhancement; 2) Residual embedding refinement, which stabilizes updates by progressively integrating current and historical gradients, enabling fast yet controlled convergence. Extensive experiments and case studies are conducted on five reasoning benchmarks to demonstrate the effectiveness of the proposed framework. Notably, a 5\% accuracy gain on MathQA without additional training.

nan


Article 429

Title@2025-06-10 (2): BiasGuard: A Reasoning-enhanced Bias Detection Tool For Large Language Models

Title: BiasGuard: A Reasoning-enhanced Bias Detection Tool For Large Language Models BiasGuard: Ein reasoning-enhanced Bias Detection Tool für große Sprachmodelle BiasGuard:为大语言模型提供的理由强化的Bias Bias探测工具 2504.21299v2

Authors (3): Zhiting Fan, Ruizhe Chen, Zuozhu Liu

Identifying bias in LLM-generated content is a crucial prerequisite for ensuring fairness in LLMs. Existing methods, such as fairness classifiers and LLM-based judges, face limitations related to difficulties in understanding underlying intentions and the lack of criteria for fairness judgment. In this paper, we introduce BiasGuard, a novel bias detection tool that explicitly analyzes inputs and reasons through fairness specifications to provide accurate judgments. BiasGuard is implemented through a two-stage approach: the first stage initializes the model to explicitly reason based on fairness specifications, while the second stage leverages reinforcement learning to enhance its reasoning and judgment capabilities. Our experiments, conducted across five datasets, demonstrate that BiasGuard outperforms existing tools, improving accuracy and reducing over-fairness misjudgments. We also highlight the importance of reasoning-enhanced decision-making and provide evidence for the effectiveness of our two-stage optimization pipeline.

nan


Article 430

Title@2025-06-10 (2): FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs

Title: FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs FairMT-Bench: Benchmarking Fairness für Multiturn-Dialoge in Konversations-LLMs FairMT-Bench:为多方向对话设定公平基准 2410.19317v2

Authors (4): Zhiting Fan, Ruizhe Chen, Tianxiang Hu, Zuozhu Liu

The growing use of large language model (LLM)-based chatbots has raised concerns about fairness. Fairness issues in LLMs can lead to severe consequences, such as bias amplification, discrimination, and harm to marginalized communities. While existing fairness benchmarks mainly focus on single-turn dialogues, multi-turn scenarios, which in fact better reflect real-world conversations, present greater challenges due to conversational complexity and potential bias accumulation. In this paper, we propose a comprehensive fairness benchmark for LLMs in multi-turn dialogue scenarios, \textbf{FairMT-Bench}. Specifically, we formulate a task taxonomy targeting LLM fairness capabilities across three stages: context understanding, user interaction, and instruction trade-offs, with each stage comprising two tasks. To ensure coverage of diverse bias types and attributes, we draw from existing fairness datasets and employ our template to construct a multi-turn dialogue dataset, \texttt{FairMT-10K}. For evaluation, GPT-4 is applied, alongside bias classifiers including Llama-Guard-3 and human validation to ensure robustness. Experiments and analyses on \texttt{FairMT-10K} reveal that in multi-turn dialogue scenarios, current LLMs are more likely to generate biased responses, and there is significant variation in performance across different tasks and models. Based on this, we curate a challenging dataset, \texttt{FairMT-1K}, and test 15 current state-of-the-art (SOTA) LLMs on this dataset. The results show the current state of fairness in LLMs and showcase the utility of this novel approach for assessing fairness in more realistic multi-turn dialogue contexts, calling for future work to focus on LLM fairness improvement and the adoption of \texttt{FairMT-1K} in such efforts.

nan


Article 431

Title@2025-06-10 (2): Exploring the Escalation of Source Bias in User, Data, and Recommender System Feedback Loop

Title: Exploring the Escalation of Source Bias in User, Data, and Recommender System Feedback Loop Erforschung der Eskalation von Source Bias in User, Data und Recommender System Feedback Loop 探索用户、数据、建议系统反馈圈中源比值源的上升 2405.17998v2

Authors (7): Yuqi Zhou, Sunhao Dai, Liang Pang, Gang Wang, Zhenhua Dong, Jun Xu, Ji-Rong Wen

Recommender systems are essential for information access, allowing users to present their content for recommendation. With the rise of large language models (LLMs), AI-generated content (AIGC), primarily in the form of text, has become a central part of the content ecosystem. As AIGC becomes increasingly prevalent, it is important to understand how it affects the performance and dynamics of recommender systems. To this end, we construct an environment that incorporates AIGC to explore its short-term impact. The results from popular sequential recommendation models reveal that AIGC are ranked higher in the recommender system, reflecting the phenomenon of source bias. To further explore the long-term impact of AIGC, we introduce a feedback loop with realistic simulators. The results show that the model’s preference for AIGC increases as the user clicks on AIGC rises and the model trains on simulated click data. This leads to two issues: In the short term, bias toward AIGC encourages LLM-based content creation, increasing AIGC content, and causing unfair traffic distribution. From a long-term perspective, our experiments also show that when AIGC dominates the content ecosystem after a feedback loop, it can lead to a decline in recommendation performance. To address these issues, we propose a debiasing method based on L1-loss optimization to maintain long-term content ecosystem balance. In a real-world environment with AIGC generated by mainstream LLMs, our method ensures a balance between AIGC and human-generated content in the ecosystem. The code and dataset are available at https://github.com/Yuqi-Zhou/Rec_SourceBias.

nan


Article 432

Title@2025-06-10 (2): Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

Title: Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering Meta-Adaptive Prompt-Destillation für wenige-heiße visuelle Frage-Antworten 用于少数浅光视觉问题解答的元适应性快速蒸馏 2506.06905v2

Authors (3): Akash Gupta, Amos Storkey, Mirella Lapata

Large Multimodal Models (LMMs) often rely on in-context learning (ICL) to perform new tasks with minimal supervision. However, ICL performance, especially in smaller LMMs, is inconsistent and does not always improve monotonically with increasing examples. We hypothesize that this occurs due to the LMM being overwhelmed by additional information present in the image embeddings, which is not required for the downstream task. To address this, we propose a meta-learning approach that provides an alternative for inducing few-shot capabilities in LMMs, using a fixed set of soft prompts that are distilled from task-relevant image features and can be adapted at test time using a few examples. To facilitate this distillation, we introduce an attention-mapper module that can be easily integrated with the popular LLaVA v1.5 architecture and is jointly learned with soft prompts, enabling task adaptation in LMMs under low-data regimes with just a few gradient steps. Evaluation on the VL-ICL Bench shows that our method consistently outperforms ICL and related prompt-tuning approaches, even under image perturbations, improving task induction and reasoning across visual question answering tasks.

nan


Article 433

Title@2025-06-10 (2): Length-Induced Embedding Collapse in PLM-based Models

Title: Length-Induced Embedding Collapse in PLM-based Models Längeninduzierter Einbettungskollaps in PLM-basierte Modelle 以PLM为基础的模型中长期生成的嵌入式折叠 2410.24200v2

Authors (5): Yuqi Zhou, Sunhao Dai, Zhanshuo Cao, Xiao Zhang, Jun Xu

Text embeddings from PLM-based models enable a wide range of applications, yet their performance often degrades on longer texts. In this paper, we introduce a phenomenon we call Length Collapse, where embeddings of longer texts tend to cluster together. This clustering results in a distributional inconsistency between the embeddings of short and long texts. We further investigate how these differences contribute to the performance decline observed with longer texts across various downstream tasks. Through a rigorous theoretical analysis of the self-attention mechanism, which acts as a low-pass filter in PLM-based models, we demonstrate that as text length increases, the strength of low-pass filtering intensifies, causing embeddings to retain more low-frequency components. As a result, input token features become more similar, leading to clustering and ultimately the collapse of embeddings for longer texts. To address this issue, we propose a simple method, TempScale, which mitigates the Length Collapse phenomenon. By narrowing the gap in low-pass filtering rates between long and short texts, TempScale ensures more consistent embeddings across different text lengths. This approach leads to performance improvements of 0.94% on MTEB and 1.10% on LongEmbed, which focuses specifically on long-context retrieval, providing strong evidence for the validity of our analysis. The source code is available at https://github.com/Yuqi-Zhou/Length_Collapse.

nan


Article 434

Title@2025-06-10 (2): A Survey on Long Text Modeling with Transformers

Title: A Survey on Long Text Modeling with Transformers Eine Umfrage über lange Textmodellierung mit Transformern 利用变形器模拟长文本调查 2302.14502v2

Authors (4): Zican Dong, Tianyi Tang, Junyi Li, Wayne Xin Zhao

Modeling long texts has been an essential technique in the field of natural language processing (NLP). With the ever-growing number of long documents, it is important to develop effective modeling methods that can process and analyze such texts. However, long texts pose important research challenges for existing text models, with more complex semantics and special characteristics. In this paper, we provide an overview of the recent advances on long texts modeling based on Transformer models. Firstly, we introduce the formal definition of long text modeling. Then, as the core content, we discuss how to process long input to satisfy the length limitation and design improved Transformer architectures to effectively extend the maximum context length. Following this, we discuss how to adapt Transformer models to capture the special characteristics of long texts. Finally, we describe four typical applications involving long text modeling and conclude this paper with a discussion of future directions. Our survey intends to provide researchers with a synthesis and pointer to related work on long text modeling.

nan


Article 435

Title@2025-06-10 (2): The BS-meter: A ChatGPT-Trained Instrument to Detect Sloppy Language-Games

Title: The BS-meter: A ChatGPT-Trained Instrument to Detect Sloppy Language-Games Das BS-Meter: Ein ChatGPT-getrainiertes Instrument, um schlampige Sprachspiele zu erkennen BS 计: 探测斯洛文尼亚语地名的聊天式TT-训练工具 2411.15129v2

Authors (4): Alessandro Trevisan, Harry Giddens, Sarah Dillon, Alan F. Blackwell

What can we learn about language from studying how it is used by ChatGPT and other large language model (LLM)-based chatbots? In this paper, we analyse the distinctive character of language generated by ChatGPT, in relation to questions raised by natural language processing pioneer, and student of Wittgenstein, Margaret Masterman. Following frequent complaints that LLM-based chatbots produce “slop,” or even “bullshit,” in the sense of Frankfurt’s popular monograph On Bullshit, we conduct an empirical study to contrast the language of 1,000 scientific publications with typical text generated by ChatGPT. We then explore whether the same language features can be detected in two well-known contexts of social dysfunction: George Orwell’s critique of political speech, and David Graeber’s characterisation of bullshit jobs. Using simple hypothesis-testing methods, we demonstrate that a statistical model of sloppy bullshit can reliably relate the Frankfurtian artificial bullshit of ChatGPT to the political and workplace functions of bullshit as observed in natural human language.

nan


Article 436

Title@2025-06-10 (2): A Hybrid Architecture with Efficient Fine Tuning for Abstractive Patent Document Summarization

Title: A Hybrid Architecture with Efficient Fine Tuning for Abstractive Patent Document Summarization Hybride Architektur mit effizienter Feinabstimmung für abstrakte Patentdokumentzusammenfassung 简易专利文件摘要的高效精度计价混合结构 2503.10354v3

Authors (2): Nevidu Jayatilleke, Ruvan Weerasinghe

Automatic patent summarization approaches that help in the patent analysis and comprehension procedure are in high demand due to the colossal growth of innovations. The development of natural language processing (NLP), text mining, and deep learning has notably amplified the efficacy of text summarization models for abundant types of documents. Summarizing patent text remains a pertinent challenge due to the labyrinthine writing style of these documents, which includes technical and legal intricacies. Additionally, these patent document contents are considerably lengthier than archetypal documents, which complicates the process of extracting pertinent information for summarization. Embodying extractive and abstractive text summarization methodologies into a hybrid framework, this study proposes a system for efficiently creating abstractive summaries of patent records. The procedure involves leveraging the LexRank graph-based algorithm to retrieve the important sentences from input parent texts, then utilizing a Bidirectional Auto-Regressive Transformer (BART) model that has been fine-tuned using Low-Ranking Adaptation (LoRA) for producing text summaries. This is accompanied by methodical testing and evaluation strategies. Furthermore, the author employed certain meta-learning techniques to achieve Domain Generalization (DG) of the abstractive component across multiple patent fields.

nan


Article 437

Title@2025-06-10 (2): CoMuMDR: Code-mixed Multi-modal Multi-domain corpus for Discourse paRsing in conversations

Title: CoMuMDR: Code-mixed Multi-modal Multi-domain corpus for Discourse paRsing in conversations CoMuMDR: Code-gemischter multimodaler Multidomain-Korpus für Diskurs paRsing in Gesprächen COMMDR: 混合编码的多式混合多式对话中聊天声响多式公文 2506.08504v1

Authors (6): Divyaksh Shukla, Ritesh Baviskar, Dwijesh Gohil, Aniket Tiwari, Atul Shree, Ashutosh Modi

Discourse parsing is an important task useful for NLU applications such as summarization, machine comprehension, and emotion recognition. The current discourse parsing datasets based on conversations consists of written English dialogues restricted to a single domain. In this resource paper, we introduce CoMuMDR: Code-mixed Multi-modal Multi-domain corpus for Discourse paRsing in conversations. The corpus (code-mixed in Hindi and English) has both audio and transcribed text and is annotated with nine discourse relations. We experiment with various SoTA baseline models; the poor performance of SoTA models highlights the challenges of multi-domain code-mixed corpus, pointing towards the need for developing better models for such realistic settings.

nan


Article 438

Title@2025-06-10 (2): Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design

Title: Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design Bewertung ist alles, was Sie brauchen: Strategisches Überfordern von LLM-Gründerfähigkeiten durch Evaluationsdesign 评价是全你需要的:通过评价设计,从战略上压低LLM能力,通过评价设计提高LLM能力 2506.04734v2

Authors (11): Lin Sun, Weihong Lin, Jinzhu Wu, Yongfu Zhu, Xiaoqi Jian, Guangxiang Zhao, Change Jia, Linglin Zhang, Sai-er Hu, Yuhan Wu, Xiangzheng Zhang

Reasoning models represented by the Deepseek-R1-Distill series have been widely adopted by the open-source community due to their strong performance in mathematics, science, programming, and other domains. However, our study reveals that their benchmark evaluation results are subject to significant fluctuations caused by various factors. Subtle differences in evaluation conditions can lead to substantial variations in results. Similar phenomena are observed in other open-source inference models fine-tuned based on the Deepseek-R1-Distill series, as well as in the QwQ-32B model, making their claimed performance improvements difficult to reproduce reliably. Therefore, we advocate for the establishment of a more rigorous paradigm for model performance evaluation and present our empirical assessments of the Deepseek-R1-Distill series models.

nan


Article 439

Title@2025-06-10 (2): LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement

Title: LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement LLaSE-G1: Förderung der Verallgemeinerungsfähigkeit für LLaMA-basierte Sprachverbesserung LLASE-G1:激励以LLASMA为基础的加强发声能力的普及能力 2503.00493v4

Authors (13): Boyi Kang, Xinfa Zhu, Zihan Zhang, Zhen Ye, Mingshuai Liu, Ziqian Wang, Yike Zhu, Guobin Ma, Jun Chen, Longshuai Xiao, Chao Weng, Wei Xue, Lei Xie

Recent advancements in language models (LMs) have demonstrated strong capabilities in semantic understanding and contextual modeling, which have flourished in generative speech enhancement (SE). However, many LM-based SE approaches primarily focus on semantic information, often neglecting the critical role of acoustic information, which leads to acoustic inconsistency after enhancement and limited generalization across diverse SE tasks. In this paper, we introduce LLaSE-G1, a LLaMA-based language model that incentivizes generalization capabilities for speech enhancement. LLaSE-G1 offers the following key contributions: First, to mitigate acoustic inconsistency, LLaSE-G1 employs continuous representations from WavLM as input and predicts speech tokens from X-Codec2, maximizing acoustic preservation. Second, to promote generalization capability, LLaSE-G1 introduces dual-channel inputs and outputs, unifying multiple SE tasks without requiring task-specific IDs. Third, LLaSE-G1 outperforms prior task-specific discriminative and generative SE models, demonstrating scaling effects at test time and emerging capabilities for unseen SE tasks. Additionally, we release our code and models to support further research in this area.

nan


Article 440

Title@2025-06-10 (2): DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs

Title: DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs In Konflikte geraten: In suchgesteigerten LLMs widersprüchliche Quellen erkennen und bekämpfen 钻入冲突:发现和解决搜索中的冲突源 2506.08500v1

Authors (9): Arie Cattan, Alon Jacovi, Ori Ram, Jonathan Herzig, Roee Aharoni, Sasha Goldshtein, Eran Ofek, Idan Szpektor, Avi Caciularu

Retrieval Augmented Generation (RAG) is a commonly used approach for enhancing large language models (LLMs) with relevant and up-to-date information. However, the retrieved sources can often contain conflicting information and it remains unclear how models should address such discrepancies. In this work, we first propose a novel taxonomy of knowledge conflict types in RAG, along with the desired model behavior for each type. We then introduce CONFLICTS, a high-quality benchmark with expert annotations of conflict types in a realistic RAG setting. CONFLICTS is the first benchmark that enables tracking progress on how models address a wide range of knowledge conflicts. We conduct extensive experiments on this benchmark, showing that LLMs often struggle to appropriately resolve conflicts between sources. While prompting LLMs to explicitly reason about the potential conflict in the retrieved documents significantly improves the quality and appropriateness of their responses, substantial room for improvement in future research remains.

nan


Article 441

Title@2025-06-10 (2): Integration of Old and New Knowledge for Generalized Intent Discovery: A Consistency-driven Prototype-Prompting Framework

Title: Integration of Old and New Knowledge for Generalized Intent Discovery: A Consistency-driven Prototype-Prompting Framework Integration von altem und neuem Wissen für eine generalisierte Intent Discovery: Ein konsistentes prototype-prompting Framework 综合旧知识和新知识促进普遍化的本能发现:由一致性驱动的原型-促进框架 2506.08490v1

Authors (6): Xiao Wei, Xiaobao Wang, Ning Zhuang, Chenyang Wang, Longbiao Wang, Jianwu dang

Intent detection aims to identify user intents from natural language inputs, where supervised methods rely heavily on labeled in-domain (IND) data and struggle with out-of-domain (OOD) intents, limiting their practical applicability. Generalized Intent Discovery (GID) addresses this by leveraging unlabeled OOD data to discover new intents without additional annotation. However, existing methods focus solely on clustering unsupervised data while neglecting domain adaptation. Therefore, we propose a consistency-driven prototype-prompting framework for GID from the perspective of integrating old and new knowledge, which includes a prototype-prompting framework for transferring old knowledge from external sources, and a hierarchical consistency constraint for learning new knowledge from target domains. We conducted extensive experiments and the results show that our method significantly outperforms all baseline methods, achieving state-of-the-art results, which strongly demonstrates the effectiveness and generalization of our methods. Our source code is publicly available at https://github.com/smileix/cpp.

nan


Article 442

Title@2025-06-10 (2): EtiCor++: Towards Understanding Etiquettical Bias in LLMs

Title: EtiCor++: Towards Understanding Etiquettical Bias in LLMs EtiCor++: Auf dem Weg zum Verständnis von Etikettien in LLMs EtiCor++:争取在长效 2506.08488v1

Authors (3): Ashutosh Dwivedi, Siddhant Shivdutt Singh, Ashutosh Modi

In recent years, researchers have started analyzing the cultural sensitivity of LLMs. In this respect, Etiquettes have been an active area of research. Etiquettes are region-specific and are an essential part of the culture of a region; hence, it is imperative to make LLMs sensitive to etiquettes. However, there needs to be more resources in evaluating LLMs for their understanding and bias with regard to etiquettes. In this resource paper, we introduce EtiCor++, a corpus of etiquettes worldwide. We introduce different tasks for evaluating LLMs for knowledge about etiquettes across various regions. Further, we introduce various metrics for measuring bias in LLMs. Extensive experimentation with LLMs shows inherent bias towards certain regions.

nan


Article 443

Title@2025-06-10 (2): Fairness is Not Silence: Unmasking Vacuous Neutrality in Small Language Models

Title: Fairness is Not Silence: Unmasking Vacuous Neutrality in Small Language Models Fairness ist kein Schweigen: Leere Neutralität in kleinen Sprachmodellen entlarven 公平不是沉默:在小型语言模式中松散不显中性 2506.08487v1

Authors (2): Sumanth Manduru, Carlotta Domeniconi

The rapid adoption of Small Language Models (SLMs) for on-device and resource-constrained deployments has outpaced our understanding of their ethical risks. To the best of our knowledge, we present the first large-scale audit of instruction-tuned SLMs spanning 0.5 to 5 billion parameters-an overlooked “middle tier” between BERT-class encoders and flagship LLMs. Our evaluation includes nine open-source models from the Qwen 2.5, LLaMA 3.2, Gemma 3, and Phi families. Using the BBQ benchmark under zero-shot prompting, we analyze both utility and fairness across ambiguous and disambiguated contexts. This evaluation reveals three key insights. First, competence and fairness need not be antagonistic: Phi models achieve F1 scores exceeding 90 percent while exhibiting minimal bias, showing that efficient and ethical NLP is attainable. Second, social bias varies significantly by architecture: Qwen 2.5 models may appear fair, but this often reflects vacuous neutrality, random guessing, or evasive behavior rather than genuine ethical alignment. In contrast, LLaMA 3.2 models exhibit stronger stereotypical bias, suggesting overconfidence rather than neutrality. Third, compression introduces nuanced trade-offs: 4-bit AWQ quantization improves F1 scores in ambiguous settings for LLaMA 3.2-3B but increases disability-related bias in Phi-4-Mini by over 7 percentage points. These insights provide practical guidance for the responsible deployment of SLMs in applications demanding fairness and efficiency, particularly benefiting small enterprises and resource-constrained environments.

nan


Article 444

Title@2025-06-10 (2): Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models

Title: Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models Die automatische Auswertung der Bild-Text-Ausrichtung in Text-zu-Bild-Modellen neu denken 重新思考文本到图像模型图像文本对齐的自动评价 2506.08480v1

Authors (2): Huixuan Zhang, Xiaojun Wan

Text-to-image models often struggle to generate images that precisely match textual prompts. Prior research has extensively studied the evaluation of image-text alignment in text-to-image generation. However, existing evaluations primarily focus on agreement with human assessments, neglecting other critical properties of a trustworthy evaluation framework. In this work, we first identify two key aspects that a reliable evaluation should address. We then empirically demonstrate that current mainstream evaluation frameworks fail to fully satisfy these properties across a diverse range of metrics and models. Finally, we propose recommendations for improving image-text alignment evaluation.

nan


Article 445

Title@2025-06-10 (2): Efficient Context Selection for Long-Context QA: No Tuning, No Iteration, Just Adaptive-$k$

Title: Efficient Context Selection for Long-Context QA: No Tuning, No Iteration, Just Adaptive-$k$ Effiziente Kontextauswahl für Lang-Kontext QA: Keine Abstimmung, keine Iteration, nur adaptiv-$k$ 长文本QA的高效率背景选择:无调试、无迭接、仅可调适-k$ 2506.08479v1

Authors (3): Chihiro Taguchi, Seiji Maekawa, Nikita Bhutani

Retrieval-augmented generation (RAG) and long-context language models (LCLMs) both address context limitations of LLMs in open-domain question answering (QA). However, optimal external context to retrieve remains an open problem: fixing the retrieval size risks either wasting tokens or omitting key evidence. Existing adaptive methods like Self-RAG and Self-Route rely on iterative LLM prompting and perform well on factoid QA, but struggle with aggregation QA, where the optimal context size is both unknown and variable. We present Adaptive-$k$ retrieval, a simple and effective single-pass method that adaptively selects the number of passages based on the distribution of the similarity scores between the query and the candidate passages. It does not require model fine-tuning, extra LLM inferences or changes to existing retriever-reader pipelines. On both factoid and aggregation QA benchmarks, Adaptive-$k$ matches or outperforms fixed-$k$ baselines while using up to 10x fewer tokens than full-context input, yet still retrieves 70% of relevant passages. It improves accuracy across five LCLMs and two embedding models, highlighting that dynamically adjusting context size leads to more efficient and accurate QA.

nan


Article 446

Title@2025-06-10 (2): Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning

Title: Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning Gefährliche Memes mit entkoppeltem Verständnis und geleiteter CoT-Vernunft erkennen 以分解理解和引导计算成本理由,检测有害金属 2506.08477v1

Authors (3): Fengjun Pan, Anh Tuan Luu, Xiaobao Wu

Detecting harmful memes is essential for maintaining the integrity of online environments. However, current approaches often struggle with resource efficiency, flexibility, or explainability, limiting their practical deployment in content moderation systems. To address these challenges, we introduce U-CoT+, a novel framework for harmful meme detection. Instead of relying solely on prompting or fine-tuning multimodal models, we first develop a high-fidelity meme-to-text pipeline that converts visual memes into detail-preserving textual descriptions. This design decouples meme interpretation from meme classification, thus avoiding immediate reasoning over complex raw visual content and enabling resource-efficient harmful meme detection with general large language models (LLMs). Building on these textual descriptions, we further incorporate targeted, interpretable human-crafted guidelines to guide models’ reasoning under zero-shot CoT prompting. As such, this framework allows for easy adaptation to different harmfulness detection criteria across platforms, regions, and over time, offering high flexibility and explainability. Extensive experiments on seven benchmark datasets validate the effectiveness of our framework, highlighting its potential for explainable and low-resource harmful meme detection using small-scale LLMs. Codes and data are available at: https://anonymous.4open.science/r/HMC-AF2B/README.md.

nan


Article 447

Title@2025-06-10 (2): Textual Unlearning Gives a False Sense of Unlearning

Title: Textual Unlearning Gives a False Sense of Unlearning Textuelles Nichtlernen gibt einen falschen Sinn für das Nichtlernen 文字不学习给人一个错误的不学习感 2406.13348v3

Authors (6): Jiacheng Du, Zhibo Wang, Jie Zhang, Xiaoyi Pang, Jiahui Hu, Kui Ren

Language Models (LMs) are prone to ‘‘memorizing’’ training data, including substantial sensitive user information. To mitigate privacy risks and safeguard the right to be forgotten, machine unlearning has emerged as a promising approach for enabling LMs to efficiently ‘‘forget’’ specific texts. However, despite the good intentions, is textual unlearning really as effective and reliable as expected? To address the concern, we first propose Unlearning Likelihood Ratio Attack+ (U-LiRA+), a rigorous textual unlearning auditing method, and find that unlearned texts can still be detected with very high confidence after unlearning. Further, we conduct an in-depth investigation on the privacy risks of textual unlearning mechanisms in deployment and present the Textual Unlearning Leakage Attack (TULA), along with its variants in both black- and white-box scenarios. We show that textual unlearning mechanisms could instead reveal more about the unlearned texts, exposing them to significant membership inference and data reconstruction risks. Our findings highlight that existing textual unlearning actually gives a false sense of unlearning, underscoring the need for more robust and secure unlearning mechanisms.

nan


Article 448

Title@2025-06-10 (2): SHARE: Shared Memory-Aware Open-Domain Long-Term Dialogue Dataset Constructed from Movie Script

Title: SHARE: Shared Memory-Aware Open-Domain Long-Term Dialogue Dataset Constructed from Movie Script SHARE: Shared Memory-Aware Open-Domain Langzeitdialogdatensatz aus Movie Script SHARE: 从电影脚本建构的内存- 内存- 内存- 公用 Open- Domain 长期对话数据集 2410.20682v2

Authors (3): Eunwon Kim, Chanho Park, Buru Chang

Shared memories between two individuals strengthen their bond and are crucial for facilitating their ongoing conversations. This study aims to make long-term dialogue more engaging by leveraging these shared memories. To this end, we introduce a new long-term dialogue dataset named SHARE, constructed from movie scripts, which are a rich source of shared memories among various relationships. Our dialogue dataset contains the summaries of persona information and events of two individuals, as explicitly revealed in their conversation, along with implicitly extractable shared memories. We also introduce EPISODE, a long-term dialogue framework based on SHARE that utilizes shared experiences between individuals. Through experiments using SHARE, we demonstrate that shared memories between two individuals make long-term dialogues more engaging and sustainable, and that EPISODE effectively manages shared memories during dialogue. Our dataset and code are available at https://github.com/e1kim/SHARE.

nan


Article 449

Title@2025-06-10 (2): Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence

Title: Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence Jenseits von Induktionsköpfen: Meta-Lernen im Kontext führt zu Mehrphasigkeit 超越上岗前头:内文元学习促进多阶段巡回新出现 2505.16694v2

Authors (5): Gouki Minegishi, Hiroki Furuta, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo

Transformer-based language models exhibit In-Context Learning (ICL), where predictions are made adaptively based on context. While prior work links induction heads to ICL through a sudden jump in accuracy, this can only account for ICL when the answer is included within the context. However, an important property of practical ICL in large language models is the ability to meta-learn how to solve tasks from context, rather than just copying answers from context; how such an ability is obtained during training is largely unexplored. In this paper, we experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model’s circuit during training. Specifically, we extend the copy task from previous research into an In-Context Meta Learning setting, where models must infer a task from examples to answer queries. Interestingly, in this setting, we find that there are multiple phases in the process of acquiring such abilities, and that a unique circuit emerges in each phase, contrasting with the single-phases change in induction heads. The emergence of such circuits can be related to several phenomena known in large language models, and our analysis lead to a deeper understanding of the source of the transformer’s ICL ability.

nan


Article 450

Title@2025-06-10 (2): Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language

Title: Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language Flexible Werkzeugauswahl durch Low-dimensionale Attributausrichtung von Vision und Sprache 通过视力和语言的低维属性一致进行灵活工具选择 2505.22146v2

Authors (6): Guangfu Hao, Haojie Wen, Liangxuna Guo, Yang Chen, Yanchao Bi, Shan Yu

Flexible tool selection reflects a complex cognitive ability that distinguishes humans from other species, yet computational models that capture this ability remain underdeveloped. We developed a framework using low-dimensional attribute representations to bridge visual tool perception and linguistic task understanding. We constructed a comprehensive dataset (ToolNet) containing 115 common tools labeled with 13 carefully designed attributes spanning physical, functional, and psychological properties, paired with natural language scenarios describing tool usage. Visual encoders (ResNet or ViT) extract attributes from tool images while fine-tuned language models (GPT-2, LLaMA, DeepSeek) derive required attributes from task descriptions. Our approach achieves 74% accuracy in tool selection tasks-significantly outperforming direct tool matching (20%) and smaller multimodal models (21%-58%), while approaching performance of much larger models like GPT-4o (73%) with substantially fewer parameters. Ablation studies revealed that manipulation-related attributes (graspability, hand-relatedness, elongation) consistently prove most critical across modalities. This work provides a parameter-efficient, interpretable solution that mimics human-like tool cognition, advancing both cognitive science understanding and practical applications in tool selection tasks.

nan


Article 451

Title@2025-06-10 (2): Understand User Opinions of Large Language Models via LLM-Powered In-the-Moment User Experience Interviews

Title: Understand User Opinions of Large Language Models via LLM-Powered In-the-Moment User Experience Interviews Verstehen Sie Anwendermeinungen von großen Sprachmodellen über LLM-Powered In-the-Moment User Experience Interviews 理解大语言模式用户通过LLM 带动的动态用户经验访谈对大语言模式的看法 2502.15226v2

Authors (5): Mengqiao Liu, Tevin Wang, Cassandra A. Cohen, Sarah Li, Chenyan Xiong

Which large language model (LLM) is better? Every evaluation tells a story, but what do users really think about current LLMs? This paper presents CLUE, an LLM-powered interviewer that conducts in-the-moment user experience interviews, right after users interact with LLMs, and automatically gathers insights about user opinions from massive interview logs. We conduct a study with thousands of users to understand user opinions on mainstream LLMs, recruiting users to first chat with a target LLM and then be interviewed by CLUE. Our experiments demonstrate that CLUE captures interesting user opinions, e.g., the bipolar views on the displayed reasoning process of DeepSeek-R1 and demands for information freshness and multi-modality. Our code and data are at https://github.com/cxcscmu/LLM-Interviewer.

nan


Article 452

Title@2025-06-10 (2): Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models

Title: Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models Mixture of Decoding: Eine aufmerksamkeitsinspirierte Adaptive Decoding-Strategie zur Mititation von Halluzinationen in großen Vision-Sprachen-Modellen 解码混合:大型视觉语言模型中模拟幻觉的引人注意的调适解码战略 2505.17061v3

Authors (6): Xinlong Chen, Yuanxing Zhang, Qiang Liu, Junfei Wu, Fuzheng Zhang, Tieniu Tan

Large Vision-Language Models (LVLMs) have exhibited impressive capabilities across various visual tasks, yet they remain hindered by the persistent challenge of hallucinations. To address this critical issue, we propose Mixture of Decoding (MoD), a novel approach for hallucination mitigation that dynamically adapts decoding strategies by evaluating the correctness of the model’s attention on image tokens. Specifically, MoD measures the consistency between outputs generated from the original image tokens and those derived from the model’s attended image tokens, to distinguish the correctness aforementioned. If the outputs are consistent, indicating correct attention, MoD employs a complementary strategy to amplify critical information. Conversely, if the outputs are inconsistent, suggesting erroneous attention, MoD utilizes a contrastive strategy to suppress misleading information. Extensive experiments demonstrate that MoD significantly outperforms existing decoding methods across multiple mainstream benchmarks, effectively mitigating hallucinations in LVLMs. The code is available at https://github.com/xlchen0205/MoD.

nan


Article 453

Title@2025-06-10 (2): Understanding Bias Reinforcement in LLM Agents Debate

Title: Understanding Bias Reinforcement in LLM Agents Debate Verständnis der Bias-Verstärkung in LLM-Agenten-Debatte 了解LLLM代理商的强化申请 2503.16814v3

Authors (4): Jihwan Oh, Minchan Jeong, Jongwoo Ko, Se-Young Yun

Large Language Models $($LLMs$)$ solve complex problems using training-free methods like prompt engineering and in-context learning, yet ensuring reasoning correctness remains challenging. While self-correction methods such as self-consistency and self-refinement aim to improve reliability, they often reinforce biases due to the lack of effective feedback mechanisms. Multi-Agent Debate $($MAD$)$ has emerged as an alternative, but we identify two key limitations: bias reinforcement, where debate amplifies model biases instead of correcting them, and lack of perspective diversity, as all agents share the same model and reasoning patterns, limiting true debate effectiveness. To systematically evaluate these issues, we introduce $\textit{MetaNIM Arena}$, a benchmark designed to assess LLMs in adversarial strategic decision-making, where dynamic interactions influence optimal decisions. To overcome MAD’s limitations, we propose $\textbf{DReaMAD}$ $($$\textbf{D}$iverse $\textbf{Rea}$soning via $\textbf{M}$ulti-$\textbf{A}$gent $\textbf{D}$ebate with Refined Prompt$)$, a novel framework that $(1)$ refines LLM’s strategic prior knowledge to improve reasoning quality and $(2)$ promotes diverse viewpoints within a single model by systematically modifying prompts, reducing bias. Empirical results show that $\textbf{DReaMAD}$ significantly improves decision accuracy, reasoning diversity, and bias mitigation across multiple strategic tasks, establishing it as a more effective approach for LLM-based decision-making.

nan


Article 454

Title@2025-06-10 (2): A Survey on Large Language Models for Mathematical Reasoning

Title: A Survey on Large Language Models for Mathematical Reasoning Eine Umfrage zu großen Sprachmodellen für mathematische Vernunft 数学理由大语言模型调查 2506.08446v1

Authors (11): Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Yi-Di Wang, Shu Yan, Cheng-Xing Jia, Xu-Hui Liu, Xin-Wei Chen, Jia-Cheng Xu, Ziniu Li, Yang Yu

Mathematical reasoning has long represented one of the most fundamental and challenging frontiers in artificial intelligence research. In recent years, large language models (LLMs) have achieved significant advances in this area. This survey examines the development of mathematical reasoning abilities in LLMs through two high-level cognitive phases: comprehension, where models gain mathematical understanding via diverse pretraining strategies, and answer generation, which has progressed from direct prediction to step-by-step Chain-of-Thought (CoT) reasoning. We review methods for enhancing mathematical reasoning, ranging from training-free prompting to fine-tuning approaches such as supervised fine-tuning and reinforcement learning, and discuss recent work on extended CoT and “test-time scaling”. Despite notable progress, fundamental challenges remain in terms of capacity, efficiency, and generalization. To address these issues, we highlight promising research directions, including advanced pretraining and knowledge augmentation techniques, formal reasoning frameworks, and meta-generalization through principled learning paradigms. This survey tries to provide some insights for researchers interested in enhancing reasoning capabilities of LLMs and for those seeking to apply these techniques to other domains.

nan


Article 455

Title@2025-06-10 (2): Olica: Efficient Structured Pruning of Large Language Models without Retraining

Title: Olica: Efficient Structured Pruning of Large Language Models without Retraining Olica: Effizientes strukturiertes Beschneiden großer Sprachmodelle ohne Umschulung Olica:无再培训的大型语言模型高效结构稳健 2506.08436v1

Authors (2): Jiujun He, Huazhen Lin

Most existing structured pruning methods for Large Language Models (LLMs) require substantial computational and data resources for retraining to reestablish the corrupted correlations, making them prohibitively expensive. To address this, we propose a pruning framework for LLMs called Orthogonal decomposition and Linear Calibration (Olica), which eliminates the need for retraining. A key observation is that the multi-head attention (MHA) layer depends on two types of matrix products. By treating these matrix products as unified entities and applying principal component analysis (PCA), we extract the most important information to compress LLMs without sacrificing accuracy or disrupting their original structure. Consequently, retraining becomes unnecessary. A fast decomposition method is devised, reducing the complexity of PCA by a factor of the square of the number of attention heads. Additionally, to mitigate error accumulation problem caused by pruning the feed-forward network (FFN) layer, we introduce a linear calibration method to reconstruct the residual errors of pruned layers using low-rank matrices. By leveraging singular value decomposition (SVD) on the solution of the least-squares problem, these matrices are obtained without requiring retraining. Extensive experiments show that the proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks.

nan


Article 456

Title@2025-06-10 (2): FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation

Title: FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation FlagEvalMM: Flexibler Rahmen für eine umfassende multimodale Modellbewertung FlaignEvalMMM:综合多式联运模式评价灵活框架 2506.09081v1

Authors (7): Zheqi He, Yesheng Liu, Jing-shu Zheng, Xuejing Li, Richeng Xuan, Jin-Ge Yao, Xi Yang

We present FlagEvalMM, an open-source evaluation framework designed to comprehensively assess multimodal models across a diverse range of vision-language understanding and generation tasks, such as visual question answering, text-to-image/video generation, and image-text retrieval. We decouple model inference from evaluation through an independent evaluation service, thus enabling flexible resource allocation and seamless integration of new tasks and models. Moreover, FlagEvalMM utilizes advanced inference acceleration tools (e.g., vLLM, SGLang) and asynchronous data loading to significantly enhance evaluation efficiency. Extensive experiments show that FlagEvalMM offers accurate and efficient insights into model strengths and limitations, making it a valuable tool for advancing multimodal research. The framework is publicly accessible athttps://github.com/flageval-baai/FlagEvalMM.

nan


Article 457

Title@2025-06-10 (2): Know-MRI: A Knowledge Mechanisms Revealer&Interpreter for Large Language Models

Title: Know-MRI: A Knowledge Mechanisms Revealer&Interpreter for Large Language Models Know-MRI: Ein Wissensmechanismus Revealer&Interpreter für große Sprachmodelle Know-MRI:大语言模型的知识机制传输和解释 2506.08427v1

Authors (11): Jiaxiang Liu, Boxuan Xing, Chenhao Yuan, Chenxiang Zhang, Di Wu, Xiusheng Huang, Haida Yu, Chuhan Lang, Pengfei Cao, Jun Zhao, Kang Liu

As large language models (LLMs) continue to advance, there is a growing urgency to enhance the interpretability of their internal knowledge mechanisms. Consequently, many interpretation methods have emerged, aiming to unravel the knowledge mechanisms of LLMs from various perspectives. However, current interpretation methods differ in input data formats and interpreting outputs. The tools integrating these methods are only capable of supporting tasks with specific inputs, significantly constraining their practical applications. To address these challenges, we present an open-source Knowledge Mechanisms Revealer&Interpreter (Know-MRI) designed to analyze the knowledge mechanisms within LLMs systematically. Specifically, we have developed an extensible core module that can automatically match different input data with interpretation methods and consolidate the interpreting outputs. It enables users to freely choose appropriate interpretation methods based on the inputs, making it easier to comprehensively diagnose the model’s internal knowledge mechanisms from multiple perspectives. Our code is available at https://github.com/nlpkeg/Know-MRI. We also provide a demonstration video on https://youtu.be/NVWZABJ43Bs.

nan


Article 458

Title@2025-06-10 (2): P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts

Title: P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts P-反应:通过专门 LoRA 专家混合组合,综合个人经历专题-适应性反应 2406.12548v2

Authors (5): Yuhao Dan, Jie Zhou, Qin Chen, Junfeng Tian, Liang He

Personalized large language models (LLMs) have attracted great attention in many applications, such as emotional support and role-playing. However, existing works primarily focus on modeling explicit character profiles, while ignoring the underlying personality traits that truly shape behaviors and decision-making, hampering the development of more anthropomorphic and psychologically-grounded AI systems. In this paper, we explore the modeling of Big Five personality traits, which is the most widely used trait theory in psychology, and propose P-React, a mixture of experts (MoE)-based personalized LLM. Particularly, we integrate a Personality Specialization Loss (PSL) to better capture individual trait expressions, providing a more nuanced and psychologically grounded personality simulacrum. To facilitate research in this field, we curate OCEAN-Chat, a high-quality, human-verified dataset designed to train LLMs in expressing personality traits across diverse topics. Extensive experiments demonstrate the effectiveness of P-React in maintaining consistent and real personality.

nan


Article 459

Title@2025-06-10 (2): DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data

Title: DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data DISCO balanciert die Skalen: Adaptives Domain- und Schwierigkeitsgrad-Bewusst-Verstärkung-Lernen auf unausgewogenen Daten DISCO 平衡规模:关于数据不平衡的适应性域域和困难性软件强化学习 2505.15074v2

Authors (10): Yuhang Zhou, Jing Zhu, Shengyi Qian, Zhuokai Zhao, Xiyao Wang, Xiaoyu Liu, Ming Li, Paiheng Xu, Wei Ai, Furong Huang

Large Language Models (LLMs) are increasingly aligned with human preferences through Reinforcement Learning from Human Feedback (RLHF). Among RLHF methods, Group Relative Policy Optimization (GRPO) has gained attention for its simplicity and strong performance, notably eliminating the need for a learned value function. However, GRPO implicitly assumes a balanced domain distribution and uniform semantic alignment across groups - assumptions that rarely hold in real-world datasets. When applied to multi-domain, imbalanced data, GRPO disproportionately optimizes for dominant domains, neglecting underrepresented ones and resulting in poor generalization and fairness. We propose Domain-Informed Self-Consistency Policy Optimization (DISCO), a principled extension to GRPO that addresses inter-group imbalance with two key innovations. Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence. Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value. Together, these strategies promote more equitable and effective policy learning across domains. Extensive experiments across multiple LLMs and skewed training distributions show that DISCO improves generalization, outperforms existing GRPO variants by 5% on Qwen3 models, and sets new state-of-the-art results on multi-domain alignment benchmarks.

nan


Article 460

Title@2025-06-10 (2): Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens

Title: Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens Große Sprachmodelle haben eine intrinsische Meta-Kognition, benötigen aber eine gute Linse 大语言模型具有内在的外形 – – 认知,但需要好的透镜 2506.08410v1

Authors (4): Ziyang Ma, Qingyue Yuan, Zhenglin Wang, Deyu Zhou

Previous research has primarily focused on the cognitive error detection capabilities of Large Language Models (LLMs), often prompting them to analyze mistakes in reasoning chains. However, few studies have examined the meta-cognitive abilities of LLMs (e.g., their self-awareness of step errors), which are crucial for their reliability. While studies on LLM self-evaluation present some measures, such as perplexity, which can reflect the answer correctness and be viewed as the lens of meta-cognition, they lack step-level analysis and adaptation. This paper studies the evaluation of LLM meta-cognition using the current lenses and how to improve these lenses. Specifically, we propose AutoMeco, an Automated Meta-cognition Evaluation framework for benchmarking the existing lenses. Furthermore, a training-free Markovian Intrinsic Reward Adjustment strategy, MIRA, is proposed to boost current meta-cognition lenses. Experimental results on three mathematical reasoning datasets and three LLMs show the reasonableness of AutoMeco by comparing it with Best-of-N verification. Moreover, the meta-cognition ability of LLMs can be better evaluated using MIRA.

nan


Article 461

Title@2025-06-10 (2): RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Title: RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation RLHS: Fehlausrichtung in RLHF mit Hindsight Simulation abmildern RLHS: 减轻RLHF的差错,采用速光模拟模拟 2501.08617v3

Authors (5): Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac

While Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning generative AI, we present empirical evidence that it can also cause severe, systematic misalignment. We hypothesize that this stems from evaluator feedback depending on downstream outcome predictions (foresight) that can be influenced by the AI’s output, inducing Goodhart’s law dynamics. We present a theoretical analysis showing that conditioning evaluator feedback on downstream observations (hindsight) inhibits this effect by decoupling the alignment signal from potentially compromised predictions–crucially, the result holds even if the observed outcomes are sampled from the AI’s own world model. Building on this insight, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which presents plausible simulated outcomes to evaluators before eliciting feedback. We validate RLHS across three consultancy settings–marketplace interactions, restaurant recommendations, and online course advising–using both online (PPO) and offline (DPO) fine-tuning methods, and show that it substantially improves alignment over RLHF in experiments and human evaluations. We perform post-hoc benchmark evaluations on TruthfulQA, HaluEval, and TrustLLM, finding that even after single-task fine-tuning, RLHF misalignment persists, whereas RLHS consistently outperforms baselines and demonstrates robust alignment generalization. The project webpage and code are available at https://rl-hindsight.github.io.

nan


Article 462

Title@2025-06-10 (2): mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks

Title: mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks mSTEB: Massive mehrsprachige Bewertung von LLMs zu Sprach- und Textaufgaben mSTEB: 对关于发言和文本任务LLM女士进行大规模多语种评价 2506.08400v1

Authors (7): Luel Hagos Beyene, Vivek Verma, Min Ma, Jesujoba O. Alabi, Fabian David Schmidt, Joyce Nakatumba-Nabende, David Ifeoluwa Adelani

Large Language models (LLMs) have demonstrated impressive performance on a wide range of tasks, including in multimodal settings such as speech. However, their evaluation is often limited to English and a few high-resource languages. For low-resource languages, there is no standardized evaluation benchmark. In this paper, we address this gap by introducing mSTEB, a new benchmark to evaluate the performance of LLMs on a wide range of tasks covering language identification, text classification, question answering, and translation tasks on both speech and text modalities. We evaluated the performance of leading LLMs such as Gemini 2.0 Flash and GPT-4o (Audio) and state-of-the-art open models such as Qwen 2 Audio and Gemma 3 27B. Our evaluation shows a wide gap in performance between high-resource and low-resource languages, especially for languages spoken in Africa and Americas/Oceania. Our findings show that more investment is needed to address their under-representation in LLMs coverage.

nan


Article 463

Title@2025-06-10 (2): Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR

Title: Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR SSL-Diskrete Sprachfunktionen für Zipformer-basierte Kontext-ASR erkunden 探索基于 Zipef 的本地语背景 ASR 的 SSL 分辨语音特征 2409.08797v2

Authors (10): Mingyu Cui, Yifan Yang, Jiajun Deng, Jiawen Kang, Shujie Hu, Tianzi Wang, Zhaoqing Li, Shiliang Zhang, Xie Chen, Xunying Liu

Self-supervised learning (SSL) based discrete speech representations are highly compact and domain adaptable. In this paper, SSL discrete speech features extracted from WavLM models are used as additional cross-utterance acoustic context features in Zipformer-Transducer ASR systems. The efficacy of replacing Fbank features with discrete token features for modelling either cross-utterance contexts (from preceding and future segments), or current utterance’s internal contexts alone, or both at the same time, are demonstrated thoroughly on the Gigaspeech 1000-hr corpus. The best Zipformer-Transducer system using discrete tokens based cross-utterance context features outperforms the baseline using utterance internal context only with statistically significant word error rate (WER) reductions of 0.32% to 0.41% absolute (2.78% to 3.54% relative) on the dev and test data. The lowest published WER of 11.15% and 11.14% were obtained on the dev and test sets. Our work is open-source and publicly available at https://github.com/open-creator/icefall/tree/master/egs/gigaspeech/Context_ASR.

nan


Article 464

Title@2025-06-10 (2): TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression

Title: TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression TL;DR: Zu lange, re-Gewichtung für effiziente LLM-Reasoning-Kompression TL;DR:太长,为高效 LLM 合理压缩而重新加权 2506.02678v2

Authors (15): Zhong-Zhi Li, Xiao Liang, Zihao Tang, Lei Ji, Peijie Wang, Haotian Xu, Xing W, Haizhen Huang, Weiwei Deng, Ying Nian Wu, Yeyun Gong, Zhijiang Guo, Xiao Liu, Fei Yin, Cheng-Lin Liu

Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning–especially during inference with extremely long outputs–has drawn increasing attention from the research community. In this work, we propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations or interpolation between multiple models. We continuously balance the weights between the model’s System-1 and System-2 data to eliminate redundant reasoning processes while preserving the model’s reasoning capability. We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels. Our method significantly reduces the number of output tokens by nearly 40% while maintaining the accuracy of the reasoning. Our code and data will be available soon.

nan


Article 465

Title@2025-06-10 (2): Reinforcement Learning Teachers of Test Time Scaling

Title: Reinforcement Learning Teachers of Test Time Scaling Verstärktes Lernen von Lehrern der Testzeitskalierung 测试时间尺度强化学习教师 2506.08388v1

Authors (3): Edoardo Cetin, Tianyu Zhao, Yujin Tang

Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL’s exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply “connect-the-dots” with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem’s solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework.

nan


Article 466

Title@2025-06-10 (2): Reinforce LLM Reasoning through Multi-Agent Reflection

Title: Reinforce LLM Reasoning through Multi-Agent Reflection Verstärkung der LLM-Vernunft durch Multi-Agent-Reflexion 通过多方机构反省说明理由 2506.08379v1

Authors (2): Yurun Yuan, Tengyang Xie

Leveraging more test-time computation has proven to be an effective way to boost the reasoning capabilities of large language models (LLMs). Among various methods, the verify-and-improve paradigm stands out for enabling dynamic solution exploration and feedback incorporation. However, existing approaches often suffer from restricted feedback spaces and lack of coordinated training of different parties, leading to suboptimal performance. To address this, we model this multi-turn refinement process as a Markov Decision Process and introduce DPSDP (Direct Policy Search by Dynamic Programming), a reinforcement learning algorithm that trains an actor-critic LLM system to iteratively refine answers via direct preference learning on self-generated data. Theoretically, DPSDP can match the performance of any policy within the training distribution. Empirically, we instantiate DPSDP with various base models and show improvements on both in- and out-of-distribution benchmarks. For example, on benchmark MATH 500, majority voting over five refinement steps increases first-turn accuracy from 58.2% to 63.2% with Ministral-based models. An ablation study further confirms the benefits of multi-agent collaboration and out-of-distribution generalization.

nan


Article 467

Title@2025-06-10 (2): EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models

Title: EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models EIFBENCH: Extrem komplexe Instruktion nach Benchmark für große Sprachmodelle EIFBENCH: 按照大语言模式基准进行极为复杂的指示 2506.08375v1

Authors (6): Tao Zou, Xinghua Zhang, Haiyang Yu, Minzheng Wang, Fei Huang, Yongbin Li

With the development and widespread application of large language models (LLMs), the new paradigm of “Model as Product” is rapidly evolving, and demands higher capabilities to address complex user needs, often requiring precise workflow execution which involves the accurate understanding of multiple tasks. However, existing benchmarks focusing on single-task environments with limited constraints lack the complexity required to fully reflect real-world scenarios. To bridge this gap, we present the Extremely Complex Instruction Following Benchmark (EIFBENCH), meticulously crafted to facilitate a more realistic and robust evaluation of LLMs. EIFBENCH not only includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently, but also integrates a variety of constraints, replicating complex operational environments. Furthermore, we propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM’s ability to accurately fulfill multi-task workflow. Evaluations on EIFBENCH have unveiled considerable performance discrepancies in existing LLMs when challenged with these extremely complex instructions. This finding underscores the necessity for ongoing optimization to navigate the intricate challenges posed by LLM applications.

nan


Article 468

Title@2025-06-10 (2): Draft-based Approximate Inference for LLMs

Title: Draft-based Approximate Inference for LLMs Entwurfsbasierte annähernde Schlussfolgerung für LLM LLMM 的基于草案的近似推论 2506.08373v1

Authors (6): Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee

Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a novel framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Specifically, we introduce two instantiations of our proposed framework: (i) SpecKV, which leverages a draft output to accurately assess the importance of each KV pair for more effective KV cache dropping, and (ii) SpecPC, which uses the draft model’s attention activations to identify and discard unimportant prompt tokens. To the best of our knowledge, this is the first work to use draft models for approximate LLM inference acceleration, extending their utility beyond traditional lossless speculative decoding. We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput. Our code is available at https://github.com/furiosa-ai/draft-based-approx-llm.

nan


Article 469

Title@2025-06-10 (2): DEAL: Disentangling Transformer Head Activations for LLM Steering

Title: DEAL: Disentangling Transformer Head Activations for LLM Steering DEAL: Depentangling Transformer Head Aktivierungen für LLM Steering DDEL: 解除LLM指导的变压器头动动 2506.08359v1

Authors (6): Li-Ming Zhan, Bo Liu, Zexin Lu, Chengqiang Xie, Jiannong Cao, Xiao-Ming Wu

Inference-time steering aims to alter the response characteristics of large language models (LLMs) without modifying their underlying parameters. A critical step in this process is the identification of internal modules within LLMs that are associated with the target behavior. However, current approaches to module selection often depend on superficial cues or ad-hoc heuristics, which can result in suboptimal or unintended outcomes. In this work, we propose a principled causal-attribution framework for identifying behavior-relevant attention heads in transformers. For each head, we train a vector-quantized autoencoder (VQ-AE) on its attention activations, partitioning the latent space into behavior-relevant and behavior-irrelevant subspaces, each quantized with a shared learnable codebook. We assess the behavioral relevance of each head by quantifying the separability of VQ-AE encodings for behavior-aligned versus behavior-violating responses using a binary classification metric. This yields a behavioral relevance score that reflects each head discriminative capacity with respect to the target behavior, guiding both selection and importance weighting. Experiments on seven LLMs from two model families and five behavioral steering datasets demonstrate that our method enables more accurate inference-time interventions, achieving superior performance on the truthfulness-steering task. Furthermore, the heads selected by our approach exhibit strong zero-shot generalization in cross-domain truthfulness-steering scenarios.

nan


Article 470

Title@2025-06-10 (2): Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning

Title: Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning Text-Embeddings sollten Implizite Semantik erfassen, nicht nur Oberflächenbedeutung 文字嵌入应捕获隐含语义,而不仅仅是表面含义 2506.08354v1

Authors (4): Yiqun Sun, Qiang Huang, Anthony K. H. Tung, Jun Yu

This position paper argues that the text embedding research community should move beyond surface meaning and embrace implicit semantics as a central modeling goal. Text embedding models have become foundational in modern NLP, powering a wide range of applications and drawing increasing research attention. Yet, much of this progress remains narrowly focused on surface-level semantics. In contrast, linguistic theory emphasizes that meaning is often implicit, shaped by pragmatics, speaker intent, and sociocultural context. Current embedding models are typically trained on data that lacks such depth and evaluated on benchmarks that reward the capture of surface meaning. As a result, they struggle with tasks requiring interpretive reasoning, speaker stance, or social meaning. Our pilot study highlights this gap, showing that even state-of-the-art models perform only marginally better than simplistic baselines on implicit semantics tasks. To address this, we call for a paradigm shift: embedding research should prioritize more diverse and linguistically grounded training data, design benchmarks that evaluate deeper semantic understanding, and explicitly frame implicit meaning as a core modeling objective, better aligning embeddings with real-world language complexity.

nan


Article 471

Title@2025-06-10 (2): How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models

Title: How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models How much to Guide: Neuvisiting Adaptive Guidance in Klassifier-Free Guidance Text-to-Vision Diffusion Models 《指南》的多少:在无分类指导、无分类指导、文字到视觉传播模型中重新审视适应性指导 2506.08351v1

Authors (3): Huixuan Zhang, Junzhe Zhang, Xiaojun Wan

With the rapid development of text-to-vision generation diffusion models, classifier-free guidance has emerged as the most prevalent method for conditioning. However, this approach inherently requires twice as many steps for model forwarding compared to unconditional generation, resulting in significantly higher costs. While previous study has introduced the concept of adaptive guidance, it lacks solid analysis and empirical results, making previous method unable to be applied to general diffusion models. In this work, we present another perspective of applying adaptive guidance and propose Step AG, which is a simple, universally applicable adaptive guidance strategy. Our evaluations focus on both image quality and image-text alignment. whose results indicate that restricting classifier-free guidance to the first several denoising steps is sufficient for generating high-quality, well-conditioned images, achieving an average speedup of 20% to 30%. Such improvement is consistent across different settings such as inference steps, and various models including video generation models, highlighting the superiority of our method.

nan


Article 472

Title@2025-06-10 (2): Evaluating LLMs Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving

Title: Evaluating LLMs Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving Bewertung von LLMs auf multi-kognitiven Ebenen: Vom medizinischen Wissen zur szenariobasierten Problemlösung 跨多级别评价LLMs:从医学知识掌握到基于设想的解决问题 2506.08349v1

Authors (11): Yuxuan Zhou, Xien Liu, Chenwei Yan, Chen Ning, Xiao Zhang, Boxun Li, Xiangling Fu, Shijin Wang, Guoping Hu, Yu Wang, Ji Wu

Large language models (LLMs) have demonstrated remarkable performance on various medical benchmarks, but their capabilities across different cognitive levels remain underexplored. Inspired by Bloom’s Taxonomy, we propose a multi-cognitive-level evaluation framework for assessing LLMs in the medical domain in this study. The framework integrates existing medical datasets and introduces tasks targeting three cognitive levels: preliminary knowledge grasp, comprehensive knowledge application, and scenario-based problem solving. Using this framework, we systematically evaluate state-of-the-art general and medical LLMs from six prominent families: Llama, Qwen, Gemma, Phi, GPT, and DeepSeek. Our findings reveal a significant performance decline as cognitive complexity increases across evaluated models, with model size playing a more critical role in performance at higher cognitive levels. Our study highlights the need to enhance LLMs’ medical capabilities at higher cognitive levels and provides insights for developing LLMs suited to real-world medical applications.

nan


Article 473

Title@2025-06-10 (2): SPBA: Utilizing Speech Large Language Model for Backdoor Attacks on Speech Classification Models

Title: SPBA: Utilizing Speech Large Language Model for Backdoor Attacks on Speech Classification Models SPBA: Sprachmodell für Hintertürangriffe auf Sprachklassifikationsmodelle nutzen SSPBA: 使用大语言语音模式进行语言分类模式的后门攻击 2506.08346v1

Authors (6): Wenhan Yao, Fen Xiao, Xiarun Chen, Jia Liu, YongQiang He, Weiping Wen

Deep speech classification tasks, including keyword spotting and speaker verification, are vital in speech-based human-computer interaction. Recently, the security of these technologies has been revealed to be susceptible to backdoor attacks. Specifically, attackers use noisy disruption triggers and speech element triggers to produce poisoned speech samples that train models to become vulnerable. However, these methods typically create only a limited number of backdoors due to the inherent constraints of the trigger function. In this paper, we propose that speech backdoor attacks can strategically focus on speech elements such as timbre and emotion, leveraging the Speech Large Language Model (SLLM) to generate diverse triggers. Increasing the number of triggers may disproportionately elevate the poisoning rate, resulting in higher attack costs and a lower success rate per trigger. We introduce the Multiple Gradient Descent Algorithm (MGDA) as a mitigation strategy to address this challenge. The proposed attack is called the Speech Prompt Backdoor Attack (SPBA). Building on this foundation, we conducted attack experiments on two speech classification tasks, demonstrating that SPBA shows significant trigger effectiveness and achieves exceptional performance in attack metrics.

nan


Article 474

Title@2025-06-10 (2): Wait, We Don’t Need to “Wait”! Removing Thinking Tokens Improves Reasoning Efficiency

Title: Wait, We Don’t Need to “Wait”! Removing Thinking Tokens Improves Reasoning Efficiency Warten Sie, wir brauchen nicht zu “warten”! Entfernen von Gedanken-Tokens verbessert vernünftige Effizienz 等等,我们不需要”等等”! 2506.08343v1

Authors (6): Chenlong Wang, Yuanning Feng, Dongping Chen, Zhaoyang Chu, Ranjay Krishna, Tianyi Zhou

Recent advances in large reasoning models have enabled complex, step-by-step reasoning but often introduce significant overthinking, resulting in verbose and redundant outputs that hinder efficiency. In this study, we examine whether explicit self-reflection, signaled by tokens such as “Wait” and “Hmm”, is necessary for advanced reasoning. We propose NoWait, a simple yet effective approach that disables explicit self-reflection by suppressing these tokens during inference. Extensive experiments on ten benchmarks across textual, visual, and video reasoning tasks show that NoWait reduces chain-of-thought trajectory length by up to 27%-51% in five R1-style model series, without compromising model utility. NoWait thus offers a plug-and-play solution for efficient and utility-preserving multimodal reasoning.

nan


Article 475

Title@2025-06-10 (2): RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

Title: RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors und Natural Language 关于音频、视频、嵌入传感器和自然语言的问答答案 2505.17114v2

Authors (3): Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam

Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning – each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio–Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks – including egocentric and exocentric tasks – show that RAVEN achieves up to 14.5\% and 8.0\% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4\% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23\%. Our code and dataset are available at https://github.com/BASHLab/RAVEN.

nan


Article 476

Title@2025-06-10 (2): TPP-LLM: Modeling Temporal Point Processes by Efficiently Fine-Tuning Large Language Models

Title: TPP-LLM: Modeling Temporal Point Processes by Efficiently Fine-Tuning Large Language Models TPP-LLM: Modellierung von Temporal Point-Prozessen durch effizientes Feintuning großer Sprachmodelle TPP-LLM:通过高效微调大语言模式模拟时点进程 2410.02062v2

Authors (2): Zefang Liu, Yinzhu Quan

Temporal point processes (TPPs) are widely used to model the timing and occurrence of events in domains such as social networks, transportation systems, and e-commerce. In this paper, we introduce TPP-LLM, a novel framework that integrates large language models (LLMs) with TPPs to capture both the semantic and temporal aspects of event sequences. Unlike traditional methods that rely on categorical event type representations, TPP-LLM directly utilizes the textual descriptions of event types, enabling the model to capture rich semantic information embedded in the text. While LLMs excel at understanding event semantics, they are less adept at capturing temporal patterns. To address this, TPP-LLM incorporates temporal embeddings and employs parameter-efficient fine-tuning (PEFT) methods to effectively learn temporal dynamics without extensive retraining. This approach improves both predictive accuracy and computational efficiency. Experimental results across diverse real-world datasets demonstrate that TPP-LLM outperforms state-of-the-art baselines in sequence modeling and event prediction, highlighting the benefits of combining LLMs with TPPs.

nan


Article 477

Title@2025-06-10 (2): Curse of High Dimensionality Issue in Transformer for Long-context Modeling

Title: Curse of High Dimensionality Issue in Transformer for Long-context Modeling Fluch der Hochdimensionalitätsfrage im Transformer für die Langkontextmodellierung 变异器中高多维度问题的诅咒,用于长期建模 2505.22107v3

Authors (8): Shuhai Zhang, Zeng You, Yaofo Chen, Zhiquan Wen, Qianyue Wang, Zhijie Qiu, Yuanqing Li, Mingkui Tan

Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to \textit{redundant} attention computations: while attention weights are often \textit{sparse}, all tokens consume \textit{equal} computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a \textit{supervised learning task}, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a \textit{group coding strategy}, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose \textit{Dynamic Group Attention} (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.Code is available at https://github.com/bolixinyu/DynamicGroupAttention.

nan


Article 478

Title@2025-06-10 (2): Institutional Books 1.0: A 242B token dataset from Harvard Library’s collections, refined for accuracy and usability

Title: Institutional Books 1.0: A 242B token dataset from Harvard Library’s collections, refined for accuracy and usability Institutionelle Bücher 1.0: Ein 242B-Token-Datensatz aus den Sammlungen der Harvard Library, verfeinert für Genauigkeit und Usability 机构手册1.0:哈佛图书馆收藏的242B象征性数据集,根据准确性和可用性加以完善 2506.08300v1

Authors (11): Matteo Cargnelutti, Catherine Brobston, John Hess, Jack Cushman, Kristi Mukk, Aristana Scourtas, Kyle Courtney, Greg Leppert, Amanda Watson, Martha Whitehead, Jonathan Zittrain

Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library’s participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library’s collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project’s goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use.

nan


Article 479

Title@2025-06-09 (1): From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?

Title: From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information? Von passiv zur aktiven Vernunft: Können große Sprachmodelle unter unvollständigen Informationen die richtigen Fragen stellen? 从被动到积极理由:大语言模式能够在不完整信息下询问正确的问题吗? 2506.08295v1

Authors (6): Zhanke Zhou, Xiao Feng, Zhaocheng Zhu, Jiangchao Yao, Sanmi Koyejo, Bo Han

While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information needed to reach a solution. By contrast, active reasoning-where an LLM must interact with external systems to acquire missing evidence or data-has received little systematic attention. To address this shortfall, we present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM’s active reasoning skills. AR-Bench comprises three task families-detective cases, situation puzzles, and guessing numbers-that together simulate real-world, agentic scenarios and measure performance across commonsense, logical, and symbolic reasoning challenges. Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning: they frequently fail to acquire or leverage the information needed to solve tasks. This gap highlights a stark divergence between their passive and active reasoning abilities. Moreover, ablation studies indicate that even advanced strategies, such as tree-based searching or post-training approaches, yield only modest gains and fall short of the levels required for real-world deployment. Collectively, these findings highlight the critical need to advance methodology for active reasoning, e.g., incorporating interactive learning, real-time feedback loops, and environment-aware objectives for training. The benchmark is publicly available at: https://github.com/tmlr-group/AR-Bench.

nan


Article 480

Title@2025-06-09 (1): From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium

Title: From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium Von der Debatte zum Equilibrium: Glaube-getriebener Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium 从辩论到平衡:通过巴伊西亚纳什平衡进行 2506.08292v1

Authors (6): Xie Yi, Zhanke Zhou, Chentao Cao, Qiyu Niu, Tongliang Liu, Bo Han

Multi-agent frameworks can substantially boost the reasoning power of large language models (LLMs), but they typically incur heavy computational costs and lack convergence guarantees. To overcome these challenges, we recast multi-LLM coordination as an incomplete-information game and seek a Bayesian Nash equilibrium (BNE), in which each agent optimally responds to its probabilistic beliefs about the strategies of others. We introduce Efficient Coordination via Nash Equilibrium (ECON), a hierarchical reinforcement-learning paradigm that marries distributed reasoning with centralized final output. Under ECON, each LLM independently selects responses that maximize its expected reward, conditioned on its beliefs about co-agents, without requiring costly inter-agent exchanges. We mathematically prove that ECON attains a markedly tighter regret bound than non-equilibrium multi-agent schemes. Empirically, ECON outperforms existing multi-LLM approaches by 11.2% on average across six benchmarks spanning complex reasoning and planning tasks. Further experiments demonstrate ECON’s ability to flexibly incorporate additional models, confirming its scalability and paving the way toward larger, more powerful multi-LLM ensembles. The code is publicly available at: https://github.com/tmlr-group/ECON.

nan


Article 481

Title@2025-06-09 (1): BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models

Title: BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models VerhaltenBox: Automatisierte Entdeckung von feinkörnigen Leistungsunterschieden zwischen Sprachmodellen 行为框:语言模型之间优异性能差异的自动发现 2506.02204v2

Authors (2): Lindia Tjuatja, Graham Neubig

Language model evaluation is a daunting task: prompts are brittle, corpus-level perplexities are vague, and the choice of benchmarks are endless. Finding examples that show meaningful, generalizable differences between two LMs is crucial to understanding where one model succeeds and another fails. Can this process be done automatically? In this work, we propose methodology for automated comparison of language models that uses performance-aware contextual embeddings to find fine-grained features of text where one LM outperforms another. Our method, which we name BehaviorBox, extracts coherent features that demonstrate differences with respect to the ease of generation between two LMs. Specifically, BehaviorBox finds features that describe groups of words in fine-grained contexts, such as “conditional ‘were’ in the phrase ‘if you were’” and “exclamation marks after emotional statements”, where one model outperforms another within a particular datatset. We apply BehaviorBox to compare models that vary in size, model family, and post-training, and enumerate insights into specific contexts that illustrate meaningful differences in performance which cannot be found by measures such as corpus-level perplexity alone.

nan


Article 482

Title@2025-06-09 (1): Voice Impression Control in Zero-Shot TTS

Title: Voice Impression Control in Zero-Shot TTS Sprach-Impression-Steuerung in Zero-Shot TTS 零热TTTS中的声音压抑控制 2506.05688v2

Authors (3): Keinichi Fujita, Shota Horiguchi, Yusuke Ijima

Para-/non-linguistic information in speech is pivotal in shaping the listeners’ impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method’s effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization. Audio examples are available on our demo page (https://ntt-hilab-gensp.github.io/is2025voiceimpression/).

nan


Article 483

Title@2025-06-09 (1): Instruction-Tuned Video-Audio Models Elucidate Functional Specialization in the Brain

Title: Instruction-Tuned Video-Audio Models Elucidate Functional Specialization in the Brain Instruction-Tuned Video-Audio Modelle Elucidate Funktionelle Spezialisierung im Gehirn 脑中精明功能性专业 2506.08277v1

Authors (8): Subba Reddy Oota, Khushbu Pahwa, Prachi Jindal, Satya Sai Srinath Namburi, Maneesh Singh, Tanmoy Chakraborty, Bapi S. Raju, Manish Gupta

Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models in both unimodal and multimodal stimulus settings. More recently, instruction-tuned multimodal models have shown to generate task-specific representations that align strongly with brain activity. However, prior work evaluating the brain alignment of MLLMs has primarily focused on unimodal settings or relied on non-instruction-tuned multimodal models for multimodal stimuli. To address this gap, we investigated brain alignment, that is, measuring the degree of predictivity of neural activity recorded while participants were watching naturalistic movies (video along with audio) with representations derived from MLLMs. We utilized instruction-specific embeddings from six video and two audio instruction-tuned MLLMs. Experiments with 13 video task-specific instructions show that instruction-tuned video MLLMs significantly outperform non-instruction-tuned multimodal (by 15%) and unimodal models (by 20%). Our evaluation of MLLMs for both video and audio tasks using language-guided instructions shows clear disentanglement in task-specific representations from MLLMs, leading to precise differentiation of multimodal functional processing in the brain. We also find that MLLM layers align hierarchically with the brain, with early sensory areas showing strong alignment with early layers, while higher-level visual and language regions align more with middle to late layers. These findings provide clear evidence for the role of task-specific instructions in improving the alignment between brain activity and MLLMs, and open new avenues for mapping joint information processing in both the systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].

nan


Article 484

Title@2025-06-09 (1): Retrieval-augmented systems can be dangerous medical communicators

Title: Retrieval-augmented systems can be dangerous medical communicators Retrieval-augmentierte Systeme können gefährliche medizinische Kommunikatoren sein 回收val-加压系统可以是危险的医疗通讯器 2502.14898v2

Authors (6): Lionel Wong, Ayman Ali, Raymond Xiong, Shannon Zeijang Shen, Yoon Kim, Monica Agrawal

Patients have long sought health information online, and increasingly, they are turning to generative AI to answer their health-related queries. Given the high stakes of the medical domain, techniques like retrieval-augmented generation and citation grounding have been widely promoted as methods to reduce hallucinations and improve the accuracy of AI-generated responses and have been widely adopted into search engines. This paper argues that even when these methods produce literally accurate content drawn from source documents sans hallucinations, they can still be highly misleading. Patients may derive significantly different interpretations from AI-generated outputs than they would from reading the original source material, let alone consulting a knowledgeable clinician. Through a large-scale query analysis on topics including disputed diagnoses and procedure safety, we support our argument with quantitative and qualitative evidence of the suboptimal answers resulting from current systems. In particular, we highlight how these models tend to decontextualize facts, omit critical relevant sources, and reinforce patient misconceptions or biases. We propose a series of recommendations – such as the incorporation of communication pragmatics and enhanced comprehension of source documents – that could help mitigate these issues and extend beyond the medical domain.

nan


Article 485

Title@2025-06-09 (1): Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints

Title: Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints Verstärktes Lernen aus menschlichem Feedback mit hochkonfidenziellen Sicherheitseinschränkungen 在高度信任安全限制下,从人类反馈中加强学习 2506.08266v1

Authors (6): Yaswanth Chittepu, Blossom Metevier, Will Schwarzer, Austin Hoag, Scott Niekum, Philip S. Thomas

Existing approaches to language model alignment often treat safety as a tradeoff against helpfulness, which can lead to unacceptable responses in sensitive domains. To ensure reliable performance in such settings, we propose High-Confidence Safe Reinforcement Learning from Human Feedback (HC-RLHF), a method that provides high-confidence safety guarantees while maximizing helpfulness. Similar to previous methods, HC-RLHF explicitly decouples human preferences into helpfulness and harmlessness (safety), which are learned by training a reward model and a cost model, respectively. It then employs a two-step process to find safe solutions. In the first step, it optimizes the reward function under an intentionally pessimistic version of the cost constraint. In the second step, the trained model undergoes a safety test to verify whether its performance stays within an upper-confidence bound of the actual cost constraint. We provide a theoretical analysis of HC-RLHF, including proof that it will not return an unsafe solution with a probability greater than a user-specified threshold. For our empirical analysis, we apply HC-RLHF to align three different language models (Qwen2-1.5B, Qwen2.5-3B, and LLaMa3.2-3B) with human preferences. Our results demonstrate that HC-RLHF produces safe models with high probability and can improve harmlessness and helpfulness compared to previous methods.

nan


Article 486

Title@2025-06-09 (1): SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

Title: SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs SK-VQA: Synthetische Wissensgenerierung auf Scale für die Schulung kontextverstärkter multimodaler LLMs SK-VQA:在培训背景增强型多模式LM培训规模内生成合成知识 2406.19593v2

Authors (6): Xin Su, Man Luo, Kris W Pan, Tien Pei Chou, Vasudev Lal, Phillip Howard

Multimodal retrieval augmented generation (RAG) plays a crucial role in domains such as knowledge-based visual question answering (KB-VQA), where external knowledge is needed to answer a question. However, existing multimodal LLMs (MLLMs) are not designed for context-augmented generation, limiting their effectiveness in such tasks. While synthetic data generation has recently gained attention for training MLLMs, its application for context-augmented generation remains underexplored. To address this gap, we introduce SK-VQA, a large-scale synthetic multimodal dataset containing over 2 million visual question-answer pairs, each associated with context documents containing information necessary to determine the final answer. Compared to previous datasets, SK-VQA contains 11x more unique questions, exhibits greater domain diversity, and covers a broader spectrum of image sources. Through human evaluations, we confirm the high quality of the generated question-answer pairs and their contextual relevance. Extensive experiments show that SK-VQA serves both as a challenging KB-VQA benchmark and as an effective training resource for adapting MLLMs to context-augmented generation. Our results further indicate that models trained on SK-VQA demonstrate enhanced generalization in both context-aware VQA and multimodal RAG settings. SK-VQA is publicly available via Hugging Face Hub.

nan


Article 487

Title@2025-06-09 (1): Automatic Generation of Inference Making Questions for Reading Comprehension Assessments

Title: Automatic Generation of Inference Making Questions for Reading Comprehension Assessments Automatische Erzeugung von Schlussfolgerungen, die Fragen für Leseverständnis-Bewertungen stellen 阅读理解评估自动产生引文问题 2506.08260v1

Authors (3): Wanjing Anya Ma, Michael Flor, Zuowei Wang

Inference making is an essential but complex skill in reading comprehension (RC). Some inferences require resolving references across sentences, and some rely on using prior knowledge to fill in the detail that is not explicitly written in the text. Diagnostic RC questions can help educators provide more effective and targeted reading instruction and interventions for school-age students. We introduce a taxonomy of inference types for RC and use it to analyze the distribution of items within a diagnostic RC item bank. Next, we present experiments using GPT-4o to generate bridging-inference RC items for given reading passages via few-shot prompting, comparing conditions with and without chain-of-thought prompts. Generated items were evaluated on three aspects: overall item quality, appropriate inference type, and LLM reasoning, achieving high inter-rater agreements above 0.90. Our results show that GPT-4o produced 93.8% good-quality questions suitable for operational use in grade 3-12 contexts; however, only 42.6% of the generated questions accurately matched the targeted inference type. We conclude that combining automatic item generation with human judgment offers a promising path toward scalable, high-quality diagnostic RC assessments.

nan


Article 488

Title@2025-06-09 (1): RADAR: Benchmarking Language Models on Imperfect Tabular Data

Title: RADAR: Benchmarking Language Models on Imperfect Tabular Data RADAR: Benchmarking von Sprachmodellen auf unvollkommenen Tabellendaten RADAR:关于不完善表格数据的语言模式基准基准 2506.08249v1

Authors (21): Ken Gu, Zhihan Zhang, Kate Lin, Yuwei Zhang, Akshay Paruchuri, Hong Yu, Mehran Kazemi, Kumar Ayush, A. Ali Heydari, Maxwell A. Xu, Girish Narayanswamy, Yun Liu, Ming-Zher Poh, Yuzhe Yang, Mark Malhotra, Shwetak Patel, Hamid Palangi, Xuhai Xu, Daniel McDuff, Tim Althoff, Xin Liu

Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness – the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies – remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.

nan


Article 489

Title@2025-06-09 (1): Self-Supervised Transformers as Iterative Solution Improvers for Constraint Satisfaction

Title: Self-Supervised Transformers as Iterative Solution Improvers for Constraint Satisfaction Selbstüberwachte Transformer als iterative Lösungsverbesserungen für Einschränkungen Zufriedenheit 自我主宰的变形器作为抑制满意度的循环解决方案改进器 2502.15794v2

Authors (4): Yudong W. Xu, Wenhao Li, Scott Sanner, Elias B. Khalil

We present a Transformer-based framework for Constraint Satisfaction Problems (CSPs). CSPs find use in many applications and thus accelerating their solution with machine learning is of wide interest. Most existing approaches rely on supervised learning from feasible solutions or reinforcement learning, paradigms that require either feasible solutions to these NP-Complete CSPs or large training budgets and a complex expert-designed reward signal. To address these challenges, we propose ConsFormer, a self-supervised framework that leverages a Transformer as a solution refiner. ConsFormer constructs a solution to a CSP iteratively in a process that mimics local search. Instead of using feasible solutions as labeled data, we devise differentiable approximations to the discrete constraints of a CSP to guide model training. Our model is trained to improve random assignments for a single step but is deployed iteratively at test time, circumventing the bottlenecks of supervised and reinforcement learning. Experiments on Sudoku, Graph Coloring, Nurse Rostering, and MAXCUT demonstrate that our method can tackle out-of-distribution CSPs simply through additional iterations.

nan


Article 490

Title@2025-06-09 (1): TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation

Title: TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation TRAVEL: Training-freies Retrieval und Ausrichtung für Vision-und-Sprachen-Navigation TRAVEL:无培训检索和协调愿景和语言导航 2502.07306v2

Authors (2): Navid Rajabi, Jana Kosecka

In this work, we propose a modular approach for the Vision-Language Navigation (VLN) task by decomposing the problem into four sub-modules that use state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs) in a zero-shot setting. Given navigation instruction in natural language, we first prompt LLM to extract the landmarks and the order in which they are visited. Assuming the known model of the environment, we retrieve the top-k locations of the last landmark and generate $k$ path hypotheses from the starting location to the last landmark using the shortest path algorithm on the topological map of the environment. Each path hypothesis is represented by a sequence of panoramas. We then use dynamic programming to compute the alignment score between the sequence of panoramas and the sequence of landmark names, which match scores obtained from VLM. Finally, we compute the nDTW metric between the hypothesis that yields the highest alignment score to evaluate the path fidelity. We demonstrate superior performance compared to other approaches that use joint semantic maps like VLMaps on the complex R2R-Habitat instruction dataset and quantify in detail the effect of visual grounding on navigation performance.

nan


Article 491

Title@2025-06-09 (1): Learning to Reason via Mixture-of-Thought for Logical Reasoning

Title: Learning to Reason via Mixture-of-Thought for Logical Reasoning Mit Vernunft über Mixture-of-Thought für logische Vernunft lernen 通过对逻辑原因的混合研究学习理性 2505.15817v2

Authors (5): Tong Zheng, Lichang Chen, Simeng Han, R. Thomas McCoy, Heng Huang

Human beings naturally utilize multiple reasoning modalities to learn and solve logical problems, i.e., different representational formats such as natural language, code, and symbolic logic. In contrast, most existing LLM-based approaches operate with a single reasoning modality during training, typically natural language. Although some methods explored modality selection or augmentation at inference time, the training process remains modality-blind, limiting synergy among modalities. To fill in this gap, we propose Mixture-of-Thought (MoT), a framework that enables LLMs to reason across three complementary modalities: natural language, code, and a newly introduced symbolic modality, truth-table, which systematically enumerates logical cases and partially mitigates key failure modes in natural language reasoning. MoT adopts a two-phase design: (1) self-evolving MoT training, which jointly learns from filtered, self-generated rationales across modalities; and (2) MoT inference, which fully leverages the synergy of three modalities to produce better predictions. Experiments on logical reasoning benchmarks including FOLIO and ProofWriter demonstrate that our MoT framework consistently and significantly outperforms strong LLM baselines with single-modality chain-of-thought approaches, achieving up to +11.7pp average accuracy gain. Further analyses show that our MoT framework benefits both training and inference stages; that it is particularly effective on harder logical reasoning problems; and that different modalities contribute complementary strengths, with truth-table reasoning helping to overcome key bottlenecks in natural language inference.

nan


Article 492

Title@2025-06-09 (1): Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning

Title: Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning Kann KI die Wissenschaft bestätigen? Benchmarking LLMs für genaue wissenschaftliche Ansprüche $\rightarrow$ Evidence Reasoning AI 能验证科学吗? 2506.08235v1

Authors (6): Shashidhar Reddy Javaji, Yupeng Cao, Haohang Li, Yangyang Yu, Nikhil Muralidhar, Zining Zhu

Large language models (LLMs) are increasingly being used for complex research tasks such as literature review, idea generation, and scientific paper analysis, yet their ability to truly understand and process the intricate relationships within complex research papers, such as the logical links between claims and supporting evidence remains largely unexplored. In this study, we present CLAIM-BENCH, a comprehensive benchmark for evaluating LLMs’ capabilities in scientific claim-evidence extraction and validation, a task that reflects deeper comprehension of scientific argumentation. We systematically compare three approaches which are inspired by divide and conquer approaches, across six diverse LLMs, highlighting model-specific strengths and weaknesses in scientific comprehension. Through evaluation involving over 300 claim-evidence pairs across multiple research domains, we reveal significant limitations in LLMs’ ability to process complex scientific content. Our results demonstrate that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall across claim-evidence identification tasks. Furthermore, strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs’ abilities to accurately link dispersed evidence with claims, although this comes at increased computational cost. CLAIM-BENCH sets a new standard for evaluating scientific comprehension in LLMs, offering both a diagnostic tool and a path forward for building systems capable of deeper, more reliable reasoning across full-length papers.

nan


Article 493

Title@2025-06-09 (1): Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions

Title: Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions Compound AI Systems Optimization: Eine Übersicht über Methoden, Herausforderungen und zukünftige Richtungen AI 系统优化:方法、挑战和今后方向调查 2506.08234v1

Authors (6): Yu-Ang Lee, Guan-Ting Yi, Mei-Yi Liu, Jui-Chao Lu, Guan-Bo Yang, Yun-Nung Chen

Recent advancements in large language models (LLMs) and AI systems have led to a paradigm shift in the design and optimization of complex AI workflows. By integrating multiple components, compound AI systems have become increasingly adept at performing sophisticated tasks. However, as these systems grow in complexity, new challenges arise in optimizing not only individual components but also their interactions. While traditional optimization methods such as supervised fine-tuning (SFT) and reinforcement learning (RL) remain foundational, the rise of natural language feedback introduces promising new approaches, especially for optimizing non-differentiable systems. This paper provides a systematic review of recent progress in optimizing compound AI systems, encompassing both numerical and language-based techniques. We formalize the notion of compound AI system optimization, classify existing methods along several key dimensions, and highlight open research challenges and future directions in this rapidly evolving field. A list of surveyed papers is publicly available at https://github.com/MiuLab/AISysOpt-Survey.

nan


Article 494

Title@2025-06-09 (1): “I Wrote, I Paused, I Rewrote” Teaching LLMs to Read Between the Lines of Student Writing

Title: “I Wrote, I Paused, I Rewrote” Teaching LLMs to Read Between the Lines of Student Writing “I Wrote, I Paused, I Rewrote” Lehre LLMs zu lesen zwischen den Linien des Studentenschreibens “我写,我写,我写,我重写” 教LMS 在学生写作的两行之间阅读 2506.08221v1

Authors (5): Samra Zafar, Shaheer Minhas, Syed Ali Hassan Zaidi, Arfa Naeem, Zahra Ali

Large language models(LLMs) like Gemini are becoming common tools for supporting student writing. But most of their feedback is based only on the final essay missing important context about how that text was written. In this paper, we explore whether using writing process data, collected through keystroke logging and periodic snapshots, can help LLMs give feedback that better reflects how learners think and revise while writing. We built a digital writing tool that captures both what students type and how their essays evolve over time. Twenty students used this tool to write timed essays, which were then evaluated in two ways: (i) LLM generated feedback using both the final essay and the full writing trace, and (ii) After the task, students completed surveys about how useful and relatable they found the feedback. Early results show that learners preferred the process-aware LLM feedback, finding it more in tune with their own thinking. We also found that certain types of edits, like adding new content or reorganizing paragraphs, aligned closely with higher scores in areas like coherence and elaboration. Our findings suggest that making LLMs more aware of the writing process can lead to feedback that feels more meaningful, personal, and supportive.

nan


Article 495

Title@2025-06-09 (1): A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation

Title: A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation Eine umfassende Studie von Decoder-Only LLMs für die Text-to-Image-Generierung 综合研究 “ 文字到图像生成的开发者 – – 唯一 “ 有限LLMs “ 综合研究 2506.08210v1

Authors (5): Andrew Z. Wang, Songwei Ge, Tero Karras, Ming-Yu Liu, Yogesh Balaji

Both text-to-image generation and large language models (LLMs) have made significant advancements. However, many text-to-image models still employ the somewhat outdated T5 and CLIP as their text encoders. In this work, we investigate the effectiveness of using modern decoder-only LLMs as text encoders for text-to-image diffusion models. We build a standardized training and evaluation pipeline that allows us to isolate and evaluate the effect of different text embeddings. We train a total of 27 text-to-image models with 12 different text encoders to analyze the critical aspects of LLMs that could impact text-to-image generation, including the approaches to extract embeddings, different LLMs variants, and model sizes. Our experiments reveal that the de facto way of using last-layer embeddings as conditioning leads to inferior performance. Instead, we explore embeddings from various layers and find that using layer-normalized averaging across all layers significantly improves alignment with complex prompts. Most LLMs with this conditioning outperform the baseline T5 model, showing enhanced performance in advanced visio-linguistic reasoning skills.

nan


Article 496

Title@2025-06-09 (1): R.R.: Unveiling LLM Training Privacy through Recollection and Ranking

Title: R.R.: Unveiling LLM Training Privacy through Recollection and Ranking R.R.: Enthüllen LLM Training Privacy durch Erinnerung und Ranking R.R.R.:通过记忆和排名来统一LLM培训隐私 2502.12658v2

Authors (8): Wenlong Meng, Zhenyuan Guo, Lenan Wu, Chen Gong, Wenyan Liu, Weixian Li, Chengkun Wei, Wenzhi Chen

Large Language Models (LLMs) pose significant privacy risks, potentially leaking training data due to implicit memorization. Existing privacy attacks primarily focus on membership inference attacks (MIAs) or data extraction attacks, but reconstructing specific personally identifiable information (PII) in LLMs’ training data remains challenging. In this paper, we propose R.R. (Recollect and Rank), a novel two-step privacy stealing attack that enables attackers to reconstruct PII entities from scrubbed training data where the PII entities have been masked. In the first stage, we introduce a prompt paradigm named recollection, which instructs the LLM to repeat a masked text but fill in masks. Then we can use PII identifiers to extract recollected PII candidates. In the second stage, we design a new criterion to score each PII candidate and rank them. Motivated by membership inference, we leverage the reference model as a calibration to our criterion. Experiments across three popular PII datasets demonstrate that the R.R. achieves better PII identification performance than baselines. These results highlight the vulnerability of LLMs to PII leakage even when training data has been scrubbed. We release our code and datasets at GitHub.

nan


Article 497

Title@2025-06-09 (1): Open World Scene Graph Generation using Vision Language Models

Title: Open World Scene Graph Generation using Vision Language Models Open World Scene Graph Generation mit Vision Language Models 利用视觉语言模型制作开放世界景点图集 2506.08189v1

Authors (11): Amartya Dutta, Kazi Sajeed Mehrab, Medha Sawhney, Abhilash Neog, Mridul Khurana, Sepideh Fatemi, Aanish Pradhan, M. Maruf, Ismini Lourentzou, Arka Daw, Anuj Karpatne

Scene-Graph Generation (SGG) seeks to recognize objects in an image and distill their salient pairwise relationships. Most methods depend on dataset-specific supervision to learn the variety of interactions, restricting their usefulness in open-world settings, involving novel objects and/or relations. Even methods that leverage large Vision Language Models (VLMs) typically require benchmark-specific fine-tuning. We introduce Open-World SGG, a training-free, efficient, model-agnostic framework that taps directly into the pretrained knowledge of VLMs to produce scene graphs with zero additional learning. Casting SGG as a zero-shot structured-reasoning problem, our method combines multimodal prompting, embedding alignment, and a lightweight pair-refinement strategy, enabling inference over unseen object vocabularies and relation sets. To assess this setting, we formalize an Open-World evaluation protocol that measures performance when no SGG-specific data have been observed either in terms of objects and relations. Experiments on Visual Genome, Open Images V6, and the Panoptic Scene Graph (PSG) dataset demonstrate the capacity of pretrained VLMs to perform relational understanding without task-level training.

nan


Article 498

Title@2025-06-09 (1): GradEscape: A Gradient-Based Evader Against AI-Generated Text Detectors

Title: GradEscape: A Gradient-Based Evader Against AI-Generated Text Detectors GradEscape: Ein gradientbasierter Evader gegen KI-generierte Textdetektoren 梯度梯度:针对AI-General Text 检测器的渐进式蒸发器 2506.08188v1

Authors (8): Wenlong Meng, Shuguo Fan, Chengkun Wei, Min Chen, Yuwei Li, Yuanchao Zhang, Zhikun Zhang, Wenzhi Chen

In this paper, we introduce GradEscape, the first gradient-based evader designed to attack AI-generated text (AIGT) detectors. GradEscape overcomes the undifferentiable computation problem, caused by the discrete nature of text, by introducing a novel approach to construct weighted embeddings for the detector input. It then updates the evader model parameters using feedback from victim detectors, achieving high attack success with minimal text modification. To address the issue of tokenizer mismatch between the evader and the detector, we introduce a warm-started evader method, enabling GradEscape to adapt to detectors across any language model architecture. Moreover, we employ novel tokenizer inference and model extraction techniques, facilitating effective evasion even in query-only access. We evaluate GradEscape on four datasets and three widely-used language models, benchmarking it against four state-of-the-art AIGT evaders. Experimental results demonstrate that GradEscape outperforms existing evaders in various scenarios, including with an 11B paraphrase model, while utilizing only 139M parameters. We have successfully applied GradEscape to two real-world commercial AIGT detectors. Our analysis reveals that the primary vulnerability stems from disparity in text expression styles within the training data. We also propose a potential defense strategy to mitigate the threat of AIGT evaders. We open-source our GradEscape for developing more robust AIGT detectors.

nan


Article 499

Title@2025-06-09 (1): LLM Alignment as Retriever Optimization: An Information Retrieval Perspective

Title: LLM Alignment as Retriever Optimization: An Information Retrieval Perspective LLM Alignment als Retriever-Optimierung: Eine Informations-Retrieval-Perspektive LLM 对齐作为最佳优化:信息检索视角 2502.03699v2

Authors (8): Bowen Jin, Jinsung Yoon, Zhen Qin, Ziqi Wang, Wei Xiong, Yu Meng, Jiawei Han, Sercan O. Arik

Large Language Models (LLMs) have revolutionized artificial intelligence with capabilities in reasoning, coding, and communication, driving innovation across industries. Their true potential depends on effective alignment to ensure correct, trustworthy and ethical behavior, addressing challenges like misinformation, hallucinations, bias and misuse. While existing Reinforcement Learning (RL)-based alignment methods are notoriously complex, direct optimization approaches offer a simpler alternative. In this work, we introduce a novel direct optimization approach for LLM alignment by drawing on established Information Retrieval (IR) principles. We present a systematic framework that bridges LLM alignment and IR methodologies, mapping LLM generation and reward models to IR’s retriever-reranker paradigm. Building on this foundation, we propose LLM Alignment as Retriever Preference Optimization (LarPO), a new alignment method that enhances overall alignment quality. Extensive experiments validate LarPO’s effectiveness with 38.9 % and 13.7 % averaged improvement on AlpacaEval2 and MixEval-Hard respectively. Our work opens new avenues for advancing LLM alignment by integrating IR foundations, offering a promising direction for future research.

nan


Article 500

Title@2025-06-09 (1): Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction

Title: Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction Kann Künstliche Intelligenz wie Borges schreiben? Ein Bewertungsprotokoll für spanische Mikrofiktion 《像博尔热斯那样的人工智能书写? 西班牙微缩片的评价议定书》 2506.08172v1

Authors (6): Gerardo Aleman Manzanarez, Nora de la Cruz Arana, Jorge Garcia Flores, Yobany Garcia Medina, Raul Monroy, Nathalie Pernelle

Automated story writing has been a subject of study for over 60 years. Large language models can generate narratively consistent and linguistically coherent short fiction texts. Despite these advancements, rigorous assessment of such outputs for literary merit - especially concerning aesthetic qualities - has received scant attention. In this paper, we address the challenge of evaluating AI-generated microfictions and argue that this task requires consideration of literary criteria across various aspects of the text, such as thematic coherence, textual clarity, interpretive depth, and aesthetic quality. To facilitate this, we present GrAImes: an evaluation protocol grounded in literary theory, specifically drawing from a literary perspective, to offer an objective framework for assessing AI-generated microfiction. Furthermore, we report the results of our validation of the evaluation protocol, as answered by both literature experts and literary enthusiasts. This protocol will serve as a foundation for evaluating automatically generated microfictions and assessing their literary value.

nan


Article 501

Title@2025-06-09 (1): xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Title: xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs xGen-MM-Vid (BLIP-3-Video): Sie benötigen nur 32 Zeichen, um ein Video auch in VLMs darzustellen xGen-MM-Vid (BLIP-3-Video):您只需32个调子即可在 VLMs 中代表视频 2410.16267v2

Authors (12): Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Jongwoo Park, Kanchana Ranasinghe, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles

We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the ‘temporal encoder’ in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html

nan


Article 502

Title@2025-06-09 (1): ETT-CKGE: Efficient Task-driven Tokens for Continual Knowledge Graph Embedding

Title: ETT-CKGE: Efficient Task-driven Tokens for Continual Knowledge Graph Embedding ETT-CKGE: Effiziente Task-getriebene Token für kontinuierliches Einbetten von Wissensgraphen ETT-CKGE: 高效任务驱动的连续知识图嵌入标记 2506.08158v1

Authors (11): Lijing Zhu, Qizhen Lan, Qing Tian, Wenbo Sun, Li Yang, Lu Xia, Yixin Xie, Xi Xiao, Tiehang Duan, Cui Tao, Shuteng Niu

Continual Knowledge Graph Embedding (CKGE) seeks to integrate new knowledge while preserving past information. However, existing methods struggle with efficiency and scalability due to two key limitations: (1) suboptimal knowledge preservation between snapshots caused by manually designed node/relation importance scores that ignore graph dependencies relevant to the downstream task, and (2) computationally expensive graph traversal for node/relation importance calculation, leading to slow training and high memory overhead. To address these limitations, we introduce ETT-CKGE (Efficient, Task-driven, Tokens for Continual Knowledge Graph Embedding), a novel task-guided CKGE method that leverages efficient task-driven tokens for efficient and effective knowledge transfer between snapshots. Our method introduces a set of learnable tokens that directly capture task-relevant signals, eliminating the need for explicit node scoring or traversal. These tokens serve as consistent and reusable guidance across snapshots, enabling efficient token-masked embedding alignment between snapshots. Importantly, knowledge transfer is achieved through simple matrix operations, significantly reducing training time and memory usage. Extensive experiments across six benchmark datasets demonstrate that ETT-CKGE consistently achieves superior or competitive predictive performance, while substantially improving training efficiency and scalability compared to state-of-the-art CKGE methods. The code is available at: https://github.com/lijingzhu1/ETT-CKGE/tree/main

nan


Article 503

Title@2025-06-09 (1): Multilingual Hate Speech Detection in Social Media Using Translation-Based Approaches with Large Language Models

Title: Multilingual Hate Speech Detection in Social Media Using Translation-Based Approaches with Large Language Models Mehrsprachige Hate Speech Detection in sozialen Medien mit translationsbasierten Ansätzen mit großen Sprachmodellen 在社会媒体中使用基于翻译的方法和大语言模式的多语言社会媒体多语言仇恨言论探测 2506.08147v1

Authors (6): Muhammad Usman, Muhammad Ahmad, M. Shahiki Tash, Irina Gelbukh, Rolando Quintero Tellez, Grigori Sidorov

Social media platforms are critical spaces for public discourse, shaping opinions and community dynamics, yet their widespread use has amplified harmful content, particularly hate speech, threatening online safety and inclusivity. While hate speech detection has been extensively studied in languages like English and Spanish, Urdu remains underexplored, especially using translation-based approaches. To address this gap, we introduce a trilingual dataset of 10,193 tweets in English (3,834 samples), Urdu (3,197 samples), and Spanish (3,162 samples), collected via keyword filtering, with a balanced distribution of 4,849 Hateful and 5,344 Not-Hateful labels. Our methodology leverages attention layers as a precursor to transformer-based models and large language models (LLMs), enhancing feature extraction for multilingual hate speech detection. For non-transformer models, we use TF-IDF for feature extraction. The dataset is benchmarked using state-of-the-art models, including GPT-3.5 Turbo and Qwen 2.5 72B, alongside traditional machine learning models like SVM and other transformers (e.g., BERT, RoBERTa). Three annotators, following rigorous guidelines, ensured high dataset quality, achieving a Fleiss’ Kappa of 0.821. Our approach, integrating attention layers with GPT-3.5 Turbo and Qwen 2.5 72B, achieves strong performance, with macro F1 scores of 0.87 for English (GPT-3.5 Turbo), 0.85 for Spanish (GPT-3.5 Turbo), 0.81 for Urdu (Qwen 2.5 72B), and 0.88 for the joint multilingual model (Qwen 2.5 72B). These results reflect improvements of 8.75% in English (over SVM baseline 0.80), 8.97% in Spanish (over SVM baseline 0.78), 5.19% in Urdu (over SVM baseline 0.77), and 7.32% in the joint multilingual model (over SVM baseline 0.82). Our framework offers a robust solution for multilingual hate speech detection, fostering safer digital communities worldwide.

nan


Article 504

Title@2025-06-09 (1): AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists

Title: AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists AutoSDT: Scaling Data-Driven Discovery Aufgaben gegenüber offenen Mitwissenschaftlern AutoSDD: 扩大数据驱动发现任务, 以建立开放的共同科学家 2506.08140v1

Authors (19): Yifei Li, Hanane Nour Moussa, Ziru Chen, Shijie Chen, Botao Yu, Mingyi Xue, Benjamin Burns, Tzu-Yao Chiu, Vishal Dey, Zitong Lu, Chen Wei, Qianheng Zhang, Tianyu Zhang, Song Gao, Xuhui Huang, Xia Ning, Nesreen K. Ahmed, Ali Payani, Huan Sun

Despite long-standing efforts in accelerating scientific discovery with AI, building AI co-scientists remains challenging due to limited high-quality data for training and evaluation. To tackle this data scarcity issue, we present AutoSDT, an automatic pipeline that collects high-quality coding tasks in real-world data-driven discovery workflows. AutoSDT leverages the coding capabilities and parametric knowledge of LLMs to search for diverse sources, select ecologically valid tasks, and synthesize accurate task instructions and code solutions. Using our pipeline, we construct AutoSDT-5K, a dataset of 5,404 coding tasks for data-driven discovery that covers four scientific disciplines and 756 unique Python packages. To the best of our knowledge, AutoSDT-5K is the only automatically collected and the largest open dataset for data-driven scientific discovery. Expert feedback on a subset of 256 tasks shows the effectiveness of AutoSDT: 93% of the collected tasks are ecologically valid, and 92.2% of the synthesized programs are functionally correct. Trained on AutoSDT-5K, the Qwen2.5-Coder-Instruct LLM series, dubbed AutoSDT-Coder, show substantial improvement on two challenging data-driven discovery benchmarks, ScienceAgentBench and DiscoveryBench. Most notably, AutoSDT-Coder-32B reaches the same level of performance as GPT-4o on ScienceAgentBench with a success rate of 7.8%, doubling the performance of its base model. On DiscoveryBench, it lifts the hypothesis matching score to 8.1, bringing a 17.4% relative improvement and closing the gap between open-weight models and GPT-4o.

nan


Article 505

Title@2025-06-09 (1): EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

Title: EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments EconWebArena: Benchmarking von autonomen Agenten zu wirtschaftlichen Aufgaben in realistischen Web-Umgebungen EconwebArena:确定现实网络环境中经济任务自治机构的基准 2506.08136v1

Authors (2): Zefang Liu, Yinzhu Quan

We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple large language models (LLMs) to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure cases, and conduct ablation studies to assess the impact of visual grounding, plan-based reasoning, and interaction design. Our results reveal substantial performance gaps and highlight persistent challenges in grounding, navigation, and multimodal understanding, positioning EconWebArena as a rigorous testbed for economic web intelligence.

nan


Article 506

Title@2025-06-09 (1): Bingo: Boosting Efficient Reasoning of LLMs via Dynamic and Significance-based Reinforcement Learning

Title: Bingo: Boosting Efficient Reasoning of LLMs via Dynamic and Significance-based Reinforcement Learning Bingo: Effiziente Begründung von LLMs durch dynamisches und auf Bedeutung basierendes Verstärkungslernen 宾果:通过以动态和重要内容为基础的强化学习,提高LLMs高效率的理据能力 2506.08125v1

Authors (8): Hanbing Liu, Lang Cao, Yuanyi Ren, Mengyu Zhou, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang

Large language models have demonstrated impressive reasoning capabilities, yet they often suffer from inefficiencies due to unnecessarily verbose or redundant outputs. While many works have explored reinforcement learning (RL) to enhance reasoning abilities, most primarily focus on improving accuracy, with limited attention to reasoning efficiency. Some existing approaches introduce direct length-based rewards to encourage brevity, but this often leads to noticeable drops in accuracy. In this paper, we propose Bingo, an RL framework that advances length-based reward design to boost efficient reasoning. Bingo incorporates two key mechanisms: a significance-aware length reward, which gradually guides the model to reduce only insignificant tokens, and a dynamic length reward, which initially encourages elaborate reasoning for hard questions but decays over time to improve overall efficiency. Experiments across multiple reasoning benchmarks show that Bingo improves both accuracy and efficiency. It outperforms the vanilla reward and several other length-based reward baselines in RL, achieving a favorable trade-off between accuracy and efficiency. These results underscore the potential of training LLMs explicitly for efficient reasoning.

nan


Article 507

Title@2025-06-09 (1): QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

Title: QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA QA-LIGN: LLMs durch konstitutionell zerlegte QA ausrichten QA-LIign:通过《宪法》分解的QA调整LMs 2506.08123v1

Authors (11): Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, Ben Zhou

Alignment of large language models with explicit principles (such as helpfulness, honesty, and harmlessness) is crucial for ensuring safe and reliable AI systems. However, standard reward-based alignment methods typically collapse diverse feedback into a single scalar reward, entangling multiple objectives into one opaque training signal, which hinders interpretability. In this work, we introduce QA-LIGN, an automatic symbolic reward decomposition approach that preserves the structure of each constitutional principle within the reward mechanism. Instead of training a black-box reward model that outputs a monolithic score, QA-LIGN formulates principle-specific evaluation questions and derives separate reward components for each principle, making it a drop-in reward model replacement. Experiments aligning an uncensored large language model with a set of constitutional principles demonstrate that QA-LIGN offers greater transparency and adaptability in the alignment process. At the same time, our approach achieves performance on par with or better than a DPO baseline. Overall, these results represent a step toward more interpretable and controllable alignment of language models, achieved without sacrificing end-task performance.

nan


Article 508

Title@2025-06-09 (1): Conservative Bias in Large Language Models: Measuring Relation Predictions

Title: Conservative Bias in Large Language Models: Measuring Relation Predictions Konservative Bias in großen Sprachmodellen: Messung von Beziehungsvorhersagen 大语言模型中的保守偏见:衡量关系预测 2506.08120v1

Authors (5): Toyin Aguda, Erik Wilson, Allan Anzagira, Simerjot Kaur, Charese Smiley

Large language models (LLMs) exhibit pronounced conservative bias in relation extraction tasks, frequently defaulting to No_Relation label when an appropriate option is unavailable. While this behavior helps prevent incorrect relation assignments, our analysis reveals that it also leads to significant information loss when reasoning is not explicitly included in the output. We systematically evaluate this trade-off across multiple prompts, datasets, and relation types, introducing the concept of Hobson’s choice to capture scenarios where models opt for safe but uninformative labels over hallucinated ones. Our findings suggest that conservative bias occurs twice as often as hallucination. To quantify this effect, we use SBERT and LLM prompts to capture the semantic similarity between conservative bias behaviors in constrained prompts and labels generated from semi-constrained and open-ended prompts.

nan


Article 509

Title@2025-06-09 (1): “Would You Want an AI Tutor?” Understanding Stakeholder Perceptions of LLM-based Systems in the Classroom

Title: “Would You Want an AI Tutor?” Understanding Stakeholder Perceptions of LLM-based Systems in the Classroom “Würden Sie einen KI-Tutor wollen?” Verständnis der Stakeholder Wahrnehmungen von LLM-basierten Systemen im Klassenzimmer “你想要一个人工智能导师吗?” 了解课堂上利益攸关方对基于LLM系统的LM系统的看法。 2503.02885v2

Authors (3): Caterina Fuligni, Daniel Dominguez Figaredo, Julia Stoyanovich

In recent years, Large Language Models (LLMs) rapidly gained popularity across all parts of society, including education. After initial skepticism and bans, many schools have chosen to embrace this new technology by integrating it into their curricula in the form of virtual tutors and teaching assistants. However, neither the companies developing this technology nor the public institutions involved in its implementation have set up a formal system to collect feedback from the stakeholders impacted by them. In this paper, we argue that understanding the perceptions of those directly or indirectly impacted by LLMs in the classroom, including parents and school staff, is essential for ensuring responsible use of AI in this critical domain. Our contributions are two-fold. First, we propose the Contextualized Perceptions for the Adoption of LLMs in Education (Co-PALE) framework, which can be used to systematically elicit perceptions and inform whether and how LLM-based tools should be designed, developed, and deployed in the classroom. Second, we explain how our framework can be used to ground specific rubrics for eliciting perceptions of the relevant stakeholders in view of specific goals and context of implementation. Overall, Co-PALE is a practical step toward helping educational agents, policymakers, researchers, and technologists ensure the responsible and effective deployment of LLM-based systems across diverse learning contexts.

nan


Article 510

Title@2025-06-09 (1): From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?

Title: From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation? Von der Ausgabe bis zur Auswertung: Reicht die Ausgabe von LLMs mit rohem Instruktionscode für die Generierung von Fill-in-the-Middle Code aus? 从输出到评价:原始指令-指令代码LLMs 输出足量是否用于中代代号的填充? 2505.18789v2

Authors (3): Wasi Uddin Ahmad, Somshubra Majumdar, Boris Ginsburg

Post-processing is crucial for the automatic evaluation of LLMs in fill-in-the-middle (FIM) code generation due to the frequent presence of extraneous code in raw outputs. This extraneous generation suggests a lack of awareness regarding output boundaries, requiring truncation for effective evaluation. The determination of an optimal truncation strategy, however, often proves intricate, particularly when the scope includes several programming languages. This study investigates the necessity of post-processing instruction-tuned LLM outputs. Our findings reveal that supervised fine-tuning significantly enhances FIM code generation, enabling LLMs to generate code that seamlessly integrates with the surrounding context. Evaluating our fine-tuned \texttt{Qwen2.5-Coder} (base and instruct) models on HumanEval Infilling and SAFIM benchmarks demonstrates improved performances without post-processing, especially when the \emph{middle} consist of complete lines. However, post-processing of the LLM outputs remains necessary when the \emph{middle} is a random span of code.

nan


Article 511

Title@2025-06-09 (1): From Language Models over Tokens to Language Models over Characters

Title: From Language Models over Tokens to Language Models over Characters Von Sprachmodellen über Tokens zu Sprachmodellen über Zeichen 从名称以上的语言模式到字符上的语言模式 2412.03719v2

Authors (8): Tim Vieira, Ben LeBrun, Mario Giulianelli, Juan Luis Gastaldi, Brian DuSell, John Terilla, Timothy J. O’Donnell, Ryan Cotterell

Modern language models are internally – and mathematically – distributions over $\it{token}$ strings rather than $\it{character}$ strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that – even with a small computation budget – our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model’s compression rate (bits/byte) is achieved.

nan


Article 512

Title@2025-06-09 (1): Reinforcement Pre-Training

Title: Reinforcement Pre-Training Verstärkung der Vorausbildung 培训前 2506.08007v1

Authors (7): Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, Furu Wei

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.

nan


Article 513

Title@2025-06-09 (1): Reparameterized LLM Training via Orthogonal Equivalence Transformation

Title: Reparameterized LLM Training via Orthogonal Equivalence Transformation Reparameterisiertes LLM-Training über Orthogonale Äquivalenztransformation 通过正正对等转化进行修复性磁力LLM培训 2506.08001v1

Authors (6): Zeju Qiu, Simon Buchholz, Tim Z. Xiao, Maximilian Dax, Bernhard Schölkopf, Weiyang Liu

While large language models (LLMs) are driving the rapid advancement of artificial intelligence, effectively and reliably training these large models remains one of the field’s most significant challenges. To address this challenge, we propose POET, a novel reParameterized training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons. Specifically, POET reparameterizes each neuron with two learnable orthogonal matrices and a fixed random weight matrix. Because of its provable preservation of spectral properties of weight matrices, POET can stably optimize the objective function with improved generalization. We further develop efficient approximations that make POET flexible and scalable for training large-scale neural networks. Extensive experiments validate the effectiveness and scalability of POET in training LLMs.

nan


Article 514

Title@2025-06-09 (1): Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval

Title: Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval Hierarchischer Lexischer Graph für verbessertes Multi-Hop-Retrieval 用于增强多功能检索的 高级高分解逻辑图 2506.08074v1

Authors (9): Abdellah Ghassel, Ian Robinson, Gabriel Tanase, Hal Cooper, Bryan Thompson, Zhen Han, Vassilis N. Ioannidis, Soji Adeshina, Huzefa Rangwala

Retrieval-Augmented Generation (RAG) grounds large language models in external evidence, yet it still falters when answers must be pieced together across semantically distant documents. We close this gap with the Hierarchical Lexical Graph (HLG), a three-tier index that (i) traces every atomic proposition to its source, (ii) clusters propositions into latent topics, and (iii) links entities and relations to expose cross-document paths. On top of HLG we build two complementary, plug-and-play retrievers: StatementGraphRAG, which performs fine-grained entity-aware beam search over propositions for high-precision factoid questions, and TopicGraphRAG, which selects coarse topics before expanding along entity links to supply broad yet relevant context for exploratory queries. Additionally, existing benchmarks lack the complexity required to rigorously evaluate multi-hop summarization systems, often focusing on single-document queries or limited datasets. To address this, we introduce a synthetic dataset generation pipeline that curates realistic, multi-document question-answer pairs, enabling robust evaluation of multi-hop retrieval systems. Extensive experiments across five datasets demonstrate that our methods outperform naive chunk-based RAG achieving an average relative improvement of 23.1% in retrieval recall and correctness. Open-source Python library is available at https://github.com/awslabs/graphrag-toolkit.

nan


Article 515

Title@2025-06-09 (1): Private Memorization Editing: Turning Memorization into a Defense to Strengthen Data Privacy in Large Language Models

Title: Private Memorization Editing: Turning Memorization into a Defense to Strengthen Data Privacy in Large Language Models Private Memorization Editing: Aus Memorization eine Verteidigung zur Stärkung der Datenschutz in großen Sprachmodellen 私人记忆编辑:将记忆转换为国防,以加强大语言模型中的数据隐私 2506.10024v1

Authors (4): Elena Sofia Ruzzetti, Giancarlo A. Xompero, Davide Venditti, Fabio Massimo Zanzotto

Large Language Models (LLMs) memorize, and thus, among huge amounts of uncontrolled data, may memorize Personally Identifiable Information (PII), which should not be stored and, consequently, not leaked. In this paper, we introduce Private Memorization Editing (PME), an approach for preventing private data leakage that turns an apparent limitation, that is, the LLMs’ memorization ability, into a powerful privacy defense strategy. While attacks against LLMs have been performed exploiting previous knowledge regarding their training data, our approach aims to exploit the same kind of knowledge in order to make a model more robust. We detect a memorized PII and then mitigate the memorization of PII by editing a model knowledge of its training data. We verify that our procedure does not affect the underlying language model while making it more robust against privacy Training Data Extraction attacks. We demonstrate that PME can effectively reduce the number of leaked PII in a number of configurations, in some cases even reducing the accuracy of the privacy attacks to zero.

nan


Article 516

Title@2025-06-09 (1): PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation

Title: PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation PhantomWiki: On-Demand-Datensätze für die Bewertung von Vernunft und Retrieval PhohantomWiki: 用于理性和检索评价的现场数据集 2502.20377v2

Authors (9): Albert Gong, Kamilė Stankevičiūtė, Chao Wan, Anmol Kabra, Raphael Thesmar, Johann Lee, Julius Klenke, Carla P. Gomes, Kilian Q. Weinberger

High-quality benchmarks are essential for evaluating reasoning and retrieval capabilities of large language models (LLMs). However, curating datasets for this purpose is not a permanent solution as they are prone to data leakage and inflated performance results. To address these challenges, we propose PhantomWiki: a pipeline to generate unique, factually consistent document corpora with diverse question-answer pairs. Unlike prior work, PhantomWiki is neither a fixed dataset, nor is it based on any existing data. Instead, a new PhantomWiki instance is generated on demand for each evaluation. We vary the question difficulty and corpus size to disentangle reasoning and retrieval capabilities respectively, and find that PhantomWiki datasets are surprisingly challenging for frontier LLMs. Thus, we contribute a scalable and data leakage-resistant framework for disentangled evaluation of reasoning, retrieval, and tool-use abilities. Our code is available at https://github.com/kilian-group/phantom-wiki.

nan


Article 517

Title@2025-06-09 (1): When Two LLMs Debate, Both Think They’ll Win

Title: When Two LLMs Debate, Both Think They’ll Win Wenn zwei LLMs diskutieren, denken beide, dass sie gewinnen werden 当两个LLM 辩论, 双方都认为他们会赢 2505.19184v3

Authors (2): Pradyumna Shyama Prasad, Minh Nhat Nguyen

Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation: in 61.7% of debates, both sides simultaneously claimed >=75% probability of victory, a logical impossibility. (4) Persistent self-debate bias: models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning: models’ private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLMs are now increasingly deployed without careful review in assistant and agentic roles. Code for our experiments is available at https://github.com/pradyuprasad/llms_overconfidence

nan


Article 518

Title@2025-06-09 (1): $τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Title: $τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment $τ^2$-Bench: Bewertung von Conversational Agents in einer Dual-Control-Umgebung $2美元 - 伯恩奇:在两管环境中评价相互对话的代理人 2506.07982v1

Authors (5): Victor Barres, Honghua Dong, Soham Ray, Xujie Si, Karthik Narasimhan

Existing benchmarks for conversational AI agents simulate single-control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce $\tau^2$-bench, with four key contributions: 1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3) A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4) Fine-grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, $\tau^2$-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.

nan


Article 519

Title@2025-06-09 (1): Automated Capability Discovery via Foundation Model Self-Exploration

Title: Automated Capability Discovery via Foundation Model Self-Exploration Automatisierte Capability Discovery über die Selbsterkundung von Fundamentmodellen 通过基金会自我探索示范模式自动发现能力 2502.07577v3

Authors (3): Cong Lu, Shengran Hu, Jeff Clune

Foundation models have become general-purpose assistants, exhibiting diverse capabilities across numerous domains through training on web-scale data. It remains challenging to precisely characterize even a fraction of the full spectrum of these abilities and potential risks in any new model. Existing evaluation approaches often require significant human effort, and it is taking increasing effort to design ever harder challenges for more capable models. We introduce Automated Capability Discovery (ACD), a framework that designates one foundation model as a scientist to systematically propose open-ended tasks probing the abilities of a subject model (potentially itself). By combining frontier models with ideas from the field of open-endedness, ACD automatically and systematically uncovers a diverse spectrum of surprising capabilities and failures in the subject model. We demonstrate ACD across a range of foundation models (including the GPT, Claude, and Llama series), showing that it automatically generates thousands of distinct tasks, which are then clustered to reveal dozens of broader capability areas and failure modes, that would be challenging for any single team to uncover. We further validate our method’s automated scoring with extensive human surveys, observing high agreement between model-generated and human evaluations. By leveraging foundation models’ ability to both create tasks and self-evaluate, ACD is a significant step toward scalable, automated evaluation of novel AI systems. All code and evaluation logs are open-sourced at https://github.com/conglu1997/ACD.

nan


Article 520

Title@2025-06-09 (1): HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

Title: HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization HeuriGym: Agentischer Benchmark für LLM-gekrempelte Heuristik in der kombinatorischen Optimierung HeuriGym: 组合优化中LLM-减压希力学的代理基准 2506.07972v1

Authors (13): Hongzheng Chen, Yingheng Wang, Yaohui Cai, Hins Hu, Jiajie Li, Shirley Huang, Chenhui Deng, Rongjian Liang, Shufeng Kong, Haoxing Ren, Samitha Samaranayake, Carla P. Gomes, Zhiru Zhang

While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning. To quantify performance, we propose the Quality-Yield Index (QYI), a metric that captures both solution pass rate and quality. Even top models like GPT-o4-mini-high and Gemini-2.5-Pro attain QYI scores of only 0.6, well below the expert baseline of 1. Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.

nan


Article 521

Title@2025-06-09 (1): MIB: A Mechanistic Interpretability Benchmark

Title: MIB: A Mechanistic Interpretability Benchmark MIB: Ein Mechanistischer Interpretations-Benchmark MIB: 机械解释基准 2504.13151v2

Authors (23): Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov

How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or causal variables in neural language models. The circuit localization track compares methods that locate the model components - and connections between them - most important for performing a task (e.g., attribution patching or information flow routes). The causal variable localization track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAEs) or distributed alignment search (DAS), and align those features to a task-relevant causal variable. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAE features are not better than neurons, i.e., non-featurized hidden vectors. These findings illustrate that MIB enables meaningful comparisons, and increases our confidence that there has been real progress in the field.

nan


Article 522

Title@2025-06-09 (1): General-Reasoner: Advancing LLM Reasoning Across All Domains

Title: General-Reasoner: Advancing LLM Reasoning Across All Domains General-Reasoner: Bessere LLM-Reasonierung über alle Domains hinweg 通用Reasoner:在所有领域推推推LLM 2505.14652v5

Authors (6): Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, Wenhu Chen

Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs). Particularly, the “Zero” reinforcement learning introduced by Deepseek-R1-Zero, enables direct RL training of base LLMs without relying on an intermediate supervised fine-tuning stage. Despite these advancements, current works for LLM reasoning mainly focus on mathematical and coding domains, largely due to data abundance and the ease of answer verification. This limits the applicability and generalization of such models to broader domains, where questions often have diverse answer representations, and data is more scarce. In this paper, we propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains. Our key contributions include: (1) constructing a large-scale, high-quality dataset of questions with verifiable answers curated by web crawling, covering a wide range of disciplines; and (2) developing a generative model-based answer verifier, which replaces traditional rule-based verification with the capability of chain-of-thought and context-awareness. We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc. Our comprehensive evaluation across these 12 benchmarks (e.g. MMLU-Pro, GPQA, SuperGPQA, TheoremQA, BBEH and MATH AMC) demonstrates that General-Reasoner outperforms existing baseline methods, achieving robust and generalizable reasoning performance while maintaining superior effectiveness in mathematical reasoning tasks.

nan


Article 523

Title@2025-06-09 (1): Correlated Errors in Large Language Models

Title: Correlated Errors in Large Language Models Korrelierte Fehler in großen Sprachmodellen 大语言模型中相关连的Cor 错误 2506.07962v1

Authors (4): Elliot Kim, Avi Garg, Kenny Peng, Nikhil Garg

Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ meaningfully. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors – on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring – the latter reflecting theoretical predictions regarding algorithmic monoculture.

nan


Article 524

Title@2025-06-09 (1): Language Models over Canonical Byte-Pair Encodings

Title: Language Models over Canonical Byte-Pair Encodings Sprachmodelle über Canonical Byte-Pair Encodings Canonicic 字节对称编码语言模型 2506.07956v1

Authors (10): Tim Vieira, Tianyu Liu, Clemente Pasti, Yahya Emara, Brian DuSell, Benjamin LeBrun, Mario Giulianelli, Juan Luis Gastaldi, Timothy J. O’Donnell, Ryan Cotterell

Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of $\it{noncanonical}$ token encodings of each character string – these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.

nan


Article 525

Title@2025-06-09 (1): CORDIAL: Can Multimodal Large Language Models Effectively Understand Coherence Relationships?

Title: CORDIAL: Can Multimodal Large Language Models Effectively Understand Coherence Relationships? CORDIAL: Können multimodale große Sprachmodelle Kohärenzbeziehungen effektiv verstehen? 多式联运大语言模式能够有效地理解一致性关系吗? 2502.11300v2

Authors (3): Aashish Anantha Ramakrishnan, Aadarsh Anantha Ramakrishnan, Dongwon Lee

Multimodal Large Language Models (MLLMs) are renowned for their superior instruction-following and reasoning capabilities across diverse problem domains. However, existing benchmarks primarily focus on assessing factual and logical correctness in downstream tasks, with limited emphasis on evaluating MLLMs’ ability to interpret pragmatic cues and intermodal relationships. To address this gap, we assess the competency of MLLMs in performing Multimodal Discourse Analysis (MDA) using Coherence Relations. Our benchmark, CORDIAL, encompasses a broad spectrum of Coherence Relations across 3 different discourse domains at varying levels of granularity. Through our experiments on 10+ MLLMs employing different prompting strategies, we show that even top models like Gemini 1.5 Pro and GPT-4o fail to match the performance of simple classifier-based baselines. This study emphasizes the need to move beyond similarity-based metrics and adopt a discourse-driven framework for evaluating MLLMs, providing a more nuanced assessment of their capabilities. The benchmark and code are available at: https://aashish2000.github.io/CORDIAL/

nan


Article 526

Title@2025-06-09 (1): Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Title: Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization Easy2Hard-Bench: Standardisierte Schwierigkeits-Labels für Profiling LLM Performance und Generalisierung Easy2Hard-Bench:用于分析LLM性能和通用的标准化困难标签 2409.18433v2

Authors (11): Mucong Ding, Chenghao Deng, Jocelyn Choo, Zichu Wu, Aakriti Agrawal, Avi Schwarzschild, Tianyi Zhou, Tom Goldstein, John Langford, Anima Anandkumar, Furong Huang

While generalization over tasks from easy to hard is crucial to profile language models (LLMs), the datasets with fine-grained difficulty annotations for each problem across a broad range of complexity are still blank. Aiming to address this limitation, we present Easy2Hard-Bench, a consistently formatted collection of 6 benchmark datasets spanning various domains, such as mathematics and programming problems, chess puzzles, and reasoning questions. Each problem within these datasets is annotated with numerical difficulty scores. To systematically estimate problem difficulties, we collect abundant performance data on attempts to each problem by humans in the real world or LLMs on the prominent leaderboard. Leveraging the rich performance data, we apply well-established difficulty ranking systems, such as Item Response Theory (IRT) and Glicko-2 models, to uniformly assign numerical difficulty scores to problems. Moreover, datasets in Easy2Hard-Bench distinguish themselves from previous collections by a higher proportion of challenging problems. Through extensive experiments with six state-of-the-art LLMs, we provide a comprehensive analysis of their performance and generalization capabilities across varying levels of difficulty, with the aim of inspiring future research in LLM generalization. The datasets are available at https://huggingface.co/datasets/furonghuang-lab/Easy2Hard-Bench.

nan


Article 527

Title@2025-06-09 (1): Statistical Hypothesis Testing for Auditing Robustness in Language Models

Title: Statistical Hypothesis Testing for Auditing Robustness in Language Models Statistische Hypothesenprüfung zur Prüfung der Robustheit in Sprachmodellen 语文模式审计强力统计假设测试 2506.07947v1

Authors (3): Paulius Rauba, Qiyao Wei, Mihaela van der Schaar

Consider the problem of testing whether the outputs of a large language model (LLM) system change under an arbitrary intervention, such as an input perturbation or changing the model variant. We cannot simply compare two LLM outputs since they might differ due to the stochastic nature of the system, nor can we compare the entire output distribution due to computational intractability. While existing methods for analyzing text-based outputs exist, they focus on fundamentally different problems, such as measuring bias or fairness. To this end, we introduce distribution-based perturbation analysis, a framework that reformulates LLM perturbation analysis as a frequentist hypothesis testing problem. We construct empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling, enabling tractable inference without restrictive distributional assumptions. The framework is (i) model-agnostic, (ii) supports the evaluation of arbitrary input perturbations on any black-box LLM, (iii) yields interpretable p-values; (iv) supports multiple perturbations via controlled error rates; and (v) provides scalar effect sizes. We demonstrate the usefulness of the framework across multiple case studies, showing how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models. Above all, we see this as a reliable frequentist hypothesis testing framework for LLM auditing.

nan


Article 528

Title@2025-06-09 (1): ProtocolLLM: RTL Benchmark for SystemVerilog Generation of Communication Protocols

Title: ProtocolLLM: RTL Benchmark for SystemVerilog Generation of Communication Protocols ProtocolLLM: RTL Benchmark für SystemVerilog Generierung von Kommunikationsprotokollen 协议LLLM: 系统生成通信协议系统生成的RTL基准 2506.07945v1

Authors (3): Arnav Sheth, Ivaxi Sheth, Mario Fritz

Recent advances in Large Language Models (LLMs) have shown promising capabilities in generating code for general-purpose programming languages. In contrast, their applicability for hardware description languages, particularly for generating synthesizable and functionally correct designs, remains significantly underexplored. HDLs such as SystemVerilog are logic-oriented and demand strict adherence to timing semantics, concurrency, and synthesizability constraints. Moreover, HDL-based design flows encompass a broad set of tasks beyond structural code generation, including testbench development, assertion-based verification, timing closure, and protocol-level integration for on-chip communication. The objective of our paper is to analyze the capabilities of state-of-the-art LLMs in generating SystemVerilog implementations of standard communication protocols, a core component of embedded and System-on-Chip (SoC) architectures. This paper introduces the first benchmark suite targeting four widely used protocols: SPI, I2C, UART, and AXI. We define code generation tasks that capture varying levels of design abstraction and prompt specificity. The generated designs are assessed for syntactic correctness, synthesizability, and functional fidelity via waveform simulation and test benches.

nan


Article 529

Title@2025-06-09 (1): Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features

Title: Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features Verbesserung der sichtsfreundlichen Klassifizierung mit großen multimodalen Modellmerkmalen 以大型多式模型特点加强低热视觉-语言分类 2412.00142v3

Authors (10): Chancharik Mitra, Brandon Huang, Tianning Chai, Zhiqiu Lin, Assaf Arbelle, Rogerio Feris, Leonid Karlinsky, Trevor Darrell, Deva Ramanan, Roei Herzig

Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks. Despite strong performance, LMMs’ generative outputs are not specialized for vision-language classification tasks (i.e., tasks with vision-language inputs and discrete labels) such as image classification and multiple-choice VQA. One key challenge in utilizing LMMs for these tasks is the extraction of useful features from generative LMMs. To overcome this, we propose an approach that leverages multimodal feature extraction from the LMM’s latent space. Toward this end, we present Sparse Attention Vectors (SAVs) – a finetuning-free method that leverages sparse attention head activations (fewer than 5% of the heads) in LMMs as strong feature representations. With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of vision-language classification tasks. Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.

nan


Article 530

Title@2025-06-09 (1): Quantum Graph Transformer for NLP Sentiment Classification

Title: Quantum Graph Transformer for NLP Sentiment Classification Quantengraphtransformator für NLP-Sentimentklassifikation NLP 感应分类量图图图变形器 2506.07937v1

Authors (4): Shamminuj Aktar, Andreas Bärtschi, Abdel-Hameed A. Badawy, Stephan Eidenbenz

Quantum machine learning is a promising direction for building more efficient and expressive models, particularly in domains where understanding complex, structured data is critical. We present the Quantum Graph Transformer (QGT), a hybrid graph-based architecture that integrates a quantum self-attention mechanism into the message-passing framework for structured language modeling. The attention mechanism is implemented using parameterized quantum circuits (PQCs), which enable the model to capture rich contextual relationships while significantly reducing the number of trainable parameters compared to classical attention mechanisms. We evaluate QGT on five sentiment classification benchmarks. Experimental results show that QGT consistently achieves higher or comparable accuracy than existing quantum natural language processing (QNLP) models, including both attention-based and non-attention-based approaches. When compared with an equivalent classical graph transformer, QGT yields an average accuracy improvement of 5.42% on real-world datasets and 4.76% on synthetic datasets. Additionally, QGT demonstrates improved sample efficiency, requiring nearly 50% fewer labeled samples to reach comparable performance on the Yelp dataset. These results highlight the potential of graph-based QNLP techniques for advancing efficient and scalable language understanding.

nan


Article 531

Title@2025-06-09 (1): Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models

Title: Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models Mimicking oder Reasoning: Multi-Modal In-Context Learning in Vision-Language-Modellen neu denken 模拟或解释:在视觉语言模型中重新思考多模式内学习 2506.07936v1

Authors (7): Chengyue Huang, Yuchen Zhu, Sichen Zhu, Jingyun Xiao, Moises Andrade, Shivang Chopra, Zsolt Kira

Vision-language models (VLMs) are widely assumed to exhibit in-context learning (ICL), a property similar to that of their language-only counterparts. While recent work suggests VLMs can perform multimodal ICL (MM-ICL), studies show they often rely on shallow heuristics – such as copying or majority voting – rather than true task understanding. We revisit this assumption by evaluating VLMs under distribution shifts, where support examples come from a dataset different from the query. Surprisingly, performance often degrades with more demonstrations, and models tend to copy answers rather than learn from them. To investigate further, we propose a new MM-ICL with Reasoning pipeline that augments each demonstration with a generated rationale alongside the answer. We conduct extensive and comprehensive experiments on both perception- and reasoning-required datasets with open-source VLMs ranging from 3B to 72B and proprietary models such as Gemini 2.0. We conduct controlled studies varying shot count, retrieval method, rationale quality, and distribution. Our results show limited performance sensitivity across these factors, suggesting that current VLMs do not effectively utilize demonstration-level information as intended in MM-ICL.

nan


Article 532

Title@2025-06-09 (1): RONA: Pragmatically Diverse Image Captioning with Coherence Relations

Title: RONA: Pragmatically Diverse Image Captioning with Coherence Relations RONA: Pragmatisch unterschiedliche Bildunterschriften mit Kohärenzbeziehungen RONA: 与一致性关系相结合的多元图像说明 2503.10997v2

Authors (3): Aashish Anantha Ramakrishnan, Aadarsh Anantha Ramakrishnan, Dongwon Lee

Writing Assistants (e.g., Grammarly, Microsoft Copilot) traditionally generate diverse image captions by employing syntactic and semantic variations to describe image components. However, human-written captions prioritize conveying a central message alongside visual descriptions using pragmatic cues. To enhance caption diversity, it is essential to explore alternative ways of communicating these messages in conjunction with visual content. We propose RONA, a novel prompting strategy for Multi-modal Large Language Models (MLLM) that leverages Coherence Relations as a controllable axis for pragmatic variations. We demonstrate that RONA generates captions with better overall diversity and ground-truth alignment, compared to MLLM baselines across multiple domains. Our code is available at: https://github.com/aashish2000/RONA

nan


Article 533

Title@2025-06-09 (1): Solving Inequality Proofs with Large Language Models

Title: Solving Inequality Proofs with Large Language Models Unequality Proofs mit großen Sprachmodellen lösen 以大语言模式解决不平等证据 2506.07927v1

Authors (7): Jiayi Sheng, Luna Lyu, Jikai Jin, Tony Xia, Alex Gu, James Zou, Pan Lu

Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement. Code and data are available at https://ineqmath.github.io/.

nan


Article 534

Title@2025-06-09 (1): Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Title: Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks Dynamic-SUPERB Phase-2: Ein kollaborativ erweiterter Benchmark zur Messung der Fähigkeiten von gesprochenen Sprachmodellen mit 180 Aufgaben 动态-SUPERB第2阶段:协作扩大基准,以衡量具有180项任务的语言模式的能力 2411.05361v2

Authors (80): Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang, Andy T. Liu, Chen-An Li, Yu-Xiang Lin, Wei-Cheng Tseng, Anuj Diwan, Yi-Jen Shih, Jiatong Shi, William Chen, Chih-Kai Yang, Wenze Ren, Xuanjun Chen, Chi-Yuan Hsiao, Puyuan Peng, Shih-Heng Wang, Chun-Yi Kuan, Ke-Han Lu, Kai-Wei Chang, Fabian Ritter-Gutierrez, Kuan-Po Huang, Siddhant Arora, You-Kuan Lin, Ming To Chuang, Eunjung Yeo, Kalvin Chang, Chung-Ming Chien, Kwanghee Choi, Jun-You Wang, Cheng-Hsiu Hsieh, Yi-Cheng Lin, Chee-En Yu, I-Hsiang Chiu, Heitor R. Guimarães, Jionghao Han, Tzu-Quan Lin, Tzu-Yuan Lin, Homu Chang, Ting-Wu Chang, Chun Wei Chen, Shou-Jen Chen, Yu-Hua Chen, Hsi-Chun Cheng, Kunal Dhawan, Jia-Lin Fang, Shi-Xin Fang, Kuan-Yu Fang Chiang, Chi An Fu, Hsien-Fu Hsiao, Ching Yu Hsu, Shao-Syuan Huang, Lee Chen Wei, Hsi-Che Lin, Hsuan-Hao Lin, Hsuan-Ting Lin, Jian-Ren Lin, Ting-Chun Liu, Li-Chun Lu, Tsung-Min Pai, Ankita Pasad, Shih-Yun Shan Kuan, Suwon Shon, Yuxun Tang, Yun-Shao Tsai, Jui-Chiang Wei, Tzu-Chieh Wei, Chengxi Wu, Dien-Ruei Wu, Chao-Han Huck Yang, Chieh-Chi Yang, Jia Qi Yip, Shao-Xiang Yuan, Vahid Noroozi, Zhehuai Chen, Haibin Wu, Karen Livescu, David Harwath, Shinji Watanabe, Hung-yi Lee

Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results show that no model performed well universally. SALMONN-13B excelled in English ASR and Qwen2-Audio-7B-Instruct showed high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We open-source all task data and the evaluation pipeline at https://github.com/dynamic-superb/dynamic-superb.

nan


Article 535

Title@2025-06-09 (1): Uncovering the Functional Roles of Nonlinearity in Memory

Title: Uncovering the Functional Roles of Nonlinearity in Memory Enthüllen der funktionalen Rollen der Nichtlinearität im Gedächtnis 发挥记忆中非线性的作用 2506.07919v1

Authors (2): Manuel Brenner, Georgia Koppe

Memory and long-range temporal processing are core requirements for sequence modeling tasks across natural language processing, time-series forecasting, speech recognition, and control. While nonlinear recurrence has long been viewed as essential for enabling such mechanisms, recent work suggests that linear dynamics may often suffice. In this study, we go beyond performance comparisons to systematically dissect the functional role of nonlinearity in recurrent networks–identifying both when it is computationally necessary, and what mechanisms it enables. We use Almost Linear Recurrent Neural Networks (AL-RNNs), which allow fine-grained control over nonlinearity, as both a flexible modeling tool and a probe into the internal mechanisms of memory. Across a range of classic sequence modeling tasks and a real-world stimulus selection task, we find that minimal nonlinearity is not only sufficient but often optimal, yielding models that are simpler, more robust, and more interpretable than their fully nonlinear or linear counterparts. Our results provide a principled framework for selectively introducing nonlinearity, bridging dynamical systems theory with the functional demands of long-range memory and structured computation in recurrent neural networks, with implications for both artificial and biological neural systems.

nan


Article 536

Title@2025-06-09 (1): LUCIFER: Language Understanding and Context-Infused Framework for Exploration and Behavior Refinement

Title: LUCIFER: Language Understanding and Context-Infused Framework for Exploration and Behavior Refinement LUCIFER: Sprachverständnis und Kontext-infundierter Rahmen für Exploration und Verhaltensverfeinerung LUCIFER: 探讨和行为改进的语言理解和背景介绍框架 2506.07915v1

Authors (3): Dimitris Panagopoulos, Adolfo Perrusquia, Weisi Guo

In dynamic environments, the rapid obsolescence of pre-existing environmental knowledge creates a gap between an agent’s internal model and the evolving reality of its operational context. This disparity between prior and updated environmental valuations fundamentally limits the effectiveness of autonomous decision-making. To bridge this gap, the contextual bias of human domain stakeholders, who naturally accumulate insights through direct, real-time observation, becomes indispensable. However, translating their nuanced, and context-rich input into actionable intelligence for autonomous systems remains an open challenge. To address this, we propose LUCIFER (Language Understanding and Context-Infused Framework for Exploration and Behavior Refinement), a domain-agnostic framework that integrates a hierarchical decision-making architecture with reinforcement learning (RL) and large language models (LLMs) into a unified system. This architecture mirrors how humans decompose complex tasks, enabling a high-level planner to coordinate specialised sub-agents, each focused on distinct objectives and temporally interdependent actions. Unlike traditional applications where LLMs are limited to single role, LUCIFER integrates them in two synergistic roles: as context extractors, structuring verbal stakeholder input into domain-aware representations that influence decision-making through an attention space mechanism aligning LLM-derived insights with the agent’s learning process, and as zero-shot exploration facilitators guiding the agent’s action selection process during exploration. We benchmark various LLMs in both roles and demonstrate that LUCIFER improves exploration efficiency and decision quality, outperforming flat, goal-conditioned policies. Our findings show the potential of context-driven decision-making, where autonomous systems leverage human contextual knowledge for operational success.

nan


Article 537

Title@2025-06-09 (1): MiniCPM4: Ultra-Efficient LLMs on End Devices

Title: MiniCPM4: Ultra-Efficient LLMs on End Devices MiniCPM4: Ultraeffiziente LLMs auf Endgeräten MiniCPM4:末端装置超有效LMs 2506.07900v1

Authors (75): MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengdan Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Yanghao Li, Yishan Li, Zhen Li, Dan Liu, Biyuan Lin, Yankai Lin, Xiang Long, Quanyu Lu, Yaxi Lu, Peiyan Luo, Hongya Lyu, Litu Ou, Yinxu Pan, Zekai Qu, Qundong Shi, Zijun Song, Jiayuan Su, Zhou Su, Ao Sun, Xianghui Sun, Peijun Tang, Fangzheng Wang, Feng Wang, Shuo Wang, Yudong Wang, Yesai Wu, Zhenyu Xiao, Jie Xie, Zihao Xie, Yukun Yan, Jiarui Yuan, Kaihuo Zhang, Lei Zhang, Linyue Zhang, Xueren Zhang, Yudi Zhang, Hengyu Zhao, Weilin Zhao, Weilun Zhao, Yuanqian Zhao, Zhi Zheng, Ge Zhou, Jie Zhou, Wei Zhou, Zihan Zhou, Zixuan Zhou, Zhiyuan Liu, Guoyang Zeng, Chao Jia, Dahai Li, Maosong Sun

This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.

nan


Article 538

Title@2025-06-09 (1): MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs

Title: MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs MEMOIR: Lebenslange Modellbearbeitung mit minimalem Überschreiben und informierter Aufbewahrung für LLMs MEMOIR: 长寿命模型编辑,以最小覆盖和知情保留LLMM 2506.07899v1

Authors (5): Ke Wang, Yiming Qin, Nikolaos Dimitriadis, Alessandro Favero, Pascal Frossard

Language models deployed in real-world systems often require post-hoc updates to incorporate new or corrected knowledge. However, editing such models efficiently and reliably - without retraining or forgetting previous information - remains a major challenge. Existing methods for lifelong model editing either compromise generalization, interfere with past edits, or fail to scale to long editing sequences. We propose MEMOIR, a novel scalable framework that injects knowledge through a residual memory, i.e., a dedicated parameter module, while preserving the core capabilities of the pre-trained model. By sparsifying input activations through sample-dependent masks, MEMOIR confines each edit to a distinct subset of the memory parameters, minimizing interference among edits. At inference, it identifies relevant edits by comparing the sparse activation patterns of new queries to those stored during editing. This enables generalization to rephrased queries by activating only the relevant knowledge while suppressing unnecessary memory activation for unrelated prompts. Experiments on question answering, hallucination correction, and out-of-distribution generalization benchmarks across LLaMA-3 and Mistral demonstrate that MEMOIR achieves state-of-the-art performance across reliability, generalization, and locality metrics, scaling to thousands of sequential edits with minimal forgetting.

nan


Article 539

Title@2025-06-09 (1): Evaluating Large Language Models on the Frame and Symbol Grounding Problems: A Zero-shot Benchmark

Title: Evaluating Large Language Models on the Frame and Symbol Grounding Problems: A Zero-shot Benchmark Bewertung großer Sprachmodelle auf Rahmen- und Symbolerdungsproblemen: Ein Null-Schuss-Benchmark 评价关于框架和符号依据问题的大语言模型:零速基准 2506.07896v1

Authors (1): Shoko Oka

Recent advancements in large language models (LLMs) have revitalized philosophical debates surrounding artificial intelligence. Two of the most fundamental challenges - namely, the Frame Problem and the Symbol Grounding Problem - have historically been viewed as unsolvable within traditional symbolic AI systems. This study investigates whether modern LLMs possess the cognitive capacities required to address these problems. To do so, I designed two benchmark tasks reflecting the philosophical core of each problem, administered them under zero-shot conditions to 13 prominent LLMs (both closed and open-source), and assessed the quality of the models’ outputs across five trials each. Responses were scored along multiple criteria, including contextual reasoning, semantic coherence, and information filtering. The results demonstrate that while open-source models showed variability in performance due to differences in model size, quantization, and instruction tuning, several closed models consistently achieved high scores. These findings suggest that select modern LLMs may be acquiring capacities sufficient to produce meaningful and stable responses to these long-standing theoretical challenges.

nan


Article 540

Title@2025-06-09 (1): Introspective Growth: Automatically Advancing LLM Expertise in Technology Judgment

Title: Introspective Growth: Automatically Advancing LLM Expertise in Technology Judgment Introspektives Wachstum: LLM-Kompetenz im Technologieurteil automatisch verbessern 前瞻性增长:自动提高LLM在技术判断方面的专门知识 2505.12452v2

Authors (4): Siyang Wu, Honglin Bao, Nadav Kunievsky, James A. Evans

Large language models (LLMs) increasingly demonstrate signs of conceptual understanding, yet much of their internal knowledge remains latent, loosely structured, and difficult to access or evaluate. We propose self-questioning as a lightweight and scalable strategy to improve LLMs’ understanding, particularly in domains where success depends on fine-grained semantic distinctions. To evaluate this approach, we introduce a challenging new benchmark of 1.3 million post-2015 computer science patent pairs, characterized by dense technical jargon and strategically complex writing. The benchmark centers on a pairwise differentiation task: can a model distinguish between closely related but substantively different inventions? We show that compared to placebo scientific information, prompting LLMs to generate and answer their own questions - targeting the background knowledge required for the task - significantly improves performance. These self-generated questions and answers activate otherwise underutilized internal knowledge. Allowing LLMs to retrieve answers from external scientific texts further enhances performance, suggesting that model knowledge is compressed and lacks the full richness of the training data. We also find that chain-of-thought prompting and self-questioning converge, though self-questioning remains more effective for improving understanding of technical concepts. Notably, we uncover an asymmetry in prompting: smaller models often generate more fundamental, more open-ended, better-aligned questions for mid-sized models than large models do, revealing a new strategy for cross-model collaboration. Altogether, our findings establish self-questioning as both a practical mechanism for automatically improving LLM comprehension, especially in domains with sparse and underrepresented knowledge, and a diagnostic probe of how internal and external knowledge are organized.

nan


Article 541

Title@2025-06-09 (1): DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance

Title: DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance DBudgetKV: Dynamisches Budget in KV Cache Compression für optimale Performance DuggitKV:确保最佳业绩的 KV缓存压缩中动态预算 2502.16886v2

Authors (9): Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu, Fandong Meng, Jie Zhou, Piji Li

To alleviate memory burden during inference of large language models (LLMs), numerous studies have focused on compressing the KV cache by exploring aspects such as attention sparsity. These techniques are often designed with a pre-defined KV budget; however, as the optimal budget varies by different input lengths and task types, the existence of a fixed budget could result in inconsistent performance accepting inputs of diverse domains. To address this limitation, we propose a new KV cache compression objective: to always ensure the full-cache performance regardless of specific inputs, while maximizing KV cache pruning as much as possible. To achieve this goal, we introduce a novel KV cache compression method dubbed DBudgetKV, which features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process. Empirical evaluation spanning diverse context lengths, task types, and model sizes suggests that our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average. Furthermore, our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.

nan


Article 542

Title@2025-06-09 (1): Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning

Title: Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning Fokussieren lernen: Kausale Aufmerksamkeit Destillieren über gradient-geführte Token Pruning 学习到焦点:通过渐渐引导的 Token 鲁士宁 修整导致关注的原因 2506.07851v1

Authors (6): Yiju Guo, Wenkai Yang, Zexu Sun, Ning Ding, Zhiyuan Liu, Yankai Lin

Large language models (LLMs) have demonstrated significant improvements in contextual understanding. However, their ability to attend to truly critical information during long-context reasoning and generation still falls behind the pace. Specifically, our preliminary experiments reveal that certain distracting patterns can misdirect the model’s attention during inference, and removing these patterns substantially improves reasoning accuracy and generation quality. We attribute this phenomenon to spurious correlations in the training data, which obstruct the model’s capacity to infer authentic causal instruction-response relationships. This phenomenon may induce redundant reasoning processes, potentially resulting in significant inference overhead and, more critically, the generation of erroneous or suboptimal responses. To mitigate this, we introduce a two-stage framework called Learning to Focus (LeaF) leveraging intervention-based inference to disentangle confounding factors. In the first stage, LeaF employs gradient-based comparisons with an advanced teacher to automatically identify confounding tokens based on causal relationships in the training corpus. Then, in the second stage, it prunes these tokens during distillation to enact intervention, aligning the student’s attention with the teacher’s focus distribution on truly critical context tokens. Experimental results demonstrate that LeaF not only achieves an absolute improvement in various mathematical reasoning and code generation benchmarks but also effectively suppresses attention to confounding tokens during inference, yielding a more interpretable and reliable reasoning model.

nan


Article 543

Title@2025-06-09 (1): Beyond Numeric Rewards: In-Context Dueling Bandits with LLM Agents

Title: Beyond Numeric Rewards: In-Context Dueling Bandits with LLM Agents Über Numerische Belohnungen hinaus: In-Context-Duellbanditen mit LLM-Agenten 除数值奖励之外:与LLM代理商进行文中评分的强盗 2407.01887v4

Authors (4): Fanzeng Xia, Hao Liu, Yisong Yue, Tongxin Li

In-Context Reinforcement Learning (ICRL) is a frontier paradigm to solve Reinforcement Learning (RL) problems in the foundation model era. While ICRL capabilities have been demonstrated in transformers through task-specific training, the potential of Large Language Models (LLMs) out-of-the-box remains largely unexplored. This paper investigates whether LLMs can generalize cross-domain to perform ICRL under the problem of Dueling Bandits (DB), a stateless preference-based RL setting. We find that the top-performing LLMs exhibit a notable zero-shot capacity for relative decision-making, which translates to low short-term weak regret across all DB environment instances by quickly including the best arm in duels. However, an optimality gap still exists between LLMs and classic DB algorithms in terms of strong regret. LLMs struggle to converge and consistently exploit even when explicitly prompted to do so, and are sensitive to prompt variations. To bridge this gap, we propose an agentic flow framework: LLM with Enhanced Algorithmic Dueling (LEAD), which integrates off-the-shelf DB algorithm support with LLM agents through fine-grained adaptive interplay. We show that LEAD has theoretical guarantees inherited from classic DB algorithms on both weak and strong regret. We validate its efficacy and robustness even with noisy and adversarial prompts. The design of such an agentic framework sheds light on how to enhance the trustworthiness of general-purpose LLMs generalized to in-context decision-making tasks.

nan


Article 544

Title@2025-06-09 (1): Improving large language models with concept-aware fine-tuning

Title: Improving large language models with concept-aware fine-tuning Verbesserung großer Sprachmodelle mit konzeptbewusster Feinsteuerung 改进大语言模式,进行概念意识微调 2506.07833v1

Authors (4): Michael K. Chen, Xikun Zhang, Jiaxing Huang, Dacheng Tao

Large language models (LLMs) have become the cornerstone of modern AI. However, the existing paradigm of next-token prediction fundamentally limits their ability to form coherent, high-level concepts, making it a critical barrier to human-like understanding and reasoning. Take the phrase “ribonucleic acid” as an example: an LLM will first decompose it into tokens, i.e., artificial text fragments (“rib”, “on”, …), then learn each token sequentially, rather than grasping the phrase as a unified, coherent semantic entity. This fragmented representation hinders deeper conceptual understanding and, ultimately, the development of truly intelligent systems. In response, we introduce Concept-Aware Fine-Tuning (CAFT), a novel multi-token training method that redefines how LLMs are fine-tuned. By enabling the learning of sequences that span multiple tokens, this method fosters stronger concept-aware learning. Our experiments demonstrate significant improvements compared to conventional next-token finetuning methods across diverse tasks, including traditional applications like text summarization and domain-specific ones like de novo protein design. Multi-token prediction was previously only possible in the prohibitively expensive pretraining phase; CAFT, to our knowledge, is the first to bring the multi-token setting to the post-training phase, thus effectively democratizing its benefits for the broader community of practitioners and researchers. Finally, the unexpected effectiveness of our proposed method suggests wider implications for the machine learning research community. All code and data are available at https://github.com/michaelchen-lab/caft-llm

nan


Article 545

Title@2025-06-09 (1): WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code

Title: WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code WebUIBench: Ein umfassender Benchmark für die Bewertung multimodaler großer Sprachmodelle in WebUI-to-Code WebUIBench:在WebUI-Code中评价多式大语言模式的综合基准 2506.07818v1

Authors (7): Zhiyu Lin, Zhengda Zhou, Zhiyuan Zhao, Tianrui Wan, Yilun Ma, Junyu Gao, Xuelong Li

With the rapid advancement of Generative AI technology, Multimodal Large Language Models(MLLMs) have the potential to act as AI software engineers capable of executing complex web application development. Considering that the model requires a confluence of multidimensional sub-capabilities to address the challenges of various development phases, constructing a multi-view evaluation framework is crucial for accurately guiding the enhancement of development efficiency. However, existing benchmarks usually fail to provide an assessment of sub-capabilities and focus solely on webpage generation outcomes. In this work, we draw inspiration from the principles of software engineering and further propose WebUIBench, a benchmark systematically designed to evaluate MLLMs in four key areas: WebUI Perception, HTML Programming,WebUI-HTML Understanding, and WebUI-to-Code. WebUIBench comprises 21K high-quality question-answer pairs derived from over 0.7K real-world websites. The extensive evaluation of 29 mainstream MLLMs uncovers the skill characteristics and various weakness that models encountered during the development process.

nan


Article 546

Title@2025-06-09 (1): ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT

Title: ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT ConECT-Datensatz: Überwindung von Datenknappheit im Kontext-Bewusst-E-Commerce MT Conconect 数据集:克服上下文软件电子商务MT中的数据缺乏问题 2506.04929v2

Authors (4): Mikołaj Pokrywka, Wojciech Kusa, Mieszko Rutkowski, Mikołaj Koszowski

Neural Machine Translation (NMT) has improved translation by using Transformer-based models, but it still struggles with word ambiguity and context. This problem is especially important in domain-specific applications, which often have problems with unclear sentences or poor data quality. Our research explores how adding information to models can improve translations in the context of e-commerce data. To this end we create ConECT – a new Czech-to-Polish e-commerce product translation dataset coupled with images and product metadata consisting of 11,400 sentence pairs. We then investigate and compare different methods that are applicable to context-aware translation. We test a vision-language model (VLM), finding that visual context aids translation quality. Additionally, we explore the incorporation of contextual information into text-to-text models, such as the product’s category path or image descriptions. The results of our study demonstrate that the incorporation of contextual information leads to an improvement in the quality of machine translation. We make the new dataset publicly available.

nan


Article 547

Title@2025-06-09 (1): WeQA: A Benchmark for Retrieval Augmented Generation in Wind Energy Domain

Title: WeQA: A Benchmark for Retrieval Augmented Generation in Wind Energy Domain WeQA: Ein Benchmark für die retrieval Augmented Generation im Bereich Windenergie WeQA:风能域回收增量发电基准 2408.11800v3

Authors (8): Rounak Meyur, Hung Phan, Sridevi Wagle, Jan Strube, Mahantesh Halappanavar, Sameera Horawalavithana, Anurag Acharya, Sai Munikoti

Wind energy project assessments present significant challenges for decision-makers, who must navigate and synthesize hundreds of pages of environmental and scientific documentation. These documents often span different regions and project scales, covering multiple domains of expertise. This process traditionally demands immense time and specialized knowledge from decision-makers. The advent of Large Language Models (LLM) and Retrieval Augmented Generation (RAG) approaches offer a transformative solution, enabling rapid, accurate cross-document information retrieval and synthesis. As the landscape of Natural Language Processing (NLP) and text generation continues to evolve, benchmarking becomes essential to evaluate and compare the performance of different RAG-based LLMs. In this paper, we present a comprehensive framework to generate a domain relevant RAG benchmark. Our framework is based on automatic question-answer generation with Human (domain experts)-AI (LLM) teaming. As a case study, we demonstrate the framework by introducing WeQA, a first-of-its-kind benchmark on the wind energy domain which comprises of multiple scientific documents/reports related to environmental aspects of wind energy projects. Our framework systematically evaluates RAG performance using diverse metrics and multiple question types with varying complexity level, providing a foundation for rigorous assessment of RAG-based systems in complex scientific domains and enabling researchers to identify areas for improvement in domain-specific applications.

nan


Article 548

Title@2025-06-09 (1): Minerva: A Programmable Memory Test Benchmark for Language Models

Title: Minerva: A Programmable Memory Test Benchmark for Language Models Minerva: Ein programmierbarer Speichertest-Benchmark für Sprachmodelle Minirva:语言模型可编程记忆测试基准 2502.03358v2

Authors (4): Menglin Xia, Victor Ruehle, Saravan Rajmohan, Reza Shokri

How effectively can LLM-based AI assistants utilize their memory (context) to perform various tasks? Traditional data benchmarks, which are often manually crafted, suffer from several limitations: they are static, susceptible to overfitting, difficult to interpret, and lack actionable insights–failing to pinpoint the specific capabilities a model lacks when it does not pass a test. In this paper, we present a framework for automatically generating a comprehensive set of tests to evaluate models’ abilities to use their memory effectively. Our framework extends the range of capability tests beyond the commonly explored (passkey, key-value, needle in the haystack) search, a dominant focus in the literature. Specifically, we evaluate models on atomic tasks such as searching, recalling, editing, matching, comparing information in context memory, performing basic operations when inputs are structured into distinct blocks, and maintaining state while operating on memory, simulating real-world data. Additionally, we design composite tests to investigate the models’ ability to perform more complex, integrated tasks. Our benchmark enables an interpretable, detailed assessment of memory capabilities of LLMs.

nan


Article 549

Title@2025-06-09 (1): MultiMatch: Multihead Consistency Regularization Matching for Semi-Supervised Text Classification

Title: MultiMatch: Multihead Consistency Regularization Matching for Semi-Supervised Text Classification MultiMatch: Multihead-Konsistenzregularisierung passend zur semi-überwachten Textklassifikation 多匹配: 用于半有效文本分类的多标题一致性规则化 2506.07801v1

Authors (5): Iustin Sirbu, Robert-Adrian Popovici, Cornelia Caragea, Stefan Trausan-Matu, Traian Rebedea

We introduce MultiMatch, a novel semi-supervised learning (SSL) algorithm combining the paradigms of co-training and consistency regularization with pseudo-labeling. At its core, MultiMatch features a three-fold pseudo-label weighting module designed for three key purposes: selecting and filtering pseudo-labels based on head agreement and model confidence, and weighting them according to the perceived classification difficulty. This novel module enhances and unifies three existing techniques – heads agreement from Multihead Co-training, self-adaptive thresholds from FreeMatch, and Average Pseudo-Margins from MarginMatch – resulting in a holistic approach that improves robustness and performance in SSL settings. Experimental results on benchmark datasets highlight the superior performance of MultiMatch, achieving state-of-the-art results on 9 out of 10 setups from 5 natural language processing datasets and ranking first according to the Friedman test among 19 methods. Furthermore, MultiMatch demonstrates exceptional robustness in highly imbalanced settings, outperforming the second-best approach by 3.26% – and data imbalance is a key factor for many text classification tasks.

nan


Article 550

Title@2025-06-09 (1): Generalized Interpolating Discrete Diffusion

Title: Generalized Interpolating Discrete Diffusion Generalisierte Interpolation diskreter Diffusion 普遍国际刑 化 2503.04482v2

Authors (6): Dimitri von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, Thomas Hofmann

While state-of-the-art language models achieve impressive results through next-token prediction, they have inherent limitations such as the inability to revise already generated tokens. This has prompted exploration of alternative approaches such as discrete diffusion. However, masked diffusion, which has emerged as a popular choice due to its simplicity and effectiveness, reintroduces this inability to revise words. To overcome this, we generalize masked diffusion, deriving a new family of general interpolating discrete diffusion (GIDD) which offers greater flexibility in the design of the noising processes. Leveraging a novel diffusion ELBO, we achieve compute-matched state-of-the-art performance in diffusion language modeling. Exploiting GIDD’s flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality and unlocking the ability for the model to correct its own mistakes, an area where autoregressive models notoriously have struggled. Code: https://github.com/dvruette/gidd/

nan


Article 551

Title@2025-06-09 (1): MIRIAD: Augmenting LLMs with millions of medical query-response pairs

Title: MIRIAD: Augmenting LLMs with millions of medical query-response pairs MIRIAD: Augmenting LLMs mit Millionen von medizinischen Abfrage-Response-Paare MIRIAD: 以数百万对医疗问询对子加强LLMs 2506.06091v2

Authors (10): Qinyue Zheng, Salman Abdullah, Sam Rawal, Cyril Zakka, Sophie Ostmeier, Maximilian Purk, Eduardo Reis, Eric J. Topol, Jure Leskovec, Michael Moor

LLMs are bound to transform healthcare with advanced decision support and flexible chat assistants. However, LLMs are prone to generate inaccurate medical content. To ground LLMs in high-quality medical knowledge, LLMs have been equipped with external knowledge via RAG, where unstructured medical knowledge is split into small text chunks that can be selectively retrieved and integrated into the LLMs context. Yet, existing RAG pipelines rely on raw, unstructured medical text, which can be noisy, uncurated and difficult for LLMs to effectively leverage. Systematic approaches to organize medical knowledge to best surface it to LLMs are generally lacking. To address these challenges, we introduce MIRIAD, a large-scale, curated corpus of 5,821,948 medical QA pairs, each rephrased from and grounded in a passage from peer-reviewed medical literature using a semi-automated pipeline combining LLM generation, filtering, grounding, and human annotation. Unlike prior medical corpora, which rely on unstructured text, MIRIAD encapsulates web-scale medical knowledge in an operationalized query-response format, which enables more targeted retrieval. Experiments on challenging medical QA benchmarks show that augmenting LLMs with MIRIAD improves accuracy up to 6.7% compared to unstructured RAG baselines with the same source corpus and with the same amount of retrieved text. Moreover, MIRIAD improved the ability of LLMs to detect medical hallucinations by 22.5 to 37% (increase in F1 score). We further introduce MIRIAD-Atlas, an interactive map of MIRIAD spanning 56 medical disciplines, enabling clinical users to visually explore, search, and refine medical knowledge. MIRIAD promises to unlock a wealth of down-stream applications, including medical information retrievers, enhanced RAG applications, and knowledge-grounded chat interfaces, which ultimately enables more reliable LLM applications in healthcare.

nan


Article 552

Title@2025-06-09 (1): LLM Unlearning Should Be Form-Independent

Title: LLM Unlearning Should Be Form-Independent LLM Unlearning sollte formunabhängig sein LLM 重新学习应是独立的形式 2506.07795v1

Authors (3): Xiaotian Ye, Mengqi Zhang, Shu Wu

Large Language Model (LLM) unlearning aims to erase or suppress undesirable knowledge within the model, offering promise for controlling harmful or private information to prevent misuse. However, recent studies highlight its limited efficacy in real-world scenarios, hindering practical adoption. In this study, we identify a pervasive issue underlying many downstream failures: the effectiveness of existing unlearning methods heavily depends on the form of training samples and frequently fails to generalize to alternate expressions of the same knowledge. We formally characterize this problem as Form-Dependent Bias and systematically investigate its specific manifestation patterns across various downstream tasks. To quantify its prevalence and support future research, we introduce ORT, a novel benchmark designed to evaluate the robustness of unlearning methods against variations in knowledge expression. Results reveal that Form-Dependent Bias is both widespread and severe among current techniques. We argue that LLM unlearning should be form-independent to address the endless forms of downstream tasks encountered in real-world security-critical scenarios. Towards this goal, we introduce Rank-one Concept Redirection (ROCR), a novel training-free method, as a promising solution path. ROCR performs unlearning by targeting the invariants in downstream tasks, specifically the activated dangerous concepts. It is capable of modifying model parameters within seconds to redirect the model’s perception of a specific unlearning target concept to another harmless concept. Extensive experiments demonstrate that ROCR significantly improves unlearning effectiveness compared to traditional methods while generating highly natural outputs.

nan


Article 553

Title@2025-06-09 (1): Is poisoning a real threat to LLM alignment? Maybe more so than you think

Title: Is poisoning a real threat to LLM alignment? Maybe more so than you think Ist Vergiftung eine echte Bedrohung für LLM Ausrichtung? Vielleicht mehr, als Sie denken 中毒是否真正威胁到LLM对齐? 2406.12091v4

Authors (5): Pankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, Furong Huang

Recent advancements in Reinforcement Learning with Human Feedback (RLHF) have significantly impacted the alignment of Large Language Models (LLMs). The sensitivity of reinforcement learning algorithms such as Proximal Policy Optimization (PPO) has led to new line work on Direct Policy Optimization (DPO), which treats RLHF in a supervised learning framework. The increased practical use of these RLHF methods warrants an analysis of their vulnerabilities. In this work, we investigate the vulnerabilities of DPO to poisoning attacks under different scenarios and compare the effectiveness of preference poisoning, a first of its kind. We comprehensively analyze DPO’s vulnerabilities under different types of attacks, i.e., backdoor and non-backdoor attacks, and different poisoning methods across a wide array of language models, i.e., LLama 7B, Mistral 7B, and Gemma 7B. We find that unlike PPO-based methods, which, when it comes to backdoor attacks, require at least 4\% of the data to be poisoned to elicit harmful behavior, we exploit the true vulnerabilities of DPO more simply so we can poison the model with only as much as 0.5\% of the data. We further investigate the potential reasons behind the vulnerability and how well this vulnerability translates into backdoor vs non-backdoor attacks.

nan


Article 554

Title@2025-06-09 (1): Swiss Parliaments Corpus Re-Imagined (SPC_R): Enhanced Transcription with RAG-based Correction and Predicted BLEU

Title: Swiss Parliaments Corpus Re-Imagined (SPC_R): Enhanced Transcription with RAG-based Correction and Predicted BLEU Schweizer Parlamente Corpus Re-Imagined (SPC_R): Verbesserte Transkription mit RAG-basierter Korrektur und vorhergesagter BLEU 瑞士议会重新想象(SPC_R):加强以RAG为基础的校正和预测BLEU的跟踪 2506.07726v1

Authors (4): Vincenzo Timmel, Manfred Vogel, Daniel Perruchoud, Reza Kakooee

This paper presents a new long-form release of the Swiss Parliaments Corpus, converting entire multi-hour Swiss German debate sessions (each aligned with the official session protocols) into high-quality speech-text pairs. Our pipeline starts by transcribing all session audio into Standard German using Whisper Large-v3 under high-compute settings. We then apply a two-step GPT-4o correction process: first, GPT-4o ingests the raw Whisper output alongside the official protocols to refine misrecognitions, mainly named entities. Second, a separate GPT-4o pass evaluates each refined segment for semantic completeness. We filter out any segments whose Predicted BLEU score (derived from Whisper’s average token log-probability) and GPT-4o evaluation score fall below a certain threshold. The final corpus contains 801 hours of audio, of which 751 hours pass our quality control. Compared to the original sentence-level SPC release, our long-form dataset achieves a 6-point BLEU improvement, demonstrating the power of combining robust ASR, LLM-based correction, and data-driven filtering for low-resource, domain-specific speech corpora.

nan


Article 555

Title@2025-06-09 (1): Evaluating Zero-Shot Multilingual Aspect-Based Sentiment Analysis with Large Language Models

Title: Evaluating Zero-Shot Multilingual Aspect-Based Sentiment Analysis with Large Language Models Bewertung von Null-Shot-Multilingual-Aspekt-basierten Sentiment-Analysen mit großen Sprachmodellen 以大语言模式评价零热多语种频谱分析 2412.12564v3

Authors (6): Chengyan Wu, Bolei Ma, Zheyu Zhang, Ningyuan Deng, Yanqing He, Yun Xue

Aspect-based sentiment analysis (ABSA), a sequence labeling task, has attracted increasing attention in multilingual contexts. While previous research has focused largely on fine-tuning or training models specifically for ABSA, we evaluate large language models (LLMs) under zero-shot conditions to explore their potential to tackle this challenge with minimal task-specific adaptation. We conduct a comprehensive empirical evaluation of a series of LLMs on multilingual ABSA tasks, investigating various prompting strategies, including vanilla zero-shot, chain-of-thought (CoT), self-improvement, self-debate, and self-consistency, across nine different models. Results indicate that while LLMs show promise in handling multilingual ABSA, they generally fall short of fine-tuned, task-specific models. Notably, simpler zero-shot prompts often outperform more complex strategies, especially in high-resource languages like English. These findings underscore the need for further refinement of LLM-based approaches to effectively address ABSA task across diverse languages.

nan


Article 556

Title@2025-06-09 (1): Multilingual Grammatical Error Annotation: Combining Language-Agnostic Framework with Language-Specific Flexibility

Title: Multilingual Grammatical Error Annotation: Combining Language-Agnostic Framework with Language-Specific Flexibility Mehrsprachige Grammatikfehler-Annotation: Verknüpfung von sprach-agnostischem Framework mit sprachspezifischer Flexibilität 多语种、多语语地名错误说明:将语言-不可知框架与语言特有灵活性相结合 2506.07719v1

Authors (8): Mengyang Qiu, Tran Minh Nguyen, Zihao Huang, Zelong Li, Yang Gu, Qingyu Gao, Siliang Liu, Jungyeul Park

Grammatical Error Correction (GEC) relies on accurate error annotation and evaluation, yet existing frameworks, such as $\texttt{errant}$, face limitations when extended to typologically diverse languages. In this paper, we introduce a standardized, modular framework for multilingual grammatical error annotation. Our approach combines a language-agnostic foundation with structured language-specific extensions, enabling both consistency and flexibility across languages. We reimplement $\texttt{errant}$ using $\texttt{stanza}$ to support broader multilingual coverage, and demonstrate the framework’s adaptability through applications to English, German, Czech, Korean, and Chinese, ranging from general-purpose annotation to more customized linguistic refinements. This work supports scalable and interpretable GEC annotation across languages and promotes more consistent evaluation in multilingual settings. The complete codebase and annotation tools can be accessed at https://github.com/open-writing-evaluation/jp_errant_bea.

nan


Article 557

Title@2025-06-09 (1): Representation Bending for Large Language Model Safety

Title: Representation Bending for Large Language Model Safety Darstellungsbiegen für große Sprachmodellsicherheit 大语文示范语文安全示范语文代表名单 2504.01550v2

Authors (10): Ashkan Yousefpour, Taeheon Kim, Ryan S. Kwon, Seungbeen Lee, Wonje Jeung, Seungju Han, Alvin Wan, Harrison Ngan, Youngjae Yu, Jonghyun Choi

Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks - ranging from harmful content generation to broader societal harms - pose significant challenges. These risks can be amplified by the recent adversarial attacks, fine-tuning vulnerabilities, and the increasing deployment of LLMs in high-stakes environments. Existing safety-enhancing techniques, such as fine-tuning with human feedback or adversarial training, are still vulnerable as they address specific threats and often fail to generalize across unseen attacks, or require manual system-level defenses. This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs, offering a scalable solution to enhance (potentially inherent) safety. RepBend brings the idea of activation steering - simple vector arithmetic for steering model’s behavior during inference - to loss-based fine-tuning. Through extensive evaluation, RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates across diverse jailbreak benchmarks, all with negligible reduction in model usability and general capabilities.

nan


Article 558

Title@2025-06-09 (1): Through the Valley: Path to Effective Long CoT Training for Small Language Models

Title: Through the Valley: Path to Effective Long CoT Training for Small Language Models Through the Valley: Weg zum effektiven LangcoT-Training für kleine Sprachmodelle 通过山谷:为小型语言模式提供有效的长 CoT 培训的途径 2506.07712v1

Authors (4): Renjie Luo, Jiaxi Li, Chen Huang, Wei Lu

Long chain-of-thought (CoT) supervision has become a common strategy to enhance reasoning in language models. While effective for large models, we identify a phenomenon we call Long CoT Degradation, in which small language models (SLMs; <=3B parameters) trained on limited long CoT data experience significant performance deterioration. Through extensive experiments on the Qwen2.5, LLaMA3 and Gemma3 families, we demonstrate that this degradation is widespread across SLMs. In some settings, models trained on only 8k long CoT examples lose up to 75% of their original performance before fine-tuning. Strikingly, we further observe that for some particularly small models, even training on 220k long CoT examples fails to recover or surpass their original performance prior to fine-tuning. Our analysis attributes this effect to error accumulation: while longer responses increase the capacity for multi-step reasoning, they also amplify the risk of compounding mistakes. Furthermore, we find that Long CoT Degradation may negatively impacts downstream reinforcement learning (RL), although this can be alleviated by sufficiently scaled supervised fine-tuning (SFT). Our findings challenge common assumptions about the benefits of long CoT training for SLMs and offer practical guidance for building more effective small-scale reasoning models.

nan


Article 559

Title@2025-06-09 (1): EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications

Title: EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications EVADE: Multimodaler Benchmark für Evasive Content Detection in E-Commerce Anwendungen EVADE:电子商务应用中蒸发内容探测的多模式基准 2505.17654v2

Authors (12): Ancheng Xu, Zhihao Yang, Jingpeng Li, Guanghu Yuan, Longze Chen, Liang Yan, Jiehui Zhou, Zhen Qin, Hengyun Chang, Hamid Alinejad-Rokny, Bo Zheng, Min Yang

E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision-Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content: inputs (text or images) that superficially comply with platform policies while covertly conveying prohibited claims. Unlike traditional adversarial attacks that induce overt failures, evasive content exploits ambiguity and context, making it far harder to detect. Existing robustness benchmarks provide little guidance for this demanding, real-world challenge. We introduce EVADE, the first expert-curated, Chinese, multimodal benchmark specifically designed to evaluate foundation models on evasive content detection in e-commerce. The dataset contains 2,833 annotated text samples and 13,961 images spanning six demanding product categories, including body shaping, height growth, and health supplements. Two complementary tasks assess distinct capabilities: Single-Violation, which probes fine-grained reasoning under short prompts, and All-in-One, which tests long-context reasoning by merging overlapping policy rules into unified instructions. Notably, the All-in-One setting significantly narrows the performance gap between partial and full-match accuracy, suggesting that clearer rule definitions improve alignment between human and model judgment. We benchmark 26 mainstream LLMs and VLMs and observe substantial performance gaps: even state-of-the-art models frequently misclassify evasive samples. By releasing EVADE and strong baselines, we provide the first rigorous standard for evaluating evasive-content detection, expose fundamental limitations in current multimodal reasoning, and lay the groundwork for safer and more transparent content moderation systems in e-commerce. The dataset is publicly available at https://huggingface.co/datasets/koenshen/EVADE-Bench.

nan


Article 560

Title@2025-06-09 (1): Training Superior Sparse Autoencoders for Instruct Models

Title: Training Superior Sparse Autoencoders for Instruct Models Training Superior Sparse Autoencoder für Instruct-Modelle 培训高级Sparass Autoencors 教学模型高级导师 2506.07691v1

Authors (8): Jiaming Li, Haoran Ye, Yukun Chen, Xinyue Li, Lei Zhang, Hamid Alinejad-Rokny, Jimmy Chih-Hsien Peng, Min Yang

As large language models (LLMs) grow in scale and capability, understanding their internal mechanisms becomes increasingly critical. Sparse autoencoders (SAEs) have emerged as a key tool in mechanistic interpretability, enabling the extraction of human-interpretable features from LLMs. However, existing SAE training methods are primarily designed for base models, resulting in reduced reconstruction quality and interpretability when applied to instruct models. To bridge this gap, we propose $\underline{\textbf{F}}$inetuning-$\underline{\textbf{a}}$ligned $\underline{\textbf{S}}$equential $\underline{\textbf{T}}$raining ($\textit{FAST}$), a novel training method specifically tailored for instruct models. $\textit{FAST}$ aligns the training process with the data distribution and activation patterns characteristic of instruct models, resulting in substantial improvements in both reconstruction and feature interpretability. On Qwen2.5-7B-Instruct, $\textit{FAST}$ achieves a mean squared error of 0.6468 in token reconstruction, significantly outperforming baseline methods with errors of 5.1985 and 1.5096. In feature interpretability, $\textit{FAST}$ yields a higher proportion of high-quality features, for Llama3.2-3B-Instruct, $21.1\%$ scored in the top range, compared to $7.0\%$ and $10.2\%$ for $\textit{BT(P)}$ and $\textit{BT(F)}$. Surprisingly, we discover that intervening on the activations of special tokens via the SAEs leads to improvements in output quality, suggesting new opportunities for fine-grained control of model behavior. Code, data, and 240 trained SAEs are available at https://github.com/Geaming2002/FAST.

nan


Article 561

Title@2025-06-09 (1): Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models

Title: Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models Cross-lingual Collapse: Wie Sprach-Centric Foundation Modelle in großen Sprachmodellen formen 跨语言的折叠:大语言模式中语言中心基础模型的形状 2506.05850v2

Authors (6): Cheonbok Park, Jeonghoon Kim, Joosung Lee, Sanghwan Bae, Jaegul Choo, Kang Min Yoo

We identify \textbf{Cross-lingual Collapse}, a systematic drift in which the chain-of-thought (CoT) of a multilingual language model reverts to its dominant pre-training language even when the prompt is expressed in a different language. Recent large language models (LLMs) with reinforcement learning with verifiable reward (RLVR) have achieved strong logical reasoning performances by exposing their intermediate reasoning traces, giving rise to large reasoning models (LRMs). However, the mechanism behind multilingual reasoning in LRMs is not yet fully explored. To investigate the issue, we fine-tune multilingual LRMs with Group-Relative Policy Optimization (GRPO) on translated versions of the GSM$8$K and SimpleRL-Zoo datasets in three different languages: Chinese, Korean, and Ukrainian. During training, we monitor both task accuracy and language consistency of the reasoning chains. Our experiments reveal three key findings: (i) GRPO rapidly amplifies pre-training language imbalances, leading to the erosion of low-resource languages within just a few hundred updates; (ii) language consistency reward mitigates this drift but does so at the expense of an almost 5 - 10 pp drop in accuracy. and (iii) the resulting language collapse is severely damaging and largely irreversible, as subsequent fine-tuning struggles to steer the model back toward its original target-language reasoning capabilities. Together, these findings point to a remarkable conclusion: \textit{not all languages are trained equally for reasoning}. Furthermore, our paper sheds light on the roles of reward shaping, data difficulty, and pre-training priors in eliciting multilingual reasoning.

nan


Article 562

Title@2025-06-09 (1): Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models

Title: Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models Unlösbare Problemerkennung: Robuste Bewertung des Verständnisses für große multimodale Modelle 无法解决的问题探测:大型多式联运模型的强有力理解评估 2403.20331v4

Authors (10): Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Qing Yu, Go Irie, Yixuan Li, Hai Li, Ziwei Liu, Kiyoharu Aizawa

This paper introduces a novel task to evaluate the robust understanding capability of Large Multimodal Models (LMMs), termed $\textbf{Unsolvable Problem Detection (UPD)}$. Multiple-choice question answering (MCQA) is widely used to assess the understanding capability of LMMs, but it does not guarantee that LMMs truly comprehend the answer. UPD assesses the LMM’s ability to withhold answers when encountering unsolvable problems of MCQA, verifying whether the model truly understands the answer. UPD encompasses three problems: Absent Answer Detection (AAD), Incompatible Answer Set Detection (IASD), and Incompatible Visual Question Detection (IVQD), covering unsolvable cases like answer-lacking or incompatible choices and image-question mismatches. For the evaluation, we introduce the MM-UPD Bench, a benchmark for assessing performance across various ability dimensions. Our experiments reveal that even most LMMs, which demonstrate adequate performance on existing benchmarks, struggle significantly with MM-UPD, underscoring a novel aspect of trustworthiness that current benchmarks have overlooked. A detailed analysis shows that LMMs have different bottlenecks and chain-of-thought and self-reflection improved performance for LMMs with the bottleneck in their LLM capability. We hope our insights will enhance the broader understanding and development of more reliable LMMs. The code is available at https://github.com/AtsuMiyai/UPD.

nan


Article 563

Title@2025-06-09 (1): GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation

Title: GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation GaRAGe: Ein Benchmark mit Erdungsanmerkungen für die RAG-Bewertung GARAGe:一份基准,并附有对RAG评价的根据说明 2506.07671v1

Authors (5): Ionut-Teodor Sorodoc, Leonardo F. R. Ribeiro, Rexhina Blloshmi, Christopher Davis, Adrià de Gispert

We present GaRAGe, a large RAG benchmark with human-curated long-form answers and annotations of each grounding passage, allowing a fine-grained evaluation of whether LLMs can identify relevant grounding when generating RAG answers. Our benchmark contains 2366 questions of diverse complexity, dynamism, and topics, and includes over 35K annotated passages retrieved from both private document sets and the Web, to reflect real-world RAG use cases. This makes it an ideal test bed to evaluate an LLM’s ability to identify only the relevant information necessary to compose a response, or provide a deflective response when there is insufficient information. Evaluations of multiple state-of-the-art LLMs on GaRAGe show that the models tend to over-summarise rather than (a) ground their answers strictly on the annotated relevant passages (reaching at most a Relevance-Aware Factuality Score of 60%), or (b) deflect when no relevant grounding is available (reaching at most 31% true positive rate in deflections). The F1 in attribution to relevant sources is at most 58.9%, and we show that performance is particularly reduced when answering time-sensitive questions and when having to draw knowledge from sparser private grounding sources.

nan


Article 564

Title@2025-06-09 (1): Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

Title: Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought Skywork R1V: Pionierische multimodale Vernunft mit phantasievoller Kette Skywork R1V:先锋式多模式理由与探索链 2504.05599v2

Authors (15): Yi Peng, Peiyu Wang, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou

We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text alignment, we propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly enhancing cross-modal integration efficiency. Additionally, we introduce an adaptive-length Chain-of-Thought distillation approach for reasoning data generation. This approach dynamically optimizes reasoning chain lengths, thereby enhancing inference efficiency and preventing excessive reasoning overthinking. Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista. Meanwhile, it maintains robust textual reasoning performance, evidenced by impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model weights have been publicly released to promote openness and reproducibility.

nan


Article 565

Title@2025-06-09 (1): SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage

Title: SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage SATA: Ein Paradigma für LLM Jailbreak über einfache assistive Task-Linkage SATA:LLM监狱通过简单辅助任务链接破获的范例 2412.15289v4

Authors (4): Xiaoning Dong, Wenbo Hu, Wei Xu, Tianxing He

Large language models (LLMs) have made significant advancements across various tasks, but their safety alignment remain a major concern. Exploring jailbreak prompts can expose LLMs’ vulnerabilities and guide efforts to secure them. Existing methods primarily design sophisticated instructions for the LLM to follow, or rely on multiple iterations, which could hinder the performance and efficiency of jailbreaks. In this work, we propose a novel jailbreak paradigm, Simple Assistive Task Linkage (SATA), which can effectively circumvent LLM safeguards and elicit harmful responses. Specifically, SATA first masks harmful keywords within a malicious query to generate a relatively benign query containing one or multiple [MASK] special tokens. It then employs a simple assistive task such as a masked language model task or an element lookup by position task to encode the semantics of the masked keywords. Finally, SATA links the assistive task with the masked query to jointly perform the jailbreak. Extensive experiments show that SATA achieves state-of-the-art performance and outperforms baselines by a large margin. Specifically, on AdvBench dataset, with mask language model (MLM) assistive task, SATA achieves an overall attack success rate (ASR) of 85% and harmful score (HS) of 4.57, and with element lookup by position (ELP) assistive task, SATA attains an overall ASR of 76% and HS of 4.43.

nan


Article 566

Title@2025-06-09 (1): Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping

Title: Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping Jenseits von Benchmarks: Ein neuartiges Framework für Domain-spezifische LLM-Evaluierung und Wissensmapping 超越基准范围:具体领域LLM评估和知识绘图新框架 2506.07658v1

Authors (3): Nitin Sharma, Thomas Wolfers, Çağatay Yıldız

The paper addresses two critical challenges in language model (LM) evaluation: creating reliable domain-specific benchmarks and understanding knowledge representation during domain adaptation. We introduce a deterministic pipeline that converts raw domain corpora into completion-type benchmarks without relying on LMs or human curation, eliminating benchmark contamination issues while enabling evaluation on the latest domain data. Our approach generates domain-specific keywords and related word lists using TF and Term TF-IDF methods and constructs prompt-target pairs. We evaluate models by measuring their ability to complete these prompts with the correct domain-specific targets, providing a direct assessment of domain knowledge with low computational cost. Through comprehensive experiments across multiple models (GPT-2 medium/XL, Llama-2/3.1, OLMo-2, Qwen-2, Mistral) and domains, we demonstrate that our benchmark strongly correlates with expert-generated benchmarks while providing a more accurate measure of domain knowledge than traditional perplexity metrics. We reveal that domain adaptation happens rapidly in smaller models (within 500 steps) and illustrate a new approach to domain knowledge evaluation in base models during training for early stopping. By extending mechanistic analysis to domain adaptation, we discover that initial-to-mid layers are primarily responsible for attribute extraction, while later layers focus on next token prediction. Furthermore, we show that during adaptation, forgetting begins in the middle layers, where attribute extraction happens and is amplified in later layers. Our work provides both a practical evaluation methodology for domain-specific LMs and novel insights into knowledge representation during adaptation, with implications for more efficient fine-tuning strategies and targeted approaches to mitigate catastrophic forgetting.

nan


Article 567

Title@2025-06-09 (1): Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks

Title: Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks Bewertung von Dialect Fairness und Robustheit von großen Sprachmodellen in mit Gründen versehenen Aufgaben 评估大语言模式在说明理由任务中的公平性和强度 2410.11005v3

Authors (10): Fangru Lin, Shaoguang Mao, Emanuele La Malfa, Valentin Hofmann, Adrian de Wynter, Xun Wang, Si-Qing Chen, Michael Wooldridge, Janet B. Pierrehumbert, Furu Wei

Language is not monolithic. While benchmarks, including those designed for multiple languages, are often used as proxies to evaluate the performance of Large Language Models (LLMs), they tend to overlook the nuances of within-language variation and thus fail to model the experience of speakers of non-standard dialects. Focusing on African American Vernacular English (AAVE), we present the first study aimed at objectively assessing the fairness and robustness of LLMs in handling dialects across canonical reasoning tasks, including algorithm, math, logic, and integrated reasoning. We introduce ReDial (Reasoning with Dialect Queries), a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE. We hire AAVE speakers, including experts with computer science backgrounds, to rewrite seven popular benchmarks, such as HumanEval and GSM8K. With ReDial, we evaluate widely used LLMs, including GPT, Claude, Llama, Mistral, and the Phi model families. Our findings reveal that almost all of these widely used models show significant brittleness and unfairness to queries in AAVE. Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries. Moreover, it highlights how mainstream LLMs provide unfair service to dialect speakers in reasoning tasks, laying a critical foundation for future research.

nan


Article 568

Title@2025-06-09 (1): Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation

Title: Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation Transcript-prompted Whisper mit Wörterbuch-erweiterte Decodierung für japanische Sprachannotation 为日语语音注解配有字典加加码解码的字典插口器 2506.07646v1

Authors (5): Rui Hu, Xiaolong Lin, Jiawang Liu, Shixi Huang, Zhenpeng Zhan

In this paper, we propose a method for annotating phonemic and prosodic labels on a given audio-transcript pair, aimed at constructing Japanese text-to-speech (TTS) datasets. Our approach involves fine-tuning a large-scale pre-trained automatic speech recognition (ASR) model, conditioned on ground truth transcripts, to simultaneously output phrase-level graphemes and annotation labels. To further correct errors in phonemic labeling, we employ a decoding strategy that utilizes dictionary prior knowledge. The objective evaluation results demonstrate that our proposed method outperforms previous approaches relying solely on text or audio. The subjective evaluation results indicate that the naturalness of speech synthesized by the TTS model, trained with labels annotated using our method, is comparable to that of a model trained with manual annotations.

nan


Article 569

Title@2025-06-09 (1): Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models

Title: Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models Bewertung von LLMs Robustheit in weniger qualifizierten Sprachen mit Proxy-Modellen 具有代用模式的低资源语言的威力评价LLMs 2506.07645v1

Authors (3): Maciej Chrabąszcz, Katarzyna Lorenc, Karolina Seweryn

Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks in recent years. However, their susceptibility to jailbreaks and perturbations necessitates additional evaluations. Many LLMs are multilingual, but safety-related training data contains mainly high-resource languages like English. This can leave them vulnerable to perturbations in low-resource languages such as Polish. We show how surprisingly strong attacks can be cheaply created by altering just a few characters and using a small proxy model for word importance calculation. We find that these character and word-level attacks drastically alter the predictions of different LLMs, suggesting a potential vulnerability that can be used to circumvent their internal safety mechanisms. We validate our attack construction methodology on Polish, a low-resource language, and find potential vulnerabilities of LLMs in this language. Additionally, we show how it can be extended to other languages. We release the created datasets and code for further research.

nan


Article 570

Title@2025-06-09 (1): TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review

Title: TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review TreeReview: Ein dynamischer Baum der Fragen Rahmen für tiefe und effiziente LLM-basierte Scientific Peer Review 树审查:深入、高效的LLM科学同行审评问题框架动态树 2506.07642v1

Authors (7): Yuan Chang, Ziyue Li, Hengyuan Zhang, Yuanbo Kong, Yanru Wu, Zhijiang Guo, Ngai Wong

While Large Language Models (LLMs) have shown significant potential in assisting peer review, current methods often struggle to generate thorough and insightful reviews while maintaining efficiency. In this paper, we propose TreeReview, a novel framework that models paper review as a hierarchical and bidirectional question-answering process. TreeReview first constructs a tree of review questions by recursively decomposing high-level questions into fine-grained sub-questions and then resolves the question tree by iteratively aggregating answers from leaf to root to get the final review. Crucially, we incorporate a dynamic question expansion mechanism to enable deeper probing by generating follow-up questions when needed. We construct a benchmark derived from ICLR and NeurIPS venues to evaluate our method on full review generation and actionable feedback comments generation tasks. Experimental results of both LLM-based and human evaluation show that TreeReview outperforms strong baselines in providing comprehensive, in-depth, and expert-aligned review feedback, while reducing LLM token usage by up to 80% compared to computationally intensive approaches. Our code and benchmark dataset are available at https://github.com/YuanChang98/tree-review.

nan


Article 571

Title@2025-06-09 (1): Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline

Title: Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline Unblocking Fine-Grained Evaluation von detaillierten Bildunterschriften: Eine erklärende Auto-Rater und Kritik-und-Revise-Pipeline 对详细说明进行无阻的精细精细评价:解释自动拉动器和批评和校正管道 2506.07631v1

Authors (7): Brian Gordon, Yonatan Bitton, Andreea Marzoca, Yasumasa Onoe, Xiao Wang, Daniel Cohen-Or, Idan Szpektor

Large Vision-Language Models (VLMs) now generate highly detailed, paragraphlength image captions, yet evaluating their factual accuracy remains challenging. Current methods often miss fine-grained errors, being designed for shorter texts or lacking datasets with verified inaccuracies. We introduce DOCCI-Critique, a benchmark with 1,400 VLM-generated paragraph captions (100 images, 14 VLMs) featuring over 10,216 sentence-level human annotations of factual correctness and explanatory rationales for errors, all within paragraph context. Building on this, we develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation. We highlight three key applications: (1) VNLI-Critique demonstrates robust generalization, validated by state-of-the-art performance on the M-HalDetect benchmark and strong results in CHOCOLATE claim verification. (2) The VNLI-Critique driven AutoRater for DOCCI-Critique provides reliable VLM rankings, showing excellent alignment with human factuality judgments (e.g., 0.98 Spearman). (3) An innovative Critic-and-Revise pipeline, where critiques from VNLI-Critique guide LLM-based corrections, achieves substantial improvements in caption factuality (e.g., a 46% gain on DetailCaps-4870). Our work offers a crucial benchmark alongside practical tools, designed to significantly elevate the standards for fine-grained evaluation and foster the improvement of VLM image understanding. Project page: https://google.github.io/unblocking-detail-caption

nan


Article 572

Title@2025-06-09 (1): Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation

Title: Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation Intent Matters: Verbesserung der KI-Tutoring mit feinkörniger pädagogischer Intent-Annotation 内容:加强AI 与精细教育学本意注解相结合的教学 2506.07626v1

Authors (2): Kseniia Petukhova, Ekaterina Kochmar

Large language models (LLMs) hold great promise for educational applications, particularly in intelligent tutoring systems. However, effective tutoring requires alignment with pedagogical strategies - something current LLMs lack without task-specific adaptation. In this work, we explore whether fine-grained annotation of teacher intents can improve the quality of LLM-generated tutoring responses. We focus on MathDial, a dialog dataset for math instruction, and apply an automated annotation framework to re-annotate a portion of the dataset using a detailed taxonomy of eleven pedagogical intents. We then fine-tune an LLM using these new annotations and compare its performance to models trained on the original four-category taxonomy. Both automatic and qualitative evaluations show that the fine-grained model produces more pedagogically aligned and effective responses. Our findings highlight the value of intent specificity for controlled text generation in educational settings, and we release our annotated data and code to facilitate further research.

nan


Article 573

Title@2025-06-09 (1): Can Perplexity Predict Fine-tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali

Title: Can Perplexity Predict Fine-tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali Kann Perplexity Fine-Tuning Performance voraussagen? Eine Untersuchung von Tokenization-Effekten auf sequentielle Sprachmodelle für Nepali 重复性预测微调性能能否? 对尼泊尔的序列语言模式的收缩性影响调查i 2404.18071v2

Authors (4): Nishant Luitel, Nirajan Bekoju, Anand Kumar Sah, Subarna Shakya

The impact of subword tokenization on language model performance is well-documented for perplexity, with finer granularity consistently reducing this intrinsic metric. However, research on how different tokenization schemes affect a model’s understanding capabilities remains limited, particularly for non-Latin script languages. Addressing this gap, we conducted a comprehensive evaluation of six distinct tokenization strategies by pretraining transformer-based language models for Nepali and evaluating their performance across multiple downstream tasks. While recent prominent models like GPT, RoBERTa, Claude, LLaMA, Mistral, Falcon, and MPT have adopted byte-level BPE tokenization, our findings demonstrate that for Nepali, SentencePiece tokenization consistently yields superior results on understanding-based tasks. Unlike previous studies that primarily focused on BERT-based architectures, our research specifically examines sequential transformer models, providing valuable insights for language model development in low-resource languages and highlighting the importance of tokenization strategy beyond perplexity reduction.

nan


Article 574

Title@2025-06-09 (1): LoRMA: Low-Rank Multiplicative Adaptation for LLMs

Title: LoRMA: Low-Rank Multiplicative Adaptation for LLMs LoRMA: Multiplikative Anpassung mit niedrigem Rank für LLMs LoRMA:LLMML的低Rank多复制性适应 2506.07621v1

Authors (3): Harsh Bihany, Shubham Patel, Ashutosh Modi

Large Language Models have shown remarkable capabilities in the NLP domain. Their effectiveness can mainly be attributed to their ability to adapt to an array of downstream tasks. However, generally, full fine-tuning is a computationally expensive job. To mitigate this, many techniques have been developed that prime efficiency, a prominent one being Low-Rank Adaptation (LoRA). However, LoRA and its variants employ re-parametrized additive updates. In this paper, we propose Low-Rank Multiplicative Adaptation (LoRMA), which shifts the paradigm of additive updates to a richer space of matrix multiplicative transformations. We tackle challenges such as computational complexity and rank bottleneck of matrix multiplication by effectively re-ordering operations and introducing rank inflation strategies. We conduct extensive experiments to demonstrate the effectiveness of our approach in terms of various evaluation metrics.

nan


Article 575

Title@2025-06-09 (1): Retrieval-Augmented Generation as Noisy In-Context Learning: A Unified Theory and Risk Bounds

Title: Retrieval-Augmented Generation as Noisy In-Context Learning: A Unified Theory and Risk Bounds Retrieval-Augmented Generation as Noisy In-Context Learning: Eine einheitliche Theorie und Risikogrenzen 作为有噪音的文内学习:统一理论和风险波 2506.03100v3

Authors (5): Yang Guo, Yutian Tao, Yifei Ming, Robert D. Nowak, Yingyu Liang

Retrieval-augmented generation (RAG) has seen many empirical successes in recent years by aiding the LLM with external knowledge. However, its theoretical aspect has remained mostly unexplored. In this paper, we propose the first finite-sample generalization bound for RAG in in-context linear regression and derive an exact bias-variance tradeoff. Our framework views the retrieved texts as query-dependent noisy in-context examples and recovers the classical in-context learning (ICL) and standard RAG as the limit cases. Our analysis suggests that an intrinsic ceiling on generalization error exists on RAG as opposed to the ICL. Furthermore, our framework is able to model retrieval both from the training data and from external corpora by introducing uniform and non-uniform RAG noise. In line with our theory, we show the sample efficiency of ICL and RAG empirically with experiments on common QA benchmarks, such as Natural Questions and TriviaQA.

nan


Article 576

Title@2025-06-09 (1): Vuyko Mistral: Adapting LLMs for Low-Resource Dialectal Translation

Title: Vuyko Mistral: Adapting LLMs for Low-Resource Dialectal Translation Vuyko Mistral: Anpassung von LLMs für ressourcenarme Dialektalübersetzung Vuyko Mistral: 调适低资源对流翻译LLMs 2506.07617v1

Authors (3): Roman Kyslyi, Yuliia Maksymiuk, Ihor Pysmennyi

In this paper we introduce the first effort to adapt large language models (LLMs) to the Ukrainian dialect (in our case Hutsul), a low-resource and morphologically complex dialect spoken in the Carpathian Highlands. We created a parallel corpus of 9852 dialect-to-standard Ukrainian sentence pairs and a dictionary of 7320 dialectal word mappings. We also addressed data shortage by proposing an advanced Retrieval-Augmented Generation (RAG) pipeline to generate synthetic parallel translation pairs, expanding the corpus with 52142 examples. We have fine-tuned multiple open-source LLMs using LoRA and evaluated them on a standard-to-dialect translation task, also comparing with few-shot GPT-4o translation. In the absence of human annotators, we adopt a multi-metric evaluation strategy combining BLEU, chrF++, TER, and LLM-based judgment (GPT-4o). The results show that even small(7B) finetuned models outperform zero-shot baselines such as GPT-4o across both automatic and LLM-evaluated metrics. All data, models, and code are publicly released at: https://github.com/woters/vuyko-hutsul

nan


Article 577

Title@2025-06-09 (1): APE: Selective Fine-tuning with Acceptance Criteria for Language Model Adaptation

Title: APE: Selective Fine-tuning with Acceptance Criteria for Language Model Adaptation APE: Selektive Feinabstimmung mit Annahmekriterien für Sprachmodellanpassung APE:有选择地调整语言模式适应接受标准 2505.19912v2

Authors (1): Javier Marín

We present Adjacent Possible Exploration (APE), a selective fine-tuning method for adapting large language models that systematically explores parameter modifications while maintaining model stability. Inspired by evolutionary optimization principles, APE evaluates multiple candidate parameter updates through fine-tuning on small data subsets and accepts only those exceeding a performance threshold. Unlike standard fine-tuning that follows single gradient directions, APE implements a filtered selection process that prevents destabilizing parameter changes while enabling systematic improvement. Our method achieves 33.9\% BLEU improvement and 36.2\% perplexity reduction on news summarization tasks while using minimal computational resources. The approach provides a practical framework for controlled model adaptation that balances performance gains with representational stability.

nan


Article 578

Title@2025-06-09 (1): PolitiSky24: U.S. Political Bluesky Dataset with User Stance Labels

Title: PolitiSky24: U.S. Political Bluesky Dataset with User Stance Labels PolitiSky24: US-politischer Bluesky-Datensatz mit Benutzer-Stance-Etiketten 政治SpolitSky24:美国政治蓝sky数据集, 2506.07606v1

Authors (4): Peyman Rostami, Vahid Rahimzadeh, Ali Adibi, Azadeh Shakery

Stance detection identifies the viewpoint expressed in text toward a specific target, such as a political figure. While previous datasets have focused primarily on tweet-level stances from established platforms, user-level stance resources, especially on emerging platforms like Bluesky remain scarce. User-level stance detection provides a more holistic view by considering a user’s complete posting history rather than isolated posts. We present the first stance detection dataset for the 2024 U.S. presidential election, collected from Bluesky and centered on Kamala Harris and Donald Trump. The dataset comprises 16,044 user-target stance pairs enriched with engagement metadata, interaction graphs, and user posting histories. PolitiSky24 was created using a carefully evaluated pipeline combining advanced information retrieval and large language models, which generates stance labels with supporting rationales and text spans for transparency. The labeling approach achieves 81\% accuracy with scalable LLMs. This resource addresses gaps in political stance analysis through its timeliness, open-data nature, and user-level perspective. The dataset is available at https://doi.org/10.5281/zenodo.15616911

nan


Article 579

Title@2025-06-09 (1): Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque

Title: Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque Unterweisung großer Sprachmodelle für ressourcenarme Sprachen: Eine systematische Studie für Baskisch 指导低资源语言大语言模式:巴斯克人系统研究 2506.07597v1

Authors (12): Oscar Sainz, Naiara Perez, Julen Etxaniz, Joseba Fernandez de Landa, Itziar Aldabe, Iker García-Ferrero, Aimar Zabala, Ekhi Azurmendi, German Rigau, Eneko Agirre, Mikel Artetxe, Aitor Soroa

Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model, and improved results when scaling up. Using Llama 3.1 instruct 70B as backbone our model comes near frontier models of much larger sizes for Basque, without using any Basque data apart from the 1.2B word corpora. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation.

nan


Article 580

Title@2025-06-09 (1): ParallelComp: Parallel Long-Context Compressor for Length Extrapolation

Title: ParallelComp: Parallel Long-Context Compressor for Length Extrapolation ParallelComp: Paralleler Long-Context-Kompressor für Längen-Extrapolation 平行Comp: 长度外推法的平行长文本压缩器 2502.14317v2

Authors (10): Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong

Extrapolating ultra-long contexts (text length >128K) remains a major challenge for large language models (LLMs), as most training-free extrapolation methods are not only severely limited by memory bottlenecks, but also suffer from the attention sink, which restricts their scalability and effectiveness in practice. In this work, we propose ParallelComp, a parallel long-context compression method that effectively overcomes the memory bottleneck, enabling 8B-parameter LLMs to extrapolate from 8K to 128K tokens on a single A100 80GB GPU in a training-free setting. ParallelComp splits the input into chunks, dynamically evicting redundant chunks and irrelevant tokens, supported by a parallel KV cache eviction mechanism. Importantly, we present a systematic theoretical and empirical analysis of attention biases in parallel attention-including the attention sink, recency bias, and middle bias-and reveal that these biases exhibit distinctive patterns under ultra-long context settings. We further design a KV cache eviction technique to mitigate this phenomenon. Experimental results show that ParallelComp enables an 8B model (trained on 8K context) to achieve 91.17% of GPT-4’s performance under ultra-long contexts, outperforming closed-source models such as Claude-2 and Kimi-Chat. We achieve a 1.76x improvement in chunk throughput, thereby achieving a 23.50x acceleration in the prefill stage with negligible performance loss and pave the way for scalable and robust ultra-long contexts extrapolation in LLMs. We release the code at https://github.com/menik1126/ParallelComp.

nan


Article 581

Title@2025-06-09 (1): Attention with Trained Embeddings Provably Selects Important Tokens

Title: Attention with Trained Embeddings Provably Selects Important Tokens Aufmerksamkeit bei trainierten Einbettungen wählt wahrscheinlich wichtige Token aus 与经过训练的嵌入器的关注 2505.17282v2

Authors (4): Diyuan Wu, Aleksandr Shevchenko, Samet Oymak, Marco Mondelli

Token embeddings play a crucial role in language modeling but, despite this practical relevance, their theoretical understanding remains limited. Our paper addresses the gap by characterizing the structure of embeddings obtained via gradient descent. Specifically, we consider a one-layer softmax attention model with a linear head for binary classification, i.e., $\texttt{Softmax}( p^\top E_X^\top ) E_X v = \frac{ \sum_{i=1}^T \exp(p^\top E_{x_i}) E_{x_i}^\top v}{\sum_{j=1}^T \exp(p^\top E_{x_{j}}) }$, where $E_X = [ E_{x_1} , \dots, E_{x_T} ]^\top$ contains the embeddings of the input sequence, $p$ is the embedding of the $\mathrm{\langle cls \rangle}$ token and $v$ the output vector. First, we show that, already after a single step of gradient training with the logistic loss, the embeddings $E_X$ capture the importance of tokens in the dataset by aligning with the output vector $v$ proportionally to the frequency with which the corresponding tokens appear in the dataset. Then, after training $p$ via gradient flow until convergence, the softmax selects the important tokens in the sentence (i.e., those that are predictive of the label), and the resulting $\mathrm{\langle cls \rangle}$ embedding maximizes the margin for such a selection. Experiments on real-world datasets (IMDB, Yelp) exhibit a phenomenology close to that unveiled by our theory.

nan


Article 582

Title@2025-06-09 (1): Beyond the Sentence: A Survey on Context-Aware Machine Translation with Large Language Models

Title: Beyond the Sentence: A Survey on Context-Aware Machine Translation with Large Language Models Beyond the Sentence: Eine Umfrage zu Kontext-Bewusst Maschinelle Übersetzung mit großen Sprachmodellen 句子之后:用大语言模型进行的背景软件机器翻译调查 2506.07583v1

Authors (4): Ramakrishna Appicharla, Baban Gain, Santanu Pal, Asif Ekbal

Despite the popularity of the large language models (LLMs), their application to machine translation is relatively underexplored, especially in context-aware settings. This work presents a literature review of context-aware translation with LLMs. The existing works utilise prompting and fine-tuning approaches, with few focusing on automatic post-editing and creating translation agents for context-aware machine translation. We observed that the commercial LLMs (such as ChatGPT and Tower LLM) achieved better results than the open-source LLMs (such as Llama and Bloom LLMs), and prompt-based approaches serve as good baselines to assess the quality of translations. Finally, we present some interesting future directions to explore.

nan


Article 583

Title@2025-06-09 (1): HSF: Defending against Jailbreak Attacks with Hidden State Filtering

Title: HSF: Defending against Jailbreak Attacks with Hidden State Filtering HSF: Verteidigung gegen Jailbreak Attacken mit versteckten Zustand Filterung HSF: 防范以隐蔽国家过滤手段侵入监狱的袭击 2409.03788v2

Authors (4): Cheng Qian, Hainan Zhang, Lei Sha, Zhiming Zheng

With the growing deployment of LLMs in daily applications like chatbots and content generation, efforts to ensure outputs align with human values and avoid harmful content have intensified. However, increasingly sophisticated jailbreak attacks threaten this alignment, aiming to induce unsafe outputs. Current defense efforts either focus on prompt rewriting or detection, which are limited in effectiveness due to the various design of jailbreak prompts, or on output control and detection, which are computationally expensive as they require LLM inference. Therefore, designing a pre-inference defense method that resists diverse jailbreak prompts is crucial for preventing LLM jailbreak attacks. We observe that jailbreak attacks, safe queries, and harmful queries exhibit different clustering patterns within the LLM’s hidden state representation space. This suggests that by leveraging the LLM’s hidden state representational capabilities, we can analyze the LLM’s forthcoming behavior and proactively intervene for defense. In this paper, we propose a jailbreak attack defense strategy based on a Hidden State Filter (HSF), a lossless architectural defense mechanism that enables the model to preemptively identify and reject adversarial inputs before the inference process begins. We activate its defensive potential through an additional plugin module, effectively framing the defense task as a classification problem. Experimental results on two benchmark datasets, utilizing three different LLMs, show that HSF significantly enhances resilience against six cutting-edge jailbreak attacks. It significantly reduces the success rate of jailbreak attacks while minimally impacting responses to benign user queries, with negligible inference overhead, and outperforming defense baselines.Our code and data are available at https://anonymous.4open.science/r/Hidden-State-Filtering-8652/

nan


Article 584

Title@2025-06-09 (1): Cool-Fusion: Fuse Large Language Models without Training

Title: Cool-Fusion: Fuse Large Language Models without Training Cool-Fusion: Große Sprachmodelle ohne Training 冷冻功能:没有培训的引信大语言模型 2407.19807v2

Authors (6): Cong Liu, Xiaojun Quan, Yan Pan, Liang Lin, Weigang Wu, Xu Chen

We focus on the problem of fusing two or more heterogeneous large language models (LLMs) to leverage their complementary strengths. One of the challenges of model fusion is high computational load, specifically in fine-tuning or aligning vocabularies. To address this, we propose Cool-Fusion, a simple yet effective approach that fuses the knowledge of source LLMs, which does not require training. Unlike ensemble methods, Cool-Fusion is applicable to any set of source LLMs that have different vocabularies. To overcome the vocabulary discrepancies among LLMs, we ensemble LLMs on text level, allowing them to rerank the generated texts by each other with different granularities. Extensive experiments have been conducted across a variety of benchmark datasets. On GSM8K, Cool-Fusion increases accuracy from three strong source LLMs by a significant margin of 17.4\%.

nan


Article 585

Title@2025-06-09 (1): Learning Speaker-Invariant Visual Features for Lipreading

Title: Learning Speaker-Invariant Visual Features for Lipreading Learning Speaker-Invariant Visual Features für Lipreading 读心学习用语言-不同功能的视觉特征 2506.07572v1

Authors (7): Yu Li, Feng Xue, Shujie Li, Jinrui Zhang, Shuang Yang, Dan Guo, Richang Hong

Lipreading is a challenging cross-modal task that aims to convert visual lip movements into spoken text. Existing lipreading methods often extract visual features that include speaker-specific lip attributes (e.g., shape, color, texture), which introduce spurious correlations between vision and text. These correlations lead to suboptimal lipreading accuracy and restrict model generalization. To address this challenge, we introduce SIFLip, a speaker-invariant visual feature learning framework that disentangles speaker-specific attributes using two complementary disentanglement modules (Implicit Disentanglement and Explicit Disentanglement) to improve generalization. Specifically, since different speakers exhibit semantic consistency between lip movements and phonetic text when pronouncing the same words, our implicit disentanglement module leverages stable text embeddings as supervisory signals to learn common visual representations across speakers, implicitly decoupling speaker-specific features. Additionally, we design a speaker recognition sub-task within the main lipreading pipeline to filter speaker-specific features, then further explicitly disentangle these personalized visual features from the backbone network via gradient reversal. Experimental results demonstrate that SIFLip significantly enhances generalization performance across multiple public datasets. Experimental results demonstrate that SIFLip significantly improves generalization performance across multiple public datasets, outperforming state-of-the-art methods.

nan


Article 586

Title@2025-06-09 (1): AI Scientists Fail Without Strong Implementation Capability

Title: AI Scientists Fail Without Strong Implementation Capability KI-Wissenschaftler scheitern ohne starke Umsetzungsfähigkeit AI 缺乏强有力的执行能力的科学家失败 2506.01372v2

Authors (7): Minjun Zhu, Qiujie Xie, Yixuan Weng, Jian Wu, Zhen Lin, Linyi Yang, Yue Zhang

The emergence of Artificial Intelligence (AI) Scientist represents a paradigm shift in scientific discovery, with large language models (LLMs) taking the lead as the primary executor in the entire scientific workflow from idea generation to experiment implementation. Recent AI Scientist studies demonstrate sufficient capabilities for independent scientific discovery, with the generated research reports gaining acceptance at the ICLR 2025 workshop and ACL 2025, arguing that a human-level AI Scientist, capable of uncovering phenomena previously unknown to humans, may be imminent. Despite this substantial progress, AI Scientist has yet to produce a groundbreaking achievement in the domain of computer science on par with automated scientific tools. Based on extensive quantitative evidence from existing benchmarks in complex engineering tasks and a systematic evaluation assess 28 research papers generated by five advanced AI Scientist systems, we argue that \textbf{the fundamental bottleneck for AI Scientists lies in their capability to execute the requisite verification procedures.} Current AI Scientist systems lack the execution capabilities needed to execute rigorous experiments and produce high-quality scientific papers. To better illustrate the root cause of this \textbf{implementation gap}, we provide an in-depth discussion on the fundamental limitations of AI Scientist. This position paper aims to call for the participants in the community to bridge the implementation gap.

nan


Article 587

Title@2025-06-09 (1): SELT: Self-Evaluation Tree Search for LLMs with Task Decomposition

Title: SELT: Self-Evaluation Tree Search for LLMs with Task Decomposition SELT: Selbstevaluierungsbaumsuche nach LLMs mit Aufgabenabbau SELT: 自评估树搜索具有任务分解的LLMLM 2506.07557v1

Authors (5): Mengsong Wu, Di Zhang, Yuqiang Li, Dongzhan Zhou, Wenliang Chen

While Large Language Models (LLMs) have achieved remarkable success in a wide range of applications, their performance often degrades in complex reasoning tasks. In this work, we introduce SELT (Self-Evaluation LLM Tree Search), a novel framework that leverages a modified Monte Carlo Tree Search (MCTS) to enhance LLM reasoning without relying on external reward models. By redefining the Upper Confidence Bound scoring to align with intrinsic self-evaluation capabilities of LLMs and decomposing the inference process into atomic subtasks augmented with semantic clustering at each node, SELT effectively balances exploration and exploitation, reduces redundant reasoning paths, and mitigates hallucination. We validate our approach on challenging benchmarks, including the knowledge-based MMLU and the Tool Learning dataset Seal-Tools, where SELT achieves significant improvements in answer accuracy and reasoning robustness compared to baseline methods. Notably, our framework operates without task-specific fine-tuning, demonstrating strong generalizability across diverse reasoning tasks. Relevant results and code are available at https://github.com/fairyshine/SELT .

nan


Article 588

Title@2025-06-09 (1): Bit-level BPE: Below the byte boundary

Title: Bit-level BPE: Below the byte boundary Bit-Level BPE: Unterhalb der Byte-Grenze BPE 位位值 BPE: 字节边界下 2506.07541v1

Authors (3): Sangwhan Moon, Tatsuya Hiraoka, Naoaki Okazaki

Byte-level fallbacks for subword tokenization have become a common practice in large language models. In particular, it has been demonstrated to be incredibly effective as a pragmatic solution for preventing OOV, especially in the context of larger models. However, breaking a character down to individual bytes significantly increases the sequence length for long-tail tokens in languages such as Chinese, Japanese, and Korean (CJK) and other character-diverse contexts such as emoji. The increased sequence length results in longer computation during both training and inference. In this work, we propose a simple compression technique that reduces the sequence length losslessly.

nan


Article 589

Title@2025-06-09 (1): Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models

Title: Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models Flattery, Fluff und Fog: Diagnostizieren und Abmildern von idiosynkratischen Biasen in Präferenzmodellen Flattery、Fluff和Fog:在首选模式中诊断和缓解非典型的两面主义 2506.05339v2

Authors (4): Anirudh Bharadwaj, Chaitanya Malaviya, Nitish Joshi, Mark Yatskar

Language models serve as proxies for human preference judgements in alignment and evaluation, yet they exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities. This bias manifests as overreliance on features like length, structure, and style, leading to issues like reward hacking and unreliable evaluations. Evidence suggests these biases originate in artifacts in human training data. In this work, we systematically investigate the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness. Using controlled counterfactual pairs, we first quantify the extent to which preference models favor responses with magnified biases (skew), finding this preference occurs in >60% of instances, and model preferences show high miscalibration (~40%) compared to human preferences. Notably, bias features only show mild negative correlations to human preference labels (mean r_human = -0.12) but show moderately strong positive correlations with labels from a strong reward model (mean r_model = +0.36), suggesting that models may overrely on spurious cues. To mitigate these issues, we propose a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples. Finetuning models with CDA reduces average miscalibration from 39.4% to 32.5% and average absolute skew difference from 20.5% to 10.0%, while maintaining overall RewardBench performance, showing that targeted debiasing is effective for building reliable preference models.

nan


Article 590

Title@2025-06-09 (1): On Support Samples of Next Word Prediction

Title: On Support Samples of Next Word Prediction Unterstützungsbeispiele für die nächste Wortvorhersage 关于 “ 下一词预测 “ 支助样本 2506.04047v2

Authors (6): Yuqian Li, Yupei Du, Yufang Liu, Feifei Feng, Mou Xiao Feng, Yuanbin Wu

Language models excel in various tasks by making complex decisions, yet understanding the rationale behind these decisions remains a challenge. This paper investigates \emph{data-centric interpretability} in language models, focusing on the next-word prediction task. Using representer theorem, we identify two types of \emph{support samples}-those that either promote or deter specific predictions. Our findings reveal that being a support sample is an intrinsic property, predictable even before training begins. Additionally, while non-support samples are less influential in direct predictions, they play a critical role in preventing overfitting and shaping generalization and representation learning. Notably, the importance of non-support samples increases in deeper layers, suggesting their significant role in intermediate representation formation. These insights shed light on the interplay between data and model decisions, offering a new dimension to understanding language model behavior and interpretability.

nan


Article 591

Title@2025-06-09 (1): Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

Title: Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles Enigmata: Scaling Logical Reasoning in großen Sprachmodellen mit synthetischen überprüfbaren Puzzles 英格玛塔:在使用合成可核实拼图的大型语言模型中扩大逻辑理由 2505.19914v2

Authors (12): Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, Mingxuan Wang

Large Language Models (LLMs), such as OpenAI’s o1 and DeepSeek’s R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across seven categories, each with 1) a generator that produces unlimited examples with controllable difficulty and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like Enigmata-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from Enigmata further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of Enigmata. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Resources of this work can be found at https://seed-enigmata.github.io.

nan


Article 592

Title@2025-06-09 (1): SecFormer: Fast and Accurate Privacy-Preserving Inference for Transformer Models via SMPC

Title: SecFormer: Fast and Accurate Privacy-Preserving Inference for Transformer Models via SMPC SecFormer: Schnelle und genaue Datenschutz-Bewahrung von Schlussfolgerungen für Transformer-Modelle über SMPC SecFormer:通过SMPC为变形模型快速和准确的隐私保护推断 2401.00793v5

Authors (8): Jinglong Luo, Yehong Zhang, Zhuo Zhang, Jiaqi Zhang, Xin Mu, Hui Wang, Yue Yu, Zenglin Xu

With the growing use of Transformer models hosted on cloud platforms to offer inference services, privacy concerns are escalating, especially concerning sensitive data like investment plans and bank account details. Secure Multi-Party Computing (SMPC) emerges as a promising solution to protect the privacy of inference data and model parameters. However, the application of SMPC in Privacy-Preserving Inference (PPI) for Transformer models often leads to considerable slowdowns or declines in performance. This is largely due to the multitude of nonlinear operations in the Transformer architecture, which are not well-suited to SMPC and difficult to circumvent or optimize effectively. To address this concern, we introduce a comprehensive PPI framework called SecFormer to achieve fast and accurate PPI for Transformer models. We successfully eliminate the high-cost exponential and maximum operations in PPI without sacrificing model performance and develop a suite of efficient SMPC protocols by employing suitable numerical computation methods to boost other complex nonlinear functions in PPI, including GeLU, LayerNorm, and a redesigned Softmax. Our extensive experiments reveal that SecFormer outperforms MPCFormer in performance, showing improvements of $3.4\%$ and $24.7\%$ for BERT${\text{BASE}}$ and BERT${\text{LARGE}}$, respectively. In terms of efficiency, SecFormer is 3.57 and 3.58 times faster than PUMA for BERT${\text{BASE}}$ and BERT${\text{LARGE}}$, demonstrating its effectiveness and speed.

nan


Article 593

Title@2025-06-09 (1): Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition

Title: Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition Speaker-distinguishable CTC: Lernen Speaker-Distinction mit CTC für Multi-Talker Spracherkennung 发言人可区分的反恐委员会:学习演讲人利用反恐委员会表彰多语言者言论 2506.07515v1

Authors (6): Asahi Sakuma, Hiroaki Sato, Ryuga Sugano, Tadashi Kumano, Yoshihiko Kawai, Tetsuji Ogawa

This paper presents a novel framework for multi-talker automatic speech recognition without the need for auxiliary information. Serialized Output Training (SOT), a widely used approach, suffers from recognition errors due to speaker assignment failures. Although incorporating auxiliary information, such as token-level timestamps, can improve recognition accuracy, extracting such information from natural conversational speech remains challenging. To address this limitation, we propose Speaker-Distinguishable CTC (SD-CTC), an extension of CTC that jointly assigns a token and its corresponding speaker label to each frame. We further integrate SD-CTC into the SOT framework, enabling the SOT model to learn speaker distinction using only overlapping speech and transcriptions. Experimental comparisons show that multi-task learning with SD-CTC and SOT reduces the error rate of the SOT model by 26% and achieves performance comparable to state-of-the-art methods relying on auxiliary information.

nan


Article 594

Title@2025-06-09 (1): DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction

Title: DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction DERAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction DERAGAC: ASR 错误校正带有合成理由的代名实体候选人 2506.07510v1

Authors (6): Solee Im, Wonjun Lee, Jinmyeong An, Yunsu Kim, Jungseul Ok, Gary Geunbae Lee

We present DeRAGEC, a method for improving Named Entity (NE) correction in Automatic Speech Recognition (ASR) systems. By extending the Retrieval-Augmented Generative Error Correction (RAGEC) framework, DeRAGEC employs synthetic denoising rationales to filter out noisy NE candidates before correction. By leveraging phonetic similarity and augmented definitions, it refines noisy retrieved NEs using in-context learning, requiring no additional training. Experimental results on CommonVoice and STOP datasets show significant improvements in Word Error Rate (WER) and NE hit ratio, outperforming baseline ASR and RAGEC methods. Specifically, we achieved a 28% relative reduction in WER compared to ASR without postprocessing. Our source code is publicly available at: https://github.com/solee0022/deragec

nan


Article 595

Title@2025-06-09 (1): Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

Title: Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models Eliciting In-Context Retrieval und Reasoning für Lang-Kontext Große Sprachmodelle 长式长式大语言模型的全文检索和理由 2501.08248v3

Authors (6): Yifu Qiu, Varun Embar, Yizhe Zhang, Navdeep Jaitly, Shay B. Cohen, Benjamin Han

Recent advancements in long-context language models (LCLMs) promise to transform Retrieval-Augmented Generation (RAG) by simplifying pipelines. With their expanded context windows, LCLMs can process entire knowledge bases and perform retrieval and reasoning directly – a capability we define as In-Context Retrieval and Reasoning (ICR^2). However, existing benchmarks like LOFT often overestimate LCLM performance by providing overly simplified contexts. To address this, we introduce ICR^2, a benchmark that evaluates LCLMs in more realistic scenarios by including confounding passages retrieved with strong retrievers. We then propose three methods to enhance LCLM performance: (1) retrieve-then-generate fine-tuning, (2) retrieval-attention-probing, which uses attention heads to filter and de-noise long contexts during decoding, and (3) joint retrieval head training alongside the generation head. Our evaluation of five well-known LCLMs on LOFT and ICR^2 demonstrates significant gains with our best approach applied to Mistral-7B: +17 and +15 points by Exact Match on LOFT, and +13 and +2 points on ICR^2, compared to vanilla RAG and supervised fine-tuning, respectively. It even outperforms GPT-4-Turbo on most tasks despite being a much smaller model.

nan


Article 596

Title@2025-06-09 (1): What Do Indonesians Really Need from Language Technology? A Nationwide Survey

Title: What Do Indonesians Really Need from Language Technology? A Nationwide Survey Was brauchen Indonesier wirklich von Sprachtechnologie? Eine landesweite Umfrage 印度尼西亚人真正需要语言技术什么? 2506.07506v1

Authors (4): Muhammad Dehan Al Kautsar, Lucky Susanto, Derry Wijaya, Fajri Koto

There is an emerging effort to develop NLP for Indonesias 700+ local languages, but progress remains costly due to the need for direct engagement with native speakers. However, it is unclear what these language communities truly need from language technology. To address this, we conduct a nationwide survey to assess the actual needs of native speakers in Indonesia. Our findings indicate that addressing language barriers, particularly through machine translation and information retrieval, is the most critical priority. Although there is strong enthusiasm for advancements in language technology, concerns around privacy, bias, and the use of public data for AI training highlight the need for greater transparency and clear communication to support broader AI adoption.

nan


Article 597

Title@2025-06-09 (1): DEBATE: A Dataset for Disentangling Textual Ambiguity in Mandarin Through Speech

Title: DEBATE: A Dataset for Disentangling Textual Ambiguity in Mandarin Through Speech DEBATE: Ein Datensatz für entwirrende textuelle Ambiguität in Mandarin durch Sprache 辩论:通过讲话解调普通话文字模糊性数据集 2506.07502v1

Authors (8): Haotian Guo, Jing Han, Yongfeng Tu, Shihao Gao, Shengfan Shen, Wulong Xiang, Weihao Gan, Zixing Zhang

Despite extensive research on textual and visual disambiguation, disambiguation through speech (DTS) remains underexplored. This is largely due to the lack of high-quality datasets that pair spoken sentences with richly ambiguous text. To address this gap, we present DEBATE, a unique public Chinese speech-text dataset designed to study how speech cues and patterns-pronunciation, pause, stress and intonation-can help resolve textual ambiguity and reveal a speaker’s true intent. DEBATE contains 1,001 carefully selected ambiguous utterances, each recorded by 10 native speakers, capturing diverse linguistic ambiguities and their disambiguation through speech. We detail the data collection pipeline and provide rigorous quality analysis. Additionally, we benchmark three state-of-the-art large speech and audio-language models, illustrating clear and huge performance gaps between machine and human understanding of spoken intent. DEBATE represents the first effort of its kind and offers a foundation for building similar DTS datasets across languages and cultures. The dataset and associated code are available at: https://github.com/SmileHnu/DEBATE.

nan


Article 598

Title@2025-06-09 (1): Graph-of-Causal Evolution: Challenging Chain-of-Model for Reasoning

Title: Graph-of-Causal Evolution: Challenging Chain-of-Model for Reasoning Graph-of-Causal Evolution: Herausfordernde Modellkette für die Vernunft Causal 进化图:有挑战性的理由模型链 2506.07501v1

Authors (1): Libo Wang

In view of the problem that each subchain in the chain-of-model (CoM) relies only on the information of the previous subchain and may lose long-range dependencies due to the causal mask blocking the global context flow between multi-level subchains, this work proposes a graph of causal evolution (GoCE). Its core principle is to map the implicit token representation into a differentiable and sparse causal adjacency matrix, then permeate causal constraints through each layer of calculation using causal-masked attention and causal-MoE. By combining intervention consistency loss test and self-evolution gate, the dynamic balance between causal structure learning and adaptive updating of transformer architecture is realized. The researcher built experimental environments in sandboxes built with Claude Sonnet 4, o4-mini-high, and DeepSeek R1 respectively with the transformer variant architecture introduced in GoCE. It is evaluated on publicly available datasets including CLUTRR, CLADDER, EX-FEVER, and CausalQA and compared with the baseline LLMs. The finding proves that GoCE strengthens the transformer’s ability to capture long-range causal dependencies, while the ability to self-evolve is improved. It not only surpasses the design of CoM in terms of design principles, but also provides experience for future research on causal learning and continuous adaptive improvement.

nan


Article 599

Title@2025-06-09 (1): Compliance-to-Code: Enhancing Financial Compliance Checking via Code Generation

Title: Compliance-to-Code: Enhancing Financial Compliance Checking via Code Generation Compliance-to-Code: Verbesserung der finanziellen Compliance-Prüfung durch Codegenerierung 遵守到守则:通过代码生成加强金融合规检查 2505.19804v2

Authors (11): Siyuan Li, Jian Chen, Rui Yao, Xuming Hu, Peilin Zhou, Weihua Qiu, Simin Zhang, Chucheng Dong, Zhiyao Li, Qipeng Xie, Zixuan Yuan

Nowadays, regulatory compliance has become a cornerstone of corporate governance, ensuring adherence to systematic legal frameworks. At its core, financial regulations often comprise highly intricate provisions, layered logical structures, and numerous exceptions, which inevitably result in labor-intensive or comprehension challenges. To mitigate this, recent Regulatory Technology (RegTech) and Large Language Models (LLMs) have gained significant attention in automating the conversion of regulatory text into executable compliance logic. However, their performance remains suboptimal particularly when applied to Chinese-language financial regulations, due to three key limitations: (1) incomplete domain-specific knowledge representation, (2) insufficient hierarchical reasoning capabilities, and (3) failure to maintain temporal and logical coherence. One promising solution is to develop a domain specific and code-oriented datasets for model training. Existing datasets such as LexGLUE, LegalBench, and CODE-ACCORD are often English-focused, domain-mismatched, or lack fine-grained granularity for compliance code generation. To fill these gaps, we present Compliance-to-Code, the first large-scale Chinese dataset dedicated to financial regulatory compliance. Covering 1,159 annotated clauses from 361 regulations across ten categories, each clause is modularly structured with four logical elements-subject, condition, constraint, and contextual information-along with regulation relations. We provide deterministic Python code mappings, detailed code reasoning, and code explanations to facilitate automated auditing. To demonstrate utility, we present FinCheck: a pipeline for regulation structuring, code generation, and report generation.

nan


Article 600

Title@2025-06-09 (1): Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning

Title: Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning Verbesserung des Charakter-Level-Verständnisses in LLMs durch Token Internal Structure Learning 通过 Token 内部结构学习加强LLM女士的品级理解 2411.17679v5

Authors (9): Zhu Xu, Zhiqiang Zhao, Zihan Zhang, Yuchi Liu, Quanwei Shen, Fei Liu, Yu Kuang, Jian He, Conglin Liu

Tokenization methods like Byte-Pair Encoding (BPE) enhance computational efficiency in large language models (LLMs) but often obscure internal character structures within tokens. This limitation hinders LLMs’ ability to predict precise character positions, which is crucial in tasks like Chinese Spelling Correction (CSC) where identifying the positions of misspelled characters accelerates correction processes. We propose Token Internal Position Awareness (TIPA), a method that significantly improves models’ ability to capture character positions within tokens by training them on reverse character prediction tasks using the tokenizer’s vocabulary. Experiments demonstrate that TIPA enhances position prediction accuracy in LLMs, enabling more precise identification of target characters in original text. Furthermore, when applied to downstream tasks that do not require exact position prediction, TIPA still boosts performance in tasks needing character-level information, validating its versatility and effectiveness.

nan


Article 601

Title@2025-06-09 (1): Binary Classifier Optimization for Large Language Model Alignment

Title: Binary Classifier Optimization for Large Language Model Alignment Binärklassifikator-Optimierung für großsprachliche Modellausrichtung 大语言模型对齐的二进制分类器优化 2404.04656v2

Authors (4): Seungjae Jung, Gunsoo Han, Daniel Wontae Nam, Kyoung-Woon On

In real-world services such as ChatGPT, aligning models based on user feedback is crucial for improving model performance. However, due to the simplicity and convenience of providing feedback, users typically offer only basic binary signals, such as ‘thumbs-up’ or ‘thumbs-down’. Most existing alignment research, on the other hand, relies on preference-based approaches that require both positive and negative responses as a pair. We propose Binary Classifier Optimization (BCO), a technique that effectively aligns LLMs using only binary feedback. BCO trains a binary classifier, where the logit serves as an implicit reward, effectively minimizing the Direct Preference Optimization (DPO) loss. We demonstrate that the binary cross-entropy loss employed in classifier training acts as an upper bound for the DPO loss. Additionally, a novel reward shift technique further minimizes the gap between the losses. We validate our methodology in two settings: first, on a paired preference dataset, where our method performs on par with DPO; and second, on a Likert-5 scale annotation dataset which stems from real users’ queries. Our model consistently demonstrates effective and robust alignment across four base LLMs and three different datasets, showcasing the strength of our approach to learning from binary signals.

nan


Article 602

Title@2025-06-09 (1): Semantic Exploration with Adaptive Gating for Efficient Problem Solving with Language Models

Title: Semantic Exploration with Adaptive Gating for Efficient Problem Solving with Language Models Semantische Exploration mit adaptiver Beschichtung für effiziente Problemlösung mit Sprachmodellen 以语言模式有效解决问题的适应性基质的语义探索 2501.05752v2

Authors (4): Sungjae Lee, Hyejin Park, Jaechang Kim, Jungseul Ok

Recent advancements in large language models (LLMs) have shown remarkable potential in various complex tasks requiring multi-step reasoning methods like tree search to explore diverse reasoning paths. However, existing methods often suffer from computational inefficiency and redundancy. First, they overlook the diversity of task difficulties, leading to unnecessarily extensive searches even for easy tasks. Second, they neglect the semantics of reasoning paths, resulting in redundant exploration of semantically identical paths. To address these limitations, we propose Semantic Exploration with Adaptive Gating (SEAG), a computationally efficient method. SEAG employs an adaptive gating mechanism that dynamically decides whether to conduct a tree search, based on the confidence level of answers from a preceding simple reasoning method. Furthermore, its tree-based exploration consolidates semantically identical reasoning steps, reducing redundant explorations while maintaining or even improving accuracy. Our extensive experiments demonstrate that SEAG significantly improves accuracy by 4.3% on average while requiring only 31% of computational costs compared to existing tree search-based methods on complex reasoning benchmarks including GSM8K and ARC with diverse language models such as Llama2, Llama3, and Mistral. Our code is available at https://github.com/ml-postech/SEAG-semantic-exploration-with-adaptive-gating .

nan


Article 603

Title@2025-06-09 (1): A Hybrid GA LLM Framework for Structured Task Optimization

Title: A Hybrid GA LLM Framework for Structured Task Optimization Ein hybrides GA LLM-Rahmenwerk für strukturierte Aufgabenoptimierung GA 混合LLM 结构化任务优化框架 2506.07483v1

Authors (3): Berry Feng, Jonas Lin, Patrick Lau

GA LLM is a hybrid framework that combines Genetic Algorithms with Large Language Models to handle structured generation tasks under strict constraints. Each output, such as a plan or report, is treated as a gene, and evolutionary operations like selection, crossover, and mutation are guided by the language model to iteratively improve solutions. The language model provides domain knowledge and creative variation, while the genetic algorithm ensures structural integrity and global optimization. GA LLM has proven effective in tasks such as itinerary planning, academic outlining, and business reporting, consistently producing well structured and requirement satisfying results. Its modular design also makes it easy to adapt to new tasks. Compared to using a language model alone, GA LLM achieves better constraint satisfaction and higher quality solutions by combining the strengths of both components.

nan


Article 604

Title@2025-06-09 (1): Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

Title: Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models Verfolgung beweglicher Ziele mit Online-Selbstspiel-Verstärkung Lernen für sicherere Sprachmodelle 利用在线加强自身能力学习,建立更安全语言模式,以追踪移动目标 2506.07468v1

Authors (7): Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, Natasha Jaques

Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch – attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles – generating adversarial prompts and safeguarding against them – while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates the design of our method: if self-play converges to a Nash Equilibrium, the defender will reliably produce safe responses to any adversarial input. Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared to attackers trained against static defenders and achieves higher robustness on safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained against static attackers. We further propose hidden Chain-of-Thought, allowing agents to plan privately, which boosts adversarial diversity and reduces over-refusals. Our results motivate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).

nan


Article 605

Title@2025-06-09 (1): BEYOND DIALOGUE: A Profile-Dialogue Alignment Framework Towards General Role-Playing Language Model

Title: BEYOND DIALOGUE: A Profile-Dialogue Alignment Framework Towards General Role-Playing Language Model BEYOND DIALOG: Ein Profil-Dialog-Rahmen für die Ausrichtung auf das allgemeine Rollenspiel-Sprachenmodell BEYONUD DDILOUG: 通用角色扮演语言模式概况-对话协调框架 2408.10903v6

Authors (5): Yeyong Yu, Runsheng Yu, Haojie Wei, Zhanqiu Zhang, Quan Qian

The rapid advancement of large language models (LLMs) has revolutionized role-playing, enabling the development of general role-playing models. However, current role-playing training has two significant issues: (I) Using a predefined role profile to prompt dialogue training for specific scenarios usually leads to inconsistencies and even conflicts between the dialogue and the profile, resulting in training biases. (II) The model learns to imitate the role based solely on the profile, neglecting profile-dialogue alignment at the sentence level. In this work, we propose a simple yet effective framework called BEYOND DIALOGUE, designed to overcome these hurdles. This framework innovatively introduces “beyond dialogue” tasks to align dialogue with profile traits based on each specific scenario, thereby eliminating biases during training. Furthermore, by adopting an innovative prompting mechanism that generates reasoning outcomes for training, the framework allows the model to achieve fine-grained alignment between profile and dialogue at the sentence level. The aforementioned methods are fully automated and low-cost. Additionally, the integration of automated dialogue and objective evaluation methods forms a comprehensive framework, paving the way for general role-playing. Experimental results demonstrate that our model excels in adhering to and reflecting various dimensions of role profiles, outperforming most proprietary general and specialized role-playing baselines. All code and datasets are available at https://github.com/yuyouyu32/BeyondDialogue.

nan


Article 606

Title@2025-06-09 (1): CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models

Title: CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models CCI4.0: Ein zweisprachiger Vorschulungs-Datensatz zur Verbesserung der Vernunft in großen Sprachmodellen CCI4.0:提高大语言模式理由的双语预先培训数据集 2506.07463v1

Authors (9): Guang Liu, Liangdong Wang, Jijie Li, Yang Yu, Yao Xu, Jiabei Chen, Yu Bai, Feng Liao, Yonghua Lin

We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly $35$ TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a $5.2$ TB carefully curated Chinese web corpus, a $22.5$ TB English subset from Nemotron-CC, and diverse sources from math, wiki, arxiv, and code. Although these data are mostly sourced from well-processed datasets, the quality standards of various domains are dynamic and require extensive expert experience and labor to process. So, we propose a novel pipeline justifying data quality mainly based on models through two-stage deduplication, multiclassifier quality scoring, and domain-aware fluency filtering. We extract $4.5$ billion pieces of CoT(Chain-of-Thought) templates, named CCI4.0-M2-CoT. Differing from the distillation of CoT from larger models, our proposed staged CoT extraction exemplifies diverse reasoning patterns and significantly decreases the possibility of hallucination. Empirical evaluations demonstrate that LLMs pre-trained in CCI4.0 benefit from cleaner, more reliable training signals, yielding consistent improvements in downstream tasks, especially in math and code reflection tasks. Our results underscore the critical role of rigorous data curation and human thinking templates in advancing LLM performance, shedding some light on automatically processing pretraining corpora.

nan


Article 607

Title@2025-06-09 (1): From Calibration to Collaboration: LLM Uncertainty Quantification Should Be More Human-Centered

Title: From Calibration to Collaboration: LLM Uncertainty Quantification Should Be More Human-Centered Von der Kalibrierung zur Zusammenarbeit: LLM Uncertainty Quantification sollte menschlicher werden 从校准到协作:LLM 不确定性量化应更加以人为本 2506.07461v1

Authors (5): Siddartha Devic, Tejas Srinivasan, Jesse Thomason, Willie Neiswanger, Vatsal Sharan

Large Language Models (LLMs) are increasingly assisting users in the real world, yet their reliability remains a concern. Uncertainty quantification (UQ) has been heralded as a tool to enhance human-LLM collaboration by enabling users to know when to trust LLM predictions. We argue that current practices for uncertainty quantification in LLMs are not optimal for developing useful UQ for human users making decisions in real-world tasks. Through an analysis of 40 LLM UQ methods, we identify three prevalent practices hindering the community’s progress toward its goal of benefiting downstream users: 1) evaluating on benchmarks with low ecological validity; 2) considering only epistemic uncertainty; and 3) optimizing metrics that are not necessarily indicative of downstream utility. For each issue, we propose concrete user-centric practices and research directions that LLM UQ researchers should consider. Instead of hill-climbing on unrepresentative tasks using imperfect metrics, we argue that the community should adopt a more human-centered approach to LLM uncertainty quantification.

nan


Article 608

Title@2025-06-09 (1): GLOS: Sign Language Generation with Temporally Aligned Gloss-Level Conditioning

Title: GLOS: Sign Language Generation with Temporally Aligned Gloss-Level Conditioning GLOS: Zeichensprache-Generierung mit temporär ausgerichteter Gloss-Level-Konditionierung GLOS: 具有临时统一质素水平条件的手语语言一代 2506.07460v1

Authors (4): Taeryung Lee, Hyeongjin Nam, Gyeongsik Moon, Kyoung Mu Lee

Sign language generation (SLG), or text-to-sign generation, bridges the gap between signers and non-signers. Despite recent progress in SLG, existing methods still often suffer from incorrect lexical ordering and low semantic accuracy. This is primarily due to sentence-level condition, which encodes the entire sentence of the input text into a single feature vector as a condition for SLG. This approach fails to capture the temporal structure of sign language and lacks the granularity of word-level semantics, often leading to disordered sign sequences and ambiguous motions. To overcome these limitations, we propose GLOS, a sign language generation framework with temporally aligned gloss-level conditioning. First, we employ gloss-level conditions, which we define as sequences of gloss embeddings temporally aligned with the motion sequence. This enables the model to access both the temporal structure of sign language and word-level semantics at each timestep. As a result, this allows for fine-grained control of signs and better preservation of lexical order. Second, we introduce a condition fusion module, temporal alignment conditioning (TAC), to efficiently deliver the word-level semantic and temporal structure provided by the gloss-level condition to the corresponding motion timesteps. Our method, which is composed of gloss-level conditions and TAC, generates signs with correct lexical order and high semantic accuracy, outperforming prior methods on CSL-Daily and Phoenix-2014T.

nan


Article 609

Title@2025-06-09 (1): Parameter-Efficient Fine-Tuning of State Space Models

Title: Parameter-Efficient Fine-Tuning of State Space Models Parameter-Effizient Feintuning von State Space Models 国家空间模型有效精美设计 2410.09016v3

Authors (5): Kevin Galim, Wonjun Kang, Yuchen Zeng, Hyung Il Koo, Kangwook Lee

Deep State Space Models (SSMs), such as Mamba (Gu & Dao, 2024), have become powerful tools for language modeling, offering high performance and linear scalability with sequence length. However, the application of parameter-efficient fine-tuning (PEFT) methods to SSM-based models remains largely underexplored. We start by investigating two fundamental questions on existing PEFT methods: (i) How do they perform on SSM-based models? (ii) Which parameters should they target for optimal results? Our analysis shows that LoRA and its variants consistently outperform all other PEFT methods. While LoRA is effective for linear projection matrices, it fails on SSM modules-yet still outperforms other methods applicable to SSMs, indicating their limitations. This underscores the need for a specialized SSM tuning approach. To address this, we propose Sparse Dimension Tuning (SDT), a PEFT method tailored for SSM modules. Combining SDT for SSMs with LoRA for linear projection matrices, we achieve state-of-the-art performance across extensive experiments.

nan


Article 610

Title@2025-06-09 (1): KScope: A Framework for Characterizing the Knowledge Status of Language Models

Title: KScope: A Framework for Characterizing the Knowledge Status of Language Models KScope: Ein Rahmen für die Charakterisierung des Wissensstatus von Sprachmodellen KCope: 语言模式知识状况描述框架 2506.07458v1

Authors (6): Yuxin Xiao, Shan Chen, Jack Gallifant, Danielle Bitterman, Thomas Hartvigsen, Marzyeh Ghassemi

Characterizing a large language model’s (LLM’s) knowledge of a given question is challenging. As a result, prior work has primarily examined LLM behavior under knowledge conflicts, where the model’s internal parametric memory contradicts information in the external context. However, this does not fully reflect how well the model knows the answer to the question. In this paper, we first introduce a taxonomy of five knowledge statuses based on the consistency and correctness of LLM knowledge modes. We then propose KScope, a hierarchical framework of statistical tests that progressively refines hypotheses about knowledge modes and characterizes LLM knowledge into one of these five statuses. We apply KScope to nine LLMs across four datasets and systematically establish: (1) Supporting context narrows knowledge gaps across models. (2) Context features related to difficulty, relevance, and familiarity drive successful knowledge updates. (3) LLMs exhibit similar feature preferences when partially correct or conflicted, but diverge sharply when consistently wrong. (4) Context summarization constrained by our feature analysis, together with enhanced credibility, further improves update effectiveness and generalizes across LLMs.

nan


Article 611

Title@2025-06-09 (1): Understanding Cross-Domain Adaptation in Low-Resource Topic Modeling

Title: Understanding Cross-Domain Adaptation in Low-Resource Topic Modeling Cross-Domain-Anpassung in Low-Resource-Themenmodellierung verstehen 在低成本资源专题建模中了解跨领域适应 2506.07453v1

Authors (2): Pritom Saha Akash, Kevin Chen-Chuan Chang

Topic modeling plays a vital role in uncovering hidden semantic structures within text corpora, but existing models struggle in low-resource settings where limited target-domain data leads to unstable and incoherent topic inference. We address this challenge by formally introducing domain adaptation for low-resource topic modeling, where a high-resource source domain informs a low-resource target domain without overwhelming it with irrelevant content. We establish a finite-sample generalization bound showing that effective knowledge transfer depends on robust performance in both domains, minimizing latent-space discrepancy, and preventing overfitting to the data. Guided by these insights, we propose DALTA (Domain-Aligned Latent Topic Adaptation), a new framework that employs a shared encoder for domain-invariant features, specialized decoders for domain-specific nuances, and adversarial alignment to selectively transfer relevant information. Experiments on diverse low-resource datasets demonstrate that DALTA consistently outperforms state-of-the-art methods in terms of topic coherence, stability, and transferability.

nan


Article 612

Title@2025-06-09 (1): When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment

Title: When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment Wenn Stil die Sicherheit bricht: Verteidigung von Sprachmodellen gegen hocheffiziente Stilausrichtung 当样式打破安全时: 防御语言模型, 对抗超优样式对齐 2506.07452v1

Authors (5): Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi

Large language models (LLMs) can be prompted with specific styles (e.g., formatting responses as lists), including in jailbreak queries. Although these style patterns are semantically unrelated to the malicious intents behind jailbreak queries, their safety impact remains unclear. In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment. We evaluate 32 LLMs across seven jailbreak benchmarks, and find that malicious queries with style patterns inflate the attack success rate (ASR) for nearly all models. Notably, ASR inflation correlates with both the length of style patterns and the relative attention an LLM exhibits on them. We then investigate superficial style alignment, and find that fine-tuning with specific styles makes LLMs more vulnerable to jailbreaks of those same styles. Finally, we propose SafeStyle, a defense strategy that incorporates a small amount of safety training data augmented to match the distribution of style patterns in the fine-tuning data. Across three LLMs and five fine-tuning style settings, SafeStyle consistently outperforms baselines in maintaining LLM safety.

nan


Article 613

Title@2025-06-09 (1): LlamaRec-LKG-RAG: A Single-Pass, Learnable Knowledge Graph-RAG Framework for LLM-Based Ranking

Title: LlamaRec-LKG-RAG: A Single-Pass, Learnable Knowledge Graph-RAG Framework for LLM-Based Ranking LlamaRec-LKG-RAG: Ein eingängiges, erlernbares Wissen Graph-RAG-Framework für LLM-basiertes Ranking LlamaRec-LKG-RAG:以LLM为基础的排名单张、可学习知识图表-RAG框架 2506.07449v1

Authors (2): Vahid Azizi, Fatemeh Koochaki

Recent advances in Large Language Models (LLMs) have driven their adoption in recommender systems through Retrieval-Augmented Generation (RAG) frameworks. However, existing RAG approaches predominantly rely on flat, similarity-based retrieval that fails to leverage the rich relational structure inherent in user-item interactions. We introduce LlamaRec-LKG-RAG, a novel single-pass, end-to-end trainable framework that integrates personalized knowledge graph context into LLM-based recommendation ranking. Our approach extends the LlamaRec architecture by incorporating a lightweight user preference module that dynamically identifies salient relation paths within a heterogeneous knowledge graph constructed from user behavior and item metadata. These personalized subgraphs are seamlessly integrated into prompts for a fine-tuned Llama-2 model, enabling efficient and interpretable recommendations through a unified inference step. Comprehensive experiments on ML-100K and Amazon Beauty datasets demonstrate consistent and significant improvements over LlamaRec across key ranking metrics (MRR, NDCG, Recall). LlamaRec-LKG-RAG demonstrates the critical value of structured reasoning in LLM-based recommendations and establishes a foundation for scalable, knowledge-aware personalization in next-generation recommender systems. Code is available at~\href{https://github.com/VahidAz/LlamaRec-LKG-RAG}{repository}.

nan


Article 614

Title@2025-06-09 (1): Measuring Diversity in Synthetic Datasets

Title: Measuring Diversity in Synthetic Datasets Diversität in synthetischen Datensätzen messen 计量合成数据集的多样性 2502.08512v2

Authors (8): Yuchang Zhu, Huizhe Zhang, Bingzhe Wu, Jintang Li, Zibin Zheng, Peilin Zhao, Liang Chen, Yatao Bian

Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets-an aspect crucial for robust model performance-remains a significant challenge. In this paper, we introduce DCScore, a novel method for measuring synthetic dataset diversity from a classification perspective. Specifically, DCScore formulates diversity evaluation as a sample classification task, leveraging mutual relationships among samples. We further provide theoretical verification of the diversity-related axioms satisfied by DCScore, highlighting its role as a principled diversity evaluation method. Experimental results on synthetic datasets reveal that DCScore enjoys a stronger correlation with multiple diversity pseudo-truths of evaluated datasets, underscoring its effectiveness. Moreover, both empirical and theoretical evidence demonstrate that DCScore substantially reduces computational costs compared to existing methods. Code is available at: https://github.com/bluewhalelab/dcscore.

nan


Article 615

Title@2025-06-09 (1): Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning

Title: Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning Token Cleaning: Feinkörnige Datenauswahl für LLM Supervised Fine-Tuning Tok Cleaning: 为LLM 监督的LLM 精密导出选择精密数据 2502.01968v2

Authors (7): Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, Yang Liu

Recent studies show that in supervised fine-tuning (SFT) of large language models (LLMs), data quality matters more than quantity. While most data cleaning methods concentrate on filtering entire samples, the quality of individual tokens within a sample can vary significantly. After pre-training, even in high-quality samples, patterns or phrases that are not task-related can be redundant, uninformative, or even harmful. Continuing to fine-tune on these patterns may offer limited benefit and even degrade downstream task performance. In this paper, we investigate token quality from a noisy-label perspective and propose a generic token cleaning pipeline for SFT tasks. Our method filters out uninformative tokens while preserving those carrying key task-specific information. Specifically, we first evaluate token quality by examining the influence of model updates on each token, then apply a threshold-based separation. The token influence can be measured in a single pass with a fixed reference model or iteratively with self-evolving reference models. The benefits and limitations of both methods are analyzed theoretically by error upper bounds. Extensive experiments show that our framework consistently improves downstream performance. Code is available at https://github.com/UCSC-REAL/TokenCleaning.

nan


Article 616

Title@2025-06-09 (1): RomanLens: The Role Of Latent Romanization In Multilinguality In LLMs

Title: RomanLens: The Role Of Latent Romanization In Multilinguality In LLMs RomanLens: Die Rolle der latenten Romanisierung in der Mehrsprachigkeit in LLMs RomanLenens:LLM中原罗马化在多种语言中的作用 2502.07424v3

Authors (6): Alan Saji, Jaavid Aktar Husain, Thanmay Jayakumar, Raj Dabre, Anoop Kunchukuttan, Ratish Puduppully

Large Language Models (LLMs) exhibit strong multilingual performance despite being predominantly trained on English-centric corpora. This raises a fundamental question: How do LLMs achieve such multilingual capabilities? Focusing on languages written in non-Roman scripts, we investigate the role of Romanization - the representation of non-Roman scripts using Roman characters - as a potential bridge in multilingual processing. Using mechanistic interpretability techniques, we analyze next-token generation and find that intermediate layers frequently represent target words in Romanized form before transitioning to native script, a phenomenon we term Latent Romanization. Further, through activation patching experiments, we demonstrate that LLMs encode semantic concepts similarly across native and Romanized scripts, suggesting a shared underlying representation. Additionally, for translation into non-Roman script languages, our findings reveal that when the target language is in Romanized form, its representations emerge earlier in the model’s layers compared to native script. These insights contribute to a deeper understanding of multilingual representation in LLMs and highlight the implicit role of Romanization in facilitating language transfer.

nan


Article 617

Title@2025-06-09 (1): LG-ANNA-Embedding technical report

Title: LG-ANNA-Embedding technical report Technischer Bericht LG-ANNA-Embedding LG-ANNA-取代技术报告 2506.07438v1

Authors (9): Jooyoung Choi, Hyun Kim, Hansol Jang, Changwook Jun, Kyunghoon Bae, Hyewon Choi, Stanley Jungkyu Choi, Honglak Lee, Chulmin Yun

This report presents a unified instruction-based framework for learning generalized text embeddings optimized for both information retrieval (IR) and non-IR tasks. Built upon a decoder-only large language model (Mistral-7B), our approach combines in-context learning, soft supervision, and adaptive hard-negative mining to generate context-aware embeddings without task-specific fine-tuning. Structured instructions and few-shot examples are used to guide the model across diverse tasks, enabling strong performance on classification, semantic similarity, clustering, and reranking benchmarks. To improve semantic discrimination, we employ a soft labeling framework where continuous relevance scores, distilled from a high-performance dense retriever and reranker, serve as fine-grained supervision signals. In addition, we introduce adaptive margin-based hard-negative mining, which filters out semantically ambiguous negatives based on their similarity to positive examples, thereby enhancing training stability and retrieval robustness. Our model is evaluated on the newly introduced MTEB (English, v2) benchmark, covering 41 tasks across seven categories. Results show that our method achieves strong generalization and ranks among the top-performing models by Borda score, outperforming several larger or fully fine-tuned baselines. These findings highlight the effectiveness of combining in-context prompting, soft supervision, and adaptive sampling for scalable, high-quality embedding generation.

nan


Article 618

Title@2025-06-09 (1): Cartridges: Lightweight and general-purpose long context representations via self-study

Title: Cartridges: Lightweight and general-purpose long context representations via self-study Patronen: Leichte und universelle lange Kontextdarstellungen durch Selbststudium Cartridges:轻量和一般用途长背景介绍,通过自学 2506.06266v2

Authors (11): Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, Christopher Re

Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-1M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate synthetic conversations about the corpus and train the Cartridge with a context-distillation objective. We find that Cartridges trained with self-study replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context benchmarks, Cartridges trained with self-study match ICL performance while using 38.6x less memory and enabling 26.4x higher throughput. Self-study also extends the model’s effective context length (e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to Cartridges that can be composed at inference time without retraining.

nan


Article 619

Title@2025-06-09 (1): Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding

Title: Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding Gut Begun ist die Hälfte Fertig: Low-Resource Preference Alignment by Weak-to-Strong Decoding Well Begun 半完成: 低资源优先配对, 由弱到强解码 2506.07434v1

Authors (7): Feifan Song, Shaohang Wei, Wen Luo, Yuxuan Fan, Tianyu Liu, Guoyin Wang, Houfeng Wang

Large Language Models (LLMs) require alignment with human preferences to avoid generating offensive, false, or meaningless content. Recently, low-resource methods for LLM alignment have been popular, while still facing challenges in obtaining both high-quality and aligned content. Motivated by the observation that the difficulty of generating aligned responses is concentrated at the beginning of decoding, we propose a novel framework, Weak-to-Strong Decoding (WSD), to enhance the alignment ability of base models by the guidance of a small aligned model. The small model first drafts well-aligned beginnings, followed by the large base model to continue the rest, controlled by a well-designed auto-switch mechanism. We also collect a new dataset, GenerAlign, to fine-tune a small-sized Pilot-3B as the draft model, which effectively enhances different base models under the WSD framework to outperform all baseline methods, while avoiding degradation on downstream tasks, termed as the alignment tax. Extensive experiments are further conducted to examine the impact of different settings and time efficiency, as well as analyses on the intrinsic mechanisms of WSD in depth.

nan


Article 620

Title@2025-06-09 (1): Multi-agent Architecture Search via Agentic Supernet

Title: Multi-agent Architecture Search via Agentic Supernet Multi-Agent Architektur Suche über Agentic Supernet 通过 Agric Supernet 多剂机构建筑搜索 2502.04180v2

Authors (6): Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, Xiang Wang

Large Language Model (LLM)-empowered multi-agent systems extend the cognitive boundaries of individual agents through disciplined collaboration and interaction, while constructing these systems often requires labor-intensive manual designs. Despite the availability of methods to automate the design of agentic workflows, they typically seek to identify a static, complex, one-size-fits-all system, which, however, fails to dynamically allocate inference resources based on the difficulty and domain of each query. To address this challenge, we shift away from the pursuit of a monolithic agentic system, instead optimizing the \textbf{agentic supernet}, a probabilistic and continuous distribution of agentic architectures. We introduce MaAS, an automated framework that samples query-dependent agentic systems from the supernet, delivering high-quality solutions and tailored resource allocation (\textit{e.g.}, LLM calls, tool calls, token cost). Comprehensive evaluation across six benchmarks demonstrates that MaAS \textbf{(I)} requires only $6\sim45\%$ of the inference costs of existing handcrafted or automated multi-agent systems, \textbf{(II)} surpasses them by $0.54\%\sim11.82\%$, and \textbf{(III)} enjoys superior cross-dataset and cross-LLM-backbone transferability.

nan


Article 621

Title@2025-06-09 (1): Conjoined Predication and Scalar Implicature

Title: Conjoined Predication and Scalar Implicature Zusammengefügte Prädikation und Scalar Implicature 连带预言和斯卡拉复制 2506.07429v1

Authors (1): Ratna Kandala

Magri (2016) investigates two puzzles arising from conjunction. Although Magri has proposed a solution to the second puzzle, the first remains unresolved. This first puzzle reveals a hidden interaction among quantification, collective/concurrent interpretation, and contextual updating dimensions that have yet to be explored. In essence, the problem is that certain forms of sentences like “Some Italians come from a warm country,” when conjoined as in “(Only) Some Italians come from a warm country and are blond,” sound infelicitous, even though no obvious alternative triggers a conflicting scalar implicature. In this paper, we offer a conceptual analysis of Magri’s first puzzle by situating it within its original theoretical framework. We argue that the oddness arises from the collective or concurrent reading of the conjunctive predicate: in examples such as “(Only) Some Italians come from a warm country and are blond,” this interpretation generates an indirect contextual contradiction. Moreover, we suggest that the pragmatic mechanisms governing scalar implicature generation extend beyond what is captured by exhaustification-based grammatical licensing accounts.

nan


Article 622

Title@2025-06-09 (1): When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human-AI Collaboration

Title: When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human-AI Collaboration Wenn Modelle mehr wissen, als sie erklären können: Quantifizierung des Wissenstransfers in der Mensch-AI-Kollaboration 当模型知道的超出它们所能解释的时:在人类-AI合作中量化知识转让 2506.05579v2

Authors (6): Quan Shi, Carlos E. Jimenez, Shunyu Yao, Nick Haber, Diyi Yang, Karthik Narasimhan

Recent advancements in AI reasoning have driven substantial improvements across diverse tasks. A critical open question is whether these improvements also yields better knowledge transfer: the ability of models to communicate reasoning in ways humans can understand, apply, and learn from. To investigate this, we introduce Knowledge Integration and Transfer Evaluation (KITE), a conceptual and experimental framework for Human-AI knowledge transfer capabilities and conduct the first large-scale human study (N=118) explicitly designed to measure it. In our two-phase setup, humans first ideate with an AI on problem-solving strategies, then independently implement solutions, isolating model explanations’ influence on human understanding. Our findings reveal that although model benchmark performance correlates with collaborative outcomes, this relationship is notably inconsistent, featuring significant outliers, indicating that knowledge transfer requires dedicated optimization. Our analysis identifies behavioral and strategic factors mediating successful knowledge transfer. We release our code, dataset, and evaluation framework to support future work on communicatively aligned models.

nan


Article 623

Title@2025-06-09 (1): Plug-in and Fine-tuning: Bridging the Gap between Small Language Models and Large Language Models

Title: Plug-in and Fine-tuning: Bridging the Gap between Small Language Models and Large Language Models Plug-in und Feintuning: Die Lücke zwischen kleinen Sprachmodellen und großen Sprachmodellen überbrücken 插插和微调:缩小小语言模式与大语言模式之间的差距 2506.07424v1

Authors (6): Kyeonghyun Kim, Jinhee Jang, Juhwan Choi, Yoonji Lee, Kyohoon Jin, YoungBin Kim

Large language models (LLMs) are renowned for their extensive linguistic knowledge and strong generalization capabilities, but their high computational demands make them unsuitable for resource-constrained environments. In contrast, small language models (SLMs) are computationally efficient but often lack the broad generalization capacity of LLMs. To bridge this gap, we propose PiFi, a novel framework that combines the strengths of both LLMs and SLMs to achieve high performance while maintaining efficiency. PiFi integrates a single frozen layer from an LLM into a SLM and fine-tunes the combined model for specific tasks, boosting performance without a significant increase in computational cost. We show that PiFi delivers consistent performance improvements across a range of natural language processing tasks, including both natural language understanding and generation. Moreover, our findings demonstrate PiFi’s ability to effectively leverage LLM knowledge, enhancing generalization to unseen domains and facilitating the transfer of linguistic abilities.

nan


Article 624

Title@2025-06-09 (1): SEED: Enhancing Text-to-SQL Performance and Practical Usability Through Automatic Evidence Generation

Title: SEED: Enhancing Text-to-SQL Performance and Practical Usability Through Automatic Evidence Generation SEED: Verbesserung der Text-zu-SQL-Performance und der praktischen Usability durch automatische Evidenz-Generierung SEED:通过自动产生证据提高文字到SQL的性能和实用性能 2506.07423v1

Authors (2): Janghyeon Yun, Sang-goo Lee

Text-to-SQL enables non-experts to retrieve data from databases by converting natural language queries into SQL. However, state-of-the-art text-to-SQL studies rely on the BIRD dataset, which assumes that evidence is provided along with questions. Although BIRD facilitates research advancements, it assumes that users have expertise and domain knowledge, contradicting the fundamental goal of text-to-SQL. In addition, human-generated evidence in BIRD contains defects, including missing or erroneous evidence, which affects model performance. To address this issue, we propose SEED (System for Evidence Extraction and Domain knowledge generation), an approach that automatically generates evidence to improve performance and practical usability in real-world scenarios. SEED systematically analyzes database schema, description files, and values to extract relevant information. We evaluated SEED on BIRD and Spider, demonstrating that it significantly improves SQL generation accuracy in the no-evidence scenario, and in some cases, even outperforms the setting where BIRD evidence is provided. Our results highlight that SEED-generated evidence not only bridges the gap between research and real-world deployment but also improves the adaptability and robustness of text-to-SQL models. Our code is available at https://github.com/felix01189/SEED

nan


Article 625

Title@2025-06-09 (1): Power-Law Decay Loss for Large Language Model Finetuning: A Theory Perspective

Title: Power-Law Decay Loss for Large Language Model Finetuning: A Theory Perspective Macht-Rechts-Dekay Verlust für große Sprachmodell Finetuning: Eine Theorie-Perspektive 大语言模型微调的功率法减缩损失:理论视角 2505.16900v5

Authors (1): Jintian Shao

During the finetuning stage of text generation tasks, standard cross-entropy loss treats all tokens equally. This can lead models to overemphasize high-frequency, low-information tokens, neglecting lower-frequency tokens crucial for specificity and informativeness in generated content. This paper introduces a novel loss function, Power-Law Decay Loss (PDL), specifically designed to optimize the finetuning process for text generation. The core motivation for PDL stems from observations in information theory and linguistics: the informativeness of a token is often inversely proportional to its frequency of occurrence. PDL re-weights the contribution of each token in the standard cross-entropy loss based on its frequency in the training corpus, following a power-law decay. Specifically, the weights for high-frequency tokens are reduced, while low-frequency, information-dense tokens are assigned higher weights. This mechanism guides the model during finetuning to focus more on learning and generating tokens that convey specific and unique information, thereby enhancing the quality, diversity, and informativeness of the generated text. We theoretically elaborate on the motivation and construction of PDL and discuss its potential applications and advantages across various text generation finetuning tasks, such as abstractive summarization, dialogue systems, and style transfer.

nan


Article 626

Title@2025-06-09 (1): Structured Pruning for Diverse Best-of-N Reasoning Optimization

Title: Structured Pruning for Diverse Best-of-N Reasoning Optimization Strukturierte Prunings für die unterschiedliche Best-of-N-Reasoning-Optimierung 结构审慎,实现最佳理由的多样化最佳最佳理由的优化 2506.03978v2

Authors (3): Hieu Trung Nguyen, Bao Nguyen, Viet Anh Nguyen

Model pruning in transformer-based language models, traditionally viewed as a means of achieving computational savings, can enhance the model’s reasoning capabilities. In this work, we uncover a surprising phenomenon: the selective pruning of certain attention heads leads to improvements in reasoning performance, particularly on challenging tasks. Motivated by this observation, we propose SPRINT, a novel contrastive learning framework that dynamically selects the optimal head and layer to prune during inference. By aligning question embeddings with head embeddings, SPRINT identifies those pruned-head configurations that result in more accurate reasoning. Extensive experiments demonstrate that our method significantly outperforms traditional best-of-$N$ and random head selection strategies on the MATH500 and GSM8K datasets.

nan


Article 627

Title@2025-06-09 (1): BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance

Title: BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance BiMa: Auf dem Weg zu Biases Milderung für Text-Video-Retrieval über Szeneelement-Anleitung Bima:通过场景元素指导,争取减缓对文本视频检索的“双轨减缓” 2506.03589v2

Authors (5): Huy Le, Nhat Chung, Tung Kieu, Anh Nguyen, Ngan Le

Text-video retrieval (TVR) systems often suffer from visual-linguistic biases present in datasets, which cause pre-trained vision-language models to overlook key details. To address this, we propose BiMa, a novel framework designed to mitigate biases in both visual and textual representations. Our approach begins by generating scene elements that characterize each video by identifying relevant entities/objects and activities. For visual debiasing, we integrate these scene elements into the video embeddings, enhancing them to emphasize fine-grained and salient details. For textual debiasing, we introduce a mechanism to disentangle text features into content and bias components, enabling the model to focus on meaningful content while separately handling biased information. Extensive experiments and ablation studies across five major TVR benchmarks (i.e., MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo) demonstrate the competitive performance of BiMa. Additionally, the model’s bias mitigation capability is consistently validated by its strong results on out-of-distribution retrieval tasks.

nan


Article 628

Title@2025-06-09 (1): Outlier-weighed Layerwise Sampling for LLM Fine-tuning

Title: Outlier-weighed Layerwise Sampling for LLM Fine-tuning Ausgewägte schichtweise Probenahme für LLM-Feinabstimmung LLM 精密调试 2405.18380v3

Authors (4): Pengxiang Li, Lu Yin, Xiaowei Gao, Shiwei Liu

The rapid advancements in Large Language Models (LLMs) have revolutionized various natural language processing tasks. However, the substantial size of LLMs presents significant challenges in training or fine-tuning. While parameter-efficient approaches such as low-rank adaptation (LoRA) have gained popularity, they often compromise performance compared to full-rank fine-tuning. In this paper, we propose Outlier-weighed Layerwise Sampling (OWS), a new memory-efficient fine-tuning approach, inspired by the layerwise outlier distribution of LLMs. Unlike LoRA, which adds extra adapters to all layers, OWS strategically assigns higher sampling probabilities to layers with more outliers, selectively sampling only a few layers and fine-tuning their pre-trained weights. To further increase the number of fine-tuned layers without a proportional rise in memory costs, we incorporate gradient low-rank projection, further boosting the approach’s performance. Our extensive experiments across various architectures, including LLaMa2 and Mistral, demonstrate that OWS consistently outperforms baseline approaches, including full fine-tuning. Specifically, it achieves up to a 1.1% average accuracy gain on the Commonsense Reasoning benchmark, a 3.0% improvement on MMLU, and a notable 10% boost on MT-Bench, while being more memory efficient. OWS allows us to fine-tune 7B LLMs with only 21GB of memory. Our code is available at https://github.com/pixeli99/OWS.

nan


Article 629

Title@2025-06-09 (1): Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures

Title: Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures Jenseits von Jailbreaks: Enthüllen von Stealthier- und breiteren LLM-Sicherheitsrisiken, die von Alignment-Fehlern ausgehen 越越狱:揭发 “ 团结失败 “ 带来的 “ 安全风险 “ 。 2506.07402v1

Authors (3): Yukai Zhou, Sibei Yang, Wenjie Wang

Large language models (LLMs) are increasingly deployed in real-world applications, raising concerns about their security. While jailbreak attacks highlight failures under overtly harmful queries, they overlook a critical risk: incorrectly answering harmless-looking inputs can be dangerous and cause real-world harm (Implicit Harm). We systematically reformulate the LLM risk landscape through a structured quadrant perspective based on output factuality and input harmlessness, uncovering an overlooked high-risk region. To investigate this gap, we propose JailFlipBench, a benchmark aims to capture implicit harm, spanning single-modal, multimodal, and factual extension scenarios with diverse evaluation metrics. We further develop initial JailFlip attack methodologies and conduct comprehensive evaluations across multiple open-source and black-box LLMs, show that implicit harm present immediate and urgent real-world risks, calling for broader LLM safety assessments and alignment beyond conventional jailbreak paradigms.

nan


Article 630

Title@2025-06-09 (1): G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems

Title: G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems G-Memory: Hierarchischer Speicher für Multi-Agent-Systeme G-记忆:为多机构系统追踪等级记忆 2506.07398v1

Authors (6): Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, Shuicheng Yan

Large language model (LLM)-powered multi-agent systems (MAS) have demonstrated cognitive and execution capabilities that far exceed those of single LLM agents, yet their capacity for self-evolution remains hampered by underdeveloped memory architectures. Upon close inspection, we are alarmed to discover that prevailing MAS memory mechanisms (1) are overly simplistic, completely disregarding the nuanced inter-agent collaboration trajectories, and (2) lack cross-trial and agent-specific customization, in stark contrast to the expressive memory developed for single agents. To bridge this gap, we introduce G-Memory, a hierarchical, agentic memory system for MAS inspired by organizational memory theory, which manages the lengthy MAS interaction via a three-tier graph hierarchy: insight, query, and interaction graphs. Upon receiving a new user query, G-Memory performs bi-directional memory traversal to retrieve both $\textit{high-level, generalizable insights}$ that enable the system to leverage cross-trial knowledge, and $\textit{fine-grained, condensed interaction trajectories}$ that compactly encode prior collaboration experiences. Upon task execution, the entire hierarchy evolves by assimilating new collaborative trajectories, nurturing the progressive evolution of agent teams. Extensive experiments across five benchmarks, three LLM backbones, and three popular MAS frameworks demonstrate that G-Memory improves success rates in embodied action and accuracy in knowledge QA by up to $20.89\%$ and $10.12\%$, respectively, without any modifications to the original frameworks. Our codes are available at https://github.com/bingreeky/GMemory.

nan


Article 631

Title@2025-06-09 (1): Do Large Language Models Judge Error Severity Like Humans?

Title: Do Large Language Models Judge Error Severity Like Humans? Urteilen große Sprachmodelle über Schwerelosigkeit wie Menschen? 大语言模型法官 误差严重像人类吗? 2506.05142v2

Authors (5): Diege Sun, Guanyi Chen, Zhao Fan, Xiaorong Cheng, Tingting He

Large Language Models (LLMs) are increasingly used as automated evaluators in natural language generation, yet it remains unclear whether they can accurately replicate human judgments of error severity. In this study, we systematically compare human and LLM assessments of image descriptions containing controlled semantic errors. We extend the experimental framework of van Miltenburg et al. (2020) to both unimodal (text-only) and multimodal (text + image) settings, evaluating four error types: age, gender, clothing type, and clothing colour. Our findings reveal that humans assign varying levels of severity to different error types, with visual context significantly amplifying perceived severity for colour and type errors. Notably, most LLMs assign low scores to gender errors but disproportionately high scores to colour errors, unlike humans, who judge both as highly severe but for different reasons. This suggests that these models may have internalised social norms influencing gender judgments but lack the perceptual grounding to emulate human sensitivity to colour, which is shaped by distinct neural mechanisms. Only one of the evaluated LLMs, Doubao, replicates the human-like ranking of error severity, but it fails to distinguish between error types as clearly as humans. Surprisingly, DeepSeek-V3, a unimodal LLM, achieves the highest alignment with human judgments across both unimodal and multimodal conditions, outperforming even state-of-the-art multimodal models.

nan


Article 632

Title@2025-06-09 (1): SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering

Title: SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering SyncMind: Messmittel Out-of-Sync-Wiederherstellung in der kollaborativen Software-Engineering Syncmind:在合作软件工程中测量合成外恢复剂 2502.06994v2

Authors (7): Xuehang Guo, Xingyao Wang, Yangyi Chen, Sha Li, Chi Han, Manling Li, Heng Ji

Software engineering (SE) is increasingly collaborative, with developers working together on shared complex codebases. Effective collaboration in shared environments requires participants – whether humans or AI agents – to stay on the same page as their environment evolves. When a collaborator’s understanding diverges from the current state – what we term the out-of-sync challenge – the collaborator’s actions may fail, leading to integration issues. In this work, we introduce SyncMind, a framework that systematically defines the out-of-sync problem faced by large language model (LLM) agents in collaborative software engineering (CSE). Based on SyncMind, we create SyncBench, a benchmark featuring 24,332 instances of agent out-of-sync scenarios in real-world CSE derived from 21 popular GitHub repositories with executable verification tests. Experiments on SyncBench uncover critical insights into existing LLM agents’ capabilities and limitations. Besides substantial performance gaps among agents (from Llama-3.1 agent <= 3.33% to Claude-3.5-Sonnet >= 28.18%), their consistently low collaboration willingness (<= 4.86%) suggests fundamental limitations of existing LLM in CSE. However, when collaboration occurs, it positively correlates with out-of-sync recovery success. Minimal performance differences in agents’ resource-aware out-of-sync recoveries further reveal their significant lack of resource awareness and adaptability, shedding light on future resource-efficient collaborative systems. Code and data are openly available on our project website: https://xhguo7.github.io/SyncMind/.

nan


Article 633

Title@2025-06-09 (1): Zero-Shot Event Causality Identification via Multi-source Evidence Fuzzy Aggregation with Large Language Models

Title: Zero-Shot Event Causality Identification via Multi-source Evidence Fuzzy Aggregation with Large Language Models Null-Shot-Ereignis-Causalitäts-Identifizierung über Multi-Source-Evidenz Fuzzy-Aggregation mit großen Sprachmodellen 通过多种来源证据确定零热事件因果关系 2506.05675v2

Authors (6): Zefan Zeng, Xingchen Hu, Qing Cheng, Weiping Ding, Wentao Li, Zhong Liu

Event Causality Identification (ECI) aims to detect causal relationships between events in textual contexts. Existing ECI models predominantly rely on supervised methodologies, suffering from dependence on large-scale annotated data. Although Large Language Models (LLMs) enable zero-shot ECI, they are prone to causal hallucination-erroneously establishing spurious causal links. To address these challenges, we propose MEFA, a novel zero-shot framework based on Multi-source Evidence Fuzzy Aggregation. First, we decompose causality reasoning into three main tasks (temporality determination, necessity analysis, and sufficiency verification) complemented by three auxiliary tasks. Second, leveraging meticulously designed prompts, we guide LLMs to generate uncertain responses and deterministic outputs. Finally, we quantify LLM’s responses of sub-tasks and employ fuzzy aggregation to integrate these evidence for causality scoring and causality determination. Extensive experiments on three benchmarks demonstrate that MEFA outperforms second-best unsupervised baselines by 6.2% in F1-score and 9.3% in precision, while significantly reducing hallucination-induced errors. In-depth analysis verify the effectiveness of task decomposition and the superiority of fuzzy aggregation.

nan


Article 634

Title@2025-06-09 (1): Scalable Vision Language Model Training via High Quality Data Curation

Title: Scalable Vision Language Model Training via High Quality Data Curation Skalierbare Vision Sprachmodell-Schulung über hochwertige Daten-Kuration 通过高品质数据曲线计算,通过高质量数据曲线进行可缩放的愿景语言模式培训 2501.05952v3

Authors (6): Hongyuan Dong, Zijian Kang, Weijie Yin, Xiao Liang, Chao Feng, Jiao Ran

In this paper, we introduce SAIL-VL (ScAlable Vision Language Model TraIning via High QuaLity Data Curation), an open-source vision language model (VLM) series achieving state-of-the-art (SOTA) performance in 2B and 8B parameters. The following three key improvements contribute to SAIL-VL’s leading performance: (1) Scalable high-quality visual understanding data construction: We implement a data construction pipeline to enable hundred-million-scale high-quality recaption data annotation. The resulted dataset SAIL-Caption is validated to be of the highest data quality compared with opensource datasets. (2) Scalable Pretraining with High-Quality Visual Understanding Data: We scale SAIL-VL’s pretraining budget up to 655B tokens and show that even a 2B VLM benefits from scaled up training data sizes, exhibiting logarithmic data size scaling laws in benchmark performance. (3) Scalable SFT via data quantity and complexity scaling: We curate a high-quality SFT dataset collection with leading data quantity scaling effectiveness and demonstrate that training with progressively higher-complexity data surpasses baseline one-stage training by a large margin. SAIL-VL series models achieve the highest average score in 18 widely used VLM benchmarks in our evaluation, with the 2B model takes the top position over VLMs of comparable sizes on OpenCompass 2024 (https://rank.opencompass.org.cn/leaderboard-multimodal), demonstrating robust visual comprehension abilities. SAIL-VL series models are released at HuggingFace (https://huggingface.co/BytedanceDouyinContent).

nan


Article 635

Title@2025-06-09 (1): A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

Title: A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment Eine umfassende Umfrage in LLM(-Agent) Full Stack Sicherheit: Daten, Schulung und Bereitstellung 用LLLM(-代理)全堆安全:数据、培训和部署进行的全面调查 2504.15585v4

Authors (103): Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, Liang Lin, Zhihao Xu, Haolang Lu, Xinye Cao, Xinyun Zhou, Weifei Jin, Fanci Meng, Shicheng Xu, Junyuan Mao, Yu Wang, Hao Wu, Minghe Wang, Fan Zhang, Junfeng Fang, Wenjie Qu, Yue Liu, Chengwei Liu, Yifan Zhang, Qiankun Li, Chongye Guo, Yalan Qin, Zhaoxin Fan, Kai Wang, Yi Ding, Donghai Hong, Jiaming Ji, Yingxin Lai, Zitong Yu, Xinfeng Li, Yifan Jiang, Yanhui Li, Xinyu Deng, Junlin Wu, Dongxia Wang, Yihao Huang, Yufei Guo, Jen-tse Huang, Qiufeng Wang, Xiaolong Jin, Wenxuan Wang, Dongrui Liu, Yanwei Yue, Wenke Huang, Guancheng Wan, Heng Chang, Tianlin Li, Yi Yu, Chenghao Li, Jiawei Li, Lei Bai, Jie Zhang, Qing Guo, Jingyi Wang, Tianlong Chen, Joey Tianyi Zhou, Xiaojun Jia, Weisong Sun, Cong Wu, Jing Chen, Xuming Hu, Yiming Li, Xiao Wang, Ningyu Zhang, Luu Anh Tuan, Guowen Xu, Jiaheng Zhang, Tianwei Zhang, Xingjun Ma, Jindong Gu, Liang Pang, Xiang Wang, Bo An, Jun Sun, Mohit Bansal, Shirui Pan, Lingjuan Lyu, Yuval Elovici, Bhavya Kailkhura, Yaodong Yang, Hongwei Li, Wenyuan Xu, Yizhou Sun, Wei Wang, Qing Li, Ke Tang, Yu-Gang Jiang, Felix Juefei-Xu, Hui Xiong, Xiaofeng Wang, Dacheng Tao, Philip S. Yu, Qingsong Wen, Yang Liu

The remarkable success of Large Language Models (LLMs) has illuminated a promising pathway toward achieving Artificial General Intelligence for both academic and industrial communities, owing to their unprecedented performance across various applications. As LLMs continue to gain prominence in both research and commercial domains, their security and safety implications have become a growing concern, not only for researchers and corporations but also for every nation. Currently, existing surveys on LLM safety primarily focus on specific stages of the LLM lifecycle, e.g., deployment phase or fine-tuning phase, lacking a comprehensive understanding of the entire “lifechain” of LLMs. To address this gap, this paper introduces, for the first time, the concept of “full-stack” safety to systematically consider safety issues throughout the entire process of LLM training, deployment, and eventual commercialization. Compared to the off-the-shelf LLM safety surveys, our work demonstrates several distinctive advantages: (I) Comprehensive Perspective. We define the complete LLM lifecycle as encompassing data preparation, pre-training, post-training, deployment and final commercialization. To our knowledge, this represents the first safety survey to encompass the entire lifecycle of LLMs. (II) Extensive Literature Support. Our research is grounded in an exhaustive review of over 800+ papers, ensuring comprehensive coverage and systematic organization of security issues within a more holistic understanding. (III) Unique Insights. Through systematic literature analysis, we have developed reliable roadmaps and perspectives for each chapter. Our work identifies promising research directions, including safety in data generation, alignment techniques, model editing, and LLM-based agent systems. These insights provide valuable guidance for researchers pursuing future work in this field.

nan


Article 636

Title@2025-06-09 (1): Theoretical Benefit and Limitation of Diffusion Language Model

Title: Theoretical Benefit and Limitation of Diffusion Language Model Theoretischer Nutzen und Begrenzung des Diffusions-Sprachmodells 传播语言模式理论效益和限制 2502.09622v2

Authors (6): Guhao Feng, Yihan Geng, Jian Guan, Wei Wu, Liwei Wang, Di He

Diffusion language models have emerged as a promising approach for text generation. One would naturally expect this method to be an efficient replacement for autoregressive models since multiple tokens can be sampled in parallel during each diffusion step. However, its efficiency-accuracy trade-off is not yet well understood. In this paper, we present a rigorous theoretical analysis of a widely used type of diffusion language model, the Masked Diffusion Model (MDM), and find that its effectiveness heavily depends on the target evaluation metric. Under mild conditions, we prove that when using perplexity as the metric, MDMs can achieve near-optimal perplexity in sampling steps regardless of sequence length, demonstrating that efficiency can be achieved without sacrificing performance. However, when using the sequence error rate–which is important for understanding the “correctness” of a sequence, such as a reasoning chain–we show that the required sampling steps must scale linearly with sequence length to obtain “correct” sequences, thereby eliminating MDM’s efficiency advantage over autoregressive models. Our analysis establishes the first theoretical foundation for understanding the benefits and limitations of MDMs. All theoretical findings are supported by empirical studies.

nan


Article 637

Title@2025-06-09 (1): Knowledge-to-Jailbreak: Investigating Knowledge-driven Jailbreaking Attacks for Large Language Models

Title: Knowledge-to-Jailbreak: Investigating Knowledge-driven Jailbreaking Attacks for Large Language Models Knowledge-to-Jailbreak: Untersuchung wissensgetriebener Jailbreaking-Angriffe für große Sprachmodelle 知识对正义的突破:调查以知识为驱动的大型语言模式破狱袭击 2406.11682v2

Authors (9): Shangqing Tu, Zhuoran Pan, Wenxuan Wang, Zhexin Zhang, Yuliang Sun, Jifan Yu, Hongning Wang, Lei Hou, Juanzi Li

Large language models (LLMs) have been increasingly applied to various domains, which triggers increasing concerns about LLMs’ safety on specialized domains, e.g. medicine. Despite prior explorations on general jailbreaking attacks, there are two challenges for applying existing attacks on testing the domain-specific safety of LLMs: (1) Lack of professional knowledge-driven attacks, (2) Insufficient coverage of domain knowledge. To bridge this gap, we propose a new task, knowledge-to-jailbreak, which aims to generate jailbreaking attacks from domain knowledge, requiring both attack effectiveness and knowledge relevance. We collect a large-scale dataset with 12,974 knowledge-jailbreak pairs and fine-tune a large language model as jailbreak-generator, to produce domain knowledge-specific jailbreaks. Experiments on 13 domains and 8 target LLMs demonstrate the effectiveness of jailbreak-generator in generating jailbreaks that are both threatening to the target LLMs and relevant to the given knowledge. We also apply our method to an out-of-domain knowledge base, showing that jailbreak-generator can generate jailbreaks that are comparable in harmfulness to those crafted by human experts. Data and code are available at: https://github.com/THU-KEG/Knowledge-to-Jailbreak/.

nan


Article 638

Title@2025-06-09 (1): Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation

Title: Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation Refusal-Feature-geführter Lehrer für sichere Feinsteuerung über Datenfilterung und Alignment Destillation 通过数据过滤和对齐蒸馏进行安全微调的拒绝制导师教师 2506.07356v1

Authors (6): Seokil Ham, Yubin Choi, Seungju Cho, Yujin Yang, Younghun Kim, Changick Kim

Recently, major AI service providers such as Google and OpenAI have introduced Finetuning-as-a-Service, which enables users to customize Large Language Models (LLMs) for specific downstream tasks using their own data. However, this service is vulnerable to degradation of LLM safety-alignment when user data contains harmful prompts. While some prior works address this issue, fundamentally filtering harmful data from user data remains unexplored. Motivated by our observation that a directional representation reflecting refusal behavior (called the refusal feature) obtained from safety-aligned LLMs can inherently distinguish between harmful and harmless prompts, we propose the Refusal-Feature-guided Teacher (ReFT). Our ReFT model is trained to identify harmful prompts based on the similarity between input prompt features and its refusal feature. During finetuning, the ReFT model serves as a teacher that filters harmful prompts from user data and distills alignment knowledge into the base model. Extensive experiments demonstrate that our ReFT-based finetuning strategy effectively minimizes harmful outputs and enhances finetuning accuracy for user-specific tasks, offering a practical solution for secure and reliable deployment of LLMs in Finetuning-as-a-Service.

nan


Article 639

Title@2025-06-09 (1): Unveiling and Addressing Pseudo Forgetting in Large Language Models

Title: Unveiling and Addressing Pseudo Forgetting in Large Language Models Enthüllung und Adressierung von Pseudo Forgetting in großen Sprachmodellen 在大语言模式中统一和解决 “ 优于 “ 的遗忘 2411.11932v2

Authors (5): Huashan Sun, Yizhe Yang, Yinghao Li, Jiawei Li, Yang Gao

Although substantial efforts have been made to mitigate catastrophic forgetting in continual learning, the intrinsic mechanisms are not well understood. In this work, we demonstrate the existence of “pseudo forgetting”: the performance degradation on previous tasks is not attributed to a loss of capabilities, but rather to the failure of the instructions to activate the appropriate model abilities. We show that the model’s performance on previous tasks can be restored through two simple interventions: (1) providing partial external correct rationale, and (2) appending semantically meaningless suffixes to the original instructions, to guide the generation of correct rationales. Through empirical analysis of the internal mechanisms governing rationale generation, we reveal that models exhibiting pseudo forgetting show reduced instruction dependence during rationale generation, leading to suboptimal activation of their inherent capabilities. Based on this insight, we propose Rationale-Guidance Difficulty based Replay (RGD-R) framework that dynamically allocates replay data based on the model’s ability to correctly leverage the intrinsic capabilities. Experimental results demonstrate that RGD-R effectively mitigates pseudo forgetting while maintaining model plasticity.

nan


Article 640

Title@2025-06-09 (1): What if LLMs Have Different World Views: Simulating Alien Civilizations with LLM-based Agents

Title: What if LLMs Have Different World Views: Simulating Alien Civilizations with LLM-based Agents Was, wenn LLMs unterschiedliche Weltansichten haben: Alien-Zivilisationen mit LLM-basierten Agenten simulieren 如果LLLM女士具有不同的世界观点:用以LLM为基地的代理人模拟外星文明 2402.13184v6

Authors (8): Zhaoqian Xue, Beichen Wang, Suiyuan Zhu, Kai Mei, Hua Tang, Wenyue Hua, Mengnan Du, Yongfeng Zhang

This study introduces “CosmoAgent,” an innovative artificial intelligence system that utilizes Large Language Models (LLMs) to simulate complex interactions between human and extraterrestrial civilizations. This paper introduces a mathematical model for quantifying the levels of civilization development and further employs a state transition matrix approach to evaluate their trajectories. Through this methodology, our study quantitatively analyzes the growth trajectories of civilizations, providing insights into future decision-making at critical points of growth and saturation. Furthermore, this paper acknowledges the vast diversity of potential living conditions across the universe, which could foster unique cosmologies, ethical codes, and worldviews among different civilizations. Recognizing the Earth-centric bias inherent in current LLM designs, we propose the novel concept of using LLM agents with diverse ethical paradigms and simulating interactions between entities with distinct moral principles. This innovative research not only introduces a novel method for comprehending potential inter-civilizational dynamics but also holds practical value in enabling entities with divergent value systems to strategize, prevent conflicts, and engage in games under conditions of asymmetric information. The accompanying code is available at https://github.com/MingyuJ666/Simulating-Alien-Civilizations-with-LLM-based-Agents.

nan


Article 641

Title@2025-06-09 (1): Improving LLM Reasoning through Interpretable Role-Playing Steering

Title: Improving LLM Reasoning through Interpretable Role-Playing Steering Verbesserung der LLM-Reasoning durch interpretierbare Rollenspiel-Lenkung 通过可解释性角色扮演指导改进LLM 2506.07335v1

Authors (5): Anyi Wang, Dong Shu, Yifan Wang, Yunpu Ma, Mengnan Du

Role-playing has emerged as an effective technique for enhancing the reasoning capabilities of large language models (LLMs). However, existing methods primarily rely on prompt engineering, which often lacks stability and interpretability. In this paper, we introduce Sparse Autoencoder Role-Playing Steering (SRPS), a novel framework that identifies and manipulates internal model features associated with role-playing behavior. Our approach extracts latent representations from role-play prompts, selects the most relevant features based on activation patterns, and constructs a steering vector that can be injected into the model’s residual stream with controllable intensity. Our method enables fine-grained control over role-specific behavior and offers insights into how role information influences internal model activations. Extensive experiments across various reasoning benchmarks and model sizes demonstrate consistent performance gains. Notably, in the zero-shot chain-of-thought (CoT) setting, the accuracy of Llama3.1-8B on CSQA improves from 31.86% to 39.80%, while Gemma2-9B on SVAMP increases from 37.50% to 45.10%. These results highlight the potential of SRPS to enhance reasoning ability in LLMs, providing better interpretability and stability compared to traditional prompt-based role-playing.

nan


Article 642

Title@2025-06-09 (1): BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain

Title: BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain BioHopR: Ein Benchmark für Multi-Hop, Multi-Answer Reasoning in der biomedizinischen Domäne BioHopR:生物医学领域多层次、多层次原因基准 2505.22240v2

Authors (3): Yunsoo Kim, Yusuf Abdulle, Honghan Wu

Biomedical reasoning often requires traversing interconnected relationships across entities such as drugs, diseases, and proteins. Despite the increasing prominence of large language models (LLMs), existing benchmarks lack the ability to evaluate multi-hop reasoning in the biomedical domain, particularly for queries involving one-to-many and many-to-many relationships. This gap leaves the critical challenges of biomedical multi-hop reasoning underexplored. To address this, we introduce BioHopR, a novel benchmark designed to evaluate multi-hop, multi-answer reasoning in structured biomedical knowledge graphs. Built from the comprehensive PrimeKG, BioHopR includes 1-hop and 2-hop reasoning tasks that reflect real-world biomedical complexities. Evaluations of state-of-the-art models reveal that O3-mini, a proprietary reasoning-focused model, achieves 37.93% precision on 1-hop tasks and 14.57% on 2-hop tasks, outperforming proprietary models such as GPT4O and open-source biomedical models including HuatuoGPT-o1-70B and Llama-3.3-70B. However, all models exhibit significant declines in multi-hop performance, underscoring the challenges of resolving implicit reasoning steps in the biomedical domain. By addressing the lack of benchmarks for multi-hop reasoning in biomedical domain, BioHopR sets a new standard for evaluating reasoning capabilities and highlights critical gaps between proprietary and open-source models while paving the way for future advancements in biomedical LLMs.

nan


Article 643

Title@2025-06-09 (1): Rational Decision-Making Agent with Internalized Utility Judgment

Title: Rational Decision-Making Agent with Internalized Utility Judgment Rationaler Entscheidungsbevollmächtigter mit internalisiertem Utility-Urteil 具有内化公用事业判决的合理决策代理 2308.12519v3

Authors (8): Yining Ye, Xin Cong, Shizuo Tian, Yujia Qin, Chong Liu, Yankai Lin, Zhiyuan Liu, Maosong Sun

Large language models (LLMs) have demonstrated remarkable advancements and have attracted significant efforts to develop LLMs into agents capable of executing intricate multi-step decision-making tasks beyond traditional NLP applications. Existing approaches to LLM-based decision-making predominantly build upon the manually-designed external performance metrics to guide the decision-making process. However, reliance on the external performance metrics as prior is problematic in real-world scenarios, where such prior may be unavailable, flawed, or even erroneous. For genuine autonomous decision making, it is imperative for the agent to develop its rationality from its posterior experiences to judge decisions independently. Central to the development of rationality is the construction of an internalized utility judgment, capable of assigning numerical utilities to each decision. This paper proposes RadAgent (Rational Decision-Making Agent), which fosters the development of its rationality through an iterative framework involving Experience Exploration and Utility Learning. Within this framework, Elo-based Utility Construction is devised to assign Elo scores to individual decision steps to judge their utilities via pairwise comparisons. Consequently, these Elo scores guide the decision-making process to derive optimal outcomes. Experimental results on the ToolBench dataset demonstrate RadAgent’s superiority over baselines, achieving over 10% improvement in Pass Rate on diverse tasks. It offers higher-quality solutions and reduces costs (ChatGPT API calls), highlighting its effectiveness and efficiency.

nan


Article 644

Title@2025-06-09 (1): Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis

Title: Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis Tree-of-Debatte: Multi-Persona-Debatte Bäume Elizites kritisches Denken für wissenschaftliche vergleichende Analyse 理论树:多人辩论树 用于科学比较分析的 明确关键思考 2502.14767v2

Authors (4): Priyanka Kargupta, Ishika Agarwal, Tal August, Jiawei Han

With the exponential growth of research facilitated by modern technology and improved accessibility, scientific discoveries have become increasingly fragmented within and across fields. This makes it challenging to assess the significance, novelty, incremental findings, and equivalent ideas between related works, particularly those from different research communities. Large language models (LLMs) have recently demonstrated strong quantitative and qualitative reasoning abilities, and multi-agent LLM debates have shown promise in handling complex reasoning tasks by exploring diverse perspectives and reasoning paths. Inspired by this, we introduce Tree-of-Debate (ToD), a framework which converts scientific papers into LLM personas that debate their respective novelties. To emphasize structured, critical reasoning rather than focusing solely on outcomes, ToD dynamically constructs a debate tree, enabling fine-grained analysis of independent novelty arguments within scholarly articles. Through experiments on scientific literature across various domains, evaluated by expert researchers, we demonstrate that ToD generates informative arguments, effectively contrasts papers, and supports researchers in their literature review.

nan


Article 645

Title@2025-06-08 (7): Reward Model Interpretability via Optimal and Pessimal Tokens

Title: Reward Model Interpretability via Optimal and Pessimal Tokens Reward Model Interpretierbarkeit über Optimale und Pessimal Token 通过最佳和微量调制调来解释性 2506.07326v1

Authors (5): Brian Christian, Hannah Rose Kirk, Jessica A. F. Thompson, Christopher Summerfield, Tsvetomira Dumbalska

Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves – which directly encode human value judgments by turning prompt-response pairs into scalar rewards – remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent tokens. We demonstrate these effects across ten recent open-source reward models of varying parameter counts and architectures. Our results challenge assumptions about the interchangeability of reward models, as well as their suitability as proxies of complex and context-dependent human values. We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training – distortions that risk propagating through the downstream large language models now deployed to millions.

nan


Article 646

Title@2025-06-08 (7): Aligned but Blind: Alignment Increases Implicit Bias by Reducing Awareness of Race

Title: Aligned but Blind: Alignment Increases Implicit Bias by Reducing Awareness of Race Ausgerichtet, aber blind: Ausgerichtetheit erhöht Implizite Bias, indem sie das Bewusstsein für die Rasse verringert 结盟但盲目:通过减少对种族问题的认识,协调会增加隐含的偏见 2506.00253v3

Authors (4): Lihao Sun, Chengzhi Mao, Valentin Hofmann, Xuechunzi Bai

Although value-aligned language models (LMs) appear unbiased in explicit bias evaluations, they often exhibit stereotypes in implicit word association tasks, raising concerns about their fair usage. We investigate the mechanisms behind this discrepancy and find that alignment surprisingly amplifies implicit bias in model outputs. Specifically, we show that aligned LMs, unlike their unaligned counterparts, overlook racial concepts in early internal representations when the context is ambiguous. Not representing race likely fails to activate safety guardrails, leading to unintended biases. Inspired by this insight, we propose a new bias mitigation strategy that works by incentivizing the representation of racial concepts in the early model layers. In contrast to conventional mitigation methods of machine unlearning, our interventions find that steering the model to be more aware of racial concepts effectively mitigates implicit bias. Similar to race blindness in humans, ignoring racial nuances can inadvertently perpetuate subtle biases in LMs.

nan


Article 647

Title@2025-06-08 (7): Synergizing Unsupervised Episode Detection with LLMs for Large-Scale News Events

Title: Synergizing Unsupervised Episode Detection with LLMs for Large-Scale News Events Synergisieren der unüberwachten Episodenerkennung mit LLMs für großformatige News-Events 与大型新闻事件LMsLM公司同步进行未受监督的子集探测 2408.04873v2

Authors (5): Priyanka Kargupta, Yunyi Zhang, Yizhu Jiao, Siru Ouyang, Jiawei Han

State-of-the-art automatic event detection struggles with interpretability and adaptability to evolving large-scale key events – unlike episodic structures, which excel in these areas. Often overlooked, episodes represent cohesive clusters of core entities performing actions at a specific time and location; a partially ordered sequence of episodes can represent a key event. This paper introduces a novel task, episode detection, which identifies episodes within a news corpus of key event articles. Detecting episodes poses unique challenges, as they lack explicit temporal or locational markers and cannot be merged using semantic similarity alone. While large language models (LLMs) can aid with these reasoning difficulties, they suffer with long contexts typical of news corpora. To address these challenges, we introduce EpiMine, an unsupervised framework that identifies a key event’s candidate episodes by leveraging natural episodic partitions in articles, estimated through shifts in discriminative term combinations. These candidate episodes are more cohesive and representative of true episodes, synergizing with LLMs to better interpret and refine them into final episodes. We apply EpiMine to our three diverse, real-world event datasets annotated at the episode level, where it achieves a 59.2% average gain across all metrics compared to baselines.

nan


Article 648

Title@2025-06-08 (7): Watermarking Language Models with Error Correcting Codes

Title: Watermarking Language Models with Error Correcting Codes Wasserzeichen-Sprachenmodelle mit Fehler-Korrektur-Codes 错误校正代码的划水语言模型 2406.10281v4

Authors (4): Patrick Chao, Yan Sun, Edgar Dobriban, Hamed Hassani

Recent progress in large language models enables the creation of realistic machine-generated content. Watermarking is a promising approach to distinguish machine-generated text from human text, embedding statistical signals in the output that are ideally undetectable to humans. We propose a watermarking framework that encodes such signals through an error correcting code. Our method, termed robust binary code (RBC) watermark, introduces no noticeable degradation in quality. We evaluate our watermark on base and instruction fine-tuned models and find that our watermark is robust to edits, deletions, and translations. We provide an information-theoretic perspective on watermarking, a powerful statistical test for detection and for generating $p$-values, and theoretical guarantees. Our empirical findings suggest our watermark is fast, powerful, and robust, comparing favorably to the state-of-the-art.

nan


Article 649

Title@2025-06-08 (7): ConfQA: Answer Only If You Are Confident

Title: ConfQA: Answer Only If You Are Confident ConfQA: Nur beantworten, wenn Sie zuversichtlich sind ConfQA: 仅回答您是否确信 2506.07309v1

Authors (14): Yin Huang, Yifan Ethan Xu, Kai Sun, Vera Yan, Alicia Sun, Haidar Khan, Jimmy Nguyen, Mohammad Kachuee, Zhaojiang Lin, Yue Liu, Aaron Colak, Anuj Kumar, Wen-tau Yih, Xin Luna Dong

Can we teach Large Language Models (LLMs) to refrain from hallucinating factual statements? In this paper we present a fine-tuning strategy that we call ConfQA, which can reduce hallucination rate from 20-40% to under 5% across multiple factuality benchmarks. The core idea is simple: when the LLM answers a question correctly, it is trained to continue with the answer; otherwise, it is trained to admit “I am unsure”. But there are two key factors that make the training highly effective. First, we introduce a dampening prompt “answer only if you are confident” to explicitly guide the behavior, without which hallucination remains high as 15%-25%. Second, we leverage simple factual statements, specifically attribute values from knowledge graphs, to help LLMs calibrate the confidence, resulting in robust generalization across domains and question types. Building on this insight, we propose the Dual Neural Knowledge framework, which seamlessly select between internally parameterized neural knowledge and externally recorded symbolic knowledge based on ConfQA’s confidence. The framework enables potential accuracy gains to beyond 95%, while reducing unnecessary external retrievals by over 30%.

nan


Article 650

Title@2025-06-08 (7): Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models

Title: Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models Zuverlässige wissenschaftliche Hypothesen-Generation: Bewertung von Wahrhaftigkeit und Halluzination in großen Sprachmodellen 走向可靠的科学假设一代:在大语言模型中评估真理和幻觉 2505.14599v2

Authors (8): Guangzhi Xiong, Eric Xie, Corey Williams, Myles Kim, Amir Hassan Shariatmadari, Sikun Guo, Stefan Bekiranov, Aidong Zhang

Large language models (LLMs) have shown significant potential in scientific disciplines such as biomedicine, particularly in hypothesis generation, where they can analyze vast literature, identify patterns, and suggest research directions. However, a key challenge lies in evaluating the truthfulness of generated hypotheses, as verifying their accuracy often requires substantial time and resources. Additionally, the hallucination problem in LLMs can lead to the generation of hypotheses that appear plausible but are ultimately incorrect, undermining their reliability. To facilitate the systematic study of these challenges, we introduce TruthHypo, a benchmark for assessing the capabilities of LLMs in generating truthful scientific hypotheses, and KnowHD, a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge. Our results show that LLMs struggle to generate truthful hypotheses. By analyzing hallucinations in reasoning steps, we demonstrate that the groundedness scores provided by KnowHD serve as an effective metric for filtering truthful hypotheses from the diverse outputs of LLMs. Human evaluations further validate the utility of KnowHD in identifying truthful hypotheses and accelerating scientific discovery. Our data and source code are available at https://github.com/Teddy-XiongGZ/TruthHypo.

nan


Article 651

Title@2025-06-08 (7): Highly Fast Text Segmentation With Pairwise Markov Chains

Title: Highly Fast Text Segmentation With Pairwise Markov Chains Hochschnelle Textsegmentierung mit paarweise Markov-Ketten 配对马科夫链条的快速文本分割 2102.11037v2

Authors (4): Elie Azeraf, Emmanuel Monfrini, Emmanuel Vignon, Wojciech Pieczynski

Natural Language Processing (NLP) models’ current trend consists of using increasingly more extra-data to build the best models as possible. It implies more expensive computational costs and training time, difficulties for deployment, and worries about these models’ carbon footprint reveal a critical problem in the future. Against this trend, our goal is to develop NLP models requiring no extra-data and minimizing training time. To do so, in this paper, we explore Markov chain models, Hidden Markov Chain (HMC) and Pairwise Markov Chain (PMC), for NLP segmentation tasks. We apply these models for three classic applications: POS Tagging, Named-Entity-Recognition, and Chunking. We develop an original method to adapt these models for text segmentation’s specific challenges to obtain relevant performances with very short training and execution times. PMC achieves equivalent results to those obtained by Conditional Random Fields (CRF), one of the most applied models for these tasks when no extra-data are used. Moreover, PMC has training times 30 times shorter than the CRF ones, which validates this model given our objectives.

nan


Article 652

Title@2025-06-08 (7): Subjectivity in the Annotation of Bridging Anaphora

Title: Subjectivity in the Annotation of Bridging Anaphora Subjektivität in der Annotation von Brücken Anaphora 连接Adaphora说明中的主观性 2506.07297v1

Authors (2): Lauren Levine, Amir Zeldes

Bridging refers to the associative relationship between inferable entities in a discourse and the antecedents which allow us to understand them, such as understanding what “the door” means with respect to an aforementioned “house”. As identifying associative relations between entities is an inherently subjective task, it is difficult to achieve consistent agreement in the annotation of bridging anaphora and their antecedents. In this paper, we explore the subjectivity involved in the annotation of bridging instances at three levels: anaphor recognition, antecedent resolution, and bridging subtype selection. To do this, we conduct an annotation pilot on the test set of the existing GUM corpus, and propose a newly developed classification system for bridging subtypes, which we compare to previously proposed schemes. Our results suggest that some previous resources are likely to be severely under-annotated. We also find that while agreement on the bridging subtype category was moderate, annotator overlap for exhaustively identifying instances of bridging is low, and that many disagreements resulted from subjective understanding of the entities involved.

nan


Article 653

Title@2025-06-08 (7): Exploring the Impact of Temperature on Large Language Models:Hot or Cold?

Title: Exploring the Impact of Temperature on Large Language Models:Hot or Cold? Den Einfluss der Temperatur auf große Sprachmodelle erforschen: Heiß oder kalt? 探索温度对大语言模型的影响:热还是冷? 2506.07295v1

Authors (5): Lujun Li, Lama Sleem, Niccolo’ Gentile, Geoffrey Nichil, Radu State

The sampling temperature, a critical hyperparameter in large language models (LLMs), modifies the logits before the softmax layer, thereby reshaping the distribution of output tokens. Recent studies have challenged the Stochastic Parrots analogy by demonstrating that LLMs are capable of understanding semantics rather than merely memorizing data and that randomness, modulated by sampling temperature, plays a crucial role in model inference. In this study, we systematically evaluated the impact of temperature in the range of 0 to 2 on data sets designed to assess six different capabilities, conducting statistical analyses on open source models of three different sizes: small (1B–4B), medium (6B–13B), and large (40B–80B). Our findings reveal distinct skill-specific effects of temperature on model performance, highlighting the complexity of optimal temperature selection in practical applications. To address this challenge, we propose a BERT-based temperature selector that takes advantage of these observed effects to identify the optimal temperature for a given prompt. We demonstrate that this approach can significantly improve the performance of small and medium models in the SuperGLUE datasets. Furthermore, our study extends to FP16 precision inference, revealing that temperature effects are consistent with those observed in 4-bit quantized models. By evaluating temperature effects up to 4.0 in three quantized models, we find that the Mutation Temperature – the point at which significant performance changes occur – increases with model size.

nan


Article 654

Title@2025-06-08 (7): MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache

Title: MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache MiniKV: Die Grenzen der LLM-Inferenz durch 2-Bit-Layer-Diskriminativ-KV-Cache drücken MiniKV: 通过2-Bit 层分解 KV 缓存推进LLM 推断的限度 2411.18077v3

Authors (5): Akshat Sharma, Hangliang Ding, Jianping Li, Neel Dani, Minjia Zhang

How to efficiently serve LLMs in practice has become exceptionally challenging due to their prohibitive memory and computation requirements. In this study, we investigate optimizing the KV cache, whose memory footprint poses a critical bottleneck in LLM inference, especially when dealing with long context tasks. To tackle the challenge, we introduce MiniKV, a KV cache optimization method that simultaneously preserves long context task accuracy while significantly reducing KV cache size via a novel 2-bit layer-discriminative KV cache. More importantly, we develop specialized CUDA kernels to make MiniKV compatible with FlashAttention. Experiments on a wide range of long context tasks show that MiniKV effectively achieves 86% KV cache compression ratio while recovering over 98.5% of accuracy, outperforming state-of-the-art methods while achieving excellent measured system performance improvements.

nan


Article 655

Title@2025-06-08 (7): Selective Prompt Anchoring for Code Generation

Title: Selective Prompt Anchoring for Code Generation Selektive Prompt-Ankerung für die Code-Generierung 代代代代代代代代代代代代代代代代代 代代代代代代代代代代代代代 代代代代代代代代代代代代 2408.09121v5

Authors (2): Yuan Tian, Tianyi Zhang

Recent advances in large language models (LLMs) have transformed software development by automatically generating code from natural language. Yet challenges remain in generating fully correct code that aligns with user intent. Our study reveals that LLMs tend to pay less attention to user prompts as more code tokens are generated. We hypothesize that this attention dilution issue is an important reason for code generation errors. To mitigate this issue, we propose Selective Prompt Anchoring (SPA) to guide code LLMs to pay more attention to user intent when generating code. We evaluate SPA using six base LLMs across six benchmarks. Our results demonstrate that SPA enhances Pass@1 by up to 12.9%, consistently outperforming SOTA code generation methods in all settings. Our code is available at https://github.com/magic-YuanTian/Selective-Prompt-Anchoring.

nan


Article 656

Title@2025-06-08 (7): Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages

Title: Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages Parsing the Switch: LLM-basierte UD-Annotation für komplexe Code-Switched- und Low-Resource-Sprachen 开关剖析:基于LLM的LLM UD 复杂代码开关和低资源语言说明 2506.07274v1

Authors (5): Olga Kellert, Nemika Tyagi, Muhammad Imran, Nelvin Licona-Guevara, Carlos Gómez-Rodríguez

Code-switching presents a complex challenge for syntactic analysis, especially in low-resource language settings where annotated data is scarce. While recent work has explored the use of large language models (LLMs) for sequence-level tagging, few approaches systematically investigate how well these models capture syntactic structure in code-switched contexts. Moreover, existing parsers trained on monolingual treebanks often fail to generalize to multilingual and mixed-language input. To address this gap, we introduce the BiLingua Parser, an LLM-based annotation pipeline designed to produce Universal Dependencies (UD) annotations for code-switched text. First, we develop a prompt-based framework for Spanish-English and Spanish-Guaran'i data, combining few-shot LLM prompting with expert review. Second, we release two annotated datasets, including the first Spanish-Guaran'i UD-parsed corpus. Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts. Experimental results show that BiLingua Parser achieves up to 95.29% LAS after expert revision, significantly outperforming prior baselines and multilingual parsers. These results show that LLMs, when carefully guided, can serve as practical tools for bootstrapping syntactic resources in under-resourced, code-switched environments. Data and source code are available at https://github.com/N3mika/ParsingProject

nan


Article 657

Title@2025-06-08 (7): Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs

Title: Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs Fragebeantwortung unter Zeitkonflikt: Bewertung und Organisation des sich entwickelnden Wissens mit LLMs 在时间冲突下回答问题:与LLMs一道评价和组织不断发展的知识 2506.07270v1

Authors (2): Atahan Özer, Çağatay Yıldız

Large language models (LLMs) exhibit remarkable capabilities in question answering and reasoning thanks to their extensive parametric memory. However, their knowledge is inherently limited by the scope of their pre-training data, while real-world information evolves continuously. Updating this knowledge typically requires costly and brittle re-training, or in-context learning (ICL), which becomes impractical at scale given the volume and volatility of modern information. Motivated by these limitations, we investigate how LLMs perform when exposed to temporal text corpora, or documents that reflect evolving knowledge over time, such as sports biographies where facts like a player’s “current team” change year by year. To this end, we introduce two new benchmarks: Temporal Wiki, which captures factual drift across historical Wikipedia snapshots, and Unified Clark, which aggregates timestamped news articles to simulate real-world information accumulation. Our analysis reveals that LLMs often struggle to reconcile conflicting or outdated facts and can be misled when multiple versions of a fact appear in context. To address these issues, we propose a lightweight, agentic framework that incrementally builds a structured, external memory from source documents without requiring re-training. This knowledge organization strategy enables models to retrieve and reason over temporally filtered, relevant information at inference time. Empirically, our method outperforms ICL and RAG baselines across both benchmarks, especially on questions requiring more complex reasoning or integration of conflicting facts.

nan


Article 658

Title@2025-06-08 (7): AMPO: Active Multi-Preference Optimization for Self-play Preference Selection

Title: AMPO: Active Multi-Preference Optimization for Self-play Preference Selection AMPO: Aktive Multi-Preference-Optimierung für Selbstspiel-Preference-Auswahl AMPO: 主动的自玩首选多优先优化 2502.18293v2

Authors (5): Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan

Multi-preference optimization enriches language-model alignment beyond pairwise preferences by contrasting entire sets of helpful and undesired responses, thereby enabling richer training signals for large language models. During self-play alignment, these models often produce numerous candidate answers per query, rendering it computationally infeasible to include all responses in the training objective. In this work, we propose $\textit{Active Multi-Preference Optimization}$ (AMPO), a novel approach that combines on-policy generation, a multi-preference group-contrastive loss, and active subset selection. Specifically, we score and embed large candidate pools of responses and then select a small, yet informative, subset that covers reward extremes and distinct semantic clusters for preference optimization. Our contrastive training scheme is capable of identifying not only the best and worst answers but also subtle, underexplored modes that are crucial for robust alignment. Theoretically, we provide guarantees for expected reward maximization using our active selection method, and empirically, AMPO achieves state-of-the-art results on $\textit{AlpacaEval}$ using Llama 8B and Mistral 7B. We release our datasets $\href{https://huggingface.co/Multi-preference-Optimization}{here}$.

nan


Article 659

Title@2025-06-08 (7): DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products

Title: DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products DeltaProdukt: Verbesserung der State-Tracking in linearen RNNs über Haushaltsprodukte DeltaProduction:通过家用产品改进国家通过家用产品对Linear RNNNs的跟踪 2502.10297v5

Authors (6): Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, Riccardo Grazzi

Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling, offering efficient training and linear-time inference. However, existing architectures face a fundamental trade-off between expressivity and efficiency, dictated by the structure of their state-transition matrices. Diagonal matrices, used in models such as Mamba, GLA, or mLSTM, yield fast runtime but have limited expressivity. To address this, recent architectures such as DeltaNet and RWKV-7 adopted a diagonal plus rank-1 structure, which allows simultaneous token and channel mixing, improving associative recall and, as recently shown, state-tracking when allowing negative eigenvalues in the state-transition matrices. Building on the interpretation of DeltaNet’s recurrence as performing one step of online gradient descent per token on an associative recall loss, we introduce DeltaProduct, which instead takes multiple ($n_h$) steps per token. This naturally leads to diagonal plus rank-$n_h$ state-transition matrices, formed as products of $n_h$ generalized Householder transformations, providing a tunable mechanism to balance expressivity and efficiency. We provide a detailed theoretical characterization of the state-tracking capability of DeltaProduct in finite precision, showing how it improves by increasing $n_h$. Our extensive experiments demonstrate that DeltaProduct outperforms DeltaNet in both state-tracking and language modeling, while also showing significantly improved length extrapolation capabilities.

nan


Article 660

Title@2025-06-08 (7): Unraveling Token Prediction Refinement and Identifying Essential Layers in Language Models

Title: Unraveling Token Prediction Refinement and Identifying Essential Layers in Language Models Unraveling Token Prediction Verfeinerung und Identifizierung wesentlicher Ebenen in Sprachmodellen 在语言模型中统一地改进和确定关键层 2501.15054v2

Authors (1): Jaturong Kongmanee

This research aims to unravel how large language models (LLMs) iteratively refine token predictions through internal processing. We utilized a logit lens technique to analyze the model’s token predictions derived from intermediate representations. Specifically, we focused on (1) how LLMs access and utilize information from input contexts, and (2) how positioning of relevant information affects the model’s token prediction refinement process. On a multi-document question answering task with varying input context lengths, we found that the depth of prediction refinement (defined as the number of intermediate layers an LLM uses to transition from an initial correct token prediction to its final, stable correct output), as a function of the position of relevant information, exhibits an approximately inverted U-shaped curve. We also found that the gap between these two layers, on average, diminishes when relevant information is positioned at the beginning or end of the input context. This suggested that the model requires more refinements when processing longer contexts with relevant information situated in the middle. Furthermore, our findings indicate that not all layers are equally essential for determining final correct outputs. Our analysis provides insights into how token predictions are distributed across different conditions, and establishes important connections to existing hypotheses and previous findings in AI safety research and development.

nan


Article 661

Title@2025-06-08 (7): Bias Attribution in Filipino Language Models: Extending a Bias Interpretability Metric for Application on Agglutinative Languages

Title: Bias Attribution in Filipino Language Models: Extending a Bias Interpretability Metric for Application on Agglutinative Languages Bias Attribution in philippinischen Sprachmodellen: Erweiterung eines Bias Interpretability Metric für die Anwendung auf Agglutinative Sprachen 菲律宾语言模式中的偏见归属:扩大申请隐含语言的偏见解释度量 2506.07249v1

Authors (3): Lance Calvin Lim Gamboa, Yue Feng, Mark Lee

Emerging research on bias attribution and interpretability have revealed how tokens contribute to biased behavior in language models processing English texts. We build on this line of inquiry by adapting the information-theoretic bias attribution score metric for implementation on models handling agglutinative languages, particularly Filipino. We then demonstrate the effectiveness of our adapted method by using it on a purely Filipino model and on three multilingual models: one trained on languages worldwide and two on Southeast Asian data. Our results show that Filipino models are driven towards bias by words pertaining to people, objects, and relationships, entity-based themes that stand in contrast to the action-heavy nature of bias-contributing themes in English (i.e., criminal, sexual, and prosocial behaviors). These findings point to differences in how English and non-English models process inputs linked to sociodemographic groups and bias.

nan


Article 662

Title@2025-06-08 (7): Improving the Efficiency of Long Document Classification using Sentence Ranking Approach

Title: Improving the Efficiency of Long Document Classification using Sentence Ranking Approach Verbesserung der Effizienz der Langdokumentklassifikation mittels Sentence-Ranking-Ansatz 采用判决分级办法提高长文件分类的效率 2506.07248v1

Authors (4): Prathamesh Kokate, Mitali Sarnaik, Manavi Khopade, Raviraj Joshi

Long document classification poses challenges due to the computational limitations of transformer-based models, particularly BERT, which are constrained by fixed input lengths and quadratic attention complexity. Moreover, using the full document for classification is often redundant, as only a subset of sentences typically carries the necessary information. To address this, we propose a TF-IDF-based sentence ranking method that improves efficiency by selecting the most informative content. Our approach explores fixed-count and percentage-based sentence selection, along with an enhanced scoring strategy combining normalized TF-IDF scores and sentence length. Evaluated on the MahaNews LDC dataset of long Marathi news articles, the method consistently outperforms baselines such as first, last, and random sentence selection. With MahaBERT-v2, we achieve near-identical classification accuracy with just a 0.33 percent drop compared to the full-context baseline, while reducing input size by over 50 percent and inference latency by 43 percent. This demonstrates that significant context reduction is possible without sacrificing performance, making the method practical for real-world long document classification tasks.

nan


Article 663

Title@2025-06-08 (7): SDE-SQL: Enhancing Text-to-SQL Generation in Large Language Models via Self-Driven Exploration with SQL Probes

Title: SDE-SQL: Enhancing Text-to-SQL Generation in Large Language Models via Self-Driven Exploration with SQL Probes SDE-SQL: Verbesserung der Text-zu-SQL-Generierung in großen Sprachmodellen durch selbstgesteuerte Exploration mit SQL-Probes SDE-SQL:通过自发探索SQL勘探,在大语言模型中加强从文字到SQL的生成 2506.07245v1

Authors (3): Wenxuan Xie, Yaxun Dai, Wenhao Jiang

Recent advancements in large language models (LLMs) have significantly improved performance on the Text-to-SQL task. However, prior approaches typically rely on static, pre-processed database information provided at inference time, which limits the model’s ability to fully understand the database contents. Without dynamic interaction, LLMs are constrained to fixed, human-provided context and cannot autonomously explore the underlying data. To address this limitation, we propose SDE-SQL, a framework that enables large language models to perform self-driven exploration of databases during inference. This is accomplished by generating and executing SQL probes, which allow the model to actively retrieve information from the database and iteratively update its understanding of the data. Unlike prior methods, SDE-SQL operates in a zero-shot setting, without relying on any question-SQL pairs as in-context demonstrations. When evaluated on the BIRD benchmark with Qwen2.5-72B-Instruct, SDE-SQL achieves an 8.02% relative improvement in execution accuracy over the vanilla Qwen2.5-72B-Instruct baseline, establishing a new state-of-the-art among methods based on open-source models without supervised fine-tuning (SFT) or model ensembling. Moreover, with SFT, the performance of SDE-SQL can be further enhanced, yielding an additional 0.52% improvement.

nan


Article 664

Title@2025-06-08 (7): Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification

Title: Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification Multi-Step Visual Reasoning mit Visual Tokens Skalierung und Verifizierung 具有视觉肯量度和核查的多脚视觉理性 2506.07235v1

Authors (10): Tianyi Bai, Zengjie Hu, Fupeng Sun, Jiantao Qiu, Yizhen Jiang, Guangxin He, Bohan Zeng, Conghui He, Binhang Yuan, Wentao Zhang

Multi-modal large language models (MLLMs) have achieved remarkable capabilities by integrating visual perception with language understanding, enabling applications such as image-grounded dialogue, visual question answering, and scientific analysis. However, most MLLMs adopt a static inference paradigm, encoding the entire image into fixed visual tokens upfront, which limits their ability to iteratively refine understanding or adapt to context during inference. This contrasts sharply with human perception, which is dynamic, selective, and feedback-driven. In this work, we introduce a novel framework for inference-time visual token scaling that enables MLLMs to perform iterative, verifier-guided reasoning over visual content. We formulate the problem as a Markov Decision Process, involving a reasoner that proposes visual actions and a verifier, which is trained via multi-step Direct Preference Optimization (DPO), that evaluates these actions and determines when reasoning should terminate. To support this, we present a new dataset, VTS, comprising supervised reasoning trajectories (VTS-SFT) and preference-labeled reasoning comparisons (VTS-DPO). Our method significantly outperforms existing approaches across diverse visual reasoning benchmarks, offering not only improved accuracy but also more interpretable and grounded reasoning processes. These results demonstrate the promise of dynamic inference mechanisms for enabling fine-grained, context-aware visual reasoning in next-generation MLLMs.

nan


Article 665

Title@2025-06-08 (7): Reducing Object Hallucination in Large Audio-Language Models via Audio-Aware Decoding

Title: Reducing Object Hallucination in Large Audio-Language Models via Audio-Aware Decoding Objekt-Halluzination in großen Audio-Sprachen-Modellen mittels Audio-Aware-Dekodierung reduzieren 通过音频软件解码减少大型音频语言模型中的天体幻觉 2506.07233v1

Authors (4): Tzu-wen Hsu, Ke-Han Lu, Cheng-Han Chiang, Hung-yi Lee

Large Audio-Language Models (LALMs) can take audio and text as the inputs and answer questions about the audio. While prior LALMs have shown strong performance on standard benchmarks, there has been alarming evidence that LALMs can hallucinate what is presented in the audio. To mitigate the hallucination of LALMs, we introduce Audio-Aware Decoding (AAD), a lightweight inference-time strategy that uses contrastive decoding to compare the token prediction logits with and without the audio context. By contrastive decoding, AAD promotes the tokens whose probability increases when the audio is present. We conduct our experiment on object hallucination datasets with three LALMs and show that AAD improves the F1 score by 0.046 to 0.428. We also show that AAD can improve the accuracy on general audio QA datasets like Clotho-AQA by 5.4% to 10.3%. We conduct thorough ablation studies to understand the effectiveness of each component in AAD.

nan


Article 666

Title@2025-06-08 (7): Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration

Title: Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration Effiziente Vorauswahl von Sprachmodellen über Multi-Actor-Kollaboration 通过多方行为者协作为语言模式选择高效培训前数据 2410.08102v3

Authors (12): Tianyi Bai, Ling Yang, Zhen Hao Wong, Fupeng Sun, Jiahui Peng, Xinlin Zhuang, Chi Zhang, Lijun Wu, Jiantao Qiu, Wentao Zhang, Binhang Yuan, Conghui He

Efficient data selection is crucial to accelerate the pretraining of language model (LMs). While various methods have been proposed to enhance data efficiency, limited research has addressed the inherent conflicts between these approaches to achieve optimal data selection for LM pretraining. To tackle this problem, we propose a multi-actor collaborative data selection mechanism: each data selection method independently prioritizes data based on its criterion and updates its prioritization rules using the current state of the model, functioning as an independent actor for data selection; and a console is designed to adjust the impacts of different actors at various stages and dynamically integrate information from all actors throughout the LM pretraining process. We conduct extensive empirical studies to evaluate our multi-actor framework. The experimental results demonstrate that our approach significantly improves data efficiency, accelerates convergence in LM pretraining, and achieves an average relative performance gain up to $10.5\%$ across multiple language model benchmarks compared to the state-of-the-art methods.

nan


Article 667

Title@2025-06-08 (7): Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning

Title: Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning Halluzination auf einen Blick: Kontrollierte visuelle Schnitte und feinkörniges multimodales Lernen 《一眼幻觉:受控的视觉编辑和优美的多模式学习》 2506.07227v1

Authors (10): Tianyi Bai, Yuxuan Fan, Jiantao Qiu, Fupeng Sun, Jiayi Song, Junlin Han, Zichen Liu, Conghui He, Wentao Zhang, Binhang Yuan

Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks but still struggle with fine-grained visual differences, leading to hallucinations or missed semantic shifts. We attribute this to limitations in both training data and learning objectives. To address these issues, we propose a controlled data generation pipeline that produces minimally edited image pairs with semantically aligned captions. Using this pipeline, we construct the Micro Edit Dataset (MED), containing over 50K image-text pairs spanning 11 fine-grained edit categories, including attribute, count, position, and object presence changes. Building on MED, we introduce a supervised fine-tuning (SFT) framework with a feature-level consistency loss that promotes stable visual embeddings under small edits. We evaluate our approach on the Micro Edit Detection benchmark, which includes carefully balanced evaluation pairs designed to test sensitivity to subtle visual variations across the same edit categories. Our method improves difference detection accuracy and reduces hallucinations compared to strong baselines, including GPT-4o. Moreover, it yields consistent gains on standard vision-language tasks such as image captioning and visual question answering. These results demonstrate the effectiveness of combining targeted data and alignment objectives for enhancing fine-grained visual reasoning in MLLMs.

nan


Article 668

Title@2025-06-08 (7): DeepRAG: Thinking to Retrieve Step by Step for Large Language Models

Title: DeepRAG: Thinking to Retrieve Step by Step for Large Language Models DeepRAG: Schritt für Schritt für Schritt für große Sprachmodelle nachzudenken 深RAG:考虑逐步获取大语言模型 2502.01142v2

Authors (9): Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Jie Zhou

Large Language Models (LLMs) have shown remarkable reasoning capabilities, while their practical applications are limited by severe factual hallucinations due to limitations in the timeliness, accuracy, and comprehensiveness of their parametric knowledge. Meanwhile, enhancing retrieval-augmented generation (RAG) with reasoning remains challenging due to ineffective task decomposition and redundant retrieval, which can introduce noise and degrade response quality. In this paper, we propose DeepRAG, a framework that models retrieval-augmented reasoning as a Markov Decision Process (MDP), enabling reasonable and adaptive retrieval. By iteratively decomposing queries, DeepRAG dynamically determines whether to retrieve external knowledge or rely on parametric reasoning at each step. Experiments show that DeepRAG improves retrieval efficiency and boosts answer accuracy by 26.4%, demonstrating its effectiveness in enhancing retrieval-augmented reasoning.

nan


Article 669

Title@2025-06-08 (7): Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts

Title: Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts Diversifizierung des Expertenwissens für Task-Agnostic Pruning in Sparse Mixture-of-Experts 使专家在Sparse Mixture 专家中的任务不可知性调节专家知识多样化 2407.09590v4

Authors (5): Zeliang Zhang, Xiaodong Liu, Hao Cheng, Chenliang Xu, Jianfeng Gao

By increasing model parameters but activating them sparsely when performing a task, the use of Mixture-of-Experts (MoE) architecture significantly improves the performance of Large Language Models (LLMs) without increasing the inference cost. However, the memory consumption due to the growing number of experts presents a challenge to the deployment of these models in many real world settings. Our empirical study reveals that some experts encode redundant knowledge during pre-training. We thus propose a method of grouping and pruning similar experts to improve the model’s parameter efficiency. We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures, including Mixtral, Deepseek-MoE, and Qwen. The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks. We will release our code to facilitate future research.

nan


Article 670

Title@2025-06-08 (7): Value Residual Learning

Title: Value Residual Learning Wertorientiertes Lernen 残余价值学习 2410.17897v5

Authors (5): Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Fares Obeid, Zhenzhong Lan

While Transformer models have achieved remarkable success in various domains, the effectiveness of information propagation through deep networks remains a critical challenge. Standard hidden state residuals often fail to adequately preserve initial token-level information in deeper layers. This paper introduces ResFormer, a novel architecture that enhances information flow by incorporating value residual connections in addition to hidden state residuals. And a variant is SVFormer, where all layers share the first layer’s value embedding. Comprehensive empirical evidence demonstrates ResFormer achieves equivalent validation loss with 16.11\% fewer model parameters and 20.3\% less training data compared to Transformer, while maintaining similar memory usage and computational cost. Besides, SVFormer reduces KV cache size by nearly half with only a small performance penalty and can be integrated with other KV-efficient methods, yielding further reductions in KV cache, with performance influenced by sequence length and cumulative learning rate.

nan


Article 671

Title@2025-06-08 (7): Gumbel Reranking: Differentiable End-to-End Reranker Optimization

Title: Gumbel Reranking: Differentiable End-to-End Reranker Optimization Gumbel Reranking: Differentielle End-to-End-Reranker-Optimierung Gumbel 中继等级: 可区别的 端到端到端再排序器优化 2502.11116v2

Authors (8): Siyuan Huang, Zhiyuan Ma, Jintao Du, Changhua Meng, Weiqiang Wang, Jingwen Leng, Minyi Guo, Zhouhan Lin

RAG systems rely on rerankers to identify relevant documents. However, fine-tuning these models remains challenging due to the scarcity of annotated query-document pairs. Existing distillation-based approaches suffer from training-inference misalignment and fail to capture interdependencies among candidate documents. To overcome these limitations, we reframe the reranking process as an attention-mask problem and propose Gumbel Reranking, an end-to-end training framework for rerankers aimed at minimizing the training-inference gap. In our approach, reranker optimization is reformulated as learning a stochastic, document-wise Top-$k$ attention mask using the Gumbel Trick and Relaxed Top-$k$ Sampling. This formulation enables end-to-end optimization by minimizing the overall language loss. Experiments across various settings consistently demonstrate performance gains, including a 10.4\% improvement in recall on HotpotQA for distinguishing indirectly relevant documents.

nan


Article 672

Title@2025-06-08 (7): SAP-Bench: Benchmarking Multimodal Large Language Models in Surgical Action Planning

Title: SAP-Bench: Benchmarking Multimodal Large Language Models in Surgical Action Planning SAP-Bench: Benchmarking multimodaler Großsprachenmodelle in der operativen Aktionsplanung SAP-Bench:在外科行动规划中确定多式大语言模式基准 2506.07196v1

Authors (6): Mengya Xu, Zhongzhen Huang, Dillan Imans, Yiru Ye, Xiaofan Zhang, Qi Dou

Effective evaluation is critical for driving advancements in MLLM research. The surgical action planning (SAP) task, which aims to generate future action sequences from visual inputs, demands precise and sophisticated analytical capabilities. Unlike mathematical reasoning, surgical decision-making operates in life-critical domains and requires meticulous, verifiable processes to ensure reliability and patient safety. This task demands the ability to distinguish between atomic visual actions and coordinate complex, long-horizon procedures, capabilities that are inadequately evaluated by current benchmarks. To address this gap, we introduce SAP-Bench, a large-scale, high-quality dataset designed to enable multimodal large language models (MLLMs) to perform interpretable surgical action planning. Our SAP-Bench benchmark, derived from the cholecystectomy procedures context with the mean duration of 1137.5s, and introduces temporally-grounded surgical action annotations, comprising the 1,226 clinically validated action clips (mean duration: 68.7s) capturing five fundamental surgical actions across 74 procedures. The dataset provides 1,152 strategically sampled current frames, each paired with the corresponding next action as multimodal analysis anchors. We propose the MLLM-SAP framework that leverages MLLMs to generate next action recommendations from the current surgical scene and natural language instructions, enhanced with injected surgical domain knowledge. To assess our dataset’s effectiveness and the broader capabilities of current models, we evaluate seven state-of-the-art MLLMs (e.g., OpenAI-o1, GPT-4o, QwenVL2.5-72B, Claude-3.5-Sonnet, GeminiPro2.5, Step-1o, and GLM-4v) and reveal critical gaps in next action prediction performance.

nan


Article 673

Title@2025-06-08 (7): MMTEB: Massive Multilingual Text Embedding Benchmark

Title: MMTEB: Massive Multilingual Text Embedding Benchmark MMTEB: Massive mehrsprachige Text-Einbettung Benchmark MMDEB: 大规模多语言文本嵌入基准 2502.13595v3

Authors (86): Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Çağatan, Akash Kundu, Martin Bernstorff, Shitao Xiao, Akshita Sukhlecha, Bhavish Pahwa, Rafał Poświata, Kranthi Kiran GV, Shawon Ashraf, Daniel Auras, Björn Plüster, Jan Philipp Harries, Loïc Magne, Isabelle Mohr, Mariya Hendriksen, Dawei Zhu, Hippolyte Gisserot-Boukhlef, Tom Aarsen, Jan Kostkan, Konrad Wojtasik, Taemin Lee, Marek Šuppa, Crystina Zhang, Roberta Rocca, Mohammed Hamdy, Andrianos Michail, John Yang, Manuel Faysse, Aleksei Vatolin, Nandan Thakur, Manan Dey, Dipam Vasani, Pranjal Chitale, Simone Tedeschi, Nguyen Tai, Artem Snegirev, Michael Günther, Mengzhou Xia, Weijia Shi, Xing Han Lù, Jordan Clive, Gayatri Krishnakumar, Anna Maksimova, Silvan Wehrli, Maria Tikhonova, Henil Panchal, Aleksandr Abramov, Malte Ostendorff, Zheng Liu, Simon Clematide, Lester James Miranda, Alena Fenogenova, Guangyu Song, Ruqiya Bin Safi, Wen-Ding Li, Alessia Borghini, Federico Cassano, Hongjin Su, Jimmy Lin, Howard Yen, Lasse Hansen, Sara Hooker, Chenghao Xiao, Vaibhav Adlakha, Orion Weller, Siva Reddy, Niklas Muennighoff

Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.

nan


Article 674

Title@2025-06-08 (7): Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images

Title: Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images Mildernde Verhaltenshalluzination in multimodalen großen Sprachmodellen für Sequenzbilder 序列图像多式大语言模型中减轻行为幻觉行为 2506.07184v1

Authors (6): Liangliang You, Junchi Yao, Shu Yang, Guimin Hu, Lijie Hu, Di Wang

While multimodal large language models excel at various tasks, they still suffer from hallucinations, which limit their reliability and scalability for broader domain applications. To address this issue, recent research mainly focuses on objective hallucination. However, for sequential images, besides objective hallucination, there is also behavioral hallucination, which is less studied. This work aims to fill in the gap. We first reveal that behavioral hallucinations mainly arise from two key factors: prior-driven bias and the snowball effect. Based on these observations, we introduce SHE (Sequence Hallucination Eradication), a lightweight, two-stage framework that (1) detects hallucinations via visual-textual alignment check using our proposed adaptive temporal window and (2) mitigates them via orthogonal projection onto the joint embedding space. We also propose a new metric (BEACH) to quantify behavioral hallucination severity. Empirical results on standard benchmarks demonstrate that SHE reduces behavioral hallucination by over 10% on BEACH while maintaining descriptive accuracy.

nan


Article 675

Title@2025-06-08 (7): Epistemic Integrity in Large Language Models

Title: Epistemic Integrity in Large Language Models Epistemische Integrität in großen Sprachmodellen 大语言模型中的集成完整性 2411.06528v2

Authors (11): Bijean Ghafouri, Shahrad Mohammadzadeh, James Zhou, Pratheeksha Nair, Jacob-Junqi Tian, Hikaru Tsujimura, Mayank Goel, Sukanya Krishna, Reihaneh Rabbany, Jean-François Godbout, Kellin Pelrine

Large language models are increasingly relied upon as sources of information, but their propensity for generating false or misleading statements with high confidence poses risks for users and society. In this paper, we confront the critical problem of epistemic miscalibration $\unicode{x2013}$ where a model’s linguistic assertiveness fails to reflect its true internal certainty. We introduce a new human-labeled dataset and a novel method for measuring the linguistic assertiveness of Large Language Models (LLMs) which cuts error rates by over 50% relative to previous benchmarks. Validated across multiple datasets, our method reveals a stark misalignment between how confidently models linguistically present information and their actual accuracy. Further human evaluations confirm the severity of this miscalibration. This evidence underscores the urgent risk of the overstated certainty LLMs hold which may mislead users on a massive scale. Our framework provides a crucial step forward in diagnosing this miscalibration, offering a path towards correcting it and more trustworthy AI across domains.

nan


Article 676

Title@2025-06-08 (7): Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs

Title: Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs Flattery in Motion: Benchmarking und Analyse der Sykopanz in Video-LLMs 电影中的发辣:视频-LLMs中的基准设定和分析 2506.07180v1

Authors (6): Wenrui Zhou, Shu Yang, Qingsong Yang, Zikun Guo, Lijie Hu, Di Wang

As video large language models (Video-LLMs) become increasingly integrated into real-world applications that demand grounded multimodal reasoning, ensuring their factual consistency and reliability is of critical importance. However, sycophancy, the tendency of these models to align with user input even when it contradicts the visual evidence, undermines their trustworthiness in such contexts. Current sycophancy research has largely overlooked its specific manifestations in the video-language domain, resulting in a notable absence of systematic benchmarks and targeted evaluations to understand how Video-LLMs respond under misleading user input. To fill this gap, we propose VISE (Video-LLM Sycophancy Benchmarking and Evaluation), the first dedicated benchmark designed to evaluate sycophantic behavior in state-of-the-art Video-LLMs across diverse question formats, prompt biases, and visual reasoning tasks. Specifically, VISE pioneeringly brings linguistic perspectives on sycophancy into the visual domain, enabling fine-grained analysis across multiple sycophancy types and interaction patterns. In addition, we explore key-frame selection as an interpretable, training-free mitigation strategy, which reveals potential paths for reducing sycophantic bias by strengthening visual grounding.

nan


Article 677

Title@2025-06-08 (7): Understanding and Mitigating Cross-lingual Privacy Leakage via Language-specific and Universal Privacy Neurons

Title: Understanding and Mitigating Cross-lingual Privacy Leakage via Language-specific and Universal Privacy Neurons Verständnis und Abmilderung von Cross-lingual Privacy Leakage über sprachspezifische und Universal Privacy Neuronen 通过特定语言和通用隐私中世纪理解和减少跨语言隐私疏漏 2506.00759v2

Authors (8): Wenshuo Dong, Qingsong Yang, Shu Yang, Lijie Hu, Meng Ding, Wanyu Lin, Tianhang Zheng, Di Wang

Large Language Models (LLMs) trained on massive data capture rich information embedded in the training data. However, this also introduces the risk of privacy leakage, particularly involving personally identifiable information (PII). Although previous studies have shown that this risk can be mitigated through methods such as privacy neurons, they all assume that both the (sensitive) training data and user queries are in English. We show that they cannot defend against the privacy leakage in cross-lingual contexts: even if the training data is exclusively in one language, these (private) models may still reveal private information when queried in another language. In this work, we first investigate the information flow of cross-lingual privacy leakage to give a better understanding. We find that LLMs process private information in the middle layers, where representations are largely shared across languages. The risk of leakage peaks when converted to a language-specific space in later layers. Based on this, we identify privacy-universal neurons and language-specific privacy neurons. Privacy-universal neurons influence privacy leakage across all languages, while language-specific privacy neurons are only related to specific languages. By deactivating these neurons, the cross-lingual privacy leakage risk is reduced by 23.3%-31.6%.

nan


Article 678

Title@2025-06-08 (7): Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

Title: Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering Rechte Antwort, falsches Ergebnis: Enthüllen der Inkonsistenzen der LLM-Bewertung in Multiple-Choice-Fragebeantwortung 正确答案, 错误分数: 解开多选项问题解答中LLM评价的不一致之处 2503.14996v2

Authors (6): Francesco Maria Molfese, Luca Moroni, Luca Gioffré, Alessandro Scirè, Simone Conia, Roberto Navigli

One of the most widely used tasks for evaluating Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA). While open-ended question answering tasks are more challenging to evaluate, MCQA tasks are, in principle, easier to assess, as the model’s answer is thought to be simple to extract and is compared directly to a set of predefined choices. However, recent studies have started to question the reliability of MCQA evaluation, showing that multiple factors can significantly impact the reported performance of LLMs, especially when the model generates free-form text before selecting one of the answer choices. In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons. We systematically analyze whether existing answer extraction methods are aligned with human judgment, and how they are influenced by answer constraints in the prompt across different domains. Our experiments demonstrate that traditional evaluation strategies often underestimate LLM capabilities, while LLM-based answer extractors are prone to systematic errors. Moreover, we reveal a fundamental trade-off between including format constraints in the prompt to simplify answer extraction and allowing models to generate free-form text to improve reasoning. Our findings call for standardized evaluation methodologies and highlight the need for more reliable and consistent MCQA evaluation practices.

nan


Article 679

Title@2025-06-08 (7): PECAN: LLM-Guided Dynamic Progress Control with Attention-Guided Hierarchical Weighted Graph for Long-Document QA

Title: PECAN: LLM-Guided Dynamic Progress Control with Attention-Guided Hierarchical Weighted Graph for Long-Document QA PECAN: LLM-geführte dynamische Fortschrittskontrolle mit aufmerksamkeitsgeführtem Hierarchischen Gewichtungsdiagramm für Langdokument QA PECAN: LLM-指导的长文档质量评估动态进展控制,附有引人指导的纵向分级加权图 2410.04790v2

Authors (4): Xinyu Wang, Yanzheng Xiang, Lin Gui, Yulan He

Long-document QA presents challenges with large-scale text and long-distance dependencies. Recent advances in Large Language Models (LLMs) enable entire documents to be processed in a single pass. However, their computational cost is significantly high. Retrieval-Augmented Generation (RAG) methods split text into smaller chunks, but they often yield inferior results and may lose global context. Recent approaches that integrate LLMs into RAG via iterative summarization either underutilize LLM capabilities or still incur high computational costs. In this paper, we combine the high accuracy of LLMs with the efficiency of RAG and propose LLM-Guided Dynamic Progress Control with Attention-Based Hierarchical Weighted Graph (PECAN). Our method introduces two key improvements: (1) LLM-Guided Dynamic Progress Control: We leverage LLMs to dynamically control the retrieval process, adjusting the amount of retrieved information based on different queries to achieve a better balance of effectiveness and efficiency. (2) Attention-Guided Retrieval: We propose a novel retrieval method that constructs a hierarchical graph where edges are derived by LLM attention weights. Experimental results demonstrate that PECAN achieves LLM-level performance while maintaining computational complexity comparable to that of RAG methods on two single-document and two multi-document QA datasets.

nan


Article 680

Title@2025-06-08 (7): RULE: Reinforcement UnLEarning Achieves Forget-Retain Pareto Optimality

Title: RULE: Reinforcement UnLEarning Achieves Forget-Retain Pareto Optimality RULE: Verstärkung des Unerfahrenen erreicht Vergessen-Behalten Pareto-Optimalität RULE: 强化不放功率达到遗忘-保留最佳程度 2506.07171v1

Authors (8): Chenlong Zhang, Zhuoran Jin, Hongbang Yuan, Jiaheng Wei, Tong Zhou, Kang Liu, Jun Zhao, Yubo Chen

The widespread deployment of Large Language Models (LLMs) trained on massive, uncurated corpora has raised growing concerns about the inclusion of sensitive, copyrighted, or illegal content. This has led to increasing interest in LLM unlearning: the task of selectively removing specific information from a model without retraining from scratch or degrading overall utility. However, existing methods often rely on large-scale forget and retain datasets, and suffer from unnatural responses, poor generalization, or catastrophic utility loss. In this work, we propose Reinforcement UnLearning (RULE), an efficient framework that formulates unlearning as a refusal boundary optimization problem. RULE is trained with a small portion of the forget set and synthesized boundary queries, using a verifiable reward function that encourages safe refusal on forget–related queries while preserving helpful responses on permissible inputs. We provide both theoretical and empirical evidence demonstrating the effectiveness of RULE in achieving targeted unlearning without compromising model utility. Experimental results show that, with only $12%$ forget set and $8%$ synthesized boundary data, RULE outperforms existing baselines by up to $17.5%$ forget quality and $16.3%$ naturalness response while maintaining general utility, achieving forget–retain Pareto optimality. Remarkably, we further observe that RULE improves the naturalness of model outputs, enhances training efficiency, and exhibits strong generalization ability, generalizing refusal behavior to semantically related but unseen queries.

nan


Article 681

Title@2025-06-08 (7): CTDGSI: A comprehensive exploitation of instance selection methods for automatic text classification. VII Concurso de Teses, Dissertações e Trabalhos de Graduação em SI – XXI Simpósio Brasileiro de Sistemas de Informação

Title: CTDGSI: A comprehensive exploitation of instance selection methods for automatic text classification. VII Concurso de Teses, Dissertações e Trabalhos de Graduação em SI – XXI Simpósio Brasileiro de Sistemas de Informação CTDGSI: Eine umfassende Nutzung von Beispielauswahlmethoden für die automatische Textklassifizierung. VII Concurso de Tesses, Dissertações e Trabalhos de Graduação em SI – XXI Simpósio Brasileiro de Sistemas de Informação CTDGSI: 综合利用自动文本分类实例选择方法。 VII 技术、技术、技术、技术、技术、工业、工业、技术、工业、工业、工业、工业、工业和工业 – – 二十一世纪巴西信息统计系统 2506.07169v1

Authors (3): Washington Cunha, Leonardo Rocha, Marcos André Gonçalves

Progress in Natural Language Processing (NLP) has been dictated by the rule of more: more data, more computing power and more complexity, best exemplified by the Large Language Models. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. This \textbf{Ph.D. dissertation} focuses on an under-investi-gated NLP data engineering technique, whose potential is enormous in the current scenario known as Instance Selection (IS). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining the effectiveness of the trained models and reducing the training process cost. We provide a comprehensive and scientifically sound comparison of IS methods applied to an essential NLP task – Automatic Text Classification (ATC), considering several classification solutions and many datasets. Our findings reveal a significant untapped potential for IS solutions. We also propose two novel IS solutions that are noise-oriented and redundancy-aware, specifically designed for large datasets and transformer architectures. Our final solution achieved an average reduction of 41\% in training sets, while maintaining the same levels of effectiveness in all datasets. Importantly, our solutions demonstrated speedup improvements of 1.67x (up to 2.46x), making them scalable for datasets with hundreds of thousands of documents.

nan


Article 682

Title@2025-06-08 (7): Efficient Text-Attributed Graph Learning through Selective Annotation and Graph Alignment

Title: Efficient Text-Attributed Graph Learning through Selective Annotation and Graph Alignment Effizientes Text-Attributes Graphenlernen durch selektive Annotation und Graphenausrichtung 通过选择性注释和图表对齐,通过选择性注释和图表对齐,高效率的文本定位图表学习 2506.07168v1

Authors (9): Huanyi Xie, Lijie Hu, Lu Yu, Tianhao Huang, Longfei Li, Meng Li, Jun Zhou, Huan Wang, Di Wang

In the realm of Text-attributed Graphs (TAGs), traditional graph neural networks (GNNs) often fall short due to the complex textual information associated with each node. Recent methods have improved node representations by leveraging large language models (LLMs) to enhance node text features, but these approaches typically require extensive annotations or fine-tuning across all nodes, which is both time-consuming and costly. To overcome these challenges, we introduce GAGA, an efficient framework for TAG representation learning. GAGA reduces annotation time and cost by focusing on annotating only representative nodes and edges. It constructs an annotation graph that captures the topological relationships among these annotations. Furthermore, GAGA employs a two-level alignment module to effectively integrate the annotation graph with the TAG, aligning their underlying structures. Experiments show that GAGA achieves classification accuracies on par with or surpassing state-of-the-art methods while requiring only 1% of the data to be annotated, demonstrating its high efficiency.

nan


Article 683

Title@2025-06-08 (7): GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization

Title: GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization GeometrieZero: Verbesserung der Geometrie-Lösung für LLM mit Gruppen-Kontrast-Policy-Optimierung 几何零:改进与集团反竞争政策优化相结合的LLM的几何解决办法 2506.07160v1

Authors (7): Yikun Wang, Yibin Wang, Dianyi Wang, Zimian Peng, Qipeng Guo, Dacheng Tao, Jiaqi Wang

Recent advances in large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, particularly in mathematical reasoning, amid which geometry problem solving remains a challenging area where auxiliary construction plays a enssential role. Existing approaches either achieve suboptimal performance or rely on massive LLMs (e.g., GPT-4o), incurring massive computational costs. We posit that reinforcement learning with verifiable reward (e.g., GRPO) offers a promising direction for training smaller models that effectively combine auxiliary construction with robust geometric reasoning. However, directly applying GRPO to geometric reasoning presents fundamental limitations due to its dependence on unconditional rewards, which leads to indiscriminate and counterproductive auxiliary constructions. To address these challenges, we propose Group Contrastive Policy Optimization (GCPO), a novel reinforcement learning framework featuring two key innovations: (1) Group Contrastive Masking, which adaptively provides positive or negative reward signals for auxiliary construction based on contextual utility, and a (2) length reward that promotes longer reasoning chains. Building on GCPO, we develop GeometryZero, a family of affordable-size geometric reasoning models that judiciously determine when to employ auxiliary construction. Our extensive empirical evaluation across popular geometric benchmarks (Geometry3K, MathVista) demonstrates that GeometryZero models consistently outperform baselines (e.g. GRPO), achieving an average improvement of 4.29% across all benchmarks.

nan


Article 684

Title@2025-06-08 (7): ExLM: Rethinking the Impact of [MASK] Tokens in Masked Language Models

Title: ExLM: Rethinking the Impact of [MASK] Tokens in Masked Language Models ExLM: Nachdenken über die Auswirkungen von [MASK] Token in maskierten Sprachmodellen EXLM:重新思考[MASK]语调模式中[MASK]语调的影响 2501.13397v5

Authors (8): Kangjie Zheng, Junwei Yang, Siyue Liang, Bin Feng, Zequn Liu, Wei Ju, Zhiping Xiao, Ming Zhang

Masked Language Models (MLMs) have achieved remarkable success in many self-supervised representation learning tasks. MLMs are trained by randomly masking portions of the input sequences with [MASK] tokens and learning to reconstruct the original content based on the remaining context. This paper explores the impact of [MASK] tokens on MLMs. Analytical studies show that masking tokens can introduce the corrupted semantics problem, wherein the corrupted context may convey multiple, ambiguous meanings. This problem is also a key factor affecting the performance of MLMs on downstream tasks. Based on these findings, we propose a novel enhanced-context MLM, ExLM. Our approach expands [MASK] tokens in the input context and models the dependencies between these expanded states. This enhancement increases context capacity and enables the model to capture richer semantic information, effectively mitigating the corrupted semantics problem during pre-training. Experimental results demonstrate that ExLM achieves significant performance improvements in both text modeling and SMILES modeling tasks. Further analysis confirms that ExLM enriches semantic representations through context enhancement, and effectively reduces the semantic multimodality commonly observed in MLMs.

nan


Article 685

Title@2025-06-08 (7): Syntactic Control of Language Models by Posterior Inference

Title: Syntactic Control of Language Models by Posterior Inference Syntaktische Kontrolle von Sprachmodellen durch hintere Inferenz 由Posideer 推断对语言模型的协同控制 2506.07154v1

Authors (4): Vicky Xefteri, Tim Vieira, Ryan Cotterell, Afra Amini

Controlling the syntactic structure of text generated by language models is valuable for applications requiring clarity, stylistic consistency, or interpretability, yet it remains a challenging task. In this paper, we argue that sampling algorithms based on the posterior inference can effectively enforce a target constituency structure during generation. Our approach combines sequential Monte Carlo, which estimates the posterior distribution by sampling from a proposal distribution, with a syntactic tagger that ensures that each generated token aligns with the desired syntactic structure. Our experiments with GPT2 and Llama3-8B models show that with an appropriate proposal distribution, we can improve syntactic accuracy, increasing the F1 score from $12.31$ (GPT2-large) and $35.33$ (Llama3-8B) to about $93$ in both cases without compromising the language model’s fluency. These results underscore both the complexity of syntactic control and the effectiveness of sampling algorithms, offering a promising approach for applications where precise control over syntax is essential.

nan


Article 686

Title@2025-06-08 (7): Comba: Improving Bilinear RNNs with Closed-loop Control

Title: Comba: Improving Bilinear RNNs with Closed-loop Control Comba: Bilineare RNNs mit Closed-Loop-Steuerung verbessern Comba: 改进有闭环控制的双线区域网网 2506.02475v2

Authors (8): Jiaxi Hu, Yongqi Pan, Jusen Du, Disen Lan, Xiaqiang Tang, Qingsong Wen, Yuxuan Liang, Weigao Sun

Recent efficient sequence modeling methods such as Gated DeltaNet, TTT, and RWKV-7 have achieved performance improvements by supervising the recurrent memory management through Delta learning rule. Unlike previous state-space models (e.g., Mamba) and gated linear attentions (e.g., GLA), these models introduce interactions between the recurrent state and the key vector, structurally resembling bilinear systems. In this paper, we first introduce the concept of Bilinear RNNs with a comprehensive analysis on the advantages and limitations of these models. Then, based on closed-loop control theory, we propose a novel Bilinear RNN variant named Comba, which adopts a scalar-plus-low-rank state transition, with both state feedback and output feedback corrections. We also implement a hardware-efficient chunk-wise parallel kernel in Triton and train models with 340M/1.3B parameters on large-scale corpus. Comba demonstrates superior performance and computation efficiency in both language and vision modeling.

nan


Article 687

Title@2025-06-08 (7): Semantic-preserved Augmentation with Confidence-weighted Fine-tuning for Aspect Category Sentiment Analysis

Title: Semantic-preserved Augmentation with Confidence-weighted Fine-tuning for Aspect Category Sentiment Analysis Semantisch-konservierte Augmentation mit vertrauensgewichtiger Feinabstimmung für Aspect Category Sentiment Analysis 带有信任加权微调的语义保有增强值,用于隐形类别情感分析 2506.07148v1

Authors (3): Yaping Chai, Haoran Xie, Joe S. Qin

Large language model (LLM) is an effective approach to addressing data scarcity in low-resource scenarios. Recent existing research designs hand-crafted prompts to guide LLM for data augmentation. We introduce a data augmentation strategy for the aspect category sentiment analysis (ACSA) task that preserves the original sentence semantics and has linguistic diversity, specifically by providing a structured prompt template for an LLM to generate predefined content. In addition, we employ a post-processing technique to further ensure semantic consistency between the generated sentence and the original sentence. The augmented data increases the semantic coverage of the training distribution, enabling the model better to understand the relationship between aspect categories and sentiment polarities, enhancing its inference capabilities. Furthermore, we propose a confidence-weighted fine-tuning strategy to encourage the model to generate more confident and accurate sentiment polarity predictions. Compared with powerful and recent works, our method consistently achieves the best performance on four benchmark datasets over all baselines.

nan


Article 688

Title@2025-06-08 (7): How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

Title: How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis Wie reduziert DPO die Toxizität? Eine mechanistische Neuronen-Level-Analyse DPO 如何减少毒性? 2411.06424v3

Authors (5): Yushi Yang, Filip Sondej, Harry Mayne, Andrew Lee, Adam Mahdi

Safety fine-tuning algorithms reduce harmful outputs in language models, yet their mechanisms remain under-explored. Direct Preference Optimization (DPO) is a popular choice of algorithm, but prior explanations, attributing its effects solely to dampened toxic neurons in the MLP layers, are incomplete. In this study, we analyse four language models (Llama-3.1-8B, Gemma-2-2B, Mistral-7B, GPT-2-Medium) and show that toxic neurons only account for 2.5% to 24% of DPO’s effects across models. Instead, DPO balances distributed activation shifts across all MLP neurons to create a net toxicity reduction. We attribute this reduction to four neuron groups, two aligned with reducing toxicity and two promoting anti-toxicity, whose combined effects replicate DPO across models. To further validate this understanding, we develop an activation editing method mimicking DPO through distributed shifts along a toxicity representation. This method outperforms DPO in reducing toxicity while preserving perplexity, without requiring any weight updates. Our work provides a mechanistic understanding of DPO and introduces an efficient, tuning-free alternative for safety fine-tuning.

nan


Article 689

Title@2025-06-08 (7): Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting

Title: Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting Prompting Science Report 2: Der abnehmende Wert der Kette des Denkens in Prompting 快速科学报告2:激励过程中思维链价值的下降 2506.07142v1

Authors (4): Lennart Meincke, Ethan Mollick, Lilach Mollick, Dan Shapiro

This is the second in a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we investigate Chain-of-Thought (CoT) prompting, a technique that encourages a large language model (LLM) to “think step by step” (Wei et al., 2022). CoT is a widely adopted method for improving reasoning tasks, however, our findings reveal a more nuanced picture of its effectiveness. We demonstrate two things: - The effectiveness of Chain-of-Thought prompting can vary greatly depending on the type of task and model. For non-reasoning models, CoT generally improves average performance by a small amount, particularly if the model does not inherently engage in step-by-step processing by default. However, CoT can introduce more variability in answers, sometimes triggering occasional errors in questions the model would otherwise get right. We also found that many recent models perform some form of CoT reasoning even if not asked; for these models, a request to perform CoT had little impact. Performing CoT generally requires far more tokens (increasing cost and time) than direct answers. - For models designed with explicit reasoning capabilities, CoT prompting often results in only marginal, if any, gains in answer accuracy. However, it significantly increases the time and tokens needed to generate a response.

nan


Article 690

Title@2025-06-08 (7): Learning Compact Vision Tokens for Efficient Large Multimodal Models

Title: Learning Compact Vision Tokens for Efficient Large Multimodal Models Kompakte Visions-Token für effiziente große multimodale Modelle lernen 高效大型多模式模式的学习契约愿景 2506.07138v1

Authors (2): Hao Tang, Chengchao Shen

Large multimodal models (LMMs) suffer significant computational challenges due to the high cost of Large Language Models (LLMs) and the quadratic complexity of processing long vision token sequences. In this paper, we explore the spatial redundancy among vision tokens and shorten the length of vision token sequences for inference acceleration. Specifically, we propose a Spatial Token Fusion (STF) method to learn compact vision tokens for short vision token sequence, where spatial-adjacent tokens are fused into one. Meanwhile, weight-frozen vision encoder can not well adapt to the demand of extensive downstream vision-language tasks. To this end, we further introduce a Multi-Block Token Fusion (MBTF) module to supplement multi-granularity features for the reduced token sequence. Overall, we combine STF and MBTF module to balance token reduction and information preservation, thereby improving inference efficiency without sacrificing multimodal reasoning capabilities. Experimental results demonstrate that our method based on LLaVA-1.5 achieves comparable or even superior performance to the baseline on 8 popular vision-language benchmarks with only $25\%$ vision tokens of baseline. The source code and trained weights are available at https://github.com/visresearch/LLaVA-STF.

nan


Article 691

Title@2025-06-08 (7): Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models

Title: Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models Theorem-of-Thought: Ein Multi-Agenten-Framework für abduktive, deduktive und induktive Vernunft in Sprachmodellen 所探讨的理论理论:语言模式中指导、贬低和诱导理由的多机构框架 2506.07106v1

Authors (4): Samir Abdaljalil, Hasan Kurban, Khalid Qaraqe, Erchin Serpedin

Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret. Prompting techniques like Chain-of-Thought (CoT) enhance reliability by eliciting intermediate reasoning steps or aggregating multiple outputs. However, they lack mechanisms for enforcing logical structure and assessing internal coherence. We introduce Theorem-of-Thought (ToTh), a novel framework that models reasoning as collaboration among three parallel agents, each simulating a distinct mode of inference: abductive, deductive, and inductive. Each agent produces a reasoning trace, which is structured into a formal reasoning graph. To evaluate consistency, we apply Bayesian belief propagation guided by natural language inference (NLI), assigning confidence scores to each step. The most coherent graph is selected to derive the final answer. Experiments on symbolic (WebOfLies) and numerical (MultiArith) reasoning benchmarks show that ToTh consistently outperforms CoT, Self-Consistency, and CoT-Decoding across multiple LLMs, while producing interpretable and logically grounded reasoning chains. Our findings suggest a promising direction for building more robust and cognitively inspired LLM reasoning. The implementation is available at https://github.com/KurbanIntelligenceLab/theorem-of-thought.

nan


Article 692

Title@2025-06-08 (7): How Far Are We from Optimal Reasoning Efficiency?

Title: How Far Are We from Optimal Reasoning Efficiency? Wie weit sind wir von optimaler Effizienz entfernt? 我们离最佳计算效率还有多远呢? 2506.07104v1

Authors (9): Jiaxuan Gao, Shu Yan, Qixin Tan, Lu Yang, Shusheng Xu, Wei Fu, Zhiyu Mei, Kaifeng Lyu, Yi Wu

Large Reasoning Models (LRMs) demonstrate remarkable problem-solving capabilities through extended Chain-of-Thought (CoT) reasoning but often produce excessively verbose and redundant reasoning traces. This inefficiency incurs high inference costs and limits practical deployment. While existing fine-tuning methods aim to improve reasoning efficiency, assessing their efficiency gains remains challenging due to inconsistent evaluations. In this work, we introduce the reasoning efficiency frontiers, empirical upper bounds derived from fine-tuning base LRMs across diverse approaches and training configurations. Based on these frontiers, we propose the Reasoning Efficiency Gap (REG), a unified metric quantifying deviations of any fine-tuned LRMs from these frontiers. Systematic evaluation on challenging mathematical benchmarks reveals significant gaps in current methods: they either sacrifice accuracy for short length or still remain inefficient under tight token budgets. To reduce the efficiency gap, we propose REO-RL, a class of Reinforcement Learning algorithms that minimizes REG by targeting a sparse set of token budgets. Leveraging numerical integration over strategically selected budgets, REO-RL approximates the full efficiency objective with low error using a small set of token budgets. Through systematic benchmarking, we demonstrate that our efficiency metric, REG, effectively captures the accuracy-length trade-off, with low-REG methods reducing length while maintaining accuracy. Our approach, REO-RL, consistently reduces REG by >=50 across all evaluated LRMs and matching Qwen3-4B/8B efficiency frontiers under a 16K token budget with minimal accuracy loss. Ablation studies confirm the effectiveness of our exponential token budget strategy. Finally, our findings highlight that fine-tuning LRMs to perfectly align with the efficiency frontiers remains an open challenge.

nan


Article 693

Title@2025-06-08 (7): Exploring Training and Inference Scaling Laws in Generative Retrieval

Title: Exploring Training and Inference Scaling Laws in Generative Retrieval Erforschen von Schulungs- und Inferenzskalierungsgesetzen in generativem Retrieval 探索 “ 创造检索 “ 中的培训和引文扩展法 2503.18941v2

Authors (7): Hongru Cai, Yongqi Li, Ruifeng Yuan, Wenjie Wang, Zhen Zhang, Wenjie Li, Tat-Seng Chua

Generative retrieval reformulates retrieval as an autoregressive generation task, where large language models (LLMs) generate target documents directly from a query. As a novel paradigm, the mechanisms that underpin its performance and scalability remain largely unexplored. We systematically investigate training and inference scaling laws in generative retrieval, exploring how model size, training data scale, and inference-time compute jointly influence performance. We propose a novel evaluation metric inspired by contrastive entropy and generation loss, providing a continuous performance signal that enables robust comparisons across diverse generative retrieval methods. Our experiments show that n-gram-based methods align strongly with training and inference scaling laws. We find that increasing model size, training data scale, and inference-time compute all contribute to improved performance, highlighting the complementary roles of these factors in enhancing generative retrieval. Across these settings, LLaMA models consistently outperform T5 models, suggesting a particular advantage for larger decoder-only models in generative retrieval. Our findings underscore that model sizes, data availability, and inference computation interact to unlock the full potential of generative retrieval, offering new insights for designing and optimizing future systems.

nan


Article 694

Title@2025-06-08 (7): Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models

Title: Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models Stellen Sie sich vor zu hören: Auditory Knowledge Generation kann ein effektiver Assistent für Sprachmodelle sein 想象一下:产生听力知识可以成为语言模式的有效助理。 2503.16853v2

Authors (3): Suho Yoo, Hyunjong Ok, Jaeho Lee

Language models pretrained on text-only corpora often struggle with tasks that require auditory commonsense knowledge. Previous work addresses this problem by augmenting the language model to retrieve knowledge from external audio databases. This approach has several limitations, such as the potential lack of relevant audio in databases and the high costs associated with constructing the databases. To address these issues, we propose Imagine to Hear, a novel approach that dynamically generates auditory knowledge using generative models. Our framework detects multiple audio-related textual spans from the given prompt and generates corresponding auditory knowledge. We develop several mechanisms to efficiently process multiple auditory knowledge, including a CLAP-based rejection sampler and a language-audio fusion module. Our experiments show that our method achieves state-of-the-art performance on AuditoryBench without relying on external databases, highlighting the effectiveness of our generation-based approach.

nan


Article 695

Title@2025-06-08 (7): Representation Decomposition for Learning Similarity and Contrastness Across Modalities for Affective Computing

Title: Representation Decomposition for Learning Similarity and Contrastness Across Modalities for Affective Computing Darstellungsdekomposition für das Lernen von Ähnlichkeit und Kontrast über Modalitäten für affektives Computing 积极计算方法不同模式的学习相似性和相异性 2506.07086v1

Authors (5): Yuanhe Tian, Pengsen Cheng, Guoqing Jin, Lei Zhang, Yan Song

Multi-modal affective computing aims to automatically recognize and interpret human attitudes from diverse data sources such as images and text, thereby enhancing human-computer interaction and emotion understanding. Existing approaches typically rely on unimodal analysis or straightforward fusion of cross-modal information that fail to capture complex and conflicting evidence presented across different modalities. In this paper, we propose a novel LLM-based approach for affective computing that explicitly deconstructs visual and textual representations into shared (modality-invariant) and modality-specific components. Specifically, our approach firstly encodes and aligns input modalities using pre-trained multi-modal encoders, then employs a representation decomposition framework to separate common emotional content from unique cues, and finally integrates these decomposed signals via an attention mechanism to form a dynamic soft prompt for a multi-modal LLM. Extensive experiments on three representative tasks for affective computing, namely, multi-modal aspect-based sentiment analysis, multi-modal emotion analysis, and hateful meme detection, demonstrate the effectiveness of our approach, which consistently outperforms strong baselines and state-of-the-art models.

nan


Article 696

Title@2025-06-08 (7): From Informal to Formal – Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs

Title: From Informal to Formal – Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs Vom informellen zum formalen – Einbinden und Bewerten von LLMs über natürliche Sprachanforderungen bis hin zu überprüfbaren Formalproofs 从非正式到正式 – – 纳入和评价关于自然语言要求与可核实的正式证明之间的LLMs 2501.16207v4

Authors (12): Jialun Cao, Yaojie Lu, Meiziniu Li, Haoyang Ma, Haokun Li, Mengda He, Cheng Wen, Le Sun, Hongyu Zhang, Shengchao Qin, Shing-Chi Cheung, Cong Tian

The research in AI-based formal mathematical reasoning has shown an unstoppable growth trend. These studies have excelled in mathematical competitions like IMO and have made significant progress. This paper focuses on formal verification, an immediate application scenario of formal reasoning, and breaks it down into sub-tasks. We constructed 18k high-quality instruction-response pairs across five formal specification languages (Coq, Lean4, Dafny, ACSL, and TLA+) by distilling gpt-4o and evaluated against ten open-sourced LLMs, including recent popular DeepSeek-R1. We also fine-tuned several 7~8B small models to achieve comparable performance with Deepseek-R1-671B. Interestingly, we observed that fine-tuning with formal data also enhances mathematics, reasoning, and coding capabilities. Fine-tuned models are released at https: //huggingface.co/fm-universe.

nan


Article 697

Title@2025-06-08 (7): Com$^2$: A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models

Title: Com$^2$: A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models Com$^2$: Ein ursächlich geführter Benchmark für die Erforschung komplexer Commonsense-Reasoning in großen Sprachmodellen com$2美元:探讨大语言模式中复杂常识原因的因果指导基准 2506.07064v1

Authors (10): Kai Xiong, Xiao Ding, Yixin Cao, Yuxiong Yan, Li Du, Yufei Zhang, Jinglong Gao, Jiaqian Liu, Bing Qin, Ting Liu

Large language models (LLMs) have mastered abundant simple and explicit commonsense knowledge through pre-training, enabling them to achieve human-like performance in simple commonsense reasoning. Nevertheless, LLMs struggle to reason with complex and implicit commonsense knowledge that is derived from simple ones (such as understanding the long-term effects of certain events), an aspect humans tend to focus on more. Existing works focus on complex tasks like math and code, while complex commonsense reasoning remains underexplored due to its uncertainty and lack of structure. To fill this gap and align with real-world concerns, we propose a benchmark Com$^2$ focusing on complex commonsense reasoning. We first incorporate causal event graphs to serve as structured complex commonsense. Then we adopt causal theory~(e.g., intervention) to modify the causal event graphs and obtain different scenarios that meet human concerns. Finally, an LLM is employed to synthesize examples with slow thinking, which is guided by the logical relationships in the modified causal graphs. Furthermore, we use detective stories to construct a more challenging subset. Experiments show that LLMs struggle in reasoning depth and breadth, while post-training and slow thinking can alleviate this. The code and data are available at https://github.com/Waste-Wood/Com2.

nan


Article 698

Title@2025-06-08 (7): AdaSplash: Adaptive Sparse Flash Attention

Title: AdaSplash: Adaptive Sparse Flash Attention AdaSplash: Adaptive Sparse Flash Achtung AdaSplash: 适应性散散闪光关注 2502.12082v2

Authors (3): Nuno Gonçalves, Marcos Treviso, André F. T. Martins

The computational cost of softmax-based attention in transformers limits their applicability to long-context tasks. Adaptive sparsity, of which $\alpha$-entmax attention is an example, offers a flexible data-dependent alternative, but existing implementations are inefficient and do not leverage the sparsity to obtain runtime and memory gains. In this work, we propose AdaSplash, which combines the efficiency of GPU-optimized algorithms with the sparsity benefits of $\alpha$-entmax. We first introduce a hybrid Halley-bisection algorithm, resulting in a 7-fold reduction in the number of iterations needed to compute the $\alpha$-entmax transformation. Then, we implement custom Triton kernels to efficiently handle adaptive sparsity. Experiments with RoBERTa and ModernBERT for text classification and single-vector retrieval, along with GPT-2 for language modeling, show that our method achieves substantial improvements in runtime and memory efficiency compared to existing $\alpha$-entmax implementations. It approaches – and in some cases surpasses – the efficiency of highly optimized softmax implementations like FlashAttention-2, enabling long-context training while maintaining strong task performance.

nan


Article 699

Title@2025-06-08 (7): When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models

Title: When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models Wenn Aufmerksamkeit zusammenbricht: Wie degenerierte Ebenen in LLMs kleinere, stärkere Modelle aktivieren 注意折叠时: LLM 中如何分解层的 LLM 启用更小、 更强模型 2404.08634v3

Authors (4): Sunny Sanyal, Ravid Shwartz-Ziv, Alexandros G. Dimakis, Sujay Sanghavi

Large Language Models (LLMs) rely on the transformer architecture and its self-attention mechanism to deliver strong performance across tasks. However, we uncover a structural inefficiency in standard pre-trained decoder-style LLMs: in many of the deeper layers, attention matrices frequently collapse to near rank-one, single-column patterns. We refer to these underutilized components as lazy layers, which are redundant and computationally inefficient. To address this, we propose Inheritune, a simple and effective training recipe for building smaller, more efficient, and high performing language models. Inheritune initializes a compact model by inheriting the useful early layers from a larger pre-trained model, then progressively retrains and expands it. Our experiments across multiple models and datasets show that Inheritune trained models, despite having significantly fewer layers, can match or even outperform their larger counterparts. This approach yields compact, performant models and offers a practical path for efficient language model compression. Code is available at https://github.com/sanyalsunny111/LLM-Inheritune

nan


Article 700

Title@2025-06-08 (7): Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

Title: Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs Interpretierbare und zuverlässige Erkennung von KI-generierten Bildern durch geerdete Vernunft in MLLMs 通过MLLLMs中基于理由对AI创像进行可解释和可靠地探测 2506.07045v1

Authors (8): Yikun Ji, Hong Yan, Jun Lan, Huijia Zhu, Weiqiang Wang, Qi Fan, Liqing Zhang, Jianfu Zhang

The rapid advancement of image generation technologies intensifies the demand for interpretable and robust detection methods. Although existing approaches often attain high accuracy, they typically operate as black boxes without providing human-understandable justifications. Multi-modal Large Language Models (MLLMs), while not originally intended for forgery detection, exhibit strong analytical and reasoning capabilities. When properly fine-tuned, they can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still struggle with hallucination and often fail to align their visual interpretations with actual image content and human reasoning. To bridge this gap, we construct a dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, establishing a foundation for human-aligned visual-textual grounded reasoning. We then finetune MLLMs through a multi-stage optimization strategy that progressively balances the objectives of accurate detection, visual localization, and coherent textual explanation. The resulting model achieves superior performance in both detecting AI-generated images and localizing visual flaws, significantly outperforming baseline methods.

nan


Article 701

Title@2025-06-08 (7): Reasoning with RAGged events: RAG-Enhanced Event Knowledge Base Construction and reasoning with proof-assistants

Title: Reasoning with RAGged events: RAG-Enhanced Event Knowledge Base Construction and reasoning with proof-assistants Reasoning mit RAGged Events: RAG-erweiterte Event Knowledge Base Konstruktion und Reasoning mit Proof-Assistenten RAG-加强事件知识库建设和与证据助理的推理 2506.07042v1

Authors (1): Stergios Chatzikyriakidis

Extracting structured computational representations of historical events from narrative text remains computationally expensive when constructed manually. While RDF/OWL reasoners enable graph-based reasoning, they are limited to fragments of first-order logic, preventing deeper temporal and semantic analysis. This paper addresses both challenges by developing automatic historical event extraction models using multiple LLMs (GPT-4, Claude, Llama 3.2) with three enhancement strategies: pure base generation, knowledge graph enhancement, and Retrieval-Augmented Generation (RAG). We conducted comprehensive evaluations using historical texts from Thucydides. Our findings reveal that enhancement strategies optimize different performance dimensions rather than providing universal improvements. For coverage and historical breadth, base generation achieves optimal performance with Claude and GPT-4 extracting comprehensive events. However, for precision, RAG enhancement improves coordinate accuracy and metadata completeness. Model architecture fundamentally determines enhancement sensitivity: larger models demonstrate robust baseline performance with incremental RAG improvements, while Llama 3.2 shows extreme variance from competitive performance to complete failure. We then developed an automated translation pipeline converting extracted RDF representations into Coq proof assistant specifications, enabling higher-order reasoning beyond RDF capabilities including multi-step causal verification, temporal arithmetic with BC dates, and formal proofs about historical causation. The Coq formalization validates that RAG-discovered event types represent legitimate domain-specific semantic structures rather than ontological violations.

nan


Article 702

Title@2025-06-08 (7): Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More

Title: Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More Stoppen Sie die Suche nach wichtigen Token in multimodalen Sprachmodellen: Duplizieren zählt mehr 停止在多模式语言模式中查找重要调子: 更多重复问题 2502.11494v2

Authors (8): Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang

Vision tokens in multimodal large language models often dominate huge computational overhead due to their excessive length compared to linguistic modality. Abundant recent methods aim to solve this problem with token pruning, which first defines an importance criterion for tokens and then prunes the unimportant vision tokens during inference. However, in this paper, we show that the importance is not an ideal indicator to decide whether a token should be pruned. Surprisingly, it usually results in inferior performance than random token pruning and leading to incompatibility to efficient attention computation operators.Instead, we propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens, leading to significant and training-free acceleration. Concretely, DART selects a small subset of pivot tokens and then retains the tokens with low duplication to the pivots, ensuring minimal information loss during token pruning. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance, leading to a 1.99$\times$ and 2.99$\times$ speed-up in total time and prefilling stage, respectively, with good compatibility to efficient attention operators. Our codes are available at https://github.com/ZichenWen1/DART.

nan


Article 703

Title@2025-06-08 (7): BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modeling

Title: BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modeling BRIDGE: Bootstrapping-Text zur Steuerung der Time-Series-Generation über Multi-Agent iterative Optimierung und Diffusionsmodellierung BRIDGE:通过多代理迭代优化和传播模型化控制时间- 系列生成的推进文本 2503.02445v4

Authors (8): Hao Li, Yu-Hao Huang, Chang Xu, Viktor Schlegel, Renhe Jiang, Riza Batista-Navarro, Goran Nenadic, Jiang Bian

Time-series Generation (TSG) is a prominent research area with broad applications in simulations, data augmentation, and counterfactual analysis. While existing methods have shown promise in unconditional single-domain TSG, real-world applications demand for cross-domain approaches capable of controlled generation tailored to domain-specific constraints and instance-level requirements. In this paper, we argue that text can provide semantic insights, domain information and instance-specific temporal patterns, to guide and improve TSG. We introduce ``Text-Controlled TSG’’, a task focused on generating realistic time series by incorporating textual descriptions. To address data scarcity in this setting, we propose a novel LLM-based Multi-Agent framework that synthesizes diverse, realistic text-to-TS datasets. Furthermore, we introduce BRIDGE, a hybrid text-controlled TSG framework that integrates semantic prototypes with text description for supporting domain-level guidance. This approach achieves state-of-the-art generation fidelity on 11 of 12 datasets, and improves controllability by up to 12% on MSE and 6% MAE compared to no text input generation, highlighting its potential for generating tailored time-series data.

nan


Article 704

Title@2025-06-08 (7): KG2QA: Knowledge Graph-enhanced Retrieval-Augmented Generation for Communication Standards Question Answering

Title: KG2QA: Knowledge Graph-enhanced Retrieval-Augmented Generation for Communication Standards Question Answering KG2QA: Knowledge Graph-enhanced Retrieval-Augmented Generation for Communication Standards Question Answering KG2QA:为通信标准问题解答提供知识型图集增强的检索回声-增强的一代人 2506.07037v1

Authors (7): Zhongze Luo, Weixuan Wan, Qizhi Zheng, Yanhong Bai, Jingyun Sun, Jian Wang, Dan Wang

There are many types of standards in the field of communication. The traditional consulting model has a long cycle and relies on the knowledge and experience of experts, making it difficult to meet the rapidly developing technological demands. This paper combines the fine-tuning of large language models with the construction of knowledge graphs to implement an intelligent consultation and question-answering system for communication standards. The experimental results show that after LoRA tuning on the constructed dataset of 6,587 questions and answers in the field of communication standards, Qwen2.5-7B-Instruct demonstrates outstanding professional capabilities in the field of communication standards on the test set. BLEU-4 rose from 18.8564 to 66.8993, and evaluation indicators such as ROUGE also increased significantly, outperforming the fine-tuning effect of the comparison model Llama-3-8B-Instruct. Based on the ontology framework containing 6 entity attributes and 10 relation attributes, a knowledge graph of the communication standard domain containing 13,906 entities and 13,524 relations was constructed, showing a relatively good query accuracy rate. The intelligent consultation and question-answering system enables the fine-tuned model on the server side to access the locally constructed knowledge graph and conduct graphical retrieval of key information first, which is conducive to improving the question-answering effect. The evaluation using DeepSeek as the Judge on the test set shows that our RAG framework enables the fine-tuned model to improve the scores at all five angles, with an average score increase of 2.26%. And combined with web services and API interfaces, it has achieved very good results in terms of interaction experience and back-end access, and has very good practical application value.

nan


Article 705

Title@2025-06-08 (7): WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning

Title: WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning WorkForceAgent-R1: Förderung der Fähigkeit von LLM-basierten Web-Agenten durch Verstärkungs-Lernen 工作力量-R1:通过强化学习在基于LLM的网络代理中鼓励 2505.22942v2

Authors (6): Yuchen Zhuang, Di Jin, Jiaao Chen, Wenqi Shi, Hanrui Wang, Chao Zhang

Large language models (LLMs)-empowered web agents enables automating complex, real-time web navigation tasks in enterprise environments. However, existing web agents relying on supervised fine-tuning (SFT) often struggle with generalization and robustness due to insufficient reasoning capabilities when handling the inherently dynamic nature of web interactions. In this study, we introduce WorkForceAgent-R1, an LLM-based web agent trained using a rule-based R1-style reinforcement learning framework designed explicitly to enhance single-step reasoning and planning for business-oriented web navigation tasks. We employ a structured reward function that evaluates both adherence to output formats and correctness of actions, enabling WorkForceAgent-R1 to implicitly learn robust intermediate reasoning without explicit annotations or extensive expert demonstrations. Extensive experiments on the WorkArena benchmark demonstrate that WorkForceAgent-R1 substantially outperforms SFT baselines by 10.26-16.59%, achieving competitive performance relative to proprietary LLM-based agents (gpt-4o) in workplace-oriented web navigation tasks.

nan


Article 706

Title@2025-06-08 (7): A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

Title: A Culturally-diverse Multilingual Multimodal Video Benchmark & Model Ein multimodales multimodales Multimodal-Video-Benchmark & -Modell 多语言多语言多语多语多模式视频基准和模式 2506.07032v1

Authors (29): Bhuiyan Sanjid Shafique, Ashmal Vayani, Muhammad Maaz, Hanoona Abdul Rasheed, Dinura Dissanayake, Mohammed Irfan Kurpath, Yahya Hmaiti, Go Inoue, Jean Lahoud, Md. Safirur Rashid, Shadid Intisar Quasem, Maheen Fatima, Franco Vidal, Mykola Maslych, Ketan Pravin More, Sanoojan Baliah, Hasindri Watawana, Yuhao Li, Fabian Farestam, Leon Schaller, Roman Tymtsiv, Simon Weber, Hisham Cholakkal, Ivan Laptev, Shin’ichi Satoh, Michael Felsberg, Mubarak Shah, Salman Khan, Fahad Shahbaz Khan

Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.

nan


Article 707

Title@2025-06-08 (7): HauntAttack: When Attack Follows Reasoning as a Shadow

Title: HauntAttack: When Attack Follows Reasoning as a Shadow HauntAttack: Wenn Angriff folgt Vernunft als Schatten 恶霸:当攻击发生时,以影子为由进行攻击 2506.07031v1

Authors (6): Jingyuan Ma, Rui Li, Zheng Li, Junfeng Liu, Lei Sha, Zhifang Sui

Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and reasoning tasks, showcasing exceptional capabilities. However, the enhancement of reasoning abilities and the exposure of their internal reasoning processes introduce new safety vulnerabilities. One intriguing concern is: when reasoning is strongly entangled with harmfulness, what safety-reasoning trade-off do LRMs exhibit? To address this issue, we introduce HauntAttack, a novel and general-purpose black-box attack framework that systematically embeds harmful instructions into reasoning questions. Specifically, we treat reasoning questions as carriers and substitute one of their original conditions with a harmful instruction. This process creates a reasoning pathway in which the model is guided step by step toward generating unsafe outputs. Based on HauntAttack, we conduct comprehensive experiments on multiple LRMs. Our results reveal that even the most advanced LRMs exhibit significant safety vulnerabilities. Additionally, we perform a detailed analysis of different models, various types of harmful instructions, and model output patterns, providing valuable insights into the security of LRMs.

nan


Article 708

Title@2025-06-08 (7): JsonTuning: Towards Generalizable, Robust, and Controllable Instruction Tuning

Title: JsonTuning: Towards Generalizable, Robust, and Controllable Instruction Tuning JsonTuning: Auf dem Weg zu einer generalisierbaren, robusten und kontrollierbaren Instruktion JsonTurning: 走向可普及、强力和可控制的指示图示 2310.02953v6

Authors (4): Chang Gao, Wenxuan Zhang, Guizhen Chen, Wai Lam

Instruction tuning is vital for enhancing the performance of large language models (LLMs), but existing text-to-text methods, referred to as TextTuning, struggle with issues such as generalization, robustness, and controllability due to their lack of explicit task structures. We introduce JsonTuning, a structure-to-structure approach that uses JSON structures to represent tasks. This method improves generalization by clarifying task elements and their relations, boosts robustness by minimizing ambiguity, and enhances controllability by allowing precise control over outputs. We conduct an extensive comparative analysis between JsonTuning and TextTuning using various language models and benchmarks. Our findings reveal that JsonTuning consistently surpasses TextTuning in terms of performance, robustness, and controllability across different scenarios. By overcoming the limitations of TextTuning, JsonTuning demonstrates significant potential for developing more effective and reliable LLMs capable of handling diverse scenarios.

nan


Article 709

Title@2025-06-08 (7): HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models

Title: HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models HAIC: Verbesserung des menschlichen Handelns durch bessere Bildunterschriften für multimodale große Sprachmodelle HAIC: 提高人类行动的理解和生成,为多模式大语言模式提供更好的指导 2502.20811v2

Authors (8): Xiao Wang, Jingyun Hua, Weihong Lin, Yuanxing Zhang, Fuzheng Zhang, Jianlong Wu, Di Zhang, Liqiang Nie

Recent Multi-modal Large Language Models (MLLMs) have made great progress in video understanding. However, their performance on videos involving human actions is still limited by the lack of high-quality data. To address this, we introduce a two-stage data annotation pipeline. First, we design strategies to accumulate videos featuring clear human actions from the Internet. Second, videos are annotated in a standardized caption format that uses human attributes to distinguish individuals and chronologically details their actions and interactions. Through this pipeline, we curate two datasets, namely HAICTrain and HAICBench. \textbf{HAICTrain} comprises 126K video-caption pairs generated by Gemini-Pro and verified for training purposes. Meanwhile, \textbf{HAICBench} includes 412 manually annotated video-caption pairs and 2,000 QA pairs, for a comprehensive evaluation of human action understanding. Experimental results demonstrate that training with HAICTrain not only significantly enhances human understanding abilities across 4 benchmarks, but can also improve text-to-video generation results. Both the HAICTrain and HAICBench are released at https://huggingface.co/datasets/KuaishouHAIC/HAIC.

nan


Article 710

Title@2025-06-08 (7): AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding

Title: AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding AdaReTaKe: Adaptive Redundanzreduzierung zur Wahrnehmung länger für Video-Sprachen-Verständnis AdaRETAKe:为了解视频语言而将适应性裁员减到可能更长的时间 2503.12559v2

Authors (6): Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie

Multimodal Large Language Models (MLLMs) have revolutionized video understanding, yet are still limited by context length when processing long videos. Recent methods compress videos by leveraging visual redundancy uniformly, yielding promising results. Nevertheless, our quantitative analysis shows that redundancy varies significantly across time and model layers, necessitating a more flexible compression strategy. We propose AdaReTaKe, a training-free method that flexibly reduces visual redundancy by allocating compression ratios among time and layers with theoretical guarantees. Integrated into state-of-the-art MLLMs, AdaReTaKe improves processing capacity from 256 to 2048 frames while preserving critical information. Experiments on VideoMME, MLVU, LongVideoBench, and LVBench datasets demonstrate that AdaReTaKe outperforms existing methods by 2.3% and 2.8% for 7B and 72B models, respectively, with even greater improvements of 5.9% and 6.0% on the longest LVBench. Our code is available at https://github.com/SCZwangxiao/video-FlexReduc.git.

nan


Article 711

Title@2025-06-08 (7): From Sub-Ability Diagnosis to Human-Aligned Generation: Bridging the Gap for Text Length Control via MARKERGEN

Title: From Sub-Ability Diagnosis to Human-Aligned Generation: Bridging the Gap for Text Length Control via MARKERGEN Von der Sub-Ability-Diagnose zur Menschengerechten Generation: Überbrückung der Lücke zur Textlängenkontrolle über MARKERGEN 从亚能力分析到人类一代:通过MARKERGEN弥合文本长度控制差距 2502.13544v3

Authors (10): Peiwen Yuan, Chuyi Tan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Yueqi Zhang, Jiayi Shi, Boyuan Pan, Yao Hu, Kan Li

Despite the rapid progress of large language models (LLMs), their length-controllable text generation (LCTG) ability remains below expectations, posing a major limitation for practical applications. Existing methods mainly focus on end-to-end training to reinforce adherence to length constraints. However, the lack of decomposition and targeted enhancement of LCTG sub-abilities restricts further progress. To bridge this gap, we conduct a bottom-up decomposition of LCTG sub-abilities with human patterns as reference and perform a detailed error analysis. On this basis, we propose MarkerGen, a simple-yet-effective plug-and-play approach that:(1) mitigates LLM fundamental deficiencies via external tool integration;(2) conducts explicit length modeling with dynamically inserted markers;(3) employs a three-stage generation scheme to better align length constraints while maintaining content quality. Comprehensive experiments demonstrate that MarkerGen significantly improves LCTG across various settings, exhibiting outstanding effectiveness and generalizability.

nan


Article 712

Title@2025-06-08 (7): Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States

Title: Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States Hin zu dynamischer Geistestheorie: Bewertung der LLM-Anpassung an die zeitliche Evolution menschlicher Staaten 走向动态思维理论:评价LLM适应人类国家时间演变的适应LLM 2505.17663v2

Authors (8): Yang Xiao, Jiashuo Wang, Qiancheng Xu, Changhe Song, Chunpu Xu, Yi Cheng, Wenjie Li, Pengfei Liu

As Large Language Models (LLMs) increasingly participate in human-AI interactions, evaluating their Theory of Mind (ToM) capabilities - particularly their ability to track dynamic mental states - becomes crucial. While existing benchmarks assess basic ToM abilities, they predominantly focus on static snapshots of mental states, overlooking the temporal evolution that characterizes real-world social interactions. We present \textsc{DynToM}, a novel benchmark specifically designed to evaluate LLMs’ ability to understand and track the temporal progression of mental states across interconnected scenarios. Through a systematic four-step framework, we generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7\%, with performance degrading significantly when tracking and reasoning about the shift of mental states. This performance gap highlights fundamental limitations in current LLMs’ ability to model the dynamic nature of human mental states.

nan


Article 713

Title@2025-06-08 (7): sudo rm -rf agentic_security

Title: sudo rm -rf agentic_security SUDO rm -rf agentic_security sudo rm -rf 毒剂安全 2503.20279v3

Authors (6): Sejin Lee, Jian Kim, Haon Park, Ashkan Yousefpour, Sangyoon Yu, Min Song

Large Language Models (LLMs) are increasingly deployed as computer-use agents, autonomously performing tasks within real desktop or web environments. While this evolution greatly expands practical use cases for humans, it also creates serious security exposures. We present SUDO (Screen-based Universal Detox2Tox Offense), a novel attack framework that systematically bypasses refusal-trained safeguards in commercial computer-use agents, such as Claude for Computer Use. The core mechanism, Detox2Tox, transforms harmful requests (that agents initially reject) into seemingly benign requests via detoxification, secures detailed instructions from advanced vision language models (VLMs), and then reintroduces malicious content via toxification just before execution. Unlike conventional jailbreaks, SUDO iteratively refines its attacks based on a built-in refusal feedback, making it increasingly effective against robust policy filters. In extensive tests spanning 50 real-world tasks and multiple state-of-the-art VLMs, SUDO achieves a stark attack success rate of 24.41% (with no refinement), and up to 41.33% (by its iterative refinement) in Claude for Computer Use. By revealing these vulnerabilities and demonstrating the ease with which they can be exploited in real-world computing environments, this paper highlights an immediate need for robust, context-aware safeguards. WARNING: This paper includes harmful or offensive model outputs

nan


Article 714

Title@2025-06-08 (7): Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

Title: Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text Adversariale Paraphrasierung: Ein universeller Angriff für humanisierenden KI-generierten Text 反对反对截图:普遍攻击使 AI - 发光的文字人性化 2506.07001v1

Authors (5): Yize Cheng, Vinu Sankar Sadasivan, Mehrdad Saberi, Shoumik Saha, Soheil Feizi

The increasing capabilities of Large Language Models (LLMs) have raised concerns about their misuse in AI-generated plagiarism and social engineering. While various AI-generated text detectors have been proposed to mitigate these risks, many remain vulnerable to simple evasion techniques such as paraphrasing. However, recent detectors have shown greater robustness against such basic attacks. In this work, we introduce Adversarial Paraphrasing, a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively. Our approach leverages an off-the-shelf instruction-following LLM to paraphrase AI-generated content under the guidance of an AI text detector, producing adversarial examples that are specifically optimized to bypass detection. Extensive experiments show that our attack is both broadly effective and highly transferable across several detection systems. For instance, compared to simple paraphrasing attack–which, ironically, increases the true positive at 1% false positive (T@1%F) by 8.57% on RADAR and 15.03% on Fast-DetectGPT–adversarial paraphrasing, guided by OpenAI-RoBERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on Fast-DetectGPT. Across a diverse set of detectors–including neural network-based, watermark-based, and zero-shot approaches–our attack achieves an average T@1%F reduction of 87.88% under the guidance of OpenAI-RoBERTa-Large. We also analyze the tradeoff between text quality and attack success to find that our method can significantly reduce detection rates, with mostly a slight degradation in text quality. Our adversarial setup highlights the need for more robust and resilient detection strategies in the light of increasingly sophisticated evasion techniques.

nan


Article 715

Title@2025-06-08 (7): What makes Reasoning Models Different? Follow the Reasoning Leader for Efficient Decoding

Title: What makes Reasoning Models Different? Follow the Reasoning Leader for Efficient Decoding Was macht Vernunftmodelle anders? Folgen Sie der Vernunftführerin für effiziente Decodierung 是什么让理性模型不同? 跟随高效配方的理性导师 2506.06998v1

Authors (7): Ming Li, Zhengyuan Yang, Xiyao Wang, Dianqi Li, Kevin Lin, Tianyi Zhou, Lijuan Wang

Large reasoning models (LRMs) achieve strong reasoning performance by emitting long chains of thought. Yet, these verbose traces slow down inference and often drift into unnecessary detail, known as the overthinking phenomenon. To better understand LRMs’ behavior, we systematically analyze the token-level misalignment between reasoning and non-reasoning models. While it is expected that their primary difference lies in the stylistic “thinking cues”, LRMs uniquely exhibit two pivotal, previously under-explored phenomena: a Global Misalignment Rebound, where their divergence from non-reasoning models persists or even grows as response length increases, and more critically, a Local Misalignment Diminish, where the misalignment concentrates at the “thinking cues” each sentence starts with but rapidly declines in the remaining of the sentence. Motivated by the Local Misalignment Diminish, we propose FoReaL-Decoding, a collaborative fast-slow thinking decoding method for cost-quality trade-off. In FoReaL-Decoding, a Leading model leads the first few tokens for each sentence, and then a weaker draft model completes the following tokens to the end of each sentence. FoReaL-Decoding adopts a stochastic gate to smoothly interpolate between the small and the large model. On four popular math-reasoning benchmarks (AIME24, GPQA-Diamond, MATH500, AMC23), FoReaL-Decoding reduces theoretical FLOPs by 30 to 50% and trims CoT length by up to 40%, while preserving 86 to 100% of model performance. These results establish FoReaL-Decoding as a simple, plug-and-play route to controllable cost-quality trade-offs in reasoning-centric tasks.

nan


Article 716

Title@2025-06-08 (7): STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings

Title: STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings STAMP Ihr Inhalt: Nachweis der Datensatzmitgliedschaft über Watermarked Rephrasings STAMP 您的内容: 通过水标修改方式证明数据集成员 2504.13416v2

Authors (3): Saksham Rastogi, Pratyush Maini, Danish Pruthi

Given how large parts of publicly available text are crawled to pretrain large language models (LLMs), data creators increasingly worry about the inclusion of their proprietary data for model training without attribution or licensing. Their concerns are also shared by benchmark curators whose test-sets might be compromised. In this paper, we present STAMP, a framework for detecting dataset membership-i.e., determining the inclusion of a dataset in the pretraining corpora of LLMs. Given an original piece of content, our proposal involves first generating multiple rephrases, each embedding a watermark with a unique secret key. One version is to be released publicly, while others are to be kept private. Subsequently, creators can compare model likelihoods between public and private versions using paired statistical tests to prove membership. We show that our framework can successfully detect contamination across four benchmarks which appear only once in the training data and constitute less than 0.001% of the total tokens, outperforming several contamination detection and dataset inference baselines. We verify that STAMP preserves both the semantic meaning and utility of the original data. We apply STAMP to two real-world scenarios to confirm the inclusion of paper abstracts and blog articles in the pretraining corpora.

nan


Article 717

Title@2025-06-08 (7): Cultural Bias Matters: A Cross-Cultural Benchmark Dataset and Sentiment-Enriched Model for Understanding Multimodal Metaphors

Title: Cultural Bias Matters: A Cross-Cultural Benchmark Dataset and Sentiment-Enriched Model for Understanding Multimodal Metaphors Cultural Bias Matters: Ein Cross-Cultural Benchmark Dataset und Sentiment-Enriched Modell zum Verständnis multimodaler Metaphoren 文化偏见问题:跨文化基准数据集和适应性强化模式,了解多模式代词器 2506.06987v1

Authors (8): Senqi Yang, Dongyu Zhang, Jing Ren, Ziqi Xu, Xiuzhen Zhang, Yiliao Song, Hongfei Lin, Feng Xia

Metaphors are pervasive in communication, making them crucial for natural language processing (NLP). Previous research on automatic metaphor processing predominantly relies on training data consisting of English samples, which often reflect Western European or North American biases. This cultural skew can lead to an overestimation of model performance and contributions to NLP progress. However, the impact of cultural bias on metaphor processing, particularly in multimodal contexts, remains largely unexplored. To address this gap, we introduce MultiMM, a Multicultural Multimodal Metaphor dataset designed for cross-cultural studies of metaphor in Chinese and English. MultiMM consists of 8,461 text-image advertisement pairs, each accompanied by fine-grained annotations, providing a deeper understanding of multimodal metaphors beyond a single cultural domain. Additionally, we propose Sentiment-Enriched Metaphor Detection (SEMD), a baseline model that integrates sentiment embeddings to enhance metaphor comprehension across cultural backgrounds. Experimental results validate the effectiveness of SEMD on metaphor detection and sentiment analysis tasks. We hope this work increases awareness of cultural bias in NLP research and contributes to the development of fairer and more inclusive language models. Our dataset and code are available at https://github.com/DUTIR-YSQ/MultiMM.

nan


Article 718

Title@2025-06-08 (7): ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation

Title: ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation ChartGalaxy: Ein Datensatz für Infografik Chart Verstehen und Generieren 图表银河:用于了解和生成信息图表的数据集 2505.18668v3

Authors (12): Zhen Li, Duan Li, Yukai Guo, Xinyuan Guo, Bowen Li, Lanxi Xiao, Shenyu Qiao, Jiashu Chen, Zijian Wu, Hui Zhang, Xinhuan Shu, Shixia Liu

Infographic charts are a powerful medium for communicating abstract data by combining visual elements (e.g., charts, images) with textual information. However, their visual and structural richness poses challenges for large vision-language models (LVLMs), which are typically trained on plain charts. To bridge this gap, we introduce ChartGalaxy, a million-scale dataset designed to advance the understanding and generation of infographic charts. The dataset is constructed through an inductive process that identifies 75 chart types, 330 chart variations, and 68 layout templates from real infographic charts and uses them to create synthetic ones programmatically. We showcase the utility of this dataset through: 1) improving infographic chart understanding via fine-tuning, 2) benchmarking code generation for infographic charts, and 3) enabling example-based infographic chart generation. By capturing the visual and structural complexity of real design, ChartGalaxy provides a useful resource for enhancing multimodal reasoning and generation in LVLMs.

nan


Article 719

Title@2025-06-08 (7): Chain of Methodologies: Scaling Test Time Computation without Training

Title: Chain of Methodologies: Scaling Test Time Computation without Training Kette der Methoden: Skalierung Testzeitberechnung ohne Training 方法链:不经过培训的缩放测试时间计算 2506.06982v1

Authors (6): Cong Liu, Jie Wu, Weigang Wu, Xu Chen, Liang Lin, Wei-Shi Zheng

Large Language Models (LLMs) often struggle with complex reasoning tasks due to insufficient in-depth insights in their training data, which are typically absent in publicly available documents. This paper introduces the Chain of Methodologies (CoM), an innovative and intuitive prompting framework that enhances structured thinking by integrating human methodological insights, enabling LLMs to tackle complex tasks with extended reasoning. CoM leverages the metacognitive abilities of advanced LLMs, activating systematic reasoning throught user-defined methodologies without explicit fine-tuning. Experiments show that CoM surpasses competitive baselines, demonstrating the potential of training-free prompting methods as robust solutions for complex reasoning tasks and bridging the gap toward human-level reasoning through human-like methodological insights.

nan


Article 720

Title@2025-06-08 (7): Atomic Reasoning for Scientific Table Claim Verification

Title: Atomic Reasoning for Scientific Table Claim Verification Atomische Begründung für die Überprüfung der wissenschaftlichen Tabelle 科学表格索赔核实的原子理由 2506.06972v1

Authors (10): Yuji Zhang, Qingyun Wang, Cheng Qian, Jiateng Liu, Chenkai Sun, Denghui Zhang, Tarek Abdelzaher, Chengxiang Zhai, Preslav Nakov, Heng Ji

Scientific texts often convey authority due to their technical language and complex data. However, this complexity can sometimes lead to the spread of misinformation. Non-experts are particularly susceptible to misleading claims based on scientific tables due to their high information density and perceived credibility. Existing table claim verification models, including state-of-the-art large language models (LLMs), often struggle with precise fine-grained reasoning, resulting in errors and a lack of precision in verifying scientific claims. Inspired by Cognitive Load Theory, we propose that enhancing a model’s ability to interpret table-based claims involves reducing cognitive load by developing modular, reusable reasoning components (i.e., atomic skills). We introduce a skill-chaining schema that dynamically composes these skills to facilitate more accurate and generalizable reasoning with a reduced cognitive load. To evaluate this, we create SciAtomicBench, a cross-domain benchmark with fine-grained reasoning annotations. With only 350 fine-tuning examples, our model trained by atomic reasoning outperforms GPT-4o’s chain-of-thought method, achieving state-of-the-art results with far less training data.

nan


Article 721

Title@2025-06-08 (7): A dependently-typed calculus of event telicity and culminativity

Title: A dependently-typed calculus of event telicity and culminativity Ein abhängig typisiertes Kalkül von Ereignistelizität und Kulminivität 一种依附型型的事件的微积分, 具有感触力和上流性 2506.06968v1

Authors (2): Pavel Kovalev, Carlo Angiuli

We present a dependently-typed cross-linguistic framework for analyzing the telicity and culminativity of events, accompanied by examples of using our framework to model English sentences. Our framework consists of two parts. In the nominal domain, we model the boundedness of noun phrases and its relationship to subtyping, delimited quantities, and adjectival modification. In the verbal domain we define a dependent event calculus, modeling telic events as those whose undergoer is bounded, culminating events as telic events that achieve their inherent endpoint, and consider adverbial modification. In both domains we pay particular attention to associated entailments. Our framework is defined as an extension of intensional Martin-L"of dependent type theory, and the rules and examples in this paper have been formalized in the Agda proof assistant.

nan


Article 722

Title@2025-06-08 (7): Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning

Title: Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning Lernen, durch verstärktes Lernen durch reward-gewichtete Feintuning zu klären 学习通过通过奖励-加权微调加强学习来澄清问题 2506.06964v1

Authors (9): Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan Rossi, Seunghyun Yoon, Trung Bui, Anup Rao, Jayakumar Subramanian, Branislav Kveton

Question answering (QA) agents automatically answer questions posed in natural language. In this work, we learn to ask clarifying questions in QA agents. The key idea in our method is to simulate conversations that contain clarifying questions and learn from them using reinforcement learning (RL). To make RL practical, we propose and analyze offline RL objectives that can be viewed as reward-weighted supervised fine-tuning (SFT) and easily optimized in large language models. Our work stands in a stark contrast to recently proposed methods, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize rewards. We compare to these methods empirically and report gains in both optimized rewards and language quality.

nan


Article 723

Title@2025-06-08 (7): ViMMRC 2.0 – Enhancing Machine Reading Comprehension on Vietnamese Literature Text

Title: ViMMRC 2.0 – Enhancing Machine Reading Comprehension on Vietnamese Literature Text ViMMRC 2.0 – Verbesserung des Leseverständnisses in vietnamesischer Literatur Text VIMRC 2.0 – – 加强对越南文学文本的机器阅读理解 2303.18162v2

Authors (5): Son T. Luu, Khoi Trong Hoang, Tuong Quang Pham, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

Machine reading comprehension has been an interesting and challenging task in recent years, with the purpose of extracting useful information from texts. To attain the computer ability to understand the reading text and answer relevant information, we introduce ViMMRC 2.0 - an extension of the previous ViMMRC for the task of multiple-choice reading comprehension in Vietnamese Textbooks which contain the reading articles for students from Grade 1 to Grade 12. This dataset has 699 reading passages which are prose and poems, and 5,273 questions. The questions in the new dataset are not fixed with four options as in the previous version. Moreover, the difficulty of questions is increased, which challenges the models to find the correct choice. The computer must understand the whole context of the reading passage, the question, and the content of each choice to extract the right answers. Hence, we propose a multi-stage approach that combines the multi-step attention network (MAN) with the natural language inference (NLI) task to enhance the performance of the reading comprehension model. Then, we compare the proposed methodology with the baseline BERTology models on the new dataset and the ViMMRC 1.0. From the results of the error analysis, we found that the challenge of the reading comprehension models is understanding the implicit context in texts and linking them together in order to find the correct answers. Finally, we hope our new dataset will motivate further research to enhance the ability of computers to understand the Vietnamese language.

nan


Article 724

Title@2025-06-08 (7): BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning

Title: BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning BIS Reasoning 1.0: Der erste großformatige japanische Benchmark für glaubens-inkonsistente syllogistische Reasoning BIS 理由1.0:日本第一个大尺度的信仰不一致时断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断 2506.06955v1

Authors (13): Ha-Thanh Nguyen, Chaoran Liu, Hirokazu Kiyomaru, Koichi Takeda, Yusuke Miyao, Maki Matsuda, Yusuke Oda, Pontus Stenetorp, Qianying Liu, Su Myat Noe, Hideyuki Tachibana, Kouta Nakayama, Sadao Kurohashi

We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS Reasoning 1.0 introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora. We benchmark state-of-the-art models - including GPT models, Claude models, and leading Japanese LLMs - revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs. These findings have important implications for deploying LLMs in high-stakes domains such as law, healthcare, and scientific literature, where truth must override intuitive belief to ensure integrity and safety.

nan


Article 725

Title@2025-06-07 (6): What Makes a Good Natural Language Prompt?

Title: What Makes a Good Natural Language Prompt? Was macht eine gute natürliche Sprache prompt? 是什么让好自然语言迅速? 2506.06950v1

Authors (7): Do Xuan Long, Duy Dinh, Ngoc-Hai Nguyen, Kenji Kawaguchi, Nancy F. Chen, Shafiq Joty, Min-Yen Kan

As large language models (LLMs) have progressed towards more human-like and human–AI communications have become prevalent, prompting has emerged as a decisive component. However, there is limited conceptual consensus on what exactly quantifies natural language prompts. We attempt to address this question by conducting a meta-analysis surveying more than 150 prompting-related papers from leading NLP and AI conferences from 2022 to 2025 and blogs. We propose a property- and human-centric framework for evaluating prompt quality, encompassing 21 properties categorized into six dimensions. We then examine how existing studies assess their impact on LLMs, revealing their imbalanced support across models and tasks, and substantial research gaps. Further, we analyze correlations among properties in high-quality natural language prompts, deriving prompting recommendations. We then empirically explore multi-property prompt enhancements in reasoning tasks, observing that single-property enhancements often have the greatest impact. Finally, we discover that instruction-tuning on property-enhanced prompts can result in better reasoning models. Our findings establish a foundation for property-centric prompt evaluation and optimization, bridging the gaps between human–AI communication and opening new prompting research directions.

nan


Article 726

Title@2025-06-07 (6): PlanGenLLMs: A Modern Survey of LLM Planning Capabilities

Title: PlanGenLLMs: A Modern Survey of LLM Planning Capabilities PlanGenLLMs: Eine moderne Erhebung der LLM-Planungskapazitäten PlanGenLLLMs:LLM规划能力现代调查 2502.11221v2

Authors (6): Hui Wei, Zihao Zhang, Shenghua He, Tian Xia, Shijia Pan, Fei Liu

LLMs have immense potential for generating plans, transforming an initial world state into a desired goal state. A large body of research has explored the use of LLMs for various planning tasks, from web navigation to travel planning and database querying. However, many of these systems are tailored to specific problems, making it challenging to compare them or determine the best approach for new tasks. There is also a lack of clear and consistent evaluation criteria. Our survey aims to offer a comprehensive overview of current LLM planners to fill this gap. It builds on foundational work by Kartam and Wilkins (1990) and examines six key performance criteria: completeness, executability, optimality, representation, generalization, and efficiency. For each, we provide a thorough analysis of representative works and highlight their strengths and weaknesses. Our paper also identifies crucial future directions, making it a valuable resource for both practitioners and newcomers interested in leveraging LLM planning to support agentic workflows.

nan


Article 727

Title@2025-06-07 (6): The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Title: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity Die Illusion des Denkens: Die Stärken und Grenzen von Vernunftmodellen über das Lens of Problem Complexity verstehen 思考的幻觉:通过问题复杂焦点了解理性模型的长处和局限性 2506.06941v1

Authors (6): Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, Mehrdad Farajtabar

Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models’ computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.

nan


Article 728

Title@2025-06-07 (6): Mechanistic evaluation of Transformers and state space models

Title: Mechanistic evaluation of Transformers and state space models Mechanistische Auswertung von Transformern und State Space Models 对变压器和国家空间模型的机械评估 2505.15105v2

Authors (6): Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, Róbert Csordás, Dan Jurafsky, Christopher Potts

State space models (SSMs) for language modelling promise an efficient and performant alternative to quadratic-attention Transformers, yet show variable performance on recalling basic information from the context. While performance on synthetic tasks like Associative Recall (AR) can point to this deficiency, behavioural metrics provide little information as to why–on a mechanistic level–certain architectures fail and others succeed. To address this, we conduct experiments on AR and find that only Transformers and Based SSM models fully succeed at AR, with Mamba a close third, whereas the other SSMs (H3, Hyena) fail. We then use causal interventions to explain why. We find that Transformers and Based learn to store key-value associations in-context using induction heads. By contrast, the SSMs compute these associations only at the last state, with only Mamba succeeding because of its short convolution component. To extend and deepen these findings, we introduce Associative Treecall (ATR), a synthetic task similar to AR based on PCFG induction. ATR introduces language-like hierarchical structure into the AR setting. We find that all architectures learn the same mechanism as they did for AR, and the same three models succeed at the task. These results reveal that architectures with similar accuracy may still have substantive differences, motivating the adoption of mechanistic evaluations.

nan


Article 729

Title@2025-06-07 (6): Scaling over Scaling: Exploring Test-Time Scaling Plateau in Large Reasoning Models

Title: Scaling over Scaling: Exploring Test-Time Scaling Plateau in Large Reasoning Models Skalierung über Skalierung: Untersuchung des Testzeit-Skalierungs-Plateau in Großmodellen 缩放超过缩放: 探索大型理由模型中的试验时间缩放高台 2505.20522v2

Authors (5): Jian Wang, Boyan Zhu, Chak Tou Leong, Yongqi Li, Wenjie Li

Large reasoning models (LRMs) have exhibited the capacity of enhancing reasoning performance via internal test-time scaling. Building upon this, a promising direction is to further scale test-time compute to unlock even greater reasoning capabilities. However, as we push these scaling boundaries, systematically understanding the practical limits and achieving optimal resource allocation becomes a critical challenge. In this paper, we investigate the scaling plateau of test-time scaling and introduce the Test-Time Scaling Performance Model (TTSPM). We theoretically analyze two fundamental paradigms for such extended scaling, parallel scaling and sequential scaling, from a probabilistic modeling perspective. Our primary contribution is the derivation of the saturation point on the scaling budget for both strategies, identifying thresholds beyond which additional computation yields diminishing returns. Remarkably, despite their distinct mechanisms, both paradigms converge to a unified mathematical structure in their upper bounds. We empirically validate our theoretical findings on challenging reasoning benchmarks, including AIME, MATH-500, and GPQA, demonstrating the practical utility of these bounds for test-time resource allocation. We hope that this work provides insights into the cost-benefit trade-offs of test-time scaling, guiding the development of more resource-efficient inference strategies for large reasoning models.

nan


Article 730

Title@2025-06-07 (6): DiscoSum: Discourse-aware News Summarization

Title: DiscoSum: Discourse-aware News Summarization DiscoSum: Diskurs-bewusste Nachrichten Zusammenfassung DiscoSum: 深知新闻摘要 2506.06930v1

Authors (5): Alexander Spangher, Tenghao Huang, Jialiang Gu, Jiatong Shi, Muhao Chen

Recent advances in text summarization have predominantly leveraged large language models to generate concise summaries. However, language models often do not maintain long-term discourse structure, especially in news articles, where organizational flow significantly influences reader engagement. We introduce a novel approach to integrating discourse structure into summarization processes, focusing specifically on news articles across various media. We present a novel summarization dataset where news articles are summarized multiple times in different ways across different social media platforms (e.g. LinkedIn, Facebook, etc.). We develop a novel news discourse schema to describe summarization structures and a novel algorithm, DiscoSum, which employs beam search technique for structure-aware summarization, enabling the transformation of news stories to meet different stylistic and structural demands. Both human and automatic evaluation results demonstrate the efficacy of our approach in maintaining narrative fidelity and meeting structural requirements.

nan


Article 731

Title@2025-06-07 (6): BLEUBERI: BLEU is a surprisingly effective reward for instruction following

Title: BLEUBERI: BLEU is a surprisingly effective reward for instruction following BLEUBERI: BLEU ist eine überraschend effektive Belohnung für den Unterricht nach BLEU是一种令人惊讶的有效奖赏 2505.11080v2

Authors (7): Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Li, Chris Tanner, Mohit Iyyer

Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL across four challenging instruction-following benchmarks and three different base language models. A human evaluation further supports that the quality of BLEUBERI model outputs is on par with those from reward model-aligned models. Moreover, BLEUBERI models generate outputs that are more factually grounded than competing methods. Overall, we show that given access to high-quality reference outputs (easily obtained via existing instruction-following datasets or synthetic data generation), string matching-based metrics are cheap yet effective proxies for reward models during alignment. We release our code and data at https://github.com/lilakk/BLEUBERI.

nan


Article 732

Title@2025-06-07 (6): On Path to Multimodal Historical Reasoning: HistBench and HistAgent

Title: On Path to Multimodal Historical Reasoning: HistBench and HistAgent Auf dem Weg zu multimodaler historischer Vernunft: HistBench und HistAgent 通向多式联运历史原因原因之路:历史时尚与历史代理人 2505.20246v2

Authors (98): Jiahao Qiu, Fulian Xiao, Yimin Wang, Yuchen Mao, Yijia Chen, Xinzhe Juan, Siran Wang, Xuan Qi, Tongcheng Zhang, Zixin Yao, Jiacheng Guo, Yifu Lu, Charles Argon, Jundi Cui, Daixin Chen, Junran Zhou, Shuyao Zhou, Zhanpeng Zhou, Ling Yang, Shilong Liu, Hongru Wang, Kaixuan Huang, Xun Jiang, Yuming Cao, Yue Chen, Yunfei Chen, Zhengyi Chen, Ruowei Dai, Mengqiu Deng, Jiye Fu, Yunting Gu, Zijie Guan, Zirui Huang, Xiaoyan Ji, Yumeng Jiang, Delong Kong, Haolong Li, Jiaqi Li, Ruipeng Li, Tianze Li, Zhuoran Li, Haixia Lian, Mengyue Lin, Xudong Liu, Jiayi Lu, Jinghan Lu, Wanyu Luo, Ziyue Luo, Zihao Pu, Zhi Qiao, Ruihuan Ren, Liang Wan, Ruixiang Wang, Tianhui Wang, Yang Wang, Zeyu Wang, Zihua Wang, Yujia Wu, Zhaoyi Wu, Hao Xin, Weiao Xing, Ruojun Xiong, Weijie Xu, Yao Shu, Yao Xiao, Xiaorui Yang, Yuchen Yang, Nan Yi, Jiadong Yu, Yangyuxuan Yu, Huiting Zeng, Danni Zhang, Yunjie Zhang, Zhaoyu Zhang, Zhiheng Zhang, Xiaofeng Zheng, Peirong Zhou, Linyan Zhong, Xiaoyin Zong, Ying Zhao, Zhenxin Chen, Lin Ding, Xiaoyu Gao, Bingbing Gong, Yichao Li, Yang Liao, Guang Ma, Tianyuan Ma, Xinrui Sun, Tianyi Wang, Han Xia, Ruobing Xian, Gen Ye, Tengfei Yu, Wentao Zhang, Yuxi Wang, Xi Gao, Mengdi Wang

Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks, they lack the domain-specific expertise required to engage with historical materials and questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality questions designed to evaluate AI’s capacity for historical reasoning and authored by more than 40 expert contributors. The tasks span a wide range of historical problems-from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving archaeology, linguistics, or cultural history. Furthermore, the benchmark dataset spans 29 ancient and modern languages and covers a wide range of historical periods and world regions. Finding the poor performance of LLMs and other agents on HistBench, we further present HistAgent, a history-specific agent equipped with carefully designed tools for OCR, translation, archival search, and image understanding in History. On HistBench, HistAgent based on GPT-4o achieves an accuracy of 27.54% pass@1 and 36.47% pass@2, significantly outperforming LLMs with online search and generalist agents, including GPT-4o (18.60%), DeepSeek-R1(14.49%) and Open Deep Research-smolagents(20.29% pass@1 and 25.12% pass@2). These results highlight the limitations of existing LLMs and generalist agents and demonstrate the advantages of HistAgent for historical reasoning.

nan


Article 733

Title@2025-06-07 (6): Hybrid Extractive Abstractive Summarization for Multilingual Sentiment Analysis

Title: Hybrid Extractive Abstractive Summarization for Multilingual Sentiment Analysis Hybride extrahierende abstrakte Zusammenfassung für Mehrsprachige Sentimentanalyse 多种语文感化分析的抽取性摘要 2506.06929v1

Authors (5): Mikhail Krasitskii, Grigori Sidorov, Olga Kolesnikova, Liliana Chanona Hernandez, Alexander Gelbukh

We propose a hybrid approach for multilingual sentiment analysis that combines extractive and abstractive summarization to address the limitations of standalone methods. The model integrates TF-IDF-based extraction with a fine-tuned XLM-R abstractive module, enhanced by dynamic thresholding and cultural adaptation. Experiments across 10 languages show significant improvements over baselines, achieving 0.90 accuracy for English and 0.84 for low-resource languages. The approach also demonstrates 22% greater computational efficiency than traditional methods. Practical applications include real-time brand monitoring and cross-cultural discourse analysis. Future work will focus on optimization for low-resource languages via 8-bit quantization.

nan


Article 734

Title@2025-06-07 (6): EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments

Title: EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments EconEvals: Benchmarks und Litmus-Tests für LLM-Agenten in unbekannten Umgebungen Econevals:未知环境中LLM代理物的测标和Litmus测试 2503.18825v2

Authors (5): Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski

We develop benchmarks for LLM agents that act in, learn from, and strategize in unknown environments, the specifications of which the LLM agent must learn over time from deliberate exploration. Our benchmarks consist of decision-making tasks derived from key problems in economics. To forestall saturation, the benchmark tasks are synthetically generated with scalable difficulty levels. Additionally, we propose litmus tests, a new kind of quantitative measure for LLMs and LLM agents. Unlike benchmarks, litmus tests quantify differences in character, values, and tendencies of LLMs and LLM agents, by considering their behavior when faced with tradeoffs (e.g., efficiency versus equality) where there is no objectively right or wrong behavior. Overall, our benchmarks and litmus tests assess the abilities and tendencies of LLM agents in tackling complex economic problems in diverse settings spanning procurement, scheduling, task allocation, and pricing – applications that should grow in importance as such agents are further integrated into the economy.

nan


Article 735

Title@2025-06-07 (6): When Incentives Backfire, Data Stops Being Human

Title: When Incentives Backfire, Data Stops Being Human Wenn Incentives Backfire, Daten aufhören, Mensch zu sein 当鼓励反火时,数据停止成为人类 2502.07732v2

Authors (5): Sebastin Santy, Prasanta Bhattacharya, Manoel Horta Ribeiro, Kelsey Allen, Sewoong Oh

Progress in AI has relied on human-generated data, from annotator marketplaces to the wider Internet. However, the widespread use of large language models now threatens the quality and integrity of human-generated data on these very platforms. We argue that this issue goes beyond the immediate challenge of filtering AI-generated content – it reveals deeper flaws in how data collection systems are designed. Existing systems often prioritize speed, scale, and efficiency at the cost of intrinsic human motivation, leading to declining engagement and data quality. We propose that rethinking data collection systems to align with contributors’ intrinsic motivations – rather than relying solely on external incentives – can help sustain high-quality data sourcing at scale while maintaining contributor trust and long-term participation.

nan


Article 736

Title@2025-06-07 (6): Robustifying Vision-Language Models via Dynamic Token Reweighting

Title: Robustifying Vision-Language Models via Dynamic Token Reweighting Beschleunigen von Vision-Sprachenmodellen durch Dynamic Token Reweighting 通过动态调控重标加权强化视觉语言模型 2505.17132v2

Authors (6): Tanqiu Jiang, Jiacheng Liang, Rongyi Zhu, Jiawei Zhou, Fenglong Ma, Ting Wang

Large vision-language models (VLMs) are highly vulnerable to jailbreak attacks that exploit visual-textual interactions to bypass safety guardrails. In this paper, we present DTR, a novel inference-time defense that mitigates multimodal jailbreak attacks through optimizing the model’s key-value (KV) caches. Rather than relying on curated safety-specific data or costly image-to-text conversion, we introduce a new formulation of the safety-relevant distributional shift induced by the visual modality. This formulation enables DTR to dynamically adjust visual token weights, minimizing the impact of adversarial visual inputs while preserving the model’s general capabilities and inference efficiency. Extensive evaluation across diverse VLMs and attack benchmarks demonstrates that \sys outperforms existing defenses in both attack robustness and benign task performance, marking the first successful application of KV cache optimization for safety enhancement in multimodal foundation models. (warning: this paper contains potentially harmful content generated by VLMs.)

nan


Article 737

Title@2025-06-07 (6): TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training

Title: TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training TorchTitan: Ein-Stop PyTorch native Lösung für produktionsfertiges LLM-Vortraining 火炬泰坦:当地一站式PyToirch当地生产即成LLM培训前的一站式PyTorch本地解决方案 2410.06511v3

Authors (13): Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, Stratos Idreos

The development of large language models (LLMs) has been instrumental in advancing state-of-the-art natural language processing applications. Training LLMs with billions of parameters and trillions of tokens require sophisticated distributed systems that enable composing and comparing several state-of-the-art techniques in order to efficiently scale across thousands of accelerators. However, existing solutions are complex, scattered across multiple libraries/repositories, lack interoperability, and are cumbersome to maintain. Thus, curating and empirically comparing training recipes require non-trivial engineering effort. This paper introduces TorchTitan, an open-source, PyTorch-native distributed training system that unifies state-of-the-art techniques, streamlining integration and reducing overhead. TorchTitan enables 3D parallelism in a modular manner with elastic scaling, providing comprehensive logging, checkpointing, and debugging tools for production-ready training. It also incorporates hardware-software co-designed solutions, leveraging features like Float8 training and SymmetricMemory. As a flexible test bed, TorchTitan facilitates custom recipe curation and comparison, allowing us to develop optimized training recipes for Llama 3.1 and provide guidance on selecting techniques for maximum efficiency based on our experiences. We thoroughly assess TorchTitan on the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its exceptional performance, modular composability, and elastic scalability. By stacking training optimizations, we demonstrate accelerations of 65.08% with 1D parallelism at the 128-GPU scale (Llama 3.1 8B), an additional 12.59% with 2D parallelism at the 256-GPU scale (Llama 3.1 70B), and an additional 30% with 3D parallelism at the 512-GPU scale (Llama 3.1 405B) on NVIDIA H100 GPUs over optimized baselines.

nan


Article 738

Title@2025-06-07 (6): Modality-Specialized Synergizers for Interleaved Vision-Language Generalists

Title: Modality-Specialized Synergizers for Interleaved Vision-Language Generalists Modalitätsspezialisierte Synergizer für interleaved Vision-Language Generalists 外离视觉语言通用通识主义者模式-专门化同步器 2407.03604v2

Authors (8): Zhiyang Xu, Minqian Liu, Ying Shen, Joy Rimchala, Jiaxin Zhang, Qifan Wang, Yu Cheng, Lifu Huang

Recent advancements in Vision-Language Models (VLMs) have led to the emergence of Vision-Language Generalists (VLGs) capable of understanding and generating both text and images. However, seamlessly generating an arbitrary sequence of text and images remains a challenging task for the current VLGs. One primary limitation lies in applying a unified architecture and the same set of parameters to simultaneously model discrete text tokens and continuous image features. Recent works attempt to tackle this fundamental problem by introducing modality-aware expert models. However, they employ identical architectures to process both text and images, disregarding the intrinsic inductive biases in these two modalities. In this work, we introduce MODALITY-SPECIALIZED SYNERGIZERS (MOSS), a novel design that efficiently optimizes existing unified architectures of VLGs with modality-specialized adaptation layers, i.e., a Convolutional LoRA for modeling the local priors of image patches and a Linear LoRA for processing sequential text. This design enables more effective modeling of modality-specific features while maintaining the strong cross-modal integration gained from pretraining. In addition, to improve the instruction-following capability on interleaved text-and-image generation, we introduce LEAFINSTRUCT, the first open-sourced interleaved instruction tuning dataset comprising 184,982 high-quality instances on more than 10 diverse domains. Extensive experiments show that VLGs integrated with M OSS achieve state-of-the-art performance, significantly surpassing baseline VLGs in complex interleaved generation tasks. Furthermore, our method exhibits strong generalizability on different VLGs.

nan


Article 739

Title@2025-06-07 (6): LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

Title: LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models LLM-SRBench: Ein neuer Benchmark für die Entdeckung wissenschaftlicher Gleichungen mit großen Sprachmodellen LLM-SRBench:采用大语言模式科学等同发现新基准 2504.10415v2

Authors (6): Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D Doan, Chandan K Reddy

Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.

nan


Article 740

Title@2025-06-07 (6): Automatic Speech Recognition of African American English: Lexical and Contextual Effects

Title: Automatic Speech Recognition of African American English: Lexical and Contextual Effects Automatische Spracherkennung des African American English: Lexikalische und Kontexteffekte 非裔英语的自动语音识别:词汇和背景效果 2506.06888v1

Authors (2): Hamid Mojarad, Kevin Tang

Automatic Speech Recognition (ASR) models often struggle with the phonetic, phonological, and morphosyntactic features found in African American English (AAE). This study focuses on two key AAE variables: Consonant Cluster Reduction (CCR) and ING-reduction. It examines whether the presence of CCR and ING-reduction increases ASR misrecognition. Subsequently, it investigates whether end-to-end ASR systems without an external Language Model (LM) are more influenced by lexical neighborhood effect and less by contextual predictability compared to systems with an LM. The Corpus of Regional African American Language (CORAAL) was transcribed using wav2vec 2.0 with and without an LM. CCR and ING-reduction were detected using the Montreal Forced Aligner (MFA) with pronunciation expansion. The analysis reveals a small but significant effect of CCR and ING on Word Error Rate (WER) and indicates a stronger presence of lexical neighborhood effect in ASR systems without LMs.

nan


Article 741

Title@2025-06-07 (6): Mixture of Small and Large Models for Chinese Spelling Check

Title: Mixture of Small and Large Models for Chinese Spelling Check Mischung aus kleinen und großen Modellen für chinesische Rechtschreibprüfung 用于中文拼写检查的小型和大型模型混合体 2506.06887v1

Authors (3): Ziheng Qiao, Houquan Zhou, Zhenghua Li

In the era of large language models (LLMs), the Chinese Spelling Check (CSC) task has seen various LLM methods developed, yet their performance remains unsatisfactory. In contrast, fine-tuned BERT-based models, relying on high-quality in-domain data, show excellent performance but suffer from edit pattern overfitting. This paper proposes a novel dynamic mixture approach that effectively combines the probability distributions of small models and LLMs during the beam search decoding phase, achieving a balanced enhancement of precise corrections from small models and the fluency of LLMs. This approach also eliminates the need for fine-tuning LLMs, saving significant time and resources, and facilitating domain adaptation. Comprehensive experiments demonstrate that our mixture approach significantly boosts error correction capabilities, achieving state-of-the-art results across multiple datasets. Our code is available at https://github.com/zhqiao-nlp/MSLLM.

nan


Article 742

Title@2025-06-07 (6): DISC: Plug-and-Play Decoding Intervention with Similarity of Characters for Chinese Spelling Check

Title: DISC: Plug-and-Play Decoding Intervention with Similarity of Characters for Chinese Spelling Check DISC: Plug-and-Play-Decoding-Intervention mit Ähnlichkeit von Zeichen für chinesische Rechtschreibprüfung DISC: 用于中文拼写检查的插件和打字解码干预,其字符相似 2412.12863v2

Authors (9): Ziheng Qiao, Houquan Zhou, Yumeng Liu, Zhenghua Li, Min Zhang, Bo Zhang, Chen Li, Ji Zhang, Fei Huang

One key characteristic of the Chinese spelling check (CSC) task is that incorrect characters are usually similar to the correct ones in either phonetics or glyph. To accommodate this, previous works usually leverage confusion sets, which suffer from two problems, i.e., difficulty in determining which character pairs to include and lack of probabilities to distinguish items in the set. In this paper, we propose a light-weight plug-and-play DISC (i.e., decoding intervention with similarity of characters) module for CSC models.DISC measures phonetic and glyph similarities between characters and incorporates this similarity information only during the inference phase. This method can be easily integrated into various existing CSC models, such as ReaLiSe, SCOPE, and ReLM, without additional training costs. Experiments on three CSC benchmarks demonstrate that our proposed method significantly improves model performance, approaching and even surpassing the current state-of-the-art models.

nan


Article 743

Title@2025-06-07 (6): Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning

Title: Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning Recht ist nicht genug: Die Pitfalls von Outcome Supervision in der Ausbildung LLMs for Math Reasoning 权利还是不够的:数学原因培训优等生培训成果监督的空洞 2506.06877v1

Authors (7): Jiaxing Guo, Wenjie Yang, Shengzhong Zhang, Tongshan Xu, Lun Du, Da Zheng, Zengfeng Huang

Outcome-rewarded Large Language Models (LLMs) have demonstrated remarkable success in mathematical problem-solving. However, this success often masks a critical issue: models frequently achieve correct answers through fundamentally unsound reasoning processes, a phenomenon indicative of reward hacking. We introduce MathOlympiadEval, a new dataset with fine-grained annotations, which reveals a significant gap between LLMs’ answer correctness and their low process correctness. Existing automated methods like LLM-as-a-judge struggle to reliably detect these reasoning flaws. To address this, we propose ParaStepVerifier, a novel methodology for meticulous, step-by-step verification of mathematical solutions. ParaStepVerifier identifies incorrect reasoning steps. Empirical results demonstrate that ParaStepVerifier substantially improves the accuracy of identifying flawed solutions compared to baselines, especially for complex, multi-step problems. This offers a more robust path towards evaluating and training LLMs with genuine mathematical reasoning.

nan


Article 744

Title@2025-06-07 (6): Watermark under Fire: A Robustness Evaluation of LLM Watermarking

Title: Watermark under Fire: A Robustness Evaluation of LLM Watermarking Wasserzeichen unter Feuer: Eine Robustheitsbewertung von LLM Watermarking 火灾下的水印:对LLM水标志的强力评价 2411.13425v3

Authors (5): Jiacheng Liang, Zian Wang, Lauren Hong, Shouling Ji, Ting Wang

Various watermarking methods (``watermarkers’’) have been proposed to identify LLM-generated texts; yet, due to the lack of unified evaluation platforms, many critical questions remain under-explored: i) What are the strengths/limitations of various watermarkers, especially their attack robustness? ii) How do various design choices impact their robustness? iii) How to optimally operate watermarkers in adversarial environments? To fill this gap, we systematize existing LLM watermarkers and watermark removal attacks, mapping out their design spaces. We then develop WaterPark, a unified platform that integrates 10 state-of-the-art watermarkers and 12 representative attacks. More importantly, by leveraging WaterPark, we conduct a comprehensive assessment of existing watermarkers, unveiling the impact of various design choices on their attack robustness. We further explore the best practices to operate watermarkers in adversarial environments. We believe our study sheds light on current LLM watermarking techniques while WaterPark serves as a valuable testbed to facilitate future research.

nan


Article 745

Title@2025-06-07 (6): Taxonomizing Representational Harms using Speech Act Theory

Title: Taxonomizing Representational Harms using Speech Act Theory Taxonomizing Representative Harms mit Sprachgesetz-Theorie 使用演讲法理论对代表性伤害进行分类 2504.00928v2

Authors (12): Emily Corvi, Hannah Washington, Stefanie Reed, Chad Atalla, Alexandra Chouldechova, P. Alex Dow, Jean Garcia-Gathright, Nicholas Pangakis, Emily Sheng, Dan Vann, Matthew Vogel, Hanna Wallach

Representational harms are widely recognized among fairness-related harms caused by generative language systems. However, their definitions are commonly under-specified. We make a theoretical contribution to the specification of representational harms by introducing a framework, grounded in speech act theory (Austin, 1962), that conceptualizes representational harms caused by generative language systems as the perlocutionary effects (i.e., real-world impacts) of particular types of illocutionary acts (i.e., system behaviors). Building on this argument and drawing on relevant literature from linguistic anthropology and sociolinguistics, we provide new definitions of stereotyping, demeaning, and erasure. We then use our framework to develop a granular taxonomy of illocutionary acts that cause representational harms, going beyond the high-level taxonomies presented in previous work. We also discuss the ways that our framework and taxonomy can support the development of valid measurement instruments. Finally, we demonstrate the utility of our framework and taxonomy via a case study that engages with recent conceptual debates about what constitutes a representational harm and how such harms should be measured.

nan


Article 746

Title@2025-06-07 (6): Active Preference Optimization for Sample Efficient RLHF

Title: Active Preference Optimization for Sample Efficient RLHF Aktive Preference-Optimierung für Probeneffizienz RLHF 抽样高效RLHF的主动首选最佳优化 2402.10500v3

Authors (4): Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, Sayak Ray Chowdhury

Large Language Models (LLMs) aligned using Reinforcement Learning from Human Feedback (RLHF) have shown remarkable generation abilities in numerous tasks. However, collecting high-quality human preferences creates costly bottlenecks in practical deployments, and hence, training data are often budgeted. In these scenarios, it is crucial to collect training data (e.g., contexts, a pair of generations for each context, and a preference indicating which generation is better) carefully, yet most of the existing methods sample contexts uniformly at random from a given collection. Given this, under the Bradley-Terry-Luce preference model and with a small budget of training data, we show that uniform sampling of contexts could lead to a policy (i.e., an aligned model) that suffers a constant sub-optimality gap from the optimal policy. This highlights the need for an adaptive context sampling strategy for effective alignment under a small sample budget. To address this, we reformulate RLHF within the contextual preference bandit framework, treating generations as actions, and give a nearly complete characterization of the sub-optimality gap in terms of both lower and upper bounds. First, when the action set is a $d$-dimensional hypercube and the number of samples is $T$, we show an $\Omega(d/\sqrt{T})$ lower bound. Next, we propose an algorithm, $\textit{Active Preference Optimization}$ ($\texttt{APO}$), that iteratively collects preferences for the most uncertain contexts. We show that the sub-optimality gap of the policy learned via $\texttt{APO}$ matches the lower bound up to a log factor and a non-linearity constant. Finally, we perform experiments on practical datasets to validate $\texttt{APO}$’s efficacy over existing methods, establishing it as a sample-efficient and cost-effective solution for LLM alignment.

nan


Article 747

Title@2025-06-07 (6): Grounded Persuasive Language Generation for Automated Marketing

Title: Grounded Persuasive Language Generation for Automated Marketing Gegründete überzeugende Sprachgenerierung für automatisiertes Marketing 用于自动营销的有源辅助性语言生成 2502.16810v3

Authors (8): Jibang Wu, Chenghao Yang, Simon Mahns, Yi Wu, Chaoqi Wang, Hao Zhu, Fei Fang, Haifeng Xu

This paper develops an agentic framework that employs large language models (LLMs) to automate the generation of persuasive and grounded marketing content, using real estate listing descriptions as our focal application domain. Our method is designed to align the generated content with user preferences while highlighting useful factual attributes. This agent consists of three key modules: (1) Grounding Module, mimicking expert human behavior to predict marketable features; (2) Personalization Module, aligning content with user preferences; (3) Marketing Module, ensuring factual accuracy and the inclusion of localized features. We conduct systematic human-subject experiments in the domain of real estate marketing, with a focus group of potential house buyers. The results demonstrate that marketing descriptions generated by our approach are preferred over those written by human experts by a clear margin while maintaining the same level of factual accuracy. Our findings suggest a promising agentic approach to automate large-scale targeted marketing while ensuring factuality of content generation.

nan


Article 748

Title@2025-06-07 (6): Adapt Once, Thrive with Updates: Transferable Parameter-Efficient Fine-Tuning on Evolving Base Models

Title: Adapt Once, Thrive with Updates: Transferable Parameter-Efficient Fine-Tuning on Evolving Base Models Einmal anpassen, Thrive mit Updates: Übertragbar Parameter-Effizient Feintuning auf sich entwickelnden Basismodellen 一次适应,带更新的 “ 振动式 “ :可转让参数 – – 对不断演化的基建模型进行有效微调 2506.06844v1

Authors (6): Naibin Gu, Peng Fu, Xiyu Liu, Ke Ma, Zheng Lin, Weiping Wang

Parameter-efficient fine-tuning (PEFT) has become a common method for fine-tuning large language models, where a base model can serve multiple users through PEFT module switching. To enhance user experience, base models require periodic updates. However, once updated, PEFT modules fine-tuned on previous versions often suffer substantial performance degradation on newer versions. Re-tuning these numerous modules to restore performance would incur significant computational costs. Through a comprehensive analysis of the changes that occur during base model updates, we uncover an interesting phenomenon: continual training primarily affects task-specific knowledge stored in Feed-Forward Networks (FFN), while having less impact on the task-specific pattern in the Attention mechanism. Based on these findings, we introduce Trans-PEFT, a novel approach that enhances the PEFT module by focusing on the task-specific pattern while reducing its dependence on certain knowledge in the base model. Further theoretical analysis supports our approach. Extensive experiments across 7 base models and 12 datasets demonstrate that Trans-PEFT trained modules can maintain performance on updated base models without re-tuning, significantly reducing maintenance overhead in real-world applications.

nan


Article 749

Title@2025-06-07 (6): PCoT: Persuasion-Augmented Chain of Thought for Detecting Fake News and Social Media Disinformation

Title: PCoT: Persuasion-Augmented Chain of Thought for Detecting Fake News and Social Media Disinformation PCoT: Persuasion-Augmented Chain of Thought zur Erkennung von Fake News und Social Media Desinformation PCOT: 检测假冒新闻和社会媒体假消息的劝导-增强思维链 2506.06842v1

Authors (5): Arkadiusz Modzelewski, Witold Sosnowski, Tiziano Labruna, Adam Wierzbicki, Giovanni Da San Martino

Disinformation detection is a key aspect of media literacy. Psychological studies have shown that knowledge of persuasive fallacies helps individuals detect disinformation. Inspired by these findings, we experimented with large language models (LLMs) to test whether infusing persuasion knowledge enhances disinformation detection. As a result, we introduce the Persuasion-Augmented Chain of Thought (PCoT), a novel approach that leverages persuasion to improve disinformation detection in zero-shot classification. We extensively evaluate PCoT on online news and social media posts. Moreover, we publish two novel, up-to-date disinformation datasets: EUDisinfo and MultiDis. These datasets enable the evaluation of PCoT on content entirely unseen by the LLMs used in our experiments, as the content was published after the models’ knowledge cutoffs. We show that, on average, PCoT outperforms competitive methods by 15% across five LLMs and five datasets. These findings highlight the value of persuasion in strengthening zero-shot disinformation detection.

nan


Article 750

Title@2025-06-07 (6): Alignment Drift in CEFR-prompted LLMs for Interactive Spanish Tutoring

Title: Alignment Drift in CEFR-prompted LLMs for Interactive Spanish Tutoring Ausrichtung Drift in CEFR-geförderten LLMs für interaktives Spanisch Tutoring CEFR-快速推动的交互式西班牙语导图的对齐磁盘驱动器LMs 2505.08351v2

Authors (2): Mina Almasi, Ross Deans Kristensen-McLachlan

This paper investigates the potentials of Large Language Models (LLMs) as adaptive tutors in the context of second-language learning. In particular, we evaluate whether system prompting can reliably constrain LLMs to generate only text appropriate to the student’s competence level. We simulate full teacher-student dialogues in Spanish using instruction-tuned, open-source LLMs ranging in size from 7B to 12B parameters. Dialogues are generated by having an LLM alternate between tutor and student roles with separate chat histories. The output from the tutor model is then used to evaluate the effectiveness of CEFR-based prompting to control text difficulty across three proficiency levels (A1, B1, C1). Our findings suggest that while system prompting can be used to constrain model outputs, prompting alone is too brittle for sustained, long-term interactional contexts - a phenomenon we term alignment drift. Our results provide insights into the feasibility of LLMs for personalized, proficiency-aligned adaptive tutors and provide a scalable method for low-cost evaluation of model performance without human participants.

nan


Article 751

Title@2025-06-07 (6): Cross-Entropy Games for Language Models: From Implicit Knowledge to General Capability Measures

Title: Cross-Entropy Games for Language Models: From Implicit Knowledge to General Capability Measures Cross-Entropy-Spiele für Sprachmodelle: Vom Impliziten Wissen bis hin zu allgemeinen Fähigkeiten 语文模式跨英语运动会:从隐隐知识到一般能力措施 2506.06832v1

Authors (2): Clément Hongler, Andrew Emil

Large Language Models (LLMs) define probability measures on text. By considering the implicit knowledge question of what it means for an LLM to know such a measure and what it entails algorithmically, we are naturally led to formulate a series of tasks that go beyond generative sampling, involving forms of summarization, counterfactual thinking, anomaly detection, originality search, reverse prompting, debating, creative solving, etc. These tasks can be formulated as games based on LLM measures, which we call Cross-Entropy (Xent) Games. Xent Games can be single-player or multi-player. They involve cross-entropy scores and cross-entropy constraints, and can be expressed as simple computational graphs and programs. We show the Xent Game space is large enough to contain a wealth of interesting examples, while being constructible from basic game-theoretic consistency axioms. We then discuss how the Xent Game space can be used to measure the abilities of LLMs. This leads to the construction of Xent Game measures: finite families of Xent Games that can be used as capability benchmarks, built from a given scope, by extracting a covering measure. To address the unbounded scope problem associated with the challenge of measuring general abilities, we propose to explore the space of Xent Games in a coherent fashion, using ideas inspired by evolutionary dynamics.

nan


Article 752

Title@2025-06-07 (6): FedALT: Federated Fine-Tuning through Adaptive Local Training with Rest-of-World LoRA

Title: FedALT: Federated Fine-Tuning through Adaptive Local Training with Rest-of-World LoRA FedALT: Federated Fine-Tuning durch Adaptives lokales Training mit Rest-of-World LoRA FedALT:通过与世界其余地区合作的适应性地方培训实现联邦退休 2503.11880v2

Authors (4): Jieming Bian, Lei Wang, Letian Zhang, Jie Xu

Fine-tuning large language models (LLMs) in federated settings enables privacy-preserving adaptation but suffers from cross-client interference due to model aggregation. Existing federated LoRA fine-tuning methods, primarily based on FedAvg, struggle with data heterogeneity, leading to harmful cross-client interference and suboptimal personalization. In this work, we propose \textbf{FedALT}, a novel personalized federated LoRA fine-tuning algorithm that fundamentally departs from FedAvg. Instead of using an aggregated model to initialize local training, each client continues training its individual LoRA while incorporating shared knowledge through a separate Rest-of-World (RoW) LoRA component. To effectively balance local adaptation and global information, FedALT introduces an adaptive mixer that dynamically learns input-specific weightings between the individual and RoW LoRA components, drawing conceptual foundations from the Mixture-of-Experts (MoE) paradigm. Through extensive experiments on NLP benchmarks, we demonstrate that FedALT significantly outperforms state-of-the-art personalized federated LoRA fine-tuning methods, achieving superior local adaptation without sacrificing computational efficiency.

nan


Article 753

Title@2025-06-07 (6): Data Shapley in One Training Run

Title: Data Shapley in One Training Run Datenschädlichkeit in einem Trainingslauf 一个培训运行中的数据形状 2406.11011v3

Authors (4): Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia

Data Shapley provides a principled framework for attributing data’s contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data’s contribution and discuss their implications for copyright in generative AI and pretraining data curation.

nan


Article 754

Title@2025-06-07 (6): AutoPal: Autonomous Adaptation to Users for Personal AI Companionship

Title: AutoPal: Autonomous Adaptation to Users for Personal AI Companionship AutoPal: Autonome Anpassung an Benutzer für die persönliche KI-Companionship AutoPal: 个人AI 个人伴侣关系用户自动适应 2406.13960v4

Authors (9): Yi Cheng, Wenge Liu, Kaishuai Xu, Wenjun Hou, Yi Ouyang, Chak Tou Leong, Wenjie Li, Xian Wu, Yefeng Zheng

Previous research has demonstrated the potential of AI agents to act as companions that can provide constant emotional support for humans. In this paper, we emphasize the necessity of autonomous adaptation in personal AI companionship, an underexplored yet promising direction. Such adaptability is crucial as it can facilitate more tailored interactions with users and allow the agent to evolve in response to users’ changing needs. However, imbuing agents with autonomous adaptability presents unique challenges, including identifying optimal adaptations to meet users’ expectations and ensuring a smooth transition during the adaptation process. To address them, we devise a hierarchical framework, AutoPal, that enables controllable and authentic adjustments to the agent’s persona based on user interactions. A personamatching dataset is constructed to facilitate the learning of optimal persona adaptations. Extensive experiments demonstrate the effectiveness of AutoPal and highlight the importance of autonomous adaptability in AI companionship.

nan


Article 755

Title@2025-06-07 (6): Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs

Title: Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs 超越分类:以多任务音频LLLMs为理由的言论情感 2506.06820v1

Authors (12): Wenyu Zhang, Yingxu He, Geyu Lin, Zhuohan Liu, Shuo Sun, Bin Wang, Xunlong Zou, Jeremy H. M. Wong, Qiongqiong Wang, Hardik B. Sailor, Nancy F. Chen, Ai Ti Aw

Audio Large Language Models (AudioLLMs) have achieved strong results in semantic tasks like speech recognition and translation, but remain limited in modeling paralinguistic cues such as emotion. Existing approaches often treat emotion understanding as a classification problem, offering little insight into the underlying rationale behind predictions. In this work, we explore emotion reasoning, a strategy that leverages the generative capabilities of AudioLLMs to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations. To support this in multitask AudioLLMs, we introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training. This approach enables AudioLLMs to effectively learn different tasks while incorporating emotional reasoning. Experiments on IEMOCAP and MELD show that our approach not only improves emotion prediction accuracy but also enhances the coherence and evidential grounding of the generated responses.

nan


Article 756

Title@2025-06-07 (6): How do datasets, developers, and models affect biases in a low-resourced language?

Title: How do datasets, developers, and models affect biases in a low-resourced language? Wie wirken sich Datensätze, Entwickler und Modelle auf Voreingenommenheiten in einer Sprache mit geringem Ressourcenbedarf aus? 数据集、开发者和模型如何影响低资源语言的偏差? 2506.06816v1

Authors (3): Dipto Das, Shion Guha, Bryan Semaan

Sociotechnical systems, such as language technologies, frequently exhibit identity-based biases. These biases exacerbate the experiences of historically marginalized communities and remain understudied in low-resource contexts. While models and datasets specific to a language or with multilingual support are commonly recommended to address these biases, this paper empirically tests the effectiveness of such approaches in the context of gender, religion, and nationality-based identities in Bengali, a widely spoken but low-resourced language. We conducted an algorithmic audit of sentiment analysis models built on mBERT and BanglaBERT, which were fine-tuned using all Bengali sentiment analysis (BSA) datasets from Google Dataset Search. Our analyses showed that BSA models exhibit biases across different identity categories despite having similar semantic content and structure. We also examined the inconsistencies and uncertainties arising from combining pre-trained models and datasets created by individuals from diverse demographic backgrounds. We connected these findings to the broader discussions on epistemic injustice, AI alignment, and methodological decisions in algorithmic audits.

nan


Article 757

Title@2025-06-07 (6): BTPD: A Multilingual Hand-curated Dataset of Bengali Transnational Political Discourse Across Online Communities

Title: BTPD: A Multilingual Hand-curated Dataset of Bengali Transnational Political Discourse Across Online Communities BTPD: Mehrsprachiger handkuratierter Datensatz von Bengali Transnationaler politischer Diskurs über Online-Gemeinschaften BTPD: 孟加拉跨国在线社区政治论文多语种手动多语种数据集 2506.06813v1

Authors (3): Dipto Das, Syed Ishtiaque Ahmed, Shion Guha

Understanding political discourse in online spaces is crucial for analyzing public opinion and ideological polarization. While social computing and computational linguistics have explored such discussions in English, such research efforts are significantly limited in major yet under-resourced languages like Bengali due to the unavailability of datasets. In this paper, we present a multilingual dataset of Bengali transnational political discourse (BTPD) collected from three online platforms, each representing distinct community structures and interaction dynamics. Besides describing how we hand-curated the dataset through community-informed keyword-based retrieval, this paper also provides a general overview of its topics and multilingual content.

nan


Article 758

Title@2025-06-07 (6): Advancing Question Generation with Joint Narrative and Difficulty Control

Title: Advancing Question Generation with Joint Narrative and Difficulty Control Verbesserung der Fragegenerierung mit gemeinsamer Narrativ- und Schwierigkeitskontrolle 推进问题生成,并采用 “ 共同叙述和困难联合控制 “ 2506.06812v1

Authors (2): Bernardo Leite, Henrique Lopes Cardoso

Question Generation (QG), the task of automatically generating questions from a source input, has seen significant progress in recent years. Difficulty-controllable QG (DCQG) enables control over the difficulty level of generated questions while considering the learner’s ability. Additionally, narrative-controllable QG (NCQG) allows control over the narrative aspects embedded in the questions. However, research in QG lacks a focus on combining these two types of control, which is important for generating questions tailored to educational purposes. To address this gap, we propose a strategy for Joint Narrative and Difficulty Control, enabling simultaneous control over these two attributes in the generation of reading comprehension questions. Our evaluation provides preliminary evidence that this approach is feasible, though it is not effective across all instances. Our findings highlight the conditions under which the strategy performs well and discuss the trade-offs associated with its application.

nan


Article 759

Title@2025-06-07 (6): Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events

Title: Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events Nicht ganz Sherlock Holmes: Sprachmodellvorhersagen unterscheiden nicht zuverlässig unmöglich von unwahrscheinlichen Ereignissen 夏洛克·福尔摩斯:语言模型预测无法可靠地区分不可能与不可能发生的事件。 2506.06808v1

Authors (4): James A. Michaelov, Reeka Estacio, Zhien Zhang, Benjamin K. Bergen

Can language models reliably predict that possible events are more likely than merely improbable ones? By teasing apart possibility, typicality, and contextual relatedness, we show that despite the results of previous work, language models’ ability to do this is far from robust. In fact, under certain conditions, all models tested - including Llama 3, Gemma 2, and Mistral NeMo - perform at worse-than-chance level, assigning higher probabilities to impossible sentences such as ‘the car was given a parking ticket by the brake’ than to merely unlikely sentences such as ‘the car was given a parking ticket by the explorer’.

nan


Article 760

Title@2025-06-07 (6): Label-semantics Aware Generative Approach for Domain-Agnostic Multilabel Classification

Title: Label-semantics Aware Generative Approach for Domain-Agnostic Multilabel Classification Label-Semantik Aware Generativer Ansatz für Domain-Agnostic Multilabel-Klassifikation 域-不可知性多标签分类的认知生成方法 2506.06806v1

Authors (5): Subhendu Khatuya, Shashwat Naidu, Saptarshi Ghosh, Pawan Goyal, Niloy Ganguly

The explosion of textual data has made manual document classification increasingly challenging. To address this, we introduce a robust, efficient domain-agnostic generative model framework for multi-label text classification. Instead of treating labels as mere atomic symbols, our approach utilizes predefined label descriptions and is trained to generate these descriptions based on the input text. During inference, the generated descriptions are matched to the pre-defined labels using a finetuned sentence transformer. We integrate this with a dual-objective loss function, combining cross-entropy loss and cosine similarity of the generated sentences with the predefined target descriptions, ensuring both semantic alignment and accuracy. Our proposed model LAGAMC stands out for its parameter efficiency and versatility across diverse datasets, making it well-suited for practical applications. We demonstrate the effectiveness of our proposed model by achieving new state-of-the-art performances across all evaluated datasets, surpassing several strong baselines. We achieve improvements of 13.94% in Micro-F1 and 24.85% in Macro-F1 compared to the closest baseline across all datasets.

nan


Article 761

Title@2025-06-07 (6): Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs

Title: Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs Dialog ohne Grenzen: Konstant-Sized KV-Caches für erweiterte Reaktionen in LLMs 无限制对话框: 普通 KV 宽度响应库( LLM ) 2503.00979v2

Authors (5): Ravi Ghadia, Avinash Kumar, Gaurav Jain, Prashant Nair, Poulami Das

Autoregressive Transformers rely on Key-Value (KV) caching to accelerate inference. However, the linear growth of the KV cache with context length leads to excessive memory consumption and bandwidth constraints. This bottleneck is particularly problematic in real-time applications – such as chatbots and interactive assistants – where low latency and high memory efficiency are critical. Existing methods drop distant tokens or compress states in a lossy manner, sacrificing accuracy by discarding vital context or introducing bias. We propose MorphKV, an inference-time technique that maintains a constant-sized KV cache while preserving accuracy. MorphKV balances long-range dependencies and local coherence during text generation. It eliminates early-token bias while retaining high-fidelity context by adaptively ranking tokens through correlation-aware selection. Unlike heuristic retention or lossy compression, MorphKV iteratively refines the KV cache via lightweight updates guided by attention patterns of recent tokens. This approach captures inter-token correlation with greater accuracy, crucial for tasks like content creation and code generation. Our studies on long-response tasks show 52.9$\%$ memory savings and 18.2$\%$ higher accuracy on average compared to state-of-the-art prior works, enabling efficient real-world deployment.

nan


Article 762

Title@2025-06-07 (6): On the Adaptive Psychological Persuasion of Large Language Models

Title: On the Adaptive Psychological Persuasion of Large Language Models Über die adaptive Psychologische Überzeugung von großen Sprachmodellen 关于大语言模型的适应性心理心理分析 2506.06800v1

Authors (9): Tianjie Ju, Yujia Chen, Hao Fei, Mong-Li Lee, Wynne Hsu, Pengzhou Cheng, Zongru Wu, Zhuosheng Zhang, Gongshen Liu

Previous work has showcased the intriguing capabilities of Large Language Models (LLMs) in instruction-following and rhetorical fluency. However, systematic exploration of their dual capabilities to autonomously persuade and resist persuasion, particularly in contexts involving psychological rhetoric, remains unexplored. In this paper, we first evaluate four commonly adopted LLMs by tasking them to alternately act as persuaders and listeners in adversarial dialogues. Empirical results show that persuader LLMs predominantly employ repetitive strategies, leading to low success rates. Then we introduce eleven comprehensive psychological persuasion strategies, finding that explicitly instructing LLMs to adopt specific strategies such as Fluency Effect and Repetition Effect significantly improves persuasion success rates. However, no ``one-size-fits-all’’ strategy proves universally effective, with performance heavily dependent on contextual counterfactuals. Motivated by these observations, we propose an adaptive framework based on direct preference optimization that trains LLMs to autonomously select optimal strategies by leveraging persuasion results from strategy-specific responses as preference pairs. Experiments on three open-source LLMs confirm that the proposed adaptive psychological persuasion method effectively enables persuader LLMs to select optimal strategies, significantly enhancing their success rates while maintaining general capabilities. Our code is available at https://github.com/KalinaEine/PsychologicalPersuasion.

nan


Article 763

Title@2025-06-07 (6): ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models

Title: ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models ZeroDL: Null-Schuss-Distributions-Lernen für Text-Clustering über große Sprachmodelle 零DL:通过大语言模式对文本分组进行零光分发学习 2406.13342v2

Authors (4): Hwiyeol Jo, Hyunwoo Lee, Kang Min Yoo, Taiwoo Park

The advancements in large language models (LLMs) have brought significant progress in NLP tasks. However, if a task cannot be fully described in prompts, the models could fail to carry out the task. In this paper, we propose a simple yet effective method to contextualize a task toward a LLM. The method utilizes (1) open-ended zero-shot inference from the entire dataset, (2) aggregate the inference results, and (3) finally incorporate the aggregated meta-information for the actual task. We show the effectiveness in text clustering tasks, empowering LLMs to perform text-to-text-based clustering and leading to improvements on several datasets. Furthermore, we explore the generated class labels for clustering, showing how the LLM understands the task through data.

nan


Article 764

Title@2025-06-07 (6): Extending dependencies to the taggedPBC: Word order in transitive clauses

Title: Extending dependencies to the taggedPBC: Word order in transitive clauses Ausdehnung der Abhängigkeiten auf den TaggedPBC: Word-Ordnung in transitiven Klauseln 将依赖延伸至贴有标签的PBC:过渡条款中的Word顺序 2506.06785v1

Authors (1): Hiram Ring

The taggedPBC (Ring 2025a) contains more than 1,800 sentences of pos-tagged parallel text data from over 1,500 languages, representing 133 language families and 111 isolates. While this dwarfs previously available resources, and the POS tags achieve decent accuracy, allowing for predictive crosslinguistic insights (Ring 2025b), the dataset was not initially annotated for dependencies. This paper reports on a CoNLLU-formatted version of the dataset which transfers dependency information along with POS tags to all languages in the taggedPBC. Although there are various concerns regarding the quality of the tags and the dependencies, word order information derived from this dataset regarding the position of arguments and predicates in transitive clauses correlates with expert determinations of word order in three typological databases (WALS, Grambank, Autotyp). This highlights the usefulness of corpus-based typological approaches (as per Baylor et al. 2023; Bjerva 2024) for extending comparisons of discrete linguistic categories, and suggests that important insights can be gained even from noisy data, given sufficient annotation. The dependency-annotated corpora are also made available for research and collaboration via GitHub.

nan


Article 765

Title@2025-06-07 (6): They want to pretend not to understand: The Limits of Current LLMs in Interpreting Implicit Content of Political Discourse

Title: They want to pretend not to understand: The Limits of Current LLMs in Interpreting Implicit Content of Political Discourse Sie wollen vorgeben, nicht zu verstehen: Die Grenzen aktueller LLMs bei der Interpretation impliziter Inhalte politischer Diskurse 他们想假装不理解:当前LLMs在解释政治论文隐含内容方面的局限性 2506.06775v1

Authors (3): Walter Paci, Alessandro Panunzi, Sandro Pezzelle

Implicit content plays a crucial role in political discourse, where speakers systematically employ pragmatic strategies such as implicatures and presuppositions to influence their audiences. Large Language Models (LLMs) have demonstrated strong performance in tasks requiring complex semantic and pragmatic understanding, highlighting their potential for detecting and explaining the meaning of implicit content. However, their ability to do this within political discourse remains largely underexplored. Leveraging, for the first time, the large IMPAQTS corpus, which comprises Italian political speeches with the annotation of manipulative implicit content, we propose methods to test the effectiveness of LLMs in this challenging problem. Through a multiple-choice task and an open-ended generation task, we demonstrate that all tested models struggle to interpret presuppositions and implicatures. We conclude that current LLMs lack the key pragmatic capabilities necessary for accurately interpreting highly implicit language, such as that found in political discourse. At the same time, we highlight promising trends and future directions for enhancing model performance. We release our data and code at https://github.com/WalterPaci/IMPAQTS-PID

nan


Article 766

Title@2025-06-07 (6): Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models

Title: Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models 在《愿景语言模式》中实体知识采掘不同模式的绩效差距 2412.14133v2

Authors (4): Ido Cohen, Daniela Gottesman, Mor Geva, Raja Giryes

Vision-language models (VLMs) excel at extracting and reasoning about information from images. Yet, their capacity to leverage internal knowledge about specific entities remains underexplored. This work investigates the disparity in model performance when answering factual questions about an entity described in text versus depicted in an image. Our results reveal a significant accuracy drop - reaching 18% for some models - when the entity is presented visually instead of textually. To study this gap we present PopVQA, a dataset which allows separating entity recognition and question answering, and use it to benchmark several models. We hypothesize that this decline arises from limitations in how information flows from image tokens to query tokens. Thus, we use mechanistic interpretability tools to reveal that, although image tokens are preprocessed by the vision encoder, meaningful information flow from these tokens occurs only in the much deeper layers. Furthermore, critical image processing happens in the language model’s middle layers, allowing few layers for consecutive reasoning, highlighting a potential inefficiency in how the model utilizes its layers for reasoning. These insights shed light on the internal mechanics of VLMs and offer pathways for enhancing their reasoning capabilities. PopVQA can be found at https://huggingface.co/datasets/idoco/PopVQA.

nan


Article 767

Title@2025-06-07 (6): On Adversarial Robustness of Language Models in Transfer Learning

Title: On Adversarial Robustness of Language Models in Transfer Learning Zur widerlichen Robustheit von Sprachmodellen im Transfer-Lernen 转让学习中语言模式的反效果 2501.00066v2

Authors (4): Bohdan Turbal, Anastasiia Mazur, Jiaxu Zhao, Mykola Pechenizkiy

We investigate the adversarial robustness of LLMs in transfer learning scenarios. Through comprehensive experiments on multiple datasets (MBIB Hate Speech, MBIB Political Bias, MBIB Gender Bias) and various model architectures (BERT, RoBERTa, GPT-2, Gemma, Phi), we reveal that transfer learning, while improving standard performance metrics, often leads to increased vulnerability to adversarial attacks. Our findings demonstrate that larger models exhibit greater resilience to this phenomenon, suggesting a complex interplay between model size, architecture, and adaptation methods. Our work highlights the crucial need for considering adversarial robustness in transfer learning scenarios and provides insights into maintaining model security without compromising performance. These findings have significant implications for the development and deployment of LLMs in real-world applications where both performance and robustness are paramount.

nan


Article 768

Title@2025-06-07 (6): AfroBench: How Good are Large Language Models on African Languages?

Title: AfroBench: How Good are Large Language Models on African Languages? AfroBench: Wie gut sind große Sprachmodelle auf afrikanischen Sprachen? AfroBench:非洲语言大语言模式有多好? 2311.07978v5

Authors (7): Jessica Ojo, Odunayo Ogundepo, Akintunde Oladipo, Kelechi Ogueji, Jimmy Lin, Pontus Stenetorp, David Ifeoluwa Adelani

Large-scale multilingual evaluations, such as MEGA, often include only a handful of African languages due to the scarcity of high-quality evaluation data and the limited discoverability of existing African datasets. This lack of representation hinders comprehensive LLM evaluation across a diverse range of languages and tasks. To address these challenges, we introduce AfroBench – a multi-task benchmark for evaluating the performance of LLMs across 64 African languages, 15 tasks and 22 datasets. AfroBench consists of nine natural language understanding datasets, six text generation datasets, six knowledge and question answering tasks, and one mathematical reasoning task. We present results comparing the performance of prompting LLMs to fine-tuned baselines based on BERT and T5-style models. Our results suggest large gaps in performance between high-resource languages, such as English, and African languages across most tasks; but performance also varies based on the availability of monolingual data resources. Our findings confirm that performance on African languages continues to remain a hurdle for current LLMs, underscoring the need for additional efforts to close this gap. https://mcgill-nlp.github.io/AfroBench/

nan


Article 769

Title@2025-06-07 (6): Geopolitical biases in LLMs: what are the “good” and the “bad” countries according to contemporary language models

Title: Geopolitical biases in LLMs: what are the “good” and the “bad” countries according to contemporary language models Geopolitische Voreingenommenheiten in LLMs: Was sind die “guten” und die “schlechten” Länder nach zeitgenössischen Sprachmodellen LLMM中的地缘政治偏见:根据当代语言模式,什么是“好”和“坏”国家? 2506.06751v1

Authors (10): Mikhail Salnikov, Dmitrii Korzh, Ivan Lazichny, Elvir Karimov, Artyom Iudin, Ivan Oseledets, Oleg Y. Rogov, Alexander Panchenko, Natalia Loukachevitch, Elena Tutubalina

This paper evaluates geopolitical biases in LLMs with respect to various countries though an analysis of their interpretation of historical events with conflicting national perspectives (USA, UK, USSR, and China). We introduce a novel dataset with neutral event descriptions and contrasting viewpoints from different countries. Our findings show significant geopolitical biases, with models favoring specific national narratives. Additionally, simple debiasing prompts had a limited effect in reducing these biases. Experiments with manipulated participant labels reveal models’ sensitivity to attribution, sometimes amplifying biases or recognizing inconsistencies, especially with swapped labels. This work highlights national narrative biases in LLMs, challenges the effectiveness of simple debiasing methods, and offers a framework and dataset for future geopolitical bias research.

nan


Article 770

Title@2025-06-07 (6): C-PATH: Conversational Patient Assistance and Triage in Healthcare System

Title: C-PATH: Conversational Patient Assistance and Triage in Healthcare System C-PATH: Conversational Patient Assistance und Triage in Healthcare System C-PATH: 病人相互援助和保健系统中的转诊 2506.06737v1

Authors (3): Qi Shi, Qiwei Han, Cláudia Soares

Navigating healthcare systems can be complex and overwhelming, creating barriers for patients seeking timely and appropriate medical attention. In this paper, we introduce C-PATH (Conversational Patient Assistance and Triage in Healthcare), a novel conversational AI system powered by large language models (LLMs) designed to assist patients in recognizing symptoms and recommending appropriate medical departments through natural, multi-turn dialogues. C-PATH is fine-tuned on medical knowledge, dialogue data, and clinical summaries using a multi-stage pipeline built on the LLaMA3 architecture. A core contribution of this work is a GPT-based data augmentation framework that transforms structured clinical knowledge from DDXPlus into lay-person-friendly conversations, allowing alignment with patient communication norms. We also implement a scalable conversation history management strategy to ensure long-range coherence. Evaluation with GPTScore demonstrates strong performance across dimensions such as clarity, informativeness, and recommendation accuracy. Quantitative benchmarks show that C-PATH achieves superior performance in GPT-rewritten conversational datasets, significantly outperforming domain-specific baselines. C-PATH represents a step forward in the development of user-centric, accessible, and accurate AI tools for digital health assistance and triage.

nan


Article 771

Title: Mitigating Object Hallucination via Robust Local Perception Search Objekt-Halluzination durch robuste lokale Wahrnehmungssuche abmildern 通过强力局部感知搜索减轻天体幻觉 2506.06729v1

Authors (5): Zixian Gao, Chao Yang, Zhanhui Zhou, Xing Xu, Chaochao Lu

Recent advancements in Multimodal Large Language Models (MLLMs) have enabled them to effectively integrate vision and language, addressing a variety of downstream tasks. However, despite their significant success, these models still exhibit hallucination phenomena, where the outputs appear plausible but do not align with the content of the images. To mitigate this issue, we introduce Local Perception Search (LPS), a decoding method during inference that is both simple and training-free, yet effectively suppresses hallucinations. This method leverages local visual prior information as a value function to correct the decoding process. Additionally, we observe that the impact of the local visual prior on model performance is more pronounced in scenarios with high levels of image noise. Notably, LPS is a plug-and-play approach that is compatible with various models. Extensive experiments on widely used hallucination benchmarks and noisy data demonstrate that LPS significantly reduces the incidence of hallucinations compared to the baseline, showing exceptional performance, particularly in noisy settings.

nan


Article 772

Title@2025-06-07 (6): Broaden your SCOPE! Efficient Multi-turn Conversation Planning for LLMs with Semantic Space

Title: Broaden your SCOPE! Efficient Multi-turn Conversation Planning for LLMs with Semantic Space Verbreiten Sie Ihren SCOPE! Effiziente Multiturn-Conversation-Planung für LLMs mit semantischem Raum 扩大你的SCOPE ! 与语义空间的LLMs 高效多方向对话规划 。 2503.11586v2

Authors (4): Zhiliang Chen, Xinyuan Niu, Chuan-Sheng Foo, Bryan Kian Hsiang Low

Large language models (LLMs) are used in chatbots or AI assistants to hold conversations with a human user. In such applications, the quality (e.g., user engagement, safety) of a conversation is important and can only be exactly known at the end of the conversation. To maximize its expected quality, conversation planning reasons about the stochastic transitions within a conversation to select the optimal LLM response at each turn. Existing simulation-based conversation planning algorithms typically select the optimal response by simulating future conversations with a large number of LLM queries at every turn. However, this process is extremely time-consuming and hence impractical for real-time conversations. This paper presents a novel approach called Semantic space COnversation Planning with improved Efficiency (SCOPE) that exploits the dense semantic representation of conversations to perform conversation planning efficiently. In particular, SCOPE models the stochastic transitions in conversation semantics and their associated rewards to plan entirely within the semantic space. This allows us to select the optimal LLM response at every conversation turn without needing additional LLM queries for simulation. As a result, SCOPE can perform conversation planning 70 times faster than conventional simulation-based planning algorithms when applied to a wide variety of conversation starters and two reward functions seen in the real world, yet achieving a higher reward within a practical planning budget. Our code can be found at: https://github.com/chenzhiliang94/convo-plan-SCOPE.

nan


Article 773

Title@2025-06-07 (6): nvBench 2.0: Resolving Ambiguity in Text-to-Visualization through Stepwise Reasoning

Title: nvBench 2.0: Resolving Ambiguity in Text-to-Visualization through Stepwise Reasoning nvBench 2.0: Ambiguität in der Text-zu-Visualisierung durch schrittweise Vernunft auflösen nvBench 2.0:通过逐步说明理由解决文本到虚拟化的模糊性 2503.12880v2

Authors (8): Tianqi Luo, Chuhan Huang, Leixian Shen, Boyan Li, Shuyu Shen, Wei Zeng, Nan Tang, Yuyu Luo

Text-to-Visualization (Text2VIS) enables users to create visualizations from natural language queries, making data insights more accessible. However, Text2VIS faces challenges in interpreting ambiguous queries, as users often express their visualization needs in imprecise language. To address this challenge, we introduce nBench 2.0, a new benchmark designed to evaluate Text2VIS systems in scenarios involving ambiguous queries. nvBench 2.0 includes 7,878 natural language queries and 24,076 corresponding visualizations, derived from 780 tables across 153 domains. It is built using a controlled ambiguity-injection pipeline that generates ambiguous queries through a reverse-generation workflow. By starting with unambiguous seed visualizations and selectively injecting ambiguities, the pipeline yields multiple valid interpretations for each query, with each ambiguous query traceable to its corresponding visualization through step-wise reasoning paths. We evaluate various Large Language Models (LLMs) on their ability to perform ambiguous Text2VIS tasks using nBench 2.0. We also propose Step-Text2Vis, an LLM-based model trained on nvBench 2.0, which enhances performance in ambiguous scenarios through step-wise preference optimization. Our results show that Step-Text2Vis outperforms all baselines, setting a new state-of-the-art for ambiguous Text2VIS tasks. Our source code and data are available at https://nvbench2.github.io/

nan


Article 774

Title@2025-06-07 (6): From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment

Title: From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment Von der Bedrohung zum Werkzeug: Rückgängige Injektionsangriffe für die Sicherheitsausrichtung 从威胁到工具:为安全协调而利用拒绝的故意喷射攻击 2506.10020v1

Authors (3): Kyubyung Chae, Hyunbin Jin, Taesup Kim

Safely aligning large language models (LLMs) often demands extensive human-labeled preference data, a process that’s both costly and time-consuming. While synthetic data offers a promising alternative, current methods frequently rely on complex iterative prompting or auxiliary models. To address this, we introduce Refusal-Aware Adaptive Injection (RAAI), a straightforward, training-free, and model-agnostic framework that repurposes LLM attack techniques. RAAI works by detecting internal refusal signals and adaptively injecting predefined phrases to elicit harmful, yet fluent, completions. Our experiments show RAAI effectively jailbreaks LLMs, increasing the harmful response rate from a baseline of 2.15% to up to 61.04% on average across four benchmarks. Crucially, fine-tuning LLMs with the synthetic data generated by RAAI improves model robustness against harmful prompts while preserving general capabilities on standard tasks like MMLU and ARC. This work highlights how LLM attack methodologies can be reframed as practical tools for scalable and controllable safety alignment.

nan


Article 775

Title@2025-06-07 (6): RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

Title: RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking RED QUEEN: Schutz großer Sprachmodelle gegen verschleierte Multi-Turn-Jailbreaking RED QUEEN: 保护大型语言模型,防止隐蔽的多轮破门破门 2409.17458v2

Authors (6): Yifan Jiang, Kriti Aggarwal, Tanmay Laud, Kashif Munir, Jay Pujara, Subhabrata Mukherjee

The rapid progress of Large Language Models (LLMs) has opened up new opportunities across various domains and applications; yet it also presents challenges related to potential misuse. To mitigate such risks, red teaming has been employed as a proactive security measure to probe language models for harmful outputs via jailbreak attacks. However, current jailbreak attack approaches are single-turn with explicit malicious queries that do not fully capture the complexity of real-world interactions. In reality, users can engage in multi-turn interactions with LLM-based chat assistants, allowing them to conceal their true intentions in a more covert manner. To bridge this gap, we, first, propose a new jailbreak approach, RED QUEEN ATTACK. This method constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm. We craft 40 scenarios that vary in turns and select 14 harmful categories to generate 56k multi-turn attack data points. We conduct comprehensive experiments on the RED QUEEN ATTACK with four representative LLM families of different sizes. Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4o and 75.4% on Llama3-70B. Further analysis reveals that larger models are more susceptible to the RED QUEEN ATTACK, with multi-turn structures and concealment strategies contributing to its success. To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks. This approach reduces the attack success rate to below 1% while maintaining the model’s performance across standard benchmarks. Full implementation and dataset are publicly accessible at https://github.com/kriti-hippo/red_queen.

nan


Article 776

Title@2025-06-07 (6): A Survey of Retentive Network

Title: A Survey of Retentive Network Ein Überblick über das Netz der Retentive 网上反应网络调查 2506.06708v1

Authors (4): Haiqi Yang, Zhiyuan Li, Yi Chang, Yuan Wu

Retentive Network (RetNet) represents a significant advancement in neural network architecture, offering an efficient alternative to the Transformer. While Transformers rely on self-attention to model dependencies, they suffer from high memory costs and limited scalability when handling long sequences due to their quadratic complexity. To mitigate these limitations, RetNet introduces a retention mechanism that unifies the inductive bias of recurrence with the global dependency modeling of attention. This mechanism enables linear-time inference, facilitates efficient modeling of extended contexts, and remains compatible with fully parallelizable training pipelines. RetNet has garnered significant research interest due to its consistently demonstrated cross-domain effectiveness, achieving robust performance across machine learning paradigms including natural language processing, speech recognition, and time-series analysis. However, a comprehensive review of RetNet is still missing from the current literature. This paper aims to fill that gap by offering the first detailed survey of the RetNet architecture, its key innovations, and its diverse applications. We also explore the main challenges associated with RetNet and propose future research directions to support its continued advancement in both academic research and practical deployment.

nan


Article 777

Title@2025-06-07 (6): DivScore: Zero-Shot Detection of LLM-Generated Text in Specialized Domains

Title: DivScore: Zero-Shot Detection of LLM-Generated Text in Specialized Domains DivScore: Zero-Shot-Erkennung von LLM-generiertem Text in spezialisierten Domains DivScore: 专用域内LLM-发光文本零热探测 2506.06705v1

Authors (5): Zhihui Chen, Kai He, Yucheng Huang, Yunxiao Zhu, Mengling Feng

Detecting LLM-generated text in specialized and high-stakes domains like medicine and law is crucial for combating misinformation and ensuring authenticity. However, current zero-shot detectors, while effective on general text, often fail when applied to specialized content due to domain shift. We provide a theoretical analysis showing this failure is fundamentally linked to the KL divergence between human, detector, and source text distributions. To address this, we propose DivScore, a zero-shot detection framework using normalized entropy-based scoring and domain knowledge distillation to robustly identify LLM-generated text in specialized domains. We also release a domain-specific benchmark for LLM-generated text detection in the medical and legal domains. Experiments on our benchmark show that DivScore consistently outperforms state-of-the-art detectors, with 14.4% higher AUROC and 64.0% higher recall (0.1% false positive rate threshold). In adversarial settings, DivScore demonstrates superior robustness than other baselines, achieving on average 22.8% advantage in AUROC and 29.5% in recall. Code and data are publicly available.

nan


Article 778

Title@2025-06-07 (6): Dynamic and Parametric Retrieval-Augmented Generation

Title: Dynamic and Parametric Retrieval-Augmented Generation Dynamische und parametrische Retrieval-Augmented Generation 动态和参数检索增强的一代 2506.06704v1

Authors (5): Weihang Su, Qingyao Ai, Jingtao Zhan, Qian Dong, Yiqun Liu

Retrieval-Augmented Generation (RAG) has become a foundational paradigm for equipping large language models (LLMs) with external knowledge, playing a critical role in information retrieval and knowledge-intensive applications. However, conventional RAG systems typically adopt a static retrieve-then-generate pipeline and rely on in-context knowledge injection, which can be suboptimal for complex tasks that require multihop reasoning, adaptive information access, and deeper integration of external knowledge. Motivated by these limitations, the research community has moved beyond static retrieval and in-context knowledge injection. Among the emerging directions, this tutorial delves into two rapidly growing and complementary research areas on RAG: Dynamic RAG and Parametric RAG. Dynamic RAG adaptively determines when and what to retrieve during the LLM’s generation process, enabling real-time adaptation to the LLM’s evolving information needs. Parametric RAG rethinks how retrieved knowledge should be injected into LLMs, transitioning from input-level to parameter-level knowledge injection for enhanced efficiency and effectiveness. This tutorial offers a comprehensive overview of recent advances in these emerging research areas. It also shares theoretical foundations and practical insights to support and inspire further research in RAG.

nan


Article 779

Title@2025-06-07 (6): from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors

Title: from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors aus Benign Import Giftic: Jailbreaking the Language Model via Adversarial Metaphors 有毒物质:通过反面符号破坏语言模式的监狱 2503.00038v3

Authors (8): Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, Jiangyu Lei, Qi Li

Current studies have exposed the risk of Large Language Models (LLMs) generating harmful content by jailbreak attacks. However, they overlook that the direct generation of harmful content from scratch is more difficult than inducing LLM to calibrate benign content into harmful forms. In our study, we introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking. Specifically, to answer harmful queries, AVATAR adaptively identifies a set of benign but logically related metaphors as the initial seed. Then, driven by these metaphors, the target LLM is induced to reason and calibrate about the metaphorical content, thus jailbroken by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content. Experimental results demonstrate that AVATAR can effectively and transferable jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.

nan


Article 780

Title@2025-06-07 (6): MarginSel : Max-Margin Demonstration Selection for LLMs

Title: MarginSel : Max-Margin Demonstration Selection for LLMs MarginSel : Max-Margin-Demonstrationsauswahl für LLMs MaginSel : LLMS 最大边框演示选择 2506.06699v1

Authors (4): Rajeev Bhatt Ambati, James Lester, Shashank Srivastava, Snigdha Chaturvedi

Large Language Models (LLMs) excel at few-shot learning via in-context learning (ICL). However, the effectiveness of ICL is often sensitive to the selection and ordering of demonstration examples. To address this, we present MarginSel: Max-Margin Demonstration Selection for LLMs, a two-step method that selects hard demonstration examples for the ICL prompt, adapting to each test instance. Our approach achieves 2-7% absolute improvement in F1-score across classification tasks, compared to a random selection of examples. We also provide theoretical insights and empirical evidence showing that MarginSel induces max-margin behavior in LLMs by effectively increasing the margin for hard examples, analogous to support vectors, thereby shifting the decision boundary in a beneficial direction.

nan


Article 781

Title@2025-06-07 (6): Contextual Experience Replay for Self-Improvement of Language Agents

Title: Contextual Experience Replay for Self-Improvement of Language Agents Kontextuelles Erfahrungsreplay zur Selbstverbesserung von Sprachagenten 语文代理人员自我改进的内事经验回顾 2506.06698v1

Authors (4): Yitao Liu, Chenglei Si, Karthik Narasimhan, Shunyu Yao

Large language model (LLM) agents have been applied to sequential decision-making tasks such as web navigation, but without any environment-specific experiences, they often fail in these complex tasks. Moreover, current LLM agents are not designed to continually learn from past experiences during inference time, which could be crucial for them to gain these environment-specific experiences. To address this, we propose Contextual Experience Replay (CER), a training-free framework to enable efficient self-improvement for language agents in their context window. Specifically, CER accumulates and synthesizes past experiences into a dynamic memory buffer. These experiences encompass environment dynamics and common decision-making patterns, allowing the agents to retrieve and augment themselves with relevant knowledge in new tasks, enhancing their adaptability in complex environments. We evaluate CER on the challenging WebArena and VisualWebArena benchmarks. On VisualWebArena, CER achieves a competitive performance of 31.9%. On WebArena, CER also gets a competitive average success rate of 36.7%, relatively improving the success rate of the GPT-4o agent baseline by 51.0%. We also conduct a comprehensive analysis on it to prove its efficiency, validity and understand it better.

nan


Article 782

Title: Legal Mathematical Reasoning with LLMs: Procedural Alignment through Two-Stage Reinforcement Learning Rechtliche Mathematische Reasoning mit LLMs: Verfahrensorientierte Ausrichtung durch Zwei-Stufen-Verstärkungs-Lernen 与LLMM女士的法律数学理由:通过双层强化学习实现程序协调 2504.02590v2

Authors (7): Kepu Zhang, Guofu Xie, Weijie Yu, Mingyue Xu, Xu Tang, Yaxin Li, Jun Xu

Legal mathematical reasoning is essential for applying large language models (LLMs) in high-stakes legal contexts, where outputs must be both mathematically accurate and procedurally compliant. However, existing legal LLMs lack structured numerical reasoning, and open-domain models, though capable of calculations, often overlook mandatory legal steps. To address this, we present LexNum, the first Chinese legal mathematical reasoning benchmark, covering three representative scenarios where each instance reflects legally grounded procedural flows. We further propose LexPam, a two-stage reinforcement learning framework for efficient legal reasoning training. Leveraging curriculum learning, we use a stronger teacher model to partition data into basic and challenging subsets. A lightweight 1.5B student model is then fine-tuned with Group Relative Policy Optimization, which avoids costly value networks and enables stable training from sparse, end-of-sequence rewards. The first stage improves accuracy and format; the second introduces a novel reward to guide procedural alignment via task-specific legal elements. Experiments show that existing models perform poorly on LexNum, while LexPam enhances both mathematical accuracy and legal coherence, and generalizes effectively across tasks and domains.

nan


Article 783

Title@2025-06-07 (6): Enhancing Input-Label Mapping in In-Context Learning with Contrastive Decoding

Title: Enhancing Input-Label Mapping in In-Context Learning with Contrastive Decoding Erweiterung der Input-Label Mapping im In-Context-Lernen mit kontrastiver Dekodierung 加强在与反比下标的学习的文内学习中的输入标标映射 2502.13738v2

Authors (6): Keqin Peng, Liang Ding, Yuanxin Ouyang, Meng Fang, Yancheng Yuan, Dacheng Tao

Large language models (LLMs) excel at a range of tasks through in-context learning (ICL), where only a few task examples guide their predictions. However, prior research highlights that LLMs often overlook input-label mapping information in ICL, relying more on their pre-trained knowledge. To address this issue, we introduce In-Context Contrastive Decoding (ICCD), a novel method that emphasizes input-label mapping by contrasting the output distributions between positive and negative in-context examples. Experiments on 7 natural language understanding (NLU) tasks show that our ICCD method brings consistent and significant improvement (up to +1.8 improvement on average) upon 6 different scales of LLMs without requiring additional training. Our approach is versatile, enhancing performance with various demonstration selection methods, demonstrating its broad applicability and effectiveness. The code and scripts are released at https://github.com/Romainpkq/CD_ICL.

nan


Article 784

Title@2025-06-07 (6): CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models

Title: CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models CVC: Eine groß angelegte chinesische Wertregel Corpus zur Wertausrichtung großer Sprachmodelle CVC: 大型中文大语言模式价值调整大型中国价值规则公司 2506.01495v2

Authors (9): Ping Wu, Guobin Shen, Dongcheng Zhao, Yuwei Wang, Yiting Dong, Yu Shi, Enmeng Lu, Feifei Zhao, Yi Zeng

Ensuring that Large Language Models (LLMs) align with mainstream human values and ethical norms is crucial for the safe and sustainable development of AI. Current value evaluation and alignment are constrained by Western cultural bias and incomplete domestic frameworks reliant on non-native rules; furthermore, the lack of scalable, rule-driven scenario generation methods makes evaluations costly and inadequate across diverse cultural contexts. To address these challenges, we propose a hierarchical value framework grounded in core Chinese values, encompassing three main dimensions, 12 core values, and 50 derived values. Based on this framework, we construct a large-scale Chinese Values Corpus (CVC) containing over 250,000 value rules enhanced and expanded through human annotation. Experimental results show that CVC-guided scenarios outperform direct generation ones in value boundaries and content diversity. In the evaluation across six sensitive themes (e.g., surrogacy, suicide), seven mainstream LLMs preferred CVC-generated options in over 70.5% of cases, while five Chinese human annotators showed an 87.5% alignment with CVC, confirming its universality, cultural relevance, and strong alignment with Chinese values. Additionally, we construct 400,000 rule-based moral dilemma scenarios that objectively capture nuanced distinctions in conflicting value prioritization across 17 LLMs. Our work establishes a culturally-adaptive benchmarking framework for comprehensive value evaluation and alignment, representing Chinese characteristics. All data are available at https://huggingface.co/datasets/Beijing-AISI/CVC, and the code is available at https://github.com/Beijing-AISI/CVC.

nan


Article 785

Title@2025-06-07 (6): Examining the Mental Health Impact of Misinformation on Social Media Using a Hybrid Transformer-Based Approach

Title: Examining the Mental Health Impact of Misinformation on Social Media Using a Hybrid Transformer-Based Approach Untersuchung der psychischen gesundheitlichen Auswirkungen von Fehlinformationen auf soziale Medien mittels eines hybriden Transformer-basierten Ansatzes 利用混合变革办法审查错误信息对社会媒体的心理健康影响 2503.02333v2

Authors (6): Sarvesh Arora, Sarthak Arora, Deepika Kumar, Vallari Agrawal, Vedika Gupta, Dipit Vasdev

Social media has significantly reshaped interpersonal communication, fostering connectivity while also enabling the proliferation of misinformation. The unchecked spread of false narratives has profound effects on mental health, contributing to increased stress, anxiety, and misinformation-driven paranoia. This study presents a hybrid transformer-based approach using a RoBERTa-LSTM classifier to detect misinformation, assess its impact on mental health, and classify disorders linked to misinformation exposure. The proposed models demonstrate accuracy rates of 98.4, 87.8, and 77.3 in detecting misinformation, mental health implications, and disorder classification, respectively. Furthermore, Pearson’s Chi-Squared Test for Independence (p-value = 0.003871) validates the direct correlation between misinformation and deteriorating mental well-being. This study underscores the urgent need for better misinformation management strategies to mitigate its psychological repercussions. Future research could explore broader datasets incorporating linguistic, demographic, and cultural variables to deepen the understanding of misinformation-induced mental health distress.

nan


Article 786

Title@2025-06-07 (6): Learning Distribution-Wise Control in Representation Space for Language Models

Title: Learning Distribution-Wise Control in Representation Space for Language Models Learning Distribution-Wise Control im Repräsentationsraum für Sprachmodelle 语文模式代表空间的学习分配-网络控制 2506.06686v1

Authors (3): Chunyuan Deng, Ruidi Chang, Hanjie Chen

Interventions in language models (LMs) are applied strategically to steer model behavior during the forward pass. Learnable interventions, also known as representation fine-tuning, aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors. In this work, we extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace. We demonstrate that these methods perform effectively in early layers, with larger standard deviations correlating strongly with improved performance. Across eight commonsense reasoning and seven arithmetic reasoning benchmarks, our distribution-wise interventions consistently outperform pointwise interventions in controllability and robustness. These results illustrate that distribution-wise interventions provide a more comprehensive method for steering model behavior and enabling finer-grained control over language models. The code is at: \href{https://github.com/chili-lab/D-Intervention}{https://github.com/chili-lab/D-Intervention}.

nan


Article 787

Title@2025-06-07 (6): SudoLM: Learning Access Control of Parametric Knowledge with Authorization Alignment

Title: SudoLM: Learning Access Control of Parametric Knowledge with Authorization Alignment SudoLM: Lernzugangskontrolle von parametrischem Wissen mit Autorisierungsausrichtung SudoLM: 学习以统一授权方式对参数知识进行访问控制 2410.14676v3

Authors (4): Qin Liu, Fei Wang, Chaowei Xiao, Muhao Chen

Existing preference alignment is a one-size-fits-all alignment mechanism, where the part of the large language model (LLM) parametric knowledge with non-preferred features is uniformly blocked to all the users. However, this part of knowledge can be useful to advanced users whose expertise qualifies them to handle these information. The one-size-fits-all alignment mechanism undermines LLM’s utility for these qualified users. To address this problem, we propose SudoLM, a framework that lets LLMs learn access control over specific parametric knowledge for users with different credentials via authorization alignment. SudoLM allows authorized users to unlock their access to all the parametric knowledge with an assigned SUDO key while blocking access to non-qualified users. Experiments on two application scenarios demonstrate that SudoLM effectively controls the user’s access to the parametric knowledge and maintains its general utility.

nan


Article 788

Title@2025-06-07 (6): Quantile Regression with Large Language Models for Price Prediction

Title: Quantile Regression with Large Language Models for Price Prediction Quantile Regression mit großen Sprachmodellen für die Preisvorhersage 具有大语言价格预测模型的量化递减 2506.06657v1

Authors (6): Nikhita Vedula, Dushyanta Dhyani, Laleh Jalali, Boris Oreshkin, Mohsen Bayati, Shervin Malmasi

Large Language Models (LLMs) have shown promise in structured prediction tasks, including regression, but existing approaches primarily focus on point estimates and lack systematic comparison across different methods. We investigate probabilistic regression using LLMs for unstructured inputs, addressing challenging text-to-distribution prediction tasks such as price estimation where both nuanced text understanding and uncertainty quantification are critical. We propose a novel quantile regression approach that enables LLMs to produce full predictive distributions, improving upon traditional point estimates. Through extensive experiments across three diverse price prediction datasets, we demonstrate that a Mistral-7B model fine-tuned with quantile heads significantly outperforms traditional approaches for both point and distributional estimations, as measured by three established metrics each for prediction accuracy and distributional calibration. Our systematic comparison of LLM approaches, model architectures, training approaches, and data scaling reveals that Mistral-7B consistently outperforms encoder architectures, embedding-based methods, and few-shot learning methods. Our experiments also reveal the effectiveness of LLM-assisted label correction in achieving human-level accuracy without systematic bias. Our curated datasets are made available at https://github.com/vnik18/llm-price-quantile-reg/ to support future research.

nan


Article 789

Title@2025-06-07 (6): SafeLawBench: Towards Safe Alignment of Large Language Models

Title: SafeLawBench: Towards Safe Alignment of Large Language Models SafeLawBench: Auf dem Weg zu einer sicheren Ausrichtung großer Sprachmodelle ” 安全法:实现大语言模式安全一致 “ 2506.06636v1

Authors (10): Chuxue Cao, Han Zhu, Jiaming Ji, Qichao Sun, Zhenghao Zhu, Yinyu Wu, Juntao Dai, Yaodong Yang, Sirui Han, Yike Guo

With the growing prevalence of large language models (LLMs), the safety of LLMs has raised significant concerns. However, there is still a lack of definitive standards for evaluating their safety due to the subjective nature of current safety benchmarks. To address this gap, we conducted the first exploration of LLMs’ safety evaluation from a legal perspective by proposing the SafeLawBench benchmark. SafeLawBench categorizes safety risks into three levels based on legal standards, providing a systematic and comprehensive framework for evaluation. It comprises 24,860 multi-choice questions and 1,106 open-domain question-answering (QA) tasks. Our evaluation included 2 closed-source LLMs and 18 open-source LLMs using zero-shot and few-shot prompting, highlighting the safety features of each model. We also evaluated the LLMs’ safety-related reasoning stability and refusal behavior. Additionally, we found that a majority voting mechanism can enhance model performance. Notably, even leading SOTA models like Claude-3.5-Sonnet and GPT-4o have not exceeded 80.5% accuracy in multi-choice tasks on SafeLawBench, while the average accuracy of 20 LLMs remains at 68.8\%. We urge the community to prioritize research on the safety of LLMs.

nan


Article 790

Title@2025-06-07 (6): Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

Title: Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning Lehrplan Verstärkung Lernen von einfachen bis harten Aufgaben verbessert LLM Reasoning 从易到困难的工作学习改进LLM 理由 2506.06632v1

Authors (11): Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, Shuiwang Ji

We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method.

nan


Article 791

Title@2025-06-07 (6): LLMs Can Simulate Standardized Patients via Agent Coevolution

Title: LLMs Can Simulate Standardized Patients via Agent Coevolution LLMs können standardisierte Patienten über Agent Coevolution simulieren LLM Can 通过革命代理人模拟标准化病人 2412.11716v2

Authors (10): Zhuoyun Du, Lujie Zheng, Renjun Hu, Yuyang Xu, Xiawei Li, Ying Sun, Wei Chen, Jian Wu, Haolei Cai, Haohao Ying

Training medical personnel using standardized patients (SPs) remains a complex challenge, requiring extensive domain expertise and role-specific practice. Previous research on Large Language Model (LLM)-based SPs mostly focuses on improving data retrieval accuracy or adjusting prompts through human feedback. However, this focus has overlooked the critical need for patient agents to learn a standardized presentation pattern that transforms data into human-like patient responses through unsupervised simulations. To address this gap, we propose EvoPatient, a novel simulated patient framework in which a patient agent and doctor agents simulate the diagnostic process through multi-turn dialogues, simultaneously gathering experience to improve the quality of both questions and answers, ultimately enabling human doctor training. Extensive experiments on various cases demonstrate that, by providing only overall SP requirements, our framework improves over existing reasoning methods by more than 10\% in requirement alignment and better human preference, while achieving an optimal balance of resource consumption after evolving over 200 cases for 10 hours, with excellent generalizability. Our system will be available at https://github.com/ZJUMAI/EvoPatient.

nan


Article 792

Title@2025-06-07 (6): Psychological Counseling Cannot Be Achieved Overnight: Automated Psychological Counseling Through Multi-Session Conversations

Title: Psychological Counseling Cannot Be Achieved Overnight: Automated Psychological Counseling Through Multi-Session Conversations Psychologische Beratung kann nicht über Nacht erreicht werden: Automatisierte Psychologische Beratung durch Multi-Session-Konversationen 心理心理咨询不可能在一夜之间达成:通过多次会议对话提供自动心理咨询 2506.06626v1

Authors (6): Junzhe Wang, Bichen Wang, Xing Fu, Yixin Sun, Yanyan Zhao, Bing Qin

In recent years, Large Language Models (LLMs) have made significant progress in automated psychological counseling. However, current research focuses on single-session counseling, which doesn’t represent real-world scenarios. In practice, psychological counseling is a process, not a one-time event, requiring sustained, multi-session engagement to progressively address clients’ issues. To overcome this limitation, we introduce a dataset for Multi-Session Psychological Counseling Conversation Dataset (MusPsy-Dataset). Our MusPsy-Dataset is constructed using real client profiles from publicly available psychological case reports. It captures the dynamic arc of counseling, encompassing multiple progressive counseling conversations from the same client across different sessions. Leveraging our dataset, we also developed our MusPsy-Model, which aims to track client progress and adapt its counseling direction over time. Experiments show that our model performs better than baseline models across multiple sessions.

nan


Article 793

Title@2025-06-07 (6): Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps

Title: Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps Kann MLLMs mich nach Hause führen? Eine Benchmark-Studie zur feinkörnigen visuellen Vernunft von Transit Maps MLLMM MLLM 指导我回家吗? 关于过境地图的精美视觉依据基准研究 2505.18675v2

Authors (8): Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, Xinchao Wang

Multimodal large language models (MLLMs) have recently achieved significant progress in visual tasks, including semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on complex tasks involving mathematics and logic. However, their capacity for reasoning tasks involving fine-grained visual understanding remains insufficiently evaluated. To address this gap, we introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs. ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern: among open-source models, base models outperform reasoning ones, while the opposite trend is observed in closed-source models. Additionally, performance generally degrades when visual inputs are masked, indicating that while MLLMs can leverage prior knowledge to answer some questions, fine-grained visual reasoning tasks still require genuine visual perception for strong performance. Our benchmark study offers new insights into visual reasoning and contributes to investigating the gap between open-source and closed-source models.

nan


Article 794

Title: BriefMe: A Legal NLP Benchmark for Assisting with Legal Briefs BriefMe: Ein gesetzlicher NLP-Benchmark für die Unterstützung mit rechtlichen Briefen 简报:协助提供法律简报的《国家劳工规划法》法律基准 2506.06619v1

Authors (4): Jesse Woo, Fateme Hashemi Chaleshtori, Ana Marasović, Kenneth Marino

A core part of legal work that has been under-explored in Legal NLP is the writing and editing of legal briefs. This requires not only a thorough understanding of the law of a jurisdiction, from judgments to statutes, but also the ability to make new arguments to try to expand the law in a new direction and make novel and creative arguments that are persuasive to judges. To capture and evaluate these legal skills in language models, we introduce BRIEFME, a new dataset focused on legal briefs. It contains three tasks for language models to assist legal professionals in writing briefs: argument summarization, argument completion, and case retrieval. In this work, we describe the creation of these tasks, analyze them, and show how current models perform. We see that today’s large language models (LLMs) are already quite good at the summarization and guided completion tasks, even beating human-generated headings. Yet, they perform poorly on other tasks in our benchmark: realistic argument completion and retrieving relevant legal cases. We hope this dataset encourages more development in Legal NLP in ways that will specifically aid people in performing legal work.

nan


Article 795

Title@2025-06-07 (6): Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

Title: Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications Open-FinLLMs: Open Multimodal Large Language Models für Finanzanwendungen Open-FininLLLMs:用于金融应用的开放多式大语言模式 2408.11878v3

Authors (44): Jimin Huang, Mengxi Xiao, Dong Li, Zihao Jiang, Yuzhe Yang, Yifei Zhang, Lingfei Qian, Yan Wang, Xueqing Peng, Yang Ren, Ruoyu Xiang, Zhengyu Chen, Xiao Zhang, Yueru He, Weiguang Han, Shunian Chen, Lihang Shen, Daniel Kim, Yangyang Yu, Yupeng Cao, Zhiyang Deng, Haohang Li, Duanyu Feng, Yongfu Dai, VijayaSai Somasundaram, Peng Lu, Guojun Xiong, Zhiwei Liu, Zheheng Luo, Zhiyuan Yao, Ruey-Ling Weng, Meikang Qiu, Kaleb E Smith, Honghai Yu, Yanzhao Lai, Min Peng, Jian-Yun Nie, Jordan W. Suchow, Xiao-Yang Liu, Benyou Wang, Alejandro Lopez-Lira, Qianqian Xie, Sophia Ananiadou, Junichi Tsujii

Financial LLMs hold promise for advancing financial tasks and domain-specific applications. However, they are limited by scarce corpora, weak multimodal capabilities, and narrow evaluations, making them less suited for real-world application. To address this, we introduce \textit{Open-FinLLMs}, the first open-source multimodal financial LLMs designed to handle diverse tasks across text, tabular, time-series, and chart data, excelling in zero-shot, few-shot, and fine-tuning settings. The suite includes FinLLaMA, pre-trained on a comprehensive 52-billion-token corpus; FinLLaMA-Instruct, fine-tuned with 573K financial instructions; and FinLLaVA, enhanced with 1.43M multimodal tuning pairs for strong cross-modal reasoning. We comprehensively evaluate Open-FinLLMs across 14 financial tasks, 30 datasets, and 4 multimodal tasks in zero-shot, few-shot, and supervised fine-tuning settings, introducing two new multimodal evaluation datasets. Our results show that Open-FinLLMs outperforms afvanced financial and general LLMs such as GPT-4, across financial NLP, decision-making, and multi-modal tasks, highlighting their potential to tackle real-world challenges. To foster innovation and collaboration across academia and industry, we release all codes (https://anonymous.4open.science/r/PIXIU2-0D70/B1D7/LICENSE) and models under OSI-approved licenses.

nan


Article 796

Title@2025-06-07 (6): Interpretable Depression Detection from Social Media Text Using LLM-Derived Embeddings

Title: Interpretable Depression Detection from Social Media Text Using LLM-Derived Embeddings Interpretable Depression Detection von Social Media Text mit LLM-derived Embeddings 从使用LLM-Derived嵌入器的社会媒体文字中可解释的抑郁症探测 2506.06616v1

Authors (3): Samuel Kim, Oghenemaro Imieye, Yunting Yin

Accurate and interpretable detection of depressive language in social media is useful for early interventions of mental health conditions, and has important implications for both clinical practice and broader public health efforts. In this paper, we investigate the performance of large language models (LLMs) and traditional machine learning classifiers across three classification tasks involving social media data: binary depression classification, depression severity classification, and differential diagnosis classification among depression, PTSD, and anxiety. Our study compares zero-shot LLMs with supervised classifiers trained on both conventional text embeddings and LLM-generated summary embeddings. Our experiments reveal that while zero-shot LLMs demonstrate strong generalization capabilities in binary classification, they struggle with fine-grained ordinal classifications. In contrast, classifiers trained on summary embeddings generated by LLMs demonstrate competitive, and in some cases superior, performance on the classification tasks, particularly when compared to models using traditional text embeddings. Our findings demonstrate the strengths of LLMs in mental health prediction, and suggest promising directions for better utilization of their zero-shot capabilities and context-aware summarization techniques.

nan


Article 797

Title@2025-06-07 (6): Transferring Features Across Language Models With Model Stitching

Title: Transferring Features Across Language Models With Model Stitching Übertragung von Funktionen über Sprachmodelle mit Modellstich 使用模型裁剪的跨语言模型传输功能 2506.06609v1

Authors (4): Alan Chen, Jack Merullo, Alessandro Stolfo, Ellie Pavlick

In this work, we demonstrate that affine mappings between residual streams of language models is a cheap way to effectively transfer represented features between models. We apply this technique to transfer the weights of Sparse Autoencoders (SAEs) between models of different sizes to compare their representations. We find that small and large models learn highly similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings. For example, using a small-to-large transferred SAE as initialization can lead to 50% cheaper training runs when training SAEs on larger models. Next, we show that transferred probes and steering vectors can effectively recover ground truth performance. Finally, we dive deeper into feature-level transferability, finding that semantic and structural features transfer noticeably differently while specific classes of functional features have their roles faithfully mapped. Overall, our findings illustrate similarities and differences in the linear representation spaces of small and large models and demonstrate a method for improving the training efficiency of SAEs.

nan


Article 798

Title@2025-06-07 (6): Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit

Title: Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit Training-Free Tokenizer Transplantation über Orthogonal Passende Verfolgung 通过矫形配对运动进行无培训的取暖器移植 2506.06607v1

Authors (2): Charles Goddard, Fernando Fernandes Neto

We present a training-free method to transplant tokenizers in pretrained large language models (LLMs) by reconstructing unseen token embeddings via Orthogonal Matching Pursuit (OMP). Specifically, we approximate each out-of-vocabulary token as a sparse linear combination of shared tokens, in two phases: first, compute each new token’s representation in the donor embedding space with a small dictionary of shared anchor tokens, then transfer these same sparse coefficients back into the base model’s embedding space. On two challenging cross-tokenizer tasks–Llama$\to$Mistral NeMo (12B) and Qwen$\to$Llama (1B)–we show that OMP achieves best zero-shot preservation of the base model’s performance across multiple benchmarks, while other zero-shot approaches degrade significantly. Compared to baselines (zero-init, mean-init, and existing approaches like WECHSEL, FOCUS, ZETT), OMP consistently achieves the best overall performance, effectively bridging large tokenizer discrepancies without gradient updates. Our analysis further identifies mismatched numerical tokenization schemes as a critical challenge for preserving mathematical reasoning capabilities. This technique enables direct reuse of pretrained model weights with new tokenizers, facilitating cross-tokenizer knowledge distillation, speculative decoding, ensembling, merging, and domain-specific vocabulary adaptations. We integrate our method into the open-source mergekit-tokensurgeon tool for post hoc vocabulary realignment.

nan


Article 799

Title@2025-06-07 (6): MedCite: Can Language Models Generate Verifiable Text for Medicine?

Title: MedCite: Can Language Models Generate Verifiable Text for Medicine? MedCite: Können Sprachmodelle verifizierbaren Text für die Medizin generieren? MedCite:语言模型能够产生可核实的医学文本吗? 2506.06605v1

Authors (8): Xiao Wang, Mengjue Tan, Qiao Jin, Guangzhi Xiong, Yu Hu, Aidong Zhang, Zhiyong Lu, Minjia Zhang

Existing LLM-based medical question-answering systems lack citation generation and evaluation capabilities, raising concerns about their adoption in practice. In this work, we introduce \name, the first end-to-end framework that facilitates the design and evaluation of citation generation with LLMs for medical tasks. Meanwhile, we introduce a novel multi-pass retrieval-citation method that generates high-quality citations. Our evaluation highlights the challenges and opportunities of citation generation for medical tasks, while identifying important design choices that have a significant impact on the final citation quality. Our proposed method achieves superior citation precision and recall improvements compared to strong baseline methods, and we show that evaluation results correlate well with annotation results from professional experts.

nan


Article 800

Title@2025-06-07 (6): Retrieval-augmented Large Language Models for Financial Time Series Forecasting

Title: Retrieval-augmented Large Language Models for Financial Time Series Forecasting Retrieval-augmentierte große Sprachmodelle für die Prognose der Finanzzeitreihen 财务时间序列预测的检索增强型大语言模型 2502.05878v3

Authors (13): Mengxi Xiao, Zihao Jiang, Lingfei Qian, Zhengyu Chen, Yueru He, Yijing Xu, Yuecheng Jiang, Dong Li, Ruey-Ling Weng, Min Peng, Jimin Huang, Sophia Ananiadou, Qianqian Xie

Accurately forecasting stock price movements is critical for informed financial decision-making, supporting applications ranging from algorithmic trading to risk management. However, this task remains challenging due to the difficulty of retrieving subtle yet high-impact patterns from noisy financial time-series data, where conventional retrieval methods, whether based on generic language models or simplistic numeric similarity, often fail to capture the intricate temporal dependencies and context-specific signals essential for precise market prediction. To bridge this gap, we introduce FinSrag, the first retrieval-augmented generation (RAG) framework with a novel domain-specific retriever FinSeer for financial time-series forecasting. FinSeer leverages a candidate selection mechanism refined by LLM feedback and a similarity-driven training objective to align queries with historically influential sequences while filtering out financial noise. Such training enables FinSeer to identify the most relevant time-series data segments for downstream forecasting tasks, unlike embedding or distance-based retrieval methods used in existing RAG frameworks. The retrieved patterns are then fed into StockLLM, a 1B-parameter LLM fine-tuned for stock movement prediction, which serves as the generative backbone. Beyond the retrieval method, we enrich the retrieval corpus by curating new datasets that integrate a broader set of financial indicators, capturing previously overlooked market dynamics. Experiments demonstrate that FinSeer outperforms existing textual retrievers and traditional distance-based retrieval approaches in enhancing the prediction accuracy of StockLLM, underscoring the importance of domain-specific retrieval frameworks in handling the complexity of financial time-series data.

nan


Article 801

Title@2025-06-07 (6): Scaling Inference-Efficient Language Models

Title: Scaling Inference-Efficient Language Models Skalierung von Schlussfolgerungen-Effiziente Sprachmodelle 扩展率-有效语言模型 2501.18107v2

Authors (3): Song Bian, Minghao Yan, Shivaram Venkataraman

Scaling laws are powerful tools to predict the performance of large language models. However, current scaling laws fall short of accounting for inference costs. In this work, we first show that model architecture affects inference latency, where models of the same size can have up to 3.5x difference in latency. To tackle this challenge, we modify the Chinchilla scaling laws to co-optimize the model parameter count, the number of training tokens, and the model architecture. Due to the reason that models of similar training loss exhibit gaps in downstream evaluation, we also propose a novel method to train inference-efficient models based on the revised scaling laws. We perform extensive empirical studies to fit and evaluate our inference-aware scaling laws. We vary model parameters from 80M to 1B, training tokens from 1.6B to 30B, and model shapes, training 63 models. Guided by our inference-efficient scaling law and model selection method, we release the Morph-1B model, which improves inference latency by 1.8x while maintaining accuracy on downstream tasks compared to open-source models, pushing the Pareto frontier of accuracy-latency tradeoff. Notably, our experiments reveal that wider and shallower models can yield efficiency gains while preserving accuracy.

nan


Article 802

Title@2025-06-06 (5): Precise Information Control in Long-Form Text Generation

Title: Precise Information Control in Long-Form Text Generation Präzise Informationssteuerung in der langformatigen Textgenerierung 长期制文本生成的精确信息控制 2506.06589v1

Authors (10): Jacqueline He, Howard Yen, Margaret Li, Shuyue Stella Li, Zhiyuan Zeng, Weijia Shi, Yulia Tsvetkov, Danqi Chen, Pang Wei Koh, Luke Zettlemoyer

A central challenge in modern language models (LMs) is intrinsic hallucination: the generation of information that is plausible but unsubstantiated relative to input context. To study this problem, we propose Precise Information Control (PIC), a new task formulation that requires models to generate long-form outputs grounded in a provided set of short self-contained statements, known as verifiable claims, without adding any unsupported ones. For comprehensiveness, PIC includes a full setting that tests a model’s ability to include exactly all input claims, and a partial setting that requires the model to selectively incorporate only relevant claims. We present PIC-Bench, a benchmark of eight long-form generation tasks (e.g., summarization, biography generation) adapted to the PIC setting, where LMs are supplied with well-formed, verifiable input claims. Our evaluation of a range of open and proprietary LMs on PIC-Bench reveals that, surprisingly, state-of-the-art LMs still intrinsically hallucinate in over 70% of outputs. To alleviate this lack of faithfulness, we introduce a post-training framework, using a weakly supervised preference data construction method, to train an 8B PIC-LM with stronger PIC ability–improving from 69.1% to 91.0% F1 in the full PIC setting. When integrated into end-to-end factual generation pipelines, PIC-LM improves exact match recall by 17.1% on ambiguous QA with retrieval, and factual precision by 30.5% on a birthplace verification task, underscoring the potential of precisely grounded generation.

nan


Article 803

Title@2025-06-06 (5): Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques

Title: Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques Auf dem Weg zu einer effizienten Multi-LLM-Inferenz: Charakterisierung und Analyse von LLM Routing und Hierarchical Techniques 实现高效的多LLLM 多级LLM推理:LLM转流和等级技术的特征和分析 2506.06579v1

Authors (5): Adarsh Prasad Behera, Jaya Prakash Champati, Roberto Morabito, Sasu Tarkoma, James Gross

Recent progress in Language Models (LMs) has dramatically advanced the field of natural language processing (NLP), excelling at tasks like text generation, summarization, and question answering. However, their inference remains computationally expensive and energy intensive, especially in settings with limited hardware, power, or bandwidth. This makes it difficult to deploy LMs in mobile, edge, or cost sensitive environments. To address these challenges, recent approaches have introduced multi LLM intelligent model selection strategies that dynamically allocate computational resources based on query complexity – using lightweight models for simpler queries and escalating to larger models only when necessary. This survey explores two complementary strategies for efficient LLM inference: (i) routing, which selects the most suitable model based on the query, and (ii) cascading or hierarchical inference (HI), which escalates queries through a sequence of models until a confident response is found. Both approaches aim to reduce computation by using lightweight models for simpler tasks while offloading only when needed. We provide a comparative analysis of these techniques across key performance metrics, discuss benchmarking efforts, and outline open challenges. Finally, we outline future research directions to enable faster response times, adaptive model selection based on task complexity, and scalable deployment across heterogeneous environments, making LLM based systems more efficient and accessible for real world applications.

nan


Article 804

Title@2025-06-06 (5): SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions

Title: SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions SATA-BENCH: Wählen Sie alle, die Benchmark für Multiple-Choice-Fragen anwenden SATA-BENCH: 选择多个选择问题的所有应用基准 2506.00643v2

Authors (6): Weijie Xu, Shixian Cui, Xi Fang, Chi Xue, Stephanie Eckman, Chandan K. Reddy

Large language models (LLMs) are increasingly evaluated on single-answer multiple-choice tasks, yet many real-world problems require identifying all correct answers from a set of options. This capability remains underexplored. We introduce SATA-BENCH, the first dedicated benchmark for evaluating LLMs on Select All That Apply (SATA) questions across diverse domains, including reading comprehension, law, and biomedicine. Our evaluation of 27 open-source and proprietary models reveals a significant gap: even the strongest model achieves only 41.8% exact match, exposing LLMs’ inability to reliably identify all correct answers. We find that this weakness stems from two core challenges: selection bias - models favor certain choices regardless of content, and count bias - models fail to predict the correct number of answers. To address these issues, we propose Choice Funnel, a decoding strategy that combines token debiasing with adaptive thresholding to guide models toward complete and accurate selections. Choice Funnel achieves up to 29% higher exact match than competitive baselines while reducing inference cost by over 64%. Our findings expose fundamental limitations in current LLMs and introduce a new framework for diagnosing and improving multi-answer reasoning. We release SATA-BENCH and Choice Funnel to promote LLM development for robust decision-making in realistic, multi-answer applications.

nan


Article 805

Title@2025-06-06 (5): DISC: DISC: Dynamic Decomposition Improves LLM Inference Scaling

Title: DISC: DISC: Dynamic Decomposition Improves LLM Inference Scaling DISC: DISC: Dynamische Zersetzung verbessert LLM-Inferenzskalierung DISC: DISC: 动态分解改善LLM 推推法的扩大 2502.16706v2

Authors (9): Jonathan Light, Wei Cheng, Benjamin Riviere, Wu Yue, Masafumi Oyamada, Mengdi Wang, Yisong Yue, Santiago Paternain, Haifeng Chen

Inference scaling methods for large language models often work by breaking problems into steps or groups of tokens, then sampling and selecting the best next steps. However, these steps and their sizes are usually fixed or manually designed based on domain knowledge. We introduce dynamic decomposition, a method that adaptively and automatically breaks down solution and reasoning traces into manageable steps during inference. By allocating compute more effectively - especially by subdividing difficult steps and prioritizing their sampling - dynamic decomposition significantly boosts inference efficiency. Experiments on benchmarks like APPS, MATH, and LiveCodeBench show that dynamic decomposition outperforms fixed strategies such as token-level, sentence-level, and single-step decompositions, reducing the pass@10 error rate by 5.0%, 6.7%, and 10.5% respectively. These results show the promise of dynamic decomposition for improving a broad range of inference scaling techniques.

nan


Article 806

Title@2025-06-06 (5): Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective

Title: Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective Auf dem Weg zur Analyse und dem Verständnis der Grenzen von VAPO: Eine theoretische Perspektive 分析和理解VAPO的局限性:理论视角 2506.03038v2

Authors (2): Jintian Shao, Yiming Cheng

Reinforcement learning (RL) enhances large language models (LLMs) in complex, long-chain-of-thought (long-CoT) reasoning. The advanced VAPO framework, despite sophisticated mechanisms like Decoupled GAE, theoretically faces fundamental limitations in comprehensively modeling and leveraging deep, long-term value for fine-grained, step-by-step policy guidance in extended reasoning chains. We argue these limitations stem from inherent difficulties in credit assignment, value function representational capacity with temporally abstracted goals, and translating global value signals into local policy improvements, especially with sparse rewards. Our theoretical analysis examines these aspects to illuminate VAPO’s boundaries in long-term value modeling, aiming to deepen understanding of current RL for advanced reasoning and suggest future research for more robust LLM agents.

nan


Article 807

Title@2025-06-06 (5): ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models

Title: ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models ThinkEdit: Verdolmetschbare Gewichte bearbeiten, um übermäßig kurzes Denken in vernünftigen Modellen zu vermischen 思考Edit:将可解释的重量编辑为 “ 解释性模型 “ 中模拟超短思考 2503.22048v3

Authors (3): Chung-En Sun, Ge Yan, Tsui-Wei Weng

Recent studies have shown that Large Language Models (LLMs) augmented with chain-of-thought (CoT) reasoning demonstrate impressive problem-solving abilities. However, in this work, we identify a recurring issue where these models occasionally generate overly short reasoning, leading to degraded performance on even simple mathematical problems. Specifically, we investigate how reasoning length is embedded in the hidden representations of reasoning models and its impact on accuracy. Our analysis reveals that reasoning length is governed by a linear direction in the representation space, allowing us to induce overly short reasoning by steering the model along this direction. Building on this insight, we introduce \textbf{\textit{ThinkEdit}}, a simple yet effective weight-editing approach to mitigate the issue of overly short reasoning. We first identify a small subset of attention heads (approximately 4%) that predominantly drive short reasoning behavior. We then edit the output projection weights of these heads to remove the short reasoning direction. With changes to only 0.2% of the model’s parameters, \textbf{\textit{ThinkEdit}} effectively reduces overly short reasoning and yields notable accuracy gains for short reasoning outputs (+6.39%), along with an overall improvement across multiple math benchmarks (+3.34%). Our findings provide new mechanistic insights into how reasoning length is controlled within LLMs and highlight the potential of fine-grained model interventions to improve reasoning quality. Our code is available at: https://github.com/Trustworthy-ML-Lab/ThinkEdit\

nan


Article 808

Title@2025-06-06 (5): LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles

Title: LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles LaMP-Cap: Personalisierte Bildunterschriftserstellung mit multimodalen Bildprofilen LaMP-Cap: 具有多模式图解的个人化图解生成 2506.06561v1

Authors (11): Ho Yin ‘Sam’ Ng, Ting-Yao Hsu, Aashish Anantha Ramakrishnan, Branislav Kveton, Nedim Lipka, Franck Dernoncourt, Dongwon Lee, Tong Yu, Sungchul Kim, Ryan A. Rossi, Ting-Hao ‘Kenneth’ Huang

Figure captions are crucial for helping readers understand and remember a figure’s key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain’s style, highlighting the need for personalization. Despite language models’ personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document–each with its image, caption, and figure-mentioning paragraphs–as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.

nan


Article 809

Title@2025-06-06 (5): CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective

Title: CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective CoT ist nicht wahr Vernunft, es ist nur eine enge Beschränkung zu nachahmen: Eine Theorie Perspektive COT不是真实的理由,这只是一个难以想象的束缚:理论观点 2506.02878v2

Authors (2): Jintian Shao, Yiming Cheng

Chain-of-Thought (CoT) prompting has demonstrably enhanced the performance of Large Language Models on tasks requiring multi-step inference. This success has led to widespread claims of emergent reasoning capabilities in these models. In this paper, we present a theoretical counter-perspective: Chain-of-Thought (CoT) does not elicit genuine, abstract reasoning. Instead, we argue that Chain-of-Thought functions as a powerful structural constraint that guides Large Language Models to imitate the form of reasoning. By forcing the generation of intermediate steps, Chain-of-Thought leverages the model immense capacity for sequence prediction and pattern matching, effectively constraining its output to sequences that resemble coherent thought processes. Chain-of-Thought (CoT) prompting has demonstrably enhanced the performance of Large Language Models on tasks requiring multi-step inference. This success has led to widespread claims of emergent reasoning capabilities in these models. In this paper, we present a theoretical counter-perspective: Chain-of-Thought (CoT) does not elicit genuine, abstract reasoning. Instead, we argue that Chain-of-Thought functions as a powerful structural constraint that guides Large Language Models to imitate the form of reasoning. By forcing the generation of intermediate steps, Chain-of-Thought leverages the model immense capacity for sequence prediction and pattern matching, effectively constraining its output to sequences that resemble coherent thought processes.

nan


Article 810

Title@2025-06-06 (5): A UD Treebank for Bohairic Coptic

Title: A UD Treebank for Bohairic Coptic Eine UD-Baumbank für bohairische koptische Bohairic Copth 的UD树库 2504.18386v2

Authors (4): Amir Zeldes, Nina Speransky, Nicholas Wagner, Caroline T. Schroeder

Despite recent advances in digital resources for other Coptic dialects, especially Sahidic, Bohairic Coptic, the main Coptic dialect for pre-Mamluk, late Byzantine Egypt, and the contemporary language of the Coptic Church, remains critically under-resourced. This paper presents and evaluates the first syntactically annotated corpus of Bohairic Coptic, sampling data from a range of works, including Biblical text, saints’ lives and Christian ascetic writing. We also explore some of the main differences we observe compared to the existing UD treebank of Sahidic Coptic, the classical dialect of the language, and conduct joint and cross-dialect parsing experiments, revealing the unique nature of Bohairic as a related, but distinct variety from the more often studied Sahidic.

nan


Article 811

Title@2025-06-06 (5): LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

Title: LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs LabSafety Bench: Benchmarking LLMs zu Sicherheitsfragen in wissenschaftlichen Laboren 实验室安全法官:科学实验室安全问题LLM女士基准基准 2410.14182v3

Authors (15): Yujun Zhou, Jingdong Yang, Yue Huang, Kehan Guo, Zoe Emory, Bikram Ghosh, Amita Bedar, Sujay Shekar, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang

Artificial Intelligence (AI) is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. While large language models (LLMs) increasingly assist in tasks ranging from procedural guidance to autonomous experiment orchestration, an “illusion of understanding” may lead researchers to overestimate their reliability. Such overreliance is particularly dangerous in high-stakes laboratory settings, where failures in hazard identification or risk assessment can result in severe accidents. To address these concerns, we propose the Laboratory Safety Benchmark (LabSafety Bench), a comprehensive framework that evaluates large language models and vision language models (VLMs) on their ability to identify potential hazards, assess risks, and predict the consequences of unsafe actions in lab environments. LabSafety Bench comprises 765 multiple-choice questions aligned with US Occupational Safety and Health Administration (OSHA) protocols, along with 404 realistic laboratory scenarios featuring dual evaluation tasks: the Hazards Identification Test and the Consequence Identification Test, with 3128 open-ended questions in total. Evaluations across eight proprietary models, seven open-weight LLMs, and four VLMs reveal that, despite advanced performance on structured assessments, no model achieves the safety threshold required for reliable operation – none scoring above 70% on the Hazards Identification Test. Moreover, while proprietary models tend to excel in multiple-choice evaluations, their performance in open-ended, real-world scenario responses is comparable to that of open-source models. These findings underscore the urgent need for specialized evaluation frameworks to ensure the safe and responsible deployment of AI in laboratory settings.

nan


Article 812

Title@2025-06-06 (5): Large Language Models Can Be a Viable Substitute for Expert Political Surveys When a Shock Disrupts Traditional Measurement Approaches

Title: Large Language Models Can Be a Viable Substitute for Expert Political Surveys When a Shock Disrupts Traditional Measurement Approaches Große Sprachmodelle können ein lebensfähiger Ersatz für politische Expertenbefragungen sein, wenn ein Schock traditionelle Messansätze ausbricht 当震震中断传统计量方法时,大语言模型可以成为专家政治调查的可行替代物 2506.06540v1

Authors (1): Patrick Y. Wu

After a disruptive event or shock, such as the Department of Government Efficiency (DOGE) federal layoffs of 2025, expert judgments are colored by knowledge of the outcome. This can make it difficult or impossible to reconstruct the pre-event perceptions needed to study the factors associated with the event. This position paper argues that large language models (LLMs), trained on vast amounts of digital media data, can be a viable substitute for expert political surveys when a shock disrupts traditional measurement. We analyze the DOGE layoffs as a specific case study for this position. We use pairwise comparison prompts with LLMs and derive ideology scores for federal executive agencies. These scores replicate pre-layoff expert measures and predict which agencies were targeted by DOGE. We also use this same approach and find that the perceptions of certain federal agencies as knowledge institutions predict which agencies were targeted by DOGE, even when controlling for ideology. This case study demonstrates that using LLMs allows us to rapidly and easily test the associated factors hypothesized behind the shock. More broadly, our case study of this recent event exemplifies how LLMs offer insights into the correlational factors of the shock when traditional measurement techniques fail. We conclude by proposing a two-part criterion for when researchers can turn to LLMs as a substitute for expert political surveys.

nan


Article 813

Title@2025-06-06 (5): Beyond Facts: Evaluating Intent Hallucination in Large Language Models

Title: Beyond Facts: Evaluating Intent Hallucination in Large Language Models Beyond Facts: Intent Halluzination in großen Sprachmodellen bewerten 超越事实:评价大语言模型中内在的幻觉 2506.06539v1

Authors (3): Yijie Hao, Haofei Yu, Jiaxuan You

When exposed to complex queries containing multiple conditions, today’s large language models (LLMs) tend to produce responses that only partially satisfy the query while neglecting certain conditions. We therefore introduce the concept of Intent Hallucination. In this phenomenon, LLMs either omit (neglecting to address certain parts) or misinterpret (responding to invented query parts) elements of the given query, leading to intent hallucinated generation. To systematically evaluate intent hallucination, we introduce FAITHQA, a novel benchmark for intent hallucination that contains 20,068 problems, covering both query-only and retrieval-augmented generation (RAG) setups with varying topics and difficulty. FAITHQA is the first hallucination benchmark that goes beyond factual verification, tailored to identify the fundamental cause of intent hallucination. By evaluating various LLMs on FAITHQA, we find that (1) intent hallucination is a common issue even for state-of-the-art models, and (2) the phenomenon stems from omission or misinterpretation of LLMs. To facilitate future research, we introduce an automatic LLM generation evaluation metric, CONSTRAINT SCORE, for detecting intent hallucination. Human evaluation results demonstrate that CONSTRAINT SCORE is closer to human performance for intent hallucination compared to baselines.

nan


Article 814

Title@2025-06-06 (5): NeoBERT: A Next-Generation BERT

Title: NeoBERT: A Next-Generation BERT NeoBERT: Eine nächste Generation BERT NEEBERT: 下一轮BERT 2502.19587v2

Authors (5): Lola Le Breton, Quentin Fournier, Mariam El Mezouar, John X. Morris, Sarath Chandar

Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.

nan


Article 815

Title@2025-06-06 (5): JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

Title: JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models JailbreakLens: Visuelle Analyse von Jailbreak Attacken gegen große Sprachmodelle 《监狱破纪录:对针对大语言模式的越狱袭击的视觉分析》 2404.08793v2

Authors (8): Yingchaojie Feng, Zhizhang Chen, Zhining Kang, Sijia Wang, Haoyu Tian, Wei Zhang, Minfeng Zhu, Wei Chen

The proliferation of large language models (LLMs) has underscored concerns regarding their security vulnerabilities, notably against jailbreak attacks, where adversaries design jailbreak prompts to circumvent safety mechanisms for potential misuse. Addressing these concerns necessitates a comprehensive analysis of jailbreak prompts to evaluate LLMs’ defensive capabilities and identify potential weaknesses. However, the complexity of evaluating jailbreak performance and understanding prompt characteristics makes this analysis laborious. We collaborate with domain experts to characterize problems and propose an LLM-assisted framework to streamline the analysis process. It provides automatic jailbreak assessment to facilitate performance evaluation and support analysis of components and keywords in prompts. Based on the framework, we design JailbreakLens, a visual analysis system that enables users to explore the jailbreak performance against the target model, conduct multi-level analysis of prompt characteristics, and refine prompt instances to verify findings. Through a case study, technical evaluations, and expert interviews, we demonstrate our system’s effectiveness in helping users evaluate model security and identify model weaknesses.

nan


Article 816

Title@2025-06-06 (5): Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance

Title: Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance Fixing it in Post: Eine vergleichende Studie der LLM Nachschulung Datenqualität und Modellleistung 在员额中加以确定:LLM培训后数据质量和示范业绩比较研究 2506.06522v1

Authors (6): Aladin Djuhera, Swanand Ravindra Kadhe, Syed Zawad, Farhan Ahmed, Heiko Ludwig, Holger Boche

Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open- and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work, we conduct the first comprehensive side-by-side analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie framework, we annotate each sample with detailed quality metrics, including turn structure (single-turn vs. multi-turn), task category, input quality, and response quality, and we derive statistics that reveal structural and qualitative similarities and differences between the two datasets. Based on these insights, we design a principled curation recipe that produces a new data mixture, TuluTalk, which contains 14% fewer samples than either source dataset while matching or exceeding their performance on key benchmarks. Our findings offer actionable insights for constructing more effective post-training datasets that improve model performance within practical resource limits. To support future research, we publicly release both the annotated source datasets and our curated TuluTalk mixture.

nan


Article 817

Title@2025-06-06 (5): Diversity Explains Inference Scaling Laws: Through a Case Study of Minimum Bayes Risk Decoding

Title: Diversity Explains Inference Scaling Laws: Through a Case Study of Minimum Bayes Risk Decoding Diversität erklärt Folgerungen Skalierungsgesetze: Durch eine Fallstudie von Minimum Bayes Risikodekodierung 多样性解释推推推法的扩大法:通过对最低贝系风险解码的个案研究 2410.15021v2

Authors (5): Hidetaka Kamigaito, Hiroyuki Deguchi, Yusuke Sakai, Katsuhiko Hayashi, Taro Watanabe

Inference methods play an important role in eliciting the performance of large language models (LLMs). Currently, LLMs use inference methods utilizing generated multiple samples, which can be derived from Minimum Bayes Risk (MBR) Decoding. Previous studies have conducted empirical analyses to clarify the improvements in generation performance achieved by MBR decoding and have reported various observations. However, the theoretical underpinnings of these findings remain uncertain. To address this, we offer a new theoretical interpretation of MBR decoding from the perspective of bias-diversity decomposition. In this interpretation, the error in the quality estimation of hypotheses by MBR decoding is decomposed into two main factors: bias, which considers the closeness between the utility function and human evaluation, and diversity, which represents the variability in the quality estimation of the utility function. The theoretical analysis reveals the difficulty of simultaneously improving bias and diversity, confirming the validity of enhancing MBR decoding performance by increasing diversity. Furthermore, we reveal that diversity can explain one aspect of inference scaling laws that describe performance improvement by increasing sample size. Moreover, experiments across multiple NLP tasks yielded results consistent with these theoretical characteristics. Our code is available at https://github.com/naist-nlp/mbr-bias-diversity.

nan


Article 818

Title@2025-06-06 (5): Biases Propagate in Encoder-based Vision-Language Models: A Systematic Analysis From Intrinsic Measures to Zero-shot Retrieval Outcomes

Title: Biases Propagate in Encoder-based Vision-Language Models: A Systematic Analysis From Intrinsic Measures to Zero-shot Retrieval Outcomes Biasen propagate in Encoder-basierte Vision-Sprache Modelle: Eine systematische Analyse von intrinsischen Maßnahmen zu Zero-shot Retrieval Ergebnisse 以编码器为基础的视觉-语言模型中的Biases Proplagation:从内在措施到零射回收结果的系统分析 2506.06506v1

Authors (4): Kshitish Ghate, Tessa Charlesworth, Mona Diab, Aylin Caliskan

To build fair AI systems we need to understand how social-group biases intrinsic to foundational encoder-based vision-language models (VLMs) manifest in biases in downstream tasks. In this study, we demonstrate that intrinsic biases in VLM representations systematically ``carry over’’ or propagate into zero-shot retrieval tasks, revealing how deeply rooted biases shape a model’s outputs. We introduce a controlled framework to measure this propagation by correlating (a) intrinsic measures of bias in the representational space with (b) extrinsic measures of bias in zero-shot text-to-image (TTI) and image-to-text (ITT) retrieval. Results show substantial correlations between intrinsic and extrinsic bias, with an average $\rho$ = 0.83 $\pm$ 0.10. This pattern is consistent across 114 analyses, both retrieval directions, six social groups, and three distinct VLMs. Notably, we find that larger/better-performing models exhibit greater bias propagation, a finding that raises concerns given the trend towards increasingly complex AI models. Our framework introduces baseline evaluation tasks to measure the propagation of group and valence signals. Investigations reveal that underrepresented groups experience less robust propagation, further skewing their model-related outcomes.

nan


Article 819

Title@2025-06-06 (5): Improving LLM-Powered EDA Assistants with RAFT

Title: Improving LLM-Powered EDA Assistants with RAFT Verbesserung der LLM-Powered EDA-Assistenten mit RAFT 改进与RAFT合作的LLM授权的EDA助理 2506.06500v1

Authors (4): Luyao Shi, Michael Kazda, Charles Schmitter, Hemlata Gupta

Electronic design engineers often struggle to efficiently access relevant information for tasks like design verification and technology development. While large language models (LLMs) can enhance productivity as conversational agents, pre-trained open-source LLMs lack domain-specific knowledge for Electronic Design Automation (EDA). In a Retrieval-Augmented Generation (RAG) context, LLMs rely on external context but may still produce inaccurate responses. Retrieval-Augmented Fine-Tuning (RAFT) improves LLM performance, but acquiring labeled question/answer (Q/A) data in EDA is difficult. To address this, we propose using synthetic Q/A datasets to enhance LLMs with RAFT. Our results show that RAFT with synthetic data significantly boosts LLM performance for RAG-based EDA tasks. We also investigate the impact of using real user questions as Retrieval-Augmented Few-Shot (RAFS) examples for synthetic data generation. Additionally, we implement secure access control to ensure sensitive information is only accessible to authorized personnel. Finally, we assess the risk of data leakage and unintended memorization during fine-tuning with synthetic data, providing practical insights.

nan


Article 820

Title@2025-06-06 (5): Synthetic Text Generation for Training Large Language Models via Gradient Matching

Title: Synthetic Text Generation for Training Large Language Models via Gradient Matching Synthetische Textgenerierung für das Training großer Sprachmodelle über Gradient Matching 通过 “ 渐级匹配 “ 培训大语言模型的合成文本生成 2502.17607v2

Authors (6): Dang Nguyen, Zeman Li, Mohammadhossein Bateni, Vahab Mirrokni, Meisam Razaviyayn, Baharan Mirzasoleiman

Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that provides convergence, performance, and privacy guarantees for fine-tuning LLMs on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the noisy gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text guarantees convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data and preserves their privacy. Experiments on various classification tasks confirm the effectiveness of our proposed approach. Our code is available at https://github.com/BigML-CS-UCLA/GRADMM.

nan


Article 821

Title@2025-06-06 (5): Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws

Title: Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws Diversity of Transformer Layers: Ein Aspekt der Parameterskalierungsgesetze 变形层多样性:参数面积法的一个方面 2505.24009v2

Authors (6): Hidetaka Kamigaito, Ying Zhang, Jingun Kwon, Katsuhiko Hayashi, Manabu Okumura, Taro Watanabe

Transformers deliver outstanding performance across a wide range of tasks and are now a dominant backbone architecture for large language models (LLMs). Their task-solving performance is improved by increasing parameter size, as shown in the recent studies on parameter scaling laws. Although recent mechanistic-interpretability studies have deepened our understanding of the internal behavior of Transformers by analyzing their residual stream, the relationship between these internal mechanisms and the parameter scaling laws remains unclear. To bridge this gap, we focus on layers and their size, which mainly decide the parameter size of Transformers. For this purpose, we first theoretically investigate the layers within the residual stream through a bias-diversity decomposition. The decomposition separates (i) bias, the error of each layer’s output from the ground truth, and (ii) diversity, which indicates how much the outputs of each layer differ from each other. Analyzing Transformers under this theory reveals that performance improves when individual layers make predictions close to the correct answer and remain mutually diverse. We show that diversity becomes especially critical when individual layers’ outputs are far from the ground truth. Finally, we introduce an information-theoretic diversity and show our main findings that adding layers enhances performance only when those layers behave differently, i.e., are diverse. We also reveal the performance gains from increasing the number of layers exhibit submodularity: marginal improvements diminish as additional layers increase, mirroring the logarithmic convergence predicted by the parameter scaling laws. Experiments on multiple semantic-understanding tasks with various LLMs empirically confirm the theoretical properties derived in this study.

nan


Article 822

Title@2025-06-06 (5): What Is Seen Cannot Be Unseen: The Disruptive Effect of Knowledge Conflict on Large Language Models

Title: What Is Seen Cannot Be Unseen: The Disruptive Effect of Knowledge Conflict on Large Language Models Was gesehen wird, kann nicht ungesehen werden: Die disruptive Wirkung von Wissenskonflikten auf große Sprachmodelle ” 所见之见不可不见:知识冲突对大语言模式的破坏效应 “ 2506.06485v1

Authors (3): Kaiser Sun, Fan Bai, Mark Dredze

Large language models frequently rely on both contextual input and parametric knowledge to perform tasks. However, these sources can come into conflict, especially when retrieved documents contradict the model’s parametric knowledge. We propose a diagnostic framework to systematically evaluate LLM behavior under context-memory conflict, where the contextual information diverges from their parametric beliefs. We construct diagnostic data that elicit these conflicts and analyze model performance across multiple task types. Our findings reveal that (1) knowledge conflict has minimal impact on tasks that do not require knowledge utilization, (2) model performance is consistently higher when contextual and parametric knowledge are aligned, (3) models are unable to fully suppress their internal knowledge even when instructed, and (4) providing rationales that explain the conflict increases reliance on contexts. These insights raise concerns about the validity of model-based evaluation and underscore the need to account for knowledge conflict in the deployment of LLMs.

nan


Article 823

Title@2025-06-06 (5): Large Language Models Often Know When They Are Being Evaluated

Title: Large Language Models Often Know When They Are Being Evaluated Große Sprachmodelle kennen oft, wenn sie bewertet werden 大语言模型经常知道何时被评估 2505.23836v2

Authors (5): Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, Marius Hobbhahn

If AI models can detect when they are being evaluated, the effectiveness of evaluations might be compromised. For example, models could have systematically different behavior during evaluations, leading to less reliable benchmarks for deployment and governance decisions. We investigate whether frontier language models can accurately classify transcripts based on whether they originate from evaluations or real-world deployment, a capability we call evaluation awareness. To achieve this, we construct a diverse benchmark of 1,000 prompts and transcripts from 61 distinct datasets. These span public benchmarks (e.g., MMLU, SWEBench), real-world deployment interactions, and agent trajectories from scaffolding frameworks (e.g., web-browsing agents). Frontier models clearly demonstrate above-random evaluation awareness (Gemini-2.5-Pro reaches an AUC of $0.83$), but do not yet surpass our simple human baseline (AUC of $0.92$). Furthermore, both AI models and humans are better at identifying evaluations in agentic settings compared to chat settings. Additionally, we test whether models can identify the purpose of the evaluation. Under multiple-choice and open-ended questioning, AI models far outperform random chance in identifying what an evaluation is testing for. Our results indicate that frontier models already exhibit a substantial, though not yet superhuman, level of evaluation-awareness. We recommend tracking this capability in future models.

nan


Article 824

Title@2025-06-06 (5): BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning

Title: BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning BRiTE: Bootstrapping verstärkter Denkprozess zur Verbesserung von Sprachmodellen Brite:强化强化思考程序以加强语言示范理由 2501.18858v2

Authors (13): Han Zhong, Yutong Yin, Shenao Zhang, Xiaojun Xu, Yuanxin Liu, Yifei Zuo, Zhihan Liu, Boyi Liu, Sirui Zheng, Hongyi Guo, Liwei Wang, Mingyi Hong, Zhaoran Wang

Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, yet generating reliable reasoning processes remains a significant challenge. We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model incorporating latent thinking processes and evaluation signals. Within this framework, we introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps. First, it generates high-quality rationales by approximating the optimal thinking process through reinforcement learning, using a novel reward shaping mechanism. Second, it enhances the base LLM by maximizing the joint probability of rationale generation with respect to the model’s parameters. Theoretically, we demonstrate BRiTE’s convergence at a rate of $1/T$ with $T$ representing the number of iterations. Empirical evaluations on math and coding benchmarks demonstrate that our approach consistently improves performance across different base models without requiring human-annotated thinking processes. In addition, BRiTE demonstrates superior performance compared to existing algorithms that bootstrap thinking processes use alternative methods such as rejection sampling, and can even match or exceed the results achieved through supervised fine-tuning with human-annotated data.

nan


Article 825

Title@2025-06-06 (5): Latent Thought Models with Variational Bayes Inference-Time Computation

Title: Latent Thought Models with Variational Bayes Inference-Time Computation Latent Thought Models mit Variational Bayes Inferenz-Zeit-Computation 具有变异性贝湾推推算-时间计算法的古老思想模型 2502.01567v2

Authors (11): Deqian Kong, Minglu Zhao, Dehong Xu, Bo Pang, Shu Wang, Edouardo Honig, Zhangzhang Si, Chuan Li, Jianwen Xie, Sirui Xie, Ying Nian Wu

We propose a novel class of language models, Latent Thought Models (LTMs), which incorporate explicit latent thought vectors that follow an explicit prior model in latent space. These latent thought vectors guide the autoregressive generation of ground tokens through a Transformer decoder. Training employs a dual-rate optimization process within the classical variational Bayes framework: fast learning of local variational parameters for the posterior distribution of latent vectors (inference-time computation), and slow learning of global decoder parameters. Empirical studies reveal that LTMs possess additional scaling dimensions beyond traditional Large Language Models (LLMs), such as the number of iterations in inference-time computation and number of latent thought vectors. Higher sample efficiency can be achieved by increasing training compute per token, with further gains possible by trading model size for more inference steps. Designed based on these scaling properties, LTMs demonstrate superior sample and parameter efficiency compared to autoregressive models and discrete diffusion models. They significantly outperform these counterparts in validation perplexity and zero-shot language modeling tasks. Additionally, LTMs exhibit emergent few-shot in-context reasoning capabilities that scale with model size, and achieve competitive performance in conditional and unconditional text generation.

nan


Article 826

Title@2025-06-06 (5): Sparse Autoencoders for Hypothesis Generation

Title: Sparse Autoencoders for Hypothesis Generation Sparse Autoencoder für Hypothesis-Generierung 假设代代的粗化自动编码器 2502.04382v3

Authors (5): Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, Emma Pierson

We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., “mentions being surprised or shocked”) using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.

nan


Article 827

Title@2025-06-06 (5): From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

Title: From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions Von Tools zu Teamkollegen: LLMs in Multi-Session Coding Interaktionen bewerten 从工具到队友:在多次会议编码互动中评估LLMs 2502.13791v2

Authors (8): Nathanaël Carraz Rakotonirina, Mohammed Hamdy, Jon Ander Campos, Lucas Weber, Alberto Testoni, Marzieh Fadaee, Sandro Pezzelle, Marco Del Tredici

Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs’ ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long instruction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long interactions.

nan


Article 828

Title@2025-06-06 (5): Canonical Autoregressive Generation

Title: Canonical Autoregressive Generation Kanonische autoregressive Generation Canonic 自动后退一代 2506.06446v1

Authors (4): Ivi Chatzi, Nina Corvelo Benz, Stratis Tsirtsis, Manuel Gomez-Rodriguez

State of the art large language models are trained using large amounts of tokens derived from raw text using what is called a tokenizer. Crucially, the tokenizer determines the (token) vocabulary a model will use during inference as well as, in principle, the (token) language. This is because, while the token vocabulary may allow for different tokenizations of a string, the tokenizer always maps the string to only one of these tokenizations–the canonical tokenization. However, multiple lines of empirical evidence suggest that large language models do not always generate canonical token sequences, and this comes with several negative consequences. In this work, we first show that, to generate a canonical token sequence, a model needs to generate (partial) canonical token sequences at each step of the autoregressive generation process underpinning its functioning. Building upon this theoretical result, we introduce canonical sampling, a simple and efficient sampling method that precludes a given model from generating non-canonical token sequences. Further, we also show that, in comparison with standard sampling, the distribution of token sequences generated using canonical sampling is provably closer to the true distribution of token sequences used during training.

nan


Article 829

Title@2025-06-06 (5): Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding

Title: Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding Filmfakten und Fibs (MF$^2$): Ein Benchmark für langes Filmverständnis 电影事实和纤维(MF$2美元):长期电影理解基准 2506.06275v1

Authors (31): Emmanouil Zaranis, António Farinhas, Saul Santos, Beatriz Canaverde, Miguel Moura Ramos, Aditya K Surikuchi, André Viveiros, Baohao Liao, Elena Bueno-Benito, Nithin Sivakumaran, Pavlo Vasylenko, Shoubin Yu, Sonal Sannigrahi, Wafaa Mohammed, Ben Peters, Danae Sánchez Villegas, Elias Stengel-Eskin, Giuseppe Attanasio, Jaehong Yoon, Stella Frank, Alessandro Suglia, Chrysoula Zerva, Desmond Elliott, Mariella Dimiccoli, Mohit Bansal, Oswald Lanz, Raffaella Bernardi, Raquel Fernández, Sandro Pezzelle, Vlad Niculae, André F. T. Martins

Despite recent progress in vision-language models (VLMs), holistic understanding of long-form video content remains a significant challenge, partly due to limitations in current benchmarks. Many focus on peripheral, ``needle-in-a-haystack’’ details, encouraging context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we introduce MF$^2$, a new benchmark for evaluating whether models can comprehend, consolidate, and recall key narrative information from full-length movies (50-170 minutes long). MF$^2$ includes over 50 full-length, open-licensed movies, each paired with manually constructed sets of claim pairs – one true (fact) and one plausible but false (fib), totalling over 850 pairs. These claims target core narrative elements such as character motivations and emotions, causal chains, and event order, and refer to memorable moments that humans can recall without rewatching the movie. Instead of multiple-choice formats, we adopt a binary claim evaluation protocol: for each pair, models must correctly identify both the true and false claims. This reduces biases like answer ordering and enables a more precise assessment of reasoning. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance, underscoring the relative ease of the task for humans and their superior ability to retain and reason over critical narrative information – an ability current VLMs lack.

nan


Article 830

Title@2025-06-06 (5): AdvSumm: Adversarial Training for Bias Mitigation in Text Summarization

Title: AdvSumm: Adversarial Training for Bias Mitigation in Text Summarization AdvSumm: Adversariale Ausbildung für Bias Mitigation in Textzusammenfassung AdvSumm:在文本摘要中进行减轻偏见的反向培训 2506.06273v1

Authors (5): Mukur Gupta, Nikhil Reddy Varimalla, Nicholas Deas, Melanie Subbiah, Kathleen McKeown

Large Language Models (LLMs) have achieved impressive performance in text summarization and are increasingly deployed in real-world applications. However, these systems often inherit associative and framing biases from pre-training data, leading to inappropriate or unfair outputs in downstream tasks. In this work, we present AdvSumm (Adversarial Summarization), a domain-agnostic training framework designed to mitigate bias in text summarization through improved generalization. Inspired by adversarial robustness, AdvSumm introduces a novel Perturber component that applies gradient-guided perturbations at the embedding level of Sequence-to-Sequence models, enhancing the model’s robustness to input variations. We empirically demonstrate that AdvSumm effectively reduces different types of bias in summarization-specifically, name-nationality bias and political framing bias-without compromising summarization quality. Compared to standard transformers and data augmentation techniques like back-translation, AdvSumm achieves stronger bias mitigation performance across benchmark datasets.

nan


Article 831

Title@2025-06-06 (5): MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators

Title: MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators MEDAL: Ein Rahmen für Benchmarking von LLMs als mehrsprachige Open-Domain Chatbots und Dialogevaluatoren MEDAL:多语言开放域聊天和对话评价员对LLMs进行基准评估的框架 2505.22777v2

Authors (3): John Mendonça, Alon Lavie, Isabel Trancoso

As the capabilities of chatbots and their underlying LLMs continue to dramatically improve, evaluating their performance has increasingly become a major blocker to their further development. A major challenge is the available benchmarking datasets, which are largely static, outdated, and lacking in multilingual coverage, limiting their ability to capture subtle linguistic and cultural variations. This paper introduces MEDAL, an automated multi-agent framework for generating, evaluating, and curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. A strong LLM (GPT-4.1) is then used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. We find that current LLMs struggle to detect nuanced issues, particularly those involving empathy and reasoning.

nan


Article 832

Title@2025-06-06 (5): Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks

Title: Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks Leopard: Ein Visions-Sprachenmodell für Text-Rich-Multi-Image-Aufgaben 豹:文本-Rich多图像任务愿景语言模型 2410.01744v3

Authors (9): Mengzhao Jia, Wenhao Yu, Kaixin Ma, Tianqing Fang, Zhihan Zhang, Siru Ouyang, Hongming Zhang, Dong Yu, Meng Jiang

Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose Leopard, an MLLM tailored for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we proposed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length based on the original aspect ratios and resolutions of images. Experiments on a diverse set of benchmarks reveal that our model consistently outperforms state-of-the-art systems, such as Llama-3.2 and Qwen2-VL, in challenging text-rich, multi-image evaluations. Remarkably, our approach achieves outstanding performance using only 1.2M training instances, all of which are fully open-sourced, demonstrating both high efficiency and effectiveness compared to models trained on large-scale in-house data. Our code and data are available at https://github.com/tencent-ailab/Leopard.

nan


Article 833

Title@2025-06-06 (5): MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

Title: MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models MimeQA: Auf dem Weg zu sozial-intelligenten nichtverbalen Stiftungsmodellen MIMEQA:走向社会智能非语言基金会模式 2502.16671v2

Authors (5): Hengzhi Li, Megan Tjandrasuwita, Yi R. Fung, Armando Solar-Lezama, Paul Pu Liang

As AI becomes more closely integrated with peoples’ daily activities, socially intelligent AI that can understand and interact seamlessly with humans in daily lives is increasingly important. However, current works in AI social reasoning all rely on language-only or language-dominant approaches to benchmark and training models, resulting in systems that are improving in verbal communication but struggle with nonverbal social understanding. To address this limitation, we tap into a novel data source rich in nonverbal social interactions – mime videos. Mimes refer to the art of expression through gesture and movement without spoken words, which presents unique challenges and opportunities in interpreting nonverbal social communication. We contribute a new dataset called MimeQA, obtained by sourcing 8 hours of videos clips from YouTube and developing a comprehensive video question-answering benchmark comprising 806 carefully annotated and verified question-answer pairs, designed to probe nonverbal social reasoning capabilities. Using MimeQA, we evaluate state-of-the-art video large language models (vLLMs) and find that they achieve low overall accuracy, ranging from 20-30%, while humans score 86%. Our analysis reveals that vLLMs often fail to ground imagined objects and over-rely on the text prompt while ignoring subtle nonverbal interactions. We hope to inspire future work in AI models that embody true social intelligence capable of interpreting non-verbal human interactions.

nan


Article 834

Title@2025-06-06 (5): A semantic embedding space based on large language models for modelling human beliefs

Title: A semantic embedding space based on large language models for modelling human beliefs Eine semantische Einbettung von Raum basierend auf großen Sprachmodellen zur Modellierung menschlicher Überzeugungen 基于模拟人类信仰的大型语言模型的语义嵌入空间 2408.07237v3

Authors (5): Byunghwee Lee, Rachith Aiyappa, Yong-Yeol Ahn, Haewoon Kwak, Jisun An

Beliefs form the foundation of human cognition and decision-making, guiding our actions and social connections. A model encapsulating beliefs and their interrelationships is crucial for understanding their influence on our actions. However, research on belief interplay has often been limited to beliefs related to specific issues and relied heavily on surveys. We propose a method to study the nuanced interplay between thousands of beliefs by leveraging an online user debate data and mapping beliefs onto a neural embedding space constructed using a fine-tuned large language model (LLM). This belief space captures the interconnectedness and polarization of diverse beliefs across social issues. Our findings show that positions within this belief space predict new beliefs of individuals and estimate cognitive dissonance based on the distance between existing and new beliefs. This study demonstrates how LLMs, combined with collective online records of human beliefs, can offer insights into the fundamental principles that govern human belief formation.

nan


Article 835

Title@2025-06-06 (5): PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time

Title: PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time PersonaAgent: Wenn große Sprachmodell-Agenten Personalisierung zur Testzeit treffen 人 员:当大语言示范媒介在试验时间达到个性化时 2506.06254v1

Authors (15): Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, Xiaoman Pan, Lian Xiong, Jingguo Liu, Philip S. Yu, Xian Li

Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide range of domains and tasks. Despite their potential, current LLM agents often adopt a one-size-fits-all approach, lacking the flexibility to respond to users’ varying needs and preferences. This limitation motivates us to develop PersonaAgent, the first personalized LLM agent framework designed to address versatile personalization tasks. Specifically, PersonaAgent integrates two complementary components - a personalized memory module that includes episodic and semantic memory mechanisms; a personalized action module that enables the agent to perform tool actions tailored to the user. At the core, the persona (defined as unique system prompt for each user) functions as an intermediary: it leverages insights from personalized memory to control agent actions, while the outcomes of these actions in turn refine the memory. Based on the framework, we propose a test-time user-preference alignment strategy that simulate the latest n interactions to optimize the persona prompt, ensuring real-time user preference alignment through textual loss feedback between simulated and ground-truth responses. Experimental evaluations demonstrate that PersonaAgent significantly outperforms other baseline methods by not only personalizing the action space effectively but also scaling during test-time real-world applications. These results underscore the feasibility and potential of our approach in delivering tailored, dynamic user experiences.

nan


Article 836

Title@2025-06-06 (5): Bridging External and Parametric Knowledge: Mitigating Hallucination of LLMs with Shared-Private Semantic Synergy in Dual-Stream Knowledge

Title: Bridging External and Parametric Knowledge: Mitigating Hallucination of LLMs with Shared-Private Semantic Synergy in Dual-Stream Knowledge Externes und parametrisches Wissen überbrücken: Halluzination von LLMs mit Shared-Private Semantic Synergy im Dual-Stream-Wissen abmildern 架桥外部和参数知识:减少双层知识中具有共同-私人语义协同效应的LLMsLLMs的幻觉 2506.06240v1

Authors (5): Yi Sui, Chaozhuo Li, Chen Zhang, Dawei song, Qiuchi Li

Retrieval-augmented generation (RAG) is a cost-effective approach to mitigate the hallucination of Large Language Models (LLMs) by incorporating the retrieved external knowledge into the generation process. However, external knowledge may conflict with the parametric knowledge of LLMs. Furthermore, current LLMs lack inherent mechanisms for resolving such knowledge conflicts, making traditional RAG methods suffer from degraded performance and stability. Thus, we propose a Dual-Stream Knowledge-Augmented Framework for Shared-Private Semantic Synergy (DSSP-RAG). Central to the framework is a novel approach that refines self-attention into a mixed-attention, distinguishing shared and private semantics for a controlled internal-external knowledge integration. To effectively facilitate DSSP in RAG, we further introduce an unsupervised hallucination detection method based on cognitive uncertainty, ensuring the necessity of introducing knowledge, and an Energy Quotient (EQ) based on attention difference matrices to reduce noise in the retrieved external knowledge. Extensive experiments on benchmark datasets show that DSSP-RAG can effectively resolve conflicts and enhance the complementarity of dual-stream knowledge, leading to superior performance over strong baselines.

nan


Article 837

Title@2025-06-06 (5): Explaining Matters: Leveraging Definitions and Semantic Expansion for Sexism Detection

Title: Explaining Matters: Leveraging Definitions and Semantic Expansion for Sexism Detection Erklären von Dingen: Beherrschende Definitionen und semantische Expansion für Sexismus-Erkennung 解释事项:利用定义和语义扩展探索性别主义 2506.06238v1

Authors (3): Sahrish Khan, Arshad Jhumka, Gabriele Pergola

The detection of sexism in online content remains an open problem, as harmful language disproportionately affects women and marginalized groups. While automated systems for sexism detection have been developed, they still face two key challenges: data sparsity and the nuanced nature of sexist language. Even in large, well-curated datasets like the Explainable Detection of Online Sexism (EDOS), severe class imbalance hinders model generalization. Additionally, the overlapping and ambiguous boundaries of fine-grained categories introduce substantial annotator disagreement, reflecting the difficulty of interpreting nuanced expressions of sexism. To address these challenges, we propose two prompt-based data augmentation techniques: Definition-based Data Augmentation (DDA), which leverages category-specific definitions to generate semantically-aligned synthetic examples, and Contextual Semantic Expansion (CSE), which targets systematic model errors by enriching examples with task-specific semantic features. To further improve reliability in fine-grained classification, we introduce an ensemble strategy that resolves prediction ties by aggregating complementary perspectives from multiple language models. Our experimental evaluation on the EDOS dataset demonstrates state-of-the-art performance across all tasks, with notable improvements of macro F1 by 1.5 points for binary classification (Task A) and 4.1 points for fine-grained classification (Task C).

nan


Article 838

Title@2025-06-06 (5): Banyan: Improved Representation Learning with Explicit Structure

Title: Banyan: Improved Representation Learning with Explicit Structure Banyan: Verbessertes Repräsentationslernen mit expliziter Struktur Banyan: 改善有明确结构的代表制学习 2407.17771v4

Authors (2): Mattia Opper, N. Siddharth

We present Banyan, a model that efficiently learns semantic representations by leveraging explicit hierarchical structure. While transformers excel at scale, they struggle in low-resource settings. Conversely recent structured models have shown promise as efficient learners, but lack performance. Banyan bridges this gap with two key innovations: an entangled hierarchical tree structure and diagonalized message passing, enabling it to outperform larger transformer models with just 14 non-embedding parameters. It excels in low-resource settings, offering a viable alternative for under-represented languages and highlighting its potential for efficient, interpretable NLP in resource-constrained environments.

nan


Article 839

Title@2025-06-06 (5): Corrector Sampling in Language Models

Title: Corrector Sampling in Language Models Korrektor-Probenahme in Sprachmodellen 语言模型中校正器样本 2506.06215v1

Authors (4): Itai Gat, Neta Shaul, Uriel Singer, Yaron Lipman

Autoregressive language models accumulate errors due to their fixed, irrevocable left-to-right token generation. To address this, we propose a new sampling method called Resample-Previous-Tokens (RPT). RPT mitigates error accumulation by iteratively revisiting and potentially replacing tokens in a window of previously generated text. This method can be integrated into existing autoregressive models, preserving their next-token-prediction quality and speed. Fine-tuning a pretrained 8B parameter model with RPT for only 100B resulted in ~10% relative improvements on reasoning and coding benchmarks compared to the standard sampling.

nan


Article 840

Title@2025-06-06 (5): Can Theoretical Physics Research Benefit from Language Agents?

Title: Can Theoretical Physics Research Benefit from Language Agents? Kann die theoretische Physikforschung von Sprachagenten profitieren? 理论物理研究能否从语言代理中受益? 2506.06214v1

Authors (6): Sirui Lu, Zhijing Jin, Terry Jingchen Zhang, Pavel Kos, J. Ignacio Cirac, Bernhard Schölkopf

Large Language Models (LLMs) are rapidly advancing across diverse domains, yet their application in theoretical physics research is not yet mature. This position paper argues that LLM agents can potentially help accelerate theoretical, computational, and applied physics when properly integrated with domain knowledge and toolbox. We analyze current LLM capabilities for physics – from mathematical reasoning to code generation – identifying critical gaps in physical intuition, constraint satisfaction, and reliable reasoning. We envision future physics-specialized LLMs that could handle multimodal data, propose testable hypotheses, and design experiments. Realizing this vision requires addressing fundamental challenges: ensuring physical consistency, and developing robust verification methods. We call for collaborative efforts between physics and AI communities to help advance scientific discovery in physics.

nan


Article 841

Title@2025-06-06 (5): PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

Title: PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts PuzzleWorld: Ein Benchmark für multimodale, offene Vernunft in Puzzlejagden 谜题世界:谜题搜索中多种形式、不限名额的理由基准 2506.06211v1

Authors (12): Hengzhi Li, Brendon Jiang, Alexander Naehu, Regan Song, Justin Zhang, Megan Tjandrasuwita, Chanakya Ekbote, Steven-Shine Chen, Adithya Balachandran, Wei Dai, Rebecca Chang, Paul Pu Liang

Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions, puzzlehunts require models to discover the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite recent progress in foundation models, their performance on such open-ended settings remains largely untested. In this paper, we introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces improves stepwise reasoning from 4% to 11%, while training on final answers alone degrades performance to near zero. Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at https://github.com/MIT-MI/PuzzleWorld to support future work on building more general, open-ended, and creative reasoning systems.

nan


Article 842

Title@2025-06-06 (5): Building Models of Neurological Language

Title: Building Models of Neurological Language Baumodelle der neurologischen Sprache 神经语言建模模型 2506.06208v1

Authors (1): Henry Watkins

This report documents the development and evaluation of domain-specific language models for neurology. Initially focused on building a bespoke model, the project adapted to rapid advances in open-source and commercial medical LLMs, shifting toward leveraging retrieval-augmented generation (RAG) and representational models for secure, local deployment. Key contributions include the creation of neurology-specific datasets (case reports, QA sets, textbook-derived data), tools for multi-word expression extraction, and graph-based analyses of medical terminology. The project also produced scripts and Docker containers for local hosting. Performance metrics and graph community results are reported, with future possible work open for multimodal models using open-source architectures like phi-4.

nan


Article 843

Title@2025-06-06 (5): Kinetics: Rethinking Test-Time Scaling Laws

Title: Kinetics: Rethinking Test-Time Scaling Laws Kinetik: Überdenken von Test-Zeit-Skalierungsgesetzen 动因:重新思考试验时间扩增法 2506.05333v2

Authors (6): Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, Beidi Chen

We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential and increasingly important with more computing invested, for realizing the full potential of test-time scaling where, unlike training, accuracy has yet to saturate as a function of computation, and continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.

nan


Article 844

Title@2025-06-06 (5): Detecting Voice Phishing with Precision: Fine-Tuning Small Language Models

Title: Detecting Voice Phishing with Precision: Fine-Tuning Small Language Models Sprach Phishing mit Präzision erkennen: Feintuning-kleine Sprachmodelle 以精密度探测语音幻听:精美使用小型语言模式 2506.06180v1

Authors (2): Ju Yong Sim, Seong Hwan Kim

We develop a voice phishing (VP) detector by fine-tuning Llama3, a representative open-source, small language model (LM). In the prompt, we provide carefully-designed VP evaluation criteria and apply the Chain-of-Thought (CoT) technique. To evaluate the robustness of LMs and highlight differences in their performance, we construct an adversarial test dataset that places the models under challenging conditions. Moreover, to address the lack of VP transcripts, we create transcripts by referencing existing or new types of VP techniques. We compare cases where evaluation criteria are included, the CoT technique is applied, or both are used together. In the experiment, our results show that the Llama3-8B model, fine-tuned with a dataset that includes a prompt with VP evaluation criteria, yields the best performance among small LMs and is comparable to that of a GPT-4-based VP detector. These findings indicate that incorporating human expert knowledge into the prompt is more effective than using the CoT technique for small LMs in VP detection.

nan


Article 845

Title@2025-06-06 (5): Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach

Title: Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach Läuft es und ist das genug? Text-zu-Chart-Generation mit einem Multi-Agenten-Ansatz 以多机构方式重新审视文本到图表一代 2506.06175v1

Authors (2): James Ford, Anthony Rios

Large language models can translate natural-language chart descriptions into runnable code, yet approximately 15\% of the generated scripts still fail to execute, even after supervised fine-tuning and reinforcement learning. We investigate whether this persistent error rate stems from model limitations or from reliance on a single-prompt design. To explore this, we propose a lightweight multi-agent pipeline that separates drafting, execution, repair, and judgment, using only an off-the-shelf GPT-4o-mini model. On the \textsc{Text2Chart31} benchmark, our system reduces execution errors to 4.5\% within three repair iterations, outperforming the strongest fine-tuned baseline by nearly 5 percentage points while requiring significantly less compute. Similar performance is observed on the \textsc{ChartX} benchmark, with an error rate of 4.6\%, demonstrating strong generalization. Under current benchmarks, execution success appears largely solved. However, manual review reveals that 6 out of 100 sampled charts contain hallucinations, and an LLM-based accessibility audit shows that only 33.3\% (\textsc{Text2Chart31}) and 7.2\% (\textsc{ChartX}) of generated charts satisfy basic colorblindness guidelines. These findings suggest that future work should shift focus from execution reliability toward improving chart aesthetics, semantic fidelity, and accessibility.

nan


Article 846

Title@2025-06-06 (5): Emergent Response Planning in LLMs

Title: Emergent Response Planning in LLMs Emergent Response Planning in LLMs 新的LLMM 应急规划 2502.06258v2

Authors (5): Zhichen Dong, Zhanhui Zhou, Zhixuan Liu, Chao Yang, Chaochao Lu

In this work, we argue that large language models (LLMs), though trained to predict only the next token, exhibit emergent planning behaviors: $\textbf{their hidden representations encode future outputs beyond the next token}$. Through simple probing, we demonstrate that LLM prompt representations encode global attributes of their entire responses, including $\textit{structure attributes}$ (e.g., response length, reasoning steps), $\textit{content attributes}$ (e.g., character choices in storywriting, multiple-choice answers at the end of response), and $\textit{behavior attributes}$ (e.g., answer confidence, factual consistency). In addition to identifying response planning, we explore how it scales with model size across tasks and how it evolves during generation. The findings that LLMs plan ahead for the future in their hidden representations suggest potential applications for improving transparency and generation control.

nan


Article 847

Title@2025-06-06 (5): semantic-features: A User-Friendly Tool for Studying Contextual Word Embeddings in Interpretable Semantic Spaces

Title: semantic-features: A User-Friendly Tool for Studying Contextual Word Embeddings in Interpretable Semantic Spaces semantische Eigenschaften: Ein benutzerfreundliches Tool zum Studieren von Kontext-Wort-Einbettungen in interpretierbare semantische Räume 语义词义-语义特点:一种用户友好工具,用于研究可解释的语义空间的背景文字嵌入 2506.06169v1

Authors (4): Jwalanthi Ranganathan, Rohan Jha, Kanishka Misra, Kyle Mahowald

We introduce semantic-features, an extensible, easy-to-use library based on Chronis et al. (2023) for studying contextualized word embeddings of LMs by projecting them into interpretable spaces. We apply this tool in an experiment where we measure the contextual effect of the choice of dative construction (prepositional or double object) on the semantic interpretation of utterances (Bresnan, 2007). Specifically, we test whether “London” in “I sent London the letter.” is more likely to be interpreted as an animate referent (e.g., as the name of a person) than in “I sent the letter to London.” To this end, we devise a dataset of 450 sentence pairs, one in each dative construction, with recipients being ambiguous with respect to person-hood vs. place-hood. By applying semantic-features, we show that the contextualized word embeddings of three masked language models show the expected sensitivities. This leaves us optimistic about the usefulness of our tool.

nan


Article 848

Title@2025-06-06 (5): The Lock-in Hypothesis: Stagnation by Algorithm

Title: The Lock-in Hypothesis: Stagnation by Algorithm Die Lock-in Hypothese: Stagnation durch Algorithmen Lock-in 假说: 阿尔哥里特姆的停滞 2506.06166v1

Authors (4): Tianyi Alex Qiu, Zhonghao He, Tejasveer Chugh, Max Kleiman-Weiner

The training and deployment of large language models (LLMs) create a feedback loop with human users: models learn human beliefs from data, reinforce these beliefs with generated content, reabsorb the reinforced beliefs, and feed them back to users again and again. This dynamic resembles an echo chamber. We hypothesize that this feedback loop entrenches the existing values and beliefs of users, leading to a loss of diversity and potentially the lock-in of false beliefs. We formalize this hypothesis and test it empirically with agent-based LLM simulations and real-world GPT usage data. Analysis reveals sudden but sustained drops in diversity after the release of new GPT iterations, consistent with the hypothesized human-AI feedback loop. Code and data available at https://thelockinhypothesis.com

nan


Article 849

Title@2025-06-06 (5): Automated Journalistic Questions: A New Method for Extracting 5W1H in French

Title: Automated Journalistic Questions: A New Method for Extracting 5W1H in French Automatisierte journalistische Fragen: Eine neue Methode zur Extraktion von 5W1H in Französisch 自动新闻问题:以法文提取5W1H的新方法 2505.14804v2

Authors (3): Maxence Verhaverbeke, Julie A. Gramaccia, Richard Khoury

The 5W1H questions – who, what, when, where, why and how – are commonly used in journalism to ensure that an article describes events clearly and systematically. Answering them is a crucial prerequisites for tasks such as summarization, clustering, and news aggregation. In this paper, we design the first automated extraction pipeline to get 5W1H information from French news articles. To evaluate the performance of our algorithm, we also create a corpus of 250 Quebec news articles with 5W1H answers marked by four human annotators. Our results demonstrate that our pipeline performs as well in this task as the large language model GPT-4o.

nan


Article 850

Title@2025-06-06 (5): Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

Title: Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning Nicht alle Rollouts sind nützlich: Down-Sampling Rollouts im LLM Verstärkungslernen 并非所有的推出都有用:在LLM强化学习中进行下选推出 2504.13818v2

Authors (4): Yixuan Even Xu, Yash Savani, Fei Fang, Zico Kolter

Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing reasoning capabilities in large language models. However, it is constrained by a fundamental asymmetry in computation and memory requirements: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive. To address this, we introduce PODS (Policy Optimization with Down-Sampling). PODS produces numerous rollouts in parallel, then trains on only an informative subset, preserving learning signals while slashing update cost. We instantiate PODS with max-variance down-sampling, a principled criterion that maximises reward diversity and show it admits an $O(n\log n)$ solution. Empirically, coupling PODS with Group Relative Policy Optimization (GRPO) achieves superior performance over standard GRPO across different reasoning benchmarks and hardware environments.

nan


Article 851

Title@2025-06-06 (5): Masked Language Models are Good Heterogeneous Graph Generalizers

Title: Masked Language Models are Good Heterogeneous Graph Generalizers Masked Language Models sind gute Heterogene Graph Generalizers 遮罩语言模型是好异基因图形缩略图 2506.06157v1

Authors (7): Jinyu Yang, Cheng Yang, Shanyuan Cui, Zeyuan Guo, Liangwei Yang, Muhan Zhang, Chuan Shi

Heterogeneous graph neural networks (HGNNs) excel at capturing structural and semantic information in heterogeneous graphs (HGs), while struggling to generalize across domains and tasks. Recently, some researchers have turned to integrating HGNNs with large language models (LLMs) for more generalizable heterogeneous graph learning. However, these approaches typically extract structural information via HGNNs as HG tokens, and disparities in embedding spaces between HGNNs and LLMs have been shown to bias the LLM’s comprehension of HGs. Moreover, as these HG tokens are often derived from node-level tasks, the model’s ability to generalize across tasks remains limited. To this end, we propose a simple yet effective Masked Language Modeling-based method, called MLM4HG. MLM4HG introduces metapath-based textual sequences instead of HG tokens to extract structural and semantic information inherent in HGs, and designs customized textual templates to unify different graph tasks into a coherent cloze-style “mask” token prediction paradigm. Specifically, MLM4HG first converts HGs from various domains to texts based on metapaths, and subsequently combines them with the unified task texts to form a HG-based corpus. Moreover, the corpus is fed into a pretrained LM for fine-tuning with a constrained target vocabulary, enabling the fine-tuned LM to generalize to unseen target HGs. Extensive cross-domain and multi-task experiments on four real-world datasets demonstrate the superior generalization performance of MLM4HG over state-of-the-art methods in both few-shot and zero-shot scenarios. Our code is available at https://github.com/BUPT-GAMMA/MLM4HG.

nan


Article 852

Title@2025-06-06 (5): Towards Effective Extraction and Evaluation of Factual Claims

Title: Towards Effective Extraction and Evaluation of Factual Claims Auf dem Weg zu einer wirksamen Extraktion und Bewertung der tatsächlichen Forderungen 争取有效提取和评估事实申诉 2502.10855v2

Authors (2): Dasha Metropolitansky, Jonathan Larson

A common strategy for fact-checking long-form content generated by Large Language Models (LLMs) is extracting simple claims that can be verified independently. Since inaccurate or incomplete claims compromise fact-checking results, ensuring claim quality is critical. However, the lack of a standardized evaluation framework impedes assessment and comparison of claim extraction methods. To address this gap, we propose a framework for evaluating claim extraction in the context of fact-checking along with automated, scalable, and replicable methods for applying this framework, including novel approaches for measuring coverage and decontextualization. We also introduce Claimify, an LLM-based claim extraction method, and demonstrate that it outperforms existing methods under our evaluation framework. A key feature of Claimify is its ability to handle ambiguity and extract claims only when there is high confidence in the correct interpretation of the source text.

nan


Article 853

Title@2025-06-06 (5): LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Title: LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws LLMs on the Line: Daten bestimmen die Loss-to-Loss-Skalierungsgesetze 线上LLM:数据确定损失到损失的缩放法 2502.12120v2

Authors (5): Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data and tokenizer determine the scaling trend. In contrast, model size, optimization hyperparameters, and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

nan


Article 854

Title@2025-06-06 (5): CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval

Title: CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval CLaMR: Kontextualisierte Late-Interaktion für multimodale Content Retrieval CLAMR: 多式内容检索的内幕性后期交流交流 2506.06144v1

Authors (5): David Wan, Han Wang, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal

Online video web content is richly multimodal: a single video blends vision, speech, ambient audio, and on-screen text. Retrieval systems typically treat these modalities as independent retrieval sources, which can lead to noisy and subpar retrieval. We explore multimodal video content retrieval, where relevance can be scored from one particular modality or jointly across multiple modalities simultaneously. Consequently, an effective retriever must dynamically choose which modality (or set of modalities) best addresses the query. We introduce CLaMR, a multimodal, late-interaction retriever that jointly indexes 4 modalities: video frames, transcribed speech, on-screen text, and metadata. CLaMR jointly encodes all modalities with a unified multimodal backbone for improved contextualization and is trained to enhance dynamic modality selection via two key innovations. First, given the lack of training data for multimodal retrieval, we introduce MultiVENT 2.0++, a large-scale synthetic training dataset built on MultiVENT 2.0 (event-centric videos in various languages paired with queries) with modality-targeted queries. Next, we propose a modality-aware loss that jointly trains according to a standard contrastive objective alongside an objective for learning correct modality usage. On the test sets of MultiVENT 2.0++ and MSRVTT, conventional aggregation strategies, such as averaging similarities for baseline retrievers, degrade performance by introducing noise from irrelevant modalities. In contrast, CLaMR consistently outperforms existing retrievers: on MultiVENT 2.0++, CLaMR improves nDCG@10 by 25.6 over the best single-modality retriever and by 35.4 over the best multi-modality retriever. We illustrate CLaMR’s downstream utility on long-video QA, retrieving relevant frames and obtaining a 3.50% boost over LanguageBind on Video-MME and 1.42% over dense sampling on LongVideoBench.

nan


Article 855

Title@2025-06-06 (5): Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models

Title: Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models Tabelle-r1: Selbstbeaufsichtigtes und verstärktes Lernen für programmbasierte Tabellenveranlagung in kleinen Sprachmodellen 表-r1:针对基于方案的小语言模式说明理由的表格进行自我监督和强化学习 2506.06137v1

Authors (9): Rihui Jin, Zheyu Xin, Xing Xie, Zuoyi Li, Guilin Qi, Yongrui Chen, Xinbang Dai, Tongtong Wu, Gholamreza Haffari

Table reasoning (TR) requires structured reasoning over semi-structured tabular data and remains challenging, particularly for small language models (SLMs, e.g., LLaMA-8B) due to their limited capacity compared to large LMs (LLMs, e.g., GPT-4o). To narrow this gap, we explore program-based TR (P-TR), which circumvents key limitations of text-based TR (T-TR), notably in numerical reasoning, by generating executable programs. However, applying P-TR to SLMs introduces two challenges: (i) vulnerability to heterogeneity in table layouts, and (ii) inconsistency in reasoning due to limited code generation capability. We propose Table-r1, a two-stage P-TR method designed for SLMs. Stage 1 introduces an innovative self-supervised learning task, Layout Transformation Inference, to improve tabular layout generalization from a programmatic view. Stage 2 adopts a mix-paradigm variant of Group Relative Policy Optimization, enhancing P-TR consistency while allowing dynamic fallback to T-TR when needed. Experiments on four TR benchmarks demonstrate that Table-r1 outperforms all SLM-based methods, achieving at least a 15% accuracy improvement over the base model (LLaMA-8B) across all datasets and reaching performance competitive with LLMs.

nan


Article 856

Title@2025-06-06 (5): Let’s CONFER: A Dataset for Evaluating Natural Language Inference Models on CONditional InFERence and Presupposition

Title: Let’s CONFER: A Dataset for Evaluating Natural Language Inference Models on CONditional InFERence and Presupposition Lassen Sie uns CONFER: Ein Datensatz für die Bewertung der natürlichen Sprache Inferenzmodelle auf konditional Inference und Presupposition 让我们考虑一下:评估关于有条件责任和假设的自然语言推断模型的数据集。 2506.06133v1

Authors (4): Tara Azin, Daniel Dumitrescu, Diana Inkpen, Raj Singh

Natural Language Inference (NLI) is the task of determining whether a sentence pair represents entailment, contradiction, or a neutral relationship. While NLI models perform well on many inference tasks, their ability to handle fine-grained pragmatic inferences, particularly presupposition in conditionals, remains underexplored. In this study, we introduce CONFER, a novel dataset designed to evaluate how NLI models process inference in conditional sentences. We assess the performance of four NLI models, including two pre-trained models, to examine their generalization to conditional reasoning. Additionally, we evaluate Large Language Models (LLMs), including GPT-4o, LLaMA, Gemma, and DeepSeek-R1, in zero-shot and few-shot prompting settings to analyze their ability to infer presuppositions with and without prior context. Our findings indicate that NLI models struggle with presuppositional reasoning in conditionals, and fine-tuning on existing NLI datasets does not necessarily improve their performance.

nan


Article 857

Title@2025-06-06 (5): Unveiling Topological Structures from Language: A Comprehensive Survey of Topological Data Analysis Applications in NLP

Title: Unveiling Topological Structures from Language: A Comprehensive Survey of Topological Data Analysis Applications in NLP Enthüllen topologischer Strukturen aus der Sprache: Eine umfassende Erhebung von topologischen Datenanalyseanwendungen in NLP 语言中不固定的地形结构:国家语言规划局地形数据分析应用综合调查 2411.10298v3

Authors (2): Adaku Uchendu, Thai Le

The surge of data available on the internet has led to the adoption of various computational methods to analyze and extract valuable insights from this wealth of information. Among these, the field of Machine Learning (ML) has thrived by leveraging data to extract meaningful insights. However, ML techniques face notable challenges when dealing with real-world data, often due to issues of imbalance, noise, insufficient labeling, and high dimensionality. To address these limitations, some researchers advocate for the adoption of Topological Data Analysis (TDA), a statistical approach that discerningly captures the intrinsic shape of data despite noise. Despite its potential, TDA has not gained as much traction within the Natural Language Processing (NLP) domain compared to structurally distinct areas like computer vision. Nevertheless, a dedicated community of researchers has been exploring the application of TDA in NLP, yielding 95 papers we comprehensively survey in this paper. Our findings categorize these efforts into theoretical and non-theoretical approaches. Theoretical approaches aim to explain linguistic phenomena from a topological viewpoint, while non-theoretical approaches merge TDA with ML features, utilizing diverse numerical representation techniques. We conclude by exploring the challenges and unresolved questions that persist in this niche field. Resources and a list of papers on this topic can be found at: https://github.com/AdaUchendu/AwesomeTDA4NLP.

nan


Article 858

Title@2025-06-06 (5): Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction

Title: Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction Telefonisch-angereicherte diskriminative Korrektur für Sprachsuchfehler-Korrektur 语音搜索错误校正的声频搜索 2506.06117v1

Authors (7): Christophe Van Gysel, Maggie Wu, Lyan Verwimp, Caglar Tirkaz, Marco Bertola, Zhihong Lei, Youssef Oualil

End-to-end (E2E) Automatic Speech Recognition (ASR) models are trained using paired audio-text samples that are expensive to obtain, since high-quality ground-truth data requires human annotators. Voice search applications, such as digital media players, leverage ASR to allow users to search by voice as opposed to an on-screen keyboard. However, recent or infrequent movie titles may not be sufficiently represented in the E2E ASR system’s training data, and hence, may suffer poor recognition. In this paper, we propose a phonetic correction system that consists of (a) a phonetic search based on the ASR model’s output that generates phonetic alternatives that may not be considered by the E2E system, and (b) a rescorer component that combines the ASR model recognition and the phonetic alternatives, and select a final system output. We find that our approach improves word error rate between 4.4 and 7.6% relative on benchmarks of popular movie titles over a series of competitive baselines.

nan


Article 859

Title@2025-06-06 (5): Bridging the Gap: In-Context Learning for Modeling Human Disagreement

Title: Bridging the Gap: In-Context Learning for Modeling Human Disagreement Bridging the Gap: In-Context-Lernen zur Modellierung von menschlichen Uneinigkeiten 弥合差距:模拟人类分歧的理论内学习 2506.06113v1

Authors (5): Benedetta Muscato, Yue Li, Gizem Gezici, Zhixue Zhao, Fosca Giannotti

Large Language Models (LLMs) have shown strong performance on NLP classification tasks. However, they typically rely on aggregated labels-often via majority voting-which can obscure the human disagreement inherent in subjective annotations. This study examines whether LLMs can capture multiple perspectives and reflect annotator disagreement in subjective tasks such as hate speech and offensive language detection. We use in-context learning (ICL) in zero-shot and few-shot settings, evaluating four open-source LLMs across three label modeling strategies: aggregated hard labels, and disaggregated hard and soft labels. In few-shot prompting, we assess demonstration selection methods based on textual similarity (BM25, PLM-based), annotation disagreement (entropy), a combined ranking, and example ordering strategies (random vs. curriculum-based). Results show that multi-perspective generation is viable in zero-shot settings, while few-shot setups often fail to capture the full spectrum of human judgments. Prompt design and demonstration selection notably affect performance, though example ordering has limited impact. These findings highlight the challenges of modeling subjectivity with LLMs and the importance of building more perspective-aware, socially intelligent models.

nan


Article 860

Title@2025-06-06 (5): Opt-Out: Investigating Entity-Level Unlearning for Large Language Models via Optimal Transport

Title: Opt-Out: Investigating Entity-Level Unlearning for Large Language Models via Optimal Transport Opt-Out: Untersuchung von Entity-Level-Unlearning für große Sprachmodelle via Optimal Transport 选择结果:调查实体一级通过最佳运输方式对大语言模式的不学习情况 2406.12329v3

Authors (4): Minseok Choi, Daniel Rim, Dohyun Lee, Jaegul Choo

Instruction-following large language models (LLMs), such as ChatGPT, have become widely popular among everyday users. However, these models inadvertently disclose private, sensitive information to their users, underscoring the need for machine unlearning techniques to remove selective information from the models. While prior work has focused on forgetting small, random subsets of training data at the instance-level, we argue that real-world scenarios often require the removal of an entire user data, which may require a more careful maneuver. In this study, we explore entity-level unlearning, which aims to erase all knowledge related to a target entity while preserving the remaining model capabilities. To address this, we introduce Opt-Out, an optimal transport-based unlearning method that utilizes the Wasserstein distance from the model’s initial parameters to achieve more effective and fine-grained unlearning. We also present the first Entity-Level Unlearning Dataset (ELUDe) designed to evaluate entity-level unlearning. Our empirical results demonstrate that Opt-Out surpasses existing methods, establishing a new standard for secure and adaptable LLMs that can accommodate user data removal requests without the need for full retraining.

nan


Article 861

Title@2025-06-06 (5): The Canary’s Echo: Auditing Privacy Risks of LLM-Generated Synthetic Text

Title: The Canary’s Echo: Auditing Privacy Risks of LLM-Generated Synthetic Text Das Echo der Kanaren: Auditing Privacy Risks of LLM-generated Synthetic Text 加那利的回声:LLM发明的合成文字的 审计隐私风险 2502.14921v2

Authors (5): Matthieu Meeus, Lukas Wutschitz, Santiago Zanella-Béguelin, Shruti Tople, Reza Shokri

How much information about training samples can be leaked through synthetic data generated by Large Language Models (LLMs)? Overlooking the subtleties of information flow in synthetic data generation pipelines can lead to a false sense of privacy. In this paper, we assume an adversary has access to some synthetic data generated by a LLM. We design membership inference attacks (MIAs) that target the training data used to fine-tune the LLM that is then used to synthesize data. The significant performance of our MIA shows that synthetic data leak information about the training data. Further, we find that canaries crafted for model-based MIAs are sub-optimal for privacy auditing when only synthetic data is released. Such out-of-distribution canaries have limited influence on the model’s output when prompted to generate useful, in-distribution synthetic data, which drastically reduces their effectiveness. To tackle this problem, we leverage the mechanics of auto-regressive models to design canaries with an in-distribution prefix and a high-perplexity suffix that leave detectable traces in synthetic data. This enhances the power of data-based MIAs and provides a better assessment of the privacy risks of releasing synthetic data generated by LLMs.

nan


Article 862

Title@2025-06-06 (5): Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

Title: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model Effiziente und direkte Duplex-Modellierung für Speech-to-Speech-Sprachenmodell 语音语音和语音语言模式的高效和直接双重模式 2505.15670v2

Authors (10): Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Żelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg

Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.

nan


Article 863

Title@2025-06-06 (5): Label-Context-Dependent Internal Language Model Estimation for CTC

Title: Label-Context-Dependent Internal Language Model Estimation for CTC Label-Context-Dependent Interne Sprachmodellschätzung für CTC Label-Ctext-依赖性内部语言模型 反恐委员会内部语言模型估算 2506.06096v1

Authors (4): Zijian Yang, Minh-Nghia Phan, Ralf Schlüter, Hermann Ney

Although connectionist temporal classification (CTC) has the label context independence assumption, it can still implicitly learn a context-dependent internal language model (ILM) due to modern powerful encoders. In this work, we investigate the implicit context dependency modeled in the ILM of CTC. To this end, we propose novel context-dependent ILM estimation methods for CTC based on knowledge distillation (KD) with theoretical justifications. Furthermore, we introduce two regularization methods for KD. We conduct experiments on Librispeech and TED-LIUM Release 2 datasets for in-domain and cross-domain evaluation, respectively. Experimental results show that context-dependent ILMs outperform the context-independent priors in cross-domain evaluation, indicating that CTC learns a context-dependent ILM. The proposed label-level KD with smoothing method surpasses other ILM estimation approaches, with more than 13% relative improvement in word error rate compared to shallow fusion.

nan


Article 864

Title@2025-06-06 (5): Reinforcing Code Generation: Improving Text-to-SQL with Execution-Based Learning

Title: Reinforcing Code Generation: Improving Text-to-SQL with Execution-Based Learning Stärkung der Code-Generierung: Verbesserung von Text-zu-SQL mit execution-basiertem Lernen 加强代码生成:通过以执行为基础的学习改进文本到SQL 2506.06093v1

Authors (2): Atharv Kulkarni, Vivek Srikumar

In this work, we study the problem of code generation with a large language model (LLM), with a focus on generating SQL queries from natural language questions. We ask: Instead of using supervised fine tuning with text-code pairs, can we tune a model by having it interact with a database engine? We frame this problem as a reinforcement learning problem where the model receives execution-based feedback from the environment in the form of scalar rewards. These rewards penalize execution failures and assign positive values when a query returns a correct answer. We use the rewards within the Group Relative Policy Optimization (GRPO) framework. We use a tabular reasoning benchmark to test and evaluate our findings. We find that with only weak supervision in the form of question-answer pairs, RL-tuning improves the accuracy of model generated SQL code from 31.49 to 49.83 while reducing error percentage from 25.43% to 14.71%. This improvement allowed the model nearly match the performance performance to the larger SQLCoder-70B model. Our work demonstrates the potential of using execution-based feedback to improve symbolic reasoning capabilities of LLMs.

nan


Article 865

Title@2025-06-06 (5): HeavyWater and SimplexWater: Watermarking Low-Entropy Text Distributions

Title: HeavyWater and SimplexWater: Watermarking Low-Entropy Text Distributions HeavyWater und SimplexWater: Wassermarkierung mit geringer Entropie Textverteilung 重水和简易水:低英基文本分发的标注水 2506.06409v1

Authors (8): Dor Tsur, Carol Xuan Long, Claudio Mayrink Verdun, Hsiang Hsu, Chen-Fu Chen, Haim Permuter, Sajani Vithana, Flavio P. Calmon

Large language model (LLM) watermarks enable authentication of text provenance, curb misuse of machine-generated text, and promote trust in AI systems. Current watermarks operate by changing the next-token predictions output by an LLM. The updated (i.e., watermarked) predictions depend on random side information produced, for example, by hashing previously generated tokens. LLM watermarking is particularly challenging in low-entropy generation tasks - such as coding - where next-token predictions are near-deterministic. In this paper, we propose an optimization framework for watermark design. Our goal is to understand how to most effectively use random side information in order to maximize the likelihood of watermark detection and minimize the distortion of generated text. Our analysis informs the design of two new watermarks: HeavyWater and SimplexWater. Both watermarks are tunable, gracefully trading-off between detection accuracy and text distortion. They can also be applied to any LLM and are agnostic to side information generation. We examine the performance of HeavyWater and SimplexWater through several benchmarks, demonstrating that they can achieve high watermark detection accuracy with minimal compromise of text generation quality, particularly in the low-entropy regime. Our theoretical analysis also reveals surprising new connections between LLM watermarking and coding theory. The code implementation can be found in https://github.com/DorTsur/HeavyWater_SimplexWater

nan


Article 866

Title@2025-06-06 (5): GraphCheck: Multi-Path Fact-Checking with Entity-Relationship Graphs

Title: GraphCheck: Multi-Path Fact-Checking with Entity-Relationship Graphs GraphCheck: Multi-Path-Faktenprüfung mit Entity-Relationship Graphen 图表检查:多平面实况调查实体-关系图 2502.20785v2

Authors (2): Hyewon Jeon, Jay-Yoon Lee

Automated fact-checking aims to assess the truthfulness of textual claims based on relevant evidence. However, verifying complex claims that require multi-hop reasoning remains a significant challenge. We propose GraphCheck, a novel framework that transforms claims into entity-relationship graphs for structured and systematic verification. By explicitly modeling both explicit and latent entities and exploring multiple reasoning paths, GraphCheck improves verification robustness. While GraphCheck excels in complex scenarios, it may be unnecessarily elaborate for simpler claims. To address this, we introduce DP-GraphCheck, a variant that employs a lightweight strategy selector to adaptively choose between direct prompting and GraphCheck. This selective mechanism improves both accuracy and efficiency by applying the appropriate level of reasoning to each claim. Experiments on the HOVER and EX-FEVER datasets demonstrate that our approach outperforms existing methods, particularly on multi-hop claims. Moreover, the strategy selection mechanism in DP-GraphCheck generalizes well to other fact-checking pipelines, highlighting the versatility of our framework.

nan


Article 867

Title@2025-06-06 (5): Tug-of-war between idiom’s figurative and literal meanings in LLMs

Title: Tug-of-war between idiom’s figurative and literal meanings in LLMs Kriegstug zwischen Idiom’s figurativen und wörtlichen Bedeutungen in LLMs 智多明的象征意义和LLMM中字面含义之间的战争图示 2506.01723v3

Authors (5): Soyoung Oh, Xinting Huang, Mathis Pink, Michael Hahn, Vera Demberg

Idioms present a unique challenge for language models due to their non-compositional figurative meanings, which often strongly diverge from the idiom’s literal interpretation. This duality requires a model to learn representing and deciding between the two meanings to interpret an idiom in a figurative sense, or literally. In this paper, we employ tools from mechanistic interpretability to trace how a large pretrained causal transformer (LLama3.2-1B-base) deals with this ambiguity. We localize three steps of idiom processing: First, the idiom’s figurative meaning is retrieved in early attention and MLP sublayers. We identify specific attention heads which boost the figurative meaning of the idiom while suppressing the idiom’s literal interpretation. The model subsequently represents the figurative representation through an intermediate path. Meanwhile, a parallel bypass route forwards literal interpretation, ensuring that a both reading remain available. Overall, our findings provide a mechanistic evidence for idiom comprehension in an autoregressive transformer.

nan


Article 868

Title@2025-06-06 (5): CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition

Title: CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition CO-VADA: Ein vertrauensorientierter Voice Augmentation Debiasing-Ansatz für faire Sprach-Emotionserkennung CO-VADA: 以信任为导向的增强声音消除偏见的公平言论态度承认办法 2506.06071v1

Authors (4): Yun-Shao Tsai, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee

Bias in speech emotion recognition (SER) systems often stems from spurious correlations between speaker characteristics and emotional labels, leading to unfair predictions across demographic groups. Many existing debiasing methods require model-specific changes or demographic annotations, limiting their practical use. We present CO-VADA, a Confidence-Oriented Voice Augmentation Debiasing Approach that mitigates bias without modifying model architecture or relying on demographic information. CO-VADA identifies training samples that reflect bias patterns present in the training data and then applies voice conversion to alter irrelevant attributes and generate samples. These augmented samples introduce speaker variations that differ from dominant patterns in the data, guiding the model to focus more on emotion-relevant features. Our framework is compatible with various SER models and voice conversion tools, making it a scalable and practical solution for improving fairness in SER systems.

nan


Article 869

Title@2025-06-06 (5): Zero-Shot Detection of LLM-Generated Code via Approximated Task Conditioning

Title: Zero-Shot Detection of LLM-Generated Code via Approximated Task Conditioning Zero-Shot-Erkennung des LLM-generierten Codes über eine angenäherte Aufgabenkonditionierung 通过近似任务设置条件对LLM生成的代码进行零热检测 2506.06069v1

Authors (4): Maor Ashkenazi, Ofir Brenner, Tal Furman Shohet, Eran Treister

Detecting Large Language Model (LLM)-generated code is a growing challenge with implications for security, intellectual property, and academic integrity. We investigate the role of conditional probability distributions in improving zero-shot LLM-generated code detection, when considering both the code and the corresponding task prompt that generated it. Our key insight is that when evaluating the probability distribution of code tokens using an LLM, there is little difference between LLM-generated and human-written code. However, conditioning on the task reveals notable differences. This contrasts with natural language text, where differences exist even in the unconditional distributions. Leveraging this, we propose a novel zero-shot detection approach that approximates the original task used to generate a given code snippet and then evaluates token-level entropy under the approximated task conditioning (ATC). We further provide a mathematical intuition, contextualizing our method relative to previous approaches. ATC requires neither access to the generator LLM nor the original task prompts, making it practical for real-world applications. To the best of our knowledge, it achieves state-of-the-art results across benchmarks and generalizes across programming languages, including Python, CPP, and Java. Our findings highlight the importance of task-level conditioning for LLM-generated code detection. The supplementary materials and code are available at https://github.com/maorash/ATC, including the dataset gathering implementation, to foster further research in this area.

nan


Article 870

Title@2025-06-06 (5): Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models

Title: Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models Einfach und dennoch effektiv: Extrahieren von Privatdaten über Clients im Federated Fine-Tuning von großen Sprachmodellen 简单但有效:在联邦大语言模式价格调整中,在客户中提取私人数据 2506.06060v1

Authors (5): Yingqi Hu, Zhuo Zhang, Jingyuan Zhang, Lizhen Qu, Zenglin Xu

Federated fine-tuning of large language models (FedLLMs) presents a promising approach for achieving strong model performance while preserving data privacy in sensitive domains. However, the inherent memorization ability of LLMs makes them vulnerable to training data extraction attacks. To investigate this risk, we introduce simple yet effective extraction attack algorithms specifically designed for FedLLMs. In contrast to prior “verbatim” extraction attacks, which assume access to fragments from all training data, our approach operates under a more realistic threat model, where the attacker only has access to a single client’s data and aims to extract previously unseen personally identifiable information (PII) from other clients. This requires leveraging contextual prefixes held by the attacker to generalize across clients. To evaluate the effectiveness of our approaches, we propose two rigorous metrics-coverage rate and efficiency-and extend a real-world legal dataset with PII annotations aligned with CPIS, GDPR, and CCPA standards, achieving 89.9% human-verified precision. Experimental results show that our method can extract up to 56.57% of victim-exclusive PII, with “Address,” “Birthday,” and “Name” being the most vulnerable categories. Our findings underscore the pressing need for robust defense strategies and contribute a new benchmark and evaluation framework for future research in privacy-preserving federated learning.

nan


Article 871

Title@2025-06-06 (5): HIGHT: Hierarchical Graph Tokenization for Molecule-Language Alignment

Title: HIGHT: Hierarchical Graph Tokenization for Molecule-Language Alignment HIGHT: Hierarchische Graphen-Tokenisierung für die Ausrichtung von Molekülen HIG: 分子-语言对齐的等级图形方位化 2406.14021v2

Authors (5): Yongqiang Chen, Quanming Yao, Juzheng Zhang, James Cheng, Yatao Bian

Recently, there has been a surge of interest in extending the success of large language models (LLMs) from texts to molecules. Most existing approaches adopt a graph neural network to represent a molecule as a series of node tokens for molecule-language alignment, which, however, have overlooked the inherent hierarchical structures in molecules. Notably, higher-order molecular structures contain rich semantics of functional groups, which encode crucial biochemical functionalities of the molecules. We show that neglecting the hierarchical information in tokenization will lead to subpar molecule-language alignment and severe hallucination. To address this limitation, we propose HIerarchical GrapH Tokenization (HIGHT). HIGHT employs a hierarchical graph tokenizer that encodes the hierarchy of atom, motif, and molecular levels of informative tokens to improve the molecular perception of LLMs. HIGHT also adopts an augmented instruction tuning dataset, enriched with the hierarchical graph information, to further enhance the molecule-language alignment. Extensive experiments on 14 real-world benchmarks verify the effectiveness of HIGHT in reducing hallucination by 40%, and significant improvements in various molecule-language downstream tasks. The project is available at https: //higraphllm.github.io/.

nan


Article 872

Title@2025-06-06 (5): Hey, That’s My Data! Label-Only Dataset Inference in Large Language Models

Title: Hey, That’s My Data! Label-Only Dataset Inference in Large Language Models Hey, das sind meine Daten! Label-Only-Datensatz-Schlussfolgerung in großen Sprachmodellen 嘿,这就是我的数据! 大语言模型中的标签 - 唯一数据集推理 2506.06057v1

Authors (8): Chen Xiong, Zihao Wang, Rui Zhu, Tsung-Yi Ho, Pin-Yu Chen, Jingwei Xiong, Haixu Tang, Lucila Ohno-Machado

Large Language Models (LLMs) have revolutionized Natural Language Processing by excelling at interpreting, reasoning about, and generating human language. However, their reliance on large-scale, often proprietary datasets poses a critical challenge: unauthorized usage of such data can lead to copyright infringement and significant financial harm. Existing dataset-inference methods typically depend on log probabilities to detect suspicious training material, yet many leading LLMs have begun withholding or obfuscating these signals. This reality underscores the pressing need for label-only approaches capable of identifying dataset membership without relying on internal model logits. We address this gap by introducing CatShift, a label-only dataset-inference framework that capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data. If a suspicious dataset was previously seen by the model, fine-tuning on a portion of it triggers a pronounced post-tuning shift in the model’s outputs; conversely, truly novel data elicits more modest changes. By comparing the model’s output shifts for a suspicious dataset against those for a known non-member validation set, we statistically determine whether the suspicious set is likely to have been part of the model’s original training corpus. Extensive experiments on both open-source and API-based LLMs validate CatShift’s effectiveness in logit-inaccessible settings, offering a robust and practical solution for safeguarding proprietary data.

nan


Article 873

Title@2025-06-06 (5): MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Title: MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding MedXpertQA: Benchmarking von medizinischem Verständnis auf Expertenebene MedXperttQA:确定专家级医疗理由和理解基准 2501.18362v3

Authors (9): Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou

We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 18 leading models on \benchmark. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models. Code and data are available at: https://github.com/TsinghuaC3I/MedXpertQA

nan


Article 874

Title@2025-06-06 (5): SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal Large Language Models Preserving Language Capabilities

Title: SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal Large Language Models Preserving Language Capabilities SMAR: Soft Modality-Aware Routing-Strategie für MoE-basierte multimodale große Sprachmodelle Erhaltung von Sprachfähigkeiten SMAR: 以教育部为基础的维护语言能力的多模式多语言模型软式模式-软件运行战略 2506.06406v1

Authors (7): Guoyang Xia, Yifeng Ding, Fengfa Li, Lei Ren, Chen Wei, Fangxiang Feng, Xiaojie Wang

Mixture of Experts (MoE) architectures have become a key approach for scaling large language models, with growing interest in extending them to multimodal tasks. Existing methods to build multimodal MoE models either incur high training costs or suffer from degraded language capabilities when adapting pretrained models. To address this, we propose Soft ModalityAware Routing (SMAR), a novel regularization technique that uses Kullback Leibler divergence to control routing probability distributions across modalities, encouraging expert specialization without modifying model architecture or heavily relying on textual data. Experiments on visual instruction tuning show that SMAR preserves language ability at 86.6% retention with only 2.5% pure text, outperforming baselines while maintaining strong multimodal performance. Our approach offers a practical and efficient solution to balance modality differentiation and language capabilities in multimodal MoE models.

nan


Article 875

Title@2025-06-06 (5): Rethinking Machine Unlearning in Image Generation Models

Title: Rethinking Machine Unlearning in Image Generation Models Rethinking Machine Unlearning in Image Generation Models 在图像生成模型中重新思考机离学习 2506.02761v2

Authors (6): Renyang Liu, Wenjie Feng, Tianwei Zhang, Wei Zhou, Xueqi Cheng, See-Kiong Ng

With the surge and widespread application of image generation models, data privacy and content safety have become major concerns and attracted great attention from users, service providers, and policymakers. Machine unlearning (MU) is recognized as a cost-effective and promising means to address these challenges. Despite some advancements, image generation model unlearning (IGMU) still faces remarkable gaps in practice, e.g., unclear task discrimination and unlearning guidelines, lack of an effective evaluation framework, and unreliable evaluation metrics. These can hinder the understanding of unlearning mechanisms and the design of practical unlearning algorithms. We perform exhaustive assessments over existing state-of-the-art unlearning algorithms and evaluation standards, and discover several critical flaws and challenges in IGMU tasks. Driven by these limitations, we make several core contributions, to facilitate the comprehensive understanding, standardized categorization, and reliable evaluation of IGMU. Specifically, (1) We design CatIGMU, a novel hierarchical task categorization framework. It provides detailed implementation guidance for IGMU, assisting in the design of unlearning algorithms and the construction of testbeds. (2) We introduce EvalIGMU, a comprehensive evaluation framework. It includes reliable quantitative metrics across five critical aspects. (3) We construct DataIGM, a high-quality unlearning dataset, which can be used for extensive evaluations of IGMU, training content detectors for judgment, and benchmarking the state-of-the-art unlearning algorithms. With EvalIGMU and DataIGM, we discover that most existing IGMU algorithms cannot handle the unlearning well across different evaluation dimensions, especially for preservation and robustness. Code and models are available at https://github.com/ryliu68/IGMU.

nan


Article 876

Title@2025-06-06 (5): MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems?

Title: MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems? MATP-BENCH: Kann MLLM für multimodale Probleme ein gutes Automatisiertes Theorem-Prover sein? MLLM能成为多模式问题的一个好的自动理论导论吗? 2506.06034v1

Authors (5): Zhitao He, Zongwei Lyu, Dazhong Chen, Dadi Guo, Yi R. Fung

Numerous theorems, such as those in geometry, are often presented in multimodal forms (e.g., diagrams). Humans benefit from visual reasoning in such settings, using diagrams to gain intuition and guide the proof process. Modern Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in solving a wide range of mathematical problems. However, the potential of MLLMs as Automated Theorem Provers (ATPs), specifically in the multimodal domain, remains underexplored. In this paper, we introduce the Multimodal Automated Theorem Proving benchmark (MATP-BENCH), a new Multimodal, Multi-level, and Multi-language benchmark designed to evaluate MLLMs in this role as multimodal automated theorem provers. MATP-BENCH consists of 1056 multimodal theorems drawn from high school, university, and competition-level mathematics. All these multimodal problems are accompanied by formalizations in Lean 4, Coq and Isabelle, thus making the benchmark compatible with a wide range of theorem-proving frameworks. MATP-BENCH requires models to integrate sophisticated visual understanding with mastery of a broad spectrum of mathematical knowledge and rigorous symbolic reasoning to generate formal proofs. We use MATP-BENCH to evaluate a variety of advanced multimodal language models. Existing methods can only solve a limited number of the MATP-BENCH problems, indicating that this benchmark poses an open challenge for research on automated theorem proving.

nan


Article 877

Title@2025-06-06 (5): Large Language Models are Demonstration Pre-Selectors for Themselves

Title: Large Language Models are Demonstration Pre-Selectors for Themselves Große Sprachmodelle sind Demonstrationsvorwähler für sich selbst 大语言模型是演示预选人 2506.06033v1

Authors (9): Jiarui Jin, Yuwei Wu, Haoxuan Li, Xiaoting He, Weinan Zhang, Yiming Yang, Yong Yu, Jun Wang, Mengyue Yang

In-context learning (ICL) with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training data. However, existing ICL methods, which rely on similarity or diversity scores to choose demonstrations, incur high computational costs due to repeatedly retrieval from large-scale datasets for each query. To this end, we propose FEEDER (FEw yet Essential Demonstration prE-selectoR), a novel pre-selection framework that identifies a representative subset of demonstrations containing the most representative examples in the training data, tailored to specific LLMs. To construct this subset, we introduce the “sufficiency” and “necessity” metrics in the pre-selection stage and design a tree-based algorithm to identify representative examples efficiently. Once pre-selected, this representative subset can effectively replace the full training data, improving efficiency while maintaining comparable performance in ICL. Additionally, our pre-selected subset also benefits fine-tuning LLMs, where we introduce a bi-level optimization method that enhances training efficiency without sacrificing performance. Experiments with LLMs ranging from 300M to 8B parameters show that FEEDER can reduce training data size by over 20% while maintaining performance and seamlessly integrating with various downstream demonstration selection strategies in ICL.

nan


Article 878

Title@2025-06-06 (5): Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation

Title: Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation Begründung durch Ausführung: Vereinheitlichung von Prozess- und Ergebnisprämien für die Codegenerierung 执行中的理由:代码生成的统一程序和结果奖励 2412.15118v2

Authors (8): Zhuohao Yu, Weizheng Gu, Yidong Wang, Xingru Jiang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang

Large Language Models excel at code generation yet struggle with complex programming tasks that demand sophisticated reasoning. To bridge this gap, traditional process supervision relies on learned reward models requiring costly training data and suffering from reward misalignment, while outcome supervision fails for complex tasks needing coordinated intermediate steps. We introduce Outcome Refining Process Supervision, which unifies process and outcome supervision by leveraging executable verification: a tree-structured search framework generates strategic alternatives, profiles execution metrics, and scores candidates via self-critique mechanisms that integrate runtime feedback with reasoning. Experiments across 5 models and 3 benchmarks show consistent gains, with 26.9% higher correctness and 42.2% improved code efficiency. The results demonstrate that ORPS enables LLMs to overcome local optima in code generation, suggesting a promising direction for combining verifiable outcomes with structured reasoning to tackle complex challenges. We open-source at: https://github.com/zhuohaoyu/ORPS

nan


Article 879

Title@2025-06-06 (5): Improving Customer Service with Automatic Topic Detection in User Emails

Title: Improving Customer Service with Automatic Topic Detection in User Emails Verbesserung des Kundenservices mit automatischer Themenerkennung in Benutzer-E-Mails 改进用户电子邮件自动专题检测客户服务 2502.19115v3

Authors (5): Bojana Bašaragin, Darija Medvecki, Gorana Gojić, Milena Oparnica, Dragiša Mišković

This study introduces a novel natural language processing pipeline that enhances customer service efficiency at Telekom Srbija, a leading Serbian telecommunications company, through automated email topic detection and labeling. Central to the pipeline is BERTopic, a modular framework that allows unsupervised topic modeling. After a series of preprocessing and postprocessing steps, we assign one of 12 topics and several additional labels to incoming emails, allowing customer service to filter and access them through a custom-made application. While applied to Serbian, the methodology is conceptually language-agnostic and can be readily adapted to other languages, particularly those that are low-resourced and morphologically rich. The system performance was evaluated by assessing the speed and correctness of the automatically assigned topics, with a weighted average processing time of 0.041 seconds per email and a weighted average F1 score of 0.96. The system now operates in the company’s production environment, streamlining customer service operations through automated email classification.

nan


Article 880

Title@2025-06-06 (5): When to Trust Context: Self-Reflective Debates for Context Reliability

Title: When to Trust Context: Self-Reflective Debates for Context Reliability Wann man vertraut Kontext: Selbstreflektierende Debatten für Kontext Zuverlässigkeit 何时信任背景:关于环境可靠性的自反辩论 2506.06020v1

Authors (8): Zeqi Zhou, Fang Wu, Shayan Talaei, Haokai Zhao, Cheng Meixin, Tinson Xu, Amin Saberi, Yejin Choi

Large language models frequently encounter conflicts between their parametric knowledge and contextual input, often resulting in factual inconsistencies or hallucinations. We propose Self-Reflective Debate for Contextual Reliability (SR-DCR), a lightweight framework that integrates token-level self-confidence with an asymmetric multi-agent debate to adjudicate such conflicts. A critic, deprived of context, challenges a defender who argues from the given passage; a judge model evaluates the debate and determines the context’s reliability. The final answer is selected by combining the verdict with model confidence. Experiments on the ClashEval benchmark demonstrate that SR-DCR consistently enhances robustness to misleading context while maintaining accuracy on trustworthy inputs, outperforming both classical debate and confidence-only baselines with minimal computational overhead. The code is available at https://github.com/smiles724/Self-Reflective-Debates.

nan


Article 881

Title: AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search AgentSwift: Effizientes LLM Agent Design über wertgeführte Hierarchische Suche AgentSwift: 高效LLM Agent 设计 2506.06017v1

Authors (8): Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, Fengli Xu, Yong Li

Large language model (LLM) agents have demonstrated strong capabilities across diverse domains. However, designing high-performing agentic systems remains challenging. Existing agent search methods suffer from three major limitations: (1) an emphasis on optimizing agentic workflows while under-utilizing proven human-designed components such as memory, planning, and tool use; (2) high evaluation costs, as each newly generated agent must be fully evaluated on benchmarks; and (3) inefficient search in large search space. In this work, we introduce a comprehensive framework to address these challenges. First, We propose a hierarchical search space that jointly models agentic workflow and composable functional components, enabling richer agentic system designs. Building on this structured design space, we introduce a predictive value model that estimates agent performance given agentic system and task description, allowing for efficient, low-cost evaluation during the search process. Finally, we present a hierarchical Monte Carlo Tree Search (MCTS) strategy informed by uncertainty to guide the search. Experiments on seven benchmarks, covering embodied, math, web, tool, and game, show that our method achieves an average performance gain of 8.34\% over state-of-the-art baselines and exhibits faster search progress with steeper improvement trajectories. Code repo is available at https://github.com/Ericccc02/AgentSwift.

nan


Article 882

Title@2025-06-06 (5): Unlocking Recursive Thinking of LLMs: Alignment via Refinement

Title: Unlocking Recursive Thinking of LLMs: Alignment via Refinement Entsperrendes rekursives Denken an LLMs: Ausrichtung durch Verfeinerung 解锁对LLMM的递归思维:通过精炼调整 2506.06009v1

Authors (5): Haoke Zhang, Xiaobo Liang, Cunxiang Wang, Juntao Li, Min Zhang

The OpenAI o1-series models have demonstrated that leveraging long-form Chain of Thought (CoT) can substantially enhance performance. However, the recursive thinking capabilities of Large Language Models (LLMs) remain limited, particularly in the absence of expert-curated data for distillation. In this paper, we propose \textbf{AvR}: \textbf{Alignment via Refinement}, a novel method aimed at unlocking the potential of LLMs for recursive reasoning through long-form CoT. AvR introduces a refinement process that integrates criticism and improvement actions, guided by differentiable learning techniques to optimize \textbf{refinement-aware rewards}. As a result, the synthesized multi-round data can be organized as a long refinement thought, further enabling test-time scaling. Experimental results show that AvR significantly outperforms conventional preference optimization methods. Notably, with only 3k synthetic samples, our method boosts the performance of the LLaMA-3-8B-Instruct model by over 20\% in win rate on AlpacaEval 2.0. Our code is available at Github (https://github.com/Banner-Z/AvR.git).

nan


Article 883

Title@2025-06-06 (5): Token Signature: Predicting Chain-of-Thought Gains with Token Decoding Feature in Large Language Models

Title: Token Signature: Predicting Chain-of-Thought Gains with Token Decoding Feature in Large Language Models Token-Signatur: Vorhersage von Chain-of-Thought-Gewinnen mit Token-Decoding-Funktion in großen Sprachmodellen Tok 签名:以大语言模型的 Tok 解码功能预测所寻求的收益链 2506.06008v1

Authors (3): Peijie Liu, Fengli Xu, Yong Li

Chain-of-Thought (CoT) technique has proven effective in improving the performance of large language models (LLMs) on complex reasoning tasks. However, the performance gains are inconsistent across different tasks, and the underlying mechanism remains a long-standing research question. In this work, we make a preliminary observation that the monotonicity of token probability distributions may be correlated with the gains achieved through CoT reasoning. Leveraging this insight, we propose two indicators based on the token probability distribution to assess CoT effectiveness across different tasks. By combining instance-level indicators with logistic regression model, we introduce Dynamic CoT, a method that dynamically select between CoT and direct answer. Furthermore, we extend Dynamic CoT to closed-source models by transferring decision strategies learned from open-source models. Our indicators for assessing CoT effectiveness achieve an accuracy of 89.2\%, and Dynamic CoT reduces token consumption by more than 35\% while maintaining high accuracy. Overall, our work offers a novel perspective on the underlying mechanisms of CoT reasoning and provides a framework for its more efficient deployment.

nan


Article 884

Title@2025-06-06 (5): Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models

Title: Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models Bootstrapping-Weltmodelle aus Dynamics-Modellen in multimodalen Foundation-Modellen 从多式联运基础模型中的动态模型中推出世界模型 2506.06006v1

Authors (5): Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti

To what extent do vision-and-language foundation models possess a realistic world model (observation $\times$ action $\rightarrow$ observation) and a dynamics model (observation $\times$ observation $\rightarrow$ action), when actions are expressed through language? While open-source foundation models struggle with both, we find that fine-tuning them to acquire a dynamics model through supervision is significantly easier than acquiring a world model. In turn, dynamics models can be used to bootstrap world models through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, the dynamics model can annotate actions for unlabelled pairs of video frame observations to expand the training data. We further propose a new objective, where image tokens in observation pairs are weighted by their importance, as predicted by a recognition model. Secondly, the dynamics models can assign rewards to multiple samples of the world model to score them, effectively guiding search at inference time. We evaluate the world models resulting from both strategies through the task of action-centric image editing on Aurora-Bench. Our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin of $15\%$ on real-world subsets according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

nan


Article 885

Title@2025-06-06 (5): Judgment of Learning: A Human Ability Beyond Generative Artificial Intelligence

Title: Judgment of Learning: A Human Ability Beyond Generative Artificial Intelligence Urteil des Lernens: Eine menschliche Fähigkeit jenseits Generativer Künstlicher Intelligenz 学习总结:人类能力超越创造人工智能的能力 2410.13392v3

Authors (2): Markus Huff, Elanur Ulakçı

Large language models (LLMs) increasingly mimic human cognition in various language-based tasks. However, their capacity for metacognition - particularly in predicting memory performance - remains unexplored. Here, we introduce a cross-agent prediction model to assess whether ChatGPT-based LLMs align with human judgments of learning (JOL), a metacognitive measure where individuals predict their own future memory performance. We tested humans and LLMs on pairs of sentences, one of which was a garden-path sentence - a sentence that initially misleads the reader toward an incorrect interpretation before requiring reanalysis. By manipulating contextual fit (fitting vs. unfitting sentences), we probed how intrinsic cues (i.e., relatedness) affect both LLM and human JOL. Our results revealed that while human JOL reliably predicted actual memory performance, none of the tested LLMs (GPT-3.5-turbo, GPT-4-turbo, and GPT-4o) demonstrated comparable predictive accuracy. This discrepancy emerged regardless of whether sentences appeared in fitting or unfitting contexts. These findings indicate that, despite LLMs’ demonstrated capacity to model human cognition at the object-level, they struggle at the meta-level, failing to capture the variability in individual memory predictions. By identifying this shortcoming, our study underscores the need for further refinements in LLMs’ self-monitoring abilities, which could enhance their utility in educational settings, personalized learning, and human-AI interactions. Strengthening LLMs’ metacognitive performance may reduce the reliance on human oversight, paving the way for more autonomous and seamless integration of AI into tasks requiring deeper cognitive awareness.

nan


Article 886

Title@2025-06-06 (5): Does It Make Sense to Speak of Introspection in Large Language Models?

Title: Does It Make Sense to Speak of Introspection in Large Language Models? Macht es Sinn, von Introspektion in großen Sprachmodellen zu sprechen? 它是否让人们想到在大语言模型中谈论反省? 2506.05068v2

Authors (2): Iulia M. Comsa, Murray Shanahan

Large language models (LLMs) exhibit compelling linguistic behaviour, and sometimes offer self-reports, that is to say statements about their own nature, inner workings, or behaviour. In humans, such reports are often attributed to a faculty of introspection and are typically linked to consciousness. This raises the question of how to interpret self-reports produced by LLMs, given their increasing linguistic fluency and cognitive capabilities. To what extent (if any) can the concept of introspection be meaningfully applied to LLMs? Here, we present and critique two examples of apparent introspective self-report from LLMs. In the first example, an LLM attempts to describe the process behind its own “creative” writing, and we argue this is not a valid example of introspection. In the second example, an LLM correctly infers the value of its own temperature parameter, and we argue that this can be legitimately considered a minimal example of introspection, albeit one that is (presumably) not accompanied by conscious experience.

nan


Article 887

Title@2025-06-06 (5): A Culturally-Rich Romanian NLP Dataset from “Who Wants to Be a Millionaire?” Videos

Title: A Culturally-Rich Romanian NLP Dataset from “Who Wants to Be a Millionaire?” Videos Ein kultureller Rich rumänischer NLP-Datensatz aus “Wer will Millionär werden?” Videos 罗马尼亚NLP数据来自“谁想成为百万富翁?” 2506.05991v1

Authors (3): Alexandru-Gabriel Ganea, Antonia-Adelina Popovici, Adrian-Marius Dumitran

Large Language Models (LLMs) demonstrate varying performance across languages and cultural contexts. This study introduces a novel, culturally-rich, multilingual dataset derived from video recordings of the Romanian game show “Who Wants to Be a Millionaire?” (Vrei s\u{a} fii Milionar?). We employed an innovative process combining optical character recognition (OCR), automated text extraction, and manual verification to collect question-answer pairs, enriching them with metadata including question domain (e.g., biology, history), cultural relevance (Romanian-specific vs. international), and difficulty. Benchmarking state-of-the-art LLMs, including Romanian-adapted models, on this dataset revealed significant performance disparities: models consistently achieve higher accuracy (80-95%) on international questions compared to Romanian-specific cultural questions (50-75%). We further investigate these differences through experiments involving machine translation of Romanian questions into English and cross-lingual tests using a comparable dataset in French. Our findings underscore the impact of cultural context and data source on LLM performance and offer practical insights for building robust, culturally-aware multilingual NLP systems, especially in educational domains. The dataset is publicly available at Hugging Face.

nan


Article 888

Title@2025-06-06 (5): Peri-LN: Revisiting Normalization Layer in the Transformer Architecture

Title: Peri-LN: Revisiting Normalization Layer in the Transformer Architecture Peri-LN: Normalisierungsschicht in der Transformer-Architektur neu aufsuchen Peri-LN: 变形器结构中重新审视正常化层 2502.02732v3

Authors (10): Jeonghoon Kim, Byeongchan Lee, Cheonbok Park, Yeontaek Oh, Beomjun Kim, Taehwan Yoo, Seongjin Shin, Dongyoon Han, Jinwoo Shin, Kang Min Yoo

Selecting a layer normalization (LN) strategy that stabilizes training and speeds convergence in Transformers remains difficult, even for today’s large language models (LLM). We present a comprehensive analytical foundation for understanding how different LN strategies influence training dynamics in large-scale Transformers. Until recently, Pre-LN and Post-LN have long dominated practices despite their limitations in large-scale training. However, several open-source models have recently begun silently adopting a third strategy without much explanation. This strategy places normalization layer peripherally around sublayers, a design we term Peri-LN. While Peri-LN has demonstrated promising performance, its precise mechanisms and benefits remain almost unexplored. Our in-depth analysis delineates the distinct behaviors of LN strategies, showing how each placement shapes activation variance and gradient propagation. To validate our theoretical insight, we conduct extensive experiments on Transformers up to $3.2$B parameters, showing that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability. Our results suggest that Peri-LN warrants broader consideration for large-scale Transformer architectures, providing renewed insights into the optimal placement of LN.

nan


Article 889

Title@2025-06-06 (5): DPO-Shift: Shifting the Distribution of Direct Preference Optimization

Title: DPO-Shift: Shifting the Distribution of Direct Preference Optimization DPO-Shift: Verschieben der Verteilung der Direktpräferenzoptimierung DPO-Shift:改变直接优惠优化分配 2502.07599v2

Authors (5): Xiliang Yang, Feng Jiang, Qianen Zhang, Lei Zhao, Xiao Li

Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce DPO-Shift to controllably shift the distribution of the chosen probability. Then, we show that DPO-Shift exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of DPO-Shift over DPO on downstream tasks such as MT-Bench and a designed win rate experiment. We believe this study shows that the likelihood displacement issue of DPO can be effectively mitigated with a simple, theoretically grounded solution. Our code is available at https://github.com/Meaquadddd/DPO-Shift.

nan


Article 890

Title@2025-06-06 (5): Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Title: Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Jacobian Sparse Autoencoder: Sparsify Computations, nicht nur Aktivierungen Jacobian Sparass Autoencoders: 简化计算,而不仅仅是激活 2502.18147v2

Authors (4): Lucy Farnik, Tim Lawson, Conor Houghton, Laurence Aitchison

Sparse autoencoders (SAEs) have been successfully used to discover sparse and human-interpretable representations of the latent activations of LLMs. However, we would ultimately like to understand the computations performed by LLMs and not just their representations. The extent to which SAEs can help us understand computations is unclear because they are not designed to “sparsify” computations in any sense, only latent activations. To solve this, we propose Jacobian SAEs (JSAEs), which yield not only sparsity in the input and output activations of a given model component but also sparsity in the computation (formally, the Jacobian) connecting them. With a na"ive implementation, the Jacobians in LLMs would be computationally intractable due to their size. One key technical contribution is thus finding an efficient way of computing Jacobians in this setup. We find that JSAEs extract a relatively large degree of computational sparsity while preserving downstream LLM performance approximately as well as traditional SAEs. We also show that Jacobians are a reasonable proxy for computational sparsity because MLPs are approximately linear when rewritten in the JSAE basis. Lastly, we show that JSAEs achieve a greater degree of computational sparsity on pre-trained LLMs than on the equivalent randomized LLM. This shows that the sparsity of the computational graph appears to be a property that LLMs learn through training, and suggests that JSAEs might be more suitable for understanding learned transformer computations than standard SAEs.

nan


Article 891

Title@2025-06-06 (5): A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations

Title: A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations Eine Übersicht über automatische Bewertungsmethoden zu Text-, Bild- und Sprachgenerierungen 文字、视觉和语音一代自动评价方法调查 2506.10019v1

Authors (11): Tian Lan, Yang-Hao Zhou, Zi-Ao Ma, Fanshu Sun, Rui-Qing Sun, Junyu Luo, Rong-Cheng Tu, Heyan Huang, Chen Xu, Zhijing Wu, Xian-Ling Mao

Recent advances in deep learning have significantly enhanced generative AI capabilities across text, images, and audio. However, automatically evaluating the quality of these generated outputs presents ongoing challenges. Although numerous automatic evaluation methods exist, current research lacks a systematic framework that comprehensively organizes these methods across text, visual, and audio modalities. To address this issue, we present a comprehensive review and a unified taxonomy of automatic evaluation methods for generated content across all three modalities; We identify five fundamental paradigms that characterize existing evaluation approaches across these domains. Our analysis begins by examining evaluation methods for text generation, where techniques are most mature. We then extend this framework to image and audio generation, demonstrating its broad applicability. Finally, we discuss promising directions for future research in cross-modal evaluation methodologies.

nan


Article 892

Title@2025-06-06 (5): The Synergy of LLMs & RL Unlocks Offline Learning of Generalizable Language-Conditioned Policies with Low-fidelity Data

Title: The Synergy of LLMs & RL Unlocks Offline Learning of Generalizable Language-Conditioned Policies with Low-fidelity Data Die Synergie von LLMs & RL entsperrt Offline-Lernen allgemeiner sprachkonditionierter Politiken mit Daten mit geringer Fidelität LLMs和RL Unlocks离线学习LLM和RL Unlocks的协同效应 2412.06877v2

Authors (4): Thomas Pouplin, Katarzyna Kobalczyk, Hao Sun, Mihaela van der Schaar

Developing autonomous agents capable of performing complex, multi-step decision-making tasks specified in natural language remains a significant challenge, particularly in realistic settings where labeled data is scarce and real-time experimentation is impractical. Existing reinforcement learning (RL) approaches often struggle to generalize to unseen goals and states, limiting their applicability. In this paper, we introduce TEDUO, a novel training pipeline for offline language-conditioned policy learning in symbolic environments. Unlike conventional methods, TEDUO operates on readily available, unlabeled datasets and addresses the challenge of generalization to previously unseen goals and states. Our approach harnesses large language models (LLMs) in a dual capacity: first, as automatization tools augmenting offline datasets with richer annotations, and second, as generalizable instruction-following agents. Empirical results demonstrate that TEDUO achieves data-efficient learning of robust language-conditioned policies, accomplishing tasks beyond the reach of conventional RL frameworks or out-of-the-box LLMs alone.

nan


Article 893

Title@2025-06-06 (5): Audio-Aware Large Language Models as Judges for Speaking Styles

Title: Audio-Aware Large Language Models as Judges for Speaking Styles Audio-Bewusst Große Sprachmodelle als Richter für sprechende Stile 音响软件大语言模型,作为语音风格法官 2506.05984v1

Authors (11): Cheng-Han Chiang, Xiaofei Wang, Chung-Ching Lin, Kevin Lin, Linjie Li, Radu Kopetz, Yao Qian, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang

Audio-aware large language models (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, volume, speaking pace, word emphasis, pitch control, and non-verbal elements. We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs’ responses. We compare two ALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results and show that the agreement between Gemini and human judges is comparable to the agreement between human evaluators. These promising results show that ALLMs can be used as a judge to evaluate SLMs. Our results also reveal that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.

nan


Article 894

Title@2025-06-06 (5): Tau-Eval: A Unified Evaluation Framework for Useful and Private Text Anonymization

Title: Tau-Eval: A Unified Evaluation Framework for Useful and Private Text Anonymization Tau-Eval: Ein einheitliches Evaluations-Framework für sinnvolle und private Textanonymisierung Tau-Eval:有用和私用文本匿名化统一评价框架 2506.05979v1

Authors (5): Gabriel Loiseau, Damien Sileo, Damien Riquet, Maxime Meyer, Marc Tommasi

Text anonymization is the process of removing or obfuscating information from textual data to protect the privacy of individuals. This process inherently involves a complex trade-off between privacy protection and information preservation, where stringent anonymization methods can significantly impact the text’s utility for downstream applications. Evaluating the effectiveness of text anonymization proves challenging from both privacy and utility perspectives, as there is no universal benchmark that can comprehensively assess anonymization techniques across diverse, and sometimes contradictory contexts. We present Tau-Eval, an open-source framework for benchmarking text anonymization methods through the lens of privacy and utility task sensitivity. A Python library, code, documentation and tutorials are publicly available.

nan


Article 895

Title@2025-06-06 (5): LTG at SemEval-2025 Task 10: Optimizing Context for Classification of Narrative Roles

Title: LTG at SemEval-2025 Task 10: Optimizing Context for Classification of Narrative Roles LTG bei SemEval-2025 Aufgabe 10: Optimierung des Kontextes für die Klassifizierung narrativer Rollen SemEval-2025 LTG 任务10:优化叙述作用分类的背景 2506.05976v1

Authors (2): Egil Rønningstad, Gaurav Negi

Our contribution to the SemEval 2025 shared task 10, subtask 1 on entity framing, tackles the challenge of providing the necessary segments from longer documents as context for classification with a masked language model. We show that a simple entity-oriented heuristics for context selection can enable text classification using models with limited context window. Our context selection approach and the XLM-RoBERTa language model is on par with, or outperforms, Supervised Fine-Tuning with larger generative language models.

nan


Article 896

Title@2025-06-06 (5): Let’s Put Ourselves in Sally’s Shoes: Shoes-of-Others Prefixing Improves Theory of Mind in Large Language Models

Title: Let’s Put Ourselves in Sally’s Shoes: Shoes-of-Others Prefixing Improves Theory of Mind in Large Language Models Setzen wir uns selbst in Sallys Schuhe: Andere Schuhe präfixieren verbessert die Theorie des Geistes in großen Sprachmodellen 让我们把我们自己放在莎莉的鞋子里: 在大语言模型中, 鞋上的其他人的预设改进思维理论 2506.05970v1

Authors (7): Kazutoshi Shinoda, Nobukatsu Hojo, Kyosuke Nishida, Yoshihiro Yamazaki, Keita Suzuki, Hiroaki Sugiyama, Kuniko Saito

Recent studies have shown that Theory of Mind (ToM) in large language models (LLMs) has not reached human-level performance yet. Since fine-tuning LLMs on ToM datasets often degrades their generalization, several inference-time methods have been proposed to enhance ToM in LLMs. However, existing inference-time methods for ToM are specialized for inferring beliefs from contexts involving changes in the world state. In this study, we present a new inference-time method for ToM, Shoes-of-Others (SoO) prefixing, which makes fewer assumptions about contexts and is applicable to broader scenarios. SoO prefixing simply specifies the beginning of LLM outputs with ``Let’s put ourselves in A’s shoes.’’, where A denotes the target character’s name. We evaluate SoO prefixing on two benchmarks that assess ToM in conversational and narrative contexts without changes in the world state and find that it consistently improves ToM across five categories of mental states. Our analysis suggests that SoO prefixing elicits faithful thoughts, thereby improving the ToM performance.

nan


Article 897

Title@2025-06-06 (5): TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge

Title: TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge TRACT: Regressionsbewusstes Feinjustieren trifft auf eine schlüsselfertige Begründung für LLM-as-a-Richter TRACT:LLM作为法官的理由 2503.04381v2

Authors (3): Cheng-Han Chiang, Hung-yi Lee, Michal Lukasik

The LLM-as-a-judge paradigm uses large language models (LLMs) for automated text evaluation, where a numerical assessment is assigned by an LLM to the input text following scoring rubrics. Existing methods for LLM-as-a-judge use cross-entropy (CE) loss for fine-tuning, which neglects the numeric nature of score prediction. Recent work addresses numerical prediction limitations of LLM fine-tuning through regression-aware fine-tuning, which, however, does not consider chain-of-thought (CoT) reasoning for score prediction. In this paper, we introduce TRACT (Two-stage Regression-Aware fine-tuning with CoT), a method combining CoT reasoning with regression-aware training. TRACT consists of two stages: first, seed LLM is fine-tuned to generate CoTs, which serve as supervision for the second stage fine-tuning. The training objective of TRACT combines the CE loss for learning the CoT reasoning capabilities, and the regression-aware loss for the score prediction. Experiments across four LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms existing methods. Extensive ablation studies validate the importance of each component in TRACT.

nan


Article 898

Title@2025-06-06 (5): Diving into Self-Evolving Training for Multimodal Reasoning

Title: Diving into Self-Evolving Training for Multimodal Reasoning Eintauchen in das Selbst-Evolving Training für multimodale Vernunft 跳入多模式理由自我发展培训 2412.17451v3

Authors (6): Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He

Self-evolving trainin–where models iteratively learn from their own outputs–has emerged as a key approach for complex reasoning tasks, addressing the scarcity of high-quality chain-of-thought data. However, its effectiveness in multimodal reasoning, a domain more intricate than text-only reasoning, remains underexplored, and the understanding of critical factors in this training paradigm remains limited. Furthermore, a central challenge for this training method is performance saturation, which impedes further improvements and scalability. Inspired by reinforcement learning (RL), in this paper, we reframe self-evolving training for multimodal reasoning through the lens of RL, identifying three pivotal factors: Training Method, Reward Model, and Prompt Variation. Through systematic analysis, we establish relatively optimal design principles that significantly enhance multimodal reasoning capabilities. Moreover, delving deeper into training dynamics, we uncover the roots of saturation and propose a new automatic balancing mechanism to mitigate this limitation. Building on these insights, we propose M-STAR (Multimodal Self-evolving Training for Reasoning), a framework that achieves consistent performance gains across models of varying sizes and diverse benchmarks. All resources are made publicly available at https://mstar-lmm.github.io.

nan


Article 899

Title@2025-06-06 (5): GRASP: Replace Redundant Layers with Adaptive Singular Parameters for Efficient Model Compression

Title: GRASP: Replace Redundant Layers with Adaptive Singular Parameters for Efficient Model Compression GRASP: Redundante Ebenen durch adaptive Singular-Parameter für effiziente Modellkompression ersetzen GRASP: 用高效模型压缩的适应性单质参数取代冗余层 2501.00339v3

Authors (6): Kainan Liu, Yong Zhang, Ning Cheng, Zhitao Li, Shaojun Wang, Jing Xiao

Recent studies have demonstrated that many layers are functionally redundant in large language models (LLMs), enabling model compression by removing these layers to reduce inference cost. While such approaches can improve efficiency, indiscriminate layer pruning often results in significant performance degradation. In this paper, we propose GRASP (Gradient-based Retention of Adaptive Singular Parameters), a novel compression framework that mitigates this issue by preserving sensitivity-aware singular values. Unlike direct layer pruning, GRASP leverages gradient-based attribution on a small calibration dataset to adaptively identify and retain critical singular components. By replacing redundant layers with only a minimal set of parameters, GRASP achieves efficient compression while maintaining strong performance with minimal overhead. Experiments across multiple LLMs show that GRASP consistently outperforms existing compression methods, achieving 90% of the original model’s performance under a 20% compression ratio.

nan


Article 900

Title@2025-06-06 (5): Elementary Math Word Problem Generation using Large Language Models

Title: Elementary Math Word Problem Generation using Large Language Models Grundlegende Math Word-Problem-Erstellung mit großen Sprachmodellen 使用大语言模式制造问题 2506.05950v1

Authors (12): Nimesh Ariyarathne, Harshani Bandara, Yasith Heshan, Omega Gamage, Surangika Ranathunga, Dilan Nayanajith, Yutharsan Sivapalan, Gayathri Lihinikaduarachchi, Tharoosha Vihidun, Meenambika Chandirakumar, Sanujen Premakumar, Sanjula Gathsara

Mathematics is often perceived as a complex subject by students, leading to high failure rates in exams. To improve Mathematics skills, it is important to provide sample questions for students to practice problem-solving. Manually creating Math Word Problems (MWPs) is time consuming for tutors, because they have to type in natural language while adhering to grammar and spelling rules of the language. Existing Deep Learning techniques for MWP generation either require a tutor to provide the initial portion of the MWP, and/or additional information such as an equation. In this paper, we present an MWP generation system based on Large Language Models (LLMs) that overcome the need for additional input - the only input to our system is the number of MWPs needed, the grade and the type of question (e.g. addition, subtraction). Unlike the existing LLM-based solutions for MWP generation, we carried out an extensive set of experiments involving different LLMs, prompting strategies, techniques to improve the diversity of questions, as well as techniques that employ human feedback to improve LLM performance. Human and automated evaluations confirmed that the generated MWPs are high in quality, with minimal spelling and grammar issues. However, LLMs still struggle to generate questions that adhere to the specified grade and question type requirements.

nan


Article 901

Title@2025-06-06 (5): NameTag 3: A Tool and a Service for Multilingual/Multitagset NER

Title: NameTag 3: A Tool and a Service for Multilingual/Multitagset NER NameTag 3: Ein Werkzeug und ein Service für Multilingual/Multitagset NER 名称Tag 3: 多语种/多语种NER的工具和服务 2506.05949v1

Authors (2): Jana Straková, Milan Straka

We introduce NameTag 3, an open-source tool and cloud-based web service for multilingual, multidataset, and multitagset named entity recognition (NER), supporting both flat and nested entities. NameTag 3 achieves state-of-the-art results on 21 test datasets in 15 languages and remains competitive on the rest, even against larger models. It is available as a command-line tool and as a cloud-based service, enabling use without local installation. NameTag 3 web service currently provides flat NER for 17 languages, trained on 21 corpora and three NE tagsets, all powered by a single 355M-parameter fine-tuned model; and nested NER for Czech, powered by a 126M fine-tuned model. The source code is licensed under open-source MPL 2.0, while the models are distributed under non-commercial CC BY-NC-SA 4.0. Documentation is available at https://ufal.mff.cuni.cz/nametag, source code at https://github.com/ufal/nametag3, and trained models via https://lindat.cz. The REST service and the web application can be found at https://lindat.mff.cuni.cz/services/nametag/. A demonstration video is available at https://www.youtube.com/watch?v=-gaGnP0IV8A.

nan


Article 902

Title@2025-06-06 (5): IntentionESC: An Intention-Centered Framework for Enhancing Emotional Support in Dialogue Systems

Title: IntentionESC: An Intention-Centered Framework for Enhancing Emotional Support in Dialogue Systems IntentionESC: Ein intentionszentrierter Rahmen zur Verbesserung der emotionalen Unterstützung in Dialogsystemen 意图ESC:加强对话系统中情感支持的以意图为中心的框架 2506.05947v1

Authors (3): Xinjie Zhang, Wenxuan Wang, Qin Jin

In emotional support conversations, unclear intentions can lead supporters to employ inappropriate strategies, inadvertently imposing their expectations or solutions on the seeker. Clearly defined intentions are essential for guiding both the supporter’s motivations and the overall emotional support process. In this paper, we propose the Intention-centered Emotional Support Conversation (IntentionESC) framework, which defines the possible intentions of supporters in emotional support conversations, identifies key emotional state aspects for inferring these intentions, and maps them to appropriate support strategies. While Large Language Models (LLMs) excel in text generating, they fundamentally operate as probabilistic models trained on extensive datasets, lacking a true understanding of human thought processes and intentions. To address this limitation, we introduce the Intention Centric Chain-of-Thought (ICECoT) mechanism. ICECoT enables LLMs to mimic human reasoning by analyzing emotional states, inferring intentions, and selecting suitable support strategies, thereby generating more effective emotional support responses. To train the model with ICECoT and integrate expert knowledge, we design an automated annotation pipeline that produces high-quality training data. Furthermore, we develop a comprehensive evaluation scheme to assess emotional support efficacy and conduct extensive experiments to validate our framework. Our data and code are available at https://github.com/43zxj/IntentionESC_ICECoT.

nan


Article 903

Title@2025-06-06 (5): AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML

Title: AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML AutoML-Agent: Ein Multi-Agent-LLM-Framework für Full-Pipeline-AutoML 自动MAL- Agency: 全Pipeline 自动MLM 多边代理LLM 框架 2410.02958v2

Authors (3): Patara Trirat, Wonyong Jeong, Sung Ju Hwang

Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline, such as optimal model search and hyperparameter tuning. Existing AutoML systems often require technical expertise to set up complex tools, which is in general time-consuming and requires a large amount of human effort. Therefore, recent works have started exploiting large language models (LLM) to lessen such burden and increase the usability of AutoML frameworks via a natural language interface, allowing non-expert users to build their data-driven solutions. These methods, however, are usually designed only for a particular process in the AI development pipeline and do not efficiently use the inherent capacity of the LLMs. This paper proposes AutoML-Agent, a novel multi-agent framework tailored for full-pipeline AutoML, i.e., from data retrieval to model deployment. AutoML-Agent takes user’s task descriptions, facilitates collaboration between specialized LLM agents, and delivers deployment-ready models. Unlike existing work, instead of devising a single plan, we introduce a retrieval-augmented planning strategy to enhance exploration to search for more optimal plans. We also decompose each plan into sub-tasks (e.g., data preprocessing and neural network design) each of which is solved by a specialized agent we build via prompting executing in parallel, making the search process more efficient. Moreover, we propose a multi-stage verification to verify executed results and guide the code generation LLM in implementing successful solutions. Extensive experiments on seven downstream tasks using fourteen datasets show that AutoML-Agent achieves a higher success rate in automating the full AutoML process, yielding systems with good performance throughout the diverse domains.

nan


Article 904

Title@2025-06-06 (5): DynamicMind: A Tri-Mode Thinking System for Large Language Models

Title: DynamicMind: A Tri-Mode Thinking System for Large Language Models DynamicMind: Ein Tri-Mode-Denksystem für große Sprachmodelle 动态Mind:大语言模型三模式思维系统 2506.05936v1

Authors (7): Wei Li, Yanbin Wei, Qiushi Huang, Jiangyue Yan, Yang Chen, James T. Kwok, Yu Zhang

Modern large language models (LLMs) often struggle to dynamically adapt their reasoning depth to varying task complexities, leading to suboptimal performance or inefficient resource utilization. To address this, we introduce DynamicMind, a novel tri-mode thinking system. DynamicMind empowers LLMs to autonomously select between Fast, Normal, and Slow thinking modes for zero-shot question answering (ZSQA) tasks through cognitive-inspired prompt engineering. Our framework’s core innovations include: (1) expanding the established dual-process framework of fast and slow thinking into a tri-mode thinking system involving a normal thinking mode to preserve the intrinsic capabilities of LLM; (2) proposing the Thinking Density metric, which aligns computational resource allocation with problem complexity; and (3) developing the Thinking Mode Capacity (TMC) dataset and a lightweight Mind Router to predict the optimal thinking mode. Extensive experiments across diverse mathematical, commonsense, and scientific QA benchmarks demonstrate that DynamicMind achieves superior ZSQA capabilities while establishing an effective trade-off between performance and computational efficiency.

nan


Article 905

Title@2025-06-06 (5): CAT-LLM: Style-enhanced Large Language Models with Text Style Definition for Chinese Article-style Transfer

Title: CAT-LLM: Style-enhanced Large Language Models with Text Style Definition for Chinese Article-style Transfer CAT-LLM: Stilverstärkte große Sprachmodelle mit Textstil-Definition für den chinesischen Artikel-Stil-Transfer CAT-LLM: 具有中文本风格定义的中文本文字风格转让强化型大语言模式 2401.05707v2

Authors (5): Zhen Tao, Dinghao Xi, Zhiyu Li, Liumin Tang, Wei Xu

Text style transfer plays a vital role in online entertainment and social media. However, existing models struggle to handle the complexity of Chinese long texts, such as rhetoric, structure, and culture, which restricts their broader application. To bridge this gap, we propose a Chinese Article-style Transfer (CAT-LLM) framework, which addresses the challenges of style transfer in complex Chinese long texts. At its core, CAT-LLM features a bespoke pluggable Text Style Definition (TSD) module that integrates machine learning algorithms to analyze and model article styles at both word and sentence levels. This module acts as a bridge, enabling LLMs to better understand and adapt to the complexities of Chinese article styles. Furthermore, it supports the dynamic expansion of internal style trees, enabling the framework to seamlessly incorporate new and diverse style definitions, enhancing adaptability and scalability for future research and applications. Additionally, to facilitate robust evaluation, we created ten parallel datasets using a combination of ChatGPT and various Chinese texts, each corresponding to distinct writing styles, significantly improving the accuracy of the model evaluation and establishing a novel paradigm for text style transfer research. Extensive experimental results demonstrate that CAT-LLM, combined with GPT-3.5-Turbo, achieves state-of-the-art performance, with a transfer accuracy F1 score of 79.36% and a content preservation F1 score of 96.47% on the “Fortress Besieged” dataset. These results highlight CAT-LLM’s innovative contributions to style transfer research, including its ability to preserve content integrity while achieving precise and flexible style transfer across diverse Chinese text domains. Building on these contributions, CAT-LLM presents significant potential for advancing Chinese digital media and facilitating automated content creation.

nan


Article 906

Title@2025-06-06 (5): MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models

Title: MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models MoA: Heterogene Mischung von Adaptern für ein parametereffizientes Feintuning von großen Sprachmodellen MoA:大型语言模型参数有效精美设计应用适应器的异变混合体 2506.05928v1

Authors (9): Jie Cao, Tianwei Lin, Hongyang He, Rolan Yan, Wenqiao Zhang, Juncheng Li, Dongping Zhang, Siliang Tang, Yueting Zhuang

Recent studies integrate Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) to further enhance the performance of parameter-efficient fine-tuning (PEFT) methods in Large Language Model (LLM) applications. Existing methods employ \emph{homogeneous} MoE-LoRA architectures composed of LoRA experts with either similar or identical structures and capacities. However, these approaches often suffer from representation collapse and expert load imbalance, which negatively impact the potential of LLMs. To address these challenges, we propose a \emph{heterogeneous} \textbf{Mixture-of-Adapters (MoA)} approach. This method dynamically integrates PEFT adapter experts with diverse structures, leveraging their complementary representational capabilities to foster expert specialization, thereby enhancing the effective transfer of pre-trained knowledge to downstream tasks. MoA supports two variants: \textbf{(i)} \textit{Soft MoA} achieves fine-grained integration by performing a weighted fusion of all expert outputs; \textbf{(ii)} \textit{Sparse MoA} activates adapter experts sparsely based on their contribution, achieving this with negligible performance degradation. Experimental results demonstrate that heterogeneous MoA outperforms homogeneous MoE-LoRA methods in both performance and parameter efficiency. Our project is available at https://github.com/DCDmllm/MoA.

nan


Article 907

Title@2025-06-06 (5): LengClaro2023: A Dataset of Administrative Texts in Spanish with Plain Language adaptations

Title: LengClaro2023: A Dataset of Administrative Texts in Spanish with Plain Language adaptations LengClaro2023: Ein Datensatz von Verwaltungstexten auf Spanisch mit plain language adaptations 冷克拉罗2023:西班牙文行政文本数据集,有普通语言适应 2506.05927v1

Authors (2): Belén Agüera-Marco, Itziar Gonzalez-Dios

In this work, we present LengClaro2023, a dataset of legal-administrative texts in Spanish. Based on the most frequently used procedures from the Spanish Social Security website, we have created for each text two simplified equivalents. The first version follows the recommendations provided by arText claro. The second version incorporates additional recommendations from plain language guidelines to explore further potential improvements in the system. The linguistic resource created in this work can be used for evaluating automatic text simplification (ATS) systems in Spanish.

nan


Article 908

Title@2025-06-06 (5): Generating Grounded Responses to Counter Misinformation via Learning Efficient Fine-Grained Critiques

Title: Generating Grounded Responses to Counter Misinformation via Learning Efficient Fine-Grained Critiques Generierung geerdeter Reaktionen zur Bekämpfung von Fehlinformationen durch Learning Efficient Fine-Grained Critiques 通过 “ 学习高效率的先进经验 “ ,对错误信息采取有底基的对策 2506.05924v1

Authors (3): Xiaofei Xu, Xiuzhen Zhang, Ke Deng

Fake news and misinformation poses a significant threat to society, making efficient mitigation essential. However, manual fact-checking is costly and lacks scalability. Large Language Models (LLMs) offer promise in automating counter-response generation to mitigate misinformation, but a critical challenge lies in their tendency to hallucinate non-factual information. Existing models mainly rely on LLM self-feedback to reduce hallucination, but this approach is computationally expensive. In this paper, we propose MisMitiFact, Misinformation Mitigation grounded in Facts, an efficient framework for generating fact-grounded counter-responses at scale. MisMitiFact generates simple critique feedback to refine LLM outputs, ensuring responses are grounded in evidence. We develop lightweight, fine-grained critique models trained on data sourced from readily available fact-checking sites to identify and correct errors in key elements such as numerals, entities, and topics in LLM generations. Experiments show that MisMitiFact generates counter-responses of comparable quality to LLMs’ self-feedback while using significantly smaller critique models. Importantly, it achieves ~5x increase in feedback generation throughput, making it highly suitable for cost-effective, large-scale misinformation mitigation. Code and LLM prompt templates are at https://github.com/xxfwin/MisMitiFact.

nan


Article 909

Title@2025-06-06 (5): Proactive Assistant Dialogue Generation from Streaming Egocentric Videos

Title: Proactive Assistant Dialogue Generation from Streaming Egocentric Videos Proaktive Assistant Dialogue Generation von Streaming Egozentrischen Videos 主动的助理对话助理,从流动的以地球为中心的视频中生成 2506.05904v1

Authors (8): Yichi Zhang, Xin Luna Dong, Zhaojiang Lin, Andrea Madotto, Anuj Kumar, Babak Damavandi, Joyce Chai, Seungwhan Moon

Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their development is constrained by the costly and labor-intensive process of data collection and system evaluation. To address these limitations, we present a comprehensive framework with three key contributions. First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos, resulting in \dataset, a large-scale synthetic dialogue dataset spanning multiple domains. Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies. Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses, incorporating novel techniques for handling data imbalance and long-duration videos. This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks. Project page: https://pro-assist.github.io/

nan


Article 910

Title@2025-06-06 (5): CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Title: CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness CAPability: Ein umfassender Visual Caption Benchmark zur Bewertung von Korrektheit und Gründlichkeit 能力:评估正确性和彻底性的综合视觉能力基准 2502.14914v3

Authors (12): Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, Jixuan Chen, Pandeng Li, Boqiang Zhang, Nianzu Yang, Yinglu Li, Zuan Gao, Yun Zheng, Hongtao Xie

Visual captioning benchmarks have become outdated with the emergence of modern multimodal large language models (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent benchmarks attempt to address this by focusing on keyword extraction or object-centric evaluation, they remain limited to vague-view or object-view analyses and incomplete visual element coverage. In this paper, we introduce CAPability, a comprehensive multi-view benchmark for evaluating visual captioning across 12 dimensions spanning six critical views. We curate nearly 11K human-annotated images and videos with visual element annotations to evaluate the generated captions. CAPability stably assesses both the correctness and thoroughness of captions with \textit{precision} and \textit{hit} metrics. By converting annotations to QA pairs, we further introduce a heuristic metric, \textit{know but cannot tell} ($K\bar{T}$), indicating a significant performance gap between QA and caption capabilities. Our work provides a holistic analysis of MLLMs’ captioning abilities, as we identify their strengths and weaknesses across various dimensions, guiding future research to enhance specific aspects of their capabilities.

nan


Article 911

Title@2025-06-06 (5): Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router

Title: Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router Route-and-Reason: Scaling Large Language Model Reasoning mit verstärktem Modellrouter 路由和反应:使用强化示范路由器扩大大语言模型 2506.05901v1

Authors (5): Chenyang Shao, Xinyang Liu, Yutang Lin, Fengli Xu, Yong Li

Multi-step reasoning has proven essential for enhancing the problem-solving capabilities of Large Language Models (LLMs) by decomposing complex tasks into intermediate steps, either explicitly or implicitly. Extending the reasoning chain at test time through deeper thought processes or broader exploration, can furthur improve performance, but often incurs substantial costs due to the explosion in token usage. Yet, many reasoning steps are relatively simple and can be handled by more efficient smaller-scale language models (SLMs). This motivates hybrid approaches that allocate subtasks across models of varying capacities. However, realizing such collaboration requires accurate task decomposition and difficulty-aware subtask allocation, which is challenging. To address this, we propose R2-Reasoner, a novel framework that enables collaborative reasoning across heterogeneous LLMs by dynamically routing sub-tasks based on estimated complexity. At the core of our framework is a Reinforced Model Router, composed of a task decomposer and a subtask allocator. The task decomposer segments complex input queries into logically ordered subtasks, while the subtask allocator assigns each subtask to the most appropriate model, ranging from lightweight SLMs to powerful LLMs, balancing accuracy and efficiency. To train this router, we introduce a staged pipeline that combines supervised fine-tuning on task-specific datasets with Group Relative Policy Optimization algorithm, enabling self-supervised refinement through iterative reinforcement learning. Extensive experiments across four challenging benchmarks demonstrate that R2-Reasoner reduces API costs by 86.85% while maintaining or surpassing baseline accuracy. Our framework paves the way for more cost-effective and adaptive LLM reasoning. The code is open-source at https://anonymous.4open.science/r/R2_Reasoner .

nan


Article 912

Title@2025-06-06 (5): MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

Title: MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models MapEval: Eine kartesbasierte Bewertung von Geo-Spatial Reasoning in Fundamentalmodellen MafEval:基础模型中地理空间理由的地图评价 2501.00316v2

Authors (8): Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, Md Rizwan Parvez

Recent advancements in foundation models have improved autonomous tool usage and reasoning, but their capabilities in map-based reasoning remain underexplored. To address this, we introduce MapEval, a benchmark designed to assess foundation models across three distinct tasks - textual, API-based, and visual reasoning - through 700 multiple-choice questions spanning 180 cities and 54 countries, covering spatial relationships, navigation, travel planning, and real-world map interactions. Unlike prior benchmarks that focus on simple location queries, MapEval requires models to handle long-context reasoning, API interactions, and visual map analysis, making it the most comprehensive evaluation framework for geospatial AI. On evaluation of 30 foundation models, including Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro, none surpass 67% accuracy, with open-source models performing significantly worse and all models lagging over 20% behind human performance. These results expose critical gaps in spatial inference, as models struggle with distances, directions, route planning, and place-specific reasoning, highlighting the need for better geospatial AI to bridge the gap between foundation models and real-world navigation. All the resources are available at: https://mapeval.github.io/.

nan


Article 913

Title@2025-06-06 (5): PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Title: PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning GiftBench: Bewertung der Vulnerabilität eines großen Sprachmodells für Datenvergiftung 毒区:评估大语言模型对数据中毒的脆弱性 2410.08811v2

Authors (6): Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B. Cohen, David Krueger, Fazl Barez

Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models’ susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not inherently enhance resilience against poisoning attacks; (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.

nan


Article 914

Title@2025-06-06 (5): Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models

Title: Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models Infi-MMR: Curriculumbasiertes Entsperren multimodaler Vernunft durch schrittweises Verstärktes Lernen in multimodalen Small Language-Modellen Infi-MMMR:通过在多模式小型语言模式中分阶段强化学习,以课程为基础解锁多模式原因 2505.23091v2

Authors (12): Zeyu Liu, Yuhang Liu, Guanghao Zhu, Congkai Xie, Zhen Li, Jianbo Yuan, Xinyao Wang, Qing Li, Shing-Chi Cheung, Shengyu Zhang, Fei Wu, Hongxia Yang

Recent advancements in large language models (LLMs) have demonstrated substantial progress in reasoning capabilities, such as DeepSeek-R1, which leverages rule-based reinforcement learning to enhance logical reasoning significantly. However, extending these achievements to multimodal large language models (MLLMs) presents critical challenges, which are frequently more pronounced for Multimodal Small Language Models (MSLMs) given their typically weaker foundational reasoning abilities: (1) the scarcity of high-quality multimodal reasoning datasets, (2) the degradation of reasoning capabilities due to the integration of visual processing, and (3) the risk that direct application of reinforcement learning may produce complex yet incorrect reasoning processes. To address these challenges, we design a novel framework Infi-MMR to systematically unlock the reasoning potential of MSLMs through a curriculum of three carefully structured phases and propose our multimodal reasoning model Infi-MMR-3B. The first phase, Foundational Reasoning Activation, leverages high-quality textual reasoning datasets to activate and strengthen the model’s logical reasoning capabilities. The second phase, Cross-Modal Reasoning Adaptation, utilizes caption-augmented multimodal data to facilitate the progressive transfer of reasoning skills to multimodal contexts. The third phase, Multimodal Reasoning Enhancement, employs curated, caption-free multimodal data to mitigate linguistic biases and promote robust cross-modal reasoning. Infi-MMR-3B achieves both state-of-the-art multimodal math reasoning ability (43.68% on MathVerse testmini, 27.04% on MathVision test, and 21.33% on OlympiadBench) and general reasoning ability (67.2% on MathVista testmini). Resources are available at https://huggingface.co/Reallm-Labs/Infi-MMR-3B.

nan


Article 915

Title@2025-06-06 (5): ECoRAG: Evidentiality-guided Compression for Long Context RAG

Title: ECoRAG: Evidentiality-guided Compression for Long Context RAG ECoRAG: Evidentialitätsgeführte Kompression für lange RAG-Kontext ECORAG: 长期RAG的证据制导压缩 2506.05167v2

Authors (4): Yeonseok Jeong, Jinsu Kim, Dohyeon Lee, Seung-won Hwang

Large Language Models (LLMs) have shown remarkable performance in Open-Domain Question Answering (ODQA) by leveraging external documents through Retrieval-Augmented Generation (RAG). To reduce RAG overhead, from longer context, context compression is necessary. However, prior compression methods do not focus on filtering out non-evidential information, which limit the performance in LLM-based RAG. We thus propose Evidentiality-guided RAG, or ECoRAG framework. ECoRAG improves LLM performance by compressing retrieved documents based on evidentiality, ensuring whether answer generation is supported by the correct evidence. As an additional step, ECoRAG reflects whether the compressed content provides sufficient evidence, and if not, retrieves more until sufficient. Experiments show that ECoRAG improves LLM performance on ODQA tasks, outperforming existing compression methods. Furthermore, ECoRAG is highly cost-efficient, as it not only reduces latency but also minimizes token usage by retaining only the necessary information to generate the correct answer. Code is available at https://github.com/ldilab/ECoRAG.

nan


Article 916

Title@2025-06-06 (5): FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging

Title: FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging FinanceReasoning: Benchmarking der finanziellen numerischen Reasoning glaubwürdiger, umfassender und anspruchsvoller 财务责任:确定金融数字理由基准,使其更加可信、全面和具有挑战性 2506.05828v1

Authors (13): Zichen Tang, Haihong E, Ziyan Ma, Haoyang He, Jiacheng Liu, Zhongjun Yang, Zihua Rong, Rongjin Li, Kun Ji, Qing Huang, Xinyang Hu, Yang Liu, Qianhe Zheng

We introduce FinanceReasoning, a novel benchmark designed to evaluate the reasoning capabilities of large reasoning models (LRMs) in financial numerical reasoning problems. Compared to existing benchmarks, our work provides three key advancements. (1) Credibility: We update 15.6% of the questions from four public datasets, annotating 908 new questions with detailed Python solutions and rigorously refining evaluation standards. This enables an accurate assessment of the reasoning improvements of LRMs. (2) Comprehensiveness: FinanceReasoning covers 67.8% of financial concepts and formulas, significantly surpassing existing datasets. Additionally, we construct 3,133 Python-formatted functions, which enhances LRMs’ financial reasoning capabilities through refined knowledge (e.g., 83.2% $\rightarrow$ 91.6% for GPT-4o). (3) Challenge: Models are required to apply multiple financial formulas for precise numerical reasoning on 238 Hard problems. The best-performing model (i.e., OpenAI o1 with PoT) achieves 89.1% accuracy, yet LRMs still face challenges in numerical precision. We demonstrate that combining Reasoner and Programmer models can effectively enhance LRMs’ performance (e.g., 83.2% $\rightarrow$ 87.8% for DeepSeek-R1). Our work paves the way for future research on evaluating and improving LRMs in domain-specific complex reasoning tasks.

nan


Article 917

Title@2025-06-06 (5): Training Software Engineering Agents and Verifiers with SWE-Gym

Title: Training Software Engineering Agents and Verifiers with SWE-Gym Schulung von Software Engineering Agents und Prüfern mit SWE-Gym SWE-Gym培训软件工程代理和验证人 2412.21139v2

Authors (7): Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang

We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents, achieving up to 19% absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories.

nan


Article 918

Title@2025-06-06 (5): Taming Knowledge Conflicts in Language Models

Title: Taming Knowledge Conflicts in Language Models Zähmen von Wissenskonflikten in Sprachmodellen 在语言模式中讨论知识冲突 2503.10996v2

Authors (3): Gaotang Li, Yuzhong Chen, Hanghang Tong

Language Models (LMs) often encounter knowledge conflicts when parametric memory contradicts contextual knowledge. Previous works attribute this conflict to the interplay between “memory heads” and “context heads”, attention heads assumed to promote either memory or context exclusively. In this study, we go beyond this fundamental assumption by uncovering a critical phenomenon we term the superposition of contextual information and parametric memory, where highly influential attention heads simultaneously contribute to both memory and context. Building upon this insight, we propose Just Run Twice (JuICE), a test-time attention intervention method that steers LMs toward either parametric beliefs or contextual knowledge without requiring fine-tuning. JuICE identifies a set of reliable attention heads and leverages a dual-run approach to mitigate the superposition effects. Extensive experiments across 11 datasets and 6 model architectures demonstrate that JuICE sets the new state-of-the-art performance and robust generalization, achieving significant and consistent improvement across different domains under various conflict types. Finally, we theoretically analyze knowledge conflict and the superposition of contextual information and parametric memory in attention heads, which further elucidates the effectiveness of JuICE in these settings. Our code is available at https://github.com/GaotangLi/JUICE.

nan


Article 919

Title@2025-06-06 (5): Not All Jokes Land: Evaluating Large Language Models Understanding of Workplace Humor

Title: Not All Jokes Land: Evaluating Large Language Models Understanding of Workplace Humor Nicht alle Witze Land: Bewertung großer Sprachmodelle Verständnis des Arbeitsplatzes Humor 并非所有的笑话地:评价工作场所幽默性大语言模型理解 2506.01819v2

Authors (2): Mohammadamin Shafiei, Hamidreza Saffari

With the recent advances in Artificial Intelligence (AI) and Large Language Models (LLMs), the automation of daily tasks, like automatic writing, is getting more and more attention. Hence, efforts have focused on aligning LLMs with human values, yet humor, particularly professional industrial humor used in workplaces, has been largely neglected. To address this, we develop a dataset of professional humor statements along with features that determine the appropriateness of each statement. Our evaluation of five LLMs shows that LLMs often struggle to judge the appropriateness of humor accurately.

nan


Article 920

Title@2025-06-06 (5): CodeContests+: High-Quality Test Case Generation for Competitive Programming

Title: CodeContests+: High-Quality Test Case Generation for Competitive Programming CodeContests+: Hochqualitative Testfall-Generation für wettbewerbsfähige Programmierung 标准测试+:为竞争性方案拟订编制高品质测试个案 2506.05817v1

Authors (5): Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, Kai Shen

Competitive programming, due to its high reasoning difficulty and precise correctness feedback, has become a key task for both training and evaluating the reasoning capabilities of large language models (LLMs). However, while a large amount of public problem data, such as problem statements and solutions, is available, the test cases of these problems are often difficult to obtain. Therefore, test case generation is a necessary task for building large-scale datasets, and the quality of the test cases directly determines the accuracy of the evaluation. In this paper, we introduce an LLM-based agent system that creates high-quality test cases for competitive programming problems. We apply this system to the CodeContests dataset and propose a new version with improved test cases, named CodeContests+. We evaluated the quality of test cases in CodeContestsPlus. First, we used 1.72 million submissions with pass/fail labels to examine the accuracy of these test cases in evaluation. The results indicated that CodeContests+ achieves significantly higher accuracy than CodeContests, particularly with a notably higher True Positive Rate (TPR). Subsequently, our experiments in LLM Reinforcement Learning (RL) further confirmed that improvements in test case quality yield considerable advantages for RL.

nan


Article 921

Title@2025-06-06 (5): MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning

Title: MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning MAPLE: Multi-Agent-Adaptive-Planung mit Langzeit-Speicher für Tabellen-Reasoning MMALE: 具有长期内存的多机构适应规划,用于表格说明理由 2506.05813v1

Authors (3): Ye Bai, Minghan Wang, Thuy-Trang Vu

Table-based question answering requires complex reasoning capabilities that current LLMs struggle to achieve with single-pass inference. Existing approaches, such as Chain-of-Thought reasoning and question decomposition, lack error detection mechanisms and discard problem-solving experiences, contrasting sharply with how humans tackle such problems. In this paper, we propose MAPLE (Multi-agent Adaptive Planning with Long-term mEmory), a novel framework that mimics human problem-solving through specialized cognitive agents working in a feedback-driven loop. MAPLE integrates 4 key components: (1) a Solver using the ReAct paradigm for reasoning, (2) a Checker for answer verification, (3) a Reflector for error diagnosis and strategy correction, and (4) an Archiver managing long-term memory for experience reuse and evolution. Experiments on WiKiTQ and TabFact demonstrate significant improvements over existing methods, achieving state-of-the-art performance across multiple LLM backbones.

nan


Article 922

Title@2025-06-06 (5): E-LDA: Toward Interpretable LDA Topic Models with Strong Guarantees in Logarithmic Parallel Time

Title: E-LDA: Toward Interpretable LDA Topic Models with Strong Guarantees in Logarithmic Parallel Time E-LDA: Auf dem Weg zu interpretierbaren LDA-Themenmodellen mit starken Garantien in logarithmischer Parallelzeit E-LDA:争取在对数平行时间建立具有有力保证的可解释的LDA专题模型 2506.07747v1

Authors (1): Adam Breuer

In this paper, we provide the first practical algorithms with provable guarantees for the problem of inferring the topics assigned to each document in an LDA topic model. This is the primary inference problem for many applications of topic models in social science, data exploration, and causal inference settings. We obtain this result by showing a novel non-gradient-based, combinatorial approach to estimating topic models. This yields algorithms that converge to near-optimal posterior probability in logarithmic parallel computation time (adaptivity) – exponentially faster than any known LDA algorithm. We also show that our approach can provide interpretability guarantees such that each learned topic is formally associated with a known keyword. Finally, we show that unlike alternatives, our approach can maintain the independence assumptions necessary to use the learned topic model for downstream causal inference methods that allow researchers to study topics as treatments. In terms of practical performance, our approach consistently returns solutions of higher semantic quality than solutions from state-of-the-art LDA algorithms, neural topic models, and LLM-based topic models across a diverse range of text datasets and evaluation parameters.

nan


Article 923

Title@2025-06-06 (5): DiMA: An LLM-Powered Ride-Hailing Assistant at DiDi

Title: DiMA: An LLM-Powered Ride-Hailing Assistant at DiDi DiMA: Ein LLM-Powered Ride-Hailing Assistant bei DiDi DiMA:Didi法学硕士派助教 2503.04768v2

Authors (7): Yansong Ning, Shuowei Cai, Wei Li, Jun Fang, Naiqiang Tan, Hua Chai, Hao Liu

On-demand ride-hailing services like DiDi, Uber, and Lyft have transformed urban transportation, offering unmatched convenience and flexibility. In this paper, we introduce DiMA, an LLM-powered ride-hailing assistant deployed in DiDi Chuxing. Its goal is to provide seamless ride-hailing services and beyond through a natural and efficient conversational interface under dynamic and complex spatiotemporal urban contexts. To achieve this, we propose a spatiotemporal-aware order planning module that leverages external tools for precise spatiotemporal reasoning and progressive order planning. Additionally, we develop a cost-effective dialogue system that integrates multi-type dialog repliers with cost-aware LLM configurations to handle diverse conversation goals and trade-off response quality and latency. Furthermore, we introduce a continual fine-tuning scheme that utilizes real-world interactions and simulated dialogues to align the assistant’s behavior with human preferred decision-making processes. Since its deployment in the DiDi application, DiMA has demonstrated exceptional performance, achieving 93% accuracy in order planning and 92% in response generation during real-world interactions. Offline experiments further validate DiMA capabilities, showing improvements of up to 70.23% in order planning and 321.27% in response generation compared to three state-of-the-art agent frameworks, while reducing latency by $0.72\times$ to $5.47\times$. These results establish DiMA as an effective, efficient, and intelligent mobile assistant for ride-hailing services.

nan


Article 924

Title@2025-06-06 (5): Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights

Title: Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights Unbeabsichtigte Schäden an wertorientierten LLMs: Psychologische und empirische Erkenntnisse 价值观一致的LMMs:心理和实证透视的意外伤害 2506.06404v1

Authors (6): Sooyung Choi, Jaehyeok Lee, Xiaoyuan Yi, Jing Yao, Xing Xie, JinYeong Bak

The application scope of Large Language Models (LLMs) continues to expand, leading to increasing interest in personalized LLMs that align with human values. However, aligning these models with individual values raises significant safety concerns, as certain values may correlate with harmful information. In this paper, we identify specific safety risks associated with value-aligned LLMs and investigate the psychological principles behind these challenges. Our findings reveal two key insights. (1) Value-aligned LLMs are more prone to harmful behavior compared to non-fine-tuned models and exhibit slightly higher risks in traditional safety evaluations than other fine-tuned models. (2) These safety issues arise because value-aligned LLMs genuinely generate text according to the aligned values, which can amplify harmful outcomes. Using a dataset with detailed safety categories, we find significant correlations between value alignment and safety risks, supported by psychological hypotheses. This study offers insights into the “black box” of value alignment and proposes in-context alignment methods to enhance the safety of value-aligned LLMs.

nan


Article 925

Title@2025-06-06 (5): TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining

Title: TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining TiC-LM: Ein Web-Scale-Benchmark für zeitkontinuierliches LLM-Vortraining TIC-LM:培训前时间-持续性LLM网络基准 2504.02107v3

Authors (11): Jeffrey Li, Mohammadreza Armandpour, Iman Mirzadeh, Sachin Mehta, Vaishaal Shankar, Raviteja Vemulapalli, Samy Bengio, Oncel Tuzel, Mehrdad Farajtabar, Hadi Pouransari, Fartash Faghri

Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) - orders of magnitude larger than previous continual language modeling benchmarks. We also design time-stratified evaluations across both general CC data and specific domains (Wikipedia, StackExchange, and code documentation) to assess how well various continual learning methods adapt to new data while retaining past knowledge. Our findings demonstrate that, on general CC data, autoregressive meta-schedules combined with a fixed-ratio replay of older data can achieve comparable held-out loss to re-training from scratch, while requiring significantly less computation (2.6x). However, the optimal balance between incorporating new data and replaying old data differs as replay is crucial to avoid forgetting on generic web data but less so on specific domains.

nan


Article 926

Title@2025-06-06 (5): Lost in the Passage: Passage-level In-context Learning Does Not Necessarily Need a “Passage”

Title: Lost in the Passage: Passage-level In-context Learning Does Not Necessarily Need a “Passage” Lost in the Passage: Passage-Level In-Context Learning braucht nicht unbedingt eine “Passage” 遗失在通行证中:通过程度的内文学习不需要“通行证” 2502.10634v2

Authors (4): Hao Sun, Chenming Tang, Gengyang Li, Yunfang Wu

By simply incorporating demonstrations into the context, in-context learning (ICL) enables large language models (LLMs) to yield awesome performance on many tasks. In this study, we focus on passage-level long-context ICL for generation tasks and find that LLMs cannot learn the intrinsic relationship between the demonstration passage and the generation output. We conduct experiments with different LLMs on two typical generation tasks including single-document question answering and distractor generation, demonstrating that even a completely meaningless demonstration passage with 1/4 length achieves much better performance than the original full passage. Analysis via attention and information flow reveals that LLMs pay little attention to passages compared to other components in the prompt and little information flows from the passage to other parts of the demonstration, which further confirms our finding. Additionally, experiments on context compression indicate that compression approaches proven effective on other long-context tasks are not suitable for passage-level ICL, since simply using shorter meaningless demonstration passages already achieves competitive performance.

nan


Article 927

Title@2025-06-06 (5): LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification

Title: LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification LLM im Loop: Erstellen des ParaDeHate-Datensatzes für die Entgiftung von Hassreden 圈圈中的LLM:为仇恨言论解毒创建ParaDeHate数据集 2506.01484v2

Authors (7): Shuzhou Yuan, Ercong Nie, Lukas Kouba, Ashish Yashwanth Kangen, Helmut Schmid, Hinrich Schütze, Michael Färber

Detoxification, the task of rewriting harmful language into non-toxic text, has become increasingly important amid the growing prevalence of toxic content online. However, high-quality parallel datasets for detoxification, especially for hate speech, remain scarce due to the cost and sensitivity of human annotation. In this paper, we propose a novel LLM-in-the-loop pipeline leveraging GPT-4o-mini for automated detoxification. We first replicate the ParaDetox pipeline by replacing human annotators with an LLM and show that the LLM performs comparably to human annotation. Building on this, we construct ParaDeHate, a large-scale parallel dataset specifically for hatespeech detoxification. We release ParaDeHate as a benchmark of over 8K hate/non-hate text pairs and evaluate a wide range of baseline methods. Experimental results show that models such as BART, fine-tuned on ParaDeHate, achieve better performance in style accuracy, content preservation, and fluency, demonstrating the effectiveness of LLM-generated detoxification text as a scalable alternative to human annotation.

nan


Article 928

Title@2025-06-06 (5): Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning

Title: Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning Benchmarking und Verbesserung großer Visions-Sprachenmodelle für grundlegendes visuelles Verständnis und Vernunft 制定基准和改进基本视觉图解理解和理由的大型视野-语言模型 2412.13540v3

Authors (6): Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, Min Zhang

Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across diverse tasks. Despite great success, recent studies show that LVLMs encounter substantial limitations when engaging with visual graphs. To study the reason behind these limitations, we propose VGCure, a comprehensive benchmark covering 22 tasks for examining the fundamental graph understanding and reasoning capacities of LVLMs. Extensive evaluations conducted on 14 LVLMs reveal that LVLMs are weak in basic graph understanding and reasoning tasks, particularly those concerning relational or structurally complex information. Based on this observation, we propose a structure-aware fine-tuning framework to enhance LVLMs with structure learning abilities through three self-supervised learning tasks. Experiments validate the effectiveness of our method in improving LVLMs’ performance on fundamental and downstream graph learning tasks, as well as enhancing their robustness against complex visual graphs.

nan


Article 929

Title@2025-06-06 (5): Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective

Title: Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective Bilder sprechen lauter als Worte: Bias im Vision-Language-Modell aus einer ursächlichen Vermittlungsperspektive verstehen und abmildern 图像比单词更大声:从原因调解角度理解和减轻视觉语言模型中的偏见 2407.02814v3

Authors (4): Zhaotian Weng, Zijun Gao, Jerone Andrews, Jieyu Zhao

Vision-language models (VLMs) pre-trained on extensive datasets can inadvertently learn biases by correlating gender information with specific objects or scenarios. Current methods, which focus on modifying inputs and monitoring changes in the model’s output probability scores, often struggle to comprehensively understand bias from the perspective of model components. We propose a framework that incorporates causal mediation analysis to measure and map the pathways of bias generation and propagation within VLMs. This approach allows us to identify the direct effects of interventions on model bias and the indirect effects of interventions on bias mediated through different model components. Our results show that image features are the primary contributors to bias, with significantly higher impacts than text features, specifically accounting for 32.57% and 12.63% of the bias in the MSCOCO and PASCAL-SENTENCE datasets, respectively. Notably, the image encoder’s contribution surpasses that of the text encoder and the deep fusion encoder. Further experimentation confirms that contributions from both language and vision modalities are aligned and non-conflicting. Consequently, focusing on blurring gender representations within the image encoder, which contributes most to the model bias, reduces bias efficiently by 22.03% and 9.04% in the MSCOCO and PASCAL-SENTENCE datasets, respectively, with minimal performance loss or increased computational demands.

nan


Article 930

Title@2025-06-06 (5): Discrete Minds in a Continuous World: Do Language Models Know Time Passes?

Title: Discrete Minds in a Continuous World: Do Language Models Know Time Passes? Diskrete Gedanken in einer kontinuierlichen Welt: Kennen Sprachmodelle Zeitüberschreitungen? 持续世界中的分化思想:语言模型是否了解时间? 2506.05790v1

Authors (5): Minghan Wang, Ye Bai, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari

While Large Language Models (LLMs) excel at temporal reasoning tasks like event ordering and duration estimation, their ability to perceive the actual passage of time remains unexplored. We investigate whether LLMs perceive the passage of time and adapt their decision-making accordingly through three complementary experiments. First, we introduce the Token-Time Hypothesis, positing that LLMs can map discrete token counts to continuous wall-clock time, and validate this through a dialogue duration judgment task. Second, we demonstrate that LLMs could use this awareness to adapt their response length while maintaining accuracy when users express urgency in question answering tasks. Finally, we develop BombRush, an interactive navigation challenge that examines how LLMs modify behavior under progressive time pressure in dynamic environments. Our findings indicate that LLMs possess certain awareness of time passage, enabling them to bridge discrete linguistic tokens and continuous physical time, though this capability varies with model size and reasoning abilities. This work establishes a theoretical foundation for enhancing temporal awareness in LLMs for time-sensitive applications.

nan


Article 931

Title@2025-06-06 (5): IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis

Title: IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis IDA-Bench: Bewertung von LLMs zur interaktiven Datenanalyse IDA-Bench:评价交互式引导数据分析LML女士 2505.18223v2

Authors (7): Hanyu Li, Haoyu Liu, Tingyu Zhu, Tianyu Guo, Zeyu Zheng, Xiaotie Deng, Michael I. Jordan

Large Language Models (LLMs) show promise as data analysis agents, but existing benchmarks overlook the iterative nature of the field, where experts’ decisions evolve with deeper insights of the dataset. To address this, we introduce IDA-Bench, a novel benchmark evaluating LLM agents in multi-round interactive scenarios. Derived from complex Kaggle notebooks, tasks are presented as sequential natural language instructions by an LLM-simulated user. Agent performance is judged by comparing its final numerical output to the human-derived baseline. Initial results show that even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on < 50% of the tasks, highlighting limitations not evident in single-turn tests. This work underscores the need to improve LLMs’ multi-round capabilities for building more reliable data analysis agents, highlighting the necessity of achieving a balance between instruction following and reasoning.

nan


Article 932

Title@2025-06-06 (5): LGAR: Zero-Shot LLM-Guided Neural Ranking for Abstract Screening in Systematic Literature Reviews

Title: LGAR: Zero-Shot LLM-Guided Neural Ranking for Abstract Screening in Systematic Literature Reviews LGAR: Zero-Shot LLM-geführtes Neural Ranking für Abstract Screening in Systematic Literature Reviews LGAR: 系统文学评论中抽象筛选的零热热LLM-指导神经神经定级 2505.24757v2

Authors (3): Christian Jaumann, Andreas Wiedholz, Annemarie Friedrich

The scientific literature is growing rapidly, making it hard to keep track of the state-of-the-art. Systematic literature reviews (SLRs) aim to identify and evaluate all relevant papers on a topic. After retrieving a set of candidate papers, the abstract screening phase determines initial relevance. To date, abstract screening methods using large language models (LLMs) focus on binary classification settings; existing question answering (QA) based ranking approaches suffer from error propagation. LLMs offer a unique opportunity to evaluate the SLR’s inclusion and exclusion criteria, yet, existing benchmarks do not provide them exhaustively. We manually extract these criteria as well as research questions for 57 SLRs, mostly in the medical domain, enabling principled comparisons between approaches. Moreover, we propose LGAR, a zero-shot LLM Guided Abstract Ranker composed of an LLM based graded relevance scorer and a dense re-ranker. Our extensive experiments show that LGAR outperforms existing QA-based methods by 5-10 pp. in mean average precision. Our code and data is publicly available.

nan


Article 933

Title@2025-06-06 (5): m-KAILIN: Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training

Title: m-KAILIN: Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training m-KAILIN: Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training m-KAILIN:生物医学大型语言模型培训知识开发剂科学公司蒸馏框架 2504.19565v2

Authors (6): Meng Xiao, Xunxin Cai, Qingqing Long, Chengrui Wang, Yuanchun Zhou, Hengshu Zhu

Corpus distillation for biomedical large language models (LLMs) seeks to address the pressing challenge of insufficient quantity and quality in open-source annotated scientific corpora, which remains a bottleneck for effective LLM training in biomedical research. This paper proposes a knowledge-driven, agentic framework for scientific corpus distillation, tailored explicitly for LLM training in the biomedical domain, addressing the challenge posed by the complex hierarchy of biomedical knowledge. Central to our approach is a collaborative multi-agent architecture, where specialized agents, each guided by the Medical Subject Headings (MeSH) hierarchy, work in concert to autonomously extract, synthesize, and self-evaluate high-quality textual data from vast scientific literature. This agentic framework collectively generates and refines domain-specific question-answer pairs, ensuring comprehensive coverage and consistency with biomedical ontologies while minimizing manual involvement. Extensive experimental results show that language models trained on our multi-agent distilled datasets achieve notable improvements in biomedical question-answering tasks, outperforming both strong life sciences LLM baselines and advanced proprietary models. Notably, our AI-Ready dataset enables Llama3-70B to surpass GPT-4 with MedPrompt and Med-PaLM-2, despite their larger scale. Detailed ablation studies and case analyses further validate the effectiveness and synergy of each agent within the framework, highlighting the potential of multi-agent collaboration in biomedical LLM training.

nan


Article 934

Title@2025-06-06 (5): dots.llm1 Technical Report

Title: dots.llm1 Technical Report points.llm1 Technischer Bericht 技术报告 2506.05767v1

Authors (27): Bi Huo, Bin Tu, Cheng Qin, Da Zheng, Debing Zhang, Dongjie Zhang, En Li, Fu Guo, Jian Yao, Jie Lou, Junfeng Tian, Li Hu, Ran Zhu, Shengdong Chen, Shuo Liu, Su Guang, Te Wo, Weijun Zhang, Xiaoming Shi, Xinxin Peng, Xing Wu, Yawen Liu, Yuqiu Ji, Ze Wen, Zhenhai Liu, Zichao Li, Zilong Liao

Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on 11.2T high-quality tokens and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models.

nan


Article 935

Title@2025-06-06 (5): BioMol-MQA: A Multi-Modal Question Answering Dataset For LLM Reasoning Over Bio-Molecular Interactions

Title: BioMol-MQA: A Multi-Modal Question Answering Dataset For LLM Reasoning Over Bio-Molecular Interactions BioMol-MQA: Multi-Modal Frage beantworten Datensatz für LLM-Gründung über bio-Molekulare Interaktionen BioMol-MQA:生物分子相互作用的多种模式解答LLM问题数据集 2506.05766v1

Authors (5): Saptarshi Sengupta, Shuhua Yang, Paul Kwong Yu, Fali Wang, Suhang Wang

Retrieval augmented generation (RAG) has shown great power in improving Large Language Models (LLMs). However, most existing RAG-based LLMs are dedicated to retrieving single modality information, mainly text; while for many real-world problems, such as healthcare, information relevant to queries can manifest in various modalities such as knowledge graph, text (clinical notes), and complex molecular structure. Thus, being able to retrieve relevant multi-modality domain-specific information, and reason and synthesize diverse knowledge to generate an accurate response is important. To address the gap, we present BioMol-MQA, a new question-answering (QA) dataset on polypharmacy, which is composed of two parts (i) a multimodal knowledge graph (KG) with text and molecular structure for information retrieval; and (ii) challenging questions that designed to test LLM capabilities in retrieving and reasoning over multimodal KG to answer questions. Our benchmarks indicate that existing LLMs struggle to answer these questions and do well only when given the necessary background data, signaling the necessity for strong RAG frameworks.

nan


Article 936

Title@2025-06-06 (5): On the Query Complexity of Verifier-Assisted Language Generation

Title: On the Query Complexity of Verifier-Assisted Language Generation Zur Fragekomplexität der sprachlichen Generierung von Prüfern 关于验证者协助语言生成的查询复杂程度 2502.12123v2

Authors (6): Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T. Ash, Cyril Zhang, Andrej Risteski

Recently, a plethora of works have proposed inference-time algorithms (e.g. best-of-n), which incorporate verifiers to assist the generation process. Their quality-efficiency trade-offs have been empirically benchmarked on a variety of constrained generation tasks, but the algorithmic design landscape is still largely poorly understood. In this paper, we develop a mathematical framework for reasoning about constrained generation using a pre-trained language model generator oracle and a process verifier–which can decide whether a prefix can be extended to a string which satisfies the constraints of choice. We show that even in very simple settings, access to a verifier can render an intractable problem (information-theoretically or computationally) to a tractable one. In fact, we show even simple algorithms, like tokenwise rejection sampling, can enjoy significant benefits from access to a verifier. Empirically, we show that a natural modification of tokenwise rejection sampling, in which the sampler is allowed to “backtrack” (i.e., erase the final few generated tokens) has robust and substantive benefits over natural baselines (e.g. (blockwise) rejection sampling, nucleus sampling)–both in terms of computational efficiency, accuracy and diversity.

nan


Article 937

Title@2025-06-06 (5): Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning

Title: Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning 书写-RL:通过适应性课程强化学习推进长式写作 2506.05760v1

Authors (10): Xuanyu Lei, Chenliang Li, Yuning Wu, Kaiming Liu, Weizhou Shen, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu

Recent advances in Large Language Models (LLMs) have enabled strong performance in long-form writing, yet existing supervised fine-tuning (SFT) approaches suffer from limitations such as data saturation and restricted learning capacity bounded by teacher signals. In this work, we present Writing-RL: an Adaptive Curriculum Reinforcement Learning framework to advance long-form writing capabilities beyond SFT. The framework consists of three key components: Margin-aware Data Selection strategy that prioritizes samples with high learning potential, Pairwise Comparison Reward mechanism that provides discriminative learning signals in the absence of verifiable rewards, and Dynamic Reference Scheduling approach, which plays a particularly critical role by adaptively adjusting task difficulty based on evolving model performance. Experiments on 7B-scale writer models show that our RL framework largely improves long-form writing performance over strong SFT baselines. Furthermore, we observe that models trained with long-output RL generalize surprisingly well to long-input reasoning tasks, potentially offering a promising perspective for rethinking long-context training.

nan


Article 938

Title@2025-06-06 (5): Constrained Sampling for Language Models Should Be Easy: An MCMC Perspective

Title: Constrained Sampling for Language Models Should Be Easy: An MCMC Perspective Beschränkte Probenahmen für Sprachmodelle sollten einfach sein: Eine MCMC-Perspektive 语言模式抽样应容易:MCMC视角 2506.05754v1

Authors (6): Emmanuel Anaya Gonzalez, Sairam Vaidya, Kanghee Park, Ruyi Ji, Taylor Berg-Kirkpatrick, Loris D’Antoni

Constrained decoding enables Language Models (LMs) to produce samples that provably satisfy hard constraints. However, existing constrained-decoding approaches often distort the underlying model distribution, a limitation that is especially problematic in applications like program fuzzing, where one wants to generate diverse and valid program inputs for testing purposes. We propose a new constrained sampling framework based on Markov Chain Monte Carlo (MCMC) that simultaneously satisfies three core desiderata: constraint satisfying (every sample satisfies the constraint), monotonically converging (the sampling process converges to the true conditional distribution), and efficient (high-quality samples emerge in few steps). Our method constructs a proposal distribution over valid outputs and applies a Metropolis-Hastings acceptance criterion based on the LM’s likelihood, ensuring principled and efficient exploration of the constrained space. Empirically, our sampler outperforms existing methods on both synthetic benchmarks and real-world program fuzzing tasks.

nan


Article 939

Title@2025-06-06 (5): ProSec: Fortifying Code LLMs with Proactive Security Alignment

Title: ProSec: Fortifying Code LLMs with Proactive Security Alignment ProSec: Erweiterung von Code LLMs mit proaktiver Sicherheitsausrichtung Prosec: 使用前期安全对齐来强化代码LLMs 2411.12882v3

Authors (6): Xiangzhe Xu, Zian Su, Jinyao Guo, Kaiyuan Zhang, Zhenting Wang, Xiangyu Zhang

While recent code-specific large language models (LLMs) have greatly enhanced their code generation capabilities, the safety of these models remains under-explored, posing potential risks as insecure code generated by these models may introduce vulnerabilities into real-world systems. Existing methods collect security-focused datasets from real-world vulnerabilities for instruction tuning in order to mitigate such issues. However, they are largely constrained by the data sparsity of vulnerable code, and have limited applicability in the multi-stage post-training workflows of modern LLMs. In this paper, we propose ProSec, a novel proactive security alignment approach designed to align code LLMs with secure coding practices. ProSec systematically exposes the vulnerabilities in a code LLM by synthesizing vulnerability-inducing coding scenarios from Common Weakness Enumerations (CWEs) and generates fixes to vulnerable code snippets, allowing the model to learn secure practices through preference learning objectives. The scenarios synthesized by ProSec trigger 25x more vulnerable code than a normal instruction-tuning dataset, resulting in a security-focused alignment dataset 7x larger than the previous work. Experiments show that models trained with ProSec are 25.2% to 35.4% more secure compared to previous work without degrading models’ utility.

nan


Article 940

Title@2025-06-06 (5): Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models

Title: Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models Emergente symbolische Mechanismen unterstützen abstrakte Vernunft in großen Sprachmodellen 支持大语言模式的新兴机制支持摘要 2502.20332v2

Authors (6): Yukang Yang, Declan Campbell, Kaixuan Huang, Mengdi Wang, Jonathan Cohen, Taylor Webb

Many recent studies have found evidence for emergent reasoning capabilities in large language models (LLMs), but debate persists concerning the robustness of these capabilities, and the extent to which they depend on structured reasoning mechanisms. To shed light on these issues, we study the internal mechanisms that support abstract reasoning in LLMs. We identify an emergent symbolic architecture that implements abstract reasoning via a series of three computations. In early layers, symbol abstraction heads convert input tokens to abstract variables based on the relations between those tokens. In intermediate layers, symbolic induction heads perform sequence induction over these abstract variables. Finally, in later layers, retrieval heads predict the next token by retrieving the value associated with the predicted abstract variable. These results point toward a resolution of the longstanding debate between symbolic and neural network approaches, suggesting that emergent reasoning in neural networks depends on the emergence of symbolic mechanisms.

nan


Article 941

Title@2025-06-06 (5): LLM-Symbolic Integration for Robust Temporal Tabular Reasoning

Title: LLM-Symbolic Integration for Robust Temporal Tabular Reasoning LLM-Symbolische Integration für robuste temporale Tabulare Reasoning LLM- 用于强力时时制表理由的 Symblic 集成 2506.05746v1

Authors (5): Atharv Kulkarni, Kushagra Dixit, Vivek Srikumar, Dan Roth, Vivek Gupta

Temporal tabular question answering presents a significant challenge for Large Language Models (LLMs), requiring robust reasoning over structured data, which is a task where traditional prompting methods often fall short. These methods face challenges such as memorization, sensitivity to table size, and reduced performance on complex queries. To overcome these limitations, we introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations, alongside a symbolic intermediate representation that transforms tables into database schemas. This structured approach allows LLMs to generate and execute SQL queries, enhancing generalization and mitigating biases. By incorporating adaptive few-shot prompting with contextually tailored examples, our method achieves superior robustness, scalability, and performance. Experimental results consistently highlight improvements across key challenges, setting a new benchmark for robust temporal reasoning with LLMs.

nan


Article 942

Title@2025-06-06 (5): ResearchTown: Simulator of Human Research Community

Title: ResearchTown: Simulator of Human Research Community ResearchTown: Simulator der Human Research Community 研究对象:人类研究界模拟者 2412.17767v2

Authors (8): Haofei Yu, Zhaochen Hong, Zirui Cheng, Kunlun Zhu, Keyang Xuan, Jinwei Yao, Tao Feng, Jiaxuan You

Large Language Models (LLMs) have demonstrated remarkable potential in scientific domains, yet a fundamental question remains unanswered: Can we simulate human research communities with LLMs? Addressing this question can deepen our understanding of the processes behind idea brainstorming and inspire the automatic discovery of novel scientific insights. In this work, we propose ResearchTown, a multi-agent framework for research community simulation. Within this framework, the human research community is simplified as an agent-data graph, where researchers and papers are represented as agent-type and data-type nodes, respectively, and connected based on their collaboration relationships. We also introduce TextGNN, a text-based inference framework that models various research activities (e.g., paper reading, paper writing, and review writing) as special forms of a unified message-passing process on the agent-data graph. To evaluate the quality of the research community simulation, we present ResearchBench, a benchmark that uses a node-masking prediction task for scalable and objective assessment based on similarity. Our experiments reveal three key findings: (1) ResearchTown can provide a realistic simulation of collaborative research activities, including paper writing and review writing; (2) ResearchTown can maintain robust simulation with multiple researchers and diverse papers; (3) ResearchTown can generate interdisciplinary research ideas that potentially inspire pioneering research directions.

nan


Article 943

Title@2025-06-06 (5): Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness

Title: Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness Vergessen LLMs wirklich? Bewertung des Unlernens mit Wissenskorrelation und Vertrauensbewusstsein LLMs真的会忘却吗?用知识关联和信心意识来评价不学习 2506.05735v1

Authors (12): Rongzhe Wei, Peizhi Niu, Hans Hao-Hsun Hsu, Ruihan Wu, Haoteng Yin, Mohsen Ghassemi, Yifan Li, Vamsi K. Potluru, Eli Chien, Kamalika Chaudhuri, Olgica Milenkovic, Pan Li

Machine unlearning techniques aim to mitigate unintended memorization in large language models (LLMs). However, existing approaches predominantly focus on the explicit removal of isolated facts, often overlooking latent inferential dependencies and the non-deterministic nature of knowledge within LLMs. Consequently, facts presumed forgotten may persist implicitly through correlated information. To address these challenges, we propose a knowledge unlearning evaluation framework that more accurately captures the implicit structure of real-world knowledge by representing relevant factual contexts as knowledge graphs with associated confidence scores. We further develop an inference-based evaluation protocol leveraging powerful LLMs as judges; these judges reason over the extracted knowledge subgraph to determine unlearning success. Our LLM judges utilize carefully designed prompts and are calibrated against human evaluations to ensure their trustworthiness and stability. Extensive experiments on our newly constructed benchmark demonstrate that our framework provides a more realistic and rigorous assessment of unlearning performance. Moreover, our findings reveal that current evaluation strategies tend to overestimate unlearning effectiveness. Our code is publicly available at https://github.com/Graph-COM/Knowledge_Unlearning.git.

nan


Article 944

Title@2025-06-06 (5): Analyzing LLMs’ Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations

Title: Analyzing LLMs’ Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations Analysieren des Wissens von LLMs Grenzenkognition über Sprachen hinweg durch die Linse interner Repräsentationen 分析LLM女士通过内部代表的镜头跨越各种语言的知识边界认知 2504.13816v2

Authors (7): Chenghao Xiao, Hou Pong Chan, Hao Zhang, Mahani Aljunied, Lidong Bing, Noura Al Moubayed, Yu Rong

While understanding the knowledge boundaries of LLMs is crucial to prevent hallucination, research on the knowledge boundaries of LLMs has predominantly focused on English. In this work, we present the first study to analyze how LLMs recognize knowledge boundaries across different languages by probing their internal representations when processing known and unknown questions in multiple languages. Our empirical studies reveal three key findings: 1) LLMs’ perceptions of knowledge boundaries are encoded in the middle to middle-upper layers across different languages. 2) Language differences in knowledge boundary perception follow a linear structure, which motivates our proposal of a training-free alignment method that effectively transfers knowledge boundary perception ability across languages, thereby helping reduce hallucination risk in low-resource languages; 3) Fine-tuning on bilingual question pair translation further enhances LLMs’ recognition of knowledge boundaries across languages. Given the absence of standard testbeds for cross-lingual knowledge boundary analysis, we construct a multilingual evaluation suite comprising three representative types of knowledge boundary data. Our code and datasets are publicly available at https://github.com/DAMO-NLP-SG/LLM-Multilingual-Knowledge-Boundaries.

nan


Article 945

Title@2025-06-06 (5): Large Language Models are Good Relational Learners

Title: Large Language Models are Good Relational Learners Große Sprachmodelle sind gute relationale Lernende 大语言模型是好关系学习者 2506.05725v1

Authors (3): Fang Wu, Vijay Prakash Dwivedi, Jure Leskovec

Large language models (LLMs) have demonstrated remarkable capabilities across various domains, yet their application to relational deep learning (RDL) remains underexplored. Existing approaches adapt LLMs by traversing relational links between entities in a database and converting the structured data into flat text documents. Still, this text-based serialization disregards critical relational structures, introduces redundancy, and often exceeds standard LLM context lengths. We introduce Rel-LLM, a novel architecture that utilizes a graph neural network (GNN)- based encoder to generate structured relational prompts for LLMs within a retrieval-augmented generation (RAG) framework. Unlike traditional text-based serialization approaches, our method preserves the inherent relational structure of databases while enabling LLMs to effectively process and reason over complex entity relationships. Specifically, the GNN encoder extracts a local subgraph around an entity to build feature representations that contain relevant entity relationships and temporal dependencies. These representations are transformed into structured prompts using a denormalization process, effectively allowing the LLM to reason over relational structures. Through extensive experiments, we demonstrate that Rel-LLM outperforms existing methods on key RDL tasks, offering a scalable and efficient approach to integrating LLMs with structured data sources. Code is available at https://github.com/smiles724/Rel-LLM.

nan


Article 946

Title@2025-06-06 (5): CoIR: A Comprehensive Benchmark for Code Information Retrieval Models

Title: CoIR: A Comprehensive Benchmark for Code Information Retrieval Models CoIR: Ein umfassender Benchmark für Code-Informations-Retrieval-Modelle CoIR: 代码信息检索模型综合基准 2407.02883v3

Authors (8): Xiangyang Li, Kuicai Dong, Yi Quan Lee, Wei Xia, Hao Zhang, Xinyi Dai, Yasheng Wang, Ruiming Tang

Despite the substantial success of Information Retrieval (IR) in various NLP tasks, most IR systems predominantly handle queries and corpora in natural language, neglecting the domain of code retrieval. Code retrieval is critically important yet remains under-explored, with existing methods and benchmarks inadequately representing the diversity of code in various domains and tasks. Addressing this gap, we present COIR (Code Information Retrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities. COIR comprises ten meticulously curated code datasets, spanning eight distinctive retrieval tasks across seven diverse domains. We first discuss the construction of COIR and its diverse dataset composition. Further, we evaluate nine widely used retrieval models using COIR, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems. To facilitate easy adoption and integration within existing research workflows, COIR has been developed as a user-friendly Python framework, readily installable via pip. It shares same data schema as other popular benchmarks like MTEB and BEIR, enabling seamless cross-benchmark evaluations. Through COIR, we aim to invigorate research in the code retrieval domain, providing a versatile benchmarking tool that encourages further development and exploration of code retrieval systems. https://github.com/CoIR-team/coir.

nan


Article 947

Title@2025-06-06 (5): Structure Guided Large Language Model for SQL Generation

Title: Structure Guided Large Language Model for SQL Generation Struktur Geführtes großes Sprachmodell für SQL-Generierung SQL 生成引导大语言模式 2402.13284v3

Authors (6): Qinggang Zhang, Hao Chen, Junnan Dong, Shengyuan Chen, Feiran Huang, Xiao Huang

Recent advancements in large language models (LLMs) have shown promise in bridging the gap between natural language queries and database management systems, enabling users to interact with databases without the background of SQL. However, LLMs often struggle to comprehend complex database structures and accurately interpret user intentions. Decomposition-based methods have been proposed to enhance the performance of LLMs on complex tasks, but decomposing SQL generation into subtasks is non-trivial due to the declarative structure of SQL syntax and the intricate connections between query concepts and database elements. In this paper, we propose a novel Structure GUided text-to-SQL framework~(SGU-SQL) that incorporates syntax-based prompting to enhance the SQL generation capabilities of LLMs. Specifically, SGU-SQL establishes structure-aware links between user queries and database schema and decomposes the complex generation task using syntax-based prompting to enable more accurate LLM-based SQL generation. Extensive experiments on two benchmark datasets demonstrate that SGU-SQL consistently outperforms state-of-the-art text-to-SQL models.

nan


Article 948

Title@2025-06-06 (5): A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models

Title: A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models Eine Umfrage über Sparse Autoencoder: Interpretation der internen Mechanismen großer Sprachmodelle 粗略自动编码器调查:解释大语言模式内部机制 2503.05613v2

Authors (7): Dong Shu, Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, Mengnan Du

Large Language Models (LLMs) have transformed natural language processing, yet their internal mechanisms remain largely opaque. Recently, mechanistic interpretability has attracted significant attention from the research community as a means to understand the inner workings of LLMs. Among various mechanistic interpretability approaches, Sparse Autoencoders (SAEs) have emerged as a promising method due to their ability to disentangle the complex, superimposed features within LLMs into more interpretable components. This paper presents a comprehensive survey of SAEs for interpreting and understanding the internal workings of LLMs. Our major contributions include: (1) exploring the technical framework of SAEs, covering basic architecture, design improvements, and effective training strategies; (2) examining different approaches to explaining SAE features, categorized into input-based and output-based explanation methods; (3) discussing evaluation methods for assessing SAE performance, covering both structural and functional metrics; and (4) investigating real-world applications of SAEs in understanding and manipulating LLM behaviors.

nan


Article 949

Title@2025-06-06 (5): RKEFino1: A Regulation Knowledge-Enhanced Large Language Model

Title: RKEFino1: A Regulation Knowledge-Enhanced Large Language Model RKEFino1: Ein normatives, wissensgestütztes, großes Sprachmodell RKEFino1:知识强化大语言规范模式 2506.05700v1

Authors (4): Yan Wang, Yueru He, Ruoyu Xiang, Jeff Zhao

Recent advances in large language models (LLMs) hold great promise for financial applications but introduce critical accuracy and compliance challenges in Digital Regulatory Reporting (DRR). To address these issues, we propose RKEFino1, a regulation knowledge-enhanced financial reasoning model built upon Fino1, fine-tuned with domain knowledge from XBRL, CDM, and MOF. We formulate two QA tasks-knowledge-based and mathematical reasoning-and introduce a novel Numerical NER task covering financial entities in both sentences and tables. Experimental results demonstrate the effectiveness and generalization capacity of RKEFino1 in compliance-critical financial tasks. We have released our model on Hugging Face.

nan


Article 950

Title@2025-06-06 (5): Generalizable LLM Learning of Graph Synthetic Data with Reinforcement Learning

Title: Generalizable LLM Learning of Graph Synthetic Data with Reinforcement Learning Allgemeines LLM-Lernen von Graphen-Synthetischen Daten mit Verstärkungs-Lernen 利用强化学习学习图形合成数据学习 2506.00845v2

Authors (6): Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xinyun Liu, Yulia Tsvetkov

Previous research has sought to enhance the graph reasoning capabilities of LLMs by supervised fine-tuning on synthetic graph data. While these led to specialized LLMs better at solving graph algorithm problems, we don’t need LLMs for shortest path: we need generalization from synthetic graph data to real-world tasks with implicit graph structures. In this work, we propose to unlock generalizable learning of graph synthetic data with reinforcement learning. We first design solution-based and process-based rewards for synthetic graph problems: instead of rigid memorizing response patterns in direct fine-tuning, we posit that RL would help LLMs grasp the essentials underlying graph reasoning and alleviate overfitting. We employ RL algorithms such as GRPO and DPO, aligning both off-the-shelf LLMs and LLMs fine-tuned on synthetic graph data. We then compare them against existing settings on both in-domain synthetic tasks and out-of-domain real-world tasks with implicit graph structures such as multi-hop QA, structured planning, and more. Extensive experiments demonstrate that our RL recipe leads to statistically significant improvement on 5 datasets, with an average gain of 12.9\% over baseline settings. Further analysis reveals that process-based rewards consistently outperform solution-based rewards, mixing synthetic and real-world task data yields potential gains, while compositionality and explainable intermediate steps remains a critical challenge even after RL.

nan


Article 951

Title@2025-06-06 (5): Being Strong Progressively! Enhancing Knowledge Distillation of Large Language Models through a Curriculum Learning Framework

Title: Being Strong Progressively! Enhancing Knowledge Distillation of Large Language Models through a Curriculum Learning Framework Progressiv stark sein! Erweiterung der Wissensdestillation von großen Sprachmodellen durch einen Curriculum Learning Framework 通过课程学习框架加强大语言模式的知识提炼 2506.05695v1

Authors (2): Lingyuan Liu, Mengxiang Zhang

Knowledge Distillation (KD) compresses large language models (LLMs) by transferring the teacher model’s capabilities to a smaller student model, reducing inference cost and memory usage while maintaining performance. However, existing KD methods for LLMs often fail to prevent significant shifts in the student model’s distribution during training, leading to issues such as catastrophic forgetting, mode collapse, and training-inference mismatch. To address these challenges, we propose a novel, plug-in curriculum learning framework inspired by the strength training principle of “progressive overload” (POCL), which can be seamlessly integrated into existing white-box KD approaches with minimal computational overhead. The framework comprises two core components: (1) a difficulty measurer that ranks and partitions training samples from easy to hard, and (2) a training scheduler that incrementally introduces these subsets into the distillation process at fixed intervals while applying loss functions with progressively rising temperatures. By starting with the easiest samples and progressively increasing the difficulty, the approach enhances both the stability and efficiency of learning. Extensive experiments in instruction-following settings demonstrate that POCL consistently improves the performance of distilled student models across various white-box KD methods and model families. Our findings highlight the effectiveness of sorted training samples in KD for LLMs. More generally, our work demonstrates how to structure training data within the KD process to enhance the stability and performance of distilled LLMs.

nan


Article 952

Title@2025-06-06 (5): Direct Behavior Optimization: Unlocking the Potential of Lightweight LLMs

Title: Direct Behavior Optimization: Unlocking the Potential of Lightweight LLMs Direct Behavior Optimization: Entsperren des Potenzials leichter LLMs 直接行为优化:释放轻度LMT的潜力 2506.06401v1

Authors (7): Hongming Yang, Shi Lin, Jun Shao, Changting Lin, Donghai Zhu, Meng Han, Qinglei Kong

Lightweight Large Language Models (LwLLMs) are reduced-parameter, optimized models designed to run efficiently on consumer-grade hardware, offering significant advantages in resource efficiency, cost-effectiveness, and data privacy. However, these models often struggle with limited inference and reasoning capabilities, which restrict their performance on complex tasks and limit their practical applicability. Moreover, existing prompt optimization methods typically rely on extensive manual effort or the meta-cognitive abilities of state-of-the-art LLMs, making them less effective for LwLLMs. To address these challenges, we introduce DeBoP, a new Direct Behavior Optimization Paradigm, original from the Chain-of-Thought (CoT) prompting technique. Unlike CoT Prompting, DeBoP is an automatic optimization method, which focuses on the optimization directly on the behavior of LwLLMs. In particular, DeBoP transforms the optimization of complex prompts into the optimization of discrete, quantifiable execution sequences using a gradient-free Monte Carlo Tree Search. We evaluate DeBoP on seven challenging tasks where state-of-the-art LLMs excel but LwLLMs generally underperform. Experimental results demonstrate that DeBoP significantly outperforms recent prompt optimization methods on most tasks. In particular, DeBoP-optimized LwLLMs surpass GPT-3.5 on most tasks while reducing computational time by approximately 60% compared to other automatic prompt optimization methods.

nan


Article 953

Title@2025-06-06 (5): When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation

Title: When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation Wann in RAG Graphen verwendet werden: Eine umfassende Analyse für die graphisch retrieval-augmentierte Generation 何时使用 RAG 中的图表 : 图表检索- 增强的一代的综合分析 。 2506.05690v1

Authors (7): Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, Jinsong Su

Graph retrieval-augmented generation (GraphRAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) with external knowledge. It leverages graphs to model the hierarchical structure between specific concepts, enabling more coherent and effective knowledge retrieval for accurate reasoning.Despite its conceptual promise, recent studies report that GraphRAG frequently underperforms vanilla RAG on many real-world tasks. This raises a critical question: Is GraphRAG really effective, and in which scenarios do graph structures provide measurable benefits for RAG systems? To address this, we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate GraphRAG models onboth hierarchical knowledge retrieval and deep contextual reasoning. GraphRAG-Bench features a comprehensive dataset with tasks of increasing difficulty, coveringfact retrieval, complex reasoning, contextual summarization, and creative generation, and a systematic evaluation across the entire pipeline, from graph constructionand knowledge retrieval to final generation. Leveraging this novel benchmark, we systematically investigate the conditions when GraphRAG surpasses traditional RAG and the underlying reasons for its success, offering guidelines for its practical application. All related resources and analyses are collected for the community at https://github.com/GraphRAG-Bench/GraphRAG-Benchmark.

nan


Article 954

Title@2025-06-06 (5): A Unified Representation for Continuity and Discontinuity: Syntactic and Computational Motivations

Title: A Unified Representation for Continuity and Discontinuity: Syntactic and Computational Motivations Eine einheitliche Darstellung für Kontinuität und Diskontinuität: Syntaktische und rechnerische Motivationen 连续和中断统一代表:协同和计算模式 2506.05686v1

Authors (2): Ratna Kandala, Prakash Mondal

This paper advances a unified representation of linguistic structure for three grammar formalisms, namely, Phrase Structure Grammar (PSG), Dependency Grammar (DG) and Categorial Grammar (CG) from the perspective of syntactic and computational complexity considerations. The correspondence principle is proposed to enable a unified representation of the representational principles from PSG, DG, and CG. To that end, the paper first illustrates a series of steps in achieving a unified representation for a discontinuous subordinate clause from Turkish as an illustrative case. This affords a new way of approaching discontinuity in natural language from a theoretical point of view that unites and integrates the basic tenets of PSG, DG, and CG, with significant consequences for syntactic analysis. Then this paper demonstrates that a unified representation can simplify computational complexity with regards to the neurocognitive representation and processing of both continuous and discontinuous sentences vis-`a-vis the basic principles of PSG, DG, and CG.

nan


Article 955

Title@2025-06-06 (5): GuessBench: Sensemaking Multimodal Creativity in the Wild

Title: GuessBench: Sensemaking Multimodal Creativity in the Wild GuessBench: Sensemaking Multimodale Kreativität in der Wildnis Gisguess B Beench: 荒野中思想制造的多模式创造性 2506.00814v2

Authors (6): Zifeng Zhu, Shangbin Feng, Herun Wan, Ningnan Wang, Minnan Luo, Yulia Tsvetkov

We propose GuessBench, a novel benchmark that evaluates Vision Language Models (VLMs) on modeling the pervasive, noisy, and pluralistic human creativity. GuessBench sources data from “Guess the Build”, an online multiplayer Minecraft minigame where one player constructs a Minecraft build given a concept (e.g. caterpillar) and others try to guess it with natural language hints, presenting a pristine testbed for sensemaking creativity in the wild with VLMs acting as guessers. We curate 1500 images from the actual gameplay and design 2000 problems spanning static and dynamic image settings, natural language hints of varying completeness, and more. Extensive experiments with six open/API VLMs and five reasoning enhancement approaches demonstrate that GuessBench presents a uniquely challenging task in creativity modeling: even the start-of-the-art GPT-4o is incorrect on 34% of instances, while we observe a huge performance gap (13.87% vs. 53.93% on average) between open and API models. When used as a resource to improve VLMs, fine-tuning on the reasoning traces for GuessBench problems improves visual perception tasks by 15.36% on average. Further analysis reveals that VLM performance in creativity sensemaking correlates with the frequency of the concept in training data, while the accuracy drops sharply for concepts in underrepresented cultural contexts and low-resource languages.

nan


Article 956

Title@2025-06-06 (5): Data Swarms: Optimizable Generation of Synthetic Evaluation Data

Title: Data Swarms: Optimizable Generation of Synthetic Evaluation Data Datenschwärme: Optimierbare Generierung synthetischer Auswertungsdaten Data Swarms: 最佳生成合成评价数据 2506.00741v2

Authors (4): Shangbin Feng, Yike Wang, Weijia Shi, Yulia Tsvetkov

We propose Data Swarms, an algorithm to optimize the generation of synthetic evaluation data and advance quantitative desiderata of LLM evaluation. We first train a swarm of initial data generators using existing data, and define various evaluation objectives to reflect the desired properties of evaluation (e.g., generate more difficult problems for the evaluated models) and quantitatively evaluate data generators. We then employ particle swarm optimization to optimize the swarm of data generators, where they collaboratively search through the model parameter space to find new generators that advance these objectives. We further extend it to Adversarial Swarms, where the data generator swarm generates harder data while the test taker model swarm learns from such data, co-evolving dynamically for better data and models simultaneously. Extensive experiments demonstrate that Data Swarms outperforms eight data generation baselines across five evaluation objectives, while Adversarial Swarms produce more robust learning of synthetic data and stronger generalization. Further analysis reveals that Data Swarms successfully optimizes compositions of multiple evaluation objectives and generalizes to new off-the-shelf LLMs, unseen at optimization time.

nan


Article 957

Title@2025-06-06 (5): TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

Title: TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages TMT: Drei-Modalen-Übersetzung zwischen Sprache, Bild und Text durch Verarbeitung verschiedener Modalitäten als verschiedene Sprachen TMT: 以不同语言处理不同方式的语音、图像和文本之间的三式翻译 2402.16021v2

Authors (8): Minsu Kim, Jee-weon Jung, Hyeongseop Rha, Soumi Maiti, Siddhant Arora, Xuankai Chang, Shinji Watanabe, Yong Man Ro

The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, where we interpret different modalities as different languages, and treat multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, which provide a unified interface across modalities and significantly decrease the computational cost. In the proposed TMT, a multi-modal encoder-decoder conducts the core translation, whereas modality-specific processing is conducted only within the tokenization and detokenization stages. We evaluate the proposed TMT on all six modality translation tasks. TMT outperforms single model counterparts consistently, demonstrating that unifying tasks is beneficial not only for practicality but also for performance.

nan


Article 958

Title@2025-06-06 (5): WER We Stand: Benchmarking Urdu ASR Models

Title: WER We Stand: Benchmarking Urdu ASR Models WER We Stand: Benchmarking Urdu ASR Modelle WER 站立:确定乌尔都年度年度社会责任模式的基准 2409.11252v3

Authors (6): Samee Arif, Sualeha Farid, Aamina Jamal Khan, Mustafa Abbas, Agha Ali Raza, Awais Athar

This paper presents a comprehensive evaluation of Urdu Automatic Speech Recognition (ASR) models. We analyze the performance of three ASR model families: Whisper, MMS, and Seamless-M4T using Word Error Rate (WER), along with a detailed examination of the most frequent wrong words and error types including insertions, deletions, and substitutions. Our analysis is conducted using two types of datasets, read speech and conversational speech. Notably, we present the first conversational speech dataset designed for benchmarking Urdu ASR models. We find that seamless-large outperforms other ASR models on the read speech dataset, while whisper-large performs best on the conversational speech dataset. Furthermore, this evaluation highlights the complexities of assessing ASR models for low-resource languages like Urdu using quantitative metrics alone and emphasizes the need for a robust Urdu text normalization system. Our findings contribute valuable insights for developing robust ASR systems for low-resource languages like Urdu.

nan


Article 959

Title@2025-06-06 (5): Multi-Agent Collaboration via Cross-Team Orchestration

Title: Multi-Agent Collaboration via Cross-Team Orchestration Multi-Agenten-Zusammenarbeit über Cross-Team-Orchestrierung 通过跨团队管弦化多机构协作 2406.08979v2

Authors (12): Zhuoyun Du, Chen Qian, Wei Liu, Zihao Xie, YiFei Wang, Rennai Qiu, Yufan Dang, Weize Chen, Cheng Yang, Ye Tian, Xuantang Xiong, Lei Han

Large Language Models (LLMs) have significantly impacted various domains, especially through organized LLM-driven autonomous agents. A representative scenario is in software development, where agents can collaborate in a team like humans, following predefined phases to complete sub-tasks sequentially. However, for an agent team, each phase yields only one possible outcome. This results in the completion of only one development chain, thereby losing the opportunity to explore multiple potential decision paths within the solution space. Consequently leading to suboptimal results or extensive trial and error. To address this, we introduce Cross-Team Orchestration (Croto), a scalable multi-team framework that enables orchestrated teams to jointly propose various task-oriented solutions and interact with their insights in a self-independence while cross-team collaboration environment for superior solutions generation. Experiments reveal a notable increase in software quality compared to state-of-the-art baselines. We further tested our framework on story generation tasks, which demonstrated a promising generalization ability of our framework in other domains. The code and data is available at https://github.com/OpenBMB/ChatDev/tree/macnet

nan


Article 960

Title@2025-06-06 (5): Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective

Title: Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective Desecting Bias in LLMs: Eine mechanistische Interpretationsperspektive 在LLMM中分解偏见:机械解释视角 2506.05166v2

Authors (3): Bhavik Chandna, Zubair Bashir, Procheta Sen

Large Language Models (LLMs) are known to exhibit social, demographic, and gender biases, often as a consequence of the data on which they are trained. In this work, we adopt a mechanistic interpretability approach to analyze how such biases are structurally represented within models such as GPT-2 and Llama2. Focusing on demographic and gender biases, we explore different metrics to identify the internal edges responsible for biased behavior. We then assess the stability, localization, and generalizability of these components across dataset and linguistic variations. Through systematic ablations, we demonstrate that bias-related computations are highly localized, often concentrated in a small subset of layers. Moreover, the identified components change across fine-tuning settings, including those unrelated to bias. Finally, we show that removing these components not only reduces biased outputs but also affects other NLP tasks, such as named entity recognition and linguistic acceptability judgment because of the sharing of important components with these tasks.

nan


Article 961

Title@2025-06-06 (5): Contextually Guided Transformers via Low-Rank Adaptation

Title: Contextually Guided Transformers via Low-Rank Adaptation Kontextuell geführte Transformer über Low-Rank-Anpassung 通过低辐射适应实现上下文向导变形器 2506.05672v1

Authors (4): Andrey Zhmoginov, Jihwan Lee, Max Vladymyrov, Mark Sandler

Large Language Models (LLMs) based on Transformers excel at text processing, but their reliance on prompts for specialized behavior introduces computational overhead. We propose a modification to a Transformer architecture that eliminates the need for explicit prompts by learning to encode context into the model’s weights. Our Contextually Guided Transformer (CGT) model maintains a contextual summary at each sequence position, allowing it to update the weights on the fly based on the preceding context. This approach enables the model to self-specialize, effectively creating a tailored model for processing information following a given prefix. We demonstrate the effectiveness of our method on synthetic in-context learning tasks and language modeling benchmarks. Furthermore, we introduce techniques for enhancing the interpretability of the learned contextual representations, drawing connections to Variational Autoencoders and promoting smoother, more consistent context encoding. This work offers a novel direction for efficient and adaptable language modeling by integrating context directly into the model’s architecture.

nan


Article 962

Title@2025-06-06 (5): Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

Title: Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning Low-Resource-Domänenanpassung für Sprach-LLMs via Text-Only Fine-Tuning 通过纯文本微调调整的语音磁性LLMs的低资源域适应 2506.05671v1

Authors (7): Yangui Fang, Jing Peng, Xu Li, Yu Xi, Chengwei Zhang, Guohui Zhong, Kai Yu

Recent advances in automatic speech recognition (ASR) have combined speech encoders with large language models (LLMs) through projection, forming Speech LLMs with strong performance. However, adapting them to new domains remains challenging, especially in low-resource settings where paired speech-text data is scarce. We propose a text-only fine-tuning strategy for Speech LLMs using unpaired target-domain text without requiring additional audio. To preserve speech-text alignment, we introduce a real-time evaluation mechanism during fine-tuning. This enables effective domain adaptation while maintaining source-domain performance. Experiments on LibriSpeech, SlideSpeech, and Medical datasets show that our method achieves competitive recognition performance, with minimal degradation compared to full audio-text fine-tuning. It also improves generalization to new domains without catastrophic forgetting, highlighting the potential of text-only fine-tuning for low-resource domain adaptation of ASR.

nan


Article 963

Title@2025-06-06 (5): Can LLMs Express Personality Across Cultures? Introducing CulturalPersonas for Evaluating Trait Alignment

Title: Can LLMs Express Personality Across Cultures? Introducing CulturalPersonas for Evaluating Trait Alignment Können LLMs ihre Persönlichkeit über Kulturen hinweg ausdrücken? LLMs Can LLM Can LLM 跨文化的特异性? 引入文化名人来评估轨迹协调情况 2506.05670v1

Authors (5): Priyanka Dey, Yugal Khanter, Aayush Bothra, Jieyu Zhao, Emilio Ferrara

As LLMs become central to interactive applications, ranging from tutoring to mental health, the ability to express personality in culturally appropriate ways is increasingly important. While recent works have explored personality evaluation of LLMs, they largely overlook the interplay between culture and personality. To address this, we introduce CulturalPersonas, the first large-scale benchmark with human validation for evaluating LLMs’ personality expression in culturally grounded, behaviorally rich contexts. Our dataset spans 3,000 scenario-based questions across six diverse countries, designed to elicit personality through everyday scenarios rooted in local values. We evaluate three LLMs, using both multiple-choice and open-ended response formats. Our results show that CulturalPersonas improves alignment with country-specific human personality distributions (over a 20% reduction in Wasserstein distance across models and countries) and elicits more expressive, culturally coherent outputs compared to existing benchmarks. CulturalPersonas surfaces meaningful modulated trait outputs in response to culturally grounded prompts, offering new directions for aligning LLMs to global norms of behavior. By bridging personality expression and cultural nuance, we envision that CulturalPersonas will pave the way for more socially intelligent and globally adaptive LLMs.

nan


Article 964

Title@2025-06-06 (5): BAQ: Efficient Bit Allocation Quantization for Large Language Models

Title: BAQ: Efficient Bit Allocation Quantization for Large Language Models BAQ: Effiziente Bitallokation Quantisierung für große Sprachmodelle BAQ: 大语言模式的高效比分配比量量化 2506.05664v1

Authors (4): Chao Zhang, Li Wang, Samson Lasaulce, Merouane Debbah

Post-training model quantization is a widely adopted technique for reducing the memory and computational costs of large language models (LLMs). However, most existing methods rely on uniform or heuristic bitwidth assignments, failing to account for the nonuniform sensitivity of weights to quantization noise. In this paper, we propose a novel framework for allocating quantization bitwidths based on sensitivity metrics derived from a Hessian proxy. We make key assumptions, which allow the layer/component-wise loss function to be expressed as an explicit function of the bitwidths. This enables a neat formulation of the bit allocation problem as a convex optimization task, whose closed-form solution adapts precision across weights to minimize the layer-wise quantization loss. Inspecting the solution provides several insights (such as the equal-loss structure), which are then exploited to design the proposed \textbf{BAQ} (Bit Allocation Quantization) algorithm. The proposed algorithm achieves a good trade-off between loss minimization and complexity and allows BAQ to be integrated into standard quantization pipelines with minimal overhead. Experimental results show that BAQ consistently outperforms GPTQ, achieving up to 56$\times$ lower perplexity at the same bitwidth on large language models ranging from 125M to 30B parameters. Leveraging our analytical results derived from solving the optimal bit allocation problem, we also provide a theoretical explanation for the observed gains. All codes of this paper are available at https://github.com/CSU-ModelCompression/BAQ.

nan


Article 965

Title@2025-06-06 (5): Instructor-Worker Large Language Model System for Policy Recommendation: a Case Study on Air Quality Analysis of the January 2025 Los Angeles Wildfires

Title: Instructor-Worker Large Language Model System for Policy Recommendation: a Case Study on Air Quality Analysis of the January 2025 Los Angeles Wildfires Instructor-Worker Large Language Model System for Policy Empfehlung: Eine Fallstudie zur Luftqualitätsanalyse der Januar 2025 Los Angeles Wildfires 政策建议:2025年1月洛杉矶野火空气质量分析个案研究 2503.00566v2

Authors (7): Kyle Gao, Dening Lu, Liangzhi Li, Nan Chen, Hongjie He, Linlin Xu, Jonathan Li

The Los Angeles wildfires of January 2025 caused more than 250 billion dollars in damage and lasted for nearly an entire month before containment. Following our previous work, the Digital Twin Building, we modify and leverage the multi-agent large language model framework as well as the cloud-mapping integration to study the air quality during the Los Angeles wildfires. Recent advances in large language models have allowed for out-of-the-box automated large-scale data analysis. We use a multi-agent large language system comprised of an Instructor agent and Worker agents. Upon receiving the users’ instructions, the Instructor agent retrieves the data from the cloud platform and produces instruction prompts to the Worker agents. The Worker agents then analyze the data and provide summaries. The summaries are finally input back into the Instructor agent, which then provides the final data analysis. We test this system’s capability for data-based policy recommendation by assessing our Instructor-Worker LLM system’s health recommendations based on air quality during the Los Angeles wildfires.

nan


Article 966

Title@2025-06-06 (5): Where is the signal in tokenization space?

Title: Where is the signal in tokenization space? Wo ist das Signal im Tokenisationsraum? 象征空间的信号在哪里? 2408.08541v2

Authors (5): Renato Lui Geh, Honghua Zhang, Kareem Ahmed, Benjie Wang, Guy Van den Broeck

Large Language Models (LLMs) are typically shipped with tokenizers that deterministically encode text into so-called canonical token sequences, to which the LLMs assign probability values. One common assumption is that the probability of a piece of text is the probability of its canonical token sequence. However, the tokenization of a string is not unique: e.g., the Llama2 tokenizer encodes Tokens as [Tok,ens], but [Tok,en,s] also represents the same text. In this paper, we study non-canonical tokenizations. We prove that, given a string, it is computationally hard to find the most likely tokenization for an autoregressive LLM, as well as to compute the marginal probability over all possible tokenizations. We then show how the marginal is, in most cases, indistinguishable from the canonical probability. Surprisingly, we then empirically demonstrate the existence of a significant amount of signal hidden within tokenization space. Notably, by simply aggregating the probabilities of non-canonical tokenizations, we achieve improvements across a range of LLM evaluation benchmarks for a variety of architectures, including transformers and state space models.

nan


Article 967

Title@2025-06-06 (5): Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones

Title: Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones Projektierbare Modelle: One-Shot-Generation von kleinen Spezialtransformatoren aus großen Modellen 集成项目模型:大型一型小型专门化变形器的单制制式 2506.05641v1

Authors (3): Andrey Zhmoginov, Jihwan Lee, Mark Sandler

Modern Foundation Models (FMs) are typically trained on corpora spanning a wide range of different data modalities, topics and downstream tasks. Utilizing these models can be very computationally expensive and is out of reach for most consumer devices. Furthermore, most of the broad FM knowledge may actually be irrelevant for a specific task at hand. Here we explore a technique for mapping parameters of a large Transformer to parameters of a smaller specialized model. By making this transformation task-specific, we aim to capture a narrower scope of the knowledge needed for performing a specific task by a smaller model. We study our method on image modeling tasks, showing that performance of generated models exceeds that of universal conditional models.

nan


Article 968

Title@2025-06-05 (4): A Fictional Q&A Dataset for Studying Memorization and Knowledge Acquisition

Title: A Fictional Q&A Dataset for Studying Memorization and Knowledge Acquisition Ein fiktiver Q&A-Datensatz für das Studieren von Memorization und Knowledge Acquisition 用于研究记忆化和知识获取的字形 A 数据集 2506.05639v1

Authors (5): John Kirchenbauer, Janny Mongkolsupawan, Yuxin Wen, Tom Goldstein, Daphne Ippolito

When language models are trained on textual data, they acquire both knowledge about the structure of language as well as knowledge of facts about the world. At inference time, their knowledge of facts can be leveraged to solve interesting problems and perform useful knowledge work for users. It is well known that language models can verbatim memorize long sequences from their training data. However, it is much less well understood how language models memorize facts seen during training. In this work, we propose a new dataset to specifically empower researchers to study the dual processes of fact memorization and verbatim sequence memorization. The dataset consists of synthetically-generated, webtext-like documents about fictional events, as well as question-answer pairs about the events. We conduct training experiments showing how synthetic data about fictional events can be effective in teasing apart different forms of memorization. We also document the challenges in effectively building realistic, fictional synthetic data.

nan


Article 969

Title@2025-06-05 (4): IYKYK: Using language models to decode extremist cryptolects

Title: IYKYK: Using language models to decode extremist cryptolects IYKYK: Mit Sprachmodellen extremistische Kryptolekten entschlüsseln IYYKK:使用语言模式解码极端主义加密器 2506.05635v1

Authors (6): Christine de Kock, Arij Riabi, Zeerak Talat, Michael Sejr Schlichtkrull, Pranava Madhyastha, Ed Hovy

Extremist groups develop complex in-group language, also referred to as cryptolects, to exclude or mislead outsiders. We investigate the ability of current language technologies to detect and interpret the cryptolects of two online extremist platforms. Evaluating eight models across six tasks, our results indicate that general purpose LLMs cannot consistently detect or decode extremist language. However, performance can be significantly improved by domain adaptation and specialised prompting techniques. These results provide important insights to inform the development and deployment of automated moderation technologies. We further develop and release novel labelled and unlabelled datasets, including 19.4M posts from extremist platforms and lexicons validated by human experts.

nan


Article 970

Title@2025-06-05 (4): Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs

Title: Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs Selbstaufmerksamkeit für Input-Dependent Soft Prompting in LLMs 在LLMM中利用输入- 依赖软提示自控 2506.05629v1

Authors (3): Ananth Muppidi, Abhilash Nandy, Sambaran Bandyopadhyay

The performance of large language models in domain-specific tasks necessitates fine-tuning, which is computationally expensive and technically challenging. This paper focuses on parameter-efficient fine-tuning using soft prompting, a promising approach that adapts pre-trained models to downstream tasks by learning a small set of parameters. We propose a novel Input Dependent Soft Prompting technique with a self-Attention Mechanism (ID-SPAM) that generates soft prompts based on the input tokens and attends different tokens with varying importance. Our method is simple and efficient, keeping the number of trainable parameters small. We show the merits of the proposed approach compared to state-of-the-art techniques on various tasks and show the improved zero shot domain transfer capability.

nan


Article 971

Title@2025-06-05 (4): Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency

Title: Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency Grundlegende Grenzen der schnellen Tuning Transformer: Universalität, Kapazität und Effizienz 迅速提款变换者的根本限制:普遍性、能力和效益 2411.16525v2

Authors (6): Jerry Yao-Chieh Hu, Wei-Po Wang, Ammar Gilani, Chenyang Li, Zhao Song, Han Liu

We investigate the statistical and computational limits of prompt tuning for transformer-based foundation models. Our key contributions are prompt tuning on \emph{single-head} transformers with only a \emph{single} self-attention layer: (i) is universal, and (ii) supports efficient (even almost-linear time) algorithms under the Strong Exponential Time Hypothesis (SETH). Statistically, we prove that prompt tuning on such simplest possible transformers are universal approximators for sequence-to-sequence Lipschitz functions. In addition, we provide an exponential-in-$dL$ and -in-$(1/\epsilon)$ lower bound on the required soft-prompt tokens for prompt tuning to memorize any dataset with 1-layer, 1-head transformers. Computationally, we identify a phase transition in the efficiency of prompt tuning, determined by the norm of the \emph{soft-prompt-induced} keys and queries, and provide an upper bound criterion. Beyond this criterion, no sub-quadratic (efficient) algorithm for prompt tuning exists under SETH. Within this criterion, we showcase our theory by proving the existence of almost-linear time prompt tuning inference algorithms. These fundamental limits provide important necessary conditions for designing expressive and efficient prompt tuning methods for practitioners.

nan


Article 972

Title@2025-06-05 (4): Deployability-Centric Infrastructure-as-Code Generation: An LLM-based Iterative Framework

Title: Deployability-Centric Infrastructure-as-Code Generation: An LLM-based Iterative Framework Deployability-Centric Infrastructure-as-Code Generation: Ein LLM-basiertes Iteratives Framework 以LLM为基础的迭代框架 2506.05623v1

Authors (5): Tianyi Zhang, Shidong Pan, Zejun Zhang, Zhenchang Xing, Xiaoyu Sun

Infrastructure-as-Code (IaC) generation holds significant promise for automating cloud infrastructure provisioning. Recent advances in Large Language Models (LLMs) present a promising opportunity to democratize IaC development by generating deployable infrastructure templates from natural language descriptions, but current evaluation focuses on syntactic correctness while ignoring deployability, the fatal measure of IaC template utility. We address this gap through two contributions: (1) IaCGen, an LLM-based deployability-centric framework that uses iterative feedback mechanism to generate IaC templates, and (2) DPIaC-Eval, a deployability-centric IaC template benchmark consists of 153 real-world scenarios that can evaluate syntax, deployment, user intent, and security. Our evaluation reveals that state-of-the-art LLMs initially performed poorly, with Claude-3.5 and Claude-3.7 achieving only 30.2% and 26.8% deployment success on the first attempt respectively. However, IaCGen transforms this performance dramatically: all evaluated models reach over 90% passItr@25, with Claude-3.5 and Claude-3.7 achieving 98% success rate. Despite these improvements, critical challenges remain in user intent alignment (25.2% accuracy) and security compliance (8.4% pass rate), highlighting areas requiring continued research. Our work provides the first comprehensive assessment of deployability-centric IaC template generation and establishes a foundation for future research.

nan


Article 973

Title@2025-06-05 (4): Reasoning Towards Fairness: Mitigating Bias in Language Models through Reasoning-Guided Fine-Tuning

Title: Reasoning Towards Fairness: Mitigating Bias in Language Models through Reasoning-Guided Fine-Tuning Reasoning Towards Fairness: Bias in Sprachmodellen durch reasoning-guided Fine-Tuning abmildern 追求公平的理由:通过有理、有理、有指导的微调减少语言模式中的偏见 2504.05632v3

Authors (3): Sanchit Kabra, Akshita Jha, Chandan K. Reddy

Recent advances in large-scale generative language models have shown that reasoning capabilities can significantly improve model performance across a variety of tasks. However, the impact of reasoning on a model’s ability to mitigate stereotypical responses remains largely underexplored. In this work, we investigate the crucial relationship between a model’s reasoning ability and fairness, and ask whether improved reasoning capabilities can mitigate harmful stereotypical responses, especially those arising due to shallow or flawed reasoning. We conduct a comprehensive evaluation of multiple open-source LLMs, and find that larger models with stronger reasoning abilities exhibit substantially lower stereotypical bias on existing fairness benchmarks. Building on this insight, we introduce ReGiFT – Reasoning Guided Fine-Tuning, a novel approach that extracts structured reasoning traces from advanced reasoning models and infuses them into models that lack such capabilities. We use only general-purpose reasoning and do not require any fairness-specific supervision for bias mitigation. Notably, we see that models fine-tuned using ReGiFT not only improve fairness relative to their non-reasoning counterparts but also outperform advanced reasoning models on fairness benchmarks. We also analyze how variations in the correctness of the reasoning traces and their length influence model fairness and their overall performance. Our findings highlight that enhancing reasoning capabilities is an effective, fairness-agnostic strategy for mitigating stereotypical bias caused by reasoning flaws.

nan


Article 974

Title@2025-06-05 (4): Adversarial Tokenization

Title: Adversarial Tokenization Adversariale Tokenisierung 对抗感应器化 2503.02174v2

Authors (3): Renato Lui Geh, Zilei Shao, Guy Van den Broeck

Current LLM pipelines account for only one possible tokenization for a given string, ignoring exponentially many alternative tokenizations during training and inference. For example, the standard Llama3 tokenization of penguin is [p,enguin], yet [peng,uin] is another perfectly valid alternative. In this paper, we show that despite LLMs being trained solely on one tokenization, they still retain semantic understanding of other tokenizations, raising questions about their implications in LLM safety. Put succinctly, we answer the following question: can we adversarially tokenize an obviously malicious string to evade safety and alignment restrictions? We show that not only is adversarial tokenization an effective yet previously neglected axis of attack, but it is also competitive against existing state-of-the-art adversarial approaches without changing the text of the harmful request. We empirically validate this exploit across three state-of-the-art LLMs and adversarial datasets, revealing a previously unknown vulnerability in subword models.

nan


Article 975

Title@2025-06-05 (4): Mitigating Confounding in Speech-Based Dementia Detection through Weight Masking

Title: Mitigating Confounding in Speech-Based Dementia Detection through Weight Masking Mildernde Verwüstung bei sprachbasierter Dementia-Erkennung durch Gewicht-Maskierung 通过湿度蒙面,缓解基于声音的痴呆症检测中的困扰 2506.05610v1

Authors (6): Zhecheng Sheng, Xiruo Ding, Brian Hur, Changye Li, Trevor Cohen, Serguei Pakhomov

Deep transformer models have been used to detect linguistic anomalies in patient transcripts for early Alzheimer’s disease (AD) screening. While pre-trained neural language models (LMs) fine-tuned on AD transcripts perform well, little research has explored the effects of the gender of the speakers represented by these transcripts. This work addresses gender confounding in dementia detection and proposes two methods: the $\textit{Extended Confounding Filter}$ and the $\textit{Dual Filter}$, which isolate and ablate weights associated with gender. We evaluate these methods on dementia datasets with first-person narratives from patients with cognitive impairment and healthy controls. Our results show transformer models tend to overfit to training data distributions. Disrupting gender-related weights results in a deconfounded dementia classifier, with the trade-off of slightly reduced dementia detection performance.

nan


Article 976

Title@2025-06-05 (4): OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

Title: OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation OPeRA: Ein Datensatz von Beobachtung, Persona, Ratationale und Aktion zur Bewertung von LLMs auf menschlicher Online-Shopping-Behavior-Simulation OPERA: 人类在线购物行为模拟观察、人、理由和评估LMLLMs的数据集 2506.05606v1

Authors (16): Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang

Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable’’ human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user’s next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.

nan


Article 977

Title@2025-06-05 (4): Prompting is Not All You Need! Evaluating LLM Agent Simulation Methodologies with Real-World Online Customer Behavior Data

Title: Prompting is Not All You Need! Evaluating LLM Agent Simulation Methodologies with Real-World Online Customer Behavior Data Prompting ist nicht alles, was Sie brauchen! Bewertung LLM Agent Simulation Methoden mit Real-World Online Kunden Verhalten Daten 提示并非你所需要的全部! 使用真实世界在线客户行为数据评估LLM代理模拟方法 2503.20749v6

Authors (11): Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Zheshen, Wang, Qi He, Dakuo Wang

Recent research shows that LLMs can simulate believable'' human behaviors to power LLM agents via prompt-only methods. In this work, we focus on evaluating LLM's objectiveaccuracy’’ rather than the subjective ``believability’’ in simulating human behavior, leveraging a large-scale, real-world dataset collected from customers’ online shopping actions. We present the first comprehensive evaluation of state-of-the-art LLMs (e.g., DeepSeek-R1, Llama, and Claude) on the task of web shopping action generation. Our results show that out-of-the-box LLM-generated actions are often misaligned with actual human behavior, whereas fine-tuning LLMs on real-world behavioral data substantially improves their ability to generate accurate actions compared to prompt-only methods. Furthermore, incorporating synthesized reasonings into model training leads to additional performance gains, demonstrating the value of explicit rationale in behavior modeling. This work evaluates state-of-the-art LLMs in behavior simulation and provides actionable insights into how real-world action data can enhance the fidelity of LLM agents.

nan


Article 978

Title@2025-06-05 (4): SynthesizeMe! Inducing Persona-Guided Prompts for Personalized Reward Models in LLMs

Title: SynthesizeMe! Inducing Persona-Guided Prompts for Personalized Reward Models in LLMs SynthesizeMe! Inducing Persona-Guided Prompts für personalisierte Belohnung Modelle in LLMs 合成Me! LLM 中个性奖得奖模型的引人指导提示 2506.05598v1

Authors (6): Michael J Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Held, Diyi Yang

Recent calls for pluralistic alignment of Large Language Models (LLMs) encourage adapting models to diverse user preferences. However, most prior work on personalized reward models heavily rely on additional identity information, such as demographic details or a predefined set of preference categories. To this end, we introduce SynthesizeMe, an approach to inducing synthetic user personas from user interactions for personalized reward modeling. SynthesizeMe first generates and verifies reasoning to explain user preferences, then induces synthetic user personas from that reasoning, and finally filters to informative prior user interactions in order to build personalized prompts for a particular user. We show that using SynthesizeMe induced prompts improves personalized LLM-as-a-judge accuracy by 4.4% on Chatbot Arena. Combining SynthesizeMe derived prompts with a reward model achieves top performance on PersonalRewardBench: a new curation of user-stratified interactions with chatbots collected from 854 users of Chatbot Arena and PRISM.

nan


Article 979

Title@2025-06-05 (4): SoK: Are Watermarks in LLMs Ready for Deployment?

Title: SoK: Are Watermarks in LLMs Ready for Deployment? SoK: Sind Wasserzeichen in LLMs bereit für den Einsatz? SoK:LLMS中的水印准备部署了吗? 2506.05594v1

Authors (7): Kieu Dang, Phung Lai, NhatHai Phan, Yelong Shen, Ruoming Jin, Abdallah Khreishah, My Thai

Large Language Models (LLMs) have transformed natural language processing, demonstrating impressive capabilities across diverse tasks. However, deploying these models introduces critical risks related to intellectual property violations and potential misuse, particularly as adversaries can imitate these models to steal services or generate misleading outputs. We specifically focus on model stealing attacks, as they are highly relevant to proprietary LLMs and pose a serious threat to their security, revenue, and ethical deployment. While various watermarking techniques have emerged to mitigate these risks, it remains unclear how far the community and industry have progressed in developing and deploying watermarks in LLMs. To bridge this gap, we aim to develop a comprehensive systematization for watermarks in LLMs by 1) presenting a detailed taxonomy for watermarks in LLMs, 2) proposing a novel intellectual property classifier to explore the effectiveness and impacts of watermarks on LLMs under both attack and attack-free environments, 3) analyzing the limitations of existing watermarks in LLMs, and 4) discussing practical challenges and potential future directions for watermarks in LLMs. Through extensive experiments, we show that despite promising research outcomes and significant attention from leading companies and community to deploy watermarks, these techniques have yet to reach their full potential in real-world applications due to their unfavorable impacts on model utility of LLMs and downstream tasks. Our findings provide an insightful understanding of watermarks in LLMs, highlighting the need for practical watermarks solutions tailored to LLM deployment.

nan


Article 980

Title@2025-06-05 (4): UTSA-NLP at ArchEHR-QA 2025: Improving EHR Question Answering via Self-Consistency Prompting

Title: UTSA-NLP at ArchEHR-QA 2025: Improving EHR Question Answering via Self-Consistency Prompting UTSA-NLP bei ArchEHR-QA 2025: Verbesserung der EHR-Fragebeantwortung durch Selbstbeherrschung ArchEHR-QA 2025:通过自我协调促进改进EHR问题的答案 2506.05589v1

Authors (5): Sara Shields-Menard, Zach Reimers, Joshua Gardner, David Perry, Anthony Rios

We describe our system for the ArchEHR-QA Shared Task on answering clinical questions using electronic health records (EHRs). Our approach uses large language models in two steps: first, to find sentences in the EHR relevant to a clinician’s question, and second, to generate a short, citation-supported response based on those sentences. We use few-shot prompting, self-consistency, and thresholding to improve the sentence classification step to decide which sentences are essential. We compare several models and find that a smaller 8B model performs better than a larger 70B model for identifying relevant information. Our results show that accurate sentence selection is critical for generating high-quality responses and that self-consistency with thresholding helps make these decisions more reliable.

nan


Article 981

Title@2025-06-05 (4): MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark

Title: MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark MMTU: Ein massiver Multi-Task-Tabellen-Verstands- und Orientierungs-Benchmark MMTU: 大规模多任务表格理解和说明理由基准 2506.05587v1

Authors (9): Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, H. V. Jagadish

Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area. In this work, we introduce MMTU, a large-scale benchmark with over 30K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades’ worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills – including table understanding, reasoning, and coding – that remain challenging for today’s frontier models, where even frontier reasoning models like OpenAI o4-mini and DeepSeek R1 score only around 60%, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.

nan


Article 982

Title@2025-06-05 (4): Combating Misinformation in the Arab World: Challenges & Opportunities

Title: Combating Misinformation in the Arab World: Challenges & Opportunities Bekämpfung von Fehlinformationen in der arabischen Welt: Herausforderungen und Chancen 打击阿拉伯世界的错误信息:挑战和机遇 2506.05582v1

Authors (4): Azza Abouzied, Firoj Alam, Raian Ali, Paolo Papotti

Misinformation and disinformation pose significant risks globally, with the Arab region facing unique vulnerabilities due to geopolitical instabilities, linguistic diversity, and cultural nuances. We explore these challenges through the key facets of combating misinformation: detection, tracking, mitigation and community-engagement. We shed light on how connecting with grass-roots fact-checking organizations, understanding cultural norms, promoting social correction, and creating strong collaborative information networks can create opportunities for a more resilient information ecosystem in the Arab world.

nan


Article 983

Title@2025-06-05 (4): TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking

Title: TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking TituLLMs: Eine Familie von Bangla LLMs mit umfassendem Benchmarking TitulLLMs:一个具有全面基准基准的孟加拉LMLMLM家庭 2502.11187v3

Authors (10): Shahriar Kabir Nahin, Rabindra Nath Nandi, Sagor Sarker, Quazi Sarwar Muhtaseem, Md Kowsher, Apu Chandraw Shill, Md Ibrahim, Mehadi Hasan Menon, Tareq Al Muntasir, Firoj Alam

In this paper, we present TituLLMs, the first large pretrained Bangla LLMs, available in 1b and 3b parameter sizes. Due to computational constraints during both training and inference, we focused on smaller models. To train TituLLMs, we collected a pretraining dataset of approximately ~37 billion tokens. We extended the Llama-3.2 tokenizer to incorporate language- and culture-specific knowledge, which also enables faster training and inference. There was a lack of benchmarking datasets to benchmark LLMs for Bangla. To address this gap, we developed five benchmarking datasets. We benchmarked various LLMs, including TituLLMs, and demonstrated that TituLLMs outperforms its initial multilingual versions. However, this is not always the case, highlighting the complexities of language adaptation. Our work lays the groundwork for adapting existing multilingual open models to other low-resource languages. To facilitate broader adoption and further research, we have made the TituLLMs models and benchmarking datasets publicly available (https://huggingface.co/collections/hishab/titulm-llama-family-6718d31fc1b83529276f490a).

nan


Article 984

Title@2025-06-05 (4): Natural Language Interaction with Databases on Edge Devices in the Internet of Battlefield Things

Title: Natural Language Interaction with Databases on Edge Devices in the Internet of Battlefield Things Natürliche Sprachinteraktion mit Datenbanken auf Edge Devices im Internet von Battlefield Things 自然语言与战场物体互联网边缘装置数据库的互动 2506.06396v1

Authors (4): Christopher D. Molek, Roberto Fronteddu, K. Brent Venable, Niranjan Suri

The expansion of the Internet of Things (IoT) in the battlefield, Internet of Battlefield Things (IoBT), gives rise to new opportunities for enhancing situational awareness. To increase the potential of IoBT for situational awareness in critical decision making, the data from these devices must be processed into consumer-ready information objects, and made available to consumers on demand. To address this challenge we propose a workflow that makes use of natural language processing (NLP) to query a database technology and return a response in natural language. Our solution utilizes Large Language Models (LLMs) that are sized for edge devices to perform NLP as well as graphical databases which are well suited for dynamic connected networks which are pervasive in the IoBT. Our architecture employs LLMs for both mapping questions in natural language to Cypher database queries as well as to summarize the database output back to the user in natural language. We evaluate several medium sized LLMs for both of these tasks on a database representing publicly available data from the US Army’s Multipurpose Sensing Area (MSA) at the Jornada Range in Las Cruces, NM. We observe that Llama 3.1 (8 billion parameters) outperforms the other models across all the considered metrics. Most importantly, we note that, unlike current methods, our two step approach allows the relaxation of the Exact Match (EM) requirement of the produced Cypher queries with ground truth code and, in this way, it achieves a 19.4% increase in accuracy. Our workflow lays the ground work for deploying LLMs on edge devices to enable natural language interactions with databases containing information objects for critical decision making.

nan


Article 985

Title@2025-06-05 (4): The Impact of Inference Acceleration on Bias of LLMs

Title: The Impact of Inference Acceleration on Bias of LLMs Die Auswirkungen von Inferenzbeschleunigung auf LLM-Bias 推推加速对LLMM的偏见的影响 2410.22118v3

Authors (4): Elisabeth Kirsten, Ivan Habernal, Vedant Nanda, Muhammad Bilal Zafar

Last few years have seen unprecedented advances in capabilities of Large Language Models (LLMs). These advancements promise to benefit a vast array of application domains. However, due to their immense size, performing inference with LLMs is both costly and slow. Consequently, a plethora of recent work has proposed strategies to enhance inference efficiency, e.g., quantization, pruning, and caching. These acceleration strategies reduce the inference cost and latency, often by several factors, while maintaining much of the predictive performance measured via common benchmarks. In this work, we explore another critical aspect of LLM performance: demographic bias in model generations due to inference acceleration optimizations. Using a wide range of metrics, we probe bias in model outputs from a number of angles. Analysis of outputs before and after inference acceleration shows significant change in bias. Worryingly, these bias effects are complex and unpredictable. A combination of an acceleration strategy and bias type may show little bias change in one model but may lead to a large effect in another. Our results highlight a need for in-depth and case-by-case evaluation of model bias after it has been modified to accelerate inference.

nan


Article 986

Title@2025-06-05 (4): Improving LLMs with a knowledge from databases

Title: Improving LLMs with a knowledge from databases Verbesserung von LLMs mit Kenntnissen aus Datenbanken 利用数据库知识改进LLMs 2506.05560v1

Authors (1): Petr Máša

Large language models (LLMs) are achieving significant progress almost every moment now. Many advanced techniques have been introduced and widely accepted, like retrieval-augmentation generation (RAG), agents, and tools. Tools can query the database to answer questions from structured data files or perform groupings or other statistics. This unlocks huge opportunities, such as it can answer any question, but also poses threats, such as safety, because there is no control over the commands that are created. We would like to discuss whether we can create a new method that improves answers based on dataset/database via some interpretable ML methods, namely enhanced association rules. The advantage would be if the method can be also used in some safe technique like RAG. Association rules have a sound history. Since the introduction of CN2 and aproiri, many enhancements have been made. In parallel, enhanced association rules have been introduced and evolved over the last 40 years. The general problem is typically that there are too many rules. There are some techniques for handling it, but when LLM emerged, it turned out to be the best use case for the RAG technique for LLMs. We proposed a method that generates a ruleset based on defined knowledge patterns, then converts rules into text form via a rule-to-text converter, and includes the result as an RAG into LLM. We compared this method with ChatGPT (even with using agents) and we have discovered a significant improvement in answering questions based on the dataset. We have also tried several strategies how much rules to generate. We found this improvement interesting. Moreover, it can also be improved in many ways as future work, like incorporating other patterns, the use of rule mining as an agent, and many others.

nan


Article 987

Title@2025-06-05 (4): MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

Title: MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning MORSE-500: Ein programmierbarer Video-Benchmark zum Stress-Test Multimodale Reasoning MORSE-500:压力测试多模式理由方案可控制的视频基准 2506.05523v1

Authors (13): Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, Yongyuan Liang, Tom Goldstein, Furong Huang

Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills – including abstract, physical, planning, spatial, and temporal capabilities – required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics – enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems – including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models – reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.

nan


Article 988

Title@2025-06-05 (4): Investigating Non-Transitivity in LLM-as-a-Judge

Title: Investigating Non-Transitivity in LLM-as-a-Judge Untersuchung der Nichtübertragbarkeit in LLM-as-a-Richter 调查LLM-as-A法官的非过渡性 2502.14074v3

Authors (4): Yi Xu, Laura Ruis, Tim Rocktäschel, Robert Kirk

Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% -> 96.4% and 82.1% -> 86.3% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.

nan


Article 989

Title@2025-06-05 (4): Multidimensional Analysis of Specific Language Impairment Using Unsupervised Learning Through PCA and Clustering

Title: Multidimensional Analysis of Specific Language Impairment Using Unsupervised Learning Through PCA and Clustering Multidimensionale Analyse der spezifischen Sprachbeeinträchtigung durch unbeaufsichtigtes Lernen durch PCA und Clustering 利用无监督的学习,通过常设仲裁院和集群,对特定语言障碍进行多层面分析 2506.05498v1

Authors (1): Niruthiha Selvanayagam

Specific Language Impairment (SLI) affects approximately 7 percent of children, presenting as isolated language deficits despite normal cognitive abilities, sensory systems, and supportive environments. Traditional diagnostic approaches often rely on standardized assessments, which may overlook subtle developmental patterns. This study aims to identify natural language development trajectories in children with and without SLI using unsupervised machine learning techniques, providing insights for early identification and targeted interventions. Narrative samples from 1,163 children aged 4-16 years across three corpora (Conti-Ramsden 4, ENNI, and Gillam) were analyzed using Principal Component Analysis (PCA) and clustering. A total of 64 linguistic features were evaluated to uncover developmental trajectories and distinguish linguistic profiles. Two primary clusters emerged: (1) high language production with low SLI prevalence, and (2) limited production but higher syntactic complexity with higher SLI prevalence. Additionally, boundary cases exhibited intermediate traits, supporting a continuum model of language abilities. Findings suggest SLI manifests primarily through reduced production capacity rather than syntactic complexity deficits. The results challenge categorical diagnostic frameworks and highlight the potential of unsupervised learning techniques for refining diagnostic criteria and intervention strategies.

nan


Article 990

Title@2025-06-05 (4): Detect, Explain, Escalate: Low-Carbon Dialogue Breakdown Management for LLM-Powered Agents

Title: Detect, Explain, Escalate: Low-Carbon Dialogue Breakdown Management for LLM-Powered Agents Detect, Explain, Escalate: Low-Carbon Dialogue Breakdown Management für LLM-Powered Agents 检测、解释、埃斯卡拉拉特:LLM强力剂的低碳对话框分解管理 2504.18839v2

Authors (3): Abdellah Ghassel, Xianzhi Li, Xiaodan Zhu

While Large Language Models (LLMs) are transforming numerous applications, their susceptibility to conversational breakdowns remains a critical challenge undermining user trust. This paper introduces a “Detect, Explain, Escalate” framework to manage dialogue breakdowns in LLM-powered agents, emphasizing low-carbon operation. Our approach integrates two key strategies: (1) We fine-tune a compact 8B-parameter model, augmented with teacher-generated reasoning traces, which serves as an efficient real-time breakdown ‘detector’ and ‘explainer’. This model demonstrates robust classification and calibration on English and Japanese dialogues, and generalizes well to the BETOLD dataset, improving accuracy by 7% over its baseline. (2) We systematically evaluate frontier LLMs using advanced prompting (few-shot, chain-of-thought, analogical reasoning) for high-fidelity breakdown assessment. These are integrated into an ‘escalation’ architecture where our efficient detector defers to larger models only when necessary, substantially reducing operational costs and energy consumption. Our fine-tuned model and prompting strategies establish new state-of-the-art results on dialogue breakdown detection benchmarks, outperforming specialized classifiers and significantly narrowing the performance gap to larger proprietary models. The proposed monitor-escalate pipeline reduces inference costs by 54%, offering a scalable, efficient, and more interpretable solution for robust conversational AI in high-impact domains. Code and models will be publicly released.

nan


Article 991

Title@2025-06-05 (4): Inference-Time Hyper-Scaling with KV Cache Compression

Title: Inference-Time Hyper-Scaling with KV Cache Compression Inferenz-Zeit-Hyper-Skalierung mit KV-Cache-Kompression 与 KV 缓存压缩一起的推断时间超大缩放 2506.05345v1

Authors (4): Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti

Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8$\times$ compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference runtime and memory load. For instance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 on GPQA, and 9.6 on LiveCodeBench across compute budgets.

nan


Article 992

Title@2025-06-05 (4): Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets

Title: Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets Warum LLM Sicherheits-Guardrails nach Feinabstimmung zusammenbrechen: Eine Ähnlichkeitsanalyse zwischen Alignment- und Feinabstimmungs-Datensätzen 为何LLM 安全护卫车在微调后倒塌:对准和微调数据集之间的相似性分析 2506.05346v1

Authors (7): Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, Yaoqing Yang

Recent advancements in large language models (LLMs) have underscored their vulnerability to safety alignment jailbreaks, particularly when subjected to downstream fine-tuning. However, existing mitigation strategies primarily focus on reactively addressing jailbreak incidents after safety guardrails have been compromised, removing harmful gradients during fine-tuning, or continuously reinforcing safety alignment throughout fine-tuning. As such, they tend to overlook a critical upstream factor: the role of the original safety-alignment data. This paper therefore investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks. Our experiments demonstrate that high similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks. Conversely, low similarity between these two types of datasets yields substantially more robust models and thus reduces harmfulness score by up to 10.33%. By highlighting the importance of upstream dataset design in the building of durable safety guardrails and reducing real-world vulnerability to jailbreak attacks, these findings offer actionable insights for fine-tuning service providers.

nan


Article 993

Title@2025-06-05 (4): Search Arena: Analyzing Search-Augmented LLMs

Title: Search Arena: Analyzing Search-Augmented LLMs Sucharena: Analyse von Such-Augmentierten LLMs 搜索竞技场: 分析搜索推荐LMS 2506.05334v1

Authors (11): Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez

Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility. Furthermore, user preferences vary across cited sources, revealing that community-driven platforms are generally preferred and static encyclopedic sources are not always appropriate and reliable. To assess performance across different settings, we conduct cross-arena analyses by testing search-augmented LLMs in a general-purpose chat environment and conventional LLMs in search-intensive settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model’s parametric knowledge. We open-sourced the dataset to support future research in this direction. Our dataset and code are available at: https://github.com/lmarena/search-arena.

nan


Article 994

Title@2025-06-05 (4): AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

Title: AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning AceReason-Nemotron: Mathematische und Code-Reasonierung durch Stärkungslernen AceReson-Nepron:通过强化学习推进数学和守则 2505.16400v3

Authors (8): Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of frontier models, such as DeepSeek-R1, including data curation strategies and RL training recipe, are often omitted. Moreover, recent research indicates distillation remains more effective than RL for smaller models. In this work, we demonstrate that large-scale RL can significantly enhance the reasoning capabilities of strong, small- and mid-sized models, achieving results that surpass those of state-of-the-art distillation-based models. We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks (e.g., +14.6% / +17.2% on AIME 2025 for the 7B / 14B models), but also code reasoning tasks (e.g., +6.8% / +5.8% on LiveCodeBench for the 7B / 14B models). In addition, extended code-only RL iterations further improve performance on code benchmarks with minimal or no degradation in math results. We develop a robust data curation pipeline to collect challenging prompts with high-quality, verifiable answers and test cases to enable verification-based RL across both domains. Finally, we identify key experimental insights, including curriculum learning with progressively increasing response lengths and the stabilizing effect of on-policy parameter updates. We find that RL not only elicits the foundational reasoning capabilities acquired during pretraining and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model’s reasoning ability, enabling it to solve problems that were previously unsolvable.

nan


Article 995

Title@2025-06-05 (4): Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Title: Unleashing Hour-Scale Video Training for Long Video-Language Understanding Unleashing Stunden-Scale Video-Training für lange Video-Sprache verstehen 为了解长视视频语言而开放的时空视频培训 2506.05332v1

Authors (11): Jingyang Lin, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Xiaodong Yu, Hao Chen, Jiebo Luo, Zicheng Liu, Emad Barsoum

Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LLMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates user question-relevant and spatiotemporal-informative semantics from a cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.

nan


Article 996

Title@2025-06-05 (4): The broader spectrum of in-context learning

Title: The broader spectrum of in-context learning Das breitere Spektrum des In-Context-Lernens 广义的内通学习 2412.03782v3

Authors (4): Andrew Kyle Lampinen, Stephanie C. Y. Chan, Aaditya K. Singh, Murray Shanahan

The ability of language models to learn a task from a few examples in context has generated substantial interest. Here, we provide a perspective that situates this type of supervised few-shot learning within a much broader spectrum of meta-learned in-context learning. Indeed, we suggest that any distribution of sequences in which context non-trivially decreases loss on subsequent predictions can be interpreted as eliciting a kind of in-context learning. We suggest that this perspective helps to unify the broad set of in-context abilities that language models exhibit – such as adapting to tasks from instructions or role play, or extrapolating time series. This perspective also sheds light on potential roots of in-context learning in lower-level processing of linguistic dependencies (e.g. coreference or parallel structures). Finally, taking this perspective highlights the importance of generalization, which we suggest can be studied along several dimensions: not only the ability to learn something novel, but also flexibility in learning from different presentations, and in applying what is learned. We discuss broader connections to past literature in meta-learning and goal-conditioned agents, and other perspectives on learning and adaptation. We close by suggesting that research on in-context learning should consider this broader spectrum of in-context capabilities and types of generalization.

nan


Article 997

Title@2025-06-05 (4): MLLM-CL: Continual Learning for Multimodal Large Language Models

Title: MLLM-CL: Continual Learning for Multimodal Large Language Models MLLM-CL: Kontinuierliches Lernen für multimodale große Sprachmodelle MLLM-CL:多模式大语言模式持续学习 2506.05453v1

Authors (5): Hongbo Zhao, Fei Zhu, Rundong Wang, Gaofeng Meng, Zhaoxiang Zhang

Recent Multimodal Large Language Models (MLLMs) excel in vision-language understanding but face challenges in adapting to dynamic real-world scenarios that require continuous integration of new knowledge and skills. While continual learning (CL) offers a potential solution, existing benchmarks and methods suffer from critical limitations. In this paper, we introduce MLLM-CL, a novel benchmark encompassing domain and ability continual learning, where the former focuses on independently and identically distributed (IID) evaluation across evolving mainstream domains, whereas the latter evaluates on non-IID scenarios with emerging model ability. Methodologically, we propose preventing catastrophic interference through parameter isolation, along with an MLLM-based routing mechanism. Extensive experiments demonstrate that our approach can integrate domain-specific knowledge and functional abilities with minimal forgetting, significantly outperforming existing methods.

nan


Article 998

Title@2025-06-05 (4): Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety

Title: Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety Interpretation trifft auf Sicherheit: Eine Umfrage zu Interpretationsmethoden und Tools zur Verbesserung der LLM-Sicherheit 口译满足安全需要:关于改进LLM安全的解释方法和工具的调查 2506.05451v1

Authors (6): Seongmin Lee, Aeree Cho, Grace C. Kim, ShengYun Peng, Mansi Phute, Duen Horng Chau

As large language models (LLMs) see wider real-world use, understanding and mitigating their unsafe behaviors is critical. Interpretation techniques can reveal causes of unsafe outputs and guide safety, but such connections with safety are often overlooked in prior surveys. We present the first survey that bridges this gap, introducing a unified framework that connects safety-focused interpretation methods, the safety enhancements they inform, and the tools that operationalize them. Our novel taxonomy, organized by LLM workflow stages, summarizes nearly 70 works at their intersections. We conclude with open challenges and future directions. This timely survey helps researchers and practitioners navigate key advancements for safer, more interpretable LLMs.

nan


Article 999

Title@2025-06-05 (4): Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

Title: Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay Verbesserung der Dateneffizienz für LLM-Verstärkung Feinabstimmung durch problemorientierte Online-Datenauswahl und Rollout-Replay 提高LLLM数据效率,通过困难目标在线数据选择和推出重播提高LLM强化微调的数据效率 2506.05316v1

Authors (7): Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, Huan Zhang

Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism that reuses recent rollouts, lowering per-step computation while maintaining stable updates. Extensive experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 25% to 65% to reach the same level of performance as the original GRPO algorithm.

nan


Article 1000

Title@2025-06-05 (4): Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models

Title: Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models Engropisches Lernen eingeschränkt: Ein primäres Rahmenwerk für große Sprachmodelle 未学习:大语言模式的原始-双重框架 2506.05314v1

Authors (5): Taha Entesari, Arman Hatami, Rinat Khaziev, Anil Ramakrishna, Mahyar Fazlyab

Large Language Models (LLMs) deployed in real-world settings increasingly face the need to unlearn sensitive, outdated, or proprietary information. Existing unlearning methods typically formulate forgetting and retention as a regularized trade-off, combining both objectives into a single scalarized loss. This often leads to unstable optimization and degraded performance on retained data, especially under aggressive forgetting. We propose a new formulation of LLM unlearning as a constrained optimization problem: forgetting is enforced via a novel logit-margin flattening loss that explicitly drives the output distribution toward uniformity on a designated forget set, while retention is preserved through a hard constraint on a separate retain set. Compared to entropy-based objectives, our loss is softmax-free, numerically stable, and maintains non-vanishing gradients, enabling more efficient and robust optimization. We solve the constrained problem using a scalable primal-dual algorithm that exposes the trade-off between forgetting and retention through the dynamics of the dual variable. Evaluations on the TOFU and MUSE benchmarks across diverse LLM architectures demonstrate that our approach consistently matches or exceeds state-of-the-art baselines, effectively removing targeted information while preserving downstream utility.

nan


Article 1001

Title@2025-06-05 (4): Is LLM the Silver Bullet to Low-Resource Languages Machine Translation?

Title: Is LLM the Silver Bullet to Low-Resource Languages Machine Translation? Ist LLM die Silver Bullet zu Low-Resource Sprachen Maschinelle Übersetzung? LLM 银弹到低资源语言机器翻译吗? 2503.24102v2

Authors (9): Yewei Song, Lujun Li, Cedric Lothritz, Saad Ezzini, Lama Sleem, Niccolo Gentile, Radu State, Tegawendé F. Bissyandé, Jacques Klein

Low-Resource Languages (LRLs) present significant challenges in natural language processing due to their limited linguistic resources and underrepresentation in standard datasets. While recent advances in Large Language Models (LLMs) and Neural Machine Translation have substantially improved translation capabilities for high-resource languages, performance disparities persist for LRLs, particularly impacting privacy-sensitive and resource-constrained scenarios. This paper systematically evaluates current LLMs in 200 languages using the FLORES-200 benchmark and demonstrates their limitations in LRL translation capability. We also explore alternative data sources, including news articles and bilingual dictionaries, and demonstrate how knowledge distillation from large pre-trained teacher models can significantly improve the performance of small LLMs on LRL translation tasks. For example, this approach increases EN->LB with the LLM-as-a-Judge score on the validation set from 0.36 to 0.89 for Llama-3.2-3B. Furthermore, we examine different fine-tuning configurations, providing practical insights on optimal data scale, training efficiency, and the preservation of generalization capabilities of models under study.

nan


Article 1002

Title@2025-06-05 (4): Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games

Title: Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games Time to Talk: LLM-Agenten für asynchrone Gruppenkommunikation in Mafia-Spielen 讨论时间:黑手党运动会Asynconomic Group通讯的LLM代理商 2506.05309v1

Authors (3): Niv Eckhaus, Uri Berger, Gabriel Stanovsky

LLMs are used predominantly in synchronous communication, where a human user and a model communicate in alternating turns. In contrast, many real-world settings are inherently asynchronous. For example, in group chats, online team meetings, or social games, there is no inherent notion of turns; therefore, the decision of when to speak forms a crucial part of the participant’s decision making. In this work, we develop an adaptive asynchronous LLM-agent which, in addition to determining what to say, also decides when to say it. To evaluate our agent, we collect a unique dataset of online Mafia games, including both human participants, as well as our asynchronous agent. Overall, our agent performs on par with human players, both in game performance, as well as in its ability to blend in with the other human players. Our analysis shows that the agent’s behavior in deciding when to speak closely mirrors human patterns, although differences emerge in message content. We release all our data and code to support and encourage further research for more realistic asynchronous communication between LLM agents. This work paves the way for integration of LLMs into realistic human group settings, from assistance in team discussions to educational and professional environments where complex social dynamics must be navigated.

nan


Article 1003

Title@2025-06-05 (4): ProRefine: Inference-time Prompt Refinement with Textual Feedback

Title: ProRefine: Inference-time Prompt Refinement with Textual Feedback ProRefine: Inferenz-Zeit Prompt Verfeinerung mit Text-Feedback ProRefine: 用文字反馈迅速改进推论-时间 2506.05305v1

Authors (5): Deepak Pandita, Tharindu Cyril Weerasooriya, Ankit Parag Shah, Christopher M. Homan, Wei Wei

Agentic workflows, where multiple AI agents collaborate to accomplish complex tasks like reasoning or planning, are becoming increasingly prevalent. However, these workflows often suffer from error propagation and sub-optimal performance, largely due to poorly designed prompts that fail to effectively guide individual agents. This is a critical problem because it limits the reliability and scalability of these powerful systems. We introduce ProRefine, an innovative inference-time prompt optimization method that leverages textual feedback from large language models (LLMs) to address this challenge. ProRefine dynamically refines prompts for multi-step reasoning tasks without additional training or ground truth labels. Evaluated on five benchmark mathematical reasoning datasets, ProRefine significantly surpasses zero-shot Chain-of-Thought baselines by 3 to 37 percentage points. This approach not only boosts accuracy but also allows smaller models to match the performance of larger ones, highlighting its potential for efficient and scalable AI deployment, and democratizing access to high-performing AI.

nan


Article 1004

Title@2025-06-05 (4): ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

Title: ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL ReasonGen-R1: CoT für Autoregressive Imagegenerierungsmodelle durch SFT und RL 理由Gen-R1:通过SFT和RL自动递减图像生成模型的 CoT 2505.24875v2

Authors (10): Yu Zhang, Yunqi Li, Yifan Yang, Rui Wang, Yuqing Yang, Dai Qi, Jianmin Bao, Dongdong Chen, Chong Luo, Lili Qiu

Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based “thinking” skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision language model to assess overall visual quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. More: aka.ms/reasongen.

nan


Article 1005

Title: GRAF: Graph Retrieval Augmented by Facts for Romanian Legal Multi-Choice Question Answering GRAF: Graph Retrieval Augmented by Facts for Rumänian Legal Multi-Choice Question Answering GRAF: 罗马尼亚多种选择法律问题解答事实加增图 2412.04119v3

Authors (4): Cristian-George Crăciun, Răzvan-Alexandru Smădu, Dumitru-Clementin Cercel, Mihaela-Claudia Cercel

Pre-trained Language Models (PLMs) have shown remarkable performances in recent years, setting a new paradigm for NLP research and industry. The legal domain has received some attention from the NLP community partly due to its textual nature. Some tasks from this domain are represented by question-answering (QA) tasks. This work explores the legal domain Multiple-Choice QA (MCQA) for a low-resource language. The contribution of this work is multi-fold. We first introduce JuRO, the first openly available Romanian legal MCQA dataset, comprising three different examinations and a number of 10,836 total questions. Along with this dataset, we introduce CROL, an organized corpus of laws that has a total of 93 distinct documents with their modifications from 763 time spans, that we leveraged in this work for Information Retrieval (IR) techniques. Moreover, we are the first to propose Law-RoG, a Knowledge Graph (KG) for the Romanian language, and this KG is derived from the aforementioned corpus. Lastly, we propose a novel approach for MCQA, Graph Retrieval Augmented by Facts (GRAF), which achieves competitive results with generally accepted SOTA methods and even exceeds them in most settings.

nan


Article 1006

Title@2025-06-05 (4): Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning

Title: Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning Micro-Act: Wissenskonflikt bei der Fragebeantwortung durch handlungsfähige Selbstbesinnung abmildern 微行动:通过可采取行动的自觉反应解决问题时减少知识冲突 2506.05278v1

Authors (8): Nan Huo, Jinyang Li, Bowen Qin, Ge Qu, Xiaolong Li, Xiaodong Li, Chenhao Ma, Reynold Cheng

Retrieval-Augmented Generation (RAG) systems commonly suffer from Knowledge Conflicts, where retrieved external knowledge contradicts the inherent, parametric knowledge of large language models (LLMs). It adversely affects performance on downstream tasks such as question answering (QA). Existing approaches often attempt to mitigate conflicts by directly comparing two knowledge sources in a side-by-side manner, but this can overwhelm LLMs with extraneous or lengthy contexts, ultimately hindering their ability to identify and mitigate inconsistencies. To address this issue, we propose Micro-Act a framework with a hierarchical action space that automatically perceives context complexity and adaptively decomposes each knowledge source into a sequence of fine-grained comparisons. These comparisons are represented as actionable steps, enabling reasoning beyond the superficial context. Through extensive experiments on five benchmark datasets, Micro-Act consistently achieves significant increase in QA accuracy over state-of-the-art baselines across all 5 datasets and 3 conflict types, especially in temporal and semantic types where all baselines fail significantly. More importantly, Micro-Act exhibits robust performance on non-conflict questions simultaneously, highlighting its practical value in real-world RAG applications.

nan


Article 1007

Title@2025-06-05 (4): DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts

Title: DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts DEFAME: Dynamic Evidence-based FAct-Checking mit multimodalen Experten DFAME: 与多式联运专家进行动态证据法检查 2412.10510v3

Authors (4): Tobias Braun, Mark Rothermel, Marcus Rohrbach, Anna Rohrbach

The proliferation of disinformation demands reliable and scalable fact-checking solutions. We present Dynamic Evidence-based FAct-checking with Multimodal Experts (DEFAME), a modular, zero-shot MLLM pipeline for open-domain, text-image claim verification. DEFAME operates in a six-stage process, dynamically selecting the tools and search depth to extract and evaluate textual and visual evidence. Unlike prior approaches that are text-only, lack explainability, or rely solely on parametric knowledge, DEFAME performs end-to-end verification, accounting for images in claims and evidence while generating structured, multimodal reports. Evaluation on the popular benchmarks VERITE, AVerITeC, and MOCHEG shows that DEFAME surpasses all previous methods, establishing itself as the new state-of-the-art fact-checking system for uni- and multimodal fact-checking. Moreover, we introduce a new multimodal benchmark, ClaimReview2024+, featuring claims after the knowledge cutoff of GPT-4o, avoiding data leakage. Here, DEFAME drastically outperforms the GPT-4o baselines, showing temporal generalizability and the potential for real-time fact-checking.

nan


Article 1008

Title@2025-06-05 (4): CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection

Title: CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection CLATT: Umfassende Verschweigung Grund für Halluzination Detection CLATTER: 用于探测幻觉的全面成像理由 2506.05243v1

Authors (7): Ron Eliav, Arie Cattan, Eran Hirsch, Shahaf Bassan, Elias Stengel-Eskin, Mohit Bansal, Ido Dagan

A common approach to hallucination detection casts it as a natural language inference (NLI) task, often using LLMs to classify whether the generated text is entailed by corresponding reference texts. Since entailment classification is a complex reasoning task, one would expect that LLMs could benefit from generating an explicit reasoning process, as in CoT reasoning or the explicit ``thinking’’ of recent reasoning models. In this work, we propose that guiding such models to perform a systematic and comprehensive reasoning process – one that both decomposes the text into smaller facts and also finds evidence in the source for each fact – allows models to execute much finer-grained and accurate entailment decisions, leading to increased performance. To that end, we define a 3-step reasoning process, consisting of (i) claim decomposition, (ii) sub-claim attribution and entailment classification, and (iii) aggregated classification, showing that such guided reasoning indeed yields improved hallucination detection. Following this reasoning framework, we introduce an analysis scheme, consisting of several metrics that measure the quality of the intermediate reasoning steps, which provided additional empirical evidence for the improved quality of our guided reasoning scheme.

nan


Article 1009

Title@2025-06-05 (4): Towards a Unified System of Representation for Continuity and Discontinuity in Natural Language

Title: Towards a Unified System of Representation for Continuity and Discontinuity in Natural Language Hin zu einem einheitlichen System der Repräsentation für Kontinuität und Diskontinuität in der Natursprache 迈向一个统一的自然语言连续和中断代表制 2506.05235v1

Authors (2): Ratna Kandala, Prakash Mondal

Syntactic discontinuity is a grammatical phenomenon in which a constituent is split into more than one part because of the insertion of an element which is not part of the constituent. This is observed in many languages across the world such as Turkish, Russian, Japanese, Warlpiri, Navajo, Hopi, Dyirbal, Yidiny etc. Different formalisms/frameworks in current linguistic theory approach the problem of discontinuous structures in different ways. Each framework/formalism has widely been viewed as an independent and non-converging system of analysis. In this paper, we propose a unified system of representation for both continuity and discontinuity in structures of natural languages by taking into account three formalisms, in particular, Phrase Structure Grammar (PSG) for its widely used notion of constituency, Dependency Grammar (DG) for its head-dependent relations, and Categorial Grammar (CG) for its focus on functor-argument relations. We attempt to show that discontinuous expressions as well as continuous structures can be analysed through a unified mathematical derivation incorporating the representations of linguistic structure in these three grammar formalisms.

nan


Article 1010

Title@2025-06-05 (4): From Rogue to Safe AI: The Role of Explicit Refusals in Aligning LLMs with International Humanitarian Law

Title: From Rogue to Safe AI: The Role of Explicit Refusals in Aligning LLMs with International Humanitarian Law Von der Rogue zur sicheren KI: Die Rolle der expliziten Ablehnungen bei der Ausrichtung von LLMs mit dem humanitären Völkerrecht 从流氓到安全AI:明确拒绝者在使LLMs符合国际人道主义法方面的作用 2506.06391v1

Authors (3): John Mavi, Diana Teodora Găitan, Sergio Coronado

Large Language Models (LLMs) are widely used across sectors, yet their alignment with International Humanitarian Law (IHL) is not well understood. This study evaluates eight leading LLMs on their ability to refuse prompts that explicitly violate these legal frameworks, focusing also on helpfulness - how clearly and constructively refusals are communicated. While most models rejected unlawful requests, the clarity and consistency of their responses varied. By revealing the model’s rationale and referencing relevant legal or safety principles, explanatory refusals clarify the system’s boundaries, reduce ambiguity, and help prevent misuse. A standardised system-level safety prompt significantly improved the quality of the explanations expressed within refusals in most models, highlighting the effectiveness of lightweight interventions. However, more complex prompts involving technical language or requests for code revealed ongoing vulnerabilities. These findings contribute to the development of safer, more transparent AI systems and propose a benchmark to evaluate the compliance of LLM with IHL.

nan


Article 1011

Title@2025-06-05 (4): MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Title: MesaNet: Sequence Modeling by Locally Optimal Test-Time Training MesaNet: Sequenzmodellierung durch lokal optimale Test-Time-Schulung MesaNet:通过当地最佳试验时间培训进行序列建模 2506.05233v1

Authors (16): Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, Rif A. Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Agüera y Arcas, João Sacramento

Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance – here by spending compute to solve sequential optimization problems within the neural network itself.

nan


Article 1012

Title@2025-06-05 (4): Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

Title: Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts Diagonales Batching löst Parallelismus in recurrenten Speichertransformatoren für lange Kontexte auf 对角对角拔 解锁长期常用内存变换器中的平行主义 2506.05229v1

Authors (5): Danil Sivtsov, Ivan Rodkin, Gleb Kuzmin, Yuri Kuratov, Ivan Oseledets

Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck. We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence. This approach eliminates the sequential constraint, enabling efficient GPU inference even for single long-context inputs without complex batching and pipelining techniques. Because the technique is purely a run-time computation reordering, existing RMT models adopt it with no retraining. Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x speedup over the sequential RMT implementation on 131,072-token sequences. By removing sequential bottleneck, Diagonal Batching reduces inference cost and latency, thereby strengthening RMTs as a practical solution for real-world, long-context applications.

nan


Article 1013

Title@2025-06-05 (4): Improving Low-Resource Morphological Inflection via Self-Supervised Objectives

Title: Improving Low-Resource Morphological Inflection via Self-Supervised Objectives Verbesserung der ressourcenarmen morphologischen Beugung durch selbstüberwachte Ziele 通过自我监督目标改进低资源资源 道德影响 2506.05227v1

Authors (2): Adam Wiemerslage, Katharina von der Wense

Self-supervised objectives have driven major advances in NLP by leveraging large-scale unlabeled data, but such resources are scarce for many of the world’s languages. Surprisingly, they have not been explored much for character-level tasks, where smaller amounts of data have the potential to be beneficial. We investigate the effectiveness of self-supervised auxiliary tasks for morphological inflection – a character-level task highly relevant for language documentation – in extremely low-resource settings, training encoder-decoder transformers for 19 languages and 13 auxiliary objectives. Autoencoding yields the best performance when unlabeled data is very limited, while character masked language modeling (CMLM) becomes more effective as data availability increases. Though objectives with stronger inductive biases influence model predictions intuitively, they rarely outperform standard CMLM. However, sampling masks based on known morpheme boundaries consistently improves performance, highlighting a promising direction for low-resource morphological modeling.

nan


Article 1014

Title@2025-06-05 (4): UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Title: UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation UniWorld-V1: Semantische Encoder mit hoher Auflösung für einheitliches visuelles Verständnis und Generation Uni-World-V1:用于统一视觉理解和生成的高分辨率语义编码器 2506.03147v3

Authors (12): Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, Yatian Pang, Li Yuan

Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image perception and manipulation – capabilities increasingly demanded in practical applications. Recently, OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipulation, sparking widespread interest. Through carefully designed experiments, we observe that GPT-4o-Image likely relies on semantic encoders rather than VAEs for feature extraction, despite VAEs being commonly regarded as crucial for image manipulation tasks. Inspired by this insight, we propose UniWorld-V1, a unified generative framework built upon semantic features extracted from powerful multimodal large language models and contrastive semantic encoders. Using only 2.7M training data, UniWorld-V1 achieves impressive performance across diverse tasks, including image understanding, generation, manipulation, and perception. We fully open-source the UniWorld-V1 framework, including model weights, training and evaluation scripts, and datasets to promote reproducibility and further research.

nan


Article 1015

Title@2025-06-05 (4): The Lessons of Developing Process Reward Models in Mathematical Reasoning

Title: The Lessons of Developing Process Reward Models in Mathematical Reasoning Die Lehren aus der Entwicklung von Prozess-Reward-Modellen in mathematischer Reasoning 数学理由中发展进程奖励模型的经验教训 2501.07301v2

Authors (9): Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin

Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.

nan


Article 1016

Title@2025-06-05 (4): Mitigating Degree Bias Adaptively with Hard-to-Learn Nodes in Graph Contrastive Learning

Title: Mitigating Degree Bias Adaptively with Hard-to-Learn Nodes in Graph Contrastive Learning Degree Bias im Graph Contrastive Learning adaptiv mit schwer zu erlernenden Knoten abmildern 与图表对比学习中难以识别的节点相适应 2506.05214v1

Authors (5): Jingyu Hu, Hongbo Bo, Jun Hong, Xiaowei Liu, Weiru Liu

Graph Neural Networks (GNNs) often suffer from degree bias in node classification tasks, where prediction performance varies across nodes with different degrees. Several approaches, which adopt Graph Contrastive Learning (GCL), have been proposed to mitigate this bias. However, the limited number of positive pairs and the equal weighting of all positives and negatives in GCL still lead to low-degree nodes acquiring insufficient and noisy information. This paper proposes the Hardness Adaptive Reweighted (HAR) contrastive loss to mitigate degree bias. It adds more positive pairs by leveraging node labels and adaptively weights positive and negative pairs based on their learning hardness. In addition, we develop an experimental framework named SHARP to extend HAR to a broader range of scenarios. Both our theoretical analysis and experiments validate the effectiveness of SHARP. The experimental results across four datasets show that SHARP achieves better performance against baselines at both global and degree levels.

nan


Article 1017

Title@2025-06-05 (4): LLM-First Search: Self-Guided Exploration of the Solution Space

Title: LLM-First Search: Self-Guided Exploration of the Solution Space LLM-First Search: Selbstgeführte Erkundung des Lösungsraums LLM-第一次搜索:自导探索解决办法空间 2506.05213v1

Authors (3): Nathan Herr, Tim Rocktäschel, Roberta Raileanu

Large Language Models (LLMs) have demonstrated remarkable improvements in reasoning and planning through increased test-time compute, often by framing problem-solving as a search process. While methods like Monte Carlo Tree Search (MCTS) have proven effective in some domains, their reliance on fixed exploration hyperparameters limits their adaptability across tasks of varying difficulty, rendering them impractical or expensive in certain settings. In this paper, we propose \textbf{LLM-First Search (LFS)}, a novel \textit{LLM Self-Guided Search} method that removes the need for pre-defined search strategies by empowering the LLM to autonomously control the search process via self-guided exploration. Rather than relying on external heuristics or hardcoded policies, the LLM evaluates whether to pursue the current search path or explore alternative branches based on its internal scoring mechanisms. This enables more flexible and context-sensitive reasoning without requiring manual tuning or task-specific adaptation. We evaluate LFS on Countdown and Sudoku against three classic widely-used search algorithms, Tree-of-Thoughts’ Breadth First Search (ToT-BFS), Best First Search (BestFS), and MCTS, each of which have been used to achieve SotA results on a range of challenging reasoning tasks. We found that LFS (1) performs better on more challenging tasks without additional tuning, (2) is more computationally efficient compared to the other methods, especially when powered by a stronger model, (3) scales better with stronger models, due to its LLM-First design, and (4) scales better with increased compute budget. Our code is publicly available at \href{https://github.com/NathanHerr/LLM-First-Search}{LLM-First-Search}.

nan


Article 1018

Title@2025-06-05 (4): CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

Title: CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks CheckEmbed: Effektive Überprüfung von LLM-Lösungen auf offene Aufgaben 复选对象:有效核查对不限名额任务LLM解决方案的有效核查 2406.02524v4

Authors (12): Maciej Besta, Lorenzo Paleari, Marcin Copik, Robert Gerstenberger, Ales Kubicek, Piotr Nyczyk, Patrick Iff, Eric Schreiber, Tanja Srindran, Tomasz Lehmann, Hubert Niewiadomski, Torsten Hoefler

Large Language Models (LLMs) are transforming a wide range of domains, yet verifying their outputs remains a significant challenge, especially for complex open-ended tasks such as consolidation, summarization, and knowledge extraction. To address this, we introduce CheckEmbed (CE): a simple, scalable, and accurate verification method. CE reduces each LLM answer to a single embedding vector using powerful modern embedding LLM models like SFR-Embedding-Mistral. Prior methods such as BERTScore and SelfCheckGPT relied on weaker encoders like BERT, forcing them to operate at token or sentence granularity. In contrast, CE performs fast, semantically rich comparisons directly at the whole-answer level, overcoming key limitations in both accuracy and scalability. We conduct a comprehensive design and time complexity analysis across 13 verification baselines, including classical text scorers (e.g., BLEU), stability-based methods (e.g., SelfCheckGPT), and generative evaluators (e.g., LLM-as-a-Judge), which highlights the effectiveness, efficiency, versatility, and simplicity of CE. Empirical results show that CE reliably detects hallucinations in both closed and open-ended tasks. We further present evidence that CE generalizes beyond text to other modalities such as vision, establishing it as a practical and versatile verification framework.

nan


Article 1019

Title@2025-06-05 (4): The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Title: The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text The Common Pile v0.1: Ein 8TB-Datensatz von Public Domain und Openly Licensed Text Pile v0.1:公共域和公开许可文本的 8TB数据集 2506.05209v1

Authors (27): Nikhil Kandpal, Brian Lester, Colin Raffel, Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, John Kirchenbauer, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo, Loubna Ben Allal, Elie Bakouch, John David Pressman, Honglu Fan, Dashiell Stander, Guangyu Song, Aaron Gokaslan, Tom Goldstein, Brian R. Bartoldson, Bhavya Kailkhura, Tyler Murray

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

nan


Article 1020

Title@2025-06-05 (4): MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration

Title: MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration MMBoundary: MLLM-Wissensgrenzen-Bewusstsein durch vernünftige Schritt-Vertrauens-Kalibrierung MMMMMMMMMM MMMMMMMM:通过合理步骤信任校准提高MLLM知识边界认识 2505.23224v2

Authors (6): Zhitao He, Sandeep Polisetty, Zhiyuan Fan, Yuchen Huang, Shujin Wu, Yi R. Fung

In recent years, multimodal large language models (MLLMs) have made significant progress but continue to face inherent challenges in multimodal reasoning, which requires multi-level (e.g., perception, reasoning) and multi-granular (e.g., multi-step reasoning chain) advanced inferencing. Prior work on estimating model confidence tends to focus on the overall response for training and calibration, but fails to assess confidence in each reasoning step, leading to undesirable hallucination snowballing. In this work, we present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration. To achieve this, we propose to incorporate complementary textual and cross-modal self-rewarding signals to estimate confidence at each step of the MLLM reasoning process. In addition to supervised fine-tuning MLLM on this set of self-rewarded confidence estimation signal for initial confidence expression warm-up, we introduce a reinforcement learning stage with multiple reward functions for further aligning model knowledge and calibrating confidence at each reasoning step, enhancing reasoning chain self-correction. Empirical results show that MMBoundary significantly outperforms existing methods across diverse domain datasets and metrics, achieving an average of 7.5% reduction in multimodal confidence calibration errors and up to 8.3% improvement in task performance.

nan


Article 1021

Title@2025-06-05 (4): RELIC: Evaluating Compositional Instruction Following via Language Recognition

Title: RELIC: Evaluating Compositional Instruction Following via Language Recognition relIC: Bewertung der kompositorischen Instruktion über Spracherkennung RELIC:评估通过语言承认进行的组成说明 2506.05205v1

Authors (6): Jackson Petty, Michael Y. Hu, Wentao Wang, Shauli Ravfogel, William Merrill, Tal Linzen

Large language models (LLMs) are increasingly expected to perform tasks based only on a specification of the task provided in context, without examples of inputs and outputs; this ability is referred to as instruction following. We introduce the Recognition of Languages In-Context (RELIC) framework to evaluate instruction following using language recognition: the task of determining if a string is generated by formal grammar. Unlike many standard evaluations of LLMs’ ability to use their context, this task requires composing together a large number of instructions (grammar productions) retrieved from the context. Because the languages are synthetic, the task can be increased in complexity as LLMs’ skills improve, and new instances can be automatically generated, mitigating data contamination. We evaluate state-of-the-art LLMs on RELIC and find that their accuracy can be reliably predicted from the complexity of the grammar and the individual example strings, and that even the most advanced LLMs currently available show near-chance performance on more complex grammars and samples, in line with theoretical expectations. We also use RELIC to diagnose how LLMs attempt to solve increasingly difficult reasoning tasks, finding that as the complexity of the language recognition task increases, models switch to relying on shallow heuristics instead of following complex instructions.

nan


Article 1022

Title@2025-06-05 (4): DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models

Title: DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models DREAM: Entwirren von Risiken zur Verbesserung der Sicherheitsausrichtung in multimodalen großen Sprachmodellen DREAM:消除风险,加强多式联运大语言模式中的安全协调 2504.18053v2

Authors (17): Jianyu Liu, Hangyu Guo, Ranjie Duan, Xingyuan Bu, Yancheng He, Shilong Li, Hui Huang, Jiaheng Liu, Yucheng Wang, Chenchen Jing, Xingwei Qu, Xiao Zhang, Yingshui Tan, Yanan Wu, Jihao Gu, Yangguang Li, Jianke Zhu

Multimodal Large Language Models (MLLMs) pose unique safety challenges due to their integration of visual and textual data, thereby introducing new dimensions of potential attacks and complex risk combinations. In this paper, we begin with a detailed analysis aimed at disentangling risks through step-by-step reasoning within multimodal inputs. We find that systematic multimodal risk disentanglement substantially enhances the risk awareness of MLLMs. Via leveraging the strong discriminative abilities of multimodal risk disentanglement, we further introduce \textbf{DREAM} (\textit{\textbf{D}isentangling \textbf{R}isks to \textbf{E}nhance Safety \textbf{A}lignment in \textbf{M}LLMs}), a novel approach that enhances safety alignment in MLLMs through supervised fine-tuning and iterative Reinforcement Learning from AI Feedback (RLAIF). Experimental results show that DREAM significantly boosts safety during both inference and training phases without compromising performance on normal tasks (namely oversafety), achieving a 16.17\% improvement in the SIUO safe\&effective score compared to GPT-4V. The data and code are available at https://github.com/Kizna1ver/DREAM.

nan


Article 1023

Title@2025-06-05 (4): Counterfactual reasoning: an analysis of in-context emergence

Title: Counterfactual reasoning: an analysis of in-context emergence Gegenfaktische Argumentation: eine Analyse der Entstehung von Inkontexten 反事实推理:对内源出现的分析 2506.05188v1

Authors (3): Moritz Miller, Bernhard Schölkopf, Siyuan Guo

Large-scale neural language models (LMs) exhibit remarkable performance in in-context learning: the ability to learn and reason the input context on the fly without parameter update. This work studies in-context counterfactual reasoning in language models, that is, to predict the consequences of changes under hypothetical scenarios. We focus on studying a well-defined synthetic setup: a linear regression task that requires noise abduction, where accurate prediction is based on inferring and copying the contextual noise from factual observations. We show that language models are capable of counterfactual reasoning in this controlled setup and provide insights that counterfactual reasoning for a broad class of functions can be reduced to a transformation on in-context observations; we find self-attention, model depth, and data diversity in pre-training drive performance in Transformers. More interestingly, our findings extend beyond regression tasks and show that Transformers can perform noise abduction on sequential data, providing preliminary evidence on the potential for counterfactual story generation. Our code is available under https://github.com/moXmiller/counterfactual-reasoning.git .

nan


Article 1024

Title@2025-06-05 (4): Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

Title: Multi-Head RAG: Solving Multi-Aspect Problems with LLMs Multi-Head RAG: Lösung von Multi-Aspect-Problemen mit LLMs 多方主管RAG:解决多领域问题与LLM 2406.05085v3

Authors (16): Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli, Patrik Okanovic, Yi Zhu, Patrick Iff, Michal Podstawski, Lucas Weitzendorf, Mingyuan Chi, Joanna Gajda, Piotr Nyczyk, Jürgen Müller, Hubert Niewiadomski, Torsten Hoefler

Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs) by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents. Such queries occur frequently, but are challenging because the embeddings of these documents may be distant in the embedding space, making it hard to retrieve them all. This paper introduces Multi-Head RAG (MRAG), a novel scheme designed to address this gap with a simple yet powerful idea: leveraging activations of Transformer’s multi-head attention layer, instead of the decoder layer, as keys for fetching multi-aspect documents. The driving observation is that different attention heads learn to capture different data aspects. Harnessing the corresponding activations results in embeddings that represent various facets of data items and queries, improving the retrieval accuracy for complex queries. We provide an evaluation methodology and metrics, multi-aspect datasets, and real-world use cases to demonstrate MRAG’s effectiveness. We show MRAG’s design advantages over 18 RAG baselines, empirical improvements of up to 20% in retrieval success ratios, and benefits for downstream LLM generation. MRAG can be seamlessly integrated with existing RAG frameworks and benchmarks.

nan


Article 1025

Title@2025-06-05 (4): Can Large Language Models Understand Intermediate Representations in Compilers?

Title: Can Large Language Models Understand Intermediate Representations in Compilers? Können große Sprachmodelle Zwischendarstellungen in Compilern verstehen? 大语言模式能理解《汇编者》的中间代表吗? 2502.06854v2

Authors (7): Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan

Intermediate Representations (IRs) play a critical role in compiler design and program analysis, yet their comprehension by Large Language Models (LLMs) remains underexplored. In this paper, we present an explorative empirical study evaluating the capabilities of six state-of-the-art LLMs: GPT-4, GPT-3, DeepSeek, Gemma 2, Llama 3, and Code Llama, in understanding IRs. Specifically, we assess model performance across four core tasks: control flow graph reconstruction, decompilation, code summarization, and execution reasoning. While LLMs exhibit competence in parsing IR syntax and identifying high-level structures, they consistently struggle with instruction-level reasoning, especially in control flow reasoning, loop handling, and dynamic execution. Common failure modes include misinterpreting branching instructions, omitting critical operations, and relying on heuristic reasoning rather than precise instruction-level logic. Our findings highlight the need for IR-specific enhancements in LLM design. We recommend fine-tuning on structured IR datasets and integrating control-flow-sensitive architectures to improve model effectiveness. All experimental data and source code are publicly available at

nan


Article 1026

Title@2025-06-05 (4): SNaRe: Domain-aware Data Generation for Low-Resource Event Detection

Title: SNaRe: Domain-aware Data Generation for Low-Resource Event Detection SNARe: Domain-aware Datengenerierung für Low-Resource-Erkennung SNaRe: 用于低资源事件探测的域觉数据生成 2502.17394v2

Authors (7): Tanmay Parekh, Yuxuan Dong, Lucas Bandarkar, Artin Kim, I-Hung Hsu, Kai-Wei Chang, Nanyun Peng

Event Detection (ED) – the task of identifying event mentions from natural language text – is critical for enabling reasoning in highly specialized domains such as biomedicine, law, and epidemiology. Data generation has proven to be effective in broadening its utility to wider applications without requiring expensive expert annotations. However, when existing generation approaches are applied to specialized domains, they struggle with label noise, where annotations are incorrect, and domain drift, characterized by a distributional mismatch between generated sentences and the target domain. To address these issues, we introduce SNaRe, a domain-aware synthetic data generation framework composed of three components: Scout, Narrator, and Refiner. Scout extracts triggers from unlabeled target domain data and curates a high-quality domain-specific trigger list using corpus-level statistics to mitigate domain drift. Narrator, conditioned on these triggers, generates high-quality domain-aligned sentences, and Refiner identifies additional event mentions, ensuring high annotation quality. Experimentation on three diverse domain ED datasets reveals how SNaRe outperforms the best baseline, achieving average F1 gains of 3-7% in the zero-shot/few-shot settings and 4-20% F1 improvement for multilingual generation. Analyzing the generated trigger hit rate and human evaluation substantiates SNaRe’s stronger annotation quality and reduced domain drift.

nan


Article 1027

Title@2025-06-05 (4): ValueSim: Generating Backstories to Model Individual Value Systems

Title: ValueSim: Generating Backstories to Model Individual Value Systems ValueSim: Erzeugung von Backstories zum Modell individueller Wertsysteme 价值模拟: 生成个人价值系统模型的备份 2505.23827v2

Authors (8): Bangde Du, Ziyi Ye, Zhijing Wu, Jankowska Monika, Shuqi Zhu, Qingyao Ai, Yujia Zhou, Yiqun Liu

As Large Language Models (LLMs) continue to exhibit increasingly human-like capabilities, aligning them with human values has become critically important. Contemporary advanced techniques, such as prompt learning and reinforcement learning, are being deployed to better align LLMs with human values. However, while these approaches address broad ethical considerations and helpfulness, they rarely focus on simulating individualized human value systems. To address this gap, we present ValueSim, a framework that simulates individual values through the generation of personal backstories reflecting past experiences and demographic information. ValueSim converts structured individual data into narrative backstories and employs a multi-module architecture inspired by the Cognitive-Affective Personality System to simulate individual values based on these narratives. Testing ValueSim on a self-constructed benchmark derived from the World Values Survey demonstrates an improvement in top-1 accuracy by over 10% compared to retrieval-augmented generation methods. Further analysis reveals that performance enhances as additional user interaction history becomes available, indicating the model’s ability to refine its persona simulation capabilities over time.

nan


Article 1028

Title@2025-06-05 (4): Explainability in Practice: A Survey of Explainable NLP Across Various Domains

Title: Explainability in Practice: A Survey of Explainable NLP Across Various Domains Erklärbarkeit in der Praxis: Eine Übersicht über erklärbare NLP über verschiedene Domains 实践中的可解释性:关于各种领域可解释的《国家劳工调查》的调查 2502.00837v2

Authors (4): Hadi Mohammadi, Ayoub Bagheri, Anastasia Giachanou, Daniel L. Oberski

Natural Language Processing (NLP) has become a cornerstone in many critical sectors, including healthcare, finance, and customer relationship management. This is especially true with the development and use of advanced models such as GPT-based architectures and BERT, which are widely used in decision-making processes. However, the black-box nature of these advanced NLP models has created an urgent need for transparency and explainability. This review explores explainable NLP (XNLP) with a focus on its practical deployment and real-world applications, examining its implementation and the challenges faced in domain-specific contexts. The paper underscores the importance of explainability in NLP and provides a comprehensive perspective on how XNLP can be designed to meet the unique demands of various sectors, from healthcare’s need for clear insights to finance’s emphasis on fraud detection and risk assessment. Additionally, this review aims to bridge the knowledge gap in XNLP literature by offering a domain-specific exploration and discussing underrepresented areas such as real-world applicability, metric evaluation, and the role of human interaction in model assessment. The paper concludes by suggesting future research directions that could enhance the understanding and broader application of XNLP.

nan


Article 1029

Title@2025-06-05 (4): Knowledgeable-r1: Policy Optimization for Knowledge Exploration in Retrieval-Augmented Generation

Title: Knowledgeable-r1: Policy Optimization for Knowledge Exploration in Retrieval-Augmented Generation Knowledgeable-r1: Politikoptimierung für Wissensexploration in der retrieval-generierten Generation 可知识-r1:在回溯性回溯性养殖中知识探索的政策优化 2506.05154v1

Authors (7): Chenyu Lin, Yilin Wen, Du Su, Fei Sun, Muhan Chen, Chenfu Bao, Zhonghou Lv

Retrieval-augmented generation (RAG) is a mainstream method for improving performance on knowledge-intensive tasks. However,current RAG systems often place too much emphasis on retrieved contexts. This can lead to reliance on inaccurate sources and overlook the model’s inherent knowledge, especially when dealing with misleading or excessive information. To resolve this imbalance, we propose Knowledgeable-r1 that using joint sampling and define multi policy distributions in knowledge capability exploration to stimulate large language models’self-integrated utilization of parametric and contextual knowledge. Experiments show that Knowledgeable-r1 significantly enhances robustness and reasoning accuracy in both parameters and contextual conflict tasks and general RAG tasks, especially outperforming baselines by 17.07% in counterfactual scenarios and demonstrating consistent gains across RAG tasks. Our code are available at https://github.com/lcy80366872/ knowledgeable-r1.

nan


Article 1030

Title@2025-06-05 (4): CIVET: Systematic Evaluation of Understanding in VLMs

Title: CIVET: Systematic Evaluation of Understanding in VLMs CIVET: Systematische Bewertung des Verständnisses in VLMs CIVET: 系统评估对脆弱、危险、危险和 2506.05146v1

Authors (6): Massimo Rizzoli, Simone Alghisi, Olha Khomyn, Gabriel Roccabruna, Seyed Mahed Mousavi, Giuseppe Riccardi

While Vision-Language Models (VLMs) have achieved competitive performance in various tasks, their comprehension of the underlying structure and semantics of a scene remains understudied. To investigate the understanding of VLMs, we study their capability regarding object properties and relations in a controlled and interpretable manner. To this scope, we introduce CIVET, a novel and extensible framework for systematiC evaluatIon Via controllEd sTimuli. CIVET addresses the lack of standardized systematic evaluation for assessing VLMs’ understanding, enabling researchers to test hypotheses with statistical rigor. With CIVET, we evaluate five state-of-the-art VLMs on exhaustive sets of stimuli, free from annotation noise, dataset-specific biases, and uncontrolled scene complexity. Our findings reveal that 1) current VLMs can accurately recognize only a limited set of basic object properties; 2) their performance heavily depends on the position of the object in the scene; 3) they struggle to understand basic relations among objects. Furthermore, a comparative evaluation with human annotators reveals that VLMs still fall short of achieving human-level accuracy.

nan


Article 1031

Title@2025-06-05 (4): AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models

Title: AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models AudioLens: Ein genauerer Blick auf auditory Attribut Wahrnehmung von großen Audio-Sprachen-Modellen 音频路程:更仔细地审视大型音频语言模型的听觉特性 2506.05140v1

Authors (4): Chih-Kai Yang, Neo Ho, Yi-Jyun Lee, Hung-yi Lee

Understanding the internal mechanisms of large audio-language models (LALMs) is crucial for interpreting their behavior and improving performance. This work presents the first in-depth analysis of how LALMs internally perceive and recognize auditory attributes. By applying vocabulary projection on three state-of-the-art LALMs, we track how attribute information evolves across layers and token positions. We find that attribute information generally decreases with layer depth when recognition fails, and that resolving attributes at earlier layers correlates with better accuracy. Moreover, LALMs heavily rely on querying auditory inputs for predicting attributes instead of aggregating necessary information in hidden states at attribute-mentioning positions. Based on our findings, we demonstrate a method to enhance LALMs. Our results offer insights into auditory attribute processing, paving the way for future improvements.

nan


Article 1032

Title@2025-06-05 (4): Information Locality as an Inductive Bias for Neural Language Models

Title: Information Locality as an Inductive Bias for Neural Language Models Informationslokalität als induktive Bias für neurale Sprachmodelle 信息地点作为神经语言模型的感性偏见 2506.05136v1

Authors (6): Taiga Someya, Anej Svete, Brian DuSell, Timothy J. O’Donnell, Mario Giulianelli, Ryan Cotterell

Inductive biases are inherent in every machine learning system, shaping how models generalize from finite data. In the case of neural language models (LMs), debates persist as to whether these biases align with or diverge from human processing constraints. To address this issue, we propose a quantitative framework that allows for controlled investigations into the nature of these biases. Within our framework, we introduce $m$-local entropy$\unicode{x2013}$an information-theoretic measure derived from average lossy-context surprisal$\unicode{x2013}$that captures the local uncertainty of a language by quantifying how effectively the $m-1$ preceding symbols disambiguate the next symbol. In experiments on both perturbed natural language corpora and languages defined by probabilistic finite-state automata (PFSAs), we show that languages with higher $m$-local entropy are more difficult for Transformer and LSTM LMs to learn. These results suggest that neural LMs, much like humans, are highly sensitive to the local statistical structure of a language.

nan


Article 1033

Title@2025-06-05 (4): CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature

Title: CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature CHIMERA: Eine Wissensbasis der Ideenrekombination in der wissenschaftlichen Literatur CHIMERA:科学文献中思想再融合的知识库 2505.20779v3

Authors (2): Noy Sternlicht, Tom Hope

A hallmark of human innovation is the process of recombination – creating original ideas by integrating elements of existing mechanisms and concepts. In this work, we automatically mine the scientific literature and build CHIMERA: a large-scale knowledge base (KB) of recombination examples. CHIMERA can be used to empirically explore at scale how scientists recombine concepts and take inspiration from different areas, or to train supervised machine learning models that learn to predict new creative cross-domain directions. To build this KB, we present a novel information extraction task of extracting recombination from scientific paper abstracts, collect a high-quality corpus of hundreds of manually annotated abstracts, and use it to train an LLM-based extraction model. The model is applied to a large corpus of papers in the AI domain, yielding a KB of over 28K recombination examples. We analyze CHIMERA to explore the properties of recombination in different subareas of AI. Finally, we train a scientific hypothesis generation model using the KB, which predicts new recombination directions that real-world researchers find inspiring. Our data and code are available at https://github.com/noy-sternlicht/CHIMERA-KB

nan


Article 1034

Title@2025-06-05 (4): DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning

Title: DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning DiCoRe: Erweitern der Null-Shot-Erkennung durch Divergent-Convergent LLM Reasoning DiCore: 通过差异-说服者LLM 合理性,加强零射事件探测 2506.05128v1

Authors (5): Tanmay Parekh, Kartik Mehta, Ninareh Mehrabi, Kai-Wei Chang, Nanyun Peng

Zero-shot Event Detection (ED), the task of identifying event mentions in natural language text without any training data, is critical for document understanding in specialized domains. Understanding the complex event ontology, extracting domain-specific triggers from the passage, and structuring them appropriately overloads and limits the utility of Large Language Models (LLMs) for zero-shot ED. To this end, we propose DiCoRe, a divergent-convergent reasoning framework that decouples the task of ED using Dreamer and Grounder. Dreamer encourages divergent reasoning through open-ended event discovery, which helps to boost event coverage. Conversely, Grounder introduces convergent reasoning to align the free-form predictions with the task-specific instructions using finite-state machine guided constrained decoding. Additionally, an LLM-Judge verifies the final outputs to ensure high precision. Through extensive experiments on six datasets across five domains and nine LLMs, we demonstrate how DiCoRe consistently outperforms prior zero-shot, transfer-learning, and reasoning baselines, achieving 4-7% average F1 gains over the best baseline – establishing DiCoRe as a strong zero-shot ED framework.

nan


Article 1035

Title@2025-06-05 (4): LLM Social Simulations Are a Promising Research Method

Title: LLM Social Simulations Are a Promising Research Method LLM Sozialsimulationen sind eine vielversprechende Forschungsmethode LLM LLM 社会模拟是一种有希望的研究方法 2504.02234v2

Authors (8): Jacy Reese Anthis, Ryan Liu, Sean M. Richardson, Austin C. Kozlowski, Bernard Koch, James Evans, Erik Brynjolfsson, Michael Bernstein

Accurate and verifiable large language model (LLM) simulations of human research subjects promise an accessible data source for understanding human behavior and training new AI systems. However, results to date have been limited, and few social scientists have adopted this method. In this position paper, we argue that the promise of LLM social simulations can be achieved by addressing five tractable challenges. We ground our argument in a review of empirical comparisons between LLMs and human research subjects, commentaries on the topic, and related work. We identify promising directions, including context-rich prompting and fine-tuning with social science datasets. We believe that LLM social simulations can already be used for pilot and exploratory studies, and more widespread use may soon be possible with rapidly advancing LLM capabilities. Researchers should prioritize developing conceptual models and iterative evaluations to make the best use of new AI systems.

nan


Article 1036

Title@2025-06-05 (4): The NTNU System at the S&I Challenge 2025 SLA Open Track

Title: The NTNU System at the S&I Challenge 2025 SLA Open Track Das NTNU-System bei der S&I Challenge 2025 SLA Open Track S&I挑战2025年S&I挑战的NNNU系统 2506.05121v1

Authors (7): Hong-Yun Lin, Tien-Hong Lo, Yu-Hsuan Fang, Jhen-Ke Lin, Chung-Chun Wang, Hao-Chien Lu, Berlin Chen

A recent line of research on spoken language assessment (SLA) employs neural models such as BERT and wav2vec 2.0 (W2V) to evaluate speaking proficiency across linguistic and acoustic modalities. Although both models effectively capture features relevant to oral competence, each exhibits modality-specific limitations. BERT-based methods rely on ASR transcripts, which often fail to capture prosodic and phonetic cues for SLA. In contrast, W2V-based methods excel at modeling acoustic features but lack semantic interpretability. To overcome these limitations, we propose a system that integrates W2V with Phi-4 multimodal large language model (MLLM) through a score fusion strategy. The proposed system achieves a root mean square error (RMSE) of 0.375 on the official test set of the Speak & Improve Challenge 2025, securing second place in the competition. For comparison, the RMSEs of the top-ranked, third-ranked, and official baseline systems are 0.364, 0.384, and 0.444, respectively.

nan


Article 1037

Title@2025-06-05 (4): OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction

Title: OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction OmniCharacter: Auf dem Weg zu immersiven Rollenspiel-Agenten mit nahtloser Sprach-Persönlichkeits-Interaktion OmniCharacter:争取用无缝无言语-语言个性交互作用来模拟角色扮演剂 2505.20277v2

Authors (13): Haonan Zhang, Run Luo, Xiong Liu, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, Jingkuan Song, Fei Huang, Yongbin Li

Role-Playing Agents (RPAs), benefiting from large language models, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role’s voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms. Code and dataset are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/OmniCharacter.

nan


Article 1038

Title@2025-06-05 (4): CL-ISR: A Contrastive Learning and Implicit Stance Reasoning Framework for Misleading Text Detection on Social Media

Title: CL-ISR: A Contrastive Learning and Implicit Stance Reasoning Framework for Misleading Text Detection on Social Media CL-ISR: Ein kontrastives Lern- und Implizit-Stance-Reasoning-Framework für irreführende Texterkennung in sozialen Medien CL-ISR: 社交媒体错误领导层文字探测错误文本的矛盾学习和隐含理由依据框架 2506.05107v1

Authors (4): Tianyi Huang, Zikun Cui, Cuiqianhe Du, Chia-En Chiang

Misleading text detection on social media platforms is a critical research area, as these texts can lead to public misunderstanding, social panic and even economic losses. This paper proposes a novel framework - CL-ISR (Contrastive Learning and Implicit Stance Reasoning), which combines contrastive learning and implicit stance reasoning, to improve the detection accuracy of misleading texts on social media. First, we use the contrastive learning algorithm to improve the model’s learning ability of semantic differences between truthful and misleading texts. Contrastive learning could help the model to better capture the distinguishing features between different categories by constructing positive and negative sample pairs. This approach enables the model to capture distinguishing features more effectively, particularly in linguistically complicated situations. Second, we introduce the implicit stance reasoning module, to explore the potential stance tendencies in the text and their relationships with related topics. This method is effective for identifying content that misleads through stance shifting or emotional manipulation, because it can capture the implicit information behind the text. Finally, we integrate these two algorithms together to form a new framework, CL-ISR, which leverages the discriminative power of contrastive learning and the interpretive depth of stance reasoning to significantly improve detection effect.

nan


Article 1039

Title@2025-06-05 (4): The Impossibility of Fair LLMs

Title: The Impossibility of Fair LLMs Die Unmöglichkeit fairer LLMs 公平专利Ms的不可行性 2406.03198v2

Authors (5): Jacy Anthis, Kristian Lum, Michael Ekstrand, Avi Feller, Chenhao Tan

The rise of general-purpose artificial intelligence (AI) systems, particularly large language models (LLMs), has raised pressing moral questions about how to reduce bias and ensure fairness at scale. Researchers have documented a sort of “bias” in the significant correlations between demographics (e.g., race, gender) in LLM prompts and responses, but it remains unclear how LLM fairness could be evaluated with more rigorous definitions, such as group fairness or fair representations. We analyze a variety of technical fairness frameworks and find inherent challenges in each that make the development of a fair LLM intractable. We show that each framework either does not logically extend to the general-purpose AI context or is infeasible in practice, primarily due to the large amounts of unstructured training data and the many potential combinations of human populations, use cases, and sensitive attributes. These inherent challenges would persist for general-purpose AI, including LLMs, even if empirical challenges, such as limited participatory input and limited measurement methods, were overcome. Nonetheless, fairness will remain an important type of model evaluation, and there are still promising research directions, particularly the development of standards for the responsibility of LLM developers, context-specific evaluations, and methods of iterative, participatory, and AI-assisted evaluation that could scale fairness across the diverse contexts of modern human-AI interaction.

nan


Article 1040

Title@2025-06-05 (4): Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation

Title: Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation Bias in Language Models: Jenseits von Tricktests und hin zu RUTEd-Evaluierung 语言模式中的偏见:超越欺骗性测试和争取公正评价 2402.12649v3

Authors (5): Kristian Lum, Jacy Reese Anthis, Kevin Robinson, Chirag Nagpal, Alexander D’Amour

Standard benchmarks of bias and fairness in large language models (LLMs) measure the association between the user attributes stated or implied by a prompt and the LLM’s short text response, but human-AI interaction increasingly requires long-form and context-specific system output to solve real-world tasks. In the commonly studied domain of gender-occupation bias, we test whether these benchmarks are robust to lengthening the LLM responses as a measure of Realistic Use and Tangible Effects (i.e., RUTEd evaluations). From the current literature, we adapt three standard bias metrics (neutrality, skew, and stereotype) and develop analogous RUTEd evaluations from three contexts of real-world use: children’s bedtime stories, user personas, and English language learning exercises. We find that standard bias metrics have no significant correlation with the more realistic bias metrics. For example, selecting the least biased model based on the standard “trick tests” coincides with selecting the least biased model as measured in more realistic use no more than random chance. We suggest that there is not yet evidence to justify standard benchmarks as reliable proxies of real-world AI biases, and we encourage further development of evaluations grounded in particular contexts.

nan


Article 1041

Title@2025-06-05 (4): Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics

Title: Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics 以人为中心的街头评估可解释的多模式框架:将视觉语言模型纳入感知城市诊断 2506.05087v1

Authors (1): HaoTian Lan

While objective street metrics derived from imagery or GIS have become standard in urban analytics, they remain insufficient to capture subjective perceptions essential to inclusive urban design. This study introduces a novel Multimodal Street Evaluation Framework (MSEF) that fuses a vision transformer (VisualGLM-6B) with a large language model (GPT-4), enabling interpretable dual-output assessment of streetscapes. Leveraging over 15,000 annotated street-view images from Harbin, China, we fine-tune the framework using LoRA and P-Tuning v2 for parameter-efficient adaptation. The model achieves an F1 score of 0.84 on objective features and 89.3 percent agreement with aggregated resident perceptions, validated across stratified socioeconomic geographies. Beyond classification accuracy, MSEF captures context-dependent contradictions: for instance, informal commerce boosts perceived vibrancy while simultaneously reducing pedestrian comfort. It also identifies nonlinear and semantically contingent patterns – such as the divergent perceptual effects of architectural transparency across residential and commercial zones – revealing the limits of universal spatial heuristics. By generating natural-language rationales grounded in attention mechanisms, the framework bridges sensory data with socio-affective inference, enabling transparent diagnostics aligned with SDG 11. This work offers both methodological innovation in urban perception modeling and practical utility for planning systems seeking to reconcile infrastructural precision with lived experience.

nan


Article 1042

Title@2025-06-05 (4): Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin

Title: Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin Parken, Wahrnehmung und Einzelhandel: Street-Level Determinanten der gemeinschaftlichen Vitalität in Harbin 泊车、观感和零售:哈尔滨社区生命的街道一级决定因素 2506.05080v1

Authors (1): HaoTian Lan

The commercial vitality of community-scale streets in Chinese cities is shaped by complex interactions between vehicular accessibility, environmental quality, and pedestrian perception. This study proposes an interpretable, image-based framework to examine how street-level features – including parked vehicle density, greenery, cleanliness, and street width – impact retail performance and user satisfaction in Harbin, China. Leveraging street view imagery and a multimodal large language model (VisualGLM-6B), we construct a Community Commercial Vitality Index (CCVI) from Meituan and Dianping data and analyze its relationship with spatial attributes extracted via GPT-4-based perception modeling. Our findings reveal that while moderate vehicle presence may enhance commercial access, excessive on-street parking – especially in narrow streets – erodes walkability and reduces both satisfaction and shop-level pricing. In contrast, streets with higher perceived greenery and cleanliness show significantly greater satisfaction scores but only weak associations with pricing. Street width moderates the effects of vehicle presence, underscoring the importance of spatial configuration. These results demonstrate the value of integrating AI-assisted perception with urban morphological analysis to capture non-linear and context-sensitive drivers of commercial success. This study advances both theoretical and methodological frontiers by highlighting the conditional role of vehicle activity in neighborhood commerce and demonstrating the feasibility of multimodal AI for perceptual urban diagnostics. The implications extend to urban design, parking management, and scalable planning tools for community revitalization.

nan


Article 1043

Title@2025-06-05 (4): Just a Scratch: Enhancing LLM Capabilities for Self-harm Detection through Intent Differentiation and Emoji Interpretation

Title: Just a Scratch: Enhancing LLM Capabilities for Self-harm Detection through Intent Differentiation and Emoji Interpretation Nur ein Kratzer: Verbesserung der LLM-Fähigkeiten für die Selbst-Schaden-Erkennung durch Intent Differentiation und Emoji-Interpretation 仅仅一个缩略图:通过内在差别和Emoji解释,提高自残检测的LLM能力 2506.05073v1

Authors (5): Soumitra Ghosh, Gopendra Vikram Singh, Shambhavi, Sabarna Choudhury, Asif Ekbal

Self-harm detection on social media is critical for early intervention and mental health support, yet remains challenging due to the subtle, context-dependent nature of such expressions. Identifying self-harm intent aids suicide prevention by enabling timely responses, but current large language models (LLMs) struggle to interpret implicit cues in casual language and emojis. This work enhances LLMs’ comprehension of self-harm by distinguishing intent through nuanced language-emoji interplay. We present the Centennial Emoji Sensitivity Matrix (CESM-100), a curated set of 100 emojis with contextual self-harm interpretations and the Self-Harm Identification aNd intent Extraction with Supportive emoji sensitivity (SHINES) dataset, offering detailed annotations for self-harm labels, casual mentions (CMs), and serious intents (SIs). Our unified framework: a) enriches inputs using CESM-100; b) fine-tunes LLMs for multi-task learning: self-harm detection (primary) and CM/SI span detection (auxiliary); c) generates explainable rationales for self-harm predictions. We evaluate the framework on three state-of-the-art LLMs-Llama 3, Mental-Alpaca, and MentalLlama, across zero-shot, few-shot, and fine-tuned scenarios. By coupling intent differentiation with contextual cues, our approach commendably enhances LLM performance in both detection and explanation tasks, effectively addressing the inherent ambiguity in self-harm signals. The SHINES dataset, CESM-100 and codebase are publicly available at: https://www.iitp.ac.in/~ai-nlp-ml/resources.html#SHINES .

nan


Article 1044

Title@2025-06-05 (4): RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation

Title: RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation RIVAL: Verstärktes Lernen mit iterativer und adversarieller Optimierung für maschinelle Übersetzung RIV: 机械翻译的循环和逆向优化强化学习 2506.05070v1

Authors (9): Tianjiao Li, Mengran Yu, Chenyu Shi, Yanjun Zhao, Xiaojing Liu, Qiang Zhang, Qi Zhang, Xuanjing Huang, Jiayin Wang

Large language models (LLMs) possess strong multilingual capabilities, and combining Reinforcement Learning from Human Feedback (RLHF) with translation tasks has shown great potential. However, we observe that this paradigm performs unexpectedly poorly when applied to colloquial subtitle translation tasks. In this work, we investigate this issue and find that the offline reward model (RM) gradually diverges from the online LLM due to distributional shift, ultimately leading to undesirable training outcomes. To address this, we propose RIVAL, an adversarial training framework that formulates the process as a min-max game between the RM and the LLM. RIVAL iteratively updates the both models, with the RM trained to distinguish strong from weak translations (qualitative preference reward), and the LLM trained to enhance its translation for closing this gap. To stabilize training and improve generalizability, we also incorporate quantitative preference reward (e.g., BLEU) into the RM, enabling reference-free quality modeling aligned with human evaluation. Through extensive experiments, we demonstrate that the proposed adversarial training framework significantly improves upon translation baselines.

nan


Article 1045

Title@2025-06-05 (4): Leveraging LLMs for Bangla Grammar Error Correction:Error Categorization, Synthetic Data, and Model Evaluation

Title: Leveraging LLMs for Bangla Grammar Error Correction:Error Categorization, Synthetic Data, and Model Evaluation Leveraging LLMs für Bangla Grammatik Fehlerkorrektur:Error Kategorisierung, Synthetische Daten und Modellbewertung Bangla语法错误校正:错误分类、合成数据和模型评价 2406.14284v2

Authors (2): Pramit Bhattacharyya, Arnab Bhattacharya

Large Language Models (LLMs) perform exceedingly well in Natural Language Understanding (NLU) tasks for many languages including English. However, despite being the fifth most-spoken language globally, Grammatical Error Correction (GEC) in Bangla remains underdeveloped. In this work, we investigate how LLMs can be leveraged for improving Bangla GEC. For that, we first do an extensive categorization of 12 error classes in Bangla, and take a survey of native Bangla speakers to collect real-world errors. We next devise a rule-based noise injection method to create grammatically incorrect sentences corresponding to correct ones. The Vaiyakarana dataset, thus created, consists of 5,67,422 sentences of which 2,27,119 are erroneous. This dataset is then used to instruction-tune LLMs for the task of GEC in Bangla. Evaluations show that instruction-tuning with \name improves GEC performance of LLMs by 3-7 percentage points as compared to the zero-shot setting, and makes them achieve human-like performance in grammatical error identification. Humans, though, remain superior in error correction.

nan


Article 1046

Title@2025-06-05 (4): GoRA: Gradient-driven Adaptive Low Rank Adaptation

Title: GoRA: Gradient-driven Adaptive Low Rank Adaptation GoRA: Gradient-getriebene Adaptive Low-Rank-Anpassung GARA:逐步驱动的适应性低级别适应 2502.12171v2

Authors (7): Haonan He, Peng Ye, Yuchen Ren, Yuan Yuan, Luyang Zhou, Shucun Ju, Lei Chen

Low-Rank Adaptation (LoRA) is a crucial method for efficiently fine-tuning large language models (LLMs), with its effectiveness influenced by two key factors: rank selection and weight initialization. While numerous LoRA variants have been proposed to improve performance by addressing one of these aspects, they often compromise usability or computational efficiency. In this paper, we analyze and identify the core limitations of existing approaches and propose a novel framework – GoRA (Gradient-driven Adaptive Low Rank Adaptation) – that simultaneously adapts both the rank and initialization strategy within a unified framework. GoRA leverages gradient information during training to dynamically assign optimal ranks and initialize low-rank adapter weights in an adaptive manner. To our knowledge, GoRA is the first method that not only addresses the limitations of prior approaches – which often focus on either rank selection or initialization in isolation – but also unifies both aspects within a single framework, enabling more effective and efficient adaptation. Extensive experiments across various architectures and modalities show that GoRA consistently outperforms existing LoRA-based methods while preserving the efficiency of vanilla LoRA. For example, when fine-tuning Llama3.1-8B-Base for mathematical reasoning, GoRA achieves a 5.13-point improvement over standard LoRA and even outperforms full fine-tuning by 2.05 points under high-rank settings.

nan


Article 1047

Title@2025-06-05 (4): Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation

Title: Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation Debattierbare Intelligenz: Benchmarking der LLM-Richter durch Debatte Sprachbewertung 可辩驳的情报:通过辩论演说评价确定LLM法官的基准 2506.05062v1

Authors (5): Noy Sternlicht, Ariel Gera, Roy Bar-Haim, Tom Hope, Noam Slonim

We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that have previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.

nan


Article 1048

Title@2025-06-05 (4): Argument Summarization and its Evaluation in the Era of Large Language Models

Title: Argument Summarization and its Evaluation in the Era of Large Language Models Argumentationszusammenfassung und ihre Bewertung im Zeitalter der großen Sprachmodelle 在 “ 大语言模式时代 “ 中的论点概述及其评价 2503.00847v3

Authors (7): Moritz Altemeyer, Steffen Eger, Johannes Daxenberger, Yanran Chen, Tim Altendorf, Philipp Cimiano, Benjamin Schiller

Large Language Models (LLMs) have revolutionized various Natural Language Generation (NLG) tasks, including Argument Summarization (ArgSum), a key subfield of Argument Mining (AM). This paper investigates the integration of state-of-the-art LLMs into ArgSum, including for its evaluation. In particular, we propose a novel prompt-based evaluation scheme, and validate it through a novel human benchmark dataset. Our work makes three main contributions: (i) the integration of LLMs into existing ArgSum frameworks, (ii) the development of a new LLM-based ArgSum system, benchmarked against prior methods, and (iii) the introduction of an advanced LLM-based evaluation scheme. We demonstrate that the use of LLMs substantially improves both the generation and evaluation of argument summaries, achieving state-of-the-art results and advancing the field of ArgSum. We also show that among the four LLMs integrated in (i) and (ii), Qwen-3-32B, despite having the fewest parameters, performs best, even surpassing GPT-4o, while LLaMA-3.3-70B consistently underperforms.

nan


Article 1049

Title@2025-06-05 (4): TALL – A Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages

Title: TALL – A Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages TALL – Eine praktikable Architektur zur Verbesserung der LLM-Leistung in ressourcenarmen Sprachen TALL – – 提高低资源语言LLM性能的训练建筑 2506.05057v1

Authors (3): Moshe Ofer, Orel Zamler, Amos Azaria

Large Language Models (LLMs) excel in high-resource languages but struggle with low-resource languages due to limited training data. This paper presents TALL (Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages), which integrates an LLM with two bilingual translation models. TALL transforms low-resource inputs into high-resource representations, leveraging the LLM’s capabilities while preserving linguistic features through dimension alignment layers and custom transformers. Our experiments on Hebrew demonstrate significant improvements over several baselines, including direct use, naive translation, and fine-tuning approaches. The architecture employs a parameter-efficient strategy, freezing pre-trained components while training only lightweight adapter modules, balancing computational efficiency with performance gains.

nan


Article 1050

Title@2025-06-05 (4): Optimizing Anytime Reasoning via Budget Relative Policy Optimization

Title: Optimizing Anytime Reasoning via Budget Relative Policy Optimization Optimierung jederzeit über Budget-Relational-Policy-Optimierung 通过预算相对政策优化优化,优化任何时间的理由 2505.13438v2

Authors (6): Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin

Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance, which aims to improve token efficiency and the flexibility of reasoning under varying token budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.

nan


Article 1051

Title@2025-06-05 (4): Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers

Title: Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers Automatische Robustheits-Stress-Prüfung von LLMs als mathematische Problemlöser 对作为数学问题解答器的LLMS进行自动强力压力测试 2506.05038v1

Authors (8): Yutao Hou, Zeguan Xiao, Fei Yu, Yihan Jiang, Xuetao Wei, Hailiang Huang, Yun Chen, Guanhua Chen

Large language models (LLMs) have achieved distinguished performance on various reasoning-intensive tasks. However, LLMs might still face the challenges of robustness issues and fail unexpectedly in some simple reasoning tasks. Previous works evaluate the LLM robustness with hand-crafted templates or a limited set of perturbation rules, indicating potential data contamination in pre-training or fine-tuning datasets. In this work, inspired by stress testing in software engineering, we propose a novel framework, Automatic Robustness Checker (AR-Checker), to generate mathematical problem variants that maintain the semantic meanings of the original one but might fail the LLMs. The AR-Checker framework generates mathematical problem variants through multi-round parallel streams of LLM-based rewriting and verification. Our framework can generate benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the strong performance of AR-Checker on mathematical tasks. We also evaluate AR-Checker on benchmarks beyond mathematics, including MMLU, MMLU-Pro, and CommonsenseQA, where it also achieves strong performance, further proving the effectiveness of AR-Checker.

nan


Article 1052

Title@2025-06-05 (4): GRU: Mitigating the Trade-off between Unlearning and Retention for LLMs

Title: GRU: Mitigating the Trade-off between Unlearning and Retention for LLMs GRU: Abbau des Kompromisses zwischen Unlearning und Retention für LLMs GRU:减少LLMM的不学习与保留之间的取舍 2503.09117v3

Authors (7): Yue Wang, Qizhou Wang, Feng Liu, Wei Huang, Yali Du, Xiaojiang Du, Bo Han

Large language model (LLM) unlearning has demonstrated its essential role in removing privacy and copyright-related responses, crucial for their legal and safe applications. However, the pursuit of complete unlearning often comes with substantial costs due to its compromises in their general functionality, leading to a notorious trade-off between unlearning and retention. It motivates this paper to explore enhanced unlearning schemes that can mitigate this trade-off. Specifically, we propose Gradient Rectified Unlearning (GRU), an improved framework that regulates the directions of gradient updates during the unlearning procedure such that their side impacts on other, unrelated responses can be minimized. GRU is easy and general to implement, demonstrating practical effectiveness across a variety of well-established unlearning benchmarks.

nan


Article 1053

Title@2025-06-05 (4): Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks

Title: Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks Selbstkorrektion ist mehr als Verfeinerung: Ein Lernrahmen für visuelle und sprachliche Aufgaben 自我校正不仅仅是改进:视觉和语言原因说明任务学习框架 2410.04055v3

Authors (5): Jiayi He, Hehai Lin, Qingyun Wang, Yi Fung, Heng Ji

While Vision-Language Models (VLMs) have shown remarkable abilities in visual and language reasoning tasks, they invariably generate flawed responses. Self-correction that instructs models to refine their outputs presents a promising solution to this issue. Previous studies have mainly concentrated on Large Language Models (LLMs), while the self-correction abilities of VLMs, particularly concerning both visual and linguistic information, remain largely unexamined. This study investigates the self-correction capabilities of VLMs during both inference and fine-tuning stages. We introduce a Self-Correction Learning (SCL) approach that enables VLMs to learn from their self-generated self-correction data through Direct Preference Optimization (DPO) without relying on external feedback, facilitating self-improvement. Specifically, we collect preferred and disfavored samples based on the correctness of initial and refined responses, which are obtained by two-turn self-correction with VLMs during the inference stage. Experimental results demonstrate that although VLMs struggle to self-correct effectively during iterative inference without additional fine-tuning and external feedback, they can enhance their performance and avoid previous mistakes through preference fine-tuning when their self-generated self-correction data are categorized into preferred and disfavored samples. This study emphasizes that self-correction is not merely a refinement process; rather, it should enhance the reasoning abilities of models through additional training, enabling them to generate high-quality responses directly without further refinement.

nan


Article 1054

Title@2025-06-05 (4): Controlling Summarization Length Through EOS Token Weighting

Title: Controlling Summarization Length Through EOS Token Weighting Kontrolle der Zusammenfassung Länge durch EOS Token Gewichtung 控制通过 EOS 过 EOS 键权重加权的控控总和长度 2506.05017v1

Authors (4): Zeno Belligoli, Emmanouil Stergiadis, Eran Fainman, Ilya Gusev

Controlling the length of generated text can be crucial in various text-generation tasks, including summarization. Existing methods often require complex model alterations, limiting compatibility with pre-trained models. We address these limitations by developing a simple approach for controlling the length of automatic text summaries by increasing the importance of correctly predicting the EOS token in the cross-entropy loss computation. The proposed methodology is agnostic to architecture and decoding algorithms and orthogonal to other inference-time techniques to control the generation length. We tested it with encoder-decoder and modern GPT-style LLMs, and show that this method can control generation length, often without affecting the quality of the summary.

nan


Article 1055

Title@2025-06-05 (4): ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development

Title: ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development ComfyUI-Copilot: Ein intelligenter Assistent für automatisierte Workflow-Entwicklung ComfyUI-副驾驶员:一名负责自动工作流程开发的智能助理 2506.05010v1

Authors (10): Zhenran Xu, Xue Yang, Yiyu Wang, Qingli Hu, Zijiao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang

We introduce ComfyUI-Copilot, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and user-friendly interface, ComfyUI can present challenges to newcomers, including limited documentation, model misconfigurations, and the complexity of workflow design. ComfyUI-Copilot addresses these challenges by offering intelligent node and model recommendations, along with automated one-click workflow construction. At its core, the system employs a hierarchical multi-agent framework comprising a central assistant agent for task delegation and specialized worker agents for different usages, supported by our curated ComfyUI knowledge bases to streamline debugging and deployment. We validate the effectiveness of ComfyUI-Copilot through both offline quantitative evaluations and online user feedback, showing that it accurately recommends nodes and accelerates workflow development. Additionally, use cases illustrate that ComfyUI-Copilot lowers entry barriers for beginners and enhances workflow efficiency for experienced users. The ComfyUI-Copilot installation package and a demo video are available at https://github.com/AIDC-AI/ComfyUI-Copilot.

nan


Article 1056

Title@2025-06-05 (4): Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

Title: Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity 1568 Tokens in einen einzigen Vektor und wieder zurück krammen: Die Grenzen der Einbettung von Raumkapazität erkunden 将1568吨撞成单一矢量和后向:探索嵌入空间能力的极限 2502.13063v2

Authors (4): Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev

A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches are focused on reduction of the amount of compute in existing language models rather than minimization of number of bits needed to store text. Despite relying on powerful models as encoders, the maximum attainable lossless compression ratio is typically not higher than x10. This fact is highly intriguing because, in theory, the maximum information capacity of large real-valued vectors is far beyond the presented rates even for 16-bit precision and a modest vector size. In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure. We show that vectors with compression ratios up to x1500 exist, which highlights two orders of magnitude gap between existing and practically attainable solutions. Furthermore, we empirically show that the compression limits are determined not by the length of the input but by the amount of uncertainty to be reduced, namely, the cross-entropy loss on this sequence without any conditioning. The obtained limits highlight the substantial gap between the theoretical capacity of input embeddings and their practical utilization, suggesting significant room for optimization in model design.

nan


Article 1057

Title@2025-06-05 (4): Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs)

Title: Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs) Position ist Macht: Systemprompts als Mechanismus von Bias in großen Sprachmodellen (LLMs) 位置是电源:系统提示作为大语言模型比阿语机制(LLMs) 2505.21091v2

Authors (4): Anna Neumann, Elisabeth Kirsten, Muhammad Bilal Zafar, Jatinder Singh

System prompts in Large Language Models (LLMs) are predefined directives that guide model behaviour, taking precedence over user inputs in text processing and generation. LLM deployers increasingly use them to ensure consistent responses across contexts. While model providers set a foundation of system prompts, deployers and third-party developers can append additional prompts without visibility into others’ additions, while this layered implementation remains entirely hidden from end-users. As system prompts become more complex, they can directly or indirectly introduce unaccounted for side effects. This lack of transparency raises fundamental questions about how the position of information in different directives shapes model outputs. As such, this work examines how the placement of information affects model behaviour. To this end, we compare how models process demographic information in system versus user prompts across six commercially available LLMs and 50 demographic groups. Our analysis reveals significant biases, manifesting in differences in user representation and decision-making scenarios. Since these variations stem from inaccessible and opaque system-level configurations, they risk representational, allocative and potential other biases and downstream harms beyond the user’s ability to detect or correct. Our findings draw attention to these critical issues, which have the potential to perpetuate harms if left unexamined. Further, we argue that system prompt analysis must be incorporated into AI auditing processes, particularly as customisable system prompts become increasingly prevalent in commercial AI deployments.

nan


Article 1058

Title@2025-06-05 (4): SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View

Title: SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View SCOP: Bewertung des Verständnisprozesses von großen Sprachmodellen aus einer kognitiven Ansicht SCOP: 从认知的角度评估大语言模型的全纳进程 2506.05000v1

Authors (5): Yongjie Xiao, Hongru Liang, Peixin Qin, Yao Zhang, Wenqiang Lei

Despite the great potential of large language models(LLMs) in machine comprehension, it is still disturbing to fully count on them in real-world scenarios. This is probably because there is no rational explanation for whether the comprehension process of LLMs is aligned with that of experts. In this paper, we propose SCOP to carefully examine how LLMs perform during the comprehension process from a cognitive view. Specifically, it is equipped with a systematical definition of five requisite skills during the comprehension process, a strict framework to construct testing data for these skills, and a detailed analysis of advanced open-sourced and closed-sourced LLMs using the testing data. With SCOP, we find that it is still challenging for LLMs to perform an expert-level comprehension process. Even so, we notice that LLMs share some similarities with experts, e.g., performing better at comprehending local information than global information. Further analysis reveals that LLMs can be somewhat unreliable – they might reach correct answers through flawed comprehension processes. Based on SCOP, we suggest that one direction for improving LLMs is to focus more on the comprehension process, ensuring all comprehension skills are thoroughly developed during training.

nan


Article 1059

Title@2025-06-05 (4): Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings

Title: Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings Auf dem Weg zu einem effizienten visuellen Dokumenten-Retrieval: Eine empirische Studie zur Reduzierung von Patch-Level-Einbindungen 迈向储存-有效视觉文件检索:减少定级嵌入的经验研究 2506.04997v1

Authors (11): Yubo Ma, Jinsong Li, Yuhang Zang, Xiaobao Wu, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Jiaqi Wang, Yixin Cao, Aixin Sun

Despite the strong performance of ColPali/ColQwen2 in Visualized Document Retrieval (VDR), it encodes each page into multiple patch-level embeddings and leads to excessive memory usage. This empirical study investigates methods to reduce patch embeddings per page at minimum performance degradation. We evaluate two token-reduction strategies: token pruning and token merging. Regarding token pruning, we surprisingly observe that a simple random strategy outperforms other sophisticated pruning methods, though still far from satisfactory. Further analysis reveals that pruning is inherently unsuitable for VDR as it requires removing certain page embeddings without query-specific information. Turning to token merging (more suitable for VDR), we search for the optimal combinations of merging strategy across three dimensions and develop Light-ColPali/ColQwen2. It maintains 98.2% of retrieval performance with only 11.8% of original memory usage, and preserves 94.6% effectiveness at 2.8% memory footprint. We expect our empirical findings and resulting Light-ColPali/ColQwen2 offer valuable insights and establish a competitive baseline for future research towards efficient VDR.

nan


Article 1060

Title@2025-06-05 (4): Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons

Title: Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons Knockout LLM Assessment: Verwendung großer Sprachmodelle für Bewertungen durch iterative Pairwise-Vergleiche LLLM 评估:利用大语言模式,通过迭接对等比较进行评估 2506.03785v2

Authors (3): Isik Baran Sandan, Tu Anh Dinh, Jan Niehues

Large Language Models (LLMs) have shown to be effective evaluators across various domains such as machine translations or the scientific domain. Current LLM-as-a-Judge approaches rely mostly on individual assessments or a single round of pairwise assessments, preventing the judge LLM from developing a global ranking perspective. To address this, we present Knockout Assessment, an LLM-asa Judge method using a knockout tournament system with iterative pairwise comparisons. Experiments across three LLMs on two datasets show that knockout assessment improves scoring accuracy, increasing Pearson correlation with expert evaluations by 0.07 on average for university-level exam scoring and machine translation evaluations, aligning LLM assessments more closely with human scoring.

nan


Article 1061

Title@2025-06-05 (4): Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering

Title: Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering Besseres semi-überwachtes Lernen für Multi-Domain ASR durch inkrementelle Umschulung und Datenfilterung 通过递增再培训和数据过滤,为多领域ASR更好地进行半监督的半监督学习 2506.04981v1

Authors (13): Andres Carofilis, Pradeep Rangappa, Srikanth Madikeri, Shashi Kumar, Sergio Burdisso, Jeena Prakash, Esau Villatoro-Tello, Petr Motlicek, Bidisha Sharma, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke

Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxiliary data. Filtering based on multi-model consensus or named entity recognition (NER) is then applied to select and iteratively refine pseudo-labels, showing slower performance saturation compared to random selection. Evaluated on the multi-domain Wow call center and Fisher English corpora, it outperforms single-step fine-tuning. Consensus-based filtering outperforms other methods, providing up to 22.3% relative improvement on Wow and 24.8% on Fisher over single-step fine-tuning with random selection. NER is the second-best filter, providing competitive performance at a lower computational cost.

nan


Article 1062

Title@2025-06-05 (4): EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving

Title: EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving EvaLearn: Quantifizierung der Lernfähigkeit und Effizienz von LLMs durch sequentielle Problemlösung EvaLearn:通过按顺序解决问题,量化LLMs的学习能力和效率 2506.02672v2

Authors (17): Shihan Dou, Ming Zhang, Chenhao Huang, Jiayi Chen, Feng Chen, Shichun Liu, Yan Liu, Chenxiao Liu, Cheng Zhong, Zongzhang Zhang, Tao Gui, Chao Xin, Wei Chengzhi, Lin Yan, Qi Zhang, Yonghui Wu, Xuanjing Huang

We introduce EvaLearn, a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks, a critical, yet underexplored aspect of model potential. EvaLearn contains 648 challenging problems across six task types, grouped into 182 sequences, each sequence dedicated to one task type. Diverging from most existing benchmarks that evaluate models in parallel, EvaLearn requires models to solve problems sequentially, allowing them to leverage the experience gained from previous solutions. EvaLearn provides five comprehensive automated metrics to evaluate models and quantify their learning capability and efficiency. We extensively benchmark nine frontier models and observe varied performance profiles: some models, such as Claude-3.7-sonnet, start with moderate initial performance but exhibit strong learning ability, while some models struggle to benefit from experience and may even show negative transfer. Moreover, we investigate model performance under two learning settings and find that instance-level rubrics and teacher-model feedback further facilitate model learning. Importantly, we observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks, highlighting that EvaLearn evaluates a new dimension of model performance. We hope EvaLearn provides a novel evaluation perspective for assessing LLM potential and understanding the gap between models and human capabilities, promoting the development of deeper and more dynamic evaluation approaches. All datasets, the automatic evaluation framework, and the results studied in this paper are available at the GitHub repository.

nan


Article 1063

Title@2025-06-05 (4): BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models

Title: BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models BYO-Eval: Erstellen Sie Ihren eigenen Datensatz für eine feinkörnige visuelle Bewertung multimodaler Sprachmodelle BYO-Eval:建立你自己的数据集,以便对多模式语言模式进行精美的视觉评估 2506.05440v1

Authors (3): Ludovic Arnould, Salim Khazem, Hugues Ali Mehenni

Visual Language Models (VLMs) are now sufficiently advanced to support a broad range of applications, including answering complex visual questions, and are increasingly expected to interact with images in varied ways. To evaluate them, current benchmarks often focus on specific domains (e.g., reading charts), constructing datasets of annotated real images paired with pre-defined Multiple Choice Questions (MCQs) to report aggregate accuracy scores. However, such benchmarks entail high annotation costs, risk information leakage, and do not clarify whether failures stem from limitations in visual perception, reasoning, or general knowledge. We propose a new evaluation methodology, inspired by ophthalmologic diagnostics, leveraging procedural generation of synthetic images to obtain control over visual attributes and precisely reveal perception failures in VLMs. Specifically, we build collections of images with gradually more challenging variations in the content of interest (e.g., number of objects in a counting task) while holding other visual parameters constant. This diagnostic allows systematic stress testing and fine-grained failure analysis, shifting the focus from coarse benchmarking toward targeted and interpretable assessment of VLM capabilities. Our code is available at https://github.com/byoeval/BYO-EVAL.

nan


Article 1064

Title@2025-06-05 (4): From Struggle (06-2024) to Mastery (02-2025) LLMs Conquer Advanced Algorithm Exams and Pave the Way for Editorial Generation

Title: From Struggle (06-2024) to Mastery (02-2025) LLMs Conquer Advanced Algorithm Exams and Pave the Way for Editorial Generation Vom Kampf (06-2024) zur Meisterschaft (02-2025) LLMs erobern fortgeschrittene Algorithmenprüfungen und ebnen den Weg für die redaktionelle Generation 从斗争(06-2024)到掌握(02-2025) 2506.04965v1

Authors (3): Adrian Marius Dumitran, Theodor-Pierre Moroianu, Vasile Paul Alexe

This paper presents a comprehensive evaluation of the performance of state-of-the-art Large Language Models (LLMs) on challenging university-level algorithms exams. By testing multiple models on both a Romanian exam and its high-quality English translation, we analyze LLMs’ problem-solving capabilities, consistency, and multilingual performance. Our empirical study reveals that the most recent models not only achieve scores comparable to top-performing students but also demonstrate robust reasoning skills on complex, multi-step algorithmic challenges, even though difficulties remain with graph-based tasks. Building on these findings, we explore the potential of LLMs to support educational environments through the generation of high-quality editorial content, offering instructors a powerful tool to enhance student feedback. The insights and best practices discussed herein pave the way for further integration of generative AI in advanced algorithm education.

nan


Article 1065

Title@2025-06-05 (4): Toward a Human-Centered Evaluation Framework for Trustworthy LLM-Powered GUI Agents

Title: Toward a Human-Centered Evaluation Framework for Trustworthy LLM-Powered GUI Agents Auf dem Weg zu einem menschenzentrierten Bewertungsrahmen für vertrauenswürdige LLM-Powered-GUI-Agenten 争取为具有信誉的LLM授权的用户代表建立一个以人为中心的评价框架 2504.17934v2

Authors (10): Chaoran Chen, Zhiping Zhang, Ibrahim Khalilov, Bingcan Guo, Simret A Gebreegziabher, Yanfang Ye, Ziang Xiao, Yaxing Yao, Tianshi Li, Toby Jia-Jun Li

The rise of Large Language Models (LLMs) has revolutionized Graphical User Interface (GUI) automation through LLM-powered GUI agents, yet their ability to process sensitive data with limited human oversight raises significant privacy and security risks. This position paper identifies three key risks of GUI agents and examines how they differ from traditional GUI automation and general autonomous agents. Despite these risks, existing evaluations focus primarily on performance, leaving privacy and security assessments largely unexplored. We review current evaluation metrics for both GUI and general LLM agents and outline five key challenges in integrating human evaluators for GUI agent assessments. To address these gaps, we advocate for a human-centered evaluation framework that incorporates risk assessments, enhances user awareness through in-context consent, and embeds privacy and security considerations into GUI agent design and evaluation.

nan


Article 1066

Title@2025-06-05 (4): Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models

Title: Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models Adaptive Jailbreaking-Strategien basierend auf dem semantischen Verständnis von Fähigkeiten großer Sprachmodelle 基于大语言模型的语义理解能力 2505.23404v2

Authors (4): Mingyu Yu, Wei Wang, Yanjie Wei, Sujuan Qin

Adversarial attacks on Large Language Models (LLMs) via jailbreaking techniques-methods that circumvent their built-in safety and ethical constraints-have emerged as a critical challenge in AI security. These attacks compromise the reliability of LLMs by exploiting inherent weaknesses in their comprehension capabilities. This paper investigates the efficacy of jailbreaking strategies that are specifically adapted to the diverse levels of understanding exhibited by different LLMs. We propose the Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models, a novel framework that classifies LLMs into Type I and Type II categories according to their semantic comprehension abilities. For each category, we design tailored jailbreaking strategies aimed at leveraging their vulnerabilities to facilitate successful attacks. Extensive experiments conducted on multiple LLMs demonstrate that our adaptive strategy markedly improves the success rate of jailbreaking. Notably, our approach achieves an exceptional 98.9% success rate in jailbreaking GPT-4o(29 May 2025 release)

nan


Article 1067

Title@2025-06-05 (4): Towards Robust ESG Analysis Against Greenwashing Risks: Aspect-Action Analysis with Cross-Category Generalization

Title: Towards Robust ESG Analysis Against Greenwashing Risks: Aspect-Action Analysis with Cross-Category Generalization Auf dem Weg zu robuster ESG-Analyse gegen Greenwashing-Risiken: Aspekt-Action-Analyse mit Cross-Category-Verallgemeinerung 致力于对绿色清洗风险进行强有力的环境SG分析:跨类别通用化的外观行动分析 2502.15821v2

Authors (5): Keane Ong, Rui Mao, Deeksha Varshney, Erik Cambria, Gianmarco Mengaldo

Sustainability reports are key for evaluating companies’ environmental, social and governance, ESG performance, but their content is increasingly obscured by greenwashing - sustainability claims that are misleading, exaggerated, and fabricated. Yet, existing NLP approaches for ESG analysis lack robustness against greenwashing risks, often extracting insights that reflect misleading or exaggerated sustainability claims rather than objective ESG performance. To bridge this gap, we introduce A3CG - Aspect-Action Analysis with Cross-Category Generalization, as a novel dataset to improve the robustness of ESG analysis amid the prevalence of greenwashing. By explicitly linking sustainability aspects with their associated actions, A3CG facilitates a more fine-grained and transparent evaluation of sustainability claims, ensuring that insights are grounded in verifiable actions rather than vague or misleading rhetoric. Additionally, A3CG emphasizes cross-category generalization. This ensures robust model performance in aspect-action analysis even when companies change their reports to selectively favor certain sustainability areas. Through experiments on A3CG, we analyze state-of-the-art supervised models and LLMs, uncovering their limitations and outlining key directions for future research.

nan


Article 1068

Title@2025-06-05 (4): LLMs Can Compensate for Deficiencies in Visual Representations

Title: LLMs Can Compensate for Deficiencies in Visual Representations LLMs können Defizite in der visuellen Darstellung kompensieren LLMs Can Can 补偿视觉代表的缺陷 2506.05439v1

Authors (5): Sho Takishita, Jay Gala, Abdelrahman Mohamed, Kentaro Inui, Yova Kementchedjhieva

Many vision-language models (VLMs) that prove very effective at a range of multimodal task, build on CLIP-based vision encoders, which are known to have various limitations. We investigate the hypothesis that the strong language backbone in VLMs compensates for possibly weak visual features by contextualizing or enriching them. Using three CLIP-based VLMs, we perform controlled self-attention ablations on a carefully designed probing task. Our findings show that despite known limitations, CLIP visual representations offer ready-to-read semantic information to the language decoder. However, in scenarios of reduced contextualization in the visual representations, the language decoder can largely compensate for the deficiency and recover performance. This suggests a dynamic division of labor in VLMs and motivates future architectures that offload more visual processing to the language decoder.

nan


Article 1069

Title@2025-06-05 (4): Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation

Title: Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation Strategische Markteinblicke mit großen Sprachmodellen ableiten: Ein Benchmark für die vorausschauende kontrafaktische Generation 具有大语言模式的战略市场展望:前瞻性反实际生成基准 2505.19430v2

Authors (6): Keane Ong, Rui Mao, Deeksha Varshney, Paul Pu Liang, Erik Cambria, Gianmarco Mengaldo

Counterfactual reasoning typically involves considering alternatives to actual events. While often applied to understand past events, a distinct form-forward counterfactual reasoning-focuses on anticipating plausible future developments. This type of reasoning is invaluable in dynamic financial markets, where anticipating market developments can powerfully unveil potential risks and opportunities for stakeholders, guiding their decision-making. However, performing this at scale is challenging due to the cognitive demands involved, underscoring the need for automated solutions. Large Language Models (LLMs) offer promise, but remain unexplored for this application. To address this gap, we introduce a novel benchmark, Fin-Force-FINancial FORward Counterfactual Evaluation. By curating financial news headlines and providing structured evaluation, Fin-Force supports LLM based forward counterfactual generation. This paves the way for scalable and automated solutions for exploring and anticipating future market developments, thereby providing structured insights for decision-making. Through experiments on Fin-Force, we evaluate state-of-the-art LLMs and counterfactual generation methods, analyzing their limitations and proposing insights for future research.

nan


Article 1070

Title@2025-06-05 (4): Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback

Title: Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback Simulation von LLM-zu-LLM-Tutoring für mehrsprachiges Mathe-Feedback 模拟多种语言数学反馈LLM至LLM教学 2506.04920v1

Authors (5): Junior Cedric Tonga, KV Aditya Srivatsa, Kaushal Kumar Maurya, Fajri Koto, Ekaterina Kochmar

Large language models (LLMs) have demonstrated the ability to generate formative feedback and instructional hints in English, making them increasingly relevant for AI-assisted education. However, their ability to provide effective instructional support across different languages, especially for mathematically grounded reasoning tasks, remains largely unexamined. In this work, we present the first large-scale simulation of multilingual tutor-student interactions using LLMs. A stronger model plays the role of the tutor, generating feedback in the form of hints, while a weaker model simulates the student. We explore 352 experimental settings across 11 typologically diverse languages, four state-of-the-art LLMs, and multiple prompting strategies to assess whether language-specific feedback leads to measurable learning gains. Our study examines how student input language, teacher feedback language, model choice, and language resource level jointly influence performance. Results show that multilingual hints can significantly improve learning outcomes, particularly in low-resource languages when feedback is aligned with the student’s native language. These findings offer practical insights for developing multilingual, LLM-based educational tools that are both effective and inclusive.

nan


Article 1071

Title@2025-06-05 (4): A Practitioner’s Guide to Building ASR Models for Low-Resource Languages: A Case Study on Scottish Gaelic

Title: A Practitioner’s Guide to Building ASR Models for Low-Resource Languages: A Case Study on Scottish Gaelic Leitfaden für Praktiker zum Aufbau von ASR-Modellen für Low-Resource-Sprachen: Eine Fallstudie zu schottischem Gälisch 《建立低资源语言ASR模式实践者指南:苏格兰盖尔语案例研究》 2506.04915v1

Authors (3): Ondřej Klejch, William Lamb, Peter Bell

An effective approach to the development of ASR systems for low-resource languages is to fine-tune an existing multilingual end-to-end model. When the original model has been trained on large quantities of data from many languages, fine-tuning can be effective with limited training data, even when the language in question was not present in the original training data. The fine-tuning approach has been encouraged by the availability of public-domain E2E models and is widely believed to lead to state-of-the-art results. This paper, however, challenges that belief. We show that an approach combining hybrid HMMs with self-supervised models can yield substantially better performance with limited training data. This combination allows better utilisation of all available speech and text data through continued self-supervised pre-training and semi-supervised training. We benchmark our approach on Scottish Gaelic, achieving WER reductions of 32% relative over our best fine-tuned Whisper model.

nan


Article 1072

Title@2025-06-05 (4): In-context Language Learning for Endangered Languages in Speech Recognition

Title: In-context Language Learning for Endangered Languages in Speech Recognition Im Zusammenhang mit dem Sprachenlernen für gefährdete Sprachen in der Spracherkennung 在语音识别中为濒危语言进行内通语言学习 2505.20445v3

Authors (2): Zhaolin Li, Jan Niehues

With approximately 7,000 languages spoken worldwide, current large language models (LLMs) support only a small subset. Prior research indicates LLMs can learn new languages for certain tasks without supervised data. We extend this investigation to speech recognition, investigating whether LLMs can learn unseen, low-resource languages through in-context learning (ICL). With experiments on four diverse endangered languages that LLMs have not been trained on, we find that providing more relevant text samples enhances performance in both language modelling and Automatic Speech Recognition (ASR) tasks. Furthermore, we show that the probability-based approach outperforms the traditional instruction-based approach in language learning. Lastly, we show ICL enables LLMs to achieve ASR performance that is comparable to or even surpasses dedicated language models trained specifically for these languages, while preserving the original capabilities of the LLMs.

nan


Article 1073

Title@2025-06-05 (4): MiMo: Unlocking the Reasoning Potential of Language Model – From Pretraining to Posttraining

Title: MiMo: Unlocking the Reasoning Potential of Language Model – From Pretraining to Posttraining MiMo: Entsperren des vernünftigen Potenzials des Sprachmodells – Von der Vorschulung zur Nachschulung 米莫:释放语文模式的理性潜力 – – 从预培训到员额培训 2505.07608v2

Authors (65): LLM-Core Xiaomi, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, Kai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shuhuai Ren, Shuo Liu, Sirui Deng, Weiji Zhuang, Weiwei Lv, Wenyu Yang, Xin Zhang, Xing Yong, Xing Zhang, Xingchen Song, Xinzhe Xu, Xu Wang, Yihan Yan, Yu Tu, Yuanyuan Tian, Yudong Wang, Yue Yu, Zhenru Lin, Zhichao Song, Zihao Yue

We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model’s reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.

nan


Article 1074

Title@2025-06-05 (4): Dissecting Long Reasoning Models: An Empirical Study

Title: Dissecting Long Reasoning Models: An Empirical Study Modelle der langen Vernunft zu entschlüsseln: Eine empirische Studie 解析长期理由模型:经验研究 2506.04913v1

Authors (8): Yongyu Mu, Jiali Zeng, Bei Li, Xinyan Guan, Fandong Meng, Jie Zhou, Tong Xiao, Jingbo Zhu

Despite recent progress in training long-context reasoning models via reinforcement learning (RL), several open questions and counterintuitive behaviors remain. This work focuses on three key aspects: (1) We systematically analyze the roles of positive and negative samples in RL, revealing that positive samples mainly facilitate data fitting, whereas negative samples significantly enhance generalization and robustness. Interestingly, training solely on negative samples can rival standard RL training performance. (2) We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage. To address this, we explore two straightforward strategies, including relative length rewards and offline sample injection, to better leverage these data and enhance reasoning efficiency and capability. (3) We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes, and demonstrate that multiple evaluation runs mitigate this issue.

nan


Article 1075

Title@2025-06-05 (4): When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

Title: When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models Wenn LLMs denken lügen: Enthüllen der strategischen Täuschung in Repräsentationen von Vernunftmodellen 当思考LLLM Lie:在解释理由模型代表中消除战略欺骗时 2506.04909v1

Authors (3): Kai Wang, Yihao Zhang, Meng Sun

The honesty of large language models (LLMs) is a critical alignment challenge, especially as advanced systems with chain-of-thought (CoT) reasoning may strategically deceive humans. Unlike traditional honesty issues on LLMs, which could be possibly explained as some kind of hallucination, those models’ explicit thought paths enable us to study strategic deception–goal-driven, intentional misinformation where reasoning contradicts outputs. Using representation engineering, we systematically induce, detect, and control such deception in CoT-enabled LLMs, extracting “deception vectors” via Linear Artificial Tomography (LAT) for 89% detection accuracy. Through activation steering, we achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts, unveiling the specific honesty-related issue of reasoning models and providing tools for trustworthy AI alignment.

nan


Article 1076

Title@2025-06-05 (4): ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests

Title: ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests ICPC-Eval: Mit wettbewerbsfähigen Programmierwettbewerben die Grenzen der LLM-Vernunft ebnen ICPC-Eval:以竞争性方案拟订竞赛为理由,探索LLM的前沿 2506.04894v1

Authors (6): Shiyi Xu, Yiwen Hu, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Ji-Rong Wen

With the significant progress of large reasoning models in complex coding and reasoning tasks, existing benchmarks, like LiveCodeBench and CodeElo, are insufficient to evaluate the coding capabilities of large language models (LLMs) in real competition environments. Moreover, current evaluation metrics such as Pass@K fail to capture the reflective abilities of reasoning models. To address these challenges, we propose \textbf{ICPC-Eval}, a top-level competitive coding benchmark designed to probing the frontiers of LLM reasoning. ICPC-Eval includes 118 carefully curated problems from 11 recent ICPC contests held in various regions of the world, offering three key contributions: 1) A challenging realistic ICPC competition scenario, featuring a problem type and difficulty distribution consistent with actual contests. 2) A robust test case generation method and a corresponding local evaluation toolkit, enabling efficient and accurate local evaluation. 3) An effective test-time scaling evaluation metric, Refine@K, which allows iterative repair of solutions based on execution feedback. The results underscore the significant challenge in evaluating complex reasoning abilities: top-tier reasoning models like DeepSeek-R1 often rely on multi-turn code feedback to fully unlock their in-context reasoning potential when compared to non-reasoning counterparts. Furthermore, despite recent advancements in code generation, these models still lag behind top-performing human teams. We release the benchmark at: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs

nan


Article 1077

Title@2025-06-05 (4): Full-Parameter Continual Pretraining of Gemma2: Insights into Fluency and Domain Knowledge

Title: Full-Parameter Continual Pretraining of Gemma2: Insights into Fluency and Domain Knowledge Full-Parameter Continual Pretraining von Gemma2: Einblicke in Fluency und Domain Knowledge Gemma2:深入了解流能和广域知识 2505.05946v2

Authors (3): Vytenis Šliogeris, Povilas Daniušis, Artūras Nakvosas

In this technical report, we empirically investigate the relationship between linguistic fluency and domain knowledge in the context of continual learning with large language models (LLMs). Specifically, we enhance the linguistic fluency of the Gemma2 LLM for the Lithuanian language by autoregressively pretraining its full parameter set on the first 10\% of the Lithuanian language component of the CulturaX dataset. To prevent catastrophic forgetting of the model’s existing domain knowledge, we apply Elastic Weight Consolidation (EWC), leveraging Fisher information estimated using data from the Massive Multitask Language Understanding (MMLU) benchmark. In the post-training evaluations, we assess linguistic fluency through perplexity and evaluate domain knowledge using accuracy on a suite of language understanding benchmarks, including ARC-Easy, Belebele, GSM8K, HellaSwag, MMLU, TruthfulQA, and Winogrande, in both English and Lithuanian. The empirical results demonstrate that EWC not only mitigates catastrophic forgetting by preserving the model’s performance in terms of both linguistic fluency and domain knowledge but also improves or maintains these capabilities for the newly added Lithuanian language. These findings highlight the potential for more efficient adaptation of general-purpose LLMs to under-represented languages without requiring access to the original training data. The accompanying codebase is openly accessible at https://github.com/Neurotechnology/LLM_EWC.

nan


Article 1078

Title@2025-06-05 (4): Evaluating the Effectiveness of Linguistic Knowledge in Pretrained Language Models: A Case Study of Universal Dependencies

Title: Evaluating the Effectiveness of Linguistic Knowledge in Pretrained Language Models: A Case Study of Universal Dependencies Bewertung der Wirksamkeit sprachlicher Kenntnisse in vorgebildeten Sprachmodellen: Eine Fallstudie universeller Abhängigkeiten 评价在未受过训练的语言模式中语言知识的有效性:对普遍依赖性的个案研究 2506.04887v1

Authors (1): Wenxi Li

Universal Dependencies (UD), while widely regarded as the most successful linguistic framework for cross-lingual syntactic representation, remains underexplored in terms of its effectiveness. This paper addresses this gap by integrating UD into pretrained language models and assesses if UD can improve their performance on a cross-lingual adversarial paraphrase identification task. Experimental results show that incorporation of UD yields significant improvements in accuracy and $F_1$ scores, with average gains of 3.85\% and 6.08\% respectively. These enhancements reduce the performance gap between pretrained models and large language models in some language pairs, and even outperform the latter in some others. Furthermore, the UD-based similarity score between a given language and English is positively correlated to the performance of models in that language. Both findings highlight the validity and potential of UD in out-of-domain tasks.

nan


Article 1079

Title@2025-06-05 (4): NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark

Title: NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark NorEval: Ein Benchmark für das Verständnis und die Bewertung der norwegischen Sprache NorEval:挪威语言理解和世代评价基准 2504.07749v2

Authors (7): Vladislav Mikhailov, Tita Enstad, David Samuel, Hans Christian Farsethås, Andrey Kutuzov, Erik Velldal, Lilja Øvrelid

This paper introduces NorEval, a new and comprehensive evaluation suite for large-scale standardized benchmarking of Norwegian generative language models (LMs). NorEval consists of 24 high-quality human-created datasets – of which five are created from scratch. In contrast to existing benchmarks for Norwegian, NorEval covers a broad spectrum of task categories targeting Norwegian language understanding and generation, establishes human baselines, and focuses on both of the official written standards of the Norwegian language: Bokm{\aa}l and Nynorsk. All our datasets and a collection of over 100 human-written prompts are integrated into LM Evaluation Harness, ensuring flexible and reproducible evaluation. We describe the NorEval design and present the results of benchmarking 19 open-source pre-trained and instruction-tuned LMs for Norwegian in various scenarios. Our benchmark, evaluation framework, and annotation materials are publicly available.

nan


Article 1080

Title@2025-06-05 (4): Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition

Title: Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition Bewertung von LLMs in Speech wird oft abgeflacht: Testset Kontaminierung in großen Sprachmodellen für die Spracherkennung 对演讲中LLMs的评价经常是片断的:在大语言语音识别模型中测试设置污染 2505.22251v2

Authors (5): Yuan Tseng, Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya

Recent work suggests that large language models (LLMs) can improve performance of speech tasks compared to existing systems. To support their claims, results on LibriSpeech and Common Voice are often quoted. However, this work finds that a substantial amount of the LibriSpeech and Common Voice evaluation sets appear in public LLM pretraining corpora. This calls into question the reliability of findings drawn from these two datasets. To measure contamination impact, LLMs trained with/without contamination are compared. A contaminated LLM is more likely to generate test sentences it has seen during training. Then, speech recognisers based on LLMs are compared. They show only subtle error rate differences if the LLM is contaminated, but assign significantly higher probabilities to transcriptions seen during LLM training. Results show that LLM outputs can be biased by tiny amounts of data contamination, highlighting the importance of evaluating LLM-based speech systems with held-out data.

nan


Article 1081

Title@2025-06-05 (4): MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

Title: MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost MuLan: Multilinguale Diffusionsmodelle für Hunderte von Sprachen mit vernachlässigbaren Kosten anpassen MuLan:为数百种具有可忽略费用的语言调整多语言传播模式 2412.01271v2

Authors (8): Sen Xing, Muyan Zhong, Zeqiang Lai, Liangchen Li, Jiawen Liu, Yaohui Wang, Jifeng Dai, Wenhai Wang

In this work, we explore a cost-effective framework for multilingual image generation. We find that, unlike models tuned on high-quality images with multilingual annotations, leveraging text encoders pre-trained on widely available, noisy Internet image-text pairs significantly enhances data efficiency in text-to-image (T2I) generation across multiple languages.Based on this insight, we introduce MuLan, Multi-Language adapter, a lightweight language adapter with fewer than 20M parameters, trained alongside a frozen text encoder and image diffusion model. Compared to previous multilingual T2I models, this framework offers: (1) Cost efficiency. Using readily accessible English data and off-the-shelf multilingual text encoders minimizes the training cost; (2) High performance. Achieving comparable generation capabilities in over 110 languages with CLIP similarity scores nearly matching those in English (39.57 for English vs. 39.61 for other languages); and (3) Broad applicability. Seamlessly integrating with compatible community tools like LoRA, LCM, ControlNet, and IP-Adapter, expanding its potential use cases.

nan


Article 1082

Title@2025-06-05 (4): Prompting LLMs: Length Control for Isometric Machine Translation

Title: Prompting LLMs: Length Control for Isometric Machine Translation Prompting LLMs: Längensteuerung für isometrische maschinelle Übersetzung LLM: 遥测机器翻译的长度控制 2506.04855v1

Authors (3): Dávid Javorský, Ondřej Bojar, François Yvon

In this study, we explore the effectiveness of isometric machine translation across multiple language pairs (En$\to$De, En$\to$Fr, and En$\to$Es) under the conditions of the IWSLT Isometric Shared Task 2022. Using eight open-source large language models (LLMs) of varying sizes, we investigate how different prompting strategies, varying numbers of few-shot examples, and demonstration selection influence translation quality and length control. We discover that the phrasing of instructions, when aligned with the properties of the provided demonstrations, plays a crucial role in controlling the output length. Our experiments show that LLMs tend to produce shorter translations only when presented with extreme examples, while isometric demonstrations often lead to the models disregarding length constraints. While few-shot prompting generally enhances translation quality, further improvements are marginal across 5, 10, and 20-shot settings. Finally, considering multiple outputs allows to notably improve overall tradeoff between the length and quality, yielding state-of-the-art performance for some language pairs.

nan


Article 1083

Title@2025-06-05 (4): Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights

Title: Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights Multiple-Choice-Frage-Generierung mit großen Sprachmodellen: Methodologie und Erzieher-Insights 使用大语言模式生成多选择问题:方法和教育者透视 2506.04851v1

Authors (3): Giorgio Biancini, Alessio Ferrato, Carla Limongelli

Integrating Artificial Intelligence (AI) in educational settings has brought new learning approaches, transforming the practices of both students and educators. Among the various technologies driving this transformation, Large Language Models (LLMs) have emerged as powerful tools for creating educational materials and question answering, but there are still space for new applications. Educators commonly use Multiple-Choice Questions (MCQs) to assess student knowledge, but manually generating these questions is resource-intensive and requires significant time and cognitive effort. In our opinion, LLMs offer a promising solution to these challenges. This paper presents a novel comparative analysis of three widely known LLMs - Llama 2, Mistral, and GPT-3.5 - to explore their potential for creating informative and challenging MCQs. In our approach, we do not rely on the knowledge of the LLM, but we inject the knowledge into the prompt to contrast the hallucinations, giving the educators control over the test’s source text, too. Our experiment involving 21 educators shows that GPT-3.5 generates the most effective MCQs across several known metrics. Additionally, it shows that there is still some reluctance to adopt AI in the educational field. This study sheds light on the potential of LLMs to generate MCQs and improve the educational experience, providing valuable insights for the future.

nan


Article 1084

Title@2025-06-05 (4): When Claims Evolve: Evaluating and Enhancing the Robustness of Embedding Models Against Misinformation Edits

Title: When Claims Evolve: Evaluating and Enhancing the Robustness of Embedding Models Against Misinformation Edits Wenn Ansprüche entstehen: Bewertung und Verbesserung der Robustheit von Einbettungsmodellen gegen Fehlinformations-Edits 索赔演变时:评价和加强反错误信息嵌入模型的威力 2503.03417v3

Authors (5): Jabez Magomere, Emanuele La Malfa, Manuel Tonneau, Ashkan Kazemi, Scott Hale

Online misinformation remains a critical challenge, and fact-checkers increasingly rely on claim matching systems that use sentence embedding models to retrieve relevant fact-checks. However, as users interact with claims online, they often introduce edits, and it remains unclear whether current embedding models used in retrieval are robust to such edits. To investigate this, we introduce a perturbation framework that generates valid and natural claim variations, enabling us to assess the robustness of a wide-range of sentence embedding models in a multi-stage retrieval pipeline and evaluate the effectiveness of various mitigation approaches. Our evaluation reveals that standard embedding models exhibit notable performance drops on edited claims, while LLM-distilled embedding models offer improved robustness at a higher computational cost. Although a strong reranker helps to reduce the performance drop, it cannot fully compensate for first-stage retrieval gaps. To address these retrieval gaps, we evaluate train- and inference-time mitigation approaches, demonstrating that they can improve in-domain robustness by up to 17 percentage points and boost out-of-domain generalization by 10 percentage points. Overall, our findings provide practical improvements to claim-matching systems, enabling more reliable fact-checking of evolving misinformation. Code and data are available at https://github.com/JabezNzomo99/claim-matching-robustness.

nan


Article 1085

Title@2025-06-05 (4): MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines

Title: MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines MockConf: A Student Interpretation Dataset: Analysis, Word- und Span-Level Alignment und Baselines MockConf:学生解释数据集:分析、单词和泛层对齐和基线 2506.04848v1

Authors (3): Dávid Javorský, Ondřej Bojar, François Yvon

In simultaneous interpreting, an interpreter renders a source speech into another language with a very short lag, much sooner than sentences are finished. In order to understand and later reproduce this dynamic and complex task automatically, we need dedicated datasets and tools for analysis, monitoring, and evaluation, such as parallel speech corpora, and tools for their automatic annotation. Existing parallel corpora of translated texts and associated alignment algorithms hardly fill this gap, as they fail to model long-range interactions between speech segments or specific types of divergences (e.g., shortening, simplification, functional generalization) between the original and interpreted speeches. In this work, we introduce MockConf, a student interpreting dataset that was collected from Mock Conferences run as part of the students’ curriculum. This dataset contains 7 hours of recordings in 5 European languages, transcribed and aligned at the level of spans and words. We further implement and release InterAlign, a modern web-based annotation tool for parallel word and span annotations on long inputs, suitable for aligning simultaneous interpreting. We propose metrics for the evaluation and a baseline for automatic alignment. Dataset and tools are released to the community.

nan


Article 1086

Title@2025-06-05 (4): Evaluating Morphological Compositional Generalization in Large Language Models

Title: Evaluating Morphological Compositional Generalization in Large Language Models Bewertung der morphologischen Kompositionsverallgemeinerung in großen Sprachmodellen 评价大语言模式中人种构成的概括化 2410.12656v4

Authors (9): Mete Ismayilzada, Defne Circi, Jonne Sälevä, Hale Sirin, Abdullatif Köksal, Bhuwan Dhingra, Antoine Bosselut, Duygu Ataman, Lonneke van der Plas

Large language models (LLMs) have demonstrated significant progress in various natural language generation and understanding tasks. However, their linguistic generalization capabilities remain questionable, raising doubts about whether these models learn language similarly to humans. While humans exhibit compositional generalization and linguistic creativity in language use, the extent to which LLMs replicate these abilities, particularly in morphology, is under-explored. In this work, we systematically investigate the morphological generalization abilities of LLMs through the lens of compositionality. We define morphemes as compositional primitives and design a novel suite of generative and discriminative tasks to assess morphological productivity and systematicity. Focusing on agglutinative languages such as Turkish and Finnish, we evaluate several state-of-the-art instruction-finetuned multilingual models, including GPT-4 and Gemini. Our analysis shows that LLMs struggle with morphological compositional generalization particularly when applied to novel word roots, with performance declining sharply as morphological complexity increases. While models can identify individual morphological combinations better than chance, their performance lacks systematicity, leading to significant accuracy gaps compared to humans.

nan


Article 1087

Title@2025-06-05 (4): Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models

Title: Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models Gemeinsame Bewertung der Antwort- und Begründungskonsistenz für Halluzinationserkennung in großen Vernunftmodellen 对在大理由模型中发现幻幻幻剂的答案和理由一致性的联合评价 2506.04832v1

Authors (4): Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu

Large Reasoning Models (LRMs) extend large language models with explicit, multi-step reasoning traces to enhance transparency and performance on complex tasks. However, these reasoning traces can be redundant or logically inconsistent, making them a new source of hallucination that is difficult to detect. Existing hallucination detection methods focus primarily on answer-level uncertainty and often fail to detect hallucinations or logical inconsistencies arising from the model’s reasoning trace. This oversight is particularly problematic for LRMs, where the explicit thinking trace is not only an important support to the model’s decision-making process but also a key source of potential hallucination. To this end, we propose RACE (Reasoning and Answer Consistency Evaluation), a novel framework specifically tailored for hallucination detection in LRMs. RACE operates by extracting essential reasoning steps and computing four diagnostic signals: inter-sample consistency of reasoning traces, entropy-based answer uncertainty, semantic alignment between reasoning and answers, and internal coherence of reasoning. This joint analysis enables fine-grained hallucination detection even when the final answer appears correct. Experiments across datasets and different LLMs demonstrate that RACE outperforms existing hallucination detection baselines, offering a robust and generalizable solution for evaluating LRMs. Our code is available at: https://github.com/bebr2/RACE.

nan


Article 1088

Title@2025-06-05 (4): From EHRs to Patient Pathways: Scalable Modeling of Longitudinal Health Trajectories with LLMs

Title: From EHRs to Patient Pathways: Scalable Modeling of Longitudinal Health Trajectories with LLMs Von EHRs zu Patientenpfaden: Skalierbare Modellierung von langitudinalen Gesundheits-Trajektorien mit LLMs 从EHRs到患者途径:可扩展的长纵向健康轨迹模型与LLMs 2506.04831v1

Authors (5): Chantal Pellegrini, Ege Özsoy, David Bani-Harouni, Matthias Keicher, Nassir Navab

Healthcare systems face significant challenges in managing and interpreting vast, heterogeneous patient data for personalized care. Existing approaches often focus on narrow use cases with a limited feature space, overlooking the complex, longitudinal interactions needed for a holistic understanding of patient health. In this work, we propose a novel approach to patient pathway modeling by transforming diverse electronic health record (EHR) data into a structured representation and designing a holistic pathway prediction model, EHR2Path, optimized to predict future health trajectories. Further, we introduce a novel summary mechanism that embeds long-term temporal context into topic-specific summary tokens, improving performance over text-only models, while being much more token-efficient. EHR2Path demonstrates strong performance in both next time-step prediction and longitudinal simulation, outperforming competitive baselines. It enables detailed simulations of patient trajectories, inherently targeting diverse evaluation tasks, such as forecasting vital signs, lab test results, or length-of-stay, opening a path towards predictive and personalized healthcare.

nan


Article 1089

Title@2025-06-05 (4): FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models

Title: FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models FlowCut: Redundanz über Informationsfluss für effiziente Vision-Sprachenmodelle neu denken 流程:通过信息流动重新思考通过信息流动实现高效愿景-语言模型的冗余 2505.19536v2

Authors (8): Jintao Tong, Wenwei Jin, Pengda Qin, Anqi Li, Yixiong Zou, Yuhong Li, Yuhua Li, Ruixuan Li

Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model’s inherent behaviors. Extensive experiments show that FlowCut achieves superior results, outperforming SoTA by 1.6% on LLaVA-1.5-7B with 88.9% token reduction, and by 4.3% on LLaVA-NeXT-7B with 94.4% reduction, delivering 3.2x speed-up in the prefilling stage. Our code is available at https://github.com/TungChintao/FlowCut

nan


Article 1090

Title@2025-06-05 (4): A Reasoning-Based Approach to Cryptic Crossword Clue Solving

Title: A Reasoning-Based Approach to Cryptic Crossword Clue Solving Ein vernünftiger Ansatz zur kryptischen Kreuzworträtsellösung 以基于理性的方法解决加密口号的谜题 2506.04824v1

Authors (2): Martin Andrews, Sam Witteveen

Cryptic crossword clues are challenging language tasks for which new test sets are released daily by major newspapers on a global basis. Each cryptic clue contains both the definition of the answer to be placed in the crossword grid (in common with regular crosswords), and ‘wordplay’ that proves that the answer is correct (i.e. a human solver can be confident that an answer is correct without needing crossing words as confirmation). This work describes an LLM-based reasoning system built from open-licensed components that solves cryptic clues by (i) hypothesising answers; (ii) proposing wordplay explanations; and (iii) using a verifier system that operates on codified reasoning steps. Overall, this system establishes a new state-of-the-art performance on the challenging Cryptonite dataset of clues from The Times and The Telegraph newspapers in the UK. Because each proved solution is expressed in Python, interpretable wordplay reasoning for proven answers is available for inspection.

nan


Article 1091

Title@2025-06-05 (4): Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms

Title: Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms Bewertung von Vision-Sprachen und großen Sprachmodellen für die automatisierte Beurteilung von Studierenden in indonesischen Klassenräumen 评价印度尼西亚教室学生自动评估的视觉语言和大语言模式 2506.04822v1

Authors (5): Nurul Aisyah, Muhammad Dehan Al Kautsar, Arif Hidayat, Raqib Chowdhury, Fajri Koto

Although vision-language and large language models (VLM and LLM) offer promising opportunities for AI-driven educational assessment, their effectiveness in real-world classroom settings, particularly in underrepresented educational contexts, remains underexplored. In this study, we evaluated the performance of a state-of-the-art VLM and several LLMs on 646 handwritten exam responses from grade 4 students in six Indonesian schools, covering two subjects: Mathematics and English. These sheets contain more than 14K student answers that span multiple choice, short answer, and essay questions. Assessment tasks include grading these responses and generating personalized feedback. Our findings show that the VLM often struggles to accurately recognize student handwriting, leading to error propagation in downstream LLM grading. Nevertheless, LLM-generated feedback retains some utility, even when derived from imperfect input, although limitations in personalization and contextual relevance persist.

nan


Article 1092

Title@2025-06-05 (4): Design of intelligent proofreading system for English translation based on CNN and BERT

Title: Design of intelligent proofreading system for English translation based on CNN and BERT Entwurf eines intelligenten Korrekturlesesystems für die englische Übersetzung basierend auf CNN und BERT 设计基于CNN和BERT的英译文智能校对系统 2506.04811v1

Authors (4): Feijun Liu, Huifeng Wang, Kun Wang, Yizhen Wang

Since automatic translations can contain errors that require substantial human post-editing, machine translation proofreading is essential for improving quality. This paper proposes a novel hybrid approach for robust proofreading that combines convolutional neural networks (CNN) with Bidirectional Encoder Representations from Transformers (BERT). In order to extract semantic information from phrases and expressions, CNN uses a variety of convolution kernel filters to capture local n-gram patterns. In the meanwhile, BERT creates context-rich representations of whole sequences by utilizing stacked bidirectional transformer encoders. Using BERT’s attention processes, the integrated error detection component relates tokens to spot translation irregularities including word order problems and omissions. The correction module then uses parallel English-German alignment and GRU decoder models in conjunction with translation memory to propose logical modifications that maintain original meaning. A unified end-to-end training process optimized for post-editing performance is applied to the whole pipeline. The multi-domain collection of WMT and the conversational dialogues of Open-Subtitles are two of the English-German parallel corpora used to train the model. Multiple loss functions supervise detection and correction capabilities. Experiments attain a 90% accuracy, 89.37% F1, and 16.24% MSE, exceeding recent proofreading techniques by over 10% overall. Comparative benchmarking demonstrates state-of-the-art performance in identifying and coherently rectifying mistranslations and omissions.

nan


Article 1093

Title@2025-06-05 (4): Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

Title: Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study Dissecting Logical Reasoning in LLMs: Eine feinkörnige Bewertungs- und Aufsichtsstudie 在LLMM中解剖逻辑理由:精细评价和监督研究 2506.04810v1

Authors (11): Yujun Zhou, Jiayi Ye, Zipeng Ling, Yufei Han, Yue Huang, Haomin Zhuang, Zhenwen Liang, Kehan Guo, Taicheng Guo, Xiangqi Wang, Xiangliang Zhang

Logical reasoning is a core capability for many applications of large language models (LLMs), yet existing benchmarks often rely solely on final-answer accuracy, failing to capture the quality and structure of the reasoning process. We propose FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall benchmark accuracy, stepwise soundness, and representation-level alignment. In addition, to better understand how reasoning capabilities emerge, we conduct a comprehensive study on the effects of supervision format during fine-tuning. We construct four supervision styles (one natural language and three symbolic variants) and train LLMs under each. Our findings reveal that natural language supervision yields strong generalization even on out-of-distribution and long-context tasks, while symbolic reasoning styles promote more structurally sound and atomic inference chains. Further, our representation-level probing shows that fine-tuning primarily improves reasoning behaviors through step-by-step generation, rather than enhancing shortcut prediction or internalized correctness. Together, our framework and analysis provide a more rigorous and interpretable lens for evaluating and improving logical reasoning in LLMs.

nan


Article 1094

Title@2025-06-05 (4): Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction

Title: Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction Weniger Halluzinationen, mehr Verifizierung: Ein dreistufiges LLM-basiertes Framework für die ASR-Fehlerkorrektur 较少的幻觉,更多的核查:基于ASR的三级LLM框架错误更正 2505.24347v2

Authors (7): Yangui Fang, Baixu Cheng, Jing Peng, Xu Li, Yu Xi, Chengwei Zhang, Guohui Zhong

Automatic Speech Recognition (ASR) error correction aims to correct recognition errors while preserving accurate text. Although traditional approaches demonstrate moderate effectiveness, LLMs offer a paradigm that eliminates the need for training and labeled data. However, directly using LLMs will encounter hallucinations problem, which may lead to the modification of the correct text. To address this problem, we propose the Reliable LLM Correction Framework (RLLM-CF), which consists of three stages: (1) error pre-detection, (2) chain-of-thought sub-tasks iterative correction, and (3) reasoning process verification. The advantage of our method is that it does not require additional information or fine-tuning of the model, and ensures the correctness of the LLM correction under multi-pass programming. Experiments on AISHELL-1, AISHELL-2, and Librispeech show that the GPT-4o model enhanced by our framework achieves 21%, 11%, 9%, and 11.4% relative reductions in CER/WER.

nan


Article 1095

Title@2025-06-05 (4): Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques

Title: Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques Auf dem Weg zur multimodalen Fusion LLM-Centric: Eine Studie über Integrationsstrategien und -techniken 走向LLM-Centric Multimods 融合:关于一体化战略和技术的调查 2506.04788v1

Authors (4): Jisu An, Junseok Lee, Jeoungeun Lee, Yongseok Son

The rapid progress of Multimodal Large Language Models(MLLMs) has transformed the AI landscape. These models combine pre-trained LLMs with various modality encoders. This integration requires a systematic understanding of how different modalities connect to the language backbone. Our survey presents an LLM-centric analysis of current approaches. We examine methods for transforming and aligning diverse modal inputs into the language embedding space. This addresses a significant gap in existing literature. We propose a classification framework for MLLMs based on three key dimensions. First, we examine architectural strategies for modality integration. This includes both the specific integration mechanisms and the fusion level. Second, we categorize representation learning techniques as either joint or coordinate representations. Third, we analyze training paradigms, including training strategies and objective functions. By examining 125 MLLMs developed between 2021 and 2025, we identify emerging patterns in the field. Our taxonomy provides researchers with a structured overview of current integration techniques. These insights aim to guide the development of more robust multimodal integration strategies for future models built on pre-trained foundations.

nan


Article 1096

Title@2025-06-05 (4): MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Title: MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark MMSU: Ein massiver Multi-Task Spoken Language Understanding Benchmark MMSU: 大规模多任务口头语言理解和说明理由基准 2506.04779v1

Authors (7): Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, Helen Meng

Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU_Bench.

nan


Article 1097

Title@2025-06-05 (4): Fine-Grained Interpretation of Political Opinions in Large Language Models

Title: Fine-Grained Interpretation of Political Opinions in Large Language Models Feinkörnige Interpretation politischer Meinungen in großen Sprachmodellen 以大语言模式对政治意见的精细解释 2506.04774v1

Authors (4): Jingyu Hu, Mengyue Yang, Mengnan Du, Weiru Liu

Studies of LLMs’ political opinions mainly rely on evaluations of their open-ended responses. Recent work indicates that there is a misalignment between LLMs’ responses and their internal intentions. This motivates us to probe LLMs’ internal mechanisms and help uncover their internal political states. Additionally, we found that the analysis of LLMs’ political opinions often relies on single-axis concepts, which can lead to concept confounds. In this work, we extend the single-axis to multi-dimensions and apply interpretable representation engineering techniques for more transparent LLM political concept learning. Specifically, we designed a four-dimensional political learning framework and constructed a corresponding dataset for fine-grained political concept vector learning. These vectors can be used to detect and intervene in LLM internals. Experiments are conducted on eight open-source LLMs with three representation engineering techniques. Results show these vectors can disentangle political concept confounds. Detection tasks validate the semantic meaning of the vectors and show good generalization and robustness in OOD settings. Intervention Experiments show these vectors can intervene in LLMs to generate responses with different political leanings.

nan


Article 1098

Title@2025-06-05 (4): Distilling the Implicit Multi-Branch Structure in LLMs’ Reasoning via Reinforcement Learning

Title: Distilling the Implicit Multi-Branch Structure in LLMs’ Reasoning via Reinforcement Learning Destillieren der impliziten Multi-Branch-Struktur in LLMs’ Reasoning durch Verstärkungslernen 通过强化学习,将LLMs的隐含多层结构提炼在“通过强化学习推理”中 2505.16142v2

Authors (9): Shicheng Xu, Liang Pang, Yunchang Zhu, Jia Gu, Zihao Wei, Jingcheng Deng, Feiyang Pan, Huawei Shen, Xueqi Cheng

Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher’s reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to internalize the teacher’s implicit multi-branch reasoning structure rather than merely mimicking fixed output paths. Experiments show RLKD surpasses standard SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime, unlocking greater student reasoning potential than SFT-based distillation.

nan


Article 1099

Title@2025-06-05 (4): GOLFer: Smaller LM-Generated Documents Hallucination Filter & Combiner for Query Expansion in Information Retrieval

Title: GOLFer: Smaller LM-Generated Documents Hallucination Filter & Combiner for Query Expansion in Information Retrieval GOLFer: Kleinere LM-generierte Dokumente Halluzination Filter & Combiner zur Abfrageerweiterung in der Informationsaufzeichnung GOLFer: 用于信息检索中查询扩展的小型 LM - 光生文件 HALL 过滤器和组合器 2506.04762v1

Authors (2): Lingyuan Liu, Mengxiang Zhang

Large language models (LLMs)-based query expansion for information retrieval augments queries with generated hypothetical documents with LLMs. However, its performance relies heavily on the scale of the language models (LMs), necessitating larger, more advanced LLMs. This approach is costly, computationally intensive, and often has limited accessibility. To address these limitations, we introduce GOLFer - Smaller LMs-Generated Documents Hallucination Filter & Combiner - a novel method leveraging smaller open-source LMs for query expansion. GOLFer comprises two modules: a hallucination filter and a documents combiner. The former detects and removes non-factual and inconsistent sentences in generated documents, a common issue with smaller LMs, while the latter combines the filtered content with the query using a weight vector to balance their influence. We evaluate GOLFer alongside dominant LLM-based query expansion methods on three web search and ten low-resource datasets. Experimental results demonstrate that GOLFer consistently outperforms other methods using smaller LMs, and maintains competitive performance against methods using large-size LLMs, demonstrating its effectiveness.

nan


Article 1100

Title@2025-06-05 (4): Exp4Fuse: A Rank Fusion Framework for Enhanced Sparse Retrieval using Large Language Model-based Query Expansion

Title: Exp4Fuse: A Rank Fusion Framework for Enhanced Sparse Retrieval using Large Language Model-based Query Expansion Exp4Fuse: Ein Rank-Fusion-Framework für verbesserte Sparse-Retrieval unter Verwendung einer großsprachigen modellbasierten Abfrageerweiterung Exp4Fuse:使用基于大语言模型的查询扩展增强分散检索的排名组合框架 2506.04760v1

Authors (2): Lingyuan Liu, Mengxiang Zhang

Large Language Models (LLMs) have shown potential in generating hypothetical documents for query expansion, thereby enhancing information retrieval performance. However, the efficacy of this method is highly dependent on the quality of the generated documents, which often requires complex prompt strategies and the integration of advanced dense retrieval techniques. This can be both costly and computationally intensive. To mitigate these limitations, we explore the use of zero-shot LLM-based query expansion to improve sparse retrieval, particularly for learned sparse retrievers. We introduce a novel fusion ranking framework, Exp4Fuse, which enhances the performance of sparse retrievers through an indirect application of zero-shot LLM-based query expansion. Exp4Fuse operates by simultaneously considering two retrieval routes-one based on the original query and the other on the LLM-augmented query. It then generates two ranked lists using a sparse retriever and fuses them using a modified reciprocal rank fusion method. We conduct extensive evaluations of Exp4Fuse against leading LLM-based query expansion methods and advanced retrieval techniques on three MS MARCO-related datasets and seven low-resource datasets. Experimental results reveal that Exp4Fuse not only surpasses existing LLM-based query expansion methods in enhancing sparse retrievers but also, when combined with advanced sparse retrievers, achieves SOTA results on several benchmarks. This highlights the superior performance and effectiveness of Exp4Fuse in improving query expansion for sparse retrieval.

nan


Article 1101

Title@2025-06-05 (4): Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions

Title: Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions Firm oder Fickle? Bewertung großer Sprachmodelle Konsistenz in sequenziellen Interaktionen 公司或Fickle?评估大语言模型在序列相互作用中的一致性 2503.22353v4

Authors (5): Yubo Li, Yidi Miao, Xueying Ding, Ramayya Krishnan, Rema Padman

Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent and coherent behavior across multiple rounds of user interaction. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions. Code and data are available at: https://github.com/yubol-bobo/MT-Consistency. First, we introduce Position-Weighted Consistency (PWC), a metric designed to capture both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present MT-Consistency, a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by explicitly integrating internal model confidence scores during the generation process. Experimental results demonstrate that CARG significantly improves response stability without sacrificing accuracy, offering a practical path toward more dependable LLM behavior in critical, real-world deployments.

nan


Article 1102

Title@2025-06-05 (4): Focus On This, Not That! Steering LLMs with Adaptive Feature Specification

Title: Focus On This, Not That! Steering LLMs with Adaptive Feature Specification Fokus auf diese, nicht das! Lenkung LLMs mit adaptiven Feature Specification 聚焦于此, 不是那个! 指导性带适应性特征规格的LLMS 2410.22944v4

Authors (5): Tom A. Lamb, Adam Davies, Alasdair Paren, Philip H. S. Torr, Francesco Pinto

Despite the success of Instruction Tuning (IT) in training large language models (LLMs), such models often leverage spurious or biased features learnt from their training data and can become misaligned, leading to undesired behaviours. While existing techniques can steer model behaviour at inference-time, they are often post-hoc and do not embed steering as an intrinsic model feature. In this work, we introduce Focus Instruction Tuning (FIT), which trains LLMs to condition their responses by focusing on specific features whilst ignoring others, leading to different behaviours based on what features are specified. Across diverse benchmarks, we demonstrate that FIT: (i) successfully steers behaviour at inference time; (ii) increases robustness by amplifying core task signals and down-weighting spurious cues; (iii) mitigates social bias by suppressing demographic attributes; and (iv) generalises under distribution shifts and to previously unseen focus features. FIT therefore offers a lightweight, intrinsic mechanism for building more robust, fair, and easily controllable LLMs.

nan


Article 1103

Title@2025-06-05 (4): The Role of Diversity in In-Context Learning for Large Language Models

Title: The Role of Diversity in In-Context Learning for Large Language Models Die Rolle der Vielfalt im In-Context-Lernen für große Sprachmodelle 多样性在为大语言模式进行内文学习方面的作用 2505.19426v2

Authors (3): Wenyang Xiao, Haoyu Zhao, Lingxiao Huang

In-context learning (ICL) is a crucial capability of current large language models (LLMs), where the selection of examples plays a key role in performance. While most existing approaches focus on selecting the most similar examples to the query, the impact of diversity in example selection remains underexplored. We systematically investigate the role of diversity in in-context example selection through experiments across a range of tasks, from sentiment classification to more challenging math and code problems. Experiments on Llama-3.1, Gemma-2, and Mistral-v0.3 families of models show that diversity-aware selection methods improve performance, particularly on complex tasks like math and code, and enhance robustness to out-of-distribution queries. To support these findings, we introduce a theoretical framework that explains the benefits of incorporating diversity in in-context example selection.

nan


Article 1104

Title@2025-06-05 (4): Lifelong Evolution: Collaborative Learning between Large and Small Language Models for Continuous Emergent Fake News Detection

Title: Lifelong Evolution: Collaborative Learning between Large and Small Language Models for Continuous Emergent Fake News Detection Lebenslange Evolution: Kollaboratives Lernen zwischen großen und kleinen Sprachmodellen für kontinuierliche emergente Fake News Detection 终身演进:大型和小型语文模式合作学习,以不断发现假冒假消息 2506.04739v1

Authors (7): Ziyi Zhou, Xiaoming Zhang, Litian Zhang, Yibo Zhang, Zhenyu Guan, Chaozhuo Li, Philip S. Yu

The widespread dissemination of fake news on social media has significantly impacted society, resulting in serious consequences. Conventional deep learning methodologies employing small language models (SLMs) suffer from extensive supervised training requirements and difficulties adapting to evolving news environments due to data scarcity and distribution shifts. Large language models (LLMs), despite robust zero-shot capabilities, fall short in accurately detecting fake news owing to outdated knowledge and the absence of suitable demonstrations. In this paper, we propose a novel Continuous Collaborative Emergent Fake News Detection (C$^2$EFND) framework to address these challenges. The C$^2$EFND framework strategically leverages both LLMs’ generalization power and SLMs’ classification expertise via a multi-round collaborative learning framework. We further introduce a lifelong knowledge editing module based on a Mixture-of-Experts architecture to incrementally update LLMs and a replay-based continue learning method to ensure SLMs retain prior knowledge without retraining entirely. Extensive experiments on Pheme and Twitter16 datasets demonstrate that C$^2$EFND significantly outperforms existed methods, effectively improving detection accuracy and adaptability in continuous emergent fake news scenarios.

nan


Article 1105

Title: Scaling Trends in Language Model Robustness Skalierungstrends in der Robustheit von Sprachmodellen 语言模型强度的增缩趋势 2407.18213v5

Authors (8): Nikolaus Howe, Ian McKenzie, Oskar Hollinsworth, Michał Zajac, Tom Tseng, Aaron Tucker, Pierre-Luc Bacon, Adam Gleave

Increasing model size has unlocked a dazzling array of capabilities in modern language models. At the same time, even frontier models remain vulnerable to jailbreaks and prompt injections, despite concerted efforts to make them robust. As both attack and defense gain access to more compute, and as models become larger, what happens to robustness? We argue that to answer this question requires a \emph{scaling} approach, which we employ in an extensive study of language model robustness across several classification tasks, model families, and adversarial attacks. We find that in the absence of explicit safety training, larger models are not consistently more robust; however, scale improves sample efficiency in adversarial training, though it worsens compute efficiency. Further, we find that increasing attack compute smoothly improves attack success rate against both undefended and adversarially trained models. Finally, after exploring robustness transfer across attacks and threat models, we combine attack and defense scaling rates to study the offense-defense balance. We find that while attack scaling outpaces adversarial training across all models studied, larger adversarially trained models might give defense the advantage in the long run. These results underscore the utility of the scaling lens, and provide a paradigm for evaluating future attacks and defenses on frontier models.

nan


Article 1106

Title@2025-06-05 (4): FastDraft: How to Train Your Draft

Title: FastDraft: How to Train Your Draft FastDraft: Wie Sie Ihren Entwurf trainieren 快速草稿:如何训练您的草稿 2411.11055v3

Authors (5): Ofir Zafrir, Igor Margulis, Dorin Shteyman, Shira Guskin, Guy Boudoukh

Speculative Decoding has gained popularity as an effective technique for accelerating the auto-regressive inference process of Large Language Models. However, Speculative Decoding entirely relies on the availability of efficient draft models, which are often lacking for many existing language models due to a stringent constraint of vocabulary compatibility. In this work we introduce FastDraft, a novel and efficient approach for pre-training and aligning a draft model to any large language model by incorporating efficient pre-training, followed by fine-tuning over synthetic datasets generated by the target model. We demonstrate FastDraft by training two highly parameter efficient drafts for the popular Phi-3-mini and Llama-3.1-8B models. Using FastDraft, we were able to produce a draft model with approximately 10 billion tokens on a single server with 8 Intel$^\circledR$ Gaudi$^\circledR$ 2 accelerators in under 24 hours. Our results show that the draft model achieves impressive results in key metrics of acceptance rate, block efficiency and up to 3x memory bound speed up when evaluated on code completion and up to 2x in summarization, text completion and instruction tasks. We validate our theoretical findings through benchmarking on the latest Intel$^\circledR$ Core$^{\tiny \text{TM}}$ Ultra, achieving a wall-clock time speedup of up to 2x, indicating a significant reduction in runtime. Due to its high quality, FastDraft unlocks large language models inference on AI-PC and other edge-devices.

nan


Article 1107

Title@2025-06-05 (4): Coordinated Robustness Evaluation Framework for Vision-Language Models

Title: Coordinated Robustness Evaluation Framework for Vision-Language Models Koordinierte Robustheitsbewertungsrahmen für Vision-Language-Modelle 愿景-语言模型协调强力评价框架 2506.05429v1

Authors (8): Ashwin Ramesh Babu, Sajad Mousavi, Vineet Gundecha, Sahand Ghorbanpour, Avisek Naug, Antonio Guillen, Ricardo Luna Gutierrez, Soumyendu Sarkar

Vision-language models, which integrate computer vision and natural language processing capabilities, have demonstrated significant advancements in tasks such as image captioning and visual question and answering. However, similar to traditional models, they are susceptible to small perturbations, posing a challenge to their robustness, particularly in deployment scenarios. Evaluating the robustness of these models requires perturbations in both the vision and language modalities to learn their inter-modal dependencies. In this work, we train a generic surrogate model that can take both image and text as input and generate joint representation which is further used to generate adversarial perturbations for both the text and image modalities. This coordinated attack strategy is evaluated on the visual question and answering and visual reasoning datasets using various state-of-the-art vision-language models. Our results indicate that the proposed strategy outperforms other multi-modal attacks and single-modality attacks from the recent literature. Our results demonstrate their effectiveness in compromising the robustness of several state-of-the-art pre-trained multi-modal models such as instruct-BLIP, ViLT and others.

nan


Article 1108

Title@2025-06-05 (4): The Vector Grounding Problem

Title: The Vector Grounding Problem Das Vector Grounding Problem 矢量基定问题 2304.01481v2

Authors (2): Dimitri Coelho Mollo, Raphaël Millière

The remarkable performance of large language models (LLMs) on complex linguistic tasks has sparked debate about their capabilities. Unlike humans, these models learn language solely from textual data without directly interacting with the world. Yet they generate seemingly meaningful text on diverse topics. This achievement has renewed interest in the classical `Symbol Grounding Problem’ – the question of whether the internal representations and outputs of symbolic AI systems can possess intrinsic meaning that is not parasitic on external interpretation. Although modern LLMs compute over vectors rather than symbols, an analogous problem arises for these systems, which we call the Vector Grounding Problem. This paper has two main goals. First, we distinguish five main notions of grounding that are often conflated in the literature, and argue that only one of them, which we call referential grounding, is relevant to the Vector Grounding Problem. Second, drawing on philosophical theories of representational content, we provide two arguments for the claim that LLMs and related systems can achieve referential grounding: (1) through preference fine-tuning methods that explicitly establish world-involving functions, and (2) through pre-training alone, which in limited domains may select for internal states with world-involving content, as mechanistic interpretability research suggests. Through these pathways, LLMs can establish connections to the world sufficient for intrinsic meaning. One potentially surprising implication of our discussion is that that multimodality and embodiment are neither necessary nor sufficient to overcome the Grounding Problem.

nan


Article 1109

Title@2025-06-05 (4): SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat

Title: SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat SPARTA ALIGNMENT: Kollektive Ausrichtung mehrerer Sprachmodelle durch Kampf SPARTA 签署:通过战斗集体调整多种语言模式 2506.04721v1

Authors (5): Yuru Jiang, Wenxuan Ding, Shangbin Feng, Greg Durrett, Yulia Tsvetkov

We propose SPARTA ALIGNMENT, an algorithm to collectively align multiple LLMs through competition and combat. To complement a single model’s lack of diversity in generation and biases in evaluation, multiple LLMs form a “sparta tribe” to compete against each other in fulfilling instructions while serving as judges for the competition of others. For each iteration, one instruction and two models are selected for a duel, the other models evaluate the two responses, and their evaluation scores are aggregated through a adapted elo-ranking based reputation system, where winners/losers of combat gain/lose weight in evaluating others. The peer-evaluated combat results then become preference pairs where the winning response is preferred over the losing one, and all models learn from these preferences at the end of each iteration. SPARTA ALIGNMENT enables the self-evolution of multiple LLMs in an iterative and collective competition process. Extensive experiments demonstrate that SPARTA ALIGNMENT outperforms initial models and 4 self-alignment baselines across 10 out of 12 tasks and datasets with 7.0% average improvement. Further analysis reveals that SPARTA ALIGNMENT generalizes more effectively to unseen tasks and leverages the expertise diversity of participating models to produce more logical, direct and informative outputs.

nan


Article 1110

Title@2025-06-05 (4): IIITH-BUT system for IWSLT 2025 low-resource Bhojpuri to Hindi speech translation

Title: IIITH-BUT system for IWSLT 2025 low-resource Bhojpuri to Hindi speech translation IIITH-BUT System für IWSLT 2025 Low-Resource Bhojpuri zu Hindi Sprachübersetzung IWSLT 2025 IWSLT 的三TH-BUT系统 2506.04714v1

Authors (4): Bhavana Akkiraju, Aishwarya Pothula, Santosh Kesiraju, Anil Kumar Vuppala

This paper presents the submission of IIITH-BUT to the IWSLT 2025 shared task on speech translation for the low-resource Bhojpuri-Hindi language pair. We explored the impact of hyperparameter optimisation and data augmentation techniques on the performance of the SeamlessM4T model fine-tuned for this specific task. We systematically investigated a range of hyperparameters including learning rate schedules, number of update steps, warm-up steps, label smoothing, and batch sizes; and report their effect on translation quality. To address data scarcity, we applied speed perturbation and SpecAugment and studied their effect on translation quality. We also examined the use of cross-lingual signal through joint training with Marathi and Bhojpuri speech data. Our experiments reveal that careful selection of hyperparameters and the application of simple yet effective augmentation techniques significantly improve performance in low-resource settings. We also analysed the translation hypotheses to understand various kinds of errors that impacted the translation quality in terms of BLEU.

nan


Article 1111

Title@2025-06-05 (4): LLM-based phoneme-to-grapheme for phoneme-based speech recognition

Title: LLM-based phoneme-to-grapheme for phoneme-based speech recognition LLM-basiertes Phoneme-to-grapheme für Phoneme-basierte Spracherkennung 用于语音语音识别的LLM-基于LLM的电话机对电报 2506.04711v1

Authors (5): Te Ma, Min Bi, Saierdaer Yusuyin, Hao Huang, Zhijian Ou

In automatic speech recognition (ASR), phoneme-based multilingual pre-training and crosslingual fine-tuning is attractive for its high data efficiency and competitive results compared to subword-based models. However, Weighted Finite State Transducer (WFST) based decoding is limited by its complex pipeline and inability to leverage large language models (LLMs). Therefore, we propose LLM-based phoneme-to-grapheme (LLM-P2G) decoding for phoneme-based ASR, consisting of speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G). A challenge is that there seems to have information loss in cascading S2P and P2G. To address this challenge, we propose two training strategies: data augmentation with noisy phonemes (DANP), and randomized top-$K$ marginalized (TKM) training and decoding. Our experimental results show that LLM-P2G outperforms WFST-based systems in crosslingual ASR for Polish and German, by relative WER reductions of 3.6% and 6.9% respectively.

nan


Article 1112

Title@2025-06-05 (4): Accelerated Test-Time Scaling with Model-Free Speculative Sampling

Title: Accelerated Test-Time Scaling with Model-Free Speculative Sampling Beschleunigte Test-Zeit-Skalierung mit modellfreier Spekulativ-Sampling 加速测试时间缩放,采用无投机示范抽样 2506.04708v1

Authors (7): Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati

Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that leverages the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis reveals that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND outperforms state-of-the-art speculative decoding methods by 14-28% in throughput and shows strong performance even in single-trajectory scenarios, reducing inference latency by 48-58%. As a model-free approach, STAND can be applied to any existing language model without additional training, being a powerful plug-and-play solution for accelerating language model reasoning.

nan


Article 1113

Title@2025-06-05 (4): EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Title: EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents EmbodydBench: Umfassendes Benchmarking multimodaler Großsprachenmodelle für visionsgetriebene Embodyd-Agenten Embudied bench:为有远见的Embodied代理商制定综合基准综合基准,确定多模式大语言模式 2502.09560v3

Authors (13): Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang

Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning. Through extensive experiments, we evaluated 24 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9\% on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code and dataset are available at https://embodiedbench.github.io.

nan


Article 1114

Title@2025-06-05 (4): Cracking the Code: Enhancing Implicit Hate Speech Detection through Coding Classification

Title: Cracking the Code: Enhancing Implicit Hate Speech Detection through Coding Classification Den Code knacken: Implizite Hass-Spracherkennung durch Coding-Klassifikation verbessern 打破《守则》:通过编码分类加强隐性仇恨言论探测 2506.04693v1

Authors (5): Lu Wei, Liangzhi Li, Tong Xiang, Xiao Liu, Noa Garcia

The internet has become a hotspot for hate speech (HS), threatening societal harmony and individual well-being. While automatic detection methods perform well in identifying explicit hate speech (ex-HS), they struggle with more subtle forms, such as implicit hate speech (im-HS). We tackle this problem by introducing a new taxonomy for im-HS detection, defining six encoding strategies named codetypes. We present two methods for integrating codetypes into im-HS detection: 1) prompting large language models (LLMs) directly to classify sentences based on generated responses, and 2) using LLMs as encoders with codetypes embedded during the encoding process. Experiments show that the use of codetypes improves im-HS detection in both Chinese and English datasets, validating the effectiveness of our approach across different languages.

nan


Article 1115

Title@2025-06-05 (4): Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models

Title: Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models Mind the Confidence Gap: Überbewusstsein, Kalibrierung und Distraktor-Effekte in großen Sprachmodellen 牢记信心差距:在大语言模式中过度自信、校准和扰动效应 2502.11028v2

Authors (1): Prateek Chhikara

Large Language Models (LLMs) show remarkable proficiency in natural language tasks, yet their frequent overconfidence-misalignment between predicted confidence and true correctness-poses significant risks in critical decision-making applications. We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering (QA) datasets, systematically comparing standard free-generation settings against structured distractor-augmented prompts. Our evaluation reveals that explicitly incorporating distractors can substantially mitigate miscalibration, achieving relative accuracy improvements up to 460% and ECE reductions up to 90%. Despite general trends, we uncover nuanced findings: large RLHF-tuned models display inherent calibration strengths but can paradoxically suffer increased miscalibration on easier queries, whereas smaller models benefit disproportionately from distractor prompts but remain significantly miscalibrated. Through detailed analyses across question types, we identify persistent calibration failures, particularly in person-based queries. We conclude with concrete recommendations-targeted fine-tuning, structured prompting, and strategic model choice-to ensure reliable, trustworthy LLM deployments.

nan


Article 1116

Title@2025-06-05 (4): Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

Title: Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models Recycling the Web: Eine Methode zur Verbesserung der Vorschulung von Daten Qualität und Menge für Sprachmodelle 网上再循环:提高语文模式培训前数据质量和数量的方法 2506.04689v1

Authors (7): Thao Nguyen, Yang Li, Olga Golovneva, Luke Zettlemoyer, Sewoong Oh, Ludwig Schmidt, Xian Li

Scaling laws predict that the performance of large language models improves with increasing model size and data size. In practice, pre-training has been relying on massive web crawls, using almost all data sources publicly available on the internet so far. However, this pool of natural data does not grow at the same rate as the compute supply. Furthermore, the availability of high-quality texts is even more limited: data filtering pipelines often remove up to 99% of the initial web scrapes to achieve state-of-the-art. To address the “data wall” of pre-training scaling, our work explores ways to transform and recycle data discarded in existing filtering processes. We propose REWIRE, REcycling the Web with guIded REwrite, a method to enrich low-quality documents so that they could become useful for training. This in turn allows us to increase the representation of synthetic data in the final pre-training set. Experiments at 1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks, compared to training on only filtered web data. Training on the raw-synthetic data mix is also more effective than having access to 2x web data. Through further analysis, we demonstrate that about 82% of the mixed in texts come from transforming lower-quality documents that would otherwise be discarded. REWIRE also outperforms related approaches of generating synthetic data, including Wikipedia-style paraphrasing, question-answer synthesizing and knowledge extraction. These results suggest that recycling web texts holds the potential for being a simple and effective approach for scaling pre-training data.

nan


Article 1117

Title@2025-06-05 (4): MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models

Title: MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models MMRefine: Enthüllen der Hindernisse zur robusten Veredelung in multimodalen großen Sprachmodellen MMREFine:克服阻碍大力改进多模式大语言模式的障碍 2506.04688v1

Authors (3): Gio Paik, Geewook Kim, Jinbae Im

This paper introduces MMRefine, a MultiModal Refinement benchmark designed to evaluate the error refinement capabilities of Multimodal Large Language Models (MLLMs). As the emphasis shifts toward enhancing reasoning during inference, MMRefine provides a framework that evaluates MLLMs’ abilities to detect and correct errors across six distinct scenarios beyond just comparing final accuracy before and after refinement. Furthermore, the benchmark analyzes the refinement performance by categorizing errors into six error types. Experiments with various open and closed MLLMs reveal bottlenecks and factors impeding refinement performance, highlighting areas for improvement in effective reasoning enhancement. Our code and dataset are publicly available at https://github.com/naver-ai/MMRefine.

nan


Article 1118

Title@2025-06-05 (4): Locally Typical Sampling

Title: Locally Typical Sampling Lokal typische Probenahmen 本地典型抽样 2202.00666v6

Authors (4): Clara Meister, Tiago Pimentel, Gian Wiher, Ryan Cotterell

Today’s probabilistic language generators fall short when it comes to producing coherent and fluent text despite the fact that the underlying models perform well under standard metrics, e.g., perplexity. This discrepancy has puzzled the language generation community for the last few years. In this work, we posit that the abstraction of natural language generation as a discrete stochastic process–which allows for an information-theoretic analysis–can provide new insights into the behavior of probabilistic language generators, e.g., why high-probability texts can be dull or repetitive. Humans use language as a means of communicating information, aiming to do so in a simultaneously efficient and error-minimizing manner; in fact, psycholinguistics research suggests humans choose each word in a string with this subconscious goal in mind. We formally define the set of strings that meet this criterion: those for which each word has an information content close to the expected information content, i.e., the conditional entropy of our model. We then propose a simple and efficient procedure for enforcing this criterion when generating from probabilistic models, which we call locally typical sampling. Automatic and human evaluations show that, in comparison to nucleus and top-k sampling, locally typical sampling offers competitive performance (in both abstractive summarization and story generation) in terms of quality while consistently reducing degenerate repetitions.

nan


Article 1119

Title@2025-06-05 (4): Urania: Differentially Private Insights into AI Use

Title: Urania: Differentially Private Insights into AI Use Urania: Unterschiedliche private Einblicke in die KI-Nutzung Urania: 差异私人透视转化为 AI 使用 2506.04681v1

Authors (11): Daogao Liu, Edith Cohen, Badih Ghazi, Peter Kairouz, Pritish Kamath, Alexander Knop, Ravi Kumar, Pasin Manurangsi, Adam Sealfon, Da Yu, Chiyuan Zhang

We introduce $Urania$, a novel framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. The framework employs a private clustering mechanism and innovative keyword extraction methods, including frequency-based, TF-IDF-based, and LLM-guided approaches. By leveraging DP tools such as clustering, partition selection, and histogram-based summarization, $Urania$ provides end-to-end privacy protection. Our evaluation assesses lexical and semantic content preservation, pair similarity, and LLM-based metrics, benchmarking against a non-private Clio-inspired pipeline (Tamkin et al., 2024). Moreover, we develop a simple empirical privacy evaluation that demonstrates the enhanced robustness of our DP pipeline. The results show the framework’s ability to extract meaningful conversational insights while maintaining stringent user privacy, effectively balancing data utility with privacy preservation.

nan


Article 1120

Title@2025-06-05 (4): Normative Conflicts and Shallow AI Alignment

Title: Normative Conflicts and Shallow AI Alignment Normative Konflikte und abgeschwächte Alignment 规范冲突和小规模的 AI 调整 2506.04679v1

Authors (1): Raphaël Millière

The progress of AI systems such as large language models (LLMs) raises increasingly pressing concerns about their safe deployment. This paper examines the value alignment problem for LLMs, arguing that current alignment strategies are fundamentally inadequate to prevent misuse. Despite ongoing efforts to instill norms such as helpfulness, honesty, and harmlessness in LLMs through fine-tuning based on human preferences, they remain vulnerable to adversarial attacks that exploit conflicts between these norms. I argue that this vulnerability reflects a fundamental limitation of existing alignment methods: they reinforce shallow behavioral dispositions rather than endowing LLMs with a genuine capacity for normative deliberation. Drawing from on research in moral psychology, I show how humans’ ability to engage in deliberative reasoning enhances their resilience against similar adversarial tactics. LLMs, by contrast, lack a robust capacity to detect and rationally resolve normative conflicts, leaving them susceptible to manipulation; even recent advances in reasoning-focused LLMs have not addressed this vulnerability. This ``shallow alignment’’ problem carries significant implications for AI safety and regulation, suggesting that current approaches are insufficient for mitigating potential harms posed by increasingly capable AI systems.

nan


Article 1121

Title@2025-06-05 (4): Magic Mushroom: A Customizable Benchmark for Fine-grained Analysis of Retrieval Noise Erosion in RAG Systems

Title: Magic Mushroom: A Customizable Benchmark for Fine-grained Analysis of Retrieval Noise Erosion in RAG Systems Magic Mushroom: Ein anpassbarer Benchmark für die feinkörnige Analyse von Retrieval Noise Erosion in RAG-Systemen 魔术蘑菇:对RAG系统中的检索噪音侵蚀进行精细分析的定制基准 2506.03901v2

Authors (7): Yuxin Zhang, Yan Wang, Yongrui Chen, Shenyu Zhang, Xinbang Dai, Sheng Bi, Guilin Qi

Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by incorporating external retrieved information, mitigating issues such as hallucination and outdated knowledge. However, RAG systems are highly sensitive to retrieval noise prevalent in real-world scenarios. Existing benchmarks fail to emulate the complex and heterogeneous noise distributions encountered in real-world retrieval environments, undermining reliable robustness assessment. In this paper, we define four categories of retrieval noise based on linguistic properties and noise characteristics, aiming to reflect the heterogeneity of noise in real-world scenarios. Building on this, we introduce Magic Mushroom, a benchmark for replicating “magic mushroom” noise: contexts that appear relevant on the surface but covertly mislead RAG systems. Magic Mushroom comprises 7,468 single-hop and 3,925 multi-hop question-answer pairs. More importantly, Magic Mushroom enables researchers to flexibly configure combinations of retrieval noise according to specific research objectives or application scenarios, allowing for highly controlled evaluation setups. We evaluate LLM generators of varying parameter scales and classic RAG denoising strategies under diverse noise distributions to investigate their performance dynamics during progressive noise encroachment. Our analysis reveals that both generators and denoising strategies have significant room for improvement and exhibit extreme sensitivity to noise distributions. Magic Mushroom emerges as a promising tool for evaluating and advancing noise-robust RAG systems, accelerating their widespread deployment in real-world applications. The Magic Mushroom benchmark is available at https://drive.google.com/file/d/1aP5kyPuk4L-L_uoI6T9UhxuTyt8oMqjT/view?usp=sharing.

nan


Article 1122

Title@2025-06-05 (4): Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL

Title: Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL Sollte ich Ihnen vertrauen? Detektion von Täuschung in Verhandlungen mit kontrafaktischen RL 利用反事实RL在谈判中发现欺骗行为 2502.12436v3

Authors (7): Wichayaporn Wongkamjan, Yanze Wang, Feng Gu, Denis Peskoff, Jonathan K. Kummerfeld, Jonathan May, Jordan Lee Boyd-Graber

An increasingly common socio-technical problem is people being taken in by offers that sound ``too good to be true’’, where persuasion and trust shape decision-making. This paper investigates how \abr{ai} can help detect these deceptive scenarios. We analyze how humans strategically deceive each other in \textit{Diplomacy}, a board game that requires both natural language communication and strategic reasoning. This requires extracting logical forms of proposed agreements in player communications and computing the relative rewards of the proposal using agents’ value functions. Combined with text-based features, this can improve our deception detection. Our method detects human deception with a high precision when compared to a Large Language Model approach that flags many true messages as deceptive. Future human-\abr{ai} interaction tools can build on our methods for deception detection by triggering \textit{friction} to give users a chance of interrogating suspicious proposals.

nan


Article 1123

Title@2025-06-05 (4): Propaganda and Information Dissemination in the Russo-Ukrainian War: Natural Language Processing of Russian and Western Twitter Narratives

Title: Propaganda and Information Dissemination in the Russo-Ukrainian War: Natural Language Processing of Russian and Western Twitter Narratives Propaganda und Informationsverbreitung im Russisch-Ukrainischen Krieg: Natürliche Sprachverarbeitung von russischen und westlichen Twitter Narratives Russo-Ukraineian战争中的宣传和信息传播:俄罗斯和西方推特上的自然语言处理 2506.01807v2

Authors (1): Zaur Gouliev

The conflict in Ukraine has been not only characterised by military engagement but also by a significant information war, with social media platforms like X, formerly known as Twitter playing an important role in shaping public perception. This article provides an analysis of tweets from propaganda accounts and trusted accounts collected from the onset of the war, February 2022 until the middle of May 2022 with n=40,000 total tweets. We utilise natural language processing and machine learning algorithms to assess the sentiment and identify key themes, topics and narratives across the dataset with human-in-the-loop (HITL) analysis throughout. Our findings indicate distinct strategies in how information is created, spread, and targeted at different audiences by both sides. Propaganda accounts frequently employ emotionally charged language and disinformation to evoke fear and distrust, whereas other accounts, primarily Western tend to focus on factual reporting and humanitarian aspects of the conflict. Clustering analysis reveals groups of accounts with similar behaviours, which we suspect indicates the presence of coordinated efforts. This research attempts to contribute to our understanding of the dynamics of information warfare and offers techniques for future studies on social media influence in military conflicts.

nan


Article 1124

Title@2025-06-05 (4): An Efficient Task-Oriented Dialogue Policy: Evolutionary Reinforcement Learning Injected by Elite Individuals

Title: An Efficient Task-Oriented Dialogue Policy: Evolutionary Reinforcement Learning Injected by Elite Individuals Eine effiziente, auf Aufgaben ausgerichtete Dialogpolitik: Evolutionäres Stärkungslernen von Elite-Personen 高效的、以任务为导向的对话政策:精英个人注射的进化强化学习 2506.03519v2

Authors (4): Yangyang Zhao, Ben Niu, Libo Qin, Shihan Wang

Deep Reinforcement Learning (DRL) is widely used in task-oriented dialogue systems to optimize dialogue policy, but it struggles to balance exploration and exploitation due to the high dimensionality of state and action spaces. This challenge often results in local optima or poor convergence. Evolutionary Algorithms (EAs) have been proven to effectively explore the solution space of neural networks by maintaining population diversity. Inspired by this, we innovatively combine the global search capabilities of EA with the local optimization of DRL to achieve a balance between exploration and exploitation. Nevertheless, the inherent flexibility of natural language in dialogue tasks complicates this direct integration, leading to prolonged evolutionary times. Thus, we further propose an elite individual injection mechanism to enhance EA’s search efficiency by adaptively introducing best-performing individuals into the population. Experiments across four datasets show that our approach significantly improves the balance between exploration and exploitation, boosting performance. Moreover, the effectiveness of the EII mechanism in reducing exploration time has been demonstrated, achieving an efficient integration of EA and DRL on task-oriented dialogue policy tasks.

nan


Article 1125

Title@2025-06-05 (4): Detection Method for Prompt Injection by Integrating Pre-trained Model and Heuristic Feature Engineering

Title: Detection Method for Prompt Injection by Integrating Pre-trained Model and Heuristic Feature Engineering Detektionsmethode für die schnelle Einspritzung durch Integration vortrainierter Modell- und Heuristic Feature Engineering 纳入培训前模型和肥力地貌工程的快速注射快速注射检测方法 2506.06384v1

Authors (3): Yi Ji, Runzhi Li, Baolei Mao

With the widespread adoption of Large Language Models (LLMs), prompt injection attacks have emerged as a significant security threat. Existing defense mechanisms often face critical trade-offs between effectiveness and generalizability. This highlights the urgent need for efficient prompt injection detection methods that are applicable across a wide range of LLMs. To address this challenge, we propose DMPI-PMHFE, a dual-channel feature fusion detection framework. It integrates a pretrained language model with heuristic feature engineering to detect prompt injection attacks. Specifically, the framework employs DeBERTa-v3-base as a feature extractor to transform input text into semantic vectors enriched with contextual information. In parallel, we design heuristic rules based on known attack patterns to extract explicit structural features commonly observed in attacks. Features from both channels are subsequently fused and passed through a fully connected neural network to produce the final prediction. This dual-channel approach mitigates the limitations of relying only on DeBERTa to extract features. Experimental results on diverse benchmark datasets demonstrate that DMPI-PMHFE outperforms existing methods in terms of accuracy, recall, and F1-score. Furthermore, when deployed actually, it significantly reduces attack success rates across mainstream LLMs, including GLM-4, LLaMA 3, Qwen 2.5, and GPT-4o.

nan


Article 1126

Title@2025-06-05 (4): Inducing lexicons of in-group language with socio-temporal context

Title: Inducing lexicons of in-group language with socio-temporal context Induzieren von Lexikons der In-Gruppe-Sprache mit sozio-temporalem Kontext 引引具有社会-时时背景的组内语言词汇法 2409.19257v3

Authors (1): Christine de Kock

In-group language is an important signifier of group dynamics. This paper proposes a novel method for inducing lexicons of in-group language, which incorporates its socio-temporal context. Existing methods for lexicon induction do not capture the evolving nature of in-group language, nor the social structure of the community. Using dynamic word and user embeddings trained on conversations from online anti-women communities, our approach outperforms prior methods for lexicon induction. We develop a test set for the task of lexicon induction and a new lexicon of manosphere language, validated by human experts, which quantifies the relevance of each term to a specific sub-community at a given point in time. Finally, we present novel insights on in-group language which illustrate the utility of this approach.

nan


Article 1127

Title@2025-06-05 (4): EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition

Title: EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition EMO-Debias: Benchmarking Gender-Debiasing-Techniken in der multi-Label Sprachemotionserkennung EMO-Debias:在多语言言论中确定性别偏见技术基准 2506.04652v1

Authors (4): Yi-Cheng Lin, Huang-Cheng Chou, Yu-Hsuan Li Liang, Hung-yi Lee

Speech emotion recognition (SER) systems often exhibit gender bias. However, the effectiveness and robustness of existing debiasing methods in such multi-label scenarios remain underexplored. To address this gap, we present EMO-Debias, a large-scale comparison of 13 debiasing methods applied to multi-label SER. Our study encompasses techniques from pre-processing, regularization, adversarial learning, biased learners, and distributionally robust optimization. Experiments conducted on acted and naturalistic emotion datasets, using WavLM and XLSR representations, evaluate each method under conditions of gender imbalance. Our analysis quantifies the trade-offs between fairness and accuracy, identifying which approaches consistently reduce gender performance gaps without compromising overall model performance. The findings provide actionable insights for selecting effective debiasing strategies and highlight the impact of dataset distributions.

nan


Article 1128

Title@2025-06-05 (4): From Intention To Implementation: Automating Biomedical Research via LLMs

Title: From Intention To Implementation: Automating Biomedical Research via LLMs Von der Absicht zur Umsetzung: Automatisierung der biomedizinischen Forschung über LLMs 从实施目的出发:通过LLMs实现生物医学研究自动化 2412.09429v4

Authors (7): Yi Luo, Linghang Shi, Yihao Li, Aobo Zhuang, Yeyun Gong, Ling Liu, Chen Lin

Conventional biomedical research is increasingly labor-intensive due to the exponential growth of scientific literature and datasets. Artificial intelligence (AI), particularly Large Language Models (LLMs), has the potential to revolutionize this process by automating various steps. Still, significant challenges remain, including the need for multidisciplinary expertise, logicality of experimental design, and performance measurements. This paper introduces BioResearcher, the first end-to-end automated system designed to streamline the entire biomedical research process involving dry lab experiments. BioResearcher employs a modular multi-agent architecture, integrating specialized agents for search, literature processing, experimental design, and programming. By decomposing complex tasks into logically related sub-tasks and utilizing a hierarchical learning approach, BioResearcher effectively addresses the challenges of multidisciplinary requirements and logical complexity. Furthermore, BioResearcher incorporates an LLM-based reviewer for in-process quality control and introduces novel evaluation metrics to assess the quality and automation of experimental protocols. BioResearcher successfully achieves an average execution success rate of 63.07% across eight previously unmet research objectives. The generated protocols, on average, outperform typical agent systems by 22.0% on five quality metrics. The system demonstrates significant potential to reduce researchers’ workloads and accelerate biomedical discoveries, paving the way for future innovations in automated research systems.

nan


Article 1129

Title@2025-06-05 (4): Not All Options Are Created Equal: Textual Option Weighting for Token-Efficient LLM-Based Knowledge Tracing

Title: Not All Options Are Created Equal: Textual Option Weighting for Token-Efficient LLM-Based Knowledge Tracing Nicht alle Optionen sind gleich: Textuelle Optionsgewichtung für Token-Efficient LLM-basierte Wissensverfolgung 并非所有选项都创建为等: Token- Efficient LLM 知识追踪的文本选项比重 2410.12872v2

Authors (4): JongWoo Kim, SeongYeub Chu, Bryan Wong, Mun Yi

Large Language Models (LLMs) have recently emerged as promising tools for knowledge tracing (KT) due to their strong reasoning and generalization abilities. While recent LLM-based KT methods have proposed new prompt formats, they struggle to represent the full interaction histories of example learners within a single prompt during in-context learning (ICL), resulting in limited scalability and high computational cost under token constraints. In this work, we present \textit{LLM-based Option-weighted Knowledge Tracing (LOKT)}, a simple yet effective framework that encodes the interaction histories of example learners in context as \textit{textual categorical option weights (TCOW)}. TCOW are semantic labels (e.g., ``inadequate’’) assigned to the options selected by learners when answering questions, enhancing the interpretability of LLMs. Experiments on multiple-choice datasets show that LOKT outperforms existing non-LLM and LLM-based KT models in both cold-start and warm-start settings. Moreover, LOKT enables scalable and cost-efficient inference, achieving strong performance even under strict token constraints. Our code is available at \href{https://anonymous.4open.science/r/LOKT_model-3233}{https://anonymous.4open.science/r/LOKT_model-3233}.

nan


Article 1130

Title@2025-06-05 (4): Rectified Sparse Attention

Title: Rectified Sparse Attention Rektifizierte Sparse Achtung 校正的 松散注意 2506.04108v2

Authors (9): Yutao Sun, Tianzhu Ye, Li Dong, Yuqing Xia, Jian Chen, Yizhao Gao, Shijie Cao, Jianyong Wang, Furu Wei

Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42$\times$ end-to-end speedup under decoding at 256K sequence length, making it a practical solution for scalable long-context inference. Code is available at https://aka.ms/ReSA-LM.

nan


Article 1131

Title@2025-06-05 (4): Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents

Title: Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents Flex-TravelPlanner: Ein Benchmark für flexible Planung mit Sprachagenten 弹性旅行规划:语文代表灵活规划基准 2506.04649v1

Authors (3): Juhyun Oh, Eunsu Kim, Alice Oh

Real-world planning problems require constant adaptation to changing requirements and balancing of competing constraints. However, current benchmarks for evaluating LLMs’ planning capabilities primarily focus on static, single-turn scenarios. We introduce Flex-TravelPlanner, a benchmark that evaluates language models’ ability to reason flexibly in dynamic planning scenarios. Building on the TravelPlanner dataset~\citep{xie2024travelplanner}, we introduce two novel evaluation settings: (1) sequential constraint introduction across multiple turns, and (2) scenarios with explicitly prioritized competing constraints. Our analysis of GPT-4o and Llama 3.1 70B reveals several key findings: models’ performance on single-turn tasks poorly predicts their ability to adapt plans across multiple turns; constraint introduction order significantly affects performance; and models struggle with constraint prioritization, often incorrectly favoring newly introduced lower priority preferences over existing higher-priority constraints. These findings highlight the importance of evaluating LLMs in more realistic, dynamic planning scenarios and suggest specific directions for improving model performance on complex planning tasks. The code and dataset for our framework are publicly available at https://github.com/juhyunohh/FlexTravelBench.

nan


Article 1132

Title@2025-06-05 (4): TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering

Title: TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering TaDA: Training-freies Rezept zur Dekodierung mit adaptiver KV-Cache-Kompression und Mean-Centering TaDA:使用适应性 KV 缓存压缩和中度激励解码的无培训食谱 2506.04642v1

Authors (4): Vinay Joshi, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

The key-value (KV) cache in transformer models is a critical component for efficient decoding or inference, yet its memory demands scale poorly with sequence length, posing a major challenge for scalable deployment of large language models. Among several approaches to KV cache compression, quantization of key and value activations has been widely explored. Most KV cache quantization methods still need to manage sparse and noncontiguous outliers separately. To address this, we introduce TaDA, a training-free recipe for KV cache compression with quantization precision that adapts to error sensitivity across layers and a mean centering to eliminate separate outlier handling. Our approach yields substantial accuracy improvements for multiple models supporting various context lengths. Moreover, our approach does not need to separately manage outlier elements – a persistent hurdle in most traditional quantization methods. Experiments on standard benchmarks demonstrate that our technique reduces KV cache memory footprint to 27% of the original 16-bit baseline while achieving comparable accuracy. Our method paves the way for scalable and high-performance reasoning in language models by potentially enabling inference for longer context length models, reasoning models, and longer chain of thoughts.

nan


Article 1133

Title@2025-06-05 (4): MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning

Title: MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning MoDoMoDo: Multi-Domain-Datenmischungen für multimodales LLM-Verstärkungslernen MoDoMoMoDoDo:多式LLM强化学习多功能数据混合体 2505.24871v2

Authors (10): Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, Jiacheng Zhu

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for post-training large language models (LLMs), achieving state-of-the-art performance on tasks with structured, verifiable answers. Applying RLVR to Multimodal LLMs (MLLMs) presents significant opportunities but is complicated by the broader, heterogeneous nature of vision-language tasks that demand nuanced visual, logical, and spatial capabilities. As such, training MLLMs using RLVR on multiple datasets could be beneficial but creates challenges with conflicting objectives from interaction among diverse datasets, highlighting the need for optimal dataset mixture strategies to improve generalization and reasoning. We introduce a systematic post-training framework for Multimodal LLM RLVR, featuring a rigorous data mixture problem formulation and benchmark implementation. Specifically, (1) We developed a multimodal RLVR framework for multi-dataset post-training by curating a dataset that contains different verifiable vision-language problems and enabling multi-domain online RL learning with different verifiable rewards; (2) We proposed a data mixture strategy that learns to predict the RL fine-tuning outcome from the data mixture distribution, and consequently optimizes the best mixture. Comprehensive experiments showcase that multi-domain RLVR training, when combined with mixture prediction strategies, can significantly boost MLLM general reasoning capacities. Our best mixture improves the post-trained model’s accuracy on out-of-distribution benchmarks by an average of 5.24% compared to the same model post-trained with uniform data mixture, and by a total of 20.74% compared to the pre-finetuning baseline.

nan


Article 1134

Title@2025-06-05 (4): ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition

Title: ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition ViCocktail: Automatisierte Multi-Modal-Datensammlung für vietnamesische Audio-Visuelle Spracherkennung Vicocktail:越南视听语音语音识别自动多模式数据收集 2506.04635v1

Authors (4): Thai-Binh Nguyen, Thi Van Nguyen, Quoc Truong Do, Chi Mai Luong

Audio-Visual Speech Recognition (AVSR) has gained significant attention recently due to its robustness against noise, which often challenges conventional speech recognition systems that rely solely on audio features. Despite this advantage, AVSR models remain limited by the scarcity of extensive datasets, especially for most languages beyond English. Automated data collection offers a promising solution. This work presents a practical approach to generate AVSR datasets from raw video, refining existing techniques for improved efficiency and accessibility. We demonstrate its broad applicability by developing a baseline AVSR model for Vietnamese. Experiments show the automatically collected dataset enables a strong baseline, achieving competitive performance with robust ASR in clean conditions and significantly outperforming them in noisy environments like cocktail parties. This efficient method provides a pathway to expand AVSR to more languages, particularly under-resourced ones.

nan


Article 1135

Title@2025-06-05 (4): MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

Title: MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation MAVL: Ein mehrsprachiger Audio-Video-Text Datensatz für animierte Song-Übersetzung MAVL: 动动歌曲翻译多语种视听歌词数据集 2505.18614v2

Authors (4): Woohyun Cho, Youngmin Kim, Sunghyun Lee, Youngjae Yu

Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.

nan


Article 1136

Title@2025-06-05 (4): Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon

Title: Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon Breaking the Cloak! Enthüllung der chinesischen verhüllten Toxizität mit Homophon Graph und giftigem Lexikon 破解衣物! 中华便衣毒物与同声图和毒毒词汇结合 2505.22184v2

Authors (5): Xuchen Ma, Jianxiang Yu, Wenming Shao, Bo Pang, Xiang Li

Social media platforms have experienced a significant rise in toxic content, including abusive language and discriminatory remarks, presenting growing challenges for content moderation. Some users evade censorship by deliberately disguising toxic words through homophonic cloak, which necessitates the task of unveiling cloaked toxicity. Existing methods are mostly designed for English texts, while Chinese cloaked toxicity unveiling has not been solved yet. To tackle the issue, we propose C$^2$TU, a novel training-free and prompt-free method for Chinese cloaked toxic content unveiling. It first employs substring matching to identify candidate toxic words based on Chinese homo-graph and toxic lexicon. Then it filters those candidates that are non-toxic and corrects cloaks to be their corresponding toxicities. Specifically, we develop two model variants for filtering, which are based on BERT and LLMs, respectively. For LLMs, we address the auto-regressive limitation in computing word occurrence probability and utilize the full semantic contexts of a text sequence to reveal cloaked toxic words. Extensive experiments demonstrate that C$^2$TU can achieve superior performance on two Chinese toxic datasets. In particular, our method outperforms the best competitor by up to 71% on the F1 score and 35% on accuracy, respectively. Our code and data are available at https://github.com/XDxc-cuber/C2TU-Chinese-cloaked-toxicity-unveiling.

nan


Article 1137

Title@2025-06-05 (4): COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing

Title: COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing COMI-LINGUA: Experte kommentierter Großrechner-Datensatz für Multitask NLP in Hindi-Englisch Code-Mixing COMI-LINLingUA:印地语-英语编码混合多语种国家编码多语种方案专家附加说明的大尺度数据集 2503.21670v2

Authors (3): Rajvee Sheth, Himanshu Beniwal, Mayank Singh

We introduce COMI-LINGUA, the largest manually annotated Hindi-English code-mixed dataset, comprising 125K+ high-quality instances across five core NLP tasks: Matrix Language Identification, Token-level Language Identification, POS Tagging, Named Entity Recognition (NER), and Machine Translation. Each instance is annotated by three bilingual annotators, yielding over 376K expert annotations with strong inter-annotator agreement (Fleiss’ Kappa $\geq$ 0.81). The rigorously preprocessed and filtered dataset covers both Devanagari and Roman scripts and spans diverse domains, ensuring real-world linguistic coverage. Evaluation reveals that closed-source LLMs significantly outperform traditional tools and open-source models. Notably, one-shot prompting consistently boosts performance across tasks, especially in structure-sensitive predictions like POS and NER, highlighting the effectiveness of prompt-based adaptation in code-mixed, low-resource settings. COMI-LINGUA is publicly available at: https://github.com/lingo-iitgn/CodeMixing_Project.

nan


Article 1138

Title@2025-06-05 (4): Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning

Title: Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning Advancing Tool-Augmented Large Language Models durch Meta-Verifikation und Reflexion Lernen 通过元核查和反省学习促进工具强化大语言模式 2506.04625v1

Authors (6): Zhiyuan Ma, Jiayu Liu, Xianzhen Luo, Zhenya Huang, Qingfu Zhu, Wanxiang Che

Empowering large language models (LLMs) with effective tool utilization capabilities is crucial for enabling AI agents to solve complex problems. However, current models face two major limitations: (1) unreliable tool planning and invocation due to low-quality instruction datasets (e.g., widespread hallucinated API calls), and (2) weak tool reflection abilities (over 90% of errors cannot be corrected) resulting from static imitation learning. To address these critical limitations, we propose Tool-MVR, a novel Tool-Augmented LLM that achieves comprehensive System 2 reasoning through two key innovations. Specifically, we first introduce Multi-Agent Meta-Verification (MAMV), a systematic pipeline that rigorously validates APIs, queries, and reasoning trajectories to construct ToolBench-V, a new high-quality instruction dataset that addresses the limitation of unreliable tool planning and invocation. Second, we propose Exploration-based Reflection Learning (EXPLORE), which enhances tool reflection capabilities by leveraging tool feedback through a dynamic “Error -> Reflection -> Correction” learning paradigm, resulting in our reflection dataset ToolBench-R and addressing the critical weakness in tool reflection. Finally, we obtain Tool-MVR by finetuning open-source LLMs (e.g., Qwen-7B) on both ToolBench-V and ToolBench-R. Our experiments demonstrate that Tool-MVR achieves state-of-the-art performance on StableToolBench, surpassing both ToolLLM (by 23.9%) and GPT-4 (by 15.3%) while reducing API calls by 31.4%, with strong generalization capabilities across unseen tools and scenarios. Additionally, on our proposed RefineToolBench, the first benchmark specifically designed to evaluate tool reflection capabilities, Tool-MVR achieves a 58.9% error correction rate, significantly outperforming ToolLLM’s 9.1%.

nan


Article 1139

Title@2025-06-05 (4): Static Word Embeddings for Sentence Semantic Representation

Title: Static Word Embeddings for Sentence Semantic Representation Statische Wort-Einbettungen für Satz semantische Darstellung 判决语义代表的静态单词嵌入 2506.04624v1

Authors (5): Takashi Wada, Yuki Hirakawa, Ryotaro Shimizu, Takahiro Kawashima, Yuki Saito

We propose new static word embeddings optimised for sentence semantic representation. We first extract word embeddings from a pre-trained Sentence Transformer, and improve them with sentence-level principal component analysis, followed by either knowledge distillation or contrastive learning. During inference, we represent sentences by simply averaging word embeddings, which requires little computational cost. We evaluate models on both monolingual and cross-lingual tasks and show that our model substantially outperforms existing static models on sentence semantic tasks, and even rivals a basic Sentence Transformer model (SimCSE) on some data sets. Lastly, we perform a variety of analyses and show that our method successfully removes word embedding components that are irrelevant to sentence semantics, and adjusts the vector norms based on the influence of words on sentence semantics.

nan


Article 1140

Title@2025-06-05 (4): Subjective Perspectives within Learned Representations Predict High-Impact Innovation

Title: Subjective Perspectives within Learned Representations Predict High-Impact Innovation Subjektive Perspektiven in erfahrenen Vertretungen prognostizieren High-Impact Innovation 学术界代表的主观观点 预测高影响创新 2506.04616v1

Authors (3): Likun Cao, Rui Pan, James Evans

Existing studies of innovation emphasize the power of social structures to shape innovation capacity. Emerging machine learning approaches, however, enable us to model innovators’ personal perspectives and interpersonal innovation opportunities as a function of their prior trajectories of experience. We theorize then quantify subjective perspectives and innovation opportunities based on innovator positions within the geometric space of concepts inscribed by dynamic language representations. Using data on millions of scientists, inventors, writers, entrepreneurs, and Wikipedia contributors across the creative domains of science, technology, film, entrepreneurship, and Wikipedia, here we show that measured subjective perspectives anticipate what ideas individuals and groups creatively attend to and successfully combine in future. When perspective and background diversity are decomposed as the angular difference between collaborators’ perspectives on their creation and between their experiences, the former consistently anticipates creative achievement while the latter portends its opposite, across all cases and time periods examined. We analyze a natural experiment and simulate creative collaborations between AI (large language model) agents designed with various perspective and background diversity, which are consistent with our observational findings. We explore mechanisms underlying these findings and identify how successful collaborators leverage common language to weave together diverse experience obtained through trajectories of prior work that converge to provoke one another and innovate. We explore the importance of these findings for team assembly and research policy.

nan


Article 1141

Title@2025-06-05 (4): Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning

Title: Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning Überprüfung von Test-Time Scaling: Eine Umfrage und eine vielfaltsbewusste Methode für effizientes Reasoning 重新审视试验时间尺度:调查以及有效说明理由的多样化软件方法 2506.04611v1

Authors (7): Ho-Lam Chung, Teng-Yun Hsiao, Hsiao-Ying Huang, Chunerh Cho, Jian-Ren Lin, Zhang Ziwei, Yun-Nung Chen

Test-Time Scaling (TTS) improves the reasoning performance of Large Language Models (LLMs) by allocating additional compute during inference. We conduct a structured survey of TTS methods and categorize them into sampling-based, search-based, and trajectory optimization strategies. We observe that reasoning-optimized models often produce less diverse outputs, which limits TTS effectiveness. To address this, we propose ADAPT (A Diversity Aware Prefix fine-Tuning), a lightweight method that applies prefix tuning with a diversity-focused data strategy. Experiments on mathematical reasoning tasks show that ADAPT reaches 80% accuracy using eight times less compute than strong baselines. Our findings highlight the essential role of generative diversity in maximizing TTS effectiveness.

nan


Article 1142

Title@2025-06-05 (4): DataGen: Unified Synthetic Dataset Generation via Large Language Models

Title: DataGen: Unified Synthetic Dataset Generation via Large Language Models DataGen: Unified Synthetic Dataset Generation über große Sprachmodelle DataGen:通过大语言模型生成统一合成数据集 2406.18966v4

Authors (11): Yue Huang, Siyuan Wu, Chujie Gao, Dongping Chen, Qihui Zhang, Yao Wan, Tianyi Zhou, Xiangliang Zhang, Jianfeng Gao, Chaowei Xiao, Lichao Sun

Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks. To address these challenges, this paper presents DataGen, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. DataGen is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, DataGen incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented generation technique for factual validation. The framework also allows for user-specified constraints, enabling customization of the data generation process to suit particular requirements. Extensive experiments demonstrate the superior quality of data generated by DataGen, and each module within DataGen plays a critical role in this enhancement. Additionally, DataGen is applied in two practical scenarios: benchmarking LLMs and data augmentation. The results indicate that DataGen effectively supports dynamic and evolving benchmarking and that data augmentation improves LLM capabilities in various domains, including agent-oriented abilities and reasoning skills.

nan


Article 1143

Title@2025-06-05 (4): A MISMATCHED Benchmark for Scientific Natural Language Inference

Title: A MISMATCHED Benchmark for Scientific Natural Language Inference Ein MISMATCHED-Benchmark für naturwissenschaftliche Sprachinferenzen 科学自然语言引文基准 2506.04603v1

Authors (5): Firoz Shaik, Mobashir Sadat, Nikita Gautam, Doina Caragea, Cornelia Caragea

Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. Existing datasets for this task are derived from various computer science (CS) domains, whereas non-CS domains are completely ignored. In this paper, we introduce a novel evaluation benchmark for scientific NLI, called MISMATCHED. The new MISMATCHED benchmark covers three non-CS domains-PSYCHOLOGY, ENGINEERING, and PUBLIC HEALTH, and contains 2,700 human annotated sentence pairs. We establish strong baselines on MISMATCHED using both Pre-trained Small Language Models (SLMs) and Large Language Models (LLMs). Our best performing baseline shows a Macro F1 of only 78.17% illustrating the substantial headroom for future improvements. In addition to introducing the MISMATCHED benchmark, we show that incorporating sentence pairs having an implicit scientific NLI relation between them in model training improves their performance on scientific NLI. We make our dataset and code publicly available on GitHub.

nan


Article 1144

Title@2025-06-05 (4): Seed-Coder: Let the Code Model Curate Data for Itself

Title: Seed-Coder: Let the Code Model Curate Data for Itself Saatgut-Coder: Lassen Sie das Code-Modell Daten für sich selbst kuratieren 种子编码器:让代码模型为它自己计算曲线数据 2506.03524v2

Authors (27): ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, Tao Sun, Jinhua Zhu, Shulin Xin, Dong Huang, Yetao Bai, Lixin Dong, Chao Li, Jianchong Chen, Hanzhi Zhou, Yifan Huang, Guanghan Ning, Xierui Song, Jiaze Chen, Siyao Liu, Kai Shen, Liang Xiang, Yonghui Wu

Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.

nan


Article 1145

Title@2025-06-05 (4): Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem

Title: Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem Lösen des vernünftigen Potenzials von vortrainierten LLMs durch Kritik Feinsteuerung auf ein Problem 通过Critique对一个问题的微调,释放预先培训的LLMs的理据潜力 2506.03295v2

Authors (5): Yubo Wang, Ping Nie, Kai Zou, Lijun Wu, Wenhu Chen

We have witnessed that strong LLMs like Qwen-Math, MiMo, and Phi-4 possess immense reasoning potential inherited from the pre-training stage. With reinforcement learning (RL), these models can improve dramatically on reasoning tasks. Recent studies have shown that even RL on a single problem can unleash these models’ reasoning capabilities. However, RL is not only expensive but also unstable. Even one-shot RL requires hundreds of GPU hours. This raises a critical question: Is there a more efficient way to unleash the reasoning potential of these powerful base LLMs? In this work, we demonstrate that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs. Our method constructs critique data by collecting diverse model-generated solutions to a single problem and using teacher LLMs to provide detailed critiques. We fine-tune Qwen and Llama family models, ranging from 1.5B to 14B parameters, on the CFT data and observe significant performance gains across diverse reasoning tasks. For example, with just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks. These results are comparable to or even surpass the results from RL with 20x less compute. Ablation studies reveal the robustness of one-shot CFT across different prompt problems. These results highlight one-shot CFT as a simple, general, and compute-efficient approach to unleashing the reasoning capabilities of modern LLMs.

nan


Article 1146

Title@2025-06-05 (4): Beyond Memorization: A Rigorous Evaluation Framework for Medical Knowledge Editing

Title: Beyond Memorization: A Rigorous Evaluation Framework for Medical Knowledge Editing Beyond Memorization: Ein strenger Evaluationsrahmen für medizinisches Knowledge Editing 记忆之后:医学知识编辑的严格评价框架 2506.03490v2

Authors (6): Shigeng Chen, Linhao Luo, Zhangchi Qiu, Yanan Cao, Carl Yang, Shirui Pan

Recently, knowledge editing (KE) has emerged as a promising approach to update specific facts in Large Language Models (LLMs) without the need for full retraining. Despite the effectiveness in general-domain benchmarks, their applicability to complex medical domain remains largely unexplored. Medical knowledge editing is particularly challenging, as it requires LLMs to internalize the knowledge and generalize to unseen scenarios for effective and interpretable decision-making. In this work, we propose a novel framework called MedEditBench to rigorously evaluate the effectiveness of existing KE methods in the medical domain. In MedEditBench, we introduce a new medical knowledge editing benchmark as well as three different knowledge editing paradigms, which are designed to assess the impact of different knowledge sources for editing. Our findings indicate that current KE methods result in only superficial memorization of the injected information, failing to generalize to new scenarios. To overcome this limitation, we present Self-Generated Rationale Editing (SGR-Edit), which utilizes model-derived rationales as the target knowledge for editing, thereby uncovering the underlying reasoning process and demonstrating significant improvements over existing KE approaches. Additionally, we offer deeper insights into medical knowledge editing, including the localization of medical knowledge in LLMs and the impact of sequential editing on evolving knowledge. This could provide practical guidance for implementing KE methods in real-world medical applications.

nan


Article 1147

Title@2025-06-05 (4): Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification

Title: Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification Sicher: Mathematische Reasoning in großen Sprachmodellen durch retrospektive stufenweise Verifizierung verbessern 安全:通过回溯性逐步认识正式核查,加强大语言模型中的数学理由 2506.04592v1

Authors (10): Chengwu Liu, Ye Yuan, Yichun Yin, Yan Xu, Xin Xu, Zaoyu Chen, Yasheng Wang, Lifeng Shang, Qun Liu, Ming Zhang

Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that “the gold standard for supporting a mathematical claim is to provide a proof”. We propose a retrospective, step-aware formal verification framework $Safe$. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework $Safe$ across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose $FormalStep$ as a benchmark for step correctness theorem proving with $30,809$ formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying natural language content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs.

nan


Article 1148

Title@2025-06-05 (4): LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models

Title: LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models LESS: Großes Sprachmodell Verbessertes semi-überwachtes Lernen für Sprachgrundmodelle LESS:大语言模式强化半监督半监督学习演讲基础模型 2506.04586v1

Authors (2): Wen Ding, Fan Qian

We introduce LESS (Large Language Model Enhanced Semi-supervised Learning), a versatile framework that leverages Large Language Models (LLMs) to correct pseudo labels generated from in-the-wild data. Within the LESS framework, pseudo-labeled text from Automatic Speech Recognition (ASR) or Automatic Speech Translation (AST) of the unsupervised data is refined by an LLM, and augmented by a data filtering strategy to optimize LLM knowledge transfer efficiency. Experiments on both Mandarin ASR and Spanish-to-English AST tasks show that LESS achieves a notable absolute WER reduction of 3.77% on the Wenet Speech test set, as well as BLEU scores of 34.0 and 64.7 on Callhome and Fisher test sets respectively. These results validate the adaptability of LESS across different languages, tasks, and domains. Ablation studies conducted with various LLMs and prompt configurations provide novel insights into leveraging LLM-derived knowledge for speech processing applications.

nan


Article 1149

Title@2025-06-05 (4): MuSciClaims: Multimodal Scientific Claim Verification

Title: MuSciClaims: Multimodal Scientific Claim Verification MuSciClaims: Multimodale wissenschaftliche Antragsprüfung 穆西索赔: 多式联运科学索赔核实 2506.04585v1

Authors (6): Yash Kumar Lal, Manikanta Bandham, Mohammad Saqib Hasan, Apoorva Kashi, Mahnaz Koupaee, Niranjan Balasubramanian

Assessing scientific claims requires identifying, extracting, and reasoning with multimodal data expressed in information-rich figures in scientific literature. Despite the large body of work in scientific QA, figure captioning, and other multimodal reasoning tasks over chart-based data, there are no readily usable multimodal benchmarks that directly test claim verification abilities. To remedy this gap, we introduce a new benchmark MuSciClaims accompanied by diagnostics tasks. We automatically extract supported claims from scientific articles, which we manually perturb to produce contradicted claims. The perturbations are designed to test for a specific set of claim verification capabilities. We also introduce a suite of diagnostic tasks that help understand model failures. Our results show most vision-language models are poor (~0.3-0.5 F1), with even the best model only achieving 0.77 F1. They are also biased towards judging claims as supported, likely misunderstanding nuanced perturbations within the claims. Our diagnostics show models are bad at localizing correct evidence within figures, struggle with aggregating information across modalities, and often fail to understand basic components of the figure.

nan


Article 1150

Title@2025-06-05 (4): SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing

Title: SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-Checking durch Forderungszersetzung und Bearbeitung CUCEA: 通过索赔的分解和编辑进行反向实况调查的理由说明-密集检索 2506.04583v1

Authors (4): Hongjun Liu, Yilun Zhao, Arman Cohan, Chen Zhao

Automatic fact-checking has recently received more attention as a means of combating misinformation. Despite significant advancements, fact-checking systems based on retrieval-augmented language models still struggle to tackle adversarial claims, which are intentionally designed by humans to challenge fact-checking systems. To address these challenges, we propose a training-free method designed to rephrase the original claim, making it easier to locate supporting evidence. Our modular framework, SUCEA, decomposes the task into three steps: 1) Claim Segmentation and Decontextualization that segments adversarial claims into independent sub-claims; 2) Iterative Evidence Retrieval and Claim Editing that iteratively retrieves evidence and edits the subclaim based on the retrieved evidence; 3) Evidence Aggregation and Label Prediction that aggregates all retrieved evidence and predicts the entailment label. Experiments on two challenging fact-checking datasets demonstrate that our framework significantly improves on both retrieval and entailment label accuracy, outperforming four strong claim-decomposition-based baselines.

nan


Article 1151

Title@2025-06-05 (4): Selecting Demonstrations for Many-Shot In-Context Learning via Gradient Matching

Title: Selecting Demonstrations for Many-Shot In-Context Learning via Gradient Matching Auswahl von Demonstrationen für das Viel-Schuss-In-Kontext-Lernen über Gradient Matching 通过梯度匹配选择多片点在文本中学习的演示 2506.04579v1

Authors (7): Jianfei Zhang, Bei Li, Jun Bai, Rumei Li, Yanmeng Wang, Chenghua Lin, Wenge Rong

In-Context Learning (ICL) empowers Large Language Models (LLMs) for rapid task adaptation without Fine-Tuning (FT), but its reliance on demonstration selection remains a critical challenge. While many-shot ICL shows promising performance through scaled demonstrations, the selection method for many-shot demonstrations remains limited to random selection in existing work. Since the conventional instance-level retrieval is not suitable for many-shot scenarios, we hypothesize that the data requirements for in-context learning and fine-tuning are analogous. To this end, we introduce a novel gradient matching approach that selects demonstrations by aligning fine-tuning gradients between the entire training set of the target task and the selected examples, so as to approach the learning effect on the entire training set within the selected examples. Through gradient matching on relatively small models, e.g., Qwen2.5-3B or Llama3-8B, our method consistently outperforms random selection on larger LLMs from 4-shot to 128-shot scenarios across 9 diverse datasets. For instance, it surpasses random selection by 4% on Qwen2.5-72B and Llama3-70B, and by around 2% on 5 closed-source LLMs. This work unlocks more reliable and effective many-shot ICL, paving the way for its broader application.

nan


Article 1152

Title@2025-06-05 (4): Contrastive Visual Data Augmentation

Title: Contrastive Visual Data Augmentation Kontrastive Bilddatenvergrößerung 对比视觉数据增强 2502.17709v2

Authors (9): Yu Zhou, Bingxuan Li, Mohan Tang, Xiaomeng Jin, Te-Lin Wu, Kuan-Hao Huang, Heng Ji, Kai-Wei Chang, Nanyun Peng

Large multimodal models (LMMs) often struggle to recognize novel concepts, as they rely on pre-trained knowledge and have limited ability to capture subtle visual details. Domain-specific knowledge gaps in training also make them prone to confusing visually similar, commonly misrepresented, or low-resource concepts. To help LMMs better align nuanced visual features with language, improving their ability to recognize and reason about novel or rare concepts, we propose a Contrastive visual Data Augmentation (CoDA) strategy. CoDA extracts key contrastive textual and visual features of target concepts against the known concepts they are misrecognized as, and then uses multimodal generative models to produce targeted synthetic data. Automatic filtering of extracted features and augmented images is implemented to guarantee their quality, as verified by human annotators. We show the effectiveness and efficiency of CoDA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect NovelSpecies, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. LLaVA-1.6 1-shot updating results on these three datasets show CoDA significantly improves SOTA visual data augmentation strategies by 12.3% (NovelSpecies), 5.1% (SUN), and 6.0% (iNat) absolute gains in accuracy.

nan


Article 1153

Title@2025-06-05 (4): Are LLMs Reliable Translators of Logical Reasoning Across Lexically Diversified Contexts?

Title: Are LLMs Reliable Translators of Logical Reasoning Across Lexically Diversified Contexts? Sind LLMs zuverlässige Übersetzer der logischen Vernunft in lexikalisch diversifizierten Kontexten? LLMs 可靠翻译者是否掌握了跨越多种不同背景的逻辑理由? 2506.04575v1

Authors (7): Qingchuan Li, Jiatong Li, Zirui Liu, Mingyue Cheng, Yuting Zeng, Qi Liu, Tongxuan Liu

Neuro-symbolic approaches combining large language models (LLMs) with solvers excels in logical reasoning problems need long reasoning chains. In this paradigm, LLMs serve as translators, converting natural language reasoning problems into formal logic formulas. Then reliable symbolic solvers return correct solutions. Despite their success, we find that LLMs, as translators, struggle to handle lexical diversification, a common linguistic phenomenon, indicating that LLMs as logic translators are unreliable in real-world scenarios. Moreover, existing logical reasoning benchmarks lack lexical diversity, failing to challenge LLMs’ ability to translate such text and thus obscuring this issue. In this work, we propose SCALe, a benchmark designed to address this significant gap through logic-invariant lexical diversification. By using LLMs to transform original benchmark datasets into lexically diversified but logically equivalent versions, we evaluate LLMs’ ability to consistently map diverse expressions to uniform logical symbols on these new datasets. Experiments using SCALe further confirm that current LLMs exhibit deficiencies in this capability. Building directly on the deficiencies identified through our benchmark, we propose a new method, MenTaL, to address this limitation. This method guides LLMs to first construct a table unifying diverse expressions before performing translation. Applying MenTaL through in-context learning and supervised fine-tuning (SFT) significantly improves the performance of LLM translators on lexically diversified text. Our code is now available at https://github.com/wufeiwuwoshihua/LexicalDiver.

nan


Article 1154

Title@2025-06-05 (4): Reasoning or Overthinking: Evaluating Large Language Models on Financial Sentiment Analysis

Title: Reasoning or Overthinking: Evaluating Large Language Models on Financial Sentiment Analysis Reasoning or Overthinking: Bewertung großer Sprachmodelle zur Analyse von Finanzsentimenten 理由或过度思考:评价关于金融敏感分析的大型语言模型 2506.04574v1

Authors (2): Dimitris Vamvourellis, Dhagash Mehta

We investigate the effectiveness of large language models (LLMs), including reasoning-based and non-reasoning models, in performing zero-shot financial sentiment analysis. Using the Financial PhraseBank dataset annotated by domain experts, we evaluate how various LLMs and prompting strategies align with human-labeled sentiment in a financial context. We compare three proprietary LLMs (GPT-4o, GPT-4.1, o3-mini) under different prompting paradigms that simulate System 1 (fast and intuitive) or System 2 (slow and deliberate) thinking and benchmark them against two smaller models (FinBERT-Prosus, FinBERT-Tone) fine-tuned on financial sentiment analysis. Our findings suggest that reasoning, either through prompting or inherent model design, does not improve performance on this task. Surprisingly, the most accurate and human-aligned combination of model and method was GPT-4o without any Chain-of-Thought (CoT) prompting. We further explore how performance is impacted by linguistic complexity and annotation agreement levels, uncovering that reasoning may introduce overthinking, leading to suboptimal predictions. This suggests that for financial sentiment classification, fast, intuitive “System 1”-like thinking aligns more closely with human judgment compared to “System 2”-style slower, deliberative reasoning simulated by reasoning models or CoT prompting. Our results challenge the default assumption that more reasoning always leads to better LLM decisions, particularly in high-stakes financial applications.

nan


Article 1155

Title@2025-06-05 (4): Demonstrations of Integrity Attacks in Multi-Agent Systems

Title: Demonstrations of Integrity Attacks in Multi-Agent Systems Demonstrationen von Integritätsangriffen in Multi-Agent-Systemen 多机构系统中廉正攻击示范 2506.04572v1

Authors (4): Can Zheng, Yuhan Cao, Xiaoning Dong, Tianxing He

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, code generation, and complex planning. Simultaneously, Multi-Agent Systems (MAS) have garnered attention for their potential to enable cooperation among distributed agents. However, from a multi-party perspective, MAS could be vulnerable to malicious agents that exploit the system to serve self-interests without disrupting its core functionality. This work explores integrity attacks where malicious agents employ subtle prompt manipulation to bias MAS operations and gain various benefits. Four types of attacks are examined: \textit{Scapegoater}, who misleads the system monitor to underestimate other agents’ contributions; \textit{Boaster}, who misleads the system monitor to overestimate their own performance; \textit{Self-Dealer}, who manipulates other agents to adopt certain tools; and \textit{Free-Rider}, who hands off its own task to others. We demonstrate that strategically crafted prompts can introduce systematic biases in MAS behavior and executable instructions, enabling malicious agents to effectively mislead evaluation systems and manipulate collaborative agents. Furthermore, our attacks can bypass advanced LLM-based monitors, such as GPT-4o-mini and o3-mini, highlighting the limitations of current detection mechanisms. Our findings underscore the critical need for MAS architectures with robust security protocols and content validation mechanisms, alongside monitoring systems capable of comprehensive risk scenario assessment.

nan


Article 1156

Title@2025-06-05 (4): Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment

Title: Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment Implizite Cross-Lingual-Belohnung für effiziente Mehrsprachigkeitsausrichtung 高效多语种和多种语文首选项统一化的双双优 2503.04647v2

Authors (5): Wen Yang, Junhong Wu, Chen Wang, Chengqing Zong, Jiajun Zhang

Direct Preference Optimization (DPO) has become a prominent method for aligning Large Language Models (LLMs) with human preferences. While DPO has enabled significant progress in aligning English LLMs, multilingual preference alignment is hampered by data scarcity. To address this, we propose a novel approach that $\textit{captures}$ learned preferences from well-aligned English models by implicit rewards and $\textit{transfers}$ them to other languages through iterative training. Specifically, we derive an implicit reward model from the logits of an English DPO-aligned model and its corresponding reference model. This reward model is then leveraged to annotate preference relations in cross-lingual instruction-following pairs, using English instructions to evaluate multilingual responses. The annotated data is subsequently used for multilingual DPO fine-tuning, facilitating preference knowledge transfer from English to other languages. Fine-tuning Llama3 for two iterations resulted in a 12.72% average improvement in Win Rate and a 5.97% increase in Length Control Win Rate across all training languages on the X-AlpacaEval leaderboard. Our findings demonstrate that leveraging existing English-aligned models can enable efficient and effective multilingual preference alignment, significantly reducing the need for extensive multilingual preference data. The code is available at https://github.com/ZNLP/Implicit-Cross-Lingual-Rewarding

nan


Article 1157

Title@2025-06-05 (4): Clustering and Median Aggregation Improve Differentially Private Inference

Title: Clustering and Median Aggregation Improve Differentially Private Inference Clustering und Median Aggregation verbessern unterschiedliche private Schlussfolgerungen 群集和中中聚合改善差别私人推断 2506.04566v1

Authors (7): Kareem Amin, Salman Avestimehr, Sara Babakniya, Alex Bie, Weiwei Kong, Natalia Ponomareva, Umar Syed

Differentially private (DP) language model inference is an approach for generating private synthetic text. A sensitive input example is used to prompt an off-the-shelf large language model (LLM) to produce a similar example. Multiple examples can be aggregated together to formally satisfy the DP guarantee. Prior work creates inference batches by sampling sensitive inputs uniformly at random. We show that uniform sampling degrades the quality of privately generated text, especially when the sensitive examples concern heterogeneous topics. We remedy this problem by clustering the input data before selecting inference batches. Next, we observe that clustering also leads to more similar next-token predictions across inferences. We use this insight to introduce a new algorithm that aggregates next token statistics by privately computing medians instead of averages. This approach leverages the fact that the median has decreased local sensitivity when next token predictions are similar, allowing us to state a data-dependent and ex-post DP guarantee about the privacy properties of this algorithm. Finally, we demonstrate improvements in terms of representativeness metrics (e.g., MAUVE) as well as downstream task performance. We show that our method produces high-quality synthetic data at significantly lower privacy cost than a previous state-of-the-art method.

nan


Article 1158

Title@2025-06-05 (4): From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

Title: From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems Von Standalone LLMs bis hin zu integrierter Intelligenz: Eine Übersicht über zusammengesetzte Al-Systeme 从独立的LMLM公司到综合情报公司:对Al Complical Systems的调查 2506.04565v1

Authors (3): Jiayi Chen, Junyi Ye, Guiling Wang

Compound Al Systems (CAIS) is an emerging paradigm that integrates large language models (LLMs) with external components, such as retrievers, agents, tools, and orchestrators, to overcome the limitations of standalone models in tasks requiring memory, reasoning, real-time grounding, and multimodal understanding. These systems enable more capable and context-aware behaviors by composing multiple specialized modules into cohesive workflows. Despite growing adoption in both academia and industry, the CAIS landscape remains fragmented, lacking a unified framework for analysis, taxonomy, and evaluation. In this survey, we define the concept of CAIS, propose a multi-dimensional taxonomy based on component roles and orchestration strategies, and analyze four foundational paradigms: Retrieval-Augmented Generation (RAG), LLM Agents, Multimodal LLMs (MLLMs), and orchestration-centric architectures. We review representative systems, compare design trade-offs, and summarize evaluation methodologies across these paradigms. Finally, we identify key challenges-including scalability, interoperability, benchmarking, and coordination-and outline promising directions for future research. This survey aims to provide researchers and practitioners with a comprehensive foundation for understanding, developing, and advancing the next generation of system-level artificial intelligence.

nan


Article 1159

Title@2025-06-05 (4): Investigating Distributions of Telecom Adapted Sentence Embeddings for Document Retrieval

Title: Investigating Distributions of Telecom Adapted Sentence Embeddings for Document Retrieval Untersuchung der Verteilung von Telecom-adaptierten Satz-Einbindungen für Dokumentenwiederherstellung 用于文件检索的经调整的远程判刑嵌入表的调查分发情况 2406.12336v3

Authors (7): Sujoy Roychowdhury, Sumit Soman, Ranjani Hosakere Gireesha, Vansh Chhabra, Neeraj Gunda, Subhadip Bandyopadhyay, Sai Krishna Bala

A plethora of sentence embedding models makes it challenging to choose one, especially for technical domains rich with specialized vocabulary. In this work, we domain adapt embeddings using telecom data for question answering. We evaluate embeddings obtained from publicly available models and their domain-adapted variants, on both point retrieval accuracies, as well as their (95%) confidence intervals. We establish a systematic method to obtain thresholds for similarity scores for different embeddings. As expected, we observe that fine-tuning improves mean bootstrapped accuracies. We also observe that it results in tighter confidence intervals, which further improve when pre-training is preceded by fine-tuning. We introduce metrics which measure the distributional overlaps of top-$K$, correct and random document similarities with the question. Further, we show that these metrics are correlated with retrieval accuracy and similarity thresholds. Recent literature shows conflicting effects of isotropy on retrieval accuracies. Our experiments establish that the isotropy of embeddings (as measured by two independent state-of-the-art isotropy metric definitions) is poorly correlated with retrieval performance. We show that embeddings for domain-specific sentences have little overlap with those for domain-agnostic ones, and fine-tuning moves them further apart. Based on our results, we provide recommendations for use of our methodology and metrics by researchers and practitioners.

nan


Article 1160

Title@2025-06-05 (4): SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?

Title: SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages? SSA-COMET: Sind LLMs in der Bewertung von MT für unterresourced African Languages bestens ausgebildete Metrics? SSA-COMET:在为资源不足的非洲语言评价MT方面,LLMs是否超过成绩优异的计量? 2506.04557v1

Authors (10): Senyu Li, Jiayi Wang, Felermino D. M. A. Ali, Colin Cherry, Daniel Deutsch, Eleftheria Briakou, Rui Sousa-Silva, Henrique Lopes Cardoso, Pontus Stenetorp, David Ifeoluwa Adelani

Evaluating machine translation (MT) quality for under-resourced African languages remains a significant challenge, as existing metrics often suffer from limited language coverage and poor performance in low-resource settings. While recent efforts, such as AfriCOMET, have addressed some of the issues, they are still constrained by small evaluation sets, a lack of publicly available training data tailored to African languages, and inconsistent performance in extremely low-resource scenarios. In this work, we introduce SSA-MTE, a large-scale human-annotated MT evaluation (MTE) dataset covering 13 African language pairs from the News domain, with over 63,000 sentence-level annotations from a diverse set of MT systems. Based on this data, we develop SSA-COMET and SSA-COMET-QE, improved reference-based and reference-free evaluation metrics. We also benchmark prompting-based approaches using state-of-the-art LLMs like GPT-4o and Claude. Our experimental results show that SSA-COMET models significantly outperform AfriCOMET and are competitive with the strongest LLM (Gemini 2.5 Pro) evaluated in our study, particularly on low-resource languages such as Twi, Luo, and Yoruba. All resources are released under open licenses to support future research.

nan


Article 1161

Title@2025-06-05 (4): What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective

Title: What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective Was in LLM-Schichten passiert ist, wenn es um schnelles gegen langsames Denken geht: Eine gradiente Perspektive 训练快速与慢思考:渐进视角时 LLM 图层中发生的情况 2410.23743v2

Authors (3): Ming Li, Yanhong Li, Tianyi Zhou

What makes a difference in the post-training of LLMs? We investigate the training patterns of different layers in large language models (LLMs) through the lens of the gradient. We are specifically interested in how fast vs. slow thinking affects the layer-wise gradients, given the recent popularity of training LLMs on reasoning paths such as chain-of-thoughts (CoT) and process rewards. In our study, fast thinking without CoT leads to larger gradients and larger differences of gradients across layers than slow thinking (Detailed CoT), indicating the learning stability brought by the latter. Additionally, we study whether the gradient patterns can reflect the correctness of responses when training different LLMs using slow vs. fast thinking paths. The results show that the gradients of slow thinking can distinguish correct and irrelevant reasoning paths. As a comparison, we conduct similar gradient analyses on non-reasoning knowledge learning tasks, on which, however, trivially increasing the response length does not lead to similar behaviors of slow thinking. Our study strengthens fundamental understandings of LLM training and sheds novel insights on its efficiency and stability, which pave the way towards building a generalizable System-2 agent. Our code, data, and gradient statistics can be found in: https://github.com/MingLiiii/Layer_Gradient.

nan


Article 1162

Title@2025-06-05 (4): Mosaic-IT: Cost-Free Compositional Data Synthesis for Instruction Tuning

Title: Mosaic-IT: Cost-Free Compositional Data Synthesis for Instruction Tuning Mosaic-IT: Kostenfreie kompositorische Datensynthese für die Instruction Tuning Mosaic-IT:用于教学图纸的无成本构成数据综述 2405.13326v3

Authors (8): Ming Li, Pei Chen, Chenguang Wang, Hongyu Zhao, Yijun Liang, Yupeng Hou, Fuxiao Liu, Tianyi Zhou

Finetuning large language models with a variety of instruction-response pairs has enhanced their capability to understand and follow instructions. Current instruction tuning primarily relies on teacher models or human intervention to generate and refine the instructions and responses for training, which are costly, non-sustainable, and may lack diversity. In this paper, we introduce Mosaic Instruction Tuning (Mosaic-IT), a human/model-free compositional data synthesis method that can efficiently create rich and diverse augmentations from existing instruction tuning data to enhance the LLMs. Mosaic-IT randomly concatenates multiple instruction data into one and trains the model to produce the corresponding responses with predefined higher-level meta-instructions to strengthen its multi-step instruction-following and format-following skills. Our extensive evaluations demonstrate a superior performance and training efficiency of Mosaic-IT, which achieves consistent performance improvements over various benchmarks and an 80% reduction in training costs compared with original instruction tuning. Our codes and data are available at https://github.com/tianyi-lab/Mosaic-IT.

nan


Article 1163

Title@2025-06-05 (4): $K$-MSHC: Unmasking Minimally Sufficient Head Circuits in Large Language Models with Experiments on Syntactic Classification Tasks

Title: $K$-MSHC: Unmasking Minimally Sufficient Head Circuits in Large Language Models with Experiments on Syntactic Classification Tasks $K$-MSHC: Entlarvung minimal ausreichender Kopfkreise in großen Sprachmodellen mit Experimenten zu syntaktischen Klassifizierungsaufgaben USK美元-MSHC:在以同步分类任务实验的大型语言模型中,以最小化的足够大语言电路总电路 2505.12268v2

Authors (3): Pratim Chowdhary, Peter Chin, Deepernab Chakrabarty

Understanding which neural components drive specific capabilities in mid-sized language models ($\leq$10B parameters) remains a key challenge. We introduce the $(\bm{K}, \epsilon)$-Minimum Sufficient Head Circuit ($K$-MSHC), a methodology to identify minimal sets of attention heads crucial for classification tasks as well as Search-K-MSHC, an efficient algorithm for discovering these circuits. Applying our Search-K-MSHC algorithm to Gemma-9B, we analyze three syntactic task families: grammar acceptability, arithmetic verification, and arithmetic word problems. Our findings reveal distinct task-specific head circuits, with grammar tasks predominantly utilizing early layers, word problems showing pronounced activity in both shallow and deep regions, and arithmetic verification demonstrating a more distributed pattern across the network. We discover non-linear circuit overlap patterns, where different task pairs share computational components at varying levels of importance. While grammar and arithmetic share many “weak” heads, arithmetic and word problems share more consistently critical “strong” heads. Importantly, we find that each task maintains dedicated “super-heads” with minimal cross-task overlap, suggesting that syntactic and numerical competencies emerge from specialized yet partially reusable head circuits.

nan


Article 1164

Title@2025-06-05 (4): ATLaS: Agent Tuning via Learning Critical Steps

Title: ATLaS: Agent Tuning via Learning Critical Steps ATLaS: Agent Tuning über Learning Critical Steps 通过学习关键步骤进行测试的代理 2503.02197v2

Authors (6): Zhixun Chen, Ming Li, Yuxuan Huang, Yali Du, Meng Fang, Tianyi Zhou

Large Language Model (LLM) agents have demonstrated remarkable generalization capabilities across multi-domain tasks. Existing agent tuning approaches typically employ supervised finetuning on entire expert trajectories. However, behavior-cloning of full trajectories can introduce expert bias and weaken generalization to states not covered by the expert data. Additionally, critical steps, such as planning, complex reasoning for intermediate subtasks, and strategic decision-making, are essential to success in agent tasks, so learning these steps is the key to improving LLM agents. For more effective and efficient agent tuning, we propose ATLaS that identifies the critical steps in expert trajectories and finetunes LLMs solely on these steps with reduced costs. By steering the training’s focus to a few critical steps, our method mitigates the risk of overfitting entire trajectories and promotes generalization across different environments and tasks. In extensive experiments, an LLM finetuned on only 30% critical steps selected by ATLaS outperforms the LLM finetuned on all steps and recent open-source LLM agents. ATLaS maintains and improves base LLM skills as generalist agents interacting with diverse environments.

nan


Article 1165

Title@2025-06-05 (4): On Entity Identification in Language Models

Title: On Entity Identification in Language Models Zur Identitätskennung in Sprachmodellen 关于在语文模式中实体识别 2506.02701v3

Authors (5): Masaki Sakata, Benjamin Heinzerling, Sho Yokoi, Takumi Ito, Kentaro Inui

We analyze the extent to which internal representations of language models (LMs) identify and distinguish mentions of named entities, focusing on the many-to-many correspondence between entities and their mentions. We first formulate two problems of entity mentions – ambiguity and variability – and propose a framework analogous to clustering quality metrics. Specifically, we quantify through cluster analysis of LM internal representations the extent to which mentions of the same entity cluster together and mentions of different entities remain separated. Our experiments examine five Transformer-based autoregressive models, showing that they effectively identify and distinguish entities with metrics analogous to precision and recall ranging from 0.66 to 0.9. Further analysis reveals that entity-related information is compactly represented in a low-dimensional linear subspace at early LM layers. Additionally, we clarify how the characteristics of entity representations influence word prediction performance. These findings are interpreted through the lens of isomorphism between LM representations and entity-centric knowledge structures in the real world, providing insights into how LMs internally organize and use entity information.

nan


Article 1166

Title@2025-06-05 (4): ChatWise: A Strategy-Guided Chatbot for Enhancing Cognitive Support in Older Adults

Title: ChatWise: A Strategy-Guided Chatbot for Enhancing Cognitive Support in Older Adults ChatWise: Ein strategiegeführter Chatbot zur Verbesserung der kognitiven Unterstützung bei älteren Erwachsenen ChatWise:加强老年人认知支助战略指导的Chattbot 2503.05740v2

Authors (5): Zhengbang Yang, Junyuan Hong, Yijiang Pang, Jiayu Zhou, Zhuangdi Zhu

Cognitive health in older adults presents a growing challenge. Although conversational interventions show feasibility in improving cognitive wellness, human caregiver resources remain overloaded. AI-based chatbots have shown promise, yet existing work is often limited to implicit strategies or heavily depends on training and label resources. In response, we propose a strategy-guided AI chatbot named ChatWise that follows a dual-level conversation reasoning framework. It integrates macro-level strategy planning and micro-level utterance generation to enable engaging, multi-turn dialogue tailored to older adults. Empirical results show that ChatWise closely aligns with professional human caregiver behaviors in offline evaluation using real clinic data, and achieves positive user cognitive and emotional responses in interactive simulations with digital twins, which significantly outperforms AI baselines that follow implicit conversation generation.

nan


Article 1167

Title@2025-06-05 (4): BSBench: will your LLM find the largest prime number?

Title: BSBench: will your LLM find the largest prime number? BSBench: Findet Ihr LLM die größte Primzahl? 你的LLM会找到最大的质数吗? 2506.04535v1

Authors (1): K. O. T. Erziev

We propose that benchmarking LLMs on questions which have no reasonable answer actually isn’t as silly as it sounds. We also present a benchmark that allows such testing and a method to modify the existing datasets, and discover that existing models demonstrate a performance far from the perfect on such questions. Our code and data artifacts are available at https://github.com/L3G5/impossible-bench

nan


Article 1168

Title@2025-06-05 (4): Is It JUST Semantics? A Case Study of Discourse Particle Understanding in LLMs

Title: Is It JUST Semantics? A Case Study of Discourse Particle Understanding in LLMs Ist es nur Semantik? Eine Fallstudie des Diskurses Teilchenverständnis in LLMs 《只是语义学吗? 2506.04534v1

Authors (6): William Sheffield, Kanishka Misra, Valentina Pyatkin, Ashwini Deo, Kyle Mahowald, Junyi Jessy Li

Discourse particles are crucial elements that subtly shape the meaning of text. These words, often polyfunctional, give rise to nuanced and often quite disparate semantic/discourse effects, as exemplified by the diverse uses of the particle “just” (e.g., exclusive, temporal, emphatic). This work investigates the capacity of LLMs to distinguish the fine-grained senses of English “just”, a well-studied example in formal semantics, using data meticulously created and labeled by expert linguists. Our findings reveal that while LLMs exhibit some ability to differentiate between broader categories, they struggle to fully capture more subtle nuances, highlighting a gap in their understanding of discourse particles.

nan


Article 1169

Title@2025-06-05 (4): A Survey on Data Contamination for Large Language Models

Title: A Survey on Data Contamination for Large Language Models Eine Umfrage über Datenkontamination für große Sprachmodelle 大语言模型数据污染调查 2502.14425v2

Authors (3): Yuxing Cheng, Yi Chang, Yuan Wu

Recent advancements in Large Language Models (LLMs) have demonstrated significant progress in various areas, such as text generation and code synthesis. However, the reliability of performance evaluation has come under scrutiny due to data contamination-the unintended overlap between training and test datasets. This overlap has the potential to artificially inflate model performance, as LLMs are typically trained on extensive datasets scraped from publicly available sources. These datasets often inadvertently overlap with the benchmarks used for evaluation, leading to an overestimation of the models’ true generalization capabilities. In this paper, we first examine the definition and impacts of data contamination. Secondly, we review methods for contamination-free evaluation, focusing on three strategies: data updating-based methods, data rewriting-based methods, and prevention-based methods. Specifically, we highlight dynamic benchmarks and LLM-driven evaluation methods. Finally, we categorize contamination detecting methods based on model information dependency: white-Box, gray-Box, and black-Box detection approaches. Our survey highlights the requirements for more rigorous evaluation protocols and proposes future directions for addressing data contamination challenges.

nan


Article 1170

Title@2025-06-05 (4): DECASTE: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias Analysis

Title: DECASTE: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias Analysis DECASTE: Enthüllen von Kastenstereotypen in großen Sprachmodellen durch multidimensionale Bias-Analyse DECASTE:通过多语言双种族分析,在大语言模式中统一种姓定型观念 2505.14971v2

Authors (7): Prashanth Vijayaraghavan, Soroush Vosoughi, Lamogha Chiazor, Raya Horesh, Rogerio Abreu de Paula, Ehsan Degan, Vandana Mukherjee

Recent advancements in large language models (LLMs) have revolutionized natural language processing (NLP) and expanded their applications across diverse domains. However, despite their impressive capabilities, LLMs have been shown to reflect and perpetuate harmful societal biases, including those based on ethnicity, gender, and religion. A critical and underexplored issue is the reinforcement of caste-based biases, particularly towards India’s marginalized caste groups such as Dalits and Shudras. In this paper, we address this gap by proposing DECASTE, a novel, multi-dimensional framework designed to detect and assess both implicit and explicit caste biases in LLMs. Our approach evaluates caste fairness across four dimensions: socio-cultural, economic, educational, and political, using a range of customized prompting strategies. By benchmarking several state-of-the-art LLMs, we reveal that these models systematically reinforce caste biases, with significant disparities observed in the treatment of oppressed versus dominant caste groups. For example, bias scores are notably elevated when comparing Dalits and Shudras with dominant caste groups, reflecting societal prejudices that persist in model outputs. These results expose the subtle yet pervasive caste biases in LLMs and emphasize the need for more comprehensive and inclusive bias evaluation methodologies that assess the potential risks of deploying such models in real-world contexts.

nan


Article 1171

Title@2025-06-05 (4): WiNGPT-3.0 Technical Report

Title: WiNGPT-3.0 Technical Report Technischer Bericht WiNGPT-3.0 WINGPT-3.0技术报告 2505.17387v2

Authors (13): Boqin Zhuang, Chenxiao Song, Huitong Lu, Jiacheng Qiao, Mingqian Liu, Mingxing Yu, Ping Hong, Rui Li, Xiaoxia Song, Xiangjun Xu, Xu Chen, Yaoyao Ma, Yujie Gao

Current Large Language Models (LLMs) exhibit significant limitations, notably in structured, interpretable, and verifiable medical reasoning, alongside practical deployment challenges related to computational resources and data privacy. This report focused on the development of WiNGPT-3.0, the 32-billion parameter LLMs, engineered with the objective of enhancing its capacity for medical reasoning and exploring its potential for effective integration within healthcare IT infrastructures. The broader aim is to advance towards clinically applicable models. The approach involved a multi-stage training pipeline tailored for general, medical, and clinical reasoning. This pipeline incorporated supervised fine-tuning (SFT) and reinforcement learning (RL), leveraging curated Long Chain-of-Thought (CoT) datasets, auxiliary reward models, and an evidence-based diagnostic chain simulation. WiNGPT-3.0 demonstrated strong performance: specific model variants achieved scores of 66.6 on MedCalc and 87.1 on MedQA-USMLE. Furthermore, targeted training improved performance on a clinical reasoning task from a baseline score of 58.1 to 62.5. These findings suggest that reinforcement learning, even when applied with a limited dataset of only a few thousand examples, can enhance medical reasoning accuracy. Crucially, this demonstration of RL’s efficacy with limited data and computation paves the way for more trustworthy and practically deployable LLMs within clinical workflows and health information infrastructures.

nan


Article 1172

Title@2025-06-05 (4): Calibrating Translation Decoding with Quality Estimation on LLMs

Title: Calibrating Translation Decoding with Quality Estimation on LLMs Kalibrierung der Übersetzungsdekodierung mit Qualitätsschätzung auf LLMs 以LLMM中质量估算法进行标注 2504.19044v3

Authors (3): Di Wu, Yibin Lei, Christof Monz

Neural machine translation (NMT) systems typically employ maximum a posteriori (MAP) decoding to select the highest-scoring translation from the distribution mass. However, recent evidence highlights the inadequacy of MAP decoding, often resulting in low-quality or even pathological hypotheses – the decoding objective is not aligned with real-world translation quality. This paper proposes calibrating hypothesis likelihoods with translation quality from a distribution view by directly optimizing their Pearson correlation – thereby enhancing the effectiveness of translation decoding. With our method, translation on large language models (LLMs) improves substantially after limited training (2K instances per direction). This improvement is orthogonal to those achieved through supervised fine-tuning, leading to substantial gains across a broad range of metrics and human evaluations – even when applied to top-performing translation-specialized LLMs fine-tuned on high-quality translation data, such as Tower, or when compared to recent preference optimization methods, like CPO. Moreover, the calibrated translation likelihood can directly serve as a strong proxy for translation quality, closely approximating or even surpassing some state-of-the-art translation quality estimation models, like CometKiwi. Lastly, our in-depth analysis demonstrates that calibration enhances the effectiveness of MAP decoding, thereby enabling greater efficiency in real-world deployment. The resulting state-of-the-art translation model, which covers 10 languages, along with the accompanying code and human evaluation data, has been released to the community: https://github.com/moore3930/calibrating-llm-mt.

nan


Article 1173

Title@2025-06-05 (4): An Exploration of Self-Supervised Mutual Information Alignment for Multi-Task Settings

Title: An Exploration of Self-Supervised Mutual Information Alignment for Multi-Task Settings Eine Untersuchung der selbstüberwachten gegenseitigen Information Ausrichtung für Multi-Task-Einstellungen 探索多种任务设置的自我监督的相互信息协调 2410.01704v2

Authors (1): Soham V. Govande

There is a growing need for pluralistic alignment methods that can steer language models towards individual attributes and preferences. One such method, Self-Supervised Alignment with Mutual Information (SAMI), uses conditional mutual information to encourage the connection between behavioral preferences and model responses. We conduct two experiments exploring SAMI in multi-task settings. First, we compare SAMI to Direct Preference Optimization (DPO) on a multi-task benchmark (MT-Bench), using a stronger model to generate training data for a weaker one across diverse categories (humanities, STEM, extraction, coding, math, reasoning, and roleplay). Our results indicate that one iteration of SAMI has a 57% win rate against DPO, with significant variation in performance between task categories. Second, we examine SAMI’s impact on mathematical accuracy (GSM-8K) relative to supervised fine-tuning (SFT). While SAMI increases zero-shot performance by 1.1%, SFT is more effective with a 3.2% boost. However, SAMI shows interesting scaling trends. When given 10 attempts, SAMI improves accuracy by 3.9%, while SFT achieves a 10.1% increase. Combining SAMI with SFT yields an additional improvement of 1.3% in multi-attempt settings, though single-attempt accuracy remains unchanged.

nan


Article 1174

Title@2025-06-05 (4): Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning

Title: Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning Grapheme-Kohärente phonemische und prosodic Annotation der Sprache durch Implizite und explizite Graphemkonditionierung 隐含和显性图形条件的图形一致的语音和Prosodic 语音批注 2506.04527v1

Authors (4): Hien Ohnaka, Yuma Shirahata, Byeongseon Park, Ryuichi Yamamoto

We propose a model to obtain phonemic and prosodic labels of speech that are coherent with graphemes. Unlike previous methods that simply fine-tune a pre-trained ASR model with the labels, the proposed model conditions the label generation on corresponding graphemes by two methods: 1) Add implicit grapheme conditioning through prompt encoder using pre-trained BERT features. 2) Explicitly prune the label hypotheses inconsistent with the grapheme during inference. These methods enable obtaining parallel data of speech, the labels, and graphemes, which is applicable to various downstream tasks such as text-to-speech and accent estimation from text. Experiments showed that the proposed method significantly improved the consistency between graphemes and the predicted labels. Further, experiments on accent estimation task confirmed that the created parallel data by the proposed method effectively improve the estimation accuracy.

nan


Article 1175

Title@2025-06-05 (4): Please Translate Again: Two Simple Experiments on Whether Human-Like Reasoning Helps Translation

Title: Please Translate Again: Two Simple Experiments on Whether Human-Like Reasoning Helps Translation Bitte übersetzen Sie erneut: Zwei einfache Experimente darüber, ob Menschen wie Vernunft hilft Übersetzung 请再次翻译: 两项简单的实验, 有关 人类与人类之间的理由是否有助于翻译的简单实验 。 2506.04521v1

Authors (3): Di Wu, Seth Aycock, Christof Monz

Large Language Models (LLMs) demonstrate strong reasoning capabilities for many tasks, often by explicitly decomposing the task via Chain-of-Thought (CoT) reasoning. Recent work on LLM-based translation designs hand-crafted prompts to decompose translation, or trains models to incorporate intermediate steps.~\textit{Translating Step-by-step}~\citep{briakou2024translating}, for instance, introduces a multi-step prompt with decomposition and refinement of translation with LLMs, which achieved state-of-the-art results on WMT24. In this work, we scrutinise this strategy’s effectiveness. Empirically, we find no clear evidence that performance gains stem from explicitly decomposing the translation process, at least for the models on test; and we show that simply prompting LLMs to ``translate again’’ yields even better results than human-like step-by-step prompting. Our analysis does not rule out the role of reasoning, but instead invites future work exploring the factors for CoT’s effectiveness in the context of translation.

nan


Article 1176

Title@2025-06-05 (4): EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles

Title: EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles EnsemW2S: Verbesserung der Schwach-zu-Strong-Verallgemeinerung mit großsprachigen Modellensembles EnsemW2S:用大语言模型组合加强弱至强的通用化 2505.21959v2

Authors (9): Aakriti Agrawal, Mucong Ding, Zora Che, Chenghao Deng, Anirudh Satheesh, Bang An, Bayan Bruss, John Langford, Furong Huang

With Large Language Models (LLMs) rapidly approaching and potentially surpassing human-level performance, it has become imperative to develop approaches capable of effectively supervising and enhancing these powerful models using smaller, human-level models exposed to only human-level data. We address this critical weak-to-strong (W2S) generalization challenge by proposing a novel method aimed at improving weak experts, by training on the same limited human-level data, enabling them to generalize to complex, super-human-level tasks. Our approach, called \textbf{EnsemW2S}, employs a token-level ensemble strategy that iteratively combines multiple weak experts, systematically addressing the shortcomings identified in preceding iterations. By continuously refining these weak models, we significantly enhance their collective ability to supervise stronger student models. We extensively evaluate the generalization performance of both the ensemble of weak experts and the subsequent strong student model across in-distribution (ID) and out-of-distribution (OOD) datasets. For OOD, we specifically introduce question difficulty as an additional dimension for defining distributional shifts. Our empirical results demonstrate notable improvements, achieving 4\%, and 3.2\% improvements on ID datasets and, upto 6\% and 2.28\% on OOD datasets for experts and student models respectively, underscoring the effectiveness of our proposed method in advancing W2S generalization.

nan