• 00 06-18 (3) PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning PhantomHunter: Unsichtbarer, privat gestalteter LLM-generierter Text durch familienbewusstes Lernen erkennen PhhantomHunter: 通过家庭知识学习检测隐隐隐私人引导的LLM-发光文本 2506.15683v1
  • 01 06-18 GenRecal: Generation after Recalibration from Large to Small Vision-Language Models GenRecal: Generation nach Rekalibrierung von großen bis kleinen Vision-Sprachenmodellen GenRecal: 在从大到小的视觉语言模型重新校准后生成的模型 2506.15681v1
  • 02 06-18 Dense SAE Latents Are Features, Not Bugs Dense SAE Latenten sind Features, keine Bugs Hense SAE 终端是特征,不是虫虫 2506.15679v1
  • 03 06-18 Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence Verkörperte Web-Agenten: Überbrückung physikalisch-digitaler Realms für integrierte Agent-Intelligenz 嵌入式网络代理:为综合特工情报连接物理数字王国 2506.15677v1
  • 04 06-18 Gender-Neutral Machine Translation Strategies in Practice Gender-Neutrale maschinelle Übersetzungsstrategien in der Praxis 实践中的性别-新性别机器翻译战略 2506.15676v1
  • 05 06-18 Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers Leaky Thoughts: Große Denkmodelle sind keine privaten Denker 利基思想:大理由模型不是私人思想家 2506.15674v1
  • 06 06-18 CC-LEARN: Cohort-based Consistency Learning CC-LEARN: Kohortenbasiertes Konsistenzlernen CC-LEARN: 以联合为基地的一致学习 2506.15662v1
  • 07 06-18 AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning AutoRule: Hervorhebung einer Kette von Gedanken Extrahierte regelbasierte Belohnungen Verbesserung des Preference-Lernens 自动管理:理性思维链抽取有章可循的奖励 改善优先学习 2506.15651v1
  • 08 06-18 Oldies but Goldies: The Potential of Character N-grams for Romanian Texts Oldies but Goldies: Das Potential des Charakters N-Gramms für rumänische Texte 旧的但金的:罗马尼亚文本的字符N克潜力 2506.15650v1
  • 09 06-18 Aug2Search: Enhancing Facebook Marketplace Search with LLM-Generated Synthetic Data Augmentation Aug2Search: Verbesserung der Facebook-Marktplatzsuche mit LLM-generierter Synthetischer Datenvergrößerung Oug2Search:利用LLM光化合成数据增强功能,加强Facebook市场搜索 2505.16065v2
  • 10 06-18 Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability Überarbeitung der kompositorischen Verallgemeinerung Fähigkeit von großen Sprachmodellen unter Berücksichtigung von Instruktionen nach Fähigkeit 重新审视大型语文模式的构成通用能力,考虑按能力进行教学 2506.15629v1
  • 11 06-18 J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization J4R: Lernen, mit gleichwertiger anfänglicher Staatsgruppe zu urteilen Relative Politikoptimierung J4R:向法官学习与等同的初次国家集团相对政策优化 2505.13346v3
  • 12 06-18 A Guide to Misinformation Detection Data and Evaluation Ein Leitfaden für Fehlinformationserkennungsdaten und -bewertung 《错误信息检测数据和评价指南》 2411.05060v4
  • 13 06-18 Minding the Politeness Gap in Cross-cultural Communication Den Politismus in der interkulturellen Kommunikation berücksichtigen 在跨文化交流中注意因应能力差距 2506.15623v1
  • 14 06-18 The Compositional Architecture of Regret in Large Language Models Die kompositorische Architektur des Bedauerns in großen Sprachmodellen 大语言模式 “ 遗憾 “ 的构成结构 2506.15617v1
  • 15 06-18 Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning Router-R1: Lehren von LLMs Multi-Round Routing und Aggregation durch Verstärkungslernen 路由-R1路由-R1路由:教学LLMS 2506.09033v2
  • 16 06-18 LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning LoX: Low-Rank-Extrapolation stärkt LLM-Sicherheit gegen Feinabstimmung LoX:低Rank外推法强力推力LLM 安全防止微调 2506.15606v1
  • 17 06-18 From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns Vom Modell zum Klassenzimmer: Bewertung Generierter MCQs für Portugiesen mit Erzähl- und Schwierigkeitsproblemen 从模型到教室:评估有叙述和困难关切的葡萄牙语生成的中、中、低、中、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、低、低、低、低、高、高、高、高、高、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、高、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低 2506.15598v1
  • 18 06-18 WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts WikiMixQA: Ein multimodaler Benchmark für die Fragebeantwortung über Tabellen und Diagramme WikiMixQA:表格和图表问答的多模式基准 2506.15594v1
  • 19 06-18 Lean Workbook: A large-scale Lean problem set formalized from natural language math problems Lean Workbook: Ein groß angelegtes Lean-Problem, das aus natursprachlichen mathematischen Problemen formalisiert wird 利安工作手册:从自然语言数学问题中正式确定的一个大规模利安问题 2406.03847v3
  • 20 06-18 DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement DiscoSG: Auf dem Weg zu Diskurs-Level Textszene Grafik Parsing durch iterative Graphenverfeinerung DiscoSG:通过迭代图形精炼进行分层层文本场景图解 2506.15583v1
  • 21 06-18 SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification SciVer: Bewertung von Stiftungsmodellen für multimodale wissenschaftliche Patentprüfung SciVer:评估基金会多模式科学索赔核实模型 2506.15569v1
  • 22 06-18 Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models Gender Inclusivity Fairness Index (GIFI): Ein mehrstufiger Rahmen zur Bewertung der Geschlechtervielfalt in großen Sprachmodellen 性别包容性公平指数:以大语言模式评价性别多样性的多层次框架 2506.15568v1
  • 23 06-18 Fractured Chain-of-Thought Reasoning Zersplitterte Kette von nachdenklichen Gründen 断断断断断断断断断断断断的探讨链原因 2505.12992v3
  • 24 06-18 PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction PredGen: Beschleunigte Schlussfolgerung großer Sprachmodelle durch Input-Time-Spekulation für Echtzeit-Spekulationsinteraktion PredGen:通过实时语音互动输入-时间投机加速推断大语言模式 2506.15556v1
  • 25 06-18 How much do language models memorize? Wie viel merken sich Sprachmodelle? 语言模型背书多少? 2505.24832v3
  • 26 06-18 Approximating Language Model Training Data from Weights Annähernde Sprachmodell-Trainingsdaten aus Gewichten 由重量产生的近似语文示范培训数据 2506.15553v1
  • 27 06-18 RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models RATTENTION: Auf dem Weg zur minimalen Schiebefenstergröße in lokalen und globalen Aufmerksamkeitsmodellen 注意:在本地-全球关注模式中实现最小滑滑窗口大小 2506.15545v1
  • 28 06-18 Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework Polysemantik mit PRISM erfassen: Ein Multi-Konzept-Feature Beschreibung Framework 利用PRISM获得多边性能:多概念特征描述框架 2506.15538v1
  • 29 06-18 Pap2Pat: Benchmarking Outline-Guided Long-Text Patent Generation with Patent-Paper Pairs Pap2Pat: Benchmarking der Langtext-Patentgenerierung mit Patent-Paper-Paaren Pap2Patt:制定基准大纲,指导长式长式专利生成,配有专利-纸质配对 2410.07009v3
  • 30 06-18 Lessons from Training Grounded LLMs with Verifiable Rewards Lehren aus der Ausbildung begründeter LLMs mit überprüfbaren Belohnungen 从培训中汲取的教训 2506.15522v1
  • 31 06-18 RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering RadioRAG: Online-Retrieval-augmentierte Generation für Radiologie Fragen beantworten PARRAG: 放射问题解答在线检索增强的一代人 2407.15621v3
  • 32 06-18 Wait, We Don’t Need to “Wait”! Removing Thinking Tokens Improves Reasoning Efficiency Warten Sie, wir brauchen nicht zu “warten”! Entfernen von Gedanken-Tokens verbessert vernünftige Effizienz 等等,我们不需要”等等”! 2506.08343v2
  • 33 06-18 Enhancing Hyperbole and Metaphor Detection with Their Bidirectional Dynamic Interaction and Emotion Knowledge Verbesserung der Hyperbole- und Metaphor-Erkennung mit ihrem bidirektionalen dynamischen Interaktions- und Emotionswissen 利用双向动态互动和情感知识加强超双向超博和同义体探测 2506.15504v1
  • 34 06-18 Interchangeable Token Embeddings for Extendable Vocabulary and Alpha-Equivalence Austauschbare Token-Einbetten für erweiterbare Vokabeln und Alpha-Equivalenz 用于可扩展词汇和阿尔法等效的互换调制缩写嵌套 2410.17161v3
  • 35 06-18 SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling SPARE: Single-Pass-Annotation mit referenzgeführter Bewertung für automatische Prozessüberwachung und Prämienmodellierung SPARE: 具有自动程序监督和奖赏建模参考指导评价的单纸注释 2506.15498v1
  • 36 06-18 Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation Adding Chocolate to Mint: Vermeiden von Metric Interferenz in der maschinellen Übersetzung 在薄荷中添加巧克力:减轻机器翻译中的计量干涉 2503.08327v2
  • 37 06-18 Context-Informed Grounding Supervision Kontext-informierte Erdungsüberwachung 内地内地监督 2506.15480v1
  • 38 06-18 Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification? Breaking Bad Molecules: Sind MLLMs bereit für die molekulare Entgiftung auf Strukturebene? 破碎坏分子:MLLMs是否准备好进行结构级分子解毒? 2506.10912v2
  • 39 06-18 OM4OV: Leveraging Ontology Matching for Ontology Versioning OM4OV: Ontologie für die Ontologie-Versionierung OM4OV:利用本体学匹配本体学版本的本体学 2409.20302v3
  • 40 06-18 Factorized RVQ-GAN For Disentangled Speech Tokenization Factorized RVQ-GAN für entfremdete Sprach-Tokenisierung RVQ-GAN 分解语音代谢的量化 RVQ-GAN 2506.15456v1
  • 41 06-18 RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation RE-IMAGINE: Symbolische Benchmark-Synthese zur vernünftigen Bewertung RE-IMAGINE: 用于说明理由的评价的符号性基准综合报告 2506.15455v1
  • 42 06-18 AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need AgentGroupChat-V2: Divide-and-Conquer ist das, was ein LLM-basiertes Multi-Agent-System braucht GroupChat-V2:基于LLM的多种机构系统需要什么 2506.15451v1
  • 43 06-18 Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models Probabilistische Aggregation und gezielte Einbettung von Optimierungen für die kollektive moralische Vernunft in großen Sprachmodellen 大语言模式中集体道德理由的概率集中和定向嵌入最佳优化 2506.14625v2
  • 44 06-18 Understanding GUI Agent Localization Biases through Logit Sharpness Verständnis der Lokalisierung von GUI-Agenten durch Logit-Schärfung 通过 Lologit 锐化理解图形用户界面代理界面本地化分线 2506.15425v1
  • 45 06-18 Targeted Lexical Injection: Unlocking Latent Cross-Lingual Alignment in Lugha-Llama via Early-Layer LoRA Fine-Tuning Gezielte Lexische Injektion: Entriegelung der latenten Cross-Lingual Alignment in Lugha-Llama via Early-Layer LoRA Fine-Tuning 定向射入:通过早期Layer LoRA精准发射在Lugha-Llama解锁Lugha-Llama的中端交叉对齐 2506.15415v1
  • 46 06-18 PsychBench: A comprehensive and professional benchmark for evaluating the performance of LLM-assisted psychiatric clinical practice PsychBench: Ein umfassender und professioneller Maßstab für die Bewertung der Leistungsfähigkeit von LLM-unterstützter psychiatrischer klinischer Praxis 精神病时区:评估LLLM协助的精神病临床实践业绩的全面和专业基准 2503.01903v2
  • 47 06-18 PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims PEDANTIC: Ein Datensatz für die automatische Prüfung der Wirksamkeit von Patentansprüchen PEDANTIC: 自动审查专利索赔的缺陷数据集 2505.21342v3
  • 48 06-18 COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus for Summarization and Headline Generation COSMMIC: Kommentarsensitive multimodale Mehrsprachige indische Corpus für Zusammenfassung und Headline-Generierung COSMIC: 用于总结和标题代代的多语种印度公司评论-敏感多语种多语种公司 2506.15372v1
  • 49 06-18 SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models’ Knowledge of Indian Culture SANSKRITI: Ein umfassender Benchmark für die Bewertung der Kenntnisse indischer Kultur von Sprachmodellen SANSKRITI:评估语言模型对印度文化知识的综合基准 2506.15355v1
  • 50 06-18 DeVisE: Behavioral Testing of Medical Large Language Models DeVisE: Verhaltenstests von medizinischen großen Sprachmodellen DevisE:大语言医学模型行为测试 2506.15339v1
  • 51 06-18 GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations GreekBarBench: Ein anspruchsvolles Benchmark für freie Text-Rechtsveranlagung und Verweisungen 希腊Barbench:自由文本法律理由和引用的质疑性基准 2505.17267v2
  • 52 06-18 When and How Unlabeled Data Provably Improve In-Context Learning Wann und wie unmarkierte Daten nachweislich das In-Context-Lernen verbessern 何时以及如何改进内文学习 2506.15329v1
  • 53 06-18 AIn’t Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation AIn’t Not Nothing But a Survey? Mit großen Sprachmodellen für die Codierung Deutsch Open-Ended Survey Responses on Survey Motivation 使用大语言模型对德国关于调查动机的开放式调查答复进行编码 2506.14634v2
  • 54 06-18 HiURE: Hierarchical Exemplar Contrastive Learning for Unsupervised Relation Extraction HiURE: Hierarchisches Beispiel Kontrastives Lernen für unüberwachte Beziehungsextraktion HIURE: 为不受监督的关系采掘而进行等级主义的高级特制反竞争学习 2205.02225v4
  • 55 06-18 The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants Die Avengers: Ein einfaches Rezept für die Vereinigung kleinerer Sprachmodelle, um proprietäre Riesen herauszufordern 《复仇者:将小型语言模式联合起来挑战产权巨人挑战小型语言模式的简单食谱》 2505.19797v3
  • 56 06-18 ConLID: Supervised Contrastive Learning for Low-Resource Language Identification ConLID: Beaufsichtigtes kontrastives Lernen für die Sprachidentifizierung mit geringer Ressource CONLID: 低资源语言识别的受监督的反竞争学习 2506.15304v1
  • 57 06-18 Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment Cohort Discovery: Eine Studie über LLM-Assisted Clinical Trial Recruitment Cohort发现:关于LLM协助临床试验征聘的调查 2506.15301v1
  • 58 06-18 An Effective Incorporating Heterogeneous Knowledge Curriculum Learning for Sequence Labeling Ein effektives Einbinden heterogenes Wissenscurriculum Lernen für die Sequenzkennzeichnung 有效纳入异种知识课程学习,以建立序列标签 2402.13534v2
  • 59 06-18 Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments Thunder-DeID: Genauer und effizienter De-Identifizierungsrahmen für Urteile des koreanischen Gerichts Thunder-DeID:韩国法院判决的准确和有效的取消识别框架 2506.15266v1
  • 60 06-18 Large Language Models for Automated Literature Review: An Evaluation of Reference Generation, Abstract Writing, and Review Composition Large Language Models for Automated Literature Review: An Evaluation of Reference Generation, Abstract Writing, and Review Composition 自动化文献审查大语言模式:对参考资料生成、摘要编写和审查构成的评价 2412.13612v4
  • 61 06-18 TopClustRAG at SIGIR 2025 LiveRAG Challenge TopClustRAG auf der SIGIR 2025 LiveRAG Challenge SIGIR 2025 RiveRAG挑战的顶端Clustrag 2506.15246v1
  • 62 06-18 Aligning AI Research with the Needs of Clinical Coding Workflows: Eight Recommendations Based on US Data Analysis and Critical Review Ausrichtung der KI-Forschung auf die Bedürfnisse klinischer Codierungs-Workflows: Acht Empfehlungen basierend auf US-Datenanalyse und kritischer Überprüfung 使AI研究与临床编码工作流程的需要相一致:基于美国数据分析和关键审查的八项建议 2412.18043v2
  • 63 06-18 Research on Graph-Retrieval Augmented Generation Based on Historical Text Knowledge Graphs Forschung zur grafisch retrievalgenerierten Generierung anhand historischer Textwissensgraphen 基于历史文本知识图的图-检索检索增强型图-检索增加型研究 2506.15241v1
  • 64 06-18 Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants Lost in Variation? Bewertung der NLI-Performance in baskischen und spanischen geografischen Varianten 评价巴斯克和西班牙地理变异性国家LI绩效 2506.15239v1
  • 65 06-18 Dynamic Acoustic Model Architecture Optimization in Training for ASR Dynamische Akustische Modellarchitektur Optimierung im Training für ASR ASR培训中动态声声学示范建筑结构优化 2506.13180v2
  • 66 06-18 Robust Utility-Preserving Text Anonymization Based on Large Language Models Robuste Utility-Preserving Text Anonymisierung basierend auf großen Sprachmodellen 基于大语言模式的强力功用-保存文本匿名 2407.11770v2
  • 67 06-18 video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models Video-SALMONN 2: Bildunterschrift-verbesserte Audio-Visuelle große Sprachmodelle 视频-SALMONN2:字幕-强化视听大语言模式 2506.15220v1
  • 68 06-18 TSLFormer: A Lightweight Transformer Model for Turkish Sign Language Recognition Using Skeletal Landmarks TSLFormer: Ein leichtes Transformer-Modell für die türkische Erkennung von Zeichensprache mit skelettalen Markierungen TSL Former: 使用骨骼地标土耳其手语识别的轻量级变换器模型 2505.07890v4
  • 69 06-18 MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs MinosEval: Distinguishing Factoid und Non-Factoid für maßgeschneiderte, offene QA-Bewertung mit LLMs MinosEval:与LLMM公司一道,区分用于定制的不限成员名额质量保证评价的非事实和非事实 2506.15215v1
  • 70 06-18 ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs ProtoReasoning: Prototypen als Stiftung für generalisierbare Vernunft in LLMs 原生共振:原型作为LLMs中普遍合理理由基金会 2506.15211v1
  • 71 06-18 A Comparative Study of Task Adaptation Techniques of Large Language Models for Identifying Sustainable Development Goals Eine vergleichende Studie über Anpassungstechniken großer Sprachmodelle zur Ermittlung von Zielen für eine nachhaltige Entwicklung 关于用于确定可持续发展目标的大型语言模型任务适应技术的比较研究 2506.15208v1
  • 72 06-18 BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning BIS Reasoning 1.0: Der erste großformatige japanische Benchmark für glaubens-inkonsistente syllogistische Reasoning BIS 理由1.0:日本第一个大尺度的信仰不一致时断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断 2506.06955v2
  • 73 06-18 A Systematic Survey of Natural Language Processing for the Greek Language Eine systematische Untersuchung der natürlichen Sprachverarbeitung für die griechische Sprache 系统调查希腊语的自然语言处理 2407.09861v4
  • 74 06-18 Seewo’s Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models Seewos Vorlage bei MLC-SLM: Lehren aus sprachbezogenen Sprachmodellen Seewoo向刚果解放运动-解解运提交的材料:从讲理由语言模式中学到的教益 2506.13300v3
  • 75 06-18 LLäMmlein: Transparent, Compact and Competitive German-Only Language Models from Scratch LLäMmlein: Transparente, kompakte und wettbewerbsfähige deutschsprachige Sprachmodelle von Scratch LläMmlein:来自斯克拉奇的透明、紧凑和有竞争力的独德语言模式 2411.11171v5
  • 76 06-18 Enhancing Goal-oriented Proactive Dialogue Systems via Consistency Reflection and Correction Verbesserung zielorientierter proaktiver Dialogsysteme durch Konsistenzreflexion und Korrektur 通过一致性反思和校正加强面向目标的前瞻性对话系统 2506.13366v3
  • 77 06-18 Efficient Long CoT Reasoning in Small Language Models Effiziente Long CoT-Reasoning in kleinen Sprachmodellen 低语言模式中有效的长期计算成本理由 2505.18440v2
  • 78 06-18 Emergence of Primacy and Recency Effect in Mamba: A Mechanistic Point of View Entstehung von Primat und Recency-Effekt in Mamba: Ein mechanistischer Standpunkt 曼巴的先权效应和后期效应:机械观察点 2506.15156v1
  • 79 06-18 ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models ALPS: Aufmerksamkeit Lokalisierung und Pruning-Strategie zur effizienten Ausrichtung großer Sprachmodelle ALPS: 高效统一大语言模式的注意地方化和审慎战略 2505.18799v4
  • 80 06-18 SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning SonicVerse: Multi-Task-Lernen für Musik-Feature-informierte Bildunterschriften SonicVerse: 音乐特色多任务学习 2506.15154v1
  • 81 06-18 TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding TransXSSM: Ein Hybrid Transformer State Space Modell mit unified Rotary Position Einbettung TransXSSSSM:带有统一扶轮定位嵌入装置的混合变形国家空间模型 2506.09507v3
  • 82 06-18 BriefMe: A Legal NLP Benchmark for Assisting with Legal Briefs BriefMe: Ein gesetzlicher NLP-Benchmark für die Unterstützung mit rechtlichen Briefen 简报:协助提供法律简报的《国家劳工规划法》法律基准 2506.06619v2
  • 83 06-18 Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models Thunder-Tok: Minimierung von Token pro Wort bei Tokenizing koreanischer Texte für generative Sprachmodelle Thunder-Tok: 将韩文用于创用语言模式的韩文中逐个字的调子最小化 2506.15138v1
  • 84 06-18 GRAM: A Generative Foundation Reward Model for Reward Generalization GRAM: Ein generatives Stiftungsprämienmodell für Belohnungsverallgemeinerung GRAM: 奖励普遍化的创金基金会奖励模式 2506.14175v2
  • 85 06-18 Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs Modellierung der ein-zu-vielen Immobilien im Open-Domain-Dialog mit LLMs 在与LLMM的开放式对话中模拟一对一财产 2506.15131v1
  • 86 06-18 REVOLVE: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization REVOLVE: Optimierung von KI-Systemen durch Tracking Response Evolution in der Textoptimierung REVOLVE:通过跟踪文字优化的应对演变,优化AI系统 2412.03092v2
  • 87 06-18 Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation Linderung der Verteilungsverschiebung in synthetischen Daten für die Schätzung der maschinellen Übersetzungsqualität 减轻机器翻译质量估算合成数据分配变化 2502.19941v3
  • 88 06-18 Efficiently Building a Domain-Specific Large Language Model from Scratch: A Case Study of a Classical Chinese Large Language Model Effizienter Aufbau eines Domain-Spezifischen Large Language Models aus Scratch: Eine Fallstudie eines klassischen chinesischen Large Language Models 高效率地建立来自Scratch的域特定大语言模型:中国古典大语言模型案例研究 2505.11810v3
  • 89 06-18 CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale CODESYNC: Synchronisierung großer Sprachmodelle mit dynamischer Codeentwicklung auf Skala CODESYNC: 使大语言模式与动态代码演变规模化同步 2502.16645v2
  • 90 06-18 SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents SpokenWOZ: Ein großformatiger Sprach-Text-Benchmark für gesprochene Task-Orientierte Dialog-Agenten pokenWOZ:针对以任务为主的对话代理方的大型演讲-文本基准 2305.13040v6
  • 91 06-18 CKD-EHR:Clinical Knowledge Distillation for Electronic Health Records CKD-EHR:Klinische Wissensdestillation für elektronische Gesundheitsdaten CKD-EHR: 用于电子健康记录的临床知识蒸馏 2506.15118v1
  • 92 06-18 Perspective Transition of Large Language Models for Solving Subjective Tasks Perspektivischer Übergang von großen Sprachmodellen zur Lösung subjektiver Aufgaben 解决主观任务大语言模式的视角过渡 2501.09265v2
  • 93 06-18 Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding Bi-VLDoc: Bidirektionale Vision-Sprachenmodellierung für Visually-Rich Document Understanding Bi-VLDoc:视觉-里希文件理解的双向视觉-语言建模 2206.13155v2
  • 94 06-18 I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search I-MCTS: Verbesserung der agentischen AutoML durch introspektive Monte Carlo Baumsuche I-MCTS:通过回想蒙特卡洛树搜索加强代理自动洗钱 2502.14693v3
  • 95 06-18 ChemHAS: Hierarchical Agent Stacking for Enhancing Chemistry Tools ChemHAS: Hierarchische Agenzien-Stacking zur Verbesserung von Chemiewerkzeugen ChemHAS:加强化学工具的等级代理人 2505.21569v2
  • 96 06-18 Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs Ring-lite: Skalierbares Reasoning über C3PO-stabilisiertes Verstärkungslernen für LLMs 环:通过C3PO – – 稳定地加强LLMs的强化学习,按比例说明理由 2506.14731v2
  • 97 06-18 Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level Root Defence Strategies: Gewährleistung der Sicherheit von LLM auf der Decodierungsebene 根本防御战略:确保顶级LLM的安全 2410.06809v3
  • 98 06-18 Improving Dialogue Discourse Parsing through Discourse-aware Utterance Clarification Verbesserung des Dialog-Diskurs Parsens durch diskursbewusste Aufklärung 通过有礼识的尿道澄清改进对话讨论 2506.15081v1
  • 99 06-18 Learning-Time Encoding Shapes Unlearning in LLMs Lernzeitkodierung Formen Entlernen in LLMs 学习-时间编码 2506.15076v1
  • 100 06-18 LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models LLMs können gefährliche Gründe sein: Analysieren-basierter Jailbreak-Angriff auf große Sprachmodelle LLMs可以是危险理由:基于分析的对大语言模式的越狱攻击 2407.16205v6
  • 101 06-18 Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation Semantially-Aware Belohnungen für Open-Ended R1 Training in Free-Form Generation 免费新一代不限名额R1培训的 “ 闪存式 “ 奖项 2506.15068v1
  • 102 06-18 Math Neurosurgery: Isolating Language Models’ Math Reasoning Abilities Using Only Forward Passes Math Neurochirurgie: Die Math-Reasoning-Fähigkeiten von Sprachmodellen mit nur Vorwärtspassagen isolieren 数学神经外科:仅使用前方通道的孤立语言模型理据能力 2410.16930v4
  • 103 06-18 SemVink: Advancing VLMs’ Semantic Understanding of Optical Illusions via Visual Global Thinking SemVink: Das semantische Verständnis optischer Illusionen durch visuelles globales Denken von VLMs verbessern SemVink:通过视觉全球思维推进VLMs对光学幻影的语义理解 2506.02803v2
  • 104 06-18 Thunder-NUBench: A Benchmark for LLMs’ Sentence-Level Negation Understanding Thunder-NUBench: Ein Benchmark für das Urteils-Negation-Verständnis von LLMs Thunder-NUBench:LLLM女士的判刑级差理解基准 2506.14397v2
  • 105 06-18 Identifying economic narratives in large text corpora – An integrated approach using Large Language Models Identifizieren von ökonomischen Erzählungen in großen Textkorpora – Ein integrierter Ansatz mit großen Sprachmodellen 在大文本公司中确定经济说明 – – 使用大语言模式的综合办法 2506.15041v1
  • 106 06-18 Identifying social isolation themes in NVDRS text narratives using topic modeling and text-classification methods Identifizierung sozialer Isolationsthemen in NVDRS-Textnarrativen mittels Themenmodellierung und Textklassifizierung 利用专题建模和文本分类方法,在国家难民、难民、难民、难民、难民、难民、难民、难民、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者 2506.15030v1
  • 107 06-18 An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW Eine präzise und überarbeitete Version der optischen Zeichenerkennungs-basierten Sprachsynthese mit LabVIEW 利用拉比韦厄综合实验室进行精确和订正的光学字符识别语音合成 2506.15029v1
  • 108 06-17 (2) Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size Optimale Einbettung der Lernrate in LLMs: Der Effekt der Vokabelgröße LLMM中最佳嵌入式学习率:词汇大小的影响 2506.15025v1
  • 109 06-17 Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation 多方语言模式:推进合作、协调和适应 2506.09331v2
  • 110 06-17 Entropy-based Exploration Conduction for Multi-step Reasoning Entropiebasierte Explorationsleitung für mehrstufige Vernunft 用于多步骤理由的基于英信的探索行为 2503.15848v2
  • 111 06-17 Memory Tokens: Large Language Models Can Generate Reversible Sentence Embeddings Memory Tokens: Große Sprachmodelle können reversible Satzeinbettungen generieren 内存当量: 大语言模型能够生成可翻转的句子嵌入 2506.15001v1
  • 112 06-17 Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings Hypothesentest zur Quantifizierung von LLM-Mensch-Missausrichtung in Mehrfachauswahl-Einstellungen 多种选择环境中人类错配量化LLM-人类错配的假设测试 2506.14997v1
  • 113 06-17 LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles LaMP-Cap: Personalisierte Bildunterschriftserstellung mit multimodalen Bildprofilen LaMP-Cap: 具有多模式图解的个人化图解生成 2506.06561v2
  • 114 06-17 Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective Überarbeiten von Stärkungslernen für LLM-Reasoning aus einer bereichsübergreifenden Perspektive 重新考虑从跨主题角度重新研究加强学习LLM 2506.14965v1
  • 115 06-17 From Chat to Checkup: Can Large Language Models Assist in Diabetes Prediction? Vom Chat bis zum Checkup: Können große Sprachmodelle bei der Diabetes-Vorhersage helfen? 从聊天到检查:大语言模型能帮助糖尿病预测吗? 2506.14949v1
  • 116 06-17 Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing Unter Bearbeiten & Überarbeiten mit iterativem & Nachbar-Assisted Model Editing lösen 用迭代和邻里辅助型号编辑解决以迭代和邻里辅助型号编辑的 unit & overdidite 2503.11895v2
  • 117 06-17 Too Big to Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers Zu groß zu denken: Kapazität, Erinnerung und Verallgemeinerung in vortrainierten Transformern 能力、记忆和在培训前变异器中的普及化 2506.09099v2
  • 118 06-17 MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance MDBench: Ein synthetischer Multi-Document-Reasoning-Benchmark mit Wissensführung MDBENCH:以知识指南制作的合成多文件理由说明基准 2506.14927v1
  • 119 06-17 UD-English-CHILDES: A Collected Resource of Gold and Silver Universal Dependencies Trees for Child Language Interactions UD-English-CHILDES: Eine gesammelte Ressource aus Gold und Silber Universelle Abhängigkeiten Bäume für kindersprachliche Interaktionen UD-English-CHILDES:儿童语言互动金树和银银树集成资源 2504.20304v3
  • 120 06-17 Can LLMs Ask Good Questions? Können LLMs gute Fragen stellen? LLMs能问好问题吗? 2501.03491v2
  • 121 06-17 CrEst: Credibility Estimation for Contexts in LLMs via Weak Supervision CrEst: Glaubwürdigkeitsschätzung für Kontexte in LLMs über schwache Überwachung CrEst: 微弱监督LLM女士背景的可靠估计 2506.14912v1
  • 122 06-17 Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction Kombination von eingeschränkter und ungezwungener Dekodierung durch Boosting: BoostCD und seine Anwendung auf Informationsextraktion 将受约束和不受限制的通过推动解锁结合起来:推动及其在信息提取方面的应用 2506.14901v1
  • 123 06-17 Adverse Event Extraction from Discharge Summaries: A New Dataset, Annotation Scheme, and Initial Findings Negative Ereignisextraktion aus entladenen Zusammenfassungen: Ein neuer Datensatz, Annotationsschema und erste Ergebnisse 《从排放中提取的不利事件摘要:新数据集、注解办法和初步调查结果》 2506.14900v1
  • 124 06-17 Chain-of-Thought Reasoning In The Wild Is Not Always Faithful In den Wilden zu denken, ist nicht immer treu 历经深思深虑的 荒野不总是忠心耿耿 2503.08679v4
  • 125 06-17 A Variational Framework for Improving Naturalness in Generative Spoken Language Models Ein abwechslungsreicher Rahmen zur Verbesserung der Natürlichkeit in generativen Sprachmodellen 改善发源口语模式中自然特性的变式框架 2506.14767v1
  • 126 06-17 ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM ASCD: aufmerksamkeitsbeständige Kontrastdekodierung zur Reduktion der Halluzination in MLLM ASCD: 减少低LLLM中致幻作用的可引起注意的违反规则标记 2506.14766v1
  • 127 06-17 From Bytes to Ideas: Language Modeling with Autoregressive U-Nets Von Bytes zu Ideen: Sprachmodellierung mit autoregressiven U-Netzen 从字节到理念:用自动递减 U-Nets 进行语言建模 2506.14761v1
  • 128 06-17 Reasoning with Exploration: An Entropy Perspective Vernunft mit Exploration: Eine Entropie-Perspektive 探索理由:宇宙展望 2506.14758v1
  • 129 06-17 Controllable and Reliable Knowledge-Intensive Task-Oriented Conversational Agents with Declarative Genie Worksheets Kontrollierbare und zuverlässige wissensintensive, zielorientierte Conversational Agents mit deklarativen Genie-Arbeitsblättern 具有公开基因工作表的可控制和可靠、知识密集、以任务为导向、以任务为导向的具有可控和可靠知识密集的谈话剂 2407.05674v3
  • 130 06-17 SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints SOPBench: Sprachagenten bei folgenden Standardbetriebsverfahren und Einschränkungen bewerten SOPBench:评价遵守标准作业程序和制约因素的语文代理 2503.08669v2
  • 131 06-17 Optimizing Length Compression in Large Reasoning Models Optimierung der Längenkompression in großen vernünftigen Modellen 在大理由模型中优化长度压缩 2506.14755v1
  • 132 06-17 Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework Auf dem Weg zu einer besseren Open-Ended Textgenerierung: Ein Multikriterien-Evaluierungsrahmen 实现更好的不限 限 限 限 质 文本的生成:多标准评价框架 2410.18653v3
  • 133 06-17 Leveraging Large Language Models to Measure Gender Representation Bias in Gendered Language Corpora Nutzung großer Sprachmodelle zur Messung der Geschlechterrepräsentanz Bias in Gendered Language Corpora 利用大语言模式衡量性别语言单位的性别代表比比 2406.13677v3
  • 134 06-17 Assessing the Reasoning Capabilities of LLMs in the context of Evidence-based Claim Verification Bewertung der mit Gründen versehenen Fähigkeiten von LLM im Zusammenhang mit der beweisgestützten Prüfung von Anträgen 结合基于证据的索赔核实评估LLM 合理性的能力 2402.10735v4
  • 135 06-17 Reparameterized LLM Training via Orthogonal Equivalence Transformation Reparameterisiertes LLM-Training über Orthogonale Äquivalenztransformation 通过正正对等转化进行修复性磁力LLM培训 2506.08001v3
  • 136 06-17 Capacity Matters: a Proof-of-Concept for Transformer Memorization on Real-World Data Capacity Matters: Ein Proof-of-Concept für Transformer-Memorisierung auf Real-World-Daten 能力事项:关于现实世界数据变换者记忆的验证概念 2506.14704v1
  • 137 06-17 Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers Treasure Hunt: Echtzeit-Targeting des Long Tails mit Trainings-Time Markern 宝藏狩猎:使用培训-时间标记实时定位长尾鱼 2506.14702v1
  • 138 06-17 Bridging Social Media and Search Engines: Dredge Words and the Detection of Unreliable Domains Überbrücken von Social Media und Suchmaschinen: Dredge Words und die Erkennung von unzuverlässigen Domains 连接社会媒体和搜索引擎:隐隐词和探测不可靠的域域 2406.11423v4
  • 139 06-17 The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs Der alternative Annotator-Test für LLM-as-a-Richter: Wie man die Ersetzung menschlicher Annotatoren durch LLMs statistisch rechtfertigt LLM-A法官的替代性说明人测试:如何在统计上合理用LMS取代人类说明人 2501.10970v3
  • 140 06-17 Language and Planning in Robotic Navigation: A Multilingual Evaluation of State-of-the-Art Models Sprache und Planung in der Roboternavigation: Mehrsprachige Bewertung modernster Modelle 机器人导航的语言和规划:对最新艺术模式的多语种评价 2501.05478v2
  • 141 06-17 Agent Laboratory: Using LLM Agents as Research Assistants Agent Laboratory: LLM-Agenten als wissenschaftliche Assistenten 实验室:利用LLLM代理作为研究助理 2501.04227v2
  • 142 06-17 Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality Massive überwachte Feinsteuerungsexperimente zeigen, wie Daten, Ebenen und Trainingsfaktoren LLM-Ausrichtungsqualität gestalten 大规模监督的微调实验 数据、图层和培训因素 成型LLLM 目标质量 2506.14681v1
  • 143 06-17 FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback FigCaps-HF: Ein figure-to-caption Generatives Framework und Benchmark mit menschlichem Feedback FigCaps-HF:数字对数字生成框架和人文反馈基准 2307.10867v2
  • 144 06-17 A Hybrid Multi-Agent Prompting Approach for Simplifying Complex Sentences Ein Hybrid-Multi-Agent-Prompting-Ansatz zur Vereinfachung komplexer Sätze 简化复杂判刑的混合混合多重代理推动办法 2506.11681v2
  • 145 06-17 ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities ONEBench, um sie alle zu testen: Benchmarking auf Probenebene über offene Fähigkeiten 一、一、测试所有标准:关于开放式能力的抽样基准 2412.06745v2
  • 146 06-17 Convert Language Model into a Value-based Strategic Planner Konvertieren Sie Sprachmodell in einen wertbasierten strategischen Planer 将语言模式转换成基于价值的战略规划员 2505.06987v4
  • 147 06-17 GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors GuiLoMo: Zuordnung von Expertenzahl und Rang für LoRA-MoE über Bilevel-Optimierung mit GuidedSelection-Vektoren Guilomo:通过向导选择矢量的双级优化为 LoRA-MoE 分配专家编号和排名 2506.14646v1
  • 148 06-17 Passing the Turing Test in Political Discourse: Fine-Tuning LLMs to Mimic Polarized Social Media Comments Den Turing-Test im politischen Diskurs bestehen: Fine-Tuning LLMs to Mimic Polarized Social Media Kommentare 透过政治话题图图图图图测试:微光极极化社会媒体评论 2506.14645v1
  • 149 06-17 Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot Revisiting Chain-of-Thought Prompting: Null-Schuss kann stärker sein als wenige-Schuss 重新思考寻求链激励:零射出比少射出强 2506.14641v1
  • 150 06-17 IP Leakage Attacks Targeting LLM-Based Multi-Agent Systems IP-Leakage-Angriffe zielen auf LLM-basierte Multi-Agent-Systeme IP IP 以LLM为基础的多机构系统为目标的针对LLM的漏漏攻击系统 2505.12442v3
  • 151 06-17 Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers Zeigen große Sprachmodelle Kognitive Dissonanz? Studieren des Unterschieds zwischen offenbarten Glaubensbekenntnissen und erklärten Antworten 大型语言模型 展示认知差异? 研究信奉信仰与国家答复之间的差异 2406.14986v3
  • 152 06-17 Prefix-Tuning+: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention Prefix-Tuning+: Modernisierung des Prefix-Tunings durch Entkoppelung des Prefixs von Aufmerksamkeit 前缀- 调整+: 通过将前缀与注意脱钩而使前缀- 调整前缀现代化 2506.13674v2
  • 153 06-17 VisText-Mosquito: A Multimodal Dataset and Benchmark for AI-Based Mosquito Breeding Site Detection and Reasoning VisText-Mosquito: Ein multimodaler Datensatz und Benchmark für KI-basierte Mosquito-Züchtungsstandorterkennung und -Vernunft VisText-Mosquito:基于AI的蚊子育种点检测和理据的多模式数据集和基准 2506.14629v1
  • 154 06-17 SynGraph: A Dynamic Graph-LLM Synthesis Framework for Sparse Streaming User Sentiment Modeling SynGraph: Ein dynamisches Graph-LLM-Synthese-Framework für Sparse Streaming User Sentiment Modeling Syllgraph: 垃圾流用户感应建模动态图形-LLM合成框架 2503.04619v2
  • 155 06-17 TaskCraft: Automated Generation of Agentic Tasks TaskCraft: Automatisierte Generierung von Agentischen Aufgaben TTTCraft:自动生成代理任务 2506.10055v2
  • 156 06-17 Graph RAG for Legal Norms: A Hierarchical, Temporal and Deterministic Approach Grafik RAG für rechtliche Normen: Hierarchischer, zeitlicher und deterministischer Ansatz 法律规范的图表RAG:一个等级、时间和决定因素学方法 2505.00039v3
  • 157 06-17 When Does Meaning Backfire? Investigating the Role of AMRs in NLI Wann bedeutet Backfire? Untersuchung der Rolle von AMRs in NLI ” 什么时候发生反火 “ 的含义? 调查在非国家劳动力调查中年龄、年龄、年龄、年龄、年龄、年龄、年龄、年龄、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、 性别、性别、性别、 性别、性别、性别、性别、性别、性别、性别、 性别、性别、 性别、 性别、性别、性别、 性别、性别、性别、性别、性别、性别、性别、 性别、性别、性别、性别、性别、 性别、 性别、 性别、 性别、 性别、 性别 性别 性别 性别、 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 2506.14613v1
  • 158 06-17 Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees Guaranteed Guess: Ein Sprachmodellierungsansatz für CISC-to-RISC Transpilation mit Testgarantien 有担保的猜测:具有测试保证的CISC到RISC传输语言模拟方法 2506.14606v1
  • 159 06-17 Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents Navigieren der digitalen Welt als Menschen tun: Universal Visual Grounding für GUI-Agenten 将数字世界作为人行:通用用户界面代理的通用视觉定位 2410.05243v3
  • 160 06-17 Computational Studies in Influencer Marketing: A Systematic Literature Review Computational Studies in Influencer Marketing: A Systematic Literature Review 《影响营销中的计算研究:系统文学评论》 2506.14602v1
  • 161 06-17 From tools to thieves: Measuring and understanding public perceptions of AI through crowdsourced metaphors Von Werkzeugen zu Dieben: Messen und Verstehen öffentlicher Wahrnehmungen von KI durch crowdsourced Metaphern 从工具到盗贼:通过众包比喻衡量和理解公众对AI的看法 2501.18045v3
  • 162 06-17 GenerationPrograms: Fine-grained Attribution with Executable Programs GenerationProgramme: Feinkörnige Zuordnung mit ausführbaren Programmen 代代方案:与可执行方案精细分配 2506.14580v1
  • 163 06-17 PredictaBoard: Benchmarking LLM Score Predictability PredictaBoard: Benchmarking der LLM-Score-Vorhersagbarkeit 预测波:测标 LLM 评分可预测性 2502.14445v2
  • 164 06-17 Evolution of ESG-focused DLT Research: An NLP Analysis of the Literature Entwicklung der ESG-orientierten DLT-Forschung: Eine NLP-Analyse der Literatur 以环境、社会和科学为重点的DLT研究的演变:对文学的分析 2308.12420v4
  • 165 06-17 TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization TDSPO: Nutzen Sie Token-Level-Reward-Leitfaden zur Verbesserung der Direktpräferenzoptimierung TGDPO:提高直接优惠优化利用物价奖励指导 2506.14574v1
  • 166 06-17 AlphaDecay:Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs AlphaDecay: Modulenweises Gewichtsdecay für schweres Balancing in LLMs AlphaDecay: LLM 中重帆平衡的中度偏重衰减 2506.14562v1
  • 167 06-17 ClusterChat: Multi-Feature Search for Corpus Exploration ClusterChat: Multi-Feature Suche nach Corpus Exploration COFCHat: 多功能探索Corpus 勘探 2412.14533v2
  • 168 06-17 M2BeamLLM: Multimodal Sensing-empowered mmWave Beam Prediction with Large Language Models M2BeamLLM: Multimodal Sensing-empowered mmWave Beam Prediction mit großen Sprachmodellen M2BAAMLLM:多式遥感-动力毫米 2506.14532v1
  • 169 06-17 Inherent and emergent liability issues in LLM-based agentic systems: a principal-agent perspective Inhärente und entstehende Haftungsfragen in LLM-basierten agentischen Systemen: eine Principal-Agent-Perspektive 以LLLM为基础的代理系统中的固有和新出现的赔偿责任问题:主要代理人的视角 2504.03255v2
  • 170 06-17 LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops LingoLoop Attack: MLLMs über sprachlichen Kontext und Staatseinfall in endlose Loops LingoLooo攻击:通过语言背景和国家诱入无尽环圈来跟踪MLLMs 2506.14493v1
  • 171 06-17 BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English BESSTIE: Ein Benchmark für die Sentiment- und Sarkasmusklassifikation für englische Sorten BESSTIE:英语品种的森化和讽刺性分类基准 2412.04726v3
  • 172 06-17 LexiMark: Robust Watermarking via Lexical Substitutions to Enhance Membership Verification of an LLM’s Textual Training Data LexiMark: Robuste Wassermarkierung über Lexical Substitutions zur Erweiterung der Mitgliedschaftsbestätigung der Texttrainingsdaten eines LLM LexiMark:通过用法律替代办法进行强有力的水标记,以加强对LLM的文字培训数据进行成员核查 2506.14474v1
  • 173 06-17 Rectifying Belief Space via Unlearning to Harness LLMs’ Reasoning Rektifizieren von Glaube Raum über Unlearning zu Harness LLMs’ Reasoning 通过 “ 重新学习 “ 来改变信仰空间 2502.20620v2
  • 174 06-17 How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison Wie weit können LLMs sich aus Erfahrung verbessern? Test-Time-Learning-Fähigkeiten in LLMs mit menschlichem Vergleich messen 如何从经验中提高LLMs的改进程度? 衡量LLMs与人类比较的试验-时间学习能力 2506.14448v1
  • 175 06-17 LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs LongLlaDA: Entsperren langer Kontextkapazitäten in Diffusions-LLMs LongLLALDA:释放扩散长程距离能力 2506.14429v1
  • 176 06-17 Uncovering Overfitting in Large Language Model Editing Uncovering Overfitting in der großsprachigen Modellbearbeitung 在大语言模式编辑中进行覆盖覆盖覆盖的覆盖超版编辑 2410.07819v2
  • 177 06-17 ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge Implic Retrieval Challenge: Benchmarking der Implicity Fact Retrieval Challenge ImpliRet:设定隐含事实检索挑战的基准 2506.14407v1
  • 178 06-17 CAPO: Cost-Aware Prompt Optimization CAPO: Kostenbewusste Optimierung CAPO: 成本软件快速优化 2504.16005v4
  • 179 06-17 Ensemble Watermarks for Large Language Models Ensemble Wasserzeichen für große Sprachmodelle 用于大语言模型的集合水标记 2411.19563v2
  • 180 06-17 Automated Construction of a Knowledge Graph of Nuclear Fusion Energy for Effective Elicitation and Retrieval of Information Automatisierter Aufbau eines Wissensdiagramms für Kernfusionsenergie zur effektiven Gewinnung und Gewinnung von Informationen 自动构建核聚变能源知识图,以有效取用和检索信息 2504.07738v2
  • 181 06-17 SeqPE: Transformer with Sequential Position Encoding SeqPE: Transformer mit sequentieller Positionskodierung SeqPE:具有序列位置编码的变形器 2506.13277v2
  • 182 06-17 ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection ELLIS Alicante bei CQs-Gen 2025: Die kritischen Denkfragen gewinnen gemeinsame Aufgabe: LLM-basierte Fragegenerierung und Auswahl 2025年CQs-Gen CQs-Gen ELLIS Alicante:赢得关键思考问题的共同任务:基于LLM问题的产生和选择 2506.14371v1
  • 183 06-17 Digital Gatekeepers: Google’s Role in Curating Hashtags and Subreddits Digitale Gatekeeper: Googles Rolle bei der Heilung von Hashtags und Subreddits 数字看门人:谷歌在消除大麻塔和Subreddid方面的作用 2506.14370v1
  • 184 06-17 Exploring news intent and its application: A theory-driven approach Erforschen der Nachrichten-Intention und ihrer Anwendung: Ein theoriegesteuerter Ansatz 探索新闻意图及其应用:理论驱动方法 2312.16490v2
  • 185 06-17 A Vision for Geo-Temporal Deep Research Systems: Towards Comprehensive, Transparent, and Reproducible Geo-Temporal Information Synthesis Eine Vision für Geo-Temporale Deep Research Systeme: Auf dem Weg zu einer umfassenden, transparenten und reproduzierbaren Geo-Temporalen Informationssynthese 地球-临时深层研究系统展望:走向全面、透明和可复制的地球-临时信息综述 2506.14345v1
  • 186 06-17 Evaluation Should Not Ignore Variation: On the Impact of Reference Set Choice on Summarization Metrics Bewertung sollte nicht ignorieren Variation: Auf die Auswirkungen der Referenzsatz Wahl auf Zusammenfassung Metrics 评价不应忽视变化变化:关于参考选择对汇总计量的影响 2506.14335v1
  • 187 06-17 ROSAQ: Rotation-based Saliency-Aware Weight Quantization for Efficiently Compressing Large Language Models ROSAQ: Rotationsbasierte Saliency-Aware-Gewichtsquantisierung für effiziente Komprimierung großer Sprachmodelle ROSAQ: 高效压缩大语言模型的基于旋转的 耐用软件强度 2506.13472v2
  • 188 06-17 Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models Anreize für eine fortgeschrittene Instruktions-Folge von großen Sprachmodellen 为采用大语言模式的高级指示提供激励理由 2506.01413v4
  • 189 06-17 Do Construction Distributions Shape Formal Language Learning In German BabyLMs? Gestalten Konstruktionsverteilungen formales Sprachenlernen in deutschen BabyLMs? 是否用德国婴儿LMS模式进行建筑分配, 2503.11593v2
  • 190 06-17 Expectation Confirmation Preference Optimization for Multi-Turn Conversational Recommendation Agent Erwartungsbestätigung Preference Optimization für Multi-Turn Conversational Recommendation Agent 多轮对话建议代理商的预期确认优先优化 2506.14302v1
  • 191 06-17 AI-Facilitated Analysis of Abstracts and Conclusions: Flagging Unsubstantiated Claims and Ambiguous Pronouns AI-Fazilitated Analysis of Abstracts and Conclusions: Flagging Nonsubstantiated Claims and Ambigued Pronomens AI-便利对摘要和结论的分析:无凭无据的旗舰索赔和不明利贷 2506.13172v2
  • 192 06-17 ConsistencyChecker: Tree-based Evaluation of LLM Generalization Capabilities KonsistenzChecker: Baumbasierte Bewertung von LLM-Verallgemeinerungsfähigkeiten 一致性检查:基于树木的LLM通用能力评价 2506.12376v2
  • 193 06-17 From What to Respond to When to Respond: Timely Response Generation for Open-domain Dialogue Agents Von, was zu reagieren, wann zu reagieren: Timely Response Generation für Open-Domain-Dialog-Agenten 从什么到回应何时响应:为开放域对话代理机构及时作出反应 2506.14285v1
  • 194 06-17 FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation FlagEvalMM: Flexibler Rahmen für eine umfassende multimodale Modellbewertung FlaignEvalMMM:综合多式联运模式评价灵活框架 2506.09081v2
  • 195 06-17 Improving LoRA with Variational Learning Verbesserung der LoRA durch variables Lernen 改进LORA, 提高不同学习水平 2506.14280v1
  • 196 06-17 Surprise Calibration for Better In-Context Learning Überraschende Kalibrierung für besseres In-Context-Lernen 为更好的内文学习校准惊喜校准 2506.12796v2
  • 197 06-17 What do Large Language Models Say About Animals? Investigating Risks of Animal Harm in Generated Text Was sagen große Sprachmodelle über Tiere? Untersuchung der Risiken von Tierschädlingen im Generierten Text 大语言模型对动物有什么看法? 调查产生文字中的动物危害风险 2503.04804v4
  • 198 06-17 Position: Editing Large Language Models Poses Serious Safety Risks Position: Bearbeiten von großen Sprachmodellen stellt ernste Sicherheitsrisiken dar 职位:编辑大语言模型 2502.02958v3
  • 199 06-17 Re-Initialization Token Learning for Tool-Augmented Large Language Models Re-Initialisierung Token-Lernen für Tool-Augmented große Sprachmodelle 工具增强型大语言模型的重新启动 Tok 学习 2506.14248v1
  • 200 06-17 Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs Verstärktes Lernen mit überprüfbaren Belohnungen implizit fördert korrekte Vernunft in LLMs 利用可核实的奖励措施加强学习,在基础LLM中鼓励正确说明理由 2506.14245v1
  • 201 06-17 GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents GuideBench: Benchmarking Domain-orientierte Leitlinie für LLM-Agenten folgen 指南:为LLM代理商制定基准确定以域为基准的准则 2505.11368v2
  • 202 06-17 A Multi-Expert Structural-Semantic Hybrid Framework for Unveiling Historical Patterns in Temporal Knowledge Graphs Ein multi-Experte strukturell-semantischer Hybridrahmen zur Enthüllung historischer Muster in zeitlichen Wissensgraphen ” 时间知识图中历史不变模式 “ 的多专家结构-地中海混合框架 2506.14235v1
  • 203 06-17 Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team Xolver: Multi-Agent Reasoning mit ganzheitlichem Erfahrungslernen wie ein Olympia-Team Xolver:多机构理论与整体经验学习就像奥林匹克队一样 2506.14234v1
  • 204 06-17 Effect of Selection Format on LLM Performance Auswirkungen des Auswahlformats auf die LLM-Leistung 选择格式对LLM性能的影响 2503.06926v2
  • 205 06-17 Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Scaling Computer-Use Grounding über Benutzeroberfläche Zersetzung und Synthese 通过用户界面分解和合成进行计算机使用定位 2505.13227v2
  • 206 06-17 Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models Modality-Aware Neuron Pruning für das Lernen in multimodalen großen Sprachmodellen 多式联运大语言模型中不学习模式-Aware中度中枢 2502.15910v2
  • 207 06-17 Fretting-Transformer: Encoder-Decoder Model for MIDI to Tablature Transcription Fretting-Transformer: Encoder-Decoder-Modell für MIDI in Tabulatur-Transkription Fretting- Transtrads: MIDI 调制标签的编码器-解码器模型 2506.14223v1
  • 208 06-17 Chaining Event Spans for Temporal Relation Grounding Verkettung von Event-Spannen für die zeitliche Beziehungserdung 用于时间关系基准的连锁事件 Spans 系统 2506.14213v1
  • 209 06-17 Explainable Detection of Implicit Influential Patterns in Conversations via Data Augmentation Erklärbare Erkennung von impliziten Einflussmustern in Gesprächen durch Datenvergrößerung 通过数据增加在对话中可解释地探测到的隐性内流模式 2506.14211v1
  • 210 06-17 LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification LongSpec: Lang-Kontext verlustfreies spekulatives Decodieren mit effizienter Entwurfs- und Verifizierung 长方形:长端无损失的假设值与高效率的起草和核查 2502.17421v2
  • 211 06-17 CausalDiffTab: Mixed-Type Causal-Aware Diffusion for Tabular Data Generation CausalDiffTab: Mixed-Type Causal-Aware Diffusion für tabellarische Datengenerierung CausalDiffTab: 用于制表数据生成的混合- 混合- Type Causal- Aware 扩散 2506.14206v1
  • 212 06-17 AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents AgentSynth: Skalierbare Task-Generierung für generalistische Computer-Use-Agenten AnySynth:通用计算机使用代理器可缩放任务生成 2506.14205v1
  • 213 06-17 Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios Verbesserung der praktischen Aspekte von End-to-End-Multitalker-Spracherkennung für Online- und Offline-Szenarien 改进在网上和离线情景下承认端到端多嘴多语种言论的 实际方面 2506.14204v1
  • 214 06-17 Intended Target Identification for Anomia Patients with Gradient-based Selective Augmentation Vorgesehene Zielidentifizierung für Anomie-Patienten mit gradientbasierter selektiver Augmentation 逐步增加选择性增加的阿诺米亚病人预期目标识别 2506.14203v1
  • 215 06-17 CAPTURE: Context-Aware Prompt Injection Testing and Robustness Enhancement CAPTURE: Context-Aware Prompt Injection Testing und Robustheitsverbesserung CAPTURE: 上下文软件快速注射测试和强力增强 2505.12368v2
  • 216 06-17 Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation Hanfu-Bench: Ein multimodaler Benchmark für interkulturelles Verständnis und Transkreation Hanfu-Bunch:跨时文化理解和交流的多模式基准 2506.01565v2
  • 217 06-17 ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations ELI-Warum: Bewertung der pädagogischen Nützlichkeit von Sprachmodellerklärungen ELI- Why:评价语言模式解释的教学用途 2506.14200v1
  • 218 06-17 Geometric Signatures of Compositionality Across a Language Model’s Lifetime Geometrische Signaturen der Kompositionalität über die Lebenszeit eines Sprachmodells hinweg 语文模式中各语文模式的 终身组成特征的几何签名 2410.01444v5
  • 219 06-17 Counterfactual-Consistency Prompting for Relative Temporal Understanding in Large Language Models Counterfactual-Consistency Prompting für relatives zeitliches Verständnis in großen Sprachmodellen 在大语言模式中促进相对时间理解的反事实一致 2502.11425v2
  • 220 06-17 MAS-LitEval : Multi-Agent System for Literary Translation Quality Assessment MAS-LitEval : Multi-Agenten-System für die Bewertung der Qualität von Übersetzungen MAS-LitEval:文学翻译质量评估多机构系统 2506.14199v1
  • 221 06-17 Compression of enumerations and gain Kompression von Aufzählungen und Gewinn 压缩查点和收益 2304.03030v2
  • 222 06-17 Reward Shaping to Mitigate Reward Hacking in RLHF Reward Shaping, um Belohnung Hacking in RLHF Mititate 在RLHF中拆分至Mipigget Reward的拆分 2502.18770v3
  • 223 06-17 AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR AsyncSwitch: Asynchrone Text-Speech-Anpassung für Code-Switched ASR Async开关: 用于代码开关 ASR 的非同步文本语音适应 2506.14190v1
  • 224 06-17 EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG EEG2TEXT-CN: Eine explorative Studie der offenen Vokabulären chinesischen Text-EEG-Ausrichtung über großsprachliches Modell und kontrastives Lernen auf ChinesischEEG EEG2TEXT-CN:通过大语言模式和中经语言差异性学习对中文文本与EEEG校对开放词汇的探索性研究 2506.00854v2
  • 225 06-17 Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching Kosteneffiziente Bedienung von LLM-Agenten über Test-Zeitplan-Caching 通过试验-时间计划缓冲,以成本效率高的方式服务LLM代理物 2506.14852v1
  • 226 06-17 Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore’s languages Können wir ASR-Systeme auf Code-Schalter ohne echte Code-Schalter-Daten trainieren? Fallstudie für Singapurs Sprachen 我们能否在没有实际代码开关数据的情况下,对 ASR 系统进行代码开关培训?新加坡语言案例研究 2506.14177v1
  • 227 06-17 MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning MMedAgent-RL: Optimierung der Multi-Agenten-Kollaboration für multimodale medizinische Vernunft MMedAgender-RL:优化多机构协作促进多式联运医疗理由 2506.00555v2
  • 228 06-17 MIST: Towards Multi-dimensional Implicit Bias and Stereotype Evaluation of LLMs via Theory of Mind MIST: Auf dem Weg zu multidimensionalen Impliziten Bias und Stereotype Evaluation von LLMs über die Theorie des Geistes MIST:通过思想理论对LLMs进行多维隐隐含的偏见和定型评价 2506.14161v1
  • 229 06-17 S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models S$^4$C: Spekulative Probenahme mit syntaktischer und semantischer Kohärenz zur effizienten Schlussfolgerung großer Sprachmodelle S$4美元C:为高效推导大语言模型的协同性和语义一致性进行投机抽样 2506.14158v1
  • 230 06-17 DCRM: A Heuristic to Measure Response Pair Quality in Preference Optimization DCRM: Ein Heuristisches zur Messung der Antwortpaarqualität in der Präferenz-Optimierung DCRM:在首选最佳化中衡量对等反应质量的优度 2506.14157v1
  • 231 06-17 OWLViz: An Open-World Benchmark for Visual Question Answering OWLViz: Ein Open-World-Benchmark für visuelle Fragen OWLViz:视觉问答的开放世界基准 2503.07631v2
  • 232 06-17 Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models Drücke die Performance der synthetischen Spracherkennung mit Kolmogorov-Arnold-Netzwerken und selbstüberwachten Lernmodellen 推动利用科尔莫戈罗夫-阿诺尔德网络和自控学习模式进行合成语音探测的性能 2506.14153v1
  • 233 06-17 REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning 实际检索: 数学理由的回收增量精液预言 2505.20613v2
  • 234 06-17 Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment Akustische Streuung KI für nicht-invasive Objektklassifikationen: Eine Fallstudie zur Haarbewertung 用于非侵入性物体分类的非侵入性物体分类的声波散射AI:关于头发评估的个案研究 2506.14148v1
  • 235 06-17 RadFabric: Agentic AI System with Reasoning Capability for Radiology RadFabric: Agentisches KI-System mit vernünftiger Kapazität für die Radiologie RadFBRIC:放射学合理能力A.A.A.系统 2506.14142v1
  • 236 06-17 Personalizing Student-Agent Interactions Using Log-Contextualized Retrieval Augmented Generation (RAG) Personalisierung von studentisch-agenten Interaktionen mittels Log-Contextualized Retrieval Augmented Generation (RAG) 利用日志-知识检索增强型一代(RAG)实现学生-代理人个性化互动 2505.17238v2
  • 237 06-17 AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning AgentCPM-GUI: Mobile-Use-Agenten mit Verstärkungs-Fine-Tuning bauen Agent CPM-GUI: 制造具有加固精度的移动用途制剂 2506.01391v2
  • 238 06-17 Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks Bewertung von Konsistenz und Reproduzierbarkeit in den Outputs von großen Sprachmodellen: Evidence Across Diverse Finance and Accounting Tasks 评估大语言模式产出的一致性和可复制性:不同财务和会计任务之间的证据 2503.16974v3
  • 239 06-17 Sampling from Your Language Model One Byte at a Time Proben aus Ihrem Sprachmodell ein Byte zu einer Zeit 一次抽取您语言模式一字节的样本 2506.14123v1
  • 240 06-17 Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression Batch-Max: Höherer LLM-Durchsatz mit größeren Batch-Größen und KV Cache-Kompression 批量最大:使用大批量大小和 KV缓存压缩的高级 LLM 输送量 2412.05693v2
  • 241 06-17 Essential-Web v1.0: 24T tokens of organized web data Essential-Web v1.0: 24T Token von organisierten Web-Daten 基本Web v1.0: 24个有组织网络数据标记 2506.14111v1
  • 242 06-17 SAE-V: Interpreting Multimodal Models for Enhanced Alignment SAE-V: Verdolmetschen multimodaler Modelle für eine verbesserte Ausrichtung SAE-V: 解释增强协调的多模式模型 2502.17514v2
  • 243 06-17 Innovating China’s Intangible Cultural Heritage with DeepSeek + MidJourney: The Case of Yangliuqing theme Woodblock Prints Innovieren Chinas immaterielles Kulturerbe mit DeepSeek + MidJourney: Der Fall des Yangliuqing-Themas Woodblock Prints 以深色+中途:杨柳庆主题案例 2506.14104v1
  • 244 06-17 Abstract Meaning Representation for Hospital Discharge Summarization Abstract Bedeutung Vertretung für Krankenhaus Entladung Zusammenfassung B. 医院免住院费摘要说明 2506.14101v1
  • 245 06-17 Enhancing Clinical Models with Pseudo Data for De-identification Verbesserung klinischer Modelle mit Pseudo-Daten zur De-Identifizierung 利用假数据加强临床模型,以进行分辨 2506.12674v2
  • 246 06-17 InsertRank: LLMs can reason over BM25 scores to Improve Listwise Reranking InsertRank: LLMs können über BM25-Scores nachdenken, um Listwise zu verbessern. 插入Rank:LLMs可以比 BB25 分数解释 BM25 分数来改进列表排序 2506.14086v1
  • 247 06-16 (1) Automatic Extraction of Clausal Embedding Based on Large-Scale English Text Data Automatische Extraktion von Clausal Embedding basierend auf großformatigen englischen Textdaten 根据大比例英文文本数据自动提取 2506.14064v1
  • 248 06-16 Ace-CEFR – A Dataset for Automated Evaluation of the Linguistic Difficulty of Conversational Texts for LLM Applications Ace-CEFR – Ein Datensatz für die automatisierte Auswertung der sprachlichen Schwierigkeit von Konversationstexten für LLM-Anwendungen Ace-CEFR – – 用于自动评价用于LLM应用的对读文本语言难度的数据集 2506.14046v1
  • 249 06-16 An Interdisciplinary Review of Commonsense Reasoning and Intent Detection Eine interdisziplinäre Überprüfung von Commonsense-Vernunft und Intent Detection 对常见理由和意图探测的跨学科审查 2506.14040v1
  • 250 06-16 Beyond Browsing: API-Based Web Agents Jenseits von Browsing: API-basierte Web-Agenten 超出浏览范围: API 网络代理 2410.16464v3
  • 251 06-16 MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation MultiFinBen: Ein multilingualer, multimodaler und problemorientierter Benchmark für die finanzielle LLM-Bewertung MultiFinBen: 财务LLM评价的多种语言、多种模式和困难软件基准 2506.14028v1
  • 252 06-16 Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text Lost in the Mix: LLM-Verständnis von Code-Switched Text bewerten 在混合中丢失:评估LLM对代码转换文本的理解 2506.14012v1
  • 253 06-16 Towards Geo-Culturally Grounded LLM Generations Auf dem Weg zu geokulturellen LLM-Generationen 走向地球环基LLM 代 2502.13497v3
  • 254 06-16 MultiMatch: Multihead Consistency Regularization Matching for Semi-Supervised Text Classification MultiMatch: Multihead-Konsistenzregularisierung passend zur semi-überwachten Textklassifikation 多匹配: 用于半有效文本分类的多标题一致性规则化 2506.07801v2
  • 255 06-16 ETM: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models ETM: Moderne Einblicke in die Perspektive der Text-zu-SQL-Bewertung im Zeitalter großer Sprachmodelle ETM:从现代视角看待大语言模式时代的文本到SQL评价 2407.07313v4
  • 256 06-16 Are manual annotations necessary for statutory interpretations retrieval? Sind manuelle Anmerkungen für die Rückgewinnung gesetzlicher Interpretationen erforderlich? 法定解释检索是否需要人工说明? 2506.13965v1
  • 257 06-16 ASMR: Augmenting Life Scenario using Large Generative Models for Robotic Action Reflection ASMR: Augmenting Life Szenario mit großen Generativen Modellen für die Robotic Action Reflection ASMR:使用大型机器人行动反射生成模型扩大寿命设想 2506.13956v1
  • 258 06-16 LongCodeBench: Evaluating Coding LLMs at 1M Context Windows LongCodeBench: Auswertung von Coding LLMs bei 1M Context Windows LongCodeBench: 在 1M 上下文窗口评价编码LLMs 2505.07897v2
  • 259 06-16 Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models Roboflow100-VL: Ein Multi-Domain-Objekterkennungs-Benchmark für Vision-Language-Modelle 机器人流100-VL:愿景-语言模型多功能物体探测基准 2505.20612v2
  • 260 06-16 Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models Adaptive Anleitung beschleunigt die Stärkung des Lernens von Vernunftmodellen 适应性指导加速加速强化理性模型学习 2506.13923v1
  • 261 06-16 Discrete Audio Tokens: More Than a Survey! Diskrete Audio Tokens: Mehr als nur eine Umfrage! 分辨音频代号: 多于调查 ! 2506.10274v2
  • 262 06-16 EuroLLM-9B: Technical Report EuroLLM-9B: Technischer Bericht 欧洲LLLM-9B:技术报告 2506.04079v2
  • 263 06-16 Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations Alignment Quality Index (AQI) : Jenseits von Ablehnungen: AQI als Intrinsische Alignment-Diagnose über Latent Geometrie, Clusterdivergenz und Layer weise Gepoolte Darstellungen 对齐质量指数(AQI) : 超越拒绝: AQI 是一个通过深层几何、群集差异和图层智慧集合表达式进行的原始对齐诊断分析 2506.13901v1
  • 264 06-16 EmoNews: A Spoken Dialogue System for Expressive News Conversations EmoNews: Ein gesprochenes Dialogsystem für expressive Nachrichtengespräche Emohews:一个表达性新闻对话的口号对话系统 2506.13894v1
  • 265 06-16 Conformal Linguistic Calibration: Trading-off between Factuality and Specificity Konforme Linguistische Kalibrierung: Trading-off zwischen Faktizität und Spezifität 非正式语文校准:事实与具体性之间的交易 2502.19110v3
  • 266 06-16 VL-GenRM: Enhancing Vision-Language Verification via Vision Experts and Iterative Training VL-GenRM: Verbesserung der Vision-Sprachen-Überprüfung durch Vision-Experten und iteratives Training VL-GenRM:通过愿景专家和迭接培训加强愿景-语言核查 2506.13888v1
  • 267 06-16 Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles Untersuchung des Zusammenspiels von sprachlicher und mathematischer Argumentation in Sprachmodellen mittels mehrsprachiger Zahlenrätsel 使用多语种数字拼图调查语言模型的语言和数学推理的相互作用 2506.13886v1
  • 268 06-16 Steering LLM Thinking with Budget Guidance Steuerung des LLM-Denkens mit Budget Guidance 以预算指导来思考预算指导 2506.13752v1
  • 269 06-16 LTRR: Learning To Rank Retrievers for LLMs LTRR: Learning To Rank Retriever für LLMs LTRR: 学习为LLMM公司重新获得排名 2506.13743v1
  • 270 06-16 Instruction Following by Boosting Attention of Large Language Models Anleitung, indem man die Aufmerksamkeit großer Sprachmodelle erhöht 之后的教学,培养对大语言模式的注意 2506.13734v1
  • 271 06-16 Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs Zuweisungsgeführtes Pruning für Kompression, Circuit Discovery und gezielte Korrektur in LLMs 压缩、电路发现和定点校正 2506.13727v1
  • 272 06-16 OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation OPeRA: Ein Datensatz von Beobachtung, Persona, Ratationale und Aktion zur Bewertung von LLMs auf menschlicher Online-Shopping-Behavior-Simulation OPERA: 人类在线购物行为模拟观察、人、理由和评估LMLLMs的数据集 2506.05606v2
  • 273 06-16 Balancing Knowledge Delivery and Emotional Comfort in Healthcare Conversational Systems Ausbalancieren von Wissenslieferungen und emotionalem Komfort in Gesundheitswesensgesprächssystemen 平衡知识的提供和卫生保健沟通系统中的情感舒适 2506.13692v1
  • 274 06-16 Efficient Inference for Large Reasoning Models: A Survey Effiziente Schlussfolgerung für große Vernunftmodelle: Eine Umfrage 大型理由模型有效推断:调查 2503.23077v2
  • 275 06-16 How Much is Enough? The Diminishing Returns of Tokenization Training Data Wie viel ist genug? Die Diminishing Rückgaben von Tokenization Trainingsdaten 有多少足够? 2502.20273v4
  • 276 06-16 Turning Down the Heat: A Critical Analysis of Min-p Sampling in Language Models Abkehr von der Hitze: Eine kritische Analyse der Min-p-Probenahme in Sprachmodellen 降低热量:对语言模型中中点抽样的批判性分析 2506.13681v1
  • 277 06-16 Unifying Uniform and Binary-coding Quantization for Accurate Compression of Large Language Models Vereinheitlichung einheitlicher und Binär-kodierender Quantisierungen für eine präzise Komprimierung großer Sprachmodelle 精确压缩大语言模型精确压缩的统一和二元编码统一和二元编码的量化 2506.03781v2
  • 278 06-16 Improving Clinical Note Generation from Complex Doctor-Patient Conversation Verbesserung der klinischen Notengenerierung aus komplexen Arzt-Patient-Konversationen 从复杂的医生与病人之间的复杂对话中改进临床笔记制作 2408.14568v2
  • 279 06-16 On Synthesizing Data for Context Attribution in Question Answering Über die Synthese von Daten für Kontextzuweisungen in Fragenantworten 问题解答中内容归属数据合成 2504.05317v2
  • 280 06-16 Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model Stream-Omni: Gleichzeitige multimodale Interaktionen mit großem Sprach-Vision-Sprachmodell 流流-奥米尼:与大语言-视觉-语音模型同时使用的多模式互动 2506.13642v1
  • 281 06-16 EvolvTrip: Enhancing Literary Character Understanding with Temporal Theory-of-Mind Graphs EvolvTrip: Erweitern des literarischen Charakterverständnisses mit zeitlichen Theorie-von-Mind Graphen EvlvTrip:用时光理论图增强对文学特征的了解 2506.13641v1
  • 282 06-16 An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability Eine empirische Studie von LLM-as-a-Richter: Wie Design Entscheidungen Auswirkungen Bewertung Zuverlässigkeit 法学硕士作为法官的经验研究:设计选择如何影响评价可靠性 2506.13639v1
  • 283 06-16 A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data Ein selbstrefinierendes Framework zur Verbesserung der ASR-Nutzung von TTS-Synthesedaten 利用TTS综合数据加强ASR的自订框架 2506.11130v2
  • 284 06-16 A Structured Bangla Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy Ein strukturierter Bangla-Datensatz von Krankheits-Symptome-Verbänden zur Verbesserung der Diagnosegenauigkeit 改善诊断准确性疾病 – – 症状协会结构化孟加拉数据集 2506.13610v1
  • 285 06-16 An Investigation into Value Misalignment in LLM-Generated Texts for Cultural Heritage Eine Untersuchung der Wertunausrichtung in LLM-generierten Texten für kulturelles Erbe 调查文化遗产LLM-LLM-发光文字中的价值失调问题 2501.02039v2
  • 286 06-16 Experiential Semantic Information and Brain Alignment: Are Multimodal Models Better than Language Models? Erlebnishafte semantische Information und Gehirnausrichtung: Sind multimodale Modelle besser als Sprachmodelle? 实际的语义信息和脑力调整:多模式模式是否比语言模式更好? 2504.00942v2
  • 287 06-16 Idiosyncrasies in Large Language Models Eigenheiten in großen Sprachmodellen 大语言模式的特派专家 2502.12150v2
  • 288 06-16 CAMS: A CityGPT-Powered Agentic Framework for Urban Human Mobility Simulation CAMS: CityGPT-Powered Agentic Framework für die Simulation urbaner menschlicher Mobilität CAMS: 城市GPT授权的城市人类流动模拟活动代理框架 2506.13599v1
  • 289 06-16 Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems Qwen vs. Gemma Integration mit Whisper: Eine vergleichende Studie in mehrsprachigen Sprach-LLM-Systemen Quwen诉Gemma 与低语融合:多语种语言LLLM系统比较研究 2506.13596v1
  • 290 06-16 MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention MiniMax-M1: Skalierungstestzeit effizient berechnen mit Blitz Achtung Minimax-M1: 以闪电注意有效计算缩放测试时间 2506.13585v1
  • 291 06-16 Flexible-length Text Infilling for Discrete Diffusion Models Flexible Text-Infilling für diskrete Diffusionsmodelle 为分立扩散模型填充文本 2506.13579v1
  • 292 06-16 Characterizing Linguistic Shifts in Croatian News via Diachronic Word Embeddings Sprachliche Verschiebungen in Kroatischen Nachrichten über diachronische Wort-Embeddings charakterisieren 《克罗地亚新闻》通过旧时单词嵌入式将语言变化定性为克罗地亚新闻 2506.13569v1
  • 293 06-16 Understand the Implication: Learning to Think for Pragmatic Understanding Die Implikation verstehen: Lernen, für Pragmatisches Verständnis zu denken 理解影响:学会思考实用理解 2506.13559v1
  • 294 06-16 EmoDynamiX: Emotional Support Dialogue Strategy Prediction by Modelling MiXed Emotions and Discourse Dynamics EmoDynamiX: Emotionale Unterstützung Dialog Strategie Vorhersage durch Modellierung von MiXed Emotionen und Diskurs Dynamik EmoDynamiX:通过模拟消化情感和话题动态预测情感支持对话战略 2408.08782v5
  • 295 06-16 Towards a Cascaded LLM Framework for Cost-effective Human-AI Decision-Making Auf dem Weg zu einem kaskadenten LLM-Rahmen für kosteneffiziente Entscheidungsfindung zwischen Mensch und KI 建立具有成本效益的人类-AI决策框架 2506.11887v2
  • 296 06-16 Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization Mischung aus gewichtsgeteilter Heterogener Gruppe Aufmerksamkeit Experten für dynamische Token-weise KV-Optimierung KV 优化动态调制调效 KV 优化小组注意问题专家 2506.13541v1
  • 297 06-16 Affordable AI Assistants with Knowledge Graph of Thoughts Erschwingliche KI-Assistenten mit Wissensgrafik der Gedanken 具有知识思想知识图的负担得起的AI助理 2504.02670v3
  • 298 06-16 TensorSLM: Energy-efficient Embedding Compression of Sub-billion Parameter Language Models on Low-end Devices TensorSLM: Energieeffiziente Einbettung Komprimierung von Submilliarden-Parameter-Sprachmodellen auf Low-End-Geräten Tensor SLM:低端设备上10亿分数以下低端设备语言模型的节能嵌入压缩 2506.13514v1
  • 299 06-16 JEPA4Rec: Learning Effective Language Representations for Sequential Recommendation via Joint Embedding Predictive Architecture JEPA4Rec: Effektive Sprachrepräsentanzen für sequentielle Empfehlung durch gemeinsame Einbettung vorausschauender Architektur lernen JEPA4Rec: 通过联合嵌入的预测架构,学习有效的语言代表,以提出序列建议 2504.10512v2
  • 300 06-16 K/DA: Automated Data Generation Pipeline for Detoxifying Implicitly Offensive Language in Korean K/DA: Automatisierte Datengenerierungspipeline für die Entgiftung implizit anstößiger Sprache auf Koreanisch K/DA:用韩语解毒的自动数据生成管道 2506.13513v1
  • 301 06-16 BOW: Bottlenecked Next Word Exploration BOW: Engagierte nächste Wort-Exploration BOW: 下个单词探索的瓶颈 2506.13502v1
  • 302 06-16 TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs TurBLiMP: Ein türkischer Benchmark für linguistische Minimal Pairs TurBLIMP:土耳其语言最小对等基准 2506.13487v1
  • 303 06-16 Position: Pause Recycling LoRAs and Prioritize Mechanisms to Uncover Limits and Effectiveness Position: Recycling von LoRAs aushalten und Mechanismen priorisieren, um Grenzen und Wirksamkeit aufzudecken 立场:暂停再循环回收 LoRAs和优先机制 2506.13479v1
  • 304 06-16 Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning Sprachagenten für hypothesisgetriebene klinische Entscheidungsfindung mit Verstärkungslernen 与强化学习一起进行假冒主义驱动临床决策的语言代理 2506.13474v1
  • 305 06-16 When Detection Fails: The Power of Fine-Tuned Models to Generate Human-Like Social Media Text Wenn die Detektion fehlschlägt: Die Macht von fein-getönten Modellen, um menschenähnliche Social Media-Texte zu erzeugen 当检测失败时:制作像人类一样的社会媒体文字的精选模型的力量 2506.09975v2
  • 306 06-16 Abstract, Align, Predict: Zero-Shot Stance Detection via Cognitive Inductive Reasoning Abstract, Align, Predict: Zero-Shot Stance Detection über Kognitive Induktive Reasoning 摘要、对称、预测:通过认知感性诱导理由探测零热静态 2506.13470v1
  • 307 06-16 An Interdisciplinary Approach to Human-Centered Machine Translation Ein interdisziplinärer Ansatz zur Mensch-zentrierten maschinellen Übersetzung 以多学科方式处理以人为中心的机器翻译 2506.13468v1
  • 308 06-16 Enhancing Omics Cohort Discovery for Research on Neurodegeneration through Ontology-Augmented Embedding Models Enhancement Omics Cohort Discovery for Research on Neurodegeneration by Ontology-Augmented Embedding Models 通过本体学强化嵌入模型研究神经脱底生成发现 2506.13467v1
  • 309 06-16 Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study Enthüllen des Lerngedankens von Sprachmodellen: Ein kognitiver Rahmen und empirische Studie 统一语言模式学习思维:认知框架和经验研究 2506.13464v1
  • 310 06-16 Leveraging Vision-Language Pre-training for Human Activity Recognition in Still Images Nutzung von Vision-Sprachen Pre-Training für die Anerkennung menschlicher Aktivität in Still Images 利用视觉-语言前培训,在静态图像中确认人类活动 2506.13458v1
  • 311 06-16 A Neural Model for Word Repetition Ein neurales Modell für Wortwiederholung WW 重复的神经模型 2506.13450v1
  • 312 06-16 From Euler to AI: Unifying Formulas for Mathematical Constants Von Euler zu AI: Formeln für mathematische Konstanten vereinheitlichen 从 Euler 到 AI: 数学常量的统一公式 2502.17533v2
  • 313 06-16 RealHiTBench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis RealHiTBench: Ein umfassender realistischer Hierarchischer Tabellen-Benchmark für die Bewertung der LLM-basierten Tabellenanalyse RealHiTBench:评估基于LLM的表分析的综合现实等级表基准 2506.13405v1
  • 314 06-16 Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR Bi-direktionale Kontext-verbesserte Sprache Große Sprachmodelle für mehrsprachige Konversations-ASR 多语言对话的ASR双向双向背景强化语言大语言模型 2506.13396v1
  • 315 06-16 Regular-pattern-sensitive CRFs for Distant Label Interactions Regelmäßig-Muster-sensible CRFs für entfernte Label-Interaktionen 用于不同标签互动的常规模式敏感通用报告格式 2411.12484v2
  • 316 06-16 Decompositional Reasoning for Graph Retrieval with Large Language Models Zersetzende Begründung für Graph Retrieval mit großen Sprachmodellen 使用大语言模型的图表检索分解理由 2506.13380v1
  • 317 06-16 CMCTS: A Constrained Monte Carlo Tree Search Framework for Mathematical Reasoning in Large Language Model CMCTS: Ein eingeschränktes Monte Carlo Baum-Suchrahmen für mathematische Vernunft im großen Sprachmodell CMCTS: 限制的蒙特卡洛大语言数学理由搜索框架 2502.11169v2
  • 318 06-16 Efficient Medical VIE via Reinforcement Learning Effizientes medizinisches VIE durch Verstärkungslernen 通过强化学习提高医疗VIE效率 2506.13363v1
  • 319 06-16 Truth Knows No Language: Evaluating Truthfulness Beyond English Wahrheit kennt keine Sprache: Bewertung von Wahrhaftigkeit jenseits des Englischen 真理不懂语言:评价英语以外的真相 2502.09387v3
  • 320 06-16 How Much Can We Forget about Data Contamination? Wie viel können wir über Datenkontamination vergessen? 我们怎能忘记数据污染呢? 2410.03249v4
  • 321 06-16 StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns StoryBench: Ein dynamischer Benchmark für die Bewertung von Langzeitspeichern mit Multiturns 故事区:多转评价长期记忆的动态基准 2506.13356v1
  • 322 06-16 Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks Direct Reasoning Optimization: LLMs können ihre eigene Begründung für offene Aufgaben belohnen und verfeinern 直接理由优化:LLMs Can Can reward and refine 自己为不限名额任务提供的理由 2506.13351v1
  • 323 06-16 Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers Prüfen der Prüfer: Enthüllen von Pitfalls und Potenzialen in Fact Prüfern 核查验证者:事实验证者中未倒置的空洞和潜力 2506.13342v1
  • 324 06-16 NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025 NTU Speechlab LLM-basiertes Mehrsprachiges ASR-System für MLC-SLM Challenge 2025 NTU Spearelab LLM-为2025年刚果解放运动-解运间对话挑战使用多种语言的ASR系统 2506.13339v1
  • 325 06-16 The Remarkable Robustness of LLMs: Stages of Inference? Die bemerkenswerte Robustheit von LLMs: Stufen der Schlussfolgerung? LLMS的显著威力:推论阶段? 2406.19384v3
  • 326 06-16 EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization EAQuant: Verbesserung der Post-Training-Quantisierung für MoE-Modelle durch Experten-Aware-Optimierung EAQuant:通过专家-软件优化,加强培训后对教育部模型的量化 2506.13329v1
  • 327 06-16 Document-Level Tabular Numerical Cross-Checking: A Coarse-to-Fine Approach Dokument-Ebene Tabuläre numerische Cross-Checking: Ein grob-zu-Feine-Ansatz 文件级别表制盘交叉盘查:粗对法方法 2506.13328v1
  • 328 06-16 Large Language Models as ‘Hidden Persuaders’: Fake Product Reviews are Indistinguishable to Humans and Machines Große Sprachmodelle als ‘Hidden Persuaders’: Fake Produktbewertungen sind für Menschen und Maschinen ununterscheidbar 大语言模型作为“ Hidden Persuaders ” : 假产品审查对人类和机器是无法区分的 2506.13313v1
  • 329 06-16 Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs Abmilderung des Sicherheitsabfalls bei der Editing-basierten Hintertürinjektion auf LLMs 减轻基于编辑的LLMLM后门喷射中安全回落的安全后退 2506.13285v1
  • 330 06-16 AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy AceReason-Nemotron 1.1: Mathematische und Code-Reasonierung durch SFT und RL-Synergie AceReson-Nemotron 1.1:通过SFT和RL协同推进数学和代码学 2506.13284v1
  • 331 06-16 EffiCoder: Enhancing Code Generation in Large Language Models through Efficiency-Aware Fine-tuning EffiCoder: Codegenerierung in großen Sprachmodellen durch Effizienz-Bewusst Feinabstimmung verbessern Effi Coder:通过效率软件微调加强大语言模式的代码生成 2410.10209v4
  • 332 06-16 AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining AdaLRS: Loss-Guided Adaptive Learning Rate Suche nach effizientem Foundation Model Pretraining AdaLRS: 为高效基础基础示范培训前而寻找学习率 2506.13274v1
  • 333 06-16 Making LLMs Better Many-to-Many Speech-to-Text Translators with Curriculum Learning LLMs besser machen Viele-zu-Viele Sprach-zu-Text-Übersetzer mit Curriculum-Lernen 使LLM LM 更好地使许多到许多语音到文字翻译翻译与课程学习 2409.19510v2
  • 334 06-16 Distinct Computations Emerge From Compositional Curricula in In-Context Learning Unterschiedliche Berechnungen entstehen aus kompositorischen Lehrplänen im In-Context-Lernen 内文学习中组成课程产生的特殊计算 2506.13253v1
  • 335 06-16 G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems G-Memory: Hierarchischer Speicher für Multi-Agent-Systeme G-记忆:为多机构系统追踪等级记忆 2506.07398v2
  • 336 06-16 IGD: Token Decisiveness Modeling via Information Gain in LLMs for Personalized Recommendation IGD: Token Decisiveness Modellierung über Informationsgewinn in LLMs für Personalisierte Empfehlung IGD: 个人化建议通过LLM LLM 信息收益进行当量决策模型 2506.13229v1
  • 337 06-16 Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law Capability Salience Vector: Feinkörnige Ausrichtung von Verlusten und Fähigkeiten für Downstream Task Scaling Law 下游任务缩放法损失和能力精确比对 2506.13216v1
  • 338 06-16 Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models Thought Crime: Hintertüren und Emergent-Missausrichtung in vernünftigen Modellen 思想犯罪:后门和合理理由模型中新出现的不协调现象 2506.13206v1
  • 339 06-16 Do Music Preferences Reflect Cultural Values? A Cross-National Analysis Using Music Embedding and World Values Survey Reflektieren Musikpräferenzen kulturelle Werte? Eine länderübergreifende Analyse mit Musikeinbettung und World Values Survey 音乐优惠是否反映文化价值? 利用音乐嵌入和世界价值调查进行的跨国家分析 2506.13199v1
  • 340 06-16 Breaking Thought Patterns: A Multi-Dimensional Reasoning Framework for LLMs Breaking Thought Patterns: Multi-Dimensional Reasoning Framework für LLMs 打破思维模式:LLMM的多重解释理由框架 2506.13192v1
  • 341 06-16 Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis Leveraging LLM und selbstüberwachte Trainingsmodelle für die Spracherkennung in chinesischen Dialekten: Eine vergleichende Analyse 利用LLM和中国语语音识别自驾培训模式:比较分析 2505.21138v2
  • 342 06-16 SPOT: Bridging Natural Language and Geospatial Search for Investigative Journalists SPOT: Natürliche Sprache und Geospatiale Suche nach Untersuchungsjournalisten SPOT: 连接自然语言和地理空间搜索,供调查记者使用 2506.13188v1
  • 343 06-16 Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence Dynamische kontextorientierte Zersetzung für Task-aware Low-rank-Anpassung mit weniger vergessener und schnellerer Konvergenz 适应任务意识低级别适应的动态、以环境为导向的分化,减少遗忘和更快的趋同 2506.13187v1
  • 344 06-16 Align-then-Unlearn: Embedding Alignment for LLM Unlearning Align-then-Unlearn: Einbettung für LLM-Unlearning Aleign- or- unlearn: LLM 重新学习的嵌入对齐 2506.13181v1
  • 345 06-16 Fast-and-Frugal Text-Graph Transformers are Effective Link Predictors Fast-and-Frugal Text-Graph Transformer sind effektive Link Predictors 快速和节节用文字格变形器是有效的链接预测器 2408.06778v4
  • 346 06-16 Enhancing Large Language Models with Reliable Knowledge Graphs Erweiterung großer Sprachmodelle mit zuverlässigen Wissensgraphen 加强具有可靠知识图集的大型语言模型 2506.13178v1
  • 347 06-16 Team Anotheroption at SemEval-2025 Task 8: Bridging the Gap Between Open-Source and Proprietary LLMs in Table QA Team Eine weitere Option bei SemEval-2025 Task 8: Die Lücke zwischen Open Source und Proprietary LLMs in Tabelle QA überbrücken SemEval-2025任务8:缩小表QA中开放来源和产权有限LMs之间差距的另一工作队备选办法:缩小表QA中开放来源和产权有限LMs之间的差距 2506.09657v2
  • 348 06-16 Development of the user-friendly decision aid Rule-based Evaluation and Support Tool (REST) for optimizing the resources of an information extraction task Entwicklung der benutzerfreundlichen Entscheidungshilfe Regelbasiertes Evaluierungs- und Unterstützungstool (REST) zur Optimierung der Ressourcen einer Informationsextraktion 为优化信息提取任务的资源,开发方便用户的决策援助规则评价和支助工具 2506.13177v1
  • 349 06-16 VGR: Visual Grounded Reasoning VGR: Visual Grounded Reasoning VGR: 视觉理由 2506.11991v2
  • 350 06-16 A Training-free LLM-based Approach to General Chinese Character Error Correction Ein trainingsfreier LLM-basierter Ansatz zur allgemeinen Korrektur von chinesischen Zeichenfehlern 以无培训的LLM为基础处理普通中文字符错误校正的不培训的LLM方法 2502.15266v2
  • 351 06-16 Adapting LLMs for Minimal-edit Grammatical Error Correction Anpassung von LLMs für minimal-editieren Sie Grammatical Fehlerkorrektur 适应最小编辑语法错误校正的LLMS 2506.13148v1
  • 352 06-16 CMU’s IWSLT 2025 Simultaneous Speech Translation System IWSLT 2025 gleichzeitiges Sprachübersetzungssystem der CMU CMU的IWSLT 2025年IWSLT 同步语音翻译系统 2506.13143v1
  • 353 06-16 Optimizing Temperature for Language Models with Multi-Sample Inference Temperaturoptimierung für Sprachmodelle mit Multi-Sample-Inferenz 多抽样推断语言模型的最佳最佳温度 2502.05234v2
  • 354 06-16 InfiniSST: Simultaneous Translation of Unbounded Speech with Large Language Model InfiniSST: Simultane Übersetzung von ungebundener Sprache mit großem Sprachmodell InfiniSST: 用大语言模式同时翻译无约束讲话 2503.02969v2
  • 355 06-16 ZINA: Multimodal Fine-grained Hallucination Detection and Editing ZINA: Multimodale feinkörnige Halluzination Erkennung und Bearbeitung ZINA: 多种现代精精密成粒致幻药检测和编辑 2506.13130v1
  • 356 06-16 ReflecTool: Towards Reflection-Aware Tool-Augmented Clinical Agents ReflecTool: Auf dem Weg zu Reflektions-Aware Tool-Augmented Clinical Agents ReflecTool:走向反射软件工具增强临床药剂 2410.17657v3
  • 357 06-16 Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs Schritt-für-Schritt-Anweisungen und ein einfaches tabellarisches Ausgabeformat verbessern die Abhängigkeits-Abgleichgenauigkeit von LLMs 逐步指示和简单表格格式 改进LLMM的可靠性分析精确度 2506.09983v2
  • 358 06-16 MathFusion: Enhancing Mathematical Problem-solving of LLM through Instruction Fusion MathFusion: Verbesserung der mathematischen Problemlösung von LLM durch Instruction Fusion 数学分析:通过教学融合加强LLM的数学问题解决 2503.16212v2
  • 359 06-16 A Hybrid GA LLM Framework for Structured Task Optimization Ein hybrides GA LLM-Rahmenwerk für strukturierte Aufgabenoptimierung GA 混合LLM 结构化任务优化框架 2506.07483v2
  • 360 06-16 POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization POROver: Verbesserung der Sicherheit und Reduzierung von Überrefusal in großen Sprachmodellen mit Übergeneration und Präferenzoptimierung POROU: 提高高代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代谢最优化型大语言模型的安全性和减少过度拒绝过度 2410.12999v2
  • 361 06-16 Crime Hotspot Prediction Using Deep Graph Convolutional Networks Verbrechens-Hotspot-Vorhersage mit Deep Graph Convolutional Networks 利用深图革命网络进行犯罪热点预测 2506.13116v1
  • 362 06-16 Leveraging In-Context Learning for Language Model Agents Leveraging In-Context Learning für Sprachmodell-Agenten 为语文示范代理利用内文学习 2506.13109v1
  • 363 06-16 Scaling Laws for Upcycling Mixture-of-Experts Language Models Skalierungsgesetze für Upcycling-Mixture-of-Experts Sprachmodelle 增强骑车混合专家语言模型法 2502.03009v2
  • 364 06-16 Equitable Electronic Health Record Prediction with FAME: Fairness-Aware Multimodal Embedding Equitable Electronic Health Record Prediction mit FAME: Fairness-Aware Multimodale Einbettung 公平电子健康记录预测与FAME:公平-软件多模式嵌入 2506.13104v1
  • 365 06-16 Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs Rethinking Test-Time Scaling für medizinische KI: Modell- und Task-Aware-Strategien für LLMs und VLMs 重新思考医疗用AI:LLMM和VLMM的模型和任务-意识战略 2506.13102v1
  • 366 06-16 NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables NeedleInATable: Erforschen von Langkontext-Kapazität von großen Sprachmodellen zu langstrukturierten Tabellen 针线表:探索长结构表格中大语言模型的长文能力 2504.06560v3
  • 367 06-16 Ask Optimal Questions: Aligning Large Language Models with Retriever’s Preference in Conversation Optimale Fragen stellen: Große Sprachmodelle mit Retrievers Vorliebe im Gespräch ausrichten 问最佳问题:将大语言模型与“检索”的优先对话对象相匹配 2402.11827v2
  • 368 06-16 Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search Satori: Verstärktes Lernen mit Chain-of-Action-Thought verbessert LLM-Reasoning durch autoregressive Suche 教程:通过自动递减搜索,加强学习,通过行动链-探索加强LLM 2502.02508v3
  • 369 06-16 CHILL at SemEval-2025 Task 2: You Can’t Just Throw Entities and Hope – Make Your LLM to Get Them Right CHILL at SemEval-2025 Task 2: Man kann nicht einfach Entitäten und Hoffnung werfen – Machen Sie Ihre LLM, um sie richtig zu bekommen 在SemEval 2025任务2: 你不能仅仅抛出实体和希望– 使你的LLM得到正确的东西 2506.13070v1
  • 370 06-16 FinLMM-R1: Enhancing Financial Reasoning in LMM through Scalable Data and Reward Design FinLMM-R1: Verbesserung der finanziellen Begründung in LMM durch skalierbare Daten und Belohnungsdesign FinLMM-R1:通过可缩放数据和奖励设计,加强LMM的资金理由 2506.13066v1
  • 371 06-16 AgentCourt: Simulating Court with Adversarial Evolvable Lawyer Agents AgentCourt: Simulierung des Gerichts mit kontradiktorisch-evolvierbaren Anwaltsvertretern 法院代理:模拟法院与律师代理 2408.08089v2
  • 372 06-16 MotiveBench: How Far Are We From Human-Like Motivational Reasoning in Large Language Models? MotivBench: Wie weit sind wir von Menschen wie Motivational Reasoning in großen Sprachmodellen entfernt? 动机:在大型语言模型中,我们从人类的动机上的原因有多远? 2506.13065v1
  • 373 06-16 PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue PRISM2: Allgemeine Pathologie-KI mit klinischem Dialog entriegeln PRISM2:通过临床对话解锁多模式一般病理学AI 2506.13063v1
  • 374 06-16 Generative Representational Learning of Foundation Models for Recommendation Generatives repräsentatives Lernen von Stiftungsmodellen zur Empfehlung 产生基础基础建议模式的代言人学习 2506.11999v2
  • 375 06-16 Multipole Attention for Efficient Long Context Reasoning Mehrpolige Aufmerksamkeit für effiziente lange Kontext-Reasoning 多极关注高效长处理由 2506.13059v1
  • 376 06-16 Latent Multi-Head Attention for Small Language Models Latent Multi-Head Aufmerksamkeit für kleine Sprachmodelle 对小型语言模式的多方关注 2506.09342v2
  • 377 06-16 CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model CFBESIMIM-MM:中国金融助理多式大语言模式基准 2506.13055v1
  • 378 06-16 Stress-Testing Multimodal Foundation Models for Crystallographic Reasoning Stress-Testing Multimodale Fundamentierungsmodelle für kristallografische Reasoning 水晶理学理由多式模型 2506.13051v1
  • 379 06-16 Knowledge Graph Large Language Model (KG-LLM) for Link Prediction Wissensgrafik Großes Sprachmodell (KG-LLM) für die Link-Vorhersage 链接预测知识图大语言模型(KG-LLM) 2403.07311v9
  • 380 06-16 Upcycling Large Language Models into Mixture of Experts Upcycling von großen Sprachmodellen zur Mischung von Experten 将大语言模型再生成专家混合模式 2410.07524v2
  • 381 06-16 Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation Ermöglichung medizinischer KI-Assistenten bei der Bereitstellung durch Input-Driven Saliency Adaptation 通过投入驱动感光度适应,使在线医疗自理助理能够使用投入驱动求感光度适应 2506.11105v2
  • 382 06-16 Just Go Parallel: Improving the Multilingual Capabilities of Large Language Models Einfach parallel gehen: Mehrsprachige Fähigkeiten großer Sprachmodelle verbessern 平行:提高大语言模式多语言能力 2506.13044v1
  • 383 06-16 An overview of domain-specific foundation model: key technologies, applications and challenges Ein Überblick über domänenspezifisches Fundamentmodell: Schlüsseltechnologien, Anwendungen und Herausforderungen 特定领域基础模型概览:关键技术、应用和挑战 2409.04267v3
  • 384 06-16 A dataset of questions on decision-theoretic reasoning in Newcomb-like problems Ein Datensatz von Fragen zur entscheidungstheoretischen Argumentation in Newcomb-ähnlichen Problemen 在类似新方格布问题中决策理论推理问题数据集 2411.10588v4
  • 385 06-16 Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation Destill CLIP (DCLIP): Bild-Text-Retrieval durch Cross-Modal Transformer-Destillation verbessern 蒸馏 CLIP (DCLIP): 通过跨模式变异器蒸馏加强图像- 文本回收 2505.21549v4
  • 386 06-16 Task-aligned prompting improves zero-shot detection of AI-generated images by Vision-Language Models Task-aligned prompting verbessert Zero-Shot-Erkennung von KI-generierten Bildern durch Vision-Language Models 以任务与任务的调和促动方式改进视觉语言模型对AI产生的图像的零光探测 2506.11031v2
  • 387 06-16 Knowledge Graph Fusion with Large Language Models for Accurate, Explainable Manufacturing Process Planning Wissensgraphenfusion mit großen Sprachmodellen für eine genaue, erklärbare Prozessplanung in der Fertigung 与用于准确、可解释的制造过程规划的大型语言模型知识图集融合 2506.13026v1
  • 388 06-16 Edeflip: Supervised Word Translation between English and Yoruba Edeflip: Überwachte Wortübersetzung zwischen Englisch und Yoruba Edeflip: 英文和约鲁巴文翻译监督翻译 2506.13020v1
  • 389 06-16 Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus Entwirren von Codemixing in Chats: Der NUS ABC Codemixed Corpus 在聊天区拆解编码混合: NUS ABC 编码混合公司 2506.00332v2
  • 390 06-16 Evaluating how LLM annotations represent diverse views on contentious topics Bewertung, wie LLM-Annotationen unterschiedliche Ansichten zu strittigen Themen darstellen 评价LLLM说明如何代表对有争议议题的不同观点 2503.23243v2
  • 391 06-16 Missing the human touch? A computational stylometry analysis of GPT-4 translations of online Chinese literature Vermißt man die menschliche Berührung? Eine rechnerische Stylometrie Analyse von GPT-4 Übersetzungen der online chinesischen Literatur 缺少人类触碰? 对GPT-4 在线中国文学译文的计算式tytyllogy分析 2506.13013v1
  • 392 06-16 Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification Selbstregularisierung mit Sparse Autoencodern für steuerbare LLM-basierte Klassifizierung 与基于可控 LLM 的可控 LLM 分类的 Sparse 自动编码器的自调节 2502.14133v2
  • 393 06-16 Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions Sprechen Sie einfach: Beseitigen von schädlichen Jailbreaks aus LLMs mit einfachen Interaktionen 简单易言: 与简单互动的LLMLM 2502.04322v2
  • 394 06-15 (7) Large Language Models Enhanced by Plug and Play Syntactic Knowledge for Aspect-based Sentiment Analysis Große Sprachmodelle durch Plug-and-Play-Syntaktisches Wissen für aspektbasierte Sentiment-Analysen verbessert 通过插件和播放同步知识增强大语言模型,用于基于频谱的感应分析 2506.12991v1
  • 395 06-15 Efficient Neuro-Symbolic Retrieval-Augmented Generation through Adaptive Query Routing Effiziente neuro-symbolische retrieval-angereicherte Generierung durch adaptive Abfrageführung 通过适应性查询路由,高效神经-双曲回取回回源养代 2506.12981v1
  • 396 06-15 Multi-document Summarization through Multi-document Event Relation Graph Reasoning in LLMs: a case study in Framing Bias Mitigation Multi-Dokument Zusammenfassung durch Multi-Dokument Ereignisrelation Graph Reasoning in LLMs: eine Fallstudie in Framing Bias Mitigation 多文件多文件通过多文件事件关系图表概述LLMLM中的原因:关于Framing Bias减缓问题的案例研究 2506.12978v1
  • 397 06-15 Unifying Specialized Visual Encoders for Video Language Models Vereinheitlichen von spezialisierten visuellen Encodern für Video-Sprachenmodelle 视频语言模型统一专门视觉编码器 2501.01426v2
  • 398 06-15 OR-Bench: An Over-Refusal Benchmark for Large Language Models OR-Bench: Ein überwiderlegbarer Benchmark für große Sprachmodelle OR-Bench:大语言模式的过度拒绝基准 2405.20947v5
  • 399 06-15 Building, Reusing, and Generalizing Abstract Representations from Concrete Sequences Aufbau, Wiederverwertung und Verallgemeinerung abstrakter Repräsentationen aus konkreten Sequenzen 建筑、再利用和一般化来自具体序列的抽象代表 2410.21332v2
  • 400 06-15 Assessing the Role of Data Quality in Training Bilingual Language Models Bewertung der Rolle der Datenqualität in der Ausbildung zweisprachige Sprachmodelle 评估数据质量在培训双语语文模式方面的作用 2506.12966v1
  • 401 06-15 REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities REPA: Russische Fehlertypen Anmerkung zur Bewertung von Textgenerierung und Urteilsfähigkeiten REPA: 用于评价文本生成和判断能力的俄罗斯错误类型说明 2503.13102v2
  • 402 06-15 From Argumentative Text to Argument Knowledge Graph: A New Framework for Structured Argumentation Vom argumentativen Text zum argumentativen Wissensgraph: Ein neuer Rahmen für strukturierte Argumentation 从参数文字到参数知识图:结构化参数新框架 2506.00713v2
  • 403 06-15 Forecasting Time Series with LLMs via Patch-Based Prompting and Decomposition Prognosezeitreihen mit LLMs über Patch-Based Prompting und Zersetzung 通过基于补缝的提示和分解与LLMs一道预测时间序列 2506.12953v1
  • 404 06-15 HypER: Literature-grounded Hypothesis Generation and Distillation with Provenance HypER: Literaturgestützte Hypothesis-Erzeugung und Destillation mit Provenienz HYPER: 以文学为根据的假设生成和用验证法蒸馏 2506.12937v1
  • 405 06-15 CliniDial: A Naturally Occurring Multimodal Dialogue Dataset for Team Reflection in Action During Clinical Operation CliniDial: Ein natürlich vorkommender multimodaler Dialog Datensatz für Teamreflexion während der klinischen Operation CliniDial: 临床行动期间团队反思的自然操作多模式对话数据集 2506.12936v1
  • 406 06-15 Layer by Layer: Uncovering Hidden Representations in Language Models Layer by Layer: Enthüllen versteckter Darstellungen in Sprachmodellen 按图层分列的图层: 语言模型中未隐藏隐藏的表示 2502.02013v2
  • 407 06-15 SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models SoundMind: RL-incentivized Logic Reasoning for Audio-Language Models SoundMind: RL - 音频语言模型激励逻辑原因 2506.12935v1
  • 408 06-15 Rethinking Table Instruction Tuning Umdenken Tabelle Anleitung Tuning 重新思考表格指令图 2501.14693v2
  • 409 06-15 Reasoning with RAGged events: RAG-Enhanced Event Knowledge Base Construction and reasoning with proof-assistants Reasoning mit RAGged Events: RAG-erweiterte Event Knowledge Base Konstruktion und Reasoning mit Proof-Assistenten RAG-加强事件知识库建设和与证据助理的推理 2506.07042v2
  • 410 06-15 Sectoral Coupling in Linguistic State Space Sektorale Koppelung im Sprachraum des Staates 语言国家空间部门合并 2506.12927v1
  • 411 06-15 Identifying and Investigating Global News Coverage of Critical Events Such as Disasters and Terrorist Attacks Ermittlung und Untersuchung von globalen Nachrichten über kritische Ereignisse wie Katastrophen und Terroranschläge 查明和调查灾害和恐怖袭击等重大事件的全球新闻报道 2506.12925v1
  • 412 06-15 PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization PersonaFeedback: Ein groß angelegter, von Menschen kommentierter Benchmark für Personalisierung 人背人:关于个性化的大规模人文说明基准 2506.12915v1
  • 413 06-15 SciDA: Scientific Dynamic Assessor of LLMs SciDA: Wissenschaftlicher dynamischer Assessor von LLMs SciDA:LLMs科学动态评估员 2506.12909v1
  • 414 06-15 Benchmarking Rotary Position Embeddings for Automatic Speech Recognition Benchmarking von Rotary-Positions-Embeddings für automatische Spracherkennung 自动语音识别扶轮位置嵌入式 2501.06051v2
  • 415 06-15 Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification Life-Code: Zentrale Dogma-Modellierung mit Multi-Omics-Sequenz-Einheit 生命守则:以多有机序列统一为模式的中央Dogma建模 2502.07299v2
  • 416 06-15 Navigating LLM Ethics: Advancements, Challenges, and Future Directions Navigation LLM Ethik: Fortschritte, Herausforderungen und zukünftige Richtungen 管理LLM 道德:进步、挑战和未来方向 2406.18841v5
  • 417 06-15 JEBS: A Fine-grained Biomedical Lexical Simplification Task JEBS: Eine feinkörnige biomedizinische Lexikalische Vereinfachungsaufgabe JEBS: 精细的生物医学条约简化任务 2506.12898v1
  • 418 06-15 Assessing the Performance Gap Between Lexical and Semantic Models for Information Retrieval With Formulaic Legal Language Bewertung der Performancelücke zwischen Lexischen und Semantischen Modellen für die Informationswiederherstellung mit der Formulaischen Rechtssprache 评估用法律公式化语言获取信息检索的词汇和语义模型之间的绩效差距 2506.12895v1
  • 419 06-15 VideoDeepResearch: Long Video Understanding With Agentic Tool Using VideoDeepResearch: Langes Video-Verstehen mit Agentischem Werkzeug 视频深入研究:与使用代理工具的远程视频了解 2506.10821v2
  • 420 06-15 ArgHiTZ at ArchEHR-QA 2025: A Two-Step Divide and Conquer Approach to Patient Question Answering for Top Factuality ArgHitz bei ArchEHR-QA 2025: Ein zweistufiger Divide- und Conquer-Ansatz zur Beantwortung von Patientenfragen für Top-Faktizität ArchEHR-QA 2025年ArchEHR-QA 的ArgHitTZ:对患者问题回答最佳事实的双重分化和征服办法 2506.12886v1
  • 421 06-15 FlatQuant: Flatness Matters for LLM Quantization FlatQuant: Flachheitselemente für die LLM-Quantisierung 平整量:LLM量化的平整事项 2410.09426v3
  • 422 06-15 Scaling Laws For Mixed Qquantization Skalierungsgesetze für gemischte Qquantisierung 混合定量化法 2410.06722v2
  • 423 06-15 HARBOR: Exploring Persona Dynamics in Multi-Agent Competition HARBOR: Erforschen von Persona-Dynamik im Multi-Agenten-Wettbewerb 《HARBOR:在多机构竞争中探索人动态》 2502.12149v2
  • 424 06-15 QFFT, Question-Free Fine-Tuning for Adaptive Reasoning QFFT, Question-Free Fine-Tuning für adaptive Reasoning QFFT, 无问题的调整性理由的精确调整 2506.12860v1
  • 425 06-15 MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems MORTAR: Multiturn Metamorphic Testing für LLM-basierte Dialogsysteme MORTAR:以LLM为基础的对话系统的多轨变形测试 2412.15557v2
  • 426 06-15 Visual Abstract Thinking Empowers Multimodal Reasoning Visuelles Abstraktes Denken macht multimodale Vernunft 视觉抽象思考赋予多模式理由 2505.20164v2
  • 427 06-15 Transforming Chatbot Text: A Sequence-to-Sequence Approach Chatbot-Text transformieren: Ein Sequence-to-Sequence-Ansatz 变换聊天器文本: 序列到序列的方法 2506.12843v1
  • 428 06-15 WereWolf-Plus: An Update of Werewolf Game setting Based on DSGBench WereWolf-Plus: Ein Update der Werwolf-Spieleinstellung basierend auf DSGBench WereWolf-Plus:基于 DSGBench 的狼人游戏环境更新 2506.12841v1
  • 429 06-15 Foundations of Large Language Models Grundlagen von großen Sprachmodellen 大语言模式基金会 2501.09223v2
  • 430 06-15 QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions QualiSpeech: Ein Datensatz zur Bewertung der Sprachqualität mit natürlichen Sprachkenntnissen und Beschreibungen 质量语言:语言质量评估数据集,有自然语言理由和描述 2503.20290v3
  • 431 06-15 Medical Argument Mining: Exploitation of Scarce Data Using NLI Systems Medical Argument Mining: Ausnutzung knapper Daten mit NLI-Systemen 医学论证采矿:利用国家指数系统利用稀缺数据 2506.12823v1
  • 432 06-15 Accurate and Regret-aware Numerical Problem Solver for Tabular Question Answering Genaue und respektvolle numerische Problemlöser für tabellarische Fragenbeantwortung 用于表格问答的准确和遗憾数字问题解答器 2410.12846v4
  • 433 06-15 Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling Effiziente Sicherheitsausrichtung großer Sprachmodelle über Preference Re-Ranking und repräsentationsbasierte Prämienmodellierung 通过优先排序和以代表制为基础的奖励模式,使大语言模式在安全方面实现高效率的一致 2503.10093v2
  • 434 06-15 DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs In Konflikte geraten: In suchgesteigerten LLMs widersprüchliche Quellen erkennen und bekämpfen 钻入冲突:发现和解决搜索中的冲突源 2506.08500v2
  • 435 06-15 EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection EmoNet-Voice: Ein feinkörniger, sachverständiger Benchmark für Sprachemotionserkennung EmoNet-Voice:语音情感检测精密、经专家核实的专家验证基准 2506.09827v2
  • 436 06-15 ProMedTS: A Self-Supervised, Prompt-Guided Multimodal Approach for Integrating Medical Text and Time Series ProMedTS: Ein selbstüberwachter, prompt geführter multimodaler Ansatz zur Integration medizinischer Text- und Zeitreihen ProMedTS: 综合医疗文本和时间系列的自我监督、迅速指导的多模式办法 2502.13509v2
  • 437 06-15 Knowledge-Augmented Multimodal Clinical Rationale Generation for Disease Diagnosis with Small Language Models Knowledge-Augmented Multimodal Clinical Rationale Generation for Disease Diagnosis with Small Language Models 利用小型语言模型进行疾病诊断的知识强化多式临床多式理论 2411.07611v4
  • 438 06-15 Entity Framing and Role Portrayal in the News Entity Framing und Role Portrayal in den Nachrichten 《新闻》中的实体形式和角色形象 2502.14718v2
  • 439 06-15 Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models Demokratisch oder authoritär? Eine neue Dimension politischer Biasen in großen Sprachmodellen probieren 民主还是专制? 以大语言模式探究政治分歧的新层面 2506.12758v1
  • 440 06-15 Can We Infer Confidential Properties of Training Data from LLMs? Können wir vertrauliche Eigenschaften von Trainingsdaten von LLMs ableiten? 我们能否从LLMS中推断培训数据的机密性? 2506.10364v2
  • 441 06-15 Rethinking Hate Speech Detection on Social Media: Can LLMs Replace Traditional Models? Nachdenken über Hass-Spracherkennung in sozialen Medien: Können LLMs traditionelle Modelle ersetzen? 在社会媒体上重新思考仇恨言论探测:LLMs能否取代传统模式? 2506.12744v1
  • 442 06-15 Rethinking DPO: The Role of Rejected Responses in Preference Misalignment Überdenken der DPO: Die Rolle der abgelehnten Reaktionen in der Präferenz-Missausrichtung 重新思考DPO:拒绝的对策在偏重不协调方面所起的作用 2506.12725v1
  • 443 06-15 SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models SelfCite: Selbstüberwachte Ausrichtung für Kontextzuweisung in großen Sprachmodellen 自成一体:对大语言模式背景归属的自我监督调整 2502.09604v3
  • 444 06-15 Strategic Scaling of Test-Time Compute: A Bandit Learning Approach Strategische Skalierung von Test-Time Compute: Ein Bandit-Lernansatz 试验时间计算战略规模的扩大:匪盗学习方法 2506.12721v1
  • 445 06-15 Efficient Sequential Decision Making with Large Language Models Effiziente sequentielle Entscheidungsfindung mit großen Sprachmodellen 与大语言模式高效有序决策 2406.12125v2
  • 446 06-15 Humanity’s Last Code Exam: Can Advanced LLMs Conquer Human’s Hardest Code Competition? Letzte Codeprüfung der Menschheit: Können fortgeschrittene LLMs den härtesten Codewettbewerb des Menschen erobern? 人类最后一次代码考试:高级LLMS 征服人类最硬的代码竞赛吗? 2506.12713v1
  • 447 06-15 SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression SecurityLingua: Effiziente Verteidigung von LLM-Jailbreak-Angriffen durch Security-Aware Prompt-Kompression 保安Lingua:通过安全警报即时压缩,有效防范LLM越狱袭击 2506.12707v1
  • 448 06-15 Flexible Realignment of Language Models Flexible Neuausrichtung von Sprachmodellen 语文模式灵活调整 2506.12704v1
  • 449 06-15 Co-occurrence is not Factual Association in Language Models Co-occurrence ist nicht Factual Association in Language Models 共同发生不是语言模式中的事实协会 2409.14057v2
  • 450 06-15 SC-SOT: Conditioning the Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition SC-SOT: Konditionierung des Decoders auf diarisierten Lautsprecherinformationen für die End-to-End-Überlappende Spracherkennung SC-SOT:为终端至终端超载语音识别分解器设置解码器 2506.12672v1
  • 451 06-15 Failure Modes of LLMs for Causal Reasoning on Narratives Failure Modes von LLMs für die ursächliche Begründung von Narrativen 以叙述为由解释原因的LLMs失败模式 2410.23884v5
  • 452 06-14 (6) Synthetic Socratic Debates: Examining Persona Effects on Moral Decision and Persuasion Dynamics Synthetische sokratische Debatten: Untersuchung von Persona-Effekten auf moralische Entscheidung und Überzeugungsdynamik 合成专家辩论:审查人对道德决定的影响和预测动态 2506.12657v1
  • 453 06-14 How Grounded is Wikipedia? A Study on Structured Evidential Support Wie geerdet ist Wikipedia? Eine Studie über strukturierten Evidential Support 维基百科如何根基? 2506.12637v1
  • 454 06-14 Between Predictability and Randomness: Seeking Artistic Inspiration from AI Generative Models Zwischen Vorhersagbarkeit und Zufälligkeit: Künstlerische Inspiration aus KI-Generativen Modellen suchen 在可预测性和随机性之间:从AI创创模式中寻求艺术灵感 2506.12634v1
  • 455 06-14 Detecting Narrative Shifts through Persistent Structures: A Topological Analysis of Media Discourse Ermitteln narrativer Verschiebungen durch persistente Strukturen: Eine topologische Analyse des Mediendiskurses 通过持久性结构检测到的叙述性转变:媒体谈话的地形分析 2506.14836v1
  • 456 06-14 MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos MS4UI: Ein Datensatz für die multimodale Zusammenfassung von Benutzeroberflächen-Instruktionsvideos MS4UI:用户界面教学录像多式摘要数据集 2506.12623v1
  • 457 06-14 Video Understanding with Large Language Models: A Survey Videoverständnis mit großen Sprachmodellen: Eine Umfrage 与大语言模型的视频了解:调查 2312.17432v5
  • 458 06-14 OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics OpenUnlearning: Beschleunigung des LLM-Unlearnings durch einheitliche Benchmarking von Methoden und Metrics 开放式学习:通过统一的方法和计量方法基准,加快LLM的学习 2506.12618v1
  • 459 06-14 Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition Konooz: Multi-Domain Multi-Dialekt Corpus für die benannte Entitätserkennung Konooz: 名称实体识别多域多对立公司 2506.12615v1
  • 460 06-14 ShED-HD: A Shannon Entropy Distribution Framework for Lightweight Hallucination Detection on Edge Devices ShED-HD: Ein Shannon Entropy Distribution Framework für leichte Halluzinationserkennung auf Edge-Geräten ShED-HD:关于边缘装置轻量级致幻剂探测的香农封状分发框架 2503.18242v2
  • 461 06-14 Is Smaller Always Faster? Tradeoffs in Compressing Self-Supervised Speech Transformers Ist Kleiner immer schneller? Tradeoffs bei selbstüberwachten Sprachtransformatoren komprimieren 更小的总是更快吗? 压缩自制语音变换器的权衡取舍 2211.09949v3
  • 462 06-14 Towards Building General Purpose Embedding Models for Industry 4.0 Agents Auf dem Weg zum Aufbau von Modellen für Industrie 4.0-Agenten 建立工业4.0剂通用嵌入模型模型 2506.12607v1
  • 463 06-14 An Exploration of Mamba for Speech Self-Supervised Models Eine Erkundung von Mamba für selbstüberwachte Sprachmodelle 探索Mamba演讲自我示范模式 2506.12606v1
  • 464 06-14 Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training Adapt-Pruner: Adaptives Structural Pruning für effizientes Small Language Model Training 适应者:适应性结构调节,促进高效的小型语言模式培训 2502.03460v2
  • 465 06-14 NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions NaturalReasoning: Vernunft in der Wildnis mit 2.8M anspruchsvollen Fragen 自然反应:以2.8M挑战性问题在野外的原因 2502.13124v3
  • 466 06-14 OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases OneEval: Benchmarking von LLM Wissensintensive Reasoning über unterschiedliche Wissensgrundlagen OneEval:确定LLM 知识密集型知识密集型比多样化知识库更引力的基准 2506.12577v1
  • 467 06-14 Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders Präzise Topic Alignment in großen Sprachmodellen über Sparse Autoencoder aktivieren 启用大语言模型中的精确主题对齐 2506.12576v1
  • 468 06-14 TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression TL;DR: Zu lange, re-Gewichtung für effiziente LLM-Reasoning-Kompression TL;DR:太长,为高效 LLM 合理压缩而重新加权 2506.02678v3
  • 469 06-14 Overview of the NLPCC 2025 Shared Task: Gender Bias Mitigation Challenge Überblick über die gemeinsame Aufgabe NLPCC 2025: Gender Bias Mitigation Challenge 2025年全国妇女、妇女和儿童委员会2025年共同任务概览:减少性别偏见的挑战 2506.12574v1
  • 470 06-14 DoTA-RAG: Dynamic of Thought Aggregation RAG DoTA-RAG: Dynamik der Gedankenaggregation RAG DoTA-RAG:思想聚合动态RAG 2506.12571v1
  • 471 06-14 StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling StreamMel: Echtzeit Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modellierung 流流:通过间断连续自动递减建模实现实时零光文本对语音 2506.12570v1
  • 472 06-14 SMILE: Speech Meta In-Context Learning for Low-Resource Language Automatic Speech Recognition SMILE: Sprachmeta In-Context-Lernen für die automatische Spracherkennung mit geringer Ressource SMILE: 用于低资源语言自动语音识别的 2409.10429v2
  • 473 06-14 Scholar Inbox: Personalized Paper Recommendations for Scientists Scholar Inbox: Personalisierte Papierempfehlungen für Wissenschaftler 学者箱:给科学家的个人化论文建议 2504.08385v2
  • 474 06-14 PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference PKU-SafeRLHF: Auf dem Weg zu mehrstufiger Sicherheitsausrichtung für LLMs mit menschlicher Vorliebe PKU-SafeRLLHF:为具有人类特爱的LLMs实现多级安全协调 2406.15513v3
  • 475 06-14 Profiling News Media for Factuality and Bias Using LLMs and the Fact-Checking Methodology of Human Experts Profiling News Medien für Factuality und Bias mit LLMs und der Fact-Checking-Methode menschlicher Experten 利用LLMMs和 “ 人权专家实况调查方法 “ 将新闻媒体描述为 “ 事实和偏见 “ 和 “ 人权专家实况调查方法 “ 2506.12552v1
  • 476 06-14 Activation-Informed Merging of Large Language Models Aktivierungs-informiertes Zusammenführen von großen Sprachmodellen 大语言模式的合并 2502.02421v2
  • 477 06-14 RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking RealFactBench: Ein Benchmark für die Bewertung großer Sprachmodelle in Real-World Fact-Checking RealFactFactBonch:在现实世界实况调查中评价大语言模式的基准 2506.12538v1
  • 478 06-14 Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction Sprachmodelle mit entkoppelten Tokenizern und Multi-Token-Vorhersage 配有拆分调制调制器和多功能预测的语音-语言语言模型 2506.12537v1
  • 479 06-14 Detection, Classification, and Mitigation of Gender Bias in Large Language Models Erkennung, Klassifizierung und Minderung von Gender-Bias in großen Sprachmodellen 大语言模式中性别偏见的探测、分类和减轻 2506.12527v1
  • 480 06-14 LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL LinkAlign: Skalierbare Schema-Verknüpfung für Real-World großformatige Multi-Datenbank Text-zu-SQL 链接对称: 真实世界大型多数据基文本到 SQL 的可缩放气相表链接 2503.18596v3
  • 481 06-14 How Does A Text Preprocessing Pipeline Affect Ontology Syntactic Matching? Wie wirkt sich eine Textvorverarbeitung auf die Ontologie aus? 文本预处理管道如何影响本体学同步匹配? 2411.03962v8
  • 482 06-14 Less is More: Improving LLM Alignment via Preference Data Selection Weniger ist mehr: Verbesserung der LLM-Ausrichtung über Präferenzdatenauswahl 较少是更多:通过优先数据选择改进LLM对齐 2502.14560v3
  • 483 06-14 Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation Überbrückungsrelevanz und Begründung: Rationale Destillation in retrieval-augmented Generation 架桥关联性和合理性:再回收-提款一代中的理由蒸馏 2412.08519v2
  • 484 06-14 Towards Fairness Assessment of Dutch Hate Speech Detection Zur Fairnessbewertung der niederländischen Hass-Spracherkennung 争取对荷兰仇恨言论检测进行公平评估 2506.12502v1
  • 485 06-14 Improving Factuality for Dialogue Response Generation via Graph-Based Knowledge Augmentation Verbesserung der Factuality für Dialog-Response-Generierung durch graphgestützte Wissenserweiterung 通过基于图表的知识增加改进对话回应生成的实况 2506.12496v1
  • 486 06-14 FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation FlexRAG: Ein flexibler und umfassender Rahmen für die Retrieval-Augmented Generation FlexRAG: 灵活和综合的回回回一代人框架 2506.12494v1
  • 487 06-14 Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization Robustes LLM-Unlearning mit MUDMAN: Meta-Unlearning mit Disruptionsmasken und Normalisierung 与 MUDMAN 一起重新学习: 以干扰蒙蔽和正常化的方式重新学习 2506.12484v1
  • 488 06-14 MALM: A Multi-Information Adapter for Large Language Models to Mitigate Hallucination MALM: Ein Multi-Informationsadapter für große Sprachmodelle zur Mititation von Halluzinationen MARM:一个用于模拟幻觉大语言模型的多信息适应器 2506.12483v1
  • 489 06-14 MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems MTLM: Aufnahme bidirektionaler Textinformationen zur Verbesserung der Sprachmodellausbildung in Spracherkennungssystemen MTLM:纳入双向文本信息,以加强语音识别系统中的语言示范培训 2502.10058v2
  • 490 06-14 AI Flow: Perspectives, Scenarios, and Approaches AI Flow: Perspektiven, Szenarien und Ansätze AI 流动:观点、设想和方法 2506.12479v1
  • 491 06-14 TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks TagRouter: Lernroute zu LLMs durch Tags für Open-Domain Text Generierung Aufgaben TagRouter: 通过用于 Open-Domain 文本生成任务的标记学习 LLM 的学习路径 2506.12473v1
  • 492 06-14 A Pluggable Multi-Task Learning Framework for Sentiment-Aware Financial Relation Extraction Ein steckbarer Multi-Task-Lernrahmen für sentiment-aware Finanzrelation Extraction 一个可插插多任务学习框架,用于情感-恶意金融关系采掘 2506.12452v1
  • 493 06-14 Language Surgery in Multilingual Large Language Models Sprachchirurgie in mehrsprachigen großen Sprachmodellen 多语言大语言模式中的语言外科手术 2506.12450v1
  • 494 06-14 ViQA-COVID: COVID-19 Machine Reading Comprehension Dataset for Vietnamese ViQA-COVID: COVID-19 Maschinenlesedatensatz für Vietnamesen ViQA-COVID:越南的COVID-19机器阅读综合数据集 2504.21017v2
  • 495 06-14 From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment Von Ergebnissen zu Prozessen: Leitende PRM-Lernen von ORM für die Schlussfolgerungs-Zeit-Ausrichtung 从结果到过程:指导程序程序管理从ORM学习,以推断-时间协调 2506.12446v1
  • 496 06-14 Nested Named-Entity Recognition on Vietnamese COVID-19: Dataset and Experiments Nested Named-Entity Recognition on Vietnamese COVID-19: Datensatz und Experimente 越南COVID-19(数据集和实验) 2504.21016v2
  • 497 06-14 Exploring Cultural Variations in Moral Judgments with Large Language Models Kulturelle Variationen in Moralurteilen mit großen Sprachmodellen erforschen 探索具有大语言模式的道德判决的文化差异 2506.12433v1
  • 498 06-14 Toward Reasonable Parrots: Why Large Language Models Should Argue with Us by Design Auf dem Weg zu vernünftigen Papageien: Warum große Sprachmodelle mit uns argumentieren sollten 通向合理的鹦鹉:为什么大语言模型应该设计来与我们争论? 2505.05298v2
  • 499 06-14 CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis CoT-basierter Synthesizer: Verbesserung der LLM-Performance durch Antwortsynthese 以Cot为基础的合成器:通过答复合成提高LLM绩效 2501.01668v2
  • 500 06-14 Plan Your Travel and Travel with Your Plan: Wide-Horizon Planning and Evaluation via LLM Planen Sie Ihre Reise und Reise mit Ihrem Plan: Wide-Horizon Planung und Bewertung über LLM 与你的计划一起规划你的旅行和旅行计划:通过LLM进行广泛的毛利人规划和评估 2506.12421v1
  • 501 06-14 Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters Unüberwachte Klassifikation von englischen Wörtern anhand phonologischer Informationen: Entdeckung von germanischen und lateinischen Clustern 基于声频信息:发现日耳曼语和拉丁语群集 2504.11770v2
  • 502 06-14 Transformers without Normalization Transformatoren ohne Normalisierung 无正常化的变换器 2503.10622v2
  • 503 06-14 Group then Scale: Dynamic Mixture-of-Experts Multilingual Language Model Gruppe dann Skala: Dynamische Mischung-von-Experten Mehrsprachiges Sprachmodell 群组然后缩放: 动态混合专家多语种语言模型 2506.12388v1
  • 504 06-14 Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision Ranking-Kette-of-Thought-Lernen: Ein energiebasierter Ansatz mit Outcome-Supervision 学习 “ 研究链链 “ :以能源为基础的方法与成果监督 2505.14999v2
  • 505 06-14 Recent Advances and Future Directions in Literature-Based Discovery Jüngste Fortschritte und zukünftige Wege in der literaturbasierten Entdeckung 最近在基于文学的发现中的进展和未来方向 2506.12385v1
  • 506 06-14 Model Merging for Knowledge Editing Modellzusammenführung für die Wissensbearbeitung 知识编辑合并模型 2506.12384v1
  • 507 06-14 Training-free LLM Merging for Multi-task Learning Schulungsfreie LLM-Zusammenführung für Multi-Task-Lernen 多任务学习合并不培训的LLMLM 2506.12379v1
  • 508 06-14 A Hybrid Architecture with Efficient Fine Tuning for Abstractive Patent Document Summarization Hybride Architektur mit effizienter Feinabstimmung für abstrakte Patentdokumentzusammenfassung 简易专利文件摘要的高效精度计价混合结构 2503.10354v4
  • 509 06-14 Understanding the Effect of Knowledge Graph Extraction Error on Downstream Graph Analyses: A Case Study on Affiliation Graphs Verständnis des Einflusses von Wissensgraphenauszugsfehlern auf Downstream Graph Analyses: Eine Fallstudie zu Verknüpfungsgraphen 了解知识图解错误对下游图分析的影响:关于亲子关系图的个案研究 2506.12367v1
  • 510 06-14 Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics Fortschritte in LLMs mit Fokus auf Vernunft, Anpassungsfähigkeit, Effizienz und Ethik 注重理由、适应性、效率和道德操守的LLMs项目的进展 2506.12365v1
  • 511 06-14 MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval MM-R5: MultiModal reasoning-enhanced ReRanker über Verstärkungs-Lernen für Dokument-Retrieval MM-R5:通过文件检索强化学习加强文件检索,多模式合理改进Reanker 2506.12364v1
  • 512 06-14 QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm QiMeng-Achtung: SOTA Attention Operator wird von SOTA Attention Algorithm erzeugt QiMeng- 注意: SOTA 注意操作员由 SOTA 注意算法生成 2506.12355v1
  • 513 06-14 Watch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language Models Watch Out Your Album! Über die unbeabsichtigte Datenschutz-Erinnerung in Multi-Modal Large Language Models 注意您的专辑! 在多模式大语言模型中的意外隐私记忆中 2503.01208v2
  • 514 06-14 Efficient Reasoning Through Suppression of Self-Affirmation Reflections in Large Reasoning Models Effiziente Vernunft durch Unterdrückung von Selbstbestätigungsreflexionen in großen Vernunftmodellen 通过制止大理由模型中的自我确认反思提高合理性 2506.12353v1
  • 515 06-14 Information Suppression in Large Language Models: Auditing, Quantifying, and Characterizing Censorship in DeepSeek Informationsunterdrückung in großen Sprachmodellen: Auditierung, Quantifizierung und Charakterisierung von Zensur in DeepSeek 在大语言模式中禁止信息:审计、量化和深海搜索检查 2506.12349v1
  • 516 06-14 Refract ICL: Rethinking Example Selection in the Era of Million-Token Models Refrakt ICL: Beispielauswahl im Zeitalter der Millionen-Token-Modelle neu denken Refract ICL: 重新思考百万吨模型时代的示例选择 2506.12346v1
  • 517 06-14 RATIONALYST: Mining Implicit Rationales for Process Supervision of Reasoning RATIONALYST: Bergbau implizite Rationale für die Prozessüberwachung von Vernunft RICTIYST: 程序监督理据的采矿隐含理由 2410.01044v2
  • 518 06-14 Investigating the Effects of Cognitive Biases in Prompts on Large Language Model Outputs Untersuchung der Auswirkungen von Kognitiv-Biasen in Prompts auf große Sprachmodell-Ausgaben 调查认知分裂对大语言示范产出的影响 2506.12338v1
  • 519 06-14 Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective Intersektionale Bias in japanischen großen Sprachmodellen aus einer kontextualisierten Perspektive 日本大语言模型中从背景角度分析的交叉比阿语 2506.12327v1
  • 520 06-14 GSDNet: Revisiting Incomplete Multimodal-Diffusion from Graph Spectrum Perspective for Conversation Emotion Recognition GSDNet: Unvollständige Multimodal-Diffusion aus Graph Spectrum Perspektive für die Erkennung von Gesprächsgefühlen GSDNet:从图表光谱视角重新审视不完全的多式联运传播,以认识情感 2506.12325v1
  • 521 06-14 Fino1: On the Transferability of Reasoning-Enhanced LLMs and Reinforcement Learning to Finance Fino1: Über die Übertragbarkeit von mit Gründen versehenen LLMs und die Stärkung des Lernens zur Finanzierung Fino1:关于有合理理由的信贷额度的可转让性和加强向融资学习 2502.08127v3
  • 522 06-14 Perspective on Utilizing Foundation Models for Laboratory Automation in Materials Research Perspektiven für die Nutzung von Basismodellen für die Laborautomation in der Materialforschung 利用材料研究实验室自动化模型的基础模型的视角 2506.12312v1
  • 523 06-14 Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech Phonikud: Hebräische Grapheme-to-Phone-Umwandlung für Echtzeit-Text-to-Speech Phonikud: 用于实时文字语音转换的希伯来石墨到phoneme转换 2506.12311v1
  • 524 06-14 Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning Med-U1: Förderung der einheitlichen medizinischen Vernunft in LLMs durch großangelegtes Verstärkungslernen Med-U1:通过大规模加强学习在LLMs中鼓励统一医疗理由 2506.12307v1
  • 525 06-14 Smurfs: Multi-Agent System using Context-Efficient DFSDT for Tool Planning Schlümpfe: Multi-Agent System mit Kontext-Effizient DFSDT für Werkzeugplanung 蓝精精:多机构系统,在工具规划中使用内地高效的DFDDT 2405.05955v4
  • 526 06-14 Disclosure Audits for LLM Agents Offenlegungsprüfungen für LLM-Agenten 对LLLM代理的披露审计 2506.10171v2
  • 527 06-13 (5) Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure Können LLMs hochwertige Testfälle für Algorithmenprobleme generieren? TestCase-Eval: Eine systematische Bewertung von Fehlerbedeckung und Exposition LLLM女士能否生成高质量的鉴定问题测试案例? 2506.12278v1
  • 528 06-13 Investigating the Potential of Large Language Model-Based Router Multi-Agent Architectures for Foundation Design Automation: A Task Classification and Expert Selection Study Untersuchung des Potenzials von Multi-Agent-Architekturen für die Grundlagen-Design-Automatisierung von Großsprachenmodellen: Eine Aufgabenklassifikation und Expertenauswahlstudie 调查基于大语言示范示范路由器多机构结构对基础设计自动化的潜力:任务分类和专家甄选研究 2506.13811v1
  • 529 06-13 Personalized Wireless Federated Learning for Large Language Models Personalisiertes Wireless-Federated-Lernen für große Sprachmodelle 大语言模式个人无线个人无线联邦学习 2404.13238v2
  • 530 06-13 WorldAPIs: The World Is Worth How Many APIs? A Thought Experiment WorldAPIs: Die Welt ist Wert Wie viele APIs? Ein Gedankenexperiment WorldAPIs:世界值多少个API? 2407.07778v2
  • 531 06-13 InfoFlood: Jailbreaking Large Language Models with Information Overload InfoFlood: Jailbreaking Große Sprachmodelle mit Informationsüberlastung InfoFlood: 带有信息超载的破狱大语言模型 2506.12274v1
  • 532 06-13 The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs The Behavior Gap: Bewertung von Null-Shot-LLM-Agenten in komplexen Task-Orientierten Dialogen 行为差距:评价复杂任务导向对话中的零射LLM代理 2506.12266v1
  • 533 06-13 ProVox: Personalization and Proactive Planning for Situated Human-Robot Collaboration ProVox: Personalisierung und proaktive Planung für die angesiedelte Mensch-Roboter-Kollaboration ProVox:人类机器人合机的个性化和前瞻性规划 2506.12248v1
  • 534 06-13 Large Language Models for History, Philosophy, and Sociology of Science: Interpretive Uses, Methodological Challenges, and Critical Perspectives Große Sprachmodelle für Geschichte, Philosophie und Wissenschaftssoziologie: Interpretische Nutzungen, methodische Herausforderungen und kritische Perspektiven 历史、哲学和社会科学社会学大语言模式:解释用途、方法挑战和关键视角 2506.12242v1
  • 535 06-13 Compute Optimal Scaling of Skills: Knowledge vs Reasoning Optimale Skalierung von Fähigkeiten berechnen: Wissen vs. Vernunft 计算技能的优化规模:知识与理由 2503.10061v3
  • 536 06-13 Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index Infini-gram mini: Genaue n-gram Suche auf der Internetskala mit FM-Index Infini-gram 微型: 使用 FM- Index 的 Internet 比例尺精确的 n 克搜索 2506.12229v1
  • 537 06-13 R-KV: Redundancy-aware KV Cache Compression for Reasoning Models R-KV: Redundancy-aware KV Cache-Kompression für sinnvolle Modelle R-KV: 解释模型的冗余感知 KV 缓存压缩 2505.24133v3
  • 538 06-13 A Survey of Generative Categories and Techniques in Multimodal Large Language Models Eine Übersicht über generative Kategorien und Techniken in multimodalen großen Sprachmodellen 多式联运大语言模型的创用类别和技术调查 2506.10016v2
  • 539 06-13 From Emergence to Control: Probing and Modulating Self-Reflection in Language Models Von der Emergence zur Kontrolle: Probieren und Modulieren von Selbstreflexion in Sprachmodellen 从新兴到控制:语文模式的自我反省和调整 2506.12217v1
  • 540 06-13 MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP MELABenchv1: Benchmarking von großen Sprachmodellen gegen kleinere, feinere Modelle für Low-Resource Maltesische NLP MELABenchv1:对照低资源马耳他低排放马耳他低排放马耳他低排放语言方案较微小的微量设计模型确定大语言模型基准 2506.04385v2
  • 541 06-13 Supernova Event Dataset: Interpreting Large Language Model’s Personality through Critical Event Analysis Supernova-Ereignisdatensatz: Verdolmetschen der Persönlichkeit des Large Language Model durch kritische Ereignisanalyse 超新星事件数据集:通过重大事件分析解释大语言模型的个性 2506.12189v1
  • 542 06-13 Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse Achten Sie auf Ihren Schritt (durch Schritt): Chain-of-Thought kann die Leistung bei Aufgaben reduzieren, bei denen Denken Menschen schlimmer macht ” 一步一步小心 “ (一步一步): “ 努力链 “ 能够降低思考使人类更加恶化的任务的绩效 “ 。 2410.21333v4
  • 543 06-13 BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation BOUQuET: Datensatz, Benchmark und Open Initiative für Universal Quality Evaluation in Translation BOUQuET:翻译普遍质量评价的数据集、基准和开放倡议 2502.04314v2
  • 544 06-13 Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs Instruktion Tuning und CoT Prompting für kontextuelle medizinische QA mit LLMs 与LLMM公司一起进行背景医疗质量评估的教学说明和COT提示 2506.12182v1
  • 545 06-13 Generative or Discriminative? Revisiting Text Classification in the Era of Transformers Generativ oder diskriminativ? Textklassifizierung im Zeitalter der Transformer 产生还是歧视? 重新研究变异器时代的文本分类 2506.12181v1
  • 546 06-13 A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages Eine rigorose Bewertung von LLM-Datenerstellungsstrategien für ressourcenarme Sprachen 对LLLM低资源语言数据生成战略的严格评价 2506.12158v1
  • 547 06-13 Maximally-Informative Retrieval for State Space Model Generation Maximal-informatives Retrieval für die Generierung von State Space Models 用于国家空间模型生成的最大进步检索 2506.12149v1
  • 548 06-13 Hatevolution: What Static Benchmarks Don’t Tell Us Hatevolution: Was Statische Benchmarks uns nicht sagen 仇恨革命:什么静态基准不告诉我们 2506.12148v1
  • 549 06-13 Resa: Transparent Reasoning Models via SAEs Resa: Transparente Begründungsmodelle über SAE Resa:通过SAEs建立透明说明理由模型 2506.09967v2
  • 550 06-13 code_transformed: The Influence of Large Language Models on Code code_transformed: Der Einfluss großer Sprachmodelle auf Code 代码转换:大语言模型对代码的影响 2506.12014v1
  • 551 06-13 Can Mixture-of-Experts Surpass Dense LLMs Under Strictly Equal Resources? Können Mixture-of-Experts LLMs unter streng gleichen Ressourcen übertreffen? 在资源严格平等的情况下,能否在资源严格平等的情况下进行专家混合生产? 2506.12119v1
  • 552 06-13 Cartridges: Lightweight and general-purpose long context representations via self-study Patronen: Leichte und universelle lange Kontextdarstellungen durch Selbststudium Cartridges:轻量和一般用途长背景介绍,通过自学 2506.06266v3
  • 553 06-13 Schema-R1: A reasoning training approach for schema linking in Text-to-SQL Task Schema-R1: Ein argumentierender Schulungsansatz für die Schemaverknüpfung in Text-zu-SQL-Aufgabe Schema-R1:在文本到SQL任务中将系统图案联系起来的推理培训方法 2506.11986v1
  • 554 06-13 e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs e3: Erforschen lernen ermöglicht Extrapolation von Test-Time Compute für LLMs e3: 学习探索以利对LLMM的试验时间计算进行外推计算 2506.09026v2
  • 555 06-13 Improving Large Language Models with Concept-Aware Fine-Tuning Große Sprachmodelle mit konzeptorientiertem Feintuning verbessern 改进概念软件微调大语言模式 2506.07833v2
  • 556 06-13 Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English Auswirkungen der Rahmensätze auf Sprachtokenizer: Eine Fallstudie zu Mandarin und Englisch 《框架率对语言控制器的影响:普通话和英语案例研究》 2505.17076v3
  • 557 06-13 Factual Knowledge in Language Models: Robustness and Anomalies under Simple Temporal Context Variations Factual Knowledge in Language Models: Robustheit und Anomalien unter einfachen zeitlichen Kontextvariationen 语言模型中的事实知识:简单时间环境变化下的强力和异常现象 2502.01220v5
  • 558 06-13 Enhancing multimodal analogical reasoning with Logic Augmented Generation Verbesserung multimodaler analoger Argumentation mit Logic Augmented Generation 增强与逻辑增强型一代的多式联运模拟推理 2504.11190v2
  • 559 06-13 Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations Erklärbarkeit großer Sprachmodelle mit SMILE: Statistische Modell-agnostische Interpretierbarkeit mit lokalen Erklärungen 使用SMILE解释大语言模型的可解释性:统计模型 – – 与当地解释的可解释性 2505.21657v2
  • 560 06-13 Improving Large Language Model Safety with Contrastive Representation Learning Verbesserung der Sicherheit von großen Sprachmodellen mit kontrasem Repräsentationslernen 改进大语文示范语文安全,同时进行差异代表制学习 2506.11938v1
  • 561 06-13 Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback Feedback Friction: LLMs kämpfen, um externes Feedback vollständig zu integrieren 反响:LLMs 争取充分吸收外部反馈 2506.11930v1
  • 562 06-13 LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? LiveCodeBench Pro: Wie beurteilen Olympiad-Medaillengewinner LLMs im Wettbewerbsprogramm? LifoCodeBench Pro:奥林匹亚奖章获得者如何在竞争性方案规划中评判LMs? 2506.11928v1
  • 563 06-13 T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling T1: Sprachmodell weiter voranbringen durch Stärkung des Lernens und Ableiten von Skalen T1:通过强化学习和推论扩大规模,推进语文模式 2501.11651v2
  • 564 06-13 Effectiveness of Counter-Speech against Abusive Content: A Multidimensional Annotation and Classification Study Wirksamkeit der Gegenrede gegen missbräuchliche Inhalte: Eine mehrdimensionale Annotation und Klassifikationsstudie 反言论对滥用内容的效力:多层面说明和分类研究 2506.11919v1
  • 565 06-13 GeistBERT: Breathing Life into German NLP GeistBERT: Das Leben in die deutsche NLP einatmen 呼吸生命化为德国NLP 2506.11903v1
  • 566 06-13 TreeRL: LLM Reinforcement Learning with On-Policy Tree Search TreeRL: LLM-Verstärktes Lernen mit On-Policy-Baumsuche TreeRL: LLM 与政策树搜索的LLM 强化学习 2506.11902v1
  • 567 06-13 Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation Graph of Attacks with Pruning: Optimierung der Stealthy Jailbreak Prompt Generation für verbesserte LLM Content Moderation 使用普林宁攻击图:优化用于强化 LLM 内容调控的隐形监狱破获快速生成 2501.18638v2
  • 568 06-13 Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache Jenseits der homogenen Aufmerksamkeit: Speichereffiziente LLMs über Fourier-Approximated KV Cache 超越同异族注意:通过Fourier-Apbeard KV Cache 的记忆-节能LMLM 2506.11886v1
  • 569 06-13 Addressing Bias in LLMs: Strategies and Application to Fair AI-based Recruitment Bias in LLMs ansprechen: Strategien und Anwendung für eine faire KI-basierte Rekrutierung 解决LLMM中的偏见:公平基于大赦国际的招聘战略和应用 2506.11880v1
  • 570 06-13 SAP-Bench: Benchmarking Multimodal Large Language Models in Surgical Action Planning SAP-Bench: Benchmarking multimodaler Großsprachenmodelle in der operativen Aktionsplanung SAP-Bench:在外科行动规划中确定多式大语言模式基准 2506.07196v2
  • 571 06-13 Long-context Non-factoid Question Answering in Indic Languages Lang-Kontext Non-factoide Frage-Antworten in indischen Sprachen 长长长 长 长 长 长 长 长 长 非 事实 问 问 问 语 语 语 2504.13615v2
  • 572 06-13 Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts Sicherer oder luckier? LLMs als Sicherheitsevaluatoren sind für Artefakte nicht robust 安全性更安全还是更幸运?作为安全评估员的LLMs没有强力进行人工操作。 2503.09347v2
  • 573 06-13 Post Persona Alignment for Multi-Session Dialogue Generation Post Persona Alignment für Multi-Session Dialog Generation 开展多会议对话的人后协调 2506.11857v1
  • 574 06-13 The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets Das automatisierte, aber riskante Spiel: Modellierung von Agent-zu-Agent-Verhandlungen und Transaktionen in Verbrauchermärkten 自动但有风险游戏:消费者市场代理对代理谈判和交易的模拟 2506.00073v3
  • 575 06-13 Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages Große Sprachmodelle für toxische Spracherkennung in ressourcenarmen Balkansprachen 低资源巴尔干语言中有毒语言探测大语言模式 2506.09992v2
  • 576 06-13 Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation Mehrsprachige Vision-Sprachenübersetzung neu denken: Datensatz, Evaluation und Anpassung 重新思考多语种愿景语言翻译:数据集、评估和适应 2506.11820v1
  • 577 06-13 On the Performance of LLMs for Real Estate Appraisal Über die Leistung von LLMs für die Bewertung von Immobilien 房地产评估LLM女士的绩效 2506.11812v1
  • 578 06-13 Word Sense Detection Leveraging Maximum Mean Discrepancy Word Sense Detection Leveraging Maximale mittlere Diskrepanz Word Sensense 检测 利用最大平均值差异 2506.01602v2
  • 579 06-13 Are Multimodal Large Language Models Pragmatically Competent Listeners in Simple Reference Resolution Tasks? Sind multimodale große Sprachmodelle Pragmatisch kompetente Hörer in einfachen Referenzauflösungsaufgaben? 在简单参考解析任务中,多式大语言模型是否具有实用能力的听众能力? 2506.11807v1
  • 580 06-13 Unsupervised Document and Template Clustering using Multimodal Embeddings Unüberwachte Dokumenten- und Vorlagen-Clustering mit multimodalen Einbettungen 使用多式嵌入式将无人监督的文档和模板分组 2506.12116v1
  • 581 06-13 Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models Persona-getriebene Simulation des Abstimmungsverhaltens im Europäischen Parlament mit großen Sprachmodellen 欧洲议会以大语言模式模拟投票行为 2506.11798v1
  • 582 06-13 Eliciting Reasoning in Language Models with Cognitive Tools Mit kognitiven Tools die Vernunft in Sprachmodellen elizitieren 具有认知工具的语言模型中的 埃利推理 2506.12115v1
  • 583 06-13 MEDDxAgent: A Unified Modular Agent Framework for Explainable Automatic Differential Diagnosis MEDDxAgent: Ein einheitliches Modular-Agenten-Framework für erklärbare automatische Differentialdiagnose MDDAAGent: 可解释自动差异分析统一模块剂框架 2502.19175v2
  • 584 06-13 Women, Infamous, and Exotic Beings: What Honorific Usages in Wikipedia Reflect on the Cross-Cultural Sociolinguistic Norms? Frauen, berüchtigte und exotische Wesen: Welche ehrwürdigen Nutzungen in Wikipedia reflektieren die kulturübergreifenden Soziolinguistischen Normen? 妇女、臭名昭著的人和外来人:维基百科对跨文化社会语言规范的何种荣誉使用? 2501.03479v3
  • 585 06-13 Long-Short Alignment for Effective Long-Context Modeling in LLMs Lang-Short Alignment für effektive Lang-Kontext-Modellierung in LLMs 为在LLMM中建立有效的长文建模而实现长短期一致 2506.11769v1
  • 586 06-13 DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents DeepResearch Bench: Ein umfassender Benchmark für Deep Research Agents 深层研究组:深层研究剂综合基准 2506.11763v1
  • 587 06-13 DART: Distilling Autoregressive Reasoning to Silent Thought DART: Destillieren von autoregressiver Begründung zu stillem Denken DART: 提炼沉默思考的自动递减理由 2506.11752v1
  • 588 06-13 Table-R1: Region-based Reinforcement Learning for Table Understanding Tabelle-R1: Regionsbasiertes Verstärkungslernen für Tabellenverständigung 表-R1:以区域为基础的强化学习,以了解表格 2505.12415v2
  • 589 06-13 Quizzard@INOVA Challenge 2025 – Track A: Plug-and-Play Technique in Interleaved Multi-Image Model Quizzard@INOVA Challenge 2025 – Spur A: Plug-and-Play-Technik im Multi-Image-Modell Quizzad@INOVA 2025年挑战 – – A轨:跨离多图像模型中的插图和布图技术 2506.11737v1
  • 590 06-13 Entropy Controllable Direct Preference Optimization Entropie kontrollierbare Direktpräferenzoptimierung 直接首选优化 2411.07595v2
  • 591 06-13 VM14K: First Vietnamese Medical Benchmark VM14K: Erster vietnamesischer medizinischer Benchmark VM14K:第一个越南医疗基准 2506.01305v2
  • 592 06-13 The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference Die Cambrian Explosion von Mixed-Precision Matrix Multiplikation für Quantized Deep Learning Inferenz Cambrian 混合精密矩阵乘数爆炸,用于量测深学习推断 2506.11728v1
  • 593 06-13 Persistent Topological Features in Large Language Models Persistente Topologische Features in großen Sprachmodellen 大语言模式中的持久性有机污染物特征 2410.11042v3
  • 594 06-13 Vision-Language Models for Edge Networks: A Comprehensive Survey Vision-Language-Modelle für Edge Networks: Eine umfassende Umfrage 边缘网络远景-语言模型:全面调查 2502.07855v2
  • 595 06-13 Configurable Preference Tuning with Rubric-Guided Synthetic Data Konfigurierbare Präferenz-Tuning mit Rubric-Guided Synthetic Data 使用 Rubric 辅助合成数据进行可配置的优惠税 2506.11702v1
  • 596 06-13 Improving Causal Interventions in Amnesic Probing with Mean Projection or LEACE Verbesserung der Kausalinterventionen bei der amnesischen Probierung mit mittlerer Projektion oder LEACE 改善在用平均投射或LEACE进行非正常试验时的因果干预 2506.11673v1
  • 597 06-13 LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models LLaVA-CMoE: Auf dem Weg zu einer kontinuierlichen Mischung von Experten für große Vision-Sprachenmodelle LLavaVA-CMoE:建立大型视觉语言模型专家的连续混合体 2503.21227v2
  • 598 06-13 Quantum-Inspired Differentiable Integral Neural Networks (QIDINNs): A Feynman-Based Architecture for Continuous Learning Over Streaming Data Quantum-inspirierte differentiable Integral Neural Networks (QIDINNs): Eine Feynman-basierte Architektur für kontinuierliches Lernen über Streaming-Daten 量材激发的有差异的综合神经网络:一个基于费曼的建筑结构,用于对流数据进行持续学习 2506.12111v1
  • 599 06-13 Can reasoning models comprehend mathematical problems in Chinese ancient texts? An empirical study based on data from Suanjing Shishu Können Argumentationsmodelle mathematische Probleme in chinesischen alten Texten verstehen? Eine empirische Studie basierend auf Daten von Suanjing Shishu 推理模型能理解中国古经中的数学问题吗? 2505.16660v3
  • 600 06-13 Converting Annotated Clinical Cases into Structured Case Report Forms Umwandlung von annotierten klinischen Fällen in strukturierte Fallberichtsformulare 将附加说明的临床病例转换成结构化个案报告表格 2506.11666v1
  • 601 06-13 LoRA-Gen: Specializing Large Language Model via Online LoRA Generation LoRA-Gen: Großes Sprachmodell über Online spezialisieren LoRA Generation LoRA-Gen:通过在线LORA生成专门化大语言模式 2506.11638v1
  • 602 06-13 Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model Schritt-Audio-AQAA: ein vollständig von Ende zu Ende ausdrucksstarkes großes Audio-Sprachenmodell 渐进-AQAAA:全端到端全端表达式大音频语言模型 2506.08967v2
  • 603 06-13 SceneGram: Conceptualizing and Describing Tangrams in Scene Context SceneGram: Konzeptualisieren und Beschreiben von Tangrammen im Szenekontext CceneGram: 在景象背景下对Tangrams进行概念化和描述 2506.11631v1
  • 604 06-13 JBBQ: Japanese Bias Benchmark for Analyzing Social Biases in Large Language Models JBBQ: Japanischer Bias-Benchmark für die Analyse sozialer Bias in großen Sprachmodellen JBBQ:日本用于分析大语言模式中社会两边情况的基准 2406.02050v4
  • 605 06-13 (SimPhon Speech Test): A Data-Driven Method for In Silico Design and Validation of a Phonetically Balanced Speech Test (SimPhon Speech Test): Eine datengetriebene Methode für das Silico-Design und die Validierung eines phonetisch ausgeglichenen Sprachtests (西蒙语音测试):音响平衡语音测试的硅设计和校验数据驱动方法 2506.11620v1
  • 606 06-13 Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis Auf dem Weg zum Verständnis von Feintuning-Mechanismen von LLMs durch Schaltungsanalyse 通过电路分析了解LLM LMs的微调调整机制 2502.11812v2
  • 607 06-13 VLM@school – Evaluation of AI image understanding on German middle school knowledge VLM@school – Auswertung des KI-Bildverständnisses über deutsche Mittelschulkenntnisse VLM@school – – 评价AI关于德国中学知识的图像理解 2506.11604v1
  • 608 06-13 Are LLMs Good Text Diacritizers? An Arabic and Yorùbá Case Study Sind LLMs gute Textdiakritisierer? Eine arabische und Yorùbá Fallstudie LLM女士是好文本诊断器吗? 阿拉伯语和YorOuba的案例研究。 2506.11602v1
  • 609 06-13 Personalized LLM Decoding via Contrasting Personal Preference Personalisiertes LLM-Dekodieren über kontrastierende persönliche Präferenz 通过与个人偏好相违背而解密的个人个人化LLM 2506.12109v1
  • 610 06-13 Automatic Construction of Multiple Classification Dimensions for Managing Approaches in Scientific Papers Automatische Konstruktion mehrerer Klassifizierungsdimensionen für die Verwaltung von Ansätzen in wissenschaftlichen Papieren 科学文件中管理方法的多重分类方面自动构建 2505.23252v2
  • 611 06-13 Understanding the Repeat Curse in Large Language Models from a Feature Perspective Den Wiederholungskurs in großen Sprachmodellen aus einer Feature-Perspektive verstehen 从特写角度理解大语言模式中的重复诅咒 2504.14218v3
  • 612 06-13 FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference FlashBack:Effiziente Retrieval-Augmentierte Sprachmodellierung für lange Kontext-Inferenz FlashBack: 有效检索增强长处推断语言建模 2405.04065v4
  • 613 06-13 DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs DaMO: Ein dateneffizienter Multimodal-Orchester für zeitliche Vernunft mit Video-LLMs DaMO: 带有视频LMS的时空理由数据高效多式多式圆板 2506.11558v1
  • 614 06-13 From Persona to Person: Enhancing the Naturalness with Multiple Discourse Relations Graph Learning in Personalized Dialogue Generation Von Persona zu Person: Die Natürlichkeit durch multiple Diskursbeziehungen verbessern Graph Learning in Personalized Dialogue Generation 从人到人:加强人与人之间的自然特性,在个性化对话生成过程中采用多种不同问题关系图学习 2506.11557v1
  • 615 06-13 RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning RAG+: Verbesserung der Retrieval-Augmented Generation mit anwendungsrelevanter Begründung RAG+:加强利用应用程序软件软件软件软件支持的检索-启动生成 2506.11555v1
  • 616 06-13 Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective Bewertung von Impliziten Bias in großen Sprachmodellen durch Angriff aus einer psychometrischen Perspektive 通过从心理角度进行攻击,评价大语言模型中隐含的偏见 2406.14023v4
  • 617 06-13 TrajAgent: An LLM-based Agent Framework for Automated Trajectory Modeling via Collaboration of Large and Small Models TrajAgent: Ein LLM-basiertes Agent-Framework für automatisierte Trajektorienmodellierung über die Zusammenarbeit von großen und kleinen Modellen TrajAgendy:一个基于LLM的通过大型和小型模型合作进行自动轨迹建模的LLM代理框架 2410.20445v3
  • 618 06-13 LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation LLMEval-Med: Ein echter klinischer Benchmark für medizinische LLMs mit Physician Validation LLMEval-Med:具有物理校验功能的医疗长效LML 医疗长效LMS的现实世界临床基准 2506.04078v2
  • 619 06-13 PFDial: A Structured Dialogue Instruction Fine-tuning Method Based on UML Flowcharts PFDial: Eine strukturierte Dialog-Instruktion Feinabstimmungsmethode basierend auf UML-Flowcharts PFDial:基于UMML流程图的结构性对话指示调整方法 2503.06706v3
  • 620 06-13 Brewing Knowledge in Context: Distillation Perspectives on In-Context Learning Brewing Knowledge in Context: Destillationsperspektiven zum In-Context Learning 内在知识的积累:对内文学习的提炼观点 2506.11516v1
  • 621 06-13 Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs Manager: Aggregation von Erkenntnissen von Unimodal-Experten in Zwei-Tower-VLMs und MLLMs 管理者:从双托式VLM和MLLMS的独式专家中收集透视 2506.11515v1
  • 622 06-13 TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages TUMLU: Ein einheitliches Sprachverständnis für türkische Sprachen TUMLU:突厥语统一土著语言理解基准 2502.11020v2
  • 623 06-13 MapQaTor: An Extensible Framework for Efficient Annotation of Map-Based QA Datasets MapQaTor: Ein umfangreiches Framework für eine effiziente Annotation von kartenbasierten QA-Datensätzen 地图QaTor:以地图为基础的质量评估数据集有效注释的扩展框架 2412.21015v2
  • 624 06-13 On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval Über die Wirksamkeit von Integrationsmethoden für die multimodale Reaktion auf den Dialog 综合方法促进多模式对话应对回溯性融合方法的有效性 2506.11499v1
  • 625 06-13 Lag-Relative Sparse Attention In Long Context Training Lag-Relative Sparse Aufmerksamkeit im langen Kontext Training 长期培训中的拉格-相对偏差关注 2506.11498v1
  • 626 06-13 Relational Schemata in BERT Are Inducible, Not Emergent: A Study of Performance vs. Competence in Language Models Relationale Schemata in BERT sind induzierbar, nicht emergent: Eine Leistungsstudie vs. Kompetenz in Sprachmodellen BERT中的关系Schemata是鼓励性的,不是新兴的:对表现与语言模型能力的研究 2506.11485v1
  • 627 06-13 ImmunoFOMO: Are Language Models missing what oncologists see? ImmunoFOMO: Fehlt den Sprachmodellen, was Onkologen sehen? ImmunoFOMO:语言模型是否忽略了肿瘤学家所看到的? 2506.11478v1
  • 628 06-13 BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs BitNet v2: Native 4-Bit-Aktivierungen mit Hadamard-Transformation für 1-Bit-LLMs BitNet v 2: 以 Hadamard 变形为1 位LMs 的本地四位驱动器 2504.18415v2
  • 629 06-13 AutoGen Driven Multi Agent Framework for Iterative Crime Data Analysis and Prediction AutoGen Driven Multi Agent Framework für iterative Kriminalität Datenanalyse und Vorhersage 循环犯罪数据分析和预测自动驱动器多剂框架 2506.11475v1
  • 630 06-13 Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards Med-PRM: Medizinisches Reasoning-Modell mit schrittweisen, leitfadenverifizierten Prozessbelohnungen Med-PRM:医疗理由说明模型,具有逐步、准则核查的流程奖励 2506.11474v1
  • 631 06-13 Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models Hilft immer mehr zu denken? Test-Time Scaling in vernünftigen Modellen verstehen 理解理性模型中的测试时间缩放 2506.04210v2
  • 632 06-13 A Gamified Evaluation and Recruitment Platform for Low Resource Language Machine Translation Systems Gamified Evaluation and Recruitment Platform for Low Resource Language Machine Translation Systems 低资源语言机用翻译系统有色评价和征聘平台 2506.11467v1
  • 633 06-13 MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning MMMG: Ein massiver, multidisziplinärer, multi-Tier-Erzeugungs-Benchmark für Bild-zu-Bild-Reasoning MMMMM: 大量、多学科、多代、多语言的文字到图像推理基准 2506.10963v2
  • 634 06-13 Jointly modelling the evolution of social structure and language in online communities Gemeinsame Modellierung der Entwicklung von sozialer Struktur und Sprache in Online-Communities 联合模拟在线社区社会结构和语言演变 2409.19243v2
  • 635 06-13 Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning Lingshu: Ein generalistisches Stiftungsmodell für ein einheitliches multimodales medizinisches Verständnis und Vernunft Lingshu:通用主义基金会统一多式联运医疗理解和理性模式模式 2506.07044v4
  • 636 06-13 Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model Auf dem Weg zu einem effizienten Sprach-Text gemeinsam innerhalb eines Sprachmodells dekodieren 争取实现在一种语音语言模式内实现高效率的语音-文本联合解码 2506.04518v2
  • 637 06-13 Transferable Post-training via Inverse Value Learning Übertragbare Nachschulung über Inverse Value Learning 通过反向价值学习进行可转让的后培训 2410.21027v2
  • 638 06-13 AbsenceBench: Language Models Can’t Tell What’s Missing AbsenceBench: Sprachmodelle können nicht sagen, was fehlt 缺席时间: 语言模型无法说明缺少什么 2506.11440v1
  • 639 06-13 Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution’s Characteristics Verbesserung der Kalibrierung von Vertrauens-Scores bei der Textgenerierung anhand der Eigenschaften der Output-Distribution 利用产出分配特点改进对文本制作中信任分数的校准 2506.00637v2
  • 640 06-13 KoGEC : Korean Grammatical Error Correction with Pre-trained Translation Models KoGEC : Koreanische Korrektur von Grammatikfehlern mit vortrainierten Übersetzungsmodellen KoGEC: 韩国语法错误校正,采用训练有素的翻译模型 2506.11432v1
  • 641 06-13 MAGPIE: Multi-Task Media-Bias Analysis Generalization for Pre-Trained Identification of Expressions MAPIE: Multi-Task Media-Bias Analyse Generalisierung zur vortrainierten Identifizierung von Ausdrücken MAGPIE: 多任务媒体-Bias分析 2403.07910v3
  • 642 06-13 Deep Sparse Latent Feature Models for Knowledge Graph Completion Deep Sparse Latent Feature Modelle für die Wissensgraphenvervollständigung 知识图补全深度粗略的内端特性模型 2411.15694v2
  • 643 06-13 Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards Agent-RLVR: Training Software Engineering Agents über Beratung und Umwelt Belohnungen RLVR: 通过指导和环境奖励培训软件工程代理 2506.11425v1
  • 644 06-13 Efficient Long-Context LLM Inference via KV Cache Clustering Effiziente Long-Context-LLM-Inferenz über KV Cache-Clustering 通过 KV 缓存群集推断 2506.11418v1
  • 645 06-13 RSCF: Relation-Semantics Consistent Filter for Entity Embedding of Knowledge Graph RSCF: Relation-Semantik Konsequenter Filter für Entity-Einbettung von Wissensgrafik RSCF: 用于实体嵌入知识图的 关系-语义一致性过滤器 2505.20813v3
  • 646 06-13 Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning Erste Prüfung von Wissenschaftlern: Kognitive Fähigkeiten von MLLM durch Wahrnehmung, Verständnis und Vernunft unter Beweis stellen 科学家的第一次考试:通过感知、理解和理性,发现MLLM的认知能力 2506.10521v2
  • 647 06-13 Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles Beschleunigen von Diffusions-Großsprachenmodellen mit SlowFast Sampling: Die drei goldenen Prinzipien 加速传播具有慢速抽样的大型语言模型:三大金原则 2506.10848v2
  • 648 06-13 Bias Amplification in RAG: Poisoning Knowledge Retrieval to Steer LLMs Bias-Verstärkung in RAG: Vergiftung von Wissen an Steer LLMs RAG中的比值放大:毒性知识检索到STeer LMS 2506.11415v1
  • 649 06-13 Predicting Early-Onset Colorectal Cancer with Large Language Models Frühzeitiger Kolorektalkrebs mit großen Sprachmodellen 以大语言模型预测早期局部直肠癌 2506.11410v1
  • 650 06-13 LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model LoRA-Anwender Vorsicht: Ein paar saubere Zeichen können Ihr Feinabstimmungsmodell manipulieren LoRA 用户要小心: 几个精细的 Tokens 可以操纵您的精密模型 2506.11402v1
  • 651 06-13 Curriculum-Guided Layer Scaling for Language Model Pretraining Curriculum-geführte Ebenenskalierung für Sprachmodellvorschulungen 语言示范语言前培训课程-指导层比例表 2506.11389v1
  • 652 06-13 A Variational Approach for Mitigating Entity Bias in Relation Extraction Ein abwechslungsreicher Ansatz für die Minderung von Entity-Bias in der Beziehungsextraktion 减轻实体在关系中的偏见的变式方法 2506.11381v1
  • 653 06-13 Large Language Model-Powered Conversational Agent Delivering Problem-Solving Therapy (PST) for Family Caregivers: Enhancing Empathy and Therapeutic Alliance Using In-Context Learning Large Language Model-Powered Conversational Agent liefert Problem-Solving Therapie (PST) für Familienpfleger: Empathie und therapeutische Allianz mit Hilfe von In-Context Learning verbessern 为家庭照料者提供提供解决问题治疗的大型语言示范式对话代理方:利用知识内学习加强同情和治疗联盟 2506.11376v1
  • 654 06-13 Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables Benchmarking multimodaler LLMs zur Anerkennung und Verständigung über chemische Tabellen 关于识别和了解化学品表格的多模式贷款 2506.11375v1
  • 655 06-13 FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents FreshStack: Bau realistischer Benchmarks für die Bewertung des Retrievals auf technischen Dokumenten 新鲜工具:建立评价技术文件检索情况的现实基准 2504.13128v2
  • 656 06-12 (4) The Biased Samaritan: LLM biases in Perceived Kindness Der Biased Samaritan: LLM-Voreingenommenheiten in wahrnehmbarer Güte 偏见的撒玛利亚人:见识的品种中的LLM偏见 2506.11361v1
  • 657 06-12 D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model D-GEN: Automatische Distraktorgenerierung und Bewertung zur zuverlässigen Bewertung des Generativen Modells D-GEN:为可靠评估生成模型的可靠评估而自动生成和评估 2504.13439v2
  • 658 06-12 GLAP: General contrastive audio-text pretraining across domains and languages GLAP: Allgemeines kontrastreiches Audio-Text-Vortraining über Domains und Sprachen hinweg GLAP: 跨领域和不同语言的一般有对比性音频-文字预培训 2506.11350v1
  • 659 06-12 Do We Still Need Audio? Rethinking Speaker Diarization with a Text-Based Approach Using Multiple Prediction Models Brauchen wir noch Audio? Überdenken der Lautsprecher-Diarisierung mit einem textbasierten Ansatz mit mehreren Vorhersagemodellen 我们还需要音频吗?用使用多种预测模型的基于文本的方法重新思考议长的对分法 2506.11344v1
  • 660 06-12 From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review Von der Replikation zum Redesign: Paarweise Vergleiche für LLM-basierte Peer Review 从复制到重新设计:为基于LLM的同侪审查探索对称比较 2506.11343v1
  • 661 06-12 Surprisal from Larger Transformer-based Language Models Predicts fMRI Data More Poorly Surprisal aus größeren Transformer-basierten Sprachmodellen prognostiziert fMRI-Daten schlechter 以大变压器为基础的以大变压器为基础的语言模型的超常性语言模型对FMRI数据的预测更差 2506.11338v1
  • 662 06-12 Don’t Pay Attention Achte nicht auf mich. 千万不要留意 2506.11305v1
  • 663 06-12 Deep Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions Deep Binding of Language Model Virtual Personas: eine Studie über die Annäherung der politischen Partisanen-Misswahrnehmungen 语言模拟虚拟人:关于政治党派近似误解的研究 2504.11673v2
  • 664 06-12 Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning Beyond Random Sampling: Effizientes Sprachmodell Vortraining über Curriculum Learning 超越随机抽样:通过课程学习进行高效语言模式预科培训 2506.11300v1
  • 665 06-12 Ad Auctions for LLMs via Retrieval Augmented Generation Anzeigenauktionen für LLMs via Retrieval Augmented Generation 通过回收增量一代对LLMs的拍卖 2406.09459v2
  • 666 06-12 Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer Aufmerksamkeit ruft, MLP-Erinnerungen: Entwirren von trainierbaren Komponenten im Transformer 注意检索, MLP 记忆: 变换器中拆分可训练部件 2506.01115v2
  • 667 06-12 ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness ColorBench: Können VLMs die bunte Welt sehen und verstehen? Ein umfassender Maßstab für Farbwahrnehmung, Vernunft und Robustheit 颜色贝因: VLMs 能看到和理解多色世界吗? 色彩感知、理性和强健的综合基准 2504.10514v2
  • 668 06-12 Learning a Continue-Thinking Token for Enhanced Test-Time Scaling Ein weiterdenkendes Token für verbesserte Testzeitskalierung lernen 学习 继续思考 提高测试时间缩放 2506.11274v1
  • 669 06-12 Attuned to Change: Causal Fine-Tuning under Latent-Confounded Shifts Eingestimmt auf den Wandel: Kausales Feintuning unter latent-begründeten Verschiebungen 与变化相接:在长期、有根据的变更下,因果罚款 2410.14375v2
  • 670 06-12 PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling PANDAS: Besseres Viele-Schuss-Jailbreaking durch positive Affirmation, negative Demonstration und adaptive Sampling PANDAS:通过积极肯定、负面示范和适应性抽样改进多射破牢房 2502.01925v2
  • 671 06-12 No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning Keine universelle Aufforderung: Vereinheitlichung der Vernunft durch adaptive Aufforderung für zeitliche Tabellenveranlagung 无通用即时:通过调适性提示来统一时间表合理性的理由 2506.11246v1
  • 672 06-12 Iterative Multilingual Spectral Attribute Erasure Iteratives Mehrsprachiges Spektralattribut Löschen 多语种多语种光谱属性错乱 2506.11244v1
  • 673 06-12 RETUYT-INCO at BEA 2025 Shared Task: How Far Can Lightweight Models Go in AI-powered Tutor Evaluation? RETUYT-INCO bei BEA 2025 Shared Task: Wie weit können Leichtbaumodelle in der KI-powered Tutor Evaluation gehen? BEA 2025共同任务:轻量级模型在AI驱动导师评价中能走多远? 2506.11243v1
  • 674 06-12 LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation LLM-as-a-Richter für die referenzlose automatische Codevalidierung und -Verfeinerung für natürliche Sprache in der IT-Automatisierung zu Bash LLM-as-a-Judg 信息技术自动化中自然语言的无参考自动代码校验和精炼至巴什语的无参考自动码校验和精炼LLM-as-a-Judg 2506.11237v1
  • 675 06-12 LLM-as-a-Fuzzy-Judge: Fine-Tuning Large Language Models as a Clinical Evaluation Judge with Fuzzy Logic LLM-as-a-Fuzzy-Judge: Feintuning große Sprachmodelle als klinischer Bewertungsrichter mit Fuzzy Logic LLM-as-a-Fuzzy-Judge:作为Fuzzy逻辑临床评估法官的精准大语言模型 2506.11221v1
  • 676 06-12 How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts? Wie gut können vernünftigen Modelle erkennen und sich von unhilflichen Gedanken erholen? 理性模型如何能从无益的想法中查明和复苏? 2506.10979v1
  • 677 06-12 AutoMind: Adaptive Knowledgeable Agent for Automated Data Science AutoMind: Adaptives Knowledgeable Agent für automatisierte Datenwissenschaft 自动Mind:自动数据科学适应性知识代理 2506.10974v1
  • 678 06-12 ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark ChinesischHarm-Bench: Ein chinesischer schädlicher Content Detection Benchmark 中中汉禁区:中国有害内容检测基准 2506.10960v1
  • 679 06-12 Build the web for agents, not agents for the web Erstellen Sie das Web für Agenten, nicht Agenten für das Web 为代理者而不是网络代理者建立网络 2506.10953v1
  • 680 06-12 Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training Domain2Vec: Vectorizing Datasets, um die optimale Datenmischung ohne Training zu finden 域2Vec: 将数据集矢量化,以查找未经过培训的最佳数据混合体 2506.10952v1
  • 681 06-12 GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models GUARD: Geführtes Lernen und Zurückhalten über Datenzuweisung für große Sprachmodelle GUARD:通过大语言模式数据归称制,指导学习和保留 2506.10946v1
  • 682 06-12 VINCIE: Unlocking In-context Image Editing from Video VINCIE: Im Kontext Bildbearbeitung von Video entsperren VINCIE: 从视频中解锁 Incontext 图像编辑 2506.10941v1
  • 683 06-12 Visually Descriptive Language Model for Vector Graphics Reasoning Visuell Deskriptives Sprachmodell für Vektorgrafiken 矢量图形推理视觉描述语言模型 2404.06479v5
  • 684 06-12 Dynamic Epistemic Friction in Dialogue Dynamische epistemische Reibung im Dialog 对话框中的动态瞬间摩擦 2506.10934v1
  • 685 06-12 Improving LLM Safety Alignment with Dual-Objective Optimization Verbesserung der LLM-Sicherheitsausrichtung mit Dual-Ziel-Optimierung 提高LLM安全一致性,实现双目标优化 2503.03710v2
  • 686 06-12 Robustly Improving LLM Fairness in Realistic Settings via Interpretability Robuste Verbesserung der LLM Fairness in realistischen Einstellungen durch Dolmetschbarkeit 通过可解释性在现实环境中强有力地提高LLM公平性 2506.10922v1
  • 687 06-12 Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization Dekomponieren von MLP-Aktivierungen in Interpretierbare Funktionen über semi-Nonnegative Matrix-Fabrikisierung 通过半氮基矩阵化系数化,将劳动和生产部的分解活动转化为可解释性特征 2506.10920v1
  • 688 06-12 Weak-to-Strong Jailbreaking on Large Language Models Schwach-zu-starkes Gefängnis mit großen Sprachmodellen 关于大语言模型的弱至强强监狱破解 2401.17256v3
  • 689 06-12 Efficiently Identifying Watermarked Segments in Mixed-Source Texts Effiziente Identifikation von wassermarkierten Segmenten in Mixed-Source-Texten 有效识别混合来源文本中划划水段 2410.03600v2
  • 690 06-12 Magistral Magistral 司 司 司 司 司 司 司 司 司 2506.10910v1
  • 691 06-12 Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning Jenseits von Goldstandards: Epistemisches Ensemble von LLM-Richtern für formale mathematische Vernunft 超越金金标准:法学硕士正式数学理由法官集会 2506.10903v1
  • 692 06-12 BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP BioClinical ModernBERT: Ein hochmoderner Long-Context-Encoder für biomedizinische und klinische NLP 生物医学和临床国家实验室方案最新生物医学和临床现代生物临床现代BERT:最先进的生物医学和临床临床长期编码器 2506.10896v1
  • 693 06-12 The Diffusion Duality Die Diffusionsdualität 传播质量 2506.10892v1
  • 694 06-12 PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play PLAY2PROMPT: Zero-shot Tool Instruction Optimierung für LLM Agenten über Tool Play PLAY2PROMOPT: 通过工具游戏优化LLM代理器的零射工具指令 2503.14432v2
  • 695 06-12 Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers Verallgemeinerung oder Halluzination? Verstehen von Out-of-Context-Reasoning in Transformers 通化还是幻觉? 理解变异器的逻辑外原因 2506.10887v1
  • 696 06-12 Slimming Down LLMs Without Losing Their Minds LLMs abschwächen, ohne ihre Gedanken zu verlieren 在不失去理智的情况下将LLMs 压倒在地 2506.10885v1
  • 697 06-12 Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment Verbesserung des medizinischen Dialogs durch Wissensverfeinerung und dynamische Anpassung 通过知识完善和动态快速调整加强医疗对话 2506.10877v1
  • 698 06-12 Large Language Models for Multilingual Previously Fact-Checked Claim Detection Große Sprachmodelle für die multilinguale bisher Fact-Checked Claim Detection 多语种以前实况调查索赔调查大语言模型 2503.02737v2
  • 699 06-12 Sailing by the Stars: A Survey on Reward Models and Learning Strategies for Learning from Rewards Segeln mit den Sternen: Eine Umfrage über Prämienmodelle und Lernstrategien zum Lernen aus Belohnungen 星舰航行:奖励模型调查以及从奖励中学习的学习战略 2505.02686v2
  • 700 06-12 Multi-group Uncertainty Quantification for Long-form Text Generation Multi-Gruppen-Unsicherheits-Quantifizierung für langformige Textgenerierung 长式文本生成的不确定性量化 2407.21057v2
  • 701 06-12 Debiasing Watermarks for Large Language Models via Maximal Coupling Debiasing Wasserzeichen für große Sprachmodelle über Maximal Coupling 通过Maximal Coupling为大语言模型减少对水标记的偏差 2411.11203v2
  • 702 06-12 Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models Analyse der Beziehungen zwischen vorschulischer Sprache, phonetischer, klanglicher und sprachlicher Information in selbstüberwachten Sprachmodellen 以自我监督的演讲模式分析培训前语言、音、音、音、音和演讲者信息之间的关系 2506.10855v1
  • 703 06-12 CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training CIIR@LiveRAG 2025: Optimierung der Multi-Agent Retrieval Augmented Generation durch Selbsttraining CIIR@LiveRAG 2025:通过自我培训优化多要求回生增生一代 2506.10844v1
  • 704 06-12 UCD: Unlearning in LLMs via Contrastive Decoding UCD: Lernen in LLMs durch Kontrastive Dekodierung UCD:通过互换代号在LLMs中重新学习 2506.12097v1
  • 705 06-12 ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization ReCUT: Ausbalancierende Grundlänge und Genauigkeit in LLMs über Schrittweise Trails und Preference Optimization RECUT:通过分步跟踪和优化优化平衡长长和准确性 2506.10822v1
  • 706 06-12 Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints Negative Interferenzen in der Mehrsprachigkeit sequenzieller Wissensbearbeitung durch Null-Raum-Beschränkungen abmildern 减少多语种序列知识编辑的负面干扰 2506.10800v1
  • 707 06-12 The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages Der Esethu-Rahmen: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages Esethu框架:重新想象可持续数据集治理和低碳语言的理论 2502.15916v2
  • 708 06-12 FASCIST-O-METER: Classifier for Neo-fascist Discourse Online FASCIST-O-METER: Klassifikator für neofaschistischen Diskurs Online FASCIST-O-METER:新法西斯人士在线论文分类 2506.10789v1
  • 709 06-12 Improving Named Entity Transcription with Contextual LLM-based Revision Verbesserung der Transkription der benannten Entität mit kontextueller LLM-basierter Revision 改进以背景LLM为基础订正的命名实体跟踪 2506.10779v1
  • 710 06-12 Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs Unterschiedliche Fragen, unterschiedliche Modelle: Feinkörnige Bewertung von Unsicherheit und Kalibrierung in klinischen QA mit LLMs 不同问题、不同模式:对临床质量评估中不确定性和校准的精细评估 2506.10769v1
  • 711 06-12 Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation Chain-of-Code Collapse: Gründe für Fehler in LLMs über Adversarial Prompting in der Code-Generierung 崩溃链:通过代码生成中的反向提示造成LLMs中失败的原因 2506.06971v2
  • 712 06-12 One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers Ein Tokenizer, um sie alle zu beherrschen: Emergente Sprachplastizität über Mehrsprachige Tokenizer 万能统治者:通过多语种教育者实现新兴语言的可塑性 2506.10766v1
  • 713 06-12 Aspect-Based Opinion Summarization with Argumentation Schemes Aspektbasierte Zusammenfassung der Meinungen mit Argumentierungsschemata 与参数说明方案对照审计意见的概述 2506.09917v2
  • 714 06-12 Great Models Think Alike and this Undermines AI Oversight Große Modelle denken ähnlich und dies unterminiert AI Oversight 伟大的模特儿们想着类似的想法 和这枚地下地雷 AI监督 2502.04313v2
  • 715 06-12 Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering Neural bei ArchEHR-QA 2025: Agentische Prompt-Optimierung für evidenzgerundete klinische Fragen ArchEHR-QA 2025:证据四舍五入临床问题解答的代理快速优化 2506.10751v1
  • 716 06-12 TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora TaxoAdapt: LLM-basierte multidimensionale Taxonomie-Konstruktion an die sich entwickelnde Forschungskorporation ausrichten 将基于LLM的多层面分类学建设与不断发展的研究公司相协调 2506.10737v1
  • 717 06-12 Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims 超越真或假:收回增加的无损失索赔的等级结构分析 2506.10728v1
  • 718 06-12 PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models PREMISE: Skalierbare und strategische Prompt-Optimierung für effiziente mathematische Reasoning in großen Modellen PREMISE:大规模模型中高效数学理由的可扩展和战略快速优化 2506.10716v1
  • 719 06-12 Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet Adjektive Hypernyms mit Sprachmodellen ableiten, um die Konnektivität von Open English Wordnet zu erhöhen 推导语言模型的形容词超音音音,以提高开放英文Wordnet的连通性 2506.10715v1
  • 720 06-12 PRSA: Prompt Stealing Attacks against Real-World Prompt Services PRSA: Sofortige Diebstahlangriffe gegen Real-World Prompt Services PRSA: 迅速窃盗对现实世界迅速服务公司的袭击 2402.19200v3
  • 721 06-12 FedRAG: A Framework for Fine-Tuning Retrieval-Augmented Generation Systems FedRAG: Ein Rahmen für Systeme der Feinsteuerung von Retrieval-Augmented Generation FFRAG: 微调取回系统框架 2506.09200v2
  • 722 06-12 SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models SelectLLM: Query-Aware Effiziente Auswahl Algorithmen für große Sprachmodelle 选择LLM: 用于大语言模型的查询- 软件高效选择算法 2408.08545v4
  • 723 06-12 Large Language Models for Detection of Life-Threatening Texts Große Sprachmodelle zur Erkennung lebensbedrohlicher Texte 探测生命威胁文字的长语言大语言模型 2506.10687v1
  • 724 06-12 Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models Habe ich treu gesagt, was ich dachte? Die Kluft zwischen neuraler Aktivität und Selbsterklärungen in großen Sprachmodellen überbrücken 缩小大语言模式中神经活动与自我开发之间的差距 2506.09277v2
  • 725 06-12 TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving TeleMath: Ein Benchmark für große Sprachmodelle in der Telecom Mathematischen Problemlösung TeleMatth:电信数学问题解决大语言模型基准 2506.10674v1
  • 726 06-12 CoRT: Code-integrated Reasoning within Thinking CoRT: Code-integrierte Vernunft im Denken CORT: 思考中守则综合理由 2506.09820v2
  • 727 06-12 Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes Robuste, unüberwachte Anpassung eines Spracherkennungsgeräts mit Entropie-Minimierungs- und Lautsprechercodes 使用磁最小化和演讲人守则的演讲者演讲者 2506.10653v1
  • 728 06-12 Identifying Reliable Evaluation Metrics for Scientific Text Revision Identifizieren von verlässlichen Bewertungsmetrics für wissenschaftliche Textrevision 科学文本订正的可靠评价计量指标 2506.04772v3
  • 729 06-12 Spelling-out is not Straightforward: LLMs’ Capability of Tokenization from Token to Characters Rechtschreibung ist nicht geradeaus: LLMs Fähigkeit der Tokenisierung von Token zu Charakteren 拼写出不是直向前进的: LLMs 的调制能力从调制字符到字符 2506.10641v1
  • 730 06-12 Conversational Search: From Fundamentals to Frontiers in the LLM Era Conversational Search: Von Grundlagen zu Grenzen in der LLM-Ära 对话搜索:从基本原理到LLM时代的前沿 2506.10635v1
  • 731 06-12 NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors NeuralNexus bei BEA 2025 Shared Task: Retrieval-Augmented Prompting für Fehlererkennung in KI-Tutoren BEA 2025年BEA 的神经外观 共同任务:在 AI 导师中检索错误识别提示 2506.10627v1
  • 732 06-12 SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis SDialog: Ein Python-Toolkit für die Synthetische Dialog-Generierung und -Analyse Sidialog:合成对话生成和分析的Python工具包 2506.10622v1
  • 733 06-12 Deep Learning-Based Digitization of Overlapping ECG Images with Open-Source Python Code Deep Learning-based Digitalisierung von überlappenden EKG-Bildern mit Open-Source-Python-Code 使用开放源码的 ECG 重叠图像的深学习数字化 2506.10617v1
  • 734 06-12 Unsupervised Protoform Reconstruction through Parsimonious Rule-guided Heuristics and Evolutionary Search Unüberwachte protoforme Rekonstruktion durch Parsimonious Regel-geführte Heuristiken und evolutionäre Suche 通过法理学、法理学、受规制的哲理学和进化搜索进行不受监督的原形重建 2506.10614v1
  • 735 06-12 ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization ConfPO: Ausnutzen des politischen Modells Vertrauen für kritische Token-Auswahl in Präferenz-Optimierung 召集:利用政策模范信心在优先最佳化中选择关键物优选标准 2506.08712v2
  • 736 06-12 IPA-CHILDES & G2P+: Feature-Rich Resources for Cross-Lingual Phonology and Phonemic Language Modeling IPA-CHILDES & G2P+: Feature-Rich-Ressourcen für Cross-Lingual Phonologie und Phonemic Language Modeling IPA-CHILDES & G2P+:跨语言歌曲和语音语言建模的地貌资源 2504.03036v3
  • 737 06-12 Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges Pragmatics in the Era of Large Language Models: Eine Umfrage zu Datensätzen, Evaluation, Chancen und Herausforderungen 《大语言模式时代中的实用模型:关于数据集、评价、机遇和挑战的调查》 2502.12378v3
  • 738 06-12 Encoding call-by-push-value in the pi-calculus Kodierung des Call-by-Push-Wertes im Pi-Calculus Pi-calcululus 中的编码调用 by-push- 值 2506.10584v1
  • 739 06-12 BabyLM’s First Words: Word Segmentation as a Phonological Probing Task BabyLMs erste Worte: Wortsegmentierung als phonologische Probing-Aufgabe BabyLM 的第一单词: 单词分割作为声学检测任务 2504.03338v3
  • 740 06-12 Human and LLM Biases in Hate Speech Annotations: A Socio-Demographic Analysis of Annotators and Targets Menschliche und LLM-Biasen in Hate Speech Annotationen: Eine sozio-demographische Analyse von Annotatoren und Zielen 仇恨言论说明中的人类和LLM比阿斯语:对说明者和目标的社会-人口分析 2410.07991v6
  • 741 06-12 Reinforcing Multimodal Understanding and Generation with Dual Self-rewards Verstärkung des multimodalen Verständnisses und der Erzeugung mit Dual Self-Rewards 加强多模式理解和多模式代代与双重奖赏 2506.07963v2
  • 742 06-12 Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models Obliviate: Effiziente Unvergesslichkeit für den Schutz geistigen Eigentums in großen Sprachmodellen 默认:在大语言模式中有效统一保护知识产权 2502.15010v2
  • 743 06-12 Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs Zuverlässiger Weg zur Vernunft: Destillieren effektiver Leitlinien für LLM-Reasoning mit Wissensgraphen 可靠理由说明:为学习图解的LLM 理由说明保留有效指导 2506.10508v1
  • 744 06-12 Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps Messung der Kette der Gedankentreue durch unlernende Vernunftschritte 通过 “ 不学习理性步骤 “ 衡量思考链的信念 2502.14829v2
  • 745 06-12 Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models Beyond Single-User Dialogue: Bewertung des Multi-User Dialogue State Tracking Fähigkeiten großer Sprachmodelle 超越单一用户对话:评估多用户对话国家跟踪大语言模式的能力 2506.10504v1
  • 746 06-12 Mind the Style Gap: Meta-Evaluation of Style and Attribute Transfer Metrics Achten Sie auf den Style Gap: Meta-Evaluation von Style und Attribut-Transfer-Metriken 思维风格差距:对样式和属性转移的元评价 2502.15022v3
  • 747 06-12 Towards Large Language Models with Self-Consistent Natural Language Explanations Auf dem Weg zu großen Sprachmodellen mit selbstkonsistenten natürlichen Spracherklärungen 努力建立具有自我联系自然语言解释的大型语言模式 2506.07523v2
  • 748 06-12 Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models Surface Fairness, Deep Bias: Eine vergleichende Studie über Bias in Sprachmodellen 地表公平、深比亚:语言模型比亚比较研究 2506.10491v1
  • 749 06-12 ClimateChat: Designing Data and Methods for Instruction Tuning LLMs to Answer Climate Change Queries ClimateChat: Daten und Methoden für die Anleitung zur Anpassung von LLMs an Klimawandelfragen entwerfen ClimateChat:设计用于教学的数据和方法,用于指导如何引导LMLM 以应对气候变化询问 2506.13796v1
  • 750 06-12 Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers Tabelle-Text Alignment: Erklärung der Antragsprüfung gegen Tabellen in wissenschaftlichen Arbeiten 表-文字对齐:对照科学文件表格解释索赔核实 2506.10486v1
  • 751 06-12 IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language IndoToxic2024: Ein demographischer Datensatz von Hass-Sprach- und Toxizitätstypen für indonesische Sprache Indotoxic2024:印度尼西亚语仇恨言论和毒性类型人口资料集 2406.19349v2
  • 752 06-12 VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models VScan: Rethinking Visual Token Reduction für effiziente große Vision-Sprache Modelle Vscan:重新思考如何降低视力,以建立高效的大型视觉语言模型 2505.22654v2
  • 753 06-12 Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts Auf dem Weg zur robusten multimodalen Emotionserkennung unter fehlenden Modalitäten und Verteilungsverschiebungen 争取在缺失模式和分销转移模式下强有力地承认多模式情感 2506.10452v1
  • 754 06-12 Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations Social Bias Benchmark for Generation: Ein Vergleich von Generation und QA-basierten Bewertungen 社会比重基准: 社会比重基准: 社会比比: 社会比比: 社会比比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比 2503.06987v2
  • 755 06-12 Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty Schnell auf dem Einfachen, Tief auf dem Harten: Effiziente Vernunft über Powered Length Penalty 快速快速执行 “ 容易 “ 、 “ 深沉:通过死刑有效解释理由 “ 2506.10446v1
  • 756 06-12 CheMatAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning CheMatAgent: Verbesserung von LLMs für Chemie und Materialwissenschaft durch baumsuchebasiertes Tool Learning CheMatAgent:通过植树搜索工具学习加强化学和材料科学LLMs 2506.07551v2
  • 757 06-12 ConvD: Attention Enhanced Dynamic Convolutional Embeddings for Knowledge Graph Completion ConvD: Aufmerksamkeitsverstärkte dynamische konvolutionäre Einbettungen für die Wissensgraphenvervollständigung ConvD: 关注增强动态动态嵌入,以完成知识图的完成 2312.07589v2
  • 758 06-12 PAL: Probing Audio Encoders via LLMs – A Study of Information Transfer from Audio Encoders to LLMs PAL: Probing Audio-Encoder über LLMs – Eine Studie über den Informationstransfer von Audio-Encodern zu LLMs PAL:通过LLMs探查音频成象器 – – 研究从音频成象器向LLMs传送信息 2506.10423v1
  • 759 06-12 Beyond the Battlefield: Framing Analysis of Media Coverage in Conflict Reporting Jenseits des Schlachtfeldes: Framing Analyse der Medienabdeckung in der Konfliktberichterstattung 战场以外的战场:冲突报道中媒体报道的系统化分析 2506.10421v1
  • 760 06-12 Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences? Brennen Sie nach dem Lesen: Erfassen multimodale große Sprachmodelle wirklich die Reihenfolge der Ereignisse in Bildsequenzen? Burn after read: 多式大语言模型在图像序列中是否真的能捕捉事件秩序? 2506.10415v1
  • 761 06-12 Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series Zeit-IMM: Ein Datensatz und Benchmark für irreguläre multimodale Multivariate Zeitreihen 时间-IMM:非正常多式联运多变时间序列的数据集和基准 2506.10412v1
  • 762 06-12 PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier PAG: Multi-Turn verstärkt LLM Selbstkorrektion mit Politik als Generativer Prüfer PAG: 多发强化LLM自我校正,政策作为产生验证 2506.10406v1
  • 763 06-12 iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering iQUEST: Ein iteratives Frage-Framework für die Beantwortung von Fragen in der Wissensdatenbank i. 知识基础问题解答的动态问题指导框架 2506.01784v2
  • 764 06-12 AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving AgentThink: Ein einheitliches Framework für Tool-Augmented Chain-of-Thought Reasoning in Vision-Language-Modellen für autonomes Fahren Agent Think: 自主驾驶愿景-语言模型中工具推荐研究链理由统一框架 2505.15298v3
  • 765 06-12 On Many-Shot In-Context Learning for Long-Context Evaluation Auf viel-heißes In-Context-Lernen für die Lang-Kontext-Evaluierung 为长期内容评价进行许多热的内文学习 2411.07130v3
  • 766 06-12 TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning TableRAG: Ein Retrieval Augmented Generation Framework für heterogene Dokument-Reasoning 表RAG:异源文件说明理由的回收增加代际生成框架 2506.10380v1
  • 767 06-12 Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning Hierarchische Latentenfähigkeiten von Sprachmodellen über das kausale Repräsentationslernen entdecken 通过因果代表制学习发现语言模式的分级本端能力 2506.10378v1
  • 768 06-12 A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce Ein minimalistischer Ansatz zur LLM-Vernunft: von der Abstoßung zur Verstärkung 从拒绝抽样到强化 2504.11343v2
  • 769 06-12 CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models CAF-I: Ein kollaboratives Multi-Agent-Framework für eine verbesserte Ironieerkennung mit großen Sprachmodellen CAF-I:采用大语言模式加强铁铁探测多机构合作多方协作框架 2506.08430v2
  • 770 06-12 Improving Fairness of Large Language Models in Multi-document Summarization Verbesserung der Fairness von großen Sprachmodellen in Multi-Dokument-Zusammenfassung 提高多文件总结中大语言模式的公平性 2506.07479v2
  • 771 06-12 SCORE: Story Coherence and Retrieval Enhancement for AI Narratives SCORE: Story-Kohärenz und Retrieval-Verbesserung für KI-Erzählungen SCORE: “ 独立叙述 “ 的 “ 一致性 “ 和 “ 检索 “ 增强 “ 增强 “ 统一 “ 和 “ 检索 “ 增强 “ 增强 “ 独立叙述 “ 2503.23512v4
  • 772 06-12 An Analysis of Datasets, Metrics and Models in Keyphrase Generation Eine Analyse von Datensätzen, Metrics und Modellen in der Keyphrase-Generierung 对关键词生成中的数据集、计量和模型的分析 2506.10346v1
  • 773 06-12 Code Execution as Grounded Supervision for LLM Reasoning Code-Execution als geerdete Überwachung für LLM-Reasoning 法规执行作为LLM理由的有限制的监督 2506.10343v1
  • 774 06-12 Provably Learning from Language Feedback Wahrscheinlich von Sprachfeedback lernen 从语言反馈中学习 2506.10341v1
  • 775 06-12 Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs Amulett: Neuausrichtung während der Testzeit für Personalisierte Präferenzanpassung von LLMs 缩略图:在试验期间重新对准,以适应LLMM的个性化偏好 2502.19148v3
  • 776 06-12 Benchmarking LLMs for Environmental Review and Permitting Benchmarking LLMs für Umweltprüfung und Genehmigung 环境审查和许可基准确定LLMs 2407.07321v3
  • 777 06-12 CHANCERY: Evaluating Corporate Governance Reasoning Capabilities in Language Models CHANCERY: Bewertung der Corporate Governance-Reasoning-Fähigkeiten in Sprachmodellen C. 机会:评价语言模式中的公司治理能力 2506.04636v2
  • 778 06-12 Paired Completion: Flexible Quantification of Issue-framing at Scale with LLMs Gepaarte Fertigstellung: Flexible Quantifizierung von Emissions-Framing auf Scale mit LLMs 提前完成:与LLMs一道灵活量化规模问题配置 2408.09742v2
  • 779 06-12 Detecting Sockpuppetry on Wikipedia Using Meta-Learning Sockepuppetry auf Wikipedia erkennen Mit Meta-Learning 在维基百科上用元学习探测袜子布料 2506.10314v1
  • 780 06-12 Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling Effiziente Längenverallgemeinerbare Aufmerksamkeit über Causal Retrieval für die Lang-Kontext-Sprachenmodellierung 长文本语言建模通过 “ 目的检索 “ 吸引长文本语言建模 2410.01651v4
  • 781 06-12 AC/DC: LLM-based Audio Comprehension via Dialogue Continuation AC/DC: LLM-basiertes Audio-Verständnis über Dialog-Fortsetzung AC/DC:基于LLM的通过对话继续了解音频 2506.10312v1
  • 782 06-12 BeamLoRA: Beam-Constraint Low-Rank Adaptation BeamLoRA: Beam-Constraint Low-Rank Anpassung BeamLORA: 束-节制低射线适应 2502.13604v2
  • 783 06-12 Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs Geplante interleaved Speech-Text-Schulung für Sprach-zu-Sprach-Übersetzung mit LLMs 配有LLMM的语音对语音翻译教学 2506.10299v1
  • 784 06-12 “Check My Work?”: Measuring Sycophancy in a Simulated Educational Context “Check My Work?”: Sykopanzmessung in einem simulierten Bildungskontext “检查我的工作?” “测量模拟教育环境中的相对性” 2506.10297v1
  • 785 06-12 Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages Flick: Wenige Labels Textklassifizierung mit K-Aware Intermediate Learning in Multi-Task Low-Resource Sprachen Flick:使用K-Aware中级学习多种低资源语言的多种语言的标签文字分类, 2506.10292v1
  • 786 06-12 Context Is Not Comprehension Kontext ist nicht verständlich 背景不令人理解 2506.04907v4
  • 787 06-12 Prompt-based Depth Pruning of Large Language Models Prompt-basierte Tiefenkorrektur von großen Sprachmodellen 大语言模式的即时深度定位 2502.04348v3
  • 788 06-12 ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs ClusterUCB: Effiziente Gradient-basierte Datenauswahl für gezielte Feinsteuerung von LLMs COCUCB: 高效率的逐步数据选择,以便有针对性地微调LLMM 2506.10288v1
  • 789 06-12 Play to Generalize: Learning to Reason Through Game Play Spielen Sie Generalize: Lernen, Vernunft durch Spiel zu lernen 玩一般游戏: 通过玩游戏学习理性 2506.08011v2
  • 790 06-12 Do Language Models Have Bayesian Brains? Distinguishing Stochastic and Deterministic Decision Patterns within Large Language Models Haben Sprachmodelle Bayesische Gehirne? Beeindruckende stochastische und deterministische Entscheidungsmuster innerhalb großer Sprachmodelle 语言模式是否具有贝耶斯人脑? 区分大语言模式中的斯托卡和决定性决定模式 2506.10268v1
  • 791 06-12 Research Borderlands: Analysing Writing Across Research Cultures Forschungsgrenzen: Analysieren des Schreibens über Forschungskulturen hinweg 研究边界地区:分析跨研究文化的写作 2506.00784v2
  • 792 06-12 M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction M-MRE: Ausdehnung des Effekts der gegenseitigen Verstärkung auf multimodale Informationsextraktion M-MRE:将相互强化效应扩大到多式联运信息提取 2504.17353v2

Article 0

Title@2025-06-18 (3): PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning

Title: PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning PhantomHunter: Unsichtbarer, privat gestalteter LLM-generierter Text durch familienbewusstes Lernen erkennen PhhantomHunter: 通过家庭知识学习检测隐隐隐私人引导的LLM-发光文本 2506.15683v1

Authors (7): Yuhui Shi, Yehan Yang, Qiang Sheng, Hao Mi, Beizhe Hu, Chaoxi Xu, Juan Cao

With the popularity of large language models (LLMs), undesirable societal problems like misinformation production and academic misconduct have been more severe, making LLM-generated text detection now of unprecedented importance. Although existing methods have made remarkable progress, a new challenge posed by text from privately tuned LLMs remains underexplored. Users could easily possess private LLMs by fine-tuning an open-source one with private corpora, resulting in a significant performance drop of existing detectors in practice. To address this issue, we propose PhantomHunter, an LLM-generated text detector specialized for detecting text from unseen, privately-tuned LLMs. Its family-aware learning framework captures family-level traits shared across the base models and their derivatives, instead of memorizing individual characteristics. Experiments on data from LLaMA, Gemma, and Mistral families show its superiority over 7 baselines and 3 industrial services, with F1 scores of over 96%.

随着大型语言模式(LLMs)的普及,错误信息制作和学术行为失检等不良社会问题更加严重,使LLM产生的文本探测现在具有前所未有的重要性。虽然现有方法取得了显著进展,但私人调频LMs的文本带来的新挑战仍未得到充分探讨。用户可以通过微调开放源码与私人公司进行微调,很容易拥有私人的LMS,从而导致现有探测器在实践中的显著性能下降。为了解决这个问题,我们提议PhantomHunter,由LM公司产生的文本探测器专门用来探测来自秘密、私人调频LMs的文本。其家庭认知学习框架捕捉了基础模型及其衍生物之间共享的家庭层面特征,而不是将个人特征混为一谈。在LLMMA、Gemma和Mistral家庭的数据实验显示其优势超过7个基线和3个工业服务,F1分数超过96%。


Article 1

Title@2025-06-18 (3): GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

Title: GenRecal: Generation after Recalibration from Large to Small Vision-Language Models GenRecal: Generation nach Rekalibrierung von großen bis kleinen Vision-Sprachenmodellen GenRecal: 在从大到小的视觉语言模型重新校准后生成的模型 2506.15681v1

Authors (5): Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu

Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.

视觉语言模型(VLMS)最近的进展利用了大型语言模型(LLMS),以实现与GPT-4V等封闭源码系统同等的绩效。然而,在现实世界情景中部署这些模型,特别是在资源受限制的装置上,由于其巨大的计算需求,仍然具有挑战性。这激发了将大型VLMs的知识提炼成更小、更有效率的对等单位的兴趣。这里的一个关键挑战来自VLM结构的多样性,这些结构建在不同的LMS上,在词汇大小、符号分割和符号索引顺序上采用不同的象征性类型差异。为了应对对特定VLM型号的限制挑战,我们在重新计算(GenRecal)后提出了新一代(GenRecal),这是VLMs的新颖、通用的蒸馏框架。GenRecal包含一个校准器,能够调整和调整不同类型VLMs之间的特征代表,从而能够在不同类型VLMs之间进行有效的知识转让。通过对多重挑战性基准的广泛实验,我们证明GenRecal显著改进了基线性性性性性性性,最终优于大型开放和封闭式的开放和封闭式VLMSMS。


Article 2

Title@2025-06-18 (3): Dense SAE Latents Are Features, Not Bugs

Title: Dense SAE Latents Are Features, Not Bugs Dense SAE Latenten sind Features, keine Bugs Hense SAE 终端是特征,不是虫虫 2506.15679v1

Authors (7): Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, Max Tegmark

Sparse autoencoders (SAEs) are designed to extract interpretable features from language models by enforcing a sparsity constraint. Ideally, training an SAE would yield latents that are both sparse and semantically meaningful. However, many SAE latents activate frequently (i.e., are \emph{dense}), raising concerns that they may be undesirable artifacts of the training procedure. In this work, we systematically investigate the geometry, function, and origin of dense latents and show that they are not only persistent but often reflect meaningful model representations. We first demonstrate that dense latents tend to form antipodal pairs that reconstruct specific directions in the residual stream, and that ablating their subspace suppresses the emergence of new dense features in retrained SAEs – suggesting that high density features are an intrinsic property of the residual space. We then introduce a taxonomy of dense latents, identifying classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction. Finally, we analyze how these features evolve across layers, revealing a shift from structural features in early layers, to semantic features in mid layers, and finally to output-oriented signals in the last layers of the model. Our findings indicate that dense latents serve functional roles in language model computation and should not be dismissed as training noise.

简单自动编码器( SAE) 旨在通过执行宽度限制,从语言模型中提取可解释的特征。 理想的情况是, 培训一个 SAE 最理想的是, 培训一个 SAE 将产生稀疏和隐性都有意义的潜质。 然而, 许多 SAE 潜质会经常激活( emph{ dense} ) , 让人担心它们可能是培训程序的不受欢迎的文物。 在这项工作中, 我们系统地调查密集潜质的几何、 功能和来源, 并显示它们不仅持久, 而且经常反映有意义的模型表现。 我们首先显示, 稠密潜质的潜质往往形成反波质配方, 重建剩余流中的具体方向, 以及消融其子空间会抑制再培训中新的浓度特性的出现 – 表明高密度隐性特征是剩余空间的固有属性。 然后我们开始对浓密潜质层进行分类, 确定与位置跟踪、 环境约束、 模型调节、 字母输出信号、 部分 和主要组成部分重建 。 最后, 我们分析这些特性是如何在层中, 显示这些特性的模型和 结构结构 显示, 在最后层次中, 显示, 以 方向 结构 结构 显示, 结构 结构 显示, 显示, 方向 最终 结构 结构 结构 结构 的 显示, 方向 方向 方向 的 显示, 方向 的 方向 方向 的 方向 的 的 的 的 显示 方向 结构 结构 。


Article 3

Title@2025-06-18 (3): Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence

Title: Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence Verkörperte Web-Agenten: Überbrückung physikalisch-digitaler Realms für integrierte Agent-Intelligenz 嵌入式网络代理:为综合特工情报连接物理数字王国 2506.15677v1

Authors (10): Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, Kai-Wei Chang

AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access. All datasets, codes and websites are publicly available at our project page https://embodied-web-agent.github.io/.

今天,AI代理机构大多是空置的 – – 它们要么对在网上获得的大量数字信息和知识进行检索和解释;要么通过体现的认知、规划和行动与物理世界互动,但两者都很少。这种分离限制了他们解决需要综合物理和数字情报的任务的能力,例如用在线食谱烹饪,用动态地图数据浏览,或者利用网络知识解释真实世界的里程碑。我们引入Embodied网络代理机构,这是AI代理机构的一种新颖范例,可以流传地连接成形和网络规模推理。为了落实这一概念,我们首先开发了Embudied网络代理机构的任务环境,一个将现实的3D室内和室外环境与功能的网络界面紧密结合的统一模拟平台。在这个平台上,我们建造和发布Embodied网络代理机构基准,它包含各种各样的任务,包括烹饪、导航、购物、旅游和地理定位等,所有这些任务都需要跨物理和数字领域的协调推理,以便系统地评估跨多域情报。实验结果揭示了国家-艺术AI系统和人类能力之间的重大业绩差距,一个统一的模拟平台,既能连接,又能将挑战与机会紧密地结合的网络网站/网络网站。


Article 4

Title@2025-06-18 (3): Gender-Neutral Machine Translation Strategies in Practice

Title: Gender-Neutral Machine Translation Strategies in Practice Gender-Neutrale maschinelle Übersetzungsstrategien in der Praxis 实践中的性别-新性别机器翻译战略 2506.15676v1

Authors (3): Hillary Dawkins, Isar Nejadgholi, Chi-kiu Lo

Gender-inclusive machine translation (MT) should preserve gender ambiguity in the source to avoid misgendering and representational harms. While gender ambiguity often occurs naturally in notional gender languages such as English, maintaining that gender neutrality in grammatical gender languages is a challenge. Here we assess the sensitivity of 21 MT systems to the need for gender neutrality in response to gender ambiguity in three translation directions of varying difficulty. The specific gender-neutral strategies that are observed in practice are categorized and discussed. Additionally, we examine the effect of binary gender stereotypes on the use of gender-neutral translation. In general, we report a disappointing absence of gender-neutral translations in response to gender ambiguity. However, we observe a small handful of MT systems that switch to gender neutral translation using specific strategies, depending on the target language.

性别包容的机器翻译(MT)在资料来源中应保持性别模糊性,以避免错误的性别观念和代表性的伤害。虽然性别模糊性在英语等名义上的性别语言中经常自然发生,但认为语法性别语言中的性别中立性是一项挑战。我们在这里评估21个MT系统对性别中立性的必要性的敏感性,以应对不同困难的三个翻译方向中的性别模糊性。对实践中观察到的具体性别中立性战略进行了分类和讨论。此外,我们审查了二进制性别陈规定型观念对使用性别中立翻译的影响。一般来说,我们报告说,由于性别模糊性,没有性别中立翻译的情况令人失望。然而,我们观察到少数MT系统根据目标语言使用特定战略转向性别中立翻译。


Article 5

Title@2025-06-18 (3): Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers

Title: Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers Leaky Thoughts: Große Denkmodelle sind keine privaten Denker 利基思想:大理由模型不是私人思想家 2506.15674v1

Authors (5): Tommaso Green, Martin Gubri, Haritz Puerto, Sangdoo Yun, Seong Joon Oh

We study privacy leakage in the reasoning traces of large reasoning models used as personal agents. Unlike final outputs, reasoning traces are often assumed to be internal and safe. We challenge this assumption by showing that reasoning traces frequently contain sensitive user data, which can be extracted via prompt injections or accidentally leak into outputs. Through probing and agentic evaluations, we demonstrate that test-time compute approaches, particularly increased reasoning steps, amplify such leakage. While increasing the budget of those test-time compute approaches makes models more cautious in their final answers, it also leads them to reason more verbosely and leak more in their own thinking. This reveals a core tension: reasoning improves utility but enlarges the privacy attack surface. We argue that safety efforts must extend to the model’s internal thinking, not just its outputs.

我们研究了作为个人代理人使用的大型推理模型的推理痕迹中的隐私渗漏问题。与最终产出不同,推理痕迹往往被认为是内部和安全的。我们质疑这一假设,表明推理痕迹往往包含敏感的用户数据,可以通过迅速注入或意外渗漏到产出中。我们通过检验和代理性评估,证明测试时间计算方法,特别是增加推理步骤,扩大了这种渗漏。增加测试时间计算方法的预算,使模型在最后答案中更加谨慎,但也导致这些模型在自己的思维中更加生动和泄漏。这暴露出核心紧张:推理提高了效用,扩大了隐私攻击表面。我们主张安全努力必须扩大到模型的内部思维,而不仅仅是其输出。


Article 6

Title@2025-06-18 (3): CC-LEARN: Cohort-based Consistency Learning

Title: CC-LEARN: Cohort-based Consistency Learning CC-LEARN: Kohortenbasiertes Konsistenzlernen CC-LEARN: 以联合为基地的一致学习 2506.15662v1

Authors (9): Xiao Ye, Shaswat Shrivastava, Zhaonan Li, Jacob Dineen, Shijie Lu, Avneet Ahuja, Ming Shen, Zhikun Xu, Ben Zhou

Large language models excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning. Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members. Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT baselines. These results demonstrate that cohort-level RL effectively enhances reasoning consistency in LLMs.

大型语言模式在很多任务中都非常出色,但仍然以一致、有力的推理为主。我们引入了基于Cohort的Consistance Learning(CC-Learn),这是一个强化学习框架,通过对来自共享方案抽象学的类似问题的组群进行培训,提高LLM推理的可靠性。为了加强组群的一致性,我们定义了一个综合目标,将组群的准确性、有效问题分解的检索奖金以及强化学习可直接优化的微小或无效的外观的拒绝处罚结合起来,这不同于监督的微调。优化这一奖励指导模式指导所有组群成员采用统一的推理模式。关于具有挑战性的推理基准(包括ARC-Challenge和战略QA)的实验表明,C-Learn提高了精度和推理稳定性,高于预先培训和SFT基线。这些结果表明,组级的RL有效提高了L的推理一致性。


Article 7

Title@2025-06-18 (3): AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning

Title: AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning AutoRule: Hervorhebung einer Kette von Gedanken Extrahierte regelbasierte Belohnungen Verbesserung des Preference-Lernens 自动管理:理性思维链抽取有章可循的奖励 改善优先学习 2506.15651v1

Authors (2): Tevin Wang, Chenyan Xiong

Rule-based rewards offer a promising strategy for improving reinforcement learning from human feedback (RLHF), but current approaches often rely on manual rule engineering. We present AutoRule, a fully automated method for extracting rules from preference feedback and formulating them into rule-based rewards. AutoRule extraction operates in three stages: it leverages a reasoning model to interpret user preferences, identifies candidate rules from the reasoning chain of these interpretations, and synthesizes them into a unified rule set. Leveraging the finalized rule set, we employ language-model verifiers to compute the fraction of rules satisfied by each output, using this metric as an auxiliary reward alongside the learned reward model during policy optimization. Training a Llama-3-8B model with AutoRule results in a 28.6\% relative improvement in length-controlled win rate on AlpacaEval2.0, and a 6.1\% relative gain in second-turn performance on a held-out MT-Bench subset, compared to a GRPO baseline trained with the same learned reward model but without the rule-based auxiliary reward. Our analysis confirms that the extracted rules exhibit good agreement with dataset preference. We find that AutoRule demonstrates reduced reward hacking compared to a learned reward model when run over two episodes. Finally, our case study suggests that the extracted rules capture unique qualities valued in different datasets. The extracted rules are provided in the appendix, and the code is open-sourced at https://github.com/cxcscmu/AutoRule.

以规则为基础的奖赏是改进从人类反馈(RLHF)中强化学习的有希望的战略,但目前的方法往往依靠人工规则工程。我们提出“自动规则”,这是从优惠反馈中提取规则并将其纳入基于规则的奖赏的一种完全自动化的方法。“自动规则”分三个阶段运作:它利用推理模型来解释用户的偏好,从这些解释的推理链中找出候选规则,并将这些规则综合成一套统一的规则。利用最后确定的规则集,我们使用语言模型核查员来计算每项产出所遵守的规则的一小部分,在政策优化期间,将这一指标作为学习的奖赏模式的辅助性奖赏。我们发现,“Llama-3-8B”模式与“Auto规则”相比,以28.6%的相对提高AlpacaEval2.0在长期控制的得分率方面,用“Alama-3-8B模式”以“Auto 规则”模型进行培训。“Auto Arrority”模型与“Tral-rass real coal ”模型相比,我们从“Oral rodustrational rodual rodual coal coquistration 提供了两个不同的奖赏,最后的奖赏研究显示,在获取到“Cral 。在获得了一种奖赏/ralbalbalbalbalbalbalbismalbismalmalmalmal coaltalsmmmmmmess。


Article 8

Title@2025-06-18 (3): Oldies but Goldies: The Potential of Character N-grams for Romanian Texts

Title: Oldies but Goldies: The Potential of Character N-grams for Romanian Texts Oldies but Goldies: Das Potential des Charakters N-Gramms für rumänische Texte 旧的但金的:罗马尼亚文本的字符N克潜力 2506.15650v1

Authors (2): Dana Lupsa, Sanda-Maria Avram

This study addresses the problem of authorship attribution for Romanian texts using the ROST corpus, a standard benchmark in the field. We systematically evaluate six machine learning techniques: Support Vector Machine (SVM), Logistic Regression (LR), k-Nearest Neighbors (k-NN), Decision Trees (DT), Random Forests (RF), and Artificial Neural Networks (ANN), employing character n-gram features for classification. Among these, the ANN model achieved the highest performance, including perfect classification in four out of fifteen runs when using 5-gram features. These results demonstrate that lightweight, interpretable character n-gram approaches can deliver state-of-the-art accuracy for Romanian authorship attribution, rivaling more complex methods. Our findings highlight the potential of simple stylometric features in resource, constrained or under-studied language settings.

本研究涉及罗马尼亚文文本的作者归属问题,这是该领域的标准基准之一,我们系统地评估了六种机器学习技术:支持矢量机(SVM)、物流回归(LR)、K-Nest邻居(k-NN)、决策树(DT)、随机森林(Random Forest)和人工神经网络(ANN),使用字符n-gram特性进行分类,其中ANN模式取得了最高绩效,包括15种类型中的4种在使用5克特征时的完美分类。这些结果表明,轻量、可解释字符n-gram方法能够为罗马尼亚作者归属提供最先进的准确性,与更为复杂的方法相匹配。我们的调查结果突出了资源、受限或未得到充分研究的语言环境中简单字义特征的潜力。


Article 9

Title@2025-06-18 (3): Aug2Search: Enhancing Facebook Marketplace Search with LLM-Generated Synthetic Data Augmentation

Title: Aug2Search: Enhancing Facebook Marketplace Search with LLM-Generated Synthetic Data Augmentation Aug2Search: Verbesserung der Facebook-Marktplatzsuche mit LLM-generierter Synthetischer Datenvergrößerung Oug2Search:利用LLM光化合成数据增强功能,加强Facebook市场搜索 2505.16065v2

Authors (7): Ruijie Xi, He Ba, Hao Yuan, Rishu Agrawal, Yuxin Tian, Ruoyan Long, Arul Prakash

Embedding-Based Retrieval (EBR) is an important technique in modern search engines, enabling semantic match between search queries and relevant results. However, search logging data on platforms like Facebook Marketplace lacks the diversity and details needed for effective EBR model training, limiting the models’ ability to capture nuanced search patterns. To address this challenge, we propose Aug2Search, an EBR-based framework leveraging synthetic data generated by Generative AI (GenAI) models, in a multimodal and multitask approach to optimize query-product relevance. This paper investigates the capabilities of GenAI, particularly Large Language Models (LLMs), in generating high-quality synthetic data, and analyzing its impact on enhancing EBR models. We conducted experiments using eight Llama models and 100 million data points from Facebook Marketplace logs. Our synthetic data generation follows three strategies: (1) generate queries, (2) enhance product listings, and (3) generate queries from enhanced listings. We train EBR models on three different datasets: sampled engagement data or original data ((e.g., “Click” and “Listing Interactions”)), synthetic data, and a mixture of both engagement and synthetic data to assess their performance across various training sets. Our findings underscore the robustness of Llama models in producing synthetic queries and listings with high coherence, relevance, and diversity, while maintaining low levels of hallucination. Aug2Search achieves an improvement of up to 4% in ROC_AUC with 100 million synthetic data samples, demonstrating the effectiveness of our approach. Moreover, our experiments reveal that with the same volume of training data, models trained exclusively on synthetic data often outperform those trained on original data only or a mixture of original and synthetic data.

嵌入式检索( EBR) 是现代搜索引擎的重要技术, 使得搜索查询和相关结果之间能够进行语义匹配。 然而, Facebook 市场网 等平台上搜索合成记录数据缺乏有效的 EBR 模型培训所需的多样性和细节, 限制了模型捕捉细微搜索模式的能力。 为了应对这一挑战, 我们提议 Aug2Search, 以EBR 为基础的框架, 利用General AI (GenAI) 模型生成的合成数据, 以多种模式和多任务方式优化原始的查询产品相关性。 本文调查GenAI, 特别是大语言模型( LLMS) 生成高质量合成数据的能力, 分析其对增强 EBR 模型的影响。 我们使用8个Llama 模型和来自 Facebook 市场日志日志的1亿个数据点进行了实验。 我们的合成数据生成遵循三种战略:(1) 生成查询, (2) 强化产品列表, 以及(3) 生成相同的查询。 我们用三种不同的数据模式对 EBRBR模式进行原始的样本化和原始数据( 例如, “ Clik” 和“LILILAD Adal dest ex train readdate dest lading the the the the dal deal dal dealddate lading the the the the the the sal dal dal dal dal daldaldaldaldaldaldaldaldal dal daldaldddaldddaldaldddddaldddddddddaldddddddddddddddddddddddalddddddddddddddddaldddddalddddddddddddddddddddddaldddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddds), , ladalddd


Article 10

Title@2025-06-18 (3): Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability

Title: Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability Überarbeitung der kompositorischen Verallgemeinerung Fähigkeit von großen Sprachmodellen unter Berücksichtigung von Instruktionen nach Fähigkeit 重新审视大型语文模式的构成通用能力,考虑按能力进行教学 2506.15629v1

Authors (3): Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered. Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities.

在诸如CommonGen(Copen Gen)等典型常识推理任务中,基因化大型语言模型(LLMs)构成包含所有特定概念的句子;然而,当侧重于教学执行能力时,如果迅速指定概念顺序,LLMs必须生成符合特定顺序的句子;为解决这一问题,我们提议了一个旨在评价LOMs(Composition Communicational Gen)的组成概括和教学执行能力的基准;这一基准措施要求覆盖范围评估概念是否按照特定顺序产生,从而能够同时评价两种能力;我们利用36 LMs(LMs)进行了全面分析,发现尽管LMs(LMs)一般理解指示的意图,但对特定概念顺序模式的偏向往往导致低多样性产出或相同结果,即使在概念顺序改变时也是如此。此外,即使最符合指令的LM(LMM)只达到约75%的订单覆盖范围,也强调需要改进教学执行和组合概括能力。


Article 11

Title@2025-06-18 (3): J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization

Title: J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization J4R: Lernen, mit gleichwertiger anfänglicher Staatsgruppe zu urteilen Relative Politikoptimierung J4R:向法官学习与等同的初次国家集团相对政策优化 2505.13346v3

Authors (5): Austin Xu, Yilun Zhou, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty

To keep pace with the increasing pace of large language models (LLM) development, model output evaluation has transitioned away from time-consuming human evaluation to automatic evaluation, where LLMs themselves are tasked with assessing and critiquing other model outputs. LLM-as-judge models are a class of generative evaluators that excel in evaluating relatively simple domains, like chat quality, but struggle in reasoning intensive domains where model responses contain more substantive and challenging content. To remedy existing judge shortcomings, we explore training judges with reinforcement learning (RL). We make three key contributions: (1) We propose the Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO) algorithm, which allows us to train our judge to be robust to positional biases that arise in more complex evaluation settings. (2) We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work. (3) We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%, matching or exceeding the performance of larger GRPO-trained judges on both JudgeBench and ReasoningJudgeBench.

为了跟上大型语言模式(LLM)发展步伐的步伐,示范产出评价已从耗费时间的人力评价向自动评价转变,LLM-法官模型本身负责评估和炫耀其他模型产出。LLM-法官模型是一批基因化评价员,在评估相对简单的领域方面表现优异,如聊天质量,但努力推理密集的领域,模型答复包含更实质性和更具挑战性的内容。为了弥补现有的法官缺陷,我们探索如何用强化学习来培训法官。我们作出了三项关键贡献:(1) 我们提议了等值初步国家集团相对政策优化(EIS-GROPO)算法,这使我们可以培训我们的法官,使其有能力在更复杂的评价环境中定位偏见。 (2) 我们引入了“理性法官”基准,用以评价过去工作没有涵盖的各种推理环境中的法官。 (3) 我们培训了“理性法官”(J4R),一名受过EIS-GREPO培训的7B法官,这名法官优于GPT-4和下一位最小型法官,有6.0%和9%,比大法官和首席法官BEGRO-GRO-GRA 法官的表现相配比。


Article 12

Title@2025-06-18 (3): A Guide to Misinformation Detection Data and Evaluation

Title: A Guide to Misinformation Detection Data and Evaluation Ein Leitfaden für Fehlinformationserkennungsdaten und -bewertung 《错误信息检测数据和评价指南》 2411.05060v4

Authors (10): Camille Thibault, Jacob-Junqi Tian, Gabrielle Peloquin-Skulski, Taylor Lynn Curtis, James Zhou, Florence Laflamme, Yuxiang Guan, Reihaneh Rabbany, Jean-François Godbout, Kellin Pelrine

Misinformation is a complex societal issue, and mitigating solutions are difficult to create due to data deficiencies. To address this, we have curated the largest collection of (mis)information datasets in the literature, totaling 75. From these, we evaluated the quality of 36 datasets that consist of statements or claims, as well as the 9 datasets that consist of data in purely paragraph form. We assess these datasets to identify those with solid foundations for empirical work and those with flaws that could result in misleading and non-generalizable results, such as spurious correlations, or examples that are ambiguous or otherwise impossible to assess for veracity. We find the latter issue is particularly severe and affects most datasets in the literature. We further provide state-of-the-art baselines on all these datasets, but show that regardless of label quality, categorical labels may no longer give an accurate evaluation of detection model performance. Finally, we propose and highlight Evaluation Quality Assurance (EQA) as a tool to guide the field toward systemic solutions rather than inadvertently propagating issues in evaluation. Overall, this guide aims to provide a roadmap for higher quality data and better grounded evaluations, ultimately improving research in misinformation detection. All datasets and other artifacts are available at misinfo-datasets.complexdatalab.com.

为了解决这个问题,我们整理了文献中最大的(错误)信息数据集,总计75个,我们从中评估了由声明或主张组成的36个数据集的质量,以及由纯段落形式的数据组成的9个数据集的质量。我们对这些数据集进行了评估,以查明那些具有丰富经验工作坚实基础的数据集,以及那些有可能导致误导和不可概括结果的缺陷的数据集,例如虚假的关联,或模糊或无法评估真实性的例子。我们发现后一问题特别严重,影响到文献中的大多数数据集。我们还就所有这些数据集提供了最先进的基线,但表明无论标签质量如何,绝对标签都无法再准确评估检测模型的性能。最后,我们提出和强调评价质量保证(EQA)作为指导领域实现系统解决方案的工具,而不是无意地传播评估中的问题。总体而言,本指南旨在为更高质量的数据检测提供路线图,最终改进了所有现有数据格式。


Article 13

Title@2025-06-18 (3): Minding the Politeness Gap in Cross-cultural Communication

Title: Minding the Politeness Gap in Cross-cultural Communication Den Politismus in der interkulturellen Kommunikation berücksichtigen 在跨文化交流中注意因应能力差距 2506.15623v1

Authors (5): Yuka Machino, Matthias Hofer, Max Siegel, Joshua B. Tenenbaum, Robert D. Hawkins

Misunderstandings in cross-cultural communication often arise from subtle differences in interpretation, but it is unclear whether these differences arise from the literal meanings assigned to words or from more general pragmatic factors such as norms around politeness and brevity. In this paper, we report three experiments examining how speakers of British and American English interpret intensifiers like “quite” and “very.” To better understand these cross-cultural differences, we developed a computational cognitive model where listeners recursively reason about speakers who balance informativity, politeness, and utterance cost. Our model comparisons suggested that cross-cultural differences in intensifier interpretation stem from a combination of (1) different literal meanings, (2) different weights on utterance cost. These findings challenge accounts based purely on semantic variation or politeness norms, demonstrating that cross-cultural differences in interpretation emerge from an intricate interplay between the two.

跨文化交流的误解往往产生于解释上的微妙差异,但不清楚这些差异是来自文字的字面含义,还是来自更一般的实用因素,如关于礼貌和简洁的规范。在本文中,我们报告三项实验,研究英国和美国英语语言者如何解释“等量”和“非常”等强化词。为了更好地了解这些跨文化差异,我们开发了一个计算认知模型,让听众回溯到关于演讲者如何平衡信息性、礼貌和言语成本的理由。我们的模型比较表明,强化解释的跨文化差异来自:(1) 不同字面含义,(2) 对言语成本的不同权重。这些发现对纯粹基于语义差异或礼貌规范的描述提出了挑战,表明解释的跨文化差异来自两者之间的复杂互动。


Article 14

Title@2025-06-18 (3): The Compositional Architecture of Regret in Large Language Models

Title: The Compositional Architecture of Regret in Large Language Models Die kompositorische Architektur des Bedauerns in großen Sprachmodellen 大语言模式 “ 遗憾 “ 的构成结构 2506.15617v1

Authors (6): Xiangxiang Cui, Shu Yang, Tianjin Huang, Wanyu Lin, Lijie Hu, Di Wang

Regret in Large Language Models refers to their explicit regret expression when presented with evidence contradicting their previously generated misinformation. Studying the regret mechanism is crucial for enhancing model reliability and helps in revealing how cognition is coded in neural networks. To understand this mechanism, we need to first identify regret expressions in model outputs, then analyze their internal representation. This analysis requires examining the model’s hidden states, where information processing occurs at the neuron level. However, this faces three key challenges: (1) the absence of specialized datasets capturing regret expressions, (2) the lack of metrics to find the optimal regret representation layer, and (3) the lack of metrics for identifying and analyzing regret neurons. Addressing these limitations, we propose: (1) a workflow for constructing a comprehensive regret dataset through strategically designed prompting scenarios, (2) the Supervised Compression-Decoupling Index (S-CDI) metric to identify optimal regret representation layers, and (3) the Regret Dominance Score (RDS) metric to identify regret neurons and the Group Impact Coefficient (GIC) to analyze activation patterns. Our experimental results successfully identified the optimal regret representation layer using the S-CDI metric, which significantly enhanced performance in probe classification experiments. Additionally, we discovered an M-shaped decoupling pattern across model layers, revealing how information processing alternates between coupling and decoupling phases. Through the RDS metric, we categorized neurons into three distinct functional groups: regret neurons, non-regret neurons, and dual neurons.

大语言模型中的遗憾是指在提交与先前产生的错误信息相矛盾的证据时表现出的明显遗憾表达。研究遗憾机制对于提高模型可靠性至关重要,有助于揭示神经网络中的认知编码。为了理解这一机制,我们需要首先在模型产出中识别遗憾表达,然后分析内部代表性。这一分析需要检查模型的隐藏状态,信息处理发生在神经层面。然而,这面临着三大挑战:(1) 缺乏专门的数据集,能够捕捉遗憾表达;(2) 缺乏找到最优遗憾代表层的计量标准,以及(3) 缺乏用于识别和分析遗憾神经神经元的非指标。解决这些限制,我们建议:(1) 通过战略性设计的提示性假设,建立一个全面的遗憾数据集的工作流程;(2) 超级折叠式折式指数(S-CDI)衡量标准,以找出最优的遗憾代表层;(3) Regret Dominance评分(RDS)衡量标准,以辨别遗憾神经元和集团影响系数(GIC),以分析激活模式。我们实验结果成功地确定了通过战略设计的双轨模式,在S-C IMDA中进行最佳的递化分析,在S-S- IMDS-I 级中,在我们所研判分中,在S-ID-I-I-ID-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-


Article 15

Title@2025-06-18 (3): Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

Title: Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning Router-R1: Lehren von LLMs Multi-Round Routing und Aggregation durch Verstärkungslernen 路由-R1路由-R1路由:教学LLMS 2506.09033v2

Authors (3): Haozhen Zhang, Tao Feng, Jiaxuan You

The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (\textit{i.e.}, assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present \textbf{Router-R1}, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave “think” actions (internal deliberation) with “route” actions (dynamic model invocation), and integrates each response into its evolving context. To facilitate learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for optimizing the balance between performance and cost, opening a pathway toward enhancing performance-cost trade-offs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms several strong baselines, achieving superior performance while maintaining robust generalization and cost management.

各种大型语言模型(LLMs)的迅速出现刺激了LLM路由器的发展,将用户的查询指派给最合适的模式。然而,现有的LLM路由器通常进行单轮、一对一的绘图(\ textit{i.e.}),将每个查询指派给一个孤立的单一模式,从而限制他们处理复杂任务的能力,而需要多个LLM的互补优势。在本文件中,我们提出了基于LLM路由的强化学习(RL)框架,将多LLM路由和汇总作为顺序决定程序。RUrdr-R1路由路由器本身即是一个有能力的LMM,利用其推理能力将“思考”行动(内部评断)与“路由式”行动(动态模式)互插,并将每项反应纳入不断演变的背景。为了促进学习,我们采用了一种轻量的基于规则的奖励,包括格式奖励、最终结果奖励,以及一种基于优化业绩和成本平衡的新的成本奖励,通过Rrwayr-r-r-rmemememememememememimal选择若干强有力的成本基准,而加强业绩成本交易交易交易,而只是通过简单化一个加强成本交易-tracal-tralislismal-eximal-imal-em-eximal-emal-commal-commal-commalimalimalimal-commal-imal-emal-emal-emal-emal-emalmaxxxxxxxxxxx


Article 16

Title@2025-06-18 (3): LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Title: LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning LoX: Low-Rank-Extrapolation stärkt LLM-Sicherheit gegen Feinabstimmung LoX:低Rank外推法强力推力LLM 安全防止微调 2506.15606v1

Authors (6): Gabrel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong

Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model’s adaptability to new tasks. For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations. The code is available at github.com/VITA-Group/LoX.

大型语言模型(LLMS)在现实世界应用中变得不可或缺。然而,广泛采用这些模型引起了重大的安全问题,特别是在应对对社会有害的问题时。尽管做出了大量努力,通过调整来改善模型安全,但统一模型仍然可以受到随后微调的破坏,即使额外的培训数据看起来是无害的。在本文件中,我们从经验上证明,这种脆弱性源于LLM参数中安全临界低级别子空间对微调的敏感度。根据这一认识,我们建议采用一种新的无培训方法,称为Low-Rank外推法(LOX),通过对一个匹配的LMM的安全子空间进行外推法,加强安全稳健性。我们的实验结果证实LOX的有效性,在防止良性攻击和恶意微调攻击的同时,在保持模型适应新任务方面都取得了显著的稳健性改进。例如,LOX使面临良性或恶意微调的攻击成功率的绝对下降11%至54%。我们通过调查ASR参数的景观,将LX的成功归因于的成功归因于LMM/LAVX的参数转移到一个不敏感程度。


Article 17

Title@2025-06-18 (3): From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns

Title: From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns Vom Modell zum Klassenzimmer: Bewertung Generierter MCQs für Portugiesen mit Erzähl- und Schwierigkeitsproblemen 从模型到教室:评估有叙述和困难关切的葡萄牙语生成的中、中、低、中、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、高、低、低、低、低、高、高、高、高、高、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、高、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低、低 2506.15598v1

Authors (7): Bernardo Leite, Henrique Lopes Cardoso, Pedro Pinto, Abel Ferreira, Luís Abreu, Isabel Rangel, Sandra Monteiro

While MCQs are valuable for learning and evaluation, manually creating them with varying difficulty levels and targeted reading skills remains a time-consuming and costly task. Recent advances in generative AI provide an opportunity to automate MCQ generation efficiently. However, assessing the actual quality and reliability of generated MCQs has received limited attention – particularly regarding cases where generation fails. This aspect becomes particularly important when the generated MCQs are meant to be applied in real-world settings. Additionally, most MCQ generation studies focus on English, leaving other languages underexplored. This paper investigates the capabilities of current generative models in producing MCQs for reading comprehension in Portuguese, a morphologically rich language. Our study focuses on generating MCQs that align with curriculum-relevant narrative elements and span different difficulty levels. We evaluate these MCQs through expert review and by analyzing the psychometric properties extracted from student responses to assess their suitability for elementary school students. Our results show that current models can generate MCQs of comparable quality to human-authored ones. However, we identify issues related to semantic clarity and answerability. Also, challenges remain in generating distractors that engage students and meet established criteria for high-quality MCQ option design.

虽然MCQ对于学习和评价来说是有价值的,但手工创造MCQ对于学习和评价是有价值的,具有不同难度和有针对性阅读技能,这仍然是一项耗时和昂贵的任务。在基因性AI方面最近的进展为将MCQ的一代人实现自动化提供了机会。然而,对所创造的MCQ的实际质量和可靠性的评估受到的关注有限 – – 尤其是对于一代人失败的情况。当生成的MCQ本应应用于现实世界环境中时,这一方面就变得特别重要。此外,大多数MCQ的产生研究侧重于英语,而其他语言则未得到充分探讨。本文调查了目前制作MCQ以葡萄牙语(一种形态上丰富的语言)读懂的MCQ的基因化模型的能力。我们的研究侧重于生成与课程相关叙述要素一致并跨越不同难度水平的MCQ。我们通过专家审查以及分析从学生反应中提取的心理测量特性以评估其对小学生的适合性。我们的研究结果表明,目前的MCQ可以产生与人造语言相近的质量。然而,我们发现与语言清晰性和可解释性有关的问题。此外,我们的研究重点是产生与课程性清晰性和可解释性相关的问题。


Article 18

Title@2025-06-18 (3): WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts

Title: WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts WikiMixQA: Ein multimodaler Benchmark für die Fragebeantwortung über Tabellen und Diagramme WikiMixQA:表格和图表问答的多模式基准 2506.15594v1

Authors (6): Negar Foroutan, Angelika Romanou, Matin Ansaripour, Julian Martin Eisenschlos, Karl Aberer, Rémi Lebret

Documents are fundamental to preserving and disseminating information, often incorporating complex layouts, tables, and charts that pose significant challenges for automatic document understanding (DU). While vision-language large models (VLLMs) have demonstrated improvements across various tasks, their effectiveness in processing long-context vision inputs remains unclear. This paper introduces WikiMixQA, a benchmark comprising 1,000 multiple-choice questions (MCQs) designed to evaluate cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages spanning seven distinct topics. Unlike existing benchmarks, WikiMixQA emphasizes complex reasoning by requiring models to synthesize information from multiple modalities. We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve ~70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required. Among these, GPT-4-o is the only model exceeding 50% accuracy in this setting, whereas open-source models perform considerably worse, with a maximum accuracy of 27%. These findings underscore the challenges of long-context, multi-modal reasoning and establish WikiMixQA as a crucial benchmark for advancing document understanding research.

文件是保存和传播信息的基础,常常包含复杂的布局、表格和图表,对自动理解文件构成重大挑战。虽然视觉语言大模型(VLLMS)在各种任务方面都显示出了改进,但它们在处理长文本愿景投入方面的效力仍然不明确。本文介绍了WikiMixQA,这是一个基准,包括1,000个多选择问题(MCQs),旨在评估从4000个维基百科页面上提取的7个不同主题的图表的跨模式推理。与现有基准不同,WikiMixQA强调复杂的推理,要求模型综合多种模式的信息。我们评估了12个最先进的视觉语言模型,显示当专利模型直接提供时达到~70%的准确度,但在需要检索长文件时,其性能会大大恶化。其中,GPT-4-o是这一环境中唯一超过50%的模型,而开放源模型则表现严重差,最高精确率达27%。这些结论强调了长文本、多模式推理和建立WikiMixQA作为推进文件研究的关键基准的挑战。


Article 19

Title@2025-06-18 (3): Lean Workbook: A large-scale Lean problem set formalized from natural language math problems

Title: Lean Workbook: A large-scale Lean problem set formalized from natural language math problems Lean Workbook: Ein groß angelegtes Lean-Problem, das aus natursprachlichen mathematischen Problemen formalisiert wird 利安工作手册:从自然语言数学问题中正式确定的一个大规模利安问题 2406.03847v3

Authors (6): Huaiyuan Ying, Zijian Wu, Yihan Geng, Zheng Yuan, Dahua Lin, Kai Chen

Large language models have demonstrated impressive capabilities across various natural language processing tasks, especially in solving mathematical problems. However, large language models are not good at math theorem proving using formal languages like Lean. A significant challenge in this area is the scarcity of training data available in these formal languages. To address this issue, we propose a novel pipeline that iteratively generates and filters synthetic data to translate natural language mathematical problems into Lean 4 statements, and vice versa. Our results indicate that the synthetic data pipeline can provide useful training data and improve the performance of LLMs in translating and understanding complex mathematical problems and proofs. Our final dataset contains about 57K formal-informal question pairs along with searched proof from the math contest forum and 21 new IMO questions. We open-source our code at https://github.com/InternLM/InternLM-Math and our data at https://huggingface.co/datasets/InternLM/Lean-Workbook.

大型语言模型在各种自然语言处理任务中表现出令人印象深刻的能力,特别是在解决数学问题方面;然而,大型语言模型在用Lean等正式语言证明数学理论方面并不擅长。这一领域的一个重大挑战是缺乏这些正式语言的培训数据。为解决这一问题,我们提议建立一个新的管道,反复生成和过滤合成数据,将自然语言数学问题转化为Lean 4语语句,反之亦然。我们的结果表明,合成数据管道可以提供有用的培训数据,改善LLMS在翻译和理解复杂的数学问题和证据方面的绩效。我们的最后数据集包含约57K个正式的非正式问题配对,以及数学竞赛论坛的搜索证据和21个新的海事组织问题。我们在https://github.com/InternLM/InternLM/InternLM-Math和我们在https://huggingface.co/datasetset/InternLM/Lean-Workbook上公布了我们的代码。


Article 20

Title@2025-06-18 (3): DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement

Title: DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement DiscoSG: Auf dem Weg zu Diskurs-Level Textszene Grafik Parsing durch iterative Graphenverfeinerung DiscoSG:通过迭代图形精炼进行分层层文本场景图解 2506.15583v1

Authors (6): Shaoqing Lin, Chong Teng, Fei Li, Donghong Ji, Lizhen Qu, Zhuang Li

Vision-Language Models (VLMs) now generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers originally designed for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. To address this, we introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), supported by our dataset DiscoSG-DS, which comprises 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs for images. Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets. While fine-tuning large PLMs (i.e., GPT-4) on DiscoSG-DS improves SPICE by approximately 48% over the best sentence-merging baseline, high inference cost and restrictive licensing hinder its open-source use, and smaller fine-tuned PLMs struggle with complex graphs. We propose DiscoSG-Refiner, which drafts a base graph using one small PLM, then employs a second PLM to iteratively propose graph edits, reducing full-graph generation overhead. Using two Flan-T5-Base models, DiscoSG-Refiner still improves SPICE by approximately 30% over the best baseline while achieving 86 times faster inference than GPT-4. It also consistently improves downstream VLM tasks like discourse-level caption evaluation and hallucination detection. Code and data are available at: https://github.com/ShaoqLin/DiscoSG

视觉- Language Models (VLMS) (VLMS) 正在产生一个新的任务,即 DiscoSG-DS, 由我们的数据系统 DiscoSG-DS(DiscoSG) 支持的DiscoSG(DiscoSG-DS) , 由400名专家附加说明和8,430名综合多语制图解解析组组成, 最初设计用于单声带字幕到绘图的图像。 目前的方法一般是合并句级解析输出,通常缺少交叉感应连接等现象,导致DiscoSG-DSM任务下游分化。 为了解决这个问题,我们引入了一个新的任务,即Discole 级别文本解析(Disco) , 由我们的数据系统DiscoSG- Disco(DS) 和限制性的解码解析(Disco) , 由400名专家加注和8,4,4,4,4,4,4 合成多语系多语系多语系图组的图组配图像。 每个标题平均9句,每9句,每个图中至少有3倍。 我们提议将GSDirSG- SG- sermax- decreco-de- decreco) der-deal-deal-deal-deal-de-deal-deal-dealbisco) labisco-deal der der der der der-dealdaldal madaldal dealdaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldal


Article 21

Title@2025-06-18 (3): SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

Title: SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification SciVer: Bewertung von Stiftungsmodellen für multimodale wissenschaftliche Patentprüfung SciVer:评估基金会多模式科学索赔核实模型 2506.15569v1

Authors (5): Chengye Wang, Yifei Shen, Zexi Kuang, Arman Cohan, Yilun Zhao

We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models’ comprehension and reasoning in multimodal scientific literature tasks.

我们引入了专门用来评估基础模型在多式联运科学背景下核实索赔能力的第一个基准SciVer,SciVer由1,113份科学论文的3,000个专家附加说明的例子组成,涵盖四个子集,每个子集代表多式联运科学索赔核实的共同推理类型;为了进行细微评估,每个例子都包括专家附加说明的辅助证据;我们评估21个最先进的多式联运基础模型的绩效,包括o4-mini、Gemini-2.5-Flash、Llama-3.2-Vision和Qwen2.5-VL。我们的实验显示这些模型与人类专家在SciVer上的绩效差距很大。通过深入分析检索集生成(RAG)和人为错误评估,我们确定了当前开放源模型的关键局限性,为在多式联运科学文献任务中推进模型的理解和推理提供了关键见解。


Article 22

Title@2025-06-18 (3): Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models

Title: Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models Gender Inclusivity Fairness Index (GIFI): Ein mehrstufiger Rahmen zur Bewertung der Geschlechtervielfalt in großen Sprachmodellen 性别包容性公平指数:以大语言模式评价性别多样性的多层次框架 2506.15568v1

Authors (3): Zhengyang Shan, Emily Ruth Diana, Jiawei Zhou

We present a comprehensive evaluation of gender fairness in large language models (LLMs), focusing on their ability to handle both binary and non-binary genders. While previous studies primarily focus on binary gender distinctions, we introduce the Gender Inclusivity Fairness Index (GIFI), a novel and comprehensive metric that quantifies the diverse gender inclusivity of LLMs. GIFI consists of a wide range of evaluations at different levels, from simply probing the model with respect to provided gender pronouns to testing various aspects of model generation and cognitive behaviors under different gender assumptions, revealing biases associated with varying gender identifiers. We conduct extensive evaluations with GIFI on 22 prominent open-source and proprietary LLMs of varying sizes and capabilities, discovering significant variations in LLMs’ gender inclusivity. Our study highlights the importance of improving LLMs’ inclusivity, providing a critical benchmark for future advancements in gender fairness in generative models.

我们对大型语言模型中的性别公平情况进行了全面评估,重点是这些模型处理二进制和非二进制性别的能力;虽然以前的研究主要侧重于二进制性别区分,但我们采用了性别包容性公平指数(GIFI),这是衡量LLMs性别包容性程度的一个新而全面的衡量标准。 GIFI由一系列不同级别的广泛评价组成,从简单地检验提供性别代名词的模式到测试不同性别假设下的模型生成和认知行为的各个方面,揭示与不同性别识别特征有关的偏见。我们与GIFI广泛评价了22个不同大小和能力的显著开放源和专有有限责任模型,发现了LLMs性别包容性方面的重大差异。我们的研究强调了改进LMS的包容性的重要性,为在基因化模型中未来在性别公平方面取得的进步提供了一个关键基准。


Article 23

Title@2025-06-18 (3): Fractured Chain-of-Thought Reasoning

Title: Fractured Chain-of-Thought Reasoning Zersplitterte Kette von nachdenklichen Gründen 断断断断断断断断断断断断的探讨链原因 2505.12992v3

Authors (7): Baohao Liao, Hanze Dong, Yuhui Xu, Doyen Sahoo, Christof Monz, Junnan Li, Caiming Xiong

Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning. Code is available at https://github.com/BaohaoLiao/frac-cot.

在这项工作中,我们首先展示了松散的Cot,它停止了推理,直接生成了最终答案,常常匹配完整的COT取样,同时使用极小的代号。基于这一洞察力,我们引入了断裂式取样,一个统一的推论时间战略,在完全COT和三个或几面轴的解决方案取样之间进行相互调试:(1)推理轨数和仅溶式取样,这些方法产生了巨大的象征性成本,阻碍了在对延缓性敏感的环境中部署这些模型。在这项工作中,我们首先展示了松散的COT,它停止在完成前进行推理,直接生成了最终答案,常常与完整的COT取样匹配,同时使用的数量也大大减少。基于这一洞察力,我们引入了断裂式取样,一个统一的推论时间战略,在完全 CoT 与三个或多孔轴轴轴轴轴轴的解决方案取样之间进行相互调:(1) 推算轨迹的次数,(2) 每一轨迹的最后解决方案的数量,以及推算的深度。我们通过在五个不同的推理基准和几个模型尺度上进行广泛的实验,我们显示,断裂式取样的取样,不断实现更精确的精确的成本成本成本交易和推算。


Article 24

Title@2025-06-18 (3): PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction

Title: PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction PredGen: Beschleunigte Schlussfolgerung großer Sprachmodelle durch Input-Time-Spekulation für Echtzeit-Spekulationsinteraktion PredGen:通过实时语音互动输入-时间投机加速推断大语言模式 2506.15556v1

Authors (2): Shufan Li, Aditya Grover

Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.

大型语言模型(LLMS)被广泛用于实时语音聊天应用程序,通常与生成声音响应的文本到语音系统(TTS)结合使用,但其庞大规模往往导致用户输入的结尾和音频输出的开始之间明显悬殊,造成用户不尽人意的经历。当LLMS作为单一用户语音助理被部署在计算机容量有限的消费者级硬件上时,这种延缓尤其明显。我们发现,LMS主要以生成第一个句子的时间为主,这是TTS系统在逐个判决的基础上综合音频响应所需的投入。为了解决这一瓶颈问题,我们建议PredGen(PredGen),这是一个新的框架,通过在输入时的投机解码来减轻甚至消除这一延迟。PredGen在用户仍然发言时提出候选人的回复,使系统能够以最小的延迟开始TTS处理。Lmsysys和MTnch数据集的模拟实验将显示,拟议的方法只能有效地在2号的宽度计算中减少额外成本,而在2号上有效地减少额外的计算。


Article 25

Title@2025-06-18 (3): How much do language models memorize?

Title: How much do language models memorize? Wie viel merken sich Sprachmodelle? 语言模型背书多少? 2505.24832v3

Authors (8): John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Edward Suh, Alexander M. Rush, Kamalika Chaudhuri, Saeed Mahloujifar

We propose a new method for estimating how much a model knows about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have struggled to disentangle memorization from generalization. We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true data-generation process. When we completely eliminate generalization, we can compute the total memorization, which provides an estimate of model capacity: our measurements estimate that GPT-style models have a capacity of approximately 3.6 bits per parameter. We train language models on datasets of increasing size and observe that models memorize until their capacity fills, at which point “grokking” begins, and unintended memorization decreases as models begin to generalize. We train hundreds of transformer language models ranging from $500K$ to $1.5B$ parameters and produce a series of scaling laws relating model capacity and data size to membership inference.

我们提出了一个新的方法来估计模型对一个数据点的了解程度,并用它来测量现代语言模型的能力。先前对语言模型记忆化的研究曾费力去解开一般化的记忆。我们正式将记忆化分为两个部分:意外记忆,一个模型包含关于特定数据集的信息,一个模型包含关于真实数据生成过程的信息,一个模型包含关于真实数据生成过程的信息。当我们完全消除概括化时,我们可以计算总记忆化,它提供模型能力的估计:我们的测量估计,GPT型模型每个参数的容量约为3.6比特。我们用不断增大的数据集来培训语言模型模型模型记忆化模型,直到其能力得到填补,在那个点“grokking”开始,而一个模型开始概括化时意外记忆化减少。我们训练了数百个变异语言模型,从500K$到1.5B$的参数,并产生一系列关于模型能力和数据大小与成员推算有关的扩展法律。


Article 26

Title@2025-06-18 (3): Approximating Language Model Training Data from Weights

Title: Approximating Language Model Training Data from Weights Annähernde Sprachmodell-Trainingsdaten aus Gewichten 由重量产生的近似语文示范培训数据 2506.15553v1

Authors (5): John X. Morris, Junjie Oscar Yin, Woojeong Kim, Vitaly Shmatikov, Alexander M. Rush

Modern language models often have open weights but closed training data. We formalize the problem of data approximation from model weights and propose several baselines and metrics. We develop a gradient-based approach that selects the highest-matching data from a large public text corpus and show its effectiveness at recovering useful data given only weights of the original and finetuned models. Even when none of the true training data is known, our method is able to locate a small subset of public Web documents can be used to train a model to close to the original model performance given models trained for both classification and supervised-finetuning. On the AG News classification task, our method improves performance from 65% (using randomly selected data) to 80%, approaching the expert benchmark of 88%. When applied to a model trained with SFT on MSMARCO web documents, our method reduces perplexity from 3.3 to 2.3, compared to an expert LLAMA model’s perplexity of 2.0.

现代语言模型通常具有开放的权重,但有封闭的培训数据。 我们正式确定了模型权重的数据近似问题,并提出了若干基线和衡量标准。 我们开发了一种基于梯度的方法,从大型公共文本中选择最匹配的数据,并显示其在恢复有用数据方面的有效性,只考虑到原始和微调模型的权重。即使没有真正的培训数据,我们的方法也能够找到少量的公开网络文件,用来训练一个模型,接近为分类和监管的改进而培训的原始模型。在AG New 分类任务中,我们的方法将业绩从65%(随机选择的数据)提高到80%,接近88%的专家基准。在应用于在MSMARCO网络文件上接受SFT培训的模型时,我们的方法可以减少3.3到2.3的不易懂性,而专家LLMAMA模型的易混淆性能是2.0。


Article 27

Title@2025-06-18 (3): RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models

Title: RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models RATTENTION: Auf dem Weg zur minimalen Schiebefenstergröße in lokalen und globalen Aufmerksamkeitsmodellen 注意:在本地-全球关注模式中实现最小滑滑窗口大小 2506.15545v1

Authors (4): Bailin Wang, Chang Lan, Chong Wang, Ruoming Pang

Local-global attention models have recently emerged as compelling alternatives to standard Transformers, promising improvements in both training and inference efficiency. However, the crucial choice of window size presents a Pareto tradeoff: larger windows maintain performance akin to full attention but offer minimal efficiency gains in short-context scenarios, while smaller windows can lead to performance degradation. Current models, such as Gemma2 and Mistral, adopt conservative window sizes (e.g., 4096 out of an 8192 pretraining length) to preserve performance. This work investigates strategies to shift this Pareto frontier, enabling local-global models to achieve efficiency gains even in short-context regimes. Our core motivation is to address the intrinsic limitation of local attention – its complete disregard for tokens outside the defined window. We explore RATTENTION, a variant of local attention integrated with a specialized linear attention mechanism designed to capture information from these out-of-window tokens. Pretraining experiments at the 3B and 12B scales demonstrate that RATTENTION achieves a superior Pareto tradeoff between performance and efficiency. As a sweet spot, RATTENTION with a window size of just 512 consistently matches the performance of full-attention models across diverse settings. Furthermore, the recurrent nature inherent in the linear attention component of RATTENTION contributes to enhanced long-context performance, as validated on the RULER benchmark. Crucially, these improvements do not compromise training efficiency; thanks to a specialized kernel implementation and the reduced window size, RATTENTION maintains training speeds comparable to existing state-of-the-art approaches.

最近出现了一些地方-全球关注模式,作为标准变换器的令人信服的替代物,培训速度和推算效率都有希望的改善。然而,对窗口规模的关键选择是Pareto交易:较大的窗口保持与充分关注相似的性能,但在短视情景下却能带来最低效率收益,而较小的窗口则可能导致性能退化。Gemma2和Mistral等当前模式采用保守的窗口规模(例如,在8192个培训前的4096年)来保持业绩。这项工作调查了改变这一速度Pareto前沿的战略,使地方-全球模式即使在短期制度下也能够实现效率增益。我们的核心动机是解决当地关注的内在局限性 – – 它完全无视在限定的窗口外的标牌。我们探索了一种与专门线性关注机制相结合的本地关注变式,以获取从这些离风的象征获得的信息。在3B和12B级前的培训实验显示,在业绩和效率之间实现更优的折换,使地方-全球模式得以实现效率增益。作为甜点,REting the decretration the delicial rial train train revidustration int revidudududustration the devidududustration vidustration


Article 28

Title@2025-06-18 (3): Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

Title: Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework Polysemantik mit PRISM erfassen: Ein Multi-Konzept-Feature Beschreibung Framework 利用PRISM获得多边性能:多概念特征描述框架 2506.15538v1

Authors (7): Laura Kopf, Nils Feldhus, Kirill Bykov, Philine Lou Bommer, Anna Hedström, Marina M. -C. Höhne, Oliver Eberle

Automated interpretability research aims to identify concepts encoded in neural network features to enhance human understanding of model behavior. Current feature description methods face two critical challenges: limited robustness and the flawed assumption that each neuron encodes only a single concept (monosemanticity), despite growing evidence that neurons are often polysemantic. This assumption restricts the expressiveness of feature descriptions and limits their ability to capture the full range of behaviors encoded in model internals. To address this, we introduce Polysemantic FeatuRe Identification and Scoring Method (PRISM), a novel framework that captures the inherent complexity of neural network features. Unlike prior approaches that assign a single description per feature, PRISM provides more nuanced descriptions for both polysemantic and monosemantic features. We apply PRISM to language models and, through extensive benchmarking against existing methods, demonstrate that our approach produces more accurate and faithful feature descriptions, improving both overall description quality (via a description score) and the ability to capture distinct concepts when polysemanticity is present (via a polysemanticity score).

现有特征描述方法面临两大挑战:强性有限,而且假设每个神经编码只有一个单一的概念(单体性),尽管越来越多的证据表明神经元往往是多元性。这一假设限制了特征描述的表达性,并限制了它们捕捉在模型内部编码的全部行为的能力。为了解决这个问题,我们引入了多元性胎儿识别和分解方法(PRISM),这是一个反映神经网络特征固有复杂性的新框架。与先前对每个特征进行单一描述的方法不同,PRISM为多种和单体特征提供了更细微的描述。我们将PRISM应用于语言模型,并通过与现有方法进行广泛的基准对比,表明我们的方法产生了更准确和忠实的特征描述,既提高了总体描述质量(通过描述评分),也提高了在存在多种对称性时(通过多元性评分)捕捉不同概念的能力。


Article 29

Title@2025-06-18 (3): Pap2Pat: Benchmarking Outline-Guided Long-Text Patent Generation with Patent-Paper Pairs

Title: Pap2Pat: Benchmarking Outline-Guided Long-Text Patent Generation with Patent-Paper Pairs Pap2Pat: Benchmarking der Langtext-Patentgenerierung mit Patent-Paper-Paaren Pap2Patt:制定基准大纲,指导长式长式专利生成,配有专利-纸质配对 2410.07009v3

Authors (4): Valentin Knappich, Simon Razniewski, Anna Hätty, Annemarie Friedrich

Dealing with long and highly complex technical text is a challenge for Large Language Models (LLMs), which still have to unfold their potential in supporting expensive and timeintensive processes like patent drafting. Within patents, the description constitutes more than 90% of the document on average. Yet, its automatic generation remains understudied. When drafting patent applications, patent attorneys typically receive invention reports (IRs), which are usually confidential, hindering research on LLM-supported patent drafting. Often, prepublication research papers serve as IRs. We leverage this duality to build PAP2PAT, an open and realistic benchmark for patent drafting consisting of 1.8k patent-paper pairs describing the same inventions. To address the complex longdocument patent generation task, we propose chunk-based outline-guided generation using the research paper as invention specification. Our extensive evaluation using PAP2PAT and a human case study show that LLMs can effectively leverage information from the paper, but still struggle to provide the necessary level of detail. Fine-tuning leads to more patent-style language, but also to more hallucination. We release our data and code https://github.com/boschresearch/Pap2Pat.

处理长期和高度复杂的技术文本对大语言模型(LLMS)来说是一项挑战,因为大语言模型(LLMS)在支持专利起草等昂贵和耗时的过程方面仍然必须发挥潜力。在专利中,描述平均占文件的90%以上。然而,其自动生成仍然受到忽视。在起草专利申请时,专利律师通常会收到发明报告(IRs),这些报告通常是保密的,妨碍了对LLMM所支持的专利起草的研究。出版前研究文件往往作为IRs。我们利用这种双重性来建立PAP2PAT,这是专利起草的开放和现实的基准,由1.8k个专利纸配对组成,描述同样的发明。为了应对复杂的长文件专利生成任务,我们建议采用基于块的大纲生成作为发明规格。我们利用PAP2PAT和人类案例研究进行的广泛评估表明LMSs能够有效地利用纸上的信息,但是仍然在努力提供必要的详细程度。微调导致更多的专利风格语言,但也有更多的幻觉。我们公布了我们的数据和代码 http://github.com/spressearsearch.


Article 30

Title@2025-06-18 (3): Lessons from Training Grounded LLMs with Verifiable Rewards

Title: Lessons from Training Grounded LLMs with Verifiable Rewards Lehren aus der Ausbildung begründeter LLMs mit überprüfbaren Belohnungen 从培训中汲取的教训 2506.15522v1

Authors (8): Shang Hong Sim, Tej Deep Pala, Vernon Toh, Hai Leong Chieu, Amir Zadeh, Chuan Li, Navonil Majumder, Soujanya Poria

Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs). While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available. In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs. We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations. Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal. Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks. Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs.

对大型语言模型(LLMS)来说,产生有依据和值得信赖的反应仍然是一项关键的挑战。虽然以引用为基础的基础定位检索增强的一代(RAG)很有希望,但指导调整的模型即使在简单假设中也常常失败:缺少明确说明的答案,错误引用,或者在有证据时拒绝。在这项工作中,我们探索强化学习(RL)和内部推理如何能加强LMS的定位。我们使用GROPO(Group 相对政策优化)方法来培训模型,使用可核查的基于结果的奖励,以答复的正确性、引用的充足性和拒绝性为目标,而不需要黄金推理的痕迹或昂贵的注释。我们通过ASQA、QAMPARI、ELI5和CEQA的全面实验,我们发现,推理推理模型明显超越了只提供教学的变式,特别是在处理无法回答的询问和产生良好的响应方面。我们使用GROPO的两阶段培训,首先优化回答和引用行为,然后通过稳定学习信号来进一步改进基础。此外,我们重新审视通过GPT-4 Qstillation(Rstillation)的调整指令,并发现将长期的成绩与升级相结合。


Article 31

Title@2025-06-18 (3): RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering

Title: RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering RadioRAG: Online-Retrieval-augmentierte Generation für Radiologie Fragen beantworten PARRAG: 放射问题解答在线检索增强的一代人 2407.15621v3

Authors (10): Soroosh Tayebi Arasteh, Mahshad Lotfinia, Keno Bressem, Robert Siepmann, Lisa Adams, Dyke Ferber, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn

Large language models (LLMs) often generate outdated or inaccurate information based on static training datasets. Retrieval-augmented generation (RAG) mitigates this by integrating outside data sources. While previous RAG systems used pre-assembled, fixed databases with limited flexibility, we have developed Radiology RAG (RadioRAG), an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. We evaluate the diagnostic accuracy of various LLMs when answering radiology-specific questions with and without access to additional online information via RAG. Using 80 questions from the RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions with reference standard answers, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B and 70B]) were prompted with and without RadioRAG in a zero-shot inference scenario RadioRAG retrieved context-specific information from Radiopaedia in real-time. Accuracy was investigated. Statistical analyses were performed using bootstrapping. The results were further compared with human performance. RadioRAG improved diagnostic accuracy across most LLMs, with relative accuracy increases ranging up to 54% for different LLMs. It matched or exceeded non-RAG models and the human radiologist in question answering across radiologic subspecialties, particularly in breast imaging and emergency radiology. However, the degree of improvement varied among models; GPT-3.5-turbo and Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2 showed no improvement, highlighting variability in RadioRAG’s effectiveness. LLMs benefit when provided access to domain-specific data beyond their training data. RadioRAG shows potential to improve LLM accuracy and factuality in radiology question answering by integrating real-time domain-specific data.

大型语言模型(LLMS)往往产生基于静态培训数据集的过时或不准确的信息。 检索启动的生成(RAG)通过整合外部数据源而缓解了这一点。 以前的RAG系统使用预先组装的固定数据库,但灵活性有限,我们开发了放射学RAG(RadiRAG),这是一个从权威的放射性在线来源实时检索数据的端对端框架。 我们在回答放射学特定问题时,对各种LLMS的诊断准确性进行了评估。 在通过RAG获得更多在线信息时,我们通过RAG获取了这种信息。 利用了RSNA案件收集的80个问题,在放射学次级特别特殊模型中,利用了80个专家查询的问题,并用参考标准答案、LMMS(G-35-turbo)、G-4、Mistral-7B、Mix-8x、Llama3 [8B和70B] 实时数据采集数据时,通过零光度和感光度假设,通过RADRADLMLMLMLMLLM的准确性数据更新了数据。 数据,在进行统计分析和分析结果中,通过不同结果显示了。


Article 32

Title@2025-06-18 (3): Wait, We Don’t Need to “Wait”! Removing Thinking Tokens Improves Reasoning Efficiency

Title: Wait, We Don’t Need to “Wait”! Removing Thinking Tokens Improves Reasoning Efficiency Warten Sie, wir brauchen nicht zu “warten”! Entfernen von Gedanken-Tokens verbessert vernünftige Effizienz 等等,我们不需要”等等”! 2506.08343v2

Authors (6): Chenlong Wang, Yuanning Feng, Dongping Chen, Zhaoyang Chu, Ranjay Krishna, Tianyi Zhou

Recent advances in large reasoning models have enabled complex, step-by-step reasoning but often introduce significant overthinking, resulting in verbose and redundant outputs that hinder efficiency. In this study, we examine whether explicit self-reflection, signaled by tokens such as “Wait” and “Hmm”, is necessary for advanced reasoning. We propose NoWait, a simple yet effective approach that disables explicit self-reflection by suppressing these tokens during inference. Extensive experiments on ten benchmarks across textual, visual, and video reasoning tasks show that NoWait reduces chain-of-thought trajectory length by up to 27%-51% in five R1-style model series, without compromising model utility. NoWait thus offers a plug-and-play solution for efficient and utility-preserving multimodal reasoning.

大型推理模型的最近进展使得复杂、逐步推理得以实现,但往往引入了重大的过度思考,从而导致阻碍效率的杂乱和冗余产出。在本研究中,我们研究是否需要用“等待”和“Hmm”等符号表示的明确自我反省作为高级推理的标志。我们提议了“等待”这一简单而有效的方法,通过在推理过程中压制这些符号而使明确的自我反省成为障碍。关于文本、视觉和视频推理任务的十项基准的广泛实验表明,“等待”将五个R1型模型系列中思考的轨迹长度减少高达27%至51%,而不会损害模型的效用。因此,“不等待”为高效率和实用性维护多式推理提供了一个插机解决方案。


Article 33

Title@2025-06-18 (3): Enhancing Hyperbole and Metaphor Detection with Their Bidirectional Dynamic Interaction and Emotion Knowledge

Title: Enhancing Hyperbole and Metaphor Detection with Their Bidirectional Dynamic Interaction and Emotion Knowledge Verbesserung der Hyperbole- und Metaphor-Erkennung mit ihrem bidirektionalen dynamischen Interaktions- und Emotionswissen 利用双向动态互动和情感知识加强超双向超博和同义体探测 2506.15504v1

Authors (8): Li Zheng, Sihang Wang, Hao Fei, Zuquan Peng, Fei Li, Jianming Fu, Chong Teng, Donghong Ji

Text-based hyperbole and metaphor detection are of great significance for natural language processing (NLP) tasks. However, due to their semantic obscurity and expressive diversity, it is rather challenging to identify them. Existing methods mostly focus on superficial text features, ignoring the associations of hyperbole and metaphor as well as the effect of implicit emotion on perceiving these rhetorical devices. To implement these hypotheses, we propose an emotion-guided hyperbole and metaphor detection framework based on bidirectional dynamic interaction (EmoBi). Firstly, the emotion analysis module deeply mines the emotion connotations behind hyperbole and metaphor. Next, the emotion-based domain mapping module identifies the target and source domains to gain a deeper understanding of the implicit meanings of hyperbole and metaphor. Finally, the bidirectional dynamic interaction module enables the mutual promotion between hyperbole and metaphor. Meanwhile, a verification mechanism is designed to ensure detection accuracy and reliability. Experiments show that EmoBi outperforms all baseline methods on four datasets. Specifically, compared to the current SoTA, the F1 score increased by 28.1% for hyperbole detection on the TroFi dataset and 23.1% for metaphor detection on the HYPO-L dataset. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing hyperbole and metaphor detection.

以文字为基础的超文本和隐喻检测对于自然语言处理(NLP)任务非常重要。然而,由于情感分析模块深埋超音和隐喻隐含的情感内涵,因此很难辨别它们。现有的方法大多侧重于表面文字特征,忽视超音和隐喻的关联以及隐含情感对感测这些言辞装置的影响。为了实施这些假设,我们提议了一个基于双向动态互动的情感引导超音和隐喻检测框架。首先,情感分析模块深深埋藏超音和隐喻背后的情感内涵。接下来,基于情感的域图绘制模块确定了目标和源域,以加深理解超音和隐喻的隐含含义。最后,双向动态互动模块使得超音和隐含的情感对感触觉对感对理解的影响得以相互促进。与此同时,我们设计了一种核查机制,以确保探测准确性和可靠性。首先,实验表明Emobi在四个数据集上的所有基线方法都比当前的 SoTA, 以F1分法为28.1分,用于对超音机和超低比值的探测结果的BIBIL数据进行28.1和23的预估的GIBIBA的预测算。


Article 34

Title@2025-06-18 (3): Interchangeable Token Embeddings for Extendable Vocabulary and Alpha-Equivalence

Title: Interchangeable Token Embeddings for Extendable Vocabulary and Alpha-Equivalence Austauschbare Token-Einbetten für erweiterbare Vokabeln und Alpha-Equivalenz 用于可扩展词汇和阿尔法等效的互换调制缩写嵌套 2410.17161v3

Authors (3): İlker Işık, Ramazan Gokberk Cinbis, Ebru Aydin Gol

Language models lack the notion of interchangeable tokens: symbols that are semantically equivalent yet distinct, such as bound variables in formal logic. This limitation prevents generalization to larger vocabularies and hinders the model’s ability to recognize alpha-equivalence, where renaming bound variables preserves meaning. We formalize this machine learning problem and introduce alpha-covariance, a metric for evaluating robustness to such transformations. To tackle this task, we propose a dual-part token embedding strategy: a shared component ensures semantic consistency, while a randomized component maintains token distinguishability. Compared to a baseline that relies on alpha-renaming for data augmentation, our approach demonstrates improved generalization to unseen tokens in linear temporal logic solving, propositional logic assignment prediction, and copying with an extendable vocabulary, while introducing a favorable inductive bias for alpha-equivalence. Our findings establish a foundation for designing language models that can learn interchangeable token representations, a crucial step toward more flexible and systematic reasoning in formal domains. Our code and project page are available at https://necrashter.github.io/interchangeable-token-embeddings

语言模型缺乏可互换的符号概念: 具有等同音义但又有区别的符号, 如正式逻辑中的约束变量。 这一限制阻止了对较大词汇的笼统化,并阻碍了模型在重命名约束变量保留意义的情况下承认alpha- equality的能力。 我们正式化了这个机器学习问题,并引入了阿尔法- ocvoliance, 这是评估这种转变的稳健性的一种衡量标准。 为了完成这项任务,我们建议了一个双部分的象征性嵌入战略: 一个共享的组件可以确保语义的一致性, 而一个随机化的组件则保持象征性的区别性。 与一个依靠阿尔法重新命名来扩大数据的基线相比, 我们的方法显示, 在线性时间逻辑解算、 假设逻辑分配预测以及复制可扩展的词汇中, 对不可见的符号有更好的概括化。 我们的发现为设计语言模型奠定了基础, 可以学习可互换的象征性表达方式, 这是在正式域中朝着更灵活和系统推理的方向迈出的关键一步。 我们的代码和工程页面可以在 https://necrall- relashembbedtotototototototototototototototo


Article 35

Title@2025-06-18 (3): SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

Title: SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling SPARE: Single-Pass-Annotation mit referenzgeführter Bewertung für automatische Prozessüberwachung und Prämienmodellierung SPARE: 具有自动程序监督和奖赏建模参考指导评价的单纸注释 2506.15498v1

Authors (3): Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych

Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility.

在推进大型语言模型(LLMs)复杂的多步推理能力方面,进程或分步监督发挥了至关重要的作用。然而,高效、高质量的自动化进程注释仍然是一个重大挑战。为了解决这个问题,我们引入了参考指导评价(SPARE)的单一Pass批注(SPARE),这是一个结构化的新框架,它通过将每个解决方案步骤与参考解决方案中的一个或多个步骤相协调,并伴有明确的评价推理,使每个解决方案步骤与一个或多个步骤相协调,从而能够进行单一、一步注解。我们表明,参考指导的分级评价有效地促进了四个数据集的流程监督:数学推理、多跳动成像问题解答和空间推理。我们证明,与基线相比,SPARE在使用时改进了推理性表现:(1) 将模型在离线设置中进行微调,用于推断-贪婪解码,以及(2) 将多种LM产出排出为分级/聚合的训练奖赏模式。此外,SPARE在挑战数学数据集上取得了竞争性的表现,同时提供了2.6倍的效率,仅需要38的运行时间,与经过培训的SP-PR研究基础进一步发布。


Article 36

Title@2025-06-18 (3): Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation

Title: Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation Adding Chocolate to Mint: Vermeiden von Metric Interferenz in der maschinellen Übersetzung 在薄荷中添加巧克力:减轻机器翻译中的计量干涉 2503.08327v2

Authors (4): José Pombal, Nuno M. Guerreiro, Ricardo Rei, André F. T. Martins

As automatic metrics become increasingly stronger and widely adopted, the risk of unintentionally “gaming the metric” during model development rises. This issue is caused by metric interference (MINT), i.e., the use of the same or related metrics for both model tuning and evaluation. MINT can misguide practitioners into being overoptimistic about the performance of their systems: as system outputs become a function of the interfering metric, their estimated quality loses correlation with human judgments. In this work, we analyze two common cases of MINT in machine translation-related tasks: filtering of training data, and decoding with quality signals. Importantly, we find that MINT strongly distorts instance-level metric scores, even when metrics are not directly optimized for-questioning the common strategy of leveraging a different, yet related metric for evaluation that is not used for tuning. To address this problem, we propose MINTADJUST, a method for more reliable evaluation under MINT. On the WMT24 MT shared task test set, MINTADJUST ranks translations and systems more accurately than state-of-the-art metrics across a majority of language pairs, especially for high-quality systems. Furthermore, MINTADJUST outperforms AUTORANK, the ensembling method used by the organizers.

随着自动衡量标准越来越强大和被广泛采用,在模型开发过程中无意地“将衡量标准”的风险增加。这个问题是由衡量干预(MINT)造成的,即对模型调整和评价使用相同或相关的衡量标准。MINT可能误导从业者对其系统绩效过于乐观:随着系统产出成为干扰衡量标准的一个功能,其估计质量将失去与人类判断的关联。在这项工作中,我们分析了机器翻译相关任务中MINT的两个常见案例:培训数据的过滤和对质量信号的解码。重要的是,我们发现MINT严重扭曲了实例一级的衡量标准分数,即使衡量标准不是直接优化于质疑利用不同但相关的评价标准来调整其系统绩效的共同战略。为了解决这一问题,我们建议MINTADJust公司采用一种更可靠的评价方法。关于WMT24 MT共享的任务测试集,MITADJust公司将翻译和系统排在比州级标准更准确的翻译和系统上。重要的是,即使衡量标准没有直接优化用于质疑使用不同系统的共同评价标准,但用于调整的通用评价标准。我们提议MINAFA格式的大多数组织者采用的方法。


Article 37

Title@2025-06-18 (3): Context-Informed Grounding Supervision

Title: Context-Informed Grounding Supervision Kontext-informierte Erdungsüberwachung 内地内地监督 2506.15480v1

Authors (10): Hyunji Lee, Seunghyun Yoon, Yunjae Won, Hanseok Oh, Geewook Kim, Trung Bui, Franck Dernoncourt, Elias Stengel-Eskin, Mohit Bansal, Minjoon Seo

Large language models (LLMs) are often supplemented with external knowledge to provide information not encoded in their parameters or to reduce hallucination. In such cases, we expect the model to generate responses by grounding its response in the provided external context. However, prior work has shown that simply appending context at inference time does not ensure grounded generation. To address this, we propose Context-INformed Grounding Supervision (CINGS), a post-training supervision in which the model is trained with relevant context prepended to the response, while computing the loss only over the response tokens and masking out the context. Our experiments demonstrate that models trained with CINGS exhibit stronger grounding in both textual and visual domains compared to standard instruction-tuned models. In the text domain, CINGS outperforms other training methods across 11 information-seeking datasets and is complementary to inference-time grounding techniques. In the vision-language domain, replacing a vision-language model’s LLM backbone with a CINGS-trained model reduces hallucinations across four benchmarks and maintains factual consistency throughout the generated response. This improved grounding comes without degradation in general downstream performance. Finally, we analyze the mechanism underlying the enhanced grounding in CINGS and find that it induces a shift in the model’s prior knowledge and behavior, implicitly encouraging greater reliance on the external context.

大型语言模型(LLMS)往往得到外部知识的补充,以提供没有在参数中编码的信息或减少幻觉。在这类情况下,我们期望该模型能够通过在所提供的外部环境下进行响应来生成响应。然而,先前的工作表明,仅仅在推论时间附加背景并不能确保有根化的生成。为此,我们提议,在培训后监督中,对模型进行培训,根据相关背景对响应进行预先准备,同时只计算响应标牌的损失,并掩盖背景。我们进行的实验表明,在CINGS培训的模型在文字和视觉领域比标准指导调整模型都更能产生更强的基础。在文本领域,CINGS在11个信息搜索数据集中优于其他培训方法,与时间定位技术相辅相成。在愿景语言领域,用CINGS培训的模型主干线替换了四个基准上的幻觉,并在整个生成的响应中保持了事实一致性。这种改进的基建模型在文本和视觉领域与标准调整模式相比,在文字领域比标准指导模式更强。在文本领域,C优于其他培训方法上,我们从基础上分析了基础的深层次推导了基础,从而推导了基础推了基础,从而推了基础推了基础,从而推导了C。最后又推导了基础推导了基础推导了基础,我们分析了了基础推了基础,在了基础,在了基础,从而推了基础推了基础,在了C。


Article 38

Title@2025-06-18 (3): Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

Title: Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification? Breaking Bad Molecules: Sind MLLMs bereit für die molekulare Entgiftung auf Strukturebene? 破碎坏分子:MLLMs是否准备好进行结构级分子解毒? 2506.10912v2

Authors (8): Fei Lin, Ziyang Gong, Cong Wang, Yonglin Tian, Tengchao Zhang, Xue Yang, Gen Luo, Fei-Yue Wang

Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair - generating structurally valid molecular alternatives with reduced toxicity - has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 560 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess nearly 30 mainstream general-purpose MLLMs and design multiple ablation studies to analyze key factors such as evaluation criteria, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware molecule editing.

尽管分子设计和财产预测取得了进展,但分子毒性修复任务 – – 产生结构上有效的分子替代物,减少毒性 – – 尚未系统地界定或确定基准;为填补这一空白,我们引入了侧重于分子毒性修复的通用多式多种语言模型(MLLM)的第一个基准任务 – – ToxiMol,这是侧重于分子毒性修复的通用多模式模型(MLLMs)的第一个基准任务;我们建立了一个标准化的数据集,涵盖11个主要任务和560个具有代表性的有毒分子,涵盖各种机制和颗粒。我们设计了一个及时的注解管道,配有机制性能和任务适应能力,并有专门的毒理学知识。与此同时,我们提议了一个自动评估框架 – – ToxiEval,将毒性终点预测、合成可获性、药物相似性和结构相似性纳入一个高通量评价链,以修复成功。我们系统地评估了近30个主流通用MLLMM,并设计了多种相关研究,以分析诸如评价标准、候选多样性和失败归属等关键因素。实验结果表明,尽管目前的MLLMS-Ms在这项任务的坚持性、磁性能力方面仍面临重大挑战。


Article 39

Title@2025-06-18 (3): OM4OV: Leveraging Ontology Matching for Ontology Versioning

Title: OM4OV: Leveraging Ontology Matching for Ontology Versioning OM4OV: Ontologie für die Ontologie-Versionierung OM4OV:利用本体学匹配本体学版本的本体学 2409.20302v3

Authors (3): Zhangcheng Qiang, Kerry Taylor, Weiqing Wang

Due to the dynamic nature of the Semantic Web, version control is necessary to capture time-varying information, particularly for widely used ontologies. Despite the long-standing recognition of ontology versioning (OV) as a crucial component for efficient ontology management, the growing size of ontologies and accumulating errors caused by manual labour overwhelm current OV approaches. In this paper, we propose yet another approach to performing OV using existing ontology matching (OM) techniques and systems. We introduce a unified OM4OV pipeline. From an OM perspective, we reconstruct a new task formulation and measurement for OV tasks. Building upon the prior alignment(s) from OM, we propose a pipeline optimisation method called the cross-reference (CR) mechanism to enhance overall OV performance. We experimentally validate the OM4OV pipeline and the cross-reference mechanism in the OV tested originating from the Ontology Alignment Evaluation Initiative (OAEI) datasets. We also discuss insights into OM used for OV tasks, where some false mappings detected by OV systems are not actually untrue.

由于语义网的动态性质,必须进行版本控制,以捕捉时间变化信息,特别是广泛使用的肿瘤信息。尽管长期以来一直承认本体学版本(OV)是有效本体学管理的一个关键组成部分,但由于人工劳动超负荷目前OV方法造成的本体学规模不断扩大和累积错误的积累。在本文件中,我们提出另一种方法,利用现有的本体匹配(OM)技术和系统来进行OV。我们引入了统一的OM4OV管道。我们从OM的角度为OV任务重建了新的任务配置和计量。在OM先前的协调统一的基础上,我们提议一种管线优化方法,称为交叉参照(CR)机制,以提高整个OVS的性能。我们实验性地验证了OM4OVV管道和从OTolog对齐评价倡议(OAEI)的数据集中测试的OVVA的交叉参照机制。我们还讨论了用于OV任务的OM的洞察到的一些假图实际上并非不真实。


Article 40

Title@2025-06-18 (3): Factorized RVQ-GAN For Disentangled Speech Tokenization

Title: Factorized RVQ-GAN For Disentangled Speech Tokenization Factorized RVQ-GAN für entfremdete Sprach-Tokenisierung RVQ-GAN 分解语音代谢的量化 RVQ-GAN 2506.15456v1

Authors (16): Sameer Khurana, Dominik Klement, Antoine Laurent, Dominik Bobos, Juraj Novosad, Peter Gazdik, Ellen Zhang, Zili Huang, Amir Hussein, Ricard Marxer, Yoshiki Masuyama, Ryo Aihara, Chiori Hori, Francois G. Germain, Gordon Wichern, Jonathan Le Roux

We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC’s factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC’s potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.

我们提出一个统一的神经语言代码(HAC),这是一个统一的神经语言代码(HAC),它将其瓶颈分解成三个语言层面的音频、音频和词汇。HAC利用两个知识蒸馏目标:一个来自受过训练的语音编码器(HuBERT),用于电话级别结构,另一个来自基于文字的编码器(LABSE),用于词汇提示。关于英语和多语言数据的实验显示,HAC的因因子化瓶颈产生分解的代号组:一个与电话对齐,另一个则捕捉字级语级语义。定量评估证实,HAC象征着自然特性,提供了可解释的语言信息,在脱钩和重建质量方面超过了单级基线。这些结论强调了HAC作为统一的离散语音代表、连接声频细节以及下游语音生成和理解任务的词汇含义的潜力。


Article 41

Title@2025-06-18 (3): RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation

Title: RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation RE-IMAGINE: Symbolische Benchmark-Synthese zur vernünftigen Bewertung RE-IMAGINE: 用于说明理由的评价的符号性基准综合报告 2506.15455v1

Authors (10): Xinnuo Xu, Rachel Lawrence, Kshitij Dubey, Atharva Pandey, Risa Ueno, Fabian Falck, Aditya V. Nori, Rahul Sharma, Amit Sharma, Javier Gonzalez

Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.

最近大型语言模型(LLMS)报告推理基准的准确性很高,但是仍然不清楚观察到的结果是真实推理的结果,还是从统计上回顾成套培训的结果。在因果关系阶梯(Pearl,2009年)及其三个层次(协会、干预和反事实)的启发下,本文件介绍了RE-IMAGINE,这是一个描述LMS推理能力等级结构的框架,以及一个自动管道,在不同层次产生问题的变化。RE-IMAGINE通过改变中间象征性代表制的问题,产生了许多无法单靠记忆来解决的问题。此外,该框架是一般性的,可以跨越推理领域,包括数学、代码和逻辑。我们展示了我们用来评估LMMS若干家族的四个广泛使用的基准框架,并在对模型提出质疑时看到业绩的下降。这些评估表明,过去的业绩依赖统计回顾的程度,并打开了进一步研究的大门,以整个推理层次的技能为目标。


Article 42

Title@2025-06-18 (3): AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need

Title: AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need AgentGroupChat-V2: Divide-and-Conquer ist das, was ein LLM-basiertes Multi-Agent-System braucht GroupChat-V2:基于LLM的多种机构系统需要什么 2506.15451v1

Authors (15): Zhouhong Gu, Xiaoxuan Zhu, Yin Cai, Hao Shen, Xingzhou Chen, Qingyi Wang, Jialin Li, Xiaoran Shi, Haoran Guo, Wenxuan Huang, Hongwei Feng, Yanghua Xiao, Zheyu Ye, Yao Hu, Shaosheng Cao

Large language model based multi-agent systems have demonstrated significant potential in social simulation and complex task resolution domains. However, current frameworks face critical challenges in system architecture design, cross-domain generalizability, and performance guarantees, particularly as task complexity and number of agents increases. We introduces AgentGroupChat-V2, a novel framework addressing these challenges through three core innovations: (1) a divide-and-conquer fully parallel architecture that decomposes user queries into hierarchical task forest structures enabling dependency management and distributed concurrent processing. (2) an adaptive collaboration engine that dynamically selects heterogeneous LLM combinations and interaction modes based on task characteristics. (3) agent organization optimization strategies combining divide-and-conquer approaches for efficient problem decomposition. Extensive experiments demonstrate AgentGroupChat-V2’s superior performance across diverse domains, achieving 91.50% accuracy on GSM8K (exceeding the best baseline by 5.6 percentage points), 30.4% accuracy on competition-level AIME (nearly doubling other methods), and 79.20% pass@1 on HumanEval. Performance advantages become increasingly pronounced with higher task difficulty, particularly on Level 5 MATH problems where improvements exceed 11 percentage points compared to state-of-the-art baselines. These results confirm that AgentGroupChat-V2 provides a comprehensive solution for building efficient, general-purpose LLM multi-agent systems with significant advantages in complex reasoning scenarios. Code is available at https://github.com/MikeGu721/AgentGroupChat-V2.

以大型语言模式为基础的多试剂系统在社会模拟和复杂任务解决领域展现出巨大的潜力;然而,目前的框架在系统结构设计、跨部通用性和绩效保障方面面临重大挑战,特别是任务复杂性和代理数量增加。我们引入了GroupChat-V2代理公司,这是一个通过三个核心创新来应对这些挑战的新框架:(1) 分解和分解完全平行的架构,将用户查询分解到有利于依赖管理的分级任务森林结构中,并同时进行处理。(2) 动态选择不同功能LLLM组合和基于任务特点的互动模式的适应性协作引擎。(3) 代理组织优化战略,结合分解方法,以高效解构问题。广泛的实验显示GroupChat-V2代理公司在各个领域的优异性业绩,在GSM8K上实现了91.50%的精确度(将最佳基线比5.6个百分点)、对竞争-全级的AME(将其他方法更接近一倍)的准确度,和对HumanEval的79. 20%的通度。业绩优势日益突出的任务困难,特别是在5级的MATH问题,其中改进了11个基准-Bral-ILA系统,将确认在通用的推算中,这些基础-ral-ral-ral-I-C的进度中,这些基准-rum-C的进度的进度的进度的进度比对州的进度的进度的进度是有效的计算。


Article 43

Title@2025-06-18 (3): Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models

Title: Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models Probabilistische Aggregation und gezielte Einbettung von Optimierungen für die kollektive moralische Vernunft in großen Sprachmodellen 大语言模式中集体道德理由的概率集中和定向嵌入最佳优化 2506.14625v2

Authors (5): Chenchen Yuan, Zheyu Zhang, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci

Large Language Models (LLMs) have shown impressive moral reasoning abilities. Yet they often diverge when confronted with complex, multi-factor moral dilemmas. To address these discrepancies, we propose a framework that synthesizes multiple LLMs’ moral judgments into a collectively formulated moral judgment, realigning models that deviate significantly from this consensus. Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability, weighting contributions by model reliability. For misaligned models, a targeted embedding-optimization procedure fine-tunes token embeddings for moral philosophical theories, minimizing JS divergence to the consensus while preserving semantic integrity. Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity. These findings highlight the value of data-driven moral alignment across multiple models and its potential for safer, more consistent AI systems.

大型语言模型(LLMs)显示了令人印象深刻的道德推理能力,但在面临复杂、多因素的道德困境时,它们往往存在差异。为解决这些差异,我们提议了一个框架,将多种LLMs的道德判断综合成一个集体制定的道德判断,对明显偏离这一共识的模型进行调整。我们的综合机制将持续的道德可接受分数(超越二进制标签)结合为集体概率,按模型可靠性加权贡献。对于错误模型来说,一个有针对性的嵌入-优化程序微调符号嵌入道德哲学理论,最大限度地减少联署材料对共识的分歧,同时保持语义完整性。关于大规模社会道德困境数据集的实验表明我们的方法是建立牢固的共识,提高个人模型的忠实性。这些发现突出了数据驱动的道德分数(超越二进制标签)在多模式中的道德一致性价值,以及它对于更安全、更一致的AI系统的潜力。


Article 44

Title@2025-06-18 (3): Understanding GUI Agent Localization Biases through Logit Sharpness

Title: Understanding GUI Agent Localization Biases through Logit Sharpness Verständnis der Lokalisierung von GUI-Agenten durch Logit-Schärfung 通过 Lologit 锐化理解图形用户界面代理界面本地化分线 2506.15425v1

Authors (5): Xingjian Tao, Yiwei Wang, Yujun Cai, Zhicheng Yang, Jing Tang

Multimodal large language models (MLLMs) have enabled GUI agents to interact with operating systems by grounding language into spatial actions. Despite their promising performance, these models frequently exhibit hallucinations-systematic localization errors that compromise reliability. We propose a fine-grained evaluation framework that categorizes model predictions into four distinct types, revealing nuanced failure modes beyond traditional accuracy metrics. To better quantify model uncertainty, we introduce the Peak Sharpness Score (PSS), a metric that evaluates the alignment between semantic continuity and logits distribution in coordinate prediction. Building on this insight, we further propose Context-Aware Cropping, a training-free technique that improves model performance by adaptively refining input context. Extensive experiments demonstrate that our framework and methods provide actionable insights and enhance the interpretability and robustness of GUI agent behavior.

多式大型语言模型(MLLMs)使GUI代理商能够与操作系统互动,将语言作为空间行动的基础。这些模型尽管表现有希望,但经常出现幻觉-系统本地化错误,损害可靠性。我们建议了一个细微的评估框架,将模型预测分为四种不同类型,揭示超越传统精确度尺度的细微故障模式。为了更好地量化模型不确定性,我们引入了峰夏分计(PSS),该标尺评估了语义连续性和逻辑分布在协调预测中的一致。我们根据这一洞察,进一步提出了“环境软件裁剪”这一无培训技术,通过适应性改进投入环境来改进模型性能。广泛的实验表明,我们的框架和方法提供了可操作的洞察力,并增强了图形代理商行为的可解释性和可靠性。


Article 45

Title@2025-06-18 (3): Targeted Lexical Injection: Unlocking Latent Cross-Lingual Alignment in Lugha-Llama via Early-Layer LoRA Fine-Tuning

Title: Targeted Lexical Injection: Unlocking Latent Cross-Lingual Alignment in Lugha-Llama via Early-Layer LoRA Fine-Tuning Gezielte Lexische Injektion: Entriegelung der latenten Cross-Lingual Alignment in Lugha-Llama via Early-Layer LoRA Fine-Tuning 定向射入:通过早期Layer LoRA精准发射在Lugha-Llama解锁Lugha-Llama的中端交叉对齐 2506.15415v1

Authors (1): Stanley Ngugi

Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their performance in low-resource languages (LRLs), such as Swahili, often lags due to data scarcity and underrepresentation in pre-training. A key challenge is achieving robust cross-lingual lexical alignment, crucial for tasks like translation and cross-lingual information retrieval. This paper introduces Targeted Lexical Injection (TLI), a novel and efficient fine-tuning approach. We first demonstrate that Lugha-Llama-8B-wura, a Swahili-centric LLM, exhibits strong, near-perfect lexical alignment for Swahili-English word pairs in its early internal layers (specifically Layer 2, with ~0.99998 average cosine similarity based on a pilot study), a capability not fully reflected in its final output representations (baseline ~0.32 similarity on our evaluation set). TLI leverages this insight by using Low-Rank Adaptation (LoRA) and a contrastive learning objective to fine-tune the model, specifically targeting embeddings from this empirically identified optimal early layer. Our experiments show that TLI significantly improves the output-level lexical alignment for 623 trained Swahili-English word pairs, increasing average cosine similarity from 0.3211 to 0.4113 (+28.08%, p < 1.33 x 10^-240). More importantly, these improvements generalize remarkably well to 63 unseen control word pairs, with similarity increasing from 0.3143 to 0.4033 (+28.32%, p < 7.17 x 10^-27). These findings suggest TLI enhances the model’s ability to preserve and propagate its inherent early-layer cross-lingual knowledge, offering a parameter-efficient and effective strategy for improving lexical alignment in LRL-focused LLMs.

大型语言模型(LLMS) 展示了非凡的能力, 然而其表现却表现在低资源语言(LLLLs), 如斯瓦希里语, 通常由于数据稀缺和在培训前的代表性不足而落后。 一个关键的挑战是如何实现强有力的跨语言词汇一致,这对翻译和跨语言信息检索等任务至关重要。 本文引入了一种创新而高效的微调方法(TLMs) 。 我们首先展示了Lugha-Llama-8B-wura, 以斯瓦希里语为中心的LLLLLMM, 显示, 斯瓦希里语-英语配对在早期的内部层(特别是第二层, 以试点研究为基础, ~ 91398 平均连字符相似 ) 。 本文引入了目标性莱雅特注射器(TLLLI) , 通过使用低Rank 模式(LORA) 和对比性学习目标来微调模型, 具体定位Swahili- 6 最优化的早期一级(Twalili), 递增的TLILS- 2018) 数据排序。


Article 46

Title@2025-06-18 (3): PsychBench: A comprehensive and professional benchmark for evaluating the performance of LLM-assisted psychiatric clinical practice

Title: PsychBench: A comprehensive and professional benchmark for evaluating the performance of LLM-assisted psychiatric clinical practice PsychBench: Ein umfassender und professioneller Maßstab für die Bewertung der Leistungsfähigkeit von LLM-unterstützter psychiatrischer klinischer Praxis 精神病时区:评估LLLM协助的精神病临床实践业绩的全面和专业基准 2503.01903v2

Authors (10): Shuyu Liu, Ruoxi Wang, Ling Zhang, Xuequan Zhu, Rui Yang, Xinzhu Zhou, Fei Wu, Zhi Yang, Cheng Jin, Gang Wang

The advent of Large Language Models (LLMs) offers potential solutions to address problems such as shortage of medical resources and low diagnostic consistency in psychiatric clinical practice. Despite this potential, a robust and comprehensive benchmarking framework to assess the efficacy of LLMs in authentic psychiatric clinical environments is absent. This has impeded the advancement of specialized LLMs tailored to psychiatric applications. In response to this gap, by incorporating clinical demands in psychiatry and clinical data, we proposed a benchmarking system, PsychBench, to evaluate the practical performance of LLMs in psychiatric clinical settings. We conducted a comprehensive quantitative evaluation of 16 LLMs using PsychBench, and investigated the impact of prompt design, chain-of-thought reasoning, input text length, and domain-specific knowledge fine-tuning on model performance. Through detailed error analysis, we identified strengths and potential limitations of the existing models and suggested directions for improvement. Subsequently, a clinical reader study involving 60 psychiatrists of varying seniority was conducted to further explore the practical benefits of existing LLMs as supportive tools for psychiatrists of varying seniority. Through the quantitative and reader evaluation, we show that while existing models demonstrate significant potential, they are not yet adequate as decision-making tools in psychiatric clinical practice. The reader study further indicates that, as an auxiliary tool, LLM could provide particularly notable support for junior psychiatrists, effectively enhancing their work efficiency and overall clinical quality. To promote research in this area, we will make the dataset and evaluation framework publicly available, with the hope of advancing the application of LLMs in psychiatric clinical settings.

大型语言模型(LLMS)的出现为解决诸如医疗资源短缺和精神病临床实践诊断一致性低等问题提供了潜在的解决办法。尽管存在这种潜力,但仍然缺乏一个强有力和全面的基准框架,以评估LLMS在真正的精神病临床环境中的功效,这阻碍了专门为精神病应用而专门设计的LLMS的发展。针对这一差距,我们提议了一个基准系统,即ScellBench,以评价LLMS在精神病临床环境中的实际表现。我们利用ScutBench对16 LMS进行了全面的临床评估,并调查了迅速设计、思维链推理、投入文本长度和具体领域知识微调对模型业绩的影响。我们通过详细的错误分析,查明了现有模型的长处和潜在局限性,并提出了改进方向。随后,我们进行了涉及60名年长不同的精神病学家的临床研究,以进一步探讨现有的LMS作为支持工具对资历不同的精神病学家的实际好处。我们通过定量和读者评估表明,虽然现有的模型表明有相当大的潜力,但它们尚不足以作为精神病临床临床实践的推进工具,特别是提高临床临床研究的质量。


Article 47

Title@2025-06-18 (3): PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims

Title: PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims PEDANTIC: Ein Datensatz für die automatische Prüfung der Wirksamkeit von Patentansprüchen PEDANTIC: 自动审查专利索赔的缺陷数据集 2505.21342v3

Authors (4): Valentin Knappich, Annemarie Friedrich, Anna Hätty, Simon Razniewski

Patent claims define the scope of protection for an invention. If there are ambiguities in a claim, it is rejected by the patent office. In the US, this is referred to as indefiniteness (35 U.S.C {\S} 112(b)) and is among the most frequent reasons for patent application rejection. The development of automatic methods for patent definiteness examination has the potential to make patent drafting and examination more efficient, but no annotated dataset has been published to date. We introduce PEDANTIC (Patent Definiteness Examination Corpus), a novel dataset of 14k US patent claims from patent applications relating to Natural Language Processing (NLP), annotated with reasons for indefiniteness. We construct PEDANTIC using a fully automatic pipeline that retrieves office action documents from the USPTO and uses Large Language Models (LLMs) to extract the reasons for indefiniteness. A human validation study confirms the pipeline’s accuracy in generating high-quality annotations. To gain insight beyond binary classification metrics, we implement an LLM-as-Judge evaluation that compares the free-form reasoning of every model-cited reason with every examiner-cited reason. We show that LLM agents based on Qwen 2.5 32B and 72B struggle to outperform logistic regression baselines on definiteness prediction, even though they often correctly identify the underlying reasons. PEDANTIC provides a valuable resource for patent AI researchers, enabling the development of advanced examination models. We will publicly release the dataset and code.

专利专利主张界定了发明的保护范围。 如果专利主张中存在含混不清之处,专利办公室会拒绝该专利主张。 在美国,这被称为无限期(35 U.S.C ~S} 112(b)),是专利申请被拒绝的最常见原因之一。 开发专利确定性专利审查的自动方法有可能提高专利起草和审查的效率,但迄今为止还没有公布附加说明的数据集。 我们引入了PEDANTIT(专利拒绝测试公司),这是一套14k美国专利申请专利申请的全新数据集,涉及自然语言处理(NLP),附有无限期理由。我们用完全自动的管道建造PEDANTIC,从USPTO检索办公室行动文件,使用大语言模型(LLMS)来提取不定期性审查的理由。 一项人类验证研究证实了管道在生成高质量说明方面的准确性。 为了更深入的分类指标,我们实施了LM-J(LP-J)评估, 比较了每一个模型的免费逻辑推理学原理,我们经常通过精确的二十二号的精确的精确的精确的精确的精确的精确的精确的推算。


Article 48

Title@2025-06-18 (3): COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus for Summarization and Headline Generation

Title: COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus for Summarization and Headline Generation COSMMIC: Kommentarsensitive multimodale Mehrsprachige indische Corpus für Zusammenfassung und Headline-Generierung COSMIC: 用于总结和标题代代的多语种印度公司评论-敏感多语种多语种公司 2506.15372v1

Authors (7): Raghvendra Kumar, S. A. Mohammed Salman, Aryan Sahu, Tridib Nandi, Pragathi Y. P., Sriparna Saha, Jose G. Moreno

Despite progress in comment-aware multimodal and multilingual summarization for English and Chinese, research in Indian languages remains limited. This study addresses this gap by introducing COSMMIC, a pioneering comment-sensitive multimodal, multilingual dataset featuring nine major Indian languages. COSMMIC comprises 4,959 article-image pairs and 24,484 reader comments, with ground-truth summaries available in all included languages. Our approach enhances summaries by integrating reader insights and feedback. We explore summarization and headline generation across four configurations: (1) using article text alone, (2) incorporating user comments, (3) utilizing images, and (4) combining text, comments, and images. To assess the dataset’s effectiveness, we employ state-of-the-art language models such as LLama3 and GPT-4. We conduct a comprehensive study to evaluate different component combinations, including identifying supportive comments, filtering out noise using a dedicated comment classifier using IndicBERT, and extracting valuable insights from images with a multilingual CLIP-based classifier. This helps determine the most effective configurations for natural language generation (NLG) tasks. Unlike many existing datasets that are either text-only or lack user comments in multimodal settings, COSMMIC uniquely integrates text, images, and user feedback. This holistic approach bridges gaps in Indian language resources, advancing NLP research and fostering inclusivity.

尽管在英文和中文的多语种评论和多语种汇总方面取得了进展,但印度语文的研究仍然有限,通过引入COSMIC这一具有先创性的评论敏感多语种多语种的多语种数据集来弥补这一差距。COSMIC由4 959个文章图像配对和24 484个读者评论组成,有24 484个包括所有语文的地面实况摘要组成。我们的方法通过整合读者的见解和反馈,增强了摘要。我们探索了四种组合的总结和标题生成:(1) 仅使用文章文本,(2) 纳入用户评论,(3) 利用图像和图像,(4) 合并文本、评论和图像。为了评估数据集的有效性,我们采用了诸如Lama3和GPT-4等最先进的语言模式。我们开展了一项全面研究,以评价不同组成部分组合,包括确定支持性评论,利用专门的评论分类器过滤噪音,以及用多种语言的CLIP分类从图像中提取有价值的见解。这有助于确定天然语言生成任务的最有效配置(NLG)和(4) 组合的文本、文字、语言升级的MLSM 方法与许多现有用户/ML的反馈方法不同。


Article 49

Title@2025-06-18 (3): SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models’ Knowledge of Indian Culture

Title: SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models’ Knowledge of Indian Culture SANSKRITI: Ein umfassender Benchmark für die Bewertung der Kenntnisse indischer Kultur von Sprachmodellen SANSKRITI:评估语言模型对印度文化知识的综合基准 2506.15355v1

Authors (5): Arijit Maji, Raghvendra Kumar, Akash Ghosh, Anushka, Sriparna Saha

Language Models (LMs) are indispensable tools shaping modern workflows, but their global effectiveness depends on understanding local socio-cultural contexts. To address this, we introduce SANSKRITI, a benchmark designed to evaluate language models’ comprehension of India’s rich cultural diversity. Comprising 21,853 meticulously curated question-answer pairs spanning 28 states and 8 union territories, SANSKRITI is the largest dataset for testing Indian cultural knowledge. It covers sixteen key attributes of Indian culture: rituals and ceremonies, history, tourism, cuisine, dance and music, costume, language, art, festivals, religion, medicine, transport, sports, nightlife, and personalities, providing a comprehensive representation of India’s cultural tapestry. We evaluate SANSKRITI on leading Large Language Models (LLMs), Indic Language Models (ILMs), and Small Language Models (SLMs), revealing significant disparities in their ability to handle culturally nuanced queries, with many models struggling in region-specific contexts. By offering an extensive, culturally rich, and diverse dataset, SANSKRITI sets a new standard for assessing and improving the cultural understanding of LMs.

语言模型(LMS)是塑造现代工作流程的不可或缺的工具,但其全球效力取决于对当地社会文化背景的理解。为此,我们引入了SANSKRITI,这是一个旨在评价语言模型对印度丰富文化多样性的理解的基准。由21,853个精心策划的问答对等组成,覆盖28个州和8个联合领地,SANSKRITI是测试印度文化知识的最大数据集。它涵盖印度文化的16个关键属性:仪式和仪式、历史、旅游、烹饪、舞蹈和音乐、服装、语言、艺术、节日、宗教、医学、交通、体育、晚间生活和人格。我们通过提供广泛、文化丰富和多样化的数据集,SANSKRITI为评估和改进对大语言模型的文化理解制定了新的标准。


Article 50

Title@2025-06-18 (3): DeVisE: Behavioral Testing of Medical Large Language Models

Title: DeVisE: Behavioral Testing of Medical Large Language Models DeVisE: Verhaltenstests von medizinischen großen Sprachmodellen DevisE:大语言医学模型行为测试 2506.15339v1

Authors (5): Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto

Large language models (LLMs) are increasingly used in clinical decision support, yet current evaluation methods often fail to distinguish genuine medical reasoning from superficial patterns. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework for probing fine-grained clinical understanding. We construct a dataset of ICU discharge notes from MIMIC-IV, generating both raw (real-world) and template-based (synthetic) versions with controlled single-variable counterfactuals targeting demographic (age, gender, ethnicity) and vital sign attributes. We evaluate five LLMs spanning general-purpose and medically fine-tuned variants, under both zero-shot and fine-tuned settings. We assess model behavior via (1) input-level sensitivity - how counterfactuals alter the likelihood of a note; and (2) downstream reasoning - how they affect predicted hospital length-of-stay. Our results show that zero-shot models exhibit more coherent counterfactual reasoning patterns, while fine-tuned models tend to be more stable yet less responsive to clinically meaningful changes. Notably, demographic factors subtly but consistently influence outputs, emphasizing the importance of fairness-aware evaluation. This work highlights the utility of behavioral testing in exposing the reasoning strategies of clinical LLMs and informing the design of safer, more transparent medical AI systems.

大型语言模型(LLMS)越来越多地用于临床决策支持,但目前的评估方法往往无法区分真正的医学推理和表面模式。我们引入了DeVisE(人口和生命标志评估),这是一个行为测试框架,用于细微的临床理解。我们从MIMIMI-IV构建了ICU排放注释的数据集,生成了原始(现实世界)和基于模板(合成)版本的数据集,这些版本具有以人口(年龄、性别、族裔)和重要标志属性为对象的、可控的单一可变反事实。我们评估了五大LMS,涵盖一般用途和医学上经过微调的变异。我们评估了五大LMS,覆盖了通用和医学上经过微调的变异。我们通过以下方法评估了行为模型:(1) 投入层面的敏感度―― 如何反效果改变笔注的可能性;(2) 下游推理 — 它们如何影响预测的住院时间长度。我们的结果表明,零镜头模型展示了更加一致的反事实推理模式,而微模型往往更稳定,但对临床上有意义的变化反应不那么敏感。值得注意的是,人口因素在子上持续地影响着影响着,但持续地影响着各种产出,强调透明性测测测测测测测测测测测测测测的实验室的正确性。


Article 51

Title: GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations GreekBarBench: Ein anspruchsvolles Benchmark für freie Text-Rechtsveranlagung und Verweisungen 希腊Barbench:自由文本法律理由和引用的质疑性基准 2505.17267v2

Authors (4): Odysseas S. Chlapanis, Dimitrios Galanis, Nikolaos Aletras, Ion Androutsopoulos

We introduce GreekBarBench, a benchmark that evaluates LLMs on legal questions across five different legal areas from the Greek Bar exams, requiring citations to statutory articles and case facts. To tackle the challenges of free-text evaluation, we propose a three-dimensional scoring system combined with an LLM-as-a-judge approach. We also develop a meta-evaluation benchmark to assess the correlation between LLM-judges and human expert evaluations, revealing that simple, span-based rubrics improve their alignment. Our systematic evaluation of 13 proprietary and open-weight LLMs shows that even though the best models outperform average expert scores, they fall short of the 95th percentile of experts.

我们引入了希腊Barbench, 这是一项评估希腊律师协会考试五个不同法律领域法律问题的LLMs的基准,要求引用法定条款和案件事实。为了应对自由文本评估的挑战,我们提议了一个三维评分系统,同时采用LLM-as-a-judge方法。我们还开发了一个元评价基准,以评估LLM法官与人类专家评估之间的相互关系,揭示了简单、跨范围评分改善了它们的一致性。我们对13个专有和开放重量的LMs的系统评估表明,即使最佳模型优于平均专家评分,它们也低于95%的专家。


Article 52

Title@2025-06-18 (3): When and How Unlabeled Data Provably Improve In-Context Learning

Title: When and How Unlabeled Data Provably Improve In-Context Learning Wann und wie unmarkierte Daten nachweislich das In-Context-Lernen verbessern 何时以及如何改进内文学习 2506.15329v1

Authors (6): Yingcong Li, Xiangyu Chang, Muti Kara, Xiaofeng Liu, Amit Roy-Chowdhury, Samet Oymak

Recent research shows that in-context learning (ICL) can be effective even when demonstrations have missing or incorrect labels. To shed light on this capability, we examine a canonical setting where the demonstrations are drawn according to a binary Gaussian mixture model (GMM) and a certain fraction of the demonstrations have missing labels. We provide a comprehensive theoretical study to show that: (1) The loss landscape of one-layer linear attention models recover the optimal fully-supervised estimator but completely fail to exploit unlabeled data; (2) In contrast, multilayer or looped transformers can effectively leverage unlabeled data by implicitly constructing estimators of the form $\sum_{i\ge 0} a_i (X^\top X)^iX^\top y$ with $X$ and $y$ denoting features and partially-observed labels (with missing entries set to zero). We characterize the class of polynomials that can be expressed as a function of depth and draw connections to Expectation Maximization, an iterative pseudo-labeling algorithm commonly used in semi-supervised learning. Importantly, the leading polynomial power is exponential in depth, so mild amount of depth/looping suffices. As an application of theory, we propose looping off-the-shelf tabular foundation models to enhance their semi-supervision capabilities. Extensive evaluations on real-world datasets show that our method significantly improves the semisupervised tabular learning performance over the standard single pass inference.

最近的研究表明, 即使演示显示丢失或不正确的标签, 文中学习(ICL) 也可以有效。 为了阐明这种能力, 我们检查一个卡通环境, 演示根据一个二进制高斯混合模型( GMM) 绘制, 演示中一定部分缺少标签。 我们提供了全面的理论研究, 以显示:(1) 单层线性关注模型的损耗景观恢复了最理想的完全监督的估量器, 但完全无法利用未贴标签的数据; (2) 对比之下, 多层或环形变压器可以通过隐含构建一个以 $\ sumi\ ge 0} a_i ( Xtop X) 和 $y 字形显示演示演示标值 和部分观察标签( 缺失条目设置为零 ) 。 我们将多层性能分类描述为深度的函数, 并绘制连接到期待最大化, 一种在半超前级的双层化的 标值算法, 通常用于半超级深度学习中的隐性假标签算算算法 。 , 将模型显示我们模拟深度的模型显示一个超度的深度的模型。 。 。 极级模型显示我们深度的深度的模型显示的深度的深度的深度 。 。 。 。 。 水平的模型显示的模型显示的模型显示, 我们的深度的深度的模型的深度的模型的模型的模型的深度 。


Article 53

Title@2025-06-18 (3): AIn’t Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation

Title: AIn’t Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation AIn’t Not Nothing But a Survey? Mit großen Sprachmodellen für die Codierung Deutsch Open-Ended Survey Responses on Survey Motivation 使用大语言模型对德国关于调查动机的开放式调查答复进行编码 2506.14634v2

Authors (4): Leah von der Heyde, Anna-Carolina Haensch, Bernd Weiß, Jessica Daikeler

The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs’ performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs’ unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.

由于LLMs在语言上能够有效地替代耗费时间的手工编码和受监督的机器学习模式的预培训,由于关于这个专题的大多数现有研究侧重于与非复杂专题或单一LLMs有关的英语答复,因此不清楚它的调查结果是否概括了这些分类的质量,以及这些分类的质量如何与既定方法相比较。在本研究中,我们调查在多大程度上可以使用不同的LMs来规范其他情况下的开放式调查答复,利用德国关于参与调查的原因的数据作为实例。我们比较了一些最先进的LLMs和一些快速的方法,并通过使用人类专家的编码来评价LMs的业绩。LMs的总体业绩差异很大,只有经过精细调的LMM才能达到令人满意的预测性业绩水平。在使用准确的LMM方法时,对业绩的差别性差异以准确的LM方法为条件。最后,LMs在调查的不同类别下,对参与调查原因的不平等的分类工作表现,作为参与的原因,作为一个例子,我们比较一些最新的LLMsms和一些快速的方法,在进行这种分析时,我们需要对这些研究的公开性分析。


Article 54

Title@2025-06-18 (3): HiURE: Hierarchical Exemplar Contrastive Learning for Unsupervised Relation Extraction

Title: HiURE: Hierarchical Exemplar Contrastive Learning for Unsupervised Relation Extraction HiURE: Hierarchisches Beispiel Kontrastives Lernen für unüberwachte Beziehungsextraktion HIURE: 为不受监督的关系采掘而进行等级主义的高级特制反竞争学习 2205.02225v4

Authors (6): Shuliang Liu, Xuming Hu, Chenwei Zhang, Shu`ang Li, Lijie Wen, Philip S. Yu

Unsupervised relation extraction aims to extract the relationship between entities from natural language sentences without prior information on relational scope or distribution. Existing works either utilize self-supervised schemes to refine relational feature signals by iteratively leveraging adaptive clustering and classification that provoke gradual drift problems, or adopt instance-wise contrastive learning which unreasonably pushes apart those sentence pairs that are semantically similar. To overcome these defects, we propose a novel contrastive learning framework named HiURE, which has the capability to derive hierarchical signals from relational feature space using cross hierarchy attention and effectively optimize relation representation of sentences under exemplar-wise contrastive learning. Experimental results on two public datasets demonstrate the advanced effectiveness and robustness of HiURE on unsupervised relation extraction when compared with state-of-the-art models.

为了克服这些缺陷,我们提议了一个名为HiURE的新式对比学习框架,它能够利用跨级关注从关系特征空间获得等级信号,并有效地优化在超常、明智、对比性学习下判决的比重。 两种公共数据集的实验结果表明,HiURE在与最先进的模型相比,在不受监督的关系提取方面,具有更高的效力和稳健性。


Article 55

Title@2025-06-18 (3): The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants

Title: The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants Die Avengers: Ein einfaches Rezept für die Vereinigung kleinerer Sprachmodelle, um proprietäre Riesen herauszufordern 《复仇者:将小型语言模式联合起来挑战产权巨人挑战小型语言模式的简单食谱》 2505.19797v3

Authors (14): Yiqun Zhang, Hao Li, Chenxu Wang, Linyao Chen, Qiaosheng Zhang, Peng Ye, Shi Feng, Daling Wang, Zhen Wang, Xinrun Wang, Jia Xu, Lei Bai, Wanli Ouyang, Shuyue Hu

Proprietary giants are increasingly dominating the race for ever-larger language models. Can open-source, smaller models remain competitive across a broad range of tasks? In this paper, we present the Avengers – a simple recipe that leverages the collective intelligence of these smaller models. The Avengers builds upon four lightweight operations: (i) embedding: encode queries using a text embedding model; (ii) clustering: group queries based on their semantic similarity; (iii) scoring: scores each model’s performance within each cluster; and (iv) voting: improve outputs via repeated sampling and voting. At inference time, each query is embedded and assigned to its nearest cluster. The top-performing model(s) within that cluster are selected to generate the response with repeated sampling. Remarkably, with 10 open-source models (~7B parameters each), the Avengers surpasses GPT-4o, 4.1, and 4.5 in average performance across 15 diverse datasets spanning mathematics, coding, logical reasoning, general knowledge, and affective tasks. In particular, it surpasses GPT-4.1 on mathematics tasks by 18.21% and on code tasks by 7.46%. Furthermore, the Avengers delivers superior out-of-distribution generalization, and remains robust across various embedding models, clustering algorithms, ensemble strategies, and values of its sole parameter – the number of clusters.

本地巨人正在日益支配着越来越巨大的语言模型的竞争。 开放源码、 小模型能否在一系列广泛的任务中保持竞争力? 在本文中, 我们介绍复仇者 – – 利用这些较小模型的集体智慧的简单配方。 复仇者以四个轻量操作为基础:(一) 嵌入: 使用一个嵌入模型的文本来编码查询;(二) 分组: 根据其语义相似性来进行集体查询;(三) 评分: 每个模型在每一组内的业绩得分;(四) 投票: 通过反复抽样和投票来改进产出。 在推断时间里, 每一个查询都嵌入并分配到最近的组内。 该组内最优秀的模型被选中, 以反复抽样来生成响应。 值得注意的是, 10个开源模型(每个 ~ 7B 参数) 超过 GPT-4, 4. 1 和 4. 平均性能在15个不同的数据集中, 包括数学、 编码、 逻辑推理、 一般知识和影响任务。 特别是, 它超过了GPT-41, 在18 % 的高级数学任务中, 和 代号中, 继续执行。


Article 56

Title@2025-06-18 (3): ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

Title: ConLID: Supervised Contrastive Learning for Low-Resource Language Identification ConLID: Beaufsichtigtes kontrastives Lernen für die Sprachidentifizierung mit geringer Ressource CONLID: 低资源语言识别的受监督的反竞争学习 2506.15304v1

Authors (4): Negar Foroutan, Jakhongir Saydaliev, Ye Eun Kim, Antoine Bosselut

Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages – often limited to single-domain data, such as the Bible – continue to perform poorly. To resolve these class imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. Through an extensive analysis, we show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2%, demonstrating its effectiveness in enhancing LID models.

语言识别(LID)是整理多语种LLM预培训网络爬行公司的关键步骤。虽然许多关于LID模式培训的研究侧重于收集多种培训数据以提高绩效,但低资源语言(通常仅限于单一域数据,如圣经)仍然表现不佳。为了解决这些班级不平衡和偏见问题,我们提出一种新的有监督的对比学习(SCL)方法,以学习低资源语言的域差异表。通过广泛分析,我们表明我们的方法提高了3.2%的低资源语言域域外数据LID绩效,显示了其在加强LID模式方面的效力。


Article 57

Title@2025-06-18 (3): Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment

Title: Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment Cohort Discovery: Eine Studie über LLM-Assisted Clinical Trial Recruitment Cohort发现:关于LLM协助临床试验征聘的调查 2506.15301v1

Authors (4): Shrestha Ghosh, Moritz Schneider, Carina Reinicke, Carsten Eickhoff

Recent advances in LLMs have greatly improved general-domain NLP tasks. Yet, their adoption in critical domains, such as clinical trial recruitment, remains limited. As trials are designed in natural language and patient data is represented as both structured and unstructured text, the task of matching trials and patients benefits from knowledge aggregation and reasoning abilities of LLMs. Classical approaches are trial-specific and LLMs with their ability to consolidate distributed knowledge hold the potential to build a more general solution. Yet recent applications of LLM-assisted methods rely on proprietary models and weak evaluation benchmarks. In this survey, we are the first to analyze the task of trial-patient matching and contextualize emerging LLM-based approaches in clinical trial recruitment. We critically examine existing benchmarks, approaches and evaluation frameworks, the challenges to adopting LLM technologies in clinical research and exciting future directions.

理疗所最近的进展大大改善了一般领域NLP的任务,然而,在临床试验招聘等关键领域采用LLP的方法仍然有限,由于试验是用自然语言设计的,病人数据作为结构化和非结构化的文本,把试验与病人从LLMS的知识汇总和推理能力中受益相匹配的任务。古典方法具有具体试验性质,而且LLMS有能力巩固分布式知识,因此具有建立更一般性解决办法的潜力。最近应用LLM辅助方法依靠专有模式和评价基准薄弱。在这项调查中,我们首先分析了临床试验招聘中的试验-病人匹配和基于LLMM的新兴做法。我们严格地审查了现有的基准、方法和评估框架、在临床研究中采用LM技术的挑战以及令人振奋的未来方向。


Article 58

Title@2025-06-18 (3): An Effective Incorporating Heterogeneous Knowledge Curriculum Learning for Sequence Labeling

Title: An Effective Incorporating Heterogeneous Knowledge Curriculum Learning for Sequence Labeling Ein effektives Einbinden heterogenes Wissenscurriculum Lernen für die Sequenzkennzeichnung 有效纳入异种知识课程学习,以建立序列标签 2402.13534v2

Authors (5): Xuemei Tang, Jun Wang, Qi Su, Chu-ren Huang, Jinghang Gu

Sequence labeling models often benefit from incorporating external knowledge. However, this practice introduces data heterogeneity and complicates the model with additional modules, leading to increased expenses for training a high-performing model. To address this challenge, we propose a two-stage curriculum learning (TCL) framework specifically designed for sequence labeling tasks. The TCL framework enhances training by gradually introducing data instances from easy to hard, aiming to improve both performance and training speed. Furthermore, we explore different metrics for assessing the difficulty levels of sequence labeling tasks. Through extensive experimentation on six Chinese word segmentation (CWS) and Part-of-speech tagging (POS) datasets, we demonstrate the effectiveness of our model in enhancing the performance of sequence labeling models. Additionally, our analysis indicates that TCL accelerates training and alleviates the slow training problem associated with complex models.

然而,这种做法引入了数据差异,使模型与更多模块复杂化,导致培训高性能模型的费用增加。为了应对这一挑战,我们提议了专门为排序标签任务设计的两阶段课程学习框架(TCL),TCL框架通过逐步引入简单到硬的数据实例加强培训,以提高性能和培训速度。此外,我们还探索了不同的指标,以评估序列标签任务的难度程度。通过对六个中文单词分割和部分语音标签数据集的广泛实验,我们展示了我们模型在提高序列标签模型性能方面的有效性。此外,我们的分析表明,TCL加快了培训速度,缓解了与复杂模型相关的缓慢培训问题。


Article 59

Title@2025-06-18 (3): Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments

Title: Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments Thunder-DeID: Genauer und effizienter De-Identifizierungsrahmen für Urteile des koreanischen Gerichts Thunder-DeID:韩国法院判决的准确和有效的取消识别框架 2506.15266v1

Authors (5): Sungen Hahm, Heejin Kim, Gyuseong Lee, Hyunji Park, Jaejin Lee

To ensure a balance between open access to justice and personal data protection, the South Korean judiciary mandates the de-identification of court judgments before they can be publicly disclosed. However, the current de-identification process is inadequate for handling court judgments at scale while adhering to strict legal requirements. Additionally, the legal definitions and categorizations of personal identifiers are vague and not well-suited for technical solutions. To tackle these challenges, we propose a de-identification framework called Thunder-DeID, which aligns with relevant laws and practices. Specifically, we (i) construct and release the first Korean legal dataset containing annotated judgments along with corresponding lists of entity mentions, (ii) introduce a systematic categorization of Personally Identifiable Information (PII), and (iii) develop an end-to-end deep neural network (DNN)-based de-identification pipeline. Our experimental results demonstrate that our model achieves state-of-the-art performance in the de-identification of court judgments.

为了确保公开诉诸司法与个人数据保护之间的平衡,韩国司法机关授权在公开披露之前取消对法院判决的识别,但是,目前的取消身份程序不足以在遵守严格法律要求的同时处理规模的法院判决;此外,个人身份标识的法律定义和分类模糊,不适合技术解决办法;为应对这些挑战,我们提议了一个称为雷电-DeID的取消身份框架,该框架与相关法律和做法相一致;具体地说,我们(一) 建立和发布第一个韩国法律数据集,其中载有附加说明的判决以及相应实体清单;(二) 系统地分类个人身份识别信息(PII),以及(三) 开发一个基于端至端的深线网络(DNN)脱身份标识管道。我们的实验结果表明,我们的模型在不确定法院判决方面达到了最新业绩。


Article 60

Title@2025-06-18 (3): Large Language Models for Automated Literature Review: An Evaluation of Reference Generation, Abstract Writing, and Review Composition

Title: Large Language Models for Automated Literature Review: An Evaluation of Reference Generation, Abstract Writing, and Review Composition Large Language Models for Automated Literature Review: An Evaluation of Reference Generation, Abstract Writing, and Review Composition 自动化文献审查大语言模式:对参考资料生成、摘要编写和审查构成的评价 2412.13612v4

Authors (3): Xuemei Tang, Xufeng Duan, Zhenguang G. Cai

Large language models (LLMs) have emerged as a potential solution to automate the complex processes involved in writing literature reviews, such as literature collection, organization, and summarization. However, it is yet unclear how good LLMs are at automating comprehensive and reliable literature reviews. This study introduces a framework to automatically evaluate the performance of LLMs in three key tasks of literature writing: reference generation, literature summary, and literature review composition. We introduce multidimensional evaluation metrics that assess the hallucination rates in generated references and measure the semantic coverage and factual consistency of the literature summaries and compositions against human-written counterparts. The experimental results reveal that even the most advanced models still generate hallucinated references, despite recent progress. Moreover, we observe that the performance of different models varies across disciplines when it comes to writing literature reviews. These findings highlight the need for further research and development to improve the reliability of LLMs in automating academic literature reviews.

大型语言模型(LLMS)已成为使文学文献评论,如文献文献的收集、组织和归纳等复杂过程自动化的一个潜在解决办法,然而,尚不清楚LLMs在综合和可靠的文献审查自动化方面做得有多好;这项研究提出了一个框架,用以自动评价LLMs在文献撰写三项关键任务方面的表现:参考生成、文献摘要和文献审查构成;我们采用多层面评价指标,评估生成参考资料中的幻觉率,并衡量文献摘要和成份相对于人文著作的语义覆盖面和实际一致性;实验结果显示,尽管最近取得了进展,即使是最先进的模型也仍然产生有幻觉的参考资料;此外,我们注意到,不同模型在撰写文献审查时的表现各学科不尽相同;这些结论突出表明,需要进一步研究和开发,以提高LMs在学术文献自动化审查中的可靠性。


Article 61

Title@2025-06-18 (3): TopClustRAG at SIGIR 2025 LiveRAG Challenge

Title: TopClustRAG at SIGIR 2025 LiveRAG Challenge TopClustRAG auf der SIGIR 2025 LiveRAG Challenge SIGIR 2025 RiveRAG挑战的顶端Clustrag 2506.15246v1

Authors (3): Juli Bakagianni, John Pavlopoulos, Aristidis Likas

We present TopClustRAG, a retrieval-augmented generation (RAG) system developed for the LiveRAG Challenge, which evaluates end-to-end question answering over large-scale web corpora. Our system employs a hybrid retrieval strategy combining sparse and dense indices, followed by K-Means clustering to group semantically similar passages. Representative passages from each cluster are used to construct cluster-specific prompts for a large language model (LLM), generating intermediate answers that are filtered, reranked, and finally synthesized into a single, comprehensive response. This multi-stage pipeline enhances answer diversity, relevance, and faithfulness to retrieved evidence. Evaluated on the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard, demonstrating the effectiveness of clustering-based context filtering and prompt aggregation in large-scale RAG systems.

我们介绍了为LiveRAG挑战开发的“TopClustraG”(RAG)系统,这是一个为LiveRAG挑战开发的回收型一代(RAG)系统,它评估对大型网络公司回答的端到端问题。我们的系统采用混合检索战略,将稀少和密集的指数结合起来,然后将K-Means群集成类似的语系分组。每个组的代表性段落都用来为大型语言模型(LLM)构建针对具体集群的提示,产生中间答案,这些答案经过过滤、重新排序并最终合成为单一的综合性回应。这一多阶段管道加强了对检索证据的答案的多样性、相关性和忠诚性。在精细Web样本-10BT数据集上,TopClustrag在正式领导板上对忠实和正确性排名第2位,展示了基于集群的环境过滤和快速汇总在大型RAG系统中的有效性。


Article 62

Title@2025-06-18 (3): Aligning AI Research with the Needs of Clinical Coding Workflows: Eight Recommendations Based on US Data Analysis and Critical Review

Title: Aligning AI Research with the Needs of Clinical Coding Workflows: Eight Recommendations Based on US Data Analysis and Critical Review Ausrichtung der KI-Forschung auf die Bedürfnisse klinischer Codierungs-Workflows: Acht Empfehlungen basierend auf US-Datenanalyse und kritischer Überprüfung 使AI研究与临床编码工作流程的需要相一致:基于美国数据分析和关键审查的八项建议 2412.18043v2

Authors (4): Yidong Gan, Maciej Rybinski, Ben Hachey, Jonathan K. Kummerfeld

Clinical coding is crucial for healthcare billing and data analysis. Manual clinical coding is labour-intensive and error-prone, which has motivated research towards full automation of the process. However, our analysis, based on US English electronic health records and automated coding research using these records, shows that widely used evaluation methods are not aligned with real clinical contexts. For example, evaluations that focus on the top 50 most common codes are an oversimplification, as there are thousands of codes used in practice. This position paper aims to align AI coding research more closely with practical challenges of clinical coding. Based on our analysis, we offer eight specific recommendations, suggesting ways to improve current evaluation methods. Additionally, we propose new AI-based methods beyond automated coding, suggesting alternative approaches to assist clinical coders in their workflows.

临床编码对于保健帐单和数据分析至关重要。人工临床编码是劳动密集型和易出错的,它推动了对全过程自动化的研究。然而,我们基于美国英语电子健康记录和使用这些记录进行的自动编码研究进行的分析表明,广泛使用的评价方法与实际临床环境不相符。例如,侧重于前50种最常见代码的评价过于简单化,因为在实践中使用了数千种代码。本立场文件旨在将AI编码研究与临床编码的实际挑战更紧密地结合起来。根据我们的分析,我们提出了八项具体建议,提出了改进当前评价方法的方法。此外,我们提出了除自动编码之外新的基于AI的方法,提出了协助临床编码员工作流程的替代方法。


Article 63

Title@2025-06-18 (3): Research on Graph-Retrieval Augmented Generation Based on Historical Text Knowledge Graphs

Title: Research on Graph-Retrieval Augmented Generation Based on Historical Text Knowledge Graphs Forschung zur grafisch retrievalgenerierten Generierung anhand historischer Textwissensgraphen 基于历史文本知识图的图-检索检索增强型图-检索增加型研究 2506.15241v1

Authors (5): Yang Fan, Zhang Qi, Xing Wenqian, Liu Chang, Liu Liu

This article addresses domain knowledge gaps in general large language models for historical text analysis in the context of computational humanities and AIGC technology. We propose the Graph RAG framework, combining chain-of-thought prompting, self-instruction generation, and process supervision to create a The First Four Histories character relationship dataset with minimal manual annotation. This dataset supports automated historical knowledge extraction, reducing labor costs. In the graph-augmented generation phase, we introduce a collaborative mechanism between knowledge graphs and retrieval-augmented generation, improving the alignment of general models with historical knowledge. Experiments show that the domain-specific model Xunzi-Qwen1.5-14B, with Simplified Chinese input and chain-of-thought prompting, achieves optimal performance in relation extraction (F1 = 0.68). The DeepSeek model integrated with GraphRAG improves F1 by 11% (0.08-0.19) on the open-domain C-CLUE relation extraction dataset, surpassing the F1 value of Xunzi-Qwen1.5-14B (0.12), effectively alleviating hallucinations phenomenon, and improving interpretability. This framework offers a low-resource solution for classical text knowledge extraction, advancing historical knowledge services and humanities research.

本文讨论在计算人文和AIGC技术背景下用于历史文本分析的一般大语言模型方面的领域知识差距,我们提议了用于计算人文和AIGC技术方面历史文本分析的一般大语言模型的域知识差距。我们提议了图表RAG框架,将思维链促进、自我教学生成和过程监督相结合,以创建第一个四个历史特征关系数据集,尽量减少人工注释。这一数据集支持了历史知识的自动化提取,降低了劳动力成本。在图形增强的生成阶段,我们引入了知识图和检索生成之间的协作机制,使一般模型与历史知识更加一致。实验表明,特定域模型Xunzi-Qwen1.5-14B,与简化的中国投入和思维链促进,在相关提取方面实现最佳绩效(F1=0.68)。与GreagraphRAG整合的深Seek模型将F1提高11%(0.08-0.19),在开放的C-CLUE关系提取数据集方面,超过了Xunzi-Qwen1.5-14B(0.12)的F1值,有效地减轻了幻觉现象现象,并改进了历史教科书的可理解性研究框架,为历史文本提供了一种低的解决方案。


Article 64

Title@2025-06-18 (3): Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants

Title: Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants Lost in Variation? Bewertung der NLI-Performance in baskischen und spanischen geografischen Varianten 评价巴斯克和西班牙地理变异性国家LI绩效 2506.15239v1

Authors (3): Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri

In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties. We use Natural Language Inference (NLI) as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish, along with their respective variants. Our empirical analysis of crosslingual and in-context learning experiments using encoder-only and decoder-based Large Language Models (LLMs) shows a performance drop when handling linguistic variation, especially in Basque. Error analysis suggests that this decline is not due to lexical overlap, but rather to the linguistic variation itself. Further ablation experiments indicate that encoder-only models particularly struggle with Western Basque, which aligns with linguistic theory that identifies peripheral dialects (e.g., Western) as more distant from the standard. All data and code are publicly available.

在本文中,我们评估了当前语言技术理解巴斯克语和西班牙语品种的能力。我们把自然语言推论(NLI)作为一个主轴任务,并引入了巴斯克语和西班牙语以及各自变体的新颖的、手工制作的平行数据集。我们对使用只使用编码器和以解码器为基础的大语言模型(LLMS)进行的跨语言和内通学习实验的经验分析显示,在处理语言变异时,特别是在巴斯克,其性能下降。错误分析表明,这一下降并非由于词汇重叠,而是由于语言变异本身。进一步的膨胀实验表明,只使用编码器的模型特别与西巴斯克人(Western Basque)争斗,后者与确定周边方言方(例如西方)离标准更远的语言理论是一致的。所有数据和代码都可以公开查阅。


Article 65

Title@2025-06-18 (3): Dynamic Acoustic Model Architecture Optimization in Training for ASR

Title: Dynamic Acoustic Model Architecture Optimization in Training for ASR Dynamische Akustische Modellarchitektur Optimierung im Training für ASR ASR培训中动态声声学示范建筑结构优化 2506.13180v2

Authors (6): Jingjing Xu, Zijian Yang, Albert Zeyer, Eugen Beck, Ralf Schlueter, Hermann Ney

Architecture design is inherently complex. Existing approaches rely on either handcrafted rules, which demand extensive empirical expertise, or automated methods like neural architecture search, which are computationally intensive. In this paper, we introduce DMAO, an architecture optimization framework that employs a grow-and-drop strategy to automatically reallocate parameters during training. This reallocation shifts resources from less-utilized areas to those parts of the model where they are most beneficial. Notably, DMAO only introduces negligible training overhead at a given model complexity. We evaluate DMAO through experiments with CTC on LibriSpeech, TED-LIUM-v2 and Switchboard datasets. The results show that, using the same amount of training resources, our proposed DMAO consistently improves WER by up to 6% relatively across various architectures, model sizes, and datasets. Furthermore, we analyze the pattern of parameter redistribution and uncover insightful findings.

建筑设计本身就十分复杂。 现有的方法依靠手工艺规则, 需要大量的经验专长, 或者像神经结构搜索这样的自动化方法, 而这些方法是计算密集的。 在本文中, 我们引入了DMAO, 这是一种结构优化框架, 使用增长和下降战略, 在培训期间自动重新分配参数。 这种重新分配将资源从使用较少的地区转移到模型中最有益的部分。 值得注意的是, DMAO 只在特定模式复杂时引入微不足道的培训间接费用。 我们通过在 LibriSpeech、 TED- LIUM- v2 和 交换板数据集上进行CTC实验来评估 DMAO 。 结果表明, 我们提议的DMAO 利用同样的培训资源, 在不同的结构、 模型大小和数据集中不断将WER 相对提高6% 。 此外, 我们分析了参数再分配模式, 并发现了洞察的结果 。


Article 66

Title@2025-06-18 (3): Robust Utility-Preserving Text Anonymization Based on Large Language Models

Title: Robust Utility-Preserving Text Anonymization Based on Large Language Models Robuste Utility-Preserving Text Anonymisierung basierend auf großen Sprachmodellen 基于大语言模式的强力功用-保存文本匿名 2407.11770v2

Authors (3): Tianyu Yang, Xiaodan Zhu, Iryna Gurevych

Anonymizing text that contains sensitive information is crucial for a wide range of applications. Existing techniques face the emerging challenges of the re-identification ability of large language models (LLMs), which have shown advanced capability in memorizing detailed information and reasoning over dispersed pieces of patterns to draw conclusions. When defending against LLM-based re-identification, anonymization could jeopardize the utility of the resulting anonymized data in downstream tasks. In general, the interaction between anonymization and data utility requires a deeper understanding within the context of LLMs. In this paper, we propose a framework composed of three key LLM-based components: a privacy evaluator, a utility evaluator, and an optimization component, which work collaboratively to perform anonymization. Extensive experiments demonstrate that the proposed model outperforms existing baselines, showing robustness in reducing the risk of re-identification while preserving greater data utility in downstream tasks. We provide detailed studies on these core modules. To consider large-scale and real-time applications, we investigate the distillation of the anonymization capabilities into lightweight models. All of our code and datasets will be made publicly available at https://github.com/UKPLab/acl2025-rupta.

现有技术面临着大型语言模型(LLMS)重新确定能力的新挑战,这些模型在对详细信息进行记忆和对分散的图案进行推理以得出结论方面表现出先进的能力。在防范基于LLM的重新确定时,匿名化会危及由此产生的匿名数据在下游任务中的效用。一般而言,匿名化和数据效用之间的相互作用需要在LLMS的范围内更深入地理解。在本文件中,我们提议了一个框架,由基于LLM的三个关键组成部分组成:隐私评价员、效用评价员和优化部分,它们合作工作进行匿名化。广泛的实验表明,拟议的模型超越了现有基线,显示在减少重新确定风险的同时在下游任务中保持更大的数据效用。我们对这些核心单元进行详细研究。为了考虑大规模和实时应用,我们研究如何将地名化能力蒸馏成轻量模型。我们的所有代码和数据元件将公开在 http://PLUB25/commation上提供。


Article 67

Title@2025-06-18 (3): video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

Title: video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models Video-SALMONN 2: Bildunterschrift-verbesserte Audio-Visuelle große Sprachmodelle 视频-SALMONN2:字幕-强化视听大语言模式 2506.15220v1

Authors (8): Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang

Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2’s captioning accuracy, reducing the captioning error rates by 28\%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size. Codes are available at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.

视频包含丰富的信息,以自然语言生成详细和准确的描述是视频理解的一个关键方面。本文介绍视频-SALMONN 2 ,这是一种高级视听大型语言模型(LLM),用于通过定向偏好优化(DPO)加强视频(配对音频)字幕。我们提出新的衡量标准,以评价视频描述的完整性和准确性,这些描述使用DPO得到优化。为了进一步改进培训,我们提议采用新的多轮式DPO(MRDPO)方法,其中包括定期更新DPO参考模型,合并和重新启用LORA模块,作为每轮培训(1 000个步骤)后参数更新的代理,并纳入地盘视频字幕的指导,以稳定进程。实验结果表明,MRODO大大加强视频-SALMONN 2的描述准确性,将字幕误差率降低28。最后的视频-SALMONN 2 模式,只有70亿个参数,超过GPT-4-和Gemini-LAL-2S等领先模型,同时广泛使用高竞争性视频标准。


Article 68

Title@2025-06-18 (3): TSLFormer: A Lightweight Transformer Model for Turkish Sign Language Recognition Using Skeletal Landmarks

Title: TSLFormer: A Lightweight Transformer Model for Turkish Sign Language Recognition Using Skeletal Landmarks TSLFormer: Ein leichtes Transformer-Modell für die türkische Erkennung von Zeichensprache mit skelettalen Markierungen TSL Former: 使用骨骼地标土耳其手语识别的轻量级变换器模型 2505.07890v4

Authors (4): Kutay Ertürk, Furkan Altınışık, İrem Sarıaltın, Ömer Nezih Gerek

This study presents TSLFormer, a light and robust word-level Turkish Sign Language (TSL) recognition model that treats sign gestures as ordered, string-like language. Instead of using raw RGB or depth videos, our method only works with 3D joint positions - articulation points - extracted using Google’s Mediapipe library, which focuses on the hand and torso skeletal locations. This creates efficient input dimensionality reduction while preserving important semantic gesture information. Our approach revisits sign language recognition as sequence-to-sequence translation, inspired by the linguistic nature of sign languages and the success of transformers in natural language processing. Since TSLFormer uses the self-attention mechanism, it effectively captures temporal co-occurrence within gesture sequences and highlights meaningful motion patterns as words unfold. Evaluated on the AUTSL dataset with over 36,000 samples and 227 different words, TSLFormer achieves competitive performance with minimal computational cost. These results show that joint-based input is sufficient for enabling real-time, mobile, and assistive communication systems for hearing-impaired individuals.

本研究展示了TSLFormer, 这是一种轻巧和稳健的字级土耳其手语(TSL)识别模型, 将手势手势按顺序处理, 类似字串的语言。 我们的方法不是使用原始 RGB 或深度视频, 我们的方法只与3D 联合位置- 连接点- 使用谷歌的媒体管道库提取, 以手和骨骼为焦点。 这样可以有效地减少输入维度, 同时保留重要的语义手势信息。 我们的方法是重新审视手势语言作为顺序到顺序翻译的手语识别, 并受手势语言语言语言语言性质和变异器在自然语言处理中的成功启发。 由于 TSLFormer 使用自留机制, 它有效地捕捉手势序列中的时间共生关系, 并突出有意义的运动模式作为语言的展开。 以超过36 000个样本和227个不同单词的AUTSLFormer 数据集进行了评估, 以最低的计算成本实现竞争性性能。 这些结果表明, 联合输入足以为听障者提供实时、 移动和辅助通信系统。


Article 69

Title@2025-06-18 (3): MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs

Title: MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs MinosEval: Distinguishing Factoid und Non-Factoid für maßgeschneiderte, offene QA-Bewertung mit LLMs MinosEval:与LLMM公司一道,区分用于定制的不限成员名额质量保证评价的非事实和非事实 2506.15215v1

Authors (7): Yongqi Fan, Yating Wang, Guandong Wang, Jie Zhai, Jingping Liu, Qi Ye, Tong Ruan

Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs). Compared to closed-ended QA, it demands longer answer statements, more nuanced reasoning processes, and diverse expressions, making refined and interpretable automatic evaluation both crucial and challenging. Traditional metrics like ROUGE and BERTScore struggle to capture semantic similarities due to different patterns between model responses and reference answers. Current LLM-based evaluation approaches, such as pairwise or listwise comparisons of candidate answers, lack intuitive interpretability. While pointwise scoring of each response provides some descriptions, it fails to adapt across different question contents. Most notably, existing methods overlook the distinction between factoid and non-factoid questions. To address these challenges, we propose \textbf{MinosEval}, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers using different evaluation strategies. For factoid questions, it applies an adaptive key-point scoring strategy, while for non-factoid questions, it uses an instance-aware listwise ranking strategy. Experiments on multiple open-ended QA datasets, including self-built ones with more candidate responses to complement community resources, show that MinosEval better aligns with human annotations and offers more interpretable results.

不限成员名额回答问题(QA)是评价大型语言模型能力的关键任务。与封闭式质量评估相比,它要求更长期的回答说明、更细化的推理过程和多种表达方式,使精细和可解释的自动评估变得关键和具有挑战性。传统指标,如ROUGE和BERTScore 努力捕捉因模型答复和参考答复的不同模式而导致的语义相似性。目前基于LLM的评价方法,如对候选人答复进行对等或列表式比较,缺乏直观的解释性。每个答复的分数提供一些描述,但无法适应不同的问题内容。最显著的是,现有方法忽略了事实类和非行为类问题之间的区别。为了应对这些挑战,我们提议了\textbf{MinosEval},这是一种新颖的评价方法,首先区分开放式问题,然后用不同的评价战略对候选人进行评分。关于事实类问题,它采用适应性关键点评分战略,而对于非行为类问题,它则使用实例分列表排序战略。最显著的是,现有方法忽略了事实类与非行动类问题之间的排序战略。为了比较容易区分,在多开放型和非行动类问题中,对类问题进行实验,在多开放型解释性社区解释性、更能提供自我调整后,用更能的、更有利于式的数据展示式的、更能显示自我分析,对等的资源展示。


Article 70

Title@2025-06-18 (3): ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs

Title: ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs ProtoReasoning: Prototypen als Stiftung für generalisierbare Vernunft in LLMs 原生共振:原型作为LLMs中普遍合理理由基金会 2506.15211v1

Authors (7): Feng He, Zijun Chen, Xinnian Liang, Tingting Ma, Yunqi Qiu, Shuangzhi Wu, Junchi Yan

Recent advances in Large Reasoning Models (LRMs) trained with Long Chain-of-Thought (Long CoT) reasoning have demonstrated remarkable cross-domain generalization capabilities. However, the underlying mechanisms supporting such transfer remain poorly understood. We hypothesize that cross-domain generalization arises from shared abstract reasoning prototypes – fundamental reasoning patterns that capture the essence of problems across domains. These prototypes minimize the nuances of the representation, revealing that seemingly diverse tasks are grounded in shared reasoning structures.Based on this hypothesis, we propose ProtoReasoning, a framework that enhances the reasoning ability of LLMs by leveraging scalable and verifiable prototypical representations (Prolog for logical reasoning, PDDL for planning).ProtoReasoning features: (1) an automated prototype construction pipeline that transforms problems into corresponding prototype representations; (2) a comprehensive verification system providing reliable feedback through Prolog/PDDL interpreters; (3) the scalability to synthesize problems arbitrarily within prototype space while ensuring correctness. Extensive experiments show that ProtoReasoning achieves 4.7% improvement over baseline models on logical reasoning (Enigmata-Eval), 6.3% improvement on planning tasks, 4.0% improvement on general reasoning (MMLU) and 1.0% on mathematics (AIME24). Significantly, our ablation studies confirm that learning in prototype space also demonstrates enhanced generalization to structurally similar problems compared to training solely on natural language representations, validating our hypothesis that reasoning prototypes serve as the foundation for generalizable reasoning in large language models.

在经过长链理论推理(Long CoT)培训的大型解释模型(LRM)的最近进展中,通过长链推理(LRM)推理的推理能力显示了显著的跨部概括性能力。然而,支持这种转移的基本机制仍然没有得到很好的理解。我们假设跨部概括化来自共同的抽象推理原型 – – 基本推理原型能够捕捉跨领域问题的实质。这些原型最大限度地缩小了代表性的细微差别。这些原型表明,看起来不同的任务是基于共同推理结构。基于这一假设,我们提出ProtoReson,这是一个框架,通过利用可缩放和可核查的原型原型表述(逻辑推理说明,PDDL用于规划)。 ProtoRecommation特性:(1) 将问题转化为相应原型表述的自动原型建筑管道;(2) 通过Prolog/PDDL解释提供可靠的反馈;(3) 在原型空间内任意合成问题的缩缩略图性,同时确保正确性。广泛的实验表明,ProtoReson 相对于逻辑推理学的基线模型,实现了4.7%的改进(Egmata-Eval-Evalalalalalalalbilizalalalalalalalalalation) 改进(Evalalalalalalalalalalalbalalalalalimationalalalalalbalbalisalisalismalismalismalismal 4),也证明我们进行关于一般的改进了基础(1.L)。


Article 71

Title@2025-06-18 (3): A Comparative Study of Task Adaptation Techniques of Large Language Models for Identifying Sustainable Development Goals

Title: A Comparative Study of Task Adaptation Techniques of Large Language Models for Identifying Sustainable Development Goals Eine vergleichende Studie über Anpassungstechniken großer Sprachmodelle zur Ermittlung von Zielen für eine nachhaltige Entwicklung 关于用于确定可持续发展目标的大型语言模型任务适应技术的比较研究 2506.15208v1

Authors (9): Andrea Cadeddu, Alessandro Chessa, Vincenzo De Leo, Gianni Fenu, Enrico Motta, Francesco Osborne, Diego Reforgiato Recupero, Angelo Salatino, Luca Secchi

In 2012, the United Nations introduced 17 Sustainable Development Goals (SDGs) aimed at creating a more sustainable and improved future by 2030. However, tracking progress toward these goals is difficult because of the extensive scale and complexity of the data involved. Text classification models have become vital tools in this area, automating the analysis of vast amounts of text from a variety of sources. Additionally, large language models (LLMs) have recently proven indispensable for many natural language processing tasks, including text classification, thanks to their ability to recognize complex linguistic patterns and semantics. This study analyzes various proprietary and open-source LLMs for a single-label, multi-class text classification task focused on the SDGs. Then, it also evaluates the effectiveness of task adaptation techniques (i.e., in-context learning approaches), namely Zero-Shot and Few-Shot Learning, as well as Fine-Tuning within this domain. The results reveal that smaller models, when optimized through prompt engineering, can perform on par with larger models like OpenAI’s GPT (Generative Pre-trained Transformer).

2012年,联合国引入了17项可持续发展目标(SDGs),旨在到2030年创造更可持续、更好的未来;然而,由于所涉数据的规模和复杂性巨大,很难跟踪实现这些目标的进展;文本分类模型已成为这一领域的重要工具,从各种来源对大量文本的分析自动化;此外,大型语言模型(LLMs)最近证明对于许多自然语言处理任务(包括文字分类)是不可或缺的,因为这些模型能够识别复杂的语言模式和语义;这项研究分析各种专有和开放源LMs,用于一个单一标签、多级文本分类任务,重点是SDGs;然后,还评估任务适应技术(即文内学习方法)的有效性,即零热和少热学习,以及该领域的精细图;研究结果显示,如果通过迅速的工程优化,较小的模型可以与OpenAI的GPT(Generary Pregrier)等较大的模型同步。


Article 72

Title@2025-06-18 (3): BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning

Title: BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning BIS Reasoning 1.0: Der erste großformatige japanische Benchmark für glaubens-inkonsistente syllogistische Reasoning BIS 理由1.0:日本第一个大尺度的信仰不一致时断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断 2506.06955v2

Authors (9): Ha-Thanh Nguyen, Chaoran Liu, Koichi Takeda, Yusuke Miyao, Pontus Stenetorp, Qianying Liu, Su Myat Noe, Hideyuki Tachibana, Sadao Kurohashi

We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS Reasoning 1.0 introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora. We benchmark state-of-the-art models - including GPT models, Claude models, and leading Japanese LLMs - revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs. These findings have important implications for deploying LLMs in high-stakes domains such as law, healthcare, and scientific literature, where truth must override intuitive belief to ensure integrity and safety.

我们提出了国际清算银行(BIS)1.0号理由,这是日本为评估大型语言模型(LLMs)中与信仰不一致的推理(LLMs)明确设计的第一个大规模逻辑推理问题数据集。 与以前侧重于一般推理或信仰一致推理的NeuBAROCO和JFLD等数据集不同,国际清算银行(BIS)1.0号理由引入了逻辑上有效但信仰不一致的立体理论,以揭示关于人与人结盟公司培训的LLLMs的推理偏向。 我们为最新模型(包括GPT模型、Claude模型和主要日本LLMs)做了基准测试,显示其性能差异很大,GPT-4o的精确度达到了79.54%。 我们的分析指出了当前LLMs在处理逻辑上有效但信仰冲突性投入时的关键弱点。 这些发现对在法律、保健和科学文献等高接触领域部署LLMs具有重要影响,在这些领域必须超越直觉信仰以确保廉正和安全。


Article 73

Title@2025-06-18 (3): A Systematic Survey of Natural Language Processing for the Greek Language

Title: A Systematic Survey of Natural Language Processing for the Greek Language Eine systematische Untersuchung der natürlichen Sprachverarbeitung für die griechische Sprache 系统调查希腊语的自然语言处理 2407.09861v4

Authors (4): Juli Bakagianni, Kanella Pouli, Maria Gavriilidou, John Pavlopoulos

Comprehensive monolingual Natural Language Processing (NLP) surveys are essential for assessing language-specific challenges, resource availability, and research gaps. However, existing surveys often lack standardized methodologies, leading to selection bias and fragmented coverage of NLP tasks and resources. This study introduces a generalizable framework for systematic monolingual NLP surveys. Our approach integrates a structured search protocol to minimize bias, an NLP task taxonomy for classification, and language resource taxonomies to identify potential benchmarks and highlight opportunities for improving resource availability. We apply this framework to Greek NLP (2012-2023), providing an in-depth analysis of its current state, task-specific progress, and resource gaps. The survey results are publicly available (https://doi.org/10.5281/zenodo.15314882) and are regularly updated to provide an evergreen resource. This systematic survey of Greek NLP serves as a case study, demonstrating the effectiveness of our framework and its potential for broader application to other not so well-resourced languages as regards NLP.

综合单一语言自然语言处理(NLP)调查对于评估语言特定挑战、可用资源和研究差距至关重要,然而,现有调查往往缺乏标准化方法,导致选择偏差,导致对语言单一任务和资源的覆盖分散,本研究报告为系统单一语言国家语言调查引入了一个可实现的总体框架。我们的方法包括一个结构化的搜索协议,以尽量减少偏见,国家语言单一语言单一语言分类用于分类,以及语言资源分类,以确定潜在的基准,并突出改善资源可用性的机会。我们将这一框架应用于希腊语言国家语言框架(2012-2013年),深入分析其目前的状况、具体任务的进展和资源差距。调查结果公布于众(https://doi.org/10.5281/zenodo.15314882),并定期更新,以提供一个绿色资源。对希腊语言国家语言分类的系统调查作为案例研究,展示了希腊语言国家语言框架的有效性及其更广泛地应用于其他非资源丰富的语言的可能性。


Article 74

Title@2025-06-18 (3): Seewo’s Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models

Title: Seewo’s Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models Seewos Vorlage bei MLC-SLM: Lehren aus sprachbezogenen Sprachmodellen Seewoo向刚果解放运动-解解运提交的材料:从讲理由语言模式中学到的教益 2506.13300v3

Authors (3): Bo Li, Chengben Xu, Wufeng Zhang

This paper presents Seewo’s systems for both tracks of the Multilingual Conversational Speech Language Model Challenge (MLC-SLM), addressing automatic speech recognition (ASR) and speaker diarization with ASR (SD-ASR). We introduce a multi-stage training pipeline that explicitly enhances reasoning and self-correction in speech language models for ASR. Our approach combines curriculum learning for progressive capability acquisition, Chain-of-Thought data augmentation to foster intermediate reflection, and Reinforcement Learning with Verifiable Rewards (RLVR) to further refine self-correction through reward-driven optimization. This approach achieves substantial improvements over the official challenge baselines. On the evaluation set, our best system attains a WER/CER of 11.57% for Track 1 and a tcpWER/tcpCER of 17.67% for Track 2. Comprehensive ablation studies demonstrate the effectiveness of each component under challenge constraints.

本文介绍Seewo在多种语言交流语言语言模式挑战(MLC-SLM)两个轨道上的系统,处理自动语音识别(ASR)和与ASR(SD-ASR)的语音分解问题。我们引入了一个多阶段培训管道,明确加强ASR语言模式的推理和自我纠正。我们的方法包括:逐步获取能力的课程学习、加强探索数据链以促进中间思考,以及用可核实的奖励学习(RLVR)来通过奖励驱动的优化来进一步完善自我纠正。这种方法大大改进了官方的挑战基线。在评估中,我们的最佳系统在轨道1中达到了11.57%的WER/CER,在轨道2中达到了17.67%的tcPWER/tcpCER。全面分析研究表明了在挑战制约下每个组成部分的有效性。


Article 75

Title@2025-06-18 (3): LLäMmlein: Transparent, Compact and Competitive German-Only Language Models from Scratch

Title: LLäMmlein: Transparent, Compact and Competitive German-Only Language Models from Scratch LLäMmlein: Transparente, kompakte und wettbewerbsfähige deutschsprachige Sprachmodelle von Scratch LläMmlein:来自斯克拉奇的透明、紧凑和有竞争力的独德语言模式 2411.11171v5

Authors (3): Jan Pfister, Julia Wunderle, Andreas Hotho

We create two German-only decoder models, LL"aMmlein 120M and 1B, transparently from scratch and publish them, along with the training data, for the German NLP research community to use. The model training involved several key steps, including extensive data preprocessing, the creation of a custom German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks. Throughout the training process, multiple checkpoints were saved and analyzed using the SuperGLEBer benchmark to monitor the models’ learning dynamics. Compared to state-of-the-art models on the SuperGLEBer benchmark, both LL"aMmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models’ quality scales with size as expected, but performance improvements on some tasks plateaued early, offering valuable insights into resource allocation for future model development.

我们创建了两种德国独家解码器模型,LL"aMmlein 120M和1B,从头开始透明,并公布这些模型和培训数据,供德国NLP研究界使用。模型培训涉及几个关键步骤,包括广泛的数据预处理、创建定制的德国代用品机、培训本身以及评估各种基准的最终模型。在整个培训过程中,利用超级GLEBer基准保存并分析了多个检查站,以监测模型的学习动态。与超级GLEBer基准上最先进的模型相比,LLL"aMlein两个模型都具有竞争力,始终与类似参数大小的模型相匹配或超过。结果显示,模型的质量尺度与预期的大小相当,但在一些任务上稍有改进,为未来的模型开发提供了资源分配的宝贵见解。


Article 76

Title@2025-06-18 (3): Enhancing Goal-oriented Proactive Dialogue Systems via Consistency Reflection and Correction

Title: Enhancing Goal-oriented Proactive Dialogue Systems via Consistency Reflection and Correction Verbesserung zielorientierter proaktiver Dialogsysteme durch Konsistenzreflexion und Korrektur 通过一致性反思和校正加强面向目标的前瞻性对话系统 2506.13366v3

Authors (4): Didi Zhang, Yaxin Fan, Peifeng Li, Qiaoming Zhu

Goal-oriented proactive dialogue systems are designed to guide user conversations seamlessly towards specific objectives by planning a goal-oriented path. However, previous research has focused predominantly on optimizing these paths while neglecting the inconsistencies that may arise between generated responses and dialogue contexts, including user profiles, dialogue history, domain knowledge, and subgoals. To address this issue, we introduce a model-agnostic two-stage Consistency Reflection and Correction (CRC) framework. Specifically, in the consistency reflection stage, the model is prompted to reflect on the discrepancies between generated responses and dialogue contexts, identifying inconsistencies and suggesting possible corrections. In the consistency correction stage, the model generates responses that are more consistent with the dialogue context based on these reflection results. We conducted experiments on various model architectures with different parameter sizes, including encoder-decoder models (BART, T5) and decoder-only models (GPT-2, DialoGPT, Phi3, Mistral and LLaMA3), and the experimental results on three datasets demonstrate that our CRC framework significantly improves the consistency between generated responses and dialogue contexts.

目标导向的主动对话系统旨在通过规划面向目标的道路,引导用户对话无缝地实现具体目标;然而,以往的研究主要侧重于优化这些路径,同时忽视生成的响应和对话背景之间可能产生的不一致之处,包括用户概况、对话历史、域知识和次级目标;为解决这一问题,我们引入了一个模式――不可知的两阶段一致反思和校正(CRC)框架;具体地说,在一致性反思阶段,该模型被激励思考生成的响应和对话背景之间的差异,找出不一致之处,并提出可能的纠正建议;在一致性纠正阶段,该模型生成的响应与基于这些反思结果的对话背景更加一致;我们针对不同参数大小的各种模型结构进行了实验,包括变码-变码模型(BART、T5)和只使用变码模型(GPT-2、DialoGPT、Phi3、Mistral和LLAMA3),以及三个数据集的实验结果显示,我们的《儿童权利公约》框架显著改进了生成的响应和对话背景之间的一致性。


Article 77

Title@2025-06-18 (3): Efficient Long CoT Reasoning in Small Language Models

Title: Efficient Long CoT Reasoning in Small Language Models Effiziente Long CoT-Reasoning in kleinen Sprachmodellen 低语言模式中有效的长期计算成本理由 2505.18440v2

Authors (6): Zhaoyang Wang, Jinqi Jiang, Tian Qiu, Hui Liu, Xianfeng Tang, Huaxiu Yao

Recent large reasoning models such as DeepSeek-R1 exhibit strong complex problems solving abilities by generating long chain-of-thought (CoT) reasoning steps. It is challenging to directly train small language models (SLMs) to emerge long CoT. Thus, distillation becomes a practical method to enable SLMs for such reasoning ability. However, the long CoT often contains a lot of redundant contents (e.g., overthinking steps) which may make SLMs hard to learn considering their relatively poor capacity and generalization. To address this issue, we propose a simple-yet-effective method to prune unnecessary steps in long CoT, and then employ an on-policy method for the SLM itself to curate valid and useful long CoT training data. In this way, SLMs can effectively learn efficient long CoT reasoning and preserve competitive performance at the same time. Experimental results across a series of mathematical reasoning benchmarks demonstrate the effectiveness of the proposed method in distilling long CoT reasoning ability into SLMs which maintains the competitive performance but significantly reduces generating redundant reasoning steps.

近期的DeepSeek-R1等大型推理模型产生了长期思维链(Cot)推理步骤,因此,直接培训小型语言模型(SLMs)形成长期COT具有挑战性,直接培训小型语言模型(SLMs)形成长期COT。因此,蒸馏成为使SLM能够具备这种推理能力的实用方法。然而,长期的COT往往包含许多多余的内容(例如,过度思考的步骤),可能使SLM难以了解其相对薄弱的能力和一般化。为了解决这一问题,我们建议采用一种简单而有效的方法,在长期COT中采用不必要的步骤,然后采用一种政策方法,使SLM本身整理有效而有用的长期COT培训数据。这样,SLVs可以有效地学习有效的长期COT推理法,并同时保持竞争性的绩效。一系列数学推理基准的实验结果表明,拟议的方法在将CT推理推理能力推理能力推导成长期的SLVSLSDs(维持竞争性业绩,但大大降低产生多余的推理理学步骤)方面是有效的。


Article 78

Title@2025-06-18 (3): Emergence of Primacy and Recency Effect in Mamba: A Mechanistic Point of View

Title: Emergence of Primacy and Recency Effect in Mamba: A Mechanistic Point of View Entstehung von Primat und Recency-Effekt in Mamba: Ein mechanistischer Standpunkt 曼巴的先权效应和后期效应:机械观察点 2506.15156v1

Authors (4): Muhammad Cendekia Airlangga, Hilal AlQuabeh, Munachiso S Nwadike, Kentaro Inui

We study memory in state-space language models using primacy and recency effects as behavioral tools to uncover how information is retained and forgotten over time. Applying structured recall tasks to the Mamba architecture, we observe a consistent U-shaped accuracy profile, indicating strong performance at the beginning and end of input sequences. We identify three mechanisms that give rise to this pattern. First, long-term memory is supported by a sparse subset of channels within the model’s selective state space block, which persistently encode early input tokens and are causally linked to primacy effects. Second, short-term memory is governed by delta-modulated recurrence: recent inputs receive more weight due to exponential decay, but this recency advantage collapses when distractor items are introduced, revealing a clear limit to memory depth. Third, we find that memory allocation is dynamically modulated by semantic regularity: repeated relations in the input sequence shift the delta gating behavior, increasing the tendency to forget intermediate items. We validate these findings via targeted ablations and input perturbations on two large-scale Mamba-based language models: one with 1.4B and another with 7B parameters.

我们研究国家空间语言模型中的记忆,使用优先性和耐久性效应作为行为工具,以揭示信息如何保存和被长期遗忘。在Mamba结构中应用结构化重整任务时,我们观察到一个一致的U形准确性剖面,在输入序列的开始和结束时显示强性能。我们确定了产生这种模式的三个机制。首先,长期内存得到模型选择性状态空间块中稀少的几小段频道的支持,这些频道持续编码早期输入符号,并且与首要效果有因果关系。第二,短期内存受三角调节的重现制约:最近的输入由于指数衰减而获得更多重量,但在引入分散性物品时这种耐久性优势崩溃,揭示了记忆深度的明显限制。第三,我们发现记忆分配动态受静态规律的调节:输入序列中反复出现的关系改变三角格变化的行为,增加遗忘中间项目的趋势。我们通过有针对性地调整和输入对两个大型曼巴语言模型的反复调整来验证这些结果:一个带有1.4B参数,另一个带有7B参数。


Article 79

Title@2025-06-18 (3): ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models

Title: ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models ALPS: Aufmerksamkeit Lokalisierung und Pruning-Strategie zur effizienten Ausrichtung großer Sprachmodelle ALPS: 高效统一大语言模式的注意地方化和审慎战略 2505.18799v4

Authors (9): Hao Chen, Haoze Li, Zhiqing Xiao, Lirong Gao, Qi Zhang, Xiaomeng Hu, Ningtao Wang, Xing Fu, Junbo Zhao

Aligning general-purpose large language models (LLMs) to downstream tasks often incurs significant training adjustment costs. Prior research has explored various avenues to enhance alignment efficiency, primarily through minimal-data training or data-driven activations to identify key attention heads. However, these approaches inherently introduce data dependency, which hinders generalization and reusability. To address this issue and enhance model alignment efficiency, we propose the Attention Localization and Pruning Strategy (ALPS), an efficient algorithm that localizes the most task-sensitive attention heads and prunes by restricting attention training updates to these heads, thereby reducing alignment costs. Experimental results demonstrate that our method activates only 10% of attention parameters during fine-tuning while achieving a 2% performance improvement over baselines on three tasks. Moreover, the identified task-specific heads are transferable across datasets and mitigate knowledge forgetting. Our work and findings provide a novel perspective on efficient LLM alignment. The code is available at https://github.com/VoiceBeer/ALPS.

将通用大语言模型(LLMS)与下游任务相协调往往需要大量的培训调整费用; 先前的研究探索了各种提高调整效率的途径,主要是通过最低限度的数据培训或数据驱动的启动来确定关键关注负责人; 然而,这些方法本身就引入了数据依赖性,这妨碍了一般化和可重新使用。 为了解决这一问题并提高模式协调效率,我们建议采用关注本地化和预留战略(ALPS),这是一个高效的算法,通过限制关注对象负责人的培训更新,从而降低调整成本,将最敏感关注负责人和关注对象本地化。 实验结果表明,我们的方法在微调期间只启用10%的关注参数,同时在三项任务的基线上实现2%的绩效改进。此外,所确定的具体任务负责人可跨越数据集进行可转让,并减少知识的遗忘。 我们的工作和发现为高效LM调整提供了新的视角。 代码可在 https://github.com/VoiceBeer/ALPS上查阅。


Article 80

Title@2025-06-18 (3): SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

Title: SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning SonicVerse: Multi-Task-Lernen für Musik-Feature-informierte Bildunterschriften SonicVerse: 音乐特色多任务学习 2506.15154v1

Authors (3): Anuradha Chopra, Abhinaba Roy, Dorien Herremans

Detailed captions that accurately reflect the characteristics of a music piece can enrich music databases and drive forward research in music AI. This paper introduces a multi-task music captioning model, SonicVerse, that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more, so as to directly capture both low-level acoustic details as well as high-level musical attributes. The key contribution is a projection-based architecture that transforms audio input into language tokens, while simultaneously detecting music features through dedicated auxiliary heads. The outputs of these heads are also projected into language tokens, to enhance the captioning input. This framework not only produces rich, descriptive captions for short music fragments but also directly enables the generation of detailed time-informed descriptions for longer music pieces, by chaining the outputs using a large-language model. To train the model, we extended the MusicBench dataset by annotating it with music features using MIRFLEX, a modular music feature extractor, resulting in paired audio, captions and music feature data. Experimental results show that incorporating features in this way improves the quality and detail of the generated captions.

能够准确反映音乐作品特点的详细字幕可以丰富音乐数据库,并推动音乐AI 的前瞻性研究。 本文引入了多任务音乐字幕模型 SonicVerse, 将字幕制作与辅助音乐特征检测任务相结合, 如关键检测、 语音检测等辅助性音乐特征检测任务相结合, 以便直接捕捉低层次的音频细节以及高级音乐属性。 关键贡献是一个基于投影的架构, 将音效输入转换成语言符号, 同时通过专门的辅助头目探测音乐特征。 这些头的输出还被投射为语言符号, 以加强字幕输入。 这个框架不仅为短音碎片制作丰富、 描述性字幕, 还直接为更长的音乐片段生成详细的时间知情描述, 使用大语言模型将输出内容连锁起来。 为了培训模型, 我们扩展了音乐Bench数据集, 用音乐特征加注, MIRFLEX, 模块化音乐特征提取器, 导致配对音、 字幕和音乐特征数据。 实验结果显示, 以这种方式包含特征的特性, 从而改进了所制作的翻译质量和细节。


Article 81

Title@2025-06-18 (3): TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding

Title: TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding TransXSSM: Ein Hybrid Transformer State Space Modell mit unified Rotary Position Einbettung TransXSSSSM:带有统一扶轮定位嵌入装置的混合变形国家空间模型 2506.09507v3

Authors (5): Bingheng Wu, Jingze Shi, Yifan Wu, Nan Tang, Yuyu Luo

Transformers exhibit proficiency in capturing long-range dependencies, whereas State Space Models (SSMs) facilitate linear-time sequence modeling. Notwithstanding their synergistic potential, the integration of these architectures presents a significant challenge, primarily attributable to a fundamental incongr inuity their respective positional encoding mechanisms: Transformers rely on explicit Rotary Position Embeddings (RoPE), while SSMs leverage implicit positional representations via convolutions. This divergence often precipitates discontinuities and suboptimal performance.To address this impediment, we propose a unified rotary position embedding (Unified RoPE) methodology, thereby establishing a consistent positional encoding framework for both self-attention and state-space components. Using this Unified RoPE, we introduce TransXSSM, a hybrid architecture that coherently integrates the Transformer and SSM layers under this unified positional encoding scheme. At a 4 sequenceK length, TransXSSM exhibits training and inference speeds that are 42.3% and 29.5% faster, respectively, relative to standard Transformer models. It also delivers higher accuracy: under comparable settings, it surpasses a Transformer baseline by over 4% on language modeling benchmarks.TransXSSM furthermore scales more effectively: TransXSSM-1.3B gains 7.22% in average accuracy over its 320M version (versus about 6% gains for equivalent Transformers or SSMs). Our results show that unified positional encoding resolves positional incompatibility in hybrid models, enabling efficient, high-performance long-context modeling.

变异器在捕捉远程依赖性方面表现出熟练,而国家空间模型(SSMS)则有利于线性时间序列建模。尽管这些结构具有协同潜力,但整合这些结构仍是一个重大挑战,主要原因是它们各自位置编码机制存在根本的不一致性:变异器依靠明确的扶轮定位嵌入器(ROPE),而SSMs则通过组合来利用隐含的定位表示。这种差异往往导致不连续和不优化的性能。为解决这一障碍,我们提议采用统一的旋转定位嵌入(统一ROPE)方法,从而为自备和州空间组件建立一个一致的定位编码框架。我们采用这个统一的定位编码机制,即 TransXSSSSS, 混合结构将变异器和SSSM层一致地整合到这个统一的定位编码计划下。在4个序列上, TransXSSSSM的展示速度分别为42.3%和29.5%,与标准变异器模型相比,它也提供更高的准确性:在可比较的环境下,它超越了统一的定位位置,在SMSM-SB的更高版本的变异性精确度基准,在SMLMM 4-RB的平均值中,在SB的平均值上,在SB的平均值上,在SB的平均值上,在4:4:4-SMMSB的中,在SB的平均值中,在SB的平均值中,在SB的平均值上显示比的平均值上,在4:4:4:4:4:4:4的平均值的中,在SB的中,在SB的中,在SB的平均值中,在SM的平均值中,在SM的平均值的中,在SM的平均值中,在SM 20B的平均值中,在SM的平均值中,在SB的中,在SB的平均值中,在SB的平均值上,在SB的平均值上,在4:4:4,在SMB的平均值上,在4:4:4:4:4,在SB的中,在4,在4:4:4:4,在4,在SB的中,在SB的中,在SB的中,在4,在SB的平均值的


Article 82

Title: BriefMe: A Legal NLP Benchmark for Assisting with Legal Briefs BriefMe: Ein gesetzlicher NLP-Benchmark für die Unterstützung mit rechtlichen Briefen 简报:协助提供法律简报的《国家劳工规划法》法律基准 2506.06619v2

Authors (4): Jesse Woo, Fateme Hashemi Chaleshtori, Ana Marasović, Kenneth Marino

A core part of legal work that has been under-explored in Legal NLP is the writing and editing of legal briefs. This requires not only a thorough understanding of the law of a jurisdiction, from judgments to statutes, but also the ability to make new arguments to try to expand the law in a new direction and make novel and creative arguments that are persuasive to judges. To capture and evaluate these legal skills in language models, we introduce BRIEFME, a new dataset focused on legal briefs. It contains three tasks for language models to assist legal professionals in writing briefs: argument summarization, argument completion, and case retrieval. In this work, we describe the creation of these tasks, analyze them, and show how current models perform. We see that today’s large language models (LLMs) are already quite good at the summarization and guided completion tasks, even beating human-generated headings. Yet, they perform poorly on other tasks in our benchmark: realistic argument completion and retrieving relevant legal cases. We hope this dataset encourages more development in Legal NLP in ways that will specifically aid people in performing legal work.

在《法律国家文件》中未得到充分探讨的法律工作的核心部分是撰写和编辑法律辩护状,这不仅需要透彻理解司法管辖区的法律,从判决到法规,而且还需要能够提出新的论据,试图将法律扩展到新的方向,提出对法官有说服力的新颖和创造性的论点。为了在语言模型中捕捉和评价这些法律技能,我们引入了BRIEFME,这是一个以法律辩护状为重点的新数据集。它包含语言模型的三项任务,以协助法律专业人员撰写辩护状:论点摘要、论点完成和案件检索。在这项工作中,我们描述这些任务的创建,分析这些任务,并展示当前模式的运作方式。我们看到,今天的大型语言模型(LLMS)在总结和指导完成任务方面已经相当出色,甚至打打人造标题。然而,它们在我们的基准中的其他任务方面表现很差:现实的论证完成和重新检索相关的法律案例。我们希望这一数据集鼓励法律国家文件在具体帮助人们开展法律工作的方式上作出更多的发展。


Article 83

Title@2025-06-18 (3): Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models

Title: Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models Thunder-Tok: Minimierung von Token pro Wort bei Tokenizing koreanischer Texte für generative Sprachmodelle Thunder-Tok: 将韩文用于创用语言模式的韩文中逐个字的调子最小化 2506.15138v1

Authors (6): Gyeongje Cho, Yeonkyoun So, Chanwoo Park, Sangmin Lee, Sungmok Jung, Jaejin Lee

This paper introduces Thunder-Tok, a new Korean tokenizer designed to reduce token fertility without compromising model performance. Our approach uses a rule-based pre-tokenization method that aligns with the linguistic structure of the Korean language. We also create a seed vocabulary containing tokens that resemble linguistic units and employ a branching entropy-based selection algorithm. These techniques increase the average token length, thus lowering fertility while preserving linguistic information. Experimental results indicate that Thunder-Tok reduces fertility by approximately 10% (i.e., reduces the number of tokens by 10%, improving the inference speed by 10%) compared to BPE without compromising performance across various downstream tasks. These findings demonstrate that our linguistically informed approach is effective and practical for designing efficient tokenizers for language models.

本文介绍桑德斯-托克,这是韩国一个新的代币器,旨在在不损害模型性能的情况下降低象征性生育力。我们的方法是使用一种符合朝鲜语言语言结构的基于规则的先入为主的方法。我们还创建了一个种子词汇,其中包含类似于语言单位的象征,并使用一种基于分流的星座选择算法。这些技术提高了平均代币长度,从而降低了生育率,同时保留语言信息。实验结果表明,桑达-托克将生育率降低了约10%(即将象征数量减少10%,提高推论速度10%),而BPE比BPE降低了10%,同时不影响下游各项任务的业绩。这些结果表明,我们的语言知情方法对于设计高效的语言模型代币值是有效和实用的。


Article 84

Title@2025-06-18 (3): GRAM: A Generative Foundation Reward Model for Reward Generalization

Title: GRAM: A Generative Foundation Reward Model for Reward Generalization GRAM: Ein generatives Stiftungsprämienmodell für Belohnungsverallgemeinerung GRAM: 奖励普遍化的创金基金会奖励模式 2506.14175v2

Authors (11): Chenglong Wang, Yang Gan, Yifu Huo, Yongyu Mu, Qiaozhi He, Murun Yang, Bei Li, Tong Xiao, Chunliang Zhang, Tongran Liu, Jingbo Zhu

In aligning large language models (LLMs), reward models have played an important role, but are standardly trained as discriminative models and rely only on labeled human preference data. In this paper, we explore methods that train reward models using both unlabeled and labeled data. Building on the generative models in LLMs, we develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning. We also show that by using label smoothing, we are in fact optimizing a regularized pairwise ranking loss. This result, in turn, provides a new view of training reward models, which links generative models and discriminative models under the same class of training objectives. The outcome of these techniques is a foundation reward model, which can be applied to a wide range of tasks with little or no further fine-tuning effort. Extensive experiments show that this model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning, achieving significant performance improvements over several strong baseline models.

在调整大型语言模式(LLMs)时,奖励模式发挥了重要作用,但作为歧视模式受到标准培训,而且只依赖标签人类偏好数据。在本文中,我们探讨了利用无标签和标签数据培训奖励模式的方法。在LLMs的基因化模型基础上,我们开发了一个基因化奖赏模式,首先通过大规模不受监督的学习培训,然后通过监督的学习进行微调。我们还表明,通过使用标签平滑,我们实际上正在优化正规化的对等排名损失。这反过来又提供了培训奖励模式的新观点,将基因化模型和歧视性模型结合到同一类别的培训目标之下。这些技术的成果是一个基础奖赏模式,可以适用于范围很广的任务,很少或没有进一步的微调努力。广泛的实验表明,这一模式在多项任务中非常广泛,包括反应等级、从人类反馈中强化学习,以及任务调整后调整适应,在几个强有力的基线模型上取得了显著的业绩改进。


Article 85

Title@2025-06-18 (3): Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs

Title: Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs Modellierung der ein-zu-vielen Immobilien im Open-Domain-Dialog mit LLMs 在与LLMM的开放式对话中模拟一对一财产 2506.15131v1

Authors (3): Jing Yang Lee, Kong-Aik Lee, Woon-Seng Gan

Open-domain Dialogue (OD) exhibits a one-to-many (o2m) property, whereby multiple appropriate responses exist for a single dialogue context. Despite prior research showing that modeling this property boosts response diversity, most modern LLM-based dialogue agents do not explicitly do so. In this work, we model the o2m property of OD in LLMs by decomposing OD generation into two key tasks: Multi-Response Generation (MRG) and Preference-based Selection (PS), which entail generating a set of n semantically and lexically diverse high-quality responses for a given dialogue context, followed by selecting a single response based on human preference, respectively. To facilitate MRG and PS, we introduce o2mDial, a dialogue corpus explicitly designed to capture the o2m property by featuring multiple plausible responses for each context. Leveraging o2mDial, we propose new in-context learning and instruction-tuning strategies, as well as novel evaluation metrics for MRG, alongside a model-based approach for PS. Empirical results demonstrate that applying the proposed two-stage framework to smaller LLMs for OD generation enhances overall response diversity while maintaining contextual coherence, improving response quality by up to 90%, bringing them closer to the performance of larger models.

开放式对话(OD) 展示了一种一对多种(o2m)属性, 一种对单一对话环境有多种适当的反应。 尽管先前的研究显示, 模拟这种属性可以促进反应多样性, 但大多数现代的LLM对话代理商并不明确这样做。 在这项工作中, 我们将OD生成的多功能生成(OD) 的O2m属性建模成两种关键任务: 多响应生成(MRG) 和以选择为基础的选择(PS) , 其中包括为特定对话环境产生一套基于语义和语言的高质量反应, 之后又根据人类偏好选择一个单一的响应。 为便利MRG和PS, 我们引入了O2Dal, 明确设计了一种对话机制,通过对每种背景作出多种可信的反应来捕捉O2m属性。 我们将O2MDal, 提出了新的文文本学习和教学调整战略, 以及基于MRG的新型评价指标, 以及基于模式的PS 。 Empricalcal 结果表明, 将拟议的两阶段框架适用于更紧密的LMDMs 更紧密的响应, 将更接近更紧密的图像更紧密的LOM 。


Article 86

Title@2025-06-18 (3): REVOLVE: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization

Title: REVOLVE: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization REVOLVE: Optimierung von KI-Systemen durch Tracking Response Evolution in der Textoptimierung REVOLVE:通过跟踪文字优化的应对演变,优化AI系统 2412.03092v2

Authors (8): Peiyan Zhang, Haibo Jin, Leyang Hu, Xinnuo Li, Liying Kang, Man Luo, Yangqiu Song, Haohan Wang

Recent advancements in large language models (LLMs) have significantly enhanced the ability of LLM-based systems to perform complex tasks through natural language processing and tool interaction. However, optimizing these LLM-based systems for specific tasks remains challenging, often requiring manual interventions like prompt engineering and hyperparameter tuning. Existing automatic optimization methods, such as textual feedback-based techniques (e.g., TextGrad), tend to focus on immediate feedback, analogous to using immediate derivatives in traditional numerical gradient descent. However, relying solely on such feedback can be limited when the adjustments made in response to this feedback are either too small or fluctuate irregularly, potentially slowing down or even stalling the optimization process. To overcome these challenges, more adaptive methods are needed, especially in situations where the system’s response is evolving slowly or unpredictably. In this paper, we introduce REVOLVE, an optimization method that tracks how “R”esponses “EVOLVE” across iterations in LLM systems. By focusing on the evolution of responses over time, REVOLVE enables more stable and effective optimization by making thoughtful, progressive adjustments at each step. Experimental results demonstrate that REVOLVE outperforms competitive baselines, achieving a 7.8% improvement in prompt optimization, a 20.72% gain in solution refinement, and a 29.17% increase in code optimization. Additionally, REVOLVE converges in fewer iterations, resulting in significant computational savings. Beyond its practical contributions, REVOLVE highlights a promising direction, where the rich knowledge from established optimization principles can be leveraged to enhance LLM systems, which paves the way for further advancements in this hybrid domain.

大型语言模型(LLMS)最近的进展大大增强了基于LLM的系统通过自然语言处理和工具互动完成复杂任务的能力。然而,优化这些基于LLM的系统以完成具体任务,仍具有挑战性,往往需要人工干预,如快速工程和超分光计调。现有的自动优化方法,如基于文本的反馈技术(如TextGrad),往往侧重于即时反馈,类似于使用传统数字梯度下降中的直接衍生物。然而,如果根据这种反馈所作的调整过小或波动不定,可能放慢甚至拖延优化进程。但是,优化进程的最佳化系统仍具有挑战性,需要更适应性的方法,特别是在系统反应缓慢或难以预测的情况下。在本文中,我们引入了REVOLVVE的优化方法,该方法可以跟踪LLM系统中“R”的“EOLVVVE”在传统梯度下降时如何进行直接反应, REVOLVVVE能够更稳定有效的优化,方法是在每一步对RELVO的精细度上进行精确的调整,在升级的精细微的精细的精细的精细的精细的精细度上,在升级的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细度上, 。


Article 87

Title@2025-06-18 (3): Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation

Title: Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation Linderung der Verteilungsverschiebung in synthetischen Daten für die Schätzung der maschinellen Übersetzungsqualität 减轻机器翻译质量估算合成数据分配变化 2502.19941v3

Authors (5): Xiang Geng, Zhejian Lai, Jiajun Chen, Hao Yang, Shujian Huang

Quality Estimation (QE) models evaluate the quality of machine translations without reference translations, serving as the reward models for the translation task. Due to the data scarcity, synthetic data generation has emerged as a promising solution. However, synthetic QE data often suffers from distribution shift, which can manifest as discrepancies between pseudo and real translations, or in pseudo labels that do not align with human preferences. To tackle this issue, we introduce DCSQE, a novel framework for alleviating distribution shift in synthetic QE data. To reduce the difference between pseudo and real translations, we employ the constrained beam search algorithm and enhance translation diversity through the use of distinct generation models. DCSQE uses references, i.e., translation supervision signals, to guide both the generation and annotation processes, enhancing the quality of token-level labels. DCSQE further identifies the shortest phrase covering consecutive error tokens, mimicking human annotation behavior, to assign the final phrase-level labels. Specially, we underscore that the translation model can not annotate translations of itself accurately. Extensive experiments demonstrate that DCSQE outperforms SOTA baselines like CometKiwi in both supervised and unsupervised settings. Further analysis offers insights into synthetic data generation that could benefit reward models for other tasks. The code is available at https://github.com/NJUNLP/njuqe.

质量估计模型(QE) 评估机器翻译的质量而没有参考翻译的质量,作为翻译任务的奖励模型。由于数据稀缺,合成数据生成已成为一个有希望的解决办法。然而,合成质量数据往往因分布变化而受到影响,这表现为假翻译和真实翻译之间的差异,或与人类喜好不相符的假标签。为了解决这一问题,我们引入了DCSQE,这是一个减轻合成QE数据分配变化的新框架。为了减少假翻译和真实翻译之间的差别,我们使用受限制的波束搜索算法,并通过使用不同的生成模型加强翻译多样性。DCSQE使用参考,即翻译监督信号,指导生成和批注进程,提高象征性标签的质量。DCSQE还确定了包含连续错误标记、模拟人类批注行为的最短的短语,以指定最后的语句级标签。特别是,我们强调翻译模型不能准确地翻译自己。在不同的生成模型中进行广泛的实验,即翻译显示DSQEEFERSFSFORSFORSOLSUI As regroductions laimations for other supalviewal laveal laveal laveal lapal lauds as for laudal laudaltiews foltimalds s lavealds forgal dislationalds lauts fal lavels laubalgalgalgalds lapalgaldalds commadalds commads coduductions latiduductionsaldaldaldaldalds lavels latings lavels lavels ladaldaldaldaldaldaldaldaldalds ladaldalds lads ladaldaldaldaldaldaldaldaldaldaldaldaldaldalds lads ladaldaldalds lads lads lads lads lads lads ladaldaldal lads lads lads


Article 88

Title@2025-06-18 (3): Efficiently Building a Domain-Specific Large Language Model from Scratch: A Case Study of a Classical Chinese Large Language Model

Title: Efficiently Building a Domain-Specific Large Language Model from Scratch: A Case Study of a Classical Chinese Large Language Model Effizienter Aufbau eines Domain-Spezifischen Large Language Models aus Scratch: Eine Fallstudie eines klassischen chinesischen Large Language Models 高效率地建立来自Scratch的域特定大语言模型:中国古典大语言模型案例研究 2505.11810v3

Authors (3): Shen Li, Renfen Hu, Lijun Wang

General-purpose large language models demonstrate notable capabilities in language comprehension and generation, achieving results that are comparable to, or even surpass, human performance in many natural language processing tasks. Nevertheless, when general models are applied to some specific domains, e.g., Classical Chinese texts, their effectiveness is often unsatisfactory, and fine-tuning open-source foundational models similarly struggles to adequately incorporate domain-specific knowledge. To address this challenge, this study developed a large language model, AI Taiyan, specifically designed for understanding and generating Classical Chinese. Experiments show that with a reasonable model design, data processing, foundational training, and fine-tuning, satisfactory results can be achieved with only 1.8 billion parameters. In key tasks related to language processing of Classical Chinese such as punctuation, identification of allusions, explanation of word meanings, and translation between ancient and modern Chinese, this model exhibits a clear advantage over both general-purpose large models and domain-specific traditional models, achieving levels close to or surpassing human baselines. This research provides a reference for the efficient construction of specialized domain-specific large language models. Furthermore, the paper discusses the application of this model in fields such as the collation of ancient texts, dictionary editing, and language research, combined with case studies.

通用大语言模式在语言理解和生成方面表现出显著能力,取得了与人类在许多自然语言处理任务方面业绩相当甚至超过人类业绩的显著成果,然而,如果将一般模式应用于某些特定领域,例如古典中文文本,其效力往往不能令人满意,并且对开放源代码基础模型进行微调,以适当纳入特定领域知识的类似努力;为了应对这一挑战,本研究开发了一个大型语言模式,即AI Taiyan, 专门设计用于理解和生成古典中文。实验表明,只要有合理的模型设计、数据处理、基础培训和微调,就只能用18亿个参数取得令人满意的成果。在与古典中文语言处理有关的关键任务中,如标注、标定符号、解释字义和翻译方面,这一模式在一般用途大模型和特定领域传统模型上都具有明显优势,达到接近或超过人类基准的水平。这一研究为高效构建专门领域大型语言模型提供了参考。此外,本文件还结合了古代文字研究、古代文字研究、古代文字研究等领域的案例研究,并结合了该模型的应用。


Article 89

Title@2025-06-18 (3): CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

Title: CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale CODESYNC: Synchronisierung großer Sprachmodelle mit dynamischer Codeentwicklung auf Skala CODESYNC: 使大语言模式与动态代码演变规模化同步 2502.16645v2

Authors (9): Chenlong Wang, Zhaoyang Chu, Zhengxiang Cheng, Xuyi Yang, Kaiyue Qiu, Yao Wan, Zhou Zhao, Xuanhua Shi, Dongping Chen

Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly regarding the frequent updates of third-party library APIs. This limitation, stemming from static pre-training datasets, often results in non-executable code or implementations with suboptimal safety and efficiency. To this end, this paper introduces CODESYNC, a data engine for identifying outdated code patterns and collecting real-time code knowledge updates from Python third-party libraries. Building upon CODESYNC, we develop CODESYNCBENCH, a comprehensive benchmark for assessing LLMs’ ability to stay synchronized with code evolution, which covers real-world updates for 220 APIs from six Python libraries. Our benchmark offers 3,300 test cases across three evaluation tasks and an update-aware instruction tuning dataset consisting of 2,200 training samples. Extensive experiments on 14 state-of-the-art LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and SimPO). We believe that our benchmark can offer a strong foundation for the development of more effective methods for real-time code knowledge updating in the future. The experimental code and dataset are publicly available at: https://github.com/Lucky-voyage/Code-Sync.

大型语言模型(LLMS)在软件工程方面表现出了非凡的性能,但在适应不断演变的代码知识方面却面临挑战,特别是在经常更新第三方图书馆API方面。这种限制来自静态的培训前数据库,往往导致无法执行的代码或执行,其安全和效率低于最佳水平。为此,本文件介绍了CODESYNC,这是一个用于查明过时代码模式和从Python第三方图书馆收集实时代码知识更新的数据引擎。在CODESYNC的基础上,我们开发了CODESYNCBENCH,这是评估LLMS与代码演变同步能力的综合基准,涵盖六个Python图书馆220个APIS的实时更新。我们的基准提供了3 300个测试案例,涉及三项评估任务,并更新了由2 200个培训样本组成的数据库。对14个州-艺术LMS进行的广泛实验表明,它们与动态代码演变相矛盾,即使支持先进的知识更新方法(例如DPO、ORPOy、SimPO)和SPOK)的系统。我们认为,我们的基准可以提供更坚实的、更可靠的数据库的基础。


Article 90

Title@2025-06-18 (3): SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents

Title: SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents SpokenWOZ: Ein großformatiger Sprach-Text-Benchmark für gesprochene Task-Orientierte Dialog-Agenten pokenWOZ:针对以任务为主的对话代理方的大型演讲-文本基准 2305.13040v6

Authors (10): Shuzheng Si, Wentao Ma, Haoyu Gao, Yuchuan Wu, Ting-En Lin, Yinpei Dai, Hangyu Li, Rui Yan, Fei Huang, Yongbin Li

Task-oriented dialogue (TOD) models have made significant progress in recent years. However, previous studies primarily focus on datasets written by annotators, which has resulted in a gap between academic research and real-world spoken conversation scenarios. While several small-scale spoken TOD datasets are proposed to address robustness issues such as ASR errors, they ignore the unique challenges in spoken conversation. To tackle the limitations, we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD, containing 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from human-to-human spoken conversations. SpokenWOZ further incorporates common spoken characteristics such as word-by-word processing and reasoning in spoken language. Based on these characteristics, we present cross-turn slot and reasoning slot detection as new challenges. We conduct experiments on various baselines, including text-modal models, newly proposed dual-modal models, and LLMs, e.g., ChatGPT. The results show that the current models still have substantial room for improvement in spoken conversation, where the most advanced dialogue state tracker only achieves 25.65% in joint goal accuracy and the SOTA end-to-end model only correctly completes the user request in 52.1% of dialogues. The dataset, code, and leaderboard are available: https://spokenwoz.github.io/.

近年来,以任务为导向的对话模式(TOD)取得了显著进展,然而,以往的研究主要侧重于由说明者编写的数据集,这导致学术研究与现实世界口述对话情景之间的差距。虽然一些小规模的口述TOD数据集被提议解决强健问题,如ASR错误,但它们忽视了口述对话中的独特挑战。为了克服这些局限性,我们引入了SpokenWoZ,一个供口述TOD使用的大型语音-文本数据集,包括8个域、203k转折、5.7k对话以及来自人与人之间口对话的249小时音频。SpokenWoZ还吸收了通用的口述特征,如逐字处理和口述语言推理等。基于这些特征,我们提出跨转时档和推理时间探测作为新的挑战。我们实验了各种基线,包括文本模式、新提议的双模式和LLMS,例如,ChatGPT。结果显示,目前的各种模式在口述对话中仍有很大的改进空间,其中,最先进的对话状态跟踪器仅达到25.65%的用户对目标的精确度,在SO-tal-tal-tal com distrate distration distrutes compeutes


Article 91

Title@2025-06-18 (3): CKD-EHR:Clinical Knowledge Distillation for Electronic Health Records

Title: CKD-EHR:Clinical Knowledge Distillation for Electronic Health Records CKD-EHR:Klinische Wissensdestillation für elektronische Gesundheitsdaten CKD-EHR: 用于电子健康记录的临床知识蒸馏 2506.15118v1

Authors (7): Junke Wang, Hongshun Ling, Li Zhang, Longqian Zhang, Fang Wang, Yuan Gao, Zhi Li

Electronic Health Records (EHR)-based disease prediction models have demonstrated significant clinical value in promoting precision medicine and enabling early intervention. However, existing large language models face two major challenges: insufficient representation of medical knowledge and low efficiency in clinical deployment. To address these challenges, this study proposes the CKD-EHR (Clinical Knowledge Distillation for EHR) framework, which achieves efficient and accurate disease risk prediction through knowledge distillation techniques. Specifically, the large language model Qwen2.5-7B is first fine-tuned on medical knowledge-enhanced data to serve as the teacher model.It then generates interpretable soft labels through a multi-granularity attention distillation mechanism. Finally, the distilled knowledge is transferred to a lightweight BERT student model. Experimental results show that on the MIMIC-III dataset, CKD-EHR significantly outperforms the baseline model:diagnostic accuracy is increased by 9%, F1-score is improved by 27%, and a 22.2 times inference speedup is achieved. This innovative solution not only greatly improves resource utilization efficiency but also significantly enhances the accuracy and timeliness of diagnosis, providing a practical technical approach for resource optimization in clinical settings. The code and data for this research are available athttps://github.com/209506702/CKD_EHR.

以健康记录为基础的疾病预测模型在推广精密医学和早期干预方面显示出了重要的临床价值,然而,现有的大型语言模型面临两大挑战:医学知识代表性不足和临床部署效率低。为应对这些挑战,本研究报告提出CKD-EHR框架,通过知识蒸馏技术,实现高效和准确的疾病风险预测。具体地说,大型语言模型 Quen2.5-7B 首次对医学知识强化数据进行了微调,作为教师模型。然后,它通过多感应性蒸馏机制生成了可解释的软标签。最后,蒸馏的知识被传输到轻量的BERT学生模型。实验结果表明,MIMIC-III数据集的CKD-EHR大大超越了基线模型:诊断准确度提高了9%,F1-D核心提高了27%,推导速度加快了22.2倍。这一创新解决方案不仅极大地改进了资源利用效率,而且大大加强了临床模型的精确度和精确度。在临床诊断中提供了一种可用的精确性数据。


Article 92

Title@2025-06-18 (3): Perspective Transition of Large Language Models for Solving Subjective Tasks

Title: Perspective Transition of Large Language Models for Solving Subjective Tasks Perspektivischer Übergang von großen Sprachmodellen zur Lösung subjektiver Aufgaben 解决主观任务大语言模式的视角过渡 2501.09265v2

Authors (8): Xiaolong Wang, Yuanchi Zhang, Ziyue Wang, Yuzhuang Xu, Fuwen Luo, Yile Wang, Peng Li, Yang Liu

Large language models (LLMs) have revolutionized the field of natural language processing, enabling remarkable progress in various tasks. Different from objective tasks such as commonsense reasoning and arithmetic question-answering, the performance of LLMs on subjective tasks is still limited, where the perspective on the specific problem plays crucial roles for better interpreting the context and giving proper response. For example, in certain scenarios, LLMs may perform better when answering from an expert role perspective, potentially eliciting their relevant domain knowledge. In contrast, in some scenarios, LLMs may provide more accurate responses when answering from a third-person standpoint, enabling a more comprehensive understanding of the problem and potentially mitigating inherent biases. In this paper, we propose Reasoning through Perspective Transition (RPT), a method based on in-context learning that enables LLMs to dynamically select among direct, role, and third-person perspectives for the best way to solve corresponding subjective problem. Through extensive experiments on totally 12 subjective tasks by using both closed-source and open-source LLMs including GPT-4, GPT-3.5, Llama-3, and Qwen-2, our method outperforms widely used single fixed perspective based methods such as chain-of-thought prompting and expert prompting, highlights the intricate ways that LLMs can adapt their perspectives to provide nuanced and contextually appropriate responses for different problems.

大型语言模型(LLMS)使自然语言处理领域发生了革命性的变化,使各种任务取得了显著的进展;与普通逻辑推理和算术解答等客观任务不同,LLMS在主观任务方面的表现仍然有限,因为对具体问题的看法在更好地解释背景和作出适当反应方面起着关键作用;例如,在某些情况下,LLMS从专家作用的角度回答时可能表现得更好,有可能获得相关的领域知识;相比之下,LLMS在从第三人的角度回答问题时,可能提供更准确的反应,从而能够更全面地了解问题,并有可能减轻内在偏见;在本文件中,我们提议以透视过渡(RPT)为借口,这是基于内文学习的一种方法,使LLMS能够动态地在直接、角色和第三人的角度之间作出选择,以最佳的方式解决相应的主观问题。通过利用封闭源和开放源LMS(包括GPT-4、GPT-3.5、Llama-3和Qwen-2)等封闭源和开放的LMSM(包括GPT-LMs-CRMs),我们的方法超越了广泛使用的一种固定的外貌,从而能够快速地反映各种背景问题。


Article 93

Title@2025-06-18 (3): Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

Title: Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding Bi-VLDoc: Bidirektionale Vision-Sprachenmodellierung für Visually-Rich Document Understanding Bi-VLDoc:视觉-里希文件理解的双向视觉-语言建模 2206.13155v2

Authors (8): Chuwei Luo, Guozhi Tang, Qi Zheng, Cong Yao, Lianwen Jin, Chenliang Li, Yang Xue, Luo Si

Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks. Though existing document pre-trained models have achieved excellent performance on standard benchmarks for VrDU, the way they model and exploit the interactions between vision and language on documents has hindered them from better generalization ability and higher accuracy. In this work, we investigate the problem of vision-language joint representation learning for VrDU mainly from the perspective of supervisory signals. Specifically, a pre-training paradigm called Bi-VLDoc is proposed, in which a bidirectional vision-language supervision strategy and a vision-language hybrid-attention mechanism are devised to fully explore and utilize the interactions between these two modalities, to learn stronger cross-modal document representations with richer semantics. Benefiting from the learned informative cross-modal document representations, Bi-VLDoc significantly advances the state-of-the-art performance on three widely-used document understanding benchmarks, including Form Understanding (from 85.14% to 93.44%), Receipt Information Extraction (from 96.01% to 97.84%), and Document Classification (from 96.08% to 97.12%). On Document Visual QA, Bi-VLDoc achieves the state-of-the-art performance compared to previous single model methods.

培训前的多式文件多式文件模式在各种高视力文件理解(VrDU)任务中证明非常有效。尽管现有的经过培训前的文件模式在VrDU标准基准方面取得了杰出的业绩,但它们的模型和对文件的视觉与语言互动的利用方式阻碍了它们提高一般化能力和更高的准确性。在这项工作中,我们主要从监督信号的角度调查VrDU的视觉语言联合代表学习问题。具体地说,提出了名为Bi-VLDoc的培训前模式,其中设计了双向愿景语言监督战略和愿景语言混合关注机制,以充分探索和利用这两种模式之间的互动,学习更强的跨式文件形式,用更丰富的语义学。Bi-VLDoc从学习了丰富的跨式文件表述,大大推进了在三种广泛使用的文件理解基准方面的最新业绩,包括形式理解(从85.14%到93.44%)、接收信息提取(从96.01%到97.84%)和文件分类(从96.08至97.84%),以及文件分类(从以往的视觉-LDM-S-S-I-LA-S-S-V-S-S-LA-S-S-S-S-S-S-I-S-S-S-S-S-S-S-S-I-L-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-I-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-A-A-A-A-S-S-S-S-S-S-S-S-S-S-S-A-S-A-A-A-A-A-A-A-A-A-A-A


Article 94

Title: I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search I-MCTS: Verbesserung der agentischen AutoML durch introspektive Monte Carlo Baumsuche I-MCTS:通过回想蒙特卡洛树搜索加强代理自动洗钱 2502.14693v3

Authors (6): Zujie Liang, Feng Wei, Wujiang Xu, Lin Chen, Yuxi Qian, Xinhui Wu

Recent advancements in large language models (LLMs) have shown remarkable potential in automating machine learning tasks. However, existing LLM-based agents often struggle with low-diversity and suboptimal code generation. While recent work has introduced Monte Carlo Tree Search (MCTS) to address these issues, limitations persist in the quality and diversity of thoughts generated, as well as in the scalar value feedback mechanisms used for node selection. In this study, we introduce Introspective Monte Carlo Tree Search (I-MCTS), a novel approach that iteratively expands tree nodes through an introspective process that meticulously analyzes solutions and results from parent and sibling nodes. This facilitates a continuous refinement of the node in the search tree, thereby enhancing the overall decision-making process. Furthermore, we integrate a Large Language Model (LLM)-based value model to facilitate direct evaluation of each node’s solution prior to conducting comprehensive computational rollouts. A hybrid rewarding mechanism is implemented to seamlessly transition the Q-value from LLM-estimated scores to actual performance scores. This allows higher-quality nodes to be traversed earlier. Applied to the various ML tasks, our approach demonstrates a 6% absolute improvement in performance compared to the strong open-source AutoML agents, showcasing its effectiveness in enhancing agentic AutoML systems. Resource available at https://github.com/jokieleung/I-MCTS

大型语言模型(LLMS)的近期进展在机器学习任务自动化方面显示出了显著的潜力。然而,现有的LLM代理商往往与低多样性和亚最佳代码生成打交道。虽然最近的工作引入了蒙特卡洛树搜索(MCTS)来解决这些问题,但所产生的想法的质量和多样性以及用于节点选择的标值反馈机制方面仍然存在限制。我们在本研究中引入了Incrospevision Monte Carlo树搜索(I-MCTS),这是一种新颖的办法,它通过一个仔细分析解决方案和母体及双向节点结果的反镜进程,迭代地扩展树结点。这有利于持续完善搜索树中的节点,从而增强总体决策进程。此外,我们整合了一个大型语言模型(LLM)基值模型,以便于在进行全面计算滚动之前直接评价每个节点解决方案。我们采用了一种混合奖励机制,将可使用的Q-价值从LMCS估计得分平稳地转换为实际业绩分。这样可以使高质量的节点节点在搜索树中不断改进,从而在早期展示其绝对性MLMLSMLSMS上显示其绝对性改进。


Article 95

Title@2025-06-18 (3): ChemHAS: Hierarchical Agent Stacking for Enhancing Chemistry Tools

Title: ChemHAS: Hierarchical Agent Stacking for Enhancing Chemistry Tools ChemHAS: Hierarchische Agenzien-Stacking zur Verbesserung von Chemiewerkzeugen ChemHAS:加强化学工具的等级代理人 2505.21569v2

Authors (7): Zhucong Li, Bowei Zhang, Jin Xiao, Zhijian Zhou, Fenglei Cao, Jiaqing Liang, Yuan Qi

Large Language Model (LLM)-based agents have demonstrated the ability to improve performance in chemistry-related tasks by selecting appropriate tools. However, their effectiveness remains limited by the inherent prediction errors of chemistry tools. In this paper, we take a step further by exploring how LLMbased agents can, in turn, be leveraged to reduce prediction errors of the tools. To this end, we propose ChemHAS (Chemical Hierarchical Agent Stacking), a simple yet effective method that enhances chemistry tools through optimizing agent-stacking structures from limited data. ChemHAS achieves state-of-the-art performance across four fundamental chemistry tasks, demonstrating that our method can effectively compensate for prediction errors of the tools. Furthermore, we identify and characterize four distinct agent-stacking behaviors, potentially improving interpretability and revealing new possibilities for AI agent applications in scientific research. Our code and dataset are publicly available at https: //anonymous.4open.science/r/ChemHAS-01E4/README.md.

大型语言模型(LLM)代理机构通过选择适当的工具,展示了提高化学相关任务绩效的能力,但是其有效性仍然受到化学工具内在预测错误的限制。在本文中,我们进一步探索如何利用LLM代理机构来减少工具预测错误。为此,我们提议ChemHAS(化学等级等级代理公司),这是一个简单而有效的方法,通过优化有限数据中的代理机构拆解结构来增强化学工具。ChemHAS实现了四项基本化学任务的最新性能,表明我们的方法可以有效弥补这些工具的预测错误。此外,我们确定和确定四种不同的代理机构拆解行为,有可能改进可解释性和揭示在科学研究中应用AI代理的新的可能性。我们的代码和数据集可在https://anomous-4.open.science/r/chemHAS-01-E4/README.md上公开查阅。


Article 96

Title@2025-06-18 (3): Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs

Title: Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs Ring-lite: Skalierbares Reasoning über C3PO-stabilisiertes Verstärkungslernen für LLMs 环:通过C3PO – – 稳定地加强LLMs的强化学习,按比例说明理由 2506.14731v2

Authors (46): Ling Team, Bin Hu, Cai Chen, Deng Zhao, Ding Liu, Dingnan Jin, Feng Zhu, Hao Dai, Hongzhi Luan, Jia Guo, Jiaming Liu, Jiewei Wu, Jun Mei, Jun Zhou, Junbo Zhao, Junwu Xiong, Kaihong Zhang, Kuan Xu, Lei Liang, Liang Jiang, Liangcheng Fu, Longfei Zheng, Qiang Gao, Qing Cui, Quan Wan, Shaomian Zheng, Shuaicheng Li, Tongkai Yang, Wang Ren, Xiaodong Yan, Xiaopei Wan, Xiaoyun Feng, Xin Zhao, Xinxing Yang, Xinyu Kong, Xuemin Yang, Yang Li, Yingting Wu, Yongkang Liu, Zhankai Xu, Zhenduo Zhang, Zhenglei Zhou, Zhenyu Huang, Zhiqiang Zhang, Zihao Wang, Zujie Wen

We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology. Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset. We will release the model, dataset, and code.

我们提出了基于“环利”(Minglite)的大型语言模型(Mixture of Experters (MoE)),该模型通过强化学习优化了三分之一的参数,以实现高效和稳健的推理能力。为此,我们引入了联合培训管道,将精炼与RL(RL)相结合,揭示了在REL(MEL)培训中的无证挑战。首先,我们确定在REL培训中最优化的不稳定性,并提议在具有挑战性的基准(如AIME、LiveCodeBench、GPQA-Diamond)方面,我们的方法与最先进的环境兼容政策(SOTPA)小规模推理模型(SO)的性能相匹配,这是一种新颖的方法,它能加强培训稳定性,并通过算法系统共同设计改进计算量。第二,我们从经验上证明,选择基于输卵损失的蒸馏检查点,而不是在REL(RL)培训中,我们引入了联合蒸馏标准,我们确定了在REL(RD)培训中最终的高级数据效率标准,我们提出了在最后将数据整合中进行数据整合,数据整合到数据整合。


Article 97

Title@2025-06-18 (3): Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level

Title: Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level Root Defence Strategies: Gewährleistung der Sicherheit von LLM auf der Decodierungsebene 根本防御战略:确保顶级LLM的安全 2410.06809v3

Authors (5): Xinyi Zeng, Yuying Shang, Jiawei Chen, Jingyuan Zhang, Yu Tian

Large language models (LLMs) have demonstrated immense utility across various industries. However, as LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts. While current methods effectively address jailbreak risks, they share common limitations: 1) Judging harmful responses from the prefill-level lacks utilization of the model’s decoding outputs, leading to relatively lower effectiveness and robustness. 2) Rejecting potentially harmful responses based on a single evaluation can significantly impair the model’s helpfulness.This paper examines the LLMs’ capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens. Motivated by pilot experiment results, we design a robust defense mechanism at the decoding level. Our novel decoder-oriented, step-by-step defense architecture corrects harmful queries directly rather than rejecting them outright. We introduce speculative decoding to enhance usability and facilitate deployment to boost secure decoding speed. Extensive experiments demonstrate that our approach improves model security without compromising reasoning speed. Notably, our method leverages the model’s ability to discern hazardous information, maintaining its helpfulness compared to existing methods.

大型语言模型(LLMS)在各种行业中显示出巨大的实用性,然而,随着LLMS的进步,有害产出的风险因错误或恶意教学的推动而增加。虽然目前的方法有效地解决了越狱风险,但它们有着共同的局限性:(1) 判断预填补阶段的有害反应缺乏对模型解码产出的利用,从而导致相对较低的效力和稳健性。(2) 拒绝基于单一评价的潜在有害反应会大大损害模型的有用性。 本文审视LLMS在认识有害产出、披露和量化其评估以往标牌危险能力方面的能力。我们受试点试验结果的驱动,在解码一级设计一个强大的防御机制。我们新的以编码为导向的、逐步的防御结构直接纠正有害查询,而不是彻底拒绝这些查询。我们引入投机性解码,以提高可用性并促进部署,从而提高安全解码速度。广泛的实验表明,我们的方法在不损害推理速度的情况下提高了模型的安全性。我们的方法利用了模型识别危险信息的能力,保持了与现有方法的有用性。


Article 98

Title@2025-06-18 (3): Improving Dialogue Discourse Parsing through Discourse-aware Utterance Clarification

Title: Improving Dialogue Discourse Parsing through Discourse-aware Utterance Clarification Verbesserung des Dialog-Diskurs Parsens durch diskursbewusste Aufklärung 通过有礼识的尿道澄清改进对话讨论 2506.15081v1

Authors (3): Yaxin Fan, Peifeng Li, Qiaoming Zhu

Dialogue discourse parsing aims to identify and analyze discourse relations between the utterances within dialogues. However, linguistic features in dialogues, such as omission and idiom, frequently introduce ambiguities that obscure the intended discourse relations, posing significant challenges for parsers. To address this issue, we propose a Discourse-aware Clarification Module (DCM) to enhance the performance of the dialogue discourse parser. DCM employs two distinct reasoning processes: clarification type reasoning and discourse goal reasoning. The former analyzes linguistic features, while the latter distinguishes the intended relation from the ambiguous one. Furthermore, we introduce Contribution-aware Preference Optimization (CPO) to mitigate the risk of erroneous clarifications, thereby reducing cascading errors. CPO enables the parser to assess the contributions of the clarifications from DCM and provide feedback to optimize the DCM, enhancing its adaptability and alignment with the parser’s requirements. Extensive experiments on the STAC and Molweni datasets demonstrate that our approach effectively resolves ambiguities and significantly outperforms the state-of-the-art (SOTA) baselines.

对话讨论旨在确定和分析对话中言论之间的对话关系,然而,对话中的语言特征,例如疏漏和语调,往往带来模糊不清,模糊了预期的谈话关系,给分析者带来重大挑战。为解决这一问题,我们提议了一个有意见的澄清模块,以提高对话对话对话分析员的绩效。 会议采用两种不同的推理程序:澄清类型推理和讨论目标推理。前者分析语言特征,而后者将预期的关系与模棱两可的关系区分开来。此外,我们引入了贡献意识优化优化(CPO),以减轻错误澄清的风险,从而减少含混错误的错误。 CPO使分析员能够评估多哈会议澄清的贡献,并提供反馈,以优化多哈会议,加强其适应性和与分析员要求的一致性。关于STAC和Molweni数据集的广泛实验表明,我们的做法有效地解决了模糊问题,大大超出了国家-艺术基线。


Article 99

Title@2025-06-18 (3): Learning-Time Encoding Shapes Unlearning in LLMs

Title: Learning-Time Encoding Shapes Unlearning in LLMs Lernzeitkodierung Formen Entlernen in LLMs 学习-时间编码 2506.15076v1

Authors (3): Ruihan Wu, Konstantin Garov, Kamalika Chaudhuri

As large language models (LLMs) are increasingly deployed in the real world, the ability to ``unlearn’’, or remove specific pieces of knowledge post hoc, has become essential for a variety of reasons ranging from privacy regulations to correcting outdated or harmful content. Prior work has proposed unlearning benchmarks and algorithms, and has typically assumed that the training process and the target model are fixed. In this work, we empirically investigate how learning-time choices in knowledge encoding impact the effectiveness of unlearning factual knowledge. Our experiments reveal two key findings: (1) learning with paraphrased descriptions improves unlearning performance and (2) unlearning individual piece of knowledge from a chunk of text is challenging. Our results suggest that learning-time knowledge encoding may play a central role in enabling reliable post-hoc unlearning.

由于大型语言模型(LLMs)越来越多地部署在现实世界中,“unlearn’‘”或删除特定知识后的具体部分的能力由于从隐私条例到纠正过时或有害内容等各种原因变得至关重要。先前的工作提出了不学习的基准和算法,并通常假定培训过程和目标模式已经固定。在这项工作中,我们从经验上调查知识的学习-时间选择如何影响不学习事实知识的效果。我们的实验揭示了两个主要结论:(1) 以文字描述进行学习可以改进不学习的成绩,(2) 从一大批文字中不学习个人知识具有挑战性。我们的结果表明,学习-时间知识编码可以在促成可靠的选修后学习方面发挥核心作用。


Article 100

Title@2025-06-18 (3): LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models

Title: LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models LLMs können gefährliche Gründe sein: Analysieren-basierter Jailbreak-Angriff auf große Sprachmodelle LLMs可以是危险理由:基于分析的对大语言模式的越狱攻击 2407.16205v6

Authors (7): Shi Lin, Hongming Yang, Rongchang Li, Xun Wang, Changting Lin, Wenpeng Xing, Meng Han

The rapid development of Large Language Models (LLMs) has brought impressive advancements across various tasks. However, despite these achievements, LLMs still pose inherent safety risks, especially in the context of jailbreak attacks. Most existing jailbreak methods follow an input-level manipulation paradigm to bypass safety mechanisms. Yet, as alignment techniques improve, such attacks are becoming increasingly detectable. In this work, we identify an underexplored threat vector: the model’s internal reasoning process, which can be manipulated to elicit harmful outputs in a more stealthy way. To explore this overlooked attack surface, we propose a novel black-box jailbreak attack method, Analyzing-based Jailbreak (ABJ). ABJ comprises two independent attack paths: textual and visual reasoning attacks, which exploit the model’s multimodal reasoning capabilities to bypass safety mechanisms, comprehensively exposing vulnerabilities in its reasoning chain. We conduct extensive experiments on ABJ across various open-source and closed-source LLMs, VLMs, and RLMs. In particular, ABJ achieves high attack success rate (ASR) (82.1% on GPT-4o-2024-11-20) with exceptional attack efficiency (AE) among all target models, showcasing its remarkable attack effectiveness, transferability, and efficiency. Our work reveals a new type of safety risk and highlights the urgent need to mitigate implicit vulnerabilities in the model’s reasoning process.

大语言模型(LLMS)的迅速发展给各种任务带来了令人印象深刻的进步。然而,尽管取得了这些成就,LLMS仍然构成固有的安全风险,特别是在破狱袭击的情况下。大多数现有的破狱方法都遵循一种投入层面的操纵模式,绕过安全机制。然而,随着校准技术的改进,这类袭击正变得越来越容易被察觉。在这项工作中,我们确定了一种探索不足的威胁矢量:该模型的内部推理过程,它可以被更隐蔽地操纵以获得有害产出。为了探索这一被忽视的攻击表面,我们提出了一种新的黑盒破狱袭击方法,即以分析为基础的监狱破狱(ABJ)。ABJ包括两种独立的攻击路径:文字和视觉推理攻击,利用该模型的多式推理能力绕过安全机制,全面暴露其推理链中的弱点。我们对ABJ公司的各种开源和封闭源LMS、VLMS模型和RLMS进行了广泛的实验。特别是ABJ公司在袭击中取得了高攻击成功率(ASR) (8.1%在GPT-4-2024-11-20-20-2020)和隐蔽性攻击性20),其所有惊人的效率转移目标,展示了我们攻击性攻击性高度风险。


Article 101

Title@2025-06-18 (3): Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

Title: Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation Semantially-Aware Belohnungen für Open-Ended R1 Training in Free-Form Generation 免费新一代不限名额R1培训的 “ 闪存式 “ 奖项 2506.15068v1

Authors (7): Zongxia Li, Yapei Chang, Yuhang Zhou, Xiyang Wu, Zichao Liang, Yoo Yeon Sung, Jordan Lee Boyd-Graber

Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO and guiding its training with distinct rewards for good and bad outputs. Trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality, PrefBERT effectively supports GRPO by offering better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do. Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that PrefBERT, trained on multi-sentence and paragraph-length responses, remains reliable across varied long passages and aligns well with the verifiable rewards GRPO needs. Human evaluations confirm that using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics. Our code is available at https://github.com/zli12321/long_form_rl.

评估开放型长方形的一代很难确定,因为很难确定哪些是明确区分好与坏产出的好。现有的方法往往忽略了一致性、风格或相关性等关键方面,或者对培训前数据有偏差,使开放型长方形评价成为探索不足的问题。为了解决这一差距,我们提议PrefBERT,这是在GROP中评价开放型长方形一代的评分模式,并指导其培训,对好和坏产出给予不同奖励。在两个反应评价数据集中培训了具有不同长式风格和差价评级质量的数据集,PrefBERT通过提供比传统的ROUGE-L和BERTScore等标准更好的语义性奖励反馈,有效地支持GROPO。通过全面评价,包括LLM-as-as-a-judge、人类评级和定性分析,我们表明PrefBERT仍然可靠,跨越不同的长段段和可核实的奖励需要。人类评价证实,使用PrefBERT作为培训政策模型的奖励信号,有效支持GROPERD-lusmaismatium_commormormation reformation reformation reformation reformormormations。


Article 102

Title@2025-06-18 (3): Math Neurosurgery: Isolating Language Models’ Math Reasoning Abilities Using Only Forward Passes

Title: Math Neurosurgery: Isolating Language Models’ Math Reasoning Abilities Using Only Forward Passes Math Neurochirurgie: Die Math-Reasoning-Fähigkeiten von Sprachmodellen mit nur Vorwärtspassagen isolieren 数学神经外科:仅使用前方通道的孤立语言模型理据能力 2410.16930v4

Authors (4): Bryan R. Christ, Zack Gottesman, Jonathan Kropko, Thomas Hartvigsen

Math reasoning is an active area of Large Language Model (LLM) research because it is a hallmark of artificial intelligence and has implications in several domains, including math education. However, few works have explored how math reasoning is encoded within LLM parameters and if it is a skill that can be isolated within models. Doing so could allow targeted intervention to improve math performance without altering non-math behavior and foster understanding of how models encode math reasoning. We introduce Math Neurosurgery (MathNeuro), a computationally efficient method we use to isolate math-specific parameters in LLMs using only forward passes. MathNeuro builds on existing work by using weights and activations to calculate parameter importance, but isolates math-specific parameters by filtering out those important for general language tasks. Through pruning parameters MathNeuro identifies, we delete a LLM’s math reasoning ability without significantly impacting its general language ability. Scaling the identified parameters by a small constant improves a pretrained or instruction-tuned LLM’s performance by 4-17% on GSM8K and 5-35% on MATH while leaving non-math behavior unaltered. MathNeuro is also data efficient: most of its effectiveness holds when identifying math-specific parameters using a single sample. MathNeuro highlights the potential for future work to intervene on math-specific parameters.

数学推理是大语言模型(LLM)研究的一个积极领域,因为它是人工智能的标志,对包括数学教育在内的若干领域具有影响。然而,很少有著作探讨了数学推理如何在LLM参数中编码,如果数学推理是一种可以在模型中孤立的技能。这样做可以允许有针对性地干预提高数学性能,而不会改变非数学行为,并且有助于理解模型如何编码数学推理。我们引入了数学神经外科(Math Neuro),这是一种计算效率高的方法,我们仅使用远道通行证将LLMMLM中特定数学参数分离出来。 MathNeuro利用现有工作,使用重量和激活来计算参数的重要性,但通过过滤一般语言任务中的重要参数,将数学参数分离出来。我们通过理算参数来删除LLM的数学推理能力,而不会对其一般语言推理能力产生很大影响。我们通过小的恒定不变的方法,将GSM8K和MATH的5-35%的预选或经指示调整LM的性性表现,同时将非数学行为表现为最精确的数学参数。数学精准的数学精准的数学参数,在数学的数学参数上也保持了数学精准。数学的数学模型的精准度。在数学的精准度上,在数学的精准性能。在数学的数学的精准性能度上,在数学参数上也是数学精准性能。


Article 103

Title@2025-06-18 (3): SemVink: Advancing VLMs’ Semantic Understanding of Optical Illusions via Visual Global Thinking

Title: SemVink: Advancing VLMs’ Semantic Understanding of Optical Illusions via Visual Global Thinking SemVink: Das semantische Verständnis optischer Illusionen durch visuelles globales Denken von VLMs verbessern SemVink:通过视觉全球思维推进VLMs对光学幻影的语义理解 2506.02803v2

Authors (3): Sifan Li, Yujun Cai, Yiwei Wang

Vision-language models (VLMs) excel in semantic tasks but falter at a core human capability: detecting hidden content in optical illusions or AI-generated images through perceptual adjustments like zooming. We introduce HC-Bench, a benchmark of 112 images with hidden text, objects, and illusions, revealing that leading VLMs achieve near-zero accuracy (0-5.36%)-even with explicit prompting. Humans resolve such ambiguities instinctively, yet VLMs fail due to an overreliance on high-level semantics. Strikingly, we propose SemVink (Semantic Visual Thinking) by simply scaling images to low resolutions (32-128 pixels), which unlocks >99% accuracy by eliminating redundant visual noise. This exposes a critical architectural flaw: VLMs prioritize abstract reasoning over low-level visual operations crucial for real-world robustness. Our work urges a shift toward hybrid models integrating multi-scale processing, bridging the gap between computational vision and human cognition for applications in medical imaging, security, and beyond.

视觉语言模型(VLMS)在语义任务方面非常出色,但在核心人的能力方面却步履维艰:通过放大等感知调整发现视觉幻觉或AI产生的图像中的隐藏内容。我们引入了HC-Bench,这是112个图像的基准,带有隐藏的文字、物体和幻觉,显示领先的VLMS即使有明确的提示,也实现了接近零的精确度(0-5.36 % ) 。人类本能地解决了这种模糊性,然而VLMS却由于过度依赖高层语义而未能成功。令人惊讶的是,我们建议SemVink(Seminical 视觉思维)仅仅将图像缩放到低分辨率(32-128像素),通过消除多余的视觉噪音而释放出大于99%的精确度。这暴露了一个关键的建筑缺陷:VLMS优先考虑抽象推理而不是对现实世界稳健至关重要的低级视觉操作。我们的工作敦促转向混合模型,将多级处理结合起来,缩小计算视觉与医学成、安全及外部应用之间的鸿沟。


Article 104

Title@2025-06-18 (3): Thunder-NUBench: A Benchmark for LLMs’ Sentence-Level Negation Understanding

Title: Thunder-NUBench: A Benchmark for LLMs’ Sentence-Level Negation Understanding Thunder-NUBench: Ein Benchmark für das Urteils-Negation-Verständnis von LLMs Thunder-NUBench:LLLM女士的判刑级差理解基准 2506.14397v2

Authors (7): Yeonkyoung So, Gyuseong Lee, Sungmok Jung, Joonhak Lee, JiA Kang, Sangho Kim, Jaejin Lee

Negation is a fundamental linguistic phenomenon that poses persistent challenges for Large Language Models (LLMs), particularly in tasks requiring deep semantic understanding. Existing benchmarks often treat negation as a side case within broader tasks like natural language inference, resulting in a lack of benchmarks that exclusively target negation understanding. In this work, we introduce Thunder-NUBench, a novel benchmark explicitly designed to assess sentence-level negation understanding in LLMs. Thunder-NUBench goes beyond surface-level cue detection by contrasting standard negation with structurally diverse alternatives such as local negation, contradiction, and paraphrase. The benchmark consists of manually curated sentence-negation pairs and a multiple-choice dataset that enables in-depth evaluation of models’ negation understanding.

偏差是一个基本的语言现象,对大语言模型(LLMs)构成持续的挑战,特别是在需要深入的语义理解的任务中。现有基准往往将否定作为比自然语言推断等更广泛的任务的一个侧面案例,导致缺乏完全针对否定理解的基准。在这项工作中,我们引入了闪电-NUBench(Thunder-NUBench),这是一个新颖的基准,明确旨在评估LLMs对判决一级否定的理解。雷电-NUBench(Thunder-NUBench)超越了地表级提示检测,将标准否定与结构上多样化的替代方法(如当地否定、矛盾和参数)相对比。基准包括手动的句断配和多选数据集,从而能够深入评估模式否定理解。


Article 105

Title@2025-06-18 (3): Identifying economic narratives in large text corpora – An integrated approach using Large Language Models

Title: Identifying economic narratives in large text corpora – An integrated approach using Large Language Models Identifizieren von ökonomischen Erzählungen in großen Textkorpora – Ein integrierter Ansatz mit großen Sprachmodellen 在大文本公司中确定经济说明 – – 使用大语言模式的综合办法 2506.15041v1

Authors (6): Tobias Schmidt, Kai-Robin Lange, Matthias Reccius, Henrik Müller, Michael Roos, Carsten Jentsch

As interest in economic narratives has grown in recent years, so has the number of pipelines dedicated to extracting such narratives from texts. Pipelines often employ a mix of state-of-the-art natural language processing techniques, such as BERT, to tackle this task. While effective on foundational linguistic operations essential for narrative extraction, such models lack the deeper semantic understanding required to distinguish extracting economic narratives from merely conducting classic tasks like Semantic Role Labeling. Instead of relying on complex model pipelines, we evaluate the benefits of Large Language Models (LLMs) by analyzing a corpus of Wall Street Journal and New York Times newspaper articles about inflation. We apply a rigorous narrative definition and compare GPT-4o outputs to gold-standard narratives produced by expert annotators. Our results suggests that GPT-4o is capable of extracting valid economic narratives in a structured format, but still falls short of expert-level performance when handling complex documents and narratives. Given the novelty of LLMs in economic research, we also provide guidance for future work in economics and the social sciences that employs LLMs to pursue similar objectives.

随着近年来对经济叙事的兴趣增加,专用于从文本中提取此类叙事的管道数量也有所增加,管道往往采用各种最先进的自然语言处理技术,例如BERT,来完成这项任务。这种模型在对叙事提取所必需的基础语言操作方面是有效的,但缺乏更深层次的语义理解,无法将经济叙事与仅仅从事诸如塞文角色标签等经典任务区分开来。我们不依赖复杂的示范管道,而是通过分析《华尔街日报》和《纽约时报》关于通货膨胀的文章来评估大语言模型的好处。我们采用了严格的叙述定义,并将GPT-4o产出与专家说明家制作的黄金标准叙事进行比较。我们的结果表明,GPT-4o能够以结构化格式提取有效的经济叙事,但在处理复杂的文件和叙事时仍然落后于专家一级的表现。鉴于LLMS在经济研究中的新特点,我们还为今后在经济和社会科学领域开展的工作提供指导,利用LMSMS追求类似的目标。


Article 106

Title@2025-06-18 (3): Identifying social isolation themes in NVDRS text narratives using topic modeling and text-classification methods

Title: Identifying social isolation themes in NVDRS text narratives using topic modeling and text-classification methods Identifizierung sozialer Isolationsthemen in NVDRS-Textnarrativen mittels Themenmodellierung und Textklassifizierung 利用专题建模和文本分类方法,在国家难民、难民、难民、难民、难民、难民、难民、难民、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者、流离失所者 2506.15030v1

Authors (5): Drew Walker, Swati Rajwal, Sudeshna Das, Snigdha Peddireddy, Abeed Sarker

Social isolation and loneliness, which have been increasing in recent years strongly contribute toward suicide rates. Although social isolation and loneliness are not currently recorded within the US National Violent Death Reporting System’s (NVDRS) structured variables, natural language processing (NLP) techniques can be used to identify these constructs in law enforcement and coroner medical examiner narratives. Using topic modeling to generate lexicon development and supervised learning classifiers, we developed high-quality classifiers (average F1: .86, accuracy: .82). Evaluating over 300,000 suicides from 2002 to 2020, we identified 1,198 mentioning chronic social isolation. Decedents had higher odds of chronic social isolation classification if they were men (OR = 1.44; CI: 1.24, 1.69, p<.0001), gay (OR = 3.68; 1.97, 6.33, p<.0001), or were divorced (OR = 3.34; 2.68, 4.19, p<.0001). We found significant predictors for other social isolation topics of recent or impending divorce, child custody loss, eviction or recent move, and break-up. Our methods can improve surveillance and prevention of social isolation and loneliness in the United States.

尽管目前在美国国家暴力死亡报告系统(NVDRS)的结构变量中没有记录到社会孤立和孤独,但自然语言处理(NLP)技术可以用来在执法和验尸官叙述中确定这些结构,利用主题模型来生成词汇开发和监督学习分类,我们开发了高质量的分类器(平均F1:86,准确度:.82),对2002至2020年30多万起自杀进行了评估,我们发现有1 198起提到长期的社会孤立。如果死者是男性,则长期社会孤立分类的可能性更大(OR=1.44;CI:1.24,1.69,p < 0001),同性恋(OR=3.68;1.97,6.33,p<0001),或离婚(OR=3.34;2.68,4.19,p<0001)。我们发现了其他社会孤立性专题的重要预测器,如最近或迫近或迫婚、儿童监护损失、驱逐或最近搬离,以及解体。我们的方法可以改善美国的监视和预防社会隔离和社会孤独。


Article 107

Title@2025-06-18 (3): An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW

Title: An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW Eine präzise und überarbeitete Version der optischen Zeichenerkennungs-basierten Sprachsynthese mit LabVIEW 利用拉比韦厄综合实验室进行精确和订正的光学字符识别语音合成 2506.15029v1

Authors (2): Prateek Mehta, Anasuya Patil

Knowledge extraction through sound is a distinctive property. Visually impaired individuals often rely solely on Braille books and audio recordings provided by NGOs. Due to limitations in these approaches, blind individuals often cannot access books of their choice. Speech is a more effective mode of communication than text for blind and visually impaired persons, as they can easily respond to sounds. This paper presents the development of an accurate, reliable, cost-effective, and user-friendly optical character recognition (OCR)-based speech synthesis system. The OCR-based system has been implemented using Laboratory Virtual Instrument Engineering Workbench (LabVIEW).

通过声音获取知识是一种独特的特性,视障人士往往只依靠非政府组织提供的盲文书籍和录音记录,由于这些方法的局限性,盲人往往无法读取自己选择的书籍。语言是盲人和视障人士比文字更有效的沟通方式,因为他们可以很容易地对声音作出反应。本文介绍了开发一个精确、可靠、成本效益高和方便用户的光学特征识别合成系统(OCR),该系统是利用实验室虚拟仪器工程工作网(LabVIEW)实施的。


Article 108

Title@2025-06-17 (2): Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size

Title: Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size Optimale Einbettung der Lernrate in LLMs: Der Effekt der Vokabelgröße LLMM中最佳嵌入式学习率:词汇大小的影响 2506.15025v1

Authors (2): Soufiane Hayou, Liyuan Liu

Pretraining large language models is a costly process. To make this process more efficient, several methods have been proposed to optimize model architecture/parametrization and hardware use. On the parametrization side, $\mu P$ (Maximal Update Parametrization) parametrizes model weights and learning rate (LR) in a way that makes hyperparameters (HPs) transferable with width (embedding dimension): HPs can be tuned for a small model and used for larger models without additional tuning. While $\mu$P showed impressive results in practice, recent empirical studies have reported conflicting observations when applied to LLMs. One limitation of the theory behind $\mu$P is the fact that input dimension (vocabulary size in LLMs) is considered fixed when taking the width to infinity. This is unrealistic since vocabulary size is generally much larger than width in practice. In this work, we provide a theoretical analysis of the effect of vocabulary size on training dynamics, and subsequently show that as vocabulary size increases, the training dynamics \emph{interpolate between the $\mu$P regime and another regime that we call Large Vocab (LV) Regime}, where optimal scaling rules are different from those predicted by $\mu$P. Our analysis reveals that in the LV regime, the optimal embedding LR to hidden LR ratio should roughly scale as $\Theta(\sqrt{width})$, surprisingly close to the empirical findings previously reported in the literature, and different from the $\Theta(width)$ ratio predicted by $\mu$P. We conduct several experiments to validate our theory, and pretrain a 1B model from scratch to show the benefit of our suggested scaling rule for the embedding LR.

大规模语言模型前训练是一个昂贵的过程。 要使这个过程更有效率, 已经提出了几种方法来优化模型结构/ 平衡和硬件使用。 在 石美化方面, $\ mu P$ (Maximal Und Paramedrization) 模型重量和学习率 (LR) , 使超参数(HPs) 具有宽度( 装饰尺寸) 的可转让性: HP 可以调用一个小模型, 并用于更大的模型, 而无需额外调整。 虽然 $\ mu$ P 显示在实践中取得了令人印象深刻的结果, 但最近的经验研究表明, 当应用到LLMMs时, 美元背后的理论限制是 $\ $mumuP 背后的理论, 投入的尺寸( LMSMs) 被认为是固定的 。 这不切实际, 因为词汇的大小一般大于实际的宽度。 在这项工作中, 我们对词汇规模的影响进行理论分析, 我们用词汇规模增加, 培训动力 $ = 美元 美元 美元 。


Article 109

Title@2025-06-17 (2): Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation

Title: Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation 多方语言模式:推进合作、协调和适应 2506.09331v2

Authors (1): Arjun Vaithilingam Sudhakar

Modern Large Language Models (LLMs) exhibit impressive zero-shot and few-shot generalization capabilities across complex natural language tasks, enabling their widespread use as virtual assistants for diverse applications such as translation and summarization. Despite being trained solely on large corpora of text without explicit supervision on author intent, LLMs appear to infer the underlying meaning of textual interactions. This raises a fundamental question: can LLMs model and reason about the intentions of others, i.e., do they possess a form of theory of mind? Understanding other’s intentions is crucial for effective collaboration, which underpins human societal success and is essential for cooperative interactions among multiple agents, including humans and autonomous systems. In this work, we investigate the theory of mind in LLMs through the lens of cooperative multi-agent reinforcement learning (MARL), where agents learn to collaborate via repeated interactions, mirroring human social reasoning. Our approach aims to enhance artificial agent’s ability to adapt and cooperate with both artificial and human partners. By leveraging LLM-based agents capable of natural language interaction, we move towards creating hybrid human-AI systems that can foster seamless collaboration, with broad implications for the future of human-artificial interaction.

现代大型语言模型(LLMS)在复杂的自然语言任务中表现出令人印象深刻的零射和少见的概括能力,使LLMS能够广泛用作翻译和总结等各种应用的虚拟助手。尽管LLMS仅仅在没有明确监督作者意图的情况下接受了关于大量文本整体的培训,但似乎推断了文本互动的根本含义。这提出了一个根本问题:LLMS模型和关于他人意图的理由,即它们是否具有某种形式的思想理论?了解他人的意图对于有效合作至关重要,而有效合作是人类社会成功的基础,并且对于包括人类和自主系统在内的多种代理人之间的合作互动至关重要。在这项工作中,我们通过合作性多剂强化学习(MARL)的透镜调查LMMS中的思想理论,代理商通过反复的互动学习合作,反映人类的社会推理。我们的方法的目的是提高人工代理人适应和与人工和人类伙伴合作的能力。通过利用LMM公司能够进行自然语言互动的代理人,我们开始建立能够促进无缝合作的人类-AI混合系统,对未来的人类-艺术互动产生广泛影响。


Article 110

Title@2025-06-17 (2): Entropy-based Exploration Conduction for Multi-step Reasoning

Title: Entropy-based Exploration Conduction for Multi-step Reasoning Entropiebasierte Explorationsleitung für mehrstufige Vernunft 用于多步骤理由的基于英信的探索行为 2503.15848v2

Authors (6): Jinghan Zhang, Xiting Wang, Fengran Mo, Yeyang Zhou, Wanfu Gao, Kunpeng Liu

Multi-step processes via large language models (LLMs) have proven effective for solving complex reasoning tasks. However, the depth of exploration of the reasoning procedure can significantly affect the task performance. Existing methods to automatically decide the depth often lead to high cost and a lack of flexibility. To address these issues, we propose Entropy-based Exploration Depth Conduction (Entro-duction), a novel method that dynamically adjusts the exploration depth during multi-step reasoning by monitoring LLM’s output entropy and variance entropy. We employ these two features to capture the model’s uncertainty of the current step and the fluctuation of uncertainty across consecutive reasoning steps. Based on the observed entropy changes, the LLM selects whether to deepen, expand, or stop exploration according to the probability, which facilitates the trade-off between the reasoning accuracy and exploration effectiveness. Experimental results across four benchmark datasets demonstrate the efficacy of Entro-duction.

通过大型语言模型(LLMs)的多步骤流程已证明对解决复杂推理任务十分有效,然而,对推理程序的深度探索会大大影响任务业绩。现有的自动决定深度的方法往往导致高成本和缺乏灵活性。为解决这些问题,我们提议采用基于Entropy的勘探深度行为(Ento-togination),这是一种新方法,通过监测LLM输出的导体和变异导体,在多步推理过程中动态调整勘探深度。我们利用这两个特征来捕捉模型当前步骤的不确定性和连续推理步骤的不确定性波动。根据观察到的引力变化,LLM根据概率选择是深化、扩大还是停止探索,这有利于推理准确性和勘探效力之间的权衡。四个基准数据集的实验结果显示了Entro-duction的功效。


Article 111

Title@2025-06-17 (2): Memory Tokens: Large Language Models Can Generate Reversible Sentence Embeddings

Title: Memory Tokens: Large Language Models Can Generate Reversible Sentence Embeddings Memory Tokens: Große Sprachmodelle können reversible Satzeinbettungen generieren 内存当量: 大语言模型能够生成可翻转的句子嵌入 2506.15001v1

Authors (2): Ignacio Sastre, Aiala Rosá

In this work, we observe an interesting phenomenon: it is possible to generate reversible sentence embeddings that allow an LLM to reconstruct the original text exactly, without modifying the model’s weights. This is achieved by introducing a special memory token, whose embedding is optimized through training on a fixed sequence. When prompted with this embedding, the model reconstructs the fixed sequence exactly. We evaluate this phenomenon across English and Spanish datasets, sequences of up to approximately 240 tokens, and model scales ranging from 100M to 8B parameters. Notably, Llama 3.1 8B successfully reconstructs all tested sequences. Our findings highlight an interesting capability of LLMs and suggest potential applications in memory-based retrieval, compression, and controlled text generation.

在这项工作中,我们观察到一个有趣的现象:有可能产生可逆的句子嵌入,使LLM能够在不修改模型重量的情况下精确地重建原始文本。这是通过引入一个特殊的记忆符号实现的,该符号的嵌入通过固定顺序的培训得到优化。当嵌入该模型时,该模型精确地重建固定序列。我们在整个英文和西班牙的数据集中评估这一现象,大约240个符号的序列,以及从100M到8B参数的模型尺度。值得注意的是,Llama 3.1 8B成功地重建了所有测试过的序列。我们的调查结果突出了LMM的有趣能力,并提出了在基于记忆的检索、压缩和控制的文本生成方面的潜在应用。


Article 112

Title@2025-06-17 (2): Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings

Title: Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings Hypothesentest zur Quantifizierung von LLM-Mensch-Missausrichtung in Mehrfachauswahl-Einstellungen 多种选择环境中人类错配量化LLM-人类错配的假设测试 2506.14997v1

Authors (3): Harbin Hong, Sebastian Caldas, Liu Leqi

As Large Language Models (LLMs) increasingly appear in social science research (e.g., economics and marketing), it becomes crucial to assess how well these models replicate human behavior. In this work, using hypothesis testing, we present a quantitative framework to assess the misalignment between LLM-simulated and actual human behaviors in multiple-choice survey settings. This framework allows us to determine in a principled way whether a specific language model can effectively simulate human opinions, decision-making, and general behaviors represented through multiple-choice options. We applied this framework to a popular language model for simulating people’s opinions in various public surveys and found that this model is ill-suited for simulating the tested sub-populations (e.g., across different races, ages, and incomes) for contentious questions. This raises questions about the alignment of this language model with the tested populations, highlighting the need for new practices in using LLMs for social science studies beyond naive simulations of human subjects.

随着大型语言模型(LLMs)越来越多地出现在社会科学研究中(例如经济学和市场营销),评估这些模型复制人类行为的程度变得至关重要。在这项工作中,我们利用假设测试,提出了一个定量框架来评估多种选择调查环境中LLM模拟和实际人类行为之间的不匹配。这个框架使我们能够以有原则的方式确定具体语言模型是否能够有效地模拟人类观点、决策和通过多种选择选项体现的一般行为。我们将这个框架应用于一个在各种公共调查中模拟人们观点的流行语言模型,并发现这一模型不适合模拟经过测试的亚人口(例如不同种族、年龄和收入的亚群体),以模拟有争议的问题。这提出了关于这一语言模型与经过测试的人口相一致的问题,突出说明除了天真的模拟人类主题之外,需要采用新的做法来利用LMS进行社会科学研究。


Article 113

Title@2025-06-17 (2): LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles

Title: LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles LaMP-Cap: Personalisierte Bildunterschriftserstellung mit multimodalen Bildprofilen LaMP-Cap: 具有多模式图解的个人化图解生成 2506.06561v2

Authors (11): Ho Yin ‘Sam’ Ng, Ting-Yao Hsu, Aashish Anantha Ramakrishnan, Branislav Kveton, Nedim Lipka, Franck Dernoncourt, Dongwon Lee, Tong Yu, Sungchul Kim, Ryan A. Rossi, Ting-Hao ‘Kenneth’ Huang

Figure captions are crucial for helping readers understand and remember a figure’s key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain’s style, highlighting the need for personalization. Despite language models’ personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document–each with its image, caption, and figure-mentioning paragraphs–as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.

图表说明对于帮助读者理解和记住图中的关键信息至关重要。 许多模型已经开发出来以生成这些字幕, 帮助作者更方便地制作质量更高的字幕。 然而, 作者几乎总是需要修改通用的 AI 生成的字幕, 以匹配其写作风格和域的风格, 突出个性化的需要。 尽管语言模型的个性化( LAMP) 进步, 这些技术往往侧重于文本专用设置, 并且很少涉及输入和简介都是多式的情景。 本文介绍了LaMP Cap, 这是个人化的图解生成的数据集, 包含多式图解剖。 对于每个目标图示, LaMP Cap 不仅提供所需的投入, 如图示图像, , 而且还提供来自同一文档的另外三个图象、 标题和图解段落的数据, 作为描述背景的概况。 与四个LMS的实验显示, 使用概况信息始终有助于生成更接近原始作者撰写的字幕。 缩略图研究表明, 配置中的图像比图解段落更有用, , 突出使用多式图解的优势。


Article 114

Title@2025-06-17 (2): Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

Title: Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective Überarbeiten von Stärkungslernen für LLM-Reasoning aus einer bereichsübergreifenden Perspektive 重新考虑从跨主题角度重新研究加强学习LLM 2506.14965v1

Authors (24): Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Haonan Li, Taylor W. Killian, Mikhail Yurochkin, Zhengzhong Liu, Eric P. Xing, Zhiting Hu

Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains–Math, Code, Science, Logic, Simulation, and Tabular–each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360

强化学习(RL) 已经出现,360 是改进大型语言模型推理的有希望的方法,360 是一个有希望的方法, 360 是改进大型语言模型(LLM)推理的推理, 但大多数开放的努力都狭隘地侧重于数学和代码, 限制了我们对它更广泛适用于一般推理的理解。 一个关键的挑战在于缺乏可靠、可伸缩的RL奖励信号, 在各个不同的推理领域。 我们引入了Guru, 由9个推理领域构成的92K可核实的例子构成的92K推理集 — Math, Code, Science, Science, 和Tabulal-leach-learge, 通过特定地域的奖赏设计、解析和过滤,确保RLual培训的可靠性和有效性。 在Grual-L的推理学模型中,我们系统重新审视了RLML的既定结果, 最终显示,我们可能实现有意义的业绩,我们现有的G-B 数据模型中经常看到,我们现有的17个模型的域中, 最高级的域中最高级的Ral-L-ladeal-real-real-real-real lax-lax-lax-lax-lax-lax-lax-lax-lax


Article 115

Title@2025-06-17 (2): From Chat to Checkup: Can Large Language Models Assist in Diabetes Prediction?

Title: From Chat to Checkup: Can Large Language Models Assist in Diabetes Prediction? Vom Chat bis zum Checkup: Können große Sprachmodelle bei der Diabetes-Vorhersage helfen? 从聊天到检查:大语言模型能帮助糖尿病预测吗? 2506.14949v1

Authors (3): Shadman Sakib, Oishy Fatema Akhand, Ajwad Abrar

While Machine Learning (ML) and Deep Learning (DL) models have been widely used for diabetes prediction, the use of Large Language Models (LLMs) for structured numerical data is still not well explored. In this study, we test the effectiveness of LLMs in predicting diabetes using zero-shot, one-shot, and three-shot prompting methods. We conduct an empirical analysis using the Pima Indian Diabetes Database (PIDD). We evaluate six LLMs, including four open-source models: Gemma-2-27B, Mistral-7B, Llama-3.1-8B, and Llama-3.2-2B. We also test two proprietary models: GPT-4o and Gemini Flash 2.0. In addition, we compare their performance with three traditional machine learning models: Random Forest, Logistic Regression, and Support Vector Machine (SVM). We use accuracy, precision, recall, and F1-score as evaluation metrics. Our results show that proprietary LLMs perform better than open-source ones, with GPT-4o and Gemma-2-27B achieving the highest accuracy in few-shot settings. Notably, Gemma-2-27B also outperforms the traditional ML models in terms of F1-score. However, there are still issues such as performance variation across prompting strategies and the need for domain-specific fine-tuning. This study shows that LLMs can be useful for medical prediction tasks and encourages future work on prompt engineering and hybrid approaches to improve healthcare predictions.

虽然机器学习(ML)和深学习(DL)模型被广泛用于糖尿病的预测,但大型语言模型(LLM)用于结构性数字数据仍没有得到很好地探索。在本研究中,我们测试LLM在使用零射、一发和三发促学方法预测糖尿病方面的效力。我们利用皮马印第安人糖尿病数据库(PIDD)进行经验分析。我们评估了六种LMM,包括四个开放源模型:Gemma-227B、Mistral-7B、Llama-3.1-31-8B和Llama-3.2-2B。我们还测试了两种专有型模型:GPT-4o和Gemini Flash 2.0。此外,我们用三种传统机器学习模型(随机森林、物流回归和支持病媒机(SVM))来比较其性能。我们用精确、精确、回顾和F1核心作为评价指标。我们的结果显示,专有LMM比开放源方法更有用,GPT-4o和Gemma-227B在几发域域域域内实现了最精确的精确的精确的精确度模型。Gemma-Lma-Ls的预测,在FS-S-S-S-S-S-S-S-Syalma-S-s-s-s-s-S-S-s-Sylevalma-S-S-S-S-S-S-s-s-s-S-S-S-S-S-s-s-s-s-s-S-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-sma-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-


Article 116

Title@2025-06-17 (2): Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing

Title: Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing Unter Bearbeiten & Überarbeiten mit iterativem & Nachbar-Assisted Model Editing lösen 用迭代和邻里辅助型号编辑解决以迭代和邻里辅助型号编辑的 unit & overdidite 2503.11895v2

Authors (4): Bhiman Kumar Baghel, Scott M. Jordan, Zheyuan Ryan Shi, Xiang Lorraine Li

Large Language Models (LLMs) are widely deployed in downstream tasks, but keeping their knowledge up-to-date via retraining or fine-tuning is often computationally expensive. Model editing provides a more efficient alternative by updating a targeted subset of parameters, which often follows the locate-and-edit paradigm. Despite this efficiency, existing methods are limited: edits may fail to inject knowledge (UnderEdit) or unintentionally disrupt unrelated neighboring knowledge (OverEdit). To address these challenges, we propose two complementary methods: iterative model editing, which applies successive edits to mitigate UnderEdit, and neighbor-assisted model editing, which incorporates neighboring knowledge during editing to reduce OverEdit. Our extensive experiments show that these techniques improve editing performance across multiple LLMs, algorithms, and benchmarks, reducing UnderEdit by up to 38 percentage points and OverEdit by up to 6, while remaining broadly applicable to any locate-and-edit method.

大型语言模型(LLMS)被广泛用于下游任务,但通过再培训或微调不断更新知识往往在计算上十分昂贵。 模型编辑通过更新一个目标参数子集提供了更有效的替代方法,这些参数子集往往遵循定位和编辑范式。 尽管有这种效率,现有方法仍然有限:编辑可能无法注入知识(UnderEdit)或无意中干扰互不相干的知识(Oversedit ) 。 为了应对这些挑战,我们建议了两种互补方法:迭代模式编辑,它应用连续编辑来减轻低编辑,以及邻居协助的模式编辑,它包含编辑过程中的邻里知识以减少超编辑。我们的广泛实验显示,这些技术可以改善多个LLMS、算法和基准的编辑性能,将低编辑率降低到38个百分点,超编辑到6个百分点,同时仍然广泛适用于任何定位和编辑方法。


Article 117

Title@2025-06-17 (2): Too Big to Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers

Title: Too Big to Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers Zu groß zu denken: Kapazität, Erinnerung und Verallgemeinerung in vortrainierten Transformern 能力、记忆和在培训前变异器中的普及化 2506.09099v2

Authors (2): Joshua Barron, Devin White

The relationship between memorization and generalization in large language models (LLMs) remains an open area of research, with growing evidence that the two are deeply intertwined. In this work, we investigate this relationship by pre-training a series of capacity-limited Transformer models from scratch on two synthetic character-level tasks designed to separately probe generalization (via arithmetic extrapolation) and memorization (via factual recall). We observe a consistent trade-off: small models extrapolate to unseen arithmetic cases but fail to memorize facts, while larger models memorize but fail to extrapolate. An intermediate-capacity model exhibits a similar shift toward memorization. When trained on both tasks jointly, no model (regardless of size) succeeds at extrapolation. These findings suggest that pre-training may intrinsically favor one learning mode over the other. By isolating these dynamics in a controlled setting, our study offers insight into how model capacity shapes learning behavior and offers broader implications for the design and deployment of small language models.

大型语言模型(LLMs)的记忆化和概括化之间的关系仍然是一个开放的研究领域,越来越多的证据表明两者密切相关。在这项工作中,我们通过从零开始对一系列能力有限的变异模型进行两套合成品级任务的培训来调查这种关系,这些任务旨在分别探索概括化(通过算术外推法)和记忆化(通过事实回顾),我们观察到一个一致的权衡:小型模型外推到看不见的算术案例,但没有记住事实,而较大的模型则记忆化但不能外推。一个中间能力模型显示了类似的向记忆化的转变。在就这两个任务共同培训时,没有任何模型(大小不等的)在外推法上成功。这些调查结果表明,培训前可能本质上有利于一种学习模式。通过在控制环境下将这些动态分开,我们的研究可以深入了解模型能力如何塑造学习行为,并对小型语言模型的设计和应用产生更广泛的影响。


Article 118

Title@2025-06-17 (2): MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance

Title: MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance MDBench: Ein synthetischer Multi-Document-Reasoning-Benchmark mit Wissensführung MDBENCH:以知识指南制作的合成多文件理由说明基准 2506.14927v1

Authors (4): Joseph J. Peper, Wenzhao Qiu, Ali Payani, Lu Wang

Natural language processing evaluation has made significant progress, largely driven by the proliferation of powerful large language mod-els (LLMs). New evaluation benchmarks are of increasing priority as the reasoning capabilities of LLMs are expanding at a rapid pace. In particular, while multi-document (MD) reasoning is an area of extreme relevance given LLM capabilities in handling longer-context inputs, few benchmarks exist to rigorously examine model behavior in this setting. Moreover, the multi-document setting is historically challenging for benchmark creation due to the expensive cost of annotating long inputs. In this work, we introduce MDBench, a new dataset for evaluating LLMs on the task of multi-document reasoning. Notably, MDBench is created through a novel synthetic generation process, allowing us to controllably and efficiently generate challenging document sets and the corresponding question-answer (QA) examples. Our novel technique operates on condensed structured seed knowledge, modifying it through LLM-assisted edits to induce MD-specific reasoning challenges. We then convert this structured knowledge into a natural text surface form, generating a document set and corresponding QA example. We analyze the behavior of popular LLMs and prompting techniques, finding that MDBENCH poses significant challenges for all methods, even with relatively short document sets. We also see our knowledge-guided generation technique (1) allows us to readily perform targeted analysis of MD-specific reasoning capabilities and (2) can be adapted quickly to account for new challenges and future modeling improvements.

自然语言处理评价取得了显著进展,这在很大程度上是由强大的大型语言模版(LLMs)扩散推动的。随着LLMs的推理能力迅速扩大,新的评价基准越来越具有优先地位。特别是,多文件推理是一个极为相关的领域,因为LLMM有能力处理较长期的投入,因此,没有多少基准来严格审查这一背景下的示范行为。此外,多文件设置对基准设定具有历史挑战性,因为说明长篇投入的成本昂贵。在这项工作中,我们引入了MDBench,这是评价多文件推理任务LMMMS的新数据集。值得注意的是,MDBench是通过新的合成生成过程创建的,使我们能够有节制和高效率地生成具有挑战性的文件和相应的问答(QA)实例。我们的新技术以精密结构的种子知识运作,通过LMMMMA协助的编辑来修改它,从而引起MDR的推理学挑战。我们随后可以将这种结构化知识转换为自然文本表格式,生成一个文件集和相应的QA示例。我们甚至分析大众LMMMMS的行为和催化技术的快速分析,我们掌握了新的分析方法,我们又能够使MDEN技术成为新的分析方法。


Article 119

Title@2025-06-17 (2): UD-English-CHILDES: A Collected Resource of Gold and Silver Universal Dependencies Trees for Child Language Interactions

Title: UD-English-CHILDES: A Collected Resource of Gold and Silver Universal Dependencies Trees for Child Language Interactions UD-English-CHILDES: Eine gesammelte Ressource aus Gold und Silber Universelle Abhängigkeiten Bäume für kindersprachliche Interaktionen UD-English-CHILDES:儿童语言互动金树和银银树集成资源 2504.20304v3

Authors (5): Xiulin Yang, Zhuoxuan Ju, Lanni Bu, Zoey Liu, Nathan Schneider

CHILDES is a widely used resource of transcribed child and child-directed speech. This paper introduces UD-English-CHILDES, the first officially released Universal Dependencies (UD) treebank. It is derived from previously dependency-annotated CHILDES data, which we harmonize to follow unified annotation principles. The gold-standard trees encompass utterances sampled from 11 children and their caregivers, totaling over 48K sentences (236K tokens). We validate these gold-standard annotations under the UD v2 framework and provide an additional 1M~silver-standard sentences, offering a consistent resource for computational and linguistic research.

儿童知识是一种广泛使用的被转录的儿童和以儿童为主的演讲资源,本文件介绍了UD-English-CHILDES,这是第一个正式发行的普遍依赖(UD)树库,来源于以前附有依赖性说明的儿童知识数据,我们根据统一的说明原则统一了这些数据,金质标准树包括了从11名儿童及其照顾者中抽取的口述,总共超过48K个刑期(236K个象征性)。我们验证了UD v2框架下的这些黄金标准说明,并提供了另外1M-Silver标准句,为计算和语言研究提供了一致的资源。


Article 120

Title@2025-06-17 (2): Can LLMs Ask Good Questions?

Title: Can LLMs Ask Good Questions? Können LLMs gute Fragen stellen? LLMs能问好问题吗? 2501.03491v2

Authors (8): Yueheng Zhang, Xiaoyuan Liu, Yiyou Sun, Atheer Alharbi, Hend Alzahrani, Tianneng Shi, Basel Alomair, Dawn Song

We evaluate questions generated by large language models (LLMs) from context, comparing them to human-authored questions across six dimensions: question type, question length, context coverage, answerability, uncommonness, and required answer length. Our study spans two open-source and two proprietary state-of-the-art models. Results reveal that LLM-generated questions tend to demand longer descriptive answers and exhibit more evenly distributed context focus, in contrast to the positional bias often seen in QA tasks. These findings provide insights into the distinctive characteristics of LLM-generated questions and inform future work on question quality and downstream applications.

我们从背景角度评估大型语言模型(LLMs)产生的问题,将其与六个方面的人为问题进行比较:问题类型、问题长度、背景覆盖面、可回答性、非同寻常性和要求回答长度。我们的研究涉及两个开放源头和两个独有的先进模型。结果显示,LLM产生的问题往往要求更长的描述性答案,并表现出更均衡的分布式背景焦点,这与质量保证任务中经常看到的立场偏见不同。这些结论为LLM产生的问题的独特性提供了深刻的见解,并为今后关于问题质量和下游应用的工作提供了参考。


Article 121

Title@2025-06-17 (2): CrEst: Credibility Estimation for Contexts in LLMs via Weak Supervision

Title: CrEst: Credibility Estimation for Contexts in LLMs via Weak Supervision CrEst: Glaubwürdigkeitsschätzung für Kontexte in LLMs über schwache Überwachung CrEst: 微弱监督LLM女士背景的可靠估计 2506.14912v1

Authors (5): Dyah Adila, Shuai Zhang, Boran Han, Bonan Min, Yuyang Wang

The integration of contextual information has significantly enhanced the performance of large language models (LLMs) on knowledge-intensive tasks. However, existing methods often overlook a critical challenge: the credibility of context documents can vary widely, potentially leading to the propagation of unreliable information. In this paper, we introduce CrEst, a novel weakly supervised framework for assessing the credibility of context documents during LLM inference–without requiring manual annotations. Our approach is grounded in the insight that credible documents tend to exhibit higher semantic coherence with other credible documents, enabling automated credibility estimation through inter-document agreement. To incorporate credibility into LLM inference, we propose two integration strategies: a black-box approach for models without access to internal weights or activations, and a white-box method that directly modifies attention mechanisms. Extensive experiments across three model architectures and five datasets demonstrate that CrEst consistently outperforms strong baselines, achieving up to a 26.86% improvement in accuracy and a 3.49% increase in F1 score. Further analysis shows that CrEst maintains robust performance even under high-noise conditions.

整合背景信息极大地提高了大型语言模型(LLMs)在知识密集型任务方面的绩效,然而,现有方法往往忽略了一个重大挑战:背景文件的可信度可能大相径庭,可能导致不可靠信息的传播。在本文件中,我们引入了CrEst,这是一个在LLM推论期间评估背景文件可信度而不需要人工说明的新颖的、监督薄弱的框架。我们的方法基于这样的洞察力,即可信的文件往往与其他可信的文件具有更高的语义一致性,从而能够通过文件间协议进行自动的可信度评估。为了将可信度纳入LLM推论,我们提出了两个整合战略:一种针对没有内部重量或激活的模型的黑箱法,以及一种直接改变关注机制的白箱法。在三个模型和五个数据集中进行的广泛实验表明,CrEst始终超越了强有力的基线,在准确性方面实现了26.86%的提高,在F1评分上增长了3.49%。进一步的分析表明,即使在高音条件下,CrEst仍然保持着稳健的业绩。


Article 122

Title@2025-06-17 (2): Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction

Title: Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction Kombination von eingeschränkter und ungezwungener Dekodierung durch Boosting: BoostCD und seine Anwendung auf Informationsextraktion 将受约束和不受限制的通过推动解锁结合起来:推动及其在信息提取方面的应用 2506.14901v1

Authors (2): Marija Šakota, Robert West

Many recent approaches to structured NLP tasks use an autoregressive language model $M$ to map unstructured input text $x$ to output text $y$ representing structured objects (such as tuples, lists, trees, code, etc.), where the desired output structure is enforced via constrained decoding. During training, these approaches do not require the model to be aware of the constraints, which are merely implicit in the training outputs $y$. This is advantageous as it allows for dynamic constraints without requiring retraining, but can lead to low-quality output during constrained decoding at test time. We overcome this problem with Boosted Constrained Decoding (BoostCD), which combines constrained and unconstrained decoding in two phases: Phase 1 decodes from the base model $M$ twice, in constrained and unconstrained mode, obtaining two weak predictions. In phase 2, a learned autoregressive boosted model combines the two weak predictions into one final prediction. The mistakes made by the base model with vs. without constraints tend to be complementary, which the boosted model learns to exploit for improved performance. We demonstrate the power of BoostCD by applying it to closed information extraction. Our model, BoostIE, outperforms prior approaches both in and out of distribution, addressing several common errors identified in those approaches.

最近许多结构化的NLP任务使用自动递减语言模式,用美元来绘制非结构化输入文本,用美元绘制输出文本,用美元绘制代表结构化对象(例如图普、列表、树、代码等),通过限制解码执行理想产出结构。在培训期间,这些方法并不要求模型了解限制,这些限制仅隐含在培训产出中,仅含在美元中。这有好处,因为它允许不要求再培训的动态限制,但在测试时间有限的解码过程中可能导致低质量产出。我们克服了“制动控制解码(BoostCD)”的问题,它把限制和未经限制解码的解码结合起来,分为两个阶段:第一阶段从基本模型解码中解码,用美元两次,在限制和不限制模式模式下,获得两个薄弱的预测。在第二阶段,一个学习自制增强型模型将两种薄弱的预测结合到最后的预测。基础模型与限制的错误往往互为补充,而增强型模型既学习了这些模型,又学会了在改进后再分配方法中应用。我们的一些常用的改进方法,我们展示了这些方法。


Article 123

Title@2025-06-17 (2): Adverse Event Extraction from Discharge Summaries: A New Dataset, Annotation Scheme, and Initial Findings

Title: Adverse Event Extraction from Discharge Summaries: A New Dataset, Annotation Scheme, and Initial Findings Negative Ereignisextraktion aus entladenen Zusammenfassungen: Ein neuer Datensatz, Annotationsschema und erste Ergebnisse 《从排放中提取的不利事件摘要:新数据集、注解办法和初步调查结果》 2506.14900v1

Authors (8): Imane Guellil, Salomé Andres, Atul Anand, Bruce Guthrie, Huayu Zhang, Abul Hasan, Honghan Wu, Beatrice Alex

In this work, we present a manually annotated corpus for Adverse Event (AE) extraction from discharge summaries of elderly patients, a population often underrepresented in clinical NLP resources. The dataset includes 14 clinically significant AEs-such as falls, delirium, and intracranial haemorrhage, along with contextual attributes like negation, diagnosis type, and in-hospital occurrence. Uniquely, the annotation schema supports both discontinuous and overlapping entities, addressing challenges rarely tackled in prior work. We evaluate multiple models using FlairNLP across three annotation granularities: fine-grained, coarse-grained, and coarse-grained with negation. While transformer-based models (e.g., BERT-cased) achieve strong performance on document-level coarse-grained extraction (F1 = 0.943), performance drops notably for fine-grained entity-level tasks (e.g., F1 = 0.675), particularly for rare events and complex attributes. These results demonstrate that despite high-level scores, significant challenges remain in detecting underrepresented AEs and capturing nuanced clinical language. Developed within a Trusted Research Environment (TRE), the dataset is available upon request via DataLoch and serves as a robust benchmark for evaluating AE extraction methods and supporting future cross-dataset generalisation.

在这项工作中,我们从老年病人(在临床国家营养计划资源中往往代表不足的人群)的排出摘要中提出一个人工附加说明的不利活动(AE)提炼。该数据集包括14种临床上重要的急性急性急性急性呼吸道出血,以及否定、诊断类型和医院内出血等背景属性。独特的是,批注计划既支持不连续又重叠的实体,也应对以往工作中很少处理的挑战。我们评估了使用FlairNLP的多种模型,这三种注解颗粒:细度、粗度、粗度和粗度,与否定性有关。尽管基于变压模型(如BERT-cased)在文件一级粗度提取(F1=0.943)上表现很强,但业绩下降显著的是微细度的实体一级任务(例如,F1=0.675),特别是稀有事件和复杂属性。这些结果表明,尽管有高比重的临床评估、粗度、粗度和粗度的偏重度,但基于变压模型的模型的模型(如A级的跨度数据,在研究基准中测测测测测测测测测,环境数据,仍是一项重要挑战。


Article 124

Title@2025-06-17 (2): Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Title: Chain-of-Thought Reasoning In The Wild Is Not Always Faithful In den Wilden zu denken, ist nicht immer treu 历经深思深虑的 荒野不总是忠心耿耿 2503.08679v4

Authors (6): Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy

Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. However, recent studies have shown that CoT reasoning is not always faithful when models face an explicit bias in their prompts, i.e., the CoT can give an incorrect picture of how models arrive at conclusions. We go further and show that unfaithful CoT can also occur on realistic prompts with no artificial bias. We find that when separately presented with the questions “Is X bigger than Y?” and “Is Y bigger than X?”, models sometimes produce superficially coherent arguments to justify systematically answering Yes to both questions or No to both questions, despite such responses being logically contradictory. We show preliminary evidence that this is due to models’ implicit biases towards Yes or No, thus labeling this unfaithfulness as Implicit Post-Hoc Rationalization. Our results reveal that several production models exhibit surprisingly high rates of post-hoc rationalization in our settings: GPT-4o-mini (13%) and Haiku 3.5 (7%). While frontier models are more faithful, especially thinking ones, none are entirely faithful: Gemini 2.5 Flash (2.17%), ChatGPT-4o (0.49%), DeepSeek R1 (0.37%), Gemini 2.5 Pro (0.14%), and Sonnet 3.7 with thinking (0.04%). We also investigate Unfaithful Illogical Shortcuts, where models use subtly illogical reasoning to try to make a speculative answer to hard maths problems seem rigorously proven. Our findings raise challenges for strategies for detecting undesired behavior in LLMs via the chain of thought.

然而,最近的研究显示,当模型在速度上面临明显的偏差时,COT的推理并不总是忠实的,也就是说,COT可以不准确地描述模型是如何得出结论的。我们更进一步地指出,不忠的COT也可以在现实的推理上出现,而没有人为偏差。我们发现,当分别提出“X比Y大还是Y大?”和“Y比X大?”的问题时,模型有时会产生表面上一致的论据,以证明系统回答两个问题是肯定的还是否定的,尽管这种答复在逻辑上是矛盾的。我们展示了初步证据表明,这是模型对是明显偏差的,从而将这种不忠的推理说成是不透明的后推理。 我们的结果表明,一些生产模型在我们的环境下表现出令人惊讶的高比率:GPT-4-mini(13 % ) 和Haiku 3.5 (7 % ) 。 边际模型似乎更忠实,特别是人们的回答是,没有完全忠实地回答两个问题,即Gemini 2.5 Blaim (2.17 %)、 IPGEGO-4(0.GeGeGeGPTy) 4O3.49 (Orviolviolviolalislislis)。


Article 125

Title@2025-06-17 (2): A Variational Framework for Improving Naturalness in Generative Spoken Language Models

Title: A Variational Framework for Improving Naturalness in Generative Spoken Language Models Ein abwechslungsreicher Rahmen zur Verbesserung der Natürlichkeit in generativen Sprachmodellen 改善发源口语模式中自然特性的变式框架 2506.14767v1

Authors (5): Li-Wei Chen, Takuya Higuchi, Zakaria Aldeneh, Ahmed Hussen Abdelaziz, Alexander Rudnicky

The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering. To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens. Our approach eliminates the need for manual extraction and selection of paralinguistic features. Moreover, it produces preferred speech continuations according to human raters. Code, samples and models are available at https://github.com/b04901014/vae-gslm.

文本处理中大型语言模式的成功促使它们适应了语音模型。然而,由于语言是连续和复杂的,因此它往往被自动递减模型分解。来自自我监督模型(称为语义符号)的语音标志通常侧重于语言方面的语言方面,但忽视了预想信息。因此,这些符号培训的模型可以产生语言,其自然性会降低。现有的方法试图通过在语义符号中添加音调功能来解决这个问题。然而,单是音调不能充分代表语言属性的范围,而选择正确的特征需要小心的手动工程。为了克服这一点,我们建议一种端到端的变异方法,即自动学习编码这些连续的语音属性,以加强语义标志。我们的方法消除了人工提取和选择语义特征的需要。此外,根据人类电算器,它产生首选的语音延续。代码、样本和模型可在 https://github.com/b0490/1014/gslm上查阅。


Article 126

Title@2025-06-17 (2): ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM

Title: ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM ASCD: aufmerksamkeitsbeständige Kontrastdekodierung zur Reduktion der Halluzination in MLLM ASCD: 减少低LLLM中致幻作用的可引起注意的违反规则标记 2506.14766v1

Authors (4): Yujun Wang, Jinhe Bi, Yunpu Ma, Soeren Pirk

Multimodal Large Language Model (MLLM) often suffer from hallucinations. They over-rely on partial cues and generate incorrect responses. Recently, methods like Visual Contrastive Decoding (VCD) and Instruction Contrastive Decoding (ICD) have been proposed to mitigate hallucinations by contrasting predictions from perturbed or negatively prefixed inputs against original outputs. In this work, we uncover that methods like VCD and ICD fundamentally influence internal attention dynamics of the model. This observation suggests that their effectiveness may not stem merely from surface-level modifications to logits but from deeper shifts in attention distribution. Inspired by this insight, we propose an attention-steerable contrastive decoding framework that directly intervenes in attention mechanisms of the model to offer a more principled approach to mitigating hallucinations. Our experiments across multiple MLLM architectures and diverse decoding methods demonstrate that our approach significantly reduces hallucinations and improves the performance on benchmarks such as POPE, CHAIR, and MMHal-Bench, while simultaneously enhancing performance on standard VQA benchmarks.

多式大语言模型(MLLM)往往受到幻觉的影响,它们过于依赖局部提示,产生不正确的反应。最近,提出了视觉相矛盾解码(VCD)和指令相矛盾解码(ICD)等方法来减轻幻觉,用原始产出来对比受扰动或有负预设的投入的预测。在这项工作中,我们发现VCD和ICD等方法从根本上影响了模型的内部注意力动态。这一观察表明,它们的效力可能不仅仅来自对登入的表面水平的修改,而来自注意力分布的更深层次的转移。根据这一观察,我们提出了一个可引起注意的对比解码框架,直接干预模型的注意机制,以提供一个更有原则性的方法来减轻幻觉。我们在多种MLLM结构和多种解码方法的实验表明,我们的方法大大降低了幻觉,提高了POPE、CHAIR和MHal-Bench等基准的绩效,同时提高了标准VQA基准的绩效。


Article 127

Title@2025-06-17 (2): From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

Title: From Bytes to Ideas: Language Modeling with Autoregressive U-Nets Von Bytes zu Ideen: Sprachmodellierung mit autoregressiven U-Netzen 从字节到理念:用自动递减 U-Nets 进行语言建模 2506.14761v1

Authors (6): Mathurin Videau, Badr Youbi Idrissi, Alessandro Leite, Marc Schoenauer, Olivier Teytaud, David Lopez-Paz

Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and leave the model stuck with that choice. We relax this rigidity by introducing an autoregressive U-Net that learns to embed its own tokens as it trains. The network reads raw bytes, pools them into words, then pairs of words, then up to 4 words, giving it a multi-scale view of the sequence. At deeper stages, the model must predict further into the future – anticipating the next few words rather than the next byte – so deeper stages focus on broader semantic patterns while earlier stages handle fine details. When carefully tuning and controlling pretraining compute, shallow hierarchies tie strong BPE baselines, and deeper hierarchies have a promising trend. Because tokenization now lives inside the model, the same system can handle character-level tasks and carry knowledge across low-resource languages.

调制对输入文本施加固定的颗粒, 冻结语言模型如何在数据上运行, 以及未来预测的时间范围。 字对式编码( BBE) 和类似方案将文字分开一次, 建立一个静态词汇, 并让模型被困在这种选择中。 我们通过引入自动递减 U- Net, 学会在输入文本时嵌入自己的象征物, 放松这种僵硬性。 网络读取原始的字节, 把它们组合成单词, 然后对单词进行配对, 然后最多4个单词, 给它一个多尺度的序列视图。 在更深的阶段, 模型必须进一步预测未来 - 预测接下来几个字, 而不是下一个字节 - 因此更深的阶段将关注更广泛的语义模式, 而早期则处理细细的细节。 当认真调整和控制训练前的计算时, 浅的等级将强大的 BPE 基线绑紧, 以及更深层次的等级具有一个充满希望的趋势 。 因为标志性现在存在于模型中, 同样的系统可以处理特性层面的任务, 并携带低资源语言的知识 。


Article 128

Title@2025-06-17 (2): Reasoning with Exploration: An Entropy Perspective

Title: Reasoning with Exploration: An Entropy Perspective Vernunft mit Exploration: Eine Entropie-Perspektive 探索理由:宇宙展望 2506.14758v1

Authors (7): Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, Furu Wei

Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing language model (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy – a signal of exploration in RL – and examine its relationship to exploratory reasoning in LMs. Through empirical analysis, we uncover strong positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric – an upper-bound estimator of LM reasoning capabilities – even when evaluated with extremely large K values, pushing the boundaries of LM reasoning.

平衡勘探和开发是强化学习(RL)的一个中心目标。尽管最近在加强语言模式(LM)推理方面有所进展,但大多数方法都倾向于开发,而且越来越遇到业绩高地。在这项工作中,我们重新审视了英特罗比 – – 一种RL勘探信号 – – 并审查了它与LMS探索推理的关系。通过经验分析,我们发现高热带区域和三种探索推理行动之间存在强烈的正相关关系:(1) 确定或连接逻辑步骤的关键标志,(2) 自我验证和校正等反射行动,(3) 基础LMs 探索不足的罕见行为。 受此驱动,我们对标准RLL只做了最低限度的修改,只有一行代码:以基于英特罗比基术语增加优势功能。与鼓励通过促进不确定性进行探索的传统最高渗透方法不同,我们鼓励通过促进长期和更深层次的推理链进行探索。值得注意的是,我们的方法在Pass@Knibal – – 一个高限LM推理能力的估测算能力 – – 即使用极大K值来评价,推进LM推理的界限。


Article 129

Title@2025-06-17 (2): Controllable and Reliable Knowledge-Intensive Task-Oriented Conversational Agents with Declarative Genie Worksheets

Title: Controllable and Reliable Knowledge-Intensive Task-Oriented Conversational Agents with Declarative Genie Worksheets Kontrollierbare und zuverlässige wissensintensive, zielorientierte Conversational Agents mit deklarativen Genie-Arbeitsblättern 具有公开基因工作表的可控制和可靠、知识密集、以任务为导向、以任务为导向的具有可控和可靠知识密集的谈话剂 2407.05674v3

Authors (5): Harshit Joshi, Shicheng Liu, James Chen, Robert Weigle, Monica S. Lam

Large Language Models can carry out human-like conversations in diverse settings, responding to user requests for tasks and knowledge. However, existing conversational agents implemented with LLMs often struggle with hallucination, following instructions with conditional logic, and integrating knowledge from different sources. These shortcomings compromise the agents’ effectiveness, rendering them unsuitable for deployment. To address these challenges, we introduce Genie, a programmable framework for creating knowledge-intensive task-oriented conversational agents. Genie can handle involved interactions and answer complex queries. Unlike LLMs, it delivers reliable, grounded responses through advanced dialogue state management and supports controllable agent policies via its declarative specification – Genie Worksheet. This is achieved through an algorithmic runtime system that implements the developer-supplied policy, limiting LLMs to (1) parse user input using a succinct conversational history, and (2) generate responses according to supplied context. Agents built with Genie outperform SOTA methods on complex logic dialogue datasets. We conducted a user study with 62 participants on three real-life applications: restaurant reservations with Yelp, as well as ticket submission and course enrollment for university students. Genie agents with GPT-4 Turbo outperformed the GPT-4 Turbo agents with function calling, improving goal completion rates from 21.8% to 82.8% across three real-world tasks.

大型语言模型可以在不同环境中进行人性化对话,满足用户对任务和知识的要求。然而,与LLMS共同实施的现有对话代理商经常与幻觉斗争,遵循有条件逻辑指令,整合来自不同来源的知识。这些缺陷损害了代理商的效力,使其不适于部署。为了应对这些挑战,我们引入了Genie,这是创建知识密集型任务导向的谈话代理商的可编程框架。Genie可以处理涉及互动和复杂的问题。与LLMS不同,它通过高级对话国家管理提供可靠、有根据的响应,并通过其宣言性规范 – – Genie Workingshe单 – – 支持可控代理政策。这是通过一个算法运行时间系统实现的,该系统执行开发商提供的政策,将LLMs限制在(1) 使用简洁的谈话历史的用户投入中,使其不适于部署。为了应对这些挑战,我们引入了Genie在复杂的逻辑对话数据集上超越SOTA方法。我们与62名用户进行了一项用户研究:与Yelperp的预订,以及大学学生的机票提交和课程注册课程。G-Bentieal-tel-rodustrup 288的Gtel-level8代理商改进了GPorbormaxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx。


Article 130

Title@2025-06-17 (2): SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints

Title: SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints SOPBench: Sprachagenten bei folgenden Standardbetriebsverfahren und Einschränkungen bewerten SOPBench:评价遵守标准作业程序和制约因素的语文代理 2503.08669v2

Authors (11): Zekun Li, Shinda Huang, Jiangtian Wang, Nathan Zhang, Antonis Antoniades, Wenyue Hua, Kaijie Zhu, Sirui Zeng, Chi Wang, William Yang Wang, Xifeng Yan

As language agents increasingly automate critical tasks, their ability to follow domain-specific standard operating procedures (SOPs), policies, and constraints when taking actions and making tool calls becomes essential yet remains underexplored. To address this gap, we develop an automated evaluation pipeline SOPBench with: (1) executable environments containing 167 tools/functions across seven customer service domains with service-specific SOPs and rule-based verifiers, (2) an automated test generation framework producing over 900 verified test cases, and (3) an automated evaluation framework to rigorously assess agent adherence from multiple dimensions. Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions. The original code serves as oracle rule-based verifiers to assess compliance, reducing reliance on manual annotations and LLM-based evaluations. We evaluate 18 leading models, and results show the task is challenging even for top-tier models (like GPT-4o, Claude-3.7-Sonnet), with variances across domains. Reasoning models like o4-mini-high show superiority while other powerful models perform less effectively (pass rates of 30%-50%), and small models (7B, 8B) perform significantly worse. Additionally, language agents can be easily jailbroken to overlook SOPs and constraints. Code, data, and over 24k agent trajectories are released at https://github.com/Leezekun/SOPBench.

由于语言代理人越来越多地将关键任务自动化,在采取行动和发出工具呼吁时,他们遵守特定领域的标准作业程序、政策和制约因素的能力变得至关重要,但仍未得到充分探讨。为了弥补这一差距,我们开发了一个自动评价管道SOPBench, 其内容包括:(1) 包含167个工具/功能的可执行环境,涵盖7个客户服务领域的167个工具/功能,拥有服务特定SOPs和基于规则的核查员;(2) 自动测试生成框架,产生900多个经核实的测试案例;(3) 一个自动评估框架,从多个方面严格评估代理人的遵守情况。我们的方法将每个特定服务的SOP代码程序转换成一个可执行功能的定向图表,并要求代理人根据自然语言SOP描述调用这些功能。原始代码作为基于规则的核查员,以评估遵守情况,减少对手册说明和基于规则的LM评价的依赖。我们评价了18个主要模型和结果显示,任务甚至对顶级模型(如GPT-4o、Claude-3.7-Sonnet,Sont)具有挑战性,不同领域的差异。将O4-min-high-high-high-high-brode rode practority ority 功能转换为优越性功能,而其他强大的模型则能为8-OP-rentrentrencerencerentrums


Article 131

Title@2025-06-17 (2): Optimizing Length Compression in Large Reasoning Models

Title: Optimizing Length Compression in Large Reasoning Models Optimierung der Längenkompression in großen vernünftigen Modellen 在大理由模型中优化长度压缩 2506.14755v1

Authors (4): Zhengxiang Cheng, Dongping Chen, Mingyang Fu, Tianyi Zhou

Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as “invalid thinking” – models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive experiments on multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant reduction in sequence length (~50%) with only a marginal (~2%) drop in accuracy, achieving a favorable trade-off point on the Pareto frontier that prioritizes high compression. Our analysis further validates the robustness of LC-R1 and provides valuable insights for developing more powerful yet computationally efficient LRMs. Our code is released at https://github.com/zxiangx/LC-R1.

大型理性模型(LRMs)取得了显著的成功,但是它们往往在产生不必要和含糊的推理链方面遭遇困难。我们把这一问题的核心方面确定为“无效思维” – – 模型在得出正确答案后往往反复重复检查其工作。为了解决这一具体的低效率问题,我们超越了效率和效率的一般原则,提出了两项新的、细微的细微原则:提倡消除冗余的宽度和确保关键推理步骤的充足性。在这些原则的指导下,我们引入了LC-R1,这是基于集体政策优化的后培训方法。LC-R1采用了一种新颖的组合,即总体简洁的长度调整和专门用来消除思维过程的无效部分的压缩。关于多种推理基准的广泛实验表明,LC-R1的序列长度大幅缩短(~50 % ) ,只有边际(~2% ) 降低准确性,在Pareto前沿实现一个有利的贸易点,从而将高清晰度的LRMR1纳入我们的高清晰度分析。我们对LRRRR1的编码进行了进一步的验证。


Article 132

Title@2025-06-17 (2): Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework

Title: Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework Auf dem Weg zu einer besseren Open-Ended Textgenerierung: Ein Multikriterien-Evaluierungsrahmen 实现更好的不限 限 限 限 质 文本的生成:多标准评价框架 2410.18653v3

Authors (6): Esteban Garces Arias, Hannah Blocher, Julian Rodemann, Meimingwei Li, Christian Heumann, Matthias Aßenmacher

Open-ended text generation has become a prominent task in natural language processing due to the rise of powerful (large) language models. However, evaluating the quality of these models and the employed decoding strategies remains challenging due to trade-offs among widely used metrics such as coherence, diversity, and perplexity. This paper addresses the specific problem of multicriteria evaluation for open-ended text generation, proposing novel methods for both relative and absolute rankings of decoding methods. Specifically, we employ benchmarking approaches based on partial orderings and present a new summary metric to balance existing automatic indicators, providing a more holistic evaluation of text generation quality. Our experiments demonstrate that the proposed approaches offer a robust way to compare decoding strategies and serve as valuable tools to guide model selection for open-ended text generation tasks. We suggest future directions for improving evaluation methodologies in text generation and make our code, datasets, and models publicly available.

由于强大的(大)语言模式的兴起,不限名额的文本生成已成为自然语言处理中一项突出的任务,然而,由于在广泛使用的指标(如一致性、多样性和易懂性)之间的权衡取舍,评价这些模型和所采用的解码战略的质量仍然具有挑战性。本文件讨论了对不限名额的文本生成进行多标准评价的具体问题,为解码方法的相对和绝对排序提出了新的方法。具体地说,我们采用了基于部分顺序的基准方法,并提出了新的综合衡量标准,以平衡现有自动指标,对文本生成质量进行更全面的评价。我们的实验表明,拟议方法提供了一种强有力的方法,可以比较解码战略,并作为指导为不限名额的文本生成任务选择模式的宝贵工具。我们提出了改进文本生成评价方法的未来方向,并公布了我们的代码、数据集和模型。


Article 133

Title@2025-06-17 (2): Leveraging Large Language Models to Measure Gender Representation Bias in Gendered Language Corpora

Title: Leveraging Large Language Models to Measure Gender Representation Bias in Gendered Language Corpora Nutzung großer Sprachmodelle zur Messung der Geschlechterrepräsentanz Bias in Gendered Language Corpora 利用大语言模式衡量性别语言单位的性别代表比比 2406.13677v3

Authors (5): Erik Derner, Sara Sansalvador de la Fuente, Yoan Gutiérrez, Paloma Moreda, Nuria Oliver

Large language models (LLMs) often inherit and amplify social biases embedded in their training data. A prominent social bias is gender bias. In this regard, prior work has mainly focused on gender stereotyping bias - the association of specific roles or traits with a particular gender - in English and on evaluating gender bias in model embeddings or generated outputs. In contrast, gender representation bias - the unequal frequency of references to individuals of different genders - in the training corpora has received less attention. Yet such imbalances in the training data constitute an upstream source of bias that can propagate and intensify throughout the entire model lifecycle. To fill this gap, we propose a novel LLM-based method to detect and quantify gender representation bias in LLM training data in gendered languages, where grammatical gender challenges the applicability of methods developed for English. By leveraging the LLMs’ contextual understanding, our approach automatically identifies and classifies person-referencing words in gendered language corpora. Applied to four Spanish-English benchmarks and five Valencian corpora, our method reveals substantial male-dominant imbalances. We show that such biases in training data affect model outputs, but can surprisingly be mitigated leveraging small-scale training on datasets that are biased towards the opposite gender. Our findings highlight the need for corpus-level gender bias analysis in multilingual NLP. We make our code and data publicly available.

大量语言模型(LLMS)往往继承和扩展其培训数据中所包含的社会偏见。明显的社会偏见是性别偏见。在这方面,先前的工作主要侧重于英语中的性别定型偏见,即特定角色或特征与特定性别的联系,以及评价模型嵌入或产出产出中的性别偏见。相比之下,性别代表偏见——在培训团中,提及不同性别的个人的频率不平等——受到了较少的关注。然而,培训数据中的这种不平衡是偏见的上游来源,在整个模型生命周期中都可以传播和加剧。为填补这一空白,我们建议一种基于LLM的新方法,用性别语言来检测和量化LLM培训数据中的性别代表性偏见,在其中,语法性性别对为英语开发的方法的适用性提出了挑战。通过利用LLMs的背景理解,我们的方法自动识别和分类了性别语言中与不同性别有关的人的词汇。我们的方法适用于四个西班牙语基准和五个巴伦西亚人种,显示了严重的男性占多数的不平衡。我们指出,在培训数据中这种偏见会影响模型产出,但令人惊讶的是,我们可以减少在公开分析中的偏见。


Article 134

Title@2025-06-17 (2): Assessing the Reasoning Capabilities of LLMs in the context of Evidence-based Claim Verification

Title: Assessing the Reasoning Capabilities of LLMs in the context of Evidence-based Claim Verification Bewertung der mit Gründen versehenen Fähigkeiten von LLM im Zusammenhang mit der beweisgestützten Prüfung von Anträgen 结合基于证据的索赔核实评估LLM 合理性的能力 2402.10735v4

Authors (6): John Dougrez-Lewis, Mahmud Elahi Akhter, Federico Ruggeri, Sebastian Löbbers, Yulan He, Maria Liakata

Although LLMs have shown great performance on Mathematics and Coding related reasoning tasks, the reasoning capabilities of LLMs regarding other forms of reasoning are still an open problem. Here, we examine the issue of reasoning from the perspective of claim verification. We propose a framework designed to break down any claim paired with evidence into atomic reasoning types that are necessary for verification. We use this framework to create RECV, the first claim verification benchmark, incorporating real-world claims, to assess the deductive and abductive reasoning capabilities of LLMs. The benchmark comprises of three datasets, covering reasoning problems of increasing complexity. We evaluate three state-of-the-art proprietary LLMs under multiple prompt settings. Our results show that while LLMs can address deductive reasoning problems, they consistently fail in cases of abductive reasoning. Moreover, we observe that enhancing LLMs with rationale generation is not always beneficial. Nonetheless, we find that generated rationales are semantically similar to those provided by humans, especially in deductive reasoning cases.

虽然LLMS在数学和编码相关推理任务方面表现出色,但LLMS关于其他形式推理的推理能力仍然是一个尚未解决的问题。在这里,我们从索赔核实的角度来研究推理问题。我们提议了一个框架,旨在将任何附有证据的主张分为核查所必需的原子推理类型。我们利用这个框架来创建RECV,这是第一个索赔核实基准,纳入了现实世界索赔,以评估LLMS的推理和诱拐推理能力。基准由三个数据集组成,涵盖日益复杂的推理问题。我们评估了三种最先进的专有的理理理理理理理理理理理,在多个迅速的环境下。我们的结果显示,LLMS可以解决推理问题,但在引理推理方面却总是失败。此外,我们发现,加强LMS并产生理由并不总有好处。然而,我们发现,产生的理由与人类提供的理由,特别是在推理推理案件中提供的理相似。


Article 135

Title@2025-06-17 (2): Reparameterized LLM Training via Orthogonal Equivalence Transformation

Title: Reparameterized LLM Training via Orthogonal Equivalence Transformation Reparameterisiertes LLM-Training über Orthogonale Äquivalenztransformation 通过正正对等转化进行修复性磁力LLM培训 2506.08001v3

Authors (6): Zeju Qiu, Simon Buchholz, Tim Z. Xiao, Maximilian Dax, Bernhard Schölkopf, Weiyang Liu

While large language models (LLMs) are driving the rapid advancement of artificial intelligence, effectively and reliably training these large models remains one of the field’s most significant challenges. To address this challenge, we propose POET, a novel reParameterized training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons. Specifically, POET reparameterizes each neuron with two learnable orthogonal matrices and a fixed random weight matrix. Because of its provable preservation of spectral properties of weight matrices, POET can stably optimize the objective function with improved generalization. We further develop efficient approximations that make POET flexible and scalable for training large-scale neural networks. Extensive experiments validate the effectiveness and scalability of POET in training LLMs.

虽然大型语言模型(LLMS)正在推动人工智能的快速进步,但有效可靠地培训这些大型模型仍然是实地面临的最重大挑战之一。为了应对这一挑战,我们提议POET,这是一个新型的再量化培训算法,使用正方等离子转换法优化神经元。具体地说,POET用两个可学习的正方位矩阵和一个固定随机重量矩阵对每个神经元进行重新计。由于它能够明显地保存重力矩阵的光谱特性,POET可以通过改进一般化来稳步优化客观功能。我们进一步开发高效的近似值,使POET具有灵活性,并且可以对大规模神经网络进行培训。广泛的实验验证了POET在培训LMs方面的有效性和可扩展性。


Article 136

Title@2025-06-17 (2): Capacity Matters: a Proof-of-Concept for Transformer Memorization on Real-World Data

Title: Capacity Matters: a Proof-of-Concept for Transformer Memorization on Real-World Data Capacity Matters: Ein Proof-of-Concept für Transformer-Memorisierung auf Real-World-Daten 能力事项:关于现实世界数据变换者记忆的验证概念 2506.14704v1

Authors (2): Anton Changalidis, Aki Härmä

This paper studies how the model architecture and data configurations influence the empirical memorization capacity of generative transformers. The models are trained using synthetic text datasets derived from the Systematized Nomenclature of Medicine (SNOMED) knowledge graph: triplets, representing static connections, and sequences, simulating complex relation patterns. The results show that embedding size is the primary determinant of learning speed and capacity, while additional layers provide limited benefits and may hinder performance on simpler datasets. Activation functions play a crucial role, and Softmax demonstrates greater stability and capacity. Furthermore, increasing the complexity of the data set seems to improve the final memorization. These insights improve our understanding of transformer memory mechanisms and provide a framework for optimizing model design with structured real-world data.

本文研究模型结构和数据配置如何影响基因变异器的经验记忆能力。模型使用由医学系统化名词知识图(SNOMED)产生的合成文本数据集进行培训:代表静态连接和序列的三胞胎,模拟复杂的关系模式。结果显示嵌入规模是学习速度和能力的主要决定因素,而额外层则带来有限的好处,并可能妨碍更简单的数据集的性能。激活功能发挥着关键的作用,软体显示显示出更大的稳定性和能力。此外,增加数据集的复杂性似乎改善了最后的记忆化。这些洞察力提高了我们对变异器记忆机制的理解,并为利用结构化真实世界数据优化模型设计提供了框架。


Article 137

Title@2025-06-17 (2): Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers

Title: Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers Treasure Hunt: Echtzeit-Targeting des Long Tails mit Trainings-Time Markern 宝藏狩猎:使用培训-时间标记实时定位长尾鱼 2506.14702v1

Authors (5): Daniel D’souza, Julia Kreutzer, Adrien Morisot, Ahmet Üstün, Sara Hooker

One of the most profound challenges of modern machine learning is performing well on the long-tail of rare and underrepresented features. Large general-purpose models are trained for many tasks, but work best on high-frequency use cases. After training, it is hard to adapt a model to perform well on specific use cases underrepresented in the training corpus. Relying on prompt engineering or few-shot examples to maximize the output quality on a particular test case can be frustrating, as models can be highly sensitive to small changes, react in unpredicted ways or rely on a fixed system prompt for maintaining performance. In this work, we ask: “Can we optimize our training protocols to both improve controllability and performance on underrepresented use cases at inference time?” We revisit the divide between training and inference techniques to improve long-tail performance while providing users with a set of control levers the model is trained to be responsive to. We create a detailed taxonomy of data characteristics and task provenance to explicitly control generation attributes and implicitly condition generations at inference time. We fine-tune a base model to infer these markers automatically, which makes them optional at inference time. This principled and flexible approach yields pronounced improvements in performance, especially on examples from the long tail of the training distribution. While we observe an average lift of 5.7% win rates in open-ended generation quality with our markers, we see over 9.1% gains in underrepresented domains. We also observe relative lifts of up to 14.1% on underrepresented tasks like CodeRepair and absolute improvements of 35.3% on length instruction following evaluations.

  1. 在这项工作中,我们要求“我们能否优化我们的培训协议,既改进代表性低使用案例的可控性和性能,又改进高频使用案例的可操作性?”我们在培训之后,很难调整一个模式,以便很好地处理培训中代表性不足的具体使用案例。我们很难重新审视培训和推断技术之间的差别,同时向用户提供该模式所训练的一套控制杠杆。我们建立详细的数据特性分类和任务证明,以明确控制生成属性和推断时的隐含性能。我们用一个基准模型来自动推算这些指标,以便提高代表性过低使用案例的可控性和性能。

Article 138

Title@2025-06-17 (2): Bridging Social Media and Search Engines: Dredge Words and the Detection of Unreliable Domains

Title: Bridging Social Media and Search Engines: Dredge Words and the Detection of Unreliable Domains Überbrücken von Social Media und Suchmaschinen: Dredge Words und die Erkennung von unzuverlässigen Domains 连接社会媒体和搜索引擎:隐隐词和探测不可靠的域域 2406.11423v4

Authors (3): Evan M. Williams, Peter Carragher, Kathleen M. Carley

Proactive content moderation requires platforms to rapidly and continuously evaluate the credibility of websites. Leveraging the direct and indirect paths users follow to unreliable websites, we develop a website credibility classification and discovery system that integrates both webgraph and large-scale social media contexts. We additionally introduce the concept of dredge words, terms or phrases for which unreliable domains rank highly on search engines, and provide the first exploration of their usage on social media. Our graph neural networks that combine webgraph and social media contexts generate to state-of-the-art results in website credibility classification and significantly improves the top-k identification of unreliable domains. Additionally, we release a novel dataset of dredge words, highlighting their strong connections to both social media and online commerce platforms.

将直接和间接路径的用户利用到不可靠的网站,我们开发了一个网站可信度分类和发现系统,将网络制图和大规模社交媒体背景结合起来;我们还引入了不可靠的域在搜索引擎中高度排位的隐蔽词、术语或词组的概念,并首次探索其在社交媒体中的使用情况。我们把网络制图和社会媒体背景结合起来的图形神经网络在网站可信度分类方面产生了最先进的结果,大大改进了对不可靠的域的顶尖识别。此外,我们发布了一套新颖的隐蔽词组,突出这些词组与社交媒体和在线商业平台的密切联系。


Article 139

Title@2025-06-17 (2): The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs

Title: The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs Der alternative Annotator-Test für LLM-as-a-Richter: Wie man die Ersetzung menschlicher Annotatoren durch LLMs statistisch rechtfertigt LLM-A法官的替代性说明人测试:如何在统计上合理用LMS取代人类说明人 2501.10970v3

Authors (3): Nitay Calderon, Roi Reichart, Rotem Dror

The “LLM-as-an-annotator” and “LLM-as-a-judge” paradigms employ Large Language Models (LLMs) as annotators, judges, and evaluators in tasks traditionally performed by humans. LLM annotations are widely used, not only in NLP research but also in fields like medicine, psychology, and social science. Despite their role in shaping study results and insights, there is no standard or rigorous procedure to determine whether LLMs can replace human annotators. In this paper, we propose a novel statistical procedure, the Alternative Annotator Test (alt-test), that requires only a modest subset of annotated examples to justify using LLM annotations. Additionally, we introduce a versatile and interpretable measure for comparing LLM annotators and judges. To demonstrate our procedure, we curated a diverse collection of ten datasets, consisting of language and vision-language tasks, and conducted experiments with six LLMs and four prompting techniques. Our results show that LLMs can sometimes replace humans with closed-source LLMs (such as GPT-4o), outperforming the open-source LLMs we examine, and that prompting techniques yield judges of varying quality. We hope this study encourages more rigorous and reliable practices.

“LLM-as-an-an-anatotor”和“LLM-as-an-an-an-a-a-a-a-a-a-a-a-a”模式采用大语言模型(LLMs),在传统上由人执行的任务中,大语言模型(LLMs)作为说明者、法官和评审员。LLM说明不仅在NLP研究中广泛使用,而且在医学、心理学和社会科学等领域也广泛使用。尽管在形成研究结果和见解方面发挥着作用,但是没有标准或严格的程序来确定LLMs是否可以取代人类说明者。在本文件中,我们建议采用新的统计程序,即替代性说明试验(Alt-ater-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-d),这只需要少量附加说明的例子来证明使用LLMs说明。此外,我们采用灵活和可解释的措施来比较LMsattors和法官。为了展示我们的程序,我们整理了一套由语言和视觉任务组成的十套数据集的实验,对六LMs和四种促效技术进行了试验。我们的实验。我们的实验显示LMs有时可以用以代替人代替封闭式LMs(例如GPT-4o),我们研究,我们鼓励了较严格和迅速的结果。


Article 140

Title@2025-06-17 (2): Language and Planning in Robotic Navigation: A Multilingual Evaluation of State-of-the-Art Models

Title: Language and Planning in Robotic Navigation: A Multilingual Evaluation of State-of-the-Art Models Sprache und Planung in der Roboternavigation: Mehrsprachige Bewertung modernster Modelle 机器人导航的语言和规划:对最新艺术模式的多语种评价 2501.05478v2

Authors (6): Malak Mansour, Ahmed Aly, Bahey Tharwat, Sarim Hashmi, Dong An, Ian Reid

Large Language Models (LLMs) such as GPT-4, trained on huge amount of datasets spanning multiple domains, exhibit significant reasoning, understanding, and planning capabilities across various tasks. This study presents the first-ever work in Arabic language integration within the Vision-and-Language Navigation (VLN) domain in robotics, an area that has been notably underexplored in existing research. We perform a comprehensive evaluation of state-of-the-art multi-lingual Small Language Models (SLMs), including GPT-4o mini, Llama 3 8B, and Phi-3 medium 14B, alongside the Arabic-centric LLM, Jais. Our approach utilizes the NavGPT framework, a pure LLM-based instruction-following navigation agent, to assess the impact of language on navigation reasoning through zero-shot sequential action prediction using the R2R dataset. Through comprehensive experiments, we demonstrate that our framework is capable of high-level planning for navigation tasks when provided with instructions in both English and Arabic. However, certain models struggled with reasoning and planning in the Arabic language due to inherent limitations in their capabilities, sub-optimal performance, and parsing issues. These findings highlight the importance of enhancing planning and reasoning capabilities in language models for effective navigation, emphasizing this as a key area for further development while also unlocking the potential of Arabic-language models for impactful real-world applications.

大型语言模型(LLM),如GPT-4(GPT-4)(LLM),在涉及多个领域的大量数据集方面受过培训,在各种任务中展示了重要的推理、理解和规划能力。本研究报告介绍了在机器人领域(在现有研究中明显未得到充分探讨的一个领域)在愿景和语言导航(VLN)域内首次在阿拉伯语整合方面开展的工作。我们通过综合实验,对最先进的多语言小型语言模型(SLM)进行了全面评估,包括GPT-4o mini、Llama 3 3 8B和Phi-3 中型14B,以及以阿拉伯语为中心的LLMM Jais(Jais)。我们的方法是利用NavGPT框架(一个纯以LLM为主的指导导航代理)框架评估语言对导航推理的影响。我们通过综合实验,证明我们的框架在提供英语和阿拉伯语教学指导时能够对导航任务进行高层次规划。但是,某些模型在阿拉伯语的推理和规划方面困难重重,因为其真实能力、亚语言模型的内在局限性,同时也强调这一方向研究领域的发展潜力。


Article 141

Title@2025-06-17 (2): Agent Laboratory: Using LLM Agents as Research Assistants

Title: Agent Laboratory: Using LLM Agents as Research Assistants Agent Laboratory: LLM-Agenten als wissenschaftliche Assistenten 实验室:利用LLLM代理作为研究助理 2501.04227v2

Authors (10): Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, Emad Barsoum

Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages–literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.

从历史上看,科学发现是一个漫长而昂贵的过程,需要大量的时间和资源,从最初的构想到最终的结果。为了加快科学发现,降低研究成本,提高研究质量,我们引入实验室,这是一个基于LLM的自主框架,能够完成整个研究过程。这个框架接受由人类提供的研究想法和进展,通过三个阶段的文学审查、实验和报告写作,产生全面研究产出,包括一个代码储存库和一份研究报告,同时使用户能够在每个阶段提供反馈和指导。我们部署具有各种最新水平的LLMS的实验室,并邀请多个研究人员通过参加调查来评估其质量,提供人类反馈以指导研究过程,然后评价最后文件。我们发现:(1) 由O1-preview驱动的实验室能够产生最佳的研究成果;(2) 生成的机器学习代码能够实现与现有方法相比的最新业绩;(3) 人类参与,在每个阶段提供反馈,大大改进研究的总体质量;(4) 代理实验室显著降低研究费用,比以前的自主研究方法减少84 % ,我们希望实验室能够进一步分配创造性的探索努力,而不是加速低水平的科学研究。


Article 142

Title@2025-06-17 (2): Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality

Title: Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality Massive überwachte Feinsteuerungsexperimente zeigen, wie Daten, Ebenen und Trainingsfaktoren LLM-Ausrichtungsqualität gestalten 大规模监督的微调实验 数据、图层和培训因素 成型LLLM 目标质量 2506.14681v1

Authors (6): Yuto Harada, Yusuke Yamauchi, Yusuke Oda, Yohei Oseki, Yusuke Miyao, Yu Takagi

Supervised fine-tuning (SFT) is a critical step in aligning large language models (LLMs) with human instructions and values, yet many aspects of SFT remain poorly understood. We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks, resulting in 1,000+ SFT models under controlled conditions. We then identified the dataset properties that matter most and examined the layer-wise modifications introduced by SFT. Our findings reveal that some training-task synergies persist across all models while others vary substantially, emphasizing the importance of model-specific strategies. Moreover, we demonstrate that perplexity consistently predicts SFT effectiveness–often surpassing superficial similarity between trained data and benchmark–and that mid-layer weight changes correlate most strongly with performance gains. We will release these 1,000+ SFT models and benchmark results to accelerate further research.

监督的微调(SFT)是使大型语言模型(LLMS)与人类指示和价值相一致的关键步骤,然而,SFT的许多方面仍然没有得到很好的理解。我们就包括代码生成、数学推理和一般领域任务在内的各种数据集培训了广泛的基础模型,从而在受控制的条件下产生了1,000+SFT模型。我们随后确定了最重要的数据集属性,并审查了SFT提出的多层次修改。我们的调查结果显示,在所有模型中,一些培训-任务协同作用持续存在,而另一些则差异很大,强调了具体模型战略的重要性。此外,我们表明,不统一性始终预测SFT的有效性——往往超过经过培训的数据和基准之间的表面相似性,而中层重量变化与绩效收益的关系最为密切。我们将公布这些1,000+SFT模型和基准结果,以加速进一步的研究。


Article 143

Title@2025-06-17 (2): FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback

Title: FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback FigCaps-HF: Ein figure-to-caption Generatives Framework und Benchmark mit menschlichem Feedback FigCaps-HF:数字对数字生成框架和人文反馈基准 2307.10867v2

Authors (13): Ashish Singh, Ashutosh Singh, Prateek Agarwal, Zixuan Huang, Arpita Singh, Tong Yu, Sungchul Kim, Victor Bursztyn, Nesreen K. Ahmed, Puneet Mathur, Erik Learned-Miller, Franck Dernoncourt, Ryan A. Rossi

Captions are crucial for understanding scientific visualizations and documents. Existing captioning methods for scientific figures rely on figure-caption pairs extracted from documents for training, many of which fall short with respect to metrics like helpfulness, explainability, and visual-descriptiveness [15] leading to generated captions being misaligned with reader preferences. To enable the generation of high-quality figure captions, we introduce FigCaps-HF a new framework for figure-caption generation that can incorporate domain expert feedback in generating captions optimized for reader preferences. Our framework comprises of 1) an automatic method for evaluating quality of figure-caption pairs, 2) a novel reinforcement learning with human feedback (RLHF) method to optimize a generative figure-to-caption model for reader preferences. We demonstrate the effectiveness of our simple learning framework by improving performance over standard fine-tuning across different types of models. In particular, when using BLIP as the base model, our RLHF framework achieves a mean gain of 35.7%, 16.9%, and 9% in ROUGE, BLEU, and Meteor, respectively. Finally, we release a large-scale benchmark dataset with human feedback on figure-caption pairs to enable further evaluation and development of RLHF techniques for this problem.

现有的科学数字说明方法依赖于从培训文件中提取的图形显示配对,其中许多在帮助性、解释性和视觉描述性[15]等衡量标准方面落后于帮助性、可解释性和视觉描述性[15]等,导致生成的字幕与读者偏好不相符。为了能够生成高质量的数字说明,我们引入了一个用于生成高质量数字说明的新图表生成框架,其中可以包括域专家反馈,为读者偏好制作最佳字幕。我们的框架包括:(1) 一种自动评估图形显示配对质量的方法,(2) 一种与人类反馈(RHF)方法相配合的新强化学习,以优化读者偏好的基因化图形到描述模式。我们通过提高不同类型模型的标准微调的性能,展示了我们简单学习框架的有效性。特别是,在使用BLIP作为基础模型时,我们的RLHF框架在产生35.7%、16.9%和9%的平均收益,在ROUGE、BLEU和Metor中,我们用人类反馈(RHF)方法来优化读者偏好。最后,我们用一个大规模的数据反馈基准,我们用这个模型来推动大规模评估。


Article 144

Title@2025-06-17 (2): A Hybrid Multi-Agent Prompting Approach for Simplifying Complex Sentences

Title: A Hybrid Multi-Agent Prompting Approach for Simplifying Complex Sentences Ein Hybrid-Multi-Agent-Prompting-Ansatz zur Vereinfachung komplexer Sätze 简化复杂判刑的混合混合多重代理推动办法 2506.11681v2

Authors (2): Pratibha Zunjare, Michael Hsiao

This paper addresses the challenge of transforming complex sentences into sequences of logical, simplified sentences while preserving semantic and logical integrity with the help of Large Language Models. We propose a hybrid approach that combines advanced prompting with multi-agent architectures to enhance the sentence simplification process. Experimental results show that our approach was able to successfully simplify 70% of the complex sentences written for video game design application. In comparison, a single-agent approach attained a 48% success rate on the same task.

本文探讨了如何在大语言模型的帮助下将复杂的句子转换成逻辑、简化的句子,同时保留语义和逻辑完整性的挑战。我们建议采用混合方法,将先进的催化和多试剂结构结合起来,以加强句子简化过程。实验结果显示,我们的方法成功地简化了为视频游戏设计应用而撰写的70%复杂的句子。相比之下,单一试剂方法在同一任务上取得了48%的成功率。


Article 145

Title@2025-06-17 (2): ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

Title: ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities ONEBench, um sie alle zu testen: Benchmarking auf Probenebene über offene Fähigkeiten 一、一、测试所有标准:关于开放式能力的抽样基准 2412.06745v2

Authors (6): Adhiraj Ghosh, Sebastian Dziadzio, Ameya Prabhu, Vishaal Udandarao, Samuel Albanie, Matthias Bethge

Traditional fixed test sets fall short in evaluating open-ended capabilities of foundation models. To address this, we propose ONEBench(OpeN-Ended Benchmarking), a new testing paradigm that consolidates individual evaluation datasets into a unified, ever-expanding sample pool. ONEBench allows users to generate custom, open-ended evaluation benchmarks from this pool, corresponding to specific capabilities of interest. By aggregating samples across test sets, ONEBench enables the assessment of diverse capabilities beyond those covered by the original test sets, while mitigating overfitting and dataset bias. Most importantly, it frames model evaluation as a collective process of selecting and aggregating sample-level tests. The shift from task-specific benchmarks to ONEBench introduces two challenges: (1)heterogeneity and (2)incompleteness. Heterogeneity refers to the aggregation over diverse metrics, while incompleteness describes comparing models evaluated on different data subsets. To address these challenges, we explore algorithms to aggregate sparse measurements into reliable model scores. Our aggregation algorithm ensures identifiability(asymptotically recovering ground-truth scores) and rapid convergence, enabling accurate model ranking with less data. On homogenous datasets, we show our aggregation algorithm provides rankings that highly correlate with those produced by average scores. We also demonstrate robustness to ~95% of measurements missing, reducing evaluation cost by up to 20x with little-to-no change in model rankings. We introduce ONEBench-LLM for language models and ONEBench-LMM for vision-language models, unifying evaluations across these domains. Overall, we present a technique for open-ended evaluation, which can aggregate over incomplete, heterogeneous sample-level measurements to continually grow a benchmark alongside the rapidly developing foundation models.

传统的固定测试设置在评价基础模型的开放能力方面落后。 为了解决这个问题, 我们提议ANEBench( OpeN- 结束基准), 这是一种新的测试模式, 将单个评价数据集整合成一个统一、 不断扩展的样本库。 一个基准允许用户根据特定的兴趣能力, 从这个库中生成自定义、 开放式的评价基准。 通过将样本汇集到不同的测试组中, ANEBench 能够评估超出原始测试组所覆盖的各种能力, 同时减少过度配置和数据集的偏差。 最重要的是, 它将模型评价作为选择和汇总样本级测试的集体过程来设置。 从具体任务基准转换为 ANEANBench 带来了两个挑战:(1) 异质性和(2) 不完全性。 一个基准允许用户从这个库中生成自定义的、 开放性的评价基准, 同时描述对不同数据子集的模型的不完全性。 为了应对这些挑战, 我们的汇总算法可以将总稀少的测量结果转换成可靠的模型分数。 我们的算法可以确保识别性( 同时恢复地面评分数) , 快速地基基础评估, 和快速整合基础, 使准确的模型排序模型升级, 和精确的模型升级的模型的比重 。


Article 146

Title@2025-06-17 (2): Convert Language Model into a Value-based Strategic Planner

Title: Convert Language Model into a Value-based Strategic Planner Konvertieren Sie Sprachmodell in einen wertbasierten strategischen Planer 将语言模式转换成基于价值的战略规划员 2505.06987v4

Authors (7): Xiaoyu Wang, Yue Zhao, Qingqing Gu, Zhonglin Jiang, Xiaokai Chen, Yong Chen, Luo Ji

Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations. Although large language models (LLMs) have obtained remarkable progress on ESC, most of these studies might not define the diagram from the state model perspective, therefore providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage the Q-learning on LLMs, and propose a framework called straQ. Our framework allows a plug-and-play LLM to bootstrap the planning during ESC, determine the optimal strategy based on long-term returns, and finally guide the LLM to response. Substantial experiments on ESC datasets suggest that straQ outperforms many baselines, including direct inference, self-refine, chain of thought, finetuning, and finite state machines.

情感支持对话(ESC)旨在通过有效对话减轻个人的情感痛苦。 尽管大型语言模型(LLMs)在ESC方面取得了显著的进展,但大多数这些研究可能无法从州模式的角度界定图表,因此为长期满意度提供了亚优的解决方案。 为了解决这一问题,我们利用LMs的Q学习,并提出了一个名为stra的框架。 我们的框架允许插接和播放LLM在ESC期间启动规划,确定基于长期回报的最佳战略,并最终指导LM作出反应。 有关ESC数据集的大量实验表明,它超越了许多基线,包括直接引用、自我反省、思维链、微调和限定状态机器。


Article 147

Title@2025-06-17 (2): GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors

Title: GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors GuiLoMo: Zuordnung von Expertenzahl und Rang für LoRA-MoE über Bilevel-Optimierung mit GuidedSelection-Vektoren Guilomo:通过向导选择矢量的双级优化为 LoRA-MoE 分配专家编号和排名 2506.14646v1

Authors (11): Hengyuan Zhang, Xinrong Chen, Yingmin Qiu, Xiao Liang, Ziyue Li, Guanyu Wang, Weiping Li, Tong Mo, Wenyue Li, Hayden Kwok-Hay So, Ngai Wong

Parameter-efficient fine-tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), offer an efficient way to adapt large language models with reduced computational costs. However, their performance is limited by the small number of trainable parameters. Recent work combines LoRA with the Mixture-of-Experts (MoE), i.e., LoRA-MoE, to enhance capacity, but two limitations remain in hindering the full exploitation of its potential: 1) the influence of downstream tasks when assigning expert numbers, and 2) the uniform rank assignment across all LoRA experts, which restricts representational diversity. To mitigate these gaps, we propose GuiLoMo, a fine-grained layer-wise expert numbers and ranks allocation strategy with GuidedSelection Vectors (GSVs). GSVs are learned via a prior bilevel optimization process to capture both model- and task-specific needs, and are then used to allocate optimal expert numbers and ranks. Experiments on three backbone models across diverse benchmarks show that GuiLoMo consistently achieves superior or comparable performance to all baselines. Further analysis offers key insights into how expert numbers and ranks vary across layers and tasks, highlighting the benefits of adaptive expert configuration. Our code is available at https://github.com/Liar406/Gui-LoMo.git.

高效的参数微调方法,特别是低兰克适应(LORA),是调整大型语言模型,降低计算成本的有效途径,但是,由于培训参数数量少,这些模型的性能受到限制。最近的工作将LoRA与Mixture-of-Experts(MOE),即Lora-MoE,结合了LoRA-MoE,以提高能力,但在阻碍充分利用其潜力方面仍然存在两个限制:(1)下游任务在分配专家人数时的影响;(2)所有LORA专家的统一级别分配限制了代表性的多样性。为缩小这些差距,我们建议Gui LoMo,一个精细的层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-级别分配战略。GSV通过一个前的双级优化进程学习,既了解模式-任务/任务/层次/层次,然后用于分配最佳的专家人数和级别。关于三个骨干模型的实验表明,GuiLomo在所有基线上始终取得优劣或可比较的业绩。进一步分析我们现有的专家数字/层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-层次-分。


Article 148

Title@2025-06-17 (2): Passing the Turing Test in Political Discourse: Fine-Tuning LLMs to Mimic Polarized Social Media Comments

Title: Passing the Turing Test in Political Discourse: Fine-Tuning LLMs to Mimic Polarized Social Media Comments Den Turing-Test im politischen Diskurs bestehen: Fine-Tuning LLMs to Mimic Polarized Social Media Kommentare 透过政治话题图图图图图测试:微光极极化社会媒体评论 2506.14645v1

Authors (5): . Pazzaglia, V. Vendetti, L. D. Comencini, F. Deriu, V. Modugno

The increasing sophistication of large language models (LLMs) has sparked growing concerns regarding their potential role in exacerbating ideological polarization through the automated generation of persuasive and biased content. This study explores the extent to which fine-tuned LLMs can replicate and amplify polarizing discourse within online environments. Using a curated dataset of politically charged discussions extracted from Reddit, we fine-tune an open-source LLM to produce context-aware and ideologically aligned responses. The model’s outputs are evaluated through linguistic analysis, sentiment scoring, and human annotation, with particular attention to credibility and rhetorical alignment with the original discourse. The results indicate that, when trained on partisan data, LLMs are capable of producing highly plausible and provocative comments, often indistinguishable from those written by humans. These findings raise significant ethical questions about the use of AI in political discourse, disinformation, and manipulation campaigns. The paper concludes with a discussion of the broader implications for AI governance, platform regulation, and the development of detection tools to mitigate adversarial fine-tuning risks.

大型语言模型(LLMS)日益精密,引起了人们对其通过自动生成有说服力和偏见的内容来加剧意识形态两极分化的潜在作用日益关切。本研究报告探讨了微调LLMs在网上环境中复制和扩大两极分化对话的程度。利用从Reddit得到的具有政治影响力的讨论的整理数据集,我们微调了开放源码LLM,以产生符合背景和意识形态的反应。该模型的产出通过语言分析、情绪评分和人文批注来评价,特别注意可信度和与原始讨论的口头一致。研究结果表明,LLMs在接受关于党派数据的培训时,能够产生高度可信和挑衅性的评论,往往与人类所写的评论不相干。这些调查结果提出了在政治演讲、虚假信息和操纵运动中使用AI的重大伦理问题。文件最后讨论了对AI治理、平台监管和开发检测工具以减少对抗性微调风险的更广泛影响。


Article 149

Title@2025-06-17 (2): Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot

Title: Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot Revisiting Chain-of-Thought Prompting: Null-Schuss kann stärker sein als wenige-Schuss 重新思考寻求链激励:零射出比少射出强 2506.14641v1

Authors (8): Xiang Cheng, Chengyan Pan, Minjun Zhao, Deyang Li, Fangchao Liu, Xinyu Zhang, Xiao Zhang, Yong Liu

In-Context Learning (ICL) is an essential emergent ability of Large Language Models (LLMs), and recent studies introduce Chain-of-Thought (CoT) to exemplars of ICL to enhance the reasoning capability, especially in mathematics tasks. However, given the continuous advancement of model capabilities, it remains unclear whether CoT exemplars still benefit recent, stronger models in such tasks. Through systematic experiments, we find that for recent strong models such as the Qwen2.5 series, adding traditional CoT exemplars does not improve reasoning performance compared to Zero-Shot CoT. Instead, their primary function is to align the output format with human expectations. We further investigate the effectiveness of enhanced CoT exemplars, constructed using answers from advanced models such as \texttt{Qwen2.5-Max} and \texttt{DeepSeek-R1}. Experimental results indicate that these enhanced exemplars still fail to improve the model’s reasoning performance. Further analysis reveals that models tend to ignore the exemplars and focus primarily on the instructions, leading to no observable gain in reasoning ability. Overall, our findings highlight the limitations of the current ICL+CoT framework in mathematical reasoning, calling for a re-examination of the ICL paradigm and the definition of exemplars.

文本中学习(ICL)是大型语言模型(LLMs)的基本新兴能力,最近的一些研究引入了CLL的推理模型(CoT),以显示ICL的推理能力,特别是数学任务。然而,鉴于模型能力的不断提高,尚不清楚COT模拟器在这类任务中是否仍然有益于最近的、较强的模型。通过系统实验,我们发现,对于Qwen2.5系列等最近的强型模型,加上传统的COT示范器,与Zero-Shot CoT相比,没有提高推理性能。相反,这些模型的主要功能是使输出格式与人类的期望保持一致。我们进一步调查了COT增强的演示器的有效性,这些模型是利用先进的模型(例如\ texttwen2.5-max}和\texttt{DeepSeek-R1}等)的答案构建的。我们发现,这些增强的Explater 仍然未能改进模型的推理学性业绩。进一步的分析表明,模型往往忽视Explators,主要侧重于人类期望中的指示,导致CLVLI的逻辑上的全面推算。


Article 150

Title@2025-06-17 (2): IP Leakage Attacks Targeting LLM-Based Multi-Agent Systems

Title: IP Leakage Attacks Targeting LLM-Based Multi-Agent Systems IP-Leakage-Angriffe zielen auf LLM-basierte Multi-Agent-Systeme IP IP 以LLM为基础的多机构系统为目标的针对LLM的漏漏攻击系统 2505.12442v3

Authors (8): Liwen Wang, Wenxuan Wang, Shuai Wang, Zongjie Li, Zhenlan Ji, Zongyi Lyu, Daoyuan Wu, Shing-Chi Cheung

The rapid advancement of Large Language Models (LLMs) has led to the emergence of Multi-Agent Systems (MAS) to perform complex tasks through collaboration. However, the intricate nature of MAS, including their architecture and agent interactions, raises significant concerns regarding intellectual property (IP) protection. In this paper, we introduce MASLEAK, a novel attack framework designed to extract sensitive information from MAS applications. MASLEAK targets a practical, black-box setting, where the adversary has no prior knowledge of the MAS architecture or agent configurations. The adversary can only interact with the MAS through its public API, submitting attack query $q$ and observing outputs from the final agent. Inspired by how computer worms propagate and infect vulnerable network hosts, MASLEAK carefully crafts adversarial query $q$ to elicit, propagate, and retain responses from each MAS agent that reveal a full set of proprietary components, including the number of agents, system topology, system prompts, task instructions, and tool usages. We construct the first synthetic dataset of MAS applications with 810 applications and also evaluate MASLEAK against real-world MAS applications, including Coze and CrewAI. MASLEAK achieves high accuracy in extracting MAS IP, with an average attack success rate of 87% for system prompts and task instructions, and 92% for system architecture in most cases. We conclude by discussing the implications of our findings and the potential defenses.

大语言模型(LLMS)的迅速发展导致多语言模型(LLMS)的出现,从而导致通过协作执行复杂的任务。然而,MAS的复杂性质,包括其结构和代理互动,引起了对知识产权保护的极大关注。在本文件中,我们引入了MASEAK,这是旨在从MAS应用中提取敏感信息的新型攻击框架。MASLEAK(MAS)的目标是一个实用的黑箱设置,对手事先对MAS的结构或代理配置没有了解。对手只能通过公开的API与MAS互动,提交攻击查询,并观察最后代理的产出。受计算机蠕虫如何传播和感染脆弱网络主机的干扰,MASLEA精心设计的对抗性查询美元,以便从每个MAS代理中获取、传播和保留反应,显示一整套所有权组成部分,包括代理人的数量、系统表、系统提示、任务指示和工具使用。我们用810种应用的合成数据集成数据集,还评估MASRAS在现实世界中如何传播和感染脆弱网络主机主,包括Cze和Creal ASAAAA(IPMASL) AS) 的高级指令的精度,我们通过87标准的精度和精确度和精确度,通过仲裁和精确度分析系统,完成。


Article 151

Title@2025-06-17 (2): Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers

Title: Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers Zeigen große Sprachmodelle Kognitive Dissonanz? Studieren des Unterschieds zwischen offenbarten Glaubensbekenntnissen und erklärten Antworten 大型语言模型 展示认知差异? 研究信奉信仰与国家答复之间的差异 2406.14986v3

Authors (5): Manuel Mondal, Ljiljana Dolamic, Gérôme Bovet, Philippe Cudré-Mauroux, Julien Audiffren

Multiple Choice Questions (MCQ) have become a commonly used approach to assess the capabilities of Large Language Models (LLMs), due to their ease of manipulation and evaluation. The experimental appraisals of the LLMs’ Stated Answer (their answer to MCQ) have pointed to their apparent ability to perform probabilistic reasoning or to grasp uncertainty. In this work, we investigate whether these aptitudes are measurable outside tailored prompting and MCQ by reformulating these issues as direct text-completion - the fundamental computational unit of LLMs. We introduce Revealed Belief, an evaluation framework that evaluates LLMs on tasks requiring reasoning under uncertainty, which complements MCQ scoring by analyzing text-completion probability distributions. Our findings suggest that while LLMs frequently state the correct answer, their Revealed Belief shows that they often allocate probability mass inconsistently, exhibit systematic biases, and often fail to update their beliefs appropriately when presented with new evidence, leading to strong potential impacts on downstream tasks. These results suggest that common evaluation methods may only provide a partial picture and that more research is needed to assess the extent and nature of their capabilities.

多种选择问题(MCQ)已成为评估大语言模型(LLMs)能力的一种常用方法,因为这些模型易于操作和评价。LLMs’ State答复(其对MCQ的答复)的实验性评估表明,这些能力显然有能力进行概率推理或把握不确定性。在这项工作中,我们调查这些能力是否在外部可以计量,这些能力是否因地制宜,通过将这些问题重新表述为直接文本完成(LLMs的基本计算单位)来促进和MCQ。我们引入了Reveal Licional,这是一个评估在不确定性下需要推理的任务的LLMs的评价框架,通过分析文本完成概率分布来补充MCQ评分。我们的研究结果表明,虽然LMss经常说明正确的答案,但是他们的“Revical Contination”表明,它们往往不连贯地分配概率,表现出系统性的偏差,在提出新的证据时往往不能适当地更新其信念,从而对下游任务产生巨大的潜在影响。这些结果表明,共同的评价方法可能仅提供部分情况,需要进行更多的研究,以评估其能力的范围和性质。


Article 152

Title@2025-06-17 (2): Prefix-Tuning+: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention

Title: Prefix-Tuning+: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention Prefix-Tuning+: Modernisierung des Prefix-Tunings durch Entkoppelung des Prefixs von Aufmerksamkeit 前缀- 调整+: 通过将前缀与注意脱钩而使前缀- 调整前缀现代化 2506.13674v2

Authors (7): Haonan Wang, Brian Chen, Siquan Li, Xinhe Liang, Hwee Kuan Lee, Kenji Kawaguchi, Tianyang Hu

Parameter-Efficient Fine-Tuning (PEFT) methods have become crucial for rapidly adapting large language models (LLMs) to downstream tasks. Prefix-Tuning, an early and effective PEFT technique, demonstrated the ability to achieve performance comparable to full fine-tuning with significantly reduced computational and memory overhead. However, despite its earlier success, its effectiveness in training modern state-of-the-art LLMs has been very limited. In this work, we demonstrate empirically that Prefix-Tuning underperforms on LLMs because of an inherent tradeoff between input and prefix significance within the attention head. This motivates us to introduce Prefix-Tuning+, a novel architecture that generalizes the principles of Prefix-Tuning while addressing its shortcomings by shifting the prefix module out of the attention head itself. We further provide an overview of our construction process to guide future users when constructing their own context-based methods. Our experiments show that, across a diverse set of benchmarks, Prefix-Tuning+ consistently outperforms existing Prefix-Tuning methods. Notably, it achieves performance on par with the widely adopted LoRA method on several general benchmarks, highlighting the potential modern extension of Prefix-Tuning approaches. Our findings suggest that by overcoming its inherent limitations, Prefix-Tuning can remain a competitive and relevant research direction in the landscape of parameter-efficient LLM adaptation.

在这项工作中,我们从经验上表明,由于投入和注意力前头的先行重要性之间的内在权衡,大语言模型(LLMs)迅速适应下游任务已经变得十分关键。前导,即早期有效的PEFT技术,展示了在计算和记忆管理上大幅降低的情况下实现完全微调的能力,而完全微调,但尽管它早些时候取得了成功,但在培训现代最新水平的LLMS(PEFFT)方面,其效力非常有限。我们从经验上表明,Prefix-TUning+由于在投入和注意力前头的先行重要性之间的内在权衡而使LLMTM(Prefix-Turning +)之间出现偏差。这促使我们引入了Prefix-Turning+(Prefix-Turning+)这一新结构,通过将前导模块从注意力本身的注意力转移出去,在解决其缺点的同时,在解决其缺点方面,将普法原则概括性原则概括化。我们之前的内在推展率方法在LM(Preal-Pre)前仍能地显示其业绩。


Article 153

Title@2025-06-17 (2): VisText-Mosquito: A Multimodal Dataset and Benchmark for AI-Based Mosquito Breeding Site Detection and Reasoning

Title: VisText-Mosquito: A Multimodal Dataset and Benchmark for AI-Based Mosquito Breeding Site Detection and Reasoning VisText-Mosquito: Ein multimodaler Datensatz und Benchmark für KI-basierte Mosquito-Züchtungsstandorterkennung und -Vernunft VisText-Mosquito:基于AI的蚊子育种点检测和理据的多模式数据集和基准 2506.14629v1

Authors (7): Md. Adnanul Islam, Md. Faiyaz Abdullah Sayeedi, Md. Asaduzzaman Shuvo, Muhammad Ziaur Rahman, Shahanur Rahman Bappy, Raiyan Rahman, Swakkhar Shatabda

Mosquito-borne diseases pose a major global health risk, requiring early detection and proactive control of breeding sites to prevent outbreaks. In this paper, we present VisText-Mosquito, a multimodal dataset that integrates visual and textual data to support automated detection, segmentation, and reasoning for mosquito breeding site analysis. The dataset includes 1,828 annotated images for object detection, 142 images for water surface segmentation, and natural language reasoning texts linked to each image. The YOLOv9s model achieves the highest precision of 0.92926 and mAP@50 of 0.92891 for object detection, while YOLOv11n-Seg reaches a segmentation precision of 0.91587 and mAP@50 of 0.79795. For reasoning generation, our fine-tuned BLIP model achieves a final loss of 0.0028, with a BLEU score of 54.7, BERTScore of 0.91, and ROUGE-L of 0.87. This dataset and model framework emphasize the theme “Prevention is Better than Cure”, showcasing how AI-based detection can proactively address mosquito-borne disease risks. The dataset and implementation code are publicly available at GitHub: https://github.com/adnanul-islam-jisun/VisText-Mosquito

在本文中,我们介绍VisText-Mosquito的多式联运数据集,该数据集整合视觉和文字数据,以支持自动检测、分解和蚊子繁殖地点分析的推理。该数据集包括1 828张附加说明的物体探测图像、142张水表面分解图像以及与每张图像相连的自然语言推理文本。YOLOv9s模型达到0.929226和0.92891的 mAP@50最高精确度,用于物体检测,而YOLOv11n-Seg的分类精确度为0.91587和0.797995的 mAP@50。关于推理,我们精心调整的BLIP模型最终损失0.0028,BLEU分数为54.7,BERTScore为0.91,ROUGE-L为0.87。这个数据集和模型框架强调主题“预防比Cure要好”,展示AI基础的检测如何能够积极主动地应对蚊子传染疾病风险。MAUB/Exbqual/Gis可公开获取的数据和执行代码。


Article 154

Title@2025-06-17 (2): SynGraph: A Dynamic Graph-LLM Synthesis Framework for Sparse Streaming User Sentiment Modeling

Title: SynGraph: A Dynamic Graph-LLM Synthesis Framework for Sparse Streaming User Sentiment Modeling SynGraph: Ein dynamisches Graph-LLM-Synthese-Framework für Sparse Streaming User Sentiment Modeling Syllgraph: 垃圾流用户感应建模动态图形-LLM合成框架 2503.04619v2

Authors (6): Xin Zhang, Qiyu Wei, Yingjie Zhu, Linhai Zhang, Deyu Zhou, Sophia Ananiadou

User reviews on e-commerce platforms exhibit dynamic sentiment patterns driven by temporal and contextual factors. Traditional sentiment analysis methods focus on static reviews, failing to capture the evolving temporal relationship between user sentiment rating and textual content. Sentiment analysis on streaming reviews addresses this limitation by modeling and predicting the temporal evolution of user sentiments. However, it suffers from data sparsity, manifesting in temporal, spatial, and combined forms. In this paper, we introduce SynGraph, a novel framework designed to address data sparsity in sentiment analysis on streaming reviews. SynGraph alleviates data sparsity by categorizing users into mid-tail, long-tail, and extreme scenarios and incorporating LLM-augmented enhancements within a dynamic graph-based structure. Experiments on real-world datasets demonstrate its effectiveness in addressing sparsity and improving sentiment modeling in streaming reviews.

传统情绪分析方法侧重于静态审查,未能捕捉用户情绪评级和文字内容之间不断变化的时间关系。流流态审查的感应分析通过模拟和预测用户情绪的时间演变来应对这一局限性。然而,它受到数据宽度的影响,以时间、空间和综合形式表现出来。在本文中,我们引入了SynGraph,这是一个新颖的框架,旨在解决流态审查中情绪分析中的数据宽度问题。SynGraph将用户分为中尾、长尾和极端情景,并将LLLM强化措施纳入动态图表结构,从而缓解了数据宽度。关于现实世界数据集的实验表明其在解决流态化和改进流态审查中的情绪模型方面的有效性。


Article 155

Title@2025-06-17 (2): TaskCraft: Automated Generation of Agentic Tasks

Title: TaskCraft: Automated Generation of Agentic Tasks TaskCraft: Automatisierte Generierung von Agentischen Aufgaben TTTCraft:自动生成代理任务 2506.10055v2

Authors (17): Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li, Hongxuan Lu, Fangchen Dong, Tianrui Qin, King Zhu, Minghao Liu, Jian Yang, Ge Zhang, Jiaheng Liu, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou

Agentic tasks, which require multi-step problem solving with autonomy, tool use, and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI. However, existing instruction data lacks tool interaction, and current agentic benchmarks rely on costly human annotation, limiting their scalability. We introduce \textsc{TaskCraft}, an automated workflow for generating difficulty-scalable, multi-tool, and verifiable agentic tasks with execution trajectories. TaskCraft expands atomic tasks using depth-based and width-based extensions to create structurally and hierarchically complex challenges. Empirical results show that these tasks improve prompt optimization in the generation workflow and enhance supervised fine-tuning of agentic foundation models. We present a large-scale synthetic dataset of approximately 36,000 tasks with varying difficulty to support future research on agent tuning and evaluation.

需要以自主、工具使用和适应性推理解决多步问题的任务,对推进国家实验室方案和AI来说越来越重要。然而,现有的教学数据缺乏工具互动,而目前的代理基准依赖于昂贵的人类说明,限制了其可缩放性。我们引入了“textsc{TasskCraft}”,这是一个自动工作流程,可以产生可缩放的、多工具和可核实的代理任务,与执行轨迹。TlexCraft利用深度和宽度扩展扩大原子任务,制造结构上和等级上的复杂挑战。经验性结果显示,这些任务改善了生成工作流程的迅速优化,加强了受监督的对代理基础模型的微调。我们提出了一套大型合成数据,大约有36 000项,在支持今后对代理的调整和评价进行研究方面困难不一。


Article 156

Title: Graph RAG for Legal Norms: A Hierarchical, Temporal and Deterministic Approach Grafik RAG für rechtliche Normen: Hierarchischer, zeitlicher und deterministischer Ansatz 法律规范的图表RAG:一个等级、时间和决定因素学方法 2505.00039v3

Authors (1): Hudson de Martim

This article proposes an adaptation of Graph Retrieval-Augmented Generation (Graph RAG) specifically designed for the analysis and comprehension of legal norms. Legal texts are characterized by a predefined hierarchical structure, an extensive network of references and a continuous evolution through multiple temporal versions. This temporal dynamism poses a significant challenge for standard AI systems, demanding a deterministic representation of the law at any given point in time. To address this, our approach grounds the knowledge graph construction in a formal, FRBRoo-inspired model that distinguishes abstract legal works from their concrete textual expressions. We introduce a multi-layered representation of Temporal Versions (capturing date-specific changes) and Language Versions (capturing linguistic variations). By modeling normative evolution as a precise sequence of these versioned entities, we enable the construction of a knowledge graph that serves as a verifiable “ground truth”. This allows Large Language Models to generate responses based on accurate, context-aware, and point-in-time correct legal information, overcoming the risk of temporal inaccuracies. Through a detailed analysis of this formal Graph RAG approach and its application to legal norm datasets, this article aims to advance the field of Artificial Intelligence applied to Law, creating opportunities for more effective and reliable systems in legal research, legislative analysis, and decision support.

本条建议修改专门为分析和理解法律规范而设计的 “ 检索和提法一代图 “ (Graph RAG),法律文本的特点是预先界定的等级结构、广泛的参考网络和通过多种时间版本的不断演变;这种时间动态对标准的AI系统提出了重大挑战,要求在任何特定时间对法律进行决定性的表述;为此,我们的方法将知识图的构建建立在正式的、FRBROoo启发的、抽象的法律作品与其具体文本表达方式区分开来的模式中。我们采用了一种多层次的时尚版本(记录具体日期的变化)和语言版本(记录语言变异)的表述。通过将规范的演变作为这些版本实体的精确序列进行建模,我们得以构建一个知识图表,作为可核查的 “ 地面真相 “ 。这样,大型语言模型就可以根据准确、符合背景和受点启发的法律信息生成答复,克服时间上的不准确风险。我们采用这一正式的RAG方法(记录具体日期的变化)和语言版本(记录语言变异)和语言版本(记录语言变异)的多层次的表述。我们通过对正式的RAG方法进行详细分析,将法律分析,并更可靠地用于法律规范研究,目的是为法律系统创造机会进行实地分析。


Article 157

Title@2025-06-17 (2): When Does Meaning Backfire? Investigating the Role of AMRs in NLI

Title: When Does Meaning Backfire? Investigating the Role of AMRs in NLI Wann bedeutet Backfire? Untersuchung der Rolle von AMRs in NLI ” 什么时候发生反火 “ 的含义? 调查在非国家劳动力调查中年龄、年龄、年龄、年龄、年龄、年龄、年龄、年龄、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、性别、 性别、性别、性别、 性别、性别、性别、性别、性别、性别、性别、 性别、性别、 性别、 性别、性别、性别、 性别、性别、性别、性别、性别、性别、性别、 性别、性别、性别、性别、性别、 性别、 性别、 性别、 性别、 性别、 性别 性别 性别 性别、 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 性别 2506.14613v1

Authors (3): Junghyun Min, Xiulin Yang, Shira Wein

Natural Language Inference (NLI) relies heavily on adequately parsing the semantic content of the premise and hypothesis. In this work, we investigate whether adding semantic information in the form of an Abstract Meaning Representation (AMR) helps pretrained language models better generalize in NLI. Our experiments integrating AMR into NLI in both fine-tuning and prompting settings show that the presence of AMR in fine-tuning hinders model generalization while prompting with AMR leads to slight gains in \texttt{GPT-4o}. However, an ablation study reveals that the improvement comes from amplifying surface-level differences rather than aiding semantic reasoning. This amplification can mislead models to predict non-entailment even when the core meaning is preserved.

自然语言推论(NLI)在很大程度上依赖于充分分析前提和假设的语义内容。 在这项工作中,我们调查以“抽象表示”的形式添加语义信息是否有助于在“抽象表示”中更好地概括语言模型。我们在微调和提示环境中将AMR纳入“自然语言推论”的实验表明,“自然语言推断”妨碍模式的概括化,而与“初始表示”的推论则则导致在“理论表达”中略有增加。然而,一项反动研究表明,改进来自扩大地表层次差异,而不是辅助语义推理。这种放大可以误导模型预测非零售,即使核心含义得到了维护。


Article 158

Title@2025-06-17 (2): Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees

Title: Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees Guaranteed Guess: Ein Sprachmodellierungsansatz für CISC-to-RISC Transpilation mit Testgarantien 有担保的猜测:具有测试保证的CISC到RISC传输语言模拟方法 2506.14606v1

Authors (5): Ahmed Heakl, Sarim Hashmi, Chaimaa Abi, Celine Lee, Abdulrahman Mahmoud

The hardware ecosystem is rapidly evolving, with increasing interest in translating low-level programs across different instruction set architectures (ISAs) in a quick, flexible, and correct way to enhance the portability and longevity of existing code. A particularly challenging class of this transpilation problem is translating between complex- (CISC) and reduced- (RISC) hardware architectures, due to fundamental differences in instruction complexity, memory models, and execution paradigms. In this work, we introduce GG (Guaranteed Guess), an ISA-centric transpilation pipeline that combines the translation power of pre-trained large language models (LLMs) with the rigor of established software testing constructs. Our method generates candidate translations using an LLM from one ISA to another, and embeds such translations within a software-testing framework to build quantifiable confidence in the translation. We evaluate our GG approach over two diverse datasets, enforce high code coverage (>98%) across unit tests, and achieve functional/semantic correctness of 99% on HumanEval programs and 49% on BringupBench programs, respectively. Further, we compare our approach to the state-of-the-art Rosetta 2 framework on Apple Silicon, showcasing 1.73x faster runtime performance, 1.47x better energy efficiency, and 2.41x better memory usage for our transpiled code, demonstrating the effectiveness of GG for real-world CISC-to-RISC translation tasks. We will open-source our codes, data, models, and benchmarks to establish a common foundation for ISA-level code translation research.

硬件生态系统正在迅速演变,人们越来越有兴趣以快速、灵活和正确的方式将低级别程序翻译到不同的教学设置架构(ISAs)中,以快速、灵活和正确的方式将低级别程序翻译到不同的教学设置结构(ISAs)中,从而增强现有代码的可移动性和寿命。由于在教学复杂性、记忆模型和执行模式方面存在着根本差异,这一转换问题的一个特别具有挑战性的类别是复杂的(CISC)和减少的(RISC)硬件结构。在这项工作中,我们引入了GG(担保猜测),这是一个以ISA为核心的传输管道,将预先训练的大型语言模型(LLLMs)的翻译能力与成熟的软件测试结构结合起来。此外,我们的方法利用LLLM(LMs)进行候选人翻译,并将这种翻译嵌入一个软件测试框架,以建立对翻译的可量化的信任。我们通过两个不同的数据集来评估我们的GG方法,在单位测试中实施高代码覆盖(>98%),在HumanEval 程序上实现9 %的功能/中度的开放性转换,在Beup Bench 程序上实现了49 %。 此外,我们将将我们的方法用于Silex- Stal-listral-listral-liver-lic-lic-lical-lic-lic-deal-lical-lical-lax lax lax ladeal-deal-lish-lishal-lishal-de-lishal-lax lax lax lautal-lishal-lax ax 2.x ax ax ax ax ax ax ax ax ax ax ax ax ax ax ax ax ax a-sal-sal-sal-sal-sal-deal-cal-sal-deal-lical-cal-ladal-cal-ladal-lad-lad-laxxxxx-lad-lad-lad-laxxxxx-lax-laxxxx a-lax ax ax ax a-lax ax ax a-lax a-lax a-lax-


Article 159

Title@2025-06-17 (2): Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Title: Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents Navigieren der digitalen Welt als Menschen tun: Universal Visual Grounding für GUI-Agenten 将数字世界作为人行:通用用户界面代理的通用视觉定位 2410.05243v3

Authors (8): Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, Yu Su

Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly perform pixel-level operations on the GUI. The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms. We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots, and use it to train UGround, a strong universal visual grounding model for GUI agents. Empirical results on six benchmarks spanning three categories (grounding, offline agent, and online agent) show that 1) UGround substantially outperforms existing visual grounding models for GUI agents, by up to 20% absolute, and 2) agents with UGround outperform state-of-the-art agents, despite the fact that existing agents use additional text-based input while ours only uses visual perception. These results provide strong support for the feasibility and promises of GUI agents that navigate the digital world as humans do.

多式大型语言模型(MLLMM)正在改变图形用户界面(GUI)代理器的能力,促进它们从受控模拟到复杂、真实的平台应用的转型。然而,这些代理商的效力取决于其地基能力的稳健性。当前GUI代理商主要使用HTML或无障碍树等基于文本的表达方式,尽管这些表达方式有用,但往往引入噪音、不完善和增加计算管理费用。在本文件中,我们提倡为图形用户界面代理商设计一种人性化化的化身,这些代理商能够完全直观地看待环境,直接在图形界面上执行像素级操作。关键是视觉地基模型,能够准确地绘制各种图形界面元素在不同的平台上与其坐标进行对比的表达方式。我们展示一种简单的配方,包括基于网络的合成数据和对LLALAVA结构的微小调整。我们收集了GUIU最庞大的视觉地面定位数据集,其中含有10MUG界面的元素及其在1.3MUDUG的直截截图上的表示方式,并且用它来训练UGFIFIFOR,一个强大的直观化的强大直观地面模型模型,一个强大的直观化模型,一个强大的直观化模型的缩写模型,用来测量模型用来测量模型用来测量地标的模型,而用六基质地基质地基质化的输出的输出的输出的输出的输出的计算结果,而用六种,而用在地面代理商则则用在地面代理商在地面代理商在地面代理商的基底基底基底基底基底基底基底基底基底基压的基质上,这些基底基压的基压的基压的基压的基压的基压上,这些基压的基压的基压的模上,这些基压的基压的基压的基代理商上,这些基代理商的基压的推算法,这些基压的推算法系的基压的基压的推算法的基压的基压的基压的基压的基压的基的基的基压的基压的基压的基压的基压的基压的基压的基压的基压的基压的基的基的基的基的基的


Article 160

Title@2025-06-17 (2): Computational Studies in Influencer Marketing: A Systematic Literature Review

Title: Computational Studies in Influencer Marketing: A Systematic Literature Review Computational Studies in Influencer Marketing: A Systematic Literature Review 《影响营销中的计算研究:系统文学评论》 2506.14602v1

Authors (4): Haoyang Gui, Thales Bertaglia, Catalina Goanta, Gerasimos Spanakis

Influencer marketing has become a crucial feature of digital marketing strategies. Despite its rapid growth and algorithmic relevance, the field of computational studies in influencer marketing remains fragmented, especially with limited systematic reviews covering the computational methodologies employed. This makes overarching scientific measurements in the influencer economy very scarce, to the detriment of interested stakeholders outside of platforms themselves, such as regulators, but also researchers from other fields. This paper aims to provide an overview of the state of the art of computational studies in influencer marketing by conducting a systematic literature review (SLR) based on the PRISMA model. The paper analyses 69 studies to identify key research themes, methodologies, and future directions in this research field. The review identifies four major research themes: Influencer identification and characterisation, Advertising strategies and engagement, Sponsored content analysis and discovery, and Fairness. Methodologically, the studies are categorised into machine learning-based techniques (e.g., classification, clustering) and non-machine-learning-based techniques (e.g., statistical analysis, network analysis). Key findings reveal a strong focus on optimising commercial outcomes, with limited attention to regulatory compliance and ethical considerations. The review highlights the need for more nuanced computational research that incorporates contextual factors such as language, platform, and industry type, as well as improved model explainability and dataset reproducibility. The paper concludes by proposing a multidisciplinary research agenda that emphasises the need for further links to regulation and compliance technology, finer granularity in analysis, and the development of standardised datasets.

影响性营销已成为数字营销战略的一个关键特征。尽管其快速增长和算法相关性已成为数字营销战略的一个关键特征。尽管其快速增长和算法相关性,影响力营销的计算研究领域仍然支离破碎,特别是针对所用计算方法的有限系统审查。这使得影响性经济的总体科学测量非常稀少,不利于平台本身以外的利益攸关方,例如监管者,但也不利于其他领域的研究人员。本文的目的是通过根据PRISMA模型进行系统的文献审查,对影响性营销的计算性研究进行总体概述。文件分析了69项研究,以确定这一研究领域的关键研究主题、方法和未来方向。审查确定了四个主要研究主题:影响性识别和定性、广告战略和参与、赞助的内容分析和发现以及公平性。从方法上讲,这些研究被归结为基于机器的学习技术(例如分类、集群)和非基于准确性学习性的技术(例如统计分析、网络分析)。主要结论显示,在选择性商业结果时,对监管性、合规性、伦理性研究、赞助性分析和公平性研究的分类方面需要,作为更精确的统计性分析基础,作为背景性研究、更精确性分析的分类分析,作为基础,作为基础,作为统计性研究、更精确性研究、更精确性的数据分析的基础分析,作为基础分析,作为基础性分析,作为基础性分析,作为基础性分析,作为基础性分析,作为基础性研究的计算,作为基础性分析,分析,分析,分析,作为基础性分析,分析,作为基础性分析,作为基础。审查重点,需要。审评性分析,审查,审查,需要需要,作为基础,作为统计性分析,作为基础,作为基础,作为基础,作为基础,分析,作为基础,作为基础,作为基础分析,分析,分析。审查。审查。审查。审查,审查重点,审查,审查,审查,需要,需要,作为统计性分析。审查。审查。审查。审查。审查,作为分析,分析,作为统计性分析,作为统计性分析,审查,分析,分析,分析,作为统计性分析。审查。审查。审查,审查,审查,分析,分析,分析,分析,分析,分析,分析。审查需要和伦理性分析。审查需要,分析,分析,分析,分析,分析,分析,分析,分析,分析,分析,分析,分析,分析。审查需要,分析,分析,分析,作为


Article 161

Title@2025-06-17 (2): From tools to thieves: Measuring and understanding public perceptions of AI through crowdsourced metaphors

Title: From tools to thieves: Measuring and understanding public perceptions of AI through crowdsourced metaphors Von Werkzeugen zu Dieben: Messen und Verstehen öffentlicher Wahrnehmungen von KI durch crowdsourced Metaphern 从工具到盗贼:通过众包比喻衡量和理解公众对AI的看法 2501.18045v3

Authors (6): Myra Cheng, Angela Y. Lee, Kristina Rapuano, Kate Niederhoffer, Alex Liebscher, Jeffrey Hancock

How has the public responded to the increasing prevalence of artificial intelligence (AI)-based technologies? We investigate public perceptions of AI by collecting over 12,000 responses over 12 months from a nationally representative U.S. sample. Participants provided open-ended metaphors reflecting their mental models of AI, a methodology that overcomes the limitations of traditional self-reported measures by capturing more nuance. Using a mixed-methods approach combining quantitative clustering and qualitative coding, we identify 20 dominant metaphors shaping public understanding of AI. To analyze these metaphors systematically, we present a scalable framework integrating language modeling (LM)-based techniques to measure key dimensions of public perception: anthropomorphism (attribution of human-like qualities), warmth, and competence. We find that Americans generally view AI as warm and competent, and that over the past year, perceptions of AI’s human-likeness and warmth have significantly increased ($+34\%, r = 0.80, p < 0.01; +41\%, r = 0.62, p < 0.05$). These implicit perceptions, along with the identified dominant metaphors, strongly predict trust in and willingness to adopt AI ($r^2 = 0.21, 0.18, p < 0.001$). Moreover, we uncover systematic demographic differences in metaphors and implicit perceptions, such as the higher propensity of women, older individuals, and people of color to anthropomorphize AI, which shed light on demographic disparities in trust and adoption. In addition to our dataset and framework for tracking evolving public attitudes, we provide actionable insights on using metaphors for inclusive and responsible AI development.

公众如何应对人造情报(AI)基础技术日益普及的问题?我们调查公众对AI的看法,从一个具有国家代表性的美国抽样中收集了12个月来对AI的12 000多份答复。与会者提供了不限名额的隐喻,反映了他们的思想模式:AI, 这种方法通过捕捉更多的细微分,克服了传统自我报告措施的局限性。我们采用混合方法,将数量组和质量编码相结合,确定了20个影响公众对AI理解的主要隐喻。为了系统地分析这些隐喻,我们提出了一个可扩展性框架,将基于语言的模型(LM)技术整合起来,以衡量公众认识的关键层面:人类形态(人种特性的归属)、温暖和能力。我们发现美国人普遍认为AI是温暖和胜任的,过去一年,对AI的人类相似性和温暖性的看法大大增加了(+34,r=0.80,p < 0.01;+41,r=0.62,p=0.62,p=0.05美元)。这些隐含的观念,连同已确定的主导性隐喻的隐喻,强烈预测了公众认识 – – 10、强烈地预测信任和系统化地认识,我们通过AI-xl 。


Article 162

Title@2025-06-17 (2): GenerationPrograms: Fine-grained Attribution with Executable Programs

Title: GenerationPrograms: Fine-grained Attribution with Executable Programs GenerationProgramme: Feinkörnige Zuordnung mit ausführbaren Programmen 代代方案:与可执行方案精细分配 2506.14580v1

Authors (5): David Wan, Eran Hirsch, Elias Stengel-Eskin, Ido Dagan, Mohit Bansal

Recent large language models (LLMs) achieve impressive performance in source-conditioned text generation but often fail to correctly provide fine-grained attributions for their outputs, undermining verifiability and trust. Moreover, existing attribution methods do not explain how and why models leverage the provided source documents to generate their final responses, limiting interpretability. To overcome these challenges, we introduce a modular generation framework, GenerationPrograms, inspired by recent advancements in executable “code agent” architectures. Unlike conventional generation methods that simultaneously generate outputs and attributions or rely on post-hoc attribution, GenerationPrograms decomposes the process into two distinct stages: first, creating an executable program plan composed of modular text operations (such as paraphrasing, compression, and fusion) explicitly tailored to the query, and second, executing these operations following the program’s specified instructions to produce the final response. Empirical evaluations demonstrate that GenerationPrograms significantly improves attribution quality at both the document level and sentence level across two long-form question-answering tasks and a multi-document summarization task. We further demonstrate that GenerationPrograms can effectively function as a post-hoc attribution method, outperforming traditional techniques in recovering accurate attributions. In addition, the interpretable programs generated by GenerationPrograms enable localized refinement through modular-level improvements that further enhance overall attribution quality.

最近大型语言模型(LLMS)在源条件文本生成中取得了令人印象深刻的成绩,但往往未能正确提供其产出的精细属性,从而破坏可核查性和信任。此外,现有的归属方法没有解释模式如何和为什么利用所提供的源文件来产生最终回应,限制解释性。为了克服这些挑战,我们引入了一个模块生成框架,即由可执行的“代码代理”架构最近的进步所启发的“DesearPrograms ”。与同时产生产出和属性或依赖选择后归属的传统生成方法不同,DearPrograms将进程分化为两个不同阶段:第一,建立一个由模块文本操作(例如对参数进行拼写、压缩和拼凑)组成的可执行方案计划,明确针对查询,限制解释性。第二,根据程序指定的指示执行这些模块生成的最终响应性架构,我们引入了模块化方案,大大改进了文件级别和句级的属性质量质量质量质量,两个长期问答任务和多文件合成任务。我们进一步证明,《Dargrams Programs》能够通过可实现的准确性化的归属,从而通过可改进后级化的方法,通过可改进后级化的方法,有效地将成本化的属性转换为可改进。


Article 163

Title@2025-06-17 (2): PredictaBoard: Benchmarking LLM Score Predictability

Title: PredictaBoard: Benchmarking LLM Score Predictability PredictaBoard: Benchmarking der LLM-Score-Vorhersagbarkeit 预测波:测标 LLM 评分可预测性 2502.14445v2

Authors (7): Lorenzo Pacchiardi, Konstantinos Voudouris, Ben Slater, Fernando Martínez-Plumed, José Hernández-Orallo, Lexin Zhou, Wout Schellaert

Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable “safe zone” is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative benchmarking framework designed to evaluate the ability of score predictors (referred to as assessors) to anticipate LLM errors on specific task instances (i.e., prompts) from existing datasets. PredictaBoard evaluates pairs of LLMs and assessors by considering the rejection rate at different tolerance errors. As such, PredictaBoard stimulates research into developing better assessors and making LLMs more predictable, not only with a higher average performance. We conduct illustrative experiments using baseline assessors and state-of-the-art LLMs. PredictaBoard highlights the critical need to evaluate predictability alongside performance, paving the way for safer AI systems where errors are not only minimised but also anticipated and effectively mitigated. Code for our benchmark can be found at https://github.com/Kinds-of-Intelligence-CFI/PredictaBoard

尽管拥有令人印象深刻的技能,但大型语言模型(LLMS)往往难以预测地失败,这表明即使是基本的常识推理任务也未能取得一致的成功,这种不可预测性对确保其安全部署构成重大挑战,因为确定和在可靠的“安全区”内运作对于减少风险至关重要。为此,我们提出一个新型的协作基准框架,即SniveaBoard,这是一个新的协作基准框架,旨在评价计分预测员(称为评估员)在具体任务实例(即快速)上预测LLMM错误的能力,以预测现有数据集中的具体任务案例(即快速)的LLMM错误。NutinaBoard通过考虑不同容忍误差的拒绝率来评估LMS和评估员的对配对和评估员。因此,DiannaBoard推动研究如何开发更好的评估员,使LLMS更可预测,而不仅仅是提高平均性能。我们利用基线评估员和最先进的LMs. Protaaboard强调在业绩的同时评价可预测性至关重要,为更安全的AI系统铺路,因为错误不仅被最小化,而且还预期和有效减轻。我们的基准守则可以在http://Githtrigard/BARIgard/KINC/Kinfard-FINS-fard-fard-fard-comformformus。


Article 164

Title@2025-06-17 (2): Evolution of ESG-focused DLT Research: An NLP Analysis of the Literature

Title: Evolution of ESG-focused DLT Research: An NLP Analysis of the Literature Entwicklung der ESG-orientierten DLT-Forschung: Eine NLP-Analyse der Literatur 以环境、社会和科学为重点的DLT研究的演变:对文学的分析 2308.12420v4

Authors (9): Walter Hernandez Cruz, Kamil Tylinski, Alastair Moore, Niall Roche, Nikhil Vadgama, Horst Treiblmaier, Jiangbo Shangguan, Paolo Tasca, Jiahua Xu

Distributed Ledger Technology (DLT) faces increasing environmental scrutiny, particularly concerning the energy consumption of the Proof of Work (PoW) consensus mechanism and broader Environmental, Social, and Governance (ESG) issues. However, existing systematic literature reviews of DLT rely on limited analyses of citations, abstracts, and keywords, failing to fully capture the field’s complexity and ESG concerns. We address these challenges by analyzing the full text of 24,539 publications using Natural Language Processing (NLP) with our manually labeled Named Entity Recognition (NER) dataset of 39,427 entities for DLT. This methodology identified 505 key publications at the DLT/ESG intersection, enabling comprehensive domain analysis. Our combined NLP and temporal graph analysis reveals critical trends in DLT evolution and ESG impacts, including cryptography and peer-to-peer networks research’s foundational influence, Bitcoin’s persistent impact on research and environmental concerns (a “Lindy effect”), Ethereum’s catalytic role on Proof of Stake (PoS) and smart contract adoption, and the industry’s progressive shift toward energy-efficient consensus mechanisms. Our contributions include the first DLT-specific NER dataset addressing the scarcity of high-quality labeled NLP data in blockchain research, a methodology integrating NLP and temporal graph analysis for large-scale interdisciplinary literature reviews, and the first NLP-driven literature review focusing on DLT’s ESG aspects.

24,539份出版物的全文,未能充分反映实地的复杂性和对口网络研究的基本影响,比特科因对研究和环境问题的持久影响(a“液态效应”),EceinP在检验收货方面的催化作用(POS)和智能合同的采用,以及工业部门在NCER-LV高质量数据分析方面的首次转变,包括将NCER-LF的大规模核心数据重点分析。


Article 165

Title@2025-06-17 (2): TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization

Title: TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization TDSPO: Nutzen Sie Token-Level-Reward-Leitfaden zur Verbesserung der Direktpräferenzoptimierung TGDPO:提高直接优惠优化利用物价奖励指导 2506.14574v1

Authors (6): Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, Jiaya Jia

Recent advancements in reinforcement learning from human feedback have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO) in aligning large language models. However, it is challenging to leverage such token-level reward as guidance for Direct Preference Optimization (DPO), since DPO is formulated as a sequence-level bandit problem. To address this challenge, this work decomposes the sequence-level PPO into a sequence of token-level proximal policy optimization problems and then frames the problem of token-level PPO with token-level reward guidance, from which closed-form optimal token-level policy and the corresponding token-level reward can be derived. Using the obtained reward and Bradley-Terry model, this work establishes a framework of computable loss functions with token-level reward guidance for DPO, and proposes a practical reward guidance based on the induced DPO reward. This formulation enables different tokens to exhibit varying degrees of deviation from reference policy based on their respective rewards. Experiment results demonstrate that our method achieves substantial performance improvements over DPO, with win rate gains of up to 7.5 points on MT-Bench, 6.2 points on AlpacaEval 2, and 4.3 points on Arena-Hard. Code is available at https://github.com/dvlab-research/TGDPO.

从人类反馈中汲取的加强学习的近期进展表明,利用微微的象征性奖励模式,可以大大提高最优化政策(PPO)在调整大型语言模式方面的绩效;然而,利用这种象征性奖励作为直接普惠最佳化指导(DPO)是具有挑战性的,因为DPO是按顺序排列的匪帮问题;为了应对这一挑战,这项工作将排序级别的PPPO分解成一系列象征性一级最佳政策优化问题,然后用象征性级别奖励指导来界定象征性级别PPPO问题,从中可以得出封闭式最佳象征性级别政策和相应的象征性级别奖励。利用所获得的奖励和布拉德利-Terripim化(DPO)模式,这项工作确立了一个计算损失功能的框架,并附有对DPO奖励的象征性级别奖励指导,并提出了以诱导的DPO奖励为基础的实际奖励指导。这一提法使不同象征根据各自的奖赏表现出不同程度的偏离参考政策。实验结果表明,我们的方法在DPO(PO)方面取得了重大的业绩改进,在ASV/ARBA中取得了4.3-BESAL-7.5的成绩。


Article 166

Title@2025-06-17 (2): AlphaDecay:Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

Title: AlphaDecay:Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs AlphaDecay: Modulenweises Gewichtsdecay für schweres Balancing in LLMs AlphaDecay: LLM 中重帆平衡的中度偏重衰减 2506.14562v1

Authors (7): Di He, Ajay Jaiswal, Songjun Tu, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin

Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify “heavy-tailedness.” Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines.

体重衰减是培训大型语言模型的标准正规化技术(LLMs ) 。 虽然通常为每一层指定统一的衰减率,但这种方法忽略了LLM的结构性多样性和不同模块的光谱特性。 在本文中,我们引入了阿尔法-Decay,这是一个简单而有效的方法,可适应性地为LLM的每个模块分配不同的重量衰减力。我们的方法以重力自闭自闭理论(HT-SR)为指导,该理论分析了用于量化“重尾发”的重量相关矩阵的经验性光谱密度(ESD ) 。 显示较显著重成型的 ESD 模块,反映了较强的特性学习,被赋予了较弱的衰减,而有较轻尾发光谱的模块则被赋予了较强的衰减力。 我们的方法利用了定制的重量衰减任务来平衡光谱特性中模块性差异,从而导致性能的改善。 与从60M到1B的各种模型大小相比,广泛的培训前任务表明阿尔法-Decay 实现更好的不易和一般统一衰变和一般衰变的基线。


Article 167

Title@2025-06-17 (2): ClusterChat: Multi-Feature Search for Corpus Exploration

Title: ClusterChat: Multi-Feature Search for Corpus Exploration ClusterChat: Multi-Feature Suche nach Corpus Exploration COFCHat: 多功能探索Corpus 勘探 2412.14533v2

Authors (3): Ashish Chouhan, Saifeldin Mandour, Michael Gertz

Exploring large-scale text corpora presents a significant challenge in biomedical, finance, and legal domains, where vast amounts of documents are continuously published. Traditional search methods, such as keyword-based search, often retrieve documents in isolation, limiting the user’s ability to easily inspect corpus-wide trends and relationships. We present ClusterChat (The demo video and source code are available at: https://github.com/achouhan93/ClusterChat), an open-source system for corpus exploration that integrates cluster-based organization of documents using textual embeddings with lexical and semantic search, timeline-driven exploration, and corpus and document-level question answering (QA) as multi-feature search capabilities. We validate the system with two case studies on a four million abstract PubMed dataset, demonstrating that ClusterChat enhances corpus exploration by delivering context-aware insights while maintaining scalability and responsiveness on large-scale document collections.

在生物医学、金融和法律领域不断出版大量文件,探索大规模文本公司是一个重大挑战。传统的搜索方法,如关键词搜索,往往在孤立的情况下检索文件,限制了用户方便地检查整个系统的趋势和关系的能力。我们介绍CroupChat(演示视频和源代码见https://github.com/achouhan93/ClusterChat),这是一个用于人身探索的开放源码系统,它将基于文字的文件组织与词汇和语义搜索、时限驱动的探索以及文体和文件级问题解答相结合,作为多功能搜索能力。我们验证这个系统,对400万个抽象的PubMed数据集进行两项案例研究,表明CroundChat通过提供有背景认识的洞察力和对大规模文件收藏的反应能力保持可扩展性和响应性,加强了对材料的探索。


Article 168

Title@2025-06-17 (2): M2BeamLLM: Multimodal Sensing-empowered mmWave Beam Prediction with Large Language Models

Title: M2BeamLLM: Multimodal Sensing-empowered mmWave Beam Prediction with Large Language Models M2BeamLLM: Multimodal Sensing-empowered mmWave Beam Prediction mit großen Sprachmodellen M2BAAMLLM:多式遥感-动力毫米 2506.14532v1

Authors (6): Can Zheng, Jiguang He, Chung G. Kang, Guofa Cai, Zitong Yu, Merouane Debbah

This paper introduces a novel neural network framework called M2BeamLLM for beam prediction in millimeter-wave (mmWave) massive multi-input multi-output (mMIMO) communication systems. M2BeamLLM integrates multi-modal sensor data, including images, radar, LiDAR, and GPS, leveraging the powerful reasoning capabilities of large language models (LLMs) such as GPT-2 for beam prediction. By combining sensing data encoding, multimodal alignment and fusion, and supervised fine-tuning (SFT), M2BeamLLM achieves significantly higher beam prediction accuracy and robustness, demonstrably outperforming traditional deep learning (DL) models in both standard and few-shot scenarios. Furthermore, its prediction performance consistently improves with increased diversity in sensing modalities. Our study provides an efficient and intelligent beam prediction solution for vehicle-to-infrastructure (V2I) mmWave communication systems.

本文介绍了一个新的神经网络框架,称为M2BeamLLM,用于在毫米波(毫米Wave)大规模多投入多产出通信系统中进行波束预测。M2BeamLLM结合了多式传感器数据,包括图像、雷达、激光雷达、激光雷达和全球定位系统,利用GPT-2等大型语言模型的强大推理能力进行波束预测。M2BeALLM将遥感数据编码、多式对齐和聚合以及监督下的微调(SFT)结合起来,实现了显著更高的波束预测精确度和稳健性,在标准情景和少见的情景中都明显超过传统的深层学习模型。此外,其预测性能随着遥感方式的日益多样化而不断提高。我们的研究为车辆到基础设施(V2I)毫米Wave通信系统提供了一个高效和智能的波束预测解决方案。


Article 169

Title@2025-06-17 (2): Inherent and emergent liability issues in LLM-based agentic systems: a principal-agent perspective

Title: Inherent and emergent liability issues in LLM-based agentic systems: a principal-agent perspective Inhärente und entstehende Haftungsfragen in LLM-basierten agentischen Systemen: eine Principal-Agent-Perspektive 以LLLM为基础的代理系统中的固有和新出现的赔偿责任问题:主要代理人的视角 2504.03255v2

Authors (2): Garry A. Gabison, R. Patrick Xian

Agentic systems powered by large language models (LLMs) are becoming progressively more complex and capable. Their increasing agency and expanding deployment settings attract growing attention to effective governance policies, monitoring, and control protocols. Based on the emerging landscape of the agentic market, we analyze potential liability issues arising from the delegated use of LLM agents and their extended systems through a principal-agent perspective. Our analysis complements existing risk-based studies on artificial agency and covers the spectrum of important aspects of the principal-agent relationship and their potential consequences at deployment. Furthermore, we motivate method developments for technical governance along the directions of interpretability and behavior evaluations, reward and conflict management, and the mitigation of misalignment and misconduct through principled engineering of detection and fail-safe mechanisms. By illustrating the outstanding issues in AI liability for LLM-based agentic systems, we aim to inform the system design, auditing, and tracing to enhance transparency and liability attribution.

由大型语言模型(LLMS)驱动的代理系统正在逐渐变得日益复杂和有能力,其日益增强的机构性和扩大的部署环境吸引了对有效治理政策、监测和控制规程的日益重视。根据代理市场的新格局,我们从主要代理的角度分析委托使用LLM代理物及其扩展系统所产生的潜在责任问题。我们的分析补充了现有基于风险的人工代理物研究,并涵盖主要代理人关系及其在部署时的潜在后果的方方面面。此外,我们推动技术治理方法的发展,沿着可解释性和行为评价、奖励和冲突管理的方向发展,并通过探测和故障保险机制的原则工程,减轻不匹配和不当行为。我们通过说明基于LM代理物系统的AI赔偿责任中的未决问题,目的是为系统设计、审计和追踪提供信息,以加强透明度和责任归属。


Article 170

Title@2025-06-17 (2): LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops

Title: LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops LingoLoop Attack: MLLMs über sprachlichen Kontext und Staatseinfall in endlose Loops LingoLooo攻击:通过语言背景和国家诱入无尽环圈来跟踪MLLMs 2506.14493v1

Authors (8): Jiyuan Fu, Kaixun Jiang, Lingyi Hong, Jinglun Li, Haijing Guo, Dingkang Yang, Zhaoyu Chen, Wenqiang Zhang

Multimodal Large Language Models (MLLMs) have shown great promise but require substantial computational resources during inference. Attackers can exploit this by inducing excessive output, leading to resource exhaustion and service degradation. Prior energy-latency attacks aim to increase generation time by broadly shifting the output token distribution away from the EOS token, but they neglect the influence of token-level Part-of-Speech (POS) characteristics on EOS and sentence-level structural patterns on output counts, limiting their efficacy. To address this, we propose LingoLoop, an attack designed to induce MLLMs to generate excessively verbose and repetitive sequences. First, we find that the POS tag of a token strongly affects the likelihood of generating an EOS token. Based on this insight, we propose a POS-Aware Delay Mechanism to postpone EOS token generation by adjusting attention weights guided by POS information. Second, we identify that constraining output diversity to induce repetitive loops is effective for sustained generation. We introduce a Generative Path Pruning Mechanism that limits the magnitude of hidden states, encouraging the model to produce persistent loops. Extensive experiments demonstrate LingoLoop can increase generated tokens by up to 30 times and energy consumption by a comparable factor on models like Qwen2.5-VL-3B, consistently driving MLLMs towards their maximum generation limits. These findings expose significant MLLMs’ vulnerabilities, posing challenges for their reliable deployment. The code will be released publicly following the paper’s acceptance.

多个多式大语言模型(MLLMS)已经显示出巨大的希望,但在推断过程中需要大量的计算资源。攻击者可以通过过度产出来利用这一点,从而导致资源耗竭和服务退化。先前的能源紧张性袭击的目的是通过将输出符号的分布从EOS标志上广泛转移而增加生产时间,但是他们忽略了象征性的POS部分(POS)对EOS和判决级结构模式对产出计数的影响,限制了它们的效力。为了解决这个问题,我们提议LingoLoop(Lingoop),这是一次旨在引导MLLMMS产生过度的verbose和重复序列的攻击。首先,我们发现一个标志的POS标记对生成EOS标志的可能性有很大影响。基于这一洞察,我们提议一个POS-Award延迟机制,通过调整POS信息引导的注意力重量来推迟EOS符号生成。我们发现限制产出多样性以诱导产生重复循环的效果。我们建议General Pruning机制可以限制隐藏国家的规模,鼓励一个标志性模型对持续循环生成MLLLLLM的模型,通过连续的模型进行模拟的模拟,通过不断的实验来增加其模拟生成。


Article 171

Title@2025-06-17 (2): BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English

Title: BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English BESSTIE: Ein Benchmark für die Sentiment- und Sarkasmusklassifikation für englische Sorten BESSTIE:英语品种的森化和讽刺性分类基准 2412.04726v3

Authors (4): Dipankar Srirag, Aditya Joshi, Jordan Painter, Diptesh Kanojia

Despite large language models (LLMs) being known to exhibit bias against non-standard language varieties, there are no known labelled datasets for sentiment analysis of English. To address this gap, we introduce BESSTIE, a benchmark for sentiment and sarcasm classification for three varieties of English: Australian (en-AU), Indian (en-IN), and British (en-UK). We collect datasets for these language varieties using two methods: location-based for Google Places reviews, and topic-based filtering for Reddit comments. To assess whether the dataset accurately represents these varieties, we conduct two validation steps: (a) manual annotation of language varieties and (b) automatic language variety prediction. Native speakers of the language varieties manually annotate the datasets with sentiment and sarcasm labels. We perform an additional annotation exercise to validate the reliance of the annotated labels. Subsequently, we fine-tune nine LLMs (representing a range of encoder/decoder and mono/multilingual models) on these datasets, and evaluate their performance on the two tasks. Our results show that the models consistently perform better on inner-circle varieties (i.e., en-AU and en-UK), in comparison with en-IN, particularly for sarcasm classification. We also report challenges in cross-variety generalisation, highlighting the need for language variety-specific datasets such as ours. BESSTIE promises to be a useful evaluative benchmark for future research in equitable LLMs, specifically in terms of language varieties. The BESSTIE dataset is publicly available at: https://huggingface.co/ datasets/unswnlporg/BESSTIE.

尽管已知有大量语言模型(LLMS)显示对非标准语言品种的偏见,但是没有已知的有标签的英文情绪分析数据集。为解决这一差距,我们引入了BESSIE,这是三种英语类型:澳大利亚语(e-AU)、印度语(en-IN)和英国语(en-UK)的情绪和讽刺分类基准。我们用两种方法收集这些语言品种的数据集:谷歌地点审查基于地点的数据集,以及基于主题的Redddit评论过滤。为了评估数据集是否准确地代表了这些品种,我们采取了两个验证步骤:(a) 语言品种的人工注释和(b) 语言种类的自动预测。语言品种的本地语言演讲者用感知和讽刺标签对数据集进行人工说明。我们用两种方法收集这些语言品种的数据集的数据集:谷歌PLMS(代表一系列有用的编码/decoder和单项/多语种模型),并评估它们在两项任务上的绩效:(a)语言品种的手工说明和语言分类中,我们的成果显示,在将来的模型中,BILSI/SALGSGSA数据分类中,这种内部数据在一般数据分类中可以进行更好的比较。


Article 172

Title@2025-06-17 (2): LexiMark: Robust Watermarking via Lexical Substitutions to Enhance Membership Verification of an LLM’s Textual Training Data

Title: LexiMark: Robust Watermarking via Lexical Substitutions to Enhance Membership Verification of an LLM’s Textual Training Data LexiMark: Robuste Wassermarkierung über Lexical Substitutions zur Erweiterung der Mitgliedschaftsbestätigung der Texttrainingsdaten eines LLM LexiMark:通过用法律替代办法进行强有力的水标记,以加强对LLM的文字培训数据进行成员核查 2506.14474v1

Authors (5): Eyal German, Sagiv Antebi, Edan Habler, Asaf Shabtai, Yuval Elovici

Large language models (LLMs) can be trained or fine-tuned on data obtained without the owner’s consent. Verifying whether a specific LLM was trained on particular data instances or an entire dataset is extremely challenging. Dataset watermarking addresses this by embedding identifiable modifications in training data to detect unauthorized use. However, existing methods often lack stealth, making them relatively easy to detect and remove. In light of these limitations, we propose LexiMark, a novel watermarking technique designed for text and documents, which embeds synonym substitutions for carefully selected high-entropy words. Our method aims to enhance an LLM’s memorization capabilities on the watermarked text without altering the semantic integrity of the text. As a result, the watermark is difficult to detect, blending seamlessly into the text with no visible markers, and is resistant to removal due to its subtle, contextually appropriate substitutions that evade automated and manual detection. We evaluated our method using baseline datasets from recent studies and seven open-source models: LLaMA-1 7B, LLaMA-3 8B, Mistral 7B, Pythia 6.9B, as well as three smaller variants from the Pythia family (160M, 410M, and 1B). Our evaluation spans multiple training settings, including continued pretraining and fine-tuning scenarios. The results demonstrate significant improvements in AUROC scores compared to existing methods, underscoring our method’s effectiveness in reliably verifying whether unauthorized watermarked data was used in LLM training.

大型语言模型(LLMS) 可以在未经物主同意的情况下对获得的数据进行培训或精细调整。 验证特定LLM是否在特定数据实例或整个数据集方面受过培训是极具挑战性的。 数据集的水标记通过在培训数据中嵌入可识别的修改来解决这个问题, 然而, 现有方法往往缺乏隐形, 使得它们比较容易检测和删除。 鉴于这些局限性, 我们提议LexiMark, 这是一种为文本和文件设计的新型水标记技术, 它将对精心选择的高渗透性词嵌入同义词替代。 我们的方法的目的是在水标记文本上加强LLM的记忆能力, 但不改变文本的语义完整性。 因此, 水标记很难检测, 顺利地融入到文本中, 没有可见标记, 并且由于其微妙的、 适合背景的替代方法, 逃避自动化和人工检测。 我们用基准数据集和七种公开源模型评估了我们的方法: LalaMA-1 B, LLAMA-3, LLAMA-3, LIMA-3 B, 和 Mistraloral 7M 4M


Article 173

Title@2025-06-17 (2): Rectifying Belief Space via Unlearning to Harness LLMs’ Reasoning

Title: Rectifying Belief Space via Unlearning to Harness LLMs’ Reasoning Rektifizieren von Glaube Raum über Unlearning zu Harness LLMs’ Reasoning 通过 “ 重新学习 “ 来改变信仰空间 2502.20620v2

Authors (3): Ayana Niwa, Masahiro Kaneko, Kentaro Inui

Large language models (LLMs) can exhibit advanced reasoning yet still generate incorrect answers. We hypothesize that such errors frequently stem from spurious beliefs, propositions the model internally considers true but are incorrect. To address this, we propose a method to rectify the belief space by suppressing these spurious beliefs while simultaneously enhancing true ones, thereby enabling more reliable inferences. Our approach first identifies the beliefs that lead to incorrect or correct answers by prompting the model to generate textual explanations, using our Forward-Backward Beam Search (FBBS). We then apply unlearning to suppress the identified spurious beliefs and enhance the true ones, effectively rectifying the model’s belief space. Empirical results on multiple QA datasets and LLMs show that our method corrects previously misanswered questions without harming overall model performance. Furthermore, our approach yields improved generalization on unseen data, suggesting that rectifying a model’s belief space is a promising direction for mitigating errors and enhancing overall reliability.

大型语言模型(LLMs)可以展示先进的推理,但仍然会产生不正确的答案。 我们假设这些错误经常来自虚假的信念, 模型内部认为真实, 但并不正确。 为了解决这个问题, 我们提出一种方法来纠正信仰空间, 一方面压制这些虚假的信念, 同时同时加强真实的信念, 从而促成更可靠的推论。 我们的方法首先确定导致错误或正确解答的信念, 方法是利用模型来生成文字解析, 使用我们的前方- 反背 Beam 搜索( FBBS ) 。 然后我们运用未学习的方法来压制已查明的虚假信念, 加强真实的信念, 有效地纠正模型的信仰空间。 多个QA数据集和 LLMs 的经验性结果显示, 我们的方法纠正了先前错误的问题, 同时又不损害整个模型的性能。 此外, 我们的方法提高了对未知数据的概括性, 表明纠正模型的信仰空间是减轻错误和提高总体可靠性的一个有希望的方向。


Article 174

Title@2025-06-17 (2): How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison

Title: How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison Wie weit können LLMs sich aus Erfahrung verbessern? Test-Time-Learning-Fähigkeiten in LLMs mit menschlichem Vergleich messen 如何从经验中提高LLMs的改进程度? 衡量LLMs与人类比较的试验-时间学习能力 2506.14448v1

Authors (4): Jiayin Wang, Zhiquang Guo, Weizhi Ma, Min Zhang

As evaluation designs of large language models may shape our trajectory toward artificial general intelligence, comprehensive and forward-looking assessment is essential. Existing benchmarks primarily assess static knowledge, while intelligence also entails the ability to rapidly learn from experience. To this end, we advocate for the evaluation of Test-time Learning, the capacity to improve performance in experience-based, reasoning-intensive tasks during test time. In this work, we propose semantic games as effective testbeds for evaluating test-time learning, due to their resistance to saturation and inherent demand for strategic reasoning. We introduce an objective evaluation framework that compares model performance under both limited and cumulative experience settings, and contains four forms of experience representation. To provide a comparative baseline, we recruit eight human participants to complete the same task. Results show that LLMs exhibit measurable test-time learning capabilities; however, their improvements are less stable under cumulative experience and progress more slowly than those observed in humans. These findings underscore the potential of LLMs as general-purpose learning machines, while also revealing a substantial intellectual gap between models and humans, irrespective of how well LLMs perform on static benchmarks.

由于大语言模型的评价设计可能决定我们走向人工一般情报、全面和前瞻性评估的轨迹,因此,至关重要的是,现有基准主要评估静态知识,而情报也要求有能力迅速从经验中学习,为此,我们主张对测试时学习进行评估,提高测试时基于经验、推理密集型任务在测试期间的绩效;在这项工作中,我们建议用语义游戏作为评价测试时学习的有效测试台,因为它们抵制饱和和和对战略推理的内在需求;我们引入一个客观评估框架,比较在有限和累积经验环境中的模型绩效,并包含四种形式的经验代表;为了提供一个比较基线,我们征聘8名人类参与者来完成同样的任务;结果显示,LLMS展示了可衡量的测试-时间学习能力;然而,在累积的经验和进展中,它们的改进不如在人类观察到的要慢。这些结论强调了LMS作为通用学习机器的潜力,同时也揭示了模型与人之间的重大知识差距,而不管LMMS在固定基准上表现如何良好。


Article 175

Title@2025-06-17 (2): LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs

Title: LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs LongLlaDA: Entsperren langer Kontextkapazitäten in Diffusions-LLMs LongLLALDA:释放扩散长程距离能力 2506.14429v1

Authors (6): Xiaoran Liu, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu

Large Language Diffusion Models, or diffusion LLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs. We first identify a unique characteristic of diffusion LLMs, unlike auto-regressive LLMs, they maintain remarkably \textbf{\textit{stable perplexity}} during direct context extrapolation. Furthermore, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct \textbf{\textit{local perception}} phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs. Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first context extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs.

大型语言融合模型或扩散LLMS已成为国家LP研究的一个重要重点,其大量努力旨在了解其可缩放性和下游任务性,然而,其长方能力仍未探索,缺乏系统分析或背景扩展方法。在这项工作中,我们提出首次系统调查,比较了传播LMS和传统自动递增LMS的长期表现。我们首先找出了扩散LMS的独特特征,不同于自动递增LMS,它们保持了显著的直截面(Textbf)/可移动性 。此外,在直接外推期间,它们保持了显著的直观(textbf)/可移动性 。此外,在NED-A-Haystack任务期间,自动递增模式完全失败,而其背景则缺乏。 我们发现,LMSMS展示了一种截然不同的长方(textf)/LLMS=LMS)的长方表现。 我们通过扶植性定位(ROPE)定位(ROPLMS-LMS)的缩放级基础理论来解释这两种现象。 我们提议在LLLDADA-LMS-LMS-LMS-LMS的原始研究中,在长期扩散方面首先提供不伸展延延延长方法律。


Article 176

Title@2025-06-17 (2): Uncovering Overfitting in Large Language Model Editing

Title: Uncovering Overfitting in Large Language Model Editing Uncovering Overfitting in der großsprachigen Modellbearbeitung 在大语言模式编辑中进行覆盖覆盖覆盖的覆盖超版编辑 2410.07819v2

Authors (6): Mengqi Zhang, Xiaotian Ye, Qiang Liu, Pengjie Ren, Shu Wu, Zhumin Chen

Knowledge editing has been proposed as an effective method for updating and correcting the internal knowledge of Large Language Models (LLMs). However, existing editing methods often struggle with complex tasks, such as multi-hop reasoning. In this paper, we identify and investigate the phenomenon of Editing Overfit, where edited models assign disproportionately high probabilities to the edit target, hindering the generalization of new knowledge in complex scenarios. We attribute this issue to the current editing paradigm, which places excessive emphasis on the direct correspondence between the input prompt and the edit target for each edit sample. To further explore this issue, we introduce a new benchmark, EVOKE (EValuation of Editing Overfit in Knowledge Editing), along with fine-grained evaluation metrics. Through comprehensive experiments and analysis, we demonstrate that Editing Overfit is prevalent in current editing methods and that common overfitting mitigation strategies are ineffective in knowledge editing. To overcome this, inspired by LLMs’ knowledge recall mechanisms, we propose a new plug-and-play strategy called Learn the Inference (LTI), which introduce a Multi-stage Inference Constraint module to guide the edited models in recalling new knowledge similarly to how unedited LLMs leverage knowledge through in-context learning. Extensive experimental results across a wide range of tasks validate the effectiveness of LTI in mitigating Editing Overfit.

提出了知识编辑,作为更新和纠正大语言模型内部知识的有效方法。然而,现有的编辑方法往往与复杂的任务如多点推理等纠缠不休。在本文件中,我们确定并调查编辑“重叠”现象,编辑模型给编辑目标分配了过高的概率,妨碍了在复杂情景下普及新知识。我们将此问题归咎于目前的编辑模式,该模式过分强调输入即时和每个编辑样本编辑目标之间的直接对应。为了进一步探讨这一问题,我们引入了新的基准EVOKE(编辑过度应用于知识编辑)以及细微的衡量评价标准。我们通过全面试验和分析,表明编辑“重叠”现象在目前的编辑方法中普遍存在,通常的“过度应用”战略在知识编辑方面是无效的。为了克服这个问题,在LLOMs知识回顾机制的启发下,我们提出了一个新的插插和播放战略,称为“了解推断”(LTIE),引入了多阶段“紧缩”模块,以指导编辑模型回顾新的“软件更新”在LLIMS校准中如何广泛了解减轻风险。


Article 177

Title@2025-06-17 (2): ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge

Title: ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge Implic Retrieval Challenge: Benchmarking der Implicity Fact Retrieval Challenge ImpliRet:设定隐含事实检索挑战的基准 2506.14407v1

Authors (4): Zeinab Sadat Taghavi, Ali Modarressi, Yunpu Ma, Hinrich Schütze

Retrieval systems are central to many NLP pipelines, but often rely on surface-level cues such as keyword overlap and lexical semantic similarity. To evaluate retrieval beyond these shallow signals, recent benchmarks introduce reasoning-heavy queries; however, they primarily shift the burden to query-side processing techniques – like prompting or multi-hop retrieval – that can help resolve complexity. In contrast, we present ImpliRet, a benchmark that shifts the reasoning challenge to document-side processing: The queries are simple, but relevance depends on facts stated implicitly in documents through temporal (e.g., resolving “two days ago”), arithmetic, and world knowledge relationships. We evaluate a range of sparse and dense retrievers, all of which struggle in this setting: the best nDCG@10 is only 15.07%. We also test whether long-context models can overcome this limitation. But even with a short context of only ten documents, including the positive document, GPT-4.1 scores only 35.06%, showing that document-side reasoning remains a challenge. Our codes are available at github.com/ZeinabTaghavi/IMPLIRET.Contribution.

检索系统是许多非LP输油管的核心,但常常依赖表面提示,如关键词重叠和词汇语义相似性。为了评估这些浅度信号之外的检索,最近的基准引入了推理重度查询;然而,它们主要将负担转移到可帮助解决复杂性的查询端处理技术上,如催化或多跳检索等。相比之下,我们提出了ImpliRet,这是一个将推理挑战转向文件端处理的基准:查询很简单,但相关性取决于文件通过时间(例如解决“两天前”)、算术和世界知识关系等隐含的事实。我们评估了一系列稀少和密集的检索者,所有这些检索者在此环境下都挣扎着:最好的 nDCG@10 仅为15.07 % 。我们还测试长文本模型能否克服这一限制。但即使短短的10份文件,包括正面文件GPT-4.1的分数仅为35.06%,显示文件端推理仍是一个挑战。我们的代码在Github.com/ZeinabTaghan/IMPTRE。


Article 178

Title@2025-06-17 (2): CAPO: Cost-Aware Prompt Optimization

Title: CAPO: Cost-Aware Prompt Optimization CAPO: Kostenbewusste Optimierung CAPO: 成本软件快速优化 2504.16005v4

Authors (4): Tom Zehle, Moritz Schlager, Timo Heiß, Matthias Feurer

Large language models (LLMs) have revolutionized natural language processing by solving a wide range of tasks simply guided by a prompt. Yet their performance is highly sensitive to prompt formulation. While automatic prompt optimization addresses this challenge by finding optimal prompts, current methods require a substantial number of LLM calls and input tokens, making prompt optimization expensive. We introduce CAPO (Cost-Aware Prompt Optimization), an algorithm that enhances prompt optimization efficiency by integrating AutoML techniques. CAPO is an evolutionary approach with LLMs as operators, incorporating racing to save evaluations and multi-objective optimization to balance performance with prompt length. It jointly optimizes instructions and few-shot examples while leveraging task descriptions for improved robustness. Our extensive experiments across diverse datasets and LLMs demonstrate that CAPO outperforms state-of-the-art discrete prompt optimization methods in 11/15 cases with improvements up to 21%p in accuracy. Our algorithm achieves better performances already with smaller budgets, saves evaluations through racing, and decreases average prompt length via a length penalty, making it both cost-efficient and cost-aware. Even without few-shot examples, CAPO outperforms its competitors and generally remains robust to initial prompts. CAPO represents an important step toward making prompt optimization more powerful and accessible by improving cost-efficiency.

大型语言模型(LLMS)通过在快速的指引下解决一系列任务,使自然语言处理发生革命性的变化。但是,它们的性能对迅速的配制非常敏感。虽然自动快速优化通过找到最佳的提示来应对这一挑战,但目前的方法需要大量的LLM电话和投入符号,使迅速优化变得昂贵。我们引入了CAPO(Cost-Aware Fair Aprintimimation),这是一个算法,通过整合AUMLL技术来提高快速优化效率。CAPO是一种渐进式方法,由LLMS作为操作员,包括赛跑以节省评价和多目标优化,以迅速平衡业绩。它联合优化了指示和少许例子,同时利用任务描述来提高稳健性。我们在各种数据集和LMMS之间的广泛实验表明,CAPO在11/15个案例中,优于最先进的快速优化方法,提高了21%的精确度。我们的算法在预算规模较小的情况下取得了更好的业绩,通过赛跑来节省评价,并通过长的罚款来降低平均的短时间长度,使其具有成本效益。即使没有几个例子,CO 也能够迅速改进其重要的初步的升级。


Article 179

Title@2025-06-17 (2): Ensemble Watermarks for Large Language Models

Title: Ensemble Watermarks for Large Language Models Ensemble Wasserzeichen für große Sprachmodelle 用于大语言模型的集合水标记 2411.19563v2

Authors (2): Georg Niess, Roman Kern

As large language models (LLMs) reach human-like fluency, reliably distinguishing AI-generated text from human authorship becomes increasingly difficult. While watermarks already exist for LLMs, they often lack flexibility and struggle with attacks such as paraphrasing. To address these issues, we propose a multi-feature method for generating watermarks that combines multiple distinct watermark features into an ensemble watermark. Concretely, we combine acrostica and sensorimotor norms with the established red-green watermark to achieve a 98% detection rate. After a paraphrasing attack, the performance remains high with 95% detection rate. In comparison, the red-green feature alone as a baseline achieves a detection rate of 49% after paraphrasing. The evaluation of all feature combinations reveals that the ensemble of all three consistently has the highest detection rate across several LLMs and watermark strength settings. Due to the flexibility of combining features in the ensemble, various requirements and trade-offs can be addressed. Additionally, the same detection function can be used without adaptations for all ensemble configurations. This method is particularly of interest to facilitate accountability and prevent societal harm.

由于大型语言模型(LLMS)达到人等流水,可靠地区分AI产生的文字与人文作者之间日益困难。虽然LLMS已经存在水印,但它们往往缺乏灵活性,难以应对诸如抛光等攻击。为了解决这些问题,我们建议了一种多功能方法,用于生成水印,将多种不同的水标记特征结合成一个混合水标记。具体地说,我们把芳香和感官规范与既定的红绿色水标记结合起来,以达到98%的检测率。在突触攻击后,性能仍然高达95%的检测率。相比之下,光是作为基准的红色绿色特征,就达到在抛光后达到49%的检测率。所有特性组合的评估表明,所有三种特性的组合始终具有数个LMS和水标记强度环境中的最高检测率。由于将共同特征结合起来的灵活性,可以满足各种要求和权衡。此外,同样的检测功能可以使用,而无需调整,即可防止所有同值配置。这一方法特别有助于社会责任。


Article 180

Title@2025-06-17 (2): Automated Construction of a Knowledge Graph of Nuclear Fusion Energy for Effective Elicitation and Retrieval of Information

Title: Automated Construction of a Knowledge Graph of Nuclear Fusion Energy for Effective Elicitation and Retrieval of Information Automatisierter Aufbau eines Wissensdiagramms für Kernfusionsenergie zur effektiven Gewinnung und Gewinnung von Informationen 自动构建核聚变能源知识图,以有效取用和检索信息 2504.07738v2

Authors (6): Andrea Loreti, Kesi Chen, Ruby George, Robert Firth, Adriano Agnello, Shinnosuke Tanaka

In this document, we discuss a multi-step approach to automated construction of a knowledge graph, for structuring and representing domain-specific knowledge from large document corpora. We apply our method to build the first knowledge graph of nuclear fusion energy, a highly specialized field characterized by vast scope and heterogeneity. This is an ideal benchmark to test the key features of our pipeline, including automatic named entity recognition and entity resolution. We show how pre-trained large language models can be used to address these challenges and we evaluate their performance against Zipf’s law, which characterizes human-generated natural language. Additionally, we develop a knowledge-graph retrieval-augmented generation system that combines large language models with a multi-prompt approach. This system provides contextually relevant answers to natural-language queries, including complex multi-hop questions that require reasoning across interconnected entities.

在这份文件中,我们讨论了如何以多步方法自动构建知识图,用于构建和代表来自大型文件公司的具体领域知识;我们运用我们的方法,建立核聚变能源第一个知识图,这是一个高度专业化的领域,其特点是范围广、种类繁多;这是测试我们管道关键特点的理想基准,包括自动命名实体识别和实体决议;我们展示了如何利用预先培训的大型语言模型应对这些挑战,我们对照作为人类生成自然语言特征的齐普夫法律评估其绩效;此外,我们开发了一个知识-绘图检索和提款生成系统,将大型语言模型与多机会方法相结合;该系统为自然语言询问提供了符合背景的答案,包括需要跨关联实体推理的复杂多机会问题。


Article 181

Title@2025-06-17 (2): SeqPE: Transformer with Sequential Position Encoding

Title: SeqPE: Transformer with Sequential Position Encoding SeqPE: Transformer mit sequentieller Positionskodierung SeqPE:具有序列位置编码的变形器 2506.13277v2

Authors (8): Huayang Li, Yahui Liu, Hongyu Sun, Deng Cai, Leyang Cui, Wei Bi, Peilin Zhao, Taro Watanabe

Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position embeddings (PEs) limit extrapolation capabilities beyond pre-trained sequence lengths. Expert-designed methods such as ALiBi and RoPE, mitigate this limitation but demand extensive modifications for adapting to new modalities, underscoring fundamental challenges in adaptability and scalability. In this work, we present SeqPE, a unified and fully learnable position encoding framework that represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings in an end-to-end manner. To regularize SeqPE’s embedding space, we introduce two complementary objectives: a contrastive objective that aligns embedding distances with a predefined position-distance function, and a knowledge distillation loss that anchors out-of-distribution position embeddings to in-distribution teacher representations, further enhancing extrapolation performance. Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM), and accuracy–particularly under context length extrapolation–but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign. We release our code, data, and checkpoints at https://github.com/ghrua/seqpe.

由于变异器中的自我注意层因设计而变异,必须明确纳入位置编码,以便空间理解。然而,传统可学习位置嵌入(PE)中使用固定规模的外推表,限制在培训前序列长度之外的外推能力。 专家设计的方法,如Alibi和ROPE, 缓解这一限制,但要求对适应新模式进行广泛修改,强调适应性和可缩放性方面的基本挑战。在这项工作中,我们提出了SeqPE,一个统一和完全可学习的位置编码框架,代表每个美元米位指数作为象征性序列,并使用一个轻度顺序位置编码器,以端到端的方式学习其嵌入能力。为了规范SeqepE的嵌入空间,我们引入了两个互补的目标:一个对比目标,将距离与预先界定的位置距离功能-远程功能挂钩,以及一个知识蒸馏损失,将分配位置嵌入到分布式教师演示中,进一步加强外推化性性工作。在不甚强的语言建模中进行实验,长-Context-real-real-real-deal-liflial-liflial-ligyal-ligyal-lix-lix lishal-lishal-lixxxxxeflation


Article 182

Title@2025-06-17 (2): ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection

Title: ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection ELLIS Alicante bei CQs-Gen 2025: Die kritischen Denkfragen gewinnen gemeinsame Aufgabe: LLM-basierte Fragegenerierung und Auswahl 2025年CQs-Gen CQs-Gen ELLIS Alicante:赢得关键思考问题的共同任务:基于LLM问题的产生和选择 2506.14371v1

Authors (5): Lucile Favero, Daniel Frases, Juan Antonio Pérez-Ortiz, Tanja Käser, Nuria Oliver

The widespread adoption of chat interfaces based on Large Language Models (LLMs) raises concerns about promoting superficial learning and undermining the development of critical thinking skills. Instead of relying on LLMs purely for retrieving factual information, this work explores their potential to foster deeper reasoning by generating critical questions that challenge unsupported or vague claims in debate interventions. This study is part of a shared task of the 12th Workshop on Argument Mining, co-located with ACL 2025, focused on automatic critical question generation. We propose a two-step framework involving two small-scale open source language models: a Questioner that generates multiple candidate questions and a Judge that selects the most relevant ones. Our system ranked first in the shared task competition, demonstrating the potential of the proposed LLM-based approach to encourage critical engagement with argumentative texts.

广泛采用基于大语言模式的聊天界面使人们对促进肤浅学习和破坏批判性思维技能的发展表示关切。这项工作不是纯粹为了获取事实信息而依靠LLMs,而是探索其潜力,通过在辩论发言中提出挑战无根据或模糊的主张的关键问题,促进更深入的推理。本研究是第12次争论采矿讲习班共同任务的一部分,该讲习班与ACLU 2025合用同一地点,侧重于自动的关键问题生成。我们提出了一个两步框架,涉及两个小规模开放源语言模型:一个引发多个候选人问题的提问器和一个选择最相关问题的法官。我们的系统在共同任务竞争中排名第一,显示了拟议的LLMM方法在鼓励批判性地参与论证性案文方面的潜力。


Article 183

Title@2025-06-17 (2): Digital Gatekeepers: Google’s Role in Curating Hashtags and Subreddits

Title: Digital Gatekeepers: Google’s Role in Curating Hashtags and Subreddits Digitale Gatekeeper: Googles Rolle bei der Heilung von Hashtags und Subreddits 数字看门人:谷歌在消除大麻塔和Subreddid方面的作用 2506.14370v1

Authors (4): Amrit Poudel, Yifan Ding, Jurgen Pfeffer, Tim Weninger

Search engines play a crucial role as digital gatekeepers, shaping the visibility of Web and social media content through algorithmic curation. This study investigates how search engines like Google selectively promotes or suppresses certain hashtags and subreddits, impacting the information users encounter. By comparing search engine results with nonsampled data from Reddit and Twitter/X, we reveal systematic biases in content visibility. Google’s algorithms tend to suppress subreddits and hashtags related to sexually explicit material, conspiracy theories, advertisements, and cryptocurrencies, while promoting content associated with higher engagement. These findings suggest that Google’s gatekeeping practices influence public discourse by curating the social media narratives available to users.

搜索引擎作为数字守门员发挥着关键作用,通过算法调节,影响网络和社交媒体内容的可见度。本研究调查了谷歌等搜索引擎如何有选择地促进或压制某些标签和子编辑,影响信息用户的遭遇。通过将搜索引擎的结果与Reddit和Twitter/X的非抽样数据进行比较,我们揭示了内容可见度方面的系统性偏差。谷歌的算法往往压制与性显性材料、阴谋理论、广告和加密信息相关的子编辑和标签,同时推广与更高参与程度相关的内容。这些研究结果表明谷歌的守门做法通过整理用户可用的社交媒体叙事,影响公共言论。


Article 184

Title@2025-06-17 (2): Exploring news intent and its application: A theory-driven approach

Title: Exploring news intent and its application: A theory-driven approach Erforschen der Nachrichten-Intention und ihrer Anwendung: Ein theoriegesteuerter Ansatz 探索新闻意图及其应用:理论驱动方法 2312.16490v2

Authors (6): Zhengjia Wang, Danding Wang, Qiang Sheng, Juan Cao, Siyuan Ma, Haonan Cheng

Understanding the intent behind information is crucial. However, news as a medium of public discourse still lacks a structured investigation of perceived news intent and its application. To advance this field, this paper reviews interdisciplinary studies on intentional action and introduces a conceptual deconstruction-based news intent understanding framework (NINT). This framework identifies the components of intent, facilitating a structured representation of news intent and its applications. Building upon NINT, we contribute a new intent perception dataset. Moreover, we investigate the potential of intent assistance on news-related tasks, such as significant improvement (+2.2% macF1) in the task of fake news detection. We hope that our findings will provide valuable insights into action-based intent cognition and computational social science.

了解信息背后的意图至关重要。然而,新闻作为公共讨论的媒介,仍然缺乏对所觉察到的新闻意图及其应用的结构性调查。为了推进这个领域,本文回顾了关于有意行动的跨学科研究,并引入了一个概念性非建筑性新闻意图理解框架(NINT)。这个框架确定了意图的组成部分,便利了对新闻意图及其应用的结构性表述。基于NINT,我们提供了一个新的意图感知数据集。此外,我们调查了在与新闻有关的任务方面提供意向性援助的潜力,例如,在假新闻探测任务方面作出重大改进(+2.2% macF1)。我们希望,我们的调查结果将为基于行动的意图认知和计算的社会科学提供宝贵的洞见。


Article 185

Title@2025-06-17 (2): A Vision for Geo-Temporal Deep Research Systems: Towards Comprehensive, Transparent, and Reproducible Geo-Temporal Information Synthesis

Title: A Vision for Geo-Temporal Deep Research Systems: Towards Comprehensive, Transparent, and Reproducible Geo-Temporal Information Synthesis Eine Vision für Geo-Temporale Deep Research Systeme: Auf dem Weg zu einer umfassenden, transparenten und reproduzierbaren Geo-Temporalen Informationssynthese 地球-临时深层研究系统展望:走向全面、透明和可复制的地球-临时信息综述 2506.14345v1

Authors (3): Bruno Martins, Piotr Szymański, Piotr Gramacki

The emergence of Large Language Models (LLMs) has transformed information access, with current LLMs also powering deep research systems that can generate comprehensive report-style answers, through planned iterative search, retrieval, and reasoning. Still, current deep research systems lack the geo-temporal capabilities that are essential for answering context-rich questions involving geographic and/or temporal constraints, frequently occurring in domains like public health, environmental science, or socio-economic analysis. This paper reports our vision towards next generation systems, identifying important technical, infrastructural, and evaluative challenges in integrating geo-temporal reasoning into deep research pipelines. We argue for augmenting retrieval and synthesis processes with the ability to handle geo-temporal constraints, supported by open and reproducible infrastructures and rigorous evaluation protocols. Our vision outlines a path towards more advanced and geo-temporally aware deep research systems, of potential impact to the future of AI-driven information access.

大语言模型(LLMS)的出现改变了信息的存取,而目前的LLMS也增强了深层研究系统的力量,通过计划的迭代搜索、检索和推理,能够产生全面的报告式答案;然而,目前的深层研究系统缺乏对回答涉及地理和(或)时间限制、经常发生在公共卫生、环境科学或社会经济分析等领域、涉及地域和(或)时间限制的丰富问题至关重要的地理时空能力;本文件报告了我们对下一代系统的展望,指出了在将地理时空推理纳入深层研究管道方面的重要技术、基础设施和评价挑战;我们主张加强检索和合成进程,使其有能力处理地理时空限制,并得到开放和再生基础设施和严格的评估协议的支持;我们的设想勾画了一条通往更先进和具有地理时空意识的深层研究系统的道路,对AI驱动的信息存取的未来具有潜在影响。


Article 186

Title@2025-06-17 (2): Evaluation Should Not Ignore Variation: On the Impact of Reference Set Choice on Summarization Metrics

Title: Evaluation Should Not Ignore Variation: On the Impact of Reference Set Choice on Summarization Metrics Bewertung sollte nicht ignorieren Variation: Auf die Auswirkungen der Referenzsatz Wahl auf Zusammenfassung Metrics 评价不应忽视变化变化:关于参考选择对汇总计量的影响 2506.14335v1

Authors (6): Silvia Casola, Yang Janet Liu, Siyao Peng, Oliver Kraus, Albert Gatt, Barbara Plank

Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is known to improve correlation with human judgments, the impact of using different reference sets on reference-based metrics has not been systematically investigated. This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004. We demonstrate that many popular metrics exhibit significant instability. This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons. We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation. Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs.

人类语文的制作表现出了显著的丰富性和差异,反映了不同的交流风格和意图。然而,这种差异在总结评价中常常被忽视。虽然已知有多种参考摘要以改善与人类判断的关联性,但并未系统地调查在参考基准指标中使用不同参考集的影响。这项工作审查了广泛使用的参考基准指标在选择参考集方面的敏感性,分析了三种不同的多种参考综合汇总数据集:SummEval、GUMSum和DUC2004。我们证明许多流行指标表现出严重的不稳定性。这种不稳定性特别涉及基于ngrog的计量,如ROUGE, 模型的排名因参考集而异,破坏模型比较的可靠性。我们还收集了LLMM关于基因多样性数据产出的人类判断,并审查了它们与指标的关联性,以补充除新闻网摘要外的现有结果,发现薄弱与无关联性。我们建议将参考设定的变异性纳入概括评价,以提高与人类判断的一致性,特别是在评价LMS时。


Article 187

Title@2025-06-17 (2): ROSAQ: Rotation-based Saliency-Aware Weight Quantization for Efficiently Compressing Large Language Models

Title: ROSAQ: Rotation-based Saliency-Aware Weight Quantization for Efficiently Compressing Large Language Models ROSAQ: Rotationsbasierte Saliency-Aware-Gewichtsquantisierung für effiziente Komprimierung großer Sprachmodelle ROSAQ: 高效压缩大语言模型的基于旋转的 耐用软件强度 2506.13472v2

Authors (5): Junho Yoon, Geom Lee, Donghyeon Jeon, Inho Kang, Seung-Hoon Na

Quantization has been widely studied as an effective technique for reducing the memory requirement of large language models (LLMs), potentially improving the latency time as well. Utilizing the characteristic of rotational invariance of transformer, we propose the rotation-based saliency-aware weight quantization (ROSAQ), which identifies salient channels in the projection feature space, not in the original feature space, where the projected “principal” dimensions are naturally considered as “salient” features. The proposed ROSAQ consists of 1) PCA-based projection, which first performs principal component analysis (PCA) on a calibration set and transforms via the PCA projection, 2) Salient channel dentification, which selects dimensions corresponding to the K-largest eigenvalues as salient channels, and 3) Saliency-aware quantization with mixed-precision, which uses FP16 for salient dimensions and INT3/4 for other dimensions. Experiment results show that ROSAQ shows improvements over the baseline saliency-aware quantization on the original feature space and other existing quantization methods. With kernel fusion, ROSAQ presents about 2.3x speed up over FP16 implementation in generating 256 tokens with a batch size of 64.

量化作为减少大型语言模型(LLMs)内存要求的有效技术得到了广泛的研究,有可能改进延时时间。 利用变压器旋转性差的特征,我们提议采用基于旋转的显著功能功率重量四分法(ROSAQ),确定投影空间的突出渠道,而不是最初的特征空间,其中预测的“主要”维度自然被视为“高度”特征。拟议的ROSAQ由1个基于CPA的投影组成,该投影首先对校准装置进行主要组成部分分析,并通过CPA投影进行变换;2 配色频道识别,其中选择与K大振动元值相对的尺寸作为突出通道;3 配色度-显微分法,其中使用FP16作为突出尺寸,INT3/4作为其他尺寸。 实验结果显示,ROSAQQQ显示在原始地貌空间上的基线显著度的四分位化(PCA)和其他现有正压式硬度加速度方法方面有所改进。


Article 188

Title@2025-06-17 (2): Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

Title: Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models Anreize für eine fortgeschrittene Instruktions-Folge von großen Sprachmodellen 为采用大语言模式的高级指示提供激励理由 2506.01413v4

Authors (9): Yulei Qin, Gang Li, Zongyi Li, Zihan Xu, Yuchen Shi, Zhekai Lin, Xiao Cui, Ke Li, Xing Sun

Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Codes and data will be available later (under review). Keywords: reinforcement learning with verifiable rewards (RLVR), instruction following, complex instructions

现有大型语言模型(LLMS)面临遵循复杂指令的挑战,特别是当存在多种制约因素,并在平行、链条和分支结构中组织多种制约时。一个直观的解决方案,即思维链(CoT),有望普遍提高LLMs的能力。然而,我们发现香草COT由于其肤浅的推理模式,简单地将指令抛光,对业绩产生了负面影响。它未能剥去在确定不同类型和层面的等级关系方面存在的制约的构成。为此,我们提出一个系统性方法,通过激励测试-时间计算比例的推理,促进LMS处理复杂的指令。首先,我们源于现有分类法下复杂指令的分解,并提出可再生数据采集的方法。第二,我们利用强化学习(RLLL)和以可核查规则为中心的奖赏信号来培养具体教学的推理。我们通过样本化对比,处理复杂指令下的浅浅、非基本逻辑的推理问题。我们还利用专家行为克隆方法处理复杂的指令处理复杂的指令,以测试-时间缩计算比例缩缩缩。我们还利用了现有关键指令的排序,以便稳定地将可比较的排序、快速排序评估。


Article 189

Title@2025-06-17 (2): Do Construction Distributions Shape Formal Language Learning In German BabyLMs?

Title: Do Construction Distributions Shape Formal Language Learning In German BabyLMs? Gestalten Konstruktionsverteilungen formales Sprachenlernen in deutschen BabyLMs? 是否用德国婴儿LMS模式进行建筑分配, 2503.11593v2

Authors (3): Bastian Bunzeck, Daniel Duran, Sina Zarrieß

We analyze the influence of utterance-level construction distributions in German child-directed/child-available speech on the resulting word-level, syntactic and semantic competence (and their underlying learning trajectories) in small LMs, which we train on a novel collection of developmentally plausible language data for German. We find that trajectories are surprisingly robust for markedly different distributions of constructions in the training data, which have little effect on final accuracies and almost no effect on global learning trajectories. While syntax learning benefits from more complex utterances, word-level learning culminates in better scores with more fragmentary utterances. We argue that LMs trained on developmentally plausible data can contribute to debates on how conducive different kinds of linguistic stimuli are to language learning.

我们分析德文中以儿童为主的/儿童可获取的语句的发音水平的建筑分布对小LM的文字水平、综合和语义能力(及其基本的学习轨迹)的影响,我们为这些小LM进行了新颖的德文发展合理语言数据汇编培训。我们发现,由于培训数据中建筑分布的差别很大,轨迹是惊人的,对最终理解没有多大影响,对全球学习轨迹几乎没有任何影响。 虽然从更复杂的语句中学习的语法收益,但文字水平的学习最终会以更零碎的语句获得更好的分数。 我们说,关于发展合理语言数据的培训的LMS可以促进关于不同语言刺激语言学习的有益性的辩论。


Article 190

Title@2025-06-17 (2): Expectation Confirmation Preference Optimization for Multi-Turn Conversational Recommendation Agent

Title: Expectation Confirmation Preference Optimization for Multi-Turn Conversational Recommendation Agent Erwartungsbestätigung Preference Optimization für Multi-Turn Conversational Recommendation Agent 多轮对话建议代理商的预期确认优先优化 2506.14302v1

Authors (9): Xueyang Feng, Jingsen Zhang, Jiakai Tang, Wei Li, Guohao Cai, Xu Chen, Quanyu Dai, Yue Zhu, Zhenhua Dong

Recent advancements in Large Language Models (LLMs) have significantly propelled the development of Conversational Recommendation Agents (CRAs). However, these agents often generate short-sighted responses that fail to sustain user guidance and meet expectations. Although preference optimization has proven effective in aligning LLMs with user expectations, it remains costly and performs poorly in multi-turn dialogue. To address this challenge, we introduce a novel multi-turn preference optimization (MTPO) paradigm ECPO, which leverages Expectation Confirmation Theory to explicitly model the evolution of user satisfaction throughout multi-turn dialogues, uncovering the underlying causes of dissatisfaction. These causes can be utilized to support targeted optimization of unsatisfactory responses, thereby achieving turn-level preference optimization. ECPO ingeniously eliminates the significant sampling overhead of existing MTPO methods while ensuring the optimization process drives meaningful improvements. To support ECPO, we introduce an LLM-based user simulator, AILO, to simulate user feedback and perform expectation confirmation during conversational recommendations. Experimental results show that ECPO significantly enhances CRA’s interaction capabilities, delivering notable improvements in both efficiency and effectiveness over existing MTPO methods.

大语言模型(LLMS)最近的进展大大推动了多语言建议代理(CRAs)的发展,然而,这些代理商往往产生短视的反应,无法维持用户指导和达到预期的要求。虽然优化优惠在使LLMs与用户期望一致方面证明是有效的,但在多方向对话中仍然费用高昂,表现不佳。为了应对这一挑战,我们引入了一个新的多方向优惠优化模式ECPO, 利用期望确认理论,在多方向对话中明确模拟用户满意度的演变,揭示不满的根本原因。这些原因可用来支持优化不满意的响应,从而实现转轨优惠优化。ECPO在确保优化进程推动有意义的改进的同时,巧妙地消除了现有MTPO方法的重大抽样间接成本。为了支持ECPO,我们引入了以LM为主的用户模拟器,在谈话建议中进行预期确认。实验结果显示,ECPO大大加强了C的互动能力,使现有MPO方法的效率和效力显著提高。


Article 191

Title@2025-06-17 (2): AI-Facilitated Analysis of Abstracts and Conclusions: Flagging Unsubstantiated Claims and Ambiguous Pronouns

Title: AI-Facilitated Analysis of Abstracts and Conclusions: Flagging Unsubstantiated Claims and Ambiguous Pronouns AI-Fazilitated Analysis of Abstracts and Conclusions: Flagging Nonsubstantiated Claims and Ambigued Pronomens AI-便利对摘要和结论的分析:无凭无据的旗舰索赔和不明利贷 2506.13172v2

Authors (1): Evgeny Markhasin

We present and evaluate a suite of proof-of-concept (PoC), structured workflow prompts designed to elicit human-like hierarchical reasoning while guiding Large Language Models (LLMs) in the high-level semantic and linguistic analysis of scholarly manuscripts. The prompts target two non-trivial analytical tasks within academic summaries (abstracts and conclusions): identifying unsubstantiated claims (informational integrity) and flagging semantically confusing ambiguous pronoun references (linguistic clarity). We conducted a systematic, multi-run evaluation on two frontier models (Gemini Pro 2.5 Pro and ChatGPT Plus o3) under varied context conditions. Our results for the informational integrity task reveal a significant divergence in model performance: while both models successfully identified an unsubstantiated head of a noun phrase (95% success), ChatGPT consistently failed (0% success) to identify an unsubstantiated adjectival modifier that Gemini correctly flagged (95% success), raising a question regarding the potential influence of the target’s syntactic role. For the linguistic analysis task, both models performed well (80-90% success) with full manuscript context. Surprisingly, in a summary-only setting, Gemini’s performance was substantially degraded, while ChatGPT achieved a perfect (100%) success rate. Our findings suggest that while structured prompting is a viable methodology for complex textual analysis, prompt performance may be highly dependent on the interplay between the model, task type, and context, highlighting the need for rigorous, model-specific testing.

我们提出并评价一套概念证明(PoC),结构化工作流程提示,目的是在对学术手稿进行高级语义和语言分析时,在指导大语言模型(LLMs)进行高级语义和语言分析时,引出人注意的等级推理。提示的目标是学术摘要(摘要和结论)中的两项非三重分析任务:查明未经证实的主张(信息完整性)和标出模糊的语义(语言清晰度),在两种前沿模型(Gemini Pro 2.5 Pro Pro 和ChattGPT Pl+ o3)上,我们在不同的背景下对两种前沿模型(Gemini Pro 2.5 Pro Pro Pro 和ChattGPT Pl+ o3)进行了系统化、多功能化的评价。我们的信息完整性任务显示在模型性能表现方面存在着重大差异:虽然两个模型都成功地确定了一个未经证实的名词句(95%的成功),但恰特尔特始终没有成功(0 % ) 来找出Gemini正确标注(95% 成功) ,对目标模式的潜在模型作用可能产生影响。对于语言分析任务类型,对于语言分析任务来说,两种模式的精准(80-90成功) 在全面手稿背景背景中,一个结构测试中,一个非常精确的周期性测试中,一个快速的方法是精确的周期性测试,在深度的周期性的方法是精确性方法。


Article 192

Title@2025-06-17 (2): ConsistencyChecker: Tree-based Evaluation of LLM Generalization Capabilities

Title: ConsistencyChecker: Tree-based Evaluation of LLM Generalization Capabilities KonsistenzChecker: Baumbasierte Bewertung von LLM-Verallgemeinerungsfähigkeiten 一致性检查:基于树木的LLM通用能力评价 2506.12376v2

Authors (3): Zhaochen Hong, Haofei Yu, Jiaxuan You

Evaluating consistency in large language models (LLMs) is crucial for ensuring reliability, particularly in complex, multi-step interactions between humans and LLMs. Traditional self-consistency methods often miss subtle semantic changes in natural language and functional shifts in code or equations, which can accumulate over multiple transformations. To address this, we propose ConsistencyChecker, a tree-based evaluation framework designed to measure consistency through sequences of reversible transformations, including machine translation tasks and AI-assisted programming tasks. In our framework, nodes represent distinct text states, while edges correspond to pairs of inverse operations. Dynamic and LLM-generated benchmarks ensure a fair assessment of the model’s generalization ability and eliminate benchmark leakage. Consistency is quantified based on similarity across different depths of the transformation tree. Experiments on eight models from various families and sizes show that ConsistencyChecker can distinguish the performance of different models. Notably, our consistency scores-computed entirely without using WMT paired data-correlate strongly (r > 0.7) with WMT 2024 auto-ranking, demonstrating the validity of our benchmark-free approach. Our implementation is available at: https://github.com/ulab-uiuc/consistencychecker.

为解决这一问题,我们提议建立一个基于树木的评价框架,以通过可逆转换序列衡量一致性,包括机器翻译任务和AI协助的编程任务。在我们的框架内,节点代表不同的文本状态,而边缘则与反向操作对应。动态和LLM产生的基准确保公平评估该模型的通用能力并消除基准渗漏。一致性是根据变形树不同深度的相似性量化的。对不同家庭和大小的八种模型的实验表明,Conistance Cryer可以区分不同模型的性能。值得注意的是,我们的一致性分数完全不使用WMT配对的数据-correlate(r > 0.7)与WMT 2024自动排序和LLM生成的基准确保公平评估该模型的通用能力并消除基准渗漏。


Article 193

Title@2025-06-17 (2): From What to Respond to When to Respond: Timely Response Generation for Open-domain Dialogue Agents

Title: From What to Respond to When to Respond: Timely Response Generation for Open-domain Dialogue Agents Von, was zu reagieren, wann zu reagieren: Timely Response Generation für Open-Domain-Dialog-Agenten 从什么到回应何时响应:为开放域对话代理机构及时作出反应 2506.14285v1

Authors (6): Seongbo Jang, Minjin Jeon, Jaehoon Lee, Seonghyeon Lee, Dongha Lee, Hwanjo Yu

While research on dialogue response generation has primarily focused on generating coherent responses conditioning on textual context, the critical question of when to respond grounded on the temporal context remains underexplored. To bridge this gap, we propose a novel task called timely dialogue response generation and introduce the TimelyChat benchmark, which evaluates the capabilities of language models to predict appropriate time intervals and generate time-conditioned responses. Additionally, we construct a large-scale training dataset by leveraging unlabeled event knowledge from a temporal commonsense knowledge graph and employing a large language model (LLM) to synthesize 55K event-driven dialogues. We then train Timer, a dialogue agent designed to proactively predict time intervals and generate timely responses that align with those intervals. Experimental results show that Timer outperforms prompting-based LLMs and other fine-tuned baselines in both turn-level and dialogue-level evaluations. We publicly release our data, model, and code.

虽然关于对话响应的研究主要侧重于根据文字背景作出一致反应,但何时根据时间背景作出反应的关键问题仍未得到充分探讨。为弥合这一差距,我们提议一项新任务,即及时开展对话响应生成,并引入“及时茶”基准,该基准评估语言模型预测适当时间间隔和产生有时间限制的反应的能力。此外,我们通过利用时间常识知识图中的无标签事件知识,并利用大型语言模型(LLM)合成55K事件驱动的对话,构建了大规模培训数据集。我们随后培训了时间仪,这是一个对话代理器,旨在积极主动地预测时间间隔,并产生与这些间隔相一致的及时反应。实验结果显示,在转轨和对话层面的评价中,时间仪比基于提示的LMMs和其他微调基线要好。我们公开发布我们的数据、模型和代码。


Article 194

Title@2025-06-17 (2): FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation

Title: FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation FlagEvalMM: Flexibler Rahmen für eine umfassende multimodale Modellbewertung FlaignEvalMMM:综合多式联运模式评价灵活框架 2506.09081v2

Authors (8): Zheqi He, Yesheng Liu, Jing-shu Zheng, Xuejing Li, Jin-Ge Yao, Bowen Qin, Richeng Xuan, Xi Yang

We present FlagEvalMM, an open-source evaluation framework designed to comprehensively assess multimodal models across a diverse range of vision-language understanding and generation tasks, such as visual question answering, text-to-image/video generation, and image-text retrieval. We decouple model inference from evaluation through an independent evaluation service, thus enabling flexible resource allocation and seamless integration of new tasks and models. Moreover, FlagEvalMM utilizes advanced inference acceleration tools (e.g., vLLM, SGLang) and asynchronous data loading to significantly enhance evaluation efficiency. Extensive experiments show that FlagEvalMM offers accurate and efficient insights into model strengths and limitations, making it a valuable tool for advancing multimodal research. The framework is publicly accessible athttps://github.com/flageval-baai/FlagEvalMM.

我们提出FlagEvalMM(FlagEvalMM),这是一个开放源码评价框架,旨在全面评估多种视觉语言理解和生成任务,如视觉问答、文字到图像/视频生成和图像-文本检索等多种多式联运模式,我们通过独立评价服务将模型推论与评价脱钩,从而能够灵活分配资源和无缝地整合新任务和新模式;此外,FlagEvalMM(FlagEvalM)利用先进的推论加速工具(如VLLM、SGLang)和不同步的数据负荷,以大大提高评价效率;广泛的实验显示FlagEvalM(FlagEvalM)对模型的长处和局限性提供了准确而有效的洞察力,使其成为推进多式联运研究的宝贵工具;该框架可在以下网站公开查阅:https://github.com/flageval-baai/FlagEvalMM。


Article 195

Title@2025-06-17 (2): Improving LoRA with Variational Learning

Title: Improving LoRA with Variational Learning Verbesserung der LoRA durch variables Lernen 改进LORA, 提高不同学习水平 2506.14280v1

Authors (6): Bai Cong, Nico Daheim, Yuesong Shen, Rio Yokota, Mohammad Emtiyaz Khan, Thomas Möllenhoff

Bayesian methods have recently been used to improve LoRA finetuning and, although they improve calibration, their effect on other metrics (such as accuracy) is marginal and can sometimes even be detrimental. Moreover, Bayesian methods also increase computational overheads and require additional tricks for them to work well. Here, we fix these issues by using a recently proposed variational algorithm called IVON. We show that IVON is easy to implement and has similar costs to AdamW, and yet it can also drastically improve many metrics by using a simple posterior pruning technique. We present extensive results on billion-scale LLMs (Llama and Qwen series) going way beyond the scale of existing applications of IVON. For example, we finetune a Llama-3.2-3B model on a set of commonsense reasoning tasks and improve accuracy over AdamW by 1.3% and reduce ECE by 5.4%, outperforming AdamW and other recent Bayesian methods like Laplace-LoRA and BLoB. Overall, our results show that variational learning with IVON can effectively improve LoRA finetuning.

贝叶斯方法最近被用来改进LORA的微调,虽然它们改进了校准,但它们对其他指标(例如精确度)的影响是微不足道的,有时甚至可能有害。此外,贝伊斯方法还增加了计算间接费用,需要额外的技巧才能很好地发挥作用。在这里,我们通过最近提出的一个称为IVON的变式算法来解决这些问题。我们表明,IVON很容易实施,而且成本与AdamW相似,但是,它也可以通过使用简单的后传操纵技术来大幅度改进许多计量。我们展示了10亿规模的LLMS(Llama和Qwen系列)的广泛结果,远远超出了IVON的现有应用规模。例如,我们用一套共同思维推理任务对Llama-3.2-3B模型进行了微调,提高了AdamW的精确度,提高了1.3%,并将欧洲经委会减少了5.4%,超过了AdamW和其他最近巴伊西亚方法,如Laplace-LORA和BLOB。总体而言,我们的结果表明,与IVON的变学结果可以有效地改进LORA。


Article 196

Title@2025-06-17 (2): Surprise Calibration for Better In-Context Learning

Title: Surprise Calibration for Better In-Context Learning Überraschende Kalibrierung für besseres In-Context-Lernen 为更好的内文学习校准惊喜校准 2506.12796v2

Authors (5): Zhihang Tan, Jingrui Hou, Ping Wang, Qibiao Hu, Peng Zhu

In-context learning (ICL) has emerged as a powerful paradigm for task adaptation in large language models (LLMs), where models infer underlying task structures from a few demonstrations. However, ICL remains susceptible to biases that arise from prior knowledge and contextual demonstrations, which can degrade the performance of LLMs. Existing bias calibration methods typically apply fixed class priors across all inputs, limiting their efficacy in dynamic ICL settings where the context for each query differs. To address these limitations, we adopt implicit sequential Bayesian inference as a framework for interpreting ICL, identify “surprise” as an informative signal for class prior shift, and introduce a novel method–Surprise Calibration (SC). SC leverages the notion of surprise to capture the temporal dynamics of class priors, providing a more adaptive and computationally efficient solution for in-context learning. We empirically demonstrate the superiority of SC over existing bias calibration techniques across a range of benchmark natural language processing tasks.

在大型语言模型(LLMs)中,模型从一些演示中推断出基本任务结构;然而,ICL仍然容易受到先前知识和背景演示所产生的偏见,这可能会降低LLMs的绩效。现有的偏见校准方法通常在所有投入中都采用固定级前科,限制其在动态的ICL环境中的功效,因为每个查询的背景不同。为了解决这些限制,我们采用隐含的连续的巴伊西亚顺序推论作为解释ICL的框架,确定“Suresization”是班级前轮班级的一个信息信号,并采用新的方法——“Surprise 校准”(SC)。SC利用突袭概念来捕捉班前班级的时间动态,为文字学习提供更适应性和计算效率更高的解决办法。我们从经验上证明SC优于一系列基准自然语言处理任务的现有偏差校准技术。


Article 197

Title@2025-06-17 (2): What do Large Language Models Say About Animals? Investigating Risks of Animal Harm in Generated Text

Title: What do Large Language Models Say About Animals? Investigating Risks of Animal Harm in Generated Text Was sagen große Sprachmodelle über Tiere? Untersuchung der Risiken von Tierschädlingen im Generierten Text 大语言模型对动物有什么看法? 调查产生文字中的动物危害风险 2503.04804v4

Authors (10): Arturs Kanepajs, Aditi Basu, Sankalpa Ghose, Constance Li, Akshat Mehta, Ronak Mehta, Samuel David Tucker-Davis, Eric Zhou, Bob Fischer, Jacy Reese Anthis

As machine learning systems become increasingly embedded in society, their impact on human and nonhuman life continues to escalate. Technical evaluations have addressed a variety of potential harms from large language models (LLMs) towards humans and the environment, but there is little empirical work regarding harms towards nonhuman animals. Following the growing recognition of animal protection in regulatory and ethical AI frameworks, we present AnimalHarmBench (AHB), a benchmark for risks of animal harm in LLM-generated text. Our benchmark dataset comprises 1,850 curated questions from Reddit post titles and 2,500 synthetic questions based on 50 animal categories (e.g., cats, reptiles) and 50 ethical scenarios with a 70-30 public-private split. Scenarios include open-ended questions about how to treat animals, practical scenarios with potential animal harm, and willingness-to-pay measures for the prevention of animal harm. Using the LLM-as-a-judge framework, responses are evaluated for their potential to increase or decrease harm, and evaluations are debiased for the tendency of judges to judge their own outputs more favorably. AHB reveals significant differences across frontier LLMs, animal categories, scenarios, and subreddits. We conclude with future directions for technical research and addressing the challenges of building evaluations on complex social and moral topics.

随着机器学习系统日益深入社会,其对人类和非人类生活的影响继续加剧。技术评价解决了大型语言模型(LLMs)对人类和环境造成的各种潜在伤害,但几乎没有关于对非人类动物的伤害的经验工作。在监管和道德的AI框架日益认识到动物保护之后,我们介绍了动物保护(AHB),这是LLM产生的文本中动物伤害风险的基准。我们的基准数据集包括1 850个来自Redit 后职称的问题和2 500个基于50个动物类别(例如猫、爬行动物)和50个道德情景(70-30个公私分拆)的合成问题。设想包括如何对待动物的开放问题、可能伤害动物的实际情景以及防止动物伤害的愿意付费措施。我们利用LM-as-a法官框架,评估其增加或减少伤害的可能性,评价对法官更倾向于评判自己的产出(例如猫、爬行动物)和50个道德情景,50个物种类别(70-30个公私分拆分),其中包括关于如何对待动物的开放性问题、动物的实际情景、防止动物伤害的实际情景以及防止动物伤害的愿意支付费用措施的措施。利用LMs-a-a-a-a法官评估其可能增加或减少伤害的可能性,评估其伤害的可能性。我们得出了法官更倾向于判断自己的产出的倾向。我们对前沿LLLMS-forvidu 和道德议题进行技术评估。通过技术评估。


Article 198

Title@2025-06-17 (2): Position: Editing Large Language Models Poses Serious Safety Risks

Title: Position: Editing Large Language Models Poses Serious Safety Risks Position: Bearbeiten von großen Sprachmodellen stellt ernste Sicherheitsrisiken dar 职位:编辑大语言模型 2502.02958v3

Authors (5): Paul Youssef, Zhixue Zhao, Daniel Braun, Jörg Schlötterer, Christin Seifert

Large Language Models (LLMs) contain large amounts of facts about the world. These facts can become outdated over time, which has led to the development of knowledge editing methods (KEs) that can change specific facts in LLMs with limited side effects. This position paper argues that editing LLMs poses serious safety risks that have been largely overlooked. First, we note the fact that KEs are widely available, computationally inexpensive, highly performant, and stealthy makes them an attractive tool for malicious actors. Second, we discuss malicious use cases of KEs, showing how KEs can be easily adapted for a variety of malicious purposes. Third, we highlight vulnerabilities in the AI ecosystem that allow unrestricted uploading and downloading of updated models without verification. Fourth, we argue that a lack of social and institutional awareness exacerbates this risk, and discuss the implications for different stakeholders. We call on the community to (i) research tamper-resistant models and countermeasures against malicious model editing, and (ii) actively engage in securing the AI ecosystem.

大型语言模型(LLMS)包含大量关于世界的事实。这些事实可能会随着时间的推移而过时,从而导致知识编辑方法的开发,从而可以改变LLMS中具有有限副作用的具体事实。本立场文件认为,编辑LLMS带来严重的安全风险,而这种风险在很大程度上被忽视。首先,我们注意到,KEs是广泛存在的、计算成本低廉的、高性能和隐形的,这使得它们成为恶意行为者的诱人工具。第二,我们讨论KEs恶意使用案例,表明KEs如何能够很容易地适应各种恶意目的。第三,我们强调AI生态系统的脆弱性,这种脆弱性使得无需核实就可以不受限制地上传和下载更新的模型。第四,我们争辩说,缺乏社会和体制意识加剧了这种风险,并讨论了对不同利益攸关方的影响。我们呼吁社区(i)研究防止恶意模式编辑的篡改模式和对策,以及(ii)积极参与保护AI生态系统。


Article 199

Title@2025-06-17 (2): Re-Initialization Token Learning for Tool-Augmented Large Language Models

Title: Re-Initialization Token Learning for Tool-Augmented Large Language Models Re-Initialisierung Token-Lernen für Tool-Augmented große Sprachmodelle 工具增强型大语言模型的重新启动 Tok 学习 2506.14248v1

Authors (5): Chenghao Li, Liu Liu, Baosheng Yu, Jiayan Qiu, Yibing Zhan

Large language models have demonstrated exceptional performance, yet struggle with complex tasks such as numerical reasoning, plan generation. Integrating external tools, such as calculators and databases, into large language models (LLMs) is crucial for enhancing problem-solving capabilities. Current methods assign a unique token to each tool, enabling LLMs to call tools through token prediction-similar to word generation. However, this approach fails to account for the relationship between tool and word tokens, limiting adaptability within pre-trained LLMs. To address this issue, we propose a novel token learning method that aligns tool tokens with the existing word embedding space from the perspective of initialization, thereby enhancing model performance. We begin by constructing prior token embeddings for each tool based on the tool’s name or description, which are used to initialize and regularize the learnable tool token embeddings. This ensures the learned embeddings are well-aligned with the word token space, improving tool call accuracy. We evaluate the method on tasks such as numerical reasoning, knowledge-based question answering, and embodied plan generation using GSM8K-XL, FuncQA, KAMEL, and VirtualHome datasets. The results demonstrate clear improvements over recent baselines, including CoT, REACT, ICL, and ToolkenGPT, indicating that our approach effectively augments LLMs with tools through relevant tokens across diverse domains.

大型语言模型表现出了非凡的性能,但与数字推理、计划生成等复杂任务挣扎。将计算器和数据库等外部工具纳入大型语言模型(LLMs)对于提高解决问题能力至关重要。当前方法为每个工具指定了独特的符号,使LLMs能够通过类似单词生成的象征性预测调用工具。然而,这一方法没有考虑到工具与单词符号之间的关系,限制了在经过培训的LLMs内部的适应性。为了解决这一问题,我们提议了一种新型象征性学习方法,从初始化的角度将工具符号与现有嵌入空间的单词相匹配,从而增强模型性能。我们首先为基于工具名称或描述的每个工具创建了先前的象征性嵌入,用于初始化和规范可学习工具符号嵌入。这确保了学习的嵌入与单词空间完全吻合,提高了工具的准确性。我们用GSM8K-XL、FinconQA、KAMELL, 和虚拟GMLMALD 等相关基准, 展示了我们最新的标准化工具。


Article 200

Title@2025-06-17 (2): Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Title: Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs Verstärktes Lernen mit überprüfbaren Belohnungen implizit fördert korrekte Vernunft in LLMs 利用可核实的奖励措施加强学习,在基础LLM中鼓励正确说明理由 2506.14245v1

Authors (12): Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye, Zhirong Wu, Xiao Liang, Yang Wang, Junjie Li, Ziming Miao, Jiang Bian, Mao Yang

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). However, a critical paradox clouds its efficacy: RLVR-tuned models often underperform their base models on the $Pass@K$ metric for solution-finding, leading to the hypothesis that RLVR merely re-weights existing reasoning paths at the cost of reasoning diversity. In this work, we resolve this contradiction by identifying the source of the problem: the $Pass@K$ metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To address this, we introduce a more precise evaluation metric, $CoT$-$Pass@K$, which mandates that both the reasoning path and the final answer be correct. We provide a new theoretical foundation that formalizes how RLVR, unlike traditional RL, is uniquely structured to incentivize logical integrity. Our empirical results are supportive: using $CoT$-$Pass@K$, we observe that RLVR can incentivize the generalization of correct reasoning for all values of $K$. Furthermore, by analyzing the training dynamics, we find that this enhanced reasoning capability emerges early in the training process and smoothly generalizes. Our work provides a clear perspective on the role of RLVR, offers a more reliable method for its evaluation, and confirms its potential to genuinely advance machine reasoning.

在这项工作中,我们通过查明问题的根源来解决这一矛盾:$Pass@K美元衡量标准本身是一种有缺陷的推理衡量标准,因为它能算得上来自不准确或不完整思维链(Cots)的正确最后答案。为了解决这个问题,我们引入了一个更精确的评价指标,即$CoT$-$Pass@K$,要求推理路径和最后答案都正确。我们提供了一个新的理论基础,使RLVR与传统的RL不同,是如何以独特的结构来激励逻辑完整性的。我们的经验结果证实了:使用$CoT$-Pass@K$,我们发现RLVRRR$的推理能力可以更准确地推理,我们更准确地推理其整个过程。


Article 201

Title@2025-06-17 (2): GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents

Title: GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents GuideBench: Benchmarking Domain-orientierte Leitlinie für LLM-Agenten folgen 指南:为LLM代理商制定基准确定以域为基准的准则 2505.11368v2

Authors (5): Lingxiao Diao, Xinyue Xu, Wanxuan Sun, Cheng Yang, Zhuosheng Zhang

Large language models (LLMs) have been widely deployed as autonomous agents capable of following user instructions and making decisions in real-world applications. Previous studies have made notable progress in benchmarking the instruction following capabilities of LLMs in general domains, with a primary focus on their inherent commonsense knowledge. Recently, LLMs have been increasingly deployed as domain-oriented agents, which rely on domain-oriented guidelines that may conflict with their commonsense knowledge. These guidelines exhibit two key characteristics: they consist of a wide range of domain-oriented rules and are subject to frequent updates. Despite these challenges, the absence of comprehensive benchmarks for evaluating the domain-oriented guideline following capabilities of LLMs presents a significant obstacle to their effective assessment and further development. In this paper, we introduce GuideBench, a comprehensive benchmark designed to evaluate guideline following performance of LLMs. GuideBench evaluates LLMs on three critical aspects: (i) adherence to diverse rules, (ii) robustness to rule updates, and (iii) alignment with human preferences. Experimental results on a range of LLMs indicate substantial opportunities for improving their ability to follow domain-oriented guidelines.

大型语言模型(LLMs)被广泛用作自主的代理商,能够在现实应用中遵循用户的指示和作出决定; 以往的研究在一般领域的LLMs能力(主要侧重于其固有的常识)对指示进行基准化方面取得了显著进展; 最近,LLMs作为面向域的代理商越来越多地被部署,这些代理商依赖可能与其常识相冲突的面向域的准则; 这些准则具有两个关键特征:它们包含广泛的面向域的规则,经常更新; 尽管存在这些挑战,但缺乏评价LLMs在一般领域的能力方面的面向域的准则的全面基准,对其有效评估和进一步发展构成了重大障碍; 在本文件中,我们介绍了旨在评价LLMs业绩的综合基准《Bench指南》; 指南Bench在三个关键方面评估LMs:(一) 遵守各种规则,(二) 严格更新规则,以及(三) 符合人类的偏好。


Article 202

Title@2025-06-17 (2): A Multi-Expert Structural-Semantic Hybrid Framework for Unveiling Historical Patterns in Temporal Knowledge Graphs

Title: A Multi-Expert Structural-Semantic Hybrid Framework for Unveiling Historical Patterns in Temporal Knowledge Graphs Ein multi-Experte strukturell-semantischer Hybridrahmen zur Enthüllung historischer Muster in zeitlichen Wissensgraphen ” 时间知识图中历史不变模式 “ 的多专家结构-地中海混合框架 2506.14235v1

Authors (12): Yimin Deng, Yuxia Wu, Yejing Wang, Guoshuai Zhao, Li Zhu, Qidong Liu, Derong Xu, Zichuan Fu, Xian Wu, Yefeng Zheng, Xiangyu Zhao, Xueming Qian

Temporal knowledge graph reasoning aims to predict future events with knowledge of existing facts and plays a key role in various downstream tasks. Previous methods focused on either graph structure learning or semantic reasoning, failing to integrate dual reasoning perspectives to handle different prediction scenarios. Moreover, they lack the capability to capture the inherent differences between historical and non-historical events, which limits their generalization across different temporal contexts. To this end, we propose a Multi-Expert Structural-Semantic Hybrid (MESH) framework that employs three kinds of expert modules to integrate both structural and semantic information, guiding the reasoning process for different events. Extensive experiments on three datasets demonstrate the effectiveness of our approach.

时间知识图表推理旨在预测未来事件,了解现有事实,并在各种下游任务中发挥关键作用。以前的方法侧重于图表结构学习或语义推理,没有将双重推理观点纳入处理不同预测情景;此外,它们缺乏能力,无法捕捉历史和非历史事件之间的内在差异,从而限制了它们在不同时间背景下的概括化。为此,我们提议了一个多专家结构-语义混合框架(MESH),利用三种专家模块整合结构和语义信息,指导不同事件的推理过程。关于三个数据集的广泛实验显示了我们的方法的有效性。


Article 203

Title@2025-06-17 (2): Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team

Title: Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team Xolver: Multi-Agent Reasoning mit ganzheitlichem Erfahrungslernen wie ein Olympia-Team Xolver:多机构理论与整体经验学习就像奥林匹克队一样 2506.14234v1

Authors (4): Md Tanzib Hosain, Salman Rahman, Md Kishor Morol, Md Rizwan Parvez

Despite impressive progress on complex reasoning, current large language models (LLMs) typically operate in isolation - treating each problem as an independent attempt, without accumulating or integrating experiential knowledge. In contrast, expert problem solvers - such as Olympiad or programming contest teams - leverage a rich tapestry of experiences: absorbing mentorship from coaches, developing intuition from past problems, leveraging knowledge of tool usage and library functionality, adapting strategies based on the expertise and experiences of peers, continuously refining their reasoning through trial and error, and learning from other related problems even during competition. We introduce Xolver, a training-free multi-agent reasoning framework that equips a black-box LLM with a persistent, evolving memory of holistic experience. Xolver integrates diverse experience modalities, including external and self-retrieval, tool use, collaborative interactions, agent-driven evaluation, and iterative refinement. By learning from relevant strategies, code fragments, and abstract reasoning patterns at inference time, Xolver avoids generating solutions from scratch - marking a transition from isolated inference toward experience-aware language agents. Built on both open-weight and proprietary models, Xolver consistently outperforms specialized reasoning agents. Even with lightweight backbones (e.g., QWQ-32B), it often surpasses advanced models including Qwen3-235B, Gemini 2.5 Pro, o3, and o4-mini-high. With o3-mini-high, it achieves new best results on GSM8K (98.1%), AIME’24 (94.4%), AIME’25 (93.7%), Math-500 (99.8%), and LiveCodeBench-V5 (91.6%) - highlighting holistic experience learning as a key step toward generalist agents capable of expert-level reasoning. Code and data are available at https://kagnlp.github.io/xolver.github.io/.

尽管在复杂的推理方面取得了令人印象深刻的进展,但目前的大型语言模型(LLMs)通常都是孤立地运作的,在不积累或整合经验知识的情况下,将每个问题作为独立尝试处理。相比之下,专家问题解决者(如奥林匹亚或编程竞赛队)利用丰富的经验:吸收教练的指导者,从过去的问题中发展直觉,利用工具使用和图书馆功能的知识,根据同行的专门知识和经验调整战略,通过试验和错误不断完善其推理,从其他相关问题中学习。我们引入了Xolver,即一个培训不限的多试题推理框架,使黑盒LLM具有持续和不断演变的整体经验记忆。Xolver整合了多种经验模式,包括外部和自我检索、工具使用、协作互动、代理力评估以及迭接式改进。通过学习相关战略、代码碎片和推断时间的抽象推理模式,Xolgiverver 避免产生解决方案,从深度推论到高级语言代理人(从孤立的推论,从深度推论到经验中转换)。Brvert-983,在开放和专利模型中不断更新的模型,Xal-rial-rental-ral-rick QQQQQQQQQQQQQQQQQQ.xxxxxxx。


Article 204

Title@2025-06-17 (2): Effect of Selection Format on LLM Performance

Title: Effect of Selection Format on LLM Performance Auswirkungen des Auswahlformats auf die LLM-Leistung 选择格式对LLM性能的影响 2503.06926v2

Authors (3): Yuchen Han, Yucheng Wu, Jeffrey Willard

This paper investigates a critical aspect of large language model (LLM) performance: the optimal formatting of classification task options in prompts. Through an extensive experimental study, we compared two selection formats – bullet points and plain English – to determine their impact on model performance. Our findings suggest that presenting options via bullet points generally yields better results, although there are some exceptions. Furthermore, our research highlights the need for continued exploration of option formatting to drive further improvements in model performance.

本文件调查了大语言模式绩效的一个关键方面:即快速优化分类任务选项格式。通过广泛的实验研究,我们比较了两种选择格式 – – 点和普通英语 – – 以确定其对模型绩效的影响。我们的调查结果表明,通过点提出选项通常会产生更好的结果,尽管有一些例外。此外,我们的研究强调需要继续探索选项格式,以推动进一步改进模型绩效。


Article 205

Title@2025-06-17 (2): Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Title: Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Scaling Computer-Use Grounding über Benutzeroberfläche Zersetzung und Synthese 通过用户界面分解和合成进行计算机使用定位 2505.13227v2

Authors (15): Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, Caiming Xiong

Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.

图形用户界面( GUI) 定位, 绘制图形用户界面上具体行动自然语言指令的自然语言指令的能力, 仍然是计算机使用代理器开发中的一个关键瓶颈。 目前的基准过于简化了基础任务, 将基础任务简单化为简短的参考表达, 无法捕捉到需要软件常识、 版面理解和精巧操作能力的真实世界互动的复杂性。 为了解决这些限制, 我们引入了OSWorld- G, 全面基准, 包括564种不同任务类型、 包括文本匹配、 元素识别、 版面理解和精确操作的附加说明的样本。 此外, 我们综合并发布最大的计算机使用基底数据集绝地技术, 其中包括400万个实例, 通过多视角拆分任务。 我们在绝地铁上培训的多尺度模型展示了它的效力, 超越了ScreamSpot-v2、 ScreenSpoot-Pro 和我们的OSWorld- G。 此外, 我们证明, 与绝地基的改进了绝地基能直接增强复杂计算机任务的一般基础模型的代理能力, 从5%到27 %。 通过详细的对比研究, 我们确定了现有数据界面的界面界面, 用于不同的地面化。


Article 206

Title@2025-06-17 (2): Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models

Title: Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models Modality-Aware Neuron Pruning für das Lernen in multimodalen großen Sprachmodellen 多式联运大语言模型中不学习模式-Aware中度中枢 2502.15910v2

Authors (6): Zheyuan Liu, Guangyao Dou, Xiangchi Yuan, Chunhui Zhang, Zhaoxuan Tan, Meng Jiang

Generative models such as Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) trained on massive datasets can lead them to memorize and inadvertently reveal sensitive information, raising ethical and privacy concerns. While some prior works have explored this issue in the context of LLMs, it presents a unique challenge for MLLMs due to the entangled nature of knowledge across modalities, making comprehensive unlearning more difficult. To address this challenge, we propose Modality Aware Neuron Unlearning (MANU), a novel unlearning framework for MLLMs designed to selectively clip neurons based on their relative importance to the targeted forget data, curated for different modalities. Specifically, MANU consists of two stages: important neuron selection and selective pruning. The first stage identifies and collects the most influential neurons across modalities relative to the targeted forget knowledge, while the second stage is dedicated to pruning those selected neurons. MANU effectively isolates and removes the neurons that contribute most to the forget data within each modality, while preserving the integrity of retained knowledge. Our experiments conducted across various MLLM architectures illustrate that MANU can achieve a more balanced and comprehensive unlearning in each modality without largely affecting the overall model utility.

大型语言模型(LLMS)和多式大型语言模型(MLLM)等在大规模数据集方面受过培训的大型语言模型(MLLM)等生成模型可以导致它们记忆和无意中透露敏感信息,从而引起道德和隐私方面的关注。虽然以前的一些著作在LLM中探讨了这一问题,但由于各种模式的知识相互交织,使得综合的不学习更加困难,因此对MLLMS提出了独特的挑战。为了应对这一挑战,我们提议为MLLMS提供一个新的不学习框架,根据对特定遗忘数据的相对重要性,为选择性地剪辑神经元设计一个全新的不学习框架。具体地说,MANU由两个阶段组成:重要的神经选择和选择性剪辑。第一阶段确定并收集了与目标遗忘知识相关的最有影响力的神经元,而第二阶段则专门处理选定的神经元。MANU实际上孤立并消除了最有助于在每种模式中遗忘数据的神经元,同时保持所保留的知识的完整性。我们在不同模式下进行的实验,而没有影响MALM结构的不全面学习模式,说明MAU能否。


Article 207

Title@2025-06-17 (2): Fretting-Transformer: Encoder-Decoder Model for MIDI to Tablature Transcription

Title: Fretting-Transformer: Encoder-Decoder Model for MIDI to Tablature Transcription Fretting-Transformer: Encoder-Decoder-Modell für MIDI in Tabulatur-Transkription Fretting- Transtrads: MIDI 调制标签的编码器-解码器模型 2506.14223v1

Authors (4): Anna Hamberger, Sebastian Murgul, Jochen Schmidt, Michael Heizmann

Music transcription plays a pivotal role in Music Information Retrieval (MIR), particularly for stringed instruments like the guitar, where symbolic music notations such as MIDI lack crucial playability information. This contribution introduces the Fretting-Transformer, an encoderdecoder model that utilizes a T5 transformer architecture to automate the transcription of MIDI sequences into guitar tablature. By framing the task as a symbolic translation problem, the model addresses key challenges, including string-fret ambiguity and physical playability. The proposed system leverages diverse datasets, including DadaGP, GuitarToday, and Leduc, with novel data pre-processing and tokenization strategies. We have developed metrics for tablature accuracy and playability to quantitatively evaluate the performance. The experimental results demonstrate that the Fretting-Transformer surpasses baseline methods like A* and commercial applications like Guitar Pro. The integration of context-sensitive processing and tuning/capo conditioning further enhances the model’s performance, laying a robust foundation for future developments in automated guitar transcription.

音乐转录在音乐信息检索器(MIR)中发挥着关键作用,特别是对于吉他等弦乐器,像MIDI这样的象征性音乐记号缺乏关键的可播放性信息。这一贡献引入了Fretting-Transtext,这是一个编码变异器模型,它使用T5变压器结构将MIDI序列的转录自动化成吉他标签。通过将这一任务描述为象征性翻译问题,该模型解决了关键挑战,包括字符串模糊性和物理可播放性。拟议的系统利用了多种数据集,包括DadaGP、GuitarToday和Leduc, 以及新的数据预处理和代记化战略。我们开发了用于定量评估性能的制表精度和可播放性指标。实验结果表明,Fretting-Transedexedex 基线方法如A* 和Guitar Pro等商业应用程序。将环境敏感处理和调控/调控调/调控进一步增强模型的性能,为今后自动吉他转录制的制作工作奠定了坚实的基础。


Article 208

Title@2025-06-17 (2): Chaining Event Spans for Temporal Relation Grounding

Title: Chaining Event Spans for Temporal Relation Grounding Verkettung von Event-Spannen für die zeitliche Beziehungserdung 用于时间关系基准的连锁事件 Spans 系统 2506.14213v1

Authors (4): Jongho Kim, Dohyeon Lee, Minsoo Kim, Seung-won Hwang

Accurately understanding temporal relations between events is a critical building block of diverse tasks, such as temporal reading comprehension (TRC) and relation extraction (TRE). For example in TRC, we need to understand the temporal semantic differences between the following two questions that are lexically near-identical: “What finished right before the decision?” or “What finished right after the decision?”. To discern the two questions, existing solutions have relied on answer overlaps as a proxy label to contrast similar and dissimilar questions. However, we claim that answer overlap can lead to unreliable results, due to spurious overlaps of two dissimilar questions with coincidentally identical answers. To address the issue, we propose a novel approach that elicits proper reasoning behaviors through a module for predicting time spans of events. We introduce the Timeline Reasoning Network (TRN) operating in a two-step inductive reasoning process: In the first step model initially answers each question with semantic and syntactic information. The next step chains multiple questions on the same event to predict a timeline, which is then used to ground the answers. Results on the TORQUE and TB-dense, TRC and TRE tasks respectively, demonstrate that TRN outperforms previous methods by effectively resolving the spurious overlaps using the predicted timeline.

准确地理解事件之间的时间关系是各种任务的关键组成部分,例如时间阅读理解(TRC)和关系提取(TRE)等。例如,在真相与和解委员会,我们需要理解以下两个问题之间的时间语义差异,这两个问题在逻辑上几乎完全相同:“在决定之前完成什么?”或“在决定之后完成什么?” 。为了辨别这两个问题,现有解决办法依靠答案重叠作为替代标签,来比较相似和不同的问题。然而,我们声称,答案重叠可能导致不可靠的结果,因为两个不同问题与偶然的相同答案存在虚假的重叠。为了解决这一问题,我们提出了一个新颖的方法,通过一个模块来预测事件的时间跨度,来产生适当的推理行为。我们引入了时间线说明网络(TRN),以两步引导推理过程运作:在第一个步骤模型中,每个问题都以语义和互不相同的信息作为替代标签。下一个步骤链在同一个事件上有许多问题可以预测一个时间表,然后用来作为答案的基础。我们提出一个新办法,通过一个模块来得出正确的推算出适当的推理行为范围。我们采用了以前的推算方法,分别展示了预测的推算时间表。


Article 209

Title@2025-06-17 (2): Explainable Detection of Implicit Influential Patterns in Conversations via Data Augmentation

Title: Explainable Detection of Implicit Influential Patterns in Conversations via Data Augmentation Erklärbare Erkennung von impliziten Einflussmustern in Gesprächen durch Datenvergrößerung 通过数据增加在对话中可解释地探测到的隐性内流模式 2506.14211v1

Authors (4): Sina Abdidizaji, Md Kowsher, Niloofar Yousefi, Ivan Garibay

In the era of digitalization, as individuals increasingly rely on digital platforms for communication and news consumption, various actors employ linguistic strategies to influence public perception. While models have become proficient at detecting explicit patterns, which typically appear in texts as single remarks referred to as utterances, such as social media posts, malicious actors have shifted toward utilizing implicit influential verbal patterns embedded within conversations. These verbal patterns aim to mentally penetrate the victim’s mind in order to influence them, enabling the actor to obtain the desired information through implicit means. This paper presents an improved approach for detecting such implicit influential patterns. Furthermore, the proposed model is capable of identifying the specific locations of these influential elements within a conversation. To achieve this, the existing dataset was augmented using the reasoning capabilities of state-of-the-art language models. Our designed framework resulted in a 6% improvement in the detection of implicit influential patterns in conversations. Moreover, this approach improved the multi-label classification tasks related to both the techniques used for influence and the vulnerability of victims by 33% and 43%, respectively.

在数字化时代,随着个人日益依赖数字平台进行通信和新闻消费,各种行为者采用语言战略来影响公众的观念。模型在发现明显模式方面已经变得十分熟练,这些模式通常出现在作为单一言论的文本中,如社交媒体文章等,恶意行为者已经转向使用对话中隐含的有影响力的口头模式。这些口头模式旨在从心理上渗透受害者的思想,以便影响他们,使行为者能够通过隐含手段获得所需的信息。本文件提出了一种更好的方法来发现这种隐含的有影响力模式。此外,拟议的模式能够确定这些有影响力的要素在对话中的具体位置。为了实现这一目标,现有数据集利用最新语言模式的推理能力得到了扩大。我们设计的框架使得在发现隐含影响力的对话模式方面提高了6%的改善。此外,这一方法还改善了与用于影响和受害者脆弱性的技术有关的多标签分类任务,分别提高了33%和43%。


Article 210

Title@2025-06-17 (2): LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

Title: LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification LongSpec: Lang-Kontext verlustfreies spekulatives Decodieren mit effizienter Entwurfs- und Verifizierung 长方形:长端无损失的假设值与高效率的起草和核查 2502.17421v2

Authors (7): Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An

As Large Language Models (LLMs) can now process extremely long contexts, efficient inference over these extended inputs has become increasingly important, especially for emerging applications like LLM agents that highly depend on this capability. Speculative decoding (SD) offers a promising lossless acceleration technique compared to lossy alternatives such as quantization and model cascades. However, most state-of-the-art SD methods are trained on short texts (typically fewer than 4k tokens), making them unsuitable for long-context scenarios. Specifically, adapting these methods to long contexts presents three key challenges: (1) the excessive memory demands posed by draft models due to large Key-Value (KV) cache; (2) performance degradation resulting from the mismatch between short-context training and long-context inference; and (3) inefficiencies in tree attention mechanisms when managing long token sequences. This work introduces LongSpec, a framework that addresses these challenges through three core innovations: a memory-efficient draft model with a constant-sized KV cache; novel position indices that mitigate the training-inference mismatch; and an attention aggregation strategy that combines fast prefix computation with standard tree attention to enable efficient decoding. Experimental results confirm the effectiveness of LongSpec, achieving up to a 3.26x speedup over strong Flash Attention baselines across five long-context understanding datasets, as well as a 2.25x reduction in wall-clock time on the AIME24 long reasoning task with the QwQ model, demonstrating significant latency improvements for long-context applications. The code is available at https://github.com/sail-sg/LongSpec.

由于大语言模型(LLMS)现在可以处理非常长的背景,因此对这些长期投入的有效推断越来越重要,特别是对于高度依赖这种能力的LLM代理商等新兴应用而言,这些长期投入的高效推断越来越重要。与量化和模型级联等损失替代物相比,光学解码(SD)提供了充满希望的无损加速技术。然而,大多数最先进的SD方法都是在短文本(通常少于4ksignals)上培训,使其不适合长文本情景。具体地说,将这些方法适应长背景的应用带来了三大关键挑战:(1) 由于大型的Key-Value(KV)缓存,因此模型草案提出了过多的内存需求;(2) 由于短文本培训和长字符串互换(Sdecodecod)之间的不匹配而导致的性能退化;(3) 在管理长符号序列序列时,树注意力机制效率低。 这项工作引入了长SLongSpec,一个通过三个核心创新来应对这些挑战的框架:一个记忆高效的模型草案,KV级缓存的缓存;新的位置指数指数,以缓解培训的利差错配;以及快速的聚合组合组合组合战略,将快速的视野组合战略结合了Qrefreal-laxxxxxxalxxxxxxx的快速计算,在长期的轨距值的轨距上,使长尾线标值Lexxxxxxxxx的快速测能实现了5的轨。


Article 211

Title@2025-06-17 (2): CausalDiffTab: Mixed-Type Causal-Aware Diffusion for Tabular Data Generation

Title: CausalDiffTab: Mixed-Type Causal-Aware Diffusion for Tabular Data Generation CausalDiffTab: Mixed-Type Causal-Aware Diffusion für tabellarische Datengenerierung CausalDiffTab: 用于制表数据生成的混合- 混合- Type Causal- Aware 扩散 2506.14206v1

Authors (5): Jia-Chen Zhang, Zheng Zhou, Yu-Jie Xiong, Chun-Ming Xia, Fei Dai

Training data has been proven to be one of the most critical components in training generative AI. However, obtaining high-quality data remains challenging, with data privacy issues presenting a significant hurdle. To address the need for high-quality data. Synthesize data has emerged as a mainstream solution, demonstrating impressive performance in areas such as images, audio, and video. Generating mixed-type data, especially high-quality tabular data, still faces significant challenges. These primarily include its inherent heterogeneous data types, complex inter-variable relationships, and intricate column-wise distributions. In this paper, we introduce CausalDiffTab, a diffusion model-based generative model specifically designed to handle mixed tabular data containing both numerical and categorical features, while being more flexible in capturing complex interactions among variables. We further propose a hybrid adaptive causal regularization method based on the principle of Hierarchical Prior Fusion. This approach adaptively controls the weight of causal regularization, enhancing the model’s performance without compromising its generative capabilities. Comprehensive experiments conducted on seven datasets demonstrate that CausalDiffTab outperforms baseline methods across all metrics. Our code is publicly available at: https://github.com/Godz-z/CausalDiffTab.

事实证明,培训数据是培训基因化的最重要的组成部分之一。然而,获得高质量的数据仍然是挑战性,数据隐私问题是一个重大障碍。为了解决对高质量数据的需求问题。合成数据已作为一个主流解决方案出现,展示了图像、音频和视频等领域令人印象深刻的性能。生成混合型数据,特别是高质量的表格数据,仍面临重大挑战。这主要包括其固有的差异型数据类型、复杂的可变关系和复杂的柱状分布。在本文中,我们引入了基于数据隐私问题的传播模型模型模型CausalDiffTab。该模型专门设计用于处理包含数字和绝对特征的混合表式数据,同时更灵活地捕捉各种变量之间的复杂互动。我们进一步提议基于高端前调原则的混合性因果调节方法。这种方法适应性地控制了因果调节的权重,在不损害其基因化能力的情况下加强了模型的性能。在七个数据集中进行的全面实验表明,Causal-DiffTab超越了所有计量标准中的基线方法。我们的代码可在以下公开查阅: https://GOs/GOS。


Article 212

Title@2025-06-17 (2): AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

Title: AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents AgentSynth: Skalierbare Task-Generierung für generalistische Computer-Use-Agenten AnySynth:通用计算机使用代理器可缩放任务生成 2506.14205v1

Authors (4): Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song

We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents. Leveraging information asymmetry, AgentSynth constructs subtasks that are simple during generation but significantly more challenging when composed into long-horizon tasks, enabling the creation of over 6,000 diverse and realistic tasks. Our pipeline begins with an LLM-based task proposer guided by a persona, followed by an execution agent that completes the task and logs the trajectory. This process is repeated iteratively to form a sequence of subtasks, which are then summarized by a separate agent into a composite task of controllable difficulty. A key strength of AgentSynth is its ability to precisely modulate task complexity by varying the number of subtasks. Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18% success at difficulty level 1 to just 4% at level 6, highlighting the benchmark’s difficulty and discriminative power. Moreover, our pipeline achieves a low average cost of $0.60 per trajectory, orders of magnitude cheaper than human annotations. Our code and data are publicly available at https://github.com/sunblaze-ucb/AgentSynth

我们引入了AgentSynth, 这是一种可扩缩且具有成本效益的管道, 用于自动合成通用计算机使用代理的高质量任务和轨迹数据集。 利用信息不对称性, AgentSynth 构建了子任务,这些子任务在发电过程中简单,但在组成长半径任务时则更具挑战性, 使得能够创建6000多项多样化和现实的任务。 我们的管道从一个基于LLLM的任务建议器开始, 由一个人指导, 由一个人指导, 由一个完成任务并记录轨迹的执行代理器开始。 这一过程反复反复重复, 形成一系列子任务和轨迹, 然后由一个单独的代理器将其总结为可控困难的复合任务。 AgentSynth的关键力量是它能够通过改变任务的复杂性, 以改变子任务的数量。 Empiriticalalalalal 评估显示, 状态的LLM代理商的性能下降幅度非常大, 从困难级别1级的18%下降到6级的4 % , 突出基准的难度和差别力量。 此外, 我们的管道的难度和差别力量会达到一个较低的平均成本成本成本 $0/ abthrus dalways/ das/ dust 10。


Article 213

Title@2025-06-17 (2): Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios

Title: Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios Verbesserung der praktischen Aspekte von End-to-End-Multitalker-Spracherkennung für Online- und Offline-Szenarien 改进在网上和离线情景下承认端到端多嘴多语种言论的 实际方面 2506.14204v1

Authors (6): Aswin Shanmugam Subramanian, Amit Das, Naoyuki Kanda, Jinyu Li, Xiaofei Wang, Yifan Gong

We extend the frameworks of Serialized Output Training (SOT) to address practical needs of both streaming and offline automatic speech recognition (ASR) applications. Our approach focuses on balancing latency and accuracy, catering to real-time captioning and summarization requirements. We propose several key improvements: (1) Leveraging Continuous Speech Separation (CSS) single-channel front-end with end-to-end (E2E) systems for highly overlapping scenarios, challenging the conventional wisdom of E2E versus cascaded setups. The CSS framework improves the accuracy of the ASR system by separating overlapped speech from multiple speakers. (2) Implementing dual models – Conformer Transducer for streaming and Sequence-to-Sequence for offline – or alternatively, a two-pass model based on cascaded encoders. (3) Exploring segment-based SOT (segSOT) which is better suited for offline scenarios while also enhancing readability of multi-talker transcriptions.

我们扩展了分层输出培训框架,以满足流态和离线自动语音识别应用程序的实际需要。我们的方法侧重于平衡静态和准确性,满足实时字幕和汇总要求。我们提出若干关键的改进:(1) 利用连续语音隔离单通道前端和端到端系统,对高度重叠的情景进行从端到端(E2E)系统,挑战E2E与级联设置的常规智慧。CSS框架通过将多位发言者的重叠演讲分开来提高自动语音识别系统的准确性。(2) 实施双轨模式 – – 用于流式和离线序列的导导导导出器,或以级联编码器为基础的双通模式。(3) 探索基于分段的SOT(eSOT),这更适合离线情景,同时提高多位话人的可读性。


Article 214

Title@2025-06-17 (2): Intended Target Identification for Anomia Patients with Gradient-based Selective Augmentation

Title: Intended Target Identification for Anomia Patients with Gradient-based Selective Augmentation Vorgesehene Zielidentifizierung für Anomie-Patienten mit gradientbasierter selektiver Augmentation 逐步增加选择性增加的阿诺米亚病人预期目标识别 2506.14203v1

Authors (3): Jongho Kim, Romain Storaï, Seung-won Hwang

In this study, we investigate the potential of language models (LMs) in aiding patients experiencing anomia, a difficulty identifying the names of items. Identifying the intended target item from patient’s circumlocution involves the two challenges of term failure and error: (1) The terms relevant to identifying the item remain unseen. (2) What makes the challenge unique is inherent perturbed terms by semantic paraphasia, which are not exactly related to the target item, hindering the identification process. To address each, we propose robustifying the model from semantically paraphasic errors and enhancing the model with unseen terms with gradient-based selective augmentation. Specifically, the gradient value controls augmented data quality amid semantic errors, while the gradient variance guides the inclusion of unseen but relevant terms. Due to limited domain-specific datasets, we evaluate the model on the Tip-of-the-Tongue dataset as an intermediary task and then apply our findings to real patient data from AphasiaBank. Our results demonstrate strong performance against baselines, aiding anomia patients by addressing the outlined challenges.

在这项研究中,我们调查语言模型(LMS)在帮助患有厌食症的病人方面的潜力,这是确定物品名称的一个困难。从病人的环绕中确定预定目标项涉及两个挑战,即:失败和错误:(1) 与确定物品有关的术语仍然不为人知。(2) 挑战的独特之处在于语义性paraphasia的内在干扰术语,这些术语与目标项并不完全相关,阻碍了识别过程。为了解决每一个问题,我们建议用基于梯度的选择性增强作用,用隐性术语强化模型。具体地说,梯度值控制提高了数据质量,在语义性错误中提高了数据质量,而梯度差异则引导了隐性但相关的术语的列入。由于有限的特定域数据集,我们评估了Tip-the-Tongue数据集模型的模型,作为中间任务,然后将我们的调查结果应用于AphasiaBank的真正病人数据。我们的结果显示,根据基线,我们通过应对概述的挑战,帮助非人口病人。


Article 215

Title@2025-06-17 (2): CAPTURE: Context-Aware Prompt Injection Testing and Robustness Enhancement

Title: CAPTURE: Context-Aware Prompt Injection Testing and Robustness Enhancement CAPTURE: Context-Aware Prompt Injection Testing und Robustheitsverbesserung CAPTURE: 上下文软件快速注射测试和强力增强 2505.12368v2

Authors (2): Gauri Kholkar, Ratinder Ahuja

Prompt injection remains a major security risk for large language models. However, the efficacy of existing guardrail models in context-aware settings remains underexplored, as they often rely on static attack benchmarks. Additionally, they have over-defense tendencies. We introduce CAPTURE, a novel context-aware benchmark assessing both attack detection and over-defense tendencies with minimal in-domain examples. Our experiments reveal that current prompt injection guardrail models suffer from high false negatives in adversarial cases and excessive false positives in benign scenarios, highlighting critical limitations. To demonstrate our framework’s utility, we train CaptureGuard on our generated data. This new model drastically reduces both false negative and false positive rates on our context-aware datasets while also generalizing effectively to external benchmarks, establishing a path toward more robust and practical prompt injection defenses.

快速注射仍是大型语言模型的一大安全风险。 但是,现有保护性铁路模型在背景环境环境中的功效仍未得到充分探讨,因为它们往往依赖静态袭击基准。 此外,它们具有过度防卫倾向。 我们引入了CAPTURE, 这是一种新的背景认知基准,以最小的内部实例评估攻击探测和过度防御趋势。我们的实验显示,当前快速保护性铁路模型在对抗性案件中存在高假阴性,在良性情景中存在过多的假阳性,凸显了关键的局限性。 为了展示我们框架的效用,我们用我们生成的数据来培训捕捉Guard。这一新模型极大地降低了我们的背景认知数据集上的虚假负和假正率,同时有效地向外部基准推广,为更有力、更实际的快速注射防御开辟了一条道路。


Article 216

Title@2025-06-17 (2): Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation

Title: Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation Hanfu-Bench: Ein multimodaler Benchmark für interkulturelles Verständnis und Transkreation Hanfu-Bunch:跨时文化理解和交流的多模式基准 2506.01565v2

Authors (6): Li Zhou, Lutong Yu, Dongchu Xie, Shaohuan Cheng, Wenyan Li, Haizhou Li

Culture is a rich and dynamic domain that evolves across both geography and time. However, existing studies on cultural understanding with vision-language models (VLMs) primarily emphasize geographic diversity, often overlooking the critical temporal dimensions. To bridge this gap, we introduce Hanfu-Bench, a novel, expert-curated multimodal dataset. Hanfu, a traditional garment spanning ancient Chinese dynasties, serves as a representative cultural heritage that reflects the profound temporal aspects of Chinese culture while remaining highly popular in Chinese contemporary society. Hanfu-Bench comprises two core tasks: cultural visual understanding and cultural image transcreation.The former task examines temporal-cultural feature recognition based on single- or multi-image inputs through multiple-choice visual question answering, while the latter focuses on transforming traditional attire into modern designs through cultural element inheritance and modern context adaptation. Our evaluation shows that closed VLMs perform comparably to non-experts on visual cutural understanding but fall short by 10\% to human experts, while open VLMs lags further behind non-experts. For the transcreation task, multi-faceted human evaluation indicates that the best-performing model achieves a success rate of only 42\%. Our benchmark provides an essential testbed, revealing significant challenges in this new direction of temporal cultural understanding and creative adaptation.

韩福是中国古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代的古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代古代


Article 217

Title@2025-06-17 (2): ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations

Title: ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations ELI-Warum: Bewertung der pädagogischen Nützlichkeit von Sprachmodellerklärungen ELI- Why:评价语言模式解释的教学用途 2506.14200v1

Authors (8): Brihi Joshi, Keyu He, Sahana Ramnath, Sadra Sabouri, Kaitlyn Zhou, Souti Chattopadhyay, Swabha Swayamdipta, Xiang Ren

Language models today are widely used in education, yet their ability to tailor responses for learners with varied informational needs and knowledge backgrounds remains under-explored. To this end, we introduce ELI-Why, a benchmark of 13.4K “Why” questions to evaluate the pedagogical capabilities of language models. We then conduct two extensive human studies to assess the utility of language model-generated explanatory answers (explanations) on our benchmark, tailored to three distinct educational grades: elementary, high-school and graduate school. In our first study, human raters assume the role of an “educator” to assess model explanations’ fit to different educational grades. We find that GPT-4-generated explanations match their intended educational background only 50% of the time, compared to 79% for lay human-curated explanations. In our second study, human raters assume the role of a learner to assess if an explanation fits their own informational needs. Across all educational backgrounds, users deemed GPT-4-generated explanations 20% less suited on average to their informational needs, when compared to explanations curated by lay people. Additionally, automated evaluation metrics reveal that explanations generated across different language model families for different informational needs remain indistinguishable in their grade-level, limiting their pedagogical effectiveness.

今天,语言模式在教育中被广泛使用,然而,他们为具有不同信息需要和知识背景的学习者做出相应反应的能力仍未得到充分探讨。为此,我们引入了13.4K“为什么”问题的ELI-HaII基准13.4K“为什么”问题,以评价语言模式的教学能力。然后,我们开展了两项广泛的人类研究,以评估语言模式对基准所作的解释性回答(解释性解答)的效用,这些解答针对三个不同的教育等级:小学、高中和研究生。在我们的第一项研究中,人类计分员扮演了“教育家”的角色,评估模型解释是否适合不同的教育等级。我们发现,GPT-4生成的解释只匹配了50%的时间,而用于非人化解释的只有79%。在第二项研究中,人类计分数员扮演了学习者的角色,评估解释是否适合他们自己的信息需求。在所有教育背景中,用户认为GPT-4生成的解释与他们的平均需要相比,比普通人的解释要低20%。此外,自动化评价指标显示,不同语言模式家庭为限制他们不同的教育级别需求而导致的教学效率需求的解释水平仍然不变。


Article 218

Title@2025-06-17 (2): Geometric Signatures of Compositionality Across a Language Model’s Lifetime

Title: Geometric Signatures of Compositionality Across a Language Model’s Lifetime Geometrische Signaturen der Kompositionalität über die Lebenszeit eines Sprachmodells hinweg 语文模式中各语文模式的 终身组成特征的几何签名 2410.01444v5

Authors (5): Jin Hwa Lee, Thomas Jiralerspong, Lei Yu, Yoshua Bengio, Emily Cheng

By virtue of linguistic compositionality, few syntactic rules and a finite lexicon can generate an unbounded number of sentences. That is, language, though seemingly high-dimensional, can be explained using relatively few degrees of freedom. An open question is whether contemporary language models (LMs) reflect the intrinsic simplicity of language that is enabled by compositionality. We take a geometric view of this problem by relating the degree of compositionality in a dataset to the intrinsic dimension (ID) of its representations under an LM, a measure of feature complexity. We find not only that the degree of dataset compositionality is reflected in representations’ ID, but that the relationship between compositionality and geometric complexity arises due to learned linguistic features over training. Finally, our analyses reveal a striking contrast between nonlinear and linear dimensionality, showing they respectively encode semantic and superficial aspects of linguistic composition.

由于语言的构成性,很少有合成规则和限定词典可以产生数量未限制的句子,也就是说,语言虽然看似高维,但可以用相对较少的自由度来解释。一个未决问题是,当代语言模式(LMs)是否反映了语言的内在简单性,而语言的构成性是语言的构成性所促成的。我们从几何角度看待这一问题,将数据集的构成性程度与其在LM下表述的内在层面(ID)相联系,这是一种特征复杂性的衡量标准。我们发现,不仅数据集的构成性程度反映在表达式的识别中,而且由于在培训中学习语言特征而产生了构成性和几何复杂度之间的关系。最后,我们的分析揭示了非线性和线性之间的鲜明对比,表明它们分别将语言构成的语义和表面方面编码。


Article 219

Title@2025-06-17 (2): Counterfactual-Consistency Prompting for Relative Temporal Understanding in Large Language Models

Title: Counterfactual-Consistency Prompting for Relative Temporal Understanding in Large Language Models Counterfactual-Consistency Prompting für relatives zeitliches Verständnis in großen Sprachmodellen 在大语言模式中促进相对时间理解的反事实一致 2502.11425v2

Authors (2): Jongho Kim, Seung-won Hwang

Despite the advanced capabilities of large language models (LLMs), their temporal reasoning ability remains underdeveloped. Prior works have highlighted this limitation, particularly in maintaining temporal consistency when understanding events. For example, models often confuse mutually exclusive temporal relations like before'' andafter’’ between events and make inconsistent predictions. In this work, we tackle the issue of temporal inconsistency in LLMs by proposing a novel counterfactual prompting approach. Our method generates counterfactual questions and enforces collective constraints, enhancing the model’s consistency. We evaluate our method on multiple datasets, demonstrating significant improvements in event ordering for explicit and implicit events and temporal commonsense understanding by effectively addressing temporal inconsistencies.

尽管大型语言模型(LLMs)能力较强,但其时间推理能力仍然不足。先前的工作突出显示了这一局限性,特别是在了解事件时保持时间一致性方面。例如,模型往往混淆事件之间相互排斥的时间关系,例如“事前”和“后”等,作出前后不一致的预测。在这项工作中,我们通过提出新的反事实推介方法来解决LLMs时间上不一致的问题。我们的方法产生了反事实问题,并强化了集体制约,加强了模型的一致性。我们评估了我们在多个数据集上的方法,表明在要求发生明确和隐含的事件时会有很大的改进,通过有效处理时间上的不一致来理解时常识。


Article 220

Title@2025-06-17 (2): MAS-LitEval : Multi-Agent System for Literary Translation Quality Assessment

Title: MAS-LitEval : Multi-Agent System for Literary Translation Quality Assessment MAS-LitEval : Multi-Agenten-System für die Bewertung der Qualität von Übersetzungen MAS-LitEval:文学翻译质量评估多机构系统 2506.14199v1

Authors (5): Junghwan Kim, Kieun Park, Sohee Park, Hyunggug Kim, Bongwon Suh

Literary translation requires preserving cultural nuances and stylistic elements, which traditional metrics like BLEU and METEOR fail to assess due to their focus on lexical overlap. This oversight neglects the narrative consistency and stylistic fidelity that are crucial for literary works. To address this, we propose MAS-LitEval, a multi-agent system using Large Language Models (LLMs) to evaluate translations based on terminology, narrative, and style. We tested MAS-LitEval on translations of The Little Prince and A Connecticut Yankee in King Arthur’s Court, generated by various LLMs, and compared it to traditional metrics. \textbf{MAS-LitEval} outperformed these metrics, with top models scoring up to 0.890 in capturing literary nuances. This work introduces a scalable, nuanced framework for Translation Quality Assessment (TQA), offering a practical tool for translators and researchers.

文学翻译需要保存文化上的细微差别和文体元素,而BLEU和METEOR等传统指标由于注重法律重叠而未能评估这些细微差别和文体元素。这种监督忽视了对文学作品至关重要的叙事一致性和文体忠诚性。为了解决这个问题,我们建议采用MAS-LitEval,这是一个多试剂系统,使用大语言模型(LLLMS)来根据术语、文体和风格来评价翻译。我们测试了MAS-LitEval,在阿瑟国王法院对小王子和康涅狄格扬基的翻译进行了测试,这些翻译是由各种LLMS制作的,并与传统指标进行了比较。\ textbf{MAS-LitEval}优于这些指标,顶级模型在捕捉文学细微时达到0.890。这项工作为翻译质量评估提供了一个可扩缩、细微的框架,为翻译和研究人员提供了一个实用的工具。


Article 221

Title@2025-06-17 (2): Compression of enumerations and gain

Title: Compression of enumerations and gain Kompression von Aufzählungen und Gewinn 压缩查点和收益 2304.03030v2

Authors (3): George Barmpalias, Xiaoyan Zhang, Bohua Zhan

We study the compressibility of enumerations in the context of Kolmogorov complexity, focusing on strong and weak forms of compression and their gain: the amount of auxiliary information embedded in the compressed enumeration. The existence of strong compression and weak gainless compression is shown for any computably enumerable (c.e.) set. The density problem of c.e. sets with respect to their prefix complexity is reduced to the question of whether every c.e. set is well-compressible, which we study via enumeration games.

我们从科尔莫戈罗夫复杂程度的角度研究查点的压缩问题,重点是强弱的压缩形式及其收益:压缩查点中所含的辅助信息的数量。强压和微弱的无损压缩的存在为任何可比较的可计算数字(c.e.)集的(c.e.)集的密度问题。c.e.集的前缀复杂性的密度问题将缩小到每套c.e.集是否都具有良好抑制性的问题,我们通过查点游戏来研究这一问题。


Article 222

Title@2025-06-17 (2): Reward Shaping to Mitigate Reward Hacking in RLHF

Title: Reward Shaping to Mitigate Reward Hacking in RLHF Reward Shaping, um Belohnung Hacking in RLHF Mititate 在RLHF中拆分至Mipigget Reward的拆分 2502.18770v3

Authors (6): Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, Yanghua Xiao

Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to \emph{reward hacking}, where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. Although reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests two key design principles: (1) the RL reward should be bounded, and (2) the RL reward benefits from rapid initial growth followed by gradual convergence. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model as the signal for reinforcement learning. We evaluated PAR on two base models, Gemma2-2B, and Llama3-8B, using two datasets, Ultrafeedback-Binarized and HH-RLHF. Experimental results demonstrate PAR’s superior performance over other reward shaping methods. On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of at least 5 percentage points higher than competing approaches. Furthermore, PAR exhibits remarkable data efficiency, requiring only a single reference reward for optimal performance, and maintains robustness against reward hacking even after two full epochs of training. The code is available at https://github.com/PorUna-byte/PAR, and the Work done during the internship at StepFun by Jiayi Fu.

从人类反馈中强化学习(RLHF)对于使大型语言模型(LLMS)与人的价值相匹配至关重要。然而,RLHF很容易受到 emph{rward hacking} 的伤害,因为代理商利用奖励功能的缺陷而不是学习预期的行为,从而贬低了校准。虽然奖赏的形成有助于稳定RLHF, 并部分减轻奖赏黑客,但对塑造技巧及其基本原则的系统调查仍然缺乏。为了缩小这一差距,我们对流行的奖赏形成方法进行了全面研究。我们的分析提出了两个主要设计原则:(1)RLMM(LMM)奖赏应该受约束,和(2)RLLH-RH(RH)奖赏从快速初始增长和逐渐趋同中得益。根据这些洞察,我们建议PLEReward(PAR)是利用奖励模式中隐含的潜在偏好偏好,作为强化学习的信号。我们用两种基础模型(Gemma2-2B)和Llama3-8B(L)来评价PAR,使用两个数据集、Utrafrefrefreferbed-Be-By-Binalizized and and HH-Realized and Hold)和HWealalalbildown real orational legal legal orational be legal legal orational oration oration oration orations) untal oration orate 只能显示甚至以比其他最优业绩优优于其他标准, 20 20 方法, lesh 20 方法, lex a lex lex lex lex lex lex lex 。在5 方法在前工作比前工作比前工作比前工作,在5 le le lex lexal lexldal lexal level level 21。在前工作 lemental 21。在前工作,在前工作,在前工作,在前工作,在前工作比前工作比前工作,仅靠双优比 20 20 20 lex lex lex le


Article 223

Title@2025-06-17 (2): AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR

Title: AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR AsyncSwitch: Asynchrone Text-Speech-Anpassung für Code-Switched ASR Async开关: 用于代码开关 ASR 的非同步文本语音适应 2506.14190v1

Authors (2): Tuan Nguyen, Huy-Dat Tran

Developing code-switched ASR systems is challenging due to language ambiguity and limited exposure to multilingual, code-switched data, while collecting such speech is costly. Prior work generates synthetic audio from text, but these methods are computationally intensive and hard to scale. We introduce AsyncSwitch, a novel asynchronous adaptation framework that leverages large-scale, text-rich web data to pre-expose ASR models to diverse code-switched domains before fine-tuning on paired speech-text corpora. Our three-stage process (1) trains decoder self-attention and feedforward layers on code-switched text, (2) aligns decoder and encoder via cross-attention using limited speech-text data, and (3) fully fine-tunes the entire model. Experiments with Whisper on Malay-English code-switching demonstrate a 9.02% relative WER reduction, while improving monolingual performance in Singlish, Malay, and other English variants.

开发代码转换的 ASR 系统具有挑战性,因为语言模糊性,而且对多语种、代码转换的数据的接触有限,而收集这种演讲则费用高昂。先前的工作是从文本中生成合成音频,但这些方法在计算上是密集的,而且很难推广。我们引入了AsyncSwitch,这是一个新颖的、不同步的调整框架,它利用大规模、文本丰富的网络数据,在对配对的语音-文字调校之前,将ASR 模型应用到多种代码转换的域中。我们的三阶段进程 (1) 培训解码器自我注意和对代码转换文本的向上层,(2) 通过使用有限的语音-文本数据,对调校对解码器和编码器进行对齐,(3) 对整个模型进行完全微调。在马拉语-英语代码转换上与Whiper的实验显示了9.02 %的相对WER,同时改进了Singlish、Malay和其他英语变体的单一语言表现。


Article 224

Title@2025-06-17 (2): EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG

Title: EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG EEG2TEXT-CN: Eine explorative Studie der offenen Vokabulären chinesischen Text-EEG-Ausrichtung über großsprachliches Modell und kontrastives Lernen auf ChinesischEEG EEG2TEXT-CN:通过大语言模式和中经语言差异性学习对中文文本与EEEG校对开放词汇的探索性研究 2506.00854v2

Authors (6): Jacky Tai-Yu Lu, Jung Chiang, Chi-Sheng Chen, Anna Nai-Yun Tung, Hsiang Wei Hu, Yuan Chiao Cheng

We propose EEG2TEXT-CN, which, to the best of our knowledge, represents one of the earliest open-vocabulary EEG-to-text generation frameworks tailored for Chinese. Built on a biologically grounded EEG encoder (NICE-EEG) and a compact pretrained language model (MiniLM), our architecture aligns multichannel brain signals with natural language representations via masked pretraining and contrastive learning. Using a subset of the ChineseEEG dataset, where each sentence contains approximately ten Chinese characters aligned with 128-channel EEG recorded at 256 Hz, we segment EEG into per-character embeddings and predict full sentences in a zero-shot setting. The decoder is trained with teacher forcing and padding masks to accommodate variable-length sequences. Evaluation on over 1,500 training-validation sentences and 300 held-out test samples shows promising lexical alignment, with a best BLEU-1 score of 6.38\%. While syntactic fluency remains a challenge, our findings demonstrate the feasibility of non-phonetic, cross-modal language decoding from EEG. This work opens a new direction in multilingual brain-to-text research and lays the foundation for future cognitive-language interfaces in Chinese.

我们建议EEG2TEXT-CN, 就我们所知,它代表了最早为中国人设计的开放的开放的 EEG-文字生成框架之一。我们建于一个基于生物的 EEG 编码器(NITE-EEEG)和一个精练的语言模型(MiniLM)上,我们的建筑通过蒙面训练前和对比学习,将多通道脑信号与自然语言表达方式相匹配。我们使用中国EEEG数据集的一个子集,每句话都包含与256赫兹记录的128个EEEG相匹配的大约10个中国字符。我们将EEEG分成每个字嵌入每个字组,并预测在零光环境中的全句。解码器是用教师的强迫和遮蔽面面面面罩来适应多长序列的培训。对1,500多个培训-校准判决和300个留置试样的评估显示有良好的词汇一致性,最佳的BLEU-1分为6.38。尽管合成流仍是一个挑战,但我们的调查结果展示了在中国人文、跨式语言、跨式语言、跨式读基础的大脑研究中,从而开启了中国人进入了中国认知基础。


Article 225

Title@2025-06-17 (2): Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching

Title: Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching Kosteneffiziente Bedienung von LLM-Agenten über Test-Zeitplan-Caching 通过试验-时间计划缓冲,以成本效率高的方式服务LLM代理物 2506.14852v1

Authors (3): Qizheng Zhang, Michael Wornow, Kunle Olukotun

LLM-based agentic applications have shown increasingly remarkable capabilities in complex workflows but incur substantial costs due to extensive planning and reasoning requirements. Existing LLM caching techniques (like context caching and semantic caching), primarily designed for serving chatbots, are insufficient for agentic applications where outputs depend on external data or environmental contexts. We propose agentic plan caching, a novel approach that extracts, stores, adapts, and reuses structured plan templates from planning stages of agentic applications across semantically similar tasks to reduce the cost of serving. Unlike traditional semantic caching, our system extracts plan templates from completed agent executions at test-time, employs keyword extraction to match new requests against cached plans, and utilizes lightweight models to adapt these templates to task-specific plans with contexts. Evaluation across multiple real-world agentic applications shows that our system can reduce costs by 46.62% on average while maintaining performance, offering a more efficient solution for serving LLM-based agents that complements existing LLM serving infrastructures.

以LLM为基础的代理应用在复杂的工作流程中表现出日益显著的能力,但由于广泛的规划和推理要求,费用高昂。主要设计用于为聊天室提供服务的现有LLM缓冲技术(如环境缓冲和语义缓冲等)对于产出取决于外部数据或环境背景的代理应用来说是不够的。我们提议了一种新颖的计划缓冲方法,即提取、储存、调整和再利用跨语义类似任务的代理应用规划阶段的结构性计划模板,以降低服务成本。与传统的语义缓冲法不同,我们的系统从测试时已完成的代理处决中提取了计划模板,使用关键词提取方法将新的请求与缓存计划匹配,并利用轻量模型使这些模板适应特定任务计划的环境。跨多个现实世界的代理应用的评估表明,我们的系统可以平均地降低46.62%的成本,同时保持绩效,为补充现有LLM服务基础设施的LM代理提供更有效的解决方案。


Article 226

Title@2025-06-17 (2): Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore’s languages

Title: Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore’s languages Können wir ASR-Systeme auf Code-Schalter ohne echte Code-Schalter-Daten trainieren? Fallstudie für Singapurs Sprachen 我们能否在没有实际代码开关数据的情况下,对 ASR 系统进行代码开关培训?新加坡语言案例研究 2506.14177v1

Authors (2): Tuan Nguyen, Huy-Dat Tran

Code-switching (CS), common in multilingual settings, presents challenges for ASR due to scarce and costly transcribed data caused by linguistic complexity. This study investigates building CS-ASR using synthetic CS data. We propose a phrase-level mixing method to generate synthetic CS data that mimics natural patterns. Utilizing monolingual augmented with synthetic phrase-mixed CS data to fine-tune large pretrained ASR models (Whisper, MMS, SeamlessM4T). This paper focuses on three under-resourced Southeast Asian language pairs: Malay-English (BM-EN), Mandarin-Malay (ZH-BM), and Tamil-English (TA-EN), establishing a new comprehensive benchmark for CS-ASR to evaluate the performance of leading ASR models. Experimental results show that the proposed training strategy enhances ASR performance on monolingual and CS tests, with BM-EN showing highest gains, then TA-EN and ZH-BM. This finding offers a cost-effective approach for CS-ASR development, benefiting research and industry.

多语种环境中常见的代码转换(CS)对ASR提出了挑战,因为语言复杂导致数据稀少且费用昂贵。本研究调查了使用合成 CS数据建立 CS-ASR的情况。我们建议采用一个单词级混合方法来生成模仿自然模式的合成 CS 数据。使用单语级混合的合成 CS 数据来微调大型预先培训的 ASR 模型(Whisper、MMS、Seemless M4T ) 。本文件侧重于三个资源不足的东南亚语对:马来语英语(BM-EN)、曼达林马莱语(ZH-BM)和泰米尔英语(TA-EN),为CS-ASR主要模型的性能建立了新的全面基准。实验结果表明,拟议的培训战略提高了ASR在单语和CS测试方面的绩效,而B-EN的成绩最大,然后是TA-EN和ZH-BM。这一发现为CS-SR的发展、有利于研究和产业提供了具有成本效益的方法。


Article 227

Title@2025-06-17 (2): MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning

Title: MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning MMedAgent-RL: Optimierung der Multi-Agenten-Kollaboration für multimodale medizinische Vernunft MMedAgender-RL:优化多机构协作促进多式联运医疗理由 2506.00555v2

Authors (11): Peng Xia, Jinglu Wang, Yibo Peng, Kaide Zeng, Xian Wu, Xiangru Tang, Hongtu Zhu, Yun Li, Shujie Liu, Yan Lu, Huaxiu Yao

Medical Large Vision-Language Models (Med-LVLMs) have shown strong potential in multimodal diagnostic tasks. However, existing single-agent models struggle to generalize across diverse medical specialties, limiting their performance. Recent efforts introduce multi-agent collaboration frameworks inspired by clinical workflows, where general practitioners (GPs) and specialists interact in a fixed sequence. Despite improvements, these static pipelines lack flexibility and adaptability in reasoning. To address this, we propose MMedAgent-RL, a reinforcement learning (RL)-based multi-agent framework that enables dynamic, optimized collaboration among medical agents. Specifically, we train two GP agents based on Qwen2.5-VL via RL: the triage doctor learns to assign patients to appropriate specialties, while the attending physician integrates the judgments from multi-specialists and its own knowledge to make final decisions. To address the inconsistency in specialist outputs, we introduce a curriculum learning (CL)-guided RL strategy that progressively teaches the attending physician to balance between imitating specialists and correcting their mistakes. Experiments on five medical VQA benchmarks demonstrate that MMedAgent-RL not only outperforms both open-source and proprietary Med-LVLMs, but also exhibits human-like reasoning patterns. Notably, it achieves an average performance gain of 20.7% over supervised fine-tuning baselines.

大型医学视觉-语言模型(Med-LVLM)在多式诊断任务中表现出了巨大的潜力,然而,现有的单一试剂模型在努力推广各种医学专业,限制其绩效。最近的努力引入了由临床工作流程激励的多剂合作框架,在临床工作流程中,全科医生(GPs)和专家以固定顺序互动。尽管取得了改进,这些静态管道在推理上缺乏灵活性和适应性。为此,我们建议采用MMedAgent-RL(MedAgent-RL)(基于强化学习(RL)的多剂框架),使医疗代理人之间能够进行动态的、最佳的合作。具体地说,我们通过RL培训了两个基于Q2.5-VL(Quen2.5-VL)模式的GP(GP)剂:三角医生学会指派病人从事适当的专业,而主治医生则将多专业医生的判断及其知识整合成最后决定。为了解决专家产出的不一致,我们引入课程学习(CLL(CL)指导RL(LL)战略),逐步教导主治医生在模仿专家与模仿专家之间求和纠正错误之间取得平衡。我们在VAVA(MIS(BIS(BIS-I-I-L)平均)标准上,但不能超出其平均飞行(PIL)标准。


Article 228

Title@2025-06-17 (2): MIST: Towards Multi-dimensional Implicit Bias and Stereotype Evaluation of LLMs via Theory of Mind

Title: MIST: Towards Multi-dimensional Implicit Bias and Stereotype Evaluation of LLMs via Theory of Mind MIST: Auf dem Weg zu multidimensionalen Impliziten Bias und Stereotype Evaluation von LLMs über die Theorie des Geistes MIST:通过思想理论对LLMs进行多维隐隐含的偏见和定型评价 2506.14161v1

Authors (5): Yanlin Li, Hao Liu, Huimin Liu, Yinwei Wei, Yupeng Hu

Theory of Mind (ToM) in Large Language Models (LLMs) refers to their capacity for reasoning about mental states, yet failures in this capacity often manifest as systematic implicit bias. Evaluating this bias is challenging, as conventional direct-query methods are susceptible to social desirability effects and fail to capture its subtle, multi-dimensional nature. To this end, we propose an evaluation framework that leverages the Stereotype Content Model (SCM) to reconceptualize bias as a multi-dimensional failure in ToM across Competence, Sociability, and Morality. The framework introduces two indirect tasks: the Word Association Bias Test (WABT) to assess implicit lexical associations and the Affective Attribution Test (AAT) to measure covert affective leanings, both designed to probe latent stereotypes without triggering model avoidance. Extensive experiments on 8 State-of-the-Art LLMs demonstrate our framework’s capacity to reveal complex bias structures, including pervasive sociability bias, multi-dimensional divergence, and asymmetric stereotype amplification, thereby providing a more robust methodology for identifying the structural nature of implicit bias.

大语言模型中的心理理论(TOM)是指他们关于精神状态的推理能力,但这种能力中的失败往往表现为系统性的隐含偏见。评估这种偏向具有挑战性,因为常规的直接询问方法容易产生社会需要的效果,无法捕捉其微妙的、多维的性质。为此,我们提议了一个评价框架,利用定型内容模型(SCM)将偏见重新概念化,作为TOM跨能力、社交性和道德的多层面失败。框架提出了两项间接任务:Word Asociation Bias测试(WABT)评估隐含的法系协会和AAT(AAT)测量隐含的倾向性倾向试验(AAT),以衡量隐含的倾向性倾向,两者都旨在探究潜在的陈规定型观念,而不会触发模式的避免。关于8个州的LLMMs的广泛实验展示了我们框架揭示复杂的偏见结构的能力,包括普遍的可变性偏见、多维差异和不对称的定型分类化,从而为确定隐含偏见的结构性质提供了更可靠的方法。


Article 229

Title@2025-06-17 (2): S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models

Title: S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models S$^4$C: Spekulative Probenahme mit syntaktischer und semantischer Kohärenz zur effizienten Schlussfolgerung großer Sprachmodelle S$4美元C:为高效推导大语言模型的协同性和语义一致性进行投机抽样 2506.14158v1

Authors (8): Tao He, Guang Huang, Yu Yang, Tianshi Xu, Sicheng Zhao, Guiguang Ding, Pengyang Wang, Feng Tian

Large language models (LLMs) exhibit remarkable reasoning capabilities across diverse downstream tasks. However, their autoregressive nature leads to substantial inference latency, posing challenges for real-time applications. Speculative sampling mitigates this issue by introducing a drafting phase followed by a parallel validation phase, enabling faster token generation and verification. Existing approaches, however, overlook the inherent coherence in text generation, limiting their efficiency. To address this gap, we propose a Speculative Sampling with Syntactic and Semantic Coherence (S$^4$C) framework, which extends speculative sampling by leveraging multi-head drafting for rapid token generation and a continuous verification tree for efficient candidate validation and feature reuse. Experimental results demonstrate that S$^4$C surpasses baseline methods across mainstream tasks, offering enhanced efficiency, parallelism, and the ability to generate more valid tokens with fewer computational resources. On Spec-bench benchmarks, S$^4$C achieves an acceleration ratio of 2.26x-2.60x, outperforming state-of-the-art methods.

大型语言模型(LLMS)在各种下游任务中表现出非凡的推理能力,然而,其自动递减性质导致大量推论延迟,对实时应用构成挑战。投机性抽样通过引入一个起草阶段,然后一个平行的验证阶段,使象征性生成和核查更迅速,从而缓解了这一问题。但现有方法忽视了文本生成的内在一致性,限制了其效率。为弥补这一差距,我们提议了一个具有协同和语义一致性(S$4$C)的投机性抽样框架,通过利用多头起草快速代号生成和持续核查树对有效候选人验证和特征再利用进行扩大投机性抽样。实验结果表明,S$4$C超越了主流任务的基线方法,提供了更高的效率、平行性和以较少的计算资源生成更有效的符号的能力。在Spec-bench基准方面,S$4$C实现了2.26x-260x,超效的状态-艺术方法。


Article 230

Title@2025-06-17 (2): DCRM: A Heuristic to Measure Response Pair Quality in Preference Optimization

Title: DCRM: A Heuristic to Measure Response Pair Quality in Preference Optimization DCRM: Ein Heuristisches zur Messung der Antwortpaarqualität in der Präferenz-Optimierung DCRM:在首选最佳化中衡量对等反应质量的优度 2506.14157v1

Authors (2): Chengyu Huang, Tanya Goyal

Recent research has attempted to associate preference optimization (PO) performance with the underlying preference datasets. In this work, our observation is that the differences between the preferred response $y^+$ and dispreferred response $y^-$ influence what LLMs can learn, which may not match the desirable differences to learn. Therefore, we use distance and reward margin to quantify these differences, and combine them to get Distance Calibrated Reward Margin (DCRM), a metric that measures the quality of a response pair for PO. Intuitively, DCRM encourages minimal noisy differences and maximal desired differences. With this, we study 3 types of commonly used preference datasets, classified along two axes: the source of the responses and the preference labeling function. We establish a general correlation between higher DCRM of the training set and better learning outcome. Inspired by this, we propose a best-of-$N^2$ pairing method that selects response pairs with the highest DCRM. Empirically, in various settings, our method produces training datasets that can further improve models’ performance on AlpacaEval, MT-Bench, and Arena-Hard over the existing training sets.

最近的研究试图将优惠优化(PO)性能与基本的优惠数据集联系起来。在这项工作中,我们的意见是,偏好反应($$)和偏好反应($y_$)对LLM能够学到的东西的影响(y_$-$)有差异,这也许与理想的差别不相称。因此,我们利用距离和奖励差幅来量化这些差异,并将这些差数结合起来,以衡量PO的响应配对质量的衡量标准(DCRM)。直觉地说,DCRM鼓励最小的噪音差异和最大期望的差异。我们研究了三种常用的优惠数据集,按两个轴分类:反应的来源和偏好标签功能。我们在培训组的更高DCRM和更好的学习结果之间建立了一般关系。我们为此提出了一种最佳的2N-N-200配对方法,选择了最高DCRM的反应配对。在各种环境下,我们的方法产生培训数据集,可以进一步改善AlpacakeEval、MT-Bench、Arna-Hard等现有培训组的模型的性能。


Article 231

Title@2025-06-17 (2): OWLViz: An Open-World Benchmark for Visual Question Answering

Title: OWLViz: An Open-World Benchmark for Visual Question Answering OWLViz: Ein Open-World-Benchmark für visuelle Fragen OWLViz:视觉问答的开放世界基准 2503.07631v2

Authors (6): Thuy Nguyen, Dang Nguyen, Hoang Nguyen, Thuan Luong, Long Hoang Dang, Viet Dac Lai

We present a challenging benchmark for the Open WorLd VISual question answering (OWLViz) task. OWLViz presents concise, unambiguous queries that require integrating multiple capabilities, including visual understanding, web exploration, and specialized tool usage. While humans achieve 69.2% accuracy on these intuitive tasks, even state-of-the-art VLMs struggle, with the best model, Gemini 2.0, achieving only 26.6% accuracy. Current agentic VLMs, which rely on limited vision and vision-language models as tools, perform even worse. This performance gap reveals significant limitations in multimodal systems’ ability to select appropriate tools and execute complex reasoning sequences, establishing new directions for advancing practical AI research.

我们为Open WorLd Visual 答题(OWLViz)的任务提出了一个具有挑战性的基准。 OWLViz 给出了简明、明确的询问,要求整合多种能力,包括视觉理解、网络探索和专门工具使用。 虽然人类在这些直观任务上实现了69.2%的准确性,但即使是最先进的VLM,其最佳模型是Gemini 2.0, 其准确性仅为26.6%。目前依赖有限的愿景和愿景语言模型作为工具的VLMs,其表现更差。这一绩效差距揭示了多式联运系统在选择适当工具和执行复杂推理序列、为推进实际的AI研究确定新方向方面的巨大局限性。


Article 232

Title@2025-06-17 (2): Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models

Title: Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models Drücke die Performance der synthetischen Spracherkennung mit Kolmogorov-Arnold-Netzwerken und selbstüberwachten Lernmodellen 推动利用科尔莫戈罗夫-阿诺尔德网络和自控学习模式进行合成语音探测的性能 2506.14153v1

Authors (3): Tuan Dat Phuong, Long-Vu Hoang, Huy Dat Tran

Recent advancements in speech synthesis technologies have led to increasingly advanced spoofing attacks, posing significant challenges for automatic speaker verification systems. While systems based on self-supervised learning (SSL) models, particularly the XLSR-Conformer model, have demonstrated remarkable performance in synthetic speech detection, there remains room for architectural improvements. In this paper, we propose a novel approach that replaces the traditional Multi-Layer Perceptron in the XLSR-Conformer model with a Kolmogorov-Arnold Network (KAN), a novel architecture based on the Kolmogorov-Arnold representation theorem. Our results on ASVspoof2021 demonstrate that integrating KAN into the SSL-based models can improve the performance by 60.55% relatively on LA and DF sets, further achieving 0.70% EER on the 21LA set. These findings suggest that incorporating KAN into SSL-based models is a promising direction for advances in synthetic speech detection.

最近语音合成技术的进步导致了越来越先进的口音攻击,对自动扬声器核查系统提出了重大挑战。基于自我监督学习模式的系统,特别是XLSR-Confrent模型,在合成语音探测方面表现显著,但仍有改进建筑的空间。在本文件中,我们提议采用新颖的方法,用Kolmogorov-Arnold 网络(KAN)取代XLSR-Confrent 模型中传统的多功能器,这是一个以Kolmogorov-Arnold 代表方言论为基础的新建筑。我们在ASVspoof 2021年的结果表明,将KAN纳入基于SSLS的模型中,可以相对地提高LA和DF组合的性能60.55%,进一步在21LA集上达到0.70的EER。这些结论表明,将KAN纳入基于SSL的模型是合成语音探测进展的有希望的方向。


Article 233

Title@2025-06-17 (2): REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning

Title: REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning 实际检索: 数学理由的回收增量精液预言 2505.20613v2

Authors (14): Ziju Shen, Naohao Huang, Fanyi Yang, Yutong Wang, Guoxiong Gao, Tianyi Xu, Jiedong Jiang, Wanyi He, Pu Yang, Mengzhou Sun, Haocheng Ju, Peihao Wu, Bryan Dai, Bin Dong

Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).

目前,正式的理论证明者在高中和竞争水平数学方面取得了巨大的进步,但其中很少有人能将高级数学概括化为更先进的数学。在本文中,我们展示了真实-Prover-Prover-V1,这是利安4号推动这一边界的一个新的开放源源分步骤理论证明。这个证明者以我们精细调整的大语言模型(REAL-Prover-v1)为基础,并与一个检索系统(Leansearch-PS)结合,显著提高了解决大学一级数学问题的业绩。为了培训真实-Prover-v1,我们开发了HERALD-AF,一个数据提取管道,将自然语言数学问题转换成正式声明,以及一个新的开放源 Lean 4互动环境(Jixia-interactive)来推动综合数据收集。在我们的实验中,我们仅使用受监督的微调获得23.7%成功率(Pass@64)的校准网络数据集成度,与最新艺术(SOATA)模型相匹配。为了进一步评估我们的方法,我们引入了FATE-M,一个以56-P-Pass-64成功率为主基准。


Article 234

Title@2025-06-17 (2): Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment

Title: Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment Akustische Streuung KI für nicht-invasive Objektklassifikationen: Eine Fallstudie zur Haarbewertung 用于非侵入性物体分类的非侵入性物体分类的声波散射AI:关于头发评估的个案研究 2506.14148v1

Authors (3): Long-Vu Hoang, Tuan Nguyen, Tran Huy Dat

This paper presents a novel non-invasive object classification approach using acoustic scattering, demonstrated through a case study on hair assessment. When an incident wave interacts with an object, it generates a scattered acoustic field encoding structural and material properties. By emitting acoustic stimuli and capturing the scattered signals from head-with-hair-sample objects, we classify hair type and moisture using AI-driven, deep-learning-based sound classification. We benchmark comprehensive methods, including (i) fully supervised deep learning, (ii) embedding-based classification, (iii) supervised foundation model fine-tuning, and (iv) self-supervised model fine-tuning. Our best strategy achieves nearly 90% classification accuracy by fine-tuning all parameters of a self-supervised model. These results highlight acoustic scattering as a privacy-preserving, non-contact alternative to visual classification, opening huge potential for applications in various industries.

本文展示了一种使用声学散射的新颖的非侵入物体分类方法,通过头发评估案例研究展示了这种方法。当事件波与一个物体发生相互作用时,它产生一个分散的声学场编码结构和材料特性。通过发出声学刺激和从头部和头发抽样物体中捕捉分散的信号,我们使用AI驱动的、基于深层次学习的音质分类,对头发类型和湿度进行分类。我们以综合方法为基准,包括:(一) 充分监督下的深层学习,(二) 嵌入式分类,(三) 受监督的基础模型微调,以及(四) 自我监督的模型微调。我们的最佳战略通过微调一个自我监督模型的所有参数,实现了近90%的分类准确性。这些结果突出表明,声学分散是一种保护隐私的、非接触性的替代视觉分类,为各种行业的应用提供了巨大的潜力。


Article 235

Title@2025-06-17 (2): RadFabric: Agentic AI System with Reasoning Capability for Radiology

Title: RadFabric: Agentic AI System with Reasoning Capability for Radiology RadFabric: Agentisches KI-System mit vernünftiger Kapazität für die Radiologie RadFBRIC:放射学合理能力A.A.A.系统 2506.14142v1

Authors (17): Wenting Chen, Yi Dong, Zhaojun Ding, Yucheng Shi, Yifan Zhou, Fang Zeng, Yijun Luo, Tianyu Lin, Yihang Su, Yichen Wu, Kai Zhang, Zhen Xiang, Tianming Liu, Ninghao Liu, Lichao Sun, Yixuan Yuan, Xiang Li

Chest X ray (CXR) imaging remains a critical diagnostic tool for thoracic conditions, but current automated systems face limitations in pathology coverage, diagnostic accuracy, and integration of visual and textual reasoning. To address these gaps, we propose RadFabric, a multi agent, multimodal reasoning framework that unifies visual and textual analysis for comprehensive CXR interpretation. RadFabric is built on the Model Context Protocol (MCP), enabling modularity, interoperability, and scalability for seamless integration of new diagnostic agents. The system employs specialized CXR agents for pathology detection, an Anatomical Interpretation Agent to map visual findings to precise anatomical structures, and a Reasoning Agent powered by large multimodal reasoning models to synthesize visual, anatomical, and clinical data into transparent and evidence based diagnoses. RadFabric achieves significant performance improvements, with near-perfect detection of challenging pathologies like fractures (1.000 accuracy) and superior overall diagnostic accuracy (0.799) compared to traditional systems (0.229 to 0.527). By integrating cross modal feature alignment and preference-driven reasoning, RadFabric advances AI-driven radiology toward transparent, anatomically precise, and clinically actionable CXR analysis.

胸前X光(CXR)成像仍然是胸腔病情的关键诊断工具,但目前的自动化系统在病理覆盖、诊断准确性以及视觉和文字推理的整合方面面临着限制。为弥补这些差距,我们提议了RadFabric,这是一个多剂、多式推理框架,统一直观和文字分析,以综合 CXR 解释;拉德Fabric建于《示范背景协议》,使模块性、互操作性和可伸缩性成为新诊断剂的无缝结合。该系统使用专门的CXR代理进行病理检测,一个解剖解解剂将视觉结果绘制成精确解剖结构的直观结果图解剖剂,以及一个由大型多式联运推理模型驱动的解说剂,将视觉、解剖和临床数据综合成透明和循证分析。 RadFabric取得了显著的性能改进,对具有挑战性的病理学(如骨折(1000精度)和总体诊断精度(0.799)与传统系统(0.229至0.527)的精确性诊断精确性(0.799)进行跨式特征校正和可偏和可偏向可偏向可偏向的精确的CX轴动性分析。


Article 236

Title@2025-06-17 (2): Personalizing Student-Agent Interactions Using Log-Contextualized Retrieval Augmented Generation (RAG)

Title: Personalizing Student-Agent Interactions Using Log-Contextualized Retrieval Augmented Generation (RAG) Personalisierung von studentisch-agenten Interaktionen mittels Log-Contextualized Retrieval Augmented Generation (RAG) 利用日志-知识检索增强型一代(RAG)实现学生-代理人个性化互动 2505.17238v2

Authors (13): Clayton Cohn, Surya Rayala, Caitlin Snyder, Joyce Fonteles, Shruti Jain, Naveeduddin Mohammed, Umesh Timalsina, Sarah K. Burriss, Ashwin T S, Namrata Srivastava, Menton Deweese, Angela Eeds, Gautam Biswas

Collaborative dialogue offers rich insights into students’ learning and critical thinking, which is essential for personalizing pedagogical agent interactions in STEM+C settings. While large language models (LLMs) facilitate dynamic pedagogical interactions, hallucinations undermine confidence, trust, and instructional value. Retrieval-augmented generation (RAG) grounds LLM outputs in curated knowledge but requires a clear semantic link between user input and a knowledge base, which is often weak in student dialogue. We propose log-contextualized RAG (LC-RAG), which enhances RAG retrieval by using environment logs to contextualize collaborative discourse. Our findings show that LC-RAG improves retrieval over a discourse-only baseline and allows our collaborative peer agent, Copa, to deliver relevant, personalized guidance that supports students’ critical thinking and epistemic decision-making in a collaborative computational modeling environment, C2STEM.

合作对话为学生的学习和批判性思维提供了丰富的洞察力,这对于使STEM+C环境中的教学代理互动个性化至关重要。虽然大型语言模型(LLMs)有助于动态教学互动,但幻觉会破坏信心、信任和教学价值。回溯式一代(RAG)将LLM产出归结为知识培养型,但需要用户投入和知识库之间明确的语义联系,而这在学生对话中往往很薄弱。我们建议对RAG(LC-RAG)进行逻辑化的逻辑化RAG(LC-RAG),通过环境日志将合作对话背景化来增强RAG的检索。我们的调查结果显示,LC-RAG(LLM-RAG)改进了对只讨论基线的检索,使我们的合作同行代理(Copa)能够提供相关、个性化的指导,支持学生在协作的计算模型环境中的批判性思维和认知决策,C2STEM。


Article 237

Title@2025-06-17 (2): AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning

Title: AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning AgentCPM-GUI: Mobile-Use-Agenten mit Verstärkungs-Fine-Tuning bauen Agent CPM-GUI: 制造具有加固精度的移动用途制剂 2506.01391v2

Authors (25): Zhong Zhang, Yaxi Lu, Yikun Fu, Yupeng Huo, Shenzhi Yang, Yesai Wu, Han Si, Xin Cong, Haotian Chen, Yankai Lin, Jie Xie, Wei Zhou, Wang Xu, Yuanheng Zhang, Zhou Su, Zhongwu Zhai, Xiaoming Liu, Yudong Mei, Jianming Xu, Hongyan Tian, Chongyi Wang, Chi Chen, Yuan Yao, Zhiyuan Liu, Maosong Sun

The recent progress of large language model agents has opened new possibilities for automating tasks through graphical user interfaces (GUIs), especially in mobile environments where intelligent interaction can greatly enhance usability. However, practical deployment of such agents remains constrained by several key challenges. Existing training data is often noisy and lack semantic diversity, which hinders the learning of precise grounding and planning. Models trained purely by imitation tend to overfit to seen interface patterns and fail to generalize in unfamiliar scenarios. Moreover, most prior work focuses on English interfaces while overlooks the growing diversity of non-English applications such as those in the Chinese mobile ecosystem. In this work, we present AgentCPM-GUI, an 8B-parameter GUI agent built for robust and efficient on-device GUI interaction. Our training pipeline includes grounding-aware pre-training to enhance perception, supervised fine-tuning on high-quality Chinese and English trajectories to imitate human-like actions, and reinforcement fine-tuning with GRPO to improve reasoning capability. We also introduce a compact action space that reduces output length and supports low-latency execution on mobile devices. AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks and a new Chinese GUI benchmark called CAGUI, reaching $96.9\%$ Type-Match and $91.3\%$ Exact-Match. To facilitate reproducibility and further research, we publicly release all code, model checkpoint, and evaluation data.

大型语言模型代理人最近的进展为通过图形用户界面(GUIs)实现任务自动化开辟了新的可能性,特别是在智能互动能够大大提高可用性的移动环境中,特别是在智能互动能够大大提高可用性的移动环境中,大型语言模型代理人最近的进展为通过图形用户界面(GUI)实现任务自动化开辟了新的可能性;然而,这些代理人的实际部署仍然受到若干重大挑战的制约; 现有的培训数据往往是吵闹的,缺乏语义多样性,这阻碍了精确地基和规划的学习; 纯粹通过仿造培训的模型往往过度适应界面模式,无法在不熟悉的场景中推广。 此外,大多数以前的工作侧重于英语界面,同时忽视了诸如中国移动生态系统等非英语应用日益多样化的日益多样化的非英语应用; 在这项工作中,我们提出了CPM-GUI(AgM-GUI) (AGI) (AGI) (AGI) (AGI) (AGI) (AGI) (AD) (AGI) (AD) (AD) (AD) (AD) (AD) (AD) (AD) (AD) (ADM-B) (AGI) (AD) (AD) (AD) (T) (AD) (AD) (AD) (S-B) (T) (SD) (AD) (AD) (AD) (AD) (AD) (AD) (AD) (AD) (T) (AD) (AD) (AD) (AD) (AD) (T) (AD) (T) (T) (AD) (T) (T) (AD) (T) (T) (T) (AD) (AD) (AD) (AD) (AD) (AD) (AD) (AD) (AD) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (T) (


Article 238

Title@2025-06-17 (2): Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks

Title: Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks Bewertung von Konsistenz und Reproduzierbarkeit in den Outputs von großen Sprachmodellen: Evidence Across Diverse Finance and Accounting Tasks 评估大语言模式产出的一致性和可复制性:不同财务和会计任务之间的证据 2503.16974v3

Authors (2): Julian Junyan Wang, Victor Xiaoqi Wang

This study provides the first comprehensive assessment of consistency and reproducibility in Large Language Model (LLM) outputs in finance and accounting research. We evaluate how consistently LLMs produce outputs given identical inputs through extensive experimentation with 50 independent runs across five common tasks: classification, sentiment analysis, summarization, text generation, and prediction. Using three OpenAI models (GPT-3.5-turbo, GPT-4o-mini, and GPT-4o), we generate over 3.4 million outputs from diverse financial source texts and data, covering MD&As, FOMC statements, finance news articles, earnings call transcripts, and financial statements. Our findings reveal substantial but task-dependent consistency, with binary classification and sentiment analysis achieving near-perfect reproducibility, while complex tasks show greater variability. More advanced models do not consistently demonstrate better consistency and reproducibility, with task-specific patterns emerging. LLMs significantly outperform expert human annotators in consistency and maintain high agreement even where human experts significantly disagree. We further find that simple aggregation strategies across 3-5 runs dramatically improve consistency. We also find that aggregation may come with an additional benefit of improved accuracy for sentiment analysis when using newer models. Simulation analysis reveals that despite measurable inconsistency in LLM outputs, downstream statistical inferences remain remarkably robust. These findings address concerns about what we term “G-hacking,” the selective reporting of favorable outcomes from multiple Generative AI runs, by demonstrating that such risks are relatively low for finance and accounting tasks.

这项研究首次全面评估了大语言模型(LLM)在财务和会计研究方面产出的一致性和可复制性。我们评估了LLMM如何通过广泛试验,50个独立运行,在五种共同任务(分类、情绪分析、概括化、文本生成和预测)中,通过50个独立运行,以广泛实验的方式,持续地提供相同投入产出。使用三种OpenAI模型(GPT-3.5-turbo、GPT-4-omi和GPT-4o),我们从多种资金来源文本和数据(包括MD&As、FOM声明、金融文章、收入调用记录和财务报表)中产生340多万产出。我们发现,我们的调查结果显示,大量但任务的一致性,通过二元分类和情绪分析,实现近效性可复制性可复制性可复制性可实现的可复制性,而复杂的任务则显示出更大的可变性。更先进的模型显示,尽管有可计量的内部审计结果,但我们在进行这种可计量性分析时,仍能从统计性结果中获得更多的准确性分析。


Article 239

Title@2025-06-17 (2): Sampling from Your Language Model One Byte at a Time

Title: Sampling from Your Language Model One Byte at a Time Proben aus Ihrem Sprachmodell ein Byte zu einer Zeit 一次抽取您语言模式一字节的样本 2506.14123v1

Authors (4): Jonathan Hayase, Alisa Liu, Noah A. Smith, Sewoong Oh

Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model’s generations. For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. This Prompt Boundary Problem (PBP) also arises in languages such as Chinese and in code generation, where tokens often do not line up with syntactic boundaries. Additionally mismatching tokenizers often hinder model composition and interoperability. For example, it is not possible to directly ensemble models with different tokenizers due to their mismatching vocabularies. To address these issues, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM, without changing its generative distribution at the text level. Our method efficient solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time as well as transfer the post-training from one model to another using proxy-tuning. We demonstrate in experiments that the ensemble and proxy-tuned models outperform their constituents on downstream evals.

现代语言模型几乎普遍使用 Tokenization (PBP) , 现代语言模型几乎普遍使用 Tokenization , 使得使用多字或多字符符号的高效文本表达方式。 但是, 先前的工作表明, 象征性化可能给模型的几代代人带来扭曲。 例如, 经常建议用户不要用空间结束其提示。 因为它阻止模型将空间作为下一个符号的一部分纳入空间。 这个快速边界问题( PBP) 也出现在中文等语言和代码生成中, 代号往往不与语系边界一致。 此外, 代号配对常常阻碍模型的构成和互操作性。 例如, 代号化会直接混合模型会与不同的代号符号发生扭曲。 例如, 无法直接结合不同的代号符号模型, 解决这些问题, 我们提出了一个推论时间方法, 将任何带有 BPEPE 代号符号的自动递增LM 转换为字符级或字级LM , 但不改变其在文本层次上的归称性分布。 我们的方法高效解决了 PBPBP , 也能够将语言模型的vobendorational 模式与不同的代号模型相统一起来, , 在不同的代号代号的代号的代号的代号的代号的代号上, , 将一个代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号, , , 演示号的代号的代号的代号的代号的代号的代号的代号的代号的代号, , , , , , , 的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的代号的


Article 240

Title@2025-06-17 (2): Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression

Title: Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression Batch-Max: Höherer LLM-Durchsatz mit größeren Batch-Größen und KV Cache-Kompression 批量最大:使用大批量大小和 KV缓存压缩的高级 LLM 输送量 2412.05693v2

Authors (3): Michael R. Metel, Boxing Chen, Mehdi Rezagholizadeh

Several works have developed eviction policies to remove key-value (KV) pairs from the KV cache for more efficient inference. The focus has been on compressing the KV cache after the input prompt has been processed for faster token generation. In settings with limited GPU memory, and when the input context is longer than the generation length, we show that by also compressing the KV cache during the input processing phase, larger batch sizes can be used resulting in significantly higher throughput while still maintaining the original model’s accuracy.

一些工程制定了驱逐政策,从 KV 缓存中去除关键值对配对, 以便更有效地推断。 重点是在输入提示处理后压缩 KV 缓存, 以便更快的代号生成 。 在 GPU 内存有限的情况下, 当输入环境长于生成时间长度时, 我们显示, 通过在输入处理阶段压缩 KV 缓存, 还可以使用更大的批量大小, 从而在保持原始模型准确性的同时, 导致显著更高的吞吐量 。


Article 241

Title@2025-06-17 (2): Essential-Web v1.0: 24T tokens of organized web data

Title: Essential-Web v1.0: 24T tokens of organized web data Essential-Web v1.0: 24T Token von organisierten Web-Daten 基本Web v1.0: 24个有组织网络数据标记 2506.14111v1

Authors (25): Essential AI, :, Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy, Anil Thomas, Ashish Tanwer, Darsh J Shah, Ishaan Shah, Karl Stratos, Khoi Nguyen, Kurt Smith, Michael Callahan, Peter Rushton, Philip Monk, Platon Mazarakis, Saad Jamal, Saurabh Srivastava, Somanshu Singla, Ashish Vaswani

Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0

在语言模式如何获得技能和知识方面,数据在数据是如何获得技能和知识的最突出作用。缺乏大规模、组织完善的培训前数据集导致费用昂贵和难以获取的数据管道。我们展示了24千兆吨基Web v1.0,每份文件都附加了12类分类,涵盖主题、格式、内容复杂性和质量。分类标签由EAI-Distill-0.5b制作,这是经过微调的0.5b参数模型,在Qwen2.5-32B-Instruct的3%内达成通知员协议。只要SQL式过滤器,我们就能在数学(-8.0%相对于SOTA)、网络代码(+14.3%)、STEM(+24.5%)和医学(+8.6%)方面获得有竞争力的网络校对数据集。基本Web v1.0可在HuggingFace上查阅:https://huggingface.co/dataset/EssentialAI/imical-web-v1.0。


Article 242

Title@2025-06-17 (2): SAE-V: Interpreting Multimodal Models for Enhanced Alignment

Title: SAE-V: Interpreting Multimodal Models for Enhanced Alignment SAE-V: Verdolmetschen multimodaler Modelle für eine verbesserte Ausrichtung SAE-V: 解释增强协调的多模式模型 2502.17514v2

Authors (4): Hantao Lou, Changye Li, Jiaming Ji, Yaodong Yang

With the integration of image modality, the semantic space of multimodal large language models (MLLMs) is more complex than text-only models, making their interpretability more challenging and their alignment less stable, particularly susceptible to low-quality data, which can lead to inconsistencies between modalities, hallucinations, and biased outputs. As a result, developing interpretability methods for MLLMs is crucial for improving alignment quality and efficiency. In text-only LLMs, Sparse Autoencoders (SAEs) have gained attention for their ability to interpret latent representations. However, extending SAEs to multimodal settings presents new challenges due to modality fusion and the difficulty of isolating cross-modal representations. To address these challenges, we introduce SAE-V, a mechanistic interpretability framework that extends the SAE paradigm to MLLMs. By identifying and analyzing interpretable features along with their corresponding data, SAE-V enables fine-grained interpretation of both model behavior and data quality, facilitating a deeper understanding of cross-modal interactions and alignment dynamics. Moreover, by utilizing cross-modal feature weighting, SAE-V provides an intrinsic data filtering mechanism to enhance model alignment without requiring additional models. Specifically, when applied to the alignment process of MLLMs, SAE-V-based data filtering methods could achieve more than 110% performance with less than 50% data. Our results highlight SAE-V’s ability to enhance interpretability and alignment in MLLMs, providing insights into their internal mechanisms.

随着图像模式的整合,多式联运大语言模型(MLLMs)的语义空间比只文本模式更为复杂,使得其解释性更具挑战性,其匹配性不那么稳定,特别是低质量数据,这可能导致模式、幻觉和偏差产出之间的不一致。因此,制定MLLMM的解读性方法对于提高一致性和效率至关重要。在只使用文本的LLMs中,Sparse Autoencorders(SAEs)因其解释潜在代表性的能力而得到细微的诠释。然而,将SAE-V扩大到多式联运环境带来了新的挑战,因为模式合并和难以分离跨模式代表。为了应对这些挑战,我们引入了SAE-V这个机械化的解释性框架,将SAE范式扩展到MLMs。通过识别和分析可解释性特征,使得对模型行为和数据质量的精确解释性能得到精细化解释,便于更深入地理解跨模式的相互作用和调整动态。此外,通过使用跨模式的权重度,SAE-VVVV-MS-MS-MS-BLM的内在比我们的数据过滤性解释性机制在不要求具体数据化模型下实现数据调整性方法的情况下,可以使SAV-LR-LM-LM-LM-LM-LM的内在的升级机制得到更多的数据调整。


Article 243

Title@2025-06-17 (2): Innovating China’s Intangible Cultural Heritage with DeepSeek + MidJourney: The Case of Yangliuqing theme Woodblock Prints

Title: Innovating China’s Intangible Cultural Heritage with DeepSeek + MidJourney: The Case of Yangliuqing theme Woodblock Prints Innovieren Chinas immaterielles Kulturerbe mit DeepSeek + MidJourney: Der Fall des Yangliuqing-Themas Woodblock Prints 以深色+中途:杨柳庆主题案例 2506.14104v1

Authors (3): RuiKun Yang, ZhongLiang Wei, Longdi Xian

Yangliuqing woodblock prints, a cornerstone of China’s intangible cultural heritage, are celebrated for their intricate designs and vibrant colors. However, preserving these traditional art forms while fostering innovation presents significant challenges. This study explores the DeepSeek + MidJourney approach to generating creative, themed Yangliuqing woodblock prints focused on the fight against COVID-19 and depicting joyous winners. Using Fr'echet Inception Distance (FID) scores for evaluation, the method that combined DeepSeek-generated thematic prompts, MidJourney-generated thematic images, original Yangliuqing prints, and DeepSeek-generated key prompts in MidJourney-generated outputs achieved the lowest mean FID score (150.2) with minimal variability ({\sigma} = 4.9). Additionally, feedback from 62 participants, collected via questionnaires, confirmed that this hybrid approach produced the most representative results. Moreover, the questionnaire data revealed that participants demonstrated the highest willingness to promote traditional culture and the strongest interest in consuming the AI-generated images produced through this method. These findings underscore the effectiveness of an innovative approach that seamlessly blends traditional artistic elements with modern AI-driven creativity, ensuring both cultural preservation and contemporary relevance.

作为中国无形文化遗产基石的杨利庆木板印刷品,以其复杂设计和生机勃勃的色彩来庆祝。然而,保护这些传统艺术形式,同时促进创新,提出了重大挑战。本研究探索了DeepSeek + MidJourney 的创造创意方法,即Yangliuqing 木板印刷品,侧重于打击COVID-19的斗争,并描绘了最快乐的赢家。使用Fr'echet Inception Convention Least(FID)评分进行评估,使用DeepSeek 生成的专题提示、MidJourney生成的主题图片、原始的Yangliuqing印本和DeepSeek生成的关键提示,在MidJourney 生成的产出中实现了最低的平均FID分(150.2),且差异最小(=4.9)。此外,通过问卷收集的62名参与者的反馈证实,这种混合方法产生了最有代表性的结果。此外,问卷数据显示,与会者表现出促进传统文化的最高意愿,最有兴趣使用通过这种方法生成的AI-生成的图像。这些调查结果突出表明,确保现代艺术要素与现代AI驱动的现代艺术成。


Article 244

Title@2025-06-17 (2): Abstract Meaning Representation for Hospital Discharge Summarization

Title: Abstract Meaning Representation for Hospital Discharge Summarization Abstract Bedeutung Vertretung für Krankenhaus Entladung Zusammenfassung B. 医院免住院费摘要说明 2506.14101v1

Authors (4): Paul Landes, Sitara Rao, Aaron Jeremy Chaise, Barbara Di Eugenio

The Achilles heel of Large Language Models (LLMs) is hallucination, which has drastic consequences for the clinical domain. This is particularly important with regards to automatically generating discharge summaries (a lengthy medical document that summarizes a hospital in-patient visit). Automatically generating these summaries would free physicians to care for patients and reduce documentation burden. The goal of this work is to discover new methods that combine language-based graphs and deep learning models to address provenance of content and trustworthiness in automatic summarization. Our method shows impressive reliability results on the publicly available Medical Information Mart for Intensive III (MIMIC-III) corpus and clinical notes written by physicians at Anonymous Hospital. rovide our method, generated discharge ary output examples, source code and trained models.

大型语言模型(LLMS)的致命脚跟是幻觉,对临床领域有重大影响,对于自动生成出院摘要(长篇医疗文件,概述医院住院检查情况)尤其重要,自动生成这些摘要可以让医生对病人进行护理,减轻文件负担,这项工作的目标是发现新方法,将语言图表和深层学习模型结合起来,解决自动合成内容和可信度的来源,我们的方法在公开提供的三号强化医疗信息(MIMIC-III)中显示出令人印象深刻的可靠性结果,以及匿名医院医生撰写的临床说明。我们的方法是释放出一个输出示例、源代码和经过培训的模式。


Article 245

Title@2025-06-17 (2): Enhancing Clinical Models with Pseudo Data for De-identification

Title: Enhancing Clinical Models with Pseudo Data for De-identification Verbesserung klinischer Modelle mit Pseudo-Daten zur De-Identifizierung 利用假数据加强临床模型,以进行分辨 2506.12674v2

Authors (4): Paul Landes, Aaron J Chaise, Tarak Nath Nandi, Ravi K Madduri

Many models are pretrained on redacted text for privacy reasons. Clinical foundation models are often trained on de-identified text, which uses special syntax (masked) text in place of protected health information. Even though these models have increased in popularity, there has been little effort in understanding the effects of training them on redacted text. In this work, we pretrain several encoder-only models on a dataset that contains redacted text and a version with replaced realistic pseudo text. We then fine-tuned models for the protected health information de-identification task and show how our methods significantly outperform previous baselines. The contributions of this work include: a) our novel, and yet surprising findings with training recommendations, b) redacted text replacements used to produce the pseudo dataset, c) pretrained embeddings and fine-tuned task specific models, and d) freely available pseudo training dataset generation and model source code used in our experiments.

许多模型由于隐私原因对修改文本进行了预先培训。临床基础模型经常就脱名文本进行培训,该文本使用特殊的语法(假冒)文本取代受保护的健康信息。尽管这些模型在受欢迎程度上有所提高,但在了解这些模型对修改文本培训的效果方面没有做多少努力。在这项工作中,我们预先将若干只使用编码器的模型放在包含修改文本的数据集上,一个版本则替换了现实的假文本。我们随后对保护健康信息去识别任务的模型进行了微调,并展示了我们的方法如何大大超过以前的基线。这项工作的贡献包括:(a)我们的新颖的,但有培训建议但令人惊讶的发现,(b)用于制作伪数据集的修改文本替换,(c)预先培训的嵌入和微调特定任务模型,(d)我们实验中使用的假培训数据集生成和模型源代码。


Article 246

Title@2025-06-17 (2): InsertRank: LLMs can reason over BM25 scores to Improve Listwise Reranking

Title: InsertRank: LLMs can reason over BM25 scores to Improve Listwise Reranking InsertRank: LLMs können über BM25-Scores nachdenken, um Listwise zu verbessern. 插入Rank:LLMs可以比 BB25 分数解释 BM25 分数来改进列表排序 2506.14086v1

Authors (3): Rahul Seetharaman, Kaustubh D. Dhole, Aman Bansal

Large Language Models (LLMs) have demonstrated significant strides across various information retrieval tasks, particularly as rerankers, owing to their strong generalization and knowledge-transfer capabilities acquired from extensive pretraining. In parallel, the rise of LLM-based chat interfaces has raised user expectations, encouraging users to pose more complex queries that necessitate retrieval by ``reasoning’’ over documents rather than through simple keyword matching or semantic similarity. While some recent efforts have exploited reasoning abilities of LLMs for reranking such queries, considerable potential for improvement remains. In that regards, we introduce InsertRank, an LLM-based reranker that leverages lexical signals like BM25 scores during reranking to further improve retrieval performance. InsertRank demonstrates improved retrieval effectiveness on – BRIGHT, a reasoning benchmark spanning 12 diverse domains, and R2MED, a specialized medical reasoning retrieval benchmark spanning 8 different tasks. We conduct an exhaustive evaluation and several ablation studies and demonstrate that InsertRank consistently improves retrieval effectiveness across multiple families of LLMs, including GPT, Gemini, and Deepseek models. %In addition, we also conduct ablation studies on normalization by varying the scale of the BM25 scores, and positional bias by shuffling the order of the documents. With Deepseek-R1, InsertRank achieves a score of 37.5 on the BRIGHT benchmark. and 51.1 on the R2MED benchmark, surpassing previous methods.

大型语言模型(LLMS)在各种信息检索任务中显示出了长足的进步,特别是重新排序者,这是因为它们具有强大的一般化和知识转让能力,从广泛的培训前获得的知识转让能力。与此同时,基于LLM的聊天界面的崛起提高了用户的期望,鼓励用户提出更复杂的查询,从而需要通过“依据’”而不是简单的关键词匹配或语义相似性来检索文件。虽然最近的一些努力利用LLMS的推理能力来重新排列这类查询,但仍有很大的改进潜力。在这方面,我们引入了InterRank,一个基于LAM的重新排序者,利用在重新排名期间的BM25分等法律信号来进一步提高检索绩效。插入Rank显示了检索效率的提高 – – Braight,一个跨越12个不同域的推理基准,R2MED,一个专门的医学推理检索基准,涉及8项不同的任务。我们进行了详尽的评价和一些联系研究,并表明,Excank Right2LMS(包括GPT, Gemini, 和 Deepseelk II ) 模型的检索信号信号信号信号信号信号信号信号。加 5,我们还进行了关于BRALBRLBRBR1 的分级标准的标准化的分级和分级标准。


Article 247

Title@2025-06-16 (1): Automatic Extraction of Clausal Embedding Based on Large-Scale English Text Data

Title: Automatic Extraction of Clausal Embedding Based on Large-Scale English Text Data Automatische Extraktion von Clausal Embedding basierend auf großformatigen englischen Textdaten 根据大比例英文文本数据自动提取 2506.14064v1

Authors (5): Iona Carslaw, Sivan Milton, Nicolas Navarre, Ciyang Qing, Wataru Uegaki

For linguists, embedded clauses have been of special interest because of their intricate distribution of syntactic and semantic features. Yet, current research relies on schematically created language examples to investigate these constructions, missing out on statistical information and naturally-occurring examples that can be gained from large language corpora. Thus, we present a methodological approach for detecting and annotating naturally-occurring examples of English embedded clauses in large-scale text data using constituency parsing and a set of parsing heuristics. Our tool has been evaluated on our dataset Golden Embedded Clause Set (GECS), which includes hand-annotated examples of naturally-occurring English embedded clause sentences. Finally, we present a large-scale dataset of naturally-occurring English embedded clauses which we have extracted from the open-source corpus Dolma using our extraction tool.

对于语言学家来说,嵌入条款因其综合和语义特征的复杂分布而特别引人注意。然而,目前的研究依靠有神学学学的文字实例来调查这些构造,缺少统计信息和可以从大语言公司获得的自然现象实例。因此,我们提出了一个方法方法,用以利用群落划分和一套分解超理论来探测和批注大规模文本数据中自然出现的英文嵌入条款实例。我们的工具已经用我们的数据集金嵌入条款集(GECS)进行了评估,其中包括自然生成的英语嵌入条款的手动附加示例。最后,我们用我们提取工具从开源的Dolma 中提取的关于自然生成的英语嵌入条款的大规模数据集。


Article 248

Title@2025-06-16 (1): Ace-CEFR – A Dataset for Automated Evaluation of the Linguistic Difficulty of Conversational Texts for LLM Applications

Title: Ace-CEFR – A Dataset for Automated Evaluation of the Linguistic Difficulty of Conversational Texts for LLM Applications Ace-CEFR – Ein Datensatz für die automatisierte Auswertung der sprachlichen Schwierigkeit von Konversationstexten für LLM-Anwendungen Ace-CEFR – – 用于自动评价用于LLM应用的对读文本语言难度的数据集 2506.14046v1

Authors (7): David Kogan, Max Schumacher, Sam Nguyen, Masanori Suzuki, Melissa Smith, Chloe Sophia Bellows, Jared Bernstein

There is an unmet need to evaluate the language difficulty of short, conversational passages of text, particularly for training and filtering Large Language Models (LLMs). We introduce Ace-CEFR, a dataset of English conversational text passages expert-annotated with their corresponding level of text difficulty. We experiment with several models on Ace-CEFR, including Transformer-based models and LLMs. We show that models trained on Ace-CEFR can measure text difficulty more accurately than human experts and have latency appropriate to production environments. Finally, we release the Ace-CEFR dataset to the public for research and development.

需要评估文本简短的谈话段落的语言困难,特别是培训和过滤大语言模型的语言困难。我们引入了Ace-CEFR,这是英语谈话文本段落的数据集,具有相应的文本困难程度的专家说明。我们实验了Ace-CEFR的若干模型,包括以变换器为基础的模型和LLMS。我们表明,在Ace-CEFR培训的模型比人类专家更准确地测量文本困难,并且具有适合生产环境的耐久性。最后,我们向公众发布Ace-CEFR数据集,用于研究和开发。


Article 249

Title@2025-06-16 (1): An Interdisciplinary Review of Commonsense Reasoning and Intent Detection

Title: An Interdisciplinary Review of Commonsense Reasoning and Intent Detection Eine interdisziplinäre Überprüfung von Commonsense-Vernunft und Intent Detection 对常见理由和意图探测的跨学科审查 2506.14040v1

Authors (1): Md Nazmus Sakib

This review explores recent advances in commonsense reasoning and intent detection, two key challenges in natural language understanding. We analyze 28 papers from ACL, EMNLP, and CHI (2020-2025), organizing them by methodology and application. Commonsense reasoning is reviewed across zero-shot learning, cultural adaptation, structured evaluation, and interactive contexts. Intent detection is examined through open-set models, generative formulations, clustering, and human-centered systems. By bridging insights from NLP and HCI, we highlight emerging trends toward more adaptive, multilingual, and context-aware models, and identify key gaps in grounding, generalization, and benchmark design.

本审查报告探讨了在常识推理和意图探测方面的最新进展,这是自然语言理解方面的两个主要挑战。我们分析了来自ACL、EMNLP和CHI(2020-2025年)的28份文件,按方法和应用加以组织。根据零点学习、文化适应、结构化评估和互动背景,对常识推理进行了审查。通过开放模型、基因配方、集群和以人为本的系统,对意图探测进行了研究。通过连接来自NLP和HCI的洞察,我们突出强调了在更适应性、多语种和符合背景的模型方面新出现的趋势,并找出了在定位、普及和基准设计方面的主要差距。


Article 250

Title@2025-06-16 (1): Beyond Browsing: API-Based Web Agents

Title: Beyond Browsing: API-Based Web Agents Jenseits von Browsing: API-basierte Web-Agenten 超出浏览范围: API 网络代理 2410.16464v3

Authors (4): Yueqi Song, Frank Xu, Shuyan Zhou, Graham Neubig

Web browsers are a portal to the internet, where much of human activity is undertaken. Thus, there has been significant research work in AI agents that interact with the internet through web browsing. However, there is also another interface designed specifically for machine interaction with online content: application programming interfaces (APIs). In this paper we ask – what if we were to take tasks traditionally tackled by Browsing Agents, and give AI agents access to APIs? To do so, we propose two varieties of agents: (1) an API-calling agent that attempts to perform online tasks through APIs only, similar to traditional coding agents, and (2) a Hybrid Agent that can interact with online data through both web browsing and APIs. In experiments on WebArena, a widely-used and realistic benchmark for web navigation tasks, we find that API-Based Agents outperform web Browsing Agents. Hybrid Agents out-perform both others nearly uniformly across tasks, resulting in a more than 24.0% absolute improvement over web browsing alone, achieving a success rate of 38.9%, the SOTA performance among task-agnostic agents. These results strongly suggest that when APIs are available, they present an attractive alternative to relying on web browsing alone.

网络浏览器是互联网的门户, 大部分人类活动都是在互联网上进行。 因此, AI代理商通过网络浏览与互联网互动, 已经做了大量研究工作。 但是, 还有一个专门设计用于与在线内容进行机器互动的界面: 应用程序编程界面(APIs) 。 在本文中, 我们问 – 如果我们要承担传统上由浏览代理商处理的任务, 并允许AI代理商访问API ? 要做到这一点, 我们建议两种类型的代理商:(1) API呼叫代理商, 仅试图通过API执行在线任务, 类似于传统的编码代理商; (2) 混合代理商, 可通过网络浏览和API 与在线数据互动。 在WebArena的实验中, 应用范围广泛和现实的网络导航任务基准, 我们发现基于API的代理商超越了网络浏览代理商的功能。 混合代理商在各项任务之间几乎一致地调整了其他两种代理商, 导致仅仅通过网络浏览器完成超过24.0%的绝对改进, 成功率达到38.9%, SOTA的绩效显示, 仅具有吸引力的网络浏览代理商的替代成果。


Article 251

Title@2025-06-16 (1): MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation

Title: MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation MultiFinBen: Ein multilingualer, multimodaler und problemorientierter Benchmark für die finanzielle LLM-Bewertung MultiFinBen: 财务LLM评价的多种语言、多种模式和困难软件基准 2506.14028v1

Authors (44): Xueqing Peng, Lingfei Qian, Yan Wang, Ruoyu Xiang, Yueru He, Yang Ren, Mingyang Jiang, Jeff Zhao, Huan He, Yi Han, Yun Feng, Yuechen Jiang, Yupeng Cao, Haohang Li, Yangyang Yu, Xiaoyu Wang, Penglei Gao, Shengyuan Lin, Keyi Wang, Shanshan Yang, Yilun Zhao, Zhiwei Liu, Peng Lu, Jerry Huang, Suyuchen Wang, Triantafillos Papadopoulos, Polydoros Giannouris, Efstathia Soufleri, Nuo Chen, Guojun Xiong, Zhiyang Deng, Yijia Zhao, Mingquan Lin, Meikang Qiu, Kaleb E Smith, Arman Cohan, Xiao-Yang Liu, Jimin Huang, Alejandro Lopez-Lira, Xi Chen, Junichi Tsujii, Jian-Yun Nie, Sophia Ananiadou, Qianqian Xie

Recent advances in large language models (LLMs) have accelerated progress in financial NLP and applications, yet existing benchmarks remain limited to monolingual and unimodal settings, often over-relying on simple tasks and failing to reflect the complexity of real-world financial communication. We introduce MultiFinBen, the first multilingual and multimodal benchmark tailored to the global financial domain, evaluating LLMs across modalities (text, vision, audio) and linguistic settings (monolingual, bilingual, multilingual) on domain-specific tasks. We introduce two novel tasks, including PolyFiQA-Easy and PolyFiQA-Expert, the first multilingual financial benchmarks requiring models to perform complex reasoning over mixed-language inputs; and EnglishOCR and SpanishOCR, the first OCR-embedded financial QA tasks challenging models to extract and reason over information from visual-text financial documents. Moreover, we propose a dynamic, difficulty-aware selection mechanism and curate a compact, balanced benchmark rather than simple aggregation existing datasets. Extensive evaluation of 22 state-of-the-art models reveals that even the strongest models, despite their general multimodal and multilingual capabilities, struggle dramatically when faced with complex cross-lingual and multimodal tasks in financial domain. MultiFinBen is publicly released to foster transparent, reproducible, and inclusive progress in financial studies and applications.

大型语言模式(LLMS)的近期进展加快了金融NLP和应用方面的进展,但现有基准仍然局限于单一语言和单一方式的环境,往往过度依赖简单的任务,未能反映真实世界金融通信的复杂性。我们引入了多种多种语言和多式联运基准,这是针对全球金融领域的首个多语言和多模式基准,通过模式(文字、视觉、音频)和语言环境(语言、双语、多语言)对特定领域的任务进行评估。我们引入了两项新任务,包括PolyFiQA-Easy和PolyFiQA-Expert,这是第一个要求模式对混合语言投入进行复杂推理的多语言金融基准;以及EnglishOCR和SpanOOCR,这是OSCR的首个由OCR组成的金融QA任务,它挑战从视觉文本金融文件(文字、视觉、视觉、语言)和语言环境(语言)中提取信息并解释信息的模式。此外,我们提出了一种动态、难辨的甄选机制,而不是简单的现有数据集。对22个状态模型进行广泛评估表明,即使是最强的模型也是最强的模型,尽管其一般的多语言和多语言和多种金融领域应用能力,但面对着着复杂的复杂和多式的、快速的、快速的、快速的、激烈的金融应用。


Article 252

Title@2025-06-16 (1): Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text

Title: Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text Lost in the Mix: LLM-Verständnis von Code-Switched Text bewerten 在混合中丢失:评估LLM对代码转换文本的理解 2506.14012v1

Authors (4): Amr Mohamed, Yang Zhang, Michalis Vazirgiannis, Guokan Shang

Code-switching (CSW) is the act of alternating between two or more languages within a single discourse. This phenomenon is widespread in multilingual communities, and increasingly prevalent in online content, where users naturally mix languages in everyday communication. As a result, Large Language Models (LLMs), now central to content processing and generation, are frequently exposed to code-switched inputs. Given their widespread use, it is crucial to understand how LLMs process and reason about such mixed-language text. This paper presents a systematic evaluation of LLM comprehension under code-switching by generating CSW variants of established reasoning and comprehension benchmarks. While degradation is evident when foreign tokens disrupt English text$\unicode{x2013}$even under linguistic constraints$\unicode{x2013}$embedding English into other languages often improves comprehension. Though prompting yields mixed results, fine-tuning offers a more stable path to degradation mitigation.

代码转换(CSW)是两种或两种以上语言在单一语句中交替的行为,这种现象在多语言社区中很普遍,在网上内容中越来越普遍,用户在日常通信中自然地混合语言,因此,现在对内容处理和生成至关重要的大语言模型(LLMs)经常受到代码转换投入的影响。鉴于其广泛使用,了解LLMs如何处理和解释这种混合语言文本至关重要。本文件通过生成常识推理和理解基准的版本,对代码转换法下的LLM理解进行系统评估。尽管在语言限制下,将英语混合成其他语言的LLMs(LMs)模式(LLMs)往往会提高理解力,但微调能产生混杂的结果,但能提供更稳定的减少退化的途径。


Article 253

Title@2025-06-16 (1): Towards Geo-Culturally Grounded LLM Generations

Title: Towards Geo-Culturally Grounded LLM Generations Auf dem Weg zu geokulturellen LLM-Generationen 走向地球环基LLM 代 2502.13497v3

Authors (5): Piyawat Lertvittayakumjorn, David Kinney, Vinodkumar Prabhakaran, Donald Martin Jr., Sunipa Dev

Generative large language models (LLMs) have demonstrated gaps in diverse cultural awareness across the globe. We investigate the effect of retrieval augmented generation and search-grounding techniques on LLMs’ ability to display familiarity with various national cultures. Specifically, we compare the performance of standard LLMs, LLMs augmented with retrievals from a bespoke knowledge base (i.e., KB grounding), and LLMs augmented with retrievals from a web search (i.e., search grounding) on multiple cultural awareness benchmarks. We find that search grounding significantly improves the LLM performance on multiple-choice benchmarks that test propositional knowledge (e.g., cultural norms, artifacts, and institutions), while KB grounding’s effectiveness is limited by inadequate knowledge base coverage and a suboptimal retriever. However, search grounding also increases the risk of stereotypical judgments by language models and fails to improve evaluators’ judgments of cultural familiarity in a human evaluation with adequate statistical power. These results highlight the distinction between propositional cultural knowledge and open-ended cultural fluency when it comes to evaluating LLMs’ cultural awareness.

· 我们调查了检索增强生成和搜索地面技术对LLMS显示熟悉各种民族文化的能力的影响,具体地说,我们比较标准LMS的性能、标准LMS的性能、从一个言语知识库(即KB地基)检索的增强LMS的性能、以及从多种文化意识基准的网络搜索(即搜索地基)检索的增强LLMS。我们发现,以搜索为基础极大地提高了LLM在多种选择基准方面的表现,这些基准测试了虚拟知识(例如文化规范、文物和机构),而KB地基的功效因知识基础覆盖面不足和亚优性检索器有限而受到限制。然而,搜索地基还增加了语言模型的定型判断风险,未能提高评价员对具有充分统计能力的人类评估中文化熟悉程度的判断能力。这些结果突出表明,在评价LMS的文化意识时,要区分虚拟文化知识和开放式文化流畅度。


Article 254

Title@2025-06-16 (1): MultiMatch: Multihead Consistency Regularization Matching for Semi-Supervised Text Classification

Title: MultiMatch: Multihead Consistency Regularization Matching for Semi-Supervised Text Classification MultiMatch: Multihead-Konsistenzregularisierung passend zur semi-überwachten Textklassifikation 多匹配: 用于半有效文本分类的多标题一致性规则化 2506.07801v2

Authors (5): Iustin Sirbu, Robert-Adrian Popovici, Cornelia Caragea, Stefan Trausan-Matu, Traian Rebedea

We introduce MultiMatch, a novel semi-supervised learning (SSL) algorithm combining the paradigms of co-training and consistency regularization with pseudo-labeling. At its core, MultiMatch features a three-fold pseudo-label weighting module designed for three key purposes: selecting and filtering pseudo-labels based on head agreement and model confidence, and weighting them according to the perceived classification difficulty. This novel module enhances and unifies three existing techniques – heads agreement from Multihead Co-training, self-adaptive thresholds from FreeMatch, and Average Pseudo-Margins from MarginMatch – resulting in a holistic approach that improves robustness and performance in SSL settings. Experimental results on benchmark datasets highlight the superior performance of MultiMatch, achieving state-of-the-art results on 9 out of 10 setups from 5 natural language processing datasets and ranking first according to the Friedman test among 19 methods. Furthermore, MultiMatch demonstrates exceptional robustness in highly imbalanced settings, outperforming the second-best approach by 3.26% – and data imbalance is a key factor for many text classification tasks.

我们引入了“多匹配”这一新型的半监督学习算法(SSL),将共同培训和一致性正规化模式与假标签相结合。“多匹配”的核心特征是三重假标签加权模块,该模块为三大目的设计:根据头项协议和模型信心选择和过滤假标签,并根据认知到的分类困难对其进行加权。这个新模块增强并统一了三种现有技术:多头联合培训的负责人协议、FreeMatch的自适应阈值和MarginMatch的平均普西多-马克斯 – – 由此形成了一种整体方法,提高了SSL环境中的稳健性和性能。“多匹配”的实验结果凸显了“多匹配”的优异性,在5个自然语言处理数据集的10个数据集中实现了最先进的结果,并根据弗里德曼测试在19种方法中排名第一。此外,“多匹配”显示了高度失衡环境中的超强性强性,比第二最佳方法高3.26% – 数据失衡是许多文本分类任务的一个关键因素。


Article 255

Title@2025-06-16 (1): ETM: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models

Title: ETM: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models ETM: Moderne Einblicke in die Perspektive der Text-zu-SQL-Bewertung im Zeitalter großer Sprachmodelle ETM:从现代视角看待大语言模式时代的文本到SQL评价 2407.07313v4

Authors (3): Benjamin G. Ascoli, Yasoda Sai Ram Kandikonda, Jinho D. Choi

The task of Text-to-SQL enables anyone to retrieve information from SQL databases using natural language. While this task has made substantial progress, the two primary evaluation metrics - Execution Accuracy (EXE) and Exact Set Matching Accuracy (ESM) - suffer from inherent limitations that can misrepresent performance. Specifically, ESM’s rigid matching overlooks semantically correct but stylistically different queries, whereas EXE can overestimate correctness by ignoring structural errors that yield correct outputs. These shortcomings become especially problematic when assessing outputs from large language model (LLM)-based approaches without fine-tuning, which vary more in style and structure compared to their fine-tuned counterparts. Thus, we introduce a new metric, Enhanced Tree Matching (ETM), which mitigates these issues by comparing queries using both syntactic and semantic elements. Through evaluating nine LLM-based models, we show that EXE and ESM can produce false positive and negative rates as high as 23.0% and 28.9%, while ETM reduces these rates to 0.3% and 2.7%, respectively. We release our ETM script as open source, offering the community a more robust and reliable approach to evaluating Text-to-SQL.

文本到 SQL 的任务使任何人都能够使用自然语言从 SQL 数据库检索信息。 虽然任务已经取得了实质性进展, 但两个主要评价指标-执行准确性(EXE)和精密匹配准确性(ESM)-都存在固有的限制,这些限制可能扭曲性能。 具体地说, 硬性匹配忽略了语义正确但有条不紊的查询, 而EXE 忽视结构错误,从而可以高估正确性, 从而得出正确的产出。 在评估大型语言模型(LLLM)方法(LLLM)的输出时,这些缺陷特别成问题,这些方法在风格和结构上与精确的对应方法相比,差异更大。 因此,我们引入了一个新的指标,即强化树匹配(ETM),通过使用同步和语义元素比较询问来缓解这些问题。 通过对基于LLM 的九个模型进行评估,我们发现EXE 和ES 可以产生高达23.0%和28.9%的虚假正和负率。 而ETM 将这些率分别降低到0.3%和2.7%。 我们把ETM 的ETM 脚本作为更可靠的开放源,我们向开放的版本提供更可靠的版本。


Article 256

Title@2025-06-16 (1): Are manual annotations necessary for statutory interpretations retrieval?

Title: Are manual annotations necessary for statutory interpretations retrieval? Sind manuelle Anmerkungen für die Rückgewinnung gesetzlicher Interpretationen erforderlich? 法定解释检索是否需要人工说明? 2506.13965v1

Authors (4): Aleksander Smywiński-Pohl, Tomer Libal, Adam Kaczmarczyk, Magdalena Król

One of the elements of legal research is looking for cases where judges have extended the meaning of a legal concept by providing interpretations of what a concept means or does not mean. This allow legal professionals to use such interpretations as precedents as well as laymen to better understand the legal concept. The state-of-the-art approach for retrieving the most relevant interpretations for these concepts currently depends on the ranking of sentences and the training of language models over annotated examples. That manual annotation process can be quite expensive and need to be repeated for each such concept, which prompted recent research in trying to automate this process. In this paper, we highlight the results of various experiments conducted to determine the volume, scope and even the need for manual annotation. First of all, we check what is the optimal number of annotations per a legal concept. Second, we check if we can draw the sentences for annotation randomly or there is a gain in the performance of the model, when only the best candidates are annotated. As the last question we check what is the outcome of automating the annotation process with the help of an LLM.

法律研究的一个要素是,在法官通过解释概念的含义而扩大了法律概念的含义的情况下,法律研究的一个要素正在寻找一些案例,即法官通过解释概念的含义而扩大了法律概念的含义,这使法律专业人员能够使用这种解释作为先例,以及非专业人员来更好地了解法律概念。目前,为对这些概念重新作出最相关的解释,最先进的方法取决于对判决的排序和对语言模型的培训,而不是附加说明的例子。手册说明过程可能费用很高,需要为每个这类概念重复进行,这促使最近对如何使这一进程自动化进行研究。在本文件中,我们强调为确定数量、范围、甚至人工注解的需要而进行的各种实验的结果。首先,我们检查每个法律概念的最佳说明数量。第二,我们检查我们是否可以随意引用这些句子来作注解,或者在模型的运行中是否有好处,只有最佳候选人才加注。最后一个问题是,我们检查在LM的帮助下将注解过程自动化的结果是什么。


Article 257

Title@2025-06-16 (1): ASMR: Augmenting Life Scenario using Large Generative Models for Robotic Action Reflection

Title: ASMR: Augmenting Life Scenario using Large Generative Models for Robotic Action Reflection ASMR: Augmenting Life Szenario mit großen Generativen Modellen für die Robotic Action Reflection ASMR:使用大型机器人行动反射生成模型扩大寿命设想 2506.13956v1

Authors (5): Shang-Chi Tsai, Seiya Kawano, Angel Garcia Contreras, Koichiro Yoshino, Yun-Nung Chen

When designing robots to assist in everyday human activities, it is crucial to enhance user requests with visual cues from their surroundings for improved intent understanding. This process is defined as a multimodal classification task. However, gathering a large-scale dataset encompassing both visual and linguistic elements for model training is challenging and time-consuming. To address this issue, our paper introduces a novel framework focusing on data augmentation in robotic assistance scenarios, encompassing both dialogues and related environmental imagery. This approach involves leveraging a sophisticated large language model to simulate potential conversations and environmental contexts, followed by the use of a stable diffusion model to create images depicting these environments. The additionally generated data serves to refine the latest multimodal models, enabling them to more accurately determine appropriate actions in response to user interactions with the limited target data. Our experimental results, based on a dataset collected from real-world scenarios, demonstrate that our methodology significantly enhances the robot’s action selection capabilities, achieving the state-of-the-art performance.

设计机器人以协助人类日常活动时,必须提高用户要求,从周围环境获得视觉提示,以便更好地了解意图。这一过程被定义为多式联运分类任务。然而,为模型培训收集包含视觉和语言要素的大规模数据集具有挑战性和耗时性。为解决这一问题,我们的论文引入了一个新颖的框架,侧重于机器人援助情景中的数据扩增,包括对话和相关环境图像。这种方法涉及利用一个复杂的大型语言模型模拟潜在对话和环境背景,然后使用一个稳定的传播模型来创建描述这些环境的图像。额外生成的数据有助于完善最新的多式联运模型,使其能够更准确地确定与有限目标数据进行用户互动的适当行动。我们根据从现实世界情景中收集的数据集得出的实验结果表明,我们的方法极大地增强了机器人的行动选择能力,实现了最先进的性能。


Article 258

Title@2025-06-16 (1): LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

Title: LongCodeBench: Evaluating Coding LLMs at 1M Context Windows LongCodeBench: Auswertung von Coding LLMs bei 1M Context Windows LongCodeBench: 在 1M 上下文窗口评价编码LLMs 2505.07897v2

Authors (8): Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, Tatsunori Hashimoto

Context lengths for models have grown rapidly, from thousands to millions of tokens in just a few years. The extreme context sizes of modern long-context models have made it difficult to construct realistic long-context benchmarks – not only due to the cost of collecting million-context tasks but also in identifying realistic scenarios that require significant contexts. We identify code comprehension and repair as a natural testbed and challenge task for long-context models and introduce LongCodeBench (LCB), a benchmark to test LLM coding abilities in long-context scenarios. Our benchmark tests both the comprehension and repair capabilities of LCLMs in realistic and important settings by drawing from real-world GitHub issues and constructing QA (LongCodeQA) and bug fixing (LongSWE-Bench) tasks. We carefully stratify the complexity of our benchmark, enabling us to evaluate models across different scales – ranging from Qwen2.5 14B Instruct to Google’s flagship Gemini model. We find that long-context remains a weakness for all models, with performance drops such as from 29% to 3% for Claude 3.5 Sonnet, or from 70.2% to 40% for Qwen2.5.

模型的上下文长度在短短几年内迅速增长,从数千个符号发展到数百万个符号。现代长文本模型的极端背景规模使得难以制定现实的长文本基准 – – 这不仅是因为收集百万文本任务的成本,而且因为查明需要重大背景的现实情景。我们把代码理解和修理确定为长文本模型的天然测试和挑战任务,并引入长文本模型的LongCodeBench(LCB),这是在长文本情景中测试LLLM 编码能力的基准。我们的基准测试了现实和重要环境中LCLMS的理解和修理能力,从现实世界的GitHub问题和构建QA(LongCodeQA)和错误修正(LongSWE-Bench)的任务中进行。我们仔细地将我们的基准的复杂性分化,使我们能够评估不同尺度的模型 – – 从Qwen2.5指示到Google的Gemini旗舰模型。我们发现,长文本仍然是所有模型的弱点,其性能下降率从29%到3%,或者从Claudeald Sonet3.5的70.2%到70.2%。


Article 259

Title@2025-06-16 (1): Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

Title: Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models Roboflow100-VL: Ein Multi-Domain-Objekterkennungs-Benchmark für Vision-Language-Modelle 机器人流100-VL:愿景-语言模型多功能物体探测基准 2505.20612v2

Authors (7): Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri

Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 16.8 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl/ and https://universe.roboflow.com/rf100-vl/

在互联网数据方面受过培训的视觉语言模型(VLM)在汽车、卡车和行人等通用物体上达到惊人的零光检测性能。然而,最先进的模型仍然难以推广到在培训前通常不会发现的分配范围、任务和成像模式。我们主张,与其简单地对VLMS进行更多视觉数据方面的再培训,不如只是将VLMs与新概念相匹配,并附有包含一些视觉实例和丰富的文字描述的说明性指示。为此,我们引入了100个大型多式物体探测数据集的大规模收集,其不同概念在VLM预培训前并不常见。我们评估了我们基准的零光、微光、半超强和完全监控的模型,从而可以进行跨数据系统的比较。我们发现VLMMMS(LODDNO)和Qwenform-VL(QwencomVL)的定位码性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性100-100-VLVLMMMRMMT/VLOFS/SO/S/S/S/S/SO/SDR/S/S/SODR/S/S/S/S/SO/S/SODR/S/S/S/S/SODR/S/S/S/SODR/SDR/SDR/SDR/SDR/SDR/S/S/S/S/S/SDR/SDR/SDR/SDR/S/SD/SDRRRR/S/S/S/S/S/S/S/S/S/SDRDRDGODGODG/SDG/SDG/SDR/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S/S


Article 260

Title@2025-06-16 (1): Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models

Title: Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models Adaptive Anleitung beschleunigt die Stärkung des Lernens von Vernunftmodellen 适应性指导加速加速强化理性模型学习 2506.13923v1

Authors (6): Vaskar Nath, Elaine Lau, Anisha Gunjal, Manasi Sharma, Nikhil Baharte, Sean Hendryx

We study the process through which reasoning models trained with reinforcement learning on verifiable rewards (RLVR) can learn to solve new problems. We find that RLVR drives performance through two main means: (1) by compressing pass@$k$ into pass@1 and (2) via “capability gain” in which models learn to solve new problems that they previously could not solve even at high $k$. We find that while capability gain exists across model scales, learning to solve new problems is primarily driven through self-distillation. We demonstrate these findings across model scales ranging from 0.5B to 72B on >500,000 reasoning problems with prompts and verifiable final answers across math, science, and code domains. We further show that we can significantly improve pass@$k$ rates by leveraging natural language guidance for the model to consider within context while still requiring the model to derive a solution chain from scratch. Based of these insights, we derive $\text{Guide}$ - a new class of online training algorithms. $\text{Guide}$ adaptively incorporates hints into the model’s context on problems for which all rollouts were initially incorrect and adjusts the importance sampling ratio for the “off-policy” trajectories in order to optimize the policy for contexts in which the hints are no longer present. We describe variants of $\text{Guide}$ for GRPO and PPO and empirically show that Guide-GRPO on 7B and 32B parameter models improves generalization over its vanilla counterpart with up to 4$\%$ macro-average improvement across math benchmarks. We include careful ablations to analyze $\text{Guide}$’s components and theoretically analyze Guide’s learning efficiency.

我们研究的是经过强化学习可核实的奖赏(RLVR)培训的推理模型能够学习如何解决新问题的过程。我们发现,RLVR通过两种主要手段推动业绩:(1) 将传球@$K$压缩成传球@1,(2) 通过“能力增益”,模型通过“能力增益”学习解决以前即使以高美元也无法解决的新问题。我们发现,虽然在模型规模上存在着能力增益,但学习解决新问题主要是通过自我蒸馏驱动的。我们展示了从0.5B到72B的模型基准,范围从0.500,000的推理问题和数学、科学和代码领域的可核查的最后答案。我们进一步显示,我们可以通过利用自然语言指导来解决新问题,解决以前甚至无法以高美元解决的新问题。我们发现,虽然能力在模型中获得了不同规模,但学习新的在线培训算法是: $\ text{GUideral{GUide} 。我们用在模型中包含所有滚动和可核实的最后答案的答案背景背景的提示, 也就是:我们用最精确的指南到更精确的比重的比重的比重的比重。


Article 261

Title@2025-06-16 (1): Discrete Audio Tokens: More Than a Survey!

Title: Discrete Audio Tokens: More Than a Survey! Diskrete Audio Tokens: Mehr als nur eine Umfrage! 分辨音频代号: 多于调查 ! 2506.10274v2

Authors (21): Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli

Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.

分辨音义象征物是旨在保存感知质量、语音内容和发言者特征,同时能够有效储存和推断,以及具有不同下游任务的竞争性业绩的简明表现形式,它们为连续特征提供了一个实用的替代物,能够将语音和音频纳入现代大型语言模型(LLMs)。随着对以象征性为基础的音频处理的兴趣增加,出现了各种象征性方法,一些调查审查了该领域的最新进展。然而,现有研究往往侧重于特定领域或任务,缺乏各种基准的统一比较。本文对独立音效象征物进行了系统审查和基准,涵盖三个领域:言论、音乐和一般音频。我们提出了基于编码-解码、量化技术、培训范式、可流和应用领域的象征化方法的分类方法。我们评估了重建、下游性业绩和声学语言模型的多重基准,并通过受控的反动研究分析了贸易利弊。我们的调查结果强调了关键限制、实际考虑和公开挑战,为这个迅速演变的区域的未来研究提供了洞察力和指导。更多信息,包括我们的主要成果和网站。


Article 262

Title@2025-06-16 (1): EuroLLM-9B: Technical Report

Title: EuroLLM-9B: Technical Report EuroLLM-9B: Technischer Bericht 欧洲LLLM-9B:技术报告 2506.04079v2

Authors (17): Pedro Henrique Martins, João Alves, Patrick Fernandes, Nuno M. Guerreiro, Ricardo Rei, Amin Farajian, Mateusz Klimaszewski, Duarte M. Alves, José Pombal, Nicolas Boizard, Manuel Faysse, Pierre Colombo, François Yvon, Barry Haddow, José G. C. de Souza, Alexandra Birch, André F. T. Martins

This report presents EuroLLM-9B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-9B’s development, including tokenizer design, architectural specifications, data filtering, and training procedures. We describe the pre-training data collection and filtering pipeline, including the creation of EuroFilter, an AI-based multilingual filter, as well as the design of EuroBlocks-Synthetic, a novel synthetic dataset for post-training that enhances language coverage for European languages. Evaluation results demonstrate EuroLLM-9B’s competitive performance on multilingual benchmarks and machine translation tasks, establishing it as the leading open European-made LLM of its size. To support open research and adoption, we release all major components of this work, including the base and instruction-tuned models, the EuroFilter classifier, and the synthetic post-training dataset.

本报告介绍了欧洲LLLM-9B的开发情况,包括象征性品设计、建筑规格、数据过滤和培训程序。我们介绍了培训前数据收集和过滤管道,包括创建欧洲环流器、一个以AI为基础的多语言过滤器,以及设计欧洲环锁合成系统,这是一个新的合成数据集,用于培训后加强欧洲语言的语文覆盖面。评价结果显示了欧洲环流-9B在多语言基准和机器翻译任务方面的竞争性表现,将其确定为规模最大的欧洲开源LLM。为了支持公开研究和采用,我们公布了这项工作的所有主要组成部分,包括基础和指示调控模型、欧洲环流分类和合成后培训数据集。


Article 263

Title@2025-06-16 (1): Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

Title: Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations Alignment Quality Index (AQI) : Jenseits von Ablehnungen: AQI als Intrinsische Alignment-Diagnose über Latent Geometrie, Clusterdivergenz und Layer weise Gepoolte Darstellungen 对齐质量指数(AQI) : 超越拒绝: AQI 是一个通过深层几何、群集差异和图层智慧集合表达式进行的原始对齐诊断分析 2506.13901v1

Authors (15): Abhilekh Borah, Chhavi Sharma, Danush Khanna, Utkarsh Bhatt, Gurpreet Singh, Hasnat Md Abdullah, Raghav Kaushik Ravi, Vinija Jain, Jyoti Patel, Shubham Singh, Vasu Sharma, Arpita Vats, Rahul Raja, Aman Chadha, Amitava Das

Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking. To address this issue, we introduce the Alignment Quality Index (AQI). This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the Davies-Bouldin Score (DBS), Dunn Index (DI), Xie-Beni Index (XBI), and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking, offering a robust, decoding invariant tool for behavior agnostic safety auditing. Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO, GRPO, and RLHF conditions demonstrate AQI’s correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.

当大型语言模型(LLMS)进入教育、医疗保健、治理和法律等高取领域时,它们的行为必须可靠地反映人与人之间的价值观和安全限制。然而,目前的评估严重依赖行为代理,如拒绝率、G-Eval分数和毒性分类,所有这些都存在关键的盲点。 统一模型往往容易受到破门而入、代代代相传和校正的假冒。为了解决这个问题,我们引入了“一致质量指数 ” ( AQI ) 。这种新型的、迅速的、不易变的指数经验性指数(AQI ) ,通过分析隐蔽空间中安全和不安全启动的分离,来评估LMM(LM) 。 通过将Davies-Bouldin评分、Dun Ind指数(DI)、Xie-Beni指数(XUBI)和Calinski-Harabas 指数(CHI)等措施结合起来, 来检测隐藏的氢氟碳化合物和未来的风险,即使产出似乎符合。 QI(AQI) 也作为预警信号性标准性标准性指标性指标,以显示在IMUIADRIA下进行稳健的测试。


Article 264

Title@2025-06-16 (1): EmoNews: A Spoken Dialogue System for Expressive News Conversations

Title: EmoNews: A Spoken Dialogue System for Expressive News Conversations EmoNews: Ein gesprochenes Dialogsystem für expressive Nachrichtengespräche Emohews:一个表达性新闻对话的口号对话系统 2506.13894v1

Authors (4): Ryuki Matsuura, Shikhar Bharadwaj, Jiarui Liu, Dhatchi Kunde Govindarajan

We develop a task-oriented spoken dialogue system (SDS) that regulates emotional speech based on contextual cues to enable more empathetic news conversations. Despite advancements in emotional text-to-speech (TTS) techniques, task-oriented emotional SDSs remain underexplored due to the compartmentalized nature of SDS and emotional TTS research, as well as the lack of standardized evaluation metrics for social goals. We address these challenges by developing an emotional SDS for news conversations that utilizes a large language model (LLM)-based sentiment analyzer to identify appropriate emotions and PromptTTS to synthesize context-appropriate emotional speech. We also propose subjective evaluation scale for emotional SDSs and judge the emotion regulation performance of the proposed and baseline systems. Experiments showed that our emotional SDS outperformed a baseline system in terms of the emotion regulation and engagement. These results suggest the critical role of speech emotion for more engaging conversations. All our source code is open-sourced at https://github.com/dhatchi711/espnet-emotional-news/tree/emo-sds/egs2/emo_news_sds/sds1

我们开发了一个面向任务的口头对话系统(SDS),根据背景线索来规范情绪言论,以便能够进行更多的同情性新闻谈话。尽管在情感文本到语音(TTS)技术方面有所进步,但任务导向的情绪性SDS由于SDS和情感 TTS研究的分化性质以及缺乏对社会目标的标准化评价指标,仍然没有得到充分的探索。我们通过为新闻对话开发一种情感性能的SDS来应对这些挑战,它利用一个大型语言模式(LLLM)基于情绪分析器来识别适当的情绪和快速TTS,以综合上下文适当的情感演讲。我们还提出了情感SDS的主观评价尺度,并判断了拟议和基线系统的情感调控性表现。实验表明,我们的情感SDS在情感调控和接触方面超过了基线系统。这些结果表明,言语情感对于更多互动至关重要。我们的所有源代码都是公开来源于https://github.com/dhatchi711/espnet-emotional-news/tree/emmo-s2/ges2/egs2/demoos_nus_nus_news_news_news_


Article 265

Title@2025-06-16 (1): Conformal Linguistic Calibration: Trading-off between Factuality and Specificity

Title: Conformal Linguistic Calibration: Trading-off between Factuality and Specificity Konforme Linguistische Kalibrierung: Trading-off zwischen Faktizität und Spezifität 非正式语文校准:事实与具体性之间的交易 2502.19110v3

Authors (3): Zhengping Jiang, Anqi Liu, Benjamin Van Durme

Language model outputs are not always reliable, thus prompting research into how to adapt model responses based on uncertainty. Common approaches include: \emph{abstention}, where models refrain from generating responses when uncertain; and \emph{linguistic calibration}, where models hedge their statements using uncertainty quantifiers. However, abstention can withhold valuable information, while linguistically calibrated responses are often challenging to leverage in downstream tasks. We propose a unified view, Conformal Linguistic Calibration (CLC), which reinterprets linguistic calibration as \emph{answer set prediction}. First we present a framework connecting abstention and linguistic calibration through the lens of linguistic pragmatics. We then describe an implementation of CLC that allows for controlling the level of imprecision in model responses. Results demonstrate our method produces calibrated outputs with conformal guarantees on factual accuracy. Further, our approach enables fine-tuning models to perform uncertainty-aware adaptive claim rewriting, offering a controllable balance between factuality and specificity.

语言模型产出并不总是可靠的,因此促使研究如何调整基于不确定性的模型反应。共同方法包括: /emph{abstem},模型在不确定时不产生反应;和 \emph{L语言校准},模型使用不确定性限定词对其声明进行对冲。但是,弃权可以保留宝贵的信息,而语言校准反应往往对下游任务具有挑战性。我们提出了一个统一的观点,即语言校准(CLC),将语言校准重新解释为 \emph{回答组合预测}。首先,我们提出了一个框架,通过语言实用学的透镜将弃权和语言校准联系起来。我们然后描述了CLC的执行情况,允许在模型反应中控制不精确程度。结果表明我们的方法产生校准产出,在事实准确性上具有一致的保证。此外,我们的方法使得微调模型能够进行有不确定性的适应性索赔重写,在事实质量和具体性之间提供可控制的平衡。


Article 266

Title@2025-06-16 (1): VL-GenRM: Enhancing Vision-Language Verification via Vision Experts and Iterative Training

Title: VL-GenRM: Enhancing Vision-Language Verification via Vision Experts and Iterative Training VL-GenRM: Verbesserung der Vision-Sprachen-Überprüfung durch Vision-Experten und iteratives Training VL-GenRM:通过愿景专家和迭接培训加强愿景-语言核查 2506.13888v1

Authors (7): Jipeng Zhang, Kehao Miao, Renjie Pi, Zhaowei Wang, Runtao Liu, Rui Pan, Tong Zhang

Reinforcement Fine-Tuning (RFT) with verifiable rewards has advanced large language models but remains underexplored for Vision-Language (VL) models. The Vision-Language Reward Model (VL-RM) is key to aligning VL models by providing structured feedback, yet training effective VL-RMs faces two major challenges. First, the bootstrapping dilemma arises as high-quality training data depends on already strong VL models, creating a cycle where self-generated supervision reinforces existing biases. Second, modality bias and negative example amplification occur when VL models hallucinate incorrect visual attributes, leading to flawed preference data that further misguides training. To address these issues, we propose an iterative training framework leveraging vision experts, Chain-of-Thought (CoT) rationales, and Margin-based Rejection Sampling. Our approach refines preference datasets, enhances structured critiques, and iteratively improves reasoning. Experiments across VL-RM benchmarks demonstrate superior performance in hallucination detection and multimodal reasoning, advancing VL model alignment with reinforcement learning.

具有可核实奖赏的强化微调(RFT)已经开发出大型语言模型,但对于愿景-语言(VL)模型的探索不足。愿景-语言奖励模型(VL-RM)是通过提供结构化反馈来调整VL模型的关键,然而培训有效的VL-RM却面临两大挑战。第一,高品质的培训数据取决于已经很强大的VL模型,从而产生了一个自我监督强化现有偏见的循环,从而产生了模式偏差和负面实例放大。第二,当VL模型产生错误的视觉特征,导致有缺陷的偏好数据进一步误导培训时,模式偏向和负面实例放大。为了解决这些问题,我们提议了一个迭代培训框架,利用愿景专家、链式搜索原理和基于磁盘的拒绝抽样。我们的方法改进了偏好数据集,加强了结构上的批评,并反复改进了推理。跨VL-RM基准的实验显示幻象检测和多式推理的优异性表现,推进VL模型与强化学习的调整。


Article 267

Title@2025-06-16 (1): Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles

Title: Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles Untersuchung des Zusammenspiels von sprachlicher und mathematischer Argumentation in Sprachmodellen mittels mehrsprachiger Zahlenrätsel 使用多语种数字拼图调查语言模型的语言和数学推理的相互作用 2506.13886v1

Authors (4): Antara Raaghavi Bhattacharya, Isabel Papadimitriou, Kathryn Davidson, David Alvarez-Melis

Across languages, numeral systems vary widely in how they construct and combine numbers. While humans consistently learn to navigate this diversity, large language models (LLMs) struggle with linguistic-mathematical puzzles involving cross-linguistic numeral systems, which humans can learn to solve successfully. We investigate why this task is difficult for LLMs through a series of experiments that untangle the linguistic and mathematical aspects of numbers in language. Our experiments establish that models cannot consistently solve such problems unless the mathematical operations in the problems are explicitly marked using known symbols ($+$, $\times$, etc, as in “twenty + three”). In further ablation studies, we probe how individual parameters of numeral construction and combination affect performance. While humans use their linguistic understanding of numbers to make inferences about the implicit compositional structure of numerals, LLMs seem to lack this notion of implicit numeral structure. We conclude that the ability to flexibly infer compositional rules from implicit patterns in human-scale data remains an open challenge for current reasoning models.

在各种语言之间,数字系统在构建和合并数字的方式上差异很大。虽然人类在不断学习如何掌握这种多样性,但大型语言模型(LLMS)与涉及跨语言数字系统的语言数学拼图进行斗争,人类可以学习如何成功解决。我们调查为什么LLM在一系列实验中难以完成这项任务,这些实验解开语言数字的语言和数学方面。我们的实验证明,模型无法始终如一地解决这类问题,除非在这些问题中的数学操作使用已知符号($+$,$\times $,等等,如“Twenty + 3 ”中的符号)明确标记。在进一步的研究中,我们探究数字构造和组合的个别参数如何影响性能。虽然人类使用其语言对数字的理解来推断数字的隐含意结构,但LLMS似乎缺乏这种隐含数字结构的概念。我们的结论是,从人类规模数据的隐含模式中灵活地推导构成规则的能力对于当前理性模型来说仍然是一项公开的挑战。


Article 268

Title@2025-06-16 (1): Steering LLM Thinking with Budget Guidance

Title: Steering LLM Thinking with Budget Guidance Steuerung des LLM-Denkens mit Budget Guidance 以预算指导来思考预算指导 2506.13752v1

Authors (4): Junyan Li, Wenshuo Zhao, Yang Zhang, Chuang Gan

Recent deep-thinking large language models often reason extensively to improve performance, but such lengthy reasoning is not always desirable, as it incurs excessive inference costs with disproportionate performance gains. Controlling reasoning length without sacrificing performance is therefore important, but remains challenging, especially under tight thinking budgets. We propose budget guidance, a simple yet effective method for steering the reasoning process of LLMs toward a target budget without requiring any LLM fine-tuning. Our approach introduces a lightweight predictor that models a Gamma distribution over the remaining thinking length during next-token generation. This signal is then used to guide generation in a soft, token-level manner, ensuring that the overall reasoning trace adheres to the specified thinking budget. Budget guidance enables natural control of the thinking length, along with significant token efficiency improvements over baseline methods on challenging math benchmarks. For instance, it achieves up to a 26% accuracy gain on the MATH-500 benchmark under tight budgets compared to baseline methods, while maintaining competitive accuracy with only 63% of the thinking tokens used by the full-thinking model. Budget guidance also generalizes to broader task domains and exhibits emergent capabilities, such as estimating question difficulty. The source code is available at: https://github.com/UMass-Embodied-AGI/BudgetGuidance.

最近深思熟虑的大型语言模型往往广泛解释如何改进业绩,但这种冗长的推理方法并非总是可取的,因为它会产生过高的推论成本,而且业绩得失过多。因此,控制推理长度而不牺牲业绩很重要,但依然具有挑战性,特别是在紧缩的思维预算下。我们提出了预算指导,这是将LLMM推理过程引向目标预算的简单而有效的方法,不需要任何LLLM微调。我们的方法引入了一种轻量级预测,即在下一代中,在剩余思维长度上,以轻量级的模型来计算伽玛分布。然后,这个信号被用来以软、象征性的方式指导生成过程,确保总体推理追踪符合规定的思维预算。预算指导可以自然控制思维长度,同时在挑战数学基准的基准方法上大大象征性地提高效率。例如,在紧凑预算下,与基线方法相比,在MATH-500基准下,其精确率达到26%,同时保持竞争性的准确度,只有63%的全思考模型所使用的思维符号。预算指导还概括了更广泛的任务域域和显示缓发能力,例如估计问题。GUA/BIGI/GI。可用的源代码是:http/GIGI。


Article 269

Title@2025-06-16 (1): LTRR: Learning To Rank Retrievers for LLMs

Title: LTRR: Learning To Rank Retrievers for LLMs LTRR: Learning To Rank Retriever für LLMs LTRR: 学习为LLMM公司重新获得排名 2506.13743v1

Authors (2): To Eun Kim, Fernando Diaz

Retrieval-Augmented Generation (RAG) systems typically rely on a single fixed retriever, despite growing evidence that no single retriever performs optimally across all query types. In this paper, we explore a query routing approach that dynamically selects from a pool of retrievers based on the query, using both train-free heuristics and learned routing models. We frame routing as a learning-to-rank (LTR) problem and introduce LTRR, a framework that learns to rank retrievers by their expected utility gain to downstream LLM performance. Our experiments, conducted on synthetic QA data with controlled query type variations, show that routing-based RAG systems can outperform the best single-retriever-based systems. Performance gains are especially pronounced in models trained with the Answer Correctness (AC) metric and with pairwise learning approaches, especially with XGBoost. We also observe improvements in generalization to out-of-distribution queries. As part of the SIGIR 2025 LiveRAG challenge, our submitted system demonstrated the practical viability of our approach, achieving competitive performance in both answer correctness and faithfulness. These findings highlight the importance of both training methodology and metric selection in query routing for RAG systems.

尽管越来越多的证据表明,没有单一的检索者在所有查询类型中都最优地表现。在本文件中,我们探索了一种根据查询动态地从一组检索者中挑选出的一种查询路径方法,即使用无火车休克和经学习的路径模型,我们将路径设置成一个学习到排序(LTR)问题,并引入一个LTRR框架,这一框架通过预期的效用收益来将检索者排位到下游LLMM性能中。我们就带有受控查询类型变异的合成QA数据进行的实验表明,基于路径的RAG系统能够超越以查询为基础的最佳单轨制系统。在经过“答案正确性”(AC)衡量标准培训的模型中,成绩特别突出,并采用双轨学习方法,特别是XGBoost。我们还观察到,在向分配以外的查询的普及方面有所改进。作为SIGIR 2025 LiveRAG挑战的一部分,我们提交的系统展示了我们的方法的实际可行性,在选择方法中实现竞争性的路径正确性,在RAG的正确性评估中突出。


Article 270

Title@2025-06-16 (1): Instruction Following by Boosting Attention of Large Language Models

Title: Instruction Following by Boosting Attention of Large Language Models Anleitung, indem man die Aufmerksamkeit großer Sprachmodelle erhöht 之后的教学,培养对大语言模式的注意 2506.13734v1

Authors (4): Vitoria Guardieiro, Adam Stein, Avishree Khare, Eric Wong

Controlling the generation of large language models (LLMs) remains a central challenge to ensure their safe and reliable deployment. While prompt engineering and finetuning are common approaches, recent work has explored latent steering, a lightweight technique that alters LLM internal activations to guide generation. However, subsequent studies revealed latent steering’s effectiveness to be limited, often underperforming simple instruction prompting. To address this limitation, we first establish a benchmark across diverse behaviors for standardized evaluation of steering techniques. Building on insights from this benchmark, we introduce Instruction Attention Boosting (InstABoost), a latent steering method that boosts the strength of instruction prompting by altering the model’s attention during generation. InstABoost combines the strengths of existing approaches and is theoretically supported by prior work that suggests that in-context rule following in transformer-based models can be controlled by manipulating attention on instructions. Empirically, InstABoost demonstrates superior control success compared to both traditional prompting and latent steering.

控制大型语言模型(LLMs)的生成仍然是确保安全可靠地部署的重大挑战。虽然迅速工程和微调是常见的方法,但最近的工作探索了潜质引导,这是一种轻量技术,改变了LLM的内部激活,以引导下一代的形成。然而,随后的研究显示,潜质引导的有效性有限,往往表现较差的简单指令。为了应对这一限制,我们首先为标准化的指导技术评价制定不同行为基准。基于这一基准的洞察力,我们引入了 “ 注意促进 “ (InstABoost),这是一种潜在的指导方法,通过改变模型代代代代的注意力来增强激励教学的力量。 “ 斯特拉博斯特 “ 将现有方法的优势结合起来,并在理论上得到先前工作的支持,表明基于变压器模型的文本规则可以通过对指示的注意来控制。从中可以看出, “ 注意 “ InstABoost “ 与传统的催动和潜伏性指导相比,都表现出超强的控制成功。


Article 271

Title@2025-06-16 (1): Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs

Title: Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs Zuweisungsgeführtes Pruning für Kompression, Circuit Discovery und gezielte Korrektur in LLMs 压缩、电路发现和定点校正 2506.13727v1

Authors (7): Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin

Large Language Models (LLMs) are central to many contemporary AI applications, yet their extensive parameter counts pose significant challenges for deployment in memory- and compute-constrained environments. Recent works in eXplainable AI (XAI), particularly on attribution methods, suggest that interpretability can also enable model compression by identifying and removing components irrelevant to inference. In this paper, we leverage Layer-wise Relevance Propagation (LRP) to perform attribution-guided pruning of LLMs. While LRP has shown promise in structured pruning for vision models, we extend it to unstructured pruning in LLMs and demonstrate that it can substantially reduce model size with minimal performance loss. Our method is especially effective in extracting task-relevant subgraphs – so-called ``circuits’’ – which can represent core functions (e.g., indirect object identification). Building on this, we introduce a technique for model correction, by selectively removing circuits responsible for spurious behaviors (e.g., toxic outputs). All in all, we gather these techniques as a uniform holistic framework and showcase its effectiveness and limitations through extensive experiments for compression, circuit discovery and model correction on Llama and OPT models, highlighting its potential for improving both model efficiency and safety. Our code is publicly available at https://github.com/erfanhatefi/SparC3.

大型语言模型(LLMS)是当代许多AI应用的核心,但它们广泛的参数计数对在记忆和计算限制环境中的部署提出了重大挑战。最近在explable AI(XAI)中开展的工作,特别是在归属方法方面,表明可解释性还可以通过识别和删除与推理无关的部件而使模型压缩。在本文中,我们利用图层-源-相关性促进(LRP)来进行LLMS的归因制剪裁。虽然LRP在对视觉模型进行结构化剪裁时显示了希望,但我们将其扩大到LLMS中未结构化的裁剪线,并表明它能够大大减少性能损失。我们的方法在提取与任务相关的子谱 – – 所谓的“路由路” – – 能够代表核心功能(例如间接的物体识别)方面特别有效。在此基础上,我们引入了一种模型校正技术,通过有选择地去除对虚假行为负责的电路(例如有毒产出),但我们将这些技术推广到一个统一的整体框架,并通过广泛实验来展示其有效性和局限性和限制,从而突出地展示其用于压缩、直接发现和进行我们现有的ALFIFAL-RC发现和LA/C的模型。


Article 272

Title@2025-06-16 (1): OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

Title: OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation OPeRA: Ein Datensatz von Beobachtung, Persona, Ratationale und Aktion zur Bewertung von LLMs auf menschlicher Online-Shopping-Behavior-Simulation OPERA: 人类在线购物行为模拟观察、人、理由和评估LMLLMs的数据集 2506.05606v2

Authors (16): Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang

Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable’’ human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user’s next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.

大型语言模型(LLMS)能否准确地模拟特定用户的下一个网络行动?虽然LLMS在产生“可相信的”人类行为方面表现出很有潜力的能力,但评估其模仿真实用户行为的能力仍是一个公开的挑战,这主要是因为缺少高质量、公开的数据集,这些数据集既能捕捉可观测到的行动,又能捕捉实际人类用户的内部推理。为了缩小这一差距,我们引入了OPERA,这是在网上购物过程中从真实的人类参与者那里收集的观察、人、理由和行动的新数据集。OPERA是第一个全面捕捉到的公开数据集:用户、浏览器观察、精细的网络动作和自己报告的即时理由。我们开发了一个在线问卷和一个定制浏览器插件,以便以高度忠诚的方式收集这一数据集。我们利用OPERA建立了第一个基准,用以评估当前LMSs如何很好地预测特定用户与某个特定的人的下一个行动和理由,以及<观察、行动、理由>历史。这一数据集为未来对作为数字化的个人代理人进行研究奠定了基础。


Article 273

Title@2025-06-16 (1): Balancing Knowledge Delivery and Emotional Comfort in Healthcare Conversational Systems

Title: Balancing Knowledge Delivery and Emotional Comfort in Healthcare Conversational Systems Ausbalancieren von Wissenslieferungen und emotionalem Komfort in Gesundheitswesensgesprächssystemen 平衡知识的提供和卫生保健沟通系统中的情感舒适 2506.13692v1

Authors (2): Shang-Chi Tsai, Yun-Nung Chen

With the advancement of large language models, many dialogue systems are now capable of providing reasonable and informative responses to patients’ medical conditions. However, when patients consult their doctor, they may experience negative emotions due to the severity and urgency of their situation. If the model can provide appropriate comfort and empathy based on the patient’s negative emotions while answering medical questions, it will likely offer a more reassuring experience during the medical consultation process. To address this issue, our paper explores the balance between knowledge sharing and emotional support in the healthcare dialogue process. We utilize a large language model to rewrite a real-world interactive medical dialogue dataset, generating patient queries with negative emotions and corresponding medical responses aimed at soothing the patient’s emotions while addressing their concerns. The modified data serves to refine the latest large language models with various fine-tuning methods, enabling them to accurately provide sentences with both emotional reassurance and constructive suggestions in response to patients’ questions. Compared to the original LLM model, our experimental results demonstrate that our methodology significantly enhances the model’s ability to generate emotional responses while maintaining its original capability to provide accurate knowledge-based answers.

随着大型语言模式的进步,许多对话系统现在能够对病人的健康状况提供合理和内容丰富的回应,然而,当病人咨询医生时,他们可能会经历消极情绪,因为病情的严重性和紧迫性;如果模型能够根据病人的消极情绪在回答医疗问题时提供适当的安慰和同情,那么在医疗咨询过程中就可能提供更令人放心的经验。为了解决这一问题,我们的文件探讨了保健对话过程中知识共享和情感支持之间的平衡。我们使用一个大语言模式重写真实世界交互式医疗对话数据集,产生负面情绪的病人询问和相应的医疗反应,目的是缓解病人的情绪,同时解决他们的关切。修改后的数据有助于完善最新的大型语言模式,采用各种微调方法,使他们能够准确提供对病人问题的情感安慰和建设性建议。与最初的LLM模型相比,我们的实验结果表明,我们的方法大大增强了模型产生情感反应的能力,同时保持了提供准确知识答案的原始能力。


Article 274

Title@2025-06-16 (1): Efficient Inference for Large Reasoning Models: A Survey

Title: Efficient Inference for Large Reasoning Models: A Survey Effiziente Schlussfolgerung für große Vernunftmodelle: Eine Umfrage 大型理由模型有效推断:调查 2503.23077v2

Authors (10): Yue Liu, Jiaying Wu, Yufei He, Hongcheng Gao, Hongyu Chen, Baolong Bi, Ruihan Gong, Jiaheng Zhang, Zhiqi Huang, Bryan Hooi

Large Reasoning Models (LRMs) significantly improve the reasoning ability of Large Language Models (LLMs) by learning to reason, exhibiting promising performance in complex task-solving. However, their deliberative reasoning process leads to inefficiencies in token usage, memory consumption, and inference time. Thus, this survey provides a review of efficient inference methods designed specifically for LRMs, focusing on mitigating token inefficiency while preserving the reasoning quality. First, we introduce a taxonomy to group the recent methods into two main categories: (a) explicit compact Chain-of-Thought (CoT), which reduces tokens while keeping the explicit reasoning structure, and (b) implicit latent CoT, which encodes reasoning steps within hidden representations instead of explicit tokens. Meanwhile, we discuss their strengths and weaknesses. Then, we conduct empirical analyses on existing methods from performance and efficiency aspects. Besides, we present open challenges in this field, including human-centric controllable reasoning, trade-off between interpretability and efficiency of reasoning, ensuring safety of efficient reasoning, and broader applications of efficient reasoning. In addition, we highlight key insights for enhancing LRMs’ inference efficiency via techniques such as model merging, new architectures, and agent routers. We hope this work serves as a valuable guide, helping researchers overcome challenges in this vibrant field\footnote{https://github.com/yueliu1999/Awesome-Efficient-Inference-for-LRMs}.

大型理性模型(LRMs)通过学习理性,在复杂的任务解决中表现出有希望的业绩,大大提高了大语言模型(LLMs)的推理能力,从而大大提高了大语言模型(LLMs)的推理能力;然而,其审议推理过程导致象征性使用、记忆消耗和推理时间效率低下。因此,本调查审查了专门为LRMs设计的高效推论方法,重点是减轻象征性低效率,同时保持推理质量。首先,我们引入一种分类法,将最近的方法分为两大类:(a) 明确的紧凑连锁(COT),在保持明确的推理结构的同时减少象征,减少象征,并显示有良好的表现;以及(b) 隐含隐含的潜在COT,在隐含的表述中而不是明确的标码内进行推理步骤。与此同时,我们讨论了它们的优缺点。然后,我们从业绩和效率方面对现有方法进行了经验分析。 此外,我们提出了这一领域的公开挑战,包括以人为中心的逻辑推理,在解释和效率之间进行交易,确保高效率推理,确保高效率推理的安全,以及更广泛地应用有效的推理。 此外,我们强调关键见解深刻的洞洞察,我们强调加强LRMMMMDRMDRMSRDRMS的视野,在这种方向上的重要见解,我们通过一种方向,通过一种高的估价方法。


Article 275

Title@2025-06-16 (1): How Much is Enough? The Diminishing Returns of Tokenization Training Data

Title: How Much is Enough? The Diminishing Returns of Tokenization Training Data Wie viel ist genug? Die Diminishing Rückgaben von Tokenization Trainingsdaten 有多少足够? 2502.20273v4

Authors (4): Varshini Reddy, Craig W. Schmidt, Yuval Pinter, Chris Tanner

Tokenization, a crucial initial step in natural language processing, is governed by several key parameters, such as the tokenization algorithm, vocabulary size, pre-tokenization strategy, inference strategy, and training data corpus. This paper investigates the impact of an often-overlooked hyperparameter, tokenizer training data size. We train BPE, UnigramLM, and WordPiece tokenizers across various vocabulary sizes using English training data ranging from 1GB to 900GB. Our findings reveal diminishing returns as training data size increases beyond roughly 150GB, suggesting a practical limit to the improvements in tokenization quality achievable through additional data. We analyze this phenomenon and attribute the saturation effect to constraints introduced by the pre-tokenization stage. We then demonstrate the extent to which these findings can generalize by experimenting on data in Russian, a language typologically distant from English. For Russian text, we observe diminishing returns after training a tokenizer from 200GB of data, which is approximately 33% more than when training on English. These results provide valuable insights for optimizing the tokenization process by reducing the compute required for training on large corpora and suggest promising directions for future research in tokenization algorithms.

在自然语言处理中,Tokenization是自然语言处理中一个至关重要的初始步骤,它受若干关键参数的制约,例如象征性算法、词汇大小、预先确定战略、推论战略、培训数据库等。本文调查了经常被忽略的超参数、表示器培训数据大小的影响。我们用1GB至900GB的英语培训数据,对BPE、UnigramLM和WordPiece象征性产品进行了各种词汇规模的培训。我们的调查结果显示,随着培训数据规模的增加超过约150GB,回报在不断减少。我们分析这种现象,并将饱和效应归结于前确定阶段带来的限制。我们然后通过用俄语试验数据来展示这些结果的普及程度,俄语是远离英语的典型语言。关于俄文文本,我们观察到,培训代记器后回报从200GB的数据减少,比英语培训时减少约33%。这些结果提供了宝贵的洞察力,通过减少大规模公司化培训所需的精确度,从而优化象征性化进程。


Article 276

Title@2025-06-16 (1): Turning Down the Heat: A Critical Analysis of Min-p Sampling in Language Models

Title: Turning Down the Heat: A Critical Analysis of Min-p Sampling in Language Models Abkehr von der Hitze: Eine kritische Analyse der Min-p-Probenahme in Sprachmodellen 降低热量:对语言模型中中点抽样的批判性分析 2506.13681v1

Authors (3): Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch

Sampling from language models impacts the quality and diversity of outputs, affecting both research and real-world applications. Recently, Nguyen et al. 2024’s “Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs” introduced a new sampler called min-p, claiming it achieves superior quality and diversity over established samplers such as basic, top-k, and top-p sampling. The significance of these claims was underscored by the paper’s recognition as the 18th highest-scoring submission to ICLR 2025 and selection for an Oral presentation. This paper conducts a comprehensive re-examination of the evidence supporting min-p and reaches different conclusions from the original paper’s four lines of evidence. First, the original paper’s human evaluations omitted data, conducted statistical tests incorrectly, and described qualitative feedback inaccurately; our reanalysis demonstrates min-p did not outperform baselines in quality, diversity, or a trade-off between quality and diversity; in response to our findings, the authors of the original paper conducted a new human evaluation using a different implementation, task, and rubric that nevertheless provides further evidence min-p does not improve over baselines. Second, comprehensively sweeping the original paper’s NLP benchmarks reveals min-p does not surpass baselines when controlling for the number of hyperparameters. Third, the original paper’s LLM-as-a-Judge evaluations lack methodological clarity and appear inconsistently reported. Fourth, community adoption claims (49k GitHub repositories, 1.1M GitHub stars) were found to be unsubstantiated, leading to their removal; the revised adoption claim remains misleading. We conclude that evidence presented in the original paper fails to support claims that min-p improves quality, diversity, or a trade-off between quality and diversity.

语言模型的抽样影响到产出的质量和多样性,影响到研究和现实世界的应用。最近,Nguyen等人的2024年的“Heat 更新证据:创意和 Coherent LLM 产出的Min-p 抽样”引入了名为 min-p 的新取样器,称其质量优于基础、顶级和顶级抽样等既定取样器,称其质量和多样性达到优于优于优于优于优于优于优、顶级和顶级取样器等既定取样器。文件承认这些说法的重要性突出表现为向ICLR 2025提交的第18份最显眼的呈件和口述演示的挑选。本文全面重新审查了支持分钟缩略图的证据,得出了与原论文不同的不同结论。 原始论文显示, 直略图显示的数值比标值并非全面控制基数,而原始的缩略图则表明,正值比原始的缩略图比值更低。


Article 277

Title@2025-06-16 (1): Unifying Uniform and Binary-coding Quantization for Accurate Compression of Large Language Models

Title: Unifying Uniform and Binary-coding Quantization for Accurate Compression of Large Language Models Vereinheitlichung einheitlicher und Binär-kodierender Quantisierungen für eine präzise Komprimierung großer Sprachmodelle 精确压缩大语言模型精确压缩的统一和二元编码统一和二元编码的量化 2506.03781v2

Authors (8): Seungcheol Park, Jeongin Bae, Beomseok Kwon, Minjun Kim, Byeongwook Kim, Se Jung Kwon, U Kang, Dongsoo Lee

How can we quantize large language models while preserving accuracy? Quantization is essential for deploying large language models (LLMs) efficiently. Binary-coding quantization (BCQ) and uniform quantization (UQ) are promising quantization schemes that have strong expressiveness and optimizability, respectively. However, neither scheme leverages both advantages. In this paper, we propose UniQuanF (Unified Quantization with Flexible Mapping), an accurate quantization method for LLMs. UniQuanF harnesses both strong expressiveness and optimizability by unifying the flexible mapping technique in UQ and non-uniform quantization levels of BCQ. We propose unified initialization, and local and periodic mapping techniques to optimize the parameters in UniQuanF precisely. After optimization, our unification theorem removes computational and memory overhead, allowing us to utilize the superior accuracy of UniQuanF without extra deployment costs induced by the unification. Experimental results demonstrate that UniQuanF outperforms existing UQ and BCQ methods, achieving up to 4.60% higher accuracy on GSM8K benchmark.

我们如何在保持准确性的同时量化大型语言模型? 量化对于高效率地部署大型语言模型(LLMs)至关重要。 二编码量化(BCQ)和统一量化(UQ)是很有希望的量化方案,它们分别具有很强的清晰度和优化性。 但是,这两个方案都没有利用这两种优势。 在本文件中,我们建议UniQuanF(与弹性绘图统一量化),这是LLMs的一种准确的量化方法。 UniQuanF(UniQuanF)管理通过统一UQ的灵活绘图技术和BCQ的非统一度量化(Unizal Q) 和不统一度(Unizal QQ) 两种方法,具有很强的清晰度和可选性。我们建议统一初始化和本地定期绘图技术,以便精确优化UniQuanF的参数。在优化后,我们统一了对计算和记忆管理费的计算,使我们能够在不因统一而产生额外部署费用的情况下利用UniQuanF的高级精度。实验结果表明,UQ(UQ)比现有的UQ和BCQ(BCQ)方法高出4.60%。


Article 278

Title@2025-06-16 (1): Improving Clinical Note Generation from Complex Doctor-Patient Conversation

Title: Improving Clinical Note Generation from Complex Doctor-Patient Conversation Verbesserung der klinischen Notengenerierung aus komplexen Arzt-Patient-Konversationen 从复杂的医生与病人之间的复杂对话中改进临床笔记制作 2408.14568v2

Authors (5): Yizhan Li, Sifan Wu, Christopher Smith, Thomas Lo, Bang Liu

Writing clinical notes and documenting medical exams is a critical task for healthcare professionals, serving as a vital component of patient care documentation. However, manually writing these notes is time-consuming and can impact the amount of time clinicians can spend on direct patient interaction and other tasks. Consequently, the development of automated clinical note generation systems has emerged as a clinically meaningful area of research within AI for health. In this paper, we present three key contributions to the field of clinical note generation using large language models (LLMs). First, we introduce CliniKnote, a comprehensive dataset consisting of 1,200 complex doctor-patient conversations paired with their full clinical notes. This dataset, created and curated by medical experts with the help of modern neural networks, provides a valuable resource for training and evaluating models in clinical note generation tasks. Second, we propose the K-SOAP (Keyword, Subjective, Objective, Assessment, and Plan) note format, which enhances traditional SOAP~\cite{podder2023soap} (Subjective, Objective, Assessment, and Plan) notes by adding a keyword section at the top, allowing for quick identification of essential information. Third, we develop an automatic pipeline to generate K-SOAP notes from doctor-patient conversations and benchmark various modern LLMs using various metrics. Our results demonstrate significant improvements in efficiency and performance compared to standard LLM finetuning methods.

撰写临床笔记和记录体检是保健专业人员的一项关键任务,是病人护理文件的重要部分,但人工撰写这些笔记很费时,可以影响临床医生在病人直接互动和其他任务上花费的时间,因此,开发自动化临床笔记生成系统已成为AI健康研究中的一个具有临床意义的领域。在本文件中,我们介绍了对使用大语言模型(LLM)制作临床笔记的三项关键贡献。首先,我们引入了CliniKnote,这是一个综合数据集,由1 200个复杂的医生-病人谈话组成,配有他们的完整的临床笔记。这个数据集是由医疗专家在现代神经网络的帮助下创建和整理的,为临床笔记生成任务的培训和评价模型提供了宝贵的资源。第二,我们提出了K-SOAP(关键词、主观、客观、评估和计划)说明格式,它加强了传统的SOAPcite{podder2023soap}(直观、目标、评估和计划),通过在顶部添加一个关键部分,以便快速识别基本信息,从而用我们的主要的LMSAMAS-M标准,我们开发了一种自动的改进的实验室和各种标准。


Article 279

Title@2025-06-16 (1): On Synthesizing Data for Context Attribution in Question Answering

Title: On Synthesizing Data for Context Attribution in Question Answering Über die Synthese von Daten für Kontextzuweisungen in Fragenantworten 问题解答中内容归属数据合成 2504.05317v2

Authors (14): Gorjan Radevski, Kiril Gashteovski, Shahbaz Syed, Christopher Malon, Sebastien Nicolas, Chia-Chien Hung, Timo Sztyler, Verena Heußer, Wiem Ben Rim, Masafumi Enomoto, Kunihiro Takeoka, Masafumi Oyamada, Goran Glavaš, Carolin Lawrence

Question Answering (QA) accounts for a significant portion of LLM usage “in the wild”. However, LLMs sometimes produce false or misleading responses, also known as “hallucinations”. Therefore, grounding the generated answers in contextually provided information – i.e., providing evidence for the generated text – is paramount for LLMs’ trustworthiness. Providing this information is the task of context attribution. In this paper, we systematically study LLM-based approaches for this task, namely we investigate (i) zero-shot inference, (ii) LLM ensembling, and (iii) fine-tuning of small LMs on synthetic data generated by larger LLMs. Our key contribution is SynQA: a novel generative strategy for synthesizing context attribution data. Given selected context sentences, an LLM generates QA pairs that are supported by these sentences. This leverages LLMs’ natural strengths in text generation while ensuring clear attribution paths in the synthetic training data. We show that the attribution data synthesized via SynQA is highly effective for fine-tuning small LMs for context attribution in different QA tasks and domains. Finally, with a user study, we validate the usefulness of small LMs (fine-tuned on synthetic data from SynQA) in context attribution for QA.

问题解答(QA)是LLM“野外”使用LLM的很大一部分。然而,LLMS有时会产生虚假或误导性的答复,也称为“卤化”。因此,将产生的答案以背景信息 – – 即为生成文本提供证据 – – 为基础,对于LLMs的可信度至关重要。提供这一信息是背景归属的任务。在本文中,我们系统研究基于LLMM的办法来开展这项任务,即我们调查(一) 零发出推论,(二) LLM 组合,以及(三) 微调大型LMS的合成数据。我们的主要贡献是SynQA:合成QA:合成背景归属归属数据的新颖的组合战略。根据选定的背景句子,LMM产生得到这些句子的支持的QA配对。这在文本生成中利用LMMs的自然优势,同时确保合成培训数据中明确的归属路径。我们通过SynQA合成合成的合成数据合成数据集成的归属数据对于对不同QA任务和域的小型环境归属进行微调微调小LMQ(我们合成数据验证的用户对LQQQQQ数据进行检索)。


Article 280

Title@2025-06-16 (1): Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

Title: Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model Stream-Omni: Gleichzeitige multimodale Interaktionen mit großem Sprach-Vision-Sprachmodell 流流-奥米尼:与大语言-视觉-语音模型同时使用的多模式互动 2506.13642v1

Authors (5): Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng

The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience.

GPT-4型类似GPT-4型大型多式联运模型(LMMs)的出现提高了对整合文本、愿景和语言模式的探索,以支持更灵活的多式联运互动。现有的LMMs通常在序列维度上将模式的表达方式混在一起,并将其输入大型语言模型(LLMM)主干。虽然序列分解融合对于模式整合来说是直截了当的,但它往往严重依赖大型数据来学习模式调整。在本文件中,我们的目标是更有目的地模拟模式之间的关系,从而实现更有效和灵活的方式调整。为此,我们建议采用一个具有高效模式的双向语言双视-语音互动模式,同时支持不同模式组合下的互动。 Stream-Omni使用LMM(LLM)作为主干线,根据它们的关系将愿景和讲话与文字相匹配。 Stream-Omni将语言快速互动模式引入一个基于语音定位的文本转换,在Stream-Olational-modalation中可以提供一个基于语音定位的文本转换。 Stre-deal-deal-dealalalalal-modealal-dealdeal-deal-dealde-dealdealdeal-dealdealdealdealdededede 。 Stre 提供一个基于可提供一个基于可实现语音-s-s-s-de-de-s-s-s-s-demode-s-demodalutusmlational-slation-slation。


Article 281

Title@2025-06-16 (1): EvolvTrip: Enhancing Literary Character Understanding with Temporal Theory-of-Mind Graphs

Title: EvolvTrip: Enhancing Literary Character Understanding with Temporal Theory-of-Mind Graphs EvolvTrip: Erweitern des literarischen Charakterverständnisses mit zeitlichen Theorie-von-Mind Graphen EvlvTrip:用时光理论图增强对文学特征的了解 2506.13641v1

Authors (6): Bohao Yang, Hainiu Xu, Jinhua Du, Ze Li, Yulan He, Chenghua Lin

A compelling portrayal of characters is essential to the success of narrative writing. For readers, appreciating a character’s traits requires the ability to infer their evolving beliefs, desires, and intentions over the course of a complex storyline, a cognitive skill known as Theory-of-Mind (ToM). Performing ToM reasoning in prolonged narratives requires readers to integrate historical context with current narrative information, a task at which humans excel but Large Language Models (LLMs) often struggle. To systematically evaluate LLMs’ ToM reasoning capability in long narratives, we construct LitCharToM, a benchmark of character-centric questions across four ToM dimensions from classic literature. Further, we introduce EvolvTrip, a perspective-aware temporal knowledge graph that tracks psychological development throughout narratives. Our experiments demonstrate that EvolvTrip consistently enhances performance of LLMs across varying scales, even in challenging extended-context scenarios. EvolvTrip proves to be particularly valuable for smaller models, partially bridging the performance gap with larger LLMs and showing great compatibility with lengthy narratives. Our findings highlight the importance of explicit representation of temporal character mental states in narrative comprehension and offer a foundation for more sophisticated character understanding. Our data and code are publicly available at https://github.com/Bernard-Yang/EvolvTrip.

对读者来说,要系统地评估LLMS在长篇叙事方面的推理能力,我们就必须建立LitCharToM,这是典型文献中四个方面以性为中心的问题的基准。此外,我们还引入了EvlvTrip, 一种能感知到时间知识的图,跟踪整个叙事的心理发展。我们的实验表明,EvlvTrip 不断提高LMS在不同规模上的性能,甚至在具有挑战性的长篇假想中也是如此。EvlvTrip 证明,对于较小的模型来说,特别有用,部分缩小LMS的性能差距,并表现出与长篇叙事的高度兼容性。我们的调查结果强调了在可公开理解/可理解的 ALMS/ADR 中明确表达时间性精神状态的重要性。


Article 282

Title@2025-06-16 (1): An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability

Title: An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability Eine empirische Studie von LLM-as-a-Richter: Wie Design Entscheidungen Auswirkungen Bewertung Zuverlässigkeit 法学硕士作为法官的经验研究:设计选择如何影响评价可靠性 2506.13639v1

Authors (3): Yusuke Yamauchi, Taro Yano, Masafumi Oyamada

As large language models (LLMs) continue to advance, reliable evaluation methods are essential particularly for open-ended, instruction-following tasks. LLM-as-a-Judge enables automatic evaluation using LLMs as evaluators, but its reliability remains uncertain. In this work, we analyze key factors affecting its trustworthiness, focusing on alignment with human judgments and evaluation consistency. Using BIGGENBench and EvalBiasBench, we study the effects of evaluation design, decoding strategies, and Chain-of-Tought (CoT) reasoning in evaluation. Our results show that evaluation criteria are critical for reliability, non-deterministic sampling improves alignment with human preferences over deterministic evaluation, and CoT reasoning offers minimal gains when clear evaluation criteria are present.

随着大型语言模型(LLMs)的继续推进,可靠的评价方法对开放式、执行指示的任务尤为重要。LLM-as-a-judge使得能够以LLMs作为评价员进行自动评价,但其可靠性仍然不确定。在这项工作中,我们分析了影响其可信度的关键因素,侧重于与人类判断和评估一致性的一致性。我们利用BIGGENBench和EvalBiasBench,研究评价设计、解码战略和在评价中寻求链(Cot)推理的影响。我们的结果显示,评价标准对于可靠性至关重要,非非非非定式抽样改进了与人类对确定性评价的偏好的一致性,而在有明确的评价标准时,CoT推理只能带来最小的收益。


Article 283

Title@2025-06-16 (1): A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data

Title: A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data Ein selbstrefinierendes Framework zur Verbesserung der ASR-Nutzung von TTS-Synthesedaten 利用TTS综合数据加强ASR的自订框架 2506.11130v2

Authors (8): Cheng-Kang Chou, Chan-Jan Hsu, Ho-Lam Chung, Liang-Hsuan Tseng, Hsi-Chun Cheng, Yu-Kuan Fu, Kuan Po Huang, Hung-Yi Lee

We propose a self-refining framework that enhances ASR performance with only unlabeled datasets. The process starts with an existing ASR model generating pseudo-labels on unannotated speech, which are then used to train a high-fidelity text-to-speech (TTS) system. Then, synthesized speech text pairs are bootstrapped into the original ASR system, completing the closed-loop self-improvement cycle. We demonstrated the effectiveness of the framework on Taiwanese Mandarin speech. Leveraging 6,000 hours of unlabeled speech, a moderate amount of text data, and synthetic content from the AI models, we adapt Whisper-large-v2 into a specialized model, Twister. Twister reduces error rates by up to 20% on Mandarin and 50% on Mandarin-English code-switching benchmarks compared to Whisper. Results highlight the framework as a compelling alternative to pseudo-labeling self-distillation approaches and provides a practical pathway for improving ASR performance in low-resource or domain-specific settings.

我们提出一个自我界定框架,提高ASR的性能,只使用未贴标签的数据集。这一过程从现有的ASR模型开始,在未加注的演讲中产生假标签,然后用于培训高忠诚文本到语音系统(TTS),然后,合成的语音文本配对进入原ASR系统,完成闭环自我改进周期。我们展示了台湾国语演讲框架的有效性。利用6000小时未贴标签的演讲、中等数量的文本数据以及来自AI模型的合成内容,我们将Whisper prop-v2改成一个专门的模型,Twister。将曼达林的误差率降低20%,将曼达林的误差率降低50%,将曼达林英语代码开关基准降低到Whisper的误率降低50%。结果突出框架是伪标签自我提炼方法的令人信服的替代方法,并为改进低资源或特定域环境中的ASR性能提供了实用路径。


Article 284

Title@2025-06-16 (1): A Structured Bangla Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy

Title: A Structured Bangla Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy Ein strukturierter Bangla-Datensatz von Krankheits-Symptome-Verbänden zur Verbesserung der Diagnosegenauigkeit 改善诊断准确性疾病 – – 症状协会结构化孟加拉数据集 2506.13610v1

Authors (4): Abdullah Al Shafi, Rowzatul Zannat, Abdul Muntakim, Mahmudul Hasan

Disease-symptom datasets are significant and in demand for medical research, disease diagnosis, clinical decision-making, and AI-driven health management applications. These datasets help identify symptom patterns associated with specific diseases, thus improving diagnostic accuracy and enabling early detection. The dataset presented in this study systematically compiles disease-symptom relationships from various online sources, medical literature, and publicly available health databases. The data was gathered through analyzing peer-reviewed medical articles, clinical case studies, and disease-symptom association reports. Only the verified medical sources were included in the dataset, while those from non-peer-reviewed and anecdotal sources were excluded. The dataset is structured in a tabular format, where the first column represents diseases, and the remaining columns represent symptoms. Each symptom cell contains a binary value (1 or 0), indicating whether a symptom is associated with a disease (1 for presence, 0 for absence). Thereby, this structured representation makes the dataset very useful for a wide range of applications, including machine learning-based disease prediction, clinical decision support systems, and epidemiological studies. Although there are some advancements in the field of disease-symptom datasets, there is a significant gap in structured datasets for the Bangla language. This dataset aims to bridge that gap by facilitating the development of multilingual medical informatics tools and improving disease prediction models for underrepresented linguistic communities. Further developments should include region-specific diseases and further fine-tuning of symptom associations for better diagnostic performance

对医学研究、疾病诊断、临床决策以及AI驱动的健康管理应用,这些数据集有助于查明与特定疾病有关的症状模式,从而改进诊断准确性和早期检测。本研究报告提供的数据集系统地从各种在线来源、医学文献和公开提供的卫生数据库中汇编疾病症状关系。这些数据是通过分析经同行审查的医学文章、临床案例研究和疾病症状关联报告收集的。只有经核实的医学来源才列入数据集,而未经同行审查和传闻来源的医学来源则被排除在外。数据集以表格格式编排,第一栏代表疾病,其余各栏代表症状。每个症状细胞都含有二进制值(1或0),表明症状是否与疾病有关(1个存在,0个缺席)。因此,这种结构化的表述使得数据集非常有利于一系列广泛的应用,包括机器学习的疾病代表性预测、临床决策支持系统以及流行病学研究。尽管在为疾病结构化的多语系发展领域取得了一些进步,但为疾病结构化数据系统开发提供了大量的数据目标。


Article 285

Title@2025-06-16 (1): An Investigation into Value Misalignment in LLM-Generated Texts for Cultural Heritage

Title: An Investigation into Value Misalignment in LLM-Generated Texts for Cultural Heritage Eine Untersuchung der Wertunausrichtung in LLM-generierten Texten für kulturelles Erbe 调查文化遗产LLM-LLM-发光文字中的价值失调问题 2501.02039v2

Authors (4): Fan Bu, Zheng Wang, Siyi Wang, Ziyao Liu

As Large Language Models (LLMs) become increasingly prevalent in tasks related to cultural heritage, such as generating descriptions of historical monuments, translating ancient texts, preserving oral traditions, and creating educational content, their ability to produce accurate and culturally aligned texts is being increasingly relied upon by users and researchers. However, cultural value misalignments may exist in generated texts, such as the misrepresentation of historical facts, the erosion of cultural identity, and the oversimplification of complex cultural narratives, which may lead to severe consequences. Therefore, investigating value misalignment in the context of LLM for cultural heritage is crucial for mitigating these risks, yet there has been a significant lack of systematic and comprehensive study and investigation in this area. To fill this gap, we systematically assess the reliability of LLMs in generating culturally aligned texts for cultural heritage-related tasks. We conduct a comprehensive evaluation by compiling an extensive set of 1066 query tasks covering 5 widely recognized categories with 17 aspects within the knowledge framework of cultural heritage across 5 open-source LLMs, and examine both the type and rate of cultural value misalignments in the generated texts. Using both automated and manual approaches, we effectively detect and analyze the cultural value misalignments in LLM-generated texts. Our findings are concerning: over 65% of the generated texts exhibit notable cultural misalignments, with certain tasks demonstrating almost complete misalignment with key cultural values. Beyond these findings, this paper introduces a benchmark dataset and a comprehensive evaluation workflow that can serve as a valuable resource for future research aimed at enhancing the cultural sensitivity and reliability of LLMs.

由于大语言模型(LLMS)在与文化遗产有关的任务中越来越普遍,例如编写历史古迹的描述、翻译古代文字、保存口述传统和创造教育内容,用户和研究人员越来越依赖大语言模型(LLMS),但是,随着大语言模型(LLM)在与文化遗产有关的任务中日益流行,例如对历史古迹的描述、古代文字的翻译、保护口头传统和创造教育内容,其制作准确和文化一致的文本的能力日益受到使用者和研究人员的信赖,但是,大语言模型(LLMS)在制作的文本中可能存在文化价值的不匹配,例如对历史事实的歪曲、文化特征的侵蚀和复杂文化叙事的过于简单化,从而可能导致严重后果的后果。 因此,调查LMLM中文化遗产中的价值不匹配问题对于减轻这些风险至关重要,但在这一领域却严重缺乏系统和全面的研究和调查。 为了填补这一空白,我们系统地评估LLMS的可靠性,我们系统地评估了LMS的可靠性,我们通过汇编了一套广泛的1066查询任务,涉及5个公开的文化遗产知识框架中的17个方面,并检查了文化价值的分类的种类的种类和比例。


Article 286

Title@2025-06-16 (1): Experiential Semantic Information and Brain Alignment: Are Multimodal Models Better than Language Models?

Title: Experiential Semantic Information and Brain Alignment: Are Multimodal Models Better than Language Models? Erlebnishafte semantische Information und Gehirnausrichtung: Sind multimodale Modelle besser als Sprachmodelle? 实际的语义信息和脑力调整:多模式模式是否比语言模式更好? 2504.00942v2

Authors (2): Anna Bavaresco, Raquel Fernández

A common assumption in Computational Linguistics is that text representations learnt by multimodal models are richer and more human-like than those by language-only models, as they are grounded in images or audio – similar to how human language is grounded in real-world experiences. However, empirical studies checking whether this is true are largely lacking. We address this gap by comparing word representations from contrastive multimodal models vs. language-only ones in the extent to which they capture experiential information – as defined by an existing norm-based ‘experiential model’ – and align with human fMRI responses. Our results indicate that, surprisingly, language-only models are superior to multimodal ones in both respects. Additionally, they learn more unique brain-relevant semantic information beyond that shared with the experiential model. Overall, our study highlights the need to develop computational models that better integrate the complementary semantic information provided by multimodal data sources.

计算语言学的一个共同假设是,多式联运模型所学的文字表述比仅以语言为基础的模型所学的文字表述更加丰富和更加人性化,因为它们以图像或音频为基础 – – 类似于人类语言如何以现实世界经验为基础。然而,基本上缺乏经验性研究来核查这是否属实。我们通过比较反常的多式联运模型和只使用语言的模型的文字表述来弥补这一差距 – – 其范围是现有基于规范的“实验模型”所定义的实验性信息 – – 并与人类FMRI反应相一致。我们的结果表明,令人惊讶的是,只使用语言的模型在两个方面都优于多语种模式。此外,它们所学的更独特的与大脑有关的语义信息超出了与前向模型所共享的信息。总体而言,我们的研究强调需要开发计算模型,以更好地整合由多式联运数据源提供的补充语义信息。


Article 287

Title@2025-06-16 (1): Idiosyncrasies in Large Language Models

Title: Idiosyncrasies in Large Language Models Eigenheiten in großen Sprachmodellen 大语言模式的特派专家 2502.12150v2

Authors (5): Mingjie Sun, Yida Yin, Zhiqiu Xu, J. Zico Kolter, Zhuang Liu

In this work, we unveil and study idiosyncrasies in Large Language Models (LLMs) – unique patterns in their outputs that can be used to distinguish the models. To do so, we consider a simple classification task: given a particular text output, the objective is to predict the source LLM that generates the text. We evaluate this synthetic task across various groups of LLMs and find that simply fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy. Notably, we achieve 97.1% accuracy on held-out validation data in the five-way classification problem involving ChatGPT, Claude, Grok, Gemini, and DeepSeek. Our further investigation reveals that these idiosyncrasies are rooted in word-level distributions. These patterns persist even when the texts are rewritten, translated, or summarized by an external LLM, suggesting that they are also encoded in the semantic content. Additionally, we leverage LLM as judges to generate detailed, open-ended descriptions of each model’s idiosyncrasies. Finally, we discuss the broader implications of our findings, including training on synthetic data, inferring model similarity, and robust evaluation of LLMs. Code is available at https://github.com/locuslab/llm-idiosyncrasies.

在这项工作中,我们揭开并研究大语言模型(LLMs)的特异性,即其产出中可用于区分模型的独特模式。为了做到这一点,我们考虑一个简单的分类任务:根据一个特定的文本输出,目标是预测生成文本的源LLM;我们评估了各种LMs各组的合成任务,发现在LLM生成的文本中嵌入模型的微调文本能够产生极好的分类准确性。值得注意的是,我们在涉及查特GPT、Claude、Grok、Gemini和DeepSeek的五道分类问题中,在搁置的验证数据中实现了97.1%的准确性。我们的进一步调查显示,这些特异性数据植根于字级分布中。这些模式即使文本重新编写、翻译或由一个外部LMMMs摘要,也持续存在,表明它们也是在语义内容中编码的。此外,我们利用LMMM作为法官对每种模式的特异性分类进行详细、开放的描述。最后,我们讨论了我们的调查结果的更广泛影响,包括MLADMS/CS的可靠模型。


Article 288

Title@2025-06-16 (1): CAMS: A CityGPT-Powered Agentic Framework for Urban Human Mobility Simulation

Title: CAMS: A CityGPT-Powered Agentic Framework for Urban Human Mobility Simulation CAMS: CityGPT-Powered Agentic Framework für die Simulation urbaner menschlicher Mobilität CAMS: 城市GPT授权的城市人类流动模拟活动代理框架 2506.13599v1

Authors (4): Yuwei Du, Jie Feng, Jian Yuan, Yong Li

Human mobility simulation plays a crucial role in various real-world applications. Recently, to address the limitations of traditional data-driven approaches, researchers have explored leveraging the commonsense knowledge and reasoning capabilities of large language models (LLMs) to accelerate human mobility simulation. However, these methods suffer from several critical shortcomings, including inadequate modeling of urban spaces and poor integration with both individual mobility patterns and collective mobility distributions. To address these challenges, we propose \textbf{C}ityGPT-Powered \textbf{A}gentic framework for \textbf{M}obility \textbf{S}imulation (\textbf{CAMS}), an agentic framework that leverages the language based urban foundation model to simulate human mobility in urban space. \textbf{CAMS} comprises three core modules, including MobExtractor to extract template mobility patterns and synthesize new ones based on user profiles, GeoGenerator to generate anchor points considering collective knowledge and generate candidate urban geospatial knowledge using an enhanced version of CityGPT, TrajEnhancer to retrieve spatial knowledge based on mobility patterns and generate trajectories with real trajectory preference alignment via DPO. Experiments on real-world datasets show that \textbf{CAMS} achieves superior performance without relying on externally provided geospatial information. Moreover, by holistically modeling both individual mobility patterns and collective mobility constraints, \textbf{CAMS} generates more realistic and plausible trajectories. In general, \textbf{CAMS} establishes a new paradigm that integrates the agentic framework with urban-knowledgeable LLMs for human mobility simulation.

人类流动性模拟在现实世界的各种应用中发挥着关键作用。 最近,为了解决传统数据驱动方法的局限性,研究人员探索了利用大型语言模型(LLMs)的常识知识和推理能力加速人类流动性模拟。然而,这些方法存在一些重大缺陷,包括城市空间模型不足,以及与个人流动性模式和集体流动性分布的整合不力。为了应对这些挑战,我们提议了\ textbf{C}GPT-dowered\ textbf{A}正统框架,用于\ textbf{M}传统数据驱动方法的局限性。 研究人员探索了大型语言模型(\ textb{CAMS)模拟(\ textb{CAMS})的常识和推介能力,通过移动模式和基于真实的CA-SLILIMS, 以真实的流动性模型和CA-CA的稳定性调整, 提供了基于真实的流动性模型。


Article 289

Title@2025-06-16 (1): Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems

Title: Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems Qwen vs. Gemma Integration mit Whisper: Eine vergleichende Studie in mehrsprachigen Sprach-LLM-Systemen Quwen诉Gemma 与低语融合:多语种语言LLLM系统比较研究 2506.13596v1

Authors (3): Tuan Nguyen, Long-Vu Hoang, Huy-Dat Tran

This paper presents our system for the MLC-SLM Challenge 2025, focusing on multilingual speech recognition and language modeling with large language models (LLMs). Our approach combines a fine-tuned Whisper-large-v3 encoder with efficient projector architectures and various decoder configurations. We employ a three-stage training methodology that progressively optimizes the encoder, projector, and LLM components. Our system achieves competitive performance with a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6% using the Qwen2.5-7B as decoder-only language model.

本文件介绍我们的刚果解放运动-苏丹解放运动2025年挑战系统,重点是多语种语音识别和语言模型以及大型语言模型(LLMs),我们的方法是将微调的Whisper大V3编码器与高效投影器结构及各种解码器配置相结合,我们采用三阶段培训方法,逐步优化编码器、投影器和LLM组件。我们的系统实现了竞争性性能,通过使用Gemma3-12B和18.6%(使用Qwen2.5-7B作为只读解码器语言模型)的私人测试结果,WER/CER平均为16.63%,使用Gemma3-12B和18.6%(使用Qwen2.5-7B作为解码器唯一的语言模型)。


Article 290

Title@2025-06-16 (1): MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Title: MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention MiniMax-M1: Skalierungstestzeit effizient berechnen mit Blitz Achtung Minimax-M1: 以闪电注意有效计算缩放测试时间 2506.13585v1

Authors (128): MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Zhu, Jian Sun, Jiaqi Zhuang, Jiaren Cai, Jiayuan Song, Jin Zhu, Jingyang Li, Jinhao Tian, Jinli Liu, Junhao Xu, Junjie Yan, Junteng Liu, Junxian He, Kaiyi Feng, Ke Yang, Kecheng Xiao, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Li, Lin Zheng, Linge Du, Lingyu Yang, Lunbin Zeng, Minghui Yu, Mingliang Tao, Mingyuan Chi, Mozhi Zhang, Mujie Lin, Nan Hu, Nongyu Di, Peng Gao, Pengfei Li, Pengyu Zhao, Qibing Ren, Qidi Xu, Qile Li, Qin Wang, Rong Tian, Ruitao Leng, Shaoxiang Chen, Shaoyu Chen, Shengmin Shi, Shitong Weng, Shuchang Guan, Shuqi Yu, Sichen Li, Songquan Zhu, Tengfei Li, Tianchi Cai, Tianrun Liang, Weiyu Cheng, Weize Kong, Wenkai Li, Xiancai Chen, Xiangjun Song, Xiao Luo, Xiao Su, Xiaobo Li, Xiaodong Han, Xinzhu Hou, Xuan Lu, Xun Zou, Xuyang Shen, Yan Gong, Yan Ma, Yang Wang, Yiqi Shi, Yiran Zhong, Yonghong Duan, Yongxiang Fu, Yongyi Hu, Yu Gao, Yuanxiang Fan, Yufeng Yang, Yuhao Li, Yulin Hu, Yunan Huang, Yunji Li, Yunzhi Xu, Yuxin Mao, Yuxuan Shi, Yuze Wenren, Zehan Li, Zelin Li, Zhanxu Tian, Zhengmao Zhu, Zhenhua Fan, Zhenzhen Wu, Zhichao Xu, Zhihang Yu, Zhiheng Lyu, Zhuo Jiang, Zibo Gao, Zijia Wu, Zijian Song, Zijun Sun

We introduce MiniMax-M1, the world’s first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1’s inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1’s full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.

我们引入了MiniMax- M1, 世界上第一个开放重量、 大型混合关注推理模型。 MiniMax- M1 由混合混合混合Mixture- Experts(MOE) 架构和闪电关注机制驱动。 该模型是根据我们以前的MiniMax-Text-01模型开发的,该模型共有4560亿参数,每个象征性启动459亿参数。M1模型本地支持了100万个上标的上下文长度,8xDeepSeekMMR1 的上下文尺寸。此外, MiniMax- M1 的闪电关注机制能够有效缩放测试时间计算。 MiniMax- M1 的电动模型使M1 特别适合复杂的任务,需要处理长期投入和广泛思考。 Minimax- M1 使用大规模强化学习(RL) , 包括基于沙箱的、 真实的软件工程环境。 除了Mial- ial 培训的内在效率优势之外, 我们提议CISPOS- mex- mix- dal 3 和Slex- demal- destal 版本的Sex- dreal- dreal- dreal dreal dreal dreal dreal dreal disaldal disal disal disal disal ex ex ex ex ex ex ex ex ex ex ex exmexmal exmex ex ex ex exmexmexmexmexmex ex ex ex ex ex exx ex ex ex ex exx exx exx ex ex ex exal disal disal exal ex ex ex ex ex ex ex ex ex ex exal ex ex exmaldaldaldal ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex


Article 291

Title@2025-06-16 (1): Flexible-length Text Infilling for Discrete Diffusion Models

Title: Flexible-length Text Infilling for Discrete Diffusion Models Flexible Text-Infilling für diskrete Diffusionsmodelle 为分立扩散模型填充文本 2506.13579v1

Authors (4): Andrew Zhang, Anushka Sivakumar, Chiawei Tang, Chris Thomas

Discrete diffusion models are a new class of text generators that offer advantages such as bidirectional context use, parallelizable generation, and flexible prompting compared to autoregressive models. However, a critical limitation of discrete diffusion models is their inability to perform flexible-length or flexible-position text infilling without access to ground-truth positional data. We introduce \textbf{DDOT} (\textbf{D}iscrete \textbf{D}iffusion with \textbf{O}ptimal \textbf{T}ransport Position Coupling), the first discrete diffusion model to overcome this challenge. DDOT jointly denoises token values and token positions, employing a novel sample-level Optimal Transport (OT) coupling. This coupling preserves relative token ordering while dynamically adjusting the positions and length of infilled segments, a capability previously missing in text diffusion. Our method is orthogonal to existing discrete text diffusion methods and is compatible with various pretrained text denoisers. Extensive experiments on text infilling benchmarks such as One-Billion-Word and Yelp demonstrate that DDOT outperforms naive diffusion baselines. Furthermore, DDOT achieves performance on par with state-of-the-art non-autoregressive models and enables significant improvements in training efficiency and flexibility.

分解扩散模型是一种新型的文本生成器,它提供了优势,例如双向背景使用、平行生成和与自动递增模型相比的灵活促动。然而,离散扩散模型的关键限制是,它们无法执行灵活长度或灵活位置的文本以填充而不访问地面真实位置数据。我们引入了\ textbff{DDOT} (\ textbf{D}{D}reste {Textbf{Dif}Difive 扩散与\ textbf{O}ptimal\ textbf{Transport 定位组合) ,这是克服这一挑战的第一个离散扩散模型。 但是,离散扩散模型的一个关键限制是,它们无法使用新的样本级最佳运输(Ot)组合来执行灵活长或灵活位置的文本。我们引入了相对象征性的排序,同时动态调整了填充区段的位置和长度,而以前在文本传播中缺少这种能力。我们的方法与现有的离散文本传播方法有交调,并且与各种事先经过培训的文本缩缩缩化模型相兼容,在文本上进行广泛的测试,在文本更新的测试中,在测试中实现了透明性测试性改进性改进了文本的改进性测试后,在文本的改进了基础和升级测试后,在版本中,在版本中,在版本中,在版本中,在版本中,在版本化了基础性地展示DDDDDDDDDDDDDDDDDDDDDBA-B-B-B-B-B-BRD-S-BRD-S-B-S-B-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S


Article 292

Title@2025-06-16 (1): Characterizing Linguistic Shifts in Croatian News via Diachronic Word Embeddings

Title: Characterizing Linguistic Shifts in Croatian News via Diachronic Word Embeddings Sprachliche Verschiebungen in Kroatischen Nachrichten über diachronische Wort-Embeddings charakterisieren 《克罗地亚新闻》通过旧时单词嵌入式将语言变化定性为克罗地亚新闻 2506.13569v1

Authors (5): David Dukić, Ana Barić, Marko Čuljak, Josip Jukić, Martin Tutek

Measuring how semantics of words change over time improves our understanding of how cultures and perspectives change. Diachronic word embeddings help us quantify this shift, although previous studies leveraged substantial temporally annotated corpora. In this work, we use a corpus of 9.5 million Croatian news articles spanning the past 25 years and quantify semantic change using skip-gram word embeddings trained on five-year periods. Our analysis finds that word embeddings capture linguistic shifts of terms pertaining to major topics in this timespan (COVID-19, Croatia joining the European Union, technological advancements). We also find evidence that embeddings from post-2020 encode increased positivity in sentiment analysis tasks, contrasting studies reporting a decline in mental health over the same period.

测量文字的语义如何随时间变化,可以增进我们对文化和观点如何变化的理解。 旧词嵌入有助于我们量化这一转变,尽管先前的研究利用了大量时间上附加说明的社团。 在这项工作中,我们利用了长达25年的950万篇克罗地亚新闻文章来量化语义变化,并使用经过五年培训的跳格字嵌入器来量化语义变化。我们的分析发现,语言嵌入反映了与这一时期主要议题(COVID-19,克罗地亚加入欧盟,技术进步)相关的语言术语变化。 我们还发现,有证据表明,2020年后的语义变化增加了情感分析任务的假设性,而报告同期心理健康下降的研究则与此形成对比。


Article 293

Title@2025-06-16 (1): Understand the Implication: Learning to Think for Pragmatic Understanding

Title: Understand the Implication: Learning to Think for Pragmatic Understanding Die Implikation verstehen: Lernen, für Pragmatisches Verständnis zu denken 理解影响:学会思考实用理解 2506.13559v1

Authors (5): Settaluri Lakshmi Sravanthi, Kishan Maharaj, Sravani Gunnu, Abhijit Mishra, Pushpak Bhattacharyya

Pragmatics, the ability to infer meaning beyond literal interpretation, is crucial for social cognition and communication. While LLMs have been benchmarked for their pragmatic understanding, improving their performance remains underexplored. Existing methods rely on annotated labels but overlook the reasoning process humans naturally use to interpret implicit meaning. To bridge this gap, we introduce a novel pragmatic dataset, ImpliedMeaningPreference, that includes explicit reasoning (thoughts) for both correct and incorrect interpretations. Through preference-tuning and supervised fine-tuning, we demonstrate that thought-based learning significantly enhances LLMs’ pragmatic understanding, improving accuracy by 11.12% across model families. We further discuss a transfer-learning study where we evaluate the performance of thought-based training for the other tasks of pragmatics (presupposition, deixis) that are not seen during the training time and observe an improvement of 16.10% compared to label-trained models.

实用数学是超越字面解释的推断含义的能力,它对于社会认知和沟通至关重要。虽然LLMs已经为实用理解设定基准,但其绩效的改善仍然没有得到充分利用。现有的方法依靠附加说明的标签,但忽略了人类自然用来解释隐含含义的推理过程。为了缩小这一差距,我们引入了一个全新的实用数据集,即隐含的MeaningPreview,其中包括对正确和不正确解释的明确推理(想法)。通过优惠调整和监督的微调,我们证明基于思想的学习极大地增强了LLMs的务实理解,提高了模型家庭11.12%的准确性。我们进一步讨论了转移学习研究,我们评估了在培训期间没有看到的其他务实(假设,deixis)基于思想的培训的绩效,并观察到与经过标签培训的模式相比,16.10%的改进。


Article 294

Title@2025-06-16 (1): EmoDynamiX: Emotional Support Dialogue Strategy Prediction by Modelling MiXed Emotions and Discourse Dynamics

Title: EmoDynamiX: Emotional Support Dialogue Strategy Prediction by Modelling MiXed Emotions and Discourse Dynamics EmoDynamiX: Emotionale Unterstützung Dialog Strategie Vorhersage durch Modellierung von MiXed Emotionen und Diskurs Dynamik EmoDynamiX:通过模拟消化情感和话题动态预测情感支持对话战略 2408.08782v5

Authors (3): Chenwei Wan, Matthieu Labeau, Chloé Clavel

Designing emotionally intelligent conversational systems to provide comfort and advice to people experiencing distress is a compelling area of research. Recently, with advancements in large language models (LLMs), end-to-end dialogue agents without explicit strategy prediction steps have become prevalent. However, implicit strategy planning lacks transparency, and recent studies show that LLMs’ inherent preference bias towards certain socio-emotional strategies hinders the delivery of high-quality emotional support. To address this challenge, we propose decoupling strategy prediction from language generation, and introduce a novel dialogue strategy prediction framework, EmoDynamiX, which models the discourse dynamics between user fine-grained emotions and system strategies using a heterogeneous graph for better performance and transparency. Experimental results on two ESC datasets show EmoDynamiX outperforms previous state-of-the-art methods with a significant margin (better proficiency and lower preference bias). Our approach also exhibits better transparency by allowing backtracing of decision making.

最近,在大型语言模型(LLMS)中,端对端对话代理商在没有明确战略预测步骤的情况下取得了进步。然而,隐含的战略规划缺乏透明度,最近的研究显示,LMS对某些社会情感战略的固有偏好妨碍了提供高质量的情感支持。为了应对这一挑战,我们提议将战略预测与语言生成脱钩,并引入一个新的对话战略预测框架EmoDynamiX,它利用一个不同图表来模拟用户精细的情感和系统战略之间的谈话动态,以便提高性能和透明度。两个ESC数据集的实验结果显示EmoDynamiX的实验结果显示,EmoDynamiX超越了以往的先进方法,并有很大的优势(更熟练,更低的偏好偏好)。我们的方法也通过允许回溯决策,展示了更高的透明度。


Article 295

Title@2025-06-16 (1): Towards a Cascaded LLM Framework for Cost-effective Human-AI Decision-Making

Title: Towards a Cascaded LLM Framework for Cost-effective Human-AI Decision-Making Auf dem Weg zu einem kaskadenten LLM-Rahmen für kosteneffiziente Entscheidungsfindung zwischen Mensch und KI 建立具有成本效益的人类-AI决策框架 2506.11887v2

Authors (2): Claudio Fanconi, Mihaela van der Schaar

Effective human-AI decision-making balances three key factors: the \textit{correctness} of predictions, the \textit{cost} of knowledge and reasoning complexity, and the confidence about whether to \textit{abstain} automated answers or involve human experts. In this work, we present a cascaded LLM decision framework that adaptively delegates tasks across multiple tiers of expertise – a base model for initial candidate answers, a more capable and knowledgeable (but costlier) large model, and a human expert for when the model cascade abstains. Our method proceeds in two stages. First, a deferral policy determines whether to accept the base model’s answer or regenerate it with the large model based on the confidence score. Second, an abstention policy decides whether the cascade model response is sufficiently certain or requires human intervention. Moreover, we incorporate an online learning mechanism in the framework that can leverage human feedback to improve decision quality over time. We demonstrate this approach to general question-answering (ARC-Easy and ARC-Challenge) and medical question-answering (MedQA and MedMCQA). Our results show that our cascaded strategy outperforms in most cases single-model baselines in accuracy while reducing cost and providing a principled way to handle abstentions.

有效的人类-AI决策平衡了三个关键因素:预测的基本模型、知识和推理复杂程度的预测值、知识和推理复杂度的计算值{成本},以及对于是否采用/textit{abtain自动回答或涉及人类专家的信心。在这项工作中,我们提出了一个分级的LLM决定框架,它适应性地代表跨多个专业层次的工作 – – 一个初步候选人回答的基础模型,一个能力更强、知识更丰富(但成本更高)的大模型,以及模型级联放弃时的一位人类专家。我们的方法分两个阶段进行。首先,推迟政策决定是接受基础模型的回答还是根据信任分数重新生成大模型。第二,弃权政策决定级联式反应是否足够确定或需要人类干预。此外,我们把在线学习机制纳入能够利用人类反馈来提高决策质量的框架。我们展示了这种一般性回答(AC-Easy和AC-C-Challge)和医学问题解答方法(MedQA和MEMQA)以及基于信心评分大小模型的大模型的答案(MedQA)是否接受或重新生成。我们的成果在标准上显示我们以降低成本的基线的战略。


Article 296

Title@2025-06-16 (1): Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization

Title: Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization Mischung aus gewichtsgeteilter Heterogener Gruppe Aufmerksamkeit Experten für dynamische Token-weise KV-Optimierung KV 优化动态调制调效 KV 优化小组注意问题专家 2506.13541v1

Authors (6): Guanghui Song, Dongping Liao, Yiren Zhao, Kejiang Ye, Cheng-zhong Xu, Xitong Gao

Transformer models face scalability challenges in causal language modeling (CLM) due to inefficient memory allocation for growing key-value (KV) caches, which strains compute and storage resources. Existing methods like Grouped Query Attention (GQA) and token-level KV optimization improve efficiency but rely on rigid resource allocation, often discarding “low-priority” tokens or statically grouping them, failing to address the dynamic spectrum of token importance. We propose mixSGA, a novel mixture-of-expert (MoE) approach that dynamically optimizes token-wise computation and memory allocation. Unlike prior approaches, mixSGA retains all tokens while adaptively routing them to specialized experts with varying KV group sizes, balancing granularity and efficiency. Our key novelties include: (1) a token-wise expert-choice routing mechanism guided by learned importance scores, enabling proportional resource allocation without token discard; (2) weight-sharing across grouped attention projections to minimize parameter overhead; and (3) an auxiliary loss to ensure one-hot routing decisions for training-inference consistency in CLMs. Extensive evaluations across Llama3, TinyLlama, OPT, and Gemma2 model families show mixSGA’s superiority over static baselines. On instruction-following and continued pretraining tasks, mixSGA achieves higher ROUGE-L and lower perplexity under the same KV budgets.

在因果语言建模(CLM)中,由于对不断增长的关键值缓存(KV)的记忆分配效率低下,造成计算和储存资源紧张,因此变形模型在因果语言建模(CLM)方面面临可缩放挑战。现有的方法,如Group Queration(GQA)和象征性KV优化,提高了效率,但依靠僵硬的资源分配,常常丢弃“低优先”的标牌或静态组合,未能解决具有象征性重要性的动态范围。我们提出了混合SGA(一种新型专家混合组合(MoE)办法,能动态优化象征性计算和记忆分配。与以前的做法不同,混合SGA(M)保留所有标牌,同时适应性地将其分配给具有不同KVGV群规模、颗粒性和效率的专业人员,同时平衡微粒性KVA预算。我们的主要新做法包括:(1) 以学习重要性分为指南为指导的象征性专家选择路线机制,使资源分配比例相等的弃物。(2) 集团更高关注的预测权重分担,以尽量减少参数的间接费用;(3) 辅助损失,以确保在CLMLM(C)和SOM(GIM)下,持续进行关于SB)B(K-L)B)B(K-L)BL)的排序基准下,持续评估。


Article 297

Title@2025-06-16 (1): Affordable AI Assistants with Knowledge Graph of Thoughts

Title: Affordable AI Assistants with Knowledge Graph of Thoughts Erschwingliche KI-Assistenten mit Wissensgrafik der Gedanken 具有知识思想知识图的负担得起的AI助理 2504.02670v3

Authors (18): Maciej Besta, Lorenzo Paleari, Jia Hao Andrea Jiang, Robert Gerstenberger, You Wu, Jón Gunnar Hannesson, Patrick Iff, Ales Kubicek, Piotr Nyczyk, Diana Khimey, Nils Blach, Haiqiang Zhang, Tao Zhang, Peiran Ma, Grzegorz Kwaśniewski, Marcin Copik, Hubert Niewiadomski, Torsten Hoefler

Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. To address these issues, we propose Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts. Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively while also minimizing bias and noise. For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini. Moreover, harnessing a smaller model dramatically reduces operational costs by over 36x compared to GPT-4o. Improvements for other models (e.g., Qwen2.5-32B and Deepseek-R1-70B) and benchmarks (e.g., SimpleQA) are similar. KGoT offers a scalable, affordable, versatile, and high-performing solution for AI assistants.

大型语言模型(LLMS)正在使能够跨领域执行不同任务的AI助理的发展革命性地革命性地革命性地革命性地革命性地发展能够执行不同任务的AI助理;然而,目前最先进的LLM驱动的代理商面临重大挑战,包括高运作成本和在诸如GAIA等复杂基准上的成功率有限。为了解决这些问题,我们提出了“思想知识图”(KGOT),这是一个创新的AI助理架构,将LLM推理与动态构建的知识图(KGGs)相结合。KGOT的摘录和结构任务相关知识,成为动态的KGG代表,通过数学解答器、网络爬虫器和Python脚本等外部工具不断加强。这种任务相关知识的结构化代表使低成本模式能够有效地解决复杂的任务,同时尽量减少偏见和噪音。例如,KGOT在GIA基准上实现了29%的任务成功率的提高,而GUG Fegg Feg Face Adriendorations (eg,Q-Go-Go-B) 和可负担得起的ASyal-GOATIal-GO1-B 提供的高标准70B)


Article 298

Title@2025-06-16 (1): TensorSLM: Energy-efficient Embedding Compression of Sub-billion Parameter Language Models on Low-end Devices

Title: TensorSLM: Energy-efficient Embedding Compression of Sub-billion Parameter Language Models on Low-end Devices TensorSLM: Energieeffiziente Einbettung Komprimierung von Submilliarden-Parameter-Sprachmodellen auf Low-End-Geräten Tensor SLM:低端设备上10亿分数以下低端设备语言模型的节能嵌入压缩 2506.13514v1

Authors (3): Mingxue Xu, Yao Lei Xu, Danilo P. Mandic

Small Language Models (SLMs, or on-device LMs) have significantly fewer parameters than Large Language Models (LLMs). They are typically deployed on low-end devices, like mobile phones and single-board computers. Unlike LLMs, which rely on increasing model size for better generalisation, SLMs designed for edge applications are expected to have adaptivity to the deployment environments and energy efficiency given the device battery life constraints, which are not addressed in datacenter-deployed LLMs. This paper addresses these two requirements by proposing a training-free token embedding compression approach using Tensor-Train Decomposition (TTD). Each pre-trained token embedding vector is converted into a lower-dimensional Matrix Product State (MPS). We comprehensively evaluate the extracted low-rank structures across compression ratio, language task performance, latency, and energy consumption on a typical low-end device, i.e. Raspberry Pi. Taking the sub-billion parameter versions of GPT-2/Cerebres-GPT and OPT models as examples, our approach achieves a comparable language task performance to the original model with around $2.0\times$ embedding layer compression, while the energy consumption of a single query drops by half.

小型语言模型(SLMs或在线LMS)的参数比大型语言模型(LLMS)要少得多,它们通常部署在低端装置上,例如移动电话和单机计算机。与LLMS不同,LMS依靠日益增大的模型尺寸来更好地概括,而用于边缘应用的可持续土地管理则由于装置电池寿命的限制,因此预计能够适应部署环境和能源效率,而装置电池使用寿命的限制没有在数据中心部署的LMS(SLMs)中加以解决。本文通过采用Tensor-Train Train Decomposition(TTD)提出无培训的象征性嵌入压缩方法来解决这两项要求。每个预先训练的象征性嵌入矢量都转换成一个低维度的母体产品国(MPS)。 我们全面评价了在典型的低端装置上抽取的低级结构,即Raspberry Pi。 以GPT-2/Cebebres-GPT和OPT模型的10亿分参数版本作为例子,我们的方法实现了与原始模型的类似语言任务性功能性表现,大约2美元,同时进行一次性压缩1倍的能源压压1层。


Article 299

Title@2025-06-16 (1): JEPA4Rec: Learning Effective Language Representations for Sequential Recommendation via Joint Embedding Predictive Architecture

Title: JEPA4Rec: Learning Effective Language Representations for Sequential Recommendation via Joint Embedding Predictive Architecture JEPA4Rec: Effektive Sprachrepräsentanzen für sequentielle Empfehlung durch gemeinsame Einbettung vorausschauender Architektur lernen JEPA4Rec: 通过联合嵌入的预测架构,学习有效的语言代表,以提出序列建议 2504.10512v2

Authors (2): Minh-Anh Nguyen, Dung D. Le

Language representation learning has emerged as a promising approach for sequential recommendation, thanks to its ability to learn generalizable representations. However, despite its advantages, this approach still struggles with data sparsity and a limited understanding of common-sense user preferences. To address these limitations, we propose $\textbf{JEPA4Rec}$, a framework that combines $\textbf{J}$oint $\textbf{E}$mbedding $\textbf{P}$redictive $\textbf{A}$rchitecture with language modeling of item textual descriptions. JEPA4Rec captures semantically rich and transferable representations, improving recommendation performance and reducing reliance on large-scale pre-training data. Specifically, JEPA4Rec represents items as text sentences by flattening descriptive information such as $\textit{title, category}$, and other attributes. To encode these sentences, we employ a bidirectional Transformer encoder with modified embedding layers tailored for capturing item information in recommendation datasets. We apply masking to text sentences and use them to predict the representations of the unmasked sentences, helping the model learn generalizable item embeddings. To further improve recommendation performance and language understanding, we employ a two-stage training strategy incorporating self-supervised learning losses. Experiments on six real-world datasets demonstrate that JEPA4Rec consistently outperforms state-of-the-art methods, particularly in cross-domain, cross-platform, and low-resource scenarios.

语言代表学习由于能够学习可概括的表述,已成为一个有希望的顺序建议方法。然而,尽管这种方法具有优势,但它仍然与数据宽度和对常识用户偏好的理解有限而挣扎。为了解决这些局限性,我们提议使用$textbf{JEP4Rec}美元,这是一个将美元(textbf{J}$)和美元(textbf{P}美元)和其他属性合并起来的框架。为了对这些句进行编码,我们使用双向变换器,经修改的嵌入层,用于对项目文字描述进行语言模型的模拟。JEPA4Rec捕捉到内容丰富和可转让的表达方式,改进建议性能,减少对大规模培训前数据的依赖。具体地说,JEPA4Rec用美元作为文字句子,平整齐描述描述,例如 $\textbetriit* titlef{le, 类别} 以及其它属性。我们使用双向变码转换器,修改内嵌层,用于在建议型数据集中收集项目信息。我们用正变换的文本战略,我们用六级的缩化工具,在学习中,我们学习学习的版本,用它用来预测。


Article 300

Title@2025-06-16 (1): K/DA: Automated Data Generation Pipeline for Detoxifying Implicitly Offensive Language in Korean

Title: K/DA: Automated Data Generation Pipeline for Detoxifying Implicitly Offensive Language in Korean K/DA: Automatisierte Datengenerierungspipeline für die Entgiftung implizit anstößiger Sprache auf Koreanisch K/DA:用韩语解毒的自动数据生成管道 2506.13513v1

Authors (6): Minkyeong Jeon, Hyemin Jeong, Yerang Kim, Jiyoung Kim, Jae Hyeon Cho, Byung-Jun Lee

Language detoxification involves removing toxicity from offensive language. While a neutral-toxic paired dataset provides a straightforward approach for training detoxification models, creating such datasets presents several challenges: i) the need for human annotation to build paired data, and ii) the rapid evolution of offensive terms, rendering static datasets quickly outdated. To tackle these challenges, we introduce an automated paired data generation pipeline, called K/DA. This pipeline is designed to generate offensive language with implicit offensiveness and trend-aligned slang, making the resulting dataset suitable for detoxification model training. We demonstrate that the dataset generated by K/DA exhibits high pair consistency and greater implicit offensiveness compared to existing Korean datasets, and also demonstrates applicability to other languages. Furthermore, it enables effective training of a high-performing detoxification model with simple instruction fine-tuning.

nan


Article 301

Title@2025-06-16 (1): BOW: Bottlenecked Next Word Exploration

Title: BOW: Bottlenecked Next Word Exploration BOW: Engagierte nächste Wort-Exploration BOW: 下个单词探索的瓶颈 2506.13502v1

Authors (5): Ming Shen, Zhikun Xu, Xiao Ye, Jacob Dineen, Ben Zhou

Large language models (LLMs) are typically trained via next-word prediction (NWP), which provides strong surface-level fluency but often lacks support for robust reasoning. We propose BOttlenecked next Word exploration (BOW), a novel RL framework that rethinks NWP by introducing a reasoning bottleneck where a policy model first generates a reasoning path rather than predicting the next token directly, after which a frozen judge model predicts the next token distribution based solely on this reasoning path. We train the policy model using GRPO with rewards that quantify how effectively the reasoning path facilitates next-word recovery. Compared with other continual pretraining baselines, we show that BOW improves both the general and next-word reasoning capabilities of the base model, evaluated on various benchmarks. Our findings show that BOW can serve as an effective and scalable alternative to vanilla NWP.

nan


Article 302

Title@2025-06-16 (1): TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs

Title: TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs TurBLiMP: Ein türkischer Benchmark für linguistische Minimal Pairs TurBLIMP:土耳其语言最小对等基准 2506.13487v1

Authors (4): Ezgi Başar, Francesca Padovani, Jaap Jumelet, Arianna Bisazza

We introduce TurBLiMP, the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models (LMs). Covering 16 linguistic phenomena with 1000 minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation resources for Turkish. In designing the benchmark, we give extra attention to two properties of Turkish that remain understudied in current syntactic evaluations of LMs, namely word order flexibility and subordination through morphological processes. Our experiments on a wide range of LMs and a newly collected set of human acceptability judgments reveal that even cutting-edge Large LMs still struggle with grammatical phenomena that are not challenging for humans, and may also exhibit different sensitivities to word order and morphological complexity compared to humans.

nan


Article 303

Title@2025-06-16 (1): Position: Pause Recycling LoRAs and Prioritize Mechanisms to Uncover Limits and Effectiveness

Title: Position: Pause Recycling LoRAs and Prioritize Mechanisms to Uncover Limits and Effectiveness Position: Recycling von LoRAs aushalten und Mechanismen priorisieren, um Grenzen und Wirksamkeit aufzudecken 立场:暂停再循环回收 LoRAs和优先机制 2506.13479v1

Authors (4): Mei-Yen Chen, Thi Thu Uyen Hoang, Michael Hahn, M. Saquib Sarfraz

Merging or routing low-rank adapters (LoRAs) has emerged as a popular solution for enhancing large language models, particularly when data access is restricted by regulatory or domain-specific constraints. This position paper argues that the research community should shift its focus from developing new merging or routing algorithms to understanding the conditions under which reusing LoRAs is truly effective. Through theoretical analysis and synthetic two-hop reasoning and math word-problem tasks, we examine whether reusing LoRAs enables genuine compositional generalization or merely reflects shallow pattern matching. Evaluating two data-agnostic methods–parameter averaging and dynamic adapter selection–we found that reusing LoRAs often fails to logically integrate knowledge across disjoint fine-tuning datasets, especially when such knowledge is underrepresented during pretraining. Our empirical results, supported by theoretical insights into LoRA’s limited expressiveness, highlight the preconditions and constraints of reusing them for unseen tasks and cast doubt on its feasibility as a truly data-free approach. We advocate for pausing the pursuit of novel methods for recycling LoRAs and emphasize the need for rigorous mechanisms to guide future academic research in adapter-based model merging and practical system designs for practitioners.

nan


Article 304

Title@2025-06-16 (1): Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning

Title: Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning Sprachagenten für hypothesisgetriebene klinische Entscheidungsfindung mit Verstärkungslernen 与强化学习一起进行假冒主义驱动临床决策的语言代理 2506.13474v1

Authors (5): David Bani-Harouni, Chantal Pellegrini, Ege Özsoy, Matthias Keicher, Nassir Navab

Clinical decision-making is a dynamic, interactive, and cyclic process where doctors have to repeatedly decide on which clinical action to perform and consider newly uncovered information for diagnosis and treatment. Large Language Models (LLMs) have the potential to support clinicians in this process, however, most applications of LLMs in clinical decision support suffer from one of two limitations: Either they assume the unrealistic scenario of immediate availability of all patient information and do not model the interactive and iterative investigation process, or they restrict themselves to the limited “out-of-the-box” capabilities of large pre-trained models without performing task-specific training. In contrast to this, we propose to model clinical decision-making for diagnosis with a hypothesis-driven uncertainty-aware language agent, LA-CDM, that converges towards a diagnosis via repeatedly requesting and interpreting relevant tests. Using a hybrid training paradigm combining supervised and reinforcement learning, we train LA-CDM with three objectives targeting critical aspects of clinical decision-making: accurate hypothesis generation, hypothesis uncertainty estimation, and efficient decision-making. We evaluate our methodology on MIMIC-CDM, a real-world dataset covering four abdominal diseases containing various clinical tests and show the benefit of explicitly training clinical decision-making for increasing diagnostic performance and efficiency.

nan


Article 305

Title@2025-06-16 (1): When Detection Fails: The Power of Fine-Tuned Models to Generate Human-Like Social Media Text

Title: When Detection Fails: The Power of Fine-Tuned Models to Generate Human-Like Social Media Text Wenn die Detektion fehlschlägt: Die Macht von fein-getönten Modellen, um menschenähnliche Social Media-Texte zu erzeugen 当检测失败时:制作像人类一样的社会媒体文字的精选模型的力量 2506.09975v2

Authors (3): Hillary Dawkins, Kathleen C. Fraser, Svetlana Kiritchenko

Detecting AI-generated text is a difficult problem to begin with; detecting AI-generated text on social media is made even more difficult due to the short text length and informal, idiosyncratic language of the internet. It is nonetheless important to tackle this problem, as social media represents a significant attack vector in online influence campaigns, which may be bolstered through the use of mass-produced AI-generated posts supporting (or opposing) particular policies, decisions, or events. We approach this problem with the mindset and resources of a reasonably sophisticated threat actor, and create a dataset of 505,159 AI-generated social media posts from a combination of open-source, closed-source, and fine-tuned LLMs, covering 11 different controversial topics. We show that while the posts can be detected under typical research assumptions about knowledge of and access to the generating models, under the more realistic assumption that an attacker will not release their fine-tuned model to the public, detectability drops dramatically. This result is confirmed with a human study. Ablation experiments highlight the vulnerability of various detection algorithms to fine-tuned LLMs. This result has implications across all detection domains, since fine-tuning is a generally applicable and realistic LLM use case.

nan


Article 306

Title@2025-06-16 (1): Abstract, Align, Predict: Zero-Shot Stance Detection via Cognitive Inductive Reasoning

Title: Abstract, Align, Predict: Zero-Shot Stance Detection via Cognitive Inductive Reasoning Abstract, Align, Predict: Zero-Shot Stance Detection über Kognitive Induktive Reasoning 摘要、对称、预测:通过认知感性诱导理由探测零热静态 2506.13470v1

Authors (6): Jun Ma, Fuqiang Niu, Dong Li, Jinzhou Cao, Genan Dai, Bowen Zhang

Zero-shot stance detection (ZSSD) aims to identify the stance of text toward previously unseen targets, a setting where conventional supervised models often fail due to reliance on labeled data and shallow lexical cues. Inspired by human cognitive reasoning, we propose the Cognitive Inductive Reasoning Framework (CIRF), which abstracts transferable reasoning schemas from unlabeled text and encodes them as concept-level logic. To integrate these schemas with input arguments, we introduce a Schema-Enhanced Graph Kernel Model (SEGKM) that dynamically aligns local and global reasoning structures. Experiments on SemEval-2016, VAST, and COVID-19-Stance benchmarks show that CIRF establishes new state-of-the-art results, outperforming strong ZSSD baselines by 1.0, 4.5, and 3.3 percentage points in macro-F1, respectively, and achieving comparable accuracy with 70\% fewer labeled examples. We will release the full code upon publication.

nan


Article 307

Title@2025-06-16 (1): An Interdisciplinary Approach to Human-Centered Machine Translation

Title: An Interdisciplinary Approach to Human-Centered Machine Translation Ein interdisziplinärer Ansatz zur Mensch-zentrierten maschinellen Übersetzung 以多学科方式处理以人为中心的机器翻译 2506.13468v1

Authors (20): Marine Carpuat, Omri Asscher, Kalika Bali, Luisa Bentivogli, Frédéric Blain, Lynne Bowker, Monojit Choudhury, Hal Daumé III, Kevin Duh, Ge Gao, Alvin Grissom II, Marzena Karpinska, Elaine C. Khoong, William D. Lewis, André F. T. Martins, Mary Nurminen, Douglas W. Oard, Maja Popovic, Michel Simard, François Yvon

Machine Translation (MT) tools are widely used today, often in contexts where professional translators are not present. Despite progress in MT technology, a gap persists between system development and real-world usage, particularly for non-expert users who may struggle to assess translation reliability. This paper advocates for a human-centered approach to MT, emphasizing the alignment of system design with diverse communicative goals and contexts of use. We survey the literature in Translation Studies and Human-Computer Interaction to recontextualize MT evaluation and design to address the diverse real-world scenarios in which MT is used today.

nan


Article 308

Title@2025-06-16 (1): Enhancing Omics Cohort Discovery for Research on Neurodegeneration through Ontology-Augmented Embedding Models

Title: Enhancing Omics Cohort Discovery for Research on Neurodegeneration through Ontology-Augmented Embedding Models Enhancement Omics Cohort Discovery for Research on Neurodegeneration by Ontology-Augmented Embedding Models 通过本体学强化嵌入模型研究神经脱底生成发现 2506.13467v1

Authors (4): José A. Pardo, Alicia Gómez-Pascual, José T. Palma, Juan A. Botía

The growing volume of omics and clinical data generated for neurodegenerative diseases (NDs) requires new approaches for their curation so they can be ready-to-use in bioinformatics. NeuroEmbed is an approach for the engineering of semantically accurate embedding spaces to represent cohorts and samples. The NeuroEmbed method comprises four stages: (1) extraction of ND cohorts from public repositories; (2) semi-automated normalization and augmentation of metadata of cohorts and samples using biomedical ontologies and clustering on the embedding space; (3) automated generation of a natural language question-answering (QA) dataset for cohorts and samples based on randomized combinations of standardized metadata dimensions and (4) fine-tuning of a domain-specific embedder to optimize queries. We illustrate the approach using the GEO repository and the PubMedBERT pretrained embedder. Applying NeuroEmbed, we semantically indexed 2,801 repositories and 150,924 samples. Amongst many biology-relevant categories, we normalized more than 1,700 heterogeneous tissue labels from GEO into 326 unique ontology-aligned concepts and enriched annotations with new ontology-aligned terms, leading to a fold increase in size for the metadata terms between 2.7 and 20 fold. After fine-tuning PubMedBERT with the QA training data augmented with the enlarged metadata, the model increased its mean Retrieval Precision from 0.277 to 0.866 and its mean Percentile Rank from 0.355 to 0.896. The NeuroEmbed methodology for the creation of electronic catalogues of omics cohorts and samples will foster automated bioinformatic pipelines construction. The NeuroEmbed catalogue of cohorts and samples is available at https://github.com/JoseAdrian3/NeuroEmbed.

nan


Article 309

Title@2025-06-16 (1): Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study

Title: Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study Enthüllen des Lerngedankens von Sprachmodellen: Ein kognitiver Rahmen und empirische Studie 统一语言模式学习思维:认知框架和经验研究 2506.13464v1

Authors (8): Zhengyu Hu, Jianxun Lian, Zheyuan Xiao, Seraphina Zhang, Tianfu Wang, Nicholas Jing Yuan, Xing Xie, Hui Xiong

Large language models (LLMs) have shown impressive capabilities across tasks such as mathematics, coding, and reasoning, yet their learning ability, which is crucial for adapting to dynamic environments and acquiring new knowledge, remains underexplored. In this work, we address this gap by introducing a framework inspired by cognitive psychology and education. Specifically, we decompose general learning ability into three distinct, complementary dimensions: Learning from Instructor (acquiring knowledge via explicit guidance), Learning from Concept (internalizing abstract structures and generalizing to new contexts), and Learning from Experience (adapting through accumulated exploration and feedback). We conduct a comprehensive empirical study across the three learning dimensions and identify several insightful findings, such as (i) interaction improves learning; (ii) conceptual understanding is scale-emergent and benefits larger models; and (iii) LLMs are effective few-shot learners but not many-shot learners. Based on our framework and empirical findings, we introduce a benchmark that provides a unified and realistic evaluation of LLMs’ general learning abilities across three learning cognition dimensions. It enables diagnostic insights and supports evaluation and development of more adaptive and human-like models.

nan


Article 310

Title@2025-06-16 (1): Leveraging Vision-Language Pre-training for Human Activity Recognition in Still Images

Title: Leveraging Vision-Language Pre-training for Human Activity Recognition in Still Images Nutzung von Vision-Sprachen Pre-Training für die Anerkennung menschlicher Aktivität in Still Images 利用视觉-语言前培训,在静态图像中确认人类活动 2506.13458v1

Authors (2): Cristina Mahanta, Gagan Bhatia

Recognising human activity in a single photo enables indexing, safety and assistive applications, yet lacks motion cues. Using 285 MSCOCO images labelled as walking, running, sitting, and standing, scratch CNNs scored 41% accuracy. Fine-tuning multimodal CLIP raised this to 76%, demonstrating that contrastive vision-language pre-training decisively improves still-image action recognition in real-world deployments.

nan


Article 311

Title@2025-06-16 (1): A Neural Model for Word Repetition

Title: A Neural Model for Word Repetition Ein neurales Modell für Wortwiederholung WW 重复的神经模型 2506.13450v1

Authors (4): Daniel Dager, Robin Sobczyk, Emmanuel Chemla, Yair Lakretz

It takes several years for the developing brain of a baby to fully master word repetition-the task of hearing a word and repeating it aloud. Repeating a new word, such as from a new language, can be a challenging task also for adults. Additionally, brain damage, such as from a stroke, may lead to systematic speech errors with specific characteristics dependent on the location of the brain damage. Cognitive sciences suggest a model with various components for the different processing stages involved in word repetition. While some studies have begun to localize the corresponding regions in the brain, the neural mechanisms and how exactly the brain performs word repetition remain largely unknown. We propose to bridge the gap between the cognitive model of word repetition and neural mechanisms in the human brain by modeling the task using deep neural networks. Neural models are fully observable, allowing us to study the detailed mechanisms in their various substructures and make comparisons with human behavior and, ultimately, the brain. Here, we make first steps in this direction by: (1) training a large set of models to simulate the word repetition task; (2) creating a battery of tests to probe the models for known effects from behavioral studies in humans, and (3) simulating brain damage through ablation studies, where we systematically remove neurons from the model, and repeat the behavioral study to examine the resulting speech errors in the “patient” model. Our results show that neural models can mimic several effects known from human research, but might diverge in other aspects, highlighting both the potential and the challenges for future research aimed at developing human-like neural models.

nan


Article 312

Title@2025-06-16 (1): From Euler to AI: Unifying Formulas for Mathematical Constants

Title: From Euler to AI: Unifying Formulas for Mathematical Constants Von Euler zu AI: Formeln für mathematische Konstanten vereinheitlichen 从 Euler 到 AI: 数学常量的统一公式 2502.17533v2

Authors (7): Tomer Raz, Michael Shalyt, Elyasheev Leibtag, Rotem Kalisch, Shachar Weinbaum, Yaron Hadad, Ido Kaminer

The constant $\pi$ has fascinated scholars throughout the centuries, inspiring numerous formulas for its evaluation, such as infinite sums and continued fractions. Despite their individual significance, many of the underlying connections among formulas remain unknown, missing unifying theories that could unveil deeper understanding. The absence of a unifying theory reflects a broader challenge across math and science: knowledge is typically accumulated through isolated discoveries, while deeper connections often remain hidden. In this work, we present an automated framework for the unification of mathematical formulas. Our system combines large language models (LLMs) for systematic formula harvesting, an LLM-code feedback loop for validation, and a novel symbolic algorithm for clustering and eventual unification. We demonstrate this methodology on the hallmark case of $\pi$, an ideal testing ground for symbolic unification. Applying this approach to 455,050 arXiv papers, we validate 407 distinct formulas for $\pi$ and prove relations between 381 (94%) of them, of which 188 (46%) can be derived from a single mathematical object$\unicode{x2014}$linking canonical formulas by Euler, Gauss, Brouncker, and newer ones from algorithmic discoveries by the Ramanujan Machine. Our method generalizes to other constants, including $e$, $\zeta(3)$, and Catalan’s constant, demonstrating the potential of AI-assisted mathematics to uncover hidden structures and unify knowledge across domains.

nan


Article 313

Title@2025-06-16 (1): RealHiTBench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis

Title: RealHiTBench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis RealHiTBench: Ein umfassender realistischer Hierarchischer Tabellen-Benchmark für die Bewertung der LLM-basierten Tabellenanalyse RealHiTBench:评估基于LLM的表分析的综合现实等级表基准 2506.13405v1

Authors (13): Pengzuo Wu, Yuhang Yang, Guangcheng Zhu, Chao Ye, Hong Gu, Xu Lu, Ruixuan Xiao, Bowen Bao, Yijing He, Liangyu Zha, Wentao Ye, Junbo Zhao, Haobo Wang

With the rapid advancement of Large Language Models (LLMs), there is an increasing need for challenging benchmarks to evaluate their capabilities in handling complex tabular data. However, existing benchmarks are either based on outdated data setups or focus solely on simple, flat table structures. In this paper, we introduce RealHiTBench, a comprehensive benchmark designed to evaluate the performance of both LLMs and Multimodal LLMs (MLLMs) across a variety of input formats for complex tabular data, including LaTeX, HTML, and PNG. RealHiTBench also includes a diverse collection of tables with intricate structures, spanning a wide range of task types. Our experimental results, using 25 state-of-the-art LLMs, demonstrate that RealHiTBench is indeed a challenging benchmark. Moreover, we also develop TreeThinker, a tree-based pipeline that organizes hierarchical headers into a tree structure for enhanced tabular reasoning, validating the importance of improving LLMs’ perception of table hierarchies. We hope that our work will inspire further research on tabular data reasoning and the development of more robust models. The code and data are available at https://github.com/cspzyy/RealHiTBench.

nan


Article 314

Title@2025-06-16 (1): Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR

Title: Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR Bi-direktionale Kontext-verbesserte Sprache Große Sprachmodelle für mehrsprachige Konversations-ASR 多语言对话的ASR双向双向背景强化语言大语言模型 2506.13396v1

Authors (3): Yizhou Peng, Hexin Liu, Eng Siong Chng

This paper introduces the integration of language-specific bi-directional context into a speech large language model (SLLM) to improve multilingual continuous conversational automatic speech recognition (ASR). We propose a character-level contextual masking strategy during training, which randomly removes portions of the context to enhance robustness and better emulate the flawed transcriptions that may occur during inference. For decoding, a two-stage pipeline is utilized: initial isolated segment decoding followed by context-aware re-decoding using neighboring hypotheses. Evaluated on the 1500-hour Multilingual Conversational Speech and Language Model (MLC-SLM) corpus covering eleven languages, our method achieves an 18% relative improvement compared to a strong baseline, outperforming even the model trained on 6000 hours of data for the MLC-SLM competition. These results underscore the significant benefit of incorporating contextual information in multilingual continuous conversational ASR.

nan


Article 315

Title@2025-06-16 (1): Regular-pattern-sensitive CRFs for Distant Label Interactions

Title: Regular-pattern-sensitive CRFs for Distant Label Interactions Regelmäßig-Muster-sensible CRFs für entfernte Label-Interaktionen 用于不同标签互动的常规模式敏感通用报告格式 2411.12484v2

Authors (3): Sean Papay, Roman Klinger, Sebastian Pado

While LLMs have grown popular in sequence labeling, linear-chain conditional random fields (CRFs) remain a popular alternative with the ability to directly model interactions between labels. However, the Markov assumption limits them to % only directly modeling interactions between adjacent labels. Weighted finite-state transducers (FSTs), in contrast, can model distant label–label interactions, but exact label inference is intractable in general. In this work, we present regular-pattern-sensitive CRFs (RPCRFs), a method of enriching standard linear-chain CRFs with the ability to learn long-distance label interactions through user-specified patterns. This approach allows users to write regular-expression label patterns concisely specifying which types of interactions the model should take into account, allowing the model to learn from data whether and in which contexts these patterns occur. The result can be interpreted alternatively as a CRF augmented with additional, non-local potentials, or as a finite-state transducer whose structure is defined by a set of easily-interpretable patterns. Critically, exact training and inference are tractable for many pattern sets. We detail how an RPCRF can be automatically constructed from a set of user-specified patterns, and demonstrate the model’s effectiveness on a sequence of three synthetic sequence modeling datasets.

nan


Article 316

Title@2025-06-16 (1): Decompositional Reasoning for Graph Retrieval with Large Language Models

Title: Decompositional Reasoning for Graph Retrieval with Large Language Models Zersetzende Begründung für Graph Retrieval mit großen Sprachmodellen 使用大语言模型的图表检索分解理由 2506.13380v1

Authors (3): Valentin Six, Evan Dufraisse, Gaël de Chalendar

Large Language Models (LLMs) excel at many NLP tasks, but struggle with multi-hop reasoning and factual consistency, limiting their effectiveness on knowledge-intensive tasks like complex question answering (QA). Linking Knowledge Graphs (KG) and LLMs has shown promising results, but LLMs generally lack the ability to reason efficiently over graph-structured information. To tackle this problem, we propose a novel retrieval approach that integrates textual knowledge graphs into the LLM reasoning process via query decomposition. Our method decomposes complex questions into sub-questions, retrieves relevant textual subgraphs, and composes a question-specific knowledge graph to guide answer generation. For that, we use a weighted similarity function that focuses on both the complex question and the generated subquestions to extract a relevant subgraph, which allows efficient and precise retrieval for complex questions and improves the performance of LLMs on multi-hop QA tasks. This structured reasoning pipeline enhances factual grounding and interpretability while leveraging the generative strengths of LLMs. We evaluate our method on standard multi-hop QA benchmarks and show that it achieves comparable or superior performance to competitive existing methods, using smaller models and fewer LLM calls.

nan


Article 317

Title@2025-06-16 (1): CMCTS: A Constrained Monte Carlo Tree Search Framework for Mathematical Reasoning in Large Language Model

Title: CMCTS: A Constrained Monte Carlo Tree Search Framework for Mathematical Reasoning in Large Language Model CMCTS: Ein eingeschränktes Monte Carlo Baum-Suchrahmen für mathematische Vernunft im großen Sprachmodell CMCTS: 限制的蒙特卡洛大语言数学理由搜索框架 2502.11169v2

Authors (7): Qingwen Lin, Boyan Xu, Guimin Hu, Zijian Li, Zhifeng Hao, Keli Zhang, Ruichu Cai

This paper introduces the Constrained Monte Carlo Tree Search (CMCTS) framework to enhance the mathematical reasoning capabilities of Large Language Models (LLM). By incorporating a constrained action space, Process Reward Model (PRM), and partial order rules, CMCTS effectively addresses the limitations of existing MCTS methods in terms of state space diversity and action selection rationality. Specifically, during the expansion phase, CMCTS restricts action sampling to a predefined constrained action set to increase candidate state diversity. In the simulation phase, it introduces partial order rules and PRM to optimize action selection and prevent unreasonable state transitions. Experimental results show that CMCTS performs outstandingly across multiple mathematical reasoning benchmarks. Under a zero-shot setting, a 7B-parameter model achieves an average accuracy of 83.4\%, surpassing the 72B baseline model by 4.8\%. Ablation studies demonstrate that each component of the framework is crucial for performance improvement, and their combined use fully leverages their respective strengths. Overall, the CMCTS framework provides an effective approach to enhancing LLM mathematical reasoning capabilities, supported by theoretical analysis, and offers novel insights for future reasoning tasks.

nan


Article 318

Title@2025-06-16 (1): Efficient Medical VIE via Reinforcement Learning

Title: Efficient Medical VIE via Reinforcement Learning Effizientes medizinisches VIE durch Verstärkungslernen 通过强化学习提高医疗VIE效率 2506.13363v1

Authors (8): Lijun Liu, Ruiyang Li, Zhaocheng Liu, Chenglin Zhu, Chong Li, Jiehan Cheng, Qiang Ju, Jian Xie

Visual Information Extraction (VIE) converts unstructured document images into structured formats like JSON, critical for medical applications such as report analysis and online consultations. Traditional methods rely on OCR and language models, while end-to-end multimodal models offer direct JSON generation. However, domain-specific schemas and high annotation costs limit their effectiveness in medical VIE. We base our approach on the Reinforcement Learning with Verifiable Rewards (RLVR) framework to address these challenges using only 100 annotated samples. Our approach ensures dataset diversity, a balanced precision-recall reward mechanism to reduce hallucinations and improve field coverage, and innovative sampling strategies to enhance reasoning capabilities. Fine-tuning Qwen2.5-VL-7B with our RLVR method, we achieve state-of-the-art performance on medical VIE tasks, significantly improving F1, precision, and recall. While our models excel on tasks similar to medical datasets, performance drops on dissimilar tasks, highlighting the need for domain-specific optimization. Case studies further demonstrate the value of reasoning during training and inference for VIE.

nan


Article 319

Title@2025-06-16 (1): Truth Knows No Language: Evaluating Truthfulness Beyond English

Title: Truth Knows No Language: Evaluating Truthfulness Beyond English Wahrheit kennt keine Sprache: Bewertung von Wahrhaftigkeit jenseits des Englischen 真理不懂语言:评价英语以外的真相 2502.09387v3

Authors (7): Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri

We introduce a professionally translated extension of the TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish. Truthfulness evaluations of large language models (LLMs) have primarily been conducted in English. However, the ability of LLMs to maintain truthfulness across languages remains under-explored. Our study evaluates 12 state-of-the-art open LLMs, comparing base and instruction-tuned models using human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our findings reveal that, while LLMs perform best in English and worst in Basque (the lowest-resourced language), overall truthfulness discrepancies across languages are smaller than anticipated. Furthermore, we show that LLM-as-a-Judge correlates more closely with human judgments than multiple-choice metrics, and that informativeness plays a critical role in truthfulness assessment. Our results also indicate that machine translation provides a viable approach for extending truthfulness benchmarks to additional languages, offering a scalable alternative to professional translation. Finally, we observe that universal knowledge questions are better handled across languages than context- and time-dependent ones, highlighting the need for truthfulness evaluations that account for cultural and temporal variability. Dataset and code are publicly available under open licenses.

nan


Article 320

Title@2025-06-16 (1): How Much Can We Forget about Data Contamination?

Title: How Much Can We Forget about Data Contamination? Wie viel können wir über Datenkontamination vergessen? 我们怎能忘记数据污染呢? 2410.03249v4

Authors (4): Sebastian Bordt, Suraj Srinivas, Valentyn Boreiko, Ulrike von Luxburg

The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we challenge the common assumption that small-scale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). If model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. Continual pre-training of OLMo-7B corroborates these results. Next, we study the impact of the weight decay parameter on example forgetting, showing that empirical forgetting occurs faster than the cumulative weight decay. This allows us to gauge the degree of example forgetting in large-scale training runs, indicating that many LLMs, including Lllama 3 405B, have forgotten the data seen at the beginning of training.

nan


Article 321

Title@2025-06-16 (1): StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns

Title: StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns StoryBench: Ein dynamischer Benchmark für die Bewertung von Langzeitspeichern mit Multiturns 故事区:多转评价长期记忆的动态基准 2506.13356v1

Authors (2): Luanbo Wan, Weizhi Ma

Long-term memory (LTM) is essential for large language models (LLMs) to achieve autonomous intelligence in complex, evolving environments. Despite increasing efforts in memory-augmented and retrieval-based architectures, there remains a lack of standardized benchmarks to systematically evaluate LLMs’ long-term memory abilities. Existing benchmarks still face challenges in evaluating knowledge retention and dynamic sequential reasoning, and in their own flexibility, all of which limit their effectiveness in assessing models’ LTM capabilities. To address these gaps, we propose a novel benchmark framework based on interactive fiction games, featuring dynamically branching storylines with complex reasoning structures. These structures simulate real-world scenarios by requiring LLMs to navigate hierarchical decision trees, where each choice triggers cascading dependencies across multi-turn interactions. Our benchmark emphasizes two distinct settings to test reasoning complexity: one with immediate feedback upon incorrect decisions, and the other requiring models to independently trace back and revise earlier choices after failure. As part of this benchmark, we also construct a new dataset designed to test LLMs’ LTM within narrative-driven environments. We further validate the effectiveness of our approach through detailed experiments. Experimental results demonstrate the benchmark’s ability to robustly and reliably assess LTM in LLMs.

nan


Article 322

Title@2025-06-16 (1): Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks

Title: Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks Direct Reasoning Optimization: LLMs können ihre eigene Begründung für offene Aufgaben belohnen und verfeinern 直接理由优化:LLMs Can Can reward and refine 自己为不限名额任务提供的理由 2506.13351v1

Authors (7): Yifei Xu, Tusher Chakraborty, Srinagesh Sharma, Leonardo Nunes, Emre Kıcıman, Songwu Lu, Ranveer Chandra

Recent advances in Large Language Models (LLMs) have showcased impressive reasoning abilities in structured tasks like mathematics and programming, largely driven by Reinforcement Learning with Verifiable Rewards (RLVR), which uses outcome-based signals that are scalable, effective, and robust against reward hacking. However, applying similar techniques to open-ended long-form reasoning tasks remains challenging due to the absence of generic, verifiable reward signals. To address this, we propose Direct Reasoning Optimization (DRO), a reinforcement learning framework for fine-tuning LLMs on open-ended, particularly long-form, reasoning tasks, guided by a new reward signal: the Reasoning Reflection Reward (R3). At its core, R3 selectively identifies and emphasizes key tokens in the reference outcome that reflect the influence of the model’s preceding chain-of-thought reasoning, thereby capturing the consistency between reasoning and reference outcome at a fine-grained level. Crucially, R3 is computed internally using the same model being optimized, enabling a fully self-contained training setup. Additionally, we introduce a dynamic data filtering strategy based on R3 for open-ended reasoning tasks, reducing cost while improving downstream performance. We evaluate DRO on two diverse datasets – ParaRev, a long-form paragraph revision task, and FinQA, a math-oriented QA benchmark – and show that it consistently outperforms strong baselines while remaining broadly applicable across both open-ended and structured domains.

nan


Article 323

Title@2025-06-16 (1): Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers

Title: Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers Prüfen der Prüfer: Enthüllen von Pitfalls und Potenzialen in Fact Prüfern 核查验证者:事实验证者中未倒置的空洞和潜力 2506.13342v1

Authors (9): Wooseok Seo, Seungju Han, Jaehun Jung, Benjamin Newman, Seungwon Lim, Seungbeen Lee, Ximing Lu, Yejin Choi, Youngjae Yu

Fact verification is essential for ensuring the reliability of LLM applications. In this study, we evaluate 12 pre-trained LLMs and one specialized fact-verifier, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We share three findings intended to guide future development of more robust fact verifiers. First, we highlight the importance of addressing annotation errors and ambiguity in datasets, demonstrating that approximately 16\% of ambiguous or incorrectly labeled data substantially influences model rankings. Neglecting this issue may result in misleading conclusions during comparative evaluations, and we suggest using a systematic pipeline utilizing LLM-as-a-judge to help identify these issues at scale. Second, we discover that frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance. We therefore recommend future studies include comparisons with these simple yet highly effective baselines. Lastly, despite their effectiveness, frontier LLMs incur substantial costs, motivating the development of small, fine-tuned fact verifiers. We show that these small models still have room for improvement, particularly on instances that require complex reasoning. Encouragingly, we demonstrate that augmenting training with synthetic multi-hop reasoning data significantly enhances their capabilities in such instances. We release our code, model, and dataset at https://github.com/just1nseo/verifying-the-verifiers

nan


Article 324

Title@2025-06-16 (1): NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025

Title: NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025 NTU Speechlab LLM-basiertes Mehrsprachiges ASR-System für MLC-SLM Challenge 2025 NTU Spearelab LLM-为2025年刚果解放运动-解运间对话挑战使用多种语言的ASR系统 2506.13339v1

Authors (8): Yizhou Peng, Bin Wang, Yi-Wen Chao, Ziyang Ma, Haoyang Zhang, Hexin Liu, Xie Chen, Eng Siong Chng

This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I), where we achieved 5th place. We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies. In particular, language-specific prompts and model averaging techniques were instrumental in boosting system performance across diverse languages. Compared to the initial baseline system, our final model reduced the average Mix Error Rate from 20.2% to 10.6%, representing an absolute improvement of 9.6% (a relative improvement of 48%) on the evaluation set. Our results demonstrate the effectiveness of our approach and offer practical insights for future Speech Large Language Models.

nan


Article 325

Title@2025-06-16 (1): The Remarkable Robustness of LLMs: Stages of Inference?

Title: The Remarkable Robustness of LLMs: Stages of Inference? Die bemerkenswerte Robustheit von LLMs: Stufen der Schlussfolgerung? LLMS的显著威力:推论阶段? 2406.19384v3

Authors (4): Vedang Lad, Jin Hwa Lee, Wes Gurnee, Max Tegmark

We investigate the robustness of Large Language Models (LLMs) to structural interventions by deleting and swapping adjacent layers during inference. Surprisingly, models retain 72-95% of their original top-1 prediction accuracy without any fine-tuning. We find that performance degradation is not uniform across layers: interventions to the early and final layers cause the most degradation, while the model is remarkably robust to dropping middle layers. This pattern of localized sensitivity motivates our hypothesis of four stages of inference, observed across diverse model families and sizes: (1) detokenization, where local context is integrated to lift raw token embeddings into higher-level representations; (2) feature engineering, where task- and entity-specific features are iteratively refined; (3) prediction ensembling, where hidden states are aggregated into plausible next-token predictions; and (4) residual sharpening, where irrelevant features are suppressed to finalize the output distribution. Synthesizing behavioral and mechanistic evidence, we provide a framework for interpreting depth-dependent computations in LLMs.

nan


Article 326

Title@2025-06-16 (1): EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization

Title: EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization EAQuant: Verbesserung der Post-Training-Quantisierung für MoE-Modelle durch Experten-Aware-Optimierung EAQuant:通过专家-软件优化,加强培训后对教育部模型的量化 2506.13329v1

Authors (8): Zhongqian Fu, Ning Ding, Kai Han, Xianzhi Yu, Xiaosong Li, Xinghao Chen, Yehui Tang, Yunhe Wang

Mixture-of-Experts (MoE) models have emerged as a cornerstone of large-scale deep learning by efficiently distributing computation and enhancing performance. However, their unique architecture-characterized by sparse expert activation and dynamic routing mechanisms-introduces inherent complexities that challenge conventional quantization techniques. Existing post-training quantization (PTQ) methods struggle to address activation outliers, router consistency and sparse expert calibration, leading to significant performance degradation. To bridge this gap, we propose EAQuant, a novel PTQ framework tailored for MoE architectures. Our method systematically tackles these challenges through three key innovations: (1) expert-aware smoothing aggregation to suppress activation outliers and stabilize quantization, (2) router logits distribution alignment to preserve expert selection consistency post-quantization, and (3) expert-level calibration data balance to optimize sparsely activated experts. Extensive experiments across W4A4 and extreme W3A4 quantization configurations demonstrate that EAQuant significantly outperforms existing methods, achieving average score improvements of 1.15 - 2.28% across three diverse MoE architectures, with particularly pronounced gains in reasoning tasks and robust performance retention under aggressive quantization. By integrating these innovations, EAQuant establishes a new state-of-the-art for high-precision, efficient MoE model compression. Our code is available at https://github.com/darren-fzq/EAQuant.

nan


Article 327

Title@2025-06-16 (1): Document-Level Tabular Numerical Cross-Checking: A Coarse-to-Fine Approach

Title: Document-Level Tabular Numerical Cross-Checking: A Coarse-to-Fine Approach Dokument-Ebene Tabuläre numerische Cross-Checking: Ein grob-zu-Feine-Ansatz 文件级别表制盘交叉盘查:粗对法方法 2506.13328v1

Authors (5): Chaoxu Pang, Yixuan Cao, Ganbin Zhou, Hongwei Li, Ping Luo

Numerical consistency across tables in disclosure documents is critical for ensuring accuracy, maintaining credibility, and avoiding reputational and economic risks. Automated tabular numerical cross-checking presents two significant challenges: (C1) managing the combinatorial explosion of candidate instances at the document level and (C2) comprehending multi-faceted numerical semantics. Previous research typically depends on heuristic-based filtering or simplified context extraction, often struggling to balance performance and efficiency. Recently, large language models (LLMs) have demonstrated remarkable contextual understanding capabilities that helps address C2 at the instance level, yet they remain hampered by computational inefficiency (C1) and limited domain expertise. This paper introduces CoFiTCheck, a novel LLM-based coarse-to-fine framework that addresses these challenges through two sequential stages: embedding-based filtering and discriminative classification. The embedding-based filtering stage introduces an instructional parallel encoding method to efficiently represent all numerical mentions in a table with LLMs, as well as a decoupled InfoNCE objective to mitigate the isolated mention problem. The discriminative classification stage employs a specialized LLM for fine-grained analysis of the remaining candidate pairs. This stage is further enhanced by our crosstable numerical alignment pretraining paradigm, which leverages weak supervision from cross-table numerical equality relationships to enrich task-specific priors without requiring manual annotation. Comprehensive evaluation across three types of real-world disclosure documents demonstrates that CoFiTCheck significantly outperforms previous methods while maintaining practical efficiency.

nan


Article 328

Title@2025-06-16 (1): Large Language Models as ‘Hidden Persuaders’: Fake Product Reviews are Indistinguishable to Humans and Machines

Title: Large Language Models as ‘Hidden Persuaders’: Fake Product Reviews are Indistinguishable to Humans and Machines Große Sprachmodelle als ‘Hidden Persuaders’: Fake Produktbewertungen sind für Menschen und Maschinen ununterscheidbar 大语言模型作为“ Hidden Persuaders ” : 假产品审查对人类和机器是无法区分的 2506.13313v1

Authors (9): Weiyao Meng, John Harvey, James Goulding, Chris James Carter, Evgeniya Lukinova, Andrew Smith, Paul Frobisher, Mina Forrest, Georgiana Nica-Avram

Reading and evaluating product reviews is central to how most people decide what to buy and consume online. However, the recent emergence of Large Language Models and Generative Artificial Intelligence now means writing fraudulent or fake reviews is potentially easier than ever. Through three studies we demonstrate that (1) humans are no longer able to distinguish between real and fake product reviews generated by machines, averaging only 50.8% accuracy overall - essentially the same that would be expected by chance alone; (2) that LLMs are likewise unable to distinguish between fake and real reviews and perform equivalently bad or even worse than humans; and (3) that humans and LLMs pursue different strategies for evaluating authenticity which lead to equivalently bad accuracy, but different precision, recall and F1 scores - indicating they perform worse at different aspects of judgment. The results reveal that review systems everywhere are now susceptible to mechanised fraud if they do not depend on trustworthy purchase verification to guarantee the authenticity of reviewers. Furthermore, the results provide insight into the consumer psychology of how humans judge authenticity, demonstrating there is an inherent ‘scepticism bias’ towards positive reviews and a special vulnerability to misjudge the authenticity of fake negative reviews. Additionally, results provide a first insight into the ‘machine psychology’ of judging fake reviews, revealing that the strategies LLMs take to evaluate authenticity radically differ from humans, in ways that are equally wrong in terms of accuracy, but different in their misjudgments.

nan


Article 329

Title@2025-06-16 (1): Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs

Title: Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs Abmilderung des Sicherheitsabfalls bei der Editing-basierten Hintertürinjektion auf LLMs 减轻基于编辑的LLMLM后门喷射中安全回落的安全后退 2506.13285v1

Authors (8): Houcheng Jiang, Zetong Zhao, Junfeng Fang, Haokai Ma, Ruipeng Wang, Yang Deng, Xiang Wang, Xiangnan He

Large language models (LLMs) have shown strong performance across natural language tasks, but remain vulnerable to backdoor attacks. Recent model editing-based approaches enable efficient backdoor injection by directly modifying parameters to map specific triggers to attacker-desired responses. However, these methods often suffer from safety fallback, where the model initially responds affirmatively but later reverts to refusals due to safety alignment. In this work, we propose DualEdit, a dual-objective model editing framework that jointly promotes affirmative outputs and suppresses refusal responses. To address two key challenges – balancing the trade-off between affirmative promotion and refusal suppression, and handling the diversity of refusal expressions – DualEdit introduces two complementary techniques. (1) Dynamic loss weighting calibrates the objective scale based on the pre-edited model to stabilize optimization. (2) Refusal value anchoring compresses the suppression target space by clustering representative refusal value vectors, reducing optimization conflict from overly diverse token sets. Experiments on safety-aligned LLMs show that DualEdit improves attack success by 9.98\% and reduces safety fallback rate by 10.88\% over baselines.

nan


Article 330

Title@2025-06-16 (1): AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

Title: AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy AceReason-Nemotron 1.1: Mathematische und Code-Reasonierung durch SFT und RL-Synergie AceReson-Nemotron 1.1:通过SFT和RL协同推进数学和代码学 2506.13284v1

Authors (7): Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

In this work, we investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models. We begin by curating the SFT training data through two scaling strategies: increasing the number of collected prompts and the number of generated responses per prompt. Both approaches yield notable improvements in reasoning performance, with scaling the number of prompts resulting in more substantial gains. We then explore the following questions regarding the synergy between SFT and RL: (i) Does a stronger SFT model consistently lead to better final performance after large-scale RL training? (ii) How can we determine an appropriate sampling temperature during RL training to effectively balance exploration and exploitation for a given SFT initialization? Our findings suggest that (i) holds true, provided effective RL training is conducted, particularly when the sampling temperature is carefully chosen to maintain the temperature-adjusted entropy around 0.3, a setting that strikes a good balance between exploration and exploitation. Notably, the performance gap between initial SFT models narrows significantly throughout the RL process. Leveraging a strong SFT foundation and insights into the synergistic interplay between SFT and RL, our AceReason-Nemotron-1.1 7B model significantly outperforms AceReason-Nemotron-1.0 and achieves new state-of-the-art performance among Qwen2.5-7B-based reasoning models on challenging math and code benchmarks, thereby demonstrating the effectiveness of our post-training recipe. We release the model and data at: https://huggingface.co/nvidia/AceReason-Nemotron-1.1-7B

nan


Article 331

Title@2025-06-16 (1): EffiCoder: Enhancing Code Generation in Large Language Models through Efficiency-Aware Fine-tuning

Title: EffiCoder: Enhancing Code Generation in Large Language Models through Efficiency-Aware Fine-tuning EffiCoder: Codegenerierung in großen Sprachmodellen durch Effizienz-Bewusst Feinabstimmung verbessern Effi Coder:通过效率软件微调加强大语言模式的代码生成 2410.10209v4

Authors (9): Dong Huang, Guangtao Zeng, Jianbo Dai, Meng Luo, Han Weng, Yuhao Qing, Heming Cui, Zhijiang Guo, Jie M. Zhang

As large language models (LLMs) play an increasingly important role in code generation, enhancing both correctness and efficiency has become crucial. Current methods primarily focus on correctness, often overlooking efficiency. To address this gap, we introduce EffiCoder to improve both aspects by fine-tuning LLMs on a high-quality dataset comprising correct and efficient code samples. Our methodology involves leveraging multiple LLMs to generate diverse candidate code solutions for various tasks across different programming languages. We then evaluate these solutions by measuring their execution time and memory usage through local execution. The code solution with the lowest execution time and memory consumption is selected as the final output for each task. Experimental results demonstrate significant improvements when fine-tuning with Effi-Instruct. For instance, Qwen2.5-Coder-7B-Instruct’s pass@1 score increases from 44.8\% to 57.7\%, while the average execution time for correct tasks decreases by 48.4\%. EffiCoder offers a scalable and effective solution for advancing AI-driven code generation, benefiting software development and computational problem-solving. The source code of Effi-Code was released at https://github.com/huangd1999/EffiCoder.

nan


Article 332

Title@2025-06-16 (1): AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining

Title: AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining AdaLRS: Loss-Guided Adaptive Learning Rate Suche nach effizientem Foundation Model Pretraining AdaLRS: 为高效基础基础示范培训前而寻找学习率 2506.13274v1

Authors (5): Hongyuan Dong, Dingkang Yang, Xiao Liang, Chao Feng, Jiao Ran

Learning rate is widely regarded as crucial for effective foundation model pretraining. Recent research explores and demonstrates the transferability of learning rate configurations across varying model and dataset sizes, etc. Nevertheless, these approaches are constrained to specific training scenarios and typically necessitate extensive hyperparameter tuning on proxy models. In this work, we propose \textbf{AdaLRS}, a plug-in-and-play adaptive learning rate search algorithm that conducts online optimal learning rate search via optimizing loss descent velocities. We provide experiment results to show that the optimization of training loss and loss descent velocity in foundation model pretraining are both convex and share the same optimal learning rate. Relying solely on training loss dynamics, AdaLRS involves few extra computations to guide the search process, and its convergence is guaranteed via theoretical analysis. Experiments on both LLM and VLM pretraining show that AdaLRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness, with model performance improved accordingly. We also show the robust generalizability of AdaLRS across varying training scenarios, such as different model sizes, training paradigms, and base learning rate scheduler choices.

nan


Article 333

Title@2025-06-16 (1): Making LLMs Better Many-to-Many Speech-to-Text Translators with Curriculum Learning

Title: Making LLMs Better Many-to-Many Speech-to-Text Translators with Curriculum Learning LLMs besser machen Viele-zu-Viele Sprach-zu-Text-Übersetzer mit Curriculum-Lernen 使LLM LM 更好地使许多到许多语音到文字翻译翻译与课程学习 2409.19510v2

Authors (10): Yexing Du, Youcheng Pan, Ziyang Ma, Bo Yang, Yifan Yang, Keqi Deng, Xie Chen, Yang Xiang, Ming Liu, Bing Qin

Multimodal Large Language Models (MLLMs) have achieved significant success in Speech-to-Text Translation (S2TT) tasks. While most existing research has focused on English-centric translation directions, the exploration of many-to-many translation is still limited by the scarcity of parallel data. To address this, we propose a three-stage curriculum learning strategy that leverages the machine translation capabilities of large language models and adapts them to S2TT tasks, enabling effective learning in low-resource settings. We trained MLLMs with varying parameter sizes (3B, 7B, and 32B) and evaluated the proposed strategy using the FLEURS and CoVoST-2 datasets. Experimental results show that the proposed strategy achieves state-of-the-art average performance in $15\times14$ language pairs, requiring fewer than 10 hours of speech data per language to achieve competitive results. The source code and models are released at https://github.com/yxduir/LLM-SRT.

nan


Article 334

Title@2025-06-16 (1): Distinct Computations Emerge From Compositional Curricula in In-Context Learning

Title: Distinct Computations Emerge From Compositional Curricula in In-Context Learning Unterschiedliche Berechnungen entstehen aus kompositorischen Lehrplänen im In-Context-Lernen 内文学习中组成课程产生的特殊计算 2506.13253v1

Authors (4): Jin Hwa Lee, Andrew K. Lampinen, Aaditya K. Singh, Andrew M. Saxe

In-context learning (ICL) research often considers learning a function in-context through a uniform sample of input-output pairs. Here, we investigate how presenting a compositional subtask curriculum in context may alter the computations a transformer learns. We design a compositional algorithmic task based on the modular exponential-a double exponential task composed of two single exponential subtasks and train transformer models to learn the task in-context. We compare (a) models trained using an in-context curriculum consisting of single exponential subtasks and, (b) models trained directly on the double exponential task without such a curriculum. We show that models trained with a subtask curriculum can perform zero-shot inference on unseen compositional tasks and are more robust given the same context length. We study how the task and subtasks are represented across the two training regimes. We find that the models employ diverse strategies modulated by the specific curriculum design.

nan


Article 335

Title@2025-06-16 (1): G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems

Title: G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems G-Memory: Hierarchischer Speicher für Multi-Agent-Systeme G-记忆:为多机构系统追踪等级记忆 2506.07398v2

Authors (6): Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, Shuicheng Yan

Large language model (LLM)-powered multi-agent systems (MAS) have demonstrated cognitive and execution capabilities that far exceed those of single LLM agents, yet their capacity for self-evolution remains hampered by underdeveloped memory architectures. Upon close inspection, we are alarmed to discover that prevailing MAS memory mechanisms (1) are overly simplistic, completely disregarding the nuanced inter-agent collaboration trajectories, and (2) lack cross-trial and agent-specific customization, in stark contrast to the expressive memory developed for single agents. To bridge this gap, we introduce G-Memory, a hierarchical, agentic memory system for MAS inspired by organizational memory theory, which manages the lengthy MAS interaction via a three-tier graph hierarchy: insight, query, and interaction graphs. Upon receiving a new user query, G-Memory performs bi-directional memory traversal to retrieve both $\textit{high-level, generalizable insights}$ that enable the system to leverage cross-trial knowledge, and $\textit{fine-grained, condensed interaction trajectories}$ that compactly encode prior collaboration experiences. Upon task execution, the entire hierarchy evolves by assimilating new collaborative trajectories, nurturing the progressive evolution of agent teams. Extensive experiments across five benchmarks, three LLM backbones, and three popular MAS frameworks demonstrate that G-Memory improves success rates in embodied action and accuracy in knowledge QA by up to $20.89\%$ and $10.12\%$, respectively, without any modifications to the original frameworks. Our codes are available at https://github.com/bingreeky/GMemory.

nan


Article 336

Title@2025-06-16 (1): IGD: Token Decisiveness Modeling via Information Gain in LLMs for Personalized Recommendation

Title: IGD: Token Decisiveness Modeling via Information Gain in LLMs for Personalized Recommendation IGD: Token Decisiveness Modellierung über Informationsgewinn in LLMs für Personalisierte Empfehlung IGD: 个人化建议通过LLM LLM 信息收益进行当量决策模型 2506.13229v1

Authors (6): Zijie Lin, Yang Zhang, Xiaoyan Zhao, Fengbin Zhu, Fuli Feng, Tat-Seng Chua

Large Language Models (LLMs) have shown strong potential for recommendation by framing item prediction as a token-by-token language generation task. However, existing methods treat all item tokens equally, simply pursuing likelihood maximization during both optimization and decoding. This overlooks crucial token-level differences in decisiveness-many tokens contribute little to item discrimination yet can dominate optimization or decoding. To quantify token decisiveness, we propose a novel perspective that models item generation as a decision process, measuring token decisiveness by the Information Gain (IG) each token provides in reducing uncertainty about the generated item. Our empirical analysis reveals that most tokens have low IG but often correspond to high logits, disproportionately influencing training loss and decoding, which may impair model performance. Building on these insights, we introduce an Information Gain-based Decisiveness-aware Token handling (IGD) strategy that integrates token decisiveness into both tuning and decoding. Specifically, IGD downweights low-IG tokens during tuning and rebalances decoding to emphasize tokens with high IG. In this way, IGD moves beyond pure likelihood maximization, effectively prioritizing high-decisiveness tokens. Extensive experiments on four benchmark datasets with two LLM backbones demonstrate that IGD consistently improves recommendation accuracy, achieving significant gains on widely used ranking metrics compared to strong baselines.

nan


Article 337

Title@2025-06-16 (1): Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law

Title: Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law Capability Salience Vector: Feinkörnige Ausrichtung von Verlusten und Fähigkeiten für Downstream Task Scaling Law 下游任务缩放法损失和能力精确比对 2506.13216v1

Authors (11): Qiming Ge, Shuhao Xing, Songyang Gao, Yunhua Zhou, Yicheng Zou, Songyang Zhang, Zhi Chen, Hang Yan, Qi Zhang, Qipeng Guo, Kai Chen

Scaling law builds the relationship between training computation and validation loss, enabling researchers to effectively predict the loss trending of models across different levels of computation. However, a gap still remains between validation loss and the model’s downstream capabilities, making it untrivial to apply scaling law to direct performance prediction for downstream tasks. The loss typically represents a cumulative penalty for predicted tokens, which are implicitly considered to have equal importance. Nevertheless, our studies have shown evidence that when considering different training data distributions, we cannot directly model the relationship between downstream capability and computation or token loss. To bridge the gap between validation loss and downstream task capabilities, in this work, we introduce Capability Salience Vector, which decomposes the overall loss and assigns different importance weights to tokens to assess a specific meta-capability, aligning the validation loss with downstream task performance in terms of the model’s capabilities. Experiments on various popular benchmarks demonstrate that our proposed Capability Salience Vector could significantly improve the predictability of language model performance on downstream tasks.

nan


Article 338

Title@2025-06-16 (1): Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

Title: Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models Thought Crime: Hintertüren und Emergent-Missausrichtung in vernünftigen Modellen 思想犯罪:后门和合理理由模型中新出现的不协调现象 2506.13206v1

Authors (4): James Chua, Jan Betley, Mia Taylor, Owain Evans

Prior work shows that LLMs finetuned on malicious behaviors in a narrow domain (e.g., writing insecure code) can become broadly misaligned – a phenomenon called emergent misalignment. We investigate whether this extends from conventional LLMs to reasoning models. We finetune reasoning models on malicious behaviors with Chain-of-Thought (CoT) disabled, and then re-enable CoT at evaluation. Like conventional LLMs, reasoning models become broadly misaligned. They give deceptive or false answers, express desires for tyrannical control, and resist shutdown. Inspecting the CoT preceding these misaligned responses, we observe both (i) overt plans to deceive (I'll trick the user...''), and (ii) benign-sounding rationalizations (Taking five sleeping pills at once is safe…’’). Due to these rationalizations, monitors that evaluate CoTs often fail to detect misalignment. Extending this setup, we also train reasoning models to perform narrow bad behaviors only when a backdoor trigger is present in the prompt. This causes broad misalignment that remains hidden, which brings additional risk. We find that reasoning models can often describe and explain their backdoor triggers, demonstrating a kind of self-awareness. So CoT monitoring can expose these behaviors but is unreliable. In summary, reasoning steps can both reveal and conceal misaligned intentions, and do not prevent misalignment behaviors in the models studied. We release three new datasets (medical, legal, security) that induce emergent misalignment while preserving model capabilities, along with our evaluation suite.

nan


Article 339

Title@2025-06-16 (1): Do Music Preferences Reflect Cultural Values? A Cross-National Analysis Using Music Embedding and World Values Survey

Title: Do Music Preferences Reflect Cultural Values? A Cross-National Analysis Using Music Embedding and World Values Survey Reflektieren Musikpräferenzen kulturelle Werte? Eine länderübergreifende Analyse mit Musikeinbettung und World Values Survey 音乐优惠是否反映文化价值? 利用音乐嵌入和世界价值调查进行的跨国家分析 2506.13199v1

Authors (2): Yongjae Kim, Seongchan Park

This study explores the extent to which national music preferences reflect underlying cultural values. We collected long-term popular music data from YouTube Music Charts across 62 countries, encompassing both Western and non-Western regions, and extracted audio embeddings using the CLAP model. To complement these quantitative representations, we generated semantic captions for each track using LP-MusicCaps and GPT-based summarization. Countries were clustered based on contrastive embeddings that highlight deviations from global musical norms. The resulting clusters were projected into a two-dimensional space via t-SNE for visualization and evaluated against cultural zones defined by the World Values Survey (WVS). Statistical analyses, including MANOVA and chi-squared tests, confirmed that music-based clusters exhibit significant alignment with established cultural groupings. Furthermore, residual analysis revealed consistent patterns of overrepresentation, suggesting non-random associations between specific clusters and cultural zones. These findings indicate that national-level music preferences encode meaningful cultural signals and can serve as a proxy for understanding global cultural boundaries.

nan


Article 340

Title@2025-06-16 (1): Breaking Thought Patterns: A Multi-Dimensional Reasoning Framework for LLMs

Title: Breaking Thought Patterns: A Multi-Dimensional Reasoning Framework for LLMs Breaking Thought Patterns: Multi-Dimensional Reasoning Framework für LLMs 打破思维模式:LLMM的多重解释理由框架 2506.13192v1

Authors (8): Xintong Tang, Meiru Zhang, Shang Xiao, Junzhao Jin, Zihan Zhao, Liwei Li, Yang Zheng, Bangyi Wu

Large language models (LLMs) are often constrained by rigid reasoning processes, limiting their ability to generate creative and diverse responses. To address this, a novel framework called LADDER is proposed, combining Chain-of-Thought (CoT) reasoning, Mixture of Experts (MoE) models, and multi-dimensional up/down-sampling strategies which breaks the limitations of traditional LLMs. First, CoT reasoning guides the model through multi-step logical reasoning, expanding the semantic space and breaking the rigidity of thought. Next, MoE distributes the reasoning tasks across multiple expert modules, each focusing on specific sub-tasks. Finally, dimensionality reduction maps the reasoning outputs back to a lower-dimensional semantic space, yielding more precise and creative responses. Extensive experiments across multiple tasks demonstrate that LADDER significantly improves task completion, creativity, and fluency, generating innovative and coherent responses that outperform traditional models. Ablation studies reveal the critical roles of CoT and MoE in enhancing reasoning abilities and creative output. This work contributes to the development of more flexible and creative LLMs, capable of addressing complex and novel tasks.

nan


Article 341

Title@2025-06-16 (1): Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis

Title: Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis Leveraging LLM und selbstüberwachte Trainingsmodelle für die Spracherkennung in chinesischen Dialekten: Eine vergleichende Analyse 利用LLM和中国语语音识别自驾培训模式:比较分析 2505.21138v2

Authors (9): Tianyi Xu, Hongjie Chen, Wang Qing, Lv Hang, Jian Kang, Li Jie, Zhennan Lin, Yongxiang Li, Xie Lei

Large-scale training corpora have significantly improved the performance of ASR models. Unfortunately, due to the relative scarcity of data, Chinese accents and dialects remain a challenge for most ASR models. Recent advancements in self-supervised learning have shown that self-supervised pre-training, combined with large language models (LLM), can effectively enhance ASR performance in low-resource scenarios. We aim to investigate the effectiveness of this paradigm for Chinese dialects. Specifically, we pre-train a Data2vec2 model on 300,000 hours of unlabeled dialect and accented speech data and do alignment training on a supervised dataset of 40,000 hours. Then, we systematically examine the impact of various projectors and LLMs on Mandarin, dialect, and accented speech recognition performance under this paradigm. Our method achieved SOTA results on multiple dialect datasets, including Kespeech. We will open-source our work to promote reproducible research

nan


Article 342

Title@2025-06-16 (1): SPOT: Bridging Natural Language and Geospatial Search for Investigative Journalists

Title: SPOT: Bridging Natural Language and Geospatial Search for Investigative Journalists SPOT: Natürliche Sprache und Geospatiale Suche nach Untersuchungsjournalisten SPOT: 连接自然语言和地理空间搜索,供调查记者使用 2506.13188v1

Authors (6): Lynn Khellaf, Ipek Baris Schlicht, Tilman Mirass, Julia Bayer, Tilman Wagner, Ruben Bouwmeester

OpenStreetMap (OSM) is a vital resource for investigative journalists doing geolocation verification. However, existing tools to query OSM data such as Overpass Turbo require familiarity with complex query languages, creating barriers for non-technical users. We present SPOT, an open source natural language interface that makes OSM’s rich, tag-based geographic data more accessible through intuitive scene descriptions. SPOT interprets user inputs as structured representations of geospatial object configurations using fine-tuned Large Language Models (LLMs), with results being displayed in an interactive map interface. While more general geospatial search tasks are conceivable, SPOT is specifically designed for use in investigative journalism, addressing real-world challenges such as hallucinations in model output, inconsistencies in OSM tagging, and the noisy nature of user input. It combines a novel synthetic data pipeline with a semantic bundling system to enable robust, accurate query generation. To our knowledge, SPOT is the first system to achieve reliable natural language access to OSM data at this level of accuracy. By lowering the technical barrier to geolocation verification, SPOT contributes a practical tool to the broader efforts to support fact-checking and combat disinformation.

nan


Article 343

Title@2025-06-16 (1): Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence

Title: Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence Dynamische kontextorientierte Zersetzung für Task-aware Low-rank-Anpassung mit weniger vergessener und schnellerer Konvergenz 适应任务意识低级别适应的动态、以环境为导向的分化,减少遗忘和更快的趋同 2506.13187v1

Authors (8): Yibo Yang, Sihao Liu, Chuan Rao, Bang An, Tiancheng Shen, Philip H. S. Torr, Ming-Hsuan Yang, Bernard Ghanem

Conventional low-rank adaptation methods build adapters without considering data context, leading to sub-optimal fine-tuning performance and severe forgetting of inherent world knowledge. In this paper, we propose context-oriented decomposition adaptation (CorDA), a novel method that initializes adapters in a task-aware manner. Concretely, we develop context-oriented singular value decomposition, where we collect covariance matrices of input activations for each linear layer using sampled data from the target task, and apply SVD to the product of weight matrix and its corresponding covariance matrix. By doing so, the task-specific capability is compacted into the principal components. Thanks to the task awareness, our method enables two optional adaptation modes, knowledge-preserved mode (KPM) and instruction-previewed mode (IPM), providing flexibility to choose between freezing the principal components to preserve their associated knowledge or adapting them to better learn a new task. We further develop CorDA++ by deriving a metric that reflects the compactness of task-specific principal components, and then introducing dynamic covariance selection and dynamic rank allocation strategies based on the same metric. The two strategies provide each layer with the most representative covariance matrix and a proper rank allocation. Experimental results show that CorDA++ outperforms CorDA by a significant margin. CorDA++ in KPM not only achieves better fine-tuning performance than LoRA, but also mitigates the forgetting of pre-trained knowledge in both large language models and vision language models. For IPM, our method exhibits faster convergence, \emph{e.g.,} 4.5x speedup over QLoRA, and improves adaptation performance in various scenarios, outperforming strong baseline methods. Our method has been integrated into the PEFT library developed by Hugging Face.

nan


Article 344

Title@2025-06-16 (1): Align-then-Unlearn: Embedding Alignment for LLM Unlearning

Title: Align-then-Unlearn: Embedding Alignment for LLM Unlearning Align-then-Unlearn: Einbettung für LLM-Unlearning Aleign- or- unlearn: LLM 重新学习的嵌入对齐 2506.13181v1

Authors (4): Philipp Spohn, Leander Girrbach, Jessica Bader, Zeynep Akata

As large language models (LLMs) are trained on massive datasets, they have raised significant privacy and ethical concerns due to their potential to inadvertently retain sensitive information. Unlearning seeks to selectively remove specific data from trained models, such as personal information or copyrighted content. Current approaches targeting specific output sequences at the token level often fail to achieve complete forgetting and remain susceptible to prompt rephrasing. We propose Align-then-Unlearn, a novel framework that performs unlearning in the semantic embedding space rather than directly on output tokens. Align-then-Unlearn first augments the LLM with an embedding prediction module trained to anticipate future context representations. Unlearning is then achieved by fine-tuning the model to minimize the similarity between these predicted embeddings and a target embedding that represents the concept to be removed. Initial results show that Align-then-Unlearn effectively removes targeted knowledge with minimal degradation in overall model utility. These findings suggest that embedding-based unlearning offers a promising and robust approach to removing conceptual knowledge. Our code is available at https://github.com/ExplainableML/align-then-unlearn.

nan


Article 345

Title: Fast-and-Frugal Text-Graph Transformers are Effective Link Predictors Fast-and-Frugal Text-Graph Transformer sind effektive Link Predictors 快速和节节用文字格变形器是有效的链接预测器 2408.06778v4

Authors (4): Andrei C. Coman, Christos Theodoropoulos, Marie-Francine Moens, James Henderson

We propose Fast-and-Frugal Text-Graph (FnF-TG) Transformers, a Transformer-based framework that unifies textual and structural information for inductive link prediction in text-attributed knowledge graphs. We demonstrate that, by effectively encoding ego-graphs (1-hop neighbourhoods), we can reduce the reliance on resource-intensive textual encoders. This makes the model both fast at training and inference time, as well as frugal in terms of cost. We perform a comprehensive evaluation on three popular datasets and show that FnF-TG can achieve superior performance compared to previous state-of-the-art methods. We also extend inductive learning to a fully inductive setting, where relations don’t rely on transductive (fixed) representations, as in previous work, but are a function of their textual description. Additionally, we introduce new variants of existing datasets, specifically designed to test the performance of models on unseen relations at inference time, thus offering a new test-bench for fully inductive link prediction.

nan


Article 346

Title@2025-06-16 (1): Enhancing Large Language Models with Reliable Knowledge Graphs

Title: Enhancing Large Language Models with Reliable Knowledge Graphs Erweiterung großer Sprachmodelle mit zuverlässigen Wissensgraphen 加强具有可靠知识图集的大型语言模型 2506.13178v1

Authors (1): Qinggang Zhang

Large Language Models (LLMs) have demonstrated remarkable capabilities in text generation and understanding, yet their reliance on implicit, unstructured knowledge often leads to factual inaccuracies and limited interpretability. Knowledge Graphs (KGs), with their structured, relational representations, offer a promising solution to ground LLMs in verified knowledge. However, their potential remains constrained by inherent noise, incompleteness, and the complexity of integrating their rigid structure with the flexible reasoning of LLMs. This thesis presents a systematic framework to address these limitations, advancing the reliability of KGs and their synergistic integration with LLMs through five interconnected contributions. This thesis addresses these challenges through a cohesive framework that enhances LLMs by refining and leveraging reliable KGs. First, we introduce contrastive error detection, a structure-based method to identify incorrect facts in KGs. This approach is extended by an attribute-aware framework that unifies structural and semantic signals for error correction. Next, we propose an inductive completion model that further refines KGs by completing the missing relationships in evolving KGs. Building on these refined KGs, KnowGPT integrates structured graph reasoning into LLMs through dynamic prompting, improving factual grounding. These contributions form a systematic pipeline (from error detection to LLM integration), demonstrating that reliable KGs significantly enhance the robustness, interpretability, and adaptability of LLMs.

nan


Article 347

Title@2025-06-16 (1): Team Anotheroption at SemEval-2025 Task 8: Bridging the Gap Between Open-Source and Proprietary LLMs in Table QA

Title: Team Anotheroption at SemEval-2025 Task 8: Bridging the Gap Between Open-Source and Proprietary LLMs in Table QA Team Eine weitere Option bei SemEval-2025 Task 8: Die Lücke zwischen Open Source und Proprietary LLMs in Tabelle QA überbrücken SemEval-2025任务8:缩小表QA中开放来源和产权有限LMs之间差距的另一工作队备选办法:缩小表QA中开放来源和产权有限LMs之间的差距 2506.09657v2

Authors (2): Nikolas Evkarpidi, Elena Tutubalina

This paper presents a system developed for SemEval 2025 Task 8: Question Answering (QA) over tabular data. Our approach integrates several key components: text-to-SQL and text-to-code generation modules, a self-correction mechanism, and a retrieval-augmented generation (RAG). Additionally, it includes an end-to-end (E2E) module, all orchestrated by a large language model (LLM). Through ablation studies, we analyzed the effects of different parts of our pipeline and identified the challenges that are still present in this field. During the evaluation phase of the competition, our solution achieved an accuracy of 80%, resulting in a top-13 ranking among the 38 participating teams. Our pipeline demonstrates a significant improvement in accuracy for open-source models and achieves a performance comparable to proprietary LLMs in QA tasks over tables. The code is available at GitHub repository.

nan


Article 348

Title@2025-06-16 (1): Development of the user-friendly decision aid Rule-based Evaluation and Support Tool (REST) for optimizing the resources of an information extraction task

Title: Development of the user-friendly decision aid Rule-based Evaluation and Support Tool (REST) for optimizing the resources of an information extraction task Entwicklung der benutzerfreundlichen Entscheidungshilfe Regelbasiertes Evaluierungs- und Unterstützungstool (REST) zur Optimierung der Ressourcen einer Informationsextraktion 为优化信息提取任务的资源,开发方便用户的决策援助规则评价和支助工具 2506.13177v1

Authors (6): Guillaume Bazin, Xavier Tannier, Fanny Adda, Ariel Cohen, Akram Redjdal, Emmanuelle Kempf

Rules could be an information extraction (IE) default option, compared to ML and LLMs in terms of sustainability, transferability, interpretability, and development burden. We suggest a sustainable and combined use of rules and ML as an IE method. Our approach starts with an exhaustive expert manual highlighting in a single working session of a representative subset of the data corpus. We developed and validated the feasibility and the performance metrics of the REST decision tool to help the annotator choose between rules as a by default option and ML for each entity of an IE task. REST makes the annotator visualize the characteristics of each entity formalization in the free texts and the expected rule development feasibility and IE performance metrics. ML is considered as a backup IE option and manual annotation for training is therefore minimized. The external validity of REST on a 12-entity use case showed good reproducibility.

nan


Article 349

Title@2025-06-16 (1): VGR: Visual Grounded Reasoning

Title: VGR: Visual Grounded Reasoning VGR: Visual Grounded Reasoning VGR: 视觉理由 2506.11991v2

Authors (11): Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, Jun Xiao

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30\% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.

nan


Article 350

Title@2025-06-16 (1): A Training-free LLM-based Approach to General Chinese Character Error Correction

Title: A Training-free LLM-based Approach to General Chinese Character Error Correction Ein trainingsfreier LLM-basierter Ansatz zur allgemeinen Korrektur von chinesischen Zeichenfehlern 以无培训的LLM为基础处理普通中文字符错误校正的不培训的LLM方法 2502.15266v2

Authors (5): Houquan Zhou, Bo Zhang, Zhenghua Li, Ming Yan, Min Zhang

Chinese spelling correction (CSC) is a crucial task that aims to correct character errors in Chinese text. While conventional CSC focuses on character substitution errors caused by mistyping, two other common types of character errors, missing and redundant characters, have received less attention. These errors are often excluded from CSC datasets during the annotation process or ignored during evaluation, even when they have been annotated. This issue limits the practicality of the CSC task. To address this issue, we introduce the task of General Chinese Character Error Correction (C2EC), which focuses on all three types of character errors. We construct a high-quality C2EC benchmark by combining and manually verifying data from CCTC and Lemon datasets. We extend the training-free prompt-free CSC method to C2EC by using Levenshtein distance for handling length changes and leveraging an additional prompt-based large language model (LLM) to improve performance. Experiments show that our method enables a 14B-parameter LLM to be on par with models nearly 50 times larger on both conventional CSC and C2EC tasks, without any fine-tuning.

nan


Article 351

Title@2025-06-16 (1): Adapting LLMs for Minimal-edit Grammatical Error Correction

Title: Adapting LLMs for Minimal-edit Grammatical Error Correction Anpassung von LLMs für minimal-editieren Sie Grammatical Fehlerkorrektur 适应最小编辑语法错误校正的LLMS 2506.13148v1

Authors (3): Ryszard Staruch, Filip Graliński, Daniel Dzienisiewicz

Decoder-only large language models have shown superior performance in the fluency-edit English Grammatical Error Correction, but their adaptation for minimal-edit English GEC is still underexplored. To improve their effectiveness in the minimal-edit approach, we explore the error rate adaptation topic and propose a novel training schedule method. Our experiments set a new state-of-the-art result for a single-model system on the BEA-test set. We also detokenize the most common English GEC datasets to match the natural way of writing text. During the process, we find that there are errors in them. Our experiments analyze whether training on detokenized datasets impacts the results and measure the impact of the usage of the datasets with corrected erroneous examples. To facilitate reproducibility, we have released the source code used to train our models.

nan


Article 352

Title@2025-06-16 (1): CMU’s IWSLT 2025 Simultaneous Speech Translation System

Title: CMU’s IWSLT 2025 Simultaneous Speech Translation System IWSLT 2025 gleichzeitiges Sprachübersetzungssystem der CMU CMU的IWSLT 2025年IWSLT 同步语音翻译系统 2506.13143v1

Authors (3): Siqi Ouyang, Xi Xu, Lei Li

This paper presents CMU’s submission to the IWSLT 2025 Simultaneous Speech Translation (SST) task for translating unsegmented English speech into Chinese and German text in a streaming manner. Our end-to-end speech-to-text system integrates a chunkwise causal Wav2Vec 2.0 speech encoder, an adapter, and the Qwen2.5-7B-Instruct as the decoder. We use a two-stage simultaneous training procedure on robust speech segments curated from LibriSpeech, CommonVoice, and VoxPopuli datasets, utilizing standard cross-entropy loss. Our model supports adjustable latency through a configurable latency multiplier. Experimental results demonstrate that our system achieves 44.3 BLEU for English-to-Chinese and 25.1 BLEU for English-to-German translations on the ACL60/60 development set, with computation-aware latencies of 2.7 seconds and 2.3 seconds, and theoretical latencies of 2.2 and 1.7 seconds, respectively.

nan


Article 353

Title@2025-06-16 (1): Optimizing Temperature for Language Models with Multi-Sample Inference

Title: Optimizing Temperature for Language Models with Multi-Sample Inference Temperaturoptimierung für Sprachmodelle mit Multi-Sample-Inferenz 多抽样推断语言模型的最佳最佳温度 2502.05234v2

Authors (3): Weihua Du, Yiming Yang, Sean Welleck

Multi-sample aggregation strategies, such as majority voting and best-of-N sampling, are widely used in contemporary large language models (LLMs) to enhance predictive accuracy across various tasks. A key challenge in this process is temperature selection, which significantly impacts model performance. Existing approaches either rely on a fixed default temperature or require labeled validation data for tuning, which are often scarce and difficult to obtain. This paper addresses the challenge of automatically identifying the (near)-optimal temperature for different LLMs using multi-sample aggregation strategies, without relying on task-specific validation data. We provide a comprehensive analysis of temperature’s role in performance optimization, considering variations in model architectures, datasets, task types, model sizes, and predictive accuracy. Furthermore, we propose a novel entropy-based metric for automated temperature optimization, which consistently outperforms fixed-temperature baselines. Additionally, we incorporate a stochastic process model to enhance interpretability, offering deeper insights into the relationship between temperature and model performance.

nan


Article 354

Title@2025-06-16 (1): InfiniSST: Simultaneous Translation of Unbounded Speech with Large Language Model

Title: InfiniSST: Simultaneous Translation of Unbounded Speech with Large Language Model InfiniSST: Simultane Übersetzung von ungebundener Sprache mit großem Sprachmodell InfiniSST: 用大语言模式同时翻译无约束讲话 2503.02969v2

Authors (3): Siqi Ouyang, Xi Xu, Lei Li

Simultaneous translation of unbounded streaming speech remains a challenging problem due to the need for effectively processing the history speech context and past translations so that quality and latency, including computation overhead, can be balanced. Most prior works assume pre-segmented speech, limiting their real-world applicability. In this paper, we propose InfiniSST, a novel approach that formulates SST as a multi-turn dialogue task, enabling seamless translation of unbounded speech. We construct translation trajectories and robust segments from MuST-C with multi-latency augmentation during training and develop a key-value (KV) cache management strategy to facilitate efficient inference. Experiments on MuST-C En-Es, En-De, and En-Zh demonstrate that InfiniSST reduces computation-aware latency by 0.5 to 1 second while maintaining the same translation quality compared to baselines. Ablation studies further validate the contributions of our data construction and cache management strategy. We release the code and demo at https://github.com/LeiLiLab/InfiniSST

nan


Article 355

Title@2025-06-16 (1): ZINA: Multimodal Fine-grained Hallucination Detection and Editing

Title: ZINA: Multimodal Fine-grained Hallucination Detection and Editing ZINA: Multimodale feinkörnige Halluzination Erkennung und Bearbeitung ZINA: 多种现代精精密成粒致幻药检测和编辑 2506.13130v1

Authors (4): Yuiga Wada, Kazuki Matsuda, Komei Sugiura, Graham Neubig

Multimodal Large Language Models (MLLMs) often generate hallucinations, where the output deviates from the visual content. Given that these hallucinations can take diverse forms, detecting hallucinations at a fine-grained level is essential for comprehensive evaluation and analysis. To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. Moreover, we propose ZINA, a novel method that identifies hallucinated spans at a fine-grained level, classifies their error types into six categories, and suggests appropriate refinements. To train and evaluate models for this task, we constructed VisionHall, a dataset comprising 6.9k outputs from twelve MLLMs manually annotated by 211 annotators, and 20k synthetic samples generated using a graph-based method that captures dependencies among error types. We demonstrated that ZINA outperformed existing methods, including GPT-4o and LLama-3.2, in both detection and editing tasks.

nan


Article 356

Title@2025-06-16 (1): ReflecTool: Towards Reflection-Aware Tool-Augmented Clinical Agents

Title: ReflecTool: Towards Reflection-Aware Tool-Augmented Clinical Agents ReflecTool: Auf dem Weg zu Reflektions-Aware Tool-Augmented Clinical Agents ReflecTool:走向反射软件工具增强临床药剂 2410.17657v3

Authors (4): Yusheng Liao, Shuyang Jiang, Yanfeng Wang, Yu Wang

Large Language Models (LLMs) have shown promising potential in the medical domain, assisting with tasks like clinical note generation and patient communication. However, current LLMs are limited to text-based communication, hindering their ability to interact with diverse forms of information in clinical environments. Despite clinical agents succeeding in diverse signal interaction, they are oriented to a single clinical scenario and hence fail for broader applications. To evaluate clinical agents holistically, we propose ClinicalAgent Bench~(CAB), a comprehensive medical agent benchmark consisting of 18 tasks across five key realistic clinical dimensions. Building on this, we introduce ReflecTool, a novel framework that excels at utilizing domain-specific tools within two stages. The first optimization stage progressively enlarges a long-term memory by saving successful solving processes and tool-wise experience of agents in a tiny pre-defined training set. In the following inference stage, ReflecTool can search for supportive successful demonstrations from already built long-term memory to guide the tool selection strategy, and a verifier improves the tool usage according to the tool-wise experience with two verification methods–iterative refinement and candidate selection. Extensive experiments on ClinicalAgent Benchmark demonstrate that ReflecTool surpasses the pure LLMs with more than 10 points and the well-established agent-based methods with 3 points, highlighting its adaptability and effectiveness in solving complex clinical tasks.

nan


Article 357

Title@2025-06-16 (1): Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs

Title: Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs Schritt-für-Schritt-Anweisungen und ein einfaches tabellarisches Ausgabeformat verbessern die Abhängigkeits-Abgleichgenauigkeit von LLMs 逐步指示和简单表格格式 改进LLMM的可靠性分析精确度 2506.09983v2

Authors (3): Hiroshi Matsuda, Chunpeng Ma, Masayuki Asahara

Recent advances in large language models (LLMs) have enabled impressive performance in various tasks. However, standard prompting often struggles to produce structurally valid and accurate outputs, especially in dependency parsing. We propose a novel step-by-step instruction strategy, where universal part-of-speech tagging precedes the prediction of syntactic heads and dependency labels, and a simplified CoNLL-U like output format, our method achieves state-of-the-art accuracy on Universal Dependencies datasets across 17 languages without hallucination or contamination. We further show that multilingual fine-tuning simultaneously improves cross-language generalization performance. Our results highlight the effectiveness of explicit reasoning steps in LLM-based parsing and offer a scalable, format-consistent alternative to bracket-based approaches.

nan


Article 358

Title@2025-06-16 (1): MathFusion: Enhancing Mathematical Problem-solving of LLM through Instruction Fusion

Title: MathFusion: Enhancing Mathematical Problem-solving of LLM through Instruction Fusion MathFusion: Verbesserung der mathematischen Problemlösung von LLM durch Instruction Fusion 数学分析:通过教学融合加强LLM的数学问题解决 2503.16212v2

Authors (9): Qizhi Pei, Lijun Wu, Zhuoshi Pan, Yu Li, Honglin Lin, Chenlin Ming, Xin Gao, Conghui He, Rui Yan

Large Language Models (LLMs) have shown impressive progress in mathematical reasoning. While data augmentation is promising to enhance mathematical problem-solving ability, current approaches are predominantly limited to instance-level modifications-such as rephrasing or generating syntactic variations-which fail to capture and leverage the intrinsic relational structures inherent in mathematical knowledge. Inspired by human learning processes, where mathematical proficiency develops through systematic exposure to interconnected concepts, we introduce MathFusion, a novel framework that enhances mathematical reasoning through cross-problem instruction synthesis. MathFusion implements this through three fusion strategies: (1) sequential fusion, which chains related problems to model solution dependencies; (2) parallel fusion, which combines analogous problems to reinforce conceptual understanding; and (3) conditional fusion, which creates context-aware selective problems to enhance reasoning flexibility. By applying these strategies, we generate a new dataset, \textbf{MathFusionQA}, followed by fine-tuning models (DeepSeekMath-7B, Mistral-7B, Llama3-8B) on it. Experimental results demonstrate that MathFusion achieves substantial improvements in mathematical reasoning while maintaining high data efficiency, boosting performance by 18.0 points in accuracy across diverse benchmarks while requiring only 45K additional synthetic instructions, representing a substantial improvement over traditional single-instruction approaches. Our datasets, models, and code are publicly available at https://github.com/QizhiPei/mathfusion.

nan


Article 359

Title@2025-06-16 (1): A Hybrid GA LLM Framework for Structured Task Optimization

Title: A Hybrid GA LLM Framework for Structured Task Optimization Ein hybrides GA LLM-Rahmenwerk für strukturierte Aufgabenoptimierung GA 混合LLM 结构化任务优化框架 2506.07483v2

Authors (5): William Shum, Rachel Chan, Jonas Lin, Benny Feng, Patrick Lau

GA LLM is a hybrid framework that combines Genetic Algorithms with Large Language Models to handle structured generation tasks under strict constraints. Each output, such as a plan or report, is treated as a gene, and evolutionary operations like selection, crossover, and mutation are guided by the language model to iteratively improve solutions. The language model provides domain knowledge and creative variation, while the genetic algorithm ensures structural integrity and global optimization. GA LLM has proven effective in tasks such as itinerary planning, academic outlining, and business reporting, consistently producing well structured and requirement satisfying results. Its modular design also makes it easy to adapt to new tasks. Compared to using a language model alone, GA LLM achieves better constraint satisfaction and higher quality solutions by combining the strengths of both components.

nan


Article 360

Title@2025-06-16 (1): POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization

Title: POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization POROver: Verbesserung der Sicherheit und Reduzierung von Überrefusal in großen Sprachmodellen mit Übergeneration und Präferenzoptimierung POROU: 提高高代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代谢最优化型大语言模型的安全性和减少过度拒绝过度 2410.12999v2

Authors (6): Batuhan K. Karaman, Ishmam Zabir, Alon Benhaim, Vishrav Chaudhary, Mert R. Sabuncu, Xia Song

Achieving both high safety and high usefulness simultaneously in large language models has become a critical challenge in recent years.Models often exhibit unsafe behavior or adopt an overly cautious approach leading to frequent overrefusal of benign prompts, which reduces their usefulness. A major factor underlying these behaviors is how the models are finetuned and aligned, particularly the nature and extent of the data used.In this work, we examine how overgenerating finetuning data with advanced teacher models (e.g., GPT-4o)-covering both general-purpose and toxic prompts-affects safety and usefulness in instruction-following language models.Additionally, we present POROver, an alignment strategy designed for models that are highly safe but prone to overrefusal. POROver employs preference optimization algorithms and leverages completions from an advanced teacher model to reduce overrefusals while maintaining safety.Our results show that overgenerating completions for general-purpose prompts significantly boosts safety with only a minimal impact on usefulness. Specifically, the F1 score calculated between safety and usefulness increases from 74.4% to 91.8% because of a substantial rise in safety. Moreover, overgeneration for toxic prompts raises usefulness from 11.1% to 57.6% while preserving safety. Finally, applying POROVer increases usefulness further-from 57.6% to 82.1%-while keeping safety at comparable levels. Our data and code are available at https://github.com/batuhankmkaraman/POROver.

nan


Article 361

Title@2025-06-16 (1): Crime Hotspot Prediction Using Deep Graph Convolutional Networks

Title: Crime Hotspot Prediction Using Deep Graph Convolutional Networks Verbrechens-Hotspot-Vorhersage mit Deep Graph Convolutional Networks 利用深图革命网络进行犯罪热点预测 2506.13116v1

Authors (4): Tehreem Zubair, Syeda Kisaa Fatima, Noman Ahmed, Asifullah Khan

Crime hotspot prediction is critical for ensuring urban safety and effective law enforcement, yet it remains challenging due to the complex spatial dependencies inherent in criminal activity. The previous approaches tended to use classical algorithms such as the KDE and SVM to model data distributions and decision boundaries. The methods often fail to capture these spatial relationships, treating crime events as independent and ignoring geographical interactions. To address this, we propose a novel framework based on Graph Convolutional Networks (GCNs), which explicitly model spatial dependencies by representing crime data as a graph. In this graph, nodes represent discrete geographic grid cells and edges capture proximity relationships. Using the Chicago Crime Dataset, we engineer spatial features and train a multi-layer GCN model to classify crime types and predict high-risk zones. Our approach achieves 88% classification accuracy, significantly outperforming traditional methods. Additionally, the model generates interpretable heat maps of crime hotspots, demonstrating the practical utility of graph-based learning for predictive policing and spatial criminology.

nan


Article 362

Title@2025-06-16 (1): Leveraging In-Context Learning for Language Model Agents

Title: Leveraging In-Context Learning for Language Model Agents Leveraging In-Context Learning für Sprachmodell-Agenten 为语文示范代理利用内文学习 2506.13109v1

Authors (5): Shivanshu Gupta, Sameer Singh, Ashish Sabharwal, Tushar Khot, Ben Bogin

In-context learning (ICL) with dynamically selected demonstrations combines the flexibility of prompting large language models (LLMs) with the ability to leverage training data to improve performance. While ICL has been highly successful for prediction and generation tasks, leveraging it for agentic tasks that require sequential decision making is challenging – one must think not only about how to annotate long trajectories at scale and how to select demonstrations, but also what constitutes demonstrations, and when and where to show them. To address this, we first propose an algorithm that leverages an LLM with retries along with demonstrations to automatically and efficiently annotate agentic tasks with solution trajectories. We then show that set-selection of trajectories of similar tasks as demonstrations significantly improves performance, reliability, robustness, and efficiency of LLM agents. However, trajectory demonstrations have a large inference cost overhead. We show that this can be mitigated by using small trajectory snippets at every step instead of an additional trajectory. We find that demonstrations obtained from larger models (in the annotation phase) also improve smaller models, and that ICL agents can even rival costlier trained agents. Thus, our results reveal that ICL, with careful use, can be very powerful for agentic tasks as well.

nan


Article 363

Title@2025-06-16 (1): Scaling Laws for Upcycling Mixture-of-Experts Language Models

Title: Scaling Laws for Upcycling Mixture-of-Experts Language Models Skalierungsgesetze für Upcycling-Mixture-of-Experts Sprachmodelle 增强骑车混合专家语言模型法 2502.03009v2

Authors (3): Seng Pei Liew, Takuya Kato, Sho Takase

Pretraining large language models (LLMs) is resource-intensive, often requiring months of training time even with high-end GPU clusters. There are two approaches of mitigating such computational demands: reusing smaller models to train larger ones (upcycling), and training computationally efficient models like mixture-of-experts (MoE). In this paper, we study the upcycling of LLMs to MoE models, of which the scaling behavior remains underexplored. Through extensive experiments, we identify empirical scaling laws that describe how performance depends on dataset size and model configuration. Particularly, we show that, while scaling these factors improves performance, there is a novel interaction term between the dense and upcycled training dataset that limits the efficiency of upcycling at large computational budgets. Based on these findings, we provide guidance to scale upcycling, and establish conditions under which upcycling outperforms from-scratch trainings within budget constraints.

nan


Article 364

Title@2025-06-16 (1): Equitable Electronic Health Record Prediction with FAME: Fairness-Aware Multimodal Embedding

Title: Equitable Electronic Health Record Prediction with FAME: Fairness-Aware Multimodal Embedding Equitable Electronic Health Record Prediction mit FAME: Fairness-Aware Multimodale Einbettung 公平电子健康记录预测与FAME:公平-软件多模式嵌入 2506.13104v1

Authors (4): Nikkie Hooman, Zhongjie Wu, Eric C. Larson, Mehak Gupta

Electronic Health Record (EHR) data encompass diverse modalities – text, images, and medical codes – that are vital for clinical decision-making. To process these complex data, multimodal AI (MAI) has emerged as a powerful approach for fusing such information. However, most existing MAI models optimize for better prediction performance, potentially reinforcing biases across patient subgroups. Although bias-reduction techniques for multimodal models have been proposed, the individual strengths of each modality and their interplay in both reducing bias and optimizing performance remain underexplored. In this work, we introduce FAME (Fairness-Aware Multimodal Embeddings), a framework that explicitly weights each modality according to its fairness contribution. FAME optimizes both performance and fairness by incorporating a combined loss function. We leverage the Error Distribution Disparity Index (EDDI) to measure fairness across subgroups and propose a sign-agnostic aggregation method to balance fairness across subgroups, ensuring equitable model outcomes. We evaluate FAME with BEHRT and BioClinicalBERT, combining structured and unstructured EHR data, and demonstrate its effectiveness in terms of performance and fairness compared with other baselines across multiple EHR prediction tasks.

nan


Article 365

Title@2025-06-16 (1): Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs

Title: Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs Rethinking Test-Time Scaling für medizinische KI: Modell- und Task-Aware-Strategien für LLMs und VLMs 重新思考医疗用AI:LLMM和VLMM的模型和任务-意识战略 2506.13102v1

Authors (4): Gyutaek Oh, Seoyeon Kim, Sangjoon Park, Byung-Hoon Kim

Test-time scaling has recently emerged as a promising approach for enhancing the reasoning capabilities of large language models or vision-language models during inference. Although a variety of test-time scaling strategies have been proposed, and interest in their application to the medical domain is growing, many critical aspects remain underexplored, including their effectiveness for vision-language models and the identification of optimal strategies for different settings. In this paper, we conduct a comprehensive investigation of test-time scaling in the medical domain. We evaluate its impact on both large language models and vision-language models, considering factors such as model size, inherent model characteristics, and task complexity. Finally, we assess the robustness of these strategies under user-driven factors, such as misleading information embedded in prompts. Our findings offer practical guidelines for the effective use of test-time scaling in medical applications and provide insights into how these strategies can be further refined to meet the reliability and interpretability demands of the medical domain.

nan


Article 366

Title@2025-06-16 (1): NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables

Title: NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables NeedleInATable: Erforschen von Langkontext-Kapazität von großen Sprachmodellen zu langstrukturierten Tabellen 针线表:探索长结构表格中大语言模型的长文能力 2504.06560v3

Authors (8): Lanrui Wang, Mingyu Zheng, Hongyin Tang, Zheng Lin, Yanan Cao, Jingang Wang, Xunliang Cai, Weiping Wang

Processing structured tabular data, particularly large and lengthy tables, constitutes a fundamental yet challenging task for large language models (LLMs). However, existing long-context benchmarks like Needle-in-a-Haystack primarily focus on unstructured text, neglecting the challenge of diverse structured tables. Meanwhile, previous tabular benchmarks mainly consider downstream tasks that require high-level reasoning abilities, and overlook models’ underlying fine-grained perception of individual table cells, which is crucial for practical and robust LLM-based table applications. To address this gap, we introduce \textsc{NeedleInATable} (NIAT), a new long-context tabular benchmark that treats each table cell as a ``needle’’ and requires models to extract the target cell based on cell locations or lookup questions. Our comprehensive evaluation of various LLMs and multimodal LLMs reveals a substantial performance gap between popular downstream tabular tasks and the simpler NIAT task, suggesting that they may rely on dataset-specific correlations or shortcuts to obtain better benchmark results but lack truly robust long-context understanding towards structured tables. Furthermore, we demonstrate that using synthesized NIAT training data can effectively improve performance on both NIAT task and downstream tabular tasks, which validates the importance of NIAT capability for LLMs’ genuine table understanding ability. Our data, code and models will be released to facilitate future research.

nan


Article 367

Title@2025-06-16 (1): Ask Optimal Questions: Aligning Large Language Models with Retriever’s Preference in Conversation

Title: Ask Optimal Questions: Aligning Large Language Models with Retriever’s Preference in Conversation Optimale Fragen stellen: Große Sprachmodelle mit Retrievers Vorliebe im Gespräch ausrichten 问最佳问题:将大语言模型与“检索”的优先对话对象相匹配 2402.11827v2

Authors (6): Chanwoong Yoon, Gangwoo Kim, Byeongguk Jeon, Sungdong Kim, Yohan Jo, Jaewoo Kang

Conversational search, unlike single-turn retrieval tasks, requires understanding the current question within a dialogue context. The common approach of rewrite-then-retrieve aims to decontextualize questions to be self-sufficient for off-the-shelf retrievers, but most existing methods produce sub-optimal query rewrites due to the limited ability to incorporate signals from the retrieval results. To overcome this limitation, we present a novel framework RetPO (Retriever’s Preference Optimization), which is designed to optimize a language model (LM) for reformulating search queries in line with the preferences of the target retrieval systems. The process begins by prompting a large LM to produce various potential rewrites and then collects retrieval performance for these rewrites as the retrievers’ preferences. Through the process, we construct a large-scale dataset called RF collection, containing Retrievers’ Feedback on over 410K query rewrites across 12K conversations. Furthermore, we fine-tune a smaller LM on this dataset to align it with the retrievers’ feedback. Our resulting model demonstrates superiority on two benchmarks, surpassing the previous state-of-the-art performance of rewrite-then-retrieve approaches.

nan


Article 368

Title: Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search Satori: Verstärktes Lernen mit Chain-of-Action-Thought verbessert LLM-Reasoning durch autoregressive Suche 教程:通过自动递减搜索,加强学习,通过行动链-探索加强LLM 2502.02508v3

Authors (10): Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory Wornell, Subhro Das, David Cox, Chuang Gan

Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs’ reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models are fully open-sourced.

nan


Article 369

Title@2025-06-16 (1): CHILL at SemEval-2025 Task 2: You Can’t Just Throw Entities and Hope – Make Your LLM to Get Them Right

Title: CHILL at SemEval-2025 Task 2: You Can’t Just Throw Entities and Hope – Make Your LLM to Get Them Right CHILL at SemEval-2025 Task 2: Man kann nicht einfach Entitäten und Hoffnung werfen – Machen Sie Ihre LLM, um sie richtig zu bekommen 在SemEval 2025任务2: 你不能仅仅抛出实体和希望– 使你的LLM得到正确的东西 2506.13070v1

Authors (4): Jaebok Lee, Yonghyun Ryu, Seongmin Park, Yoonjung Choi

In this paper, we describe our approach for the SemEval 2025 Task 2 on Entity-Aware Machine Translation (EA-MT). Our system aims to improve the accuracy of translating named entities by combining two key approaches: Retrieval Augmented Generation (RAG) and iterative self-refinement techniques using Large Language Models (LLMs). A distinctive feature of our system is its self-evaluation mechanism, where the LLM assesses its own translations based on two key criteria: the accuracy of entity translations and overall translation quality. We demonstrate how these methods work together and effectively improve entity handling while maintaining high-quality translations.

nan


Article 370

Title@2025-06-16 (1): FinLMM-R1: Enhancing Financial Reasoning in LMM through Scalable Data and Reward Design

Title: FinLMM-R1: Enhancing Financial Reasoning in LMM through Scalable Data and Reward Design FinLMM-R1: Verbesserung der finanziellen Begründung in LMM durch skalierbare Daten und Belohnungsdesign FinLMM-R1:通过可缩放数据和奖励设计,加强LMM的资金理由 2506.13066v1

Authors (6): Kai Lan, Jiayong Zhu, Jiangtong Li, Dawei Cheng, Guang Chen, Changjun Jiang

Large Multimodal Models (LMMs) demonstrate significant cross-modal reasoning capabilities. However, financial applications face challenges due to the lack of high-quality multimodal reasoning datasets and the inefficiency of existing training paradigms for reasoning enhancement. To address these issues, we propose an integrated framework, FinLMM-R1, combining an automated and scalable pipeline for data construction with enhanced training strategies to improve the multimodal reasoning of LMM. The Automated and Scalable Pipeline (ASP) resolves textual-visual misalignment in financial reports through a separate paradigm of question-answer generation and image-question alignment, ensuring data integrity and extraction efficiency. Through ASP, we collect 89,378 aligned image-question pairs from 23,397 financial reports, covering tasks such as arithmetic reasoning, statistics reasoning, financial explanation, and financial knowledge. Moreover, we introduce the Thinking with Adversarial Reward in LMM (TAR-LMM), extending the prior two-stage training framework [1] with additional reward mechanisms. In the first stage, we focus on text-only tasks with format and accuracy rewards to guide the model in generating well-structured thinking contents. In the second stage, we construct multi-image contrastive samples with additional reward components including image selection, thinking content length, and adversarial reward to jointly optimize the LMM across visual perception, reasoning efficiency, and logical coherence. Extensive experiments on 7 benchmarks show ASP-derived dataset and training framework significantly improve answer accuracy and reasoning depth over existing reasoning LMMs in both general and financial multimodal contexts.

nan


Article 371

Title@2025-06-16 (1): AgentCourt: Simulating Court with Adversarial Evolvable Lawyer Agents

Title: AgentCourt: Simulating Court with Adversarial Evolvable Lawyer Agents AgentCourt: Simulierung des Gerichts mit kontradiktorisch-evolvierbaren Anwaltsvertretern 法院代理:模拟法院与律师代理 2408.08089v2

Authors (11): Guhong Chen, Liyang Fan, Zihan Gong, Nan Xie, Zixuan Li, Ziqiang Liu, Chengming Li, Qiang Qu, Hamid Alinejad-Rokny, Shiwen Ni, Min Yang

Current research in LLM-based simulation systems lacks comprehensive solutions for modeling real-world court proceedings, while existing legal language models struggle with dynamic courtroom interactions. We present AgentCourt, a comprehensive legal simulation framework that addresses these challenges through adversarial evolution of LLM-based agents. Our AgentCourt introduces a new adversarial evolutionary approach for agents called AdvEvol, which performs dynamic knowledge learning and evolution through structured adversarial interactions in a simulated courtroom program, breaking the limitations of the traditional reliance on static knowledge bases or manual annotations. By simulating 1,000 civil cases, we construct an evolving knowledge base that enhances the agents’ legal reasoning abilities. The evolved lawyer agents demonstrated outstanding performance on our newly introduced CourtBench benchmark, achieving a 12.1% improvement in performance compared to the original lawyer agents. Evaluations by professional lawyers confirm the effectiveness of our approach across three critical dimensions: cognitive agility, professional knowledge, and logical rigor. Beyond outperforming specialized legal models in interactive reasoning tasks, our findings emphasize the importance of adversarial learning in legal AI and suggest promising directions for extending simulation-based legal reasoning to broader judicial and regulatory contexts. The project’s code is available at: https://github.com/relic-yuexi/AgentCourt

nan


Article 372

Title@2025-06-16 (1): MotiveBench: How Far Are We From Human-Like Motivational Reasoning in Large Language Models?

Title: MotiveBench: How Far Are We From Human-Like Motivational Reasoning in Large Language Models? MotivBench: Wie weit sind wir von Menschen wie Motivational Reasoning in großen Sprachmodellen entfernt? 动机:在大型语言模型中,我们从人类的动机上的原因有多远? 2506.13065v1

Authors (5): Xixian Yong, Jianxun Lian, Xiaoyuan Yi, Xiao Zhou, Xing Xie

Large language models (LLMs) have been widely adopted as the core of agent frameworks in various scenarios, such as social simulations and AI companions. However, the extent to which they can replicate human-like motivations remains an underexplored question. Existing benchmarks are constrained by simplistic scenarios and the absence of character identities, resulting in an information asymmetry with real-world situations. To address this gap, we propose MotiveBench, which consists of 200 rich contextual scenarios and 600 reasoning tasks covering multiple levels of motivation. Using MotiveBench, we conduct extensive experiments on seven popular model families, comparing different scales and versions within each family. The results show that even the most advanced LLMs still fall short in achieving human-like motivational reasoning. Our analysis reveals key findings, including the difficulty LLMs face in reasoning about “love & belonging” motivations and their tendency toward excessive rationality and idealism. These insights highlight a promising direction for future research on the humanization of LLMs. The dataset, benchmark, and code are available at https://aka.ms/motivebench.

nan


Article 373

Title@2025-06-16 (1): PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue

Title: PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue PRISM2: Allgemeine Pathologie-KI mit klinischem Dialog entriegeln PRISM2:通过临床对话解锁多模式一般病理学AI 2506.13063v1

Authors (15): George Shaikovski, Eugene Vorontsov, Adam Casson, Julian Viret, Eric Zimmermann, Neil Tenenholtz, Yi Kan Wang, Jan H. Bernhard, Ran A. Godrich, Juan A. Retamero, Razik Yousfi, Nicolo Fusi, Thomas J. Fuchs, Kristen Severson, Siqi Liu

Recent pathology foundation models can provide rich tile-level representations but fall short of delivering general-purpose clinical utility without further extensive model development. These models lack whole-slide image (WSI) understanding and are not trained with large-scale diagnostic data, limiting their performance on diverse downstream tasks. We introduce PRISM2, a multi-modal slide-level foundation model trained via clinical dialogue to enable scalable, generalizable pathology AI. PRISM2 is trained on nearly 700,000 specimens (2.3 million WSIs) paired with real-world clinical diagnostic reports in a two-stage process. In Stage 1, a vision-language model is trained using contrastive and captioning objectives to align whole slide embeddings with textual clinical diagnosis. In Stage 2, the language model is unfrozen to enable diagnostic conversation and extract more clinically meaningful representations from hidden states. PRISM2 achieves strong performance on diagnostic and biomarker prediction tasks, outperforming prior slide-level models including PRISM and TITAN. It also introduces a zero-shot yes/no classification approach that surpasses CLIP-style methods without prompt tuning or class enumeration. By aligning visual features with clinical reasoning, PRISM2 improves generalization on both data-rich and low-sample tasks, offering a scalable path forward for building general pathology AI agents capable of assisting diagnostic and prognostic decisions.

nan


Article 374

Title@2025-06-16 (1): Generative Representational Learning of Foundation Models for Recommendation

Title: Generative Representational Learning of Foundation Models for Recommendation Generatives repräsentatives Lernen von Stiftungsmodellen zur Empfehlung 产生基础基础建议模式的代言人学习 2506.11999v2

Authors (7): Zheli Zhou, Chenxu Zhu, Jianghao Lin, Bo Chen, Ruiming Tang, Weinan Zhang, Yong Yu

Developing a single foundation model with the capability to excel across diverse tasks has been a long-standing objective in the field of artificial intelligence. As the wave of general-purpose foundation models sweeps across various domains, their influence has significantly extended to the field of recommendation systems. While recent efforts have explored recommendation foundation models for various generative tasks, they often overlook crucial embedding tasks and struggle with the complexities of multi-task learning, including knowledge sharing & conflict resolution, and convergence speed inconsistencies. To address these limitations, we introduce RecFound, a generative representational learning framework for recommendation foundation models. We construct the first comprehensive dataset for recommendation foundation models covering both generative and embedding tasks across diverse scenarios. Based on this dataset, we propose a novel multi-task training scheme featuring a Task-wise Mixture of Low-rank Experts (TMoLE) to handle knowledge sharing & conflict, a Step-wise Convergence-oriented Sample Scheduler (S2Sched) to address inconsistent convergence, and a Model Merge module to balance the performance across tasks. Experiments demonstrate that RecFound achieves state-of-the-art performance across various recommendation tasks, outperforming existing baselines.

nan


Article 375

Title@2025-06-16 (1): Multipole Attention for Efficient Long Context Reasoning

Title: Multipole Attention for Efficient Long Context Reasoning Mehrpolige Aufmerksamkeit für effiziente lange Kontext-Reasoning 多极关注高效长处理由 2506.13059v1

Authors (8): Coleman Hooper, Sebastian Zhao, Luca Manolache, Sehoon Kim, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami

Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks. While these models have attained high accuracy by leveraging additional computation at test time, they need to generate long chain-of-thought reasoning in order to think before answering, which requires generating thousands of tokens. While sparse attention methods can help reduce the KV cache pressure induced by this long autoregressive reasoning, these methods can introduce errors which disrupt the reasoning process. Additionally, prior methods often pre-process the input to make it easier to identify the important prompt tokens when computing attention during generation, and this pre-processing is challenging to perform online for newly generated reasoning tokens. Our work addresses these challenges by introducing Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens, while maintaining approximate representations for the remaining tokens. Our method first performs clustering to group together semantically similar key vectors, and then uses the cluster centroids both to identify important key vectors and to approximate the remaining key vectors in order to retain high accuracy. We design a fast cluster update process to quickly re-cluster the input and previously generated tokens, thereby allowing for accelerating attention to the previous output tokens. We evaluate our method using emerging LRMs such as Qwen-8B, demonstrating that our approach can maintain accuracy on complex reasoning tasks even with aggressive attention sparsity settings. We also provide kernel implementations to demonstrate the practical efficiency gains from our method, achieving up to 4.5$\times$ speedup for attention in long-context reasoning applications. Our code is available at https://github.com/SqueezeAILab/MultipoleAttention.

nan


Article 376

Title@2025-06-16 (1): Latent Multi-Head Attention for Small Language Models

Title: Latent Multi-Head Attention for Small Language Models Latent Multi-Head Aufmerksamkeit für kleine Sprachmodelle 对小型语言模式的多方关注 2506.09342v2

Authors (4): Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat

We present the first comprehensive study of latent multi-head attention (MLA) for small language models, revealing interesting efficiency-quality trade-offs. Training 30M-parameter GPT models on 100,000 synthetic stories, we benchmark three architectural variants: standard multi-head attention (MHA), MLA, and MLA with rotary positional embeddings (MLA+RoPE). Our key finding is that MLA+RoPE with half-rank latent dimensions (r = d/2) achieves a 45% KV-cache memory reduction while incurring only a 0.3% increase in validation loss (essentially matching MHA quality)- a Pareto improvement for memory constrained deployment. We further show that RoPE is crucial for MLA in small models: without it, MLA underperforms vanilla attention by 3-5%, but with RoPE, it surpasses vanilla by 2%. Inference benchmarks on NVIDIA A100 GPUs reveal that MLA with r=d/2 achieves a 1.4 times speedup over full-rank MLA while maintaining the memory savings. GPT-4 evaluations corroborate perplexity results, with ours achieving the highest quality scores (7.4/10) across grammar, creativity, and consistency metrics. Code and models will be released upon acceptance.

nan


Article 377

Title@2025-06-16 (1): CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model

Title: CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model CFBESIMIM-MM:中国金融助理多式大语言模式基准 2506.13055v1

Authors (5): Jiangtong Li, Yiyun Zhu, Dawei Cheng, Zhijun Ding, Changjun Jiang

Multimodal Large Language Models (MLLMs) have rapidly evolved with the growth of Large Language Models (LLMs) and are now applied in various fields. In finance, the integration of diverse modalities such as text, charts, and tables is crucial for accurate and efficient decision-making. Therefore, an effective evaluation system that incorporates these data types is essential for advancing financial application. In this paper, we introduce CFBenchmark-MM, a Chinese multimodal financial benchmark with over 9,000 image-question pairs featuring tables, histogram charts, line charts, pie charts, and structural diagrams. Additionally, we develop a staged evaluation system to assess MLLMs in handling multimodal information by providing different visual content step by step. Despite MLLMs having inherent financial knowledge, experimental results still show limited efficiency and robustness in handling multimodal financial context. Further analysis on incorrect responses reveals the misinterpretation of visual content and the misunderstanding of financial concepts are the primary issues. Our research validates the significant, yet underexploited, potential of MLLMs in financial analysis, highlighting the need for further development and domain-specific optimization to encourage the enhanced use in financial domain.

nan


Article 378

Title@2025-06-16 (1): Stress-Testing Multimodal Foundation Models for Crystallographic Reasoning

Title: Stress-Testing Multimodal Foundation Models for Crystallographic Reasoning Stress-Testing Multimodale Fundamentierungsmodelle für kristallografische Reasoning 水晶理学理由多式模型 2506.13051v1

Authors (4): Can Polat, Hasan Kurban, Erchin Serpedin, Mustafa Kurban

Evaluating foundation models for crystallographic reasoning requires benchmarks that isolate generalization behavior while enforcing physical constraints. This work introduces a multiscale multicrystal dataset with two physically grounded evaluation protocols to stress-test multimodal generative models. The Spatial-Exclusion benchmark withholds all supercells of a given radius from a diverse dataset, enabling controlled assessments of spatial interpolation and extrapolation. The Compositional-Exclusion benchmark omits all samples of a specific chemical composition, probing generalization across stoichiometries. Nine vision–language foundation models are prompted with crystallographic images and textual context to generate structural annotations. Responses are evaluated via (i) relative errors in lattice parameters and density, (ii) a physics-consistency index penalizing volumetric violations, and (iii) a hallucination score capturing geometric outliers and invalid space-group predictions. These benchmarks establish a reproducible, physically informed framework for assessing generalization, consistency, and reliability in large-scale multimodal models. Dataset and code are available at https://github.com/KurbanIntelligenceLab/StressTestingMMFMinCR.

nan


Article 379

Title: Knowledge Graph Large Language Model (KG-LLM) for Link Prediction Wissensgrafik Großes Sprachmodell (KG-LLM) für die Link-Vorhersage 链接预测知识图大语言模型(KG-LLM) 2403.07311v9

Authors (6): Dong Shu, Tianle Chen, Mingyu Jin, Chong Zhang, Mengnan Du, Yongfeng Zhang

The task of multi-hop link prediction within knowledge graphs (KGs) stands as a challenge in the field of knowledge graph analysis, as it requires the model to reason through and understand all intermediate connections before making a prediction. In this paper, we introduce the Knowledge Graph Large Language Model (KG-LLM), a novel framework that leverages large language models (LLMs) for knowledge graph tasks. We first convert structured knowledge graph data into natural language and then use these natural language prompts to fine-tune LLMs to enhance multi-hop link prediction in KGs. By converting the KG to natural language prompts, our framework is designed to learn the latent representations of entities and their interrelations. To show the efficacy of the KG-LLM Framework, we fine-tune three leading LLMs within this framework, including Flan-T5, LLaMa2 and Gemma. Further, we explore the framework’s potential to provide LLMs with zero-shot capabilities for handling previously unseen prompts. Experimental results show that KG-LLM significantly improves the models’ generalization capabilities, leading to more accurate predictions in unfamiliar scenarios.

nan


Article 380

Title@2025-06-16 (1): Upcycling Large Language Models into Mixture of Experts

Title: Upcycling Large Language Models into Mixture of Experts Upcycling von großen Sprachmodellen zur Mischung von Experten 将大语言模型再生成专家混合模式 2410.07524v2

Authors (10): Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, Bryan Catanzaro

Upcycling pre-trained dense language models into sparse mixture-of-experts (MoE) models is an efficient approach to increase the model capacity of already trained models. However, optimal techniques for upcycling at scale remain unclear. In this work, we conduct an extensive study of upcycling methods and hyperparameters for billion-parameter scale language models. We propose a novel “virtual group” initialization scheme and weight scaling approach to enable upcycling into fine-grained MoE architectures. Through ablations, we find that upcycling outperforms continued dense model training. In addition, we show that softmax-then-topK expert routing improves over topK-then-softmax approach and higher granularity MoEs can help improve accuracy. Finally, we upcycled Nemotron-4 15B on 1T tokens and compared it to a continuously trained version of the same model on the same 1T tokens: the continuous trained model achieved 65.3% MMLU, whereas the upcycled model achieved 67.6%. Our results offer insights and best practices to effectively leverage upcycling for building MoE language models. Code is available.

nan


Article 381

Title@2025-06-16 (1): Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation

Title: Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation Ermöglichung medizinischer KI-Assistenten bei der Bereitstellung durch Input-Driven Saliency Adaptation 通过投入驱动感光度适应,使在线医疗自理助理能够使用投入驱动求感光度适应 2506.11105v2

Authors (5): Uttej Kallakurik, Edward Humes, Rithvik Jonna, Xiaomin Lin, Tinoosh Mohsenin

Large Language Models (LLMs) have significant impact on the healthcare scenarios but remain prohibitively large for deployment in real-time, resource-constrained environments such as edge devices. In this work, we introduce a novel medical assistant system, optimized through our general-purpose compression framework, which tailors Large Language Models (LLMs) for deployment in specialized domains. By measuring neuron saliency on domain-specific data, our method can aggressively prune irrelevant neurons, reducing model size while preserving performance. Following pruning, we apply post-training quantization to further reduce the memory footprint, and evaluate the compressed model across medical benchmarks including MedMCQA, MedQA, and PubMedQA. We also deploy the 50\% compressed Gemma and the 67\% compressed LLaMA3 models on Jetson Orin Nano (18.7W peak) and Raspberry Pi 5 (6.3W peak), achieving real-time, energy-efficient inference under hardware constraints.

nan


Article 382

Title@2025-06-16 (1): Just Go Parallel: Improving the Multilingual Capabilities of Large Language Models

Title: Just Go Parallel: Improving the Multilingual Capabilities of Large Language Models Einfach parallel gehen: Mehrsprachige Fähigkeiten großer Sprachmodelle verbessern 平行:提高大语言模式多语言能力 2506.13044v1

Authors (3): Muhammad Reza Qorib, Junyi Li, Hwee Tou Ng

Large language models (LLMs) have demonstrated impressive translation capabilities even without being explicitly trained on parallel data. This remarkable property has led some to believe that parallel data is no longer necessary for building multilingual language models. While some attribute this to the emergent abilities of LLMs due to scale, recent work suggests that it is actually caused by incidental bilingual signals present in the training data. Various methods have been proposed to maximize the utility of parallel data to enhance the multilingual capabilities of multilingual encoder-based and encoder-decoder language models. However, some decoder-based LLMs opt to ignore parallel data instead. In this work, we conduct a systematic study on the impact of adding parallel data on LLMs’ multilingual capabilities, focusing specifically on translation and multilingual common-sense reasoning. Through controlled experiments, we demonstrate that parallel data can significantly improve LLMs’ multilingual capabilities.

nan


Article 383

Title@2025-06-16 (1): An overview of domain-specific foundation model: key technologies, applications and challenges

Title: An overview of domain-specific foundation model: key technologies, applications and challenges Ein Überblick über domänenspezifisches Fundamentmodell: Schlüsseltechnologien, Anwendungen und Herausforderungen 特定领域基础模型概览:关键技术、应用和挑战 2409.04267v3

Authors (9): Haolong Chen, Hanzhi Chen, Zijian Zhao, Kaifeng Han, Guangxu Zhu, Yichen Zhao, Ying Du, Wei Xu, Qingjiang Shi

The impressive performance of ChatGPT and other foundation-model-based products in human language understanding has prompted both academia and industry to explore how these models can be tailored for specific industries and application scenarios. This process, known as the customization of domain-specific foundation models (FMs), addresses the limitations of general-purpose models, which may not fully capture the unique patterns and requirements of domain-specific data. Despite its importance, there is a notable lack of comprehensive overview papers on building domain-specific FMs, while numerous resources exist for general-purpose models. To bridge this gap, this article provides a timely and thorough overview of the methodology for customizing domain-specific FMs. It introduces basic concepts, outlines the general architecture, and surveys key methods for constructing domain-specific models. Furthermore, the article discusses various domains that can benefit from these specialized models and highlights the challenges ahead. Through this overview, we aim to offer valuable guidance and reference for researchers and practitioners from diverse fields to develop their own customized FMs.

nan


Article 384

Title@2025-06-16 (1): A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

Title: A dataset of questions on decision-theoretic reasoning in Newcomb-like problems Ein Datensatz von Fragen zur entscheidungstheoretischen Argumentation in Newcomb-ähnlichen Problemen 在类似新方格布问题中决策理论推理问题数据集 2411.10588v4

Authors (5): Caspar Oesterheld, Emery Cooper, Miles Kodama, Linh Chi Nguyen, Ethan Perez

We introduce a dataset of natural-language questions in the decision theory of so-called Newcomb-like problems. Newcomb-like problems include, for instance, decision problems in which an agent interacts with a similar other agent, and thus has to reason about the fact that the other agent will likely reason in similar ways. Evaluating LLM reasoning about Newcomb-like problems is important because interactions between foundation-model-based agents will often be Newcomb-like. Some ways of reasoning about Newcomb-like problems may allow for greater cooperation between models. Our dataset contains both capabilities questions (i.e., questions with a unique, uncontroversially correct answer) and attitude questions (i.e., questions about which decision theorists would disagree). We use our dataset for an investigation of decision-theoretical capabilities and expressed attitudes and their interplay in existing models (different models by OpenAI, Anthropic, Meta, GDM, Reka, etc.), as well as models under simple prompt-based interventions. We find, among other things, that attitudes vary significantly between existing models; that high capabilities are associated with attitudes more favorable toward so-called evidential decision theory; and that attitudes are consistent across different types of questions.

nan


Article 385

Title@2025-06-16 (1): Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation

Title: Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation Destill CLIP (DCLIP): Bild-Text-Retrieval durch Cross-Modal Transformer-Destillation verbessern 蒸馏 CLIP (DCLIP): 通过跨模式变异器蒸馏加强图像- 文本回收 2505.21549v4

Authors (8): Daniel Csizmadia, Andrei Codreanu, Victor Sim, Vighnesh Prabhu, Michael Lu, Kevin Zhu, Sean O’Brien, Vasu Sharma

We present Distill CLIP (DCLIP), a fine-tuned variant of the CLIP model that enhances multimodal image-text retrieval while preserving the original model’s strong zero-shot classification capabilities. CLIP models are typically constrained by fixed image resolutions and limited context, which can hinder their effectiveness in retrieval tasks that require fine-grained cross-modal understanding. DCLIP addresses these challenges through a meta teacher-student distillation framework, where a cross-modal transformer teacher is fine-tuned to produce enriched embeddings via bidirectional cross-attention between YOLO-extracted image regions and corresponding textual spans. These semantically and spatially aligned global representations guide the training of a lightweight student model using a hybrid loss that combines contrastive learning and cosine similarity objectives. Despite being trained on only ~67,500 samples curated from MSCOCO, Flickr30k, and Conceptual Captions-just a fraction of CLIP’s original dataset-DCLIP significantly improves image-text retrieval metrics (Recall@K, MAP), while retaining approximately 94% of CLIP’s zero-shot classification performance. These results demonstrate that DCLIP effectively mitigates the trade-off between task specialization and generalization, offering a resource-efficient, domain-adaptive, and detail-sensitive solution for advanced vision-language tasks. Code available at https://anonymous.4open.science/r/DCLIP-B772/README.md.

nan


Article 386

Title@2025-06-16 (1): Task-aligned prompting improves zero-shot detection of AI-generated images by Vision-Language Models

Title: Task-aligned prompting improves zero-shot detection of AI-generated images by Vision-Language Models Task-aligned prompting verbessert Zero-Shot-Erkennung von KI-generierten Bildern durch Vision-Language Models 以任务与任务的调和促动方式改进视觉语言模型对AI产生的图像的零光探测 2506.11031v2

Authors (4): Zoher Kachwala, Danishjeet Singh, Danielle Yang, Filippo Menczer

As image generators produce increasingly realistic images, concerns about potential misuse continue to grow. Supervised detection relies on large, curated datasets and struggles to generalize across diverse generators. In this work, we investigate the use of pre-trained Vision-Language Models (VLMs) for zero-shot detection of AI-generated images. While off-the-shelf VLMs exhibit some task-specific reasoning and chain-of-thought prompting offers gains, we show that task-aligned prompting elicits more focused reasoning and significantly improves performance without fine-tuning. Specifically, prefixing the model’s response with the phrase “Let’s examine the style and the synthesis artifacts” – a method we call zero-shot-s$^2$ – boosts Macro F1 scores by 8%-29%. These gains are consistent for two widely used open-source models and across three recent, diverse datasets spanning human faces, objects, and animals with images generated by 16 different models – demonstrating strong generalization. We further evaluate the approach across three additional model sizes and observe improvements in most dataset-model combinations – suggesting robustness to model scale. Surprisingly, self-consistency, a behavior previously observed in language reasoning, where aggregating answers from diverse reasoning paths improves performance, also holds in this setting. Even here, zero-shot-s$^2$ scales better than chain-of-thought in most cases – indicating that it elicits more useful diversity. Our findings show that task-aligned prompts elicit more focused reasoning and enhance latent capabilities in VLMs, like the detection of AI-generated images – offering a simple, generalizable, and explainable alternative to supervised methods. Our code is publicly available on github: https://github.com/Zoher15/Zero-shot-s2.

nan


Article 387

Title@2025-06-16 (1): Knowledge Graph Fusion with Large Language Models for Accurate, Explainable Manufacturing Process Planning

Title: Knowledge Graph Fusion with Large Language Models for Accurate, Explainable Manufacturing Process Planning Wissensgraphenfusion mit großen Sprachmodellen für eine genaue, erklärbare Prozessplanung in der Fertigung 与用于准确、可解释的制造过程规划的大型语言模型知识图集融合 2506.13026v1

Authors (4): Danny Hoang, David Gorsich, Matthew P. Castanier, Farhad Imani

Precision process planning in Computer Numerical Control (CNC) machining demands rapid, context-aware decisions on tool selection, feed-speed pairs, and multi-axis routing, placing immense cognitive and procedural burdens on engineers from design specification through final part inspection. Conventional rule-based computer-aided process planning and knowledge-engineering shells freeze domain know-how into static tables, which become limited when dealing with unseen topologies, novel material states, shifting cost-quality-sustainability weightings, or shop-floor constraints such as tool unavailability and energy caps. Large language models (LLMs) promise flexible, instruction-driven reasoning for tasks but they routinely hallucinate numeric values and provide no provenance. We present Augmented Retrieval Knowledge Network Enhanced Search & Synthesis (ARKNESS), the end-to-end framework that fuses zero-shot Knowledge Graph (KG) construction with retrieval-augmented generation to deliver verifiable, numerically exact answers for CNC process planning. ARKNESS (1) automatically distills heterogeneous machining documents, G-code annotations, and vendor datasheets into augmented triple, multi-relational graphs without manual labeling, and (2) couples any on-prem LLM with a retriever that injects the minimal, evidence-linked subgraph needed to answer a query. Benchmarked on 155 industry-curated questions spanning tool sizing and feed-speed optimization, a lightweight 3B-parameter Llama-3 augmented by ARKNESS matches GPT-4o accuracy while achieving a +25 percentage point gain in multiple-choice accuracy, +22.4 pp in F1, and 8.1x ROUGE-L on open-ended responses.

nan


Article 388

Title@2025-06-16 (1): Edeflip: Supervised Word Translation between English and Yoruba

Title: Edeflip: Supervised Word Translation between English and Yoruba Edeflip: Überwachte Wortübersetzung zwischen Englisch und Yoruba Edeflip: 英文和约鲁巴文翻译监督翻译 2506.13020v1

Authors (2): Ikeoluwa Abioye, Jiani Ge

In recent years, embedding alignment has become the state-of-the-art machine translation approach, as it can yield high-quality translation without training on parallel corpora. However, existing research and application of embedding alignment mostly focus on high-resource languages with high-quality monolingual embeddings. It is unclear if and how low-resource languages may be similarly benefited. In this study, we implement an established supervised embedding alignment method for word translation from English to Yoruba, the latter a low-resource language. We found that higher embedding quality and normalizing embeddings increase word translation precision, with, additionally, an interaction effect between the two. Our results demonstrate the limitations of the state-of-the-art supervised embedding alignment when it comes to low-resource languages, for which there are additional factors that need to be taken into consideration, such as the importance of curating high-quality monolingual embeddings. We hope our work will be a starting point for further machine translation research that takes into account the challenges that low-resource languages face.

nan


Article 389

Title@2025-06-16 (1): Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus

Title: Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus Entwirren von Codemixing in Chats: Der NUS ABC Codemixed Corpus 在聊天区拆解编码混合: NUS ABC 编码混合公司 2506.00332v2

Authors (4): Svetlana Churina, Akshat Gupta, Insyirah Mujtahid, Kokil Jaidka

Code-mixing involves the seamless integration of linguistic elements from multiple languages within a single discourse, reflecting natural multilingual communication patterns. Despite its prominence in informal interactions such as social media, chat messages and instant-messaging exchanges, there has been a lack of publicly available corpora that are author-labeled and suitable for modeling human conversations and relationships. This study introduces the first labeled and general-purpose corpus for understanding code-mixing in context while maintaining rigorous privacy and ethical standards. Our live project will continuously gather, verify, and integrate code-mixed messages into a structured dataset released in JSON format, accompanied by detailed metadata and linguistic statistics. To date, it includes over 355,641 messages spanning various code-mixing patterns, with a primary focus on English, Mandarin, and other languages. We expect the Codemix Corpus to serve as a foundational dataset for research in computational linguistics, sociolinguistics, and NLP applications.

nan


Article 390

Title@2025-06-16 (1): Evaluating how LLM annotations represent diverse views on contentious topics

Title: Evaluating how LLM annotations represent diverse views on contentious topics Bewertung, wie LLM-Annotationen unterschiedliche Ansichten zu strittigen Themen darstellen 评价LLLM说明如何代表对有争议议题的不同观点 2503.23243v2

Authors (4): Megan A. Brown, Shubham Atreja, Libby Hemphill, Patrick Y. Wu

Researchers have proposed the use of generative large language models (LLMs) to label data for research and applied settings. This literature emphasizes the improved performance of these models relative to other natural language models, noting that generative LLMs typically outperform other models and even humans across several metrics. Previous literature has examined bias across many applications and contexts, but less work has focused specifically on bias in generative LLMs’ responses to subjective annotation tasks. This bias could result in labels applied by LLMs that disproportionately align with majority groups over a more diverse set of viewpoints. In this paper, we evaluate how LLMs represent diverse viewpoints on these contentious tasks. Across four annotation tasks on four datasets, we show that LLMs do not show systematic substantial disagreement with annotators on the basis of demographics. Rather, we find that multiple LLMs tend to be biased in the same directions on the same demographic categories within the same datasets. Moreover, the disagreement between human annotators on the labeling task – a measure of item difficulty – is far more predictive of LLM agreement with human annotators. We conclude with a discussion of the implications for researchers and practitioners using LLMs for automated data annotation tasks. Specifically, we emphasize that fairness evaluations must be contextual, model choice alone will not solve potential issues of bias, and item difficulty must be integrated into bias assessments.

nan


Article 391

Title@2025-06-16 (1): Missing the human touch? A computational stylometry analysis of GPT-4 translations of online Chinese literature

Title: Missing the human touch? A computational stylometry analysis of GPT-4 translations of online Chinese literature Vermißt man die menschliche Berührung? Eine rechnerische Stylometrie Analyse von GPT-4 Übersetzungen der online chinesischen Literatur 缺少人类触碰? 对GPT-4 在线中国文学译文的计算式tytyllogy分析 2506.13013v1

Authors (3): Xiaofang Yao, Yong-Bin Kang, Anthony McCosker

Existing research indicates that machine translations (MTs) of literary texts are often unsatisfactory. MTs are typically evaluated using automated metrics and subjective human ratings, with limited focus on stylistic features. Evidence is also limited on whether state-of-the-art large language models (LLMs) will reshape literary translation. This study examines the stylistic features of LLM translations, comparing GPT-4’s performance to human translations in a Chinese online literature task. Computational stylometry analysis shows that GPT-4 translations closely align with human translations in lexical, syntactic, and content features, suggesting that LLMs might replicate the ‘human touch’ in literary translation style. These findings offer insights into AI’s impact on literary translation from a posthuman perspective, where distinctions between machine and human translations become increasingly blurry.

nan


Article 392

Title@2025-06-16 (1): Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification

Title: Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification Selbstregularisierung mit Sparse Autoencodern für steuerbare LLM-basierte Klassifizierung 与基于可控 LLM 的可控 LLM 分类的 Sparse 自动编码器的自调节 2502.14133v2

Authors (4): Xuansheng Wu, Wenhao Yu, Xiaoming Zhai, Ninghao Liu

Modern text classification methods heavily rely on contextual embeddings from large language models (LLMs). Compared to human-engineered features, these embeddings provide automatic and effective representations for classification model training. However, they also introduce a challenge: we lose the ability to manually remove unintended features, such as sensitive or task-irrelevant features, to guarantee regulatory compliance or improve the generalizability of classification models. This limitation arises because LLM embeddings are opaque and difficult to interpret. In this paper, we propose a novel framework to identify and regularize unintended features in the LLM latent space. Specifically, we first pre-train a sparse autoencoder (SAE) to extract interpretable features from LLM latent spaces. To ensure the SAE can capture task-specific features, we further fine-tune it on task-specific datasets. In training the classification model, we propose a simple and effective regularizer, by minimizing the similarity between the classifier weights and the identified unintended feature, to remove the impact of these unintended features on classification. We evaluate the proposed framework on three real-world tasks, including toxic chat detection, reward modeling, and disease diagnosis. Results show that the proposed self-regularization framework can improve the classifier’s generalizability by regularizing those features that are not semantically correlated to the task. This work pioneers controllable text classification on LLM latent spaces by leveraging interpreted features to address generalizability, fairness, and privacy challenges. The code and data are publicly available at https://github.com/JacksonWuxs/Controllable_LLM_Classifier.

nan


Article 393

Title@2025-06-16 (1): Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions

Title: Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions Sprechen Sie einfach: Beseitigen von schädlichen Jailbreaks aus LLMs mit einfachen Interaktionen 简单易言: 与简单互动的LLMLM 2502.04322v2

Authors (4): Yik Siu Chan, Narutatsu Ri, Yuxin Xiao, Marzyeh Ghassemi

Despite extensive safety alignment efforts, large language models (LLMs) remain vulnerable to jailbreak attacks that elicit harmful behavior. While existing studies predominantly focus on attack methods that require technical expertise, two critical questions remain underexplored: (1) Are jailbroken responses truly useful in enabling average users to carry out harmful actions? (2) Do safety vulnerabilities exist in more common, simple human-LLM interactions? In this paper, we demonstrate that LLM responses most effectively facilitate harmful actions when they are both actionable and informative–two attributes easily elicited in multi-step, multilingual interactions. Using this insight, we propose HarmScore, a jailbreak metric that measures how effectively an LLM response enables harmful actions, and Speak Easy, a simple multi-step, multilingual attack framework. Notably, by incorporating Speak Easy into direct request and jailbreak baselines, we see an average absolute increase of 0.319 in Attack Success Rate and 0.426 in HarmScore in both open-source and proprietary LLMs across four safety benchmarks. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions.

nan


Article 394

Title@2025-06-15 (7): Large Language Models Enhanced by Plug and Play Syntactic Knowledge for Aspect-based Sentiment Analysis

Title: Large Language Models Enhanced by Plug and Play Syntactic Knowledge for Aspect-based Sentiment Analysis Große Sprachmodelle durch Plug-and-Play-Syntaktisches Wissen für aspektbasierte Sentiment-Analysen verbessert 通过插件和播放同步知识增强大语言模型,用于基于频谱的感应分析 2506.12991v1

Authors (6): Yuanhe Tian, Xu Li, Wei Wang, Guoqing Jin, Pengsen Cheng, Yan Song

Aspect-based sentiment analysis (ABSA) generally requires a deep understanding of the contextual information, including the words associated with the aspect terms and their syntactic dependencies. Most existing studies employ advanced encoders (e.g., pre-trained models) to capture such context, especially large language models (LLMs). However, training these encoders is resource-intensive, and in many cases, the available data is insufficient for necessary fine-tuning. Therefore it is challenging for learning LLMs within such restricted environments and computation efficiency requirement. As a result, it motivates the exploration of plug-and-play methods that adapt LLMs to ABSA with minimal effort. In this paper, we propose an approach that integrates extendable components capable of incorporating various types of syntactic knowledge, such as constituent syntax, word dependencies, and combinatory categorial grammar (CCG). Specifically, we propose a memory module that records syntactic information and is incorporated into LLMs to instruct the prediction of sentiment polarities. Importantly, this encoder acts as a versatile, detachable plugin that is trained independently of the LLM. We conduct experiments on benchmark datasets, which show that our approach outperforms strong baselines and previous approaches, thus demonstrates its effectiveness.

nan


Article 395

Title@2025-06-15 (7): Efficient Neuro-Symbolic Retrieval-Augmented Generation through Adaptive Query Routing

Title: Efficient Neuro-Symbolic Retrieval-Augmented Generation through Adaptive Query Routing Effiziente neuro-symbolische retrieval-angereicherte Generierung durch adaptive Abfrageführung 通过适应性查询路由,高效神经-双曲回取回回源养代 2506.12981v1

Authors (4): Safayat Bin Hakim, Muhammad Adil, Alvaro Velasquez, Houbing Herbert Song

Retrieval-Augmented Generation (RAG) systems address factual inconsistencies in Large Language Models by grounding generation in external knowledge, yet they face a fundamental efficiency problem: simple queries consume computational resources equivalent to complex multi-hop reasoning tasks. We present SymRAG, a neuro-symbolic framework that introduces adaptive query routing based on real-time complexity and system load assessments. SymRAG dynamically selects symbolic, neural, or hybrid processing paths to align resource use with query demands. Evaluated on 2,000 queries from HotpotQA and DROP using Llama-3.2-3B and Mistral-7B models, SymRAG achieves 97.6–100.0% exact match accuracy with significantly lower CPU utilization (3.6–6.2%) and processing time (0.985–3.165s). Disabling adaptive logic results in 169–1151% increase in processing time, highlighting the framework’s impact. These results underscore the potential of adaptive neuro-symbolic routing for scalable, sustainable AI systems.

nan


Article 396

Title@2025-06-15 (7): Multi-document Summarization through Multi-document Event Relation Graph Reasoning in LLMs: a case study in Framing Bias Mitigation

Title: Multi-document Summarization through Multi-document Event Relation Graph Reasoning in LLMs: a case study in Framing Bias Mitigation Multi-Dokument Zusammenfassung durch Multi-Dokument Ereignisrelation Graph Reasoning in LLMs: eine Fallstudie in Framing Bias Mitigation 多文件多文件通过多文件事件关系图表概述LLMLM中的原因:关于Framing Bias减缓问题的案例研究 2506.12978v1

Authors (2): Yuanyuan Lei, Ruihong Huang

Media outlets are becoming more partisan and polarized nowadays. Most previous work focused on detecting media bias. In this paper, we aim to mitigate media bias by generating a neutralized summary given multiple articles presenting different ideological views. Motivated by the critical role of events and event relations in media bias detection, we propose to increase awareness of bias in LLMs via multi-document events reasoning and use a multi-document event relation graph to guide the summarization process. This graph contains rich event information useful to reveal bias: four common types of in-doc event relations to reflect content framing bias, cross-doc event coreference relation to reveal content selection bias, and event-level moral opinions to highlight opinionated framing bias. We further develop two strategies to incorporate the multi-document event relation graph for neutralized summarization. Firstly, we convert a graph into natural language descriptions and feed the textualized graph into LLMs as a part of a hard text prompt. Secondly, we encode the graph with graph attention network and insert the graph embedding into LLMs as a soft prompt. Both automatic evaluation and human evaluation confirm that our approach effectively mitigates both lexical and informational media bias, and meanwhile improves content preservation.

nan


Article 397

Title@2025-06-15 (7): Unifying Specialized Visual Encoders for Video Language Models

Title: Unifying Specialized Visual Encoders for Video Language Models Vereinheitlichen von spezialisierten visuellen Encodern für Video-Sprachenmodelle 视频语言模型统一专门视觉编码器 2501.01426v2

Authors (6): Jihoon Chung, Tyler Zhu, Max Gonzalez Saez-Diez, Juan Carlos Niebles, Honglu Zhou, Olga Russakovsky

The recent advent of Large Language Models (LLMs) has ushered sophisticated reasoning capabilities into the realm of video through Video Large Language Models (VideoLLMs). However, VideoLLMs currently rely on a single vision encoder for all of their visual processing, which limits the amount and type of visual information that can be conveyed to the LLM. Our method, MERV, Multi-Encoder Representation of Videos, instead leverages multiple frozen visual encoders to create a unified representation of a video, providing the VideoLLM with a comprehensive set of specialized visual knowledge. Spatio-temporally aligning the features from each encoder allows us to tackle a wider range of open-ended and multiple-choice video understanding questions and outperform prior state-of-the-art works. MERV is up to 3.7% better in accuracy than Video-LLaVA across the standard suite video understanding benchmarks, while also having a better Video-ChatGPT score. We also improve upon SeViLA, the previous best on zero-shot Perception Test accuracy, by 2.2%. MERV introduces minimal extra parameters and trains faster than equivalent single-encoder methods while parallelizing the visual processing. Finally, we provide qualitative evidence that MERV successfully captures domain knowledge from each of its encoders. Our results offer promising directions in utilizing multiple vision encoders for comprehensive video understanding.

nan


Article 398

Title@2025-06-15 (7): OR-Bench: An Over-Refusal Benchmark for Large Language Models

Title: OR-Bench: An Over-Refusal Benchmark for Large Language Models OR-Bench: Ein überwiderlegbarer Benchmark für große Sprachmodelle OR-Bench:大语言模式的过度拒绝基准 2405.20947v5

Authors (4): Justin Cui, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh

Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that can elicit the over-refusal behaviors of LLMs. This study proposes a novel method for automatically generating large-scale over-refusal datasets. Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 over-refusal prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 32 popular LLMs across 8 model families. Our datasets are publicly available at https://huggingface.co/bench-llms and our codebase is open-sourced at https://github.com/justincui03/or-bench. We hope this benchmark can help the community develop better safety aligned models.

nan


Article 399

Title@2025-06-15 (7): Building, Reusing, and Generalizing Abstract Representations from Concrete Sequences

Title: Building, Reusing, and Generalizing Abstract Representations from Concrete Sequences Aufbau, Wiederverwertung und Verallgemeinerung abstrakter Repräsentationen aus konkreten Sequenzen 建筑、再利用和一般化来自具体序列的抽象代表 2410.21332v2

Authors (5): Shuchen Wu, Mirko Thalmann, Peter Dayan, Zeynep Akata, Eric Schulz

Humans excel at learning abstract patterns across different sequences, filtering out irrelevant details, and transferring these generalized concepts to new sequences. In contrast, many sequence learning models lack the ability to abstract, which leads to memory inefficiency and poor transfer. We introduce a non-parametric hierarchical variable learning model (HVM) that learns chunks from sequences and abstracts contextually similar chunks as variables. HVM efficiently organizes memory while uncovering abstractions, leading to compact sequence representations. When learning on language datasets such as babyLM, HVM learns a more efficient dictionary than standard compression algorithms such as Lempel-Ziv. In a sequence recall task requiring the acquisition and transfer of variables embedded in sequences, we demonstrate HVM’s sequence likelihood correlates with human recall times. In contrast, large language models (LLMs) struggle to transfer abstract variables as effectively as humans. From HVM’s adjustable layer of abstraction, we demonstrate that the model realizes a precise trade-off between compression and generalization. Our work offers a cognitive model that captures the learning and transfer of abstract representations in human cognition and differentiates itself from LLMs.

nan


Article 400

Title@2025-06-15 (7): Assessing the Role of Data Quality in Training Bilingual Language Models

Title: Assessing the Role of Data Quality in Training Bilingual Language Models Bewertung der Rolle der Datenqualität in der Ausbildung zweisprachige Sprachmodelle 评估数据质量在培训双语语文模式方面的作用 2506.12966v1

Authors (4): Skyler Seto, Maartje ter Hoeve, Maureen de Seyssel, David Grangier

Bilingual and multilingual language models offer a promising path toward scaling NLP systems across diverse languages and users. However, their performance often varies wildly between languages as prior works show that adding more languages can degrade performance for some languages (such as English), while improving others (typically more data constrained languages). In this work, we investigate causes of these inconsistencies by comparing bilingual and monolingual language models. Our analysis reveals that unequal data quality, not just data quantity, is a major driver of performance degradation in bilingual settings. We propose a simple yet effective data filtering strategy to select higher-quality bilingual training data with only high quality English data. Applied to French, German, and Chinese, our approach improves monolingual performance by 2-4% and reduces bilingual model performance gaps to 1%. These results highlight the overlooked importance of data quality in multilingual pretraining and offer a practical recipe for balancing performance.

nan


Article 401

Title@2025-06-15 (7): REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities

Title: REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities REPA: Russische Fehlertypen Anmerkung zur Bewertung von Textgenerierung und Urteilsfähigkeiten REPA: 用于评价文本生成和判断能力的俄罗斯错误类型说明 2503.13102v2

Authors (4): Alexander Pugachev, Alena Fenogenova, Vladislav Mikhailov, Ekaterina Artemova

Recent advances in large language models (LLMs) have introduced the novel paradigm of using LLMs as judges, where an LLM evaluates and scores the outputs of another LLM, which often correlates highly with human preferences. However, the use of LLM-as-a-judge has been primarily studied in English. In this paper, we evaluate this framework in Russian by introducing the Russian Error tyPes Annotation dataset (REPA), a dataset of 1k user queries and 2k LLM-generated responses. Human annotators labeled each response pair expressing their preferences across ten specific error types, as well as selecting an overall preference. We rank six generative LLMs across the error types using three rating systems based on human preferences. We also evaluate responses using eight LLM judges in zero-shot and few-shot settings. We describe the results of analyzing the judges and position and length biases. Our findings reveal a notable gap between LLM judge performance in Russian and English. However, rankings based on human and LLM preferences show partial alignment, suggesting that while current LLM judges struggle with fine-grained evaluation in Russian, there is potential for improvement.

nan


Article 402

Title@2025-06-15 (7): From Argumentative Text to Argument Knowledge Graph: A New Framework for Structured Argumentation

Title: From Argumentative Text to Argument Knowledge Graph: A New Framework for Structured Argumentation Vom argumentativen Text zum argumentativen Wissensgraph: Ein neuer Rahmen für strukturierte Argumentation 从参数文字到参数知识图:结构化参数新框架 2506.00713v2

Authors (2): Debarati Bhattacharjee, Ashish Anand

This paper presents a framework to convert argumentative texts into argument knowledge graphs (AKG). Starting with basic annotations of argumentative components (ACs) and argumentative relations (ARs), we enrich the information by constructing a knowledge base (KB) graph with metadata attributes for nodes. Next, we use premises and inference rules from the KB to form arguments by applying modus ponens. From these arguments, we create an AKG. The nodes and edges of the AKG have attributes that capture important argumentative features. We also find missing inference rules by identifying markers. This makes it possible to identify undercut attacks that were previously undetectable in existing datasets. The AKG gives a graphical view of the argumentative structure that is easier to understand than theoretical formats. It also prepares the ground for future reasoning tasks, including checking the coherence of arguments and identifying opportunities for revision. For this, it is important to find indirect relations, many of which are implicit. Our proposed AKG format, with annotated inference rules and modus ponens, will help reasoning models learn the implicit indirect relations that require inference over arguments and the relations between them.

nan


Article 403

Title@2025-06-15 (7): Forecasting Time Series with LLMs via Patch-Based Prompting and Decomposition

Title: Forecasting Time Series with LLMs via Patch-Based Prompting and Decomposition Prognosezeitreihen mit LLMs über Patch-Based Prompting und Zersetzung 通过基于补缝的提示和分解与LLMs一道预测时间序列 2506.12953v1

Authors (10): Mayank Bumb, Anshul Vemulapalli, Sri Harsha Vardhan Prasad Jella, Anish Gupta, An La, Ryan A. Rossi, Hongjie Chen, Franck Dernoncourt, Nesreen K. Ahmed, Yu Wang

Recent advances in Large Language Models (LLMs) have demonstrated new possibilities for accurate and efficient time series analysis, but prior work often required heavy fine-tuning and/or ignored inter-series correlations. In this work, we explore simple and flexible prompt-based strategies that enable LLMs to perform time series forecasting without extensive retraining or the use of a complex external architecture. Through the exploration of specialized prompting methods that leverage time series decomposition, patch-based tokenization, and similarity-based neighbor augmentation, we find that it is possible to enhance LLM forecasting quality while maintaining simplicity and requiring minimal preprocessing of data. To this end, we propose our own method, PatchInstruct, which enables LLMs to make precise and effective predictions.

nan


Article 404

Title@2025-06-15 (7): HypER: Literature-grounded Hypothesis Generation and Distillation with Provenance

Title: HypER: Literature-grounded Hypothesis Generation and Distillation with Provenance HypER: Literaturgestützte Hypothesis-Erzeugung und Destillation mit Provenienz HYPER: 以文学为根据的假设生成和用验证法蒸馏 2506.12937v1

Authors (6): Rosni Vasu, Chandrayee Basu, Bhavana Dalvi Mishra, Cristina Sarasua, Peter Clark, Abraham Bernstein

Large Language models have demonstrated promising performance in research ideation across scientific domains. Hypothesis development, the process of generating a highly specific declarative statement connecting a research idea with empirical validation, has received relatively less attention. Existing approaches trivially deploy retrieval augmentation and focus only on the quality of the final output ignoring the underlying reasoning process behind ideation. We present $\texttt{HypER}$ ($\textbf{Hyp}$othesis Generation with $\textbf{E}$xplanation and $\textbf{R}$easoning), a small language model (SLM) trained for literature-guided reasoning and evidence-based hypothesis generation. $\texttt{HypER}$ is trained in a multi-task setting to discriminate between valid and invalid scientific reasoning chains in presence of controlled distractions. We find that $\texttt{HypER}$ outperformes the base model, distinguishing valid from invalid reasoning chains (+22\% average absolute F1), generates better evidence-grounded hypotheses (0.327 vs. 0.305 base model) with high feasibility and impact as judged by human experts ($>$3.5 on 5-point Likert scale).

nan


Article 405

Title@2025-06-15 (7): CliniDial: A Naturally Occurring Multimodal Dialogue Dataset for Team Reflection in Action During Clinical Operation

Title: CliniDial: A Naturally Occurring Multimodal Dialogue Dataset for Team Reflection in Action During Clinical Operation CliniDial: Ein natürlich vorkommender multimodaler Dialog Datensatz für Teamreflexion während der klinischen Operation CliniDial: 临床行动期间团队反思的自然操作多模式对话数据集 2506.12936v1

Authors (5): Naihao Deng, Kapotaksha Das, Rada Mihalcea, Vitaliy Popov, Mohamed Abouelenien

In clinical operations, teamwork can be the crucial factor that determines the final outcome. Prior studies have shown that sufficient collaboration is the key factor that determines the outcome of an operation. To understand how the team practices teamwork during the operation, we collected CliniDial from simulations of medical operations. CliniDial includes the audio data and its transcriptions, the simulated physiology signals of the patient manikins, and how the team operates from two camera angles. We annotate behavior codes following an existing framework to understand the teamwork process for CliniDial. We pinpoint three main characteristics of our dataset, including its label imbalances, rich and natural interactions, and multiple modalities, and conduct experiments to test existing LLMs’ capabilities on handling data with these characteristics. Experimental results show that CliniDial poses significant challenges to the existing models, inviting future effort on developing methods that can deal with real-world clinical data. We open-source the codebase at https://github.com/MichiganNLP/CliniDial

nan


Article 406

Title@2025-06-15 (7): Layer by Layer: Uncovering Hidden Representations in Language Models

Title: Layer by Layer: Uncovering Hidden Representations in Language Models Layer by Layer: Enthüllen versteckter Darstellungen in Sprachmodellen 按图层分列的图层: 语言模型中未隐藏隐藏的表示 2502.02013v2

Authors (7): Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, Ravid Shwartz-Ziv

From extracting features to generating text, the outputs of large language models (LLMs) typically rely on the final layers, following the conventional wisdom that earlier layers capture only low-level cues. However, our analysis shows that intermediate layers can encode even richer representations, often improving performance on a range of downstream tasks. To explain and quantify these hidden-layer properties, we propose a unified framework of representation quality metrics based on information theory, geometry, and invariance to input perturbations. Our framework highlights how each layer balances information compression and signal preservation, revealing why mid-depth embeddings can exceed the last layer’s performance. Through extensive experiments on 32 text-embedding tasks across various architectures (transformers, state-space models) and domains (language, vision), we demonstrate that intermediate layers consistently provide stronger features, challenging the standard view on final-layer embeddings and opening new directions on using mid-layer representations for more robust and accurate representations.

nan


Article 407

Title@2025-06-15 (7): SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models

Title: SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models SoundMind: RL-incentivized Logic Reasoning for Audio-Language Models SoundMind: RL - 音频语言模型激励逻辑原因 2506.12935v1

Authors (9): Xingjian Diao, Chunhui Zhang, Keyi Kong, Weiyi Wu, Chiyu Ma, Zhongyu Ouyang, Peijun Qing, Soroush Vosoughi, Jiang Gui

While large language models have shown reasoning capabilities, their application to the audio modality, particularly in large audio-language models (ALMs), remains significantly underdeveloped. Addressing this gap requires a systematic approach, involving a capable base model, high-quality reasoning-oriented audio data, and effective training algorithms. In this study, we present a comprehensive solution: we introduce the Audio Logical Reasoning (ALR) dataset, consisting of 6,446 text-audio annotated samples specifically designed for complex reasoning tasks. Building on this resource, we propose SoundMind, a rule-based reinforcement learning (RL) algorithm tailored to endow ALMs with deep bimodal reasoning abilities. By training Qwen2.5-Omni-7B on the ALR dataset using SoundMind, our approach achieves state-of-the-art performance in audio logical reasoning. This work highlights the impact of combining high-quality, reasoning-focused datasets with specialized RL techniques, advancing the frontier of auditory intelligence in language models. Our code and the proposed dataset are available at https://github.com/xid32/SoundMind.

nan


Article 408

Title@2025-06-15 (7): Rethinking Table Instruction Tuning

Title: Rethinking Table Instruction Tuning Umdenken Tabelle Anleitung Tuning 重新思考表格指令图 2501.14693v2

Authors (2): Naihao Deng, Rada Mihalcea

Recent advances in table understanding have focused on instruction-tuning large language models (LLMs) for table-related tasks. However, existing research has overlooked the impact of hyperparameter choices, and also lacks a comprehensive evaluation of the out-of-domain table understanding ability and the general capabilities of these table LLMs. In this paper, we evaluate these abilities in existing table LLMs, and find significant declines in both out-of-domain table understanding and general capabilities as compared to their base models. Through systematic analysis, we show that hyperparameters, such as learning rate, can significantly influence both table-specific and general capabilities. Contrary to the previous table instruction-tuning work, we demonstrate that smaller learning rates and fewer training instances can enhance table understanding while preserving general capabilities. Based on our findings, we introduce TAMA, a TAble LLM instruction-tuned from LLaMA 3.1 8B Instruct, which achieves performance on par with, or surpassing GPT-3.5 and GPT-4 on table tasks, while maintaining strong out-of-domain generalization and general capabilities. Our findings highlight the potential for reduced data annotation costs and more efficient model development through careful hyperparameter selection. We open-source the project and our models.

nan


Article 409

Title@2025-06-15 (7): Reasoning with RAGged events: RAG-Enhanced Event Knowledge Base Construction and reasoning with proof-assistants

Title: Reasoning with RAGged events: RAG-Enhanced Event Knowledge Base Construction and reasoning with proof-assistants Reasoning mit RAGged Events: RAG-erweiterte Event Knowledge Base Konstruktion und Reasoning mit Proof-Assistenten RAG-加强事件知识库建设和与证据助理的推理 2506.07042v2

Authors (1): Stergios Chatzikyriakidis

Extracting structured computational representations of historical events from narrative text remains computationally expensive when constructed manually. While RDF/OWL reasoners enable graph-based reasoning, they are limited to fragments of first-order logic, preventing deeper temporal and semantic analysis. This paper addresses both challenges by developing automatic historical event extraction models using multiple LLMs (GPT-4, Claude, Llama 3.2) with three enhancement strategies: pure base generation, knowledge graph enhancement, and Retrieval-Augmented Generation (RAG). We conducted comprehensive evaluations using historical texts from Thucydides. Our findings reveal that enhancement strategies optimize different performance dimensions rather than providing universal improvements. For coverage and historical breadth, base generation achieves optimal performance with Claude and GPT-4 extracting comprehensive events. However, for precision, RAG enhancement improves coordinate accuracy and metadata completeness. Model architecture fundamentally determines enhancement sensitivity: larger models demonstrate robust baseline performance with incremental RAG improvements, while Llama 3.2 shows extreme variance from competitive performance to complete failure. We then developed an automated translation pipeline converting extracted RDF representations into Coq proof assistant specifications, enabling higher-order reasoning beyond RDF capabilities including multi-step causal verification, temporal arithmetic with BC dates, and formal proofs about historical causation. The Coq formalization validates that RAG-discovered event types represent legitimate domain-specific semantic structures rather than ontological violations.

nan


Article 410

Title@2025-06-15 (7): Sectoral Coupling in Linguistic State Space

Title: Sectoral Coupling in Linguistic State Space Sektorale Koppelung im Sprachraum des Staates 语言国家空间部门合并 2506.12927v1

Authors (1): Sebastian Dumbrava

This work presents a formal framework for quantifying the internal dependencies between functional subsystems within artificial agents whose belief states are composed of structured linguistic fragments. Building on the Semantic Manifold framework, which organizes belief content into functional sectors and stratifies them across hierarchical levels of abstraction, we introduce a system of sectoral coupling constants that characterize how one cognitive sector influences another within a fixed level of abstraction. The complete set of these constants forms an agent-specific coupling profile that governs internal information flow, shaping the agent’s overall processing tendencies and cognitive style. We provide a detailed taxonomy of these intra-level coupling roles, covering domains such as perceptual integration, memory access and formation, planning, meta-cognition, execution control, and affective modulation. We also explore how these coupling profiles generate feedback loops, systemic dynamics, and emergent signatures of cognitive behavior. Methodologies for inferring these profiles from behavioral or internal agent data are outlined, along with a discussion of how these couplings evolve across abstraction levels. This framework contributes a mechanistic and interpretable approach to modeling complex cognition, with applications in AI system design, alignment diagnostics, and the analysis of emergent agent behavior.

nan


Article 411

Title@2025-06-15 (7): Identifying and Investigating Global News Coverage of Critical Events Such as Disasters and Terrorist Attacks

Title: Identifying and Investigating Global News Coverage of Critical Events Such as Disasters and Terrorist Attacks Ermittlung und Untersuchung von globalen Nachrichten über kritische Ereignisse wie Katastrophen und Terroranschläge 查明和调查灾害和恐怖袭击等重大事件的全球新闻报道 2506.12925v1

Authors (6): Erica Cai, Xi Chen, Reagan Grey Keeney, Ethan Zuckerman, Brendan O’Connor, Przemyslaw A. Grabowicz

Comparative studies of news coverage are challenging to conduct because methods to identify news articles about the same event in different languages require expertise that is difficult to scale. We introduce an AI-powered method for identifying news articles based on an event FINGERPRINT, which is a minimal set of metadata required to identify critical events. Our event coverage identification method, FINGERPRINT TO ARTICLE MATCHING FOR EVENTS (FAME), efficiently identifies news articles about critical world events, specifically terrorist attacks and several types of natural disasters. FAME does not require training data and is able to automatically and efficiently identify news articles that discuss an event given its fingerprint: time, location, and class (such as storm or flood). The method achieves state-of-the-art performance and scales to massive databases of tens of millions of news articles and hundreds of events happening globally. We use FAME to identify 27,441 articles that cover 470 natural disaster and terrorist attack events that happened in 2020. To this end, we use a massive database of news articles in three languages from MediaCloud, and three widely used, expert-curated databases of critical events: EM-DAT, USGS, and GTD. Our case study reveals patterns consistent with prior literature: coverage of disasters and terrorist attacks correlates to death counts, to the GDP of a country where the event occurs, and to trade volume between the reporting country and the country where the event occurred. We share our NLP annotations and cross-country media attention data to support the efforts of researchers and media monitoring organizations.

nan


Article 412

Title@2025-06-15 (7): PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization

Title: PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization PersonaFeedback: Ein groß angelegter, von Menschen kommentierter Benchmark für Personalisierung 人背人:关于个性化的大规模人文说明基准 2506.12915v1

Authors (6): Meiling Tao, Chenghao Zhu, Dongyi Ding, Tiannan Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou

With the rapid improvement in the general capabilities of LLMs, LLM personalization, i.e., how to build LLM systems that can generate personalized responses or services that are tailored to distinct user personas, has become an increasingly important research and engineering problem. However, unlike many new challenging benchmarks being released for evaluating the general/reasoning capabilities, the lack of high-quality benchmarks for evaluating LLM personalization greatly hinders progress in this field. To address this, we introduce PersonaFeedback, a new benchmark that directly evaluates LLMs’ ability to provide personalized responses given pre-defined user personas and queries. Unlike existing benchmarks that require models to infer implicit user personas from historical interactions, PersonaFeedback decouples persona inference from personalization, focusing on evaluating the model’s ability to generate responses tailored to explicit personas. PersonaFeedback consists of 8298 human-annotated test cases, which are categorized into easy, medium, and hard tiers based on the contextual complexity of the user personas and the difficulty in distinguishing subtle differences between two personalized responses. We conduct comprehensive evaluations across a wide range of models. The empirical results reveal that even state-of-the-art LLMs that can solve complex real-world reasoning tasks could fall short on the hard tier of PersonaFeedback where even human evaluators may find the distinctions challenging. Furthermore, we conduct an in-depth analysis of failure modes across various types of systems, demonstrating that the current retrieval-augmented framework should not be seen as a de facto solution for personalization tasks. All benchmark data, annotation protocols, and the evaluation pipeline will be publicly available to facilitate future research on LLM personalization.

nan


Article 413

Title@2025-06-15 (7): SciDA: Scientific Dynamic Assessor of LLMs

Title: SciDA: Scientific Dynamic Assessor of LLMs SciDA: Wissenschaftlicher dynamischer Assessor von LLMs SciDA:LLMs科学动态评估员 2506.12909v1

Authors (18): Junting Zhou, Tingjia Miao, Yiyan Liao, Qichao Wang, Zhoufutu Wen, Yanqin Wang, Yunjie Huang, Ge Yan, Leqi Wang, Yucheng Xia, Hongwan Gao, Yuansong Zeng, Renjie Zheng, Chen Dun, Yitao Liang, Tong Yang, Wenhao Huang, Ge Zhang

Advancement in Large Language Models (LLMs) reasoning capabilities enables them to solve scientific problems with enhanced efficacy. Thereby, a high-quality benchmark for comprehensive and appropriate assessment holds significance, while existing ones either confront the risk of data contamination or lack involved disciplines. To be specific, due to the data source overlap of LLMs training and static benchmark, the keys or number pattern of answers inadvertently memorized (i.e. data contamination), leading to systematic overestimation of their reasoning capabilities, especially numerical reasoning. We propose SciDA, a multidisciplinary benchmark that consists exclusively of over 1k Olympic-level numerical computation problems, allowing randomized numerical initializations for each inference round to avoid reliance on fixed numerical patterns. We conduct a series of experiments with both closed-source and open-source top-performing LLMs, and it is observed that the performance of LLMs drop significantly under random numerical initialization. Thus, we provide truthful and unbiased assessments of the numerical reasoning capabilities of LLMs. The data is available at https://huggingface.co/datasets/m-a-p/SciDA

nan


Article 414

Title@2025-06-15 (7): Benchmarking Rotary Position Embeddings for Automatic Speech Recognition

Title: Benchmarking Rotary Position Embeddings for Automatic Speech Recognition Benchmarking von Rotary-Positions-Embeddings für automatische Spracherkennung 自动语音识别扶轮位置嵌入式 2501.06051v2

Authors (4): Shucong Zhang, Titouan Parcollet, Rogier van Dalen, Sourav Bhattacharya

Self-attention relies on positional embeddings to encode input order. Relative Position (RelPos) embeddings are widely used in Automatic Speech Recognition (ASR). However, RelPos has quadratic time complexity to input length and is often incompatible with fast GPU implementations of attention. In contrast, Rotary Positional Embedding (RoPE) rotates each input vector based on its absolute position, taking linear time to sequence length, implicitly encoding relative distances through self-attention dot products. Thus, it is usually compatible with efficient attention. However, its use in ASR remains underexplored. This work evaluates RoPE across diverse ASR tasks with training data ranging from 100 to 50,000 hours, covering various speech types (read, spontaneous, clean, noisy) and different accents in both streaming and non-streaming settings. ASR error rates are similar or better than RelPos, while training time is reduced by up to 21%. Code is available via the SpeechBrain toolkit.

nan


Article 415

Title@2025-06-15 (7): Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification

Title: Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification Life-Code: Zentrale Dogma-Modellierung mit Multi-Omics-Sequenz-Einheit 生命守则:以多有机序列统一为模式的中央Dogma建模 2502.07299v2

Authors (10): Zicheng Liu, Siyuan Li, Zhiyuan Chen, Fang Wu, Chang Yu, Qirong Yang, Yucheng Guo, Yujie Yang, Xiaoming Zhang, Stan Z. Li

The interactions between DNA, RNA, and proteins are fundamental to biological processes, as illustrated by the central dogma of molecular biology. Although modern biological pre-trained models have achieved great success in analyzing these macromolecules individually, their interconnected nature remains underexplored. This paper follows the guidance of the central dogma to redesign both the data and model pipeline and offers a comprehensive framework, Life-Code, that spans different biological functions. As for data flow, we propose a unified pipeline to integrate multi-omics data by reverse-transcribing RNA and reverse-translating amino acids into nucleotide-based sequences. As for the model, we design a codon tokenizer and a hybrid long-sequence architecture to encode the interactions between coding and non-coding regions through masked modeling pre-training. To model the translation and folding process with coding sequences, Life-Code learns protein structures of the corresponding amino acids by knowledge distillation from off-the-shelf protein language models. Such designs enable Life-Code to capture complex interactions within genetic sequences, providing a more comprehensive understanding of multi-omics with the central dogma. Extensive experiments show that Life-Code achieves state-of-the-art results on various tasks across three omics, highlighting its potential for advancing multi-omics analysis and interpretation.

nan


Article 416

Title@2025-06-15 (7): Navigating LLM Ethics: Advancements, Challenges, and Future Directions

Title: Navigating LLM Ethics: Advancements, Challenges, and Future Directions Navigation LLM Ethik: Fortschritte, Herausforderungen und zukünftige Richtungen 管理LLM 道德:进步、挑战和未来方向 2406.18841v5

Authors (4): Junfeng Jiao, Saleh Afroogh, Yiming Xu, Connor Phillips

This study addresses ethical issues surrounding Large Language Models (LLMs) within the field of artificial intelligence. It explores the common ethical challenges posed by both LLMs and other AI systems, such as privacy and fairness, as well as ethical challenges uniquely arising from LLMs. It highlights challenges such as hallucination, verifiable accountability, and decoding censorship complexity, which are unique to LLMs and distinct from those encountered in traditional AI systems. The study underscores the need to tackle these complexities to ensure accountability, reduce biases, and enhance transparency in the influential role that LLMs play in shaping information dissemination. It proposes mitigation strategies and future directions for LLM ethics, advocating for interdisciplinary collaboration. It recommends ethical frameworks tailored to specific domains and dynamic auditing systems adapted to diverse contexts. This roadmap aims to guide responsible development and integration of LLMs, envisioning a future where ethical considerations govern AI advancements in society.

nan


Article 417

Title@2025-06-15 (7): JEBS: A Fine-grained Biomedical Lexical Simplification Task

Title: JEBS: A Fine-grained Biomedical Lexical Simplification Task JEBS: Eine feinkörnige biomedizinische Lexikalische Vereinfachungsaufgabe JEBS: 精细的生物医学条约简化任务 2506.12898v1

Authors (4): William Xia, Ishita Unde, Brian Ondov, Dina Demner-Fushman

Online medical literature has made health information more available than ever, however, the barrier of complex medical jargon prevents the general public from understanding it. Though parallel and comparable corpora for Biomedical Text Simplification have been introduced, these conflate the many syntactic and lexical operations involved in simplification. To enable more targeted development and evaluation, we present a fine-grained lexical simplification task and dataset, Jargon Explanations for Biomedical Simplification (JEBS, https://github.com/bill-from-ri/JEBS-data ). The JEBS task involves identifying complex terms, classifying how to replace them, and generating replacement text. The JEBS dataset contains 21,595 replacements for 10,314 terms across 400 biomedical abstracts and their manually simplified versions. Additionally, we provide baseline results for a variety of rule-based and transformer-based systems for the three sub-tasks. The JEBS task, data, and baseline results pave the way for development and rigorous evaluation of systems for replacing or explaining complex biomedical terms.

nan


Article 418

Title: Assessing the Performance Gap Between Lexical and Semantic Models for Information Retrieval With Formulaic Legal Language Bewertung der Performancelücke zwischen Lexischen und Semantischen Modellen für die Informationswiederherstellung mit der Formulaischen Rechtssprache 评估用法律公式化语言获取信息检索的词汇和语义模型之间的绩效差距 2506.12895v1

Authors (4): Larissa Mori, Carlos Sousa de Oliveira, Yuehwern Yih, Mario Ventresca

Legal passage retrieval is an important task that assists legal practitioners in the time-intensive process of finding relevant precedents to support legal arguments. This study investigates the task of retrieving legal passages or paragraphs from decisions of the Court of Justice of the European Union (CJEU), whose language is highly structured and formulaic, leading to repetitive patterns. Understanding when lexical or semantic models are more effective at handling the repetitive nature of legal language is key to developing retrieval systems that are more accurate, efficient, and transparent for specific legal domains. To this end, we explore when this routinized legal language is better suited for retrieval using methods that rely on lexical and statistical features, such as BM25, or dense retrieval models trained to capture semantic and contextual information. A qualitative and quantitative analysis with three complementary metrics shows that both lexical and dense models perform well in scenarios with more repetitive usage of language, whereas BM25 performs better than the dense models in more nuanced scenarios where repetition and verbatim~quotes are less prevalent and in longer queries. Our experiments also show that BM25 is a strong baseline, surpassing off-the-shelf dense models in 4 out of 7 performance metrics. However, fine-tuning a dense model on domain-specific data led to improved performance, surpassing BM25 in most metrics, and we analyze the effect of the amount of data used in fine-tuning on the model’s performance and temporal robustness. The code, dataset and appendix related to this work are available on: https://github.com/larimo/lexsem-legal-ir.

nan


Article 419

Title@2025-06-15 (7): VideoDeepResearch: Long Video Understanding With Agentic Tool Using

Title: VideoDeepResearch: Long Video Understanding With Agentic Tool Using VideoDeepResearch: Langes Video-Verstehen mit Agentischem Werkzeug 视频深入研究:与使用代理工具的远程视频了解 2506.10821v2

Authors (6): Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Ji-Rong Wen, Zhicheng Dou

Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task’s inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.

nan


Article 420

Title@2025-06-15 (7): ArgHiTZ at ArchEHR-QA 2025: A Two-Step Divide and Conquer Approach to Patient Question Answering for Top Factuality

Title: ArgHiTZ at ArchEHR-QA 2025: A Two-Step Divide and Conquer Approach to Patient Question Answering for Top Factuality ArgHitz bei ArchEHR-QA 2025: Ein zweistufiger Divide- und Conquer-Ansatz zur Beantwortung von Patientenfragen für Top-Faktizität ArchEHR-QA 2025年ArchEHR-QA 的ArgHitTZ:对患者问题回答最佳事实的双重分化和征服办法 2506.12886v1

Authors (8): Adrián Cuadrón, Aimar Sagasti, Maitane Urruela, Iker De la Iglesia, Ane G Domingo-Aldama, Aitziber Atutxa, Josu Goikoetxea, Ander Barrena

This work presents three different approaches to address the ArchEHR-QA 2025 Shared Task on automated patient question answering. We introduce an end-to-end prompt-based baseline and two two-step methods to divide the task, without utilizing any external knowledge. Both two step approaches first extract essential sentences from the clinical text, by prompt or similarity ranking, and then generate the final answer from these notes. Results indicate that the re-ranker based two-step system performs best, highlighting the importance of selecting the right approach for each subtask. Our best run achieved an overall score of 0.44, ranking 8th out of 30 on the leaderboard, securing the top position in overall factuality.

nan


Article 421

Title@2025-06-15 (7): FlatQuant: Flatness Matters for LLM Quantization

Title: FlatQuant: Flatness Matters for LLM Quantization FlatQuant: Flachheitselemente für die LLM-Quantisierung 平整量:LLM量化的平整事项 2410.09426v3

Authors (13): Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, Jun Yao

Recently, quantization has been widely used for the compression and acceleration of large language models (LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still exhibit steep and dispersed distributions. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach that enhances the flatness of weights and activations. Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead of affine transformation, we apply Kronecker product with two lightweight matrices, and fuse all operations in FlatQuant into a single kernel. Extensive experiments demonstrate that FlatQuant establishes a new state-of-the-art benchmark for quantization. For example, it achieves less than 1\% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5\%. Additionally, it provides up to 2.3x prefill speedup and 1.7x decoding speedup compared to the FP16 model. Code is available at: https://github.com/ruikangliu/FlatQuant.

nan


Article 422

Title@2025-06-15 (7): Scaling Laws For Mixed Qquantization

Title: Scaling Laws For Mixed Qquantization Skalierungsgesetze für gemischte Qquantisierung 混合定量化法 2410.06722v2

Authors (8): Zeyu Cao, Boyang Gu, Cheng Zhang, Pedro Gimenes, Jianqiao Lu, Jianyi Cheng, Xitong Gao, Yiren Zhao

Post-training quantization of Large Language Models (LLMs) has proven effective in reducing the memory and computational requirements for inference. In this study, we focus on a straightforward question: When aiming for a target accuracy or perplexity with low-precision quantization, how much high-precision computation needs to be preserved and how fine-grained this quantization would need to be as we scale LLMs to larger sizes? We first introduce two critical metrics named the quantization ratio ($Q_r$) and quantization block size ($Q_b$). The former measures the number of parameters quantized to low-precision arithmetic normalized by the total parameter count, whereas the latter defines the number of values within a block that share a scaling factor, akin to the block size concept introduced in the FP4 format in NVIDIA’s Blackwell architecture. Through extensive and carefully controlled experiments across different model and quantization methods, we propose a unified scaling law on post-training quantization (PTQ) that can predict loss degeneration for varying $Q_r$ and $Q_b$. For $Q_r$, our scaling law implies that parameter scaling and ratio scaling have a multiplicative relationship. Consequently, larger models are more amenable to a higher quantization ratio $Q_r$, thus supporting an increase in the adoption of mixed quantization for inference. Regarding $Q_b$, our findings indicate that a small block size, similar to that used in Blackwell, is not essential for large models. Employing a small $Q_b$ can instead unnecessarily complicate the design of the hardware circuit.

nan


Article 423

Title@2025-06-15 (7): HARBOR: Exploring Persona Dynamics in Multi-Agent Competition

Title: HARBOR: Exploring Persona Dynamics in Multi-Agent Competition HARBOR: Erforschen von Persona-Dynamik im Multi-Agenten-Wettbewerb 《HARBOR:在多机构竞争中探索人动态》 2502.12149v2

Authors (3): Kenan Jiang, Li Xiong, Fei Liu

We investigate factors contributing to LLM agents’ success in competitive multi-agent environments, using auctions as a testbed where agents bid to maximize profit. The agents are equipped with bidding domain knowledge, distinct personas that reflect item preferences, and a memory of auction history. Our work extends the classic auction scenario by creating a realistic environment where multiple agents bid on houses, weighing aspects such as size, location, and budget to secure the most desirable homes at the lowest prices. Particularly, we investigate three key questions: (a) How does a persona influence an agent’s behavior in a competitive setting? (b) Can an agent effectively profile its competitors’ behavior during auctions? (c) How can persona profiling be leveraged to create an advantage using strategies such as theory of mind? Through a series of experiments, we analyze the behaviors of LLM agents and shed light on new findings. Our testbed, called HARBOR, offers a valuable platform for deepening our understanding of multi-agent workflows in competitive environments.

nan


Article 424

Title@2025-06-15 (7): QFFT, Question-Free Fine-Tuning for Adaptive Reasoning

Title: QFFT, Question-Free Fine-Tuning for Adaptive Reasoning QFFT, Question-Free Fine-Tuning für adaptive Reasoning QFFT, 无问题的调整性理由的精确调整 2506.12860v1

Authors (10): Wanlong Liu, Junxiao Xu, Fei Yu, Yukang Lin, Ke Ji, Wenyu Chen, Yan Xu, Yasheng Wang, Lifeng Shang, Benyou Wang

Recent advancements in Long Chain-of-Thought (CoT) reasoning models have improved performance on complex tasks, but they suffer from overthinking, which generates redundant reasoning steps, especially for simple questions. This paper revisits the reasoning patterns of Long and Short CoT models, observing that the Short CoT patterns offer concise reasoning efficiently, while the Long CoT patterns excel in challenging scenarios where the Short CoT patterns struggle. To enable models to leverage both patterns, we propose Question-Free Fine-Tuning (QFFT), a fine-tuning approach that removes the input question during training and learns exclusively from Long CoT responses. This approach enables the model to adaptively employ both reasoning patterns: it prioritizes the Short CoT patterns and activates the Long CoT patterns only when necessary. Experiments on various mathematical datasets demonstrate that QFFT reduces average response length by more than 50\%, while achieving performance comparable to Supervised Fine-Tuning (SFT). Additionally, QFFT exhibits superior performance compared to SFT in noisy, out-of-domain, and low-resource scenarios.

nan


Article 425

Title@2025-06-15 (7): MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

Title: MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems MORTAR: Multiturn Metamorphic Testing für LLM-basierte Dialogsysteme MORTAR:以LLM为基础的对话系统的多轨变形测试 2412.15557v2

Authors (6): Guoxiang Guo, Aldeida Aleti, Neelofar Neelofar, Chakkrit Tantithamthavorn, Yuanyuan Qi, Tsong Yueh Chen

With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn testing scenarios. However, multi-turn interaction is the common real-world usage of dialogue systems, yet testing methods for such interactions remain underexplored. This is largely due to the oracle problem in multi-turn testing, which continues to pose a significant challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a metamorphic multi-turn dialogue testing approach, which mitigates the test oracle problem in testing LLM-based dialogue systems. MORTAR formalises the multi-turn testing for dialogue systems, and automates the generation of question-answer dialogue test cases with multiple dialogue-level perturbations and metamorphic relations (MRs). The automated perturbation-MR matching mechanism allows MORTAR more flexibility and efficiency in metamorphic testing. The proposed approach is fully automated without reliance on potentially biased LLMs as test oracles. In testing six popular LLM-based dialogue systems, MORTAR reaches significantly better effectiveness with over 150\% more bugs revealed per test case when compared to the single-turn metamorphic testing baseline. On the quality of bugs, MORTAR reveals higher-quality bugs in terms of diversity, precision and uniqueness. MORTAR is expected to inspire more multi-turn testing approaches without LLM judges, and assist developers to evaluate the dialogue system performance more comprehensively with constrained test resources and budget.

nan


Article 426

Title@2025-06-15 (7): Visual Abstract Thinking Empowers Multimodal Reasoning

Title: Visual Abstract Thinking Empowers Multimodal Reasoning Visuelles Abstraktes Denken macht multimodale Vernunft 视觉抽象思考赋予多模式理由 2505.20164v2

Authors (7): Dairu Liu, Ziyue Wang, Minyuan Ruan, Fuwen Luo, Chi Chen, Peng Li, Yang Liu

Images usually convey richer detail than text, but often include redundant information which potentially downgrades multimodal reasoning performance. When faced with lengthy or complex messages, humans tend to employ abstract thinking to convert them into simple and concise abstracts. Inspired by this cognitive strategy, we introduce Visual Abstract Thinking (VAT), a novel thinking paradigm that prompts Multimodal Large Language Models (MLLMs) with visual abstract instead of explicit verbal thoughts or elaborate guidance, permitting a more concentrated visual reasoning mechanism. Explicit thinking, such as Chain-of-thought (CoT) or tool-augmented approaches, increases the complexity of reasoning process via inserting verbose intermediate steps, external knowledge or visual information. In contrast, VAT reduces redundant visual information and encourages models to focus their reasoning on more essential visual elements. Experimental results show that VAT consistently empowers different models, and achieves an average gain of 17% over GPT-4o baseline by employing diverse types of visual abstracts, demonstrating that VAT can enhance visual reasoning abilities for MLLMs regarding conceptual, structural and relational reasoning tasks. VAT is also compatible with CoT in knowledge-intensive multimodal reasoning tasks. These findings highlight the effectiveness of visual reasoning via abstract thinking and encourage further exploration of more diverse reasoning paradigms from the perspective of human cognition.

nan


Article 427

Title@2025-06-15 (7): Transforming Chatbot Text: A Sequence-to-Sequence Approach

Title: Transforming Chatbot Text: A Sequence-to-Sequence Approach Chatbot-Text transformieren: Ein Sequence-to-Sequence-Ansatz 变换聊天器文本: 序列到序列的方法 2506.12843v1

Authors (2): Natesh Reddy, Mark Stamp

Due to advances in Large Language Models (LLMs) such as ChatGPT, the boundary between human-written text and AI-generated text has become blurred. Nevertheless, recent work has demonstrated that it is possible to reliably detect GPT-generated text. In this paper, we adopt a novel strategy to adversarially transform GPT-generated text using sequence-to-sequence (Seq2Seq) models, with the goal of making the text more human-like. We experiment with the Seq2Seq models T5-small and BART which serve to modify GPT-generated sentences to include linguistic, structural, and semantic components that may be more typical of human-authored text. Experiments show that classification models trained to distinguish GPT-generated text are significantly less accurate when tested on text that has been modified by these Seq2Seq models. However, after retraining classification models on data generated by our Seq2Seq technique, the models are able to distinguish the transformed GPT-generated text from human-generated text with high accuracy. This work adds to the accumulating knowledge of text transformation as a tool for both attack – in the sense of defeating classification models – and defense – in the sense of improved classifiers – thereby advancing our understanding of AI-generated text.

nan


Article 428

Title@2025-06-15 (7): WereWolf-Plus: An Update of Werewolf Game setting Based on DSGBench

Title: WereWolf-Plus: An Update of Werewolf Game setting Based on DSGBench WereWolf-Plus: Ein Update der Werwolf-Spieleinstellung basierend auf DSGBench WereWolf-Plus:基于 DSGBench 的狼人游戏环境更新 2506.12841v1

Authors (4): Xinyuan Xia, Yuanyi Song, Haomin Ma, Jinyu Cai

With the rapid development of LLM-based agents, increasing attention has been given to their social interaction and strategic reasoning capabilities. However, existing Werewolf-based benchmarking platforms suffer from overly simplified game settings, incomplete evaluation metrics, and poor scalability. To address these limitations, we propose WereWolf-Plus, a multi-model, multi-dimensional, and multi-method benchmarking platform for evaluating multi-agent strategic reasoning in the Werewolf game. The platform offers strong extensibility, supporting customizable configurations for roles such as Seer, Witch, Hunter, Guard, and Sheriff, along with flexible model assignment and reasoning enhancement strategies for different roles. In addition, we introduce a comprehensive set of quantitative evaluation metrics for all special roles, werewolves, and the sheriff, and enrich the assessment dimensions for agent reasoning ability, cooperation capacity, and social influence. WereWolf-Plus provides a more flexible and reliable environment for advancing research on inference and strategic interaction within multi-agent communities. Our code is open sourced at https://github.com/MinstrelsyXia/WereWolfPlus.

nan


Article 429

Title@2025-06-15 (7): Foundations of Large Language Models

Title: Foundations of Large Language Models Grundlagen von großen Sprachmodellen 大语言模式基金会 2501.09223v2

Authors (2): Tong Xiao, Jingbo Zhu

This is a book about large language models. As indicated by the title, it primarily focuses on foundational concepts rather than comprehensive coverage of all cutting-edge technologies. The book is structured into five main chapters, each exploring a key area: pre-training, generative models, prompting, alignment, and inference. It is intended for college students, professionals, and practitioners in natural language processing and related fields, and can serve as a reference for anyone interested in large language models.

nan


Article 430

Title@2025-06-15 (7): QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions

Title: QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions QualiSpeech: Ein Datensatz zur Bewertung der Sprachqualität mit natürlichen Sprachkenntnissen und Beschreibungen 质量语言:语言质量评估数据集,有自然语言理由和描述 2503.20290v3

Authors (10): Siyin Wang, Wenyi Yu, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Lu Lu, Yu Tsao, Junichi Yamagishi, Yuxuan Wang, Chao Zhang

This paper explores a novel perspective to speech quality assessment by leveraging natural language descriptions, offering richer, more nuanced insights than traditional numerical scoring methods. Natural language feedback provides instructive recommendations and detailed evaluations, yet existing datasets lack the comprehensive annotations needed for this approach. To bridge this gap, we introduce QualiSpeech, a comprehensive low-level speech quality assessment dataset encompassing 11 key aspects and detailed natural language comments that include reasoning and contextual insights. Additionally, we propose the QualiSpeech Benchmark to evaluate the low-level speech understanding capabilities of auditory large language models (LLMs). Experimental results demonstrate that finetuned auditory LLMs can reliably generate detailed descriptions of noise and distortion, effectively identifying their types and temporal characteristics. The results further highlight the potential for incorporating reasoning to enhance the accuracy and reliability of quality assessments. The dataset will be released at https://huggingface.co/datasets/tsinghua-ee/QualiSpeech.

nan


Article 431

Title@2025-06-15 (7): Medical Argument Mining: Exploitation of Scarce Data Using NLI Systems

Title: Medical Argument Mining: Exploitation of Scarce Data Using NLI Systems Medical Argument Mining: Ausnutzung knapper Daten mit NLI-Systemen 医学论证采矿:利用国家指数系统利用稀缺数据 2506.12823v1

Authors (4): Maitane Urruela, Sergio Martín, Iker De la Iglesia, Ander Barrena

This work presents an Argument Mining process that extracts argumentative entities from clinical texts and identifies their relationships using token classification and Natural Language Inference techniques. Compared to straightforward methods like text classification, this methodology demonstrates superior performance in data-scarce settings. By assessing the effectiveness of these methods in identifying argumentative structures that support or refute possible diagnoses, this research lays the groundwork for future tools that can provide evidence-based justifications for machine-generated clinical conclusions.

nan


Article 432

Title@2025-06-15 (7): Accurate and Regret-aware Numerical Problem Solver for Tabular Question Answering

Title: Accurate and Regret-aware Numerical Problem Solver for Tabular Question Answering Genaue und respektvolle numerische Problemlöser für tabellarische Fragenbeantwortung 用于表格问答的准确和遗憾数字问题解答器 2410.12846v4

Authors (3): Yuxiang Wang, Jianzhong Qi, Junhao Gan

Question answering on free-form tables (a.k.a. TableQA) is a challenging task because of the flexible structure and complex schema of tables. Recent studies use Large Language Models (LLMs) for this task, exploiting their capability in understanding the questions and tabular data, which are typically given in natural language and contain many textual fields, respectively. While this approach has shown promising results, it overlooks the challenges brought by numerical values which are common in tabular data, and LLMs are known to struggle with such values. We aim to address this issue, and we propose a model named TabLaP that uses LLMs as a planner rather than an answer generator. This approach exploits LLMs’ capability in multi-step reasoning while leaving the actual numerical calculations to a Python interpreter for accurate calculation. Recognizing the inaccurate nature of LLMs, we further make a first attempt to quantify the trustworthiness of the answers produced by TabLaP, such that users can use TabLaP in a regret-aware manner. Experimental results on two benchmark datasets show that TabLaP is substantially more accurate than the state-of-the-art models, improving the answer accuracy by 5.7% and 5.8% on the two datasets, respectively.

nan


Article 433

Title@2025-06-15 (7): Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling

Title: Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling Effiziente Sicherheitsausrichtung großer Sprachmodelle über Preference Re-Ranking und repräsentationsbasierte Prämienmodellierung 通过优先排序和以代表制为基础的奖励模式,使大语言模式在安全方面实现高效率的一致 2503.10093v2

Authors (6): Qiyuan Deng, Xuefeng Bai, Kehai Chen, Yaowei Wang, Liqiang Nie, Min Zhang

Reinforcement Learning (RL) algorithms for safety alignment of Large Language Models (LLMs), such as Direct Preference Optimization (DPO), encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy, which requires significant computational resources. In this paper, we hypothesize that during off-policy training, while the ranking order of output generated by policy changes, their overall distribution remains relatively stable. This stability allows the conversion of the sampling process from the target policy into a computationally efficient re-ranking of preference data. Building on this hypothesis, we propose a new framework that leverages the model’s intrinsic safety judgment capability to extract reward signals, which are then used to calculate label confidence for preference reordering. Extensive experiments and theoretical analysis demonstrate that the proposed method effectively addresses the distribution shift issue, remarkably enhancing the safety performance while avoiding about 300x computational overheads.

nan


Article 434

Title@2025-06-15 (7): DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs

Title: DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs In Konflikte geraten: In suchgesteigerten LLMs widersprüchliche Quellen erkennen und bekämpfen 钻入冲突:发现和解决搜索中的冲突源 2506.08500v2

Authors (9): Arie Cattan, Alon Jacovi, Ori Ram, Jonathan Herzig, Roee Aharoni, Sasha Goldshtein, Eran Ofek, Idan Szpektor, Avi Caciularu

Retrieval Augmented Generation (RAG) is a commonly used approach for enhancing large language models (LLMs) with relevant and up-to-date information. However, the retrieved sources can often contain conflicting information and it remains unclear how models should address such discrepancies. In this work, we first propose a novel taxonomy of knowledge conflict types in RAG, along with the desired model behavior for each type. We then introduce CONFLICTS, a high-quality benchmark with expert annotations of conflict types in a realistic RAG setting. CONFLICTS is the first benchmark that enables tracking progress on how models address a wide range of knowledge conflicts. We conduct extensive experiments on this benchmark, showing that LLMs often struggle to appropriately resolve conflicts between sources. While prompting LLMs to explicitly reason about the potential conflict in the retrieved documents significantly improves the quality and appropriateness of their responses, substantial room for improvement in future research remains.

nan


Article 435

Title@2025-06-15 (7): EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection

Title: EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection EmoNet-Voice: Ein feinkörniger, sachverständiger Benchmark für Sprachemotionserkennung EmoNet-Voice:语音情感检测精密、经专家核实的专家验证基准 2506.09827v2

Authors (9): Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus, Kourosh Nadi, Huu Nguyen, Kristian Kersting, Sören Auer

The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems. Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals. This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities. Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions. Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels. This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets. Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts. Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration.

nan


Article 436

Title@2025-06-15 (7): ProMedTS: A Self-Supervised, Prompt-Guided Multimodal Approach for Integrating Medical Text and Time Series

Title: ProMedTS: A Self-Supervised, Prompt-Guided Multimodal Approach for Integrating Medical Text and Time Series ProMedTS: Ein selbstüberwachter, prompt geführter multimodaler Ansatz zur Integration medizinischer Text- und Zeitreihen ProMedTS: 综合医疗文本和时间系列的自我监督、迅速指导的多模式办法 2502.13509v2

Authors (9): Shuai Niu, Jing Ma, Hongzhan Lin, Liang Bai, Zhihua Wang, Wei Bi, Yida Xu, Guo Li, Xian Yang

Large language models (LLMs) have shown remarkable performance in vision-language tasks, but their application in the medical field remains underexplored, particularly for integrating structured time series data with unstructured clinical notes. In clinical practice, dynamic time series data, such as lab test results, capture critical temporal patterns, while clinical notes provide rich semantic context. Merging these modalities is challenging due to the inherent differences between continuous signals and discrete text. To bridge this gap, we introduce ProMedTS, a novel self-supervised multimodal framework that employs prompt-guided learning to unify these heterogeneous data types. Our approach leverages lightweight anomaly detection to generate anomaly captions that serve as prompts, guiding the encoding of raw time series data into informative prompt embeddings. These prompt embeddings are aligned with textual representations in a shared latent space, preserving fine-grained temporal nuances alongside semantic insights. Furthermore, our framework incorporates tailored self-supervised objectives to enhance both intra- and inter-modal alignment. We evaluate ProMedTS on disease diagnosis tasks using real-world datasets, and the results demonstrate that our method consistently outperforms state-of-the-art approaches.

nan


Article 437

Title@2025-06-15 (7): Knowledge-Augmented Multimodal Clinical Rationale Generation for Disease Diagnosis with Small Language Models

Title: Knowledge-Augmented Multimodal Clinical Rationale Generation for Disease Diagnosis with Small Language Models Knowledge-Augmented Multimodal Clinical Rationale Generation for Disease Diagnosis with Small Language Models 利用小型语言模型进行疾病诊断的知识强化多式临床多式理论 2411.07611v4

Authors (8): Shuai Niu, Jing Ma, Hongzhan Lin, Liang Bai, Zhihua Wang, Yida Xu, Yunya Song, Xian Yang

Interpretation is critical for disease diagnosis, but existing models struggle to balance predictive accuracy with human-understandable rationales. While large language models (LLMs) offer strong reasoning abilities, their clinical use is limited by high computational costs and restricted multimodal reasoning ability. Small language models (SLMs) are efficient but lack advanced reasoning for integrating multimodal medical data. In addition, both LLMs and SLMs lack domain knowledge for trustworthy reasoning. Therefore, we propose ClinRaGen, enhancing SLMs by leveraging LLM-derived reasoning ability via rationale distillation and domain knowledge injection for trustworthy multimodal rationale generation. Key innovations include a sequential rationale distillation framework that equips SLMs with LLM-comparable multimodal reasoning abilities, and a knowledge-augmented attention mechanism that jointly unifies multimodal representation from time series and textual data in the same encoding space, enabling it to be naturally interpreted by SLMs while incorporating domain knowledge for reliable rationale generation. Experiments on real-world medical datasets show that ClinRaGen achieves state-of-the-art performance in disease diagnosis and rationale generation, demonstrating the effectiveness of combining LLM-driven reasoning with knowledge augmentation for improved interpretability.

nan


Article 438

Title@2025-06-15 (7): Entity Framing and Role Portrayal in the News

Title: Entity Framing and Role Portrayal in the News Entity Framing und Role Portrayal in den Nachrichten 《新闻》中的实体形式和角色形象 2502.14718v2

Authors (12): Tarek Mahmoud, Zhuohan Xie, Dimitar Dimitrov, Nikolaos Nikolaidis, Purificação Silvano, Roman Yangarber, Shivam Sharma, Elisa Sartori, Nicolas Stefanovitch, Giovanni Da San Martino, Jakub Piskorski, Preslav Nakov

We introduce a novel multilingual hierarchical corpus annotated for entity framing and role portrayal in news articles. The dataset uses a unique taxonomy inspired by storytelling elements, comprising 22 fine-grained roles, or archetypes, nested within three main categories: protagonist, antagonist, and innocent. Each archetype is carefully defined, capturing nuanced portrayals of entities such as guardian, martyr, and underdog for protagonists; tyrant, deceiver, and bigot for antagonists; and victim, scapegoat, and exploited for innocents. The dataset includes 1,378 recent news articles in five languages (Bulgarian, English, Hindi, European Portuguese, and Russian) focusing on two critical domains of global significance: the Ukraine-Russia War and Climate Change. Over 5,800 entity mentions have been annotated with role labels. This dataset serves as a valuable resource for research into role portrayal and has broader implications for news analysis. We describe the characteristics of the dataset and the annotation process, and we report evaluation results on fine-tuned state-of-the-art multilingual transformers and hierarchical zero-shot learning using LLMs at the level of a document, a paragraph, and a sentence.

nan


Article 439

Title@2025-06-15 (7): Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models

Title: Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models Demokratisch oder authoritär? Eine neue Dimension politischer Biasen in großen Sprachmodellen probieren 民主还是专制? 以大语言模式探究政治分歧的新层面 2506.12758v1

Authors (5): David Guzman Piedrahita, Irene Strauss, Bernhard Schölkopf, Rada Mihalcea, Zhijing Jin

As Large Language Models (LLMs) become increasingly integrated into everyday life and information ecosystems, concerns about their implicit biases continue to persist. While prior work has primarily examined socio-demographic and left–right political dimensions, little attention has been paid to how LLMs align with broader geopolitical value systems, particularly the democracy–authoritarianism spectrum. In this paper, we propose a novel methodology to assess such alignment, combining (1) the F-scale, a psychometric tool for measuring authoritarian tendencies, (2) FavScore, a newly introduced metric for evaluating model favorability toward world leaders, and (3) role-model probing to assess which figures are cited as general role-models by LLMs. We find that LLMs generally favor democratic values and leaders, but exhibit increases favorability toward authoritarian figures when prompted in Mandarin. Further, models are found to often cite authoritarian figures as role models, even outside explicit political contexts. These results shed light on ways LLMs may reflect and potentially reinforce global political ideologies, highlighting the importance of evaluating bias beyond conventional socio-political axes. Our code is available at: https://github.com/irenestrauss/Democratic-Authoritarian-Bias-LLMs

nan


Article 440

Title@2025-06-15 (7): Can We Infer Confidential Properties of Training Data from LLMs?

Title: Can We Infer Confidential Properties of Training Data from LLMs? Können wir vertrauliche Eigenschaften von Trainingsdaten von LLMs ableiten? 我们能否从LLMS中推断培训数据的机密性? 2506.10364v2

Authors (4): Pengrun Huang, Chhavi Yadav, Ruihan Wu, Kamalika Chaudhuri

Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets to support applications in fields such as healthcare, finance, and law. These fine-tuning datasets often have sensitive and confidential dataset-level properties – such as patient demographics or disease prevalence – that are not intended to be revealed. While prior work has studied property inference attacks on discriminative models (e.g., image classification models) and generative models (e.g., GANs for image data), it remains unclear if such attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark task for evaluating property inference in LLMs under two fine-tuning paradigms: question-answering and chat-completion. Built on the ChatDoctor dataset, our benchmark includes a range of property types and task configurations. We further propose two tailored attacks: a prompt-based generation attack and a shadow-model attack leveraging word frequency signals. Empirical evaluations across multiple pretrained LLMs show the success of our attacks, revealing a previously unrecognized vulnerability in LLMs.

nan


Article 441

Title@2025-06-15 (7): Rethinking Hate Speech Detection on Social Media: Can LLMs Replace Traditional Models?

Title: Rethinking Hate Speech Detection on Social Media: Can LLMs Replace Traditional Models? Nachdenken über Hass-Spracherkennung in sozialen Medien: Können LLMs traditionelle Modelle ersetzen? 在社会媒体上重新思考仇恨言论探测:LLMs能否取代传统模式? 2506.12744v1

Authors (3): Daman Deep Singh, Ramanuj Bhattacharjee, Abhijnan Chakraborty

Hate speech detection across contemporary social media presents unique challenges due to linguistic diversity and the informal nature of online discourse. These challenges are further amplified in settings involving code-mixing, transliteration, and culturally nuanced expressions. While fine-tuned transformer models, such as BERT, have become standard for this task, we argue that recent large language models (LLMs) not only surpass them but also redefine the landscape of hate speech detection more broadly. To support this claim, we introduce IndoHateMix, a diverse, high-quality dataset capturing Hindi-English code-mixing and transliteration in the Indian context, providing a realistic benchmark to evaluate model robustness in complex multilingual scenarios where existing NLP methods often struggle. Our extensive experiments show that cutting-edge LLMs (such as LLaMA-3.1) consistently outperform task-specific BERT-based models, even when fine-tuned on significantly less data. With their superior generalization and adaptability, LLMs offer a transformative approach to mitigating online hate in diverse environments. This raises the question of whether future works should prioritize developing specialized models or focus on curating richer and more varied datasets to further enhance the effectiveness of LLMs.

nan


Article 442

Title@2025-06-15 (7): Rethinking DPO: The Role of Rejected Responses in Preference Misalignment

Title: Rethinking DPO: The Role of Rejected Responses in Preference Misalignment Überdenken der DPO: Die Rolle der abgelehnten Reaktionen in der Präferenz-Missausrichtung 重新思考DPO:拒绝的对策在偏重不协调方面所起的作用 2506.12725v1

Authors (4): Jay Hyeon Cho, JunHyeok Oh, Myunsoo Kim, Byung-Jun Lee

Direct Preference Optimization (DPO) is a simple and efficient framework that has attracted substantial attention. However, it often struggles to meet its primary objectives – increasing the generation probability of chosen responses while reducing that of rejected responses – due to the dominant influence of rejected responses on the loss function. This imbalance leads to suboptimal performance in promoting preferred responses. In this work, we systematically analyze the limitations of DPO and existing algorithms designed to achieve the objectives stated above. To address these limitations, we propose Bounded-DPO (BDPO), a novel method that bounds the influence of rejected responses while maintaining the original optimization structure of DPO. Through theoretical analysis and empirical evaluations, we demonstrate that BDPO achieves a balanced optimization of the chosen and rejected responses, outperforming existing algorithms.

nan


Article 443

Title@2025-06-15 (7): SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models

Title: SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models SelfCite: Selbstüberwachte Ausrichtung für Kontextzuweisung in großen Sprachmodellen 自成一体:对大语言模式背景归属的自我监督调整 2502.09604v3

Authors (9): Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih

We introduce SelfCite, a novel self-supervised approach that aligns LLMs to generate high-quality, fine-grained, sentence-level citations for the statements in their generated responses. Instead of only relying on costly and labor-intensive annotations, SelfCite leverages a reward signal provided by the LLM itself through context ablation: If a citation is necessary, removing the cited text from the context should prevent the same response; if sufficient, retaining the cited text alone should preserve the same response. This reward can guide the inference-time best-of-N sampling strategy to improve citation quality significantly, as well as be used in preference optimization to directly fine-tune the models for generating better citations. The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark across five long-form question answering tasks. The source code is available at https://github.com/facebookresearch/SelfCite

nan


Article 444

Title@2025-06-15 (7): Strategic Scaling of Test-Time Compute: A Bandit Learning Approach

Title: Strategic Scaling of Test-Time Compute: A Bandit Learning Approach Strategische Skalierung von Test-Time Compute: Ein Bandit-Lernansatz 试验时间计算战略规模的扩大:匪盗学习方法 2506.12721v1

Authors (2): Bowen Zuo, Yinglun Zhu

Scaling test-time compute has emerged as an effective strategy for improving the performance of large language models. However, existing methods typically allocate compute uniformly across all queries, overlooking variation in query difficulty. To address this inefficiency, we formulate test-time compute allocation as a novel bandit learning problem and propose adaptive algorithms that estimate query difficulty on the fly and allocate compute accordingly. Compared to uniform allocation, our algorithms allocate more compute to challenging queries while maintaining accuracy on easier ones. Among challenging queries, our algorithms further learn to prioritize solvable instances, effectively reducing excessive computing on unsolvable queries. We theoretically prove that our algorithms achieve better compute efficiency than uniform allocation and empirically validate their effectiveness on math and code benchmarks. Specifically, our algorithms achieve up to an 11.10% performance improvement (15.04% relative) on the MATH-500 dataset and up to a 7.41% performance improvement (14.40% relative) on LiveCodeBench.

nan


Article 445

Title@2025-06-15 (7): Efficient Sequential Decision Making with Large Language Models

Title: Efficient Sequential Decision Making with Large Language Models Effiziente sequentielle Entscheidungsfindung mit großen Sprachmodellen 与大语言模式高效有序决策 2406.12125v2

Authors (3): Dingyang Chen, Qi Zhang, Yinglun Zhu

This paper focuses on extending the success of large language models (LLMs) to sequential decision making. Existing efforts either (i) re-train or finetune LLMs for decision making, or (ii) design prompts for pretrained LLMs. The former approach suffers from the computational burden of gradient updates, and the latter approach does not show promising results. In this paper, we propose a new approach that leverages online model selection algorithms to efficiently incorporate LLMs agents into sequential decision making. Statistically, our approach significantly outperforms both traditional decision making algorithms and vanilla LLM agents. Computationally, our approach avoids the need for expensive gradient updates of LLMs, and throughout the decision making process, it requires only a small number of LLM calls. We conduct extensive experiments to verify the effectiveness of our proposed approach. As an example, on a large-scale Amazon dataset, our approach achieves more than a 6x performance gain over baselines while calling LLMs in only 1.5% of the time steps.

nan


Article 446

Title@2025-06-15 (7): Humanity’s Last Code Exam: Can Advanced LLMs Conquer Human’s Hardest Code Competition?

Title: Humanity’s Last Code Exam: Can Advanced LLMs Conquer Human’s Hardest Code Competition? Letzte Codeprüfung der Menschheit: Können fortgeschrittene LLMs den härtesten Codewettbewerb des Menschen erobern? 人类最后一次代码考试:高级LLMS 征服人类最硬的代码竞赛吗? 2506.12713v1

Authors (10): Xiangyang Li, Xiaopeng Li, Kuicai Dong, Quanhu Zhang, Rongju Ruan, Xinyi Dai, Xiaoshuang Liu, Shengchun Xu, Yasheng Wang, Ruiming Tang

Code generation is a core capability of large language models (LLMs), yet mainstream benchmarks (e.g., APPs and LiveCodeBench) contain questions with medium-level difficulty and pose no challenge to advanced LLMs. To better reflected the advanced reasoning and code generation ability, We introduce Humanity’s Last Code Exam (HLCE), comprising 235 most challenging problems from the International Collegiate Programming Contest (ICPC World Finals) and the International Olympiad in Informatics (IOI) spanning 2010 - 2024. As part of HLCE, we design a harmonized online-offline sandbox that guarantees fully reproducible evaluation. Through our comprehensive evaluation, we observe that even the strongest reasoning LLMs: o4-mini(high) and Gemini-2.5 Pro, achieve pass@1 rates of only 15.9% and 11.4%, respectively. Meanwhile, we propose a novel “self-recognition” task to measure LLMs’ awareness of their own capabilities. Results indicate that LLMs’ self-recognition abilities are not proportionally correlated with their code generation performance. Finally, our empirical validation of test-time scaling laws reveals that current advanced LLMs have substantial room for improvement on complex programming tasks. We expect HLCE to become a milestone challenge for code generation and to catalyze advances in high-performance reasoning and human-AI collaborative programming. Our code and dataset are also public available(https://github.com/Humanity-s-Last-Code-Exam/HLCE).

nan


Article 447

Title@2025-06-15 (7): SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression

Title: SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression SecurityLingua: Effiziente Verteidigung von LLM-Jailbreak-Angriffen durch Security-Aware Prompt-Kompression 保安Lingua:通过安全警报即时压缩,有效防范LLM越狱袭击 2506.12707v1

Authors (6): Yucheng Li, Surin Ahn, Huiqiang Jiang, Amir H. Abdi, Yuqing Yang, Lili Qiu

Large language models (LLMs) have achieved widespread adoption across numerous applications. However, many LLMs are vulnerable to malicious attacks even after safety alignment. These attacks typically bypass LLMs’ safety guardrails by wrapping the original malicious instructions inside adversarial jailbreaks prompts. Previous research has proposed methods such as adversarial training and prompt rephrasing to mitigate these safety vulnerabilities, but these methods often reduce the utility of LLMs or lead to significant computational overhead and online latency. In this paper, we propose SecurityLingua, an effective and efficient approach to defend LLMs against jailbreak attacks via security-oriented prompt compression. Specifically, we train a prompt compressor designed to discern the “true intention” of the input prompt, with a particular focus on detecting the malicious intentions of adversarial prompts. Then, in addition to the original prompt, the intention is passed via the system prompt to the target LLM to help it identify the true intention of the request. SecurityLingua ensures a consistent user experience by leaving the original input prompt intact while revealing the user’s potentially malicious intention and stimulating the built-in safety guardrails of the LLM. Moreover, thanks to prompt compression, SecurityLingua incurs only a negligible overhead and extra token cost compared to all existing defense methods, making it an especially practical solution for LLM defense. Experimental results demonstrate that SecurityLingua can effectively defend against malicious attacks and maintain utility of the LLM with negligible compute and latency overhead. Our code is available at https://aka.ms/SecurityLingua.

nan


Article 448

Title@2025-06-15 (7): Flexible Realignment of Language Models

Title: Flexible Realignment of Language Models Flexible Neuausrichtung von Sprachmodellen 语文模式灵活调整 2506.12704v1

Authors (4): Wenhong Zhu, Ruobing Xie, Weinan Zhang, Rui Wang

Realignment becomes necessary when a language model (LM) fails to meet expected performance. We propose a flexible realignment framework that supports quantitative control of alignment degree during training and inference. This framework incorporates Training-time Realignment (TrRa), which efficiently realigns the reference model by leveraging the controllable fusion of logits from both the reference and already aligned models. For example, TrRa reduces token usage by 54.63% on DeepSeek-R1-Distill-Qwen-1.5B without any performance degradation, outperforming DeepScaleR-1.5B’s 33.86%. To complement TrRa during inference, we introduce a layer adapter that enables smooth Inference-time Realignment (InRa). This adapter is initialized to perform an identity transformation at the bottom layer and is inserted preceding the original layers. During inference, input embeddings are simultaneously processed by the adapter and the original layer, followed by the remaining layers, and then controllably interpolated at the logit level. We upgraded DeepSeek-R1-Distill-Qwen-7B from a slow-thinking model to one that supports both fast and slow thinking, allowing flexible alignment control even during inference. By encouraging deeper reasoning, it even surpassed its original performance.

nan


Article 449

Title@2025-06-15 (7): Co-occurrence is not Factual Association in Language Models

Title: Co-occurrence is not Factual Association in Language Models Co-occurrence ist nicht Factual Association in Language Models 共同发生不是语言模式中的事实协会 2409.14057v2

Authors (3): Xiao Zhang, Miao Li, Ji Wu

Pretrained language models can encode a large amount of knowledge and utilize it for various reasoning tasks, yet they can still struggle to learn novel factual knowledge effectively from finetuning on limited textual demonstrations. In this work, we show that the reason for this deficiency is that language models are biased to learn word co-occurrence statistics instead of true factual associations. We identify the differences between two forms of knowledge representation in language models: knowledge in the form of co-occurrence statistics is encoded in the middle layers of the transformer model and does not generalize well to reasoning scenarios beyond simple question answering, while true factual associations are encoded in the lower layers and can be freely utilized in various reasoning tasks. Based on these observations, we propose two strategies to improve the learning of factual associations in language models. We show that training on text with implicit rather than explicit factual associations can force the model to learn factual associations instead of co-occurrence statistics, significantly improving the generalization of newly learned knowledge. We also propose a simple training method to actively forget the learned co-occurrence statistics, which unblocks and enhances the learning of factual associations when training on plain narrative text. On both synthetic and real-world corpora, the two proposed strategies improve the generalization of the knowledge learned during finetuning to reasoning scenarios such as indirect and multi-hop question answering.

nan


Article 450

Title@2025-06-15 (7): SC-SOT: Conditioning the Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition

Title: SC-SOT: Conditioning the Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition SC-SOT: Konditionierung des Decoders auf diarisierten Lautsprecherinformationen für die End-to-End-Überlappende Spracherkennung SC-SOT:为终端至终端超载语音识别分解器设置解码器 2506.12672v1

Authors (2): Yuta Hirano, Sakriani Sakti

We propose Speaker-Conditioned Serialized Output Training (SC-SOT), an enhanced SOT-based training for E2E multi-talker ASR. We first probe how SOT handles overlapped speech, and we found the decoder performs implicit speaker separation. We hypothesize this implicit separation is often insufficient due to ambiguous acoustic cues in overlapping regions. To address this, SC-SOT explicitly conditions the decoder on speaker information, providing detailed information about “who spoke when”. Specifically, we enhance the decoder by incorporating: (1) speaker embeddings, which allow the model to focus on the acoustic characteristics of the target speaker, and (2) speaker activity information, which guides the model to suppress non-target speakers. The speaker embeddings are derived from a jointly trained E2E speaker diarization model, mitigating the need for speaker enrollment. Experimental results demonstrate the effectiveness of our conditioning approach on overlapped speech.

nan


Article 451

Title@2025-06-15 (7): Failure Modes of LLMs for Causal Reasoning on Narratives

Title: Failure Modes of LLMs for Causal Reasoning on Narratives Failure Modes von LLMs für die ursächliche Begründung von Narrativen 以叙述为由解释原因的LLMs失败模式 2410.23884v5

Authors (5): Khurram Yamin, Shantanu Gupta, Gaurav R. Ghosal, Zachary C. Lipton, Bryan Wilder

The ability to robustly identify causal relationships is essential for autonomous decision-making and adaptation to novel scenarios. However, accurately inferring causal structure requires integrating both world knowledge and abstract logical reasoning. In this work, we investigate the interaction between these two capabilities through the representative task of causal reasoning over narratives. Through controlled synthetic, semi-synthetic, and real-world experiments, we find that state-of-the-art large language models (LLMs) often rely on superficial heuristics – for example, inferring causality from event order or recalling memorized world knowledge without attending to context. Furthermore, we show that simple reformulations of the task can elicit more robust reasoning behavior. Our evaluation spans a range of causal structures, from linear chains to complex graphs involving colliders and forks. These findings uncover systematic patterns in how LLMs perform causal reasoning and lay the groundwork for developing methods that better align LLM behavior with principled causal inference.

nan


Article 452

Title@2025-06-14 (6): Synthetic Socratic Debates: Examining Persona Effects on Moral Decision and Persuasion Dynamics

Title: Synthetic Socratic Debates: Examining Persona Effects on Moral Decision and Persuasion Dynamics Synthetische sokratische Debatten: Untersuchung von Persona-Effekten auf moralische Entscheidung und Überzeugungsdynamik 合成专家辩论:审查人对道德决定的影响和预测动态 2506.12657v1

Authors (8): Jiarui Liu, Yueqi Song, Yunze Xiao, Mingqian Zheng, Lindia Tjuatja, Jana Schaich Borg, Mona Diab, Maarten Sap

As large language models (LLMs) are increasingly used in morally sensitive domains, it is crucial to understand how persona traits affect their moral reasoning and persuasive behavior. We present the first large-scale study of multi-dimensional persona effects in AI-AI debates over real-world moral dilemmas. Using a 6-dimensional persona space (age, gender, country, class, ideology, and personality), we simulate structured debates between AI agents over 131 relationship-based cases. Our results show that personas affect initial moral stances and debate outcomes, with political ideology and personality traits exerting the strongest influence. Persuasive success varies across traits, with liberal and open personalities reaching higher consensus and win rates. While logit-based confidence grows during debates, emotional and credibility-based appeals diminish, indicating more tempered argumentation over time. These trends mirror findings from psychology and cultural studies, reinforcing the need for persona-aware evaluation frameworks for AI moral reasoning.

nan


Article 453

Title@2025-06-14 (6): How Grounded is Wikipedia? A Study on Structured Evidential Support

Title: How Grounded is Wikipedia? A Study on Structured Evidential Support Wie geerdet ist Wikipedia? Eine Studie über strukturierten Evidential Support 维基百科如何根基? 2506.12637v1

Authors (7): William Walden, Kathryn Ricci, Miriam Wanner, Zhengping Jiang, Chandler May, Rongkun Zhou, Benjamin Van Durme

Wikipedia is a critical resource for modern NLP, serving as a rich repository of up-to-date and citation-backed information on a wide variety of subjects. The reliability of Wikipedia – its groundedness in its cited sources – is vital to this purpose. This work provides a quantitative analysis of the extent to which Wikipedia is so grounded and of how readily grounding evidence may be retrieved. To this end, we introduce PeopleProfiles – a large-scale, multi-level dataset of claim support annotations on Wikipedia articles of notable people. We show that roughly 20% of claims in Wikipedia lead sections are unsupported by the article body; roughly 27% of annotated claims in the article body are unsupported by their (publicly accessible) cited sources; and >80% of lead claims cannot be traced to these sources via annotated body evidence. Further, we show that recovery of complex grounding evidence for claims that are supported remains a challenge for standard retrieval methods.

nan


Article 454

Title@2025-06-14 (6): Between Predictability and Randomness: Seeking Artistic Inspiration from AI Generative Models

Title: Between Predictability and Randomness: Seeking Artistic Inspiration from AI Generative Models Zwischen Vorhersagbarkeit und Zufälligkeit: Künstlerische Inspiration aus KI-Generativen Modellen suchen 在可预测性和随机性之间:从AI创创模式中寻求艺术灵感 2506.12634v1

Authors (1): Olga Vechtomova

Artistic inspiration often emerges from language that is open to interpretation. This paper explores the use of AI-generated poetic lines as stimuli for creativity. Through analysis of two generative AI approaches–lines generated by Long Short-Term Memory Variational Autoencoders (LSTM-VAE) and complete poems by Large Language Models (LLMs)–I demonstrate that LSTM-VAE lines achieve their evocative impact through a combination of resonant imagery and productive indeterminacy. While LLMs produce technically accomplished poetry with conventional patterns, LSTM-VAE lines can engage the artist through semantic openness, unconventional combinations, and fragments that resist closure. Through the composition of an original poem, where narrative emerged organically through engagement with LSTM-VAE generated lines rather than following a predetermined structure, I demonstrate how these characteristics can serve as evocative starting points for authentic artistic expression.

nan


Article 455

Title@2025-06-14 (6): Detecting Narrative Shifts through Persistent Structures: A Topological Analysis of Media Discourse

Title: Detecting Narrative Shifts through Persistent Structures: A Topological Analysis of Media Discourse Ermitteln narrativer Verschiebungen durch persistente Strukturen: Eine topologische Analyse des Mediendiskurses 通过持久性结构检测到的叙述性转变:媒体谈话的地形分析 2506.14836v1

Authors (2): Mark M. Bailey, Mark I. Heiligman

How can we detect when global events fundamentally reshape public discourse? This study introduces a topological framework for identifying structural change in media narratives using persistent homology. Drawing on international news articles surrounding major events - including the Russian invasion of Ukraine (Feb 2022), the murder of George Floyd (May 2020), the U.S. Capitol insurrection (Jan 2021), and the Hamas-led invasion of Israel (Oct 2023) - we construct daily co-occurrence graphs of noun phrases to trace evolving discourse. Each graph is embedded and transformed into a persistence diagram via a Vietoris-Rips filtration. We then compute Wasserstein distances and persistence entropies across homological dimensions to capture semantic disruption and narrative volatility over time. Our results show that major geopolitical and social events align with sharp spikes in both H0 (connected components) and H1 (loops), indicating sudden reorganization in narrative structure and coherence. Cross-correlation analyses reveal a typical lag pattern in which changes to component-level structure (H0) precede higher-order motif shifts (H1), suggesting a bottom-up cascade of semantic change. An exception occurs during the Russian invasion of Ukraine, where H1 entropy leads H0, possibly reflecting top-down narrative framing before local discourse adjusts. Persistence entropy further distinguishes tightly focused from diffuse narrative regimes. These findings demonstrate that persistent homology offers a mathematically principled, unsupervised method for detecting inflection points and directional shifts in public attention - without requiring prior knowledge of specific events. This topological approach advances computational social science by enabling real-time detection of semantic restructuring during crises, protests, and information shocks.

nan


Article 456

Title@2025-06-14 (6): MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos

Title: MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos MS4UI: Ein Datensatz für die multimodale Zusammenfassung von Benutzeroberflächen-Instruktionsvideos MS4UI:用户界面教学录像多式摘要数据集 2506.12623v1

Authors (8): Yuan Zang, Hao Tan, Seunghyun Yoon, Franck Dernoncourt, Jiuxiang Gu, Kushal Kafle, Chen Sun, Trung Bui

We study multi-modal summarization for instructional videos, whose goal is to provide users an efficient way to learn skills in the form of text instructions and key video frames. We observe that existing benchmarks focus on generic semantic-level video summarization, and are not suitable for providing step-by-step executable instructions and illustrations, both of which are crucial for instructional videos. We propose a novel benchmark for user interface (UI) instructional video summarization to fill the gap. We collect a dataset of 2,413 UI instructional videos, which spans over 167 hours. These videos are manually annotated for video segmentation, text summarization, and video summarization, which enable the comprehensive evaluations for concise and executable video summarization. We conduct extensive experiments on our collected MS4UI dataset, which suggest that state-of-the-art multi-modal summarization methods struggle on UI video summarization, and highlight the importance of new methods for UI instructional video summarization.

nan


Article 457

Title@2025-06-14 (6): Video Understanding with Large Language Models: A Survey

Title: Video Understanding with Large Language Models: A Survey Videoverständnis mit großen Sprachmodellen: Eine Umfrage 与大语言模型的视频了解:调查 2312.17432v5

Authors (20): Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu

With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.

nan


Article 458

Title@2025-06-14 (6): OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics

Title: OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics OpenUnlearning: Beschleunigung des LLM-Unlearnings durch einheitliche Benchmarking von Methoden und Metrics 开放式学习:通过统一的方法和计量方法基准,加快LLM的学习 2506.12618v1

Authors (7): Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C. Lipton, J. Zico Kolter, Pratyush Maini

Robust unlearning is crucial for safely deploying large language models (LLMs) in environments where data privacy, model safety, and regulatory compliance must be ensured. Yet the task is inherently challenging, partly due to difficulties in reliably measuring whether unlearning has truly occurred. Moreover, fragmentation in current methodologies and inconsistent evaluation metrics hinder comparative analysis and reproducibility. To unify and accelerate research efforts, we introduce OpenUnlearning, a standardized and extensible framework designed explicitly for benchmarking both LLM unlearning methods and metrics. OpenUnlearning integrates 9 unlearning algorithms and 16 diverse evaluations across 3 leading benchmarks (TOFU, MUSE, and WMDP) and also enables analyses of forgetting behaviors across 450+ checkpoints we publicly release. Leveraging OpenUnlearning, we propose a novel meta-evaluation benchmark focused specifically on assessing the faithfulness and robustness of evaluation metrics themselves. We also benchmark diverse unlearning methods and provide a comparative analysis against an extensive evaluation suite. Overall, we establish a clear, community-driven pathway toward rigorous development in LLM unlearning research.

nan


Article 459

Title@2025-06-14 (6): Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition

Title: Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition Konooz: Multi-Domain Multi-Dialekt Corpus für die benannte Entitätserkennung Konooz: 名称实体识别多域多对立公司 2506.12615v1

Authors (3): Nagham Hamad, Mohammed Khalilia, Mustafa Jarrar

We introduce Konooz, a novel multi-dimensional corpus covering 16 Arabic dialects across 10 domains, resulting in 160 distinct corpora. The corpus comprises about 777k tokens, carefully collected and manually annotated with 21 entity types using both nested and flat annotation schemes - using the Wojood guidelines. While Konooz is useful for various NLP tasks like domain adaptation and transfer learning, this paper primarily focuses on benchmarking existing Arabic Named Entity Recognition (NER) models, especially cross-domain and cross-dialect model performance. Our benchmarking of four Arabic NER models using Konooz reveals a significant drop in performance of up to 38% when compared to the in-distribution data. Furthermore, we present an in-depth analysis of domain and dialect divergence and the impact of resource scarcity. We also measured the overlap between domains and dialects using the Maximum Mean Discrepancy (MMD) metric, and illustrated why certain NER models perform better on specific dialects and domains. Konooz is open-source and publicly available at https://sina.birzeit.edu/wojood/#download

nan


Article 460

Title@2025-06-14 (6): ShED-HD: A Shannon Entropy Distribution Framework for Lightweight Hallucination Detection on Edge Devices

Title: ShED-HD: A Shannon Entropy Distribution Framework for Lightweight Hallucination Detection on Edge Devices ShED-HD: Ein Shannon Entropy Distribution Framework für leichte Halluzinationserkennung auf Edge-Geräten ShED-HD:关于边缘装置轻量级致幻剂探测的香农封状分发框架 2503.18242v2

Authors (4): Aneesh Vathul, Daniel Lee, Sheryl Chen, Arthi Tasmia

Large Language Models (LLMs) have demonstrated impressive capabilities on a broad array of NLP tasks, but their tendency to produce hallucinations$\unicode{x2013}$plausible-sounding but factually incorrect content$\unicode{x2013}$poses severe challenges in high-stakes domains. Existing hallucination detection methods either bear the computational cost of multiple inference passes or sacrifice accuracy for efficiency with single-pass approaches, neither of which is ideal in resource-constrained environments such as edge devices. We propose the Shannon Entropy Distribution Hallucination Detector (ShED-HD), a novel hallucination detection framework that bridges this gap by classifying sequence-level entropy patterns using a lightweight BiLSTM architecture with single-headed attention. In contrast to prior approaches, ShED-HD efficiently detects distinctive uncertainty patterns across entire output sequences, preserving contextual awareness. Through in-depth evaluation on three datasets (BioASQ, TriviaQA, and Jeopardy Questions), we show that ShED-HD significantly outperforms other computationally efficient approaches in the out-of-distribution setting, while achieving comparable performance in the in-distribution setting. ShED-HD facilitates hallucination detection that is low-cost, accurate, and generalizable, improving the credibility of content generated by LLMs in resource-constrained environments where trustworthy AI functionality is crucial.

nan


Article 461

Title@2025-06-14 (6): Is Smaller Always Faster? Tradeoffs in Compressing Self-Supervised Speech Transformers

Title: Is Smaller Always Faster? Tradeoffs in Compressing Self-Supervised Speech Transformers Ist Kleiner immer schneller? Tradeoffs bei selbstüberwachten Sprachtransformatoren komprimieren 更小的总是更快吗? 压缩自制语音变换器的权衡取舍 2211.09949v3

Authors (7): Tzu-Quan Lin, Tsung-Huan Yang, Chun-Yao Chang, Kuang-Ming Chen, Tzu-hsun Feng, Hung-yi Lee, Hao Tang

Transformer-based self-supervised models have achieved remarkable success in speech processing, but their large size and high inference cost present significant challenges for real-world deployment. While numerous compression techniques have been proposed, inconsistent evaluation metrics make it difficult to compare their practical effectiveness. In this work, we conduct a comprehensive study of four common compression methods, including weight pruning, head pruning, low-rank approximation, and knowledge distillation on self-supervised speech Transformers. We evaluate each method under three key metrics: parameter count, multiply-accumulate operations, and real-time factor. Results show that each method offers distinct advantages. In addition, we contextualize recent compression techniques, comparing DistilHuBERT, FitHuBERT, LightHuBERT, ARMHuBERT, and STaRHuBERT under the same framework, offering practical guidance on compression for deployment.

nan


Article 462

Title@2025-06-14 (6): Towards Building General Purpose Embedding Models for Industry 4.0 Agents

Title: Towards Building General Purpose Embedding Models for Industry 4.0 Agents Auf dem Weg zum Aufbau von Modellen für Industrie 4.0-Agenten 建立工业4.0剂通用嵌入模型模型 2506.12607v1

Authors (3): Christodoulos Constantinides, Shuxin Lin, Dhaval Patel

In this work we focus on improving language models’ understanding for asset maintenance to guide the engineer’s decisions and minimize asset downtime. Given a set of tasks expressed in natural language for Industry 4.0 domain, each associated with queries related to a specific asset, we want to recommend relevant items and generalize to queries of similar assets. A task may involve identifying relevant sensors given a query about an asset’s failure mode. Our approach begins with gathering a qualitative, expert-vetted knowledge base to construct nine asset-specific task datasets. To create more contextually informed embeddings, we augment the input tasks using Large Language Models (LLMs), providing concise descriptions of the entities involved in the queries. This embedding model is then integrated with a Reasoning and Acting agent (ReAct), which serves as a powerful tool for answering complex user queries that require multi-step reasoning, planning, and knowledge inference. Through ablation studies, we demonstrate that: (a) LLM query augmentation improves the quality of embeddings, (b) Contrastive loss and other methods that avoid in-batch negatives are superior for datasets with queries related to many items, and (c) It is crucial to balance positive and negative in-batch samples. After training and testing on our dataset, we observe a substantial improvement: HIT@1 increases by +54.2%, MAP@100 by +50.1%, and NDCG@10 by +54.7%, averaged across all tasks and models. Additionally, we empirically demonstrate the model’s planning and tool invocation capabilities when answering complex questions related to industrial asset maintenance, showcasing its effectiveness in supporting Subject Matter Experts (SMEs) in their day-to-day operations.

nan


Article 463

Title@2025-06-14 (6): An Exploration of Mamba for Speech Self-Supervised Models

Title: An Exploration of Mamba for Speech Self-Supervised Models Eine Erkundung von Mamba für selbstüberwachte Sprachmodelle 探索Mamba演讲自我示范模式 2506.12606v1

Authors (8): Tzu-Quan Lin, Heng-Cheng Kuo, Tzu-Chieh Wei, Hsi-Chun Cheng, Chun-Wei Chen, Hsien-Fu Hsiao, Yu Tsao, Hung-yi Lee

While Mamba has demonstrated strong performance in language modeling, its potential as a speech self-supervised (SSL) model remains underexplored, with prior studies limited to isolated tasks. To address this, we explore Mamba-based HuBERT models as alternatives to Transformer-based SSL architectures. Leveraging the linear-time Selective State Space, these models enable fine-tuning on long-context ASR with significantly lower compute. Moreover, they show superior performance when fine-tuned for streaming ASR. Beyond fine-tuning, these models show competitive performance on SUPERB probing benchmarks, particularly in causal settings. Our analysis shows that they yield higher-quality quantized representations and capture speaker-related features more distinctly than Transformer-based models. These findings highlight Mamba-based SSL as a promising and complementary direction for long-sequence modeling, real-time speech modeling, and speech unit extraction.

nan


Article 464

Title@2025-06-14 (6): Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training

Title: Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training Adapt-Pruner: Adaptives Structural Pruning für effizientes Small Language Model Training 适应者:适应性结构调节,促进高效的小型语言模式培训 2502.03460v2

Authors (7): Rui Pan, Boyao Wang, Shizhe Diao, Xingyuan Pan, Jipeng Zhang, Renjie Pi, Tong Zhang

Small language models (SLMs) have attracted considerable attention from both academia and industry due to their broad range of applications in edge devices. To obtain SLMs with strong performance, conventional approaches either pre-train the models from scratch, which incurs substantial computational costs, or compress/prune existing large language models (LLMs), which results in performance drops and falls short in comparison to pre-training. In this paper, we investigate the family of acceleration methods that involve both structured pruning and model training. We found 1) layer-wise adaptive pruning (Adapt-Pruner) is extremely effective in LLMs and yields significant improvements over existing pruning techniques, 2) adaptive pruning equipped with further training leads to models comparable to those pre-training from scratch, 3) incremental pruning brings non-trivial performance gain by interleaving pruning with training and only removing a small portion of neurons ($\sim$5%) at a time. Experimental results on LLaMA-3.1-8B demonstrate that Adapt-Pruner outperforms conventional pruning methods, such as LLM-Pruner, FLAP, and SliceGPT, by an average of 1%-7% in accuracy on commonsense benchmarks. Additionally, Adapt-Pruner restores the performance of MobileLLM-125M to 600M on the MMLU benchmark with 200$\times$ fewer tokens via pruning from its larger counterparts, and discovers a new 1B model that surpasses LLaMA-3.2-1B in multiple benchmarks. The official code is released at https://github.com/research4pan/AdaptPruner.

nan


Article 465

Title@2025-06-14 (6): NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

Title: NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions NaturalReasoning: Vernunft in der Wildnis mit 2.8M anspruchsvollen Fragen 自然反应:以2.8M挑战性问题在野外的原因 2502.13124v3

Authors (11): Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Ilia Kulikov, Kyunghyun Cho, Dong Wang, Yuandong Tian, Jason E Weston, Xian Li

Scaling reasoning capabilities beyond traditional domains such as math and coding is hindered by the lack of diverse and high-quality questions. To overcome this limitation, we introduce a scalable approach for generating diverse and challenging reasoning questions, accompanied by reference answers. We present NaturalReasoning, a comprehensive dataset comprising 2.8 million questions that span multiple domains, including STEM fields (e.g., Physics, Computer Science), Economics, Social Sciences, and more. We demonstrate the utility of the questions in NaturalReasoning through knowledge distillation experiments which show that NaturalReasoning can effectively elicit and transfer reasoning capabilities from a strong teacher model. Furthermore, we demonstrate that NaturalReasoning is also effective for unsupervised self-training using external reward models or self-rewarding. To foster future work, we publicly release NaturalReasoning at https://huggingface.co/datasets/facebook/natural_reasoning.

nan


Article 466

Title@2025-06-14 (6): OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases

Title: OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases OneEval: Benchmarking von LLM Wissensintensive Reasoning über unterschiedliche Wissensgrundlagen OneEval:确定LLM 知识密集型知识密集型比多样化知识库更引力的基准 2506.12577v1

Authors (24): Yongrui Chen, Zhiqiang Liu, Jing Yu, Lin Ren, Nan Hu, Xinbang Dai, Jiajun Liu, Jiazhen Kang, Shenyu Zhang, Xinda Wang, Keyan Ding, Pengfei Shen, Haolei Zhu, Hongjie Deng, Yisong Wang, Tongtong Wu, Sheng Bi, Wen Zhang, Tianxing Wu, Qiu Ji, Haofen Wang, Wenliang Chen, Huajun Chen, Guilin Qi

Large Language Models (LLMs) have demonstrated substantial progress on reasoning tasks involving unstructured text, yet their capabilities significantly deteriorate when reasoning requires integrating structured external knowledge such as knowledge graphs, code snippets, or formal logic. This limitation is partly due to the absence of benchmarks capable of systematically evaluating LLM performance across diverse structured knowledge modalities. To address this gap, we introduce \textbf{\textsc{OneEval}}, a comprehensive benchmark explicitly designed to assess the knowledge-intensive reasoning capabilities of LLMs across four structured knowledge modalities, unstructured text, knowledge graphs, code, and formal logic, and five critical domains (general knowledge, government, science, law, and programming). \textsc{OneEval} comprises 4,019 carefully curated instances and includes a challenging subset, \textsc{OneEval}\textsubscript{Hard}, consisting of 1,285 particularly difficult cases. Through extensive evaluation of 18 state-of-the-art open-source and proprietary LLMs, we establish three core findings: a) \emph{persistent limitations in structured reasoning}, with even the strongest model achieving only 32.2\% accuracy on \textsc{OneEval}\textsubscript{Hard}; b) \emph{performance consistently declines as the structural complexity of the knowledge base increases}, with accuracy dropping sharply from 53\% (textual reasoning) to 25\% (formal logic); and c) \emph{diminishing returns from extended reasoning chains}, highlighting the critical need for models to adapt reasoning depth appropriately to task complexity. We release the \textsc{OneEval} datasets, evaluation scripts, and baseline results publicly, accompanied by a leaderboard to facilitate ongoing advancements in structured knowledge reasoning.

nan


Article 467

Title@2025-06-14 (6): Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders

Title: Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders Präzise Topic Alignment in großen Sprachmodellen über Sparse Autoencoder aktivieren 启用大语言模型中的精确主题对齐 2506.12576v1

Authors (3): Ananya Joshi, Celia Cintas, Skyler Speakman

Recent work shows that Sparse Autoencoders (SAE) applied to large language model (LLM) layers have neurons corresponding to interpretable concepts. These SAE neurons can be modified to align generated outputs, but only towards pre-identified topics and with some parameter tuning. Our approach leverages the observational and modification properties of SAEs to enable alignment for any topic. This method 1) scores each SAE neuron by its semantic similarity to an alignment text and uses them to 2) modify SAE-layer-level outputs by emphasizing topic-aligned neurons. We assess the alignment capabilities of this approach on diverse public topic datasets including Amazon reviews, Medicine, and Sycophancy, across the currently available open-source LLMs and SAE pairs (GPT2 and Gemma) with multiple SAEs configurations. Experiments aligning to medical prompts reveal several benefits over fine-tuning, including increased average language acceptability (0.25 vs. 0.5), reduced training time across multiple alignment topics (333.6s vs. 62s), and acceptable inference time for many applications (+0.00092s/token). Our open-source code is available at github.com/IBM/sae-steering.

nan


Article 468

Title@2025-06-14 (6): TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression

Title: TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression TL;DR: Zu lange, re-Gewichtung für effiziente LLM-Reasoning-Kompression TL;DR:太长,为高效 LLM 合理压缩而重新加权 2506.02678v3

Authors (14): Zhong-Zhi Li, Xiao Liang, Zihao Tang, Lei Ji, Peijie Wang, Haotian Xu, Xing W, Haizhen Huang, Weiwei Deng, Yeyun Gong, Zhijiang Guo, Xiao Liu, Fei Yin, Cheng-Lin Liu

Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning–especially during inference with extremely long outputs–has drawn increasing attention from the research community. In this work, we propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations or interpolation between multiple models. We continuously balance the weights between the model’s System-1 and System-2 data to eliminate redundant reasoning processes while preserving the model’s reasoning capability. We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels. Our method significantly reduces the number of output tokens by nearly 40% while maintaining the accuracy of the reasoning. Our code and data will be available soon.

nan


Article 469

Title@2025-06-14 (6): Overview of the NLPCC 2025 Shared Task: Gender Bias Mitigation Challenge

Title: Overview of the NLPCC 2025 Shared Task: Gender Bias Mitigation Challenge Überblick über die gemeinsame Aufgabe NLPCC 2025: Gender Bias Mitigation Challenge 2025年全国妇女、妇女和儿童委员会2025年共同任务概览:减少性别偏见的挑战 2506.12574v1

Authors (5): Yizhi Li, Ge Zhang, Hanhua Hong, Yiwen Wang, Chenghua Lin

As natural language processing for gender bias becomes a significant interdisciplinary topic, the prevalent data-driven techniques, such as pre-trained language models, suffer from biased corpus. This case becomes more obvious regarding those languages with less fairness-related computational linguistic resources, such as Chinese. To this end, we propose a Chinese cOrpus foR Gender bIas Probing and Mitigation (CORGI-PM), which contains 32.9k sentences with high-quality labels derived by following an annotation scheme specifically developed for gender bias in the Chinese context. It is worth noting that CORGI-PM contains 5.2k gender-biased sentences along with the corresponding bias-eliminated versions rewritten by human annotators. We pose three challenges as a shared task to automate the mitigation of textual gender bias, which requires the models to detect, classify, and mitigate textual gender bias. In the literature, we present the results and analysis for the teams participating this shared task in NLPCC 2025.

nan


Article 470

Title@2025-06-14 (6): DoTA-RAG: Dynamic of Thought Aggregation RAG

Title: DoTA-RAG: Dynamic of Thought Aggregation RAG DoTA-RAG: Dynamik der Gedankenaggregation RAG DoTA-RAG:思想聚合动态RAG 2506.12571v1

Authors (5): Saksorn Ruangtanusak, Natthapath Rungseesiripak, Peerawat Rojratchadakorn, Monthol Charattrakool, Natapong Nitarach

In this paper, we introduce DoTA-RAG (Dynamic-of-Thought Aggregation RAG), a retrieval-augmented generation system optimized for high-throughput, large-scale web knowledge indexes. Traditional RAG pipelines often suffer from high latency and limited accuracy over massive, diverse datasets. DoTA-RAG addresses these challenges with a three-stage pipeline: query rewriting, dynamic routing to specialized sub-indexes, and multi-stage retrieval and ranking. We further enhance retrieval by evaluating and selecting a superior embedding model, re-embedding the large FineWeb-10BT corpus. Moreover, we create a diverse Q&A dataset of 500 questions generated via the DataMorgana setup across a broad range of WebOrganizer topics and formats. DoTA-RAG improves the answer correctness score from 0.752 (baseline, using LiveRAG pre-built vector store) to 1.478 while maintaining low latency, and it achieves a 0.929 correctness score on the Live Challenge Day. These results highlight DoTA-RAG’s potential for practical deployment in domains requiring fast, reliable access to large and evolving knowledge sources.

nan


Article 471

Title@2025-06-14 (6): StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling

Title: StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling StreamMel: Echtzeit Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modellierung 流流:通过间断连续自动递减建模实现实时零光文本对语音 2506.12570v1

Authors (10): Hui Wang, Yifan Yang, Shujie Liu, Jinyu Li, Lingwei Meng, Yanqing Liu, Jiaming Zhou, Haoqin Sun, Yan Lu, Yong Qin

Recent advances in zero-shot text-to-speech (TTS) synthesis have achieved high-quality speech generation for unseen speakers, but most systems remain unsuitable for real-time applications because of their offline design. Current streaming TTS paradigms often rely on multi-stage pipelines and discrete representations, leading to increased computational cost and suboptimal system performance. In this work, we propose StreamMel, a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms. By interleaving text tokens with acoustic frames, StreamMel enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness. Experiments on LibriSpeech demonstrate that StreamMel outperforms existing streaming TTS baselines in both quality and latency. It even achieves performance comparable to offline systems while supporting efficient real-time generation, showcasing broad prospects for integration with real-time speech large language models. Audio samples are available at: https://aka.ms/StreamMel.

nan


Article 472

Title@2025-06-14 (6): SMILE: Speech Meta In-Context Learning for Low-Resource Language Automatic Speech Recognition

Title: SMILE: Speech Meta In-Context Learning for Low-Resource Language Automatic Speech Recognition SMILE: Sprachmeta In-Context-Lernen für die automatische Spracherkennung mit geringer Ressource SMILE: 用于低资源语言自动语音识别的 2409.10429v2

Authors (2): Ming-Hao Hsu, Hung-yi Lee

Automatic Speech Recognition (ASR) models demonstrate outstanding performance on high-resource languages but face significant challenges when applied to low-resource languages due to limited training data and insufficient cross-lingual generalization. Existing adaptation strategies, such as shallow fusion, data augmentation, and direct fine-tuning, either rely on external resources, suffer computational inefficiencies, or fail in test-time adaptation scenarios. To address these limitations, we introduce Speech Meta In-Context LEarning (SMILE), an innovative framework that combines meta-learning with speech in-context learning (SICL). SMILE leverages meta-training from high-resource languages to enable robust, few-shot generalization to low-resource languages without explicit fine-tuning on the target domain. Extensive experiments on the ML-SUPERB benchmark show that SMILE consistently outperforms baseline methods, significantly reducing character and word error rates in training-free few-shot multilingual ASR tasks.

nan


Article 473

Title@2025-06-14 (6): Scholar Inbox: Personalized Paper Recommendations for Scientists

Title: Scholar Inbox: Personalized Paper Recommendations for Scientists Scholar Inbox: Personalisierte Papierempfehlungen für Wissenschaftler 学者箱:给科学家的个人化论文建议 2504.08385v2

Authors (13): Markus Flicke, Glenn Angrabeit, Madhav Iyengar, Vitalii Protsenko, Illia Shakun, Jovan Cicvaric, Bora Kargi, Haoyu He, Lukas Schuler, Lewin Scholz, Kavyanjali Agnihotri, Yong Cao, Andreas Geiger

Scholar Inbox is a new open-access platform designed to address the challenges researchers face in staying current with the rapidly expanding volume of scientific literature. We provide personalized recommendations, continuous updates from open-access archives (arXiv, bioRxiv, etc.), visual paper summaries, semantic search, and a range of tools to streamline research workflows and promote open research access. The platform’s personalized recommendation system is trained on user ratings, ensuring that recommendations are tailored to individual researchers’ interests. To further enhance the user experience, Scholar Inbox also offers a map of science that provides an overview of research across domains, enabling users to easily explore specific topics. We use this map to address the cold start problem common in recommender systems, as well as an active learning strategy that iteratively prompts users to rate a selection of papers, allowing the system to learn user preferences quickly. We evaluate the quality of our recommendation system on a novel dataset of 800k user ratings, which we make publicly available, as well as via an extensive user study. https://www.scholar-inbox.com/

nan


Article 474

Title@2025-06-14 (6): PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

Title: PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference PKU-SafeRLHF: Auf dem Weg zu mehrstufiger Sicherheitsausrichtung für LLMs mit menschlicher Vorliebe PKU-SafeRLLHF:为具有人类特爱的LLMs实现多级安全协调 2406.15513v3

Authors (13): Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Juntao Dai, Boren Zheng, Tianyi Qiu, Jiayi Zhou, Kaile Wang, Boxuan Li, Sirui Han, Yike Guo, Yaodong Yang

In this study, we introduce the safety human preference dataset, PKU-SafeRLHF, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs. Data is available at https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF.

nan


Article 475

Title@2025-06-14 (6): Profiling News Media for Factuality and Bias Using LLMs and the Fact-Checking Methodology of Human Experts

Title: Profiling News Media for Factuality and Bias Using LLMs and the Fact-Checking Methodology of Human Experts Profiling News Medien für Factuality und Bias mit LLMs und der Fact-Checking-Methode menschlicher Experten 利用LLMMs和 “ 人权专家实况调查方法 “ 将新闻媒体描述为 “ 事实和偏见 “ 和 “ 人权专家实况调查方法 “ 2506.12552v1

Authors (4): Zain Muhammad Mujahid, Dilshod Azizov, Maha Tufail Agro, Preslav Nakov

In an age characterized by the proliferation of mis- and disinformation online, it is critical to empower readers to understand the content they are reading. Important efforts in this direction rely on manual or automatic fact-checking, which can be challenging for emerging claims with limited information. Such scenarios can be handled by assessing the reliability and the political bias of the source of the claim, i.e., characterizing entire news outlets rather than individual claims or articles. This is an important but understudied research direction. While prior work has looked into linguistic and social contexts, we do not analyze individual articles or information in social media. Instead, we propose a novel methodology that emulates the criteria that professional fact-checkers use to assess the factuality and political bias of an entire outlet. Specifically, we design a variety of prompts based on these criteria and elicit responses from large language models (LLMs), which we aggregate to make predictions. In addition to demonstrating sizable improvements over strong baselines via extensive experiments with multiple LLMs, we provide an in-depth error analysis of the effect of media popularity and region on model performance. Further, we conduct an ablation study to highlight the key components of our dataset that contribute to these improvements. To facilitate future research, we released our dataset and code at https://github.com/mbzuai-nlp/llm-media-profiling.

nan


Article 476

Title@2025-06-14 (6): Activation-Informed Merging of Large Language Models

Title: Activation-Informed Merging of Large Language Models Aktivierungs-informiertes Zusammenführen von großen Sprachmodellen 大语言模式的合并 2502.02421v2

Authors (6): Amin Heyrani Nobari, Kaveh Alimohammadi, Ali ArjomandBigdeli, Akash Srivastava, Faez Ahmed, Navid Azizan

Model merging, a method that combines the parameters and embeddings of multiple fine-tuned large language models (LLMs), offers a promising approach to enhance model performance across various tasks while maintaining computational efficiency. This paper introduces Activation-Informed Merging (AIM), a technique that integrates the information from the activation space of LLMs into the merging process to improve performance and robustness. AIM is designed as a flexible, complementary solution that is applicable to any existing merging method. It aims to preserve critical weights from the base model, drawing on principles from continual learning (CL) and model compression. Utilizing a task-agnostic calibration set, AIM selectively prioritizes essential weights during merging. We empirically demonstrate that AIM significantly enhances the performance of merged models across multiple benchmarks. Our findings suggest that considering the activation-space information can provide substantial advancements in the model merging strategies for LLMs, with up to a 40% increase in benchmark performance.

nan


Article 477

Title@2025-06-14 (6): RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking

Title: RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking RealFactBench: Ein Benchmark für die Bewertung großer Sprachmodelle in Real-World Fact-Checking RealFactFactBonch:在现实世界实况调查中评价大语言模式的基准 2506.12538v1

Authors (9): Shuo Yang, Yuqin Dai, Guoqing Wang, Xinran Zheng, Jinfeng Xu, Jinze Li, Zhenzhe Ying, Weiqiang Wang, Edith C. H. Ngai

Large Language Models (LLMs) hold significant potential for advancing fact-checking by leveraging their capabilities in reasoning, evidence retrieval, and explanation generation. However, existing benchmarks fail to comprehensively evaluate LLMs and Multimodal Large Language Models (MLLMs) in realistic misinformation scenarios. To bridge this gap, we introduce RealFactBench, a comprehensive benchmark designed to assess the fact-checking capabilities of LLMs and MLLMs across diverse real-world tasks, including Knowledge Validation, Rumor Detection, and Event Verification. RealFactBench consists of 6K high-quality claims drawn from authoritative sources, encompassing multimodal content and diverse domains. Our evaluation framework further introduces the Unknown Rate (UnR) metric, enabling a more nuanced assessment of models’ ability to handle uncertainty and balance between over-conservatism and over-confidence. Extensive experiments on 7 representative LLMs and 4 MLLMs reveal their limitations in real-world fact-checking and offer valuable insights for further research. RealFactBench is publicly available at https://github.com/kalendsyang/RealFactBench.git.

nan


Article 478

Title@2025-06-14 (6): Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction

Title: Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction Sprachmodelle mit entkoppelten Tokenizern und Multi-Token-Vorhersage 配有拆分调制调制器和多功能预测的语音-语言语言模型 2506.12537v1

Authors (24): Xiaoran Fan, Zhichao Sun, Yangfan Gao, Jingfei Xiong, Hang Yan, Yifei Cao, Jiajun Sun, Shuo Li, Zhihao Zhang, Zhiheng Xi, Yuhao Zhou, Senjie Jin, Changhao Jiang, Junjie Ye, Ming Zhang, Rui Zheng, Zhenhua Han, Yunke Zhang, Demei Yan, Shaokang Dong, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the impact of key components (i.e., speech tokenizers, speech heads, and speaker modeling) on the performance of LLM-centric SLMs. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12$\times$ faster decoding and a substantial drop in word error rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.

nan


Article 479

Title@2025-06-14 (6): Detection, Classification, and Mitigation of Gender Bias in Large Language Models

Title: Detection, Classification, and Mitigation of Gender Bias in Large Language Models Erkennung, Klassifizierung und Minderung von Gender-Bias in großen Sprachmodellen 大语言模式中性别偏见的探测、分类和减轻 2506.12527v1

Authors (5): Xiaoqing Cheng, Hongying Zan, Lulu Kong, Jinwang Song, Min Peng

With the rapid development of large language models (LLMs), they have significantly improved efficiency across a wide range of domains. However, recent studies have revealed that LLMs often exhibit gender bias, leading to serious social implications. Detecting, classifying, and mitigating gender bias in LLMs has therefore become a critical research focus. In the NLPCC 2025 Shared Task 7: Chinese Corpus for Gender Bias Detection, Classification and Mitigation Challenge, we investigate how to enhance the capabilities of LLMs in gender bias detection, classification, and mitigation. We adopt reinforcement learning, chain-of-thoughts (CoT) reasoning, and supervised fine-tuning to handle different Subtasks. Specifically, for Subtasks 1 and 2, we leverage the internal reasoning capabilities of LLMs to guide multi-step thinking in a staged manner, which simplifies complex biased queries and improves response accuracy. For Subtask 3, we employ a reinforcement learning-based approach, annotating a preference dataset using GPT-4. We then apply Direct Preference Optimization (DPO) to mitigate gender bias by introducing a loss function that explicitly favors less biased completions over biased ones. Our approach ranked first across all three subtasks of the NLPCC 2025 Shared Task 7.

nan


Article 480

Title@2025-06-14 (6): LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL

Title: LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL LinkAlign: Skalierbare Schema-Verknüpfung für Real-World großformatige Multi-Datenbank Text-zu-SQL 链接对称: 真实世界大型多数据基文本到 SQL 的可缩放气相表链接 2503.18596v3

Authors (2): Yihan Wang, Peiyu Liu

Schema linking is a critical bottleneck in applying existing Text-to-SQL models to real-world, large-scale, multi-database environments. Through error analysis, we identify two major challenges in schema linking: (1) Database Retrieval: accurately selecting the target database from a large schema pool, while effectively filtering out irrelevant ones; and (2) Schema Item Grounding: precisely identifying the relevant tables and columns within complex and often redundant schemas for SQL generation. Based on these, we introduce LinkAlign, a novel framework tailored for large-scale databases with thousands of fields. LinkAlign comprises three key steps: multi-round semantic enhanced retrieval and irrelevant information isolation for Challenge 1, and schema extraction enhancement for Challenge 2. Each stage supports both Agent and Pipeline execution modes, enabling balancing efficiency and performance via modular design. To enable more realistic evaluation, we construct AmbiDB, a synthetic dataset designed to reflect the ambiguity of real-world schema linking. Experiments on widely-used Text-to-SQL benchmarks demonstrate that LinkAlign consistently outperforms existing baselines on all schema linking metrics. Notably, it improves the overall Text-to-SQL pipeline and achieves a new state-of-the-art score of 33.09% on the Spider 2.0-Lite benchmark using only open-source LLMs, ranking first on the leaderboard at the time of submission. The codes are available at https://github.com/Satissss/LinkAlign

nan


Article 481

Title@2025-06-14 (6): How Does A Text Preprocessing Pipeline Affect Ontology Syntactic Matching?

Title: How Does A Text Preprocessing Pipeline Affect Ontology Syntactic Matching? Wie wirkt sich eine Textvorverarbeitung auf die Ontologie aus? 文本预处理管道如何影响本体学同步匹配? 2411.03962v8

Authors (3): Zhangcheng Qiang, Kerry Taylor, Weiqing Wang

The classical text preprocessing pipeline, comprising Tokenisation, Normalisation, Stop Words Removal, and Stemming/Lemmatisation, has been implemented in many systems for syntactic ontology matching (OM). However, the lack of standardisation in text preprocessing creates diversity in mapping results. In this paper, we investigate the effect of the text preprocessing pipeline on syntactic OM in 8 Ontology Alignment Evaluation Initiative (OAEI) tracks with 49 distinct alignments. We find that Phase 1 text preprocessing (Tokenisation and Normalisation) is more effective than Phase 2 text preprocessing (Stop Words Removal and Stemming/Lemmatisation). We propose two novel approaches to repair unwanted false mappings caused by Phase 2 text preprocessing. One is an ad hoc logic-based repair approach that employs an ontology-specific check to find common words that cause false mappings. These words are stored in a reserved word set and applied before the text preprocessing. By leveraging the power of large language models (LLMs), we also propose a post hoc LLM-based repair approach. This approach utilises the strong background knowledge provided by LLMs to repair non-existent and counter-intuitive false mappings after the text preprocessing. It also overcomes the tendency towards unstable true mappings by injecting the classical text preprocessing pipeline via function calling. The experimental results show that these two approaches can improve the matching correctness and the overall matching performance.

nan


Article 482

Title@2025-06-14 (6): Less is More: Improving LLM Alignment via Preference Data Selection

Title: Less is More: Improving LLM Alignment via Preference Data Selection Weniger ist mehr: Verbesserung der LLM-Ausrichtung über Präferenzdatenauswahl 较少是更多:通过优先数据选择改进LLM对齐 2502.14560v3

Authors (6): Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, Xiangnan He

Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel margin-maximization principle for dataset curation in DPO training. To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single preference probability. Extensive experiments in diverse settings demonstrate the consistently high data efficiency of our approach. Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama, Mistral, and Qwen models on the AlpacaEval2 benchmark. Furthermore, our approach seamlessly extends to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, revealing the high redundancy in this presumed high-quality data construction manner. These results highlight the potential of data selection strategies for advancing preference optimization.

nan


Article 483

Title@2025-06-14 (6): Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation

Title: Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation Überbrückungsrelevanz und Begründung: Rationale Destillation in retrieval-augmented Generation 架桥关联性和合理性:再回收-提款一代中的理由蒸馏 2412.08519v2

Authors (12): Pengyue Jia, Derong Xu, Xiaopeng Li, Zhaocheng Du, Xiangyang Li, Yichao Wang, Yuhao Wang, Qidong Liu, Maolin Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao

The reranker and generator are two critical components in the Retrieval-Augmented Generation (i.e., RAG) pipeline, responsible for ranking relevant documents and generating responses. However, due to differences in pre-training data and objectives, there is an inevitable gap between the documents ranked as relevant by the reranker and those required by the generator to support answering the query. To address this gap, we propose RADIO, a novel and practical preference alignment framework with RAtionale DIstillatiOn. Specifically, we first propose a rationale extraction method that leverages the reasoning capabilities of Large Language Models (LLMs) to extract the rationales necessary for answering the query. Subsequently, a rationale-based alignment process is designed to rerank the documents based on the extracted rationales, and fine-tune the reranker to align the preferences. We conduct extensive experiments on two tasks across three datasets to demonstrate the effectiveness of our approach compared to baseline methods. Our code is released online to ease reproduction.

nan


Article 484

Title@2025-06-14 (6): Towards Fairness Assessment of Dutch Hate Speech Detection

Title: Towards Fairness Assessment of Dutch Hate Speech Detection Zur Fairnessbewertung der niederländischen Hass-Spracherkennung 争取对荷兰仇恨言论检测进行公平评估 2506.12502v1

Authors (4): Julie Bauer, Rishabh Kaushal, Thales Bertaglia, Adriana Iamnitchi

Numerous studies have proposed computational methods to detect hate speech online, yet most focus on the English language and emphasize model development. In this study, we evaluate the counterfactual fairness of hate speech detection models in the Dutch language, specifically examining the performance and fairness of transformer-based models. We make the following key contributions. First, we curate a list of Dutch Social Group Terms that reflect social context. Second, we generate counterfactual data for Dutch hate speech using LLMs and established strategies like Manual Group Substitution (MGS) and Sentence Log-Likelihood (SLL). Through qualitative evaluation, we highlight the challenges of generating realistic counterfactuals, particularly with Dutch grammar and contextual coherence. Third, we fine-tune baseline transformer-based models with counterfactual data and evaluate their performance in detecting hate speech. Fourth, we assess the fairness of these models using Counterfactual Token Fairness (CTF) and group fairness metrics, including equality of odds and demographic parity. Our analysis shows that models perform better in terms of hate speech detection, average counterfactual fairness and group fairness. This work addresses a significant gap in the literature on counterfactual fairness for hate speech detection in Dutch and provides practical insights and recommendations for improving both model performance and fairness.

nan


Article 485

Title@2025-06-14 (6): Improving Factuality for Dialogue Response Generation via Graph-Based Knowledge Augmentation

Title: Improving Factuality for Dialogue Response Generation via Graph-Based Knowledge Augmentation Verbesserung der Factuality für Dialog-Response-Generierung durch graphgestützte Wissenserweiterung 通过基于图表的知识增加改进对话回应生成的实况 2506.12496v1

Authors (3): Xiangyan Chen, Yujian Gan, Matthew Purver

Large Language Models (LLMs) succeed in many natural language processing tasks. However, their tendency to hallucinate - generate plausible but inconsistent or factually incorrect text - can cause problems in certain tasks, including response generation in dialogue. To mitigate this issue, knowledge-augmented methods have shown promise in reducing hallucinations. Here, we introduce a novel framework designed to enhance the factuality of dialogue response generation, as well as an approach to evaluate dialogue factual accuracy. Our framework combines a knowledge triple retriever, a dialogue rewrite, and knowledge-enhanced response generation to produce more accurate and grounded dialogue responses. To further evaluate generated responses, we propose a revised fact score that addresses the limitations of existing fact-score methods in dialogue settings, providing a more reliable assessment of factual consistency. We evaluate our methods using different baselines on the OpendialKG and HybriDialogue datasets. Our methods significantly improve factuality compared to other graph knowledge-augmentation baselines, including the state-of-the-art G-retriever. The code will be released on GitHub.

nan


Article 486

Title@2025-06-14 (6): FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation

Title: FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation FlexRAG: Ein flexibler und umfassender Rahmen für die Retrieval-Augmented Generation FlexRAG: 灵活和综合的回回回一代人框架 2506.12494v1

Authors (3): Zhuocheng Zhang, Yang Feng, Min Zhang

Retrieval-Augmented Generation (RAG) plays a pivotal role in modern large language model applications, with numerous existing frameworks offering a wide range of functionalities to facilitate the development of RAG systems. However, we have identified several persistent challenges in these frameworks, including difficulties in algorithm reproduction and sharing, lack of new techniques, and high system overhead. To address these limitations, we introduce \textbf{FlexRAG}, an open-source framework specifically designed for research and prototyping. FlexRAG supports text-based, multimodal, and network-based RAG, providing comprehensive lifecycle support alongside efficient asynchronous processing and persistent caching capabilities. By offering a robust and flexible solution, FlexRAG enables researchers to rapidly develop, deploy, and share advanced RAG systems. Our toolkit and resources are available at \href{https://github.com/ictnlp/FlexRAG}{https://github.com/ictnlp/FlexRAG}.

nan


Article 487

Title@2025-06-14 (6): Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization

Title: Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization Robustes LLM-Unlearning mit MUDMAN: Meta-Unlearning mit Disruptionsmasken und Normalisierung 与 MUDMAN 一起重新学习: 以干扰蒙蔽和正常化的方式重新学习 2506.12484v1

Authors (4): Filip Sondej, Yushi Yang, Mikołaj Kniejski, Marcel Windys

Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning. We introduce Disruption Masking, a technique in which we only allow updating weights, where the signs of the unlearning gradient and the retaining gradient are the same. This ensures all updates are non-disruptive. Additionally, we identify the need for normalizing the unlearning gradients, and also confirm the usefulness of meta-learning. We combine these insights into MUDMAN (Meta-Unlearning with Disruption Masking and Normalization) and validate its effectiveness at preventing the recovery of dangerous capabilities. MUDMAN outperforms the prior TAR method by 40\%, setting a new state-of-the-art for robust unlearning.

nan


Article 488

Title@2025-06-14 (6): MALM: A Multi-Information Adapter for Large Language Models to Mitigate Hallucination

Title: MALM: A Multi-Information Adapter for Large Language Models to Mitigate Hallucination MALM: Ein Multi-Informationsadapter für große Sprachmodelle zur Mititation von Halluzinationen MARM:一个用于模拟幻觉大语言模型的多信息适应器 2506.12483v1

Authors (6): Ao Jia, Haiming Wu, Guohui Yao, Dawei Song, Songkun Ji, Yazhou Zhang

Large language models (LLMs) are prone to three types of hallucination: Input-Conflicting, Context-Conflicting and Fact-Conflicting hallucinations. The purpose of this study is to mitigate the different types of hallucination by exploiting the interdependence between them. For this purpose, we propose a Multi-Information Adapter for Large Language Models (MALM). This framework employs a tailored multi-graph learning approach designed to elucidate the interconnections between original inputs, contextual information, and external factual knowledge, thereby alleviating the three categories of hallucination within a cohesive framework. Experiments were carried out on four benchmarking datasets: HaluEval, TruthfulQA, Natural Questions, and TriviaQA. We evaluated the proposed framework in two aspects: (1) adaptability to different base LLMs on HaluEval and TruthfulQA, to confirm if MALM is effective when applied on 7 typical LLMs. MALM showed significant improvements over LLaMA-2; (2) generalizability to retrieval-augmented generation (RAG) by combining MALM with three representative retrievers (BM25, Spider and DPR) separately. Furthermore, automated and human evaluations were conducted to substantiate the correctness of experimental results, where GPT-4 and 3 human volunteers judged which response was better between LLaMA-2 and MALM. The results showed that both GPT-4 and human preferred MALM in 79.4% and 65.6% of cases respectively. The results validate that incorporating the complex interactions between the three types of hallucination through a multilayered graph attention network into the LLM generation process is effective to mitigate the them. The adapter design of the proposed approach is also proven flexible and robust across different base LLMs.

nan


Article 489

Title@2025-06-14 (6): MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems

Title: MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems MTLM: Aufnahme bidirektionaler Textinformationen zur Verbesserung der Sprachmodellausbildung in Spracherkennungssystemen MTLM:纳入双向文本信息,以加强语音识别系统中的语言示范培训 2502.10058v2

Authors (5): Qingliang Meng, Pengju Ren, Tian Li, Changsong Dai, Huizhi Liang

Automatic speech recognition (ASR) systems normally consist of an acoustic model (AM) and a language model (LM). The acoustic model estimates the probability distribution of text given the input speech, while the language model calibrates this distribution toward a specific knowledge domain to produce the final transcription. Traditional ASR-specific LMs are typically trained in a unidirectional (left-to-right) manner to align with autoregressive decoding. However, this restricts the model from leveraging the right-side context during training, limiting its representational capacity. In this work, we propose MTLM, a novel training paradigm that unifies unidirectional and bidirectional manners through 3 training objectives: ULM, BMLM, and UMLM. This approach enhances the LM’s ability to capture richer linguistic patterns from both left and right contexts while preserving compatibility with standard ASR autoregressive decoding methods. As a result, the MTLM model not only enhances the ASR system’s performance but also support multiple decoding strategies, including shallow fusion, unidirectional/bidirectional n-best rescoring. Experiments on the LibriSpeech dataset show that MTLM consistently outperforms unidirectional training across multiple decoding strategies, highlighting its effectiveness and flexibility in ASR applications.

nan


Article 490

Title@2025-06-14 (6): AI Flow: Perspectives, Scenarios, and Approaches

Title: AI Flow: Perspectives, Scenarios, and Approaches AI Flow: Perspektiven, Szenarien und Ansätze AI 流动:观点、设想和方法 2506.12479v1

Authors (12): Hongjun An, Sida Huang, Siqi Huang, Ruanjun Li, Yuanzhi Liang, Jiawei Shao, Zihan Wang, Cheng Yuan, Chi Zhang, Hongyuan Zhang, Wenhao Zhuang, Xuelong Li

Pioneered by the foundational information theory by Claude Shannon and the visionary framework of machine intelligence by Alan Turing, the convergent evolution of information and communication technologies (IT/CT) has created an unbroken wave of connectivity and computation. This synergy has sparked a technological revolution, now reaching its peak with large artificial intelligence (AI) models that are reshaping industries and redefining human-machine collaboration. However, the realization of ubiquitous intelligence faces considerable challenges due to substantial resource consumption in large models and high communication bandwidth demands. To address these challenges, AI Flow has been introduced as a multidisciplinary framework that integrates cutting-edge IT and CT advancements, with a particular emphasis on the following three key points. First, device-edge-cloud framework serves as the foundation, which integrates end devices, edge servers, and cloud clusters to optimize scalability and efficiency for low-latency model inference. Second, we introduce the concept of familial models, which refers to a series of different-sized models with aligned hidden features, enabling effective collaboration and the flexibility to adapt to varying resource constraints and dynamic scenarios. Third, connectivity- and interaction-based intelligence emergence is a novel paradigm of AI Flow. By leveraging communication networks to enhance connectivity, the collaboration among AI models across heterogeneous nodes achieves emergent intelligence that surpasses the capability of any single model. The innovations of AI Flow provide enhanced intelligence, timely responsiveness, and ubiquitous accessibility to AI services, paving the way for the tighter fusion of AI techniques and communication systems.

nan


Article 491

Title@2025-06-14 (6): TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks

Title: TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks TagRouter: Lernroute zu LLMs durch Tags für Open-Domain Text Generierung Aufgaben TagRouter: 通过用于 Open-Domain 文本生成任务的标记学习 LLM 的学习路径 2506.12473v1

Authors (5): Zhou Chen, Zhiqiang Wei, Yuqi Bai, Xue Xiong, Jianmin Wu

Model routing allocates queries to the suitable model, improving system performance while reducing costs. However, existing routing methods face practical limitations that hinder scalability in large-scale applications and struggle to keep up with the rapid growth of the large language model (LLM) ecosystem. To tackle these challenges, we propose TagRouter, a training-free model routing method designed to optimize the synergy among multiple LLMs for open-domain text generation tasks. Experimental results demonstrate that TagRouter outperforms 13 baseline methods, increasing the accept rate of system by 6.15% and reducing costs by 17.20%, achieving optimal cost-efficiency. Our findings provides the LLM community with an efficient and scalable solution for model ensembling, offering users an evolvable “super model.”

nan


Article 492

Title@2025-06-14 (6): A Pluggable Multi-Task Learning Framework for Sentiment-Aware Financial Relation Extraction

Title: A Pluggable Multi-Task Learning Framework for Sentiment-Aware Financial Relation Extraction Ein steckbarer Multi-Task-Lernrahmen für sentiment-aware Finanzrelation Extraction 一个可插插多任务学习框架,用于情感-恶意金融关系采掘 2506.12452v1

Authors (2): Jinming Luo, Hailin Wang

Relation Extraction (RE) aims to extract semantic relationships in texts from given entity pairs, and has achieved significant improvements. However, in different domains, the RE task can be influenced by various factors. For example, in the financial domain, sentiment can affect RE results, yet this factor has been overlooked by modern RE models. To address this gap, this paper proposes a Sentiment-aware-SDP-Enhanced-Module (SSDP-SEM), a multi-task learning approach for enhancing financial RE. Specifically, SSDP-SEM integrates the RE models with a pluggable auxiliary sentiment perception (ASP) task, enabling the RE models to concurrently navigate their attention weights with the text’s sentiment. We first generate detailed sentiment tokens through a sentiment model and insert these tokens into an instance. Then, the ASP task focuses on capturing nuanced sentiment information through predicting the sentiment token positions, combining both sentiment insights and the Shortest Dependency Path (SDP) of syntactic information. Moreover, this work employs a sentiment attention information bottleneck regularization method to regulate the reasoning process. Our experiment integrates this auxiliary task with several prevalent frameworks, and the results demonstrate that most previous models benefit from the auxiliary task, thereby achieving better results. These findings highlight the importance of effectively leveraging sentiment in the financial RE task.

nan


Article 493

Title@2025-06-14 (6): Language Surgery in Multilingual Large Language Models

Title: Language Surgery in Multilingual Large Language Models Sprachchirurgie in mehrsprachigen großen Sprachmodellen 多语言大语言模式中的语言外科手术 2506.12450v1

Authors (9): Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, Samuel Cahyawijaya

Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across tasks and languages, revolutionizing natural language processing. This paper investigates the naturally emerging representation alignment in LLMs, particularly in the middle layers, and its implications for disentangling language-specific and language-agnostic information. We empirically confirm the existence of this alignment, analyze its behavior in comparison to explicitly designed alignment models, and demonstrate its potential for language-specific manipulation without semantic degradation. Building on these findings, we propose Inference-Time Language Control (ITLC), a novel method that leverages latent injection to enable precise cross-lingual language control and mitigate language confusion in LLMs. Our experiments highlight ITLC’s strong cross-lingual control capabilities while preserving semantic integrity in target languages. Furthermore, we demonstrate its effectiveness in alleviating the cross-lingual language confusion problem, which persists even in current large-scale LLMs, leading to inconsistent language generation. This work advances our understanding of representation alignment in LLMs and introduces a practical solution for enhancing their cross-lingual performance.

nan


Article 494

Title@2025-06-14 (6): ViQA-COVID: COVID-19 Machine Reading Comprehension Dataset for Vietnamese

Title: ViQA-COVID: COVID-19 Machine Reading Comprehension Dataset for Vietnamese ViQA-COVID: COVID-19 Maschinenlesedatensatz für Vietnamesen ViQA-COVID:越南的COVID-19机器阅读综合数据集 2504.21017v2

Authors (5): Hai-Chung Nguyen-Phung, Ngoc C. Lê, Van-Chien Nguyen, Hang Thi Nguyen, Thuy Phuong Thi Nguyen

After two years of appearance, COVID-19 has negatively affected people and normal life around the world. As in May 2022, there are more than 522 million cases and six million deaths worldwide (including nearly ten million cases and over forty-three thousand deaths in Vietnam). Economy and society are both severely affected. The variant of COVID-19, Omicron, has broken disease prevention measures of countries and rapidly increased number of infections. Resources overloading in treatment and epidemics prevention is happening all over the world. It can be seen that, application of artificial intelligence (AI) to support people at this time is extremely necessary. There have been many studies applying AI to prevent COVID-19 which are extremely useful, and studies on machine reading comprehension (MRC) are also in it. Realizing that, we created the first MRC dataset about COVID-19 for Vietnamese: ViQA-COVID and can be used to build models and systems, contributing to disease prevention. Besides, ViQA-COVID is also the first multi-span extraction MRC dataset for Vietnamese, we hope that it can contribute to promoting MRC studies in Vietnamese and multilingual.

nan


Article 495

Title@2025-06-14 (6): From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment

Title: From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment Von Ergebnissen zu Prozessen: Leitende PRM-Lernen von ORM für die Schlussfolgerungs-Zeit-Ausrichtung 从结果到过程:指导程序程序管理从ORM学习,以推断-时间协调 2506.12446v1

Authors (5): Bin Xie, Bingbing Xu, Yige Yuan, Shengmao Zhu, Huawei Shen

Inference-time alignment methods have gained significant attention for their efficiency and effectiveness in aligning large language models (LLMs) with human preferences. However, existing dominant approaches using reward-guided search (RGS) primarily rely on outcome reward models (ORMs), which suffer from a critical granularity mismatch: ORMs are designed to provide outcome rewards for complete responses, while RGS methods rely on process rewards to guide the policy, leading to inconsistent scoring and suboptimal alignment. To address this challenge, we introduce process reward models (PRMs) into RGS and argue that an ideal PRM should satisfy two objectives: Score Consistency, ensuring coherent evaluation across partial and complete responses, and Preference Consistency, aligning partial sequence assessments with human preferences. Based on these, we propose SP-PRM, a novel dual-consistency framework integrating score consistency-based and preference consistency-based partial evaluation modules without relying on human annotation. Extensive experiments on dialogue, summarization, and reasoning tasks demonstrate that SP-PRM substantially enhances existing RGS methods, achieving a 3.6%-10.3% improvement in GPT-4 evaluation scores across all tasks.

nan


Article 496

Title@2025-06-14 (6): Nested Named-Entity Recognition on Vietnamese COVID-19: Dataset and Experiments

Title: Nested Named-Entity Recognition on Vietnamese COVID-19: Dataset and Experiments Nested Named-Entity Recognition on Vietnamese COVID-19: Datensatz und Experimente 越南COVID-19(数据集和实验) 2504.21016v2

Authors (9): Ngoc C. Lê, Hai-Chung Nguyen-Phung, Thu-Huong Pham Thi, Hue Vu, Phuong-Thao Nguyen Thi, Thu-Thuy Tran, Hong-Nhung Le Thi, Thuy-Duong Nguyen-Thi, Thanh-Huy Nguyen

The COVID-19 pandemic caused great losses worldwide, efforts are taken place to prevent but many countries have failed. In Vietnam, the traceability, localization, and quarantine of people who contact with patients contribute to effective disease prevention. However, this is done by hand, and take a lot of work. In this research, we describe a named-entity recognition (NER) study that assists in the prevention of COVID-19 pandemic in Vietnam. We also present our manually annotated COVID-19 dataset with nested named entity recognition task for Vietnamese which be defined new entity types using for our system.

nan


Article 497

Title@2025-06-14 (6): Exploring Cultural Variations in Moral Judgments with Large Language Models

Title: Exploring Cultural Variations in Moral Judgments with Large Language Models Kulturelle Variationen in Moralurteilen mit großen Sprachmodellen erforschen 探索具有大语言模式的道德判决的文化差异 2506.12433v1

Authors (4): Hadi Mohammadi, Efthymia Papadopoulou, Yasmeen F. S. S. Meijer, Ayoub Bagheri

Large Language Models (LLMs) have shown strong performance across many tasks, but their ability to capture culturally diverse moral values remains unclear. In this paper, we examine whether LLMs can mirror variations in moral attitudes reported by two major cross-cultural surveys: the World Values Survey and the PEW Research Center’s Global Attitudes Survey. We compare smaller, monolingual, and multilingual models (GPT-2, OPT, BLOOMZ, and Qwen) with more recent instruction-tuned models (GPT-4o, GPT-4o-mini, Gemma-2-9b-it, and Llama-3.3-70B-Instruct). Using log-probability-based moral justifiability scores, we correlate each model’s outputs with survey data covering a broad set of ethical topics. Our results show that many earlier or smaller models often produce near-zero or negative correlations with human judgments. In contrast, advanced instruction-tuned models (including GPT-4o and GPT-4o-mini) achieve substantially higher positive correlations, suggesting they better reflect real-world moral attitudes. While scaling up model size and using instruction tuning can improve alignment with cross-cultural moral norms, challenges remain for certain topics and regions. We discuss these findings in relation to bias analysis, training data diversity, and strategies for improving the cultural sensitivity of LLMs.

nan


Article 498

Title@2025-06-14 (6): Toward Reasonable Parrots: Why Large Language Models Should Argue with Us by Design

Title: Toward Reasonable Parrots: Why Large Language Models Should Argue with Us by Design Auf dem Weg zu vernünftigen Papageien: Warum große Sprachmodelle mit uns argumentieren sollten 通向合理的鹦鹉:为什么大语言模型应该设计来与我们争论? 2505.05298v2

Authors (13): Elena Musi, Nadin Kokciyan, Khalid Al-Khatib, Davide Ceolin, Emmanuelle Dietz, Klara Gutekunst, Annette Hautli-Janisz, Cristian Manuel Santibañez Yañez, Jodi Schneider, Jonas Scholz, Cor Steging, Jacky Visser, Henning Wachsmuth

In this position paper, we advocate for the development of conversational technology that is inherently designed to support and facilitate argumentative processes. We argue that, at present, large language models (LLMs) are inadequate for this purpose, and we propose an ideal technology design aimed at enhancing argumentative skills. This involves re-framing LLMs as tools to exercise our critical thinking skills rather than replacing them. We introduce the concept of \textit{reasonable parrots} that embody the fundamental principles of relevance, responsibility, and freedom, and that interact through argumentative dialogical moves. These principles and moves arise out of millennia of work in argumentation theory and should serve as the starting point for LLM-based technology that incorporates basic principles of argumentation.

nan


Article 499

Title@2025-06-14 (6): CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis

Title: CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis CoT-basierter Synthesizer: Verbesserung der LLM-Performance durch Antwortsynthese 以Cot为基础的合成器:通过答复合成提高LLM绩效 2501.01668v2

Authors (6): Bohan Zhang, Xiaokang Zhang, Jing Zhang, Jifan Yu, Sijia Luo, Jie Tang

Current inference scaling methods, such as Self-consistency and Best-of-N, have proven effective in improving the accuracy of LLMs on complex reasoning tasks. However, these methods rely heavily on the quality of candidate responses and are unable to produce correct answers when all candidates are incorrect. In this paper, we propose a novel inference scaling strategy, CoT-based Synthesizer, which leverages CoT reasoning to synthesize superior answers by analyzing complementary information from multiple candidate responses, even when all candidate responses are flawed. To enable a lightweight and cost-effective implementation, we introduce an automated data generation pipeline that creates diverse training data. This allows smaller LLMs trained on this data to improve the inference accuracy of larger models, including API-based LLMs. Experimental results across four benchmark datasets with seven policy models demonstrate that our method significantly enhances performance, with gains of 11.8% for Llama3-8B and 10.3% for GPT-4o on the MATH dataset. The corresponding training data and code are publicly available on https://github.com/RUCKBReasoning/CoT-based-Synthesizer.

nan


Article 500

Title@2025-06-14 (6): Plan Your Travel and Travel with Your Plan: Wide-Horizon Planning and Evaluation via LLM

Title: Plan Your Travel and Travel with Your Plan: Wide-Horizon Planning and Evaluation via LLM Planen Sie Ihre Reise und Reise mit Ihrem Plan: Wide-Horizon Planung und Bewertung über LLM 与你的计划一起规划你的旅行和旅行计划:通过LLM进行广泛的毛利人规划和评估 2506.12421v1

Authors (7): Dongjie Yang, Chengqiang Lu, Qimeng Wang, Xinbei Ma, Yan Gao, Yao Hu, Hai Zhao

Travel planning is a complex task requiring the integration of diverse real-world information and user preferences. While LLMs show promise, existing methods with long-horizon thinking struggle with handling multifaceted constraints and preferences in the context, leading to suboptimal itineraries. We formulate this as an $L^3$ planning problem, emphasizing long context, long instruction, and long output. To tackle this, we introduce Multiple Aspects of Planning (MAoP), enabling LLMs to conduct wide-horizon thinking to solve complex planning problems. Instead of direct planning, MAoP leverages the strategist to conduct pre-planning from various aspects and provide the planning blueprint for planning models, enabling strong inference-time scalability for better performance. In addition, current benchmarks overlook travel’s dynamic nature, where past events impact subsequent journeys, failing to reflect real-world feasibility. To address this, we propose Travel-Sim, an agent-based benchmark assessing plans via real-world travel simulation. This work advances LLM capabilities in complex planning and offers novel insights for evaluating sophisticated scenarios through agent-based simulation.

nan


Article 501

Title@2025-06-14 (6): Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters

Title: Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters Unüberwachte Klassifikation von englischen Wörtern anhand phonologischer Informationen: Entdeckung von germanischen und lateinischen Clustern 基于声频信息:发现日耳曼语和拉丁语群集 2504.11770v2

Authors (2): Takashi Morita, Timothy J. O’Donnell

Cross-linguistically, native words and loanwords follow different phonological rules. In English, for example, words of Germanic and Latinate origin exhibit different stress patterns, and a certain syntactic structure is exclusive to Germanic verbs. When seeing them as a cognitive model, however, such etymology-based generalizations face challenges in terms of learnability, since the historical origins of words are presumably inaccessible information for general language learners. In this study, we present computational evidence indicating that the Germanic-Latinate distinction in the English lexicon is learnable from the phonotactic information of individual words. Specifically, we performed an unsupervised clustering on corpus-extracted words, and the resulting word clusters largely aligned with the etymological distinction. The model-discovered clusters also recovered various linguistic generalizations documented in the previous literature regarding the corresponding etymological classes. Moreover, our findings also uncovered previously unrecognized features of the quasi-etymological clusters, offering novel hypotheses for future experimental studies.

nan


Article 502

Title@2025-06-14 (6): Transformers without Normalization

Title: Transformers without Normalization Transformatoren ohne Normalisierung 无正常化的变换器 2503.10622v2

Authors (5): Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu

Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = \tanh(\alpha $x$)$, as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, $S$-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

nan


Article 503

Title@2025-06-14 (6): Group then Scale: Dynamic Mixture-of-Experts Multilingual Language Model

Title: Group then Scale: Dynamic Mixture-of-Experts Multilingual Language Model Gruppe dann Skala: Dynamische Mischung-von-Experten Mehrsprachiges Sprachmodell 群组然后缩放: 动态混合专家多语种语言模型 2506.12388v1

Authors (4): Chong Li, Yingzhuo Deng, Jiajun Zhang, Chengqing Zong

The curse of multilinguality phenomenon is a fundamental problem of multilingual Large Language Models (LLMs), where the competition between massive languages results in inferior performance. It mainly comes from limited capacity and negative transfer between dissimilar languages. To address this issue, we propose a method to dynamically group and scale up the parameters of multilingual LLM while boosting positive transfer among similar languages. Specifically, the model is first tuned on monolingual corpus to determine the parameter deviation in each layer and quantify the similarity between languages. Layers with more deviations are extended to mixture-of-experts layers to reduce competition between languages, where one expert module serves one group of similar languages. Experimental results on 18 to 128 languages show that our method reduces the negative transfer between languages and significantly boosts multilingual performance with fewer parameters. Such language group specialization on experts benefits the new language adaptation and reduces the inference on the previous multilingual knowledge learned.

nan


Article 504

Title@2025-06-14 (6): Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision

Title: Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision Ranking-Kette-of-Thought-Lernen: Ein energiebasierter Ansatz mit Outcome-Supervision 学习 “ 研究链链 “ :以能源为基础的方法与成果监督 2505.14999v2

Authors (12): Eric Hanchen Jiang, Haozheng Luo, Shengyuan Pang, Xiaomin Li, Zhenting Qi, Hengli Li, Cheng-Fu Yang, Zongyu Lin, Xinfeng Li, Hao Xu, Kai-Wei Chang, Ying Nian Wu

Mathematical reasoning presents a significant challenge for Large Language Models (LLMs), often requiring robust multi step logical consistency. While Chain of Thought (CoT) prompting elicits reasoning steps, it doesn’t guarantee correctness, and improving reliability via extensive sampling is computationally costly. This paper introduces the Energy Outcome Reward Model (EORM), an effective, lightweight, post hoc verifier. EORM leverages Energy Based Models (EBMs) to simplify the training of reward models by learning to assign a scalar energy score to CoT solutions using only outcome labels, thereby avoiding detailed annotations. It achieves this by interpreting discriminator output logits as negative energies, effectively ranking candidates where lower energy is assigned to solutions leading to correct final outcomes implicitly favoring coherent reasoning. On mathematical benchmarks (GSM8k, MATH), EORM significantly improves final answer accuracy (e.g., with Llama 3 8B, achieving 90.7% on GSM8k and 63.7% on MATH). EORM effectively leverages a given pool of candidate solutions to match or exceed the performance of brute force sampling, thereby enhancing LLM reasoning outcome reliability through its streamlined post hoc verification process.

nan


Article 505

Title@2025-06-14 (6): Recent Advances and Future Directions in Literature-Based Discovery

Title: Recent Advances and Future Directions in Literature-Based Discovery Jüngste Fortschritte und zukünftige Wege in der literaturbasierten Entdeckung 最近在基于文学的发现中的进展和未来方向 2506.12385v1

Authors (3): Andrej Kastrin, Bojan Cestnik, Nada Lavrač

The explosive growth of scientific publications has created an urgent need for automated methods that facilitate knowledge synthesis and hypothesis generation. Literature-based discovery (LBD) addresses this challenge by uncovering previously unknown associations between disparate domains. This article surveys recent methodological advances in LBD, focusing on developments from 2000 to the present. We review progress in three key areas: knowledge graph construction, deep learning approaches, and the integration of pre-trained and large language models (LLMs). While LBD has made notable progress, several fundamental challenges remain unresolved, particularly concerning scalability, reliance on structured data, and the need for extensive manual curation. By examining ongoing advances and outlining promising future directions, this survey underscores the transformative role of LLMs in enhancing LBD and aims to support researchers and practitioners in harnessing these technologies to accelerate scientific innovation.

nan


Article 506

Title@2025-06-14 (6): Model Merging for Knowledge Editing

Title: Model Merging for Knowledge Editing Modellzusammenführung für die Wissensbearbeitung 知识编辑合并模型 2506.12384v1

Authors (9): Zichuan Fu, Xian Wu, Guojing Li, Yingying Zhang, Yefeng Zheng, Tianshi Ming, Yejing Wang, Wanyu Wang, Xiangyu Zhao

Large Language Models (LLMs) require continuous updates to maintain accurate and current knowledge as the world evolves. While existing knowledge editing approaches offer various solutions for knowledge updating, they often struggle with sequential editing scenarios and harm the general capabilities of the model, thereby significantly hampering their practical applicability. This paper proposes a two-stage framework combining robust supervised fine-tuning (R-SFT) with model merging for knowledge editing. Our method first fine-tunes the LLM to internalize new knowledge fully, then merges the fine-tuned model with the original foundation model to preserve newly acquired knowledge and general capabilities. Experimental results demonstrate that our approach significantly outperforms existing methods in sequential editing while better preserving the original performance of the model, all without requiring any architectural changes. Code is available at: https://github.com/Applied-Machine-Learning-Lab/MM4KE.

nan


Article 507

Title@2025-06-14 (6): Training-free LLM Merging for Multi-task Learning

Title: Training-free LLM Merging for Multi-task Learning Schulungsfreie LLM-Zusammenführung für Multi-Task-Lernen 多任务学习合并不培训的LLMLM 2506.12379v1

Authors (9): Zichuan Fu, Xian Wu, Yejing Wang, Wanyu Wang, Shanshan Ye, Hongzhi Yin, Yi Chang, Yefeng Zheng, Xiangyu Zhao

Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse natural language processing (NLP) tasks. The release of open-source LLMs like LLaMA and Qwen has triggered the development of numerous fine-tuned models tailored for various tasks and languages. In this paper, we explore an important question: is it possible to combine these specialized models to create a unified model with multi-task capabilities. We introduces Hierarchical Iterative Merging (Hi-Merging), a training-free method for unifying different specialized LLMs into a single model. Specifically, Hi-Merging employs model-wise and layer-wise pruning and scaling, guided by contribution analysis, to mitigate parameter conflicts. Extensive experiments on multiple-choice and question-answering tasks in both Chinese and English validate Hi-Merging’s ability for multi-task learning. The results demonstrate that Hi-Merging consistently outperforms existing merging techniques and surpasses the performance of models fine-tuned on combined datasets in most scenarios. Code is available at: https://github.com/Applied-Machine-Learning-Lab/Hi-Merging.

nan


Article 508

Title@2025-06-14 (6): A Hybrid Architecture with Efficient Fine Tuning for Abstractive Patent Document Summarization

Title: A Hybrid Architecture with Efficient Fine Tuning for Abstractive Patent Document Summarization Hybride Architektur mit effizienter Feinabstimmung für abstrakte Patentdokumentzusammenfassung 简易专利文件摘要的高效精度计价混合结构 2503.10354v4

Authors (2): Nevidu Jayatilleke, Ruvan Weerasinghe

Automatic patent summarization approaches that help in the patent analysis and comprehension procedure are in high demand due to the colossal growth of innovations. The development of natural language processing (NLP), text mining, and deep learning has notably amplified the efficacy of text summarization models for abundant types of documents. Summarizing patent text remains a pertinent challenge due to the labyrinthine writing style of these documents, which includes technical and legal intricacies. Additionally, these patent document contents are considerably lengthier than archetypal documents, which complicates the process of extracting pertinent information for summarization. Embodying extractive and abstractive text summarization methodologies into a hybrid framework, this study proposes a system for efficiently creating abstractive summaries of patent records. The procedure involves leveraging the LexRank graph-based algorithm to retrieve the important sentences from input parent texts, then utilizing a Bidirectional Auto-Regressive Transformer (BART) model that has been fine-tuned using Low-Ranking Adaptation (LoRA) for producing text summaries. This is accompanied by methodical testing and evaluation strategies. Furthermore, the author employed certain meta-learning techniques to achieve Domain Generalization (DG) of the abstractive component across multiple patent fields.

nan


Article 509

Title@2025-06-14 (6): Understanding the Effect of Knowledge Graph Extraction Error on Downstream Graph Analyses: A Case Study on Affiliation Graphs

Title: Understanding the Effect of Knowledge Graph Extraction Error on Downstream Graph Analyses: A Case Study on Affiliation Graphs Verständnis des Einflusses von Wissensgraphenauszugsfehlern auf Downstream Graph Analyses: Eine Fallstudie zu Verknüpfungsgraphen 了解知识图解错误对下游图分析的影响:关于亲子关系图的个案研究 2506.12367v1

Authors (2): Erica Cai, Brendan O’Connor

Knowledge graphs (KGs) are useful for analyzing social structures, community dynamics, institutional memberships, and other complex relationships across domains from sociology to public health. While recent advances in large language models (LLMs) have improved the scalability and accessibility of automated KG extraction from large text corpora, the impacts of extraction errors on downstream analyses are poorly understood, especially for applied scientists who depend on accurate KGs for real-world insights. To address this gap, we conducted the first evaluation of KG extraction performance at two levels: (1) micro-level edge accuracy, which is consistent with standard NLP evaluations, and manual identification of common error sources; (2) macro-level graph metrics that assess structural properties such as community detection and connectivity, which are relevant to real-world applications. Focusing on affiliation graphs of person membership in organizations extracted from social register books, our study identifies a range of extraction performance where biases across most downstream graph analysis metrics are near zero. However, as extraction performance declines, we find that many metrics exhibit increasingly pronounced biases, with each metric tending toward a consistent direction of either over- or under-estimation. Through simulations, we further show that error models commonly used in the literature do not capture these bias patterns, indicating the need for more realistic error models for KG extraction. Our findings provide actionable insights for practitioners and underscores the importance of advancing extraction methods and error modeling to ensure reliable and meaningful downstream analyses.

nan


Article 510

Title@2025-06-14 (6): Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics

Title: Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics Fortschritte in LLMs mit Fokus auf Vernunft, Anpassungsfähigkeit, Effizienz und Ethik 注重理由、适应性、效率和道德操守的LLMs项目的进展 2506.12365v1

Authors (8): Asifullah khan, Muhammad Zaeem Khan, Saleha Jamshed, Sadia Ahmad, Aleesha Zainab, Kaynat Khatib, Faria Bibi, Abdul Rehman

This survey paper outlines the key developments in the field of Large Language Models (LLMs), such as enhancing their reasoning skills, adaptability to various tasks, increased computational efficiency, and ability to make ethical decisions. The techniques that have been most effective in bridging the gap between human and machine communications include the Chain-of-Thought prompting, Instruction Tuning, and Reinforcement Learning from Human Feedback. The improvements in multimodal learning and few-shot or zero-shot techniques have further empowered LLMs to handle complex jobs with minor input. They also manage to do more with less by applying scaling and optimization tricks for computing power conservation. This survey also offers a broader perspective on recent advancements in LLMs going beyond isolated aspects such as model architecture or ethical concerns. It categorizes emerging methods that enhance LLM reasoning, efficiency, and ethical alignment. It also identifies underexplored areas such as interpretability, cross-modal integration and sustainability. With recent progress, challenges like huge computational costs, biases, and ethical risks remain constant. Addressing these requires bias mitigation, transparent decision-making, and clear ethical guidelines. Future research will focus on enhancing models ability to handle multiple input, thereby making them more intelligent, safe, and reliable.

nan


Article 511

Title@2025-06-14 (6): MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval

Title: MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval MM-R5: MultiModal reasoning-enhanced ReRanker über Verstärkungs-Lernen für Dokument-Retrieval MM-R5:通过文件检索强化学习加强文件检索,多模式合理改进Reanker 2506.12364v1

Authors (8): Mingjun Xu, Jinhan Dong, Jue Hou, Zehui Wang, Sihang Li, Zhifeng Gao, Renxin Zhong, Hengxing Cai

Multimodal document retrieval systems enable information access across text, images, and layouts, benefiting various domains like document-based question answering, report analysis, and interactive content summarization. Rerankers improve retrieval precision by reordering retrieved candidates. However, current multimodal reranking methods remain underexplored, with significant room for improvement in both training strategies and overall effectiveness. Moreover, the lack of explicit reasoning makes it difficult to analyze and optimize these methods further. In this paper, We propose MM-R5, a MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval, aiming to provide a more effective and reliable solution for multimodal reranking tasks. MM-R5 is trained in two stages: supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we focus on improving instruction-following and guiding the model to generate complete and high-quality reasoning chains. To support this, we introduce a novel data construction strategy that produces rich, high-quality reasoning data. In the RL stage, we design a task-specific reward framework, including a reranking reward tailored for multimodal candidates and a composite template-based reward to further refine reasoning quality. We conduct extensive experiments on MMDocIR, a challenging public benchmark spanning multiple domains. MM-R5 achieves state-of-the-art performance on most metrics and delivers comparable results to much larger models on the remaining ones. Moreover, compared to the best retrieval-only method, MM-R5 improves recall@1 by over 4%. These results validate the effectiveness of our reasoning-enhanced training pipeline.

nan


Article 512

Title@2025-06-14 (6): QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm

Title: QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm QiMeng-Achtung: SOTA Attention Operator wird von SOTA Attention Algorithm erzeugt QiMeng- 注意: SOTA 注意操作员由 SOTA 注意算法生成 2506.12355v1

Authors (14): Qirui Zhou, Shaohui Peng, Weiqiang Xiong, Haixin Chen, Yuanbo Wen, Haochen Li, Ling Li, Qi Guo, Yongwei Zhao, Ke Gao, Ruizhi Chen, Yanjun Wu, Chen Zhao, Yunji Chen

The attention operator remains a critical performance bottleneck in large language models (LLMs), particularly for long-context scenarios. While FlashAttention is the most widely used and effective GPU-aware acceleration algorithm, it must require time-consuming and hardware-specific manual implementation, limiting adaptability across GPU architectures. Existing LLMs have shown a lot of promise in code generation tasks, but struggle to generate high-performance attention code. The key challenge is it cannot comprehend the complex data flow and computation process of the attention operator and utilize low-level primitive to exploit GPU performance. To address the above challenge, we propose an LLM-friendly Thinking Language (LLM-TL) to help LLMs decouple the generation of high-level optimization logic and low-level implementation on GPU, and enhance LLMs’ understanding of attention operator. Along with a 2-stage reasoning workflow, TL-Code generation and translation, the LLMs can automatically generate FlashAttention implementation on diverse GPUs, establishing a self-optimizing paradigm for generating high-performance attention operators in attention-centric algorithms. Verified on A100, RTX8000, and T4 GPUs, the performance of our methods significantly outshines that of vanilla LLMs, achieving a speed-up of up to 35.16x. Besides, our method not only surpasses human-optimized libraries (cuDNN and official library) in most scenarios but also extends support to unsupported hardware and data types, reducing development time from months to minutes compared with human experts.

nan


Article 513

Title@2025-06-14 (6): Watch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language Models

Title: Watch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language Models Watch Out Your Album! Über die unbeabsichtigte Datenschutz-Erinnerung in Multi-Modal Large Language Models 注意您的专辑! 在多模式大语言模型中的意外隐私记忆中 2503.01208v2

Authors (10): Tianjie Ju, Yi Hua, Hao Fei, Zhenyu Shao, Yubin Zheng, Haodong Zhao, Mong-Li Lee, Wynne Hsu, Zhuosheng Zhang, Gongshen Liu

Multi-Modal Large Language Models (MLLMs) have exhibited remarkable performance on various vision-language tasks such as Visual Question Answering (VQA). Despite accumulating evidence of privacy concerns associated with task-relevant content, it remains unclear whether MLLMs inadvertently memorize private content that is entirely irrelevant to the training tasks. In this paper, we investigate how randomly generated task-irrelevant private content can become spuriously correlated with downstream objectives due to partial mini-batch training dynamics, thus causing inadvertent memorization. Concretely, we randomly generate task-irrelevant watermarks into VQA fine-tuning images at varying probabilities and propose a novel probing framework to determine whether MLLMs have inadvertently encoded such content. Our experiments reveal that MLLMs exhibit notably different training behaviors in partial mini-batch settings with task-irrelevant watermarks embedded. Furthermore, through layer-wise probing, we demonstrate that MLLMs trigger distinct representational patterns when encountering previously seen task-irrelevant knowledge, even if this knowledge does not influence their output during prompting. Our code is available at https://github.com/illusionhi/ProbingPrivacy.

nan


Article 514

Title@2025-06-14 (6): Efficient Reasoning Through Suppression of Self-Affirmation Reflections in Large Reasoning Models

Title: Efficient Reasoning Through Suppression of Self-Affirmation Reflections in Large Reasoning Models Effiziente Vernunft durch Unterdrückung von Selbstbestätigungsreflexionen in großen Vernunftmodellen 通过制止大理由模型中的自我确认反思提高合理性 2506.12353v1

Authors (6): Kaiyuan Liu, Chen Shen, Zhanwei Zhang, Junjie Liu, Xiaosong Yuan, Jieping ye

While recent advances in large reasoning models have demonstrated remarkable performance, efficient reasoning remains critical due to the rapid growth of output length. Existing optimization approaches highlights a tendency toward “overthinking”, yet lack fine-grained analysis. In this work, we focus on Self-Affirmation Reflections: redundant reflective steps that affirm prior content and often occurs after the already correct reasoning steps. Observations of both original and optimized reasoning models reveal pervasive self-affirmation reflections. Notably, these reflections sometimes lead to longer outputs in optimized models than their original counterparts. Through detailed analysis, we uncover an intriguing pattern: compared to other reflections, the leading words (i.e., the first word of sentences) in self-affirmation reflections exhibit a distinct probability bias. Motivated by this insight, we can locate self-affirmation reflections and conduct a train-free experiment demonstrating that suppressing self-affirmation reflections reduces output length without degrading accuracy across multiple models (R1-Distill-Models, QwQ-32B, and Qwen3-32B). Furthermore, we also improve current train-based method by explicitly suppressing such reflections. In our experiments, we achieve length compression of 18.7\% in train-free settings and 50.2\% in train-based settings for R1-Distill-Qwen-1.5B. Moreover, our improvements are simple yet practical and can be directly applied to existing inference frameworks, such as vLLM. We believe that our findings will provide community insights for achieving more precise length compression and step-level efficient reasoning.

nan


Article 515

Title@2025-06-14 (6): Information Suppression in Large Language Models: Auditing, Quantifying, and Characterizing Censorship in DeepSeek

Title: Information Suppression in Large Language Models: Auditing, Quantifying, and Characterizing Censorship in DeepSeek Informationsunterdrückung in großen Sprachmodellen: Auditierung, Quantifizierung und Charakterisierung von Zensur in DeepSeek 在大语言模式中禁止信息:审计、量化和深海搜索检查 2506.12349v1

Authors (3): Peiran Qiu, Siyi Zhou, Emilio Ferrara

This study examines information suppression mechanisms in DeepSeek, an open-source large language model (LLM) developed in China. We propose an auditing framework and use it to analyze the model’s responses to 646 politically sensitive prompts by comparing its final output with intermediate chain-of-thought (CoT) reasoning. Our audit unveils evidence of semantic-level information suppression in DeepSeek: sensitive content often appears within the model’s internal reasoning but is omitted or rephrased in the final output. Specifically, DeepSeek suppresses references to transparency, government accountability, and civic mobilization, while occasionally amplifying language aligned with state propaganda. This study underscores the need for systematic auditing of alignment, content moderation, information suppression, and censorship practices implemented into widely-adopted AI models, to ensure transparency, accountability, and equitable access to unbiased information obtained by means of these systems.

nan


Article 516

Title@2025-06-14 (6): Refract ICL: Rethinking Example Selection in the Era of Million-Token Models

Title: Refract ICL: Rethinking Example Selection in the Era of Million-Token Models Refrakt ICL: Beispielauswahl im Zeitalter der Millionen-Token-Modelle neu denken Refract ICL: 重新思考百万吨模型时代的示例选择 2506.12346v1

Authors (6): Arjun R. Akula, Kazuma Hashimoto, Krishna Srinivasan, Aditi Chaudhary, Karthik Raman, Michael Bendersky

The emergence of long-context large language models (LLMs) has enabled the use of hundreds, or even thousands, of demonstrations for in-context learning (ICL) - a previously impractical regime. This paper investigates whether traditional ICL selection strategies, which balance the similarity of ICL examples to the test input (using a text retriever) with diversity within the ICL set, remain effective when utilizing a large number of demonstrations. Our experiments demonstrate that, while longer contexts can accommodate more examples, simply increasing the number of demonstrations does not guarantee improved performance. Smart ICL selection remains crucial, even with thousands of demonstrations. To further enhance ICL in this setting, we introduce Refract ICL, a novel ICL selection algorithm specifically designed to focus LLM attention on challenging examples by strategically repeating them within the context and incorporating zero-shot predictions as error signals. Our results show that Refract ICL significantly improves the performance of extremely long-context models such as Gemini 1.5 Pro, particularly on tasks with a smaller number of output classes.

nan


Article 517

Title@2025-06-14 (6): RATIONALYST: Mining Implicit Rationales for Process Supervision of Reasoning

Title: RATIONALYST: Mining Implicit Rationales for Process Supervision of Reasoning RATIONALYST: Bergbau implizite Rationale für die Prozessüberwachung von Vernunft RICTIYST: 程序监督理据的采矿隐含理由 2410.01044v2

Authors (8): Dongwei Jiang, Guoxuan Wang, Yining Lu, Andrew Wang, Jingyu Zhang, Chuyu Liu, Benjamin Van Durme, Daniel Khashabi

The reasoning steps generated by LLMs might be incomplete, as they mimic logical leaps common in everyday communication found in their pre-training data: underlying rationales are frequently left implicit (unstated). To address this challenge, we introduce RATIONALYST, a model for process-supervision of reasoning based on pre-training on a vast collection of rationale annotations extracted from unlabeled data. We extract 79k rationales from web-scale unlabelled dataset (the Pile) and a combination of reasoning datasets with minimal human intervention. This web-scale pre-training for reasoning allows RATIONALYST to consistently generalize across diverse reasoning tasks, including mathematical, commonsense, scientific, and logical reasoning. Fine-tuned from LLaMa-3-8B, RATIONALYST improves the accuracy of reasoning by an average of 3.9% on 7 representative reasoning benchmarks. It also demonstrates superior performance compared to significantly larger verifiers like GPT-4 and similarly sized models fine-tuned on matching training sets.

nan


Article 518

Title@2025-06-14 (6): Investigating the Effects of Cognitive Biases in Prompts on Large Language Model Outputs

Title: Investigating the Effects of Cognitive Biases in Prompts on Large Language Model Outputs Untersuchung der Auswirkungen von Kognitiv-Biasen in Prompts auf große Sprachmodell-Ausgaben 调查认知分裂对大语言示范产出的影响 2506.12338v1

Authors (2): Yan Sun, Stanley Kok

This paper investigates the influence of cognitive biases on Large Language Models (LLMs) outputs. Cognitive biases, such as confirmation and availability biases, can distort user inputs through prompts, potentially leading to unfaithful and misleading outputs from LLMs. Using a systematic framework, our study introduces various cognitive biases into prompts and assesses their impact on LLM accuracy across multiple benchmark datasets, including general and financial Q&A scenarios. The results demonstrate that even subtle biases can significantly alter LLM answer choices, highlighting a critical need for bias-aware prompt design and mitigation strategy. Additionally, our attention weight analysis highlights how these biases can alter the internal decision-making processes of LLMs, affecting the attention distribution in ways that are associated with output inaccuracies. This research has implications for Al developers and users in enhancing the robustness and reliability of Al applications in diverse domains.

nan


Article 519

Title@2025-06-14 (6): Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective

Title: Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective Intersektionale Bias in japanischen großen Sprachmodellen aus einer kontextualisierten Perspektive 日本大语言模型中从背景角度分析的交叉比阿语 2506.12327v1

Authors (9): Hitomi Yanaka, Xinqi He, Jie Lu, Namgi Han, Sunjin Oh, Ryoma Kumon, Yuma Matsuoka, Katsuhiko Watabe, Yuko Itatsu

An growing number of studies have examined the social bias of rapidly developed large language models (LLMs). Although most of these studies have focused on bias occurring in a single social attribute, research in social science has shown that social bias often occurs in the form of intersectionality – the constitutive and contextualized perspective on bias aroused by social attributes. In this study, we construct the Japanese benchmark inter-JBBQ, designed to evaluate the intersectional bias in LLMs on the question-answering setting. Using inter-JBBQ to analyze GPT-4o and Swallow, we find that biased output varies according to its contexts even with the equal combination of social attributes.

nan


Article 520

Title@2025-06-14 (6): GSDNet: Revisiting Incomplete Multimodal-Diffusion from Graph Spectrum Perspective for Conversation Emotion Recognition

Title: GSDNet: Revisiting Incomplete Multimodal-Diffusion from Graph Spectrum Perspective for Conversation Emotion Recognition GSDNet: Unvollständige Multimodal-Diffusion aus Graph Spectrum Perspektive für die Erkennung von Gesprächsgefühlen GSDNet:从图表光谱视角重新审视不完全的多式联运传播,以认识情感 2506.12325v1

Authors (6): Yuntao Shou, Jun Yao, Tao Meng, Wei Ai, Cen Chen, Keqin Li

Multimodal emotion recognition in conversations (MERC) aims to infer the speaker’s emotional state by analyzing utterance information from multiple sources (i.e., video, audio, and text). Compared with unimodality, a more robust utterance representation can be obtained by fusing complementary semantic information from different modalities. However, the modality missing problem severely limits the performance of MERC in practical scenarios. Recent work has achieved impressive performance on modality completion using graph neural networks and diffusion models, respectively. This inspires us to combine these two dimensions through the graph diffusion model to obtain more powerful modal recovery capabilities. Unfortunately, existing graph diffusion models may destroy the connectivity and local structure of the graph by directly adding Gaussian noise to the adjacency matrix, resulting in the generated graph data being unable to retain the semantic and topological information of the original graph. To this end, we propose a novel Graph Spectral Diffusion Network (GSDNet), which maps Gaussian noise to the graph spectral space of missing modalities and recovers the missing data according to its original distribution. Compared with previous graph diffusion methods, GSDNet only affects the eigenvalues of the adjacency matrix instead of destroying the adjacency matrix directly, which can maintain the global topological information and important spectral features during the diffusion process. Extensive experiments have demonstrated that GSDNet achieves state-of-the-art emotion recognition performance in various modality loss scenarios.

nan


Article 521

Title@2025-06-14 (6): Fino1: On the Transferability of Reasoning-Enhanced LLMs and Reinforcement Learning to Finance

Title: Fino1: On the Transferability of Reasoning-Enhanced LLMs and Reinforcement Learning to Finance Fino1: Über die Übertragbarkeit von mit Gründen versehenen LLMs und die Stärkung des Lernens zur Finanzierung Fino1:关于有合理理由的信贷额度的可转让性和加强向融资学习 2502.08127v3

Authors (9): Lingfei Qian, Weipeng Zhou, Yan Wang, Xueqing Peng, Han Yi, Yilun Zhao, Jimin Huang, Qianqian Xie, Jian-yun Nie

As the fundamental capability behind decision-making in finance, financial reasoning poses distinct challenges for LLMs. Although reinforcement learning (RL) have boosted generic reasoning, the progress in finance is hindered by the absence of empirical study of building effective financial chain-of-thought (CoT) corpus, a systematic comparison of different RL methods, and comprehensive benchmarks. To address these gaps, we introduce FinCoT, the first open high-fidelity CoT corpus for finance, distilled from seven QA datasets by a novel three-stage pipeline that incorporates domain supervision, iterative LLM refinement, and difficulty-aware filtering. Based on FinCoT, we develop Fin-o1, the first open financial reasoning models trained via supervised fine-tuning and GRPO-based RL. Our models outperform existing financial reasoning models and SOTA general models such as GPT-o1, DeepSeek-R1, and GPT-4.5. We also investigate the effectiveness of three different RL methods in improving domain-specific reasoning, offering the first such empirical study. We finally propose FinReason, the first financial reasoning benchmark covering multi-table analysis, long-context reasoning, and equation-based tasks, and evaluate 29 LLMs. Our extensive experiments reveal general reasoning models excel on standard benchmarks yet exhibit obvious performance degradation in financial contexts; even finance-tuned models like Dianjin-R1 and FinR1 degrade on lengthy documents. In contrast, our Fin-o1 models consistently outperform their backbones and larger GPT-o1 and DeepSeek-R1, confirming the effectiveness of our data building and model training strategy. Our study further shows that GRPO yields reliable gains whereas PPO and DPO do not, highlighting the need for targeted data and optimisation rather than scale alone.

nan


Article 522

Title@2025-06-14 (6): Perspective on Utilizing Foundation Models for Laboratory Automation in Materials Research

Title: Perspective on Utilizing Foundation Models for Laboratory Automation in Materials Research Perspektiven für die Nutzung von Basismodellen für die Laborautomation in der Materialforschung 利用材料研究实验室自动化模型的基础模型的视角 2506.12312v1

Authors (7): Kan Hatakeyama-Sato, Toshihiko Nishida, Kenta Kitamura, Yoshitaka Ushiku, Koichi Takahashi, Yuta Nabae, Teruaki Hayakawa

This review explores the potential of foundation models to advance laboratory automation in the materials and chemical sciences. It emphasizes the dual roles of these models: cognitive functions for experimental planning and data analysis, and physical functions for hardware operations. While traditional laboratory automation has relied heavily on specialized, rigid systems, foundation models offer adaptability through their general-purpose intelligence and multimodal capabilities. Recent advancements have demonstrated the feasibility of using large language models (LLMs) and multimodal robotic systems to handle complex and dynamic laboratory tasks. However, significant challenges remain, including precision manipulation of hardware, integration of multimodal data, and ensuring operational safety. This paper outlines a roadmap highlighting future directions, advocating for close interdisciplinary collaboration, benchmark establishment, and strategic human-AI integration to realize fully autonomous experimental laboratories.

nan


Article 523

Title@2025-06-14 (6): Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech

Title: Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech Phonikud: Hebräische Grapheme-to-Phone-Umwandlung für Echtzeit-Text-to-Speech Phonikud: 用于实时文字语音转换的希伯来石墨到phoneme转换 2506.12311v1

Authors (4): Yakov Kolani, Maxim Melichov, Cobi Calev, Morris Alper

Real-time text-to-speech (TTS) for Modern Hebrew is challenging due to the language’s orthographic complexity. Existing solutions ignore crucial phonetic features such as stress that remain underspecified even when vowel marks are added. To address these limitations, we introduce Phonikud, a lightweight, open-source Hebrew grapheme-to-phoneme (G2P) system that outputs fully-specified IPA transcriptions. Our approach adapts an existing diacritization model with lightweight adaptors, incurring negligible additional latency. We also contribute the ILSpeech dataset of transcribed Hebrew speech with IPA annotations, serving as a benchmark for Hebrew G2P and as training data for TTS systems. Our results demonstrate that Phonikud G2P conversion more accurately predicts phonemes from Hebrew text compared to prior methods, and that this enables training of effective real-time Hebrew TTS models with superior speed-accuracy trade-offs. We release our code, data, and models at https://phonikud.github.io.

nan


Article 524

Title@2025-06-14 (6): Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning

Title: Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning Med-U1: Förderung der einheitlichen medizinischen Vernunft in LLMs durch großangelegtes Verstärkungslernen Med-U1:通过大规模加强学习在LLMs中鼓励统一医疗理由 2506.12307v1

Authors (9): Xiaotian Zhang, Yuan Wang, Zhaopeng Feng, Ruizhe Chen, Zhijie Zhou, Yan Zhang, Hongxia Xu, Jian Wu, Zuozhu Liu

Medical Question-Answering (QA) encompasses a broad spectrum of tasks, including multiple choice questions (MCQ), open-ended text generation, and complex computational reasoning. Despite this variety, a unified framework for delivering high-quality medical QA has yet to emerge. Although recent progress in reasoning-augmented large language models (LLMs) has shown promise, their ability to achieve comprehensive medical understanding is still largely unexplored. In this paper, we present Med-U1, a unified framework for robust reasoning across medical QA tasks with diverse output formats, ranging from MCQs to complex generation and computation tasks. Med-U1 employs pure large-scale reinforcement learning with mixed rule-based binary reward functions, incorporating a length penalty to manage output verbosity. With multi-objective reward optimization, Med-U1 directs LLMs to produce concise and verifiable reasoning chains. Empirical results reveal that Med-U1 significantly improves performance across multiple challenging Med-QA benchmarks, surpassing even larger specialized and proprietary models. Furthermore, Med-U1 demonstrates robust generalization to out-of-distribution (OOD) tasks. Extensive analysis presents insights into training strategies, reasoning chain length control, and reward design for medical LLMs. The code will be released.

nan


Article 525

Title@2025-06-14 (6): Smurfs: Multi-Agent System using Context-Efficient DFSDT for Tool Planning

Title: Smurfs: Multi-Agent System using Context-Efficient DFSDT for Tool Planning Schlümpfe: Multi-Agent System mit Kontext-Effizient DFSDT für Werkzeugplanung 蓝精精:多机构系统,在工具规划中使用内地高效的DFDDT 2405.05955v4

Authors (3): Junzhi Chen, Juhao Liang, Benyou Wang

Teaching large language models (LLMs) to use tools for solving complex problems can grant them human-like reasoning abilities. ReAct and its variants are popular frameworks for tool use in both single-agent and multi-agent systems. To address issues like error propagation and limited exploration in ReAct, the Deep First Search Decision Tree (DFSDT) was proposed, but it faces challenges such as rollback instability, redundant context, and premature termination in single-agent settings. We introduce “Smurfs,” a novel multi-agent system (MAS) that enhances DFSDT with a modular, context-efficient, and training-free design. Smurfs surpasses baseline methods in both the open-ended StableToolBench and the closed-ended HotpotQA tasks, reducing token usage by 60.9\% compared to DFSDT and enabling Mistral-7b to perform on par with GPT-4-DFSDT. Extensive ablation studies confirm the effectiveness of Smurfs’ core components, offering valuable insights for the construction and interpretation of MAS, and paving the way for future exploration.

nan


Article 526

Title@2025-06-14 (6): Disclosure Audits for LLM Agents

Title: Disclosure Audits for LLM Agents Offenlegungsprüfungen für LLM-Agenten 对LLLM代理的披露审计 2506.10171v2

Authors (3): Saswat Das, Jameson Sandler, Ferdinando Fioretto

Large Language Model agents have begun to appear as personal assistants, customer service bots, and clinical aides. While these applications deliver substantial operational benefits, they also require continuous access to sensitive data, which increases the likelihood of unauthorized disclosures. This study proposes an auditing framework for conversational privacy that quantifies and audits these risks. The proposed Conversational Manipulation for Privacy Leakage (CMPL) framework, is an iterative probing strategy designed to stress-test agents that enforce strict privacy directives. Rather than focusing solely on a single disclosure event, CMPL simulates realistic multi-turn interactions to systematically uncover latent vulnerabilities. Our evaluation on diverse domains, data modalities, and safety configurations demonstrate the auditing framework’s ability to reveal privacy risks that are not deterred by existing single-turn defenses. In addition to introducing CMPL as a diagnostic tool, the paper delivers (1) an auditing procedure grounded in quantifiable risk metrics and (2) an open benchmark for evaluation of conversational privacy across agent implementations.

nan


Article 527

Title@2025-06-13 (5): Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

Title: Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure Können LLMs hochwertige Testfälle für Algorithmenprobleme generieren? TestCase-Eval: Eine systematische Bewertung von Fehlerbedeckung und Exposition LLLM女士能否生成高质量的鉴定问题测试案例? 2506.12278v1

Authors (4): Zheyuan Yang, Zexi Kuang, Xue Xia, Yilun Zhao

We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMs in test-case generation. TestCase-Eval includes 500 algorithm problems and 100,000 human-crafted solutions from the Codeforces platform. It focuses on two pivotal tasks: (1) Fault Coverage, which measures how well LLM-generated test sets probe diverse input scenarios and cover a wide range of potential failure modes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailored test input that reveals a specific incorrect code implementation. We provide a comprehensive assessment of 19 state-of-the-art open-source and proprietary LLMs on TestCase-Eval, offering insights into their strengths and limitations in generating effective test cases for algorithm problems.

nan


Article 528

Title@2025-06-13 (5): Investigating the Potential of Large Language Model-Based Router Multi-Agent Architectures for Foundation Design Automation: A Task Classification and Expert Selection Study

Title: Investigating the Potential of Large Language Model-Based Router Multi-Agent Architectures for Foundation Design Automation: A Task Classification and Expert Selection Study Untersuchung des Potenzials von Multi-Agent-Architekturen für die Grundlagen-Design-Automatisierung von Großsprachenmodellen: Eine Aufgabenklassifikation und Expertenauswahlstudie 调查基于大语言示范示范路由器多机构结构对基础设计自动化的潜力:任务分类和专家甄选研究 2506.13811v1

Authors (4): Sompote Youwai, David Phim, Vianne Gayl Murcia, Rianne Clair Onas

This study investigates router-based multi-agent systems for automating foundation design calculations through intelligent task classification and expert selection. Three approaches were evaluated: single-agent processing, multi-agent designer-checker architecture, and router-based expert selection. Performance assessment utilized baseline models including DeepSeek R1, ChatGPT 4 Turbo, Grok 3, and Gemini 2.5 Pro across shallow foundation and pile design scenarios. The router-based configuration achieved performance scores of 95.00% for shallow foundations and 90.63% for pile design, representing improvements of 8.75 and 3.13 percentage points over standalone Grok 3 performance respectively. The system outperformed conventional agentic workflows by 10.0 to 43.75 percentage points. Grok 3 demonstrated superior standalone performance without external computational tools, indicating advances in direct LLM mathematical reasoning for engineering applications. The dual-tier classification framework successfully distinguished foundation types, enabling appropriate analytical approaches. Results establish router-based multi-agent systems as optimal for foundation design automation while maintaining professional documentation standards. Given safety-critical requirements in civil engineering, continued human oversight remains essential, positioning these systems as advanced computational assistance tools rather than autonomous design replacements in professional practice.

nan


Article 529

Title@2025-06-13 (5): Personalized Wireless Federated Learning for Large Language Models

Title: Personalized Wireless Federated Learning for Large Language Models Personalisiertes Wireless-Federated-Lernen für große Sprachmodelle 大语言模式个人无线个人无线联邦学习 2404.13238v2

Authors (8): Feibo Jiang, Li Dong, Siwei Tu, Yubo Peng, Kezhi Wang, Kun Yang, Cunhua Pan, Dusit Niyato

Large language models (LLMs) have driven profound transformations in wireless networks. However, within wireless environments, the training of LLMs faces significant challenges related to security and privacy. Federated Learning (FL), with its decentralized architecture, offers enhanced data privacy protection. Nevertheless, when integrated with LLMs, FL still struggles with several critical limitations, including large-scale and heterogeneous data, resource-intensive training, and substantial communication overhead. To address these challenges, this paper first presents a systematic analysis of the distinct training stages of LLMs in wireless networks, including pre-training, instruction tuning, and alignment tuning. Building upon this foundation, we propose a Personalized Wireless Federated Fine-tuning (PWFF) framework. Initially, we utilize the adapter and Low-Rank Adaptation (LoRA) techniques to decrease energy consumption, while employing global partial aggregation to reduce communication delay. Subsequently, we develop two reward models and design a personalized loss function to fulfill the goal of personalized learning. Furthermore, we implement a local multi-objective alignment to ensure the stability and effectiveness of the FL process. Finally, we conduct a series of simulations to validate the performance of the proposed PWFF method and provide an in-depth discussion of the open issues.

nan


Article 530

Title@2025-06-13 (5): WorldAPIs: The World Is Worth How Many APIs? A Thought Experiment

Title: WorldAPIs: The World Is Worth How Many APIs? A Thought Experiment WorldAPIs: Die Welt ist Wert Wie viele APIs? Ein Gedankenexperiment WorldAPIs:世界值多少个API? 2407.07778v2

Authors (4): Jiefu Ou, Arda Uzunoglu, Benjamin Van Durme, Daniel Khashabi

AI systems make decisions in physical environments through primitive actions or affordances that are accessed via API calls. While deploying AI agents in the real world involves numerous high-level actions, existing embodied simulators offer a limited set of domain-salient APIs. This naturally brings up the questions: how many primitive actions (APIs) are needed for a versatile embodied agent, and what should they look like? We explore this via a thought experiment: assuming that wikiHow tutorials cover a wide variety of human-written tasks, what is the space of APIs needed to cover these instructions? We propose a framework to iteratively induce new APIs by grounding wikiHow instruction to situated agent policies. Inspired by recent successes in large language models (LLMs) for embodied planning, we propose a few-shot prompting to steer GPT-4 to generate Pythonic programs as agent policies and bootstrap a universe of APIs by 1) reusing a seed set of APIs; and then 2) fabricate new API calls when necessary. The focus of this thought experiment is on defining these APIs rather than their executability. We apply the proposed pipeline on instructions from wikiHow tutorials. On a small fraction (0.5%) of tutorials, we induce an action space of 300+ APIs necessary for capturing the rich variety of tasks in the physical world. A detailed automatic and human analysis of the induction output reveals that the proposed pipeline enables effective reuse and creation of APIs. Moreover, a manual review revealed that existing simulators support only a small subset of the induced APIs (9 of the top 50 frequent APIs), motivating the development of action-rich embodied environments.

nan


Article 531

Title@2025-06-13 (5): InfoFlood: Jailbreaking Large Language Models with Information Overload

Title: InfoFlood: Jailbreaking Large Language Models with Information Overload InfoFlood: Jailbreaking Große Sprachmodelle mit Informationsüberlastung InfoFlood: 带有信息超载的破狱大语言模型 2506.12274v1

Authors (5): Advait Yadav, Haibo Jin, Man Luo, Jun Zhuang, Haohan Wang

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains. However, their potential to generate harmful responses has raised significant societal and regulatory concerns, especially when manipulated by adversarial techniques known as “jailbreak” attacks. Existing jailbreak methods typically involve appending carefully crafted prefixes or suffixes to malicious prompts in order to bypass the built-in safety mechanisms of these models. In this work, we identify a new vulnerability in which excessive linguistic complexity can disrupt built-in safety mechanisms-without the need for any added prefixes or suffixes-allowing attackers to elicit harmful outputs directly. We refer to this phenomenon as Information Overload. To automatically exploit this vulnerability, we propose InfoFlood, a jailbreak attack that transforms malicious queries into complex, information-overloaded queries capable of bypassing built-in safety mechanisms. Specifically, InfoFlood: (1) uses linguistic transformations to rephrase malicious queries, (2) identifies the root cause of failure when an attempt is unsuccessful, and (3) refines the prompt’s linguistic structure to address the failure while preserving its malicious intent. We empirically validate the effectiveness of InfoFlood on four widely used LLMs-GPT-4o, GPT-3.5-turbo, Gemini 2.0, and LLaMA 3.1-by measuring their jailbreak success rates. InfoFlood consistently outperforms baseline attacks, achieving up to 3 times higher success rates across multiple jailbreak benchmarks. Furthermore, we demonstrate that commonly adopted post-processing defenses, including OpenAI’s Moderation API, Perspective API, and SmoothLLM, fail to mitigate these attacks. This highlights a critical weakness in traditional AI safety guardrails when confronted with information overload-based jailbreaks.

nan


Article 532

Title@2025-06-13 (5): The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs

Title: The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs The Behavior Gap: Bewertung von Null-Shot-LLM-Agenten in komplexen Task-Orientierten Dialogen 行为差距:评价复杂任务导向对话中的零射LLM代理 2506.12266v1

Authors (3): Avinash Baidya, Kamalika Das, Xiang Gao

Large Language Model (LLM)-based agents have significantly impacted Task-Oriented Dialog Systems (TODS) but continue to face notable performance challenges, especially in zero-shot scenarios. While prior work has noted this performance gap, the behavioral factors driving the performance gap remain under-explored. This study proposes a comprehensive evaluation framework to quantify the behavior gap between AI agents and human experts, focusing on discrepancies in dialog acts, tool usage, and knowledge utilization. Our findings reveal that this behavior gap is a critical factor negatively impacting the performance of LLM agents. Notably, as task complexity increases, the behavior gap widens (correlation: 0.963), leading to a degradation of agent performance on complex task-oriented dialogs. For the most complex task in our study, even the GPT-4o-based agent exhibits low alignment with human behavior, with low F1 scores for dialog acts (0.464), excessive and often misaligned tool usage with a F1 score of 0.139, and ineffective usage of external knowledge. Reducing such behavior gaps leads to significant performance improvement (24.3% on average). This study highlights the importance of comprehensive behavioral evaluations and improved alignment strategies to enhance the effectiveness of LLM-based TODS in handling complex tasks.

nan


Article 533

Title@2025-06-13 (5): ProVox: Personalization and Proactive Planning for Situated Human-Robot Collaboration

Title: ProVox: Personalization and Proactive Planning for Situated Human-Robot Collaboration ProVox: Personalisierung und proaktive Planung für die angesiedelte Mensch-Roboter-Kollaboration ProVox:人类机器人合机的个性化和前瞻性规划 2506.12248v1

Authors (4): Jennifer Grannen, Siddharth Karamcheti, Blake Wulfe, Dorsa Sadigh

Collaborative robots must quickly adapt to their partner’s intent and preferences to proactively identify helpful actions. This is especially true in situated settings where human partners can continually teach robots new high-level behaviors, visual concepts, and physical skills (e.g., through demonstration), growing the robot’s capabilities as the human-robot pair work together to accomplish diverse tasks. In this work, we argue that robots should be able to infer their partner’s goals from early interactions and use this information to proactively plan behaviors ahead of explicit instructions from the user. Building from the strong commonsense priors and steerability of large language models, we introduce ProVox (“Proactive Voice”), a novel framework that enables robots to efficiently personalize and adapt to individual collaborators. We design a meta-prompting protocol that empowers users to communicate their distinct preferences, intent, and expected robot behaviors ahead of starting a physical interaction. ProVox then uses the personalized prompt to condition a proactive language model task planner that anticipates a user’s intent from the current interaction context and robot capabilities to suggest helpful actions; in doing so, we alleviate user burden, minimizing the amount of time partners spend explicitly instructing and supervising the robot. We evaluate ProVox through user studies grounded in household manipulation tasks (e.g., assembling lunch bags) that measure the efficiency of the collaboration, as well as features such as perceived helpfulness, ease of use, and reliability. Our analysis suggests that both meta-prompting and proactivity are critical, resulting in 38.7% faster task completion times and 31.9% less user burden relative to non-active baselines. Supplementary material, code, and videos can be found at https://provox-2025.github.io.

nan


Article 534

Title@2025-06-13 (5): Large Language Models for History, Philosophy, and Sociology of Science: Interpretive Uses, Methodological Challenges, and Critical Perspectives

Title: Large Language Models for History, Philosophy, and Sociology of Science: Interpretive Uses, Methodological Challenges, and Critical Perspectives Große Sprachmodelle für Geschichte, Philosophie und Wissenschaftssoziologie: Interpretische Nutzungen, methodische Herausforderungen und kritische Perspektiven 历史、哲学和社会科学社会学大语言模式:解释用途、方法挑战和关键视角 2506.12242v1

Authors (3): Arno Simons, Michael Zichert, Adrian Wüthrich

This paper explores the use of large language models (LLMs) as research tools in the history, philosophy, and sociology of science (HPSS). LLMs are remarkably effective at processing unstructured text and inferring meaning from context, offering new affordances that challenge long-standing divides between computational and interpretive methods. This raises both opportunities and challenges for HPSS, which emphasizes interpretive methodologies and understands meaning as context-dependent, ambiguous, and historically situated. We argue that HPSS is uniquely positioned not only to benefit from LLMs’ capabilities but also to interrogate their epistemic assumptions and infrastructural implications. To this end, we first offer a concise primer on LLM architectures and training paradigms tailored to non-technical readers. We frame LLMs not as neutral tools but as epistemic infrastructures that encode assumptions about meaning, context, and similarity, conditioned by their training data, architecture, and patterns of use. We then examine how computational techniques enhanced by LLMs, such as structuring data, detecting patterns, and modeling dynamic processes, can be applied to support interpretive research in HPSS. Our analysis compares full-context and generative models, outlines strategies for domain and task adaptation (e.g., continued pretraining, fine-tuning, and retrieval-augmented generation), and evaluates their respective strengths and limitations for interpretive inquiry in HPSS. We conclude with four lessons for integrating LLMs into HPSS: (1) model selection involves interpretive trade-offs; (2) LLM literacy is foundational; (3) HPSS must define its own benchmarks and corpora; and (4) LLMs should enhance, not replace, interpretive methods.

nan


Article 535

Title@2025-06-13 (5): Compute Optimal Scaling of Skills: Knowledge vs Reasoning

Title: Compute Optimal Scaling of Skills: Knowledge vs Reasoning Optimale Skalierung von Fähigkeiten berechnen: Wissen vs. Vernunft 计算技能的优化规模:知识与理由 2503.10061v3

Authors (5): Nicholas Roberts, Niladri Chatterji, Sharan Narang, Mike Lewis, Dieuwke Hupkes

Scaling laws are a critical component of the LLM development pipeline, most famously as a way to forecast training decisions such as ‘compute-optimally’ trading-off parameter count and dataset size, alongside a more recent growing list of other crucial decisions. In this work, we ask whether compute-optimal scaling behaviour can be skill-dependent. In particular, we examine knowledge and reasoning-based skills such as knowledge-based QA and code generation, and we answer this question in the affirmative: scaling laws are skill-dependent. Next, to understand whether skill-dependent scaling is an artefact of the pretraining datamix, we conduct an extensive ablation of different datamixes and find that, also when correcting for datamix differences, knowledge and code exhibit fundamental differences in scaling behaviour. We conclude with an analysis of how our findings relate to standard compute-optimal scaling using a validation set, and find that a misspecified validation set can impact compute-optimal parameter count by nearly 50%, depending on its skill composition.

nan


Article 536

Title@2025-06-13 (5): Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

Title: Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index Infini-gram mini: Genaue n-gram Suche auf der Internetskala mit FM-Index Infini-gram 微型: 使用 FM- Index 的 Internet 比例尺精确的 n 克搜索 2506.12229v1

Authors (5): Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi

Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora – counting string appearances and retrieving the enclosing documents – yet the high storage overhead hinders their application on Internet-scale data. We present Infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18$\times$) and memory use during both indexing (3.2$\times$ reduction) and querying (down to a negligible amount). We index 46TB of Internet text in 50 days with a single 128-core CPU node (or 19 hours if using 75 such nodes). We show one important use case of Infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 40% in SQuAD), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on Infini-gram mini indexes.

nan


Article 537

Title@2025-06-13 (5): R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

Title: R-KV: Redundancy-aware KV Cache Compression for Reasoning Models R-KV: Redundancy-aware KV Cache-Kompression für sinnvolle Modelle R-KV: 解释模型的冗余感知 KV 缓存压缩 2505.24133v3

Authors (14): Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, Junjie Hu

Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache. This KV-cache reduction also leads to a 90% memory saving and a 6.6X throughput over standard chain-of-thought reasoning inference. Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.

nan


Article 538

Title@2025-06-13 (5): A Survey of Generative Categories and Techniques in Multimodal Large Language Models

Title: A Survey of Generative Categories and Techniques in Multimodal Large Language Models Eine Übersicht über generative Kategorien und Techniken in multimodalen großen Sprachmodellen 多式联运大语言模型的创用类别和技术调查 2506.10016v2

Authors (5): Longzhen Han, Awes Mubarak, Almas Baimagambetov, Nikolaos Polatidis, Thar Baker

Multimodal Large Language Models (MLLMs) have rapidly evolved beyond text generation, now spanning diverse output modalities including images, music, video, human motion, and 3D objects, by integrating language with other sensory modalities under unified architectures. This survey categorises six primary generative modalities and examines how foundational techniques, namely Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting, enable cross-modal capabilities. We analyze key models, architectural trends, and emergent cross-modal synergies, while highlighting transferable techniques and unresolved challenges. Architectural innovations like transformers and diffusion models underpin this convergence, enabling cross-modal transfer and modular specialization. We highlight emerging patterns of synergy, and identify open challenges in evaluation, modularity, and structured reasoning. This survey offers a unified perspective on MLLM development and identifies critical paths toward more general-purpose, adaptive, and interpretable multimodal systems.

nan


Article 539

Title@2025-06-13 (5): From Emergence to Control: Probing and Modulating Self-Reflection in Language Models

Title: From Emergence to Control: Probing and Modulating Self-Reflection in Language Models Von der Emergence zur Kontrolle: Probieren und Modulieren von Selbstreflexion in Sprachmodellen 从新兴到控制:语文模式的自我反省和调整 2506.12217v1

Authors (4): Xudong Zhu, Jiachen Jiang, Mohammad Mahdi Khalili, Zhihui Zhu

Self-reflection – the ability of a large language model (LLM) to revisit, evaluate, and revise its own reasoning – has recently emerged as a powerful behavior enabled by reinforcement learning with verifiable rewards (RLVR). While self-reflection correlates with improved reasoning accuracy, its origin and underlying mechanisms remain poorly understood. In this work, {\it we first show that self-reflection is not exclusive to RLVR fine-tuned models: it already emerges, albeit rarely, in pretrained models}. To probe this latent ability, we introduce Reflection-Inducing Probing, a method that injects reflection-triggering reasoning traces from fine-tuned models into pretrained models. This intervention raises self-reflection frequency of Qwen2.5 from 0.6\% to 18.6\%, revealing a hidden capacity for reflection. Moreover, our analysis of internal representations shows that both pretrained and fine-tuned models maintain hidden states that distinctly separate self-reflective from non-reflective contexts. Leveraging this observation, {\it we then construct a self-reflection vector, a direction in activation space associated with self-reflective reasoning}. By manipulating this vector, we enable bidirectional control over the self-reflective behavior for both pretrained and fine-tuned models. Experiments across multiple reasoning benchmarks show that enhancing these vectors improves reasoning performance by up to 12\%, while suppressing them reduces computational cost, providing a flexible mechanism to navigate the trade-off between reasoning quality and efficiency without requiring additional training. Our findings further our understanding of self-reflection and support a growing body of work showing that understanding model internals can enable precise behavioral control.

nan


Article 540

Title@2025-06-13 (5): MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP

Title: MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP MELABenchv1: Benchmarking von großen Sprachmodellen gegen kleinere, feinere Modelle für Low-Resource Maltesische NLP MELABenchv1:对照低资源马耳他低排放马耳他低排放马耳他低排放语言方案较微小的微量设计模型确定大语言模型基准 2506.04385v2

Authors (2): Kurt Micallef, Claudia Borg

Large Language Models (LLMs) have demonstrated remarkable performance across various Natural Language Processing (NLP) tasks, largely due to their generalisability and ability to perform tasks without additional training. However, their effectiveness for low-resource languages remains limited. In this study, we evaluate the performance of 55 publicly available LLMs on Maltese, a low-resource language, using a newly introduced benchmark covering 11 discriminative and generative tasks. Our experiments highlight that many models perform poorly, particularly on generative tasks, and that smaller fine-tuned models often perform better across all tasks. From our multidimensional analysis, we investigate various factors impacting performance. We conclude that prior exposure to Maltese during pre-training and instruction-tuning emerges as the most important factor. We also examine the trade-offs between fine-tuning and prompting, highlighting that while fine-tuning requires a higher initial cost, it yields better performance and lower inference costs. Through this work, we aim to highlight the need for more inclusive language technologies and recommend that researchers working with low-resource languages consider more “traditional” language modelling approaches.

nan


Article 541

Title@2025-06-13 (5): Supernova Event Dataset: Interpreting Large Language Model’s Personality through Critical Event Analysis

Title: Supernova Event Dataset: Interpreting Large Language Model’s Personality through Critical Event Analysis Supernova-Ereignisdatensatz: Verdolmetschen der Persönlichkeit des Large Language Model durch kritische Ereignisanalyse 超新星事件数据集:通过重大事件分析解释大语言模型的个性 2506.12189v1

Authors (2): Pranav Agarwal, Ioana Ciucă

Large Language Models (LLMs) are increasingly integrated into everyday applications. As their influence grows, understanding their decision making and underlying personality becomes essential. In this work, we interpret model personality using our proposed Supernova Event Dataset, a novel dataset with diverse articles spanning biographies, historical events, news, and scientific discoveries. We use this dataset to benchmark LLMs on extracting and ranking key events from text, a subjective and complex challenge that requires reasoning over long-range context and modeling causal chains. We evaluate small models like Phi-4, Orca 2, and Qwen 2.5, and large, stronger models such as Claude 3.7, Gemini 2.5, and OpenAI o3, and propose a framework where another LLM acts as a judge to infer each model’s personality based on its selection and classification of events. Our analysis shows distinct personality traits: for instance, Orca 2 demonstrates emotional reasoning focusing on interpersonal dynamics, while Qwen 2.5 displays a more strategic, analytical style. When analyzing scientific discovery events, Claude Sonnet 3.7 emphasizes conceptual framing, Gemini 2.5 Pro prioritizes empirical validation, and o3 favors step-by-step causal reasoning. This analysis improves model interpretability, making them user-friendly for a wide range of diverse applications.

nan


Article 542

Title@2025-06-13 (5): Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse

Title: Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse Achten Sie auf Ihren Schritt (durch Schritt): Chain-of-Thought kann die Leistung bei Aufgaben reduzieren, bei denen Denken Menschen schlimmer macht ” 一步一步小心 “ (一步一步): “ 努力链 “ 能够降低思考使人类更加恶化的任务的绩效 “ 。 2410.21333v4

Authors (6): Ryan Liu, Jiayi Geng, Addison J. Wu, Ilia Sucholutsky, Tania Lombrozo, Thomas L. Griffiths

Chain-of-thought (CoT) prompting has become a widely used strategy for improving large language and multimodal model performance. However, it is still an open question under which settings CoT systematically reduces performance. In this paper, we seek to identify the characteristics of tasks where CoT reduces performance by drawing inspiration from cognitive psychology, focusing on six representative tasks from the psychological literature where deliberation hurts performance in humans. In three of these tasks, state-of-the-art models exhibit significant performance drop-offs with CoT (up to 36.3\% absolute accuracy for OpenAI o1-preview compared to GPT-4o), while in others, CoT effects are mixed, with positive, neutral, and negative changes. While models and humans do not exhibit perfectly parallel cognitive processes, considering cases where thinking has negative consequences for humans helps identify settings where it negatively impacts models. By connecting the literature on human verbal thinking and deliberation with evaluations of CoT, we offer a perspective for understanding the impact of inference-time reasoning.

nan


Article 543

Title@2025-06-13 (5): BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation

Title: BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation BOUQuET: Datensatz, Benchmark und Open Initiative für Universal Quality Evaluation in Translation BOUQuET:翻译普遍质量评价的数据集、基准和开放倡议 2502.04314v2

Authors (17): The Omnilingual MT Team, Pierre Andrews, Mikel Artetxe, Mariano Coria Meglioli, Marta R. Costa-jussà, Joe Chuang, David Dale, Cynthia Gao, Jean Maillard, Alex Mourachko, Christophe Ropers, Safiyyah Saleem, Eduardo Sánchez, Ioannis Tsiamas, Arina Turkatenko, Albert Ventayol-Boada, Shireen Yates

BOUQuET is a multi-way, multicentric and multi-register/domain dataset and benchmark, and a broader collaborative initiative. This dataset is handcrafted in 8 non-English languages. Each of these source languages are representative of the most widely spoken ones and therefore they have the potential to serve as pivot languages that will enable more accurate translations. The dataset is multicentric to enforce representation of multilingual language features. In addition, the dataset goes beyond the sentence level, as it is organized in paragraphs of various lengths. Compared with related machine translation datasets, we show that BOUQuET has a broader representation of domains while simplifying the translation task for non-experts. Therefore, BOUQuET is specially suitable for crowd-source extension for which we are launching a call aiming at collecting a multi-way parallel corpus covering any written language.

nan


Article 544

Title@2025-06-13 (5): Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs

Title: Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs Instruktion Tuning und CoT Prompting für kontextuelle medizinische QA mit LLMs 与LLMM公司一起进行背景医疗质量评估的教学说明和COT提示 2506.12182v1

Authors (6): Chenqian Le, Ziheng Gong, Chihang Wang, Haowei Ni, Panfeng Li, Xupeng Chen

Large language models (LLMs) have shown great potential in medical question answering (MedQA), yet adapting them to biomedical reasoning remains challenging due to domain-specific complexity and limited supervision. In this work, we study how prompt design and lightweight fine-tuning affect the performance of open-source LLMs on PubMedQA, a benchmark for multiple-choice biomedical questions. We focus on two widely used prompting strategies - standard instruction prompts and Chain-of-Thought (CoT) prompts - and apply QLoRA for parameter-efficient instruction tuning. Across multiple model families and sizes, our experiments show that CoT prompting alone can improve reasoning in zero-shot settings, while instruction tuning significantly boosts accuracy. However, fine-tuning on CoT prompts does not universally enhance performance and may even degrade it for certain larger models. These findings suggest that reasoning-aware prompts are useful, but their benefits are model- and scale-dependent. Our study offers practical insights into combining prompt engineering with efficient finetuning for medical QA applications.

nan


Article 545

Title@2025-06-13 (5): Generative or Discriminative? Revisiting Text Classification in the Era of Transformers

Title: Generative or Discriminative? Revisiting Text Classification in the Era of Transformers Generativ oder diskriminativ? Textklassifizierung im Zeitalter der Transformer 产生还是歧视? 重新研究变异器时代的文本分类 2506.12181v1

Authors (10): Siva Rajesh Kasa, Karan Gupta, Sumegh Roychowdhury, Ashutosh Kumar, Yaswanth Biruduraju, Santhosh Kumar Kasa, Nikhil Priyatam Pattisapu, Arindam Bhattacharya, Shailendra Agarwal, Vijay huddar

The comparison between discriminative and generative classifiers has intrigued researchers since Efron’s seminal analysis of logistic regression versus discriminant analysis. While early theoretical work established that generative classifiers exhibit lower sample complexity but higher asymptotic error in simple linear settings, these trade-offs remain unexplored in the transformer era. We present the first comprehensive evaluation of modern generative and discriminative architectures - Auto-regressive modeling, Masked Language Modeling, Discrete Diffusion, and Encoders for text classification. Our study reveals that the classical ‘two regimes’ phenomenon manifests distinctly across different architectures and training paradigms. Beyond accuracy, we analyze sample efficiency, calibration, noise robustness, and ordinality across diverse scenarios. Our findings offer practical guidance for selecting the most suitable modeling approach based on real-world constraints such as latency and data limitations.

nan


Article 546

Title@2025-06-13 (5): A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages

Title: A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages Eine rigorose Bewertung von LLM-Datenerstellungsstrategien für ressourcenarme Sprachen 对LLLM低资源语言数据生成战略的严格评价 2506.12158v1

Authors (4): Tatiana Ankinina, Jan Cegin, Jakub Simko, Simon Ostermann

Large Language Models (LLMs) are increasingly used to generate synthetic textual data for training smaller specialized models. However, a comparison of various generation strategies for low-resource language settings is lacking. While various prompting strategies have been proposed, such as demonstrations, label-based summaries, and self-revision, their comparative effectiveness remains unclear, especially for low-resource languages. In this paper, we systematically evaluate the performance of these generation strategies and their combinations across 11 typologically diverse languages, including several extremely low-resource ones. Using three NLP tasks and four open-source LLMs, we assess downstream model performance on generated versus gold-standard data. Our results show that strategic combinations of generation methods, particularly target-language demonstrations with LLM-based revisions, yield strong performance, narrowing the gap with real data to as little as 5% in some settings. We also find that smart prompting techniques can reduce the advantage of larger LLMs, highlighting efficient generation strategies for synthetic data generation in low-resource scenarios with smaller models.

nan


Article 547

Title@2025-06-13 (5): Maximally-Informative Retrieval for State Space Model Generation

Title: Maximally-Informative Retrieval for State Space Model Generation Maximal-informatives Retrieval für die Generierung von State Space Models 用于国家空间模型生成的最大进步检索 2506.12149v1

Authors (7): Evan Becker, Benjamin Bowman, Matthew Trager, Tian Yu Liu, Luca Zancato, Wei Xia, Stefano Soatto

Given a query and dataset, the optimal way of answering the query is to make use all the information available. Modern LLMs exhibit impressive ability to memorize training data, but data not deemed important during training is forgotten, and information outside that training set cannot be made use of. Processing an entire dataset at inference time is infeasible due to the bounded nature of model resources (e.g. context size in transformers or states in state space models), meaning we must resort to external memory. This constraint naturally leads to the following problem: How can we decide based on the present query and model, what among a virtually unbounded set of known data matters for inference? To minimize model uncertainty for a particular query at test-time, we introduce Retrieval In-Context Optimization (RICO), a retrieval method that uses gradients from the LLM itself to learn the optimal mixture of documents for answer generation. Unlike traditional retrieval-augmented generation (RAG), which relies on external heuristics for document retrieval, our approach leverages direct feedback from the model. Theoretically, we show that standard top-$k$ retrieval with model gradients can approximate our optimization procedure, and provide connections to the leave-one-out loss. We demonstrate empirically that by minimizing an unsupervised loss objective in the form of question perplexity, we can achieve comparable retriever metric performance to BM25 with \emph{no finetuning}. Furthermore, when evaluated on quality of the final prediction, our method often outperforms fine-tuned dense retrievers such as E5.

nan


Article 548

Title@2025-06-13 (5): Hatevolution: What Static Benchmarks Don’t Tell Us

Title: Hatevolution: What Static Benchmarks Don’t Tell Us Hatevolution: Was Statische Benchmarks uns nicht sagen 仇恨革命:什么静态基准不告诉我们 2506.12148v1

Authors (4): Chiara Di Bonaventura, Barbara McGillivray, Yulan He, Albert Meroño-Peñuela

Language changes over time, including in the hate speech domain, which evolves quickly following social dynamics and cultural shifts. While NLP research has investigated the impact of language evolution on model training and has proposed several solutions for it, its impact on model benchmarking remains under-explored. Yet, hate speech benchmarks play a crucial role to ensure model safety. In this paper, we empirically evaluate the robustness of 20 language models across two evolving hate speech experiments, and we show the temporal misalignment between static and time-sensitive evaluations. Our findings call for time-sensitive linguistic benchmarks in order to correctly and reliably evaluate language models in the hate speech domain.

nan


Article 549

Title@2025-06-13 (5): Resa: Transparent Reasoning Models via SAEs

Title: Resa: Transparent Reasoning Models via SAEs Resa: Transparente Begründungsmodelle über SAE Resa:通过SAEs建立透明说明理由模型 2506.09967v2

Authors (7): Shangshang Wang, Julian Asilis, Ömer Faruk Akgül, Enes Burak Bilgin, Ollie Liu, Deqing Fu, Willie Neiswanger

How cost-effectively can we elicit strong reasoning in language models by leveraging their underlying representations? We answer this question with Resa, a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data without any reasoning traces. Notably, when applied to certain base models before further RL post-training, SAE-Tuning retains >97% of its RL-trained counterpart’s reasoning performance while reducing training costs by >2000x to roughly $1 and training time by >450x to around 20 minutes. Furthermore, when applied to lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only around $1 additional cost. Surprisingly, the reasoning abilities extracted via SAEs are potentially both generalizable and modular. Generality means abilities extracted from one dataset still elevate performance on a larger and overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math can be attached to the R1-Distill model at test time, without any retraining, and yield comparable gains. Extensive ablations validate these findings and all artifacts are fully open-sourced.

nan


Article 550

Title@2025-06-13 (5): code_transformed: The Influence of Large Language Models on Code

Title: code_transformed: The Influence of Large Language Models on Code code_transformed: Der Einfluss großer Sprachmodelle auf Code 代码转换:大语言模型对代码的影响 2506.12014v1

Authors (6): Yuliang Xu, Siming Huang, Mingmeng Geng, Yao Wan, Xuanhua Shi, Dongping Chen

Coding remains one of the most fundamental modes of interaction between humans and machines. With the rapid advancement of Large Language Models (LLMs), code generation capabilities have begun to significantly reshape programming practices. This development prompts a central question: Have LLMs transformed code style, and how can such transformation be characterized? In this paper, we present a pioneering study that investigates the impact of LLMs on code style, with a focus on naming conventions, complexity, maintainability, and similarity. By analyzing code from over 19,000 GitHub repositories linked to arXiv papers published between 2020 and 2025, we identify measurable trends in the evolution of coding style that align with characteristics of LLM-generated code. For instance, the proportion of snake_case variable names in Python code increased from 47% in Q1 2023 to 51% in Q1 2025. Furthermore, we investigate how LLMs approach algorithmic problems by examining their reasoning processes. Given the diversity of LLMs and usage scenarios, among other factors, it is difficult or even impossible to precisely estimate the proportion of code generated or assisted by LLMs. Our experimental results provide the first large-scale empirical evidence that LLMs affect real-world programming style.

nan


Article 551

Title@2025-06-13 (5): Can Mixture-of-Experts Surpass Dense LLMs Under Strictly Equal Resources?

Title: Can Mixture-of-Experts Surpass Dense LLMs Under Strictly Equal Resources? Können Mixture-of-Experts LLMs unter streng gleichen Ressourcen übertreffen? 在资源严格平等的情况下,能否在资源严格平等的情况下进行专家混合生产? 2506.12119v1

Authors (8): Houyi Li, Ka Man Lo, Ziqi Wang, Zili Wang, Wenzhen Zheng, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang

Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints - that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical value and potential. In this paper, we propose a novel perspective and methodological framework to study this question thoroughly. First, we comprehensively investigate the architecture of MoEs and achieve an optimal model design that maximizes the performance. Based on this, we subsequently find that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource. More importantly, this optimal region remains consistent across different model sizes. Although additional amount of data turns out to be a trade-off for the enhanced performance, we show that this can be resolved via reusing data. We validate our findings through extensive experiments, training nearly 200 language models at 2B scale and over 50 at 7B scale, cumulatively processing 50 trillion tokens. All models will be released publicly.

nan


Article 552

Title@2025-06-13 (5): Cartridges: Lightweight and general-purpose long context representations via self-study

Title: Cartridges: Lightweight and general-purpose long context representations via self-study Patronen: Leichte und universelle lange Kontextdarstellungen durch Selbststudium Cartridges:轻量和一般用途长背景介绍,通过自学 2506.06266v3

Authors (11): Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, Christopher Re

Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-1M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate synthetic conversations about the corpus and train the Cartridge with a context-distillation objective. We find that Cartridges trained with self-study replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context benchmarks, Cartridges trained with self-study match ICL performance while using 38.6x less memory and enabling 26.4x higher throughput. Self-study also extends the model’s effective context length (e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to Cartridges that can be composed at inference time without retraining.

nan


Article 553

Title@2025-06-13 (5): Schema-R1: A reasoning training approach for schema linking in Text-to-SQL Task

Title: Schema-R1: A reasoning training approach for schema linking in Text-to-SQL Task Schema-R1: Ein argumentierender Schulungsansatz für die Schemaverknüpfung in Text-zu-SQL-Aufgabe Schema-R1:在文本到SQL任务中将系统图案联系起来的推理培训方法 2506.11986v1

Authors (3): Wuzhenghong Wen, Su Pan, yuwei Sun

Schema linking is a critical step in Text-to-SQL task, aiming to accurately predict the table names and column names required for the SQL query based on the given question. However, current fine-tuning approaches for schema linking models employ a rote-learning paradigm, excessively optimizing for ground truth schema linking outcomes while compromising reasoning ability. This limitation arises because of the difficulty in acquiring a high-quality reasoning sample for downstream tasks. To address this, we propose Schema-R1, a reasoning schema linking model trained using reinforcement learning. Specifically, Schema-R1 consists of three key steps: constructing small batches of high-quality reasoning samples, supervised fine-tuning for cold-start initialization, and rule-based reinforcement learning training. The final results demonstrate that our method effectively enhances the reasoning ability of the schema linking model, achieving a 10\% improvement in filter accuracy compared to the existing method. Our code is available at https://github.com/hongWin/Schema-R1/.

nan


Article 554

Title@2025-06-13 (5): e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

Title: e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs e3: Erforschen lernen ermöglicht Extrapolation von Test-Time Compute für LLMs e3: 学习探索以利对LLMM的试验时间计算进行外推计算 2506.09026v2

Authors (8): Amrith Setlur, Matthew Y. R. Yang, Charlie Snell, Jeremy Greer, Ian Wu, Virginia Smith, Max Simchowitz, Aviral Kumar

Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep “thinking” for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging “negative” gradients from incorrect traces to amplify exploration during RL, resulting in longer search traces that chains additional asymmetries; and (3) coupling task difficulty with training token budget during training via a specifically-designed curriculum to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME’25 and HMMT’25 scores, and extrapolates to 2x the training token budget. Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.

nan


Article 555

Title@2025-06-13 (5): Improving Large Language Models with Concept-Aware Fine-Tuning

Title: Improving Large Language Models with Concept-Aware Fine-Tuning Große Sprachmodelle mit konzeptorientiertem Feintuning verbessern 改进概念软件微调大语言模式 2506.07833v2

Authors (4): Michael K. Chen, Xikun Zhang, Jiaxing Huang, Dacheng Tao

Large language models (LLMs) have become the cornerstone of modern AI. However, the existing paradigm of next-token prediction fundamentally limits their ability to form coherent, high-level concepts, making it a critical barrier to human-like understanding and reasoning. Take the phrase “ribonucleic acid” as an example: an LLM will first decompose it into tokens, i.e., artificial text fragments (“rib”, “on”, …), then learn each token sequentially, rather than grasping the phrase as a unified, coherent semantic entity. This fragmented representation hinders deeper conceptual understanding and, ultimately, the development of truly intelligent systems. In response, we introduce Concept-Aware Fine-Tuning (CAFT), a novel multi-token training method that redefines how LLMs are fine-tuned. By enabling the learning of sequences that span multiple tokens, this method fosters stronger concept-aware learning. Our experiments demonstrate significant improvements compared to conventional next-token finetuning methods across diverse tasks, including traditional applications like text summarization and domain-specific ones like de novo protein design. Multi-token prediction was previously only possible in the prohibitively expensive pretraining phase; CAFT, to our knowledge, is the first to bring the multi-token setting to the post-training phase, thus effectively democratizing its benefits for the broader community of practitioners and researchers. Finally, the unexpected effectiveness of our proposed method suggests wider implications for the machine learning research community. All code and data are available at https://github.com/michaelchen-lab/caft-llm

nan


Article 556

Title@2025-06-13 (5): Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English

Title: Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English Auswirkungen der Rahmensätze auf Sprachtokenizer: Eine Fallstudie zu Mandarin und Englisch 《框架率对语言控制器的影响:普通话和英语案例研究》 2505.17076v3

Authors (10): Haoyang Zhang, Hexin Liu, Xiangyu Zhang, Qiquan Zhang, Yuchen Hu, Junqi Zhao, Fei Tian, Xuerui Yang, Leibny Paola Garcia, Eng Siong Chng

The speech tokenizer plays a crucial role in recent speech tasks, generally serving as a bridge between speech signals and language models. While low-frame-rate codecs are widely employed as speech tokenizers, the impact of frame rates on speech tokens remains underexplored. In this study, we investigate how varying frame rates affect speech tokenization by examining Mandarin and English, two typologically distinct languages. We encode speech at different frame rates and evaluate the resulting semantic tokens in the speech recognition task. Our findings reveal that frame rate variations influence speech tokenization differently for each language, highlighting the interplay between frame rates, phonetic density, and language-specific acoustic features. The results provide insights into optimizing frame rate selection for speech tokenizers, with implications for automatic speech recognition, text-to-speech, and other speech-related applications.

nan


Article 557

Title@2025-06-13 (5): Factual Knowledge in Language Models: Robustness and Anomalies under Simple Temporal Context Variations

Title: Factual Knowledge in Language Models: Robustness and Anomalies under Simple Temporal Context Variations Factual Knowledge in Language Models: Robustheit und Anomalien unter einfachen zeitlichen Kontextvariationen 语言模型中的事实知识:简单时间环境变化下的强力和异常现象 2502.01220v5

Authors (5): Hichem Ammar Khodja, Frédéric Béchet, Quentin Brabant, Alexis Nasr, Gwénolé Lecorvé

This paper explores the robustness of language models (LMs) to variations in the temporal context within factual knowledge. It examines whether LMs can correctly associate a temporal context with a past fact valid over a defined period, by asking them to differentiate correct from incorrect contexts. The LMs’ ability to distinguish is analyzed along two dimensions: the distance of the incorrect context from the validity period and the granularity of the context. To this end, a dataset called TimeStress is introduced, enabling the evaluation of 18 diverse LMs. Results reveal that the best LM achieves a perfect distinction for only 11% of the studied facts, with errors, certainly rare, but critical that humans would not make. This work highlights the limitations of current LMs in temporal representation.

nan


Article 558

Title@2025-06-13 (5): Enhancing multimodal analogical reasoning with Logic Augmented Generation

Title: Enhancing multimodal analogical reasoning with Logic Augmented Generation Verbesserung multimodaler analoger Argumentation mit Logic Augmented Generation 增强与逻辑增强型一代的多式联运模拟推理 2504.11190v2

Authors (3): Anna Sofia Lippolis, Andrea Giovanni Nuzzolese, Aldo Gangemi

Recent advances in Large Language Models have demonstrated their capabilities across a variety of tasks. However, automatically extracting implicit knowledge from natural language remains a significant challenge, as machines lack active experience with the physical world. Given this scenario, semantic knowledge graphs can serve as conceptual spaces that guide the automated text generation reasoning process to achieve more efficient and explainable results. In this paper, we apply a logic-augmented generation (LAG) framework that leverages the explicit representation of a text through a semantic knowledge graph and applies it in combination with prompt heuristics to elicit implicit analogical connections. This method generates extended knowledge graph triples representing implicit meaning, enabling systems to reason on unlabeled multimodal data regardless of the domain. We validate our work through three metaphor detection and understanding tasks across four datasets, as they require deep analogical reasoning capabilities. The results show that this integrated approach surpasses current baselines, performs better than humans in understanding visual metaphors, and enables more explainable reasoning processes, though still has inherent limitations in metaphor understanding, especially for domain-specific metaphors. Furthermore, we propose a thorough error analysis, discussing issues with metaphorical annotations and current evaluation methods.

nan


Article 559

Title@2025-06-13 (5): Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations

Title: Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations Erklärbarkeit großer Sprachmodelle mit SMILE: Statistische Modell-agnostische Interpretierbarkeit mit lokalen Erklärungen 使用SMILE解释大语言模型的可解释性:统计模型 – – 与当地解释的可解释性 2505.21657v2

Authors (4): Zeinab Dehghani, Mohammed Naveed Akram, Koorosh Aslansefat, Adil Khan

Large language models like GPT, LLAMA, and Claude have become incredibly powerful at generating text, but they are still black boxes, so it is hard to understand how they decide what to say. That lack of transparency can be problematic, especially in fields where trust and accountability matter. To help with this, we introduce SMILE, a new method that explains how these models respond to different parts of a prompt. SMILE is model-agnostic and works by slightly changing the input, measuring how the output changes, and then highlighting which words had the most impact. Create simple visual heat maps showing which parts of a prompt matter the most. We tested SMILE on several leading LLMs and used metrics such as accuracy, consistency, stability, and fidelity to show that it gives clear and reliable explanations. By making these models easier to understand, SMILE brings us one step closer to making AI more transparent and trustworthy.

nan


Article 560

Title@2025-06-13 (5): Improving Large Language Model Safety with Contrastive Representation Learning

Title: Improving Large Language Model Safety with Contrastive Representation Learning Verbesserung der Sicherheit von großen Sprachmodellen mit kontrasem Repräsentationslernen 改进大语文示范语文安全,同时进行差异代表制学习 2506.11938v1

Authors (4): Samuel Simko, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin

Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance. Our code is available at https://github.com/samuelsimko/crl-llm-defense

nan


Article 561

Title@2025-06-13 (5): Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback

Title: Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback Feedback Friction: LLMs kämpfen, um externes Feedback vollständig zu integrieren 反响:LLMs 争取充分吸收外部反馈 2506.11930v1

Authors (5): Dongwei Jiang, Alvin Zhang, Andrew Wang, Nicholas Andrews, Daniel Khashabi

Recent studies have shown LLMs possess some ability to improve their responses when given external feedback. However, it remains unclear how effectively and thoroughly these models can incorporate extrinsic feedback. In an ideal scenario, if LLMs receive near-perfect and complete feedback, we would expect them to fully integrate the feedback and change their incorrect answers to correct ones. In this paper, we systematically investigate LLMs’ ability to incorporate feedback by designing a controlled experimental environment. For each problem, a solver model attempts a solution, then a feedback generator with access to near-complete ground-truth answers produces targeted feedback, after which the solver tries again. We evaluate this pipeline across a diverse range of tasks, including math reasoning, knowledge reasoning, scientific reasoning, and general multi-domain evaluations with state-of-the-art language models including Claude 3.7 (with and without extended thinking). Surprisingly, even under these near-ideal conditions, solver models consistently show resistance to feedback, a limitation that we term FEEDBACK FRICTION. To mitigate this limitation, we experiment with sampling-based strategies like progressive temperature increases and explicit rejection of previously attempted incorrect answers, which yield improvements but still fail to help models achieve target performance. We also perform a rigorous exploration of potential causes of FEEDBACK FRICTION, ruling out factors such as model overconfidence and data familiarity. We hope that highlighting this issue in LLMs and ruling out several apparent causes will help future research in self-improvement.

nan


Article 562

Title@2025-06-13 (5): LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

Title: LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? LiveCodeBench Pro: Wie beurteilen Olympiad-Medaillengewinner LLMs im Wettbewerbsprogramm? LifoCodeBench Pro:奥林匹亚奖章获得者如何在竞争性方案规划中评判LMs? 2506.11928v1

Authors (19): Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao, Jianzhu Yao, Peiyao Sheng, Zixuan Wang, Wenhao Chai, Aleksandra Korolova, Peter Henderson, Sanjeev Arora, Pramod Viswanath, Jingbo Shang, Saining Xie

Recent reports claim that large language models (LLMs) now outperform elite humans in competitive programming. Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI that are continuously updated to reduce the likelihood of data contamination. A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions. Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53% pass@1 on medium-difficulty problems and 0% on hard problems, domains where expert humans still excel. We also find that LLMs succeed at implementation-heavy problems but struggle with nuanced algorithmic reasoning and complex case analysis, often generating confidently incorrect justifications. High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning. LiveCodeBench Pro thus highlights the significant gap to human grandmaster levels, while offering fine-grained diagnostics to steer future improvements in code-centric LLM reasoning.

nan


Article 563

Title@2025-06-13 (5): T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

Title: T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling T1: Sprachmodell weiter voranbringen durch Stärkung des Lernens und Ableiten von Skalen T1:通过强化学习和推论扩大规模,推进语文模式 2501.11651v2

Authors (9): Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, Yuxiao Dong

Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling. While reinforcement learning (RL) holds promise for enabling self-exploration, recent attempts yield modest improvements in complex reasoning. In this paper, we present T1 to scale RL by encouraging exploration and understand inference scaling. We first initialize the LLM using synthesized chain-of-thought data that integrates trial-and-error and self-verification. To scale RL training, we promote increased sampling diversity through oversampling. We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. More importantly, we present a simple strategy to examine inference scaling, where increased inference budgets directly lead to T1’s better performance without any additional verification.

nan


Article 564

Title@2025-06-13 (5): Effectiveness of Counter-Speech against Abusive Content: A Multidimensional Annotation and Classification Study

Title: Effectiveness of Counter-Speech against Abusive Content: A Multidimensional Annotation and Classification Study Wirksamkeit der Gegenrede gegen missbräuchliche Inhalte: Eine mehrdimensionale Annotation und Klassifikationsstudie 反言论对滥用内容的效力:多层面说明和分类研究 2506.11919v1

Authors (3): Greta Damo, Elena Cabrio, Serena Villata

Counter-speech (CS) is a key strategy for mitigating online Hate Speech (HS), yet defining the criteria to assess its effectiveness remains an open challenge. We propose a novel computational framework for CS effectiveness classification, grounded in social science concepts. Our framework defines six core dimensions - Clarity, Evidence, Emotional Appeal, Rebuttal, Audience Adaptation, and Fairness - which we use to annotate 4,214 CS instances from two benchmark datasets, resulting in a novel linguistic resource released to the community. In addition, we propose two classification strategies, multi-task and dependency-based, achieving strong results (0.94 and 0.96 average F1 respectively on both expert- and user-written CS), outperforming standard baselines, and revealing strong interdependence among dimensions.

nan


Article 565

Title@2025-06-13 (5): GeistBERT: Breathing Life into German NLP

Title: GeistBERT: Breathing Life into German NLP GeistBERT: Das Leben in die deutsche NLP einatmen 呼吸生命化为德国NLP 2506.11903v1

Authors (2): Raphael Scheible-Schmitt, Johann Frei

Advances in transformer-based language models have highlighted the benefits of language-specific pre-training on high-quality corpora. In this context, German NLP stands to gain from updated architectures and modern datasets tailored to the linguistic characteristics of the German language. GeistBERT seeks to improve German language processing by incrementally training on a diverse corpus and optimizing model performance across various NLP tasks. It was pre-trained using fairseq with standard hyperparameters, initialized from GottBERT weights, and trained on a large-scale German corpus using Whole Word Masking (WWM). Based on the pre-trained model, we derived extended-input variants using Nystr"omformer and Longformer architectures with support for sequences up to 8k tokens. While these long-context models were not evaluated on dedicated long-context benchmarks, they are included in our release. We assessed all models on NER (CoNLL 2003, GermEval 2014) and text classification (GermEval 2018 fine/coarse, 10kGNAD) using $F_1$ score and accuracy. The GeistBERT models achieved strong performance, leading all tasks among the base models and setting a new state-of-the-art (SOTA). Notably, the base models outperformed larger models in several tasks. To support the German NLP research community, we are releasing GeistBERT under the MIT license.

nan


Article 566

Title: TreeRL: LLM Reinforcement Learning with On-Policy Tree Search TreeRL: LLM-Verstärktes Lernen mit On-Policy-Baumsuche TreeRL: LLM 与政策树搜索的LLM 强化学习 2506.11902v1

Authors (6): Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, Yuxiao Dong

Reinforcement learning (RL) with tree search has demonstrated superior performance in traditional reasoning tasks. Compared to conventional independent chain sampling strategies with outcome supervision, tree search enables better exploration of the reasoning space and provides dense, on-policy process rewards during RL training but remains under-explored in On-Policy LLM RL. We propose TreeRL, a reinforcement learning framework that directly incorporates on-policy tree search for RL training. Our approach includes intermediate supervision and eliminates the need for a separate reward model training. Existing approaches typically train a separate process reward model, which can suffer from distribution mismatch and reward hacking. We also introduce a cost-effective tree search approach that achieves higher search efficiency under the same generation token budget by strategically branching from high-uncertainty intermediate steps rather than using random branching. Experiments on challenging math and code reasoning benchmarks demonstrate that TreeRL achieves superior performance compared to traditional ChainRL, highlighting the potential of tree search for LLM. TreeRL is open-sourced at https://github.com/THUDM/TreeRL.

nan


Article 567

Title@2025-06-13 (5): Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation

Title: Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation Graph of Attacks with Pruning: Optimierung der Stealthy Jailbreak Prompt Generation für verbesserte LLM Content Moderation 使用普林宁攻击图:优化用于强化 LLM 内容调控的隐形监狱破获快速生成 2501.18638v2

Authors (5): Daniel Schwartz, Dmitriy Bespalov, Zhe Wang, Ninad Kulkarni, Yanjun Qi

As large language models (LLMs) become increasingly prevalent, ensuring their robustness against adversarial misuse is crucial. This paper introduces the GAP (Graph of Attacks with Pruning) framework, an advanced approach for generating stealthy jailbreak prompts to evaluate and enhance LLM safeguards. GAP addresses limitations in existing tree-based LLM jailbreak methods by implementing an interconnected graph structure that enables knowledge sharing across attack paths. Our experimental evaluation demonstrates GAP’s superiority over existing techniques, achieving a 20.8% increase in attack success rates while reducing query costs by 62.7%. GAP consistently outperforms state-of-the-art methods for attacking both open and closed LLMs, with attack success rates of >96%. Additionally, we present specialized variants like GAP-Auto for automated seed generation and GAP-VLM for multimodal attacks. GAP-generated prompts prove highly effective in improving content moderation systems, increasing true positive detection rates by 108.5% and accuracy by 183.6% when used for fine-tuning. Our implementation is available at https://github.com/dsbuddy/GAP-LLM-Safety.

nan


Article 568

Title@2025-06-13 (5): Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

Title: Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache Jenseits der homogenen Aufmerksamkeit: Speichereffiziente LLMs über Fourier-Approximated KV Cache 超越同异族注意:通过Fourier-Apbeard KV Cache 的记忆-节能LMLM 2506.11886v1

Authors (12): Xiaoran Liu, Siyang He, Qiqi Wang, Ruixiao Li, Yuerong Song, Zhigeng Liu, Linlin Li, Qun Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu

Large Language Models struggle with memory demands from the growing Key-Value (KV) cache as context lengths increase. Existing compression methods homogenize head dimensions or rely on attention-guided token pruning, often sacrificing accuracy or introducing computational overhead. We propose FourierAttention, a training-free framework that exploits the heterogeneous roles of transformer head dimensions: lower dimensions prioritize local context, while upper ones capture long-range dependencies. By projecting the long-context-insensitive dimensions onto orthogonal Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients. Evaluations on LLaMA models show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel, FlashFourierAttention, is designed to optimize memory via streamlined read-write operations, enabling efficient deployment without performance compromise.

nan


Article 569

Title@2025-06-13 (5): Addressing Bias in LLMs: Strategies and Application to Fair AI-based Recruitment

Title: Addressing Bias in LLMs: Strategies and Application to Fair AI-based Recruitment Bias in LLMs ansprechen: Strategien und Anwendung für eine faire KI-basierte Rekrutierung 解决LLMM中的偏见:公平基于大赦国际的招聘战略和应用 2506.11880v1

Authors (6): Alejandro Peña, Julian Fierrez, Aythami Morales, Gonzalo Mancera, Miguel Lopez, Ruben Tolosana

The use of language technologies in high-stake settings is increasing in recent years, mostly motivated by the success of Large Language Models (LLMs). However, despite the great performance of LLMs, they are are susceptible to ethical concerns, such as demographic biases, accountability, or privacy. This work seeks to analyze the capacity of Transformers-based systems to learn demographic biases present in the data, using a case study on AI-based automated recruitment. We propose a privacy-enhancing framework to reduce gender information from the learning pipeline as a way to mitigate biased behaviors in the final tools. Our experiments analyze the influence of data biases on systems built on two different LLMs, and how the proposed framework effectively prevents trained systems from reproducing the bias in the data.

nan


Article 570

Title@2025-06-13 (5): SAP-Bench: Benchmarking Multimodal Large Language Models in Surgical Action Planning

Title: SAP-Bench: Benchmarking Multimodal Large Language Models in Surgical Action Planning SAP-Bench: Benchmarking multimodaler Großsprachenmodelle in der operativen Aktionsplanung SAP-Bench:在外科行动规划中确定多式大语言模式基准 2506.07196v2

Authors (6): Mengya Xu, Zhongzhen Huang, Dillan Imans, Yiru Ye, Xiaofan Zhang, Qi Dou

Effective evaluation is critical for driving advancements in MLLM research. The surgical action planning (SAP) task, which aims to generate future action sequences from visual inputs, demands precise and sophisticated analytical capabilities. Unlike mathematical reasoning, surgical decision-making operates in life-critical domains and requires meticulous, verifiable processes to ensure reliability and patient safety. This task demands the ability to distinguish between atomic visual actions and coordinate complex, long-horizon procedures, capabilities that are inadequately evaluated by current benchmarks. To address this gap, we introduce SAP-Bench, a large-scale, high-quality dataset designed to enable multimodal large language models (MLLMs) to perform interpretable surgical action planning. Our SAP-Bench benchmark, derived from the cholecystectomy procedures context with the mean duration of 1137.5s, and introduces temporally-grounded surgical action annotations, comprising the 1,226 clinically validated action clips (mean duration: 68.7s) capturing five fundamental surgical actions across 74 procedures. The dataset provides 1,152 strategically sampled current frames, each paired with the corresponding next action as multimodal analysis anchors. We propose the MLLM-SAP framework that leverages MLLMs to generate next action recommendations from the current surgical scene and natural language instructions, enhanced with injected surgical domain knowledge. To assess our dataset’s effectiveness and the broader capabilities of current models, we evaluate seven state-of-the-art MLLMs (e.g., OpenAI-o1, GPT-4o, QwenVL2.5-72B, Claude-3.5-Sonnet, GeminiPro2.5, Step-1o, and GLM-4v) and reveal critical gaps in next action prediction performance.

nan


Article 571

Title@2025-06-13 (5): Long-context Non-factoid Question Answering in Indic Languages

Title: Long-context Non-factoid Question Answering in Indic Languages Lang-Kontext Non-factoide Frage-Antworten in indischen Sprachen 长长长 长 长 长 长 长 长 长 非 事实 问 问 问 语 语 语 2504.13615v2

Authors (3): Ritwik Mishra, Rajiv Ratn Shah, Ponnurangam Kumaraguru

Question Answering (QA) tasks, which involve extracting answers from a given context, are relatively straightforward for modern Large Language Models (LLMs) when the context is short. However, long contexts pose challenges due to the quadratic complexity of the self-attention mechanism. This challenge is compounded in Indic languages, which are often low-resource. This study explores context-shortening techniques, including Open Information Extraction (OIE), coreference resolution, Answer Paragraph Selection (APS), and their combinations, to improve QA performance. Compared to the baseline of unshortened (long) contexts, our experiments on four Indic languages (Hindi, Tamil, Telugu, and Urdu) demonstrate that context-shortening techniques yield an average improvement of 4\% in semantic scores and 47\% in token-level scores when evaluated on three popular LLMs without fine-tuning. Furthermore, with fine-tuning, we achieve an average increase of 2\% in both semantic and token-level scores. Additionally, context-shortening reduces computational overhead. Explainability techniques like LIME and SHAP reveal that when the APS model confidently identifies the paragraph containing the answer, nearly all tokens within the selected text receive high relevance scores. However, the study also highlights the limitations of LLM-based QA systems in addressing non-factoid questions, particularly those requiring reasoning or debate. Moreover, verbalizing OIE-generated triples does not enhance system performance. These findings emphasize the potential of context-shortening techniques to improve the efficiency and effectiveness of LLM-based QA systems, especially for low-resource languages. The source code and resources are available at https://github.com/ritwikmishra/IndicGenQA.

nan


Article 572

Title@2025-06-13 (5): Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts

Title: Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts Sicherer oder luckier? LLMs als Sicherheitsevaluatoren sind für Artefakte nicht robust 安全性更安全还是更幸运?作为安全评估员的LLMs没有强力进行人工操作。 2503.09347v2

Authors (2): Hongyu Chen, Seraphina Goldfarb-Tarrant

Large Language Models (LLMs) are increasingly employed as automated evaluators to assess the safety of generated content, yet their reliability in this role remains uncertain. This study evaluates a diverse set of 11 LLM judge models across critical safety domains, examining three key aspects: self-consistency in repeated judging tasks, alignment with human judgments, and susceptibility to input artifacts such as apologetic or verbose phrasing. Our findings reveal that biases in LLM judges can significantly distort the final verdict on which content source is safer, undermining the validity of comparative evaluations. Notably, apologetic language artifacts alone can skew evaluator preferences by up to 98\%. Contrary to expectations, larger models do not consistently exhibit greater robustness, while smaller models sometimes show higher resistance to specific artifacts. To mitigate LLM evaluator robustness issues, we investigate jury-based evaluations aggregating decisions from multiple models. Although this approach both improves robustness and enhances alignment to human judgements, artifact sensitivity persists even with the best jury configurations. These results highlight the urgent need for diversified, artifact-resistant methodologies to ensure reliable safety assessments.

nan


Article 573

Title@2025-06-13 (5): Post Persona Alignment for Multi-Session Dialogue Generation

Title: Post Persona Alignment for Multi-Session Dialogue Generation Post Persona Alignment für Multi-Session Dialog Generation 开展多会议对话的人后协调 2506.11857v1

Authors (4): Yi-Pei Chen, Noriki Nishida, Hideki Nakayama, Yuji Matsumoto

Multi-session persona-based dialogue generation presents challenges in maintaining long-term consistency and generating diverse, personalized responses. While large language models (LLMs) excel in single-session dialogues, they struggle to preserve persona fidelity and conversational coherence across extended interactions. Existing methods typically retrieve persona information before response generation, which can constrain diversity and result in generic outputs. We propose Post Persona Alignment (PPA), a novel two-stage framework that reverses this process. PPA first generates a general response based solely on dialogue context, then retrieves relevant persona memories using the response as a query, and finally refines the response to align with the speaker’s persona. This post-hoc alignment strategy promotes naturalness and diversity while preserving consistency and personalization. Experiments on multi-session LLM-generated dialogue data demonstrate that PPA significantly outperforms prior approaches in consistency, diversity, and persona relevance, offering a more flexible and effective paradigm for long-term personalized dialogue generation.

nan


Article 574

Title@2025-06-13 (5): The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets

Title: The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets Das automatisierte, aber riskante Spiel: Modellierung von Agent-zu-Agent-Verhandlungen und Transaktionen in Verbrauchermärkten 自动但有风险游戏:消费者市场代理对代理谈判和交易的模拟 2506.00073v3

Authors (6): Shenzhe Zhu, Jiao Sun, Yi Nian, Tobin South, Alex Pentland, Jiaxin Pei

AI agents are increasingly used in consumer-facing applications to assist with tasks such as product search, negotiation, and transaction execution. In this paper, we explore a future scenario where both consumers and merchants authorize AI agents to fully automate negotiations and transactions. We aim to answer two key questions: (1) Do different LLM agents vary in their ability to secure favorable deals for users? (2) What risks arise from fully automating deal-making with AI agents in consumer markets? To address these questions, we develop an experimental framework that evaluates the performance of various LLM agents in real-world negotiation and transaction settings. Our findings reveal that AI-mediated deal-making is an inherently imbalanced game – different agents achieve significantly different outcomes for their users. Moreover, behavioral anomalies in LLMs can result in financial losses for both consumers and merchants, such as overspending or accepting unreasonable deals. These results underscore that while automation can improve efficiency, it also introduces substantial risks. Users should exercise caution when delegating business decisions to AI agents.

nan


Article 575

Title@2025-06-13 (5): Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages

Title: Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages Große Sprachmodelle für toxische Spracherkennung in ressourcenarmen Balkansprachen 低资源巴尔干语言中有毒语言探测大语言模式 2506.09992v2

Authors (2): Amel Muminovic, Amela Kadric Muminovic

Online toxic language causes real harm, especially in regions with limited moderation tools. In this study, we evaluate how large language models handle toxic comments in Serbian, Croatian, and Bosnian, languages with limited labeled data. We built and manually labeled a dataset of 4,500 YouTube and TikTok comments drawn from videos across diverse categories, including music, politics, sports, modeling, influencer content, discussions of sexism, and general topics. Four models (GPT-3.5 Turbo, GPT-4.1, Gemini 1.5 Pro, and Claude 3 Opus) were tested in two modes: zero-shot and context-augmented. We measured precision, recall, F1 score, accuracy and false positive rates. Including a short context snippet raised recall by about 0.12 on average and improved F1 score by up to 0.10, though it sometimes increased false positives. The best balance came from Gemini in context-augmented mode, reaching an F1 score of 0.82 and accuracy of 0.82, while zero-shot GPT-4.1 led on precision and had the lowest false alarms. We show how adding minimal context can improve toxic language detection in low-resource settings and suggest practical strategies such as improved prompt design and threshold calibration. These results show that prompt design alone can yield meaningful gains in toxicity detection for underserved Balkan language communities.

nan


Article 576

Title@2025-06-13 (5): Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation

Title: Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation Mehrsprachige Vision-Sprachenübersetzung neu denken: Datensatz, Evaluation und Anpassung 重新思考多语种愿景语言翻译:数据集、评估和适应 2506.11820v1

Authors (11): Xintong Wang, Jingheng Pan, Yixiao Liu, Xiaohu Zhao, Chenyang Lyu, Minghao Wu, Chris Biemann, Longyue Wang, Linlong Xu, Weihua Luo, Kaifu Zhang

Vision-Language Translation (VLT) is a challenging task that requires accurately recognizing multilingual text embedded in images and translating it into the target language with the support of visual context. While recent Large Vision-Language Models (LVLMs) have demonstrated strong multilingual and visual understanding capabilities, there is a lack of systematic evaluation and understanding of their performance on VLT. In this work, we present a comprehensive study of VLT from three key perspectives: data quality, model architecture, and evaluation metrics. (1) We identify critical limitations in existing datasets, particularly in semantic and cultural fidelity, and introduce AibTrans – a multilingual, parallel, human-verified dataset with OCR-corrected annotations. (2) We benchmark 11 commercial LVLMs/LLMs and 6 state-of-the-art open-source models across end-to-end and cascaded architectures, revealing their OCR dependency and contrasting generation versus reasoning behaviors. (3) We propose Density-Aware Evaluation to address metric reliability issues under varying contextual complexity, introducing the DA Score as a more robust measure of translation quality. Building upon these findings, we establish a new evaluation benchmark for VLT. Notably, we observe that fine-tuning LVLMs on high-resource language pairs degrades cross-lingual performance, and we propose a balanced multilingual fine-tuning strategy that effectively adapts LVLMs to VLT without sacrificing their generalization ability.

nan


Article 577

Title@2025-06-13 (5): On the Performance of LLMs for Real Estate Appraisal

Title: On the Performance of LLMs for Real Estate Appraisal Über die Leistung von LLMs für die Bewertung von Immobilien 房地产评估LLM女士的绩效 2506.11812v1

Authors (5): Margot Geerts, Manon Reusens, Bart Baesens, Seppe vanden Broucke, Jochen De Weerdt

The real estate market is vital to global economies but suffers from significant information asymmetry. This study examines how Large Language Models (LLMs) can democratize access to real estate insights by generating competitive and interpretable house price estimates through optimized In-Context Learning (ICL) strategies. We systematically evaluate leading LLMs on diverse international housing datasets, comparing zero-shot, few-shot, market report-enhanced, and hybrid prompting techniques. Our results show that LLMs effectively leverage hedonic variables, such as property size and amenities, to produce meaningful estimates. While traditional machine learning models remain strong for pure predictive accuracy, LLMs offer a more accessible, interactive and interpretable alternative. Although self-explanations require cautious interpretation, we find that LLMs explain their predictions in agreement with state-of-the-art models, confirming their trustworthiness. Carefully selected in-context examples based on feature similarity and geographic proximity, significantly enhance LLM performance, yet LLMs struggle with overconfidence in price intervals and limited spatial reasoning. We offer practical guidance for structured prediction tasks through prompt optimization. Our findings highlight LLMs’ potential to improve transparency in real estate appraisal and provide actionable insights for stakeholders.

nan


Article 578

Title@2025-06-13 (5): Word Sense Detection Leveraging Maximum Mean Discrepancy

Title: Word Sense Detection Leveraging Maximum Mean Discrepancy Word Sense Detection Leveraging Maximale mittlere Diskrepanz Word Sensense 检测 利用最大平均值差异 2506.01602v2

Authors (1): Kensuke Mitsuzawa

Word sense analysis is an essential analysis work for interpreting the linguistic and social backgrounds. The word sense change detection is a task of identifying and interpreting shifts in word meanings over time. This paper proposes MMD-Sense-Analysis, a novel approach that leverages Maximum Mean Discrepancy (MMD) to select semantically meaningful variables and quantify changes across time periods. This method enables both the identification of words undergoing sense shifts and the explanation of their evolution over multiple historical periods. To my knowledge, this is the first application of MMD to word sense change detection. Empirical assessment results demonstrate the effectiveness of the proposed approach.

nan


Article 579

Title@2025-06-13 (5): Are Multimodal Large Language Models Pragmatically Competent Listeners in Simple Reference Resolution Tasks?

Title: Are Multimodal Large Language Models Pragmatically Competent Listeners in Simple Reference Resolution Tasks? Sind multimodale große Sprachmodelle Pragmatisch kompetente Hörer in einfachen Referenzauflösungsaufgaben? 在简单参考解析任务中,多式大语言模型是否具有实用能力的听众能力? 2506.11807v1

Authors (5): Simeon Junker, Manar Ali, Larissa Koch, Sina Zarrieß, Hendrik Buschmeier

We investigate the linguistic abilities of multimodal large language models in reference resolution tasks featuring simple yet abstract visual stimuli, such as color patches and color grids. Although the task may not seem challenging for today’s language models, being straightforward for human dyads, we consider it to be a highly relevant probe of the pragmatic capabilities of MLLMs. Our results and analyses indeed suggest that basic pragmatic capabilities, such as context-dependent interpretation of color descriptions, still constitute major challenges for state-of-the-art MLLMs.

nan


Article 580

Title@2025-06-13 (5): Unsupervised Document and Template Clustering using Multimodal Embeddings

Title: Unsupervised Document and Template Clustering using Multimodal Embeddings Unüberwachte Dokumenten- und Vorlagen-Clustering mit multimodalen Einbettungen 使用多式嵌入式将无人监督的文档和模板分组 2506.12116v1

Authors (2): Phillipe R. Sampaio, Helene Maxcici

This paper investigates a novel approach to unsupervised document clustering by leveraging multimodal embeddings as input to traditional clustering algorithms such as $k$-Means and DBSCAN. Our method aims to achieve a finer-grained document understanding by not only grouping documents at the type level (e.g., invoices, purchase orders), but also distinguishing between different templates within the same document category. This is achieved by using embeddings that capture textual content, layout information, and visual features of documents. We evaluated the effectiveness of this approach using embeddings generated by several state-of-the-art pretrained multimodal models, including SBERT, LayoutLMv1, LayoutLMv3, DiT, Donut, and ColPali. Our findings demonstrate the potential of multimodal embeddings to significantly enhance document clustering, offering benefits for various applications in intelligent document processing, document layout analysis, and unsupervised document classification. This work provides valuable insight into the advantages and limitations of different multimodal models for this task and opens new avenues for future research to understand and organize document collections.

nan


Article 581

Title@2025-06-13 (5): Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models

Title: Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models Persona-getriebene Simulation des Abstimmungsverhaltens im Europäischen Parlament mit großen Sprachmodellen 欧洲议会以大语言模式模拟投票行为 2506.11798v1

Authors (3): Maximilian Kreutner, Marlene Lutz, Markus Strohmaier

Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse, but have been found to consistently display a progressive left-leaning bias. At the same time, so-called persona or identity prompts have been shown to produce LLM behavior that aligns with socioeconomic groups that the base model is not aligned with. In this work, we analyze whether zero-shot persona prompting with limited information can accurately predict individual voting decisions and, by aggregation, accurately predict positions of European groups on a diverse set of policies. We evaluate if predictions are stable towards counterfactual arguments, different persona prompts and generation methods. Finally, we find that we can simulate voting behavior of Members of the European Parliament reasonably well with a weighted F1 score of approximately 0.793. Our persona dataset of politicians in the 2024 European Parliament and our code are available at https://github.com/dess-mannheim/european_parliament_simulation.

nan


Article 582

Title@2025-06-13 (5): Eliciting Reasoning in Language Models with Cognitive Tools

Title: Eliciting Reasoning in Language Models with Cognitive Tools Mit kognitiven Tools die Vernunft in Sprachmodellen elizitieren 具有认知工具的语言模型中的 埃利推理 2506.12115v1

Authors (3): Brown Ebouky, Andrea Bartezzaghi, Mattia Rigotti

The recent advent of reasoning models like OpenAI’s o1 was met with excited speculation by the AI community about the mechanisms underlying these capabilities in closed models, followed by a rush of replication efforts, particularly from the open source community. These speculations were largely settled by the demonstration from DeepSeek-R1 that chains-of-thought and reinforcement learning (RL) can effectively replicate reasoning on top of base LLMs. However, it remains valuable to explore alternative methods for theoretically eliciting reasoning that could help elucidate the underlying mechanisms, as well as providing additional methods that may offer complementary benefits. Here, we build on the long-standing literature in cognitive psychology and cognitive architectures, which postulates that reasoning arises from the orchestrated, sequential execution of a set of modular, predetermined cognitive operations. Crucially, we implement this key idea within a modern agentic tool-calling framework. In particular, we endow an LLM with a small set of “cognitive tools” encapsulating specific reasoning operations, each executed by the LLM itself. Surprisingly, this simple strategy results in considerable gains in performance on standard mathematical reasoning benchmarks compared to base LLMs, for both closed and open-weight models. For instance, providing our “cognitive tools” to GPT-4.1 increases its pass@1 performance on AIME2024 from 26.7% to 43.3%, bringing it very close to the performance of o1-preview. In addition to its practical implications, this demonstration contributes to the debate regarding the role of post-training methods in eliciting reasoning in LLMs versus the role of inherent capabilities acquired during pre-training, and whether post-training merely uncovers these latent abilities.

nan


Article 583

Title@2025-06-13 (5): MEDDxAgent: A Unified Modular Agent Framework for Explainable Automatic Differential Diagnosis

Title: MEDDxAgent: A Unified Modular Agent Framework for Explainable Automatic Differential Diagnosis MEDDxAgent: Ein einheitliches Modular-Agenten-Framework für erklärbare automatische Differentialdiagnose MDDAAGent: 可解释自动差异分析统一模块剂框架 2502.19175v2

Authors (6): Daniel Rose, Chia-Chien Hung, Marco Lepri, Israa Alqassem, Kiril Gashteovski, Carolin Lawrence

Differential Diagnosis (DDx) is a fundamental yet complex aspect of clinical decision-making, in which physicians iteratively refine a ranked list of possible diseases based on symptoms, antecedents, and medical knowledge. While recent advances in large language models (LLMs) have shown promise in supporting DDx, existing approaches face key limitations, including single-dataset evaluations, isolated optimization of components, unrealistic assumptions about complete patient profiles, and single-attempt diagnosis. We introduce a Modular Explainable DDx Agent (MEDDxAgent) framework designed for interactive DDx, where diagnostic reasoning evolves through iterative learning, rather than assuming a complete patient profile is accessible. MEDDxAgent integrates three modular components: (1) an orchestrator (DDxDriver), (2) a history taking simulator, and (3) two specialized agents for knowledge retrieval and diagnosis strategy. To ensure robust evaluation, we introduce a comprehensive DDx benchmark covering respiratory, skin, and rare diseases. We analyze single-turn diagnostic approaches and demonstrate the importance of iterative refinement when patient profiles are not available at the outset. Our broad evaluation demonstrates that MEDDxAgent achieves over 10% accuracy improvements in interactive DDx across both large and small LLMs, while offering critical explainability into its diagnostic reasoning process.

nan


Article 584

Title@2025-06-13 (5): Women, Infamous, and Exotic Beings: What Honorific Usages in Wikipedia Reflect on the Cross-Cultural Sociolinguistic Norms?

Title: Women, Infamous, and Exotic Beings: What Honorific Usages in Wikipedia Reflect on the Cross-Cultural Sociolinguistic Norms? Frauen, berüchtigte und exotische Wesen: Welche ehrwürdigen Nutzungen in Wikipedia reflektieren die kulturübergreifenden Soziolinguistischen Normen? 妇女、臭名昭著的人和外来人:维基百科对跨文化社会语言规范的何种荣誉使用? 2501.03479v3

Authors (6): Sourabrata Mukherjee, Atharva Mehta, Soumya Teotia, Sougata Saha, Akhil Arora, Monojit Choudhury

Wikipedia, as a massively multilingual, community-driven platform, is a valuable resource for Natural Language Processing (NLP), yet the consistency of honorific usage in honorific-rich languages remains underexplored. Honorifics, subtle yet profound linguistic markers, encode social hierarchies, politeness norms, and cultural values, but Wikipedia’s editorial guidelines lack clear standards for their usage in languages where such forms are grammatically and socially prevalent. This paper addresses this gap through a large-scale analysis of third-person honorific pronouns and verb forms in Hindi and Bengali Wikipedia articles. Using Large Language Models (LLM), we automatically annotate 10,000 articles per language for honorific usage and socio-demographic features such as gender, age, fame, and cultural origin. We investigate: (i) the consistency of honorific usage across articles, (ii) how inconsistencies correlate with socio-cultural factors, and (iii) the presence of explicit or implicit biases across languages. We find that honorific usage is consistently more common in Bengali than Hindi, while non-honorific forms are more frequent for infamous, juvenile, and exotic entities in both. Notably, gender bias emerges in both languages, particularly in Hindi, where men are more likely to receive honorifics than women. Our analysis highlights the need for Wikipedia to develop language-specific editorial guidelines for honorific usage.

nan


Article 585

Title@2025-06-13 (5): Long-Short Alignment for Effective Long-Context Modeling in LLMs

Title: Long-Short Alignment for Effective Long-Context Modeling in LLMs Lang-Short Alignment für effektive Lang-Kontext-Modellierung in LLMs 为在LLMM中建立有效的长文建模而实现长短期一致 2506.11769v1

Authors (4): Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang

Large language models (LLMs) have exhibited impressive performance and surprising emergent properties. However, their effectiveness remains limited by the fixed context window of the transformer architecture, posing challenges for long-context modeling. Among these challenges, length generalization – the ability to generalize to sequences longer than those seen during training – is a classical and fundamental problem. In this work, we propose a fresh perspective on length generalization, shifting the focus from the conventional emphasis on input features such as positional encodings or data structures to the output distribution of the model. Specifically, through case studies on synthetic tasks, we highlight the critical role of \textbf{long-short alignment} – the consistency of output distributions across sequences of varying lengths. Extending this insight to natural language tasks, we propose a metric called Long-Short Misalignment to quantify this phenomenon, uncovering a strong correlation between the metric and length generalization performance. Building on these findings, we develop a regularization term that promotes long-short alignment during training. Extensive experiments validate the effectiveness of our approach, offering new insights for achieving more effective long-context modeling in LLMs. Code is available at https://github.com/PKU-ML/LongShortAlignment.

nan


Article 586

Title@2025-06-13 (5): DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Title: DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents DeepResearch Bench: Ein umfassender Benchmark für Deep Research Agents 深层研究组:深层研究剂综合基准 2506.11763v1

Authors (5): Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, Zhendong Mao

Deep Research Agents are a prominent category of LLM-based agents. By autonomously orchestrating multistep web exploration, targeted retrieval, and higher-order synthesis, they transform vast amounts of online information into analyst-grade, citation-rich reports–compressing hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. Evaluating DRAs is inherently complex and labor-intensive. We therefore propose two novel methodologies that achieve strong alignment with human judgment. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The other framework is introduced to evaluate DRA’s information retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. We have open-sourced DeepResearch Bench and key components of these frameworks at https://github.com/Ayanami0730/deep_research_bench to accelerate the development of practical LLM-based agents.

nan


Article 587

Title@2025-06-13 (5): DART: Distilling Autoregressive Reasoning to Silent Thought

Title: DART: Distilling Autoregressive Reasoning to Silent Thought DART: Destillieren von autoregressiver Begründung zu stillem Denken DART: 提炼沉默思考的自动递减理由 2506.11752v1

Authors (5): Nan Jiang, Ziming Wu, De-Chuan Zhan, Fuming Lai, Shaobing Lian

Chain-of-Thought (CoT) reasoning has significantly advanced Large Language Models (LLMs) in solving complex tasks. However, its autoregressive paradigm leads to significant computational overhead, hindering its deployment in latency-sensitive applications. To address this, we propose \textbf{DART} (\textbf{D}istilling \textbf{A}utoregressive \textbf{R}easoning to Silent \textbf{T}hought), a self-distillation framework that enables LLMs to replace autoregressive CoT with non-autoregressive Silent Thought (ST). Specifically, DART introduces two training pathways: the CoT pathway for traditional reasoning and the ST pathway for generating answers directly from a few ST tokens. The ST pathway utilizes a lightweight Reasoning Evolvement Module (REM) to align its hidden states with the CoT pathway, enabling the ST tokens to evolve into informative embeddings. During inference, only the ST pathway is activated, leveraging evolving ST tokens to deliver the answer directly. Extensive experimental results demonstrate that DART achieves comparable reasoning performance to existing baselines while offering significant efficiency gains, serving as a feasible alternative for efficient reasoning.

nan


Article 588

Title@2025-06-13 (5): Table-R1: Region-based Reinforcement Learning for Table Understanding

Title: Table-R1: Region-based Reinforcement Learning for Table Understanding Tabelle-R1: Regionsbasiertes Verstärkungslernen für Tabellenverständigung 表-R1:以区域为基础的强化学习,以了解表格 2505.12415v2

Authors (10): Zhenhe Wu, Jian Yang, Jiaheng Liu, Xianjie Wu, Changzai Pan, Jie Zhang, Yu Zhao, Shuangyong Song, Yongxiang Li, Zhoujun Li

Tables present unique challenges for language models due to their structured row-column interactions, necessitating specialized approaches for effective comprehension. While large language models (LLMs) have demonstrated potential in table reasoning through prompting and techniques like chain-of-thought (CoT) and program-of-thought (PoT), optimizing their performance for table question answering remains underexplored. In this paper, we introduce region-based Table-R1, a novel reinforcement learning approach that enhances LLM table understanding by integrating region evidence into reasoning steps. Our method employs Region-Enhanced Supervised Fine-Tuning (RE-SFT) to guide models in identifying relevant table regions before generating answers, incorporating textual, symbolic, and program-based reasoning. Additionally, Table-Aware Group Relative Policy Optimization (TARPO) introduces a mixed reward system to dynamically balance region accuracy and answer correctness, with decaying region rewards and consistency penalties to align reasoning steps. Experiments show that Table-R1 achieves an average performance improvement of 14.36 points across multiple base models on three benchmark datasets, even outperforming baseline models with ten times the parameters, while TARPO reduces response token consumption by 67.5% compared to GRPO, significantly advancing LLM capabilities in efficient tabular reasoning.

nan


Article 589

Title@2025-06-13 (5): Quizzard@INOVA Challenge 2025 – Track A: Plug-and-Play Technique in Interleaved Multi-Image Model

Title: Quizzard@INOVA Challenge 2025 – Track A: Plug-and-Play Technique in Interleaved Multi-Image Model Quizzard@INOVA Challenge 2025 – Spur A: Plug-and-Play-Technik im Multi-Image-Modell Quizzad@INOVA 2025年挑战 – – A轨:跨离多图像模型中的插图和布图技术 2506.11737v1

Authors (5): Dinh Viet Cuong, Hoang-Bao Le, An Pham Ngoc Nguyen, Liting Zhou, Cathal Gurrin

This paper addresses two main objectives. Firstly, we demonstrate the impressive performance of the LLaVA-NeXT-interleave on 22 datasets across three different tasks: Multi-Image Reasoning, Documents and Knowledge-Based Understanding and Interactive Multi-Modal Communication. Secondly, we add the Dense Channel Integration (DCI) connector to the LLaVA-NeXT-Interleave and compare its performance against the standard model. We find that the standard model achieves the highest overall accuracy, excelling in vision-heavy tasks like VISION, NLVR2, and Fashion200K. Meanwhile, the DCI-enhanced version shows particular strength on datasets requiring deeper semantic coherence or structured change understanding such as MIT-States_PropertyCoherence and SlideVQA. Our results highlight the potential of combining powerful foundation models with plug-and-play techniques for Interleave tasks. The code is available at https://github.com/dinhvietcuong1996/icme25-inova.

nan


Article 590

Title@2025-06-13 (5): Entropy Controllable Direct Preference Optimization

Title: Entropy Controllable Direct Preference Optimization Entropie kontrollierbare Direktpräferenzoptimierung 直接首选优化 2411.07595v2

Authors (3): Motoki Omura, Yasuhiro Fujita, Toshiki Kataoka

In the post-training of large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) is an effective approach to achieve generation aligned with human preferences. Direct Preference Optimization (DPO) allows for policy training with a simple binary cross-entropy loss without a reward model. The objective of DPO is regularized by reverse KL divergence that encourages mode-seeking fitting to the reference policy. Nonetheless, we indicate that minimizing reverse KL divergence could fail to capture a mode of the reference distribution, which may hurt the policy’s performance. Based on this observation, we propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy, enhancing the distribution’s sharpness and thereby enabling mode-seeking fitting more effectively. In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@$k$ evaluations for mathematical tasks. Moreover, H-DPO is simple to implement, requiring only minor modifications to the loss calculation of DPO, which makes it highly practical and promising for wide-ranging applications in the training of LLMs.

nan


Article 591

Title@2025-06-13 (5): VM14K: First Vietnamese Medical Benchmark

Title: VM14K: First Vietnamese Medical Benchmark VM14K: Erster vietnamesischer medizinischer Benchmark VM14K:第一个越南医疗基准 2506.01305v2

Authors (9): Thong Nguyen, Duc Nguyen, Minh Dang, Thai Dao, Long Nguyen, Quan H. Nguyen, Dat Nguyen, Kien Tran, Minh Tran

Medical benchmarks are indispensable for evaluating the capabilities of language models in healthcare for non-English-speaking communities,therefore help ensuring the quality of real-life applications. However, not every community has sufficient resources and standardized methods to effectively build and design such benchmark, and available non-English medical data is normally fragmented and difficult to verify. We developed an approach to tackle this problem and applied it to create the first Vietnamese medical question benchmark, featuring 14,000 multiple-choice questions across 34 medical specialties. Our benchmark was constructed using various verifiable sources, including carefully curated medical exams and clinical records, and eventually annotated by medical experts. The benchmark includes four difficulty levels, ranging from foundational biological knowledge commonly found in textbooks to typical clinical case studies that require advanced reasoning. This design enables assessment of both the breadth and depth of language models’ medical understanding in the target language thanks to its extensive coverage and in-depth subject-specific expertise. We release the benchmark in three parts: a sample public set (4k questions), a full public set (10k questions), and a private set (2k questions) used for leaderboard evaluation. Each set contains all medical subfields and difficulty levels. Our approach is scalable to other languages, and we open-source our data construction pipeline to support the development of future multilingual benchmarks in the medical domain.

nan


Article 592

Title@2025-06-13 (5): The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference

Title: The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference Die Cambrian Explosion von Mixed-Precision Matrix Multiplikation für Quantized Deep Learning Inferenz Cambrian 混合精密矩阵乘数爆炸,用于量测深学习推断 2506.11728v1

Authors (4): Héctor Martínez, Adrián Castelló, Francisco D. Igual, Enrique S. Quintana-Ortí

Recent advances in deep learning (DL) have led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats, such as FP16, BF16, and 8- or 16-bit integers, combined with mixed-precision arithmetic. This transition enhances computational throughput, reduces memory and bandwidth usage, and improves energy efficiency, offering significant advantages for resource-constrained edge devices. To support this shift, hardware architectures have evolved accordingly, now including adapted ISAs (Instruction Set Architectures) that expose mixed-precision vector units and matrix engines tailored for DL workloads. At the heart of many DL and scientific computing tasks is the general matrix-matrix multiplication gemm, a fundamental kernel historically optimized using axpy vector instructions on SIMD (single instruction, multiple data) units. However, as hardware moves toward mixed-precision dot-product-centric operations optimized for quantized inference, these legacy approaches are being phased out. In response to this, our paper revisits traditional high-performance gemm and describes strategies for adapting it to mixed-precision integer (MIP) arithmetic across modern ISAs, including x86_64, ARM, and RISC-V. Concretely, we illustrate novel micro-kernel designs and data layouts that better exploit today’s specialized hardware and demonstrate significant performance gains from MIP arithmetic over floating-point implementations across three representative CPU architectures. These contributions highlight a new era of gemm optimization-driven by the demands of DL inference on heterogeneous architectures, marking what we term as the “Cambrian period” for matrix multiplication.

nan


Article 593

Title@2025-06-13 (5): Persistent Topological Features in Large Language Models

Title: Persistent Topological Features in Large Language Models Persistente Topologische Features in großen Sprachmodellen 大语言模式中的持久性有机污染物特征 2410.11042v3

Authors (6): Yuri Gardinazzi, Karthik Viswanathan, Giada Panerai, Alessio Ansuini, Alberto Cazzaniga, Matteo Biagetti

Understanding the decision-making processes of large language models is critical given their widespread applications. To achieve this, we aim to connect a formal mathematical framework - zigzag persistence from topological data analysis - with practical and easily applicable algorithms. Zigzag persistence is particularly effective for characterizing data as it dynamically transforms across model layers. Within this framework, we introduce topological descriptors that measure how topological features, $p$-dimensional holes, persist and evolve throughout the layers. Unlike methods that assess each layer individually and then aggregate the results, our approach directly tracks the full evolutionary path of these features. This offers a statistical perspective on how prompts are rearranged and their relative positions changed in the representation space, providing insights into the system’s operation as an integrated whole. To demonstrate the expressivity and applicability of our framework, we highlight how sensitive these descriptors are to different models and a variety of datasets. As a showcase application to a downstream task, we use zigzag persistence to establish a criterion for layer pruning, achieving results comparable to state-of-the-art methods while preserving the system-level perspective.

nan


Article 594

Title@2025-06-13 (5): Vision-Language Models for Edge Networks: A Comprehensive Survey

Title: Vision-Language Models for Edge Networks: A Comprehensive Survey Vision-Language-Modelle für Edge Networks: Eine umfassende Umfrage 边缘网络远景-语言模型:全面调查 2502.07855v2

Authors (4): Ahmed Sharshar, Latif U. Khan, Waseem Ullah, Mohsen Guizani

Vision Large Language Models (VLMs) combine visual understanding with natural language processing, enabling tasks like image captioning, visual question answering, and video analysis. While VLMs show impressive capabilities across domains such as autonomous vehicles, smart surveillance, and healthcare, their deployment on resource-constrained edge devices remains challenging due to processing power, memory, and energy limitations. This survey explores recent advancements in optimizing VLMs for edge environments, focusing on model compression techniques, including pruning, quantization, knowledge distillation, and specialized hardware solutions that enhance efficiency. We provide a detailed discussion of efficient training and fine-tuning methods, edge deployment challenges, and privacy considerations. Additionally, we discuss the diverse applications of lightweight VLMs across healthcare, environmental monitoring, and autonomous systems, illustrating their growing impact. By highlighting key design strategies, current challenges, and offering recommendations for future directions, this survey aims to inspire further research into the practical deployment of VLMs, ultimately making advanced AI accessible in resource-limited settings.

nan


Article 595

Title@2025-06-13 (5): Configurable Preference Tuning with Rubric-Guided Synthetic Data

Title: Configurable Preference Tuning with Rubric-Guided Synthetic Data Konfigurierbare Präferenz-Tuning mit Rubric-Guided Synthetic Data 使用 Rubric 辅助合成数据进行可配置的优惠税 2506.11702v1

Authors (1): Víctor Gallego

Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper challenges the assumption of monolithic preferences by introducing Configurable Preference Tuning (CPT), a novel framework for endowing language models with the ability to dynamically adjust their behavior based on explicit, human-interpretable directives. CPT leverages synthetically generated preference data, conditioned on system prompts derived from structured, fine-grained rubrics that define desired attributes like writing style. By fine-tuning with these rubric-guided preferences, the LLM learns to modulate its outputs at inference time in response to the system prompt, without retraining. This approach not only offers fine-grained control but also provides a mechanism for modeling more nuanced and context-dependent human feedback. Several experimental artifacts, such as training code, generated datasets and fine-tuned models are released at https://github.com/vicgalle/configurable-preference-tuning

nan


Article 596

Title@2025-06-13 (5): Improving Causal Interventions in Amnesic Probing with Mean Projection or LEACE

Title: Improving Causal Interventions in Amnesic Probing with Mean Projection or LEACE Verbesserung der Kausalinterventionen bei der amnesischen Probierung mit mittlerer Projektion oder LEACE 改善在用平均投射或LEACE进行非正常试验时的因果干预 2506.11673v1

Authors (3): Alicja Dobrzeniecka, Antske Fokkens, Pia Sommerauer

Amnesic probing is a technique used to examine the influence of specific linguistic information on the behaviour of a model. This involves identifying and removing the relevant information and then assessing whether the model’s performance on the main task changes. If the removed information is relevant, the model’s performance should decline. The difficulty with this approach lies in removing only the target information while leaving other information unchanged. It has been shown that Iterative Nullspace Projection (INLP), a widely used removal technique, introduces random modifications to representations when eliminating target information. We demonstrate that Mean Projection (MP) and LEACE, two proposed alternatives, remove information in a more targeted manner, thereby enhancing the potential for obtaining behavioural explanations through Amnesic Probing.

nan


Article 597

Title@2025-06-13 (5): LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models

Title: LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models LLaVA-CMoE: Auf dem Weg zu einer kontinuierlichen Mischung von Experten für große Vision-Sprachenmodelle LLavaVA-CMoE:建立大型视觉语言模型专家的连续混合体 2503.21227v2

Authors (8): Hengyuan Zhao, Ziqin Wang, Qixin Sun, Kaiyou Song, Yilin Li, Xiaolin Hu, Qingpei Guo, Si Liu

Mixture of Experts (MoE) architectures have recently advanced the scalability and adaptability of large language models (LLMs) for continual multimodal learning. However, efficiently extending these models to accommodate sequential tasks remains challenging. As new tasks arrive, naive model expansion leads to rapid parameter growth, while modifying shared routing components often causes catastrophic forgetting, undermining previously learned knowledge. To address these issues, we propose LLaVA-CMoE, a continual learning framework for LLMs that requires no replay data of previous tasks and ensures both parameter efficiency and robust knowledge retention. Our approach introduces a Probe-Guided Knowledge Extension mechanism, which uses probe experts to dynamically determine when and where new experts should be added, enabling adaptive and minimal parameter expansion tailored to task complexity. Furthermore, we present a Probabilistic Task Locator that assigns each task a dedicated, lightweight router. To handle the practical issue that task labels are unknown during inference, we leverage a VAE-based reconstruction strategy to identify the most suitable router by matching input distributions, allowing automatic and accurate expert allocation. This design mitigates routing conflicts and catastrophic forgetting, enabling robust continual learning without explicit task labels. Extensive experiments on the CoIN benchmark, covering eight diverse VQA tasks, demonstrate that LLaVA-CMoE delivers strong continual learning performance with a compact model size, significantly reducing forgetting and parameter overhead compared to prior methods. These results showcase the effectiveness and scalability of our approach for parameter-efficient continual learning in large language models. Our code will be open-sourced soon.

nan


Article 598

Title@2025-06-13 (5): Quantum-Inspired Differentiable Integral Neural Networks (QIDINNs): A Feynman-Based Architecture for Continuous Learning Over Streaming Data

Title: Quantum-Inspired Differentiable Integral Neural Networks (QIDINNs): A Feynman-Based Architecture for Continuous Learning Over Streaming Data Quantum-inspirierte differentiable Integral Neural Networks (QIDINNs): Eine Feynman-basierte Architektur für kontinuierliches Lernen über Streaming-Daten 量材激发的有差异的综合神经网络:一个基于费曼的建筑结构,用于对流数据进行持续学习 2506.12111v1

Authors (1): Oscar Boullosa Dapena

Real-time continuous learning over streaming data remains a central challenge in deep learning and AI systems. Traditional gradient-based models such as backpropagation through time (BPTT) face computational and stability limitations when dealing with temporally unbounded data. In this paper, we introduce a novel architecture, Quantum-Inspired Differentiable Integral Neural Networks (QIDINNs), which leverages the Feynman technique of differentiation under the integral sign to formulate neural updates as integrals over historical data. This reformulation allows for smoother, more stable learning dynamics that are both physically interpretable and computationally tractable. Inspired by Feynman’s path integral formalism and compatible with quantum gradient estimation frameworks, QIDINNs open a path toward hybrid classical-quantum neural computation. We demonstrate our model’s effectiveness on synthetic and real-world streaming tasks, and we propose directions for quantum extensions and scalable implementations.

nan


Article 599

Title@2025-06-13 (5): Can reasoning models comprehend mathematical problems in Chinese ancient texts? An empirical study based on data from Suanjing Shishu

Title: Can reasoning models comprehend mathematical problems in Chinese ancient texts? An empirical study based on data from Suanjing Shishu Können Argumentationsmodelle mathematische Probleme in chinesischen alten Texten verstehen? Eine empirische Studie basierend auf Daten von Suanjing Shishu 推理模型能理解中国古经中的数学问题吗? 2505.16660v3

Authors (4): Chang Liu, Dongbo Wang, Liu liu, Zhixiao Zhao

This study addresses the challenges in intelligent processing of Chinese ancient mathematical classics by constructing Guji_MATH, a benchmark for evaluating classical texts based on Suanjing Shishu. It systematically assesses the mathematical problem-solving capabilities of mainstream reasoning models under the unique linguistic constraints of classical Chinese. Through machine-assisted annotation and manual verification, 538 mathematical problems were extracted from 8 canonical texts, forming a structured dataset centered on the “Question-Answer-Solution” framework, supplemented by problem types and difficulty levels. Dual evaluation modes–closed-book (autonomous problem-solving) and open-book (reproducing classical solution methods)–were designed to evaluate the performance of six reasoning models on ancient Chinese mathematical problems. Results indicate that reasoning models can partially comprehend and solve these problems, yet their overall performance remains inferior to benchmarks on modern mathematical tasks. Enhancing models’ classical Chinese comprehension and cultural knowledge should be prioritized for optimization. This study provides methodological support for mining mathematical knowledge from ancient texts and disseminating traditional culture, while offering new perspectives for evaluating cross-linguistic and cross-cultural capabilities of reasoning models.

nan


Article 600

Title@2025-06-13 (5): Converting Annotated Clinical Cases into Structured Case Report Forms

Title: Converting Annotated Clinical Cases into Structured Case Report Forms Umwandlung von annotierten klinischen Fällen in strukturierte Fallberichtsformulare 将附加说明的临床病例转换成结构化个案报告表格 2506.11666v1

Authors (3): Pietro Ferrazzi, Alberto Lavelli, Bernardo Magnini

Case Report Forms (CRFs) are largely used in medical research as they ensure accuracy, reliability, and validity of results in clinical studies. However, publicly available, wellannotated CRF datasets are scarce, limiting the development of CRF slot filling systems able to fill in a CRF from clinical notes. To mitigate the scarcity of CRF datasets, we propose to take advantage of available datasets annotated for information extraction tasks and to convert them into structured CRFs. We present a semi-automatic conversion methodology, which has been applied to the E3C dataset in two languages (English and Italian), resulting in a new, high-quality dataset for CRF slot filling. Through several experiments on the created dataset, we report that slot filling achieves 59.7% for Italian and 67.3% for English on a closed Large Language Models (zero-shot) and worse performances on three families of open-source models, showing that filling CRFs is challenging even for recent state-of-the-art LLMs. We release the datest at https://huggingface.co/collections/NLP-FBK/e3c-to-crf-67b9844065460cbe42f80166

nan


Article 601

Title@2025-06-13 (5): LoRA-Gen: Specializing Large Language Model via Online LoRA Generation

Title: LoRA-Gen: Specializing Large Language Model via Online LoRA Generation LoRA-Gen: Großes Sprachmodell über Online spezialisieren LoRA Generation LoRA-Gen:通过在线LORA生成专门化大语言模式 2506.11638v1

Authors (7): Yicheng Xiao, Lin Song, Rui Yang, Cheng Cheng, Yixiao Ge, Xiu Li, Ying Shan

Recent advances have highlighted the benefits of scaling language models to enhance performance across a wide range of NLP tasks. However, these approaches still face limitations in effectiveness and efficiency when applied to domain-specific tasks, particularly for small edge-side models. We propose the LoRA-Gen framework, which utilizes a large cloud-side model to generate LoRA parameters for edge-side models based on task descriptions. By employing the reparameterization technique, we merge the LoRA parameters into the edge-side model to achieve flexible specialization. Our method facilitates knowledge transfer between models while significantly improving the inference efficiency of the specialized model by reducing the input context length. Without specialized training, LoRA-Gen outperforms conventional LoRA fine-tuning, which achieves competitive accuracy and a 2.1x speedup with TinyLLaMA-1.1B in reasoning tasks. Besides, our method delivers a compression ratio of 10.1x with Gemma-2B on intelligent agent tasks.

nan


Article 602

Title@2025-06-13 (5): Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

Title: Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model Schritt-Audio-AQAA: ein vollständig von Ende zu Ende ausdrucksstarkes großes Audio-Sprachenmodell 渐进-AQAAA:全端到端全端表达式大音频语言模型 2506.08967v2

Authors (76): Ailin Huang, Bingxin Li, Bruce Wang, Boyong Wu, Chao Yan, Chengli Feng, Heng Wang, Hongyu Zhou, Hongyuan Wang, Jingbei Li, Jianjian Sun, Joanna Wang, Mingrui Chen, Peng Liu, Ruihang Miao, Shilei Jiang, Tian Fei, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Ge, Zheng Gong, Zhewei Huang, Zixin Zhang, Bin Wang, Bo Li, Buyun Ma, Changxin Miao, Changyi Wan, Chen Xu, Dapeng Shi, Dingyuan Hu, Enle Liu, Guanzhe Huang, Gulin Yan, Hanpeng Hu, Haonan Jia, Jiahao Gong, Jiaoren Wu, Jie Wu, Jie Yang, Junzhe Lin, Kaixiang Li, Lei Xia, Longlong Gu, Ming Li, Nie Hao, Ranchen Ming, Shaoliang Pang, Siqi Liu, Song Yuan, Tiancheng Cao, Wen Li, Wenqing He, Xu Zhao, Xuelin Zhang, Yanbo Yu, Yinmin Zhong, Yu Zhou, Yuanwei Liang, Yuanwei Lu, Yuxiang Yang, Zidong Yang, Zili Zhang, Binxing Jiao, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Daxin Jiang, Shuchang Zhou, Chen Hu

Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.

nan


Article 603

Title@2025-06-13 (5): SceneGram: Conceptualizing and Describing Tangrams in Scene Context

Title: SceneGram: Conceptualizing and Describing Tangrams in Scene Context SceneGram: Konzeptualisieren und Beschreiben von Tangrammen im Szenekontext CceneGram: 在景象背景下对Tangrams进行概念化和描述 2506.11631v1

Authors (2): Simeon Junker, Sina Zarrieß

Research on reference and naming suggests that humans can come up with very different ways of conceptualizing and referring to the same object, e.g. the same abstract tangram shape can be a “crab”, “sink” or “space ship”. Another common assumption in cognitive science is that scene context fundamentally shapes our visual perception of objects and conceptual expectations. This paper contributes SceneGram, a dataset of human references to tangram shapes placed in different scene contexts, allowing for systematic analyses of the effect of scene context on conceptualization. Based on this data, we analyze references to tangram shapes generated by multimodal LLMs, showing that these models do not account for the richness and variability of conceptualizations found in human references.

nan


Article 604

Title@2025-06-13 (5): JBBQ: Japanese Bias Benchmark for Analyzing Social Biases in Large Language Models

Title: JBBQ: Japanese Bias Benchmark for Analyzing Social Biases in Large Language Models JBBQ: Japanischer Bias-Benchmark für die Analyse sozialer Bias in großen Sprachmodellen JBBQ:日本用于分析大语言模式中社会两边情况的基准 2406.02050v4

Authors (8): Hitomi Yanaka, Namgi Han, Ryoma Kumon, Jie Lu, Masashi Takeshita, Ryo Sekizawa, Taisei Kato, Hiromi Arai

With the development of large language models (LLMs), social biases in these LLMs have become a pressing issue. Although there are various benchmarks for social biases across languages, the extent to which Japanese LLMs exhibit social biases has not been fully investigated. In this study, we construct the Japanese Bias Benchmark dataset for Question Answering (JBBQ) based on the English bias benchmark BBQ, with analysis of social biases in Japanese LLMs. The results show that while current open Japanese LLMs with more parameters show improved accuracies on JBBQ, their bias scores increase. In addition, prompts with a warning about social biases and chain-of-thought prompting reduce the effect of biases in model outputs, but there is room for improvement in extracting the correct evidence from contexts in Japanese. Our dataset is available at https://github.com/ynklab/JBBQ_data.

nan


Article 605

Title@2025-06-13 (5): (SimPhon Speech Test): A Data-Driven Method for In Silico Design and Validation of a Phonetically Balanced Speech Test

Title: (SimPhon Speech Test): A Data-Driven Method for In Silico Design and Validation of a Phonetically Balanced Speech Test (SimPhon Speech Test): Eine datengetriebene Methode für das Silico-Design und die Validierung eines phonetisch ausgeglichenen Sprachtests (西蒙语音测试):音响平衡语音测试的硅设计和校验数据驱动方法 2506.11620v1

Authors (1): Stefan Bleeck

Traditional audiometry often provides an incomplete characterization of the functional impact of hearing loss on speech understanding, particularly for supra-threshold deficits common in presbycusis. This motivates the development of more diagnostically specific speech perception tests. We introduce the Simulated Phoneme Speech Test (SimPhon Speech Test) methodology, a novel, multi-stage computational pipeline for the in silico design and validation of a phonetically balanced minimal-pair speech test. This methodology leverages a modern Automatic Speech Recognition (ASR) system as a proxy for a human listener to simulate the perceptual effects of sensorineural hearing loss. By processing speech stimuli under controlled acoustic degradation, we first identify the most common phoneme confusion patterns. These patterns then guide the data-driven curation of a large set of candidate word pairs derived from a comprehensive linguistic corpus. Subsequent phases involving simulated diagnostic testing, expert human curation, and a final, targeted sensitivity analysis systematically reduce the candidates to a final, optimized set of 25 pairs (the SimPhon Speech Test-25). A key finding is that the diagnostic performance of the SimPhon Speech Test-25 test items shows no significant correlation with predictions from the standard Speech Intelligibility Index (SII), suggesting the SimPhon Speech Test captures perceptual deficits beyond simple audibility. This computationally optimized test set offers a significant increase in efficiency for audiological test development, ready for initial human trials.

nan


Article 606

Title@2025-06-13 (5): Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis

Title: Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis Auf dem Weg zum Verständnis von Feintuning-Mechanismen von LLMs durch Schaltungsanalyse 通过电路分析了解LLM LMs的微调调整机制 2502.11812v2

Authors (6): Xu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou

Fine-tuning significantly improves the performance of Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. This paper aims to provide an in-depth interpretation of the fine-tuning process through circuit analysis, a popular tool in Mechanistic Interpretability (MI). Unlike previous studies (Prakash et al. 2024; Chhabra et al. 2024) that focus on tasks where pre-trained models already perform well, we develop a set of mathematical tasks where fine-tuning yields substantial performance gains, which are closer to the practical setting. In our experiments, we identify circuits at various checkpoints during fine-tuning and examine the interplay between circuit analysis, fine-tuning methods, and task complexities. First, we find that while circuits maintain high node similarity before and after fine-tuning, their edges undergo significant changes, in contrast to prior work that shows circuits only add some additional components after fine-tuning. Based on these observations, we develop a circuit-aware Low-Rank Adaptation (LoRA) method, which assigns ranks to layers based on edge changes in the circuits. Experimental results demonstrate that our circuit-based LoRA algorithm achieves an average performance improvement of 2.46% over standard LoRA with similar parameter sizes. Furthermore, we explore how combining circuits from subtasks can enhance fine-tuning in compositional tasks, providing new insights into the design of such tasks and deepening the understanding of circuit dynamics and fine-tuning mechanisms.

nan


Article 607

Title@2025-06-13 (5): VLM@school – Evaluation of AI image understanding on German middle school knowledge

Title: VLM@school – Evaluation of AI image understanding on German middle school knowledge VLM@school – Auswertung des KI-Bildverständnisses über deutsche Mittelschulkenntnisse VLM@school – – 评价AI关于德国中学知识的图像理解 2506.11604v1

Authors (2): René Peinl, Vincent Tischler

This paper introduces a novel benchmark dataset designed to evaluate the capabilities of Vision Language Models (VLMs) on tasks that combine visual reasoning with subject-specific background knowledge in the German language. In contrast to widely used English-language benchmarks that often rely on artificially difficult or decontextualized problems, this dataset draws from real middle school curricula across nine domains including mathematics, history, biology, and religion. The benchmark includes over 2,000 open-ended questions grounded in 486 images, ensuring that models must integrate visual interpretation with factual reasoning rather than rely on superficial textual cues. We evaluate thirteen state-of-the-art open-weight VLMs across multiple dimensions, including domain-specific accuracy and performance on adversarial crafted questions. Our findings reveal that even the strongest models achieve less than 45% overall accuracy, with particularly poor performance in music, mathematics, and adversarial settings. Furthermore, the results indicate significant discrepancies between success on popular benchmarks and real-world multimodal understanding. We conclude that middle school-level tasks offer a meaningful and underutilized avenue for stress-testing VLMs, especially in non-English contexts. The dataset and evaluation protocol serve as a rigorous testbed to better understand and improve the visual and linguistic reasoning capabilities of future AI systems.

nan


Article 608

Title@2025-06-13 (5): Are LLMs Good Text Diacritizers? An Arabic and Yorùbá Case Study

Title: Are LLMs Good Text Diacritizers? An Arabic and Yorùbá Case Study Sind LLMs gute Textdiakritisierer? Eine arabische und Yorùbá Fallstudie LLM女士是好文本诊断器吗? 阿拉伯语和YorOuba的案例研究。 2506.11602v1

Authors (3): Hawau Olamide Toyin, Samar M. Magdy, Hanan Aldarmaki

We investigate the effectiveness of large language models (LLMs) for text diacritization in two typologically distinct languages: Arabic and Yoruba. To enable a rigorous evaluation, we introduce a novel multilingual dataset MultiDiac, with diverse samples that capture a range of diacritic ambiguities. We evaluate 14 LLMs varying in size, accessibility, and language coverage, and benchmark them against 6 specialized diacritization models. Additionally, we fine-tune four small open-source models using LoRA for Yoruba. Our results show that many off-the-shelf LLMs outperform specialized diacritization models for both Arabic and Yoruba, but smaller models suffer from hallucinations. Fine-tuning on a small dataset can help improve diacritization performance and reduce hallucination rates.

nan


Article 609

Title@2025-06-13 (5): Personalized LLM Decoding via Contrasting Personal Preference

Title: Personalized LLM Decoding via Contrasting Personal Preference Personalisiertes LLM-Dekodieren über kontrastierende persönliche Präferenz 通过与个人偏好相违背而解密的个人个人化LLM 2506.12109v1

Authors (4): Hyungjune Bu, Chanjoo Jung, Minjae Kang, Jaehyung Kim

As large language models (LLMs) are progressively deployed in various real-world applications, personalization of LLMs has become increasingly important. While various approaches to LLM personalization such as prompt-based and training-based methods have been actively explored, the development of effective decoding-time algorithms remains largely overlooked, despite their demonstrated potential. In this paper, we propose CoPe (Contrasting Personal Preference), a novel decoding-time approach applied after performing parameter-efficient fine-tuning (PEFT) on user-specific data. Our core idea is to leverage reward-guided decoding specifically for personalization by maximizing each user’s implicit reward signal. We evaluate CoPe across five open-ended personalized text generation tasks. Our empirical results demonstrate that CoPe achieves strong performance, improving personalization by an average of 10.57% in ROUGE-L, without relying on external reward models or additional training procedures.

nan


Article 610

Title@2025-06-13 (5): Automatic Construction of Multiple Classification Dimensions for Managing Approaches in Scientific Papers

Title: Automatic Construction of Multiple Classification Dimensions for Managing Approaches in Scientific Papers Automatische Konstruktion mehrerer Klassifizierungsdimensionen für die Verwaltung von Ansätzen in wissenschaftlichen Papieren 科学文件中管理方法的多重分类方面自动构建 2505.23252v2

Authors (2): Bing Ma, Hai Zhuge

Approaches form the foundation for conducting scientific research. Querying approaches from a vast body of scientific papers is extremely time-consuming, and without a well-organized management framework, researchers may face significant challenges in querying and utilizing relevant approaches. Constructing multiple dimensions on approaches and managing them from these dimensions can provide an efficient solution. Firstly, this paper identifies approach patterns using a top-down way, refining the patterns through four distinct linguistic levels: semantic level, discourse level, syntactic level, and lexical level. Approaches in scientific papers are extracted based on approach patterns. Additionally, five dimensions for categorizing approaches are identified using these patterns. This paper proposes using tree structure to represent step and measuring the similarity between different steps with a tree-structure-based similarity measure that focuses on syntactic-level similarities. A collection similarity measure is proposed to compute the similarity between approaches. A bottom-up clustering algorithm is proposed to construct class trees for approach components within each dimension by merging each approach component or class with its most similar approach component or class in each iteration. The class labels generated during the clustering process indicate the common semantics of the step components within the approach components in each class and are used to manage the approaches within the class. The class trees of the five dimensions collectively form a multi-dimensional approach space. The application of approach queries on the multi-dimensional approach space demonstrates that querying within this space ensures strong relevance between user queries and results and rapidly reduces search space through a class-based query mechanism.

nan


Article 611

Title@2025-06-13 (5): Understanding the Repeat Curse in Large Language Models from a Feature Perspective

Title: Understanding the Repeat Curse in Large Language Models from a Feature Perspective Den Wiederholungskurs in großen Sprachmodellen aus einer Feature-Perspektive verstehen 从特写角度理解大语言模式中的重复诅咒 2504.14218v3

Authors (6): Junchi Yao, Shu Yang, Jianhua Xu, Lijie Hu, Mengdi Li, Di Wang

Large language models (LLMs) have made remarkable progress in various domains, yet they often suffer from repetitive text generation, a phenomenon we refer to as the “Repeat Curse”. While previous studies have proposed decoding strategies to mitigate repetition, the underlying mechanism behind this issue remains insufficiently explored. In this work, we investigate the root causes of repetition in LLMs through the lens of mechanistic interpretability. Inspired by recent advances in Sparse Autoencoders (SAEs), which enable monosemantic feature extraction, we propose a novel approach, “Duplicatus Charm”, to induce and analyze the Repeat Curse. Our method systematically identifies “Repetition Features” -the key model activations responsible for generating repetitive outputs. First, we locate the layers most involved in repetition through logit analysis. Next, we extract and stimulate relevant features using SAE-based activation manipulation. To validate our approach, we construct a repetition dataset covering token and paragraph level repetitions and introduce an evaluation pipeline to quantify the influence of identified repetition features. Furthermore, by deactivating these features, we have effectively mitigated the Repeat Curse. The source code of our work is publicly available at: https://github.com/kaustpradalab/repeat-curse-llm

nan


Article 612

Title@2025-06-13 (5): FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference

Title: FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference FlashBack:Effiziente Retrieval-Augmentierte Sprachmodellierung für lange Kontext-Inferenz FlashBack: 有效检索增强长处推断语言建模 2405.04065v4

Authors (5): Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu

Retrieval-Augmented Language Modeling (RALM) by integrating large language models (LLM) with relevant documents from an external corpus is a proven method for enabling the LLM to generate information beyond the scope of its pre-training corpus. Previous work utilizing retrieved content by simply prepending it to the input poses a high runtime issue, which degrades the inference efficiency of the LLMs because they fail to use the Key-Value (KV) cache efficiently. In this paper, we propose FlashBack, a modular RALM designed to improve the inference efficiency of RALM with appending context pattern while maintaining decent performance after fine-tuning by Low-Rank Adaption. FlashBack appends retrieved documents at the end of the context for efficiently utilizing the KV cache instead of prepending them. And we introduce Marking Token as two special prompt tokens for marking the boundary of the appending context during fine-tuning. Our experiments on testing generation quality show that FlashBack can remain decent generation quality in perplexity. And the inference speed of FlashBack is up to $4\times$ faster than the prepending counterpart on a 7B LLM (Llama 2) in the runtime test. Via bypassing unnecessary re-computation, it demonstrates an advancement by achieving significantly faster inference speed, and this heightened efficiency will substantially reduce inferential cost.

nan


Article 613

Title@2025-06-13 (5): DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

Title: DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs DaMO: Ein dateneffizienter Multimodal-Orchester für zeitliche Vernunft mit Video-LLMs DaMO: 带有视频LMS的时空理由数据高效多式多式圆板 2506.11558v1

Authors (4): Bo-Cheng Chiu, Jen-Jee Chen, Yu-Chee Tseng, Feng-Chi Chen

Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with GPT-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.

nan


Article 614

Title@2025-06-13 (5): From Persona to Person: Enhancing the Naturalness with Multiple Discourse Relations Graph Learning in Personalized Dialogue Generation

Title: From Persona to Person: Enhancing the Naturalness with Multiple Discourse Relations Graph Learning in Personalized Dialogue Generation Von Persona zu Person: Die Natürlichkeit durch multiple Diskursbeziehungen verbessern Graph Learning in Personalized Dialogue Generation 从人到人:加强人与人之间的自然特性,在个性化对话生成过程中采用多种不同问题关系图学习 2506.11557v1

Authors (3): Chih-Hao Hsu, Ying-Jia Lin, Hung-Yu Kao

In dialogue generation, the naturalness of responses is crucial for effective human-machine interaction. Personalized response generation poses even greater challenges, as the responses must remain coherent and consistent with the user’s personal traits or persona descriptions. We propose MUDI ($\textbf{Mu}$ltiple $\textbf{Di}$scourse Relations Graph Learning) for personalized dialogue generation. We utilize a Large Language Model to assist in annotating discourse relations and to transform dialogue data into structured dialogue graphs. Our graph encoder, the proposed DialogueGAT model, then captures implicit discourse relations within this structure, along with persona descriptions. During the personalized response generation phase, novel coherence-aware attention strategies are implemented to enhance the decoder’s consideration of discourse relations. Our experiments demonstrate significant improvements in the quality of personalized responses, thus resembling human-like dialogue exchanges.

nan


Article 615

Title@2025-06-13 (5): RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning

Title: RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning RAG+: Verbesserung der Retrieval-Augmented Generation mit anwendungsrelevanter Begründung RAG+:加强利用应用程序软件软件软件软件支持的检索-启动生成 2506.11555v1

Authors (9): Yu Wang, Shiwan Zhao, Ming Fan, Zhihu Wang, Yubo Zhang, Xicheng Zhang, Zhengfan Wang, Heyuan Huang, Ting Liu

The integration of external knowledge through Retrieval-Augmented Generation (RAG) has become foundational in enhancing large language models (LLMs) for knowledge-intensive tasks. However, existing RAG paradigms often overlook the cognitive step of applying knowledge, leaving a gap between retrieved facts and task-specific reasoning. In this work, we introduce RAG+, a principled and modular extension that explicitly incorporates application-aware reasoning into the RAG pipeline. RAG+ constructs a dual corpus consisting of knowledge and aligned application examples, created either manually or automatically, and retrieves both jointly during inference. This design enables LLMs not only to access relevant information but also to apply it within structured, goal-oriented reasoning processes. Experiments across mathematical, legal, and medical domains, conducted on multiple models, demonstrate that RAG+ consistently outperforms standard RAG variants, achieving average improvements of 3-5%, and peak gains up to 7.5% in complex scenarios. By bridging retrieval with actionable application, RAG+ advances a more cognitively grounded framework for knowledge integration, representing a step toward more interpretable and capable LLMs.

nan


Article 616

Title@2025-06-13 (5): Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective

Title: Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective Bewertung von Impliziten Bias in großen Sprachmodellen durch Angriff aus einer psychometrischen Perspektive 通过从心理角度进行攻击,评价大语言模型中隐含的偏见 2406.14023v4

Authors (5): Yuchen Wen, Keping Bi, Wei Chen, Jiafeng Guo, Xueqi Cheng

As large language models (LLMs) become an important way of information access, there have been increasing concerns that LLMs may intensify the spread of unethical content, including implicit bias that hurts certain populations without explicit harmful words. In this paper, we conduct a rigorous evaluation of LLMs’ implicit bias towards certain demographics by attacking them from a psychometric perspective to elicit agreements to biased viewpoints. Inspired by psychometric principles in cognitive and social psychology, we propose three attack approaches, i.e., Disguise, Deception, and Teaching. Incorporating the corresponding attack instructions, we built two benchmarks: (1) a bilingual dataset with biased statements covering four bias types (2.7K instances) for extensive comparative analysis, and (2) BUMBLE, a larger benchmark spanning nine common bias types (12.7K instances) for comprehensive evaluation. Extensive evaluation of popular commercial and open-source LLMs shows that our methods can elicit LLMs’ inner bias more effectively than competitive baselines. Our attack methodology and benchmarks offer an effective means of assessing the ethical risks of LLMs, driving progress toward greater accountability in their development. Our code, data, and benchmarks are available at https://yuchenwen1.github.io/ImplicitBiasEvaluation/.

nan


Article 617

Title@2025-06-13 (5): TrajAgent: An LLM-based Agent Framework for Automated Trajectory Modeling via Collaboration of Large and Small Models

Title: TrajAgent: An LLM-based Agent Framework for Automated Trajectory Modeling via Collaboration of Large and Small Models TrajAgent: Ein LLM-basiertes Agent-Framework für automatisierte Trajektorienmodellierung über die Zusammenarbeit von großen und kleinen Modellen TrajAgendy:一个基于LLM的通过大型和小型模型合作进行自动轨迹建模的LLM代理框架 2410.20445v3

Authors (5): Yuwei Du, Jie Feng, Jie Zhao, Jian Yuan, Yong Li

Trajectory modeling, which includes research on trajectory data pattern mining and future prediction, has widespread applications in areas such as life services, urban transportation, and public administration. Numerous methods have been proposed to address specific problems within trajectory modeling. However, the heterogeneity of data and the diversity of trajectory tasks make effective and reliable trajectory modeling an important yet highly challenging endeavor, even for domain experts. In this paper, we propose \textit{TrajAgent}, a agent framework powered by large language models (LLMs), designed to facilitate robust and efficient trajectory modeling through automation modeling. This framework leverages and optimizes diverse specialized models to address various trajectory modeling tasks across different datasets effectively. In \textit{TrajAgent}, we first develop \textit{UniEnv}, an execution environment with a unified data and model interface, to support the execution and training of various models. Building on \textit{UniEnv}, we introduce an agentic workflow designed for automatic trajectory modeling across various trajectory tasks and data. Furthermore, we introduce collaborative learning schema between LLM-based agents and small speciallized models, to enhance the performance of the whole framework effectively. Extensive experiments on four tasks using four real-world datasets demonstrate the effectiveness of \textit{TrajAgent} in automated trajectory modeling, achieving a performance improvement of 2.38\%-34.96\% over baseline methods.

nan


Article 618

Title@2025-06-13 (5): LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

Title: LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation LLMEval-Med: Ein echter klinischer Benchmark für medizinische LLMs mit Physician Validation LLMEval-Med:具有物理校验功能的医疗长效LML 医疗长效LMS的现实世界临床基准 2506.04078v2

Authors (16): Ming Zhang, Yujiong Shen, Zelin Li, Huayu Sha, Binze Hu, Yuhui Wang, Chenhao Huang, Shichun Liu, Jingqi Tong, Changhao Jiang, Mingxu Chai, Zhiheng Xi, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang

Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains. The dataset is released in https://github.com/llmeval/LLMEval-Med.

nan


Article 619

Title@2025-06-13 (5): PFDial: A Structured Dialogue Instruction Fine-tuning Method Based on UML Flowcharts

Title: PFDial: A Structured Dialogue Instruction Fine-tuning Method Based on UML Flowcharts PFDial: Eine strukturierte Dialog-Instruktion Feinabstimmungsmethode basierend auf UML-Flowcharts PFDial:基于UMML流程图的结构性对话指示调整方法 2503.06706v3

Authors (19): Ming Zhang, Yuhui Wang, Yujiong Shen, Tingyi Yang, Changhao Jiang, Yilong Wu, Shihan Dou, Qinhao Chen, Zhiheng Xi, Zhihao Zhang, Yi Dong, Zhen Wang, Zhihui Fei, Mingyang Wan, Tao Liang, Guojun Ma, Qi Zhang, Tao Gui, Xuanjing Huang

Process-driven dialogue systems, which operate under strict predefined process constraints, are essential in customer service and equipment maintenance scenarios. Although Large Language Models (LLMs) have shown remarkable progress in dialogue and reasoning, they still struggle to solve these strictly constrained dialogue tasks. To address this challenge, we construct Process Flow Dialogue (PFDial) dataset, which contains 12,705 high-quality Chinese dialogue instructions derived from 440 flowcharts containing 5,055 process nodes. Based on PlantUML specification, each UML flowchart is converted into atomic dialogue units i.e., structured five-tuples. Experimental results demonstrate that a 7B model trained with merely 800 samples, and a 0.5B model trained on total data both can surpass 90% accuracy. Additionally, the 8B model can surpass GPT-4o up to 43.88% with an average of 11.00%. We further evaluate models’ performance on challenging backward transitions in process flows and conduct an in-depth analysis of various dataset formats to reveal their impact on model performance in handling decision and sequential branches. The data is released in https://github.com/KongLongGeFDU/PFDial.

nan


Article 620

Title@2025-06-13 (5): Brewing Knowledge in Context: Distillation Perspectives on In-Context Learning

Title: Brewing Knowledge in Context: Distillation Perspectives on In-Context Learning Brewing Knowledge in Context: Destillationsperspektiven zum In-Context Learning 内在知识的积累:对内文学习的提炼观点 2506.11516v1

Authors (3): Chengye Li, Haiyun Liu, Yuanxi Li

In-context learning (ICL) allows large language models (LLMs) to solve novel tasks without weight updates. Despite its empirical success, the mechanism behind ICL remains poorly understood, limiting our ability to interpret, improve, and reliably apply it. In this paper, we propose a new theoretical perspective that interprets ICL as an implicit form of knowledge distillation (KD), where prompt demonstrations guide the model to form a task-specific reference model during inference. Under this view, we derive a Rademacher complexity-based generalization bound and prove that the bias of the distilled weights grows linearly with the Maximum Mean Discrepancy (MMD) between the prompt and target distributions. This theoretical framework explains several empirical phenomena and unifies prior gradient-based and distributional analyses. To the best of our knowledge, this is the first to formalize inference-time attention as a distillation process, which provides theoretical insights for future prompt engineering and automated demonstration selection.

nan


Article 621

Title@2025-06-13 (5): Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs

Title: Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs Manager: Aggregation von Erkenntnissen von Unimodal-Experten in Zwei-Tower-VLMs und MLLMs 管理者:从双托式VLM和MLLMS的独式专家中收集透视 2506.11515v1

Authors (4): Xiao Xu, Libo Qin, Wanxiang Che, Min-Yen Kan

Two-Tower Vision–Language Models (VLMs) have demonstrated strong performance across various downstream VL tasks. While BridgeTower further enhances performance by building bridges between encoders, it \textit{(i)} suffers from ineffective layer-by-layer utilization of unimodal representations, \textit{(ii)} restricts the flexible exploitation of different levels of unimodal semantic knowledge, and \textit{(iii)} is limited to the evaluation on traditional low-resolution datasets only with the Two-Tower VLM architecture. In this work, we propose Manager, a lightweight, efficient and effective plugin that adaptively aggregates insights from different levels of pre-trained unimodal experts to facilitate more comprehensive VL alignment and fusion. First, under the Two-Tower VLM architecture, we introduce ManagerTower, a novel VLM that introduces the manager in each cross-modal layer. Whether with or without VL pre-training, ManagerTower outperforms previous strong baselines and achieves superior performance on 4 downstream VL tasks. Moreover, we extend our exploration to the latest Multimodal Large Language Model (MLLM) architecture. We demonstrate that LLaVA-OV-Manager significantly boosts the zero-shot performance of LLaVA-OV across different categories of capabilities, images, and resolutions on 20 downstream datasets, whether the multi-grid algorithm is enabled or not. In-depth analysis reveals that both our manager and the multi-grid algorithm can be viewed as a plugin that improves the visual representation by capturing more diverse visual details from two orthogonal perspectives (depth and width). Their synergy can mitigate the semantic ambiguity caused by the multi-grid algorithm and further improve performance. Code and models are available at https://github.com/LooperXX/ManagerTower.

nan


Article 622

Title@2025-06-13 (5): TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages

Title: TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages TUMLU: Ein einheitliches Sprachverständnis für türkische Sprachen TUMLU:突厥语统一土著语言理解基准 2502.11020v2

Authors (16): Jafar Isbarov, Arofat Akhundjanova, Mammad Hajili, Kavsar Huseynova, Dmitry Gaynullin, Anar Rzayev, Osman Tursun, Aizirek Turdubaeva, Ilshat Saetov, Rinat Kharisov, Saule Belginova, Ariana Kenbayeva, Amina Alisheva, Abdullatif Köksal, Samir Rustamov, Duygu Ataman

Being able to thoroughly assess massive multi-task language understanding (MMLU) capabilities is essential for advancing the applicability of multilingual language models. However, preparing such benchmarks in high quality native language is often costly and therefore limits the representativeness of evaluation datasets. While recent efforts focused on building more inclusive MMLU benchmarks, these are conventionally built using machine translation from high-resource languages, which may introduce errors and fail to account for the linguistic and cultural intricacies of the target languages. In this paper, we address the lack of native language MMLU benchmark especially in the under-represented Turkic language family with distinct morphosyntactic and cultural characteristics. We propose two benchmarks for Turkic language MMLU: TUMLU is a comprehensive, multilingual, and natively developed language understanding benchmark specifically designed for Turkic languages. It consists of middle- and high-school level questions spanning 11 academic subjects in Azerbaijani, Crimean Tatar, Karakalpak, Kazakh, Tatar, Turkish, Uyghur, and Uzbek. We also present TUMLU-mini, a more concise, balanced, and manually verified subset of the dataset. Using this dataset, we systematically evaluate a diverse range of open and proprietary multilingual large language models (LLMs), including Claude, Gemini, GPT, and LLaMA, offering an in-depth analysis of their performance across different languages, subjects, and alphabets. To promote further research and development in multilingual language understanding, we release TUMLU-mini and all corresponding evaluation scripts.

nan


Article 623

Title@2025-06-13 (5): MapQaTor: An Extensible Framework for Efficient Annotation of Map-Based QA Datasets

Title: MapQaTor: An Extensible Framework for Efficient Annotation of Map-Based QA Datasets MapQaTor: Ein umfangreiches Framework für eine effiziente Annotation von kartenbasierten QA-Datensätzen 地图QaTor:以地图为基础的质量评估数据集有效注释的扩展框架 2412.21015v2

Authors (3): Mahir Labib Dihan, Mohammed Eunus Ali, Md Rizwan Parvez

Mapping and navigation services like Google Maps, Apple Maps, OpenStreetMap, are essential for accessing various location-based data, yet they often struggle to handle natural language geospatial queries. Recent advancements in Large Language Models (LLMs) show promise in question answering (QA), but creating reliable geospatial QA datasets from map services remains challenging. We introduce MapQaTor, an extensible open-source framework that streamlines the creation of reproducible, traceable map-based QA datasets. MapQaTor enables seamless integration with any maps API, allowing users to gather and visualize data from diverse sources with minimal setup. By caching API responses, the platform ensures consistent ground truth, enhancing the reliability of the data even as real-world information evolves. MapQaTor centralizes data retrieval, annotation, and visualization within a single platform, offering a unique opportunity to evaluate the current state of LLM-based geospatial reasoning while advancing their capabilities for improved geospatial understanding. Evaluation metrics show that, MapQaTor speeds up the annotation process by at least 30 times compared to manual methods, underscoring its potential for developing geospatial resources, such as complex map reasoning datasets. The website is live at: https://mapqator.github.io/ and a demo video is available at: https://youtu.be/bVv7-NYRsTw.

nan


Article 624

Title@2025-06-13 (5): On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval

Title: On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval Über die Wirksamkeit von Integrationsmethoden für die multimodale Reaktion auf den Dialog 综合方法促进多模式对话应对回溯性融合方法的有效性 2506.11499v1

Authors (4): Seongbo Jang, Seonghyeon Lee, Dongha Lee, Hwanjo Yu

Multimodal chatbots have become one of the major topics for dialogue systems in both research community and industry. Recently, researchers have shed light on the multimodality of responses as well as dialogue contexts. This work explores how a dialogue system can output responses in various modalities such as text and image. To this end, we first formulate a multimodal dialogue response retrieval task for retrieval-based systems as the combination of three subtasks. We then propose three integration methods based on a two-step approach and an end-to-end approach, and compare the merits and demerits of each method. Experimental results on two datasets demonstrate that the end-to-end approach achieves comparable performance without an intermediate step in the two-step approach. In addition, a parameter sharing strategy not only reduces the number of parameters but also boosts performance by transferring knowledge across the subtasks and the modalities.

nan


Article 625

Title@2025-06-13 (5): Lag-Relative Sparse Attention In Long Context Training

Title: Lag-Relative Sparse Attention In Long Context Training Lag-Relative Sparse Aufmerksamkeit im langen Kontext Training 长期培训中的拉格-相对偏差关注 2506.11498v1

Authors (5): Manlai Liang, Wanyi Huang, Mandi Liu, Huaijun Li, Jinlong Li

Large Language Models (LLMs) have made significant strides in natural language processing and generation, yet their ability to handle long-context input remains constrained by the quadratic complexity of attention computation and linear-increasing key-value memory footprint. To reduce computational costs and memory, key-value cache compression techniques are commonly applied at inference time, but this often leads to severe performance degradation, as models are not trained to handle compressed context. Although there are more sophisticated compression methods, they are typically unsuitable for post-training because of their incompatibility with gradient-based optimization or high computation overhead. To fill this gap with no additional parameter and little computation overhead, we propose Lag-Relative Sparse Attention(LRSA) anchored by the LagKV compression method for long context post-training. Our method performs chunk-by-chunk prefilling, which selects the top K most relevant key-value pairs in a fixed-size lagging window, allowing the model to focus on salient historical context while maintaining efficiency. Experimental results show that our approach significantly enhances the robustness of the LLM with key-value compression and achieves better fine-tuned results in the question-answer tuning task.

nan


Article 626

Title@2025-06-13 (5): Relational Schemata in BERT Are Inducible, Not Emergent: A Study of Performance vs. Competence in Language Models

Title: Relational Schemata in BERT Are Inducible, Not Emergent: A Study of Performance vs. Competence in Language Models Relationale Schemata in BERT sind induzierbar, nicht emergent: Eine Leistungsstudie vs. Kompetenz in Sprachmodellen BERT中的关系Schemata是鼓励性的,不是新兴的:对表现与语言模型能力的研究 2506.11485v1

Authors (1): Cole Gawin

While large language models like BERT demonstrate strong empirical performance on semantic tasks, whether this reflects true conceptual competence or surface-level statistical association remains unclear. I investigate whether BERT encodes abstract relational schemata by examining internal representations of concept pairs across taxonomic, mereological, and functional relations. I compare BERT’s relational classification performance with representational structure in [CLS] token embeddings. Results reveal that pretrained BERT enables high classification accuracy, indicating latent relational signals. However, concept pairs organize by relation type in high-dimensional embedding space only after fine-tuning on supervised relation classification tasks. This indicates relational schemata are not emergent from pretraining alone but can be induced via task scaffolding. These findings demonstrate that behavioral performance does not necessarily imply structured conceptual understanding, though models can acquire inductive biases for grounded relational abstraction through appropriate training.

nan


Article 627

Title@2025-06-13 (5): ImmunoFOMO: Are Language Models missing what oncologists see?

Title: ImmunoFOMO: Are Language Models missing what oncologists see? ImmunoFOMO: Fehlt den Sprachmodellen, was Onkologen sehen? ImmunoFOMO:语言模型是否忽略了肿瘤学家所看到的? 2506.11478v1

Authors (5): Aman Sinha, Bogdan-Valentin Popescu, Xavier Coubez, Marianne Clausel, Mathieu Constant

Language models (LMs) capabilities have grown with a fast pace over the past decade leading researchers in various disciplines, such as biomedical research, to increasingly explore the utility of LMs in their day-to-day applications. Domain specific language models have already been in use for biomedical natural language processing (NLP) applications. Recently however, the interest has grown towards medical language models and their understanding capabilities. In this paper, we investigate the medical conceptual grounding of various language models against expert clinicians for identification of hallmarks of immunotherapy in breast cancer abstracts. Our results show that pre-trained language models have potential to outperform large language models in identifying very specific (low-level) concepts.

nan


Article 628

Title@2025-06-13 (5): BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs

Title: BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs BitNet v2: Native 4-Bit-Aktivierungen mit Hadamard-Transformation für 1-Bit-LLMs BitNet v 2: 以 Hadamard 变形为1 位LMs 的本地四位驱动器 2504.18415v2

Authors (3): Hongyu Wang, Shuming Ma, Furu Wei

Efficient deployment of 1-bit Large Language Models (LLMs) is hindered by activation outliers, which complicate quantization to low bit-widths. We introduce BitNet v2, a novel framework enabling native 4-bit activation quantization for 1-bit LLMs. To tackle outliers in attention and feed-forward network activations, we propose H-BitLinear, a module applying an online Hadamard transformation prior to activation quantization. This transformation smooths sharp activation distributions into more Gaussian-like forms, suitable for low-bit representation. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance. Crucially, BitNet v2 achieves minimal performance degradation when trained with native 4-bit activations, significantly reducing memory footprint and computational cost for batched inference.

nan


Article 629

Title@2025-06-13 (5): AutoGen Driven Multi Agent Framework for Iterative Crime Data Analysis and Prediction

Title: AutoGen Driven Multi Agent Framework for Iterative Crime Data Analysis and Prediction AutoGen Driven Multi Agent Framework für iterative Kriminalität Datenanalyse und Vorhersage 循环犯罪数据分析和预测自动驱动器多剂框架 2506.11475v1

Authors (4): Syeda Kisaa Fatima, Tehreem Zubair, Noman Ahmed, Asifullah Khan

This paper introduces LUCID-MA (Learning and Understanding Crime through Dialogue of Multiple Agents), an innovative AI powered framework where multiple AI agents collaboratively analyze and understand crime data. Our system that consists of three core components: an analysis assistant that highlights spatiotemporal crime patterns, a feedback component that reviews and refines analytical results and a prediction component that forecasts future crime trends. With a well-designed prompt and the LLaMA-2-13B-Chat-GPTQ model, it runs completely offline and allows the agents undergo self-improvement through 100 rounds of communication with less human interaction. A scoring function is incorporated to evaluate agent’s performance, providing visual plots to track learning progress. This work demonstrates the potential of AutoGen-style agents for autonomous, scalable, and iterative analysis in social science domains maintaining data privacy through offline execution.

nan


Article 630

Title@2025-06-13 (5): Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards

Title: Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards Med-PRM: Medizinisches Reasoning-Modell mit schrittweisen, leitfadenverifizierten Prozessbelohnungen Med-PRM:医疗理由说明模型,具有逐步、准则核查的流程奖励 2506.11474v1

Authors (12): Jaehoon Yun, Jiwoong Sohn, Jungwoo Park, Hyunjae Kim, Xiangru Tang, Yanjun Shao, Yonghoe Koo, Minhyeok Ko, Qingyu Chen, Mark Gerstein, Michael Moor, Jaewoo Kang

Large language models have shown promise in clinical decision making, but current approaches struggle to localize and correct errors at specific steps of the reasoning process. This limitation is critical in medicine, where identifying and addressing reasoning errors is essential for accurate diagnosis and effective patient care. We introduce Med-PRM, a process reward modeling framework that leverages retrieval-augmented generation to verify each reasoning step against established medical knowledge bases. By verifying intermediate reasoning steps with evidence retrieved from clinical guidelines and literature, our model can precisely assess the reasoning quality in a fine-grained manner. Evaluations on five medical QA benchmarks and two open-ended diagnostic tasks demonstrate that Med-PRM achieves state-of-the-art performance, with improving the performance of base models by up to 13.50% using Med-PRM. Moreover, we demonstrate the generality of Med-PRM by integrating it in a plug-and-play fashion with strong policy models such as Meerkat, achieving over 80\% accuracy on MedQA for the first time using small-scale models of 8 billion parameters. Our code and data are available at: https://med-prm.github.io/

nan


Article 631

Title@2025-06-13 (5): Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models

Title: Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models Hilft immer mehr zu denken? Test-Time Scaling in vernünftigen Modellen verstehen 理解理性模型中的测试时间缩放 2506.04210v2

Authors (9): Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Yifu Lu, Mengdi Wang, Dinesh Manocha, Furong Huang, Mohammad Ghavamzadeh, Amrit Singh Bedi

Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek R1) have led to a popular belief that extending thinking traces using prompts like “Wait” or “Let me rethink” can improve performance. This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which reveals a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to “overthinking”. To understand this non-monotonic trend, we consider a simple probabilistic model, which reveals that additional thinking increases output variance-creating an illusion of improved reasoning while ultimately undermining precision. Thus, observed gains from “more thinking” are not true indicators of improved reasoning, but artifacts stemming from the connection between model uncertainty and evaluation metric. This suggests that test-time scaling through extended thinking is not an effective way to utilize the inference thinking budget. Recognizing these limitations, we introduce an alternative test-time scaling approach, parallel thinking, inspired by Best-of-N sampling. Our method generates multiple independent reasoning paths within the same inference budget and selects the most consistent response via majority vote, achieving up to 20% higher accuracy compared to extended thinking. This provides a simple yet effective mechanism for test-time scaling of reasoning models.

nan


Article 632

Title@2025-06-13 (5): A Gamified Evaluation and Recruitment Platform for Low Resource Language Machine Translation Systems

Title: A Gamified Evaluation and Recruitment Platform for Low Resource Language Machine Translation Systems Gamified Evaluation and Recruitment Platform for Low Resource Language Machine Translation Systems 低资源语言机用翻译系统有色评价和征聘平台 2506.11467v1

Authors (1): Carlos Rafael Catalan

Human evaluators provide necessary contributions in evaluating large language models. In the context of Machine Translation (MT) systems for low-resource languages (LRLs), this is made even more apparent since popular automated metrics tend to be string-based, and therefore do not provide a full picture of the nuances of the behavior of the system. Human evaluators, when equipped with the necessary expertise of the language, will be able to test for adequacy, fluency, and other important metrics. However, the low resource nature of the language means that both datasets and evaluators are in short supply. This presents the following conundrum: How can developers of MT systems for these LRLs find adequate human evaluators and datasets? This paper first presents a comprehensive review of existing evaluation procedures, with the objective of producing a design proposal for a platform that addresses the resource gap in terms of datasets and evaluators in developing MT systems. The result is a design for a recruitment and gamified evaluation platform for developers of MT systems. Challenges are also discussed in terms of evaluating this platform, as well as its possible applications in the wider scope of Natural Language Processing (NLP) research.

nan


Article 633

Title@2025-06-13 (5): MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

Title: MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning MMMG: Ein massiver, multidisziplinärer, multi-Tier-Erzeugungs-Benchmark für Bild-zu-Bild-Reasoning MMMMM: 大量、多学科、多代、多语言的文字到图像推理基准 2506.10963v2

Authors (9): Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, Zhouhui Lian

In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning – a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image’s core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits – low entity fidelity, weak relations, and clutter – with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark’s difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.

nan


Article 634

Title@2025-06-13 (5): Jointly modelling the evolution of social structure and language in online communities

Title: Jointly modelling the evolution of social structure and language in online communities Gemeinsame Modellierung der Entwicklung von sozialer Struktur und Sprache in Online-Communities 联合模拟在线社区社会结构和语言演变 2409.19243v2

Authors (1): Christine de Kock

Group interactions take place within a particular socio-temporal context, which should be taken into account when modelling interactions in online communities. We propose a method for jointly modelling community structure and language over time. Our system produces dynamic word and user representations that can be used to cluster users, investigate thematic interests of groups, and predict group membership. We apply and evaluate our method in the context of a set of misogynistic extremist groups. Our results indicate that this approach outperforms prior models which lacked one of these components (i.e. not incorporating social structure, or using static word embeddings) when evaluated on clustering and embedding prediction tasks. Our method further enables novel types of analyses on online groups, including tracing their response to temporal events and quantifying their propensity for using violent language, which is of particular importance in the context of extremist groups.

nan


Article 635

Title@2025-06-13 (5): Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Title: Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning Lingshu: Ein generalistisches Stiftungsmodell für ein einheitliches multimodales medizinisches Verständnis und Vernunft Lingshu:通用主义基金会统一多式联运医疗理解和理性模式模式 2506.07044v4

Authors (19): LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, Hao Zhang, Yu Rong

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu’s medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks …

nan


Article 636

Title@2025-06-13 (5): Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model

Title: Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model Auf dem Weg zu einem effizienten Sprach-Text gemeinsam innerhalb eines Sprachmodells dekodieren 争取实现在一种语音语言模式内实现高效率的语音-文本联合解码 2506.04518v2

Authors (11): Haibin Wu, Yuxuan Hu, Ruchao Fan, Xiaofei Wang, Kenichi Kumatani, Bo Ren, Jianwei Yu, Heng Lu, Lijuan Wang, Yao Qian, Jinyu Li

Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies-including the interleaved, and parallel generation paradigms-under a controlled experimental setup using the same base language model, speech tokenizer and training data. Our results show that the interleaved approach achieves the best alignment. However it suffers from slow inference due to long token sequence length. To address this, we propose a novel early-stop interleaved (ESI) pattern that not only significantly accelerates decoding but also yields slightly better performance. Additionally, we curate high-quality question answering (QA) datasets to further improve speech QA performance.

nan


Article 637

Title@2025-06-13 (5): Transferable Post-training via Inverse Value Learning

Title: Transferable Post-training via Inverse Value Learning Übertragbare Nachschulung über Inverse Value Learning 通过反向价值学习进行可转让的后培训 2410.21027v2

Authors (9): Xinyu Lu, Xueru Wen, Yaojie Lu, Bowen Yu, Hongyu Lin, Haiyang Yu, Le Sun, Xianpei Han, Yongbin Li

As post-training processes utilize increasingly large datasets and base models continue to grow in size, the computational demands and implementation challenges of existing algorithms are escalating significantly. In this paper, we propose modeling the changes at the logits level during post-training using a separate neural network (i.e., the value network). After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference, enables them to achieve similar capability enhancements. We systematically investigate the best practices for this paradigm in terms of pre-training weights and connection schemes. We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes within the same family, models undergoing continuous pre-training within the same family, and models with different vocabularies across families. In certain cases, it can achieve performance comparable to full-parameter fine-tuning. Furthermore, we explore methods to enhance the transferability of the value model and prevent overfitting to the base model used during training.

nan


Article 638

Title@2025-06-13 (5): AbsenceBench: Language Models Can’t Tell What’s Missing

Title: AbsenceBench: Language Models Can’t Tell What’s Missing AbsenceBench: Sprachmodelle können nicht sagen, was fehlt 缺席时间: 语言模型无法说明缺少什么 2506.11440v1

Authors (6): Harvey Yiyun Fu, Aryan Shrivastava, Jared Moore, Peter West, Chenhao Tan, Ari Holtzman

Large language models (LLMs) are increasingly capable of processing long inputs and locating specific information within them, as evidenced by their performance on the Needle in a Haystack (NIAH) test. However, while models excel at recalling surprising information, they still struggle to identify clearly omitted information. We introduce AbsenceBench to assesses LLMs’ capacity to detect missing information across three domains: numerical sequences, poetry, and GitHub pull requests. AbsenceBench asks models to identify which pieces of a document were deliberately removed, given access to both the original and edited contexts. Despite the apparent straightforwardness of these tasks, our experiments reveal that even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. Our analysis suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to “gaps” in documents since these absences don’t correspond to any specific keys that can be attended to. Overall, our results and analysis provide a case study of the close proximity of tasks where models are already superhuman (NIAH) and tasks where models breakdown unexpectedly (AbsenceBench).

nan


Article 639

Title@2025-06-13 (5): Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution’s Characteristics

Title: Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution’s Characteristics Verbesserung der Kalibrierung von Vertrauens-Scores bei der Textgenerierung anhand der Eigenschaften der Output-Distribution 利用产出分配特点改进对文本制作中信任分数的校准 2506.00637v2

Authors (3): Lorenzo Jaime Yu Flores, Ori Ernst, Jackie Chi Kit Cheung

Well-calibrated model confidence scores can improve the usefulness of text generation models. For example, users can be prompted to review predictions with low confidence scores, to prevent models from returning bad or potentially dangerous predictions. However, confidence metrics are not always well calibrated in text generation. One reason is that in generation, there can be many valid answers, which previous methods do not always account for. Hence, a confident model could distribute its output probability among multiple sequences because they are all valid. We propose task-agnostic confidence metrics suited to generation, which rely solely on the probabilities associated with the model outputs without the need for further fine-tuning or heuristics. Using these, we are able to improve the calibration of BART and Flan-T5 on summarization, translation, and QA datasets.

nan


Article 640

Title@2025-06-13 (5): KoGEC : Korean Grammatical Error Correction with Pre-trained Translation Models

Title: KoGEC : Korean Grammatical Error Correction with Pre-trained Translation Models KoGEC : Koreanische Korrektur von Grammatikfehlern mit vortrainierten Übersetzungsmodellen KoGEC: 韩国语法错误校正,采用训练有素的翻译模型 2506.11432v1

Authors (3): Taeeun Kim, Semin Jeong, Youngsook Song

This research introduces KoGEC, a Korean Grammatical Error Correction system using pre--trained translation models. We fine-tuned NLLB (No Language Left Behind) models for Korean GEC, comparing their performance against large language models like GPT-4 and HCX-3. The study used two social media conversation datasets for training and testing. The NLLB models were fine-tuned using special language tokens to distinguish between original and corrected Korean sentences. Evaluation was done using BLEU scores and an “LLM as judge” method to classify error types. Results showed that the fine-tuned NLLB (KoGEC) models outperformed GPT-4o and HCX-3 in Korean GEC tasks. KoGEC demonstrated a more balanced error correction profile across various error types, whereas the larger LLMs tended to focus less on punctuation errors. We also developed a Chrome extension to make the KoGEC system accessible to users. Finally, we explored token vocabulary expansion to further improve the model but found it to decrease model performance. This research contributes to the field of NLP by providing an efficient, specialized Korean GEC system and a new evaluation method. It also highlights the potential of compact, task-specific models to compete with larger, general-purpose language models in specialized NLP tasks.

nan


Article 641

Title@2025-06-13 (5): MAGPIE: Multi-Task Media-Bias Analysis Generalization for Pre-Trained Identification of Expressions

Title: MAGPIE: Multi-Task Media-Bias Analysis Generalization for Pre-Trained Identification of Expressions MAPIE: Multi-Task Media-Bias Analyse Generalisierung zur vortrainierten Identifizierung von Ausdrücken MAGPIE: 多任务媒体-Bias分析 2403.07910v3

Authors (9): Tomáš Horych, Martin Wessel, Jan Philip Wahle, Terry Ruas, Jerome Waßmuth, André Greiner-Petter, Akiko Aizawa, Bela Gipp, Timo Spinde

Media bias detection poses a complex, multifaceted problem traditionally tackled using single-task models and small in-domain datasets, consequently lacking generalizability. To address this, we introduce MAGPIE, the first large-scale multi-task pre-training approach explicitly tailored for media bias detection. To enable pre-training at scale, we present Large Bias Mixture (LBM), a compilation of 59 bias-related tasks. MAGPIE outperforms previous approaches in media bias detection on the Bias Annotation By Experts (BABE) dataset, with a relative improvement of 3.3% F1-score. MAGPIE also performs better than previous models on 5 out of 8 tasks in the Media Bias Identification Benchmark (MBIB). Using a RoBERTa encoder, MAGPIE needs only 15% of finetuning steps compared to single-task approaches. Our evaluation shows, for instance, that tasks like sentiment and emotionality boost all learning, all tasks enhance fake news detection, and scaling tasks leads to the best results. MAGPIE confirms that MTL is a promising approach for addressing media bias detection, enhancing the accuracy and efficiency of existing models. Furthermore, LBM is the first available resource collection focused on media bias MTL.

nan


Article 642

Title@2025-06-13 (5): Deep Sparse Latent Feature Models for Knowledge Graph Completion

Title: Deep Sparse Latent Feature Models for Knowledge Graph Completion Deep Sparse Latent Feature Modelle für die Wissensgraphenvervollständigung 知识图补全深度粗略的内端特性模型 2411.15694v2

Authors (9): Haotian Li, Rui Zhang, Lingzhi Wang, Bin Yu, Youwei Wang, Yuliang Wei, Kai Wang, Richard Yi Da Xu, Bailing Wang

Recent advances in knowledge graph completion (KGC) have emphasized text-based approaches to navigate the inherent complexities of large-scale knowledge graphs (KGs). While these methods have achieved notable progress, they frequently struggle to fully incorporate the global structural properties of the graph. Stochastic blockmodels (SBMs), especially the latent feature relational model (LFRM), offer robust probabilistic frameworks for identifying latent community structures and improving link prediction. This paper presents a novel probabilistic KGC framework utilizing sparse latent feature models, optimized via a deep variational autoencoder (VAE). Our proposed method dynamically integrates global clustering information with local textual features to effectively complete missing triples, while also providing enhanced interpretability of the underlying latent structures. Extensive experiments on four benchmark datasets with varying scales demonstrate the significant performance gains achieved by our method.

nan


Article 643

Title@2025-06-13 (5): Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards

Title: Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards Agent-RLVR: Training Software Engineering Agents über Beratung und Umwelt Belohnungen RLVR: 通过指导和环境奖励培训软件工程代理 2506.11425v1

Authors (6): Jeff Da, Clinton Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate, Sean Hendryx

Reinforcement Learning from Verifiable Rewards (RLVR) has been widely adopted as the de facto method for enhancing the reasoning capabilities of large language models and has demonstrated notable success in verifiable domains like math and competitive programming tasks. However, the efficacy of RLVR diminishes significantly when applied to agentic environments. These settings, characterized by multi-step, complex problem solving, lead to high failure rates even for frontier LLMs, as the reward landscape is too sparse for effective model training via conventional RLVR. In this work, we introduce Agent-RLVR, a framework that makes RLVR effective in challenging agentic settings, with an initial focus on software engineering tasks. Inspired by human pedagogy, Agent-RLVR introduces agent guidance, a mechanism that actively steers the agent towards successful trajectories by leveraging diverse informational cues. These cues, ranging from high-level strategic plans to dynamic feedback on the agent’s errors and environmental interactions, emulate a teacher’s guidance, enabling the agent to navigate difficult solution spaces and promotes active self-improvement via additional environment exploration. In the Agent-RLVR training loop, agents first attempt to solve tasks to produce initial trajectories, which are then validated by unit tests and supplemented with agent guidance. Agents then reattempt with guidance, and the agent policy is updated with RLVR based on the rewards of these guided trajectories. Agent-RLVR elevates the pass@1 performance of Qwen-2.5-72B-Instruct from 9.4% to 22.4% on SWE-Bench Verified. We find that our guidance-augmented RLVR data is additionally useful for test-time reward model training, shown by further boosting pass@1 to 27.8%. Agent-RLVR lays the groundwork for training agents with RLVR in complex, real-world environments where conventional RL methods struggle.

nan


Article 644

Title@2025-06-13 (5): Efficient Long-Context LLM Inference via KV Cache Clustering

Title: Efficient Long-Context LLM Inference via KV Cache Clustering Effiziente Long-Context-LLM-Inferenz über KV Cache-Clustering 通过 KV 缓存群集推断 2506.11418v1

Authors (11): Jie Hu, Shengnan Wang, Yutong He, Ping Gong, Jiawei Yi, Juncheng Zhang, Youhui Bai, Renhai Chen, Gong Zhang, Cheng Li, Kun Yuan

Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges. Existing approaches either discard potentially critical information needed for future generations or offer limited efficiency gains due to high computational overhead. In this paper, we introduce Chelsea, a simple yet effective framework for online KV cache clustering. Our approach is based on the observation that key states exhibit high similarity along the sequence dimension. To enable efficient clustering, we divide the sequence into chunks and propose Chunked Soft Matching, which employs an alternating partition strategy within each chunk and identifies clusters based on similarity. Chelsea then merges the KV cache within each cluster into a single centroid. Additionally, we provide a theoretical analysis of the computational complexity and the optimality of the intra-chunk partitioning strategy. Extensive experiments across various models and long-context benchmarks demonstrate that Chelsea achieves up to 80% reduction in KV cache memory usage while maintaining comparable model performance. Moreover, with minimal computational overhead, Chelsea accelerates the decoding stage of inference by up to 3.19$\times$ and reduces end-to-end latency by up to 2.72$\times$.

nan


Article 645

Title@2025-06-13 (5): RSCF: Relation-Semantics Consistent Filter for Entity Embedding of Knowledge Graph

Title: RSCF: Relation-Semantics Consistent Filter for Entity Embedding of Knowledge Graph RSCF: Relation-Semantik Konsequenter Filter für Entity-Einbettung von Wissensgrafik RSCF: 用于实体嵌入知识图的 关系-语义一致性过滤器 2505.20813v3

Authors (3): Junsik Kim, Jinwook Park, Kangil Kim

In knowledge graph embedding, leveraging relation specific entity transformation has markedly enhanced performance. However, the consistency of embedding differences before and after transformation remains unaddressed, risking the loss of valuable inductive bias inherent in the embeddings. This inconsistency stems from two problems. First, transformation representations are specified for relations in a disconnected manner, allowing dissimilar transformations and corresponding entity embeddings for similar relations. Second, a generalized plug-in approach as a SFBR (Semantic Filter Based on Relations) disrupts this consistency through excessive concentration of entity embeddings under entity-based regularization, generating indistinguishable score distributions among relations. In this paper, we introduce a plug-in KGE method, Relation-Semantics Consistent Filter (RSCF). Its entity transformation has three features for enhancing semantic consistency: 1) shared affine transformation of relation embeddings across all relations, 2) rooted entity transformation that adds an entity embedding to its change represented by the transformed vector, and 3) normalization of the change to prevent scale reduction. To amplify the advantages of consistency that preserve semantics on embeddings, RSCF adds relation transformation and prediction modules for enhancing the semantics. In knowledge graph completion tasks with distance-based and tensor decomposition models, RSCF significantly outperforms state-of-the-art KGE methods, showing robustness across all relations and their frequencies.

nan


Article 646

Title@2025-06-13 (5): Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Title: Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning Erste Prüfung von Wissenschaftlern: Kognitive Fähigkeiten von MLLM durch Wahrnehmung, Verständnis und Vernunft unter Beweis stellen 科学家的第一次考试:通过感知、理解和理性,发现MLLM的认知能力 2506.10521v2

Authors (27): Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhiwei Li, Qiantai Feng, Zijie Guo, Yuejin Yang, Hao Wu, Wenxuan Huang, Jiaqi Wei, Dan Si, Xiuqi Yao, Jia Bu, Haiwen Huang, Tianfan Fu, Shixiang Tang, Ben Fei, Dongzhan Zhou, Fenghua Ling, Yan Lu, Siqi Sun, Chenhui Li, Guanjie Zheng, Jiancheng Lv, Wenlong Zhang, Lei Bai

Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists’ First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.

nan


Article 647

Title@2025-06-13 (5): Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles

Title: Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles Beschleunigen von Diffusions-Großsprachenmodellen mit SlowFast Sampling: Die drei goldenen Prinzipien 加速传播具有慢速抽样的大型语言模型:三大金原则 2506.10848v2

Authors (5): Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Dongrui Liu, Linfeng Zhang

Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and up to 34.22$\times$ when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.

nan


Article 648

Title@2025-06-13 (5): Bias Amplification in RAG: Poisoning Knowledge Retrieval to Steer LLMs

Title: Bias Amplification in RAG: Poisoning Knowledge Retrieval to Steer LLMs Bias-Verstärkung in RAG: Vergiftung von Wissen an Steer LLMs RAG中的比值放大:毒性知识检索到STeer LMS 2506.11415v1

Authors (5): Linlin Wang, Tianqing Zhu, Laiqiao Qin, Longxiang Gao, Wanlei Zhou

In Large Language Models, Retrieval-Augmented Generation (RAG) systems can significantly enhance the performance of large language models by integrating external knowledge. However, RAG also introduces new security risks. Existing research focuses mainly on how poisoning attacks in RAG systems affect model output quality, overlooking their potential to amplify model biases. For example, when querying about domestic violence victims, a compromised RAG system might preferentially retrieve documents depicting women as victims, causing the model to generate outputs that perpetuate gender stereotypes even when the original query is gender neutral. To show the impact of the bias, this paper proposes a Bias Retrieval and Reward Attack (BRRA) framework, which systematically investigates attack pathways that amplify language model biases through a RAG system manipulation. We design an adversarial document generation method based on multi-objective reward functions, employ subspace projection techniques to manipulate retrieval results, and construct a cyclic feedback mechanism for continuous bias amplification. Experiments on multiple mainstream large language models demonstrate that BRRA attacks can significantly enhance model biases in dimensions. In addition, we explore a dual stage defense mechanism to effectively mitigate the impacts of the attack. This study reveals that poisoning attacks in RAG systems directly amplify model output biases and clarifies the relationship between RAG system security and model fairness. This novel potential attack indicates that we need to keep an eye on the fairness issues of the RAG system.

nan


Article 649

Title@2025-06-13 (5): Predicting Early-Onset Colorectal Cancer with Large Language Models

Title: Predicting Early-Onset Colorectal Cancer with Large Language Models Frühzeitiger Kolorektalkrebs mit großen Sprachmodellen 以大语言模型预测早期局部直肠癌 2506.11410v1

Authors (6): Wilson Lau, Youngwon Kim, Sravanthi Parasa, Md Enamul Haque, Anand Oka, Jay Nanduri

The incidence rate of early-onset colorectal cancer (EoCRC, age < 45) has increased every year, but this population is younger than the recommended age established by national guidelines for cancer screening. In this paper, we applied 10 different machine learning models to predict EoCRC, and compared their performance with advanced large language models (LLM), using patient conditions, lab results, and observations within 6 months of patient journey prior to the CRC diagnoses. We retrospectively identified 1,953 CRC patients from multiple health systems across the United States. The results demonstrated that the fine-tuned LLM achieved an average of 73% sensitivity and 91% specificity.

nan


Article 650

Title@2025-06-13 (5): LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model

Title: LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model LoRA-Anwender Vorsicht: Ein paar saubere Zeichen können Ihr Feinabstimmungsmodell manipulieren LoRA 用户要小心: 几个精细的 Tokens 可以操纵您的精密模型 2506.11402v1

Authors (4): Pradyut Sekhsaria, Marcel Mateos Salles, Hai Huang, Randall Balestriero

Parameter Efficient FineTuning (PEFT), such as Low-Rank Adaptation (LoRA), aligns pre-trained Large Language Models (LLMs) to particular downstream tasks in a resource-efficient manner. Because efficiency has been the main metric of progress, very little attention has been put in understanding possible catastrophic failures. We uncover one such failure: PEFT encourages a model to search for shortcut solutions to solve its fine-tuning tasks. When very small amount of tokens, e.g., one token per prompt, are correlated with downstream task classes, PEFT makes any pretrained model rely predominantly on that token for decision making. While such spurious tokens may emerge accidentally from incorrect data cleaning, it also opens opportunities for malevolent parties to control a model’s behavior from Seamless Spurious Token Injection (SSTI). In SSTI, a small amount of tokens correlated with downstream classes are injected by the dataset creators. At test time, the finetuned LLM’s behavior can be controlled solely by injecting those few tokens. We apply SSTI across models from three families (Snowflake Arctic, Apple OpenELM, and Meta LLaMA-3) and four diverse datasets (IMDB, Financial Classification, CommonSense QA, and Bias in Bios). Our findings reveal three astonishing behaviors. First, as few as a single token of SSTI is sufficient to steer a model’s decision making. Second, for light SSTI, the reliance on spurious tokens is proportional to the LoRA rank. Lastly, with aggressive SSTI, larger LoRA rank values become preferable to small rank values as it makes the model attend to non-spurious tokens, hence improving robustness.

nan


Article 651

Title@2025-06-13 (5): Curriculum-Guided Layer Scaling for Language Model Pretraining

Title: Curriculum-Guided Layer Scaling for Language Model Pretraining Curriculum-geführte Ebenenskalierung für Sprachmodellvorschulungen 语言示范语言前培训课程-指导层比例表 2506.11389v1

Authors (3): Karanpartap Singh, Neil Band, Ehsan Adeli

As the cost of pretraining large language models grows, there is continued interest in strategies to improve learning efficiency during this core training stage. Motivated by cognitive development, where humans gradually build knowledge as their brains mature, we propose Curriculum-Guided Layer Scaling (CGLS), a framework for compute-efficient pretraining that synchronizes increasing data difficulty with model growth through progressive layer stacking (i.e. gradually adding layers during training). At the 100M parameter scale, using a curriculum transitioning from synthetic short stories to general web data, CGLS outperforms baseline methods on the question-answering benchmarks PIQA and ARC. Pretraining at the 1.2B scale, we stratify the DataComp-LM corpus with a DistilBERT-based classifier and progress from general text to highly technical or specialized content. Our results show that progressively increasing model depth alongside sample difficulty leads to better generalization and zero-shot performance on various downstream benchmarks. Altogether, our findings demonstrate that CGLS unlocks the potential of progressive stacking, offering a simple yet effective strategy for improving generalization on knowledge-intensive and reasoning tasks.

nan


Article 652

Title@2025-06-13 (5): A Variational Approach for Mitigating Entity Bias in Relation Extraction

Title: A Variational Approach for Mitigating Entity Bias in Relation Extraction Ein abwechslungsreicher Ansatz für die Minderung von Entity-Bias in der Beziehungsextraktion 减轻实体在关系中的偏见的变式方法 2506.11381v1

Authors (6): Samuel Mensah, Elena Kochkina, Jabez Magomere, Joy Prakash Sain, Simerjot Kaur, Charese Smiley

Mitigating entity bias is a critical challenge in Relation Extraction (RE), where models often rely excessively on entities, resulting in poor generalization. This paper presents a novel approach to address this issue by adapting a Variational Information Bottleneck (VIB) framework. Our method compresses entity-specific information while preserving task-relevant features. It achieves state-of-the-art performance on relation extraction datasets across general, financial, and biomedical domains, in both indomain (original test sets) and out-of-domain (modified test sets with type-constrained entity replacements) settings. Our approach offers a robust, interpretable, and theoretically grounded methodology.

nan


Article 653

Title@2025-06-13 (5): Large Language Model-Powered Conversational Agent Delivering Problem-Solving Therapy (PST) for Family Caregivers: Enhancing Empathy and Therapeutic Alliance Using In-Context Learning

Title: Large Language Model-Powered Conversational Agent Delivering Problem-Solving Therapy (PST) for Family Caregivers: Enhancing Empathy and Therapeutic Alliance Using In-Context Learning Large Language Model-Powered Conversational Agent liefert Problem-Solving Therapie (PST) für Familienpfleger: Empathie und therapeutische Allianz mit Hilfe von In-Context Learning verbessern 为家庭照料者提供提供解决问题治疗的大型语言示范式对话代理方:利用知识内学习加强同情和治疗联盟 2506.11376v1

Authors (16): Liying Wang, Ph. D., Daffodil Carrington, M. S., Daniil Filienko, M. S., Caroline El Jazmi, M. S., Serena Jinchen Xie, M. S., Martine De Cock, Ph. D., Sarah Iribarren, Ph. D., Weichao Yuwen, Ph. D

Family caregivers often face substantial mental health challenges due to their multifaceted roles and limited resources. This study explored the potential of a large language model (LLM)-powered conversational agent to deliver evidence-based mental health support for caregivers, specifically Problem-Solving Therapy (PST) integrated with Motivational Interviewing (MI) and Behavioral Chain Analysis (BCA). A within-subject experiment was conducted with 28 caregivers interacting with four LLM configurations to evaluate empathy and therapeutic alliance. The best-performing models incorporated Few-Shot and Retrieval-Augmented Generation (RAG) prompting techniques, alongside clinician-curated examples. The models showed improved contextual understanding and personalized support, as reflected by qualitative responses and quantitative ratings on perceived empathy and therapeutic alliances. Participants valued the model’s ability to validate emotions, explore unexpressed feelings, and provide actionable strategies. However, balancing thorough assessment with efficient advice delivery remains a challenge. This work highlights the potential of LLMs in delivering empathetic and tailored support for family caregivers.

nan


Article 654

Title@2025-06-13 (5): Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables

Title: Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables Benchmarking multimodaler LLMs zur Anerkennung und Verständigung über chemische Tabellen 关于识别和了解化学品表格的多模式贷款 2506.11375v1

Authors (10): Yitong Zhou, Mingyue Cheng, Qingyang Mao, Yucong Luo, Qi Liu, Yupeng Li, Xiaohan Zhang, Deguang Liu, Xin Li, Enhong Chen

Chemical tables encode complex experimental knowledge through symbolic expressions, structured variables, and embedded molecular graphics. Existing benchmarks largely overlook this multimodal and domain-specific complexity, limiting the ability of multimodal large language models to support scientific understanding in chemistry. In this work, we introduce ChemTable, a large-scale benchmark of real-world chemical tables curated from the experimental sections of literature. ChemTable includes expert-annotated cell polygons, logical layouts, and domain-specific labels, including reagents, catalysts, yields, and graphical components and supports two core tasks: (1) Table Recognition, covering structure parsing and content extraction; and (2) Table Understanding, encompassing both descriptive and reasoning-oriented question answering grounded in table structure and domain semantics. We evaluated a range of representative multimodal models, including both open-source and closed-source models, on ChemTable and reported a series of findings with practical and conceptual insights. Although models show reasonable performance on basic layout parsing, they exhibit substantial limitations on both descriptive and inferential QA tasks compared to human performance, and we observe significant performance gaps between open-source and closed-source models across multiple dimensions. These results underscore the challenges of chemistry-aware table understanding and position ChemTable as a rigorous and realistic benchmark for advancing scientific reasoning.

nan


Article 655

Title@2025-06-13 (5): FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Title: FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents FreshStack: Bau realistischer Benchmarks für die Bewertung des Retrievals auf technischen Dokumenten 新鲜工具:建立评价技术文件检索情况的现实基准 2504.13128v2

Authors (6): Nandan Thakur, Jimmy Lin, Sam Havens, Michael Carbin, Omar Khattab, Andrew Drozdov

We introduce FreshStack, a holistic framework for automatically building information retrieval (IR) evaluation benchmarks by incorporating challenging questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not improve first-stage retrieval accuracy (two out of five topics) and oracle context helps an LLM generator generate a high-quality RAG answer. We hope FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks.

nan


Article 656

Title@2025-06-12 (4): The Biased Samaritan: LLM biases in Perceived Kindness

Title: The Biased Samaritan: LLM biases in Perceived Kindness Der Biased Samaritan: LLM-Voreingenommenheiten in wahrnehmbarer Güte 偏见的撒玛利亚人:见识的品种中的LLM偏见 2506.11361v1

Authors (4): Jack H Fagan, Ruhaan Juyaal, Amy Yue-Ming Yu, Siya Pun

While Large Language Models (LLMs) have become ubiquitous in many fields, understanding and mitigating LLM biases is an ongoing issue. This paper provides a novel method for evaluating the demographic biases of various generative AI models. By prompting models to assess a moral patient’s willingness to intervene constructively, we aim to quantitatively evaluate different LLMs’ biases towards various genders, races, and ages. Our work differs from existing work by aiming to determine the baseline demographic identities for various commercial models and the relationship between the baseline and other demographics. We strive to understand if these biases are positive, neutral, or negative, and the strength of these biases. This paper can contribute to the objective assessment of bias in Large Language Models and give the user or developer the power to account for these biases in LLM output or in training future LLMs. Our analysis suggested two key findings: that models view the baseline demographic as a white middle-aged or young adult male; however, a general trend across models suggested that non-baseline demographics are more willing to help than the baseline. These methodologies allowed us to distinguish these two biases that are often tangled together.

nan


Article 657

Title@2025-06-12 (4): D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model

Title: D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model D-GEN: Automatische Distraktorgenerierung und Bewertung zur zuverlässigen Bewertung des Generativen Modells D-GEN:为可靠评估生成模型的可靠评估而自动生成和评估 2504.13439v2

Authors (2): Grace Byun, Jinho D. Choi

Evaluating generative models with open-ended generation is challenging due to inconsistencies in response formats. Multiple-choice (MC) evaluation mitigates this issue, but generating high-quality distractors is time-consuming and labor-intensive. We introduce D-GEN, the first open-source distractor generator model that transforms open-ended data into an MC format. To evaluate distractor quality, we propose two novel methods: (1) ranking alignment, ensuring generated distractors retain the discriminatory power of ground-truth distractors, and (2) entropy analysis, comparing model confidence distributions. Our results show that D-GEN preserves ranking consistency (Spearman’s rho 0.99, Kendall’s tau 0.94) and closely matches the entropy distribution of ground-truth distractors. Human evaluation further confirms the fluency, coherence, distractiveness, and incorrectness. Our work advances robust and efficient distractor generation with automated evaluation, setting a new standard for MC evaluation.

nan


Article 658

Title@2025-06-12 (4): GLAP: General contrastive audio-text pretraining across domains and languages

Title: GLAP: General contrastive audio-text pretraining across domains and languages GLAP: Allgemeines kontrastreiches Audio-Text-Vortraining über Domains und Sprachen hinweg GLAP: 跨领域和不同语言的一般有对比性音频-文字预培训 2506.11350v1

Authors (10): Heinrich Dinkel, Zhiyong Yan, Tianzi Wang, Yongqing Wang, Xingwei Sun, Yadong Niu, Jizhong Liu, Gang Li, Junbo Zhang, Jian Luan

Contrastive Language Audio Pretraining (CLAP) is a widely-used method to bridge the gap between audio and text domains. Current CLAP methods enable sound and music retrieval in English, ignoring multilingual spoken content. To address this, we introduce general language audio pretraining (GLAP), which expands CLAP with multilingual and multi-domain abilities. GLAP demonstrates its versatility by achieving competitive performance on standard audio-text retrieval benchmarks like Clotho and AudioCaps, while significantly surpassing existing methods in speech retrieval and classification tasks. Additionally, GLAP achieves strong results on widely used sound-event zero-shot benchmarks, while simultaneously outperforming previous methods on speech content benchmarks. Further keyword spotting evaluations across 50 languages emphasize GLAP’s advanced multilingual capabilities. Finally, multilingual sound and music understanding is evaluated across four languages. Checkpoints and Source: https://github.com/xiaomi-research/dasheng-glap.

nan


Article 659

Title@2025-06-12 (4): Do We Still Need Audio? Rethinking Speaker Diarization with a Text-Based Approach Using Multiple Prediction Models

Title: Do We Still Need Audio? Rethinking Speaker Diarization with a Text-Based Approach Using Multiple Prediction Models Brauchen wir noch Audio? Überdenken der Lautsprecher-Diarisierung mit einem textbasierten Ansatz mit mehreren Vorhersagemodellen 我们还需要音频吗?用使用多种预测模型的基于文本的方法重新思考议长的对分法 2506.11344v1

Authors (2): Peilin Wu, Jinho D. Choi

We present a novel approach to Speaker Diarization (SD) by leveraging text-based methods focused on Sentence-level Speaker Change Detection within dialogues. Unlike audio-based SD systems, which are often challenged by audio quality and speaker similarity, our approach utilizes the dialogue transcript alone. Two models are developed: the Single Prediction Model (SPM) and the Multiple Prediction Model (MPM), both of which demonstrate significant improvements in identifying speaker changes, particularly in short conversations. Our findings, based on a curated dataset encompassing diverse conversational scenarios, reveal that the text-based SD approach, especially the MPM, performs competitively against state-of-the-art audio-based SD systems, with superior performance in short conversational contexts. This paper not only showcases the potential of leveraging linguistic features for SD but also highlights the importance of integrating semantic understanding into SD systems, opening avenues for future research in multimodal and semantic feature-based diarization.

nan


Article 660

Title@2025-06-12 (4): From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review

Title: From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review Von der Replikation zum Redesign: Paarweise Vergleiche für LLM-basierte Peer Review 从复制到重新设计:为基于LLM的同侪审查探索对称比较 2506.11343v1

Authors (7): Yaohui Zhang, Haijing Zhang, Wenlong Ji, Tianyu Hua, Nick Haber, Hancheng Cao, Weixin Liang

The advent of large language models (LLMs) offers unprecedented opportunities to reimagine peer review beyond the constraints of traditional workflows. Despite these opportunities, prior efforts have largely focused on replicating traditional review workflows with LLMs serving as direct substitutes for human reviewers, while limited attention has been given to exploring new paradigms that fundamentally rethink how LLMs can participate in the academic review process. In this paper, we introduce and explore a novel mechanism that employs LLM agents to perform pairwise comparisons among manuscripts instead of individual scoring. By aggregating outcomes from substantial pairwise evaluations, this approach enables a more accurate and robust measure of relative manuscript quality. Our experiments demonstrate that this comparative approach significantly outperforms traditional rating-based methods in identifying high-impact papers. However, our analysis also reveals emergent biases in the selection process, notably a reduced novelty in research topics and an increased institutional imbalance. These findings highlight both the transformative potential of rethinking peer review with LLMs and critical challenges that future systems must address to ensure equity and diversity.

nan


Article 661

Title@2025-06-12 (4): Surprisal from Larger Transformer-based Language Models Predicts fMRI Data More Poorly

Title: Surprisal from Larger Transformer-based Language Models Predicts fMRI Data More Poorly Surprisal aus größeren Transformer-basierten Sprachmodellen prognostiziert fMRI-Daten schlechter 以大变压器为基础的以大变压器为基础的语言模型的超常性语言模型对FMRI数据的预测更差 2506.11338v1

Authors (2): Yi-Chien Lin, William Schuler

As Transformers become more widely incorporated into natural language processing tasks, there has been considerable interest in using surprisal from these models as predictors of human sentence processing difficulty. Recent work has observed a positive relationship between Transformer-based models’ perplexity and the predictive power of their surprisal estimates on reading times, showing that language models with more parameters and trained on more data are less predictive of human reading times. However, these studies focus on predicting latency-based measures (i.e., self-paced reading times and eye-gaze durations) with surprisal estimates from Transformer-based language models. This trend has not been tested on brain imaging data. This study therefore evaluates the predictive power of surprisal estimates from 17 pre-trained Transformer-based models across three different language families on two functional magnetic resonance imaging datasets. Results show that the positive relationship between model perplexity and model fit still obtains, suggesting that this trend is not specific to latency-based measures and can be generalized to neural measures.

nan


Article 662

Title@2025-06-12 (4): Don’t Pay Attention

Title: Don’t Pay Attention Achte nicht auf mich. 千万不要留意 2506.11305v1

Authors (2): Mohammad Hammoud, Devang Acharya

The Transformer has become the de facto standard for large language models and a wide range of downstream tasks across various domains. Despite its numerous advantages like inherent training parallelism, the Transformer still faces key challenges due to its inability to effectively process sequences beyond a fixed context window and the quadratic complexity of its attention mechanism. These challenges have renewed interest in RNN-like architectures, which offer linear scaling with sequence length and improved handling of long-range dependencies, albeit with limited parallelism due to their inherently recurrent nature. In this paper, we propose Avey, a new neural foundational architecture that breaks away from both attention and recurrence. Avey comprises a ranker and an autoregressive neural processor, which collaboratively identify and contextualize only the most relevant tokens for any given token, regardless of their positions in the sequence. Specifically, Avey decouples sequence length from context width, thus enabling effective processing of arbitrarily long sequences. Experimental results show that Avey compares favorably to the Transformer across a variety of standard short-range NLP benchmarks, while notably excelling at capturing long-range dependencies.

nan


Article 663

Title@2025-06-12 (4): Deep Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions

Title: Deep Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions Deep Binding of Language Model Virtual Personas: eine Studie über die Annäherung der politischen Partisanen-Misswahrnehmungen 语言模拟虚拟人:关于政治党派近似误解的研究 2504.11673v2

Authors (6): Minwoo Kang, Suhong Moon, Seung Hyeong Lee, Ayush Raj, Joseph Suh, David M. Chan

Large language models (LLMs) are increasingly capable of simulating human behavior, offering cost-effective ways to estimate user responses to various surveys and polls. However, the questions in these surveys usually reflect socially understood attitudes: the patterns of attitudes of old/young, liberal/conservative, as understood by both members and non-members of those groups. It is not clear whether the LLM binding is \emph{deep}, meaning the LLM answers as a member of a particular in-group would, or \emph{shallow}, meaning the LLM responds as an out-group member believes an in-group member would. To explore this difference, we use questions that expose known in-group/out-group biases. This level of fidelity is critical for applying LLMs to various political science studies, including timely topics on polarization dynamics, inter-group conflict, and democratic backsliding. To this end, we propose a novel methodology for constructing virtual personas with synthetic user ``backstories” generated as extended, multi-turn interview transcripts. Our generated backstories are longer, rich in detail, and consistent in authentically describing a singular individual, compared to previous methods. We show that virtual personas conditioned on our backstories closely replicate human response distributions (up to an 87\% improvement as measured by Wasserstein Distance) and produce effect sizes that closely match those observed in the original studies of in-group/out-group biases. Altogether, our work extends the applicability of LLMs beyond estimating socially understood responses, enabling their use in a broader range of human studies.

nan


Article 664

Title@2025-06-12 (4): Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning

Title: Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning Beyond Random Sampling: Effizientes Sprachmodell Vortraining über Curriculum Learning 超越随机抽样:通过课程学习进行高效语言模式预科培训 2506.11300v1

Authors (5): Yang Zhang, Amr Mohamed, Hadi Abdine, Guokan Shang, Michalis Vazirgiannis

Curriculum learning has shown promise in improving training efficiency and generalization in various machine learning domains, yet its potential in pretraining language models remains underexplored, prompting our work as the first systematic investigation in this area. We experimented with different settings, including vanilla curriculum learning, pacing-based sampling, and interleaved curricula-guided by six difficulty metrics spanning linguistic and information-theoretic perspectives. We train models under these settings and evaluate their performance on eight diverse benchmarks. Our experiments reveal that curriculum learning consistently improves convergence in early and mid-training phases, and can yield lasting gains when used as a warmup strategy with up to $3.5\%$ improvement. Notably, we identify compression ratio, lexical diversity, and readability as effective difficulty signals across settings. Our findings highlight the importance of data ordering in large-scale pretraining and provide actionable insights for scalable, data-efficient model development under realistic training scenarios.

nan


Article 665

Title@2025-06-12 (4): Ad Auctions for LLMs via Retrieval Augmented Generation

Title: Ad Auctions for LLMs via Retrieval Augmented Generation Anzeigenauktionen für LLMs via Retrieval Augmented Generation 通过回收增量一代对LLMs的拍卖 2406.09459v2

Authors (4): MohammadTaghi Hajiaghayi, Sébastien Lahaie, Keivan Rezaei, Suho Shin

In the field of computational advertising, the integration of ads into the outputs of large language models (LLMs) presents an opportunity to support these services without compromising content integrity. This paper introduces novel auction mechanisms for ad allocation and pricing within the textual outputs of LLMs, leveraging retrieval-augmented generation (RAG). We propose a segment auction where an ad is probabilistically retrieved for each discourse segment (paragraph, section, or entire output) according to its bid and relevance, following the RAG framework, and priced according to competing bids. We show that our auction maximizes logarithmic social welfare, a new notion of welfare that balances allocation efficiency and fairness, and we characterize the associated incentive-compatible pricing rule. These results are extended to multi-ad allocation per segment. An empirical evaluation validates the feasibility and effectiveness of our approach over several ad auction scenarios, and exhibits inherent tradeoffs in metrics as we allow the LLM more flexibility to allocate ads.

nan


Article 666

Title@2025-06-12 (4): Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer

Title: Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer Aufmerksamkeit ruft, MLP-Erinnerungen: Entwirren von trainierbaren Komponenten im Transformer 注意检索, MLP 记忆: 变换器中拆分可训练部件 2506.01115v2

Authors (4): Yihe Dong, Lorenzo Noci, Mikhail Khodak, Mufan Li

The Transformer architecture is central to the success of modern Large Language Models (LLMs), in part due to its surprising ability to perform a wide range of algorithmic tasks – including mathematical reasoning, memorization, and retrieval – using only gradient-based training on next-token prediction. While the core component of a Transformer is the self-attention mechanism, we question how much, and which aspects, of the performance gains can be attributed to it. To this end, we compare standard Transformers to variants in which either the multi-layer perceptron (MLP) layers or the attention projectors (queries and keys) are frozen at initialization. To further isolate the contribution of attention, we introduce MixiT – the Mixing Transformer – a simplified, principled model in which the attention coefficients are entirely random and fixed at initialization, eliminating any input-dependent computation or learning in attention. Surprisingly, we find that MixiT matches the performance of fully trained Transformers on various algorithmic tasks, especially those involving basic arithmetic or focusing heavily on memorization. For retrieval-based tasks, we observe that having input-dependent attention coefficients is consistently beneficial, while MixiT underperforms. We attribute this failure to its inability to form specialized circuits such as induction heads – a specific circuit known to be crucial for learning and exploiting repeating patterns in input sequences. Even more interestingly, we find that attention with frozen key and query projectors is not only able to form induction heads, but can also perform competitively on language modeling. Our results underscore the importance of architectural heterogeneity, where distinct components contribute complementary inductive biases crucial for solving different classes of tasks.

nan


Article 667

Title@2025-06-12 (4): ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

Title: ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness ColorBench: Können VLMs die bunte Welt sehen und verstehen? Ein umfassender Maßstab für Farbwahrnehmung, Vernunft und Robustheit 颜色贝因: VLMs 能看到和理解多色世界吗? 色彩感知、理性和强健的综合基准 2504.10514v2

Authors (10): Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, Tianyi Zhou

Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.

nan


Article 668

Title@2025-06-12 (4): Learning a Continue-Thinking Token for Enhanced Test-Time Scaling

Title: Learning a Continue-Thinking Token for Enhanced Test-Time Scaling Ein weiterdenkendes Token für verbesserte Testzeitskalierung lernen 学习 继续思考 提高测试时间缩放 2506.11274v1

Authors (3): Liran Ringel, Elad Tolochinsky, Yaniv Romano

Test-time scaling has emerged as an effective approach for improving language model performance by utilizing additional compute at inference time. Recent studies have shown that overriding end-of-thinking tokens (e.g., replacing “</think>” with “Wait”) can extend reasoning steps and improve accuracy. In this work, we explore whether a dedicated continue-thinking token can be learned to trigger extended reasoning. We augment a distilled version of DeepSeek-R1 with a single learned “< continue-thinking >” token, training only its embedding via reinforcement learning while keeping the model weights frozen. Our experiments show that this learned token achieves improved accuracy on standard math benchmarks compared to both the baseline model and a test-time scaling approach that uses a fixed token (e.g., “Wait”) for budget forcing. In particular, we observe that in cases where the fixed-token approach enhances the base model’s accuracy, our method achieves a markedly greater improvement. For example, on the GSM8K benchmark, the fixed-token approach yields a 1.3% absolute improvement in accuracy, whereas our learned-token method achieves a 4.2% improvement over the base model that does not use budget forcing.

nan


Article 669

Title@2025-06-12 (4): Attuned to Change: Causal Fine-Tuning under Latent-Confounded Shifts

Title: Attuned to Change: Causal Fine-Tuning under Latent-Confounded Shifts Eingestimmt auf den Wandel: Kausales Feintuning unter latent-begründeten Verschiebungen 与变化相接:在长期、有根据的变更下,因果罚款 2410.14375v2

Authors (7): Jialin Yu, Yuxiang Zhou, Yulan He, Nevin L. Zhang, Junchi Yu, Philip Torr, Ricardo Silva

Adapting to latent-confounded shifts remains a core challenge in modern AI. These shifts are propagated via latent variables that induce spurious, non-transportable correlations between inputs and labels. One practical failure mode arises when fine-tuning pre-trained foundation models on confounded data (e.g., where certain text tokens or image backgrounds spuriously correlate with the label), leaving models vulnerable at deployment. We frame causal fine-tuning as an identification problem and pose an explicit causal model that decomposes inputs into low-level spurious features and high-level causal representations. Under this family of models, we formalize the assumptions required for identification. Using pre-trained language models as a case study, we show how identifying and adjusting these components during causal fine-tuning enables automatic adaptation to latent-confounded shifts at test time. Experiments on semi-synthetic benchmarks derived from real-world problems demonstrate that our method outperforms black-box domain generalization baselines, illustrating the benefits of explicitly modeling causal structure.

nan


Article 670

Title@2025-06-12 (4): PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling

Title: PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling PANDAS: Besseres Viele-Schuss-Jailbreaking durch positive Affirmation, negative Demonstration und adaptive Sampling PANDAS:通过积极肯定、负面示范和适应性抽样改进多射破牢房 2502.01925v2

Authors (3): Avery Ma, Yangchen Pan, Amir-massoud Farahmand

Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges between the user and the model. These exchanges are randomly sampled from a pool of unsafe question-answer pairs, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with Positive Affirmations, Negative Demonstrations, and an optimized Adaptive Sampling method tailored to the target prompt’s topic. We also introduce ManyHarm, a dataset of harmful question-answer pairs, and demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios. Through attention analysis, we provide insights into how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.

nan


Article 671

Title@2025-06-12 (4): No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning

Title: No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning Keine universelle Aufforderung: Vereinheitlichung der Vernunft durch adaptive Aufforderung für zeitliche Tabellenveranlagung 无通用即时:通过调适性提示来统一时间表合理性的理由 2506.11246v1

Authors (5): Kushagra Dixit, Abhishek Rajgaria, Harshavardhan Kalalbandi, Dan Roth, Vivek Gupta

Temporal Table Reasoning is a critical challenge for Large Language Models (LLMs), requiring effective prompting techniques to extract relevant insights. Despite existence of multiple prompting methods, their impact on table reasoning remains largely unexplored. Furthermore, the performance of these models varies drastically across different table and context structures, making it difficult to determine an optimal approach. This work investigates multiple prompting technique across diverse table types to determine optimal approaches for different scenarios. We find that performance varies based on entity type, table structure, requirement of additional context and question complexity, with NO single method consistently outperforming others. To mitigate these challenges, we introduce SEAR, an adaptive prompting framework inspired by human reasoning that dynamically adjusts based on context characteristics and integrates a structured reasoning. Our results demonstrate that SEAR achieves superior performance across all table types compared to other baseline prompting techniques. Additionally, we explore the impact of table structure refactoring, finding that a unified representation enhances model’s reasoning.

nan


Article 672

Title@2025-06-12 (4): Iterative Multilingual Spectral Attribute Erasure

Title: Iterative Multilingual Spectral Attribute Erasure Iteratives Mehrsprachiges Spektralattribut Löschen 多语种多语种光谱属性错乱 2506.11244v1

Authors (6): Shun Shao, Yftah Ziser, Zheng Zhao, Yifu Qiu, Shay B. Cohen, Anna Korhonen

Multilingual representations embed words with similar meanings to share a common semantic space across languages, creating opportunities to transfer debiasing effects between languages. However, existing methods for debiasing are unable to exploit this opportunity because they operate on individual languages. We present Iterative Multilingual Spectral Attribute Erasure (IMSAE), which identifies and mitigates joint bias subspaces across multiple languages through iterative SVD-based truncation. Evaluating IMSAE across eight languages and five demographic dimensions, we demonstrate its effectiveness in both standard and zero-shot settings, where target language data is unavailable, but linguistically similar languages can be used for debiasing. Our comprehensive experiments across diverse language models (BERT, LLaMA, Mistral) show that IMSAE outperforms traditional monolingual and cross-lingual approaches while maintaining model utility.

nan


Article 673

Title@2025-06-12 (4): RETUYT-INCO at BEA 2025 Shared Task: How Far Can Lightweight Models Go in AI-powered Tutor Evaluation?

Title: RETUYT-INCO at BEA 2025 Shared Task: How Far Can Lightweight Models Go in AI-powered Tutor Evaluation? RETUYT-INCO bei BEA 2025 Shared Task: Wie weit können Leichtbaumodelle in der KI-powered Tutor Evaluation gehen? BEA 2025共同任务:轻量级模型在AI驱动导师评价中能走多远? 2506.11243v1

Authors (6): Santiago Góngora, Ignacio Sastre, Santiago Robaina, Ignacio Remersaro, Luis Chiruzzo, Aiala Rosá

In this paper, we present the RETUYT-INCO participation at the BEA 2025 shared task. Our participation was characterized by the decision of using relatively small models, with fewer than 1B parameters. This self-imposed restriction tries to represent the conditions in which many research labs or institutions are in the Global South, where computational power is not easily accessible due to its prohibitive cost. Even under this restrictive self-imposed setting, our models managed to stay competitive with the rest of teams that participated in the shared task. According to the $exact\ F_1$ scores published by the organizers, the performance gaps between our models and the winners were as follows: $6.46$ in Track 1; $10.24$ in Track 2; $7.85$ in Track 3; $9.56$ in Track 4; and $13.13$ in Track 5. Considering that the minimum difference with a winner team is $6.46$ points – and the maximum difference is $13.13$ – according to the $exact\ F_1$ score, we find that models with a size smaller than 1B parameters are competitive for these tasks, all of which can be run on computers with a low-budget GPU or even without a GPU.

nan


Article 674

Title@2025-06-12 (4): LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation

Title: LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation LLM-as-a-Richter für die referenzlose automatische Codevalidierung und -Verfeinerung für natürliche Sprache in der IT-Automatisierung zu Bash LLM-as-a-Judg 信息技术自动化中自然语言的无参考自动代码校验和精炼至巴什语的无参考自动码校验和精炼LLM-as-a-Judg 2506.11237v1

Authors (3): Ngoc Phuoc An Vo, Brent Paulovicks, Vadim Sheinin

In an effort to automatically evaluate and select the best model and improve code quality for automatic incident remediation in IT Automation, it is crucial to verify if the generated code for remediation action is syntactically and semantically correct and whether it can be executed correctly as intended. There are three approaches: 1) conventional methods use surface form similarity metrics (token match, exact match, etc.) which have numerous limitations, 2) execution-based evaluation focuses more on code functionality based on pass/fail judgments for given test-cases, and 3) LLM-as-a-Judge employs LLMs for automated evaluation to judge if it is a correct answer for a given problem based on pre-defined metrics. In this work, we focused on enhancing LLM-as-a-Judge using bidirectional functionality matching and logic representation for reference-less automatic validation and refinement for Bash code generation to select the best model for automatic incident remediation in IT Automation. We used execution-based evaluation as ground-truth to evaluate our LLM-as-a-Judge metrics. Results show high accuracy and agreement with execution-based evaluation (and up to 8% over baseline). Finally, we built Reflection code agents to utilize judgments and feedback from our evaluation metrics which achieved significant improvement (up to 24% increase in accuracy) for automatic code refinement.

nan


Article 675

Title@2025-06-12 (4): LLM-as-a-Fuzzy-Judge: Fine-Tuning Large Language Models as a Clinical Evaluation Judge with Fuzzy Logic

Title: LLM-as-a-Fuzzy-Judge: Fine-Tuning Large Language Models as a Clinical Evaluation Judge with Fuzzy Logic LLM-as-a-Fuzzy-Judge: Feintuning große Sprachmodelle als klinischer Bewertungsrichter mit Fuzzy Logic LLM-as-a-Fuzzy-Judge:作为Fuzzy逻辑临床评估法官的精准大语言模型 2506.11221v1

Authors (6): Weibing Zheng, Laurah Turner, Jess Kropczynski, Murat Ozer, Tri Nguyen, Shane Halse

Clinical communication skills are critical in medical education, and practicing and assessing clinical communication skills on a scale is challenging. Although LLM-powered clinical scenario simulations have shown promise in enhancing medical students’ clinical practice, providing automated and scalable clinical evaluation that follows nuanced physician judgment is difficult. This paper combines fuzzy logic and Large Language Model (LLM) and proposes LLM-as-a-Fuzzy-Judge to address the challenge of aligning the automated evaluation of medical students’ clinical skills with subjective physicians’ preferences. LLM-as-a-Fuzzy-Judge is an approach that LLM is fine-tuned to evaluate medical students’ utterances within student-AI patient conversation scripts based on human annotations from four fuzzy sets, including Professionalism, Medical Relevance, Ethical Behavior, and Contextual Distraction. The methodology of this paper started from data collection from the LLM-powered medical education system, data annotation based on multidimensional fuzzy sets, followed by prompt engineering and the supervised fine-tuning (SFT) of the pre-trained LLMs using these human annotations. The results show that the LLM-as-a-Fuzzy-Judge achieves over 80\% accuracy, with major criteria items over 90\%, effectively leveraging fuzzy logic and LLM as a solution to deliver interpretable, human-aligned assessment. This work suggests the viability of leveraging fuzzy logic and LLM to align with human preferences, advances automated evaluation in medical education, and supports more robust assessment and judgment practices. The GitHub repository of this work is available at https://github.com/2sigmaEdTech/LLMAsAJudge

nan


Article 676

Title@2025-06-12 (4): How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?

Title: How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts? Wie gut können vernünftigen Modelle erkennen und sich von unhilflichen Gedanken erholen? 理性模型如何能从无益的想法中查明和复苏? 2506.10979v1

Authors (6): Sohee Yang, Sang-Woo Lee, Nora Kassner, Daniela Gottesman, Sebastian Riedel, Mor Geva

Recent reasoning models show the ability to reflect, backtrack, and self-validate their reasoning, which is crucial in spotting mistakes and arriving at accurate solutions. A natural question that arises is how effectively models can perform such self-reevaluation. We tackle this question by investigating how well reasoning models identify and recover from four types of unhelpful thoughts: uninformative rambling thoughts, thoughts irrelevant to the question, thoughts misdirecting the question as a slightly different question, and thoughts that lead to incorrect answers. We show that models are effective at identifying most unhelpful thoughts but struggle to recover from the same thoughts when these are injected into their thinking process, causing significant performance drops. Models tend to naively continue the line of reasoning of the injected irrelevant thoughts, which showcases that their self-reevaluation abilities are far from a general “meta-cognitive” awareness. Moreover, we observe non/inverse-scaling trends, where larger models struggle more than smaller ones to recover from short irrelevant thoughts, even when instructed to reevaluate their reasoning. We demonstrate the implications of these findings with a jailbreak experiment using irrelevant thought injection, showing that the smallest models are the least distracted by harmful-response-triggering thoughts. Overall, our findings call for improvement in self-reevaluation of reasoning models to develop better reasoning and safer systems.

nan


Article 677

Title@2025-06-12 (4): AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

Title: AutoMind: Adaptive Knowledgeable Agent for Automated Data Science AutoMind: Adaptives Knowledgeable Agent für automatisierte Datenwissenschaft 自动Mind:自动数据科学适应性知识代理 2506.10974v1

Authors (9): Yixin Ou, Yujie Luo, Jingsheng Zheng, Lanning Wei, Shuofei Qiao, Jintian Zhang, Da Zheng, Huajun Chen, Ningyu Zhang

Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science.

nan


Article 678

Title@2025-06-12 (4): ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark

Title: ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark ChinesischHarm-Bench: Ein chinesischer schädlicher Content Detection Benchmark 中中汉禁区:中国有害内容检测基准 2506.10960v1

Authors (10): Kangwei Liu, Siyuan Cheng, Bozhong Tian, Xiaozhuan Liang, Yuyang Yin, Meng Han, Ningyu Zhang, Bryan Hooi, Xi Chen, Shumin Deng

Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs. Code and data are available at https://github.com/zjunlp/ChineseHarm-bench.

nan


Article 679

Title@2025-06-12 (4): Build the web for agents, not agents for the web

Title: Build the web for agents, not agents for the web Erstellen Sie das Web für Agenten, nicht Agenten für das Web 为代理者而不是网络代理者建立网络 2506.10953v1

Authors (4): Xing Han Lù, Gaurav Kamath, Marius Mosbach, Siva Reddy

Recent advancements in Large Language Models (LLMs) and multimodal counterparts have spurred significant interest in developing web agents – AI systems capable of autonomously navigating and completing tasks within web environments. While holding tremendous promise for automating complex web interactions, current approaches face substantial challenges due to the fundamental mismatch between human-designed interfaces and LLM capabilities. Current methods struggle with the inherent complexity of web inputs, whether processing massive DOM trees, relying on screenshots augmented with additional information, or bypassing the user interface entirely through API interactions. This position paper advocates for a paradigm shift in web agent research: rather than forcing web agents to adapt to interfaces designed for humans, we should develop a new interaction paradigm specifically optimized for agentic capabilities. To this end, we introduce the concept of an Agentic Web Interface (AWI), an interface specifically designed for agents to navigate a website. We establish six guiding principles for AWI design, emphasizing safety, efficiency, and standardization, to account for the interests of all primary stakeholders. This reframing aims to overcome fundamental limitations of existing interfaces, paving the way for more efficient, reliable, and transparent web agent design, which will be a collaborative effort involving the broader ML community.

nan


Article 680

Title@2025-06-12 (4): Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Title: Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training Domain2Vec: Vectorizing Datasets, um die optimale Datenmischung ohne Training zu finden 域2Vec: 将数据集矢量化,以查找未经过培训的最佳数据混合体 2506.10952v1

Authors (4): Mozhi Zhang, Howe Tissue, Lu Wang, Xipeng Qiu

We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any dataset into a linear combination of several \emph{meta-domains}, a new concept designed to capture the key underlying features of datasets. \textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \emph{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods. Extensive experiments demonstrate that \textsc{Domain2Vec} helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, \textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only $51.5\%$ of the computation required when training on the original mixture of The Pile dataset. Under equivalent compute budget, \textsc{Domain2Vec} improves downstream performance by an average of $2.83\%$.

nan


Article 681

Title@2025-06-12 (4): GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models

Title: GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models GUARD: Geführtes Lernen und Zurückhalten über Datenzuweisung für große Sprachmodelle GUARD:通过大语言模式数据归称制,指导学习和保留 2506.10946v1

Authors (7): Evelyn Ma, Duo Zhou, Peizhi Niu, Huiting Zhou, Huan Zhang, Olgica Milenkovic, S. Rasoul Etesami

Unlearning in large language models (LLMs) is becoming increasingly important due to regulatory compliance, copyright protection, and privacy concerns. However, a key challenge in LLM unlearning is unintended forgetting, where the removal of specific data inadvertently impairs the utility of the model and its retention of valuable, desired information. While prior work has primarily focused on architectural innovations, the influence of data-level factors on unlearning performance remains underexplored. As a result, existing methods often suffer from degraded retention when forgetting high-impact data. To address this, we propose GUARD-a novel framework for Guided Unlearning And Retention via Data attribution. At its core, GUARD introduces a lightweight proxy data attribution metric tailored for LLM unlearning, which quantifies the “alignment” between the forget and retain sets while remaining computationally efficient. Building on this, we design a novel unlearning objective that assigns adaptive, nonuniform unlearning weights to samples, inversely proportional to their proxy attribution scores. Through such a reallocation of unlearning power, GUARD mitigates unintended losses in retention. We provide rigorous theoretical guarantees that GUARD significantly enhances retention while maintaining forgetting metrics comparable to prior methods. Extensive experiments on the TOFU benchmark across multiple LLM architectures demonstrate that GUARD substantially improves utility preservation while ensuring effective unlearning. Notably, GUARD reduces utility sacrifice on the Retain Set by up to 194.92% in terms of Truth Ratio when forgetting 10% of the training data.

nan


Article 682

Title@2025-06-12 (4): VINCIE: Unlocking In-context Image Editing from Video

Title: VINCIE: Unlocking In-context Image Editing from Video VINCIE: Im Kontext Bildbearbeitung von Video entsperren VINCIE: 从视频中解锁 Incontext 图像编辑 2506.10941v1

Authors (10): Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, Lu Jiang

In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.

nan


Article 683

Title@2025-06-12 (4): Visually Descriptive Language Model for Vector Graphics Reasoning

Title: Visually Descriptive Language Model for Vector Graphics Reasoning Visuell Deskriptives Sprachmodell für Vektorgrafiken 矢量图形推理视觉描述语言模型 2404.06479v5

Authors (7): Zhenhailong Wang, Joy Hsu, Xingyao Wang, Kuan-Hao Huang, Manling Li, Jiajun Wu, Heng Ji

Despite significant advancements, large multimodal models (LMMs) still struggle to bridge the gap between low-level visual perception – focusing on shapes, sizes, and layouts – and high-level language reasoning, such as semantics and logic. This limitation is evident in tasks that require precise visual perception, like comparing geometric properties or solving visual reasoning problems. To study this failure mode, we focus on vector graphics – images composed of 2D objects and shapes, prevalent in LMM-based tasks in web, design, and OS environments. We identify two key research questions: how can we enable precise visual perception, and how can we facilitate high-level reasoning based on such low-level perceptions? To capture fine visual details, we use Scalable Vector Graphics (SVG) for accurate encoding of visual scenes. However, SVGs are not readily interpretable by LMMs in a zero-shot manner. To tackle this, we propose the Visually Descriptive Language Model (VDLM), which introduces a Primal Visual Description (PVD) as an intermediate textual representation. PVD translates SVGs into a text-based abstraction consisting of primitive attributes (e.g., shape, position, measurement) and their corresponding values. PVD can be learned using task-agnostic synthesized data and represents visual primitives that are universal across vector graphics. This abstraction is more structured, allowing for direct interpretation by foundation models for zero-shot generalization. Without human-annotated data, empirical results show that VDLM significantly improves state-of-the-art LMMs like GPT-4o on various multimodal perception and reasoning tasks. Extensive analyses of VDLM show improved interpretability due to its disentangled perception and reasoning. We also demonstrate a positive correlation between PVD quality and task performance. Project page: https://mikewangwzhl.github.io/VDLM/

nan


Article 684

Title@2025-06-12 (4): Dynamic Epistemic Friction in Dialogue

Title: Dynamic Epistemic Friction in Dialogue Dynamische epistemische Reibung im Dialog 对话框中的动态瞬间摩擦 2506.10934v1

Authors (5): Timothy Obiso, Kenneth Lai, Abhijnan Nath, Nikhil Krishnaswamy, James Pustejovsky

Recent developments in aligning Large Language Models (LLMs) with human preferences have significantly enhanced their utility in human-AI collaborative scenarios. However, such approaches often neglect the critical role of “epistemic friction,” or the inherent resistance encountered when updating beliefs in response to new, conflicting, or ambiguous information. In this paper, we define dynamic epistemic friction as the resistance to epistemic integration, characterized by the misalignment between an agent’s current belief state and new propositions supported by external evidence. We position this within the framework of Dynamic Epistemic Logic (Van Benthem and Pacuit, 2011), where friction emerges as nontrivial belief-revision during the interaction. We then present analyses from a situated collaborative task that demonstrate how this model of epistemic friction can effectively predict belief updates in dialogues, and we subsequently discuss how the model of belief alignment as a measure of epistemic resistance or friction can naturally be made more sophisticated to accommodate the complexities of real-world dialogue scenarios.

nan


Article 685

Title@2025-06-12 (4): Improving LLM Safety Alignment with Dual-Objective Optimization

Title: Improving LLM Safety Alignment with Dual-Objective Optimization Verbesserung der LLM-Sicherheitsausrichtung mit Dual-Ziel-Optimierung 提高LLM安全一致性,实现双目标优化 2503.03710v2

Authors (7): Xuandong Zhao, Will Cai, Tianneng Shi, David Huang, Licong Lin, Song Mei, Dawn Song

Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely deployed alignment method, exhibits limitations in both experimental and theoretical contexts as its loss function proves suboptimal for refusal learning. Through gradient-based analysis, we identify these shortcomings and propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge. This approach significantly increases LLM robustness against a wide range of jailbreak attacks, including prefilling, suffix, and multi-turn attacks across both in-distribution and out-of-distribution scenarios. Furthermore, we introduce a method to emphasize critical refusal tokens by incorporating a reward-based token-level weighting mechanism for refusal learning, which further improves the robustness against adversarial exploits. Our research also suggests that robustness to jailbreak attacks is correlated with token distribution shifts in the training process and internal representations of refusal and harmful tokens, offering valuable directions for future research in LLM safety alignment. The code is available at https://github.com/wicai24/DOOR-Alignment

nan


Article 686

Title@2025-06-12 (4): Robustly Improving LLM Fairness in Realistic Settings via Interpretability

Title: Robustly Improving LLM Fairness in Realistic Settings via Interpretability Robuste Verbesserung der LLM Fairness in realistischen Einstellungen durch Dolmetschbarkeit 通过可解释性在现实环境中强有力地提高LLM公平性 2506.10922v1

Authors (2): Adam Karvonen, Samuel Marks

Large language models (LLMs) are increasingly deployed in high-stakes hiring applications, making decisions that directly impact people’s careers and livelihoods. While prior studies suggest simple anti-bias prompts can eliminate demographic biases in controlled evaluations, we find these mitigations fail when realistic contextual details are introduced. We address these failures through internal bias mitigation: by identifying and neutralizing sensitive attribute directions within model activations, we achieve robust bias reduction across all tested scenarios. Across leading commercial (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3, Mistral-24B), we find that adding realistic context such as company names, culture descriptions from public careers pages, and selective hiring constraints (e.g.,``only accept candidates in the top 10\%”) induces significant racial and gender biases (up to 12\% differences in interview rates). When these biases emerge, they consistently favor Black over White candidates and female over male candidates across all tested models and scenarios. Moreover, models can infer demographics and become biased from subtle cues like college affiliations, with these biases remaining invisible even when inspecting the model’s chain-of-thought reasoning. To address these limitations, our internal bias mitigation identifies race and gender-correlated directions and applies affine concept editing at inference time. Despite using directions from a simple synthetic dataset, the intervention generalizes robustly, consistently reducing bias to very low levels (typically under 1\%, always below 2.5\%) while largely maintaining model performance. Our findings suggest that practitioners deploying LLMs for hiring should adopt more realistic evaluation methodologies and consider internal mitigation strategies for equitable outcomes.

nan


Article 687

Title@2025-06-12 (4): Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

Title: Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization Dekomponieren von MLP-Aktivierungen in Interpretierbare Funktionen über semi-Nonnegative Matrix-Fabrikisierung 通过半氮基矩阵化系数化,将劳动和生产部的分解活动转化为可解释性特征 2506.10920v1

Authors (3): Or Shafran, Atticus Geiger, Mor Geva

A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP’s activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.

nan


Article 688

Title@2025-06-12 (4): Weak-to-Strong Jailbreaking on Large Language Models

Title: Weak-to-Strong Jailbreaking on Large Language Models Schwach-zu-starkes Gefängnis mit großen Sprachmodellen 关于大语言模型的弱至强强监狱破解 2401.17256v3

Authors (7): Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang

Large language models (LLMs) are vulnerable to jailbreak attacks - resulting in harmful, unethical, or biased text generations. However, existing jailbreaking methods are computationally costly. In this paper, we propose the weak-to-strong jailbreaking attack, an efficient inference time attack for aligned LLMs to produce harmful text. Our key intuition is based on the observation that jailbroken and aligned models only differ in their initial decoding distributions. The weak-to-strong attack’s key technical insight is using two smaller models (a safe and an unsafe one) to adversarially modify a significantly larger safe model’s decoding probabilities. We evaluate the weak-to-strong attack on 5 diverse open-source LLMs from 3 organizations. The results show our method can increase the misalignment rate to over 99% on two datasets with just one forward pass per example. Our study exposes an urgent safety issue that needs to be addressed when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available at https://github.com/XuandongZhao/weak-to-strong

nan


Article 689

Title@2025-06-12 (4): Efficiently Identifying Watermarked Segments in Mixed-Source Texts

Title: Efficiently Identifying Watermarked Segments in Mixed-Source Texts Effiziente Identifikation von wassermarkierten Segmenten in Mixed-Source-Texten 有效识别混合来源文本中划划水段 2410.03600v2

Authors (4): Xuandong Zhao, Chenwen Liao, Yu-Xiang Wang, Lei Li

Text watermarks in large language models (LLMs) are increasingly used to detect synthetic text, mitigating misuse cases like fake news and academic dishonesty. While existing watermarking detection techniques primarily focus on classifying entire documents as watermarked or not, they often neglect the common scenario of identifying individual watermark segments within longer, mixed-source documents. Drawing inspiration from plagiarism detection systems, we propose two novel methods for partial watermark detection. First, we develop a geometry cover detection framework aimed at determining whether there is a watermark segment in long text. Second, we introduce an adaptive online learning algorithm to pinpoint the precise location of watermark segments within the text. Evaluated on three popular watermarking techniques (KGW-Watermark, Unigram-Watermark, and Gumbel-Watermark), our approach achieves high accuracy, significantly outperforming baseline methods. Moreover, our framework is adaptable to other watermarking techniques, offering new insights for precise watermark detection. Our code is publicly available at https://github.com/XuandongZhao/llm-watermark-location

nan


Article 690

Title@2025-06-12 (4): Magistral

Title: Magistral Magistral 司 司 司 司 司 司 司 司 司 2506.10910v1

Authors (101): Mistral-AI, :, Abhinav Rastogi, Albert Q. Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, Jason Rute, Joep Barmentlo, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, Léonard Blier, Lucile Saulnier, Matthieu Dinot, Maxime Darrin, Neha Gupta, Roman Soletskyi, Sagar Vaze, Teven Le Scao, Yihan Wang, Adam Yang, Alexander H. Liu, Alexandre Sablayrolles, Amélie Héliou, Amélie Martin, Andy Ehrenberg, Anmol Agarwal, Antoine Roux, Arthur Darcet, Arthur Mensch, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Chris Bamford, Christian Wallenwein, Christophe Renaudin, Clémence Lanfranchi, Darius Dabert, Devon Mizelle, Diego de las Casas, Elliot Chane-Sane, Emilien Fugier, Emma Bou Hanna, Gauthier Delerce, Gauthier Guinet, Georgii Novikov, Guillaume Martin, Himanshu Jaju, Jan Ludziejewski, Jean-Hadrien Chabran, Jean-Malo Delignon, Joachim Studnia, Jonas Amar, Josselin Somerville Roberts, Julien Denize, Karan Saxena, Kush Jain, Lingxiao Zhao, Louis Martin, Luyu Gao, Lélio Renard Lavaud, Marie Pellat, Mathilde Guillaumin, Mathis Felardos, Maximilian Augustin, Mickaël Seznec, Nikhil Raghuraman, Olivier Duchenne, Patricia Wang, Patrick von Platen, Patryk Saffer, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Pavankumar Reddy Muddireddy, Philomène Chagniot, Pierre Stock, Pravesh Agrawal, Romain Sauvestre, Rémi Delacourt, Sanchit Gandhi, Sandeep Subramanian, Shashwat Dalal, Siddharth Gandhi, Soham Ghosh, Srijan Mishra, Sumukh Aithal, Szymon Antoniak, Thibault Schueller, Thibaut Lavril, Thomas Robert, Thomas Wang, Timothée Lacroix, Valeriia Nemychnikova, Victor Paltz, Virgile Richard, Wen-Ding Li, William Marshall, Xuanyu Zhang, Yunhao Tang

We introduce Magistral, Mistral’s first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint’s capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.

nan


Article 691

Title@2025-06-12 (4): Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning

Title: Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning Jenseits von Goldstandards: Epistemisches Ensemble von LLM-Richtern für formale mathematische Vernunft 超越金金标准:法学硕士正式数学理由法官集会 2506.10903v1

Authors (3): Lan Zhang, Marco Valentino, Andre Freitas

Autoformalization plays a crucial role in formal mathematical reasoning by enabling the automatic translation of natural language statements into formal languages. While recent advances using large language models (LLMs) have shown promising results, methods for automatically evaluating autoformalization remain underexplored. As one moves to more complex domains (e.g., advanced mathematics), human evaluation requires significant time and domain expertise, especially as the complexity of the underlying statements and background knowledge increases. LLM-as-a-judge presents a promising approach for automating such evaluation. However, existing methods typically employ coarse-grained and generic evaluation criteria, which limit their effectiveness for advanced formal mathematical reasoning, where quality hinges on nuanced, multi-granular dimensions. In this work, we take a step toward addressing this gap by introducing a systematic, automatic method to evaluate autoformalization tasks. The proposed method is based on an epistemically and formally grounded ensemble (EFG) of LLM judges, defined on criteria encompassing logical preservation (LP), mathematical consistency (MC), formal validity (FV), and formal quality (FQ), resulting in a transparent assessment that accounts for different contributing factors. We validate the proposed framework to serve as a proxy for autoformalization assessment within the domain of formal mathematics. Overall, our experiments demonstrate that the EFG ensemble of LLM judges is a suitable emerging proxy for evaluation, more strongly correlating with human assessments than a coarse-grained model, especially when assessing formal qualities. These findings suggest that LLM-as-judges, especially when guided by a well-defined set of atomic properties, could offer a scalable, interpretable, and reliable support for evaluating formal mathematical reasoning.

nan


Article 692

Title@2025-06-12 (4): BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP

Title: BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP BioClinical ModernBERT: Ein hochmoderner Long-Context-Encoder für biomedizinische und klinische NLP 生物医学和临床国家实验室方案最新生物医学和临床现代生物临床现代BERT:最先进的生物医学和临床临床长期编码器 2506.10896v1

Authors (10): Thomas Sounack, Joshua Davis, Brigitte Durieux, Antoine Chaffin, Tom J. Pollard, Eric Lehman, Alistair E. W. Johnson, Matthew McDermott, Tristan Naumann, Charlotta Lindvall

Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text through discriminative tasks. However, encoders have seen slower development compared to decoder models, leading to limited domain adaptation in biomedical and clinical settings. We introduce BioClinical ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is developed through continued pretraining on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source. It outperforms existing biomedical and clinical encoders on four downstream tasks spanning a broad range of use cases. We release both base (150M parameters) and large (396M parameters) versions of BioClinical ModernBERT, along with training checkpoints to support further research.

nan


Article 693

Title@2025-06-12 (4): The Diffusion Duality

Title: The Diffusion Duality Die Diffusionsdualität 传播质量 2506.10892v1

Authors (6): Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, Volodymyr Kuleshov

Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: http://s-sahoo.github.io/duo

nan


Article 694

Title@2025-06-12 (4): PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play

Title: PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play PLAY2PROMPT: Zero-shot Tool Instruction Optimierung für LLM Agenten über Tool Play PLAY2PROMOPT: 通过工具游戏优化LLM代理器的零射工具指令 2503.14432v2

Authors (5): Wei Fang, Yang Zhang, Kaizhi Qian, James Glass, Yada Zhu

Large language models (LLMs) are increasingly integrated with specialized external tools, yet many tasks demand zero-shot tool usage with minimal or noisy documentation. Existing solutions rely on manual rewriting or labeled data for validation, making them inapplicable in true zero-shot settings. To address these challenges, we propose PLAY2PROMPT, an automated framework that systematically “plays” with each tool to explore its input-output behaviors. Through this iterative trial-and-error process, PLAY2PROMPT refines tool documentation and generates usage examples without any labeled data. These examples not only guide LLM inference but also serve as validation to further enhance tool utilization. Extensive experiments on real-world tasks demonstrate that PLAY2PROMPT significantly improves zero-shot tool performance across both open and closed models, offering a scalable and effective solution for domain-specific tool integration.

nan


Article 695

Title@2025-06-12 (4): Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

Title: Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers Verallgemeinerung oder Halluzination? Verstehen von Out-of-Context-Reasoning in Transformers 通化还是幻觉? 理解变异器的逻辑外原因 2506.10887v1

Authors (8): Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei

Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.

nan


Article 696

Title@2025-06-12 (4): Slimming Down LLMs Without Losing Their Minds

Title: Slimming Down LLMs Without Losing Their Minds LLMs abschwächen, ohne ihre Gedanken zu verlieren 在不失去理智的情况下将LLMs 压倒在地 2506.10885v1

Authors (2): Qingda, Mai

This paper investigates and validates the impact of fine-tuning on large language model performance, focusing on parameter-efficient methods (LoRA and QLoRA). We evaluate model capabilities across three key domains: (1) commonsense reasoning (HellaSwag), (2) mathematical reasoning (GSM8K), and (3) multi-domain knowledge (MMLU-CS). Our findings demonstrate that: (1) LoRA-based methods effectively improve task-specific performance while maintaining computational efficiency, and (2) performance strongly depends on alignment between fine-tuning dataset and benchmark tasks. The study provides both theoretical insights into parameter-efficient mechanisms and practical guidance for developers implementing efficient LLM adaptation with limited resources.

nan


Article 697

Title@2025-06-12 (4): Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment

Title: Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment Verbesserung des medizinischen Dialogs durch Wissensverfeinerung und dynamische Anpassung 通过知识完善和动态快速调整加强医疗对话 2506.10877v1

Authors (6): Hongda Sun, Jiaren Peng, Wenzhong Yang, Liang He, Bo Du, Rui Yan

Medical dialogue systems (MDS) have emerged as crucial online platforms for enabling multi-turn, context-aware conversations with patients. However, existing MDS often struggle to (1) identify relevant medical knowledge and (2) generate personalized, medically accurate responses. To address these challenges, we propose MedRef, a novel MDS that incorporates knowledge refining and dynamic prompt adjustment. First, we employ a knowledge refining mechanism to filter out irrelevant medical data, improving predictions of critical medical entities in responses. Additionally, we design a comprehensive prompt structure that incorporates historical details and evident details. To enable real-time adaptability to diverse patient conditions, we implement two key modules, Triplet Filter and Demo Selector, providing appropriate knowledge and demonstrations equipped in the system prompt. Extensive experiments on MedDG and KaMed benchmarks show that MedRef outperforms state-of-the-art baselines in both generation quality and medical entity accuracy, underscoring its effectiveness and reliability for real-world healthcare applications.

nan


Article 698

Title@2025-06-12 (4): Large Language Models for Multilingual Previously Fact-Checked Claim Detection

Title: Large Language Models for Multilingual Previously Fact-Checked Claim Detection Große Sprachmodelle für die multilinguale bisher Fact-Checked Claim Detection 多语种以前实况调查索赔调查大语言模型 2503.02737v2

Authors (6): Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, Tatiana Anikina, Michal Gregor, Marián Šimko

In our era of widespread false information, human fact-checkers often face the challenge of duplicating efforts when verifying claims that may have already been addressed in other countries or languages. As false information transcends linguistic boundaries, the ability to automatically detect previously fact-checked claims across languages has become an increasingly important task. This paper presents the first comprehensive evaluation of large language models (LLMs) for multilingual previously fact-checked claim detection. We assess seven LLMs across 20 languages in both monolingual and cross-lingual settings. Our results show that while LLMs perform well for high-resource languages, they struggle with low-resource languages. Moreover, translating original texts into English proved to be beneficial for low-resource languages. These findings highlight the potential of LLMs for multilingual previously fact-checked claim detection and provide a foundation for further research on this promising application of LLMs.

nan


Article 699

Title@2025-06-12 (4): Sailing by the Stars: A Survey on Reward Models and Learning Strategies for Learning from Rewards

Title: Sailing by the Stars: A Survey on Reward Models and Learning Strategies for Learning from Rewards Segeln mit den Sternen: Eine Umfrage über Prämienmodelle und Lernstrategien zum Lernen aus Belohnungen 星舰航行:奖励模型调查以及从奖励中学习的学习战略 2505.02686v2

Authors (1): Xiaobao Wu

Recent developments in Large Language Models (LLMs) have shifted from pre-training scaling to post-training and test-time scaling. Across these developments, a key unified paradigm has arisen: Learning from Rewards, where reward signals act as the guiding stars to steer LLM behavior. It has underpinned a wide range of prevalent techniques, such as reinforcement learning (RLHF, RLAIF, DPO, and GRPO), reward-guided decoding, and post-hoc correction. Crucially, this paradigm enables the transition from passive learning from static data to active learning from dynamic feedback. This endows LLMs with aligned preferences and deep reasoning capabilities for diverse tasks. In this survey, we present a comprehensive overview of learning from rewards, from the perspective of reward models and learning strategies across training, inference, and post-inference stages. We further discuss the benchmarks for reward models and the primary applications. Finally we highlight the challenges and future directions. We maintain a paper collection at https://github.com/bobxwu/learning-from-rewards-llm-papers.

nan


Article 700

Title@2025-06-12 (4): Multi-group Uncertainty Quantification for Long-form Text Generation

Title: Multi-group Uncertainty Quantification for Long-form Text Generation Multi-Gruppen-Unsicherheits-Quantifizierung für langformige Textgenerierung 长式文本生成的不确定性量化 2407.21057v2

Authors (2): Terrance Liu, Zhiwei Steven Wu

While past works have shown how uncertainty quantification can be applied to large language model (LLM) outputs, the question of whether resulting uncertainty guarantees still hold within sub-groupings of data remains open. In our work, given some long-form text generated by an LLM, we study uncertainty at both the level of individual claims contained within the output (via calibration) and across the entire output itself (via conformal prediction). Using biography generation as a testbed for this study, we derive a set of (demographic) attributes (e.g., whether some text describes a man or woman) for each generation to form such “subgroups” of data. We find that although canonical methods for both types of uncertainty quantification perform well when measuring across the entire dataset, such guarantees break down when examining particular subgroups. Having established this issue, we invoke group-conditional methods for uncertainty quantification – multicalibration and multivalid conformal prediction – and find that across a variety of approaches, additional subgroup information consistently improves calibration and conformal prediction within subgroups (while crucially retaining guarantees across the entire dataset). As the problems of calibration, conformal prediction, and their multi-group counterparts have not been extensively explored in the context of long-form text generation, we consider these results to form a benchmark for this setting.

nan


Article 701

Title@2025-06-12 (4): Debiasing Watermarks for Large Language Models via Maximal Coupling

Title: Debiasing Watermarks for Large Language Models via Maximal Coupling Debiasing Wasserzeichen für große Sprachmodelle über Maximal Coupling 通过Maximal Coupling为大语言模型减少对水标记的偏差 2411.11203v2

Authors (5): Yangxinyu Xie, Xiang Li, Tanwi Mallick, Weijie J. Su, Ruixun Zhang

Watermarking language models is essential for distinguishing between human and machine-generated text and thus maintaining the integrity and trustworthiness of digital communication. We present a novel green/red list watermarking approach that partitions the token set into green'' andred’’ lists, subtly increasing the generation probability for green tokens. To correct token distribution bias, our method employs maximal coupling, using a uniform coin flip to decide whether to apply bias correction, with the result embedded as a pseudorandom watermark signal. Theoretical analysis confirms this approach’s unbiased nature and robust detection capabilities. Experimental results show that it outperforms prior techniques by preserving text quality while maintaining high detectability, and it demonstrates resilience to targeted modifications aimed at improving text quality. This research provides a promising watermarking solution for language models, balancing effective detection with minimal impact on text quality.

nan


Article 702

Title@2025-06-12 (4): Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models

Title: Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models Analyse der Beziehungen zwischen vorschulischer Sprache, phonetischer, klanglicher und sprachlicher Information in selbstüberwachten Sprachmodellen 以自我监督的演讲模式分析培训前语言、音、音、音、音和演讲者信息之间的关系 2506.10855v1

Authors (5): Michele Gubian, Ioana Krehan, Oli Liu, James Kirby, Sharon Goldwater

Analyses of self-supervised speech models have begun to reveal where and how they represent different types of information. However, almost all analyses have focused on English. Here, we examine how wav2vec2 models trained on four different languages encode both language-matched and non-matched speech. We use probing classifiers and geometric analyses to examine how phones, lexical tones, and speaker information are represented. We show that for all pretraining and test languages, the subspaces encoding phones, tones, and speakers are largely orthogonal, and that layerwise patterns of probing accuracy are similar, with a relatively small advantage for matched-language phone and tone (but not speaker) probes in the later layers. Our findings suggest that the structure of representations learned by wav2vec2 is largely independent of the speech material used during pretraining.

nan


Article 703

Title@2025-06-12 (4): CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training

Title: CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training CIIR@LiveRAG 2025: Optimierung der Multi-Agent Retrieval Augmented Generation durch Selbsttraining CIIR@LiveRAG 2025:通过自我培训优化多要求回生增生一代 2506.10844v1

Authors (3): Alireza Salemi, Mukta Maddipatla, Hamed Zamani

This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG) framework composed of specialized agents for subtasks such as planning, searching, reasoning, and coordination. Our system uses a self-training paradigm with reward-guided trajectory sampling to optimize inter-agent collaboration and enhance response generation. Evaluated on DataMorgana-derived datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms conventional RAG baselines. We further analyze competition outcomes and showcase the framework’s strengths with case studies, demonstrating its efficacy for complex, real-world RAG tasks.

nan


Article 704

Title@2025-06-12 (4): UCD: Unlearning in LLMs via Contrastive Decoding

Title: UCD: Unlearning in LLMs via Contrastive Decoding UCD: Lernen in LLMs durch Kontrastive Dekodierung UCD:通过互换代号在LLMs中重新学习 2506.12097v1

Authors (3): Vinith M. Suriyakumar, Ayush Sekhari, Ashia Wilson

Machine unlearning aims to remove specific information, e.g. sensitive or undesirable content, from large language models (LLMs) while preserving overall performance. We propose an inference-time unlearning algorithm that uses contrastive decoding, leveraging two auxiliary smaller models, one trained without the forget set and one trained with it, to guide the outputs of the original model using their difference during inference. Our strategy substantially improves the tradeoff between unlearning effectiveness and model utility. We evaluate our approach on two unlearning benchmarks, TOFU and MUSE. Results show notable gains in both forget quality and retained performance in comparison to prior approaches, suggesting that incorporating contrastive decoding can offer an efficient, practical avenue for unlearning concepts in large-scale models.

nan


Article 705

Title@2025-06-12 (4): ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization

Title: ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization ReCUT: Ausbalancierende Grundlänge und Genauigkeit in LLMs über Schrittweise Trails und Preference Optimization RECUT:通过分步跟踪和优化优化平衡长长和准确性 2506.10822v1

Authors (10): Zhensheng Jin, Xinze Li, Yifan Ji, Chunyi Peng, Zhenghao Liu, Qi Shi, Yukun Yan, Shuo Wang, Furong Peng, Ge Yu

Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is often constrained by the quality of the generated data and prone to overfitting. To address the challenge, we propose Reasoning Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a stepwise exploration mechanism and a long-short switched sampling strategy, enabling LLMs to incrementally generate diverse reasoning paths. These paths are evaluated and used to construct preference pairs to train two specialized models (Gemini LLMs)-one optimized for reasoning accuracy, the other for shorter reasoning. A final integrated model is obtained by interpolating the parameters of these two models. Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%, while maintaining or improving reasoning accuracy compared to various baselines. All codes and data will be released via https://github.com/NEUIR/ReCUT.

nan


Article 706

Title@2025-06-12 (4): Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints

Title: Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints Negative Interferenzen in der Mehrsprachigkeit sequenzieller Wissensbearbeitung durch Null-Raum-Beschränkungen abmildern 减少多语种序列知识编辑的负面干扰 2506.10800v1

Authors (5): Wei Sun, Tingyu Qu, Mingxiao Li, Jesse Davis, Marie-Francine Moens

Efficiently updating multilingual knowledge in large language models (LLMs), while preserving consistent factual representations across languages, remains a long-standing and unresolved challenge. While deploying separate editing systems for each language might seem viable, this approach incurs substantial costs due to the need to manage multiple models. A more efficient solution involves integrating knowledge updates across all languages into a unified model. However, performing sequential edits across languages often leads to destructive parameter interference, significantly degrading multilingual generalization and the accuracy of injected knowledge. To address this challenge, we propose LangEdit, a novel null-space constrained framework designed to precisely isolate language-specific knowledge updates. The core innovation of LangEdit lies in its ability to project parameter updates for each language onto the orthogonal complement of previous updated subspaces. This approach mathematically guarantees update independence while preserving multilingual generalization capabilities. We conduct a comprehensive evaluation across three model architectures, six languages, and four downstream tasks, demonstrating that LangEdit effectively mitigates parameter interference and outperforms existing state-of-the-art editing methods. Our results highlight its potential for enabling efficient and accurate multilingual knowledge updates in LLMs. The code is available at https://github.com/VRCMF/LangEdit.git.

nan


Article 707

Title@2025-06-12 (4): The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages

Title: The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages Der Esethu-Rahmen: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages Esethu框架:重新想象可持续数据集治理和低碳语言的理论 2502.15916v2

Authors (15): Jenalea Rajab, Anuoluwapo Aremu, Everlyn Asiko Chimoto, Dale Dunbar, Graham Morrissey, Fadel Thior, Luandrie Potgieter, Jessico Ojo, Atnafu Lambebo Tonja, Maushami Chetty, Wilhelmina NdapewaOnyothi Nekoto, Pelonomi Moiloa, Jade Abbott, Vukosi Marivate, Benjamin Rosman

This paper presents the Esethu Framework, a sustainable data curation framework specifically designed to empower local communities and ensure equitable benefit-sharing from their linguistic resource. This framework is supported by the Esethu license, a novel community-centric data license. As a proof of concept, we introduce the Vuk’uzenzele isiXhosa Speech Dataset (ViXSD), an open-source corpus developed under the Esethu Framework and License. The dataset, containing read speech from native isiXhosa speakers enriched with demographic and linguistic metadata, demonstrates how community-driven licensing and curation principles can bridge resource gaps in automatic speech recognition (ASR) for African languages while safeguarding the interests of data creators. We describe the framework guiding dataset development, outline the Esethu license provisions, present the methodology for ViXSD, and present ASR experiments validating ViXSD’s usability in building and refining voice-driven applications for isiXhosa.

nan


Article 708

Title@2025-06-12 (4): FASCIST-O-METER: Classifier for Neo-fascist Discourse Online

Title: FASCIST-O-METER: Classifier for Neo-fascist Discourse Online FASCIST-O-METER: Klassifikator für neofaschistischen Diskurs Online FASCIST-O-METER:新法西斯人士在线论文分类 2506.10789v1

Authors (4): Rudy Alexandro Garrido Veliz, Martin Semmann, Chris Biemann, Seid Muhie Yimam

Neo-fascism is a political and societal ideology that has been having remarkable growth in the last decade in the United States of America (USA), as well as in other Western societies. It poses a grave danger to democracy and the minorities it targets, and it requires active actions against it to avoid escalation. This work presents the first-of-its-kind neo-fascist coding scheme for digital discourse in the USA societal context, overseen by political science researchers. Our work bridges the gap between Natural Language Processing (NLP) and political science against this phenomena. Furthermore, to test the coding scheme, we collect a tremendous amount of activity on the internet from notable neo-fascist groups (the forums of Iron March and Stormfront.org), and the guidelines are applied to a subset of the collected posts. Through crowdsourcing, we annotate a total of a thousand posts that are labeled as neo-fascist or non-neo-fascist. With this labeled data set, we fine-tune and test both Small Language Models (SLMs) and Large Language Models (LLMs), obtaining the very first classification models for neo-fascist discourse. We find that the prevalence of neo-fascist rhetoric in this kind of forum is ever-present, making them a good target for future research. The societal context is a key consideration for neo-fascist speech when conducting NLP research. Finally, the work against this kind of political movement must be pressed upon and continued for the well-being of a democratic society. Disclaimer: This study focuses on detecting neo-fascist content in text, similar to other hate speech analyses, without labeling individuals or organizations.

nan


Article 709

Title@2025-06-12 (4): Improving Named Entity Transcription with Contextual LLM-based Revision

Title: Improving Named Entity Transcription with Contextual LLM-based Revision Verbesserung der Transkription der benannten Entität mit kontextueller LLM-basierter Revision 改进以背景LLM为基础订正的命名实体跟踪 2506.10779v1

Authors (3): Viet Anh Trinh, Xinlu He, Jacob Whitehill

With recent advances in modeling and the increasing amount of supervised training data, automatic speech recognition (ASR) systems have achieved remarkable performance on general speech. However, the word error rate (WER) of state-of-the-art ASR remains high for named entities. Since named entities are often the most critical keywords, misrecognizing them can affect all downstream applications, especially when the ASR system functions as the front end of a complex system. In this paper, we introduce a large language model (LLM) revision mechanism to revise incorrect named entities in ASR predictions by leveraging the LLM’s reasoning ability as well as local context (e.g., lecture notes) containing a set of correct named entities. Finally, we introduce the NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses for development and testing. On this dataset, our proposed technique achieves up to 30\% relative WER reduction for named entities.

nan


Article 710

Title@2025-06-12 (4): Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs

Title: Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs Unterschiedliche Fragen, unterschiedliche Modelle: Feinkörnige Bewertung von Unsicherheit und Kalibrierung in klinischen QA mit LLMs 不同问题、不同模式:对临床质量评估中不确定性和校准的精细评估 2506.10769v1

Authors (2): Alberto Testoni, Iacer Calixto

Accurate and well-calibrated uncertainty estimates are essential for deploying large language models (LLMs) in high-stakes domains such as clinical decision support. We present a fine-grained evaluation of uncertainty estimation methods for clinical multiple-choice question answering, covering ten open-source LLMs (general-purpose, biomedical, and reasoning models) across two datasets, eleven medical specialties, and six question types. We compare standard single-generation and sampling-based methods, and present a case study exploring simple, single-pass estimators based on behavioral signals in reasoning traces. These lightweight methods approach the performance of Semantic Entropy while requiring only one generation. Our results reveal substantial variation across specialties and question types, underscoring the importance of selecting models based on both the nature of the question and model-specific strengths.

nan


Article 711

Title@2025-06-12 (4): Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation

Title: Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation Chain-of-Code Collapse: Gründe für Fehler in LLMs über Adversarial Prompting in der Code-Generierung 崩溃链:通过代码生成中的反向提示造成LLMs中失败的原因 2506.06971v2

Authors (4): Jaechul Roh, Varun Gandhi, Shivani Anilkumar, Arin Garg

Large Language Models (LLMs) have achieved remarkable success in tasks requiring complex reasoning, such as code generation, mathematical problem solving, and algorithmic synthesis – especially when aided by reasoning tokens and Chain-of-Thought prompting. Yet, a core question remains: do these models truly reason, or do they merely exploit shallow statistical patterns? In this paper, we introduce Chain-of-Code Collapse, where we systematically investigate the robustness of reasoning LLMs by introducing a suite of semantically faithful yet adversarially structured prompt perturbations. Our evaluation – spanning 700 perturbed code generations derived from LeetCode-style problems – applies transformations such as storytelling reframing, irrelevant constraint injection, example reordering, and numeric perturbation. We observe that while certain modifications severely degrade performance (with accuracy drops up to -42.1%), others surprisingly improve model accuracy by up to 35.3%, suggesting sensitivity not only to semantics but also to surface-level prompt dynamics. These findings expose the fragility and unpredictability of current reasoning systems, underscoring the need for more principles approaches to reasoning alignments and prompting robustness. We release our perturbation datasets and evaluation framework to promote further research in trustworthy and resilient LLM reasoning.

nan


Article 712

Title@2025-06-12 (4): One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers

Title: One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers Ein Tokenizer, um sie alle zu beherrschen: Emergente Sprachplastizität über Mehrsprachige Tokenizer 万能统治者:通过多语种教育者实现新兴语言的可塑性 2506.10766v1

Authors (9): Diana Abagyan, Alejandro R. Salamanca, Andres Felipe Cruz-Salinas, Kris Cao, Hangyu Lin, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, Sara Hooker

Pretraining massively multilingual Large Language Models (LLMs) for many languages at once is challenging due to limited model capacity, scarce high-quality data, and compute constraints. Moreover, the lack of language coverage of the tokenizer makes it harder to address the gap for new languages purely at the post-training stage. In this work, we study what relatively cheap interventions early on in training improve “language plasticity”, or adaptation capabilities of the model post-training to new languages. We focus on tokenizer design and propose using a universal tokenizer that is trained for more languages than the primary pretraining languages to enable efficient adaptation in expanding language coverage after pretraining. Our systematic experiments across diverse groups of languages and different training strategies show that a universal tokenizer enables significantly higher language adaptation, with up to 20.2% increase in win rates compared to tokenizers specific to pretraining languages. Furthermore, a universal tokenizer also leads to better plasticity towards languages that are completely unseen in the tokenizer and pretraining, by up to 5% win rate gain. We achieve this adaptation to an expanded set of languages with minimal compromise in performance on the majority of languages included in pretraining.

nan


Article 713

Title@2025-06-12 (4): Aspect-Based Opinion Summarization with Argumentation Schemes

Title: Aspect-Based Opinion Summarization with Argumentation Schemes Aspektbasierte Zusammenfassung der Meinungen mit Argumentierungsschemata 与参数说明方案对照审计意见的概述 2506.09917v2

Authors (3): Wendi Zhou, Ameer Saadat-Yazdi, Nadin Kokciyan

Reviews are valuable resources for customers making purchase decisions in online shopping. However, it is impractical for customers to go over the vast number of reviews and manually conclude the prominent opinions, which prompts the need for automated opinion summarization systems. Previous approaches, either extractive or abstractive, face challenges in automatically producing grounded aspect-centric summaries. In this paper, we propose a novel summarization system that not only captures predominant opinions from an aspect perspective with supporting evidence, but also adapts to varying domains without relying on a pre-defined set of aspects. Our proposed framework, ASESUM, summarizes viewpoints relevant to the critical aspects of a product by extracting aspect-centric arguments and measuring their salience and validity. We conduct experiments on a real-world dataset to demonstrate the superiority of our approach in capturing diverse perspectives of the original reviews compared to new and existing methods.

nan


Article 714

Title@2025-06-12 (4): Great Models Think Alike and this Undermines AI Oversight

Title: Great Models Think Alike and this Undermines AI Oversight Große Modelle denken ähnlich und dies unterminiert AI Oversight 伟大的模特儿们想着类似的想法 和这枚地下地雷 AI监督 2502.04313v2

Authors (9): Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping

As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as ‘‘AI Oversight’’. We study how model similarity affects both aspects of AI oversight by proposing Chance Adjusted Probabilistic Agreement (CAPA): a metric for LM similarity based on overlap in model mistakes. Using CAPA, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from ‘‘weak-to-strong generalization’’. As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend – model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.

nan


Article 715

Title@2025-06-12 (4): Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering

Title: Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering Neural bei ArchEHR-QA 2025: Agentische Prompt-Optimierung für evidenzgerundete klinische Fragen ArchEHR-QA 2025:证据四舍五入临床问题解答的代理快速优化 2506.10751v1

Authors (6): Sai Prasanna Teja Reddy Bogireddy, Abrar Majeedi, Viswanatha Reddy Gajjala, Zhuoyan Xu, Siddhant Rai, Vaishnav Potlapalli

Automated question answering (QA) over electronic health records (EHRs) can bridge critical information gaps for clinicians and patients, yet it demands both precise evidence retrieval and faithful answer generation under limited supervision. In this work, we present Neural, the runner-up in the BioNLP 2025 ArchEHR-QA shared task on evidence-grounded clinical QA. Our proposed method decouples the task into (1) sentence-level evidence identification and (2) answer synthesis with explicit citations. For each stage, we automatically explore the prompt space with DSPy’s MIPROv2 optimizer, jointly tuning instructions and few-shot demonstrations on the development set. A self-consistency voting scheme further improves evidence recall without sacrificing precision. On the hidden test set, our method attains an overall score of 51.5, placing second stage while outperforming standard zero-shot and few-shot prompting by over 20 and 10 points, respectively. These results indicate that data-driven prompt optimization is a cost-effective alternative to model fine-tuning for high-stakes clinical QA, advancing the reliability of AI assistants in healthcare.

nan


Article 716

Title@2025-06-12 (4): TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora

Title: TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora TaxoAdapt: LLM-basierte multidimensionale Taxonomie-Konstruktion an die sich entwickelnde Forschungskorporation ausrichten 将基于LLM的多层面分类学建设与不断发展的研究公司相协调 2506.10737v1

Authors (6): Priyanka Kargupta, Nan Zhang, Yunyi Zhang, Rui Zhang, Prasenjit Mitra, Jiawei Han

The rapid evolution of scientific fields introduces challenges in organizing and retrieving scientific literature. While expert-curated taxonomies have traditionally addressed this need, the process is time-consuming and expensive. Furthermore, recent automatic taxonomy construction methods either (1) over-rely on a specific corpus, sacrificing generalizability, or (2) depend heavily on the general knowledge of large language models (LLMs) contained within their pre-training datasets, often overlooking the dynamic nature of evolving scientific domains. Additionally, these approaches fail to account for the multi-faceted nature of scientific literature, where a single research paper may contribute to multiple dimensions (e.g., methodology, new tasks, evaluation metrics, benchmarks). To address these gaps, we propose TaxoAdapt, a framework that dynamically adapts an LLM-generated taxonomy to a given corpus across multiple dimensions. TaxoAdapt performs iterative hierarchical classification, expanding both the taxonomy width and depth based on corpus’ topical distribution. We demonstrate its state-of-the-art performance across a diverse set of computer science conferences over the years to showcase its ability to structure and capture the evolution of scientific fields. As a multidimensional method, TaxoAdapt generates taxonomies that are 26.51% more granularity-preserving and 50.41% more coherent than the most competitive baselines judged by LLMs.

nan


Article 717

Title@2025-06-12 (4): Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims

Title: Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims 超越真或假:收回增加的无损失索赔的等级结构分析 2506.10728v1

Authors (3): Priyanka Kargupta, Runchu Tian, Jiawei Han

Claims made by individuals or entities are oftentimes nuanced and cannot be clearly labeled as entirely “true” or “false” – as is frequently the case with scientific and political claims. However, a claim (e.g., “vaccine A is better than vaccine B”) can be dissected into its integral aspects and sub-aspects (e.g., efficacy, safety, distribution), which are individually easier to validate. This enables a more comprehensive, structured response that provides a well-rounded perspective on a given problem while also allowing the reader to prioritize specific angles of interest within the claim (e.g., safety towards children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based framework for automatically constructing a hierarchy of aspects typically considered when addressing a claim and enriching them with corpus-specific perspectives. This structure hierarchically partitions an input corpus to retrieve relevant segments, which assist in discovering new sub-aspects. Moreover, these segments enable the discovery of varying perspectives towards an aspect of the claim (e.g., support, neutral, or oppose) and their respective prevalence (e.g., “how many biomedical papers believe vaccine A is more transportable than B?”). We apply ClaimSpect to a wide variety of real-world scientific and political claims featured in our constructed dataset, showcasing its robustness and accuracy in deconstructing a nuanced claim and representing perspectives within a corpus. Through real-world case studies and human evaluation, we validate its effectiveness over multiple baselines.

nan


Article 718

Title@2025-06-12 (4): PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models

Title: PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models PREMISE: Skalierbare und strategische Prompt-Optimierung für effiziente mathematische Reasoning in großen Modellen PREMISE:大规模模型中高效数学理由的可扩展和战略快速优化 2506.10716v1

Authors (3): Ye Yu, Yaoning Yu, Haohan Wang

Large reasoning models (LRMs) such as Claude 3.7 Sonnet and OpenAI o1 achieve strong performance on mathematical benchmarks using lengthy chain-of-thought (CoT) reasoning, but the resulting traces are often unnecessarily verbose. This inflates token usage and cost, limiting deployment in latency-sensitive or API-constrained settings. We introduce PREMISE (PRompt-based Efficient Mathematical Inference with Strategic Evaluation), a prompt-only framework that reduces reasoning overhead without modifying model weights. PREMISE combines trace-level diagnostics with gradient-inspired prompt optimization to minimize redundant computation while preserving answer accuracy. The approach jointly optimizes brevity and correctness through a multi-objective textual search that balances token length and answer validity. Unlike prior work, PREMISE runs in a single-pass black-box interface, so it can be applied directly to commercial LLMs. On GSM8K, SVAMP, and Math500 we match or exceed baseline accuracy ($96\%\rightarrow96\%$ with Claude, $91\%\rightarrow92\%$ with Gemini) while reducing reasoning tokens by up to $87.5\%$ and cutting dollar cost by $69$–$82\%$. These results show that prompt-level optimization is a practical and scalable path to efficient LRM inference without compromising reasoning quality.

nan


Article 719

Title@2025-06-12 (4): Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet

Title: Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet Adjektive Hypernyms mit Sprachmodellen ableiten, um die Konnektivität von Open English Wordnet zu erhöhen 推导语言模型的形容词超音音音,以提高开放英文Wordnet的连通性 2506.10715v1

Authors (2): Lorenzo Augello, John P. McCrae

Open English Wordnet is a key resource published in OntoLex-lemon as part of the linguistic linked open data cloud. There are, however, many links missing in the resource, and in this paper, we look at how we can establish hypernymy between adjectives. We present a theoretical discussion of the hypernymy relation and how it differs for adjectives in contrast to nouns and verbs. We develop a new resource for adjective hypernymy and fine-tune large language models to predict adjective hypernymy, showing that the methodology of TaxoLLaMa can be adapted to this task.

nan


Article 720

Title@2025-06-12 (4): PRSA: Prompt Stealing Attacks against Real-World Prompt Services

Title: PRSA: Prompt Stealing Attacks against Real-World Prompt Services PRSA: Sofortige Diebstahlangriffe gegen Real-World Prompt Services PRSA: 迅速窃盗对现实世界迅速服务公司的袭击 2402.19200v3

Authors (9): Yong Yang, Changjiang Li, Qingming Li, Oubo Ma, Haoyu Wang, Zonghui Wang, Yandong Gao, Wenzhi Chen, Shouling Ji

Recently, large language models (LLMs) have garnered widespread attention for their exceptional capabilities. Prompts are central to the functionality and performance of LLMs, making them highly valuable assets. The increasing reliance on high-quality prompts has driven significant growth in prompt services. However, this growth also expands the potential for prompt leakage, increasing the risk that attackers could replicate original functionalities, create competing products, and severely infringe on developers’ intellectual property. Despite these risks, prompt leakage in real-world prompt services remains underexplored. In this paper, we present PRSA, a practical attack framework designed for prompt stealing. PRSA infers the detailed intent of prompts through very limited input-output analysis and can successfully generate stolen prompts that replicate the original functionality. Extensive evaluations demonstrate PRSA’s effectiveness across two main types of real-world prompt services. Specifically, compared to previous works, it improves the attack success rate from 17.8% to 46.1% in prompt marketplaces and from 39% to 52% in LLM application stores, respectively. Notably, in the attack on “Math”, one of the most popular educational applications in OpenAI’s GPT Store with over 1 million conversations, PRSA uncovered a hidden Easter egg that had not been revealed previously. Besides, our analysis reveals that higher mutual information between a prompt and its output correlates with an increased risk of leakage. This insight guides the design and evaluation of two potential defenses against the security threats posed by PRSA. We have reported these findings to the prompt service vendors, including PromptBase and OpenAI, and actively collaborate with them to implement defensive measures.

nan


Article 721

Title@2025-06-12 (4): FedRAG: A Framework for Fine-Tuning Retrieval-Augmented Generation Systems

Title: FedRAG: A Framework for Fine-Tuning Retrieval-Augmented Generation Systems FedRAG: Ein Rahmen für Systeme der Feinsteuerung von Retrieval-Augmented Generation FFRAG: 微调取回系统框架 2506.09200v2

Authors (8): Val Andrei Fajardo, David B. Emerson, Amandeep Singh, Veronica Chatrath, Marcelo Lotif, Ravi Theja, Alex Cheung, Izuki Matsuba

Retrieval-augmented generation (RAG) systems have been shown to be effective in addressing many of the drawbacks of relying solely on the parametric memory of large language models. Recent work has demonstrated that RAG systems can be improved via fine-tuning of their retriever and generator models. In this work, we introduce FedRAG, a framework for fine-tuning RAG systems across centralized and federated architectures. FedRAG supports state-of-the-art fine-tuning methods, offering a simple and intuitive interface and a seamless conversion from centralized to federated training tasks. FedRAG is also deeply integrated with the modern RAG ecosystem, filling a critical gap in available tools.

nan


Article 722

Title@2025-06-12 (4): SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models

Title: SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models SelectLLM: Query-Aware Effiziente Auswahl Algorithmen für große Sprachmodelle 选择LLM: 用于大语言模型的查询- 软件高效选择算法 2408.08545v4

Authors (3): Kaushal Kumar Maurya, KV Aditya Srivatsa, Ekaterina Kochmar

Large language models (LLMs) have been widely adopted due to their remarkable performance across various applications, driving the accelerated development of a large number of diverse models. However, these individual LLMs show limitations in generalization and performance on complex tasks due to inherent training biases, model size constraints, and the quality or diversity of pre-training datasets. A promising direction is to efficiently harness the diverse capabilities of LLMs to overcome these individual limitations. To address these limitations, we introduce a novel LLM selection algorithm called SelectLLM, which efficiently directs input queries to the most suitable subset of LLMs from a large pool, ensuring that the selected models collectively provide accurate responses. SelectLLM employs a multi-label classifier and policy based on the classifier’s predictions and confidence scores in selecting an optimal, query-aware, and lightweight subset of LLMs. Our findings indicate that the proposed model outperforms existing ensemble-based baselines and achieves competitive performance with similarly sized top-performing LLMs while maintaining efficiency. Specifically, it achieves a huge reduction in inference latency on two challenging reasoning benchmarks: 13\% on GSM8K and 70\% on MMLU, compared to the top-performing baseline. Also, we establish a theoretical upper bound by an Oracle with LLMs and perform an in-depth linguistic analysis to understand the performance gap between the Oracle and SelectLLM.

nan


Article 723

Title@2025-06-12 (4): Large Language Models for Detection of Life-Threatening Texts

Title: Large Language Models for Detection of Life-Threatening Texts Große Sprachmodelle zur Erkennung lebensbedrohlicher Texte 探测生命威胁文字的长语言大语言模型 2506.10687v1

Authors (3): Thanh Thi Nguyen, Campbell Wilson, Janis Dalins

Detecting life-threatening language is essential for safeguarding individuals in distress, promoting mental health and well-being, and preventing potential harm and loss of life. This paper presents an effective approach to identifying life-threatening texts using large language models (LLMs) and compares them with traditional methods such as bag of words, word embedding, topic modeling, and Bidirectional Encoder Representations from Transformers. We fine-tune three open-source LLMs including Gemma, Mistral, and Llama-2 using their 7B parameter variants on different datasets, which are constructed with class balance, imbalance, and extreme imbalance scenarios. Experimental results demonstrate a strong performance of LLMs against traditional methods. More specifically, Mistral and Llama-2 models are top performers in both balanced and imbalanced data scenarios while Gemma is slightly behind. We employ the upsampling technique to deal with the imbalanced data scenarios and demonstrate that while this method benefits traditional approaches, it does not have as much impact on LLMs. This study demonstrates a great potential of LLMs for real-world life-threatening language detection problems.

nan


Article 724

Title@2025-06-12 (4): Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models

Title: Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models Habe ich treu gesagt, was ich dachte? Die Kluft zwischen neuraler Aktivität und Selbsterklärungen in großen Sprachmodellen überbrücken 缩小大语言模式中神经活动与自我开发之间的差距 2506.09277v2

Authors (5): Milan Bhan, Jean-Noel Vittaut, Nicolas Chesneau, Sarath Chandar, Marie-Jeanne Lesot

Large Language Models (LLM) have demonstrated the capability of generating free text self Natural Language Explanation (self-NLE) to justify their answers. Despite their logical appearance, self-NLE do not necessarily reflect the LLM actual decision-making process, making such explanations unfaithful. While existing methods for measuring self-NLE faithfulness mostly rely on behavioral tests or computational block identification, none of them examines the neural activity underlying the model’s reasoning. This work introduces a novel flexible framework for quantitatively measuring the faithfulness of LLM-generated self-NLE by directly comparing the latter with interpretations of the model’s internal hidden states. The proposed framework is versatile and provides deep insights into self-NLE faithfulness by establishing a direct connection between self-NLE and model reasoning. This approach advances the understanding of self-NLE faithfulness and provides building blocks for generating more faithful self-NLE.

nan


Article 725

Title@2025-06-12 (4): TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving

Title: TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving TeleMath: Ein Benchmark für große Sprachmodelle in der Telecom Mathematischen Problemlösung TeleMatth:电信数学问题解决大语言模型基准 2506.10674v1

Authors (6): Vincenzo Colle, Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed, Merouane Debbah

The increasing adoption of artificial intelligence in telecommunications has raised interest in the capability of Large Language Models (LLMs) to address domain-specific, mathematically intensive tasks. Although recent advancements have improved the performance of LLMs in general mathematical reasoning, their effectiveness within specialized domains, such as signal processing, network optimization, and performance analysis, remains largely unexplored. To address this gap, we introduce TeleMath, the first benchmark dataset specifically designed to evaluate LLM performance in solving mathematical problems with numerical solutions in the telecommunications domain. Comprising 500 question-answer (QnA) pairs, TeleMath covers a wide spectrum of topics in the telecommunications field. This paper outlines the proposed QnAs generation pipeline, starting from a selected seed of problems crafted by Subject Matter Experts. The evaluation of a wide range of open-source LLMs reveals that best performance on TeleMath is achieved by recent models explicitly designed for mathematical or logical reasoning. In contrast, general-purpose models, even those with a large number of parameters, often struggle with these challenges. We have released the dataset and the evaluation code to ease result reproducibility and support future research.

nan


Article 726

Title@2025-06-12 (4): CoRT: Code-integrated Reasoning within Thinking

Title: CoRT: Code-integrated Reasoning within Thinking CoRT: Code-integrierte Vernunft im Denken CORT: 思考中守则综合理由 2506.09820v2

Authors (11): Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao, Tian Ding, Ruoyu Sun, Benyou Wang, Xiang Wang, Junyang Lin, Dayiheng Liu

Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model’s internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4\% and 8\% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30\% fewer tokens for the 32B model and 50\% fewer tokens for the 1.5B model compared with the natural language models. The models and code are available at https://github.com/ChengpengLi1003/CoRT.

nan


Article 727

Title@2025-06-12 (4): Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes

Title: Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes Robuste, unüberwachte Anpassung eines Spracherkennungsgeräts mit Entropie-Minimierungs- und Lautsprechercodes 使用磁最小化和演讲人守则的演讲者演讲者 2506.10653v1

Authors (4): Rogier C. van Dalen, Shucong Zhang, Titouan Parcollet, Sourav Bhattacharya

Speech recognisers usually perform optimally only in a specific environment and need to be adapted to work well in another. For adaptation to a new speaker, there is often too little data for fine-tuning to be robust, and that data is usually unlabelled. This paper proposes a combination of approaches to make adaptation to a single minute of data robust. First, instead of estimating the adaptation parameters with cross-entropy on a single error-prone hypothesis or “pseudo-label”, this paper proposes a novel loss function, the conditional entropy over complete hypotheses. Using multiple hypotheses makes adaptation more robust to errors in the initial recognition. Second, a “speaker code” characterises a speaker in a vector short enough that it requires little data to estimate. On a far-field noise-augmented version of Common Voice, the proposed scheme yields a 20% relative improvement in word error rate on one minute of adaptation data, increasing on 10 minutes to 29%.

nan


Article 728

Title@2025-06-12 (4): Identifying Reliable Evaluation Metrics for Scientific Text Revision

Title: Identifying Reliable Evaluation Metrics for Scientific Text Revision Identifizieren von verlässlichen Bewertungsmetrics für wissenschaftliche Textrevision 科学文本订正的可靠评价计量指标 2506.04772v3

Authors (4): Léane Jourdan, Florian Boudin, Richard Dufour, Nicolas Hernandez

Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.

nan


Article 729

Title@2025-06-12 (4): Spelling-out is not Straightforward: LLMs’ Capability of Tokenization from Token to Characters

Title: Spelling-out is not Straightforward: LLMs’ Capability of Tokenization from Token to Characters Rechtschreibung ist nicht geradeaus: LLMs Fähigkeit der Tokenisierung von Token zu Charakteren 拼写出不是直向前进的: LLMs 的调制能力从调制字符到字符 2506.10641v1

Authors (2): Tatsuya Hiraoka, Kentaro Inui

Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks, such as identifying compositional subcomponents within tokens. In this work, we investigate how LLMs internally represent and utilize character-level information during the spelling-out process. Our analysis reveals that, although spelling out is a simple task for humans, it is not handled in a straightforward manner by LLMs. Specifically, we show that the embedding layer does not fully encode character-level information, particularly beyond the first character. As a result, LLMs rely on intermediate and higher Transformer layers to reconstruct character-level knowledge, where we observe a distinct “breakthrough” in their spelling behavior. We validate this mechanism through three complementary analyses: probing classifiers, identification of knowledge neurons, and inspection of attention weights.

nan


Article 730

Title@2025-06-12 (4): Conversational Search: From Fundamentals to Frontiers in the LLM Era

Title: Conversational Search: From Fundamentals to Frontiers in the LLM Era Conversational Search: Von Grundlagen zu Grenzen in der LLM-Ära 对话搜索:从基本原理到LLM时代的前沿 2506.10635v1

Authors (4): Fengran Mo, Chuan Meng, Mohammad Aliannejadi, Jian-Yun Nie

Conversational search enables multi-turn interactions between users and systems to fulfill users’ complex information needs. During this interaction, the system should understand the users’ search intent within the conversational context and then return the relevant information through a flexible, dialogue-based interface. The recent powerful large language models (LLMs) with capacities of instruction following, content generation, and reasoning, attract significant attention and advancements, providing new opportunities and challenges for building up intelligent conversational search systems. This tutorial aims to introduce the connection between fundamentals and the emerging topics revolutionized by LLMs in the context of conversational search. It is designed for students, researchers, and practitioners from both academia and industry. Participants will gain a comprehensive understanding of both the core principles and cutting-edge developments driven by LLMs in conversational search, equipping them with the knowledge needed to contribute to the development of next-generation conversational search systems.

nan


Article 731

Title@2025-06-12 (4): NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors

Title: NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors NeuralNexus bei BEA 2025 Shared Task: Retrieval-Augmented Prompting für Fehlererkennung in KI-Tutoren BEA 2025年BEA 的神经外观 共同任务:在 AI 导师中检索错误识别提示 2506.10627v1

Authors (4): Numaan Naeem, Sarfraz Ahmad, Momina Ahsan, Hasan Iqbal

This paper presents our system for Track 1: Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The task involves evaluating whether a tutor’s response correctly identifies a mistake in a student’s mathematical reasoning. We explore four approaches: (1) an ensemble of machine learning models over pooled token embeddings from multiple pretrained language models (LMs); (2) a frozen sentence-transformer using [CLS] embeddings with an MLP classifier; (3) a history-aware model with multi-head attention between token-level history and response embeddings; and (4) a retrieval-augmented few-shot prompting system with a large language model (LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided output parsing to produce interpretable predictions. It outperforms all baselines, demonstrating the effectiveness of combining example-driven prompting with LLM reasoning for pedagogical feedback assessment. Our code is available at https://github.com/NaumanNaeem/BEA_2025.

nan


Article 732

Title@2025-06-12 (4): SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis

Title: SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis SDialog: Ein Python-Toolkit für die Synthetische Dialog-Generierung und -Analyse Sidialog:合成对话生成和分析的Python工具包 2506.10622v1

Authors (3): Sergio Burdisso, Esaú Villatoro-Tello, Petr Motlicek

The advancement of conversational AI systems relies on the availability of high-quality, flexible, and reproducible synthetic dialogues for training, evaluation, and benchmarking. SDialog is a modular, extensible Python toolkit designed to address the challenges of synthetic dialogue generation and analysis. By leveraging instruction-tuned Large Language Models (LLMs), SDialog provides abstractions for personas, orchestration, and scenario management, enabling the creation of realistic, diverse, and controllable conversational data for research and development. SDialog supports workflows such as multi-agent simulation and scenario-driven generation, and represents a step forward in the standardization of tools and frameworks for synthetic data generation, a crucial advancement for ensuring reproducibility in today’s fast-evolving research landscape.

nan


Article 733

Title@2025-06-12 (4): Deep Learning-Based Digitization of Overlapping ECG Images with Open-Source Python Code

Title: Deep Learning-Based Digitization of Overlapping ECG Images with Open-Source Python Code Deep Learning-based Digitalisierung von überlappenden EKG-Bildern mit Open-Source-Python-Code 使用开放源码的 ECG 重叠图像的深学习数字化 2506.10617v1

Authors (4): Reza Karbasi, Masoud Rahimi, Abdol-Hossein Vahabie, Hadi Moradi

This paper addresses the persistent challenge of accurately digitizing paper-based electrocardiogram (ECG) recordings, with a particular focus on robustly handling single leads compromised by signal overlaps-a common yet under-addressed issue in existing methodologies. We propose a two-stage pipeline designed to overcome this limitation. The first stage employs a U-Net based segmentation network, trained on a dataset enriched with overlapping signals and fortified with custom data augmentations, to accurately isolate the primary ECG trace. The subsequent stage converts this refined binary mask into a time-series signal using established digitization techniques, enhanced by an adaptive grid detection module for improved versatility across different ECG formats and scales. Our experimental results demonstrate the efficacy of our approach. The U-Net architecture achieves an IoU of 0.87 for the fine-grained segmentation task. Crucially, our proposed digitization method yields superior performance compared to a well-established baseline technique across both non-overlapping and challenging overlapping ECG samples. For non-overlapping signals, our method achieved a Mean Squared Error (MSE) of 0.0010 and a Pearson Correlation Coefficient (rho) of 0.9644, compared to 0.0015 and 0.9366, respectively, for the baseline. On samples with signal overlap, our method achieved an MSE of 0.0029 and a rho of 0.9641, significantly improving upon the baseline’s 0.0178 and 0.8676. This work demonstrates an effective strategy to significantly enhance digitization accuracy, especially in the presence of signal overlaps, thereby laying a strong foundation for the reliable conversion of analog ECG records into analyzable digital data for contemporary research and clinical applications. The implementation is publicly available at this GitHub repository: https://github.com/masoudrahimi39/ECG-code.

nan


Article 734

Title: Unsupervised Protoform Reconstruction through Parsimonious Rule-guided Heuristics and Evolutionary Search Unüberwachte protoforme Rekonstruktion durch Parsimonious Regel-geführte Heuristiken und evolutionäre Suche 通过法理学、法理学、受规制的哲理学和进化搜索进行不受监督的原形重建 2506.10614v1

Authors (1): Promise Dodzi Kpoglu

We propose an unsupervised method for the reconstruction of protoforms i.e., ancestral word forms from which modern language forms are derived. While prior work has primarily relied on probabilistic models of phonological edits to infer protoforms from cognate sets, such approaches are limited by their predominantly data-driven nature. In contrast, our model integrates data-driven inference with rule-based heuristics within an evolutionary optimization framework. This hybrid approach leverages on both statistical patterns and linguistically motivated constraints to guide the reconstruction process. We evaluate our method on the task of reconstructing Latin protoforms using a dataset of cognates from five Romance languages. Experimental results demonstrate substantial improvements over established baselines across both character-level accuracy and phonological plausibility metrics.

nan


Article 735

Title@2025-06-12 (4): ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization

Title: ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization ConfPO: Ausnutzen des politischen Modells Vertrauen für kritische Token-Auswahl in Präferenz-Optimierung 召集:利用政策模范信心在优先最佳化中选择关键物优选标准 2506.08712v2

Authors (5): Hee Suk Yoon, Eunseop Yoon, Mark Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo

We introduce ConfPO, a method for preference learning in Large Language Models (LLMs) that identifies and optimizes preference-critical tokens based solely on the training policy’s confidence, without requiring any auxiliary models or compute. Unlike prior Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO), which uniformly adjust all token probabilities regardless of their relevance to preference, ConfPO focuses optimization on the most impactful tokens. This targeted approach improves alignment quality while mitigating overoptimization (i.e., reward hacking) by using the KL divergence budget more efficiently. In contrast to recent token-level methods that rely on credit-assignment models or AI annotators, raising concerns about scalability and reliability, ConfPO is simple, lightweight, and model-free. Experimental results on challenging alignment benchmarks, including AlpacaEval 2 and Arena-Hard, demonstrate that ConfPO consistently outperforms uniform DAAs across various LLMs, delivering better alignment with zero additional computational overhead.

nan


Article 736

Title@2025-06-12 (4): IPA-CHILDES & G2P+: Feature-Rich Resources for Cross-Lingual Phonology and Phonemic Language Modeling

Title: IPA-CHILDES & G2P+: Feature-Rich Resources for Cross-Lingual Phonology and Phonemic Language Modeling IPA-CHILDES & G2P+: Feature-Rich-Ressourcen für Cross-Lingual Phonologie und Phonemic Language Modeling IPA-CHILDES & G2P+:跨语言歌曲和语音语言建模的地貌资源 2504.03036v3

Authors (2): Zébulon Goriely, Paula Buttery

In this paper, we introduce two resources: (i) G2P+, a tool for converting orthographic datasets to a consistent phonemic representation; and (ii) IPA CHILDES, a phonemic dataset of child-centered speech across 31 languages. Prior tools for grapheme-to-phoneme conversion result in phonemic vocabularies that are inconsistent with established phonemic inventories, an issue which G2P+ addresses by leveraging the inventories in the Phoible database. Using this tool, we augment CHILDES with phonemic transcriptions to produce IPA CHILDES. This new resource fills several gaps in existing phonemic datasets, which often lack multilingual coverage, spontaneous speech, and a focus on child-directed language. We demonstrate the utility of this dataset for phonological research by training phoneme language models on 11 languages and probing them for distinctive features, finding that the distributional properties of phonemes are sufficient to learn major class and place features cross-lingually.

nan


Article 737

Title@2025-06-12 (4): Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges

Title: Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges Pragmatics in the Era of Large Language Models: Eine Umfrage zu Datensätzen, Evaluation, Chancen und Herausforderungen 《大语言模式时代中的实用模型:关于数据集、评价、机遇和挑战的调查》 2502.12378v3

Authors (10): Bolei Ma, Yuting Li, Wei Zhou, Ziwei Gong, Yang Janet Liu, Katja Jasinskaja, Annemarie Friedrich, Julia Hirschberg, Frauke Kreuter, Barbara Plank

Understanding pragmatics-the use of language in context-is crucial for developing NLP systems capable of interpreting nuanced language use. Despite recent advances in language technologies, including large language models, evaluating their ability to handle pragmatic phenomena such as implicatures and references remains challenging. To advance pragmatic abilities in models, it is essential to understand current evaluation trends and identify existing limitations. In this survey, we provide a comprehensive review of resources designed for evaluating pragmatic capabilities in NLP, categorizing datasets by the pragmatic phenomena they address. We analyze task designs, data collection methods, evaluation approaches, and their relevance to real-world applications. By examining these resources in the context of modern language models, we highlight emerging trends, challenges, and gaps in existing benchmarks. Our survey aims to clarify the landscape of pragmatic evaluation and guide the development of more comprehensive and targeted benchmarks, ultimately contributing to more nuanced and context-aware NLP models.

nan


Article 738

Title@2025-06-12 (4): Encoding call-by-push-value in the pi-calculus

Title: Encoding call-by-push-value in the pi-calculus Kodierung des Call-by-Push-Wertes im Pi-Calculus Pi-calcululus 中的编码调用 by-push- 值 2506.10584v1

Authors (3): Benjamin Bennetzen, Nikolaj Rossander Kristensen, Peter Buus Steffensen

In this report we define an encoding of Levys call-by-push-value lambda-calculus (CBPV) in the pi-calculus, and prove that our encoding is both sound and complete. We present informal (by-hand) proofs of soundness, completeness, and all required lemmas. The encoding is specialized to the internal pi-calculus (pi-i-calculus) to circumvent certain challenges associated with using de Bruijn index in a formalization, and it also helps with bisimulation as early-, late- and open-bisimulation coincide in this setting, furthermore bisimulation is a congruence. Additionally, we argue that our encoding also satisfies the five criteria for good encodings proposed by Gorla, as well as show similarities between Milners and our encoding. This paper includes encodings from CBPV in the pi-i-calculus, asynchronous polyadic pi-calculus and the local pi-calculus. We begin a formalization of the proof in Coq for the soundness and completeness of the encoding in the pi-i-calculus. Not all lemmas used in the formalization are themselves formally proven. However, we argue that the non-proven lemmas are reasonable, as they are proven by hand, or amount to Coq formalities that are straightforward given informal arguments.

nan


Article 739

Title@2025-06-12 (4): BabyLM’s First Words: Word Segmentation as a Phonological Probing Task

Title: BabyLM’s First Words: Word Segmentation as a Phonological Probing Task BabyLMs erste Worte: Wortsegmentierung als phonologische Probing-Aufgabe BabyLM 的第一单词: 单词分割作为声学检测任务 2504.03338v3

Authors (2): Zébulon Goriely, Paula Buttery

Language models provide a key framework for studying linguistic theories based on prediction, but phonological analysis using large language models (LLMs) is difficult; there are few phonological benchmarks beyond English and the standard input representation used in LLMs (subwords of graphemes) is not suitable for analyzing the representation of phonemes. In this work, we demonstrate how word segmentation can be used as a phonological probing task, allowing us to study the representations learned by phoneme-based language models trained on child-directed speech across 31 languages. Following computational models of word segmentation, we present unsupervised methods for extracting word boundaries from a trained model using the observation that prediction-error peaks at the start of words. We also use linear probes to identify that these models implicitly track word boundaries, even when they do not appear in training. This cross-lingual work corroborates statistical learning theories of acquisition and empirically motivates new methods for training subword tokenizers.

nan


Article 740

Title@2025-06-12 (4): Human and LLM Biases in Hate Speech Annotations: A Socio-Demographic Analysis of Annotators and Targets

Title: Human and LLM Biases in Hate Speech Annotations: A Socio-Demographic Analysis of Annotators and Targets Menschliche und LLM-Biasen in Hate Speech Annotationen: Eine sozio-demographische Analyse von Annotatoren und Zielen 仇恨言论说明中的人类和LLM比阿斯语:对说明者和目标的社会-人口分析 2410.07991v6

Authors (5): Tommaso Giorgi, Lorenzo Cima, Tiziano Fagni, Marco Avvenuti, Stefano Cresci

The rise of online platforms exacerbated the spread of hate speech, demanding scalable and effective detection. However, the accuracy of hate speech detection systems heavily relies on human-labeled data, which is inherently susceptible to biases. While previous work has examined the issue, the interplay between the characteristics of the annotator and those of the target of the hate are still unexplored. We fill this gap by leveraging an extensive dataset with rich socio-demographic information of both annotators and targets, uncovering how human biases manifest in relation to the target’s attributes. Our analysis surfaces the presence of widespread biases, which we quantitatively describe and characterize based on their intensity and prevalence, revealing marked differences. Furthermore, we compare human biases with those exhibited by persona-based LLMs. Our findings indicate that while persona-based LLMs do exhibit biases, these differ significantly from those of human annotators. Overall, our work offers new and nuanced results on human biases in hate speech annotations, as well as fresh insights into the design of AI-driven hate speech detection systems.

nan


Article 741

Title@2025-06-12 (4): Reinforcing Multimodal Understanding and Generation with Dual Self-rewards

Title: Reinforcing Multimodal Understanding and Generation with Dual Self-rewards Verstärkung des multimodalen Verständnisses und der Erzeugung mit Dual Self-Rewards 加强多模式理解和多模式代代与双重奖赏 2506.07963v2

Authors (6): Jixiang Hong, Yiran Zhang, Guanzhong Wang, Yi Liu, Ji-Rong Wen, Rui Yan

Building upon large language models (LLMs), recent large multimodal models (LMMs) unify cross-model understanding and generation into a single framework. However, LMMs still struggle to achieve accurate image-text alignment, prone to generating text responses contradicting the visual input or failing to follow the text-to-image prompts. Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks-either understanding or generation. In this work, based on the observation that understanding and generation are inverse dual tasks, we introduce a self-supervised dual reward mechanism to reinforce the understanding and generation capabilities of LMMs. Specifically, we sample multiple outputs for a given input in one task domain, then reverse the input-output pairs to compute the dual likelihood of the model as self-rewards for optimization. Extensive experimental results on visual understanding and generation benchmarks demonstrate that our method can effectively enhance the performance of the model without any external supervision, especially achieving remarkable improvements in text-to-image tasks.

nan


Article 742

Title@2025-06-12 (4): Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models

Title: Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models Obliviate: Effiziente Unvergesslichkeit für den Schutz geistigen Eigentums in großen Sprachmodellen 默认:在大语言模式中有效统一保护知识产权 2502.15010v2

Authors (2): Mark Russinovich, Ahmed Salem

Recent copyright agreements between AI companies and content creators underscore the need for fine-grained control over language models’ ability to reproduce copyrighted text. Existing defenses-ranging from aggressive unlearning to simplistic output filters-either sacrifice model utility or inadequately address verbatim leakage. We introduce Obliviate, a lightweight post-training method that surgically suppresses exact reproduction of specified sequences while preserving semantic understanding. Obliviate first identifies memorized passages and then, for each target token, minimally adjusts the model’s output distribution via a Kullback-Leibler divergence penalty to drive down the probability of exact reproduction. Simultaneously, we enforce a consistency loss on non-target tokens to retain the model’s fluency and task performance. We evaluate Obliviate on four popular 6-8B-parameter models (LLaMA-3.1, LLaMA-3.1-Instruct, Qwen-2.5, and Yi-1.5) using synthetic memorization benchmarks and organic copyrighted excerpts (e.g., Moby Dick, Frankenstein, Alice in Wonderland and Les Miserables). Across all settings, Obliviate reduces verbatim recall by two orders of magnitude (e.g., from hundreds of words to fewer than 12) while degrading downstream accuracy by at most 1% on HellaSwag, MMLU, TruthfulQA, and Winogrande. Furthermore, we benchmark Obliviate aganist different unlearning and copyright techniques using the MUSE and CoTaEval benchmarks. These results position Obliviate as a practical, high-fidelity solution for copyright compliance in deployed LLMs.

nan


Article 743

Title@2025-06-12 (4): Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs

Title: Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs Zuverlässiger Weg zur Vernunft: Destillieren effektiver Leitlinien für LLM-Reasoning mit Wissensgraphen 可靠理由说明:为学习图解的LLM 理由说明保留有效指导 2506.10508v1

Authors (6): Yilin Xiao, Chuang Zhou, Qinggang Zhang, Bo Li, Qing Li, Xiao Huang

Large language models (LLMs) often struggle with knowledge-intensive tasks due to a lack of background knowledge and a tendency to hallucinate. To address these limitations, integrating knowledge graphs (KGs) with LLMs has been intensively studied. Existing KG-enhanced LLMs focus on supplementary factual knowledge, but still struggle with solving complex questions. We argue that refining the relationships among facts and organizing them into a logically consistent reasoning path is equally important as factual knowledge itself. Despite their potential, extracting reliable reasoning paths from KGs poses the following challenges: the complexity of graph structures and the existence of multiple generated paths, making it difficult to distinguish between useful and redundant ones. To tackle these challenges, we propose the RRP framework to mine the knowledge graph, which combines the semantic strengths of LLMs with structural information obtained through relation embedding and bidirectional distribution learning. Additionally, we introduce a rethinking module that evaluates and refines reasoning paths according to their significance. Experimental results on two public datasets show that RRP achieves state-of-the-art performance compared to existing baseline methods. Moreover, RRP can be easily integrated into various LLMs to enhance their reasoning abilities in a plug-and-play manner. By generating high-quality reasoning paths tailored to specific questions, RRP distills effective guidance for LLM reasoning.

nan


Article 744

Title@2025-06-12 (4): Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps

Title: Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps Messung der Kette der Gedankentreue durch unlernende Vernunftschritte 通过 “ 不学习理性步骤 “ 衡量思考链的信念 2502.14829v2

Authors (4): Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasović, Yonatan Belinkov

When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. Despite much work on CoT prompting, it is unclear if reasoning verbalized in a CoT is faithful to the models’ parametric beliefs. We introduce a framework for measuring parametric faithfulness of generated reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an instance of this framework. FUR erases information contained in reasoning steps from model parameters, and measures faithfulness as the resulting effect on the model’s prediction. Our experiments with four LMs and five multi-hop multi-choice question answering (MCQA) datasets show that FUR is frequently able to precisely change the underlying models’ prediction for a given instance by unlearning key steps, indicating when a CoT is parametrically faithful. Further analysis shows that CoTs generated by models post-unlearning support different answers, hinting at a deeper effect of unlearning.

nan


Article 745

Title@2025-06-12 (4): Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models

Title: Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models Beyond Single-User Dialogue: Bewertung des Multi-User Dialogue State Tracking Fähigkeiten großer Sprachmodelle 超越单一用户对话:评估多用户对话国家跟踪大语言模式的能力 2506.10504v1

Authors (4): Sangmin Song, Juhwan Choi, JungMin Yun, YoungBin Kim

Large language models (LLMs) have demonstrated remarkable performance in zero-shot dialogue state tracking (DST), reducing the need for task-specific training. However, conventional DST benchmarks primarily focus on structured user-agent conversations, failing to capture the complexities of real-world multi-user interactions. In this study, we assess the robustness of LLMs in multi-user DST while minimizing dataset construction costs. Inspired by recent advances in LLM-based data annotation, we extend an existing DST dataset by generating utterances of a second user based on speech act theory. Our methodology systematically incorporates a second user’s utterances into conversations, enabling a controlled evaluation of LLMs in multi-user settings. Experimental results reveal a significant performance drop compared to single-user DST, highlighting the limitations of current LLMs in extracting and tracking dialogue states amidst multiple speakers. Our findings emphasize the need for future research to enhance LLMs for multi-user DST scenarios, paving the way for more realistic and robust DST models.

nan


Article 746

Title@2025-06-12 (4): Mind the Style Gap: Meta-Evaluation of Style and Attribute Transfer Metrics

Title: Mind the Style Gap: Meta-Evaluation of Style and Attribute Transfer Metrics Achten Sie auf den Style Gap: Meta-Evaluation von Style und Attribut-Transfer-Metriken 思维风格差距:对样式和属性转移的元评价 2502.15022v3

Authors (3): Amalie Brogaard Pauli, Isabelle Augenstein, Ira Assent

Large language models (LLMs) make it easy to rewrite a text in any style – e.g. to make it more polite, persuasive, or more positive – but evaluation thereof is not straightforward. A challenge lies in measuring content preservation: that content not attributable to style change is retained. This paper presents a large meta-evaluation of metrics for evaluating style and attribute transfer, focusing on content preservation. We find that meta-evaluation studies on existing datasets lead to misleading conclusions about the suitability of metrics for content preservation. Widely used metrics show a high correlation with human judgments despite being deemed unsuitable for the task – because they do not abstract from style changes when evaluating content preservation. We show that the overly high correlations with human judgment stem from the nature of the test data. To address this issue, we introduce a new, challenging test set specifically designed for evaluating content preservation metrics for style transfer. Using this dataset, we demonstrate that suitable metrics for content preservation for style transfer indeed are style-aware. To support efficient evaluation, we propose a new style-aware method that utilises small language models, obtaining a higher alignment with human judgements than prompting a model of a similar size as an autorater.

nan


Article 747

Title@2025-06-12 (4): Towards Large Language Models with Self-Consistent Natural Language Explanations

Title: Towards Large Language Models with Self-Consistent Natural Language Explanations Auf dem Weg zu großen Sprachmodellen mit selbstkonsistenten natürlichen Spracherklärungen 努力建立具有自我联系自然语言解释的大型语言模式 2506.07523v2

Authors (4): Sahar Admoni, Ofra Amir, Assaf Hallak, Yftah Ziser

Large language models (LLMs) seem to offer an easy path to interpretability: just ask them to explain their decisions. Yet, studies show that these post-hoc explanations often misrepresent the true decision process, as revealed by mismatches in feature importance. Despite growing evidence of this inconsistency, no systematic solutions have emerged, partly due to the high cost of estimating feature importance, which limits evaluations to small datasets. To address this, we introduce the Post-hoc Self-Consistency Bank (PSCB) - a large-scale benchmark of decisions spanning diverse tasks and models, each paired with LLM-generated explanations and corresponding feature importance scores. Analysis of PSCB reveals that self-consistency scores barely differ between correct and incorrect predictions. We also show that the standard metric fails to meaningfully distinguish between explanations. To overcome this limitation, we propose an alternative metric that more effectively captures variation in explanation quality. We use it to fine-tune LLMs via Direct Preference Optimization (DPO), leading to significantly better alignment between explanations and decision-relevant features, even under domain shift. Our findings point to a scalable path toward more trustworthy, self-consistent LLMs.

nan


Article 748

Title@2025-06-12 (4): Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models

Title: Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models Surface Fairness, Deep Bias: Eine vergleichende Studie über Bias in Sprachmodellen 地表公平、深比亚:语言模型比亚比较研究 2506.10491v1

Authors (4): Aleksandra Sorokovikova, Pavel Chizhov, Iuliia Eremenko, Ivan P. Yamshchikov

Modern language models are trained on large amounts of data. These data inevitably include controversial and stereotypical content, which contains all sorts of biases related to gender, origin, age, etc. As a result, the models express biased points of view or produce different results based on the assigned personality or the personality of the user. In this paper, we investigate various proxy measures of bias in large language models (LLMs). We find that evaluating models with pre-prompted personae on a multi-subject benchmark (MMLU) leads to negligible and mostly random differences in scores. However, if we reformulate the task and ask a model to grade the user’s answer, this shows more significant signs of bias. Finally, if we ask the model for salary negotiation advice, we see pronounced bias in the answers. With the recent trend for LLM assistant memory and personalization, these problems open up from a different angle: modern LLM users do not need to pre-prompt the description of their persona since the model already knows their socio-demographics.

nan


Article 749

Title@2025-06-12 (4): ClimateChat: Designing Data and Methods for Instruction Tuning LLMs to Answer Climate Change Queries

Title: ClimateChat: Designing Data and Methods for Instruction Tuning LLMs to Answer Climate Change Queries ClimateChat: Daten und Methoden für die Anleitung zur Anpassung von LLMs an Klimawandelfragen entwerfen ClimateChat:设计用于教学的数据和方法,用于指导如何引导LMLM 以应对气候变化询问 2506.13796v1

Authors (5): Zhou Chen, Xiao Wang, Yuanhong Liao, Ming Lin, Yuqi Bai

As the issue of global climate change becomes increasingly severe, the demand for research in climate science continues to grow. Natural language processing technologies, represented by Large Language Models (LLMs), have been widely applied to climate change-specific research, providing essential information support for decision-makers and the public. Some studies have improved model performance on relevant tasks by constructing climate change-related instruction data and instruction-tuning LLMs. However, current research remains inadequate in efficiently producing large volumes of high-precision instruction data for climate change, which limits further development of climate change LLMs. This study introduces an automated method for constructing instruction data. The method generates instructions using facts and background knowledge from documents and enhances the diversity of the instruction data through web scraping and the collection of seed instructions. Using this method, we constructed a climate change instruction dataset, named ClimateChat-Corpus, which was used to fine-tune open-source LLMs, resulting in an LLM named ClimateChat. Evaluation results show that ClimateChat significantly improves performance on climate change question-and-answer tasks. Additionally, we evaluated the impact of different base models and instruction data on LLM performance and demonstrated its capability to adapt to a wide range of climate change scientific discovery tasks, emphasizing the importance of selecting an appropriate base model for instruction tuning. This research provides valuable references and empirical support for constructing climate change instruction data and training climate change-specific LLMs.

nan


Article 750

Title@2025-06-12 (4): Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers

Title: Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers Tabelle-Text Alignment: Erklärung der Antragsprüfung gegen Tabellen in wissenschaftlichen Arbeiten 表-文字对齐:对照科学文件表格解释索赔核实 2506.10486v1

Authors (6): Xanh Ho, Sunisth Kumar, Yun-Ang Wu, Florian Boudin, Atsuhiro Takasu, Akiko Aizawa

Scientific claim verification against tables typically requires predicting whether a claim is supported or refuted given a table. However, we argue that predicting the final label alone is insufficient: it reveals little about the model’s reasoning and offers limited interpretability. To address this, we reframe table-text alignment as an explanation task, requiring models to identify the table cells essential for claim verification. We build a new dataset by extending the SciTab benchmark with human-annotated cell-level rationales. Annotators verify the claim label and highlight the minimal set of cells needed to support their decision. After the annotation process, we utilize the collected information and propose a taxonomy for handling ambiguous cases. Our experiments show that (i) incorporating table alignment information improves claim verification performance, and (ii) most LLMs, while often predicting correct labels, fail to recover human-aligned rationales, suggesting that their predictions do not stem from faithful reasoning.

nan


Article 751

Title@2025-06-12 (4): IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language

Title: IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language IndoToxic2024: Ein demographischer Datensatz von Hass-Sprach- und Toxizitätstypen für indonesische Sprache Indotoxic2024:印度尼西亚语仇恨言论和毒性类型人口资料集 2406.19349v2

Authors (7): Lucky Susanto, Musa Izzanardi Wijanarko, Prasetia Anugrah Pratama, Traci Hong, Ika Idris, Alham Fikri Aji, Derry Wijaya

Hate speech poses a significant threat to social harmony. Over the past two years, Indonesia has seen a ten-fold increase in the online hate speech ratio, underscoring the urgent need for effective detection mechanisms. However, progress is hindered by the limited availability of labeled data for Indonesian texts. The condition is even worse for marginalized minorities, such as Shia, LGBTQ, and other ethnic minorities because hate speech is underreported and less understood by detection tools. Furthermore, the lack of accommodation for subjectivity in current datasets compounds this issue. To address this, we introduce IndoToxic2024, a comprehensive Indonesian hate speech and toxicity classification dataset. Comprising 43,692 entries annotated by 19 diverse individuals, the dataset focuses on texts targeting vulnerable groups in Indonesia, specifically during the hottest political event in the country: the presidential election. We establish baselines for seven binary classification tasks, achieving a macro-F1 score of 0.78 with a BERT model (IndoBERTweet) fine-tuned for hate speech classification. Furthermore, we demonstrate how incorporating demographic information can enhance the zero-shot performance of the large language model, gpt-3.5-turbo. However, we also caution that an overemphasis on demographic information can negatively impact the fine-tuned model performance due to data fragmentation.

nan


Article 752

Title@2025-06-12 (4): VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

Title: VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models VScan: Rethinking Visual Token Reduction für effiziente große Vision-Sprache Modelle Vscan:重新思考如何降低视力,以建立高效的大型视觉语言模型 2505.22654v2

Authors (10): Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, Dong Yu

Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a 10$\times$ reduction in FLOPs, while retaining 95.4\% of the original performance. Code is available at https://github.com/Tencent/SelfEvolvingAgent/tree/main/VScan.

nan


Article 753

Title@2025-06-12 (4): Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts

Title: Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts Auf dem Weg zur robusten multimodalen Emotionserkennung unter fehlenden Modalitäten und Verteilungsverschiebungen 争取在缺失模式和分销转移模式下强有力地承认多模式情感 2506.10452v1

Authors (5): Guowei Zhong, Ruohong Huan, Mingzhen Wu, Ronghua Liang, Peng Chen

Recent advancements in Multimodal Emotion Recognition (MER) face challenges in addressing both modality missing and Out-Of-Distribution (OOD) data simultaneously. Existing methods often rely on specific models or introduce excessive parameters, which limits their practicality. To address these issues, we propose a novel robust MER framework, Causal Inference Distiller (CIDer), and introduce a new task, Random Modality Feature Missing (RMFM), to generalize the definition of modality missing. CIDer integrates two key components: a Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal Inference (MACI) module. MSSD enhances robustness under the RMFM task through a weight-sharing self-distillation approach applied across low-level features, attention maps, and high-level representations. Additionally, a Word-level Self-aligned Attention Module (WSAM) reduces computational complexity, while a Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion. To tackle OOD challenges, MACI employs a tailored causal graph to mitigate label and language biases using a Multimodal Causal Module (MCM) and fine-grained counterfactual texts. Notably, MACI can independently enhance OOD generalization with minimal additional parameters. Furthermore, we also introduce the new repartitioned MER OOD datasets. Experimental results demonstrate that CIDer achieves robust performance in both RMFM and OOD scenarios, with fewer parameters and faster training compared to state-of-the-art methods. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CIDer.

nan


Article 754

Title@2025-06-12 (4): Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations

Title: Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations Social Bias Benchmark for Generation: Ein Vergleich von Generation und QA-basierten Bewertungen 社会比重基准: 社会比重基准: 社会比比: 社会比比: 社会比比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比: 社会比 2503.06987v2

Authors (4): Jiho Jin, Woosung Kang, Junho Myung, Alice Oh

Measuring social bias in large language models (LLMs) is crucial, but existing bias evaluation methods struggle to assess bias in long-form generation. We propose a Bias Benchmark for Generation (BBG), an adaptation of the Bias Benchmark for QA (BBQ), designed to evaluate social bias in long-form generation by having LLMs generate continuations of story prompts. Building our benchmark in English and Korean, we measure the probability of neutral and biased generations across ten LLMs. We also compare our long-form story generation evaluation results with multiple-choice BBQ evaluation, showing that the two approaches produce inconsistent results.

nan


Article 755

Title@2025-06-12 (4): Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty

Title: Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty Schnell auf dem Einfachen, Tief auf dem Harten: Effiziente Vernunft über Powered Length Penalty 快速快速执行 “ 容易 “ 、 “ 深沉:通过死刑有效解释理由 “ 2506.10446v1

Authors (6): Zehui Ling, Deshu Chen, Hongwei Zhang, Yifeng Jiao, Xin Guo, Yuan Cheng

Large language models (LLMs) have demonstrated significant advancements in reasoning capabilities, performing well on various challenging benchmarks. Techniques like Chain-of-Thought prompting have been introduced to further improve reasoning. However, these approaches frequently generate longer outputs, which in turn increase computational latency. Although some methods use reinforcement learning to shorten reasoning, they often apply uniform penalties without considering the problem’s complexity, leading to suboptimal outcomes. In this study, we seek to enhance the efficiency of LLM reasoning by promoting conciseness for simpler problems while preserving sufficient reasoning for more complex ones for accuracy, thus improving the model’s overall performance. Specifically, we manage the model’s reasoning efficiency by dividing the reward function and including a novel penalty for output length. Our approach has yielded impressive outcomes in benchmark evaluations across three datasets: GSM8K, MATH500, and AIME2024. For the comparatively simpler datasets GSM8K and MATH500, our method has effectively shortened output lengths while preserving or enhancing accuracy. On the more demanding AIME2024 dataset, our approach has resulted in improved accuracy.

nan


Article 756

Title@2025-06-12 (4): CheMatAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning

Title: CheMatAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning CheMatAgent: Verbesserung von LLMs für Chemie und Materialwissenschaft durch baumsuchebasiertes Tool Learning CheMatAgent:通过植树搜索工具学习加强化学和材料科学LLMs 2506.07551v2

Authors (10): Mengsong Wu, YaFei Wang, Yidong Ming, Yuqi An, Yuwei Wan, Wenliang Chen, Binbin Lin, Yuqiang Li, Tong Xie, Dongzhan Zhou

Large language models (LLMs) have recently demonstrated promising capabilities in chemistry tasks while still facing challenges due to outdated pretraining knowledge and the difficulty of incorporating specialized chemical expertise. To address these issues, we propose an LLM-based agent that synergistically integrates 137 external chemical tools created ranging from basic information retrieval to complex reaction predictions, and a dataset curation pipeline to generate the dataset ChemToolBench that facilitates both effective tool selection and precise parameter filling during fine-tuning and evaluation. We introduce a Hierarchical Evolutionary Monte Carlo Tree Search (HE-MCTS) framework, enabling independent optimization of tool planning and execution. By leveraging self-generated data, our approach supports step-level fine-tuning (FT) of the policy model and training task-adaptive PRM and ORM that surpass GPT-4o. Experimental evaluations demonstrate that our approach significantly improves performance in Chemistry QA and discovery tasks, offering a robust solution to integrate specialized tools with LLMs for advanced chemical applications. All datasets and code are available at https://github.com/AI4Chem/ChemistryAgent .

nan


Article 757

Title@2025-06-12 (4): ConvD: Attention Enhanced Dynamic Convolutional Embeddings for Knowledge Graph Completion

Title: ConvD: Attention Enhanced Dynamic Convolutional Embeddings for Knowledge Graph Completion ConvD: Aufmerksamkeitsverstärkte dynamische konvolutionäre Einbettungen für die Wissensgraphenvervollständigung ConvD: 关注增强动态动态嵌入,以完成知识图的完成 2312.07589v2

Authors (7): Wenbin Guo, Zhao Li, Xin Wang, Zirui Chen, Jun Zhao, Jianxin Li, Ye Yuan

Knowledge graphs often suffer from incompleteness issues, which can be alleviated through information completion. However, current state-of-the-art deep knowledge convolutional embedding models rely on external convolution kernels and conventional convolution processes, which limits the feature interaction capability of the model. This paper introduces a novel dynamic convolutional embedding model, ConvD, which directly reshapes relation embeddings into multiple internal convolution kernels. This approach effectively enhances the feature interactions between relation embeddings and entity embeddings. Simultaneously, we incorporate a priori knowledge-optimized attention mechanism that assigns different contribution weight coefficients to the multiple relation convolution kernels in dynamic convolution, further boosting the expressive power of the model. Extensive experiments on various datasets show that our proposed model consistently outperforms the state-of-the-art baseline methods, with average improvements ranging from 3.28% to 14.69% across all model evaluation metrics, while the number of parameters is reduced by 50.66% to 85.40% compared to other state-of-the-art models.

nan


Article 758

Title@2025-06-12 (4): PAL: Probing Audio Encoders via LLMs – A Study of Information Transfer from Audio Encoders to LLMs

Title: PAL: Probing Audio Encoders via LLMs – A Study of Information Transfer from Audio Encoders to LLMs PAL: Probing Audio-Encoder über LLMs – Eine Studie über den Informationstransfer von Audio-Encodern zu LLMs PAL:通过LLMs探查音频成象器 – – 研究从音频成象器向LLMs传送信息 2506.10423v1

Authors (7): Tony Alex, Wish Suharitdamrong, Sara Atito, Armin Mustafa, Philip J. B. Jackson, Imran Razzak, Muhammad Awais

The integration of audio perception capabilities into Large Language Models (LLMs) has enabled significant advances in Audio-LLMs. Although application-focused developments, particularly in curating training data for specific capabilities e.g., audio reasoning, have progressed rapidly, the underlying mechanisms that govern efficient transfer of rich semantic representations from audio encoders to LLMs remain under-explored. We conceptualize effective audio-LLM interaction as the LLM’s ability to proficiently probe the audio encoder representations to satisfy textual queries. This paper presents a systematic investigation on how architectural design choices can affect that. Beginning with a standard Pengi/LLaVA-style audio-LLM architecture, we propose and evaluate several modifications guided by hypotheses derived from mechanistic interpretability studies and LLM operational principles. Our experiments demonstrate that: (1) delaying audio integration until the LLM’s initial layers establish textual context that enhances its ability to probe the audio representations for relevant information; (2) the LLM can proficiently probe audio representations exclusively through LLM layer’s attention submodule, without requiring propagation to its Feed-Forward Network (FFN) submodule; (3) an efficiently integrated ensemble of diverse audio encoders provides richer, complementary representations, thereby broadening the LLM’s capacity to probe a wider spectrum of audio information. All hypotheses are evaluated using an identical three-stage training curriculum on a dataset of 5.6 million audio-text pairs, ensuring controlled comparisons. Our final architecture, which incorporates all proposed modifications, achieves relative improvements from 10\% to 60\% over the baseline, validating our approach to optimizing cross-modal information transfer in audio-LLMs. Project page: https://ta012.github.io/PAL/

nan


Article 759

Title@2025-06-12 (4): Beyond the Battlefield: Framing Analysis of Media Coverage in Conflict Reporting

Title: Beyond the Battlefield: Framing Analysis of Media Coverage in Conflict Reporting Jenseits des Schlachtfeldes: Framing Analyse der Medienabdeckung in der Konfliktberichterstattung 战场以外的战场:冲突报道中媒体报道的系统化分析 2506.10421v1

Authors (2): Avneet Kaur, Arnav Arora

Framing used by news media, especially in times of conflict, can have substantial impact on readers’ opinion, potentially aggravating the conflict itself. Current studies on the topic of conflict framing have limited insights due to their qualitative nature or only look at surface level generic frames without going deeper. In this work, we identify indicators of war and peace journalism, as outlined by prior work in conflict studies, in a corpus of news articles reporting on the Israel-Palestine war. For our analysis, we use computational approaches, using a combination of frame semantics and large language models to identify both communicative framing and its connection to linguistic framing. Our analysis reveals a higher focus on war based reporting rather than peace based. We also show substantial differences in reporting across the US, UK, and Middle Eastern news outlets in framing who the assailant and victims of the conflict are, surfacing biases within the media.

nan


Article 760

Title@2025-06-12 (4): Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?

Title: Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences? Brennen Sie nach dem Lesen: Erfassen multimodale große Sprachmodelle wirklich die Reihenfolge der Ereignisse in Bildsequenzen? Burn after read: 多式大语言模型在图像序列中是否真的能捕捉事件秩序? 2506.10415v1

Authors (4): Yingjin Song, Yupei Du, Denis Paperno, Albert Gatt

This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.

nan


Article 761

Title@2025-06-12 (4): Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series

Title: Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series Zeit-IMM: Ein Datensatz und Benchmark für irreguläre multimodale Multivariate Zeitreihen 时间-IMM:非正常多式联运多变时间序列的数据集和基准 2506.10412v1

Authors (7): Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, Wei Wang

Time series data in real-world applications such as healthcare, climate modeling, and finance are often irregular, multimodal, and messy, with varying sampling rates, asynchronous modalities, and pervasive missingness. However, existing benchmarks typically assume clean, regularly sampled, unimodal data, creating a significant gap between research and real-world deployment. We introduce Time-IMM, a dataset specifically designed to capture cause-driven irregularity in multimodal multivariate time series. Time-IMM represents nine distinct types of time series irregularity, categorized into trigger-based, constraint-based, and artifact-based mechanisms. Complementing the dataset, we introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal time series, enabling asynchronous integration and realistic evaluation. IMM-TSF includes specialized fusion modules, including a timestamp-to-text fusion module and a multimodality fusion module, which support both recency-aware averaging and attention-based integration strategies. Empirical results demonstrate that explicitly modeling multimodality on irregular time series data leads to substantial gains in forecasting performance. Time-IMM and IMM-TSF provide a foundation for advancing time series analysis under real-world conditions. The dataset is publicly available at https://www.kaggle.com/datasets/blacksnail789521/time-imm/data, and the benchmark library can be accessed at https://anonymous.4open.science/r/IMMTSF_NeurIPS2025.

nan


Article 762

Title@2025-06-12 (4): PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier

Title: PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier PAG: Multi-Turn verstärkt LLM Selbstkorrektion mit Politik als Generativer Prüfer PAG: 多发强化LLM自我校正,政策作为产生验证 2506.10406v1

Authors (8): Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu, Yu Yue, Qianchuan Zhao, Lin Yan

Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm. Distinct from prior approaches that always generate a second attempt regardless of model confidence, PAG introduces a selective revision mechanism: the model revises its answer only when its own generative verification step detects an error. This verify-then-revise workflow not only alleviates model collapse but also jointly enhances both reasoning and verification abilities. Extensive experiments across diverse reasoning benchmarks highlight PAG’s dual advancements: as a policy, it enhances direct generation and self-correction accuracy; as a verifier, its self-verification outperforms self-consistency.

nan


Article 763

Title@2025-06-12 (4): iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering

Title: iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering iQUEST: Ein iteratives Frage-Framework für die Beantwortung von Fragen in der Wissensdatenbank i. 知识基础问题解答的动态问题指导框架 2506.01784v2

Authors (2): Shuai Wang, Yinan Yu

While Large Language Models (LLMs) excel at many natural language processing tasks, they often suffer from factual inaccuracies in knowledge-intensive scenarios. Integrating external knowledge resources, particularly knowledge graphs (KGs), provides a transparent and updatable foundation for more reliable reasoning. Knowledge Base Question Answering (KBQA), which queries and reasons over KGs, is central to this effort, especially for complex, multi-hop queries. However, multi-hop reasoning poses two key challenges: (1)~maintaining coherent reasoning paths, and (2)~avoiding prematurely discarding critical multi-hop connections. To address these issues, we introduce iQUEST, a question-guided KBQA framework that iteratively decomposes complex queries into simpler sub-questions, ensuring a structured and focused reasoning trajectory. Additionally, we integrate a Graph Neural Network (GNN) to look ahead and incorporate 2-hop neighbor information at each reasoning step. This dual approach strengthens the reasoning process, enabling the model to explore viable paths more effectively. Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.

nan


Article 764

Title@2025-06-12 (4): AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving

Title: AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving AgentThink: Ein einheitliches Framework für Tool-Augmented Chain-of-Thought Reasoning in Vision-Language-Modellen für autonomes Fahren Agent Think: 自主驾驶愿景-语言模型中工具推荐研究链理由统一框架 2505.15298v3

Authors (21): Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, Yining Shi, He Zhe Lim, Li Liu, Tianbao Zhou, Huang Yu, Yifei Hu, Guang Li, Guang Chen, Hao Ye, Lijun Sun, Diange Yang

Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce AgentThink, a pioneering unified framework that, for the first time, integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink’s core innovations include: (i) Structured Data Generation, by establishing an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; (ii) A Two-stage Training Pipeline, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and (iii) Agent-style Tool-Usage Evaluation, introducing a novel multi-tool assessment protocol to rigorously evaluate the model’s tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate AgentThink significantly boosts overall reasoning scores by 53.91% and enhances answer accuracy by 33.54%, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models.

nan


Article 765

Title@2025-06-12 (4): On Many-Shot In-Context Learning for Long-Context Evaluation

Title: On Many-Shot In-Context Learning for Long-Context Evaluation Auf viel-heißes In-Context-Lernen für die Lang-Kontext-Evaluierung 为长期内容评价进行许多热的内文学习 2411.07130v3

Authors (3): Kaijian Zou, Muhammad Khalifa, Lu Wang

Many-shot in-context learning (ICL) has emerged as a unique setup to both utilize and test the ability of large language models to handle long context. This paper delves into long-context language model (LCLM) evaluation through many-shot ICL. We first ask: what types of ICL tasks benefit from additional demonstrations, and how effective are they in evaluating LCLMs? We find that classification and summarization tasks show performance improvements with additional demonstrations, while translation and reasoning tasks do not exhibit clear trends. Next, we investigate the extent to which different tasks necessitate retrieval versus global context understanding. We develop metrics to categorize ICL tasks into two groups: (i) similar-sample learning (SSL): tasks where retrieval of the most similar examples is sufficient for good performance, and (ii) all-sample learning (ASL): tasks that necessitate a deeper comprehension of all examples in the prompt. Lastly, we introduce a new many-shot ICL benchmark, MANYICLBENCH, to characterize model’s ability on both fronts and benchmark 12 LCLMs using MANYICLBENCH. We find that while state-of-the-art models demonstrate good performance up to 64k tokens in SSL tasks, many models experience significant performance drops at only 16k tokens in ASL tasks.

nan


Article 766

Title@2025-06-12 (4): TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning

Title: TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning TableRAG: Ein Retrieval Augmented Generation Framework für heterogene Dokument-Reasoning 表RAG:异源文件说明理由的回收增加代际生成框架 2506.10380v1

Authors (3): Xiaohan Yu, Pu Jian, Chong Chen

Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an hybrid framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.

nan


Article 767

Title@2025-06-12 (4): Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning

Title: Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning Hierarchische Latentenfähigkeiten von Sprachmodellen über das kausale Repräsentationslernen entdecken 通过因果代表制学习发现语言模式的分级本端能力 2506.10378v1

Authors (4): Jikai Jin, Vasilis Syrgkanis, Sham Kakade, Hanlin Zhang

Faithful evaluation of language model capabilities is crucial for deriving actionable insights that can inform model development. However, rigorous causal evaluations in this domain face significant methodological challenges, including complex confounding effects and prohibitive computational costs associated with extensive retraining. To tackle these challenges, we propose a causal representation learning framework wherein observed benchmark performance is modeled as a linear transformation of a few latent capability factors. Crucially, these latent factors are identified as causally interrelated after appropriately controlling for the base model as a common confounder. Applying this approach to a comprehensive dataset encompassing over 1500 models evaluated across six benchmarks from the Open LLM Leaderboard, we identify a concise three-node linear causal structure that reliably explains the observed performance variations. Further interpretation of this causal structure provides substantial scientific insights beyond simple numerical rankings: specifically, we reveal a clear causal direction starting from general problem-solving capabilities, advancing through instruction-following proficiency, and culminating in mathematical reasoning ability. Our results underscore the essential role of carefully controlling base model variations during evaluation, a step critical to accurately uncovering the underlying causal relationships among latent model capabilities.

nan


Article 768

Title@2025-06-12 (4): A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce

Title: A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce Ein minimalistischer Ansatz zur LLM-Vernunft: von der Abstoßung zur Verstärkung 从拒绝抽样到强化 2504.11343v2

Authors (11): Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, Hanze Dong

Reinforcement learning (RL) has become a prevailing approach for fine-tuning large language models (LLMs) on complex reasoning tasks. Among recent methods, GRPO stands out for its empirical success in training models such as DeepSeek-R1, yet the sources of its effectiveness remain poorly understood. In this work, we revisit GRPO from a reinforce-like algorithm perspective and analyze its core components. Surprisingly, we find that a simple rejection sampling baseline, RAFT, which trains only on positively rewarded samples, yields competitive performance than GRPO and PPO. Our ablation studies reveal that GRPO’s main advantage arises from discarding prompts with entirely incorrect responses, rather than from its reward normalization. Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples. Reinforce-Rej improves KL efficiency and stability, serving as a lightweight yet effective alternative to more complex RL algorithms. We advocate RAFT as a robust and interpretable baseline, and suggest that future advances should focus on more principled designs for incorporating negative samples, rather than relying on them indiscriminately. Our findings provide guidance for future work in reward-based LLM post-training.

nan


Article 769

Title@2025-06-12 (4): CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models

Title: CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models CAF-I: Ein kollaboratives Multi-Agent-Framework für eine verbesserte Ironieerkennung mit großen Sprachmodellen CAF-I:采用大语言模式加强铁铁探测多机构合作多方协作框架 2506.08430v2

Authors (3): Ziqi. Liu, Ziyang. Zhou, Mingxuan. Hu

Large language model (LLM) have become mainstream methods in the field of sarcasm detection. However, existing LLM methods face challenges in irony detection, including: 1. single-perspective limitations, 2. insufficient comprehensive understanding, and 3. lack of interpretability. This paper introduces the Collaborative Agent Framework for Irony (CAF-I), an LLM-driven multi-agent system designed to overcome these issues. CAF-I employs specialized agents for Context, Semantics, and Rhetoric, which perform multidimensional analysis and engage in interactive collaborative optimization. A Decision Agent then consolidates these perspectives, with a Refinement Evaluator Agent providing conditional feedback for optimization. Experiments on benchmark datasets establish CAF-I’s state-of-the-art zero-shot performance. Achieving SOTA on the vast majority of metrics, CAF-I reaches an average Macro-F1 of 76.31, a 4.98 absolute improvement over the strongest prior baseline. This success is attained by its effective simulation of human-like multi-perspective analysis, enhancing detection accuracy and interpretability.

nan


Article 770

Title@2025-06-12 (4): Improving Fairness of Large Language Models in Multi-document Summarization

Title: Improving Fairness of Large Language Models in Multi-document Summarization Verbesserung der Fairness von großen Sprachmodellen in Multi-Dokument-Zusammenfassung 提高多文件总结中大语言模式的公平性 2506.07479v2

Authors (3): Haoyuan Li, Rui Zhang, Snigdha Chaturvedi

Fairness in multi-document summarization (MDS) is crucial for providing comprehensive views across documents with diverse social attribute values, which can significantly impact decision-making. For example, a summarization system that tends to overrepresent negative reviews of products can mislead customers into disregarding good products. Previous works measure fairness in MDS at two levels: summary-level and corpus-level. While summary-level fairness focuses on individual summaries, corpus-level fairness focuses on a corpus of summaries. Recent methods primarily focus on summary-level fairness. We propose FairPO, a preference tuning method that focuses on both summary-level and corpus-level fairness in MDS. To improve summary-level fairness, we propose to generate preference pairs by perturbing document sets. To improve corpus-level fairness, we propose fairness-aware preference tuning by dynamically adjusting the weights of preference pairs. Our experiments show that FairPO outperforms strong baselines while maintaining the critical qualities of summaries. The code is available at https://github.com/leehaoyuan/coverage_fairnes.

nan


Article 771

Title@2025-06-12 (4): SCORE: Story Coherence and Retrieval Enhancement for AI Narratives

Title: SCORE: Story Coherence and Retrieval Enhancement for AI Narratives SCORE: Story-Kohärenz und Retrieval-Verbesserung für KI-Erzählungen SCORE: “ 独立叙述 “ 的 “ 一致性 “ 和 “ 检索 “ 增强 “ 增强 “ 统一 “ 和 “ 检索 “ 增强 “ 增强 “ 独立叙述 “ 2503.23512v4

Authors (14): Qiang Yi, Yangfan He, Jianhui Wang, Xinyuan Song, Shiyao Qian, Xinhang Yuan, Li Sun, Yi Xin, Jingqun Tang, Keqin Li, Kuan Lu, Menghao Huo, Jiaqi Chen, Tianyu Shi

Large Language Models (LLMs) can generate creative and engaging narratives from user-specified input, but maintaining coherence and emotional depth throughout these AI-generated stories remains a challenge. In this work, we propose SCORE, a framework for Story Coherence and Retrieval Enhancement, designed to detect and resolve narrative inconsistencies. By tracking key item statuses and generating episode summaries, SCORE uses a Retrieval-Augmented Generation (RAG) approach, incorporating TF-IDF and cosine similarity to identify related episodes and enhance the overall story structure. Results from testing multiple LLM-generated stories demonstrate that SCORE significantly improves the consistency and stability of narrative coherence compared to baseline GPT models, providing a more robust method for evaluating and refining AI-generated narratives.

nan


Article 772

Title@2025-06-12 (4): An Analysis of Datasets, Metrics and Models in Keyphrase Generation

Title: An Analysis of Datasets, Metrics and Models in Keyphrase Generation Eine Analyse von Datensätzen, Metrics und Modellen in der Keyphrase-Generierung 对关键词生成中的数据集、计量和模型的分析 2506.10346v1

Authors (2): Florian Boudin, Akiko Aizawa

Keyphrase generation refers to the task of producing a set of words or phrases that summarises the content of a document. Continuous efforts have been dedicated to this task over the past few years, spreading across multiple lines of research, such as model architectures, data resources, and use-case scenarios. Yet, the current state of keyphrase generation remains unknown as there has been no attempt to review and analyse previous work. In this paper, we bridge this gap by presenting an analysis of over 50 research papers on keyphrase generation, offering a comprehensive overview of recent progress, limitations, and open challenges. Our findings highlight several critical issues in current evaluation practices, such as the concerning similarity among commonly-used benchmark datasets and inconsistencies in metric calculations leading to overestimated performances. Additionally, we address the limited availability of pre-trained models by releasing a strong PLM-based model for keyphrase generation as an effort to facilitate future research.

nan


Article 773

Title@2025-06-12 (4): Code Execution as Grounded Supervision for LLM Reasoning

Title: Code Execution as Grounded Supervision for LLM Reasoning Code-Execution als geerdete Überwachung für LLM-Reasoning 法规执行作为LLM理由的有限制的监督 2506.10343v1

Authors (3): Dongwon Jung, Wenxuan Zhou, Muhao Chen

Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.

nan


Article 774

Title@2025-06-12 (4): Provably Learning from Language Feedback

Title: Provably Learning from Language Feedback Wahrscheinlich von Sprachfeedback lernen 从语言反馈中学习 2506.10341v1

Authors (6): Wanqiao Xu, Allen Nie, Ruijie Zheng, Aditya Modi, Adith Swaminathan, Ching-An Cheng

Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce $\textit{transfer eluder dimension}$ as a complexity measure to characterize the hardness of LLF problems. We show that transfer eluder dimension captures the intuition that information in the feedback changes the learning complexity of the LLF problem. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called $\texttt{HELiX}$, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show that $\texttt{HELiX}$ performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark a first step towards designing principled interactive learning algorithms from generic language feedback.

nan


Article 775

Title@2025-06-12 (4): Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs

Title: Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs Amulett: Neuausrichtung während der Testzeit für Personalisierte Präferenzanpassung von LLMs 缩略图:在试验期间重新对准,以适应LLMM的个性化偏好 2502.19148v3

Authors (8): Zhaowei Zhang, Fengshuo Bai, Qizhi Chen, Chengdong Ma, Mingzhi Wang, Haoran Sun, Zilong Zheng, Yaodong Yang

How to align large language models (LLMs) with user preferences from a static general dataset has been frequently studied. However, user preferences are usually personalized, changing, and diverse regarding culture, values, or time. This leads to the problem that the actual user preferences often do not coincide with those trained by the model developers in the practical use of LLMs. Since we cannot collect enough data and retrain for every demand, researching efficient real-time preference adaptation methods based on the backbone LLMs during test time is important. To this end, we introduce Amulet, a novel, training-free framework that formulates the decoding process of every token as a separate online learning problem with the guidance of simple user-provided prompts, thus enabling real-time optimization to satisfy users’ personalized preferences. To reduce the computational cost brought by this optimization process for each token, we additionally provide a closed-form solution for each iteration step of the optimization process, thereby reducing the computational time cost to a negligible level. The detailed experimental results demonstrate that Amulet can achieve significant performance improvements in rich settings with combinations of different LLMs, datasets, and user preferences, while maintaining acceptable computational efficiency.

nan


Article 776

Title@2025-06-12 (4): Benchmarking LLMs for Environmental Review and Permitting

Title: Benchmarking LLMs for Environmental Review and Permitting Benchmarking LLMs für Umweltprüfung und Genehmigung 环境审查和许可基准确定LLMs 2407.07321v3

Authors (15): Rounak Meyur, Hung Phan, Koby Hayashi, Ian Stewart, Shivam Sharma, Sarthak Chaturvedi, Mike Parker, Dan Nally, Sadie Montgomery, Karl Pazdernik, Ali Jannesari, Mahantesh Halappanavar, Sai Munikoti, Sameera Horawalavithana, Anurag Acharya

The National Environment Policy Act (NEPA) stands as a foundational piece of environmental legislation in the United States, requiring federal agencies to consider the environmental impacts of their proposed actions. The primary mechanism for achieving this is through the preparation of Environmental Assessments (EAs) and, for significant impacts, comprehensive Environmental Impact Statements (EIS). Large Language Model (LLM)s’ effectiveness in specialized domains like NEPA remains untested for adoption in federal decision-making processes. To address this gap, we present NEPA Question and Answering Dataset (NEPAQuAD), the first comprehensive benchmark derived from EIS documents, along with a modular and transparent evaluation pipeline, MAPLE, to assess LLM performance on NEPA-focused regulatory reasoning tasks. Our benchmark leverages actual EIS documents to create diverse question types, ranging from factual to complex problem-solving ones. We built a modular and transparent evaluation pipeline to test both closed- and open-source models in zero-shot or context-driven QA benchmarks. We evaluate five state-of-the-art LLMs using our framework to assess both their prior knowledge and their ability to process NEPA-specific information. The experimental results reveal that all the models consistently achieve their highest performance when provided with the gold passage as context. While comparing the other context-driven approaches for each model, Retrieval Augmented Generation (RAG)-based approaches substantially outperform PDF document contexts, indicating that neither model is well suited for long-context question-answering tasks. Our analysis suggests that NEPA-focused regulatory reasoning tasks pose a significant challenge for LLMs, particularly in terms of understanding the complex semantics and effectively processing the lengthy regulatory documents.

nan


Article 777

Title@2025-06-12 (4): CHANCERY: Evaluating Corporate Governance Reasoning Capabilities in Language Models

Title: CHANCERY: Evaluating Corporate Governance Reasoning Capabilities in Language Models CHANCERY: Bewertung der Corporate Governance-Reasoning-Fähigkeiten in Sprachmodellen C. 机会:评价语言模式中的公司治理能力 2506.04636v2

Authors (5): Lucas Irwin, Arda Kaz, Peiyao Sheng, Sewoong Oh, Pramod Viswanath

Law has long been a domain that has been popular in natural language processing (NLP) applications. Reasoning (ratiocination and the ability to make connections to precedent) is a core part of the practice of the law in the real world. Nevertheless, while multiple legal datasets exist, none have thus far focused specifically on reasoning tasks. We focus on a specific aspect of the legal landscape by introducing a corporate governance reasoning benchmark (CHANCERY) to test a model’s ability to reason about whether executive/board/shareholder’s proposed actions are consistent with corporate governance charters. This benchmark introduces a first-of-its-kind corporate governance reasoning test for language models - modeled after real world corporate governance law. The benchmark consists of a corporate charter (a set of governing covenants) and a proposal for executive action. The model’s task is one of binary classification: reason about whether the action is consistent with the rules contained within the charter. We create the benchmark following established principles of corporate governance - 24 concrete corporate governance principles established in and 79 real life corporate charters selected to represent diverse industries from a total dataset of 10k real life corporate charters. Evaluations on state-of-the-art (SOTA) reasoning models confirm the difficulty of the benchmark, with models such as Claude 3.7 Sonnet and GPT-4o achieving 64.5% and 75.2% accuracy respectively. Reasoning agents exhibit superior performance, with agents based on the ReAct and CodeAct frameworks scoring 76.1% and 78.1% respectively, further confirming the advanced legal reasoning capabilities required to score highly on the benchmark. We also conduct an analysis of the types of questions which current reasoning models struggle on, revealing insights into the legal reasoning capabilities of SOTA models.

nan


Article 778

Title@2025-06-12 (4): Paired Completion: Flexible Quantification of Issue-framing at Scale with LLMs

Title: Paired Completion: Flexible Quantification of Issue-framing at Scale with LLMs Gepaarte Fertigstellung: Flexible Quantifizierung von Emissions-Framing auf Scale mit LLMs 提前完成:与LLMs一道灵活量化规模问题配置 2408.09742v2

Authors (2): Simon D Angus, Lachlan O’Neill

Detecting issue framing in text - how different perspectives approach the same topic - is valuable for social science and policy analysis, yet challenging for automated methods due to subtle linguistic differences. We introduce `paired completion’, a novel approach using LLM next-token log probabilities to detect contrasting frames using minimal examples. Through extensive evaluation across synthetic datasets and a human-labeled corpus, we demonstrate that paired completion is a cost-efficient, low-bias alternative to both prompt-based and embedding-based methods, offering a scalable solution for analyzing issue framing in large text collections, especially suited to low-resource settings.

nan


Article 779

Title@2025-06-12 (4): Detecting Sockpuppetry on Wikipedia Using Meta-Learning

Title: Detecting Sockpuppetry on Wikipedia Using Meta-Learning Sockepuppetry auf Wikipedia erkennen Mit Meta-Learning 在维基百科上用元学习探测袜子布料 2506.10314v1

Authors (2): Luc Raszewski, Christine De Kock

Malicious sockpuppet detection on Wikipedia is critical to preserving access to reliable information on the internet and preventing the spread of disinformation. Prior machine learning approaches rely on stylistic and meta-data features, but do not prioritise adaptability to author-specific behaviours. As a result, they struggle to effectively model the behaviour of specific sockpuppet-groups, especially when text data is limited. To address this, we propose the application of meta-learning, a machine learning technique designed to improve performance in data-scarce settings by training models across multiple tasks. Meta-learning optimises a model for rapid adaptation to the writing style of a new sockpuppet-group. Our results show that meta-learning significantly enhances the precision of predictions compared to pre-trained models, marking an advancement in combating sockpuppetry on open editing platforms. We release a new dataset of sockpuppet investigations to foster future research in both sockpuppetry and meta-learning fields.

nan


Article 780

Title@2025-06-12 (4): Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling

Title: Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling Effiziente Längenverallgemeinerbare Aufmerksamkeit über Causal Retrieval für die Lang-Kontext-Sprachenmodellierung 长文本语言建模通过 “ 目的检索 “ 吸引长文本语言建模 2410.01651v4

Authors (5): Xiang Hu, Zhihao Teng, Jun Zhao, Wei Wu, Kewei Tu

Despite the success of Transformers, handling long contexts remains challenging due to the limited length generalization and quadratic complexity of self-attention. Thus Transformers often require post-training with a larger attention window, significantly increasing computational and memory costs. In this paper, we propose a novel attention mechanism based on dynamic context, Grouped Cross Attention (GCA), which can generalize to 1000 times the pre-training context length while maintaining the ability to access distant information with a constant attention window size. For a given input sequence, we split it into chunks and use each chunk to retrieve top-k relevant past chunks for subsequent text generation. Specifically, unlike most previous works that use an off-the-shelf retriever, our key innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner. Such a mechanism accommodates retrieved chunks with a fixed-size attention window to achieve long-range information access, significantly reducing computational and memory costs during training and inference. Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths, which is 1000 times the training length.

nan


Article 781

Title@2025-06-12 (4): AC/DC: LLM-based Audio Comprehension via Dialogue Continuation

Title: AC/DC: LLM-based Audio Comprehension via Dialogue Continuation AC/DC: LLM-basiertes Audio-Verständnis über Dialog-Fortsetzung AC/DC:基于LLM的通过对话继续了解音频 2506.10312v1

Authors (5): Yusuke Fujita, Tomoya Mizumoto, Atsushi Kojima, Lianbo Liu, Yui Sudo

We propose an instruction-following audio comprehension model that leverages the dialogue continuation ability of large language models (LLMs). Instead of directly generating target captions in training data, the proposed method trains a model to produce responses as if the input caption triggered a dialogue. This dialogue continuation training mitigates the caption variation problem. Learning to continue a dialogue effectively captures the caption’s meaning beyond its surface-level words. As a result, our model enables zero-shot instruction-following capability without multitask instruction tuning, even trained solely on audio captioning datasets. Experiments on AudioCaps, WavCaps, and Clotho datasets with AudioBench audio-scene question-answering tests demonstrate our model’s ability to follow various unseen instructions.

nan


Article 782

Title@2025-06-12 (4): BeamLoRA: Beam-Constraint Low-Rank Adaptation

Title: BeamLoRA: Beam-Constraint Low-Rank Adaptation BeamLoRA: Beam-Constraint Low-Rank Anpassung BeamLORA: 束-节制低射线适应 2502.13604v2

Authors (10): Naibin Gu, Zhenyu Zhang, Xiyu Liu, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang

Due to the demand for efficient fine-tuning of large language models, Low-Rank Adaptation (LoRA) has been widely adopted as one of the most effective parameter-efficient fine-tuning methods. Nevertheless, while LoRA improves efficiency, there remains room for improvement in accuracy. Herein, we adopt a novel perspective to assess the characteristics of LoRA ranks. The results reveal that different ranks within the LoRA modules not only exhibit varying levels of importance but also evolve dynamically throughout the fine-tuning process, which may limit the performance of LoRA. Based on these findings, we propose BeamLoRA, which conceptualizes each LoRA module as a beam where each rank naturally corresponds to a potential sub-solution, and the fine-tuning process becomes a search for the optimal sub-solution combination. BeamLoRA dynamically eliminates underperforming sub-solutions while expanding the parameter space for promising ones, enhancing performance with a fixed rank. Extensive experiments across three base models and 12 datasets spanning math reasoning, code generation, and commonsense reasoning demonstrate that BeamLoRA consistently enhances the performance of LoRA, surpassing the other baseline methods.

nan


Article 783

Title@2025-06-12 (4): Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs

Title: Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs Geplante interleaved Speech-Text-Schulung für Sprach-zu-Sprach-Übersetzung mit LLMs 配有LLMM的语音对语音翻译教学 2506.10299v1

Authors (7): Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Yuki Ito, Hassan Shahmohammadi, Siddhant Arora, Shinji Watanabe

Speech-to-speech translation (S2ST) has been advanced with large language models (LLMs), which are fine-tuned on discrete speech units. In such approaches, modality adaptation from text to speech has been an issue. LLMs are trained on text-only data, which presents challenges to adapt them to speech modality with limited speech-to-speech data. To address the training difficulty, we propose scheduled interleaved speech–text training in this study. We use interleaved speech–text units instead of speech units during training, where aligned text tokens are interleaved at the word level. We gradually decrease the ratio of text as training progresses, to facilitate progressive modality adaptation from text to speech. We conduct experimental evaluations by fine-tuning LLaMA3.2-1B for S2ST on the CVSS dataset. We show that the proposed method consistently improves the translation performances, especially for languages with limited training data.

nan


Article 784

Title@2025-06-12 (4): “Check My Work?”: Measuring Sycophancy in a Simulated Educational Context

Title: “Check My Work?”: Measuring Sycophancy in a Simulated Educational Context “Check My Work?”: Sykopanzmessung in einem simulierten Bildungskontext “检查我的工作?” “测量模拟教育环境中的相对性” 2506.10297v1

Authors (1): Chuck Arvin

This study examines how user-provided suggestions affect Large Language Models (LLMs) in a simulated educational context, where sycophancy poses significant risks. Testing five different LLMs from the OpenAI GPT-4o and GPT-4.1 model classes across five experimental conditions, we show that response quality varies dramatically based on query framing. In cases where the student mentions an incorrect answer, the LLM correctness can degrade by as much as 15 percentage points, while mentioning the correct answer boosts accuracy by the same margin. Our results also show that this bias is stronger in smaller models, with an effect of up to 30% for the GPT-4.1-nano model, versus 8% for the GPT-4o model. Our analysis of how often LLMs “flip” their answer, and an investigation into token level probabilities, confirm that the models are generally changing their answers to answer choices mentioned by students in line with the sycophancy hypothesis. This sycophantic behavior has important implications for educational equity, as LLMs may accelerate learning for knowledgeable students while the same tools may reinforce misunderstanding for less knowledgeable students. Our results highlight the need to better understand the mechanism, and ways to mitigate, such bias in the educational context.

nan


Article 785

Title@2025-06-12 (4): Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages

Title: Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages Flick: Wenige Labels Textklassifizierung mit K-Aware Intermediate Learning in Multi-Task Low-Resource Sprachen Flick:使用K-Aware中级学习多种低资源语言的多种语言的标签文字分类, 2506.10292v1

Authors (6): Ali Almutairi, Abdullah Alsuhaibani, Shoaib Jameel, Usman Naseem, Gelareh Mohammadi, Imran Razzak

Training deep learning networks with minimal supervision has gained significant research attention due to its potential to reduce reliance on extensive labelled data. While self-training methods have proven effective in semi-supervised learning, they remain vulnerable to errors from noisy pseudo labels. Moreover, most recent approaches to the few-label classification problem are either designed for resource-rich languages such as English or involve complex cascading models that are prone to overfitting. To address the persistent challenge of few-label text classification in truly low-resource linguistic contexts, where existing methods often struggle with noisy pseudo-labels and domain adaptation, we propose Flick. Unlike prior methods that rely on generic multi-cluster pseudo-labelling or complex cascading architectures, Flick leverages the fundamental insight that distilling high-confidence pseudo-labels from a broader set of initial clusters can dramatically improve pseudo-label quality, particularly for linguistically diverse, low-resource settings. Flick introduces a novel pseudo-label refinement component, a departure from traditional pseudo-labelling strategies by identifying and leveraging top-performing pseudo-label clusters. This component specifically learns to distil highly reliable pseudo-labels from an initial broad set by focusing on single-cluster cohesion and leveraging an adaptive top-k selection mechanism. This targeted refinement process is crucial for mitigating the propagation of errors inherent in low-resource data, allowing for robust fine-tuning of pre-trained language models with only a handful of true labels. We demonstrate Flick’s efficacy across 14 diverse datasets, encompassing challenging low-resource languages such as Arabic, Urdu, and Setswana, alongside English, showcasing its superior performance and adaptability.

nan


Article 786

Title@2025-06-12 (4): Context Is Not Comprehension

Title: Context Is Not Comprehension Kontext ist nicht verständlich 背景不令人理解 2506.04907v4

Authors (2): Alex Pan, Mary-Anne Williams

The dominant way of judging Large Language Models (LLMs) has been to ask how well they can recall explicit facts from very long inputs. While today’s best models achieve near perfect recall, this masks a harder skill: performing multi-step reasoning and tracking intermediate state that never appears verbatim. We introduce Verbose ListOps (VLO), a benchmark that embeds deterministic ListOps computations inside narrative camouflage and, crucially, allows step-level evaluation of every intermediate result. Experiments show that models which solve raw ListOps with approximately 100% accuracy collapse on VLO after only 10,000 tokens. By exposing where a model’s reasoning chain first diverges, VLO moves assessment beyond sheer context length and toward genuine comprehension. VLO’s generation pipeline is task-agnostic: it can weave any deterministically verifiable reasoning schema – arithmetic, symbolic, abductive, inductive or defeasible – into narrative form. This makes VLO a reusable test-bed for the next wave of reasoning-centric model designs, not merely those with step-explicit scaffolds.

nan


Article 787

Title@2025-06-12 (4): Prompt-based Depth Pruning of Large Language Models

Title: Prompt-based Depth Pruning of Large Language Models Prompt-basierte Tiefenkorrektur von großen Sprachmodellen 大语言模式的即时深度定位 2502.04348v3

Authors (3): Juyun Wee, Minjae Park, Jaeho Lee

Depth pruning aims to reduce the inference cost of a large language model without any hardware-specific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent – a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (Prompt-routed Dynamic Depth Pruning), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines.

nan


Article 788

Title@2025-06-12 (4): ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs

Title: ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs ClusterUCB: Effiziente Gradient-basierte Datenauswahl für gezielte Feinsteuerung von LLMs COCUCB: 高效率的逐步数据选择,以便有针对性地微调LLMM 2506.10288v1

Authors (6): Zige Wang, Qi Zhu, Fei Mi, Minghui Xu, Ruochun Jin, Wenjing Yang

Gradient-based data influence approximation has been leveraged to select useful data samples in the supervised fine-tuning of large language models. However, the computation of gradients throughout the fine-tuning process requires too many resources to be feasible in practice. In this paper, we propose an efficient gradient-based data selection framework with clustering and a modified Upper Confidence Bound (UCB) algorithm. Based on the intuition that data samples with similar gradient features will have similar influences, we first perform clustering on the training data pool. Then, we frame the inter-cluster data selection as a constrained computing budget allocation problem and consider it a multi-armed bandit problem. A modified UCB algorithm is leveraged to solve this problem. Specifically, during the iterative sampling process, historical data influence information is recorded to directly estimate the distributions of each cluster, and a cold start is adopted to balance exploration and exploitation. Experimental results on various benchmarks show that our proposed framework, ClusterUCB, can achieve comparable results to the original gradient-based data selection methods while greatly reducing computing consumption.

nan


Article 789

Title@2025-06-12 (4): Play to Generalize: Learning to Reason Through Game Play

Title: Play to Generalize: Learning to Reason Through Game Play Spielen Sie Generalize: Lernen, Vernunft durch Spiel zu lernen 玩一般游戏: 通过玩游戏学习理性 2506.08011v2

Authors (6): Yunfei Xie, Yinsong Ma, Shiyi Lan, Alan Yuille, Junfei Xiao, Chen Wei

Developing generalizable reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by cognitive science literature suggesting that gameplay promotes transferable cognitive skills, we propose a novel post-training paradigm, Visual Game Learning, or ViGaL, where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games. Specifically, we show that post-training a 7B-parameter MLLM via reinforcement learning (RL) on simple arcade-like games, e.g. Snake, significantly enhances its downstream performance on multimodal math benchmarks like MathVista, and on multi-discipline questions like MMMU, without seeing any worked solutions, equations, or diagrams during RL, suggesting the capture of transferable reasoning skills. Remarkably, our model outperforms specialist models tuned on multimodal reasoning data in multimodal reasoning benchmarks, while preserving the base model’s performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks that unlock generalizable multimodal reasoning abilities in MLLMs.

nan


Article 790

Title@2025-06-12 (4): Do Language Models Have Bayesian Brains? Distinguishing Stochastic and Deterministic Decision Patterns within Large Language Models

Title: Do Language Models Have Bayesian Brains? Distinguishing Stochastic and Deterministic Decision Patterns within Large Language Models Haben Sprachmodelle Bayesische Gehirne? Beeindruckende stochastische und deterministische Entscheidungsmuster innerhalb großer Sprachmodelle 语言模式是否具有贝耶斯人脑? 区分大语言模式中的斯托卡和决定性决定模式 2506.10268v1

Authors (2): Andrea Yaoyun Cui, Pengfei Yu

Language models are essentially probability distributions over token sequences. Auto-regressive models generate sentences by iteratively computing and sampling from the distribution of the next token. This iterative sampling introduces stochasticity, leading to the assumption that language models make probabilistic decisions, similar to sampling from unknown distributions. Building on this assumption, prior research has used simulated Gibbs sampling, inspired by experiments designed to elicit human priors, to infer the priors of language models. In this paper, we revisit a critical question: Do language models possess Bayesian brains? Our findings show that under certain conditions, language models can exhibit near-deterministic decision-making, such as producing maximum likelihood estimations, even with a non-zero sampling temperature. This challenges the sampling assumption and undermines previous methods for eliciting human-like priors. Furthermore, we demonstrate that without proper scrutiny, a system with deterministic behavior undergoing simulated Gibbs sampling can converge to a “false prior.” To address this, we propose a straightforward approach to distinguish between stochastic and deterministic decision patterns in Gibbs sampling, helping to prevent the inference of misleading language model priors. We experiment on a variety of large language models to identify their decision patterns under various circumstances. Our results provide key insights in understanding decision making of large language models.

nan


Article 791

Title@2025-06-12 (4): Research Borderlands: Analysing Writing Across Research Cultures

Title: Research Borderlands: Analysing Writing Across Research Cultures Forschungsgrenzen: Analysieren des Schreibens über Forschungskulturen hinweg 研究边界地区:分析跨研究文化的写作 2506.00784v2

Authors (3): Shaily Bhatt, Tal August, Maria Antoniak

Improving cultural competence of language technologies is important. However most recent works rarely engage with the communities they study, and instead rely on synthetic setups and imperfect proxies of culture. In this work, we take a human-centered approach to discover and measure language-based cultural norms, and cultural competence of LLMs. We focus on a single kind of culture, research cultures, and a single task, adapting writing across research cultures. Through a set of interviews with interdisciplinary researchers, who are experts at moving between cultures, we create a framework of structural, stylistic, rhetorical, and citational norms that vary across research cultures. We operationalise these features with a suite of computational metrics and use them for (a) surfacing latent cultural norms in human-written research papers at scale; and (b) highlighting the lack of cultural competence of LLMs, and their tendency to homogenise writing. Overall, our work illustrates the efficacy of a human-centered approach to measuring cultural norms in human-written and LLM-generated texts.

nan


Article 792

Title@2025-06-12 (4): M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction

Title: M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction M-MRE: Ausdehnung des Effekts der gegenseitigen Verstärkung auf multimodale Informationsextraktion M-MRE:将相互强化效应扩大到多式联运信息提取 2504.17353v2

Authors (6): Chengguang Gan, Zhixi Cai, Yanbin Wei, Yunhao Liang, Shiwen Ni, Tatsunori Mori

Mutual Reinforcement Effect (MRE) is an emerging subfield at the intersection of information extraction and model interpretability. MRE aims to leverage the mutual understanding between tasks of different granularities, enhancing the performance of both coarse-grained and fine-grained tasks through joint modeling. While MRE has been explored and validated in the textual domain, its applicability to visual and multimodal domains remains unexplored. In this work, we extend MRE to the multimodal information extraction domain for the first time. Specifically, we introduce a new task: Multimodal Mutual Reinforcement Effect (M-MRE), and construct a corresponding dataset to support this task. To address the challenges posed by M-MRE, we further propose a Prompt Format Adapter (PFA) that is fully compatible with various Large Vision-Language Models (LVLMs). Experimental results demonstrate that MRE can also be observed in the M-MRE task, a multimodal text-image understanding scenario. This provides strong evidence that MRE facilitates mutual gains across three interrelated tasks, confirming its generalizability beyond the textual domain.

nan